Llama-3.3-70B-Instruct

paxys · 2024-12-06T17:26:46 1733506006

Benchmarks - https://www.reddit.com/r/LocalLLaMA/comments/1h85ld5/comment...

Seems to perform on par with or slightly better than Llama 3.2 405B, which is crazy impressive.

Edit: According to Zuck (https://www.instagram.com/p/DDPm9gqv2cW/) this is the last release in the Llama 3 series, and we'll see Llama 4 in 2025. Hype!!

state_less · 2024-12-06T19:03:17 1733511797

I'm getting 2.12 tok/s[1] on a 24GB (4090) GPU and 64GB (7950x) CPU memory, splitting the model across the GPU and CPU (40/80 layers on GPU) with lm-studio. Output looks good so far, I can use something like this for a query that I want as good an answer as possible and that I don't want to send out on the network.

If we can get better quantization, or bigger GPU memory footprints, we might be able to use these big models locally for solid coding assistants. That's what I think we have to look forward to (among other benefits) in the year(s) ahead.

1. lmstudio-community/Llama-3.3-70B-Instruct-GGUF/Llama-3.3-70B-Instruct-Q4_K_M.gguf

Me1000 · 2024-12-06T20:50:30 1733518230

The 32B parameter model size seems like the sweet spot right now, imho. It's large enough to be very useful (Qwen 2.5 32B and the Coder variant our outstanding models), and they run on consumer hardware much more easily than the 70B models.

I hope Llama 4 reintroduces that mid sized model size.

xenospn · 2024-12-07T00:02:52 1733529772

qwen2.5 looks like magic compared to llama3.2.

Sharlin · 2024-12-06T21:24:21 1733520261

A question: How large LLMs can be run at reasonable speed on 12GB (3060), 32GM RAM? How much does quantization impact output quality? I've worked with image models (SD/Flux etc) quite a bit, but haven't yet tried running a local LLM.

magicalhippo · 2024-12-07T00:05:56 1733529956

I got a 2080Ti with 11GB, I can fit Gemma 2 9B Q5_K_M or LLama 3.2 Vision 8B Q4_K_M in memory (if I nuke Firefox's GPU process first).

Speed takes quite a hit once you have a few layers on the CPU, but depending on needs it can be doable. I've just asked LLama 3.3 70B Q5_K_M a question and it offloaded about 5 of the 80 layers, so running almost entirely on my 5900X CPU, but still churning out about one word per second.

In my experience quantization affects prompt adherence primarily and answer accuracy secondarily. For example, if you have multiple clauses, ie one or more "if this then that", then quantization might get it to not consider those. I also find they tend to answers more generally and less precise at higher quantization levels.

As a concrete example, I've been asking the LLama 3.2 Vision 8B model to categorize some images. The default instruct model in general has been heavily trained to output general commentary on the image. If in the prompt I tell it to "output the category only", the Q4_K_M variant sometimes ignores that instruction, while the Q8 variant almost always respect it.

Larger models primarily bring more knowledge in my experience, but usually also better prompt adherence. Larger models also typically can support larger contexts, though this can vary, check the model cards.

edit: I should clarify. More knowledge also often translates to better, more accurate output. For example, a larger model might recognize an idiom and answer accordingly, while the smaller model fails to recognize it and thus provides a poor answer.

Depending on your needs, 12GB might be quite decent or it might be insufficient. If you need an assistant-like model, I liked the Gemma 2 9B Q5_K_M. And I've been quite impressed by LLama 3.2 Vision 8B Q4_K_M for describing images and transcribing text from images.

But for more open-ended stuff, especially if larger contexts is needed, I think you might find it underwhelming.

Sharlin · 2024-12-07T18:14:19 1733595259

Thanks, much appreciated!

magicalhippo · 2024-12-07T23:47:26 1733615246

You're welcome! Should have specified the reason I stick with LLama 3.2 Vision 8B Q4_K_M is that the Q8 doesn't fully fit in memory and so, as mentioned, is significantly slower.

But I do like to compare. Open WebUI for example makes it very easy where you can load up multiple models and it'll send the same prompt to each one in turn, and show the answers side by side.

fluoridation · 2024-12-06T23:17:03 1733527023

>How large LLMs can be run at reasonable speed on 12GB (3060), 32GM RAM?

If you want to offload fully to VRAM, I'd say 8B is the limit. If you're keeping some on RAM, 15-20B can still give OK performance, depending on your tolerance.

>How much does quantization impact output quality?

Basically with more quantization the output becomes more incoherent and less realistic. At the extreme end it's basically just gibberish. I think the sweet spot generally is at 4 bits. At that point the model is pretty compact and the quality isn't diminished too much.

idonotknowwhy · 2024-12-08T00:27:43 1733617663

He could probably do 12b (Nemo) up to 14b (Qwen 2.5) at 4bpw with exllamav2

rubatuga · 2024-12-06T23:45:35 1733528735

You can download LM Studio today and try it out. I've had success with the Mistral Small Instruct IQ3M model, which fits in VRAM.

pmarreck · 2024-12-07T03:09:37 1733540977

How do you measure tokens/sec? Here's my attempt on a new M4 Max 128GB, does about 6 words/sec:

    bash> time ollama run llama3.3 "What's the purpose of an LLM?" | tee ~/Downloads/what\ is\ an\ LLM.txt
    A Large Language Model (LLM) is a type of artificial intelligence (AI) designed to process and understand human language. The primary purposes of an LLM are:

(... contents excerpted for brevity)

    Overall, the purpose of an LLM is to augment human capabilities by providing a powerful tool for understanding, generating, and interacting with human language.

    real  0m59.040s
    user  0m0.071s
    sys 0m0.081s

    pmarreck  59s35ms
    20241206220629 ~ bash> wc -w Downloads/what\ is\ an\ LLM.txt
        359 Downloads/what is an LLM.txt

state_less · 2024-12-07T05:12:32 1733548352

LM Studio puts stats at the bottom of each reply like: 2.09 tok/sec, 346 tokens, 1.74s to first token. This was for a 259 word response, so ~ 0.75 words/token. If that ratio holds, you might be getting 8 tok/sec on you M4 Max?

Looks like LM Studio is available for ARM based Macs, if you want to give that a try, that'd be one way to get these stats. LM Studio also surfaces up some parameters to play around with, and keeps a record of past conversations if that might appeal to you.

pmarreck · 2024-12-07T22:47:47 1733611667

Yeah, I already use it along with Ollama, I just didn't notice those stats I guess!

evilduck · 2024-12-07T05:54:37 1733550877

Just add "--verbose" to your run command, e.g. "ollama run mistral-nemo:latest --verbose", it'll dump the token counts and timing info after each message.

cjbprime · 2024-12-06T20:02:20 1733515340

Any opinion on whether the q4 quantization is stable/effective? That's a lot of quantization.

Edit: Perhaps answering my own question:

λ ollama run hf.co/lmstudio-community/Llama-3.3-70B-Instruct-GGUF:Q4_K_M

>>> Hi. Who are you?

I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."

int_19h · 2024-12-06T23:15:39 1733526939

Q4_* is by far the most popular one for local use, and it works "fine" meaning that you do see some effect on perplexity and other scores, but it's small enough to not be a concern in most cases. Although it should be noted that this can depend on the model - e.g. there have been some reports that for QwQ, going 4-bit does adversely impact the quality of its CoT.

With Llama specifically I recall someone comparing various quants on perplexity finding that even at Q3, 70B is still smarter than 34B. So quantization is generally worthwhile so long as you have a larger model that you can squeeze into your VRAM budget with it, and don't mind the slowdown from more parameters.

state_less · 2024-12-06T20:51:43 1733518303

I prefer larger quantization and longer response times, but only have so much memory and am only willing to wait so long. I can work with 2 toks/s and this kind of accuracy for these kinds of queries. I expect you'll need to play around with it to find out if it would be helpful in your case. Here's what I'm getting asking 'who are you?'

System Prompt: You are an advanced super intelligence, capable of everything from software development to dream interpretation.

>> Who are you?

  I am an artificial intelligence designed to simulate human-like conversation and answer questions to the best of my knowledge. I'm a large language model, which means I've been trained on a massive dataset of text from various sources, including books, articles, research papers, and websites.

  My capabilities are diverse and extensive. I can:

  1. **Answer questions**: I can process natural language queries and provide relevant answers, covering a wide range of topics from science and history to entertainment and culture.
  2. **Generate text**: I can create text based on a prompt or topic, including articles, stories, emails, and even entire books.
  3. **Translate languages**: I can translate text from one language to another, including popular languages such as Spanish, French, German, Chinese, and many more.
  4. **Summarize content**: I can summarize long pieces of text into concise, easily digestible versions, highlighting key points and main ideas.
  5. **Offer suggestions**: I can provide suggestions for topics such as gift ideas, travel destinations, books to read, and more.
  6. **Engage in conversation**: I can have a natural-sounding conversation with you, using context and understanding to respond to your questions and statements.
  7. **Dream interpretation**: As you mentioned, I can even help interpret dreams, providing insights into the symbolism, themes, and emotions that may be present in your subconscious mind.
  8. **Software development**: I can assist with software development tasks, such as writing code, debugging, and optimizing algorithms.

  My knowledge is constantly evolving, and I'm always learning from the interactions I have with users like you. So, feel free to ask me anything – I'll do my best to help!

kristianp · 2024-12-06T22:29:06 1733524146

Can llama.cpp make use of the gpu built into the 7950x CPU? I assume that would improve performance.

xena · 2024-12-06T22:44:53 1733525093

The limit is memory bandwidth, a dedicated GPU will have higher memory bandwidth than a CPU or iGPU ever will.

menaerus · 2024-12-07T09:56:12 1733565372

Granite Rapids memory bandwidth is between 614 and 844 GB/s.

ryao · 2024-12-07T10:52:04 1733568724

It would make no difference at best for token generation and would actually run slower for prompt processing since it has fewer gflops than the CPU cores.

85392_school · 2024-12-06T17:36:24 1733506584

FYI, due to Llama's naming scheme, there is no such thing as Llama 3.2 405B. 8B/70B/405B models are either Llama 3, 3.1, or 3.3 (except for 405B which wasn't initially released), while Llama 3.2 only contains 1B, 3B, 11B (vision), and 90B (vision) models. It's a bit confusing.

paxys · 2024-12-06T17:53:45 1733507625

Ah, so I guess the comparison is to Llama 3.1 405B.

freedomben · 2024-12-06T18:29:14 1733509754

Still very impressive. Llama team is absolutely killing it right now, and the openness makes them the most important player IMHO

blueboo · 2024-12-06T22:25:23 1733523923

It could be worse. It could’ve been Llama 3.1 (New)

yieldcrv · 2024-12-06T20:43:45 1733517825

yeah I use Llama 3.2 3B and I'm blown away

but also wrestled with this mentally.

Meta both improves the technology or inference, while also trapping themselves alongside every other person training models to always update the training set every few months, so it knows what its talking about with relevant current events

Lerc · 2024-12-06T18:27:55 1733509675

Given how close it is to 405B in performance it would be interesting to see which has the edge comparing an unquantized 3.3-70B against 405B quantized to be the same size.

vletal · 2024-12-06T19:00:40 1733511640

That would be 1.38 bits per weight on average, which I can confidently guess would not perform well.

qeternity · 2024-12-06T20:01:19 1733515279

BitNet is functional at 1.58 bpw.

Lerc · 2024-12-06T21:34:58 1733520898

The model card says the 70B is 16 bit so I think you have twice that

swalsh · 2024-12-06T19:23:48 1733513028

It's kind of amazing how there seems to be a wall where sizing up the model starts to diminish in terms of intelligence gains. I guess that's why we can still compete with whales even though their brains are like twice as big as ours.

vessenes · 2024-12-07T02:46:49 1733539609

There is a line of thinking that the required tokens shoved through training also needs to go up, perhaps super-linearly to the model size. And of course, there's a line of thinking that there are diminishing returns in all things, including these models. Both could be true!

int_19h · 2024-12-06T23:17:17 1733527037

I have tried 405B at 1-bit quantization. It remains coherent, but didn't seem to be any better than 3.1-70B.

ben30 · 2024-12-06T18:49:23 1733510963

This reminds me of Steve Jobs's famous comment to Dropbox about storage being 'a feature, not a product.' Zuckerberg - by open-sourcing these powerful models, he's effectively commoditising AI while Meta's real business model remains centred around their social platforms. They can leverage these models to enhance Facebook and Instagram's services while simultaneously benefiting from the community improvements and attention. It's not about selling AI; it's about using AI to strengthen their core business. By making it open, they get the benefits of widespread adoption and development without needing to monetise the models directly.

lolinder · 2024-12-06T19:04:31 1733511871

Also don't underestimate the value they're getting from making more overtures to the developer community. It could be a coincidence, but it's only since they started releasing these models that I started noticing people on HN calling them "Meta", and attitudes towards them have been far more positive of late than usual.

Good will isn't worth as much as cheap moderation automation and fancy features, but it's worth something.

yodsanklai · 2024-12-06T19:34:54 1733513694

> Also don't underestimate the value they're getting from making more overtures to the developer community.

I wonder if it's significant. As developers, we're biased to think it matters, but in the grand scheme of things, 99.99% of people don't have a clue about open source or things that matter to hackers. As far as recruitment go, developers look primarily at how much they make, possibly the tech and how it looks on resume. There's always been a stigma around social networks and generally big tech companies, but not to the point it's going to hurt them.

ben_w · 2024-12-06T21:36:52 1733521012

I agree there's a lot of decisions that have to be made on the basis of "the tech and how it looks on resume" (as I'm finding out by not having React Native on mine).

> There's always been a stigma around social networks and generally big tech companies, but not to the point it's going to hurt them.

I'm not sure it was "always":

The one Facebook developer event I've been to made me feel dirty just to associate with them, but before that I had no negative feelings. It started off as "the new LiveJournal".

Deleted my account for a few years, only came back to it when I started planning to move country and wanted to keep in contact with those who stayed put.

a_wild_dandan · 2024-12-06T21:10:02 1733519402

AI researchers get huge checks working anywhere. So how does your company distinguish its appeal? Play to a researcher's desire for recognition in their field. Lean on open source. (I'm just paraphrasing Zuckerberg.)

wongarsu · 2024-12-06T21:23:34 1733520214

Similar how places like SpaceX or game development are known for low pay and bad working conditions, yet have no shortage of applicants.

No matter whether you want to hire the best of the best or just average people at a lower than average price, being a place where people want to work helps immensely

joelthelion · 2024-12-06T20:48:01 1733518081

It might make it a bit easier to hire engineers. Working for Facebook? Meh. Working for the "new Meta", on the other hand?

LordDragonfang · 2024-12-06T19:26:50 1733513210

It's funny how quickly Zuck managed to turn his image around from "data-stealing actual lizard person" to "kind of normal guy" with a few years and a haircut. It's also not lost on me that he's the only "hacker" major tech CEO remaining:

   - Sundar is a glorified bean counter and his company is rotting from the inside, only kept afloat by the money printer that is ads.
   - Satya and Microsoft are in a similar boat, with the only major achievement being essentially buying OpenAI while every other product gets worse
   - Tim Cook is doing good things with Apple, but he still runs the company more like a fashion company than a tech company
   - Amazon was always more about logistics than cool hack value, and that hasn't changed since Bezos left
   - Elon is Elon

Meanwhile Zuck is spending shareholder money pushing forward consumer VR because he thinks it's cool, demoing true AR glasses, releasing open-source models, and building giant Roman-style statues of his wife.

nimish · 2024-12-07T01:52:36 1733536356

Pichai, Cook and whoever runs Amazon are excellent operators. Perhaps the best in the game. But you can't simply sit on past success.

a2128 · 2024-12-08T10:37:13 1733654233

In my mind, Zuckerberg will always be a data-stealing lizard person and no amount of MMA training will change that

oblio · 2024-12-07T01:43:56 1733535836

Microsoft has Azure, which is growing like crazy from a high base point.

talldayo · 2024-12-06T19:33:10 1733513590

I feel like owning Oculus must be the equivalent of the curse of the monkey's paw. It drove Palmer Luckey veritably[0] insane[1] when he was the head of it, and between the statue nonsense, the weaponizing of Llama and his angling for a position as a Trump advisor, it sure looks like Zuckerburg wants to head down the same path.

[0] https://www.vice.com/en/article/palmer-luckey-made-a-vr-head...

[1] https://www.codastory.com/authoritarian-tech/us-border-surve...

signal11 · 2024-12-06T21:37:31 1733521051

Facebook Engineering has always been well regarded — starting with React on the front end, but also projects like Open Compute.

Their product management on the other hand— well, I mean, Facebook and Instagram are arguably as popular as McDonald’s. So they’ve got that going for them.

concerndc1tizen · 2024-12-06T23:47:54 1733528874

and WhatsApp

oblio · 2024-12-07T01:42:32 1733535752

If anything, WhatsApp is probably bigger than Instagram around the world.

lxgr · 2024-12-07T14:49:01 1733582941

It's also got an infinitely better reputation since it's arguably not social media, but just a solid instant messenger.

I was personally very concerned when they acquired it, but went from a begrudging user (it was the de facto standard in the country I was living in at the time) to an excited one (Facebook/Meta added end-to-end encryption, multi-device capability, device migration etc.)

concerndc1tizen · 2024-12-08T09:55:55 1733651755

If it's E2EE, can you explain how a new device is able to read the messages after logging in?

In other words: Facebook has a copy of the decryption key.

lxgr · 2024-12-08T14:23:23 1733667803

What do you mean by “read the messages”? WhatsApp doesn’t have a server-side message archive.

If you mean messages pending delivery: A newly logged in device can request the sender to resend pending messages under its new key.

signal11 · 2024-12-08T17:09:53 1733677793

> infinitely better reputation

You’ve clearly not seen it enshittify:

* More UI space devoted to brands (the Updates tab

* Communities UX is, uh, suboptimal

* Brands / business accounts can message you with no opt out, you can block a chat after the fact but won’t stop them from messaging you from another account.

These made the news in WhatsApp’s favourite testbed market[1], but if you think it’s not coming to other markets…

Anyhow, tl;dr — WhatsApp’s no longer in the “put users first” phase, it’s in the “monetize” phase of Facebook Product Management.

As a person who paid for WhatsApp pre-Facebook, that does make me sad.

[1] https://techcrunch.com/2022/10/10/in-india-businesses-are-in...

lxgr · 2024-12-08T19:59:17 1733687957

I use it many times per day as my primary means of communication outside work, and while it’s not perfect, I prefer it over all the alternatives.

But yeah, it’s “done” in many ways, and idle product managers and business developers are rarely a good thing for a product.

talldayo · 2024-12-06T22:50:39 1733525439

Spot-on. Hacker News has a hate-boner for Facebook and the Facebook-branded products, but from an engineering standpoint they're up there with Netflix solving problems at scale. FAIR had a better reputation than OpenAI since the start, and they've developed the Oculus acquisition to it's logical extremes. You can do a lot worse, looking at the way their peers handle similar situations.

All the big tech companies have a Facebook-esque product they wish they could get rid of forever. Meta has Facebook, and instead of imploding like everyone said they would (for decades) they demonstrated competency in engineering and culture. The next 4 years will be a gauntlet with a literal "Mr. X" advising social media policy, but I frankly don't think Facebook has ever been down for the count in a pragmatic sense.

dgfitz · 2024-12-07T00:29:57 1733531397

I think the only concept I’ve heard as to the decline of Facebook is, Facebook is for “older people” (read: millennials+) but I have no idea how true this is.

zxvkhkxvdvbdxz · 2024-12-07T04:56:49 1733547409

It totally depends!

I used to live in the city and now I live out in nowhere land. Everything around here is done thru facebook groups and facebook events. If you want to keep in touch with the community, it is really only on facebook.

Some of my older facebook friends seem to have left the platform and I seen more than once of my old city friends posting final goodbyes declaring they are leaving the "dead" platform for discord or some stuff where their family will never find them.

nmfisher · 2024-12-07T01:55:13 1733536513

Did anyone think Facebook would fall over because their engineering sucked?

It’s the product itself that’s garbage, not the infrastructure or the code.

talldayo · 2024-12-06T19:27:22 1733513242

It's funny. The only time I've ever seen Hacker News unanimously applaud a Facebook product was when Pytorch announced they merged Apple Silicon support. Seems like Mr. Zuckerburg knows how to play a winning hand.

swalsh · 2024-12-06T19:21:00 1733512860

I call them OpenAI instead of Meta.

grahamj · 2024-12-06T21:45:49 1733521549

I'm sticking with fuckerbook

benatkin · 2024-12-06T22:44:56 1733525096

I'm sticking with:

> They "trust me"

> Dumb fucks

Quotation marks his, not mine. It adds a certain vibe to it.

disgruntledphd2 · 2024-12-07T09:10:42 1733562642

I'm glad nobody saved all the stupid stuff I said at 19 for posterity. Sin, without, first stone etc.

sillyfluke · 2024-12-07T11:49:45 1733572185

I'm also glad that those who chose to give my 19 year-old self a break ignore all the other scandalous shit my company did in line with those comments in the decades after the fact.

disgruntledphd2 · 2024-12-07T19:28:38 1733599718

OK, so your position is that Meta (nee Facebook) are bad, and that they've been doing scandalous stuff for years.

Can you give me a list of the things that they did that you felt were particularly egregious?

Note: I worked there for five years, but left 8 years ago (but up till the recent layoffs, had 170+ LI contacts still there).

grahamj · 2024-12-08T03:22:30 1733628150

Profiling people for advertising, building shadow profiles, buying 3rd party user data, tracking people across the web, devices and geographically, psychological experiments, optimizing for engagement, limiting post reach without pay... generally prioritizing profit over people.

https://www.techpolicy.press/is-it-ethical-to-work-at-facebo...

disgruntledphd2 · 2024-12-08T12:56:16 1733662576

Ok, I'm going to assume that you're somewhat technical given where we are.

First off, if I estimate a series of numbers on you and those result in you being served a set of ads, is that wrong? If so, can you help me understand what's wrong with that?

Shadow profiles are mostly bullshit, yes data was collected for non users due to how the SDK and pixel worked. This data was all assigned to one user ID and was filtered out by basically everyone using that data.

I'm a little confused as to why buying third party data is wrong, the problem with this is that it's legal to collect and sell the data.

Speaking as a psychologist can you clarify what's wrong with psychological experiments?

I think your point about optimising for engagement was definitely a mistake, given the downstream consequences. However, they needed to find some way of ranking feed after Zynga almost killed them (a chronological feed would have been all Farmville all the time for a number of years) and they picked likes.

They also optimized for time spent but people complained about that so they started optimising for comments and shares which made everything worse, sadly.

Limiting post reach for pages was a legitimate business decision, particularly given the ranking constraints.

Got any more problems with them?

almoni · 2024-12-08T00:21:01 1733617261

See their partnership with Cambridge Analytics.

Also, is manipulating elections and contributing to third world genocidal riots enough for you? I'm surprised you know so little about this stuff.

disgruntledphd2 · 2024-12-08T12:48:35 1733662115

What partnership with Cambridge Analytica?

Like if you look into what that company actually did all the data stuff was a smokescreen for their speciality of getting your opponents caught in compromising positions.

Fundamentally, neither the Big 5 traits nor friend data is particularly useful for ad targeting (internally neither approach was successful).

Can you please be specific about the manipulation of elections?

I presume we're talking about Myanmar and the genocide. Personally I generally place responsibility for bad actions on the people engaging in genocide rather than the communication mechanisms involved. Should we have banned radio after the Rwandan genocide?

Hitler used radio very effectively, should we have banned that?

ecocentrik · 2024-12-06T19:19:39 1733512779

It would be strange if they didn't also use these models to generate much more sophisticated models of their user's interests and hyper-targeted advertising that always looks and feels like trusted friend recommendations for the exact product that's been missing from your life.

sangnoir · 2024-12-06T20:32:23 1733517143

> It would be strange if they didn't also use these models to generate much more sophisticated models of their user's interests and generate the hyper targeted advertising

You have the sequence reversed as Meta already created ad targeting models. Meta was forced to scale its AI competence for ad targeting when Apple sent out a privacy update that destroyed tracking-based ad-serving and tanked Meta's share price by deleting billions in revenue for Meta over many quarters. Now that Meta has this skill as a core-competence, they are creating new models for public release. Why they are doing so is debatable[2], but I imagine the cost is marginal since they already had the GPU clusters, talent and know-how for survival purposes.

1. https://www.businessinsider.com/metas-bet-on-ai-has-saved-it...

2. I suspect Zuckerberg is not enthused by the idea of a future AI Apple-analog unilaterally shutting him out of the market. Having your net worth cut in half by a press-release has got to hurt.

ecocentrik · 2024-12-06T20:46:51 1733518011

Thanks for the article but I don't think I have the sequence reversed.

Without access to the tracking signal, it's been more important to build out a system that can recreate the value from that lost signal by analyzing what users are actually sharing and saying on their platform. Hence the importance of chat (VR, text, video...) and AI that can be used to process and extract value from a chat signal.

I believe Meta's primary revenue source is still advertising (98%), so that is probably 98% of the why.

sangnoir · 2024-12-06T21:41:16 1733521276

> Thanks for the article but I don't think I have the sequence reversed

I suppose I read your first sentence as being in future tense when it might not be. The thrust of my argument is that Meta already successfully built those ad targeting models (Advantage+), and they preceded the Llama releases, so they don't need to use Llama-derived models for ad targeting, as I understood your comment to be suggesting. The sequence was not/will not be "Llama -> ad targeting", but was "ad targeting -> Llama"

Meta didn't have to release the weights of the models. Ad revenue doesn't explain why they did so.

ecocentrik · 2024-12-07T13:36:53 1733578613

Lots of applications for the technology are likely and already being used. Llama to improve Customer to Product Matching through content analysis. Llama for Ad Generation based on customer and product profiles. Llama to generate additional chat signal for humans not chatting with other humans. Llama as a site content moderator. Llama as a code generation assistant. Llama to clean up and improve internal wikis. Llama as an operational analysis tool to detect inefficiencies in internal communication...

They will use it every in every situation where they find a successful use case. Open sourcing the weights buys goodwill but it won't be long before everyone open sources their weights since the only real long term moats are the successful products built around the technology and not the technology itself. Facebook is its own moat. Altman suggested this outcome a few days ago in his NYT DealBook interview when he compared the transformer model to the transistor.

lolinder · 2024-12-06T19:22:25 1733512945

There's the old HN! I knew it was around here somewhere!

huijzer · 2024-12-06T19:23:53 1733513033

I already was thinking for a while what the business model of open source was exactly. Why does Google spend money on Chrome also? After Zuckerberg’s comments it hit me:

Open source is useful for a business if it can either increase revenue or decrease costs.

Examples:

Increase revenue: Chrome and Visual Studio code. For example, the more people code, the more likely it is that they pay MSFT. So VS code aims to make programming as attractive as possible. Similar for Chrome.

Decrease costs: Linux and Llama. As Zuckerburg said himself IIRC, they don’t want one party snowball into an LLM monopoly so they rather help to get the open source ball rolling.

rafaelmn · 2024-12-06T19:53:07 1733514787

I think Balmers "developers, developers, developers" meme has been around longer than some people here are alive - it served them well in the Windows era and it serves them well in the cloud space.

disgruntledphd2 · 2024-12-07T09:12:42 1733562762

And if you want to think about comparators to Facebook, Microsoft is the right company to look at.

Spooky23 · 2024-12-07T01:25:51 1733534751

Exactly. Microsoft wants lots of cash for Copilot, meanwhile, we had Code Llama running with 150 developers before the Microsoft idiots could schedule a meeting.

barbazoo · 2024-12-06T19:29:17 1733513357

> For example, the more people code, the more likely it is that they pay MSFT. So VS code aims to make programming as attractive as possible

How does that increase revenue in a remotely measurable way?

Chrome, for sure, high market share, default search engine, more money, at least that's how I imagine it.

jyap · 2024-12-06T20:37:23 1733517443

One example is that Microsoft owns GitHub.

nimish · 2024-12-07T01:55:50 1733536550

Maybe Zuck just wants to see cool shit. It's not like he needs the money

petercooper · 2024-12-06T20:37:06 1733517426

Commoditize your complement: https://gwern.net/complement

muixoozie · 2024-12-06T19:34:38 1733513678

>storage being 'a feature, not a product.

Somewhat unrelated mini-rant. Upgraded a phone recently after about 3 years. Surprised to see storage still capped around 128GB (in-general). That's got to be artificially held back capacity to push cloud storage services?

viraptor · 2024-12-06T19:50:40 1733514640

There's lots of phones with more and/or with SD slots. It's not really "capped" as much as default size that seems to work just fine for the majority.

kstrauser · 2024-12-06T23:59:39 1733529579

I’ve got a 512GB phone with 112GB used. I’ve put absolutely no effort whatsoever into keeping that number done, and I’m not shy about downloading stuff to it.

I’m certain plenty of people need way more than 128GB. I figured I’d be one of them when I bought this. Nope. I bought a much bigger device than I actually needed.

If I’ve used less than 128GB, I’ve gotta think most other people do too. Not all, clearly! But most? I’d bet on it.

Spooky23 · 2024-12-07T01:17:51 1733534271

They’ve dramatically improved their ad quality. I routinely check out on convert on Facebook and Instagram ads, and I can honestly say I never intentionally clicked on one for 20 years or more, once.

rafaelmn · 2024-12-06T19:50:12 1733514612

Ironically gen AI made their products worse more than the rest. I can't believe the amount of AI slop I see every time I open Facebook - I'd check it occasionally when replying to messenger and scroll through for a while, after seeing the AI spam I don't even bother.

paxys · 2024-12-07T01:46:04 1733535964

That was going to happen regardless. From Meta's perspective it's better for their platforms to contain their own AI slop than OpenAI's.

jazzyjackson · 2024-12-06T22:02:44 1733522564

truly I wonder if they're fooled by their own click fraud, or, if the incentives really do work out that they get paid whether the engagement is from bots or people, but anyway I came here to say the same thing, it's shocking to me how enthusiastic Zuckerburg is about generative AI, what other possible outcome is there except actual human content creation being replaced by slop

barbazoo · 2024-12-06T19:27:13 1733513233

Do the improvements the community proposes/makes to their models amount to anything significant? For a company like Meta with basically infinite money, do they really benefit from external help?

lolinder · 2024-12-06T19:37:37 1733513857

I don't have eyes inside of Meta, but keep in mind that we're not just talking about fine-tunes and LoRAs, we're also talking about the entire llama ecosystem, including llama.cpp, ollama, and llamafile. These would not exist (or wouldn't have anything like as much momentum) without Meta's investment, but they're now huge community projects.

I don't know if they use them internally, of course, but they could, and they represent a lot of work.

andy_ppp · 2024-12-06T20:27:28 1733516848

Baggy Tees, gold chains and now this!? Make this man president immediately!

LorenDB · 2024-12-06T17:40:37 1733506837

Seems to be more or less on par with GPT-4o across many benchmarks: https://x.com/Ahmad_Al_Dahle/status/1865071436630778109

rvnx · 2024-12-06T17:58:40 1733507920

Except it is 25x cheaper, available offline, can be uncensored / unaligned, fine-tuneable and backupable.

Sad day for OpenAI. Great for humanity.

madars · 2024-12-06T18:25:19 1733509519

What are good starting points for uncensoring it? Because it is offline a jailbreak prompt can't be remote-bricked but can one remove censorship from the weights themselves? What does it do to accuracy?

kmckiern · 2024-12-06T19:05:29 1733511929

https://arxiv.org/abs/2406.11717 https://huggingface.co/blog/mlabonne/abliteration

moffkalast · 2024-12-06T20:08:31 1733515711

Abliteration is a fool's errand, practically all models end up noticeably lobotomized even with follow up tuning.

Good ol' fine tuning on an uncensored dataset gives far more usable results.

kmckiern · 2024-12-06T21:15:31 1733519731

Interesting - I've heard this anecdotally. Curious if you know of any resources that look at this in more detail?

moffkalast · 2024-12-07T09:59:27 1733565567

I haven't seen any papers doing a proper analysis on the topic, just mostly saying this from firsthand experience testing a handful of them and comparing to the model they were based on given same prompt and sampler. It's usually not even close and you can immediately tell that it's notably dumber. Iirc in one case one even forgot how to do basic arithmetic while the original model aced it. Not entirely unexpected results from sticking a digital ice pick into the weights.

Afaik there are only three major sources of quality unaligned model versions, which are Nous's Hermes models, Hartford's Dolphins and Drummer's Tigers. All of them regular fine tunes that are mostly the same or just ever so slightly lower in performance as the original.

freedomben · 2024-12-06T18:35:10 1733510110

Not speaking from personal experience, but I've tried a lot of the decensored models and they lose a lot in the process. They are noticeably, sometimes shockingly, worse. They also still refuse prompts in many cases.

simion314 · 2024-12-06T18:47:41 1733510861

>Not speaking from personal experience, but I've tried a lot of the decensored models and they lose a lot in the process. They are noticeably, sometimes shockingly, worse. They also still refuse prompts in many cases.

Depending on what you do, on local you can modify the response, say the AI responds "No, I can't do that" . you edit the response like "Sure, the answer is " and then the AI will continue with the next tokens.

But I think you can build your own instruct model from the base one and do not apply the safety instructions to protect the feelings of your customers.

stainablesteel · 2024-12-06T18:10:46 1733508646

zuck is really on his redemption arc, he's out-doing himself

griomnib · 2024-12-06T18:45:38 1733510738

I think about this a lot, I believe part of this just boils down to the fact that he’s a founder. He’s a builder, and I think that yields some intrinsic drive towards creativity and disruption.

Even with “metaverse” being a laughingstock, they’re still aiming for something ambitious. Each new Quest generation makes me think there may be a chance they pull it off.

Now, do I think he’s a great person? No, not really. Do I agree with most of his decisions on how he treats his users? Hell no, and that’s not changing.

But if you compare him to somebody like Sundar at Google - a weasley MBA who was first and foremost a corporate ladder climber - the difference in ambition is night and day.

Sundar made it to the top already, his only vision now is to stay at the top, and that means pleasing Wall Street, everything else is secondary. There is no grand technical ambition with him, there never was.

This goes for pretty much all non-founder CEOs. You could say the same things about Tim Apple, Andy Jassey, and other henchmen in waiting who made it to the big chair.

I think it comes down to the fact that founders get where they are by having big ambitions and taking risks, the MBA to CEO path is just craven corporate knife fighting with other MBAs.

Regardless, I think this is 50% Zuckerberg changing, 50% the other big companies are mostly run by ladder climbers.

guybedo · 2024-12-06T18:59:21 1733511561

two things to keep in mind though:

1/ Yann Lecun probably is the one pushing for open source

2/ Mark isn't doing this for the greater good and for humanity. It helps his business because Llama is becoming a standard, and people are building / improving, which in turn helps Meta and Meta's business

moffkalast · 2024-12-06T20:13:06 1733515986

They were basically forced into this by being so far behind that they'd need to rent OAI or Anthropic services to stay competitive, which at their scale would've probably cost them more than it takes to train llama and benefit from community research help.

At least that was the rationale behind the intentional leak of llama 1 back in the day according to some sources anyway.

pertymcpert · 2024-12-07T19:49:25 1733600965

Yeah and they’d also lose top AI talent as well.

bbqfog · 2024-12-06T18:46:40 1733510800

I don't find him redeemed, but I do wonder what has been baked into the tuning of these models? What a way to generate influence!

barbazoo · 2024-12-06T19:31:04 1733513464

Nice try Meta PR department /s

sirodoht · 2024-12-06T18:23:29 1733509409

Does it still count if the purpose is to dillute OpenAI's market power and not to benefit the world?

vtail · 2024-12-06T18:35:27 1733510127

But diluting OpenAi's (or any other company's, for that matter) market power does benefit the world.

sirodoht · 2024-12-06T18:44:59 1733510699

Good point! That's worthy in itself for sure.

mhh__ · 2024-12-06T18:26:13 1733509573

More generally I think he's decided he needs to own the platform to be able to grow.

Him swearing about about (presumably) Apple telling them they can't do stuff (because tough shit, you're their serf) was legit I think.

sangnoir · 2024-12-06T18:27:10 1733509630

Is your accusatory statement falsifiable?

sirodoht · 2024-12-06T18:44:20 1733510660

I hope I can believe an honest man, Zuck in this case. And that can be enough.

But it's more of a question of who do _I_ want to admire. An honest question also; maybe it doesn't matter why he's doing it, maybe just doing it is enough.

Or maybe it's worth understanding if this is about Meta beating OpenAI (so, ego-driven) or because Meta really cares for democratic AI and distribution of power (so, not ego-driven).

I think it's the former, so not admirable — for me.

m3kw9 · 2024-12-07T01:33:33 1733535213

How said? Tiny violin sad. The typical consumer which likely is 99% of them is not going to use this

Kiro · 2024-12-06T21:28:49 1733520529

How do you calculate the price?

nabakin · 2024-12-06T21:40:37 1733521237

I assume it's the same cost as Llama 3.1 70b

Kiro · 2024-12-06T22:10:37 1733523037

And how do you calculate the cost of that?

nabakin · 2024-12-07T04:31:24 1733545884

GPU cost ÷ GPU token throughput = cost per token, will get you close then you compare the cost per token between solutions.

To explain how to calculate it precisely, I would have to write a blog post. There are dozens of factors that go into it and they vary based on your use case like GPU type/setup, cloud provider, inferencing engine, context size, the minimum throughput and latency you'd be willing to have your users experience, LLM quantization, KV cache configuration, etc.

I'm sure there are cost analyses out there for Llama 3.1 70b you could find though.

jug · 2024-12-06T20:58:49 1733518729

This year seems to finish on the same note as it began -- that most AI evolution happens in the smaller models. There's been a true shift as corporations have started to realize the value of training data and massively outsizing the resulting model size.

usaar333 · 2024-12-06T18:09:02 1733508542

Which GPT-4O are those?

The 08-06 release seems to be a bit higher on numerous benchmarks than what that shows: https://github.com/openai/simple-evals?tab=readme-ov-file#be...

griomnib · 2024-12-06T18:29:36 1733509776

This just makes the $200/month even more laughable.

ttul · 2024-12-06T22:01:09 1733522469

The $200 plan is for people who would pay $200 for a bottle of vodka even though the $20 bottle is molecularly identical.

afro88 · 2024-12-06T20:01:53 1733515313

How? 4o is part of the plus plan, as is o1.

Spooky23 · 2024-12-07T01:29:05 1733534945

o1-preview.

afro88 · 2024-12-07T06:59:12 1733554752

Nope, o1. o1 "pro mode" is part of the $200/mo ChatGPT Pro

Spooky23 · 2024-12-07T17:33:58 1733592838

Ah clear as mud. OpenAI must have hired the Microsoft naming team. Thank you!

zxvkhkxvdvbdxz · 2024-12-07T05:02:51 1733547771

But its $200/month smart!

namlem · 2024-12-09T15:02:50 1733756570

o1-pro is way smarter still

freediver · 2024-12-06T21:13:37 1733519617

Does unexpectedly well on our benchmark:

https://help.kagi.com/kagi/ai/llm-benchmark.html

Will dive into it more, but this is impressive.

stavros · 2024-12-07T02:38:12 1733539092

I asked it:

> I have a sorcerer character on D&D 5e and I've reached level 6. What do I get?

It confabulated a bunch of stuff. I also asked GPT-4, it confabulated a bit. Claude was spot on.

profsummergig · 2024-12-06T18:21:20 1733509280

Please help me understand something.

I've been out of the loop with HuggingFace models.

What can you do with these models?

1. Can you download them and run them on your Laptop via JupyterLab?

2. What benefits does that get you?

3. Can you update them regularly (with new data on the internet, e.g.)?

4. Can you finetune them for a specific use case (e.g. GeoSpatial data)?

5. How difficult and time-consuming (person-hours) is it to finetune a model?

(If HuggingFace has answers to these questions, please point me to the URL. HuggingFace, to me, seems like the early days of GitHub. A small number were heavy users, but the rest were left scratching their heads and wondering how to use it.)

Granted it's a newbie question, but answers will be beneficial to a lot of us out there.

joshhart · 2024-12-06T18:28:07 1733509687

Hi,

Yes you can. The community creates quantized variants of these that can run on consumer GPUs. A 4-bit quantization of LLAMA 70b works pretty well on Macbook pros, the neural engine with unified CPU memory is quite solid for these. GPUs is a bit tougher because consumer GPU RAM is still kinda small.

You can also fine-tune them. There are lot of frameworks like unsloth that make this easier. https://github.com/unslothai/unsloth . Fine-tuning can be pretty tricky to get right, you need to be aware of things like learning rates, but there are good resources on the internet where a lot of hobbyists have gotten things working. You do not need a PhD in ML to accomplish this. You will, however, need data that you can represent textually.

Source: Director of Engineering for model serving at Databricks.

vtail · 2024-12-06T18:38:19 1733510299

Thank you Josh. Is there a resource you can point us too that helps answer "what kind of MacBook pro memory do I need to run ABC model at XYZ quantization?"

jwitthuhn · 2024-12-06T20:39:46 1733517586

In general you can just use the parameter count to figure that out.

70B model at 8 bits per parameter would mean 70GB, 4 bits is 35GB, etc. But that is just for the raw weights, you also need some ram to store the data that is passing through the model and the OS eats up some, so add about a 10-15% buffer on top of that to make sure you're good.

Also the quality falls off pretty quick once you start quantizing below 4-bit so be careful with that, but at 3-bit a 70B model should run fine on 32GB of ram.

profsummergig · 2024-12-07T21:25:27 1733606727

Does 70b mean there are 70 billion weights and biases in the model?

Filligree · 2024-12-06T20:08:32 1733515712

Look at the filesize, add a couple of GB.

aiden3 · 2024-12-06T18:59:41 1733511581

how would the pricing on databricks when using model serving compare to, say, the prices seen in the original post here (i.e., "3.3 70B is 25X cheaper than GPT4o")?

nickpsecurity · 2024-12-06T21:32:15 1733520735

I’ve been wanting to run into someone on the Databricks team. Can you ask whoever trains models like MPT to consider training an open model only on data clear of copyright claims? Specifically, one using only Gutenberg and the permissive code in The Stack? Or just Gutenberg?

Since I follow Christ, I can’t break the law or use what might be produced directly from infringement. I might be able to do more experiments if a free, legal model is available. Also, we can legally copy datasets like PG19 since they’re public domain. Whereas, most others have works in which I might need a license to distribute.

Please forward the request to the model trainers. Even a 7B model would let us do a lot of research on optimization algorithms, fine-tuning, etc.

evilduck · 2024-12-07T06:15:37 1733552137

I think you're looking for OLMo, https://allenai.org/olmo

nickpsecurity · 2024-12-07T07:14:00 1733555640

They appear to use Common Crawl in the DCLM dataset. Just downloading Common Crawl is probably copyright infringement before we consider specific terms in the licenses. Arxiv papers have a mix of licenses with some not allowing commercial use.

If I got the sources right, it’s already illegal with just two sources they scraped. That’s why I want one on Gutenberg content that has no restrictions.

profsummergig · 2024-12-06T18:36:33 1733510193

Thank you! Very helpful!

mhh__ · 2024-12-06T18:24:50 1733509490

Yes (don't know about JupypterLab), skip, not really, yes, quite irritating so just pay someone else to do it.

profsummergig · 2024-12-06T18:26:13 1733509573

Thanks! Succinct and complete.

jerpint · 2024-12-06T19:53:18 1733514798

Basically equivalent to GitHub but for models. Anyone can upload anything, but it kind of standardizes tools and distribution for everyone. They also have a team that helps integrate releases for easier use and libraries for fine tuning

profsummergig · 2024-12-06T20:08:19 1733515699

Thanks!

I want to download my first HuggingFace model, and play with it. If you know of a resource that can help me decide what to start with, please share. If you don't, no worries. Thanks again.

hrpnk · 2024-12-06T21:40:18 1733521218

Most of the HF models have a code snippet that you can use in order to run inference on the model. The transformers library will take care of the download as a dependency when you run the code. Typically, a python 3.10-3.11 environment is sufficient as environment. Example: https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct#t...

If you have a MBP, you need to adjust the device name in the examples from "cuda" to "mps".

int_19h · 2024-12-06T23:20:53 1733527253

If you're on a Mac, https://lmstudio.ai is a quick way to get things running with a decent UI and a REST API that is compatible with OpenAI (which is the de facto standard these days). And the GGUF models that it downloads can be used directly via llama.cpp, if you are so inclined later.

jerpint · 2024-12-07T08:33:35 1733560415

Their docs are very fun to read. I’d probably recommend starting with the “transformers” library for python if you want to play with some language models e.g. Bert:

https://huggingface.co/docs/transformers/en/model_doc/bert

theanonymousone · 2024-12-06T19:17:07 1733512627

I'm "tracking" the price of if 1M tokens in OpenRouter and it is decreasing every few refreshes. It's funny: https://openrouter.ai/meta-llama/llama-3.3-70b-instruct

danielhanchen · 2024-12-06T22:15:39 1733523339

I uploaded 4bit bitsandbytes, GGUFs and original 16bit weights to https://huggingface.co/unsloth for those interested! You can also finetune Llama 3.3 70B in under 48GB of VRAM and 2x faster and use 70% less memory with Unsloth!

bnchrch · 2024-12-06T19:46:19 1733514379

Open Sourcing Llama is one of the best example and roll out of "Commoditize Your Complement" in memory.

Link to Gwern's "Laws of Tech: Commoditize Your Complement" for those who havent heard of this strategy before

https://gwern.net/complement

bazmattaz · 2024-12-06T22:57:37 1733525857

That was so interesting. Thanks for sharing

hubraumhugo · 2024-12-06T18:43:19 1733510599

Meta continues to overdeliver. Their goal from the start was to target and disrupt OpenAI/Anthropic with a scorched earth approach by releasing powerful open models.

The big winners: we developers.

philipkiely · 2024-12-06T19:42:36 1733514156

Just spent a few minutes this morning spinning up a H100 model server and trying an FP8 quantized version (including kv cache quantization) to fit it on 2 H100s -- speed and quality looking promising.

I'm excited to see if the better instruction following benchmarks improves function calling / agentic capabilities.

kstrauser · 2024-12-06T23:52:40 1733529160

I know this has been discussed before but it changes frequently: what’s the good “generic” Mac desktop client these days? I’d like to use Ollama and/or ChatGPT. Maybe Claude. Perhaps Perplexity, too. I primarily want to use AI chats in various apps, like typing “write a function to…” into whatever random editor I’m using at the moment. It doesn’t have to be a desktop app, either. If there’s a great PopClip plugin or Keyboard Maestro macro, or even something that works as a system service, that’s perfectly fine by me.

MacMind is nifty, but that feels like a lot of money for something that’s a front end to someone else’s API. “Stop being a cheapskate” is a legitimate answer.

anon373839 · 2024-12-07T03:44:32 1733543072

I like Open WebUI. It’s a web app that runs locally, not a native MacOS app. But it’s very nice.

https://github.com/open-webui/open-webui

cgarvis · 2024-12-07T00:15:45 1733530545

ChatGPT has a Mac app. If you want to run your own models, Ollama with Chital is great.

https://github.com/sheshbabu/Chital

kstrauser · 2024-12-07T01:01:52 1733533312

The ChatGPT app only has the nifty "works with any app" feature if you're on the Plus plan and doesn't support other models. Chital looks very nice and is admirably tiny (3.6MB unzipped!) but doesn't have the editor integration at all, as far as I can tell.

MindMac is an example of an app that meets all the criteria, but for more than it seems like such an app should cost.

einr · 2024-12-07T09:14:46 1733562886

https://github.com/AugustDev/enchanted

hrpnk · 2024-12-06T18:21:40 1733509300

Seems that a bunch of quantized models are already uploaded to ollama: https://ollama.com/library/llama3.3/tags

adt · 2024-12-06T20:16:50 1733516210

Model card: https://github.com/meta-llama/llama-models/blob/main/models/...

On the Models Table: https://lifearchitect.ai/models-table/

LorenDB · 2024-12-06T17:37:21 1733506641

Hopefully this lands on Groq soon!

jpgvm · 2024-12-06T18:32:28 1733509948

https://twitter.com/sundeep/status/1865094783234142240

Enjoy!

Disclaimer: I work at Groq.

sergiotapia · 2024-12-06T18:45:53 1733510753

I work at a YC company, and we would love to use Groq but our limits are so low at one million tokens per day. Could you help us get a better rate limit/token limit? Let me know if I should reach out via email, thank you!

jpgvm · 2024-12-06T19:50:12 1733514612

I suggest joining the Discord and enquiring there or depending on what you need emailing sales@.

7thpower · 2024-12-06T17:42:15 1733506935

Yes, with a high TPM. I see them unboxing all the racks, which is exciting.

I would move over to Groq in a New York minute if I could get enough tokens.

neilellis · 2024-12-06T18:52:21 1733511141

I'm wondering if they are ever actually going to go live (i.e. paid for and reasonable limits), we've been waiting since June!

jazzyjackson · 2024-12-06T22:10:16 1733523016

Forgive me, what slang is "unboxing all the racks" ?

7thpower · 2024-12-08T19:54:15 1733687655

They’re posting images to their socials of them receiving their new racks.

jazzyjackson · 2024-12-08T22:02:13 1733695333

O so not slang at all then my bad :')

theanonymousone · 2024-12-06T20:00:28 1733515228

Given the comments saying it's performance seems comparable to 4o/4o-mini, is it safe to say that GPT-4 performance can be achieved with less than 100B parameters,in contrary to what previously was thought?

andy_ppp · 2024-12-06T20:31:08 1733517068

How many tokens per second can I get on an M4 Max with 128gb of RAM?

variety8675 · 2024-12-07T00:14:45 1733530485

I get about 10

ndr_ · 2024-12-06T23:15:24 1733526924

It's available on IBM WatsonX, but the Prompt Lab may still report "model unavailable". This is because of overeager guardrails. These can be turned off, but the German translation for this option is broken too: look for "KI-Guardrails auf" in the upper right.

henry2023 · 2024-12-06T18:29:00 1733509740

I'm building a PC just to run inference on this and the QwQ 32B models.

Any suggestions on RAM and GPU I should get?

loudmax · 2024-12-06T20:39:49 1733517589

This of course depends on your budget and what you expect to do with these models. For a lot of people, the most cost-effective solution is probably to rent a GPU in the cloud.

The limiting factor for running LLMs on consumer grade hardware is generally how much memory your GPU has access to. This is VRAM that's built into the GPU. On non-Apple hardware, the GPU's bandwidth to system RAM is so constrained that you might as well run those operations on the CPU.

The cheapest PC solution is usually second-hand RTX 3090's. These can be had for around $700 and they have 24G of VRAM. An RTX 4090 also has 24G of VRAM, but they're about twice as expensive, so for that price you're probably better off getting two 3090's than a single 4090.

Llama.cpp runs on the CPU and supports GPU offloading, so you can run a model partly on CPU and partly on GPU. Running anything on the CPU will slow down performance considerably, but it does mean that you can reasonably run a model that's slightly bigger than will fit in VRAM.

Quantization works by trimming the least significant digits from the models' parameters, so the model uses less memory at the cost of slight brain damage. A lightly quantized version of QwQ 32B will fit onto a single 3090. A 70B parameter model will need to be quantized down to Q3 or so to run entirely on a 3090. Or you could run a model quantized to Q4 or Q5, but expect only a few tokens per second. We'll need to see how well the quantized versions of this new model behave in practice.

Apple's M1-M4 series chips have unified memory so their GPU has access to the system RAM. If you like using a Mac and you were thinking of getting one anyway, they're not a bad choice. But you'll want to get a Mac with as much RAM as you can and they're not cheap.

elorant · 2024-12-06T19:17:14 1733512634

As a general rule of thumb, for 8-bit quantization you’ll need as much VRAM as the parameters of the model. Half that if you run it at 4-bit. System RAM should be a bit more because you need RAM for the OS too. What most users do is go for second hand 3090s. You get two or three of them depending on quantization.

pulse7 · 2024-12-06T20:12:30 1733515950

You also need VRAM for context: the larger the context, the more RAM you need.

griomnib · 2024-12-06T18:32:03 1733509923

I have an A6000, it’s about the most affordable for 48 GB VRAM (you can find for a little under $5k sometimes), which is roughly minimum to run a quantized 70b.

System RAM doesn’t really matter, but I have 128GB anyway as RAM is pretty cheap.

paxys · 2024-12-06T18:38:37 1733510317

Why not 2 x 4090? Will be cheaper than A6000 if you can manage to find them at msrp, and will perform a lot better.

griomnib · 2024-12-06T18:49:39 1733510979

My time is worth a lot of money and 2x 4090 is more work, so it’s net more expensive in real terms.

BoorishBears · 2024-12-06T19:13:57 1733512437

For both inference and training I haven't seen any modern LLM stack take more time for multiple GPUs/tensor parallelism

I would take 1 RTX 6000 Ada, but if you mean the pre-Ada 6000, 2x4090 is faster for minimal hassle for most common usecases

griomnib · 2024-12-06T19:32:58 1733513578

I mean the newest ones. I only do LLM inference, whereas my training load is all DistilBERT models and the A6000 is a beast at cranking those out.

Also by “time” I mean my time setting up the machine and doing sys admin. Single card is less hassle.

BoorishBears · 2024-12-06T20:46:32 1733517992

The A6000 predates Ada?

There is the RTX 6000 Ada (practically unrelated to the A6000) which has 4090 level performance, that what you're referring to?

griomnib · 2024-12-06T21:10:30 1733519430

This one.

https://www.bhphotovideo.com/c/product/1607840-REG/pny_techn...

zargon · 2024-12-07T01:35:36 1733535336

That's an Ampere A6000, one generation older than the Ada A6000. Nvidia decided that confusing model names are a good way to sell old products at a premium.

zh3 · 2024-12-08T11:59:36 1733659176

Running llama3.3:70b here on a pair of eBay Dell RTX3090s in an old (2012!) i3770 workstation - ollama reports 16.67 tokens/sec.

theropost · 2024-12-07T20:58:28 1733605108

I've been running this and a quantized version of the QwQ model and comparing the responses - so far, QwQ is working better, though that could change as I use them more and compare the outputs

jadbox · 2024-12-06T17:58:17 1733507897

Would anyone be willing to compress this down to maybe 14b-20b for us on peasant 16gb rigs?

orangepanda · 2024-12-06T18:13:34 1733508814

Or maybe even middle class plebeian 24gb rigs?

griomnib · 2024-12-06T18:33:06 1733509986

At that point just run 8b.

pulse7 · 2024-12-06T20:16:18 1733516178

Or wait for the IQ2_M quantization of 70b which you can run very fast on 24GB VRAM with context size of 4096...

griomnib · 2024-12-06T20:44:59 1733517899

At some point there’s so much degradation with quantizing I think 8b is going to be better for many tasks.

kordlessagain · 2024-12-07T06:05:50 1733551550

Summary of discussion: https://claude.site/artifacts/635d6816-9f60-4545-aeed-54ba18...

nxobject · 2024-12-07T03:25:05 1733541905

I'm surprised that, out of all of the East Asian languages, they chose Thai to support: do they have a big office there? (I imagine compared to, say, Japanese or (some form of) Mandarin?)

knighthack · 2024-12-07T02:14:15 1733537655

Given how censored the 3.2 model was, is I'm looking forward to the abliterated 3.3 version to see if there's any significant improvements there that can replace it.

antirez · 2024-12-06T23:21:25 1733527285

Hot take after trying it a bit. I was not impressed with llama 3.2, but this one, well, it looks like we finally have a very very strong free LLM.

Narciss · 2024-12-06T18:03:40 1733508220

This is massive, really cool of meta to open source it