Hacker News new | past | comments | ask | show | jobs | submit login
Llama-3.3-70B-Instruct (huggingface.co)
425 points by pr337h4m 40 days ago | hide | past | favorite | 219 comments



Benchmarks - https://www.reddit.com/r/LocalLLaMA/comments/1h85ld5/comment...

Seems to perform on par with or slightly better than Llama 3.2 405B, which is crazy impressive.

Edit: According to Zuck (https://www.instagram.com/p/DDPm9gqv2cW/) this is the last release in the Llama 3 series, and we'll see Llama 4 in 2025. Hype!!


I'm getting 2.12 tok/s[1] on a 24GB (4090) GPU and 64GB (7950x) CPU memory, splitting the model across the GPU and CPU (40/80 layers on GPU) with lm-studio. Output looks good so far, I can use something like this for a query that I want as good an answer as possible and that I don't want to send out on the network.

If we can get better quantization, or bigger GPU memory footprints, we might be able to use these big models locally for solid coding assistants. That's what I think we have to look forward to (among other benefits) in the year(s) ahead.

1. lmstudio-community/Llama-3.3-70B-Instruct-GGUF/Llama-3.3-70B-Instruct-Q4_K_M.gguf


The 32B parameter model size seems like the sweet spot right now, imho. It's large enough to be very useful (Qwen 2.5 32B and the Coder variant our outstanding models), and they run on consumer hardware much more easily than the 70B models.

I hope Llama 4 reintroduces that mid sized model size.


qwen2.5 looks like magic compared to llama3.2.


A question: How large LLMs can be run at reasonable speed on 12GB (3060), 32GM RAM? How much does quantization impact output quality? I've worked with image models (SD/Flux etc) quite a bit, but haven't yet tried running a local LLM.


I got a 2080Ti with 11GB, I can fit Gemma 2 9B Q5_K_M or LLama 3.2 Vision 8B Q4_K_M in memory (if I nuke Firefox's GPU process first).

Speed takes quite a hit once you have a few layers on the CPU, but depending on needs it can be doable. I've just asked LLama 3.3 70B Q5_K_M a question and it offloaded about 5 of the 80 layers, so running almost entirely on my 5900X CPU, but still churning out about one word per second.

In my experience quantization affects prompt adherence primarily and answer accuracy secondarily. For example, if you have multiple clauses, ie one or more "if this then that", then quantization might get it to not consider those. I also find they tend to answers more generally and less precise at higher quantization levels.

As a concrete example, I've been asking the LLama 3.2 Vision 8B model to categorize some images. The default instruct model in general has been heavily trained to output general commentary on the image. If in the prompt I tell it to "output the category only", the Q4_K_M variant sometimes ignores that instruction, while the Q8 variant almost always respect it.

Larger models primarily bring more knowledge in my experience, but usually also better prompt adherence. Larger models also typically can support larger contexts, though this can vary, check the model cards.

edit: I should clarify. More knowledge also often translates to better, more accurate output. For example, a larger model might recognize an idiom and answer accordingly, while the smaller model fails to recognize it and thus provides a poor answer.

Depending on your needs, 12GB might be quite decent or it might be insufficient. If you need an assistant-like model, I liked the Gemma 2 9B Q5_K_M. And I've been quite impressed by LLama 3.2 Vision 8B Q4_K_M for describing images and transcribing text from images.

But for more open-ended stuff, especially if larger contexts is needed, I think you might find it underwhelming.


Thanks, much appreciated!


You're welcome! Should have specified the reason I stick with LLama 3.2 Vision 8B Q4_K_M is that the Q8 doesn't fully fit in memory and so, as mentioned, is significantly slower.

But I do like to compare. Open WebUI for example makes it very easy where you can load up multiple models and it'll send the same prompt to each one in turn, and show the answers side by side.


>How large LLMs can be run at reasonable speed on 12GB (3060), 32GM RAM?

If you want to offload fully to VRAM, I'd say 8B is the limit. If you're keeping some on RAM, 15-20B can still give OK performance, depending on your tolerance.

>How much does quantization impact output quality?

Basically with more quantization the output becomes more incoherent and less realistic. At the extreme end it's basically just gibberish. I think the sweet spot generally is at 4 bits. At that point the model is pretty compact and the quality isn't diminished too much.


He could probably do 12b (Nemo) up to 14b (Qwen 2.5) at 4bpw with exllamav2


You can download LM Studio today and try it out. I've had success with the Mistral Small Instruct IQ3M model, which fits in VRAM.


How do you measure tokens/sec? Here's my attempt on a new M4 Max 128GB, does about 6 words/sec:

    bash> time ollama run llama3.3 "What's the purpose of an LLM?" | tee ~/Downloads/what\ is\ an\ LLM.txt
    A Large Language Model (LLM) is a type of artificial intelligence (AI) designed to process and understand human language. The primary purposes of an LLM are:
(... contents excerpted for brevity)

    Overall, the purpose of an LLM is to augment human capabilities by providing a powerful tool for understanding, generating, and interacting with human language.

    real  0m59.040s
    user  0m0.071s
    sys 0m0.081s

    pmarreck  59s35ms
    20241206220629 ~ bash> wc -w Downloads/what\ is\ an\ LLM.txt
        359 Downloads/what is an LLM.txt


LM Studio puts stats at the bottom of each reply like: 2.09 tok/sec, 346 tokens, 1.74s to first token. This was for a 259 word response, so ~ 0.75 words/token. If that ratio holds, you might be getting 8 tok/sec on you M4 Max?

Looks like LM Studio is available for ARM based Macs, if you want to give that a try, that'd be one way to get these stats. LM Studio also surfaces up some parameters to play around with, and keeps a record of past conversations if that might appeal to you.


Yeah, I already use it along with Ollama, I just didn't notice those stats I guess!


Just add "--verbose" to your run command, e.g. "ollama run mistral-nemo:latest --verbose", it'll dump the token counts and timing info after each message.


Any opinion on whether the q4 quantization is stable/effective? That's a lot of quantization.

Edit: Perhaps answering my own question:

λ ollama run hf.co/lmstudio-community/Llama-3.3-70B-Instruct-GGUF:Q4_K_M

>>> Hi. Who are you?

I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."


Q4_* is by far the most popular one for local use, and it works "fine" meaning that you do see some effect on perplexity and other scores, but it's small enough to not be a concern in most cases. Although it should be noted that this can depend on the model - e.g. there have been some reports that for QwQ, going 4-bit does adversely impact the quality of its CoT.

With Llama specifically I recall someone comparing various quants on perplexity finding that even at Q3, 70B is still smarter than 34B. So quantization is generally worthwhile so long as you have a larger model that you can squeeze into your VRAM budget with it, and don't mind the slowdown from more parameters.


I prefer larger quantization and longer response times, but only have so much memory and am only willing to wait so long. I can work with 2 toks/s and this kind of accuracy for these kinds of queries. I expect you'll need to play around with it to find out if it would be helpful in your case. Here's what I'm getting asking 'who are you?'

System Prompt: You are an advanced super intelligence, capable of everything from software development to dream interpretation.

>> Who are you?

  I am an artificial intelligence designed to simulate human-like conversation and answer questions to the best of my knowledge. I'm a large language model, which means I've been trained on a massive dataset of text from various sources, including books, articles, research papers, and websites.

  My capabilities are diverse and extensive. I can:

  1. **Answer questions**: I can process natural language queries and provide relevant answers, covering a wide range of topics from science and history to entertainment and culture.
  2. **Generate text**: I can create text based on a prompt or topic, including articles, stories, emails, and even entire books.
  3. **Translate languages**: I can translate text from one language to another, including popular languages such as Spanish, French, German, Chinese, and many more.
  4. **Summarize content**: I can summarize long pieces of text into concise, easily digestible versions, highlighting key points and main ideas.
  5. **Offer suggestions**: I can provide suggestions for topics such as gift ideas, travel destinations, books to read, and more.
  6. **Engage in conversation**: I can have a natural-sounding conversation with you, using context and understanding to respond to your questions and statements.
  7. **Dream interpretation**: As you mentioned, I can even help interpret dreams, providing insights into the symbolism, themes, and emotions that may be present in your subconscious mind.
  8. **Software development**: I can assist with software development tasks, such as writing code, debugging, and optimizing algorithms.

  My knowledge is constantly evolving, and I'm always learning from the interactions I have with users like you. So, feel free to ask me anything – I'll do my best to help!


Can llama.cpp make use of the gpu built into the 7950x CPU? I assume that would improve performance.


The limit is memory bandwidth, a dedicated GPU will have higher memory bandwidth than a CPU or iGPU ever will.


Granite Rapids memory bandwidth is between 614 and 844 GB/s.


It would make no difference at best for token generation and would actually run slower for prompt processing since it has fewer gflops than the CPU cores.


FYI, due to Llama's naming scheme, there is no such thing as Llama 3.2 405B. 8B/70B/405B models are either Llama 3, 3.1, or 3.3 (except for 405B which wasn't initially released), while Llama 3.2 only contains 1B, 3B, 11B (vision), and 90B (vision) models. It's a bit confusing.


Ah, so I guess the comparison is to Llama 3.1 405B.


Still very impressive. Llama team is absolutely killing it right now, and the openness makes them the most important player IMHO


It could be worse. It could’ve been Llama 3.1 (New)


yeah I use Llama 3.2 3B and I'm blown away

but also wrestled with this mentally.

Meta both improves the technology or inference, while also trapping themselves alongside every other person training models to always update the training set every few months, so it knows what its talking about with relevant current events


Given how close it is to 405B in performance it would be interesting to see which has the edge comparing an unquantized 3.3-70B against 405B quantized to be the same size.


That would be 1.38 bits per weight on average, which I can confidently guess would not perform well.


BitNet is functional at 1.58 bpw.


The model card says the 70B is 16 bit so I think you have twice that


It's kind of amazing how there seems to be a wall where sizing up the model starts to diminish in terms of intelligence gains. I guess that's why we can still compete with whales even though their brains are like twice as big as ours.


There is a line of thinking that the required tokens shoved through training also needs to go up, perhaps super-linearly to the model size. And of course, there's a line of thinking that there are diminishing returns in all things, including these models. Both could be true!


I have tried 405B at 1-bit quantization. It remains coherent, but didn't seem to be any better than 3.1-70B.


This reminds me of Steve Jobs's famous comment to Dropbox about storage being 'a feature, not a product.' Zuckerberg - by open-sourcing these powerful models, he's effectively commoditising AI while Meta's real business model remains centred around their social platforms. They can leverage these models to enhance Facebook and Instagram's services while simultaneously benefiting from the community improvements and attention. It's not about selling AI; it's about using AI to strengthen their core business. By making it open, they get the benefits of widespread adoption and development without needing to monetise the models directly.


Also don't underestimate the value they're getting from making more overtures to the developer community. It could be a coincidence, but it's only since they started releasing these models that I started noticing people on HN calling them "Meta", and attitudes towards them have been far more positive of late than usual.

Good will isn't worth as much as cheap moderation automation and fancy features, but it's worth something.


> Also don't underestimate the value they're getting from making more overtures to the developer community.

I wonder if it's significant. As developers, we're biased to think it matters, but in the grand scheme of things, 99.99% of people don't have a clue about open source or things that matter to hackers. As far as recruitment go, developers look primarily at how much they make, possibly the tech and how it looks on resume. There's always been a stigma around social networks and generally big tech companies, but not to the point it's going to hurt them.


I agree there's a lot of decisions that have to be made on the basis of "the tech and how it looks on resume" (as I'm finding out by not having React Native on mine).

> There's always been a stigma around social networks and generally big tech companies, but not to the point it's going to hurt them.

I'm not sure it was "always":

The one Facebook developer event I've been to made me feel dirty just to associate with them, but before that I had no negative feelings. It started off as "the new LiveJournal".

Deleted my account for a few years, only came back to it when I started planning to move country and wanted to keep in contact with those who stayed put.


AI researchers get huge checks working anywhere. So how does your company distinguish its appeal? Play to a researcher's desire for recognition in their field. Lean on open source. (I'm just paraphrasing Zuckerberg.)


Similar how places like SpaceX or game development are known for low pay and bad working conditions, yet have no shortage of applicants.

No matter whether you want to hire the best of the best or just average people at a lower than average price, being a place where people want to work helps immensely


It might make it a bit easier to hire engineers. Working for Facebook? Meh. Working for the "new Meta", on the other hand?


It's funny how quickly Zuck managed to turn his image around from "data-stealing actual lizard person" to "kind of normal guy" with a few years and a haircut. It's also not lost on me that he's the only "hacker" major tech CEO remaining:

   - Sundar is a glorified bean counter and his company is rotting from the inside, only kept afloat by the money printer that is ads.
   - Satya and Microsoft are in a similar boat, with the only major achievement being essentially buying OpenAI while every other product gets worse
   - Tim Cook is doing good things with Apple, but he still runs the company more like a fashion company than a tech company
   - Amazon was always more about logistics than cool hack value, and that hasn't changed since Bezos left
   - Elon is Elon
Meanwhile Zuck is spending shareholder money pushing forward consumer VR because he thinks it's cool, demoing true AR glasses, releasing open-source models, and building giant Roman-style statues of his wife.


Pichai, Cook and whoever runs Amazon are excellent operators. Perhaps the best in the game. But you can't simply sit on past success.


In my mind, Zuckerberg will always be a data-stealing lizard person and no amount of MMA training will change that


Microsoft has Azure, which is growing like crazy from a high base point.


I feel like owning Oculus must be the equivalent of the curse of the monkey's paw. It drove Palmer Luckey veritably[0] insane[1] when he was the head of it, and between the statue nonsense, the weaponizing of Llama and his angling for a position as a Trump advisor, it sure looks like Zuckerburg wants to head down the same path.

[0] https://www.vice.com/en/article/palmer-luckey-made-a-vr-head...

[1] https://www.codastory.com/authoritarian-tech/us-border-surve...


Facebook Engineering has always been well regarded — starting with React on the front end, but also projects like Open Compute.

Their product management on the other hand— well, I mean, Facebook and Instagram are arguably as popular as McDonald’s. So they’ve got that going for them.


and WhatsApp


If anything, WhatsApp is probably bigger than Instagram around the world.


It's also got an infinitely better reputation since it's arguably not social media, but just a solid instant messenger.

I was personally very concerned when they acquired it, but went from a begrudging user (it was the de facto standard in the country I was living in at the time) to an excited one (Facebook/Meta added end-to-end encryption, multi-device capability, device migration etc.)


If it's E2EE, can you explain how a new device is able to read the messages after logging in?

In other words: Facebook has a copy of the decryption key.


What do you mean by “read the messages”? WhatsApp doesn’t have a server-side message archive.

If you mean messages pending delivery: A newly logged in device can request the sender to resend pending messages under its new key.


> infinitely better reputation

You’ve clearly not seen it enshittify:

* More UI space devoted to brands (the Updates tab

* Communities UX is, uh, suboptimal

* Brands / business accounts can message you with no opt out, you can block a chat after the fact but won’t stop them from messaging you from another account.

These made the news in WhatsApp’s favourite testbed market[1], but if you think it’s not coming to other markets…

Anyhow, tl;dr — WhatsApp’s no longer in the “put users first” phase, it’s in the “monetize” phase of Facebook Product Management.

As a person who paid for WhatsApp pre-Facebook, that does make me sad.

[1] https://techcrunch.com/2022/10/10/in-india-businesses-are-in...


I use it many times per day as my primary means of communication outside work, and while it’s not perfect, I prefer it over all the alternatives.

But yeah, it’s “done” in many ways, and idle product managers and business developers are rarely a good thing for a product.


Spot-on. Hacker News has a hate-boner for Facebook and the Facebook-branded products, but from an engineering standpoint they're up there with Netflix solving problems at scale. FAIR had a better reputation than OpenAI since the start, and they've developed the Oculus acquisition to it's logical extremes. You can do a lot worse, looking at the way their peers handle similar situations.

All the big tech companies have a Facebook-esque product they wish they could get rid of forever. Meta has Facebook, and instead of imploding like everyone said they would (for decades) they demonstrated competency in engineering and culture. The next 4 years will be a gauntlet with a literal "Mr. X" advising social media policy, but I frankly don't think Facebook has ever been down for the count in a pragmatic sense.


I think the only concept I’ve heard as to the decline of Facebook is, Facebook is for “older people” (read: millennials+) but I have no idea how true this is.


It totally depends!

I used to live in the city and now I live out in nowhere land. Everything around here is done thru facebook groups and facebook events. If you want to keep in touch with the community, it is really only on facebook.

Some of my older facebook friends seem to have left the platform and I seen more than once of my old city friends posting final goodbyes declaring they are leaving the "dead" platform for discord or some stuff where their family will never find them.


Did anyone think Facebook would fall over because their engineering sucked?

It’s the product itself that’s garbage, not the infrastructure or the code.


It's funny. The only time I've ever seen Hacker News unanimously applaud a Facebook product was when Pytorch announced they merged Apple Silicon support. Seems like Mr. Zuckerburg knows how to play a winning hand.


I call them OpenAI instead of Meta.


I'm sticking with fuckerbook


I'm sticking with:

> They "trust me"

> Dumb fucks

Quotation marks his, not mine. It adds a certain vibe to it.


I'm glad nobody saved all the stupid stuff I said at 19 for posterity. Sin, without, first stone etc.


I'm also glad that those who chose to give my 19 year-old self a break ignore all the other scandalous shit my company did in line with those comments in the decades after the fact.


OK, so your position is that Meta (nee Facebook) are bad, and that they've been doing scandalous stuff for years.

Can you give me a list of the things that they did that you felt were particularly egregious?

Note: I worked there for five years, but left 8 years ago (but up till the recent layoffs, had 170+ LI contacts still there).


Profiling people for advertising, building shadow profiles, buying 3rd party user data, tracking people across the web, devices and geographically, psychological experiments, optimizing for engagement, limiting post reach without pay... generally prioritizing profit over people.

https://www.techpolicy.press/is-it-ethical-to-work-at-facebo...


Ok, I'm going to assume that you're somewhat technical given where we are.

First off, if I estimate a series of numbers on you and those result in you being served a set of ads, is that wrong? If so, can you help me understand what's wrong with that?

Shadow profiles are mostly bullshit, yes data was collected for non users due to how the SDK and pixel worked. This data was all assigned to one user ID and was filtered out by basically everyone using that data.

I'm a little confused as to why buying third party data is wrong, the problem with this is that it's legal to collect and sell the data.

Speaking as a psychologist can you clarify what's wrong with psychological experiments?

I think your point about optimising for engagement was definitely a mistake, given the downstream consequences. However, they needed to find some way of ranking feed after Zynga almost killed them (a chronological feed would have been all Farmville all the time for a number of years) and they picked likes.

They also optimized for time spent but people complained about that so they started optimising for comments and shares which made everything worse, sadly.

Limiting post reach for pages was a legitimate business decision, particularly given the ranking constraints.

Got any more problems with them?


See their partnership with Cambridge Analytics.

Also, is manipulating elections and contributing to third world genocidal riots enough for you? I'm surprised you know so little about this stuff.


What partnership with Cambridge Analytica?

Like if you look into what that company actually did all the data stuff was a smokescreen for their speciality of getting your opponents caught in compromising positions.

Fundamentally, neither the Big 5 traits nor friend data is particularly useful for ad targeting (internally neither approach was successful).

Can you please be specific about the manipulation of elections?

I presume we're talking about Myanmar and the genocide. Personally I generally place responsibility for bad actions on the people engaging in genocide rather than the communication mechanisms involved. Should we have banned radio after the Rwandan genocide?

Hitler used radio very effectively, should we have banned that?


It would be strange if they didn't also use these models to generate much more sophisticated models of their user's interests and hyper-targeted advertising that always looks and feels like trusted friend recommendations for the exact product that's been missing from your life.


> It would be strange if they didn't also use these models to generate much more sophisticated models of their user's interests and generate the hyper targeted advertising

You have the sequence reversed as Meta already created ad targeting models. Meta was forced to scale its AI competence for ad targeting when Apple sent out a privacy update that destroyed tracking-based ad-serving and tanked Meta's share price by deleting billions in revenue for Meta over many quarters. Now that Meta has this skill as a core-competence, they are creating new models for public release. Why they are doing so is debatable[2], but I imagine the cost is marginal since they already had the GPU clusters, talent and know-how for survival purposes.

1. https://www.businessinsider.com/metas-bet-on-ai-has-saved-it...

2. I suspect Zuckerberg is not enthused by the idea of a future AI Apple-analog unilaterally shutting him out of the market. Having your net worth cut in half by a press-release has got to hurt.


Thanks for the article but I don't think I have the sequence reversed.

Without access to the tracking signal, it's been more important to build out a system that can recreate the value from that lost signal by analyzing what users are actually sharing and saying on their platform. Hence the importance of chat (VR, text, video...) and AI that can be used to process and extract value from a chat signal.

I believe Meta's primary revenue source is still advertising (98%), so that is probably 98% of the why.


> Thanks for the article but I don't think I have the sequence reversed

I suppose I read your first sentence as being in future tense when it might not be. The thrust of my argument is that Meta already successfully built those ad targeting models (Advantage+), and they preceded the Llama releases, so they don't need to use Llama-derived models for ad targeting, as I understood your comment to be suggesting. The sequence was not/will not be "Llama -> ad targeting", but was "ad targeting -> Llama"

Meta didn't have to release the weights of the models. Ad revenue doesn't explain why they did so.


Lots of applications for the technology are likely and already being used. Llama to improve Customer to Product Matching through content analysis. Llama for Ad Generation based on customer and product profiles. Llama to generate additional chat signal for humans not chatting with other humans. Llama as a site content moderator. Llama as a code generation assistant. Llama to clean up and improve internal wikis. Llama as an operational analysis tool to detect inefficiencies in internal communication...

They will use it every in every situation where they find a successful use case. Open sourcing the weights buys goodwill but it won't be long before everyone open sources their weights since the only real long term moats are the successful products built around the technology and not the technology itself. Facebook is its own moat. Altman suggested this outcome a few days ago in his NYT DealBook interview when he compared the transformer model to the transistor.


There's the old HN! I knew it was around here somewhere!


I already was thinking for a while what the business model of open source was exactly. Why does Google spend money on Chrome also? After Zuckerberg’s comments it hit me:

Open source is useful for a business if it can either increase revenue or decrease costs.

Examples:

Increase revenue: Chrome and Visual Studio code. For example, the more people code, the more likely it is that they pay MSFT. So VS code aims to make programming as attractive as possible. Similar for Chrome.

Decrease costs: Linux and Llama. As Zuckerburg said himself IIRC, they don’t want one party snowball into an LLM monopoly so they rather help to get the open source ball rolling.


I think Balmers "developers, developers, developers" meme has been around longer than some people here are alive - it served them well in the Windows era and it serves them well in the cloud space.


And if you want to think about comparators to Facebook, Microsoft is the right company to look at.


Exactly. Microsoft wants lots of cash for Copilot, meanwhile, we had Code Llama running with 150 developers before the Microsoft idiots could schedule a meeting.


> For example, the more people code, the more likely it is that they pay MSFT. So VS code aims to make programming as attractive as possible

How does that increase revenue in a remotely measurable way?

Chrome, for sure, high market share, default search engine, more money, at least that's how I imagine it.


One example is that Microsoft owns GitHub.


Maybe Zuck just wants to see cool shit. It's not like he needs the money


Commoditize your complement: https://gwern.net/complement


>storage being 'a feature, not a product.

Somewhat unrelated mini-rant. Upgraded a phone recently after about 3 years. Surprised to see storage still capped around 128GB (in-general). That's got to be artificially held back capacity to push cloud storage services?


There's lots of phones with more and/or with SD slots. It's not really "capped" as much as default size that seems to work just fine for the majority.


I’ve got a 512GB phone with 112GB used. I’ve put absolutely no effort whatsoever into keeping that number done, and I’m not shy about downloading stuff to it.

I’m certain plenty of people need way more than 128GB. I figured I’d be one of them when I bought this. Nope. I bought a much bigger device than I actually needed.

If I’ve used less than 128GB, I’ve gotta think most other people do too. Not all, clearly! But most? I’d bet on it.


They’ve dramatically improved their ad quality. I routinely check out on convert on Facebook and Instagram ads, and I can honestly say I never intentionally clicked on one for 20 years or more, once.


Ironically gen AI made their products worse more than the rest. I can't believe the amount of AI slop I see every time I open Facebook - I'd check it occasionally when replying to messenger and scroll through for a while, after seeing the AI spam I don't even bother.


That was going to happen regardless. From Meta's perspective it's better for their platforms to contain their own AI slop than OpenAI's.


truly I wonder if they're fooled by their own click fraud, or, if the incentives really do work out that they get paid whether the engagement is from bots or people, but anyway I came here to say the same thing, it's shocking to me how enthusiastic Zuckerburg is about generative AI, what other possible outcome is there except actual human content creation being replaced by slop


Do the improvements the community proposes/makes to their models amount to anything significant? For a company like Meta with basically infinite money, do they really benefit from external help?


I don't have eyes inside of Meta, but keep in mind that we're not just talking about fine-tunes and LoRAs, we're also talking about the entire llama ecosystem, including llama.cpp, ollama, and llamafile. These would not exist (or wouldn't have anything like as much momentum) without Meta's investment, but they're now huge community projects.

I don't know if they use them internally, of course, but they could, and they represent a lot of work.


Baggy Tees, gold chains and now this!? Make this man president immediately!


Seems to be more or less on par with GPT-4o across many benchmarks: https://x.com/Ahmad_Al_Dahle/status/1865071436630778109


Except it is 25x cheaper, available offline, can be uncensored / unaligned, fine-tuneable and backupable.

Sad day for OpenAI. Great for humanity.


What are good starting points for uncensoring it? Because it is offline a jailbreak prompt can't be remote-bricked but can one remove censorship from the weights themselves? What does it do to accuracy?



Abliteration is a fool's errand, practically all models end up noticeably lobotomized even with follow up tuning.

Good ol' fine tuning on an uncensored dataset gives far more usable results.


Interesting - I've heard this anecdotally. Curious if you know of any resources that look at this in more detail?


I haven't seen any papers doing a proper analysis on the topic, just mostly saying this from firsthand experience testing a handful of them and comparing to the model they were based on given same prompt and sampler. It's usually not even close and you can immediately tell that it's notably dumber. Iirc in one case one even forgot how to do basic arithmetic while the original model aced it. Not entirely unexpected results from sticking a digital ice pick into the weights.

Afaik there are only three major sources of quality unaligned model versions, which are Nous's Hermes models, Hartford's Dolphins and Drummer's Tigers. All of them regular fine tunes that are mostly the same or just ever so slightly lower in performance as the original.


Not speaking from personal experience, but I've tried a lot of the decensored models and they lose a lot in the process. They are noticeably, sometimes shockingly, worse. They also still refuse prompts in many cases.


>Not speaking from personal experience, but I've tried a lot of the decensored models and they lose a lot in the process. They are noticeably, sometimes shockingly, worse. They also still refuse prompts in many cases.

Depending on what you do, on local you can modify the response, say the AI responds "No, I can't do that" . you edit the response like "Sure, the answer is " and then the AI will continue with the next tokens.

But I think you can build your own instruct model from the base one and do not apply the safety instructions to protect the feelings of your customers.


zuck is really on his redemption arc, he's out-doing himself


I think about this a lot, I believe part of this just boils down to the fact that he’s a founder. He’s a builder, and I think that yields some intrinsic drive towards creativity and disruption.

Even with “metaverse” being a laughingstock, they’re still aiming for something ambitious. Each new Quest generation makes me think there may be a chance they pull it off.

Now, do I think he’s a great person? No, not really. Do I agree with most of his decisions on how he treats his users? Hell no, and that’s not changing.

But if you compare him to somebody like Sundar at Google - a weasley MBA who was first and foremost a corporate ladder climber - the difference in ambition is night and day.

Sundar made it to the top already, his only vision now is to stay at the top, and that means pleasing Wall Street, everything else is secondary. There is no grand technical ambition with him, there never was.

This goes for pretty much all non-founder CEOs. You could say the same things about Tim Apple, Andy Jassey, and other henchmen in waiting who made it to the big chair.

I think it comes down to the fact that founders get where they are by having big ambitions and taking risks, the MBA to CEO path is just craven corporate knife fighting with other MBAs.

Regardless, I think this is 50% Zuckerberg changing, 50% the other big companies are mostly run by ladder climbers.


two things to keep in mind though:

1/ Yann Lecun probably is the one pushing for open source

2/ Mark isn't doing this for the greater good and for humanity. It helps his business because Llama is becoming a standard, and people are building / improving, which in turn helps Meta and Meta's business


They were basically forced into this by being so far behind that they'd need to rent OAI or Anthropic services to stay competitive, which at their scale would've probably cost them more than it takes to train llama and benefit from community research help.

At least that was the rationale behind the intentional leak of llama 1 back in the day according to some sources anyway.


Yeah and they’d also lose top AI talent as well.


I don't find him redeemed, but I do wonder what has been baked into the tuning of these models? What a way to generate influence!


Nice try Meta PR department /s


Does it still count if the purpose is to dillute OpenAI's market power and not to benefit the world?


But diluting OpenAi's (or any other company's, for that matter) market power does benefit the world.


Good point! That's worthy in itself for sure.


More generally I think he's decided he needs to own the platform to be able to grow.

Him swearing about about (presumably) Apple telling them they can't do stuff (because tough shit, you're their serf) was legit I think.


Is your accusatory statement falsifiable?


I hope I can believe an honest man, Zuck in this case. And that can be enough.

But it's more of a question of who do _I_ want to admire. An honest question also; maybe it doesn't matter why he's doing it, maybe just doing it is enough.

Or maybe it's worth understanding if this is about Meta beating OpenAI (so, ego-driven) or because Meta really cares for democratic AI and distribution of power (so, not ego-driven).

I think it's the former, so not admirable — for me.


How said? Tiny violin sad. The typical consumer which likely is 99% of them is not going to use this


How do you calculate the price?


I assume it's the same cost as Llama 3.1 70b


And how do you calculate the cost of that?


GPU cost ÷ GPU token throughput = cost per token, will get you close then you compare the cost per token between solutions.

To explain how to calculate it precisely, I would have to write a blog post. There are dozens of factors that go into it and they vary based on your use case like GPU type/setup, cloud provider, inferencing engine, context size, the minimum throughput and latency you'd be willing to have your users experience, LLM quantization, KV cache configuration, etc.

I'm sure there are cost analyses out there for Llama 3.1 70b you could find though.


This year seems to finish on the same note as it began -- that most AI evolution happens in the smaller models. There's been a true shift as corporations have started to realize the value of training data and massively outsizing the resulting model size.


Which GPT-4O are those?

The 08-06 release seems to be a bit higher on numerous benchmarks than what that shows: https://github.com/openai/simple-evals?tab=readme-ov-file#be...


This just makes the $200/month even more laughable.


The $200 plan is for people who would pay $200 for a bottle of vodka even though the $20 bottle is molecularly identical.


How? 4o is part of the plus plan, as is o1.


o1-preview.


Nope, o1. o1 "pro mode" is part of the $200/mo ChatGPT Pro


Ah clear as mud. OpenAI must have hired the Microsoft naming team. Thank you!


But its $200/month smart!


o1-pro is way smarter still


Does unexpectedly well on our benchmark:

https://help.kagi.com/kagi/ai/llm-benchmark.html

Will dive into it more, but this is impressive.


I asked it:

> I have a sorcerer character on D&D 5e and I've reached level 6. What do I get?

It confabulated a bunch of stuff. I also asked GPT-4, it confabulated a bit. Claude was spot on.


Please help me understand something.

I've been out of the loop with HuggingFace models.

What can you do with these models?

1. Can you download them and run them on your Laptop via JupyterLab?

2. What benefits does that get you?

3. Can you update them regularly (with new data on the internet, e.g.)?

4. Can you finetune them for a specific use case (e.g. GeoSpatial data)?

5. How difficult and time-consuming (person-hours) is it to finetune a model?

(If HuggingFace has answers to these questions, please point me to the URL. HuggingFace, to me, seems like the early days of GitHub. A small number were heavy users, but the rest were left scratching their heads and wondering how to use it.)

Granted it's a newbie question, but answers will be beneficial to a lot of us out there.


Hi,

Yes you can. The community creates quantized variants of these that can run on consumer GPUs. A 4-bit quantization of LLAMA 70b works pretty well on Macbook pros, the neural engine with unified CPU memory is quite solid for these. GPUs is a bit tougher because consumer GPU RAM is still kinda small.

You can also fine-tune them. There are lot of frameworks like unsloth that make this easier. https://github.com/unslothai/unsloth . Fine-tuning can be pretty tricky to get right, you need to be aware of things like learning rates, but there are good resources on the internet where a lot of hobbyists have gotten things working. You do not need a PhD in ML to accomplish this. You will, however, need data that you can represent textually.

Source: Director of Engineering for model serving at Databricks.


Thank you Josh. Is there a resource you can point us too that helps answer "what kind of MacBook pro memory do I need to run ABC model at XYZ quantization?"


In general you can just use the parameter count to figure that out.

70B model at 8 bits per parameter would mean 70GB, 4 bits is 35GB, etc. But that is just for the raw weights, you also need some ram to store the data that is passing through the model and the OS eats up some, so add about a 10-15% buffer on top of that to make sure you're good.

Also the quality falls off pretty quick once you start quantizing below 4-bit so be careful with that, but at 3-bit a 70B model should run fine on 32GB of ram.


Does 70b mean there are 70 billion weights and biases in the model?


Look at the filesize, add a couple of GB.


how would the pricing on databricks when using model serving compare to, say, the prices seen in the original post here (i.e., "3.3 70B is 25X cheaper than GPT4o")?


I’ve been wanting to run into someone on the Databricks team. Can you ask whoever trains models like MPT to consider training an open model only on data clear of copyright claims? Specifically, one using only Gutenberg and the permissive code in The Stack? Or just Gutenberg?

Since I follow Christ, I can’t break the law or use what might be produced directly from infringement. I might be able to do more experiments if a free, legal model is available. Also, we can legally copy datasets like PG19 since they’re public domain. Whereas, most others have works in which I might need a license to distribute.

Please forward the request to the model trainers. Even a 7B model would let us do a lot of research on optimization algorithms, fine-tuning, etc.


I think you're looking for OLMo, https://allenai.org/olmo


They appear to use Common Crawl in the DCLM dataset. Just downloading Common Crawl is probably copyright infringement before we consider specific terms in the licenses. Arxiv papers have a mix of licenses with some not allowing commercial use.

If I got the sources right, it’s already illegal with just two sources they scraped. That’s why I want one on Gutenberg content that has no restrictions.


Thank you! Very helpful!


Yes (don't know about JupypterLab), skip, not really, yes, quite irritating so just pay someone else to do it.


Thanks! Succinct and complete.


Basically equivalent to GitHub but for models. Anyone can upload anything, but it kind of standardizes tools and distribution for everyone. They also have a team that helps integrate releases for easier use and libraries for fine tuning


Thanks!

I want to download my first HuggingFace model, and play with it. If you know of a resource that can help me decide what to start with, please share. If you don't, no worries. Thanks again.


Most of the HF models have a code snippet that you can use in order to run inference on the model. The transformers library will take care of the download as a dependency when you run the code. Typically, a python 3.10-3.11 environment is sufficient as environment. Example: https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct#t...

If you have a MBP, you need to adjust the device name in the examples from "cuda" to "mps".


If you're on a Mac, https://lmstudio.ai is a quick way to get things running with a decent UI and a REST API that is compatible with OpenAI (which is the de facto standard these days). And the GGUF models that it downloads can be used directly via llama.cpp, if you are so inclined later.


Their docs are very fun to read. I’d probably recommend starting with the “transformers” library for python if you want to play with some language models e.g. Bert:

https://huggingface.co/docs/transformers/en/model_doc/bert


I'm "tracking" the price of if 1M tokens in OpenRouter and it is decreasing every few refreshes. It's funny: https://openrouter.ai/meta-llama/llama-3.3-70b-instruct


I uploaded 4bit bitsandbytes, GGUFs and original 16bit weights to https://huggingface.co/unsloth for those interested! You can also finetune Llama 3.3 70B in under 48GB of VRAM and 2x faster and use 70% less memory with Unsloth!


Open Sourcing Llama is one of the best example and roll out of "Commoditize Your Complement" in memory.

Link to Gwern's "Laws of Tech: Commoditize Your Complement" for those who havent heard of this strategy before

https://gwern.net/complement


That was so interesting. Thanks for sharing


Meta continues to overdeliver. Their goal from the start was to target and disrupt OpenAI/Anthropic with a scorched earth approach by releasing powerful open models.

The big winners: we developers.


Just spent a few minutes this morning spinning up a H100 model server and trying an FP8 quantized version (including kv cache quantization) to fit it on 2 H100s -- speed and quality looking promising.

I'm excited to see if the better instruction following benchmarks improves function calling / agentic capabilities.


I know this has been discussed before but it changes frequently: what’s the good “generic” Mac desktop client these days? I’d like to use Ollama and/or ChatGPT. Maybe Claude. Perhaps Perplexity, too. I primarily want to use AI chats in various apps, like typing “write a function to…” into whatever random editor I’m using at the moment. It doesn’t have to be a desktop app, either. If there’s a great PopClip plugin or Keyboard Maestro macro, or even something that works as a system service, that’s perfectly fine by me.

MacMind is nifty, but that feels like a lot of money for something that’s a front end to someone else’s API. “Stop being a cheapskate” is a legitimate answer.


I like Open WebUI. It’s a web app that runs locally, not a native MacOS app. But it’s very nice.

https://github.com/open-webui/open-webui


ChatGPT has a Mac app. If you want to run your own models, Ollama with Chital is great.

https://github.com/sheshbabu/Chital


The ChatGPT app only has the nifty "works with any app" feature if you're on the Plus plan and doesn't support other models. Chital looks very nice and is admirably tiny (3.6MB unzipped!) but doesn't have the editor integration at all, as far as I can tell.

MindMac is an example of an app that meets all the criteria, but for more than it seems like such an app should cost.



Seems that a bunch of quantized models are already uploaded to ollama: https://ollama.com/library/llama3.3/tags



Hopefully this lands on Groq soon!



I work at a YC company, and we would love to use Groq but our limits are so low at one million tokens per day. Could you help us get a better rate limit/token limit? Let me know if I should reach out via email, thank you!


I suggest joining the Discord and enquiring there or depending on what you need emailing sales@.


Yes, with a high TPM. I see them unboxing all the racks, which is exciting.

I would move over to Groq in a New York minute if I could get enough tokens.


I'm wondering if they are ever actually going to go live (i.e. paid for and reasonable limits), we've been waiting since June!


Forgive me, what slang is "unboxing all the racks" ?


They’re posting images to their socials of them receiving their new racks.


O so not slang at all then my bad :')


Given the comments saying it's performance seems comparable to 4o/4o-mini, is it safe to say that GPT-4 performance can be achieved with less than 100B parameters,in contrary to what previously was thought?


How many tokens per second can I get on an M4 Max with 128gb of RAM?


I get about 10


It's available on IBM WatsonX, but the Prompt Lab may still report "model unavailable". This is because of overeager guardrails. These can be turned off, but the German translation for this option is broken too: look for "KI-Guardrails auf" in the upper right.


I'm building a PC just to run inference on this and the QwQ 32B models.

Any suggestions on RAM and GPU I should get?


This of course depends on your budget and what you expect to do with these models. For a lot of people, the most cost-effective solution is probably to rent a GPU in the cloud.

The limiting factor for running LLMs on consumer grade hardware is generally how much memory your GPU has access to. This is VRAM that's built into the GPU. On non-Apple hardware, the GPU's bandwidth to system RAM is so constrained that you might as well run those operations on the CPU.

The cheapest PC solution is usually second-hand RTX 3090's. These can be had for around $700 and they have 24G of VRAM. An RTX 4090 also has 24G of VRAM, but they're about twice as expensive, so for that price you're probably better off getting two 3090's than a single 4090.

Llama.cpp runs on the CPU and supports GPU offloading, so you can run a model partly on CPU and partly on GPU. Running anything on the CPU will slow down performance considerably, but it does mean that you can reasonably run a model that's slightly bigger than will fit in VRAM.

Quantization works by trimming the least significant digits from the models' parameters, so the model uses less memory at the cost of slight brain damage. A lightly quantized version of QwQ 32B will fit onto a single 3090. A 70B parameter model will need to be quantized down to Q3 or so to run entirely on a 3090. Or you could run a model quantized to Q4 or Q5, but expect only a few tokens per second. We'll need to see how well the quantized versions of this new model behave in practice.

Apple's M1-M4 series chips have unified memory so their GPU has access to the system RAM. If you like using a Mac and you were thinking of getting one anyway, they're not a bad choice. But you'll want to get a Mac with as much RAM as you can and they're not cheap.


As a general rule of thumb, for 8-bit quantization you’ll need as much VRAM as the parameters of the model. Half that if you run it at 4-bit. System RAM should be a bit more because you need RAM for the OS too. What most users do is go for second hand 3090s. You get two or three of them depending on quantization.


You also need VRAM for context: the larger the context, the more RAM you need.


I have an A6000, it’s about the most affordable for 48 GB VRAM (you can find for a little under $5k sometimes), which is roughly minimum to run a quantized 70b.

System RAM doesn’t really matter, but I have 128GB anyway as RAM is pretty cheap.


Why not 2 x 4090? Will be cheaper than A6000 if you can manage to find them at msrp, and will perform a lot better.


My time is worth a lot of money and 2x 4090 is more work, so it’s net more expensive in real terms.


For both inference and training I haven't seen any modern LLM stack take more time for multiple GPUs/tensor parallelism

I would take 1 RTX 6000 Ada, but if you mean the pre-Ada 6000, 2x4090 is faster for minimal hassle for most common usecases


I mean the newest ones. I only do LLM inference, whereas my training load is all DistilBERT models and the A6000 is a beast at cranking those out.

Also by “time” I mean my time setting up the machine and doing sys admin. Single card is less hassle.


The A6000 predates Ada?

There is the RTX 6000 Ada (practically unrelated to the A6000) which has 4090 level performance, that what you're referring to?



That's an Ampere A6000, one generation older than the Ada A6000. Nvidia decided that confusing model names are a good way to sell old products at a premium.


Running llama3.3:70b here on a pair of eBay Dell RTX3090s in an old (2012!) i3770 workstation - ollama reports 16.67 tokens/sec.


I've been running this and a quantized version of the QwQ model and comparing the responses - so far, QwQ is working better, though that could change as I use them more and compare the outputs


Would anyone be willing to compress this down to maybe 14b-20b for us on peasant 16gb rigs?


Or maybe even middle class plebeian 24gb rigs?


At that point just run 8b.


Or wait for the IQ2_M quantization of 70b which you can run very fast on 24GB VRAM with context size of 4096...


At some point there’s so much degradation with quantizing I think 8b is going to be better for many tasks.



I'm surprised that, out of all of the East Asian languages, they chose Thai to support: do they have a big office there? (I imagine compared to, say, Japanese or (some form of) Mandarin?)


Given how censored the 3.2 model was, is I'm looking forward to the abliterated 3.3 version to see if there's any significant improvements there that can replace it.


Hot take after trying it a bit. I was not impressed with llama 3.2, but this one, well, it looks like we finally have a very very strong free LLM.


This is massive, really cool of meta to open source it


Hate to be one of those, but it’s not open source.

It’s just regular old freeware.

You can’t built llama yourself and it’s license contains a (admittedly generous) commercial usage restriction.


To be fair if they released detailed instructions and datasets on how to rebuild llama (considering that there's some randomness in the process) you still probably wouldn't be able to build it, like who has the resources? And if you had the resources you probably _still_ probably wouldn't _want_ to rebuild it yourself, it seems awfully expensive when you could instead spend those resources elsewhere.

Fair point about the license, people have different definitions for what "open source" means.


> people have different definitions for what "open source" means.

They shouldn’t. It’s just market confusion.

There is an explicit widely accepted definition.

Also like llama (the file you download from huggingface) isn’t even a program. It’s a binary weights file. No source to be opened, even.

It’s just freeware.

https://opensource.org/osd


That's true for most people for ordinary software too. How many people actually build Linux or Chromium from source? Building Chromium takes more RAM and HD space than most people even have. Yet the world gets immense value from the few who do. I wouldn't want to live in a world where WebKit and Chromium were closed source. You can run a Chromium fork without having to build it yourself. And compute costs will come down over time.


> Building Chromium takes more RAM and HD space than most people even have.

According to [1], it takes 16GB of RAM and ~180GB of disk space. Most people have that much. It does take several hours without a many-core machine though.

Building Linux takes much less.

[1] https://chromium.googlesource.com/chromium/src.git/+/master/...


I would bet overall most people have those 4GB RAM, 32GB eMMC laptops from walmart, etc. If you limit things to developers/gamers/enthusiasts, you'd probably be right.


Those laptops kind of died up to a point, whoever can use an smartphone or tablet used those instead.


Linux and Chromium seem at the edge of the current scale of "ordinary" open-source software. I think perhaps one should also take into account how much money would be needed to be able to build the thing in reasonable time.

Building Chromium sounds awful, but I'm not sure I'd really need to buy another computer for that. If I did I'm sure I wouldn't need to spend billions on it, most probably not even millions.

For LLaMa I definitely don't have the computer to build it, I definitely don't have the money to buy the computer, even if I won the lottery tomorrow I'm pretty sure I wouldn't have enough money to buy the hardware, even if I had enough money to buy the hardware I'm still not sure I could actually buy it in reasonable time, nvidia may be backlogged for a while, even if I already had all the hardware I probably wouldn't want to retrain llama, and even if I wanted to retrain it the process is probably going to take weeks if not months at best.

Like I think it's one of those things where the difference in magnitude creates a difference in kind, one can't quite meaningfully compare LLaMa with the Calculator app that Ubuntu ships with.


The practicality of building it yourself has nothing to do with an organization affording you that ability.

Also like, gentoo people compile everything


> To be fair if they released detailed instructions and datasets on how to rebuild llama

Where?

Books3 was famously one of the datasets used to train llama and it’s very illegal to put that together nowadays.

I believe the guy who wrote the script to build it got arrested


Perhaps an individual couldn't. But an organization or a state could.


It is “open weight”. Mark Zuckerberg keeps calling it open source and everyone copies him, but you are right. It is not open source.


We need more uncensored models


No base model? disappointed.


The base model is Llama 3.1 70B


It is probably the same base model as Llama 3.0.

They mention postraining improvements.


interesting comment... what are you doing with base models? Are you a "finetuner"? I have been trying my hand with finetunes on instruct models and the results have been ok, but not awesome. I have a base model downloading now to give that a proper shot.


I'm not them but I still prefer a text completion style of prompting rather than a baked in pre-prompt structure assuming only a 'chat' style metaphor of interaction.


Base models are useful in research to see the effect of instruction tuning




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: