> Also, with 512 gigs of RAM and a massive CPU, it can run a 405 billion parameter Large Language Model. It's not fast, but it did run, giving me just under a token per second.
If you're serious about running LLMs and you can afford it, you'll of course want GPUs. But this might be a relatively affordable way to run really huge models like Llama 405B on your own hardware. This could be even more plausible on Ampere's upcoming 512-core CPU, though RAM bandwidth might be more of a bottleneck than CPU cores. Probably a niche use case, but intriguing.
You know, there's nothing wrong with running a slow LLM.
For some people, they lack the resources to run an LLM on a GPU. For others, they want to try certain models without buying thousands of dollars of equipment just to try things out.
Either way, I see too many people putting the proverbial horse before the cart: they buy a video card, then try to fit LLMs in to the limited VRAM they have, instead of playing around, even if at 1/10th the speed, and figuring out which models they want to run before deciding where they want to invest their money.
One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.
> For some people, they lack the resources to run an LLM on a GPU.
Most people have a usable iGPU, that's going to run most models significantly slower (because less available memory throughput, and/or more of it being wasted on padding, compared to CPU) but a lot cooler than the CPU. NPU's will likely be a similar story.
It would be nice if there was an easy way to only run the initial prompt+context processing (which is generally compute bound) on iGPU+NPU, but move to CPU for the token generation stage.
> One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.
It's a minuscule pittance, on hardware that costs as much as an AmpereOne.
It's not "really slow" at all, 1 tok/sec is absolutely par for the course given the overall model size. The 405B model was never actually intended for production use, so the fact that it can even kinda run at speeds that are almost usable is itself noteworthy.
It's a little under 1 token/sec using ollama, but that was with stock llama.cpp — apparently Ampere has their own optimized version that runs a little better on the AmpereOne. I haven't tested it yet with 405b.
This resource looks very bad to me as they don't check batched inference at all. This might make sense now when most people a just running single query at once, but pretty soon almost everything will be running queries in parallel to take advantage of the compute.
My question wasn't about how to run multiple queries against the LLM but rather how is it even possible from transformer architecture PoV to have a single LLM hosting multiple and different end clients. I'm probably missing something but can't figure that out yet.
Dad of a friend ran a store selling designer underwear for men. He sold boxer shorts for $70 over 25 years ago.
He once told me that he also had a rack of thongs. Now, the thongs didn't have a lot of material, obviously, and were inexpensive to manufacture, so if he'd do his regular markup on them they'd end up selling for $15.
"However, notice the price tag is $25. I add an extra $10, because if a guy's looking to buy a thong, he's going to buy a thong".
I think about what he said when I see the chip prices on these high-end server CPUs.
Wondering who actually pays public price on those AMD chips in big-OEM servers. Not saying the AmpereOne isn't also discount-able too, but these public prices always feel like signalling, to me, more than a reference to the actual price. Or maybe I'm lucky...
> So you have to look at TCO over a period of years, but that's still quite significant.
Your point is a valid point. However if we're talking TCO, we should also factor in that TCO also does involve rack space, power (and power efficiency), cooling, cabling, ports on the ToR switch, so on and so forth.
The AMD solution halves pretty much everything around the CPU, and in order to get the same thread count you should really compare 2x$5k vs $15k.
I get that SMT does have performance hits, but such performance penalties mostly show up when you're doing number crunching all day every day. In real world scenarios, where software spends a lot of time waiting either for disk i/o or for network i/o I would expect that to be a non-issue.
AMD's offering is still competitive, in my opinion.
The 9965 is a much faster chip than the Ampere, but also it is at the point of diminishing returns. The Ampere struggles to match much cheaper AMD offerings in the phoronix suite, it is not better value at all.
Yeah, agreed. If you think artificial intelligence is going to be an important technology in the coming years, and you want to get a better understanding of how it works, it's useful to be able to run something that you have full control over. Especially since you become very aware what the shortcomings are, and you appreciate the efforts that go into running the big online models.
What you describe is very similar to my own experience first running llama.cpp on my desktop computer. It was slow and inaccurate, but that's beside the point. What impressed me was that I could write a question in English, and it would understand the question, and respond in English with an internally coherent and grammatically correct answer. This is coming from a desktop, not a rack full of servers in some hyperscaler's datacenter. This was like meeting a talking dog! The fact that what it says is unreliable is completely beside the point.
I think you still need to calibrate your expectations for what you can get from consumer grade hardware without a powerful GPU. I wouldn't look to a local LLM as a useful store of factual knowledge about the world. The amount of stuff that it knows is going to be hampered by the smaller size. That doesn't mean it can't be useful, it may be very helpful for specialized domains, like coding.
I hope and expect that over the next several years, hardware that's capable of running more powerful models will become cheaper and more widely available. But for now, the practical applications of local models that don't require a powerful GPU are fairly limited. If you really want to talk to an LLM that has a sophisticated understanding of the world, you're better off using Claude or Gemeni or ChatGPT.
An old laptop can make a great homelab server if you don't need that much processing power. It's quiet, and it's got a built-in KVM and battery backup.
Is there much demand for that in the enterprise spaces where RHEL is used?
Some applications, like ESRI, are already packaged by the vendor using Wine on RHEL.
I like Mastodon better too, but non-technical users seem to find it too complicated/nerdy. Threads is easy to use but apparently suppresses posts containing political content and perhaps as a result of inheriting Instagram's userbase, has an issue with users accusing other users of posting engagement bait, which isn’t conducive to building and sustaining a community.
Mastodon and Threads have also been comparatively slow/reluctant to respond to community feedback. With Mastodon this is understandable because it’s open source and resources are limited, but with Threads that’s not a problem which suggests it’s intentional.
"The corruption of Twitter"? Whenabout do you pinpoint that? For me, it took place when the platform went from being a place where virtually every user was just a regular person and it got raided and invaded by corporate media, "celebrities" and politicians, so pretty soon. Shortly afterwards it began curtailing freedom of speech heavily by way of very suspicious and surreptitious means (deplatforming, suppression of ideas, removal of posts, shadowbanning, penalizing certain keywords, collaborating with governments to silence opposition and dissent, etcetera). Only since Elon Musk took over all of those vile tactics went out the window globally, though I admit I don't like the walled-garden nature of it all one bit, but at least one can express oneself there more or less freely now, and community notes are truly a boon to every sane individual--albeit certainly not to legacy media and woke cultists.
With regards to federation, I think in this case it just takes the worst of both worlds, but to each their own. I'd prefer a truly decentralized approach.
If you want to see the outcome of an unmoderated platform without consequences or accountability see how anonymous message boards evolve every time.
Society and face to face discussion has its own inherent methods of moderation so I don’t really understand this absolutist desire for an anonymous town square we can all yell vitriol at each other. And twitter isn’t even that.
I’d rather not respond to the user below (because I don’t have the energy to talk to anyone who uses the word “woke”), but I did find this pre-print interesting re potential Twitter algorithm manipulation: https://eprints.qut.edu.au/253211/
>because I don’t have the energy to talk to anyone who uses the word “woke”
You did this to yourselves. You pushed the term (and the whole ideology--extreme constructivism under an abhorrent mockery of morality). Don't spit in the wind.
He doesn't. You didn't link to anything regarding what I explicitly mentioned. How dishonest of you (no surprise there). In fact, I knew someone would have the very same knee-jerk reaction you had and link that wikipedia article or a similar one. Go ahead, downvote me all you want. Silence my post just like Twitter used to do with non-woke standpoints. You won't have provided any real arguments to counter my original post, still. You're pointing to a whole other problem there: complying with court orders. And you can thank Lula for that.
>I don’t really understand this absolutist desire for an anonymous town square we can all yell vitriol at each other.
That's quite a ridiculous way of caricaturizing free speech. Whenever the truth emerges through the filth, it's "vitriol" and "hate speech" for your delusional lot. Wokeness lost. Own it. Now everyone can see everything for what it is, with no corporate, partisan or governmental censors trying to manipulate reality to fit their abject narrative.
“collaborating with governments to silence opposition and dissent” he reinstated the ban of opposition accounts in brazil at behest of Brazilian government. not sure how much more direct an example I could give.
The fuel cost to launch is significant here, as you'd want to put this into quite a high orbit. You absolutely don't want this stuff in low orbit where it will be taken down eventually by atmospheric drag, or by slamming into some other spacecraft.
Another issue is the reliability of your launch vehicle. Sometimes these things blow up on launch, which you really don't want happening when it's loaded with radioactive waste. Developments made by SpaceX might contribute positively to both of these, by lowering the cost and increasing the reliability.
A third issue is that this stuff might turn out to be valuable after all. Hopefully not for making bombs, but future technology might come up with socially positive uses for material that's currently considered dangerous waste.
That is a reasonable question. I think the answer has a lot to do with why the dead are being dug up. If the dead are being exhumed by archeologists seeking to make information about our ancestors available to everyone, that's very different from grave robbers looking for gold.
If an archaeologist goes down to the local cemetery and digs someone up, runs a DNA test and publishes the results (‘oh, and I found a gold ring!’) it’s clearly not ok.
The line is there somewhere, I’d suggest it was seen as ok from 150-200 years or older.
There are some tasks that I would have used a search engine for in the past, because that was more or less the only option. Say, searching for "ffmpeg transcode syntax" and then spending 15 minutes comparing examples from Stack Overflow with the official documentation to try to make sense of them. Now I can tell Claude exactly what I'm trying to accomplish and it will give me an answer in 30 seconds that's either correct, or close enough for me to quickly make up the difference.
I'm still going to turn to Google to find out what a store's opening hours or phone number is, as well as a lot of other tasks. But there are types of queries that are better suited for an LLM, that previously could only be done in a search engine.
There's also a non-technical reason for LLM search. Google built its business on free search, paid for by advertising, which seemed like a good idea in the early 2000's. A few decades later and we have a better appreciation for the value of the ad-driven business model. Right now, there's a whole lot of money being thrown at online LLMs, so for the most part they're not really doing ads yet. It's refreshing to make a query and not have sponsored results at the top of the list. Obviously, the free online LLM business model isn't going to last indefinitely. In the pretty near future, we'll either need to start paying a usage fee, or parse through advertisements delivered by LLMs as well. But it's nice while it lasts.
>>>Obviously, the free online LLM business model isn't going to last indefinitely.
I think this is one of those points where LLMs have already changed the paradigm
People liked being able to search, but would not pay for it. For many queries , the value wasnt there: users still had to scroll thru pages or tinker for the right query for the required result.
Eventually search turned so much worse by seo spam, that kagi stepped in to fill the void.
LLMs start from a different direction. The value is clearly there. OpenAI etc still have a ton of paying subscribers.
I do think eventually some will incorporate ads, but I think innovation has revealed that theres a market -perhaps a substantial one- for fee-based information search with LLMS
Ads delivered via LLMs will cost more to distribute, which means higher cost for the businesses purchasing ads, perhaps high enough to deter a lot of smaller ads customers, so I think we'll see an interesting dynamic appear there. Especially if the ad-laden SEO-boosted sites suffer from further enshittification of Search, which has been spinning in its own vicious cycle lately.
> Also, with 512 gigs of RAM and a massive CPU, it can run a 405 billion parameter Large Language Model. It's not fast, but it did run, giving me just under a token per second.
If you're serious about running LLMs and you can afford it, you'll of course want GPUs. But this might be a relatively affordable way to run really huge models like Llama 405B on your own hardware. This could be even more plausible on Ampere's upcoming 512-core CPU, though RAM bandwidth might be more of a bottleneck than CPU cores. Probably a niche use case, but intriguing.