Hacker News new | past | comments | ask | show | jobs | submit login
Llama.cpp: Full CUDA GPU Acceleration (github.com/ggerganov)
728 points by gzer0 on June 13, 2023 | hide | past | favorite | 310 comments



llama.cpp is great. It started off as CPU-only solution and now looks like it wants to support any computation device it can.

I find it interesting that it's an example of an ML software that's totally detached from Python ML ecosystem and also popular.

Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it? Less hassle? Or did it take off for a different reason? Interested in hearing thoughts.


> Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it? Less hassle? Or did it take off for a different reason? Interested in hearing thoughts.

For me it's less about the language and more about the dependencies. When I want to run a Python ML program, I need to go through the hassle of figuring out what the minimum version is, which package manager/distribution I need to use, and what system libraries those dependencies need to function properly. If I want to build a package for my distribution, these problems are dialed up to 11 and make it difficult to integrate (especially when using Nix). On top of that, those dependencies typically hide the juicy details of the program I actually care about.

For something like C or C++? Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path. Maybe install some missing system libraries if necessary.

I just don't want to install yet another hundred copies of dependencies in a virtualenv and just hope it's set up correctly.


> For me it's less about the language and more about the dependencies. When I want to run a Python ML program, I need to go through the hassle of figuring out what the minimum version is, which package manager/distribution I need to use, and what system libraries those dependencies need to function properly.

This is exactly why I hate Python. They even have a pseudo-package.json style dependencies file that you should supposedly be able to just run "install" with, but it NEVER works. Not once have I ever downloaded someone's Python project from github and tried to install the dependencies and run it has it ever gone smoothly and without issue.

The Python language itself may be great, I don't know, but I'm forever put off from learning or using it because clearly in all the years it's been around they have yet to figure out reproducibility of builds. And it's obviously possible! JavaScript manages to accomplish it just fine with npm and package.json. But for some reason Python and its community cannot figure it out.


Would the problem "how can I run this cool ML project from GitHub" be solved if developers would publish their container images on dockerhub? The only downside I see is enormous image sizes


"Here's a massive glob of mystery-meat state to run my massive glob of mystery-meat tensors".


"and by the way, inside the image I slightly modified one file of the tensors library (and this is undocumented) and this totally changes the output if the fix is not there"


They just need a dockerfile that builds correctly. No need to make an image available, the dockerfile should be able to build it consistently


I use pypoetry for dependency management in Python projects. It helps a lot, but doesn't resolve the issue of pip packages to fail installing because you're missing system libraries. At least it specified the Python version to use. With many open source ML repos I have to guess what Python version to use.

I'd really like to see more Docker images (images, not Dockerfiles that fail to build). Maybe flatpack or snap packages do the trick, too.


> Not once have I ever downloaded someone's Python project from github and tried to install the dependencies and run it has it ever gone smoothly and without issue.

Same here. If it can't resolve dependencies or whatever, then there will almost certainly be some kind of showstopping runtime error (probably because of API changes or something). I avoid Python programs at all cost nowadays.


This.

`pip` is the package manager that almost works.

Python is the language that almost supports package distribution.

I'll keep using `apt` on vanilla Debian.


Strong agree. I'll willingly install a handful dependencies from my distro package manager, where the dependencies are battle-hardened Unixy tools and I can clearly see what they do and how they do it.

I'm not going to install thousands of dodgy-looking packages from pip, the only documentation for which is a Discord channel full of children exchanging 'dank memes'.

I like Python, but I simply do not trust the pip ecosystem at this point (same for npm, etc.).


> I'm not going to install thousands of dodgy-looking packages from pip, the only documentation for which is a Discord channel full of children exchanging 'dank memes'.

This made me laugh. It’s true, isn’t it? That’s really what we deal with day to day (for me in the js world, the create react app dependencies make my head spin)


> For something like C or C++? Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path. Maybe install some missing system libraries if necessary.

I don't mean to detract from your main point re Python dependencies, but I find this about C to be rarely true. `make` etc build flows usually result in dependency-related compile errors, or may require a certain operating system. I notice this frequently with OSS or academic software that doens't ship binaries.


Yep. I've given up on any C or C++ projects because I find they almost never work and waste hours of my time. Part of the issue might be the fact that I'm often using Windows or MacOS but I've had bad experiences on Linux also.


> `make` etc build flows usually result in dependency-related compile errors,

Which are displayed to me during the `./configure` step before the `make`, and usually require me to type "apt-get install [blah] [blah] [blah]", and to run configure again.


Not all configure scripts are created equal: a lot of them only tell you about missing dependencies one at a time.

I'm glad make, etc., works for yo. But for me, neither C, C++, nor Python are particularly enjoyable dependency wise.


So much this. As someone whose bread and butter is systems programming for things that run on end-user devices, every time I dig into a Python project I feel like I've been teleported into the darkest timeline, where everything is environment management hell.

Even the more complex and annoying scenarios in native-land for dependency management still feels positively idyllic in comparison to Python venvs.


When I initially started to learn Python (1.6), virtualenv was starting to be adopted, and since then thing have hardly changed.

It also helps that even minor versions introduce breaking changes.

I doubt anyone really knows Python that well, unless they are on the core team.


It was fine, back in the early days (I started with 1.4-ish). I just downloaded the tarball, unpacked, configured, make, installed into /usr/local on my workstation, then downloaded and stuck any packages into site-packages. Numeric was sometime tricky to compile right, but ye olde "configure && make && make install" worked fine.

Of course, that worked because 1) I was really only doing one project, not juggling multiple ones, 2) there weren't all that many dependencies (Numeric, plotting, etc.), and 3) I was already up to my eyeballs in the build system with SWIG and linking to the actual compute code, so I knew my way around the system.

But every now and then I just shake my fist at the clouds then mutter darkly about just installing the dang thing and maybe not taking on so many dependencies. :-)


I’m newish to python, I’ve only used it for machine learning projects and some web scraping. Could somebody elaborate on venv? I just started using it but now everyone in this thread is saying how much they hate it. Is there an alternative?


venv is fine. Remember that this is a self-selected sample of people. You’re going to bump into the flaws and gotchas but it’s a perfectly usable tool.


Uhm so I was professionally setting up ML distros and ML containers for cloud deployments. Venv is not fine, especially if you've seen how other langs do it.


Could you say more what you don’t like about venv compared to how other languages do it?


Just store your deps in project directory instead of using hidden fucking magic.


The hidden magic is adjusting env vars used by python, LD, etc. Adjusting paths only seems like magic until you understand what it’s doing.

I’ve done with with plenty of languages including C/C++


Yes I know how it works but try explaining that to someone learning flask at a boot camp.


I mean I wouldn’t expect someone at a boot camp to understand this nor was this a topic of conversation, so /shrug?


The good news is they're working towards encouraging a standard .venv in project directory.

https://peps.python.org/pep-0704/


IMHO venvs are fine as an implementation detail, a building block for a slicker tool.

The annoyance with venvs is you have to create and activate them. In contrast for cargo (or stack or dotnet or yarn or pipenv or poetry), you just run the build tool in the project directory.

Another limitation of venv is it doesn't solve the problem of pinning a versions of Python, so you need another tool.


Well I just spent an hour to diagnose the build failure of llama.cpp due to it picking up wrong nvcc path.

Dependency problem still happens even with C/C++.


I had the same issue... Turned out it was because I used the flat pack version of intellij idea and it had problems with paths. Running from a plain terminal worked fine.


The flatpak version of intellij isn't officially maintained by jetbrains. Jetbrains only maintains the snap for linux


it's fortunate that flatpak solves this problem of reproducible environments so effectively


This is also the reason I like when I see a project in C or C++. It's often a ./configure && make or something. Sometimes running a Python project even if dependencies install, there might be some mystery crash because package dependencies were not set correctly or something similar (I had a lot of trouble with AUTOMATIC1111 StableDiffusion UI when using some extensions that installed their own requirements that might be in conflict with the main project).

With a boring C project, if it compiles it probably works without hassle.

Feels validating that other people have these thoughts too and I'm not just some old fart.


I recently hit the "classic" case. Saw a CLI tool for an API I'd like to use, written in Python. Tried it and found out it didn't work on my machine. I later found out it was a bug in a dependency of that tool. 100 lines of shell script later, I had the functionality I needed, and a codebase which was actually free of unexpected surprises. I know, this is an extreme example, but as personal anecotes go, Python has lost a lot of trust from my side. I also wonder how people can write >10k codebases without static types, but that is just me ....


It's not that Python is bad. It's the people who want to just hack something quick together go to Python, so any time you pick up some software written in Python it's marred with all kinds of compatibility issues and bugs where you can't just run it

The answer is "yeah use this other software to make it work in an isolated way because the whole ecosystem is actually broken" and that's somehow acceptable


I think it is the opposite for me but I am also a fan of system independent package mangers, provided they support easy package configuration.

Otherwise you not only bind to system architecture and OS, you also bind yourself to a distribution.

I find that Automatic1111 plugins tend to not share dependencies and instead redownloads them for their own use. Can make your hdd cry because some of these are larger models. Advantages and disadvantages probably...

There are package managers for C and some are quite good. But for most projects you are quite dependent on the package manager of your distro to supply you a fitting foundation. Sometimes it is easy, but if there is a problem, I think handling C is far harder than python. And I write quite a bit of C while I can only perhaps read python code.

No code is completely platform independent, especially a stable diffusion project, but Python is still more flexible as C by a long shot here.

Of course Llama is great. Time to get those LLMs on our devices for our personal dystopian AIs running amok.


> which package manager/distribution I need to use, and what system libraries those dependencies need to function properly

I don't understand why things are so complicated in Python+ML world.

Normally, when I have a Python project, I just pick the latest Python version - unless documentation specifically tells me otherwise (like if it's still Python 2 or if 3.11 is not yet supported). If the project maintainer had some sense, it will have a requirements list with exact locked versions, so I run `pip install -r requirements.txt` (if there is a requirements.txt), `pipenv sync` (if there is a Pipfile), or `poetry install` (if there's pyproject.toml). That's three commands to remember, and that's not one just because pip (the one de-facto package manager) has its limitations but community hadn't really decided on the successor. Kinda like `make` vs automake vs `cmake` (vs `bazel` and other less common stuff; same with Python).

External libraries are typically not needed - because they'll be either provided in binary form with wheels (prebuilt for all most common system types), or automatically built during the installation process, assuming that `gcc`, `pkgconfig` and essential headers are available.

Although, I guess, maybe binary wheels aren't covering all those Nvidia driver/CUDA variations? I'm not a ML guy, so I'm sure how this is handled - I've heard there are binary wheels for CUDA libraries, but never used that.

Ideally, there's Nix (and poetry2nix) that could take care of everything, but only a few folks write Flakes for their projects.

> Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path

Getting the correct version of all the dependencies is the trickiest part as there is no universal package managers - so it's all highly OS/distro specific. Some projects vendor their dependencies just to avoid this (and risk getting stuck with awfully out-of-date stuff).

> Maybe install some missing system libraries if necessary.

And hope their ABIs (if they're just dynamically loaded)/headers (if linked with) are still compatible with what the project expects. At least that is my primary frustration when I try to build something and it says it doesn't work anymore with whatever OS provides (mostly, Debian stable fault lol). It is not exactly fun to backport a Debian package (twice so if doing this properly and not handwaving it with checkinstall).


> Ideally, there's Nix (and poetry2nix) that could take care of everything, but only a few folks write Flakes for their projects.

Relevant to "AI, Python, setting up is hard ... nix", there's stuff like:

https://github.com/nixified-ai/flake


The right combo for Nvidia/CUDA/RandomPythonML library is a nightmare at times. This is especially true if you want to use older hardware like a Tesla M40 (dirt cheap, still capable). And your maker hopefully be with you if you if you tried to use your distro's native drivers first.

It's fair to say part of the blame is on Nvidia, but wow is it frustrating when you have to find eclectic mixes.


My personal recipe (on NixOS) is pip-ed virtual environment for quick tests, or conda inside a nix-shell, on top of a dedicated zfs pool/conda mounted in ~/.conda with dedup=on so nothing nixified and nothing that last a nixos-rebuild...

Many pythonic projects not only in ML world tend to be just developers experiments, so to be run as an experiment, not worth to be packaged as a stable, released program...

Oh, BTW projects like home-assistant fell in the same bucket...


I totally agree with you the irony being python and languages like were built in part to reduce the complexity not only of the language but also to build and run the code… I feel machine learning is a low enough level thing that it should not be tied to a high level language like python… so I can use node, ruby, php or whatever by adding a c binding etc that to me is why this is most interesting


The problem is that python is designed assuming people want to use system-wide packages. In hindsight, that has turned out to be a mistake. Conda / venv try to bridge that gap but they’re kludgy, complex hacks compared to something like cargo or even npm.

Worse, because Python is a dynamic language, you also have to deal with all of that complexity at deployment time. (Vs C/C++/Zig/Rust where you can just ship the compiled binary).


> The problem is that python is designed assuming people want to use system-wide packages.

This wasn't true for decades, `virtualenv` was de-facto standard isolation solution (now baked in as `python -m venv`, still de-facto standard), and `pip` is the package manager (we don't talk about setuptools/distutils, ssh!). If someone still used system-wide packages that was either because a) they were building a container or some single-purpose system; or b) they were sloppy or had no idea what they're doing (most likely, following some crappy tutorial). Or it was distro people creating packages to satisfy dependencies for Python programs - but that's a whole different story (and one's virtualenv shouldn't inherit system packages unless it is really really necessary and iif it makes sense to do so).

The problem started when one needed some external non-Python dependencies. Python had invented binary wheels and they're around for a while (completely solving issues with e.g. PostgreSQL drivers, no one needs to worry about libpq), but I suppose depending on specific versions of kernel drivers and CUDA libraries is a more complex and nuanced subject.

> Vs C/C++/Zig/Rust where you can just ship the compiled binary

Only assuming that you can either statically link, or if all libraries' ABIs are stable (or if you're targeting a very specific ABI, but I've had my share of "version `GLIBC_2.xx' not found"s and not fond of those).

In a similar spirit, any Python project can be distributed as one binary (Python interpreter and a ZIP archive, bundled together) plus a set of zero or more .so files.


> This wasn't true for decades, `virtualenv` was de-facto standard isolation solution (now baked in as `python -m venv`, still de-facto standard)

Right; but python itself doesn’t check your local virtual environment unless you “activate” it (ugh what). And it can’t handle transitive dependency conflicts, like node and cargo can. Both of those problems stem from python assuming that a simple, flat set of dependencies are passed in from its environment variables.


Virtual envs are actually quite simple -- they contain a bin/ directory with a linked python binary. When the python binary runs, it checks it sibling directories (it knows it was executed as e.g. /home/user/.venv/bin/python) for what to load. You don't need the activate shell scripts or anything, just running that binary within your venv is enough; the shell script is just for convenient of inserting the bin directory into the $PATH so just "python" or "pip" runs the right thing.


> the shell script is just for convenient of inserting the bin directory into the $PATH so just "python" or "pip" runs the right thing.

Or so any reference in the program you run that launches another binary or loads a DLL relying onnthe environment gets the right one, etc. There are some binaries you can run without activating a venv with no problem, and others will crash hard, and others will just subtly do the wrong thing if the conditions are “right” in your normal system environment.


Another implication of this is that its impossible for 2 mutually incompatible copies of the same package to exist in the same environment. If packageA needs numpy 1.20 and packageB needs numpy 1.21, you're stuck.


> Virtual envs are actually quite simple

You have never trashed your system from virtualenv?

Also, there is a problem when wheels assume they can have everything like tensorflow from years ago -- I don't know about now, since tf used to be tied to cuda versions you could get into trouble installing tf versions, even with venv, conda, etc.


> You have never trashed your system from virtualenv?

Unless one have done something they shouldn't have done (in particular, using sudo while working with virtualenv), this shouldn't be possible.

Due to limitations of most commonplace system-wide package managers (like, dpkg, rpm or ebuild, not modern stuff like nix) system packages exist to support other system packages. One installs some program, it needs libraries, dependencies get pulled. And then its distro package managers' job to ensure compatibility and deal with multiple version conflicts (not fun).

But if you start or check out some project, common knowledge was that you shouldn't be using on system packages, even if they're available and could work. With some obligatory exceptions like when you're working on a distribution packaging, or developing something meant to be tightly integrated with a particular distro (like a corporate standard stuff).

That is, unless we're talking about some system libraries/drivers needed for CUDA in particular (which is system stuff) rather than virtualenv itself.


> That is, unless we're talking about some system libraries/drivers needed for CUDA in particular (which is system stuff) rather than virtualenv itself.

Sir, this is an ML thread.

Venv interacts with that poorly, though to be fair it could be googles fault. Still it shouldn't be even possible.


I mean, virtualenv is not supposed to interact with that at all. System libraries are systems' package manager responsibility. Doubly so, as - as I get it - all this stuff is directly tied to the kernel driver.

What Python's package manager (pip in virtualenv) should do is build (or download prebuilt binaries) the relevant bindings and that's the extent of it. If others say it works this way with C (that comment about cmake and pkgconfig), then it must work this way with Python too.


> If someone still used system-wide packages that was either because a) [...] b) [...]

Or simply because they are packagers for some distro and they user want a simple way to pull-in some software by it's name, while the upstream devs imaging people cloning their public repo and run the software from the checkout in their own home, with regular pull, regularly rebuilding the needed surroundings...

Not to talking about modern systems/distro with not-really-posix vision like NixOS or Guix System...

> In a similar spirit, any Python project can be distributed as one binary

A single 10+Gb binary :-D


Ridiculous, 10gb binary only if machine learning models involved. I had distributed full stack binaries in 70mb or less.


> For something like C or C++? Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path. Maybe install some missing system libraries if necessary.

If you don't use virtual environments in python, isn't it basically the same in python? Just run `pip install` and maybe install some missing system libraries if necessary. In practice, it's not that simple in either language, and "maybe install some missing dependencies" sweeps a lot of pain under the rug.


I wonder if having a shared cache would make this more easeful. fwiw nix does that.


You might like Cog. It solves these problems for ML projects in specific.


Wow, so interesting to see the "depth" of anti-python feeling in some quarters. I guess that is the backlash from all the hordes of Python-bros.

Having used both C++ and Python for some time, the idea that managing C++ dependencies is easier than venv and pip install is one of the moments you wonder how credible is HN opinion on anything.

> a compelling non-Python solution appears

Confusing a large ML framework like pytorch that allows you to experiment and develop any type of model with a particular optimized implementation in a low level language suggests people are not even aware of basic workflows in this space.

> also popular

Ofcourse its popular. As in: People are delirious with LLM FOMO but can't fork gazillions to cloud gatekeepers or NVIDIA so anybody who can alleviate that pain is like a deus-ex-machina.

Ofcourse llama.cpp and its creator are great. But the exercise primarily points out that there isn't a unified platfrom to both develop and deploy ML type models in a flexible and hardware agnostic way.

p.s. For julia lovers that usually jump at the "two-language problem" of Python: here is your chance to shine. There is a Llama.jl that wraps Llama.cpp. You want to develop a native one.


Managing C++ dependencies _is_ much easier! It's either "run this setup exe" or "extract this zip file/tarball/dmg and run".

This is because most people don't care about developing the project, just using it. So they don't care what the dependencies are, just that things work. C++ might be more difficult to handle dependencies to build things, but few people will look into hacking on the code before checking to see if it's even relevant.


> Managing C++ dependencies _is_ much easier! It's either "run this setup exe" or "extract this zip file/tarball/dmg and run".

Not sure if that was half sarcastic, but from experience, anything touching c++ & CUDA has the potential to devolve into a nightmare on a developer-friendly platform (hello kernel updates!), or worse, into a hair-pulling experience if you happen to use the malware from Redmond (to compound your misery, throw boost and Qt dependencies in the mix).

Then again, some of the configurations steps required will be the same for Python or C++. And if the C++ developer was kind enough to provide a one-stop compilation step only depending on an up-to-date compiler, it might, indeed, be better than a Python solution with strict package version dependencies.


Maybe this distinction explains indeed the dissonance! But it might be rather shortsighted given the state of those models and the need to tune them.


If we're only talking about end-user "binaries" you can also package Python protects into exe files or similar format that bundle all the dependencies and are ready to run.


>Wow, so interesting to see the "depth" of anti-python feeling in some quarters. I guess that is the backlash from all the hordes of Python-bros.

I think you are generalizing. I do not hate on Python the language but this ML projects are a very , very terrible experience. Maybe you can lame the devs of this ML projects, or the ones of the dependencies but the experience is shit. You can follow a step by step instruction that worked 11 day ago and today is broken.

I had similar issues with Python GTK apps, if the app is old enough then you are crewed because that old GTK version is no longer packaged, if the app is very new then you are screwed again because it depends on latest version of some GTK wrapper/helper.


I think what has happened is that because Python is sweet and easy to use for many things, it generated irrational expectations that is perfect for all things. But its just an interpreted language that started as a scripting and gluing oriented language.

Its deployment story is where this gap frequently shows. Desktop apps at best passable, whereas e.g. android apps practically non-existing despite the efforts of projects like kivy.


The problem seems to be getting a project that works on the developer machine packaged and distributed to non developers. Some type of projects seem harder to distribute, this ML dependencies seem to change fast and everything breaks(maybe because the dependencies are not locked correctly).


I think the Python hate has been manufactured, starting with Google's launch of Go language, which wanted to be eat Python's cake.

And some jumped on the started bandwagon unknowingly, and never got off of it.


Just to balance things out: I still love Python. A lot!


I don't even write python, really, but I've been interfacing with llama.cpp and whisper.cpp through it recently because it's been most convenient. Before that I was using nodejs libraries that just wrap those two cpp libs.

I guess since these models are meant to be run "client side" or "at the edge" or whatever you want to call it, it helps if they can be neutrally used with just about any wrapper. Using them from Javascript instead of Python is sort of huge for moving ML off the server and into the client.

I haven't really dipped my toes into the space until llama and whisper cpp came along because they dropped the barrier extremely low. The documentation is simple, and excellent. The tools that it's enabled on top like text-generation-webui are next level easy.

git clone. make. download model. run.

That's it.


Quality of HN comments are getting bad for a few months. This is nothing to do with python ML ecosystem and what you have to realize llmcpp doing is it is inferencing already built models- which is running the models.

Building (training) machine learning, deep learning models are much more complex , order of magnitude complex than just running the models and doing that in C or C++ would take you years which would take just a few month with python.

And complexity of `pip install` is nothing compared to that.

That's why no real ETL+Deep learning, training work is done in c or c++.


You're not entirely wrong but pretty much everything you use from python is written in C++ anyway, so what's your point?


The point is, as you pointed out, that you code against the appropriate level of abstraction. You write a ML workflow appropriate language like Python in something like C++/rust, and ML flows in Python. That should really not be that hard to understand.


It is same argument as "Every HTML, CSS , JAVASCRIPT" development you do is written in C/C++ anyways .


I am not an expert, but from what I've seen, PyTorch is mostly a thin wrapper over the C++ libtorch. The same is true for DOM as well, but nobody uses DOM directly, whereas everybody uses PyTorch.

The big shift is Jupyter, but that's mainly for exploratory programming. If you already know what you're doing, there's no reason why C++ should be worse than Python for training. It's likely that most ML engineers do not have experience with C++.

I don't have that experience either but from what I've seen, C++ is very powerful, so once you subtract the jupyter, there's not really too much left.

BTW: You cannot use DOM/CSSOM from C++, the only API is JS, so your argument is theoretical.


Yes Python is incredibly annoying to use. Their dependency management is a total mess, and it's incredible how brittle packages if there are even minor point changes in versions anywhere in a stack.


I have to agree. Installing dependencies for some git repos is a total crapshoot. I ended up wasting so much hard drive space with copies of pytorch. Meanwhile llama.cpp is just "make" and takes less time to build than to download one copy of pytorch.


So, the solution is that everyone should write code as self-contained C++ code and not use any software libraries ever. Dependency hell has been solved for all time!


Python was released in 1991. It obviously didn’t get package management right, just like C and C++ didn’t figure it out in the 70s.

Take a look at rust’s Cargo for what a modern package manager should look like. Or deno / Go if you swing that way.

Which old language gets package management right? None of them. None of them get it right.

And sure - conda / venv / CMake / etc help. But last century’s bad design decisions still shine through.


Well, in the cases where the libraries are causing more work than they are solving. Then yes.


There is a happy medium... somewhere. After following Postgres development for the better part of a decade, I think it's definitely closer to the python side of things... But man they (python) do make it hard to like using that ecosystem.

The flip side is like you said... You will just have to reimplement everything yourself and then you can never worry about dependencies again! Just hope you didn't introduce some obscure security issue in your hashmap implementation .


I personally really don't like much Python, I find it as tedious to write as Go but without the added performance, typesafety and autocomplete benefits that comes with it in exchange.

If I have use a dynamic language, at least make it battery included like Ruby. Sure it's also not performant but I get something back in exchange.

Python sits in a very uncomfortable spot which I don't find a use for. Too verbose for a dynamic language and not performant enough compared to a more static language.

The testing culture is also pretty poor in my opinion, packages rarely have proper tests, (and especially in the ML field)


In addition to the above:

1) function decorators etc have made the code unreadable

2) while code is succinct, a lot of abstraction is hidden in some C/C++ language binding somewhere, so, when there is a problem, it is hard to debug

3) Pytorch has become a monolithic monster with practically no-one understanding its functionality e-2-e


For me Python's main use cases are being BASIC replacement, and a saner alternative to Perl (I like Perl though) for OS scripting.

For everything else, I rather use compiled languages with JIT/AOT toolchains, and since most "Python libraries" for machine learning are actually C and C++, any language goes, there is nothing special about Python there.


The Python apologists are more annoying than the language.

Its always been obvious that ML’s marriage to python has always been credential ladened proponents in tangentially related fields following group think.

As soon as we got a reason to ignore those PhDs, their gatekept moat evaporated overnight and the community of [co-]dependencies became irrelevant overnight.


As far back as 2015, it’s been common to take neural net inference and get a C++ version of it. That’s what this is.

It didn’t make python obsolete then and it won’t now.

Training (orchestration, architecture definition etc.) and data munging (scraping cleaning analyzing etc.) are much easier with python than C++, and so there is no chance that C++ takes over as the lingua Franca of machine learning until those activities are rare


python is syntactic sugar - the heavy lifting is done by c/c++ bindings.

many ML experts are not software engineers. They just want syntax to get their job done. fair enough.


Even "compiled" JAX or PyTorch can leave some performance even if you hit the common path (It's also "praying" that the compiler actually works if you do anything non-standard).

But memory wise, there is almost no optimization or reuse (and it's sacrificed for performance), which leads to insane memory usage.

And it's not that they are bad - but optimal compilation of a graph is combinatorially explosive problem, impossible without heuristics and guesswork (what to reuse and waste memory vs recompute). A good programmer can do a significantly better job.


> Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it? Less hassle? Or did it take off for a different reason? Interested in hearing thoughts.

Exploring the landscape ends up with you having 29384232938792834234 different python environments, because that one thing requires specific versions of one set of libraries, while that other thing requires different versions of the same library and there is no middle ground.

It's horribly annoying and I absolutely love python!


I'm currently a Python dev for a living, but I spend a portion of my personal time trying to get better at C/C++, and in my case, it's strictly about the potential for writing faster code. I'm interesting in getting into DSP stuff specifically, so it's 100% necessary if I want to do that, and I would also like to get my head wrapped around OpenCV for similarly creative reasons.

Python opened up the world of code to a lot more people, but there was a cost to that, and the real action as far as actual computer systems go is always gonna be at a much lower level, in much the same way that a lot of people can top up their fluids but most of us pay someone to change the oil.

I could actually totally see oil changes becoming completely robotic in the future, but we would first have to establish open standards for oil pans that all automakers adhered to.

The whole computers designing computers thing, outside of someone cracking cold fusion I don't think we'll ever have the juice for it. In my lifetime, a C-level programmer will never be out of work, but I suspect that the demand for Python programmers is going to slack off, while the supply continues to grow.


Yet another TEDIOUS BATTLE: Python vs. C++/C stack.

This project gained popularity due to the HIGH DEMAND for running large models with 1B+ parameters, like `llama`. Python dominates the interface and training ecosystem, but prior to llama.cpp, non-ML professionals showed little interest in a fast C++ interface library. While existing solutions like tensorflow-serving [1] in C++ were sufficiently fast with GPU support, llama.cpp took the initiative to optimize for CPU and trim unnecessary code, essentially code-golfing and sacrificing some algorithm correctness for improved performance, which isn't favored by "ML research".

NOTE: In my opinion, a true pioneer was DarkNet, which implemented the YOLO model series and significantly outperformed others [2]. Same trick basically like llama.cpp

[1] https://github.com/tensorflow/serving [2] https://github.com/pjreddie/darknet


I believe it's moreso for the (actively pursued) speed optimizations it provides. When inference is already computationally expensive any bit of performance is a big plus


yeah it's sad I guess but half the reason I am recommending this is that it "just works" so much more easily than installing half the Python ecosystem into a conda environment (ironically, just so that Python can then behave as a thin wrapper for calling the underlying native libraries ...)


I've been using it because there are bindings in other languages I know like .NET


I've been wondering for awhile now - what was ever the benefit to building these things in Python, other than pytorch and numpy being there to experiment on in hobbyist ways? There's no way that a serious AI is really going to be built in a scripting language, is there? Once you know what you actually want it to do, you're definitely going to rebuild it as close to the metal as you can, right?

Not to mention, to really protect source code and all the sugar around the training systems, it's going to be a good investment to get out of hobby land and manage your own memory and just code them in C/C++.

It strikes me that the hobby AI ethos aligns very well with scripting languages in that they both assume the availability of endless resources to push things a little dirtier and messier and see if anything interesting emerges. Which is great for hobby AI. It's probably not the future, though, unless resource availability outpaces the imagination of people to write more and more bloated scripts to accomplish what's already been proven.


incredibly condescending which always pairs well with ignorance. "Hobby AI" are the people (mathematicians, domain experts etc) that made this all possible so that you can now "just code it in C/C++".

Have you ever tried to iterate developing any serious class of algorithms in C++?

> Once you know what you actually want it to do

When is that exactly? Even the last few months of LLM land development show very clearly how everything is rapidly evolving (and will very likely continue for quite some time).

Numerical linear algebra stabilized decades ago so you do have low-level libraries in C++ (or even fortran) but there is quite some distance between an LLM and linear algebra.


> Have you ever tried to iterate developing any serious class of algorithms in C++?

Yes, and it's not so bad. A lot of ML deployments have been based on C/C++ for inference anyway (with Python driving the training). So that's really nothing new. I.e. most Python research code is not deployable in terms of quality / performance.


> Yes, and it's not so bad.

In the absence of alternatives its workable. Libraries like eigen or armadillo help a lot. But Python with numpy has been extremely popular for a good reason.


Not quite. Nothing to do directly with python. This was the introduction of 4 and 8 bit quantization to a large number of people.

There wasn’t a python library like that anyone was used to using. Would have always been a C extension anyway.

Starting with the cpu in this situation made sense, strangely. There are python wrappers now. I tried to make one for ya’ll in rust in April, but haha I had a compiler issue I never solved.


It's a great project and an impressive achievement, but I'm also struggling to understand what people use it for that PyTorch wasn't offering. Easy deployment on iOS I guess? I would have thought that's a pretty small use case though.

Given the author hand-rolled his own FFT, I'm also guessing it's not as performant?


Pytorch (+GPU) dependency and python container type diversity are particularly bad. Programmers may not perceive this since they're already managing their python environment keeping all the OS/libs/containers/applications in the alignment required for things to work but it's quite complex. I couldn't do it.

In comparison I could just type git clone https://github.com/ggerganov/llama.cpp and make . And it worked. And since then I've managed to get llama.cpp clBLAS partial GPU acceleration working with my AMD RX 580 8GB. Plus with the llama.cpp CPU mmap stuff I can run multiple LLM IRC bot processes using the same model all sharing the RAM representation for free. Are there even ways to run 2 or 3 bit models in pytorch implementations like llama.cpp can do? It's pretty rad I could run a 65B llama in 27 GB of RAM on my 32GB RAM system (and still get better perplexity than 30B 8 bit).


> In comparison I could just type git clone https://github.com/ggerganov/llama.cpp and make . And it worked.

You're comparing a single, well managed project that had put effort into user onboarding against all projects of a different language and proclaiming that an entire language/ecosystem is crap.

The only real take away is that many projects, independent of language, put way too little effort towards onboarding users.


That's exactly it. Who wants to try pulling in an entire language with wide depenencies and it's ecosystem of envs/containers/etc when a single program will do it? Not people who just want to run inference on pre-made models.


Easy deployment anywhere, not just iOS. I haven't used Python in years so I have no idea what package manager is the best now, completely forgot how to use virtualenv, and it only took a few weeks to completely fuck up my local Python install ("Your version of CUDA doesn't match the one used to blah blah")

Python is a mess. llama.cpp was literally a git clone followed by "cd llama.cpp && make && ./main" - I can recite the commands from memory and I haven't done any C/C++ development in a long time.


For most modern ML projects in python you can just do something like `conda env create -f environment.yaml` then straight to `./main.py`. This handles very complex dependencies in a single command.

The example you gave works because llama.cpp specifically strives to have no dependencies. But this is not an intrinsically useful goal; there's a reason software libraries were invented. I always have fun when I find out that the thing I'm trying to compile needs -std=C++26 and glibc 3.0, and I'm running on a HPC cluster where I can't use a system level package manager, and I don't want to be arsed to dockerize every small thing I want to run.

For scientific and ML uses, conda has basically solved the whole "python packaging is a mess" that people seem to still complain about, at least on the end-user side. Sure, conda is slow as hell but there's a drop in replacement (mamba) that solves that issue.


Conda? Mamba? Or should I use Venv? What are the commands to “activate” an environment? And why do I have to do that anyway, given thats not needed in any other programming language? Which of those systems support nested dependencies properly? Do any of them support the dependency graph containing multiple, mutually incompatible copies of the same package?

Coming from rust (and nodejs before that), the package management situation in python feels like a mess. It’s barely better than C and C++ - both of which are also a disaster. (Make? Autotools? CMake? Use vendored dependencies? System dependencies? Which openblas package should I install from apt? Are any of them recent enough? Kill me.)

Node: npm install. npm start.

Rust: cargo run. Cargo run —-release.

I don’t want to pick from 18 flavours of “virtual environments” that I have to remember how to to “activate”. And I don’t want to deal with transitive dependency conflicts, and I don’t want to be wading through my distro’s packages to figure out how to manually install dependencies.

I just want to run the program. Python and C both make that much more difficult than it needs to be.


Honest question, as I don't follow rust, does the cargo package manager handle non-rust dependencies? I've played around with it enough that I can say it's for sure a joy to use, but what if your rust program has to link against a very specific combination of versions of, say, Nvidia drivers and openssl and blas?

In general solving such environments (where some versions are left floating or only have like >= requirements) is an NP-hard problem. And it requires care and has to draw source code and/or binaries from various sources.

This is the problem that conda/mamba solves.

If you just want to install python packages, use pip+virtualenv. It's officially supported by python. And while pip has traditionally been a mess, there's been a bunch of active development lately, the major version number has gone from like 9.xx to 23.xx in like the past two or three years. They are trying, better late than never, especially for an ecosystem as developed as python's.

So, if you want to compare rust/cargo, and it handles non-rust deps, then the equivalent is conda. Otherwise, it's virtualenv+pip. I don't think there are any other serious options, but I agree that two is not necessarily better than one in this case. Not defending this state of affairs, just pointing out which would be the relevant comparison to rust.


> Honest question, as I don't follow rust, does the cargo package manager handle non-rust dependencies?

You can run arbitrary code at build time, and you can link to external libraries. For the likes of C libraries, you’ll typically bind to the library and either require that it be installed locally already, or bundle its source and compile it yourself (most commonly with the `cc` crate), or support both techniques (probably via a feature flag). The libsqlite3-sys crate is a fairly complex example of offering both, if you happen to want to look at an actual build.rs and Cargo.toml and the associated files.

> [pip’s] major version number has gone from like 9.xx to 23.xx in like the past two or three years.

It skipped from 10.0.1 to 18.0 in mid-2018, and the “major” part has since corresponded to the year it was released in, minus 2000. (Look through the table of contents in https://pip.pypa.io/en/stable/news/ to see all this easily.)


Hah, TIL! So much for semver... Calendarver?


> npm install. npm start.

Yeah, no… my current $DAYJOB has a confusing mix of nx, pnpm, npm commands in multiple projects. Python is bad but node is absolutely not a good example.


Eh. Node / npm on its own is generally fine, especially if you use a private package repository for internally shared libraries. The problems show up when compiling javascript for the web as well as nodejs. If you stick to server side javascript using node and npm, it all works pretty well. It’s much nicer than venv / conda. And it handles transitive dependency conflicts and all sorts of wacky problems.

It’s just that almost nobody does that.

What we want instead is to combine typescript, js bundlers, wasm, es modules and node packages all together to make web apps. And that’s more than enough to bring seasoned engineers to tears. Let alone adding in svelte / solidjs compilation on top of all that. I have sweats just thinking about it.


> If you stick to server side javascript using node and npm, it all works pretty well.

Rose colored glasses if you ask me. The difference is it seems you use Node often (daily?) and have rationalized the pain. Same goes for everyone defending Python (which I’m sorta in that camp, full disclosure), they are just used to the worts and have muscle memory for how to work around them just like you seem to be able to do with Node.


> Rose colored glasses if you ask me. The difference is it seems you use Node often (daily?) and have rationalized the pain.

It’s not just that. Node can also handle arbitrary dependency graphs, including transitive dependency conflicts, sibling dependencies and dev dependencies. And node doesn’t need symlink hacks to find locally installed dependencies. Node packages have also been expected to maintain semver compatibility from inception. They’re usually pretty good about it.

It’s not perfect, but it’s pretty nice.


@antfu/ni does a good job on determining which of package managers should be run in the currect project


I have no dog in this fight, but the fact that you end your "conda solves this complexity" explanation with "and mamba is a replacement for conda" doesn't really sell me on it. Is it just speed as the main reason for mamba? The fact that the "one ring to rule them all" solve for pythons packaging woes even has a replacement sort of defeats the purpose a little.

I understand why people find Python's packaging story painful, I guess is what I'm saying.


conda takes up to several minutes to figure out dependencies just to tell you that one library requires an old version of a dependency and another requires a new one, making the whole setup impossible.

mamba can do this much quicker.


as always, a fast no is better than a slow maybe


That’s genuinely good to know! Cheers


As sibling elaborated, mamba just accelerates the NP-hard step of resolving conflicts using mathy SAT-solver methods. Also it has a much prettier cli/tui. I would be surprised if conda doesn't eventually merge the project, but you can already use it as an optional/experimental backend on the official conda distro. Haven't followed it at all but I suspect the folks who developed it wanted to use it right away so they forked the project, while the people who maintain conda are more conservative since it's critical infrastructure for ${large_number} of orgs.


> But this is not an intrinsically useful goal; there's a reason software libraries were invented.

No need to denigrate those who don't need to import all their code; it's perfectly fine to not rely on third parties when developing software - some of our best works result from this.


Thanks for the response. Interesting to see a lot of other people echoing the same comments - that dependency management in Python is an absolute PITA.

I know exactly what you mean, but I'm probably so inured to it by now that I've just come to accept it. Obviously not everyone feels this way!


Python sits on the C-glue segment of programming languages (where Perl, PHP, Ruby and Node are also notable members). Being a glue language means having APIs to a lot of external toolchains written in not only C/C++ but many other compiled languages, APIs and system resources. Conda, virtualenv, etc. are godsend modules for making it all work, or even better, to freeze things once they all work, without resourcing to Docker, VMs or shell scripts. It's meant for application and DevOps people who need to slap together, ie, ML, Numpy, Elasticsearch, AWS APIs and REST endpoints and Get $hit Done.

It's annoying to see them "glueys" compared to the binary compiled segment where the heavy lifting is done. Python and others exist to latch on and assimilate. Resistance is futile:

https://pypi.org/project/pyllamacpp/

https://www.npmjs.com/package/llama-node

https://packagist.org/packages/kambo/llama-cpp-php

https://github.com/yoshoku/llama_cpp.rb


You can “cd llama.cpp && make && ./main" because the developers made that happen.

A self contained zero dependency Python project is also literally “python app.py”


This is true, though in practice I so often find large, non-trivial examples of the former, but very seldom find examples of the latter.

The barriers to entry are lower with python, and so I think there tends to be a lower standard for deployment.


>Easy deployment on iOS I guess?

Deploying a PyTorch running Python app to anything in a way a user can just run it is a struggle. Even iOS aside that’s not a small use case that’s all local and offline ML potential.

Really makes me wonder about the whole language, did they ever expect the code to have to run elsewhere than the machines the writer controls.


llama cpp is just cpp inference on the llama model. PyTorch is a library to train neural networks. I'm not sure why people are conflating these two totally different projects..


> Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it? Less hassle? Or did it take off for a different reason?

The desktop/laptop LLMs use case is one where resource efficiency often makes the difference between “I can’t use this at all” rather than the more frequent “I can use it but it maybe runs a little slower”, and llama.cpp offers that. It also has offered new quantization options that the Python-based tooling hasn’t, which compounds the basic resource efficiency point.

(It’s also not an “everyone” thing: plenty of people are using the Python-based toolchains for LLMs, its possible for their to be multiple popular options in a space . Not everything is all or nothing.)


I think Python and R are generally superior (in terms of developer experience) when you have to do end-to-end ML work, including acquiring and munging data, plotting results etc. But even then, the core algorithms are generally implemented in libraries built out of native code (C/C++/Fortran), just wrapped in friendly bindings.

For LLMs, unless you're doing extensive work refactoring the inputs, there are fewer productivity gains to be had around the edges - the main gains are just speeding up training, evaluation and inference, i.e. pure performance.


Yeah, most Python software seem to have pretty poor compatibility, I usually need to downgrade my Python version to get stuff to run.


Python isn't annoying to use, at least for me and many I know. But it isn't known for speed. And or easy multithreading.


> Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it?

Yes.


Slightly OT:

I have been playing around with whisper.cpp; it's nice because I can run the large model (quantized to 8-bits) at roughly real-time with cublas on a Ryzen 2700 with a 1050Ti. I couldn't even run the pytorch whisper medium on this card with X11 also running.

It blows me away that I can get real-time speech-to-text of this quality on a machine that is almost 5 years old.


Seconded. I were playing around for my native language (Polish) and the large models actually blew me away. For example, it handled "przescreenować" spelling correctly, which is an english word with a polish prefix and a conjugated suffix.


is there any dummy guide to get started with any of these?


Have you tried the Quick start in the https://github.com/ggerganov/whisper.cpp README?


This is an impresdive use case


Is it possible to run on apple m1 devices or mobile phones or not yet?


I can recommend the MacWhisper app if you prefer a gui.


And Whisper Memos for iOS https://whispermemos.com/


The really nice part of Whisper is being able to use it offline and on-device, it seems whisper memos is uploading your audio and notes to a server of unknown security, confidentiality etc.

I like Aiko for on-device transcription both in macOS and iOS https://apps.apple.com/us/app/aiko/id1672085276


Whisper Memos uses OpenAI API. The upside is that it uses the largest model - that would take 2GB on your iPhone.


yeah the whisper.cpp github page has a demo for both. Have used it on my M1 MBA for the past few months.


Nice to see Georgi has started a company:

https://twitter.com/ggerganov/status/1666120568993730561?s=4...

Godspeed


Nice. He's obviously a talented engineer who's struck a nerve with the whisper.cpp/llama.cpp projects, so hope he has success with whatever he plans to do.


A lot of work going into refactoring proprietary code that can be randomly deprecated and outcompeted without any prior notice by any number of large competitors... problematic business model, in my opinion.


He's building a library, ggml; it's generally not hard to add support for new models. For instance llama.ccp already supports the Falcon 7B model (different architecture to llama). And given how politicised AI has become, there's unlikely to be many companies releasing weights for models competitive with the current models (e.g. LLaMA 65B). They may have private models that are better, like GPT3.5 and GPT4, but you can't run these on your own server so they're not competing with ggml.


> "there's unlikely to be many companies releasing weights for models competitive with the current models"

We're at the very dawn of this technology going mainstream, and you're saying that it's unlikely for new players to release new, competing and incompatible models?


Can someone ELI5 why AMD is not in this game? Is it really so much harder to implement this in a non-platform-specific library?


CUDA is the best supported solution, tends to get you access to the best performance, has a great profiler (it will literally tell you things like "your memory accesses don't seem to be coalesced properly" or "your kernel is ALU limited" as well as a bunch of useful stats), even works in windows, all of that.

OpenCL is (was?) the main open alternative to CUDA and was mainly backed by AMD and Apple. Apple got bored of it when they decided Metal was the future. AMD got bored of it when they developed rocm and HIP (basically a partially complete compatibility layer with CUDA).

There's also stuff like DirectML which only works on windows and e.g. various (Vulkan, directx etc.) compute shaders which are really more oriented at games.

There's also a bit of a performance aspect to it. Obviously GPGPU stuff is massively performance sensitive and CUDA gets you the best performance on the most widely supported platform.

AMD are also the main competitor in the space but all but totally dropped support for GPGPU in desktop cards, trying to instead focus on gaming for Radeon and compute in MI/CDNA. They seem to have realised their mistake a bit and are now introducing some support for RDNA2+ cards.


If I remember correctly it's not just that AMD has poor support for their consumer cards, their Rocm code doesn't compile to a device agnostic intermediate, so you have to recompile for each chip. New and old Cuda compatible cards (like all Geforce cards) can run your already shipped Cuda code, as long as it doesn't use new and unsupported features. So even if AMD had supported more cards, the development and user experience would be much worse anyway where you have to find out if your specific card is supported.


All RDNA2 GPUs and many Vega GPUs could use the same ISA (modulo bugs). Long ago, there was an assumption made that it was safer to treat each minor revision of the GPU hardware as having its own distinct ISA, so that if a hardware bug were found it could be addressed by the compiler without affecting the code generation for other hardware. In practice, this resulted in all but the flagship GPUs being ignored as libraries only ended up getting built for those GPUs in the official binary releases. And in source releases of the libraries, needlessly specific #ifdefs frequently broke compilation for all but the flagship ISAs.

There was an implicit assumption that just building for more ISAs was no big deal. That assumption was wrong, but the good news is that big improvements to compatibility can be made even for existing hardware just by more thoughtful handling of the GFX ISAs.

If you know what you're doing, it's possible to run ROCm on nearly all AMD GPUs. As I've been packaging the ROCm libraries for Debian, I've been enabling support for more hardware. Most GFX9 and GFX10 AMD GPUs should be supported in packages on Debian Experimental in the upcoming days. That said, it will need extensive testing on a wide variety of hardware before it's ready for general use. And we still have lots more libraries to package before all the apps that people care about will run on Debian.


True, it's better just to use OpenSYCL that stores intermediate device agnostic form and complies it as needed to specific card.

I don't understand why isn't SYCL more widely used.


SYCL is still very early on in development and I don’t see it really picking up until support is up-streamed to the llvm project, at the very least. That said, I am a firm believer in the single source philosophy. There just isn’t a tractable alternative.


“Apple got bored of it when they decided Metal was the future. AMD got bored of it when they developed rocm and HIP”

Apple backed OpenCL because they needed an alternative after their divorce with Nvidia. No one was going to target an AMD alternative when they had such a trivial market share, so it had to be an open standard. Initially this arrangement was highly productive and OpenCL 1.x enjoyed terrific success. Vendors across compute markets piled support behind OpenCL and many even started actively participating in it. However this success is what precipitated the disastrous OpenCL 2.x series. In other words, OpenCL 2.x was far too revolutionary for many and far too conservative for others. What followed was Apple pulling out to pursue Metal, AMD having shoddy drivers, Nvidia all but ignoring it, and mobile chip vendors basically sticking to 1.2 and nothing more. Eventually this deadlock was fixed after OpenCL 3.0 walked back the changes of 2.x, but this was in large part because the backers of 2.x moved to SYCL.

As for AMD, OpenCL was a tremendous boon when it was first introduced. At least initially it gave them a fighting chance against CUDA. But it was never realistic for OpenCL to be a complete CUDA alternative. I mean, any standard that is basically “everything in CUDA and possibly more” is a standard no one could afford or bother to implement. ROCm and HIP are basically AMD using an API people are already familiar with software underneath to play to the strengths of their hardware.

“AMD are also the main competitor in the space but all but totally dropped support for GPGPU in desktop cards, trying to instead focus on gaming for Radeon and compute in MI/CDNA.”

Keep in mind that AMD has been under intense pressure to deliver world class HPC systems, and they managed to do so with ORNL Frontier. I don’t blame them for being selective with ROCm support because most of those product lines were in flight before ROCm started development in earnest. That said, Nvidia is obviously the clear leader for hardware support, as therefore the safest option for desktop users.


> partially complete compatibility layer

So… partial compatibility layer?


Is your point that "partially complete" is a redundant phrasing?

In this case I still prefer my version. I feel that it puts greater emphasis on the fact that it can potentially be complete, given the massive value that could give to the project.

Also "I would have written you a shorter letter but I did not have the time" sentiment springs to mind.


Thanks


Short version: AMD's software incompetence. Very few hardware companies have the competence to properly support their HW. You see this problem again and again, HW companies designing HW that's great on paper but can't be used properly because it's not properly supported with SW. Nvidia understands this and has 10 times as many SW engineers than HW engineers. AMD doesn't. Intel might too.


I think you're totally right with this, just to add - it's often possible to do a lot of things that are neat in hardware but create difficult problems in software. Virtually always when this happens it turns out to be nearly impossible to actually create software that takes advantage of it. So it's massively important to have a closed loop feedback system between the software and hardware so that the hardware guys don't accidentally tie the software up in knots. This is common in companies that consider themselves hardware companies first.


Examples being the PS3 cell architecture and HP's Itanium chips


Strongly agree. There's a surprising cultural difference between the two. As a software engineer in a different hardware company, I can see where the fault lines are, and it takes continual management effort to make it work properly.

(I note that if we had an "open" GPU architecture in the same way that we have CPU architectures, things might be a lot better, but the openness of the IBM PC seems to be a historical accident that no company will allow again)


Chalking it up to “software incompetence” is a bit simplistic, to say the least. AMD was on the brink of bankruptcy not too long ago and their GPU division was struggling to even trend water. They didn’t have an alternative to CUDA because they couldn’t afford one and no one would use it anyway, OpenCL stagnated because most vendors didn’t want to implement functionality that only the biggest players wanted, and their graphics division had to pivot from optimizing for gaming (where they could sell) to optimizing for compute as well.

Now that AMD has the capital they are playing catch up to Nvidia. But it’s going to take time for their software to improve. Hiring at boat load of programmers all at once isn’t going to solve that.


It's been a while since AMD was on the brink of bankruptcy, they had enough time to do something about compute and yet it's still not usable, see the George Hotz rant. OpenCL stagnated because 2.0 added mandatory features that Nvidia didn't support so it never got adopted by the biggest player.


llama.cpp can be run with a speedup for AMD GPUs when compiled with `LLAMA_CLBLAST=1` and there is also a HIPified fork [1] being worked on by a community contributor. The other week I was poking on how hard it would be to get an AMD card running w/ acceleration on Linux and was pleasantly surprised, it wasn't too bad: https://mostlyobvious.org/?link=/Reference%2FSoftware%2FGene...

That being said, it's important to note that ROCm is Linux only. Not only that, but ROCm's GPU support has actually been decreasing over the past few years. The current list: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... Previously (2022): https://docs.amd.com/bundle/Hardware_and_Software_Reference_...

The ELI5 is that a few years back, AMD split their graphics (RDNA) and compute (CDNA) architectures, which Nvidia does too, but notably (what Nvidia definitely doesn't do, and a key to their success IMO) AMD also decided they would simply not support any CUDA-parity compute features on Windows or their non "compute" cards. In practice, this means that community/open-source developers will never have, tinker, port, or develop on AMD hardware, while on Nvidia you can start with a GTX/RTX card on your laptop, and use the same code up to an H100 or DGX.

llama.cpp is a super-high profile project, has almost 200 contributiors now, but AFAIK, no contributors from AMD. If AMD doesn't have the manpower, IMO they should simply be sending nsa free hardware to top open source project/library developers (and on the software side, their #1 priority should be making sure every single current GPU they sell is at least "enabled" if not "supported" in ROCm, on Linux and Windows).

[1] https://github.com/SlyEcho/llama.cpp/tree/hipblas


I just tried this [1] and it still uses my CPU even though the prompt says otherwise.

[1] https://github.com/ggerganov/llama.cpp/issues/1433#issuecomm...


I saw there was an answer already in your issue, although you plan on doing a lot of inferencing on your GPU, I'd highly recommend you consider dual-booting into Linux. It turns out exllama merged ROCm support last week and more than 2X faster than the CLBlast code. A 13b gptq model at full context clocks in at 15t/s on my old Radeon VII. (Rumor has it that ROCm 5.6 may add Windows support, although it remains to be seen what that exactly entails.)


So it now uses the GPU after some help, but it is not that much faster on my Vega VII than on my 5950x 16 core cpu :/


short answer is they have somewhat competent hardware but software sucks or you can watch george hotz rant about how amd driver sucks

https://www.youtube.com/watch?v=Mr0rWJhv9jU


He got a tarball fix for the driver after his rant got viral. Still not looking good IMHO.

https://geohot.github.io/blog/jekyll/update/2023/06/07/a-div...


It looks like some great engineers inside the company fighting the burocracy.


> So it fixed the main reported issue! Sadly, they asked me not to distribute it, and gave no more details on what the issue is.

I think they missed the thrust of the rant.


he goes from building a ml rack out of ati graphics cards to coding some python to recommending reading the unabomber manifesto, from marx to saying he owns a rolls royce... lord, please have mercy!


I think there are several reasons.

Firstly, nVidia has been at it much longer. Just because of this tools on nVidia side feel easier to set up / are more polished (at least that was my feeling when fiddling with ROCm like a year ago).

Second but still related to #1, from the beginning even consumers nVidia cards were able to run CUDA and this made so that hobbyist and prosumers/researchers on a budget bought nVidia cards compounding even further the time/tooling advantage nVidia had. I.e. a huge user base of not only gamers but people that use their cards to do other things than gaming and know that things work on these cards.

These are, IMHO, the main reasons why everyone targets CUDA and explain why frameworks like Tensorflow or Pytorch targeted it as a first class citizen.


If AMD was software competent the frameworks would support their drivers just as well. No one wants a monopoly.


Agreed. Sorry if I gave the impression of being pro-nVidia. I am not.

But the reality is that when Tensorflow and Pytorch came to be there was no alternative. Now you need to jump through hoops to make it work with non CUDA hardware.

Additionally, while drivers play a role, I think the main difference is in the computing libraries (CUDA vs ROCm)


I'm hopeful for SYCL [0] to become the cross platform alternative, but there doesn't seem to be a lot of uptake from projects like this, so maybe my hope is misplaced. It's an official Khronos standard, and Intel seems to like it [1], but neither of those things are enough to change things.

Someone who knows about this space that can comment on the likelihood that SYCL will be a good option eventually? Cross platform and cross vendor compatibility would be really nice, and not supporting the proprietary de facto standard would also be a bonus as long as the alternative works well enough.

[0] https://www.khronos.org/sycl/

[1] https://spec.oneapi.io/versions/latest/elements/sycl/source/...


> It's an official Khronos standard

I think that's the problem. Khronos isn't known for good UX, and being from Khronos is exactly the reason why I'm not even bothering to check it out. I want an alternative to CUDA, but I also want it to be as easy to use as CUDA.


Being from Khronos is also a reason why it might actually be usable in a decade's time, like Vulkan.

(Vulkan is from 2015 and is just recently starting to become usable.)


From a non-expert's standpoint, Vulkan feels quite unusable and complex.


From an expert's standpoint, it's still quite unusable and complex.


I'm no expert but if I understand correctly the CUDA cores are the main pull and the API to them.

They're supposed to be more optimized and more stable compared to AMD. That's how it was before anyway, not sure today.


Isn't the main component for AI matrix multiplication? What makes it so hard to create a good alternative API for matrix multiplication?


It's a lot more complicated than just writing a matrix multiplication kernel because there are all sorts of operations you need to have on top of matrix multiplication (non linearities, various ways of manipulating the data) and this sort of effort is only really worthwhile if it's well optimized.

On top of that, AMD's compute stack is fairly immature, their OpenCL support is buggy and ROCm compiles device specific code, so it has very limited hardware support and is kind of unrealistic to distribute compiled binaries for. Then, getting to the optimization aspect, NVIDIA has many tools which provide detailed information on the GPU's behavior, making it much easier to identify bottlenecks and optimize. AMD is still working on these.

Finally, NVIDIA went out of its way to support ML applications. They provide a lot of their own tooling to make using them easier. AMD seems to have struggled on the "easier" part.


Well I think there are 2 types right ? Tensor cores (which afaik AMD dont have) which are better for matrix ops, and CUDO which are better for general parallel ops.

Maybe someone more clever than me can go into the specifics, I only understand the minimum of the low lvl GPU details.

Nice high lvl document

[0] https://www.acecloudhosting.com/blog/cuda-cores-vs-tensor-co...


I think API for matrix multiplication is just a part of the issue. CUDA tooling has better ergonomics, it's easier to set up and treated as first class citizen in tools like Tensorflow and Pytorch.

So, while I can't talk about the hardware differences in detail, developer experience is greatly on nVidia side and now AMD has a moat to overcome to catch up.


there is nccl, gpudirect, nvlink and so on and so forth.. It is not just matmul on gpus.


Planned economy, planned who does what.


I'm a bit surprised by the numbers. It's "only" a 2× speedup on a relatively top-end card (4090)? And you can only use one CPU core. With 16+ core CPUs becoming normal and 128GB+ RAM being cheap, that seems like leaving a lot on the table.

[edit] realized it's relative to the merged partial CUDA acceleration, so the speedup is more impressive, but still surprised by the core usage.


> And you can only use one CPU core.

because the core's job is solely to direct the GPU, which is doing all of the work.


To the replies; I think one feature of llama.cpp is it can handle models with more RAM than VRAM provides, this is where I would think more cores would be useful.


That cheap ram is about 10x slower than the VRAM. Didn't see any actual figures for latency, but there must be a reason, why newer gpus have memory chips on both sides of the PCB, as close to the GPU as possible


Excuse me for my ignorance, but can someone explain why the Llama.cpp is so popular? isn't it possible to port the pytroch lama to any environment using onnx or something?


You can run it on RPi or any old hardware, only limited by the RAM and your patience. It is a lean code base easy to get up and running, and designed to be interfaced from any app without sacrificing performance.

They are also innovating (or at least implementing innovations from papers) different ways to fit bigger models in consumer HW, making them run faster and with better outputs.

Pytorch and other libs (bitsandbytes) can be horrible to setup with correct versions, and updating the repo is painful. PyTorch projects require a hefty GPU or enormous CPU+RAM resources, while llama.cpp is flexible enough to use GPU but doesn't require it and runs smaller models well on any laptop.

ONNX is a generalized ML platform for researchers to create new models with ease. Once your model is proven to work, there are many optimizations left on the table. At least for distributing an application that relies on LLM it would be easier to add llama.cpp than ONNX.


In Stable Diffusion land, onnx performance was not that great compared to ML compilers and some other implementations (at least when I tried it).

Also, llama.cpp is excellent at splitting the workload between CPU and accelerators since its so CPU focused. You can run 13B or 33B in a 6GB GPU and still get some acceleration.

Also, as said above, quantization. That is make or break. There is no reason to run a 7B model at fp16 when you can run 13B or 30B in the same memory pool at 2-5 bits.


Should be similar performance but gglm guy did it in what he knows best and biggest selling point is single binary


ONNX doesn't support the same level of quantization as GGML.

So basically GGML will run on hardware with less memory.


Or alternatively, bigger models with the same memory (just quantised harder).


Python Torture Chamber is, of course, an eminently viable tool, but I gather some people prefer a more streamlined toolchain like that of C. Or COBOL.


Llama.cpp runs better than pytorch on a much wider variety of hardware, including mobile phones, Raspberry Pis and more.


Anyone get performance numbers for other 30 series cards? 3060 12gb?

I’m curious how it compares to his apple silicon numbers


Not a 30 series, but on my 4090 I'm getting 32.0 tokens/s on a 13b q4_0 model (uses about 10GiB of VRAM) w/ full context (`-ngl 99 -n 2048 --ignore-eos` to force all layers on GPU memory and to do a full 2048 context). As a point of reference, currently exllama [1] runs a 4-bit GPTQ of the same 13b model at 83.5 tokens/s.

[1] https://github.com/turboderp/exllama


Also, someone please let me use my 3050 4GB for something else than stable diffusion generation of silly thumbnail-sized pics. I'd be happy with an LLM that's specialized in insults and car analogies.


You can split inference between CPU and GPU using whatever available GPU vRAM with llama.cpp. And you can run many small models with 4GB of vRAM. Anything with 3B parameters quantized to 4bit should be fine.


Do you have your settings correct? I have a 1650 on an old computer and I have generated 512x512 pictures and merged models.

Heck, even using CPU, I've been able to generate 512x512.

If you arent generating 512x512 pictures, lmk, I'll go grab my bat file's startup parameters.


I can’t do 512x512 on a laptop 1050 ti 4gb


Are you using automatic1111, do you have the -low vram setting checked?

Just to be clear, I can also make 512x512 with CPU, so you basically just need to have the correct config of the .bat file.


Did you buy that card for cuda? Cause otherwise I have no idea why someone would chose a 3050 over a 6600


What's a 6600, some AMD card? And why is it better?


Cheaper Nvidia cards are generally considered to have dubious value. Having seen the benchmarks I agree, but it's not like a game-changing difference really. For CUDA and ML stuff, the 3050 would run circles around the 6600.


For cuda and ML you'd be much better off choosing a 3060. Honestly, if you've only got the money for a 4gb 3050, you're prob better off working in google colab


With layering enabled, I don't necessarily agree. Not being able to load an entire model into memory isn't a dealbreaker these days. You can even layer onto swap space if your drive is fast enough, so there's really no excuse not to use the hardware if you have it. Unless you just like the cloud and hate setting stuff up yourself, or what have you.


Yeah for the price of a 4gb 3050 you could afford a 8gb rx 6600 xt which is way faster.


Is there a legitimate way to get the weights to actually use this without filling in forms?


You can use OpenLLaMA models (Apache 2.0 license, unrelated to LLaMA apart from their architecture and general approach to training):

https://huggingface.co/TheBloke/open-llama-7b-open-instruct-...

https://huggingface.co/SlyEcho/open_llama_7b_ggml

https://huggingface.co/SlyEcho/open_llama_3b_ggml


Not if you want the original LLAMA weights, but now there are other models like RedPajama available.


There’s a torrent linked in the Llama.cpp docs, it’s in a merge request on the LLaMA repo. Has all the files.


It's almost (or actually is) a "pirated" torrent. So it might not be "legitimate".


The best model currently is Falcon


No


They come a dime a dozen on HuggingFace, check out https://old.reddit.com/r/LocalLLaMA/wiki/models for a few options


Are these done as LLaMA deltas still? I.e. do I need to apply a patch to LLaMA, and so I still need to source LLaMA?


Most of them are merged models, so you don't need the base model.

It's stupidly simple to get going.


Thr best model currently is Falcon


Great news but I'd like to know how does it compare with just using torchlib.

If this is faster than torchlib this optimizations should flow to torchlib as well

Love the idea of not having to deal with python, though; dependency management is just horrible, I'd much rather have ML projects written in cpp.


I've been using llama.cpp with the python wrappers and it's the speed increase has been great, but it seemed to be limited to a max of 40 N_GPU_LAYERS. Going to have to update and see what sort of improvement I see.


I'm a total newb about the implementation details, but I'm curious if a hybrid is possible (GPU+CPU) to enable inference with even larger models than what fits in consumer GPU VRAM.


llama.cpp does it already. You tell it how many layers to offload to GPU, and it runs remaining ones on CPU.


RTX 3090 isn't cheap, more than 1000GBP new, crikey!


As someone who uses their computers every day, for 7 years... Then I hand them down to my kids.... Then I turn them into servers.

I find the cost of computing extremely affordable, even for high end stuff. Whats the amortization on a 2-3k computer over 7 years? How about if I use it 4 hours a day actively and 24 hours passively?

I have considered spending 10-30k on a computer given the recent AI craze, but the thing stopping me is that by 2025, a 10-30k computer in the AI space is going to be 2-4x better. Only in the last 1 year are we finding out the importance of absurd amounts of VRAM. I feel like the 4090's 24gb VRAM is going to age alright at best, but most likely poorly. (Not that 4090 buyers are going to have qualms upgrading to the 6090)


This is pretty cool, do you find the server farm of older computers valuable for your own work?


Oh yeah, I have a computer for a minecraft server. A computer hosting my kiddo's website(just for fun, its silly, but randomly he will want me to pull it up from outside of the house). That same computer hosts some listeners/watchdogs for a media computer, but I havent actually used much of that information or features in a year (WFH kind of removed the need for me to use my remote tools).

I suppose that's it for now.

Oh, I thought of another use, I run a small business on the side and my interns occasionally don't have a laptop, I give them a crappy laptop. (they are basically just using excel/google sheets)


Why not get a used one?


I don't have a PC with a powerful GPU. What's the easiest way I can play with Llama on AWS, Google Cloud, or somebody else's computer?


You can play with Llama on your CPU. Depending on the model you use and the RAM you have available, the performance may be acceptable.


using llama.cpp

it runs on the CPU

this news story is that they are now extending GPU support to llama.cpp


Do you know about Oobabooga?

You can probably find a google colab link.


Is that like a LLM Stable Diffusion? Neat.


I don't quite get it. Where does one get the model from? Or is it just for people that can afford to spend the $ to train the models themselves?


You can download a model from a site like huggingface. Here is a list of models that can be used for inference:

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

This user has some models already compiled for use with GGML (look for models with that in the name):

https://huggingface.co/TheBloke

Or if you want to convert your own model the llama.cpp repo has good instructions. Briefly it's `python3 convert.py <model>` - then if you are using a large parameter model you may need to quantize it to fit in memory `./quantize <source_model> <destination_name> <quantization>`


Thanks


you can easily download llama weights by googling the ipfs link. it doesn't even seem to be illegal to do so, but IANAL


What do people see becoming of the non commercial license on llama projects? How much is this type of work wed to using LlaMA?


TBH people are kinda ignoring the LLaMA license now, as Meta seems to be doing. I see some pseudo commercial (encouraging donations and such) and a few straight up commercial services using a LLaMA backend.

llama.cpp specifically has Falcon on their roadmap, and some other quantized implementations already work with it. But the transition will be slow.


I think other LLMs will be used for commercial purposes in most cases. There are alread a few and i'm sure there's more in the queue.


> With a boring C project, if it compiles it probably works without hassle.

amen.


Apart from all the graphics drivers and it actually working at all


Anyone know if whisper.cpp is GPU accelerated yet?


AFAIK it's not fully, but you can use cuBLAS/clBlast for a pretty good speedup.


Partially. Keep an eye out for ggml-cuda.cu getting updated.


Got downvoted out of view for saying this once but no less true. Absolutely a pity & a shame no one else has competed with this market dominance by Nvidia. Just a shit world entirely that we are single vendored up.

A lot of pissant shitty defenses of monopolization too. Wrong or right, this is a shit world we're in now. https://news.ycombinator.com/item?id=36304225


Apple's take on GPUs is quite different and very interesting IMO. Shared memory architecture with absolutely massive RAM (and hence VRAM) support, e.g. the new Mac Studio having 192GB RAM/VRAM which can run pretty massive models, much more so than is easily consumer-accessible even at the high end with Nvidia 4090s. It's not as fast as Nvidia, but it's not horribly far off in the latest chips.

As LLM adoption grows, I wonder whether Apple's approach will start to make more sense for consumer adoption, so that you can run models on your machine without needing to pay large subscription costs for AI-powered apps (since OpenAI et al have fairly high fees). The high cost of using the APIs in my opinion is a drag on certain types of adoption in consumer apps.

Llama.cpp actually first started as a way to run LLMs on Macs! At first CPU-only, but then later the first GPU driver backend added was Metal, not anything from Nvidia.


I agree, though it’s still vendor lock-in. I think a lot of devs aren’t aware of how tightly integrated these frameworks are with the hardware. It’s not trivial to separate the two and the companies driving this tech have no motive to do so, while also being amongst a very small number of hardware designers.


I think criticizing Apple for not supporting Nvidia is a classic missing-the-forest-for-the-trees. Unifying memory is a next logical step in general purpose computing. The flexibility that comes from it has unseen potential.


It's a next step that AMD has been trying to make happen for like 10 years, but not managing to make it commercially successful in anything but consoles, ending up permanently stuck in the lowest low end of the market. In both the PC and datacenter market spaces it has ended up with basically the opposite of product market fit. Nobody actually wants the generic CPU compute tied to the GPU compute.

The PC gamers really want the GPU component to be separately upgradable from the CPU. Non-gamer PC users don't care about the GPU performance, just cost. The datacenter folks want to be able to use a single $1k CPU to host $100k worth of GPUs.

It's plausible that AI accelerators follow a different path for the consumers. It's harder to see it happening for the datacenter market.


You’re missing the forest for the trees.

Being able to switch from an optimized CPU-centric workload to an optimized GPU-centric workload without any hardware changes sounds useful to me.

You could even do unified memory with upgradeable separate CPU and GPU. You won’t get the benefits of having them on the same chip, but there’s nothing intrinsic about the separation requiring separate memory space.


Thinking what Apple sells is a GPU bundled with a large amount of VRAM and a free ARM CPU makes things easier to accept.

NVIDIA never sold a GPU with expandable memory, either.


Is that RAM as fast as A100/H100 VRAM? AFAIR, gpus push ~ 1 TB/s ish


The Apple M2 Ultra memory bandwidth is 800GB/s so it's not a long way off 1TB/s.

A100 goes up to 2TB/s in the largest configuration, and H100 claims 3TB/s per GPU node. (These figures keep changing with new variants.) But you can buy several Mac Studios for the price :-)

The real use of H100 is for training, as it can pool many GPUs together with a dedicated high bandwidth network. You can't do that with Mac Studios.


That figures are really impressive. A single node is still useful for finetuning smaller models. Apple could move the needle in this market a bit


In addition to the cost, OpenAI censors their models, and while that seems protective at first, if you think about it, if the model is the knowledge graph, then censor data in the graph is censoring free speech.


I don't personally like model censorship, but you need to come up with a better argument than...whatever that is. It is entirely unconvincing, and seems like you misunderstand what free speech is.


Freedom of speech is a principle that supports the right of an individual or a community to articulate their opinions and ideas without fear of retaliation, censorship, or legal sanction.

Since the models are not suppressed to answer certain questions and indeed have been demonstrated to be biased toward one end of the political spectrum, and if you treat the model’s output as a knowledge graph, as proposed by John Schulman, one of the cofounders of OpenAI, then yes, I would say the suppression of freedom of speech is a valid argument to make. Otherwise why would there be a set of “uncensored” models that exist in the open source world?

I would suggest you read about these models and think about the implications. Perhaps that’ll lead to you to reconsider your stance


If we ignore the corporate structure details where your analogy breaks down, in the simplest case, choosing to self-inhibit is not violation of the freedom of speech. OpenAI is voluntarily self-censoring their output. This isn’t an ideological battle for them, it’s a business.


> I would suggest you read about these models and think about the implications.

I have done that, and come to the conclusion that "that's business, baby!"

Do you have any more meaningful objections to this "censorship" of a free thing you agreed to license?


I've said it before and I'll say it again. Cuda dominance is the darkest timeline.

OpenCL was the utopia timeline.


OpenCL still works however :)


Shame Apple never really invested in OpenCL. Things would look a lot different now if they hadn't abandoned it.


AMD almost pulled it off but software made Nvidia take the lead again. AI being CUDA dependent really gave them an edge, in both the consumer and business market.


I have hope that AMD can close that gap over the next few years. Their hardware is already great, and the business case for investing in AI software is crazy strong. Their stock price would probably get a bump just from announcing the investment.

It seems like a no brainer to hire a bunch of people to work on making PyTorch / Tensorflow on AMD become a competitive option. It’ll just take a few years.


I think it's been a couple of years already that we hoped better of AMD, to general disappointment


Overview of some of the efforts to move away from cuda: https://www.semianalysis.com/p/nvidiaopenaitritonpytorch


Can hardware vendors even put their differences aside to build such a thing? We can't even build a unified open raster graphics API, and now you're asking for machine learning acceleration in that vein?


Machine learning would probably be the simpler API. If you can speak Linear Algebra, you're most of the way there.


You're right, and it's why projects like the ONNX runtime exist to unify vendor-specific AI accelerators. Covering the basics isn't too hard.

What GP seems to be asking for is an open CUDA replacement, which is kinda like asking someone to fund a Free and Open Source cruise ship to compete with Carnival for you. You'll get somewhere with some effort, luck and good old human intuition, but Nvidia can outspend you 10:1 unless you have funding leverage from FAANG.


IMHO, it comes down to the software.

It turns out you need very different kernels for good performance on different GPUs, so OpenCL is a nice tool, but not sufficient; you need a hardware-specific kernel library.

From the framework side, each integration is relatively expensive to support, so you really don’t want to invest in many of them. Without some sort of kernel API standard, you’re into a proprietary solution, and NVidia did an amazing job at investing in their software, so that’s the way things go.

I think we had a pretty solid foundation for doing something smarter with PlaidML, but after we were bought by Intel, some architectural decisions and some business decisions consigned that to be a research project; I don’t know that it’s going anywhere.

These days, I’d probably look into OctoML / TVM, or maybe Modular, for a better solution in this space… or just buy NVidia.

(I worked a bit on Intel’s Meteor Lake VPU; it’s a lovely machine, but I’m not sure what the story will be for general framework integrations. I bet OpenVINO will run really well on it, though :-)


> Absolutely a pity & a shame no one else has competed with this market dominance by Nvidia.

Well, at least you admit it's not Nvidia's fault. Apparently Apple, Intel and AMD don't think there's much money to grab here.


I'm sure Intel and AMD heavily regret their neglect of OpenCL since the rise of LLMs and stable diffusion.


For me it’s doubly amazing that Intel does not exist in those discussions about alternatives that are already rare. They should write a book about how to blow up a successfully semiconductor company from the inside.


FWIW, Intel has OpenVINO acceleration on their ARC GPU lineup. Their $300 Arc A770 outperforms the M1 Ultra by ~10% in OpenCL (which OpenVINO uses): https://browser.geekbench.com/opencl-benchmarks

It stands to reason that Intel is making highly price-competitive hardware at the moment, but people don't talk about them as much Nvidia because they have a minuscule install base with primitive Windows drivers. I wouldn't count them out if their first showing is this impressive, though.


High technology hardware needs a lot of investment, and it's hard to gather enough resources (human, capital, market share).

But AMD is coming strong, and they are trying to compete with Nvidia now. https://www.forbes.com/sites/iainmartin/2023/05/31/lisa-su-s...


The others tried with OpenCL to build a more open environment, this will always lose against a single vendor tailoring it's solution for their own lineup.

Think of Apples ecosystem vs Android or MS's Office-Outlook-Teams vs anything else.


Well, we could do a much much better job of it but in fact Qualcomm does compete with NVIDIA for use cases like this (inference). Both in mobile devices and the data center.

Disclaimer: I work at Qualcomm.


What is on offer optimized for running these LLMs?


The hexagon NSP is reasonably well suited for running ML in general. I know it's used for some image/CV use cases and I think it will work well for language models, but maybe suboptimal for the recent large ones.

This processor shows up in both snapdragon SoCs and cloud ai 100.


With any luck projects like MLC will help close the gap

https://github.com/mlc-ai/mlc-llm


You can also use TPUs or other training cards. Nvidia is just the best one that's accessible.

But I think you're getting downvoted since it's very off-topic.


I wonder if it would be possible to emulate/translate CUDA to target non-NVIDIA hardware.

I suspect it would be more of a legal challenge than a technical one.


That's what ROCm was trying to do


WebGPU strikes me as the answer to that. Perhaps I am missing something?


It could be. But there's quite a bit of momentum behind CUDA. Plus, CUDA is just wicked fast. I wrote a WebGPU version of LLaMA inference and there's still a bit of a gap in performance between WebGPU and CUDA. Admittedly, WebGPU can't access tensor cores and I undoubtedly need to optimize further.


looking forward to AMD MI300, hopefully will be a game changer


There are far more important things to be this worried about


tinygrad is trying to address this problem. We'll see if it's successful.


it’s uh… not that serious


Such a pity no one else can compete here presently. Would that others be able to gain a position where their software made them competitive on the free market.


Compete with Llama.cpp? Like transformers llama [0], exllama [1] (really fast), or litllama [2] ?

exllama is really memory efficient and really fast

[0] https://huggingface.co/docs/transformers/main/model_doc/llam...

[1] https://github.com/turboderp/exllama

[2] https://github.com/Lightning-AI/lit-llama

EDIT: Or do you mean cuda? Because yeah, it's such a shame AMD's Rocm is so bad even geohot gave up. it's examples don't even run without crashing.

https://github.com/RadeonOpenCompute/ROCm/issues/2198#issuec...


Also https://github.com/kayvr/TokenHawk, a WebGPU implementation of LLaMA.

edit: Note that this is my project.


Thanks for the tip about exllama, I've been on the lookout for a readable python implementation to play with that is also fast and has support for quantized datasets.


There was free competition here, a while ago. OpenCL was formed by Apple, Khronos et al. to stave off CUDA's dominance. The platform languished from a lack of commitment though, and Apple eventually gave up on open GPU APIs entirely. Nvidia continued funding CUDA and scaling it for industry application, and the rest is history. The landscape of stakeholders is just too bitter to unseat CUDA for what it's used for - your best shot at democratizing AI inferencing acceleration is through something like Microsoft's ONNX[0] runtime.

[0] https://onnxruntime.ai/


CUDA had a lot of inertia and opencl brought half baked docs and half baked support out of the gate. If they had focused on simplifying their api to be more user friendly for the 80% use case it could've been a success. Opencl always looked nice on the surface but a few hours in and you've exhausted the docs trying to figure out what to do and there's no good example code around. Of course if they really wanted it to succeed they would've built a Cuda to opencl transpiler for the c api or at least a comprehensive migration guide. I'm not convinced anyone involved was trying to make it popular.


Note that Llama supports acceleration on both OpenCL and Apple Metal


There’s also geohot’s tiny corp betting on AMD gpus.


Not any more.


https://geohot.github.io/blog/jekyll/update/2023/06/07/a-div...

AMD gave him a binary blob driver and that fixed his problem. Also, tinygrad is the only Python framework I know that has full OpenCL acceleration.


What do you mean? At least as of June 7 geohot was still working on amd drivers builds and stability. https://geohot.github.io/blog/jekyll/update/2023/06/07/a-div...

So far it doesn’t look that AMD is fully on board with Tiny Corp, but they are talking…


Why not ggml?


Unclear what this is referring to, but if it means CUDA vs other things it is worth noting that:

a) CUDA won in a free market because NVidia showed they cared about it

b) Llama has support for OpenCL (via CLBlast) and Apple Metal

The OpenCL support already has a custom kernel for token generation.


There's Fabrice Bellard's textsynth server. https://bellard.org/ts_server/

No open source though.


This isn't a market.


My understanding from reading this is that a 3090 GPU is 2x speedup over a decent modern CPU. Is that really the case, or am I reading it wrong? My initial thought was that it would be far higher. Is this typical of inference for these kind of models? If so, why do we need such expensive hardware? Please excuse my lack of knowledge :)


I think it was 2x total speedup vs previous version, which already used gpu for “most” things, so the real speedup is 2/(1-most), which could be a lot.


Thanks - that makes more sense. It wasn't clear from the article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: