Llama.cpp: Full CUDA GPU Acceleration

adeon · on June 13, 2023

llama.cpp is great. It started off as CPU-only solution and now looks like it wants to support any computation device it can.

I find it interesting that it's an example of an ML software that's totally detached from Python ML ecosystem and also popular.

Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it? Less hassle? Or did it take off for a different reason? Interested in hearing thoughts.

slabity · on June 13, 2023

> Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it? Less hassle? Or did it take off for a different reason? Interested in hearing thoughts.

For me it's less about the language and more about the dependencies. When I want to run a Python ML program, I need to go through the hassle of figuring out what the minimum version is, which package manager/distribution I need to use, and what system libraries those dependencies need to function properly. If I want to build a package for my distribution, these problems are dialed up to 11 and make it difficult to integrate (especially when using Nix). On top of that, those dependencies typically hide the juicy details of the program I actually care about.

For something like C or C++? Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path. Maybe install some missing system libraries if necessary.

I just don't want to install yet another hundred copies of dependencies in a virtualenv and just hope it's set up correctly.

xingped · on June 13, 2023

> For me it's less about the language and more about the dependencies. When I want to run a Python ML program, I need to go through the hassle of figuring out what the minimum version is, which package manager/distribution I need to use, and what system libraries those dependencies need to function properly.

This is exactly why I hate Python. They even have a pseudo-package.json style dependencies file that you should supposedly be able to just run "install" with, but it NEVER works. Not once have I ever downloaded someone's Python project from github and tried to install the dependencies and run it has it ever gone smoothly and without issue.

The Python language itself may be great, I don't know, but I'm forever put off from learning or using it because clearly in all the years it's been around they have yet to figure out reproducibility of builds. And it's obviously possible! JavaScript manages to accomplish it just fine with npm and package.json. But for some reason Python and its community cannot figure it out.

tryauuum · on June 13, 2023

Would the problem "how can I run this cool ML project from GitHub" be solved if developers would publish their container images on dockerhub? The only downside I see is enormous image sizes

anotherhue · on June 13, 2023

"Here's a massive glob of mystery-meat state to run my massive glob of mystery-meat tensors".

rvnx · on June 14, 2023

"and by the way, inside the image I slightly modified one file of the tensors library (and this is undocumented) and this totally changes the output if the fix is not there"

hobo_in_library · on June 14, 2023

They just need a dockerfile that builds correctly. No need to make an image available, the dockerfile should be able to build it consistently

KETpXDDzR · on June 14, 2023

I use pypoetry for dependency management in Python projects. It helps a lot, but doesn't resolve the issue of pip packages to fail installing because you're missing system libraries. At least it specified the Python version to use. With many open source ML repos I have to guess what Python version to use.

I'd really like to see more Docker images (images, not Dockerfiles that fail to build). Maybe flatpack or snap packages do the trick, too.

buzzert · on June 14, 2023

> Not once have I ever downloaded someone's Python project from github and tried to install the dependencies and run it has it ever gone smoothly and without issue.

Same here. If it can't resolve dependencies or whatever, then there will almost certainly be some kind of showstopping runtime error (probably because of API changes or something). I avoid Python programs at all cost nowadays.

rubicks · on June 14, 2023

This.

`pip` is the package manager that almost works.

Python is the language that almost supports package distribution.

I'll keep using `apt` on vanilla Debian.

troad · on June 13, 2023

Strong agree. I'll willingly install a handful dependencies from my distro package manager, where the dependencies are battle-hardened Unixy tools and I can clearly see what they do and how they do it.

I'm not going to install thousands of dodgy-looking packages from pip, the only documentation for which is a Discord channel full of children exchanging 'dank memes'.

I like Python, but I simply do not trust the pip ecosystem at this point (same for npm, etc.).

barbariangrunge · on June 13, 2023

> I'm not going to install thousands of dodgy-looking packages from pip, the only documentation for which is a Discord channel full of children exchanging 'dank memes'.

This made me laugh. It’s true, isn’t it? That’s really what we deal with day to day (for me in the js world, the create react app dependencies make my head spin)

the__alchemist · on June 13, 2023

> For something like C or C++? Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path. Maybe install some missing system libraries if necessary.

I don't mean to detract from your main point re Python dependencies, but I find this about C to be rarely true. `make` etc build flows usually result in dependency-related compile errors, or may require a certain operating system. I notice this frequently with OSS or academic software that doens't ship binaries.

andybak · on June 13, 2023

Yep. I've given up on any C or C++ projects because I find they almost never work and waste hours of my time. Part of the issue might be the fact that I'm often using Windows or MacOS but I've had bad experiences on Linux also.

pessimizer · on June 13, 2023

> `make` etc build flows usually result in dependency-related compile errors,

Which are displayed to me during the `./configure` step before the `make`, and usually require me to type "apt-get install [blah] [blah] [blah]", and to run configure again.

orra · on June 13, 2023

Not all configure scripts are created equal: a lot of them only tell you about missing dependencies one at a time.

I'm glad make, etc., works for yo. But for me, neither C, C++, nor Python are particularly enjoyable dependency wise.

potatolicious · on June 13, 2023

So much this. As someone whose bread and butter is systems programming for things that run on end-user devices, every time I dig into a Python project I feel like I've been teleported into the darkest timeline, where everything is environment management hell.

Even the more complex and annoying scenarios in native-land for dependency management still feels positively idyllic in comparison to Python venvs.

pjmlp · on June 13, 2023

When I initially started to learn Python (1.6), virtualenv was starting to be adopted, and since then thing have hardly changed.

It also helps that even minor versions introduce breaking changes.

I doubt anyone really knows Python that well, unless they are on the core team.

oddthink · on June 13, 2023

It was fine, back in the early days (I started with 1.4-ish). I just downloaded the tarball, unpacked, configured, make, installed into /usr/local on my workstation, then downloaded and stuck any packages into site-packages. Numeric was sometime tricky to compile right, but ye olde "configure && make && make install" worked fine.

Of course, that worked because 1) I was really only doing one project, not juggling multiple ones, 2) there weren't all that many dependencies (Numeric, plotting, etc.), and 3) I was already up to my eyeballs in the build system with SWIG and linking to the actual compute code, so I knew my way around the system.

But every now and then I just shake my fist at the clouds then mutter darkly about just installing the dang thing and maybe not taking on so many dependencies. :-)

barbariangrunge · on June 13, 2023

I’m newish to python, I’ve only used it for machine learning projects and some web scraping. Could somebody elaborate on venv? I just started using it but now everyone in this thread is saying how much they hate it. Is there an alternative?

BryantD · on June 13, 2023

venv is fine. Remember that this is a self-selected sample of people. You’re going to bump into the flaws and gotchas but it’s a perfectly usable tool.

throwawaymaths · on June 13, 2023

Uhm so I was professionally setting up ML distros and ML containers for cloud deployments. Venv is not fine, especially if you've seen how other langs do it.

collinmanderson · on June 14, 2023

Could you say more what you don’t like about venv compared to how other languages do it?

throwawaymaths · on June 14, 2023

Just store your deps in project directory instead of using hidden fucking magic.

tekknik · on June 14, 2023

The hidden magic is adjusting env vars used by python, LD, etc. Adjusting paths only seems like magic until you understand what it’s doing.

I’ve done with with plenty of languages including C/C++

throwawaymaths · on June 14, 2023

Yes I know how it works but try explaining that to someone learning flask at a boot camp.

tekknik · on June 18, 2023

I mean I wouldn’t expect someone at a boot camp to understand this nor was this a topic of conversation, so /shrug?

collinmanderson · on June 16, 2023

The good news is they're working towards encouraging a standard .venv in project directory.

https://peps.python.org/pep-0704/

orra · on June 13, 2023

IMHO venvs are fine as an implementation detail, a building block for a slicker tool.

The annoyance with venvs is you have to create and activate them. In contrast for cargo (or stack or dotnet or yarn or pipenv or poetry), you just run the build tool in the project directory.

Another limitation of venv is it doesn't solve the problem of pinning a versions of Python, so you need another tool.

bhy · on June 13, 2023

Well I just spent an hour to diagnose the build failure of llama.cpp due to it picking up wrong nvcc path.

Dependency problem still happens even with C/C++.

Tostino · on June 13, 2023

I had the same issue... Turned out it was because I used the flat pack version of intellij idea and it had problems with paths. Running from a plain terminal worked fine.

Faceless1230 · on June 13, 2023

The flatpak version of intellij isn't officially maintained by jetbrains. Jetbrains only maintains the snap for linux

paulmd · on June 13, 2023

it's fortunate that flatpak solves this problem of reproducible environments so effectively

adeon · on June 13, 2023

This is also the reason I like when I see a project in C or C++. It's often a ./configure && make or something. Sometimes running a Python project even if dependencies install, there might be some mystery crash because package dependencies were not set correctly or something similar (I had a lot of trouble with AUTOMATIC1111 StableDiffusion UI when using some extensions that installed their own requirements that might be in conflict with the main project).

With a boring C project, if it compiles it probably works without hassle.

Feels validating that other people have these thoughts too and I'm not just some old fart.

lynx23 · on June 13, 2023

I recently hit the "classic" case. Saw a CLI tool for an API I'd like to use, written in Python. Tried it and found out it didn't work on my machine. I later found out it was a bug in a dependency of that tool. 100 lines of shell script later, I had the functionality I needed, and a codebase which was actually free of unexpected surprises. I know, this is an extreme example, but as personal anecotes go, Python has lost a lot of trust from my side. I also wonder how people can write >10k codebases without static types, but that is just me ....

iopq · on June 13, 2023

It's not that Python is bad. It's the people who want to just hack something quick together go to Python, so any time you pick up some software written in Python it's marred with all kinds of compatibility issues and bugs where you can't just run it

The answer is "yeah use this other software to make it work in an isolated way because the whole ecosystem is actually broken" and that's somehow acceptable

raxxorraxor · on June 13, 2023

I think it is the opposite for me but I am also a fan of system independent package mangers, provided they support easy package configuration.

Otherwise you not only bind to system architecture and OS, you also bind yourself to a distribution.

I find that Automatic1111 plugins tend to not share dependencies and instead redownloads them for their own use. Can make your hdd cry because some of these are larger models. Advantages and disadvantages probably...

There are package managers for C and some are quite good. But for most projects you are quite dependent on the package manager of your distro to supply you a fitting foundation. Sometimes it is easy, but if there is a problem, I think handling C is far harder than python. And I write quite a bit of C while I can only perhaps read python code.

No code is completely platform independent, especially a stable diffusion project, but Python is still more flexible as C by a long shot here.

Of course Llama is great. Time to get those LLMs on our devices for our personal dystopian AIs running amok.

drdaeman · on June 13, 2023

> which package manager/distribution I need to use, and what system libraries those dependencies need to function properly

I don't understand why things are so complicated in Python+ML world.

Normally, when I have a Python project, I just pick the latest Python version - unless documentation specifically tells me otherwise (like if it's still Python 2 or if 3.11 is not yet supported). If the project maintainer had some sense, it will have a requirements list with exact locked versions, so I run `pip install -r requirements.txt` (if there is a requirements.txt), `pipenv sync` (if there is a Pipfile), or `poetry install` (if there's pyproject.toml). That's three commands to remember, and that's not one just because pip (the one de-facto package manager) has its limitations but community hadn't really decided on the successor. Kinda like `make` vs automake vs `cmake` (vs `bazel` and other less common stuff; same with Python).

External libraries are typically not needed - because they'll be either provided in binary form with wheels (prebuilt for all most common system types), or automatically built during the installation process, assuming that `gcc`, `pkgconfig` and essential headers are available.

Although, I guess, maybe binary wheels aren't covering all those Nvidia driver/CUDA variations? I'm not a ML guy, so I'm sure how this is handled - I've heard there are binary wheels for CUDA libraries, but never used that.

Ideally, there's Nix (and poetry2nix) that could take care of everything, but only a few folks write Flakes for their projects.

> Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path

Getting the correct version of all the dependencies is the trickiest part as there is no universal package managers - so it's all highly OS/distro specific. Some projects vendor their dependencies just to avoid this (and risk getting stuck with awfully out-of-date stuff).

> Maybe install some missing system libraries if necessary.

And hope their ABIs (if they're just dynamically loaded)/headers (if linked with) are still compatible with what the project expects. At least that is my primary frustration when I try to build something and it says it doesn't work anymore with whatever OS provides (mostly, Debian stable fault lol). It is not exactly fun to backport a Debian package (twice so if doing this properly and not handwaving it with checkinstall).

rgoulter · on June 13, 2023

> Ideally, there's Nix (and poetry2nix) that could take care of everything, but only a few folks write Flakes for their projects.

Relevant to "AI, Python, setting up is hard ... nix", there's stuff like:

https://github.com/nixified-ai/flake

NBJack · on June 13, 2023

The right combo for Nvidia/CUDA/RandomPythonML library is a nightmare at times. This is especially true if you want to use older hardware like a Tesla M40 (dirt cheap, still capable). And your maker hopefully be with you if you if you tried to use your distro's native drivers first.

It's fair to say part of the blame is on Nvidia, but wow is it frustrating when you have to find eclectic mixes.

kkfx · on June 13, 2023

My personal recipe (on NixOS) is pip-ed virtual environment for quick tests, or conda inside a nix-shell, on top of a dedicated zfs pool/conda mounted in ~/.conda with dedup=on so nothing nixified and nothing that last a nixos-rebuild...

Many pythonic projects not only in ML world tend to be just developers experiments, so to be run as an experiment, not worth to be packaged as a stable, released program...

Oh, BTW projects like home-assistant fell in the same bucket...

taf2 · on June 13, 2023

I totally agree with you the irony being python and languages like were built in part to reduce the complexity not only of the language but also to build and run the code… I feel machine learning is a low enough level thing that it should not be tied to a high level language like python… so I can use node, ruby, php or whatever by adding a c binding etc that to me is why this is most interesting

josephg · on June 13, 2023

The problem is that python is designed assuming people want to use system-wide packages. In hindsight, that has turned out to be a mistake. Conda / venv try to bridge that gap but they’re kludgy, complex hacks compared to something like cargo or even npm.

Worse, because Python is a dynamic language, you also have to deal with all of that complexity at deployment time. (Vs C/C++/Zig/Rust where you can just ship the compiled binary).

drdaeman · on June 13, 2023

> The problem is that python is designed assuming people want to use system-wide packages.

This wasn't true for decades, `virtualenv` was de-facto standard isolation solution (now baked in as `python -m venv`, still de-facto standard), and `pip` is the package manager (we don't talk about setuptools/distutils, ssh!). If someone still used system-wide packages that was either because a) they were building a container or some single-purpose system; or b) they were sloppy or had no idea what they're doing (most likely, following some crappy tutorial). Or it was distro people creating packages to satisfy dependencies for Python programs - but that's a whole different story (and one's virtualenv shouldn't inherit system packages unless it is really really necessary and iif it makes sense to do so).

The problem started when one needed some external non-Python dependencies. Python had invented binary wheels and they're around for a while (completely solving issues with e.g. PostgreSQL drivers, no one needs to worry about libpq), but I suppose depending on specific versions of kernel drivers and CUDA libraries is a more complex and nuanced subject.

> Vs C/C++/Zig/Rust where you can just ship the compiled binary

Only assuming that you can either statically link, or if all libraries' ABIs are stable (or if you're targeting a very specific ABI, but I've had my share of "version `GLIBC_2.xx' not found"s and not fond of those).

In a similar spirit, any Python project can be distributed as one binary (Python interpreter and a ZIP archive, bundled together) plus a set of zero or more .so files.

josephg · on June 13, 2023

> This wasn't true for decades, `virtualenv` was de-facto standard isolation solution (now baked in as `python -m venv`, still de-facto standard)

Right; but python itself doesn’t check your local virtual environment unless you “activate” it (ugh what). And it can’t handle transitive dependency conflicts, like node and cargo can. Both of those problems stem from python assuming that a simple, flat set of dependencies are passed in from its environment variables.

Erwin · on June 13, 2023

Virtual envs are actually quite simple -- they contain a bin/ directory with a linked python binary. When the python binary runs, it checks it sibling directories (it knows it was executed as e.g. /home/user/.venv/bin/python) for what to load. You don't need the activate shell scripts or anything, just running that binary within your venv is enough; the shell script is just for convenient of inserting the bin directory into the $PATH so just "python" or "pip" runs the right thing.

dragonwriter · on June 13, 2023

> the shell script is just for convenient of inserting the bin directory into the $PATH so just "python" or "pip" runs the right thing.

Or so any reference in the program you run that launches another binary or loads a DLL relying onnthe environment gets the right one, etc. There are some binaries you can run without activating a venv with no problem, and others will crash hard, and others will just subtly do the wrong thing if the conditions are “right” in your normal system environment.

josephg · on June 14, 2023

Another implication of this is that its impossible for 2 mutually incompatible copies of the same package to exist in the same environment. If packageA needs numpy 1.20 and packageB needs numpy 1.21, you're stuck.

throwawaymaths · on June 13, 2023

> Virtual envs are actually quite simple

You have never trashed your system from virtualenv?

Also, there is a problem when wheels assume they can have everything like tensorflow from years ago -- I don't know about now, since tf used to be tied to cuda versions you could get into trouble installing tf versions, even with venv, conda, etc.

drdaeman · on June 14, 2023

> You have never trashed your system from virtualenv?

Unless one have done something they shouldn't have done (in particular, using sudo while working with virtualenv), this shouldn't be possible.

Due to limitations of most commonplace system-wide package managers (like, dpkg, rpm or ebuild, not modern stuff like nix) system packages exist to support other system packages. One installs some program, it needs libraries, dependencies get pulled. And then its distro package managers' job to ensure compatibility and deal with multiple version conflicts (not fun).

But if you start or check out some project, common knowledge was that you shouldn't be using on system packages, even if they're available and could work. With some obligatory exceptions like when you're working on a distribution packaging, or developing something meant to be tightly integrated with a particular distro (like a corporate standard stuff).

That is, unless we're talking about some system libraries/drivers needed for CUDA in particular (which is system stuff) rather than virtualenv itself.

throwawaymaths · on June 14, 2023

> That is, unless we're talking about some system libraries/drivers needed for CUDA in particular (which is system stuff) rather than virtualenv itself.

Sir, this is an ML thread.

Venv interacts with that poorly, though to be fair it could be googles fault. Still it shouldn't be even possible.

drdaeman · on June 15, 2023

I mean, virtualenv is not supposed to interact with that at all. System libraries are systems' package manager responsibility. Doubly so, as - as I get it - all this stuff is directly tied to the kernel driver.

What Python's package manager (pip in virtualenv) should do is build (or download prebuilt binaries) the relevant bindings and that's the extent of it. If others say it works this way with C (that comment about cmake and pkgconfig), then it must work this way with Python too.

kkfx · on June 13, 2023

> If someone still used system-wide packages that was either because a) [...] b) [...]

Or simply because they are packagers for some distro and they user want a simple way to pull-in some software by it's name, while the upstream devs imaging people cloning their public repo and run the software from the checkout in their own home, with regular pull, regularly rebuilding the needed surroundings...

Not to talking about modern systems/distro with not-really-posix vision like NixOS or Guix System...

> In a similar spirit, any Python project can be distributed as one binary

A single 10+Gb binary :-D

v3ss0n · on June 13, 2023

Ridiculous, 10gb binary only if machine learning models involved. I had distributed full stack binaries in 70mb or less.

6gvONxR4sf7o · on June 13, 2023

> For something like C or C++? Usually the most complicated part is running `make` or `cmake` with `pkgconfig` somewhere in my path. Maybe install some missing system libraries if necessary.

If you don't use virtual environments in python, isn't it basically the same in python? Just run `pip install` and maybe install some missing system libraries if necessary. In practice, it's not that simple in either language, and "maybe install some missing dependencies" sweeps a lot of pain under the rug.

leoh · on June 13, 2023

I wonder if having a shared cache would make this more easeful. fwiw nix does that.

jelling · on June 13, 2023

You might like Cog. It solves these problems for ML projects in specific.

nologic01 · on June 13, 2023

Wow, so interesting to see the "depth" of anti-python feeling in some quarters. I guess that is the backlash from all the hordes of Python-bros.

Having used both C++ and Python for some time, the idea that managing C++ dependencies is easier than venv and pip install is one of the moments you wonder how credible is HN opinion on anything.

> a compelling non-Python solution appears

Confusing a large ML framework like pytorch that allows you to experiment and develop any type of model with a particular optimized implementation in a low level language suggests people are not even aware of basic workflows in this space.

> also popular

Ofcourse its popular. As in: People are delirious with LLM FOMO but can't fork gazillions to cloud gatekeepers or NVIDIA so anybody who can alleviate that pain is like a deus-ex-machina.

Ofcourse llama.cpp and its creator are great. But the exercise primarily points out that there isn't a unified platfrom to both develop and deploy ML type models in a flexible and hardware agnostic way.

p.s. For julia lovers that usually jump at the "two-language problem" of Python: here is your chance to shine. There is a Llama.jl that wraps Llama.cpp. You want to develop a native one.

mook · on June 13, 2023

Managing C++ dependencies _is_ much easier! It's either "run this setup exe" or "extract this zip file/tarball/dmg and run".

This is because most people don't care about developing the project, just using it. So they don't care what the dependencies are, just that things work. C++ might be more difficult to handle dependencies to build things, but few people will look into hacking on the code before checking to see if it's even relevant.

fransje26 · on June 14, 2023

> Managing C++ dependencies _is_ much easier! It's either "run this setup exe" or "extract this zip file/tarball/dmg and run".

Not sure if that was half sarcastic, but from experience, anything touching c++ & CUDA has the potential to devolve into a nightmare on a developer-friendly platform (hello kernel updates!), or worse, into a hair-pulling experience if you happen to use the malware from Redmond (to compound your misery, throw boost and Qt dependencies in the mix).

Then again, some of the configurations steps required will be the same for Python or C++. And if the C++ developer was kind enough to provide a one-stop compilation step only depending on an up-to-date compiler, it might, indeed, be better than a Python solution with strict package version dependencies.

nologic01 · on June 13, 2023

Maybe this distinction explains indeed the dissonance! But it might be rather shortsighted given the state of those models and the need to tune them.

ric2b · on June 13, 2023

If we're only talking about end-user "binaries" you can also package Python protects into exe files or similar format that bundle all the dependencies and are ready to run.

simion314 · on June 13, 2023

>Wow, so interesting to see the "depth" of anti-python feeling in some quarters. I guess that is the backlash from all the hordes of Python-bros.

I think you are generalizing. I do not hate on Python the language but this ML projects are a very , very terrible experience. Maybe you can lame the devs of this ML projects, or the ones of the dependencies but the experience is shit. You can follow a step by step instruction that worked 11 day ago and today is broken.

I had similar issues with Python GTK apps, if the app is old enough then you are crewed because that old GTK version is no longer packaged, if the app is very new then you are screwed again because it depends on latest version of some GTK wrapper/helper.

nologic01 · on June 13, 2023

I think what has happened is that because Python is sweet and easy to use for many things, it generated irrational expectations that is perfect for all things. But its just an interpreted language that started as a scripting and gluing oriented language.

Its deployment story is where this gap frequently shows. Desktop apps at best passable, whereas e.g. android apps practically non-existing despite the efforts of projects like kivy.

simion314 · on June 13, 2023

The problem seems to be getting a project that works on the developer machine packaged and distributed to non developers. Some type of projects seem harder to distribute, this ML dependencies seem to change fast and everything breaks(maybe because the dependencies are not locked correctly).

tempera · on June 14, 2023

I think the Python hate has been manufactured, starting with Google's launch of Go language, which wanted to be eat Python's cake.

And some jumped on the started bandwagon unknowingly, and never got off of it.

lexandstuff · on June 13, 2023

Just to balance things out: I still love Python. A lot!

noman-land · on June 13, 2023

I don't even write python, really, but I've been interfacing with llama.cpp and whisper.cpp through it recently because it's been most convenient. Before that I was using nodejs libraries that just wrap those two cpp libs.

I guess since these models are meant to be run "client side" or "at the edge" or whatever you want to call it, it helps if they can be neutrally used with just about any wrapper. Using them from Javascript instead of Python is sort of huge for moving ML off the server and into the client.

I haven't really dipped my toes into the space until llama and whisper cpp came along because they dropped the barrier extremely low. The documentation is simple, and excellent. The tools that it's enabled on top like text-generation-webui are next level easy.

git clone. make. download model. run.

That's it.

v3ss0n · on June 13, 2023

Quality of HN comments are getting bad for a few months. This is nothing to do with python ML ecosystem and what you have to realize llmcpp doing is it is inferencing already built models- which is running the models.

Building (training) machine learning, deep learning models are much more complex , order of magnitude complex than just running the models and doing that in C or C++ would take you years which would take just a few month with python.

And complexity of `pip install` is nothing compared to that.

That's why no real ETL+Deep learning, training work is done in c or c++.

cztomsik · on June 13, 2023

You're not entirely wrong but pretty much everything you use from python is written in C++ anyway, so what's your point?

ageofwant · on June 13, 2023

The point is, as you pointed out, that you code against the appropriate level of abstraction. You write a ML workflow appropriate language like Python in something like C++/rust, and ML flows in Python. That should really not be that hard to understand.

v3ss0n · on June 13, 2023

It is same argument as "Every HTML, CSS , JAVASCRIPT" development you do is written in C/C++ anyways .

cztomsik · on June 14, 2023

I am not an expert, but from what I've seen, PyTorch is mostly a thin wrapper over the C++ libtorch. The same is true for DOM as well, but nobody uses DOM directly, whereas everybody uses PyTorch.

The big shift is Jupyter, but that's mainly for exploratory programming. If you already know what you're doing, there's no reason why C++ should be worse than Python for training. It's likely that most ML engineers do not have experience with C++.

I don't have that experience either but from what I've seen, C++ is very powerful, so once you subtract the jupyter, there's not really too much left.

BTW: You cannot use DOM/CSSOM from C++, the only API is JS, so your argument is theoretical.

forgingahead · on June 13, 2023

Yes Python is incredibly annoying to use. Their dependency management is a total mess, and it's incredible how brittle packages if there are even minor point changes in versions anywhere in a stack.

sp332 · on June 13, 2023

I have to agree. Installing dependencies for some git repos is a total crapshoot. I ended up wasting so much hard drive space with copies of pytorch. Meanwhile llama.cpp is just "make" and takes less time to build than to download one copy of pytorch.

cshimmin · on June 13, 2023

So, the solution is that everyone should write code as self-contained C++ code and not use any software libraries ever. Dependency hell has been solved for all time!

josephg · on June 13, 2023

Python was released in 1991. It obviously didn’t get package management right, just like C and C++ didn’t figure it out in the 70s.

Take a look at rust’s Cargo for what a modern package manager should look like. Or deno / Go if you swing that way.

Which old language gets package management right? None of them. None of them get it right.

And sure - conda / venv / CMake / etc help. But last century’s bad design decisions still shine through.

sp332 · on June 13, 2023

Well, in the cases where the libraries are causing more work than they are solving. Then yes.

Tostino · on June 13, 2023

There is a happy medium... somewhere. After following Postgres development for the better part of a decade, I think it's definitely closer to the python side of things... But man they (python) do make it hard to like using that ecosystem.

The flip side is like you said... You will just have to reimplement everything yourself and then you can never worry about dependencies again! Just hope you didn't introduce some obscure security issue in your hashmap implementation .

realusername · on June 13, 2023

I personally really don't like much Python, I find it as tedious to write as Go but without the added performance, typesafety and autocomplete benefits that comes with it in exchange.

If I have use a dynamic language, at least make it battery included like Ruby. Sure it's also not performant but I get something back in exchange.

Python sits in a very uncomfortable spot which I don't find a use for. Too verbose for a dynamic language and not performant enough compared to a more static language.

The testing culture is also pretty poor in my opinion, packages rarely have proper tests, (and especially in the ML field)

emmender · on June 13, 2023

In addition to the above:

1) function decorators etc have made the code unreadable

2) while code is succinct, a lot of abstraction is hidden in some C/C++ language binding somewhere, so, when there is a problem, it is hard to debug

3) Pytorch has become a monolithic monster with practically no-one understanding its functionality e-2-e

pjmlp · on June 13, 2023

For me Python's main use cases are being BASIC replacement, and a saner alternative to Perl (I like Perl though) for OS scripting.

For everything else, I rather use compiled languages with JIT/AOT toolchains, and since most "Python libraries" for machine learning are actually C and C++, any language goes, there is nothing special about Python there.

yieldcrv · on June 13, 2023

The Python apologists are more annoying than the language.

Its always been obvious that ML’s marriage to python has always been credential ladened proponents in tangentially related fields following group think.

As soon as we got a reason to ignore those PhDs, their gatekept moat evaporated overnight and the community of [co-]dependencies became irrelevant overnight.

dontreact · on June 13, 2023

As far back as 2015, it’s been common to take neural net inference and get a C++ version of it. That’s what this is.

It didn’t make python obsolete then and it won’t now.

Training (orchestration, architecture definition etc.) and data munging (scraping cleaning analyzing etc.) are much easier with python than C++, and so there is no chance that C++ takes over as the lingua Franca of machine learning until those activities are rare

emmender · on June 13, 2023

python is syntactic sugar - the heavy lifting is done by c/c++ bindings.

many ML experts are not software engineers. They just want syntax to get their job done. fair enough.

bartwr · on June 13, 2023

Even "compiled" JAX or PyTorch can leave some performance even if you hit the common path (It's also "praying" that the compiler actually works if you do anything non-standard).

But memory wise, there is almost no optimization or reuse (and it's sacrificed for performance), which leads to insane memory usage.

And it's not that they are bad - but optimal compilation of a graph is combinatorially explosive problem, impossible without heuristics and guesswork (what to reuse and waste memory vs recompute). A good programmer can do a significantly better job.

MrYellowP · on June 13, 2023

> Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it? Less hassle? Or did it take off for a different reason? Interested in hearing thoughts.

Exploring the landscape ends up with you having 29384232938792834234 different python environments, because that one thing requires specific versions of one set of libraries, while that other thing requires different versions of the same library and there is no middle ground.

It's horribly annoying and I absolutely love python!

jtode · on June 13, 2023

I'm currently a Python dev for a living, but I spend a portion of my personal time trying to get better at C/C++, and in my case, it's strictly about the potential for writing faster code. I'm interesting in getting into DSP stuff specifically, so it's 100% necessary if I want to do that, and I would also like to get my head wrapped around OpenCV for similarly creative reasons.

Python opened up the world of code to a lot more people, but there was a cost to that, and the real action as far as actual computer systems go is always gonna be at a much lower level, in much the same way that a lot of people can top up their fluids but most of us pay someone to change the oil.

I could actually totally see oil changes becoming completely robotic in the future, but we would first have to establish open standards for oil pans that all automakers adhered to.

The whole computers designing computers thing, outside of someone cracking cold fusion I don't think we'll ever have the juice for it. In my lifetime, a C-level programmer will never be out of work, but I suspect that the demand for Python programmers is going to slack off, while the supply continues to grow.

tension000 · on June 13, 2023

Yet another TEDIOUS BATTLE: Python vs. C++/C stack.

This project gained popularity due to the HIGH DEMAND for running large models with 1B+ parameters, like `llama`. Python dominates the interface and training ecosystem, but prior to llama.cpp, non-ML professionals showed little interest in a fast C++ interface library. While existing solutions like tensorflow-serving [1] in C++ were sufficiently fast with GPU support, llama.cpp took the initiative to optimize for CPU and trim unnecessary code, essentially code-golfing and sacrificing some algorithm correctness for improved performance, which isn't favored by "ML research".

NOTE: In my opinion, a true pioneer was DarkNet, which implemented the YOLO model series and significantly outperformed others [2]. Same trick basically like llama.cpp

[1] https://github.com/tensorflow/serving [2] https://github.com/pjreddie/darknet

csjh · on June 13, 2023

I believe it's moreso for the (actively pursued) speed optimizations it provides. When inference is already computationally expensive any bit of performance is a big plus

zmmmmm · on June 13, 2023

yeah it's sad I guess but half the reason I am recommending this is that it "just works" so much more easily than installing half the Python ecosystem into a conda environment (ironically, just so that Python can then behave as a thin wrapper for calling the underlying native libraries ...)

plaguuuuuu · on June 13, 2023

I've been using it because there are bindings in other languages I know like .NET

noduerme · on June 13, 2023

I've been wondering for awhile now - what was ever the benefit to building these things in Python, other than pytorch and numpy being there to experiment on in hobbyist ways? There's no way that a serious AI is really going to be built in a scripting language, is there? Once you know what you actually want it to do, you're definitely going to rebuild it as close to the metal as you can, right?

Not to mention, to really protect source code and all the sugar around the training systems, it's going to be a good investment to get out of hobby land and manage your own memory and just code them in C/C++.

It strikes me that the hobby AI ethos aligns very well with scripting languages in that they both assume the availability of endless resources to push things a little dirtier and messier and see if anything interesting emerges. Which is great for hobby AI. It's probably not the future, though, unless resource availability outpaces the imagination of people to write more and more bloated scripts to accomplish what's already been proven.

nologic01 · on June 13, 2023

incredibly condescending which always pairs well with ignorance. "Hobby AI" are the people (mathematicians, domain experts etc) that made this all possible so that you can now "just code it in C/C++".

Have you ever tried to iterate developing any serious class of algorithms in C++?

> Once you know what you actually want it to do

When is that exactly? Even the last few months of LLM land development show very clearly how everything is rapidly evolving (and will very likely continue for quite some time).

Numerical linear algebra stabilized decades ago so you do have low-level libraries in C++ (or even fortran) but there is quite some distance between an LLM and linear algebra.

fisf · on June 13, 2023

> Have you ever tried to iterate developing any serious class of algorithms in C++?

Yes, and it's not so bad. A lot of ML deployments have been based on C/C++ for inference anyway (with Python driving the training). So that's really nothing new. I.e. most Python research code is not deployable in terms of quality / performance.

nologic01 · on June 13, 2023

> Yes, and it's not so bad.

In the absence of alternatives its workable. Libraries like eigen or armadillo help a lot. But Python with numpy has been extremely popular for a good reason.

b33j0r · on June 13, 2023

Not quite. Nothing to do directly with python. This was the introduction of 4 and 8 bit quantization to a large number of people.

There wasn’t a python library like that anyone was used to using. Would have always been a C extension anyway.

Starting with the cpu in this situation made sense, strangely. There are python wrappers now. I tried to make one for ya’ll in rust in April, but haha I had a compiler issue I never solved.

nmfisher · on June 13, 2023

It's a great project and an impressive achievement, but I'm also struggling to understand what people use it for that PyTorch wasn't offering. Easy deployment on iOS I guess? I would have thought that's a pretty small use case though.

Given the author hand-rolled his own FFT, I'm also guessing it's not as performant?

superkuh · on June 13, 2023

Pytorch (+GPU) dependency and python container type diversity are particularly bad. Programmers may not perceive this since they're already managing their python environment keeping all the OS/libs/containers/applications in the alignment required for things to work but it's quite complex. I couldn't do it.

In comparison I could just type git clone https://github.com/ggerganov/llama.cpp and make . And it worked. And since then I've managed to get llama.cpp clBLAS partial GPU acceleration working with my AMD RX 580 8GB. Plus with the llama.cpp CPU mmap stuff I can run multiple LLM IRC bot processes using the same model all sharing the RAM representation for free. Are there even ways to run 2 or 3 bit models in pytorch implementations like llama.cpp can do? It's pretty rad I could run a 65B llama in 27 GB of RAM on my 32GB RAM system (and still get better perplexity than 30B 8 bit).

manfre · on June 13, 2023

> In comparison I could just type git clone https://github.com/ggerganov/llama.cpp and make . And it worked.

You're comparing a single, well managed project that had put effort into user onboarding against all projects of a different language and proclaiming that an entire language/ecosystem is crap.

The only real take away is that many projects, independent of language, put way too little effort towards onboarding users.

superkuh · on June 13, 2023

That's exactly it. Who wants to try pulling in an entire language with wide depenencies and it's ecosystem of envs/containers/etc when a single program will do it? Not people who just want to run inference on pre-made models.

civilitty · on June 13, 2023

Easy deployment anywhere, not just iOS. I haven't used Python in years so I have no idea what package manager is the best now, completely forgot how to use virtualenv, and it only took a few weeks to completely fuck up my local Python install ("Your version of CUDA doesn't match the one used to blah blah")

Python is a mess. llama.cpp was literally a git clone followed by "cd llama.cpp && make && ./main" - I can recite the commands from memory and I haven't done any C/C++ development in a long time.

cshimmin · on June 13, 2023

For most modern ML projects in python you can just do something like `conda env create -f environment.yaml` then straight to `./main.py`. This handles very complex dependencies in a single command.

The example you gave works because llama.cpp specifically strives to have no dependencies. But this is not an intrinsically useful goal; there's a reason software libraries were invented. I always have fun when I find out that the thing I'm trying to compile needs -std=C++26 and glibc 3.0, and I'm running on a HPC cluster where I can't use a system level package manager, and I don't want to be arsed to dockerize every small thing I want to run.

For scientific and ML uses, conda has basically solved the whole "python packaging is a mess" that people seem to still complain about, at least on the end-user side. Sure, conda is slow as hell but there's a drop in replacement (mamba) that solves that issue.

josephg · on June 13, 2023

Conda? Mamba? Or should I use Venv? What are the commands to “activate” an environment? And why do I have to do that anyway, given thats not needed in any other programming language? Which of those systems support nested dependencies properly? Do any of them support the dependency graph containing multiple, mutually incompatible copies of the same package?

Coming from rust (and nodejs before that), the package management situation in python feels like a mess. It’s barely better than C and C++ - both of which are also a disaster. (Make? Autotools? CMake? Use vendored dependencies? System dependencies? Which openblas package should I install from apt? Are any of them recent enough? Kill me.)

Node: npm install. npm start.

Rust: cargo run. Cargo run —-release.

I don’t want to pick from 18 flavours of “virtual environments” that I have to remember how to to “activate”. And I don’t want to deal with transitive dependency conflicts, and I don’t want to be wading through my distro’s packages to figure out how to manually install dependencies.

I just want to run the program. Python and C both make that much more difficult than it needs to be.

cshimmin · on June 13, 2023

Honest question, as I don't follow rust, does the cargo package manager handle non-rust dependencies? I've played around with it enough that I can say it's for sure a joy to use, but what if your rust program has to link against a very specific combination of versions of, say, Nvidia drivers and openssl and blas?

In general solving such environments (where some versions are left floating or only have like >= requirements) is an NP-hard problem. And it requires care and has to draw source code and/or binaries from various sources.

This is the problem that conda/mamba solves.

If you just want to install python packages, use pip+virtualenv. It's officially supported by python. And while pip has traditionally been a mess, there's been a bunch of active development lately, the major version number has gone from like 9.xx to 23.xx in like the past two or three years. They are trying, better late than never, especially for an ecosystem as developed as python's.

So, if you want to compare rust/cargo, and it handles non-rust deps, then the equivalent is conda. Otherwise, it's virtualenv+pip. I don't think there are any other serious options, but I agree that two is not necessarily better than one in this case. Not defending this state of affairs, just pointing out which would be the relevant comparison to rust.

chrismorgan · on June 13, 2023

> Honest question, as I don't follow rust, does the cargo package manager handle non-rust dependencies?

You can run arbitrary code at build time, and you can link to external libraries. For the likes of C libraries, you’ll typically bind to the library and either require that it be installed locally already, or bundle its source and compile it yourself (most commonly with the `cc` crate), or support both techniques (probably via a feature flag). The libsqlite3-sys crate is a fairly complex example of offering both, if you happen to want to look at an actual build.rs and Cargo.toml and the associated files.

> [pip’s] major version number has gone from like 9.xx to 23.xx in like the past two or three years.

It skipped from 10.0.1 to 18.0 in mid-2018, and the “major” part has since corresponded to the year it was released in, minus 2000. (Look through the table of contents in https://pip.pypa.io/en/stable/news/ to see all this easily.)

cshimmin · on June 13, 2023

Hah, TIL! So much for semver... Calendarver?

baq · on June 13, 2023

> npm install. npm start.

Yeah, no… my current $DAYJOB has a confusing mix of nx, pnpm, npm commands in multiple projects. Python is bad but node is absolutely not a good example.

josephg · on June 13, 2023

Eh. Node / npm on its own is generally fine, especially if you use a private package repository for internally shared libraries. The problems show up when compiling javascript for the web as well as nodejs. If you stick to server side javascript using node and npm, it all works pretty well. It’s much nicer than venv / conda. And it handles transitive dependency conflicts and all sorts of wacky problems.

It’s just that almost nobody does that.

What we want instead is to combine typescript, js bundlers, wasm, es modules and node packages all together to make web apps. And that’s more than enough to bring seasoned engineers to tears. Let alone adding in svelte / solidjs compilation on top of all that. I have sweats just thinking about it.

jsjohnst · on June 13, 2023

> If you stick to server side javascript using node and npm, it all works pretty well.

Rose colored glasses if you ask me. The difference is it seems you use Node often (daily?) and have rationalized the pain. Same goes for everyone defending Python (which I’m sorta in that camp, full disclosure), they are just used to the worts and have muscle memory for how to work around them just like you seem to be able to do with Node.

josephg · on June 14, 2023

> Rose colored glasses if you ask me. The difference is it seems you use Node often (daily?) and have rationalized the pain.

It’s not just that. Node can also handle arbitrary dependency graphs, including transitive dependency conflicts, sibling dependencies and dev dependencies. And node doesn’t need symlink hacks to find locally installed dependencies. Node packages have also been expected to maintain semver compatibility from inception. They’re usually pretty good about it.

It’s not perfect, but it’s pretty nice.

dimava · on June 13, 2023

@antfu/ni does a good job on determining which of package managers should be run in the currect project

girvo · on June 13, 2023

I have no dog in this fight, but the fact that you end your "conda solves this complexity" explanation with "and mamba is a replacement for conda" doesn't really sell me on it. Is it just speed as the main reason for mamba? The fact that the "one ring to rule them all" solve for pythons packaging woes even has a replacement sort of defeats the purpose a little.

I understand why people find Python's packaging story painful, I guess is what I'm saying.

gre · on June 13, 2023

conda takes up to several minutes to figure out dependencies just to tell you that one library requires an old version of a dependency and another requires a new one, making the whole setup impossible.

mamba can do this much quicker.

dekhn · on June 13, 2023

as always, a fast no is better than a slow maybe

girvo · on June 13, 2023

That’s genuinely good to know! Cheers

cshimmin · on June 13, 2023

As sibling elaborated, mamba just accelerates the NP-hard step of resolving conflicts using mathy SAT-solver methods. Also it has a much prettier cli/tui. I would be surprised if conda doesn't eventually merge the project, but you can already use it as an optional/experimental backend on the official conda distro. Haven't followed it at all but I suspect the folks who developed it wanted to use it right away so they forked the project, while the people who maintain conda are more conservative since it's critical infrastructure for ${large_number} of orgs.

intelVISA · on June 13, 2023

> But this is not an intrinsically useful goal; there's a reason software libraries were invented.

No need to denigrate those who don't need to import all their code; it's perfectly fine to not rely on third parties when developing software - some of our best works result from this.

nmfisher · on June 13, 2023

Thanks for the response. Interesting to see a lot of other people echoing the same comments - that dependency management in Python is an absolute PITA.

I know exactly what you mean, but I'm probably so inured to it by now that I've just come to accept it. Obviously not everyone feels this way!

ojosilva · on June 13, 2023

Python sits on the C-glue segment of programming languages (where Perl, PHP, Ruby and Node are also notable members). Being a glue language means having APIs to a lot of external toolchains written in not only C/C++ but many other compiled languages, APIs and system resources. Conda, virtualenv, etc. are godsend modules for making it all work, or even better, to freeze things once they all work, without resourcing to Docker, VMs or shell scripts. It's meant for application and DevOps people who need to slap together, ie, ML, Numpy, Elasticsearch, AWS APIs and REST endpoints and Get $hit Done.

It's annoying to see them "glueys" compared to the binary compiled segment where the heavy lifting is done. Python and others exist to latch on and assimilate. Resistance is futile:

https://pypi.org/project/pyllamacpp/

https://www.npmjs.com/package/llama-node

https://packagist.org/packages/kambo/llama-cpp-php

https://github.com/yoshoku/llama_cpp.rb

vachina · on June 13, 2023

You can “cd llama.cpp && make && ./main" because the developers made that happen.

A self contained zero dependency Python project is also literally “python app.py”

peatmoss · on June 13, 2023

This is true, though in practice I so often find large, non-trivial examples of the former, but very seldom find examples of the latter.

The barriers to entry are lower with python, and so I think there tends to be a lower standard for deployment.

whywhywhywhy · on June 13, 2023

>Easy deployment on iOS I guess?

Deploying a PyTorch running Python app to anything in a way a user can just run it is a struggle. Even iOS aside that’s not a small use case that’s all local and offline ML potential.

Really makes me wonder about the whole language, did they ever expect the code to have to run elsewhere than the machines the writer controls.

slashtom · on June 13, 2023

llama cpp is just cpp inference on the llama model. PyTorch is a library to train neural networks. I'm not sure why people are conflating these two totally different projects..

dragonwriter · on June 13, 2023

> Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it? Less hassle? Or did it take off for a different reason?

The desktop/laptop LLMs use case is one where resource efficiency often makes the difference between “I can’t use this at all” rather than the more frequent “I can use it but it maybe runs a little slower”, and llama.cpp offers that. It also has offered new quantization options that the Python-based tooling hasn’t, which compounds the basic resource efficiency point.

(It’s also not an “everyone” thing: plenty of people are using the Python-based toolchains for LLMs, its possible for their to be multiple popular options in a space . Not everything is all or nothing.)

thom · on June 13, 2023

I think Python and R are generally superior (in terms of developer experience) when you have to do end-to-end ML work, including acquiring and munging data, plotting results etc. But even then, the core algorithms are generally implemented in libraries built out of native code (C/C++/Fortran), just wrapped in friendly bindings.

For LLMs, unless you're doing extensive work refactoring the inputs, there are fewer productivity gains to be had around the edges - the main gains are just speeding up training, evaluation and inference, i.e. pure performance.

kristofferR · on June 13, 2023

Yeah, most Python software seem to have pretty poor compatibility, I usually need to downgrade my Python version to get stuff to run.

PeterStuer · on June 13, 2023

Python isn't annoying to use, at least for me and many I know. But it isn't known for speed. And or easy multithreading.

IshKebab · on June 13, 2023

> Is Python so annoying to use that when a compelling non-Python solution appears, everyone will love it?

Yes.

aidenn0 · on June 13, 2023

Slightly OT:

I have been playing around with whisper.cpp; it's nice because I can run the large model (quantized to 8-bits) at roughly real-time with cublas on a Ryzen 2700 with a 1050Ti. I couldn't even run the pytorch whisper medium on this card with X11 also running.

It blows me away that I can get real-time speech-to-text of this quality on a machine that is almost 5 years old.

eurekin · on June 13, 2023

Seconded. I were playing around for my native language (Polish) and the large models actually blew me away. For example, it handled "przescreenować" spelling correctly, which is an english word with a polish prefix and a conjugated suffix.

nitinreddy88 · on June 13, 2023

is there any dummy guide to get started with any of these?

Tepix · on June 13, 2023

Have you tried the Quick start in the https://github.com/ggerganov/whisper.cpp README?

m3affan · on June 13, 2023

This is an impresdive use case

moneywoes · on June 13, 2023

Is it possible to run on apple m1 devices or mobile phones or not yet?

michelb · on June 13, 2023

I can recommend the MacWhisper app if you prefer a gui.

Void_ · on June 13, 2023

And Whisper Memos for iOS https://whispermemos.com/

b33f · on June 13, 2023

The really nice part of Whisper is being able to use it offline and on-device, it seems whisper memos is uploading your audio and notes to a server of unknown security, confidentiality etc.

I like Aiko for on-device transcription both in macOS and iOS https://apps.apple.com/us/app/aiko/id1672085276

Void_ · on June 13, 2023

Whisper Memos uses OpenAI API. The upside is that it uses the largest model - that would take 2GB on your iPhone.

raihansaputra · on June 13, 2023

yeah the whisper.cpp github page has a demo for both. Have used it on my M1 MBA for the past few months.

shon · on June 13, 2023

Nice to see Georgi has started a company:

https://twitter.com/ggerganov/status/1666120568993730561?s=4...

Godspeed

fnands · on June 13, 2023

Nice. He's obviously a talented engineer who's struck a nerve with the whisper.cpp/llama.cpp projects, so hope he has success with whatever he plans to do.

csmpltn · on June 13, 2023

A lot of work going into refactoring proprietary code that can be randomly deprecated and outcompeted without any prior notice by any number of large competitors... problematic business model, in my opinion.

logicchains · on June 13, 2023

He's building a library, ggml; it's generally not hard to add support for new models. For instance llama.ccp already supports the Falcon 7B model (different architecture to llama). And given how politicised AI has become, there's unlikely to be many companies releasing weights for models competitive with the current models (e.g. LLaMA 65B). They may have private models that are better, like GPT3.5 and GPT4, but you can't run these on your own server so they're not competing with ggml.

csmpltn · on June 13, 2023

> "there's unlikely to be many companies releasing weights for models competitive with the current models"

We're at the very dawn of this technology going mainstream, and you're saying that it's unlikely for new players to release new, competing and incompatible models?

underdeserver · on June 13, 2023

Can someone ELI5 why AMD is not in this game? Is it really so much harder to implement this in a non-platform-specific library?

captainbland · on June 13, 2023

CUDA is the best supported solution, tends to get you access to the best performance, has a great profiler (it will literally tell you things like "your memory accesses don't seem to be coalesced properly" or "your kernel is ALU limited" as well as a bunch of useful stats), even works in windows, all of that.

OpenCL is (was?) the main open alternative to CUDA and was mainly backed by AMD and Apple. Apple got bored of it when they decided Metal was the future. AMD got bored of it when they developed rocm and HIP (basically a partially complete compatibility layer with CUDA).

There's also stuff like DirectML which only works on windows and e.g. various (Vulkan, directx etc.) compute shaders which are really more oriented at games.

There's also a bit of a performance aspect to it. Obviously GPGPU stuff is massively performance sensitive and CUDA gets you the best performance on the most widely supported platform.

AMD are also the main competitor in the space but all but totally dropped support for GPGPU in desktop cards, trying to instead focus on gaming for Radeon and compute in MI/CDNA. They seem to have realised their mistake a bit and are now introducing some support for RDNA2+ cards.

sorenjan · on June 13, 2023

If I remember correctly it's not just that AMD has poor support for their consumer cards, their Rocm code doesn't compile to a device agnostic intermediate, so you have to recompile for each chip. New and old Cuda compatible cards (like all Geforce cards) can run your already shipped Cuda code, as long as it doesn't use new and unsupported features. So even if AMD had supported more cards, the development and user experience would be much worse anyway where you have to find out if your specific card is supported.

slavik81 · on June 13, 2023

All RDNA2 GPUs and many Vega GPUs could use the same ISA (modulo bugs). Long ago, there was an assumption made that it was safer to treat each minor revision of the GPU hardware as having its own distinct ISA, so that if a hardware bug were found it could be addressed by the compiler without affecting the code generation for other hardware. In practice, this resulted in all but the flagship GPUs being ignored as libraries only ended up getting built for those GPUs in the official binary releases. And in source releases of the libraries, needlessly specific #ifdefs frequently broke compilation for all but the flagship ISAs.

There was an implicit assumption that just building for more ISAs was no big deal. That assumption was wrong, but the good news is that big improvements to compatibility can be made even for existing hardware just by more thoughtful handling of the GFX ISAs.

If you know what you're doing, it's possible to run ROCm on nearly all AMD GPUs. As I've been packaging the ROCm libraries for Debian, I've been enabling support for more hardware. Most GFX9 and GFX10 AMD GPUs should be supported in packages on Debian Experimental in the upcoming days. That said, it will need extensive testing on a wide variety of hardware before it's ready for general use. And we still have lots more libraries to package before all the apps that people care about will run on Debian.

jahav · on June 13, 2023

True, it's better just to use OpenSYCL that stores intermediate device agnostic form and complies it as needed to specific card.

I don't understand why isn't SYCL more widely used.

SubjectToChange · on June 15, 2023

SYCL is still very early on in development and I don’t see it really picking up until support is up-streamed to the llvm project, at the very least. That said, I am a firm believer in the single source philosophy. There just isn’t a tractable alternative.

SubjectToChange · on June 15, 2023

“Apple got bored of it when they decided Metal was the future. AMD got bored of it when they developed rocm and HIP”

Apple backed OpenCL because they needed an alternative after their divorce with Nvidia. No one was going to target an AMD alternative when they had such a trivial market share, so it had to be an open standard. Initially this arrangement was highly productive and OpenCL 1.x enjoyed terrific success. Vendors across compute markets piled support behind OpenCL and many even started actively participating in it. However this success is what precipitated the disastrous OpenCL 2.x series. In other words, OpenCL 2.x was far too revolutionary for many and far too conservative for others. What followed was Apple pulling out to pursue Metal, AMD having shoddy drivers, Nvidia all but ignoring it, and mobile chip vendors basically sticking to 1.2 and nothing more. Eventually this deadlock was fixed after OpenCL 3.0 walked back the changes of 2.x, but this was in large part because the backers of 2.x moved to SYCL.

As for AMD, OpenCL was a tremendous boon when it was first introduced. At least initially it gave them a fighting chance against CUDA. But it was never realistic for OpenCL to be a complete CUDA alternative. I mean, any standard that is basically “everything in CUDA and possibly more” is a standard no one could afford or bother to implement. ROCm and HIP are basically AMD using an API people are already familiar with software underneath to play to the strengths of their hardware.

“AMD are also the main competitor in the space but all but totally dropped support for GPGPU in desktop cards, trying to instead focus on gaming for Radeon and compute in MI/CDNA.”

Keep in mind that AMD has been under intense pressure to deliver world class HPC systems, and they managed to do so with ORNL Frontier. I don’t blame them for being selective with ROCm support because most of those product lines were in flight before ROCm started development in earnest. That said, Nvidia is obviously the clear leader for hardware support, as therefore the safest option for desktop users.

znpy · on June 13, 2023

> partially complete compatibility layer

So… partial compatibility layer?

captainbland · on June 13, 2023

Is your point that "partially complete" is a redundant phrasing?

In this case I still prefer my version. I feel that it puts greater emphasis on the fact that it can potentially be complete, given the massive value that could give to the project.

Also "I would have written you a shorter letter but I did not have the time" sentiment springs to mind.

throwaway888abc · on June 13, 2023

Thanks

WithinReason · on June 13, 2023

Short version: AMD's software incompetence. Very few hardware companies have the competence to properly support their HW. You see this problem again and again, HW companies designing HW that's great on paper but can't be used properly because it's not properly supported with SW. Nvidia understands this and has 10 times as many SW engineers than HW engineers. AMD doesn't. Intel might too.

SilverBirch · on June 13, 2023

I think you're totally right with this, just to add - it's often possible to do a lot of things that are neat in hardware but create difficult problems in software. Virtually always when this happens it turns out to be nearly impossible to actually create software that takes advantage of it. So it's massively important to have a closed loop feedback system between the software and hardware so that the hardware guys don't accidentally tie the software up in knots. This is common in companies that consider themselves hardware companies first.

davidgl · on June 13, 2023

Examples being the PS3 cell architecture and HP's Itanium chips

pjc50 · on June 13, 2023

Strongly agree. There's a surprising cultural difference between the two. As a software engineer in a different hardware company, I can see where the fault lines are, and it takes continual management effort to make it work properly.

(I note that if we had an "open" GPU architecture in the same way that we have CPU architectures, things might be a lot better, but the openness of the IBM PC seems to be a historical accident that no company will allow again)

SubjectToChange · on June 15, 2023

Chalking it up to “software incompetence” is a bit simplistic, to say the least. AMD was on the brink of bankruptcy not too long ago and their GPU division was struggling to even trend water. They didn’t have an alternative to CUDA because they couldn’t afford one and no one would use it anyway, OpenCL stagnated because most vendors didn’t want to implement functionality that only the biggest players wanted, and their graphics division had to pivot from optimizing for gaming (where they could sell) to optimizing for compute as well.

Now that AMD has the capital they are playing catch up to Nvidia. But it’s going to take time for their software to improve. Hiring at boat load of programmers all at once isn’t going to solve that.

WithinReason · on June 15, 2023

It's been a while since AMD was on the brink of bankruptcy, they had enough time to do something about compute and yet it's still not usable, see the George Hotz rant. OpenCL stagnated because 2.0 added mandatory features that Nvidia didn't support so it never got adopted by the biggest player.

lhl · on June 13, 2023

llama.cpp can be run with a speedup for AMD GPUs when compiled with `LLAMA_CLBLAST=1` and there is also a HIPified fork [1] being worked on by a community contributor. The other week I was poking on how hard it would be to get an AMD card running w/ acceleration on Linux and was pleasantly surprised, it wasn't too bad: https://mostlyobvious.org/?link=/Reference%2FSoftware%2FGene...

That being said, it's important to note that ROCm is Linux only. Not only that, but ROCm's GPU support has actually been decreasing over the past few years. The current list: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... Previously (2022): https://docs.amd.com/bundle/Hardware_and_Software_Reference_...

The ELI5 is that a few years back, AMD split their graphics (RDNA) and compute (CDNA) architectures, which Nvidia does too, but notably (what Nvidia definitely doesn't do, and a key to their success IMO) AMD also decided they would simply not support any CUDA-parity compute features on Windows or their non "compute" cards. In practice, this means that community/open-source developers will never have, tinker, port, or develop on AMD hardware, while on Nvidia you can start with a GTX/RTX card on your laptop, and use the same code up to an H100 or DGX.

llama.cpp is a super-high profile project, has almost 200 contributiors now, but AFAIK, no contributors from AMD. If AMD doesn't have the manpower, IMO they should simply be sending nsa free hardware to top open source project/library developers (and on the software side, their #1 priority should be making sure every single current GPU they sell is at least "enabled" if not "supported" in ROCm, on Linux and Windows).

[1] https://github.com/SlyEcho/llama.cpp/tree/hipblas

Kelteseth · on June 13, 2023

I just tried this [1] and it still uses my CPU even though the prompt says otherwise.

[1] https://github.com/ggerganov/llama.cpp/issues/1433#issuecomm...

lhl · on June 13, 2023

I saw there was an answer already in your issue, although you plan on doing a lot of inferencing on your GPU, I'd highly recommend you consider dual-booting into Linux. It turns out exllama merged ROCm support last week and more than 2X faster than the CLBlast code. A 13b gptq model at full context clocks in at 15t/s on my old Radeon VII. (Rumor has it that ROCm 5.6 may add Windows support, although it remains to be seen what that exactly entails.)

Kelteseth · on June 13, 2023

So it now uses the GPU after some help, but it is not that much faster on my Vega VII than on my 5950x 16 core cpu :/

born-jre · on June 13, 2023

short answer is they have somewhat competent hardware but software sucks or you can watch george hotz rant about how amd driver sucks

https://www.youtube.com/watch?v=Mr0rWJhv9jU

alecco · on June 13, 2023

He got a tarball fix for the driver after his rant got viral. Still not looking good IMHO.

https://geohot.github.io/blog/jekyll/update/2023/06/07/a-div...

xiphias2 · on June 13, 2023

It looks like some great engineers inside the company fighting the burocracy.

jandrese · on June 13, 2023

> So it fixed the main reported issue! Sadly, they asked me not to distribute it, and gave no more details on what the issue is.

I think they missed the thrust of the rant.

HellDunkel · on June 13, 2023

he goes from building a ml rack out of ati graphics cards to coding some python to recommending reading the unabomber manifesto, from marx to saying he owns a rolls royce... lord, please have mercy!

marcyb5st · on June 13, 2023

I think there are several reasons.

Firstly, nVidia has been at it much longer. Just because of this tools on nVidia side feel easier to set up / are more polished (at least that was my feeling when fiddling with ROCm like a year ago).

Second but still related to #1, from the beginning even consumers nVidia cards were able to run CUDA and this made so that hobbyist and prosumers/researchers on a budget bought nVidia cards compounding even further the time/tooling advantage nVidia had. I.e. a huge user base of not only gamers but people that use their cards to do other things than gaming and know that things work on these cards.

These are, IMHO, the main reasons why everyone targets CUDA and explain why frameworks like Tensorflow or Pytorch targeted it as a first class citizen.

rapsey · on June 13, 2023

If AMD was software competent the frameworks would support their drivers just as well. No one wants a monopoly.

marcyb5st · on June 13, 2023

Agreed. Sorry if I gave the impression of being pro-nVidia. I am not.

But the reality is that when Tensorflow and Pytorch came to be there was no alternative. Now you need to jump through hoops to make it work with non CUDA hardware.

Additionally, while drivers play a role, I think the main difference is in the computing libraries (CUDA vs ROCm)

sorenjan · on June 13, 2023

I'm hopeful for SYCL [0] to become the cross platform alternative, but there doesn't seem to be a lot of uptake from projects like this, so maybe my hope is misplaced. It's an official Khronos standard, and Intel seems to like it [1], but neither of those things are enough to change things.

Someone who knows about this space that can comment on the likelihood that SYCL will be a good option eventually? Cross platform and cross vendor compatibility would be really nice, and not supporting the proprietary de facto standard would also be a bonus as long as the alternative works well enough.

[0] https://www.khronos.org/sycl/

[1] https://spec.oneapi.io/versions/latest/elements/sycl/source/...

mschuetz · on June 13, 2023

> It's an official Khronos standard

I think that's the problem. Khronos isn't known for good UX, and being from Khronos is exactly the reason why I'm not even bothering to check it out. I want an alternative to CUDA, but I also want it to be as easy to use as CUDA.

LoganDark · on June 13, 2023

Being from Khronos is also a reason why it might actually be usable in a decade's time, like Vulkan.

(Vulkan is from 2015 and is just recently starting to become usable.)

sicariusnoctis · on June 17, 2023

From a non-expert's standpoint, Vulkan feels quite unusable and complex.

mschuetz · on June 20, 2023

From an expert's standpoint, it's still quite unusable and complex.

bilekas · on June 13, 2023

I'm no expert but if I understand correctly the CUDA cores are the main pull and the API to them.

They're supposed to be more optimized and more stable compared to AMD. That's how it was before anyway, not sure today.

Aardwolf · on June 13, 2023

Isn't the main component for AI matrix multiplication? What makes it so hard to create a good alternative API for matrix multiplication?

dotnet00 · on June 13, 2023

It's a lot more complicated than just writing a matrix multiplication kernel because there are all sorts of operations you need to have on top of matrix multiplication (non linearities, various ways of manipulating the data) and this sort of effort is only really worthwhile if it's well optimized.

On top of that, AMD's compute stack is fairly immature, their OpenCL support is buggy and ROCm compiles device specific code, so it has very limited hardware support and is kind of unrealistic to distribute compiled binaries for. Then, getting to the optimization aspect, NVIDIA has many tools which provide detailed information on the GPU's behavior, making it much easier to identify bottlenecks and optimize. AMD is still working on these.

Finally, NVIDIA went out of its way to support ML applications. They provide a lot of their own tooling to make using them easier. AMD seems to have struggled on the "easier" part.

bilekas · on June 13, 2023

Well I think there are 2 types right ? Tensor cores (which afaik AMD dont have) which are better for matrix ops, and CUDO which are better for general parallel ops.

Maybe someone more clever than me can go into the specifics, I only understand the minimum of the low lvl GPU details.

Nice high lvl document

[0] https://www.acecloudhosting.com/blog/cuda-cores-vs-tensor-co...

marcyb5st · on June 13, 2023

I think API for matrix multiplication is just a part of the issue. CUDA tooling has better ergonomics, it's easier to set up and treated as first class citizen in tools like Tensorflow and Pytorch.

So, while I can't talk about the hardware differences in detail, developer experience is greatly on nVidia side and now AMD has a moat to overcome to catch up.

emmender · on June 13, 2023

there is nccl, gpudirect, nvlink and so on and so forth.. It is not just matmul on gpus.

lofaszvanitt · on June 13, 2023

Planned economy, planned who does what.

vid · on June 13, 2023

I'm a bit surprised by the numbers. It's "only" a 2× speedup on a relatively top-end card (4090)? And you can only use one CPU core. With 16+ core CPUs becoming normal and 128GB+ RAM being cheap, that seems like leaving a lot on the table.

[edit] realized it's relative to the merged partial CUDA acceleration, so the speedup is more impressive, but still surprised by the core usage.

LoganDark · on June 13, 2023

> And you can only use one CPU core.

because the core's job is solely to direct the GPU, which is doing all of the work.

vid · on June 13, 2023

To the replies; I think one feature of llama.cpp is it can handle models with more RAM than VRAM provides, this is where I would think more cores would be useful.

eurekin · on June 13, 2023

That cheap ram is about 10x slower than the VRAM. Didn't see any actual figures for latency, but there must be a reason, why newer gpus have memory chips on both sides of the PCB, as close to the GPU as possible

shgidigo · on June 13, 2023

Excuse me for my ignorance, but can someone explain why the Llama.cpp is so popular? isn't it possible to port the pytroch lama to any environment using onnx or something?

jmiskovic · on June 13, 2023

You can run it on RPi or any old hardware, only limited by the RAM and your patience. It is a lean code base easy to get up and running, and designed to be interfaced from any app without sacrificing performance.

They are also innovating (or at least implementing innovations from papers) different ways to fit bigger models in consumer HW, making them run faster and with better outputs.

Pytorch and other libs (bitsandbytes) can be horrible to setup with correct versions, and updating the repo is painful. PyTorch projects require a hefty GPU or enormous CPU+RAM resources, while llama.cpp is flexible enough to use GPU but doesn't require it and runs smaller models well on any laptop.

ONNX is a generalized ML platform for researchers to create new models with ease. Once your model is proven to work, there are many optimizations left on the table. At least for distributing an application that relies on LLM it would be easier to add llama.cpp than ONNX.

brucethemoose2 · on June 13, 2023

In Stable Diffusion land, onnx performance was not that great compared to ML compilers and some other implementations (at least when I tried it).

Also, llama.cpp is excellent at splitting the workload between CPU and accelerators since its so CPU focused. You can run 13B or 33B in a 6GB GPU and still get some acceleration.

Also, as said above, quantization. That is make or break. There is no reason to run a 7B model at fp16 when you can run 13B or 30B in the same memory pool at 2-5 bits.

v3ss0n · on June 13, 2023

Should be similar performance but gglm guy did it in what he knows best and biggest selling point is single binary

ianpurton · on June 13, 2023

ONNX doesn't support the same level of quantization as GGML.

So basically GGML will run on hardware with less memory.

regularfry · on June 13, 2023

Or alternatively, bigger models with the same memory (just quantised harder).

ykonstant · on June 13, 2023

Python Torture Chamber is, of course, an eminently viable tool, but I gather some people prefer a more streamlined toolchain like that of C. Or COBOL.

naasking · on June 13, 2023

Llama.cpp runs better than pytorch on a much wider variety of hardware, including mobile phones, Raspberry Pis and more.

getcrunk · on June 13, 2023

Anyone get performance numbers for other 30 series cards? 3060 12gb?

I’m curious how it compares to his apple silicon numbers

lhl · on June 13, 2023

Not a 30 series, but on my 4090 I'm getting 32.0 tokens/s on a 13b q4_0 model (uses about 10GiB of VRAM) w/ full context (`-ngl 99 -n 2048 --ignore-eos` to force all layers on GPU memory and to do a full 2048 context). As a point of reference, currently exllama [1] runs a 4-bit GPTQ of the same 13b model at 83.5 tokens/s.

[1] https://github.com/turboderp/exllama

speed_spread · on June 13, 2023

Also, someone please let me use my 3050 4GB for something else than stable diffusion generation of silly thumbnail-sized pics. I'd be happy with an LLM that's specialized in insults and car analogies.