The idea that this is a drop in replacement for numpy (e.g., `import cupy as np`) is quite nice, though I've gotten similar benefit out of using `pytorch` for this purpose. It's a very popular and well-supported library with a syntax that's similar to numpy.
However, the AMD-GPU compatibility for CuPy is quite an attractive feature.
So it's possible to write array API code that consumes arrays from any of those libraries and delegate computation to them without having to explicitly import any of them in your source code.
The only limitation for now is that PyTorch (and to some lower extent cupy as well) array API compliance is still incomplete and in practice one needs to go through this compatibility layer (hopefully temporarily):
It's interesting to see hardware/software/API co-development in practice again.
The last time I think this happen at market-scale was early 3d accelerator APIs? Glide/opengl/directx. Which has been a minute! (To a lesser extent CPU vectorization extensions)
Curious how much of Nvidia's successful strategy was driven by people who were there during that period.
Powerful first mover flywheel: build high performing hardware that allows you to define an API -> people write useful software that targets your API, because you have the highest performance -> GOTO 10 (because now more software is standardized on your API, so you can build even more performant hardware to optimize its operations)
An excellent example of Array API usage can be found in scikit-learn. Estimators written in NumPy are now operable on various backends courtesy of Array API compatible libraries such as CuPy and PyTorch.
And of course the native Python solution is memoryview. If you need to inter-operate with libraries like numpy but you cannot import numpy, use memoryview. It is specifically for fast low-level access which is why it has more C documentation than Python documentation: https://docs.python.org/3/c-api/memoryview.html
One could also "import jax.numpy as jnp". All those libraries have more or less complete implementations of numpy and scipy (i believe CuPy has the most functions, especially when it comes to scipy) functionality.
Also: You can just mix match all those functions and tensors thanks to the __cuda_array_interface__.
Indeed, has anyone so far successfully drop-in replaced numpy in a project with this cupy and achieved massive improvements? Because, you know, when dealing with GPU it is very important to actually understand how data flows back and forth to it, not only the algorithmic nature of the code written.
As a sidenote, it is funny how this gets released in 2024, and not in say 2014...
Oh yes, I've personally used CuPy for great speed ups compared to Numpy in radar signal processing. Taking a code that took 30 seconds with NumPy down to 1 second with CuPy. The code basically performed a bunch of math on like 100 MB of data, so the PCIe bottleneck was not a big issue.
Also CuPy was first released in 2015, this post is just a reminder for people that such things exist.
Yeah, the data managed by cupy generally stays on the GPU and you can control when you get it out pretty straightforwardly. It’s great if most of your work happens in a small number of standard operations. Like matrix operations or Fourier transforms, the sort of thing that cupy will provide for you. You can get custom kernels running through cupy but at some point it’s easier to just write c/c++.
As nice as it is to have a drop in replacement, most of the cost of GPU computing is moving memory around. Wouldn’t be surprised if this catches unsuspecting programmers in a few performance traps.
The moving-data-around cost is conventional wisdom in GP-GPU circles.
Is it changing though? Not only do PCIe interfaces keep doubling in performance, but CPU-GPU memory coherence is a thing.
I guess it depends on your target: 8x H100s across a PCIe bridge is going to have quite different costs vs an APU (which have gotten to be quite powerful, not even mentioning MI300a)
Exactly my experience. You end up having to juggle a whole different set of requirements and design factors in addition to whatever it is that you’re already doing. Usually after a while the results are worth it, but I found the “drop-in” idea to be slightly misleading. Just because the API is the same does not make it a drop-in replacement.
It only supports AMD cards supported by ROCm, which is quite a limited set.
I know you can enable ROCm for other hardware as well, but it's not supported and quite hit or miss. I've had limited success with running stuff against ROCm on unsupported cards, mainly having issues with memory management IIRC.
When I packaged the ROCm libraries that shipped in the Ubuntu 24.04 universe repository, I built and tested them with almost every discrete AMD GPU architecture from Vega to CDNA 2 and RDNA 3 (plus a few APUs). None of that is officially supported by AMD, but it is supported by me on a volunteer basis (for whatever that is worth).
I think that every library required to build cupy is available in the universe repositories, though I've never tried building it myself.
Yes. The primary difference in the support matrix is that all discrete RDNA 1 and RDNA 2 GPUs are enabled on in the Debian packages [1]. There is also Fiji / Polaris support enabled in the Debian packages, although there are a lot of bugs with those.
I'm supposed to end my undergraduate degree with an internship at the italian national research center and i'll have to use pytorch to write ml models from paper to code, i've tried looking at the tutorial but i feel like there's a lot going on to grasp. until now i've only used numpy (and pandas in combo with numpy), i'm quite excited but i'm a bit on the edge because i can't know whether i'll be up to the task or not
You'll do fine :) PyTorch has an API that is somewhat similar to numpy, although if you've never programmed a GPU you might want to get up to speed on that first.
However, the AMD-GPU compatibility for CuPy is quite an attractive feature.