Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PyTorch for WebGPU (praeclarum.org)
313 points by mighdoll on May 19, 2023 | hide | past | favorite | 74 comments


I'm excited about this for probably different reasons than most: I think Typescript could be a more ergonomic way to develop ML models than Python because you can automatically infer and check tensor dimensions while you are writing code! Compare this to the mess of comments you usually see writing pytorch telling you that x is of shape [x, y, z].

  // An empty 3x4 matrix
  const tensorA = tensor([3, 4])
  
  // An empty 4x5 matrix
  const tensorB = tensor([4, 5])

  const good = multiplyMatrix(tensorA, tensorB);
        ^
        Inferred type is Tensor<readonly [3, 5]>
  
  const bad = multiplyMatrix(tensorB, tensorA);
                             ^^^^^^^
                             Argument of type 'Tensor<readonly [4, 5]>' is not 
                             assignable to parameter of type '[never, "Differing 
                             types", 3 | 5]'.(2345)
I prototyped this for PotatoGPT [1] and some kind stranger on the internet wrote up a more extensive take [2]. You can play with an early version on the Typescript playground here [3] (uses a twitter shortlink for brevity)

[1] https://github.com/newhouseb/potatogpt

[2] https://sebinsua.com/type-safe-tensors

[3] https://t.co/gUzzTl4AAN


That work looks really interesting! I am also excited about type safety when it comes to tensors. My understanding was that this type safe approach to tensor shape had encountered issues because it was difficult/impossible (maybe?) to reason about the shape of some common operators at compile time. But perhaps those operators are not really necessary. [0]

Some sort of typed 'named tensor' that could be combined with einsum notation at runtime would be awesome, ie. (don't really know TS/JS well but pseudocode)

  import { torch } from 'pytorch' as t
  import { torch.nn } from 'pytorch' as nn

  const tensorA: Tensor[Batch, Seq, Emb] = t.randn([10,10,10]) // initialize tensor
  const transformLayer = nn.Einsum((Batch, Seq, Emb),(Emb)->(Batch, Seq))

  const tensorB: Tensor[Emb2] = t.randn([20])

  const transformedOutput = transformLayer(tensorA, tensorB) // type error: Emb2 does not match Emb

[0]: https://github.com/pytorch/pytorch/issues/26889


This is a great thread, thanks! Somehow I missed it when looking for prior art.

When I initially started implementing this I was hung up on similar concerns. For example in GPT2/PotatoGPT the MLP player is 4x the width of the residual stream. I went down a rabbit hole of addition and multiplication in Typescript types (the type system is Turing complete, so it's technically possible!) and after crashing my TS language server a bunch I switched tacticts.

Where I ended up was to use symbolic equivalence, which turned out to be more ergonomic anyway, i.e.

  type Multiply<A extends number, B extends number> = 
    number & { label: `${A} * ${B}` }
  const Multiply = <A extends number, B extends number>(a: A, b: B) => 
    a * b as Multiply<A, B>;
such that

  tensor([
    params.EmbeddingDimensions, // This is a literal with known size
    Multiply(4, params.EmbeddingDimensions)] as const)
is inferred as

  Tensor<readonly [768, Multiply<4, 768>]>
Notably, switching to a more symbolic approach makes it easier for type checking dimensions that can change at runtime, so something like:

  tensor([Var(tokens.length, 'Sequence Length'), 
          Multiply<4, Var(tokens.length, 'Sequence Length')>])
infers as

  Tensor<readonly [
     Var<'Sequence Length'>, 
     Multiply<4, Var<'Sequence Length'>>]> 
And you'll get all the same correctness constraints that you would if these were known dimensions.

The downside to this approach is that typescript won't know that Multiply<4, Var<'A'>> is equivalent to Multiply<Var<'A'>, 4> but in practice I haven't found this to be a problem.

Finally, on more complicated operators/functions that compose dimensions from different variables Typescript is also very capable, albeit not the most ergonomic. You can check my code for matrix multiplication and Seb's writeup for another example of a zip function).


Out of curiosity, how do you handle things where the output shape is input dependent (as opposed to only dependent on input shapes)? This is from `torch.sum(tensor, dim)` where dim might be nonconstant to `torch.nonzero(x)` and of course advanced indexing.


Another thing that TS does nicely is object handling in general: dot access for objects attributes, object destructuring, typed objects for function options. In most ML projects I see a bunch of functions that look like:

    def my_fn(x, **kwargs):
       ...
       return y_1, y_2, y_3
Which is a pain because kwargs could be anything really + now every call site has to expect 3 return values exactly while knowing their order; there's no way of adding an extra return value without changing everyone. In typescript the same function could look like:

    function myFn(x, options = { someOption: 1 }) {
       ...
       return { y_1, y_2, y_3 };
    }
Which is so much nicer because everything is typed with all types inferred automatically! And you don't burden the call sites with values they don't need:

    const { y_1 } = myFn(x, { someOption: 1 });
In Python, everyone mostly passes unbundled arguments through every function, and changing anything involves threading these untyped arguments through a bunch of untyped call sites, its not the end of the world but we can do better...


Python also has pattern matching on dicts and typed kwargs these days. It seems that the only thing missing is syntactic sugar for unconditional destructuring.


Yes! It's getting close, but we are still far from things being convenient and widely adopted


I’m of the same opinion. While I think I will keep the standard parameter order from torch, I will include the options overload to give all the benefits you describe.


Awesome :D Really nice project by the way


Without multidimensional array slicing or operator overloading it seems like Typescript could never be anywhere near as ergonomic as Python for ML, despite its other advantages.


What's the advantage of those "ergonomics" if you have to memorize all the quirks? With a language like Typescript, all those operations become explicit instead of implicit, letting you take full advantage of your IDE with autocomplete, documentation, and compile-time warnings. Python sacrifices all of those just to save a few keystrokes.


What is implicit about either feature, and what difference do they make from the IDE perspective assuming equivalent type annotations in both languages?


"Assuming equivalent type annotations" is the problem. Can't do it with Python, full stop. If we could, we wouldn't be having this conversation at all! It can't catch any mistakes because its type system is simply not expressive enough. You have to hold the type information in your head and make sure you slice and multiply correctly.


Those are niceties and can be implemented with some small hacks. Most big nets do very little slicing. Lots of dimension permutations (transpose, reshape, and friends) but less slicing. I personally use a lot of slicing so will do my best to support a clean syntax.


I've come to believe over the last few years that slicing is one of the most critical parts of a good ML array framework for a number of things and I've used it heavily. PyTorch, if I understand correctly, still doesn't have it right in terms of some forms of slice assignment and the handling of slice objects (please correct me if I'm wrong) though it is leagues better than tensorflow was.

I've written a lot of dataloader and such code over the last number of years, and the slicing was probably the most important (and most hair-pulling) parts for me. I've really debated writing my own wrapper at some point (if it is indeed worth the effort) just to keep my sanity, even if it is as the expense of some speed.


I disagree with this, slice notation is powerful and I use it quite a bit in DL.

Even just the [:, None] trick replacing unsqueeze is super useful for me.


That’s a good point, but I think python will be much more feasible because of operator overloading:

(x+y)*z/3

vs

x.add(y).mul(z).div(3)

And that’s just a really simple example.

I’m also hopeful that pythons new variadic generic types make progress here in python.


It seems that many agree with this. At the risk of getting downvoted I want to share an opposing opinion:

This way of thinking is not just unhelpful but even harmful. If one would often benefit from these checks while coding, then they should not be relying on a type checker. They should be thinking more, and writing comments is a great way to do that.

This is especially true because many operations on ndarrays / tensors can yield perfectly valid shapes with completely unintended consequences. When comments are written reasonably well they help avoid these difficult-to-debug, correct-output-shape-but-unintended-result mistakes. Not to mention the additional clear benefit of helping one quickly re-understand the tensor manipulations when coming back to the code weeks or months later.

And more generally, if one can get in the habit of writing these comments before the code, it can help push them away from the write-quickly-now-debug-later mentality. I have seen this bite folks many times, both while teaching ugrad + grad courses and while working at large tech companies.


Where do you draw the line? Is type checking in any domain harmful because it acts a crutch for your mental model of how your code works? One could similarly extrapolate this to any static analysis in any language.


I really hope that takes off because you are correct. Python though has such a fluid syntax that I'm not sure TS can match. For example when you want to sum two Numpy arrays, you just need the + operator, while that sort of thing is notoriously unpredictable in JS.


Three.js works just fine with functions like `.add`, it sure is ugly though. It kind of blows the mind that javascript has had so many syntactic additions over the years but still has no operator overloading.


I wonder if you could not do some operator overloading on the TS side to do some rewriting to get things like tensor addition on tensor types.

Heck, if you are doing that, maybe convert to webgpu automatically as well.

Someone very enterprising might do this in bun using zig.


I think you are absolutely right. It's easy to think you are supposed to use a [x y z] tensor when it expects a [z y x] and you don't find out until runtime.

It would he even better if tensor dims from loaded models could be infered ahead of time in the editor.


I don't know if you knew but this is how TensorFlow 1 worked. Unfortunately, that was a widely unpopular design choice because it was hard to overload the same function for tensors of different dimensions, among other things.


Interesting, do you have any references or examples? Some brief googling around hasn't found anything like this. The fact that overloading was an issue makes me think that TF1 was doing something different because Typescript generic type parameters allow you to do "overloading" galore (by only specifying constraints rather than enumerating every possible call format).


I believe there is WIP to get python type annotations for arrays/tensors shape, but it's not a thing yet, indeed.


If you want to do this today you can also use the torch c++ api! It’s whats pytorch binds to under the hood.


? I don't think torch C++ supports this.


Dependant types or it's a toy.


Just a little push back here, I think you strike on the right theme where a programming language could fill this gap. However, I wonder if new domain specific languages will eventually be the more elegant solution. Think Modular's Mojo [1] or Meta's KNYFE [2] mentioned earlier this week.

[1] - https://www.modular.com/mojo [2] - https://ai.facebook.com/blog/meta-training-inference-acceler...


It's a great question. I don't really have a horse in this race as long as whatever wins is maximally ergonomic. I think as long as the DSL is Turing complete such that you could "compute" on tensor shapes then we win. That said, it's very easy to build a type system that isn't so flexible (see most other languages) so I think it'd have to likely be a focus of the DSL from the get go.


Very impressive work. Would be interesting to do some benchmarks versus PyTorch.

On a side-note, I'm not sure if it is because I've looked at so many autograd engines by now, but it is really cool to see that after the years of different frameworks having been developed, most people seem to agree on some concepts and structure on how to implement something like this. It is pretty easy to dive into this, even without being particularly skilled in JS/TS.

Wondering how such frameworks will look in a couple years.


Could there be something like emscripten-forge/requests-wasm-polyfill for PyTorch with WebGPU? https://github.com/emscripten-forge/requests-wasm-polyfill

How does the performance of webgpu-torch compare to compiling PyTorch to WASM with emscripten and WebGPU?

tfjs benchmarks: Environment > backend > {WASM, WebGL, CPU, WebGPU, tflite} https://tensorflow.github.io/tfjs/e2e/benchmarks/local-bench... src: https://github.com/tensorflow/tfjs/tree/master/e2e/benchmark...

tensorflow/tfjs https://github.com/tensorflow/tfjs

tfjs-backend-wasm https://github.com/tensorflow/tfjs/tree/master/tfjs-backend-...

tfjs-backend-webgpu https://github.com/tensorflow/tfjs/tree/master/tfjs-backend-...

([...], tflite-support, tflite-micro)

From facebookresearch/shumai (a JS tensor library) https://github.com/facebookresearch/shumai/issues/122 :

> It doesn't make sense to support anything besides WebGPU at this point. WASM + SIMD is around 15-20x slower on my machine[1]. Although WebGL is more widely supported today, it doesn't have the compute features needed for efficient modern ML (transformers etc) and will likely be a deprecated backend for other frameworks when WebGPU comes online.

tensorflow rust has a struct.Tensor: https://tensorflow.github.io/rust/tensorflow/struct.Tensor.h...

"ONNX Runtime merges WebGPU backend" https://github.com/microsoft/onnxruntime https://news.ycombinator.com/item?id=35696031 ... TIL about wonnx: https://github.com/webonnx/wonnx#in-the-browser-using-webgpu...

microsoft/onnxruntime: https://github.com/microsoft/onnxruntime

Apache/arrow has language-portable Tensors for cpp: https://arrow.apache.org/docs/cpp/api/tensor.html and rust: https://docs.rs/arrow/latest/arrow/tensor/struct.Tensor.html and Python: https://arrow.apache.org/docs/python/api/tables.html#tensors https://arrow.apache.org/docs/python/generated/pyarrow.Tenso...

Fwiw it looks like the llama.cpp Tensor is from ggml, for which there are CUDA and OpenCL implementations (but not yet ROCm, or a WebGPU shim for use with emscripten transpilation to WASM): https://github.com/ggerganov/llama.cpp/blob/master/ggml.h

Are the recommendable ways to cast e.g. arrow Tensors to pytorch/tensorflow?

FWIU, Rust has a better compilation to WASM; and that's probably faster than already-compiled-to-JS/ES TensorFlow + WebGPU.

What's a fair benchmark?


>What's a fair benchmark?

the absolute golden benchmarks are https://github.com/pytorch/benchmark They are a diverse set of userland code taken from github as-is and made into benchmarks.


> What's a fair benchmark?

- /? pytorch tensorflow benchmarks webgpu 2023 site:github.com https://www.google.com/search?q=pytorch+tensorflow+benchmark...

- [tfjs benchmarks]

- huggingface/transformers:src/transformers/benchmark https://github.com/huggingface/transformers/tree/main/src/tr...


This is huge! For me the one thing preventing Typescript to replace python is the lack of availability of CV ML libraries. WebGPU and this kind of libraries changes everything


And operator overloading. TS code tends to look like this `c.add(b.add(a))` or `add(add(a, b), c)` instead of `a + b + c` as you might write in Python.

That was my biggest pain-point with using TS for graphics related projects. If operator overloading existed, then TS would be a no brainer for entry level graphics + AI/ML projects.

Edit: This gets more complicated when doing operations that force you to manually respect PEMDAS. For example, `add(div(a, b), multiply(c, d))` in TypeScript would simplify to `a / b + c * d` in Python. The TS version is unreadable.


I actually think that tagged template strings in JS/TS could be a much better version of operator overloading! https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

This would give access to any math notation in a more flexible way, implementing a custom DSL in a type safe but expressive way.

Imagine writing stuff like

const result = math`${a} + ${b} / ${c}`


I got nerdsniped and made a small library to test this concept. Might maintain

https://github.com/crubier/opov


Another option that's not quite as good as `a + b + c` but that is possible with TypeScript is a fluent API:

  const sum = a.add(b).add(c);


Indeed. "Object-oriented" fluent notation is basically equivalent to infix notation.

https://mlajtos.mu/posts/new-kind-of-paper-2


Yes that syntax works right now.


or add(a,b,c)


This. Just to riff off an example, a lot of the APIs in common DL frameworks like PyTorch revolve around numpy or pickle formats. These are Python first semantics.


There is so much stuff in scipy and opencv alone that it will take forever for another language to catch up to. Unfortunately, because python is suuuuuuuch a mediocre language in comparison. Type annotations were such a lost opportunity in python, it's such a horrible implementation.


What's the reason to run pytorch directly on WebGPU vs using ONNX on WebGPU (e.g. with https://github.com/webonnx/wonnx)?


Amazing!

Oddly, two tests fail for me with Brave (Version 1.51.118 / Chromium: 113.0.5672.126 (arm64)) on macOS Ventura 13.3.1

- pow([0], [0]) gradient, with "Expected «-Infinity» to be close to «0» (diff: < 0.0000005)"

- xlogy([0], [0.30000001192092896]) gradient with "Expected «0» to be close to «-1.2039728164672852»"


Yeah so the thing is WebGPU doesn’t correctly support IEEE floating point. Particularly, 0 is often substituted for +-Inf and NaN. See section 14.6 of the spec.

https://www.w3.org/TR/WGSL/#floating-point-evaluation

It’s not such a problem for real nets since you avoid those values like the plague. But the tests catch them and I need to make the tests are tolerant. Thanks for the results!


https://praeclarum.org/webgpu-torch/tests/

This is a dumb question but... are GPUs really that much faster than CPUs specifically at the math functions tested on this page?

xlogy trunc tan/tanh sub square sqrt sin/sinc/silu/sinh sign sigmoid sqrt/rsqrt round relu reciprocal rad2deg pow positive neg mul logaddexp/logaddexp2 log/log1p/log10/log2 ldexp hypot frac floor expm1 exp2 exp div deg2rad cos/cosh copysign ceil atan/atan2 asinh/asin add acosh/acos abs

Those are the types of math GPUs are good at? I thought they were better at a different kind of math, like matrices or something?


GPUs are about 100 times faster than CPUs for any type of single-precision floating point math operation. The catch is that you have to do roughly similar math operations on 10k+ items in parallel before the parallelism and memory bandwidth advantages of the GPU outweigh the latency and single-threaded performance advantages of the CPU. Of course this is achievable in graphics applications with millions of triangles and millions of pixels, and in machine learning applications with millions or billions of neurons.

IMO almost any application that is bottlenecked by CPU performance can be recast to use GPUs effectively. But it's rarely done because GPUs aren't nearly as standardized as CPUs and the developer tools are much worse, so it's a lot of effort for a faster but much less portable outcome.


Are there any standardised approaches for this? I fail to imagine how one would put branchy CPU code like parsing, etc. on GPUs effectively?


It is possible but you have to do things very differently, for example use monoids. There are a few compilers implemented on GPU, including Aaron Hsu's co-dfns and Voetter's compiler project[1]. The parentheses matching problem itself (the core of parsing) has long known efficient parallel algorithms and those have been ported to compute shaders[2] (disclosure: blatant self-promotion).

[1]: https://dl.acm.org/doi/pdf/10.1145/3528416.3530249

[2]: https://arxiv.org/pdf/2205.11659.pdf


WebGPU I think will help change a lot of this. Finally, portable code that is performant and runs virtually anywhere. It's the same reason web apps have taken off so much, or just the idea of deploying to and from web platforms, e.g. write in web and deploy to native.

I think WebGPU will be that universal language everyone speaks, and I think also that this will help get rid of Nvidia's monopoly on GPU compute.


GPUs are usually not faster at doing the operation, but excel at doing the operation in parallel on a gazillion elements. Matrix math is mostly additions and multiplications.


Yeah this is the trick. You need to maximize the use of workgroup parallelism and also lay things out in memory for those kernels to access efficiently. It’s a bit of a balancing act and I’ll be working on benchmarks to test out different strategies.


The main advantage is parallelism, but on top of that, common math operations are hardware accelerated on the GPU, so should run indeed faster just by being run on the GPU.


They are relatively tiny but they run on the GPU to avoid lots of copies back and forth.


Same with Chrome 113.0.5672.92 (arm64) on Ventura 13.2.

Safari 16.3 has 4 failures: "webgpu is supported", "tensor is webgpu", "xlogy([0], [0]) gradient", "xlogy([0], [0.30000001192092896]) gradient"


Sorry Safari does not support WebGPU yet. Please join me in writing to Apple and requesting it.


It seems like there's a developing competitor to the Python ecosystem in the form of webgpu and js/ts. Being able to run anywhere with no native dependencies is a pretty huge advantage, it will be interesting to see if this steals momentum. I wonder how hard it would be to add support for this as an alternate backend to transformers.js.


I used to use Python heavily and favor JS now for these reasons. Portability is huge. I think JS is going to eat Python's lunch.


Imagine: Isomorphic neural nets that can run server or client side.


> This is a perfect scenario to take advantage of code generation. I wrote a code generator that takes a template and generates the optimized kernels for each operation. The code generator is written in TypeScript and generates WebGPU compute shader code. This means that the generated code can be heavily optimized for the given scenario and those optimizations can be shared between operations.

A clever way to implement an AOT variant of the operator fusion methods in the XLA (JIT) compiler.


This is perhaps the most interesting aspect of the project--using a code generator to escape the gravitational pull of CUDA. I wonder how well it would generalize to other targets.


Impressive work.

A number of test failures for me on chromium 113.0.5672.63 (ungoogled chromium) MacOS Ventura 13.3.1: https://pastebin.com/eM6ZA3j2

I'll open a ticket if it helps..


Please do. I have a few test machines but cannot match the variety of hardware out there.


This is really nice! I have been working on getting ANN search working in the browser ([1] demo, [2] WIP repo) and would love to switch out onnx for the embedding generation.

[1] https://anansi.pages.dev/ [2] https://github.com/infrawhispers/anansi/tree/main/embedds/li...

privacy focused semantic search / ML at the edge is looking brighter every day.


There goes my weekend!! Thanks!


Learning as a beginner/novice, Feels like I am trying to catch up to a jet at takeoff speed on my kick scooter.


Curious what the potential is for this to then run headless - is the support for this in chrome etc. built into v8 etc such that node and others can simply piggyback on it? Or is it sitting in the browser layer such that you'd have to end up with a headless browser or similar?


Node doesn't have a GPU backend.

Deno has (or had), but you'd have to use Deno v1.31.3 to get WebGPU support (because if was removed afterwards for startup performance issues).


Loads of tests fail for me (chrome, windows). Mainly trigonometric functions which are way less accurate than they are supposed to be.


Yeah I think I’ll reduce the accuracy requirement for some transcendental functions since GPUs seem all over the place.


I would love to hear what Bram Wasti thinks about this (who has experience in this area and sometimes frequents the HN comments).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: