If this actually works (remains to be seen), I can only say: 1) Kudos 2) Finally...

anthonix1 · on July 15, 2024

I just tried it with llm.c ... seems to be missing quite a few key components such as cublaslt, bfloat16 support, nvtx3, compiler flags such as -t

And its linked against an old release of ROCm.

So unclear to me how it is supposed to be an improvement over something like hipify

ckitching · on July 15, 2024

Greetings, I work on SCALE.

It appears we implemented `--threads` but not `-t` for the compiler flag. Oeps. In either case, the flag has no effect at present, since fatbinary support is still in development, and that's the only part of the process that could conceivably be parallelised.

That said: clang (and hence the SCALE compiler) tends to compile CUDA much faster than nvcc does, so this lack of the parallelism feature is less problematic than it might at first seem.

NVTX support (if you want more than just "no-ops to make the code compile") requires cooperation with the authors of profilers etc., which has not so far been available

bfloat16 is not properly supported by AMD anyway: the hardware doesn't do it, and HIP's implementatoin just lies and does the math in `float`. For that reason we haven't prioritised putting together the API.

cublasLt is a fair cop. We've got a ticket :D.

anthonix1 · on July 15, 2024

Hi, why do you believe that bfloat16 is not supported? Can you please provide some references (specifically the part about the hardware "doesn't do it")?

For the hardware you are focussing on (gfx11), the reference manual [2] and the list of LLVM gfx11 instructions supported [1] describe the bfloat16 vdot & WMMA operations, and these are in fact implemented and working in various software such as composable kernels and rocBLAS, which I have used (and can guarantee they are not simply being run as float). I've also used these in the AMD fork of llm.c [3]

Outside of gfx11, I have also used bfloat16 in CDNA2 & 3 devices, and they are working and being supported.

Regarding cublasLt, what is your plan for support there? Pass everything through to hipblasLt (hipify style) or something else?

Cheers, -A

[1] https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX11.html [2] https://www.amd.com/content/dam/amd/en/documents/radeon-tech... [3] http://github.com/anthonix/llm.c

ckitching · on July 16, 2024

> Hi, why do you believe that bfloat16 is not supported?

Apologies, I appear to be talking nonsense. I conflated bfloat16 with nvidia's other wacky floating point formats. This is probably my cue to stop answering reddit/HN comments and go to bed. :D

So: ahem: bfloat16 support is basically just missing the fairly boring header.

> Regarding cublasLt, what is your plan for support there? Pass everything through to hipblasLt (hipify style) or something else?

Prettymuch that, yes. Not much point reimplementing all the math libraries when AMD is doing that part of the legwork already.

anthonix1 · on July 16, 2024

OK, so in the case of llm.c, if you're just including the HIP headers, using hipblasLt, etc, what would be the benefit of using scale instead of hipify?

Straw · on July 16, 2024

Seems like a big benefit would come from not forking the codebase into two versions!

gedy · on July 15, 2024

or: 1) CUDAs