Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

an array of arrays is not necessarily contiguous in C(++). Indeed, if allocating on the heap, you end up with a bunch of discontiguous memory that's not necessarily correctly aligned.

A good tensor implementation accounts for strides that are SIMD compatible (eg; each dimension is a multiple of the SIMD register width).



An array of arrays is necessarily contiguous in C - this is implied by the type. An array of pointers to arrays will, of course, not be contiguous - and is the only way to get a dynamically sized heap-allocated 2D array in C (VLAs give you stack-allocated arrays, with all the size limits that entails).

In C++, this all is best handled by a library class.


Heap-allocated dynamically-sized NxM matrix in C99:

    double (*mat)[M] = calloc(N * M, sizeof (double));


Ah, good point. I always forget that VLA types in C99 are actually types, and so you can use them in these contexts as well.

It's a shame they killed VLAs as a mandatory language feature. They didn't make C into Fortran (which I think was the hope, between them, complex numbers, and "restrict"?), but they did make some things a great deal more pleasant to write.


>and is the only way to get a dynamically sized heap-allocated 2D array in C (VLAs give you stack-allocated arrays, with all the size limits that entails).

Why wouldnt you be able to create a dynamic array of arrays with placement new and a cast.


A cast to what, though? You need to be able to write the type of that array, but you can't do that unless dimensions are compile-time constants.


> A good tensor implementation accounts for strides that are SIMD compatible (eg; each dimension is a multiple of the SIMD register width).

Never implemented any tensors, but in my experience, sometimes you better do what GPUs do with some texture formats: switch from linear layout into dense blocks layout. E.g. for dense float32 matrix and SSE code, a good choice is a 2D array of 4x4 blocks: exactly 1 cache line, and only consumes 4 out of 16 registers while being processed i.e. you can do a lot within a block while not hitting any RAM latency.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: