an array of arrays is not necessarily contiguous in C(++). Indeed, if allocating on the heap, you end up with a bunch of discontiguous memory that's not necessarily correctly aligned.
A good tensor implementation accounts for strides that are SIMD compatible (eg; each dimension is a multiple of the SIMD register width).
An array of arrays is necessarily contiguous in C - this is implied by the type. An array of pointers to arrays will, of course, not be contiguous - and is the only way to get a dynamically sized heap-allocated 2D array in C (VLAs give you stack-allocated arrays, with all the size limits that entails).
In C++, this all is best handled by a library class.
Ah, good point. I always forget that VLA types in C99 are actually types, and so you can use them in these contexts as well.
It's a shame they killed VLAs as a mandatory language feature. They didn't make C into Fortran (which I think was the hope, between them, complex numbers, and "restrict"?), but they did make some things a great deal more pleasant to write.
>and is the only way to get a dynamically sized heap-allocated 2D array in C (VLAs give you stack-allocated arrays, with all the size limits that entails).
Why wouldnt you be able to create a dynamic array of arrays with placement new and a cast.
> A good tensor implementation accounts for strides that are SIMD compatible (eg; each dimension is a multiple of the SIMD register width).
Never implemented any tensors, but in my experience, sometimes you better do what GPUs do with some texture formats: switch from linear layout into dense blocks layout. E.g. for dense float32 matrix and SSE code, a good choice is a 2D array of 4x4 blocks: exactly 1 cache line, and only consumes 4 out of 16 registers while being processed i.e. you can do a lot within a block while not hitting any RAM latency.
A good tensor implementation accounts for strides that are SIMD compatible (eg; each dimension is a multiple of the SIMD register width).