> However, although tensors describe the relationship between arbitrary higher-dimensional arrays, in practice the TPU hardware that we will consider is designed to perform calculations associated with one and two-dimensional arrays. Or, more specifically, vector and matrix operations.
I still don’t understand why the term “tensor” is used if it’s only vectors and matrices.
It says:
tensors describe the relationship between high-d arrays
It does not say:
tensors “only” describe the relationship between high-d arrays
The term “tensor” is used because it covers all cases: scalars, vectors, matrices, and higher-dimensional arrays.
Tensors are still a generalization of vectors and matrices.
Note the context: In ML and computer science, they are considered a generalization. From a strict pure math standpoint they can be considered different.
As frustrating as it seems one is not really more right and context is the decider. There are lots of definitions across STEM fields that change based on the context or field they’re applied to.
The word tensor has become more ambiguous during the time.
Before 1900, the use of the word tensor was consistent with its etymology, because it was used only for symmetric matrices, which correspond to affine transformations that stretch or compress a body in certain directions.
The square matrix that corresponds to a general affine transformation can be decomposed into the product of a tensor (a symmetric matrix which stretches) and a versor (a rotation matrix, which is antisymmetric and which rotates).
When Ricci-Curbastro and Levi-Civitta have published the first theory of what now are called tensors, they did not define any new word for the concept of a multidimensional array with certain rules of transformation when the coordinate system is changed, which is now called tensor.
When Einstein has published the Theory of General Relativity during WWI in which he used what is now called tensor theory, for an unknown reason and without any explanation for this choice he has begun to use the word "tensor" with the current meaning, in contrast with all previous physics publications.
Because Einstein has become extremely popular immediately after WWI, his usage of the word "tensor" has spread everywhere, including in mathematics (and including in the American translations of the works of Ricci and Levi-Civita, where the word tensor has been introduced everywhere, despite the fact that it did not exist in the original).
Nevertheless, for many years the word "tensor" could not be used for arbitrary multi-dimensional arrays, but only for those which observe the tensor transformation rules with respect to coordinate changes.
The use of the word "tensor" as a synonym for the word "array", like in ML/AI, is a recent phenomenon.
Previously, e.g. in all early computer literature, the word "array" (or "table" in COBOL literature) was used to cover all cases, from scalars, vectors and matrices to arrays with an arbitrary number of dimensions, so no new words are necessary.
Famously whether free helium is a molecule or not depends on whether you're talking to a physicist or a chemist.
But yeah, people in different countries speak different languages and the same sound, like "no" can mean a negation in English but a possessive in Japanese. And as different fields establish their jargons they often redefine words in different ways. It's just something you have to be aware of.
(I think) technically, all of these mathematical objects are tensors of different ranks:
0. Scalar numbers are tensors of rank 0.
1. Vectors (eg velocity, acceleration in intro high school physics) are tensors of rank 1.
2. Matrices that you learn in intro linear anlgebra are tensors of rank 2. Nested arrays 1 level deep, aka a 2d array.
0. Tensors numbers are tensors of rank 3 or higher. I explain this as ‘nested arrays’ to people with programming backgrounds as nested arrays of arrays with 3dimensions of arrays or higher.
It's branding (see: TensorFlow); also, pretty much anything (linear) you would do with an arbitrarily ranked tensor can be expressed in terms of vector ops and matmuls
At the end of the day all the arrays are 1 dimensional and thinking of them as 2 dimensional is just an indexing convenience. A matrix multiply is a bunch of vector dot products in a row. Higher tensor contractions can be built out of lower-dimensional ones, so I don't think it's really fair to say the hardware doesn't support it.
I’d say it’s more like calling an ALU that can perform unary and binary operations (so 1 or 2 inputs) an “array processing unit” because it’s like it can process 1- and 2-element arrays. ;)
I do not know which is the real origin of the fashion to use the word tensor in the context of AI/ML.
Nevertheless, I have always interpreted it as a reference to the fact that the optimal method of multiplying matrices is to decompose the matrix multiplication into tensor products of vectors.
The other 2 alternative methods, i.e. decomposing the matrix multiplication into scalar products of vectors or into AXPY operations on pairs of vectors, have a much worse ratio between computation operations and transfer operations.
Unfortunately, most people learn in school the much less useful definition of the matrix multiplication based on scalar products of vectors, instead of its definition based on tensor products of vectors, which is the one needed in practice.
The 3 possible methods for multiplying matrices correspond to the 6 possible orders for the 3 indices of the 3 nested loops that compute a matrix product.
The Einsum notation makes it desirable to formulate your model/layer as multi-dimensional arrays connected by (loosely) named axes, without worrying too much about breaking it down to primitives yourself. Once you get used to it, the terseness is liberating.
I was confused as hell for a long time when I first got into ML, until I figured out how to think about tensors in a visual way.
You're right: fundamentally ML is about vector and matrix operations (1D and 2D). So then why are most ML programs 3D, 4D, and in a transformer sometimes up to 6D (?!)
One reasonable guess is that the third dimension is time. Actually not. It turns out that time is pretty rare in ML, and it's only (relatively) recently that it's been introduced into e.g. video models.
Another guess is that it's to represent "time" as in, think of how transformers work: they generate a token, then another given the previous, then a third given the first two, etc. That's a certain way of describing "time". But it turns out that transformers don't do this as a 3D or 4D dimension. It only needs to be 2D, because tokens are 1D -- if you're representing tokens over time, you get a 2D output. So even with a cutting edge model like transformers, you still only need plain old 2D matrix operations. The attention layer creates a mask, which ends up being 2D.
So then why do models get to 3D and above? Usually batching. You get a certain efficiency boost when you pack a bunch of operations together. And if you pack a bunch of 2D operations together, that third dimension is the batch dimension.
For images, you typically end up with 4D, with the convension N,C,H,W, which stands for "Batch, Channel, Height, Width". It can also be N,H,W,C, which is the same thing but it's packed in memory as red green blue, red green blue, etc instead of all the red pixels first, then all the green pixels, then all the blue pixels. This matters in various subtle ways.
I have no idea why the batch dimension is called N, but it's probably "number of images".
"Vector" wouldn't quite cover all of this, and although "tensor" is confusing, it's fine. It's the ham sandwich of naming conventions: flexible, satisfying to some, and you can make them in a bunch of different varieties.
Under the hood, TPUs actually flatten 3D tensors down into 2D matrix multiplications. I was surprised by this, but it makes total sense. The native size for a TPU is 8x128 -- you can think of it a bit like the native width of a CPU, except it's 2D. So if you have a 3x4x256 tensor, it actually gets flattened out to 12x256, then the XLA black box magic figures out how to split that across a certain number of 8x128 vector registers. Note they're called "vector registers" rather than "tensor registers", which is interesting. See https://cloud.google.com/tpu/docs/performance-guide
You'd hate particle physics then. "Spin" and "action" and so on are terrible names, but scientists live with them, because convention.
Convention dominates most of what we do. I'm not sure there's a good way around this. Most conventions suck, but they were established back before there was a clear idea of what the best long-term convention should be.
At least in physics you can understand how the terms came about historically, where at some point they made sense. But “tensor” here, as note in sibling comments, seems to have been chosen primarily for marketing reasons.
It comes from the maths, where tensors are generalisations of matrices/vectors. They got cribbed, because the ML stuff directly used a bunch of the underlying maths. It’s a novel term, it sounds cool, not surprised it also then got promoted up into a marketing term.
> tensors are generalisations of matrices/vectors.
Is that what they are though? Because that really is not my understanding. Tensors are mappings which not all matrices and vectors are. Maybe the matrices in ML layers are all mappings, but a matrix in general is not, not is a vector always a mapping. So tensors aren’t generalizations of matrices and vectors.
> Tensors are mappings which not all matrices and vectors are.
A tensor in Physics is an object that follows some rules when changing reference frame. Their matrix representation is just one way of writing them. It’s the same with vectors: a list with their components is a representation of a vector, not the vector itself. We can think about it that way: the velocity of an object does not depend on the reference frame. Changing the axes does not make the object change its trajectory, but it does change the numerical values of the components of the velocity vector.
> So tensors aren’t generalizations of matrices and vectors.
Indeed. Tensors in ML have pretty much nothing to do with tensors in Maths or Physics. It is very unfortunate that they settled on the same name just because it sounds cool and sciency.
I think to be a tensor, all the bases should be independent. The way I think of it is you use a tensor to describe the rotation of an asteroid around all its major axes (inertia tensor?)
just because an image is 2-D doesn’t mean that the model can’t use higher dimensional representations in subsequent layers.
For an image, you could imagine a network learning to push the image through a filter bank that does oriented local frequency decomposition and turns it into 4D {height}x{width}x{spatial freq}X{orientation} before dealing with color channels or image batches
For whatever reason, I have held a mental image of a Tensor as a Tesseract/HyperCube where the connections are like the Elastic workout bands where they have differing tensile resistances, and they pull on one another to create their encapsulated info-cluster - but I have no clue if thats truly an accurate depiction, but it works in my head....
I'm reluctant to tell people "no, don't think of it that way," especially if it works for you, because I don't know the best way to think of things. I only know what works well for me. But for me, it'd be ~impossible to use your mental model to do anything useful. That doesn't mean it's bad, just that I don't understand what you mean.
The most straightforward mental model I've ever found for ML is, think of it as 2D matrix operations, like high school linear algebra. Matrix-matrix, matrix-vector, vector-matrix, and vector-vector will get you through 95% of what comes up in practice. In fact I'm having trouble thinking of something that doesn't work that way, because even if you have an RGB image that you multiply against a 2D matrix (i.e. HxWxC multiplied by a mask) the matrix is still only going to apply to 2 of the channels (height and width), since that's the only thing that makes sense. That's why there's all kinds of flattening and rearranging everywhere in practice -- everyone is trying to get a format like N,C,H,W down to a 2D matrix representation.
People like to talk up the higher level maths in ML, but highschool linear algebra (or for the gamedevs in the audience, the stuff you'd normally do in a rendering engine) really will carry you most of the way through your ML journey without loss of generality. The higher level maths usually happens when you start understanding how differentiation works, which you don't even need to understand until way later after you're doing useful things already.
>One reasonable guess is that the third dimension is time. Actually not. It turns out that time is pretty rare in ML, and it's only (relatively) recently that it's been introduced into e.g. video models.
WRT to ML - may time be better thought of where a thing lives in relation to other things that occurred within the same temporal window?
so "all the shit that happened in 1999 also has an expression within this cluster of events from 1999" - but the same information appears in any location where it is relationally contextual to the other neighbors, such as the SUBJECT of the information? Is this accurate to say why its 'quantum' because the information will show up depending on where the Observation (query) for it is occurring?
Something similar happens on Wikipedia, where topics that use math inevitably get explained in the highest level math possible. It makes topics harder to understand than they need to be.
As a helpful Wiki editor just trying to make sure that we don't lead people astray, I've made some small changes to clarify your statement:
In the virtual compendium of Wikipedia, an extensive repository of human knowledge, there is a discernible proclivity for the hermeneutics of mathematically-infused topics to be articulated through the prism of esoteric and sophisticated mathematical constructs, often employing a panoply of arcane lexemes and syntactic structures of Greek and Latin etymology. This phenomenon, redolent of an academic periphrasis, tends to transmute the exegesis of such subjects into a crucible of abstruse and high-order mathematical discourse. Consequently, this modus operandi obfuscates the intrinsic didactic intent, thereby precipitating an epistemological chasm that challenges the layperson's erudition and obviates the pedagogical utility of the exposition.
I still don’t understand why the term “tensor” is used if it’s only vectors and matrices.