> A very pretty structure emerges; this might be spurious in that it captures more about the layout algorithm than any "true" structure of numbers.
What would it look like if we used PCA rather than UMAP? PCA is simpler than UMAP, so it's in some sense "less arbitrary". If the image is similar then we know we're seeing something about numbers rather than about our methods.
PCA is dimensionally invalid, it destroys, not preserves structure and consists of arbitrary linear algebra operations. It is "less arbitrary" the way x86 assembly is "less arbitrary" wrt. C (actually it ties you to a certain mode of thinking).
I don't think "arbitrary linear algebra operations" is a valid critique. If you understand PCA as "take the SVD of the data", then the operations seem arbitrary. But if you understand it as, "construct a low-rank approximation in the L2 sense to the data, or its covariance", then it's not.
Also, I don't think that the (very legitimate) "dimensional" critique of PCA applies here. The units on the coordinates of the representation are the same: the presence or absence of that prime factor.
To the original question: my suspicion is that PCA might pull out the even numbers (first PC) and the divisible by 3 numbers (second PC), because these two factors may explain the most variability in the underlying vector representation. If it did, that would be pretty intuitive, although not as interesting.
---
Edited to add: Suspicion turned out to be true. For the first 2000 integers, the top 6 PCs turned out to correspond to the first 6 primes (2, 3, 5, 7, 11, 13).
function [nums,pcs]=pca_prime(nMax,nPC)
nums = zeros(nMax, nMax);
for k = 2:nMax,
nums(k,factor(k)) = 1; % vector representation of "k"
end;
% 2:end because don't care about 1 as a "prime"
pcs = pca(nums(2:end,:), 'NumComponents', nPC);
--
[nums,pcs]=pca_prime(2000,10); % "svd" would work too
plot(pcs(:,1:6)); % first 6 PCs
Smart observation. Another way to say it is that, for distinct primes p1 and p2, the events “p1 divides n”, and “p2 divides n”, are approximately statistically independent. So you get a near-diagonal covariance with entries as you wrote.
> If you understand PCA as "take the SVD of the data", then the operations seem arbitrary. But if you understand it as, "construct a low-rank approximation in the L2 sense to the data, or its covariance", then it's not.
Those are the same thing. Is your first case just a reader who has no idea what the SVD is?
Yes, they are both descriptions of the same thing. I'm trying to say that PCA does have a justification. It's not just an "arbitrary linear algebra operation", although the application of the SVD algorithm to perform PCA can be presented that way.
Yes. See the plot: for example, PC #1 is essentially a 0/1 vector with all its weight (1 in this case) placed on the "2", which is the representation of the prime number 2 in the scheme used in the OP.
Without (arbitrary) scaling normalization PCA gives different results with change of dimensional units, that is principal components depend on choice of units or scaling.
Oh okay. I did know about that weakness of PCA. But it seems like in this case letting one prime factor correspond to one unit of distance is a nonarbitrary scaling, so I would expect PCA to give sensible results.
Prime factors of a number is the ultimate high-dimensional space.
Damn. The more I see UMAP, the more I think it is going to be a central and generic tool for high-dimensional analysis. I haven't taken the time to go in depth into it yet, though :/
So far, my understanding of it is: t-SNE on steroids
* t-SNE is great for local proximity, but it 'rips' high dimensional global structures too early. UMAP solves both scales by using transformations to map overlapping points of the different lower dimensional spaces that are locally relevant.
* It is faster than t-SNE, and has a better scale factor.
* t-SNE is about moving the points when UMAP is about finding the transformations that move the points.. which means:
a) it yields a model that you can use to create embeddings for unseen data. This means sharing your work by contributing to the public model zoos.
b) And you can also do supervised dimension reduction as you create your embedding. Ie You can judge if the shape looks good for unseen data (aka it generalizes well), and then correct the embedding by choosing which unseen instances to add to the training set. This means you control the cost of labeling data. You can see where your errors are, and back-propagate them to the collection process in a cost effective manner. For high dimensional data.
* You can choose your metric! Specific a distance function and you're good to go. Haversine for a great orange peeling, Levenshtein for visualizing word spelling (and maybe provide an embedding for ML-based spell checking?)
* You can choose the output space to be greater than 2 or 3, in order to stop the compression at a specified level.
I believe it will replace t-SNE in the long term.
Here is a great video of the author presenting his work:
> * You can choose your metric! Specific a distance function and you're good to go. Haversine for a great orange peeling, Levenshtein for visualizing word spelling (and maybe provide an embedding for ML-based spell checking?)
t-SNE, at least the implementation I usually work with (from the R Rtsne package) happily accepts any distance matrix as input. I successfully used all kinds of distance measures with it.
That’s why I love HN, you start reading a post along with the comments and you see nice links like this one that take you down another (similar) road ;-)
If what you're asking about is the math, the steps are (essentially) as follows:
1. A Riemannian manifold is constructed from the dataset.
2. The manifold is approximately mapped to an n-dimensional topological structure.
3. The reduced embedding is an (n - k)-dimensional projection equivalent to the initial topological structure, where k is the number of dimensions you'd like to reduce by.
I don't know how well that answers your question because it's difficult to simplify the math beyond that. But you can also check out the paper on arXiv. [1]
The underlying idea is to transform the data into a topological representation, analyze its structure, then find a much smaller (dimensionally speaking) topological structure which is either the same thing ("equivalent") or very close to it. You get most of the way there by thinking about how two things which look very different can be topologically the same based on their properties. A pretty accessible demonstration of that idea is the classical donut <-> coffee mug example on the Wikipedia page for homeomorphisms. [2]
This is a good summary. For a video that explains some of this (but which still hand waves), see the author's presentation at SciPy 2018:
https://www.youtube.com/watch?v=nq6iPZVUxZU
Is this actually capturing any properties of the original set, or is this a set of operations that will make any input look similar? (i.e. is this just a pretty picture with no real connection to the math.)
You're asking about two somewhat different things.
In the strict sense two things which are equivalent share the same properties, yes. This is the topological generalization of algebraic homomorphisms and analytic bijections. See the example about coffee mugs and donuts both being topological tori.
That being said I can't really comment on the potential artifacting details of this specific algorithm. In theory the overarching idea makes sense because if you find structure preserving maps between sets of varying dimensions you should expect relations within the set to be preserved (i.e. the relational information in the smaller set is equivalent, there's just less of it). But practically speaking not all datasets can be usefully abstracted to a manifold in this way, which means that (efficiently) finding an equivalent lower dimensional projection for the embedding might involve a fair amount of approximation.
With enough approximation you'll introduce spurious artifacts. But that's precisely where all the innovation comes in - finding ways to efficiently find equivalent structures with the representative data in fewer dimensions. This isn't the first proposal for topological dimension reduction (see information geometry); the devil is really in the details here.
It captures more of the spatial (metric topological) arrangement in the set. Example they give in the paper is the MINST dataset where distinctly looking digits like 1 and 0 get separated farther apart and similar ones clump together, whereas t-SNE while correctly delimiting individual clusters clumps them all in a blob.
The last image in the post, which is in section 1.3.2.2, has an image formed from random numbers. For some definition of random. I imagine you would get different results by choosing how the randomness is applied.
The color scale is min to max at the time the frame was rendered. Each frame adds 1000 to the previous set, halving their brightness from the previous frame. This creates the illusion of movement.
Looking at such visualizations of mathematical axioms, like these and the Ulam Spiral [0], gives me a kind of vague .. intuition? apprehension? feeling/fun-idea-to-muse-about, that maybe Reality began from these.
As in the root of the "What caused the Big Bang" or "but who created God?" questions. Stuff like 1+1=2 shouldn't need a root cause, and would give rise to patterns like these.
My conjecture is that the loops (especially the loops outside the main clump at the center) might correspond to newly-introduced prime factors.
As a new batch of integers is introduced in going from frame N to frame N+1, the prime numbers within that batch would become new points in the 2d projection, because they have not been seen before.
Then, as you go from frame N+1 to frame N+2, etc., the new prime factors from frame N start to re-occur in those successive frames, and new points are added to the loop.
There's a lot of interesting structure. Does this suggest some kind of structure to the space of prime factors or is it just trying to attach meaning where none exists?
Prime factors are defined by a pretty rigid structure; you see a factor of 2 every two numbers; a factor of 3 every three numbers; a factor of 5 every five numbers...
Ulam sprials are super interesting. I developed a visualization (and some explanation) once just to understand the problem a little better. Maybe this sparks some interest in someone :)
The site uses JavaScript to render the Markdown to HTML client-side. Reminds me of when I thought XML with client-side XSLT for rendering was a good idea, lol.
I enjoyed the plain markdown page though, very readable.
I see, I've set my browser to block third party javascript, on inspection, it seems the author has decided to download the markdown render script from what looks to be an analytics firm.
(https://casual-effects.com/markdeep/latest/markdeep.min.js)
What would it look like if we used PCA rather than UMAP? PCA is simpler than UMAP, so it's in some sense "less arbitrary". If the image is similar then we know we're seeing something about numbers rather than about our methods.