A visualization of the prime factors of the first million integers

OscarCunningham · on Aug 22, 2018

> A very pretty structure emerges; this might be spurious in that it captures more about the layout algorithm than any "true" structure of numbers.

What would it look like if we used PCA rather than UMAP? PCA is simpler than UMAP, so it's in some sense "less arbitrary". If the image is similar then we know we're seeing something about numbers rather than about our methods.

thraway180306 · on Aug 22, 2018

PCA is dimensionally invalid, it destroys, not preserves structure and consists of arbitrary linear algebra operations. It is "less arbitrary" the way x86 assembly is "less arbitrary" wrt. C (actually it ties you to a certain mode of thinking).

mturmon · on Aug 22, 2018

I don't think "arbitrary linear algebra operations" is a valid critique. If you understand PCA as "take the SVD of the data", then the operations seem arbitrary. But if you understand it as, "construct a low-rank approximation in the L2 sense to the data, or its covariance", then it's not.

Also, I don't think that the (very legitimate) "dimensional" critique of PCA applies here. The units on the coordinates of the representation are the same: the presence or absence of that prime factor.

To the original question: my suspicion is that PCA might pull out the even numbers (first PC) and the divisible by 3 numbers (second PC), because these two factors may explain the most variability in the underlying vector representation. If it did, that would be pretty intuitive, although not as interesting.

---

Edited to add: Suspicion turned out to be true. For the first 2000 integers, the top 6 PCs turned out to correspond to the first 6 primes (2, 3, 5, 7, 11, 13).

Plot at: https://imgur.com/a/qi2Sx5u?

  function [nums,pcs]=pca_prime(nMax,nPC)

  nums = zeros(nMax, nMax);
  for k = 2:nMax,
    nums(k,factor(k)) = 1; % vector representation of "k"
  end;
  % 2:end because don't care about 1 as a "prime"
  pcs = pca(nums(2:end,:), 'NumComponents', nPC);

--

  [nums,pcs]=pca_prime(2000,10); % "svd" would work too
  plot(pcs(:,1:6)); % first 6 PCs

murbard2 · on Aug 23, 2018

If you think of the covariance matrix, entry i,j for i ≠ j will be

   floor(n / (p[i]*p[j])) / n - floor(n / p[i]) * floor(n/p[j]) / n^2

and the ith diagonal entry will be

   floor(n / p[i]) / n - ( floor(n / p[i]) / n )^2

for n large, you approximately get a diagonal matrix with diagonal entries / eigenvalues 1/p[i] - 1/p[i]^2.

mturmon · on Aug 23, 2018

Smart observation. Another way to say it is that, for distinct primes p1 and p2, the events “p1 divides n”, and “p2 divides n”, are approximately statistically independent. So you get a near-diagonal covariance with entries as you wrote.

jacobolus · on Aug 22, 2018

> If you understand PCA as "take the SVD of the data", then the operations seem arbitrary. But if you understand it as, "construct a low-rank approximation in the L2 sense to the data, or its covariance", then it's not.

Those are the same thing. Is your first case just a reader who has no idea what the SVD is?

mturmon · on Aug 22, 2018

Yes, they are both descriptions of the same thing. I'm trying to say that PCA does have a justification. It's not just an "arbitrary linear algebra operation", although the application of the SVD algorithm to perform PCA can be presented that way.

bhl · on Aug 22, 2018

Are you saying the kth principal component is equal to the kth prime number?

mturmon · on Aug 22, 2018

Yes. See the plot: for example, PC #1 is essentially a 0/1 vector with all its weight (1 in this case) placed on the "2", which is the representation of the prime number 2 in the scheme used in the OP.

OscarCunningham · on Aug 22, 2018

> dimensionally invalid

What does this mean?

thraway180306 · on Aug 22, 2018

Without (arbitrary) scaling normalization PCA gives different results with change of dimensional units, that is principal components depend on choice of units or scaling.

OscarCunningham · on Aug 22, 2018

Oh okay. I did know about that weakness of PCA. But it seems like in this case letting one prime factor correspond to one unit of distance is a nonarbitrary scaling, so I would expect PCA to give sensible results.

BenoitP · on Aug 22, 2018

Reposting my comment from here:

https://news.ycombinator.com/item?id=17816981

----

Prime factors of a number is the ultimate high-dimensional space.

Damn. The more I see UMAP, the more I think it is going to be a central and generic tool for high-dimensional analysis. I haven't taken the time to go in depth into it yet, though :/

So far, my understanding of it is: t-SNE on steroids

* t-SNE is great for local proximity, but it 'rips' high dimensional global structures too early. UMAP solves both scales by using transformations to map overlapping points of the different lower dimensional spaces that are locally relevant.

* It is faster than t-SNE, and has a better scale factor.

* t-SNE is about moving the points when UMAP is about finding the transformations that move the points.. which means:

a) it yields a model that you can use to create embeddings for unseen data. This means sharing your work by contributing to the public model zoos.

b) And you can also do supervised dimension reduction as you create your embedding. Ie You can judge if the shape looks good for unseen data (aka it generalizes well), and then correct the embedding by choosing which unseen instances to add to the training set. This means you control the cost of labeling data. You can see where your errors are, and back-propagate them to the collection process in a cost effective manner. For high dimensional data.

* You can choose your metric! Specific a distance function and you're good to go. Haversine for a great orange peeling, Levenshtein for visualizing word spelling (and maybe provide an embedding for ML-based spell checking?)

* You can choose the output space to be greater than 2 or 3, in order to stop the compression at a specified level.

I believe it will replace t-SNE in the long term.

Here is a great video of the author presenting his work:

https://www.youtube.com/watch?v=nq6iPZVUxZU

macleginn · on Aug 22, 2018

> * You can choose your metric! Specific a distance function and you're good to go. Haversine for a great orange peeling, Levenshtein for visualizing word spelling (and maybe provide an embedding for ML-based spell checking?)

t-SNE, at least the implementation I usually work with (from the R Rtsne package) happily accepts any distance matrix as input. I successfully used all kinds of distance measures with it.

ttoinou · on Aug 22, 2018

Nice. Reminds me of Buddhabrot images http://erleuchtet.org/2010/07/ridiculously-large-buddhabrot....

NKosmatos · on Aug 22, 2018

That’s why I love HN, you start reading a post along with the comments and you see nice links like this one that take you down another (similar) road ;-)

tzs · on Aug 22, 2018

I'd like to see this for various pseudo-random number generators, both ordinary and cryptographically secure.

kiki_jiki · on Aug 22, 2018

I didn't really understand how they reduce to 2 dimensions. Can somebody explain?

throwawaymath · on Aug 22, 2018

If what you're asking about is the math, the steps are (essentially) as follows:

1. A Riemannian manifold is constructed from the dataset.

2. The manifold is approximately mapped to an n-dimensional topological structure.

3. The reduced embedding is an (n - k)-dimensional projection equivalent to the initial topological structure, where k is the number of dimensions you'd like to reduce by.

I don't know how well that answers your question because it's difficult to simplify the math beyond that. But you can also check out the paper on arXiv. [1]

The underlying idea is to transform the data into a topological representation, analyze its structure, then find a much smaller (dimensionally speaking) topological structure which is either the same thing ("equivalent") or very close to it. You get most of the way there by thinking about how two things which look very different can be topologically the same based on their properties. A pretty accessible demonstration of that idea is the classical donut <-> coffee mug example on the Wikipedia page for homeomorphisms. [2]

__________________

1. https://arxiv.org/pdf/1802.03426.pdf

2. https://en.wikipedia.org/wiki/Homeomorphism

mturmon · on Aug 22, 2018

This is a good summary. For a video that explains some of this (but which still hand waves), see the author's presentation at SciPy 2018: https://www.youtube.com/watch?v=nq6iPZVUxZU

digitaLandscape · on Aug 22, 2018

Is this actually capturing any properties of the original set, or is this a set of operations that will make any input look similar? (i.e. is this just a pretty picture with no real connection to the math.)

throwawaymath · on Aug 22, 2018

You're asking about two somewhat different things.

In the strict sense two things which are equivalent share the same properties, yes. This is the topological generalization of algebraic homomorphisms and analytic bijections. See the example about coffee mugs and donuts both being topological tori.

That being said I can't really comment on the potential artifacting details of this specific algorithm. In theory the overarching idea makes sense because if you find structure preserving maps between sets of varying dimensions you should expect relations within the set to be preserved (i.e. the relational information in the smaller set is equivalent, there's just less of it). But practically speaking not all datasets can be usefully abstracted to a manifold in this way, which means that (efficiently) finding an equivalent lower dimensional projection for the embedding might involve a fair amount of approximation.

With enough approximation you'll introduce spurious artifacts. But that's precisely where all the innovation comes in - finding ways to efficiently find equivalent structures with the representative data in fewer dimensions. This isn't the first proposal for topological dimension reduction (see information geometry); the devil is really in the details here.

thraway180306 · on Aug 22, 2018

It captures more of the spatial (metric topological) arrangement in the set. Example they give in the paper is the MINST dataset where distinctly looking digits like 1 and 0 get separated farther apart and similar ones clump together, whereas t-SNE while correctly delimiting individual clusters clumps them all in a blob.

digitaLandscape · on Aug 22, 2018

Cool. Thanks.

the_cat_kittles · on Aug 22, 2018

sidenote: i only recently realized the X in arXiv is a greek chi, so its sonically "archive"...

throwawaymath · on Aug 22, 2018

Yep, just like LaTeX :)

the_cat_kittles · on Aug 22, 2018

"latechi?" ive always heard it "lah-tek"

Insanity · on Aug 22, 2018

Latech is how I have heard it.

the_cat_kittles · on Aug 22, 2018

im assuming you mean hard "a" like lay-tech. ive heard that too.

Fjolsvith · on Aug 22, 2018

Probably similar to how a camera takes the photons from a 3D source and projects it onto film (or photosensitive plate).

d33 · on Aug 22, 2018

I wonder what it would look like for completely random numbers?

mathgenius · on Aug 22, 2018

Yeah it's hard to tell if this visualization is picking up anything interesting about prime numbers or not.

akovaski · on Aug 23, 2018

The last image in the post, which is in section 1.3.2.2, has an image formed from random numbers. For some definition of random. I imagine you would get different results by choosing how the randomness is applied.

amflare · on Aug 22, 2018

I don't understand. Why are the points of light moving?

Mary-Jane · on Aug 22, 2018

The color scale is min to max at the time the frame was rendered. Each frame adds 1000 to the previous set, halving their brightness from the previous frame. This creates the illusion of movement.

baxtr · on Aug 22, 2018

Looks a bit like our universe...

thanatropism · on Aug 22, 2018

It's curious to see UMAP compared primarily to t-SNE and not to MDS, Isomap, LLE, LTSA, etc.

_suzw · on Aug 22, 2018

Was I the only one who saw a pattern resembling the flying spaghetti monster?

Razengan · on Aug 23, 2018

Looking at such visualizations of mathematical axioms, like these and the Ulam Spiral [0], gives me a kind of vague .. intuition? apprehension? feeling/fun-idea-to-muse-about, that maybe Reality began from these.

As in the root of the "What caused the Big Bang" or "but who created God?" questions. Stuff like 1+1=2 shouldn't need a root cause, and would give rise to patterns like these.

[0] https://en.wikipedia.org/wiki/Ulam_spiral

benrbray · on Aug 22, 2018

What are the loopy things? What do the starbursts correspond to??

mturmon · on Aug 22, 2018

My conjecture is that the loops (especially the loops outside the main clump at the center) might correspond to newly-introduced prime factors.

As a new batch of integers is introduced in going from frame N to frame N+1, the prime numbers within that batch would become new points in the 2d projection, because they have not been seen before.

Then, as you go from frame N+1 to frame N+2, etc., the new prime factors from frame N start to re-occur in those successive frames, and new points are added to the loop.

frayesto · on Aug 22, 2018

There's a lot of interesting structure. Does this suggest some kind of structure to the space of prime factors or is it just trying to attach meaning where none exists?

thaumasiotes · on Aug 22, 2018

Prime factors are defined by a pretty rigid structure; you see a factor of 2 every two numbers; a factor of 3 every three numbers; a factor of 5 every five numbers...

algorias · on Aug 22, 2018

For extra fun, if you plot the number of times a specific prime factor occurs, you get a fractal pattern. E.g. for the prime factor 2:

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
  0  1  0  2  0  1  0  3  0  1  0  2  0  1  0

thraway180306 · on Aug 22, 2018

structure to the space of prime factors

There is topological structure https://math.stackexchange.com/questions/2879/mapping-natura... http://ncatlab.org/nlab/show/arithmetic+topology as well as geometric (arithmetic geometry). Question is how does the picture correspond to any of it. Glancing at the UMAP paper https://arxiv.org/abs/1802.03426 it looks like a proper invariant preserving embedding, so maybe loops in the image are real. Even if not, it does hint artistically at sorts of stuff that exist.

vesinisa · on Aug 22, 2018

Given that there is definitive visual structure to primes themselves[1], it would be rather surprising if prime factorization had none.

[1] https://en.wikipedia.org/wiki/Ulam_spiral

tess0r · on Aug 22, 2018

Ulam sprials are super interesting. I developed a visualization (and some explanation) once just to understand the problem a little better. Maybe this sparks some interest in someone :)

https://tessi.github.io/walking-the-ulam-spiral/

jwilk · on Aug 22, 2018

On HN 3 months ago:

https://news.ycombinator.com/item?id=17019533

adamtulinius · on Aug 22, 2018

"A very pretty structure emerges; this might be spurious in that it captures more about the layout algorithm than any “true” structure of numbers."

davidrusu · on Aug 22, 2018

Seems the markdown wasn't rendered to html, here's a link to the generated image https://johnhw.github.io/umap_primes/primes_umap_1e6_4k.png

anc84 · on Aug 22, 2018

The site uses JavaScript to render the Markdown to HTML client-side. Reminds me of when I thought XML with client-side XSLT for rendering was a good idea, lol.

I enjoyed the plain markdown page though, very readable.

davidrusu · on Aug 22, 2018

I see, I've set my browser to block third party javascript, on inspection, it seems the author has decided to download the markdown render script from what looks to be an analytics firm. (https://casual-effects.com/markdeep/latest/markdeep.min.js)

Interesting choice

asicsp · on Aug 22, 2018

markdeep is an extension of markdown, see https://casual-effects.com/markdeep/ for details

davidrusu · on Aug 22, 2018

Aha! I mistyped the domain when exploring and got to this site https://www.causal-effects.com/

The site you linked does look much friendlier (still.. self host your js when you can!)

jwilk · on Aug 22, 2018

Here's a copy that works without JS:

https://gist.github.com/jwilk/37e329126f0919b8ddf4066e61c188...

pighive · on Aug 22, 2018

This is so refreshing! Awesome to see and feel such a different perspective over numbers.

rajacombinator · on Aug 23, 2018

Cute and colorful collection of squiggly lines but I don’t think this visualization is useful for capturing any insights into prime factorization...

mgalka · on Aug 22, 2018

Really cool! I'd love to see this go even further and see what other patterns appear.

dbelchamber · on Aug 22, 2018

BUT WHAT DOES IT MEAN?!