This is one of my favorite topics in all of AI. It was the most surprising and mysterious discovery for me.
The answer is that the training process literally has to make the results smooth. That’s how training works.
Imagine you have 100 photos. Your job is to classify them by color. You can place them however you want, but similar colors should be physically closer together.
You can imagine the result would look a lot like a photoshop RGB picker, which is smooth.
The surprise is, this works for any kind of input. Even text paired with images.
The key is the loss function (a horrible name). In the color picker example, the loss function would be how similar two colors are. In the text to image example, it’s how dissimilar the input examples are from each other (Contrastive Loss). The brilliance of that is, pushing dissimilar pairs apart is the same thing as pulling similar pairs together, when you train for a long time on millions of examples. Electrons are all trying to push each other apart, but your body is still smooth.
The reason it’s brilliant is because it’s far easier to measure dissimilar pairs than to come up with a good way of judging “does this text describe this image?” — you definitely know that it isn’t a bicycle, but you might not know whether a car is a corvette or a Tesla. But both the corvette and the Tesla will be pushed away from text that says it’s a bicycle, and toward text that says it’s a car.
That means for a well-trained model, the input by definition is smooth with respect to the output, the same way that a small change in {latitude,longitude} in real life has a small change in the cultural difference of a given region of the world.
It doesn’t exist. The above explanation is the result of me spending almost all of my time immersing myself in ML for the last three years.
gwern helped too. He has an intuition for ML that I’m still jealous of.
Your best bet is to just start building things and worry about explanations later. It’s not far from the truth to say that even the most detailed explanation is still a longform way of saying “we don’t really know.” Some people get upset and refuse to believe that fundamental truth, but I’ve always been along for the ride more than the destination.
It’s never been easier to dive in. I’ve always wanted to write detailed guides on how to start, and how to navigate the AI space, but somehow I wound up writing an ML fanfic instead: https://blog.gpt4.org/jaxtpu
(Fun fact: my blog runs on a TPU.)
I’m increasingly of the belief that all you need is a strong desire to create things, and some resources to play with. If you have both of those, it’s just a matter of time — especially putting in the time.
That link explains how to get the resources. But I can’t help with how to get a desire to create things with ML. Mine was just a fascination with how strange computers can be when you wire them up with a small dose of calculus that I didn’t bother trying to understand until two years after I started.
(If you mean contrastive loss specifically, https://openai.com/blog/clip/ is decent. But it’s just a droplet in the pond of all the wonderful things there are to learn about ML.)
IMO the term "cost function" is much more intuitive than "loss function" - it tells you the cost, which it attempts to minimize by some iterative process (in this case training)
I actually completely lost interest once I found this out. Simply taking some ML course like the old Andrew Ng courses online are enough for you to get the general idea.
ML is simply curve fitting. It's a applied math problem that's quite common. In fact I lost a lot of interest in intelligence in general once I realized this was all that was going on. The implications really say that all of intelligence is really some form of curve fitting.
The simplest form of this is linear regression which is used to derive an equation for a line from a set of 2D points. All ML is basically a 10,000 (or much more) dimensional extension of that. The magic is lost.
Most of ML research is just to find the most efficient way to find the best fitting curve given the least amount of data points. A ML guys knowledge is centered around a bunch of tricks and techniques to achieve that goal with some N-D template equation. And the general template equation is all the same: A neural network. The answer to what intelligence is seems to be quite simple and not that profound at all... which makes sense given that we're able to create things like DALL-E in such a short time frame.
One of the big mysteries of the universe (intelligence) and the thing I always wondered about was essentially answered within the last 2 decades which is pretty cool.. but it's like knowing the secret behind an amazing magic trick.
But for me, that means it’s far more interesting than AGI. Everyone has their eye on AGI, and no one seems to be taking ML at face value. That means the first companies to do it will stand to make a fortune.
Why do people use analogies to prove a point? It doesn't prove anything.
What was your point here? ML is like a guitar? What you said doesn't seem to contradict anything I said other then you find curve fitting interesting and I don't.
Not trying to be offensive here, don't take it the wrong way.
On a side note your music example also basically destroys the question of what is music? Well the answer to that question is that music is the set of all points on some sort of N-dimensional curve. The profoundness of the question is completely gone.
This is largely a pointless semantic debate, but to risk wasting my time: transformers specifically in LLMs are doing more than curve fitting as there can never be enough training data to naively interpolate between training examples to construct the space of semantically valid text strings. To find meaningful sentences between training examples, the intrinsic regularity of the underlying distribution must be modeled. This is different from merely curve fitting. To drive this point home, some examinations of transformer behavior in LLMs show emergent structures that capture arbitrary patterns in the input and utilize them in constructing the output. This is not merely curve fitting.
Bro. The basic fundamental idea of Computer logic has always been trivial even without understanding of binary. There are tons of mechanisms outside of binary that can mimic logic, there's was never anything mysterious here. Understanding boolean logic and architecture is not a far leap from an intuitive understanding of how computers work.
Human thought and and human intelligence on the other hand was a great and epic concept on the scale of the origin of the universe. It was truly this mysterious epic thing that seemed like something we would never crack. ML brought it down and completely reduced this concept and simplified it by a massive scale. The entire field is now an extension of this curve fitting concept. And the disappointing thing is that the field is correct. That's all intelligence is in the end.
This is all I mean. Not saying ML is less interesting or easier then any other STEM field. All I'm saying is the reduction was massive. The progress is amazing but 99% of the wonder was lost. The scale at which we lacked understanding was covered in a single step and now the average person can understand the basics easier than they can understand something like quantum mechanics. There's still a lot going on in terms of things to discover and things to engineer, but the fundamentals of what's going on are clearer than ever before.
So I think what happened here is that you mistook what I wrote and took offense as if I was attacking the field, I'm not. I'm writing this to explain to you that you're mistaken.
So dial your aggressive shit back. Is everyone from Romania like you? I certainly hope not.
Usually when people say "ML is just curve fitting" they mean to continue with something like "so it will never be able to compete with humans."
The interesting thing to me about the secret language is that it seems to imply that when DALL-E fit words to concepts, it created extrapolations in its curve fit that are more extreme than the actual training samples, ie. its fit has out-of-domain extrema. So there are letter sequences that are more "a whale on the moon" than the actual text "a whale on the moon." Linguistic superstimulus.
Yes, I can confirm that's how I read the "just curve fitting" bit.
Regarding the gibberish word to image issue - CLIP uses a text transformer trained by contrastive matching to images. That means it's different from GPT, where it trains to predict the probability of the next word. GPT would easily tell apart gibberish words from real words, or incorrect syntax because they would be low probability sequences. CLIP text transformer doesn't do that because of the task formulation, not because of an intrinsic limitation. It's not so mysterious after realising they could have used a different approach to have both the text embedding and filter out gibberish if they wanted.
A good analogy would be a Rorschach test - show an OOD image to a human asking him to caption it. They will still say something about the image, just like DALL-E will draw a fake word. It's because the human is expected to generate a phrase no matter if the image makes sense or not, and DALL-E has a similar demand. The task formulation explains the result.
The mapping from nonsense word to image is explained by the continuous embedding space of the prompt and the ability to generate images from noise of the diffusion model. Any point in the embedding space, even random ones, fall closer to some concepts and further from other concepts. The lucky concept most similar to the random embedding would trigger the image generation.
Usually except I went on to elaborate that curve fitting was essentially what intelligence was. If mr. Genius here read my post more carefully he wouldn't have had to reveal how immature he was with those comments.
For example, I could write a heuristic algorithm to product the same thing using a Google image search, but it would look like MS word clip art.