Hacker Newsnew | past | comments | ask | show | jobs | submit | sota_pop's commentslogin

Very cool. In spite of all the deep learning hype, I’ve got deep (no pun intended) love in my heart for Kmeans and all of the unsupervised algorithms. I built a similar library in C# once upon a time, but your builder pattern is neat. I also didn’t use multi-threading, which is an VERY nice addition. Interested in your spherical and balanced variants. I don’t have intuition for why something like subbing cosine similarity would be meaningful (since you are clustering points and not vectors - although I understand mechanically how it’s carried out). Overall, kudos - looks nice at a glance!

Thanks sota_pop. I am looking for contributors to help me integrate new features. Please come and join me!

My understanding of model distillation is quite different in that it trains another (typically smaller) model using the error between the new model’s output and that of the existing - effectively capturing the existing model’s embedded knowledge and encoding it (ideally more densely) into the new.


What what I was referring to is similar in concept, but I've seen both described in papers as distillation. What I meant was you take the output of a large model like GPT4 and use that as training data to fine-tune a smaller model.


Yes, that does sound very similar. To my knowledge, isn’t that (effectively) how the latest DeepSeek breakthroughs were made? (i.e. by leveraging chatgpt outputs to provide feedback for training the likes of R1)


Not sure what you mean by “not trained to saturation”. Also I agree with the article, in the literature, the phenomenon to which the article refers is known as “catastrophic forgetting”. Because no one has specific knowledge about which weights contribute to model performance, by updating the weights via fine-tuning, you are modifying the model such that future performance will change in ways that are not understood. Also I may be showing my age a bit here, but I always thought “fine-tuning” was performing additional training on the output network (traditionally a fully-connected net), but leaving the initial portion (the “encoder”) weights unchanged - allowing the model to capture features the way it always has, but updating the way it generates outputs based on the discovered features.


OK, so this intuition is actually a bit hard to unpack, I got it from bits and pieces. So this is this post https://www.fast.ai/posts/2023-09-04-learning-jumps/. Essentially, a single pass over the training data is enough for the LLM to significantly "learn" the material. In fact if you read the LLM training papers, for the large-large models, they generally explicitly say that they only did 1 pass over the training corpus, and sometimes not even the full corpus, only like 80% of it or whatever. The other relevant information is the loss curves - models like Llama 3 are not trained until the loss on the training data is minimized, like typical ML models. Rather they use these approximate estimates of FLOPS / tokens vs. performance on benchmarks. But it is pretty much guaranteed that if you continued to train on the training data it would continue to improve its fit - 1 pass over the training data is by no means enough to adequately learn all of the patterns. So from a compression standpoint, the paper I linked previously says that an LLM is a great compressor - but it's not even fully tuned, hence "not trained to saturation".

Now as far as how fine-tuning affects model performance, it is pretty simple: improves fit on the fine-tuning data, decreases fit on original training corpus. Beyond that, yeah, it is hard to say if fine-tuning will help you solve your problem. My experience has been that it always hurts generalization, so if you aren't getting reasonable results with a base or chat-tuned model, then fine-tuning further will not help, but if you are getting results then fine-tuning will make it more consistent.


Always appreciated the work of Jeremy Howard. Also had a lot of fun using the Fast.ai framework. My experience is similar to your description. When using 2, 3, or more epochs, felt that overfitting started to emerge. (And I was CERTAINLY not training models anywhere near the size of modern LLMs) I suppose in this case by “saturation” you meant training “marginally before exhibiting over-fitting” - something akin to “the elbow method” w.r.t. clustering algorithms? I’ll have to chew on your description of overfitting results for a while. It jives with mine, but in a way that really makes me question my own - thanks for the thought provoking response!


What a journey. Bravo, I am surely convinced beyond a reasonable doubt.


I haven’t personally tried it, but the high-level demos of “khanmigo” created by khan academy seem really promising. I’ll always have a special place in my heart (and brain) for the work of Sal Khan and the folks at khan academy.


I disagree with this wholeheartedly. Sure, there is lots of trial and error, but it’s more an amalgamation of theory from many areas of mathematics including but not limited to: topology, geometry, game theory, calculus, and statistics. The very foundations (i.e. back-propagation) is just the chain rule applied to the weights. The difference is that deep learning has become such an accessible (sic profitable) field that many practitioners have the luxury of learning the subject without having to learn the origins of the formalisms. Ultimately allowing them to utilize or “reinvent” theories and techniques often without knowing they have been around in other fields for much longer.


None of the major aspects of deep learning came from manifolds though.

It is primarily linear algebra, calculus, probability theory and statistics, secondarily you could add something like information theory for ideas like entropy, loss functions etc.

But really, if "manifolds" had never been invented/conceptualized, we would still have deep learning now, it really made zero impact on the actual practical technology we are all using every day now.


Loss landscapes can be viewed as manifolds. Adagrad/ADAM adjust SGD to better fit the local geometry and are widely used in practice.


Can you give an example where theories and techniques from other fields are reinvented? I would be genuinely interested for concrete examples. Such "reinventions" happen quite often in science, so to some degree this would be expected.


Bethe ansatz is one. It took a toure de force by Yedidia to recognize that loopy belief propagation is computing the stationary point of Bethe's approximation to Free Energy.

Many statistical thermodynamics ideas were reinvented in ML.

Same is true for mirror descent. It was independently discovered by Warmuth and his students as Bregman divergence proximal minimization, or as a special case would have it, exponential gradient algorithms.

One can keep going.


The connections of deep learning to stat-mech and thermodynamics are really cool.

It's led me to wonder about the origin of the probability distributions in stat-mech. Physical randomness is mostly a fiction (outside maybe quantum mechanics) so probability theory must be a convenient fiction. But objectively speaking, where then do the probabilities in stat-mech come from? So far, I've noticed that the (generalised) Boltzmann distribution serves as the bridge between probability theory and thermodynamics: It lets us take non-probabilistic physics and invent probabilities in a useful way.


In Boltzmann's formulation of stat-mech it comes from the assumption that when a system is in "equilibrium", then all the micro-states that are consistent with the macro-state are equally occupied. That's the basis of the theory. A prime mover is thermal agitation.

It can be circular if one defines equilibrium to be that situation when all the micro-states are equally occupied. One way out is to define equilibrium in temporal terms - when the macro-states are not changing with time.


The Bayesian reframing of that would be that when all you have measured is the macrostate, and you have no further information by which to assign a higher probability to any compatible microstate than any other, you follow the principle of indifference and assign a uniform distribution.


Yes indeed, thanks for pointing this out. There are strong relationships between max-ent and Bayesian formulations.

For example one can use a non-uniform prior over the micro-states. If that prior happens to be in the Darmois-Koopman family that implicitly means that there are some non explicitly stated constraints that bind the micro-state statistics.


One might add 8-16-bit training and quantization. Also, computing semi-unreliable values with error correction. Such tricks have been used in embedded, software development on MCU's for some time.


I mean the entire domain of systems control is being reinvented by deep RL. System identification, stability, robustness etc


Good one. Slightly different focus but they really are the same topic. Historically, Control Theory has focused on stability and smooth dynamics while RL has traditionally focused on convergence of learning algorithms in discrete spaces.


I’ve always enjoyed this framing of the subject, the idea of mapping anything as hyperplanes existing in a solution space is one of the ideas that really blew my hair back during my academic studies. I would nitpick at your “dots in a circle example - with the stoner reference joke” I could be mistaken, but common practice isn’t to “move to a higher dimension”, but use a kernel (i.e. parameterize the points into the polar |r,theta> basis). All things considered, nice article.


I'm pulling directly from Chris Olah's blog post with that example. But I will say that in practice, its always surprising how increasing the dimensionality of a neural network magically solves all sorts of problems. You could use a kernel if you don't have more computation available, but given more computation adding a dimension is strictly more flexible (and is capable of separating a much wider range of datasets)


Your explanation of finding a surface to separate good reasoning traces from bad reasoning traces in a high dimensional space worked as a great framing of the problem. It seems though that the surface will be fractal - the distance between a good trace and a bad trace could be arbitrarily small. If so then the work required to find and compute better and better surfaces will grow arbitrarily large. I wonder if there is a rigorous way to determine if the surface is fractal or not.


Always was a fan of Dilbert, but I especially enjoyed a short story he wrote called God’s Debris when I discovered it as a young undergrad. Sad news indeed.


I resonate with some of this. Personally, I am motivated and driven by understanding “why/how” things work. The actual implementations are rarely more valuable to me than an exercise I know is “good for me” - and one that will help me gain insight into “the next thing” more quickly and efficiently. One major drawback to this that I have a bad habit of explaining how/why something works to the people in my life… even if they didn’t ask. Definitely something I am learning to filter.

Additionally, one of the most unsettling things I find about LLMs is the (now well-observed) phenomenon of hallucinations. As someone who is terrible at memorization and has gotten by life thus far in large part to mental models - I didn’t realize until their popularization that I may or may not have regularly “hallucinated” things my entire life - especially when forming opinions about things. … makes you think …

Great article!

edit: I also find that the type of abstract thinking reinforced by writing software regularly, is addictive. Once you learn how to abstract a thing in order to improve or increase efficiency of the thing, it starts a cycle of continually abstracting, then abstracting your abstractions ad infinitum. It’s also a common bug I see in young CS students - they are fantastic problem solvers, but don’t realize that most (all?) of CS isn’t a thing - it’s the thing that gets you to the thing. Which is (I believe) why we have a generation of software engineers who all want to build platforms and marketplaces with very few applications that ACTUALLY DO SOMETHING. They haven’t taken enough humanities courses, or gained the life experience or something - to find the “REAL” problem they want to solve.


I have really come to enjoy using LINQ queries (specifically with lambda syntax) in the C# language. Although I’m a little biased because I work mostly in C#.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: