Hacker News new | past | comments | ask | show | jobs | submit | mertnesvat's comments login

The most interesting part of DeepSeek's R1 release isn't just the performance - it's their pure RL approach without supervised fine-tuning. This is particularly fascinating when you consider the closed vs open system dynamics in AI.

Their model crushes it on closed-system tasks (97.3% on MATH-500, 2029 Codeforces rating) where success criteria are clear. This makes sense - RL thrives when you can define concrete rewards. Clean feedback loops in domains like math and coding make it easier for the model to learn what "good" looks like.

What's counterintuitive is they achieved this without the usual supervised learning step. This hints at a potential shift in how we might train future models for well-defined domains. The MIT license is nice, but the real value is showing you can bootstrap complex reasoning through pure reinforcement.

The challenge will be extending this to open systems (creative writing, cultural analysis, etc.) where "correct" is fuzzy. You can't just throw RL at problems where the reward function itself is subjective.

This feels like a "CPU moment" for AI - just as CPUs got really good at fixed calculations before GPUs tackled parallel processing, we might see AI master closed systems through pure RL before cracking the harder open-ended domains.

The business implications are pretty clear - if you're working in domains with clear success metrics, pure RL approaches might start eating your lunch sooner than you think. If you're in fuzzy human domains, you've probably got more runway.


Interestingly this point was indicated by Karpathy last summer that RLHF is barely RL. He said it would be very difficult to apply pure reinforcement learning on open-domains. This is why RLHF are a shortcut to fill this gap but still because the reward model is trained on human vibes checks the LLM could easily game the RM by giving out misleading responses or gaming the system.

Importantly the barrier is that open domains are too complex and too undefined to have a clear reward function. But if someone cracks that — meaning they create a way for AI to self-optimize in these messy, subjective spaces — it'll completely revolutionize LLMs through pure RL.

Here's the link of the tweet: https://x.com/karpathy/status/1821277264996352246


The whole point of RLHF is to make up for the fact that there is no loss function for a good answer in terms of token ids or their order. A good answer can come in many different forms and shapes.

That’s why all those models fine tuned on (instruction, input, answer) tuples are essentially lobotomized. They’ve been told that, for the given input, only the output given in the training data is correct, and any deviation should be “punished”.

In truth, for each given input, there are many examples of output that should be reinforced, many examples of output that should be punished, and a lot in between.

When BF Skinner used to train his pigeons, he’d initially reinforce any tiny movement that at least went in the right direction. For example, instead of waiting for the pigeon to peck the lever directly (which it might not do for many hours), he’d give reinforcement if the pigeon so much as turned its head towards the lever. Over time, he’d raise the bar. Until, eventually, only clear lever pecks would receive reinforcement.

We should be doing the same when taming LLMs from their pretraining as document completers into assistants.


Layman question here since this isn't my field: how do you achieve success on closed-system tasks without supervision? Surely at some point along the way, the system must understand whether their answers and reasoning are correct.


In their paper, they explain that "in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases."

Basically, they have an external source-of-truth that verifies whether the model's answers are correct or not.


You're totally right there must be supervision; it's just a matter of how the term is used.

"Supervised learning" for LLMs generally means the system sees a full response (eg from a human expert) as supervision.

Reinforcement learning is a much weaker signal: the system has the freedom to construct its own response / reasoning, and only gets feedback at the end whether it was correct. This is a much harder task, especially if you start with a weak model. RL training can potentially struggle in the dark for an exponentially long period before stumbling on any reward at all, which is why you'd often start with a supervised learning phase to at least get the model in the right neighborhood.


They use other models to judge correct-ness and when possible just ask the model output something that can be directly verified. Like math equations that can be checked 1:1 against the correct answer.


> the real value is showing you can bootstrap complex reasoning through pure reinforcement.

This made me smile, as I thought (non snarkily) that's what living beings do.


this ! and the truth is is there that much corporate domains without "clear success metrics" ?


You also need to be able to test your solution, on how sucsessful it is.

In some domains it is harder than math and code.


true. I think simulations will help a lot in that direction. Imagine if you can do RL a bit like DeepSeek for R1 but on corporate tasks. https://open.substack.com/pub/transitions/p/deepseek-is-comi...


emphasis on corporate


The MIT licence is for code only


> Sarah: Built a fusion reactor at 16. Now? Debugging fintech payment systems.

It's striking to imagine a fully functional fusion reactor that could benefit humanity, yet its creator now focuses on fintech payment systems. This highlights the importance of a strong middle class, which seems to be declining globally. A thriving middle class, with disposable income and free time, creates the conditions for innovation. Without it, even brilliant minds like Einstein might spend their entire careers working on immediate economic needs rather than pursuing breakthrough discoveries.


What you are saying is essentially true. I just don't want people to come away with the notion that building a fusion reactor and yielding net energy from said reactor is equivalent. They are very very very far away from each other in terms of complexity.


Are these real examples?

I was curious, and all I could find it: https://newsforkids.net/articles/2024/09/04/16-year-old-stud...

They are not working in Fintech AFAIK.


They seem believable to me, as a graduate of the same university.

I have friends from Imperial College who now work at ESA, Los Alamos, quantum computer research etc, but also others working in banks, hedge funds or adtech.

Top of my computer science class is working at a hedge fund, number two is working at a fintech startup.

It's on their website, even specifically for electrical engineers (since that's the topic)

- 22% working in manufacturing

- 16% in IT

- 25% in finance

- 16% professional / scientific / technical

https://www.imperial.ac.uk/media/imperial-college/administra... via https://www.imperial.ac.uk/careers/plan-your-career/destinat...


I work in finance too and see similar, but are these examples real, or a remix of real examples.


Probably what was built was a Fusor. There's tons of instructions how to build one (https://fusor.net/board/) and seemingly there's a lot of focus on how "young" the builders of such are. Just google: fusion reactor teenager. In some of the stories it become apparent the fusor was never actually even finished but just along the way.

https://newsforkids.net/articles/2024/09/04/16-year-old-stud... https://online.kidsdiscover.com/quickread/arkansas-teen-buil... https://interestingengineering.com/energy/nuclear-fusion-rea... ...


Yeah that really stood out to me. Like obviously she just put paper mache on a lamp and said it’s a model reactor, so why pretend it’s a serious achievement.

Same with using a 3D printer to print a rigid arm, or saying “imagine if I built a swarm, that would be so cool”

There’s definitely a delusion among rich children that they are geniuses. Poor schools can’t afford participation awards.


It's true especially in social media for them it's a show of power that I can do this.

They pick these victims from time to time to remind their power and suppress other people.


Self-censorship is extremely harmful. When people fears physical harm when expressing themselves many criticisms go unsaid.


There was a documentary about women prisons. It's hell on the earth.


I wonder what Elon thinks about this. There was one demo from SpaceX about using their rockets for Trips where they can lower down transatlantic flights to 20 30 mins (if you have strong sto-mach) Or Boring Company focused to hyperloops.

My humble opinion is that it's aviation company without huge innovation or disruption of the industry. More like a fast horse rather than car.


There are a lot of similarities between CBT and Vipassana meditation.

Like the mind reading and personalisation Vipassana medidation suggest we have `sankharas` loosely translated as formations which is kind of a poison in our body and mind. After a consistent meditation you can examine and make peace with them. I'm not an expert but it worked for me, was kinda magical and interesting.

CBT and meditation has a lot to offer to our lives in our little concrete jungle.

https://en.wikipedia.org/wiki/Sa%E1%B9%85kh%C4%81ra


It's really interesting no other comment is talking about the LGBTQ community because as far as I know founder is gay and living in Turkey and that means they have a bit closed community, general public doesn't accept them.

So I think orkut means a lot for the LGBTQ and Activists community would be great to hear the thoughts from them.

Some ref... https://gay.blog.br/en/geek-en/orkut-social-network-was-foun...


It is funny that you mention: I don't know the source of this, but a whole urban myth was going around in the heydays (India especially for sure) that Orkut Buyukkokten had envisaged a social network in the hopes that he could re-connect with a childhood flame, a girl who he was no longer in touch with [1]. Who would have known back then :)

[1] https://www.theweek.in/leisure/society/2018/04/22/orkut-the-...

Emphasized: "Even the (possibly cooked up) origin story of Orkut is sappy and saccharine—the lore goes that Orkut was created by Buyukkokten after embarking on a search for his missing girlfriend."


Congrats on the launch! I'm also a fan of glitch so congrats on that good job too.

Just out of curiosity have been reading the engineering part and came across with below for not using websockets, confused because debounce and throttle is mainly used to avoid many updates over sockets so it's very well known problem for reactive programming

( https://pketh.org/how-kinopio-is-made.html )

> You might be wondering, why don’t you just update the database with websockets instead of relatively slow API requests?

> The problem with saving data with websockets is that they’re too fast. Authenticating that many messages per second and writing them to disk would be really inefficient. E.g. If you’re moving a card from position x: 20 to x: 420, Kinopio will use websockets to broadcast many updates during the move: moving card x to 21, moving card x to 24, moving card x to 28… potentially hundreds of messages.


That's a good point, and I should look into that (it's been a while since I touched that part of the codebase). Off-hand my guess is that the other reason I handle api requests separately is because I can group multiple actions into a single request more easily, which isn't something I need to do with websocket streaming



I liked the article and the wikipedia page for hobbies is very interesting, I've found some gems...

- Magnet Fishing (https://en.wikipedia.org/wiki/Magnet_fishing)

- Binge watching ( didn't know it can be called as a hobby - https://en.wikipedia.org/wiki/Binge-watching )

- Constructing languages ( https://en.wikipedia.org/wiki/Constructed_language )

- Tea bag collecting ( https://en.wikipedia.org/wiki/Tea_bag )


+1 for constructed languages. Fascinating hobby. I've dabbled myself.

Bonus: Tolkien would approve.


Well said! I also learnt user moderation hell with painful way but for mobile projects If it's very necessary there's a possibility of using Anonymous logins with Firebase or Amazon AppSync as well.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: