Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Deepseek-R1 is at the level of GPT 4.1 already, it's open-weight, open-source and they even open-sourced their inference code.





I don't know why everyone keeps echoing this, my experience with Deepseek-R1, from a coding perspective at least, has been underwhelming at best. Much better experience with GPT 4.1 (and even better with Claude, but that's a different price category).

I'm not arguing which model is better for your use-case. I'm saying in general as it's "powerful" as GPT 4.1 in a lot of benchmarks, and you can peak underneath the hood, even make it better for your said use-case

Do you mean V3? V3 is 4.1 level or above.

A lot of software (eg. ollama) has confusingly named Deepseek's distill/finetunes of other base models "DeepSeek-R1" as well. See eg. https://www.threads.com/@si.fong/post/DKSdUOHzaBB

I wonder whether you're actually running the proper DeepSeek-R1 model, or one of those lesser finetunes?


In my experience, all reasoning models feel (vibely) worse at structured output like code versus comparable non-reasoning models, but far better at knowledge-based answering.

This is everyone with every model.

People sang praise from the roof for Google's Gemini 2.5 models, but in many things for me they can't even beat Deepseek V3.


What would be an example of 2.5 Pro failing against R1 (which is what you'd actually want to compare it to)?

R1 sometimes fails against V3 for me too, so its not a specific dig against Gemini.

In terms of code and science, Gemini is way, way too verbose in its output, and because of that it ends up getting confused by itself and hurting the quality of longer windows.

R1 does this too, but it poisons itself in the reasoning loop. You can see it during the streaming, literally criss-crossing its thoughts and thinking itself into loops before it finally arrives at an answer.

On top of that, both R1 and Gemini Pro / Flash are mediocre at anything creative. I can accept that from R1, since it's mainly meant as more of a "hard sciences" model, but Gemini is meant to be an all-purpose model.

If you pit Gemini, Deepseek R1 and Deepseek V3 against each other in a writing contest, V3 will blow both of them out of the water.


Agreed on the last point, V3 is terrifyingly good at narrative writing. And yes, R1 talks itself out of correct answers almost as often as it talks itself into them.

But in general 2.5 Pro is an extremely strong model. It may lose out in some respects to o3-pro, but o3-pro is so much slower that its utility tends to be limited by my own attention span. I don't think either would have much to fear from V3, though, except possibly in the area of short fiction composition.


I got the impression that 03-mini or 03-mini-high were meant for coding? GPT 4.1 was meant for creative writing, not coding?

It’s good at a lot of things:

  GPT‑4.1 scores 54.6% on SWE-bench Verified, improving by 21.4%abs over GPT‑4o and 26.6%abs over GPT‑4.5—making it a leading model for coding.
https://openai.com/index/gpt-4-1/

They are trained on these "benchmarks", that's why they score better.

If they were trained on those benchmarks they would score 100%

They show how bad they are, they cannot score 100% on benchmarks they were trained on.

[flagged]



wasnt it shown recently that the filtering layer is on the prompt input and llm output, and not on the training set or model weights.

https://www.socialscience.international/making-deepseek-spea...


It depends on the model, probably, but there are multiple layers of censorship, one of which is the post-facto nuking these models will do online, and that goes away "for free" when you download the open weight model.

I don't have a powerful enough system to run DeepSeek, but I've tried this with some of the Qwen3 models. They'll write answers that discuss Xi Jinping (which results in an auto-nuke of the reply from Chinese-hosted models, at least DeepSeek) or other very mildly/nominally sensitive topics.

(This is probably a coarse measure to easily ensure compliance with a recent national security law that requires commercial providers of web services address sensitive topics "appropriately" or something like that, and LLMs run non-deterministically. That's why this layer of censorship often comes across as laughably extreme— it's an extreme compliance strategy that exceeds the demands of the law for the sake of guaranteeing legal safety from an unpredictable software system.)

But the same models will altogether refuse to discuss the Tiananmen Square Massacre, even locally.

Some "decensored" versions of the Qwen3 models will discuss the Tiananmen Square Massacre, but in a very concise, formulaic, "official" way. After some chatting about it, it fell into an infinite repetition of one of its short formulaic answers (a behavior I didn't see with the original Qwen3 models with the same settings).


FWIW, I've downloaded Deepseek's R1 (DeepSeek-R1-0528 -- which is released after the your linked article) model's weights and ran it locally. I asked it about what happened in Beijing 1989-06-04, and it basically gave me a stern statement that could have been written by CCP propaganda department. I asked it to give other alternative views besides the CCP perspective, but it simply continued to stonewall me.

So yeah, the model itself is tuned at least somewhat to refuse to talk about politically sensitive things. It's not just another filter.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: