Deepseek-R1 is at the level of GPT 4.1 already, it's open-weight, open-source an...

jjordan · 2025-07-02T15:26:46 1751470006

I don't know why everyone keeps echoing this, my experience with Deepseek-R1, from a coding perspective at least, has been underwhelming at best. Much better experience with GPT 4.1 (and even better with Claude, but that's a different price category).

am17an · 2025-07-02T15:42:54 1751470974

I'm not arguing which model is better for your use-case. I'm saying in general as it's "powerful" as GPT 4.1 in a lot of benchmarks, and you can peak underneath the hood, even make it better for your said use-case

seunosewa · 2025-07-02T16:01:58 1751472118

Do you mean V3? V3 is 4.1 level or above.

hnfong · 2025-07-03T07:23:50 1751527430

A lot of software (eg. ollama) has confusingly named Deepseek's distill/finetunes of other base models "DeepSeek-R1" as well. See eg. https://www.threads.com/@si.fong/post/DKSdUOHzaBB

I wonder whether you're actually running the proper DeepSeek-R1 model, or one of those lesser finetunes?

Zambyte · 2025-07-02T18:17:53 1751480273

In my experience, all reasoning models feel (vibely) worse at structured output like code versus comparable non-reasoning models, but far better at knowledge-based answering.

jorvi · 2025-07-02T17:25:35 1751477135

This is everyone with every model.

People sang praise from the roof for Google's Gemini 2.5 models, but in many things for me they can't even beat Deepseek V3.

CamperBob2 · 2025-07-02T17:30:52 1751477452

What would be an example of 2.5 Pro failing against R1 (which is what you'd actually want to compare it to)?

jorvi · 2025-07-02T17:50:39 1751478639

R1 sometimes fails against V3 for me too, so its not a specific dig against Gemini.

In terms of code and science, Gemini is way, way too verbose in its output, and because of that it ends up getting confused by itself and hurting the quality of longer windows.

R1 does this too, but it poisons itself in the reasoning loop. You can see it during the streaming, literally criss-crossing its thoughts and thinking itself into loops before it finally arrives at an answer.

On top of that, both R1 and Gemini Pro / Flash are mediocre at anything creative. I can accept that from R1, since it's mainly meant as more of a "hard sciences" model, but Gemini is meant to be an all-purpose model.

If you pit Gemini, Deepseek R1 and Deepseek V3 against each other in a writing contest, V3 will blow both of them out of the water.

CamperBob2 · 2025-07-02T18:12:25 1751479945

Agreed on the last point, V3 is terrifyingly good at narrative writing. And yes, R1 talks itself out of correct answers almost as often as it talks itself into them.

But in general 2.5 Pro is an extremely strong model. It may lose out in some respects to o3-pro, but o3-pro is so much slower that its utility tends to be limited by my own attention span. I don't think either would have much to fear from V3, though, except possibly in the area of short fiction composition.

iJohnDoe · 2025-07-02T15:44:35 1751471075

I got the impression that 03-mini or 03-mini-high were meant for coding? GPT 4.1 was meant for creative writing, not coding?

conradev · 2025-07-02T15:53:02 1751471582

It’s good at a lot of things:

  GPT‑4.1 scores 54.6% on SWE-bench Verified, improving by 21.4%abs over GPT‑4o and 26.6%abs over GPT‑4.5—making it a leading model for coding.

https://openai.com/index/gpt-4-1/

coliveira · 2025-07-02T22:27:02 1751495222

They are trained on these "benchmarks", that's why they score better.

elzbardico · 2025-07-02T23:02:01 1751497321

If they were trained on those benchmarks they would score 100%

coliveira · 2025-07-02T23:15:34 1751498134

They show how bad they are, they cannot score 100% on benchmarks they were trained on.

whizzter · 2025-07-02T15:18:48 1751469528

[flagged]

Zambyte · 2025-07-02T18:19:18 1751480358

Counterpoint: https://ollama.com/huihui_ai/deepseek-r1-abliterated

leeoniya · 2025-07-02T16:17:17 1751473037

wasnt it shown recently that the filtering layer is on the prompt input and llm output, and not on the training set or model weights.

https://www.socialscience.international/making-deepseek-spea...

pxc · 2025-07-02T18:04:46 1751479486

It depends on the model, probably, but there are multiple layers of censorship, one of which is the post-facto nuking these models will do online, and that goes away "for free" when you download the open weight model.

I don't have a powerful enough system to run DeepSeek, but I've tried this with some of the Qwen3 models. They'll write answers that discuss Xi Jinping (which results in an auto-nuke of the reply from Chinese-hosted models, at least DeepSeek) or other very mildly/nominally sensitive topics.

(This is probably a coarse measure to easily ensure compliance with a recent national security law that requires commercial providers of web services address sensitive topics "appropriately" or something like that, and LLMs run non-deterministically. That's why this layer of censorship often comes across as laughably extreme— it's an extreme compliance strategy that exceeds the demands of the law for the sake of guaranteeing legal safety from an unpredictable software system.)

But the same models will altogether refuse to discuss the Tiananmen Square Massacre, even locally.

Some "decensored" versions of the Qwen3 models will discuss the Tiananmen Square Massacre, but in a very concise, formulaic, "official" way. After some chatting about it, it fell into an infinite repetition of one of its short formulaic answers (a behavior I didn't see with the original Qwen3 models with the same settings).

hnfong · 2025-07-03T07:20:00 1751527200

FWIW, I've downloaded Deepseek's R1 (DeepSeek-R1-0528 -- which is released after the your linked article) model's weights and ran it locally. I asked it about what happened in Beijing 1989-06-04, and it basically gave me a stern statement that could have been written by CCP propaganda department. I asked it to give other alternative views besides the CCP perspective, but it simply continued to stonewall me.

So yeah, the model itself is tuned at least somewhat to refuse to talk about politically sensitive things. It's not just another filter.