> Mistral Medium destroys the 4.5 preview. On what metrics? LMSys shows it does ...

benreesman · on Jan 11, 2024

Keep in mind that modern quantitative approaches to LLM evaluation have been effectively co-designed with the rise of OpenAI, and folks like Ravenwolf routinely disagree with the leaderboards.

There's also very little if any credible literature on what constitutes statistically significant on MMLU or whatever. There's such a massive vested interest from so many parties (the YC ecosystem is invested in Sam, MSFT is invested in OpenAI, the US is invested in not-France, a bunch of academics are invested in GPT-is-borderline-AGI, Yud is either a Time Magazine cover author or a Harry Potter fanfic guy, etc.) in seeing GPT-4.5 at the top of those rankings and taking the bold one at < 10% lift as state of the art that I think everyone should just use a bunch of them and optimize per use case.

I have my own biases as well and freely admit that I love to see OpenAI stumble (no I didn't apply to work there, yes I know knuckleheads who go on about the fact they do).

And once you factor in "mixtral is aligned to the demands of the user and GPT balks at using profanity while happily taking sides on things Ilya has double-spoken on", even e.g. MMLU is nowhere near the whole picture.

It's easy and cheap to just try both these days, don't take my word for which one is better.

icelancer · on Jan 11, 2024

> It's easy and cheap to just try both these days, don't take my word for which one is better.

I literally use 8x-7b on my on-prem GPU cluster and have several fine tunes of 7b (which I said in the previous post). I've used mistral-medium.

GPT-4-turbo is better than them all on all benchmarks, human preference, and anything that isn't biased vibes. My opinion - such that it is - is that GPT-4-turbo is by far the best.

I have no vested interest in it being the best. I'd actually prefer if it wasn't. But all objective data points to it being the best and most lived experiences that are unbiased agree (assuming broad model use and not hyperfocused fine-tunes; I have Mistral-7b fine-tunes beating 4-turbo in very limited domains, but that hardly counts).

The rest of your post I really have no idea what's going on, so good luck with all that I guess.

MacsHeadroom · on Jan 11, 2024

Mistral Medium beats 4.5 on the censorship benchmark. It doesn't refuse to help with anything that could be vaguely non-PC or could potentially be used to hurt anyone in the wrong hands, including dangerously hot salsa recipes.

wokwokwok · on Jan 11, 2024

That's not a metric.

That's a use case.

Certainly, no one here is arguing that there are things openai refuses to allow, and given that the effectiveness of using GPT4 on them is literally zero, a sweet potato connected to a spring and keyboard will "beat" GPT-4, if that's your scoring metric.

If you want a meaningful comparison you need tasks that both tools are capable of doing, and then see how effective they are.

Claiming that mistral medium beats it is like me claiming the RenderMan beats DALLE2 at rendering 3d models; yes, technically they both generate images, but since it's not possible to use DALLE2 to render a 3d model, it's not really a meaningful comparison is it?

theshackleford · on Jan 11, 2024

> If you want a meaningful comparison you need tasks that both tools are capable of doing, and then see how effective they are.

The fact it’s incapable of simple requests that an alternative can is absolutely part of a worthwhile comparison.

wokwokwok · on Jan 11, 2024

You’re just twisting what “best” means to suit your bias.

That is not a measure of how sophisticated and capable a model is.

GPT4 is a more sophisticated, more capable mode than mistral.

If that doesn’t make it the “better” for you, that’s fine; but any attempt to argue about the capabilities of the models is misguided.

Restrictions placed on a model are an orthogonal concern to its capabilities.

…but sure, you can invent some benchmarks to score models on other criteria, which is entirely valid.

It’s perfectly fair to say that GPT4 doesn’t top all possible metrics… only the meaningful ones about model capabilities.

bambax · on Jan 11, 2024

Semantics.

Both tools are generative systems that produce text in response to a prompt. If Mistral was mute on random topics for no other reason that its makers dislike talking about that, would you say it doesn't count?

benreesman · on Jan 11, 2024

I'm a big proponent of freedom in this space (and remain one), but Dolphin is fucking scary.

I don't have any use cases for crime in my life at the moment beyond wanting to pirate like Adobe Illustrator before signing up for an uncancelable subscription, but it will do arbitrary things within it's abilities and it's google with a grudge in terms of how to do anything you ask. I stopped wanting to know when it convinced me it could explain how to stage a coup d'etat. I'm back on mixtral-8x7b.

dbuxton · on Jan 11, 2024

Agree with this. I would say that the rate of progress from Mistral is very encouraging though in terms of having multiple plausible contenders for the crown.

epups · on Jan 11, 2024

> Keep in mind that modern quantitative approaches to LLM evaluation have been effectively co-designed with the rise of OpenAI, and folks like Ravenwolf routinely disagree with the leaderboards.

Sorry but you're talking complete nonsense here. The benchmark by LMSys (chatbot arena) cannot be gamed, and Ravenwolf is a random-ass poster with no scientific rigor to his benchmarks.

ParetoOptimal · on Jan 11, 2024

Cannot be gamed? C'mon now... You could pay a bunch of people to vote for your model in the arena.

epups · on Jan 11, 2024

No you can't, because you actually don't know which model is which when you vote.

ParetoOptimal · on Jan 11, 2024

Do only the initial votes count? Because after I made an initial choice I was then put in a session where I saw the name of both of the AI. I made subsequent votes in that session where I could see their names.

epups · on Jan 11, 2024

https://github.com/lm-sys/FastChat/issues/1210

leo150 · on Jan 11, 2024

It just feels like “what LLM is better” becomes new “what GPU is better” type of talk. It’s great to find a clear winner, but at the end the gap between the leaders isn’t an order of magnitude.

speedgoose · on Jan 11, 2024

These days the question is more about which LLM is second best. It’s very tight while ChatGPT 4 is in its own league.

seydor · on Jan 11, 2024

I think people are missing the context that the prices of even the largest LLMs trend towards $0 in the medium term. Mistral-medium is almost open source, and we are still early days