On what metrics? LMSys shows it does well but 4-Turbo is still leading the field by a wide margin.
I am using 8x-7b internally for a lot of things and Mistral-7b fine-tunes for other specific applications. They're both excellent. But neither can touch GPT-4-turbo (preview) for wide-ranging needs or the strongest reasoning requirements.
Keep in mind that modern quantitative approaches to LLM evaluation have been effectively co-designed with the rise of OpenAI, and folks like Ravenwolf routinely disagree with the leaderboards.
There's also very little if any credible literature on what constitutes statistically significant on MMLU or whatever. There's such a massive vested interest from so many parties (the YC ecosystem is invested in Sam, MSFT is invested in OpenAI, the US is invested in not-France, a bunch of academics are invested in GPT-is-borderline-AGI, Yud is either a Time Magazine cover author or a Harry Potter fanfic guy, etc.) in seeing GPT-4.5 at the top of those rankings and taking the bold one at < 10% lift as state of the art that I think everyone should just use a bunch of them and optimize per use case.
I have my own biases as well and freely admit that I love to see OpenAI stumble (no I didn't apply to work there, yes I know knuckleheads who go on about the fact they do).
And once you factor in "mixtral is aligned to the demands of the user and GPT balks at using profanity while happily taking sides on things Ilya has double-spoken on", even e.g. MMLU is nowhere near the whole picture.
It's easy and cheap to just try both these days, don't take my word for which one is better.
> It's easy and cheap to just try both these days, don't take my word for which one is better.
I literally use 8x-7b on my on-prem GPU cluster and have several fine tunes of 7b (which I said in the previous post). I've used mistral-medium.
GPT-4-turbo is better than them all on all benchmarks, human preference, and anything that isn't biased vibes. My opinion - such that it is - is that GPT-4-turbo is by far the best.
I have no vested interest in it being the best. I'd actually prefer if it wasn't. But all objective data points to it being the best and most lived experiences that are unbiased agree (assuming broad model use and not hyperfocused fine-tunes; I have Mistral-7b fine-tunes beating 4-turbo in very limited domains, but that hardly counts).
The rest of your post I really have no idea what's going on, so good luck with all that I guess.
Mistral Medium beats 4.5 on the censorship benchmark. It doesn't refuse to help with anything that could be vaguely non-PC or could potentially be used to hurt anyone in the wrong hands, including dangerously hot salsa recipes.
Certainly, no one here is arguing that there are things openai refuses to allow, and given that the effectiveness of using GPT4 on them is literally zero, a sweet potato connected to a spring and keyboard will "beat" GPT-4, if that's your scoring metric.
If you want a meaningful comparison you need tasks that both tools are capable of doing, and then see how effective they are.
Claiming that mistral medium beats it is like me claiming the RenderMan beats DALLE2 at rendering 3d models; yes, technically they both generate images, but since it's not possible to use DALLE2 to render a 3d model, it's not really a meaningful comparison is it?
Both tools are generative systems that produce text in response to a prompt. If Mistral was mute on random topics for no other reason that its makers dislike talking about that, would you say it doesn't count?
I'm a big proponent of freedom in this space (and remain one), but Dolphin is fucking scary.
I don't have any use cases for crime in my life at the moment beyond wanting to pirate like Adobe Illustrator before signing up for an uncancelable subscription, but it will do arbitrary things within it's abilities and it's google with a grudge in terms of how to do anything you ask. I stopped wanting to know when it convinced me it could explain how to stage a coup d'etat. I'm back on mixtral-8x7b.
Agree with this. I would say that the rate of progress from Mistral is very encouraging though in terms of having multiple plausible contenders for the crown.
> Keep in mind that modern quantitative approaches to LLM evaluation have been effectively co-designed with the rise of OpenAI, and folks like Ravenwolf routinely disagree with the leaderboards.
Sorry but you're talking complete nonsense here. The benchmark by LMSys (chatbot arena) cannot be gamed, and Ravenwolf is a random-ass poster with no scientific rigor to his benchmarks.
Do only the initial votes count? Because after I made an initial choice I was then put in a session where I saw the name of both of the AI. I made subsequent votes in that session where I could see their names.
It just feels like “what LLM is better” becomes new “what GPU is better” type of talk. It’s great to find a clear winner, but at the end the gap between the leaders isn’t an order of magnitude.
I think people are missing the context that the prices of even the largest LLMs trend towards $0 in the medium term. Mistral-medium is almost open source, and we are still early days
On what metrics? LMSys shows it does well but 4-Turbo is still leading the field by a wide margin.
I am using 8x-7b internally for a lot of things and Mistral-7b fine-tunes for other specific applications. They're both excellent. But neither can touch GPT-4-turbo (preview) for wide-ranging needs or the strongest reasoning requirements.
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
EDIT: Neither does mistral-medium, which I didn't discuss, but is in the leaderboard link.