Same conclusion here so far. Tested out various open source models, maybe once or twice per month, comparing them against GPT-4, nothing has come close so far. Even closed source models seems to not far very well, so far maybe Claude got the closest to GPT-4, but yet to find something that could surpass GPT-4 for coding help.
Of course, could be that I've just got used to GPT-4 and my prompting been optimized for GPT-4, and I try to apply the same techniques to other models where those prompts wouldn't work as great.
They won't beat Claude or GPT-4. If you want a model that writes code or answers complex questions use one of those. But for many simpler tasks like summarization, sentiment analysis, data transformation, text completion, etc, self-hosted models are perfectly suited.
And if you work on something where the commercial models are trained to refuse answers and lecture the user instead, some of the freely available models are much more pleasant to work with. With 70B models you even get decent amounts of reasoning capabilities
I wrote an automated form-filling Firefox extension and tested it with Ollama 3.1. Not perfect, quite slow, but better than any other form fillers I tested.
I also tried to hook it up to Claude and so far its flawless (didn't do a lot of testing though).
Most "well-known-name" open-source ML models, are very much "base models" — they are meant to be flexible and generic, so that they can be fine-tuned with additional training for task-specific purposes.
Mind you, you don't have to do that work yourself. There are open-source fine-tunes as well, for all sorts of specific purposes, that can be easily found on HuggingFace / found linked on applicable subreddits / etc — but these don't "make the news" like the releases of new open-source base models do, so they won't be top-of-mind when doing a search for a model to solve a task. You have to actively look for them.
Heck, even focusing on the proprietary-model Inference-as-a-Service space, it's only really OpenAI that purports to have a "general" model that can be set to every task with only prompting. All the other proprietary-model Inf-aaS providers also sell Fine-Tuning-as-a-Service of their models, because they know people will need it.
---
Also, if you're comparing e.g. ChatGPT-4o (~200b) with a local model you can run on your PC (probably 7b, or maybe 13b if you have a 4090) then obviously the latter is going to be "dumber" — it's (either literally, or effectively) had 95+% of its connections stripped out!
For production deployment of an open-source model with "smart thinking" requirements (e.g. a customer-support chatbot), the best-practice open-source-model approach would be to pay for dedicated and/or serverless hosting where the instances have direct-attached dedicated server-class GPUs, that can then therefore host the largest-parameter-size variants of the open-source models. Larger-parameter-size open-source models fare much better against the proprietary hosted models.
IMHO the models in the "hostable on a PC" parameter-size range, mainly exist for two use-cases:
• doing local development and testing of LLM-based backend systems (Due to the way pruning+quantizing parameters works, a smaller spin of a larger model will be probabilistically similar in behavior to its larger cousin — giving you the "smart" answer some percentage of the time, and a "dumb" answer the rest of the time. For iterative development, this is no problem — regenerate responses until it works, and if it never does, then you've got the wrong model/prompt.)
• "shrinking" an AI system that doesn't require so much "smart thinking", to decrease its compute requirements and thus OpEx. You start with the largest spin of the model; then you keep taking it down in size until it stops producing acceptable results; and then you take one step back.
The models of this size-range don't exist to "prove out" the applicability of a model family to a given ML task. You can do it with them — especially if there's an existing fine-tuned model perfectly suited to the use-case — but it'll be frustrating, because "the absence of evidence is not evidence of absence." You won't know whether you've chosen a bad model, or your prompt is badly structured, or your prompt is impossible for any model, etc.
When proving out a task, test with the largest spin of each model you can get your hands on, using e.g. a serverless Inf-aaS like Runpod. Once you know the model family can do that task to your satisfaction, then pull a local model spin from that family for development.
"There are open-source fine-tunes as well, for all sorts of specific purposes"
Have you had good results from any of these? I've not tried a model that's been fine-tuned for a specific purpose yet, I've just worked with the general purpose ones.