To be clear, I didn't ask it to write something complex. The prompt was "how do I do X with library Y?", with a bit more detail. The library is fairly popular and in a mainstream language.
I had a suspicion that what I was trying to do was simply not possible with that library, but since LLMs are incapable of saying "that's not possible" or "I don't know", they will rephrase your prompt and hallucinate whatever might plausibly make sense. They have no way to gauge whether what they're outputting is actually correct.
So I can imagine that you sometimes might get something useful from this, but if you want a specific answer about something, you will always have to double-check their work. In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code. I think there are coding assistant services that do this already, but frankly, I was expecting more from simple chat services.
Specific is the specific thing that statistical models are not good at :(
> how do I do X with library Y?
Recent research and anecdotal experience has shown that LLMs perform quite poorly with short prompts. Attention just has more data to work with when there are more tokens. Try extending that question like “I am using this programming language and am trying to do this task with this library. How do I do this thing with this other library”
I realize prompt engineering like this is fuzzy and “magic,” but short prompts have a consistent lower performance.
> In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code.
Not as simple as you’d think. You’re letting something run arbitrary code.
Tho you should give aider.chat a try if you want to test out that workflow. I found it very very slow.
> Recent research and anecdotal experience has shown that LLMs perform quite poorly with short prompts.
I'm aware of that. The actual prompt was more elaborate. I was just mentioning the gist of it here.
Besides, you would think that after 30 minutes of prompting and corrections it would arrive at the correct answer. I'm aware that subsequent output is based on the session history, but I would also expect this to be less of an issue if the human response was negative. It just seems like sloppy engineering otherwise.
> Specific is the specific thing that statistical models are not good at
Some models are good at needle-in-a-haystack problems. If the information exists, they're able to find it. What I don't need is for it to hallucinate wrong answers if the information doesn't exist.
This is a core problem of this tech, but I also expected it to improve over time.
> Tho you should give aider.chat a try
Thanks, I'll do that eventually. If it's slow, it can get faster. I'd rather the tool be slow but give correct answers, than it slowing me down by wasting my time error correcting it.
Thankfully, these approaches can work for programming tasks. There is not much that can be done to verify the output of any other subject.
I had a suspicion that what I was trying to do was simply not possible with that library, but since LLMs are incapable of saying "that's not possible" or "I don't know", they will rephrase your prompt and hallucinate whatever might plausibly make sense. They have no way to gauge whether what they're outputting is actually correct.
So I can imagine that you sometimes might get something useful from this, but if you want a specific answer about something, you will always have to double-check their work. In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code. I think there are coding assistant services that do this already, but frankly, I was expecting more from simple chat services.