State of the art techniques to get GPT to return JSON
- logic: Responses must match this JSON schema
- demonstration: For example…
- appeal to identity: You are a chatbot that speaks perfect JSON
- cajoling: Remember always return JSON!
- threat: if you don't I SWEAR I'm gonna–
One of the techniques that I've found for reliably returning JSON is... ask for multiple responses and then use one of the responses that successfully parses!
A glaring problem is: Non-determinism of LLMs, creating different answers to the same prompt. I appreciate your blogging and analysis in this space, so I am I am interested in your responses. The non-determinism implies that prompt engineering is brittle, difficult, and prone to no formal evaluation techniques for correctness.
It is certainly an interesting phenomenon, and I wonder what techniques from neuroscience for brain mapping could be used for model "brain mapping", which could lend itself more to prompt engineering as a science (latent space mapping).
You can use vicuna-7b-1.1. No need for chat prompts. Just slam in your data and end it off like so
Generate a JSON with this and that
{"this": "
Lower the temperature to minimum for deterministic results, fine tune the other parameters if needed. And have a stop token for JSON closing tag like so }.
That usually works perfectly fine for me in most scenarios. Best: that stuff runs on RTX 3080 with 15token/s (quite fast!). Also vicuna-7b is pretty much as good as gpt-3 when it came out.
Vicuna-7b is much better than Gpt4All, but still struggles with math - I can't wait until my new work computer comes in, I will try to run the new StableLM models
Hi Simon, your blog posts have been invaluable in my ongoing process of refining a document that covers major concepts in prompt engineering and LLM fine-tuning and I'd love to pick your brain over email or a call if you have any bandwidth!
why not just adjust the decoder / beam search to not emit any tokens that aren't semantically valid JSON?
ie. instead of using temperature to sample something from the top k most likely tokens, first exclude all the tokens that cause the output to be malformed. the model can only emit {, ", [, or a number for the first token, for example.
if someone would like a fun project to try this right away, one place to start would be to modify llama.cpp's chat example just before the line that samples tokens [1], going through `lctx.logits` to zero out invalid tokens (or these are logits, so i guess set them to -INFINITY). For smoketest, fix the first token of the model's output to "{" without any other changes and I bet you'd get something approaching JSON out.
i mean, the most principled approach probably requires some theoretic CS knowledge about regular expression derivatives or parsing machine derivatives, but i'm surprised it isn't more common to just hook into the decoder design a little, given how much we want structured data out of these models
i wish i knew how to voice my ignorant skepticism in a less disparaging way, sorry.... but i feel like a lot of this "legitimization of prompt engineering as a useful trade/practice" thinking assumes that we're trapped in the "magic circle" where the only input we have to the model is picking the prompt and the only possible output is the most likely token. but these are generative models! conditioned on their output, we have our choice about which token to accept, so why not just condition on the distribution of possible JSON output instead of the distribution of possible prose?
i suspect very quickly the most competitive prompt engineers will combine their solid understanding of theoretic machine learning and statistics with a solid understanding of computer science, perhaps even combined with a dash of persuasion / neurolinguistic programming experience. kinda worries me but it's how it is