XML is also a great option, but there are a few trade offs: > XML is a many more...

gavindean90 · on June 18, 2024

I think once you have < and / the rest becomes much easier to predict. In a way it “spreads” the prediction over several tokens.

The < indicates that the preceding information is in fact over. The “/“ represents that we are closing something and not starting a subtopic. And the “output” defines what we are closing. The final “>” ensures that our “output” string is ended. In JSON all of that semantic meaning get put into the one token }.

joatmon-snoo · on June 20, 2024

Hmm, that's an interesting way of thinking about it. The way I see it, I trust XML less, because the sparser representation gives it more room to make a mistake: if you think of every token as an opportunity to be correct or wrong, the higher token count needed to represent content in XML gives the model a higher chance to get the output wrong (kinda like the birthday paradox).

(Plus, more output tokens is more expensive!)

e.g.

using the cl_100k tokenizer (what GPT4 uses), this JSON is 60 tokens:

    {
      "method": "GET",
      "endpoint": "/api/model/details",
      "headers": {
        "Authorization": "Bearer YOUR_ACCESS_TOKEN",
        "Content-Type": "application/json"
      },
      "queryParams": {
        "model_id": "12345"
      }
    }

whereas this XML is 76 tokens:

    <?xml version="1.0" encoding="UTF-8" ?>
    <method>GET</method>
    <endpoint>/api/model/details</endpoint>
    <headers>
        <Authorization>Bearer YOUR_ACCESS_TOKEN</Authorization>
        <Content-Type>application/json</Content-Type>
    </headers>
    <queryParams>
        <model_id>12345</model_id>
    </queryParams>

You can check out the tokenization here by toggling "show tokens": https://www.promptfiddle.com/json-vs-xml-token-count-BtXe3

tarasglek · on June 20, 2024

you will love yaml since its a similar improvement in token use over json