Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

XML is also a great option, but there are a few trade offs:

> XML is a many more tokens (much slower + $$$ for complex schemas)

> regardless of if you're looking for } or </output> its really a matter of "does your parser work". when you have three tokens that need to be correct "</" "output", ">", the odds of a mistake are higher, instead of when you just need "}".

That said, the parser is much easier to write, we're actually considering supporting XML in BAML. have you found any reductions of accuracy?

Also, not sure if you saw this, but apparently Claude doesn't actually prefer XML, it just happens to work well with it. Was recently new info for myself as well. https://x.com/alexalbert__/status/1778550859598807178 (devrel @ Anthropic)



I think once you have < and / the rest becomes much easier to predict. In a way it “spreads” the prediction over several tokens.

The < indicates that the preceding information is in fact over. The “/“ represents that we are closing something and not starting a subtopic. And the “output” defines what we are closing. The final “>” ensures that our “output” string is ended. In JSON all of that semantic meaning get put into the one token }.


Hmm, that's an interesting way of thinking about it. The way I see it, I trust XML less, because the sparser representation gives it more room to make a mistake: if you think of every token as an opportunity to be correct or wrong, the higher token count needed to represent content in XML gives the model a higher chance to get the output wrong (kinda like the birthday paradox).

(Plus, more output tokens is more expensive!)

e.g.

using the cl_100k tokenizer (what GPT4 uses), this JSON is 60 tokens:

    {
      "method": "GET",
      "endpoint": "/api/model/details",
      "headers": {
        "Authorization": "Bearer YOUR_ACCESS_TOKEN",
        "Content-Type": "application/json"
      },
      "queryParams": {
        "model_id": "12345"
      }
    }
whereas this XML is 76 tokens:

    <?xml version="1.0" encoding="UTF-8" ?>
    <method>GET</method>
    <endpoint>/api/model/details</endpoint>
    <headers>
        <Authorization>Bearer YOUR_ACCESS_TOKEN</Authorization>
        <Content-Type>application/json</Content-Type>
    </headers>
    <queryParams>
        <model_id>12345</model_id>
    </queryParams>
You can check out the tokenization here by toggling "show tokens": https://www.promptfiddle.com/json-vs-xml-token-count-BtXe3


you will love yaml since its a similar improvement in token use over json




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: