I suppose you could saw that parsing any text-based protocol in general "Is a Minefield". They look so simple and "readable", which is why they're appealing initially, but parsing text always involves lots of corner-cases and I've always thought it a huge waste of resources to use text-based protocols for data that's not actually meant for human consumption the vast majority of the time.
Consider something as simple as parsing an integer in a text-based format; there may be whitespace to skip, an optional sign character, and then a loop to accumulate digits and convert them (itself a subtraction, multiply, and add), and there's still the questions of all the invalid cases and what they should do. In contrast, in a binary format, all that's required is to read the data, and the most complex thing which might be required is endianness conversion. Length-prefixed binary formats are almost trivial to parse, on par with reading a field from a struture.
> Length-prefixed binary formats are almost trivial to parse
They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.
> the most complex thing which might be required is endianness conversion
That is a gross simplification. When you look at the details of binary representations, things get complex, and you end up with corner cases.
Let's look at floating point numbers: with a binary format you can transmit NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do not have the same binary representation. You have to choose single or double precision (maybe a benefit, not always). Etc.
Similarly in JSON integers or arrays of integers are nothing special. It is mostly a benefit not to have to specify UInt8Array.
JSON is one of many competitors within an ecology of programming, including binary formats, and yet JSON currently dominates large parts of that ecology. So far a binary format mutation hasn't beaten JSON, which is telling since binary had the early advantage (well: binary definitely wins in parts of the ecology, just as JSON wins in other parts).
> They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.
I assume you're mainly referring to buffer overflows, which are a problem with text based formats too. See for example the series of overflow vulnerabilities in IIS's HTTP parser which lead to some of the most disruptive worms in history like Code Red. Really this is more of a problem with memory-unsafe languages than serialization formats.
> Let's look at floating point numbers: with a binary format you can transmit NaN, Infinity, -infinity, and -0
Depending on the use case being able to encode these values may be a requirement, in which case binary is no worse than text.
> You can also create two NaN numbers that do not have the same binary representation.
This is specific to IEEE 754, not all binary representations have this issue. Text based formats also have far more pervasive problems with lacking a canonical representation so it's hard to count this as a point against binary.
> JSON is one of many competitors within an ecology of programming, including binary formats, and yet JSON currently dominates large parts of that ecology.
> I assume you're mainly referring to buffer overflows,
That is the most visible security issue, but there are many others e.g. reading excess data is a security flaw (a la heartbleed).
> Really this is more of a problem with memory-unsafe languages than serialization formats
JSON is often used to communicate with unsafe languages. Are you suggesting a binary format is better for using with a language that is memory-unsafe? Or are your implying we should use Rust so that we can use a binary format!
> This is specific to IEEE 754, not all binary representations have this issue.
So now we use some other (unspecified) binary format for floating point numbers?
> This is just an appeal to popularity fallacy.
Bullshit. JSON didn't become popular because it was popular. Developers have chosen to use JSON because it served a purpose for them within their particular ecology. It became popular into the headwinds of XML and other formats.
There are plenty of unpopular binary formats that developers have had experience with that they choose not to use. I have personally have enough experience with a variety of spec and custom binary formats to know when I would use one.
Binary formats most definitely have their place (embedded, high throughput at scale, severe bandwidth restrictions, strict typing, languages with poor string handling, naturally binary data). But for a large percentage of software projects, JSON works and it works well.
XML became popular because of its human readability compared to binary and allowed disparate systems to cooperate. JSON is even more readable and allowed the whole XLST thing to be ignored, making a dev's life easier, and really took off with node.
Binary is not gone, but you don't see a lot of it in the 'web world', because everyone in that space is JSON, almost exclusively.
Having written many parsers, binary is by far and away the easiest: provided your language/environment supports it. That is, binary in js is a pain, because you don't have native ints and floats. In js, JSON is the base atomic data structure, so it makes complete sense that it's used... But in c/c++ and co. JSON is harder and introduces a lot of overhead/ quirks which simply don't exist in binary (provided you cover the buffer overruns and co.)
> XML became popular because of its human readability compared to binary and allowed disparate systems to cooperate. JSON is even more readable and allowed the whole XLST thing to be ignored, making a dev's life easier, and really took off with node.
XML was simply never designed to represent structured data. It was meant to represent document markup in a way that was both simpler and more extensible than SGML.
If there's one thing the industry always seems to do, its embrace some technology hammer as the solution for every problem. Before XML some people were trying to trade around relational database dumps as data interchange formats, because relational databases were the golden hammer.
JSON is being abused by being stretched beyond its sweet spot as well, but there aren't industry consortiums necessarily pushing bad ideas like there was with SOAP and WS-*.
JSON has JSON Object Signing and Encryption (JOSE), compare that to XML-Dsig and XML-Enc. It has OpenAPI/Swagger, which is close to SOAP/WSDL. It has JSON-Schema, which is close to XML-Schema. It has OpenID Connect, which is literally SAML, but with JSON instead of XML.
If you're missing anything out of WS-* in JSON, you can be sure somebody is working on a spec for it.
Erlang and elixir are two more memory safe languages that comfortably handle binary. See Armstrong, Pogramming Erlang, Ch. 5, "Advanced Bit Syntax Examples", with partial parsers for MPEG, COFF, and IPv4 on pages 83-89.
There is no functional difference between an array of unsigned bytes (binary) and a array of signed bytes (char data). The only difference is that when you send binary, 0 is now a valid value instead of null terminating a string. Therefore you must prepend the size because you can no longer parse till you find a NULL byte. It is always safer to know the size you must allocate ahead of time instead of dynamically growing a buffer until the text stream is terminated.
There are plenty of examples of binary formats where you do not know buffer sizes until you've received all the data, and where assumptions with parsing the data can cause a buffer overflow.
decompression and PNG libraries for example have caused massive security impact across the industry because of reuse in different products. Font handling, compressed bitmap, and windows cursor parsing also have been sources of issues.
Mozilla didn't just invest in Rust because parsing HTML and JSON are hard. Its all hard.
“It is always safer to know the size you must allocate ahead of time instead of dynamically growing a buffer until the text stream is terminated.”
And then you go on to give examples of said scenarios of how this is true while saying I’m wrong? Anytime you have an unknown payload you have to make a determination of how long you’re going to wait, how much you’re going to accept, buffer, etc before it’s become a drain on the system
Yes, in fact, this is most common in video streaming formats. These types of streams are more commonly downloaded as opposed to uploaded where the server has to be careful not to exhaust too many resources parsing variable-length messages.
There’s no reason strings cant be sent like binary if you do a size header first. The problem is trying to send binary like a string where your data might have 0s that could be interpreted by the receiver as string terminations. Typically base64 is used to address this issue
>This is specific to IEEE 754, not all binary representations have this issue.
Also sometimes it is a feature. The payload of a NaN value can be user defined and some programs use it[1]. The string "NaN" drops information that might be usefull to some programs, it just doesn't affect as many as null -> "null" does.
> Similarly in JSON integers or arrays of integers are nothing special.
JSON is perfectly capable of representing integers which cannot be represented in IEEE-754 double precision floating point. That seems at least a little special to me.
> with a binary format you can transmit NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do not have the same binary representation. You have to choose single or double precision (maybe a benefit, not always). Etc.
Boy you are really stretching to make this sound complicated. It's not. You transmit 4 bytes or 8 bytes. Serialization is a memcpy().
You don't have to think about NaNs and Infinities because they Just Work -- unlike with textual formats where you need to have special representations for them and you have to worry about whether you are possibly losing data by dropping those NaN bits. If you want to drop the NaN bits in a binary format, it's another one-liner to do so.
It's funny that you choose to pick on floating-point numbers here, because converting floating-points to decimal text and back is insanely complicated. One of the best-known implementations of converting FP to text is dtoa(), based on the paper (yes, a whole paper) called "How to Print Floating-Point Numbers Accurately". Here's the code:
> > Length-prefixed binary formats are almost trivial to parse
> They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.
Injection (forgetting to escape embedded text) is the root cause of a huge number of security flaws for text formats. Length-prefixed formats do not suffer from this.
What "huge number of security flaws" are you referring to that affect length-delimited values? Buffer overflows? Those aren't caused by values being length-delimited, they are caused by people deserializing variable-length values into fixed-length buffers without a check. That mistake can be made just as easily with a text format as with a binary format. In fact I've seen it much more often with text.
> JSON currently dominates large parts of that ecology.
JSON wins for one simple reason: it's easy for human developers to think about, because they can see what it looks like. This is very comforting. It's wasteful and full of pitfalls ("oops, my int64 silently lost a bunch of bits because JSON numbers are all floating point"), but comforting. Even I find it comforting.
Ironically, writing a full JSON parser from scratch is much more complicated that writing a full Protobuf parser. But developers are more comfortable with the parser being a black box than with the data format itself being a black box. ¯\_(ツ)_/¯
(Disclosure: I am the author of Protobuf v2 and Cap'n Proto. In addition to binary native formats, both have text-based alternate formats for which I wrote parsers and serializers, and I've also written a few JSON parsers in my time...)
I think people undervalue clean-looking (alphabet-only, few special character) things, things that don't require people to use the symbol-parsing part of their brain. Basically easily human-parseable things. I suspect this phenomenon can be observed in the case of relative popularity of JSON, TOML, YAML and Python plus the relative unpopularity of Lisp, Haskell, Rust and XML. And if we look at protobuf in this context it is not easy to parse for humans, which causes people not to want to use it, developers are not
> more comfortable with the parser being a black box
they're more comfortable with the parser being a black box but the format being relatively easy to parse compared to the parser being easy to understand but the format basically unreadable for a human.
> I think people undervalue clean-looking (alphabet-only, few special character) things, things that don't require people to use the symbol-parsing part of their brain. Basically easily human-parseable things.
The symbol parsing part of the human brain is what parses letters and numbers, as well as other abstract symbols. The division of symbols into letters, numbers, and others is fairly arbitrary. Most people would say “&”, but the modern name of that symbol is a smoothing over of the way it was recited when it was considered part of the alphabet and recited with it.
> I suspect this phenomenon can be observed in the case of relative popularity of JSON, TOML, YAML and Python plus the relative unpopularity of Lisp, Haskell, Rust, XML.
I suspect not: Lisp and Haskell have less use of non-alphanumeric characters than most more-popular general purpose languages, and not significantly more than Python; also, if this was the phenomenon in play, popularity would be TOML > YAML > JSON but in reality it's closer to the reverse.
> The symbol parsing part of the human brain is what parses letters and numbers, as well as other abstract symbols.
I really don't think that's true when you talk about about someone using the latin alphabet, words in that alphabet compared to some other alphabet (e.g. {}():!) and "words" (or meanings) in those. Just as a crude example parsing "c = a - b", where equals and minus are one symbol each and have been taught for a while, is different from parsing "c := a << b" where ":=" and "<<" basically act as a separate meaning someone has to learn to understand. Similar to the difference of latin alphabet and say simplified Chinese.
> also, if this was the phenomenon in play, popularity would be TOML > YAML > JSON but in reality it's closer to the reverse.
There could be somewhat of an sigmoid response to the effect, decreased reaction if you go into either extreme compared to deviating from the average.
I'm not a linguist so it is my speculation, so don't take it too seriously :D
Those aren't caused by values being length-delimited, they are caused by people deserializing variable-length values into fixed-length buffers without a check. That mistake can be made just as easily with a text format as with a binary format. In fact I've seen it much more often with text.
Especially since text can be arbitrarily long. From that perspective, length-delimited text (I've seen that before in a few file formats, and more notably HTTP) is probably the worst of both worlds.
> You can also create two NaN numbers that do not have the same binary representation.
Way worse than that. The interpretation of those different values of NaN is software specific. You can also have signalling NaN values - where the recipient can now have their number handling code trap in completely unexpected scenarios.
Sweet summer child, isn't that what people say these days?
I and you can agree that the reasonable thing to do is to accept American encoding.
(The below is somewhat simplified, the way I remember it on a late Saturday night 10+ years later.)
The outsourced team of programmers from our software vendor did not. They (mostly) used the built in regional settings in Windows (best practice, don't reinvent the wheel) meaning we had to come up with ways to make sure the machines ran with wrong regional settings (since a lot of stuff was already serialized that way && critical parts of the software was hardcoded to use US standard.)
Not sure about CSV files, but every time I'm pasting numbers generated by Python or Java I had to use Notepad or Excel itself to replace dots to commas because of (Windows'!) regional settings.
In a somebody sent this input to my program context, designing some means to choose between these three results (or possibly others) is the main concern. Actual parsing is the easy part.
Do you get the user to choose? Great, now the UI is 2x more complex.
Do you just go with what the browser or client locale says? (User lives in Canada but their laptop is set to Turkish locale).
Go with the locale tied to the geo location of the user instead? (User above stubbornly enters all amounts using Norwegian conventions after completing high school in Norway).
If it's important, a confirmation page presented to the user and formatted in their presumed locale can help a lot.
This is a little too generous about the benefits of binary formats vs text formats. Ultimately, any data exchange between disparate systems is going to be a challenging task, no matter what format you choose. Both sides have to implement it in a compatible way. And ultimately, every format is a binary format. Encoding machine-level data structures direct on the wire sounds good, but it quickly gets complicated when you have to deal with multiple architectures and languages. And you don't have the benefit of the gradually accreted de-facto conventions like using UTF-8 encoding for text-based formats to fall back on, much less the ability for humans to troubleshoot by being able to read the wire protocol.
With sufficient discipline and rigor, and a good suite of tests, developed over years of practical experience, you can evolve a good general binary wire protocol, but by then it will turn out to be so complicated and heavyweight to use, that some upstart will come up with a NEW FANTASTIC format that doesn't have any of the particular annoyances of your rigorous protocol, and developers will flock to this innovative and efficient new format because it will help them get stuff done much faster, and most of them won't run into the edge cases the new format doesn't cover for years, and then some of them will write articles like this one and comments like yours and we can repeat the cycle every 10-20 years, just like we've been doing.
>Consider something as simple as parsing an integer in a text-based format; there may be whitespace to skip, an optional sign character, and then a loop to accumulate digits and convert them (itself a subtraction, multiply, and add), and there's still the questions of all the invalid cases and what they should do.
^[ *][-?][0-9][0-9]*[ *]$
You're welcome. Anything that passes that regex is a valid number. Now using that as a basis of a lexer means that you can store any int in whatever precision you feel like.
It's unfortunate that the majority of programmers these days are so computer illiterate that they can't write a parser for matching parens and call you an elitist for pointing out this is something anyone with a year of programming should be able to do in their sleep.
Because of this article (which I encountered a year ago) I would say Parsing JSON is no longer a minefield.
I had to write my own JSON parser/formatter a year ago (to support Java 1.2 - don't ask) and this article and its supporting github repo (https://github.com/nst/JSONTestSuite) was an unexpected gift from the heavens.
Good point. I am presuming the test suite is comprehensive. Does it cover 100% of all JSON mines? Probably not. But it surfaced about 30 bugs in my own implementation - things I would have never dreamed of.
So it certainly helped me. And just based on how thorough and insane the test suite is, I think I'm in good hands. Not perfect hands - but definitely a million times better than anything I would have come up with on my own.
The test suite made my parser blow up many times, and for each blow up I got to make a conscious decision in my bugfix: how do I want to handle this?
(I decided to let the 10,000 depth nested {{{{{{{{{{{{{{{{{{"key","value"}}}}}}}}}}}}}}}}}}} guy blow up even though it is legal. Yes, I'm too lazy to implement my own stack.) :-)
This might be throwing a lit match into a gasoline refinery, but why not opt for XML in some circumstances?
Between its strong schema and wsdl support for internet standards like soap web services, XML covers a lot of ground that Json encoding doesn't necessarily have without add-ons.
I say this knowing this is an unfashionable opinion and XML has its own weaknesses, but in the spirit of using web standards and LoC approved "archivable formats", IMO there is still a place for XML in many serialization strategies around the computing landscape.
Json is perfect for serializing between client and server operations or in progressive web apps running in JavaScript. It is quite serviceable in other places as well such as microservice REST APIs, but in other areas of the landscape like middleware, database record excerpts, desktop settings, data transfer files, Json is not much better or sometimes even slightly worse than XML.
XML cannot be parsed into nested maps/dictionaries/lists/arrays without guidance from a type or a restricted xml structure.
JSON can do that. It also maps pretty seamlessly to types/classes in most languages without annotations, attributes, or other serialization guides.
It also has explicit indicators for lists vs subdocuments vs values for keys, which xml does not. XML tags can repeat, can have subtags, and then there are tag attributes. A JSON document can also be a list, while XML documents must be a tree with a root document.
XML may be acceptable for documents. But seeing as how XHTML was a complete dud, I doubt it is useful even for that.
And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.
> And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.
So, that’s why we’re adding all of this “junk” back into JSON? Transformers, XPath for JSON, validation, schemas, namespaces (JSON-LD, JSON prefixes) it’s all there.
History repeating itself (and here’s the important part) because this complexity is needed. Not every application will need every complication, but every complication is needed by some application.
No junk has been added into JSON - the specification hasn't changed to accommodate those features.
Unless you need to use the feature, you don't need to know anything about it, which is a huge benefit for the majority. XML almost encourages programmers to use unnecessary features.
When an application domain chooses to add a feature (say JSON-LD) then there are advantages to that mixture over XML. Where XML is better, it is often chosen instead.
That depends entirely on which parser you’re using. People have wanted comments so badly there are parsing libraries (and proposed revisions to JSON) that include comments. And sometimes those comments are used to provide processing directives.
> Suppose you are using JSON to keep configuration files, which you would like to annotate. Go ahead and insert all the comments you like. Then pipe it through JSMin before handing it to your JSON parser.
> XML cannot be parsed into nested maps/dictionaries/lists/arrays without guidance from a type or a restricted xml structure.
? Using XML without a schema is slightly worse than JSON because the content of each node is just "text". XML with schema is far more powerful, also because of a richer type-system. JSON dictionaries are most of the time used to encode structs, but for that you have `complexType` and `sequence` in the XML schema.
I've been using XML with strongly-typed schemas for serialization for the last couple of years and couldn't be happier. I have ~100 classes in the schema, yet I've needed a true dictionary like 2 or 3 times.
> And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.
Validation is junk? Isn't it valuable to know that 1) if your schema requires a certain element, and 2) if the document has passed validation, then navigating to that element and parsing it according to its schema type won't throw a run-time exception?
Namespaces are junk? They serve the same purpose as in programming languages. How else would you put two elements of the same name but of different semantics (coming from different sources) into the same document? You can fake this in JSON "by convention", but in XML it's standardized.
XML is a perfectly serviceable data exchange format. The parsers and serializers work great when used properly. It's nice to have schema.
But I think people just got sick of XML because it was abused so badly with "web services", SOAP, wsdl and all those horrible technologies from the early naughts. Over-complicated balls of mud that made people miserable.
Apple's plist format might be the weirdest abuse of XML as far as I can tell. The SOAP envelopes and shit like that were horrible but plist is plain weird.
Everyone abused XML some way or another. JSON is not that "abusable" I'd say.
Xml is a beast to parse. It's slow to parse and verbose but it doesn't give you a human friendly text format. It's got a number of weird features inherited from sgml. Every parser needs a quirks mode since nobody can write good schemas and schema parsers.
XML is a really bad interchange format. It's OK for a document markup language, and that's where it survives.
When I was doing XML/Java stuff 10 years ago, you take your XSD and generate domain classes as a build step. It was more complicated but it was also 100% reliable because the tools were all rock solid. Written by the guy who made Jenkins.
Many languages have libraries built in that do something reasonable with JSON. Usually you just make a class or struct, instantiate it, and then generate JSON, no need to have a separate compile step. When going the other direction, I usually just format the JSON, copy that into my code, then fix the compile errors.
XML has all that tooling because it needs it. JSON is a lot more straightforward, is more compact, and is faster to parse and (probably) generate.
If you're going to go through the effort of a compile step, you should probably just use a binary protocol, which will get you even better performance and getting documentation out of the box (e.g. protocol buffers schemas are very readable).
I see absolutely no reason to use XML these days as a data format, but it's still a response choice as a markup format (you know, what the M stands for).
> Many languages have libraries built in that do something reasonable with JSON.
What about cross-language? In C# I define a class containing a `DateTime` field, export the schema with xsd, and generate classes for Java with xjc, and get back a field of (an equivalent of) `DateTime` type. Doing what you suggest with JSON, I'd get a "string". Thanks but no thanks.
> If you're going to go through the effort of a compile step, you should probably just use a binary protocol, […] I see absolutely no reason to use XML these days as a data format,
In our product we use a relational db (SQLServer) combined with XML. Each table has a structured part which is put into relational columns, plus an extensions part that is put into a "Data" XML column for semi-structured data. SQLServer supports XQuery so we can query the semi-structured data from SQL when needed.
This wouldn't fly with a binary format.
EDIT: yes, SQLServer also supports JSON, but has special optimizations for XML (e.g., it can understand schema types, it supports XML indexes which "shred" XML to a more efficient binary representation based on schema, etc.)
If you're going to reach for automation, why not just use a binary format like protocol buffers, flat buffers, capn proto, etc? You get the tooling and a ton of performance for free.
JSON is great because you don't need tooling. XML is great because it's expressive. You don't need expressiveness for a data format, but it works great as a markup language.
Syntax aside, I think the original mistake is IDLs, schemas, and other attempts at formalism.
WSDL, SOAP, and all their precursors were attempted in spite of Postel's Law.
Repeating myself:
Back when I was doing electronic medical records, my two-person team ran circles around our (much larger) partners by abandoning the schema tool stack. We were able to detect, debug, correct interchange problems and deploy fixes in near realtime. Whereas our partners would take days.
Just "screen scrap" inbound messages, use templates to generate outbound messages.
I'd dummy up working payloads using tools like SoapUI. Convert those known good "reference" payloads into templates. (At the time, I preferred Velocity.) Version every thing. To troubleshoot, rerun the reference messages, diff the captured results. Massage until working.
Our partners, and everyone I've told since, just couldn't grok this approach. No, no, no, we need schemas, code generators, etc.
There's a separate HN post about Square using DSLs to implement OpenAPI endpoints. That's maybe 1/4th of the way to our own home made solution.
I personally like XML a lot for rich text (I like HTML better than TeX) and layout (like in JSX for React), and it's not horrible if you want a readable representation for a tree, but I can't imagine using it for any other purpose.
JSON is exactly designed for object serialization. XML can be used for that purpose but it's awkward and requires a lot of unnecessary decisions (what becomes a tag? what becomes an attribute? how do you represent null separately from the empty string?) which just have an easy answer in JSON. And I can't think of any advantage XML has to make up for that flaw. Sure, XML can have schemas, but so can JSON.
I will agree that JSON is horrible for config files for humans to edit, but XML is quite possibly even worse at that. I don't really like YAML, either. TOML isn't bad, but I actually rather like JSON5 for config files - it's very readable for everyone who can read JSON, and fixes all the design decisions making it hard for humans to read and edit.
One of the biggest advantages for XML are attributes and namespaces. I miss these in JSON.
As AtlasBarfed mentioned, JSON has a native map and list structure in its syntax, which is sorely missed in XML. You have to rely on an XML Schema to know that some tag is expected to represent a map or list.
JSON with attributes and namespaces would be my ideal world.
Why do you want those? Attributes and namespaces just make in memory representation complicated. They're quite useful for markup, but I don't really know why you'd want them in a data format.
Use JSON or a binary protocol for data, XML for markup.
To be fair, there were a lot of very good ideas for a 2.x XML that solved a lot of the complexity. The problem was that none of the tools would be upgraded to support it.
You'd basically have to create a new independent format to have proper compatibility once you introduce breaking changes.
- by removing DTDs, remove the concepts of notations and all external resource resolution from the core spec. Also, no possibility of entity-expansion attacks.
- by removing DTDs, remove validation from the core spec.
- merge namespaces into the core specification. At the same time, make them mandatory
- merge the concept of qualified names into the core specification
- by making namespaces mandatory, all the variations of how namespaces get exposed can be eliminated
- merge the info-set definition into the core specification
- by describing XML items and how they relate, implementations can understand what data is relevant at a particular point while parsing the document.
- Merge xml:id into the core specification.
You also had some other fun outlier concepts:
- Eliminate prefixes from infoset. This is mostly a breaking change for XPath and XML Schema.
- Add an explicit qualified name token (possibly recycling the entity declaration). This would allow the above specs to have their functionality restored, although likely with a new format.
- Accept qualified names without prefixes, such as via a {uri}:{localName} syntax.
Once you're parsed the first minefield, another crop emerges: interpreting the result. Even the range of values seen in the wild for a supposedly simple boolean attribute is just mind-boggling. Setting aside all the noise from jokers trying it on with fuzzing engines, we'll see all of these presented to various APIs:
That last looks like a doozy, but old lags will guess what's going on right away. It's the octets of the 8-bit string "true", misinterpreted as UCS-2 (16-bit wide character) code points and then spat out as UTF8. Google translates it, quite appropriately, as "Enemy".
Oddly though, according to my records, never seen a "NULL".
On the other hand, if you only need 53 bits of your 64 bit numbers, and enjoy blowing CPU on ridiculously inefficient marshaling and unmarshaling steps, hey, it's your funeral.
> some JSON parsers (include JS.eval) parse it to 9223372036854776000 and continue on their merry way
This is correct behavior though...? Every number in JSON is implicitly a double-precision float. JSON doesn’t distinguish other number types.
If you want that big a string of digits in JSON, put it in a string.
Edit: let me make a more precise statement since several people seem to have a problem with the one above:
Every number that you send to a typical JavaScript JSON parser is implicitly a double-precision float, and it is correct behavior for a JavaScript JSON parser to treat a long string of digits as a double-precision float, even if that results in lost precision.
The JSON specification itself punts on the precise semantic meaning of numbers, leaving it up to producers and consumers of the JSON to coordinate their number interpretation.
> This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754 binary64 (double precision) numbers [IEEE754] is generally
available and widely used, good interoperability can be achieved by
implementations that expect no more precision or range than these
provide, in the sense that implementations will approximate JSON
numbers within the expected precision. A JSON number such as 1E400
or 3.141592653589793238462643383279 may indicate potential
interoperability problems, since it suggests that the software that
created it expects receiving software to have greater capabilities
for numeric magnitude and precision than is widely available.
> Note that when such software is used, numbers that are integers and
are in the range [-(2^53)+1, (2^53)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.
Or to paraphrase: if you pretend that every JSON number is a JavaScript number (double-precision float), you will generally be fine. If you don’t, you are responsible for the inevitable interpoperability problems you’ll have with most current JSON parsers.
You can write an arbitrary precision number in there, say 4.203984572938457290834572098345787564e+20, but if the vast majority of JSON parsers interpret it as a double-precision float, there’s not much point.
If you control both the producer and consumer of the serialized data, you can of course do whatever you like. But I would recommend people who want more extensive data types use something other than JSON.
> If you control both the producer and consumer of the serialized data, you can of course do whatever you like.
Don't forget all the unintended intermediate producers and consumers due to microservices or even otherwise well-written tools that convert to float64 internally.
You're writing the server for a procedural space game where the coordinates are stored as numbers. As the ships get far from (0,0,0) people start to report that the graphics get "jumpy" and gameplay rules break frequently.
Or, your server keeps track of how long it's been running, in nanoseconds. When the server stays up for months, people start noticing weird precision issues and hard to replicate bugs around time.
When the tests fail to 1.) experiment with numeric values greater than 2^53 or less than -2^53 and 2.) fail to carefully check the results even in the situations where they did experiment with such values.
How often do you encounter situations where your values start off below 2^53 and then later grow above 2^53? How often do you deal with integers greater than 53 bits where the app doesn't immediately fail when the exact value is not correct (i.e. unique IDs or crypto keys)? I feel like these are edge cases which would rarely come into play in most software.
Edge cases which rarely come into play are the ones you should be most afraid of. Things that happen often get caught before they cause trouble. Things that never happen are fine. Things that happen rarely are what get you.
For an made-up example of how this could get you, let’s imagine that you have a message service that gives each message a unique ID. Some bright soul decided to give this ID a nice structure and make it a 64-bit value where the top 32 bits are an incrementing integer per user, and the bottom 32 bits are the user’s ID, assigned by with a global incrementing integer.
Everything works great in testing and you deploy and the VC money is rolling in and then some of your very prolific users go past 2 million messages and suddenly messages are getting mixed up and you’re leaking private info because your access checking code happens to get the real 64-bit value but your message retrieval code puts the ID in a JSON number.
Now you might respond, but that ID scheme is dumb, don’t do that. And you may be right! But dumb things happen. It’s unwise to leave land mines lying around in your software just because they only detonate when someone does something dumb.
Oops, somehow the second half of the comment got truncated.
I meant to say: "I have, but I work mainly with embedded systems, and so there aren't usually POST or PUT APIs, just GET. So JSON is nearly always generated by the device and consumed by whoever is querying the device, not the other way around."
JSON sucks. Maybe half our REST bugs are directly related to JSON parsing.
Is that a long or an int? Boolean or the string "true"? Does my library include undefined properties in the JSON? How should I encode and decode this binary blob?
We tried using OpenApi specs on the server and generators to build the clients. In general, the generators are buggy as hell. We eventually gave up as about 1/4 of the endpoints generated directly from our server code didn't work. One look at a spec doc will tell you the complexity is just too high.
We are moving to gRPC. It just works, and takes all the fiddling out of HTTP. It saves us from dev slap fights over stupid cruft like whether an endpoint should be PUT or POST. And saves us a massive amount of time making all those decisions.
I have had the absolute joy of working with gRPC services recently. Static schemas and built in streaming mechanics are fantastic. It definitely removes a lot of my gripes with REST endpoints by design.
Ah, so that's the source of that Chrome bug that we saw last week. Customers on Chrome for Windows (only that, not Chrome for Linux or macOS) were complaining that the search on our statically-generated documentation site was not working. The search is implemented by a JavaScript file that downloads a JSON containing a search index, and it turns out that this search index had too much nesting for Chrome on Windows's JSON parser. This would reliably produce a stack overflow:
Not really, parsing JSON is probably a recursive algorithm and Windows gives processes way less stack space than Linux (generally 1MB vs 8MB). It could totally be that the code is the same, but the particular JSON string blows up the stack in Windows but not Linux.
Eh... not really if you're referring to call stacks. Rust at one point had growable stacks. That was removed for performance reasons. Haskell with GHC kind of has growable stacks (basically IIRC most function calls occur on the heap) and its stack overflows take a different form. SML I think at one point also had an implementation with a growable call stack.
If by limited you mean limited by the amount of memory your machine has then yes it's limited, but I don't think that's what parent was getting at, since in that sense everything about a computer is limited.
I’ve been using to process and maintain giant JSON structures and it’s faster than any other parser I’ve tried. I was able to replace my previous batch job with this as it gives real-time performance.
We haven't used that particular suite, but almost everything in that suite is something we've thought about. In many cases we do the right thing by not innovating and randomly allowing stuff that isn't in the spec.
I see exactly one thing we didn't think about, as our construction of a parse tree is pretty basic and we don't build an associative structure even when building up an object - thus we would not register an error when confronted with the malformed input listed under "2.4 Objects Duplicated Keys", but happily build a parse tree with duplicated keys (which will be built up strictly as a linear structure, not an associative one).
There seems to be leeway on this point as to what an implementation should do. It certainly doesn't fit our usage model very well to build a associative structure right there on the spot - some of our users wouldn't want that much complexity/overhead.
> Abstract Syntax Notation One (ASN.1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.
Asn.1 is incredibly hard to actually implement . There are dozens of cases of security bugs based on bad parsers. Also there are a dozen different encodings of asn.1 data including json (JER). Its age also means that it has a bunch of obsolete datatypes.
Protobuf and friends have most of the power without a lot of the drawbacks.
> There are dozens of cases of security bugs based on bad parsers.
You're saying this on a thread called "Parsing JSON Is a Minefield", eh?
In any event, this is not unique to ASN.1. I haven't checked but I don't doubt there are similar cases for Protobuf, etc.
> Also there are a dozen different encodings of asn.1 data including json (JER).
So what? That's the opposite of a problem.
> Its age also means that it has a bunch of obsolete datatypes.
So don't use them.
- - - -
My point is that if the time and effort that was spent on Protobuf and CapnProto and all the others had somehow been spent instead on perfecting ASN.1 then, uh, that would have been good...
> My point is that if the time and effort that was spent on Protobuf and CapnProto and all the others had somehow been spent instead on perfecting ASN.1 then, uh, that would have been good...
I wrote proto2 in 20% time at Google and I developed Cap'n Proto entirely on my own time, unpaid. If you think ASN.1 could be perfected with a similar amount of work then why don't you do it?
It seems like I may have offended you, I didn't mean to, and I apologize.
I'd love to discuss this but don't want to get in a flame war.
In re: ASN.1, if I ever have to de/serialize some messages again (I'm quasi-retired ATM) I would use ASN1SCC "an ASN.1 compiler that was developed for ESA to cover all data modelling needs of space applications."
> The compiler is targetting safe systems and generate either Spark/Ada or C code. Runtime library is minimalistic and open-source. The tool handles custom binary encoding layouts, is fully customizable through a code templating engine, generates ICDs and automatic test cases."
I've used protobufs in the past and I REALLY want to use them now, but the human readable aspect of json has always kept me coming back.
If I want to perform some rough tests of an endpoint during development, all I need to do is compose the json request and fire it off using curl. The response then comes back in a human readable format I can parse straight from the terminal. Boom, simple test conducted in less than 1 minute. I don't even need to think about it.
Compare that to protobufs; I need to create a custom client or unit test that'll compose and fire off the request I want to test, then I need to write a bunch of code that will introspect the contents of the response so I can pick out the details. Huge time loss, concentration ruined since I need to actually think about the process, I'd rather just take the extra latency that using json will incur.
This skips past all of the other advantages json has over binary serialization protocols, like quickly being able to parse requests while debugging issues, infinite client language support, ease of sharing breaking requests to help devs reproduce problems, not needing to add an extra compilation step to my deployments and packages, etc.
> If I want to perform some rough tests of an endpoint during development, all I need to do is compose the json request and fire it off using curl. The response then comes back in a human readable format I can parse straight from the terminal. Boom, simple test conducted in less than 1 minute. I don't even need to think about it.
You really don't want to put >2GB in a single protobuf (or JSON object). That would imply that in order to extract any one bit of data in that 2GB, you have to parse the entire 2GB. If you have that much data, you want to break it up into smaller chunks and put them in a database or at least a RecordIO.
Cap'n Proto is different, since it's zero-copy and random-access. You can in fact read one bit of data out of a large file in O(1) time by mmap()ing it and using the data structure in-place.
Hence, it makes sense for Cap'n Proto to support much larger messages, but it never made sense for Protobuf to try.
Incidentally the 32-bit limitation on Protobuf is an implementation issue, not fundamental to the format. It's likely some Protobuf implementations do not have this limitation.
(Disclosure: I'm the author of Protobuf v2 and Cap'n Proto.)
Generally speaking, if you have 2GB of data, why would you want it inside a protobuf, or worse, json? You clearly aren't going to open it in a text editor or send it over ajax - just put the bulk of data as a separate binary blob, and your code won't have to scan 2GB to find the end of a string.
I can understand if protocol buffers had some technical issue that you disagreed with. Or if you had a preference for a different solution because of some reason.
But this comment seems needlessly cynical and doesn't actually offer any rebuttal to the parent's point. I find technical discussions of these sorts of things interesting and a great way for new people to learn about the tradeoffs, maybe you could offer a reasoned opinion on why not use protocol buffers?
I have a love/hate relationship with JSON schema. I hate the documentation, or significant lack thereof. But I do love how business rules about relationships between fields can be expressed in the schema - e.g., if typeId is 7, then the cost field must be populated.
With things like Avro or GPB you still need to validate that the relationship holds true separately.
A lot of companies still use typed messages like Protocol Buffers which make this a signficantly smaller issue. Especially since the message format was designed to be portable.
A question: just like JSON's syntax was based on JavaScript, could we create a JSON schema syntax based on TypeScript? So the schema would essentially be a TypeScript type annotation.
Parsing is a minefield. General purpose computing systems are minefields.
Of all human readable formats I've ever worked with, only S-expressions have proven easier and safer to parse. Json.org even has unambiguous railway diagrams!
I wish EDN would catch on. The simplicity of JSON with better number handling, a few more very useful data types like namespaced keywords and sets, arbitrarily complex keys, and an even terser, more readable syntax. [k v k v] beats [k: v, k: v] hands down, and you can use whitespace commas if you want them.
In case anyone doesn't click the link, its much simpler than JSON, leaves interpretation up to the reader, supports polymorphic data, and has some tweaks to make it nicer to edit by hand. I've used this in a bunch of personal projects and maps all the models I've come across perfectly. If you need more power in your format you're better off using Lua than YAML.
I have a Rust Serde implementation 90% complete I could finish up if anyone wants it.
The bigger question is, what is there to be done? What is the road to a more uniform handling of JSON? I've handled some JSON before and it's usually fairly easy untill you catch one of these strange implementation quirks. But I'm not sure that those quirks can be ironed out at this point.
Can you help me understand the problem? These things seem like corner cases that you could just Not Do(TM) and then you don’t have to worry about it.
What am I missing, when do these gotchas become an actual problem for you as a developer?
I’m not sure I’ve used any technology that was free of footguns, and JSON appears to have fewer of them than the average programming language or library.
I call this kind of answer “The C Answer”. “Who cares if this particular combination of code results in undefined behavior? Just don’t do it!”
> when do these gotchas become an actual problem for you as a developer?
Whenever you have to deal with JSON produced by “not you”, or when you have to deal with JSON that may have been corrupted in some fashion along the way.
I’m probably just not used to pure “all behavior is defined” systems, so I appreciate your perspective.
What industry do you work in? I don’t see many systems like that in my industry. I work like hell to push things in that direction, but it’s a best case of “we went from 5% well defined behavior to 50%” after many years of effort.
The product owner perspective on this should be "nothing supports anything unless it is tested".
If you pick up a standard and just assume other products will be able to work with it, you're in for a surprise. I don't care if it's TCP sockets or .ini files; if you didn't test compatibility with the product you expect will interact with yours through the standard, consider it unstable, and don't advertise support for it.
Sometimes you have to support a standard itself, like WPA2, so you implement the standard according to internet engineering best practice: be liberal with what you accept, and conservative with what you transmit (or something to that effect). Then test compatibility with the major products you know will want to use it, and fix the bugs you find.
Been of the opinion a while a lot of issues could be resolved if we agreed on a streamable binary format that had good definitions for data types (including integers and dates).
String formats are great an all for viewing in whatever text viewer but so inefficient and then you have the whole escaping string inside of strings and string encoding binary data.
If we all agreed on a binary format then there would be a viewer for it in every debugging tool.
ASN.1, Protobuf, BSON, ION. MSGPACK whatever. I would prefer a binary format that doesn't repeat keys for efficiency where the schema can be sent separately or inlined. But even one that's basically binary JSON with more types would be step up.
I'm not a professional coder. And I mostly work with tabular data, in spreadsheets and SQL. I like to get my data as delimited text files. Ideally, delimited with some character that's 100% guaranteed to never occur in the data. In my experience, "|" is often a good option, but you never know. And CSV, even with quotes, can be a nightmare, especially if the data contains addresses. Or names with quoted nicknames.
Anyway, given the choice, I always pick JSON over XML. Because with JSON, I can always identify the data blocks that I need, and parse them out with bash and spreadsheets. Not with XML, however. Just as not with HTML.
> For example, Xcode itself will crash when opening a .json file made the character [ repeated 10000 times, most probably because the JSON syntax highlighter does not implement a depth limit.
I certainly would not criticize such a thorough examination for being facile, but I do want to point out that the conclusion "But sometimes, simple specifications just mean hidden complexity" is not supported by the article. Almost all of the end cases are caused by implementors ignoring or extending the simple specification.
The crashing test cases look scary from a security perspective, specially in the C-based parsers. Does anyone know if these results are still up to date or if the bugs have already been fixed?
> Python, go, rust and other obscure or archaic languages struggle with json. I recommend using a modern language such as js/nodejs, as it is meant for the web - where json is king among formats.
In what ways are the languages you listed obscure or archaic?
None of those mean Python is "obscure or archaic" or "unfit for json" etc.
Regarding what is a valid object, it's about syntax. Python has a different syntax than JS. That doesn't mean Python is worse because JS has {hello: 'world'}.
In fact that's an inconsistency of JS that only holds for _some_ keys (basically, valid identifiers): if the key has a space or a dash or something, you need to quote it: {'hello-1': 'world'}. So Python is more consistent in having only one style of map literal.
>in python you also can't do json.hello like in js, it would be json['hello']
Again, irrelevant. Python has a different syntax. Python also has tons of stuff you can't do in JS. Operator overloading for example, in python myObj + myOtherObj is a valid user defined operation. None of this means JS or Python is superior.
Ans none of this has anything to do with JSON.
If you mean that JSON seems to be a better fit for JS syntax, that's because it was designed to be closer to JS syntax. That said, it's not JS syntax. {foo: "bar"} is valid JS but not JSON. JSON also doesn't accept classes, closures, and tons of other things that can go in a JS object.
Or you can let json.load do its thing and use a nice module like json-syntax[1] to convert it to well typed values and back. (Yup, plugging my library.)
I don't understand what you're saying {hello: 'world'} is not valid json and neither Python's nor javascript's json parser from the standard library parse that.
Again.
Parsing is a minefield in general, but parsing JSON is one of the easiest tasks of all serialization formats.
It's also the only secure format.
Its various spec bugs (by omission) are not that dramatic, and the various "enhancements" only made it worse, ie more insecure. Still, bad but not a minefield. What worries me most that my JSON module is the defacto perl standard, passes all these tests, was the very first to add all these tests, is the fastest, and still is not included in that list, just some outdated modules which should not be used at all.
Checking best practices besides maintaining a spec obviously also is a minefield.
I forgot another major JSON minefield problem which is not mentioned nor tested here: stackoverflow.
This is in fact the most important problem to test against, because it might lead to exploitable stack ROP.
JSON is usually parsed recursively, and deeply nested structures are mostly not depth counted. One can trivially construct a nested array or map of 500 to 30000 elements, and at one point the parser either fails or crashes with an overflow. This number is fixed, thus trivially exploitable. The test spec should contain the max. depth for arrays and maps, and if there's a fixed builtin limit, a compile-time limit, implicit limit by crash, or none.
non recursive parsers are fine.
Consider something as simple as parsing an integer in a text-based format; there may be whitespace to skip, an optional sign character, and then a loop to accumulate digits and convert them (itself a subtraction, multiply, and add), and there's still the questions of all the invalid cases and what they should do. In contrast, in a binary format, all that's required is to read the data, and the most complex thing which might be required is endianness conversion. Length-prefixed binary formats are almost trivial to parse, on par with reading a field from a struture.