Parsing JSON Is a Minefield (2018)

userbinator · on Aug 17, 2019

I suppose you could saw that parsing any text-based protocol in general "Is a Minefield". They look so simple and "readable", which is why they're appealing initially, but parsing text always involves lots of corner-cases and I've always thought it a huge waste of resources to use text-based protocols for data that's not actually meant for human consumption the vast majority of the time.

Consider something as simple as parsing an integer in a text-based format; there may be whitespace to skip, an optional sign character, and then a loop to accumulate digits and convert them (itself a subtraction, multiply, and add), and there's still the questions of all the invalid cases and what they should do. In contrast, in a binary format, all that's required is to read the data, and the most complex thing which might be required is endianness conversion. Length-prefixed binary formats are almost trivial to parse, on par with reading a field from a struture.

robocat · on Aug 17, 2019

Binary formats have their own serious problems.

> Length-prefixed binary formats are almost trivial to parse

They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.

> the most complex thing which might be required is endianness conversion

That is a gross simplification. When you look at the details of binary representations, things get complex, and you end up with corner cases.

Let's look at floating point numbers: with a binary format you can transmit NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do not have the same binary representation. You have to choose single or double precision (maybe a benefit, not always). Etc.

Similarly in JSON integers or arrays of integers are nothing special. It is mostly a benefit not to have to specify UInt8Array.

JSON is one of many competitors within an ecology of programming, including binary formats, and yet JSON currently dominates large parts of that ecology. So far a binary format mutation hasn't beaten JSON, which is telling since binary had the early advantage (well: binary definitely wins in parts of the ecology, just as JSON wins in other parts).

magila · on Aug 17, 2019

> They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.

I assume you're mainly referring to buffer overflows, which are a problem with text based formats too. See for example the series of overflow vulnerabilities in IIS's HTTP parser which lead to some of the most disruptive worms in history like Code Red. Really this is more of a problem with memory-unsafe languages than serialization formats.

> Let's look at floating point numbers: with a binary format you can transmit NaN, Infinity, -infinity, and -0

Depending on the use case being able to encode these values may be a requirement, in which case binary is no worse than text.

> You can also create two NaN numbers that do not have the same binary representation.

This is specific to IEEE 754, not all binary representations have this issue. Text based formats also have far more pervasive problems with lacking a canonical representation so it's hard to count this as a point against binary.

> JSON is one of many competitors within an ecology of programming, including binary formats, and yet JSON currently dominates large parts of that ecology.

This is just an appeal to popularity fallacy.

robocat · on Aug 17, 2019

> I assume you're mainly referring to buffer overflows,

That is the most visible security issue, but there are many others e.g. reading excess data is a security flaw (a la heartbleed).

> Really this is more of a problem with memory-unsafe languages than serialization formats

JSON is often used to communicate with unsafe languages. Are you suggesting a binary format is better for using with a language that is memory-unsafe? Or are your implying we should use Rust so that we can use a binary format!

> This is specific to IEEE 754, not all binary representations have this issue.

So now we use some other (unspecified) binary format for floating point numbers?

> This is just an appeal to popularity fallacy.

Bullshit. JSON didn't become popular because it was popular. Developers have chosen to use JSON because it served a purpose for them within their particular ecology. It became popular into the headwinds of XML and other formats.

There are plenty of unpopular binary formats that developers have had experience with that they choose not to use. I have personally have enough experience with a variety of spec and custom binary formats to know when I would use one.

Binary formats most definitely have their place (embedded, high throughput at scale, severe bandwidth restrictions, strict typing, languages with poor string handling, naturally binary data). But for a large percentage of software projects, JSON works and it works well.

cmroanirgo · on Aug 17, 2019

I see both arguments.

XML became popular because of its human readability compared to binary and allowed disparate systems to cooperate. JSON is even more readable and allowed the whole XLST thing to be ignored, making a dev's life easier, and really took off with node.

Binary is not gone, but you don't see a lot of it in the 'web world', because everyone in that space is JSON, almost exclusively.

Having written many parsers, binary is by far and away the easiest: provided your language/environment supports it. That is, binary in js is a pain, because you don't have native ints and floats. In js, JSON is the base atomic data structure, so it makes complete sense that it's used... But in c/c++ and co. JSON is harder and introduces a lot of overhead/ quirks which simply don't exist in binary (provided you cover the buffer overruns and co.)

dwaite · on Aug 18, 2019

> XML became popular because of its human readability compared to binary and allowed disparate systems to cooperate. JSON is even more readable and allowed the whole XLST thing to be ignored, making a dev's life easier, and really took off with node.

XML was simply never designed to represent structured data. It was meant to represent document markup in a way that was both simpler and more extensible than SGML.

If there's one thing the industry always seems to do, its embrace some technology hammer as the solution for every problem. Before XML some people were trying to trade around relational database dumps as data interchange formats, because relational databases were the golden hammer.

JSON is being abused by being stretched beyond its sweet spot as well, but there aren't industry consortiums necessarily pushing bad ideas like there was with SOAP and WS-*.

j0057 · on Aug 18, 2019

JSON has JSON Object Signing and Encryption (JOSE), compare that to XML-Dsig and XML-Enc. It has OpenAPI/Swagger, which is close to SOAP/WSDL. It has JSON-Schema, which is close to XML-Schema. It has OpenID Connect, which is literally SAML, but with JSON instead of XML.

If you're missing anything out of WS-* in JSON, you can be sure somebody is working on a spec for it.

thyrsus · on Aug 18, 2019

Erlang and elixir are two more memory safe languages that comfortably handle binary. See Armstrong, Pogramming Erlang, Ch. 5, "Advanced Bit Syntax Examples", with partial parsers for MPEG, COFF, and IPv4 on pages 83-89.

blackflame7000 · on Aug 17, 2019

There is no functional difference between an array of unsigned bytes (binary) and a array of signed bytes (char data). The only difference is that when you send binary, 0 is now a valid value instead of null terminating a string. Therefore you must prepend the size because you can no longer parse till you find a NULL byte. It is always safer to know the size you must allocate ahead of time instead of dynamically growing a buffer until the text stream is terminated.

dwaite · on Aug 18, 2019

There are plenty of examples of binary formats where you do not know buffer sizes until you've received all the data, and where assumptions with parsing the data can cause a buffer overflow.

decompression and PNG libraries for example have caused massive security impact across the industry because of reuse in different products. Font handling, compressed bitmap, and windows cursor parsing also have been sources of issues.

Mozilla didn't just invest in Rust because parsing HTML and JSON are hard. Its all hard.

blackflame7000 · on Aug 18, 2019

“It is always safer to know the size you must allocate ahead of time instead of dynamically growing a buffer until the text stream is terminated.”

And then you go on to give examples of said scenarios of how this is true while saying I’m wrong? Anytime you have an unknown payload you have to make a determination of how long you’re going to wait, how much you’re going to accept, buffer, etc before it’s become a drain on the system

megous · on Aug 18, 2019

You can still reserve one value as a control character and have non-text streams that don't need prefixing of data frames by size.

These are also easier to recover the data from in case of corruption, or re-sync the receiver, since you have a control byte you can sync on.

blackflame7000 · on Aug 18, 2019

Yes, in fact, this is most common in video streaming formats. These types of streams are more commonly downloaded as opposed to uploaded where the server has to be careful not to exhaust too many resources parsing variable-length messages.

megous · on Aug 19, 2019

It's also fairly common if you have alway-on streams without any signalling, where you can connect to the stream at any time, like with UART.

paulryanrogers · on Aug 18, 2019

JSON over HTTP can communicate its size with an HTTP header.

blackflame7000 · on Aug 18, 2019

There’s no reason strings cant be sent like binary if you do a size header first. The problem is trying to send binary like a string where your data might have 0s that could be interpreted by the receiver as string terminations. Typically base64 is used to address this issue

edflsafoiewq · on Aug 18, 2019

Even if it's compressed?

josefx · on Aug 18, 2019

>This is specific to IEEE 754, not all binary representations have this issue.

Also sometimes it is a feature. The payload of a NaN value can be user defined and some programs use it[1]. The string "NaN" drops information that might be usefull to some programs, it just doesn't affect as many as null -> "null" does.

[1] https://github.com/WebKit/webkit/blob/master/Source/JavaScri...

recursive · on Aug 18, 2019

> Similarly in JSON integers or arrays of integers are nothing special.

JSON is perfectly capable of representing integers which cannot be represented in IEEE-754 double precision floating point. That seems at least a little special to me.

kentonv · on Aug 17, 2019

> with a binary format you can transmit NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do not have the same binary representation. You have to choose single or double precision (maybe a benefit, not always). Etc.

Boy you are really stretching to make this sound complicated. It's not. You transmit 4 bytes or 8 bytes. Serialization is a memcpy().

You don't have to think about NaNs and Infinities because they Just Work -- unlike with textual formats where you need to have special representations for them and you have to worry about whether you are possibly losing data by dropping those NaN bits. If you want to drop the NaN bits in a binary format, it's another one-liner to do so.

It's funny that you choose to pick on floating-point numbers here, because converting floating-points to decimal text and back is insanely complicated. One of the best-known implementations of converting FP to text is dtoa(), based on the paper (yes, a whole paper) called "How to Print Floating-Point Numbers Accurately". Here's the code:

http://www.netlib.org/fp/dtoa.c

Go take a look. I'll wait.

dtoa() is not even the state of the art anymore. Just in the last few years there have been significant advances, e.g. Grisu2, Grisu3, and Dragon4...

Again, in binary formats, all that is replaced by a memcpy() of 4 or 8 bytes.

(A previous rant of mine on this subject: https://news.ycombinator.com/item?id=17277560 )

> > Length-prefixed binary formats are almost trivial to parse

> They definitely are not, as displayed by the fact that binary lengths are the root cause of a huge number of security flaws. JSON mostly avoids that.

Injection (forgetting to escape embedded text) is the root cause of a huge number of security flaws for text formats. Length-prefixed formats do not suffer from this.

What "huge number of security flaws" are you referring to that affect length-delimited values? Buffer overflows? Those aren't caused by values being length-delimited, they are caused by people deserializing variable-length values into fixed-length buffers without a check. That mistake can be made just as easily with a text format as with a binary format. In fact I've seen it much more often with text.

> JSON currently dominates large parts of that ecology.

JSON wins for one simple reason: it's easy for human developers to think about, because they can see what it looks like. This is very comforting. It's wasteful and full of pitfalls ("oops, my int64 silently lost a bunch of bits because JSON numbers are all floating point"), but comforting. Even I find it comforting.

Ironically, writing a full JSON parser from scratch is much more complicated that writing a full Protobuf parser. But developers are more comfortable with the parser being a black box than with the data format itself being a black box. ¯\_(ツ)_/¯

(Disclosure: I am the author of Protobuf v2 and Cap'n Proto. In addition to binary native formats, both have text-based alternate formats for which I wrote parsers and serializers, and I've also written a few JSON parsers in my time...)

Avamander · on Aug 17, 2019

I think people undervalue clean-looking (alphabet-only, few special character) things, things that don't require people to use the symbol-parsing part of their brain. Basically easily human-parseable things. I suspect this phenomenon can be observed in the case of relative popularity of JSON, TOML, YAML and Python plus the relative unpopularity of Lisp, Haskell, Rust and XML. And if we look at protobuf in this context it is not easy to parse for humans, which causes people not to want to use it, developers are not

> more comfortable with the parser being a black box

they're more comfortable with the parser being a black box but the format being relatively easy to parse compared to the parser being easy to understand but the format basically unreadable for a human.

dragonwriter · on Aug 17, 2019

> I think people undervalue clean-looking (alphabet-only, few special character) things, things that don't require people to use the symbol-parsing part of their brain. Basically easily human-parseable things.

The symbol parsing part of the human brain is what parses letters and numbers, as well as other abstract symbols. The division of symbols into letters, numbers, and others is fairly arbitrary. Most people would say “&”, but the modern name of that symbol is a smoothing over of the way it was recited when it was considered part of the alphabet and recited with it.

> I suspect this phenomenon can be observed in the case of relative popularity of JSON, TOML, YAML and Python plus the relative unpopularity of Lisp, Haskell, Rust, XML.

I suspect not: Lisp and Haskell have less use of non-alphanumeric characters than most more-popular general purpose languages, and not significantly more than Python; also, if this was the phenomenon in play, popularity would be TOML > YAML > JSON but in reality it's closer to the reverse.

Avamander · on Aug 17, 2019

> The symbol parsing part of the human brain is what parses letters and numbers, as well as other abstract symbols.

I really don't think that's true when you talk about about someone using the latin alphabet, words in that alphabet compared to some other alphabet (e.g. {}():!) and "words" (or meanings) in those. Just as a crude example parsing "c = a - b", where equals and minus are one symbol each and have been taught for a while, is different from parsing "c := a << b" where ":=" and "<<" basically act as a separate meaning someone has to learn to understand. Similar to the difference of latin alphabet and say simplified Chinese.

> also, if this was the phenomenon in play, popularity would be TOML > YAML > JSON but in reality it's closer to the reverse.

There could be somewhat of an sigmoid response to the effect, decreased reaction if you go into either extreme compared to deviating from the average.

I'm not a linguist so it is my speculation, so don't take it too seriously :D

kentonv · on Aug 17, 2019

That's exactly what I was trying to say. Sorry if I wasn't clear.

userbinator · on Aug 17, 2019

Those aren't caused by values being length-delimited, they are caused by people deserializing variable-length values into fixed-length buffers without a check. That mistake can be made just as easily with a text format as with a binary format. In fact I've seen it much more often with text.

Especially since text can be arbitrarily long. From that perspective, length-delimited text (I've seen that before in a few file formats, and more notably HTTP) is probably the worst of both worlds.

dwaite · on Aug 18, 2019

> You can also create two NaN numbers that do not have the same binary representation.

Way worse than that. The interpretation of those different values of NaN is software specific. You can also have signalling NaN values - where the recipient can now have their number handling code trap in completely unexpected scenarios.

juliusmusseau · on Aug 17, 2019

Consider only this: "1.001"

I'll use JavaScript numeric literals here as my translation medium (ironic!):

Norway locale parses it to: 1001

USA locale parses it to: 1.001

France locale parses it to: NaN

https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/ind...

mort96 · on Aug 17, 2019

No. In a programming context, any norwegian or french programmer would expect that to evaluate to 1.001, not 1001 or NaN.

eitland · on Aug 17, 2019

Sweet summer child, isn't that what people say these days?

I and you can agree that the reasonable thing to do is to accept American encoding.

(The below is somewhat simplified, the way I remember it on a late Saturday night 10+ years later.)

The outsourced team of programmers from our software vendor did not. They (mostly) used the built in regional settings in Windows (best practice, don't reinvent the wheel) meaning we had to come up with ways to make sure the machines ran with wrong regional settings (since a lot of stuff was already serialized that way && critical parts of the software was hardcoded to use US standard.)

Fun times ;-)

userbinator · on Aug 17, 2019

I remember Excel also has(had?) some very disturbing behaviour around the interaction of CSV files and the regional decimal point separator settings.

kozziollek · on Aug 17, 2019

Not sure about CSV files, but every time I'm pasting numbers generated by Python or Java I had to use Notepad or Excel itself to replace dots to commas because of (Windows'!) regional settings.

HelloNurse · on Aug 18, 2019

In a somebody sent this input to my program context, designing some means to choose between these three results (or possibly others) is the main concern. Actual parsing is the easy part.

juliusmusseau · on Aug 18, 2019

Do you get the user to choose? Great, now the UI is 2x more complex.

Do you just go with what the browser or client locale says? (User lives in Canada but their laptop is set to Turkish locale).

Go with the locale tied to the geo location of the user instead? (User above stubbornly enters all amounts using Norwegian conventions after completing high school in Norway).

If it's important, a confirmation page presented to the user and formatted in their presumed locale can help a lot.

skywhopper · on Aug 18, 2019

This is a little too generous about the benefits of binary formats vs text formats. Ultimately, any data exchange between disparate systems is going to be a challenging task, no matter what format you choose. Both sides have to implement it in a compatible way. And ultimately, every format is a binary format. Encoding machine-level data structures direct on the wire sounds good, but it quickly gets complicated when you have to deal with multiple architectures and languages. And you don't have the benefit of the gradually accreted de-facto conventions like using UTF-8 encoding for text-based formats to fall back on, much less the ability for humans to troubleshoot by being able to read the wire protocol.

With sufficient discipline and rigor, and a good suite of tests, developed over years of practical experience, you can evolve a good general binary wire protocol, but by then it will turn out to be so complicated and heavyweight to use, that some upstart will come up with a NEW FANTASTIC format that doesn't have any of the particular annoyances of your rigorous protocol, and developers will flock to this innovative and efficient new format because it will help them get stuff done much faster, and most of them won't run into the edge cases the new format doesn't cover for years, and then some of them will write articles like this one and comments like yours and we can repeat the cycle every 10-20 years, just like we've been doing.

IloveHN84 · on Aug 18, 2019

Wait wait. XML with XSD Schemas are a easy problem. You can't fail with a XSD schema on place

pnx · on Aug 18, 2019

>Consider something as simple as parsing an integer in a text-based format; there may be whitespace to skip, an optional sign character, and then a loop to accumulate digits and convert them (itself a subtraction, multiply, and add), and there's still the questions of all the invalid cases and what they should do.

    ^[ *][-?][0-9][0-9]*[ *]$

You're welcome. Anything that passes that regex is a valid number. Now using that as a basis of a lexer means that you can store any int in whatever precision you feel like.

It's unfortunate that the majority of programmers these days are so computer illiterate that they can't write a parser for matching parens and call you an elitist for pointing out this is something anyone with a year of programming should be able to do in their sleep.

userbinator · on Aug 18, 2019

Just matching that regex alone is going to take a pretty large number of instructions (at least dozens.) That's not "simple" by any measure.

pnx · on Aug 18, 2019

Probably tens of thousands.

And it will still be a faster operation than fetching the next json file from disk by an order of magnitude, or two.

juliusmusseau · on Aug 17, 2019

Because of this article (which I encountered a year ago) I would say Parsing JSON is no longer a minefield.

I had to write my own JSON parser/formatter a year ago (to support Java 1.2 - don't ask) and this article and its supporting github repo (https://github.com/nst/JSONTestSuite) was an unexpected gift from the heavens.

AgentOrange1234 · on Aug 18, 2019

Wait. How is this no longer a minefield just because there is a test suite that identifies some tricky cases?

Doesn’t the test suite’s matrix demonstrate that there are tons of cases that aren’t handled consistently across these parsers?

juliusmusseau · on Aug 18, 2019

Good point. I am presuming the test suite is comprehensive. Does it cover 100% of all JSON mines? Probably not. But it surfaced about 30 bugs in my own implementation - things I would have never dreamed of.

So it certainly helped me. And just based on how thorough and insane the test suite is, I think I'm in good hands. Not perfect hands - but definitely a million times better than anything I would have come up with on my own.

The test suite made my parser blow up many times, and for each blow up I got to make a conscious decision in my bugfix: how do I want to handle this?

(I decided to let the 10,000 depth nested {{{{{{{{{{{{{{{{{{"key","value"}}}}}}}}}}}}}}}}}}} guy blow up even though it is legal. Yes, I'm too lazy to implement my own stack.) :-)

Dylan16807 · on Aug 18, 2019

It's a very clear list of mistakes to avoid and areas where you can choose to be lenient in parsing.

If you're emitting JSON you can skim the list and avoid all of them.

Either way the minefield proper is no longer your problem.

Multicomp · on Aug 17, 2019

This might be throwing a lit match into a gasoline refinery, but why not opt for XML in some circumstances?

Between its strong schema and wsdl support for internet standards like soap web services, XML covers a lot of ground that Json encoding doesn't necessarily have without add-ons.

I say this knowing this is an unfashionable opinion and XML has its own weaknesses, but in the spirit of using web standards and LoC approved "archivable formats", IMO there is still a place for XML in many serialization strategies around the computing landscape.

Json is perfect for serializing between client and server operations or in progressive web apps running in JavaScript. It is quite serviceable in other places as well such as microservice REST APIs, but in other areas of the landscape like middleware, database record excerpts, desktop settings, data transfer files, Json is not much better or sometimes even slightly worse than XML.

AtlasBarfed · on Aug 17, 2019

XML cannot be parsed into nested maps/dictionaries/lists/arrays without guidance from a type or a restricted xml structure.

JSON can do that. It also maps pretty seamlessly to types/classes in most languages without annotations, attributes, or other serialization guides.

It also has explicit indicators for lists vs subdocuments vs values for keys, which xml does not. XML tags can repeat, can have subtags, and then there are tag attributes. A JSON document can also be a list, while XML documents must be a tree with a root document.

XML may be acceptable for documents. But seeing as how XHTML was a complete dud, I doubt it is useful even for that.

And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.

falcolas · on Aug 17, 2019

> And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.

So, that’s why we’re adding all of this “junk” back into JSON? Transformers, XPath for JSON, validation, schemas, namespaces (JSON-LD, JSON prefixes) it’s all there.

History repeating itself (and here’s the important part) because this complexity is needed. Not every application will need every complication, but every complication is needed by some application.

robocat · on Aug 17, 2019

No junk has been added into JSON - the specification hasn't changed to accommodate those features.

Unless you need to use the feature, you don't need to know anything about it, which is a huge benefit for the majority. XML almost encourages programmers to use unnecessary features.

When an application domain chooses to add a feature (say JSON-LD) then there are advantages to that mixture over XML. Where XML is better, it is often chosen instead.

falcolas · on Aug 17, 2019

> the specification hasn't changed to accommodate those features.

Neither did XML. They simply took advantage of an early "processing directive" feature to add them in. XML and JSON are no different in this regard.

ChrisSD · on Aug 17, 2019

Except that JSON doesn't have anything like processing directives or even comments.

falcolas · on Aug 18, 2019

That depends entirely on which parser you’re using. People have wanted comments so badly there are parsing libraries (and proposed revisions to JSON) that include comments. And sometimes those comments are used to provide processing directives.

https://json5.org/ Jsonnet https://www.npmjs.com/package/comment-json

> Suppose you are using JSON to keep configuration files, which you would like to annotate. Go ahead and insert all the comments you like. Then pipe it through JSMin before handing it to your JSON parser.

reissbaker · on Aug 18, 2019

Neither Jsonnet nor JSON5 are JSON, and they're considerably less popular than their forebear.

(Although I do love JSON5.)

zvrba · on Aug 18, 2019

> XML cannot be parsed into nested maps/dictionaries/lists/arrays without guidance from a type or a restricted xml structure.

? Using XML without a schema is slightly worse than JSON because the content of each node is just "text". XML with schema is far more powerful, also because of a richer type-system. JSON dictionaries are most of the time used to encode structs, but for that you have `complexType` and `sequence` in the XML schema.

I've been using XML with strongly-typed schemas for serialization for the last couple of years and couldn't be happier. I have ~100 classes in the schema, yet I've needed a true dictionary like 2 or 3 times.

> And we didn't even need to get into the needless complexity of validation, namespaces, and other junk.

Validation is junk? Isn't it valuable to know that 1) if your schema requires a certain element, and 2) if the document has passed validation, then navigating to that element and parsing it according to its schema type won't throw a run-time exception?

Namespaces are junk? They serve the same purpose as in programming languages. How else would you put two elements of the same name but of different semantics (coming from different sources) into the same document? You can fake this in JSON "by convention", but in XML it's standardized.

crispyambulance · on Aug 17, 2019

XML is a perfectly serviceable data exchange format. The parsers and serializers work great when used properly. It's nice to have schema.

But I think people just got sick of XML because it was abused so badly with "web services", SOAP, wsdl and all those horrible technologies from the early naughts. Over-complicated balls of mud that made people miserable.

eknkc · on Aug 17, 2019

Apple's plist format might be the weirdest abuse of XML as far as I can tell. The SOAP envelopes and shit like that were horrible but plist is plain weird.

Everyone abused XML some way or another. JSON is not that "abusable" I'd say.

amaccuish · on Aug 18, 2019

PLIST is pretty flexible though, the underlying storage can be XML, binary or even JSON now.

cellularmitosis · on Aug 18, 2019

Implementing a binary plist encoding on you REST endpoints is actually pretty great for iOS devices.

nimish · on Aug 17, 2019

Xml is a beast to parse. It's slow to parse and verbose but it doesn't give you a human friendly text format. It's got a number of weird features inherited from sgml. Every parser needs a quirks mode since nobody can write good schemas and schema parsers.

XML is a really bad interchange format. It's OK for a document markup language, and that's where it survives.

tootie · on Aug 17, 2019

When I was doing XML/Java stuff 10 years ago, you take your XSD and generate domain classes as a build step. It was more complicated but it was also 100% reliable because the tools were all rock solid. Written by the guy who made Jenkins.

beatgammit · on Aug 17, 2019

Many languages have libraries built in that do something reasonable with JSON. Usually you just make a class or struct, instantiate it, and then generate JSON, no need to have a separate compile step. When going the other direction, I usually just format the JSON, copy that into my code, then fix the compile errors.

XML has all that tooling because it needs it. JSON is a lot more straightforward, is more compact, and is faster to parse and (probably) generate.

If you're going to go through the effort of a compile step, you should probably just use a binary protocol, which will get you even better performance and getting documentation out of the box (e.g. protocol buffers schemas are very readable).

I see absolutely no reason to use XML these days as a data format, but it's still a response choice as a markup format (you know, what the M stands for).

zvrba · on Aug 18, 2019

> Many languages have libraries built in that do something reasonable with JSON.

What about cross-language? In C# I define a class containing a `DateTime` field, export the schema with xsd, and generate classes for Java with xjc, and get back a field of (an equivalent of) `DateTime` type. Doing what you suggest with JSON, I'd get a "string". Thanks but no thanks.

> If you're going to go through the effort of a compile step, you should probably just use a binary protocol, […] I see absolutely no reason to use XML these days as a data format,

In our product we use a relational db (SQLServer) combined with XML. Each table has a structured part which is put into relational columns, plus an extensions part that is put into a "Data" XML column for semi-structured data. SQLServer supports XQuery so we can query the semi-structured data from SQL when needed.

This wouldn't fly with a binary format.

EDIT: yes, SQLServer also supports JSON, but has special optimizations for XML (e.g., it can understand schema types, it supports XML indexes which "shred" XML to a more efficient binary representation based on schema, etc.)

hombre_fatal · on Aug 17, 2019

Not the best context to suggest XML superiority: https://cheatsheetseries.owasp.org/cheatsheets/XML_Security_...

If parsing JSON is bad, XML is a clusterfuck.

Nicksil · on Aug 17, 2019

> Not the best context to suggest XML superiority

where was it insinuated XML is superior? It was a very reasonable response.

kelnos · on Aug 17, 2019

There'd be no reason to suggest it as an alternative were it not for an opinion that it's superior.

brightball · on Aug 17, 2019

It’s got a lot built around it that JSON doesn’t have an equivalent for.

I still miss XSD and WSDL for a lot of cases. Other cases it was serious overkill where JSON is a better option.

XML isn’t superior. It’s heavier but more complete.

JSON is lighter and less complete.

Everything in code is always about trade offs. The error comes when people advocate for one solution all the time.

beatgammit · on Aug 17, 2019

If you're going to reach for automation, why not just use a binary format like protocol buffers, flat buffers, capn proto, etc? You get the tooling and a ton of performance for free.

JSON is great because you don't need tooling. XML is great because it's expressive. You don't need expressiveness for a data format, but it works great as a markup language.

specialist · on Aug 17, 2019

Syntax aside, I think the original mistake is IDLs, schemas, and other attempts at formalism.

WSDL, SOAP, and all their precursors were attempted in spite of Postel's Law.

Repeating myself:

Back when I was doing electronic medical records, my two-person team ran circles around our (much larger) partners by abandoning the schema tool stack. We were able to detect, debug, correct interchange problems and deploy fixes in near realtime. Whereas our partners would take days.

Just "screen scrap" inbound messages, use templates to generate outbound messages.

I'd dummy up working payloads using tools like SoapUI. Convert those known good "reference" payloads into templates. (At the time, I preferred Velocity.) Version every thing. To troubleshoot, rerun the reference messages, diff the captured results. Massage until working.

Our partners, and everyone I've told since, just couldn't grok this approach. No, no, no, we need schemas, code generators, etc.

There's a separate HN post about Square using DSLs to implement OpenAPI endpoints. That's maybe 1/4th of the way to our own home made solution.

Zarel · on Aug 18, 2019

I personally like XML a lot for rich text (I like HTML better than TeX) and layout (like in JSX for React), and it's not horrible if you want a readable representation for a tree, but I can't imagine using it for any other purpose.

JSON is exactly designed for object serialization. XML can be used for that purpose but it's awkward and requires a lot of unnecessary decisions (what becomes a tag? what becomes an attribute? how do you represent null separately from the empty string?) which just have an easy answer in JSON. And I can't think of any advantage XML has to make up for that flaw. Sure, XML can have schemas, but so can JSON.

I will agree that JSON is horrible for config files for humans to edit, but XML is quite possibly even worse at that. I don't really like YAML, either. TOML isn't bad, but I actually rather like JSON5 for config files - it's very readable for everyone who can read JSON, and fixes all the design decisions making it hard for humans to read and edit.

taftster · on Aug 17, 2019

One of the biggest advantages for XML are attributes and namespaces. I miss these in JSON.

As AtlasBarfed mentioned, JSON has a native map and list structure in its syntax, which is sorely missed in XML. You have to rely on an XML Schema to know that some tag is expected to represent a map or list.

JSON with attributes and namespaces would be my ideal world.

beatgammit · on Aug 17, 2019

Why do you want those? Attributes and namespaces just make in memory representation complicated. They're quite useful for markup, but I don't really know why you'd want them in a data format.

Use JSON or a binary protocol for data, XML for markup.

legulere · on Aug 17, 2019

If JSON with its relative simplicity is already too complex and leading to a mine field, then XML is even worse by far.

_skel · on Aug 17, 2019

XML manages to be difficult and complex for both computers and people to read. That's why it fell out of favor.

dwaite · on Aug 18, 2019

To be fair, there were a lot of very good ideas for a 2.x XML that solved a lot of the complexity. The problem was that none of the tools would be upgraded to support it.

You'd basically have to create a new independent format to have proper compatibility once you introduce breaking changes.

dwaite · on Aug 18, 2019

FWIW, the common changes were to

- remove DTDs completely.

- by removing DTDs, remove non-standard entities

- by removing DTDs, remove the concepts of notations and all external resource resolution from the core spec. Also, no possibility of entity-expansion attacks.

- by removing DTDs, remove validation from the core spec.

- merge namespaces into the core specification. At the same time, make them mandatory

- merge the concept of qualified names into the core specification

- by making namespaces mandatory, all the variations of how namespaces get exposed can be eliminated

- merge the info-set definition into the core specification

- by describing XML items and how they relate, implementations can understand what data is relevant at a particular point while parsing the document.

- Merge xml:id into the core specification.

You also had some other fun outlier concepts:

- Eliminate prefixes from infoset. This is mostly a breaking change for XPath and XML Schema.

- Add an explicit qualified name token (possibly recycling the entity declaration). This would allow the above specs to have their functionality restored, although likely with a new format.

- Accept qualified names without prefixes, such as via a {uri}:{localName} syntax.

carapace · on Aug 17, 2019

Not to mention XSLT.

inopinatus · on Aug 17, 2019

Once you're parsed the first minefield, another crop emerges: interpreting the result. Even the range of values seen in the wild for a supposedly simple boolean attribute is just mind-boggling. Setting aside all the noise from jokers trying it on with fuzzing engines, we'll see all of these presented to various APIs:

    true
    false
    null
    0 | 1
    "true" | "false"    (with assorted variation by
    "yes"  | "no"        case and initial character)
    "" | "0" | "1"
    "\u2713"            (hi DHH)
    -1                  (with complements)
    "[object Object]"
    { "value": true }   (and friends)
                        (attribute not present)
    "敵牴"

That last looks like a doozy, but old lags will guess what's going on right away. It's the octets of the 8-bit string "true", misinterpreted as UCS-2 (16-bit wide character) code points and then spat out as UTF8. Google translates it, quite appropriately, as "Enemy".

Oddly though, according to my records, never seen a "NULL".

umvi · on Aug 17, 2019

I'm fine with a parser that doesn't get all of the corner cases as long as it fails gracefully.

Really, the only time it would matter is if you are parsing user-provided JSON and said user was trying to exploit your parser somehow.

But 99% of the time, I'm not parsing user-provided JSON, so I don't ever encounter these corner cases and parsing/serialization works great.

juliusmusseau · on Aug 17, 2019

What about the 2^63 corner-case?

Consider this JSON: {"key": 9223372036854775807}. With most parsers it never fails.

But... some JSON parsers (include JS.eval) parse it to 9223372036854776000 and continue on their merry way.

The problem isn't user-provided JSON here. The problem is user-provided data (or computer-provided data) that's inside the JSON.

rachelbythebay's take (http://rachelbythebay.com/w/2019/07/21/reliability/):

On the other hand, if you only need 53 bits of your 64 bit numbers, and enjoy blowing CPU on ridiculously inefficient marshaling and unmarshaling steps, hey, it's your funeral.

jacobolus · on Aug 17, 2019

> some JSON parsers (include JS.eval) parse it to 9223372036854776000 and continue on their merry way

This is correct behavior though...? Every number in JSON is implicitly a double-precision float. JSON doesn’t distinguish other number types.

If you want that big a string of digits in JSON, put it in a string.

Edit: let me make a more precise statement since several people seem to have a problem with the one above:

Every number that you send to a typical JavaScript JSON parser is implicitly a double-precision float, and it is correct behavior for a JavaScript JSON parser to treat a long string of digits as a double-precision float, even if that results in lost precision.

The JSON specification itself punts on the precise semantic meaning of numbers, leaving it up to producers and consumers of the JSON to coordinate their number interpretation.

justincormack · on Aug 17, 2019

Every number in JavaScript is, JSON does not specify.

jacobolus · on Aug 17, 2019

Update: The IETF version is a bit more explicit. https://tools.ietf.org/html/rfc8259#section-6

> This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

> Note that when such software is used, numbers that are integers and are in the range [-(2^53)+1, (2^53)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

Or to paraphrase: if you pretend that every JSON number is a JavaScript number (double-precision float), you will generally be fine. If you don’t, you are responsible for the inevitable interpoperability problems you’ll have with most current JSON parsers.

jacobolus · on Aug 17, 2019

“JavaScript Object” is right there in the name.

overgard · on Aug 17, 2019

Java is also in the name JavaScript, but we know how much that has to do with it.

TheRealPomax · on Aug 17, 2019

indeed, we've all read that history. And we all know how much JavasSript has to do with "JavaScript Object Notation", too right? Basically everything?

overgard · on Aug 17, 2019

Other than shared syntax, JSON is its own thing.

uhoh-itsmaciek · on Aug 17, 2019

>Every number in JSON is implicitly a double-precision float

Is it? I was under the impression every number in JSON is implicitly arbitrary precision.

jacobolus · on Aug 17, 2019

You can write an arbitrary precision number in there, say 4.203984572938457290834572098345787564e+20, but if the vast majority of JSON parsers interpret it as a double-precision float, there’s not much point.

If you control both the producer and consumer of the serialized data, you can of course do whatever you like. But I would recommend people who want more extensive data types use something other than JSON.

ben509 · on Aug 18, 2019

> If you control both the producer and consumer of the serialized data, you can of course do whatever you like.

Don't forget all the unintended intermediate producers and consumers due to microservices or even otherwise well-written tools that convert to float64 internally.

shawnz · on Aug 17, 2019

In what situation would that create a problem that isn't noticed immediately during testing?

overgard · on Aug 17, 2019

You're writing the server for a procedural space game where the coordinates are stored as numbers. As the ships get far from (0,0,0) people start to report that the graphics get "jumpy" and gameplay rules break frequently.

Or, your server keeps track of how long it's been running, in nanoseconds. When the server stays up for months, people start noticing weird precision issues and hard to replicate bugs around time.

juliusmusseau · on Aug 17, 2019

When the tests fail to 1.) experiment with numeric values greater than 2^53 or less than -2^53 and 2.) fail to carefully check the results even in the situations where they did experiment with such values.

shawnz · on Aug 17, 2019

How often do you encounter situations where your values start off below 2^53 and then later grow above 2^53? How often do you deal with integers greater than 53 bits where the app doesn't immediately fail when the exact value is not correct (i.e. unique IDs or crypto keys)? I feel like these are edge cases which would rarely come into play in most software.

mikeash · on Aug 17, 2019

Edge cases which rarely come into play are the ones you should be most afraid of. Things that happen often get caught before they cause trouble. Things that never happen are fine. Things that happen rarely are what get you.

For an made-up example of how this could get you, let’s imagine that you have a message service that gives each message a unique ID. Some bright soul decided to give this ID a nice structure and make it a 64-bit value where the top 32 bits are an incrementing integer per user, and the bottom 32 bits are the user’s ID, assigned by with a global incrementing integer.

Everything works great in testing and you deploy and the VC money is rolling in and then some of your very prolific users go past 2 million messages and suddenly messages are getting mixed up and you’re leaking private info because your access checking code happens to get the real 64-bit value but your message retrieval code puts the ID in a JSON number.

Now you might respond, but that ID scheme is dumb, don’t do that. And you may be right! But dumb things happen. It’s unwise to leave land mines lying around in your software just because they only detonate when someone does something dumb.

juliusmusseau · on Aug 17, 2019

Here's some stuff where it would bite me in my work:

- I sometimes use random 64 bit integers in my data (usually as cheaper synthetic guid-like keys).

- I sometimes use CRC-64 hashes.

- I sometimes use 2^63 as a sentinel.

majewsky · on Aug 17, 2019

> 99% of the time, I'm not parsing user-provided JSON

I take it you've never implemented a service with a REST API.

umvi · on Aug 17, 2019

I have, but I work

umvi · on Aug 17, 2019

Oops, somehow the second half of the comment got truncated.

I meant to say: "I have, but I work mainly with embedded systems, and so there aren't usually POST or PUT APIs, just GET. So JSON is nearly always generated by the device and consumed by whoever is querying the device, not the other way around."

nullwasamistake · on Aug 17, 2019

JSON sucks. Maybe half our REST bugs are directly related to JSON parsing.

Is that a long or an int? Boolean or the string "true"? Does my library include undefined properties in the JSON? How should I encode and decode this binary blob?

We tried using OpenApi specs on the server and generators to build the clients. In general, the generators are buggy as hell. We eventually gave up as about 1/4 of the endpoints generated directly from our server code didn't work. One look at a spec doc will tell you the complexity is just too high.

We are moving to gRPC. It just works, and takes all the fiddling out of HTTP. It saves us from dev slap fights over stupid cruft like whether an endpoint should be PUT or POST. And saves us a massive amount of time making all those decisions.

hu3 · on Aug 17, 2019

Off-topic but I'd want to work on a place where half the REST bugs are from JSON parsing.

craigds · on Aug 18, 2019

Yeah I don't believe I've ever seen a json parsing problem in 11 years of software development.

nullwasamistake · on Aug 17, 2019

Just get a boring webapp job in CRUD world :)

chairmanwow · on Aug 17, 2019

I have had the absolute joy of working with gRPC services recently. Static schemas and built in streaming mechanics are fantastic. It definitely removes a lot of my gripes with REST endpoints by design.

truth_seeker · on Aug 17, 2019

Just recently V8 the JS engine rewrote their JSON parsing code to achieve upto 2.7x faster parsing and also making it memory efficient.

Ref link:- https://v8.dev/blog/v8-release-76

majewsky · on Aug 17, 2019

Ah, so that's the source of that Chrome bug that we saw last week. Customers on Chrome for Windows (only that, not Chrome for Linux or macOS) were complaining that the search on our statically-generated documentation site was not working. The search is implemented by a JavaScript file that downloads a JSON containing a search index, and it turns out that this search index had too much nesting for Chrome on Windows's JSON parser. This would reliably produce a stack overflow:

  JSON.parse(Array(3000).join('[')+Array(3000).join(']'))

We were about to report a bug when we noticed that the problem was fixed in Chrome 76, and the users in question were still on Chrome 75.

zazagura · on Aug 17, 2019

Pretty weird for a JSON parser to be platform dependent.

mort96 · on Aug 17, 2019

Not really, parsing JSON is probably a recursive algorithm and Windows gives processes way less stack space than Linux (generally 1MB vs 8MB). It could totally be that the code is the same, but the particular JSON string blows up the stack in Windows but not Linux.

CamouflagedKiwi · on Aug 17, 2019

Maybe the stack is shallower on Windows, or the calling convention takes more space per function and is enough to push it over the edge.

nh2 · on Aug 17, 2019

Recursion based implementation of parsers in languages with limited stack size is a programmer mistake.

justincormack · on Aug 17, 2019

All languages have limited stack size.

dwohnitmok · on Aug 17, 2019

Eh... not really if you're referring to call stacks. Rust at one point had growable stacks. That was removed for performance reasons. Haskell with GHC kind of has growable stacks (basically IIRC most function calls occur on the heap) and its stack overflows take a different form. SML I think at one point also had an implementation with a growable call stack.

stkdump · on Aug 18, 2019

I guess this is a nitpick of limited vs. fixed. Even if you grow the stack, at some point you can't grow anymore.

dwohnitmok · on Aug 18, 2019

If by limited you mean limited by the amount of memory your machine has then yes it's limited, but I don't think that's what parent was getting at, since in that sense everything about a computer is limited.

fulafel · on Aug 19, 2019

If your C/C++ code calls functions, it's platform dependent in this way. C doesn't provide a lot of guarantees wrt minimum stack size.

iamleppert · on Aug 17, 2019

Check out the simd JSON project if you’re interested in a super fast JSON parser:

https://github.com/lemire/simdjson

I’ve been using to process and maintain giant JSON structures and it’s faster than any other parser I’ve tried. I was able to replace my previous batch job with this as it gives real-time performance.

calcifer · on Aug 17, 2019

This seems to have nothing to do with the article though?

iamleppert · on Aug 17, 2019

It’s a JSON parser?

kthejoker2 · on Aug 17, 2019

How does it do on the article's test suite?

glangdale · on Aug 18, 2019

[ Original designer of much of simdjson here ]

We haven't used that particular suite, but almost everything in that suite is something we've thought about. In many cases we do the right thing by not innovating and randomly allowing stuff that isn't in the spec.

I see exactly one thing we didn't think about, as our construction of a parse tree is pretty basic and we don't build an associative structure even when building up an object - thus we would not register an error when confronted with the malformed input listed under "2.4 Objects Duplicated Keys", but happily build a parse tree with duplicated keys (which will be built up strictly as a linear structure, not an associative one).

There seems to be leeway on this point as to what an implementation should do. It certainly doesn't fit our usage model very well to build a associative structure right there on the spot - some of our users wouldn't want that much complexity/overhead.

iamleppert · on Aug 17, 2019

I haven’t tested it but it parses all my JSON just fine

carapace · on Aug 17, 2019

ASN.1

> Abstract Syntax Notation One (ASN.1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One

Or keep re-inventing the wheel. It's not like the people paying you will notice or care, eh?

nimish · on Aug 17, 2019

Asn.1 is incredibly hard to actually implement . There are dozens of cases of security bugs based on bad parsers. Also there are a dozen different encodings of asn.1 data including json (JER). Its age also means that it has a bunch of obsolete datatypes.

Protobuf and friends have most of the power without a lot of the drawbacks.

carapace · on Aug 17, 2019

> Asn.1 is incredibly hard to actually implement.

For whom?

> There are dozens of cases of security bugs based on bad parsers.

You're saying this on a thread called "Parsing JSON Is a Minefield", eh?

In any event, this is not unique to ASN.1. I haven't checked but I don't doubt there are similar cases for Protobuf, etc.

> Also there are a dozen different encodings of asn.1 data including json (JER).

So what? That's the opposite of a problem.

> Its age also means that it has a bunch of obsolete datatypes.

So don't use them.

- - - -

My point is that if the time and effort that was spent on Protobuf and CapnProto and all the others had somehow been spent instead on perfecting ASN.1 then, uh, that would have been good...

kentonv · on Aug 18, 2019

> My point is that if the time and effort that was spent on Protobuf and CapnProto and all the others had somehow been spent instead on perfecting ASN.1 then, uh, that would have been good...

I wrote proto2 in 20% time at Google and I developed Cap'n Proto entirely on my own time, unpaid. If you think ASN.1 could be perfected with a similar amount of work then why don't you do it?

carapace · on Aug 19, 2019

It seems like I may have offended you, I didn't mean to, and I apologize.

I'd love to discuss this but don't want to get in a flame war.

In re: ASN.1, if I ever have to de/serialize some messages again (I'm quasi-retired ATM) I would use ASN1SCC "an ASN.1 compiler that was developed for ESA to cover all data modelling needs of space applications."

> The compiler is targetting safe systems and generate either Spark/Ada or C code. Runtime library is minimalistic and open-source. The tool handles custom binary encoding layouts, is fully customizable through a code templating engine, generates ICDs and automatic test cases."

https://essr.esa.int/project/asn1scc-asn-1-space-certifiable...

exabrial · on Aug 17, 2019

I miss the days of strongly typed schemas. It's much easier to fail gracefully.

Spivak · on Aug 17, 2019

Throw your support behind https://json-schema.org it's a great effort.

rapsey · on Aug 17, 2019

Or use something sane like protocol buffers

malwrar · on Aug 17, 2019

I've used protobufs in the past and I REALLY want to use them now, but the human readable aspect of json has always kept me coming back.

If I want to perform some rough tests of an endpoint during development, all I need to do is compose the json request and fire it off using curl. The response then comes back in a human readable format I can parse straight from the terminal. Boom, simple test conducted in less than 1 minute. I don't even need to think about it.

Compare that to protobufs; I need to create a custom client or unit test that'll compose and fire off the request I want to test, then I need to write a bunch of code that will introspect the contents of the response so I can pick out the details. Huge time loss, concentration ruined since I need to actually think about the process, I'd rather just take the extra latency that using json will incur.

This skips past all of the other advantages json has over binary serialization protocols, like quickly being able to parse requests while debugging issues, infinite client language support, ease of sharing breaking requests to help devs reproduce problems, not needing to add an extra compilation step to my deployments and packages, etc.

q3k · on Aug 18, 2019

> If I want to perform some rough tests of an endpoint during development, all I need to do is compose the json request and fire it off using curl. The response then comes back in a human readable format I can parse straight from the terminal. Boom, simple test conducted in less than 1 minute. I don't even need to think about it.

https://github.com/fullstorydev/grpcurl

jakear · on Aug 17, 2019

A problem we're finding with those is the massive runtime dependency you get trying to include protobuf in browser.

rapsey · on Aug 17, 2019

In that case json can be used as an addon. An optional input/output that a number of protobuff libraries support.

One should keep schema in protocol buffers and encourage its use for all but browsers.

izacus · on Aug 17, 2019

The minimal parser seems to be 6.5kb, which seems significantly smaller than most of other JS dependencies apps use these days.

jakear · on Aug 17, 2019

We make dependencies, so staying small is more important for us than for end-users.

nh2 · on Aug 17, 2019

Protobuf crashes with data larger than 2GB (json does not).

This severely limits its usefulness, unless you want to put workarounds of this shortcoming into your application.

Other protobuf-like things like capnproto don't have that restriction, because they don't use 32 bit integers for sizes.

kentonv · on Aug 17, 2019

You really don't want to put >2GB in a single protobuf (or JSON object). That would imply that in order to extract any one bit of data in that 2GB, you have to parse the entire 2GB. If you have that much data, you want to break it up into smaller chunks and put them in a database or at least a RecordIO.

Cap'n Proto is different, since it's zero-copy and random-access. You can in fact read one bit of data out of a large file in O(1) time by mmap()ing it and using the data structure in-place.

Hence, it makes sense for Cap'n Proto to support much larger messages, but it never made sense for Protobuf to try.

Incidentally the 32-bit limitation on Protobuf is an implementation issue, not fundamental to the format. It's likely some Protobuf implementations do not have this limitation.

(Disclosure: I'm the author of Protobuf v2 and Cap'n Proto.)

yongjik · on Aug 17, 2019

Generally speaking, if you have 2GB of data, why would you want it inside a protobuf, or worse, json? You clearly aren't going to open it in a text editor or send it over ajax - just put the bulk of data as a separate binary blob, and your code won't have to scan 2GB to find the end of a string.

rapsey · on Aug 17, 2019

Not even remotely is it a severe restriction. Very few need that.

hrktb · on Aug 17, 2019

While it’s a valid point, there are many workarounds.

Also even in json, splitting documents in smaller chunks (ndjson for instance) is the standard practice to avoid having to parse it all in one go.

bcrosby95 · on Aug 17, 2019

How would I pad my resume if I didn't keep reinventing the same wheels over and over?

_qwfv · on Aug 17, 2019

I can understand if protocol buffers had some technical issue that you disagreed with. Or if you had a preference for a different solution because of some reason.

But this comment seems needlessly cynical and doesn't actually offer any rebuttal to the parent's point. I find technical discussions of these sorts of things interesting and a great way for new people to learn about the tradeoffs, maybe you could offer a reasoned opinion on why not use protocol buffers?

yongjik · on Aug 17, 2019

I think GP is making fun of people who do not use protobufs and instead try newer, less proven ones...

EdwardDiego · on Aug 18, 2019

I have a love/hate relationship with JSON schema. I hate the documentation, or significant lack thereof. But I do love how business rules about relationships between fields can be expressed in the schema - e.g., if typeId is 7, then the cost field must be populated.

With things like Avro or GPB you still need to validate that the relationship holds true separately.

gumby · on Aug 17, 2019

Isn't that just lipstick on a pig? Why not use something pre-existing and well defined?

nighthawk648 · on Aug 18, 2019

What is your opinion of the swagger spec + definition?

izacus · on Aug 17, 2019

A lot of companies still use typed messages like Protocol Buffers which make this a signficantly smaller issue. Especially since the message format was designed to be portable.

jasonhansel · on Aug 18, 2019

A question: just like JSON's syntax was based on JavaScript, could we create a JSON schema syntax based on TypeScript? So the schema would essentially be a TypeScript type annotation.

sorokod · on Aug 17, 2019

JSON schema validation requires a JSON parser so not sure how this helps in the context.

failrate · on Aug 17, 2019

Parsing is a minefield. General purpose computing systems are minefields. Of all human readable formats I've ever worked with, only S-expressions have proven easier and safer to parse. Json.org even has unambiguous railway diagrams!

filoeleven · on Aug 17, 2019

I wish EDN would catch on. The simplicity of JSON with better number handling, a few more very useful data types like namespaced keywords and sets, arbitrarily complex keys, and an even terser, more readable syntax. [k v k v] beats [k: v, k: v] hands down, and you can use whitespace commas if you want them.

https://github.com/edn-format/edn

rendaw · on Aug 17, 2019

Plugging my amazing JSON-like format! https://gitlab.com/rendaw/luxem

In case anyone doesn't click the link, its much simpler than JSON, leaves interpretation up to the reader, supports polymorphic data, and has some tweaks to make it nicer to edit by hand. I've used this in a bunch of personal projects and maps all the models I've come across perfectly. If you need more power in your format you're better off using Lua than YAML.

I have a Rust Serde implementation 90% complete I could finish up if anyone wants it.

mkl · on Aug 18, 2019

Trailing commas are of course very sensible!

This doesn't seem simpler than JSON otherwise, though, e.g. type declarations and optional quotes.

Why is leaving interpretation up to the reader desirable? Shouldn't things always come out the same?

Asterisks are an unusual choice of comment syntax. What if your comment needs to contain an asterisk? Why not "//..." and/or "/* ... */", or "#..."?

jasonhansel · on Aug 18, 2019

Compared to XML, Markdown, and other human readable formats, this is...actually not too bad. I was expecting worse.

Arrezz · on Aug 17, 2019

The bigger question is, what is there to be done? What is the road to a more uniform handling of JSON? I've handled some JSON before and it's usually fairly easy untill you catch one of these strange implementation quirks. But I'm not sure that those quirks can be ironed out at this point.

hirundo · on Aug 17, 2019

> The bigger question is, what is there to be done?

Something like https://github.com/nst/JSONTestSuite? Could a parser test suite be an official component of an RFC like 8259?

erikpukinskis · on Aug 17, 2019

Can you help me understand the problem? These things seem like corner cases that you could just Not Do(TM) and then you don’t have to worry about it.

What am I missing, when do these gotchas become an actual problem for you as a developer?

I’m not sure I’ve used any technology that was free of footguns, and JSON appears to have fewer of them than the average programming language or library.

falcolas · on Aug 17, 2019

I call this kind of answer “The C Answer”. “Who cares if this particular combination of code results in undefined behavior? Just don’t do it!”

> when do these gotchas become an actual problem for you as a developer?

Whenever you have to deal with JSON produced by “not you”, or when you have to deal with JSON that may have been corrupted in some fashion along the way.

erikpukinskis · on Aug 17, 2019

I’m probably just not used to pure “all behavior is defined” systems, so I appreciate your perspective.

What industry do you work in? I don’t see many systems like that in my industry. I work like hell to push things in that direction, but it’s a best case of “we went from 5% well defined behavior to 50%” after many years of effort.

0xbadcafebee · on Aug 18, 2019

The product owner perspective on this should be "nothing supports anything unless it is tested".

If you pick up a standard and just assume other products will be able to work with it, you're in for a surprise. I don't care if it's TCP sockets or .ini files; if you didn't test compatibility with the product you expect will interact with yours through the standard, consider it unstable, and don't advertise support for it.

Sometimes you have to support a standard itself, like WPA2, so you implement the standard according to internet engineering best practice: be liberal with what you accept, and conservative with what you transmit (or something to that effect). Then test compatibility with the major products you know will want to use it, and fix the bugs you find.

SigmundA · on Aug 17, 2019

Been of the opinion a while a lot of issues could be resolved if we agreed on a streamable binary format that had good definitions for data types (including integers and dates).

String formats are great an all for viewing in whatever text viewer but so inefficient and then you have the whole escaping string inside of strings and string encoding binary data.

If we all agreed on a binary format then there would be a viewer for it in every debugging tool.

ASN.1, Protobuf, BSON, ION. MSGPACK whatever. I would prefer a binary format that doesn't repeat keys for efficiency where the schema can be sent separately or inlined. But even one that's basically binary JSON with more types would be step up.

mirimir · on Aug 18, 2019

I'm not a professional coder. And I mostly work with tabular data, in spreadsheets and SQL. I like to get my data as delimited text files. Ideally, delimited with some character that's 100% guaranteed to never occur in the data. In my experience, "|" is often a good option, but you never know. And CSV, even with quotes, can be a nightmare, especially if the data contains addresses. Or names with quoted nicknames.

Anyway, given the choice, I always pick JSON over XML. Because with JSON, I can always identify the data blocks that I need, and parse them out with bash and spreadsheets. Not with XML, however. Just as not with HTML.

saagarjha · on Aug 17, 2019

> For example, Xcode itself will crash when opening a .json file made the character [ repeated 10000 times, most probably because the JSON syntax highlighter does not implement a depth limit.

FWIW, this appears to have been fixed recently.

tonyedgecombe · on Aug 17, 2019

Slightly off topic but Xcode crashing has been very common for me, as much as I like macos the tooling leaves me missing Windows development.

warmfuzzykitten · on Aug 18, 2019

I certainly would not criticize such a thorough examination for being facile, but I do want to point out that the conclusion "But sometimes, simple specifications just mean hidden complexity" is not supported by the article. Almost all of the end cases are caused by implementors ignoring or extending the simple specification.

ufo · on Aug 17, 2019

The crashing test cases look scary from a security perspective, specially in the C-based parsers. Does anyone know if these results are still up to date or if the bugs have already been fixed?

ape4 · on Aug 17, 2019

Relaxed JSON is pretty good. http://www.relaxedjson.org/

krispbyte · on Aug 18, 2019

Relax it a bit more and you get Neon: https://ne-on.org/

jstewartmobile · on Aug 18, 2019

This is all well and good, but which decoders are already in the browser?

XML and JSON last I checked.

trilila · on Aug 17, 2019

[flagged]

vageli · on Aug 17, 2019

> Python, go, rust and other obscure or archaic languages struggle with json. I recommend using a modern language such as js/nodejs, as it is meant for the web - where json is king among formats.

In what ways are the languages you listed obscure or archaic?

Sharlin · on Aug 17, 2019

I sincerely hope the GP is a troll. No matter, the comment's dead now.

TylerE · on Aug 17, 2019

They don't have leftpad.

vorticalbox · on Aug 17, 2019

in js/node is is a valid object

{ hello: 'world' }

in python its not valid

json = { "hello": "world" }

in python you also can't do json.hello like in js, it would be json['hello']

coldtea · on Aug 17, 2019

None of those mean Python is "obscure or archaic" or "unfit for json" etc.

Regarding what is a valid object, it's about syntax. Python has a different syntax than JS. That doesn't mean Python is worse because JS has {hello: 'world'}.

In fact that's an inconsistency of JS that only holds for _some_ keys (basically, valid identifiers): if the key has a space or a dash or something, you need to quote it: {'hello-1': 'world'}. So Python is more consistent in having only one style of map literal.

>in python you also can't do json.hello like in js, it would be json['hello']

Again, irrelevant. Python has a different syntax. Python also has tons of stuff you can't do in JS. Operator overloading for example, in python myObj + myOtherObj is a valid user defined operation. None of this means JS or Python is superior.

Ans none of this has anything to do with JSON.

If you mean that JSON seems to be a better fit for JS syntax, that's because it was designed to be closer to JS syntax. That said, it's not JS syntax. {foo: "bar"} is valid JS but not JSON. JSON also doesn't accept classes, closures, and tons of other things that can go in a JS object.

TylerE · on Aug 17, 2019

Plus, if you really want it Python has tons of hooks to make json.load return whatever sort of struct you want.

ben509 · on Aug 18, 2019

Or you can let json.load do its thing and use a nice module like json-syntax[1] to convert it to well typed values and back. (Yup, plugging my library.)

[1]: https://pypi.org/project/json-syntax/

aflag · on Aug 17, 2019

I don't understand what you're saying {hello: 'world'} is not valid json and neither Python's nor javascript's json parser from the standard library parse that.

jkern · on Aug 17, 2019

I'm going to be generous and hope this is some really dry humor and not an actual comment

EdwardDiego · on Aug 18, 2019

> in js/node is is a valid object > { hello :'world' }

In JSON, that's invalid, because the key must be quoted. Check the spec.

cptroot · on Aug 17, 2019

Have you ever used serde-json[0]?

[0] https://github.com/serde-rs/json

rurban · on Aug 17, 2019

Again. Parsing is a minefield in general, but parsing JSON is one of the easiest tasks of all serialization formats. It's also the only secure format.

Its various spec bugs (by omission) are not that dramatic, and the various "enhancements" only made it worse, ie more insecure. Still, bad but not a minefield. What worries me most that my JSON module is the defacto perl standard, passes all these tests, was the very first to add all these tests, is the fastest, and still is not included in that list, just some outdated modules which should not be used at all. Checking best practices besides maintaining a spec obviously also is a minefield.

rurban · on Aug 18, 2019

I forgot another major JSON minefield problem which is not mentioned nor tested here: stackoverflow.

This is in fact the most important problem to test against, because it might lead to exploitable stack ROP.

JSON is usually parsed recursively, and deeply nested structures are mostly not depth counted. One can trivially construct a nested array or map of 500 to 30000 elements, and at one point the parser either fails or crashes with an overflow. This number is fixed, thus trivially exploitable. The test spec should contain the max. depth for arrays and maps, and if there's a fixed builtin limit, a compile-time limit, implicit limit by crash, or none. non recursive parsers are fine.