JSON is dangerous (and slow)

2024-01-23 2867 words 14 minutes

Contents

Okay, so, I ranted a bit on YouTube. Oops. Here’s an accompanying longread, to waste even more of your life.

Here’s the TLDR(oW) version: JSON is a ubiquitous structured data format, used for both data storage and exchange. But it’s slow, and what’s worse, it’s dangerous. We should use other formats, probably MessagePack.

JSON is a weird accident of history. It grew out of Douglas Crockford’s need to be able to send messages between server and browser. It was defined to be a subset of Javascript, which was probably the first mistake. And although I can’t specifically verify it, the existence of the Javascript object.toString() that yielded essentially JSON was probably the origin. This new format had a lot going for it, being simpler and more generic than XML, which was all the rage at the time, and is a completely terrible format that should die in a fire. But the specification itself was largely born out of just mimicking its origins without getting too much into the details. Accidental success maybe?

Today JSON is ubiquitous. It’s everywhere. And it certainly has a number of things working in its favor - relatively lightweight, easily readable, comparatively flexible. But there are problems, both serious and not-so-serious. The “no trailing comma at end of list” thing probably accounts for hundreds of hours of wasted developer time globally each month, for instance.

What follows is an attempt to address a few of the more serious technical problems, hopefully constructively, but specifically in the context of how blind reliance on JSON can get you in deep trouble.

For context, this entire line of reasoning came to me when I discovered the hard way that in a system that was relying on numbers (and non-number floating point values) to be correctly reported, they frequently were not because of mismatches in how JSON encoders and decoders worked. The kind of problem that isn’t a problem until it is, but by then you’re already deeply screwed.

It’s Slow!

Let’s talk about the slowness first. There is an assumption widespread in the computer industry that parsing JSON files is practically free and that shunting billions upon billions of JSON blobs around the Internet every day is fine.

However, this is not the case. Not only is JSON a space-inefficient format in a lot of ways, especially where numbers are concerned, it’s also a format that is quite difficult to correctly parse in an acceptable amount of time.

What makes it “hard” from a computational perspective is that you can’t parse the file without looking at every individual byte in the file. This means that you cannot use divide and conquer methods except to a very limited degree, and your best parse speed is limited to your CPU’s data bus throughput plus the time it takes the CPU to classify each byte and perform any special operations related to it.

Contrast this to formats like MessagePack where there are length fields allowing parsers to jump over large chunks of data, allowing parsing to really only consider the structure of the file rather than the contents of each individual byte.

Various JSON parser benchmarks have shown (see articles at the bottom of the article) that most JSON parsers are very slow and space inefficient. The best appear to be simdjson, which uses CPU SIMD (Single Instruction, Multiple Data) operations to quickly pattern match for special handling cases, which greatly speeds up the process by looking at a larger number of bytes at each moment. But even then, it’s not free by a long shot.

As Vargas et al note in their paper Characterizing JSON Traffic Paterns on a CDN, “mobile applications account for at least 52% of JSON traffic […] and embedded devices account for another 12% of all JSON traffic”, meaning that it’s often underpowered devices that are expending their time on encoding and decoding JSON - “88% of JSON traffic is non-browser traffic, and only 12% is requested by browsers.”

They also note that “the average size of JSON responses has decreased by around 28% since 2016”, and that “reduced response sizes increase the CPU cost-per-byte of serving JSON traffic”. The argument is that when handling JSON queries over HTTP, the processing overhead of handling each request is relatively fixed. So if you see systems replacing larger more complex JSON blobs with more and smaller blobs, there are costs associated with that through the system. It’s not just on the HTTP query handling side though: There’s a cost to initializing, running and tidying up after the parser, heavily implementation independent of course. But it’s something to consider. Kind of like a message-processing corollary to Amdahl’s law.

So, there are more messages (35 million in the data set analyzed by Vargas et al, collected over 24 hours from edge servers on an Akamai CDN), they are shorter, and thus proportionally more CPU intensive.

It’s hard to get a good handle on roughly how many total JSON messages are sent over the internet per day, but it’s probably reasonably conservative to put it in the mid-range hundreds of billions. Say 500 billion? A trillion maybe? So while each individual JSON message isn’t necessarily taking a huge amount of time to process, then a parser that takes, say, 1.6µs to decode a message similar to {"success": true, "amount": 3.141592653589793238462643383279} (which the average time for the Python JSON module on my laptop), then we’re talking in aggregate 233 CPU hours per day globally, only counting decoding. Of course in reality it’s going to be a lot more than that, both because I have a pretty beefy laptop, most messages are more complex and take longer to decode. So maybe it’s not entirely outrageous to suggest that encoding + decoding globally is on the order of a few thousand CPU hours per day. Which is, I suppose, only equivalent to a few hundred always-on computers doing nothing except encoding and decoding JSON all day.

Given very mid-range power usage, we might be able to argue that JSON is costing the world about 5-7 MWh of power per day.

Of course, the above is based on a lot of handwaving and assumptions, so take this with several shipping containers worth of salt.

The key here is that JSON is a format optimized for human readability, but if the overwhelming majority of messages sent are never read by humans, and are never intended to be read by humans, then perhaps the overhead of that readability is too high a price and it might make sense to architect systems in such a way that they use more efficient formats for message passing and storage, but have a good ecosystem of tools available for those rare occasions when humans actually do need to look at the messages.

There is a silver lining to all of this. Raphael Luba mentioned to me the other day: “The best thing about JSON is that it almost completely stopped the spread of XML!” This is totally fair and probably saved the world several hundred MWh per day. But, I wonder if there’s a way we could replace JSON with something significantly easier to parse… not to mention correctly – which is where we get to the next bit:

It’s Dangerous!

The dangerous bit is a bit trickier. And, let’s be clear – despite the hype above, it’s only dangerous in certain edge cases. All of those edge cases have to do with numbers. But numbers are important!

The JSON standardization document, RFC2859, defines the number part of the format as:

1

   number = [ minus ] int [ frac ] [ exp ]

Which is to say, a number starts with an optional minus sign, followed by an int, followed optionally by a frac fraction, and optionally an exponent. int in turn is defined as either a single 0 or a digit in the range from 1-9 followed by as many other digits as the user would like. frac is then a decimal point followed by at least one digit but as many as you’d like.

This means that all of these are valid numbers:

1
2
3
4


5
4923.2929
-10e7
-42.39929432e-203

So far, so good. It seems at first glance that we can encode any conceivable real number. The format doesn’t support fractions as such, or complex numbers, but that’s fine – nobody would expect that anyway. And if you need to do a complex number, you can do it as an array of two numbers. Easy.

So what’s the problem then? Well, before we continue, let’s understand why we are encoding numbers for transmission.

Computers deal with numbers all the time for calculations, and in many use cases the accuracy of those numbers can mean the difference between success or failure. Screw it up, and you can lose millions of units of currency, or human lives, or both. Bridges can fail, servers can crash, economies can topple. So being able to accurately encode and transmit numbers is important.

So when handling numbers in a transmission/storage format, there are two rule of absolute importance with regards to numbers:

Numbers get correctly encoded.
Numbers get correctly decoded.

I.e, for any given number, the same number should be understood the same way, as having the same value, on both sides of the transmission (or storage, if that’s the use case).

The most common internal formats for numbers in computers are integers, floating point numbers, and less commonly, fixed point numbers. The sizes of these can vary, but typically modern computers can handle signed or unsigned integers of up to 64 bits easily, and with minimal effort support bigger numbers. Fixed point numbers are typically made from structures of integers. Lastly, floating point numbers are typically 32 or 64 bits, implemented according to the IEEE754 standard.

RFC8259 specifically addresses IEEE754 floating point numbers:

Since software that implements IEEE 754 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision.

Based on this, you’d expect that IEEE754 floats are supported by the JSON format. But here’s where it all falls apart. You see, IEEE754 defines four special values which are NOT valid according to the abovementioned definition of “number”: Inf, -Inf, NaN and -NaN.

Now, one can argue that those values should not be used, but in practicality, you might for instance have an API interface to a calculator. What does your API return when the query sent is “5.0/0.0”? What about “0.0/0.0”? Or better still, “-0.0/0.0”? In any of these cases, you cannot correctly encode the resulting IEEE754 value as a number, in violation of rule #1 above.

Another example (from my line of work) would be a query endpoint that sends you back the value of a geospatial dataset given a specific latitude and longitude. Except, what if the dataset has no value in that point? You could return an error, but this isn’t exactly an error – it’s simply the absence of data. NaN is a very natural and meaningful value in such cases. But that’s not possible, so you have to report no value by some other means. Here the inability to use what’s natural leads to a more complex JSON structure, with associated performance implications both in parsing the blob and handling the resulting data in your application.

If you want to store these values, you are kind of expected to do so by sending them as strings. And many will argue that that’s the correct way to handle this. But that means that during your parsing step, when expecting a number, you need to make a special rule that allows for accepting certain string values as numbers. Not only does that further slow down everything, you also just created a new class of bug.

But wait, there’s more. As the RFC itself states:

A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

I’ll say! One of the problems with the number format as defined in JSON is that there’s no way, a priori, to know what storage class a particular number belongs to. Is a number a floating point number? An integer? A fixed point number? How many bits does it need?

Of course, the parser can do various tests to guess this ahead of time, but there are a number of situations which might be impossible to guess correctly or guessing might yield problems for the parser:

If the JSON is {"amount": 16777217.0 }, should it be understood as a 32 bit float, in which case it the exact value cannot be represented (it’ll be understood as 16,777,216), or a 64 bit float, which might seem excessive?
If the JSON is {"amount": 9007199254740993.0 }, the number can’t be accurately encoded as a 64 bit floating point number either. How should the parser react?
If the JSON is an integer value but the digit sequence is 32 megabytes long, how should the parser react?

These complaints might seem relatively trivial, but they are examples of this format violating rule #2 above.

And let’s look at how this is handled. I’ll give a few examples always with the same JSON input:

1

{"f64": 9007199254740993.0, "f32": 16777217.0, "exponentsarefun": 1E400, "pi": 3.141592653589793238462643383279 }

Here’s Python:

1

print(json.loads(blob))

giving us:

1

{'f64': 9007199254740992.0, 'f32': 16777217.0, 'exponentsarefun': inf, 'pi': 3.141592653589793}

Now let’s try Javascript (Node):

1

JSON.parse(blob)

giving us:

1
2
3
4
5
6


{
  f64: 9007199254740992,
  f32: 16777217,
  exponentsarefun: Infinity,
  pi: 3.141592653589793
}

… at least they’re consistent, eh?

Here’s Jai (just for fun), using Raphael Luba’s jason library:

1
2
3
4
5
6


{
    "exponentsarefun": Inf,
    "f32": 16777217,
    "f64": 9007199254740992,
    "pi": 3.141593
}

Here it assumes pi is a float32!

Anyway, you get the idea.

The only way to accurately handle this is to never use floating point numbers in the context of JSON, and further, always have an arbitrary precision integer and fixed point arithmetic library handy when dealing with this format. Which, let me tell you, almost nobody has, and virtually nobody actually wants.

What to do?

Honestly, there aren’t a lot of good options within the context of standards compliant JSON. The only simple remedy I can think of that doesn’t involve scrapping JSON is to adopt a mitigation strategy in a few steps:

Generally speaking, whenever you see “JSON”, read it as “I-JSON”, and follow RFC7493, which places some restrictions on what is permissible in JSON documents. Specifically, RFC7493 mandates UTF-8 format, bans unpaired surrogate and noncharacters, bans floating point numbers that can’t be represented as 64 bit IEEE754 floats, and bans duplicate names in objects. All of this adds strictness that greatly reduces the likelihood of problems.
Break the standard to allow Inf, -Inf, NaN and -NaN as values in JSON files. Incidentally, a number of JSON parsers do this already, although it’s technically a violation of RFC8259. For example, the Python json module supports NaN and Infinity with exactly that capitalization as values and handles them correctly, but it rejects -NaN, Inf, -Inf and other capitalizations. There are no current errata for this in RFC8259, but it would be a good thing to add, if for no other reason than to increase the likelihood of convergence in how this is implemented.
Decoding libraries should do a two-step value validation test when parsing. For a given value, the string parsed for the value and the string generated by the JSON library when encoding the decoded numeric value back to a string should be equivalent. If they are not, issue a warning or error condition. This will significantly slow down parsing but at least allow for detection and notification of problematic values.

But probably the correct thing to do is to gradually move away from JSON as a data storage and transmission format. It is, however, a great format for displaying structured data to (sufficiently advanced) users. So perhaps we should start treating JSON as a display (and editing?) format, but use something like MessagePack for actual storage and transmission?

For this to make sense, we’d need a much better set of tools for working with MessagePack (or whatever) data, including:

Command line tools (mp2json, json2mp, mpcat, mpless, etc)
Browser support (for inspecting query payloads and in-memory objects)
Editor support (for VSCode, Sublime, (neo)vim, emacs, etc)
Graphical tools

Without these, people are going to keep using JSON. Even with these, people are probably going to keep using JSON, but at least then there’s an alternative to point people at.

Appendix: JSON parser benchmarks:

Acknowledgements

Thanks to James Robb and Chris Guess for reviewing a draft of this and giving useful feedback. Errors and omissions are my fault, not theirs.