Contents

Storage, Transmission, and Display

A Labeled Property Graph is a collection of nodes and edges, such that each node has a label and a value, where the value is an edge relationship to another node. It’s a form of multigraph. There are a few different ways to think about this.

Labeled Property Trees are simply Labeled Property Graphs with an added rule that cycles are not allowed.

Labeled Property Trees (LPTs) are one of the most ubiquitous data structures in computing. If you’ve done any programming in the last two decades, you will have encountered an LPT. However, you’d be forgiven for not being familiar with this term. You are however definitely familiar with one or more of:

  • JSON
  • YAML
  • XML
  • SGML
  • RDF
  • BSON
  • MsgPack

Every one of these is not just structurally equivalent – they represent the same type of data structure. They do so in very different ways, with very different design goals, but they are all doing the same thing – storing data which is labeled, and may be nested in a tree-like structure. I’m going to focus on three of them for the purposes of this article: JSON, XML and MsgPack.

Before we start it’s worth noting that it is possible for all of these formats to apply conventions to allow referencing of nodes in the tree by ID, path, location, offset, or whatever, which can be used to make them be de-facto Labeled Property Graphs rather than LPTs. However, none of these formats explicitly supports that from a formal format specification level, so we won’t treat them as such.

1
2
3
4
5
6
7
8
<?xml version="1.0" ?>
<library>
    <book title="Salmon River Gauntlet" author="William Bless" />
    <book title="Wine Folly">
        <author>Madeline Puckette</author>
        <author>Justin Hammack</author>
    </book>
</library>

is entirely equivalent to this JSON:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
    "library": {
        "book": [
            { 
                "title": "Salmon River Gauntlet",
                "author": "William Bless"
            },
            {
                "title": "Wine Folly",
                "author": ["Madeline Puckette", "Justin Hammack"]
            }
        ]
    }
}

There are a few “gotchas” here though:

  • XML child nodes are ordered, but XML attributes are not. JSON object keys are, by convention, considered unordered, but list items are not. So XML child-nodes are effectively a list, for equivalency purposes.
  • XML requires exactly one top level node, which is labeled and can have properties like any other node, while JSON has either an object or a list at the top level, and it is not labeled. In the example we explicitly make the “library” be a object key of the root object, but that is technically not necessary.

In both cases, we have “pretty printed” the format.

These small differences may matter in terms of implementation, but they do not actually change the fact that these are fundamentally the same data structure.

The same structure in MsgPack would be:

1
81 A7 6C 69 62 72 61 72 79 81 A4 62 6F 6F 6B 92 82 A5 74 69 74 6C 65 B5 53 61 6C 6D 6F 6E 20 52 69 76 65 72 20 47 61 75 6E 74 6C 65 74 A6 61 75 74 68 6F 72 AD 57 69 6C 6C 69 61 6D 20 42 6C 65 73 73 82 A5 74 69 74 6C 65 AA 57 69 6E 65 20 46 6F 6C 6C 79 A6 61 75 74 68 6F 72 92 B1 4D 61 64 65 6C 69 6E 65 20 50 75 63 6B 65 74 74 65 AE 4A 75 73 74 69 6E 20 48 61 6D 6D 61 63 6B

or, rendered more readable, with strings as strings but non-string bytes in hexadecimal format prefixed with 0x

1
0x81 0xA7 library 0x81 0xA4 book 0x92 0x82 0xA5 title 0xB5 Salmon River Gauntlet 0xA6 author 0xAD William Bless 0x82 0xA5 title 0xAA Wine Folly 0xA6 author 0x92 0xB1 Madeline Puckette 0xAE Justin Hammack

While this is quite different, by virtue of being a binary format, it is effectively the same thing as the other two.

Storage

When storing data, we have a few different goals:

  • We would like the stored data to take as little space as possible.
  • We want to be able to recover the exact same data again, with no risk of misinterpretation or inaccuracy.
  • We would like to be able to validate the correctness of the data relative to some schema.
  • We would typically like the serialization (writing out the data) and deserialization (reading it in) to take as little time as possible.

Note that being human readable is not a requirement for storage. A file being stored is not being viewed by a human, although a human may retrieve a file from storage in order to view it, but that’s a separate thing.

The XML is 198 bytes with optional whitespace removed, JSON comes out at 153 bytes, while the MsgPack data comes out to 124 bytes. And this is ignoring the fact that both XML and JSON tend to be stored in “pretty” format for human readability, coming out to 234 and 309 bytes respectively. Comparing these formats shows that in terms of space taken, MsgPack is the clear winner here.

In terms of exact recovery of data, all the formats are equivalent when it comes to both labels and string data, but XML has no separate method of indicating type (although type information could be encoded in labels/etc, at a cost), and while JSON has a notion of numbers, booleans and even null, it has no way of indicating the storage type beyond that, and cannot represent all the numbers it claims to support. MsgPack by comparison has support for booleans, nulls, a range of integer and floating point types with explicit storage information, and even byte arrays for those situations when those options aren’t enough.

Each of these formats can be validated. XML has explicit schema formats and validators. JSON has a few informal schema mechanisms. I haven’t found a schema validator for MsgPack. But because the formats are all equivalent, you could technically run any validator against a deserialized version of any of these.

The serialization and deserialization of these is quite different. While serialization of any of them is quite straightforward, deserialization of XML is quite complex (mostly because of preamble entities and such, to be fair), and while deserialization of JSON is comparatively simple, both suffer from the same issue, which is not present in MsgPack, which is that both XML and JSON documents must be parsed in their entirety in order to correctly read any element of them.

To explain this, imagine if you wanted to read the files as above, and just gather up all the author names – ignoring everything else to the extent possible. Let’s assume that you have all information about the schema, but not on how many books or authors each book has. In the JSON and XML examples, you need to read and process each individual byte, because:

  • You cannot know that the library object has started until after the closing > in the beginning XML tag or after the opening { in JSON.
  • You cannot know how many book objects the library object has until you find the </library> in XML or the library’s closing } in JSON.
  • You cannot skip over the title properties that you know you don’t need, because you don’t know how long they’ll be.

Compare this to the MsgPack example.

  • When you read the first byte, 0x81, you know it’s a 1 element map. Reading the following 0xA7 tells you you have a 7 byte string, which we can confirm to be library. After that, we see another 0x81 indicating a map with one element, then 0xA4 indicating a 4 byte string, and 0x92 indicating a two item list. So we’ve already determined that there are only two books.
  • Now we see 0x82, indicating a two element map. Reading the next byte, we see 0xA5, indicating a 5 character string – we know that that can’t be ‘author’ (6 characters), so we can skip 5 bytes ahead. Then we see 0xB5, which is the string value we know we don’t need, so we can skip ahead 21 characters.
  • Finally, we find 0xA6 – a six character key string. Verify it’s ‘author’, and then read 0xAD to get the first 13-character author string.
  • The following bytes 0x82 0xA5 are handled in the same way as above, and 0xAA let’s us skip over the title again. Then we find the ‘author’ key and 0x92 indicates a list, so we scoop up the items from that list.

It’s a pretty trivial example, but here just by virtue of the format giving us various hints, it allowed us to know both the number of items and skip over the unimportant keys and their values pretty easily. In terms of parsing time, this just saved us a chunk of memory for storage, in practicality a bunch of allocations, as well as a lot of unnecessary memory reads and instructions.

It could be argued that it’s a shortcoming of the MsgPack format that there aren’t length indicators higher up in the structure: In a more complex example where there aren’t just books but also, say, magazines and DVDs (I dunno!) and we wanted to skip the books and just look at the magazines, we’d still need to traverse down into the books because we don’t know the length of each individual key and value until we get to it. Skipping those values is fantastic, but it’d be even better to skip the entire structures when unnecessary. However, that would incur a small storage cost for unclear gains in most cases, but a relatively large processing cost during serialization.

It could also be argued that being able to skip stuff isn’t actually useful for most parsing situations. However, it is unarguable that the presence of length fields speeds up parsing, because for any given value, the value itself can be copied without having to do comparison on each byte. This alone reduces branching operations by a significant amount. The one risk is that it’s important for security reasons to be sure before copying that you’ve got enough bytes in the buffer.

Transmission

Transmission follows a similar pattern as storage, except that for transmission messaging you tend to need the messages to end. Often this is done with JSON-type messages by just including them one after another:

1
2
{"id": 1, "msg": "foo"}
{"id": 2, "msg": "bar"}

This is not technically a valid JSON message, but two valid JSON messages. And as long as that’s what’s expected, it’s not a problem.

For transmission purposes, the priorities are again similar:

  • Reducing the amount of data as much as possible matters, in this case to reduce the transmission time.
  • Guaranteeing that the message as sent is correctly received.
  • Serialization and deserialization need to be as fast as possible.

The arguments that apply to storage apply the same here.

Importantly, again, human readability is neither required nor even particularly desirable. As I wrote previously:

“JSON is a format optimized for human readability, but if the overwhelming majority of messages sent are never read by humans, and are never intended to be read by humans.”

Display

Which brings us to human readability, and display more generally.

Some of these formats are good as display formats. As we’ve gone through, they fail on most of the things that make a format desirable for storage and transmission, and in an overwhelmingly high percentage of cases, there is zero benefit to them being human readible, and in our example the MsgPack format outperforms JSON by between 23 and 149 percent in storage cost and quite a lot in parsing cost, depending on whether it is “pretty printed” or not. Imagining this overhead in billions of messages passing over the Internet every day, you’re losing a lot of clock cycles and wasting a lot of bandwidth capacity.

However, it is undeniable that MsgPack is hard to read. Weirdly, after having looked at enough MsgPack messages, I have started to be able to read a lot of the byte sequences quite quickly, but it’s definitely not the preferred display format.

However, XML and JSON are both pretty decent display formats. While they do lack type information, which stands in the way of clear conversion to MsgPack, but for the purposes of reading/writing, whether it’s for creating the file or for figuring out a problem or whatever, they are entirely reasonable.

I would like a variant on JSON for these purposes, with optional type annotations something like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "library": {
        "book": [
            { 
                "title":string: "Salmon River Gauntlet",
                "author": "William Bless",
                "year":u16: 2025,
            },
            {
                "title": "Wine Folly",
                "author": ["Madeline Puckette", "Justin Hammack"],
                "year":u16: 2018,
            }
        ]
    }
}

Here we’re using annotations where it clarifies storage, and in one place where it adds nothing really, but could be done.

Tooling

The thing that’s missing for this kind of thing is a robust tool chain:

  • A command line utility like jq that work on MsgPack files.
  • Command line tools for converting between all of these common formats.
  • A browser plugin that makes inspecting MsgPack files super easy.
  • Text editor support for MsgPack.

Hashing

It’s also worth noting that none of the above schemes is independently good for having stable hash values on equivalent structures, to allow for things like authenticity verification, deduplication, indexing, and more.

With MsgPack and other binary formats, it is sufficient, in order to gain such stable hashing, to add two house rules:

  1. A cannonical order of attributes within a map, for example, a lexicographically ordered attribute list. You can add house rule exceptions to that as you want.
  2. A rule that the smallest valid storage type is always used. The MsgPack standard explicitly states “If an object can be represented in multiple possible output formats, serializers SHOULD use the format which represents the data in the smallest number of bytes”, so perhaps this rule isn’t even necessary as a house rule, as long as your serializer is standards complient.

However, with JSON and XML, because of whitespace characters, and other non-meaningful formatting choices, there is no inherent guarantee that even with additional house rules you will arrive at a stable hashing scheme. This issue has killed at least one secure messaging application I’m familiar with.

Once you have a stable hashing for the overall structure, there is no reason you couldn’t do more complex things, like Merkle trees or what have you.

Conclusion

Stop using display formats for storage and transmission.