Parsing XML

2024-01-11 1296 words 7 minutes

Contents

XML is, as formats go, terrible.

If we think about what it’s doing, it’s outlining a hierarchy of nested nodes, each of which can have a number of attributes. Nodes can additionally contain CDATA or PCDATA, which practically speaking are typically treated as their own nodes. A few magic nodes exist, mostly for compatability with SGML.

It’s a tree. There are lots of other ways of describing trees textually. Some are bad, some are good. Some are more readable, some are less readable.

In terms of readability, XML is better than, say, JSON. But the price is verbiage.

Anyway. The other day, Tomasino mentioned RSS feeds in a group chat, and I asked what people’s favorite RSS feed readers were these days. I don’t think anybody intended this, but the harm was already done.

My first thought was: How hard is it to make an RSS reader? Next, I thought: Oh, I could make one. After that, I thought: Oh, I’ll need an XML parser. This is the point that I started looking at the available XML parsers.

What exists?

Most languages have a few of these. They tend to be either DOM-style or SAX-style, where the former does actual parsing into an actual Document Object Model, where the latter appears to do opportunistic parsing as needed and “visit” nodes with callbacks, in an effort to lower memory load, at the cost of far greater parsing overhead. I don’t know why anybody would actually want that nowadays; multi-gigabyte XML documents are thankfully rare.

There are lots of XML parsing libraries out there. The obvious favorite is LXML, which is relatively lightweight and works well for everything. But there were others too.

I found a blog in which the author claimed to have written “the fastest” XML parser, but it turned out he was measuring relative to Javascript land, which makes the claim not just way less impressive, but also thoroughly uninteresting.

The fastest and simplest I found was PugiXML, which comes in at about 25000 lines of C++ and appears to be comparatively nice. I quickly broke out the Jai bindings generator and tried to make bindings, but it kept failing to resolve some symbols; I couldn’t figure out if this was a C++ demangling problem or an issue with how I was running the bindings generator, so I gave up and hunted around for other options.

After going through these a bit, I figured it was an early Sunday afternoon in January and I had nothing better to do than go sit down on the beach and try this out for myself.

Thoroughly nerdsniped by this point. Heh.

Building it

I downloaded a bunch of XML conformance testing files and made a bunch of my own test files. I also downloaded a bunch of RSS and Atom feeds. Then I started writing.

To my surprise, it was relatively easy. I made a node handler and an attribute handler, and then special case handlers for special cases like comments, CDATA, PCDATA, Doctype definitions and so on.

By midnight, all the RSS files I could find in the wild passed, as did the examples I made. A lot of the XML conformance tests passed. Here are the test suite results:

1
2
3
4


Runtime: 271858 µs.
Passed: 1072 (0.344473%)
Failed: 2040 (0.655527%)
Total : 3112

Of the tests that failed, a lot of them were tests that are meant to fail (my test suite doesn’t distinguish these at the moment), but the vast majority were tests for support of encodings other than UTF-8. For simplicity, I only implemented UTF-8 support, which by implication means it supports ASCII too. Currently we entirely ignore the initial byte order mark, which is probably bad.

Either way, I was pretty happy with that as the results of a 12 hour coding session.

How fast is it?

On my laptop (XPS13 from 2023) it parses the Gizmodo RSS feed in 18505µs, which is a throughput equivalence of 0.005812 GB/s. This isn’t great, but it’s a good start.

In the test runs above you can see we parsed 3112 arbitrary XML files in 0.272 seconds, but that’s including disk read time. Sure, a lot of them failed, but it’s still an interesting figure.

I don’t think this is going to compare well against PugiXML or any of the other reasonably optimized XML parsers, but I haven’t benchmarked on equivalent computers.

There are a lot of speedups on the table:

The UTF-8 to UTF-32 conversion is really suboptimal.
The Unicode character recognition is pretty slow. I’ve arranged it so it’s likely to early out on false positives, but in the worst case it can be over 200 comparisons per character. There’s a pretty obvious way to use SIMD to speed these comparisons up, but I haven’t done that yet.
Generally we could be using SIMD a lot more, in particular when scanning for opening and closing angled brackets.
I’ve got a fairly efficient SIMD substrng matching algorithm that I “stole” from Iain King’s jai-string library. I considered just depending on the library, but that library is waaay bigger than my XML library. Seemed overkill.

Incidentally, the whole thing runs way faster when the parser is invoked with a pool allocator. That is to say, instead of doing this:

1

result, root, parser := xml_parse(str);

I do this:

1
2
3
4
5


pool: Pool;
set_allocators(*pool);
mypool : Allocator = .{pool_allocator_proc, *pool};

result, root, parser := xml_parse(str,, mypool);

This makes more sense when doing multiple runs.

How much code is this?

The whole library is 1271 lines, currently. That’s including a bunch of code that should be cleaned up . It’s got a reasonably full-featured set of DOM traversal functions. Some of the API should be cleaned up too, but it’s not terrible.

The library does depend on a few modules from the Jai standard library, specifically String, File, Unicode and Basic. It only uses one function from Unicode, the character_utf8_to_utf32 function, which should probably be replaced with something faster as noted above.

What’s the API?

Parsing functions:

xml_parse :: (buffer: string) -> XMLParseResult, *XMLNode, *XMLParser
xml_parse_file :: (file: string) -> XMLParseResult, *XMLNode

Node functions:

node_allocate :: inline () -> *XMLNode
node_create :: (type: XMLNodeType) -> *XMLNode
node_free :: (node: *XMLNode)
node_append :: (child: *XMLNode, node: *XMLNode)
node_append_new :: (node: *XMLNode, type: XMLNodeType) -> *XMLNode
node_prepend :: (child: *XMLNode, node: *XMLNode)
node_insert_after :: (sibling: *XMLNode, node: *XMLNode)
node_insert_before :: (sibling: *XMLNode, node: *XMLNode)
node_remove :: (node: *XMLNode)
node_find_child_by_tag :: (node: *XMLNode, tag: string) -> *XMLNode
node_get_cdata :: (node: *XMLNode) -> string, bool

Attribute functions:

attribute_allocate :: inline () -> *XMLAttribute
attribute_free :: (attrib: *XMLAttribute)
attribute_get :: (node: *XMLNode, name: string) -> *XMLAttribute
attribute_get_value :: (node: *XMLNode, name: string, default := "") -> string, bool
attribute_append :: (attr: *XMLAttribute, node: *XMLNode)
attribute_append :: (node: *XMLNode, key: string, value: string)
attribute_prepend :: (attr: *XMLAttribute, node: *XMLNode)
attribute_insert_after :: (attr: *XMLAttribute, node: *XMLNode)
attribute_insert_before :: (attr: *XMLAttribute, node: *XMLNode)
attribute_remove :: (attr: *XMLAttribute, node: *XMLNode)
attribute_append_new :: (node: *XMLNode) -> *XMLAttribute

Iterators:

xml_walk_depthfirst :: (node: *XMLNode, callback: XMLNodeWalkCallback)
xml_walk_breadthfirst :: (node: *XMLNode, callback: XMLNodeWalkCallback)
for_expansion :: (node: *XMLNode, body: Code, flags: For_Flags)

Releasing

I’ll probably try to clean this up a bit over the weekend, harden some bits, do some speedups, improve the API a bit. There are also a bunch of functions missing, in particular around memory management and such.

With any luck, I’ll put it up by Sunday. We’ll see.

Update: More coverage

The library now passes 90% of the tests.

1
2
3
4


Runtime: 351301 µs.
Passed: 2818 (0.905818%)
Failed: 293 (0.094182%)
Total : 3111

What remains are UTF-16 and ISO-8859 encoded files, which I’m not going to deal with for now.

Release after some cleanups.

Update 2: It’s live

Check it out!