The GRIB file format

Smári McCarthy included in English Programming

2022-01-10 981 words 5 minutes

Contents

I love a good file format. GRIB, however, appears to be… I don’t even know. It’s a file format for meteorological data, and stands for GRIdded Binary.

There are two main versions, and they appear to be binary incompatible. Unsurprisingly for meteorological formats, a lot of the reference code is in Fortran. Sigh.

GDAL has support for the GRIB format, but unsurprisingly, since GDAL is designed more around managing geographical imagery data, it makes it a bit complicated to just read the data out of a GRIB file. I gave up on that approach pretty quickly.

Looking at other tools yielded more sighs:

So, I guess it’s time to look at the standard…

Writing a parser

I will happily admit that I get way too much joy out of unpacking obscure binary formats. As GRIB goes, it’s a pretty typical binary format. The standard is well written and clear as far as technical details go. Annoyingly, it’s 1-indexed, which means I find myself subtracting 1 all the time. This reflects its Fortranish origins. That aside, making a parser was easy, for the most part… let’s get back to that.

First off, there are 8 types of sections:

0 Indicator Section - Essentially a header with overall length, version info, and some metadata, including “discipline”, which is a bizarre way of pigeonholing datasets based on which field the person who made the data thinks they work in. Starts with “GRIB”. Totals 16 bytes.
1 Identification Section - More metadata, but more about the data than the file.
2 Local Use Section - A freeform area to put notes in or something.
3 Grid Definition Section - Wherein the grid is defined, but mostly by reference to certain templates.
4 Product Definition Section - This is, confusingly, mostly there to specify horizontal data structure of the data on the grid, and mostly by reference to either certain pre-specified “Product definition templates”. The discipline field from the header actually implies a lot about the structure from here on out.
5 Data Representation Section - This indicates how the data in the fields are to be represented; mostly again referencing standard data representation tables.
6 Bit-map Section - An optional bitmap that indicates presence or absence of data for individual locations.
7 Data Section - The actual data; which needs to be interpreted in relation to all of the above.
8 End Section - Literally just the ASCII characters “7777”. Kind of silly, but whatever.

So, while the binary format is technically fairly painless, it encodes a veritable shit ton of culture, and most of that culture seems ridiculous to me as an outsider. More on that later.

However, the point where I became seriously confused was when I came upon the End Section only 45000 or so bytes into a 23 megabyte file. It took me a bit of work, but I finally realized that this is a tape archival format.

Which is to say: It’s designed as a sequence of messages that can be stored linearly on tape storage for efficient retrieval. This makes sense because in the 1980’s the world’s meteorologists had literal reams of data in the absence of large random access hard drives.

The example file I’m working with contains 540 contatinated “messages”, each of which being a “database” of values, but each one just contains values for one given variable. For instance, I get one “message” containing soil temperature at 20cm depth for everywhere in the world during “time steps” 0-24, and then another “message” containing soil temperature at 100cm depth, etc… and then everything gets repeated for timesteps 24-48, 48-72, etc.

Which means that the file format is best understood as a sequence of binary encoded messages, where each one may or may not have shared context with other messages in the sequence.

Oh well.

Refactoring

Obviously, the next step was to refactor everything to basically allow random access of individual messages. I was happy to load tens or even hundreds of megabytes into memory at once, because computers can do that now, but the risk of it ending up in the large double digit gigabyte territory is real, so I had to be a bit smarter. Sigh.

At this point, I’m working my way out of a hole, while also realizing that certain things in the way I handled Sections 4, 5 and 7 are kind of wrong-headed because of the difference between the way the format thinks about data and the way in which I will be using the data.

Unfortunately, I’ve run out of time for today, but this has been a fun 3 hour project that I will have to return to another time. However, I can’t end this post without first talking about the culture it encodes.

Encoding culture

This format is the quintessential example of Conway’s law:

“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.”

Version 1 was standardized in 1994, but the format appears to have a longer legacy than that. It is definitely a logical format for the task at hand, given the technical limitations of the time, but it’s definitely showing its age. If not by the curious message-based format, then by the “discipline”-based implications of data structure and intent, or by the fact that the format was designed around there only being a handful of different possible origin organizations, which would have a clear and finite common list.

I do think that encoding culture is unavoidable, but it seems that this might encode culture so strictly that the data ends up suffering for it. It’s not a hard-to-read format, per se, but it’s really hard to see how this format is better, in 2022, than a SQLlite3 database, or something like that, that simply lists the values in a set of tables.