GRIB Woes, Part 2
Yesterday I spent a few hours making a decoder for the GRIB file format, which was a mostly pleasant experience, although the deeper I got the more I realized that there was something very wrong with the state of affairs.
The data I’m trying to work with is the ECMWF’s Seasonal-to-Subseasonal daily averaged data, which is a pretty typical data set as meteorology and climate science goes. It’s very well curated, very precisely defined, and deployed to the world in a really difficult to use way.
The problem of doing scientific computing is always going to be one of managing and transforming data in sensible ways. Historically, there was more data than there was capacity, and in particular a number of things were simply hard to solve on a budget - data storage being one of those things. So formats were designed with those restrictions in mind. They tend to be:
- Binary formats
- Designed for tape storage (message based)
- Assume a lot of implicit knowledge or data availability, including the availability and well-formedness of conversion tables
- Be ever so slightly overengineered
- Assume that you’re going to be working on a cluster
wgrib2 developers at NOAA issue a fine warning on their page about the risks of “just” converting to CSV:
“It may be tempting to take a grib file, convert it into a CSV file and then deal with the CSV file. After all, everybody can read a CSV file. Sure there is a litte overhead of reading a CSV file but who cares. Suppose you want to read some GFS forecasts files (20 forecast times, 5 days every 6 hours) at 0.25 x 0.25 degree global resolution. Your CSV file is going to be about 720 GBs. Suppose that our hard drive can write/read at 70 MB/s. Then we are talking about 3 hours to write the CSV file and 3 hours to read the CSV file not including CPU time which will slow down the process. Converting grib into CSV is a viable strategy if the conversion is limited. You need to restrict the number of fields converted and should consider only converting a regional domain. Note, I wrote “viable” and not optimal.”
But a lot has changed. Aside from my hard drive having a nominal write speed of 2.2 GB/s, a mere 30 times faster, there’s the more important fact that many of the assumptions no longer hold. So, I totally agree that a streaming binary format would be preferable to a CSV, but there are a few assumptions we should change.
- Even though storage read/write is still expensive, storage space is not.
- Conversion tables can and should be included as preamble; i.e. the data is self-describing.
- Reduce the amount of cultural encoding where possible, mostly by not designing the format around organizational hierarchies or assumptions about who is using the data.
- Design the format in such a way that manipulating data from it can be done in a cache-efficient, data-oriented way.
These assumptions would point at some kind of column-based binary format with self-describing properties and reasonable metadata support. If this sounds like Parquet to you, great. If not, Luminousmen made a useful page comparing pros and cons of different “big data” formats
So what now?
Unfortunately, this doesn’t solve my problem, it just moves the goal posts. In particular, I still have piles of data from ECMWF in a format that I need to be able to read. I could just spend more time on making my own reader, or I guess I could use
wgrib2, convert it to CSV and from there to Parquet… which is kind of silly.
But let’s take a look at
wgrib2 a bit more. First off, it is not very clear from the 90’s looking website where one downloads it, and once you’ve dug through it you have find fairly messy FTP directory with a
.tgz file containing a makefile. Running make yields:
By twiddling some environment variables to use GFortran, we’re getting somewhere:
Yay. Okay. Of course,
make install failed horribly so I had to install it by hand, but fine. Also, I’ll probably want to revisit my reader at some point so as to cut out the CSV middle piece, but this might suffice for now.
-csv doesn’t actually create a consolidated CSV with as many columns as there are variables represented in the messages (plus latitude, longitude, time, etc); instead it gives me a ~800MB file with over 10 million lines, for an input of 26MB with 20 “columns” worth of data and 29040 data points per column. Suddenly the warning on their website makes a lot more sense, with that level of redundancy. I guess I get to have fun writing a script to transpose this data into something halfway useful.
wgrib2 devs ever read this, might I suggest putting a “download” link on your website, possibly providing Debian packages and such, and maybe even moving to a more standardized build scheme?)
How to read/write Parquet?
My next question is of course how to read and write Parquet files. Ubuntu doesn’t appear to have any packages for it, and I find Java to be quite distasteful, so parquet-tools appears to be a reasonable basic toolset:
For linking into my programs. I’m writing mostly code in Jai these days, but also some Python. Python bindings are easy;
parquet tools covers that.
For Jai, I’ll neet to make bindings. The best library apears to be
parquet-cpp. Downloading and building that was entirely painless. Just kidding. Apparently it depends on Apache Arrow for “some of the memory management and IO interfaces” (I really just wanted to read and write some files) and you’re actually supposed to build Parquet support from there. Thankfully, aside from the 103MB source download it wasn’t too bad.
cmake . (plus a few flags, it turned out:
make install – why have one command if you can have three, amirite?
It turns out that the HEAD on the git has some problems linking, and I’ve ran out of time for today. I need to get a few unrelated matters done before I call it a day. Perhaps I’ll come back at this tomorrow with a clear head and more patience for over-the-top libraries that try to do everything under the sun instead of just reading and writing the format I want to read and write.