Meteorological data fun (day 3)

Smári McCarthy included in English Programming

2022-01-12 535 words 3 minutes

Contents

I’m making some progress.

Naturally, I was being a bit dense by the end of the day yesterday. Of course Apache Arrow comes with Debian repositories and good instructions for installation. I mean, it’s not quite as good as just working™, but close enough. Solved.

I also read up on the GLib-bindings mechanism that they provide, which makes a lot of things easier. The Ruby bindings base off it, for instance, and probably the Python bindings too. But more on that later.

Speaking of Python bindings, pip install pyarrow was sufficient. I built a minimal transpose tool that takes the CSV output from wgrib2 and converts it to consolidated CSV, JSON or Parquet as desired.

The CSV version had a slight issue: because there may be multiple readings for one variable, this form of consolidation is a bit dumb. In practice, the data I’m working with has two readings for two variables, soil temperature and soil moisture, both sampled at 20cm and 100cm depth. Since I wasn’t going to be using the CSV output directly I didn’t care about this at first, but ultimately ended up fixing it for the variables I care about, at the cost of making the tool less generic. Now it’s right in CSV, JSON and Parquet.

Parquet schema

Parquet files contain a schema.

The Parquet schema I end up with is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


required group schema {
  optional double latitude;
  optional double longitude;
  optional binary date (String);
  optional double icec;
  optional double wtmp;
  optional double cape;
  optional double tcwat;
  optional double tmp;
  optional double dpt;
  optional double skint;
  optional double sdwe;
  optional double tcdc;
  optional double sden;
  optional double salbd;
  optional double soilm20;
  optional double soilm100;
  optional double soilt20;
  optional double soilt100;
}

This could do with some fine tuning, but it’s good enough for now.

Conversion

The process is now:

1
2
3


wgrib2 2020-01-s2s.grib -csv 2020-01-s2s.csv
python s2s_transpose.py 2020-01-s2s.csv 2020-01-s2s.parquet --format parquet
rm 2020-01-s2s.csv

And boom. I have my data in a saneish format, albeit via an intermediary step that generates a lot of unnecessary data in a slow way. But all for the cause, right?

I decided at this point that it would be nice to be able to pull this into PostgreSQL too for some quick and dirty work. Thankfully the CSV version was fine for that (and the Parquet schema is practically identical to the SQL tables).

Some statistics: Starting with a 257MB GRIB file, wgrib2 expanded that to 9.3GB of CSV. s2s_transpose made from that:

A 45MB parquet file
A 244MB “tabular” CSV file

The entire process is dumb and painful, but c’est la vie. I do quite like the compression ratio on the parquet file, and I’m also super amused that the CSV ended up being smaller than the GRIB once all the redundancy was eliminated.

The final thing was to automate the process and pull in a year’s worth of weather averages. Of course it couldn’t be quite that easy, because there is some overlap between the monthly data files. Why January data should contain the 4th of February is beyond me, but fine.