Schemas and data transformations

2022-01-20 1149 words 6 minutes

Contents

Sturgeon was right, he just spelled CRUD wrong. 90% of everything is in fact CRUD – Creating, Reading, Updating, Deleting. Except, nowadays, we have schemas to at least partially automate the highly efficient creation of an infinity of crap.

My Tech Stack is Janky

The problem arises when you have six operational domains:

Data warehouse (Original data)
Batch processing data store (Largely Parquet files derived from original data)
The programs that do the batch processing (mostly Jai, some Fortran, C and Python) need to have some internal representation of the data it is processing
Operational data store (PostgreSQL tables derived from original data or batch outputs)
Django ORM (Python schema reflecting the ODS)
React frontend schemas (Typescript schema reflecting the output from Django REST API)

The closer we get to the user, the more the data will have been transformed into something directly useful. The input data is pretty often janky and worthless on its own. Heck, if it were useful on its own without further processing, I wouldn’t have to write software, would I?

The question is, how to manage this expanding cone of schemas in a way that won’t make me hate life too much? This problem is ever so slightly compounded by the fact that each of these systems has different inherent opinions about how it should be done. So what I have going into today is:

Original data in my “data warehouse” (which at the moment is just a well organized folder on a hard drive somewhere)
The SQL schema for each input data is included in a file that lives alongside that data. The table name reflects the location of the data in the warehouse.
Ditto for a converter that creates Parquet files with the correct schema.
The Django ORM definitions live in the appropriate models.py file, which is currently defined on a functional basis rather than on a data origin basis. There is room for improvement here – perhaps I separate form and function?
The React schemas for anything that will arrive from the API mostly live in one schema.ts file, which will probably eventually become unwieldy. I should make that “mostly” go away at minimum.

I think I’ll spend some time today on making this process less bad.

In an ideal world, I could get rid of, or at least neutralize some of these domains. For what I’m doing I cannot make the UI be something other than browser, and I cannot get away from doing the processing in a low level language that can do things fast. Most things between those two points can be modified, simplified or removed. Ways to do this include:

Reducing the number of languages I’m using. I’d love to do that. In particular, I’d like C and Fortran to go away (that’ll take a while), and I really wish I could make the Javascript ecosystem disappear in a puff of logic. If I could limit myself to Jai, that would be amazing, and even just doing Jai + Typescript would make my life a lot better. But any progress in this direction will take a while. Django could relatively easily go away, since I’m mostly just using it as a thick API – it’s 95% just Django REST Framework, 90% of which I’m not really using.
Perhaps I could come up with some mechanism by which I specify schema in one domain and it generates the schemas for all the other domains. This could work for “base data” but less for derived/transformed forms. The other problem here is that some of these domains have strong and stable type systems, while others are uncurably afflicted by dumb.
It is uncertain whether Parquet is actually useful, except insofar as it is not variously GeoTIFF/NetCDF/GRIB/etc. And it cannot even replace all of those. Perhaps kill?

I’m pretty sure every company in the world has dealt with this problem, and I’m sure at least five of them have come up with a sane solution. Perhaps this is time for some of that so-called research.

The value of `assert()`

Meanwhile, in the code mines, I discovered that some of my input data had flipped latitudes and longitudes. This is after processing; essentially I seem to have at some point during the process decided to read latitude, longitude, not noticing that some godless fool had decided that these should be the other way around. Anyway, my bad, but this might be a good time to discuss the merits of assert(), as a way of guarding against stupidity that should never happen.

It’s easy to assume that things that should never happen will never happen, but in face things that should never happen happen all the time. So, as a rule, if x must absolutely be true (or its opposite false), make a point of saying so:

1

assert(x);

This of course isn’t the correct way to validate user input, but it is a good way to validate things that you believe to be true about your own system. In my case, I have added this to my importer:

1
2


assert(-90.0 <= latitude <= 90.0)
assert(-180.0 <= longitude <= 180.0)

It’s just good sense for stuff to break where the problem is, when the problem arises, rather than figure it out a week and a few hundred commits later.

Data warehousing rules

The research yielded the following.

There are a bunch of “Data Warehousing” systems and tools, but most primarily lock you into a set of assumptions and systems, potentially making some complexity go away but at the cost of absorbing a bunch more. I’m pretty sure somebody has come up with exactly the right set of tools for my purposes, but I have not found it.

For now, I’m just going to keep rolling my own. Simplification time!

My data warehouse setup was too complicated, and so I simplified it and made some rules about how data should be managed in the ETL (extract, transform, load) processes. These rules should eventually be automatically enforced, but since it’s just me working on this side of things it can be manual for now.

I am eliminating C for now (fairly easy to replace with Jai), and have a plan to replace Python with Jai on the API side, but it might continue to be used for ELT purposes for a while. To entirely replace Python/Djangi/Django-REST-Framework with Jai, it looks like I will need to make two small add-ons to my jai-simplehttp module, which could be a fun weekend project, and then do a bit of integration work. This is not priority or critical path for now, but it’s a nice to have that I’ll be aiming for later.

I’m going to park Parquet for now; I was only using it in one place and I can do without it. It might come back later if it becomes relevant for various data processing purposes.

Contents

Schemas and data transformations

My Tech Stack is Janky

The value of assert()

Data warehousing rules

The value of `assert()`