Database versioning and handling branching

It's no secret to developers of database-driven applications that trying to version a database schema (and seed data) is a royal pain. Propagating changes from your development environment to a live environment is not something that most version control systems are well-equipped to do. This is further complicated by distributed VCSes, like Git -- how do you prevent schema migration chaos when developers can branch on a whim?

I've been mulling this issue over for a few months now. There are several development projects I have where database versioning is necessary. Finally, I've come up with a process that is not only effective, but actually really simple!

Taking a clue from Git, we can use a digest of the current schema definition file (the SQL script that will create the database structure) as the schema identity. The schema requires a meta-table to describe the identity of the current schema -- this will allow for automated schema migration, since the migration script can detect the current schema version. (I usually have a meta-table anyway, for storing site configuration in key/value pairs. The schema version fits in this table fairly well.)

So, let's say we're starting a new project. We design a schema, making sure to include this meta-table as well as any seed data required for the application to function. This excludes the "current database schema identity" row, since adding that row in the schema script will cause the identity to change! Then we write a migration script. This script has two functions: load the initial schema, and migrate between schema versions. When performing the initial load, it should follow this by inserting the database schema identity into the meta-table. The identity is of course obtained by digesting the schema file.

Now we are ready to make changes. Let's say we want to add a column to the table. First, we note what the current schema's identity is. Let's call this $SCHEMA_ORIGINAL. We tweak the schema definition file to include this column, and then we obtain the digest of the schema file, calling this $SCHEMA_NEW. Now, we write two migration functions in the migration script: one that will migrate from $SCHEMA_ORIGINAL to $SCHEMA_NEW (an ALTER TABLE ADD COLUMN query) as well as one that will migrate in the opposite direction (ALTER TABLE DROP COLUMN). This will allow us to roll back the database if necessary.

Now, when you ask the migration script to upgrade the database schema, it only has to fetch the database's current version, digest the schema definition file, and then find a path between those versions using a breadth-first search of the available migrations, starting at the current version.

This technique can even account for merges! When merging branches A and B together, you would resolve any conflicts that arise in the schema definition, and then construct four migration functions: A to merged, merged to A, B to merged, and merged to B. The breadth-first search during migrations means that if you are then switching from branch A prior to the merge to branch B prior to the merge, it may actually be faster to migrate the database through the merge instead of backing up until the point A and B diverged.

It may also be useful to provide a mechanism to tag certain revisions as causing data loss (such as rolling back a column addition). The migration script would then prefer longer migration paths that preserve data over shorter migration paths that destroy it.

There are some downsides to this approach. For one, migration functions will have names that provide little meaning to humans, something like migrate_0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33_to_62cdb7020ff920e5aa642c3d4066950dd1f01f4d(). And another is that the migration script will need to construct an in-memory graph of every migration so that it can perform its search. If the only possible migration path contains a migration that will cause data loss, the script will exhaustively search the rest of the graph looking for an alternative. There's probably some room for optimization there.

I will likely be coding up an MIT-licensed Python script to implement this algorithm in the coming days, so stay tuned.