Database versioning and handling branching

It’s no secret to developers of database-driven applications that trying to version a database schema (and seed data) is a royal pain. Propagating changes from your development environment to a live environment is not something that most version control systems are well-equipped to do. This is further complicated by distributed VCSes, like Git — how do you prevent schema migration chaos when developers can branch on a whim?

I’ve been mulling this issue over for a few months now. There are several development projects I have where database versioning is necessary. Finally, I’ve come up with a process that is not only effective, but actually really simple!

Taking a clue from Git, we can use a digest of the current schema definition file (the SQL script that will create the database structure) as the schema identity. The schema requires a meta-table to describe the identity of the current schema — this will allow for automated schema migration, since the migration script can detect the current schema version. (I usually have a meta-table anyway, for storing site configuration in key/value pairs. The schema version fits in this table fairly well.)

So, let’s say we’re starting a new project. We design a schema, making sure to include this meta-table as well as any seed data required for the application to function. This excludes the “current database schema identity” row, since adding that row in the schema script will cause the identity to change! Then we write a migration script. This script has two functions: load the initial schema, and migrate between schema versions. When performing the initial load, it should follow this by inserting the database schema identity into the meta-table. The identity is of course obtained by digesting the schema file.

Now we are ready to make changes. Let’s say we want to add a column to the table. First, we note what the current schema’s identity is. Let’s call this $SCHEMA_ORIGINAL. We tweak the schema definition file to include this column, and then we obtain the digest of the schema file, calling this $SCHEMA_NEW. Now, we write two migration functions in the migration script: one that will migrate from $SCHEMA_ORIGINAL to $SCHEMA_NEW (an ALTER TABLE ADD COLUMN query) as well as one that will migrate in the opposite direction (ALTER TABLE DROP COLUMN). This will allow us to roll back the database if necessary.

Now, when you ask the migration script to upgrade the database schema, it only has to fetch the database’s current version, digest the schema definition file, and then find a path between those versions using a breadth-first search of the available migrations, starting at the current version.

This technique can even account for merges! When merging branches A and B together, you would resolve any conflicts that arise in the schema definition, and then construct four migration functions: A to merged, merged to A, B to merged, and merged to B. The breadth-first search during migrations means that if you are then switching from branch A prior to the merge to branch B prior to the merge, it may actually be faster to migrate the database through the merge instead of backing up until the point A and B diverged.

It may also be useful to provide a mechanism to tag certain revisions as causing data loss (such as rolling back a column addition). The migration script would then prefer longer migration paths that preserve data over shorter migration paths that destroy it.

There are some downsides to this approach. For one, migration functions will have names that provide little meaning to humans, something like migrate_0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33_to_62cdb7020ff920e5aa642c3d4066950dd1f01f4d(). And another is that the migration script will need to construct an in-memory graph of every migration so that it can perform its search. If the only possible migration path contains a migration that will cause data loss, the script will exhaustively search the rest of the graph looking for an alternative. There’s probably some room for optimization there.

I will likely be coding up an MIT-licensed Python script to implement this algorithm in the coming days, so stay tuned.

7 Replies to “Database versioning and handling branching”

  1. Did you take a look at existing tools with similar approaches like South (south.aeracode.org)? South is built around Django’s ORM, but it was the first to come to mind. Other similar projects exist. Entity Framework Code-First Migrations are second to mind. Admittedly most are tied to work primarily with specific ORMs, but the basic concepts apply outside of that. (Migrations as code artifacts seems to make the most sense in realms where code is already the primary database description, which seems to be why most migration tools are ORM-based.) Only non-ORM specific tool I can think of is Inedo’s BuildMaster which supports migration scripts in its deployment processes.

  2. Thanks for the suggestions guys, but both Fluent Migrator and South suffer from the same problem — sequential revision numbering. This is not compatible with the concept of branching (“is the database on version 5 of my branch or your branch?”). The technique I describe in my post is designed specifically to handle frequent branching and merging.

  3. I think a better solution would be to keep different databases (e.g. in ./dbs/) and use something like django-database-url to automatically setup your configuration based on the env’s DATABASE_URL.

    Finally, manage DATABASE_URL with git’s post-checkout hook.

  4. Ryan, I’m not clear on how that would handle applying schema changes to an existing production database. It also sounds like you are advising that database storage files be kept not only alongside the application, but in the source code repository. Both of those are bad ideas, in my opinion.

Comments are closed.