So you want Database Versioning?
Here at DoltHub, we've had a lot of success with our "So you want..." series of blog posts helping people find Dolt when they are looking for it. Dolt is a lot of things. Dolt is a version controlled database, a Git database, Git for data, data version control, an immutable database, and a decentralized database.
Dolt can also be used for database versioning. This blog will explain what database versioning is, why you should version your database, and finally explore the tool options for versioning your database, including but not exclusively, Dolt.
What is Database Versioning?
It makes sense to tackle the definitions of these two terms "database" and versioning" in reverse. Versioning is doing the heavy lifting here, as is befitting of the Vince McMahon meme below, so I'll work through that definition first. The database concepts explained are only interesting in the context of versioning so those will be explained second.
Versioning
Versioning means version control. Version control is a spectrum of features with the least complex being backup and restore. Critically, backup and restore allows you to roll back changes in the case of an error. Increasing in complexity versioning sometimes includes the ability to create a fork, a copy you intend to modify. The next step in versioning complexity is branching, or creating a fork you intend to merge back into the trunk or main copy. Even more complex is the ability to create a list of differences or diffs between two versions. Finally, full version control allows users to merge two versions together, producing conflicts if the two versions cannot be merged cleanly. Popular source code versioning tools that provide this functionality are Perforce Helix Core and Mercurial.
There is a flavor of all the above features that lets you do those actions in a decentralized way where each copy does not need to know about any other copies unless the copies are to be merged. This decentralized flavor of version control is how the most popular source code version control system, Git, works.
Database versioning is applying forks, branches, diffs, merges, and conflicts to a database, preferably in a decentralized way. But how would you apply versioning to a database?
Database
For the purposes of versioning, a database can be thought of as two separate concepts: schema and data. Schema is all the table definitions, views, constraints, triggers, and stored procedures. Data is the contents of all the tables. Usually data is large scale making it traditionally difficult and costly to version.
Schema
Until recently, database schema was all you could version. I had never heard versioning schema called database versioning or database version control until recently. The term I heard for for version controlling database schema was database migration. You start with a core schema, check that into version control, and then any schema changes you do, you apply as a patch to that schema, also checking that into version control.
There are a bunch of tools that help with various pieces of database migration. For this article, we focused on some of the most popular and interesting tools, Liquibase, Redgate Deploy, and Planetscale. We dive a bit deeper into what they do so you can get a clearer picture of the space.
Data
If you want your application to be aware of data versioning, you can create some semblance of version control with schema. The term of art for this is Slowly Changing Dimension. This is not true versioning. You concoct a schema using columns that indicate inactive or active or "active from" to "active to" dates. Version number or timestamp columns on rows can also be used. Then, your application uses these columns to construct present and historical views of the data. Your application never deletes, it soft deletes, just changes state from active to inactive. I don't think anyone would consider this true versioning as the application or an operator can still make a mistake and do really bad things. But, a number of databases are designed this way to give versioning features to their applications.
If you just want rollback, backups and transaction logs are supported by every database solution. Maybe you take a nightly backup and you keep transaction logs since the last backup. A mistake that requires using these backups is usually corrected in hours at best but most likely days. This Atlassian JIRA outage is case in point. True version control means rollback on the order of seconds.
We have SQL "as of" syntax. This is in the SQL standard as of 2011 (pun intended) and supported by Oracle, Microsoft SQL Server, and MariaDB. With this syntax you can configure a table to be versioned or a "temporal table" and then query a version using as of <timestamp>
. You don't get diffs and merges but you do get instant rollbacks.
We now have the technology to fully version large data! Noms laid the groundwork with a core storage format called a prolly tree, a content addressed binary tree. This data structure allows for fast diff and merge on binary trees, the core data structure in most databases, without compromising too much read or write performance. Combining prolly trees with a Merkle DAG allows data to be structurally shared across versions, preserving disk space. Recently, a couple new databases have emerged using this or similar technology, TerminusDB for graph or document databases and Dolt for relational databases.
Why version your database?
Now that your understand what database versioning is, the question is why would you ever need database versioning? I'll focus on four key reasons to use database versioning:
- Minimize downtime
- Improve developer productivity
- Reproducibility
- Expose Versioning to Your Application
Minimize downtime
Database versioning minimizes downtime caused by database changes. Database versioning allows you to roll back bad changes quickly. After rolling back, database diffs allow for developers to root cause the issue and build preventative measures to ensure the outage does not repeat. A good example of where database versioning would have been useful is this recent month long Atlassian JIRA outage. With database that outage would be measured in minutes, not weeks.
Improve developer productivity
Database versioning improves developer productivity. Database versioning allows you to share your database between development, staging, and production. Developers can make changes locally, test, and merge these changes through to production, much like they do for source code. Database versioning allows to you implement continuous integration and continuous deployment for you database.
Reproducibility
Moreover, if there is a bug in production involving the database, a developer can reproduce the bug locally using database versioning. Restore the database to the version where the error is occurring, attach a debugger to your application, and reproduce the bug. Reproducible bugs are much quicker to fix.
Many databases are used to produce Machine Learning models. Database versioning allows you to note the version of the database in the model training metadata. If you need to rebuild the model with slightly different parameters but the same data, you can use the version used to train the original model.
Expose Versioning to your Application
Finally, database versioning allows you to create versioned applications. Traditionally, as noted above, adding versioning to databases for applications uses a technique called Slowly Changing Dimension. Modern version controlled databases can replace slowly changing dimension, simplify your application code, grant the additional ability to render diffs, and give you the database versioning benefits above.
Database Versioning Options
With that introduction, let's consider our options. We broke the tools down into three categories: Database Migration Tools, Versioned Data Lakes, and Version Controlled Databases.
Database Migration Tools
Liquibase
- Tagline
- Version Control for Databases
- Initial Release
- April 2012
- GitHub
- https://github.com/liquibase/liquibase
Liquibase formalizes your database migration language with a configuration language for database schema and alterations. The examples in their documentation lean into XML but further digging say they support SQL, JSON, and YAML as well. Using something other than SQL makes Liquibase cross platform. You can use the same Liquibase XML descriptors to migrate from say, PostgreSQL to MySQL, which is nice.
The Liquibase magic is how it applies these changes to your database. Liquibase supports rollbacks and preview.
Liquibase has been around for a decade and enjoys high relevance in this category. But there's been a lot of innovation in this space, so keep reading and check out what's new. If you're looking for the leading database migration tool, Liquibase is for you.
Redgate Deploy
- Tagline
- Source Control Your Database
- Initial Release
- Earliest Mention 2014
- GitHub
- Not Open Source
Redgate deploy is for Microsoft SQL Server only. The main selling point is its integration with the Microsoft Developer ecosystem including Visual Studio. The approach is the same as Liquibase, define your schema and alterations and apply those in a principled way.
The one feature that seems cool from a versioning perspective is that Redgate deploy allows for configurable version control of "static table data". From the documentation, this feature is "intended only where the table has fewer than a thousand rows". If you want to version more than 1000 rows, keep reading.
Planetscale
- Tagline
- The database for developers
- Initial Release
- March 2016
- GitHub
- https://github.com/planetscale
Planetscale is awesome and takes database migrations to the next level. Imagine if someone ran the database for you and controlled how schema patches were applied. What would be possible?
Planetscale is run by the good folks who wrote Vitess. Vitess is an open source "database clustering system for horizontal scaling of MySQL". Dolt is a heavy user of Vitess' MySQL dialect parsing code. We wouldn't be here without them.
So, on Planetscale, you get all the schema branch/merge functionality of Liquibase or Redgate. Additionally you also get a world class, modern deployment environment for your changes. On the downside, if this is a downside for you, Planetscale runs your database for you. Planetscale is MySQL only so if you have some aversion to that database format, that's also a downside.
Note, for the release date, I went with the first release of Planetscale's fork of Vitess. I'm not exactly sure when Planetscale in its current form launched and my cursory research didn't turn it up.
Version Controlled Data Lakes
LakeFS
- Tagline
- "Git-like capabilities for your object storage"
- Initial Release
- Aug 3, 2020
- GitHub
- https://github.com/treeverse/lakeFS
LakeFS defines a new category: the data lake versioning. Data lakes are a relatively new term referring usually to unstructured or semi-structured data stored in large cloud storage systems like S3 and GCS. Data Lakes exist in contrast to Data Warehouses which are structured and SQL based.
LakeFS sits in front of your cloud storage and adds data versioning to the data in your lake. You get commits, branches, and rollback. Merge is supported but conflicts are detected at the file level and are "up to the user to resolve". User defined merge strategies are on the roadmap.
A file in this case can be quite large, more like a dataset than a single row of data. Data is shared at the file level between commits, but a new version of the data means a new version of the file so storage can grow quite big with multiple versions.
LakeFS is relatively new, launched in 2020, and the sponsoring company Treeverse is well funded. Expect more development from the company. We like what we see from a versioning perspective.
Version Controlled Databases
Terminus DB
- Tagline
- Making Data Collaboration Easy
- Initial Release
- October 2019
- GitHub
- https://github.com/terminusdb/terminusdb
TerminusDB has full schema and data versioning capability but offers a graph database interface using a custom query language called Web Object Query Language (WOQL). WOQL is schema optional. TerminusDB also has the option to query JSON directly, similar to MongoDB, giving users a more document database style interface.
The versioning syntax is exposed via TerminusDB Console or a command line interface. The versioning metaphors are similar to Git. You branch, push, and pull. See their how to documentation for more information.
TerminusDB is new but we like what we see. The company is very responsive, has an active Discord, and is well funded. If you think your database version control makes more sense in graph or document form, check them out.
Dolt
- Tagline
- It's Git for Data
- Initial Release
- August 2019
- GitHub
- https://github.com/dolthub/dolt
Dolt takes “Database Versioning” rather literally. Dolt implements the Git command line and associated operations on table rows instead of files. Data and schema are modified in the working set using SQL. When you want to permanently store a version of the working set, you make a commit. In SQL, Dolt implements Git read operations (ie. diff, log) as system tables and write operations (ie. commit, merge) as functions or stored procedures. Dolt produces cell-wise diffs and merges, making data debugging between versions tractable. That makes Dolt the only SQL database on the market that has branches and merges. You can run Dolt offline, treating data and schema like source code. Or you can run Dolt online, like you would PostgreSQL or MySQL.
As you can see "Database Versioning" is a vertical with a lot of innovation. We're a little biased but we think adding branch and merge to a SQL database in a decentralized way is true database versioning. If you're interested in discussing the space come hang out on our Discord.