So you Want Database Branches?
Here at DoltHub, we've published a series of "So you want..." articles to help people find Dolt, the world's first version controlled SQL database, when they are looking for it but just don't know it yet.
Today's article focuses on database branches. Many databases claim to support branches or branching. However, upon further inspection, the branching in some databases is just a copy. A copy is not a branch. What are the characteristics that distinguish a branch from a copy? Which databases really support branches? This article is here to help.
What is a Database?
Generally, there are two ways to store data on a computer. You can store data in files. Or, you can store data in databases. It is common for files to contain simple text which can be read by any text editor. Databases use files behind the scenes, but generally you access the data in databases through the database program, like MySQL or Postgres.
Databases offer faster and more scalable access to data via a query mechanism. If you want to find all the values in a 1,000,000,000 line comma separated value file with the third column greater than 3 and the fourth column less than 100, it may take a long time or require custom code. If that data is in a database table, that query will return almost instantly and can be made using the database's query language, usually SQL.
So, for today's article, we will consider only SQL databases. There are other types of databases, but SQL databases are by far the most popular. Within SQL there are Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) databases. We will consider both of those categories because there are databases claiming to have branches in both.
If you're looking for a graph and document database with branches and merges, check out TerminusDB and EdgeDB. Of the two, TerminusDB has more feature-rich branching.
What is a Branch?
Branching is the duplication of an object under version control...Branching generally implies the ability to later merge or integrate changes back onto the parent branch.
A branch is a copy of a set of objects (ie. files, tables) that you can diff against and merge into another set of objects. Branches form the core of asynchronous edit workflows. Two people can change a set of objects in isolation and come back and collaborate later via diff and merge.
Branching is a version control concept. Most version control systems, like Git or Perforce, work with files. Version controlled databases are a new concept. The database branch definition is still being formed.
Copy
To create a branch, a copy is required. A branch is a copy evolving in parallel to another copy.
Most version control systems provide "structural sharing" of underlying objects. Structural sharing means that some or all of the underlying objects in storage that are the same on multiple branches are stored only once. In other words, storage is shared across similar branches. Without structural sharing, branches require a lot of disk space because each branch is a full copy on disk. A branching approach with little structural sharing cannot scale to thousands of branches.
Beyond disk, often the amount of data that is shared between branches affects the time required to create a branch. For very large data, especially data laid out in a format that allows for fast retrieval and querying, creating a copy can take minutes or hours without any structural sharing.
How much data is shared between branches is a good dimension on which to compare databases claiming to have branches.
Merge
A branch without merge is a "fork". Again, from Wikipedia:
A branch not intended to be merged is usually called a fork.
Here at DoltHub, we've written about database forks. Many databases allow for forks or a streamlined copy workflow.
Merge usually involves conflict detection and resolution. If two branches change an object to different values, which value is correct? In most version control systems, this change is flagged as a conflict and a conflict resolution workflow is triggered. Once the conflict is resolved, the change can be merged.
So, in order for a copy to be a branch, merge is required. This becomes important when evaluating databases that claim to have branches because for some the branches cannot be merged. For databases that do support merge, what is the conflict detection and resolution strategy?
Diff
A prerequisite of merge is being able to compute the difference, or diff, between branches. Diffs allow users to identify changes between branches.
The granularity in which a version control system can provide diffs is generally the granularity the version control system can merge. For instance, if your version control system can only tell you which files are different between branches, not the lines in the files like most do today, the system can only merge at the file level. If two users change the same file, the merge will result in a conflict.
Again, the granularity of computed diffs is important when evaluating databases claiming to support branching. Can the database provide diffs of schema? Can the database provide diffs of data? If the version controlled database supports data diffs, can it produce cell-wise (ie, row, column pair) diffs of the data?
What are Database Branches For?
Database branches are used for a number of different use cases. Here at DoltHub, we've had a database in market, Dolt, for over five years with true branch and merge support. We've seen a number of use cases for database branches used in production. I'll walk through them here.
Testing Database Changes in Isolation
Most of the documentation for the other databases listed later in this article focus on testing changes, usually for schema migrations. Here at DoltHub, our customers do this but it's more likely to be part of a broader branching strategy and workflow to achieve a specific use case.
Data Sharing
Dolt was conceived as a data sharing format. Like GitHub, Dolt has a place to share Dolt databases called DoltHub. Users share a number of open databases and collaborate using branches and forks, similar to open source on GitHub. The most popular databases on DoltHub are the US stock market databases maintained by post-no-preference
.
Machine Learning Training Data
Database branches are useful for Machine Learning training data. Branches are used for model collaboration, reproducibility, and explainability. Dolt is used by Flock Safety and Turbine to improve their machine learning model development workflows.
Version Controlling Configuration
In some software applications, most notably games, configuration can become very large. It's not uncommon for game configuration to reach hundreds of gigabytes. This configuration becomes unwieldy to manage in Git files. Companies search for an alternative and decide a version controlled database is a better place to store this volume of configuration. Branches are used similarly to Git for collaboration. Scorewarrior uses Dolt in this way.
Adding Branches to Your Application
Finally, branches might be useful to the users of your application. Traditionally, adding a branching workflow to your application would involve implementing it in your data model using slowly changing dimension. A database with native branch support can be used to add branch support much more quickly. Nautobot and Threekit chose to add branches to their application in this way.
Databases claiming to have Branches
For a database to support branches, it must be able to create a light weight copy, allow the user to make changes, compute the differences caused by those changes, and then merge those changes back into another branch. Which databases support all these features on schema and data? Let's start with the full set of SQL databases claiming to have branches and then filter down the set.
Neon
- Tagline
- Serverless Postgres
- Initial Release
- June 2021
- GitHub
- https://github.com/neondatabase/neon
- SQL Flavor
- Postgres
Neon is a very popular open source Postgres variant. It is an alternative to AWS Aurora Postgres. Neon separates storage and compute and substitutes the PostgreSQL storage layer by redistributing data across a cluster of nodes.
Neon claims to support branches but there is no mention of merges in their branch documentation. All the text and visuals suggest branches are point in time copies and are not designed to merged. Diff between branches is not supported.
Verdict: Neon branches are forks.
Turso
- Tagline
- SQLite for Production
- Initial Release
- Nov 2022
- GitHub
- https://github.com/tursodatabase/libsql
- SQL Flavor
- SQLite
Turso is a fairly new SQLite variant. Turso adds a server component on top of SQLite's file-based engine. This server can be replicated.
Turso claims to support branches in their documentation. However, upon further inspection, diff and merge are not supported.
Database branches are completely separate from the original database. This means that you need to handle merging any schema changes or data manually using a migration tool.
Verdict: Turso branches are forks.
Databases with Branch and Merge of Tables
LakeFS
- Tagline
- Scalable Data Version Control
- Initial Release
- August 2020
- GitHub
- https://github.com/treeverse/lakeFS
- SQL Flavor
- Apache Hive, Spark SQL
LakeFS is a version control extension for data lakes. It sits on top of many open formats like Parquet or CSV and adds version control functionality to those formats. You get SQL via Apache Hive or Spark SQL.
LakeFS claims branches and merges. Upon deeper inspection, this claim is true.
LakeFS branches employ "zero copy branching". LakeFS stores a pointer to a commit and then stores a set of uncommitted changes. It's not clear from the documentation how these changes are stored. Given other documentation, I suspect a new version of any changed file is stored, meaning structural sharing is done at the unchanged file level.
Merge is handled at the file level, which in these formats is more like a table. Conflict resolution is a simple take ours or take theirs at the file level.
Diff between branches is supported via the Iceberg plugin.
Verdict: LakeFS branches and merges tables.
Databases with Branch and Merge of Schema
Planetscale
- Tagline
- The database for developers
- Initial Release
- March 2016
- GitHub
- https://github.com/planetscale
- SQL Flavor
- MySQL
Planetscale is run by the good folks who wrote Vitess. Vitess is an open source "database clustering system for horizontal scaling of MySQL". Dolt is a heavy user of Vitess' MySQL dialect parsing code. We wouldn't be here without them.
Like Neon, Planetscale is an open source competitor to AWS Aurora but instead of competing with the Postgres variant, Planetscale competes with AWS Aurora MySQL. Planetscale separates storage and compute and substitutes the MySQL storage layer by redistributing data across a cluster of nodes. On top of that, Planetscale provides a bevy of developer tools that make managing your database easier.
PlanetScale supports branches and merges of schema. Planetscale also has data branches but these cannot be merged.
PlanetScale does not provide data syncing between a production branch and a development branch.
The Planetscale documentation suggests that branches are a full copy, "only the schema is copied". With the data branching feature you can also copy data but there is no structural sharing between branches.
Planetscale has very sophisticated three-way merge of branches.
Planetscale offers schema diff but I could not find support for data diffing between branches.
Verdict: Planetscale supports schema branch and merge. Planetscale data branches are forks.
Databases with True Branch and Merge
Dolt
- Tagline
- Git for Data
- Initial Release
- August 2019
- GitHub
- https://github.com/dolthub/dolt
- SQL Flavor
- MySQL, [Postgres](https://github.com/dolthub/doltgresql) (Alpha)
Dolt is built from the ground up on top of a novel data structure called a Prolly Tree to add true branch, diff, and three-way merge to the SQL database. Prolly Tress allow Dolt's tables to be content-addressed like files in Git. The architecture that made Git so popular for source code is now available on your database tables.
Dolt's architecture allows structural sharing of data between branches. Prolly trees chunk database tables up into 4 kilobyte sections so any 4 kilobyte section that is the same on any branch or version is only stored once.
Dolt supports merge of both data and schema. Dolt data merges are possible because Prolly Trees produce data diffs fast.
Dolt supports full cell-wise diff of data as well as schema diff. Again, Dolt produces diffs fast because of the Prolly Tree-based storage engine.
Verdict: Dolt supports true database branch and merge
Conclusion
We're a bit biased but we think Dolt is the only database with branches. Curious to learn more? Stop by our Discord and let's chat.