Three-way Merge in a SQL Database

June 19, 2024

24 min read

Dolt implements a Git-style commit graph on top of a Prolly-tree-based storage engine. This architecture makes Dolt the world's first and only version controlled SQL database.

One of the major capabilities a version controlled SQL database provides is three-way merge. What is three-way merge and how does it work in a SQL database? This blog explains.

Note, the first section explains what three-way merge is and how it works for text files. It will be a review for some. If you're confident you understand three-way merge, you can skip to the Three-way Merge in a SQL database section.

What is Three-way Merge?

Merging is a fundamental concept of version control. Merging is the act of reconciling multiple versions' changes into a single version. Brian makes changes to a file. Aaron makes changes to the same file. If Brian and Aaron try to combine their changes, what should the file look like? This is merge.

Two-way Merge

The easiest way to understand merge is through example. Let's merge two lists using two-way merge and three-way merge to show the difference. Until recently, version control was constrained to sets of files so you can imagine these lists as checked into a version control system in the same file, databases.txt.

Version 1:

Oracle
Postgres
SQL Server
MySQL
Dolt

Version 2:

Oracle
PostgreSQL
Microsoft SQL Server
MySQL

Two-way merge looks at a raw set of differences, or diff, of the two files. Two-way merge forces the user to choose which set of changes he or she wants in the resulting merged file.

Oracle
< Postgres
> PostgreSQL
< SQL Server
> Microsoft SQL Server
MySQL
- Dolt

In this case, the user must make three decisions. Should the resulting file have Postgres, PostgreSQL, or both? Should it have SQL Server, Microsoft SQL Server, or both? Should Dolt be included or not?

Three-way Merge

What if we knew the original list in databases.txt that both changes came from? How would that help our merge?

Let's say the Original copy looked like so:

Oracle
Postgres
SQL Server

Thus, the diff between Original and Version 1 (i.e. diff 1) would be:

Oracle
Postgres
SQL Server
+ MySQL
+ Dolt

And the diff between Original and Version 2 (i.e. diff 2) would be:

Oracle
< Postgres
> PostgreSQL
< SQL Server
> Microsoft SQL Server

So, we know one user added MySQL and Dolt and the other user changed Postgres and SQL Server to their more formal names, PostgreSQL and Microsoft SQL Server. The two users didn't modify the same information, in this case lines, so we're pretty sure we need no more information to process the merge! We take the two additions from diff 1 and we combine those with the modifications in diff 2 resulting in the following merge.

Oracle
PostgreSQL
Microsoft SQL Server
MySQL
Dolt

This is the power of three-way merge. Three-way merge reduces the number of decisions the merger must make to process the merge.

Conflicts

In the two-way merge example, the user had to make three "decisions" to process the merge. In version control parlance, a decision a user has to make to process a merge is called "resolving a conflict". Thus, a "conflict" is a pair of changes that cannot be resolved automatically via the merge algorithm.

Conveniently, in the contrived example above the three-way merge could proceed with zero conflicts, but three-way merges can result in conflicts. Let's say the original version of databases.txt was not:

Oracle
Postgres
SQL Server

and instead was a list produced at Oracle HQ:

Oracle
MySQL

The resulting diffs would be:

Oracle
+ Postgres
+ SQL Server
MySQL
+ Dolt

And:

Oracle
+ PostgreSQL
+ Microsoft SQL Server
MySQL

Now we have two conflicts. Should it be Postgres, PostgreSQL, or both? Should it be SQL Server, Microsoft SQL Server, or both? The addition of Dolt can still be processed automatically because it only exists on one side of the merge. Your version control system (ie. Git) will spit out something like:

Oracle
<<<<<<< version1
Postgres
SQL Server
=======
PostgreSQL
Microsoft SQL Server
>>>>>>> version2
MySQL
Dolt

To proceed with this merge, you must delete <<<<<<<, =======, and >>>>>>> and make the file look like you want. A valid merge could be:

Oracle
Postgres
SQL Server
MySQL
Dolt

Notice how this is different than the result created by our first three-way merge? This is because both users edited the file at the same time and the merging user must decide whether they want short names, formal names, or both. I like short names and it seems so does the merger. Merges with conflicts are not deterministic, they require user input to complete.

Branches

You need a user interface to tell your version control system what to merge. This user interface is branches. Branches are parallel, evolving copies of your files. Branches exist to be merged.

In the canonical Git version control branch setup, there is a main branch and users make a branch off main to make changes. When their changes are ready, they merge their changes back into main and delete their branch. There are other branch setups but this is the most popular.

Let's show off the merge example from the Conflicts section using Git and branches so you can see how branches allow you to organize changes for merge. I included some handy pictures to help you visualize what is going on behind the scenes in the Git commit graph.

First, let's set up our Git repository and make an initial commit.

$ mkdir show-off-merge
$ cd show-off-merge
$ git init
Initialized empty Git repository in /Users/timsehn/dolthub/git/show-off-merge/.git/
$ touch databases.txt
$ emacs databases.txt
$ cat databases.txt
Oracle
MySQL
$ git add databases.txt
$ git commit -m "Original databases list"
[main (root-commit) 35d85e5] Original databases list
 1 file changed, 2 insertions(+)
 create mode 100644 databases.txt
$ git branch
* main
$

This makes a very simple, single node commit graph.

Now, we'll make two branches, one for each set of edits we'll make to databases.txt.

$ git branch add-databases-informal
$ git branch add-databases-formal
$ git branch
  add-databases-formal
  add-databases-informal
* main

This makes the following commit graph. Note, branches are pointers to commits. In this case, all three branches point to the same commit.

Now, we'll checkout the add-databases-informal branch and make our changes.

$ git checkout add-databases-informal
Switched to branch 'add-databases-informal'
$ emacs databases.txt
$ cat databases.txt
Oracle
Postgres
SQL Server
MySQL
Dolt
$ git commit -am "Add databases. Informal names."
[add-databases-informal f9f959a] Add databases. Informal names.
 1 file changed, 3 insertions(+)

This makes a new HEAD commit of the add-databases-informal branch.

And then we'll checkout the add-databases-formal branch and make our changes.

$ git checkout add-databases-formal
Switched to branch 'add-databases-formal'
$ emacs databases.txt
$ cat databases.txt
Oracle
PostgreSQL
Microsoft SQL Server
MySQL
$ git commit -am "Add databases. Formal names."
[add-databases-formal adb8b25] Add databases. Formal names.
 1 file changed, 2 insertions(+)

This makes a new HEAD commit of the add-databases-formal branch.

Now we will checkout the main branch and merge all our branches together. First, the add-database-informal branch.

$ git checkout main
Switched to branch 'main'
$ git merge add-databases-informal
Updating 35d85e5..f9f959a
Fast-forward
 databases.txt | 3 +++
 1 file changed, 3 insertions(+)
 $ git branch -d add-databases-informal

Note, the first merge is what is called a "fast-forward merge" because there are no parallel changes to consider. The change is just appended to the end of main.

Now we merge the add-databases-formal branch. This is an actual three-way merge.

$ git merge add-databases-formal
Auto-merging databases.txt
CONFLICT (content): Merge conflict in databases.txt
Automatic merge failed; fix conflicts and then commit the result.
$ cat databases.txt
Oracle
<<<<<<< HEAD
Postgres
SQL Server
=======
PostgreSQL
Microsoft SQL Server
>>>>>>> add-databases-formal
MySQL
Dolt

And voila, we were able to reproduce our conflict we showed off in the conflicts section. Let's resolve the conflict and finish the merge.

$ emacs databases.txt
$ cat databases.txt
Oracle
Postgres
SQL Server
MySQL
Dolt
$ git add databases.txt
$ git commit -am "Finish the merge by choosing the informal database names"
[main e275a4a] Finish the merge by choosing the informal database names
$ git branch -d add-databases-formal

You can see in the log the last commit is denoted as a merge that contains two parents.

$ git log
commit e275a4a01d14c5dbb36f0d8e0f6cc6c44f0dfadf (HEAD -> main)
Merge: f9f959a adb8b25
Author: Tim Sehn <tim@dolthub.com>
Date:   Tue Jun 4 13:45:26 2024 -0700

    Finish the merge by choosing the informal database names

commit adb8b25e37301587c8c73d1b436cfc4cdd2e72db (add-databases-formal)
Author: Tim Sehn <tim@dolthub.com>
Date:   Tue Jun 4 13:41:02 2024 -0700

    Add databases. Formal names.

commit f9f959ac846d4f7ab9276b67ece7abe6652c5123 (add-databases-informal)
Author: Tim Sehn <tim@dolthub.com>
Date:   Tue Jun 4 13:39:41 2024 -0700

    Add databases. Informal names.

commit 35d85e5fb600ff341bd4c31c986aa2bbce999463
Author: Tim Sehn <tim@dolthub.com>
Date:   Tue Jun 4 13:35:34 2024 -0700

    Original databases list

This creates the following commit graph with the merge commit pointing to two parents.

As you can see, branches allow you to organize multiple edits and are the basis for merges.

How Three-way Merge Works

Three-way merge of files is a three step process:

Find the merge base
Perform the merge
Resolve conflicts

There are a couple of core algorithms at work that allow merge to work at scale. First, there is an algorithm to find the merge base in the commit graph. Second, there is the actual merge algorithm. Resolving conflicts is up to the user.

Find the Merge Base

So, how do you identify the original databases.txt that the new versions came from? You find the closest common ancestor or "merge base". There's even a git command to tell you the merge base of two commits.

$ git merge-base -a adb8b25e37301587c8c73d1b436cfc4cdd2e72db f9f959ac846d4f7ab9276b67ece7abe6652c5123
35d85e5fb600ff341bd4c31c986aa2bbce999463

In our example, the merge base is quite easy, it's the initial commit, the parent commit of both our new commits. But finding the merge base can be quite complicated.

With commits laid out in a graph, a "leet code style" algorithm called lowest common ancestor is used to find the merge base. Unless you're willing to pre-process the graph a brute force approach is needed. You walk the graph in breadth-first order from the into and from commits until you find a commit in the other's list. Unless your commit graph is very large and your merge base is far away, finding the merge base is quite fast.

However, if your commit graph is large and your branch histories have diverged substantially, finding the merge base can be a compute intensive operation. Git started pre-processing the graph in 2.18 to speed up merge base calculation. Dolt pre-processes the graph using a commit closure to speed up merge base calculation.

There is a cross merge case where you have two nearest common ancestors in a three-way merge. These are kind of hard to make in practice but can happen. In this case, Git uses a recursive three-way merge algorithm to handle this case, where the two nearest common ancestors are themselves merged to form a nearest common ancestor on which to run the merge algorithm.

`diff3` Merge Algorithm

Now that you have a merge base, you must merge the actual target files. This is done using an algorithm called diff3. diff3 has been in Unix since 1979. The algorithm is fairly complicated and uses longest-possible sequences to determine merge-able chunks.

The fundamental observation of three-way merge is that if we look at all the changes between merge base and into and all the changes between merge base and from, that is the set of changes that should appear in the value that results from the merge of into and from. In other words, apply both sets of changes to the merge base.

Three-way Merge in a SQL Database

Now that you know what a three-way merge is and have seen an example in files, did you know that you can also three-way merge a SQL database?

Transactions

Most SQL databases have had a primitive three-way merge for a long time: transactions. Transactions are used to manage concurrent edits to the same tables. Clients open a transaction using START TRANSACTION, execute SQL, and then close the transaction with COMMIT. If there are any concurrent edits to the same data, the first COMMIT wins and subsequent COMMITs fail and a ROLLBACK is issued. This means conflicting writes in separate transactions fail.

In most SQL databases, transaction concurrency is accomplished using row-level locking. When a transaction modifies a row, it acquires a lock on that row. If a concurrent transaction encounters a locked row, the transaction either queues or fails, depending on the database.

The are some caveats. Some databases like older versions MySQL don't support data description layer (DDL) transactions. MySQL 8.4 did launch support for Atomic Data Definition Statements.

Transactions are meant to be short-lived. With long-lived transactions, the chances of a conflicting write increases. The number of queued transactions can grow beyond your databases ability to hold them all. If you modify a lot of data in a long-lived transaction, you can essentially lock your whole database to reads and writes.

For very long-lived transactions we now have an alternative, Git-style three-way merge. Database version control now exists including three-way merge.

Prerequisites

In order to do three-way merge on a SQL database, your database must:

Be stored in a commit graph of versions
Produce the differences (ie. diff) between two versions fast

Commit graph

You first need to be able to represent your database versions in a commit graph, including branches. Commit graphs rely on the objects stored in them being content-addressed. A content-address is a value that summarizes the data stored within an object into a simple, smaller value, usually a hash. Two databases containing the same data must produce the same content-address.

Content-addressing a database is no small feat. You need the same database to produce the same content-address, no matter what order it was built in, called history independence. In Dolt, content-addressing is achieved using a custom storage engine based on a novel data structure called a Prolly Tree.

Fast diff

Computing diffs on large files can be very compute intensive. In Git, files greater than 100MB are known to cause performance issues. This is why you are prevented from pushing these files to GitHub and why Git LFS exists.

Databases are often much larger than 100MB. In order to compute diffs on large databases, a custom storage engine is required. In traditional B-tree based storage engines, computing diffs scales with the size of the table.

To get around this limitation, Dolt again uses Prolly Trees. Not only do Prolly Trees provide the history independence required to store databases in a commit graph, but they can also be diffed with costs proportional to the size of the differences, rather than to the size of the table. They do this while still providing B-tree-like seek performance. Prolly Trees are a purpose-built data structure for database version control. Dolt is the only SQL database built on Prolly Trees.

Merge Process

Three-way merge in a SQL database is a six step process whereas three-way merge of files is a three step process.

Find the Merge Base
Perform Schema Merge
Resolve any Schema Conflicts
Perform Data Merge
Resolve any Data Conflicts
Resolve any Constraint Violations

Now, let's dive into each step in order.

1. Find the Merge Base

The first step in the merge process is to identify the merge base.

Dolt uses a similar algorithm to the merge base algorithm described above to find the merge base. Dolt uses a pre-computed commit closure to speed up the process. Dolt's use of commit closures is worth a whole blog in itself. Aaron's launch blog does a pretty good job explaining their motivation, how they work, and the merge performance improvements they garner.

Dolt does not support multiple merge bases in the cross merge case. In this event, Dolt will just fail to merge. To get around it, you can merge the two merge bases manually.

2. Perform Schema Merge

Next, we must merge the schemas of both versions. Data merge cannot be performed until the schemas of both databases match.

Schema is generally small compared to data. How large could a table definition possibly be? A megabyte scale table definition seems unreasonable. Thus, schema merge is computationally light weight. The diffs to process are generally small.

However, schema merge is logically complex. There are a bunch of corner cases to handle. Schema is merged according to the following set of rules for tables, columns, indexes, foreign keys, and check constraints. If the "Mergeable" column says "Yes" that means the merge processes without conflict and produces a reasonable result. For instance, if two branches are merged that add two different columns to the same base table, the resulting merged table has both additional columns.

Tables

Left Branch	Right Branch	Caveat	Mergeable
Add t1	Add t1	Same Schema	Yes
Add t1	Add t1	Different Schema	Schema Conflict
Delete t1	Delete t1		Yes
Modify t1	Delete t1		Schema Conflict
Modify t1	Modify t1	Same Schema	Yes
Modify t1	Modify t1	Different Schema	Schema Conflict

Columns

Left Branch	Right Branch	Caveat	Mergeable
Delete c1	Delete c1		Yes
Modify c1	Delete c1		Schema Conflict
Add c1	Add c1	Same Type, Same Constraints, Same Data	Yes
Add c1	Add c1	Different Type	Schema Conflict
Add c1	Add c1	Same Type, Different Constraints	Schema Conflict
Add c1	Add c1	Same Type, Same Constraints, Different Data	Data Conflict
Modify c1	Modify c1	Same Type, Same Constraints, Same Data	Yes
Modify c1	Modify c1	Incompatible Type Change	Schema Conflict
Modify c1	Modify c1	Compatible Type Change	Yes
Modify c1	Modify c1	Same Type, Different Constraints	Schema Conflict
Modify c1	Modify c1	Same Type, Same Constraints, Different Data	Data Conflict

Foreign Keys

Left Branch	Right Branch	Caveat	Mergeable
Add fk1	Add fk1	Same definition	Yes
Add fk1	Add fk1	Different definition	Schema Conflict
Delete t1	Delete t1		Yes
Modify fk1	Delete fk1		Schema Conflict
Modify fk1	Modify fk1	Same definition	Yes
Modify fk1	Modify fk1	Different definition	Schema Conflict

Indexes

Left Branch	Right Branch	Caveat	Mergeable
Add i1	Add i1	Same definition	Yes
Add i1	Add i1	Different definition	Schema Conflict
Delete i1	Delete i1		Yes
Modify i1	Delete i1		Schema Conflict
Modify i1	Modify i1	Same definition	Yes
Modify i1	Modify i1	Different definition	Schema Conflict

Check Constraints

Left Branch	Right Branch	Caveat	Mergeable
Add ck1	Add ck1	Same definition	Yes
Add ck1	Add ck1	Different definition	Schema Conflict
Delete ck1	Delete ck1		Yes
Modify ck1	Delete ck1		Schema Conflict
Modify ck1	Modify ck1	Same definition	Yes
Modify ck1	Modify ck1	Different definition	Schema Conflict

3. Resolve any Schema Conflicts

In the event of a schema conflict, the user must be presented with a user interface to view and resolve conflicts. In Dolt, this interface is influenced heavily from Git but adapted to SQL, the interface you have available in a SQL database.

The dolt_status system table tells you which tables have conflicts and whether they are schema or data conflicts. An example table looks like:

three_way_merge> SELECT * FROM dolt_status;
+------------+--------+-----------------+
| table_name | staged | status          |
+------------+--------+-----------------+
| people     | false  | schema conflict |
+------------+--------+-----------------+

Then there is the dolt_schema_conflicts table that describes the conflict.

three_way_merge> SELECT table_name, description, base_schema, our_schema, their_schema FROM dolt_schema_conflicts;
+------------+----------------------------------------------------------------------+-------------------------------------------------------------------+-------------------------------------------------------------------+-------------------------------------------------------------------+
| table_name | description                                                          | base_schema                                                       | our_schema                                                        | their_schema                                                      |
+------------+----------------------------------------------------------------------+-------------------------------------------------------------------+-------------------------------------------------------------------+-------------------------------------------------------------------+
| people     | different column definitions for our column age and their column age | CREATE TABLE `people` (                                           | CREATE TABLE `people` (                                           | CREATE TABLE `people` (                                           |
|            |                                                                      |   `id` int NOT NULL,                                              |   `id` int NOT NULL,                                              |   `id` int NOT NULL,                                              |
|            |                                                                      |   `last_name` varchar(120),                                       |   `last_name` varchar(120),                                       |   `last_name` varchar(120),                                       |
|            |                                                                      |   `first_name` varchar(120),                                      |   `first_name` varchar(120),                                      |   `first_name` varchar(120),                                      |
|            |                                                                      |   `birthday` datetime(6),                                         |   `birthday` datetime(6),                                         |   `birthday` datetime(6),                                         |
|            |                                                                      |   `age` int DEFAULT '0',                                          |   `age` float,                                                    |   `age` bigint,                                                   |
|            |                                                                      |   PRIMARY KEY (`id`)                                              |   PRIMARY KEY (`id`)                                              |   PRIMARY KEY (`id`)                                              |
|            |                                                                      | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin; | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin; | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin; |
+------------+----------------------------------------------------------------------+-------------------------------------------------------------------+-------------------------------------------------------------------+-------------------------------------------------------------------+

And finally, we have a convenience procedure called dolt_conflicts_resolve to take ours or theirs side of the merge and resolve the conflict. Otherwise, the user must modify the schema to the desired merge state and delete the row from dolt_schema_conflicts to resolve the conflict.

To complete the merge, the user issues a call to the dolt_commit() procedure to commit the merge.

4. Perform Data Merge

After the schemas of the tables are merged, it is time to merge the data in the tables.

Merging large tables is computationally complex. Thus, you need table data stored in a data structure that can be diffed quickly. Prolly Trees fit the bill for Dolt. Prolly Trees are a recent invention, pioneered by the Noms team in 2015-2018. Thus, Dolt is the first SQL database to provide three-way merge on data in tables.

Unlike schema merge, data merge is logically simple. Data is identified by primary key across versions. If two branches modify different keys, the changes can be safely merged. If two branches modify the same key, a further diff is performed. If the two branches modify different columns in the row, the changes can be safely merged. If the two branches modify the same [primary key, column] pair to the same value, that is the merged value. However, if two branches modify the same [primary key, column] pair to a different value a data conflict is thrown.

A special case in the above rule is JSON-typed columns. An additional merge check is performed if two branches modified a JSON column identified by primary key to a different value. JSON-typed columns safely merge if the changes on each branch did not modify the same key in the stored JSON. Thus, two branches can be merged they modify the same JSON object, identified by primary key, but don't modify the same keys in the JSON itself.

Algorithmically, data merge is processed using a slightly simpler algorithm than described in diff3 Merge Algorithm section because primary keys are substituted for longest possible sequences as cross version identifiers.

There is also an additional conflict check if two primary keys are modified to satisfy the different column edits "mergeable" condition. So, you walk through the primary keys of the into and from sets, and process each according to the following rules:

Modified row is the same in into and from

If into.primarykey == from.primarykey and into.value == from.value, merge.primarykey = into.primarykey.

Result: Merged row is the value of into

Modified row is different in into and from

If into.primarykey == from.primarykey and into.value != from.value, inspect columns. Conflict if edit on the same column to different values. Otherwise, merge.primarykey.values = merged.values.

Result: Conflict if the edits are to the same column and different values. Otherwise, merge edits to columns. Merged row is set to the merged values in columns.

New or deleted row in into

If into.primarykey != from.primarykey, merge.primarykey = into.primarykey.

Result: Add or delete into row in merged set

New or deleted row in from

If from.primarykey != into.primarykey, merge.primarykey = from.primarykey

Result: Add or delete from row in merged set

This may seem a bit complicated but the results are intuitive as you will see later in an example.

5. Resolve any Data Conflicts

In the event of a data conflict, the user must be presented with a user interface to view and resolve conflicts. Again, this interface is inspired by Git but adapted to SQL. The interface is very similar to the one described in the Schema Conflicts section.

The dolt_status system table tells you which tables have conflicts and whether they are schema or data conflicts. An example table looks like:

three_way_merge> SELECT * FROM dolt_status;
+------------+--------+-----------------+
| table_name | staged | status          |
+------------+--------+-----------------+
| people     | false  | data conflict   |
+------------+--------+-----------------+

Two additional tables can be used to explore and resolve conflicts. The dolt_conflicts table tells you the number of conflicts in each table. The dolt_conflicts_<tablename> table lists the individual rows in conflict for the table <tablename>.

SELECT * FROM dolt_conflicts;
+--------+---------------+
| table  | num_conflicts |
+--------+---------------+
| people |             3 |
+--------+---------------+
SELECT * FROM dolt_conflicts_people;
+-----------------+----------------+---------+-----------------+----------+----------------+---------------+--------+----------------+---------+------------------+-----------------+----------+------------------+-----------+
| base_occupation | base_last_name | base_id | base_first_name | base_age | our_occupation | our_last_name | our_id | our_first_name | our_age | their_occupation | their_last_name | their_id | their_first_name | their_age |
+-----------------+----------------+---------+-----------------+----------+----------------+---------------+--------+----------------+---------+------------------+-----------------+----------+------------------+-----------+
| Homemaker       | Simpson        |       1 | Marge           |       37 | Homemaker      | Simpson       |      1 | Marge          |      36 | NULL             | NULL            |     NULL | NULL             |      NULL |
| Bartender       | Szyslak        |       2 | Moe             |     NULL | Bartender      | Szyslak       |      2 | Moe            |      62 | Bartender        | Szyslak         |        2 | Moe              |        60 |
| NULL            | NULL           |    NULL | NULL            |     NULL | Student        | Simpson       |      3 | Bart           |      10 | Student          | Simpson         |        3 | Lisa             |         8 |
+-----------------+----------------+---------+-----------------+----------+----------------+---------------+--------+----------------+---------+------------------+-----------------+----------+------------------+-----------+

Again, the convenience procedure called dolt_conflicts_resolve can be used to take the ours or theirs side of the merge and resolve the conflict. Otherwise, the user must modify the conflicting row to the desired merge state and delete the row from dolt_conflicts_<tablename> to resolve the conflict.

To complete the merge, the user issues a call to the dolt_commit() procedure to commit the merge.

6. Resolve any Constraint Violations

Adding even more logical complexity are foreign keys. Data merges that involve foreign keys may not be able to be resolved at merge time because multiple tables are involved. This is best seen through an illustrative example.

Consider the following schema:

mysql> create table employees (
    id int,
    last_name varchar(100),
    first_name varchar(100),
    age int,
    primary key(id),
    constraint over_18 check (age >= 18));
mysql> show create table employees;
+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table     | Create Table                                                                                                                                                                                                                                              |
+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| employees | CREATE TABLE `employees` (
                `id` int NOT NULL,
                `last_name` varchar(100),
                `first_name` varchar(100),
                `age` int,
                PRIMARY KEY (`id`),
                CONSTRAINT `over_18` CHECK ((`age` >= 18))
              ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin |
+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
mysql> create table pay (id int,
    salary int,
    primary key(id),
    foreign key (id) references employees(id));
mysql> show create table pay;
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                                                              |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pay   | CREATE TABLE `pay` (
           `id` int NOT NULL,
           `salary` int,
           PRIMARY KEY (`id`),
           CONSTRAINT `kfkov1vc` FOREIGN KEY (`id`) REFERENCES `employees` (`id`)
         ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin |
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

We have three tables, two of which reference a parent table via foreign key. We then delete a row in the parent table on one branch and insert a value referencing the deleted row on another branch.

mysql> insert into employees values (0, 'Smith', 'Ella', 34), (1, 'Baker', 'Jack', 27);
mysql> insert into pay values (0, 50000);
mysql> call dolt_commit('-am', "Data for foreign key doc");
+----------------------------------+
| hash                             |
+----------------------------------+
| kgjb1tdbqt3vsn2e3nv06n5a6jdaqtk8 |
+----------------------------------+
mysql> call dolt_checkout('-b', 'delete-parent');
+--------+
| status |
+--------+
| 0      |
+--------+
mysql> delete from employees where id=1;
mysql> call dolt_commit('-am', "Deleted Jack Baker, id=1");
+----------------------------------+
| hash                             |
+----------------------------------+
| pd8r1j7or0aonincnc8iutsdjqnkmtsb |
+----------------------------------+
mysql> call dolt_checkout('main');
+--------+
| status |
+--------+
| 0      |
+--------+
mysql> insert into pay values (1, 48000);
mysql> call dolt_commit('-am', "Added salary for Jack Baker id=1");
+----------------------------------+
| hash                             |
+----------------------------------+
| 44h9p2k59o59rc1lcenkg4dghe052um0 |
+----------------------------------+

Finally, we attempt to merge and because the resulting merge has constraint violations we cannot commit the merge.

mysql> call dolt_merge('delete-parent');
Error 1105: Constraint violation from merge detected, cannot commit transaction. Constraint violations from a merge must be resolved using the dolt_constraint_violations table before committing a transaction. To commit transactions with constraint violations set @@dolt_force_transaction_commit=1

To resolve this error, the constraint violation must be resolved manually before the merge can proceed. This means making sure all foreign key checks satisfy the requisite constraints. Similarly to the conflicts system tables, you can view the number of violated constraints in the dolt_constraint_violations system table and the actual constraint violations in the dolt_constraint_violations_<tablename> system tables. In this case, you must either add back the "Jack Baker" row or delete the salary for the "Jack Baker" reference.

SQL Transactions in Dolt

I'll dive into one more merge nuance before I get into an illustrative example, standard SQL transactions in Dolt. As we discussed earlier, transactions in standard SQL databases implement a basic three-way merge. How are SQL transactions handled in Dolt?

Dolt uses the same logic to perform Commit Graph merges as it does to merge standard SQL transactions. You can think of a SQL transaction in Dolt performing the above merge steps on the proposed working set to the current working set with concurrent transactions resolving in some order. If a conflict is encountered, a ROLLBACK statement is issued for the first transaction that conflicted.

This is a different locking and merge technique than standard SQL databases which use locks to provide concurrency and conflict detection. This can result in Dolt producing different results than you are used to under concurrency.

Because SQL transactions and Dolt merges share the same transaction merge code, interactive Dolt merges must be performed in a single transaction in the SQL context. This means you must turn off AUTOCOMMIT and begin a transaction with START TRANSACTION to perform an interactive merge. You then call the dolt_merge() procedure and you can proceed through the six merge phases. The dolt_commit() call that completes the merge implicitly commits the transaction you started with START TRANSACTION. If you have AUTOCOMMIT enabled, a dolt_merge() call that encounters a conflict is automatically rolled back just like a standard SQL transaction.

Example

Let's walk through a schema and data merge example with no conflicts to illustrate the above concepts and user interface. I'm using the Dolt command line because that ignores SQL context complexity.

First to set up, we create a database table and insert a couple rows. Finally, we make a commit. This will be used as our merge base.

$ mkdir threeway_merge
$ cd threeway_merge
$ dolt init
Successfully initialized dolt data repository.
$ dolt sql -q 'create table `databases`(name varchar(30) primary key)'
$ dolt sql -q 'show tables'
+--------------------------+
| Tables_in_threeway_merge |
+--------------------------+
| databases                |
+--------------------------+

$ dolt sql -q 'insert into `databases` values ("Oracle"), ("MySQL")'
Query OK, 2 rows affected (0.00 sec)
$ dolt add databases
$ dolt commit -m "Databases table with initial rows"
commit t19jjvtjrmddqnp2s1djbbe0j7k2pfra (HEAD -> main)
Author: timsehn <tim@dolthub.com>
Date:  Wed Jun 12 15:30:35 -0700 2024

        Databases table with initial rows

To trigger our commit as our merge base, we create two branches.

$ dolt branch add-release-date
$ dolt branch add-latest-version

Then, we add a release date column and populate it on the add-release-date branch, finishing with a commit.

$ dolt checkout add-release-date
Switched to branch 'add-release-date'
$ dolt sql -q 'alter table `databases` add column release_date date'
$ dolt sql -q 'show create table `databases`'
+-----------+------------------------------------------------------------------+
| Table     | Create Table                                                     |
+-----------+------------------------------------------------------------------+
| databases | CREATE TABLE `databases` (                                       |
|           |   `name` varchar(30) NOT NULL,                                   |
|           |   `release_date` date,                                           |
|           |   PRIMARY KEY (`name`)                                           |
|           | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin |
+-----------+------------------------------------------------------------------+

$ dolt sql -q 'update `databases` set release_date="1979-06-01" where name="Oracle"'
Query OK, 0 rows affected (0.00 sec)
Rows matched: 1  Changed: 0  Warnings: 0
$ dolt sql -q 'update `databases` set release_date="1995-05-23" where name="MySQL"'
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

$ dolt sql -q 'select * from `databases`'
+--------+--------------+
| name   | release_date |
+--------+--------------+
| MySQL  | 1995-05-23   |
| Oracle | 1979-06-01   |
+--------+--------------+
$ dolt diff
diff --dolt a/databases b/databases
--- a/databases
+++ b/databases
 CREATE TABLE `databases` (
   `name` varchar(30) NOT NULL,
+  `release_date` date,
   PRIMARY KEY (`name`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin;
+---+--------+--------------+
|   | name   | release_date |
+---+--------+--------------+
| < | MySQL  | NULL         |
| > | MySQL  | 1995-05-23   |
| < | Oracle | NULL         |
| > | Oracle | 1979-06-01   |
+---+--------+--------------+
$ dolt add databases
$ dolt commit -am "Added release date column and relevant release dates"
commit ape4hm5hicsoci3uscs466cciiv7dis6 (HEAD -> add-release-date)
Author: timsehn <tim@dolthub.com>
Date:  Wed Jun 12 15:42:17 -0700 2024

        Added release date column and relevant release dates

Then, we add a latest version column and populate it on the add-latest-version branch, finishing with a commit.

$ dolt checkout add-latest-version
Switched to branch 'add-latest-version'
$ dolt sql -q 'alter table `databases` add column latest_version varchar(30)'
$ dolt sql -q 'update `databases` set latest_version="19c" where name="Oracle"'
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
$ dolt sql -q 'update `databases` set latest_version="8.4.0" where name="MySQL"'
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
$ dolt sql -q 'select * from `databases`'
+--------+----------------+
| name   | latest_version |
+--------+----------------+
| MySQL  | 8.4.0          |
| Oracle | 19c            |
+--------+----------------+

$ dolt diff
diff --dolt a/databases b/databases
--- a/databases
+++ b/databases
 CREATE TABLE `databases` (
   `name` varchar(30) NOT NULL,
+  `latest_version` varchar(30),
   PRIMARY KEY (`name`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin;
+---+--------+----------------+
|   | name   | latest_version |
+---+--------+----------------+
| < | MySQL  | NULL           |
| > | MySQL  | 8.4.0          |
| < | Oracle | NULL           |
| > | Oracle | 19c            |
+---+--------+----------------+
$ dolt add databases
$ dolt commit -am "Added latest_version column and relevant values"
commit kkdc9381jghbj4obvbvbajd2t7bj7g4f (HEAD -> add-latest-version)
Author: timsehn <tim@dolthub.com>
Date:  Wed Jun 12 15:46:47 -0700 2024

        Added latest_version column and relevant values

Now, the moment of truth, the merge. The first is a fast-forward because there are no parallel changes on main.

$ dolt checkout main
Switched to branch 'main'
$ dolt merge add-release-date
Fast-forward
Updating ape4hm5hicsoci3uscs466cciiv7dis6..ape4hm5hicsoci3uscs466cciiv7dis6
databases | 2 **
1 tables changed, 0 rows added(+), 2 rows modified(*), 0 rows deleted(-)
$ dolt sql -q "select * from `databases`"
+--------+--------------+
| name   | release_date |
+--------+--------------+
| MySQL  | 1995-05-23   |
| Oracle | 1979-06-01   |
+--------+--------------+

Now, the second branch. This is the true three-way merge scenario.

$ dolt merge add-latest-version
Updating eovnrsqrhnvl4nep0ak2k88e4tu2mjs2..kkdc9381jghbj4obvbvbajd2t7bj7g4f
commit eovnrsqrhnvl4nep0ak2k88e4tu2mjs2 (HEAD -> main)
Merge: ape4hm5hicsoci3uscs466cciiv7dis6 kkdc9381jghbj4obvbvbajd2t7bj7g4f
Author: timsehn <tim@dolthub.com>
Date:  Wed Jun 12 15:49:11 -0700 2024

        Merge branch 'add-latest-version' into main

databases | 2 **
1 tables changed, 0 rows added(+), 2 rows modified(*), 0 rows deleted(-)
$ dolt sql -q "select * from `databases`"
+--------+--------------+----------------+
| name   | release_date | latest_version |
+--------+--------------+----------------+
| MySQL  | 1995-05-23   | 8.4.0          |
| Oracle | 1979-06-01   | 19c            |
+--------+--------------+----------------+

And we get the expected result. Two columns added with all cells populated.

This simple example belies all the technical complexity from the storage engine up to make three-way merge work at scale. As part of our discontinued Data Bounties initiative, we've merged millions of rows into tables containing hundreds of millions of rows. Dolt merge works at scale.

This blog is already reaching opus length, I will follow up with a blog that has more merge examples that generate and resolve conflicts.

Conclusion

Dolt is the first SQL database to implement three-way merge of schema and data. This functionality is powered by Dolt's unique storage engine that features a commit graph and Prolly-tree based storage for fast diff. Curious to learn more? Come by our Discord and we'd be happy to chat.

Blog

What is Three-way Merge?

Two-way Merge

Three-way Merge

Conflicts

Branches

How Three-way Merge Works

Find the Merge Base

diff3 Merge Algorithm

Three-way Merge in a SQL Database

Transactions

Prerequisites

Commit graph

Fast diff

Merge Process

1. Find the Merge Base

2. Perform Schema Merge

Tables

Columns

Foreign Keys

Indexes

Check Constraints

3. Resolve any Schema Conflicts

4. Perform Data Merge

5. Resolve any Data Conflicts

6. Resolve any Constraint Violations

SQL Transactions in Dolt

Example

Conclusion

Get started with Dolt

`diff3` Merge Algorithm