Reannouncing Dolt Query Diff
A couple years ago, we announced a version of query-diff
.
The feature always made us nervous.
Dolt's storage engine promises fast diff based on content-addressed prolly trees.
However, query-diff
doesn't take advantage of Prolly trees; rather, it does a brute force comparison of query results.
We were worried people would use it and think diff in Dolt was slow overall, not just query-diff.
Since no one was using query-diff
, we quietly removed it in early 2021.
Dolt has improved greatly since then and customers began asking for query-diff
capability, so we've decided to revive this feature.
The feature comes as a CLI command, dolt query-diff <query1> <query2>
,
and as a table function SELECT * FROM dolt_query_diff(<query1>, <query2>)
.
While query-diff
works similarly to the dolt-diff
system tables, we think that query-diff
offers a more natural way of writing these queries through the use of as of <commit>
clauses.
Additionally, query-diff
is able to compare queries that contain joins, aggregations, where clauses, and other more complicated SQL operations.
Overview
We'll demonstrate some of query-diff
's capabilities through a simple example.
First, let's setup the database:
dolt_db> create table a (i int primary key, j int);
dolt_db> insert into a values (1, 1), (2, 2), (3, 3);
Query OK, 3 rows affected (0.00 sec)
dolt_db> call dolt_commit('-am', 'initializing table a');
+----------------------------------+
| hash |
+----------------------------------+
| f5mkvq207nskkumkeerdtv85jqtn1bns |
+----------------------------------+
1 row in set (0.00 sec)
dolt_db> call dolt_branch('other');
+--------+
| status |
+--------+
| 0 |
+--------+
1 row in set (0.00 sec)
dolt_db> insert into a values (4, 4);
Query OK, 1 row affected (0.00 sec)
dolt_db> delete from a where i = 3;
Query OK, 1 row affected (0.01 sec)
dolt_db> update a set j = 20 where i = 2;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
dolt_db> call dolt_commit('-Am', 'updating table a');
+----------------------------------+
| hash |
+----------------------------------+
| n3m1lmpu3cv99qma7btpeq5jdiba0f34 |
+----------------------------------+
1 row in set (0.00 sec)
Here we have a table a
and two branches main
and other
; the rows in a
are different between the two branches.
We can take advantage of as of <commit>
clauses to view these different versions of a
:
dolt_db> select * from a as of other;
+---+---+
| i | j |
+---+---+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
+---+---+
3 rows in set (0.00 sec)
dolt_db> select * from a as of main;
+---+----+
| i | j |
+---+----+
| 1 | 1 |
| 2 | 20 |
| 4 | 4 |
+---+----+
3 rows in set (0.00 sec)
Using query-diff
, we can compare these two versions and produce an easy to read diff:
dolt_db> select * from dolt_query_diff('select * from a as of other', 'select * from a as of main');
+--------+--------+------+------+-----------+
| from_i | from_j | to_i | to_j | diff_type |
+--------+--------+------+------+-----------+
| 2 | 2 | 2 | 20 | modified |
| 3 | 3 | NULL | NULL | deleted |
| NULL | NULL | 4 | 4 | added |
+--------+--------+------+------+-----------+
3 rows in set (0.00 sec)
Here is the equivalent diff through dolt-diff
:
dolt_db> select from_i, from_j, to_i, to_j, diff_type from dolt_diff('other', 'main', 'a');
+--------+--------+------+------+-----------+
| from_i | from_i | to_i | to_j | diff_type |
+--------+--------+------+------+-----------+
| 2 | 2 | 2 | 20 | modified |
| 3 | 3 | NULL | NULL | removed |
| NULL | NULL | 4 | 4 | added |
+--------+--------+------+------+-----------+
3 rows in set (0.00 sec)
Advantages
In many cases, query-diff
is more natural to use than dolt-diff
.
Suppose we only wanted to see the diff where i = 2
.
Naturally, we can achieve this using a filter on a
:
dolt_db> select * from dolt_query_diff('select * from a as of other where i = 2', 'select * from a as of main where i = 2');
+--------+--------+------+------+-----------+
| from_i | from_j | to_i | to_j | diff_type |
+--------+--------+------+------+-----------+
| 2 | 2 | 2 | 20 | modified |
+--------+--------+------+------+-----------+
1 row in set (0.00 sec)
To accomplish the same thing with dolt-diff
, we'd need to have a filter over the entire diff:
dolt_db> select from_i, from_j, to_i, to_j, diff_type from dolt_diff('other', 'main', 'a') where from_i = 2 and to_i = 2;
+--------+--------+------+------+-----------+
| from_i | from_j | to_i | to_j | diff_type |
+--------+--------+------+------+-----------+
| 2 | 2 | 2 | 20 | modified |
+--------+--------+------+------+-----------+
1 row in set (0.00 sec)
This is a bit more verbose, and it's not immediately clear what the from_i = 2 and to_i = 2
clause is doing.
Through query-diff
, we can compare result sets from more complicated queries.
For example, we can get a diff from an aggregation:
dolt_db> select * from dolt_query_diff('select i, sum(j) from a as of other', 'select i, sum(j) from a as of main');
+--------+-------------+------+-----------+-----------+
| from_i | from_sum(j) | to_i | to_sum(j) | diff_type |
+--------+-------------+------+-----------+-----------+
| 1 | 6 | 1 | 25 | modified |
+--------+-------------+------+-----------+-----------+
1 row in set (0.00 sec)
We can diff joins:
dolt_db> create table b (k int primary key);
dolt_db> insert into b values (10);
Query OK, 1 row affected (0.00 sec)
dolt_db> select * from dolt_query_diff('select * from a as of other, b', 'select * from a as of main, b');
+--------+--------+--------+------+------+------+-----------+
| from_i | from_j | from_k | to_i | to_j | to_k | diff_type |
+--------+--------+--------+------+------+------+-----------+
| 2 | 2 | 10 | 2 | 20 | 10 | modified |
| 3 | 3 | 10 | NULL | NULL | NULL | deleted |
| NULL | NULL | NULL | 4 | 4 | 10 | added |
+--------+--------+--------+------+------+------+-----------+
3 rows in set (0.01 sec)
Limitations
Currently, diffs are only informative when performed against the same table with the same schema.
It is still unclear exactly what should happen between different tables, and we are open to suggestions.
A possibility is to have this operate similarly to NATURAL JOIN
s, where the columns are automatically matched by column name.
Query diff on different tables will work, but it will not be very informative.
dolt_db> create table t1 (i int);
dolt_db> create table t2 (j int);
dolt_db> insert into t1 values (1), (2), (3);
Query OK, 3 rows affected (0.00 sec)
dolt_db> insert into t2 values (2), (4), (6);
Query OK, 3 rows affected (0.00 sec)
dolt_db> select * from dolt_query_diff('select * from t1','select * from t2');
+--------+------+-----------+
| from_i | to_j | diff_type |
+--------+------+-----------+
| 2 | NULL | deleted |
| 1 | NULL | deleted |
| 3 | NULL | deleted |
| NULL | 2 | added |
| NULL | 6 | added |
| NULL | 4 | added |
+--------+------+-----------+
6 rows in set (0.00 sec)
Another limitation is that there must be a PRIMARY KEY
present on the table; this is just so rows can be uniquely identified, and a diff can be calculated.
Query diff on result sets without a PRIMARY KEY
is still possible, but it isn't very informative.
Lastly, the current implementation is not very performant. It is a naive brute force algorithm that compares the entire result set of each query. For large result sets, this is undoubtedly very slow. We are actively working on an improved implementation that takes advantage of some of dolt's internal data structures to make this much faster.
Conclusion
Dolt query-diff
is a useful feature, and a natural extension of dolt's git-like capabilities.
We are excited to have it back, and we hope you give it a try.
Let us know if you find any bugs or have any other feature requests on Discord or GitHub.