Dolt Diff Magic: Part 2 – Viewing data diffs
Welcome back to the second post in this blog series exploring the magic that makes Dolt so unique and powerful. Dolt DB is a MySQL-compliant SQL database that also provides the distributed versioning features of Git. The result is a powerful database that lets you fork, merge, branch, push, pull, and deeply inspect the history of your data.
In the first post in this series,
we explored how the
dolt_diff
system table
enables you to see what tables have changed, when they changed, and who changed them.
In this post we're going to see a new set of Dolt's magic tricks and explore some of the
features in Dolt that show you exactly how your data has changed.
We'll see how to view logs of all the changes to the rows in your tables and explore some interesting ways we
can use that data to create useful queries.
The third post in this series
shows how to diff the full contents of a table at any two revisions, even across branches.
Diving Deeper into Dolt's Diff Magic
Let’s dive deeper and look at the changes to the actual data in our tables.
Dolt provides a few ways to look at diffs of your data.
In this post we're going to be focused on the
dolt_diff_$tablename
system table.
This system table exists for every user table in a Dolt database and gives us a list of all the individual changes
that commits have applied to a table.
The structure of the table consists of our columns, prepended with to_
and from_
, as well as to_commit
and from_commit
columns, to show how the values in our rows have changed at each commit.
Looking at this table gives us a really nice way to see how our data has
changed over the course of the current branch's commit history –
it's a view of how each individual Dolt commit has mutated our data.
The following query shows all the changes on the main branch that we’ve made to a table
called simpletable
in the
DoltDiffMagic database.
This is the same database we'll be using for all the examples in this post, so go ahead and dolt clone
it
so you can try out the queries below, too.
select * from dolt_diff_simpletable;
+-------+------+----------------------------------+-----------------------------------+---------+--------+----------------------------------+-----------------------------------+-----------+
| to_pk | to_c | to_commit | to_commit_date | from_pk | from_c | from_commit | from_commit_date | diff_type |
+-------+------+----------------------------------+-----------------------------------+---------+--------+----------------------------------+-----------------------------------+-----------+
| 1 | 10 | 4fevgeq4t6r55d9epvn28a3rj3csdsfv | 2022-03-31 14:56:19.978 +0000 UTC | 1 | 5 | 4koq08rg8iuuefk7ssckoo7m56gvr2rj | 2022-03-31 14:56:07.131 +0000 UTC | modified |
| 2 | 200 | 4koq08rg8iuuefk7ssckoo7m56gvr2rj | 2022-03-31 14:56:07.131 +0000 UTC | NULL | NULL | taf8r3gh7rv3ei0u6f7jio326llgtb61 | 2022-03-31 14:55:50.706 +0000 UTC | added |
| 1 | 5 | taf8r3gh7rv3ei0u6f7jio326llgtb61 | 2022-03-31 14:55:50.706 +0000 UTC | 1 | 4 | 19pba1s8h7esi2rqrspvarfg61ietsfc | 2022-03-31 14:55:44.64 +0000 UTC | modified |
| 1 | 4 | 19pba1s8h7esi2rqrspvarfg61ietsfc | 2022-03-31 14:55:44.64 +0000 UTC | 1 | 3 | o70kqfvhfr7b1pntp1t8cnn69o9r353j | 2022-03-31 14:55:37.062 +0000 UTC | modified |
| 1 | 3 | o70kqfvhfr7b1pntp1t8cnn69o9r353j | 2022-03-31 14:55:37.062 +0000 UTC | 1 | 2 | f7gjqm5foil1436kk53iv6e47pc3g4j4 | 2022-03-31 14:55:31.51 +0000 UTC | modified |
| 1 | 2 | f7gjqm5foil1436kk53iv6e47pc3g4j4 | 2022-03-31 14:55:31.51 +0000 UTC | 1 | 1 | gkorppmpemtnh91jrhn46ptcqphm2iaj | 2022-03-31 14:55:12.365 +0000 UTC | modified |
| 1 | 1 | gkorppmpemtnh91jrhn46ptcqphm2iaj | 2022-03-31 14:55:12.365 +0000 UTC | NULL | NULL | u8cvv5pm9elftq9f5c7juovcckq7nlqb | 2022-03-31 14:54:47.729 +0000 UTC | added |
+-------+------+----------------------------------+-----------------------------------+---------+--------+----------------------------------+-----------------------------------+-----------+
There are a few important things to notice about the returned data.
In addition to the to_
and from_
columns that show the exact ways the data in a row changed,
there is also a diff_type
column that tells us whether the row was added
, modified
, or removed
.
Notice also that we edited the same row in multiple Dolt commits, so we see each individual delta reported
in the results, not the cumulative diff of all changes across multiple Dolt commits. This makes the
dolt_diff_$tablename
system tables particularly good for seeing an audit log of how your data has changed
over time at each individual Dolt commit.
Diffs in Action
Let’s put our new knowledge of the dolt_diff_$tablename
system tables to use! In the next sections, we'll
work through practical examples of using Dolt's diff metadata. Each example uses the inventory
table from the
DoltDiffMagic database.
Example 1: Find when each item was first introduced
Let's start reporting when each item in our inventory was originally introduced in our shop.
For every row in our inventory
table, how can we report when it was first added to the table?
In a non-versioned database, you would do this by explicitly adding this data to your table and including a timestamp when you insert new items. If you realized you needed this after you started tracking your data, then you'd have to find a way to backfill all the old rows, or make your application aware that not all rows will have this data available, test different scenarios, etc. This is all totally doable of course, but why not let Dolt do the lifting for you here and take advantage of Dolt's built-in history and versioning support to keep your data and your application a little bit simpler?
We can get Dolt to provide this information for us if we take our inventory
table and join it on the
dolt_diff_inventory
system table and apply a few filters:
select inventory.*, date_format(dolt_diff_inventory.to_commit_date, '%Y-%m-%d') as first_created
from inventory, dolt_diff_inventory
where inventory.item_id=dolt_diff_inventory.to_item_id
and dolt_diff_inventory.diff_type='added'
order by first_created asc;
+---------+------------+-------+-------+---------------+
| item_id | name | price | count | first_created |
+---------+------------+-------+-------+---------------+
| 6 | puppy chow | 15 | 8 | 2022-03-18 |
| 2 | nintendo | 100 | 9 | 2022-03-18 |
| 3 | blender | 15 | 99 | 2022-03-18 |
| 4 | camera | 50 | 6 | 2022-03-18 |
| 5 | walkman | 10 | 11 | 2022-03-18 |
| 7 | gum | 1 | 12 | 2022-03-31 |
+---------+------------+-------+-------+---------------+
Example 2: Find items on sale
How can we identify all of the items in our inventory that have had their price reduced and see the commit message from the price change?
We'll join our inventory
table with the dolt_diff_inventory
system table again and this time
we'll filter for price drops that match the current item price. We'll also join on the dolt_commits
system table so that we can pull in the commit message.
select inventory.*, dolt_diff_inventory.from_price, dolt_commits.message
from inventory, dolt_diff_inventory, dolt_commits
where dolt_commits.commit_hash=dolt_diff_inventory.to_commit
and inventory.item_id=dolt_diff_inventory.to_item_id
and to_price < from_price
and to_price = inventory.price
and from_price is not null;
+---------+---------+-------+-------+------------+--------------------------+
| item_id | name | price | count | from_price | message |
+---------+---------+-------+-------+------------+--------------------------+
| 3 | blender | 15 | 99 | 20 | putting blenders on sale |
+---------+---------+-------+-------+------------+--------------------------+
Example 3: Find removed items
How can we find all the items we’ve removed from our inventory?
The query below uses the dolt_diff_inventory
system table to see everything that's been removed.
We could further filter this to removals in specific time periods by adding an additional filter.
The results show us that we’ve only pulled one item from the shelves.
Seems like customers didn't appreciate those extra chunks in our chicken noodle soup? 🍜 🤷
select from_item_id item_id, from_name name, message reason
from dolt_diff_inventory, dolt_commits
where dolt_commits.commit_hash=dolt_diff_inventory.to_commit
and diff_type='removed';
+---------+------+------------------------------------------------------------------------+
| item_id | name | reason |
+---------+------+------------------------------------------------------------------------+
| 1 | soup | removing chunky chicken noodle soup after multiple customer complaints |
+---------+------+------------------------------------------------------------------------+
Example 4: Update from historical data
Now that the big spring blender sale is over, how can we use Dolt's versioned history to reset the price of blenders back to the price right before we put it on sale?
The query below is called an update join query.
Update joins with the dolt_diff_$tablename
system tables are a powerful way to do partial data rollbacks with Dolt.
If you're curious about how update joins are implemented in Dolt, you can read more about
how we implemented support for update joins in Dolt.
update inventory
join dolt_diff_inventory on dolt_diff_inventory.to_name=inventory.name
set price=from_price
where from_price > to_price;
Query OK, 1 row affected
Rows matched: 1 Changed: 1 Warnings: 0
It looks like the query above executed correctly and updated one row, but how can we see what actually changed?
We can use the dolt_diff_inventory
table for this, too. Working set changes are tracked in the dolt_diff_$tablename
system tables, too, but since they don't have a real commit hash yet, we can use the special value WORKING
to see them.
If we query dolt_diff_inventory
and filter on to_commit='WORKING'
, we can see exactly how the query above changed
our data. Sure enough, the blender price was changed from 15
back to 20
, just like we wanted.
select * from dolt_diff_inventory where to_commit='WORKING';
+---------+----------+------------+----------+-----------+----------------+-----------+------------+--------------+------------+----------------------------------+-----------------------------------+-----------+
| to_name | to_price | to_item_id | to_count | to_commit | to_commit_date | from_name | from_price | from_item_id | from_count | from_commit | from_commit_date | diff_type |
+---------+----------+------------+----------+-----------+----------------+-----------+------------+--------------+------------+----------------------------------+-----------------------------------+-----------+
| blender | 20 | 3 | 99 | WORKING | NULL | blender | 15 | 3 | 99 | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | modified |
+---------+----------+------------+----------+-----------+----------------+-----------+------------+--------------+------------+----------------------------------+-----------------------------------+-----------+
When we're satisfied with our changes, we can use the dolt_commit() SQL function to write the changes to Dolt's commit history.
select dolt_commit('-am', 'reverting blenders to their pre-sale price');
Wrap up
That wraps up our tour of the
dolt_diff_$tablename
system table!
We saw how the dolt_diff_$tablename
system table shows you a log of all the changes to the rows in your tables, for each
individual Dolt commit, and we saw a few ways you can use that historical data in your applications.
Dolt has lots more tricks up its sleeves and we'll keep exploring more in future posts in this series.
How could you leverage Dolt's versioning, history, and diff features in your applications? Got more questions on what else you can do with Dolt or how any of these features work? Come join us on Discord and let us know! We’d love to hear from you and see how we can help you and your team use Dolt!