Introducing dolt_column_diff
Dolt is a version controlled relational database that combines the version control features of Git and the relational database features of MySQL. One of the features that leverages the powerful versioned history Dolt tracks is our Database History System Tables. Today I’ll be introducing our newest Database History System Table – dolt_column_diff
. In this blog post I’ll go over the key features of the dolt_column_diff
system table, highlight how it differs from the dolt_diff
system table and dolt_diff_$TABLENAME
system table, and demonstrate some fun queries that our new table facilitates.
dolt_column_diff
is the best of both worlds
Our dolt_diff
system table gives you a high level overview of what’s changed in your database. We can see what tables have been modified in each commit.
select * from dolt_diff limit 5;
+----------------------------------+-----------------------------+--------------------+-------------------------------+-------------------------+------------------------------+-------------+---------------+
| commit_hash | table_name | committer | email | date | message | data_change | schema_change |
+----------------------------------+-----------------------------+--------------------+-------------------------------+-------------------------+------------------------------+-------------+---------------+
| i3f3orlfmbjgqnst90c8r96jps7tdtv9 | inmate_population_snapshots | spacelove | alec@dolthub.com | 2022-06-14 19:11:58.402 | Dolthub User: abmyii | true | false |
| | | | | | Accepted PR: 44 | | |
| ubu61jhc3qp1d28035ee3kd105ao10q1 | inmate_population_snapshots | Abdurrahmaan Iqbal | abdurrahmaaniqbal@hotmail.com | 2022-06-14 06:40:23.19 | Import KY data | true | false |
| gora1aioouji9j3858n928g84en6b17b | inmate_population_snapshots | spacelove | alec@dolthub.com | 2022-06-02 19:25:54.407 | Dolthub User: abmyii | true | false |
| | | | | | Accepted PR: 34 | | |
| gora1aioouji9j3858n928g84en6b17b | jails | spacelove | alec@dolthub.com | 2022-06-02 19:25:54.407 | Dolthub User: abmyii | true | false |
| | | | | | Accepted PR: 34 | | |
| r60ng1cnm4q3tbkkfjcbj6a2b057ba1d | jails | Abdurrahmaan Iqbal | abdurrahmaaniqbal@hotmail.com | 2022-06-01 00:25:13.334 | NULL where in_urban_area = 0 | true | false |
+----------------------------------+-----------------------------+--------------------+-------------------------------+-------------------------+------------------------------+-------------+---------------+
If we want a more detailed view of what’s changed in a specific table, you can use our dolt_diff_$TABLENAME
system table to take a look.
select * from dolt_diff_jails limit 5;
+----------+------------------+---------------------------------+------------------------+------------------+-------------------+-----------------+---------------+------------------+--------------------------------+-------------------------+-------------------------------+----------------------------+--------------------+--------------------------+----------------------------------+-------------------------+---------+-------------+--------------------+-----------------------+--------------------+---------------------+-------------------+-----------------+--------------------+----------------------------------+---------------------------+---------------------------------+------------------------------+----------------------+----------------------------+----------------------------------+------------------------+-----------+
| to_id | to_county | to_facility_name | to_facility_address | to_facility_city | to_facility_state | to_facility_zip | to_is_private | to_in_urban_area | to_holds_greater_than_72_hours | to_holds_less_than_1_yr | to_felonies_greater_than_1_yr | to_hold_less_than_72_hours | to_facility_gender | to_num_inmates_rated_for | to_commit | to_commit_date | from_id | from_county | from_facility_name | from_facility_address | from_facility_city | from_facility_state | from_facility_zip | from_is_private | from_in_urban_area | from_holds_greater_than_72_hours | from_holds_less_than_1_yr | from_felonies_greater_than_1_yr | from_hold_less_than_72_hours | from_facility_gender | from_num_inmates_rated_for | from_commit | from_commit_date | diff_type |
+----------+------------------+---------------------------------+------------------------+------------------+-------------------+-----------------+---------------+------------------+--------------------------------+-------------------------+-------------------------------+----------------------------+--------------------+--------------------------+----------------------------------+-------------------------+---------+-------------+--------------------+-----------------------+--------------------+---------------------+-------------------+-----------------+--------------------+----------------------------------+---------------------------+---------------------------------+------------------------------+----------------------+----------------------------+----------------------------------+------------------------+-----------+
| 042b1a2a | Sussex County | Greensville Correctional Center | 901 Corrections Way | Jarratt | VA | 23870 | 0 | -1 | 1 | 1 | 1 | 1 | 3 | 3007 | i3f3orlfmbjgqnst90c8r96jps7tdtv9 | 2022-06-14 19:11:58.402 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | ubu61jhc3qp1d28035ee3kd105ao10q1 | 2022-06-14 06:40:23.19 | added |
| 043dc9e7 | Maricopa County | ASPC-Perryville | 2105 North Citrus Road | Goodyear | AZ | 85395 | 0 | 0 | 1 | -1 | 1 | -1 | 2 | 4214 | i3f3orlfmbjgqnst90c8r96jps7tdtv9 | 2022-06-14 19:11:58.402 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | ubu61jhc3qp1d28035ee3kd105ao10q1 | 2022-06-14 06:40:23.19 | added |
| 06e8f5e2 | Fairfield County | Garner Correctional Institution | 50 Nunnawauk Road | Newtown | CT | 06470 | 0 | -1 | 1 | -1 | 1 | -1 | -1 | 0 | i3f3orlfmbjgqnst90c8r96jps7tdtv9 | 2022-06-14 19:11:58.402 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | ubu61jhc3qp1d28035ee3kd105ao10q1 | 2022-06-14 06:40:23.19 | added |
| 0898b5f4 | Lawrence County | Lawrence Correctional Center | 10940 Lawrence Road | Sumner | IL | 62466 | 0 | -1 | 1 | -1 | 1 | -1 | 1 | 2380 | i3f3orlfmbjgqnst90c8r96jps7tdtv9 | 2022-06-14 19:11:58.402 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | ubu61jhc3qp1d28035ee3kd105ao10q1 | 2022-06-14 06:40:23.19 | added |
| 11610970 | Pinal County | ASPC-Florence | 1305 E Butte Ave | Florence | AZ | 85132 | 0 | 1 | 1 | -1 | 1 | -1 | 1 | 3284 | i3f3orlfmbjgqnst90c8r96jps7tdtv9 | 2022-06-14 19:11:58.402 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | ubu61jhc3qp1d28035ee3kd105ao10q1 | 2022-06-14 06:40:23.19 | added |
+----------+------------------+---------------------------------+------------------------+------------------+-------------------+-----------------+---------------+------------------+--------------------------------+-------------------------+-------------------------------+----------------------------+--------------------+--------------------------+----------------------------------+-------------------------+---------+-------------+--------------------+-----------------------+--------------------+---------------------+-------------------+-----------------+--------------------+----------------------------------+---------------------------+---------------------------------+------------------------------+----------------------+----------------------------+----------------------------------+------------------------+-----------+
Now let’s say you want to pinpoint which columns have been changing in your tables. While dolt_diff_$TABLENAME
gives you every row with a change, you will have to manually compare each to_
and from_
column yourself to figure out where the change is in each row. Fortunately, with our new dolt_column_diff
system table, we do all that work for you!
dolt_column_diff
has the following schema:
+-------------+----------+
| field | Type |
+-------------+----------+
| commit_hash | text |
| table_name | text |
| column_name | text |
| committer | text |
| email | text |
| date | datetime |
| message | text |
| diff_type | text |
+-------------+----------+
dolt_column_diff
returns the commit history for the currently checked out branch and includes a row for every column that changed and includes a diff type for that change.
select * from dolt_column_diff limit 5;
+----------------------------------+-----------------------------+----------------------------+-----------+------------------+-------------------------+----------------------+-----------+
| commit_hash | table_name | column_name | committer | email | date | message | diff_type |
+----------------------------------+-----------------------------+----------------------------+-----------+------------------+-------------------------+----------------------+-----------+
| i3f3orlfmbjgqnst90c8r96jps7tdtv9 | inmate_population_snapshots | technical_parole_violators | spacelove | alec@dolthub.com | 2022-06-14 19:11:58.402 | Dolthub User: abmyii | modified |
| | | | | | | Accepted PR: 44 | |
| i3f3orlfmbjgqnst90c8r96jps7tdtv9 | inmate_population_snapshots | source_url | spacelove | alec@dolthub.com | 2022-06-14 19:11:58.402 | Dolthub User: abmyii | modified |
| | | | | | | Accepted PR: 44 | |
| i3f3orlfmbjgqnst90c8r96jps7tdtv9 | inmate_population_snapshots | source_url_2 | spacelove | alec@dolthub.com | 2022-06-14 19:11:58.402 | Dolthub User: abmyii | modified |
| | | | | | | Accepted PR: 44 | |
| i3f3orlfmbjgqnst90c8r96jps7tdtv9 | inmate_population_snapshots | federal_offense | spacelove | alec@dolthub.com | 2022-06-14 19:11:58.402 | Dolthub User: abmyii | modified |
| | | | | | | Accepted PR: 44 | |
| i3f3orlfmbjgqnst90c8r96jps7tdtv9 | inmate_population_snapshots | on_probation | spacelove | alec@dolthub.com | 2022-06-14 19:11:58.402 | Dolthub User: abmyii | modified |
| | | | | | | Accepted PR: 44 | |
+----------------------------------+-----------------------------+----------------------------+-----------+------------------+-------------------------+----------------------+-----------+
We have the best of both worlds in dolt_column_diff
– a way to see a high level overview of what changed in each commit without the need to manually compare cells in dolt_diff_$TABLENAME
. From here, we can see how all the columns are changing in our database, or use this system table as a jumping off point to get a more targeted look into dolt_diff_$TABLENAME
.
Let’s see some examples
Now that you have an idea of what makes dolt_column_diff
so powerful, let’s see it in action. If you want to follow along, we’ll be using the US Jails dataset available on DoltHub.
How do I find the history of all updates involving one specific column?
Let’s say you want to see how the values in a certain column have been changing throughout your commit history. Before, you would have had to write a query using dolt_diff_jails
that manually checked for a diff in every column so that you could filter for changes that involve facility_name
. Now, using dolt_column_diff
we can easily filter for a list of commits where there was actually a change to the column of interest which can then be used as a reference point in dolt_diff_jails
to see how the data actually changed.
select ddj.to_commit, ddj.to_facility_name, ddj.from_facility_name, ddj.diff_type
from dolt_diff_jails ddj join dolt_column_diff dcd on dcd.commit_hash=ddj.to_commit
where dcd.table_name='jails' and dcd.column_name='facility_name' and ddj.diff_type='modified'
limit 10;
+----------------------------------+----------------------------------+---------------------------------+-----------+
| to_commit | to_facility_name | from_facility_name | diff_type |
+----------------------------------+----------------------------------+---------------------------------+-----------+
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | MercedCounty Jail | MercedCountyJail | modified |
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | Gilchrist County Jail | Gilchrist CountyJail | modified |
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | Glades County Jail | Glades CountyJail | modified |
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | Hamilton County Jail | Hamilton CountyJail | modified |
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | Hardee County Jail | Hardee CountyJail | modified |
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | Holmes County Jail and Work Camp | Holmes CountyJail and Work Camp | modified |
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | Indian River County Jail | Indian River CountyJail | modified |
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | Levy County Jail | Levy CountyJail | modified |
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | Liberty County Jail | Liberty CountyJail | modified |
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | Martin County Jail | Martin CountyJail | modified |
+----------------------------------+----------------------------------+---------------------------------+-----------+
Which columns in a table have changed and how many times have they changed?
select table_name, column_name, count(commit_hash) as total_column_changes
from dolt_column_diff
where table_name='inmate_population_snapshots'
group by column_name
order by total_column_changes desc
limit 20;
+-----------------------------+----------------------------+----------------------+
| table_name | column_name | total_column_changes |
+-----------------------------+----------------------------+----------------------+
| inmate_population_snapshots | source_url | 64 |
| inmate_population_snapshots | id | 63 |
| inmate_population_snapshots | snapshot_date | 63 |
| inmate_population_snapshots | total | 62 |
| inmate_population_snapshots | source_url_2 | 34 |
| inmate_population_snapshots | detained_or_awaiting_trial | 33 |
| inmate_population_snapshots | convicted_or_sentenced | 26 |
| inmate_population_snapshots | male | 25 |
| inmate_population_snapshots | female | 25 |
| inmate_population_snapshots | federal_offense | 22 |
| inmate_population_snapshots | felony | 18 |
| inmate_population_snapshots | misdemeanor | 18 |
| inmate_population_snapshots | total_off_site | 16 |
| inmate_population_snapshots | technical_parole_violators | 16 |
| inmate_population_snapshots | asian | 7 |
| inmate_population_snapshots | white | 7 |
| inmate_population_snapshots | american_indian | 7 |
| inmate_population_snapshots | other_race | 7 |
| inmate_population_snapshots | civil_offense | 7 |
| inmate_population_snapshots | other_offense | 6 |
+-----------------------------+----------------------------+----------------------+
In this example we can see that fields describing the reasons an inmate is being held are being updated far more frequently than the fields holding demographic information about inmates.
For each primary key in a table, which columns have changed and how many times have they changed?
select ddj.to_id as id, dcd.column_name, count(dcd.column_name) as times_changed
from dolt_diff_jails ddj join dolt_column_diff dcd on dcd.commit_hash=ddj.to_commit
group by id, dcd.column_name
order by times_changed desc
limit 10;
+-----------------------+-------------+---------------+
| id | column_name | times_changed |
+-----------------------+-------------+---------------+
| NULL | id | 16 |
| STABS11a8a70014ef1fc2 | id | 16 |
| STABS0097e20278529ae2 | id | 16 |
| STABS26e8ff8ab54ea4cc | id | 16 |
| STABS0777bcd710a28622 | id | 16 |
| STABS2d676f2d3256e651 | id | 16 |
| STABS2f6eaaca83d16a08 | id | 16 |
| STABS3605f3081e276271 | id | 16 |
| STABS309ae66dbdcacb63 | id | 16 |
| STABS1b8740d3c791509b | id | 16 |
+-----------------------+-------------+---------------+
How many columns were changed in each commit?
select commit_hash, count(*)
from dolt_column_diff
group by commit_hash
limit 10;
+----------------------------------+----------+
| commit_hash | count(*) |
+----------------------------------+----------+
| i3f3orlfmbjgqnst90c8r96jps7tdtv9 | 11 |
| ubu61jhc3qp1d28035ee3kd105ao10q1 | 11 |
| gora1aioouji9j3858n928g84en6b17b | 22 |
| r60ng1cnm4q3tbkkfjcbj6a2b057ba1d | 1 |
| bg7c1miq9rpbhfhnlebtlmpdvt3u898j | 23 |
| 4dgdn1ur42cuk18sin7olt8fnaik5d9b | 19 |
| 3c2mb901bm3m1erc3k3ojad950v694ad | 4 |
| e3dqndb96jkh5meffb06srehc32iou03 | 15 |
| lme6elkc85boo7urdnc8k29s5ocgh4e3 | 1 |
| r05besu20g6csnbfvebhrbi97n4ahr8q | 1 |
+----------------------------------+----------+
Which columns haven’t changed in the past year?
select column_name
from information_schema.columns
where table_name='jails'
and table_name not in
(select column_name from dolt_column_diff where date > (NOW() - INTERVAL 12 MONTH));
+-----------------------------+
| COLUMN_NAME |
+-----------------------------+
| id |
| county |
| facility_name |
| facility_address |
| facility_city |
| facility_state |
| facility_zip |
| is_private |
| in_urban_area |
| holds_greater_than_72_hours |
| holds_less_than_1_yr |
| felonies_greater_than_1_yr |
| hold_less_than_72_hours |
| facility_gender |
| num_inmates_rated_for |
+-----------------------------+
How many columns has each committer changed?
select committer, count(distinct column_name) column_count
from dolt_column_diff
group by committer
order by column_count desc;
+--------------------+--------------+
| committer | column_count |
+--------------------+--------------+
| spacelove | 65 |
| joeeoj | 41 |
| CaptainStabs | 30 |
| Abdurrahmaan Iqbal | 29 |
| Captain Stabs | 26 |
| ericmock | 19 |
| rl1987 | 17 |
| adam | 15 |
| gmaximus0100 | 11 |
| captainstabs | 1 |
+--------------------+--------------+
That’s it!
Hopefully this blog was helpful in showing you some interesting ways to use our new system table. Were there any use cases or examples I missed? Have opinions on how data versioning and history features are helpful in your work? Come join us on discord and let us know! We’d love to hear from you.