Introducing Cell History Inspection on DoltHub
Dolt and DoltHub are Git and GitHub for data. Having a versioned database makes collaborating on data more fluid and reliable in the same way that Git improves source code collaboration for software engineers. Using both Git and GitHub, engineers are able to inspect every line of code and dive into who made changes and when over time.
Dolt does the same thing for data. You can inspect the changes between versions of every cell in every table. We've exposed this functionality through our command line interface on Dolt, but we wanted to make it easier for users to get familiar with the features that make Dolt unique.
Dolt System Tables
In order to fully take advantage of what version control can do for data, we made the resulting metadata and data available through Dolt system tables. You can query this information using SQL in the same way you can query a table.
Our new cell history feature uses the dolt_diff
system table, so I will use it as an example. You can see a description of all Dolt system tables in our docs.
The dolt_diff system table is a queryable table that shows the changes between versions of a row. Every table in Dolt has a corresponding dolt_diff_[tablename]
system table. For example, we have a repository called dolthub/corona-virus
with a table named places
, which lists all places with Coronavirus cases. This is what the first 15 rows of the dolt_diff_places
table look like using Dolt:
$ dolt sql -q "SELECT * FROM dolt_diff_places LIMIT 15"
+--------------+-------------------+-------------------+-------------+-------------+----------------------------------+-----------------------------------+----------------+---------------------+---------------------+---------------+---------------+----------------------------------+-----------------------------------+-----------+
| to_longitude | to_province_state | to_country_region | to_latitude | to_place_id | to_commit | to_commit_date | from_longitude | from_province_state | from_country_region | from_latitude | from_place_id | from_commit | from_commit_date | diff_type |
+--------------+-------------------+-------------------+-------------+-------------+----------------------------------+-----------------------------------+----------------+---------------------+---------------------+---------------+---------------+----------------------------------+-----------------------------------+-----------+
| <NULL> | <NULL> | <NULL> | <NULL> | <NULL> | 0sieej4vrv3lnh8bu87n35les82piq7m | 2020-05-14 18:59:35.053 +0000 UTC | 0 | Recovered | Canada | 0 | 569 | qp2kccaqt3e7hdc3fcebbm18tq5j9dhm | 2020-05-14 17:38:34.042 +0000 UTC | removed |
| 28.2336 | | Lesotho | -29.61 | 590 | qp2kccaqt3e7hdc3fcebbm18tq5j9dhm | 2020-05-14 17:38:34.042 +0000 UTC | <NULL> | <NULL> | <NULL> | <NULL> | <NULL> | 1u96d6g0h99ai7a8j4jl2a2jarff4dpu | 2020-05-14 16:32:36.568 +0000 UTC | added |
| 43.3333 | | Comoros | -11.6455 | 588 | kg3tr9s7k9832j7it9gbnrafb3n1rmkk | 2020-05-01 20:47:29.907 +0000 UTC | <NULL> | <NULL> | <NULL> | <NULL> | <NULL> | ha2t0c6f5cfm9t4lnlaf4ak979jqiphc | 2020-05-01 20:26:18.34 +0000 UTC | added |
| 71.2761 | | Tajikistan | 38.861 | 589 | kg3tr9s7k9832j7it9gbnrafb3n1rmkk | 2020-05-01 20:47:29.907 +0000 UTC | <NULL> | <NULL> | <NULL> | <NULL> | <NULL> | ha2t0c6f5cfm9t4lnlaf4ak979jqiphc | 2020-05-01 20:26:18.34 +0000 UTC | added |
| 112.2707 | Hubei | China | 30.9756 | 1 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 112.2707 | Hubei | China | 30.9756 | 1 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
| 113.4244 | Guangdong | China | 23.3417 | 2 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 113.4244 | Guangdong | China | 23.3417 | 2 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
| 113.614 | Henan | China | 33.882 | 3 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 113.614 | Henan | China | 33.882 | 3 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
| 111.7088 | Hunan | China | 27.6104 | 4 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 111.7088 | Hunan | China | 27.6104 | 4 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
| 115.7221 | Jiangxi | China | 27.614 | 5 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 115.7221 | Jiangxi | China | 27.614 | 5 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
| 117.2264 | Anhui | China | 31.8257 | 6 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 117.2264 | Anhui | China | 31.8257 | 6 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
| 107.874 | Chongqing | China | 30.0572 | 7 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 107.874 | Chongqing | China | 30.0572 | 7 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
| 119.455 | Jiangsu | China | 32.9711 | 8 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 119.455 | Jiangsu | China | 32.9711 | 8 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
| 118.1498 | Shandong | China | 36.3427 | 9 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 118.1498 | Shandong | China | 36.3427 | 9 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
| 102.7103 | Sichuan | China | 30.6171 | 10 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 102.7103 | Sichuan | China | 30.6171 | 10 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
| 116.4142 | Beijing | China | 40.1824 | 11 | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 116.4142 | Beijing | China | 40.1824 | 11 | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified |
+--------------+-------------------+-------------------+-------------+-------------+----------------------------------+-----------------------------------+----------------+---------------------+---------------------+---------------+---------------+----------------------------------+-----------------------------------+-----------+
Every dolt_diff
table has these columns:
+------------------+----------+
| field | type |
+------------------+----------+
| from_commit | LONGTEXT |
| from_commit_date | DATETIME |
| to_commit | LONGTEXT |
| to_commit_date | DATETIME |
| diff_type | LONGTEXT |
+------------------+----------+
As well as to_[column]
and from_[column]
columns for every column in the table.
You can view the dolt_diff_places
table on DoltHub as well by using the SQL console to run a query:
Note: Queries are limited to 200 rows on DoltHub. For unlimited query results, you can clone this repository by installing Dolt and running dolt clone dolthub/corona-virus
Using this table, we were able to construct a query to show the history of changes in a row or cell with just a click.
Cell and Row History Inspection on DoltHub
While viewing table changes has been possible on DoltHub through SQL queries, we made it even easier. Introducing cell inspection: you can now click on a table cell and copy the value, get the row or cell history, and filter by that cell value. Clicking on a column heading also gives you the option to sort by that column.
Not only does this make exploring a dataset easier, but it also teaches SQL and brings awareness to system tables.
Our tables use unique primary keys to tell when there are changes to a row. Clicking on a cell in a primary key column will generate a dolt_diff
query that shows you the full row history, since primary keys cannot change.
Clicking on a cell in a non-primary key column generates a dolt_diff
query that shows changes only in that particular cell.
A Cell Inspection Use Case
This is a very useful feature for seeing change over time, and can help users audit data for where and when information changed.
Many of our DoltHub datasets are generated from an ETL job that automatically gets the data from its source when there have been updates, transforms it to match the Dolt table schema, and pushes the newly transformed data to DoltHub. Specifically, our dolthub/corona-virus
repo runs on an ETL job that gets data from various resources on an hourly basis. You can see the import script for this dataset here.
Tim, the creator and maintainer of this dataset, was inspecting the diff for the places
table, when he found a floating point import bug for the latitude
and longitude
columns. Using cell history, he was easily able to identify exactly what commit altered these values by looking at the history of changes for those cells.
You can get even deeper into when and who made the commit that changed a cell by viewing the dolt_history
table by clicking on the link at the bottom of the dolt_diff
table, which gets every commit associated with the row, as well as who authored the commit and when.
Tim was able to go back to his import script and fix the floating point bug. Without Git-like features such as diffs and cell inspection, Tim could have easily missed data inconsistencies created by his import script.
Making Data on DoltHub Interactive
Over the past few months we've been adding more features to DoltHub to better exhibit the benefits of a version-controlled database. Part of that is teaching users about the power of Dolt system tables in a user-friendly way, as well as exposing repository information like filesystem size and when a repo was last updated. Here are some other recent additions to DoltHub that help achieve this goal:
-
LICENSE and README documentation (learn more about the
dolt_docs
table here) -
Saved queries and views (learn more about the
dolt_schemas
table here)
Conclusion
Dolt is a database that uses version control, which makes debugging data between versions easy to trace. Exposing this functionality on DoltHub is an important step to improve the data discovery interface for our users. Data can be audited using our new cell history feature on DoltHub, which shows the history of changes for each table, row, and cell.
This is just the beginning of features like this that we're building for Dolt and DoltHub. If this interests you and you'd like to stay up-to-date with new features like this, please sign up for DoltHub or join our mailing list on the right.