Querying Historical Data with AS OF Queries
Dolt is Git for data. It's a
SQL database that lets you branch, merge, and fork your data just like
you would a Git repository. In previous blog posts we announced how
you can use special system tables to query the history of your
database. Today,
we're excited to announce a simpler syntax for exploring the history
of your datasets: AS OF
queries.
SQL AS OF clauses
SQL introduced the AS OF
syntax in SQL
2011. Its basic form is easy
to understand:
SELECT * FROM myTable AS OF TIMESTAMP("2019-03-01");
As you would expect, this query returns rows from the table as it existed on the date provided.
Like many SQL features, its adoption and implementation has varied significantly across different vendors. The above simple syntax is actually a deviation from the SQL spec, adopted by both Oracle and Microsoft, but not MySQL / MariaDB. The details vary a lot too. In MySQL and SQL Server, you have to use a special syntax when you create your tables to enable versioning. In Oracle, you just turn on versioning on the entire database, and old-enough versions get garbage collected automatically. Postgres doesn't support versioned tables out of the box, requiring you to install a plugin first.
AS OF in Dolt
Dolt's versioning is a graph
of commits, just like Git. So in addition to timestamp queries like in
other database, you can query a
Dolt database with Git-like
refs in your AS OF
clauses. The following are all valid
Dolt SQL queries:
SELECT * FROM myTable AS OF TIMESTAMP("2020-03-01");
SELECT * FROM myTable AS OF 'HEAD~';
SELECT * FROM myTable AS OF 'HEAD~20';
SELECT * FROM myTable AS OF 'feature-branch';
SELECT * FROM myTable AS OF '4gunscn1j6ijtsj121m7bjaj4j8grjmv';
The latter four examples all use dolt refs instead of a timestamp, and work the way you would expect.
SELECT * FROM myTable AS OF 'HEAD~';
This query selects rows from the table as it existed at the commit
prior to the current HEAD
commit. It's equivalent to doing this:
% dolt checkout -b prior-commit HEAD~
% dolt sql -q "SELECT * FROM myTable"
Similarly, for other queries:
SELECT * FROM myTable AS OF 'HEAD~20';
This query selects rows from the table as it existed 20 commits before
the current HEAD
.
SELECT * FROM myTable AS OF 'feature-branch';
This query selects rows from the table as it exists on the current
HEAD
of the feature-branch
branch.
SELECT * FROM myTable AS OF '4gunscn1j6ijtsj121m7bjaj4j8grjmv';
This query selects rows from the table as it exists on the commit with
the hash 4gunscn1j6ijtsj121m7bjaj4j8grjmv
.
A real-world example
To demonstrate this feature, let's run some queries on DoltHub's Coronavirus dataset.
To start, we'll clone the dataset so we can run SQL on it:
% dolt clone dolthub/corona-virus
% cd corona-virus
Let's use the mortality_rate
view we introduced a few weeks
ago
to examine how mortality of the disease has changed over time.
Here's the current mortality rates per country:
doltsql> select * from mortality_rates limit 10;
+----------------+----------------+--------------+-------+--------+-----------+---------------------+
| country | state | last updated | cases | deaths | recovered | mortality_rate |
+----------------+----------------+--------------+-------+--------+-----------+---------------------+
| Sudan | | 2020-03-18 | 2 | 1 | 0 | 0.49751243781094534 |
| United Kingdom | Cayman Islands | 2020-03-19 | 3 | 1 | 0 | 0.33222591362126247 |
| Netherlands | Curacao | 2020-03-19 | 3 | 1 | 0 | 0.33222591362126247 |
| Guyana | | 2020-03-17 | 7 | 1 | 0 | 0.14265335235378032 |
| Ukraine | | 2020-03-19 | 16 | 2 | 0 | 0.12492192379762647 |
| Guatemala | | 2020-03-19 | 9 | 1 | 0 | 0.11098779134295228 |
| Algeria | | 2020-03-19 | 87 | 9 | 32 | 0.10343638662222733 |
| San Marino | | 2020-03-18 | 119 | 11 | 4 | 0.09242920762961096 |
| US | South Dakota | 2020-03-17 | 11 | 1 | 0 | 0.09082652134423251 |
| Cuba | | 2020-03-19 | 11 | 1 | 0 | 0.09082652134423251 |
+----------------+----------------+--------------+-------+--------+-----------+---------------------+
As you can see, the highest mortality rates are currently places with a small number of deaths and inadequate testing. Let's see what the picture looked like a week ago:
doltsql> select * from mortality_rates as of timestamp('2020-03-13') limit 10;
+-----------+-------------------+--------------+-------+--------+-----------+---------------------+
| country | state | last updated | cases | deaths | recovered | mortality_rate |
+-----------+-------------------+--------------+-------+--------+-----------+---------------------+
| Morocco | | 2020-03-11 | 5 | 1 | 0 | 0.1996007984031936 |
| Bulgaria | | 2020-03-11 | 7 | 1 | 0 | 0.14265335235378032 |
| Panama | | 2020-03-11 | 8 | 1 | 0 | 0.12484394506866417 |
| US | South Dakota | 2020-03-11 | 8 | 1 | 0 | 0.12484394506866417 |
| Australia | Western Australia | 2020-03-11 | 9 | 1 | 0 | 0.11098779134295228 |
| Iraq | | 2020-03-11 | 71 | 7 | 15 | 0.09857766511758906 |
| Albania | | 2020-03-11 | 12 | 1 | 0 | 0.08326394671107411 |
| US | Washington | 2020-03-11 | 366 | 29 | 1 | 0.07923280784678015 |
| US | Florida | 2020-03-11 | 28 | 2 | 0 | 0.07140307033202427 |
| Italy | | 2020-03-11 | 12462 | 827 | 1045 | 0.0663616864374206 |
+-----------+-------------------+--------------+-------+--------+-----------+---------------------+
Still topped by places with small death counts and limited testing, but now Italy and Washington State crack the top 10.
For reference, if we limit the list to places with signficant testing, mortality looks like this:
doltsql> select * from mortality_rates as of timestamp('2020-03-13') where cases > 100 limit 10;
+---------+--------------+--------------+-------+--------+-----------+----------------------+
| country | state | last updated | cases | deaths | recovered | mortality_rate |
+---------+--------------+--------------+-------+--------+-----------+----------------------+
| US | Washington | 2020-03-11 | 366 | 29 | 1 | 0.07923280784678015 |
| Italy | | 2020-03-11 | 12462 | 827 | 1045 | 0.0663616864374206 |
| China | Hubei | 2020-03-11 | 67773 | 3046 | 49134 | 0.0449441451692938 |
| Iran | | 2020-03-11 | 9000 | 354 | 2959 | 0.039333289629678185 |
| China | Hainan | 2020-03-08 | 168 | 6 | 159 | 0.035712159990476756 |
| China | Heilongjiang | 2020-03-11 | 482 | 13 | 436 | 0.026970394805087033 |
| China | Hong Kong | 2020-03-11 | 126 | 3 | 65 | 0.023807634314736925 |
| Spain | | 2020-03-11 | 2277 | 54 | 183 | 0.023715310868199963 |
| Japan | | 2020-03-11 | 639 | 15 | 118 | 0.023473811051470242 |
| China | Tianjin | 2020-03-10 | 136 | 3 | 131 | 0.022057201676347327 |
+---------+--------------+--------------+-------+--------+-----------+----------------------+
doltsql> select * from mortality_rates where cases > 100 limit 10;
+----------------+----------------+--------------+-------+--------+-----------+----------------------+
| country | state | last updated | cases | deaths | recovered | mortality_rate |
+----------------+----------------+--------------+-------+--------+-----------+----------------------+
| San Marino | | 2020-03-18 | 119 | 11 | 4 | 0.09242920762961096 |
| Italy | | 2020-03-19 | 41035 | 3405 | 4440 | 0.08297792543489084 |
| Indonesia | | 2020-03-19 | 311 | 25 | 11 | 0.08038326741905405 |
| Philippines | | 2020-03-19 | 217 | 17 | 8 | 0.07833740380627621 |
| Iran | | 2020-03-19 | 18407 | 1284 | 5710 | 0.06975603316345241 |
| Iraq | | 2020-03-19 | 192 | 13 | 43 | 0.06770480704129994 |
| US | Washington | 2020-03-19 | 1376 | 74 | 0 | 0.05377867893401937 |
| United Kingdom | United Kingdom | 2020-03-19 | 2689 | 137 | 65 | 0.05094811845251598 |
| Spain | | 2020-03-19 | 17963 | 830 | 1107 | 0.046206064573810296 |
| China | Hubei | 2020-03-19 | 67800 | 3130 | 57682 | 0.04616518493138866 |
+----------------+----------------+--------------+-------+--------+-----------+----------------------+
Disturbingly, Italy's mortality rate has actually gotten even worse in the last week.
Comparing alternate versions of your data
So far, every query we've demonstrated can be done in any normal SQL database from most vendors (maybe with a little bit more setup work). But now let's examine a use case where Dolt has unique capabilities and really shines.
Earlier this week, we discussed how we were using branches to manage data of questionable but potentially useful quality. In that blog post, to query the alternate version of the data, you had to drop out of the SQL shell and checkout the branch, like so:
% dolt checkout master
Switched to branch 'master'
% dolt sql -q "select * from mortality_rate_by_age_range"
...
% dolt checkout case_details_virological_dot_org
% dolt sql -q "select * from mortality_rate_by_age_range"
But as of this week's Dolt
release, you can
query the alternate data without leaving the SQL shell, just by naming
the branch in the AS OF
clause.
doltsql> select * from mortality_rate_by_age_range AS OF 'case_details_virological_dot_org';
Future work
There's a lot more work to do for this kind of query. SQL 2011 supports several different types of temporal queries that we haven't yet implemented, such as selecting all rows between two revisions:
SELECT * FROM t FOR SYSTEM_TIME BETWEEN (NOW() - INTERVAL 1 YEAR) AND NOW();
If you are interested in Dolt and want to us to prioritize these use cases, please let us know!
We also have a fair amount of bugs to fix. For example, our engine can't currently join a table to itself. This makes it impossible to join rows in a table to earlier revisions of themselves, which is an incredibly useful feature we're working hard to unlock.
Also please note: the examples in this blog entry use features that
are brand new and have not yet been released. If you're reading this
on the publication date, AS OF
timestamp queries are currently only
available by building from source. We'll have a release early next
week that includes timestamp support.
Conclusion
Dolt is a versioned database
that now lets you easily query earlier revisions of your data with no
additional setup work. But unlike other versioned databases,
Dolt lets you keep alternate
or experimental versions of your data on different branches. Querying
those alternate versions or their prior revisions is as simple as
including an AS OF
clause in your SQL. Download
Dolt today to try it out
yourself!