Introducing SQL VIEW Support in Dolt
Dolt is a SQL database with Git-style versioning and distribution. The most recent releases of Dolt introduced support for SQL views that are stored as part of, and versioned along with, a Dolt repository. This provides a great way for data sets to distribute pre-built queries that might be useful to consumers of the data — things like aggregations, pivots and joins often appear in SQL view statements. Views can also be a good way to build up more complicated SQL queries from easy-to-understand and self documenting components.
Let's take a look at how the new view functionality in Dolt SQL works.
A Dataset to Work With
To demonstrate views we'll work with a simple dataset that tracks the outbreak of Coronavirus. The dataset comes from a Google Sheet published by Johns Hopkins University which we mirror in a DoltHub repository. We can checkout the repository and get our bearings with the data easily:
shell$ dolt clone dolthub/corona-virus
cloning https://doltremoteapi.dolthub.com/dolthub/corona-virus
112 of 112 chunks complete. 0 chunks being downloaded currently.
shell$ cd corona-virus
shell$ dolt sql
# Welcome to the DoltSQL shell.
# Statements must be terminated with ';'.
# "exit" or "quit" (or Ctrl-D) to exit.
doltsql> show tables;
+--------------+
| Table |
+--------------+
| cases |
| dolt_schemas |
| places |
+--------------+
doltsql> describe cases;
+------------------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+----------+------+-----+---------+-------+
| observation_time | DATETIME | NO | PRI | | |
| place_id | BIGINT | NO | PRI | | |
| confirmed_count | BIGINT | YES | | | |
| recovered_count | BIGINT | YES | | | |
| death_count | BIGINT | YES | | | |
+------------------+----------+------+-----+---------+-------+
As described in the repository's README, this table mirrors the three
Johns Hopkins sheets linked above. The rows of the sheets have been
mapped to a place_id
and the sheet has been unpivoted into a SQL
table by mapping each observation column into an observation_time
row.
Digging In
This makes the table nice to work with for some types of queries. For example, the following shows how cases have progressed in the province of Hubei, where Wuhan is located:
doltsql> SELECT cases.* FROM cases NATURAL JOIN places
WHERE places.province_state = 'Hubei'
ORDER BY observation_time ASC;
+----------+-------------------------------+-----------------+-----------------+-------------+
| place_id | observation_time | confirmed_count | recovered_count | death_count |
+----------+-------------------------------+-----------------+-----------------+-------------+
| 1 | 2020-01-21 22:00:00 +0000 UTC | 270 | <NULL> | <NULL> |
| 1 | 2020-01-22 12:00:00 +0000 UTC | 444 | <NULL> | <NULL> |
| 1 | 2020-01-23 12:00:00 +0000 UTC | <NULL> | 28 | <NULL> |
| 1 | 2020-01-24 00:00:00 +0000 UTC | 549 | 31 | 24 |
| 1 | 2020-01-25 00:00:00 +0000 UTC | 729 | 32 | 32 |
| 1 | 2020-01-25 12:00:00 +0000 UTC | 761 | <NULL> | 40 |
| 1 | 2020-01-25 22:00:00 +0000 UTC | 1052 | 42 | 52 |
| 1 | 2020-01-26 11:00:00 +0000 UTC | 1058 | <NULL> | <NULL> |
| 1 | 2020-01-26 23:00:00 +0000 UTC | 1423 | 44 | 76 |
| 1 | 2020-01-27 09:00:00 +0000 UTC | <NULL> | 45 | <NULL> |
| 1 | 2020-01-27 20:30:00 +0000 UTC | 2714 | 47 | 100 |
| 1 | 2020-01-28 13:00:00 +0000 UTC | <NULL> | 52 | <NULL> |
| 1 | 2020-01-28 18:00:00 +0000 UTC | 3554 | 80 | 125 |
| 1 | 2020-01-29 14:30:00 +0000 UTC | <NULL> | 88 | <NULL> |
| 1 | 2020-01-29 21:00:00 +0000 UTC | 4586 | 90 | 162 |
| 1 | 2020-01-30 11:00:00 +0000 UTC | 4903 | <NULL> | <NULL> |
| 1 | 2020-01-31 14:00:00 +0000 UTC | 5806 | 141 | 204 |
| 1 | 2020-02-01 10:00:00 +0000 UTC | 7153 | 168 | 249 |
| 1 | 2020-02-02 21:00:00 +0000 UTC | 11177 | 295 | 350 |
| 1 | 2020-02-03 21:00:00 +0000 UTC | 13522 | 386 | 414 |
| 1 | 2020-02-04 09:40:00 +0000 UTC | <NULL> | 396 | <NULL> |
| 1 | 2020-02-04 22:00:00 +0000 UTC | 16678 | 522 | 479 |
| 1 | 2020-02-05 09:00:00 +0000 UTC | <NULL> | 537 | <NULL> |
| 1 | 2020-02-05 23:00:00 +0000 UTC | 19665 | 651 | 549 |
| 1 | 2020-02-06 09:00:00 +0000 UTC | <NULL> | 712 | <NULL> |
| 1 | 2020-02-07 20:13:00 +0000 UTC | 22112 | 867 | 618 |
| 1 | 2020-02-07 22:50:00 +0000 UTC | 24953 | 1115 | 699 |
| 1 | 2020-02-08 10:24:00 +0000 UTC | <NULL> | 1218 | <NULL> |
| 1 | 2020-02-08 23:04:00 +0000 UTC | 27100 | 1440 | 780 |
| 1 | 2020-02-09 10:30:00 +0000 UTC | <NULL> | 1480 | <NULL> |
| 1 | 2020-02-09 23:20:00 +0000 UTC | 29631 | 1795 | 871 |
| 1 | 2020-02-10 11:00:00 +0000 UTC | <NULL> | 1854 | <NULL> |
+----------+-------------------------------+-----------------+-----------------+-------------+
As you can see, some updates for the spreadsheet include observations for certain fields and not others. This makes it slightly inconvenient to answer the question "what is the most recent observation for each count in each location", for example. For that reason, the dataset travels alongside with some views that can easily provide exactly those answers.
Introducing Views
Views in a Dolt repository travel alongside the dataset and are
versioned in exactly the same way. Currently, you can see what views
are available in a dolt repository by selecting from the dolt system
table dolt_schemas
.
doltsql> select type, name from dolt_schemas;
+------+-------------------+
| type | name |
+------+-------------------+
| view | current |
| view | current_cases |
| view | current_deaths |
| view | current_recovered |
+------+-------------------+
As you can see, we store the view definitions in this table. The table
is modified by CREATE VIEW
and DROP VIEW
statements, but it is
versioned alongside the Dolt repository. That means it's available for
Dolt CLI interactions like dolt add
, dolt blame
and dolt reset
. When you create a view, you'll see a new row in the
dolt_schemas
table when you run dolt diff
. You should dolt add
the dolt_schemas
table to your commit if you want the view to
travel alongside your dataset.
Let's dig a little bit more into the views that are included in the
corona-virus
dataset. Selecting from the current
view, we can get a row for
each location in the dataset, and that locations most recent non-NULL
observation for each of the three fields:
doltsql> select * from current limit 10;
+----------------+-----------+-------+--------+-----------+
| country | state | cases | deaths | recovered |
+----------------+-----------+-------+--------+-----------+
| Mainland China | Hubei | 29631 | 871 | 1854 |
| Mainland China | Guangdong | 1159 | 1 | 167 |
| Mainland China | Zhejiang | 1092 | 0 | 242 |
| Mainland China | Henan | 1073 | 6 | 191 |
| Mainland China | Hunan | 879 | 1 | 208 |
| Mainland China | Anhui | 830 | 3 | 88 |
| Mainland China | Jiangxi | 771 | 1 | 105 |
| Mainland China | Jiangsu | 492 | 0 | 81 |
| Mainland China | Chongqing | 473 | 2 | 66 |
| Mainland China | Shandong | 466 | 1 | 66 |
+----------------+-----------+-------+--------+-----------+
We can inspect how the view is defined by selecting its schema
fragment from the dolt_schemas
table:
doltsql> select fragment from dolt_schemas where name = 'current';
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| fragment |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| select current_cases.country, current_cases.state, current_cases.cases, current_deaths.deaths, current_recovered.recovered from current_cases left join current_deaths on current_cases.country=current_deaths.country and current_cases.state=current_deaths.state left join current_recovered on current_cases.country=current_recovered.country and current_cases.state=current_recovered.state order by cases desc |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Which is a little hard to read, but just reformatting gives us:
SELECT
current_cases.country,
current_cases.state,
current_cases.cases,
current_deaths.deaths,
current_recovered.recovered
FROM
current_cases
LEFT JOIN
current_deaths
ON current_cases.country=current_deaths.country
AND current_cases.state=current_deaths.state
LEFT JOIN
current_recovered
ON current_cases.country=current_recovered.country
AND current_cases.state=current_recovered.state
ORDER BY
cases DESC
Each one of current_cases
, current_deaths
and current_recovered
looks similar. They each return the most recent observation time for
each location which has an actual observation for a given field.
This also highlights how views can help build up a complex query from
self-contained, self-documenting pieces. The query plan for select * from current
can give a good overview of what's going on when Dolt
computes the results:
doltsql> explain select * from current;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SubqueryAlias(current) |
| └─ Sort(current_cases.cases DESC) |
| └─ Project(current_cases.country, current_cases.state, current_cases.cases, current_deaths.deaths, current_recovered.recovered) |
| └─ LeftJoin(current_cases.country = current_recovered.country AND current_cases.state = current_recovered.state) |
| ├─ LeftJoin(current_cases.country = current_deaths.country AND current_cases.state = current_deaths.state) |
| │ ├─ SubqueryAlias(current_cases) |
| │ │ └─ Sort(cases ASC) |
| │ │ └─ Project(country, state, MAX(convert(cases.observation_time, datetime)) as last updated, cases) |
| │ │ └─ GroupBy |
| │ │ ├─ Aggregate(places.country_region as country, places.province_state as state, MAX(convert(cases.observation_time, datetime)), cases.confirmed_count as cases) |
| │ │ ├─ Grouping(cases.place_id) |
| │ │ └─ Filter(NOT(cases.confirmed_count IS NULL)) |
| │ │ └─ IndexedJoin(cases.place_id = places.place_id) |
| │ │ ├─ cases |
| │ │ └─ places |
| │ └─ SubqueryAlias(current_deaths) |
| │ └─ Sort(deaths DESC) |
| │ └─ Project(country, state, MAX(convert(cases.observation_time, datetime)) as last updated, deaths) |
| │ └─ GroupBy |
| │ ├─ Aggregate(places.country_region as country, places.province_state as state, MAX(convert(cases.observation_time, datetime)), cases.death_count as deaths) |
| │ ├─ Grouping(cases.place_id) |
| │ └─ Filter(NOT(cases.death_count IS NULL)) |
| │ └─ IndexedJoin(cases.place_id = places.place_id) |
| │ ├─ cases |
| │ └─ places |
| └─ SubqueryAlias(current_recovered) |
| └─ Sort(recovered DESC) |
| └─ Project(country, state, MAX(convert(cases.observation_time, datetime)) as last updated, recovered) |
| └─ GroupBy |
| ├─ Aggregate(places.country_region as country, places.province_state as state, MAX(convert(cases.observation_time, datetime)), cases.recovered_count as recovered) |
| ├─ Grouping(cases.place_id) |
| └─ Filter(NOT(cases.recovered_count IS NULL)) |
| └─ IndexedJoin(cases.place_id = places.place_id) |
| ├─ cases |
| └─ places |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Recap
SQL view support is a new Dolt feature that allows defined views to be
stored inside and versioned alongside a Dolt repository. SQL views can
be included in a repository to provide interesting ways of reframing
the data, including aggregations, joins and pivots. In a Dolt
repository, views definitions are stored in the dolt_schemas
table
and can be managed from there like any other table in a Dolt
repository.
Coming Soon for Dolt View Support
This is just the start of view support in Dolt, and there's still more work to be done. Our roadmap includes the following missing functionality that will be coming shortly:
- First Class DoltHub Support.
- Views Appear In
SHOW FULL TABLES
. SHOW CREATE VIEW
Support.ALTER VIEW
Support.
Further Data Resources on Coronavirus
There are lots of resources across the Web for tracking data related to Coronavirus. Here are some useful links: