Schema Overriding
DoltDB is the world's first fully-versioned relational database. You can branch, merge, diff, push, and pull your relational data in the same ways that Git allows you to work with source code files. If you're curious what sorts of use cases that enables, check out some of the many ways customers are using Dolt.
One of the key features of Dolt is that you can access your data (and your data's schema) at any point along your database's commit graph. You can go back in time to see how your data and schema looked two years ago, and features like branches and tags makes accessing different points in that history very easy. In this blog post, we're taking a look at a new feature, called "schema overriding", that allows you to look at your data through the lens of another schema.
Schema Overriding
Let's start off with a really simple example to introduce this feature. In fact, we'll reuse the same example from GitHub that was originally used to request this feature. In that issue, the customer did a nice job describing the problem succinctly:
"I want to have some assurance or safety in writing queries that will be backwards-compatible as I add or remove columns during table migrations across commits."
As a concrete example, they described a table called people
with the following schema at earlier commits in the database:
+-----------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+------+------+-----+---------+-------+
| Id | int | NO | PRI | NULL | |
| Name | text | NO | | NULL | |
+-----------+------+------+-----+---------+-------+
Then, as the database evolved, a Birthdate
field was added to the table in later commits:
+-----------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+------+------+-----+---------+-------+
| Id | int | NO | PRI | NULL | |
| Name | text | NO | | NULL | |
| Birthdate | date | YES | | NULL | |
+-----------+------+------+-----+---------+-------+
The customer wanted an easier way to query the data, without having to manually adjust queries for all the different schema versions in the data's history. In this example the customer wants to use this query, which works with the current, latest schema of the people
table:
select Name, Birthdate from people;
However, if he tries to work with older data from earlier commits that don't have the Birthdate
field, that query returns an error about the missing Birthdate
field:
select Name, Birthdate from people;
column "Birthdate" could not be found in any table in scope
Schema overriding solves this problem – it automatically maps your data to an overridden schema, so that you don't have to manually update your queries to run against a different schema. Let's take a look at how we can use schema overriding for this simple example...
Let's assume that we're working on a branch called olderData
and we want to query the older data on this branch using the schema as it currently exists on the main
branch. The first thing we need to do is set the @@dolt_override_schema
variable to a commit that contains the schema we want to use. We could use a branch or tag name, or a specific commit, or even an ancestor reference such as myOtherBranch~~
. In this case, we're just using the name of our branch that contains our most recent schema. Note that when we set a schema override, it's in place for all tables as long as that session variable is set. If you want to remove the schema override, you can set the variable to NULL
.
SET @@dolt_override_schema='main';
At this point, we've got a schema override in place, and if we query any of our data on this olderData
branch, it'll automatically be mapped to the schema present at the tip of the main
branch. We can now run the same query above that was previously giving us an error:
select Name, Birthdate from people;
+-----------+-----------+
| Name | Birthdate |
+-----------+-----------+
| Frank | NULL |
| Columbia | NULL |
| Dr. Scott | NULL |
+-----------+-----------+
3 rows in set (0.00 sec)
Et voila! Instead of returning an error like before, the query runs and simply fills in NULL
values for the Birthdate
column, since that column doesn't actually exist in the data on the olderData
branch.
A Bigger Example
Let's look at a slightly larger example that's more representative of something you might see in a real world system.
In this example, we're running a shop and using Dolt to track our inventory. We have a new "inventory valuation" report that we've put together to better understand how much investment we have sitting around in unsold inventory, and we want to go back in time and run this report to see how our inventory valuation has changed over the past year. However, we've made several changes to our inventory schema recently, so the query won't run in its current form on older commits.
Let's take a look at our current schema, how it has changed over time, and then see how we can use schema overriding to run our report for older data.
Here's what our schema looks like today... We've got a table called Inventory
with the following schema:
dolt/main> describe Inventory;
+--------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| inventory_id | int | NO | PRI | NULL | |
| product_id | int | YES | | NULL | |
| quantity | int | YES | | NULL | |
| checked_by | varchar(255) | YES | | NULL | |
+--------------+--------------+------+-----+---------+-------+
and a table called Products
with the following schema:
dolt/main> describe Products;
+--------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| product_id | int | NO | PRI | NULL | |
| product_name | varchar(255) | YES | | NULL | |
| cost | decimal(6,2) | YES | | NULL | |
| description | text | YES | | NULL | |
+--------------+--------------+------+-----+---------+-------+
Here's the inventory valuation query we're running to see how much we have invested currently in unsold product:
SELECT
P.product_name,
P.description,
I.quantity,
(I.quantity * P.cost) AS total_value
FROM
Inventory I
INNER JOIN
Products P ON I.product_id = P.product_id
ORDER BY
total_value DESC;
Running this on our current data at the tip of main
works great!
dolt/main> SELECT
-> P.product_name,
-> P.description,
-> I.quantity,
-> (I.quantity * P.cost) AS total_value
-> FROM
-> Inventory I
-> INNER JOIN
-> Products P ON I.product_id = P.product_id
-> ORDER BY
-> total_value DESC;
+--------------+-------------------------+----------+-------------+
| product_name | description | quantity | total_value |
+--------------+-------------------------+----------+-------------+
| Water gun | Simulates a rain storm | 64 | 256.00 |
| Confetti | For that special moment | 84 | 126.00 |
| Glow Stick | When you need a light | 101 | 101.00 |
| Rice | Throw it at newlyweds | 21 | 42.00 |
| Toast | May I propose a toast? | 80 | 40.00 |
+--------------+-------------------------+----------+-------------+
But... as soon as we go back to earlier commits, we get errors about columns not being found. Note that we're use
ing a revision database to lock to a specific commit. This is one of the many ways Dolt enables you to work with different commits, tags, and branches in your database's commit history.
use `dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap`;
Database changed
dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap> SELECT
-> P.product_name,
-> P.description,
-> I.quantity,
-> (I.quantity * P.cost) AS total_value
-> FROM
-> Inventory I
-> INNER JOIN
-> Products P ON I.product_id = P.product_id
-> ORDER BY
-> total_value DESC;
table "p" does not have column "product_id"
Before we get to the punch line and turn on a schema override, let's take a look at what the schema looks like in this older commit and see how the tables have changed.
Here's the Inventory
table:
dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap> describe Inventory;
+------------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+------+------+-----+---------+-------+
| id | int | NO | PRI | NULL | |
| product_id | int | YES | MUL | NULL | |
| quantity | int | YES | | NULL | |
+------------+------+------+-----+---------+-------+
3 rows in set (0.00 sec)
And here's the Products
table:
dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap> describe Products;
+-------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id | int | NO | PRI | NULL | |
| name | varchar(255) | YES | | NULL | |
| cost | float | YES | | NULL | |
+-------+--------------+------+-----+---------+-------+
Looking at those schemas, we see several changes were made:
- The
id
column in theInventory
table was renamed toinventory_id
- A new
checked_by
column was added to theInventory
table - The
id
column in theProducts
table was renamed toproduct_id
- The
name
column in theProducts
table was renamed toproduct_name
- A new
description
column was added to theProducts
table - The
cost
column in theProducts
table was changed fromfloat
todecimal(6,2)
Let's set a schema override so that we lock to the schema as it exists on the main
branch, and then run our query again.
dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap> set @@dolt_override_schema='main';
dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap> SELECT
-> P.product_name,
-> P.description,
-> I.quantity,
-> (I.quantity * P.cost) AS total_value
-> FROM
-> Inventory I
-> INNER JOIN
-> Products P ON I.product_id = P.product_id
-> ORDER BY
-> total_value DESC;
+--------------+-------------+----------+-------------+
| product_name | description | quantity | total_value |
+--------------+-------------+----------+-------------+
| Water gun | NULL | 80 | 320.00 |
| Confetti | NULL | 119 | 178.50 |
| Glow Stick | NULL | 154 | 154.00 |
| Rice | NULL | 54 | 108.00 |
| Toast | NULL | 0 | 0.00 |
+--------------+-------------+----------+-------------+
It worked! Thanks to @@dolt_override_schema
, we were able to easily run our query on older data without having to manually adjust the query to account for the schema changes.
Limitations
Schema overriding works well with changes such as renaming, adding, and removing columns. However, as the schema changes become more complex, it's more likely that a query will fail when using schema overriding. One, probably obvious, example is when a column has been added and you're using that column in a condition in your query, such as a filter or a join condition. Since the column added in the new version of the schema won't exist in older commits, those values will be NULL
, so your conditions likely won't work correctly, since they depend on that column having real data. However, for many cases where the query doesn't use columns that were added in conditions, schema overriding can help make it easier to run queries on older data without having to rewrite them.
There are a few additional limitations when using schema overriding. If one of these is a total deal-breaker for you, let us know by sending us an issue on GitHub or dropping into our Discord server, and we'll be happy to see what we can do to help you out.
- Read-only – when a schema override is in place, the database becomes read-only in that session. Attempting to write data or execute DDL will result in an error about the database being read-only.
- System tables – Dolt system tables do not honor schema overrides.
- Collation changes – Collations affect how data is sorted, including the order rows are stored on disk. Mapping data from one collation to another collation requires extra processing to ensure the returned results are sorted according to the mapped collation. This extra processing is not supported yet, so collation changes will not appear when overriding a schema.
- Column defaults – If the overridden schema has added new columns with column defaults, those column defaults do not currently get applied when that column is queried.
- Untrackable column changes – Some column changes can't be tracked across commits currently. Dolt is able to track column renames and column type changes, but if a column is both renamed and has its type changed, then Dolt is not able to track that the new column is logically the same column as in the older commit.
Wrap Up
The ability to track all the data and schema changes in your database over its complete history is extremely powerful. Dolt provides many ways to use that history to compute diffs, audit changes, create branches and tags off of older data, and we're excited to add schema overriding to the list of ways to make it easier to access and use your historical data.
Schema overriding is a new feature, so we're eager to hear from you about how you're using it, any snags you hit, or any ideas you have for how we can improve it. Feel free to hit us up on GitHub with an issue or swing by our Discord server and chat with us.