Introducing Workspace Updates

August 29, 2024

6 min read

A couple weeks ago I reviewed the Git Workspace, and I also covered how Dolt has the same feature. If you are unfamiliar with what the Workspace is, I suggest you read the review.

Today, I'm pleased to announce that Dolt, the world's first SQL database which supports branch and merge, has a new mechanism which allows users to better control their Workspace. The dolt_workspace_$tablename system tables provided in Dolt 1.42.16 give you total control to see and modify data on top of HEAD. Read the Documentation

This feature is aimed at the curators of data. It's for people who gather a bunch of data and then need to sort through it, categorize it, clean it up, and possibly throw some of it out. This also a response to Dolt users requesting dolt add --patch capabilities. We'll be updating the CLI in the future, but the ability to do it in SQL is a precursor to that.

Let's get into it with an example!

The Scenario

You are attempting to gather data about the real world from crowdsourced information. The data is of dubious quality, and you boil all of your information down to three columns of information in the following table:

CREATE TABLE ratings (
    id INT NOT NULL AUTO_INCREMENT,
    rating FLOAT CHECK (rating >= 0 AND rating <= 100),
    confidence FLOAT CHECK (confidence >= -1.0 AND confidence <= 1.0),
    PRIMARY KEY (id)
    );

To make it a little more tangible, the ID could refer a real world entity like a store or park in your city. You'd probably make it a foreign key to another table which has more specifics, but you get the point. Each rating is a number between 0 and 100, provided by your crowdsourcing efforts. The confidence is how much you trust the rating (-1 to 1). This is based on your super cool AI thing-a-ma-jig, or whatever. The idea is that updates to ratings will be made, and you are only going to save the updates if your confidence is higher than what you already have.

Seed the table with 10 rows, all with 0.0.

INSERT INTO ratings (rating, confidence) VALUES (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0);

Record this state in a commit. Note this includes the creation of the table and the insertion of the data.

VHS 1

Manipulating the Data

For the purposes of this demo, I'm going to update every row with some random values. Real data could be brought in with dolt table import or some other process you have that gathers data before you commit it. For our purposes we are going to roll the dice on everything by running the following:

UPDATE ratings SET rating = RAND() * 100, confidence = (RAND() * 2 - 1);

Using features Dolt has had for awhile, such as select * from dolt_diff_ratings or the dolt status CLI command, you could see what had changed. The new table being announced, dolt_workspace_$tablename allows you not only to see what has changed in your workspace, but also update it in specific ways.

SELECT * from dolt_workspace_ratings;

+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| id | staged | diff_type | to_id | to_rating | to_confidence | from_id | from_rating | from_confidence |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| 0  | false  | modified  | 1     | 3.0801456 | 0.15217814    | 1       | 0           | 0               |
| 1  | false  | modified  | 2     | 95.40468  | -0.79320604   | 2       | 0           | 0               |
| 2  | false  | modified  | 3     | 42.35977  | 0.16161437    | 3       | 0           | 0               |
| 3  | false  | modified  | 4     | 86.98194  | 0.10231206    | 4       | 0           | 0               |
| 4  | false  | modified  | 5     | 30.11406  | -0.67054313   | 5       | 0           | 0               |
| 5  | false  | modified  | 6     | 1.5826403 | 0.3439357     | 6       | 0           | 0               |
| 6  | false  | modified  | 7     | 73.59878  | -0.35658965   | 7       | 0           | 0               |
| 7  | false  | modified  | 8     | 66.67985  | -0.03970625   | 8       | 0           | 0               |
| 8  | false  | modified  | 9     | 81.421394 | -0.81469285   | 9       | 0           | 0               |
| 9  | false  | modified  | 10    | 55.632282 | -0.43831292   | 10      | 0           | 0               |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+

The first three columns of the table are always: 1) an ID, 2) a boolean, true if staged, and 3) a diff type ("added","modified","removed"). The table is a union of the difference between HEAD and STAGED, and STAGED and WORKING. All columns after the first three come in pairs related to the source schema. So there are from_ and to_ columns which correspond to the columns we created in our demo table. Read the Documentation

Starting from confidence of 0.0 in all rows, you can see that the to_confidence column has some rows where the confidence has gone up, and others where it's lower. Select all of those that have gone up:

SELECT * from dolt_workspace_ratings WHERE from_confidence < to_confidence;

+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| id | staged | diff_type | to_id | to_rating | to_confidence | from_id | from_rating | from_confidence |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| 0  | false  | modified  | 1     | 3.0801456 | 0.15217814    | 1       | 0           | 0               |
| 2  | false  | modified  | 3     | 42.35977  | 0.16161437    | 3       | 0           | 0               |
| 3  | false  | modified  | 4     | 86.98194  | 0.10231206    | 4       | 0           | 0               |
| 5  | false  | modified  | 6     | 1.5826403 | 0.3439357     | 6       | 0           | 0               |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+

Let's update the staged column for all of rows where our confidence has gone up:

UPDATE dolt_workspace_ratings SET staged = TRUE where from_confidence < to_confidence;
SELECT * FROM dolt_workspace_ratings;

+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| id | staged | diff_type | to_id | to_rating | to_confidence | from_id | from_rating | from_confidence |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| 0  | true   | modified  | 1     | 3.0801456 | 0.15217814    | 1       | 0           | 0               |
| 1  | true   | modified  | 3     | 42.35977  | 0.16161437    | 3       | 0           | 0               |
| 2  | true   | modified  | 4     | 86.98194  | 0.10231206    | 4       | 0           | 0               |
| 3  | true   | modified  | 6     | 1.5826403 | 0.3439357     | 6       | 0           | 0               |
| 4  | false  | modified  | 2     | 95.40468  | -0.79320604   | 2       | 0           | 0               |
| 5  | false  | modified  | 5     | 30.11406  | -0.67054313   | 5       | 0           | 0               |
| 6  | false  | modified  | 7     | 73.59878  | -0.35658965   | 7       | 0           | 0               |
| 7  | false  | modified  | 8     | 66.67985  | -0.03970625   | 8       | 0           | 0               |
| 8  | false  | modified  | 9     | 81.421394 | -0.81469285   | 9       | 0           | 0               |
| 9  | false  | modified  | 10    | 55.632282 | -0.43831292   | 10      | 0           | 0               |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+

Now, we can commit those values alone:

VHS 2

As you can see, after the commit, there are still changes in the workspace. All of them have a confidence under 0.0 though, which is why we didn't want to record them in the commit. Now low confidence results are all that remain. Let's roll the dice again!

UPDATE ratings SET rating = RAND() * 100, confidence = (RAND() * 2 - 1);
SELECT * FROM dolt_workspace_ratings;

+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| id | staged | diff_type | to_id | to_rating | to_confidence | from_id | from_rating | from_confidence |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| 0  | false  | modified  | 1     | 35.21392  | 0.78389084    | 1       | 3.0801456   | 0.15217814      |
| 1  | false  | modified  | 2     | 91.48664  | 0.9813288     | 2       | 0           | 0               |
| 2  | false  | modified  | 3     | 3.185655  | 0.013404363   | 3       | 42.35977    | 0.16161437      |
| 3  | false  | modified  | 4     | 5.096769  | 0.99090713    | 4       | 86.98194    | 0.10231206      |
| 4  | false  | modified  | 5     | 36.321125 | -0.039015874  | 5       | 0           | 0               |
| 5  | false  | modified  | 6     | 16.134932 | -0.7760689    | 6       | 1.5826403   | 0.3439357       |
| 6  | false  | modified  | 7     | 2.8987882 | -0.19788638   | 7       | 0           | 0               |
| 7  | false  | modified  | 8     | 81.74978  | 0.90599895    | 8       | 0           | 0               |
| 8  | false  | modified  | 9     | 38.751938 | -0.059418466  | 9       | 0           | 0               |
| 9  | false  | modified  | 10    | 0.9049674 | -0.3466417    | 10      | 0           | 0               |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+

You can do this repeatedly if you like. Eventually the confidence numbers will all be pretty high!

VHS 3

Reverting Data

After doing this a few times, you may end up with a workspace which is dirty because all of your confidence numbers are high. To clean all the junk data in a table, until now your only option was to dolt_revert() or edit the rows manually. That would probably be fine in this case, however if you need a more powerful tool, the dolt_workspace_ tables allow you to selectively delete some updates and leave others alone:

select * from dolt_workspace_ratings;

+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| id | staged | diff_type | to_id | to_rating | to_confidence | from_id | from_rating | from_confidence |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| 0  | false  | modified  | 2     | 68.673096 | -0.9563424    | 2       | 27.938786   | 0.7924695       |
| 1  | false  | modified  | 5     | 21.866325 | 0.02032262    | 5       | 76.1308     | 0.6638315       |
| 2  | false  | modified  | 6     | 92.221115 | 0.30626732    | 6       | 1.5826403   | 0.3439357       |
| 3  | false  | modified  | 7     | 68.06012  | -0.6727714    | 7       | 39.26927    | 0.7973392       |
| 4  | false  | modified  | 8     | 59.64711  | -0.04276629   | 8       | 41.079918   | 0.71927875      |
| 5  | false  | modified  | 9     | 3.0771728 | 0.5260062     | 9       | 63.266666   | 0.85197574      |
| 6  | false  | modified  | 10    | 16.337116 | 0.19719383    | 10      | 42.65101    | 0.61795044      |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+

Just to demonstrate that you can use whatever query logic you want to, I'll delete all of the updates to rows with even sequence IDs.

delete from dolt_workspace_ratings where from_id % 2 = 0;

+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| id | staged | diff_type | to_id | to_rating | to_confidence | from_id | from_rating | from_confidence |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+
| 0  | false  | modified  | 5     | 21.866325 | 0.02032262    | 5       | 76.1308     | 0.6638315       |
| 1  | false  | modified  | 7     | 68.06012  | -0.6727714    | 7       | 39.26927    | 0.7973392       |
| 2  | false  | modified  | 9     | 3.0771728 | 0.5260062     | 9       | 63.266666   | 0.85197574      |
+----+--------+-----------+-------+-----------+---------------+---------+-------------+-----------------+

It's important to note that deleting rows from WORKING is the same thing as revert --hard - you can't get the data back. It's a powerful tool to get back to a previous state; use wisely. In fact, this is a capability that git add --patch doesn't have. Power tool indeed!

What's Next

We will bring the --patch flag to the dolt add CLI command. It will use the tables described here to enable users to perform its work. Join us on Discord to tell us what you'd like to build!

Blog

The Scenario

Manipulating the Data

Reverting Data

What's Next

Get started with Dolt