Managing Branches and Releases on DoltHub

WEB
3 min read

DoltHub is the website where you can share, discover, and collaborate on Dolt version-controlled databases. DoltHub lets you edit data directly on the website using SQL queries or a spreadsheet like-interface on a separate working branch of your database, create a PR for review, and then merge your changes.

We've been working hard to make Dolt accessible to all types of users. Today, we added the ability to manage branches and releases (tags) completely through DoltHub's UI.

In a previous blog post from May 2020, we used Dolt and its CLI to create a ML train-test split for the ubiquitous iris flower dataset. A year and half later, it's now possible to repeat that process completely through the DoltHub UI. So let's try our hand at making a train-test split on DoltHub.

In this post, we'll first checkout a separate branch to house our split, create the split using DoltHub's SQL Editor, and then create a release to save the split for future use.

Creating a new branch

I've loaded the iris dataset into this database, so if you're following along you should fork it first.

After forking the database, go to the fork's database tab and click on View all branches, then Create new branch to create a new branch. I'm calling mine training-test-split. View all branches button Training test split

Now we have a separate branch where we can play with the dataset. We can alter its schema, edit data, or even drop tables without a worry in the world. Our changes are completely isolated from the main branch.

Next, let's add an is_test column using DoltHub's SQL editor and fill it using an 80/20 split.

Creating the split

First, make sure the training-test-split branch is selected in the branch selector. Then use an ALTER TABLE statement to add the column:

ALTER TABLE classified_measurement ADD is_test boolean;

Alter table

Hitting run will open a new workspace. We can run additional SQL statements in the workspace to combine all of our needed changes in a single PR. Workspace

We can run another statement to randomly assign a value to is_test:

UPDATE classified_measurements SET is_test = rand() > 0.8;

Populating is_test column

After running the query, we can look at the diff of our entire workspace. We can see both the schema changes: Schema diff

And the data changes: Data diff

All we have to do next is open a PR and merge it: PR

Overall it's pretty effortless to create a train-test split on DoltHub.

Creating a release

Now that we have our train-test split, it might be important to tag the specific Dolt commit so that we can access the current database state in the future. If you're testing a new training strategy, you might want this data to compare the strategy's performance.

Creating a release is easy. Just go to the release page, click Create a release, and select the branch that has the data you need to save. Then, just add a description about what the release contains and save it! Here's mine:

Releases

Merging upstream changes

In the situation where changes occur on the main branch, it might be useful to pull those changes into your train-test split. In this example, I've added an entirely new column to the data called stem_length_cm: New stem length column

To pull that column into our test-train split we can open a PR. We set the base branch as training-test-split and the from branch as main. After merging, we can see that our new column along with our is_test column is in the schema! New column and old

In a similar manner if new rows are added upstream, we can pull those changes as well. The is_test column would be null, but we can simply randomize the value afterwards.

Conclusion

DoltHub is the place to share your Dolt databases with other people in an accessible manner. Without touching the Dolt CLI we can make arbitrary edits to SQL data in a safe manner and tag the entire state of the database without fully duplicating it. If changes occur upstream, you can easily merge them in a predictable and reversible manner. Working with data on DoltHub is safe.

In the near future, we'll add support for creating a branch or tag from a specific commit. If you're using Dolt to ingest third-party data, checking out a specific commit on DoltHub can help you debug if the data changes unexpectedly.

If you're worried about placing your data on DoltHub check out DoltLab, the self-hosted DoltHub.

If you have any feature suggestions or just want to chat, come talk to us on Discord!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.