Pull Request CI on DoltHub

DOLTHUBFEATURE RELEASE
13 min read

In November, we announced support for continuous integration (CI) testing on DoltHub, which lets users programmatically run checks on their databases. At the time of the announcement, these checks were limited to push events, but today we're excited to announce that CI checks on pull requests are now live!

Today's blog will cover everything you need to know to get started running CI tests on your DoltHub pull requests.

We will continue using the dolthub/options database from the previous blog, and update it to start running checks on pull requests.

Why run CI on databases?

Before we jump into the guide, you might be curious as to why you'd want to run CI checks against a database. To answer this, we can look at one of Dolt's proverbial parents, Git.

Git versions files and enables decentralized collaboration between users. Git users maintain their own local copies of a repository, then push edits to files in the repository to a shared remote, like GitHub or GitLab. Once pushed, these edits are tested via a CI platform and often human-reviewed before they're committed into a repository's main branch. This ensures only the highest quality changes are committed to main.

Since Dolt is "Git for data" and the same versioning power and decentralized collaboration that Git does, instead on databases, it also needs CI to ensure only high quality edits are made to a database's main branch. DoltHub provides a pull-request style workflow for reviewing and accepting changes the same way GitHub and GitLab do, but now also provides CI.

Over the years we've recognized that reviewing file-diffs and reviewing data-diffs are quite different, and data-diffs are much harder for humans to review. Ideally, data-diffs are "reviewed" programmatically, and this is where the power of CI testing on pull-requests comes in.

Users can define checks that run automatically on pull-requests submitted against their database. If these checks pass, maintainers of the database can know with much more certainty that the proposed changes are valid and ready to commit to the main branch. They'll no longer have to hand-sift through potentially large data-diffs in order to verify changes.

We wish we'd had this feature on DoltHub back when we ran data bounties and were reviewing dozens of pull-requests by hand. But it's available now, so let's dive in how you can set this up on your database.

Getting Started with CI on pull requests

First, make sure you have the latest Dolt version installed. At the time of this post, that's Dolt v1.44.2.

% dolt version
dolt version 1.44.2

Next, let's clone the dolthub/options database we used in our initial blog.

% dolt clone dolthub/options
cloning https://doltremoteapi.dolthub.com/dolthub/options
% cd options 
% dolt status
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean

Once cloned, we can use the dolt ci ls command to see what CI workflows this database has defined already.

% dolt ci ls
my first DoltHub workflow

This has just the one we created in our initial blog, called "my first DoltHub workflow". We can export the yaml file for this workflow by running the dolt ci export <workflow name command>.

% dolt ci export "my first dolthub workflow"
Dolt CI Workflow 'my first DoltHub workflow' exported to ./my_first_DoltHub_workflow.yaml.

This command will write the workflow yaml locally for us to view. Here are the contents:

name: "my first DoltHub workflow"
"on":
    push:
        branches:
            - "master"
jobs:
    - name: "validate tables"
      steps:
        - name: "assert expected tables exist"
          saved_query_name: "show tables"
          expected_rows: "== 2"
        - name: "assert table option_chain exists"
          saved_query_name: "option_chain exists"
        - name: "assert table volatility_history"
          saved_query_name: "volatility_history exists"
    - name: "validate schema"
      steps:
        - name: "assert 13 option_chain columns exist"
          saved_query_name: "check option_chain column length"
          expected_columns: "<= 13"
        - name: "assert call_put column exist"
          saved_query_name: "check option_chain.call_put exists"
          expected_columns: "== 1"
        - name: "assert 16 volatility_history columns exist"
          saved_query_name: "check volatility_history column length"
          expected_columns: ">= 16"
        - name: "assert act_symbol column exist"
          saved_query_name: "check volatility_history.act_symbol exists"
          expected_columns: "== 1"
    - name: "check data"
      steps:
        - name: "assert option_chain table has data"
          saved_query_name: "check option_chain data"
          expected_rows: "> 0"
        - name: "assert volatility_history table has data"
          saved_query_name: "check volatility_history data"
          expected_rows: "> 0"

Great! Before continuing, recall from our initial blog that we removed all data from the option_chain table, in order to show the "check data" Job failing. We can see this change is the most recent commit in the history, with the message "remove option_chain data".

% dolt log -n 2
commit gjo41are9nbo4ocq9faapickmqtum26q (HEAD -> master, remotes/origin/master) 
Author: ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ <dustin@dolthub.com>
Date:  Wed Nov 13 19:00:31 -0800 2024

        remove option_chain data

commit 81jeqladt762lan03u9salos8o3mvflo 
Author: ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ <dustin@dolthub.com>
Date:  Wed Nov 13 17:32:29 -0800 2024

        add more saved queries

%

Let's add all the table data back, by disposing of this commit. We can do that be resetting the prior commit.

% dolt reset --hard 81jeqladt762lan03u9salos8o3mvflo
% dolt status
On branch master
Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded.
  (use "dolt pull" to update your local branch)
%
% dolt sql -q "select count(*) from option_chain;"
+----------+
| count(*) |
+----------+
| 71203454 |
+----------+

Ok, all the option_chain data is back.

Next, since there were some schema changes to Dolt's internal CI tables, we'll need to re-initialize CI in our database using the latest Dolt version. Since we have the latest version installed already, we can do so now.

% dolt ci destroy 
% dolt ci init         
% dolt log -n 2
commit ioi0kj140lckgf07jopv4rn62362iei6 (HEAD -> master) 
Author: ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ <dustin@dolthub.com>
Date:  Wed Dec 11 13:54:14 -0800 2024

        Successfully initialized Dolt CI

commit asu595aoitj2vpbbthdjr5m58oar5gpi 
Author: ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ <dustin@dolthub.com>
Date:  Wed Dec 11 13:50:01 -0800 2024

        Successfully destroyed Dolt CI

%

The dolt ci destroy command removes all Dolt CI configuration in our database and creates a new Dolt commit. We then reinitialize CI with the dolt ci init command, which also creates a new commit. This will ensure we have the latest internal CI table schema.

Then, we can add our first workflow back into the database with the dolt ci import <workflow file> command.

dustin@Dustins-MacBook-Pro-3 options % dolt ci import my_first_DoltHub_workflow.yaml 
dustin@Dustins-MacBook-Pro-3 options % dolt ci ls
my first DoltHub workflow

This also creates a new Dolt commit with the workflow persisted successfully, and we can see it listed with the dolt ci ls command.

Now that we've migrated to the latest Dolt CI table schema, we're ready to add our new pull request workflow.

To do so, let's copy the my_first_DoltHub_workflow.yaml file and call it my_second_DoltHub_workflow.yaml.

cp my_first_DoltHub_workflow.yaml my_second_DoltHub_workflow.yaml

Next, let's edit the contents of our copy to be the following:

name: "my second DoltHub workflow"
"on":
    pull_request:
        branches:
            - "master"
jobs:
    - name: "check options_chain data"
      steps:
        - name: "assert option_chain table has not changed"
          saved_query_name: "check no changes to option_chain"
          expected_rows: "== 0"
    - name: "check volatility_history data"
      steps:
        - name: "assert volatility_history has only increased"
          saved_query_name: "check volatility_history data has only increased"
          expected_rows: "== 1"

Notice we've changed the name of the workflow to be "my second DoltHub workflow", and we've also changed the on block to trigger on a pull_request event instead of a push event. We want this workflow to run whenever pull requests are made against our master branch, so we've left the same branches block.

For simplicity, we've done away with the schema validation checks we had in "my first DoltHub workflow", and we've changed the two jobs that check table data.

The first Job is called "check options_chain data", and has a single Step defined named "assert option_chain table has not changed". Here, we've decided that we don't want any pull requests to change any data in the option_chain table at all. To do this, we'll define a saved query in the database called "check no changes to option_chain", and we'll expect the results of this saved query to have zero rows returned, but more on this later.

The second Job is called "check volatility_history data". This job also has a single Step defined called "assert volatility_history has only increased". Our goal with this check is to make sure any pull request we receive is only adding data to the volatility_history table. To do this, we'll define a saved query called "check volatility_history data has only increased". We'll also expect a single row returned from our saved query in order for the check to pass.

After we save these edits, let's go ahead and create the saved queries we've defined.

First, we'll run the following to create the "check no changes to option_chain" saved query with the following:

% dolt sql --save "check no changes to option_chain" -q "select * from dolt_diff('master', 'HEAD', 'option_chain');"

We create the saved query by using the dolt sql command with the --save option. This will write our query "select * from dolt_diff('master', 'HEAD', 'option_chain');" to the dolt_query_catalog table to be used during our workflow run.

The query itself is reading the dolt_diff system table, to determine if the option_chain table appears in the diff between the tip of the master branch, indicated by the use of "master" and the tip of the current branch, indicated by the use of "HEAD".

This query currently returns zero rows, and we expected all pull requests running this saved query to also return zero rows, which means there are no changes to the option_chain table in the pull request.

Next, we define our second saved query:

% dolt sql --save "check volatility_history data has only increased" -q "select 1 where (select count(*) from volatility_history) >= (select count(*) from volatility_history as of hashof('master'));"
+---+
| 1 |
+---+
| 1 |
+---+

We use the same dolt sql command as before, but our SQL query here is a bit more clever. Essentially, it checks whether the row count of volatility_history on our current branch is greater than or equal to the row count of volatility_history on the tip of master.

Well, normally, you might expect to simply use this query:

% dolt sql -q "select (select count(*) from volatility_history) >= (select count(*) from volatility_history as of hashof('master'));"
+---------------------------------------------------------------------------------------------------------------+
| (select count(*) from volatility_history) >= (select count(*) from volatility_history as of hashof('master')) |
+---------------------------------------------------------------------------------------------------------------+
| true                                                                                                          |
+---------------------------------------------------------------------------------------------------------------+

But, currently, saved query checks on DoltHub's CI can only count returned rows and columns. So, to work around this limitation, we can add the select 1 where part of the query, and get back a single row if the whole query evaluates to true, and get back no rows if the query evaluates to false.

# a `true` query returns a row
% dolt sql -q "select 1 where (select count(*) from volatility_history) >= (select count(*) from volatility_history as of hashof('master'));"
+---+
| 1 |
+---+
| 1 |
+---+

# a `false` query does not
% dolt sql -q "select 1 where (select count(*) from volatility_history) < (select count(*) from volatility_history as of hashof('master'));"

%

This is why we can expect a single row returned in the "assert volatility_history has only increased" Step defined in our workflow.

After adding our saved queries, we need to add and commit them to the database.

% dolt status
On branch master
Your branch is ahead of 'origin/master' by 3 commits.
  (use "dolt push" to publish your local commits)
	
Changes not staged for commit:
  (use "dolt add <table>" to update what will be committed)
  (use "dolt checkout <table>" to discard changes in working directory)
	modified:         dolt_query_catalog
% dolt add .
% dolt commit -m 'add saved queries for second workflow'
commit 2b0vucrama22if38h0k54g3iuoif2i5j (HEAD -> master) 
Author: ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ <dustin@dolthub.com>
Date:  Wed Dec 11 14:01:35 -0800 2024

        add saved queries for second workflow

%

Lastly, we need to import our second workflow into the database using the dolt ci import <workflow file> command, like we did with the first workflow.

% dolt ci import my_second_DoltHub_workflow.yaml 
% dolt log -n 1
commit en29rjnoln06qshgqdn7utqsip44ng8q (HEAD -> master) 
Author: ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ <dustin@dolthub.com>
Date:  Wed Dec 11 14:02:25 -0800 2024

        Successfully stored workflow: my second DoltHub workflow

%

Doing so will automatically create a new Dolt commit with the workflow stored. Okay! Now we just need to push our changes to DoltHub for them to go live. Note, we force push with the -f option since we removed a commit from the master branch earlier.

% dolt push origin master -f
- Uploading...
To https://doltremoteapi.dolthub.com/dolthub/options
 * [new branch]          master -> master

After pushing, we can go to DoltHub.

If we go to the commit log page on the database, we can see the newly added icon indicating that our pushed commit has CI jobs running for it.

Dolt CI Commit Page Running

And if we navigate to the Jobs tab, we can see the three jobs defined in our first workflow running.

Dolt CI Jobs Page Running

After a moment or so, the Jobs finish successfully, as we'd expect.

Dolt CI Jobs Page Succeeded

Dolt CI Commit Page Succeeded

Now let's create some pull requests to test out our second workflow checks.

On our local copy of dolthub/options let's checkout a new branch called dev-1, and pretend to be a developer who adds a new row to option_chain.

% dolt status                  
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
% dolt checkout -b dev-1
Switched to branch 'dev-1'
% dolt sql -q "insert into option_chain values (now(), 'ABC', '2026-12-12', 10.00, 'Put', 10.00, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0);"
Query OK, 1 row affected (0.00 sec)
% dolt status
On branch dev-1

Changes not staged for commit:
  (use "dolt add <table>" to update what will be committed)
  (use "dolt checkout <table>" to discard changes in working directory)
	modified:         option_chain
% dolt add option_chain
% dolt commit -m 'add row to option_chain'
commit dmif4p3s2takdct9aq2ageaeh3c7lpg7 (HEAD -> dev-1) 
Author: ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ <dustin@dolthub.com>
Date:  Wed Dec 11 14:20:03 -0800 2024

        add row to option_chain

% dolt push origin dev-1
| Uploading...
To https://doltremoteapi.dolthub.com/dolthub/options
 * [new branch]          dev-1 -> dev-1
%

Now, let's go back to master then check out another branch called dev-2. On this branch we'll pretend to be a developer who wants to remove an entry from the volatility_history table.

% dolt checkout master
Switched to branch 'master'
% dolt checkout -b dev-2                  
Switched to branch 'dev-2'
% dolt sql -q "delete from volatility_history where act_symbol = 'ZWS' and date = '2024-11-11';"
Query OK, 1 row affected (0.00 sec)
% dolt status
On branch dev-2

Changes not staged for commit:
  (use "dolt add <table>" to update what will be committed)
  (use "dolt checkout <table>" to discard changes in working directory)
	modified:         volatility_history

% dolt add .
% dolt commit -m 'remove ZWS on Nov 11'
commit jd2n7q5mcpp4dp62dvk93h03v1b2vmt0 (HEAD -> dev-2) 
Author: ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ <dustin@dolthub.com>
Date:  Wed Dec 11 14:26:14 -0800 2024

        remove ZWS on Nov 11

% dolt push origin dev-2
- Uploading...
To https://doltremoteapi.dolthub.com/dolthub/options
 * [new branch]          dev-2 -> dev-2
%

And lastly', let's create one final branch from master called dev-3. Our third developer will actually not change the option_chain table, and will only make an insert into the volatility_history table.

% dolt checkout master
Switched to branch 'master'
% dolt checkout -b dev-3
Switched to branch 'dev-3'
% dolt sql -q "insert into volatility_history values (now(), 'DEF', 1.0, 1.0, 1.0, 1.0, '2024-11-30', 1.0, '2024-01-01', 1.0, 1.0, 1.0, 1.0, '2024-02-15', 1.0, '2024-05-25');"
Query OK, 1 row affected (0.00 sec)
% dolt status
On branch dev-3

Changes not staged for commit:
  (use "dolt add <table>" to update what will be committed)
  (use "dolt checkout <table>" to discard changes in working directory)
	modified:         volatility_history
% dolt add .
% dolt commit -m 'add DEF history'
commit pc4jssv7nl49anjqoi5dsktij57s2j40 (HEAD -> dev-3) 
Author: ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ๐Ÿ˜บ <dustin@dolthub.com>
Date:  Wed Dec 11 14:36:33 -0800 2024

        add DEF history

% dolt push origin dev-3
/ Uploading...
To https://doltremoteapi.dolthub.com/dolthub/options
 * [new branch]          dev-3 -> dev-3
%

Now let's head over and open pull requests from each new branch to master.

After opening the pull request, we can see the running checks we defined on the pull request page.

Dolt CI Pull Request Checks Running Pull Request Page

And these can also be seen in the Jobs tab.

Dolt CI Pull Request Checks Running Jobs Page

After a few moments, the checks complete and we can see that the dev-1 pull request failed the CI check that ensures table option_chain has no changes.

Dolt CI Pull Request Checks Failed Dev 1 Pull Request Page

Clicking the link to the Job detail page shows the error that zero rows were expected, but a row was in fact returned from our registered saved query.

Dolt CI Pull Request Checks Failed Dev 1 Job Detail Page

Now we can repeat this process for our second developer. We'll make a pull request from dev-2 to master , and see if it passes CI.

Dolt CI Pull Request Checks Failed Dev 2 Pull Request Page

And after a few moments we see that this change does not satisfy our CI tests, since they only allow additions to volatility_history and not deletions.

Finally, let's see how our third developer's pull request fairs.

Dolt CI Pull Request Checks Succeeded Dev 3 Pull Request Page

Looks like this developer got her contribution perfect, her pull request is ready to merge!

Conclusion

Although still in alpha CI on DoltHub is an exciting new feature we are rapidly improving. We hope updates like this one get you excited to start running CI on your DoltHub databases today!

If you'd like to discuss any particular use-cases you have for CI on your Dolt databases, or want to request a feature or two, we'd love to hear from you, so swing by our Discord.

Thanks for reading and don't forget to check out each of our cool products below:

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.