Data Version Control and Dolt Reproducibility

April 16, 2021

11 min read

Introduction

Dolt and DVC are often compared because of because of DVC's name, Data Version Control. Dolt also does "data version control". So what's the difference? Well, DVC is a version controlled machine learning workflow manager and Dolt is a SQL database with Git-style versioning. The two can be used together to version machine learning pipelines. This blog will illustrate how.

Background

The machine learning tooling space has seen hundreds of new projects budding over the last few years. Some projects, like Blue River, expand the capabilities of specific verticals, like autonomous tractors. Others like Weights and Biases support specific stages in the ML lifecycle, like training and result tracking.

"Data versioning and reproducibility" sounds simple in theory, but can be operationally complex.

Consider two versioning tools: Dolt and Data Version Control.

Dolt is an SQL database with Git-versioning semantics. Tabular datasets in Dolt gain the powers of production databases and Git. Dolt inherits the scalability, schema reliability and the SQL query language of relational databases. From Git Dolt adopts row-level commits, diffs, and merges, and a full-scope of logging in the form of system tables.

Data Version Control (DVC) uses workflow files to support team collaboration of Git source code and remote object stores.

Because Dolt is an unopinionated data store, and DVC is an opinionated workflow manager, we believe the two support one-another. We built an integration to show how this works in practice.

Tutorial

Overview

We will train an image classifier with model code taken from the Tensorflow docs.

MNIST is a standard image classification dataset composed of handwritten digits.

Fashion-MNIST is a drop-in replacement for MNIST with greyscale images of clothing. The designers of Fashion-MNIST believe MNIST is 1) too easy, 2) overused, and 3) non-representative or modern computer vision tasks.

Fashion-MNIST
sample

We chose this to tutorial to leverage Dolt and DVC’s strengths:

DVC documents workflows by recording commands and the metadata of files synced to remote file-systems.
Dolt is a versioned SQL database with novel features for managing strictly-typed tabular data.

We divide input and output dependencies between Dolt and DVC according to those strengths.

We store tabular data in Dolt:

Training and testing labels
Prediction results (guessed and actual labels)

We store images and binary blobs in DVC:

Training and testing images
Model weights and class pickle

data/labels
data/predictions
data/images
├── t10k-images-idx3-ubyte.gz
├── train-images-idx3-ubyte.gz
data/model
├── assets
├── saved_model.pb
└── variables
    ├── variables.data-00000-of-00001
    └── variables.index

Finally, we will use Dolt to inspect results of several training runs:

> dolt sql -q "select * from summary"
+-----------+--------------+--------------------------------------+--------+
| loss      | label_branch | timestamp                            | acc    |
+-----------+--------------+--------------------------------------+--------+
| 1.6994121 | fashion      | 2021-04-02 10:29:12.840018 +0000 UTC | 0.5352 |
| 1.580542  | master       | 2021-04-02 10:43:14.570946 +0000 UTC | 0.6217 |
+-----------+--------------+--------------------------------------+--------+

If you are interested in digging more into the model training, refer to the training tutorial.

Prediction
sample

Setup

Install dolt

sudo bash -c 'curl -L https://github.com/dolthub/dolt/releases/latest/download/install.sh | sudo bash'

Clone and install DVC's dolt-integration branch:

> git clone git@github.com:dolthub/dvc.git
> cd dvc
> git checkout dolt/integration
> pip install --user .

Initialize a separate demo folder:

> mkdir -p dvc-dolt-demo/{data/images,scratch/mnist}
> cd dvc-dolt-demo
> git init
> dvc init

Finally, let's download training code adapted from Tensorflow documentation.

> dvc get git@github.com:dolthub/dolt-integrations.git dolt_integrations/dvc/examples/fashion-mnist/train.py

Please reference the Tensorflow docs for tensorflow, keras and other training dependencies.

Access Data

DVC

Downloading images with DVC is similar to using cURL to download a file:

> dvc get git@github.com:zalandoresearch/fashion-mnist.git data/fashion --out scratch/
> dvc get-url http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz scratch/mnist/
> dvc get-url http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz scratch/mnist/

"Adding" a file to DVC records metadata and replaces that filename with a symlink to a DVC cache elsewhere on our system:

> cp scratch/mnist/* data/images/
> dvc add data/images
100% Add|█████████████████████|1/1 [00:01,  1.45s/file]

To track the changes with git, run:

	git add data/.gitignore data/images.dvc

If we pulled files from an existing DVC repo the add step would be performed automatically.

Dolt

Getting a Dolt table is as simple as running a clone command:

> dolt clone max-hoffman/mnist data/labels

If you are not starting with an existing database, you would need to write a small program to import the data, like this:

In [1]: import os
   ...: import gzip
   ...: import numpy as np
   ...: import doltcli as dolt

In [2]: with gzip.open("scratch/mnist/train-labels-idx1-ubyte.gz","r") as f:
   ...:     f.read(8) # first two bytes are padded zeros
   ...:     buf = f.read()
   ...:     labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)

In [2]: os.makedirs("mnist", exist_ok=True)
   ...: dolt.Dolt.init("mnist")
   ...: db = dolt.Dolt("mnist")
   ...: n = len(labels)
   ...: dolt.write_columns(db, "labels", dict(row_id=range(n), train=[True]*n, label=labels), primary_key=["row_id"])
   ...: db.sql("select DOLT_COMMIT('-am', 'Add MNIST labels')")

DVC "adding" a Dolt database creates a .dvc metadata file, but bypasses the cache because Dolt has its own storage format:

dvc add data/labels
100% Add|█████████████████████|1/1 [00:01,  1.63s/file]

To track the changes with git, run:

	git add data/.gitignore data/labels.dvc

DVC

We will push files to a local remote folder (this works similarly with an S3 URL):

> mkdir -p /tmp/dvc-remote
> dvc remote add dvc_remote /tmp/dvc-remote
> dvc push data/images -r dvc_remote

If a co-worker added new data, we would first use Git to reference new DVC metadata:

> git pull

Then download remotes into our cache:

> dvc pull -r dvc_remote

and finally checkout the new data according to the updated data/images.dvc spec:

> dvc checkout data/images

Dolt

The Dolt-DVC integration channels remote commands through dolt equivalents. Pushing and pulling Dolt databases with DVC commands works the same as regular files:

> mkdir -p /tmp/dolt-remote
> dvc remote add dolt_remote /tmp/dolt-remote
> dvc push data/labels -r dolt_remote
1 file pushed

Dolt restricts which branches are pushed by default. If you want the second fashion branch pushed:

> cd data/labels
> dolt checkout fashion
> dolt push -u dolt_remote fashion

Remove and restore the database with both branches:

> rm -rf data/labels
> dvc pull data/labels -r dolt_remote
1 file fetched

If you accidentally switch up the MNIST images and labels, and fall back to random accuracy of prediction:

dolt sql -q "select * from summary"
+--------------------------------------+--------------+----------+--------+----------------------------------+
| timestamp                            | label_branch | loss     | acc    | label_commit                     |
+--------------------------------------+--------------+----------+--------+----------------------------------+
| 2021-04-04 10:09:03.023317 +0000 UTC | fashion      | 2.363709 | 0.1022 | pgubqbqgtvtmchilmb2btih47869dgt8 |
+--------------------------------------+--------------+----------+--------+----------------------------------+

you can use checkout to re-synchronize the appropriate labels:

> dvc checkout

Lineage

Before running our workflow, lets setup one more database for the prediction results:

> mkdir -p data/predictions
> cd data/predictions
> dolt init

And we are ready to run a training example:

> dvc run \
    -n train \
    -d data/images \
    -d data/labels \
    -d train.py \
    -o data/model \
    -o data/predictions \
    python train.py
Epoch 1/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.6889 - accuracy: 0.5248
Epoch 2/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.5610 - accuracy: 0.5818
Epoch 3/10
1875/1875 [==============================] - 2s 986us/step - loss: 1.5356 - accuracy: 0.5898
Epoch 4/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.5066 - accuracy: 0.6006
Epoch 5/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.4952 - accuracy: 0.6038
Epoch 6/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.4926 - accuracy: 0.6029
Epoch 7/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.4769 - accuracy: 0.6067
Epoch 8/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.4548 - accuracy: 0.6127
Epoch 9/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.4657 - accuracy: 0.6100
Epoch 10/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.4414 - accuracy: 0.6138
313/313 - 0s - loss: 1.5174 - accuracy: 0.6027

A dvc run command translates to a stage in our dvc.yaml workflow file:

> cat dvc.yml
schema: '2.0'
stages:
  train:
    cmd: python -d train.py train.py
    deps:
    - path: data/images
      md5: 9619f1bb9b50d0195a0af00656fb8bf4.dir
      size: 11594788
      nfiles: 5
    - path: data/labels
      md5: pgubqbqgtvtmchilmb2btih47869dgt8-gl6c87hhtsnl3ktqd2u9r780c1hh2a38.dolt
      size: 1465570
    outs:
    - path: data/model
      md5: 4c251d267625d254b2f7edde800de85b.dir
      size: 1303212
      nfiles: 3
    - path: data/predictions
      md5: 6v00se9d2betcpadruk68trel1lmps0q-idanfgens9epcrqg5hdgnhq0bpif0icn.dolt
      size: 260528

DVC includes helper commands to visualize the pipeline:

> dvc dag
+-----------------+         +-----------------+
| data/images.dvc |         | data/labels.dvc |
+-----------------+         +-----------------+
                **            **
                  **        **
                    **    **
                   +-------+
                   | train |
                   +-------+

Reproducibility

As a last step we will swap MNIST for the structurally-identical Fashion-MNIST.

First, we swap the DVC image files:

> cp scratch/fashion/* data/images

On the Dolt-side we checkout the fashion branch:

> cd fashion-mnist
> dolt checkout fashion
> cd ..

DVC picks up on these changes:

> dvc status
train:
	changed deps:
		modified:           data/images
		modified:           data/labels
data/labels.dvc:
	changed outs:
		modified:           data/labels
data/images.dvc:
	changed outs:
		modified:           data/images

Which we can fix:

> dvc add data/images --quiet
> dvc add fashion-mnist --quiet

DVC re-triggers our workflow to generate new outputs to match the updated fashion images and labels:

> dvc repro
'data/images.dvc' didn't change, skipping
'data/labels.dvc' didn't change, skipping
> python train.py
Epoch 1/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.5923 - accuracy: 0.5852
Epoch 2/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.4209 - accuracy: 0.6550
Epoch 3/10
1875/1875 [==============================] - 2s 979us/step - loss: 1.3852 - accuracy: 0.6608
Epoch 4/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.3795 - accuracy: 0.6590
Epoch 5/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.3458 - accuracy: 0.6659
Epoch 6/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.3354 - accuracy: 0.6681
Epoch 7/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.3177 - accuracy: 0.6686
Epoch 8/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.3071 - accuracy: 0.6689
Epoch 9/10
1875/1875 [==============================] - 2s 1ms/step - loss: 1.2764 - accuracy: 0.6741
Epoch 10/10
1875/1875 [==============================] - 2s 998us/step - loss: 1.2752 - accuracy: 0.6725
313/313 - 0s - loss: 1.7931 - accuracy: 0.5620
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.

Technical Discussion

After Training

Data science iterations usually benefit from tracking differences between input and output datasets.

MINST and Fashion-MNIST are different sets of images, but random chance leaves about 1/N labels the same between the two that Dolt opportunistically deltas:

> dolt diff fashion --summary
diff --dolt a/labels b/labels
--- a/labels @ h6a5mj4rle852jj2133dtl0ofi6hdsqf
+++ b/labels @ bcvu38tkg94vmvua2qi1c9rpsg4tktpn
7,038 Rows Unmodified (10.05%)
0 Rows Added (0.00%)
0 Rows Deleted (0.00%)
62,962 Rows Modified (89.95%)
62,962 Cells Modified (29.98%)
(70,000 Entries vs 70,000 Entries)

The motivation behind Fashion-MNIST, that models overfit digits and shirts are more difficult to categorize, can be seen by comparing accuracies between training runs:

> dolt sql -q "with t as (
  select
    case
      when p1.pred = p2.actual then 1
      else 0
    end as correct,
    p1.actual,
    p1.commit_hash
  from dolt_history_predictions p1
  join dolt_history_predictions p2
    on p1.row_id = p2.row_id and
       p1.commit_hash = p2.commit_hash
  )
  select
    sum(correct)/count(*),
    t.commit_hash,
    count(*) as row_number
  from t
  group by commit_hash"
+-----------------------+------------+----------------------------------+
| sum(correct)/count(*) | row_number | commit_hash                      |
+-----------------------+------------+----------------------------------+
| 0.6217                | 10000      | vq01vp8o1fsdq95eo4sutmppfhovejfv |
| 0.5352                | 10000      | foo013e1hgej8oa0282bv1ecb9280mlh |
+-----------------------+------------+----------------------------------+

We use the dolt_history table above to access every version of the prediction table, compare results, and calculate overall accuracy. This query can be broken down into 1) a with block that adds a correct column that's either 0 or 1, and 2) a select statement that performs accuracy arithmetic, and 3) a group by to bucket accuracy by training run. Computing every training accuracy requires the history table because predictions are replaced every iteration.

We also added a summary row each training run, an easier way to incrementally generate logs:

> > dolt sql -q "select * from summary"
+-----------+--------------+--------------------------------------+--------+
| loss      | label_branch | timestamp                            | acc    |
+-----------+--------------+--------------------------------------+--------+
| 1.6994121 | fashion      | 2021-04-02 10:29:12.840018 +0000 UTC | 0.5352 |
| 1.580542  | master       | 2021-04-02 10:43:14.570946 +0000 UTC | 0.6217 |
+-----------+--------------+--------------------------------------+--------+

One instance where we might want to access the commit history is dolt blame’ing a data changes. For example, identifying who added certain rows to the fashion branch:

dolt blame fashion labels | head -n 5
+--------+----------------------------+-----------------+------------------------------+----------------------------------+
| ROW_ID | COMMIT MSG                 | AUTHOR          | TIME                         | COMMIT                           |
+--------+----------------------------+-----------------+------------------------------+----------------------------------+
| 14648  | Add Fashion labels         | Bojack Horseman | Wed Mar 31 16:20:50 PDT 2021 | 5fblpjp5neurvsfp989s6ea9lt01vd2q |
| 27051  | Add Fashion labels         | Bojack Horseman | Wed Mar 31 16:20:50 PDT 2021 | 5fblpjp5neurvsfp989s6ea9lt01vd2q |

You can learn more about how to use Dolt’s Git and SQL features here.

How Dolt Integrates With DVC

Adding

DVC’s main metadata is the md5-hash used to sync local and remote filesystems. Usually, DVC hashes files and directories the same way as Git. To avoid duplicating what Dolt does internally, we substituted Dolt’s commits in-lieu of an md5 hash.

> cat data/labels.dvc
outs:
- md5: tkcao72e0upe2umb2kokartgivd9keqc-pinisl8m4tfdecb8iikqhicta61nou9s.dolt
  size: 1465033
  path: labels
  cache: false

Dolt's three heads tell us 1) the last commit, 2) the hash if all rows in our working database were committed now , 3) the ongoing index/staging commit hash. The first two, HEAD and working, let us monitor whether the database has changed since the most recent DVC-add. If necessary, the HEAD commit can be used to checkout the appropriate database version.

Commits

In DVC, output lineage is captured as Git-committed YAML files. Pre-defined output paths are saved as-is when a workflow completes.

Dolt users are responsible for committing changes. If a new database state is committed within a workflow, DVC will track the new commit. If a tracked database is changed but not committed by the end of a workflow, then we have an uncommitted transaction -- a state that Dolt cannot reproduce.

> dolt status
On branch master
Changes not staged for commit:
  (use "dolt add <table>" to update what will be committed)
  (use "dolt checkout <table>" to discard changes in working directory)
	modified:       summary

> dvc add data/predictions
Adding...
ERROR: unexpected error - Dolt status is not clean; commit a reproducible state before adding.

Fixing this error requires committing changes within the workflow:

preddb = dolt.Dolt("data/predictions")
...
preddb.sql(f"select dolt_commit('-am', 'New workflow run at {ts}')")

Dolt's benefits flow downstream of Dolt's versioning format, the commit.

Remotes

There are several differences between Dolt and DVC remotes

A DVC remote is a content-addressed-store, while the Dolt-remote maintains a commit history and database.
A Dolt remote could be hosted on DoltHub, in the same way Git remotes are often hosted on GitHub.
DVC and Dolt will store internal references to remotes created by DVC commands.

Conclusion

We walked through a Fashion-MNIST tutorial to show how Dolt and DVC can collaborate to offer more features to users. Dolt excels at reproducibility for tabular datasets. DVC creates a process by which data scientists can better organize their work and collaborate.

A more astute observer of this blog might ask, why not merge these changes into DVC? Well, we had to make a bunch of hacks to make this work. The proper way to do this is for DVC to support some sort of extensible storage architecture. If you like what you see and want Dolt and DVC to work together, let either of us know. If there is a market for this, we'll build it collectively together.

Blog

Introduction

Background

Tutorial

Overview

Setup

Access Data

DVC

Dolt

Remotes and Sharing

DVC

Dolt

Lineage

Reproducibility

Technical Discussion

After Training

How Dolt Integrates With DVC

Adding

Commits

Remotes

Conclusion

Get started with Dolt