Comparing Dolt Backups with Remotes
Whether due to clumsiness, physical damage, or malicious actors, the worst thing that can happen to your data is irretrievable loss. But this is a feature release, not a postmortem, and we are excited to announce Dolt backups!
The CLI now includes dolt backup add
to create, dolt backups sync
to
save, and dolt backups restore
to recover the contents of your
database.
Why backup a Dolt database?
Dolt is a versioned database with several layers of fault
tolerance, just like Git. You can reverse a drop table
command with
dolt reset --hard HEAD
, undoing changes and restore the previous
state. Individual clones can even sudo rm -rf
a repo safely, as long
as the most recent data was saved to a remote with dolt push origin main
. dolt clone <remote-url>
safely restores your previous state.
With all these recovery features, why would you also need Dolt backups?
Clones aren't copies! There are important differences between
rsyncing your .dolt
folder and pushing master. We will dig into when
you would use each in this blog, and how to use the new backups feature
in Dolt to complement remotes.
Remotes are not backups
Remotes and backups are similar at first glance. They both copy data incrementally between remote address spaces. The storage and transmit format is the same. And you can push every branch, head, and tag to a remote. Remotes are familiar, convenient, guard against most varieties data loss, and designed for flexibility. Most users will be comfortable sticking to remotes.
But remotes aren't always enough. Regulatory and governance requirements can compel the use of backups. Remotes can't and shouldn't replicate uncommitted data. Backups are also useful as checkpoints before rewriting the history of your database with rebases or migrations.
In each of these cases, a brute force snapshot of the whole database is
more useful than a customizable push. A backup can either be a single
snapshot or rolling, but is always private to a single writer. A backup
therefore copies state from a .dolt
repo otherwise hidden from
remotes, including staging and working sets.
In summary, backups complement remotes when we want a heightened level of protection against faults and data loss.
Backups Tutorial
Create A Database Snapshot with the CLI
We will show how to use the new dolt backup
command in this tutorial.
The only install needed to start is the dolt binary:
sudo bash -c 'curl -L https://github.com/dolthub/dolt/releases/latest/download/install.sh | sudo bash'
We will focus on two directories to start. One for backups
, and an initial
dolt repo:
mkdir -p repo1 backups/backup1
$ cd repo1
$ dolt init
Successfully initialized dolt data repository.
Adding a backup looks similar to adding a remote:
$ dolt backup add backup1 file://../backups/backup1
/ Tree Level: 1, Percent Buffered: 0.00% Files Written: 0, Files Uploaded: 1
And syncing a backup looks similar to pushing a remote:
$ dolt backup sync backup1
In this simplified example, where we only created a single main
branch, restoring the database will look similar to
a clone:
Cd ..
$ dolt backup restore file://./backups/backup1 repo2
$ dolt branch -a
* main
$ dolt status
But under the hood, backups and remotes do different things. A reference or ref in Dolt and Git is a commit hash, branch, or tag. A client interacting with a remote can only push one ref per command. In addition to copying every pushable ref, backups also copy remote tracking refs and working set. Remote tracking refs are usually privately namespaced within a databases, and working sets are copies of rows that transactions collect before committing.
We will create one of each and start a new backup cycle to highlight this behavior:
Cd ../repo1
$ dolt branch feature
$ mkdir ../rem1
$ dolt remote add origin file://../rem1
$ dolt push origin main
$ dolt tag v1 HEAD
$ dolt sql -q "create table not_committed (a int primary key)"
$ dolt backup sync backup1
If we restore the database again, we see all of our new changes:
$ cd ..
$ dolt backup restore file://backups/backup1 repo3
$ cd repo3
$ noms ds .dolt/noms
feature
* main
remotes/origin/main
$ dolt status
On branch main
Untracked files:
(use "dolt add <table|doc>" to include in what will be committed)
new table: not_committed
We can get almost the same thing with remotes. But backups copy everything and save uncommitted data. Hopefully this example makes the comparison concrete, and provide a little inspiration for your own apps!
Daily Backups
In this second tutorial, we will make a systemd
script that
synchronizes our database on a timer. Different operating systems have
different cron managers, and we will use a linux setup with systemctl
timers here.
Our systemd script requires three files:
-
our
run_backup.sh
script a "unit file" that executes the backup -
script within the
systemd
interface (backup.service
) -
and a timer that executes our unit periodically (
backup.timer
)
First, we will write a script that creates a database backup in the same manner as the previous tutorial:
#!/usr/bin/bash
BACKUP_DIR=/home/test/backups
DB_DIR=/home/test/repo1
cd $DB_DIR
backup_id=$(date '+%s')
mkdir -p ${BACKUP_DIR}/${backup_id}
dolt backup add ${backup_id} file:///${BACKUP_DIR}/${backup_id}
dolt backup sync ${backup_id}
I saved this file to /home/test/run_backup.sh
and hardcoded my local
test folder here. You would want to edit these accordingly if following
along at home.
Next, our "unit file" written to
/usr/lib/systemd/system/backup.service
references our backup script:
[Unit]
Description=Runs db backup
[Service]
Type=oneshot
ExecStart=/home/test/run_backup.sh
[Install]
WantedBy=multi-user.target
And finally a timer backup.timer
coupled to backup.service
periodically executes the script:
[Unit]
Description=Db backup timer
[Timer]
OnCalendar=*-*-* *:*:0/5
AccuracySec=1s
[Install]
WantedBy=timers.target
We enable and start the timer to kickoff backups, which should be configured to run every five seconds:
$ systemctl enable my_backup_cmd.service.timer
$ systemctl restart my_backup_cmd.service.timer
After waiting a bit, we can view our growing list of backups:
$ tree backups
backups
├── 1633451635
│ ├── LOCK
│ ├── abmbvta6lclqj7dgrvon4kkgs4lf8ol3
│ ├── ajgrseim4flkk7bprt1jvec5dgpga6ag
│ ├── manifest
│ └── oldgen
├── 1633451640
│ ├── LOCK
│ ├── abmbvta6lclqj7dgrvon4kkgs4lf8ol3
│ ├── ajgrseim4flkk7bprt1jvec5dgpga6ag
│ ├── manifest
│ └── oldgen
├── backup1
│ ├── LOCK
│ ├── abmbvta6lclqj7dgrvon4kkgs4lf8ol3
│ ├── ajgrseim4flkk7bprt1jvec5dgpga6ag
│ ├── manifest
│ └── oldgen
What's Next?
Guarantees for single writers, encryption, and access control will make backups more secure. Users can currently implement these manually, for example, by adding additional steps to our systemd script. We will include more of these features at the Dolt layer in the future.
Extending DoltHub to automatically provision backups alongside managed servers is another useful feature we are developing. The option to custom provision remote endpoints will always exist, but we think a convenient hosted option is also useful.
Unlike MySQL, Dolt backups do not double as a format for read replication. We are currently developing other features to provide read replicas and automatic failover for Dolt SQL servers.
Conclusion
You can now backup your Dolt database separate from shared remotes. Backups and remotes are similar, but backups add an extra layer of fault tolerance and facilitate easy database restores.
A quick summary of the technical differences:
-
Backups capture the entire internal state of your database, whereas remotes synchronize specific branches or tags.
-
Backups are private snapshots, while remotes expose internal state for sharing.
We summarized examples of when you might want the flexibility of
remotes, and where need the fault tolerance of backups. We also walked
through two tutorials using the new dolt backup
CLI commands. The
first creates a static backup manually, and the second configures a
background process that automatically updates our database on a timer.
If you are interested in learning more about Dolt, backups, or relational databases reach out to us on Discord!