Availability With Dolt Read Replicas
TL;DR:
Today we're excited to introduce read replication for Dolt! Read
replication is a common way to scale read-heavy applications by copying
data from a source server to replicas. Setting DOLT_REPLICATE_TO_REMOTE
and
DOLT_READ_REPLICA_REMOTE
in your Dolt configuration lets replica
databases automatically synchronize with a source. Updates are shared
through Dolt remotes using "push on write" and "pull on read" commit
hooks.
Why Do We Use Read Replicas?
Accommodating growth and gracefully onboarding new customers is important! Web applications bursting with traffic can be scaled either vertically or horizontally. Vertical scaling beefs up small servers. Horizontal scaling adds companion servers to share work. Most modern databases do both, but horizontal scaling is more complex.
The first step in horizontally scaling high read throughput databases is read replication. One server handles all writes (the source) and reads are spread between servers (the replicas) that pull updates from the source. Replicas ease contention on the source node by handling reads.
Dividing responsibilities between sources and replicas leads to availability and consistency tradeoffs. Read replication errs towards availability at the expense of replica delay, but customization and additional designs can be used to satisfy business objectives.
How Dolt Read Replicas are Different
Most relational databases replicate individual transactions. Although SQL transactions in Dolt are persistent, they are not versioned into the commit graph. A Dolt commit, as in Git, marks the shareable unit of work. The first version of Dolt read replication sends updates at the frequency and granularity of Dolt commits.
Next, most databases stream information directly from source to replica. Since remotes are the common communication point between Dolt databases, the alpha version of Dolt replication uses remotes as a middleman. sources push, replicas pull. Remotes used in this way should be protected from external use. Conflicting processes can create merge conflicts and disrupt the flow communication between source and replicas.
Read replicas get even more interesting when you consider pairing them with other Dolt features:
-
Backups naturally complement replication by snapshotting the state of your database. Backups and replication give you durability and availability, respectively.
-
DoltHub can be the replication middleman, providing a UI to monitor the state of your database live as it changes.
-
We are investigating DoltHub continuous integration (CI), which would allow Dolt users to trigger arbitrary workflows within DoltHub. Read replication would automate CI even further, triggering upstream jobs after every commit. One instance where this might be useful is running data validation scripts at the source of data changes, instead of relying on external scheduling (like Airflow) as is common today.
Read here to learn more about how Dolt is a tailwind for simpler data lifecycle management!
Read Replica Tutorial
Ex 1: Use docker compose — one replica, one source
We will walk through a read replication demo with one source and one replica. Writes issued to the source are automatically pushed to a common remote. Reads steered towards replicas synchronize with the remote prior to returning results. The source code for this example can be found here.
First, we download Dolt and configure the required user metadata:
$ sudo bash -c 'curl -L https://github.com/dolthub/dolt/releases/latest/download/install.sh | sudo bash'
$ dolt config --local --add user.name "Max Hoffman"
Config successfully updated.
$ dolt config --local --add user.email "max@dolthub.com"
Config successfully updated.
Next, we create source and replica repositories with a common ancestry. This step is important, because two repos without a common ancestory cannot communicate through a shared remote. You can use the helper script to expedite this locally.
$ mkdir -p dbs/{source,remote}
$ remotesrv --http-port 5000 --dir ./dbs/remote &
2021/10/13 09:41:33 cwd set to ./dbs/remote
2021/10/13 09:41:33 'grpc-port' parameter not provided. Using default port 50051
2021/10/13 09:41:33 Starting http server on port 5000
2021/10/13 09:41:33 Starting grpc server on port 50051
$ cd source
$ dolt init
Successfully initialized dolt data repository.
$ dolt remote add origin http://localhost:50051/test-org/test-repo
$ dolt push -u origin main
$ cd ..
$ dolt clone http://localhost:50051/test-org/test-repo replica
cloning http://localhost:50051/test-org/test-repo
3 of 3 chunks complete. 0 chunks being downloaded currently.
We used
remotesrv
to create our remote replication middleman.
Localhost worked for initialization, but containers in docker-compose cannot communicate
through localhost. So we will add a second remote referencing a
networking bridge remote
namespace specifically for Docker:
$ cd ../source
$ dolt remote add docker_origin http://remote:50051/test-org/test-repo
$ cd ../replica
$ dolt remote add docker_origin http://remote:50051/test-org/test-repo
As a final setup step, we attach the source and replica to the new replication middleman:
$ cd ../source
$ dolt config --local --add DOLT_REPLICATE_TO_REMOTE docker_origin
Config successfully updated.
$ cd ../replica
$ dolt config --local --add DOLT_READ_REPLICA_REMOTE docker_origin
Config successfully updated.
The docker-compose.yml
below includes three servers:
-
source on port 3308
-
replica on port 3307
-
Remote server on ports 5000, 50051
The Dolt servers are started with dolt sql-server
, and the remote
server uses the same remotesrv
command as our setup step.
We run the replica in --read-only
mode to enforce unidirectional writes:
version: "3.9"
services:
remote:
build: ./remotesrv
ports:
- "50051:50051"
- "5000:5000"
expose:
- "50051"
- "5000"
volumes:
- type: bind
source: ./dbs/remote
target: /var/remotes
networks:
- backend
replica:
depends_on:
- "remote"
build: .
command: ["sql-server", '--read-only', "-l", "trace", "--host", "0.0.0.0"]
ports:
- "3307:3306"
expose:
- "3306"
working_dir: /var/dbs/replica
volumes:
- type: bind
source: ./dbs/replica/.dolt
target: /var/dbs/replica/.dolt
networks:
- backend
source:
depends_on:
- "remote"
build: .
ports:
- "3308:3306"
expose:
- "3306"
working_dir: /var/dbs/source
volumes:
- type: bind
source: ./dbs/source/.dolt
target: /var/dbs/source/.dolt
networks:
- backend
networks:
backend:
driver: bridge
Finally, we run docker-compose to start the containers:
$ docker-compose up
Starting read-replica_remote_1 ... done
Creating read-replica_source_1 ... done
Creating read-replica_replica_1 ... done
Attaching to read-replica_remote_1, read-replica_replica_1, read-replica_source_1
remote_1 | 2021/10/13 16:41:40 cwd set to .
remote_1 | 2021/10/13 16:41:40 'grpc-port' parameter not provided. Using default port 50051
remote_1 | 2021/10/13 16:41:40 Starting http server on port 5000
remote_1 | 2021/10/13 16:41:40 Starting grpc server on port 50051
replica_1 | Starting server with Config HP="0.0.0.0:3306"|U="root"|P=""|T="28800000"|R="false"|L="trace"
remote_1 | 2021/10/13 16:41:42 GRPC(00002) new request for: GetRepoMetadata
remote_1 | 2021/10/13 16:41:42 GRPC(00001) new request for: GetRepoMetadata
remote_1 | 2021/10/13 16:41:42 GRPC(00002) - finished
remote_1 | 2021/10/13 16:41:42 GRPC(00001) - finished
source_1 | Starting server with Config HP="0.0.0.0:3306"|U="root"|P=""|T="28800000"|R="false"|L="trace"
Which we can view running locally:
$ docker container ps
432402f1e067 read-replica_replica "dolt sql-server -l …" 3 minutes ago Up 3 minutes 0.0.0.0:3307->3306/tcp, :::3307->3306/tcp read-replica_replica_1
2db89c1486a4 read-replica_source "dolt sql-server -l …" 3 minutes ago Up 3 minutes 0.0.0.0:3308->3306/tcp, :::3308->3306/tcp read-replica_source_1
0ca1b1d6cce8 read-replica_remote "remotesrv --http-po…" 3 minutes ago Up 3 minutes 0.0.0.0:5000->5000/tcp, :::5000->5000/tcp, 0.0.0.0:50051->50051/tcp, :::50051->50051/tcp read-replica_remote_1
Now connect to the source and create a table:
$ mysql --user root --host=0.0.0.0 -P 3308 source
mysql> create table t1 (a int primary key);
Empty set (0.09 sec)
mysql> select dolt_commit('-am', 'cm');
+----------------------------------+
| dolt_commit('-am', 'cm') |
+----------------------------------+
| 0qoslhbbfqcerrfg8osf08qi8iapjp70 |
+----------------------------------+
1 row in set (0.26 sec)
And to conclude, observe the table on the replica:
$ mysql --user root --host=0.0.0.0 -P 3307 replica
mysql> show tables:
+-------+
| Table |
+-------+
| t1 |
+-------+
1 row in set (0.05 sec)
In the background, our source pushed the commit to our remote, and our
replica synchronously pulled the new commit before returning the show
tables
query result.
What to Look Forward to
Replication could be faster and more direct than simply pushing after every dolt commit. Replication could be asynchronous, and amortize queued commits by ancestory to reduce the number of round trips. Replicating new chunk files, rather than commits, is another possible mode of replication. Lastly, streaming writes between source and replica through TCP connections would minimize replica latency.
At the macro level, there are many exciting design considerations related to availability, partitioning, and multisource systems that overlap with read replication. We build features based on customer priority, and if any of these features interest you please reach out on Discord!
Summary
The alpha version of read replication for Dolt is now available. Read
replication horizontally scales a database by drawing traffic
away from a source node towards replicas that and track updates. Setting
DOLT_REPLICATE_TO_REMOTE
and DOLT_READ_REPLICA_REMOTE
in Dolt
configuration kicks off "push on write" and "pull on read" behavior,
respectively. Updates are triggered on Dolt commits, not SQL transaction
commits. And information is passed through a remote middleman.
Try using Dolt read replication yourself, and tell us what you think!
If you are interested in learning more about Dolt, read replication, or relational databases reach out to us on Discord!