Part I: Dolt API Surface Area
Background
When DoltHub was founded it was called Liquidata. The goal was to bring liquidity to the data market. The founders realized that the pipes were broken: sending around CSV, JSON, and other formats was broken. The requirement to translate to and from formats without formal type systems at both ends of the "pipe" was an unacceptable degree of friction:
Searching around for an example of a good user experience for data transfer, the one we liked most was Git remotes:
$ git clone path/to/remote
Or if you already have the repository you can just as easily grab the latest changes:
$ git pull path/to/remote
But "Git for data" did not solve the friction we had identified. For one, CSV and JSON files can distributed via Git remotes. Furthermore, that doesn't solve for a primary source of friction identified above, which is translating data to and from formats such as CSV and JSON, and internal data storage formats.
SQL is by far the most ubiquitous way to describe and query data. Even where SQL is ultimately not a suitable interface for the end use of data, it is often used to define data pipelines that feed data into more specialized storage layouts. We concluded that if data arrived in a SQL queryable format, we could deliver a significantly streamlined user experience for data distribution:
We chose to combine concepts from Git and SQL to create a version controlled database motivated by our desire to deliver frictionless data transfer. We imagined users cloning datasets, and spinning up a SQL server, all with a single tool. We called it Dolt.
API Surface Area
This decision gave our database API two key requirements:
- a SQL query interface
- Git-like version control semantics
These two goals are somewhat in tension. A relational database server uses SQL as an interface, often to a remote process. By contrast, Git version control uses a CLI acting mostly on local files, often with complex multi-step workflows. We also imagined that a core use-case for Dolt would be as a collaboration tool for building databases, meaning porting Git concepts to SQL was not an option.
Before digging into how we attempting to strike this balance, it's worth briefly investigating the API surface area of some popular relational databases.
Other Databases
Most users interact with relational databases via issuing SQL statements to a
remote server process. However, not all tasks for operating traditional
relational databases can be described in SQL. For example, tab completing the
prefix pg
on my Mac shows the following list of Postgres tools:
$ pg
pg_archivecleanup pg_controldata pg_isready pg_restore pg_test_timing
pg_basebackup pg_ctl pg_receivewal pg_rewind pg_upgrade
pg_checksums pg_dump pg_recvlogical pg_standby pg_waldump
pg_config pg_dumpall pg_resetwal pg_test_fsync pgbench
And likewise for the prefix mysql
:
mysql
mysql mysql_upgrade mysqldumpslow mysqlshow
mysql.server mysqladmin mysqlimport mysqlslap
mysql_client_test mysqlbinlog mysqlpump mysqltest
mysql_config mysqlcheck mysqlrouter mysqltest_safe_process
mysql_config_editor mysqld mysqlrouter_keyring mysqlxtest
mysql_secure_installation mysqld_multi mysqlrouter_passwd
mysql_ssl_rsa_setup mysqld_safe mysqlrouter_plugin_info
mysql_tzinfo_to_sql mysqldump mysqlsh
By splitting API surface area across a SQL query interface and set of shell scripts for automating common administrative tasks, these SQL databases implicitly identify two concerns:
- querying data
- administering the instance
The first bucket can be thought of as "online concerns," in the sense that the querying data concerns the contents of the database, and thus assumes the database is online and available. The second bucket can be thought of us "offline concerns," in the sense that they address the existence and availability of the database process.
Since Dolt is a database, it has to address both of these concerns, as well as exposing version control semantics.
Dolt
A Dolt instance is not just a server process running in a data center, but a full version history that can synced with a remote. Dolt's offline interface must allow users to administer their local database, while also exposing the git-style versioning tools that make that collaboration and sharing possible.
Once users have obtained data from a remote, however, they want to have a familiar SQL experience. That means the offline Dolt API must make it easy to transition from "administration", obtaining data and standing up the server, to a familiar experience of executing SQL queries via a connector.
To achieve these goals, we decided the offline experience would be best served by a user experience closely modeled on the Git CLI, with additional commands related to managing a running server. We also chose to implement the MySQL dialect, which means that Dolt can be used as a drop-in replacement for any existing application backed by MySQL.
CLI
Let's dive into the CLI. Running dolt
dumps out the index of available commands:
$ dolt
Valid commands for dolt are
init - Create an empty Dolt data repository.
status - Show the working tree status.
add - Add table changes to the list of staged table changes.
reset - Remove table changes from the list of staged table changes.
commit - Record changes to the repository.
sql - Run a SQL query against tables in repository.
sql-server - Start a MySQL-compatible server.
log - Show commit logs.
diff - Diff a table.
blame - Show what revision and author last modified each row of a table.
merge - Merge a branch.
branch - Create, list, edit, delete branches.
tag - Create, list, delete tags.
checkout - Checkout a branch or overwrite a table from HEAD.
remote - Manage set of tracked repositories.
push - Push to a dolt remote.
pull - Fetch from a dolt remote data repository and merge.
fetch - Update the database from a remote data repository.
clone - Clone from a remote data repository.
creds - Commands for managing credentials.
login - Login to a dolt remote host.
version - Displays the current Dolt cli version.
config - Dolt configuration.
ls - List tables in the working set.
schema - Commands for showing and importing table schemas.
table - Commands for copying, renaming, deleting, and exporting tables.
conflicts - Commands for viewing and resolving merge conflicts.
migrate - Executes a repository migration to update to the latest format.
read-tables - Fetch table(s) at a specific commit into a new dolt repo
gc - Cleans up unreferenced data from the repository.
filter-branch - Edits the commit history using the provided query.
Let's use the dolt clone
command modeled off of Git to clone a database:
$ dolt clone dolthub/ip-to-country
cloning https://doltremoteapi.dolthub.com/dolthub/ip-to-country
32,832 of 32,832 chunks complete. 0 chunks being downloaded currently.
$ cd ip-to-country
That was quite familiar from Git, even the printing out of chunks. Now to query our data, we'll start the server:
$ dolt checkout -b test
Switched to branch 'test'
$ dolt sql-server
Starting server with Config HP="localhost:3306"|U="root"|P=""|T="28800000"|R="false"|L="info"
A Dolt SQL Server instance is now running on port 3306, using the
ip-to-country
database we just clone from DoltHub.
SQL
We chose SQL as a query interface because it is widely adopted, and MySQL in particular because of a rich ecosystem of connectors. We also wanted users to have the familiar Git-like experience of managing their data, and seamlessly transition to executing queries in a familiar language.
We can now see the magic of "just working" as we use the off-the-shelf MySQL connector to connect to Dolt and have a familiar SQL query interface:
~|>> mysql --host=127.0.0.1 --user=root
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.9-Vitess
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> use ip_to_country;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> select * from IPv4ToCountry limit 5;
+-----------+-----------+----------+--------------+--------------------+--------------------+-----------+
| IPFrom | IpTo | Registry | AssignedDate | CountryCode2Letter | CountryCode3Letter | Country |
+-----------+-----------+----------+--------------+--------------------+--------------------+-----------+
| 763029248 | 763029503 | apnic | 1438300800 | AU | AUS | Australia |
| 763029504 | 763030527 | apnic | 1436486400 | VN | VNM | Viet Nam |
| 763030528 | 763031551 | apnic | 1436486400 | VN | VNM | Viet Nam |
| 763031552 | 763032575 | apnic | 1436486400 | VN | VNM | Viet Nam |
| 763032576 | 763033599 | apnic | 1436486400 | VN | VNM | Viet Nam |
+-----------+-----------+----------+--------------+--------------------+--------------------+-----------+
10 rows in set (0.02 sec)
Bringing it Together
We set out to build a database that provided version control features to solve for modern data infrastructure needs. This entailed combining two models. So far we showed a lightly augmented Git interface for the offline experience, and a generic SQL query interface for when Dolt is running a server process.
But to fully realize the value of our creation requires exposing version control in SQL. This where the real API challenge arises, as we try and figure out where to augment existing SQL functionality to work with an underlying commit graph, and where to implement functions that take care of concerns that don't fit in a declarative syntax.
A good example is AS OF
syntax. This is a SQL standard feature that
is there to expose "system versioning." MariaDB supports
this
when turned on, for timestamps. But since Dolt stores its version
history in a commit graph, it understands not only timestamps, but
also commit hashes, branch names, and refs such as HEAD~
:
mysql> SELECT * FROM IPv4ToCountry AS OF 'mh8o6k5bja6o3bmj8gflc4f9rdjclhcb' WHERE IPFrom = 521020416 AND IpTo = 521021439;
+-----------+-----------+----------+--------------+--------------------+--------------------+---------+
| IPFrom | IpTo | Registry | AssignedDate | CountryCode2Letter | CountryCode3Letter | Country |
+-----------+-----------+----------+--------------+--------------------+--------------------+---------+
| 521020416 | 521021439 | ripencc | 1303084800 | RO | ROU | Romania |
+-----------+-----------+----------+--------------+--------------------+--------------------+---------+
1 row in set (0.00 sec)
Other version control features are exposed via special system tables
that can be incoporated into any SQL query. For example, to examine
the commit log I can select from dolt_log
:
mysql> select commit_hash, date, message from dolt_log limit 5;
+----------------------------------+-------------------------+----------------------------------------------------------+
| commit_hash | date | message |
+----------------------------------+-------------------------+----------------------------------------------------------+
| 2v3vrf8ta478a5e1h4aihaj73q7kh96a | 2020-12-13 01:30:27.421 | Update IP to Country for date 2020-12-13 01:29:50.478038 |
| mh8o6k5bja6o3bmj8gflc4f9rdjclhcb | 2020-12-12 01:31:21.972 | Update IP to Country for date 2020-12-12 01:30:45.344102 |
| bqtrbbrsgie7u4fsgph0t12j94ofefml | 2020-12-11 01:30:36.075 | Update IP to Country for date 2020-12-11 01:29:53.647746 |
| u8pnf1o0mus2d8mok0083t5lpj190j5f | 2020-12-10 01:30:29.088 | Update IP to Country for date 2020-12-10 01:29:52.085252 |
| kqjdel14k130as6jdjbrurtg9j8e93ie | 2020-12-09 20:02:43.082 | Update IP to Country for date 2020-12-09 20:02:07.846972 |
+----------------------------------+-------------------------+----------------------------------------------------------+
5 rows in set (0.02 sec)
For convenience, the entire set of CLI functionality is available in
SQL with special functions.
For example, you can check out a branch with a SELECT
statement like so:
mysql> select dolt_checkout('some-branch')
This concludes our tour of Dolt's native APIs.
Higher Level APIs
Other relational database solutions do not generally provide higher level APIs outside of SQL. There exist many such APIs for SQL databases, but they are generally not part of the database itself. For example, SQL Alchemy is a library that provides Python native functions for manipulating various flavors of SQL database.
Dolt is a bit more complicated than existing SQL databases, merely because it has a slew of additional features that just make the possibility space larger. For example, there are combinations of common SQL queries and version control operations that users might regularly combine. We believed there existed a sufficient number of these to justify a higher level Python API designed to make it easier to integrate Dolt into existing data infrastructure.
In the next part of this post, tomorrow, we will do a deep dive on the latest iteration of Doltpy, our Python API for interacting with Dolt.
Conclusion
We reviewed how Dolt combines concepts from Git and MySQL, and where necessary innovates, to bring a relational database with novel features to market. If those features sound exciting to you, give Dolt a try, or get in touch with us. Our next post will show to seamlessly script data pipelines that incorporate Dolt using the latest version of Doltpy.