Three Ways to Import Data into Dolt
Dolt is the first database that versions data like Git versions code. We focused on a few key areas for improving Dolt this year: 1) sharpening MySQL feature parity, 2) smoothing the first hour experience, and 3) chasing MySQL's performance. Today we will talk about importing data into Dolt, which for some users is both a performance bottleneck and a negative first hour experience. This blog discusses what we learned building a new system for measuring import performance. We share improved intuitions about which imports are fast, which are slow and why.
We start with a survey of different import options before digging into the results of our benchmarks.
Bulk Loading Data into Dolt
An "import" loads data from an external format into a database. We bulk insert data when restoring databases from dumps, converting a MySQL database to a Dolt database, converting a set of CSV files into a database, or simply adding a large set of changes into a preexisting database.
There are three ways to achieve the same result: SQL dumps (i.e. dolt sql < dump.sql
), dolt table import
, and LOAD DATA INFILE
.
SQL Dump
The first way to import data simply runs a SQL script:
$ dolt sql < dump.sql`
dolt sql
runs every line of dump.sql
when fed through standard
input. Queries in a SQL script generally do not have to be inserts, but
a import dump.sql
will look something like this:
create table xy (
x int primary key,
y varchar(32)
);
insert into xy values
(1, '1'),
(2, '2'),
(3, '3'),
(4, '4');
insert into xy values
(5, '5'),
(6, '6'),
(7, '7'),
(8, '8');
A dump defines and then inserts data into tables. Queries are separated by semi-colons for parsing, and new lines for convenience.
All together, the import looks something like this:
$ dolt init
$ dolt sql < dump.sql
Query OK, 4 rows affected (0.00 sec)
Query OK, 4 rows affected (0.00 sec)
$ dolt sql -q "select * from xy"
+---+---+
| x | y |
+---+---+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
+---+---+
If you prefer strings rather than files, the batch (-b
)
flag uses a slightly different syntax for the same effect:
$ dolt sql -b -q "
insert into xy values (9, '9');
insert into xy values (10, '10');
"
Table import
The second way to bulk load data uses a Dolt command line function:
dolt table import
. I usually divide this path into two steps. First, I
create the table schema:
dolt sql -q "create table xy (x int primary key, y varchar(32));"
dolt table import
can also infer schemas, but I prefer the SQL-native
format for setting a schema's columns, primary key, and any indexes.
The second step either -c
creates, -u
updates, or -r
replaces the
table. These three will have the same result on our empty table, but a
replace will be repeatable on a non-empty table:
cat data.csv
x,y
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
dolt table import -r xy data.csv
Rows Processed: 8, Additions: 8, Modifications: 0, Had No Effect: 0
Import completed successfully.
LOAD DATA INFILE
The previous methods import data using Dolt's command line interface.
The final requires a server and client connection. The LOAD DATA
command works on Dolt and MySQL provided the server has been configured
to allow it, which isn't the default in either Dolt or MySQL (see
docs
for details).
First, we start a server in a new directory that holds our data file:
$ mkdir mydb
$ cd mydb
$ echo <<EOF
x,y
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
EOF > data.csv
$ dolt sql-server
Starting server with Config HP="localhost:3306"|T="28800000"|R="false"|L="info"|S="/tmp/mysql.sock"
Next, we connect to this database. We whitelist my local directory
/Users/max-hoffman/mydb
for access when establishing the client connection.
$ mysql -u root -h localhost --port 3306 --load-data-local-dir /Users/max-hoffman/mydb mydb
Inside the SQL shell, we will reconfigure the server to permit LOAD DATA INFILE
commands, which requires root
authorization:
mysql> SET GLOBAL local_infile=1;
Query OK, 1 row affected (0.00 sec)
Finally, we can create our table and import data.csv
:
Mysql> create table xy (x int primary key, y varchar(32))
Query OK, 0 rows affected (0.02 sec)
mysql> LOAD DATA LOCAL INFILE '/Users/max-hoffman/mydb/data.csv'
INTO TABLE xy
FIELDS TERMINATED BY ',' ENCLOSED BY ''
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
Query OK, 8 rows affected (0.01 sec)
My file is located at /Users/max-hoffman/mydb/data.csv
, and the additional
configuration indicates 1) the fields are separated by commas, 2) rows
are separated by line breaks, and 3) we should not import the first
(header) row.
MySQL has a wealth of options to configure permissions and execution of LOAD DATA
.
If you find Dolt missing a specific configuration option or pathway, let
us know and we will be happy to add it.
Benchmarking Performance
The Good
The good news is that Dolt trails MySQL's import performance by roughly the same margin as the rest of our benchmarks, about 2x. This number will scale naturally as we make all of Dolt faster. But there is room to tighten import independently if the 2x ends up bottlenecking users!
test_name | detail | row_cnt | sorted | mysql_time | sql_mult | cli_mult |
---|---|---|---|---|---|---|
col type | datetime | 200000 | 1 | 0.87 | 2.21 | 2.31 |
col type | varchar | 200000 | 1 | 0.75 | 2.4 | 2.77 |
config width | 2 cols | 200000 | 1 | 0.81 | 1.73 | 1.8 |
config width | 32 cols | 200000 | 1 | 2.07 | 1.69 | 2.43 |
config width | 8 cols | 200000 | 1 | 1.02 | 1.78 | 2.47 |
pk type | float | 200000 | 1 | 1.02 | 1.77 | 1.49 |
pk type | int | 200000 | 1 | 0.83 | 1.76 | 1.75 |
pk type | varchar | 200000 | 1 | 1.72 | 1.92 | 2.03 |
row count | 1.6mm | 1600000 | 1 | 6 | 2 | 2.1 |
row count | 400k | 400000 | 1 | 1.54 | 1.9 | 1.99 |
row count | 800k | 800000 | 1 | 3.02 | 1.95 | 2.06 |
secondary index | four index | 200000 | 1 | 4.52 | 1.43 | 1.51 |
secondary index | no secondary | 200000 | 1 | 0.97 | 1.72 | 1.96 |
secondary index | one index | 200000 | 1 | 1.31 | 1.76 | 1.89 |
secondary index | two index | 200000 | 1 | 2.44 | 1.52 | 1.59 |
sorting | shuffled 1mm | 1000000 | 0 | 6.65 | 1.9 | 1.94 |
sorting | sorted 1mm | 1000000 | 1 | 6.26 | 2.03 | 2.04 |
Each row is a particular import configuration as described by the name,
detail, row count, and a sorting flag. The next column is the mysql
time to perform the import. And finally the last two columns are dolt
time multipliers for LOAD DATA INFILE
from sql, and dolt table import
from the command line. For example, the row count, 1.6M import test
takes 6.0 seconds in MySQL, 12.0 seconds in DoltSQL, and about 12.6
seconds on the Dolt CLI.
The Bad
Both MySQL and Dolt fall short importing a certain kind of dump file. Consider a "batched" import:
insert into xy values
(1, '1'),
(2, '2'),
(3, '3'),
(4, '4');
vs and "unbatched" import:
insert into xy values (1, '1');
insert into xy values (2, '2');
insert into xy values (3, '3');
insert into xy values (4, '4');
Batching runs 30x faster in MySQL and 130x faster in Dolt, even though the files themselves are almost the same. Moving a row to disk is fast, but Dolt makes strict isolation and durability guarantees between queries. Improving lifecycle machinery to streamline transaction throughput is hard.
But don't fret! We created a helper tool with a workaround. This insert batcher reduces the number of transaction setup/teardowns by combining insert queries.
After installing Go
we can create a batched file:
$ go version
go version go1.19 darwin/amd64
$ go install github.com/max-hoffman/insert-batcher
$ cat dump.sql
create table xy (x int primary key, y int);
insert into xy values (1, '1');
insert into xy values (2, '2');
insert into xy values (3, '3');
insert into xy values (4, '4');
insert into xy values (5, '5');
insert into xy values (6, '6');
insert into xy values (7, '7');
insert into xy values (8, '8');
$ go run github.com/max-hoffman/insert-batcher \
-in dump.sql \
-out batch.sql \
-b 4
$ cat batch.sql
create table xy (x int primary key, y int);
insert into xy values (1, '1'), (2, '2'), (3, '3'), (4, '4');
insert into xy values (5, '5'), (6, '6'), (7, '7'), (8, '8');
The default batching factor -b
of 500 should amortize the transaction
overhead for most imports.
server | detail | row_cnt | time |
---|---|---|---|
dolt | batch sql | 10000 | 0.14 |
dolt | by line sql | 10000 | 17.96 |
mysql | batch sql | 10000 | 0.1 |
mysql | by line sql | 10000 | 3.58 |
The Ugly
Blob stands for "binary large object" file, and in MySQL corresponds to
TEXT
, JSON
, and BINARY
column types. We distinguish between blobs
and VARCHAR
for sizing reasons. A block of memory typically contains a
range of rows, but a blob typically uses several blocks worth of memory
on its own. Instead of disrupting the compactness of the regular tree,
we write blobs to separate storage. Data blocks contain rows of mostly
inlined values, and reference blob files written elsewhere.
This is an example of a schema that will be written inline. All fields will be layed out next to each other in a row-oriented storage.
create table xy (
x int primary key,
y varchar(30),
z varchar(30),
w varchar(30)
);
This schema writes every y
field as a separate file:
create table xy (
x int primary key,
y text,
z varchar(30),
w varchar(30)
);
When we started this work, the blob schema was more than 10x slower on import!
detail | row_cnt | time |
---|---|---|
no blob | 200000 | 1.45 |
1 blob | 200000 | 18.43 |
2 blobs | 200000 | 61.77 |
There are some good reasons why blobs take longer to write than varchar.
We write blobs as a tree, not a linear sequence of bytes. The blob tree
nodes are pooled with the rest of the table's tree nodes, and so blobs
usually expand the total node number by a factor of
10-100x. If your TEXT
fields fit as a VARCHAR
, you are
always better off sticking to VARCHAR
!
Even then, many MySQL dumps include TEXT
columns that we need to
support for compatibility. We discovered some no-so-good reasons
why blobs were slow, and managed to streamline this edge case in
a recent PR:
detail | row_cnt | time |
---|---|---|
no blob | 200000 | 1.25 |
1 blob | 200000 | 2.23 |
2 blobs | 200000 | 2.64 |
Stay tuned for a follow up blog about how we optimized blob import throughput by 10x.
Summary
We gave a summary of how to import data into Dolt. We discussed how fast Dolt is compared to MySQL, which paths are fast, which are prone to foot-guns, and which are slow and require more consideration.
Next time, we will follow-up with a performance-focused blog detailing
how we streamlined a particularly hairy edge case: TEXT
column
imports.
We are tracking performance changes on release for imports. If anyone finds gaps in our testing, or have particular edge cases they would like us to address, we encourage you to reach out on Discord, GitHub, or Twitter!