JSON Showdown: Dolt vs MySQL

September 9, 2024

11 min read

UPDATE: The original version of this post accidentally used an outdated SQL client when benchmarking Dolt, which introduced additional latency. With the updated metrics, Dolt is even faster. The previous version of this post concluded that Dolt consistently outperformed MySQL when the test docs exceeded ~1MB in size. The actual threshold is closer to ~64KB.

Dolt is a SQL database designed as a drop in replacement for MySQL, but with extra features. Dolt's biggest feature is its Git-like version control semantics, which allows users to clone, branch, diff, and merge their databases, just like Git does for files.

But it's not the only place where Dolt excels, because the same storage strategies that we developed for storing branches turned out to also be a great fit for storing JSON. As a result, we feel that there's a compelling argument for Dolt in JSON-heavy workflows, even if you don't care about version controlling your data. (You should care about version controlling your data though.)

But rather than asking you take my word for it, I wanted to put my money where my mouth by pitting Dolt head to head against the other big players in the open-source SQL engine space. The goal: to see whether Dolt can truly offer the best JSON experience. Since Dolt is billed as a MySQL replacement, it makes sense to start with them. (Postgres support is coming soon!)

Dolt's performance compared to MySQL has always been important to us, and we've been tracking it and improving it for several years now. You could consider this post a continuation of that series.

In Part 1, I'll introduce the syntax for working with JSON in MySQL. Dolt uses the same dialect, so these examples will also work for Dolt.

In Part 2, we'll look at how MySQL represents JSON internally. I've already talked about how Dolt represents JSON, which you can read, so I won't repeat that here.

In Part 3, I'll introduce the different operations that we're going to benchmark, and see how Dolt and MySQL compare at handling a variety of document sizes.

I'm not aware of any standard benchmark that specifically tests operations on large JSON documents, so we're going to roll our own. Our data will be documents generated by this method for representing the Von Neuman ordinals:

CREATE TABLE vn(pk int primary key, j json);
INSERT INTO vn(
  with recursive cte (pk, j) as (
    select 0, JSON_OBJECT("l", 0)
    union all
    select pk+1, JSON_SET(JSON_INSERT(j, CONCAT("$.k", pk), j), "$.l", pk+1) from cte where pk <= 20 ) select * from cte
);

Why not use realistic data? Wouldn't that be more meaningful?

We choose this over a more realistic data set because it helps ensure that we're not accidentally relying on the test data having some specific shape. The documents generated by this process contain documents with a range of sizes, including both deeply nested objects and shallow objects with many fields, and everything in-between. Any engine that can handle this should be able to handle any other shape of data.

Part 1: What does using JSON in MySQL look like?

In MySQL, JSON is stored in table columns with the JSON type, and is typically imported and exported in a human-readable format:

> CREATE TABLE users(userId INT PRIMARY KEY, metadata JSON);
> INSERT INTO users VALUES (1, '{"name": "Tim", "views": 124, "subscribers": [2, 3]}'),
                             (2, '{"name": "Nick", "views": 61, "subscribers": [3]}'),
                             (3, '{"name": "James", "views": 27, "subscribers": []}');
> SELECT metadata FROM users;
+-----------------------------------------------------+
| metadata                                            |
+-----------------------------------------------------+
| {"name": "Tim", "views": 124, "subscribers": [2, 3]} |
| {"name": "Nick", "views": 61, "subscribers": [3]}    |
| {"name": "James", "views": 27, "subscribers": []}    |
+-----------------------------------------------------+

If you only need to export part of a document, you can use the ->> expression:

> SELECT metadata->>"$.name" FROM users;
+---------------------+
| metadata->>"$.name" |
+---------------------+
| Tim                 |
| Nick                |
| James               |
+---------------------+

->> can be used in WHERE and ORDER BY clauses in order to filter or order the results, respectively.

> SELECT count(*) FROM users WHERE metadata->>"$.views" > 50;
+----------+
| count(*) |
+----------+
| 2        |
+----------+

SELECT metadata->>"$.name" FROM users ORDER BY metadata->>"$.views" DESC;
+---------------------+
| metadata->>"$.name" |
+---------------------+
| Nick                |
| James               |
| Tim                 |
+---------------------+

How does MySQL actually evaluate these queries? We can use DESCRIBE statements to get a clue.

> DESCRIBE SELECT count(*) FROM users WHERE metadata->>"$.views" > 50;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | users | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |   100.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+

> DESCRIBE SELECT metadata->>"$.name" FROM users ORDER BY metadata->>"$.views" DESC;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra          |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------+
|  1 | SIMPLE      | users | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |   100.00 | Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------+

This output indicates that the engine is accomplishing both the filtering and the ordering via a brute force approach, without using any indexes or other optimizations.

One way to improve this is adding an index. MySQL provides two equivalent ways to index JSON: virtual column indexes and functional indexes.

A virtual column index looks like this:

ALTER TABLE users ADD COLUMN views INT AS metadata->>"$.views", ADD INDEX viewsIdx(views);
SELECT metadata->>"$.name" FROM users ORDER BY views DESC;

This creates a virtual generated column on the table, and an index on that column.

A functional index looks like this:

ALTER TABLE users ADD INDEX viewsIdx((CAST(metadata->>"$.views" as SIGNED)));
SELECT metadata->>"$.name" FROM users ORDER BY metadata->>"$.views" DESC;

Because there is no explicit views column to reference, the query duplicates the metadata->>"$.views" expression, and the engine detects that the viewsIdx index can be used to optimize the query.

There are pros and cons for each.

Virtual column index:
- Pro: The virtual column can be referenced directly by queries.
- Con: The virtual column complicates the table schema.
- Con: The query must reference the virtual column, otherwise MySQL may not use the index.
Functional index:
- Con: It's very easy to write a query that looks like it can use the index, but doesn't.

That last point is especially easy to stumble over if the type of values being indexed are strings. This is because in order for the index to be chosen, the collation used in the index expression must match the collation in the user query.

The following looks like it should use the index, but doesn't:

ALTER TABLE users ADD INDEX nameIdx((CAST(metadata->>"$.name" as VARCHAR(256))));
SELECT metadata->>"$.views" FROM users where metadata->>"$.name" = "Nick";

This is because the CAST in the index definition creates a VARCHAR with the system default collation (most likely utf8mb4_0900_ai_ci), while the expression in the query has a collation of utf8mb4_bin (the collation of all strings contained in JSON documents.)

The most common way to avoid this is do this:

ALTER TABLE users ADD INDEX viewsIdx((CAST(metadata->>"$.name" as SIGNED) COLLATE utf8mb4_bin));

If you don't do this, the index will likely never be used.

Even setting this aside, it's much harder to reason about functional indexes than virtual column indexes. For instance, while setting up these benchmarks I was never successfully able to write an ORDER BY query that that used a JSON functional index. Because of these reasons, I'll be using virtual column indexes for the rest of this experiment.

So this is how MySQL queries JSON columns, but how are those columns actually represented on disk?

Part 2: MySQL Under the Hood

The storage format for MySQL JSON objects is well documented. It's a pretty straightforward custom serialization format similar to BSON or protocol buffers.

Here's the relevant bit:

If the value is a JSON object, its binary representation will have a header that contains:

the member count

the size of the binary value in bytes

a list of pointers to each key

a list of pointers to each value

The actual keys and values will come after the header, in the same order as in the header.

Basically, every object is stored as a single contiguous chunk, starting with a header that tells the engine where to find all of that's objects children.

Except for this header, the binary format is a 1:1 representation of the JSON object with no additional structures to aid in indexing. This means that in theory the header can allow for quick lookups into the document. However in practice this is stymied by the fact that MySQL loads the entire file into memory, even for operations where the entire file is not necessary.

In order to manipulate documents, MySQL also has a second in-memory representation. In this format, the binary form is converted into a [Domain Object Model]. A description of this format can be found in this C++ header file. The header is quite large but most of it is boilerplate code for maintaining the DOM and providing methods for traversing and iterating it. There's nothing surprising about this implementation.

As a consequence of this design, operations on JSON documents are fast, but at a cost:

In order to mutate a document, the entire document needs to be loaded into memory and converted into a DOM. Then, the DOM must be re-serialized into the binary format.

There is no structural sharing between multiple documents. Every document gets their own storage, even if two documents are completely identical.

Part 3: The Benchmarks

All of the tests were conducted on an Apple M2 Pro with 32 GB of RAM.

Here we explore how both MySQL and Dolt perform when working with multiple different document sizes. In each case, the total database size remains the same, but the size of the documents (and the total number of documents) varies.

Test 1: Exporting JSON

We want to see how quickly the engine can load, format, and return large JSON documents. For workflows where all the document manipulation is done by a client, this is the most common use case, so it needs to be fast. Both MySQL and Dolt have functionality for dumping tables into .sql script files, so we'll use those tools and benchmark them:

MySQL: time mysqldump jsonTest > ~/json.sql
Dolt: time dolt dump -fn json.sql

Exporting JSON Graph

	64 KB * 1024	256 KB * 256	1 MB * 64	4 MB * 16	16 MB * 4
Dolt	0.326±0.005	0.314±0.009	0.314±0.005	0.326±0.009	0.354±0.005
MySQL	0.402±0.008	0.412±0.015	0.398±0.013	0.464±0.056	0.732±0.028

Test 2: Importing JSON

We want to see how quickly the engine can parse JSON and store it in a table. For workflows where all the document manipulation is done by a client, this is the second most common use case, so it needs to be fast.

MySQL: time mysql < ~/json.sql
Dolt: time dolt sql < json.sql

Importing JSON Graph

	64 KB * 1024	256 KB * 256	1 MB * 64	4 MB * 16	16 MB * 4
Dolt	3.986±0.047	3.996±0.038	4.028±0.125	3.992±0.051	4.134±0.096
MySQL	3.296±0.334	2.968±0.287	3.686±1.084	3.328±0.453	3.146±0.232

Test 3: Copying a document with minor changes.

Copying documents and making minor changes to them is useful if you're trying to track the history of a document over time. If this can be done quickly, it's also a strong indicator that the database is using structural sharing, where the redundant parts of the different document versions are being reused instead of copied.

MySQL: time mysql -e 'create table jsonTable2 as select pk, JSON_SET(j, "$.level", 1) from jsonTable;'
Dolt: time dolt sql -q 'create table jsonTable2 as select pk, JSON_SET(j, "$.level", 1) from jsonTable;'

Copying JSON Graph

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	0.966±0.032	0.224±0.009	0.078±0.018	0.03±0.001	0.024±0.09
MySQL	16.668±0.225	14.862±0.443	13.972±0.172	15.418±0.154	16.202±0.292

Test 4: Extracting a sub-value from a JSON column (no index)

In theory, this should be fast for both engines, but MySQL is held back by a need loading the document from disk. In smaller tables where the operating system can hold the entire db in memory, the first lookup is slow but subsequent lookups are fast.

MySQL: time mysql -e 'select j->>"$.l" from test;'
Dolt: time dolt sql -q 'select j->>"$.l" from test;'

Extracting JSON Graph

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	0.020±0.001	0.012±0.001	0.011±0.001	0.011±0.001	0.01±0.001
MySQL	0.2±0.007	0.206±0.005	0.184±0.011	0.224±0.021	0.236±0.011

Test 5: Filtering on a JSON key (no index)

MySQL: time mysql -e 'select pk from test where j->>"$.l" = 10;'
Dolt: time dolt sql -q 'select pk from test where j->>"$.l" = 10;'

Filtering JSON Graph

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	0.262±0.004	0.07±0.001	0.022±0.004	0.012±0.001	0.01±0.004
MySQL	0.200±0.010	0.198±0.008	0.19±0.010	0.234±0.005	0.24±0.010

Test 6: Ordering by a JSON key (no index)

MySQL: time mysql -e 'select pk from test order by j->>"$.l";'
Dolt: time dolt sql -q 'select pk from test order by j->>"$.l";'

Ordering JSON Graph

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	0.268±0.004	0.076±0.005	0.030±0.001	0.012±0.004	0.010±0.001
MySQL	Out of Memory	Out of Memory	Out of Memory	Out of Memory	Out of Memory

The MySQL command fails with an error:

ERROR 1038 (HY001) at line 1: Out of sort memory, consider increasing server sort buffer size

It appears to be trying to load every JSON document into memory at once, despite the documents not being selected.

Test 7: Creating a virtual column index

MySQL: time mysql -e 'alter table test add column level int as (j->>"$.l"), add index idx(level)';
Dolt: time dolt sql -q 'alter table test add column level int as (j->>"$.l"), add index idx(level);'

Virtual Column Index Graph

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	0.634±0.011	0.168±0.004	0.054±0.009	0.022±0.004	0.024±0.001
MySQL	0.224±0.009	0.218±0.018	0.208±0.008	0.244±0.015	0.294±0.027

After creating a virtual column index, the measured time for the previous extraction, filter and order operations became indistinguishable from the engine's normal startup time: with an index, these operations are essentially free on both MySQL and Dolt. The graphs aren't very useful.

Exception: Even with an index, the ordering test still runs out of memory on MySQL.

Conclusion

We can draw the following conclusions about how Dolt and MySQL compare when it comes to handling large JSON documents:

MySQL allows fast lookups in theory, but is held back by the fact that it needs to load entire documents into memory to do any operations on them. This slows down many operations unless the operating system has already cached the document in memory, and it makes other operations like ordering impossible due to memory limits.
When using virtual column indexes, MySQL becomes as fast as Dolt, although creating these indexes may be slower depending on the document size.
Above a certain size (approximately 256 KB for the documents used in the test), Dolt outperforms MySQL on every single metric except for import speed.
Additionally, MySQL's has a 1 GB limit on JSON documents prevents data from scaling indefinitely. Dolt does not have this limitation.

For smaller document sizes, MySQL beats out Dolt. We're still making performance improvements to Dolt, and we believe that this performance gap for small documents will continue to shrink over time.

But for larger documents, Dolt soundly wins out. This is because Dolt is able to do lots of operations without loading the entire document into memory. Thanks to Prolly Trees, even mutation operations only need to load the relevant portions of a document. And Dolt's structural sharing means that copying documents is basically free, since any part of the copy that hasn't changed gets to share storage with the original.

We can confidently say that if you're using MySQL with a JSON-heavy workflow, you should consider switching to Dolt.

Still not sure if Dolt is right for you? Follow us on Twitter, or join our Discord and we'll be happy to figure out how Dolt can help your use case.

Blog

Why not use realistic data? Wouldn't that be more meaningful?