The Exciting Future of JSON on Dolt

June 12, 2024

5 min read

tl;dr: Querying JSON documents in Dolt is about to get a huge performance boost. Beginning with Dolt 1.40, we're changing the way that Dolt writes JSON documents to disk. This will allow us to greatly improve performance when working with JSON. This is not a breaking change: prior versions of Dolt can still read documents written by newer versions and vice versa.

JSON is everywhere. And it's easy to see why: unstructured data is flexible. Storing your data as JSON means you don't have to worry about data normalization up-front, and you can handle data of different shapes. Schema-ed tables are great when all your data is structured the same and you already know how you're going to access it. But sometimes you just want to throw all your documents in a database and figure out the schema later.

SQL is historically pretty crummy at not using schemas. But every major SQL dialect has supported JSON columns for about a decade now, so one would hope that by this point, using JSON in SQL would be pretty polished, right?

Eh, not really. I've never been terribly impressed with how most SQL engines handle JSON, especially for large documents. Manipulating JSON in MySQL is slow and may require 3x storage. Oracle Database didn't even have a JSON type before 2021, just functions that operate on text columns. Even Postgres, SQL-land's golden child, has two different JSON data types, forcing you to give up fast read and write speeds if you want indexes.

Since Dolt is a drop-in replacement for MySQL, that means that Dolt supports JSON too. Although frankly, we weren't doing any better on the JSON front. We were even slower than MySQL.

But that's about to change.

The Status Quo

Why is JSON on SQL in such a sorry state? I think it's a chicken-and-egg problem. Users aren't manipulating large JSON documents in SQL because it's slow. SQL engines aren't prioritizing improving JSON manipulation because it's not how people are using the database.

Over the past couple months, we've been figuring out how to make the JSON experience better in Dolt. We discussed following in Postgres's footsteps and creating an alternative storage format that allowed for faster queries at the cost of slower reads and writes. We even had an internal working implementation. But forcing users to think about storage implementation details is a pretty poor experience for some who just wants to, you know, put their data in a database. And if the main way that people are using JSON is just selecting it, then anything that makes reading JSON slower is a non-starter.

So we went back to the drawing board. If we were going to do this, we were going to do this right.

The New Hotness

Today, we're announcing the results of this initiative. Beginning with today's release (Dolt 1.40), we're tweaking how Dolt stores JSON documents in a way that will greatly improve query execution speed without impacting read performance.

Over the next few weeks I'll be delving into more detail about what exactly this means and how we achieved it, but here's the most important parts:

Starting with Dolt 1.40, newly created JSON documents will contain additional metadata that allow for efficient indexing and even mutation of stored documents.
No action is required to get these benefits other than updating Dolt. No database migration is necessary.
Dolt servers running prior to this change can still read newer documents and vice versa, just without gaining the new performance benefits.
We are actively in the process of rewriting Dolt's SQL engine to take advantage of these changes. So far, we have rewritten the JSON_EXTRACT, JSON_INSERT, and JSON_CONTAINS_PATH system functions, with more to come.

Our endgame? Extending Dolt's powerful version control features to JSON documents. We're really proud of how Dolt's prolly-tree based storage system allows for version controlled tables with unparalleled diffing and merging capabilities. In particular, Dolt allows you to perform lightning-fast three-way merges on your tables. And while we support three-way merges on JSON documents, it's not nearly as fast. But there's no reason it can't be. And soon, it will.

Even setting aside Dolt's incredible version control capabilities, we believe that Dolt's storage layer can make it the best SQL server for JSON. Dolt was created as a drop-in replacement for MySQL, but thanks to these changes, we're now faster than MySQL at handling large JSON documents.

Want some proof? Here I used a recursive table expression to create JSON documents that represent the von Neumann ordinals:

CREATE TABLE jsonTable(pk int primary key, j json);
INSERT INTO jsonTable(
  with recursive cte (pk, j) as (
    select 0, JSON_OBJECT("K", "V")
    union all
    select pk+1, JSON_INSERT(j, CONCAT("$.k", pk), j) from cte where pk < 28 ) select * from cte
);

Each row is twice the size of the previous row, and the last row of this table is an 8 gigabyte JSON file:

dolt sql -q "select * from jsonTable where pk = 27" | wc -c
8053588004

So how fast can we query it?

> SELECT pk, JSON_EXTRACT(j, '$.k7.k5.k3.k2') FROM jsonTable WHERE pk = 27;
+----+-------------------------------------+
| pk | JSON_EXTRACT(j, '$.k7.k5.k3.k2.k0') |
+----+-------------------------------------+
| 27 | {"K": "V"}                           |
+----+-------------------------------------+
1 row in set (0.00 sec)

Faster than the server can count, apparently.

But reading a big document isn't that impressive: that just means we have some kind of index on it. We get fast lookups, but we pay an upfront cost in building the index.

What would be really impressive is modifying a document, and inserting it into a new table.

CREATE TABLE jsonTable2 AS SELECT pk, JSON_INSERT(j, '$.k7.k5.k3.k2.foo', 'bar') FROM jsonTable;

This ran instantly too. I just copied and modified a 8 GB JSON file instantly. (And a 4 GB document, and a 2 GB...)

How does this compare to MySQL? Turns out MySQL doesn't even allow documents larger than 1 GB. And when I limited the table to 21 rows (about 90 MB), the JSON_EXTRACT query took 50 ms, and making the new table took three whole seconds. It's not even a close comparison.

And what about storage space? Each table contains a total of 16 GB of JSON, and there's two of them. So how much storage space did Dolt need for storing 32 GB of JSON?

% du -sh .
1.8G	.

That's 20x compression. JSON is compressible but not that compressible. A cursory Google search suggests typical ratios are between 5x and 10x. The 20x we see here comes from a combination of structural sharing and compressed archives, both of which are a product of how Dolt uses content-addressed storage. You can read more about Dolt and content-addressed storage here. While we took this approach for its usefulness in storing the database's entire history, it turns out to also be quite effective for deduplicating large documents with incremental changes.

I could keep going on about how powerful this all is, but I'll stop here. JSON on Dolt is now officially fast.

As always, we want to hear from you so that we can make Dolt the best version-controlled SQL database possible! Feel free to join our Discord or follow us on Twitter and let us know what you think.

Blog

The Status Quo

The New Hotness

Get started with Dolt