JSON Showdown: Dolt vs Sqlite

November 18, 2024

14 min read

Dolt was created to be the world's first truly version controlled database. We think it does a great job at that. And in the process of making Dolt, we realized that the same data structures that make Dolt fast and efficient at version control also make it fast and efficient at working with JSON.

I'm talking about Prolly Trees. We've already talked about how Prolly Trees work in Dolt and how we can use them to efficiently represent JSON.

But we don't want you to take our word for it. Previously, I showed how Dolt measures up against Postgres and MySQL when interacting with large JSON documents. In both cases, we concluded that once JSON documents exceeded a certain size (around 64 KB for MySQL and 256 KB for Postgres), Dolt outperforms the competition at pretty much every relevant operation.

To round things out, let's compare Dolt with the third and final of the big name SQL engines: SQLite.

SQLite is a bit different from Postgres and MySQL: it's not a server, it's a library that your program includes in order to interact directly with the database files. This has its tradeoffs:

Pros:

SQLite is more lightweight than running your DB engine in a separate process.
SQLite doesn't require inter-process communication or server management.

Cons:

Using SQLite increases the size of your program.
A SQLite database has fewer options for concurrent access by multiple clients.
SQLite has fewer access controls, since the client always has access to the entire database file.

These tradeoffs mean that you're less likely to see SQLite used in a production server where things like concurrency and access control are essential, but it's the defacto choice for local-first programs that use a local SQL DB for storage. Since we believe that local-first applications are a compelling use-case for Dolt, we wanted to see how it measures up.

Using Dolt here even has benefits over SQLite: A common problem in local-first applications is syncing across devices, or syncing with a server when a device makes changes while not connected to the internet. Dolt's built-in conflict resolution abilities offer a solution. It's also common for these types of applications to need to store arbitrary JSON data, so the fact that Dolt can merge concurrent changes to JSON documents is very useful.

So how does the performance compare? Just like our previous head-to-head matchups, we'll start by introducing the syntax that SQLite uses for indexing JSON documents. Then, we'll take a look at SQLite's code to understand how it represents documents and their indexes. And finally, we'll run benchmarks to see how it performs for common operations.

Part 1: What does using JSON in SQLite look like?

For backwards compatibility purposes, SQLite doesn't have a JSON data type. Instead, JSON values are either valid JSON strings stored as a text type, or a custom binary representation stored as a blob type.

Note: JSONB requires SQLite version 3.45.0 or newer, released on 2024-01-15. If your SQLite shell is older than this, you'll need to update it first.

For example, here we create a table with both text and blob columns for storing JSON, and show different ways to create values for those columns, all of which are equivalent:

> CREATE TABLE usersJson(userId INT PRIMARY KEY, metadata TEXT);
> INSERT INTO usersJson(metadata) VALUES
    (json('{"name": "Tim", "numbers": [2, 3]}')),
    (json_object('name', 'Tim', 'numbers', json_array(2, 3)));
> CREATE TABLE usersJsonB(userId INT PRIMARY KEY, metadata BLOB);
> INSERT INTO usersJsonB(metadata) VALUES
    (jsonb('{"name": "Tim", "numbers": [2, 3]}')),
    (jsonb_object('name', 'Tim', 'numbers', jsonb_array(2, 3)));

The json function here is used to validate and minify/canonicalize the JSON. It is not strictly necessary, but validation and canonicalization are both good habits. The jsonb function converts the input JSON string into the binary format.

This is similar to how Postgres has both the JSON and JSONB (JSON Binary) data types. SQLite even takes a page from Postgres in calling their binary representation JSONB.

However, unlike Postgres, which optimizes the two data types for different use cases, this isn't the case for SQLite. The SQLite documentation explains:

The "JSONB" name is inspired by PostgreSQL, but the on-disk format for SQLite's JSONB is not the same as PostgreSQL's. The two formats have the same name, but are not binary compatible. The PostgreSQL JSONB format claims to offer O(1) lookup of elements in objects and arrays. SQLite's JSONB format makes no such claim. SQLite's JSONB has O(N) time complexity for most operations in SQLite, just like text JSON. The advantage of JSONB in SQLite is that it is smaller and faster than text JSON - potentially several times faster. There is space in the on-disk JSONB format to add enhancements and future versions of SQLite might include options to provide O(1) lookup of elements in JSONB, but no such capability is currently available.

Efficient path lookups within a document can be accomplished by creating an expression index:

CREATE INDEX name_idx ON usersJson((metadata->>'$.name'));

Or via an index on a virtual generated column:

ALTER TABLE users ADD COLUMN username TEXT GENERATED ALWAYS AS (metadata->>'$.name') VIRTUAL;
CREATE INDEX name_idx ON users(username)

There are pros and cons for each.

Virtual column index:
- Pro: The virtual column can be referenced directly by queries.
- Con: The virtual column complicates the table schema.
- Con: The query must reference the virtual column, otherwise SQLite may not use the index.
Functional index:
- Con: It's very easy to write a query that looks like it can use the index, but doesn't.

I prefer virtual columns because it's easier to reason about when a query can use the index.

Part 2: SQLite Under the Hood

SQLite is written entirely in C. The code is extensively documented and seems to be written with readability in mind.

The source code for the JSONB format and other json handling logic is here. Most of the logic is for parsing and generating JSON strings, but it also contains implementations for the various JSON functions.

Of particular note, all mutation operations are performed on the JSONB blob format. Thus, we expect that using JSONB will have better performance because the engine will only need to convert to and from the text format when the data is inserted and when it's output to the user, instead of during every operation.

We also see that mutation operations are performed by the jsonBlobEdit function. If the modification inserts or deletes bytes from the document, then all data following the modification point must be moved, akin to inserting into the middle of an array. We also see the jsonBlobEdit must be supplied with the offset into the document where the modification occurs, which must be computed by walking the document tree. We expect walking the document tree to be faster than parsing the entire document, since the blob format includes precomputed sizes for child objects, allowing the parser to skip over them.

Part 3: The Benchmarks

All of the tests were conducted on an Apple M2 Pro with 32 GB of RAM.

Here we explore how both SQLite and Dolt perform when working with multiple different document sizes. In each case, the total database size remains the same, but the size of the documents (and the total number of documents) varies. Each document differs only in a single field, which is used by the test queries.

I'm not aware of any standard benchmark that specifically tests operations on large JSON documents, so we're going to roll our own. Our data will be documents generated by this method for representing the Von Neuman ordinals:

CREATE TABLE vn(pk int primary key, j blob);
with recursive cte (pk, j) as (
    select 0, JSONB_OBJECT('l', 0)
    union all
    select pk+1, JSONB_SET(JSONB_INSERT(j, '$.k' || pk, j), '$.l', pk+1) from cte where pk <= 4 )
    insert into vn select * from cte;

Why not use realistic data? Wouldn't that be more meaningful?

We choose this over a more realistic data set because it helps ensure that we're not accidentally relying on the test data having some specific shape. The documents generated by this process contain documents with a range of sizes, including both deeply nested objects and shallow objects with many fields, and everything in-between. Any engine that can handle this should be able to handle any other shape of data.

Test 1: Exporting JSON

We want to see how quickly the engine can load, format, and return large JSON documents. For workflows where all the document manipulation is done by a client, this is the most common use case, so it needs to be fast.

Both SQLite and Dolt have functionality for dumping tables into .sql script files, but if we try to export JSONB columns that way, we'll get the bytestream instead of valid JSON. We're benchmarking the ability of the engine to emit valid exported JSON, so we have to use a SELECT query that converts it:

SQLite JSON: time sqlite3 json.db "select j from json_table;"
SQLite JSONB: time sqlite3 json.db "select json(j) from jsonb_table;"
Dolt: time dolt dump -fn json.sql

Exporting JSON Graph

Test 2: Importing JSON

We want to see how quickly the engine can parse JSON and store it in a table. For workflows where all the document manipulation is done by a client, this is the second most common use case, so it needs to be fast.

Dolt: time dolt sql < json.sql
SQLite JSON: time sqlite3 json.db < json.sql
SQLite JSONB: time sqlite3 json.db < json.sql

For Dolt, we reused the dump file generated by the previous test. For SQLite, we wrote a python script to generate an equivalent dump file for us.

Importing JSON Graph

These first two graphs don't paint a flattering picture for Dolt. What's going on here?

What we're actually seeing here is the advantage of avoiding inter-process communication and skipping the overhead of the network stack. Dolt is designed as a drop-in replacement for MySQL, and thus implements a standard MySQL server. In these tests, a MySQL client was connecting to the Dolt server over a loopback interface. Meanwhile, the SQLite shell was interacting with the database file directly.

In hindsight, we could have modified the test so that neither engine is relying on a network interface: When a Dolt server is running on the same device as the client, the client also has the ability to connect over a unix socket, which eliminates the network overhead. If we wanted to get rid of IPC entirely, The Dolt client also has a command line interface that can access a local database and spin up an execution engine in the same process, just like SQLite. We opted not to use that interface in our benchmarks because it's not the recommended way that users interact with Dolt.

In contrast, with SQLite there's no official way to connect to a remote database even if you want to: it's embedded or nothing.

Let's see how Dolt compares on the remaining benchmarks.

Test 3: Copying a document with minor changes.

Copying documents and making minor changes to them is useful if you're trying to track the history of a document over time. If this can be done quickly, it's also a strong indicator that the database is using structural sharing, where the redundant parts of the different document versions are being reused instead of copied.

SQLite: time sqlite3 json.db "create table json_table_2 as select pk, JSON_SET(j, '$.level', 1) from json_table;"
Dolt: time dolt sql -q 'create table json_table_2 as select pk, JSON_SET(j, "$.level", 1) from json_table;'

Copying JSON Graph

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	0.966±0.032	0.224±0.009	0.078±0.018	0.03±0.001	0.024±0.09
SQLite JSON	2.557±0.058	2.853±0.060	2.997±0.195	2.870±0.168	3.614±0.596
SQLite JSONB	0.312±0.031	0.572±0.068	0.601±0.133	0.411±0.074	0.732±0.303

Test 4: Extracting a sub-value from a JSON column (no index)

In theory, this should be fast for both engines, but SQLite is held back by a need loading the document from disk. In smaller tables where the operating system can hold the entire db in memory, the first lookup is slow but subsequent lookups are fast.

SQLite JSON: time sqlite3 json.db "select j->>'$.level' from json_table;"
SQLite JSONB: time sqlite3 json.db "select j->>'$.level' from jsonb_table;"
Dolt: time dolt sql -q 'select j->>"$.level" from json_table;'

Extracting JSON Graph

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	0.020±0.001	0.012±0.001	0.011±0.001	0.011±0.001	0.01±0.001
SQLite JSON	1.354±0.121	1.316±0.045	1.412±0.062	1.390±0.068	1.958±0.067
SQLite JSONB	0.064±0.007	0.055±0.004	0.056±0.003	0.067±0.004	0.180±0.023

Why is Dolt faster for large documents? That seems weird.

Remember that the total size of the database is kept constant: we're comparing the engine's performance on more smaller documents vs fewer large ones. Since Dolt only needs to load and parse the necessary parts of documents, the total size of the document doesn't matter for most operations, only the number of them. Larger documents for the same amount of data means fewer rows, and less work for the engine. SQLite, by comparison, appears to scale in runtime with the total size of the data set, regardless of how that data is distributed.

Test 5: Filtering on a JSON key (no index)

SQLite JSON: time sqlite3 json.db "select pk from json_table where j->>'$.level' = 10;"
SQLite JSONB: time sqlite3 json.db "select pk from jsonb_table where j->>'$.level' = 10;"
Dolt: time dolt sql -q 'select pk from json_table where j->>"$.level" = 10;'

Filtering JSON Graph

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	0.262±0.004	0.07±0.001	0.022±0.004	0.012±0.001	0.01±0.004
SQLite JSON	1.277±0.036	1.326±0.076	1.353±0.036	1.394±0.061	1.936±0.067
SQLite JSONB	0.059±0.001	0.053±0.002	0.055±0.001	0.070±0.001	0.189±0.037

Test 6: Ordering by a JSON key (no index)

SQLite JSON: time sqlite3 json.db "select pk from json_table order by j->>'$.level';"
SQLite JSONB: time sqlite3 json.db "select pk from jsonb_table order by j->>'$.level';"
Dolt: time dolt sql -q 'select pk from json_table order by j->>"$.level";'

Ordering JSON Graph

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	0.268±0.004	0.076±0.005	0.030±0.001	0.012±0.004	0.010±0.001
SQLite JSON	1.291±0.024	1.283±0.038	1.312±0.063	1.381±0.055	1.933±0.033
SQLite JSONB	0.062±0.002	0.054±0.002	0.053±0.001	0.067±0.004	0.210±0.09

Test 7: Creating an index

SQLite JSON: time sqlite3 json.db "alter table json_table add column level int as (j->>'$.level'); create index idx on json_table(level)";
SQLite JSON: time sqlite3 json.db "alter table jsonb_table add column level int as (j->>'$.level'); create index bidx on jsonb_table(level)";
Dolt: time dolt sql -q 'alter table test add column level int as (j->>"$.level"), add index idx(level);'

Virtual Column Index Graph

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	0.634±0.011	0.168±0.004	0.054±0.009	0.022±0.004	0.024±0.001
SQLite JSON	1.396±0.092	1.300±0.053	1.361±0.076	1.380±0.042	1.923±0.048
SQLite JSONB	0.082±0.002	0.060±0.001	0.062±0.008	0.074±0.005	0.183±0.010

After creating a virtual column index, we used the EXPLAIN command to see which queries would use the index:

	extract	filter	order
Dolt virtual column index	✅	✅	✅
SQLite JSON virtual column index	✅	✅	❌
SQLite JSONB virtual column index	✅	✅	❌

In cases where the index was unused, the execution time was unsurprisingly the same as if there was no index. In cases where the index was used, the execution time became indistinguishable from the engine's normal startup time: with an index, these operations are essentially free on both SQLite and Dolt. The graphs aren't very useful.

Test 8: Required Disk Space

Required Disk Space

	64 KB * 8092	256 KB * 2048	1 MB * 512	4 MB * 128	16 MB * 32
Dolt	14	2.6	2.3	2.2	4
Postgres JSON	436	436	437	437	435
Postgres JSONB	286	286	287	287	299

These numbers suggest that not only does SQLite not do any form of structural sharing for text or blobs, but text and blobs are stored entirely uncompressed. This is likely done for speed, but given how compressible JSON is as a format, this can make SQLite databases larger than they need to be by an order of magnitude.

Conclusion

We can draw the following conclusions about how Dolt and SQLite compare when it comes to handling large JSON documents:

SQLite in its standard workflow is faster at important and exporting JSON documents than Dolt in its standard workflow. This is not very surprising, since SQLite doesn't rely on inter-process communication. Dolt is capable of accessing database files directly just like SQLite does, and we can improve support for that workflow if there's demand to use Dolt as a replacement for SQLite.
For a given format in SQLite (JSON vs JSONB), many operations all take approximately the same amount of time for the same total database size, regardless of whether the database contains many small documents or fewer larger ones. For instance, in our benchmarks, a simple filter, a simple ordering, a simple document lookup, and creating an index all take approximately 1.4 seconds for JSON and approximately 0.07 seconds for JSONB. This suggests that the runtime is being dominated by loading and parsing the documents, and the actual query evaluation is negligible.
Even accounting for the above, Dolt is noticeably slower than SQLite at importing JSON documents. This is because of how Dolt builds a prolly tree to partially index the document as it's imported. We believe there's a lot of room to optimize this operation if it ever turns out to be a bottleneck for users.
SQLite's claims about the difference between their JSON and JSONB formats are accurate: JSONB is several times faster than JSON for most operations, but don't support optimized lookups into large documents. (Dolt meanwhile supports O(log n) lookups in a document, even into arbitrarily nested objects.)
When storing multiple similar documents, Dolt's uses significantly less storage space than SQLite due to its ability to depublicate fragments of documents, as well as its use of compression.

Additionally, SQLite has a max document size of 2 GB, while Dolt has no limit on document size.

Overall, I was really impressed with how optimized SQLite is. Although it comes with restrictions (local-only, no access control and limited concurrency), it's able to leverage those restrictions into a real powerhouse. It's clear that Dolt, as the younger and newer database, has room to grow, especially when it comes to importing and exporting data. However, Dolt still shines in its ability to compactly store similar documents and efficiently copy documents and make changes to them. It's a great candidate for any workflow that involves storing and manipulating multiple versions of a document, such as history tracking or version control.

If you're using SQLite with a JSON-heavy workflow, we believe that Dolt has a lot to offer as an alternative. And if you have specific concerns or needs, we prioritize user needs highly when deciding where to focus our attention.

Still not sure if Dolt is right for you? Follow us on Twitter, or join our Discord and we'll be happy to figure out how Dolt can help your use case.

Blog