JSON Showdown: Dolt vs MongoDB

GOLANG
11 min read

This is the fourth in a series of tests putting Dolt head-to-head against other database engines to see how they handle large JSON documents, comparing their design, performance, and implementation choices.

Up until now, we've only been comparing Dolt to SQL databases: previously we looked at MySQL, Postgres, and SQLite. That's because Dolt is also a SQL database, designed to be a drop-in replacement for MySQL. And it turns out that for large enough documents, Dolt outperforms all of them in most categories. It's not that surprising: while none of these existing engines were originally designed with JSON in mind, Dolt uses novel prolly-tree based data structures that naturally lend it to fast operations on structured data.

But all this begs the question: how does Dolt measure up against something that was built from the ground-up for JSON?

Enter MongoDB.

MongoDB Logo

Tagline
A document database designed for ease of application development and scaling.
Initial Release
February 2009
GitHub
https://github.com/mongodb/mongo

MongoDB is a common choice for Node.js servers because it's all JSON and javascript. In fact, the MongoDB interactive shell is literally a Node REPL with MongoDB's javascript library pre-loaded. For anyone who's worked with Node, there's no new query language to learn.

What does using MongoDB look like?

Because MongoDB isn't a relational database, its queries aren't going to be 1:1 identical to the queries you'd write for a SQL engine like Dolt. But as we'll see, most SQL concepts map pretty cleanly onto MongoDB. This actually surprised me at first: NoSQL databases often leverage their structure to enable queries that would be difficult to write in SQL, such as the pattern matching expressions found in SPARQL and other query languages. But as far as I could tell, MongoDB doesn't have this functionality, instead preferring operations that will be familiar to most SQL users.

MongoDB organizes data into collections, where each collection is a set of documents with the same schema. This is roughly equivalent to SQL databases, which organize data into tables, where each table is a set of rows with the same columns.

Unlike SQL databases, which have a special column type for JSON data, all data in MongoDB is JSON. There's no explicit schemas either: documents in a collection can contain whatever fields they like, and can contain both nested ("embedded") JSON objects, as well as references to other documents in the database. Queries that require data from multiple collections can use MongoDB's $lookup operation, which functions similarly to a join in a relational database.

The MongoDB development blog's best practices doc says to prefer embedding over references whenever possible, because lookups are slow. However, it also suggests trying to keep documents small, since the entire document may be loaded even if only part of it is needed for a query.

Additionally, MongoDB imposes some limits on documents:

It's unclear how likely users are to hit these limits in practice. But the possibility of hitting these limits requires users to consider it when designing their schema.

Indexes in MongoDB are likewise very similar to indexes in SQL databases:

  • Indexes improve read performance by reducing the number of documents that the engine needs to process.
  • However, indexes can impede write performance because editing the collection requires that the indexes are updated.
  • Indexes must specify which document fields are being indexed. An index can index one field or many fields

Queries in MongoDB are expressed as a pipeline of operations. For example, in one of our benchmarks we select a single field from a document collection, ordered by the value of another field. We express that as a pipeline: a $sort operation followed by a $project operation:

db.test.aggregate([
  { $sort: { level: 1 } },
  { $project: { id_: { $getField: "id_" } } },
]);

This syntax is expressive and powerful, and frankly more intuitive than how SQL expressions are structured.

The Benchmarks

All of the tests were conducted on an Apple M2 Pro with 32 GB of RAM.

Here we explore how both MongoDB and Dolt perform when working with multiple different document sizes. In each case, the total database size remains the same, but the size of the documents (and the total number of documents) varies. Each document differs only in a single field, which is used by the test queries.

I'm not aware of any standard benchmark that specifically tests operations on large JSON documents, so we're going to roll our own. Our data will be documents generated by this method for representing the Von Neuman ordinals:

CREATE TABLE vn(pk int primary key, j blob);
with recursive cte (pk, j) as (
    select 0, JSONB_OBJECT('l', 0)
    union all
    select pk+1, JSONB_SET(JSONB_INSERT(j, '$.k' || pk, j), '$.l', pk+1) from cte where pk <= 4 )
    insert into vn select * from cte;

Why not use realistic data? Wouldn't that be more meaningful?

We choose this over a more realistic data set because it helps ensure that we're not accidentally relying on the test data having some specific shape. The documents generated by this process contain documents with a range of sizes, including both deeply nested objects and shallow objects with many fields, and everything in-between. Any engine that can handle this should be able to handle any other shape of data.

For benchmarking MongoDB, we execute the queries in mongosh. However, since mongosh is an entire Node.js REPL, it has a large startup time which I didn't want to include in my benchmark. To help with this, I wrote a helper function to time the queries and stuck it in my .mongoshrc.js file:

function time(query) {
  const t1 = new Date();
  const result = query().toArray();
  const t2 = new Date();
  return (t2 - t1) / 1000;
}

The call to toArray() is necessary because MongoDB queries return cursors that don't necessarily fetch every result immediately. Calling toArray on the cursor forces the cursor to exhaust itself, which is included in the measured time.

The current version MongoDB always creates indexes in the background, which means I couldn't identify a good way to measure the performance of building an index in MongoDB. But we can still see the result that indexes have on query times.

Test 1: Exporting JSON

We want to see how quickly the engine can load, format, and return large JSON documents. For workflows where all the document manipulation is done by a client, this is the most common use case, so it needs to be fast.

  • MongoDB: time mongoexport -d=database_name -c=collection_name > export.jsonl
  • Dolt: time dolt dump -fn json.sql

Exporting JSON Graph

64 KB * 1024 256 KB * 256 1 MB * 64 4 MB * 16 16 MB * 4
Dolt 0.326s±0.005 0.314s±0.009 0.314s±0.005 0.326s±0.009 0.354s±0.005
MongoDB 2.713s±0.062 2.716s±0.033 2.830s±0.010 2.878s±0.029 N/A

While many databases use a custom binary encoding for documents, Dolt chooses to store the document as valid JSON, alongside additional metadata derived by partially indexing the document. This allows us to optimize for returning JSON to the client, while still supporting other operations efficiently. We see the impact of this here.

Test 2: Importing JSON

We want to see how quickly the engine can parse JSON and store it in a table. For workflows where all the document manipulation is done by a client, this is the second most common use case, so it needs to be fast.

  • Dolt: time dolt sql < json.sql
  • MongoDB: time mongoimport -d=test -c=test import.jsonl

Importing JSON Graph

64 KB * 1024 256 KB * 256 1 MB * 64 4 MB * 16 16 MB * 4
Dolt 3.986s±0.047 3.996s±0.038 4.028s±0.125 3.992s±0.051 4.134s±0.096
MongoDB 2.396s±0.027 2.478s±0.039 2.709s±0.030 2.688s±0.017 N/A

As we mentioned in our PostgreSQL comparison, there's a tradeoff to how we store documents: importing JSON into Dolt introduces a bit of overhead while we partiallty index it. We believe there's room to improve this.

Test 3: Copying a document with minor changes.

Copying documents and making minor changes to them is useful if you're trying to track the history of a document over time. If this can be done quickly, it's also a strong indicator that the database is using structural sharing, where the redundant parts of the different document versions are being reused instead of copied.

  • MongoDB:
mongosh <<'MONGO'
time(() => db.test.aggregate([
    { "$replaceWith":
        { "$setField":
            { "field":"level", "input":"$$ROOT", "value":1}
        }
    },
    { "$out": "othercollection"}
]))
MONGO
  • Dolt: time dolt sql -q 'create table json_table_2 as select pk, JSON_SET(j, "$.level", 1) from json_table;'

Copying JSON Graph

64 KB * 8092 256 KB * 2048 1 MB * 512 4 MB * 128 16 MB * 32
Dolt 0.966s±0.032 0.224s±0.009 0.078s±0.018 0.03s±0.001 0.024s±0.09
MongoDB 0.603s±0.016 0.581s±0.025 0.571s±0.034 0.755s±0.013 N/A

The fact that MongoDB's runtime remains roughly constant across each document size suggests that MongoDB is not leveraging structural sharing: instead the time is spent copying the entire dataset.

Test 4: Extracting a sub-value from a JSON column (no index)

  • MongoDB:
mongosh <<'MONGO'
time(() => db.test.aggregate([
    {"$project":
        { "level": { "$getField": "level" }}
    }
]))
MONGO
  • Dolt: time mysql -e 'select j->>"$.level" from json_table;'

Extracting JSON Graph

64 KB * 8092 256 KB * 2048 1 MB * 512 4 MB * 128 16 MB * 32
Dolt (no index) 0.020s±0.001 0.012s±0.001 0.011s±0.001 0.011s±0.001 0.01s±0.001
Dolt (index) 0.034s±0.002 0.025s±0.002 0.022s±0.01 0.023s±0.001 0.021s±0.001
MongoDB (no index) 0.044s±0.011 0.0224s±0.002 0.019s±0.003 0.016s±0.002 N/A
MongoDB (index) 0.109s±0.010 0.097s±0.002 0.099s±0.003 0.103s±0.004 N/A

That's not a typo: adding an index to MongoDB made projecting the indexed field slower. This was consistent and repeatable. Looking at the generated plans for the query, the query doesn't seem to be using the index at all.

However, the performance for non-indexed documents is a strong indicator that MongoDB doesn't need to load the entire document into memory in order to read a single field.

Test 5: Filtering on a JSON key (no index)

  • MongoDB:
time mongosh <<'MONGO'
time(() => db.jtest.find(query: {"level": 1}, projection: {"id_": true})
MONGO
  • Dolt: time mysql -e 'select pk from json_table where j->>"$.level" = 10;'

Filtering JSON Graph

64 KB * 8092 256 KB * 2048 1 MB * 512 4 MB * 128 16 MB * 32
Dolt (no index) 0.262s±0.004 0.07s±0.001 0.022s±0.004 0.012s±0.001 0.01s±0.004
Dolt (index) 0.011s±0.001 0.012s±0.001 0.013s±0.01 0.013s±0.001 0.012s±0.001
MongoDB (no index) 0.013s±0.002 0.024s±0.003 0.087s±0.009 0.284s±0.009 N/A
MongoDB (index) 0.004s±0.001 0.003s±0.001 0.002s±0.001 0.002s±0.001 N/A

I was surprised to see that non-indexed MongoDB became slower when the data was organized into fewer documents. This suggests that the engine is fully loading all documents that match the filter before extracting the projected field. But we just saw in the previous benchmark that MongoDB optimizes extracting fields from a large documents, so why would adding a filter cause hurt performance like this?

My best theory for why this is happening is that this command results in a two-stage pipeline: a filter and a projection, and MongoDB is fully computing the intermediate result, generating the result set of the filter before passing it to the projection. This shouldn't be necessary, and I wonder if the engine is assuming smaller document sizes and generating plans accordingly.

Test 6: Ordering by a JSON key (no index)

  • MongoDB:
time mongosh <<'MONGO'
db.test.aggregate([
    {"$sort": { "level": 1}},
    {"$project":
        { "id_": { "$getField": "id_" }}
    }
])
MONGO
  • Dolt: time dolt sql -q 'select pk from json_table order by j->>"$.level";'

Ordering JSON Graph

64 KB * 8092 256 KB * 2048 1 MB * 512 4 MB * 128 16 MB * 32
Dolt (no index) 0.268s±0.004 0.076s±0.005 0.030s±0.001 0.012s±0.004 0.010s±0.001
Dolt (index) 0.407s±0.014 0.116s±0.007 0.040s±0.001 0.018s±0.001 0.013s±0.001
MongoDB (no index) 0.931s±0.064 0.840s±0.028 0.823s±0.030 0.845s±0.014 N/A
MongoDB (index) 0.081s±0.022 0.040s±0.005 0.018s±0.002 0.015s±0.001 N/A

These results are the most straightforward: MongoDB is consistently faster than Dolt when both are indexed, and consistently slower than Dolt when neither are indexed.

Test 7: Required Disk Space

Aside from runtime performance, another important metric is storage space. Database engines can typically reduce storage space in two ways: compression and structural sharing.

The results for MongoDB were computed using the db.stats function in mongosh.

Required Disk Space

64 KB * 8092 256 KB * 2048 1 MB * 512 4 MB * 128 16 MB * 32
Dolt 14.0 MB 2.6 MB 2.3 MB 2.2 MB 4.0 MB
MongoDB 32.3 MB 29.1 MB 26.1 MB 25.6 MB N/A

MongoDB is clearly compressing the documents, since each test dataset is 512 MB uncompressed. But given that the amount of compression (10-15x) is in line with typical JSON compression ratios, we can reasonably assume that MongoDB is not doing anything to leverage similarities between the documents and reuse space between them.

Conclusion

Looking at the results, we can draw some conclusions about MongoDB's performance and design:

  • Exporting large documents is slow because they need to be converted from MongoDB's internal binary format.
  • Creating a new collection by modifying an existing collection seems to take time proportional to the total data size, regardless of the number of documents or individual document size. Given that MongoDB does not appear to reuse storage space for similar documents, this makes sense.
  • In the event that a MongoDB user knows that documents will contain identical sub-documents, they could attempt to save space by moving these sub-documents into another collection and referencing them. But this complicates the database design and potentially introduces a performance penalty when the referenced documents are used in lookups.
  • Ignoring the surprising behavior where adding an index appears to make projections slower, operations on indexed fields are faster on MongoDB than Dolt, at all tested document sizes.
  • For non-indexed documents:
    • Some operations (like looking up individual fields) are faster when the database is organized into fewer, smaller rows, which suggests that MongoDB does not need to load the entire document into memory if only certain fields are needed.
    • Other operations (like sorting a collection) appear to perform roughly the same when there are fewer larger documents vs more smaller ones. This suggests that MongoDB is loading the entire documents into memory in this case.
    • Still other operations (like filtering a collection) perform worse when the same amount of data is organized into fewer documents, perhaps due to an issue with plan generation.

MongoDB excels at efficiently running queries on both indexed and non-indexed data, at the cost of slower export times and requiring more storage space for redundant documents. MongoDB also has restrictions on document size and depth that prevents some documents from being stored and requires forethought about your database structure. This somewhat limits the usual flexibility that schemaless document stores provide.

By comparison, Dolt has theoretically no limit on document size or document depth, and we've already demonstrated how Dolt can do lookups, copies, and in-place modifications of multi-GB documents near instantly

Overall I was largely impressed by MongoDB even though it raised my eyebrows in a couple places. Compared to Dolt, both have clear advantages and disadvantages based on optimizing for different use cases.

Dolt is still first and foremost designed to leverage its powerful version control capabilities: we think being able to branch and merge databases is a killer feature. But I'm impressed with how it's been able to hold its own against more established, more general purpose DB engines. Even if you don't think that your database should be version controlled (and let's face it, your database should be version controlled), we think that Dolt still has a lot to offer.

And we prioritize user needs highly, so if you think Dolt might be right for you but you have questions, or there's specific features you want us to prioritize, join our Discord and we'll be happy to figure out how Dolt can help your use case.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.