In 2019, we released Dolt, the world’s first version-controlled SQL database. In the seven years since, we’ve seen a gradual but steady influx of attention as people become aware we exist.

You can clearly see the first time we got mentioned on HackerNews in 2021.
More and more, this attention is taking the form of people not just noticing us, but talking about us to others. We love that. We think that word-of-mouth buzz is a strong signal for the genuine usefulness of a tool.
Last month, Daroc Alden over at LWN wrote about us.1 It was really cool to see an outside perspective. I think they did a great job of explaining what Dolt is and how it can be used. They also have a great explanation of Prolly Trees, the B-tree inspired data structure that makes Dolt possible. We’re kind of obsessed with Prolly Trees, so it’s cool to see someone else understand them and even see them talk about some of the improvements that we made to Prolly Trees to improve node size distribution.
Overall, I think the LWN article did a great job introducing Dolt to a new audience and I’m happy to see it. But there were a couple of misconceptions in the article, so for the sake of completeness I wanted to clear them up.
Project Architecture#
Two related misconceptions we see pop up on occasion is the claim that:
- Dolt and its siblings (Doltgres and DoltLite) are plugins/forks of common SQL engines.
- Each project is a different wrapper around an otherwise identical storage layer.
Combined, these two ideas paint the picture that Dolt / Doltgres / DoltLite each take the same underlying storage format and bridge it to MySQL / PostgreSQL / SQLite, respectively.
LWN repeats both of these ideas in their article. To quote:
[Dolt, Doltgres, and Doltlite] have separate frontends for the different SQL dialects, but the projects share a common storage backend that supports version-control operations.
That competitive performance is possible because Dolt only changes the storage layer of the database. The query planner, index maintenance, and so on all reuse MySQL, PostgreSQL, or SQLite’s existing implementations.
The reality is slightly more nuanced. We maintain our own SQL engine, go-mysql-server. Like the name suggests, it was originally made for the MySQL dialect, but recently we’ve also been adding support for Postgres.
We also have a common storage backend, designed for use with go-mysql-server. Both Dolt and Doltgres use this backend. They aren’t plugins, and they don’t depend on the MySQL or Postgres code bases at all.
On the other hand, DoltLite is a custom storage backend for SQLite, and as such, it doesn’t use the common storage backend, but rather its own format and implementation. It’s based on the same design as the original, but the formats are not compatible.
The Power of Git#
There’s a lot of reasons why applying the Git model to databases makes sense. The LWN article points out the biggest one:
The real utility of Dolt comes from the ability to restore old commits, fork historical states of the database, and merge in changes after review.
This is 100% true, and it’s the most common way we see users take advantage of Git-style version control. But there’s another big benefit to Git-style version control too: remotes.
Just like Git, Dolt allows you to clone a database from a remote and then sync incrementally. When you dolt pull or dolt push, only the new changes get sent. This lets you get a local, up-to-date copy by only downloading the parts that actually changed. That turns out to be a pretty powerful feature for replicas.
How Git and Dolt Store Snapshots#
Git and Dolt take a “snapshot” approach to version control, where each commit represents is an independent snapshot of the data at a specific point in time.2 But in the event that not much has changed between commits, what stops all the different snapshots from taking up lots of extra storage space?
A common Git misconception is that Git reduces storage by storing commits internally as deltas. The article repeats that misconception and suggests that this is the main reason why Git (and by extension Dolt) require data structures that can be diffed efficiently. To quote:
Each commit is logically independent, and could in theory just be stored as-is. In practice, most Git repositories do not completely change between commits, and it’s more efficient to store the differences between subsequent snapshots, rather than the snapshots themselves.
Git, which works with entire directory trees, relies on being able to quickly check whether a particular sub-tree has changed in order to produce compact diffs that include only the actual changes.
There’s some truth to this: Git does have the ability to produce compact diffs (such as with the git diff command). And Git’s storage format does have the ability to store deltas on individual files. But it doesn’t store the whole commit as a delta, and deltas aren’t the primary method for achieving efficient storage.
The actual method by which Git and Dolt achieve compact storage is much simpler: when two commits trees contain identical sub-trees, Git and Dolt simply only store that identical tree once. And indeed further down, the LWN article explains exactly that:
Content-addressed storage is a scheme that many version-control systems use to deduplicate data: a particular diff or other object is stored in a location based on the hash of its contents. This ensures that duplicate items will reuse the same storage space.
This trick of reusing the unchanged parts of the tree is called structural sharing. In order to make this possible, the data being stored requires that two trees containing the same items are exactly identical, regardless of how those items were inserted. This property is called history independence, and its the property that Prolly Trees provide that B-trees don’t. History independence is also how both Git and Dolt can produce efficient diffs.
So Git and Dolt’s efficient storage isn’t because of their diffing capabilities, but rather the same property (history independence) allows for both fast diffs and compact storage.
Dolt’s Dynamic Node Size Thresholds#
It was really cool to see this get a shout-out.
A known shortcoming of the original prolly tree design is that it produces trees where the number of elements per node follows a geometric distribution. This is a pretty big problem, because it both means that there’s no theoretical upper bound on node sizes, and it leads to imbalanced trees with worse performance.
Think about it this way: if a small number of nodes in your tree are much larger than others, then those nodes are not only going to take longer to process, but they’re also going to get visited more frequently because they have more children. This makes tree operations more likely to hit slow paths.
Dolt has a pretty clever solution to this problem, and not only did the article talk about it, it spawned some good discussion in the comments. Basically, Dolt uses a dynamic threshold to decide how to draw boundaries between tree nodes: as a tree node fills up, the probability of the next element causing a split increases.
One small correction though: the dynamic threshold isn’t based on the number of node elements like the example suggests, but rather the number of bytes written. This matters when the elements are variable-length, and gives us better control over the exact distribution of node sizes.
That’s All, Folks#
This was a great read. It was really cool to see an outsider’s perspective on Dolt. The readers of LWT are lively and did a lot of engaging in the comments of the original article.
If you’d like to engage with us more, consider joining our Discord. We’re always down to chat and answer questions.
Footnotes#
-
What does LWN stand for? Apparently nothing anymore, but it used to stand for Linux Weekly News. ↩
-
This is in comparison to delta-based version control systems, where each commit represents a diff on the previous commit. We’ve written in the past about the different types of version control schemes. ↩