Dolt Archives Status Update

7 min read

The Dolt database is the first SQL Database with Git-style branching and merging. One natural consequence of this is the need to store data which grows proportionally to the number of writes over the database's lifetime. For our users to be successful with the Dolt database as they grow, we need to make sure that we use their disk space efficiently.

We have two primary paths to reduce the amount of disk space used by Dolt databases in relation to the amount of history stored:

  1. Store everything as efficiently as possible. Use modern compression and delta encoding to reduce the amount of space used.
  2. Provide the ability to cull history from your primary database and put it into long term storage for later use.

The first path is the one I am currently focused on. The second path is a feature we will be adding in the future, but it does require us to plan for it in our storage format. Today I'll be discussing the progress being made on the first path, and what is necessary to make it the default storage format for Dolt 2.0.

Background

Since Dolt started, it has used the Snappy compression library to compress chunks (what are chunks?). Last year I wrote about the potential of using Dictionary Compression to further reduce the amount of disk space used by Dolt databases. This is because many Dolt chunks are similar, and the zStd compression library enables creating custom dictionaries for small data batches.

Our initial findings showed we could reduce Dolt database disk space usage by about 50% with this approach. A new storage format was announced last year, and honestly we don't think anyone has been using it. Turns out that while it sounds like storing all of your history is expensive, storage is cheap. We do think it's a better format though, for reason that go beyond the the space saved, so the last few months I've been working on all the pieces needed to make it the default storage format.

We have quietly added features such as the ability to push databases with the Archive format to DoltHub, and the ability to pull incremental changes from a remote server when Archives are in the mix. The path to making this the default storage format is straightforward from here. We just wanted to do some testing of our dictionary creation strategies before pulling the trigger.

Dictionary Creation

zStd compression is greatly improved with the use of dictionaries. Dictionaries are created using a representative sample of the data you want to compress, and it works well when you have many small (under a megabyte) data blocks to compress. Dolt's Chunks are typically 4K in size, so they are a great fit for this approach.

A critical piece in the process is creating a good dictionary. When we first announced the Archive format, we implemented two different dictionary creation strategies:

  1. Take a random sample of 1000 chunks in the dataset and use them to create a dictionary. This approach produces a decent dictionary, resulting in approximately 30% improvement over Snappy compression in a small set of repositories. This was the default behavior for the dolt archive command.
  2. Using the Dolt history graph, we group related chunks together and create dictionaries specifically for each group. This resulted in thousands of groups each with a dictionary, and the resulting compression was even better - on the order of ~50% improvement over Snappy compression. To be clear, that's storing the same amount of information in 1/2 the space. Not too shabby! This was achieved by executing the dolt archive --group-chunks command.

Recently we added a third option which is available, though invisible, to the user. When fetching changes from a server or during garbage collection, we always create some type of storage file for the data. When writing Snappy objects, we simply stream chunks to disk since each chunk is compressed into a single byte string. For zStd, it's more complex as each chunk requires two byte arrays: a dictionary and the data. When streaming chunks to disk though, we don't start with a dictionary and we need to create the dictionary before we can write anything to disk. To address this, we create a dictionary from the first 1000 chunks we see. Once we have that dictionary created, we can stream chunks to disk just like we do for Snappy. This feature isn't documented yet but can be enabled via an environment variable. Once we're confident in the feature's stability, we'll make it more discoverable and easier to enable.

To recap, we have three different dictionary creation strategies:

  1. Random sample of 1000 chunks
  2. Multiple dictionaries with grouped chunks
  3. First 1000 chunks

Let's Test

The new third option gave better compression than Snappy with brief testing, but I didn't know how much better. I tried it on a couple repositories, and it wasn't worse. For a long time I've wanted to test the compression improvements or Archives on a larger set of databases. So that's what I did.

Thankfully DoltHub happens to have a lot of databases. I grabbed the list of public databases which weren't forks of other databases and were more than 100MB in size. I ended up with 170 databases, and I set out to test them in the following way:

  1. Run dolt gc (baseline)
  2. Run dolt gc with the new "First 1000 Chunks" dictionary creation strategy (in development)
  3. Run dolt archive to test the Random Sample dictionary creation strategy.
  4. Run dolt archive --group-chunks to test the History Graph for Chunk Grouping dictionary creation strategy.

Runs (2), (3), and (4) were all compared to run (1), resulting in the following, where the percentage is the percentage reduction in size compared to the baseline:

First 1000 Chunks 1000 Random Chunks Grouped Chunks
Avg 24.73% 35.44% 38.87%
StdDev 10.32% 9.05% 11.55%
Max 47.44% 62.46% 74.80%
Min -3.11% -0.28% -0.27%

Or in a chart, if you're into that sort of thing:

Results

What Does This Mean?

First, it indicates that my claim that we would get 50% compression on most repositories was wrong. Frankly, it was my selection bias. It's worth calling out that the three databases I used to justify that claim (and use during development) are all in the high end of the compression spectrum. I selected them because they have long histories, and I could test the grouping of chunks easily. As I mentioned above, testing archives on a larger group of databases was always the plan, but time hadn't allowed me to do it. It wouldn't be a legitimate experiment if I knew how it would turn out. When I look at some of the worst performing databases, they all have very short histories - under 10 commits. Some were dumps of large databases into a single commit.

On the other side of the coin, the goal was to ensure that we could store a lot of history efficiently. Maybe it is the case that we can't compress flat histories much at all. We are at the mercy of the compression libraries at that point. To back this up, it's interesting to notice that in general grouping chunks was not particularly much better on average than the random sample. Of the 170 databases I tested, 79 of them had 0 groups - so they fell back to the random sample strategy. Not all is lost though, as we can see that the set of databases with the best compression were using the grouped chunks strategy.

Second, it indicates that the "First 1000 Chunks" strategy is... meh. It's not terrible, but it's not great. The reason for this is pretty well understood. Since we are using the first 1000 chunks to create a dictionary, we are creating a biased dictionary. The garbage collection process operates by walking the database history, then copying chunks as it discovers them. This means that the first 1000 chunks are mostly related to walking references, and the data tables are later in the process. So the dictionary we use to compress every chunk is created from a specific set of chunks which are not representative of the entire database. Look Ma, there is it again! Sample Bias!

Sample Bias

For those paying very close attention, you'll see that there are a small number of databases which do worse with Archives. This was a surprise, honestly. It's a pretty small set where this is the case, and I haven't gotten into it yet. Suffice it to say, the new format has a few more bookkeeping bits than the old format, so if zStd and Snappy are producing comparably sized outputs, that extra bookkeeping is going to cost us. Fortunately, zStd is generally much better and so the extra bookkeeping is worth it.

What's Next?

The goal is to make the new format the default in Dolt 2.0. I don't think I'm convinced that ~25% is the best we can do though, and I have a couple things I'm going to try first:

  1. Get a baseline with an zStd using no dictionary at all. I suspect it may actually be comparable to the First 1000 Chunks strategy. If that is the case, we will simplify the storage pipeline by not creating a dictionary at all.
  2. Experiment with a static default dictionary built from a sample across many repositories.
  3. Experiment with raising the compression level in zStd. We've avoided this thus far because the CPU utilization and time to compress go up significantly. As it stands, garbage collection takes twice as long when compressing with zStd at the default compression level. There may be options to enable users to specify the compression level based on the amount of time they are willing to spend doing garbage collection.

The ability to group chunks more intelligently is also on the table. I believe we are just scratching the surface of what is possible here. That is not going to be part of the Dolt 2.0 behavior though. Grouping chunks will likely always be an offline operation - performed when you really need the compression benefit. We'll let you know when we get better at the grouping piece.

That's it for now. More testing and experimentation will get Dolt databases even smaller. If you have a specific repository you'd like to see tested, join us on our Discord server and DM me (@macneale)!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.