Introducing the `dolt archive` command

July 31, 2024

4 min read

Here at Dolt, the first database to provide version control features like Git, we've been lucky to leverage existing code extensively while building our product. One piece we leveraged very heavily was Noms, and for the purposes of our discussion today, the "Noms Block Store" has been virtually untouched for years. In my previous review on storage, I outlined the basic approach of having content addressed blocks as the foundation of our database. It allows structural sharing of data across versions, and has served us very well.

Way back in April I discussed our plans to support a new storage format: Archives. The reason for this stemmed primarily from the ability to save as much as 50% on our disk footprint, and also to lay the foundation for history archival which will allow users to put little used history in cold storage.

Today those plans are a reality - with some caveats. Let's Go!

Thar Be Dragons

Before we jump in on the new command and what it does, I think it's worth calling out the dangers of any new storage format. The files which get persisted by any database, Dolt included, need to be given special care. The risk of getting it wrong could mean the corruption of valuable data with no recourse. The number of ways that computers can mess up is too many to count, but in many contexts you just restart something and you are good to go. If we write bad data to disk though, there isn't really anything you can do about it. You most likely lost the critical information you needed.

For a distributed database, this problem is more extreme because not only are you moving data files around a lot, but you also need to account for there being many different versions of the software which consume and produce those files.

Git is fairly remarkable in this regard. The on-disk format of git has only had one version in the wild:

$ git config core.repositoryformatversion
0

This has been extremely valuable for Git users who want to push their code to GitHub or pull in someone elses code from GitLab. You don't need to compare notes and ensure that you are on the same version of Git.

Dolt is effectively on it's 0 version of the on disk storage format, and the introduction of Archives will move bring a new format into play, and we wanted to make extra certain that the format was correct and stable before unleashing it on the world. For this reason, we are treading very lightly on how we encourage users to leverage Archives now. It's going to be most beneficial to users who run large databases, and do not rely on a push/pull model.

Introducing `dolt archive`

Introduced in version 1.42.7 of Dolt, the new sub command, dolt archive will take your data, and re-write it into Archive files which are 40%-50% smaller than the existing format:

$ dolt gc
$ dolt archive --group-chunks
Building Chunk Group Dictionaries: Done
Materializing Chunk Groups: Done
Writing Ungrouped Chunks: Done
Verifying Chunks: Done

Archived f85qeb1f6voer8s62sb9ojhmgm29hq75 (523816541 -> 231815184 bytes, 55.74% reduction)

A couple things to note here:

dolt gc performs garbage collection, and is necessary to run before dolt archive. Archives are built using what we call "Old Generation" data, or oldgen. Oldgen data is considered more stable and long lasting, which makes it a good place for us to start with archives.
The --group-chunks flag was used to perform work to get the Archive group even smaller by using dictionary compression. This is not the default, due to performance and race conditions in the zstd library. If you use the --group-chunks flag and it takes a very long time with no update in the progress, please let us know.

In the event that you determine something is not working correctly with an archive, you can revert the change with the following flag:

$ dolt archive --revert

This will return the database to using the format we've been using for years. If you do find the need to revert, please let us know why!

Caveats

I mentioned some caveats above, and here they are:

Archives are currently not supported for any data replication use cases. You can not start a push/pull enabled server on top of a database with archives. You can't back them up.
--group-chunks can be prohibitively slow. Creating thousands of dictionary files seems to lead to pathological memory allocation problems which make grouping chunks untenable for some databases.
Building an Archive doesn't actually remove old data files currently. This is to enable the fast execution of --revert when the file is still present. Running dolt gc will clean up dead files and save your disk space.

Future Plans

At this stage, we really want to ensure that the Archive format is stable. With time we will make this format the default, but before we do that we need to ensure that all the different ways you can push/pull from a Dolt database are covered with additional testing. This includes working with DoltHub and Hosted instances. Grouping similar chunks will also become the default in the future when we address the performance issues.

Once we have this format really tied in to the rest of the database, we can start working on history truncation and archival which we believe is going to become increasingly important to our users as they grow their histories.

More than anything else, we want to get your hands on Archives so you can tell us what you want. Do you want faster archive builds? Support on DoltHub.com for them? Truncating history? Come join us on our Discord and tell us!

Blog

Thar Be Dragons

Introducing dolt archive

Caveats

Future Plans

Get started with Dolt

Introducing `dolt archive`