Announcing `dolt fsck`

FEATURE RELEASE
4 min read

Dolt is the world's first SQL database which enables users to branch and merge. Today, we're adding support for dolt fsck, following in a long tradition of fsck to protect against data corruption that goes back more than 40 years!

A Brief History of fsck

Git’s ubiquity often overshadows its roots as a tool for file system aficionados, originally created by Linus Torvalds for the Linux kernel. Among Unix veterans, fsck is a familiar utility that scans file systems for corruption. The fsck command itself originated in BSD 4.0 in 1980 and remains a critical tool for system administrators worldwide.

While some debate how to pronounce fsck, those present when it was created do not. We can all smile and nod that File System ChecK is a bacronym that is safe for work.

File systems come in many types, but at their core they are about having a name (path) which resolves to a location on disk (offset and length). This is achieved through internal nodes that form a tree structure containing metadata about each file. File system paths can be updated with very minimal data transfer as a result. If you rename a file, the contents of the file stays where it was while the metadata is updated.

fsck was introduced to allow system maintainers to determine if there was corruption in any of the file system data (metadata or otherwise). This could depend on a variety of checks such as validating that the path resolved to anything, or verifying the data's parity bits. RAID file systems made this even more interesting because not only could you find the corruption in your file system, but you could also repair it. In each case, what was possible for fsck to validate depended a lot on the design of the file system.

The OpenZFS is a great example of design to account for the fact that corruption happens. It uses a Merkle Tree to ensure that every reference has integrity.

Review of git fsck

What's that you say?!? Merkle Tree? We love Ralph Merkle's invention here at Dolt, and it's been the key piece which makes Git the tool it is today.

git fsck is Git’s File System ChecK, allowing users to scan their commit history, ensuring each object is properly linked to its content-addressed reference. By calculating a SHA1 checksum, git fsck can verify that the data is both referenced correctly and intact.

Merkle Trees and Directed Acyclic Graphs (DAGs) are essential to Git and Dolt alike. With these structures, each object’s address is derived from its content’s checksum, creating a powerful data structure that is inherently self-verifying.

git fsck has a lot of options, and most of which serve to speed up the process by skipping objects which are not important. For example, the --connectivity-only flag will ignore any object which is not reachable by the commit history, or you can provide a specific object, like a commit id or a content tree, and validate just that. There is the header from the help docs:

NAME
       git-fsck - Verifies the connectivity and validity of the objects in the database

SYNOPSIS
       git fsck [--tags] [--root] [--unreachable] [--cache] [--no-reflogs]
                [--[no-]full] [--strict] [--verbose] [--lost-found]
                [--[no-]dangling] [--[no-]progress] [--connectivity-only]
                [--[no-]name-objects] [<object>...]
...

Take a look at the help to see all the things which git fsck can do. Interesting to note that they call the object store "the database." I don't think I've seen them use that term anywhere else!

Introducing dolt fsck

dolt fsck has been released in version 1.43.2, and it performs the most basic (and most complete) validation of your Dolt database. We don't support all those options to pare down the scan size, but we'll add them if you ask for them.

SYNOPSIS
        dolt fsck [--quiet]

DESCRIPTION
        Verifies the contents of the database are not corrupted.

OPTIONS
        --quiet
          Don't show progress. Just print final report.

dolt fsck will iterate through every object, we call them chunks, in your database and calculate it's cryptographic checksum to verify that it is properly addressed.

All objects are scanned - if they are in the Journal, Table files, or Archives (our three on disk formats). For the Journal in particular, there can be a lot of unreachable chunks. You may want to perform dolt gc before fsck.

Finally, the exit status of the command will be 1 if any corruption is found.

Breaking Dolt

Probably in the future I'll write another post about how we tested this. It took me more time to create corrupted databases than it did to write dolt fsck itself. We have A LOT of checks in place already to ensure that the data we read off of disk is in good condition.

In order to break a dolt database, I first tried using Golang reflection with no luck. Then I decided to fork the code and remove some of the existing checks so that I could get a chunk into storage with an incorrect address.

I also manually changed some bytes in a Journal file using xxd to dump and import binary data. This proved to be very tricky as well, and further convinced me that it's pretty hard to introduce corruption into a Dolt database.

Regardless of how hard it was to break Dolt, we've preserved the corrupted databases so that we will have regression tests going forward.

Wrapping Up

If you have a Dolt database handy, go ahead and run dolt fsck and see if it finds any problems. If it does, we want to hear about it! Join us on our Discord server!!!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.