We Have Google Drive at Home: Musings on Merkle-Tree Based File Sharing
Suppose you have a directory of files that you want to sync with your friends. When the files change, you want your friends to be able to download just the changes without needing to re-download the entire directory again. And you want this to scale, no matter how many or how large the files are. What's the best way to do this?
This is the kind of thing I think about all the time. I love distributed systems so much that I joined DoltHub to help make Dolt, the first version-controlled SQL database. And I'm exactly the kind of data hoarder who ends up trying to index and archive terabytes of data, and puzzling over the most optimal way to do it.
I can already hear you screaming: "Just use Git! This is literally what Git is for!"
Don't get me wrong: Git is a fantastic protocol. There's a reason why Git is the preferred version control system of over 90% of developers.
But it's not that simple. For all its strengths, Git scales poorly:
- Git's storage layer treats large files as atomic objects, often requiring the entire file to be loaded into memory or transmitted over the network even when only a small part of the file has changed. And structual sharing of multiple similar files is not always possible.
- Staging lots of small files at once creates a temporary file for each new "object" in Git's datastore, one per staged file, which can bottleneck large commits.
- Looking up objects in Git's datastore has degraded performance once the objects can no longer fit in a single packfile.
For all it's usefulness, Git simply isn't the solution. But it's so close to the solution. And the reason it's so close is because it leverages one of my favorite data structures of all time: Merkle trees
Merkle Trees
Merkle trees are awesome, and here at DoltHub we talk about them all the time. We've talked about Merkle trees before and how they're used for content-addressed storage. And we've talked about how Dolt uses them in our data model and how they allow for efficient push and pull operations on remotes.
The concept is dead simple too: index objects by a content-addressed hash, and create "pointers" to data by storing said hash. It's one simple trick, and it opens up so many doors.
The problem of identifying only parts of a data set that have been added or changed is called set reconciliation. And Merkle trees are a very good tool for set reconciliation because if two nodes of two Merkle trees have the same hash, then we know that those nodes are identical, as are all their children. Organizing data as Merkle trees allows us to quickly prune large swaths of unchanged data, isolating only the parts that did change.
Merkle trees make Git possible. But Git isn't the only software system that uses Merkle trees for set reconciliation. For instance, S3 allegedly uses merkle trees internally, although there aren't any public docs about it.
And that makes me wonder whether there's another decentralized system built on Merkle trees that can solve our file-sharing problem. Is there another protocol with a different set of trade-offs that can help us cross the finish line?
Last year I made a big list of every decentralized database I could find and broke down their feature sets. You could consider this a follow-up to that. Most of the tools mentioned in this article was mentioned there too because to no one's surprise, Merkle trees are a great way to build a decentralized database.
Bup
- Initial Release
- April 2005
- Website
- https://bup.github.io/
- Source Code
- https://github.com/bup/bup
Bup is a backup utility built on top of the Git storage layer, but with special handling for the things that make ordinary Git scale poorly:
- Handling large files / large directories: Instead of representing large files and directories each as a single large object, Bup splits them up into a balanced tree of objects, drawing object boundaries via a rolling hash. It's the exact same technique that we call Prolly Trees, although they don't use that term themselves.
- Handling many files: Bup writes Git package files directly, whereas Git creates individual files for each object and only consolidates them into packages during garbage collection. Additionally, if a repo contains multiple packfiles, Bup creates "multi-index" files that can be used to efficiently identify which packfile contains a given object.
This sounds like it solves all our problems. But it still has one major shortcoming: creating an incremental backup still requires scanning the entire directory to identify which files have changed, even when the changes are small relative to the total repository size.
Bup | |
---|---|
Version controlled | ✅ |
Efficiently store history of large files | ✅ |
Efficiently diff large files | ✅ |
Efficiently create incremental backup | ❌ |
Handles large number of files | ✅ |
Deduplicate files | ✅ |
Directory can be mounted / checked out | ✅ |
Mercurial
- Initial Release
- April 2005
- Source Code
- https://repo.mercurial-scm.org/hg/
Mercurial, like Git, is a version control system that uses Merkle Trees to track file history. But unlike Git, Mercurial uses one Merkle Tree to track changesets, and a seperate Merkle Tree for each file to track its history. This makes it unable to deduplicate multiple large files that are the identical or nearly identical. While this means that Mercurial can always store file revisions as deltas, diffing and computing deltas still requires that the entire file is able to fit in memory, making it unsuitable for extremely large files.
Mercurial | |
---|---|
Version controlled | ✅ |
Efficiently store history of large files | ✅ |
Efficiently diff large files | ❌ |
Efficiently create incremental backup | ✅ |
Handles large number of files | ✅ |
Deduplicate files | ❌ |
Directory can be mounted / checked out | ✅ |
IPLD
- Initial Release
- February 2015
- Source Code
- https://github.com/ipld/ipld
Interplanetary Linked Data (IPLD) is the data format underlying both IPFS and BlueSky.
IPLD has a standard way of representing files and directories within a Merkle Tree, called UnixFS. Large files are split into chunks via a rolling hash, and large directories are split using a Hash Array Mapped Trie
Unfortunately, IPLD doesn't provide any method for version control. Datastores are static, identified by their content hash. You can approximate pulling from a remote by having a remote distribute up-to-date hashes (such as with IPNS) and running ipfs pin update
, but it's unclear how well this works in practice.
Since IPLD is meant to be immutable, there are also challenges in creating incremental backups. IPFS has a feature called "Mutable File Systems" which is supposed to allow to efficiently apply changes to a file system and compute a new root hash. However, in practice it performs poorly when modifying large files and directories.
IPLD | |
---|---|
Version controlled | ❌ |
Efficiently store history of large files | ❌ |
Efficiently diff large files | ✅ |
Efficiently create incremental backup | ❌ |
Handles large number of files | ✅ |
Deduplicate files | ✅ |
Directory can be mounted / checked out | ✅ |
Dolt
It's not a DoltHub blog post if I don't plug Dolt at least once.
Trying to use Dolt for file sharing has one huge obvious glaring problem: Dolt is designed for storing relational data, not files.
But it's possible to represent a file system in relational data: ReFS is a file system designed by Microsoft which is a relational database under the hood. SQLite Archive Files are an archive format that represents a directory structure within a SQLite database.
It's not hard to imagine using Dolt as the basis of a similar approach, with file contents stored in a BLOB
column of a relational table with a predefined schema like SQLAR. Dolt already supports blobs of arbitrarily large size via a rolling hash chunker.
But even if we could store a file system within Dolt, there's not currently a way to mount that filesystem. Someone would have to make a FUSE, NFS or similar implementation before such a file system would be usable.
Dolt | |
---|---|
Version controlled | ✅ |
Efficiently store history of large files | ✅ |
Efficiently diff large files | ✅ |
Efficiently create incremental backup | ✅ |
Handles large number of files | ✅ |
Deduplicate files | ✅ |
Directory can be mounted / checked out | ❌ |
ZFS and BTRFS
ZFS and BTRFS are file systems that, among other features, use Merkle trees to store file and directory checksums, allowing for verifying file integrity and checkpointing. Given that neither of these is a decentralized application, it doesn't really make sense for this specific problem. But it's still worth mentioning as a novel use of Merkle trees.
The Final Score
Let's put everything together:
Git | Bup | Mercurial | IPLD | Dolt | |
---|---|---|---|---|---|
Version controlled | ✅ | ✅ | ✅ | ❌ | ✅ |
Efficiently store history of large files | ❌ | ✅ | ✅ | ❌ | ✅ |
Efficiently diff large files | ❌ | ✅ | ❌ | ✅ | ✅ |
Efficiently create incremental backup | ✅ | ❌ | ✅ | ❌ | ✅ |
Handles large number of files | ❌ | ✅ | ✅ | ✅ | ✅ |
Deduplicate files | ✅ | ✅ | ❌ | ✅ | ✅ |
Directory can be mounted / checked out | ✅ | ✅ | ✅ | ✅ | ❌ |
There is something deeply tragic about seeing every criteron met by some system, but seeing no system meet all of them.
Still, seeing this reaffirms my belief that this perfect file-sharing tool exists in theory. Maybe one day Git will add file chunking, or Bup will add incremental indexing. Maybe IPLD will add version control and incremental backup, or Dolt will be mountable as a file system. Or maybe an entirely new system will show up one day and be the tree swiss-army knife of content-addressed data.
Honestly, it's really cool to see all these different designs, each with their own trade-offs and priorities, slowly converging. It makes me very excited for the future of distributate data sharing on the web.
If you're as excited as I am about the future of file sharing, join our Discord! If you have some really cool use case for Dolt, file an issue on GitHub! We're always looking to meet people with cool new ideas.