Anatomy of a Dolt Database

October 28, 2024

13 min read

While writing about how to tune Dolt for large databases, it dawned on me that we don't have documentation on the internal file structure of a Dolt database. What does Dolt look like on the inside? Let's dissect a database and find out.

DISCLAIMER

My engineers threatened to quit if I didn't add a disclaimer. Everything described in this article is internal implementation details that are subject to change. If you use Dolt internals as an API, it can, and likely will, break.

Create a Dolt Database

There are two ways to create a new Dolt database: dolt init and the SQL create database statement. They create slightly different artifacts on disk so we'll describe both.

If you want to follow along, I'm going to start with a fresh directory in my home directory to show this off.

$ mkdir anatomy
$ cd anatomy

`dolt init`

dolt init is the command line interface to create a new Dolt database in your current directory. dolt init is analogous to git init.

$ dolt init
Successfully initialized dolt data repository.
$

This command seemingly does nothing on the surface but with careful digging you'll see a hidden .dolt directory.

$ ls -al
total 0
drwxr-xr-x   3 timsehn  staff    96 Oct 23 13:48 .
drwxr-x---+ 69 timsehn  staff  2208 Oct 23 13:47 ..
drwxr-xr-x   6 timsehn  staff   192 Oct 23 13:48 .dolt

Let's look inside the .dolt directory using tree. tree didn't ship with my Mac. To get it via Homebrew, run brew install tree.

$ tree .dolt
.dolt
├── config.json
├── noms
│   ├── LOCK
│   ├── journal.idx
│   ├── manifest
│   ├── oldgen
│   └── vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
├── repo_state.json
└── temptf

4 directories, 6 files

You have some configuration and state management at the top level. In the noms directory, you have the database data. Dolt is a direct descendant of Noms, a now defunct open source decentralized database. Noms spirit lives on in every Dolt database. We wouldn't be here without Noms.

We'll cover each of these files and directories in more detail in their own section.

`CREATE DATABASE`

Now let's try and create a database using Dolt's SQL interface.

$ pwd
/Users/timsehn/anatomy
$ dolt sql -q "create database anatomy"
error on line 1 for query create database anatomy: can't create database anatomy; database exists

What gives? When you dolt init, Dolt uses the directory name of directory as an implicit database name. So, the .dolt directory you created above using dolt init corresponds to a database named anatomy, the name of its parent directory. To create another database using CREATE DATABASE we must name it something different.

$ dolt sql -q "create database create_database_anatomy"
Query OK, 1 row affected (0.07 sec)
$ ls -al
total 0
drwxr-xr-x   4 timsehn  staff   128 Oct 23 13:49 .
drwxr-x---+ 69 timsehn  staff  2208 Oct 23 13:47 ..
drwxr-xr-x   7 timsehn  staff   224 Oct 23 13:49 .dolt
drwxr-xr-x   3 timsehn  staff    96 Oct 23 13:49 create_database_anatomy
$ ls -al create_database_anatomy
total 0
drwxr-xr-x  3 timsehn  staff   96 Oct 23 13:49 .
drwxr-xr-x  4 timsehn  staff  128 Oct 23 13:49 ..
drwxr-xr-x  6 timsehn  staff  192 Oct 23 13:49 .dolt

And you'll see you now have two databases:

$ dolt sql -q "show databases"
+-------------------------+
| Database                |
+-------------------------+
| anatomy                 |
| create_database_anatomy |
| information_schema      |
| mysql                   |
+-------------------------+

If you start a dolt sql-server in this directory, the server will serve both databases. This is how Dolt is able to serve multiple databases from a single server.

The basic rules are as follows. The name of the directory that houses the .dolt directory is the name of the Dolt database. The Dolt database accessed via the Dolt command line interface is the database indicated by the .dolt directory in your current working directory. If you start a dolt sql-server, it serves all databases indicated by .dolt directories in your current working directory and one layer below.

This is a little confusing so let's delete create_database_anatomy so all we're dealing with is a single database named anatomy for the rest of this example. I could do the same thing with drop database but as you'll see a little later, Dolt does something special there.

$ rm -r create_database_anatomy

Configuration and State Management

Dolt creates a few configuration and state management artifacts when you create a database. These are config.json and repo_state.json. By starting a dolt sql-server, you can also create a .doltcfg and sql-server.info. I'll explain each of these in this section.

config.json

Configuration stored in config.json is akin to Git configuration. The config.json file starts empty for a new Dolt database.

$ cat .dolt/config.json
{}

You can add configuration values to it like user.name using the dolt config command. Valid values to set are listed in the Dolt documentation. Values you set are then persisted in this file.

$ dolt config --local --set user.name anatomy
Config successfully updated.
$ cat .dolt/config.json
{"user.name":"anatomy"}
$ dolt config --local --list
user.name = anatomy

You also have a global configuration for your machine in your home directory ~/.dolt/config.json that can be viewed and modified using the same dolt config command with the --global option.

Persisted SQL variables are also stored in config.json. Dolt supports many, but not all, MySQL system variables and a few Dolt specific ones.

$ dolt sql -q "set @@PERSIST.dolt_stats_auto_refresh_enabled = 0;"
$ cat .dolt/config.json
{"sqlserver.global.dolt_stats_auto_refresh_enabled":"0","user.name":"anatomy"}

repo_state.json

repo_state.json stores state information about your current HEAD, branches, remotes, and backups. In a fresh Dolt database, it starts with only the HEAD set.

$ cat .dolt/repo_state.json
{
  "head": "refs/heads/main",
  "remotes": {},
  "backups": {},
  "branches": {}
}

The state stored in repo_state.json is configuration for Git features like the checked out HEAD for the command line interface, configured remotes, and tracking branches. Dolt also has the concept of a backup which is very similar to a remote but also syncs the working and staged sets to a remote location, not just the committed state like a remote.

If you create a remote or backup, it will get added to the repo_state.json.

$ cat .dolt/repo_state.json
{
  "head": "refs/heads/main",
  "remotes": {
    "origin": {
      "name": "origin",
      "url": "https://doltremoteapi.dolthub.com/timsehn/anatomy",
      "fetch_specs": [
        "refs/heads/*:refs/remotes/origin/*"
      ],
      "params": {}
    }
  },
  "backups": {},
  "branches": {}
}

The head field stores the value of the current checked out HEAD.

$ dolt branch b1
$ dolt checkout b1
Switched to branch 'b1'
$ cat .dolt/repo_state.json
{
  "head": "refs/heads/b1",
  "remotes": {
    "origin": {
      "name": "origin",
      "url": "https://doltremoteapi.dolthub.com/timsehn/anatomy",
      "fetch_specs": [
        "refs/heads/*:refs/remotes/origin/*"
      ],
      "params": {}
    }
  },
  "backups": {},
  "branches": {}
}

The branches field is used for tracking upstream branches on remotes. These are a bit tricky to create without a fully functioning remote so I'll skip the example. But if you have a remote branch and you run dolt branch --track origin/b1 this field will be populated.

.doltcfg

To generate the next batch of Dolt related state, we're going to start a dolt sql-server. Dolt can be run in Git for Data mode or it can be run in Version Controlled Database mode. Running a dolt sql-server starts a MySQL-compatible database server on port 3306.

To do this, we'll need a new terminal window. So, we open one and navigate to the root of our database where the .dolt directory lives. Once we're there, we run dolt sql-server. We'll just leave that window open and inspect the files it generates in the original terminal.

$ dolt sql-server
Starting server with Config HP="localhost:3306"|T="28800000"|R="false"|L="info"|S="/tmp/mysql.sock"

We now have a new directory called .doltcfg. This directory stores all the information used to configure the dolt sql-server, which is mostly permissions (ie. users and grants) information. Looking inside the directory, it contains a branch_control.db file that is used to store Dolt branch permissions. This branch_control.db file is automatically created if you have a branch that is not main and you start the server.

$ ls -al .doltcfg
total 16
drwxr-xr-x  4 timsehn  staff  128 Oct 23 13:59 .
drwxr-xr-x  4 timsehn  staff  128 Oct 23 13:59 ..
-rw-rw----  1 timsehn  staff  300 Oct 23 13:59 branch_control.db

If I create a new user, you will get an additional file, privileges.db.

$ dolt sql -q "create user timsehn"
$ ls -al .doltcfg
total 16
drwxr-xr-x  4 timsehn  staff  128 Oct 23 13:59 .
drwxr-xr-x  4 timsehn  staff  128 Oct 23 13:59 ..
-rw-rw----  1 timsehn  staff  300 Oct 23 13:59 branch_control.db
-rw-------  1 timsehn  staff  840 Oct 23 13:59 privileges.db

This privileges.db file can be copied and moved to other databases if you want to maintain the same user profiles and permissions on other databases.

sql-server.info

I now have an additional file in my .dolt directory that contains metadata about the dolt sql-server process. This file is used by the Dolt CLI and other Dolt processes to infer state about the running dolt sql-server.

$ cat .dolt/sql-server.info
54409:3306:b68c9466-d85a-40a8-8c8e-2d0a2d9561d6

When you stop the server, Dolt attempts to clean it up. I stopped the dolt sql-server process in the other terminal and lo and behold, it is gonzo.

$ ls -al .dolt
total 16
drwxr-xr-x  7 timsehn  staff  224 Oct 23 14:01 .
drwxr-xr-x  4 timsehn  staff  128 Oct 23 13:59 ..
-rwxrwxrwx  1 timsehn  staff    2 Oct 23 13:48 config.json
drwxr-xr-x  7 timsehn  staff  224 Oct 23 14:01 noms
-rwxrwxrwx  1 timsehn  staff   83 Oct 23 13:48 repo_state.json
drwxr-xr-x  3 timsehn  staff   96 Oct 23 13:49 stats
drwxr-xr-x  2 timsehn  staff   64 Oct 23 13:48 temptf

Noms

Now that we have configuration out of the way, we get on to the real data in your Dolt database. This is all housed in the noms directory.

$ ls -al .dolt/noms
total 2064
drwxr-xr-x  7 timsehn  staff      224 Oct 23 14:01 .
drwxr-xr-x  7 timsehn  staff      224 Oct 23 14:01 ..
-rw-------  1 timsehn  staff        0 Oct 23 13:48 LOCK
-rw-r--r--  1 timsehn  staff      406 Oct 23 14:01 journal.idx
-rw-------  1 timsehn  staff      145 Oct 23 14:01 manifest
drwxr-xr-x  2 timsehn  staff       64 Oct 23 13:48 oldgen
-rw-r--r--  1 timsehn  staff  1048576 Oct 23 13:59 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

Let's walk through what each of these files and directories do.

Manifest

The manifest is the control center for the Dolt's data. It tells Dolt where to look for data on the filesystem. It starts out looking something like this:

$ cat .dolt/noms/manifest
5:__DOLT__:8a4r3pjhao0gq2cnkrj6tpr1n0d971dp:uc8dnffqe97329u8piol5qh15dn6rhut:00000000000000000000000000000000:vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv:14

Dolt uses a unique content-addressed storage engine based on Prolly Trees to provide you the performance of a SQL database with all the Git features you know and love. Those hash values in the manifest correspond to real internal hash values of the schema and data in your database.

Journal

Let's examine the set of files related to the "chunk journal". The actual journal file is named vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv and there is a journal index named journal.idx that makes lookups into the journal faster. The journal file is named vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv because of Dolt's hash naming system. In the code, it's a binary hash value of all 1s. This parses in ASCII to vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv.

Dolt didn't use to have a journal. In order to make Dolt support ACID transactions, it was added in January 2023 and was made default when Dolt went 1.0 in May 2023.

All database writes are initially made to the chunk journal. When garbage collection is run, the chunk journal is broken down into table files and stored in the noms and oldgen directories with garbage chunks being discarded. Notice the changed write timestamp.

$ ls -al .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
-rw-r--r--  1 timsehn  staff  1048576 Oct 23 13:59 .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
$ dolt sql -q "create table t (id int primary key, other varchar(100))"
$ dolt sql -q "insert into t values (0,'something for the journal')"
Query OK, 1 row affected (0.01 sec)
$ ls -al .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
-rw-r--r--  1 timsehn  staff  1048576 Oct 23 14:03 .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

Also, note, the manifest has been updated with new mapping information.

$ cat .dolt/noms/manifest
5:__DOLT__:3a507o3ua5fhao5erdrq59ans5pg5i0p:qf751ggb9cttulqqrcqeg6hiua95aqrg:00000000000000000000000000000000:vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv:30

oldgen

oldgen stands for "old generation". Dolt uses a generational garbage collection approach. Garbage collection throws away the chunks in the journal that are not referenced in the commit graph. After orphaned chunks are garbage collected, the chunks that remain are structured into table files and permanently stored in the noms (ie. new generation) and oldgen (ie. old generation) directories.

The chunk journal will grow unbounded until the Dolt garbage collection process is invoked. Garbage collection is currently manual but will hopefully become automatic before the end of 2024.

Let's garbage collect our tiny database. We first make a commit to indicate a permanent marker for storage.

$ dolt commit -Am "Initial table and value"
commit htdv4naseic7lrdulsruf21b4jqe5g3d (HEAD -> main)
Author: timsehn <tim@dolthub.com>
Date:  Wed Oct 23 14:31:43 -0700 2024

        Initial table and value

$ dolt gc

Now our journal is gone.

$ ls -al .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
ls: .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv: No such file or directory

And the chunks have been stored in oldgen.

$ ls -al .dolt/noms/oldgen
total 16
drwxr-xr-x  5 timsehn  staff   160 Oct 23 14:31 .
drwxr-xr-x  6 timsehn  staff   192 Oct 23 14:31 ..
-rw-------  1 timsehn  staff     0 Oct 23 14:31 LOCK
-rw-------  1 timsehn  staff  1455 Oct 23 14:31 g8b0lhkn5rv6en5e230v969etn5klflo
-rw-------  1 timsehn  staff   144 Oct 23 14:31 manifest

Our manifest reflects this fact. Notice no mention of the vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv file in the manifest.

5:__DOLT__:1lqug7pvf62jh5k04k8bc4auo1pun4et:lp9gdupthookjlop7fthrfibo2huvbrv:1lqug7pvf62jh5k04k8bc4auo1pun4et:j25a2obgeft41l03s9e4ba2epnv40lug:3

I also have a table file in the noms directory. This is "new generation".

$ ls -al .dolt/noms
total 16
drwxr-xr-x  6 timsehn  staff  192 Oct 23 14:31 .
drwxr-xr-x  7 timsehn  staff  224 Oct 23 14:01 ..
-rw-------  1 timsehn  staff    0 Oct 23 13:48 LOCK
-rw-------  1 timsehn  staff  629 Oct 23 14:31 j25a2obgeft41l03s9e4ba2epnv40lug
-rw-------  1 timsehn  staff  144 Oct 23 14:31 manifest
drwxr-xr-x  5 timsehn  staff  160 Oct 23 14:31 oldgen

If I make another write, it is done to a new journal file.

$ dolt sql -q "insert into t values (1,'another for the journal')"
Query OK, 1 row affected (0.00 sec)
$ ls -al .dolt/noms
total 2072
drwxr-xr-x  8 timsehn  staff      256 Oct 23 14:43 .
drwxr-xr-x  7 timsehn  staff      224 Oct 23 14:01 ..
-rw-------  1 timsehn  staff        0 Oct 23 13:48 LOCK
-rw-------  1 timsehn  staff      629 Oct 23 14:31 j25a2obgeft41l03s9e4ba2epnv40lug
-rw-r--r--  1 timsehn  staff      261 Oct 23 14:43 journal.idx
-rw-------  1 timsehn  staff      179 Oct 23 14:43 manifest
drwxr-xr-x  5 timsehn  staff      160 Oct 23 14:31 oldgen
-rw-r--r--  1 timsehn  staff  1048576 Oct 23 14:43 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

Temporary Storage and Lock Files

Dolt is written in Golang. In Golang, many file operations are operating system dependent. These operations often need a temporary space in the file system or a pre-allocated lock file. We cannot count on OS-dependent solutions like /tmp or /var/tmp. Thus, Dolt allocates a temptf directory in order to ensure a temporary space exists for operations that need it. temptf should be empty unless there was an error of some sort.

$ ls -al .dolt/temptf
total 0
drwxr-xr-x  2 timsehn  staff   64 Oct 21 13:00 .
drwxr-xr-x  8 timsehn  staff  256 Oct 22 14:40 ..

In various directories you will also see LOCK files which are used to coordinate disk-based locks.

$ ls -al .dolt/noms
total 2072
drwxr-xr-x  8 timsehn  staff      256 Oct 23 14:43 .
drwxr-xr-x  7 timsehn  staff      224 Oct 23 14:01 ..
-rw-------  1 timsehn  staff        0 Oct 23 13:48 LOCK
-rw-------  1 timsehn  staff      629 Oct 23 14:31 j25a2obgeft41l03s9e4ba2epnv40lug
-rw-r--r--  1 timsehn  staff      261 Oct 23 14:43 journal.idx
-rw-------  1 timsehn  staff      179 Oct 23 14:43 manifest
drwxr-xr-x  5 timsehn  staff      160 Oct 23 14:31 oldgen
-rw-r--r--  1 timsehn  staff  1048576 Oct 23 14:43 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

Table Statistics

Table statistics are stored in Dolt format in the stats directory. The stats directory looks like a Dolt database within your Dolt database. Tables statistics are used to optimize queries and are now collected by default.

$ dolt sql -q "analyze table t"
+-------+---------+----------+----------+
| Table | Op      | Msg_type | Msg_text |
+-------+---------+----------+----------+
| t     | analyze | status   | OK       |
+-------+---------+----------+----------+

$ ls -al .dolt/stats/.dolt
total 16
drwxr-xr-x  6 timsehn  staff  192 Oct 23 13:49 .
drwxr-xr-x  3 timsehn  staff   96 Oct 23 13:49 ..
-rwxrwxrwx  1 timsehn  staff    2 Oct 23 13:49 config.json
drwxr-xr-x  7 timsehn  staff  224 Oct 23 15:35 noms
-rwxrwxrwx  1 timsehn  staff   83 Oct 23 13:49 repo_state.json
drwxr-xr-x  2 timsehn  staff   64 Oct 23 13:49 temptf

You can inspect table statistics using the dolt_statistics system table.

$ dolt sql -q "select * from dolt_statistics"
+---------------+------------+------------+-----------+----------------+------------+---------+-------+-------------+-----------------+---------------------+------+------+------+------+-----------+
| database_name | table_name | index_name | row_count | distinct_count | null_count | columns | types | upper_bound | upper_bound_cnt | created_at          | mcv1 | mcv2 | mcv3 | mcv4 | mcvCounts |
+---------------+------------+------------+-----------+----------------+------------+---------+-------+-------------+-----------------+---------------------+------+------+------+------+-----------+
| anatomy       | t          | primary    | 2         | 2              | 0          | id      | int   | 1           | 1               | 2024-10-23 22:36:04 |      |      |      |      |           |
+---------------+------------+------------+-----------+----------------+------------+---------+-------+-------------+-----------------+---------------------+------+------+------+------+-----------+

Dropped Databases

As teased earlier, Dolt has a unique feature called dolt_undrop. Dropped databases aren't permanently removed. Instead they are stored in a hidden directory until they are purged with another command. Let's clone another database from DoltHub and drop it.

$ dolt clone dolthub/nba-players
cloning https://doltremoteapi.dolthub.com/dolthub/nba-players
0 of 19,805 chunks complete. 19,805 chunks being downloaded currently.
$ dolt sql -q "drop database \`nba-players\`"
$ ls -al
total 0
drwxr-xr-x   5 timsehn  staff   160 Oct 24 13:44 .
drwxr-x---+ 69 timsehn  staff  2208 Oct 24 09:27 ..
drwxr-xr-x   7 timsehn  staff   224 Oct 24 13:29 .dolt
drwxr-xr-x   3 timsehn  staff    96 Oct 24 13:44 .dolt_dropped_databases
drwxr-xr-x   4 timsehn  staff   128 Oct 23 13:59 .doltcfg
$ ls -al .dolt_dropped_databases
total 0
drwxr-xr-x  3 timsehn  staff   96 Oct 24 13:44 .
drwxr-xr-x  5 timsehn  staff  160 Oct 24 13:44 ..
drwxr-xr-x  3 timsehn  staff   96 Oct 24 13:42 nba-players

Conclusion

Now that you've seen how the sausage is made, we hope that it doesn't ruin your appetite for Dolt. We'll be writing a few more of these "anatomy" or "under the hood" style posts in the near future. If you have a Dolt topic you want explained in detail, come by our Discord and tell us.

Blog