Anatomy of a Dolt Database
While writing about how to tune Dolt for large databases, it dawned on me that we don't have documentation on the internal file structure of a Dolt database. What does Dolt look like on the inside? Let's dissect a database and find out.
DISCLAIMER
My engineers threatened to quit if I didn't add a disclaimer. Everything described in this article is internal implementation details that are subject to change. If you use Dolt internals as an API, it can, and likely will, break.
Create a Dolt Database
There are two ways to create a new Dolt database: dolt init
and the SQL create database
statement. They create slightly different artifacts on disk so we'll describe both.
If you want to follow along, I'm going to start with a fresh directory in my home directory to show this off.
$ mkdir anatomy
$ cd anatomy
dolt init
dolt init
is the command line interface to create a new Dolt database in your current directory. dolt init
is analogous to git init
.
$ dolt init
Successfully initialized dolt data repository.
$
This command seemingly does nothing on the surface but with careful digging you'll see a hidden .dolt
directory.
$ ls -al
total 0
drwxr-xr-x 3 timsehn staff 96 Oct 23 13:48 .
drwxr-x---+ 69 timsehn staff 2208 Oct 23 13:47 ..
drwxr-xr-x 6 timsehn staff 192 Oct 23 13:48 .dolt
Let's look inside the .dolt
directory using tree
. tree
didn't ship with my Mac. To get it via Homebrew, run brew install tree
.
$ tree .dolt
.dolt
├── config.json
├── noms
│ ├── LOCK
│ ├── journal.idx
│ ├── manifest
│ ├── oldgen
│ └── vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
├── repo_state.json
└── temptf
4 directories, 6 files
You have some configuration and state management at the top level. In the noms
directory, you have the database data. Dolt is a direct descendant of Noms, a now defunct open source decentralized database. Noms spirit lives on in every Dolt database. We wouldn't be here without Noms.
We'll cover each of these files and directories in more detail in their own section.
CREATE DATABASE
Now let's try and create a database using Dolt's SQL interface.
$ pwd
/Users/timsehn/anatomy
$ dolt sql -q "create database anatomy"
error on line 1 for query create database anatomy: can't create database anatomy; database exists
What gives? When you dolt init
, Dolt uses the directory name of directory as an implicit database name. So, the .dolt
directory you created above using dolt init
corresponds to a database named anatomy
, the name of its parent directory. To create another database using CREATE DATABASE
we must name it something different.
$ dolt sql -q "create database create_database_anatomy"
Query OK, 1 row affected (0.07 sec)
$ ls -al
total 0
drwxr-xr-x 4 timsehn staff 128 Oct 23 13:49 .
drwxr-x---+ 69 timsehn staff 2208 Oct 23 13:47 ..
drwxr-xr-x 7 timsehn staff 224 Oct 23 13:49 .dolt
drwxr-xr-x 3 timsehn staff 96 Oct 23 13:49 create_database_anatomy
$ ls -al create_database_anatomy
total 0
drwxr-xr-x 3 timsehn staff 96 Oct 23 13:49 .
drwxr-xr-x 4 timsehn staff 128 Oct 23 13:49 ..
drwxr-xr-x 6 timsehn staff 192 Oct 23 13:49 .dolt
And you'll see you now have two databases:
$ dolt sql -q "show databases"
+-------------------------+
| Database |
+-------------------------+
| anatomy |
| create_database_anatomy |
| information_schema |
| mysql |
+-------------------------+
If you start a dolt sql-server
in this directory, the server will serve both databases. This is how Dolt is able to serve multiple databases from a single server.
The basic rules are as follows. The name of the directory that houses the .dolt
directory is the name of the Dolt database. The Dolt database accessed via the Dolt command line interface is the database indicated by the .dolt
directory in your current working directory. If you start a dolt sql-server
, it serves all databases indicated by .dolt
directories in your current working directory and one layer below.
This is a little confusing so let's delete create_database_anatomy
so all we're dealing with is a single database named anatomy
for the rest of this example. I could do the same thing with drop database
but as you'll see a little later, Dolt does something special there.
$ rm -r create_database_anatomy
Configuration and State Management
Dolt creates a few configuration and state management artifacts when you create a database. These are config.json
and repo_state.json
. By starting a dolt sql-server
, you can also create a .doltcfg
and sql-server.info
. I'll explain each of these in this section.
config.json
Configuration stored in config.json
is akin to Git configuration. The config.json
file starts empty for a new Dolt database.
$ cat .dolt/config.json
{}
You can add configuration values to it like user.name
using the dolt config
command. Valid values to set are listed in the Dolt documentation. Values you set are then persisted in this file.
$ dolt config --local --set user.name anatomy
Config successfully updated.
$ cat .dolt/config.json
{"user.name":"anatomy"}
$ dolt config --local --list
user.name = anatomy
You also have a global configuration for your machine in your home directory ~/.dolt/config.json
that can be viewed and modified using the same dolt config
command with the --global
option.
Persisted SQL variables are also stored in config.json
. Dolt supports many, but not all, MySQL system variables and a few Dolt specific ones.
$ dolt sql -q "set @@PERSIST.dolt_stats_auto_refresh_enabled = 0;"
$ cat .dolt/config.json
{"sqlserver.global.dolt_stats_auto_refresh_enabled":"0","user.name":"anatomy"}
repo_state.json
repo_state.json
stores state information about your current HEAD
, branches, remotes, and backups. In a fresh Dolt database, it starts with only the HEAD
set.
$ cat .dolt/repo_state.json
{
"head": "refs/heads/main",
"remotes": {},
"backups": {},
"branches": {}
}
The state stored in repo_state.json
is configuration for Git features like the checked out HEAD
for the command line interface, configured remotes, and tracking branches. Dolt also has the concept of a backup which is very similar to a remote but also syncs the working and staged sets to a remote location, not just the committed state like a remote.
If you create a remote or backup, it will get added to the repo_state.json
.
$ cat .dolt/repo_state.json
{
"head": "refs/heads/main",
"remotes": {
"origin": {
"name": "origin",
"url": "https://doltremoteapi.dolthub.com/timsehn/anatomy",
"fetch_specs": [
"refs/heads/*:refs/remotes/origin/*"
],
"params": {}
}
},
"backups": {},
"branches": {}
}
The head field stores the value of the current checked out HEAD
.
$ dolt branch b1
$ dolt checkout b1
Switched to branch 'b1'
$ cat .dolt/repo_state.json
{
"head": "refs/heads/b1",
"remotes": {
"origin": {
"name": "origin",
"url": "https://doltremoteapi.dolthub.com/timsehn/anatomy",
"fetch_specs": [
"refs/heads/*:refs/remotes/origin/*"
],
"params": {}
}
},
"backups": {},
"branches": {}
}
The branches field is used for tracking upstream branches on remotes. These are a bit tricky to create without a fully functioning remote so I'll skip the example. But if you have a remote branch and you run dolt branch --track origin/b1
this field will be populated.
.doltcfg
To generate the next batch of Dolt related state, we're going to start a dolt sql-server
. Dolt can be run in Git for Data mode or it can be run in Version Controlled Database mode. Running a dolt sql-server
starts a MySQL-compatible database server on port 3306.
To do this, we'll need a new terminal window. So, we open one and navigate to the root of our database where the .dolt
directory lives. Once we're there, we run dolt sql-server
. We'll just leave that window open and inspect the files it generates in the original terminal.
$ dolt sql-server
Starting server with Config HP="localhost:3306"|T="28800000"|R="false"|L="info"|S="/tmp/mysql.sock"
We now have a new directory called .doltcfg
. This directory stores all the information used to configure the dolt sql-server
, which is mostly permissions (ie. users and grants) information. Looking inside the directory, it contains a branch_control.db
file that is used to store Dolt branch permissions. This branch_control.db
file is automatically created if you have a branch that is not main
and you start the server.
$ ls -al .doltcfg
total 16
drwxr-xr-x 4 timsehn staff 128 Oct 23 13:59 .
drwxr-xr-x 4 timsehn staff 128 Oct 23 13:59 ..
-rw-rw---- 1 timsehn staff 300 Oct 23 13:59 branch_control.db
If I create a new user, you will get an additional file, privileges.db
.
$ dolt sql -q "create user timsehn"
$ ls -al .doltcfg
total 16
drwxr-xr-x 4 timsehn staff 128 Oct 23 13:59 .
drwxr-xr-x 4 timsehn staff 128 Oct 23 13:59 ..
-rw-rw---- 1 timsehn staff 300 Oct 23 13:59 branch_control.db
-rw------- 1 timsehn staff 840 Oct 23 13:59 privileges.db
This privileges.db
file can be copied and moved to other databases if you want to maintain the same user profiles and permissions on other databases.
sql-server.info
I now have an additional file in my .dolt
directory that contains metadata about the dolt sql-server
process. This file is used by the Dolt CLI and other Dolt processes to infer state about the running dolt sql-server
.
$ cat .dolt/sql-server.info
54409:3306:b68c9466-d85a-40a8-8c8e-2d0a2d9561d6
When you stop the server, Dolt attempts to clean it up. I stopped the dolt sql-server
process in the other terminal and lo and behold, it is gonzo.
$ ls -al .dolt
total 16
drwxr-xr-x 7 timsehn staff 224 Oct 23 14:01 .
drwxr-xr-x 4 timsehn staff 128 Oct 23 13:59 ..
-rwxrwxrwx 1 timsehn staff 2 Oct 23 13:48 config.json
drwxr-xr-x 7 timsehn staff 224 Oct 23 14:01 noms
-rwxrwxrwx 1 timsehn staff 83 Oct 23 13:48 repo_state.json
drwxr-xr-x 3 timsehn staff 96 Oct 23 13:49 stats
drwxr-xr-x 2 timsehn staff 64 Oct 23 13:48 temptf
Noms
Now that we have configuration out of the way, we get on to the real data in your Dolt database. This is all housed in the noms
directory.
$ ls -al .dolt/noms
total 2064
drwxr-xr-x 7 timsehn staff 224 Oct 23 14:01 .
drwxr-xr-x 7 timsehn staff 224 Oct 23 14:01 ..
-rw------- 1 timsehn staff 0 Oct 23 13:48 LOCK
-rw-r--r-- 1 timsehn staff 406 Oct 23 14:01 journal.idx
-rw------- 1 timsehn staff 145 Oct 23 14:01 manifest
drwxr-xr-x 2 timsehn staff 64 Oct 23 13:48 oldgen
-rw-r--r-- 1 timsehn staff 1048576 Oct 23 13:59 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Let's walk through what each of these files and directories do.
Manifest
The manifest
is the control center for the Dolt's data. It tells Dolt where to look for data on the filesystem. It starts out looking something like this:
$ cat .dolt/noms/manifest
5:__DOLT__:8a4r3pjhao0gq2cnkrj6tpr1n0d971dp:uc8dnffqe97329u8piol5qh15dn6rhut:00000000000000000000000000000000:vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv:14
Dolt uses a unique content-addressed storage engine based on Prolly Trees to provide you the performance of a SQL database with all the Git features you know and love. Those hash values in the manifest correspond to real internal hash values of the schema and data in your database.
Journal
Let's examine the set of files related to the "chunk journal". The actual journal file is named vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
and there is a journal index named journal.idx
that makes lookups into the journal faster. The journal file is named vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
because of Dolt's hash naming system. In the code, it's a binary hash value of all 1
s. This parses in ASCII to vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
.
Dolt didn't use to have a journal. In order to make Dolt support ACID transactions, it was added in January 2023 and was made default when Dolt went 1.0 in May 2023.
All database writes are initially made to the chunk journal. When garbage collection is run, the chunk journal is broken down into table files and stored in the noms
and oldgen
directories with garbage chunks being discarded. Notice the changed write timestamp.
$ ls -al .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
-rw-r--r-- 1 timsehn staff 1048576 Oct 23 13:59 .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
$ dolt sql -q "create table t (id int primary key, other varchar(100))"
$ dolt sql -q "insert into t values (0,'something for the journal')"
Query OK, 1 row affected (0.01 sec)
$ ls -al .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
-rw-r--r-- 1 timsehn staff 1048576 Oct 23 14:03 .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Also, note, the manifest has been updated with new mapping information.
$ cat .dolt/noms/manifest
5:__DOLT__:3a507o3ua5fhao5erdrq59ans5pg5i0p:qf751ggb9cttulqqrcqeg6hiua95aqrg:00000000000000000000000000000000:vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv:30
oldgen
oldgen
stands for "old generation". Dolt uses a generational garbage collection approach. Garbage collection throws away the chunks in the journal that are not referenced in the commit graph. After orphaned chunks are garbage collected, the chunks that remain are structured into table files and permanently stored in the noms
(ie. new generation) and oldgen
(ie. old generation) directories.
The chunk journal will grow unbounded until the Dolt garbage collection process is invoked. Garbage collection is currently manual but will hopefully become automatic before the end of 2024.
Let's garbage collect our tiny database. We first make a commit to indicate a permanent marker for storage.
$ dolt commit -Am "Initial table and value"
commit htdv4naseic7lrdulsruf21b4jqe5g3d (HEAD -> main)
Author: timsehn <tim@dolthub.com>
Date: Wed Oct 23 14:31:43 -0700 2024
Initial table and value
$ dolt gc
Now our journal is gone.
$ ls -al .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
ls: .dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv: No such file or directory
And the chunks have been stored in oldgen
.
$ ls -al .dolt/noms/oldgen
total 16
drwxr-xr-x 5 timsehn staff 160 Oct 23 14:31 .
drwxr-xr-x 6 timsehn staff 192 Oct 23 14:31 ..
-rw------- 1 timsehn staff 0 Oct 23 14:31 LOCK
-rw------- 1 timsehn staff 1455 Oct 23 14:31 g8b0lhkn5rv6en5e230v969etn5klflo
-rw------- 1 timsehn staff 144 Oct 23 14:31 manifest
Our manifest reflects this fact. Notice no mention of the vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
file in the manifest.
5:__DOLT__:1lqug7pvf62jh5k04k8bc4auo1pun4et:lp9gdupthookjlop7fthrfibo2huvbrv:1lqug7pvf62jh5k04k8bc4auo1pun4et:j25a2obgeft41l03s9e4ba2epnv40lug:3
I also have a table file in the noms
directory. This is "new generation".
$ ls -al .dolt/noms
total 16
drwxr-xr-x 6 timsehn staff 192 Oct 23 14:31 .
drwxr-xr-x 7 timsehn staff 224 Oct 23 14:01 ..
-rw------- 1 timsehn staff 0 Oct 23 13:48 LOCK
-rw------- 1 timsehn staff 629 Oct 23 14:31 j25a2obgeft41l03s9e4ba2epnv40lug
-rw------- 1 timsehn staff 144 Oct 23 14:31 manifest
drwxr-xr-x 5 timsehn staff 160 Oct 23 14:31 oldgen
If I make another write, it is done to a new journal file.
$ dolt sql -q "insert into t values (1,'another for the journal')"
Query OK, 1 row affected (0.00 sec)
$ ls -al .dolt/noms
total 2072
drwxr-xr-x 8 timsehn staff 256 Oct 23 14:43 .
drwxr-xr-x 7 timsehn staff 224 Oct 23 14:01 ..
-rw------- 1 timsehn staff 0 Oct 23 13:48 LOCK
-rw------- 1 timsehn staff 629 Oct 23 14:31 j25a2obgeft41l03s9e4ba2epnv40lug
-rw-r--r-- 1 timsehn staff 261 Oct 23 14:43 journal.idx
-rw------- 1 timsehn staff 179 Oct 23 14:43 manifest
drwxr-xr-x 5 timsehn staff 160 Oct 23 14:31 oldgen
-rw-r--r-- 1 timsehn staff 1048576 Oct 23 14:43 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Temporary Storage and Lock Files
Dolt is written in Golang. In Golang, many file operations are operating system dependent. These operations often need a temporary space in the file system or a pre-allocated lock file. We cannot count on OS-dependent solutions like /tmp
or /var/tmp
. Thus, Dolt allocates a temptf
directory in order to ensure a temporary space exists for operations that need it. temptf
should be empty unless there was an error of some sort.
$ ls -al .dolt/temptf
total 0
drwxr-xr-x 2 timsehn staff 64 Oct 21 13:00 .
drwxr-xr-x 8 timsehn staff 256 Oct 22 14:40 ..
In various directories you will also see LOCK
files which are used to coordinate disk-based locks.
$ ls -al .dolt/noms
total 2072
drwxr-xr-x 8 timsehn staff 256 Oct 23 14:43 .
drwxr-xr-x 7 timsehn staff 224 Oct 23 14:01 ..
-rw------- 1 timsehn staff 0 Oct 23 13:48 LOCK
-rw------- 1 timsehn staff 629 Oct 23 14:31 j25a2obgeft41l03s9e4ba2epnv40lug
-rw-r--r-- 1 timsehn staff 261 Oct 23 14:43 journal.idx
-rw------- 1 timsehn staff 179 Oct 23 14:43 manifest
drwxr-xr-x 5 timsehn staff 160 Oct 23 14:31 oldgen
-rw-r--r-- 1 timsehn staff 1048576 Oct 23 14:43 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Table Statistics
Table statistics are stored in Dolt format in the stats
directory. The stats
directory looks like a Dolt database within your Dolt database. Tables statistics are used to optimize queries and are now collected by default.
$ dolt sql -q "analyze table t"
+-------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+-------+---------+----------+----------+
| t | analyze | status | OK |
+-------+---------+----------+----------+
$ ls -al .dolt/stats/.dolt
total 16
drwxr-xr-x 6 timsehn staff 192 Oct 23 13:49 .
drwxr-xr-x 3 timsehn staff 96 Oct 23 13:49 ..
-rwxrwxrwx 1 timsehn staff 2 Oct 23 13:49 config.json
drwxr-xr-x 7 timsehn staff 224 Oct 23 15:35 noms
-rwxrwxrwx 1 timsehn staff 83 Oct 23 13:49 repo_state.json
drwxr-xr-x 2 timsehn staff 64 Oct 23 13:49 temptf
You can inspect table statistics using the dolt_statistics
system table.
$ dolt sql -q "select * from dolt_statistics"
+---------------+------------+------------+-----------+----------------+------------+---------+-------+-------------+-----------------+---------------------+------+------+------+------+-----------+
| database_name | table_name | index_name | row_count | distinct_count | null_count | columns | types | upper_bound | upper_bound_cnt | created_at | mcv1 | mcv2 | mcv3 | mcv4 | mcvCounts |
+---------------+------------+------------+-----------+----------------+------------+---------+-------+-------------+-----------------+---------------------+------+------+------+------+-----------+
| anatomy | t | primary | 2 | 2 | 0 | id | int | 1 | 1 | 2024-10-23 22:36:04 | | | | | |
+---------------+------------+------------+-----------+----------------+------------+---------+-------+-------------+-----------------+---------------------+------+------+------+------+-----------+
Dropped Databases
As teased earlier, Dolt has a unique feature called dolt_undrop
. Dropped databases aren't permanently removed. Instead they are stored in a hidden directory until they are purged with another command. Let's clone another database from DoltHub and drop it.
$ dolt clone dolthub/nba-players
cloning https://doltremoteapi.dolthub.com/dolthub/nba-players
0 of 19,805 chunks complete. 19,805 chunks being downloaded currently.
$ dolt sql -q "drop database \`nba-players\`"
$ ls -al
total 0
drwxr-xr-x 5 timsehn staff 160 Oct 24 13:44 .
drwxr-x---+ 69 timsehn staff 2208 Oct 24 09:27 ..
drwxr-xr-x 7 timsehn staff 224 Oct 24 13:29 .dolt
drwxr-xr-x 3 timsehn staff 96 Oct 24 13:44 .dolt_dropped_databases
drwxr-xr-x 4 timsehn staff 128 Oct 23 13:59 .doltcfg
$ ls -al .dolt_dropped_databases
total 0
drwxr-xr-x 3 timsehn staff 96 Oct 24 13:44 .
drwxr-xr-x 5 timsehn staff 160 Oct 24 13:44 ..
drwxr-xr-x 3 timsehn staff 96 Oct 24 13:42 nba-players
Conclusion
Now that you've seen how the sausage is made, we hope that it doesn't ruin your appetite for Dolt. We'll be writing a few more of these "anatomy" or "under the hood" style posts in the near future. If you have a Dolt topic you want explained in detail, come by our Discord and tell us.