Announcing automatic garbage collection in Dolt sql-server
At DoltHub, we are building Dolt is the world's first
version-controlled SQL database, supporting operations like branch
,
diff
, merge
, rebase
, and blame
as well as interactions with
remotes such as clone
, push
and fetch
. Today, we're happy to
announce that the latest version of
Dolt now supports
optional automatic garbage collection of the on-disk databases when
running in sql-server mode. In this short blog post we'll talk about
what the new functionality is, why you might want it, and how to
enable it if you do.
Garbage in a Dolt Database
Dolt databases are based on a novel content-addressed sorted index
structure we call a Prolly
Tree. Prolly
Trees are copy-on-write structures and so to perform a write to a Dolt
database, Dolt creates new records on disk, as opposed to replacing
existing ones. Because Dolt is versioned, it keeps the old versions of
the data around. But not all versions of the data on disk remain
reachable and relevant to the database over time. If no dolt commit
s
are created which reference a particular version of the data in the
database, that version can end up unreachable and irrelavant to the
contents of the versioned Dolt database, with its contents still
stored on disk.
Dolt has long supported a CLI and server operation which allows for
this unreferenced data to be cleaned up and removed from storage. By
running dolt gc
or, from SQL, call dolt_gc()
, the Dolt process
does the work of walking all reachable data from the roots of the
current database and ensuring that all of it is retained, while
removing things that are not longer reachable. You can read more about
garbage in Dolt and how dolt gc
generally works in our previous
blog posts
about it.
Up until now, in order to make call dolt_gc()
safe, the
implementation was disruptive to in-progress reads and writes which
were occuring concurrently with the GC operation. Running dolt_gc
would cause existing SQL connections to disconnect forcefully,
interrupting in progress operations and transactions, and it would
leave the connection which ran the GC itself in a permanently invalid
state, relying on the client application to close it. This behavior
interacted badly with connection pooling and with client applications
in general.
Automatic Garbage Collection
Automatic GC is a feature where Dolt will periodically run a GC in the background. In order to make it compelling and usable, we needed to reduce the impact of running a garbage collection concurrently with other work. The new solution is able to run a garbage collection without breaking in-flight transactions or connections and with only short blocking operations for any in-progress reads or writes.
Auto GC is currently only enabled when running dolt sql-server
with
a config.yaml
file. To enable it, add the following configuration
fragment to the config file:
behavior:
auto_gc_behavior:
enable: true
and make sure to run the sql-server
process passing along the file
path as the config parameter:
dolt sql-server --config config.yaml
Once it is enabled, as you perform writes against your Dolt database, you should eventually see log messages at info level indicating that GCs are being performed. They will look similar to:
time="2025-02-25T10:31:20-08:00" level=info msg="sqle/auto_gc: Successfully completed auto GC of database auto_gc_test in 144.550125ms"
time="2025-02-25T10:31:22-08:00" level=info msg="sqle/auto_gc: Successfully completed auto GC of database auto_gc_test in 219.788375ms"
time="2025-02-25T10:31:23-08:00" level=info msg="sqle/auto_gc: Successfully completed auto GC of database auto_gc_test in 151.752417ms"
time="2025-02-25T10:31:25-08:00" level=info msg="sqle/auto_gc: Successfully completed auto GC of database auto_gc_test in 302.215625ms"
The implementation will currently run an automatic GC anytime the database has grown in size by 125MB. The heuristics used to kick off the GC are not currently tunable. There is a memory, CPU and disk I/O overhead associated with running a garbage collection, and the current implementation does not pace the GC work relative to client-initiated work which needs to be done. As a result, it's possible to experience a performance impact from running with Auto GC enabled.
When Auto GC is enabled, call dolt_gc()
can still be used to
initiate a GC manually. Its behavior will now be the less-disruptive
implementation, where existing in-flight work does not fail. However,
the Auto GC implementation is currently experimental. For now, the
default behavior of call dolt_gc()
remains the implementation that
will terminate in-flight work and invalidate the calling connection.
Future Work
In the future, Auto GC will be enabled by default and the disruptive
implementation of call dolt_gc()
will be removed. We will also
enable it for certain operations outside of sql-server
mode, such as
dolt sql
when reading an import file. We hope to continue to
optimize the implementation, such that running it concurrently with
user-facing work will have less impact on the performance of the
sql-server. We will continue to tune the heuristics which are used to
inform when we should start an Auto GC, and if our users need it will
add tunable parameters so that they can better control the behavior of
the scheduler and pacer themselves.
Do you have a Dolt database which generates a lot of garbage? Are you interested in garbage collection or excited to try out Auto GC? Drop by our discord or file a GitHub issue to reach out and start a discussion.