Announcing automatic garbage collection in Dolt sql-server

FEATURE RELEASE
4 min read

At DoltHub, we are building Dolt is the world's first version-controlled SQL database, supporting operations like branch, diff, merge, rebase, and blame as well as interactions with remotes such as clone, push and fetch. Today, we're happy to announce that the latest version of Dolt now supports optional automatic garbage collection of the on-disk databases when running in sql-server mode. In this short blog post we'll talk about what the new functionality is, why you might want it, and how to enable it if you do.

Garbage in a Dolt Database

Dolt databases are based on a novel content-addressed sorted index structure we call a Prolly Tree. Prolly Trees are copy-on-write structures and so to perform a write to a Dolt database, Dolt creates new records on disk, as opposed to replacing existing ones. Because Dolt is versioned, it keeps the old versions of the data around. But not all versions of the data on disk remain reachable and relevant to the database over time. If no dolt commits are created which reference a particular version of the data in the database, that version can end up unreachable and irrelavant to the contents of the versioned Dolt database, with its contents still stored on disk.

Zoidberg Trash Can

Dolt has long supported a CLI and server operation which allows for this unreferenced data to be cleaned up and removed from storage. By running dolt gc or, from SQL, call dolt_gc(), the Dolt process does the work of walking all reachable data from the roots of the current database and ensuring that all of it is retained, while removing things that are not longer reachable. You can read more about garbage in Dolt and how dolt gc generally works in our previous blog posts about it.

Up until now, in order to make call dolt_gc() safe, the implementation was disruptive to in-progress reads and writes which were occuring concurrently with the GC operation. Running dolt_gc would cause existing SQL connections to disconnect forcefully, interrupting in progress operations and transactions, and it would leave the connection which ran the GC itself in a permanently invalid state, relying on the client application to close it. This behavior interacted badly with connection pooling and with client applications in general.

Automatic Garbage Collection

Automatic GC is a feature where Dolt will periodically run a GC in the background. In order to make it compelling and usable, we needed to reduce the impact of running a garbage collection concurrently with other work. The new solution is able to run a garbage collection without breaking in-flight transactions or connections and with only short blocking operations for any in-progress reads or writes.

Auto GC is currently only enabled when running dolt sql-server with a config.yaml file. To enable it, add the following configuration fragment to the config file:

behavior:
  auto_gc_behavior:
    enable: true

and make sure to run the sql-server process passing along the file path as the config parameter:

dolt sql-server --config config.yaml

Once it is enabled, as you perform writes against your Dolt database, you should eventually see log messages at info level indicating that GCs are being performed. They will look similar to:

time="2025-02-25T10:31:20-08:00" level=info msg="sqle/auto_gc: Successfully completed auto GC of database auto_gc_test in 144.550125ms"
time="2025-02-25T10:31:22-08:00" level=info msg="sqle/auto_gc: Successfully completed auto GC of database auto_gc_test in 219.788375ms"
time="2025-02-25T10:31:23-08:00" level=info msg="sqle/auto_gc: Successfully completed auto GC of database auto_gc_test in 151.752417ms"
time="2025-02-25T10:31:25-08:00" level=info msg="sqle/auto_gc: Successfully completed auto GC of database auto_gc_test in 302.215625ms"

The implementation will currently run an automatic GC anytime the database has grown in size by 125MB. The heuristics used to kick off the GC are not currently tunable. There is a memory, CPU and disk I/O overhead associated with running a garbage collection, and the current implementation does not pace the GC work relative to client-initiated work which needs to be done. As a result, it's possible to experience a performance impact from running with Auto GC enabled.

When Auto GC is enabled, call dolt_gc() can still be used to initiate a GC manually. Its behavior will now be the less-disruptive implementation, where existing in-flight work does not fail. However, the Auto GC implementation is currently experimental. For now, the default behavior of call dolt_gc() remains the implementation that will terminate in-flight work and invalidate the calling connection.

Future Work

In the future, Auto GC will be enabled by default and the disruptive implementation of call dolt_gc() will be removed. We will also enable it for certain operations outside of sql-server mode, such as dolt sql when reading an import file. We hope to continue to optimize the implementation, such that running it concurrently with user-facing work will have less impact on the performance of the sql-server. We will continue to tune the heuristics which are used to inform when we should start an Auto GC, and if our users need it will add tunable parameters so that they can better control the behavior of the scheduler and pacer themselves.

Do you have a Dolt database which generates a lot of garbage? Are you interested in garbage collection or excited to try out Auto GC? Drop by our discord or file a GitHub issue to reach out and start a discussion.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.