Doltgres System Tables Update

DOLTGRES
8 min read

We've been hard at work getting Doltgres, a Postgres-flavored version of Dolt, ready for its beta release in Q1. This includes making sure all the version control features you know and love from Dolt also work properly in a Postgres environment. System tables are our SQL solution for Git-like CLI commands that display version control metadata, like logs, diffs, conflicts, etc., and they got a bit of a makeover in Doltgres.

Why our existing system tables didn't work with Doltgres

Doltgres works by emulating a Postgres server, and converting received commands into an AST that is given to an underlying Dolt server. This lets us leverage the functionality the Dolt already provides, including version control features like system tables. However, Dolt is MySQL-flavored, so certain Postgres features, such as schemas, needed a bespoke implementation in Dolt.

There were a few Postgres-specific challenges we needed to address to get Dolt system tables working properly in Doltgres:

  1. Tables belong to schemas
  2. Global system tables need a way to reference tables in different schemas
  3. Differences exist between MySQL and Postgres column types

This blog will walk through how we addressed each of these problems and show an example of how system tables interact with different schemas in Doltgres.

Tables belong to schemas

In Postgres, every table belongs to a schema, which is essentially a collection of database objects like tables, views, indexes, etc. MySQL doesn't have schemas, so to distinguish our Dolt system tables we prefixed all tables with dolt_ and disallowed users from creating their own tables with the dolt_ prefix.

Postgres has its own system tables, which belong to the pg_catalog schema and are prefixed with pg_ (i.e. pg_catalog.pg_class). We decided the Postgres-y thing to do was to follow suit and move our system tables to a dolt schema, removing the dolt_ prefix to reduce redundancy. So dolt_log became dolt.log, dolt_status became dolt.status, etc.

But then as we were working through our system table list, we realized some of our tables need to be able to reference different user schemas. For example, we have some system tables that have a user table in the system table name, like dolt_diff_$tablename, which returns a list of diffs showing how rows in $tablename have changed over time on the current branch. So if I create a table named employees, I can review changed rows using the dolt_diff_employees system table.

But in Postgres, you can have two schemas that both have a table named employees, but look different otherwise. There needed to be a way to reference each of these and get the diff for the correct table.

We came up with two proposed solutions for these kinds of user table specific system tables:

  1. Add a schema_name column to the system table so that you can filter for rows in the correct schema (this made more sense for system tables like dolt_schemas)
  2. Keep the dolt_ prefix for these tables and put them in the user schema rather than the dolt schema

(2) was the more universally desirable solution. So we ended up creating two categories of system tables - global system tables in the dolt schema and user schema specific system tables that are prefixed with dolt_. We could now get diffs for our two different employees tables in different schemas by specifying the schema name:

select * from public.dolt_diff_employees;
select * from other_schema.dolt_diff_employees;

Here's an overview of which system table falls into each category. Our Doltgres documentation was also updated to reflect these categories.

Dolt Schema

Old usage (still usable) New usage
dolt_branches dolt.branches
dolt_column_diff dolt.column_diff
dolt_commit_ancestors dolt.commit_ancestors
dolt_commits dolt.commits
dolt_conflicts dolt.conflicts
dolt_constraint_violations dolt.constraint_violations
dolt_diff dolt.diff
dolt_docs dolt.docs
dolt_log dolt.log
dolt_merge_status dolt.merge_status
dolt_rebase dolt.rebase
dolt_remote_branches dolt.remote_branches
dolt_remotes dolt.remotes
dolt_schema_conflicts dolt.schema_conflicts
dolt_status dolt.status
dolt_tags dolt.tags

User-Specific Schema

Old usage (still usable) New usage
dolt_commit_diff_$tablename public.dolt_commit_diff_$tablename
dolt_conflicts_$tablename public.dolt_conflicts_$tablename
dolt_constraint_violations_$tablename public.dolt_constraint_violations_$tablename
dolt_diff_$tablename public.dolt_diff_$tablename
dolt_history_$tablename public.dolt_history_$tablename
dolt_ignore public.dolt_ignore
dolt_schemas public.dolt_schemas
dolt_statistics public.dolt_statistics
dolt_workspace_$tablename public.dolt_workspace_$tablename

Note that public can be replaced with any schema name that exists in your database.

Global system tables need a way to reference tables in different schemas

This was a simpler change, but worth noting. Some of our system tables in the dolt schema return table names. For example, dolt.status returns a list of tables that have been changed in the working set of the current branch. So if you add a row to table employees on branch main, you'd get this output when checking dolt.status.

postgres=> SELECT * FROM dolt.status;
 table_name  | staged |  status
-------------+--------+-----------
 employees   | f      | modified
(1 row)

But what if like above you have two schemas, both with an employees table with different changes? You need to know which table changed. So we changed the output to look like this:

postgres=> SELECT * FROM dolt.status;
       table_name       | staged |  status
------------------------+--------+-----------
 public.employees       | f      | modified
 other_schema.employees | t      | new table
(2 rows)

Differences exist between MySQL and Postgres column types

Some of the types used by the Dolt system tables don't translate to Postgres. For example, Postgres doesn't have a built-in enum type, which is used by the dolt_constraint_violations_$TABLENAME and dolt_rebase system tables in Dolt. Also, Postgres has a boolean type, but in MySQL boolean is an alias for tinyint, which has different functionality in Postgres.

While the Doltgres system tables mostly worked with the MySQL types, we ran into issues when filtering or updating cells with these column types. So we used pointers to replace the table schemas with the correct types for Doltgres.

System tables and schemas in practice

Here is a more practical example to illustrate why these schema-related changes to system tables were necessary in Doltgres.

First, create two schemas, each with a table named employees. Notice that the tables have different columns.

CREATE SCHEMA sch1;
CREATE SCHEMA sch2;
CREATE TABLE sch1.employees (id int primary key, name text);
CREATE TABLE sch2.employees (id int primary key, first_name text, last_name text);

We want to see our working set changes before committing, so we use dolt.status.

example=> select * from dolt.status;
   table_name   | staged |   status
----------------+--------+------------
 sch1.employees | f      | new table
 sch2.employees | f      | new table
 sch2           | f      | new schema
 sch1           | f      | new schema
(4 rows)

Since the dolt.status table exists in the dolt schema and is a global system table, I can see changes to all of my schemas. I create a commit and can see it in the logs.

example=> select dolt_commit('-Am', 'add tables and schemas');
            dolt_commit
------------------------------------
 {0f4la1vc1b9ttcg0bumnv9kto3f2j95g}
(1 row)

example=> select * from dolt_log;
           commit_hash            | committer |       email        |        date         |          message
----------------------------------+-----------+--------------------+---------------------+----------------------------
 0f4la1vc1b9ttcg0bumnv9kto3f2j95g | postgres  | postgres@127.0.0.1 | 2024-12-06 00:40:22 | add tables and schemas
 72drrn2lhk8bmbkhnebn919ng95imlm9 | postgres  | postgres@127.0.0.1 | 2024-12-06 00:38:42 | CREATE DATABASE
 nc0e3rst26bnb3ha2orbg35o6b4h9tj9 | tbantle   | taylor@dolthub.com | 2024-12-06 00:38:42 | Initialize data repository
(3 rows)

Now I use branches to create isolated changes to each table.

example=> select dolt_checkout('-b', 'change1');
           dolt_checkout
------------------------------------
 {0,"Switched to branch 'change1'"}
(1 row)

example=> insert into sch2.employees values (1, 'Tim', 'Sehn');
INSERT 0 1
example=> select dolt_commit('-am', 'add employee');
            dolt_commit
------------------------------------
 {9br37rvf24elvolccgju8st6b4k70303}
(1 row)

example=> select dolt_checkout('main');
          dolt_checkout
---------------------------------
 {0,"Switched to branch 'main'"}
(1 row)

example=> select dolt_checkout('-b', 'change2');
           dolt_checkout
------------------------------------
 {0,"Switched to branch 'change2'"}
(1 row)

example=> insert into sch1.employees values (1, 'Tim Sehn');
INSERT 0 1
example=> select dolt_commit('-am', 'add employee');
            dolt_commit
------------------------------------
 {grf7lddqcstsm402khnmh9q9uhkmm8un}
(1 row)

Another employee connects to the database and looks at my branches to review my changes.

example=> select name, hash, latest_committer, latest_commit_message from dolt.branches;
  name   |               hash               | latest_committer | latest_commit_message
---------+----------------------------------+------------------+------------------------
 change1 | 9br37rvf24elvolccgju8st6b4k70303 | postgres         | add employee
 change2 | grf7lddqcstsm402khnmh9q9uhkmm8un | postgres         | add employee
 main    | 0f4la1vc1b9ttcg0bumnv9kto3f2j95g | postgres         | add tables and schemas
(3 rows)

Which employees table was updated on each branch? The commit messages I wrote are not clear. The dolt.diff table will let us know.

example=> select * from dolt.diff as of 'change1' limit 1;
           commit_hash            |   table_name   | committer |       email        |          date           |   message    | data_change | schema_change
----------------------------------+----------------+-----------+--------------------+-------------------------+--------------+-------------+---------------
 9br37rvf24elvolccgju8st6b4k70303 | sch2.employees | postgres  | postgres@127.0.0.1 | 2024-12-05 16:42:48.416 | add employee | t           | f
(1 row)

example=> select * from dolt.diff as of 'change2' limit 1;
           commit_hash            |   table_name   | committer |       email        |          date           |   message    | data_change | schema_change
----------------------------------+----------------+-----------+--------------------+-------------------------+--------------+-------------+---------------
 grf7lddqcstsm402khnmh9q9uhkmm8un | sch1.employees | postgres  | postgres@127.0.0.1 | 2024-12-05 16:43:48.061 | add employee | t           | f
(1 row)

Now that I know which table was changed on each branch, I can use the dolt_diff_$tablename table to see how the rows changed on between main and each branch. We must specify the schema for this table to ensure we're looking at the correct employees table.

example=> select to_id, to_first_name, to_last_name, to_commit, diff_type from sch2.dolt_diff_employees where to_commit=hashof('change1') and from_commit=dolt_merge_base('main', 'change1');
 to_id | to_first_name | to_last_name |            to_commit             | diff_type
-------+---------------+--------------+----------------------------------+-----------
     1 | Tim           | Sehn         | 9br37rvf24elvolccgju8st6b4k70303 | added
(1 row)

example=> select to_id, to_name, to_commit, diff_type from sch1.dolt_diff_employees where to_commit=hashof('change2') and from_commit=dolt_merge_base('main', 'change2');
 to_id | to_name  |            to_commit             | diff_type
-------+----------+----------------------------------+-----------
     1 | Tim Sehn | grf7lddqcstsm402khnmh9q9uhkmm8un | added
(1 row)

Before any of the changes we made above, it would have been impossible to distinguish between the two employees tables when using system tables.

Conclusion

Check out our new and improved system tables in Doltgres version >=0.15.0. We're very excited about Doltgres and have multiple employees dedicated to working on it full time. Version control can add a lot of value to your Postgres database, and it's important we get all these features you know and love in Dolt working correctly in Doltgres.

Have any questions or Doltgres features you want us to prioritize? Stop by our Discord and speak someone on the DoltHub team.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.