Doltgres System Tables Update
We've been hard at work getting Doltgres, a Postgres-flavored version of Dolt, ready for its beta release in Q1. This includes making sure all the version control features you know and love from Dolt also work properly in a Postgres environment. System tables are our SQL solution for Git-like CLI commands that display version control metadata, like logs, diffs, conflicts, etc., and they got a bit of a makeover in Doltgres.
Why our existing system tables didn't work with Doltgres
Doltgres works by emulating a Postgres server, and converting received commands into an AST that is given to an underlying Dolt server. This lets us leverage the functionality the Dolt already provides, including version control features like system tables. However, Dolt is MySQL-flavored, so certain Postgres features, such as schemas, needed a bespoke implementation in Dolt.
There were a few Postgres-specific challenges we needed to address to get Dolt system tables working properly in Doltgres:
- Tables belong to schemas
- Global system tables need a way to reference tables in different schemas
- Differences exist between MySQL and Postgres column types
This blog will walk through how we addressed each of these problems and show an example of how system tables interact with different schemas in Doltgres.
Tables belong to schemas
In Postgres, every table belongs to a schema, which is essentially a collection of database objects like tables, views, indexes, etc. MySQL doesn't have schemas, so to distinguish our Dolt system tables we prefixed all tables with dolt_
and disallowed users from creating their own tables with the dolt_
prefix.
Postgres has its own system tables, which belong to the pg_catalog
schema and are prefixed with pg_
(i.e. pg_catalog.pg_class
). We decided the Postgres-y thing to do was to follow suit and move our system tables to a dolt
schema, removing the dolt_
prefix to reduce redundancy. So dolt_log
became dolt.log
, dolt_status
became dolt.status
, etc.
But then as we were working through our system table list, we realized some of our tables need to be able to reference different user schemas. For example, we have some system tables that have a user table in the system table name, like dolt_diff_$tablename
, which returns a list of diffs showing how rows in $tablename
have changed over time on the current branch. So if I create a table named employees
, I can review changed rows using the dolt_diff_employees
system table.
But in Postgres, you can have two schemas that both have a table named employees
, but look different otherwise. There needed to be a way to reference each of these and get the diff for the correct table.
We came up with two proposed solutions for these kinds of user table specific system tables:
- Add a
schema_name
column to the system table so that you can filter for rows in the correct schema (this made more sense for system tables likedolt_schemas
) - Keep the
dolt_
prefix for these tables and put them in the user schema rather than thedolt
schema
(2) was the more universally desirable solution. So we ended up creating two categories of system tables - global system tables in the dolt
schema and user schema specific system tables that are prefixed with dolt_
. We could now get diffs for our two different employees
tables in different schemas by specifying the schema name:
select * from public.dolt_diff_employees;
select * from other_schema.dolt_diff_employees;
Here's an overview of which system table falls into each category. Our Doltgres documentation was also updated to reflect these categories.
Dolt Schema
Old usage (still usable) | New usage |
---|---|
dolt_branches |
dolt.branches |
dolt_column_diff |
dolt.column_diff |
dolt_commit_ancestors |
dolt.commit_ancestors |
dolt_commits |
dolt.commits |
dolt_conflicts |
dolt.conflicts |
dolt_constraint_violations |
dolt.constraint_violations |
dolt_diff |
dolt.diff |
dolt_docs |
dolt.docs |
dolt_log |
dolt.log |
dolt_merge_status |
dolt.merge_status |
dolt_rebase |
dolt.rebase |
dolt_remote_branches |
dolt.remote_branches |
dolt_remotes |
dolt.remotes |
dolt_schema_conflicts |
dolt.schema_conflicts |
dolt_status |
dolt.status |
dolt_tags |
dolt.tags |
User-Specific Schema
Old usage (still usable) | New usage |
---|---|
dolt_commit_diff_$tablename |
public.dolt_commit_diff_$tablename |
dolt_conflicts_$tablename |
public.dolt_conflicts_$tablename |
dolt_constraint_violations_$tablename |
public.dolt_constraint_violations_$tablename |
dolt_diff_$tablename |
public.dolt_diff_$tablename |
dolt_history_$tablename |
public.dolt_history_$tablename |
dolt_ignore |
public.dolt_ignore |
dolt_schemas |
public.dolt_schemas |
dolt_statistics |
public.dolt_statistics |
dolt_workspace_$tablename |
public.dolt_workspace_$tablename |
Note that
public
can be replaced with any schema name that exists in your database.
Global system tables need a way to reference tables in different schemas
This was a simpler change, but worth noting. Some of our system tables in the dolt
schema return table names. For example, dolt.status
returns a list of tables that have been changed in the working set of the current branch. So if you add a row to table employees
on branch main
, you'd get this output when checking dolt.status
.
postgres=> SELECT * FROM dolt.status;
table_name | staged | status
-------------+--------+-----------
employees | f | modified
(1 row)
But what if like above you have two schemas, both with an employees
table with different changes? You need to know which table changed. So we changed the output to look like this:
postgres=> SELECT * FROM dolt.status;
table_name | staged | status
------------------------+--------+-----------
public.employees | f | modified
other_schema.employees | t | new table
(2 rows)
Differences exist between MySQL and Postgres column types
Some of the types used by the Dolt system tables don't translate to Postgres. For example, Postgres doesn't have a built-in enum
type, which is used by the dolt_constraint_violations_$TABLENAME
and dolt_rebase
system tables in Dolt. Also, Postgres has a boolean
type, but in MySQL boolean
is an alias for tinyint
, which has different functionality in Postgres.
While the Doltgres system tables mostly worked with the MySQL types, we ran into issues when filtering or updating cells with these column types. So we used pointers to replace the table schemas with the correct types for Doltgres.
System tables and schemas in practice
Here is a more practical example to illustrate why these schema-related changes to system tables were necessary in Doltgres.
First, create two schemas, each with a table named employees
. Notice that the tables have different columns.
CREATE SCHEMA sch1;
CREATE SCHEMA sch2;
CREATE TABLE sch1.employees (id int primary key, name text);
CREATE TABLE sch2.employees (id int primary key, first_name text, last_name text);
We want to see our working set changes before committing, so we use dolt.status
.
example=> select * from dolt.status;
table_name | staged | status
----------------+--------+------------
sch1.employees | f | new table
sch2.employees | f | new table
sch2 | f | new schema
sch1 | f | new schema
(4 rows)
Since the dolt.status
table exists in the dolt
schema and is a global system table, I can see changes to all of my schemas. I create a commit and can see it in the logs.
example=> select dolt_commit('-Am', 'add tables and schemas');
dolt_commit
------------------------------------
{0f4la1vc1b9ttcg0bumnv9kto3f2j95g}
(1 row)
example=> select * from dolt_log;
commit_hash | committer | email | date | message
----------------------------------+-----------+--------------------+---------------------+----------------------------
0f4la1vc1b9ttcg0bumnv9kto3f2j95g | postgres | postgres@127.0.0.1 | 2024-12-06 00:40:22 | add tables and schemas
72drrn2lhk8bmbkhnebn919ng95imlm9 | postgres | postgres@127.0.0.1 | 2024-12-06 00:38:42 | CREATE DATABASE
nc0e3rst26bnb3ha2orbg35o6b4h9tj9 | tbantle | taylor@dolthub.com | 2024-12-06 00:38:42 | Initialize data repository
(3 rows)
Now I use branches to create isolated changes to each table.
example=> select dolt_checkout('-b', 'change1');
dolt_checkout
------------------------------------
{0,"Switched to branch 'change1'"}
(1 row)
example=> insert into sch2.employees values (1, 'Tim', 'Sehn');
INSERT 0 1
example=> select dolt_commit('-am', 'add employee');
dolt_commit
------------------------------------
{9br37rvf24elvolccgju8st6b4k70303}
(1 row)
example=> select dolt_checkout('main');
dolt_checkout
---------------------------------
{0,"Switched to branch 'main'"}
(1 row)
example=> select dolt_checkout('-b', 'change2');
dolt_checkout
------------------------------------
{0,"Switched to branch 'change2'"}
(1 row)
example=> insert into sch1.employees values (1, 'Tim Sehn');
INSERT 0 1
example=> select dolt_commit('-am', 'add employee');
dolt_commit
------------------------------------
{grf7lddqcstsm402khnmh9q9uhkmm8un}
(1 row)
Another employee connects to the database and looks at my branches to review my changes.
example=> select name, hash, latest_committer, latest_commit_message from dolt.branches;
name | hash | latest_committer | latest_commit_message
---------+----------------------------------+------------------+------------------------
change1 | 9br37rvf24elvolccgju8st6b4k70303 | postgres | add employee
change2 | grf7lddqcstsm402khnmh9q9uhkmm8un | postgres | add employee
main | 0f4la1vc1b9ttcg0bumnv9kto3f2j95g | postgres | add tables and schemas
(3 rows)
Which employees
table was updated on each branch? The commit messages I wrote are not clear. The dolt.diff
table will let us know.
example=> select * from dolt.diff as of 'change1' limit 1;
commit_hash | table_name | committer | email | date | message | data_change | schema_change
----------------------------------+----------------+-----------+--------------------+-------------------------+--------------+-------------+---------------
9br37rvf24elvolccgju8st6b4k70303 | sch2.employees | postgres | postgres@127.0.0.1 | 2024-12-05 16:42:48.416 | add employee | t | f
(1 row)
example=> select * from dolt.diff as of 'change2' limit 1;
commit_hash | table_name | committer | email | date | message | data_change | schema_change
----------------------------------+----------------+-----------+--------------------+-------------------------+--------------+-------------+---------------
grf7lddqcstsm402khnmh9q9uhkmm8un | sch1.employees | postgres | postgres@127.0.0.1 | 2024-12-05 16:43:48.061 | add employee | t | f
(1 row)
Now that I know which table was changed on each branch, I can use the dolt_diff_$tablename
table to see how the rows changed on between main
and each branch. We must specify the schema for this table to ensure we're looking at the correct employees
table.
example=> select to_id, to_first_name, to_last_name, to_commit, diff_type from sch2.dolt_diff_employees where to_commit=hashof('change1') and from_commit=dolt_merge_base('main', 'change1');
to_id | to_first_name | to_last_name | to_commit | diff_type
-------+---------------+--------------+----------------------------------+-----------
1 | Tim | Sehn | 9br37rvf24elvolccgju8st6b4k70303 | added
(1 row)
example=> select to_id, to_name, to_commit, diff_type from sch1.dolt_diff_employees where to_commit=hashof('change2') and from_commit=dolt_merge_base('main', 'change2');
to_id | to_name | to_commit | diff_type
-------+----------+----------------------------------+-----------
1 | Tim Sehn | grf7lddqcstsm402khnmh9q9uhkmm8un | added
(1 row)
Before any of the changes we made above, it would have been impossible to distinguish between the two employees
tables when using system tables.
Conclusion
Check out our new and improved system tables in Doltgres version >=0.15.0. We're very excited about Doltgres and have multiple employees dedicated to working on it full time. Version control can add a lot of value to your Postgres database, and it's important we get all these features you know and love in Dolt working correctly in Doltgres.
Have any questions or Doltgres features you want us to prioritize? Stop by our Discord and speak someone on the DoltHub team.