Doltgres Data Imports
Doltgres has been improving by leaps and bounds lately, and in case you missed it... we just hit our beta milestone last week! 🎉 Dolt is the world's first version-controlled SQL database engine, supporting the MySQL dialect, and Doltgres brings those same version-controlled database features with a Postgres dialect and compatibility with Postgres tooling and libraries.
As we discussed in our last blog post on importing data dumps to Doltgres, one of our main areas of focus right now is getting dumps from Postgres databases to load cleanly into Doltgres. When we started analyzing Postgres dumps, we were pretty far from getting any dumps to load cleanly into Doltgres. Since then, we've been methodically cataloging gaps, implementing new features, and supporting more syntax to get dumps to load cleanly. We're happy to report that we're continuing to make progress on our sample dumps and now have five more Postgres dumps loading cleanly into Doltgres. In this blog post, we're going to review some of the features we've added to Doltgres since our last update and how those have helped us get more coverage on cleanly importing Postgres dumps into Doltgres.
Recent Improvements
Adaptive Encoding
One of the features I've been excited about is Doltgres' support for "adaptive encoding" with TEXT
data, also known in Postgres as TOAST (The Oversized-Attribute Storage Technique). This feature allows TEXT
values to be stored "inline" in the row that logically contains the value, or "out-of-band" in a separate storage area. Our query engine, go-mysql-server, was originally developed to be compliant with MySQL, which always stores these (usually) larger values out of band. Because of the extra latency required to perform an additional lookup per-row for each of these out-of-band values, TEXT
columns are not valid for use in primary keys, secondary indexes, or foreign keys in MySQL. Postgres doesn't have these restrictions, and the TOAST optimization allows Postgres to often avoid that extra latency for the additional lookup in a secondary storage area. Long story short, this feature allows us to support TEXT
columns being used in primary keys, secondary indexes, and foreign keys, and still have good performance. It's a common practice in Postgres schemas to use TEXT
columns (or VARCHAR
columns without a length defined, which is essentially equivalent) in primary keys, secondary keys, and foreign key relationships, so enabling this support helped us cleanly import several additional Postgres dumps in our sample set. If you're curious to dig deeper into the details of how adaptive encoding lays out your bits, check out Nick's blog introducing adaptive encoding.
Here's an example of some SQL that Doltgres is now able to support thanks to the new adaptive encoding feature:
CREATE TABLE example_text_table (
id TEXT PRIMARY KEY,
description TEXT
);
CREATE INDEX idx_example_description ON example_text_table(description);
Foreign Key Support
Foreign keys are a foundational feature of relational databases, so we knew we needed to prioritize any foreign key issues we identified in testing our sample dumps. We found and fixed many small gaps with our foreign key implementation when testing our sample dumps.
SET DEFAULT
Referential Action
Foreign key referential actions control how rows are updated when changes are made to the rows they reference. More concretely, if we have a table named child
that has a foreign key reference on one of its columns, pointing to a column in a table named parent
, a referential action controls how the row in child
should be modified if the referenced row in parent
is updated or deleted. Because of our existing query engine support for MySQL's behavior, we had almost all of Postgres' referential actions supported already (e.g. SET NULL
, CASCADE
, RESTRICT
), but lacked support for SET DEFAULT
, since it's not supported in MySQL. Let's take a quick look at a concrete example of how this referential action is used:
CREATE TABLE stories (title text NOT NULL, website_url text NOT NULL DEFAULT 'http://www.dolthub.com');
CREATE TABLE websites (url text primary key, title text);
INSERT into websites VALUES ('http://www.dolthub.com/blog', 'DoltHub Blog');
INSERT into stories VALUES ('Doltgres Goes Beta!', 'http://www.dolthub.com/blog');
ALTER TABLE ONLY stories
ADD CONSTRAINT stories_website_url_fkey FOREIGN KEY (website_url) REFERENCES websites(url)
ON UPDATE CASCADE ON DELETE SET DEFAULT;
If we select from the stories
table, we'll see the story we just inserted:
SELECT * FROM stories;
title | website_url
---------------------+-----------------------------
Doltgres Goes Beta! | http://www.dolthub.com/blog
(1 row)
Now, if we delete the websites
row that the stories
row references, the SET DEFAULT
referential action will kick in to set the website_url
column in the stories
to the column's default value of http://www.dolthub.com
:
DELETE FROM websites WHERE title = 'DoltHub Blog';
DELETE 1
SELECT * FROM stories;
title | website_url
---------------------+------------------------
Doltgres Goes Beta! | http://www.dolthub.com
(1 row)
Foreign Key Naming
The patterns for default naming of foreign key constraints are different between MySQL and Postgres. In MySQL, although there is not a guarantee that generated constraint names will be the same across versions or MySQL implementations, but with MySQL-8.4 (currently the LTS version) foreign key names follow a pattern like: <table>_ibfk_<N>
where <table>
is the name of the child table in the foreign key relationship and <N>
is an incrementing integer, starting at 1, that deduplicates the generated foreign key names. In Postgres, the format for a generated foreign key constraint name is: <table>_<column>_fkey
, where <table>
is the name of the child table in the foreign key relationship, and <column>
is the column name from the child table. In our testing, we found that some dumps referenced these generated foreign key constraint names, and when Doltgres wasn't following the same naming pattern as Postgres, it prevented the dumps from loading cleanly into Doltgres. This work also included support for handling collisions with existing constraint names, again following Postgres' standard naming rules.
IF EXISTS
Support for Dropping Constraints
Postgres allows the IF EXISTS
clause to be used in more places than MySQL allows, such as when dropping a constraint from a table using an ALTER TABLE
statement. MySQL allows IF EXISTS
to be used to conditionally execute the statement based on whether the table exists, but Postgres takes this a step further and also allows IF EXISTS
to be used to conditionally drop a constraint based on whether the named constraint exists. For example, the following statement will succeed in Postgres, even if the doesnotexist
constraint doesn't exist:
ALTER TABLE t1 DROP CONSTRAINT IF EXISTS doesnotexist;
Inline Foreign Key Definition
While testing how well Doltgres could load sample dumps we found that Doltgres wasn't properly handling inline foreign key constraints in a column definition. For example, the statements below create a parent and child table for the foreign key relationship, then uses an ALTER TABLE
statement to add a column along with an inline definition of a foreign key constraint:
CREATE TABLE parent (
id INT PRIMARY KEY,
name TEXT
);
CREATE TABLE child (
id INT PRIMARY KEY
);
ALTER TABLE child ADD COLUMN parent_id INT REFERENCES parent(id);
And to test that the constraint is working correctly, we can try to insert a row into child
with an ID that doesn't exist in parent
and see that we get an error message:
INSERT INTO child VALUES (1, 123);
ERROR: cannot add or update a child row - Foreign key violation on fk: `child_parent_id_fkey`, table: `child`, referenced table: `parent`, key: `[123]` (errno 1452) (sqlstate HY000)
Miscellaneous Quickies
Lots of gaps we find are quick fixes that take just a couple of hours to implement. They may not be major features, but they're important for customers who need them to get their dump to load cleanly. This section reviews a few highlights of a few notable quickies we've implemented for Doltgres in the pasth month:
Support for analyzing all tables – In Postgres, the ANALYZE;
statement, without any tables specified, will analyze all tables. It's common to run ANALYZE;
as the last statement at the end of a dump, to force the query engine to collect statistics on tables in order to optimize join planning for queries. Our query engine already supported analyzing tables to collect statistics for query optimization, but it required that the table names were specified by the user, so it was a small, but important step to support analyzing all tables. Doltgres customers can now run ANALYZE;
to collect statistics for all their tables.
Advisory Locks – Postgres supports advisory locks, which allow users to create application-level locks. Our query engine already supported user-level locking, so wiring it up to the pg_try_advisory_lock()
, pg_advisory_lock()
, and pg_advisory_unlock()
functions was another quickie, and enough to unblock Postgres SQL statements generated by Rails. There is a caveat here though... Postgres advisory locks are reentrant, meaning that if a thread already holds the lock, it can acquire the lock again without blocking. This is not how user-level locks work in MySQL, so we'll be following up to augment our locking framework in our query engine to support lock reentrancy for Doltgres. If that's something you need, let us know in a GitHub issue, and we'll be more than happy to prioritize it for you.
For this example, we need two separate SQL session, so that we can show one session holding a lock that the second session is unable to grab until it is released:
-- SESSION 1:
-- grab lock 42 and sleep for 60s to simulate some work being done
-- note that pg_advisory_lock() will block until it is able to acquire the lock
SELECT pg_advisory_lock(42);
SELECT 'Acquired lock 42 in session 1';
SELECT sleep(60);
-- Release the lock
SELECT 'Releasing lock 42 in session 1';
SELECT pg_advisory_unlock(42);
Now, in a second session, while the first session is sleeping, we run the following SQL:
-- SESSION 2:
SELECT pg_advisory_lock(42);
SELECT 'Acquired lock 42 in session 2';
-- Release the lock
SELECT 'Releasing lock 42 in session 2';
SELECT pg_advisory_unlock(42);
When we run these SQL statements in two separate sessions, we see that the session 2 blocks while session 1 is sleeping for 60s, because session 1 has already grabbed lock 42. As soon as session 1 releases that lock, session 2 is able to acquire the lock and continue executing.
default_with_oids
Configuration Parameter – this is a legacy Postgres configuration parameter that, when enabled, caused tables created without an explicit WITHOUT OIDS
clause to automatically include an OID (object identifier) column. OIDs were once used as unique row identifiers, but Postgres deprecated their use in user tables as of version 12 due to limited utility and potential performance issues. Doltgres doesn't actually support including OID columns, but it will allow the default_with_oids
configuration parameter to be set to false in order to support loading dumps that reference this parameter.
What's Next?
We're making swift progress with Postgres compatibility, but there are still a lot more Postgres syntax and features that we need to implement in Doltgres before we can get all of our sample dumps loading cleanly. Here are some of the features we've identified in our sample dumps that we still need to implement:
- Triggers – support for database triggers is well underway and will be launching very soon. Triggers are commonly used in database schemas and are a powerful feature for performing custom logic and validation whenever data is changed. Now that we have basic support for user-defined procedures, we'll enable those procedures to be called when rows are inserted, updated, or deleted from a table.
- TSVECTOR support – last year, we added support for vectors and vector indexes to DoltDB. Vectors store multi-dimensional floating-point arrays and support approximate nearest neighbor searches. When data is converted into vectors that capture its meaning (a process called embedding), a database can quickly find other data with similar meaning. A small number of our sample testing dumps use vector features, and eventually we'd like to add vector support to Doltgres. We'll likely focus on more widely-used features first, but if this is one that's important for you, feel free to let us know by commenting or upvoting the tracking issue in our backlog.
- AGGREGATE support – Postgres
AGGREGATE
support allows users to define custom aggregate functions, enabling advanced data processing beyond built-in aggregates likeSUM
orAVG
. A user-defined aggregate combines a state transition function and a final function, processing input rows step-by-step to produce a single result. - Collation support – collations allow users to control how data is sorted and compared. On the surface it sounds like a simple concept, but the implementation turned out to be surprisingly tricky for DoltDB, since data is compared and sorted in many places in a relational database. We'll add support for collations to Postgres, but we'll likely be looking for customer prioritization feedback to help us make sure we're implementing it at the right time.
Conclusion
We're excited for the features above, but we're honestly even more excited to build what our customers tell us they need to get their Postgres dumps to load into Doltgres. It really motivates us to know that we're working on something a customer needs and will be using. So, if you have a Postgres dump and hit any problems loading it into Doltgres, please don't hesitate to let us know! 🙏 You can send us an issue on GitHub, or you can join our Discord server and chat directly with our team, as well as with other customers.