Dolt Now Supports Check Constraints
Dolt is an SQL database with Git versioning. We have come a long way since initially committing to 100% MySQL compatibility, and today we introduce our latest step in that journey: check constraints.
What Are Check Constraints?
Check constraints are a data validation mechanism for table rows introduced formally to MySQL in version 8.0.16.
Like triggers, check constraints execute logic on row inserts and updates. And like foreign keys, checks verify the correctness of rows and can short-circuit persistence if a constraint fails.
Check constraints are different and useful because they contain arbitrary expression logic, whereas foreign keys verify only presence in another table.
Simple Examples
Positive Values
Consider a data source of blood pressure readings. We can use check constraints to limit the input range of values:
$ dolt sql -q "
CREATE TABLE bp (
patient_id int,
systole int,
diastole int,
CHECK (systole > 0),
CHECK (diastole > 0)
)"
Adding normal values succeeds:
$ dolt sql -q "INSERT INTO bp VALUES (1, 120, 80)"
Query OK, 1 row affected
Negative values trigger a check constraint:
$ dolt sql -q "INSERT INTO bp VALUES (2, -120, 80)"
Check constraint "chk_62j1d75m" violated
Only valid rows remain after these two inserts:
$ dolt sql -q "SELECT * from bp"
+------------+---------+----------+
| patient_id | systole | diastole |
+------------+---------+----------+
| 1 | 120 | 80 |
+------------+---------+----------+
Categorization
Checks are equally useful in scenarios with a distinct set of categories.
Consider a table recording the day of week a patient's blood pressure was most-recently taken (we will intentionally initialize the table without checks):
$ dolt sql -q "
CREATE TABLE bp_day (
patient_id int,
day int
)"
Our analytics team assumes day
is a categorical
feature. They use array indexing to convert between numeric and string
representations in a data pipeline:
num_to_day = ["Sun, "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"]
A physician trying to record "Sunday" with day=7
will create a
bug for our data team:
$ dolt sql -q "INSERT INTO bp_day VALUES (1, 7)"
Query OK, 1 row affected
num_to_day[7]
will IndexOutOfRangeException
and crash our code.
Instead of running a batch job to backfill our database, or changing the processing code to check for data inconsistencies, we can alter our table with a check constraint:
$ dolt sql -q "DELETE FROM bp_day WHERE patient_id = 1"
Query OK, 1 row affected
$ dolt sql -q "ALTER TABLE bp_day ADD CHECK (day in (0, 1, 2, 3, 4, 5, 6))"
$ dolt sql -q "INSERT INTO bp_day VALUES (1, 7)"
Check constraint "chk_ko21oh4v" violated
Our database can only store valid rows now, and our workflows will be more reliable in the future.
Limitations
Most simple expressions are valid check constraints, with two notable limitations:
- Checks can only reference columns in the table they are defined in:
$ dolt sql -q "
CREATE TABLE bp_week (
patient_id int,
week int,
bp.systole < 200
)"
table "bp" does not have column "systole"
- Checks cannot include subexpressions or SQL functions:
$ dolt sql -q "ALTER TABLE bp ADD CHECK (AVG(systole) < 150)"
Invalid constraint expression, functions not supported: AVG(bp.systole)
There are a host of implementation details in the MySQL docs. Multiple ways to declare constraints, options for ignoring errors, and other comments on how to use checks are included. .
Implementation Challenges
Adding check constraints required changes to Dolt's data storage layer, SQL query engine and SQL parser.
One interesting technical detail involves bridging the persistence layer, where constraints are serialized, and the query engine, where constraints are executed.
The first implementation converted all check constraints to the ALTER TABLE
format used above in the category example. The alter
string was stored in next to the associated table in Dolt, and
unwound when the query engine needed to validate rows.
An improved implementation saves only the expression string. Here we embed and parse the expression from a select statement (code simplified):
func convertCheckDefToConstraint(ctx *sql.Context, check *sql.CheckDefinition) (*sql.CheckConstraint, error) {
parseStr := fmt.Sprintf("select %s", check.CheckExpression)
parsed, err := sqlparser.Parse(parseStr)
selectStmt, ok := parsed.(*sqlparser.Select)
expr := selectStmt.SelectExprs[0]
c, err := parse.ExprToExpression(ctx, expr)
return &sql.CheckConstraint{
Name: check.Name,
Expr: c,
Enforced: check.Enforced,
}, nil
}
Versioning Concerns
Dolt versions data in addition to being an SQL database. The complexity of combining these two paradigms often complicates the story of 1-to-1 porting MySQL features, and check constraints are no exception.
First, dolt merge
does not execute check constraints for every row.
Second, dolt verify-constraints
does not yet invoke check
constraints.
Future Work
Alter check statements but do not currently
update information_schema
and SHOW CREATE TABLE
.
This affects constraint visibility in multiple ways,
including dolt diff
not showing which version of a check statement
is active. dolt merge
also chooses the most recent between
two constraints with the same name. When schema metadata is updated
to reflect check constraints, checks will behave the same way
as other table schemas fields in versioning commands.
Fixes are currently in progress, and the current behavior does not
negatively affect constraint checks themselves.
Summary
It took MySQL 15 years, and now Dolt too supports check constraints! It took us about one month to bring this customer feature request into production.
Check constraints will expand Dolt's ability to natively validate data, an area of growing interest in the data engineering community.
If you are interested in learning more about Dolt reach out to us on our discord! We would love to hear about your experiences with data versioning.