Getting to one 9 of SQL correctness in Dolt
A few months ago we finally settled on a good way to measure the correctness of Dolt's SQL engine: the sqllogictest package, first developed for SQLite and since used as a benchmark for lots of other database implementations. SQLite hit upon the idea of generating a large number of random tests by creating templated queries, then running those tests against MySQL or another "already correct" database implementation to get the expected results. This means that sqllogictest has far, far more queries than any other SQL correctness benchmark (over 5.9 million queries that run on MySQL, and 6.7 million queries total).
When we first ran Dolt against this benchmark two months ago, we had only a 23.9% pass rate, which is not as bad as it sounds but is still pretty bad. Today we are proud to announce an 89.9% pass rate, which we are rounding up to "one 9 of correctness" for the purposes of this blog post. Getting there was quite a challenge, and we are excited to share our journey.
Schema checking in sqllogictest
The biggest issue with sqllogictest is that its support for databases other than SQLite is effectively abandonware: the ODBC driver has been broken for years. What's worse, we found that MySQL 8.0 was no longer fully compatible with the test queries. When run with our golang test driver, it only produced an 89% pass rate. Something was obviously wrong, but we didn't know what, and it took some digging to find out.
The first issue was that our test runner compared schemas more strictly than the original sqllogictest harness. Here's the original sqllogictest code for checking that the schema of a result set is correct. Notice anything?
/* Verify that the type string consists of one or more characters
** from the set "TIR". */
for(k=0; (c = sScript.azToken[1][k])!=0; k++){
if( c!='T' && c!='I' && c!='R' ){
fprintf(stderr, "%s:%d: unknown type character '%c' in type string\n",
zScriptFile, sScript.startLine, c);
nErr++;
break;
}
}
Astute readers will note that this code doesn't compare the result schema to the expected one, just enforces that it contains the letters "T", "I", and "R". Because of this, it's common for tests in sqllogictest to play very fast and loose with their expected result set schemas: floating point results are often declared as type "I" (integer). This problem alone broke a very large portion of tests on MySQL.
We played around a bit with trying to fix this problem in isolation, e.g. by adopting more flexible rules for type checking in the test runner, but this only helped a little. Eventually we decided that the fact that MySQL itself was under 90% accurate on a benchmark designed to be MySQL compliant was a major problem in itself, so we bit the bullet and regenerated all the test files to conform to the results produced by MySQL 8.0. Note that this change is so large that GitHub won't even tell you how many lines is it. It's a lot of lines.
A total of 906 of the tests still don't work correctly on MySQL, because the MySQL syntax has changed since anyone bothered to run these tests against MySQL. Specifically, all the tests involving triggers are broken, as well as some tests of view functionality. This gives us a total pass rate of 99.98% for MySQL, or 3 nines of correctness. You can explore these results yourself in our DoltHub repository.
% dolt clone dolthub/dolt-sqllogictest-results
% dolt sql -q "select result, count(*) from mysql_results group by 1;"
+---------+----------+
| result | COUNT(*) |
+---------+----------+
| skipped | 1487748 |
| ok | 5931639 |
| not ok | 906 |
+---------+----------+
Because we have 1) a test runner that will talk to MySQL out of the box, and 2) test files that conform to what the modern version of MySQL actually produces, we now consider ourselves to have the most accurate version of sqllogictest on GitHub.
What is the type of SELECT - - -1?
Even after fixing the test results to conform to modern MySQL, we
still found that Dolt had
trouble with the types of some expressions. For example, in MySQL the
query SELECT - - -9
returns a result set with type DECIMAL
(arbitrary precision floating point). Why does it do this? We sure
don't know, and it seemed like a poor use of time to try to reverse
engineer these bizarre type semantics and recreate them in our SQL
engine. Instead, we
wrote some fuzzy logic to convert integer results to floating
point
when MySQL expected the latter. This bought us another few percentage
points of passing tests.
What does the ALL
keyword do?
One of the interesting consequences of generating random queries from
templates to get better coverage is that you end up with pretty
bizarre constructs in some queries, including things you would
probably never write yourself. For example, do you know what the ALL
keyword does, as in SELECT ALL foo FROM bar?
We didn't either, but
it turns out that it's the opposite of the DISTINCT
keyword, and can
be used anywhere DISTINCT
can. Ditto the unary plus
operator,
which doesn't make a negative number positive. These constructs are
used all over the place in sqllogictest's randomly generated queries,
and unfortunately our parser didn't support them. Adding support to
the
parser
and the
engine
gave us a surprisingly big boost to our pass rate, in the neighborhood
of 30%.
Is 2.0 in (NULL, NULL)
?
SQL has interesting semantics for NULL
values. They're not equal or
comparable to anything, even themselves. It's pretty important to get
right, and the sqllogictest package does a great job testing this
exhaustively with its randomly generated expressions. Our engine
needed to be taught not to
panic when
encountering some of these expressions. Fixing this gave us another
few percentage increase in correctness.
The final 40%, and a little cheating
After getting all the above fixes in place, we were able to achieve
almost 89.95% correctness against the built-in in-memory tables of our
SQL engine, which we were pretty happy with. But the
Dolt implementation still
lagged behind, at just over 50% correctness. Why? Because a lot of the
random tests in sqllogictest create tables without primary keys, which
aren't currently supported in
Dolt. But we had a hunch that
these tests didn't actually rely on there being duplicate values of
any particular row -- they weren't actually testing the semantics of
tables with no keys. So we built in a small hack to the test harness:
rewrite CREATE TABLE
statements without a
key
to include all columns in the table's primary key, and therefore make
Dolt accept them. When we did
this, we discovered that our hunch was correct, and the keyless
semantics aren't important to the random sqllogictest files. This
brought Dolt up to parity with the in-memory engine tables, at just
under one 9 of correctness.
By the way, tables without primary keys is on our roadmap. If you need this feature, please let us know and we'll prioritize the work!
Getting to two 9's, and what's next
We have a lot of work left to get to the next order of magnitude of
correctness, most of which is already on our roadmap. We need to
support UNIQUE
constraints besides the primary key, which is almost
more important for performance than it is for correctness. And we are
already hard at work implementing support for VIEW
s. We think these
two changes together should push us past two 9's of correctness. At
that point, the remaining failing tests should actually be the result
of bugs in query execution, as opposed to unimplemented
functionality. From there, it will be a long slog to find and hammer
out the remaining bugs to get us that third 9, bringing us even with
MySQL and other real databases.
Now that we have a good baseline for correctness of query execution, we will begin focusing more concretely on performance, and we'll be able to more confidently state that optimizations for speed don't result in regressions on correctness. Secondary indexes and join performance are first on our docket.
It's an exciting time to be the world's first SQL database with git branch / merge semantics. Try it out yourself today!