Getting to one 9 of SQL correctness in Dolt

December 17, 2019

5 min read

A few months ago we finally settled on a good way to measure the correctness of Dolt's SQL engine: the sqllogictest package, first developed for SQLite and since used as a benchmark for lots of other database implementations. SQLite hit upon the idea of generating a large number of random tests by creating templated queries, then running those tests against MySQL or another "already correct" database implementation to get the expected results. This means that sqllogictest has far, far more queries than any other SQL correctness benchmark (over 5.9 million queries that run on MySQL, and 6.7 million queries total).

When we first ran Dolt against this benchmark two months ago, we had only a 23.9% pass rate, which is not as bad as it sounds but is still pretty bad. Today we are proud to announce an 89.9% pass rate, which we are rounding up to "one 9 of correctness" for the purposes of this blog post. Getting there was quite a challenge, and we are excited to share our journey.

Schema checking in sqllogictest

The biggest issue with sqllogictest is that its support for databases other than SQLite is effectively abandonware: the ODBC driver has been broken for years. What's worse, we found that MySQL 8.0 was no longer fully compatible with the test queries. When run with our golang test driver, it only produced an 89% pass rate. Something was obviously wrong, but we didn't know what, and it took some digging to find out.

The first issue was that our test runner compared schemas more strictly than the original sqllogictest harness. Here's the original sqllogictest code for checking that the schema of a result set is correct. Notice anything?

/* Verify that the type string consists of one or more characters
** from the set "TIR". */
for(k=0; (c = sScript.azToken[1][k])!=0; k++){
  if( c!='T' && c!='I' && c!='R' ){
    fprintf(stderr, "%s:%d: unknown type character '%c' in type string\n",
        zScriptFile, sScript.startLine, c);
    nErr++;
    break;
  }
}

Astute readers will note that this code doesn't compare the result schema to the expected one, just enforces that it contains the letters "T", "I", and "R". Because of this, it's common for tests in sqllogictest to play very fast and loose with their expected result set schemas: floating point results are often declared as type "I" (integer). This problem alone broke a very large portion of tests on MySQL.

We played around a bit with trying to fix this problem in isolation, e.g. by adopting more flexible rules for type checking in the test runner, but this only helped a little. Eventually we decided that the fact that MySQL itself was under 90% accurate on a benchmark designed to be MySQL compliant was a major problem in itself, so we bit the bullet and regenerated all the test files to conform to the results produced by MySQL 8.0. Note that this change is so large that GitHub won't even tell you how many lines is it. It's a lot of lines.

A total of 906 of the tests still don't work correctly on MySQL, because the MySQL syntax has changed since anyone bothered to run these tests against MySQL. Specifically, all the tests involving triggers are broken, as well as some tests of view functionality. This gives us a total pass rate of 99.98% for MySQL, or 3 nines of correctness. You can explore these results yourself in our DoltHub repository.

% dolt clone dolthub/dolt-sqllogictest-results
% dolt sql -q "select result, count(*) from mysql_results group by 1;"
+---------+----------+
| result  | COUNT(*) |
+---------+----------+
| skipped | 1487748  |
| ok      | 5931639  |
| not ok  | 906      |
+---------+----------+

Because we have 1) a test runner that will talk to MySQL out of the box, and 2) test files that conform to what the modern version of MySQL actually produces, we now consider ourselves to have the most accurate version of sqllogictest on GitHub.

What is the type of `SELECT - - -1?`

Even after fixing the test results to conform to modern MySQL, we still found that Dolt had trouble with the types of some expressions. For example, in MySQL the query SELECT - - -9 returns a result set with type DECIMAL (arbitrary precision floating point). Why does it do this? We sure don't know, and it seemed like a poor use of time to try to reverse engineer these bizarre type semantics and recreate them in our SQL engine. Instead, we wrote some fuzzy logic to convert integer results to floating point when MySQL expected the latter. This bought us another few percentage points of passing tests.

What does the `ALL` keyword do?

One of the interesting consequences of generating random queries from templates to get better coverage is that you end up with pretty bizarre constructs in some queries, including things you would probably never write yourself. For example, do you know what the ALL keyword does, as in SELECT ALL foo FROM bar? We didn't either, but it turns out that it's the opposite of the DISTINCT keyword, and can be used anywhere DISTINCT can. Ditto the unary plus operator, which doesn't make a negative number positive. These constructs are used all over the place in sqllogictest's randomly generated queries, and unfortunately our parser didn't support them. Adding support to the parser and the engine gave us a surprisingly big boost to our pass rate, in the neighborhood of 30%.

Is `2.0 in (NULL, NULL)`?

SQL has interesting semantics for NULL values. They're not equal or comparable to anything, even themselves. It's pretty important to get right, and the sqllogictest package does a great job testing this exhaustively with its randomly generated expressions. Our engine needed to be taught not to panic when encountering some of these expressions. Fixing this gave us another few percentage increase in correctness.

The final 40%, and a little cheating

After getting all the above fixes in place, we were able to achieve almost 89.95% correctness against the built-in in-memory tables of our SQL engine, which we were pretty happy with. But the Dolt implementation still lagged behind, at just over 50% correctness. Why? Because a lot of the random tests in sqllogictest create tables without primary keys, which aren't currently supported in Dolt. But we had a hunch that these tests didn't actually rely on there being duplicate values of any particular row -- they weren't actually testing the semantics of tables with no keys. So we built in a small hack to the test harness: rewrite CREATE TABLE statements without a key to include all columns in the table's primary key, and therefore make Dolt accept them. When we did this, we discovered that our hunch was correct, and the keyless semantics aren't important to the random sqllogictest files. This brought Dolt up to parity with the in-memory engine tables, at just under one 9 of correctness.

By the way, tables without primary keys is on our roadmap. If you need this feature, please let us know and we'll prioritize the work!

Getting to two 9's, and what's next

We have a lot of work left to get to the next order of magnitude of correctness, most of which is already on our roadmap. We need to support UNIQUE constraints besides the primary key, which is almost more important for performance than it is for correctness. And we are already hard at work implementing support for VIEWs. We think these two changes together should push us past two 9's of correctness. At that point, the remaining failing tests should actually be the result of bugs in query execution, as opposed to unimplemented functionality. From there, it will be a long slog to find and hammer out the remaining bugs to get us that third 9, bringing us even with MySQL and other real databases.

Now that we have a good baseline for correctness of query execution, we will begin focusing more concretely on performance, and we'll be able to more confidently state that optimizations for speed don't result in regressions on correctness. Secondary indexes and join performance are first on our docket.

It's an exciting time to be the world's first SQL database with git branch / merge semantics. Try it out yourself today!

Blog

Schema checking in sqllogictest

What is the type of SELECT - - -1?

What does the ALL keyword do?

Is 2.0 in (NULL, NULL)?

The final 40%, and a little cheating

Getting to two 9's, and what's next

Get started with Dolt

What is the type of `SELECT - - -1?`

What does the `ALL` keyword do?

Is `2.0 in (NULL, NULL)`?