We take user issues very seriously at DoltHub. We have a pledge to fix bugs in under 24 hours. This pledge is possible because a version-controlled database inherently lends itself to easy reproducibility. And once we can reproduce an issue and attach a debugger, most issues are easy to fix.
But sometimes the issue runs deeper than it seems, and what looked like a short diversion can become an entire journey.
This is the story of one of those times.
The Slow Query#
Back in September, a customer came to use with an innocuous performance issue: their SQL query was taking several minutes to run on Dolt while the same query ran in under a second on MySQL. We immediately clocked this as an issue with Dolt’s execution planner: most likely we were doing a full table scan instead of an index. Dolt has great tooling for understanding execution plans, so we said we’d take a look.
They showed us the query: a straightforward join of five tables. It looked like this:
SELECT
COUNT(DISTINCT t1.id1) AS id1_count
FROM table_one AS t1
WHERE EXISTS (
SELECT 1 FROM table_two AS t2 WHERE t2.id1 = t1.id1 AND EXISTS (
SELECT 1 FROM table_three AS t3 WHERE t3.id2 = t2.id2 AND EXISTS (
SELECT 1 FROM table_four AS t4 WHERE t4.id3 = t3.id3 AND EXISTS (
SELECT 1 FROM table_five AS t5 WHERE t5.id4 = t4.id4 AND LOWER(t5.name) LIKE '%foo%'))));"
This is a pretty conventional query. While it contains several correlated subqueries (subqueries that reference their outer scopes), it does so in a very standard way. Any SQL engine worth its salt would transform this into a single tree of table joins. Provided that each table has an index that matches the filter expressions, each join will get implemented as a table lookup.
A join of many tables can be intimidating because of the potential for an exponential explosion in runtime. But for the vast majority of simple queries, even a join of many tables can get optimized to require only a single table scan, regardless of how many tables are being joined.
So given all that, we were surprised that Dolt wasn’t optimizing this query, especially if MySQL was. Dolt is on average faster than MySQL, and when it comes to multi-table joins, Dolt is typically better than MySQL at identifying optimal execution plans.
Whatever the cause, we figured that dolt debug would quickly reveal it and that it would be a simple fix we could knock out in less than a day.
The fact that we’re writing this article now and not back in September should tell you how wrong we were. We had no idea the scope of the rabbit hole we were about to stumble into.
As soon as we began our investigation, we learned that innocuous-looking query was hiding something just underneath the surface. The five tables being joined weren’t actually tables but views. Views are predefined expressions that can be treated like tables in queries and are only executed when the query that references them is executed. And each of these views was itself a multi-table join with nineteen tables each.
So in total, the query wasn’t joining five tables but ninety-five tables.
I won’t reproduce the view defintions here because they’re not that interesting to look at, just a giant soup of INNER JOINs and LEFT JOINs wrapped in a SELECT that extracted and renamed a dozen columns. I’m sure you can picture it.
When it comes to joining nearly a hundred tables, there’s a lot more than can go wrong:
-
Joining tables is a binary operation: when joining more than two tables, the engine needs to join them a pair at a time. The most efficient order to join tables is not always the same order that they appear in the query, so a good engine should reorder the tables to achieve the best results.
-
However, not every reordering will produce the same results. If some of the joins are outer joins (typically identified via the
LEFT JOINorFULL OUTER JOINsyntax, which were indeed present in the view definitions), then reordering the tables may change the output of the query. A good engine should not pick an ordering that produce incorrect results. -
Additionally, finding the optimal join order is an NP-hard problem. This means that while you can use heuristics to find the best order for specific cases, solving the general case will always scale exponentially with the number of tables, and essentially requires trying every possible combination.
The fact that MySQL could execute this query quickly meant that it was using a heuristic to find the best join plan, or at least a join plan that was good enough. But Dolt was failing to do the same.
Let’s Focus on Indexes#
There’s a lot to dig into regarding how Dolt determines join ordering. But for now, we’ll focus on a single important but easily-understood concept: a plan that uses indexes is typically better than a plan that doesn’t use indexes.
This means that in order to pick the best join order, Dolt needs to identify which indexes will actually speed up the query, and which join orders will leverage those indexes. The logic for this is straightforward in concept, but implementation details vary based on the exact nature of the query. SQL has a lot of language features for writing increasingly sophisticated queries, which various implications that Dolt needs to understand in order to optimize queries correctly.
During the investigation, we realized that we looking at not one but several issues in the join planner, situations where because of some rarely used SQL feature, Dolt was failing to correctly identify when an index would improve a query, or when a specific join would allow an index to be used.
In some ways, this meant the query was an excellent stress test for Dolt’s analyzer. While it appears simple on the surface, it actually makes use of a number of different less-common SQL language features, such as:
- Table aliases
- Column aliases
- Outer joins
- Correlated subqueries
- Views
It uses all of these features together, with multiple levels of nesting. And the join it’s attempting to optimize is large enough that failing to pick the correct plan results in a noticeable performance degradation.
If there was a query out there that would expose bugs or limitations in the execution planner, it would be this one. Investigating this query helped us uncover and fix many issues in our SQL engine.
Slaying the Hydra#
I started calling this query “The Hydra”. In Greek mythology, the hydra was a beast with many heads, and cutting off one head would cause it to grow two more. Our hydra was a join with many tables, and fixing one blocker would reveal two more in its place.
Each time we wrote a fix, we created a minimal reproduction test and verified that Dolt was now generating an optimal plan for that test case. But each time, the original query resisted our attempts to tame it. Let’s look at some of the issues that we fixed in our quest to slay the beast. This is just the issues that had to do with index selection. There’s even more work that we did beyond this, but we’ll save that for a future blog post.
PR 3380: Generate index lookups for filters that appear in lateral joins#
SQL is a declarative language: queries don’t tell the engine what to do, only what the expected output should be. This means that there’s no way to tell a SQL engine to use an index. Instead, the engine needs to be able to recognize when an index can be used.
In the simplest case, if a filter restricts a column to a constant value, the engine can use an index on that column to get only the matching rows:
-- This can be optimized into an index lookup if |name| is indexed.
DESCRIBE PLAN SELECT * FROM test_table WHERE name = "Tim";
But if the value isn’t constant, an index might not help:
-- Fetch all rows where name is in all caps
-- An index won't help here
DESCRIBE PLAN SELECT * FROM test_table WHERE name = UPPER(name);
Previously, when identifying indexes, Dolt would only use filters that compared a column to a constant, and would ignore filters that compared a constant to a non-constant value. But this was overly cautious, because in the case of subqueries, some non-constant values can still be used in index lookups.
This is because when a query contains subqueries, the subquery may get evaluated multiple times. If the subquery contains references to columns or expressions in the outer query, those values might not be constant for the full duration of the main query. But as long as the value stays the same within each subquery evaluation, it’s okay if it changes between subquery evaluations.
Lateral joins are a language feature that allows expressions on one side of a join to reference columns from the other half of the join. Previously, Dolt would not use filters to guide index selection if the filter appeared in a lateral join subquery but contained references to the outer query:
-- Dolt v1.75.0
DESCRIBE PLAN SELECT * FROM t1 JOIN LATERAL (SELECT * FROM t2 WHERE t1.id = t2.id) query_alias;
+--------------------------------+
| plan |
+--------------------------------+
| LateralCrossJoin |
| ├─ Table |
| │ └─ name: t1 |
| └─ SubqueryAlias |
| ├─ name: query_alias |
| └─ Filter |
| ├─ (t1.id = t2.id) |
| └─ Table |
| ├─ name: t2 |
| └─ columns: [id] |
+--------------------------------+
Dolt now correctly treats these references as effectively constant and optimizes them:
-- Dolt v1.82.6
DESCRIBE PLAN SELECT * FROM t1 JOIN LATERAL (SELECT * FROM t2 WHERE t1.id = t2.id) query_alias;
+--------------------------------+
| plan |
+--------------------------------+
| LateralCrossJoin |
| ├─ Table |
| │ └─ name: t1 |
| └─ SubqueryAlias |
| ├─ name: query_alias |
| └─ IndexedTableAccess(t2) |
| ├─ index: [t2.id] |
| ├─ columns: [id] |
| └─ keys: t1.id |
+--------------------------------+
Users don’t often write lateral joins, but the engine will transform certain subquery expressions into lateral joins, so optimizing lateral joins helps us optimize those subqueries too.
PR 3386: Push filters that contain references to outer scopes.#
The above example showed a query where a filter appears immediately next to the table it references. But sometimes, the filter expression and the table are separated by a join, a view, or a subquery.
In situations like these, we want to rewrite the query to move the filter closer to the table. Often this allows Dolt to use an index it otherwise couldn’t. Even if we don’t end up being able to use an index, applying the filter deeper in the plan means that we reduce the number of intermediate rows by eliminating non-matching rows early.
Just like the above example, Dolt wasn’t pushing filters if they appeared in subqueries and referenced the outer query, because it couldn’t determine that this was safe to do. This resulted in plans with filters that were evaluated later than they could have been, inflating the size of intermediate result sets:
-- Dolt v1.75.0
DESCRIBE PLAN SELECT * FROM t1 JOIN LATERAL
(SELECT * FROM t2 JOIN LATERAL
(SELECT * FROM t3) AS t3_alias
WHERE t3_alias.id = t1.id) AS t2_alias;
+------------------------------------+
| plan |
+------------------------------------+
| LateralCrossJoin |
| ├─ Table |
| │ └─ name: t1 |
| └─ SubqueryAlias |
| ├─ name: t2_alias |
| └─ LateralCrossJoin |
| ├─ (t3_alias.id = t1.id) |
| ├─ Table |
| │ ├─ name: t2 |
| │ └─ columns: [id] |
| └─ TableAlias(t3_alias) |
| └─ Table |
| ├─ name: t3 |
| └─ columns: [id] |
+------------------------------------+
Dolt now correctly rewrites these queries to move the filter directly above the correct subquery table, which can then get transformed into an index lookup:
-- Dolt v1.82.6
DESCRIBE PLAN SELECT * FROM t1 JOIN LATERAL
(SELECT * FROM t2 JOIN LATERAL
(SELECT * FROM t3) AS t3_alias
WHERE t3_alias.id = t1.id) AS t2_alias;
+----------------------------------------+
| plan |
+----------------------------------------+
| LateralCrossJoin |
| ├─ Table |
| │ └─ name: t1 |
| └─ SubqueryAlias |
| ├─ name: t2_alias |
| └─ LateralCrossJoin |
| ├─ Table |
| │ ├─ name: t2 |
| │ └─ columns: [id] |
| └─ TableAlias(t3_alias) |
| └─ IndexedTableAccess(t3) |
| ├─ index: [t3.id] |
| ├─ columns: [id] |
| └─ keys: t1.id |
+----------------------------------------+
PR 3379: Allow introspection into Views#
Views are essentially templates for subqueries. The following two queries should be equivalent, but in older versions of Dolt they produced different plans:
-- Dolt v1.75.0
-- Query without view
DESCRIBE PLAN SELECT * FROM (SELECT * FROM example_table) query_alias WHERE col = 5;
+---------------------------------------+
| plan |
+---------------------------------------+
| TableAlias(query_alias) |
| └─ IndexedTableAccess(example_table) |
| ├─ index: [example_table.col] |
| ├─ filters: [{[5, 5]}] |
| └─ columns: [col] |
+---------------------------------------+
-- Query with view
CREATE VIEW query_alias AS SELECT col FROM example_table;
DESCRIBE PLAN SELECT * FROM query_alias WHERE col = 5;
+---------------------------------+
| plan |
+---------------------------------+
| Filter |
| ├─ (query_alias.col = 5) |
| └─ SubqueryAlias |
| ├─ name: query_alias |
| └─ Table |
| ├─ name: example_table |
| └─ columns: [col] |
+---------------------------------+
When parsing queries, Dolt creates data structures that help it match references in the outer query to the underlying tables in the inner query. However, as a result of the way we were parsing and evaluating views, these data structures were getting discarded, and the analyzer was forced to treat views as opaque objects. This meant that we weren’t able to match filters outside of the view to tables inside the view.
Now, both queries use an index:
-- Dolt v1.82.6
-- Query with view
CREATE VIEW query_alias AS SELECT * FROM example_table;
DESCRIBE PLAN SELECT * FROM query_alias WHERE col = 5;
+---------------------------------------+
| plan |
+---------------------------------------+
| SubqueryAlias |
| ├─ name: query_alias |
| └─ IndexedTableAccess(example_table) |
| ├─ index: [example_table.col] |
| ├─ filters: [{[5, 5]}] |
| └─ columns: [col] |
+---------------------------------------+
PR 3383: When applying indexes from outer scopes, resolve references to table aliases#
Sometimes the filter uses a different name for the table than where the table is used in the query. But Dolt wasn’t always considering table aliases when trying to match indexes to their tables. We added an additional analysis step to consider table aliases when attempting to match a filter in an outer scope to a table in an inner scope:
-- Dolt v1.75.0
DESCRIBE PLAN SELECT * FROM t1 AS t1_alias JOIN LATERAL (SELECT * FROM t2 WHERE t1_alias.id = t2.id) AS inner_query;
+-----------------------------------+
| plan |
+-----------------------------------+
| LateralCrossJoin |
| ├─ TableAlias(t1_alias) |
| │ └─ Table |
| │ └─ name: t1 |
| └─ SubqueryAlias |
| ├─ name: inner_query |
| ├─ outerVisibility: false |
| ├─ isLateral: true |
| ├─ cacheable: false |
| └─ Filter |
| ├─ (t1_alias.id = t2.id) |
| └─ Table |
| ├─ name: t2 |
| └─ columns: [id] |
+-----------------------------------+
-- Dolt v1.82.6
DESCRIBE PLAN SELECT * FROM t1 AS t1_alias JOIN LATERAL (SELECT * FROM t2 WHERE t1_alias.id = t2.id) AS inner_query;
+------------------------------------+
| plan |
+------------------------------------+
| LateralCrossJoin |
| ├─ TableAlias(t1_alias) |
| │ └─ Table |
| │ └─ name: t1 |
| └─ SubqueryAlias |
| ├─ name: inner_query |
| └─ Filter |
| ├─ (t1_alias.id = t2.id) |
| └─ IndexedTableAccess(t2) |
| ├─ index: [t2.id] |
| ├─ columns: [id] |
| └─ keys: t1_alias.id |
+------------------------------------+
PR 3400: Push Filters inside Projections#
Other times, a subquery renames an expression from the inner SELECT. When Dolt encountered a filter on an aliased expression, it wasn’t unwrapping the alias to see if it referred to an indexed column. This meant that we were missing opportunities to push filters indexes on those tables.
-- Dolt v1.75.0
DESCRIBE PLAN SELECT * FROM
(SELECT pk AS pk_alias FROM example_table) AS example_alias
WHERE pk_alias = 1;
+-----------------------------------------------------+
| plan |
+-----------------------------------------------------+
| SubqueryAlias |
| ├─ name: example_alias |
| └─ Filter |
| ├─ (pk_alias = 1) |
| └─ Project |
| ├─ columns: [example_table.pk as pk_alias] |
| └─ Table |
| ├─ name: example_table |
| └─ columns: [pk] |
+-----------------------------------------------------+
Now, Dolt can move the filter into the subquery by rewriting the filter, replacing the alias name with the original expression. This allows Dolt to identify indexes when it couldn’t before:
-- Dolt v1.82.6
DESCRIBE PLAN SELECT * FROM
(SELECT pk AS pk_alias FROM example_table) AS example_alias
WHERE pk_alias = 1;
+-------------------------------------------------+
| plan |
+-------------------------------------------------+
| SubqueryAlias |
| ├─ name: example_alias |
| └─ Project |
| ├─ columns: [example_table.pk as pk_alias] |
| └─ IndexedTableAccess(example_table) |
| ├─ index: [example_table.pk] |
| ├─ filters: [{[1, 1]}] |
| └─ columns: [pk] |
+-------------------------------------------------+
And More!#
Some of the other fixes are complicated enough to warrant their own blog posts. We’ll save those for another day.
Until then, I hope this provided some valuable insight into how SQL engines work to optimize queries, and how Dolt in specific does join planning. If you want to learn more about how a version-controlled database can help manage your data, you can always join our Discord and drop us a line.