Memoizing Joins
Dolt is a relational database with Git versioning primitives. Dolt versions data where Git versions code. Most of our users are interested in transactional workloads, but we also have customers with heavier analytical queries. Joins have been the bottleneck for success here, and we have undergone two join search rewrites in response to increasing customer demand. Zach wrote the first version of join planning almost two years ago, Aaron followed up with ordering improvements and hints in 2021, and today we discuss the third iteration.
The new join search:
-
Doubles our join table count capacity.
-
Enumerates associative, commutative, right-deep, and bushy join plans to more aggressively search for an optimum.
-
Uses costing to globally optimize plans, as opposed to heuristics that only target pre-engineered workflows.
-
Adds
HashJoin
,SemiJoin
,AntiJoin
, andFullJoin
operators.
We spend much of this blog talking about the data structure used to
absorb the complexity, a memo, which categorizes and groups join trees
based on common inputs and outputs. Grouping plans by logical properties
simplifies the way we represent joins, and is a path towards making
go-mysql-server
’s
analyzer faster and more efficient in the future.
Background
A "relational operator" is a fancy word for a simple SQL query with:
-
One source of input rows.
-
Internal logic for manipulating or filtering those rows.
-
An output "schema" that captures the shape of result rows.
For example, consider a table xy
with columns x
and y
. The query
SELECT * from xy where x > 0
has one input relation, xy
, and we
expect output rows of shape (x, y)
. Most queries follow this pattern,
even aggregations like SELECT sum(x) from x where x > 0
that pack more
logic between input and output rows.
Joins are also relational operators, but they have an arbitrary number
of input sources! On top of table number we have to consider varying
types of filter conditions, join types (ex: LEFT_JOIN, SEMI_JOIN), and
competing physical operators (ex: NESTED_LOOP_JOIN vs HASH_JOIN).
Finding the lowest "cost" join trees bottom-up sounds simple, but
balloons into an O(n!)
problem. In practice, we make tradeoffs to
find a reasonable plan quickly.
Our latest improvement to Dolt’s join search is an expansion of our "intermediate representation" of join trees. Like a regular relational operator, we assume join orders output the same rows and columns not just for the root node, but every possible join subtree. These subtrees are divided into "expression groups", which represent a collection of equivalent plans from the outside looking in. We call the collection of expression groups a "memo".
This all sounds really complicated, and it is. The concepts are best explored through example. Without further ado, we will dive into an example to show how we use this data structure to divide join search into exploration and costing phases.
Building a memo
The first step of join planning enumerates a series of simple tree rearrangements. We consider this example query:
select * from ab
inner join uv on a = u
inner join xy on x = u;
There are three tables: ab
, uv
, xy
; and two join "edges": a = u
,
x = u
. The edge join types are important, but INNER_JOIN is not
subject to special rearrangement rules so we will ignore that detail for
now.
Lets convert the original join plan to a memo:
memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (innerJoin 1 2)
├── G4: (tableScan: xy)
└── G5: (innerJoin 3 4)
Our three tables have their own memo groups:G1
, G2
and G4
.
Two groups represent our join edges: G3
represents [ab,uv]
, and G5
is a
group for [ab,uv,xy]
. G5
is the root of our join tree, joining all
three tables and producing the final query output.
In the original plan, G3
only contains (innerJoin 1 2)
: an inner
join between G1
and G2
, or more simply, INNER_JOIN(ab, uv)
,
Similarly, there is only one plan in G5
, the inner join between G3
([ab,uv]
) and G4
(xy
). Joins always default to "left deep" trees:
the right-relation terminates in a tablescan, and the left-relation
expands to accommodate the rest of the tree.
The magic happens when we start populating join groups with more ways of
executing those relations. For example, we can switch the join order of
the seed plan in G3
because INNER_JOIN(uv, ab)
is valid. We
note this in the memo:
memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (innerJoin 1 2) (innerJoin 2 1)
├── G4: (tableScan: xy)
└── G5: (innerJoin 3 4)
We can also add new expression groups. For example, the edge
INNER_JOIN(?, ?, 'x = u')
can be used to join xy
and uv
, even
though it joined xy
and [ab,uv]
in the original plan. In
memo-speak, the group G6: (innerJoin 4 2)
is valid. Not a valid root,
but a valid subtree of the join:
memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (innerJoin 2 1) (innerJoin 1 2)
├── G4: (tableScan: xy)
├── G5: (innerJoin 3 4)
└── G6: (innerJoin 4 2)
We can connect our new group G6
the tree root (the complete result set),
with a join to ab
(G1
): (innerJoin 6 1)
:
memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (innerJoin 2 1) (innerJoin 1 2)
├── G4: (tableScan: xy)
├── G5: (innerJoin 3 4) (innerJoin 6 1)
└── G6: (innerJoin 4 2)
G5
now has two distinct paths for the full result set: [ab,uv]xy
and [xy,uv]ab
.
Exhausting the search reveals the full "core enumeration space", a maximally expanded set join groups1, commutations2, and associations3:
memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (innerJoin 2 1) (innerJoin 1 2) (innerJoin 1 2)
├── G4: (tableScan: xy)
├── G5: (innerJoin 4 3) (innerJoin 3 4) (innerJoin 6 1) (innerJoin 1 6) (innerJoin 7 2) (innerJoin 2 7)
├── G6: (innerJoin 4 2) (innerJoin 2 4)
└── G7: (innerJoin 4 1) (innerJoin 1 4)
We simplified and omitted many details for this example4. Refer to the paper here for a more thorough specification of the core search space.
Exploring Physical Operators
Join exploration continues into opportunistic physical (execution) operators for each plan.
Nested loop joins
NESTED_LOOP_JOIN is the default execution operator. Small table joins
are best executed as two "for loops" comparing all rows between each
relation. The downside is an O(n^2)
runtime; every row in
the left relation scans the entire right relation. But who cares about n-squared if the tables have two rows? The overhead of constructing a better execution strategy will overwhelm the benefit.
Lookup (indexed) joins
A LOOKUP_JOIN is the preferred operator for queries that return a small set of rows using a table index.
We apply LOOKUP_JOINs when 1) the right-side of a join is a tablescan, and 2) the join condition aligns with an index on that right-hand relation. In simpler words, a lookup join is best when there is an index to instantly satisfy the join filter.
For example, G3
provides LOOKUP_JOIN alternatives for every
plan from the core search space. G3
joins ab
and uv
, so both
commutations have a tablescan on the right-relation. And both ab
and
uv
have indexes aligning with the join condition, a = u
((a)
and
(u)
, respectively).
memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (lookupJoin 1 2) (lookupJoin 2 1) (innerJoin 2 1) (innerJoin 1 2)
├── G4: (tableScan: xy)
├── G5: (lookupJoin 3 4) (lookupJoin 7 1) (lookupJoin 6 2) (innerJoin 4 3) (innerJoin 3 4) (innerJoin 6 2) (innerJoin 2 6) (innerJoin 7 1) (innerJoin 1 7)
├── G6: (lookupJoin 1 4) (lookupJoin 4 1) (innerJoin 4 1) (innerJoin 1 4)
└── G7: (lookupJoin 2 4) (lookupJoin 4 2) (innerJoin 4 2) (innerJoin 2 4)
Hash joins
We use hash joins when one relation is small enough to materialize in memory, and we can use the materialized index for a lookup join. The main difference between LOOKUP_JOIN and HASH_JOIN is we have to build the in memory index for every query. HASH_JOIN also hashes nested join trees.
With a HASH_JOIN we only read each relation from disk once. HASH_JOIN is asymptotically faster than NESTED_LOOP_JOIN, but the overhead of building the hash map can offset that benefit. For example, a HASH_JOIN will be strictly more expensive than a NESTED_LOOP_JOIN if the left-relation only has one row. In that scenario, the overhead of hashing never leads to a runtime benefit.
HASH_LOOKUPS work for arbitrary subtrees, so every default NESTED_LOOP join in our example has a corresponding HASH_LOOKUP5:
memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (hashJoin 1 2) (hashJoin 1 2) (hashJoin 2 1) (lookupJoin 1 2) (lookupJoin 2 1) (innerJoin 2 1) (innerJoin 1 2)
├── G4: (tableScan: xy)
├── G5: (hashJoin 3 4) (hashJoin 1 7) (hashJoin 7 1) (hashJoin 2 6) (hashJoin 6 2) (hashJoin 3 4) (hashJoin 4 3) (lookupJoin 3 4) (lookupJoin 7 1) (lookupJoin 6 2) (innerJoin 4 3) (innerJoin 3 4) (innerJoin 6 2) (innerJoin 2 6) (innerJoin 7 1) (innerJoin 1 7)
├── G6: (hashJoin 1 4) (hashJoin 4 1) (lookupJoin 1 4) (lookupJoin 4 1) (innerJoin 4 1) (innerJoin 1 4)
└── G7: (hashJoin 2 4) (hashJoin 4 2) (lookupJoin 2 4) (lookupJoin 4 2) (innerJoin 4 2) (innerJoin 2 4)
Costing
Now we have a full "forest" of join plans rooted at G5
. Every plan in
the root group produces the same result set, but we can only choose one
to execute.
We recursively optimize the root group from tablescan leaves upwards.
Consider optimizing G3
, which will be invoked after trying to cost
(hashJoin 3 4)
in the root group. We must choose between
INNER_NESTED_LOOP join, HASH_JOIN, and LOOKUP_JOIN. We compare the
three physical operators between ab(rows=n)
and uv(rows=m)
as
follows:
-
NESTED_LOOP_JOIN:
n x m
, a for loop for each table reads rows from disk for comparison. -
HASH_JOIN:
n + m
, we readm
once, load the rows into a hash table, and probe while scanningn
-
LOOKUP_JOIN (unique):
n
, scann
, and probe on-disk index to find match
There is no strict hierarchy that enforces LOOKUP_JOIN < HASH_JOIN < NESTED_LOOP_JOIN, unfortunately. The cardinality of tables, presence of filter conditions, and the lookup index all affect cost. But our rough estimates steer us away from more expensive plans.
In this case, LOOKUP_JOIN is both available and preferred for both
edges. Selecting which order now comes down to table size. The cost of
each lookup will be the size of the left table, so we want to make every
join's left as small as possible. ab
< uv
<, so the cheapest
plan ends up being (lookupJoin 3 4)
with a cost size(ab)
:
memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (lookupJoin 1 2)
├── G4: (tableScan: xy)
└── G5: (lookupJoin 3 4)
The final plan gets converted into an equivalent execution tree:
LookupJoin((xy.x = uv.u) AND (ab.a = xy.x))
├─ LookupJoin(ab.a = uv.u)
│ ├─ Exchange
│ │ └─ Table(ab)
│ │ └─ columns: [a b]
│ └─ IndexedTableAccess(uv)
│ ├─ index: [uv.u]
│ └─ columns: [u v]
└─ IndexedTableAccess(xy)
├─ index: [xy.x]
└─ columns: [x y]
Summary
In this blog, we tracked join search populating a memo group with alternative plans. After exhausting order transformations and execution operators, groups are "costed" to recursively estimate the lowest cost plan.
We followed a simple example to focus on high-level concepts, but the reordering rules give interesting emergent properties for more complicated queries. Deep joins, subqueries, non-INNER_JOIN operators, and compound filters introduce edge cases that effect plan generation and optimal join selection.
There is still a lot of improvements left to add, but we encourage you to give it a try and let us know what you find! Reach out to us on Discord, GitHub, or Twitter if you would like to chat!
- For example, an expression group
[ABCD]
might include:[AB]x[CD]
,Cx[ABD]
, and[ABC]xD
. Sub-definitions in brackets, like[AB]
, are themselves memo groups that can encapsulate multiple join orders (AxB
andBxA
). Enumerating plans for[ABCD]
’s parents higher in the tree defers worrying about[ABCD]
’s subtrees. A join operator, like INNER_JOIN(A=E), can connect every pair of subtrees as long asA
andE
are not on the same side. So this filter can join[ABCD]xE
, but it can also join[ABC]x[DE]
assuming there was a join condition to produce[ABC
and[DE]
.↩ - A commutative pair reverses table order. The commutative pair of
select * from xy join uv on x = u
isselect * from uv join xy on x = u
↩ - An associative pair reverses edge order. The associative pair of
select * from xy join (select * from uv join ab on a = u) on x = u
isselect * from (select * from xy join uv on x = u) join ab on a = u
.↩ - Two other transformations reorder edges: left and right-asscom, which apply commutativity plus associativity. Different edge types also have limited valid transformations. LEFT_JOIN, for example, can not commute its relations.↩
- Subquery join leaves that reference columns from an outer scope cannot be cached, and therefore cannot be used as the hashed table in a HASH_LOOKUP.↩