Optimizing A 60 Hour IN Expression

December 13, 2021

5 min read

A customer recently experienced performance issues with a query similar to this:

SELECT a.*, b.x,  c.y from c
INNER JOIN b ON b.id = c.id
INNER JOIN a ON a.id = c.id
WHERE
    a.x in {str_iterable_tosql_list(ids)}

The performance culprit and focus of this blog is the last line, an IN expression. IN expressions are a convenient way to combine a series of equality checks, like WHERE a.x = 1 OR a.x = 2 OR ..., into a compacted form: WHERE a.x IN (1, 2, ...). We only had one runtime option for computing this check, and it worked fine for awhile. But the IN expression above included 600,000 literal values in {str_iterable_tosql_list(ids), so we added a second strategy to handle large value sets.

We will show how simple data structures from CS 101 let us shave 40 hours from the runtime of this query, a full 60% reduction in runtime. Sometimes simple algorithms are effective!

Default InTuple Operator

Every logical SQL operator has one or more corresponding physical operators capable of executing a single plan. We originally had only one physical operator for our logical InTuple, which used a nested inner-loop to check every left expression to every right tuple value.

To clarify, here is the InTuple’s default comparison logic:

func (in *InTuple) Eval(ctx *sql.Context, row sql.Row) (interface{}, error) {
	typ := in.Left().Type().Promote()
	left, err := in.Left().Eval(ctx, row)
	if err != nil {
		return nil, err
	}

	if left == nil {
		return nil, nil
	}

    right, ok := in.Right().(Tuple)
    if !ok {
	    return nil, ErrUnsupportedInOperand.New(right)

    }

    for _, el := range right {
	    right, err := el.Eval(ctx, row)
	    if err != nil {
	    	    return nil, err
	    }

	    cmp, err := typ.Compare(left, right)
	    if err != nil {
		    return nil, err
	    }

	    if cmp == 0 {
		    return true, nil
	    }
    }
	return false, nil
}

The Eval function is called for each row in our table. Every invocation iterates through the set on the right side of the InTuple until we either find a match or exhaust a list and return false. If there are n rows and k tuple values, the query will execute in O(k*n) time.

Beating O(k*n)

At a high level, executing a TupleIn with a huge tuple is similar to join planning. We are matching two sets of tuples and want to find the best strategy to make that happen. The cost of each physical (execution) operator depends on the data in the table and context of the plan, so it is not always clear upfront which strategy is better.

For TupleIn, hashing is the clear winner among our join (read IN) operator contenders:

O(1) lookups beats O(k), especially when k = 600,000.
The likelihood of k being so large that we cannot create a hash map is low, presumably because the query string would have to be many gigabytes worth of tuple values.
The inefficiency of building hash maps for small tuple sets is acceptable. Converting TupleIn values with one or two values to their WHERE equivalents is easy for us and users.
There are fun edge cases where range and sorted merge joins might be more efficient than hash maps, but those cases are rare and esoteric! Consider a in (1000, 999, ...,2,1), assuming a primary key index on a. A hash map will be fast, but maybe sorting the right tuple and performing a range scan on a < 1000 would beat HashInTuple. But maybe just doing direct index lookups on the right value tuples would be faster! Without knowing column statistics, it is hard to tell which of these physical plans is faster.

Overall, hashing will usually improve the TupleIn execution time after paying the cost of building the map during planning phase.

Tuple In with Hashing

Our new execution operator, HashInTuple, creates a hash map in the query planning phase that is passed to a companion Eval method. We will describe this lifecycle briefly, and more details can be found in the source code.

First, during analysis we check whether the transform TupleIn -> HashInTuple is valid. For example, non-constant expressions on the right side of the IN might be valid but unhashable. In x in (1, 2, 3, y), x and y are unknown until runtime, and we fallback to the standard InTuple. This check summarizes the transform validator:

    if e, ok := expr.(*expression.InTuple); ok &&
        hasSingleResult(e.Left()) &&
        isStatic(e.Right()) {
        return expression.NewHashInTupleTuple(e.Left(), e.Right())
    }

Proceeding with the transform, we build a hash map of the literal values on the right side of the IN expression (simplified):

func newInMap(right Tuple, lType sql.Type) (map[uint64]sql.Expression, error) {
	elements := make(map[uint64]sql.Expression)
	for _, el := range right {
		i, err := el.Eval(sql.NewEmptyContext(), sql.Row{})
		if err != nil {
			return nil, hasNull, err
		}

		key, err := hashOfSimple(i, lType)
		if err != nil {
			return nil, hasNull, sql.ErrInvalidOperandColumns.New(el, sql.NumColumns(lType))
		}

		elements[key] = el
	}
	return elements, nil
}

Our transform replaces the TupleIn operator with a logically equivalent but physically distinct HashInTuple operator that computes the same result in a different way (the FILTER( HASHIN ) node below):

> EXPLAIN
    SELECT a.*, b.x,  c.y from c
    INNER JOIN b ON b.id = c.id
    INNER JOIN a ON a.id = c.id
    WHERE
        A.x in (1,2,3,4,5,6);
+--------------------------------------------------------+
| plan                                                   |
+--------------------------------------------------------+
| Project(a.id, a.x, a.y, b.x, c.y)                      |
|  └─ IndexedJoin(b.id = c.id)                           |
|      ├─ Exchange(parallelism=8)                        |
|      │   └─ Table(b)                                   |
|      └─ IndexedJoin(a.id = c.id)                       |
|          ├─ Exchange(parallelism=8)                    |
|          │   └─ Filter(a.x HASH IN (1, 2, 3, 4, 5, 6)) |
|          │       └─ Table(a)                           |
|          └─ IndexedTableAccess(c on [c.id])            |
+--------------------------------------------------------+

Execution will pass scanned rows to our HashInTuple operator. The new Eval uses the runtime environment to produce an output constant for each row (leftVal) depending on the tuple's left expression. That value is hashed and weighed against the rightLookup map for set inclusion:

func (hit *HashInTuple) Eval(ctx *sql.Context, row sql.Row) (interface{}, error) {
	leftVal, err := hit.Left().Eval(ctx, row)
	if err != nil {
		return nil, err
	}

	if leftVal == nil {
		return nil, nil
	}

	key, err := hashOf(leftVal, hit.Left().Type())
	if err != nil {
		return nil, err
	}

	right, ok := hit.rightLookup[key]
	if !ok {
		return false, nil
	}

	return true, nil
}

In exchange for the time and memory spent building a hash map during query planning, we receive an execution operator that performs O(1) lookups. InTuple and HashInTuple return the same result, but this simple tradeoff reduced the execution time for our customer’s target query by 60%. Sometimes simple is effective!

Future

We have a lot of work to do adding normalization rules for query plans, building physical operators for each logical operator, and choosing the best execution plan depending on the statistics of individual databases. Preferably before customers find slow queries! But we also appreciate bug reports, and prioritize optimizations that add value immediately.

In this blog, we summarized one such customer query that was bottlenecked by a slow InTuple execution operator. Using a hash map for lookups adds to the planning time, but dramatically speeds up a variety of queries at runtime.

If you find slow queries or want to talk to us about databases, SQL engines, or optimizers reach out to us on our Discord!

Blog

Default InTuple Operator

Beating O(k*n)

Tuple In with Hashing

Future

Get started with Dolt