PRODUCTS

KEYWORDS

BranchBench Performance Update

Dolt is a database with version control capabilities, meaning you can branch, merge, and revert changes just like Git. Consequently, this makes Dolt uniquely suitable for development in an agentic workflow. Researchers at the DAP Lab of Columbia University recently presented a benchmark for databases in this specific workload called BranchBench. A couple of weeks ago, we provided a brief overview of BranchBench. While Dolt and Doltgres have excellent branching capabilities, there were two specific queries where we underperformed. In this blog, we’ll reproduce their findings against Postgres and do our best to address the performance concerns.

RangeRead#

RangeRead is a microbenchmark where a range scan of size 100 is performed over a table with a composite primary key. Table Schema:

                            Table "public.orders"
    Column    |            Type             | Collation | Nullable | Default 
--------------+-----------------------------+-----------+----------+---------
 o_id         | integer                     |           | not null | 
 o_d_id       | smallint                    |           | not null | 
 o_w_id       | smallint                    |           | not null | 
 o_c_id       | integer                     |           |          | 
 o_entry_d    | timestamp without time zone |           |          | 
 o_carrier_id | smallint                    |           |          | 
 o_ol_cnt     | smallint                    |           |          | 
 o_all_local  | smallint                    |           |          | 
Indexes:
    "orders_pkey" PRIMARY KEY, btree (o_w_id, o_d_id, o_id)
Foreign-key constraints:
    "orders_o_w_id_o_d_id_o_c_id_fkey" FOREIGN KEY (o_w_id, o_d_id, o_c_id) REFERENCES customer(c_w_id, c_d_id, c_id) ON DELETE CASCADE
Referenced by:
    TABLE "new_order" CONSTRAINT "new_order_no_w_id_no_d_id_no_o_id_fkey" FOREIGN KEY (no_w_id, no_d_id, no_o_id) REFERENCES orders(o_w_id, o_d_id, o_id) ON DELETE CASCADE
    TABLE "order_line" CONSTRAINT "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" FOREIGN KEY (ol_w_id, ol_d_id, ol_o_id) REFERENCES orders(o_w_id, o_d_id, o_id) ON DELETE CASCADE

Table Statistics:

COUNT(*): 150000
MIN(o_w_id, o_d_id, o_id): (1, 1, 1)
MAX(o_w_id, o_d_id, o_id): (5, 10, 3000)

Query:

SELECT
    *
FROM
    orders
WHERE
    (o_w_id, o_d_id, o_id) >= (x, y, z) AND
    (o_w_id, o_d_id, o_id) <= (x, y, z + 100)

Postgres Results:

Latency (ms):
    min:  0.37
    avg:  1.48
    max:  3.39
    50th: 1.50

Doltgres Results:

Latency (ms):
    min:   951.14
    avg:  1031.38
    max:  1097.01
    50th: 1032.01

Focusing on just the 50th percentile latency, we are ~688x slower than Postgres. Yikes!

Upon further investigation, it appears that we are not using the Primary Key at all when filtering with Record types.

                                                         plan                                                          
-----------------------------------------------------------------------------------------------------------------------
 Filter
  ├─ (RECORD EXPR >= [{1 integer} {1 integer} {1 integer}] AND RECORD EXPR <= [{1 integer} {1 integer} {101 integer}])
  └─ Table
      ├─ name: orders
      └─ columns: [o_id o_d_id o_w_id o_c_id o_entry_d o_carrier_id o_ol_cnt o_all_local]
(5 rows)

This plan here indicates that Doltgres is performing a full table scan over the orders table (150,000 rows) and checks the Filter condition for each row.

The problem is that Record comparisons are not a simple comparison between each column against the respective literal. For example, (i, j, k) > (1, 1, 1) is not logically equivalent to i > 1 and j > 1 and k > 1. Record comparisons happen in order of significance from left to right. The expression (i, j, k) > (1, 1, 1) is logically equivalent to (i == 1 and j == 1 and k > 1) or (i == 1 and j > 1) or (i > 1). You can read the full decomposition logic here: https://github.com/dolthub/doltgresql/pull/2860.

With this decomposition in, our analyzer can use the appropriate index and drop the filter, which results in this following plan:

                                         plan                                         
--------------------------------------------------------------------------------------
 IndexedTableAccess(orders)
  ├─ index: [orders.o_w_id,orders.o_d_id,orders.o_id]
  ├─ filters: [{[1, 1], [1, 1], [1, 101]}]
  └─ columns: [o_id o_d_id o_w_id o_c_id o_entry_d o_carrier_id o_ol_cnt o_all_local]

Rerunning the benchmark on Doltgres we get:

Latency (ms):
    min:  1.37
    avg:  1.61
    max:  4.80
    50th: 1.58

This is a 653.17x speed up, putting us within 5% of Postgres’s performance in this benchmark.

Simulation#

Simulation is a macrobenchmark meant to simulate a typical agentic workload with data mutation, evaluation, and comparison. Here, we specifically struggled with the evaluation query:

SELECT
    SUM(CASE WHEN s_quantity = 0 THEN 1 ELSE 0 END) AS stockouts, 
    SUM(ol.ol_amount) AS total_cost
FROM 
    stock s
JOIN
    order_line ol 
ON 
    s.s_i_id = ol.ol_i_id
WHERE 
    (s.s_w_id = 0 or s.s_w_id != 0) and
    (s.s_i_id <= 7500);

Postgres Results:

Latency (ms):
    min:  213.22
    avg:  215.27
    max:  223.53
    50th: 215.44

Doltgres Results:

Latency (ms):
    min:  44383.34
    avg:  44875.51
    max:  45790.65
    50th: 44472.74

Focusing on the 50th percentile, we are ~206.43x slower than Postgres. We can investigate further using the EXPLAIN syntax:

                                                                             plan                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Project
  ├─ columns: [sum(case  when s.s_quantity = 0 then 1 else 0 end as case when s_quantity = 0 then 1 else 0 end) as stockouts, sum(ol.ol_amount) as total_cost]
  └─ GroupBy
      ├─ select: SUM(CASE  WHEN s.s_quantity = 0 THEN 1 ELSE 0 END as CASE WHEN s_quantity = 0 THEN 1 ELSE 0 END), SUM(ol.ol_amount)
      ├─ group: 
      └─ LookupJoin
          ├─ TableAlias(ol)
          │   └─ Table
          │       ├─ name: order_line
          │       └─ columns: [ol_i_id ol_amount]
          └─ Filter
              ├─ ((s.s_w_id = 0 OR s.s_w_id <> 0) AND s.s_i_id <= 7500)
              └─ TableAlias(s)
                  └─ IndexedTableAccess(stock)
                      ├─ index: [stock.s_i_id]
                      ├─ columns: [s_i_id s_w_id s_quantity]
                      └─ keys: ol.ol_i_id

On the other hand, the Postgres analyzer picks a HashJoin:

                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=55226.10..55226.11 rows=1 width=40)
   ->  Gather  (cost=55225.87..55226.08 rows=2 width=40)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=54225.87..54225.88 rows=1 width=40)
               ->  Parallel Hash Join  (cost=25147.66..52517.67 rows=227760 width=6)
                     Hash Cond: (ol.ol_i_id = s.s_i_id)
                     ->  Parallel Seq Scan on order_line ol  (cost=0.00..23296.00 rows=625000 width=8)
                     ->  Parallel Hash  (cost=24957.83..24957.83 rows=15186 width=6)
                           ->  Parallel Seq Scan on stock s  (cost=0.00..24957.83 rows=15186 width=6)
                                 Filter: ((s_i_id <= 7500) AND ((s_w_id = 0) OR (s_w_id <> 0)))

The coster in our analyzer determines which join strategy to utilize. Evidently, we should adjust our cost estimates for LookupJoins. Fortunately, there was already a TODO guiding our change:

// TODO added overhead for right lookups
switch n := n.(type) {
case *LookupJoin:
    ...
    return lBest*seqIOCostFactor + selfJoinCard*(randIOCostFactor+seqIOCostFactor), nil

After some trial and error, we settled on this approximation for the right overhead:

switch n := n.(type) {
case *LookupJoin:
    ...
    lCost := float64(lBest * seqIOCostFactor)
    rCost := float64(1 + float64(math.Log(rBest)/indexLookupScalarFactor))
    return float64(lCost*rCost) + float64(selfJoinCard*float64(randIOCostFactor+seqIOCostFactor)), nil

The reasoning here is that Prolly Tree lookups scale logarithmically and indexLookupScalarFactor = 200 is here to account for fanout. This is a somewhat rough estimation. We need better benchmarks, coster tests, and more information to have more accurate cost estimation. However, this seems to work decently for now.

With these improvements, this is the new plan:

                                                                             plan                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Project
  ├─ columns: [sum(case  when s.s_quantity = 0 then 1 else 0 end as case when s_quantity = 0 then 1 else 0 end) as stockouts, sum(ol.ol_amount) as total_cost]
  └─ GroupBy
      ├─ select: SUM(CASE  WHEN s.s_quantity = 0 THEN 1 ELSE 0 END as CASE WHEN s_quantity = 0 THEN 1 ELSE 0 END), SUM(ol.ol_amount)
      ├─ group: 
      └─ HashJoin
          ├─ s.s_i_id = ol.ol_i_id
          ├─ TableAlias(ol)
          │   └─ Table
          │       ├─ name: order_line
          │       └─ columns: [ol_i_id ol_amount]
          └─ HashLookup
              ├─ left-key: (ol.ol_i_id)
              ├─ right-key: (s.s_i_id)
              └─ TableAlias(s)
                  └─ IndexedTableAccess(stock)
                      ├─ index: [stock.s_w_id,stock.s_i_id]
                      ├─ filters: [{(NULL, ∞), (NULL, 7500]}]
                      └─ columns: [s_i_id s_w_id s_quantity]

And the new results:

Latency (ms):
    min:  4650.61
    avg:  6512.50
    max:  6944.69
    50th: 6476.48

Way better!… but still not amazing. We’re seeing a ~6.88x speedup over LookupJoins, but we are still 30x slower than Postgres.

Looking at the pprof, we noticed that good chunk of CPU time was spent in the types.GeneralizeTypes function. This is used to determine the output type for CASE statements. We were reevaluating the output type for each row despite it remaining the same throughout the query, so the simple fix is to just cache it. Additionally, there were some improvements to how we hashed keys.

All together, these optimizations gave us another 39% boost in performance:

Latency (ms):
    min:  3993.03
    avg:  4028.93
    max:  4457.89
    50th: 3982.86

This leaves us around 18x slower than Postgres. Looking at pprof any significant improvements here will likely revolve around reading data off disk in a more efficient manner. Notably, Postgres’s explain output states they are using a “parallel hash join” with “parallel full table scan”. Currently, there isn’t any equivalent parallelism on the Dolt/Doltgres side for joins or table scans.

Conclusion#

BranchBench is a database benchmark for agentic workflows that has exposed some query performance bottlenecks in Dolt and Doltgres. We have made some improvements to queries that were severely underperforming. There are still many areas where we can improve in terms of query performance, and the Dolt team is hard at work finding and addressing them. Stay tuned to hear more about our performance journey. If you would like to learn more about Dolt’s agentic database capabilities or want to chat about performance, chat with us on Discord or file a Github issue.