Introducing the Dolt Debug Command
DoltDB is a version controlled relational database built from the ground up. The storage format is novel. The execution engine is novel. Both are increasingly Postgres and MySQL compatible. And it is all open source and written in GoLang.
All of our users have the same issues any database user has. SQL queries occasionally behave in surprising ways. And it's not always easy to tell the difference between user error, performance issues, and buggy database behavior. Our documentation and blogs help users through many issues, but individual queries are too variable for a small set of rules to be universally helpful.
One advantage of owning our stack is that we've accumulated battle
tested ways of debugging SQL queries. Today we will talk about a
command that packages years of debugging strategy into a CLI command.
dolt debug
lets adventurous users better understand what Dolt is doing
under the hood. And in many cases it helps our internal team debug
issues on servers running in sensitive contexts.
Currently dolt debug
outputs (1) verbose optimizer
logs and (2) broad performance metrics. The debug output includes
catalog entity names in the target query, but otherwise keeps table
data private. The command only runs on a database without an active
running server. The output format is a tar.gz
that is easy to share with
our team.
All of these constraints could be relaxed over time: debug
could run on an
active server, obfuscate plan names, and include table statistics
histograms. But as always, we would rather ship a v0 with 80% of
the features and iterate based on user feedback.
Bug Report
We will test dolt debug
with an error discovered this week:
select
distinct week(created_at) as week
from paperinfo02_null_corrected
where
year(created_at)=2024 and
reconciliation = 0 and
llm=0
order by week;
error: unable to sort: unable to find field with index 57 in row of 1 columns
The user noted that dropping the sort node ran predictably:
select
distinct week(created_at) as week
from paperinfo02_null_corrected
where
year(created_at)=2024 and
reconciliation = 0 and
llm=0;
+------+
| week |
+------+
| 32 |
| 33 |
...
We will run again with the debug
command to try to understand the error better:
> time dolt debug -t sort_debug -q <<EOF
select
distinct week(created_at) as week
from paperinfo02_null_corrected
where
year(created_at)=2024 and
reconciliation = 0 and
llm=0
order by week
EOF
starting cpu profile...
starting mem profile...
starting trace profile...
debug results in: sort_debug
zipped results in: sort_debug.tar.gz
________________________________________________________
Executed in 32.21 secs fish external
usr time 33.10 secs 0.17 millis 33.10 secs
sys time 0.69 secs 1.33 millis 0.69 secs
The debug command took about 30 seconds and produces the following files:
> ls -l sort_debug/
-rw-r--r--@ 1 maxhoffman staff 3885 Mar 10 08:39 analysis.txt
-rw-r--r--@ 1 maxhoffman staff 28956 Mar 10 08:39 cpu.pprof
-rw-r--r--@ 1 maxhoffman staff 0 Mar 10 08:39 exec.txt
-rw-r--r--@ 1 maxhoffman staff 149 Mar 10 08:39 input.sql
-rw-r--r--@ 1 maxhoffman staff 69801 Mar 10 08:40 mem.pprof
-rw-r--r--@ 1 maxhoffman staff 9508 Mar 10 08:39 plan.txt
-rw-r--r--@ 1 maxhoffman staff 28693088 Mar 10 08:40 trace.out
We will consider these files in the order I used to debug our error.
input.sql
This is just a copy of our query!
plan.txt
The plan describes how our sql engine has decided to execute this query.
This file includes both the default plan, which users can access with
explain plan <query>
, and the verbose plan, which prints internal
information crucial to execution but can be very long and hard to read.
Here is the regular plan:
Sort(WEEK(paperinfo02_null_corrected.created_at, 0) as week ASC)
└─ Distinct
└─ Project
├─ columns: [WEEK(paperinfo02_null_corrected.created_at, 0) as week]
└─ Project
├─ columns: [paperinfo02_null_corrected.id_paperinfo, paperinfo02_null_corrected.id_coderinfo, paperinfo02_null_corrected.doi, paperinfo02_null_corrected.coder, paperinfo02_null_corrected.llm, paperinfo02_null_corrected.out_of_scope, paperinfo02_null_corrected.claim_abstract_nonnull, paperinfo02_null_corrected.claim_abstract_null, paperinfo02_null_corrected.claim_abstract_precise_null, paperinfo02_null_corrected.claim_abstract_reject_meaningful_effect, paperinfo02_null_corrected.claim_nonnull, paperinfo02_null_corrected.claim_null, paperinfo02_null_corrected.claim_precise_null, paperinfo02_null_corrected.claim_reject_meaningful_effect, paperinfo02_null_corrected.countries_universe, paperinfo02_null_corrected.countries_complete, paperinfo02_null_corrected.dag, paperinfo02_null_corrected.method_agent_based, paperinfo02_null_corrected.method_archive, paperinfo02_null_corrected.method_bayesian, paperinfo02_null_corrected.method_did, paperinfo02_null_corrected.method_comparative_case_study, paperinfo02_null_corrected.method_discontinuity, paperinfo02_null_corrected.method_ethnography, paperinfo02_null_corrected.method_experiment, paperinfo02_null_corrected.method_experiment_field, paperinfo02_null_corrected.method_experiment_audit, paperinfo02_null_corrected.method_experiment_lab, paperinfo02_null_corrected.method_experiment_natural, paperinfo02_null_corrected.method_experiment_survey, paperinfo02_null_corrected.method_formal_theory, paperinfo02_null_corrected.method_interview, paperinfo02_null_corrected.method_historical_narrative, paperinfo02_null_corrected.method_iv, paperinfo02_null_corrected.method_matching, paperinfo02_null_corrected.method_network, paperinfo02_null_corrected.method_prediction, paperinfo02_null_corrected.method_process_tracing, paperinfo02_null_corrected.method_regression, paperinfo02_null_corrected.method_equivalence, paperinfo02_null_corrected.method_mediation, paperinfo02_null_corrected.method_survey, paperinfo02_null_corrected.method_panel, paperinfo02_null_corrected.method_synthetic_control, paperinfo02_null_corrected.method_text_analysis, paperinfo02_null_corrected.method_twfe, paperinfo02_null_corrected.preregistration, paperinfo02_null_corrected.preregistration_url, paperinfo02_null_corrected.replication_available, paperinfo02_null_corrected.replication_url, paperinfo02_null_corrected.statistical_article, paperinfo02_null_corrected.subfield, paperinfo02_null_corrected.year_end, paperinfo02_null_corrected.year_start, paperinfo02_null_corrected.coder_article_comments, paperinfo02_null_corrected.reconciliation, paperinfo02_null_corrected.created_at, WEEK(paperinfo02_null_corrected.created_at, 0) as week]
└─ Filter
├─ (((year(paperinfo02_null_corrected.created_at) = 2024) AND (paperinfo02_null_corrected.reconciliation = 0)) AND (paperinfo02_null_corrected.llm = 0))
└─ Table
├─ name: paperinfo02_null_corrected
└─ columns: [id_paperinfo id_coderinfo doi coder llm out_of_scope claim_abstract_nonnull claim_abstract_null claim_abstract_precise_null claim_abstract_reject_meaningful_effect claim_nonnull claim_null claim_precise_null claim_reject_meaningful_effect countries_universe countries_complete dag method_agent_based method_archive method_bayesian method_did method_comparative_case_study method_discontinuity method_ethnography method_experiment method_experiment_field method_experiment_audit method_experiment_lab method_experiment_natural method_experiment_survey method_formal_theory method_interview method_historical_narrative method_iv method_matching method_network method_prediction method_process_tracing method_regression method_equivalence method_mediation method_survey method_panel method_synthetic_control method_text_analysis method_twfe preregistration preregistration_url replication_available replication_url statistical_article subfield year_end year_start coder_article_comments reconciliation created_at]
In this case the verbose plan has more details for each column reference:
Sort(WEEK(paperinfo02_null_corrected.created_at:57, 0 (bigint)) as week ASC nullsFirst)
└─ Distinct
└─ Project
├─ columns: [WEEK(paperinfo02_null_corrected.created_at:56, 0 (bigint)) as week]
...
sort: unable to find field with index 57 in row of 1 columns
has to
refer to the topmost expression because it is the only Sort operator. It appears that
we are pruning 57 columns down to 1 column in the project operator
below, but failing to register that change in the Sort
expression. If that
doesn't make sense yet that is OK, we will look specifically at analysis
next to see exactly what's happening.
analysis.txt
The analysis file describes the stepwise process transforming the initial query plan into its executable form. We are interested in this section:
INFO: validation/0: Evaluating rule validateSubqueryColumns
INFO: validation/0: Evaluating rule validateUnionSchemasMatch
INFO: validation/0: Evaluating rule validateAggregations
INFO: after-all/0: Evaluating rule assignExecIndexes
INFO: after-all/0/assignExecIndexes:
Sort(WEEK(paperinfo02_null_corrected.created_at:57, 0 (bigint)) as week ASC nullsFirst)
└─ Distinct
└─ Project
├─ columns: [WEEK(paperinfo02_null_corrected.created_at:56, 0 (bigint)) as week]
...
INFO: after-all/0/assignExecIndexes: --- Prev
+++ Next
@@ -3,5 +3,5 @@
└─ Project
- ├─ columns: [WEEK(paperinfo02_null_corrected.created_at:57, 0 (bigint)) as week]
+ ├─ columns: [WEEK(paperinfo02_null_corrected.created_at:56, 0 (bigint)) as week]
The logical query plan initially tags every column with a unique id.
created_at
's column id is 57
, and the week
alias is 58
.
Semantic validation and logical transformations are easier
when we identify columns with ids rather than via strings or index
position. Strings might not be unique, and position is flexible while
we're rearranging nodes in the tree.
But the output of assignExecIndexes
needs
to be executable. The column created_at
should be assigned index 0
in this case, because the column is the first field returned by the
child relation.
Here is where we start to index the sort node inside assignExecIndexes
:
We've walked upwards through all of Sort's children, and are expecting to index the alias with id 58
from the children set with id 58
. Looking good so far:
And here is the problem. The sort expression is really alias(week(created_at: 57) as week): 58
. We do not have an assignment target for the column expression nested inside the alias is 57
.
If the child returned a standalone created
at we will fill the
expected dependency. But it doesn't really make sense for the single
sort expression to call week(created_at)
a second time in the call
stack. We would also have to add another third projection at the top
to return only the requested set of rows. The easiest fix makes the sort
alias a standalone reference to week:58
:
Sort(week:58 ASC)
Here is the PR for that fix.
cpu.pprof
The fix above has one clear downside, which is the number of projections we are performing.
This plan reads 57 columns from the table:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Distinct |
| └─ Project |
| ├─ columns: [WEEK(paperinfo02_null_corrected.created_at, 0) as week] |
| └─ Sort(week ASC) |
| └─ Project |
| ├─ columns: [paperinfo02_null_corrected.id_paperinfo, paperinfo02_null_corrected.id_coderinfo, paperinfo02_null_corrected.doi, paperinfo02_null_corrected.coder, paperinfo02_null_corrected.llm, paperinfo02_null_corrected.out_of_scope, paperinfo02_null_corrected.claim_abstract_nonnull, paperinfo02_null_corrected.claim_abstract_null, paperinfo02_null_corrected.claim_abstract_precise_null, paperinfo02_null_corrected.claim_abstract_reject_meaningful_effect, paperinfo02_null_corrected.claim_nonnull, paperinfo02_null_corrected.claim_null, paperinfo02_null_corrected.claim_precise_null, paperinfo02_null_corrected.claim_reject_meaningful_effect, paperinfo02_null_corrected.countries_universe, paperinfo02_null_corrected.countries_complete, paperinfo02_null_corrected.dag, paperinfo02_null_corrected.method_agent_based, paperinfo02_null_corrected.method_archive, paperinfo02_null_corrected.method_bayesian, paperinfo02_null_corrected.method_did, paperinfo02_null_corrected.method_comparative_case_study, paperinfo02_null_corrected.method_discontinuity, paperinfo02_null_corrected.method_ethnography, paperinfo02_null_corrected.method_experiment, paperinfo02_null_corrected.method_experiment_field, paperinfo02_null_corrected.method_experiment_audit, paperinfo02_null_corrected.method_experiment_lab, paperinfo02_null_corrected.method_experiment_natural, paperinfo02_null_corrected.method_experiment_survey, paperinfo02_null_corrected.method_formal_theory, paperinfo02_null_corrected.method_interview, paperinfo02_null_corrected.method_historical_narrative, paperinfo02_null_corrected.method_iv, paperinfo02_null_corrected.method_matching, paperinfo02_null_corrected.method_network, paperinfo02_null_corrected.method_prediction, paperinfo02_null_corrected.method_process_tracing, paperinfo02_null_corrected.method_regression, paperinfo02_null_corrected.method_equivalence, paperinfo02_null_corrected.method_mediation, paperinfo02_null_corrected.method_survey, paperinfo02_null_corrected.method_panel, paperinfo02_null_corrected.method_synthetic_control, paperinfo02_null_corrected.method_text_analysis, paperinfo02_null_corrected.method_twfe, paperinfo02_null_corrected.preregistration, paperinfo02_null_corrected.preregistration_url, paperinfo02_null_corrected.replication_available, paperinfo02_null_corrected.replication_url, paperinfo02_null_corrected.statistical_article, paperinfo02_null_corrected.subfield, paperinfo02_null_corrected.year_end, paperinfo02_null_corrected.year_start, paperinfo02_null_corrected.coder_article_comments, paperinfo02_null_corrected.reconciliation, paperinfo02_null_corrected.created_at, WEEK(paperinfo02_null_corrected.created_at, 0) as week] |
| └─ Filter |
| ├─ (((year(paperinfo02_null_corrected.created_at) = 2024) AND (paperinfo02_null_corrected.reconciliation = 0)) AND (paperinfo02_null_corrected.llm = 0)) |
| └─ Table |
| ├─ name: paperinfo02_null_corrected |
| └─ columns: [id_paperinfo id_coderinfo doi coder llm out_of_scope claim_abstract_nonnull claim_abstract_null claim_abstract_precise_null claim_abstract_reject_meaningful_effect claim_nonnull claim_null claim_precise_null claim_reject_meaningful_effect countries_universe countries_complete dag method_agent_based method_archive method_bayesian method_did method_comparative_case_study method_discontinuity method_ethnography method_experiment method_experiment_field method_experiment_audit method_experiment_lab method_experiment_natural method_experiment_survey method_formal_theory method_interview method_historical_narrative method_iv method_matching method_network method_prediction method_process_tracing method_regression method_equivalence method_mediation method_survey method_panel method_synthetic_control method_text_analysis method_twfe preregistration preregistration_url replication_available replication_url statistical_article subfield year_end year_start coder_article_comments reconciliation created_at] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
An alternative plan that looks better at face value swaps the order by week(created_by)
for order by week
. The plan below prunes all unnecessary columns:
explain plan select
distinct week(created_at) as week
from paperinfo02_null_corrected
where
year(created_at)=2024 and
reconciliation = 0 and
llm=0
order by week;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Distinct |
| └─ Project |
| ├─ columns: [WEEK(paperinfo02_null_corrected.created_at, 0) as week] |
| └─ Sort(WEEK(paperinfo02_null_corrected.created_at, 0) ASC) |
| └─ Filter |
| ├─ (((year(paperinfo02_null_corrected.created_at) = 2024) AND (paperinfo02_null_corrected.reconciliation = 0)) AND (paperinfo02_null_corrected.llm = 0)) |
| └─ Table |
| ├─ name: paperinfo02_null_corrected |
| └─ columns: [llm reconciliation created_at] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
But just because the second plan looks better doesn't mean it is faster. We can compare the debug output CPU profiles between the two to understand the tradeoff.
Here is how we access the HTML form of our CPU profile (as opposed to the cli form, without the -http :8080
argument):
go tool pprof -http :8080 sort_debug/cpu.pprof
The first plan spends most of its time deserializing unnecessary fields. This doesn't come as a surprise. But here is the profile for the second:
Now we are bottlenecked on sorting rather than field deserialization.
Note that the second is not necessarily faster. We want to both read few
fields and avoid repeatedly calling the week
function while sorting.
We are working on combining these two approaches, pushing function calls
below sort and pruning table columns in more cases.
mem.pprof
The memory profile tooling is similar to the CPU tooling. But while the CPU profile samples stack traces and records the estimated time spent in different functions, the memory profile tracks malloc volume by function call. We use the same command as before to inspect memory profiles:
go tool pprof -http :8080 sort_debug/mem.pprof
The tablescan was responsible for CPU latency, but the projection higher in the tree doubles our memory usage.
In comparison, the version that prunes columns but recomputes week
values uses a third as much memory:
trace.out
Golang has a built-in trace functionality that we've used to time specific sections of execution. The trace tool is separate from the profiler:
go tool trace -http :8080 sort_debug/trace.out
The tracer does not sample executions, but rather executes programmed actions for every query. For example, one value that we trace is the amount of time spent executing individual function calls. We can use this to compare the number and distribution of week(created_at)
calls between the two versions of our query.
The pruned version executes 3x more function calls (3 million vs 10
million). Interestingly, both have bimodal latency distributions
centered at 250 nanoseconds and 2.5 microseconds. My takeaway is that
the week
function takes about 2.5 microseconds, and the row filters
take about 2.5ns. Because both calls are scaled 3x, it seems that
the pruned version runs about 3 times as many iterations in the 10
seconds we were tracing this function! All evidence points to projection
management being a large component of performance.
Summary
The dolt debug
command only works as a CLI function for stopped
servers at the moment. But it organizes much of the manual debugging
process I've developed over the last 4 years in an automated package
easily accessible from the command line. Users can inspect verbose query plans and
analyzer outputs. It automatically runs three types of profiling. And
its output format is convenient and often safe to send to the Dolt team
if you want help debugging your workloads.
If you have any questions about Dolt, databases, or Golang performance reach out to us on Twitter, Discord, and GitHub!