Pruning 90% of Dolt's SQL server code
Dolt is Git for data. Git versions files, Dolt versions tables. Dolt comes with a SQL engine built in, which lets you run SQL queries against any version of your data you've committed. Dolt's SQL engine is go-mysql-server, which we forked and then adopted about a year ago. Today we're going to be discussing a layer even deeper in the stack: the SQL parser and server, which is implemented by vitess, and how we got rid of the 90% of it we weren't using.
Vitess: a powerful tool we didn't need most of
Vitess is "a database clustering system for horizontal scaling of MySQL through generalized sharding." It powers sharding in YouTube, which might be the world's largest single deployment of MySQL. It can do things like cut over atomically after a shard fails, split and merge shards, and a ton more.
It's a really cool project, but it's not our project, and its goals
aren't our goals. For example, it doesn't care about things like DDL
operations (CREATE TABLE
and friends), which are a core part of our
product. We forked it long ago to add these features, and then a few
months ago took the additional step of hard-forking it (changing the
package namespaces) in order to make life easier for our customers.
This had the negative side effect of making collaboration with the
original fork much more difficult.
All we're using vitess for is to 1) parse SQL statements, and 2) run a TCP server that implements the MySQL binary protocol. That's it. The rest of its staggering complexity wasn't only a waste, but it was actively getting in our way: some of the many, many tests would periodically break during CI, and it would usually be in parts of the codebase totally unrelated to what we were changing. And customers had noticed vitess bringing in lots of other dependencies. Dependency management in golang leaves a lot to be desired, so vitess's size was actually a stumbling block for a lot of people who wanted to use go-mysql-server as a test library.
Pruning unused code in golang
I knew there was a lot of code to prune from vitess. But which parts? It's not enough to just see which packages are imported by the packages we need. I also need to know which packages they import, and so on. I need the transitive dependency closure so I can snip off the rest of the graph.
So how do we get a dependency closure for a package in golang? The best answer I could find is a project called godepgraph. You feed it some package names, and it spits out a dotfile with all the transitive dependencies of those packages. This looks like:
% godepgraph github.com/dolthub/vitess/go/mysql github.com/dolthub/vitess/go/vt/sqlparser > deps.dot
% cat deps.dot
...
"github.com/dolthub/vitess/go/mysql" -> "github.com/dolthub/vitess/go/bucketpool";
"github.com/dolthub/vitess/go/mysql" -> "github.com/dolthub/vitess/go/netutil";
"github.com/dolthub/vitess/go/mysql" -> "github.com/dolthub/vitess/go/sqltypes";
"github.com/dolthub/vitess/go/mysql" -> "github.com/dolthub/vitess/go/stats";
...
The dotfile can render a nice messy spaghetti tangle of a chart if you like, but text is good enough for what we're doing. The syntax is easy to parse to get a list of transitive depenendencies for my input package. I only care about golang packages in the same github project, since that's what I'm pruning. To get this, I run a little bit of deep neckbeardy unix command line magic:
% grep 'dolthub/vitess' deps.dot | cut -d' ' -f3 | grep dolthub | sort | uniq
"github.com/dolthub/vitess/go/bucketpool";
"github.com/dolthub/vitess/go/bytes2";
"github.com/dolthub/vitess/go/cache";
"github.com/dolthub/vitess/go/event";
"github.com/dolthub/vitess/go/hack";
"github.com/dolthub/vitess/go/netutil";
"github.com/dolthub/vitess/go/sqltypes";
"github.com/dolthub/vitess/go/stats";
"github.com/dolthub/vitess/go/sync2";
"github.com/dolthub/vitess/go/tb";
"github.com/dolthub/vitess/go/vt/log";
"github.com/dolthub/vitess/go/vt/logutil";
"github.com/dolthub/vitess/go/vt/proto/binlogdata";
"github.com/dolthub/vitess/go/vt/proto/logutil";
"github.com/dolthub/vitess/go/vt/proto/query";
"github.com/dolthub/vitess/go/vt/proto/replicationdata";
"github.com/dolthub/vitess/go/vt/proto/topodata";
"github.com/dolthub/vitess/go/vt/proto/vtgate";
"github.com/dolthub/vitess/go/vt/proto/vtrpc";
"github.com/dolthub/vitess/go/vt/proto/vttime";
"github.com/dolthub/vitess/go/vt/sqlparser";
"github.com/dolthub/vitess/go/vt/vterrors";
"github.com/dolthub/vitess/go/vt/vttls";
The cut
gets me the third space-delimited field of each line, and
then the grep
pares that list down to only dolthub
packages. This gives me the complete list of packages in the project
that I need to keep for the mysql
package to work. Great! Now I just
need to clean that list up to remove the "
and ;
characters and
it's ready for the next stage. (If I were even more of a neckbeard I
would have thought of using tr -d ';'
to do this for me, but I just
used find and replace in my text editor.)
Using my list of packages to keep, I can then find the inverse: which packages in the project are safe to remove? This command finds every directory in the project, then prints its name, followed by the number of times that name appears in the keep list. Then it filters that list for packages that aren't used.
% find . -type d -exec echo -n {} \; -exec echo -n ' ' \; -exec grep -c {} keep.txt \; | grep 0 > unused.txt
% cut -d' ' -f1 < unused.txt > unused2.txt
% awk '{ print length(), $0 | "sort -rn" }' unused2.txt | cut -d' ' -f2 > torm.txt
Finally a little more command line magic cleans up that input and
sorts it by length in descending order, putting the final result in
torm.txt
. Now I can process this list of directories and call git
rm -r
on each one. For this trick, I find it easiest to use good old
perl one liners, thereby dating myself to at least Elder Millennial if
not full Gen X.
% perl -ne 'chomp; system("git rm -r $_");' < torm.txt
And that's it! It all works! Well, not quite. After I did this, I
realized that godepgraph
doesn't consider test dependencies, so I
ended up pruning a bunch of test packages that caused the tests to not
work anymore. So I just added them back one by one using git reset
HEAD file
and git checkout -- file
, which I have aliased to gu
(for "git unstage") and gc
(for "git checkout"):
% gu vt/tlstest
% gc -- vt/tlstest
% gu vt/sqlparser/test_queries
% gc -- vt/sqlparser/test_queries
...
The final pruning
After all this was said and done, I submitted a giant PR with the changes. It's not a pure delete -- some of what motivated this change is that other changes I made to the parser caused tests we didn't care about to start failing. But that's 300,000 lines of code we'll never worry about again.
The future
This change obviously makes it more difficult to pull changes from upstream, but not that much more difficult than the hard fork made things. And in practice, we find that the vast majority of development work on the main fork of vitess is unrelated to our own goals and can be safely ignored. That said, ask us again how good an idea this was when we do decide to merge from upstream at some point in the future.
Conclusion
We hope you enjoyed this tutorial in paring down a large golang dependency! If Dolt sounds like an interesting product to you -- a SQL database that you can branch, merge, fork, clone, push and pull like a Git repository -- then join us and let us know how you're using it!