Pruning test dependencies from Go binaries
We're building Dolt, the world's
first SQL database with Git-like version control. Recently, a customer
contacted us to let us know that test symbols were making it into
their binary
when they took a dependency on our go-mysql-server
library, which
provides the SQL parsing and query execution functionality for
Dolt. This made their binary significantly larger than it needed to
be.
Investigating why this was happening turned into a multi-day rabbit hole into how the Go compiler decides which code makes it into a binary, and taught me some very interesting tips on how to remedy this sort of issue that I'd like to share.
Diagnosing the problem
The customer identified this issue by using the Unix utility strings
to output all the character data that made it into his binary, then
searching for the testify
library, which we use for test
assertions, in the output. This is a reliable way to detect if a
library is getting pulled into a binary and doesn't require installing
anything new in a Unix-like environment.
When we ran the command on the dolt
binary, we found the following:
% strings `which dolt` | grep testify
"github.com/stretchr/testify/assert
dep github.com/stretchr/testify v1.7.1 h1:5TQK59W5E3v0r2duFAb7P95B6hEeOyEnHRa8MjYSMTY=
github.com/stretchr/testify/assert.(*Assertions).Fail
github.com/stretchr/testify/assert.(*Assertions).NoError
github.com/stretchr/testify/assert.(*Assertions).True
github.com/stretchr/testify/assert.CallerInfo
github.com/stretchr/testify/assert.isTest
github.com/stretchr/testify/assert.messageFromMsgAndArgs
github.com/stretchr/testify/assert.indentMessageLines
github.com/stretchr/testify/assert.Fail
github.com/stretchr/testify/assert.labeledOutput
github.com/stretchr/testify/assert.True
github.com/stretchr/testify/assert.NoError
github.com/stretchr/testify/assert.init
type..eq.github.com/stretchr/testify/assert.labeledContent
github.com/stretchr/testify/assert.New
/c/Users/zachmu/liquidata/go-workspace/pkg/mod/github.com/stretchr/testify@v1.7.1/assert/assertion_forward.go
/c/Users/zachmu/liquidata/go-workspace/pkg/mod/github.com/stretchr/testify@v1.7.1/assert/assertions.go
/c/Users/zachmu/liquidata/go-workspace/pkg/mod/github.com/stretchr/testify@v1.7.1/assert/assertion_compare.go
/c/Users/zachmu/liquidata/go-workspace/pkg/mod/github.com/stretchr/testify@v1.7.1/assert/errors.go
/c/Users/zachmu/liquidata/go-workspace/pkg/mod/github.com/stretchr/testify@v1.7.1/assert/forward_assertions.go
dep github.com/stretchr/testify v1.7.1 h1:5TQK59W5E3v0r2duFAb7P95B6hEeOyEnHRa8MjYSMTY=
github.com/stretchr/testify/assert.(*Assertions).Fail
github.com/stretchr/testify/assert.(*Assertions).NoError
github.com/stretchr/testify/assert.(*Assertions).True
github.com/stretchr/testify/assert.CallerInfo
github.com/stretchr/testify/assert.isTest
github.com/stretchr/testify/assert.messageFromMsgAndArgs
github.com/stretchr/testify/assert.indentMessageLines
github.com/stretchr/testify/assert.Fail
github.com/stretchr/testify/assert.labeledOutput
github.com/stretchr/testify/assert.True
github.com/stretchr/testify/assert.NoError
github.com/stretchr/testify/assert.init
type..eq.github.com/stretchr/testify/assert.labeledContent
github.com/stretchr/testify/require..inittask
github.com/stretchr/testify/assert..inittask
github.com/stretchr/testify/assert.intType
github.com/stretchr/testify/assert.int8Type
github.com/stretchr/testify/assert.int16Type
github.com/stretchr/testify/assert.int32Type
github.com/stretchr/testify/assert.int64Type
github.com/stretchr/testify/assert.uintType
github.com/stretchr/testify/assert.uint8Type
github.com/stretchr/testify/assert.uint16Type
github.com/stretchr/testify/assert.uint32Type
github.com/stretchr/testify/assert.uint64Type
github.com/stretchr/testify/assert.float32Type
github.com/stretchr/testify/assert.float64Type
github.com/stretchr/testify/assert.stringType
github.com/stretchr/testify/assert.timeType
github.com/stretchr/testify/assert.AnError
go.itab.*github.com/dolthub/dolt/go/store/d.stackTracer,github.com/stretchr/testify/assert.TestingT
go.itab.*github.com/dolthub/dolt/go/store/d.panicker,github.com/stretchr/testify/assert.TestingT
So the customer was obviously correct: test libraries were making it
into the binaries. In addition to testify
itself, many of testify's
dependencies, like go-cmp
, were also making it into the binary:
% strings `which dolt` | grep go-cmp | wc
163 169 8646
Initially I thought this must be because we vend a set of exported
acceptance
tests
that let integrators verify their database backend implementation
works correctly. My theory was that, because these tests are exported
and therefore do not end in _test.go
, this could be confusing the Go
compiler and pulling test libraries into the binary.
But a very cursory examination demonstrated that this wasn't the
case. For one thing, the customer reported that renaming all the
(non-acceptance) test files in go-mysql-server
to end in _test.go
removed the testify
symbols from his binary. For another, you can
inspect the binary with the same technique as above and demonstrate
that the package in question, enginetest
, wasn't making it into the
build.
% strings `which dolt` | grep enginetest | wc
0 0 0
This was very confusing to me, because I thought that the Go compiler did a pretty decent job at removing dead code (anything not reachable from a program's main method) from binaries. But as I found out, there are quite a few gotchas to this process.
Getting a dependency graph
Once I'd verified that the customer was correct and that test symbols were getting pulled into our binary, the next step was to figure out what was including them.
Go ships with a bunch of tooling that helps you understand your
dependency graph. The easiest one is go list -json
, which gives you
a summary about your dependency closure. Running it on the main
package of dolt
produces something like this (edited down for
brevity):
{
"Dir": "C:\\Users\\zachmu\\liquidata\\go-workspace\\src\\github.com\\dolthub\\dolt\\go\\cmd\\dolt",
"ImportPath": "github.com/dolthub/dolt/go/cmd/dolt",
"Name": "main",
"GoFiles": ["doc.go", "dolt.go", "fileno_check.go", "system_checks.go"],
"Imports": [
"context",
"crypto/rand",
"encoding/binary",
"fmt",
"github.com/dolthub/dolt/go/cmd/dolt/cli",
"github.com/dolthub/dolt/go/cmd/dolt/commands",
"github.com/dolthub/dolt/go/cmd/dolt/commands/admin",
"sync",
"time"
],
"Deps": [
"archive/zip",
"bufio",
"bytes",
"cloud.google.com/go/compute/metadata",
"cloud.google.com/go/iam",
"cloud.google.com/go/internal",
"cloud.google.com/go/internal/optional",
"cloud.google.com/go/internal/trace",
"cloud.google.com/go/internal/version",
"cloud.google.com/go/storage",
"github.com/stretchr/testify/assert",
"github.com/stretchr/testify/require"
]
}
The two sections relevant to this analysis are "Imports"
and
"Deps"
, which are the lists of all packages imported by the files in
"GoFiles"
, and their transitive closure, respectively. Just like
running strings
on the binary, this analysis clearly demonstrates
that testify
and its dependencies are considered dependencies by the
Go compiler.
The next logical question is: why? What code is pulling these
dependencies in? To answer this next question, we need a little help
from a third-party tool,
godepgraph
. Unlike go list
, godepgraph
provide a full listing of which packages depend on
which other packages in an application. We previously used it to
trim away the 90% of vitess
source code that we weren't
using.
Note: the
README
directions to install godepgraph
didn't work for me, as go get
no
longer installs modules in more recent versions of Go. I had to
install it like this:
go install github.com/kisielk/godepgraph@latest
This puts the godepgraph
binary in $GOPATH/bin
, which is on my
$PATH
. Then I can invoke it on my main package like so:
godepgraph ./cmd/dolt > deps.dot
This generates a dot file that you can render into a graph image with
the dot
program if you want. For dolt
, this wasn't very useful. It
was either tiny and unreadable as a PNG:
Or a giant circuit board of criss-crossing dependency lines as a SVG.
If your program is much smaller, your graph image might be more
useful. But the good news is that dot
files are easy-to-parse text
files that you can inspect with standard unix utilities like
grep
. Each line in the file is one dependency edge: one package that
imports another. For example, here's how I found all the packages that
depend directly on testify
:
% grep testify deps.dot
"github.com/dolthub/dolt/go/libraries/doltcore/dtestutils" -> "github.com/stretchr/testify/require";
"github.com/dolthub/dolt/go/libraries/doltcore/sqle" -> "github.com/stretchr/testify/require";
"github.com/dolthub/dolt/go/store/chunks" -> "github.com/stretchr/testify/assert";
"github.com/dolthub/dolt/go/store/d" -> "github.com/stretchr/testify/assert";
"github.com/dolthub/dolt/go/store/prolly/tree" -> "github.com/stretchr/testify/require";
"github.com/dolthub/go-mysql-server/server" -> "github.com/stretchr/testify/require";
These are all the packages in the binary that have a dependency on
testify
. Now to fix them, one at a time.
Removing the test dependencies
It took a lot of trial and error to understand why each of these
packages was pulling in the testify
library. At a high level, this
is how how Go determines what to compile into a binary:
- Collect source files to compile. Files ending in
_test.go
are not considered and cannot end up in a binary built withgo build
orgo run
(but do get included forgo test
). - Analyze the call graph reachable by the
main
method, and remove unreachable code (dead code detection).
The first rule is very simple to understand, and goes back to our
customer's observation: when you rename all the test files in
go-mysql-server
to end in _test.go
, it gets rid of test symbols in
the binary.
The second rule is much, much more subtle, and I couldn't find
documentation that lays out precisely how this step works (and I
wasn't willing to read the Go compiler source carefully enough to
understand). In principle, no test code should be reachable from our
main
package and so should all be getting pruned. But obviously,
this wasn't happening.
As it turns out, there are many gotchas to the current implementation of dead code detection that can end up including lots of code that will never be run in your binary, inflating the size on disk.
Dead code detection happens at the package level
Let's start with the simples dependency on testify
from above:
"github.com/dolthub/go-mysql-server/server" -> "github.com/stretchr/testify/require";
Some files in the server
package depends on testify
. But which
files? Go's structure makes this easy to answer. cd
into the source
directory in question and run ag
(like grep
but a bit nicer for
many tasks):
go-mysql-server/server% ag -c testify * | grep -v _test.go
handler_test_common.go:1
Sure enough, we have a test-only file that doesn't end in _test.go
,
and it imports the testify
package. Rather than rename it, we opted
to merge its contents into the only _test.go
file that actually
used it, which
accomplished the same thing. After this change, godepgraph
stopped
reporting the testify
dependency from the server
package.
This is the first gotcha with dead code detection in the Go compiler:
dependencies are not reliably pruned if they come from a package
which is used elsewhere in the binary. This was surprising to me, but
I double checked this result and kept getting the same answer. The
dolt
code base does import the server
package, which was in turn
importing testify
. Even though this code wasn't reachable from the
Dolt's main
method, its presence in a package imported by reachable
dolt
code caused it and its dependencies to get included in the
binary.
Using testify in production code
This was surprising to me, but there's nothing especially tricky about
it: dolt
had production code packages that imported and used the
testify
library directly. These were reachable from main
and were
therefore included in the build.
Good question. This dependency came from code that we inherited and
was intended to allow an equivalent of C-style debug assertions for
invariants, except that the Go compiler doesn't have a mode to exclude
them. For convenience, they decided to use the testify.Assert
interface.
// d.Chk.<Method>() -- used in test cases and as assertions
var (
Chk = assert.New(&panicker{})
)
type panicker struct {
}
func (s panicker) Errorf(format string, args ...interface{}) {
panic(fmt.Sprintf(format, args...))
}
This is nifty and lets you write assertion code like this:
d.Chk.True(to >= from)
d.Chk.Equal(oldHash, newHash)
d.Chk.NotEmpty(fields)
But it does have the penalty of including the entire testify library
and its dependencies (which are pretty big) into your binary. This
helps explain why we never noticed this problem before a customer
brought it to our attention: dolt
binaries have always shipped with
testify
, and we probably just thought it was normal.
The fix is quite simple: just define your own interface to do these checks, don't rely on a testing library.
// Assertion is an interface that provides convenient methods for asserting invariants.
// Methods panic if the invariant isn't met.
type Assertion interface {
NoError(err error)
True(b bool)
}
// Chk will panic if an assertion made on it fails
var (
Chk Assertion = &panicker{}
)
type panicker struct {}
func (s *panicker) NoError(err error) {
PanicIfError(err)
}
func (s *panicker) True(b bool) {
PanicIfFalse(b)
}
Testing libraries
Dolt has a lot of tests, and their setup can be quite complex. So we
often share test setup code in library packages that are then imported
as needed by tests in the rest of the code base. Because they're
consumed outside the package, these methods need to be exported, and
for that reason the test files can't end in _test.go
. But even so:
as long as you don't mix test and production code into such a package,
and it's only imported from files ending in _test.go
, it won't make
it into the build, right? In principle, yes. In practice, this can be
hard to enforce.
We have one package in particular where a lot of common test setup
code lives: dtestutils
.
"github.com/dolthub/dolt/go/libraries/doltcore/dtestutils" -> "github.com/stretchr/testify/require";
But no non-test file imports that package, right? Wrong again.
grep dtestutils deps.graph
"github.com/dolthub/dolt/go/libraries/doltcore/sqle" -> "github.com/dolthub/dolt/go/libraries/doltcore/dtestutils";
What's going on here?
libraries/doltcore/sqle%% ag -c dtestutils | grep -v test.go
testdata.go:10
testutil.go:6
Two test-only files that don't end in _test.go
and are therefore
getting compiled in our binary. Unlike with go-mysql-server
, these
files were themselves testing utility libraries, which meant that I
had to track down their consumers using the above process (also helped
by my IDE). In many cases I found that these test library methods only
had consumers in a single package, often a single file, so I just
moved those methods to there. For the small remaining kernel of test
functionality that actually needed to be exported for use in tests in
other packages, I had to take a couple steps:
- Remove all dependencies on
dtestutil
from thesqle
package. I knew thedtestutil
package was problematic, so I eliminated the multi-level test-library dependency by duplicating or simplifying test setup code where necessary. Now the exported files insqle
no longer depended ondtestutil
. So far so good. - Remove all references to
testify
fromtestutil.go
andtestdata.go
. Becausesqle
is a "mixed use" package containing both normal production code and these exported test library files, I had to assume that any dependencies I included would be picked up by Go source file that imported thesqle
package, as I learned when fixing the same problem ingo-mysql-server
'sserver
package.
Fortunately, the dependency on testify was easily expunged with the
help of my IDE. The test libary wasn't actually running any test case
assertions -- it was only using the testify
library to assert that
no errors occurred during setup. So code that looked like this:
func NewTestEngine(t *testing.T, dEnv *env.DoltEnv, db Database, root doltdb.RootValue) (*sqle.Engine, *sql.Context, error) {
b := env.GetDefaultInitBranch(dEnv.Config)
pro, err := NewDoltDatabaseProviderWithDatabase(b, dEnv.FS, db, dEnv.FS)
require.NoError(t, err)
}
Became standard Go error handling code like this:
func NewTestEngine(dEnv *env.DoltEnv, db Database, root doltdb.RootValue) (*sqle.Engine, *sql.Context, error) {
b := env.GetDefaultInitBranch(dEnv.Config)
pro, err := NewDoltDatabaseProviderWithDatabase(b, dEnv.FS, db, dEnv.FS)
if err != nil {
return nil, nil, err
}
}
Another option: I could have moved these exported test setup methods to their own package so they weren't intermingled with non-test code. That was a bigger change which I'll circle back to pick up later.
The treachery of init blocks
But even after getting rid of the sqle
package's dependency on the
dtestutil
package, I could see from examining the output of
strings
that the dtestutil
package, as well as testify
, were
still getting compiled into my binary, and I couldn't figure out
why. After banging my head against the wall for a while, I found a
StackOverflow
post
hinting that init()
methods in a package can break the Go compiler's
dead code detection, since it can't know ahead of time whether the
code executed in an init()
block might have important side effects
for the rest of the program. And sure enough, dtestutil
had an
init()
method.
// modified by init()
var TypedSchema = schema.MustSchemaFromCols(typedColColl)
// modified by init()
var TypedRows []row.Row
func init() {
for i := 0; i < len(UUIDS); i++ {
married := types.Int(0)
if MaritalStatus[i] {
married = types.Int(1)
}
taggedVals := row.TaggedValues{
IdTag: types.String(UUIDS[i].String()),
NameTag: types.String(Names[i]),
AgeTag: types.Uint(Ages[i]),
TitleTag: types.String(Titles[i]),
IsMarriedTag: married,
}
r, err := row.New(types.Format_Default, TypedSchema, taggedVals)
if err != nil {
panic(err)
}
TypedRows = append(TypedRows, r)
}
_, err = TypedSchema.Checks().AddCheck("test-check", "age < 123", true)
if err != nil {
panic(err)
}
}
We wrote it this way (I think, it was a long time ago) so that tests
consuming the TypedRows
and TypedSchema
exported variables could
do so with a simple reference as needed, and this test data would get
set up a single time to be re-used in every test that needed it. But
this decision came at a cost: the init()
method meant that this
test-only code was always considered live by the Go compiler, even
when it wasn't referenced by any non-test file.
The fix was to get rid of the init()
block and just create these
shared objects on demand as necessary.
// Schema returns the schema for the `people` test table.
func Schema() (schema.Schema, error) {
var typedColColl = schema.NewColCollection(
schema.NewColumn("id", IdTag, types.StringKind, true, schema.NotNullConstraint{}),
schema.NewColumn("name", NameTag, types.StringKind, false, schema.NotNullConstraint{}),
schema.NewColumn("age", AgeTag, types.UintKind, false, schema.NotNullConstraint{}),
schema.NewColumn("is_married", IsMarriedTag, types.IntKind, false, schema.NotNullConstraint{}),
schema.NewColumn("title", TitleTag, types.StringKind, false),
)
sch := schema.MustSchemaFromCols(typedColColl)
_, err = sch.Checks().AddCheck("test-check", "age < 123", true)
if err != nil {
return nil, err
}
return sch, err
}
This lets the Go compiler correctly infer that this entire package is
unused by the main()
method despite having exported methods, and
therefore exclude it from the binary.
Results
This was a lot of work! What did we accomplish?
After all these changes, we finally see that testify
is no longer
included in the dolt
binary. Nor is our internal testing library.
% strings `which dolt` | grep testify | wc
0 0 0
% strings `which dolt` | grep dtestutils | wc
0 0 0
We can also see that the output of go list -json
has shrunk
considerably, reflecting that there are fewer libraries making it into
our binary.
% wc deps-before.json
815 851 32231 deps-before.json
% wc deps-after.json
735 771 29193 deps-after.json
When you diff these two files, you can see that the reduction is due to many packages no longer being included (pruned for brevity):
% diff -u deps-before.json deps-after.json
...
"github.com/golang/protobuf/ptypes/timestamp",
"github.com/golang/snappy",
"github.com/google/flatbuffers/go",
- "github.com/google/go-cmp/cmp",
- "github.com/google/go-cmp/cmp/internal/diff",
- "github.com/google/go-cmp/cmp/internal/flags",
- "github.com/google/go-cmp/cmp/internal/function",
- "github.com/google/go-cmp/cmp/internal/value",
"github.com/google/uuid",
"github.com/googleapis/gax-go/v2",
"github.com/hashicorp/golang-lru",
@@ -450,8 +415,6 @@
"github.com/silvasur/buzhash",
"github.com/sirupsen/logrus",
"github.com/skratchdot/open-golang/open",
- "github.com/stretchr/testify/assert",
- "github.com/stretchr/testify/require",
"github.com/tealeg/xlsx",
"github.com/vbauerster/mpb/v8/cwriter",
"github.com/xitongsys/parquet-go-source/local"
...
Finally, examining the size of the binary before and after shows that
this dead code makes a real difference in size: dolt
shrank from
75MB to 68M, a savings of just over 10%.
Not all of this savings came from test dependencies: in the process of hunting down these dependencies, I also fixed other code, like benchmarks, that were getting included in the binary inappropriately. The same lessons apply there.
Take-aways
The main take-away from this exercise is that the Go compiler's dead code detection is not that great, and it's pretty easy to end up with dead code and its dependencies in your binaries. Whether this matters to you or not depends on how much you care about binary sizes, build times, your S3 egress bill, etc. But it's a real concern that Go developers should be aware of.
Here are some high-level pieces of advice I would pass on to anyone concerned about this problem in their binaries.
- The most foolproof way to make sure test code doesn't make it into
your binary is to name the file
_test.go
. These files aren't even passed to the compiler when you rungo build
orgo run
. - Do not mix testing libraries or benchmark code with production code in the same package. For common testing libraries that need to be exported for use in tests in other packages, make very sure that the entire package only contains test code.
- Assume that the Go compiler will include all the code in any
required package. In some cases, the Go compiler will bring in all
of a package's code when only a small fraction is actually reachable
from
main().
Be defensive: keep packages small and with a narrow dedicated purpose. Think about code liveness at the package boundary, rather than the method or file boundary. - Avoid
init()
methods, as they will cause an entire package to be included in a binary.init()
blocks have a lot of issues separate from dead code detection, and should be used very sparingly. - And though it should go without saying, don't purposefully take test dependencies in your production code like we did. We did name our company DoltHub, what do you expect.
Conclusion
Dead code detection in the Go compiler currently comes with a lot of gotchas and leaves a lot to be desired, but there are some simple rules you can implement to minimize the impact of this problem. Hopefully you find these tips useful when trying to prune unwanted code from your own Go binaries.
This article is part of our blog series about the Go programming language, where we discuss things about the language we learn while writing our database. Like what you see? Have a comment? Want to learn more about Dolt? Come talk to our engineering team on Discord.