Error handling with errgroups
Here at DoltHub, we write a lot of software in Go. We've previously blogged a little bit about error handling and errgroup in particular, but today I wanted to share two patterns regarding errgroups that came up recently in some code changes.
Be Careful with Nested Errgroups
We love errgroups so much that sometimes we nest them! We recently came across the following pattern in our code:
type MapBuilder interface {
// Call after all entries have been added with |AddEntry|. Returns a
// non-nil error if any error occurs.
Wait() (Map, error)
// Add an entry to the Map being built. Thread-safe.
AddEntry(MapEntry) error
}
func NewMapBuilder(ctx context.Context) MapBuilder {
eg, ctx := errgroup.WithContext(ctx)
ch := make(chan MapEntry)
mb := &AsyncMapBuilder{eg, ch}
eg.Go(func() error {
for {
select {
case e, ok := <-ch:
if !ok {
return mb.finalizeMap()
}
mb.addEntry(e)
case <-ctx.Done():
return ctx.Err()
}
}
})
return mb
...
}
func (b *AsyncMapBuilder) Wait() (Map, error) {
close(b.ch)
err := b.eg.Wait()
if err != nil {
return nil, err
}
return b.map, nil
}
Internally, the MapBuilder was using an *errgroup.Group
to coordinate
background goroutines responsible for I/O work, in-memory sorting and
buffering, etc.
MapBuilder was being used by a higher level process which was implementing a
core piece of the dolt merge
logic. That process, in turn, was using an
*errgroup.Group
:
func Merge(ctx context.Context, base, left, right Map) (value Map, conflicts Map, err error) {
eg, ctx := errgroup.WithContext(ctx)
leftDiff, rightDiff := make(chan DiffEntry), make(chan DiffEntry)
eg.Go(func() error {
defer close(leftDiff)
// Sends diff entries to leftDiff
return diff.DiffMaps(ctx, base, left, leftDiff)
})
eg.Go(func() error {
defer close(rightDiff)
// Sends diff entries to rightDiff
return diff.DiffMaps(ctx, base, right, rightDiff)
})
conflictsBuilder := NewMapBuilder(ctx)
eg.Go(func() error {
var err error
// Builds result map by applying diffs of left and right to base.
// Returns when |leftDiff| and |rightDiff| are closed.
// Sends conflicts to conflictsEntries
value, err = buildMergeMapFromDiffs(ctx, base, leftDiff, rightDiff, conflictsBuilder)
return err
})
err = eg.Wait()
if err != nil {
return
}
conflicts, err = conflictsBuilder.Wait()
return
}
When I first came across a failing test associated with this code, it didn't appear glaringly wrong. But there's definitely a problem.
The code as presented has a race condition because it calls Wait()
on eg
before it calls Wait()
on conflictsBuilder
. When we Wait()
on the
errgroup, it will cancel its context even when every spawned go routine has
returned nil
. That means the subcontext that conflictsBuilder
is using
also gets canceled.
We need to always performs Waits on sub-work within goroutines that belong to the errgroup itself. That way the context will not get canceled until all the work associated with the subcontexts have had an opportunity to complete successfully. A simple change is just:
eg.Go(func() (err error) {
// Builds result map by applying diffs of left and right to base.
// Sends conflicts to conflictsEntries
value, err = buildMergeMapFromDiffs(base, leftDiff, rightDiff, conflictsEntries)
+ if err == nil {
+ conflicts, err = conflictsBuilder.Wait()
+ }
return
})
err = eg.Wait()
if err != nil {
return
}
- conflicts, err = conflictsBuilder.Wait()
return
}
A good rule of thumb for an errgroup confined to a lexical context is: anything that can error within that context should return its error from within an errgroup go routine. That way there is one single place for an error to be signaled and propagated.
Canceling an errgroup
Let's say you have an iterator interface for a SQL engine:
type RowIter interface {
Next() (Row, error)
Close() error
}
And you want to build an asynchronous buffering implementation that will allow for some read ahead. We can do it just with errgroups with something like:
type AsyncReadAheadRowIter struct {
iter RowIter
ch <-Row
eg *errgroup.Group
cancel func()
}
func (i *AsyncReadAheadRowIter) Next() (Row, error) {
r, ok := <-i.ch
if !ok {
return nil, i.eg.Wait()
}
return r, nil
}
func (i *AsyncReadAheadRowIter) Close() error {
i.cancel()
// We do not return the errgroup error here.
// Read-ahead is only calling Next(), and if it encountered an error,
// it should be returned at the correct place in calls to Next().
i.eg.Wait()
return i.iter.Close()
}
func NewAsyncReadAheadRowIter(ctx context.Context, iter RowIter) *AsyncReadAheadRowIter {
ctx, cancel := context.WithCancel(ctx)
eg, ctx := errgroup.WithContext(ctx)
ch := make(chan Row, ReadAheadBufferSize)
eg.Go(func() error {
defer close(ch)
for {
r, err := iter.Next()
if err != nil {
return err
}
select {
case ch <- r:
case <-ctx.Done():
return ctx.Err()
}
}
})
return &AsyncReadAheadRowIter{iter, ch, eg, cancel}
}
Special care has been taken to cancel the read ahead context from within
Close()
so that we don't leak the goroutine attempting to write on the
results channel.
If we're familiar with the errgroup implementation, we can see that we've
created two subcontexts with two cancel functions one after the other in the
implementation of NewAsyncReadAheadRowIter
. Having a separate context for
cancellation vs the errgroup context itself can be useful in some cases, but in
this case it's unnecessary. We can achieve the same result by using the
errgroup itself:
type AsyncReadAheadRowIter struct {
iter RowIter
ch <-Row
eg *errgroup.Group
- cancel func()
}
func (i *AsyncReadAheadRowIter) Close() error {
- i.cancel()
+ eg.Go(func() error {
+ return context.Canceled
+ })
// We do not return the errgroup error here.
// Read-ahead is only calling Next(), and if it encountered an error,
// it should be returned at the correct place in calls to Next().
i.eg.Wait()
return i.iter.Close()
}
func NewAsyncReadAheadRowIter(ctx context.Context, iter RowIter) *AsyncReadAheadRowIter {
- ctx, cancel := context.WithCancel(ctx)
eg, ctx := errgroup.WithContext(ctx)
ch := make(chan Row, ReadAheadBufferSize)
eg.Go(func() error {
defer close(ch)
for {
r, err := iter.Next()
if err != nil {
return err
}
select {
case ch <- r:
case <-ctx.Done():
return ctx.Err()
}
}
})
- return &AsyncReadAheadRowIter{iter, ch, eg, cancel}
+ return &AsyncReadAheadRowIter{iter, ch, eg}
}
In this example, this may very well not be right tradeoff — our Close()
function took on a fair amount of overhead to avoid carrying around the
cancel
func. But the pattern of injecting an error returning function into an
errgroup to achieve cancellation is a tool that can be useful sometimes.
One particular case it has come up in our code is when you have an errgroup
that the code needs to be able to cancel explicitly. On the surface, just using
context.WithCancel
seems to achieve that, but it can be quite subtle to
structure the code and all dependent goroutines in such a way that they can
distinguish between the cancellation context that is actually of interest being
canceled, vs. a parent of that context being canceled.
If we are using errgroups, and we actually want fast finalization and the ability to distinguish between an explicit shutdown at this layer of the software, vs. a canceled context or a deadline exceeded from above us, we can explicitly inject an error of a distinguishable value into our errgroup:
var ErrShutdownRequested = errors.New("shutdown requested")
func Shutdown(eg *errgroup.Group) {
eg.Go(func() error {
return ErrShutdownRequested
})
}
func WaitIgnoringShutdown(eg *errgroup.Group) error {
err := eg.Wait()
if err == ErrShutdownRequested {
return nil
}
return err
}
The end result is that if the ErrShutdownRequested
error is the first error
the errgroup sees, that's the error the errgroup returns. And the errgroup gets
to see that error before it cancels its associated context, so there is no
racing on something like checking a parent context's Err()
and then canceling
a child context. On the balance, we find this to be a reasonable way to achieve
quick shutdown and distinguish between various error cases at certain places in
our software.
Conclusion
We looked at two handling patterns involving errgroups in Go. One is a caveat to be on the lookout for, where asynchronous work captures or takes a dependency on the errgroup context but the result of that work is not actually finalized within the errgroup itself. Such cases should be avoided because they are inherently racey. The second was a simple pattern for achieving recognizable shutdown of an errgroup and its Context.
We hope that seeing these patterns in the example contexts above helps you recognize them and use them in the future in your go code.
Questions/comments on how to best achieve robust error handling in your go code? Join us on Discord.