Optimizing varint Decoding
Introduction
Dolt stores data in a content addressable prolly tree in order to get efficient merges and diffs. In designing the table data format one of our goals was to make table column additions and deletions fast operations. They should not require a full table rewrite, the diff of rows with different schemas should show the data changes and not be obfuscated by the schema changes, and diffs and merges should be efficient across states with different schemas.
With those goals in mind we store table data within the prolly tree as a map from a key tuple to a value tuple, and every value within a tuple in Dolt has a column identifier which we call a "tag", which is a simple unsigned integer. When the schema for a table changes, none of the tuples are touched. If a column is deleted, then the tag and value associated with that column will be removed the next time the row is written, and when a row is read containing a value associated with a deleted column the data is ignored.
We achieved all of our goals with this design, but one consequence of this is that, at a bare minimum, half the values we deserialize from the prolly tree are unsigned integers which are serialized using the Google varint format.
bytes.Uvarint()
In working on performance, I have profiled a lot of different operations. Though never a huge portion of execution time,
bytes.Uvarint()
kept showing up taking a much larger percentage of execution time than I would have expected.
// Uvarint decodes a uint64 from buf and returns that value and the
// number of bytes read (> 0). If an error occurred, the value is 0
// and the number of bytes n is <= 0 meaning:
//
// n == 0: buf too small
// n < 0: value larger than 64 bits (overflow)
// and -n is the number of bytes read
//
func Uvarint(buf []byte) (uint64, int) {
var x uint64
var s uint
for i, b := range buf {
if b < 0x80 {
if i > 9 || i == 9 && b > 1 {
return 0, -(i + 1) // overflow
}
return x | uint64(b)<<s, i + 1
}
x |= uint64(b&0x7f) << s
s += 7
}
return 0, 0
}
The function is pretty simple. For every byte take the low 7 bits and shift them to the next location and bitwise or them on to the result that is being accumulated. If the current byte has the highest order bit set, then continue on to the next byte if not then return the accumulated result, and the number of bytes that were used in decoding.
So why is it slow? It is very "branchy". For every iteration there are 2 branching conditions, and a 3rd branching condition on the final byte. This can wreak havoc on the instruction execution pipeline.
Optimizing
In trying to find a faster solution I tried searching for faster existing implementations, unrolling the loop for the existing implementation, and writing a branchless implementation. What follows is discussion of each optimization, how I benchmarked them against each other, and analysis of the results.
Existing Implementations
The first thing I did was search for existing implementations that purported to be faster than bytes.Uvarint
. The only
one I found written in golang was https://github.com/dennwc/varint. It advertises
that it is 30% to 50% faster depending on the number of bytes it's decoding.
Unrolling the Loop
Loop unrolling eliminates branching conditions by duplicating logic. Modern day compilers are very good, and often times will unroll simple loops to eliminate branching conditions however the go compiler chooses not to. Take the code:
func LoopUnrollTest() int {
var count int
for i := 0; i < 1; i++ {
count++
}
return count
}
this for loop runs the code inside exactly one time. Yet the generated assembly does not optimize out the comparison, and the jump instruction.
0x0000 00000 (temp.go:3) TEXT "".LoopUnrollTest(SB), NOSPLIT|ABIInternal, $0-8
0x0000 00000 (temp.go:3) FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (temp.go:3) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (temp.go:3) XORL AX, AX
0x0002 00002 (temp.go:5) JMP 7
0x0004 00004 (temp.go:6) INCQ AX
0x0007 00007 (temp.go:5) CMPQ AX, $1
0x000b 00011 (temp.go:5) JLT 4
0x000d 00013 (temp.go:9) MOVQ AX, "".~r0+8(SP)
0x0012 00018 (temp.go:9) RET
I tried many variations, and as far as I can tell the go compiler never unrolls loops.
As is bytes.Uvarint cannot be unrolled as it checks against the length of buf at the end of every loop iteration and only exits when it reaches the end of
buf or when it finds a byte that does not have the most significant bit set. I find this behavior odd. If the size of buf is
64K and every byte was set to 0xFF it would iterate over all 64K values before returning 0, 0
. Any 64-bit number can be encoded
in 10 bytes, and anything more than that will overflow. The goals that resulted in this implementation may not match my goals exactly,
and I can improve performance by diverging from them here because Dolt is only ever decoding values that it wrote, I don't
need to worry that somebody encoded a 512-bit varint which I need to skip gracefully. I'm fine to fail after 10 bytes saying it's too
big, and have the caller handle that case as corrupt. I'm also fine handling out of bounds errors externally as data corruption.
func unrolledDecodeUVarint(buf []byte) (uint64, int) {
b := uint64(buf[0])
if b < 0x80 {
return b, 1
}
x := b & 0x7f
b = uint64(buf[1])
if b < 0x80 {
return x | (b << 7), 2
}
x |= (b & 0x7f) << 7
b = uint64(buf[2])
if b < 0x80 {
return x | (b << 14), 3
}
x |= (b & 0x7f) << 14
b = uint64(buf[3])
if b < 0x80 {
return x | (b << 21), 4
}
x |= (b & 0x7f) << 21
b = uint64(buf[4])
if b < 0x80 {
return x | (b << 28), 5
}
x |= (b & 0x7f) << 28
b = uint64(buf[5])
if b < 0x80 {
return x | (b << 35), 6
}
x |= (b & 0x7f) << 35
b = uint64(buf[6])
if b < 0x80 {
return x | (b << 42), 7
}
x |= (b & 0x7f) << 42
b = uint64(buf[7])
if b < 0x80 {
return x | (b << 49), 8
}
x |= (b & 0x7f) << 49
b = uint64(buf[8])
if b < 0x80 {
return x | (b << 56), 9
}
x |= (b & 0x7f) << 56
b = uint64(buf[9])
if b < 0x80 {
return x | (b << 63), 10
}
return 0, -10
}
It's not pretty, but it eliminates a branch condition for every byte, it eliminates the need to accumulate a count of decoded byte, and it eliminates the need to track the the number of bits that need to be shifted. As soon as it encounters a value that can't be encoded in 64 bits it returns -10 (because it looked at 10 bytes and could see it would overflow at that point, and the spec expects a negative number at that point).
Branchless Implementation
I worked in console gaming when the Xbox 360 and PS3 released. Those machines had exceptionally poor performance when it came to branching, and the pixel/vertex shaders may not have even supported it (If they did it was unusable because of how slow it was). Rewriting code to be branchless often times resulted in huge gains. Branching is still slow, but not nearly as slow as it used to be, but I wanted to see how a branchless implementation would perform out of pure curiosity.
func varuintNoBranch(buf []byte) (uint64, int) {
count := uint64(1)
more := uint64(1)
b := uint64(buf[0])
x := more * (b & 0x7f)
more &= (b & 0x80) >> 7
count += more
b = uint64(buf[1])
x |= more * ((b & 0x7f) << 7)
more &= (b & 0x80) >> 7
count += more
b = uint64(buf[2])
x |= more * ((b & 0x7f) << 14)
more &= (b & 0x80) >> 7
count += more
b = uint64(buf[3])
x |= more * ((b & 0x7f) << 21)
more &= (b & 0x80) >> 7
count += more
b = uint64(buf[4])
x |= more * ((b & 0x7f) << 28)
more &= (b & 0x80) >> 7
count += more
b = uint64(buf[5])
x |= more * ((b & 0x7f) << 35)
more &= (b & 0x80) >> 7
count += more
b = uint64(buf[6])
x |= more * ((b & 0x7f) << 42)
more &= (b & 0x80) >> 7
count += more
b = uint64(buf[7])
x |= more * ((b & 0x7f) << 49)
more &= (b & 0x80) >> 7
count += more
b = uint64(buf[8])
x |= more * ((b & 0x7f) << 56)
more &= (b & 0x80) >> 7
count += more
b = uint64(buf[9])
x |= (more & b & 0x1) << 63
more &= (b & 0x80) >> 7
retCount := int(count) * (-2*int(more) + 1)
return x, retCount
}
The trick here is to keep a boolean more
stored as a uint where 1 is true, and 0 is false. It
is initially 1 and gets cleared when the top bit is not set. Then by adding it to an int every loop you get a count of
the number of bytes read in decoding. You can then use that bool as a part of a mathematical operation instead of using
an if statement.
if more == 1 {
x |= (b&0x7f) << shift
}
is equivalent to:
x |= more * ((b & 0x7f) << shift)
Benchmarking Data Setup
When generating the test data I wanted an even distribution of data that would be decoded from buffers sizing from 1 to 10 bytes. Taking a random uint64 wouldn't give me a good distribution so I create the data like so:
func initToDecode(b *testing.B, numItems int) []ve {
toDecode := make([]ve, b.N*numItems)
r := rand.New(rand.NewSource(0))
for i := 0; i < b.N*numItems; i++ {
desiredSize := (i % 10) + 1
min := uint64(0)
max := uint64(0x80)
if desiredSize < 10 {
for j := 0; j < desiredSize-1; j++ {
min = max
max <<= 7
}
} else {
min = 0x8000000000000000
max = 0xffffffffffffffff
}
val := min + (r.Uint64() % (max - min))
buf := make([]byte, 10)
size := binary.PutUvarint(buf, val)
require.Equal(b, desiredSize, size, "%d. min: %x, val: %x, expected_size: %d, size: %d", i, min, val, desiredSize, size)
toDecode[i] = ve{val, buf}
}
return toDecode
}
This function will generate an array of test data where the nth element is decoded from (n%10) + 1 bytes. After
generating a number that is in the correct value range bytes.PutUvarint
is used to encode it and it's encoded size is
validated against what we expected.
Benchmarks
Finally our benchmark calls the initialization code and allocates storage for the results before kicking off the individual decoding implementation benchmarks. After all the benchmarks run the results are validated.
func BenchmarkUnrolledDecodeUVarint(b *testing.B) {
const DecodesPerTest = 10000000
toDecode := initToDecode(b, DecodesPerTest)
type result struct {
size int
val uint64
}
decodeBenchmark := []struct {
name string
decodeFunc func([]byte) (uint64, int)
results []result
}{
{"binary.UVarint", binary.Uvarint, make([]result, len(toDecode))},
{"noBranch", varuintNoBranch, make([]result, len(toDecode))},
{"dennwc.varint.UVarint", varint.Uvarint, make([]result, len(toDecode))}
{"unrolled", unrolledDecodeUVarint, make([]result, len(toDecode))},,
}
b.ResetTimer()
for _, decodeBench := range decodeBenchmark {
b.Run(decodeBench.name, func(b *testing.B) {
for i, valAndEnc := range toDecode {
val, size := decodeBench.decodeFunc(valAndEnc.encoding)
decodeBench.results[i] = result{size, val}
}
})
}
b.StopTimer()
for _, decodeBench := range decodeBenchmark {
for i, valAndEnc := range toDecode {
assert.Equal(b, valAndEnc.val, decodeBench.results[i].val)
assert.Equal(b, i%10+1, decodeBench.results[i].size)
}
}
}
The Results
binary.UVarint 0.439 ns/op
noBranch 0.539 ns/op
dennwc.varint.UVarint 0.341 ns/op
unrolled 0.287 ns/op
As you can see the existing binary.UVarint implementation is faster than the non-branching implementation. This
doesn't surprise me. Even if it doesn't need to branch it does twice as much work on average (binary.UVarint processes 5
bytes on average, whereas the no branch solution always processes 10 bytes). As advertised dennwc's varint implementation
is faster than binary.UVarint it does some very similar things to what our unrolled version does, but it chooses to
have the same behavior as binary.UVarint. We are also using a constant value for our shift amounts as opposed to keeping
a variable to track how much we need to shift by. In the end it is these differences that make our implementation faster.
By unrolling the loop and making slight changes to how we handle exceptional conditions we are roughly 16% faster than
dennwc's varint implementation] and about 35% faster than bytes.Uvarint
.
Conclusion
Improving performance is something we are focused on at the moment. This optimization has the greatest impact when working with wide rows, but otherwise isn't that impactful when compared with many of the other changes that we've been making, but it was valuable to understand modern day branching performance, and to take a deeper look at how the go compiler handles loops. We are building better tooling internally for measuring performance changes, and profiling queries run on dolthub, and on local dolt instances.