A nasty bit of undefined timezone behavior in Golang
Go is a great language. Really, it is! We complain about the rough edges, but on the whole it's been a great choice for us, and we're not sorry we picked it.
But just for fun, let's talk about some of those rough edges at length.
That was a pot shot. There are some actual foot-guns hidden in the runtime:
- Reads from a
nil
map are fine, writes panic - Reads from a closed channel block, sends on a closed channel panic
nil
sometimes does not== nil
- Any panic in any goroutine kills the entire process
- Can't safely store references to loop variables
Honestly, you get used to all these things. No language is without its quirks, and terrible design choices are no barrier to massive popularity. Just ask Rasmus Lerdorf, creator of PHP, by any measure one of the world's most successful languages.
But we found a new bug in golang, one that stumped an entire room of seasoned industry veterans for a solid hour. I actually hesitate to even call it a bug, because it's so obscure, and the trigger mechanism so specific to our unit test that it broke, that most people will probably never run into it. So, maybe surprising undefined behavior is a better description?
Discovering the bug
A new customer came to us with a problem: his R program writing rows to Dolt wasn't actually writing any rows. This was pretty surprising, since at this point we've integrated customers on a variety of platforms and languages, and they all manage to get basic read / write functionality working. That's the benefit of being MySQL compliant: if it works for one language or connector, that's pretty decent evidence it will work for others. But he sent us a repro case that demonstrated the problem. Something wasn't working.
statement <- enc2utf8(new("SQL", .Data = "INSERT INTO `mtcars`\n (`mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`)\nVALUES\n (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)"))
params <- unname(as.list(mtcars))
rs <- new("MariaDBResult",
sql = statement,
ptr = RMariaDB:::result_create(conn@ptr, statement, is_statement = TRUE),
bigint = conn@bigint,
conn = conn)
# No rows affected both before and after binding data to the the result
rs
#> <MariaDBResult>
#> SQL INSERT INTO `mtcars`
#> (`mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`)
#> VALUES
#> (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
#> ROWS Fetched: 0 [incomplete]
#> Changed: NA
dbGetRowsAffected(rs)
#> [1] NA
if (!is.null(params)) {
dbBind(rs, params)
}
After a little digging into the intricacies of the R MySQL client, Max
figured out what the problem was: the result schema from a
COM_STMT_PREPARE
should return a response indicating 0 columns for
INSERT
, UPDATE
, and DELETE
statements, since these statements
don't have a return schema, just status metadata like rows edited. We
were returning a column with our internal OKResult
schema. Most
clients don't care about this difference, but R's does.
The fix was easy: just don't return a schema.
But that's not the bug. That's just the bug fix that made us discover the real bug, the one that had us scratching our heads. Because when we went to release it to unblock our customer, we also picked up another change, an innocuous change to how our SQL server's log messages were formatted.
Broken unit test
When we went to release the above fix, our CI runs started failing with a suspicious timezone error.
--- FAIL: TestSelect/select_*_from_branches_system_table (0.03s)
sqlselect_test.go:1558:
Error Trace: sqlselect_test.go:1558
sqlselect_test.go:1423
Error: Not equal:
expected: []sql.Row{sql.Row{"master", "so275enkvulb96mkckbun1kjo9seg7c9", "billy bob", "bigbillieb@fake.horse", time.Time{wall:0x0, ext:62135596800, loc:(*time.Location)(0xc0008fa310)}, "Initialize data repository"}}
actual : []sql.Row{sql.Row{"master", "so275enkvulb96mkckbun1kjo9seg7c9", "billy bob", "bigbillieb@fake.horse", time.Time{wall:0x0, ext:62135596800, loc:(*time.Location)(0x3245200)}, "Initialize data repository"}}
I knew that our newest hire had recently made some timezone related changes in the SQL server, so I demonstrated outstanding leadership and blamed him without any further evidence.
But this didn't hold up under scrutiny once we actually looked at the nature of the breakage. Something else was going on.
The test that was breaking was asserting deep equality with a
time.Time
struct constructed like this:
{
Name: "select * from branches system table",
Query: "select * from dolt_branches",
ExpectedRows: []sql.Row{
{
"master",
"so275enkvulb96mkckbun1kjo9seg7c9",
"billy bob", "bigbillieb@fake.horse",
time.Date(1970, 1, 1, 0, 0, 0, 0, &time.Location{}),
"Initialize data repository",
},
},
Deep equality was failing because the time
struct returned by the
query had a different timezone than the expected value. Specifically,
we weren't giving an initialized time.Location
for the expected time
object, but the server was returning one.
Diff:
--- Expected
+++ Actual
@@ -10,9 +10,26 @@
loc: (*time.Location)({
- name: (string) "",
- zone: ([]time.zone) <nil>,
- tx: ([]time.zoneTrans) <nil>,
- extend: (string) "",
- cacheStart: (int64) 0,
- cacheEnd: (int64) 0,
- cacheZone: (*time.zone)(<nil>)
+ name: (string) (len=5) "Local",
+ zone: ([]time.zone) (len=1) {
+ (time.zone) {
+ name: (string) (len=3) "UTC",
+ offset: (int) 0,
+ isDST: (bool) false
+ }
+ },
+ tx: ([]time.zoneTrans) (len=1) {
+ (time.zoneTrans) {
+ when: (int64) -9223372036854775808,
+ index: (uint8) 0,
+ isstd: (bool) false,
+ isutc: (bool) false
+ }
+ },
+ extend: (string) (len=4) "UTC0",
+ cacheStart: (int64) 9223372036854775807,
+ cacheEnd: (int64) 9223372036854775807,
+ cacheZone: (*time.zone)({
+ name: (string) (len=3) "UTC",
+ offset: (int) 0,
+ isDST: (bool) false
+ })
After a lot of excited and confused shouting as we argued about how to ship our crucial bug fix when CI was failing for unrelated reasons, somebody figured out that the test started passing again when you commented out another test case, one that failed and wrote a log message. And it wrote a log message using a new format I had just merged, one that did this:
msg := fmt.Sprintf("%s %s [conn %d] %s {%s}\n", entry.Time.Format(time.RFC3339), lvl, connectionId, entry.Message, dataFormat.String())
But that's crazy. Surely formatting another, unrelated time.Time
struct in another part of the code base doesn't change the default
timezone value in unrelated code.
How is this possible?
As it turns out, our SQL server was using the time.Local
location,
which has some interesting behavior. golang doesn't try to determine
what your timezone info is without asking you. That would be rude!
Instead, it waits for you to tell it to load this info explicitly,
after which further calls to fetch the Local
location will include
this cached, detailed information. But calling Format
on a time
object, for some format strings (such as time.RFC3339
), needs to
load this information, so will do so implicitly as a side
effect. After such a call to time.Format
, all time.Time
structs
with the time.Local
location will include this info, when they
didn't before.
This is the kind of surprising, spooky action-at-a-distance nonsense that well-designed libraries tend to avoid. Or they at least try!
If Rob Pike happens to be reading this, know that I criticize because I care.
The fix was pretty easy: load the info ourselves, making explicit what was once implicit.
// We are doing structural equality tests on time.Time values in some of these
// tests. The SQL value layer works with times in time.Local location, but
// go standard library will return different values (which will have the same
// behavior) depending on whether detailed timezone information has been loaded
// for time.Local already. Here, we always load the detailed information so
// that our structural equality tests will be reliable.
var loadedLocalLocation *time.Location
func LoadedLocalLocation() *time.Location {
var err error
loadedLocalLocation, err = time.LoadLocation(time.Local.String())
if err != nil {
panic(err)
}
if loadedLocalLocation == nil {
panic("nil LoadedLocalLocation " + time.Local.String())
}
return loadedLocalLocation
}
...
time.Date(1970, 1, 1, 0, 0, 0, 0, time.UTC).In(LoadedLocalLocation())
Conclusion
All of us with a few years of experience under our belts have seen worse bugs than this. This one was kind of fun, because it only set us back a few hours, never hit production or impacted customers, and was just so wacky in its root cause that it crossed over into entertaining. Thanks for the memes, Rob. But still: even an entertaining foot-gun is still a foot-gun. Learn from our mistake, kids. Never trust a datetime library will do what you want, especially any defaults.
Like the article? Interested in Dolt? Come join us on Discord to say hi and let us know what you think!