Why I Threw Away Perfectly Good Code

TECHNICAL
9 min read

This blog post is inspired by the video essay This Problem Changes Your Perspective On Game Dev by Jonas Tyroller.

As the title of the video suggests, Jonas is drawing from his own experience as a game developer, and so he approaches the video from that perspective. But the advice he gives is broadly applicable outside the world of games. If you haven't listened to this video before, I recommend you check it out before continuing.

Unlike Jonas, I don't make games. I work on Dolt, a SQL database with Git-like version control semantics. It's a database you can clone, branch, and merge.

One of my recent responsibilities was to improve how Dolt stores JSON documents internally, in order to speed up complicated queries. I've talked in detail about the final implementation. But before we came up with that implementation, we originally had an entirely different implementation, one that almost made it into production. We had a working prototype, which was the result of a significant period of research, design, review, and implementation.

And then we scrapped the whole thing. The PR never got merged. We replaced it instead with a different design that shared a lot of the same high-level ideas, but had almost no code reuse. We spent a lot of engineer-hours on the original design, just to throw it all out. Where did we go wrong?

Except... I don't think we did go wrong. I believe that designing and implementing the original prototype was the right call.

I believe that scrapping it was also the right call.

It's not obvious why this might be the case, especially if you fall into a sunk-cost fallacy way of thinking about design. But it's true. And that's because design isn't just a thought experiment. Like Jonas says in his video...

Design is a Search Algorithm

His main thesis is this: The art of design is an iterative process of exploration followed by practical measurement, in order to navigate an infinite space of design choices. Thus, it's akin to a search algorithm which is attempting to find the maximum value in some search space.

Search algorithms in an infinite search space can't possibly visit every possible option. But it's also necessarily true that not every option that gets evaluated is a step in the right direction. Navigating a search space requires journeying into the unknown and making experimental measurements, and some of these journeys won't bear fruit.

Does that make the journey a mistake?

Absolutely not. The journey is essential. And while it's tempting to salvage as much of the prototype as you can, at the end of the day the scrapping is also essential. As Jonas puts it at the end of the video:

If you never scrap any of your work, that means you're not taking full advantage of the search space available to you, and you're likely missing out on a lot of great opportunities to improve.

The journey I took designing Dolt's new JSON format is a great example of why this is true. So let's look at the actual prototype that we had, why it was vital that we made it, and why in the end we had to let it go.

How To Not Implement A Document Store

In my previous blog post where I talked about the high level concept that guided our design. It's important enough that I'm going to repeat it again here:

From the get-go, we had one key insight that governed our approach: Dolt is really good at diffing and merging tables, because a Prolly Tree is very good at representing a flat data structure. If we could represent a JSON document as a flat table, then we could store it a Prolly Tree, and we could apply the same optimizations on it that we already do for tables.

For example, if you had a JSON document that looked like this:

{
  "a": "value1",
  "b": {
    "c": "value2",
    "d": "value3"
  },
  "e": ["value4", "value5"]
}

Then you could visualize that document as a table, like this:

+--------+----------+
| key    | value    |
+--------+----------+
| $.a    | "value1" |
| $.b.c  | "value2" |
| $.b.d  | "value3" |
| $.e[0] | "value4" |
| $.e[1] | "value5" |
+--------+----------+

Our first approach was the simplest, most literal interpretation of this idea: Every single JSON object in your database would actually be an anonymous table in disguise. If a table had a column of the JSON type, then each row in that table actually held the content-addressed hash of one of these anonymous tables. JSON operations were implemented as table operations: inserting a value into the document became inserting rows into the table, looking up a path in the document became a table lookup, etc.

Doing it this way had a pretty nice bonus: since we'd already implemented fast three-way merging for tables, this meant we got optimized three-way merging of JSON documents for essentially no extra work: we could use the exact same algorithm that we'd already implemented.

We were optimistic about the design, but we had a couple of concerns:

  • Documents get both written to and read from the database in standard JSON format. In order to store the documents as tables like this, we would need to add the ability to convert between these two representations. This conversion would add latency to these operations. However, we expected that in practice this wouldn't be the bottleneck: the overhead would increase latency by a constant factor, which we expected to be faster than the disk IO involved.
  • Listing the path of every value in the document introduces redundant information. Note how in the example above, the b and e keys appear multiple times in the rendered table. The more levels the document has, the worse this redundancy gets. In the worst case, the space complexity for storing this list of paths could O(n²) compared to the size of the JSON document.

We knew that the only way to understand the magnitude of these issues was to build a prototype and make measurements. But we wanted to get to a place where we could profile this approach as quickly as possible. This meant cutting lots of corners and leaving features unimplemented: a prototype doesn't need to be polished, it just needs to be written quickly. For instance:

  • Real JSON documents support a variety of value types, both primitive (strings, numbers, etc), and composite (objects and arrays). The prototype only supported strings and objects.
  • Dolt takes a mark-and-sweep approach to garbage collection, where it scans tables to find new content hashes. Since the garbage collector wasn't aware that JSON columns could contain addresses, running the GC would drop all of your JSON documents from the database.

Despite these shortcuts, we had to jump through a few hurdles just to reach this minimal state:

  • JSON documents can contain a primitive value instead of an object. These primitive values were stored inline in the table that contained them. A column that could contain a mix of both content addresses and inline values was something new, which caused some issues.
  • The values in JSON documents are polymorphic, but table columns aren't. Our solution was to make the value column of these anonymous tables a byte array, and add a third column containing an enum to tell the engine whether it was a string, number, etc.

Once we got the prototype working (the "exploration"), we had to benchmark it (the "measurement"). As we predicted, working with documents that were already in the database was near-instant. Lookups, mutations, and making copies of existing documents was a snap. But we observed two things that surprised us:

  • Reading documents from the database was about 6x slower, regardless of document size. This was technically in line with our predictions: we predicted a constant factor slowdown, and we were right. But the magnitude of that constant factor was larger than we had anticipated.
  • The size of this new format on disk was almost identical to the old format. This was very unexpected, since we were storing every path into the document, which can blow up very quickly for deeply nested documents. So why weren't we seeing that blowup? Because chunks are themselves compressed when written out to disk. This wasn't something that we were thinking about because in most cases compressing individual chunks doesn't do much: database chunks don't typically contain enough redundant data to individually compress well. That's why my good colleague Neil has been working on alternatives like dictionary compression. But these new chunks containing all these path segments were very compressible, to the extent that it essentially cancelled out the added redundancy.

By taking these measurements, we discovered that the design fell short of our expectations in one metric, while simultaneously exceeding them in another. This gave us a new bearing, telling us where to lean in and where to correct course.

The Next Exploration

Thanks to our previous venture, we knew where to focus our efforts. We didn't immediately abandon the implementation. We knew that we could likely improve the performance of exporting documents from the DB. In the best case, we would reduce the new overhead to negligible levels. So that became our next step.

Our original implementation was very fast at exporting JSON documents because the documents were stored as JSON. Larger documents were split up into chunks, but exporting was as simple as concatenating those chunks together. But this same strategy that made exports fast also made everything else slow: in order to modify any part of the document we had to parse the whole thing every time.

The prototype fixed this by dividing the document into more pieces that could be understood and updated independently. But this meant that getting the original document back had many more pieces to glue together, and each of these pieces had to be converted.

We wondered if a hybrid approach could split the difference. In other words, it was completely fine if we had to parse a chunk of JSON in order to update any part of it, provided that we didn't have to parse any other chunks to understand it, and provided that the size of these chunks stayed below some threshold.

Basically, what if we represented the previous document like this?

+------+--------------------------------+
| key  | value                          |
+------+--------------------------------+
| $.a  | "value1"                       |
| $.b  | {"c": "value2", "d": "value3"} |
| $.e  | ["value4", "value5"]           |
+------+--------------------------------+

Each value is now some "snippet" of JSON, including possibly composite values. Done this way, assembling the original document for export would require significantly fewer operations. In exchange, update operations would now require parsing a small amount of JSON.

We experimented with various cutoff values for how large an individual table row could be. As we increased this number, the time to export the document shrunk, while the time for update operations remained roughly constant. We were convinced that this hybrid approach could achieve the best of both worlds in terms of performance.

But we quickly realized that implementing it was non-trivial. Inserting into one of these "snippets" could cause it to overflow the threshold and get split across multiple rows. By itself, this wasn't a big problem. But what was a problem was the reverse, where removing part of a document meant that a previously split value could now fit in a single row. In order to preserve Dolt's history-independence, we needed to detect these situations and "un-split" the value, and it wasn't obvious how to do that.

Fortunately, we already knew how to split large documents into pieces in a history independent way using rolling hashes. Following this line of reasoning eventually led us to our final design. Unfortunately, this new design wouldn't reuse most of the code that we'd already written. It was similar enough conceptually to our previous explorations that we had confidence it would work, but the actual implementation would need to be redone.

We Broke Some Eggs To Make An Omelette

In summary, designing and implementing the prototype was the right call because it allowed us to make measurements (in this case, performance measurements) that helped us identify what was bottlenecking performance and what wasn't. This revealed two misconceptions:

  • We initially believed that the serialization and deserialization wouldn't be a bottleneck because it ran in linear time with the size of the message, which needed to be loaded from disk anyway. The prototype helped us realize that we couldn't ignore that.
  • We initially believed that storage space would be a limiting factor because the uncompressed table size were scaling super-linearly with the total document size. The prototype helped us see that in practice, per-chunk compression was very good at completely eliminating this redundancy.

Throwing out the prototype was also the right call. If we had decided that we needed to keep the original design because we'd already spent so much time on it, then we would have been stuck with a sub-optimal solution. It would have worked, but we would have had to spend even more time trying to shore up its performance issues in ways that may not have been tractable.

Scrapping the prototype code and building the final implementation from scratch freed us from any constraints that the old code imposed and kept us from being locked into undesirable trade-offs.

And sure, it's easy to say that we should have avoided this whole situation by getting the design right on the first try. But I'd question the notion that getting it right on the first try is possible. We got real value out of the prototype, value that wouldn't have been possible if we'd kept it as a thought experiment. We needed the implementation in order to make the right measurements.

Prototypes exist for a single purpose: to gather measurements that inform the next iteration of your design. That's why we make them as simple as possible and avoid over-engineering them: because there's a good chance that once you make those measurements, you've gotten all the value out of it. At that point, it's like an anchor: a practical tool, but not something you want to be dragging around once you're ready to set sail.

Sometimes, you write code that doesn't make it into production, and that's a good thing. I think that's important to keep in mind when designing anything.

But what do you think? Is a good designer someone who never has to throw out code, or someone who knows when code has served its purpose? Were you ever in a situation where you had to make that call? If so, come hang out on Discord and tell me about it. And while you're there, let's chat about whether or not you version control your database, because if you don't, you should. And Dolt can help.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.