Fetching Tags: Faster than Ever!

4 min read

Dolt, the world's first SQL database which supports branching and merging, is an open source product built by a startup. As with all startups, we need to move fast and get features into the hands of our users as quickly as possible. Occasionally this results in suboptimal behavior of the code, but generally those are things we know we can fix later when it becomes a problem.

Recently I discovered one such rough edge that was being exacerbated by a particular usage pattern. Let's jump in!

Tags

Tags in Dolt are just like in Git: they are a little bit of metadata on top of a commit which includes an author and date. Similar to commits, they are content addressable, so when we store a tag it is a chunk with an address. That chunk contains the metadata, but most importantly, if the address of the chunk is the same between two servers, you know the objects contain identical information. This is why content addressable storage is so wonderful.

Typically tags don't change. They are useful for things like giving a release number, like v3.23.6, to a given commit. In Dolt, we have users who tag their data to indicate where a ML training was performed, when a report was generated, and so forth. Tags are basically light weight labels which allow you to annotate a commit after it's been made.

Lots of Tags

Last week I was working with one of our users, and they mentioned that dolt pull was so slow. I ask how slow, and they said it took 2 hours. 2 hours is a long time, and it piqued my interest.

I cloned their database, and verified that dolt pull was insanely slow. So slow, in fact, that I never actually witnessed it complete. Curious, I dug deeper to discover that they had a lot of tags in their database:

database$ dolt pull
zsh: terminated dolt pull
database$ dolt tag | wc -l
   16743

Common sense told me that most of those tags had not changed, and for this reason there wasn't very much information that needed to be transferred. How could this possibly take multiple hours?!?!

Careful of Round Trips

This code was being invoked as part of the dolt pull operation:

func FetchFollowTags(ctx context.Context, tempTableDir string, srcDB, destDB *doltdb.DoltDB, progStarter ProgStarter, progStopper ProgStopper) error {
	err := IterResolvedTags(ctx, srcDB, func(tag *doltdb.Tag) (stop bool, err error) {
		tagHash, err := tag.GetAddr()
		if err != nil { return true, err }

		has, err := destDB.Has(ctx, tagHash)
		if err != nil { return true, err }

		if has {
			// tag is already fetched
			return false, nil
		}
...

This looks pretty innocuous. We iterate through all of the tags, and if we already have tagHash that means we've already fetched the content in question. Seems like we're short circuiting early so as not to do unnecessary work.

That method named IterResolvedTags says "Resolved" though. What is tag resolution anyway? Remember when I said that the tag address is the address of a chunk? Well resolving a tag means you load that chunk and get all that useful metadata out of it. Which is exactly what the code for IterResolvedTags does:

func IterResolvedTags(ctx context.Context, ddb *doltdb.DoltDB, cb func(tag *doltdb.Tag) (stop bool, err error)) error {
	tagRefs, err := ddb.GetTags(ctx)
	if err != nil { return err }

	var resolved []*doltdb.Tag
	for _, r := range tagRefs {
		tr, ok := r.(ref.TagRef)
		if !ok {
			return fmt.Errorf("DoltDB.GetTags() returned non-tag DoltRef")
		}

		tag, err := ddb.ResolveTag(ctx, tr)
		if err != nil { return err }

		resolved = append(resolved, tag)
	}
...

What's deceptive about the code above is that ddb.ResolveTag(ctx,tr) is performing 2 service calls to remote servers to get the information in question.

So, for every single one of those 16,743 tags, were were making 2 service calls. All those up to a very slow dolt pull

Lazy Resolution

Recall that tags don't change often (if ever), so why are we spending all of this time resolving them fully when we probably just need to look at the chunk address and short circuit early? We don't!

To fix this, I made a pretty small change to lazily resolve the full information of the Tag. Using the uncreative named new function, IterUnresolvedTags, we now use a call back to all the calling code to determine if we need to fully resolve the tag or not. If we need to, we'll follow the same slow code path, if we don't we'll short circuit. FetchFollowTags as it stands today:

func FetchFollowTags(ctx context.Context, tempTableDir string, srcDB, destDB *doltdb.DoltDB, progStarter ProgStarter, progStopper ProgStopper) error {
	err := IterUnresolvedTags(ctx, srcDB, func(tag *doltdb.TagResolver) (stop bool, err error) {
		tagHash := tag.Addr()

		has, err := destDB.Has(ctx, tagHash)
		if err != nil { return true, err }
		if has {
			// tag is already fetched
			return false, nil
		}

		t, err := tag.Resolve(ctx)
		if err != nil {
			return true, err
		}
...

By changing from a *doltdb.Tag to a *doltdb.TagResolver we defer the need to resolve the tag until we've determined we don't have it.

Results

dolt pull and dolt fetch are much faster now on this particular database:

database$ time dolt fetch
dolt fetch  2.92s user 1.30s system 14% cpu 29.896 total

Down from unknown infinite time (I lack patience) to 30 seconds. Not too shabby.

Lessons of the Day

  1. If something in Dolt takes 2 hours, definitely tell us. It's probably not right.
  2. In this age of cloud, don't believe the hype. Inter computer communication is slow, and should be used sparingly.
  3. Your users will always stress your code in ways you don't expect. You can't plan for every scenario, and some optimizations only happen when you have users that show you how your product is really used.

What are you going to build with Dolt? Break our assumptions, and come tell us on Discord!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.