Dolt Case Studies

May 7, 2021

4 min read

Background

Dolt is a SQL database with Git-like version control features for both data and schema. This makes Dolt useful in a wide variety of applications while possessing a novel set of features. This blog post zeros in on some specific examples of our how our users get value from Dolt's novel features.

We are publishing these case-studies in our documentation. We hope this acts as a resource for individuals and teams evaluating whether Dolt is a good fit for their particular needs.

NLP with Kalido

Kalido creates tools, powered by proprietary AI, to match people based on skills to connections, teams, projects and jobs. Their goal is to create value for individuals and organizations by better matching individuals with opportunities. Much of the data is natural language, in the form of resumes, project descriptions, and anything else that would help match people with projects and teams.

In practice this means that Kalido is managing a variety of different models that need to be fanned out to customers. Doing this at scale requires a high degree of model delivery hygiene. With this in mind Kalido wanted their data pipelines to produce versioned and reproducible models. An initial implementation used Git LFS, but the team found it clunky to work with, and faced performance issues. It was at this point that they discovered Dolt as a solution for managing their tabular data.

We discovered Kalido when they submitted a bug complaining about GCP remote cloning speeds. They were absolutely right! A prompt fix kicked off a lively dialogue about their use-case. Excited by teams building NLP infrastructure on top of Dolt, we proposed publishing a case study. The Kalido team were happy to collaborate, and extremely generous in sharing a good deal of detail on how they build their model pipelines on top of Dolt.

Medical Research with Turbine

Turbine is a drug discovery startup creating computer simulations of cancer cells as a mechanism for drug discovery, and pinpoint which kinds of cells would be most sensitive to these discoveries. Turbine’s technology revolves around the Simulated Cell, a continuously evolving computer simulation of human cancer cells. A Simulated Cell is assembled from lot of different types of data, including, but not limited to, protein-protein interactions, drug sensitivity data, and genomics. These data are provided by a large team of researchers, and continuously updated.

In addition to being assembled from a multitude of sources contributed by a large team of researchers, the research driven by the Simulated Cell also needs to be versioned. Like all simulated research, Turbine's goal is to short circuit the costly and time consuming process of doing lab work to reach novel results faster and more cost-effectively. These results must eventually be verified in a lab, which in turn necessitates the ability to maintain historical versions of their Simulated Cell model in production.

Turbine needed a database for the Simulated Cell model that allowed researchers to propose and test changes, but also give current lab trials the ability to run simulations against historical versions of the Simulated Cell model. In other words they needed to associate every proposed version of the Simulated Cell with some sort of identifier or tag. After struggling with performance of a Mongo DB based solution, they came across Dolt. Dolt's SQL model, and underlying commit graph, allowed them to architect a researcher portal on top of a Dolt database, and allowed the Spark simulation infrastructure to point to any version of that model that had ever been committed.

Application Database with Nautobot

Nautobot is an open source "Network Source of Truth and Network Automation Platform." The project is sponsored by Network to Code, a provider of network automation solutions. Network to Code hopes to improve the quality and efficiency of their client's network operations.

A member of the Network to Code team reached out to the DoltHub team about using Dolt in network automation solutions they were delivering to their clients. After some brainstorming, Network to Code suggested that we make a proof of concept connecting Netbox, a "Network Source of Truth" application that Nautobot is forked from, and Dolt. The idea was to show that it was possible to version underlying network configurations and test them out. You can find a presentation demonstrating the proof of concept here.

The talk was well received by the networking community, but Netbox only works with Postgres as a backend. To deliver a more the full potential of data version control in a network source of truth and automation application would require integrating with Dolt as the application database, rather than just versioning it at various states. Concurrently Network To Code had made the decision to create a fork of NetBox, called Nautobot.

After consulting the Dolt team about the possibility of integrating Dolt and Nautobot, in order to provide a branch and pull request workflow to network source of truth management, we reached an agreement with the Netbox team to implement the necessary changes on Nautobot to make it work with Dolt. This required a combination of product engineering to define and expose the necessary UX for managing branches and pull requests, as well as some work on the Dolt side to fill in some features that Django requires that were missing from Dolt.

Conclusion

Dolt is a general purpose SQL database with Git-like version control features. This post highlighted some specific examples what that means in practice across three different use-cases.. We hope this will inspire prospective users be showing how to get business value from Dolt. If you'd like to discuss how to make use of Dolt, send us an email, or join us on Discord.

Blog

Background

NLP with Kalido

Medical Research with Turbine

Application Database with Nautobot

Conclusion

Get started with Dolt