Dolt and SQLAlchemy: A tale of growth
Introduction
We have been working hard towards making Dolt a production-grade, MySQL compatible database. Last year, we discussed MySQL compatibility in the form of SQLLogictests where we hit 99% correctness. Now, we are very focused on compatibility metrics with tools like ORMs. In the long term, we want to make Dolt work out of the box with any established MySQL compatible ORM. In this blog post, we'll cover some of the progress we've made with SQLAlchemy, one of the most popular Python ORMs, as a way to highlight our progress.
SQLAlchemy Testbench
SQLAlchemy is a Python ORM that converts declarative code into generated MySQL statements. Developers usually start with the concept of a model that captures the properties of a related MySQL table. Here's an example model from the SQLAlchemy documentation.
class User(Base):
__tablename__ = "user_account"
id = Column(Integer, primary_key=True)
name = Column(String(30))
fullname = Column(String)
addresses = relationship(
"Address", back_populates="user", cascade="all, delete-orphan"
)
def __repr__(self):
return f"User(id={self.id!r}, name={self.name!r}, fullname={self.fullname!r})"
This model serves as the base for a wide variety of features including
- Easy JOIN queries with one to two lines of code
- Relationship management including one to many and many to many.
- Built in transaction support with COMMIT and ROLLBACK.
With all of these features, the SQLAlchemy MySQL test bench turned out to be quite the behemoth that leveraged all sorts of MySQL drivers.
Dolt's Progress
Last week, we decided it to run it against the core MySQL backend tests with the MySQLDB driver. Here are the results from over 16,000 tests:
497 failed, 13518 passed, 2083 skipped, 10 warnings, 218 errors in 544.94s (0:09:04) =
Excluding the skipped tests, we were pleasantly surprised to be passing about 95% of that test bench! I was curious and decided to run old releases of Dolt against the same version of this testbench to see how we were doing. About a year ago, we couldn't even run this test bench against Dolt!
We saw some solid jumps over the end of 2021 before some pretty steady performance for the rest of the year. What are a few features that we shipped that's made our compatibility so much better?
Transactions
Dolt's transactional model was the biggest feature that enabled us to get up and running against SQLAlchemy. Just take a look at the size of this July PR that turned transactions on my default. The hardest part of this was merging two very difficult concepts 1. Git branching model 2. MySQL's strict REPEATABLE READ properties. One of our engineers wrote about some of this in a large blog here but broadly we had to convert the MySQL COMMIT semantics into a MERGE between two different branches. We've found it to be working pretty well.
Analyzer Revamps
One of the most difficult parts of a database is the query analyzer which takes in a raw query string and converts it into an efficient execution plan. We've found ourselves struggling a lot with JOINs and their relevant Filter optimizations in our analyzer before. This is because JOINs require very precise query plans with a little to no margin for error. Back in September, we working with one our customers to improve join performance substantially. It looks like we also introduced some bugs with a wide variety of JOIN queries that started failing in the test bench. Clearly, our local test coverage was not good enough in September.
This touches on an important point: Great testing means product quality never drops. Had we invested in more ORM testing before we would've caught our correctness issues before some of our customers did. You can notice that in December we couldn't run the test bench at all because of silly mistakes with type conversions! Regardless, over the past couple of months we've invested a ton in making JOIN queries great in all sorts of use cases: including joins with Primary Keys, Indexes, and Foreign Keys.
Foreign Keys
One of SQLAlchemy's killer features is eager loading. Let's say Model A has a foreign key relationship with a parameter on Model B. Typically, when we query Model A we only get Model A's row data with Model B's ids stored in some of the row. We would then have to query Model B again for the rest of the data. Eager loading queries both the Model A objects and its related Model B objects by wrapping a JOIN query. Eager loading is particularly useful with Foreign Key relationships. Back in July, we launched Foreign Keys which has enabled Eager Loading, and a ton of cool ORM features as a result.
The Work To Do
There's still a good amount of work to do to hit 100% compatibility. Some features we don't support yet include
- Row Level Locks
- Complete Information Schema compatibility
- Custom Functions
- Full compatibility with additional drivers (asyncmy, mariadb, etc.)
We'll get them shipped!
Conclusion
It is extremely exciting to see Dolt mature as a product. We are becoming faster, more robust, and fully integrated into the greater database ecosystem in a remarkably short time period. In the next couple of months, we are aiming to keep pushing Dolt's compatibility with all sorts of ORMs and tools.