Enhancing Index Support
We're continuing to make progress on DoltgreSQL, which is a version of Dolt built to be a drop-in replacement for PostgreSQL. For those that may not know about Dolt, it's built as a drop-in replacement for MySQL that is built, from the ground up, with Git-influenced versioning features in mind. This means that you can use branch, merge, diff, and more with your data and schema. Doltgres is the Postgres-flavored version of Dolt, and we are actively developing it. We're inching ever closer to making Doltgres a beta product, and enhancing our index support is a major step toward that goal.
Building up Doltgres
To explain a little bit about how we've enhanced our index support, it's important to understand how we're developing Doltgres. When the project was first started, we wrote a server that emulated Postgres' wire protocol, and converted the statements that it received into their equivalent MySQL statements. These statements were then fed directly to Dolt through its SQL engine—go-mysql-server (GMS). This allowed us to get a proof-of-concept up quickly, but it was limited to the statements that were syntactically similar between Postgres and MySQL. This, of course, also meant that we were limited to the functionality that is present in both.
For a more permanent solution, we decided to parse Postgres statements into an abstract syntax tree (AST), and convert those into the Vitess AST that GMS uses. GMS treats Vitess's AST as a kind of intermediate representation (IR), which means that we're able to express far more functionality by combining AST primitives. The AST is not a true IR though, as the primitives were still created with the intention of representing MySQL functionality, rather than all possible SQL functionality. In addition, the analyzer turns the AST into an execution plan that is carried out against the Dolt storage layer, and the analyzer has been specifically tuned for MySQL. This means that, in order to have proper support for some Postgres concepts, we must either change portions of the analyzer to apply in a more general fashion, or implement new routines specific to Postgres functionality. For indexes, we chose to implement a new routine, as the existing index analyzer routine makes a critical assumption that does not hold for Postgres.
Critical Difference
There are numerous differences between MySQL and Postgres indexes that translate to differences that would require fairly large changes to GMS' index analyzer routine, but one is absolutely critical. Postgres is quite thorough with its documentation, and within it states requirements that must be upheld between operator families. One such requirement is the transitive law, which Postgres shares with the mathematical concept. With this requirement, the existing analyzer routine in GMS fails, as MySQL does not have as strict of a requirement, and therefore GMS will freely convert between different types to homogenize the types involved in an index lookup. This leads to a number of optimizations that are only possible due to the homogenization, which further creates a set of assumptions based on the output of the optimizations. All in all, while generating a fast execution plan, the existing analyzer routine is fundamentally incorrect for Postgres, and will cause incorrect results in some circumstances (which can be disastrous for a database).
To fix this, a new analyzer routine was created that adheres to all of Postgres' requirements, which allows us to return the correct results. Primarily, it does not allow for type homogenization by converting between all the types implicitly. In addition to correctness, it also enables indexes to be used in joins, which has a massive influence on overall performance. For example, by adding index support, the following simple join sped up by a factor of 290x:
SELECT a.id, a.small_int_col FROM sbtest1 a, sbtest1 b WHERE a.id = b.id limit 500;
Not only is it significantly faster, but the above query is recognized as a join, even though it does not explicitly use the JOIN
keyword.
What's Missing?
Although the new index analyzer routine enables a lot more index functionality, we have not yet implemented all of the functionality that is expected of indexes. A few of the major ones are:
- Null Ordering: This controls whether NULL values are sorted before or after regular values.
- Null Distinction: NULLs may be considered distinct or non-distinct. When distinct, the index allows multiple NULL values. In MySQL, NULL values are always considered distinct in indexes, however Postgres allows this to be changed.
- Value Layout: Types are able to control how their values are ordered. All built-in types come with a pre-defined function, however we have hardcoded them all in for now.
Conclusion
DoltgreSQL is improving at a rapid pace, and we are excited with both the reception and its future. With proper index support, it's much easier to recommend downloading DoltgreSQL today to see if it suits your needs. If there are any features that are missing that are critical to your workflow, then let us know! You can find us on Twitter/X and chat with us on Discord. We also welcome all issue reports! Thank you for reading the blog post!