The Plan for Vectors in Dolt

September 26, 2024

3 min read

Lately, I keep hearing about vector databases. Even databases that weren't originally made for vectors are getting vector capabilities added via plugins:

MariaDB Vector, announced in July, is in preview.
Sqlite has sqlite-vec, which officially left alpha this month.
Postgres has has pgvector for a while.

It seems like vectors are the new hotness, but is it a killer feature or just a passing fad? Should Dolt pursue vector functionality too?

Dolt's not a vector database. It's a version controlled SQL database with Git-like branch and merge semantics, accomplished by leveraging prolly trees. But we were curious whether or not prolly trees could also offer unique value in a vector database.

This wouldn't be the first time. Dolt wasn't a JSON-focused database either, and then it turned out that prolly trees also lent themselves to efficient JSON manipulation, and now Dolt is a competitive option for working with large scale JSON documents.

It turns out, prolly trees add a lot of value in unexpected place. Does that include vector databases?

I think it does. I believe that Dolt's unique design (data structures built on prolly trees) can lead to a novel version-controlled vector database.

What is a vector database?

A vector database is a database that stores data as for multi-dimensional floating point arrays (called vectors) and allows for Approximate Nearest Neighbor searches. If data objects can be represented as vectors in a way that stores that object's semantic meaning (a process called vector embedding), then a vector database can quickly find other objects with similar semantic meaning.

Why would I want to use one?

Vector databases are seeing a huge spike in popularity because of their use in Retrieval Augmented Generation (RAG). Essentially RAG is a technique for improving the usefulness of Large Language Models without retraining them, by identifying documents that are relevant to the query and including them in the query context. This can be accomplished by tokenizing the documents as vectors and storing them in a vector database.

Vector databases have similar use cases in areas like sentiment analysis, semantic search, recommendation algorithms, and image classification. Basically, anywhere that user submitted data needs to be semantically compared to items in a search space, you can bet these items are being represented as vectors in a vector database.

Vector databases are also a useful tool for preventing hallucinations in LLM output by semantically comparing the generated output to existing data.

What makes Dolt good for vectors?

A set of vector embeddings may grow over time as additional information is added to the knowledge base. Dolt's version control functionality allows for snapshots of the knowledge base at different points in time to co-exist, reusing storage space for the embeddings that exist in both versions.
Dolt branches can store diverging datasets, and later merge and cherry-pick changes from the different branches.
Dolt's Git-inspired clone command makes it easy to share embedding online through DoltHub and efficiently fetch updates to an evolving database with dolt pull
If the result of a search changes unexpectedly, Dolt makes it easy to roll back the database to restore the original behavior, and then diff the branches to find out exactly what changed and why.

The Plan for Vectors

Dolt, like MariaDB, uses the MySQL dialect for SQL. Thus, it makes the most sense for us to use MariaDB's syntax for vectors. This includes a new VECTOR type, and indexes on vector columns that look like this:

ALTER TABLE v ADD COLUMN embedding VECTOR(100);
CREATE VECTOR INDEX vectorIndex ON v (embedding);
SELECT * FROM v ORDER BY VEC_DISTANCE(embedding, 'target_vector') LIMIT 20

In this example, the SELECT query would find the 20 vectors in the dataset that are (approximately) closest to the target. Vector indexes often favor speed over exact results: it's okay if the result doesn't find every close vector, as long as it works quickly. The exactness of the index is called its "recall".

There's many different algorithms for vector indexes, each of which makes different tradeoffs between speed and recall. Assessing how these algorithms would work on top of Dolt's existing data model turned out to be a really cool technical problem, which I'll be talking about more as we continue to develop them.

That's all we have to say for now. We're currently working on adding support for vectors and vector indexes in Dolt. We'll have more to say as we get closer to launching.

Until then, if you have any questions, you can always join us on Discord and chat, or send us a message on Twitter.

Blog

What is a vector database?

Why would I want to use one?

What makes Dolt good for vectors?

The Plan for Vectors

Get started with Dolt