Announcing Vector Indexes
Back in September, we announced that we were looking at adding support for Vectors and Vector Indexes to Dolt. It seems like every SQL database these days is going all in on vectors, and here at DoltHub we wanted to not just jump on that bandwagon, but show that Dolt's unique data model meant we could offer the novel version-control capabilities to vector datasets.
Then in October, we outlined that challenges that came with impelementing Vector Indexes in a history-independent way, but the potential benefits if we could crack it. This would be a bleeding edge approach, and we were determined to see how far we could go.
Now, we're ready to share what we've accomplished.
As of today's release (v1.47.0), Dolt now allows users to create Vector Indexes using the VECTOR INDEX
keyword.
This is still in alpha, so you probably shouldn't use it for your critical production needs. There shouldn't be any risk of data loss or corruption: nothing that happens to the index can affect the underlying data in your tables. But using the index isn't yet as fast as it could be. And if there are bugs, they could cause Approximate Nearest Neighbor lookups might return inaccurate results.
But if you're looking for a space-efficient, version controlled vector database, we encourage you to give it a try.
Here's a simple example:
CREATE TABLE vectors(pk INT PRIMARY KEY, v JSON, VECTOR INDEX vidx(v));
INSERT INTO vectors VALUES ...;
-- Get the five closest rows to some query vector
SELECT * FROM vectors ORDER BY VEC_DISTANCE(v, "<Query Vector>") LIMIT 5;
This syntax is inspired by MariaDB's vector syntax, which predates MySQL's newly released vector functions.There are some slight differences in fucntion names between the two.
This will return the five rows whose vectors are closest to the query vector, in order of distance. Like all Approximate Nearest Neighbors searches, a small fraction of false negatives is tolerated: some close vectors may not appear in the results. But there are no false positives: faraway vectors will never appear in the results.
You can confirm that a query is actually using the index by running a DESCRIBE PLAN
query. If the index is being used, the result will look something like this:
> DESCRIBE PLAN SELECT * FROM vectors ORDER BY VEC_DISTANCE(v, "<Query Vector>") LIMIT 5;
+-----------------------------------------------------------------------------+
| plan |
+-----------------------------------------------------------------------------+
| Limit(5) |
| └─ IndexedTableAccess(vectors) |
| ├─ index: [vectors.v] |
| ├─ order: VEC_DISTANCE_L2_SQUARED(vectors.v, '...') LIMIT 5 (bigint) |
| └─ columns: [pk v] |
+-----------------------------------------------------------------------------+
One important feature of Dolt's vector indexes is that you can add or remove records from the table without needing to rebuild the index. Unlike some index designs where the dataset either can't be changed after the index is built, or becomes less accurate if additional vectors are added, Dolt's vector indexes are fully history-independent. This means that your vector index will behave exactly the same whether you build it before or after you populate the table. This allows us to leverage Dolt's powerful version control features to efficiently store the history of the index and incrementally pull and merge changes from other branches and remotes. To the best of our knowledge, no other vector database can do this.
Note that if your data is unlikely to change, we still recommend adding all rows to the table before building the index, which should result in a faster overall build time.
Given that this is an alpha feature, it has some limitations:
- The vector index must have a single column whose type is JSON. Each value in the JSON column must be an array with the same number of dimensions. Vector indexes on other data types are not currently allowed.
- While most incremental operations are efficient (
O(log(N))
for you computer science nerds), a small fraction of inserts and removals can trigger a partial rebuild of the index. The average runtime of many incremental operations is still efficient (O(Nlog(N))
), but this time is not evenly distributed, and a small number of operations will be responsible for most of the runtime. - We believe that there's a lot of room for improved performance in general.
Tips for importing vector data
You can find lots of vector datasets on HuggingFace, which allows downloading all their datasets in Parquet format. Dolt has partial support for Parquet: currently, it can import Parquet files so long as they don't contain structs or multidimensional arrays.
In my experience, vector data on HuggingFace typically used one of the following schemas:
- A different column for each dimension.
- A single column with an array type.
- A single column containing a JSON array.
All three of these can be imported into Dolt, although option (1) will need a virtual column before it can be used in a vector index.
If you're having trouble importing your data, give us an example of your data format and we'll add support for it.
Taking Vector Indexes for a Spin
Let's walk through an example. Here's a vector embedding of English-language Wikipedia, where each paragraph is assigned a vector. HuggingFace splits the dataset into multiple files, so we're going to just download the first one to keep the dataset small. You can download it here
That file is 120MB and contains 139k vectors in 768 dimensions. Small potatoes compared to some datasets out there, but enonugh to get our feet wet.
Once you download it, create a Dolt table with a compatible schema and import it:
dolt sql <<'SQL
CREATE TABLE `embeddings` (
`id` int NOT NULL PRIMARY KEY,
`title` VARCHAR(80),
`text` LONGTEXT,
`url` VARCHAR(50),
`wiki_id` INT,
`views` FLOAT,
`paragraph_id` INT,
`langs` INT,
`emb` JSON);
SQL
dolt table import -r embeddings ~/Downloads/0000.parquet
dolt table sql -q "CREATE VECTOR INDEX vidx ON embeddings(emb);"
Then go get a coffee. Or a lunch break. Building the index on my machine took around 2.5 hours, although we expect that time to drop in the future as we implement performance improvements.
Once it's done, we can take it for a spin. Let's pick an arbitrary embedding from the dataset, and then see which embeddings are closest to it. What's the first embedding in the set?
dolt sql -q "select title, text from embeddings where id = 0"
+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title | text |
+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Deaths in 2022 | The following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order. A typical entry reports information in the following sequence: |
+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
That's delightfully morbid. Let's see which other paragraphs in the dataset are most semantically similar.
time dolt sql -q "set @query = (select emb from test where id = 0); select id, title, vec_distance(emb, @query), text from test order by vec_distance(emb, @query) limit 10;"
+-------+------------------------------+---------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | title | vec_distance(emb, @query) | text |
+-------+------------------------------+---------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 0 | Deaths in 2022 | 0 | The following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order. A typical entry reports information in the following sequence: |
| 40819 | COVID-19 | 28.03495436767425 | , "Reuters" reported that it had estimated the worldwide total number of deaths due to COVID‑19 to have exceeded five million. |
| 39200 | 2022 United States elections | 28.222566254855977 | Two special elections took place in 2022 to replace senators who resigned during the 117th U.S. Congress: |
| 40821 | COVID-19 | 28.796787135491318 | In September 2020, the US Centers for Disease Control and Prevention (CDC) published preliminary estimates of the risk of death by age groups in the United States, but those estimates were widely misreported and misunderstood. |
| 4183 | 2022 in film | 28.83804286523066 | 2022 in film is an overview of events, including award ceremonies, festivals, a list of country-specific lists of films released, and notable deaths. |
| 2783 | World War II | 29.11468721407523 | Many of the civilians died because of deliberate genocide, massacres, mass bombings, disease, and starvation. |
| 12542 | COVID-19 pandemic | 29.23722639611648 | Based on Johns Hopkins University statistics, the global CFR is ( deaths for cases) as of . The number varies by region and has generally declined over time. |
| 81443 | Iraq War | 30.255731798022758 | There have been several attempts by the media, coalition governments and others to estimate the Iraqi casualties. The table below summarizes some of these estimates and methods. |
| 4184 | 2022 in film | 30.521236505294887 | List of some of the film festivals for 2022 that have been accredited by the Inoternational Federation of Film Producers Associations (FIAPF). |
| 83452 | Wikimedia Foundation | 30.743062429297062 | The following have donated or more each (2008–2019, not including gifts to the Wikimedia Endowment; list may be incomplete): |
+-------+------------------------------+---------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
dolt sql -q 4.90s user 0.27s system 110% cpu 4.688 total
This is really fascinating. The query vector is a specific combination of two ideas (death and 2022), and we see both ideas in the results. And also the Wikimedia Foundation for some reason; maybe these embeddings know something we don't.
To Be Continued
This isn't the end. We have our eyes on additional features for vector indexes, including:
- Applying the vector index to more queries.
- A dedicated
VECTOR
datatype optimized for use in Vector Indexes. - Vectors defined by multiple
FLOAT
columns. - Importing a larger variety of formats.
- Reducing storage by allowing for configurable precision levels in stored vectors.
- Additional distance functions other than Euclidean.
- Multi-indexing, a technique for improving performance when indexing vectors with a high number of dimensions.
- Smarter handling of partial index rebuilds.
As always, the most important factor that dictates our priorities is YOU. If you're considering using Dolt for indexing your vectors, but you have specific requirements, please drop us a line on Discord and let us know. We strive to make Dolt a competitive option for real-world use cases, and we can't do that without your help. We take user feedback very seriously.