Learning About RAG
Dolt recently added support for vector indexes, which are a type of index used commonly in RAG (Retrieval-Augmented Generation) applications. These applications are all the rage in the AI world right now, and for good reason. They enable developers to build applications that can generate text, based on a given context, which can then be used to power chat-bots, search, and more.
To lean further into Dolt's support for this new and exciting feature, we decided it would be fun to build our own RAG application, backed by Dolt, and generate new blog posts for our company. We'd call this application our "robot blogger," and we plan to start publishing AI generated blog posts created by our robot blogger (for better or worse) on Saturdays. Currently, we are still working out the kinks, but we'll open source our robot blogger soon and encourage you to check it out and give us feedback.
Building the robot blogger, even in a simple form, required me to learn more about vector indexes, RAG, current AI tooling, and LLM's more generally—admittedly, this is a space where I began as a complete noob.
In today's post, I'll share my process for delving into these topics and learning-by-doing. I'll highlight some of the resources that I found most cool and most helpful.
I started with Blogs!
To start my journey into the world of AI generation, I started by reading Tim's blog post on creating and using vector indexes in Dolt. This post was a great starting point because it provided a succinct, high-level overview of vector indexes, RAG, and how they can be stored in a Dolt database.
After reading Tim's post, I wanted to start building a simple "Hello World" version of a RAG application right away.
To do this, I opted to look for an example using a programming language that I'm comfortable with, and an well-established RAG SQL database. To Tim's chagrin, I often look to get something working for the first time with MySQL or Postgres before I swap the backend for Dolt. But, I do this when testing novel features since it helps me distinguish between bugs in my application code and bugs in the backing database.
With this in mind, I decided to look for a Golang example of RAG using PGVector. I figured if I could get this working, I'd have a solid idea of how to build our robot blogger on top of Dolt.
This notion led me to this really great blog post titled RAG With PostgreSQL, and provided an excellent entrypoint for my journey into the world of RAG.
It contains a really great example repository, you can run locally to get an understanding of the process. Not only that, but it introduced me to a really cool technology called Ollama.
Ollama
Ollama is a really cool engine written natively in Go that allows you to run LLM's on your local machine. I tend to think of it as Docker for LLM's. It supports a wide variety of models, including the Llama 3 series, Facebook's open source LLMs.
Just like in the RAG With PostgreSQL example, I've continued to use Ollama ever since learning about it, since it's a really handy way to experiment with different LLM's locally.
After you download it, you can run the engine with the command ollama serve
. Once running, it's really easy to get started from the command line.
For example, you can list all locally available models with the command ollama list
.
% ollama list
NAME ID SIZE MODIFIED
llama3:latest 365c0bd3c000 4.7 GB 12 days ago
Here you can see that I have a local copy of the llama3
model which I've been using to to generate some vector embeddings. Let's say we wanted to run a different model locally. We could do that with the command ollama run deepseek-r1:latest
.
% ollama run deepseek-r1:latest
pulling manifest
pulling 96c415656d37... 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB
pulling 369ca498f347... 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 387 B
pulling 6e4c38e1172f... 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.1 KB
pulling f4d24e9138dd... 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 148 B
pulling 40fb844194b2... 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 487 B
verifying sha256 digest
writing manifest
success
>>> Send a message (/? for help)
First, ollama
pulls the model to my local machine (this is what reminds me of Docker), and then it runs it. Once it's running, I can interact with it immediately by sending it a prompt.
Click to see full output
>>> what is it like on the surface of venus?
<think>
Okay, so I need to figure out what life would be like on Venus's surface. Let me start by recalling what I know about Venus. It's often called Earth's sister planet because they are
similar in size and have a lot of landmasses. But I remember hearing that Venus is really different from Earth, especially when it comes to the environment.
First off, I think the temperature on Venus is extremely high. It must be super hot since it has such thick clouds made of sulfuric acid. If I recall correctly, temperatures go up
to 462 degrees Celsius or something like that. That's way hotter than Earth's surface temperatures, so any life there would have to survive extreme heat.
Then there are the atmospheric conditions. Venus has a much thicker atmosphere composed mostly of carbon dioxide and nitrogen. Because of this, it doesn't retain as much heat as you
might expect. Wait, but since the atmosphere is so dense, maybe it traps heat in a way similar to Earth's. I'm not sure about that part. Also, the lack of liquid water on Venus
would be a big issue because we need water for drinking, eating, and all the other biological processes.
I remember reading somewhere that there are tiny, fragile water droplets floating around in the atmosphere on Venus. These are called "water snow," I think. They're super small and
can't sustain life by themselves. So even though there's water vapor, it doesn't exist in a form that can support living organisms.
The lack of liquid water is probably why Venus doesn't have any moons or rings like we do on Earth. Without water, it's hard to get celestial bodies to orbit or for rings to form
around the planet.
Looking at the landscape, I think Venus has mountains and valleys made by wind erosion over millions of years. These geological features must be quite different from anything on
Earth because there are no oceans to erode them. The thick clouds might cover these features, making it hard to see them from space or even from a spacecraft.
I also wonder about the presence of any atmosphere traps like ozone layers. Venus's atmosphere doesn't have ozone, so there's probably no protection from UV radiation except what
the sulfuric acid clouds provide. That would make life on Venus very dangerous in terms of radiation exposure.
Considering all this, it seems extremely unlikely that Venus has any form of liquid water or living organisms. The environment is just too harsh for anything to survive. Maybe some
microorganisms could live there, but they'd have to be incredibly resilient and probably wouldn't last long without constant reinforcement from the planet's surface.
</think>
Life on Venus' surface would be inhospitable due to its extreme conditions. Here's a detailed summary of why:
1. **Temperature**: The surface temperature on Venus reaches up to 462°C, far beyond what life can survive on Earth.
2. **Atmospheric Conditions**: The atmosphere is composed mainly of carbon dioxide and nitrogen, with minimal trace amounts of water vapor. Despite its density, it does not retain
significant heat, and the lack of liquid water makes sustaining life impossible.
3. **Water Presence**: While there are water droplets in the atmosphere (known as "water snow"), these are too small to support living organisms.
4. **Geological Features**: The planet has mountainous regions formed by wind erosion over millions of years, but the thick sulfuric acid clouds obscure these features from view.
5. **Lack of Celestial Bodies**: Without water, Venus cannot have moons or rings, as seen on Earth.
6. **Atmospheric Traps**: There is no ozone layer, leaving Venus exposed to harmful UV radiation, which poses a significant threat to any life form.
7. **Inhabitable Extremes**: The combination of extreme heat and lack of water makes it impossible for any known form of life to exist on Venus' surface.
While there might be microscopic organisms, their ability to survive such harsh conditions is doubtful without constant reinforcement from the planet's surface, making Venus
inhospitable to life as we know it.
Pretty cool right!
To interact programmatically with an LLM of my choice, just like in the RAG With PostgreSQL example, I can use the ollama
Go client to send prompts to the local ollama
server running on port 11434
.
Once I had some fun playing with Ollama, I decided to start hacking away on my own version of the RAG demo application with the intention of eventually making it the robot blogger.
To do this, I came up with a similar schema to the one in the RAG example, but made it so that my local postgres database would store DoltHub blogs as documents, alongside their vector embeddings. Once I had a prototype working, I tried generating a couple blogs with a simple prompt, just to see what my RAG implementation would come up with.
I used the prompt: "Write a blog post that details a particular use case for Dolt."
I also did not augment the system prompt used in the RAG With PostgreSQL demo application:
Using the given reference text, succinctly answer the question that follows:
reference text is:
%s
end of reference text. The question is:
%s
And, although my application did work, the resulting generated blogs were... uninspiring.
At this point, I needed some tips on how to improve the content my application was generating, and I wasn't quite sure where to tune this thing to get better results.
Some questions that came to mind were:
- Am I using the right LLM model for generating good blog content?
- Am I retrieving the right (or best) documents from the database?
- Am I supplying the right number of context documents to the LLM?
- How can I benchmark my results to know if changing a parameter is actually improving generated results?
Pondering on these questions, I figured I needed to know more about how others are using RAG in production. Turns out, I could just ask an AI chatbot to explain the process to me.
For starters, I learned that I should not be storing entire DoltHub blogs as documents for submission to the LLM. Instead, I need to chunk blogs into smaller pieces. Doing this would allow me to retrieve more relevant documents from the database and supply the LLM with a more focused context, which should result in better generated content.
Boom 💥.
This made a ton of sense to me, so I started working on chunking DoltHub blogs. These blogs are written in markdown, so I needed to figure out a reasonable way to chunk them, while leaving the content sectioned as logically as possible.
I figured I'd need to chunk based on the markdown headers, indicated by the #
symbol. I started coming up with a function to do this, but then just decided to search the internet for an existing solution.
I happened upon a package called TextSplitter, which does exactly what I wanted. Little did I know that this package is a part of the open source LangChain framework, which I had not heard of before, that let's you easily build LLM applications!
LangChain
Once I learned about LangChain, I realized that I could use it to build my robot blogger RAG application much faster, and with a lot less code. LangChain is a composable framework for building LLM applications, and has an extensive, multi-language ecosystem, including a Go library 🤠.
I immediately scrapped the approach I had taken with my previous RAG application and started over, this time referencing langchaingo
's extensive examples directory.
Within the examples, I found a great one on how to use Postgres and PGVector as a VectorStore
. VectorStore
is an interface that LangChain uses to store and retrieve vector embeddings.
Storing embeddings is as easy as calling AddDocuments with a list of documents, and retrieving them is as easy as calling SimilaritySearch.
This was the interface I wanted on top of Dolt for use in the robot blogger application.
The next thing I did was refactor my RAG application to use LangChain's Go library, and get everything working again with the PGVector VectorStore
. Next, I forked tmc/langchaingo
so I could implement a Dolt VectorStore
. I was able to get this working, but when I was implementing some tests of my Dolt VectorStore
, I was getting unexpected results.
I had copied the tests from the vectorstores/postgres
package and expected the same results returned from Dolt as are returned from Postgres. But, I wasn't getting the correct results for every test. I suspected I might have found a Dolt bug, but wanted to be sure.
Typically, to verify that Dolt has a bug, I simply execute the same test against MySQL. If the tests pass for MySQL, then I know it's a Dolt issue and I can file a bug report. This is because Dolt is a one-to-one, drop-in replacement for MySQL, so if it works for MySQL, it must work for Dolt, or it's a bug.
So, I started looking up how to add vectors to my local MySQL instance. As it turns out, vector support in MySQL is not free! You only get access to their vector distance function in Oracle Heatwave. I shook my fist at the heavens.
I hadn't realized that for this exact reason, Dolt actually aligns with MariaDB on vector index and function support. I also hadn't realized that like Dolt, vector support in MariaDB just launched recently too!
With this knowledge, I started to get a MariaDB VectorStore implementation working in my langchaingo
fork. Once I did, and after talking with our resident Dolt vector expert Nick, I was able to figure out why my Dolt implementation produced different results than both Postgres and MariaDB.
Importantly, both Postgres and MariaDB VectorStores use the cosine distance function to retrieve embeddings. As Tim told me from his Bayesian days, and as confirmed by Chat GPT, cosine distance is the preferred distance function to use for vector retrieval. In contrast, in Dolt today, only the Euclidean squared distance function is supported, and this difference is why I was getting different results in my tests than expected. The test as written had assumed a cosine distance.
This revelation helped me get my fork of langchaingo
working successfully with both Dolt and MariaDB, and I'm happy to say that my robot blogger application is working great! I'm still ironing out some of the kinks, but will update you with future posts on my progress, and hopefully generate some interesting blog posts along the way.
Conclusion
I hope this post inspires others to dive head-first into the world of AI generation and RAG. We are excited to continue working with these growing technologies and would love to chat with you if you have any suggestions or feedback, or just want to share your own experiences. Maybe you know the answer to one of my questions above! Just come by our Discord and give us a shout.
I've also submitted a PR to langchaingo
to officially add support for Dolt and MariaDB VectorStores, so we'll see if they decide to merge it. Until next time, thanks for reading and don't forget to check out each of our cool products below:
- Dolt—it's Git for data.
- Doltgres—it's Dolt + PostgreSQL.
- DoltHub—it's GitHub for data.
- DoltLab—it's GitLab for data.
- Hosted Dolt—it's RDS for Dolt databases.
- Dolt Workbench—it's a SQL workbench for Dolt databases.