Announcing DoltHub Robot Blogger

AI
6 min read

Today we're excited to share our open-source Robot Blogger tool 🤖!

If you've not been following along, I wrote about my experience learning about RAG to get a robot blogger prototype working. Our goal was to build something that we could iterate on to improve the quality of its generated blog posts, while sharing our learnings with the community.

The tool still has some sharp edges, and plenty of room for improvement, but we're excited to share it with the community and see what you all think!

Getting Started with Robot Blogger

To get started with Robot Blogger, you need to install the dependencies to run it. First install go if you don't already have it.

Next, you'll either need to install Ollama locally or grab your OpenAI API key.

If you're using Ollama, make sure it's running locally with ollama serve. In addition, be sure to pull the model you'd like to use with ollama pull <model-name>.

ollama serve
ollama pull llama3

Once you have those, you can clone the robot-blogger repo and install it locally with go install command:

git clone https://github.com/dolthub/robot-blogger.git
cd robot-blogger
go install .

This will install the robot-blogger command line tool in your go path. You can then run it with:

robot-blogger --help

After that, you're ready to use robot-blogger.

Robot Blogger has two modes:

  • Store documents mode
  • Generate blog post mode

Store documents mode

Store documents mode will iterate over a directory, split the files contained in the directory into chunks, and create vector embeddings using the model of your choice to create vector embeddings of those chunks. The tool will then write these embeddings to your chosen vector store.

Before you run this mode, you'll need to ensure your chosen vector store is running, and is reachable on a port. Current supported stores are Dolt, MariaDB, and Postgres.

When you're ready to store documents, you can run the following command:

export VECTOR_STORE_PASSWORD=mydbpass
export DOCS_DIR=/path/to/docs

./robot-blogger \
--ollama \
--model=llama3 \
--dolt \
--user=root \
--host=0.0.0.0 \
--port=3306 \
--store-name=robot_blogger_llama3_v1 \
--doc-type=blog_post \
--include-file-ext=".md"

When this completes, you will have a vector store with the embeddings for each of the chunks in your documents.

Gotchas

If you're not using OpenAI, be sure to run this on a host that has access to GPU. Trying to generate embeddings on a CPU will take a long time and you may run out of patience doing this, like I did.

Additionally, if you're using Dolt v1.49.3 or earlier, you may need to disable stats refreshing to avoid server crashes. This can be done by running set @@PERSIST.dolt_stats_auto_refresh_enabled = 0;, and then restarting the server.

Generate blog post mode

Once you have your vector store populated, you're ready to generate a blog post. This time, run robot-blogger but use the --prompt flag to provide a prompt for the blog post. You will also need to provide a --topic and --length for the blog post.

export VECTOR_STORE_PASSWORD=mydbpass

./robot-blogger \
--ollama \
--model=llama3 \
--dolt \
--user=root \
--host=0.0.0.0 \
--port=3306 \
--topic="Dolt and DoltHub" \
--length=100 \
--store-name=robot_blogger_llama3_v1 \
--prompt="What are Dolt and DoltHub?"

Be sure to use the same model you used to store the documents, since the vector embeddings are specific to the model. If everything works correctly, your generated blog post will be written to stdout.

...

**What is Dolt?**
--------------

Dolt is an open-source, distributed relational database designed specifically for data engineering and scientific computing workloads. It's a lightweight, high-performance database that provides a scalable and efficient way to manage large datasets. With Dolt, you can easily handle complex queries, data analytics, and machine learning tasks.


**What is DoltHub?**
-------------------

DoltHub is a cloud-based platform that allows you to create, share, and collaborate on Dolt databases, making it an ideal solution for data scientists, engineers, and analysts. It provides a GitHub-like experience for database management, enabling you to fork, merge, and manage multiple databases easily.

...

It tries so hard 😭!

Gotchas

All models are not created equal. During our testing we found that llama3 produced terrible content, where as gpt-4o-latest produced pretty good content, at least good enough to publish.

Upon further investigation, it seems like the embeddings generated by llama3 were really bad. Or at least, made no sense to me. For example, during RAG search on identical prompts, distance scores for four retrieved documents vectorized with llama3 were:

-26446.89
-26925.535
-27795.71
-28001.463

And their corresponding content was terrible, irrelevant, stuff.

On the other hand, the gpt-4o-latest embeddings produced the following distance scores:

0.83733505
0.8358216
0.8358216
0.8348657

And their corresponding content was good, relevant, stuff.

I'm not sure why that is, both used the same process for chunking and vectorizing the content, and used the exact same prompt.

We plan on doing further investigation into this, and find out why some models perform better than others, and more importantly, which models are the best for this use case.

In the same vein, all prompts are not created equal. This is to say that the prompt you choose will have a big impact on the quality of the generated content.

So far in our experiments, we've haven't found a silver bullet able to one-shot-prompt the model into generating a great blog post. But, we will continue to experiment with different prompting strategies and share our learnings on this too.

Our first generated blog!

On Saturday, we published our first generated blog post written by Robot Blogger! You can read it here.

Here is the prompt we used to generate the blog post:

Dolt is like MySQL and Git had a baby. It is an OLTP SQL database like MySQL but was built from the ground up with versioning, branching, and merging.
These first-class versioning features use Git-like semantics, making Dolt immediately familiar to Git users.

We are launching a **blog series** comparing **Git** and **Dolt**, focusing on specific **commands and features**.
Each post will highlight:
- **How Git and Dolt commands work**
- **Their similarities and key differences**
- **Why these comparisons matter to users**

### **This Week’s Topic: `clone`, `pull`, and `push`**
For this post, **compare and contrast** how these three commands work in **Git vs. Dolt**:
- Explain **each command’s purpose**
- Provide **clear terminal-based examples** for **both tools** (assume a Unix-based terminal)
- Address **key differences** in behavior, output, or workflow
- Highlight **why Dolt is a transformative technology for data versioning**, similar to how Git changed software versioning

💡 **Target Audience:**
- Assume readers have **some MySQL experience** but **may not be familiar with version control**
- Avoid jargon-heavy explanations—focus on clarity and usability

### **Ending & Next Steps**
At the end of the post, **include a teaser** for the next blog entry.
⚡ **Choose a relevant topic** that builds on today’s comparison.
Example: "Next week, we'll explore how `dolt checkout` compares to `git checkout`—and why versioning data is just as powerful as versioning code!"

Stay within a **concise, engaging tone**, keeping the blog **educational yet accessible**.

We also released the Dolt database we used to store embeddings for all our content. It has all of our past blog posts, our documentation, and our weekly emails embedded.

You can find it at dolthub/robot_blogger_v1. The database has an empty main branch, and all embeddings are stored on branches named after the model that was used to generate them.

This project has been exciting since its allowed us to start using Dolt's VECTOR INDEXES in-house and find some good bugs already! We hope you'll give them a try as well.

Conclusion

I hope this post inspires others to dive head-first into the world of AI generation and RAG. We are excited to continue working with these growing technologies and would love to chat with you if you have any suggestions or feedback, or just want to share your own experiences. Maybe you know the answer to one of the mysteries I've identified above! Just come by our Discord and give us a shout. Don't forget to check out each of our cool products below:

  • Dolt—it's Git for data.
  • Doltgres—it's Dolt + PostgreSQL.
  • DoltHub—it's GitHub for data.
  • DoltLab—it's GitLab for data.
  • Hosted Dolt—it's RDS for Dolt databases.
  • Dolt Workbench—it's a SQL workbench for Dolt databases.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.