2 billion primes in a Dolt table

DATASET

5 min read

Since releasing Dolt, we have often been asked how it scales. How many rows and how many gigs can you get into a Dolt dataset before things start breaking badly? Answering this question in practice is kind of difficult, simply because it’s difficult to find human-created datasets that are truly large. It would be easy enough to import a machine learning dataset of 100s of gigs, or randomly generate lots of data, but we wanted a data set that would be interesting to query and meaningful to humans, something that would actually showcase what Dolt is for. That’s when I hit on the idea of using dolt to store and query the first 2 billion prime numbers.

I was surprised to find that there’s nowhere online to query prime numbers. There are a couple sites that will quickly tell you the Nth prime up to a trillion, but there’s no way to e.g. query primes between an upper and lower bound for arbitrary ranges. The best you can do is download and run a program to generate the primes yourself, or download, extract, and parse hundreds of text files. Here’s a prime (haha) example of the latter approach, with 200 compressed text files of 10 million primes each. They come out looking like this (this is starting with the 100,000,000th prime):

2038074751      2038074761      2038074769      2038074793      2038074803      2038074817      2038074853      2038074923      2038074931      2038074947
2038074949      2038075001      2038075019      2038075037      2038075049      2038075069      2038075099      2038075121      2038075163      2038075183
2038075187      2038075189      2038075199      2038075219      2038075229      2038075231      2038075271      2038075289      2038075301      2038075307
2038075339      2038075349      2038075363      2038075373      2038075387      2038075397      2038075423      2038075433      2038075463      2038075469
2038075511      2038075531      2038075547      2038075579      2038075597      2038075619      2038075639      2038075643      2038075649      2038075691

To download the text files, I wrote a quick little bash script to fetch them all:

#!/bin/bash

counter=1
while [ $counter -le 200 ]
do
    cmd="wget 'http://www.primos.mat.br/dados/2T_part$counter.7z'"
    eval "$cmd"
    ((counter++))
done

Then I extracted the files with 7zip and wrote a quick go program to parse them and output the primes, along with their ordinal, to CSV format:

package main

import (
	"bufio"
	"fmt"
	"os"
	"path/filepath"
	"regexp"
	"sort"
	"strconv"
	"strings"
	"sync"
)

var r = regexp.MustCompile("2T_part(\\d+).txt")

func main() {
	inputDir := os.Args[1]

	var primeFiles []string
	abs, err := filepath.Abs(inputDir)
	if err != nil {
		panic(err)
	}

	stat, err := os.Stat(abs)
	if err != nil {
		panic(err)
	}

	if stat.IsDir() {
		filepath.Walk(inputDir, func(path string, info os.FileInfo, err error) error {
			if err != nil {
				return err
			}
			if info.IsDir() {
				return nil
			}

			if strings.HasSuffix(path, ".txt") {
				primeFiles = append(primeFiles, path)
			}
			return nil
		})
	} else {
		primeFiles = append(primeFiles, abs)
	}

	sort.Slice(primeFiles, func(a, b int) bool {
		partNumStrA := r.ReplaceAllString(filepath.Base(primeFiles[a]), "$1")
		partNumStrB := r.ReplaceAllString(filepath.Base(primeFiles[b]), "$1")

		aint, err := strconv.Atoi(partNumStrA)
		if err != nil {
			panic(err)
		}
		bint, err := strconv.Atoi(partNumStrB)
		if err != nil {
			panic(err)
		}
		return aint < bint
	})

	wg := &sync.WaitGroup{}
	workChan := make(chan string, 10)

	for i := 0; i < 10; i++ {
		go func() {
			for f := range workChan {
				wg.Add(1)
				defer wg.Done()
				toProcess := f
				processFile(toProcess)
			}
		}()
	}

	for _, f := range primeFiles {
		workChan <- f
	}
	close(workChan)

	wg.Wait()
}

func processFile(f string) {
	fmt.Printf("Processing file %s\n", f)

	file, err := os.Open(f)
	if err != nil {
		panic(err)
	}
	defer file.Close()

	numStr := r.ReplaceAllString(filepath.Base(f), "$1")
	i, err := strconv.Atoi(numStr)
	if err != nil {
		panic(err)
	}

	// each file contains 10M primes
	ordinal := uint64(i-1)*uint64(10*1000*1000) + uint64(1)

	out, err := os.Create(f + ".csv")
	if err != nil {
		panic(err)
	}
	wr := bufio.NewWriterSize(out, 1024*1024)

	fmt.Fprintf(wr, "ordinal,prime\n")

	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Text()
		nums := strings.Fields(line)
		for _, num := range nums {
			_, err := fmt.Fprintf(wr, "%d,%s\n", ordinal, num)
			if err != nil {
				panic(err)
			}

			ordinal++
		}
	}

	if scanner.Err() != nil {
		panic(scanner.Err())
	}

	err = wr.Flush()
	if err != nil {
		panic(err)
	}

	err = out.Close()
	if err != nil {
		panic(err)
	}
}

My first draft didn’t use multiple parallel goroutines, but I screwed this up enough times, and it took long enough to run, that I eventually decided it was worthwhile to parallelize. The program generates a CSV of primes keyed by their ordinal, which looks like this:

ordinal,prime
1,2
2,3
3,5
4,7
5,11
...

Once I had a CSV, importing it into dolt was easy:

% dolt table import -u primes primes.csv

And just like that, I had a way to query the primes with SQL! Here’s getting the Nth prime:

% dolt sql -q 'select * from primes where ordinal = 1000000;'
+---------+----------+
| ordinal | prime    |
+---------+----------+
| 1000000 | 15485863 |
+---------+----------+

And here’s getting all primes that are between two numbers:

dolt sql -q 'select * from primes where prime > 100000 and prime < 110000 limit 10;'
+---------+--------+
| ordinal | prime  |
+---------+--------+
| 9593    | 100003 |
| 9594    | 100019 |
| 9595    | 100043 |
| 9596    | 100049 |
| 9597    | 100057 |
| 9598    | 100069 |
| 9599    | 100103 |
| 9600    | 100109 |
| 9601    | 100129 |
| 9602    | 100151 |
+---------+--------+

And here’s a quick analysis of the ratio of prime to ordinal:

% dolt sql -q 'select cast(prime as decimal) / cast(ordinal as decimal) as ratio from primes where ordinal in (1,10,100,1000,10000,100000,1000000,10000000) limit 8;'
+------------+
| ratio      |
+------------+
| 2          |
| 2.9        |
| 5.41       |
| 7.919      |
| 10.4729    |
| 12.99709   |
| 15.485863  |
| 17.9424673 |
+------------+

You can find the first 2 billion primes dataset on DoltHub here, and clone it with Dolt like so:,

% dolt clone dolthub/primes

You might ask: why would you use Dolt, the database with branches, to store immutable facts about the universe? It’s not like the primes are going to change.

Actually, we disagree! Not about the primes not changing, you are probably right about that. But it’s easy to imagine someone cloning this dataset and filling in the next 3 trillion primes (hopefully on their own branch so that not everyone has to pull them every time). Or adding another table of interesting numbers, and writing queries to join that table to the primes for other interesting analyses.

Secondly, we think Dolt is the best way to distribute data, even for datasets that don’t change. It really shines with the branch and merge semantics for collaboration, but even without those features it’s still the best data distribution format we know of. Consider all the work I had to do above to massage something as simple as the list of of the first 2 billion primes into a format where I could query it. When you clone a Dolt dataset, you don’t have to worry about writing programs to transform the data into the shape you need. You just run a dolt table export command to get the data in CSV, JSON, SQL, or another of the growing number of supported formats. Or just start a local SQL server and start querying the data directly, no transformation necessary!

Finally, please note that we’re working hard to get SQL performance where we want it, but we have a long way to go. We don’t have secondary indexes working yet, so queries that don’t have a WHERE clause on the primary keys are going to be slow for large tables. And even primary key lookups are still experimental — you’ll need to export DOLT_USE_INDEXES=1 to turn them on.

Blog

Get started with Dolt