2 billion primes in a Dolt table
Since releasing Dolt, we have often been asked how it scales. How many rows and how many gigs can you get into a Dolt dataset before things start breaking badly? Answering this question in practice is kind of difficult, simply because it's difficult to find human-created datasets that are truly large. It would be easy enough to import a machine learning dataset of 100s of gigs, or randomly generate lots of data, but we wanted a data set that would be interesting to query and meaningful to humans, something that would actually showcase what Dolt is for. That's when I hit on the idea of using dolt to store and query the first 2 billion prime numbers.
I was surprised to find that there's nowhere online to query prime numbers. There are a couple sites that will quickly tell you the Nth prime up to a trillion, but there's no way to e.g. query primes between an upper and lower bound for arbitrary ranges. The best you can do is download and run a program to generate the primes yourself, or download, extract, and parse hundreds of text files. Here's a prime (haha) example of the latter approach, with 200 compressed text files of 10 million primes each. They come out looking like this (this is starting with the 100,000,000th prime):
2038074751 2038074761 2038074769 2038074793 2038074803 2038074817 2038074853 2038074923 2038074931 2038074947
2038074949 2038075001 2038075019 2038075037 2038075049 2038075069 2038075099 2038075121 2038075163 2038075183
2038075187 2038075189 2038075199 2038075219 2038075229 2038075231 2038075271 2038075289 2038075301 2038075307
2038075339 2038075349 2038075363 2038075373 2038075387 2038075397 2038075423 2038075433 2038075463 2038075469
2038075511 2038075531 2038075547 2038075579 2038075597 2038075619 2038075639 2038075643 2038075649 2038075691
To download the text files, I wrote a quick little bash script to fetch them all:
#!/bin/bash
counter=1
while [ $counter -le 200 ]
do
cmd="wget 'http://www.primos.mat.br/dados/2T_part$counter.7z'"
eval "$cmd"
((counter++))
done
Then I extracted the files with 7zip and wrote a quick go program to parse them and output the primes, along with their ordinal, to CSV format:
package main
import (
"bufio"
"fmt"
"os"
"path/filepath"
"regexp"
"sort"
"strconv"
"strings"
"sync"
)
var r = regexp.MustCompile("2T_part(\\d+).txt")
func main() {
inputDir := os.Args[1]
var primeFiles []string
abs, err := filepath.Abs(inputDir)
if err != nil {
panic(err)
}
stat, err := os.Stat(abs)
if err != nil {
panic(err)
}
if stat.IsDir() {
filepath.Walk(inputDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if info.IsDir() {
return nil
}
if strings.HasSuffix(path, ".txt") {
primeFiles = append(primeFiles, path)
}
return nil
})
} else {
primeFiles = append(primeFiles, abs)
}
sort.Slice(primeFiles, func(a, b int) bool {
partNumStrA := r.ReplaceAllString(filepath.Base(primeFiles[a]), "$1")
partNumStrB := r.ReplaceAllString(filepath.Base(primeFiles[b]), "$1")
aint, err := strconv.Atoi(partNumStrA)
if err != nil {
panic(err)
}
bint, err := strconv.Atoi(partNumStrB)
if err != nil {
panic(err)
}
return aint < bint
})
wg := &sync.WaitGroup{}
workChan := make(chan string, 10)
for i := 0; i < 10; i++ {
go func() {
for f := range workChan {
wg.Add(1)
defer wg.Done()
toProcess := f
processFile(toProcess)
}
}()
}
for _, f := range primeFiles {
workChan <- f
}
close(workChan)
wg.Wait()
}
func processFile(f string) {
fmt.Printf("Processing file %s\n", f)
file, err := os.Open(f)
if err != nil {
panic(err)
}
defer file.Close()
numStr := r.ReplaceAllString(filepath.Base(f), "$1")
i, err := strconv.Atoi(numStr)
if err != nil {
panic(err)
}
// each file contains 10M primes
ordinal := uint64(i-1)*uint64(10*1000*1000) + uint64(1)
out, err := os.Create(f + ".csv")
if err != nil {
panic(err)
}
wr := bufio.NewWriterSize(out, 1024*1024)
fmt.Fprintf(wr, "ordinal,prime\n")
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
nums := strings.Fields(line)
for _, num := range nums {
_, err := fmt.Fprintf(wr, "%d,%s\n", ordinal, num)
if err != nil {
panic(err)
}
ordinal++
}
}
if scanner.Err() != nil {
panic(scanner.Err())
}
err = wr.Flush()
if err != nil {
panic(err)
}
err = out.Close()
if err != nil {
panic(err)
}
}
My first draft didn't use multiple parallel goroutines, but I screwed this up enough times, and it took long enough to run, that I eventually decided it was worthwhile to parallelize. The program generates a CSV of primes keyed by their ordinal, which looks like this:
ordinal,prime
1,2
2,3
3,5
4,7
5,11
...
Once I had a CSV, importing it into dolt was easy:
% dolt table import -u primes primes.csv
And just like that, I had a way to query the primes with SQL! Here's getting the Nth prime:
% dolt sql -q 'select * from primes where ordinal = 1000000;'
+---------+----------+
| ordinal | prime |
+---------+----------+
| 1000000 | 15485863 |
+---------+----------+
And here's getting all primes that are between two numbers:
dolt sql -q 'select * from primes where prime > 100000 and prime < 110000 limit 10;'
+---------+--------+
| ordinal | prime |
+---------+--------+
| 9593 | 100003 |
| 9594 | 100019 |
| 9595 | 100043 |
| 9596 | 100049 |
| 9597 | 100057 |
| 9598 | 100069 |
| 9599 | 100103 |
| 9600 | 100109 |
| 9601 | 100129 |
| 9602 | 100151 |
+---------+--------+
And here's a quick analysis of the ratio of prime to ordinal:
% dolt sql -q 'select cast(prime as decimal) / cast(ordinal as decimal) as ratio from primes where ordinal in (1,10,100,1000,10000,100000,1000000,10000000) limit 8;'
+------------+
| ratio |
+------------+
| 2 |
| 2.9 |
| 5.41 |
| 7.919 |
| 10.4729 |
| 12.99709 |
| 15.485863 |
| 17.9424673 |
+------------+
You can find the first 2 billion primes dataset on DoltHub here, and clone it with Dolt like so:,
% dolt clone dolthub/primes
You might ask: why would you use Dolt, the database with branches, to store immutable facts about the universe? It's not like the primes are going to change.
Actually, we disagree! Not about the primes not changing, you are probably right about that. But it's easy to imagine someone cloning this dataset and filling in the next 3 trillion primes (hopefully on their own branch so that not everyone has to pull them every time). Or adding another table of interesting numbers, and writing queries to join that table to the primes for other interesting analyses.
Secondly, we think Dolt is
the best way to distribute data, even for datasets that don't
change. It really shines with the branch and merge semantics for
collaboration, but even without those features it's still the best
data distribution format we know of. Consider all the work I had to do
above to massage something as simple as the list of of the first 2
billion primes into a format where I could query it. When you clone a
Dolt dataset, you don't have
to worry about writing programs to transform the data into the shape
you need. You just run a dolt table export
command to get the data
in CSV, JSON, SQL, or another of the growing number of supported
formats. Or just start a local SQL server and start querying the data
directly, no transformation necessary!
Finally, please note that we're working hard to get SQL performance
where we want it, but we have a long way to go. We don't have
secondary indexes working yet, so queries that don't have a WHERE
clause on the primary keys are going to be slow for large tables. And
even primary key lookups are still experimental -- you'll need to
export DOLT_USE_INDEXES=1
to turn them on.