Updating FBI Crime Data on DoltHub
A little over a year ago I blogged about publishing the FBI NIBRS Crime Data on DoltHub. In that blog, I describe the dataset the FBI publishes and discuss some of the challenges of working with large amounts of data in the standard CSV format. I also highlighted why Dolt was (and is) the best format for data, allowing Git-like workflows for collaboration, providing data versioning with branching, merging, and a SQL interface.
Interestingly, since uploading the dolthub/fbi-nibrs data to DoltHub last year, it has remained the largest dataset on DoltHub, ~212 GBs when first published, and has consistently been one of our most popular.
Today, I'm sharing that I'm actively updating the DoltHub database with the latest 2019 and 2020 crime data and am also adding the missing states I didn't finish importing last year.
FBI NIBRS Data -- DoltHub's White Whale
This dataset in particular is super interesting to our team, not just because it's the best, most comprehensive national crime dataset online, but also because for the last year it has represented the known limits of both Dolt and DoltHub.
When I set out to create the database originally, my plan was to import the data for all available states and years from 1991 to 2018 on two branches, the master
branch that had the outdated schema the FBI used until 2016 and the latest
branch which has the current schema.
I successfully imported the data for all but about 7 states on each branch. However, once the database grew to around 212 GBs, I started hitting the limits of Dolt clients and DoltHub servers that made completing the dataset too costly to continue.
On the client side, at that time, Dolt was generating lots of garbage on import as it wrote temporary data files to disk (this was before dolt gc). If I recall correctly, attempts to import just a few more GBs into the 212 GB database was eating up most of the 2 TB host I was using.
On the DoltHub server side, attempts to push the growing database to DoltHub began crashing our servers due to memory constraints, our in-band tablefile conjoin process, and our lack of server-side garbage collection.
As a result, and because no customers were really pressing for DoltHub scalability, it seemed like work we could kick down the road a bit, so we decided to hit pause on crime data imports. Instead, we focused on improving DoltHub's other features and focusing on the OLTP use case for Dolt.
We did, and still do, have customers who clone the dolthub/fbi-nibrs
database, though, so we've always made sure cloning succeeds, where other DoltHub operations on this database, like forking, currently timeout.
Now, a year later and after many Dolt and DoltHub improvements, I wanted to once again push the limits of Dolt and DoltHub with this dataset by attempting to complete the imports.
My goal is three-fold: First, contrast last year's importing experience to this year's in an effort to test out the large number of improvements we've made to our tools while identifying areas of future work for scaling Dolt and DoltHub. Second, provide some insights into my data importing process for those curious souls brave enough to try big data imports themselves. And third, finally finish importing all the damn data!
The Importing Process
I was able to pick up importing more or less where I left off a year ago, although I did need to clean the data again since I didn't backup the cleaned data last time. Luckily, even after a year, my collection of janky tools are still able to get the job done, which has saved me a lot of time.
To start, I provisioned a linux host with 6 TBs of disk to make sure I'd have enough space to work with, then setup the host environment with my tools repository, the latest Dolt release, and the latest golang version.
Next, I downloaded all the available crime data to the host using the py/download-data which scrapes the FBI's S3 buckets where the data is hosted and downloads and unzips the files containing the CSVs for each state and year. The total data size in CSV is about 71 GBs, but a lot of these CSVs are duplicate reference tables.
Then, I use the go/transform tool to transform the data so it's ready to import into Dolt. The CSVs prior to 2016 all contain lowercase headers and the subsequent CSVs all contain uppercase headers, so to make them consistent for import I run:
$ go run . --migrateLegacyHeaders --dir=<data-dir>
In addition, the date formats in the CSVs don't match Dolt's accepted date formats so I have to transform those. I also need to generate keys for some of the tables, since when this database was first created, Dolt did not yet support keyless tables. Fortunately both of these steps are performed by running a single command:
$ go run . --transformForDolt --dir=<data-dir>
Once these transformation steps complete, the folder structure the transformForDolt
step outputs is <data-dir>/<state>/<year>/<state abbreviation>/transformed
which houses the relevant and cleaned CSVs that are ready for import into Dolt. Lastly, we can save some disk by cleaning up the CSVs we no longer need by running:
$ go run . --removeOldFiles --dir=<data-dir>
With the prerequisite steps out of the way, I can start importing the actual data into the Dolt database. To do so, I clone the latest database by running:
$ dolt clone dolthub/fbi-nibrs
After the clone finishes, I can cd
into the database directory and start importing the CSVs for the years that are missing, using the py/import tool.
This tool is a bit weird. It's not actually used to import the data, but it instead writes bash scripts which themselves can be used to import the data. I typically hack around on this script on the host when I'm importing the data so I can break imports up into logical state or year combinations. For example, to use the tool to import Delaware data on the latest
branch I run:
$ cd py/import
$ python3 main.py -importScriptPath /path/to/cloned/fbi-nibrs -dataPath /path/to/fbi/csvs
$ cd /path/to/cloned/fbi-nibrs
$ chmod +x legacy/import-delaware-latest.sh
$ nohup ./import-delaware-latest.sh &
This basic but effective process was how I imported the crime data into Dolt over a year ago, and it still works well today. I'm currently in the process of importing the remaining states into the dolthub/fbi-nibrs
database, and am experiencing first-hand how much Dolt has improved over the last year.
The Good, The Not So Good, And Future Work
Some large state imports against this now very large database still take a while to complete. For example, importing around 3 GBs of CSV data for the state of Texas took a few days to complete, but it did complete, which is great!
We've experienced some challenges with Dolt performance during import on databases with foreign keys constraints or a large number of indexes created before an import runs, so the slower import performance here is a known issue we are working to resolve.
In the case of dolthub/fbi-nibrs
, it contains about 40 tables with many indexes and foreign keys so the imports for the remaining states will take a bit of time. I'm planning on running them on my provisioned host over the next few weeks. To help mitigate some of this performance overhead I've run dolt sql -q 'set foreign_key_checks=0'
against the database which disables foreign key checks during the table import process.
Also, once an import completes, Dolt's generational garbage collection works really well to clean up the temporary data files on disk which made this process much easier than it was before. Despite running out of 2TB's a year ago, I'm still well under a single TB of disk utilization and the current size of the database on my host is 350 GBs and counting! I've been tremendously impressed with how well Dolt has improved and scaled so far, but there is still work to do on the DoltHub side to support large databases.
Currently, attempts to push this 350 GB behemoth still fail, and will likely require server-side garbage collection and out-of-band tablefile conjoining—work we still need to put on our roadmap.
However, we are currently in the process of rewriting Dolt's storage layer to be faster and more memory efficient, which we anticipate might actually enable larger pushes to DoltHub as a consequence. In the meantime, my first step is to finish importing all of the crime data on my host machine, after which I can work toward eradicating DoltHub scaling issues. Stay tuned for updates!
If you found this interesting and useful, and I encourage you to clone your own copy of this dataset and star the database on DoltHub to let us know you checked it out!
Curious about Dolt, DoltHub and the versioned data format of the future? There's no better place to get started than DoltHub.com where you can download Dolt, host your own public and private databases, or just clone some amazing public databases you won't find anywhere else.
Questions, comments, or looking to start publishing your data in Dolt? Get in touch with our team here or come chat with us on Discord!