Dolt for Beginners: Importing a CSV
The Dolt for Beginners series focuses on topics people getting started with Dolt will likely be interested in. Most of our other blog articles go pretty deep so we want a space to talk about topics that experts may find boring.
After you get Dolt installed, you'll face a fork in the road. Are you creating a new database with new tables and new rows? Or do you already have data that you want to import into Dolt to take advantage of Dolt's unique version control features? Is the data already in a MySQL database? Or is it in another format like a Comma Separated Value (CSV) file? If you have data as a CSV, this article will show you how to get it into Dolt.
Install Dolt
The first thing you need to do to follow along with this article is install Dolt. Dolt is not complicated software. There is no complicated install process. You don't need a Docker container or multiple dependencies installed. You download the single Dolt program and run it.
We have even more convenient ways to install Dolt for every platform.
Once you have Dolt installed, you need to open a terminal like "Terminal" on Mac or Powershell on Windows. Make sure Dolt is on your PATH
by typing dolt
. If everything is working you'll be greeted by a help message with all the valid Dolt commands. For those familiar with the Git command line, Dolt should look pretty similar.
$ dolt
Valid commands for dolt are
init - Create an empty Dolt data repository.
status - Show the working tree status.
add - Add table changes to the list of staged table changes.
diff - Diff a table.
... <trimmed for length>
... <trimmed for length>
... <trimmed for length>
reflog - Show history of named refs.
rebase - Reapplies commits on top of another base tip
ci - Commands for working with Dolt continuous integration configuration.
Import a CSV
CSV is the most common format used to share data. You can save a spreadsheet as a CSV. Most online data repositories post data in CSV format.
Before we import a CSV, we must first create a Dolt database using dolt init
. Dolt stores the database in the current working directory where you run dolt init
. So, I make a new directory called csv-import
and cd
to it before I run dolt init
.
$ mkdir csv-import
$ cd csv-import
$ dolt init
Successfully initialized dolt data repository.
You can import CSVs into Dolt using the Dolt command line. Dolt has a handy command called dolt table import
that is used to import CSVs and files of other formats. This command creates or writes to tables in Dolt.
For my CSV file I head over to Kaggle, a machine learning website with a lot of datasets in CSV form. I browse around a bit and find the New York City Payroll dataset. After signing in, I click the Download link and the file named Citywide_Payroll_Data__Fiscal_Year_.csv.zip
shows up in my Downloads folder. I double click on it to unzip it and now I have /Citywide_Payroll_Data__Fiscal_Year_.csv
ready to import.
dolt table import
has a bunch of options which you can explore in the documentation but at its core it takes two arguments - a table name and a file. The only other option I need is --create-table
because this is a brand new table. Dolt will infer the schema for me.
$ dolt table import --create-table payroll ~/Downloads/Citywide_Payroll_Data__Fiscal_Year_.csv
Rows Processed: 2,194,488, Additions: 2,194,488, Modifications: 0, Had No Effect: 0
Import completed successfully.
That was a bit more data than I expected. Let's inspect what I created using dolt sql
. Let's first look at the schema.
$ dolt sql -q "describe payroll"
+----------------------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------------------+---------------+------+-----+---------+-------+
| Fiscal Year | int | YES | | NULL | |
| Agency Name | varchar(1023) | YES | | NULL | |
| Last Name | varchar(1023) | YES | | NULL | |
| First Name | varchar(1023) | YES | | NULL | |
| Mid Init | varchar(1023) | YES | | NULL | |
| Agency Start Date | varchar(1023) | YES | | NULL | |
| Work Location Borough | varchar(1023) | YES | | NULL | |
| Title Description | varchar(1023) | YES | | NULL | |
| Leave Status as of June 30 | varchar(1023) | YES | | NULL | |
| Base Salary | varchar(1023) | YES | | NULL | |
| Pay Basis | varchar(1023) | YES | | NULL | |
| Regular Hours | float | YES | | NULL | |
| Regular Gross Paid | varchar(1023) | YES | | NULL | |
| OT Hours | float | YES | | NULL | |
| Total OT Paid | varchar(1023) | YES | | NULL | |
| Total Other Pay | varchar(1023) | YES | | NULL | |
+----------------------------+---------------+------+-----+---------+-------+
And now the first three records, in vertical format so it's easier to read in this article.
$ dolt sql --r vertical -q "select * from payroll limit 3"
*************************** 1. row ***************************
Fiscal Year: 2017
Agency Name: DEPT OF ED PER SESSION TEACHER
Last Name: NARISSI
First Name: BARBARA
Mid Init: A
Agency Start Date: 09/04/2001
Work Location Borough: MANHATTAN
Title Description: TEACHER- PER SESSION
Leave Status as of June 30: ACTIVE
Base Salary: $33.18
Pay Basis: per Day
Regular Hours: 0
Regular Gross Paid: $415.80
OT Hours: 0
Total OT Paid: $0.00
Total Other Pay: $0.00
*************************** 2. row ***************************
Fiscal Year: 2017
Agency Name: DEPT OF ED PER SESSION TEACHER
Last Name: MCGRAW
First Name: SUSAN
Mid Init: NULL
Agency Start Date: 10/26/1992
Work Location Borough: MANHATTAN
Title Description: TEACHER- PER SESSION
Leave Status as of June 30: ACTIVE
Base Salary: $33.18
Pay Basis: per Day
Regular Hours: 0
Regular Gross Paid: $2190.04
OT Hours: 0
Total OT Paid: $0.00
Total Other Pay: $0.46
*************************** 3. row ***************************
Fiscal Year: 2014
Agency Name: DEPT OF ED PEDAGOGICAL
Last Name: ROSSITTO MELEND
First Name: ROMY
Mid Init: B
Agency Start Date: 09/03/2002
Work Location Borough: NULL
Title Description: TEACHER
Leave Status as of June 30: ACTIVE
Base Salary: $80987.00
Pay Basis: per Annum
Regular Hours: 0
Regular Gross Paid: $80636.72
OT Hours: 0
Total OT Paid: $0.00
Total Other Pay: $1000.00
The data is in Dolt. Importing a CSV is a quick way to get started experimenting with Dolt.
Conclusion
As you can see importing a CSV to Dolt can be done with a few commands. If you have an existing CSV, importing it to Dolt is a fast way to get started with your own data. Questions? Stop by our Discord and just ask. We love helping beginners get started with Dolt.