Dolt for Beginners: Importing a CSV

REFERENCE
4 min read

The Dolt for Beginners series focuses on topics people getting started with Dolt will likely be interested in. Most of our other blog articles go pretty deep so we want a space to talk about topics that experts may find boring.

Dolt for Beginners

After you get Dolt installed, you'll face a fork in the road. Are you creating a new database with new tables and new rows? Or do you already have data that you want to import into Dolt to take advantage of Dolt's unique version control features? Is the data already in a MySQL database? Or is it in another format like a Comma Separated Value (CSV) file? If you have data as a CSV, this article will show you how to get it into Dolt.

Install Dolt

The first thing you need to do to follow along with this article is install Dolt. Dolt is not complicated software. There is no complicated install process. You don't need a Docker container or multiple dependencies installed. You download the single Dolt program and run it.

We have even more convenient ways to install Dolt for every platform.

Once you have Dolt installed, you need to open a terminal like "Terminal" on Mac or Powershell on Windows. Make sure Dolt is on your PATH by typing dolt. If everything is working you'll be greeted by a help message with all the valid Dolt commands. For those familiar with the Git command line, Dolt should look pretty similar.

$ dolt
Valid commands for dolt are
                init - Create an empty Dolt data repository.
              status - Show the working tree status.
                 add - Add table changes to the list of staged table changes.
                diff - Diff a table.
... <trimmed for length>
... <trimmed for length>
... <trimmed for length>
              reflog - Show history of named refs.
              rebase - Reapplies commits on top of another base tip
                  ci - Commands for working with Dolt continuous integration configuration.

Import a CSV

CSV is the most common format used to share data. You can save a spreadsheet as a CSV. Most online data repositories post data in CSV format.

Before we import a CSV, we must first create a Dolt database using dolt init. Dolt stores the database in the current working directory where you run dolt init. So, I make a new directory called csv-import and cd to it before I run dolt init.

$ mkdir csv-import 
$ cd csv-import 
$ dolt init
Successfully initialized dolt data repository.

You can import CSVs into Dolt using the Dolt command line. Dolt has a handy command called dolt table import that is used to import CSVs and files of other formats. This command creates or writes to tables in Dolt.

For my CSV file I head over to Kaggle, a machine learning website with a lot of datasets in CSV form. I browse around a bit and find the New York City Payroll dataset. After signing in, I click the Download link and the file named Citywide_Payroll_Data__Fiscal_Year_.csv.zip shows up in my Downloads folder. I double click on it to unzip it and now I have /Citywide_Payroll_Data__Fiscal_Year_.csv ready to import.

dolt table import has a bunch of options which you can explore in the documentation but at its core it takes two arguments - a table name and a file. The only other option I need is --create-table because this is a brand new table. Dolt will infer the schema for me.

$ dolt table import --create-table payroll ~/Downloads/Citywide_Payroll_Data__Fiscal_Year_.csv
Rows Processed: 2,194,488, Additions: 2,194,488, Modifications: 0, Had No Effect: 0
Import completed successfully.

That was a bit more data than I expected. Let's inspect what I created using dolt sql. Let's first look at the schema.

$ dolt sql -q "describe payroll"
+----------------------------+---------------+------+-----+---------+-------+
| Field                      | Type          | Null | Key | Default | Extra |
+----------------------------+---------------+------+-----+---------+-------+
| Fiscal Year                | int           | YES  |     | NULL    |       |
| Agency Name                | varchar(1023) | YES  |     | NULL    |       |
| Last Name                  | varchar(1023) | YES  |     | NULL    |       |
| First Name                 | varchar(1023) | YES  |     | NULL    |       |
| Mid Init                   | varchar(1023) | YES  |     | NULL    |       |
| Agency Start Date          | varchar(1023) | YES  |     | NULL    |       |
| Work Location Borough      | varchar(1023) | YES  |     | NULL    |       |
| Title Description          | varchar(1023) | YES  |     | NULL    |       |
| Leave Status as of June 30 | varchar(1023) | YES  |     | NULL    |       |
| Base Salary                | varchar(1023) | YES  |     | NULL    |       |
| Pay Basis                  | varchar(1023) | YES  |     | NULL    |       |
| Regular Hours              | float         | YES  |     | NULL    |       |
| Regular Gross Paid         | varchar(1023) | YES  |     | NULL    |       |
| OT Hours                   | float         | YES  |     | NULL    |       |
| Total OT Paid              | varchar(1023) | YES  |     | NULL    |       |
| Total Other Pay            | varchar(1023) | YES  |     | NULL    |       |
+----------------------------+---------------+------+-----+---------+-------+

And now the first three records, in vertical format so it's easier to read in this article.

$ dolt sql --r vertical -q "select * from payroll limit 3" 
*************************** 1. row ***************************
               Fiscal Year: 2017
               Agency Name: DEPT OF ED PER SESSION TEACHER
                 Last Name: NARISSI
                First Name: BARBARA
                  Mid Init: A
         Agency Start Date: 09/04/2001
     Work Location Borough: MANHATTAN    
         Title Description: TEACHER- PER SESSION                                        
Leave Status as of June 30: ACTIVE
               Base Salary: $33.18
                 Pay Basis: per Day
             Regular Hours: 0
        Regular Gross Paid: $415.80
                  OT Hours: 0
             Total OT Paid: $0.00
           Total Other Pay: $0.00

*************************** 2. row ***************************
               Fiscal Year: 2017
               Agency Name: DEPT OF ED PER SESSION TEACHER
                 Last Name: MCGRAW
                First Name: SUSAN
                  Mid Init: NULL
         Agency Start Date: 10/26/1992
     Work Location Borough: MANHATTAN    
         Title Description: TEACHER- PER SESSION                                        
Leave Status as of June 30: ACTIVE
               Base Salary: $33.18
                 Pay Basis: per Day
             Regular Hours: 0
        Regular Gross Paid: $2190.04
                  OT Hours: 0
             Total OT Paid: $0.00
           Total Other Pay: $0.46

*************************** 3. row ***************************
               Fiscal Year: 2014
               Agency Name: DEPT OF ED PEDAGOGICAL        
                 Last Name: ROSSITTO MELEND
                First Name: ROMY
                  Mid Init: B
         Agency Start Date: 09/03/2002
     Work Location Borough: NULL
         Title Description: TEACHER                                                     
Leave Status as of June 30: ACTIVE
               Base Salary: $80987.00
                 Pay Basis: per Annum
             Regular Hours: 0
        Regular Gross Paid: $80636.72
                  OT Hours: 0
             Total OT Paid: $0.00
           Total Other Pay: $1000.00

The data is in Dolt. Importing a CSV is a quick way to get started experimenting with Dolt.

Conclusion

As you can see importing a CSV to Dolt can be done with a few commands. If you have an existing CSV, importing it to Dolt is a fast way to get started with your own data. Questions? Stop by our Discord and just ask. We love helping beginners get started with Dolt.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.