DATASET

11 min read

Having accurate information on where people live within a country is of critical importance to the function of that country’s government. The United States constitution mandates that congress hold a census every 10 years. It is a labor-intensive and expensive operation that determines representation within the US congress, federal and state redistricting, has implications on federal funding, and has numerous other uses within and external to the government.

On August 12, the US Census Bureau released the 2020 census redistricting data. The raw data provided by the US Census Bureau, is very difficult to work with. You can spend many hours reading the 247-page technical manual that documents the data, and then import it into a tool of your choosing. Or you could clone it and start working with the data immediately using Dolt

dolt clone dolthub/census2020
cd census2020
dolt sql

You can even take a look and query the data on the web at https://www.dolthub.com/repositories/dolthub/census2020.

In this blog I’ll go through the process of importing this data into Dolt, and the challenges I ran into. I’ll talk about how I made the data more usable, and then cover some examples.

The Source Data

The raw data is available in an ftp directory containing 52 zip files (Fifty state zips, one for Washington DC, and one for Puerto Rico). Extracting each zip yielded four pipe delimited files (Files that contain a row of data on each line of a text file where each column is separated by the pipe character ’|’). Some files had hundreds of columns on each line, and the first row of the file did not contain the column headers, so what each field represented wasn’t obvious. In order to understand the data I needed the 2020 Census Redistricting States Technical Documentation. The document is massive, and it took a fair amount of time to understand the data well enough to be able to put together a schema which would provide the raw unfiltered data in a way that was usable.

The .pl Files

The four files extracted from each zip were named xx000012020.pl, xx000022020.pl, xx000032020.pl, and xxgeo2020.pl where xx is the State/US Abbreviation or “STUSAB” which includes all two letter US state abbreviations plus abbreviations for Washington DC, and Puerto Rico. The documentation describes the structure of this data as.

• Geographic header file
• File01 (Tables P1 and P2)
• File02 (Tables P3, P4, and H1)
• File03 (Table P5)

To get the complete dataset for the 2020 Census State Redistricting Data (Public Law 94-171) Summary File users must
download all four files.

It is easiest to think of the file set as a single file that has been broken into four physical parts: the geographic
header file, file01, file02, and file03...

A unique logical record number (LOGRECNO in the geographic header) is assigned to all files for a specific geographic
entity. This field is the key that links records across all four files.

Summary Levels

A core concept of the census data is summary level or “SUMLEV”. The census data is provided at multiple geographical levels. For example, summary level “040” provides state level data. If you were to open ca000012020.pl and find the line where “SUMLEV” is “040” you would be looking at the counts for the entire state of California. In that same file there are 58 rows with the SUMLEV “050”. Each of these rows corresponds to one of California’s 58 counties.

Summary levels are arranged in a hierarchy where each child represents a different way of splitting up the data of it’s parent. The hierarchy looks like this:

040 State
	500 State-Congressional District
		510 State-Congressional District-County
			511 State-Congressional District-County-Census Tract
			521 State-Congressional District-County-County Subdivision
		531 State-Congressional District-Place/Remainder
		541 State-Congressional District-Consolidated City
		550 State-Congressional District-American Indian Area/Alaska Native Area/Hawaiian Home Land
			553 State-Congressional District-American Indian Area-Tribal Subdivision/ Remainder
		570 State-Congressional District-School District (Elementary)/Remainder
		571 State-Congressional District-School District (Secondary)/Remainder
		572 State-Congressional District-School District (Unified)/Remainder
	610 State-State Legislative District (Upper Chamber)
		612 State-State Legislative District (Upper Chamber)-County
		613 State-State Legislative District (Upper Chamber)-County-County Subdivision
				632 State-State Legislative District (Upper Chamber)-County-County Subdivision-Subminor Civil Division
			630 State-State Legislative District (Upper Chamber)-County-Voting District/ Remainder
			631 State-State Legislative District (Upper Chamber)-County-Census Tract
		614 State-State Legislative District (Upper Chamber)-Place/Remainder
		615 State-State Legislative District (Upper Chamber)-Consolidated City
		616 State-State Legislative District (Upper Chamber)-American Indian Area/Alaska Native Area/Hawaiian Home Land
			633 State-State Legislative District (Upper Chamber)-American Indian Area-Tribal Subdivision/Remainder
		634 State-State Legislative District (Upper Chamber)-Alaska Native Regional Corporation
		617 State-State Legislative District (Upper Chamber)-School District (Elementary)/ Remainder
		618 State-State Legislative District (Upper Chamber)-School District (Secondary)/ Remainder
		619 State-State Legislative District (Upper Chamber)-School District (Unified)/Remainder
	620 State-State Legislative District (Lower Chamber)
		622 State-State Legislative District (Lower Chamber)-County
			623 State-State Legislative District (Lower Chamber)-County-County Subdivision
				637 State-State Legislative District (Lower Chamber)-County-County Subdivision-Subminor Civil Division
			635 State-State Legislative District (Lower Chamber)-County-Voting District/ Remainder
			636 State-State Legislative District (Lower Chamber)-County-Census Tract
		624 State-State Legislative District (Lower Chamber)-Place/Remainder
		625 State-State Legislative District (Lower Chamber)-Consolidated City
		626 State-State Legislative District (Lower Chamber)-American Indian Area/Alaska Native Area/Hawaiian Home Land
			638 State-State Legislative District (Lower Chamber)-American Indian Area-Tribal Subdivision/Remainder
		639 State-State Legislative District (Lower Chamber)-Alaska Native Regional Corporation
		627 State-State Legislative District (Lower Chamber)-School District (Elementary)/ Remainder
		628 State-State Legislative District (Lower Chamber)-School District (Secondary)/ Remainder
		629 State-State Legislative District (Lower Chamber)-School District (Unified)/Remainder
	050 County
		060 State-County-County Subdivision
			067 State-County-County Subdivision-Subminor Civil Division
		512 State-County-Congressional District
		640 State-County-State Legislative District (Upper Chamber)
		641 State-County-State Legislative District (Lower Chamber)
		140 State-County-Census Tract
			150 State-County-Census Tract-Block Group
		700 State-County-Voting District/Remainder
			701 State-County-Voting District/Remainder-Place/Remainder
			702 State-County-Voting District/Remainder-Consolidated City

At its coarsest the data is provided at the state level. A state can be broken up based on its counties, congressional districts, or its upper and lower chamber state legislative districts. The sum of the counts in any summary level will be equal to the count of its parent. Back to our California example, the sum of the population in each of the 58 counties will be equal to the population of the state, and the sum of the populations of the 20 Los Angeles County subdivisions is equal to the population of Los Angeles County

Defining a Schema

With a basic understanding of the files, and the way in which the row data is organized it was time to create some tables. The documentation states that the data can be thought of as a single logical record split over four files. Rather than sticking everything into a single table I put the data from the xxgeo2020.pl files into one table called geo and combine the counts from the other three files into a single table called census.

To create tables I needed to actually know the data in every column, and the data type of every column. This was a tiresome process as I had to slowly go through the documentation and get the name of the field, it’s type, and it’s maximum length.

He was straight up not having a good time

I wanted users to be able to go back and refer to the source documentation and have it be valid so field names were maintained, but notes on each field and what it represents were pulled out of the documentation and put in the table columns. This makes it easier for users to find the data they are looking for. Here is a sample of the data returned from running SELECT * FROM COLUMNS

+------------+-------------+--------------------------------------------------+------------+------------------+
| table_name | column_name | description                                      | data_type  | universe         |
+------------+-------------+--------------------------------------------------+------------+------------------+
| census     | CHARITER    | Characteristic Iteration                         | varchar(3) | NULL             |
| census     | CIFSN       | Characteristic Iteration File Sequence Number    | varchar(2) | NULL             |
| census     | FILEID      | File Identification                              | varchar(6) | NULL             |
| census     | H0010001    | Total                                            | BIGINT     | Housing Units    |
| census     | H0010002    | Occupied                                         | BIGINT     | Housing Units    |
| census     | H0010003    | Vacant                                           | BIGINT     | Housing Units    |
| census     | LOGRECNO    | Logical Record Number                            | BIGINT     | NULL             |
| census     | P0010001    | Total                                            | BIGINT     | Total Population |
| census     | P0010002    | Population of one race                           | BIGINT     | Total Population |
| census     | P0010003    | White alone                                      | BIGINT     | Total Population |
| census     | P0010004    | Black or African American alone                  | BIGINT     | Total Population |
| census     | P0010005    | American Indian and Alaska Native alone          | BIGINT     | Total Population |
| census     | P0010006    | Asian alone                                      | BIGINT     | Total Population |
| census     | P0010007    | Native Hawaiian and Other Pacific Islander alone | BIGINT     | Total Population |
| census     | P0010008    | Some Other Race alone                            | BIGINT     | Total Population |
...

So using the columns P0010001 to P0010008 you could get the percentage of each race and the percentage of the population of mixed race. We’ll use these fields later in our example.

Importing the Data

Once I had created my tables it was time to import the data. The only challenge was that the .pl files did not have column headers. I created a separate .pl file containing just the headers for each file type and ran:

cat file1_headers.pl xx2020.pl/ak000012020.pl | dolt table import -u --file-type psv census
cat file2_headers.pl xx2020.pl/ak000022020.pl | dolt table import -u --file-type psv census
cat file3_headers.pl xx2020.pl/ak0000.2020.pl | dolt table import -u --file-type psv census
cat geo_headers.pl xx2020.pl/wygeo2020.pl | dolt table import -u --file-type psv geo

Indexes

When you query this data you need to first decide on which summary level of the data you want to work on. Any useful query you run against the dataset needs to provide a value for SUMLEV, so an index was added on SUMLEV. Some summary levels have a lot of data, and often times you will be querying against a specific area and not across the entire United States. Adding an index to STUSAB and SUMLEV together lets you limit the data read to the state(s) you are interested in that have a specific summary level. Here are the index creation statements:

CREATE INDEX geo_sumlev ON geo(SUMLEV);
CREATE INDEX geo_stusab_sumlev ON geo(STUSAB,SUMLEV)

Querying the Data

If you haven’t done so already, now is a good time to install Dolt and clone the data:

dolt clone dolthub/census2020

With the data cloned, you can open a sql to query it:

cd census2020
dolt sql

Now that we have installed Dolt and cloned the census data, we are equipped to find the records we are looking for. The geo table needs to be joined with the census table in order to be able to get useful data. To make your query performant it is important to provide both the STUSAB and the LOGRECNO when joining with the census table data. A simple query to get all the state populations would be:

SELECT census.STUSAB as state, census.P0010001 as population
FROM census
JOIN (
    SELECT STUSAB, LOGRECNO
    FROM GEO
    WHERE SUMLEV = '040'
) as x
ON census.STUSAB = x.STUSAB and census.LOGRECNO = x.LOGRECNO
ORDER BY population DESC;

Making the Data More Usable

Querying at the state level is very straightforward, but getting useful results at other summary levels requires a bit more work. Say you wanted to look at the data for LA county. You can see county level data is summary level “050”, but how do you find LA county? There is a COUNTY field in the geo data. We can query the columns table to get the info from the docs:

 select  * from columns where column_name = "COUNTY" ;

+------------+-------------+---------------+------------+----------+
| table_name | column_name | description   | data_type  | universe |
+------------+-------------+---------------+------------+----------+
| geo        | COUNTY      | County (FIPS) | VARCHAR(3) | NULL     |
+------------+-------------+---------------+------------+----------+

So COUNTY is a 3 character FIPS code. After some digging I was able to find a document containing the counties for every state. At the same time I also found documents covering the FIPS codes for county subdivisions and places. I’ve imported that data as well into the tables counties, county_subdivisions, and places.

Now we can easily find the FIPS code for LA county:

SELECT * FROM counties WHERE STUSAB = 'CA' and COUNTYNAME LIKE 'L%';

+--------+-------+--------+--------------------+----------+
| STUSAB | STATE | COUNTY | COUNTYNAME         | COUNTYCC |
+--------+-------+--------+--------------------+----------+
| CA     | 06    | 033    | Lake County        | H1       |
| CA     | 06    | 035    | Lassen County      | H1       |
| CA     | 06    | 037    | Los Angeles County | H1       |
+--------+-------+--------+--------------------+----------+

Now that we know the FIPS code is “037” we can get the population:

SELECT census.P0010001 as population
FROM census
JOIN (
    SELECT STUSAB, LOGRECNO
    FROM GEO
    WHERE SUMLEV = '050' and STUSAB = 'CA' and COUNTY = '037'
) as x
ON census.STUSAB = x.STUSAB and census.LOGRECNO = x.LOGRECNO
ORDER BY population DESC;

+------------+
| population |
+------------+
| 10014009   |
+------------+

Demographic Data for Texas

It’s time for a more complicated example. We’ll be pulling demographic data for all Texas counties.

First, in order to get the right census record we will query the geo table setting the SUMLEV and STUSAB

SELECT stusab, logrecno, county
FROM geo
WHERE stusab = 'TX' and sumlev = '050';

We can look at columns table to help us find the correct columns.

+------------+-------------+--------------------------------------------------+------------------+
| table_name | column_name | description                                      | universe         |
+------------+-------------+--------------------------------------------------+------------------+
| census     | P0010001    | Total                                            | Total Population |
| census     | P0010002    | Population of one race                           | Total Population |
| census     | P0010003    | White alone                                      | Total Population |
| census     | P0010004    | Black or African American alone                  | Total Population |
| census     | P0010005    | American Indian and Alaska Native alone          | Total Population |
| census     | P0010006    | Asian alone                                      | Total Population |
| census     | P0010007    | Native Hawaiian and Other Pacific Islander alone | Total Population |
| census     | P0010008    | Some Other Race alone                            | Total Population |

We select them and give them names that are easier to understand for our query:

SELECT P0010001            as total,
       P0010001 - P0010002 as mixed_race,
       P0010003            as white,
       P0010004            as black,
       P0010005            as aa_and_an,
       P0010006            as asian,
       P0010007            as nh_and_pi,
       P0010008            as other,
       COUNTY,
       census.STUSAB
FROM census
JOIN (
    SELECT stusab, logrecno, county
    FROM geo
    WHERE stusab = 'TX' and sumlev = '050'
) as x
ON census.stusab = x.stusab and census.logrecno = x.logrecno
ORDER by total DESC;

Lastly we’ll join this with the county table to get county names and convert the counts to percentages:

SELECT countyname,
       total,
       mixed_race / totalf as mixed_race,
       white / totalf      as white,
       black / totalf      as black,
       aa_and_an / totalf  as aa_and_an,
       asian / totalf      as asian,
       nh_and_pi / totalf  as nh_and_pi,
       other / totalf      as other
FROM (
     SELECT P0010001                  as total,
            CAST(P0010001 as decimal) as totalf,
            P0010001 - P0010002       as mixed_race,
            P0010003                  as white,
            P0010004                  as black,
            P0010005                  as aa_and_an,
            P0010006                  as asian,
            P0010007                  as nh_and_pi,
            P0010008                  as other,
            COUNTY,
            census.STUSAB
     FROM census
     JOIN (
         SELECT stusab, logrecno, county
         FROM geo
         WHERE stusab = 'TX'
           and sumlev = '050'
     ) as x
     ON census.stusab = x.stusab and census.logrecno = x.logrecno
 ) as y
 JOIN counties
 ON counties.county = y.county and counties.stusab = y.stusab
 ORDER BY total DESC;

+----------------------+---------+----------------------+---------------------+-----------------------+-----------------------+------------------------+------------------------+----------------------+
| COUNTYNAME           | total   | mixed_race           | white               | black                 | aa_and_an             | asian                  | nh_and_pi              | other                |
+----------------------+---------+----------------------+---------------------+-----------------------+-----------------------+------------------------+------------------------+----------------------+
| Harris County        | 4731145 | 0.16912502153284248  | 0.36362360485675244 | 0.19172166568557927   | 0.011627417887213349  | 0.07382314429170951    | 0.0008752638103461213  | 0.18920388193555682  |
| Dallas County        | 2613539 | 0.16113017636239596  | 0.35365188734509034 | 0.21969559283408435   | 0.012034639620835962  | 0.07006744494725352    | 0.0006565809808080155  | 0.18276367790953185  |
| Tarrant County       | 2110640 | 0.14033421142402305  | 0.49489680855096085 | 0.17375156350680362   | 0.008855133987795171  | 0.061325948527460865   | 0.002156218019179017   | 0.11868011598377744  |
| Bexar County         | 2009324 | 0.2610534687287864   | 0.4577305601286801  | 0.08027923819155099   | 0.01154268798859716   | 0.03417816141149959    | 0.0017662656694490287  | 0.15344961788143674  |
| Travis County        | 1290188 | 0.163197146462376    | 0.5498787773564783  | 0.07849011151863139   | 0.009659832520531892  | 0.07831261800605803    | 0.0007898073769094117  | 0.11967170675901496  |
| Collin County        | 1064465 | 0.111036999807415    | 0.543382826114527   | 0.10360509739634464   | 0.007036398566415993  | 0.17775032528077486    | 0.0006754566848134979  | 0.05651289614970901  |
| Denton County        | 906422  | 0.12808603498149868  | 0.5807041311883427  | 0.10794861554551853   | 0.008243400976587064  | 0.10311201625732827    | 0.0008318421221020672  | 0.07107395892862264  |
| Hidalgo County       | 870781  | 0.405530207939769    | 0.3451545222047794  | 0.005452576480194217  | 0.007068367362172578  | 0.010390672281549551   | 0.00020096901517143806 | 0.22620268471636382  |
| El Paso County       | 865657  | 0.3584179415172522   | 0.3624310783601357  | 0.03356294698708611   | 0.011941219212690478  | 0.013946632442179755   | 0.0021775368304074246  | 0.21752264465024831  |
...
run the query to get the full result set.

Conclusion

The United States government spent an estimated $15.6 billion collecting the 2020 census. It is used by the government for numerous purposes, but it also has enumerable business, and research use cases… but it is kind of a pain to work with. Many companies pay to get this information distilled into a usable format, but the data is available on DoltHub now. I have provided some resources for making the data more usable, and will add will continue to improve this dataset by adding tables such as school and congressional districts. If you want to help make this dataset even better you can go to DoltHub, create a free account, and fork the dataset. Make modifications to your fork and submit pull requests. Help us make this the most complete, easiest to use, source of census data.

Blog