2020 Census
Having accurate information on where people live within a country is of critical importance to the function of that country's government. The United States constitution mandates that congress hold a census every 10 years. It is a labor-intensive and expensive operation that determines representation within the US congress, federal and state redistricting, has implications on federal funding, and has numerous other uses within and external to the government.
On August 12, the US Census Bureau released the 2020 census redistricting data. The raw data provided by the US Census Bureau, is very difficult to work with. You can spend many hours reading the 247-page technical manual that documents the data, and then import it into a tool of your choosing. Or you could clone it and start working with the data immediately using Dolt
dolt clone dolthub/census2020
cd census2020
dolt sql
You can even take a look and query the data on the web at https://www.dolthub.com/repositories/dolthub/census2020.
In this blog I'll go through the process of importing this data into Dolt, and the challenges I ran into. I'll talk about how I made the data more usable, and then cover some examples.
The Source Data
The raw data is available in an ftp directory containing 52 zip files (Fifty state zips, one for Washington DC, and one for Puerto Rico). Extracting each zip yielded four pipe delimited files (Files that contain a row of data on each line of a text file where each column is separated by the pipe character '|'). Some files had hundreds of columns on each line, and the first row of the file did not contain the column headers, so what each field represented wasn't obvious. In order to understand the data I needed the 2020 Census Redistricting States Technical Documentation. The document is massive, and it took a fair amount of time to understand the data well enough to be able to put together a schema which would provide the raw unfiltered data in a way that was usable.
The .pl Files
The four files extracted from each zip were named xx000012020.pl, xx000022020.pl, xx000032020.pl, and xxgeo2020.pl where xx is the State/US Abbreviation or "STUSAB" which includes all two letter US state abbreviations plus abbreviations for Washington DC, and Puerto Rico. The documentation describes the structure of this data as.
• Geographic header file
• File01 (Tables P1 and P2)
• File02 (Tables P3, P4, and H1)
• File03 (Table P5)
To get the complete dataset for the 2020 Census State Redistricting Data (Public Law 94-171) Summary File users must
download all four files.
It is easiest to think of the file set as a single file that has been broken into four physical parts: the geographic
header file, file01, file02, and file03...
A unique logical record number (LOGRECNO in the geographic header) is assigned to all files for a specific geographic
entity. This field is the key that links records across all four files.
Summary Levels
A core concept of the census data is summary level or "SUMLEV". The census data is provided at multiple geographical levels. For example, summary level "040" provides state level data. If you were to open ca000012020.pl and find the line where "SUMLEV" is "040" you would be looking at the counts for the entire state of California. In that same file there are 58 rows with the SUMLEV "050". Each of these rows corresponds to one of California's 58 counties.
Summary levels are arranged in a hierarchy where each child represents a different way of splitting up the data of it's parent. The hierarchy looks like this:
040 State
500 State-Congressional District
510 State-Congressional District-County
511 State-Congressional District-County-Census Tract
521 State-Congressional District-County-County Subdivision
531 State-Congressional District-Place/Remainder
541 State-Congressional District-Consolidated City
550 State-Congressional District-American Indian Area/Alaska Native Area/Hawaiian Home Land
553 State-Congressional District-American Indian Area-Tribal Subdivision/ Remainder
570 State-Congressional District-School District (Elementary)/Remainder
571 State-Congressional District-School District (Secondary)/Remainder
572 State-Congressional District-School District (Unified)/Remainder
610 State-State Legislative District (Upper Chamber)
612 State-State Legislative District (Upper Chamber)-County
613 State-State Legislative District (Upper Chamber)-County-County Subdivision
632 State-State Legislative District (Upper Chamber)-County-County Subdivision-Subminor Civil Division
630 State-State Legislative District (Upper Chamber)-County-Voting District/ Remainder
631 State-State Legislative District (Upper Chamber)-County-Census Tract
614 State-State Legislative District (Upper Chamber)-Place/Remainder
615 State-State Legislative District (Upper Chamber)-Consolidated City
616 State-State Legislative District (Upper Chamber)-American Indian Area/Alaska Native Area/Hawaiian Home Land
633 State-State Legislative District (Upper Chamber)-American Indian Area-Tribal Subdivision/Remainder
634 State-State Legislative District (Upper Chamber)-Alaska Native Regional Corporation
617 State-State Legislative District (Upper Chamber)-School District (Elementary)/ Remainder
618 State-State Legislative District (Upper Chamber)-School District (Secondary)/ Remainder
619 State-State Legislative District (Upper Chamber)-School District (Unified)/Remainder
620 State-State Legislative District (Lower Chamber)
622 State-State Legislative District (Lower Chamber)-County
623 State-State Legislative District (Lower Chamber)-County-County Subdivision
637 State-State Legislative District (Lower Chamber)-County-County Subdivision-Subminor Civil Division
635 State-State Legislative District (Lower Chamber)-County-Voting District/ Remainder
636 State-State Legislative District (Lower Chamber)-County-Census Tract
624 State-State Legislative District (Lower Chamber)-Place/Remainder
625 State-State Legislative District (Lower Chamber)-Consolidated City
626 State-State Legislative District (Lower Chamber)-American Indian Area/Alaska Native Area/Hawaiian Home Land
638 State-State Legislative District (Lower Chamber)-American Indian Area-Tribal Subdivision/Remainder
639 State-State Legislative District (Lower Chamber)-Alaska Native Regional Corporation
627 State-State Legislative District (Lower Chamber)-School District (Elementary)/ Remainder
628 State-State Legislative District (Lower Chamber)-School District (Secondary)/ Remainder
629 State-State Legislative District (Lower Chamber)-School District (Unified)/Remainder
050 County
060 State-County-County Subdivision
067 State-County-County Subdivision-Subminor Civil Division
512 State-County-Congressional District
640 State-County-State Legislative District (Upper Chamber)
641 State-County-State Legislative District (Lower Chamber)
140 State-County-Census Tract
150 State-County-Census Tract-Block Group
700 State-County-Voting District/Remainder
701 State-County-Voting District/Remainder-Place/Remainder
702 State-County-Voting District/Remainder-Consolidated City
At its coarsest the data is provided at the state level. A state can be broken up based on its counties, congressional districts, or its upper and lower chamber state legislative districts. The sum of the counts in any summary level will be equal to the count of its parent. Back to our California example, the sum of the population in each of the 58 counties will be equal to the population of the state, and the sum of the populations of the 20 Los Angeles County subdivisions is equal to the population of Los Angeles County
Defining a Schema
With a basic understanding of the files, and the way in which the row data is organized it was time to create some tables.
The documentation states that the data can be thought of as a single logical record split over four files. Rather than sticking
everything into a single table I put the data from the xxgeo2020.pl files into one table called geo
and combine the counts
from the other three files into a single table called census
.
To create tables I needed to actually know the data in every column, and the data type of every column. This was a tiresome process as I had to slowly go through the documentation and get the name of the field, it's type, and it's maximum length.
I wanted users to be able to go back and refer to the source documentation and have it be valid so field names were maintained,
but notes on each field and what it represents were pulled out of the documentation and put in the table columns
. This makes
it easier for users to find the data they are looking for. Here is a sample of the data returned from running SELECT * FROM COLUMNS
+------------+-------------+--------------------------------------------------+------------+------------------+
| table_name | column_name | description | data_type | universe |
+------------+-------------+--------------------------------------------------+------------+------------------+
| census | CHARITER | Characteristic Iteration | varchar(3) | NULL |
| census | CIFSN | Characteristic Iteration File Sequence Number | varchar(2) | NULL |
| census | FILEID | File Identification | varchar(6) | NULL |
| census | H0010001 | Total | BIGINT | Housing Units |
| census | H0010002 | Occupied | BIGINT | Housing Units |
| census | H0010003 | Vacant | BIGINT | Housing Units |
| census | LOGRECNO | Logical Record Number | BIGINT | NULL |
| census | P0010001 | Total | BIGINT | Total Population |
| census | P0010002 | Population of one race | BIGINT | Total Population |
| census | P0010003 | White alone | BIGINT | Total Population |
| census | P0010004 | Black or African American alone | BIGINT | Total Population |
| census | P0010005 | American Indian and Alaska Native alone | BIGINT | Total Population |
| census | P0010006 | Asian alone | BIGINT | Total Population |
| census | P0010007 | Native Hawaiian and Other Pacific Islander alone | BIGINT | Total Population |
| census | P0010008 | Some Other Race alone | BIGINT | Total Population |
...
So using the columns P0010001 to P0010008 you could get the percentage of each race and the percentage of the population of mixed race. We'll use these fields later in our example.
Importing the Data
Once I had created my tables it was time to import the data. The only challenge was that the .pl files did not have column headers. I created a separate .pl file containing just the headers for each file type and ran:
cat file1_headers.pl xx2020.pl/ak000012020.pl | dolt table import -u --file-type psv census
cat file2_headers.pl xx2020.pl/ak000022020.pl | dolt table import -u --file-type psv census
cat file3_headers.pl xx2020.pl/ak0000.2020.pl | dolt table import -u --file-type psv census
cat geo_headers.pl xx2020.pl/wygeo2020.pl | dolt table import -u --file-type psv geo
Indexes
When you query this data you need to first decide on which summary level of the data you want to work on. Any useful query you run against the dataset needs to provide a value for SUMLEV, so an index was added on SUMLEV. Some summary levels have a lot of data, and often times you will be querying against a specific area and not across the entire United States. Adding an index to STUSAB and SUMLEV together lets you limit the data read to the state(s) you are interested in that have a specific summary level. Here are the index creation statements:
CREATE INDEX geo_sumlev ON geo(SUMLEV);
CREATE INDEX geo_stusab_sumlev ON geo(STUSAB,SUMLEV)
Querying the Data
If you haven't done so already, now is a good time to install Dolt and clone the data:
dolt clone dolthub/census2020
With the data cloned, you can open a sql to query it:
cd census2020
dolt sql
Now that we have installed Dolt and cloned the census data, we are equipped to find the records we are
looking for. The geo
table needs to be joined with the census
table in order to be able to get useful data. To make your
query performant it is important to provide both the STUSAB and the LOGRECNO when joining with the census
table data. A
simple query to get all the state populations would be:
SELECT census.STUSAB as state, census.P0010001 as population
FROM census
JOIN (
SELECT STUSAB, LOGRECNO
FROM GEO
WHERE SUMLEV = '040'
) as x
ON census.STUSAB = x.STUSAB and census.LOGRECNO = x.LOGRECNO
ORDER BY population DESC;
Making the Data More Usable
Querying at the state level is very straightforward, but getting useful results at other summary levels requires a bit more work.
Say you wanted to look at the data for LA county. You can see county level data is summary level "050", but how do you find LA
county? There is a COUNTY
field in the geo data. We can query the columns
table to get the info from the docs:
select * from columns where column_name = "COUNTY" ;
+------------+-------------+---------------+------------+----------+
| table_name | column_name | description | data_type | universe |
+------------+-------------+---------------+------------+----------+
| geo | COUNTY | County (FIPS) | VARCHAR(3) | NULL |
+------------+-------------+---------------+------------+----------+
So COUNTY
is a 3 character FIPS code. After some digging I was able to find a document containing the counties for every
state. At the same time I also found documents covering the FIPS codes for county subdivisions
and places. I've imported that data as well into the tables
counties
, county_subdivisions
, and places
.
Now we can easily find the FIPS code for LA county:
SELECT * FROM counties WHERE STUSAB = 'CA' and COUNTYNAME LIKE 'L%';
+--------+-------+--------+--------------------+----------+
| STUSAB | STATE | COUNTY | COUNTYNAME | COUNTYCC |
+--------+-------+--------+--------------------+----------+
| CA | 06 | 033 | Lake County | H1 |
| CA | 06 | 035 | Lassen County | H1 |
| CA | 06 | 037 | Los Angeles County | H1 |
+--------+-------+--------+--------------------+----------+
Now that we know the FIPS code is "037" we can get the population:
SELECT census.P0010001 as population
FROM census
JOIN (
SELECT STUSAB, LOGRECNO
FROM GEO
WHERE SUMLEV = '050' and STUSAB = 'CA' and COUNTY = '037'
) as x
ON census.STUSAB = x.STUSAB and census.LOGRECNO = x.LOGRECNO
ORDER BY population DESC;
+------------+
| population |
+------------+
| 10014009 |
+------------+
Demographic Data for Texas
It's time for a more complicated example. We'll be pulling demographic data for all Texas counties.
First, in order to get the right census record we will query the geo
table setting the SUMLEV
and STUSAB
SELECT stusab, logrecno, county
FROM geo
WHERE stusab = 'TX' and sumlev = '050';
We can look at columns
table to help us find the correct columns.
+------------+-------------+--------------------------------------------------+------------------+
| table_name | column_name | description | universe |
+------------+-------------+--------------------------------------------------+------------------+
| census | P0010001 | Total | Total Population |
| census | P0010002 | Population of one race | Total Population |
| census | P0010003 | White alone | Total Population |
| census | P0010004 | Black or African American alone | Total Population |
| census | P0010005 | American Indian and Alaska Native alone | Total Population |
| census | P0010006 | Asian alone | Total Population |
| census | P0010007 | Native Hawaiian and Other Pacific Islander alone | Total Population |
| census | P0010008 | Some Other Race alone | Total Population |
We select them and give them names that are easier to understand for our query:
SELECT P0010001 as total,
P0010001 - P0010002 as mixed_race,
P0010003 as white,
P0010004 as black,
P0010005 as aa_and_an,
P0010006 as asian,
P0010007 as nh_and_pi,
P0010008 as other,
COUNTY,
census.STUSAB
FROM census
JOIN (
SELECT stusab, logrecno, county
FROM geo
WHERE stusab = 'TX' and sumlev = '050'
) as x
ON census.stusab = x.stusab and census.logrecno = x.logrecno
ORDER by total DESC;
Lastly we'll join this with the county table to get county names and convert the counts to percentages:
SELECT countyname,
total,
mixed_race / totalf as mixed_race,
white / totalf as white,
black / totalf as black,
aa_and_an / totalf as aa_and_an,
asian / totalf as asian,
nh_and_pi / totalf as nh_and_pi,
other / totalf as other
FROM (
SELECT P0010001 as total,
CAST(P0010001 as decimal) as totalf,
P0010001 - P0010002 as mixed_race,
P0010003 as white,
P0010004 as black,
P0010005 as aa_and_an,
P0010006 as asian,
P0010007 as nh_and_pi,
P0010008 as other,
COUNTY,
census.STUSAB
FROM census
JOIN (
SELECT stusab, logrecno, county
FROM geo
WHERE stusab = 'TX'
and sumlev = '050'
) as x
ON census.stusab = x.stusab and census.logrecno = x.logrecno
) as y
JOIN counties
ON counties.county = y.county and counties.stusab = y.stusab
ORDER BY total DESC;
+----------------------+---------+----------------------+---------------------+-----------------------+-----------------------+------------------------+------------------------+----------------------+
| COUNTYNAME | total | mixed_race | white | black | aa_and_an | asian | nh_and_pi | other |
+----------------------+---------+----------------------+---------------------+-----------------------+-----------------------+------------------------+------------------------+----------------------+
| Harris County | 4731145 | 0.16912502153284248 | 0.36362360485675244 | 0.19172166568557927 | 0.011627417887213349 | 0.07382314429170951 | 0.0008752638103461213 | 0.18920388193555682 |
| Dallas County | 2613539 | 0.16113017636239596 | 0.35365188734509034 | 0.21969559283408435 | 0.012034639620835962 | 0.07006744494725352 | 0.0006565809808080155 | 0.18276367790953185 |
| Tarrant County | 2110640 | 0.14033421142402305 | 0.49489680855096085 | 0.17375156350680362 | 0.008855133987795171 | 0.061325948527460865 | 0.002156218019179017 | 0.11868011598377744 |
| Bexar County | 2009324 | 0.2610534687287864 | 0.4577305601286801 | 0.08027923819155099 | 0.01154268798859716 | 0.03417816141149959 | 0.0017662656694490287 | 0.15344961788143674 |
| Travis County | 1290188 | 0.163197146462376 | 0.5498787773564783 | 0.07849011151863139 | 0.009659832520531892 | 0.07831261800605803 | 0.0007898073769094117 | 0.11967170675901496 |
| Collin County | 1064465 | 0.111036999807415 | 0.543382826114527 | 0.10360509739634464 | 0.007036398566415993 | 0.17775032528077486 | 0.0006754566848134979 | 0.05651289614970901 |
| Denton County | 906422 | 0.12808603498149868 | 0.5807041311883427 | 0.10794861554551853 | 0.008243400976587064 | 0.10311201625732827 | 0.0008318421221020672 | 0.07107395892862264 |
| Hidalgo County | 870781 | 0.405530207939769 | 0.3451545222047794 | 0.005452576480194217 | 0.007068367362172578 | 0.010390672281549551 | 0.00020096901517143806 | 0.22620268471636382 |
| El Paso County | 865657 | 0.3584179415172522 | 0.3624310783601357 | 0.03356294698708611 | 0.011941219212690478 | 0.013946632442179755 | 0.0021775368304074246 | 0.21752264465024831 |
...
run the query to get the full result set.
Conclusion
The United States government spent an estimated $15.6 billion collecting the 2020 census. It is used by the government for numerous purposes, but it also has enumerable business, and research use cases... but it is kind of a pain to work with. Many companies pay to get this information distilled into a usable format, but the data is available on DoltHub now. I have provided some resources for making the data more usable, and will add will continue to improve this dataset by adding tables such as school and congressional districts. If you want to help make this dataset even better you can go to DoltHub, create a free account, and fork the dataset. Make modifications to your fork and submit pull requests. Help us make this the most complete, easiest to use, source of census data.