IRS Sources Of Income Dataset
Every year the IRS publishes a treasure trove of data. It contains over a hundred different metrics which provide insight into the finances of American taxpayers. Even more compelling is they provide this information at ZIP code granularity, which can be easily joined with other datasets, and used to draw data on maps. It can be used to build services which give information on the likely income levels of users in different ZIP codes. This data has an enormous number of use cases, it's free to use, and now Dolt makes it much more easily accessible.
The raw data and it's challenges
The home page for the official data is here. For each tax year since 2011* the IRS publishes the data as a .zip file which includes an MS Word file with documentation on the data, an MS Excel file for each state, and csv files containing all the data. The documentation is critical, as the field names are codes (e.g. Adjusted Gross Income is in the column "A00100") and it is impossible to understand the data without it. For some of the fields the documentation will include what tax form and line this data comes from.
- The number of tax returns filed.
- The total amount of adjusted gross incomes for those returns in thousands of dollars.
- The number of tax returns with salaries and wages.
- The total amount of the salaries and wages for those returns in thousands of dollars.
- The number of tax returns with itemized deductions.
- The total amount of the itemized deductions for those returns in thousands of dollars.
Each of these tax years include the data in 2 groupings, in 2 csv files. allagi.csv contains data for all of the 50 states plus Washington DC where each row is for a state, ZIP** code, and AGI category combination. The 6 AGI categories are
+----------+-------------------------------+
| Category | Range |
+----------+-------------------------------+
| 1 | Between $1 under $25,000 |
| 2 | Between $25,000 under $50,000 |
| 3 | $50,000 under $75,000 |
| 4 | $75,000 under $100,000 |
| 5 | $100,000 under $200,000 |
| 6 | $200,000 or more |
+----------+-------------------------------+
The second csv file is allnoagi.csv and it includes all the states plus Washington DC where each row is for a state, and **ZIP code pair combining all the AGI category statistics together and setting the value of the AGI category to 0.
* From 1998 through 2010 tax years, the IRS published data in a different format, with just 18 columns in total and the AGI categories were different. The Dolt dataset currently only includes data in the new format.
** The special ZIP code "00000" gives the values for the entire state as a whole. The special ZIP code "99999" represents other which is a grouping of zip codes not large enough to be anonymized properly so it is aggregated with other small ZIP codes.
*** All amounts are in thousands of dollars.
Getting the Data into Dolt
Mapping IRS Codes to Descriptive Column Names
When importing the data to Dolt I wanted to make it easier for people to come in and make sense of the data without having to dive to deep into the documentation. The descriptions provided in the documentation are typically very short, and I decided to concisely abbreviate the description in snake case for the fields in Dolt so you could look at the data and understand it more easily. I started by creating a json mapping file from the codes that were used in the original IRS dataset to their easy to understand names.
{
"N1": "return_count",
"NUMDEP": "number_of_dependents",
"A00100": "adjusted_gross_income",
"A00101": "itemized_deductions_agi_amount",
"N00101": "itemized_deductions_count",
"A00200": "salaries_and_wages_amount",
"N00200": "salaries_and_wages_count",
...
}
In order to preserve the mappings between the original data and the Dolt dataset, I decided to create a table to hold the mappings. In order to get the mappings into that table I wrote a little python script to convert the json mappings to a csv file.
# core portion of the script that converts a json mapping to a csv file to be imported
with open(in_filename) as inf:
with open(out_filename, "w+") as outf:
outf.write("original,renamed\n")
data = inf.read()
mapping = json.loads(data)
for k, v in mapping.items():
outf.write(','.join([k,v])+"\n")
Once that was done I checked out a branch for the year of IRS data I was importing, and created the column_mappings table by simply importing the data.
dolt checkout -b 2011
dolt table import -c --pk original column_mappings ../irs-new/2011_mapping.csv
Once imported, I added column_mappings, and committed the change.
dolt add column_mappings
dolt commit -m "import 2011 column mappings"
Importing the Data
One thing to note about the csv files provided by the IRS is that for tax years 2011 through 2013 all numbers are printed as floating point values, but starting in 2014 they were all changed to integer values. In the 2011 allagi data almost every value ends with .0000. There are a few occurrences of 0.0001 which appear throughout including columns that are whole number counts so those can be assumed to be errors. Lastly the column A01000 or net_cap_gains_amount has a few values in the range between 0 and -1. Seeing as how this is measured in thousands of dollars, and how all values after 2014 are whole numbers greater than 0 I decided to import these as 0 values also.
dolt schema import --create --float-threshold 1 --map 2011_mapping.json --pks state,zip,agi_category allagi zipcode2011/allagi.csv
Running dolt schema import
with the --create
flag will create a new table. The parameter
--float-threshold 1
tells the inferrer the value the fractional component must exceed in order to be considered
a float. Setting it to 1 will ignore all floating point values and instead infer them to be whole number values.
--map 2011_mapping.json
gives a mapping file to use when naming the columns. --pks state,zip,agi_category
says to key each row off of the state, zip, and agi_category fields.
After creating the table and validating the schema, the data can be imported. The import is provided the same mapping file, and the -r parameter is used to replace all rows (doesn't really matter as this is the first year of data, but does matter in subsequent year imports)
dolt table import -m 2011_mapping.json -r allagi zipcode2011/allagi.csv
Next I added and committed the table.
dolt add allagi
dolt commit -m "import 2011 allagi data"
I followed the same process for importing the 2011 allnoagi.csv into Dolt.
Then once all the 2011 data had been imported I repeated the process for the 2012 to 2017 tax years with a new branch
for every tax year. One notable difference is that in subsequent years dolt schema import
was run with the
--update
flag instead of --create
. This will keep existing columns names and types as is (though it may
remove a not null constraint).
A full script showing the commands that were run can be seen here.
Shifts in AGI in California (An Example)
In my previous blog I showed how diffs can be queried using the dolt diff tables. Because of how we have structured our data, it becomes very easy to look at how things have changed over the course of the 7 annual IRS reports. Lets take a look at how the percentage of California's population in each AGI category has changed over this timeframe.
SELECT agi.state AS state,
agi.agi_category AS agi_category,
CAST(agi.from_return_count AS DECIMAL(48,16))/CAST(noagi.from_return_count AS DECIMAL(48,16)) * 100.0 AS from_percent,
CAST(agi.to_return_count AS DECIMAL(48,16))/CAST(noagi.to_return_count AS DECIMAL(48,16)) * 100.0 AS to_percent
FROM (
SELECT to_state AS state, to_agi_category AS agi_category, from_return_count, to_return_count
FROM dolt_diff_allagi
WHERE state = "CA" and to_zip = "00000" and from_commit="2011" and to_commit="2017"
) AS agi INNER JOIN (
SELECT to_state AS state, from_return_count, to_return_count
FROM dolt_diff_allnoagi
WHERE state = "CA" and to_zip = "00000" and from_commit="2011" and to_commit="2017"
) AS noagi ON agi.state = noagi.state;
+-------+--------------+--------------------+--------------------+
| state | agi_category | from_percent | to_percent |
+-------+--------------+--------------------+--------------------+
| CA | 1 | 39.74696869254931 | 32.67328843277763 |
| CA | 2 | 23.46164666487439 | 23.749387133219567 |
| CA | 3 | 12.922207638811924 | 13.498383826824323 |
| CA | 4 | 8.07491339507515 | 8.676230193323708 |
| CA | 5 | 11.487635526756423 | 14.29065087854523 |
| CA | 6 | 4.3066280819328036 | 7.112059535309541 |
+-------+--------------+--------------------+--------------------+
You can see that California had a very large 7% decrease in people in the lowest income category, and it had significant increases in people moving into the highest 2 income categories (About 3% each).
Tip of the Iceberg
Check out part 2 of this blog which uses doltpy along with this dataset to look at income inequality across the US.