Earn your share of $25,000 building US Presidential Election Database
Today, we're launching a way to make money building Dolt databases called Bounties. We'll have a follow on blog post Wednesday explaining the motivations for the Bounties feature. But today, we're going to jump right to the chase and explain how you can make money wrangling data on DoltHub. Help us build a database of US Presidential Election Precinct results and get paid for your contributions.
The Bounty
In the United States, we recently had a Presidential Election. Election data should be free and open. We've worked with the open elections project in the past. We also admire the MIT election data lab. Inspired by these two data projects, for our first bounty, we decided to offer a $25,000 prize pool for building precinct level election data for the 2016 and 2020 US presidential elections.
We started with a cleaned version of the MIT Election lab database. The cleaned results are in dolthub/us-president-precinct-results. We are accepting any inserted or updated rows for either the 2016 or 2020 presidential elections. The $25,000 bounty will be divided based on the percentage of cells modified, last write wins, in any table in the database during the three month duration of the bounty. The bounty ends February 14, 2021 and payouts will be calculated and processed then.
The Database
The database contains four tables: candidates
, counties
, precincts
, and vote_tallies
. There are foreign key constraints so make sure candidates
, counties
and precincts
are defined before adding the vote data to vote_tallies
. We think structured data of this form will be immensely more useful for analysis than the patchwork of CSVs that exist today.
In the candidates
table, we have name
, fec
, and fec_name
. fec
stands for Federal Election Commission and presidential candidate data can be sourced from their website.
In the counties
table we have name
, state
, fips
, ansi
, lat
, lon
, state_postal
, state_fips
, and state_icpsr
. name
and state
are the keys. The MIT Elections data came with the ansi
column but we're unsure where to source that data. Looks like the census bureau uses fips
for ansi
?
In the precincts
table we have precinct
, county
, state
, and jurisdiction
. All four columns are keys. There are some precincts where the county
and the jurisdiction
are different, it's pretty rare. So for now, drop the 'COUNTY' or 'PARISH' from the county
and call it the jurisdiction
if it's the same.
Finally, in the vote_tallies
table we have election_year
, stage
, precinct
, county
, state
, jurisdiction
, candidate
, party
, writein
, office
, vote_mode
, and votes
. Make sure candidates
, counties
, and precincts
are populated with the key columns before you insert your vote data into vote_tallies
as their are foreign key constraints into those tables.
Contributing
We wanted to build an example of what a bounty contribution would entail to help get people started. I went to DoltHub and figured out which states had data already by running a query on the web. This gave me an idea of the states MIT imported for 2016.
I started by googling to see if Florida had published results for 2020. First link is the Florida Department of State. Seemed promising. Denied. Results for 2020 not up yet.
Not to be deterred, I modified my Google search a bit to see if some 2016 data existed for a state that wasn't imported yet. Having spent 12 years in Seattle, I have a soft spot for Washington State. Bingo. Both 2016 and 2020 precinct level presidential election data is published. It may be incomplete but it's something. Looks like the Washington data is published in a downloadable CSV for 2016 and 2020.
Now, I need to coerce this CSV into the schema Dolt wants. Based on my experience wrangling data, the best way to do this is to write a Python script to parse the CSV and output SQL. We have doltpy to make this easier. But, if you are an Excel or Google Sheets expert, creating CSVs that match the schema of the Dolt tables and then using dolt table import -u
or Upload a CSV on DoltHub works as well.
Here's the Python script I used to input the data into Dolt. I downloaded the 2016 CSV, parsed it, and inserted the results into Dolt using doltpy
and mysql.connector
.
The first challenge is that the data is distributed with a two letter "county code" so I had to manually build a county_map
with the additional county information information the Dolt database requires. I sourced Washington State County information from Wikipedia. I ignored lat/long
in the interest of expedience. Filling these columns will get someone a piece of the bounty! I ignored ansi
because I couldn't figure out what ansi
was. Looks like the census bureau uses fips
for ansi
? We started to debate internally whether fips
was sufficient in the counties
table? Entering the county data I noticed that King County, where Seattle is located, does not have precinct level data. Another way for a bounty participant to grab their share of the bounty!
Next, I needed to add the precincts. jurisdiction
is a primary key in this database because in some states, some precincts are in the same county but different jurisdiction. In Washington, that doesn't seem to be the case so by convention I just stripped 'COUNTY' off the end of county
and made that jurisdiction
.
Then, I looked at the candidates. All the candidates were already in the candidates
table but I needed to map them to the name they existed under in that table. There were only seven unique candidates so this was not that much trouble.
Finally, it was putting it all together and inserting into the vote_tallies
table. I chose to set writein
to false and vote_mode
to 'ELECTION DAY' because those variables were not represented in the CSV data I downloaded.
Throughout, I used dolt diff
to check my work after I had made my writes. When I made a mistake, I ran dolt reset --hard
and started fresh from the tip of my branch. Dolt makes it really forgiving to make writes to databases.
There were a couple gotchas. First, the standard MySQL connector in Python turns off autocommit by default. This does not play well with Dolt so you need to run SET @@autocommit = 1
after you open a connection. Second, I wanted to make the script re-runnable. Dolt does not support ON DUPLICATE KEY
so I had to do a select count(*)
on the primary keys of a table before I inserted. replace
queries threw foreign key constraint errors for everything but the vote_tallies
table. ON DUPLICATE KEY
will be supported in Dolt shortly.
Next I made my own fork repo of the dolthub/us-president-precinct-results on DoltHub. I set personal-origin
to my fork using dolt remote add personal-origin https://doltremoteapi.dolthub.com/timsehn/us-president-precinct-results
. I pushed to personal-origin
. Then I went to DoltHub and made a pull request from my fork to the dolthub/us-president-precinct-results repository. In the Pull Request, I included the information relevant to getting my work accepted like the source of the data and the assumptions I made.
In the bounties case, once my Pull Request was accepted, I could check out the current state of my bounty payout and watch it change as more Pull Requests were accepted. I could either be happy with the payout or do more work to gain a greater share of the bounty.
The Payout
All told, it took me about four hours to insert 622,848 cells. If there are 10,000,000 cells inserted over the duration of the bounty, I would have made $1,557.12 when the bounty closes on February 14, 2021. That's $389.28 per hour. If more cells are committed by other bounty participants I would make less. If fewer cells are committed I would make more. Not bad at all for a job as a part time data wrangler.
I left the 2020 CSV for the first enterprising bounty participant who reads this. With a couple small changes to my script, you can insert a bunch of cells for the bounty and be on your way to getting your piece of that $25,000 bounty. First Pull Request wins. Get data wrangling right now!
Start Contributing Today
We are really excited to launch Bounties but we'll be even more excited to start seeing contributions come in. Come hang out with us on our Discord channel. We have a Bounties room to help get you started. Let's build the best precinct level US presidential election database out there and get paid doing it.