US Housing Prices Bounty Retrospective
US Housing Prices Bounty Retrospective
A year ago we launched DoltHub data bounties. If you're unfamiliar, we use Dolt to create open databases that don't exist anywhere else on the internet. We use Dolt's cell ancestry capability to figure out who contributed what, then we pay our contributors proportional to how much data they add.
This time we decided to run a bounty for US housing prices. We wanted to collect as many sales records as we could from official government sources. So how did we do?
It was a home run
It's hard to comprehend the scale of the housing market in the US. 5 million homes are sold every year. If contracts are being continuously signed during a 9-5 workday, around 20 contracts are being signed every minute. Each of these transactions is recorded with dubious fidelity in a set of unstandardized tables in some county database, so it makes for a certain kind of fun for the poor soul who has to analyze it.
This is where DoltHub bounties shine. Our creative participants pulled a billion cells worth of historical sales data. We were able to identify major historical trends, like the 2008 housing crisis, the seasonal trends in home-buying, and even identified the cheapest month to buy a house (more on this in our next post.)
What we learned
- Government data is messy and disorganized
- Data with addresses is messy and disorganized
- Sometimes, messy and disorganized is good enough
Government housing records don't come in a universal format or style. They come in GIS databases, SQL databases, CSV files, and HTML tables. For someone working alone this means lots of digging around search results and writing scrapers to handle all the different contexts. But at DoltHub, with our participants in competition, each of them gravitated towards their niche strengths and interests, which solved the problem of locating and scraping the data. The most motivated and creative bounty hunters pulled in the biggest awards, and all of them had fun.
Even with the messy data, collecting just the addresses, sale dates, and sale prices was enough to make sense of long-term trends in the housing market. Spurious rows, with placeholder addresses like
- 0 MARKET AVE
- 0
- NOT AVAILABLE
or with sale prices in the $0-1000 range, are easy enough to filter out from the final table if needed. We took all the data as-is, and will do any after the fact when we do our analysis. In earlier bounties, we would have participants clean the data going in, but it's difficult to say what counts as "clean" when you don't have a bird's eye view of the data.
Another thing we learned was that it's good to keep it simple. We had just one table with plain English column names. This made it easy for just about anyone to contribute, and came only at the cost of a little bit of space. If need be, we can normalize the database later.
Scoreboard
7 participants contributed here, with 1B(!) total cell edits. @abmyii took home the top prize.
Stay tuned for more
We have more coming where we take a closer look at the housing data. In the meantime if you have questions or want to participate in our current bounty, let us know. We're active on Discord and ready to answer all your questions.
If you want to investigate the data for yourself, you can. Visit the repository and click the clone button to download Dolt and make a local copy.