$10,000 US Housing Prices Bounty
We just finished SHAQ, one of our most competitive bounties ever. And off the heels of that, comes an even more ambitious project: tracking the US housing market. Every. Single. Sale.
Let's track housing sales across the US
After a lull in 2009, housing prices are at an all-time high. At the same time, so is housing insecurity. Nearly 3 apartments remain vacant for every homeless person in NYC. Help us put a microscope on what's happening with home, apartment, building, and land prices across the US. What's housing selling for, what's selling, and where? If we aggregate all the data into one place, maybe we can get some insight into it. Help us create that database.
What we want
We're looking for current and historical data on houses that have actually been sold. That means not current "for sale" prices on sites like Zillow, but actual data from official sources. As a distributed database by design, Dolt is uniquely suited for the challenge. The sum-total creativity of our bounty participants if higher than that of any lone worker.
Plus, part of the appeal in scraping primary sources is that we learn how to continuously update our data, allowing the database to become more valuable over time instead of less.
The sales
table
There's just one important table in the bounty: the sales
table. Here you can see what we're collecting.
+------------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+---------------+------+-----+---------+-------+
| state | char(2) | NO | PRI | | |
| zip5 | char(5) | YES | | | |
| physical_address | varchar(1024) | NO | PRI | | |
| city | varchar(255) | YES | | | |
| county | varchar(255) | YES | | | |
| property_id | varchar(255) | YES | | | |
| sale_date | datetime | NO | PRI | | |
| property_type | varchar(255) | YES | | | |
| sale_price | bigint | NO | | | |
| seller_name | varchar(1024) | YES | | | |
| buyer_name | varchar(1024) | YES | | | |
| num_units | int | YES | | | |
| year_built | int | YES | | | |
| source_url | varchar(2048) | YES | | | |
| book | int | YES | | | |
| page | int | YES | | | |
+------------------+---------------+------+-----+---------+-------+
Primary keys
It's tough to structure any data that relies on addresses for its keys. We made the decision to group zip5
and physical_address
together, with uniqueness on the entire address/sale tuple:
primary key (zip5, physical_address, sale_date)
This is hopefully enough to disambiguate sales, without constraining our bounty participants too much.
SQL functions
Due to the sheer amount of data we collect it can be hard to moderate incoming submissions. Plus, asking participants to clean data they've already submitted can be taxing on them. This time around we get to show off our newly-implemented SQL CHECK CONSTRAINT
functions, giving instant feedback to our submitters and getting a more consistent database.
constraint valid_year_built check (
year_built <= 2022 and
year_built >= 1492
),
constraint valid_sale_date check (
sale_date >= '1492-1-1' and
sale_date <= '2022-2-28'
),
...
Prize structure
You take home money proportional to how many cell edits you create, with a max prize of 10,000 USD. So if you create 45% of all the cell edits, you take home 4500 USD.
Visit here to learn more.
What next?
The bounty runs until Friday Feb. 18, 2022. Fork the database and make your first pull request. We'll be waiting for you on Discord if you have any questions on how to get started. Our #data-bounties channel would love to meet you.