Hospital Price Transparency $10,000 Database Bounty
On January 1, 2021, a US law was passed requiring hospitals to publish their prices in human and machine readable format. We would like to assemble the best open dataset of hospital prices in the US to aid researchers. To this end, we’re launching our second bounty! The $10,000 bounty will be divided based on the percentage of cells modified, last write wins, in any table in the database during the two month duration of the bounty. The bounty ends March 1, 2021 and payouts will be calculated and processed then.
Bounties
Bounties on DoltHub are a way to pay users to wrangle data. Because Dolt has Git-style versioning, interesting payout methods like percentage of cells edited are possible. We will be publishing new bounties approximately monthly so join our Discord to stay apprised!
The Database
The database consists of three tables:cpt_hcpcs
, hospitals
, and prices
.
The cpt_hcpcs
table consists of three columns, code
, short_description
, and long_description
. code
is the CPT or HCPCS code for the service. The CPT code is a standard maintained by the American Medical Association. HCPCS is the Medicare standard. We believe most hospitals will publish using the CPT code but HCPCS codes will be used instead if the pricing data was published using HCPCS. CPT codes are numerical while HCPCS codes start with a letter followed by 4 numbers
The hospitals
table has the columns npi_number
, name
, url
, street_address
, city
, state
, and zip_code
. The url
column needs to be populated with the url where the pricing data can be found. The npi_number
is the 10 digit national provider identifier for the corresponding hospital and name
is the name of the hospital.
The prices
table consists of four columns:
code
, with a foreign key constraint on thecpt_hcpcs
tablenpi_number
, with a foreign key constraint on thehospital
tablepayer
which is the insurer and takes on the value ‘CASH’ in the case where the payer is the individual, andprice
, the price of the medical shoppable service in USD.
Contributing
I started by picking a hospital, Ben Taub hospital in Houston, TX and googled Ben Taub Hospital shoppable services
, the second result had a reference to Hospital Price Transparency. Following the link, I found a zip download for the hospital’s Charge Description Master, which contains the descriptions, billing codes, and charge amounts for each service.
Unzipping the file yields an excel spreadsheet. In order to read it programmatically in python I rely on openpyxl. Next I google the NPI number for the hospital, which nets me the hospital’s NPI number and address, and I add them to the hospitals
table. Now that’s complete, I can rev up a dolt mysql server dolt sql-server
and run my script to add each of the entries in the Charge Description Master to the database.
You can look at the script here to help you get started.
Update on the US Election Bounty
The results for the US Election bounty so far have been fantastic, and there’s still one month left to vie for a piece of the prize! 6 people had 43 Pull Requests accepted. We’ve had over 10.3M cells edited in the database. All 50 states from 2016 are covered. 25 states from 2020 are covered.
If the bounty ended today, our top contributor would earn almost $12,000 and every participant would earn over $1,000. You can follow current payouts on the live scoreboard . Data wrangling for bounties can be lucrative work.
Conclusion
Feel free to clone the data and start analyzing it even if you don’t want to contribute. As updates are published just run dolt pull to get the latest results. Run dolt diff to see what changed. This is the power of using a version controlled database. If you have questions, stop by our Discord to talk it over with us and other bounty participants.