The State of Hospital Price Transparency Data
DoltHub has been cranking away in the version controlled database space for six and a half years now. Those of you that have been following us since the beginning know that Dolt and DoltHub started as a data sharing tool. In service of this use case, we ran data bounties where we created a schema and then paid volunteers collected and submitted data via Pull Requests on DoltHub. We stopped doing data bounties in September 2023 to focus on making Dolt, already the world's first version controlled database, the world's best version controlled database.
We ran bounties in a number of different categories: US Election Results, Museum Collections, US Housing Prices, and many more.
But by far our most popular data bounty theme was US hospital prices. We launched bounties in the fall of 2020 and the US government passed a hospital price transparency law that went into effect January 2021, perfect timing for our data bounty program to be first movers on this novel open data. Over the next three years, we ran four hospital price transparency bounties and one insurance price transparency bounty. We became experts in this data, with our bounty program coordinator, spacelove
(aka Alec Stein), working with law makers to improve hospital price transparency laws.
These databases are still very popular and as far as we know are the best collection of free hospital price data out there. This blog will fill you in on the state of the data and where DoltHub will go from here.
The Data
There are two types of hospital price data named after the laws governing their release:
- Transparency In Pricing - Data from the hospitals themselves.
- Transparency in Coverage - Data from insurance company deals with hospitals.
Here at DoltHub, we collected four different versions of Transparency in Pricing data and one version of Transparency in Coverage data.
Transparency in Pricing
- hospital-price-transparency - V1. Data from Q1 2021.
- hospital-price-transparency-v2 - V2. Data from Q2 2021.
- hospital-price-transparency-v3 - V3. Data from Q2 2022.
- transparency-in-pricing - V4. Data from Q3 2023.
We improved the schema over time and were able to collect more data. We recommend using the transparency-in-pricing data for all but historical comparisons. transparency-in-pricing contains data from over 1,000 hospitals.
Moreover, we also collected a list of price list URLs for each hospital called standard-charge-files. This database was used when collecting the transparency-in-pricing database and could be used for automated scraping by a motivated individual or team.
As you can see, we have not collected data for over a year. We don't plan on collecting data again using a bounty-style approach. We have some ideas that I will discuss later. We would happily host newly collected data for free if an intrepid DoltHub user or community wants to do the collection work.
Transparency in Coverage
In January 2023, the Transparency in Coverage law took effect requiring insurance companies to publish their payment data in a standard machine readable format. The data turned out to be petabyte-sized which lead many people to think this was malicious compliance.
We worked on building tools to extract the meaningful bits realizing in the process that the de-normalized JSON format was ballooning the data with a lot of duplicate information. We used those to build a meaningful snapshot database called transparency-in-coverage.
This database is small. If you want more data, you can use our mrfutils Python utility to parse and extract meaningful information from files you source yourself.
How To Use It
spacelove
wrote a number of blog articles with detailed examples of how to understand and use this data to do analysis. The best article to use to get started is this one.
The data is free so the easiest way to use it is to install Dolt, clone the database you want, and start running SQL queries. Dolt and DoltHub give you the full power of a SQL database combined with the decentralized sharing capabilities of Git and GitHub.
$ dolt clone dolthub/transparency-in-coverage
$ cd transparency-in-coverage
$ dolt sql -q "show tables"
+------------------------------------+
| Tables_in_transparency-in-coverage |
+------------------------------------+
| in_network |
| provider_groups |
+------------------------------------+
$ dolt sql -r vertical -q "select * from in_network limit 3"
*************************** 1. row ***************************
reporting_entity_name: United HealthCare Services, Inc.
reporting_entity_type: Third Party Administrator
plan_name: NULL
plan_id: NULL
plan_market_type: NULL
last_updated_on: 2022-08-01 00:00:00
version: 1.0.0
payer_or_issuer_name: Ascent-Hospitality-Management_PS1-50_C2
negotiation_arrangement: ffs
provider_reference: 26927
billing_code_type: CPT
billing_code: 70553
service_code: 11
negotiated_rate: 95.99
negotiated_type: negotiated
expiration_date: 9999-12-31
billing_code_type_version: 2020
*************************** 2. row ***************************
reporting_entity_name: United HealthCare Services, Inc.
reporting_entity_type: Third Party Administrator
plan_name: NULL
plan_id: NULL
plan_market_type: NULL
last_updated_on: 2022-08-01 00:00:00
version: 1.0.0
payer_or_issuer_name: Ascent-Hospitality-Management_PS1-50_C2
negotiation_arrangement: ffs
provider_reference: 23697
billing_code_type: CPT
billing_code: 73721
service_code: 11
negotiated_rate: 244.93
negotiated_type: negotiated
expiration_date: 9999-12-31
billing_code_type_version: 2020
*************************** 3. row ***************************
reporting_entity_name: United HealthCare Services, Inc.
reporting_entity_type: Third Party Administrator
plan_name: NULL
plan_id: NULL
plan_market_type: NULL
last_updated_on: 2022-08-01 00:00:00
version: 1.0.0
payer_or_issuer_name: Ascent-Hospitality-Management_PS1-50_C2
negotiation_arrangement: ffs
provider_reference: 21466
billing_code_type: CPT
billing_code: 76700
service_code: 11
negotiated_rate: 72.88
negotiated_type: negotiated
expiration_date: 9999-12-31
billing_code_type_version: 2020
Known Issues
As people have shown up on the DoltHub discord asking questions about this data, we've collected a list of known issues.
hospital-price-transparency-v3 is in an old Dolt format
Something about hospital-price-transparency-v3's indexes won't allow it to be migrated to Dolt's new format. In order to read it, you must download a Dolt version pre-1.0. We recommend ignoring it and just using transparency-in-pricing which is newer, has a better schema, and has more data than V3.
The prices make no sense
Here in lies the main issue with this data. It is not uncommon to find massive disparities in the prices for certain procedures like the linked article does for C-section births. Even when calling the providers, confirming the prices in their posted data is difficult. We know from working with this data that the services included in any given CPT code, the standard procedure identifier, can vary wildly by provider. The bill you get from a hospital is often a collection of codes that also varies by provider. Tying this data to what you would see on a bill from any given provider is difficult without having a sample of the bills themselves. We worked on this problem for over three years and ran into this problem regularly in analysis. We're not sure how to solve it.
What's Next
As I said, we are out of the bounty business and do not intend to collect more US hospital price data using that method. However, we've made some important improvements to DoltHub that could facilitate a more automated approach to collecting US Hospital Price data.
One of the problems with running automated data collection is ensuring new data collected is accurate. Source data file formats change. A bad row is present in the source and breaks the whole job. Anyone who has maintained data pipelines knows this process is expensive and time consuming. It's hard to do data quality control in an automated collection environment.
Here at DoltHub, we understood this problem from company inception. We thought a critical piece of the puzzle was data version control. Modern version control enables human review and automated testing on changes. Recently, on DoltHub, we released Continuous Integration testing, the last piece of our data quality control stack that enables you to manage data like code.
Just look at those checks on the Pull Request page! Just like your code changes on GitHub.
We are considering leveraging Continuous Integration testing on DoltHub in an automated hospital price collection pipeline. This will be the demonstration use case for the full DoltHub data quality control stack. Interested in a continuous stream of hospital price data on DoltHub QA'ed by DoltHub continuous integration? Come by our Discord and let us know. With sufficient interest, we will fund the project.