June Dataset Spotlight
Every month we highlight some interesting datasets on DoltHub. The focus is on new or updated datasets but sometimes we shed fresh light on a classic.
For those new to Dolt and DoltHub, Dolt is Git for data. Git versions files. Dolt versions SQL tables. DoltHub is a place on the internet to share Dolt repositories.
We think the way we share data with each other is broken and we think Dolt is the fix. Whenever you see a link to a CSV, JSON, or XML file, you should think of Dolt. Whenever you see an API but want all the data, not just a few entries, you should think of Dolt. We are working hard to move data shared in these formats to Dolt. This series of blogs will update you on our progress.
National Vulnerabilities Database
Link: dolthub/NVD
Contributor: dolthub
First Published: April 21, 2020
The National Vulnerabilities Database is the authoritative source for the publication of Common Vulnerabilities and Exposures (CVE). The vulnerabilities cataloged in the NVD represent the most severe and most impactful cyber-security events. The data is published through a hard to scrape JSON API. So, we import that JSON feed into Dolt hourly. Dolt provides a much easier interface for querying what you need and also you can see the diffs to see what's changed. Lots of Dolt value add here. We think security data is a promising Dolt use case and we hope to add more in the future.
US Census Response Rates
Link: dolthub/us-census-response-rates
Contributor: dolthub
First Published: June 4, 2020
The US Census Bureau is publishing daily snapshots of census response rates. We started importing these into Dolt on June 4. So, you can use Dolt's history and diff features to build a time series of the response rates starting then. Plus, you get SQL on the data.
US Supreme Court Case Transcripts
Link: dolthub/us-supreme-court-cases
Contributor: dolthub
First Published: April 28, 2020
This is a really cool dataset of US Supreme Court transcripts. It contains transcripts for over 1000 supreme court cases. There's even a judges table so you can see if certain characteristics predict language. You can even look up famous quotes:
$ dolt sql -q "select * from transcripts where speaker='Sandra Day O\'Connor' and text like '%pornography%'"
+--------------------------------------------+-----------------------------------+----------------------------------------------------------------------+---------------------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| case_name | title | link | speaker | start | stop | duration | text |
+--------------------------------------------+-----------------------------------+----------------------------------------------------------------------+---------------------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Ashcroft v. American Civil Liberties Union | Oral Argument - March 02, 2004 | https://apps.oyez.org/player/#/rehnquist10/oral_argument_audio/22093 | Sandra Day O'Connor | 136.198 | 154.982 | 18.78 | Mr. Olson, part of the problem is that the pornography laws that would apply to adult viewers don't seem to be enforced very well, the obscenity laws. |
| Ashcroft v. Free Speech Coalition | Oral Argument - October 30, 2001 | https://apps.oyez.org/player/#/rehnquist10/oral_argument_audio/21372 | Sandra Day O'Connor | 483.178 | 492.276 | 9.1 | Mr. Clement, may I ask you a question again relating to the affirmative defenses or youthful adult pornography. |
| Hunter v. Underwood | Oral Argument - February 26, 1985 | https://apps.oyez.org/player/#/burger8/oral_argument_audio/19281 | Sandra Day O'Connor | 1538.688 | 1559.452 | 20.76 | The Court of Appeals also indicated, I think in a footnote, that the statute was under-inclusive because sometimes that apparently it would be characterized at least by the Court of Appeals as crimes of moral turpitude are not included such as mailing pornography and so forth. |
| United States v. X-Citement Video, Inc. | Oral Argument - October 05, 1994 | https://apps.oyez.org/player/#/rehnquist10/oral_argument_audio/20030 | Sandra Day O'Connor | 1643.659 | 1654.034 | 10.38 | Well, General Days, I thought we had already agreed that it doesn't require obscenity or pornography, but just a visual depiction of sexually explicit conduct. |
| Wal-Mart Stores Inc. v. Samara Bros. Inc. | Oral Argument - January 19, 2000 | https://apps.oyez.org/player/#/rehnquist10/oral_argument_audio/20162 | Sandra Day O'Connor | 253.073 | 257.584 | 4.51 | It's... it's sort of like pornography: I know it when I see it. |
+--------------------------------------------+-----------------------------------+----------------------------------------------------------------------+---------------------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Bad Words
Link: dolthub/us-supreme-court-cases
Contributor: dolthub
First Published: April 9, 2020
We believe we have comiled the most comprehensive list of bad words on the internet. We showed how we did it and then built an example application using the data. This is a dataset we would love to start seeing contributions to.
Pokemon
Link: dolthub/pokemon
Contributor: dolthub
First Published: June 26, 2020
Pokemon is the highest-growing media franchise in the world, with an enormous international fanbase. This database aims to be the most complete and accurate dataset of Pokemon, and welcomes contribution. We would love to expand beyond just having data on each pokemon such as Pokemon Go statistics and data. Pokemen the card game data. Pokemen the animated series data. Etc.
Conclusion
That's it for this month. As you can see, most of the datasets are published by us. For Dolt and DoltHub to continue to exist, we need a community of data publishers to emerge. Please help us build a community by publishing. We published a blog on how to publish with SQL and another on how to publish CSVs.
That said, if you want data in Dolt format but don't have the time or expertise to import and maintain it, send us a note. We're happy to be an open data provider for your projects.