Joining Multiple Repositories with SQL Queries
In our blogs we have shown over and over again how easy it is to clone data from DoltHub and immediately start querying it with SQL. We are constantly working on improving our data catalog. As we do, there emerge more occasions where you can derive interesting information by joining two or more of these datasets. In this blog I'll be walking you through how I took data from the dolthub/us-congress repository and the dolthub/irs-soi repository to find the congressional districts representing the largest portion of US tax dollars.
The Setup
To begin we'll create a directory to store our repositories, and then clone data into them.
mkdir datasets
cd datasets
dolt clone dolthub/irs-soi
dolt clone dolthub/us-congress
At this point we are immediately ready to begin running SQL against our data. We will start a Dolt SQL shell session
with the --multi-db-dir
parameter, which tells Dolt to look at every subdirectory of the provided directory as a separate
database (version 0.16.4+ is required to try this yourself). We run the command from our newly created datasets directory like so:
dolt sql --multi-db-dir ./
# Welcome to the DoltSQL shell.
# Statements must be terminated with ';'.
# "exit" or "quit" (or Ctrl-D) to exit.
> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| irs_soi |
| us_congress |
+--------------------+
As you can see, the shell shows our two cloned repositories, but it's worth noting that the names of the repositories
have been converted to valid SQL identifiers (in this case dashes replace underscores). Additionally, the
shell shows the information_schema
database. This is a standard MySQL system database which provides information about all
the databases, but it is not important to our work here.
One thing to note about this data is that congressional districts are not cut along zipcode boundaries, and as a result a zipcode that is not wholly contained within a congressional district will appear in multiple rows, once for each district that some of its residents are a part of.
Joining Data from Multiple Databases
Now that we have cloned our data and opened a shell which provides access to both databases, we will create a table and insert data joined from both databases. Our table will be every state, district, and zip joined with the adjusted gross incomes and tax return count for each zip.
> USE us_congress;
Database changed
us_congress> CREATE TABLE irs_district_info (
-> state VARCHAR(2) NOT NULL,
-> district VARCHAR(2) NOT NULL,
-> zip VARCHAR(5) NOT NULL,
-> return_count BIGINT NOT NULL,
-> agi BIGINT NOT NULL,
-> PRIMARY KEY(zip,state,district)
-> );
us_congress> INSERT INTO irs_district_info
-> SELECT districts.state, districts.district, districts.zip, allnoagi.return_count, allnoagi.adjusted_gross_income
-> FROM districts
-> INNER JOIN irs_soi.allnoagi AS allnoagi ON districts.zip = allnoagi.zip;
Query OK, 34943 rows affected
Now that we have a table that contains districts and data from the IRS we will create a view that makes it easy to see aggregations at the district level. As mentioned earlier, due to congressional districts not being cut along zipcode boundaries this is only an approximation. If zipcode Z has some residents that are in district 1 and other residents that are in district 2 then that zipcode's tax dollars are counted twice.
us_congress> CREATE VIEW district_agi AS
-> SELECT district, state, SUM(agi) AS total_agi, SUM(return_count) AS return_count
-> FROM irs_district_info
-> GROUP BY state,district;
Now we are all set now to look at the congressional districts representing the most and fewest tax dollars. We'll start by looking at the districts representing the most tax dollars.
us_congress> SELECT *
-> FROM house
-> INNER JOIN district_agi ON house.district = district_agi.district and house.state = district_agi.state
-> ORDER BY total_agi DESC
-> LIMIT 10;
+-------+----------+------------+----------------+------------+---------------------+------------------------+------------+-----+----------+-------+-----------+--------------+
| state | district | last_name | first_name | party | year_assumed_office | residence | birth_year | sex | district | state | total_agi | return_count |
+-------+----------+------------+----------------+------------+---------------------+------------------------+------------+-----+----------+-------+-----------+--------------+
| NY | 10 | Nadler | Jerry | Democratic | 1992 | Chelsea, Manhattan | 1947 | M | 10 | NY | 180290441 | 920080 |
| NY | 12 | Maloney | Carolyn | Democratic | 1993 | East Harlem, Manhattan | 1946 | F | 12 | NY | 159256865 | 666310 |
| CA | 18 | Eshoo | Anna | Democratic | 1993 | Atherton, California | 1942 | F | 18 | CA | 139138057 | 671370 |
| CA | 33 | Lieu | Ted | Democratic | 2015 | Torrance, California | 1969 | M | 33 | CA | 125883759 | 715860 |
| TX | 2 | Crenshaw | Dan | Republican | 2019 | Spring, Texas | 1984 | M | 2 | TX | 94084261 | 856060 |
| CA | 17 | Khanna | Ro | Democratic | 2017 | Fremont, California | 1976 | M | 17 | CA | 84393712 | 594210 |
| TX | 21 | Roy | Chip | Republican | 2019 | Austin, Texas | 1972 | M | 21 | TX | 82842753 | 865730 |
| IL | 7 | Davis | Danny K. | Democratic | 1997 | Austin, Chicago | 1941 | M | 7 | IL | 81989015 | 980360 |
| TX | 7 | Fletcher | Lizzie Pannill | Democratic | 2019 | Houston | 1975 | F | 7 | TX | 81768777 | 659310 |
| NJ | 7 | Malinowski | Tom | Democratic | 2019 | Rocky Hill, New Jersey | 1965 | M | 7 | NJ | 81167355 | 616610 |
+-------+----------+------------+----------------+------------+---------------------+------------------------+------------+-----+----------+-------+-----------+--------------+
The adjusted gross income data from the IRS is in thousands of dollars, so you can see that our approximation of adjusted gross income for each congressional district has residents of New York's 10th district accounting for $180 Billion in taxable income in 2017.
Now the districts representing the fewest tax dollars.
us_congress> SELECT *
-> FROM house
-> INNER JOIN district_agi ON house.district = district_agi.district and house.state = district_agi.state
-> ORDER BY total_agi ASC
-> LIMIT 10;
+-------+----------+-----------+-------------+------------+---------------------+-----------------------------+------------+-----+----------+-------+-----------+--------------+
| state | district | last_name | first_name | party | year_assumed_office | residence | birth_year | sex | district | state | total_agi | return_count |
+-------+----------+-----------+-------------+------------+---------------------+-----------------------------+------------+-----+----------+-------+-----------+--------------+
| WV | 3 | Miller | Carol | Republican | 2019 | Crab Orchard, West Virginia | 1950 | F | 3 | WV | 9727228 | 205960 |
| KY | 5 | Rogers | Hal | Republican | 1981 | Mount Vernon, Kentucky | 1937 | M | 5 | KY | 12905404 | 293990 |
| NY | 15 | Serrano | José E. | Democratic | 1990 | Concourse, Bronx | 1943 | M | 15 | NY | 13459856 | 428160 |
| TX | 16 | Escobar | Veronica | Democratic | 2019 | El Paso, Texas | 1969 | F | 16 | TX | 14951512 | 332000 |
| VA | 9 | Griffith | Morgan | Republican | 2011 | Salem, Virginia | 1958 | M | 9 | VA | 15066790 | 304630 |
| MO | 8 | Smith | Jason | Republican | 2013 | Salem, Missouri | 1980 | M | 8 | MO | 15691063 | 328130 |
| TX | 34 | Vela | Jr. Filemon | Democratic | 2013 | Brownsville, Texas | 1963 | M | 34 | TX | 15788552 | 389710 |
| WV | 1 | McKinley | David | Republican | 2011 | Wheeling, West Virginia | 1947 | M | 1 | WV | 16015683 | 289520 |
| WV | 2 | Mooney | Alex | Republican | 2015 | Charles Town, West Virginia | 1971 | M | 2 | WV | 16108444 | 293460 |
| AR | 1 | Crawford | Rick | Republican | 2011 | Jonesboro, Arkansas | 1966 | M | 1 | AR | 16163652 | 339220 |
+-------+----------+-----------+-------------+------------+---------------------+-----------------------------+------------+-----+----------+-------+-----------+--------------+
Running dolt sql-server with Multiple Databases
Just like dolt sql
, dolt sql-server
supports the --multi-db-dir
parameter. It will allow you to query
multiple Dolt repositories each through a SQL database. Additionally, dolt sql-server
now supports a YAML configuration
file using the --config <file>
parameter which has greater support for configuring your databases. When using a YAML
configuration file you may provide the paths to the Dolt repositories on your local machine, and the database name that
they will be accessible through. This is a lot less restrictive than requiring all datasets be subdirectories of the
directory that the command runs in. The YAML configuration file also provides configuration for things that are not
available via command line configuration. See the dolt sql-server documentation
for details.
Conclusion
Historically, acquiring the data you need can be a painful process. It may involve scraping websites, downloading CSVs, or data files in proprietary formats. Once you have the raw data you may need to write scripts to make it usable so that you can import it to your analysis tools. You may often need to repeat this process multiple times to be able to build the dataset you really need by combining data from different datasets. Dolt has been designed to solve these problems. Dolt lets you clone datasets and immediately start combining them using SQL.
The vision for DoltHub is to be a place where our users can come together and collaborate on data in a way that hasn't been possible before. As we execute against that vision we hope to become the place to go for well-maintained data. We feel that being able to easily acquire data from different sources and combine it to build new and exciting datasets is an extremely compelling reason why you should give Dolt a try, and why you should put your data on DoltHub.
Multiple database support is new to Dolt. We think it provides a hint at where Dolt could go in the future as a distributed, online data catalog. Imagine being able to search DoltHub for all the datasets with similar columns to join on, and using this new feature to analyze the combined data.