library(tidyverse)
library(echor)
<-
echo_air_ca echoAirGetFacilityInfo(p_st = "CA",
qcolumns = "1,4,5,8,48,99,100,101,102") |>
select(-SourceID)
<-
echo_water_ca echoWaterGetFacilityInfo(p_st = "CA",
qcolumns = "1,4,5,9,184,185,186,187") |>
select(-SourceID)
Lab 6: Joining Datasets
Introduction
In terms of data analysis, this lab has one goal: to determine the number of industrial facilities that are currently in violation of both the Clean Air Act and the Clean Water Act in California. To achieve this goal, we’re going to have to do some data wrangling and join together some datasets published by the EPA. We’re going to practice applying different types of joins to this data and consider what we learn with each.
Learning Goals
- Identify join keys
- Discern the differences between types of joins
- Discern what happens when we perform many-to-many joins
- Apply filtering joins to identify missing data
- Practice data wrangling
Review of Key Terms
- Join Key
-
A variable shared across two datasets that identifies the observations that will be merged
EPA’s ECHO
The U.S. Environmental Protection Agency (EPA) is responsible for monitoring and regulating over 800,000 industrial facilities in the United States. This involves regularly inspecting facilities for their compliance with different environmental laws, issuing notices and taking enforcement actions when facilities are out of compliance, and issuing penalties for failures to address violations. A record of all of this activity is maintained online in a database known as ECHO - Enforcement and Compliance History Online. Specifically, ECHO maintains information about enforcement and compliance actions taken in relation to the Clean Air Act, the Clean Water Act, the Safe Drinking Water Act, and the Resource Conservation and Recovery Act.
There is so much information available to us via ECHO, and there is even an R
package that makes it really simple to access the data in ECHO. However, because ECHO documents compliance with a number of different environmental laws, when we pull data from ECHO about regulated facilities, it is usually organized into separate tables for each law. Today we are going to pull a dataset documenting facilities’ compliance with the Clean Air Act and then pull a second dataset documenting facilities’ compliance with the Clean Water Act.
The Clean Air Act was first enacted in 1963 to regulate emissions from both stationary and mobile sources of air pollution. The Clean Water Act was enacted in 1972 regulate the polluting of U.S. waterways. When a facility violates one of these Acts, it typically means that they have released an excess amount of pollutants, neglected to implement the proper control technologies or standards, or failed to submit reports.
Setting Up Your Environment
Install the
echor
package in your Console.Run the code below to load today’s data frames into your environment.
- Run the code below to load the data dictionary for
echo_air_ca
into your environment.
<- echoAirGetMeta() |>
echo_air_dd filter(ColumnID %in% c(1,4,5,8,99,100,101,102)) |>
select(ColumnID, ObjectName, Description)
<- echoWaterGetMeta() |>
echo_water_dd filter(ColumnID %in% c(1,4,5,9,184,185,186,187)) |>
select(ColumnID, ObjectName, Description)
- Open both
echo_air_dd
andecho_water_dd
to review the data dictionaries for these datasets.
Data Analysis
Join Keys
When joining data frames, we need to determine a join key. This will be a variable that exists in both data frames to uniquely identify the data frames’ units of observation. This key needs to be shared across the two data frames in order to be able to join the data frames together. This is because when joining data frames, R
will check the value in that variable for one data frame and look for the corresponding value in the key variable for the other data frame in order to determine which units of observation match across the data frames.
For instance, let’s say we have two data frames documenting the same group of students - one documents their current courses, and another documents their living arrangements. For both data frames, the student ID might serve as a join key. This ID uniquely identifies each student, and we expect that that key will be the same in both data frames (i.e. the 99- that represents you in one data frame will be the same as the 99- number that represents you in another data frame.) Because that number matches across the two data frames, we can use that variable to determine which rows (i.e. students) in the first data frame are associated with which rows in the second data frame.
Question
Reference the data dictionaries for these two data frames to determine the names of the variables that we will join on. Remember that these should be variables that we can use to uniquely identify each unit of observation across the data frames. Write code below to determine the number of unique values for the variables you identify in each of these data frames. (Hint: Remember how we counted unique values in a variable in lab 1?)
# Write code here for echo_air_ca
# Write code here for echo_water_ca
If you’ve done this correctly, you’ll learn that there are 2602 unique values echo_air_ca
’s key variable. You’ll also learn that there are 28262 unique values in echo_water_ca
’s key variable.
…but if you look in your environment, there are 2794 rows in echo_air_ca
, and there are 30829 rows in echo_water_ca
. This means that the key variable must repeat, and that certain facilities are represented more than once in each of these data frames.
This happens because even the EPA defines a facility in different ways depending on what laws it is regulated by. What counts as one facility in the Clean Air Act may count as two or three facilities in the Clean Water Act, but there is only one ID to document the relationships between them. This means that the same ID may show up a few times for different parts of the same facility. Check out some of the duplicating rows via my code below. Notice how multiple facilities - sometimes with entirely different names can be associated with the same RegistryID?
|>
echo_air_ca group_by(RegistryID) |>
filter(n() > 1) |>
arrange(RegistryID) |>
head()
Later we will talk about what happens when we join two data frames that both have repeating keys, but for now, we’re going aggregate both of these data frames so that each Registry ID only appears once. Basically, we are going to write code to say: if any of the rows pertaining to this Registry ID document certain kinds of violations, set the violation flag for that facility to 1.
Question
In the code below, for each of the data frames, you should group the data by RegistryID
and then summarize by returning the max()
value in CurrSvFlag
, ViolFlag
, CurrVioFlag
, Insp5yrFlag
for each group. Store the results in echo_air_ca_agg
and echo_water_ca_agg
respectively. Note how many rows in are in echo_air_ca_agg
and echo_water_ca_agg
. It should match the number of unique values in RegistryID
that you calculated above.
<-
echo_air_ca_agg |>
echo_air_ca |> #group by RegistryID
______ _____( #Summarize the following:
CurrSvFlag = _____, #Calculate max value in CurrSvFlag
ViolFlag = _____, #Calculate max value in ViolFlag
CurrVioFlag = _____,#Calculate max value in CurrVioFlag
Insp5yrFlag = _____ #Calculate max value in Insp5yrFlag
)
<- #Copy and adjust code above to create echo_water_ca_agg echo_water_ca_agg
Joins
So now we have two data frames with unique registry IDs and variables indicating whether the facility has a current significant violation, whether it has had a violation in the past three years, whether it has a current violation, and whether it has been inspected in the past 5 years.
Now we want to join these two data frames together, so that we can check which facilities have violations to both the Clean Air Act and the Clean Water Act.
Question
Write code below to perform four kinds of joins - a left join, a right join, and inner join, and a full join. echo_water_ca_agg
should be in the first position, and echo_air_ca_agg
should be joined onto it. I’ve started that for you in the commented code below.
#joined_left <- echo_water_ca_agg |>
#joined_right <- echo_water_ca_agg |>
#joined_inner <- echo_water_ca_agg |>
#joined_full <- echo_water_ca_agg |>
Check out the first six rows of joined_full
below and note what happens when a RegistryID
is present in one data frame but not in the other. We see data values in the columns associated with the data frame where the RegistryID
was present, and NA
values in the columns associated with the data frame where the RegistryID
was not present. This is how we come to have so many extra rows in joined_full
.
|> head() joined_full
With joined_inner
, any rows associated with a RegistryID
that doesn’t appear in both data frames get dropped. Check out the number of rows in joined_inner
.
nrow(joined_inner)
[1] 739
This represents the number of Registry IDs that were present in both echo_water_ca_agg
and echo_air_ca_agg
.
Question
The three statements below are incorrect. Correct my statements below about the remaining joins.
paste(nrow(joined_full), "represents the number of Registry IDs present in echo_air_ca_agg")
paste(nrow(joined_left), "represents the number of Registry IDs present in either echo_air_ca_agg or echo_water_ca_agg")
paste(nrow(joined_right), "represents the number of Registry IDs present in echo_water_ca_agg")
From here on out, since we are only interested in the facilities with violations to both Acts, we are going to focus on joined_inner
. Let’s remove the rest of the joined data frames from our environment.
rm(joined_full, joined_left, joined_right)
You might note at this point that it can be difficult to tell which columns are associated with which environmental laws. This is because originally in both data frames, the variables we are most interested in (CurrSvFlag
, ViolFlag
, CurrVioFlag
, Insp5yrFlag
) shared the same names. When we joined these two data frames together, R
opted to distinguish between them by tacking a .x
onto the variable names for the data frame in the first position in the join and a .y
onto the variable names for the data frame in the second position in the join. Let’s give each of these variables more meaningful names. To do so, we can use the rename()
function, which is included in the dplyr
package in the tidyverse
. When piped to a dataframe, the formula for rename()
is simple: rename(new_name = old_name)
Question
In my code below rename the remaining columns to differentiate between the column names from the two original data frames.
<-
joined_inner_renamed |>
joined_inner rename(
CurrSvFlag_water = CurrSvFlag.x,
ViolFlag_water = ViolFlag.x
#Rename remaining columns here. Be sure to separate by comma. There are six more to go!
)
Question
Now that we have a cleaned up data frame, use a data wrangling verb to subset the data frame to the rows where a RegistryID
has a current violation to both the Clean Water Act and Clean Air Act. Repeat these steps to determine which RegistryID
s have a significant violation to both Acts. I recommend running this on joined_inner_renamed
.
Once you’ve run these codes, open both echo_air_ca
and echo_water_ca
by clicking on the data frames in your environment. Search for a few of the RegistryID
s that appeared in your analysis to identify the names and locations of the facilities.
# Write wrangling code here for current violations.
# Write wrangling code here for significant violations.
Many-to-Many Joins
We don’t know much about these facilities because we lost a lot of critical information (e.g. the facility’s name and location) when aggregating our data by RegistryID
above. At the time, we couldn’t aggregate by name because certain facilities sharing the same RegistryID
had different names! Aggregating by name would have meant that we’d still have repeating RegistryId
s in the data frame - one for each different version of the facility’s name.
The truth is though that we didn’t technically have to ensure that the RegistryID
didn’t repeat. We can still join data frames in instances where the join key repeats in both data frames. This is called performing a many-to-many join because we are joining many of the same key in one data frame to many of the same key in another data frame. It’s important to pay attention to what happens when we make this join. Let’s look at an example: the facility with Registry ID 110001181186.
|>
echo_air_ca filter(RegistryID == 110001181186)
|>
echo_water_ca filter(RegistryID == 110001181186)
In echo_air_ca
, this RegistryID
appears three times in the data frame, and in echo_water_ca
, this RegistryID
appears twice in the data frame. In each instance, the facility has a different name. What happens to facility 110001181186 when we perform a many-to-many join?
|>
echo_water_ca inner_join(echo_air_ca, by = "RegistryID") |>
filter(RegistryID == 110001181186) |>
select(CWPName, AIRName, RegistryID)
This inner join created six rows for facility 110001181186. This is because R
matched each of the three rows for facility 110001181186 in echo_air_ca
with the two rows for facility 110001181186 in echo_water_ca
(and 3 rows * 2 rows = 6 rows).
This is one of the reasons why it is important to understand the context of datasets and their units of observation. Here we’ve joined these rows together as if they represented matching facilities. …but just by looking at this output we know that PLATFORM A is going to be physically different than DOS CUADRAS/SOUTH COUNTY/PLATFORM B, and PLATFORM B is going to be physically different than DOS CUADRAS/SOUTH COUNTY/PLATFORM C. The only thing that they all share is that they are sub-parts of the facility with the RegistryID
110001181186. Ultimately the units of observation don’t match across these two data frames because Clean Air Act and Clean Water Act delineate facilities differently. By aggregating to RegistryID
above - something shared across the two Acts - we standardized the unit of observation, which made it possible to perform a more meaningful join.
Question
Using my code as an example, determine how many times facility 110000483619 appears in echo_air_ca
and how many times it appears in echo_water_ca
. How many times would this facility show up if we were to perform a many-to-many inner join for these two data frames? Write your response as a comment in the code chunk.
#Write code here for echo_air_ca!
#Write code here for echo_water_ca!
#Write comment here!
Filtering Joins
Sometimes we want to know which observations that appear in one data frame don’t appear in another data frame. This might tell us where we have missing data. In this case, it will tell us which facilities are regulated by one environmental law and not the other.
Question
Write code below to perform two anti_join()
s. The first should tell me which facilities are regulated by the Clean Air Act and not the Clean Water Act, and the second should tell me which facilities are regulated by the Clean Water Act and not the Clean Air Act.
#not_clean_water <- Write code here!
#not_clean_air <- Write code here!