Lab 6: Joining Datasets

Author

Lindsay Poirier

Introduction

In terms of data analysis, this lab has one goal: to determine the number of industrial facilities that are currently in violation of both the Clean Air Act and the Clean Water Act in California. To achieve this goal, we’re going to have to do some data wrangling and join together some datasets published by the EPA. We’re going to practice applying different types of joins to this data and consider what we learn with each.

Learning Goals

Identify join keys
Discern the differences between types of joins
Discern what happens when we perform many-to-many joins
Apply filtering joins to identify missing data
Practice data wrangling

Review of Key Terms

Tip

You may wish to reference this Data Wrangling Cheatsheet when completing this lab.

Join Key: A variable shared across two datasets that identifies the observations that will be merged

EPA’s ECHO

The U.S. Environmental Protection Agency (EPA) is responsible for monitoring and regulating over 800,000 industrial facilities in the United States. This involves regularly inspecting facilities for their compliance with different environmental laws, issuing notices and taking enforcement actions when facilities are out of compliance, and issuing penalties for failures to address violations. A record of all of this activity is maintained online in a database known as ECHO - Enforcement and Compliance History Online. Specifically, ECHO maintains information about enforcement and compliance actions taken in relation to the Clean Air Act, the Clean Water Act, the Safe Drinking Water Act, and the Resource Conservation and Recovery Act.

There is so much information available to us via ECHO, and there is even an R package that makes it really simple to access the data in ECHO. However, because ECHO documents compliance with a number of different environmental laws, when we pull data from ECHO about regulated facilities, it is usually organized into separate tables for each law. Today we are going to pull a dataset documenting facilities’ compliance with the Clean Air Act and then pull a second dataset documenting facilities’ compliance with the Clean Water Act.

The Clean Air Act was first enacted in 1963 to regulate emissions from both stationary and mobile sources of air pollution. The Clean Water Act was enacted in 1972 regulate the polluting of U.S. waterways. When a facility violates one of these Acts, it typically means that they have released an excess amount of pollutants, neglected to implement the proper control technologies or standards, or failed to submit reports.

Setting Up Your Environment

Install the echor package in your Console.
Run the code below to load today’s data frames into your environment.

library(tidyverse)
library(echor)

echo_air_ca <- 
  echoAirGetFacilityInfo(p_st = "CA", 
                         qcolumns = "1,4,5,8,48,99,100,101,102") |>
  select(-SourceID)

echo_water_ca <- 
  echoWaterGetFacilityInfo(p_st = "CA", 
                           qcolumns = "1,4,5,9,184,185,186,187") |>
  select(-SourceID)

Run the code below to load the data dictionary for echo_air_ca into your environment.

echo_air_dd <- echoAirGetMeta() |>
  filter(ColumnID %in% c(1,4,5,8,99,100,101,102)) |>
  select(ColumnID, ObjectName, Description)

echo_water_dd <- echoWaterGetMeta() |>
  filter(ColumnID %in% c(1,4,5,9,184,185,186,187)) |>
  select(ColumnID, ObjectName, Description)

Open both echo_air_dd and echo_water_dd to review the data dictionaries for these datasets.

Data Analysis

Join Keys

When joining data frames, we need to determine a join key. This will be a variable that exists in both data frames to uniquely identify the data frames’ units of observation. This key needs to be shared across the two data frames in order to be able to join the data frames together. This is because when joining data frames, R will check the value in that variable for one data frame and look for the corresponding value in the key variable for the other data frame in order to determine which units of observation match across the data frames.

For instance, let’s say we have two data frames documenting the same group of students - one documents their current courses, and another documents their living arrangements. For both data frames, the student ID might serve as a join key. This ID uniquely identifies each student, and we expect that that key will be the same in both data frames (i.e. the 99- that represents you in one data frame will be the same as the 99- number that represents you in another data frame.) Because that number matches across the two data frames, we can use that variable to determine which rows (i.e. students) in the first data frame are associated with which rows in the second data frame.

Question

Reference the data dictionaries for these two data frames to determine the names of the variables that we will join on. Remember that these should be variables that we can use to uniquely identify each unit of observation across the data frames. Write code below to determine the number of unique values for the variables you identify in each of these data frames. (Hint: Remember how we counted unique values in a variable in lab 1?)

# Write code here for echo_air_ca
# Write code here for echo_water_ca

If you’ve done this correctly, you’ll learn that there are 2602 unique values echo_air_ca’s key variable. You’ll also learn that there are 28262 unique values in echo_water_ca’s key variable.

…but if you look in your environment, there are 2794 rows in echo_air_ca, and there are 30829 rows in echo_water_ca. This means that the key variable must repeat, and that certain facilities are represented more than once in each of these data frames.

This happens because even the EPA defines a facility in different ways depending on what laws it is regulated by. What counts as one facility in the Clean Air Act may count as two or three facilities in the Clean Water Act, but there is only one ID to document the relationships between them. This means that the same ID may show up a few times for different parts of the same facility. Check out some of the duplicating rows via my code below. Notice how multiple facilities - sometimes with entirely different names can be associated with the same RegistryID?

echo_air_ca |>
  group_by(RegistryID) |>
  filter(n() > 1) |>
  arrange(RegistryID) |>
  head()

Later we will talk about what happens when we join two data frames that both have repeating keys, but for now, we’re going aggregate both of these data frames so that each Registry ID only appears once. Basically, we are going to write code to say: if any of the rows pertaining to this Registry ID document certain kinds of violations, set the violation flag for that facility to 1.

Question

In the code below, for each of the data frames, you should group the data by RegistryID and then summarize by returning the max() value in CurrSvFlag, ViolFlag, CurrVioFlag, Insp5yrFlag for each group. Store the results in echo_air_ca_agg and echo_water_ca_agg respectively. Note how many rows in are in echo_air_ca_agg and echo_water_ca_agg. It should match the number of unique values in RegistryID that you calculated above.

echo_air_ca_agg <-
  echo_air_ca |>
  ______ |>             #group by RegistryID
  _____(                #Summarize the following:
    CurrSvFlag = _____, #Calculate max value in CurrSvFlag
    ViolFlag = _____,   #Calculate max value in ViolFlag
    CurrVioFlag = _____,#Calculate max value in CurrVioFlag
    Insp5yrFlag = _____ #Calculate max value in Insp5yrFlag
  )

echo_water_ca_agg <- #Copy and adjust code above to create echo_water_ca_agg

Joins

So now we have two data frames with unique registry IDs and variables indicating whether the facility has a current significant violation, whether it has had a violation in the past three years, whether it has a current violation, and whether it has been inspected in the past 5 years.

Now we want to join these two data frames together, so that we can check which facilities have violations to both the Clean Air Act and the Clean Water Act.

Question

Write code below to perform four kinds of joins - a left join, a right join, and inner join, and a full join. echo_water_ca_agg should be in the first position, and echo_air_ca_agg should be joined onto it. I’ve started that for you in the commented code below.

#joined_left <- echo_water_ca_agg |> 

#joined_right <- echo_water_ca_agg |> 

#joined_inner <- echo_water_ca_agg |> 

#joined_full <- echo_water_ca_agg |>

Check out the first six rows of joined_full below and note what happens when a RegistryID is present in one data frame but not in the other. We see data values in the columns associated with the data frame where the RegistryID was present, and NA values in the columns associated with the data frame where the RegistryID was not present. This is how we come to have so many extra rows in joined_full.

joined_full |> head()

With joined_inner, any rows associated with a RegistryID that doesn’t appear in both data frames get dropped. Check out the number of rows in joined_inner.

nrow(joined_inner)

[1] 739

This represents the number of Registry IDs that were present in both echo_water_ca_agg and echo_air_ca_agg.

Question

The three statements below are incorrect. Correct my statements below about the remaining joins.

paste(nrow(joined_full), "represents the number of Registry IDs present in echo_air_ca_agg")

paste(nrow(joined_left), "represents the number of Registry IDs present in either echo_air_ca_agg or echo_water_ca_agg")

paste(nrow(joined_right), "represents the number of Registry IDs present in echo_water_ca_agg")

From here on out, since we are only interested in the facilities with violations to both Acts, we are going to focus on joined_inner. Let’s remove the rest of the joined data frames from our environment.

rm(joined_full, joined_left, joined_right)

You might note at this point that it can be difficult to tell which columns are associated with which environmental laws. This is because originally in both data frames, the variables we are most interested in (CurrSvFlag, ViolFlag, CurrVioFlag, Insp5yrFlag) shared the same names. When we joined these two data frames together, R opted to distinguish between them by tacking a .x onto the variable names for the data frame in the first position in the join and a .y onto the variable names for the data frame in the second position in the join. Let’s give each of these variables more meaningful names. To do so, we can use the rename() function, which is included in the dplyr package in the tidyverse. When piped to a dataframe, the formula for rename() is simple: rename(new_name = old_name)

Question

In my code below rename the remaining columns to differentiate between the column names from the two original data frames.

joined_inner_renamed <-
  joined_inner |>
  rename(
    CurrSvFlag_water = CurrSvFlag.x,
    ViolFlag_water = ViolFlag.x 
    #Rename remaining columns here. Be sure to separate by comma. There are six more to go!
  )

Question

Now that we have a cleaned up data frame, use a data wrangling verb to subset the data frame to the rows where a RegistryID has a current violation to both the Clean Water Act and Clean Air Act. Repeat these steps to determine which RegistryIDs have a significant violation to both Acts. I recommend running this on joined_inner_renamed.

Once you’ve run these codes, open both echo_air_ca and echo_water_ca by clicking on the data frames in your environment. Search for a few of the RegistryIDs that appeared in your analysis to identify the names and locations of the facilities.

# Write wrangling code here for current violations. 

# Write wrangling code here for significant violations.

Many-to-Many Joins

We don’t know much about these facilities because we lost a lot of critical information (e.g. the facility’s name and location) when aggregating our data by RegistryID above. At the time, we couldn’t aggregate by name because certain facilities sharing the same RegistryID had different names! Aggregating by name would have meant that we’d still have repeating RegistryIds in the data frame - one for each different version of the facility’s name.

The truth is though that we didn’t technically have to ensure that the RegistryID didn’t repeat. We can still join data frames in instances where the join key repeats in both data frames. This is called performing a many-to-many join because we are joining many of the same key in one data frame to many of the same key in another data frame. It’s important to pay attention to what happens when we make this join. Let’s look at an example: the facility with Registry ID 110001181186.

echo_air_ca |>
  filter(RegistryID == 110001181186)

echo_water_ca |>
  filter(RegistryID == 110001181186)

In echo_air_ca, this RegistryID appears three times in the data frame, and in echo_water_ca, this RegistryID appears twice in the data frame. In each instance, the facility has a different name. What happens to facility 110001181186 when we perform a many-to-many join?

echo_water_ca |> 
  inner_join(echo_air_ca, by = "RegistryID") |> 
  filter(RegistryID == 110001181186) |>
  select(CWPName, AIRName, RegistryID)

This inner join created six rows for facility 110001181186. This is because R matched each of the three rows for facility 110001181186 in echo_air_ca with the two rows for facility 110001181186 in echo_water_ca (and 3 rows * 2 rows = 6 rows).

This is one of the reasons why it is important to understand the context of datasets and their units of observation. Here we’ve joined these rows together as if they represented matching facilities. …but just by looking at this output we know that PLATFORM A is going to be physically different than DOS CUADRAS/SOUTH COUNTY/PLATFORM B, and PLATFORM B is going to be physically different than DOS CUADRAS/SOUTH COUNTY/PLATFORM C. The only thing that they all share is that they are sub-parts of the facility with the RegistryID 110001181186. Ultimately the units of observation don’t match across these two data frames because Clean Air Act and Clean Water Act delineate facilities differently. By aggregating to RegistryID above - something shared across the two Acts - we standardized the unit of observation, which made it possible to perform a more meaningful join.

Question

Using my code as an example, determine how many times facility 110000483619 appears in echo_air_ca and how many times it appears in echo_water_ca. How many times would this facility show up if we were to perform a many-to-many inner join for these two data frames? Write your response as a comment in the code chunk.

#Write code here for echo_air_ca!

#Write code here for echo_water_ca!

#Write comment here!

Filtering Joins

Sometimes we want to know which observations that appear in one data frame don’t appear in another data frame. This might tell us where we have missing data. In this case, it will tell us which facilities are regulated by one environmental law and not the other.

Question

Write code below to perform two anti_join()s. The first should tell me which facilities are regulated by the Clean Air Act and not the Clean Water Act, and the second should tell me which facilities are regulated by the Clean Water Act and not the Clean Air Act.

#not_clean_water <- Write code here!

#not_clean_air <- Write code here!

Ethical Considerations

In March 2020, the Environmental Protection Agency published a memo that permitted industrial facilities impacted by the Covid-19 pandemic to temporarily suspend mandated pollution monitoring. Research published by the Environmental Data Governance Initiative (EDGI) (Nost et al. 2020) has indicated that, during this time, reported violations to the Clean Air Act and the Clean Water Act dropped considerably. However, EDGI’s report goes on to argue that violations were likely being under-counted at this time. …first, because facilities were not being required to monitor and report data to the same degree during Covid-19, and second, because the EPA was conducting fewer inspections during Covid-19. In sum, as they say, “the absence of data should not be taken as the absence of pollution.” What social harms emerge in the wake of these data absences? Who benefits from these policies, and who faces the greatest risks? How should we as data scientists think and act as we anaylze and present this data?

References

Nost, Eric, Kelsey Breseman, Steve Hansen, Leslie Alanis, Megan Raisle, Lourdes Vera, Maalvika Bhat, Sara Wylie, Sung-Gheel (Gil) Jang, and EDGI. 2020. “More Permission to Pollute: The Decline of EPA Enforcement and Industry Compliance During COVID.” Ke. https://envirodatagov.org/more-permission-to-pollute-the-decline-of-epa-enforcement-and-industry-compliance-during-covid/.