Joining Datasets

SDS 192: Introduction to Data Science

Lindsay Poirier
Statistical & Data Sciences, Smith College

Fall 2022

For Today

  • Reminder: Quiz 1 and Group Projects!
  • MSA Recap
  • Joining Datasets
  • Mosaic Effect

Load the following two data frames into RStudio.

What do you notice about them? Why might they be useful to consider together?

library(tidyverse)
ct_school_attendance <- read.csv("https://data.ct.gov/resource/t4hx-jd4c.csv?$limit=3000") %>%
  filter(studentgroup == "All Students" & reportingdistrictname != "Connecticut")
ct_school_learning_model <-
  read.csv("https://data.ct.gov/resource/5q7h-u2ac.csv?$where=update_date=%272020-09-17%27") |>
  rename(fully_remote = percent_students_fully_remote) 

Joining Datasets

  • Up until this point, we have been working with data in single tables.
  • Sometimes related data can be spread across multiple tables that we wish to bring together for different kinds of analysis.

Keys

  • A column shared across the tables that we can join on.
  • What might be an issue with joining on this key?

Keys

Join

ct_school_attendance |>
  inner_join(ct_school_learning_model, 
             by = c("reportingdistrictcode" = "district_code")) |>
  select(reportingdistrictcode, attrate_202021, fully_remote) |>
  head()

Different Kinds of Joins

  • inner_join()
  • left_join()
  • right_join()
  • full_join()
  • anti_join()

Inner Join

Outer Joins

One-to-many relationships

Many-to-many relationships

Anti-joins

ct_school_attendance %>%
  anti_join(ct_school_learning_model, by = c("reportingdistrictcode" = "district_code")) %>%
  select(reportingdistrictcode)