Tidying Datasets

SDS 192: Introduction to Data Science

Lindsay Poirier
Statistical & Data Sciences, Smith College

Fall 2022

For Today

  • Formatting up Columns and Values
  • Parsing Dates
  • Conditionals
  • Pivoting
  • Separating Columns

Whenever formatting columns, we will use mutate to overwrite a variable with a new cleaned up variable.

library(tidyverse)
prisons <- read_csv("../data/Prison_Boundaries.csv")

Converting Types

  • as.character(), as.numeric(), as.logical() all convert a variable from an original type to a new type
typeof(prisons$COUNTYFIPS)
[1] "character"
prisons <- 
  prisons |> 
  mutate(COUNTYFIPS = as.numeric(COUNTYFIPS))
typeof(prisons$COUNTYFIPS)
[1] "double"

Parsing Dates

  • Dates can be converted to a date format using the lubridate package
    • Step 1: Check how dates are formatted
    • Step 2: Find corresponding conversion code on lubridate cheatsheet

Setting Dates

  • ymd_hms() will take a date formatted as year, month, day, hour, minute, second and convert it to a date time format
prisons |> 
  select(NAME, SOURCEDATE) |> 
  head(3)
library(lubridate)

prisons <- 
  prisons |> 
  mutate(SOURCEDATE = ymd_hms(SOURCEDATE))

Setting NA values

  • na_if() will take a variable and set specified values to NA
prisons |> 
  select(NAME, POPULATION) |> 
  head(3)
sum(is.na(prisons$POPULATION))
[1] 0
prisons <- 
  prisons |> 
  mutate(POPULATION = na_if(POPULATION, -999))
prisons |> 
  select(NAME, POPULATION) |> 
  head(3)
sum(is.na(prisons$POPULATION))
[1] 132

Replacing Strings

  • str_replace() will take a variable and replace an existing string with a new string
prisons |> 
  select(NAME, ADDRESS) |> 
  head(3)
prisons <- 
  prisons |> 
  mutate(ADDRESS = str_replace(ADDRESS, 
                               "AVENUE", 
                               "AVE"))

Removing Strings

  • str_replace() will take a variable and replace an existing string with a new string
prisons |> select(NAME, Creator) |> head(3)
prisons <- 
  prisons |> 
  mutate(Creator = str_replace(Creator, 
                               "Hostedby", 
                               ""))

Conditionals

  • case_when() allows us to set values when conditions are met
prisons |> 
  select(SECURELVL) |> 
  distinct()
prisons <- 
  prisons |> 
  mutate(JUVENILE = 
           case_when(
             SECURELVL == "JUVENILE" ~ "Juvenile",
             TRUE ~ "Not Juvenile")) 

What is tidy data?

  • Every observation has its own row.
  • Every variable has its own columns.
  • Every value has its own cell.

Is this tidy?

What variables are displayed on this plot?

What will it look like when tidy?

df |> pivot_longer(-Date, 
                    names_to = "City", 
                    values_to = "AQI")

Learning Check: What function would I use to remove “_AQI” from the City column on the previous slide?

Pivoting Longer

  • We use pivot_longer() to pivot a datasets from wider to longer format:
  • pivot_longer() takes the following arguments:
  1. cols =: Identify a series of columns to pivot - The names of those columns will become repeated rows in the pivoted data frame, and the values in those columns will be stored in a new column.
  2. names_to =: Identify a name for the column where the column names will be store
  3. values_to =: Identify a name for the column were the values associated with those names will be stored
  4. Various arguments to support transformations to names

Example

df |> pivot_longer(cols = ends_with("AQI"), 
                    names_to = "City", 
                    values_to = "AQI") |>
  mutate(City = str_replace(City, "_AQI", ""))

Learning Check: How would I convert the dates on the previous slide to a date-time format?

Pivoting Wider

Note: I use this far less often than pivot_longer()

  • We use pivot_wider() to pivot a datasets from longer to wider format:
  • pivot_wider() takes the following arguments:
  1. names_from =: Identify the column to get the new column names from
  2. values_from =: Identify the column to get the cell values from
  3. Various arguments to support transformations to names

Example

df |> pivot_wider(names_from = "Date", 
                   values_from = "AQI", 
                   names_repair = make.names)

Separating Columns

  • We use separate() to split a column into multiple columns:
  • separate() takes the following arguments:
  1. col: Identify the existing column to separate
  2. into = c(): Identify the names of the new columns
  3. sep =: Identify the characters or numeric position that indicate where to separate columns

Example

df |> 
  pivot_longer(
    cols = everything(), 
    names_to = "Measure",
    values_to = "AQI") |>
  separate(Measure, into = c("City", "Month", "Date"), sep = "_")

Learning Check: What would we need to do to convert AQI on the previous slide into a numeric variable?