Tidying Datasets

SDS 192: Introduction to Data Science

Lindsay Poirier
Statistical & Data Sciences, Smith College

Fall 2022

For Today

Formatting up Columns and Values
Parsing Dates
Conditionals
Pivoting
Separating Columns

Whenever formatting columns, we will use `mutate` to overwrite a variable with a new cleaned up variable.

library(tidyverse)
prisons <- read_csv("../data/Prison_Boundaries.csv")

Converting Types

as.character(), as.numeric(), as.logical() all convert a variable from an original type to a new type

Before
Code
After

typeof(prisons$COUNTYFIPS)

[1] "character"

prisons <- 
  prisons |> 
  mutate(COUNTYFIPS = as.numeric(COUNTYFIPS))

typeof(prisons$COUNTYFIPS)

[1] "double"

Parsing Dates

Dates can be converted to a date format using the lubridate package
- Step 1: Check how dates are formatted
- Step 2: Find corresponding conversion code on lubridate cheatsheet

Setting Dates

ymd_hms() will take a date formatted as year, month, day, hour, minute, second and convert it to a date time format

Before
Code
After

prisons |> 
  select(NAME, SOURCEDATE) |> 
  head(3)

library(lubridate)

prisons <- 
  prisons |> 
  mutate(SOURCEDATE = ymd_hms(SOURCEDATE))

Setting `NA` values

na_if() will take a variable and set specified values to NA

Before
Code
After

prisons |> 
  select(NAME, POPULATION) |> 
  head(3)

sum(is.na(prisons$POPULATION))

[1] 0

prisons <- 
  prisons |> 
  mutate(POPULATION = na_if(POPULATION, -999))

prisons |> 
  select(NAME, POPULATION) |> 
  head(3)

sum(is.na(prisons$POPULATION))

[1] 132

Replacing Strings

str_replace() will take a variable and replace an existing string with a new string

Before
Code
After

prisons |> 
  select(NAME, ADDRESS) |> 
  head(3)

prisons <- 
  prisons |> 
  mutate(ADDRESS = str_replace(ADDRESS, 
                               "AVENUE", 
                               "AVE"))

Removing Strings

str_replace() will take a variable and replace an existing string with a new string

Before
Code
After

prisons |> select(NAME, Creator) |> head(3)

prisons <- 
  prisons |> 
  mutate(Creator = str_replace(Creator, 
                               "Hostedby", 
                               ""))

Conditionals

case_when() allows us to set values when conditions are met

Before
Code
After

prisons |> 
  select(SECURELVL) |> 
  distinct()

prisons <- 
  prisons |> 
  mutate(JUVENILE = 
           case_when(
             SECURELVL == "JUVENILE" ~ "Juvenile",
             TRUE ~ "Not Juvenile"))

What is tidy data?

Every observation has its own row.
Every variable has its own columns.
Every value has its own cell.

Is this tidy?

What variables are displayed on this plot?

What will it look like when tidy?

df |> pivot_longer(-Date, 
                    names_to = "City", 
                    values_to = "AQI")

Learning Check: What function would I use to remove “_AQI” from the `City` column on the previous slide?

Pivoting Longer

We use pivot_longer() to pivot a datasets from wider to longer format:
pivot_longer() takes the following arguments:

cols =: Identify a series of columns to pivot - The names of those columns will become repeated rows in the pivoted data frame, and the values in those columns will be stored in a new column.
names_to =: Identify a name for the column where the column names will be store
values_to =: Identify a name for the column were the values associated with those names will be stored
Various arguments to support transformations to names

df |> pivot_longer(cols = ends_with("AQI"), 
                    names_to = "City", 
                    values_to = "AQI") |>
  mutate(City = str_replace(City, "_AQI", ""))

Learning Check: How would I convert the dates on the previous slide to a date-time format?

Pivoting Wider

Note: I use this far less often than pivot_longer()

We use pivot_wider() to pivot a datasets from longer to wider format:
pivot_wider() takes the following arguments:

names_from =: Identify the column to get the new column names from
values_from =: Identify the column to get the cell values from
Various arguments to support transformations to names

Example

Before
Code
After

df |> pivot_wider(names_from = "Date", 
                   values_from = "AQI", 
                   names_repair = make.names)

Separating Columns

We use separate() to split a column into multiple columns:
separate() takes the following arguments:

col: Identify the existing column to separate
into = c(): Identify the names of the new columns
sep =: Identify the characters or numeric position that indicate where to separate columns

Example

Before
Code
After

df |> 
  pivot_longer(
    cols = everything(), 
    names_to = "Measure",
    values_to = "AQI") |>
  separate(Measure, into = c("City", "Month", "Date"), sep = "_")

Learning Check: What would we need to do to convert `AQI` on the previous slide into a numeric variable?

Tidying Datasets

For Today

Whenever formatting columns, we will use mutate to overwrite a variable with a new cleaned up variable.

Converting Types

Parsing Dates

Setting Dates

Setting NA values

Replacing Strings

Removing Strings

Conditionals

What is tidy data?

Is this tidy?

What will it look like when tidy?

Learning Check: What function would I use to remove “_AQI” from the City column on the previous slide?

Pivoting Longer

Example

Learning Check: How would I convert the dates on the previous slide to a date-time format?

Pivoting Wider

Example

Separating Columns

Example

Learning Check: What would we need to do to convert AQI on the previous slide into a numeric variable?

Whenever formatting columns, we will use `mutate` to overwrite a variable with a new cleaned up variable.

Setting `NA` values

Learning Check: What function would I use to remove “_AQI” from the `City` column on the previous slide?

Learning Check: What would we need to do to convert `AQI` on the previous slide into a numeric variable?