Data Wrangling

SDS 192: Introduction to Data Science

Lindsay Poirier
Statistical & Data Sciences, Smith College

Fall 2022

For Today

Quiz 1
Subsetting Data
Aggregating Data
Activity

Data wrangling is a process for transforming a dataset from its original form into a more relevant form.

Six “verbs” for data wrangling

arrange()
select()
filter()
mutate()
summarize()
group_by()

`arrange()`

arrange() sorts rows according to values in a column
Defaults to sorting from smallest to largest (numeric) or first character to last character (character).

spotify_playlists |>
  arrange(desc(valence)) |>
  select(playlist_name, track.name, track.album.name) |>
  head()

Subsetting data involves selecting relevant variables and observations for analysis. Aggregating data involves compiling and summarizing data.

Subsetting Data

`select()`

select() enables us to select variables (columns) of interest.

spotify_playlists |>
  select(playlist_name, valence, speechiness:instrumentalness , key_mode) |>
  head() #Remember that head() shows the first six rows of data!

`filter()`

filter() subsets observations (rows) according to a certain criteria that we provide.

spotify_playlists |>
  filter(track.album.name == "Led Zeppelin IV") |>
  select(playlist_name, track.name, track.album.name)

spotify_playlists |>
  filter(playlist_name == "Classic Rock Drive" & 
           track.popularity >= 80  & 
           is.na(track.preview_url)) |>
  select(track.name, track.album.name, track.popularity)

`mutate()`

mutate() creates a new variable (column) in a data frame and fills values according to criteria we provide

spotify_playlists |>
  mutate(track.duration_seconds = track.duration_ms/1000) |>
  filter(track.album.name == "The Wall") |>
  select(playlist_name, track.name, track.duration_seconds) |>
  head()

Aggregating Data

Summary functions

`summarize()`

summarize() computes a value across a vector of values and stores it in a new data frame

How is this different than only applying a summary function to a vector?

spotify_playlists |>
  summarize(NUM_SONGS = n(),
            MEAN_VALENCE = mean(valence),
            MEDIAN_VALENCE = median(valence))

`group_by()`

group_by() groups observations with a shared value in a variable
Grouping only changes the metadata of a data frame; we combine group_by() with other functions to transform the data frame
Values remain in groups unless we ungroup() it. This is important if we intend to run further operations on the resulting data.

spotify_playlists |>
  group_by(playlist_name) |>
  select(playlist_name, track.name, track.album.name) |>
  head()

`group_by() |> summarize()`

group_by() groups observations with a shared value in a variable
When we combine group_by() and summarize() we can perform operations within groups

spotify_playlists |>
  group_by(playlist_name) |>
  summarize(NUM_SONGS = n(),
            MEAN_VALENCE = mean(valence),
            MEDIAN_VALENCE = median(valence)) |>
  ungroup()

`group_by() |> filter()`

group_by() groups observations with a shared value in a variable
When we combine group_by() and filter() we can filter within groups

`group_by() |> filter()`

spotify_playlists |>
  group_by(playlist_name) |>
  filter(valence == max(valence)) |>
  select(playlist_name, track.name, valence) |>
  head()

`group_by() |> mutate()`

group_by() groups observations with a shared value in a variable
When we combine group_by() and mutate() we can perform operations within groups and add the resulting variable to the data frame

spotify_playlists |>
  group_by(playlist_name) |>
  mutate(TOTAL_DURATION = sum(track.duration_ms),
         PERCENT_DURATION = track.duration_ms/TOTAL_DURATION * 100) |>
  select(playlist_name, track.name, track.duration_ms, TOTAL_DURATION, PERCENT_DURATION) |>
  head()

What’s wrong with this code?

Which song has the duration that takes up the greatest percentage of time on any playlist?

spotify_playlists |>
  group_by(playlist_name) |>
  mutate(TOTAL_DURATION = sum(track.duration_ms),
         PERCENT_DURATION = track.duration_ms/TOTAL_DURATION * 100) |>
  filter(PERCENT_DURATION == max(PERCENT_DURATION)) |>
  select(playlist_name, track.name, track.duration_ms, TOTAL_DURATION, PERCENT_DURATION) |>
  head()

`ungroup()`

spotify_playlists |>
  group_by(playlist_name) |>
  mutate(TOTAL_DURATION = sum(track.duration_ms),
         PERCENT_DURATION = track.duration_ms/TOTAL_DURATION * 100) |>
  ungroup() |>
  filter(PERCENT_DURATION == max(PERCENT_DURATION)) |>
  select(playlist_name, track.name, track.duration_ms, TOTAL_DURATION, PERCENT_DURATION) |>
  head()