SDS 192: Introduction to Data Science
Lindsay Poirier
Statistical & Data Sciences, Smith College
Fall 2022
arrange()
select()
filter()
mutate()
summarize()
group_by()
arrange()
arrange()
sorts rows according to values in a columnselect()
select()
enables us to select variables (columns) of interest.filter()
filter()
subsets observations (rows) according to a certain criteria that we provide.mutate()
mutate()
creates a new variable (column) in a data frame and fills values according to criteria we providesummarize()
summarize()
computes a value across a vector of values and stores it in a new data frameHow is this different than only applying a summary function to a vector?
group_by()
group_by()
groups observations with a shared value in a variablegroup_by()
with other functions to transform the data frameungroup()
it. This is important if we intend to run further operations on the resulting data.group_by() |> summarize()
group_by()
groups observations with a shared value in a variablegroup_by()
and summarize()
we can perform operations within groupsgroup_by() |> filter()
group_by()
groups observations with a shared value in a variablegroup_by()
and filter()
we can filter within groupsgroup_by() |> filter()
group_by() |> mutate()
group_by()
groups observations with a shared value in a variablegroup_by()
and mutate()
we can perform operations within groups and add the resulting variable to the data frameWhich song has the duration that takes up the greatest percentage of time on any playlist?
spotify_playlists |>
group_by(playlist_name) |>
mutate(TOTAL_DURATION = sum(track.duration_ms),
PERCENT_DURATION = track.duration_ms/TOTAL_DURATION * 100) |>
filter(PERCENT_DURATION == max(PERCENT_DURATION)) |>
select(playlist_name, track.name, track.duration_ms, TOTAL_DURATION, PERCENT_DURATION) |>
head()
ungroup()
spotify_playlists |>
group_by(playlist_name) |>
mutate(TOTAL_DURATION = sum(track.duration_ms),
PERCENT_DURATION = track.duration_ms/TOTAL_DURATION * 100) |>
ungroup() |>
filter(PERCENT_DURATION == max(PERCENT_DURATION)) |>
select(playlist_name, track.name, track.duration_ms, TOTAL_DURATION, PERCENT_DURATION) |>
head()