SDS 192: Introduction to Data Science
Lindsay Poirier
Statistical & Data Sciences, Smith College
Fall 2022
mutate
to overwrite a variable with a new cleaned up variable.as.character()
, as.numeric()
, as.logical()
all convert a variable from an original type to a new typelubridate
package
lubridate
cheatsheetymd_hms()
will take a date formatted as year, month, day, hour, minute, second and convert it to a date time formatABCDEFGHIJ0123456789 |
NAME <chr> | SOURCEDATE <dttm> |
---|---|
BALDWIN PARK JAIL | 2018-05-16 |
CLAREMONT JAIL | 2018-05-25 |
CONCORD JAIL | 2018-05-25 |
NA
valuesna_if()
will take a variable and set specified values to NA
str_replace()
will take a variable and replace an existing string with a new stringABCDEFGHIJ0123456789 |
NAME <chr> | ADDRESS <chr> |
---|---|
BALDWIN PARK JAIL | 14403 EAST PACIFIC AVE |
CLAREMONT JAIL | 570 W BONITA AVE |
CONCORD JAIL | 1350 GALINDO STREET |
str_replace()
will take a variable and replace an existing string with a new stringABCDEFGHIJ0123456789 |
NAME <chr> | Creator <chr> |
---|---|
BALDWIN PARK JAIL | HostedByHIFLD |
CLAREMONT JAIL | HostedByHIFLD |
CONCORD JAIL | HostedByHIFLD |
case_when()
allows us to set values when conditions are metABCDEFGHIJ0123456789 |
NAME <chr> | SECURELVL <chr> | JUVENILE <chr> |
---|---|---|
BALDWIN PARK JAIL | NOT AVAILABLE | Not Juvenile |
CLAREMONT JAIL | NOT AVAILABLE | Not Juvenile |
CONCORD JAIL | NOT AVAILABLE | Not Juvenile |
MONROVIA JAIL | NOT AVAILABLE | Not Juvenile |
SIGNAL HILL CITY JAIL | NOT AVAILABLE | Not Juvenile |
MIRA LOMA DETENTION CENTER | Not Juvenile | Not Juvenile |
CULVER CITY JAIL | NOT AVAILABLE | Not Juvenile |
SANTA RITA JAIL | Not Juvenile | Not Juvenile |
YOLO COUNTY JUVENILE DETENTION FACILITY | Juvenile | Juvenile |
YOLO COUNTY MONROE DETENTION CENTER | Not Juvenile | Not Juvenile |
What variables are displayed on this plot?
ABCDEFGHIJ0123456789 |
Date <chr> | Nrthmptn_AQI <dbl> | NYC_AQI <dbl> | Bstn_AQI <dbl> |
---|---|---|---|
03/19/2022 | 70 | 72 | 43 |
03/18/2022 | 69 | 60 | 59 |
City
column on the previous slide?pivot_longer()
to pivot a datasets from wider to longer format:pivot_longer()
takes the following arguments:cols =
: Identify a series of columns to pivot - The names of those columns will become repeated rows in the pivoted data frame, and the values in those columns will be stored in a new column.names_to =
: Identify a name for the column where the column names will be storevalues_to =
: Identify a name for the column were the values associated with those names will be storedABCDEFGHIJ0123456789 |
Date <chr> | Nrthmptn_AQI <dbl> | NYC_AQI <dbl> | Bstn_AQI <dbl> |
---|---|---|---|
03/19/2022 | 70 | 72 | 43 |
03/18/2022 | 69 | 60 | 59 |
ABCDEFGHIJ0123456789 |
Date <chr> | City <chr> | AQI <dbl> |
---|---|---|
03/19/2022 | Nrthmptn | 70 |
03/19/2022 | NYC | 72 |
03/19/2022 | Bstn | 43 |
03/18/2022 | Nrthmptn | 69 |
03/18/2022 | NYC | 60 |
03/18/2022 | Bstn | 59 |
Note: I use this far less often than
pivot_longer()
pivot_wider()
to pivot a datasets from longer to wider format:pivot_wider()
takes the following arguments:names_from =
: Identify the column to get the new column names fromvalues_from =
: Identify the column to get the cell values fromABCDEFGHIJ0123456789 |
Date <chr> | City <chr> | AQI <dbl> |
---|---|---|
03/19/2022 | Nrthmptn | 70 |
03/19/2022 | NYC | 72 |
03/19/2022 | Bstn | 43 |
03/18/2022 | Nrthmptn | 69 |
03/18/2022 | NYC | 60 |
03/18/2022 | Bstn | 59 |
ABCDEFGHIJ0123456789 |
City <chr> | X03.19.2022 <dbl> | X03.18.2022 <dbl> |
---|---|---|
Nrthmptn | 70 | 69 |
NYC | 72 | 60 |
Bstn | 43 | 59 |
separate()
to split a column into multiple columns:separate()
takes the following arguments:col
: Identify the existing column to separateinto = c()
: Identify the names of the new columnssep =
: Identify the characters or numeric position that indicate where to separate columnsABCDEFGHIJ0123456789 |
Nrthmptn_3_19 <chr> | Nrthamtn_3_18 <chr> | Bstn_3_19 <chr> | Bstn_3_18 <chr> |
---|---|---|---|
70 | 69 | Unrecorded | 59 |
ABCDEFGHIJ0123456789 |
City <chr> | Date <chr> | AQI <chr> |
---|---|---|
Nrthmptn | 3_19 | 70 |
Nrthamtn | 3_18 | 69 |
Bstn | 3_19 | Unrecorded |
Bstn | 3_18 | 59 |
AQI
on the previous slide into a numeric variable?