Boxplots

SDS 192: Introduction to Data Science

Lindsay Poirier
Statistical & Data Sciences, Smith College

Fall 2022

For Today

  • Measures of Central Tendency and Dispersion
  • Boxplots
  • Project 1 Assigned

A measure of central tendency is a single numeric quantity describing data by identifying a central position.

Let’s create the following data frame to motivate today’s lecture.

library(tidyverse)
counties <- read_csv("https://raw.githubusercontent.com/sds-192-intro-fall22/sds-192-public-website-quarto/a8b64e3070ca2543b904d4d92780b09e6062ced6/website/data/nbi_counties.csv")
route_prefixes <- read_csv("https://raw.githubusercontent.com/sds-192-intro-fall22/sds-192-public-website-quarto/a8b64e3070ca2543b904d4d92780b09e6062ced6/website/data/nbi_route_pre.csv")
maintenance <- read_csv("https://raw.githubusercontent.com/sds-192-intro-fall22/sds-192-public-website-quarto/a8b64e3070ca2543b904d4d92780b09e6062ced6/website/data/nbi_maintenance.csv")
kinds <- read_csv("https://raw.githubusercontent.com/sds-192-intro-fall22/sds-192-public-website-quarto/a8b64e3070ca2543b904d4d92780b09e6062ced6/website/data/nbi_kind.csv")

nbi_ma <- read.delim("https://www.fhwa.dot.gov/bridge/nbi/2022/delimited/MA22.txt", sep = ",") |>
  left_join(counties) |>
  left_join(route_prefixes) |>
  left_join(maintenance) |>
  left_join(kinds) |>
  filter(SERVICE_ON_042A == 1) |>
  select(STRUCTURE_NUMBER_008, COUNTY_CODE_003_L, ROUTE_PREFIX_005B_L, MAINTENANCE_021_L, YEAR_BUILT_027, ADT_029, STRUCTURE_KIND_043A_L, STRUCTURAL_EVAL_067, BRIDGE_IMP_COST_094) |>
  mutate(STRUCTURE_KIND_043A_L = 
           case_when(
             STRUCTURE_KIND_043A_L == "Concrete continuous" ~ "Concrete",
             STRUCTURE_KIND_043A_L == "Steel continuous" ~ "Steel",
             STRUCTURE_KIND_043A_L == "Prestressed concrete continuous" ~ "Prestressed concrete",
             TRUE ~ STRUCTURE_KIND_043A_L)) |>
  mutate(BRIDGE_IMP_COST_094 = BRIDGE_IMP_COST_094 * 1000)

nbi_hampshire <- nbi_ma |> filter(COUNTY_CODE_003_L == "Hampshire")

rm(counties, kinds, maintenance, route_prefixes)

Mean

  • Sum of values divided by number of values summed
  • Takes every value into consideration
  • Model of entire dataset
  • Heavily influenced by outliers

Median

  • Middle value(s) of the dataset when all values are lined from smallest to largest
  • Does not model entire dataset
  • Limited influence from outliers

Learning Check: How many variables from the dataset are represented on the previous plot?

Normal Distributions

  • More values huddle around some center line and taper off as we move away from center
  • Histogram is symmetrical with a perfectly normal distribution
  • Median and mean should be about the same; mean is a good measure of central tendency

Skew

  • Histogram is non-symmetrical when there is skew
  • Long trail to the right of center indicates a right skew
  • Median becomes more representative measure of central tendency than mean

Summarizing Data

  • Measures of central tendency summarize swaths of information into single value
  • Can be reductionist
    • Example: Measures of central tendency related to wealth in the US only tell us about those in the middle
    • Hide the experiences of the most impoverished communities.
  • Degree of spread or dispersion is just as important as center

Range

  • Maximum value minus the minimum value
  • Evaluates the spread of the entire dataset

Interquartile Range

  • 1st quartile is middle value between minimum and median
  • 3rd quartile is middle value between median and maximum
  • IQR is the difference between the 1st and 3rd quartile
  • Represents the middle 50% of values

Boxplot

Grouped Boxplots

ggplot(nbi_hampshire, aes(x = ADT_029, y = ROUTE_PREFIX_005B_L)) +
  geom_boxplot() +
  labs(title = "Distribution in the Average Daily Traffic of Hampshire County, MA Bridges, 2021", 
       x = "Average Daily Traffic",
       y = "Route Prefix") +
  theme_minimal()

Interpreting Boxplots Step 1: Check for Outliers

How many are there? What do they indicate? Do you assume they are errors in teh data? Or do they represent extremes that are important for us to take into consideration?

Interpreting Boxplots Step 2: Compare Medians

Do the medians line up? If not, in which groups are the medians higher and in which are they lower?

Interpreting Boxplots Step 3: Compare Range

Do certain groups have a wider range of values represented than others? In other words, are the values more distributed for certain groups than for others? This might indicate a greater degree of disparity in some groups than others.

Interpreting Boxplots Step 4: Compare IQR

In which groups do the middle 50% of values tend to huddle around a central value? In which are they more spread out from the center?

Interpreting Boxplots Step 5: Compare Symmetry

Does the median appear to be in the center of the range and IQR? Is the median closer to the minimum – or the bottom whisker? Or the top whisker?