Labs

Tip

These labs were written and rendered correctly in Fall 2022. They pull in public data that might have changed in content and format since then. I can’t guarantee the labs will render in perpetuity, but if you run into an issue, please feel free to record it via this GitHub repo.

Problem Solving

This lab will introduce you to resources and techniques for problem solving in R. You should reference this lab often throughout the semester for reminders on best practices for addressing errors and getting help.

Lab 1: Understanding Datasets

This lab is all about learning to understand the context and parts of a dataset by referencing and interpreting data dictionaries and technical data documentation. We will get to know the U.S. Department of Education’s College Scorecard data, which includes over 3000 variables characterizing colleges in the U.S.

Lab 2: Visualization Aesthetics

This week we will practice mapping variables in the U.S. National Bridge Inventory onto different plot aesthetics in order to tell different stories with the data. We’re going to look at what kinds of variables might contribute to poor bridge conditions, where there are poor bridge conditions, and which entities are responsible for maintaining them. We will only be creating one type of plot today - a scatterplot. However, we are going to show how we can use different visual cues to plot a number of different variables onto a scatterplot.

Lab 3: Plotting Freqencies

In this lab, you will practice plotting both frequencies and distributions by analyzing data about Spotify playlists representing your favorite music genre. Specifically, we are going to produce visualizatations that allow us to consider: How joyful are popular Spotify playlists in my favorite music genre?

Lab 4: GitHub

This lab is designed to help you get acquainted with the concepts behind Git and GitHub, suggested workflows for collaborating on projects in this course, and error resolution strategies.

Lab 4 Instructions

Lab 5: Data Wrangling

In this lab, you will apply 6 data wrangling verbs in order to analyze data regarding NYPD stop, question, and frisk. Specifically, we will replicate data analysis performed by the NYCLU in 2011 to demonstrate how the practice was being carried out unconstitutionally in New York.

Lab 6: Joining Datasets (broken due to API change as of 02-22-24)

In terms of data analysis, this lab has one goal: to determine the number of industrial facilities that are currently in violation of both the Clean Air Act and the Clean Water Act in California. To achieve this goal, we’re going to have to do some data wrangling and join together some datasets published by the EPA. We’re going to practice applying different types of joins to this data and consider what we learn with each.

Lab 7: Tidying Datasets

In this lab, we will create a few data visualizations documenting point-in-time counts of homelessness in the United States. Specifically, we are going visualize data collected in 2020 through various Continuums of Care (CoCs) programs. In order to produce these data visualizations, you will need to join homelessness data with census population data and develop and execute a plan for how to wrangle the dataset into a “tidy” format.

Lab 8: Programming with Data

In this lab, we will program some custom R functions that allow us to analyze data related to medical conflicts of interest. Specifically, we will determine which ten Massachusetts-based doctors received the most money from pharmaceutical or medical device manufacturers in 2021. Then we will leverage our custom functions to produce a number of tables and plots documenting information about the payments made to each of these doctors. In doing so, we will update a similar analysis produced by ProPublica in 2018 called Dollars for Docs.

Lab 9: Mapping with Leaflet

In this lab, we will build a map that visualizes unaddressed housing violations in NYC in order to identify some of the city’s worst landlords. In doing so, we will gain practice in producing point maps in Leaflet. This analysis is based off of a similar analysis conducted by the NYC Public Advocate’s Office.

Lab 10: Working with APIs

In this lab, we will write queries to access subsets of a very large dataset on the NYC Open Data Portal. We will practice all of the standards we have learned in the course so far in visualizing and wrangling the resulting data.