This week we’ll do some data gymnastics to refresh and review what we learned over the past few weeks through three shorter case studies.

Learning goals

Discovering Simpson’s paradox via visualizations
More practice data wrangling and visualizing

Getting started

Each member of the team should:

Go to the course GitHub organization (or your team page) and locate your lab repo, which should be named lab-04-review-YOUR_TEAM_NAME.
Grab the URL of the repo, and clone it in RStudio by creating a new project from Version Control with Git.
Open the R Markdown document lab-04.Rmd and Knit it. Make sure it compiles without errors. The output will be in the file markdown .md file with the same name.

Warm up

Before we introduce the data, let’s warm up with some simple exercises. Update the YAML of your R Markdown file with your team information, knit, commit, and push your changes. Make sure to commit with a meaningful commit message. Then, go to your repo on GitHub and confirm that your changes are visible in your Rmd and md files. If anything is missing, commit and push again.

Packages

We’ll use the tidyverse package for much of the data wrangling and visualisation and data from the mosaicData and dsbox packages. You can load them by running the following in your Console:

library(tidyverse) 
library(mosaicData)  # You may have to install this package
library(dsbox)  # This should be installed from Homework 2

This code also appears at the beginning of your lab-04.Rmd file.

Reminders

Take turns answering the exercises. Make sure each team member gets to commit to the repo by the time you submit your work. And make sure that the person taking the lead for an exercise is sharing their screen.

⊕Intro stat review: Decision tree for determining the appropriate type of plot

You may want to get your Data transformation (dplyr) and Data visualization (ggplot2) R cheat sheets handy!

Part I: Smokers in Whickham

A study conducted in Whickham, England recorded participants’ age, smoking status at baseline, and then 20 years later recorded their health outcome. In this part of the lab, we analyse the relationships between these variables, first two at a time, and then controlling for the third.

Data

The dataset we’ll use is called Whickham from the mosaicData package. You can find out more about the dataset by inspecting their documentation, which you can access by running ?Whickham in the Console or using the Help menu in RStudio to search for Whickham.

Exercises

⊕Intro stat review: Study design

What type of study do you think these data come from: observational or experiment? Why?
How many observations are in this dataset? Instead of hard coding the number in your answer, use inline code. What does each observation represent?
How many variables are in this dataset? What type of variable is each (quantitative or categorical)? Display each variable using an appropriate visualization.
What would you expect the relationship between smoking status and health outcome to be?

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

⊕Categorical variables can be stored in a data frame, like Whickham, but can also be stored in a contingency table, or table of frequencies. The geom_bar() function makes the height of the bar proportional to the number of cases in each group (appropriate when your data is stored in a data frame with one case per row); geom_col uses the values in the data as the heights of the bars (appropriate when your data is stored as a contingency table).

Create a visualization depicting the relationship between smoking status and health outcome. In your answer, don’t forget to label your R chunk as well (where it says label-me-1). Your label should be short, informative, shouldn’t include spaces, and shouldn’t repeat a previous label.

⊕Hint: After grouping on the appropriate variable using group_by(), you can calculate the conditional proportions in each group with mutate(prop = n / sum(n)).

Briefly describe the relationship between smoking status and health outcome displayed in your plot from the previous exercise, and evaluate whether this meets your expectations. Additionally, calculate the relevant conditional probabilities to help your narrative. Here is some code to get you started:

Whickham %>%
  count(smoker, outcome)

⊕Hint: Use the case_when() function inside the mutate() function, then assign the result to Whickham.

Create a new variable called age_cat using the following scheme:

age <= 44 ~ "18-44"
age > 44 & age <= 64 ~ "45-64"
age > 64 ~ "65+"

and add this variable to the Whickham dataset.

Re-create the visualization depicting the relationship between smoking status and health outcome, faceted by age_cat. In your answer, don’t forget to label your R chunk as well (where it says label-me-2).
What changed in the relationship between smoking status and health outcome when we looked at the relationship within a specific age category? What might explain this change? Extend the contingency table from earlier by breaking it down by age category and use it to help your narrative. Here is some code to get you started:

Whickham %>%
  count(smoker, age_cat, outcome)

Part II: Road traffic accidents

Photo by Clark Van Der Beken on Unsplash

In this part we’ll look at traffic accidents in Edinburgh. The data are made available online by the UK Government. It covers all recorded accidents in Edinburgh in 2018 and some of the variables were modified for the purposes of this assignment.

Data

The data can be found in the dsbox package, and it’s called accidents. You can find out more about the dataset by inspecting its documentation, which you can access by running ?accidents in the Console or using the Help menu in RStudio to search for accidents.

Exercises

Run View(accidents) in your Console to view the data in the data viewer. (Do not include this code in your lab-04.Rmd file!) What does each row in the dataset represent?

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Recreate the following plot. Write a few sentences describing the plot in context of the data.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Create another data visualization based on these data and interpret it. You can choose any variables and any type of visualization you like, but it must have at least three variables, e.g. a scatterplot of x vs. y isn’t enough, but if points are colored by z, that’s fine.

Part III: Legos

Photo by Daniel Cheung on Unsplash

In this part, we’ll practice our data wrangling skills using (simulated) data from Lego sales in 2018 for a sample of customers who bought Legos in the US.

Data

The data can be found in the dsbox package, and it’s called lego_sales. You can find out more about the dataset by inspecting its documentation, which you can access by running ?lego_sales in the Console or using the Help menu in RStudio to search for lego_sales.

Exercises

Answer the following exercises using pipelines. For each question, include the code and output used, and state your answer in a sentence, e.g. “In this sample, the first three common names of purchasers are …”. Note that the answers to all questions are within the context of this particular sample of sales, i.e., you shouldn’t make inferences about the population of all Lego sales based on this sample.

⊕Hint: Look at the examples at the bottom of the lego_sales help file.

What are the three most common first names of purchasers?
How many distinct themes are there in the dataset?
What are the three most common themes of Lego sets purchased?
Among the most common theme of Lego sets purchased, what is the most common subtheme?

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

⊕Hint: You will need to consider quantity of purchases as well as price of lego sets.

Which Lego theme has made the most money for Lego?
Come up with a question you want to answer using these data, and write it down. Then, create a data visualization that answers the question, and explain how your visualization answers the question.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.

Lab 04 - Refresh and Review

Learning goals

Getting started

Warm up

Packages

Reminders

Part I: Smokers in Whickham

Data

Exercises

Part II: Road traffic accidents

Data

Exercises

Part III: Legos

Data

Exercises