This week we’ll do some data gymnastics to refresh and review what we learned over the past few weeks through three shorter case studies.
Each member of the team should:
lab-04-review-YOUR_TEAM_NAME
.lab-04.Rmd
and Knit it. Make sure it compiles without errors. The output will be in the file markdown .md
file with the same name.Before we introduce the data, let’s warm up with some simple exercises. Update the YAML of your R Markdown file with your team information, knit, commit, and push your changes. Make sure to commit with a meaningful commit message. Then, go to your repo on GitHub and confirm that your changes are visible in your Rmd and md files. If anything is missing, commit and push again.
We’ll use the tidyverse package for much of the data wrangling and visualisation and data from the mosaicData and dsbox packages. You can load them by running the following in your Console:
library(tidyverse)
library(mosaicData) # You may have to install this package
library(dsbox) # This should be installed from Homework 2
This code also appears at the beginning of your lab-04.Rmd
file.
Take turns answering the exercises. Make sure each team member gets to commit to the repo by the time you submit your work. And make sure that the person taking the lead for an exercise is sharing their screen.
Intro stat review: Decision tree for determining the appropriate type of plot
You may want to get your Data transformation (dplyr) and Data visualization (ggplot2) R cheat sheets handy!
A study conducted in Whickham, England recorded participants’ age, smoking status at baseline, and then 20 years later recorded their health outcome. In this part of the lab, we analyse the relationships between these variables, first two at a time, and then controlling for the third.
The dataset we’ll use is called Whickham from the mosaicData package. You can find out more about the dataset by inspecting their documentation, which you can access by running ?Whickham
in the Console or using the Help menu in RStudio to search for Whickham
.
Intro stat review: Study design
What type of study do you think these data come from: observational or experiment? Why?
How many observations are in this dataset? Instead of hard coding the number in your answer, use inline code. What does each observation represent?
How many variables are in this dataset? What type of variable is each (quantitative or categorical)? Display each variable using an appropriate visualization.
What would you expect the relationship between smoking status and health outcome to be?
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Categorical variables can be stored in a data frame, like Whickham
, but can also be stored in a contingency table, or table of frequencies. The geom_bar()
function makes the height of the bar proportional to the number of cases in each group (appropriate when your data is stored in a data frame with one case per row); geom_col
uses the values in the data as the heights of the bars (appropriate when your data is stored as a contingency table).
label-me-1
). Your label should be short, informative, shouldn’t include spaces, and shouldn’t repeat a previous label.Hint: After grouping on the appropriate variable using group_by()
, you can calculate the conditional proportions in each group with mutate(prop = n / sum(n))
.
%>%
Whickham count(smoker, outcome)
Hint: Use the case_when()
function inside the mutate()
function, then assign the result to Whickham
.
age_cat
using the following scheme:age <= 44 ~ "18-44"
age > 44 & age <= 64 ~ "45-64"
age > 64 ~ "65+"
and add this variable to the Whickham
dataset.
Re-create the visualization depicting the relationship between smoking status and health outcome, faceted by age_cat
. In your answer, don’t forget to label your R chunk as well (where it says label-me-2
).
What changed in the relationship between smoking status and health outcome when we looked at the relationship within a specific age category? What might explain this change? Extend the contingency table from earlier by breaking it down by age category and use it to help your narrative. Here is some code to get you started:
%>%
Whickham count(smoker, age_cat, outcome)
Photo by Clark Van Der Beken on Unsplash
In this part we’ll look at traffic accidents in Edinburgh. The data are made available online by the UK Government. It covers all recorded accidents in Edinburgh in 2018 and some of the variables were modified for the purposes of this assignment.
The data can be found in the dsbox package, and it’s called accidents
. You can find out more about the dataset by inspecting its documentation, which you can access by running ?accidents
in the Console or using the Help menu in RStudio to search for accidents
.
View(accidents)
in your Console to view the data in the data viewer. (Do not include this code in your lab-04.Rmd
file!) What does each row in the dataset represent?🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Photo by Daniel Cheung on Unsplash
In this part, we’ll practice our data wrangling skills using (simulated) data from Lego sales in 2018 for a sample of customers who bought Legos in the US.
The data can be found in the dsbox package, and it’s called lego_sales
. You can find out more about the dataset by inspecting its documentation, which you can access by running ?lego_sales
in the Console or using the Help menu in RStudio to search for lego_sales
.
Answer the following exercises using pipelines. For each question, include the code and output used, and state your answer in a sentence, e.g. “In this sample, the first three common names of purchasers are …”. Note that the answers to all questions are within the context of this particular sample of sales, i.e., you shouldn’t make inferences about the population of all Lego sales based on this sample.
Hint: Look at the examples at the bottom of the lego_sales
help file.
What are the three most common first names of purchasers?
How many distinct themes are there in the dataset?
What are the three most common themes of Lego sets purchased?
Among the most common theme of Lego sets purchased, what is the most common subtheme?
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Hint: You will need to consider quantity of purchases as well as price of lego sets.
Which Lego theme has made the most money for Lego?
Come up with a question you want to answer using these data, and write it down. Then, create a data visualization that answers the question, and explain how your visualization answers the question.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.