In January 2017, Buzzfeed published an article on why Nobel laureates show immigration is so important for American science. You can read the article here. In the article they show that while most living Nobel laureates in the sciences are based in the US, many of them were born in other countries. This is one reason why scientific leaders say that immigration is vital for progress. In this lab we will work with the data from this article to recreate some of their visualizations as well as explore new questions.
You have three tasks you should complete before the lab:
This is the first week you’re working in teams, which means all of you make changes and push those changes to your team repository. Sometimes things will go swimmingly, and sometimes you’ll run into merge conflicts. So our first task today is to walk you through a merge conflict!
Git will put conflict markers in your code that look like:
<<<<<<< HEAD
See also: [dplyr documentation](https://dplyr.tidyverse.org/)
=======
See also [ggplot2 documentation](https://ggplot2.tidyverse.org/)
>>>>>>> some1alpha2numeric3string4
The ===
s separate your changes (top) from their changes (bottom).
Note that on top you see the word HEAD
, which indicates that these are your changes.
And at the bottom you see some1alpha2numeric3string4
(well, it probably looks more like 28e7b2ceb39972085a0860892062810fb812a08f
).
This is the hash (a unique identifier) of the commit your collaborator made with the conflicting change.
Your job is to reconcile the changes: edit the file so that it incorporates the best of both versions and delete the <<<
, ===
, and >>>
lines. Then you can stage and commit the result.
This is the first week you’re working in teams. When you sign in to GitHub, you will see your team listed in the bottom left under “Your teams”. Click on your team.
Discussions: Post discussions directly with your team. Use @GITHUB_USERNAME to mention specific team members.
Members: See the other members of your team.
Repositories: View all repositories on which your team is collaborating. You will have a team repository for each lab that each member of the team has access to. You can all push to this repository.
Each member of the team should:
lab-03-nobel-laureates-YOUR_TEAM_NAME
.lab-03.Rmd
.Our goal is to see two different types of merges: first we’ll see a type of merge that git can’t figure out how to do on its own (a merge conflict) and requires human intervention, then another type of merge that git can figure out how to do without human intervention.
Doing this will require some tight choreography, so pay attention!
Take turns in completing the exercise, only one member at a time. Others should just watch, not doing anything on their own projects (this includes not even pulling changes!) until they are instructed to. If you feel like you won’t be able to resist the urge to touch your computer when it’s not your turn, we recommend putting your hands in your pockets or sitting on them!
Before starting: everyone should have the repo cloned and know which role number(s) they are.
Role 1:
🛑 Make sure the previous role has finished before moving on to the next step.
Role 2:
🛑 Make sure the previous role has finished before moving on to the next step.
Role 3:
🛑 Make sure the previous role has finished before moving on to the next step.
Role 4:
🛑 Make sure the previous role has finished before moving on to the next step.
Everyone: Pull, and observe the changes in your document.
Open the R Markdown document lab-03.Rmd
(if it’s not already open) and Knit it. Make sure it compiles without errors. The output will be in the file markdown .md
file with the same name.
Before we introduce the data, let’s warm up with some simple exercises.
Have one team member:
Now, all other team members need to pull before making any new changes!
We’ll use the tidyverse package for much of the data wrangling and visualization. You can load it by running the following in your Console:
library(tidyverse)
This code also appears at the beginning of your lab-03.Rmd
file.
The dataset for this assignment can be found as a CSV (comma separated values) file in the data
folder of your repository. You can read it in using the following.
<- read_csv("data/nobel.csv") nobel
This is also done for you in the lab-03.Rmd
file.
The variable descriptions are as follows:
id
: ID numberfirstname
: First name of laureatesurname
: Surnameyear
: Year prize woncategory
: Category of prizeaffiliation
: Affiliation of laureatecity
: City of laureate in prize yearcountry
: Country of laureate in prize yearborn_date
: Birth date of laureatedied_date
: Death date of laureategender
: Gender of laureateborn_city
: City where laureate was bornborn_country
: Country where laureate was bornborn_country_code
: Code of country where laureate was borndied_city
: City where laureate dieddied_country
: Country where laureate dieddied_country_code
: Code of country where laureate diedoverall_motivation
: Overall motivation for recognitionshare
: Number of other winners award is shared withmotivation
: Motivation for recognitionIn a few cases the name of the city/country changed after laureate was given (e.g. in 1975 Bosnia and Herzegovina was called the Socialist Federative Republic of Yugoslavia). In these cases the variables below reflect a different name than their counterparts without the suffix `_original`.
born_country_original
: Original country where laureate was bornborn_city_original
: Original city where laureate was borndied_country_original
: Original country where laureate dieddied_city_original
: Original city where laureate diedcity_original
: Original city where laureate lived at the time of winning the awardcountry_original
: Original country where laureate lived at the time of winning the awardTake turns answering the exercises. Make sure each team member gets to commit to the repo by the time you submit your work. And make sure that the person taking the lead for an exercise is sharing their screen. You don’t have to switch at each exercise, you can find a cadence that works for your team and stick to it.
You may want to get your Data transformation (dplyr) and Data visualization (ggplot2) R cheat sheets handy!
There are some observations in this dataset that we will exclude from our analysis to match the Buzzfeed results.
nobel_living
that filters forcountry
is available"org"
as their gender
)died_date
is NA
)Confirm that once you have filtered for these characteristics you are left with a data frame with 228 observations, once again using inline code.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
… says the Buzzfeed article. Let’s see if that’s true.
First, we’ll create a new variable to identify whether the laureate was in the US when they won their prize. We’ll use the mutate()
function for this. The following pipeline mutates the nobel_living
data frame by adding a new variable called country_us
. An alternative way to achieve the same result is to use the fct_other()
function to create the new variable: country_us = fct_other(country, “USA”)
. We use an if statement to create this variable. The first argument in the if_else()
function we’re using to write this if statement is the condition we’re testing for. If country
is equal to "USA"
, we set country_us
to "USA"
. If not, we set the country_us
to "Other"
.
<- nobel_living %>%
nobel_living mutate(
country_us = if_else(country == "USA", "USA", "Other")
)
Next, we will limit our analysis to only the following categories: Physics, Medicine, Chemistry, and Economics.
<- nobel_living %>%
nobel_living_science filter(category %in% c("Physics", "Medicine", "Chemistry", "Economics"))
For the next exercise work with the nobel_living_science
data frame you created above. This means you’ll need to define this data frame in your R Markdown document, even though the next exercise doesn’t explicitly ask you to do so.
Create a faceted bar plot visualizing the relationship between the category of prize and whether the laureate was in the US when they won the nobel prize. Interpret your visualization, and say a few words about whether the Buzzfeed headline is supported by the data.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Hint: You should be able to cheat borrow from code you used earlier to create the country_us
variable.
Create a new variable called born_country_us
that has the value "USA"
if the laureate is born in the US, and "Other"
otherwise. How many of the winners are born in the US?
Add a second variable to your visualization from Exercise 3 based on whether the laureate was born in the US or not. Based on your visualization, do the data appear to support Buzzfeed’s claim? Explain your reasoning in 1-2 sentences.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Note that your bar plot won’t exactly match the one from the Buzzfeed article. This is likely because the data has been updated since the article was published.
count()
function) for their birth country (born_country
) and arrange the resulting data frame in descending order of number of observations for each country. Which country is the most common?🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.
Now go back through your write up to make sure you’ve answered all questions and all of your R chunks are properly labelled. Once you decide as a team that you’re done with this lab, all members of the team should pull the changes and knit the R Markdown document to confirm that they can reproduce the report.
Teams work better when members have a common understanding of the team’s goals and expectations for collaboration. These can be set with a team agreement or contract. As a team, discuss and fill out the team-agreement.Rmd
file located in your team’s Lab 03 repo, then knit, commit, and push the team agreement to GitHub.
The purpose of this exercise is to help your team make a plan for working together during lab and outside of the scheduled lab time. Each team member will have some ideas about how a team should operate. These ideas may be very different. This is your opportunity to share your thoughts and ideas to promote optimal team function and prevent misunderstandings in the future.
The plots in the Buzzfeed article are called waffle plots. You can find the code used for making these plots in Buzzfeed’s GitHub repo (yes, they have one!) here. You’re not expected to recreate them as part of your assignment, but you’re welcomed to do so for fun!
This lab is adapted from material in the Data Science in a Box course by Mine Çetinkaya-Rundel licensed under a Creative Commons Attribution Share Alike 4.0 International. Visit here for more information about the license.