HW 03 - What should I major in?

Photo by Marleena Garris on Unsplash Photo by Marleena Garris on Unsplash

The first step in the process of turning information into knowledge process is to summarize and describe the raw information – the data. In this assignment we explore data on college majors and earnings, specifically the data used in the FiveThirtyEight story “The Economic Guide To Picking A College Major”.

These data originally come from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series. While this is outside the scope of this assignment, if you are curious about how raw data from the ACS were cleaned and prepared, see the code FiveThirtyEight authors used.

We should also note that there are many considerations that go into picking a major. Earnings potential and employment prospects are two of them, and they are important, but they don’t tell the whole story. Keep this in mind as you analyze the data.

Getting started

Packages

Reminder: If you get the error Error in library(xyz) : there is no package called ‘xyz’ , then you need to install that package either by using the install.packages() function in the console, or by going to the Packages tab and clicking Install. You only need to install packages once, but you need to load them using the library() function in each R session.

We’ll use the tidyverse package for much of the data wrangling and visualization, the scales package for better formatting of labels on visualizations, and the data live in the fivethirtyeight package. The following code to load these packages is included at the beginning of hw-03.Rmd:

library(tidyverse)
library(scales)
library(fivethirtyeight)

Data

The data can be found in the fivethirtyeight package, and are called college_recent_grads. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. You can find out more about the dataset by inspecting its documentation, which you can access by running ?college_recent_grads in the Console or using the Help menu in RStudio to search for college_recent_grads. You can also find this information here.

It’s always important to understand the observational unit (one case, or one row) in a dataset. In this dataset, one observational unit is one college major (not one college graduate).

You can also take a quick peek at your data frame and view its dimensions with the glimpse function.

glimpse(college_recent_grads)

The college_recent_grads data frame is a trove of information. Let’s think about some questions we might want to answer with these data:

In the next sections we aim to answer these questions.

Warm up

Which major has the lowest unemployment rate?

In order to answer this question all we need to do is sort the data. We use the arrange function to do this, and sort it by the unemployment_rate variable. By default arrange sorts in ascending order, which is what we want here – we’re interested in the major with the lowest unemployment rate.

college_recent_grads %>%
  arrange(unemployment_rate)

This gives us what we wanted, but not in an ideal form. First, the name of the major barely fits on the page. Second, some of the variables are not that useful (e.g. major_code, major_category) and some we might want front and center are not easily viewed (e.g. unemployment_rate).

We can use the select function to choose which variables to display, and in which order:

Note how easily we expanded our code with adding another step to our pipeline, with the pipe operator: %>%.

college_recent_grads %>%
  arrange(unemployment_rate) %>%
  select(rank, major, unemployment_rate)

Ok, this is looking better, but do we really need to display all those decimal places in the unemployment variable? Not really!

We can use the percent() function to clean up the display a bit.

college_recent_grads %>%
  arrange(unemployment_rate) %>%
  select(rank, major, unemployment_rate) %>%
  mutate(unemployment_rate = percent(unemployment_rate))

Exercises

Which major has the highest percentage of women?

Hint: Use the desc() function as an argument in the arrange() function.

  1. Using what you’ve learned so far, arrange the data in descending order with respect to proportion of women in a major, and display only the major, the total number of people with major, and proportion of women. Show only the top 3 majors by adding top_n(3) at the end of the pipeline.

How do the distributions of median income compare across major categories?

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value below which 20% of the observations may be found. (Source: Wikipedia)

There are three types of incomes reported in this data frame: p25th, median, and p75th. These correspond to the 25th, 50th, and 75th percentiles of the income distribution of sampled individuals for a given major.

Intro stat review: Properties of the mean and median

  1. Why do we often choose the median, rather than the mean, to describe the typical income of a group of people?

Note: Since each observational unit is one college major, “median income” is one of the variables measured on each major. Don’t confuse the name of the variable (“median income”) with the “median” summary statistic of that variable (the “median median income” across all college majors).

The question we want to answer is “How do the distributions of median income compare across major categories?”. We need to do a few things to answer this question: First, we need to group the data by major_category. Then, we need a way to summarize the distributions of median income within these groups. This decision will depend on the shapes of these distributions. So first, we need to visualize the data.

It’s good practice to always think in the context of the data and try out a few binwidths before settling on a binwidth. You might ask yourself: “What would be a meaningful difference in median incomes?” $1 is obviously too little, $10000 might be too high.

  1. Create a histogram of the distribution of all median incomes (without considering the major categories). Try binwidths of $1000 and $5000 and choose one. Explain your reasoning for your choice. Note that the binwidth is an argument for the geom_histogram function. So to specify a binwidth of $1000, you would use geom_histogram(binwidth = 1000).

We can also calculate summary statistics for this distribution using the summarise function:

college_recent_grads %>%
  summarise(min = min(median), max = max(median),
            mean = mean(median), med = median(median),
            sd = sd(median), 
            q1 = quantile(median, probs = 0.25),
            q3 = quantile(median, probs = 0.75))

Intro stat review: How to describe the shape of the distribution of a quantitative variable

  1. Based on the shape of the histogram you created in the previous exercise, determine which of these summary statistics is useful for describing the distribution. Write up your description of the distribution (remember shape, center, spread, any unusual observations) and include the summary statistic output as well.

  2. Again plot the distribution of median income using a histogram, but now faceted by major_category. Use the binwidth you chose in the earlier exercise.

Now that we’ve seen the shapes of the distributions of median incomes for each major category, we should have a better idea for which summary statistic to use to quantify the typical median income.

  1. Which major category has the highest typical (you’ll need to decide what this means) median income? Use the partial code below, filling it in with the appropriate statistic and function. Also note that we are looking for the highest statistic, so make sure to arrange in the correct direction.
college_recent_grads %>%
  group_by(major_category) %>%
  summarise(___ = ___(median)) %>%
  arrange(___)

Hint: Use the count() and arrange() functions.

  1. Which major category is the least popular in this sample?

All STEM fields aren’t the same

One of the sections of the FiveThirtyEight story is “All STEM fields aren’t the same”. Let’s see if this is true.

First, let’s create a new vector called stem_categories that lists the major categories that are considered STEM fields.

stem_categories <- c("Biology & Life Science",
                     "Computers & Mathematics",
                     "Engineering",
                     "Physical Sciences")

Then, we can use this to create a new variable in our data frame indicating whether a major is STEM or not.

college_recent_grads <- college_recent_grads %>%
  mutate(major_type = ifelse(major_category %in% stem_categories, "stem", "not stem"))

Let’s unpack this: with mutate we create a new variable called major_type, which is defined as "stem" if the major_category is in the vector called stem_categories we created earlier, and as "not stem" otherwise.

%in% is a logical operator. Other logical operators that are commonly used are

Operator Operation
x < y less than
x > y greater than
x <= y less than or equal to
x >= y greater than or equal to
x != y not equal to
x == y equal to
x %in% y contains
x | y or
x & y and
!x not

We can use the logical operators to also filter our data for STEM majors whose median earnings is less than median for all majors’ median earnings, which we found to be $36,000 earlier.

college_recent_grads %>%
  filter(
    major_type == "stem",
    median < 36000
  )
  1. Which STEM majors have median salaries equal to or less than the median for all majors’ median earnings? Your output should only show the major name and median, 25th percentile, and 75th percentile earning for that major and should be sorted such that the major with the highest median earning is on top.

What types of majors do women tend to major in?

  1. Create a scatterplot of median income vs. proportion of women in that major, colored by whether the major is in a STEM field or not. Describe the association between these three variables.

Further exploration

  1. Ask a question of interest to you, and answer it using both summary statistic(s) and visualization(s).

Cite sources

Write the sources you used to complete this assignment at the end of your .Rmd document, adhering to the “Guidance on Citing Sources” bullet points in the collaboration policy section on our course syllabus.

Attribution

This homework is adapted from material in the Data Science in a Box course by Mine Çetinkaya-Rundel licensed under a Creative Commons Attribution Share Alike 4.0 International. Visit here for more information about the license.