HW 05 - Data Wrangling

Getting started

Part I: La Quinta is Spanish for next to Denny’s, Pt. 2

Let’s revisit the Denny’s and La Quinta Inn and Suites data we visualized in the previous lab. Remember that the datasets we’ll use are called dennys and laquinta from the dsbox package.

  1. Filter the Denny’s dataframe for Alaska (AK) and save the result as dn_ak. How many Denny’s locations are there in Alaska?

  2. Filter the La Quinta dataframe for Alaska (AK) and save the result as lq_ak. How many La Quinta locations are there in Alaska?

Next we’ll calculate the distance between all Denny’s and all La Quinta locations in Alaska. Let’s take this step by step:

Step 1: There are 3 Denny’s and 2 La Quinta locations in Alaska. (If you answered differently above, you might want to recheck your answers.)

Step 2: Let’s focus on the first Denny’s location. We’ll need to calculate two distances for it: (1) distance between Denny’s 1 and La Quinta 1 and (2) distance between Denny’s 1 and La Quinta (2).

Step 3: Now let’s consider all Denny’s locations.

  1. How many pairings are there between all Denny’s and all La Quinta locations in Alaska, i.e., how many distances do we need to calculate between the locations of these establishments in Alaska?

In order to calculate these distances we need to first restructure our data to pair the Denny’s and La Quinta locations. To do so, we will join the two data frames.

  1. Join the two data sets dn_ak and lq_ak into a single data set named dn_lq_ak. Keep all rows and columns from both dn_ak and lq_ak data frames.

  2. How many observations are in the joined dn_lq_ak data frame? What are the names of the variables in this data frame.

Now that we have the data in the format we wanted, all that is left is to calculate the distances between the pairs. One way of calculating the distance between any two points on the earth is to use the Haversine distance formula. This formula takes into account the fact that the earth is not flat, but instead spherical.

This function is not available in R, but we have it saved in a file called haversine.R that we can load and then use:

haversine <- function(long1, lat1, long2, lat2, round = 3) {
  # convert to radians
  long1 = long1 * pi / 180
  lat1  = lat1  * pi / 180
  long2 = long2 * pi / 180
  lat2  = lat2  * pi / 180
  
  R = 6371 # Earth mean radius in km
  
  a = sin((lat2 - lat1)/2)^2 + cos(lat1) * cos(lat2) * sin((long2 - long1)/2)^2
  d = R * 2 * asin(sqrt(a))
  
  return( round(d,round) ) # distance in km
}

This function takes five arguments:

  1. The code for the haversine function is included in the hw-05.Rmd template. Adding to the code, document this function with (1) a description, (2) summary of input(s), and (3) summary of outputs.

  2. Calculate the distances between all pairs of Denny’s and La Quinta locations and save this variable as distance. Make sure to save this variable in THE dn_lq_ak data frame so that you can use it later.

  3. Calculate the minimum distance between a Denny’s and La Quinta for each Denny’s location. To do so, you will need to group by Denny’s locations (addresses) and calculate a new variable called closets that stores the information for the minimum distance.

  4. Describe the distribution of the distances between Denny’s and the nearest La Quinta locations in Alaska. Also include an appropriate visualization and relevant summary statistics.

  5. Repeat the same analysis for a state of your choosing, different than the ones we covered so far: (i) filter Denny’s and La Quinta Data Frames for your chosen state, (ii) join these data frames to get a complete list of all possible pairings, (iii) calculate the distances between all possible pairings of Denny’s and La Quinta in your chosen state, (iv) find the minimum distance between each Denny’s and La Quinta location, (v) visualize and describe the distribution of these shortest distances using appropriate summary statistics.

Part II: History of Baseball

For this question, a subset of the tables contained in the History of Baseball database are available. Additional details are available here: https://www.kaggle.com/seanlahman/the-history-of-baseball. The following tables will be used for these questions:

player <- read_csv("https://math.montana.edu/shancock/data/player.csv")
all_star <- read_csv("https://math.montana.edu/shancock/data/all_star.csv")
salary <- read_csv("https://math.montana.edu/shancock/data/salary.csv")
  1. How many players were born in Montana?

  2. Print a table that contains each player born in Montana. The table should contain the player_id as well as given name and their total salary across all years (i.e., the sum of all the salaries across years). If salary is not available (pre-1985), include the player but have an NA for salary.

  3. Create a long dataset for that contains the yearly salaries of David Ortiz, Derek Jeter, and Troy Tulowitzki.

  4. Create a wide dataset for that contains the yearly salaries of David Ortiz, Derek Jeter, and Troy Tulowitzki.