Have you ever taken a road trip in the US and thought to yourself “I wonder what La Quinta means”. Well, the late comedian Mitch Hedberg thinks it’s Spanish for next to Denny’s.
If you’re not familiar with these two establishments, Denny’s is a casual diner chain that is open 24 hours and La Quinta Inn and Suites is a hotel chain.
These two establishments tend to be clustered together, or at least this observation is a joke made famous by Mitch Hedberg. In this lab we explore the validity of this joke and along the way learn some more data wrangling and tips for visualizing spatial data.
The inspiration for this lab comes from a blog post by John Reiser on his new jersey geographer blog. You can read that analysis here. Reiser’s blog post focuses on scraping data from Denny’s and La Quinta Inn and Suites websites using Python. In this lab we focus on visualization and analysis of these data. However note that the data scraping was also done in R, and we we will discuss web scraping using R later in the course. But for now we focus on the data that has already been scraped and tidied for you.
Each member of the team should:
lab-05-wrangling-YOUR_TEAM_NAME
.lab-05.Rmd
and Knit it.
Make sure it compiles without errors. The output will be in the file
markdown .md
file with the same name.Before we introduce the data, let’s warm up with some simple exercises.
We’ll use the tidyverse package for much of the data wrangling and visualization and the data live in the dsbox package.
Additional references for spatial data in R:
Using Spatial Data with R by Claudia A Engel, 2019 Spatial Data Science with Applications in R by Edzer Pebesma and Roger Bivand, 2022 ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham, Chapter 6: Maps “Drawing beautiful maps programmatically with R, sf and ggplot2” by Mel Moreno and Mathieu Basille, 2018
To take advantage of additional spatial data information, we will use three packages that you will most likely need to install: ggmap, maps, and sf.
You can load them by running the following in your Console:
library(tidyverse)
library(dsbox)
library(ggmap)
library(maps)
library(sf)
This code also appears at the beginning of your
lab-05.Rmd
file.
The datasets we’ll use are called dennys
and
laquinta
from the dsbox package. Note that
these data were scraped from here and here,
respectively.
Since the datasets are distributed with the package, we don’t need to
load them separately; they become available to us when we load the
package. You can find out more about the datasets by inspecting their
documentation, which you can access by running ?dennys
and
?laquinta
in the Console or using the Help menu in RStudio
to search for dennys
or laquinta
. You can also
find this information here
and here.
To help with our analysis we will also use a dataset on US states,
which is located in your repository’s data
folder.
<- read_csv("data/states.csv") states
Each observation in this dataset represents a state, including DC. Along with the name of the state we have the two-letter abbreviation and we have the geographic area of the state (in square miles).
What are the dimensions of the Denny’s dataset? Use inline R code
and functions like nrow
and ncol
to compose
your answer. What does each row in the dataset represent? What are the
variables?
What are the dimensions of the La Quinta’s dataset? Use inline R
code and functions like nrow
and ncol
to
compose your answer. What does each row in the dataset represent? What
are the variables?
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
We would like to limit our analysis to Denny’s and La Quinta locations in the United States.
Take a look at the websites that the data come from (linked above). Are there any La Quinta’s locations outside of the US? If so, which countries? What about Denny’s?
Now take a look at the data. What would be some ways of determining whether or not either establishment has any locations outside the US using just the data (and not the websites). Don’t worry about whether you know how to implement this, just brainstorm some ideas. Write down at least one as your answer, but you’re welcome to write down a few options too.
We will determine whether or not the establishment has a location
outside the US using the state
variable in the
dennys
and laquinta
datasets. We know exactly
which states are in the US, and we have this information in the
states
dataframe we loaded.
state
is not in states$abbreviation
. The code
for this is given below. Note that the %in%
operator
matches the states listed in the state
variable to those
listed in states$abbreviation
. The !
operator
means not. Are there any Denny’s locations outside the
US?%>%
dennys filter(!(state %in% states$abbreviation))
"United States"
. Remember, you can
use the mutate
function for adding a variable. Make sure to
save the result of this as dennys
again so that the stored
data frame contains the new variable going forward.Hint:
You can use the mutate
function with the argument
country = “United States”
. We don’t need to tell R how many
times to repeat the character string “United States” to fill in the data
for all observations, R takes care of that
automatically.
Find the La Quinta locations that are outside the US, and figure out which country they are in. (This will require some googling. Make sure to cite your source!) Take notes, you will need to use this information in the next exercise.
Add a country variable to the La Quinta dataset. Use the
case_when
function to populate this variable. You’ll need
to refer to your notes from Exercise 7 about which country the non-US
locations are in. Here is some starter code to get you going:
<- laquinta %>%
laquinta mutate(country = case_when(
%in% state.abb ~ "United States",
state %in% c("ON", "BC") ~ "Canada",
state == "ANT" ~ "Colombia",
state # fill in the rest
... ))
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Going forward we will work with the data from the United States only. All Denny’s locations are in the United States, so we don’t need to worry about them. However we do need to filter the La Quinta dataset for locations in United States.
<- laquinta %>%
laquinta filter(country == "United States")
Next, let’s calculate which states have the most Denny’s locations
per thousand square miles. This requires joining
information from the frequency tables you created in Exercise 8 with
information from the states
data frame.
First, we count how many observations are in each state, which will
give us a data frame with two variables: state
and
n
. Then, we join this data frame with the
states
data frame. However, note that the variables in the
states
data frame that has the two-letter abbreviations is
called abbreviation
. So when we’re joining the two data
frames we specify that the state
variable from the Denny’s
data should be matched by
the abbreviation
variable from the states
data:
%>%
dennys count(state) %>%
inner_join(states, by = c("state" = "abbreviation"))
Before you move on the the next question, run the code above and take a look at the output. In the next exercise you will need to build on this pipe.
Next, we put the two datasets together into a single data frame.
However before we do so, we need to add an identifier variable. We’ll
call this establishment
and set the value to
"Denny's"
and "La Quinta"
for the
dennys
and laquinta
data frames,
respectively.
<- dennys %>%
dennys mutate(establishment = "Denny's")
<- laquinta %>%
laquinta mutate(establishment = "La Quinta")
Since the two data frames have the same columns, we can easily bind
them with the bind_rows
function:
<- bind_rows(dennys, laquinta) dn_lq
We can plot the locations of the two establishments using a scatterplot, and color the points by the establishment type. Note that the latitude is plotted on the x-axis and the longitude on the y-axis.
ggplot(dn_lq,
mapping = aes(x = longitude,
y = latitude,
color = establishment)) +
geom_point()
The following two questions ask you to create visualizations. These should follow best practices, such as informative titles, axis labels, etc. See http://ggplot2.tidyverse.org/reference/labs.html for help with the syntax. You can also choose different themes to change the overall look of your plots, see http://ggplot2.tidyverse.org/reference/ggtheme.html for help with these.
Filter the data for observations in North Carolina only, and
recreate the plot. You should also adjust the transparency of the
points, by setting the alpha
level, so that it’s easier to
see the overplotted ones. Visually, does Mitch Hedberg’s joke appear to
hold here?
Now filter the data for observations in Texas only, and recreate
the plot, with an appropriate alpha
level. Visually, does
Mitch Hedberg’s joke appear to hold here?
The plots above only plotted the latitude and longitude points, without overlaying those points on a map. Here, we will look at two approaches to plotting points on a map.
Google just recently
changed its API requirements, so if you would like to use Google
Maps with the qmplot()
function, you will need to register with
Google.
The first approach is to use the qmplot()
function
(“quick map plot”) in the ggmap package to add a
basemap to a plot. The basemaps are queried from either Google Maps,
OpenStreetMap, Stamen Maps, or Naver Map, with the default being Stamen
Maps.
qmplot(longitude, latitude,
data = dn_lq,
maptype = "toner-lite") +
geom_point(aes(color = establishment))
A second approach is to use the maps and
sf packages to create “simple feature” map objects,
then use the geom_sf()
function to add this layer to a
ggplot. The “simple features” standard produced by the Open Geospatial
Consortium is common for encoding vector data for maps.
The code below will produce a map of Denny’s and La Quinta locations in North Carolina.
<- filter(dn_lq, state == "NC")
dn_lq_nc
<- map('state',
nc region = "north carolina",
fill = TRUE,
plot = FALSE) %>%
st_as_sf()
ggplot() +
geom_sf(data = nc) +
geom_point(data = dn_lq_nc,
aes(x = longitude, y = latitude,
color = establishment)) +
coord_sf() # Ensures lat and long on same scale
That’s it for now! You will revisit this data set and practice more joins on your homework for this week.
🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.