STAT 408: Week 10

3/29/2022

Data Viz Resources

Edward Tufte

Tufte: Visual Display of Quantitative Information

William Cleveland

Cleveland: The Elements of Graphing Data

Nathan Yau (FlowingData)

Yau: Visualize This

Telling Stories with Data

One of the best ways to explore and understand a dataset is with visualization.

Telling Stories with Data

What is Statistics?
- hypothesis tests
- pattern finding
- predictive modeling
- storytelling with data can help you solve real-world problems (predicting unrest, decreasing crime) or it can help you stay more informed

Data viz is more than numbers

Journalism

Art

Starry night for the color blind

Entertainment

Compelling - Hans Rosling

Hans Rosling’s 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four

Data Viz: What to look for

Patterns

Why so many births around Sept. 25?

Relationships

Age vs. hospital visits

Design Principles

Explain Encodings

What is gray?

Label Axes

Calories for menu items

Keep Geometry in Check

Proper scaling

Include Sources

Source your data

Spotting Visualization Lies

FlowingData Guide for Spotting Visualization Lies

Types of Graphs

Why use Graphics

Why do you, or have you, in the past used data graphics?

Exploratory Graphics
Publication Graphics
Presentation Graphics

Basic Plots

Source: https://mtstateintrostats.github.io/IntroStatTextbook/index.html

Interactive Graphics: R Shiny

Overview of R Shiny

R Shiny provides a way to create interactive visualizations and web applets.

There are two key components of an R Shiny Script:

The ui (user interface) provides a way for the user to interact with the visualization and for the program to capture input values.
The server piece takes those outputs and applies them to R code.

Server Code

Under the hood: Server Code

# Define server logic required to draw a histogram
server <- function(input, output) {
  
  output$distPlot <- renderPlot({
    # generate bins based on input$bins from ui.R
    x    <- faithful[, 2] 
    bins <- seq(min(x), max(x), length.out = input$bins + 1)
    
    # draw the histogram with the specified number of bins
    hist(x, breaks = bins, col = 'darkgray', border = 'white')
  })
}

Under the hood: Server Code - This is just R Code

    input.bins <- 30 # number of bins
    x   <- faithful[, 2] 
    bins <- seq(min(x), max(x), length.out = input.bins + 1)
      # draw the histogram with the specified number of bins
    hist(x, breaks = bins, col = 'darkgray', border = 'white')

Under the hood: Server Code

Under the hood: Server Code - Change # of Bins

UI Code

Under the Hood: UI Code

# Define UI for application that draws a histogram
ui <- fluidPage(
   # Application title
   titlePanel("Old Faithful Geyser Data"),
   # Sidebar with a slider input for number of bins 
   sidebarLayout(
      sidebarPanel(
         sliderInput("bins", 
                     "Number of bins:",
                     min = 1,
                     max = 50,
                     value = 30)
      ),
      # Show a plot of the generated distribution
      mainPanel(
         plotOutput("distPlot")
      )
   )
)

Shiny Code: Running the Application

# Define server logic required to draw a histogram
server <- function(input, output) {
   
   output$distPlot <- renderPlot({
      # generate bins based on input$bins from ui.R
      x    <- faithful[, 2] 
      bins <- seq(min(x), max(x), length.out = input$bins + 1)
      
      # draw the histogram with the specified number of bins
      hist(x, breaks = bins, col = 'darkgray', border = 'white')
   })
}

# Run the application 
shinyApp(ui = ui, server = server)

Shiny Output

Demos

Shiny Web Applet
Shiny in RMarkdown HTML file
shinyapps.io

Exercises

Open the default Shiny Web App in R that contains the interactive histogram of Old faithful eruptions. Change the color of the histogram and reload the app.
Open the default Shiny document R Markdown file and change the system file to use a different embedded Shiny application example than “06_tabsets”.

Shiny Apps to Demonstrate Shiny

Bonus Exercise

The babynames library contains a (giant) dataset called babynames, which contains the full baby name data provided by the Social Security Administration (SSA). Type the following to explore the data set and its variables:

library(babynames)  # You will most likely have to install this
data(babynames)  # Load data into environment
?babynames
head(babynames)

Using the babynames data set, implement your own version of this app from the SSA.

Graphics in R: Additional Slides

Visualizing Patterns Over Time

What are we looking for with data over time?
- Trends (increasing/decreasing)
- Are season cycles present?
Identifying these patterns requires looking beyond single points
We are also interested in looking at more the data in more detail
- Are there outliers?
- Do any time periods look out of place?
- Are there spikes or dips?
- What causes any of these irregularities?

Capital Bikeshare Data

bike.data <- read_csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/Bike.csv')

Capital Bikeshare Data

bike.data <- bike.data %>% mutate(year = as.factor(year(datetime)), month = as.factor(month(datetime)))
monthly.counts <- bike.data %>% group_by(month) %>% summarize(num_bikes = sum(count), .groups = 'drop') %>% arrange(month)
monthly.counts

## # A tibble: 12 × 2
##    month num_bikes
##    <fct>     <dbl>
##  1 1         79884
##  2 2         99113
##  3 3        133501
##  4 4        167402
##  5 5        200147
##  6 6        220733
##  7 7        214617
##  8 8        213516
##  9 9        212529
## 10 10       207434
## 11 11       176440
## 12 12       160160

Discrete Points: Bar Charts

Discrete Points: Bar Charts - Code

monthly.counts %>% 
  ggplot(aes(y = num_bikes, x = month)) + 
  geom_bar(stat = 'identity') + xlab('Month') + 
  ylab('Bike Rentals') + 
  labs(title = 'Bike Rentals per Month in 2011-2012 \n Capital Bikeshare in Washington, DC', 
       caption = 'Source: www.capitalbikeshare.com')

Discrete Points: Stacked Bar

Discrete Points in Time: Stacked Bar - Code

bike.counts <- aggregate(cbind(bike.data$casual,bike.data$registered),
                         by=list(bike.data$month), sum)
barplot(t(as.matrix(bike.counts[,-1])), 
        names.arg =collect(select(monthly.counts, month))[[1]], 
        xlab='Month', sub ='Source: www.capitalbikeshare.com', 
        ylab='Bike Rentals', 
        main='Bike Rentals per Month in 2011 - 2012 \n Capital Bikeshare in Washington, DC',
        col=c("darkblue","red"),legend.text = c("Casual", "Registered"),
        args.legend = list(x = "topleft"))

Discrete Points in Time: Points

plot(rowSums(bike.counts[,-1])~bike.counts[,1],xlab='Month',
     sub ='Source: www.capitalbikeshare.com', ylab='Bike Rentals', 
     main='Bike Rentals per Month \n Capital Bikeshare in Washington, DC',
     col=c("darkblue"),pch=16,axes=F,
     ylim=c(0,max(rowSums(bike.counts[,-1]))))
axis(2)
axis(1,at=1:12)
box()

Connect the Dots

mean_temp <- bike.data %>% group_by(month) %>%
  summarize(mean_temp = mean(temp),.groups = 'drop') %>% 
  mutate(month = as.numeric(month))

ggplot(aes(y=temp, x= month), data = bike.data) +
  geom_jitter(alpha = .1) + 
  geom_line(inherit.aes = F, aes(y = mean_temp, x = month),
            data = mean_temp, color = 'red', lwd = 2) +
  ylab('Average Temp (C)') + xlab('Month') + 
  labs(title = 'Average Temperature in Washington, DC', 
                       caption = 'Source: www.capitalbikeshare.com')

Visualizing Proportions

What to look for in proportions?
- Generally looking for maximum, minimum, and overall distribution.
Many of the figures we have discussed are useful here as well: for example, stacked bar charts or points to look at changes in proportions over time.
Another possibility, which we will not cover, are plotting with rectangles known as a tree map.

Visualizing Relationships

When considering relationships between variables, what are we looking for?
- If something goes up, do other variables have a positive relationship, negative relationship, or no relationship.
- What is the distribution of your data? (both univariate and multivariate)

Relationships: Scatterplots

Visualizing Relationships: Scatterplots - code

bike.data$tempF <- bike.data$temp * 1.8 + 32
plot(bike.data$count~bike.data$tempF,pch=16,
     col=rgb(100,0,0,10,max=255),ylab='Hourly Bike Rentals',
     xlab='Temp (F)',sub ='Source: www.capitalbikeshare.com',
     main='Hourly Bike Rentals by Temperature')
bike.fit <- loess(count~tempF,bike.data)
temp.seq <- seq(min(bike.data$tempF),max(bike.data$tempF))
lines(predict(bike.fit,temp.seq)~temp.seq,lwd=2)

Visualizing Relationships: Multivariate Scatterplots

pairs(bike.data[,c(12,15,8)])

Multivariate Scatterplots

Relationships: Multivariate Scatterplots

par(mfcol=c(2,2),oma = c(1,0,0,0))
bike.data$tempF <- bike.data$temp * 1.8 + 32
plot(bike.data$count~bike.data$tempF,pch=16,col=rgb(100,0,0,10,max=255),
     ylab='Hourly Bike Rentals',xlab='Temp (F)',
     main='Hourly Bike Rentals by Temperature')
bike.fit <- loess(count~tempF,bike.data)
temp.seq <- seq(min(bike.data$tempF),max(bike.data$tempF))
lines(predict(bike.fit,temp.seq)~temp.seq,lwd=2)

plot(bike.data$count~bike.data$humidity,pch=16,
     col=rgb(100,0,100,10,max=255),
     ylab='Hourly Bike Rentals',xlab='Humidity (%)',
     main='Hourly Bike Rentals by Humidity')
bike.fit <- loess(count~humidity,bike.data)
humidity.seq <- seq(min(bike.data$humidity),max(bike.data$humidity))
lines(predict(bike.fit,humidity.seq)~humidity.seq,lwd=2)

plot(bike.data$count~bike.data$windspeed,pch=16,col=rgb(0,0,100,10,max=255),
     ylab='Hourly Bike Rentals',xlab='Windspeed (MPH)',main='Hourly Bike Rentals by Windspeed')
bike.fit <- loess(count~windspeed,bike.data)
windspeed.seq <- seq(min(bike.data$windspeed),max(bike.data$windspeed))
lines(predict(bike.fit,windspeed.seq)~windspeed.seq,lwd=2)

plot(bike.data$count~as.factor(bike.data$weather),col=rgb(0,100,0,255,max=255),
     ylab='Hourly Bike Rentals',xlab='Weather Conditions',main='Hourly Bike Rentals by Weather')

mtext('Source: www.capitalbikeshare.com', outer = TRUE, cex = .9, side=1)
par(mfcol=c(1,1),oma = c(0,0,0,0))

Relationships: Histograms

hist(bike.data$tempF,prob=T, main='Temperature (F)',col='red',xlab='')

Multiple Histograms

Visualizing Relationships: Multiple Histograms - Code

par(mfrow=c(2,1))
bike.data$reltempF <- bike.data$atemp * 1.8 + 32

hist(bike.data$tempF,prob=T,breaks='FD',
     main='Temperature (F)',col='red',xlab='',
     xlim=c(0,max(c(bike.data$reltempF,bike.data$tempF))))
hist(bike.data$reltempF,prob=T,breaks='FD',
     main='Relative Temperature (F)',col='orange',xlab='',
     xlim=c(0,max(c(bike.data$reltempF,bike.data$tempF))))