2/15/2022
While there is not universal agreement on programming style, there are two good examples:
Hadley Wickham’s Style Guide: http://adv-r.had.co.nz/Style.html
Google R Style Guide: https://google.github.io/styleguide/Rguide.xml, which presents different options from Wickham’s guide.
File Names: File names should end in .R (script file) or .Rmd (R Markdown file) and be concise yet meaningful.
Identifiers: Don’t use hyphens or spaces in identifiers (or dots when naming functions).
variable_name
)VariableName
x <- 1:10 mean <- sum(x) # oh no!
<-
not =
for assignment==, +, ...
) and assignment (<-
)Use %>%
to emphasize a sequence of actions, rather than the object that the actions are being performed on.
Avoid using the pipe when:
You need to manipulate more than one object at a time. Reserve pipes for a sequence of steps applied to one primary object.
There are meaningful intermediate objects that could be given informative names.
%>%
should always have a space before and after%>%
+
connecting ggplot()
commands are similarsurveys <- read_csv("https://math.montana.edu/shancock/data/animal_survey.csv")
# Clean up this code surveys%>%filter(!is.na(weight) & !is.na(hindfoot_length)) %>% select(sex, species, hindfoot_length, weight) %>% group_by(sex) %>% summarize(mean_hindfoot_length=mean(hindfoot_length),mean_weight=mean(weight),n_species=n_distinct(species))
# your solutions here
surveys %>% filter(!is.na(weight) & !is.na(hindfoot_length)) %>% select(sex, species, hindfoot_length, weight) %>% group_by(sex) %>% summarize( mean_hindfoot_length = mean(hindfoot_length), mean_weight = mean(weight), n_species = n_distinct(species) )
Most mathematical operators are self explanatory, but here are a few more important operators.
==
will test for equality.
pi == 3
in R and will return FALSE. Note this operator returns a logical value.&
is the AND operator, so TRUE & FALSE
will return FALSE.|
is the OR operator, so TRUE | FALSE
will return TRUE.!
is the NOT operator, so ! TRUE
will return FALSE.^
permits power terms, so 4 ^ 2
returns 16 and 4 ^ .5
returns 2.Always type out TRUE
and FALSE
rather than T
and F
.
Note that order of operations is important in writing R code.
4 - 2 ^ 2 (4 - 2) ^ 2 5 * 2 - 3 ^ 2 pi == 3 ! TRUE & pi == 3 ! (TRUE | FALSE)
Evaluate all expressions. Note !
is R’s “not” operator.
The results of the R code are:
4 - 2 ^ 2
## [1] 0
(4 - 2) ^ 2
## [1] 4
5 * 2 - 3 ^ 2
## [1] 1
The results of the R code are:
pi == 3
## [1] FALSE
! TRUE & pi == 3
## [1] FALSE
! (TRUE | FALSE)
## [1] FALSE
The general layout of an R script (.R) should follow as:
source()
and library()
statementsGeneral guidelines for a reproducible R Markdown file (.Rmd):
Code comments should be included in R chunks
R chunks should always be named: {r chunk_name, options}
Print out all code in documents
R output should be integrated into text, using “r mean(x)” (using back ticks in place of quotes). DO NOT hard code results in written text.
Look at output to verify results look how you intended. Knit often!
#
and then one space.#
and then one space.# create plot of housing price by zipcode plot(Seattle$Price ~ Seattle$Zip, rgb(.5,0,0,.7), # set transparency for points xlab='zipode')
# New section title ---------
Note that output from R can often be hard to read. Luckily there are several options for creating nicely formatted tables. One, which we will use, is the kable()
function.
library(knitr) kable( aggregate(Loblolly$height, by = list(Loblolly$age), mean), digits = 3, caption = 'Average height of loblolly pine by age', col.names = c('Tree Age','Height (ft)') )
Tree Age | Height (ft) |
---|---|
3 | 4.238 |
5 | 10.205 |
10 | 27.442 |
15 | 40.544 |
20 | 51.469 |
25 | 60.289 |
Where does your analysis “live”?
Use your R/Rmd files to recreate your environment. Reproducible research!
RStudio > Preferences
Open RStudio, and type
getwd()
Projects allow you to set your working directory and operate using relative paths rather than absolute paths in your code.
# bad read_csv("/Users/staceyhancock/Documents/stat408/data/nobel.csv") # good read_csv("data/nobel.csv")
With this class, we cannot cover every possible situation that you will encounter. The overall course goals are to:
When writing code (and conducting statistical analyses) an iterative approach is a good strategy.
Finding your bug is a process of confirming the many things that you believe are true – until you find one which is not true.
– Norm Matloff
We will first focus on debugging when an error, or warning is tripped.
R will flag, print out a message, in two cases: warnings and errors.
stop()
and force all execution of code to stop triggering an error
.warning()
and display potential problems. Warnings do not stop code from executing.message()
, which pass along information.In other cases, we will have bugs in our code that don’t necessarily give a warning or an error.
Note: NA
values often return a warning message, but not always.
surveys <- read_csv("https://math.montana.edu/shancock/data/animal_survey.csv")
Debug the following code:
surveys %>% filter(!is.na(weight)) %>% group_by(sex) %>% summarize( mean-wgt = mean(weight), sd_wgt = sd(weight), max_wgt = max(weight) ) %>% select(weight, species)
# your solution here
surveys %>% filter(!is.na(weight)) %>% group_by(sex) %>% select(weight, species) %>% summarize( mean_wgt = mean(weight), sd_wgt = sd(weight), max_wgt = max(weight) )
## # A tibble: 3 × 4 ## sex mean_wgt sd_wgt max_wgt ## <chr> <dbl> <dbl> <dbl> ## 1 F 42.2 36.8 274 ## 2 M 43.0 36.2 280 ## 3 <NA> 64.7 62.2 243
To get more details in R, type ?FunctionName
. This will open up a help window that displays essential characteristics of the function. For example, with the mean
function the following information is shown:
Description: function for the (trimmed) arithmetic mean.
Usage: mean(x, trim = 0, na.rm = FALSE, …)
x: An R object.
trim: the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.
na.rm: a logical value indicating whether NA values should be stripped before the computation proceeds.
Functions are a way to save elements of code to be used repeatedly.
name_of_function <- function(arguments) { # Documentation body of function... }
RollDice <- function(num.rolls) { # # ARGS: # RETURNS: sample(6, num.rolls, replace = T) } RollDice(2)
## [1] 1 5
else
(
of function
OR double-indent (four spaces))
of function arguments and start of function {
Functions should contain a comments section immediately below the function definition line. These comments should consist of
Args:
, with a description of each andReturns:
.The comments should be descriptive enough that the function can be used without reading the function code.
Document this function with
RollDice <- function(num.rolls) { # # ARGS: # RETURNS: return(sample(6, num.rolls, replace = T)) }
RollDice <- function(num.rolls) { # function that returns rolls of dice # ARGS: num.rolls - number of rolls # RETURNS: vector of num.rolls of a die return(sample(6, num.rolls, replace = T)) } RollDice(2)
## [1] 3 6
Here is an example (trivial) R function.
SquareRoot <- function(value.in) { # function takes square root of value. # Args: value.in - numeric value # Returns: the square root of value.in value.in ^ .5 }
Now consider running the function for a few values.
SquareRoot(9)
## [1] 3
SquareRoot(25)
## [1] 5
Now what happens with SquareRoot(-1)
?
SquareRoot(-1)
## [1] NaN
What should happen?
Here is an example (trivial) R function.
SquareRoot <- function(value.in) { # function takes square root of value. # Args: value.in - numeric value # Returns: the square root of value.in if (value.in < 0) { stop('argument less than zero') } value.in ^ .5 }
SquareRoot(-1)
This returns:
> SquareRoot(-1) Error in SquareRoot(-1) : argument less than zero
Use the defined style guidelines to create an R script that:
Verify your functions works by running it twice using “MT” and “NE” as inputs.
SummarizeHousingCosts <- function(state) { # computes average sales price in a state # ARGS: state abbr, such as 'MT' or 'CA' # RETURNS: vector with average sales price that each state housing.data <- read.csv( 'http://math.montana.edu/ahoegh/teaching/stat408/datasets/HousingSales.csv') location <- subset(housing.data, State == state) mean(location$Closing_Price) }
SummarizeHousingCosts('MT')
## [1] 164608
SummarizeHousingCosts('NE')
## [1] 152050
SummarizeHousingCosts <- function( state, path ) { # computes average sales price in a state # ARGS: # state - abbr, such as 'MT' or 'CA' # path - character pathname to data # RETURNS: vector with average sales price that each state housing.data <- read.csv(path) location <- subset(housing.data, State == state) mean(location$Closing_Price) }
SummarizeHousingCosts('MT', path = 'http://math.montana.edu/ahoegh/teaching/stat408/datasets/HousingSales.csv')
## [1] 164608
Now what will happen if we try this code?
SummarizeHousingCosts('MT')
SummarizeHousingCosts <- function( state, path = 'http://math.montana.edu/ahoegh/teaching/stat408/datasets/HousingSales.csv' ) { # computes average sales price in a state # ARGS: # state - abbr, such as 'MT' or 'CA' # path - character pathname to data # RETURNS: vector with average sales price that each state housing.data <- read.csv(path) location <- subset(housing.data, State == state) mean(location$Closing_Price) }
SummarizeHousingCosts('MT')
## [1] 164608
Now write a function that;
Also include and the stop()
function for errors. Test this function with two settings:
ToSki <- function(snowfall, day) { # determines whether to ski or stay home # ARGS: snowfall in inches, day as three letter # abbrwith first letter capitalized # RETURNS: string stating whether to ski or not if (snowfall < 0) stop('snowfall should be greater than or equal to zero inches') if (day == 'Sat') { print('Go Ski') } else if (snowfall > 5) { print('Go Ski') } else print('Stay Home') }
ToSki(snowfall = 15, day = "Sat")
## [1] "Go Ski"
ToSki(-1, 'Mon')
## Error in ToSki(-1, "Mon"): snowfall should be greater ## than or equal to zero inches