3/1/2022
As a statistician or more generally a data scientist the ability to manipulate, process, clean, and merge datasets is an essential skill.
The concept of tidy data can be attributed to Hadley Wickham and has three principles for organizing data. Tidy Data Reference
Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham
dplyr
and ggplot2
require tidy data.Source: US Census Fact Finder, General Economic Characteristics, ACS 2017
Raw data or summary data?
## # A tibble: 173 × 3 ## Major ShareWomen Unemployment_rate ## <chr> <dbl> <dbl> ## 1 PETROLEUM ENGINEERING 0.121 0.0184 ## 2 MINING AND MINERAL ENGINEERING 0.102 0.117 ## 3 METALLURGICAL ENGINEERING 0.153 0.0241 ## 4 NAVAL ARCHITECTURE AND MARINE ENGINEERING 0.107 0.0501 ## 5 CHEMICAL ENGINEERING 0.342 0.0611 ## 6 NUCLEAR ENGINEERING 0.145 0.177 ## 7 ACTUARIAL SCIENCE 0.441 0.0957 ## 8 ASTRONOMY AND ASTROPHYSICS 0.536 0.0212 ## 9 MECHANICAL ENGINEERING 0.120 0.0573 ## 10 ELECTRICAL ENGINEERING 0.196 0.0592 ## # … with 163 more rows
Raw data or summary data?
## # A tibble: 16 × 2 ## Major_category ave_med_salary ## <chr> <dbl> ## 1 Agriculture & Natural Resources 36900 ## 2 Arts 33062. ## 3 Biology & Life Science 36421. ## 4 Business 43538. ## 5 Communications & Journalism 34500 ## 6 Computers & Mathematics 42745. ## 7 Education 32350 ## 8 Engineering 57383. ## 9 Health 36825 ## 10 Humanities & Liberal Arts 31913. ## 11 Industrial Arts & Consumer Services 36343. ## 12 Interdisciplinary 35000 ## 13 Law & Public Policy 42200 ## 14 Physical Sciences 41890 ## 15 Psychology & Social Work 30100 ## 16 Social Science 37344.
Consider the two data frames, how can we merge them and what should be the dimensions of the merged data frame.
## school state ## 1 MSU MT ## 2 VT VA ## 3 Mines CO
## school enrollment ## 1 Mines 5794 ## 2 MSU 15688 ## 3 VT 30598
One possibility is to use the arrange the data frames first.
df1 <- df1[order(df1$school),] df2 <- df2[order(df2$school),]
One possibility is to use the arrange the data frames first.
df1
## school state ## 3 Mines CO ## 1 MSU MT ## 2 VT VA
df2
## school enrollment ## 1 Mines 5794 ## 2 MSU 15688 ## 3 VT 30598
Now, given that the data frames are both sorted the same way, we can bind the columns together.
comb_df <- cbind(df1,df2) comb_df
## school state school enrollment ## 3 Mines CO Mines 5794 ## 1 MSU MT MSU 15688 ## 2 VT VA VT 30598
comb_df <- comb_df[,-3]
Now assume we want to add another school to the data frame.
new.school <- c('Luther', 'IA',2337) rbind(comb_df, new.school)
## school state enrollment ## 3 Mines CO 5794 ## 1 MSU MT 15688 ## 2 VT VA 30598 ## 4 Luther IA 2337
Note: If your strings are saved as factors, this chunk of code will give you an error.
dplyr
also contains functions for - binding rows: bind_rows()
- binding columns: bind_cols()
dplyr
Information on 10 women in science who changed the world.
name |
---|
Ada Lovelace |
Marie Curie |
Janaki Ammal |
Chien-Shiung Wu |
Katherine Johnson |
Rosalind Franklin |
Vera Rubin |
Gladys West |
Flossie Wong-Staal |
Jennifer Doudna |
Source: Discover Magazine
professions
## # A tibble: 10 × 2 ## name profession ## <chr> <chr> ## 1 Ada Lovelace Mathematician ## 2 Marie Curie Physicist and Chemist ## 3 Janaki Ammal Botanist ## 4 Chien-Shiung Wu Physicist ## 5 Katherine Johnson Mathematician ## 6 Rosalind Franklin Chemist ## 7 Vera Rubin Astronomer ## 8 Gladys West Mathematician ## 9 Flossie Wong-Staal Virologist and Molecular Biologist ## 10 Jennifer Doudna Biochemist
dates
## # A tibble: 8 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Rosalind Franklin 1920 1958 ## 5 Vera Rubin 1928 2016 ## 6 Gladys West 1930 NA ## 7 Flossie Wong-Staal 1947 NA ## 8 Jennifer Doudna 1964 NA
works
## # A tibble: 9 × 2 ## name known_for ## <chr> <chr> ## 1 Ada Lovelace first computer algorithm ## 2 Marie Curie theory of radioactivity, discovery of elements polonium a… ## 3 Janaki Ammal hybrid species, biodiversity protection ## 4 Chien-Shiung Wu confim and refine theory of radioactive beta decy, Wu expe… ## 5 Katherine Johnson calculations of orbital mechanics critical to sending the … ## 6 Vera Rubin existence of dark matter ## 7 Gladys West mathematical modeling of the shape of the Earth which serv… ## 8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes… ## 9 Jennifer Doudna one of the primary developers of CRISPR, a ground-breaking…
## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematician NA NA first co… ## 2 Marie Curie Physicist and Chemist NA NA theory o… ## 3 Janaki Ammal Botanist 1897 1984 hybrid s… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim a… ## 5 Katherine Johnson Mathematician 1918 2020 calculat… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existenc… ## 8 Gladys West Mathematician 1930 NA mathemat… ## 9 Flossie Wong-Staal Virologist and Molecular … 1947 NA first sc… ## 10 Jennifer Doudna Biochemist 1964 NA one of t…
names(professions) ## [1] "name" "profession" names(dates) ## [1] "name" "birth_year" "death_year" names(works) ## [1] "name" "known_for"
nrow(professions) ## [1] 10 nrow(dates) ## [1] 8 nrow(works) ## [1] 9
something_join(x, y)
left_join()
: all rows from xright_join()
: all rows from yfull_join()
: all rows from both x and ysemi_join()
: all rows from x where there are matching values in y, keeping just columns from xinner_join()
: all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matchesanti_join()
: return all rows from x where there are not matching values in y, never duplicate rows of xFor the next few slides…
x
## # A tibble: 3 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3
y
## # A tibble: 3 × 2 ## id value_y ## <dbl> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4
left_join()
professions %>% left_join(dates)
## # A tibble: 10 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Ada Lovelace Mathematician NA NA ## 2 Marie Curie Physicist and Chemist NA NA ## 3 Janaki Ammal Botanist 1897 1984 ## 4 Chien-Shiung Wu Physicist 1912 1997 ## 5 Katherine Johnson Mathematician 1918 2020 ## 6 Rosalind Franklin Chemist 1920 1958 ## 7 Vera Rubin Astronomer 1928 2016 ## 8 Gladys West Mathematician 1930 NA ## 9 Flossie Wong-Staal Virologist and Molecular Biologist 1947 NA ## 10 Jennifer Doudna Biochemist 1964 NA
right_join()
professions %>% right_join(dates)
## # A tibble: 8 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Janaki Ammal Botanist 1897 1984 ## 2 Chien-Shiung Wu Physicist 1912 1997 ## 3 Katherine Johnson Mathematician 1918 2020 ## 4 Rosalind Franklin Chemist 1920 1958 ## 5 Vera Rubin Astronomer 1928 2016 ## 6 Gladys West Mathematician 1930 NA ## 7 Flossie Wong-Staal Virologist and Molecular Biologist 1947 NA ## 8 Jennifer Doudna Biochemist 1964 NA
full_join()
dates %>% full_join(works)
## # A tibble: 10 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodiversity protec… ## 2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioac… ## 3 Katherine Johnson 1918 2020 calculations of orbital mechanics c… ## 4 Rosalind Franklin 1920 1958 <NA> ## 5 Vera Rubin 1928 2016 existence of dark matter ## 6 Gladys West 1930 NA mathematical modeling of the shape … ## 7 Flossie Wong-Staal 1947 NA first scientist to clone HIV and cr… ## 8 Jennifer Doudna 1964 NA one of the primary developers of CR… ## 9 Ada Lovelace NA NA first computer algorithm ## 10 Marie Curie NA NA theory of radioactivity, discovery…
inner_join()
dates %>% inner_join(works)
## # A tibble: 7 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodiversity protect… ## 2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioact… ## 3 Katherine Johnson 1918 2020 calculations of orbital mechanics cr… ## 4 Vera Rubin 1928 2016 existence of dark matter ## 5 Gladys West 1930 NA mathematical modeling of the shape o… ## 6 Flossie Wong-Staal 1947 NA first scientist to clone HIV and cre… ## 7 Jennifer Doudna 1964 NA one of the primary developers of CRI…
semi_join()
dates %>% semi_join(works)
## # A tibble: 7 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Vera Rubin 1928 2016 ## 5 Gladys West 1930 NA ## 6 Flossie Wong-Staal 1947 NA ## 7 Jennifer Doudna 1964 NA
anti_join()
dates %>% anti_join(works)
## # A tibble: 1 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Rosalind Franklin 1920 1958
professions %>% left_join(dates) %>% left_join(works)
## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematician NA NA first co… ## 2 Marie Curie Physicist and Chemist NA NA theory o… ## 3 Janaki Ammal Botanist 1897 1984 hybrid s… ## 4 Chien-Shiung Wu Physicist 1912 1997 confim a… ## 5 Katherine Johnson Mathematician 1918 2020 calculat… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existenc… ## 8 Gladys West Mathematician 1930 NA mathemat… ## 9 Flossie Wong-Staal Virologist and Molecular … 1947 NA first sc… ## 10 Jennifer Doudna Biochemist 1964 NA one of t…
enrollment
## # A tibble: 3 × 2 ## id name ## <dbl> <chr> ## 1 1 Dave Friday ## 2 2 Hermine ## 3 3 Sura Selvarajah
survey
## # A tibble: 4 × 3 ## id name username ## <dbl> <chr> <chr> ## 1 2 Hermine bakealongwithhermine ## 2 3 Sura surasbakes ## 3 4 Peter peter_bakes ## 4 5 Mark thebakingbuddha
enrollment %>% left_join(survey, by = "id")
## # A tibble: 3 × 4 ## id name.x name.y username ## <dbl> <chr> <chr> <chr> ## 1 1 Dave Friday <NA> <NA> ## 2 2 Hermine Hermine bakealongwithhermine ## 3 3 Sura Selvarajah Sura surasbakes
enrollment %>% anti_join(survey, by = "id")
## # A tibble: 1 × 2 ## id name ## <dbl> <chr> ## 1 1 Dave Friday
survey %>% anti_join(enrollment, by = "id")
## # A tibble: 2 × 3 ## id name username ## <dbl> <chr> <chr> ## 1 4 Peter peter_bakes ## 2 5 Mark thebakingbuddha
purchases
## # A tibble: 5 × 2 ## customer_id item ## <dbl> <chr> ## 1 1 bread ## 2 1 milk ## 3 1 banana ## 4 2 milk ## 5 2 toilet paper
prices
## # A tibble: 5 × 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3
purchases %>% left_join(prices)
## # A tibble: 5 × 3 ## customer_id item price ## <dbl> <chr> <dbl> ## 1 1 bread 1 ## 2 1 milk 0.8 ## 3 1 banana 0.15 ## 4 2 milk 0.8 ## 5 2 toilet paper 3
purchases %>% left_join(prices) %>% summarise(total_revenue = sum(price))
## # A tibble: 1 × 1 ## total_revenue ## <dbl> ## 1 5.75
purchases %>% left_join(prices)
## # A tibble: 5 × 3 ## customer_id item price ## <dbl> <chr> <dbl> ## 1 1 bread 1 ## 2 1 milk 0.8 ## 3 1 banana 0.15 ## 4 2 milk 0.8 ## 5 2 toilet paper 3
purchases %>% left_join(prices) %>% group_by(customer_id) %>% summarise(total_revenue = sum(price))
## # A tibble: 2 × 2 ## customer_id total_revenue ## <dbl> <dbl> ## 1 1 1.95 ## 2 2 3.8
Combine the following information into a single table sorted alphabetically by the name of the ski hill.
ski_acres
## ski.hill skiable.acres ## 1 Big Sky 5800 ## 2 Bridger Bowl 2000 ## 3 Jackson 2500+ ## 4 Steamboat 2965
df_cost
## ski.resort ticket.cost ## 1 Bridger Bowl 60 ## 2 Big Sky depends ## 3 Steamboat 145 ## 4 Jackson 130
disco
## [1] "Discovery" "2200" "20"
df_comb <- ski_acres %>% full_join(df_cost, by = c("ski.hill" = "ski.resort")) %>% rbind(disco) %>% #<<< arrange(ski.hill) str(df_comb)
## 'data.frame': 5 obs. of 3 variables: ## $ ski.hill : chr "Big Sky" "Bridger Bowl" "Discovery" "Jackson" ... ## $ skiable.acres: chr "5800" "2000" "2200" "2500+" ... ## $ ticket.cost : chr "depends" "60" "20" "130" ...
disco_df <- data.frame(matrix(disco, nrow = 1)) names(disco_df) <- c("ski.hill", "skiable.acres", "ticket.cost") df_comb <- ski_acres %>% full_join(df_cost, by = c("ski.hill"= "ski.resort")) %>% full_join(disco_df) %>% #<<< arrange(ski.hill) str(df_comb)
## 'data.frame': 5 obs. of 3 variables: ## $ ski.hill : chr "Big Sky" "Bridger Bowl" "Discovery" "Jackson" ... ## $ skiable.acres: chr "5800" "2000" "2200" "2500+" ... ## $ ticket.cost : chr "depends" "60" "20" "130" ...
We have data organised in an unideal way for our analysis.
We want to reorganise the data to carry on with our analysis.
We have…
## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA>
We want…
## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA>
The goal of tidyr is to help you tidy your data via
NA
s should be treatedWider = more columns
## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA>
Longer = more rows
## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA>
pivot_longer()
data
(as usual)cols
: columns to pivot into longer formatnames_to
: name of the column where column names of pivoted variables go (character string)values_to
: name of the column where data in pivoted variables go (character string)pivot_longer( data, cols, names_to = "name", values_to = "value" )
purchases <- customers %>% pivot_longer( cols = item_1:item_3, # variables item_1 to item_3 names_to = "item_no", # column names -> new column called item_no values_to = "item" # values in columns -> new column called item ) purchases
## # A tibble: 6 × 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA>
Most likely, because the next step of your analysis needs it
prices
## # A tibble: 5 × 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3
purchases %>% left_join(prices)
## # A tibble: 6 × 4 ## customer_id item_no item price ## <dbl> <chr> <chr> <dbl> ## 1 1 item_1 bread 1 ## 2 1 item_2 milk 0.8 ## 3 1 item_3 banana 0.15 ## 4 2 item_1 milk 0.8 ## 5 2 item_2 toilet paper 3 ## 6 2 item_3 <NA> NA
data
(as usual)names_from
: which column in the long format contains the what should be column names in the wide formatvalues_from
: which column in the long format contains the what should be values in the new columns in the wide formatpurchases %>% pivot_wider( names_from = item_no, values_from = item )
## # A tibble: 2 × 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA>
Consider the billboard
dataset (contained in the tidyr
package) which contains the rank of the song (in 2000) for each week after it first entered the list.
billboard
## # A tibble: 317 × 79 ## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 ## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA ## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA ## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 ## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59 ## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49 ## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 ## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA ## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38 ## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14 ## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58 ## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>, ## # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, ## # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, ## # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, ## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, ## # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, ## # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, …
billboard
dataIf we want to identify songs that reach number 1 quickly, the data need to wrangled.
billboard %>% select(artist, track, date.entered, wk1, wk2) %>% pivot_longer( cols = c('wk1', 'wk2'), names_to = 'week', values_to = 'rank', values_drop_na = TRUE )
## # A tibble: 629 × 5 ## artist track date.entered week rank ## <chr> <chr> <date> <chr> <dbl> ## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 ## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82 ## 3 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91 ## 4 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87 ## 5 3 Doors Down Kryptonite 2000-04-08 wk1 81 ## 6 3 Doors Down Kryptonite 2000-04-08 wk2 70 ## 7 3 Doors Down Loser 2000-10-21 wk1 76 ## 8 3 Doors Down Loser 2000-10-21 wk2 76 ## 9 504 Boyz Wobble Wobble 2000-04-15 wk1 57 ## 10 504 Boyz Wobble Wobble 2000-04-15 wk2 34 ## # … with 619 more rows
billboard %>% pivot_longer( cols= starts_with('wk'), names_to = 'week', values_to = 'rank', values_drop_na = TRUE) %>% mutate( week_numb = as.numeric(str_replace(week, 'wk','')) ) %>% filter(rank == 1) %>% arrange(week_numb) %>% slice(1) %>% kable()
artist | track | date.entered | week | rank | week_numb |
---|---|---|---|---|---|
Madonna | Music | 2000-08-12 | wk6 | 1 | 6 |
Determine which song in this dataset spent the most time at #1.
billboard_long <- billboard %>% pivot_longer( cols= starts_with('wk'), names_to = 'week', values_to = 'rank', values_drop_na = TRUE) %>% mutate( week_numb = as.numeric(str_replace(week, 'wk',''))) %>% filter(rank == 1) %>% group_by(track) %>% tally() %>% arrange(desc(n))
track | n |
---|---|
Independent Women Pa… | 11 |
Maria, Maria | 10 |
Come On Over Baby (A… | 4 |
I Knew I Loved You | 4 |
Music | 4 |
Be With You | 3 |
Doesn’t Really Matte… | 3 |
Say My Name | 3 |
Amazed | 2 |
Incomplete | 2 |
It’s Gonna Be Me | 2 |
What A Girl Wants | 2 |
Bent | 1 |