3/1/2022

Data Wrangling

Data Wrangling

As a statistician or more generally a data scientist the ability to manipulate, process, clean, and merge datasets is an essential skill.

  • These skills are generally referred to as data wrangling or munging.
  • In a data analysis or visualization setting, they will undoubtedly require a majority of your time.
  • Wrangling data can be a painful process.
  • This lecture will provide some tools and examples of organizing data.

Data Wrangling Concepts

  • Wide and thin datasets
  • Merging and joining relational data
  • Dealing with strings
  • Dealing with date/time objects

Tidy Data

Rules for Tidy Data

The concept of tidy data can be attributed to Hadley Wickham and has three principles for organizing data. Tidy Data Reference

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table (with a single value in each cell).

Rules for Tidy Data

Visual Representation of Tidy Data. Source: R4DS

Visual Representation of Tidy Data. Source: R4DS

Why use Tidy Data

Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham

  • Storing data in a consistent way gives familiarity with methods for manipulating data.
  • Tidy data structure takes advantage of vectorised operations in R.
  • Many useful packages: such as dplyr and ggplot2 require tidy data.

What are messy data?

What are messy data?

Displaying vs summarizing data

  • Summary data might look “tidy”, but its rows are not the observational units
  • Raw data can produce summary data, but you can’t go back to raw data from summary data

Displaying vs summarizing data

Raw data or summary data?

## # A tibble: 173 × 3
##    Major                                     ShareWomen Unemployment_rate
##    <chr>                                          <dbl>             <dbl>
##  1 PETROLEUM ENGINEERING                          0.121            0.0184
##  2 MINING AND MINERAL ENGINEERING                 0.102            0.117 
##  3 METALLURGICAL ENGINEERING                      0.153            0.0241
##  4 NAVAL ARCHITECTURE AND MARINE ENGINEERING      0.107            0.0501
##  5 CHEMICAL ENGINEERING                           0.342            0.0611
##  6 NUCLEAR ENGINEERING                            0.145            0.177 
##  7 ACTUARIAL SCIENCE                              0.441            0.0957
##  8 ASTRONOMY AND ASTROPHYSICS                     0.536            0.0212
##  9 MECHANICAL ENGINEERING                         0.120            0.0573
## 10 ELECTRICAL ENGINEERING                         0.196            0.0592
## # … with 163 more rows

Displaying vs summarizing data

Raw data or summary data?

## # A tibble: 16 × 2
##    Major_category                      ave_med_salary
##    <chr>                                        <dbl>
##  1 Agriculture & Natural Resources             36900 
##  2 Arts                                        33062.
##  3 Biology & Life Science                      36421.
##  4 Business                                    43538.
##  5 Communications & Journalism                 34500 
##  6 Computers & Mathematics                     42745.
##  7 Education                                   32350 
##  8 Engineering                                 57383.
##  9 Health                                      36825 
## 10 Humanities & Liberal Arts                   31913.
## 11 Industrial Arts & Consumer Services         36343.
## 12 Interdisciplinary                           35000 
## 13 Law & Public Policy                         42200 
## 14 Physical Sciences                           41890 
## 15 Psychology & Social Work                    30100 
## 16 Social Science                              37344.

Merge / Join

Merging in Base R

Consider the two data frames, how can we merge them and what should be the dimensions of the merged data frame.

##   school state
## 1    MSU    MT
## 2     VT    VA
## 3  Mines    CO
##   school enrollment
## 1  Mines       5794
## 2    MSU      15688
## 3     VT      30598

pre-sort

One possibility is to use the arrange the data frames first.

df1 <- df1[order(df1$school),]
df2 <- df2[order(df2$school),]

pre-sort

One possibility is to use the arrange the data frames first.

df1 
##   school state
## 3  Mines    CO
## 1    MSU    MT
## 2     VT    VA
df2 
##   school enrollment
## 1  Mines       5794
## 2    MSU      15688
## 3     VT      30598

rbind() and cbind()

Now, given that the data frames are both sorted the same way, we can bind the columns together.

comb_df <- cbind(df1,df2)
comb_df
##   school state school enrollment
## 3  Mines    CO  Mines       5794
## 1    MSU    MT    MSU      15688
## 2     VT    VA     VT      30598
comb_df <- comb_df[,-3]

rbind() and cbind()

Now assume we want to add another school to the data frame.

new.school <- c('Luther', 'IA',2337)
rbind(comb_df, new.school)
##   school state enrollment
## 3  Mines    CO       5794
## 1    MSU    MT      15688
## 2     VT    VA      30598
## 4 Luther    IA       2337

Note: If your strings are saved as factors, this chunk of code will give you an error.

bind_rows() / bind_cols()

dplyr also contains functions for - binding rows: bind_rows() - binding columns: bind_cols()

Joins in dplyr

Data: Women in science

Information on 10 women in science who changed the world.

name
Ada Lovelace
Marie Curie
Janaki Ammal
Chien-Shiung Wu
Katherine Johnson
Rosalind Franklin
Vera Rubin
Gladys West
Flossie Wong-Staal
Jennifer Doudna

Source: Discover Magazine

Inputs

professions
## # A tibble: 10 × 2
##    name               profession                        
##    <chr>              <chr>                             
##  1 Ada Lovelace       Mathematician                     
##  2 Marie Curie        Physicist and Chemist             
##  3 Janaki Ammal       Botanist                          
##  4 Chien-Shiung Wu    Physicist                         
##  5 Katherine Johnson  Mathematician                     
##  6 Rosalind Franklin  Chemist                           
##  7 Vera Rubin         Astronomer                        
##  8 Gladys West        Mathematician                     
##  9 Flossie Wong-Staal Virologist and Molecular Biologist
## 10 Jennifer Doudna    Biochemist

Inputs

dates
## # A tibble: 8 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Rosalind Franklin        1920       1958
## 5 Vera Rubin               1928       2016
## 6 Gladys West              1930         NA
## 7 Flossie Wong-Staal       1947         NA
## 8 Jennifer Doudna          1964         NA

Inputs

works
## # A tibble: 9 × 2
##   name               known_for                                                  
##   <chr>              <chr>                                                      
## 1 Ada Lovelace       first computer algorithm                                   
## 2 Marie Curie        theory of radioactivity,  discovery of elements polonium a…
## 3 Janaki Ammal       hybrid species, biodiversity protection                    
## 4 Chien-Shiung Wu    confim and refine theory of radioactive beta decy, Wu expe…
## 5 Katherine Johnson  calculations of orbital mechanics critical to sending the …
## 6 Vera Rubin         existence of dark matter                                   
## 7 Gladys West        mathematical modeling of the shape of the Earth which serv…
## 8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes…
## 9 Jennifer Doudna    one of the primary developers of CRISPR, a ground-breaking…

Desired output

## # A tibble: 10 × 5
##    name               profession                 birth_year death_year known_for
##    <chr>              <chr>                           <dbl>      <dbl> <chr>    
##  1 Ada Lovelace       Mathematician                      NA         NA first co…
##  2 Marie Curie        Physicist and Chemist              NA         NA theory o…
##  3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
##  4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
##  5 Katherine Johnson  Mathematician                    1918       2020 calculat…
##  6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
##  7 Vera Rubin         Astronomer                       1928       2016 existenc…
##  8 Gladys West        Mathematician                    1930         NA mathemat…
##  9 Flossie Wong-Staal Virologist and Molecular …       1947         NA first sc…
## 10 Jennifer Doudna    Biochemist                       1964         NA one of t…

Inputs, reminder

names(professions)
## [1] "name"       "profession"
names(dates)
## [1] "name"       "birth_year" "death_year"
names(works)
## [1] "name"      "known_for"
nrow(professions)
## [1] 10
nrow(dates)
## [1] 8
nrow(works)
## [1] 9

Joining data frames

something_join(x, y)
  • left_join(): all rows from x
  • right_join(): all rows from y
  • full_join(): all rows from both x and y
  • semi_join(): all rows from x where there are matching values in y, keeping just columns from x
  • inner_join(): all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches
  • anti_join(): return all rows from x where there are not matching values in y, never duplicate rows of x

Setup

For the next few slides…

x
## # A tibble: 3 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2     
## 3     3 x3
y
## # A tibble: 3 × 2
##      id value_y
##   <dbl> <chr>  
## 1     1 y1     
## 2     2 y2     
## 3     4 y4

left_join()



professions %>%
  left_join(dates) 
## # A tibble: 10 × 4
##    name               profession                         birth_year death_year
##    <chr>              <chr>                                   <dbl>      <dbl>
##  1 Ada Lovelace       Mathematician                              NA         NA
##  2 Marie Curie        Physicist and Chemist                      NA         NA
##  3 Janaki Ammal       Botanist                                 1897       1984
##  4 Chien-Shiung Wu    Physicist                                1912       1997
##  5 Katherine Johnson  Mathematician                            1918       2020
##  6 Rosalind Franklin  Chemist                                  1920       1958
##  7 Vera Rubin         Astronomer                               1928       2016
##  8 Gladys West        Mathematician                            1930         NA
##  9 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
## 10 Jennifer Doudna    Biochemist                               1964         NA

right_join()



professions %>%
  right_join(dates) 
## # A tibble: 8 × 4
##   name               profession                         birth_year death_year
##   <chr>              <chr>                                   <dbl>      <dbl>
## 1 Janaki Ammal       Botanist                                 1897       1984
## 2 Chien-Shiung Wu    Physicist                                1912       1997
## 3 Katherine Johnson  Mathematician                            1918       2020
## 4 Rosalind Franklin  Chemist                                  1920       1958
## 5 Vera Rubin         Astronomer                               1928       2016
## 6 Gladys West        Mathematician                            1930         NA
## 7 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
## 8 Jennifer Doudna    Biochemist                               1964         NA

full_join()



dates %>%
  full_join(works) 
## # A tibble: 10 × 4
##    name               birth_year death_year known_for                           
##    <chr>                   <dbl>      <dbl> <chr>                               
##  1 Janaki Ammal             1897       1984 hybrid species, biodiversity protec…
##  2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioac…
##  3 Katherine Johnson        1918       2020 calculations of orbital mechanics c…
##  4 Rosalind Franklin        1920       1958 <NA>                                
##  5 Vera Rubin               1928       2016 existence of dark matter            
##  6 Gladys West              1930         NA mathematical modeling of the shape …
##  7 Flossie Wong-Staal       1947         NA first scientist to clone HIV and cr…
##  8 Jennifer Doudna          1964         NA one of the primary developers of CR…
##  9 Ada Lovelace               NA         NA first computer algorithm            
## 10 Marie Curie                NA         NA theory of radioactivity,  discovery…

inner_join()



dates %>%
  inner_join(works) 
## # A tibble: 7 × 4
##   name               birth_year death_year known_for                            
##   <chr>                   <dbl>      <dbl> <chr>                                
## 1 Janaki Ammal             1897       1984 hybrid species, biodiversity protect…
## 2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioact…
## 3 Katherine Johnson        1918       2020 calculations of orbital mechanics cr…
## 4 Vera Rubin               1928       2016 existence of dark matter             
## 5 Gladys West              1930         NA mathematical modeling of the shape o…
## 6 Flossie Wong-Staal       1947         NA first scientist to clone HIV and cre…
## 7 Jennifer Doudna          1964         NA one of the primary developers of CRI…

semi_join()



dates %>%
  semi_join(works) 
## # A tibble: 7 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Vera Rubin               1928       2016
## 5 Gladys West              1930         NA
## 6 Flossie Wong-Staal       1947         NA
## 7 Jennifer Doudna          1964         NA

anti_join()



dates %>%
  anti_join(works) 
## # A tibble: 1 × 3
##   name              birth_year death_year
##   <chr>                  <dbl>      <dbl>
## 1 Rosalind Franklin       1920       1958

Putting it altogether

professions %>%
  left_join(dates) %>%
  left_join(works)
## # A tibble: 10 × 5
##    name               profession                 birth_year death_year known_for
##    <chr>              <chr>                           <dbl>      <dbl> <chr>    
##  1 Ada Lovelace       Mathematician                      NA         NA first co…
##  2 Marie Curie        Physicist and Chemist              NA         NA theory o…
##  3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
##  4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
##  5 Katherine Johnson  Mathematician                    1918       2020 calculat…
##  6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
##  7 Vera Rubin         Astronomer                       1928       2016 existenc…
##  8 Gladys West        Mathematician                    1930         NA mathemat…
##  9 Flossie Wong-Staal Virologist and Molecular …       1947         NA first sc…
## 10 Jennifer Doudna    Biochemist                       1964         NA one of t…

Exercise: Student records

Student records

  • Have:
    • Enrollment: official university enrollment records
    • Survey: Student provided info missing students who never filled it out and including students who filled it out but dropped the class
  • Want:
    • Survey info for all enrolled in class
    • Students who are enrolled in class but missing survey
    • Students who took the survey but are no longer enrolled

enrollment
## # A tibble: 3 × 2
##      id name           
##   <dbl> <chr>          
## 1     1 Dave Friday    
## 2     2 Hermine        
## 3     3 Sura Selvarajah
survey
## # A tibble: 4 × 3
##      id name    username            
##   <dbl> <chr>   <chr>               
## 1     2 Hermine bakealongwithhermine
## 2     3 Sura    surasbakes          
## 3     4 Peter   peter_bakes         
## 4     5 Mark    thebakingbuddha

Student records: Solution

In class

enrollment %>% 
  left_join(survey, by = "id")  
## # A tibble: 3 × 4
##      id name.x          name.y  username            
##   <dbl> <chr>           <chr>   <chr>               
## 1     1 Dave Friday     <NA>    <NA>                
## 2     2 Hermine         Hermine bakealongwithhermine
## 3     3 Sura Selvarajah Sura    surasbakes

Survey missing

enrollment %>% 
  anti_join(survey, by = "id")  
## # A tibble: 1 × 2
##      id name       
##   <dbl> <chr>      
## 1     1 Dave Friday

Dropped

survey %>% 
  anti_join(enrollment, by = "id")  
## # A tibble: 2 × 3
##      id name  username       
##   <dbl> <chr> <chr>          
## 1     4 Peter peter_bakes    
## 2     5 Mark  thebakingbuddha

Exercise: Grocery sales

Grocery sales

  • Have:
    • Purchases: One row per customer per item, listing purchases they made
    • Prices: One row per item in the store, listing their prices
  • Want:
    • Total revenue (over all customers)
    • Revenue per customer

purchases
## # A tibble: 5 × 2
##   customer_id item        
##         <dbl> <chr>       
## 1           1 bread       
## 2           1 milk        
## 3           1 banana      
## 4           2 milk        
## 5           2 toilet paper


prices
## # A tibble: 5 × 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3

Grocery sales: Solution

Total revenue

purchases %>% 
  left_join(prices)  
## # A tibble: 5 × 3
##   customer_id item         price
##         <dbl> <chr>        <dbl>
## 1           1 bread         1   
## 2           1 milk          0.8 
## 3           1 banana        0.15
## 4           2 milk          0.8 
## 5           2 toilet paper  3
purchases %>% 
  left_join(prices) %>%
  summarise(total_revenue = sum(price))  
## # A tibble: 1 × 1
##   total_revenue
##           <dbl>
## 1          5.75

Revenue by customer

purchases %>% 
  left_join(prices)
## # A tibble: 5 × 3
##   customer_id item         price
##         <dbl> <chr>        <dbl>
## 1           1 bread         1   
## 2           1 milk          0.8 
## 3           1 banana        0.15
## 4           2 milk          0.8 
## 5           2 toilet paper  3
purchases %>% 
  left_join(prices) %>%
  group_by(customer_id) %>%  
  summarise(total_revenue = sum(price))
## # A tibble: 2 × 2
##   customer_id total_revenue
##         <dbl>         <dbl>
## 1           1          1.95
## 2           2          3.8

Exercise: Ski hills

Ski hills

Combine the following information into a single table sorted alphabetically by the name of the ski hill.

ski_acres
##       ski.hill skiable.acres
## 1      Big Sky          5800
## 2 Bridger Bowl          2000
## 3      Jackson         2500+
## 4    Steamboat          2965





df_cost
##     ski.resort ticket.cost
## 1 Bridger Bowl          60
## 2      Big Sky     depends
## 3    Steamboat         145
## 4      Jackson         130
disco
## [1] "Discovery" "2200"      "20"

Solution option 1

df_comb <- ski_acres %>% 
  full_join(df_cost, by = c("ski.hill" = "ski.resort")) %>%
  rbind(disco) %>% #<<<
  arrange(ski.hill)
str(df_comb)
## 'data.frame':    5 obs. of  3 variables:
##  $ ski.hill     : chr  "Big Sky" "Bridger Bowl" "Discovery" "Jackson" ...
##  $ skiable.acres: chr  "5800" "2000" "2200" "2500+" ...
##  $ ticket.cost  : chr  "depends" "60" "20" "130" ...

Solution option 2

disco_df <- data.frame(matrix(disco, nrow = 1))
names(disco_df) <- c("ski.hill", "skiable.acres", "ticket.cost")
  
df_comb <- ski_acres %>% 
  full_join(df_cost, by = c("ski.hill"= "ski.resort")) %>%
  full_join(disco_df) %>% #<<<
  arrange(ski.hill)
str(df_comb)
## 'data.frame':    5 obs. of  3 variables:
##  $ ski.hill     : chr  "Big Sky" "Bridger Bowl" "Discovery" "Jackson" ...
##  $ skiable.acres: chr  "5800" "2000" "2200" "2500+" ...
##  $ ticket.cost  : chr  "depends" "60" "20" "130" ...

wide / long data

wide(r) / long(er) data

We have data organised in an unideal way for our analysis.

We want to reorganise the data to carry on with our analysis.

Data: Grocery sales

We have…

## # A tibble: 2 × 4
##   customer_id item_1 item_2       item_3
##         <dbl> <chr>  <chr>        <chr> 
## 1           1 bread  milk         banana
## 2           2 milk   toilet paper <NA>







We want…

## # A tibble: 6 × 3
##   customer_id item_no item        
##         <dbl> <chr>   <chr>       
## 1           1 item_1  bread       
## 2           1 item_2  milk        
## 3           1 item_3  banana      
## 4           2 item_1  milk        
## 5           2 item_2  toilet paper
## 6           2 item_3  <NA>

A grammar of data tidying

The goal of tidyr is to help you tidy your data via

  • pivoting for going between wide and long data
  • splitting and combining character columns
  • nesting and unnesting columns
  • clarifying how NAs should be treated

Not this…

but this!

Wider vs. longer

Wider = more columns

## # A tibble: 2 × 4
##   customer_id item_1 item_2       item_3
##         <dbl> <chr>  <chr>        <chr> 
## 1           1 bread  milk         banana
## 2           2 milk   toilet paper <NA>







Longer = more rows

## # A tibble: 6 × 3
##   customer_id item_no item        
##         <dbl> <chr>   <chr>       
## 1           1 item_1  bread       
## 2           1 item_2  milk        
## 3           1 item_3  banana      
## 4           2 item_1  milk        
## 5           2 item_2  toilet paper
## 6           2 item_3  <NA>

pivot_longer()

  • data (as usual)
  • cols: columns to pivot into longer format
  • names_to: name of the column where column names of pivoted variables go (character string)
  • values_to: name of the column where data in pivoted variables go (character string)
pivot_longer(
  data,
  cols, 
  names_to = "name", 
  values_to = "value"
  )

Customers \(\rightarrow\) purchases

purchases <- customers %>%
  pivot_longer(
    cols = item_1:item_3,  # variables item_1 to item_3
    names_to = "item_no",  # column names -> new column called item_no
    values_to = "item"     # values in columns -> new column called item
    )

purchases
## # A tibble: 6 × 3
##   customer_id item_no item        
##         <dbl> <chr>   <chr>       
## 1           1 item_1  bread       
## 2           1 item_2  milk        
## 3           1 item_3  banana      
## 4           2 item_1  milk        
## 5           2 item_2  toilet paper
## 6           2 item_3  <NA>

Why pivot?

Most likely, because the next step of your analysis needs it

prices
## # A tibble: 5 × 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3




purchases %>%
  left_join(prices)
## # A tibble: 6 × 4
##   customer_id item_no item         price
##         <dbl> <chr>   <chr>        <dbl>
## 1           1 item_1  bread         1   
## 2           1 item_2  milk          0.8 
## 3           1 item_3  banana        0.15
## 4           2 item_1  milk          0.8 
## 5           2 item_2  toilet paper  3   
## 6           2 item_3  <NA>         NA

Purchases \(\rightarrow\) customers

  • data (as usual)
  • names_from: which column in the long format contains the what should be column names in the wide format
  • values_from: which column in the long format contains the what should be values in the new columns in the wide format
purchases %>%
  pivot_wider( 
    names_from = item_no, 
    values_from = item 
  )
## # A tibble: 2 × 4
##   customer_id item_1 item_2       item_3
##         <dbl> <chr>  <chr>        <chr> 
## 1           1 bread  milk         banana
## 2           2 milk   toilet paper <NA>

Example: Billboard

Example: Billboard

Consider the billboard dataset (contained in the tidyr package) which contains the rank of the song (in 2000) for each week after it first entered the list.

billboard
## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # … with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## #   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## #   wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## #   wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## #   wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## #   wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## #   wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, …

billboard data

If we want to identify songs that reach number 1 quickly, the data need to wrangled.

billboard %>% 
  select(artist, track, date.entered, wk1, wk2) %>%
  pivot_longer(
    cols = c('wk1', 'wk2'),
    names_to = 'week',
    values_to = 'rank',
    values_drop_na = TRUE
    )
## # A tibble: 629 × 5
##    artist       track                   date.entered week   rank
##    <chr>        <chr>                   <date>       <chr> <dbl>
##  1 2 Pac        Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac        Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2Ge+her      The Hardest Part Of ... 2000-09-02   wk1      91
##  4 2Ge+her      The Hardest Part Of ... 2000-09-02   wk2      87
##  5 3 Doors Down Kryptonite              2000-04-08   wk1      81
##  6 3 Doors Down Kryptonite              2000-04-08   wk2      70
##  7 3 Doors Down Loser                   2000-10-21   wk1      76
##  8 3 Doors Down Loser                   2000-10-21   wk2      76
##  9 504 Boyz     Wobble Wobble           2000-04-15   wk1      57
## 10 504 Boyz     Wobble Wobble           2000-04-15   wk2      34
## # … with 619 more rows

Billboard Data Analysis

billboard %>% 
  pivot_longer(
    cols= starts_with('wk'),
    names_to = 'week',
    values_to = 'rank',
    values_drop_na = TRUE) %>%
  mutate(
    week_numb = as.numeric(str_replace(week, 'wk',''))
    ) %>%
  filter(rank == 1) %>%
  arrange(week_numb) %>% 
  slice(1) %>%
  kable()
artist track date.entered week rank week_numb
Madonna Music 2000-08-12 wk6 1 6

Exercise

Determine which song in this dataset spent the most time at #1.

Solution: Code

billboard_long <- billboard %>% 
  pivot_longer(
    cols= starts_with('wk'),
    names_to = 'week',
    values_to = 'rank',
    values_drop_na = TRUE) %>%
  mutate(
    week_numb = 
    as.numeric(str_replace(week, 'wk',''))) %>%
  filter(rank == 1) %>%
  group_by(track) %>% 
  tally() %>%
  arrange(desc(n))

Solution: Result

track n
Independent Women Pa… 11
Maria, Maria 10
Come On Over Baby (A… 4
I Knew I Loved You 4
Music 4
Be With You 3
Doesn’t Really Matte… 3
Say My Name 3
Amazed 2
Incomplete 2
It’s Gonna Be Me 2
What A Girl Wants 2
Bent 1