3/21/2022

Data Scraping

Web Scraping

  • Increasing amount of data is available on the web

  • These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors

  • Web scraping is the process of extracting this information automatically and transform it into a structured dataset

Web Scraping

We could spend multiple weeks on this, so this will be a basic introduction that will allow you to:

  • extract text and numbers from webpages and
  • extract tables from webpages.

Hypertext Markup Language

  • Most of the data on the web is still largely available as HTML
  • It is structured (hierarchical / tree based), but it’s often not available in a form useful for analysis (flat / tidy).
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>

A bit about HTML

HTML elements are written with a start tag, an end tag, and with the content in between: content. The the textual content we wish to scrape typically lie between these tags. Some tags include:

  • \(<h1>\), \(<h2>\),…,: for headings
  • \(<p>\): Paragraph elements
  • \(<ul>\): Unordered bulleted list
  • \(<ol>\): Ordered list
  • \(<li>\): Individual List item
  • \(<div>\): Division or section
  • \(<table>\): Table

HTML Example

Scraping with rvest

Scraping with rvest

  • The rvest package makes basic processing and manipulation of HTML data straight forward
  • It’s designed to work with pipelines built with %>%

Core rvest functions

  • read_html - Read HTML data from a url or character string
  • html_element - Select a specified element (node) from HTML document
  • html_elements - Select specified elements (nodes) from HTML document
  • html_table - Parse an HTML table into a data frame
  • html_text - Extract tag pairs’ content
  • html_name - Extract tags’ names
  • html_attrs - Extract all of each tag’s attributes
  • html_attr - Extract tags’ attribute value by name

Scraping with rvest

library(rvest)
library(stringr)
msu.math <- read_html("http://math.montana.edu/")
msu.math
## {html_document}
## <html lang="en-US">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="responsive">\n      <div class="mus-topalert" id="mus-topale ...

Scraping with rvest

 msu.math %>% html_elements('h1') 
## {xml_nodeset (1)}
## [1] <h1>Department of Mathematical Sciences</h1>
 msu.math %>% html_elements('h1')  %>% html_text()
## [1] "Department of Mathematical Sciences"

Scraping h3

msu.math %>% html_elements('h3') %>% html_text()
##  [1] "\n                                             About the Department\n                                          "
##  [2] "\n                                             Events\n                                          "              
##  [3] "\n                                             Students\n                                          "            
##  [4] "\n                                             Research\n                                          "            
##  [5] "\n                                             Department Resources\n                                          "
##  [6] "Announcements:"                                                                                                 
##  [7] "Math and Stat Center - Hours and Information"                                                                   
##  [8] "Prospective Graduate Students:  Explore Here"                                                                   
##  [9] "More Information"                                                                                               
## [10] "Resources"                                                                                                      
## [11] "Follow Us"

Tidying Up

msu.math %>% html_elements('h3') %>% html_text() %>% 
  str_replace_all("\\s+", " ") %>%
  str_replace_all(pattern = "\n", replacement = "") %>% 
  str_replace_all(pattern = "\t", replacement = "")
##  [1] " About the Department "                      
##  [2] " Events "                                    
##  [3] " Students "                                  
##  [4] " Research "                                  
##  [5] " Department Resources "                      
##  [6] "Announcements:"                              
##  [7] "Math and Stat Center - Hours and Information"
##  [8] "Prospective Graduate Students: Explore Here" 
##  [9] "More Information"                            
## [10] "Resources"                                   
## [11] "Follow Us"

SelectorGadget

Scraping Demo

Scraping Demo: IMDB Top 120 Movies

Take a look at the source code, look for the table tag:
http://www.imdb.com/chart/top

Steps

  1. Read the whole page

  2. Scrape movie titles and save as titles

  3. Scrape years movies were made in and save as years

  4. Scrape IMDB ratings and save as ratings

  5. Create a data frame called imdb_top_250 with variables title, year, and rating

Selecting Tables

Selecting Tables: baseball data

Scraping Tables

batting <- read_html("https://www.baseball-reference.com/leagues/MLB/2017-standard-batting.shtml")
batting.list <- batting %>% html_elements('table') %>% html_table()
batting.df <- as_tibble(batting.list[[1]])
kable(batting.df)
Tm #Bat BatAge R/G G PA AB R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB LOB
Arizona Diamondbacks 45 28.3 5.01 162 6224 5525 812 1405 314 39 220 776 103 30 578 1456 .254 .329 .445 .774 94 2457 106 54 39 27 44 1118
Atlanta Braves 49 28.7 4.52 162 6216 5584 732 1467 289 26 165 706 77 31 474 1184 .263 .326 .412 .738 92 2303 137 66 59 32 57 1127
Baltimore Orioles 50 28.6 4.59 162 6140 5650 743 1469 269 12 232 713 32 13 392 1412 .260 .312 .435 .747 100 2458 138 50 10 37 12 1041
Boston Red Sox 49 27.3 4.85 162 6338 5669 785 1461 302 19 168 735 106 31 571 1224 .258 .329 .407 .736 92 2305 141 53 9 36 48 1134
Chicago Cubs 47 27.1 5.07 162 6283 5496 822 1402 274 29 223 785 62 31 622 1401 .255 .338 .437 .775 99 2403 134 82 48 32 54 1147
Chicago White Sox 51 26.7 4.36 162 6059 5513 706 1412 256 37 186 670 71 31 401 1397 .256 .314 .417 .731 96 2300 124 76 35 33 17 1055
Cincinnati Reds 47 27.1 4.65 162 6213 5484 753 1390 249 38 219 715 120 39 565 1329 .253 .329 .433 .761 97 2372 116 72 50 42 41 1135
Cleveland Indians 41 28.0 5.05 162 6234 5511 818 1449 333 29 212 780 88 23 604 1153 .263 .339 .449 .788 104 2476 125 50 23 45 30 1158
Colorado Rockies 41 28.3 5.09 162 6201 5534 824 1510 293 38 192 793 59 34 519 1408 .273 .338 .444 .781 90 2455 143 44 62 41 46 1088
Detroit Tigers 49 29.6 4.54 162 6150 5556 735 1435 289 35 187 699 65 34 503 1313 .258 .324 .424 .748 98 2355 128 52 11 27 21 1104
Houston Astros 46 28.8 5.53 162 6271 5611 896 1581 346 20 238 854 98 42 509 1087 .282 .346 .478 .823 123 2681 139 70 11 61 27 1094
Kansas City Royals 49 28.9 4.33 162 6027 5536 702 1436 260 24 193 660 91 31 390 1166 .259 .311 .420 .731 93 2323 160 45 17 37 19 1005
Los Angeles Angels 55 30.0 4.38 162 6073 5415 710 1314 251 14 186 678 136 44 523 1198 .243 .315 .397 .712 92 2151 141 70 17 46 30 1033
Los Angeles Dodgers 52 27.9 4.75 162 6191 5408 770 1347 312 20 221 730 77 28 649 1380 .249 .334 .437 .771 104 2362 119 64 31 38 41 1146
Miami Marlins 43 28.4 4.80 162 6248 5602 778 1497 271 31 194 743 91 30 486 1282 .267 .331 .431 .761 107 2412 119 67 50 41 48 1130
Milwaukee Brewers 50 27.4 4.52 162 6135 5467 732 1363 267 22 224 695 128 41 547 1571 .249 .322 .429 .751 94 2346 116 53 42 26 34 1088
Minnesota Twins 52 27.0 5.03 162 6261 5557 815 1444 286 31 206 781 95 28 593 1342 .260 .334 .434 .768 104 2410 105 46 26 39 26 1147
New York Mets 52 28.8 4.54 162 6169 5510 735 1379 286 28 224 713 58 23 529 1291 .250 .320 .434 .755 101 2393 118 57 36 37 31 1099
New York Yankees 51 28.6 5.30 162 6354 5594 858 1463 266 23 241 821 90 22 616 1386 .262 .339 .447 .785 105 2498 119 64 18 56 22 1184
Oakland Athletics 54 28.7 4.56 162 6126 5464 739 1344 305 15 234 708 57 22 565 1491 .246 .319 .436 .755 104 2381 129 43 13 40 15 1075
Philadelphia Phillies 51 26.6 4.26 162 6133 5535 690 1382 287 36 174 654 59 25 494 1417 .250 .315 .409 .723 89 2263 128 47 21 36 25 1079
Pittsburgh Pirates 47 28.2 4.12 162 6136 5458 668 1331 249 36 151 635 67 36 519 1213 .244 .318 .386 .704 86 2105 120 88 42 28 39 1129
San Diego Padres 52 26.2 3.73 162 5954 5356 604 1251 227 31 189 576 89 33 460 1499 .234 .299 .393 .692 83 2107 99 53 52 33 20 1037
Seattle Mariners 61 29.5 4.63 162 6166 5551 750 1436 281 17 200 714 89 35 487 1267 .259 .325 .424 .749 103 2351 131 78 14 35 31 1084
San Francisco Giants 49 29.5 3.94 162 6137 5551 639 1382 290 28 128 612 76 34 467 1204 .249 .309 .380 .689 81 2112 136 36 31 52 37 1093
St. Louis Cardinals 48 28.0 4.70 162 6219 5470 761 1402 284 28 196 728 81 31 593 1348 .256 .334 .426 .760 99 2330 139 65 47 44 36 1118
Tampa Bay Rays 53 28.3 4.28 162 6147 5478 694 1340 226 32 228 671 88 34 545 1538 .245 .317 .422 .739 99 2314 115 55 16 48 33 1114
Texas Rangers 51 28.3 4.93 162 6122 5430 799 1326 255 21 237 756 113 44 544 1493 .244 .320 .430 .750 91 2334 110 81 27 39 18 1015
Toronto Blue Jays 60 30.9 4.28 162 6154 5499 693 1320 269 5 222 661 53 24 542 1327 .240 .312 .412 .724 91 2265 153 51 25 35 12 1064
Washington Nationals 49 29.2 5.06 162 6214 5553 819 1477 311 31 215 796 108 30 542 1327 .266 .332 .449 .782 99 2495 116 31 43 45 56 1101
League Average 45 28.3 4.65 162 6176 5519 753 1407 280 26 204 719 84 31 528 1337 .255 .324 .426 .750 97 2351 127 59 31 39 32 1098
1358 28.3 4.65 4860 185295 165567 22582 42215 8397 795 6105 21558 2527 934 15829 40104 .255 .324 .426 .750 97 70517 3804 1763 925 1168 970 32942
Tm #Bat BatAge R/G G PA AB R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB LOB

Scraping Exercise: Get Team Info

Scraping Solution: Get Team Info

batting.CO <- read_html("https://www.baseball-reference.com/teams/COL/2017.shtml")
tables.CO <- batting.CO %>% html_elements('table') %>% html_table()
as_tibble(tables.CO[[1]])
## # A tibble: 48 × 28
##    Rk    Pos   Name  Age   G     PA    AB    R     H     `2B`  `3B`  HR    RBI  
##    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 1     C     Tony… 25    83    266   229   30    55    8     1     0     16   
##  2 2     1B    Mark… 33    148   593   520   82    139   22    1     30    97   
##  3 3     2B    DJ L… 28    155   682   609   95    189   28    4     8     64   
##  4 4     SS    Trev… 24    145   555   503   68    120   32    3     24    82   
##  5 5     3B    Nola… 26    159   680   606   100   187   43    7     37    130  
##  6 6     LF    Gera… 30    115   425   392   56    121   24    1     10    71   
##  7 7     CF    Char… 30    159   725   644   137   213   35    14    37    104  
##  8 8     RF    Carl… 31    136   534   470   72    123   34    0     14    57   
##  9 Rk    Pos   Name  Age   G     PA    AB    R     H     2B    3B    HR    RBI  
## 10 9     UT    Ian … 31    95    373   339   47    93    11    1     7     40   
## # … with 38 more rows, and 15 more variables: SB <chr>, CS <chr>, BB <chr>,
## #   SO <chr>, BA <chr>, OBP <chr>, SLG <chr>, OPS <chr>, `OPS+` <chr>,
## #   TB <chr>, GDP <chr>, HBP <chr>, SH <chr>, SF <chr>, IBB <chr>

Additional Slides

Web scraping considerations

Using functions with web scraping

Using iteration with web scraping