3/21/2022
Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset
We could spend multiple weeks on this, so this will be a basic introduction that will allow you to:
<html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html>
HTML elements are written with a start tag, an end tag, and with the content in between:
rvest
rvest
%>%
read_html
- Read HTML data from a url or character stringhtml_element
- Select a specified element (node) from HTML documenthtml_elements
- Select specified elements (nodes) from HTML documenthtml_table
- Parse an HTML table into a data framehtml_text
- Extract tag pairs’ contenthtml_name
- Extract tags’ nameshtml_attrs
- Extract all of each tag’s attributeshtml_attr
- Extract tags’ attribute value by namervest
library(rvest) library(stringr) msu.math <- read_html("http://math.montana.edu/") msu.math
## {html_document} ## <html lang="en-US"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body class="responsive">\n <div class="mus-topalert" id="mus-topale ...
rvest
msu.math %>% html_elements('h1')
## {xml_nodeset (1)} ## [1] <h1>Department of Mathematical Sciences</h1>
msu.math %>% html_elements('h1') %>% html_text()
## [1] "Department of Mathematical Sciences"
msu.math %>% html_elements('h3') %>% html_text()
## [1] "\n About the Department\n " ## [2] "\n Events\n " ## [3] "\n Students\n " ## [4] "\n Research\n " ## [5] "\n Department Resources\n " ## [6] "Announcements:" ## [7] "Math and Stat Center - Hours and Information" ## [8] "Prospective Graduate Students: Explore Here" ## [9] "More Information" ## [10] "Resources" ## [11] "Follow Us"
msu.math %>% html_elements('h3') %>% html_text() %>% str_replace_all("\\s+", " ") %>% str_replace_all(pattern = "\n", replacement = "") %>% str_replace_all(pattern = "\t", replacement = "")
## [1] " About the Department " ## [2] " Events " ## [3] " Students " ## [4] " Research " ## [5] " Department Resources " ## [6] "Announcements:" ## [7] "Math and Stat Center - Hours and Information" ## [8] "Prospective Graduate Students: Explore Here" ## [9] "More Information" ## [10] "Resources" ## [11] "Follow Us"
Take a look at the source code, look for the table
tag:
http://www.imdb.com/chart/top
Read the whole page
Scrape movie titles and save as titles
Scrape years movies were made in and save as years
Scrape IMDB ratings and save as ratings
Create a data frame called imdb_top_250
with variables title
, year
, and rating
https://www.baseball-reference.com/leagues/MLB/2017-standard-batting.shtml
batting <- read_html("https://www.baseball-reference.com/leagues/MLB/2017-standard-batting.shtml") batting.list <- batting %>% html_elements('table') %>% html_table() batting.df <- as_tibble(batting.list[[1]]) kable(batting.df)
Tm | #Bat | BatAge | R/G | G | PA | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | BA | OBP | SLG | OPS | OPS+ | TB | GDP | HBP | SH | SF | IBB | LOB |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Arizona Diamondbacks | 45 | 28.3 | 5.01 | 162 | 6224 | 5525 | 812 | 1405 | 314 | 39 | 220 | 776 | 103 | 30 | 578 | 1456 | .254 | .329 | .445 | .774 | 94 | 2457 | 106 | 54 | 39 | 27 | 44 | 1118 |
Atlanta Braves | 49 | 28.7 | 4.52 | 162 | 6216 | 5584 | 732 | 1467 | 289 | 26 | 165 | 706 | 77 | 31 | 474 | 1184 | .263 | .326 | .412 | .738 | 92 | 2303 | 137 | 66 | 59 | 32 | 57 | 1127 |
Baltimore Orioles | 50 | 28.6 | 4.59 | 162 | 6140 | 5650 | 743 | 1469 | 269 | 12 | 232 | 713 | 32 | 13 | 392 | 1412 | .260 | .312 | .435 | .747 | 100 | 2458 | 138 | 50 | 10 | 37 | 12 | 1041 |
Boston Red Sox | 49 | 27.3 | 4.85 | 162 | 6338 | 5669 | 785 | 1461 | 302 | 19 | 168 | 735 | 106 | 31 | 571 | 1224 | .258 | .329 | .407 | .736 | 92 | 2305 | 141 | 53 | 9 | 36 | 48 | 1134 |
Chicago Cubs | 47 | 27.1 | 5.07 | 162 | 6283 | 5496 | 822 | 1402 | 274 | 29 | 223 | 785 | 62 | 31 | 622 | 1401 | .255 | .338 | .437 | .775 | 99 | 2403 | 134 | 82 | 48 | 32 | 54 | 1147 |
Chicago White Sox | 51 | 26.7 | 4.36 | 162 | 6059 | 5513 | 706 | 1412 | 256 | 37 | 186 | 670 | 71 | 31 | 401 | 1397 | .256 | .314 | .417 | .731 | 96 | 2300 | 124 | 76 | 35 | 33 | 17 | 1055 |
Cincinnati Reds | 47 | 27.1 | 4.65 | 162 | 6213 | 5484 | 753 | 1390 | 249 | 38 | 219 | 715 | 120 | 39 | 565 | 1329 | .253 | .329 | .433 | .761 | 97 | 2372 | 116 | 72 | 50 | 42 | 41 | 1135 |
Cleveland Indians | 41 | 28.0 | 5.05 | 162 | 6234 | 5511 | 818 | 1449 | 333 | 29 | 212 | 780 | 88 | 23 | 604 | 1153 | .263 | .339 | .449 | .788 | 104 | 2476 | 125 | 50 | 23 | 45 | 30 | 1158 |
Colorado Rockies | 41 | 28.3 | 5.09 | 162 | 6201 | 5534 | 824 | 1510 | 293 | 38 | 192 | 793 | 59 | 34 | 519 | 1408 | .273 | .338 | .444 | .781 | 90 | 2455 | 143 | 44 | 62 | 41 | 46 | 1088 |
Detroit Tigers | 49 | 29.6 | 4.54 | 162 | 6150 | 5556 | 735 | 1435 | 289 | 35 | 187 | 699 | 65 | 34 | 503 | 1313 | .258 | .324 | .424 | .748 | 98 | 2355 | 128 | 52 | 11 | 27 | 21 | 1104 |
Houston Astros | 46 | 28.8 | 5.53 | 162 | 6271 | 5611 | 896 | 1581 | 346 | 20 | 238 | 854 | 98 | 42 | 509 | 1087 | .282 | .346 | .478 | .823 | 123 | 2681 | 139 | 70 | 11 | 61 | 27 | 1094 |
Kansas City Royals | 49 | 28.9 | 4.33 | 162 | 6027 | 5536 | 702 | 1436 | 260 | 24 | 193 | 660 | 91 | 31 | 390 | 1166 | .259 | .311 | .420 | .731 | 93 | 2323 | 160 | 45 | 17 | 37 | 19 | 1005 |
Los Angeles Angels | 55 | 30.0 | 4.38 | 162 | 6073 | 5415 | 710 | 1314 | 251 | 14 | 186 | 678 | 136 | 44 | 523 | 1198 | .243 | .315 | .397 | .712 | 92 | 2151 | 141 | 70 | 17 | 46 | 30 | 1033 |
Los Angeles Dodgers | 52 | 27.9 | 4.75 | 162 | 6191 | 5408 | 770 | 1347 | 312 | 20 | 221 | 730 | 77 | 28 | 649 | 1380 | .249 | .334 | .437 | .771 | 104 | 2362 | 119 | 64 | 31 | 38 | 41 | 1146 |
Miami Marlins | 43 | 28.4 | 4.80 | 162 | 6248 | 5602 | 778 | 1497 | 271 | 31 | 194 | 743 | 91 | 30 | 486 | 1282 | .267 | .331 | .431 | .761 | 107 | 2412 | 119 | 67 | 50 | 41 | 48 | 1130 |
Milwaukee Brewers | 50 | 27.4 | 4.52 | 162 | 6135 | 5467 | 732 | 1363 | 267 | 22 | 224 | 695 | 128 | 41 | 547 | 1571 | .249 | .322 | .429 | .751 | 94 | 2346 | 116 | 53 | 42 | 26 | 34 | 1088 |
Minnesota Twins | 52 | 27.0 | 5.03 | 162 | 6261 | 5557 | 815 | 1444 | 286 | 31 | 206 | 781 | 95 | 28 | 593 | 1342 | .260 | .334 | .434 | .768 | 104 | 2410 | 105 | 46 | 26 | 39 | 26 | 1147 |
New York Mets | 52 | 28.8 | 4.54 | 162 | 6169 | 5510 | 735 | 1379 | 286 | 28 | 224 | 713 | 58 | 23 | 529 | 1291 | .250 | .320 | .434 | .755 | 101 | 2393 | 118 | 57 | 36 | 37 | 31 | 1099 |
New York Yankees | 51 | 28.6 | 5.30 | 162 | 6354 | 5594 | 858 | 1463 | 266 | 23 | 241 | 821 | 90 | 22 | 616 | 1386 | .262 | .339 | .447 | .785 | 105 | 2498 | 119 | 64 | 18 | 56 | 22 | 1184 |
Oakland Athletics | 54 | 28.7 | 4.56 | 162 | 6126 | 5464 | 739 | 1344 | 305 | 15 | 234 | 708 | 57 | 22 | 565 | 1491 | .246 | .319 | .436 | .755 | 104 | 2381 | 129 | 43 | 13 | 40 | 15 | 1075 |
Philadelphia Phillies | 51 | 26.6 | 4.26 | 162 | 6133 | 5535 | 690 | 1382 | 287 | 36 | 174 | 654 | 59 | 25 | 494 | 1417 | .250 | .315 | .409 | .723 | 89 | 2263 | 128 | 47 | 21 | 36 | 25 | 1079 |
Pittsburgh Pirates | 47 | 28.2 | 4.12 | 162 | 6136 | 5458 | 668 | 1331 | 249 | 36 | 151 | 635 | 67 | 36 | 519 | 1213 | .244 | .318 | .386 | .704 | 86 | 2105 | 120 | 88 | 42 | 28 | 39 | 1129 |
San Diego Padres | 52 | 26.2 | 3.73 | 162 | 5954 | 5356 | 604 | 1251 | 227 | 31 | 189 | 576 | 89 | 33 | 460 | 1499 | .234 | .299 | .393 | .692 | 83 | 2107 | 99 | 53 | 52 | 33 | 20 | 1037 |
Seattle Mariners | 61 | 29.5 | 4.63 | 162 | 6166 | 5551 | 750 | 1436 | 281 | 17 | 200 | 714 | 89 | 35 | 487 | 1267 | .259 | .325 | .424 | .749 | 103 | 2351 | 131 | 78 | 14 | 35 | 31 | 1084 |
San Francisco Giants | 49 | 29.5 | 3.94 | 162 | 6137 | 5551 | 639 | 1382 | 290 | 28 | 128 | 612 | 76 | 34 | 467 | 1204 | .249 | .309 | .380 | .689 | 81 | 2112 | 136 | 36 | 31 | 52 | 37 | 1093 |
St. Louis Cardinals | 48 | 28.0 | 4.70 | 162 | 6219 | 5470 | 761 | 1402 | 284 | 28 | 196 | 728 | 81 | 31 | 593 | 1348 | .256 | .334 | .426 | .760 | 99 | 2330 | 139 | 65 | 47 | 44 | 36 | 1118 |
Tampa Bay Rays | 53 | 28.3 | 4.28 | 162 | 6147 | 5478 | 694 | 1340 | 226 | 32 | 228 | 671 | 88 | 34 | 545 | 1538 | .245 | .317 | .422 | .739 | 99 | 2314 | 115 | 55 | 16 | 48 | 33 | 1114 |
Texas Rangers | 51 | 28.3 | 4.93 | 162 | 6122 | 5430 | 799 | 1326 | 255 | 21 | 237 | 756 | 113 | 44 | 544 | 1493 | .244 | .320 | .430 | .750 | 91 | 2334 | 110 | 81 | 27 | 39 | 18 | 1015 |
Toronto Blue Jays | 60 | 30.9 | 4.28 | 162 | 6154 | 5499 | 693 | 1320 | 269 | 5 | 222 | 661 | 53 | 24 | 542 | 1327 | .240 | .312 | .412 | .724 | 91 | 2265 | 153 | 51 | 25 | 35 | 12 | 1064 |
Washington Nationals | 49 | 29.2 | 5.06 | 162 | 6214 | 5553 | 819 | 1477 | 311 | 31 | 215 | 796 | 108 | 30 | 542 | 1327 | .266 | .332 | .449 | .782 | 99 | 2495 | 116 | 31 | 43 | 45 | 56 | 1101 |
League Average | 45 | 28.3 | 4.65 | 162 | 6176 | 5519 | 753 | 1407 | 280 | 26 | 204 | 719 | 84 | 31 | 528 | 1337 | .255 | .324 | .426 | .750 | 97 | 2351 | 127 | 59 | 31 | 39 | 32 | 1098 |
1358 | 28.3 | 4.65 | 4860 | 185295 | 165567 | 22582 | 42215 | 8397 | 795 | 6105 | 21558 | 2527 | 934 | 15829 | 40104 | .255 | .324 | .426 | .750 | 97 | 70517 | 3804 | 1763 | 925 | 1168 | 970 | 32942 | |
Tm | #Bat | BatAge | R/G | G | PA | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | BA | OBP | SLG | OPS | OPS+ | TB | GDP | HBP | SH | SF | IBB | LOB |
Visit the baseball reference website for the Colorado Rockies https://www.baseball-reference.com/teams/COL/2017.shtml and scrape a table or text.
batting.CO <- read_html("https://www.baseball-reference.com/teams/COL/2017.shtml") tables.CO <- batting.CO %>% html_elements('table') %>% html_table() as_tibble(tables.CO[[1]])
## # A tibble: 48 × 28 ## Rk Pos Name Age G PA AB R H `2B` `3B` HR RBI ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1 C Tony… 25 83 266 229 30 55 8 1 0 16 ## 2 2 1B Mark… 33 148 593 520 82 139 22 1 30 97 ## 3 3 2B DJ L… 28 155 682 609 95 189 28 4 8 64 ## 4 4 SS Trev… 24 145 555 503 68 120 32 3 24 82 ## 5 5 3B Nola… 26 159 680 606 100 187 43 7 37 130 ## 6 6 LF Gera… 30 115 425 392 56 121 24 1 10 71 ## 7 7 CF Char… 30 159 725 644 137 213 35 14 37 104 ## 8 8 RF Carl… 31 136 534 470 72 123 34 0 14 57 ## 9 Rk Pos Name Age G PA AB R H 2B 3B HR RBI ## 10 9 UT Ian … 31 95 373 339 47 93 11 1 7 40 ## # … with 38 more rows, and 15 more variables: SB <chr>, CS <chr>, BB <chr>, ## # SO <chr>, BA <chr>, OBP <chr>, SLG <chr>, OPS <chr>, `OPS+` <chr>, ## # TB <chr>, GDP <chr>, HBP <chr>, SH <chr>, SF <chr>, IBB <chr>