--- title: 'STAT 408: Week 9' subtitle: Web Scraping date: "3/21/2022" output: ioslides_presentation: css: https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css widescreen: yes beamer_presentation: theme: Berkeley colortheme: seahorse slide_level: 2 revealjs::revealjs_presentation: transition: none incremental: no slidy_presentation: incremental: no urlcolor: blue --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(rvest) library(robotstxt) library(DBI) library(RSQLite) knitr::opts_chunk$set(echo = TRUE, warning=FALSE, error=FALSE, message=FALSE) ``` # Data Scraping ## Web Scraping - Increasing amount of data is available on the web - These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors - Web scraping is the process of extracting this information automatically and transform it into a structured dataset ## Web Scraping We could spend multiple weeks on this, so this will be a basic introduction that will allow you to: - extract text and numbers from webpages and - extract tables from webpages. ## Hypertext Markup Language - Most of the data on the web is still largely available as HTML - It is structured (hierarchical / tree based), but it's often not available in a form useful for analysis (flat / tidy). ```html This is a title

Hello world!

``` ## A bit about HTML HTML elements are written with a start tag, an end tag, and with the content in between: content. The the textual content we wish to scrape typically lie between these tags. Some tags include: - $

$, $

$,…,: for headings - $

$: Paragraph elements - $