How to parse a website with R
Parsing or web scraping refers to extracting the required data from the websites. The rvest library in R provides parsing functionality.
Steps to parse a webpage
We can parse a webpage with R in the following three steps:
Import the
rvestlibrary.Read the HTML code.
Scrap the required data from the HTML code.
Example
Here is an R code that scraps data from a Wiki page.
library (rvest)# Read the HTMLwebpage = read_html("https://en.wikipedia.org/wiki/Web_scraping")# Scrape data with CSS selectordata = html_node(webpage, '.mw-page-title-main')# Convert the data to texttext = html_text(data)print(text)
Explanation
Line 1: We import the
rvestlibrary.Line 4: We use the
read_html()function to fetch the downloaded HTML from the Wiki URL provided as a parameter.Line 7: We scrape the page's title from the HTML code stored in the
webpage. In this case, the CSS selector for the title ismv-page-title-main.Line 10: We convert the value stored in
datato readable form, i.e., text.
Try changing the CSS selector at line 7 to 'p'. This will scrape all the paragraph sections.
Note: In case, a pre-added CSS selector doesn't work, try inspecting the element and verify the CSS code.
Free Resources