The Basics of Web Scraping

Learn how to get data from HTML pages.

The basics of web scraping

Web scraping is the art of getting data from web pages. The difference between scraping and polling a web service is that web pages are meant to be seen by humans, while web services are for machines.

How, then, can we teach machines to read data meant for humans?

The need for hooks

Apart from some applications of artificial intelligence, machines have to be guided to retrieve data meant to be visualized on a page. We need to use some tricks and hooks to allow the program to navigate a page and recognize data.

When we create web pages, we typically define them with parts that are all formatted the same way. For this reason, we usually assign CSS classes for consistency across the whole website.

Tip: Our first hook is to look for CSS classes.

Sometimes web developers assign an element ID to some elements on the page. This is a unique identifier that, if present, is a powerful hook.

Tip: The second hook to consider is the element id.

With these two hooks, we can already scrape a lot of different pages.

Get hands-on with 1200+ tech skills courses.