Design Improvements of Web Crawler

Understand how to identify the design shortcomings and challenges, and improve accordingly.

Introduction

This lesson gives details about the design improvements needed to enhance the functionality, performance, and security of our web crawler design. We have divided the lesson into two sections:

  1. Functionality and performance enhancement design improvements – extensibility and multi-worker architecture.
  2. Security-enhancement design improvements – crawler traps.

Let’s dive deep into each one of them.

Design improvements

Our current design is simplistic and has some inherent shortcomings and challenges. Let us highlight them one by one and make some adjustments to our design along the way.

  • Shortcoming: Currently, our design supports HTTP protocol and only extracts textual content; how exactly can we extend our crawler to facilitate multiple communication protocols as well as extract various file types.

    Adjustment: Since we have two separate components for serving communication handling and extracting – HTML Fetcher and Extractor, respectively – let us discuss their modifications one by one.

    1. HTML Fetcher – We have only discussed the HTTP module in this component so far because of the widely used HTTP URLs scheme. We can easily extend our design to incorporate other communication protocols like FTP (File Transfer Protocol). The workflow will then have an intermediary step where the crawler will invoke the concerned communication module based on the URL’s scheme; the subsequent steps will remain the same.
    2. Extractor – Currently, we only extract the textual content from the downloaded document placed in DIS (Document Input Stream). The document has other file types in it as well, e.g., images and videos. If we wish to extract other content from the stored document, we need to add new modules with functionalities to process those media types. Since we are using a blob store for the content storage, storing the newly extracted content comprising of text, images, and videos will not be a problem.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy