The scraper project

The scraper project has the following components in the data-processing pipeline:

  • It has one PageProducer process of the :producer type.

  • It has two OnlinePageConsumerProducer processes of the :producer_consumer type.

  • It has one PageConsumerSupervisor process of the :consumer type.

  • It has up to two PageConsumer processes started on demand by PageConsumerSupervisor.

Flow with the GenStage

To demonstrate how Flow works with GenStage, we’ll rewrite the original OnlinePageConsumerProducer implementation using Flow. When working with GenStage, two groups of functions are available to use. The first group is made to work with already running stages:

  • It works with the from_stages/2 stage to receive events from :producer stages.

  • It works with the through_stages/3 stage to send events to :producer_consumer stages and receive what they send in turn.

  • It works with the into_stages/3 stage to send events to :consumer or :producer_consumer stages.

All functions in this group require a list of the process ids (PIDs) to connect to the already running processes.

The second group of functions is useful when we want Flow to start the GenStage processes for us. These are the functions:

  • from_specs/2

  • through_specs/3

  • into_specs/3

They work the same way as those in the previous group, except they require a list of tuples instead of a list of PIDs. Each tuple represents a child specification, similar to the child specifications we use in supervisors. An example of this could be {Module, args}. The Flow module uses the child specification to start the processes for us.

We plan to use from_stages/2 and into_stages/3 to connect to the already running PageProducer. We receive new page events, filter offline websites, and send them to PageConsumerSupervisor, just like before. Let’s get started.

Add flow to the scraper project

First, let’s add flow to the scraper project:

Get hands-on with 1200+ tech skills courses.