Create Multi-stage Data Pipelines

Learn when multistage data pipelines are needed and then implement them by adding a producer-consumer to our scraper project.

We’ve already demonstrated how :producer and :consumer stages work in practice. The only type of stage we haven’t seen in action yet is the :producer_consumer stage. Producer-consumer stages are the key to building infinitely complex data processing pipelines. The good news is that if we understand how producers and consumers work, we already know producer-consumers.

A word of caution

When we learn to add stages and extend our data pipelines, we may be tempted to organize our business logic using stages rather than plain Elixir modules and functions. As the GenStage documentation warns us, this is an anti-pattern:

“If our domain has to process the data in multiple steps, we should write that logic in separate modules and not directly in a GenStage. We only add stages according to the runtime needs, typically when we need to provide back-pressure or leverage concurrency.”

Use plain functions

A good rule of thumb is to always start with plain functions. When we recognize the need to use back-pressure, we create a two-stage data pipeline first. As we will see in a moment, adding more stages is easy, so we can extend it gradually when we spot an opportunity to improve.

Business logic

First, we need to add some business logic that justifies adding another stage. We open scraper.ex and add the following function:

Get hands-on with 1200+ tech skills courses.