Configure the Scraping Pipeline
Explore how to configure a web scraping pipeline using Broadway in Elixir. Learn to set up processors for concurrent message handling, filter offline websites, batch scrape requests, and maintain performance with dynamic batching. This lesson helps you understand integrating custom GenStage producers and managing batch processors for efficient data ingestion pipelines.
Pipeline configuration
We’ll use the processors of Broadway to refactor the logic that checks each website. For this, we have to define :processors in start_link/1, and use handle_message/3:
def start_link(_args) do
options = [
name: ScrapingPipeline,
producer: [
module: {PageProducer, []},
transformer: {ScrapingPipeline, :transform, []}
],
processors: [
default: [max_demand: 1, concurrency: 2]
]
]
Broadway.start_link(__MODULE__, options)
end
def handle_message(_processor, message, _context) do
if Scraper.online?(message.data) do
# To do...
else
...