System Design: Designing a Streaming Data Processing Pipeline

Question

Design a streaming data processing pipeline that can reliably process data in near real-time.

Background

Streaming systems have applications across multiple domains such as ads, abuse and fraud detection, and real-time analytics. While this is a slightly more domain-specific and knowledge-based question, it’s still useful to go through this exercise given its broad application and interesting edge/error cases that require you to reason through.

It is highly recommended you familiarize yourself with data processing systems since they are so relevant, and the design decisions and trade-offs can be generalized to many different problems.

Solution approach

We will follow a similar approach to previous problems with one section addition: Edge case discussion. There are a lot of potential issues when dealing with real-time data and this section discusses some of these and potential mitigation options.

  • Define system requirements:

    We’ll start by clarifying the requirements and nature of the streaming data.

  • System breakdown:

    We’ll then diagram the key individual components of the system needed to address the key requirements. We’ll also discuss the role of each component in the system.

  • Dataflow discussion:

    We’ll then discuss how the data flows through each system component.

  • Edge case discussion:

    We’ll then talk about how to handle edge cases and potential issues when it comes to handling streaming data.

  • Scaling the design:

Then, we’ll talk about design choices and trade-offs to ensure the system scales.

  • Capacity modeling:

    We’ll finally estimate how many storage and processing machines we’ll need, assuming a high input data volume.

Sample answer

System requirements

We’ll start by clarifying the requirements of the system. This is very important since we need to understand both what the system needs to do and the nature of the streaming data.

We will assume the following requirements:

  1. Data volume:

    At peak loads, we need to support order of TBs/second of streaming data.

  2. Multiple data producers and consumers:

    We will need to support multiple Data Producers and Data Consumers.

    • Data Producers are components that produce the input data for our system.
    • Data Consumers are the components that consume the output of our processing system.

    In addition, we also need a way to allow data consumers to specify the data processing transforms on specific streams of data from data producers.

  3. Processors:

    We will need to handle basic data transforms, such as pattern matches and filtering. We will also need to handle more complex and stateful processing, such as aggregations and joins.

  4. Data issues:

    We will need to handle the following potential issues when it comes to data:

    • Data loss: Data may be lost in transit.
    • Data corruption: Data may not arrive in the correct state.
    • Out-of-order data: Data may not arrive in order, and we may need to update previously processed data.
    • Schema changes: The schema of the streaming data may change.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.