Data engineer System Design interview questions

Data engineer System Design interview questions

Data engineer system design interviews assess your ability to design end-to-end data platforms, not just ETL. You’ll be tested on ingestion, large-scale processing, storage, streaming and batch systems, data quality, and governance.

Mar 10, 2026
Share
editor-page-cover

Data engineer system design interview questions test your ability to architect end-to-end data platforms that handle ingestion, transformation, storage, and serving at scale. Unlike backend system design, these interviews prioritize throughput, data correctness, and analytical access patterns over request-response latency and transactional APIs.

Key takeaways

  • Trade-off reasoning matters more than tool names: Interviewers evaluate your decision-making framework across competing priorities like latency vs. correctness and cost vs. performance.
  • End-to-end thinking is non-negotiable: Strong answers cover ingestion through governance, not just the compute layer in isolation.
  • Failure modes separate senior from mid-level candidates: Discussing backpressure, schema drift, data skew, and replay strategies signals production experience.
  • Data quality and lineage are core concerns: Systems that lack trust fail regardless of how well they scale.
  • Hybrid architectures dominate modern pipelines: Knowing when to unify batch and streaming logic and when to keep them separate is a core skill interviewers probe.


Most engineers walk into a data engineering system design interview expecting it to feel like a backend interview with bigger datasets. It does not. The mental model is fundamentally different, the failure modes are subtler, and the systems you design must serve analysts, ML engineers, and compliance teams simultaneously. Getting this wrong means you spend 45 minutes talking about API gateways when the interviewer wanted to hear about partitioning strategies and exactly-once delivery.

This guide breaks down what interviewers actually test, the recurring question categories you should prepare for, and how to structure answers that demonstrate senior-level thinking. Whether you are targeting a role at a FAANG company or a fast-growing startup, the underlying design principles remain the same.

Why data engineering system design feels different from backend design#

Backend system design revolves around request-response cycles, user-facing APIs, and transactional consistency. You optimize for p99 latency, connection pooling, and cache hit rates. Data engineering system design operates on an entirely different axis.

Data engineers design systems optimized for throughput over latency, append-heavy writes over random updates, and analytical access patterns over point lookups. The “user” is often another system, a dashboard, or an ML training job rather than a human clicking buttons. This shift changes nearly every architectural decision.

Real-world context: At companies like Netflix or Uber, data platforms ingest millions of events per second from devices, services, and partner feeds. The interview is testing whether you can reason about systems at that scale, not whether you have memorized Kafka configuration flags.

Interviewers evaluate how data flows through an organization from ingestion to transformation, storage, and consumption. They probe whether you understand backpressureA flow-control mechanism where a downstream system signals an upstream producer to slow down when it cannot keep pace with incoming data. and how it propagates, how late-arriving data corrupts aggregations, and why a poorly chosen partition key can silently cripple a pipeline.

The next sections walk through each stage of the data life cycle, starting with the first and often most underestimated challenge: getting data into the system reliably.

High-throughput, append-heavy ingestion#

Ingestion is the front door of every data platform, and it is where most production incidents begin. Unlike user-facing APIs where a failed request can be retried by the client, ingestion pipelines must absorb sudden traffic spikes, tolerate regional failures, and handle delayed or duplicated records without operator intervention.

Strong interview answers start by clarifying the ingestion sources:

  • Streaming events from applications, IoT devices, or clickstream collectors
  • CDC feeds from transactional databases capturing row-level changes
  • Batch file drops from external partners landing in object storage
  • API-based pulls from third-party SaaS platforms

Each source has different reliability characteristics. Streaming events can arrive out of order. CDC feeds depend on database replication slots that can fall behind. Batch files may arrive late or with corrupt schemas. Designing a single ingestion layer that normalizes these differences is a core interview challenge.

The following diagram illustrates how multiple source types converge into a unified ingestion layer before reaching downstream processing.

Loading D2 diagram...
Multi-source data ingestion architecture with unified buffering

Reliability mechanisms that interviewers expect#

Interviewers want to hear specific reliability strategies, not vague references to “making it reliable.” Three mechanisms consistently appear in strong answers.

First, idempotent writesAn operation that produces the same result regardless of how many times it is executed, preventing duplicate records when producers retry after transient failures. ensure that retries from producers do not create duplicate records in the pipeline. Kafka’s transactional producer API, for example, provides exactly-once semantics within a single Kafka cluster.

Second, dead-letter queues isolate poison messages (records that consistently fail parsing or validation) so they do not block healthy data from flowing through the pipeline. Third, replay strategies enabled by Kafka’s configurable retention or compacted topics allow operators to reprocess historical data when bugs are discovered downstream.

Attention: Many candidates mention Kafka partitions but forget to discuss partition key selection. A poorly chosen key (like a high-cardinality user ID with skewed distribution) creates hot partitions that bottleneck throughput. Always explain how you would detect and mitigate data skewAn uneven distribution of data across partitions or nodes, causing some workers to process disproportionately more data than others. at the ingestion layer.

Autoscaling consumers to match load completes the picture. Whether you use Kafka consumer groups or Kinesis shard splitting, the interviewer wants to see that you understand operational elasticity rather than static provisioning.

Ingestion without structure leads to chaos, which is why the next critical challenge is transforming raw data into something usable at scale.

Transformations at scale#

Once data lands in the pipeline, transformation becomes the dominant cost and complexity driver. At scale, every transformation is a distributed computing problem, and interviewers expect you to reason about how computation distributes across a cluster rather than simply listing transformation steps.

The transformation categories themselves are familiar: cleaning, joining, aggregating, enriching, and deduplicating. What matters in an interview is how you execute them efficiently. A narrow map operation that filters invalid records is cheap. A wide shuffle that redistributes data across every node in the cluster to perform a join is expensive. Understanding this distinction and designing around it is what separates mid-level from senior answers.

Handling skew, shuffles, and state#

Skewed joins are a silent killer in production pipelines. When one join key (say, a popular merchant ID) holds orders of magnitude more records than others, a single executor gets overwhelmed while the rest sit idle. Techniques like saltingA technique where a random suffix is appended to skewed join keys to distribute data more evenly across partitions, followed by a fan-out/fan-in join pattern. or broadcast joins for small dimension tables demonstrate that you have dealt with real workloads.

The following table compares common approaches for handling skewed joins in distributed processing frameworks.

Comparison of Skew-Handling Techniques

Technique

Memory Requirements

Implementation Complexity

Best-Fit Scenarios

Broadcast Join

High — entire smaller dataset must fit in memory on each worker node

Low — built-in support in most frameworks (e.g., Spark)

Small dataset joins; star schema (large fact tables + small dimension tables); minimizing shuffles

Salting

Low — slight increase due to additional salted keys

Medium — requires modifying join logic to add random key components

Heavy data skew causing partition overload; balancing workload across partitions

Adaptive Query Execution (AQE)

Moderate — slight overhead from collecting and analyzing runtime statistics

Medium — requires engine support and parameter tuning

Unpredictable or time-varying data distributions; dynamic skew optimization at runtime

Bucketed Join

Low — pre-partitioned/sorted data reduces shuffle overhead

Medium-High — requires preprocessing to create buckets during data preparation

Large datasets with frequent joins on the same keys; consistent and repeated join patterns

Pro tip: When discussing Spark, mention Adaptive Query Execution (AQE), which dynamically coalesces shuffle partitions and optimizes skewed joins at runtime. It signals awareness of modern Spark internals without needing to recite configuration parameters.

Interviewers also probe whether batch and streaming transformations should be unified or kept separate. The Kappa architectureA data architecture pattern that processes all data through a single streaming pipeline, eliminating the need for a separate batch layer by replaying the stream for reprocessing. unifies everything through streaming, while the Lambda architecture maintains parallel batch and streaming paths. Frameworks like Apache Flink or Spark Structured Streaming allow shared transformation logic across both modes, but they introduce trade-offs around state management, checkpointing overhead, and operational complexity.

A strong answer acknowledges these trade-offs explicitly. Streaming provides freshness but adds complexity in state management and recovery. Batch provides simplicity and cost efficiency but introduces latency. Many production systems use a hybrid approach where streaming handles time-sensitive metrics and batch handles historical backfills and heavy aggregations.

Transformations produce data that must land somewhere useful, and how you store that data determines everything about query performance, cost, and governance.

Storage optimized for analytical access#

Storage design in data engineering is not a single decision. It is a cascade of decisions that affect query latency, data freshness, operational cost, and compliance posture. Interviewers use storage questions to gauge whether you think about the full data life cycle or just the query layer.

The first decision is format. Columnar formats like Apache Parquet excel at analytical queries that scan specific columns across millions of rows. Row-oriented formats are better for transactional workloads with full-row reads and writes. Most data engineering interviews assume columnar storage, but you should be prepared to justify the choice.

The second decision is table layout. Partitioning by date is the most common strategy, but it is not always sufficient. Clustering (sorting data within partitions by frequently filtered columns) and Z-ordering (interleaving sort keys to optimize multi-dimensional queries) can dramatically reduce scan times. These techniques matter because storage systems charge by data scanned, not just data stored.

Historical note: Before lakehouse formats existed, data lakes were essentially “data swamps” with no ACID guarantees. The introduction of Apache Iceberg, Delta Lake, and Hudi brought transactional semantics, time travel, and schema evolution to object storage, bridging the gap between warehouses and lakes.

Lakehouse vs. warehouse trade-offs#

This comparison appears in nearly every data engineering system design interview. A strong answer frames the decision around workload characteristics rather than brand loyalty.

Lakehouse vs. Traditional Data Warehouse: Feature Comparison

Dimension

Lakehouse (Iceberg / Delta Lake / Hudi)

Traditional DW (Snowflake / BigQuery / Redshift)

Query Latency

Moderate to good; Hudi excels at real-time, Iceberg/Delta optimized for large-scale analytics

Generally low; built-in auto-tuning and columnar storage deliver consistent fast queries

Concurrency Support

Snapshot isolation (Iceberg), optimistic concurrency (Delta), high-throughput upserts (Hudi)

High concurrency via multi-cluster (Snowflake), distributed infra (BigQuery), WLM (Redshift)

ACID Guarantees

Full ACID on all three; achieved via transaction logs, snapshot isolation, and atomic operations

Full ACID compliance natively built into the engine across all three platforms

Cost Model

Lower cost; decoupled compute and storage on cheap object storage (S3, ADLS)

Higher cost; tightly coupled storage and compute with usage-based or reserved pricing

Format Openness

Open formats (Parquet, ORC) with open table specs; minimal vendor lock-in

Proprietary storage formats; higher vendor dependency and limited portability

ML Workload Compatibility

Strong; direct access to raw/processed data enables feature engineering without duplication

Limited; ML integration typically requires data export to external platforms, adding latency

Lakehouse formats introduce concepts like copy-on-write vs. merge-on-readTwo update strategies in lakehouse table formats. Copy-on-write rewrites entire data files on each update for faster reads but slower writes. Merge-on-read keeps deltas separate and merges them at query time for faster writes but slower reads. that affect read and write performance differently. Copy-on-write rewrites data files on every update, which optimizes read performance but slows ingestion. Merge-on-read stores deltas separately and reconciles them at query time, which speeds up writes at the cost of read latency.

The small-file problem is another operational concern that interviewers love. Streaming pipelines that write frequently produce thousands of tiny files, which degrade query performance because metadata overhead per file becomes significant. Compaction jobs that periodically merge small files into larger ones are essential. Explaining this unprompted signals operational maturity.

The visual below shows how a tiered storage architecture organizes data from raw through curated zones with compaction and retention policies.

Loading D2 diagram...
Tiered lakehouse storage architecture with compaction and archival flows

Storage without trust is just an expensive disk, which brings us to the most frequently underestimated pillar of data engineering interviews: data quality and lineage.

Data quality, lineage, and governance#

If downstream consumers do not trust the data, every upstream investment in ingestion, transformation, and storage is wasted. Interviewers increasingly treat data quality and lineage as core architectural concerns rather than optional enhancements.

Automated data quality checks should run at pipeline boundaries: between ingestion and transformation and again between transformation and serving. These checks detect three categories of problems:

  • Freshness violations where data arrives later than the SLA threshold
  • Schema drift where upstream producers change column types or add fields without notice
  • Anomalous values where statistical distributions shift unexpectedly, indicating either bugs or real changes in the source system
Real-world context: Companies like Airbnb and Spotify have invested heavily in internal data quality platforms that block bad data from propagating downstream. Mentioning tools like Great Expectations, Soda, or Monte Carlo helps, but what interviewers really want is an explanation of where checks run, what they block, and how alerts reach the right team.

Lineage is equally critical. Column-level lineage tracks which upstream columns contribute to each downstream metric, enabling impact analysis when a source schema changes and compliance audits when regulators ask how a number was computed. Data contracts between producing and consuming teams formalize expectations around schema, freshness, and volume, reducing the number of surprise breakages.

Pro tip: When discussing governance, mention role-based access control at the column level, not just the table level. GDPR and CCPA often require masking PII fields for some consumers while exposing them to others. This granularity demonstrates enterprise-scale thinking.

Backfill procedures round out the governance story. When a bug is discovered in transformation logic, how do you reprocess historical data without disrupting live consumers? Strong answers describe atomic swap strategies where new data replaces old data in a single metadata operation, preventing downstream dashboards from ever seeing partial results.

Quality and governance operate within a broader ecosystem of tools. Understanding why those tools exist and how to choose between them is the next challenge interviewers present.

Ecosystem complexity and tool trade-offs#

Data engineering interviews are littered with “why did you pick X over Y” questions. The goal is never tool memorization. It is demonstrating a decision-making framework rooted in constraints, not preferences.

The following table captures the most common tool comparisons that appear in data engineering system design interviews.

Streaming, Processing & Orchestration Platform Comparison

Category

Criteria

Apache Kafka / Spark / Airflow

Amazon Kinesis / Apache Beam / Managed Orchestration

Apache Pulsar / Apache Flink / Dagster

Kafka vs. Kinesis vs. Pulsar

Throughput

100,000+ msg/sec per partition

Per-shard: 1 MB/s write, 2 MB/s read

1M+ msg/sec via decoupled architecture

Operational Complexity

High — requires managing brokers & ZooKeeper/KRaft

Low — fully managed by AWS

Medium-High — brokers + BookKeeper nodes

Managed Options

Amazon MSK, Confluent Cloud

Fully managed (native AWS)

Limited; some hosted solutions available

Spark vs. Flink vs. Beam

Batch Strength

Strong — originally designed for batch

Unified batch/stream abstraction layer

Primarily streaming; batch APIs available

Streaming Strength

Micro-batching; higher latency

Unified model; relies on Flink/Spark runners

True streaming; low latency, real-time ready

State Management

State store + RocksDB backend support

Runner-dependent state management

Advanced stateful compute with RocksDB support

Airflow vs. Dagster vs. Managed Orchestration

DAG Authoring Model

Python scripts; flexible but manual

GUI-based (e.g., Step Functions, Cloud Composer)

Python API with type safety and modularity

Observability

Web UI for DAG runs, tasks, and logs

Integrated dashboards with alerting

Real-time monitoring, logs, and execution history

Cloud-Native Integration

Requires manual cloud configuration

Native integration with cloud ecosystems

Built-in integrations with cloud services

A strong framework evaluates each tool across four dimensions. Operational complexity measures how much infrastructure you must manage. Ecosystem maturity captures community support, connector availability, and production track record. Scaling characteristics determine whether the tool handles your projected throughput. And managed-service availability affects whether your team can focus on pipeline logic rather than cluster management.

Attention: Avoid saying “I would use Kafka because it is the industry standard.” That answer reveals nothing about your reasoning. Instead, explain that Kafka’s configurable retention enables replay for backfills, its partitioned log model supports horizontal scaling, and its schema registry integration reduces serialization bugs. Then acknowledge that Kinesis offers lower operational overhead for teams without dedicated infrastructure engineers.

When discussing schema-on-read vs. schema-on-writeTwo approaches to enforcing data structure. Schema-on-write validates data against a predefined schema before storing it, catching errors early. Schema-on-read stores data in its raw form and applies schema interpretation at query time, offering more flexibility but deferring error detection. trade-offs, connect the choice to the pipeline’s maturity and consumer expectations. Early-stage data exploration benefits from schema-on-read flexibility. Production dashboards require schema-on-write enforcement to prevent broken metrics.

With tool trade-offs established, the next section covers the specific question categories you will encounter and how to structure answers for each one.

Core questions to expect in interviews#

Most data engineer system design interview questions cluster around six recurring scenarios. Each one probes different aspects of your architectural thinking, but they all share a common thread: the interviewer wants to see how you handle scale, failure, and trade-offs simultaneously.

Designing a real-time streaming pipeline#

Real-time pipeline questions test your understanding of event-time processing, late data handling, and stateful computation. A strong answer begins by clarifying latency requirements (sub-second vs. sub-minute changes the entire architecture) and delivery semantics.

Key design elements include the streaming platform (Kafka, Kinesis, or Pulsar), a partitioning strategy that avoids hotspots, watermark-based windowing for handling out-of-order events, and low-latency sinks for real-time serving. What elevates the answer is explaining how backpressure propagates when a sink slows down, how state checkpointing enables recovery after failures, and how schema evolution is managed across long-running streaming jobs.

Pro tip: Discuss sessionization logic (grouping events into user sessions based on inactivity gaps) as a concrete example of stateful stream processing. It forces you to reason about state growth, session timeouts, and late arrivals simultaneously, which is exactly what interviewers want to see.

The following visual illustrates a real-time streaming pipeline with watermarking and checkpointing.

Loading D2 diagram...
Real-time streaming pipeline with fault tolerance

Designing a batch ETL pipeline#

Batch pipelines may seem straightforward, but senior-level answers focus on operational reality rather than just the happy path. The interviewer wants to hear about retry strategies for failed tasks, partial reprocessing that avoids recomputing an entire day’s data, and atomic publishing so downstream consumers never read incomplete partitions.

Partition design drives both performance and reprocessing granularity. If you partition by date and hour, you can reprocess a single hour without touching the rest of the day. Schema enforcement through external catalogs (like the Hive Metastore or AWS Glue Catalog) prevents silent type coercion bugs. Cost optimization through spot instances and partition pruning shows awareness of cloud economics.

A typical discussion should touch on how malformed records are handled (quarantine vs. fail-fast), how backfills are orchestrated without duplicating data, and how metadata catalogs track which partitions have been validated.

Designing a data warehouse or lakehouse#

This question tests analytical storage trade-offs at depth. Interviewers expect you to compare query latency under concurrent workloads, ACID guarantee strength, cost models (compute-storage separation vs. tightly coupled), and format openness.

Discuss compaction strategies, partition evolution (changing partition schemes without rewriting history), and the role of catalog services like Iceberg’s REST catalog in enabling multi-engine access. Mentioning star schema vs. snowflake schemaTwo dimensional modeling approaches. Star schema denormalizes dimension tables for simpler, faster queries. Snowflake schema normalizes dimensions into sub-dimensions, reducing storage but requiring more joins. demonstrates familiarity with analytical data modeling, which many candidates overlook in favor of infrastructure discussion.

Real-world context: Companies like Apple and Netflix have publicly discussed their migrations from traditional warehouses to Iceberg-based lakehouses. These migrations were driven by the need for open formats that multiple compute engines (Spark, Trino, Flink) could access without vendor lock-in.

Building a feature store for machine learning#

Feature store questions expose subtle correctness challenges. The central concern is training-serving skewA divergence between the feature values used during model training and those available during real-time inference, caused by differences in computation logic, data freshness, or pipeline bugs. where the features used during training differ from those served during inference, silently degrading model accuracy.

Strong answers separate the offline store (used for historical feature retrieval during training) from the online store (used for low-latency serving during inference). Explain how point-in-time correctness prevents data leakage by ensuring that training examples only use features that were available at the time of the original event. Feature versioning, backfill procedures, and freshness SLAs round out the discussion.

Python
import pandas as pd
def get_point_in_time_features(
events: pd.DataFrame, # columns: entity_id, event_timestamp
feature_store_client, # client with .get_feature_history(entity_ids) -> DataFrame
feature_names: list[str],
) -> pd.DataFrame:
entity_ids = events["entity_id"].unique().tolist()
# Fetch full materialization history for all relevant entities
feature_history: pd.DataFrame = feature_store_client.get_feature_history(
entity_ids=entity_ids,
feature_names=feature_names,
)
# Expected columns: entity_id, feature_name, feature_value, materialized_at
# Cross-join events with feature history on entity_id
merged = events.merge(feature_history, on="entity_id", how="left")
# Point-in-time filter: keep only features materialized before or at the event timestamp
pit_correct = merged[merged["materialized_at"] <= merged["event_timestamp"]]
# For each (entity, event_timestamp, feature), keep the most recent materialization
pit_correct = (
pit_correct
.sort_values("materialized_at", ascending=False)
.groupby(["entity_id", "event_timestamp", "feature_name"], as_index=False)
.first() # latest materialization that is still <= event_timestamp
)
# Pivot so each feature becomes its own column in the training dataset
training_dataset = pit_correct.pivot_table(
index=["entity_id", "event_timestamp"],
columns="feature_name",
values="feature_value",
aggfunc="first",
).reset_index()
training_dataset.columns.name = None # flatten MultiIndex column name
return training_dataset

Designing a metrics or analytics platform#

Analytics platforms introduce challenges around aggregation granularity, high-cardinality dimensions, and query latency under concurrent BI tool access. A strong answer explains the pre-aggregation strategy: raw events are rolled up into hourly and daily aggregates, and high-cardinality dimensionsColumns with a very large number of distinct values (such as user IDs or session IDs) that make GROUP BY queries expensive because the aggregation state grows proportionally with distinct value count. are managed by either limiting dimensions in pre-aggregated tables or using approximate algorithms like HyperLogLog for distinct counts.

Discuss tiered storage where hot data (recent days) lives in a fast OLAP engine like ClickHouse or Druid, warm data lives in a warehouse, and cold data is archived to object storage. Materialized views and pre-computed rollups serve the most common dashboard queries without scanning raw data.

Building a CDC pipeline#

CDC pipeline questions test correctness under operational pressure. Start by distinguishing log-based CDC (reading the database’s write-ahead log) from query-based CDC (polling for changes), and explain why log-based approaches are preferred for correctness and reduced source database load.

Address snapshotting for initial loads, incremental consumption for ongoing changes, ordering guarantees across partitions, and schema evolution when the source database adds columns. Deduplication strategies and cross-region replication concerns round out a strong answer.

Attention: Many candidates forget to discuss how transactional boundaries are preserved in CDC. If a source transaction updates three tables atomically, the CDC pipeline must either maintain that atomicity or clearly document that downstream consumers may see partial transactions temporarily.

Each of these question types tests different skills, but the structure of your answer matters as much as the content. The next section covers how to organize your thinking like a senior engineer.

How to structure your answers like a senior engineer#

Beyond any single question, interviewers evaluate your reasoning process. Senior candidates exhibit a consistent structure that makes complex designs easy to follow and easy to challenge constructively.

Clear separation of system stages. Explicitly decompose your design into ingestion, storage, transformation, serving, orchestration, and governance. This separation supports technical scalability (each layer scales independently) and organizational ownership (different teams own different layers). Data contracts between layers formalize expectations.

Trade-off awareness at every decision point. Depth is demonstrated by reasoning about competing priorities. Latency vs. correctness. Cost vs. performance. Simplicity vs. flexibility. The goal is not to find the “right” answer but to show that you understand the consequences of each choice.

Pro tip: When the interviewer asks “why not X?”, treat it as a gift, not a challenge. It is an invitation to demonstrate trade-off reasoning. Respond with “X would work if our constraint were Y, but given Z, I chose this approach because…”

Proactive failure mode discussion. Great candidates do not wait for the interviewer to ask “what happens when this fails?” They volunteer failure scenarios and mitigations. Skewed partitions, corrupt files, late events, schema drift, stuck consumers, and state backend corruption are all fair game. Explaining both detection (monitoring, alerting, anomaly detection) and mitigation (circuit breakers, dead-letter queues, automated backfills) builds credibility.

Metadata and lineage as architectural pillars. Senior answers consistently include cataloging, column-level lineage, data quality dashboards, access controls, and audit trails. These are not afterthoughts. They are load-bearing components that determine whether the system earns organizational trust.

The following diagram shows how these structural elements connect in a unified analytics platform.

Loading D2 diagram...
Unified analytics platform with governance overlay

Sample question walk-through: design a unified analytics platform#

Bringing everything together, here is how a senior engineer might structure a response to “Design a unified analytics platform that serves BI, ML, and real-time dashboards.”

Ingestion layer. Multi-source connectors feed into Kafka with a Confluent Schema Registry enforcing Avro schemas. Batch file drops land in an S3 raw zone and trigger orchestrated ingestion jobs. Dead-letter queues capture unparseable records.

Storage layer. Three zones built on Iceberg tables over S3. The raw zone preserves source data as-is. The standardized zone applies deduplication, type casting, and schema enforcement. The curated zone contains business-level aggregations, feature tables, and dimensional models optimized for BI access.

Compute layer. Flink handles streaming transformations for sub-minute freshness metrics. Spark handles batch backfills, heavy joins, and historical aggregations. An interactive engine like Trino serves ad-hoc exploration queries directly against the lakehouse.

Serving layer. A warehouse (Snowflake or BigQuery) powers BI dashboards with high concurrency. The curated Iceberg tables serve ML training jobs directly. Materialized views in an OLAP engine like ClickHouse handle real-time dashboard queries.

Governance layer. Column-level lineage tracked through a metadata catalog. Data quality gates at each zone boundary enforce freshness, schema, and statistical checks. Role-based access control masks PII fields for non-privileged consumers. Audit trails log every query and transformation for compliance.

Python
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.sensors.external_task import ExternalTaskSensor
# --- Failure callback handlers ---
def alert_on_failure(context):
# Route failure notification to alerting system (e.g., PagerDuty, Slack)
task_id = context["task_instance"].task_id
dag_id = context["task_instance"].dag_id
print(f"[ALERT] Task '{task_id}' in DAG '{dag_id}' failed. Sending alert.")
# alert_client.send(dag_id=dag_id, task_id=task_id, run_id=context["run_id"])
def dead_letter_on_failure(context):
# Push failed task metadata to dead-letter queue for reprocessing
task_id = context["task_instance"].task_id
run_id = context["run_id"]
print(f"[DEAD-LETTER] Routing failed task '{task_id}' (run: {run_id}) to dead-letter queue.")
# dead_letter_queue.publish({"task_id": task_id, "run_id": run_id})
def combined_failure_callback(context):
# Invoke both alert and dead-letter handlers on any task failure
alert_on_failure(context)
dead_letter_on_failure(context)
# --- Task business logic stubs ---
def standardize_raw_data(**kwargs):
# Normalize and validate schema of raw ingested records
print("Standardizing raw data...")
# standardizer.run(source="raw_zone", target="standardized_zone")
def run_quality_gate(**kwargs):
# Assert data quality rules; raise exception to trigger failure path if checks fail
print("Running quality gate checks...")
# results = quality_checker.evaluate(dataset="standardized_zone")
# if not results.passed: raise ValueError(f"Quality gate failed: {results.summary}")
def aggregate_to_curated(**kwargs):
# Apply business aggregations and write to curated layer
print("Aggregating to curated layer...")
# aggregator.run(source="standardized_zone", target="curated_zone")
def refresh_materialized_view(**kwargs):
# Refresh downstream materialized views for BI/reporting consumers
print("Refreshing materialized views...")
# db.execute("REFRESH MATERIALIZED VIEW curated_summary_mv")
# --- Default task arguments applied to all tasks ---
default_args = {
"owner": "data-platform",
"depends_on_past": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
"on_failure_callback": combined_failure_callback, # Global failure routing
}
with DAG(
dag_id="batch_pipeline_orchestration",
default_args=default_args,
description="Batch path: ingest → standardize → quality gate → curate → materialize",
schedule_interval="@daily",
start_date=datetime(2024, 1, 1),
catchup=False,
tags=["batch", "ingestion", "quality", "curated"],
) as dag:
# Sensor waits for upstream raw ingestion DAG/task to complete before proceeding
raw_ingestion_sensor = ExternalTaskSensor(
task_id="wait_for_raw_ingestion",
external_dag_id="raw_ingestion_dag", # DAG that writes to raw zone
external_task_id="ingest_raw_records", # Specific task to await
timeout=3600,
poke_interval=60,
mode="reschedule", # Free worker slot while waiting
)
standardize_task = PythonOperator(
task_id="standardize_raw_data",
python_callable=standardize_raw_data,
)
quality_gate_task = PythonOperator(
task_id="run_quality_gate",
python_callable=run_quality_gate,
)
curated_aggregation_task = PythonOperator(
task_id="aggregate_to_curated",
python_callable=aggregate_to_curated,
)
materialized_view_refresh_task = PythonOperator(
task_id="refresh_materialized_view",
python_callable=refresh_materialized_view,
)
# Define linear dependency chain: sensor → standardize → quality → curate → materialize
(
raw_ingestion_sensor
>> standardize_task
>> quality_gate_task
>> curated_aggregation_task
>> materialized_view_refresh_task
)

Real-world context: Uber’s data platform team has published extensively about their unified analytics architecture, which uses Apache Hudi on HDFS/S3 with Flink for incremental processing and Presto/Trino for interactive queries. Their architecture handles petabytes of daily data across thousands of internal consumers.

The platform’s strength is cohesion. Transformation logic is shared where possible between batch and streaming paths. Feature tables in the curated zone serve both BI and ML workloads. Governance is not an afterthought bolted on at the end but a structural layer that runs through every zone boundary.

Conclusion#

Data engineering system design interviews test three things above all else: your ability to reason about trade-offs across the full data life cycle, your awareness of how systems fail and recover in production, and your understanding that data quality and governance are architectural requirements rather than optional polish. If your answers consistently demonstrate end-to-end thinking, operational maturity, and principled decision-making, you communicate something more valuable than technical knowledge: engineering judgment.

The field is moving toward tighter integration between streaming and batch processing, more sophisticated metadata-driven automation, and increasingly open table formats that decouple storage from compute engines. Engineers who understand these trends and can articulate why they matter will find themselves well-positioned as data platforms continue to grow in organizational importance.

Prepare by designing systems on paper, not by memorizing tools. The tools will change. The thinking endures.


Written By:
Zarish Khalid