Table of Contents
Why ad click aggregation is a system design favoriteClarifying requirements and choosing the right guaranteesSplitting outcomes into two tiersScale and burstinessSemantic honesty about exactly-onceHigh-level architecture and why dual pipelines existThe event log as system of recordStorage layoutIngestion edge and event integrityMinimal, fast validationSchema evolution for long-lived SDKsEvent integrity without full fraud preventionStreaming aggregation with state, windows, and watermarksEvent-time windows vs. processing timeLate events and watermark configurationSkew and hot key mitigationEffective exactly-once without magical guaranteesThe three-part recipeDeduplication in practiceBatch as the ultimate correction layerMulti-dimensional analytics without storage explosionPrimary cubes and selective materializationApproximate distinct counts with HyperLogLogStorage layout and query routingHot tier and cold tierQuery routing as the interview moveHandling correctionsWalk-through 1: a click becomes a real-time dashboard updateWalk-through 2: a click becomes a billing line itemFraud detection as a core pipeline componentWhere fraud detection fitsDegradation and cost controlsGraceful degradation under loadCost controls and tiered retentionReconciliation and audit for billing correctnessReconciliation as a continuous processAuditability through lineageObservability metrics that keep you honestMetrics by pipeline stageSLAs and numeric targetsWhat a strong interview answer sounds likeThe 60-second outlineThe checklist interviewers are scoring againstConclusion
Ad Click Aggregator System Design

Ad Click Aggregator System Design

This blog explains how to ace an Ad Click Aggregator System Design interview by walking through a dual streaming + batch pipeline, showing how to handle deduplication, late events, hot keys, cost-aware analytics, and billing-grade reconciliation.

Mar 11, 2026
Share
editor-page-cover

Every satisfying ad you click triggers a measurement event that someone, somewhere, is counting toward a billion-dollar invoice. An ad click aggregator is the system that ingests raw click events at extreme scale, produces near-real-time dashboards for campaign optimization, and generates audited billing reports that advertisers trust enough to pay. The core architectural challenge is that no single pipeline can optimize for both freshness and correctness, so production systems split into a streaming path for fast estimates and a batch path for billing truth, unified by a raw event log as the shared source of truth.

Key takeaways

  • Dual-pipeline design: A streaming path delivers near-real-time dashboard metrics while a separate batch path produces authoritative, auditable billing reports.
  • Effective exactly-once: Rather than claiming global exactly-once delivery, production systems combine at-least-once ingestion with idempotent sinks, deduplication, and checkpointing to achieve correct final outputs.
  • Late events and watermarking: Event-time windows paired with configurable watermarks handle delayed arrivals in streaming, while batch recomputation catches everything the watermark misses.
  • Hot key mitigation: Techniques like key salting and hierarchical aggregation prevent a single dominant campaign from overloading partitions and stream processors.
  • Reconciliation and lineage: Comparing streaming estimates against batch truth with drift thresholds, combined with raw-log traceability, is what makes billing disputes resolvable.


Every second you spend online, invisible infrastructure is counting, attributing, and billing for the ads you see and click. When that infrastructure breaks, advertisers either overpay or underpay, publishers lose revenue, and trust collapses. Designing an ad click aggregator is one of the highest-stakes system design problems because it sits at the intersection of real-time analytics and financial auditability, and getting only one right means the system fails.

Why ad click aggregation is a system design favorite#

Advertising measurement is deceptively simple on the surface. You count clicks. But beneath that simplicity lies a tangle of competing requirements that makes this problem a staple of Staff-level system design interviews. Real-time dashboards demand freshness and low latency so campaign managers can adjust spend mid-flight. Billing reports demand correctness, completeness, and traceability so finance teams can close the books.

The inputs themselves are hostile. Mobile SDKs retry aggressively. Devices go offline and flush buffered events hours later. Clock skew makes timestamps unreliable. Traffic concentrates on a handful of hot campaigns during live events or product launches. Some publishers may even tamper with payloads to inflate counts. A strong design makes these realities explicit and uses architecture to contain them rather than hoping they won’t happen.

Real-world context: At companies like Google and Meta, ad measurement pipelines process tens of billions of events daily. Even a 0.1% error rate in click counting can translate to millions of dollars in billing discrepancies per quarter.

This question tests whether you can hold two truths simultaneously. You need speed for decisions and correctness for money. If you design only one pipeline, you’ll either be too slow for optimization or too wrong for billing. The rest of this post walks through a practical dual-pipeline architecture that respects both constraints.

Before we can design anything, we need to nail down what exactly we’re promising and to whom.

Clarifying requirements and choosing the right guarantees#

In interviews, you’re not graded on listing every metric. You’re graded on defining contracts. An ad click aggregator is fundamentally a counting and attribution system, so you must state what you can guarantee, what you approximate, and where the source of truth lives.

Splitting outcomes into two tiers#

Start by separating your outputs into two categories:

  • Fast estimates (dashboard tier): Near-real-time click counts, spend pacing, geo and device breakdowns, anomaly alerts. Latency target of seconds to low minutes. Slight inaccuracy is acceptable.
  • Authoritative truth (billing tier): Deduplicated, fraud-adjusted, fully reconciled aggregates partitioned by day and campaign. Correctness and auditability are non-negotiable, even if reports lag by hours.

Then define which dimensions matter. Advertisers typically slice by campaign, ad creative, publisher, geo, device type, and time bucket. Dashboard queries need sub-second response on recent windows. Billing queries tolerate higher latency but demand exact numbers.

Scale and burstiness#

“Billions of events per day” is the baseline. At peak, during live sports or breaking news, that translates to 1 to 10 million events per second with heavy skew where a few campaigns dominate traffic. This pushes you toward an append-only log for ingestion, streaming compute for incremental aggregates, and a batch warehouse for correctness.

The following table compares the two tiers and their engineering contracts:

Dashboard Tier vs Billing Tier: Key Dimension Comparison

Dimension

Dashboard Tier

Billing Tier

Latency Target

Ultra-low (sub-50ms) for real-time responsiveness

Higher tolerance (seconds to minutes); batch-oriented

Accuracy Guarantee

Speed prioritized; slight accuracy trade-offs acceptable

High accuracy mandatory due to financial implications

Data Freshness

Near-instant (within seconds); 99% of events queryable within ~10s

Less critical; hourly or daily batch processing sufficient

Query Routing Destination

High-performance in-memory or hot storage systems

Cost-effective cold storage or data warehouses

Acceptable Error Tolerance

Up to ~1% tolerance acceptable to reduce latency

Near-zero tolerance (~0.01% or less); rigorous accuracy required

Semantic honesty about exactly-once#

Here is the most important contract to get right. You should not promise global exactly-once across the entire distributed pipeline. Instead, promise at-least-once ingestion combined with effective exactly-onceA correctness guarantee achieved not through a single magical delivery semantic but through the combination of idempotent sinks, event-level deduplication, and deterministic checkpointing, ensuring that final aggregated outputs reflect each event exactly once even when the underlying transport may deliver duplicates. That’s the language interviewers expect from someone who has operated streaming systems at scale.

Attention: Saying “exactly-once” as a blanket statement is a red flag in interviews. Strong answers explain where you get idempotency, how you handle replays, and what happens when deduplication windows expire.

Be explicit about dedupe keys early. A composite key like click_id + source + timestamp gives you a stable identifier for deduplication. Define your late-event policy (how late is too late for streaming) and clarify that batch will reconcile anything streaming misses.

With requirements and guarantees locked down, the next step is understanding why two pipelines are structurally necessary rather than just convenient.

High-level architecture and why dual pipelines exist#

A dual-pipeline architecture exists because a single pipeline cannot simultaneously optimize for freshness and correctness at internet scale. This is not a theoretical preference. It’s an operational reality proven out at every major ad platform.

The streaming pipeline is optimized for speed. It consumes events as they arrive, maintains windowed state, and writes incremental aggregates to a low-latency OLAP store. The batch pipeline is optimized for completeness. It reads archived raw events, applies finalized fraud decisions, includes late arrivals, and produces audited outputs that overwrite prior results.

The following diagram illustrates how both pipelines share a common backbone:

Loading D2 diagram...
Dual-pipeline event processing architecture with lambda pattern

The event log as system of record#

The backbone of both paths is the raw event log, typically Apache Kafka or a managed equivalent. In interviews, treat the log as the system of record for events before long-term archival. This gives you three critical capabilities:

  • Buffering: The log absorbs backpressure during traffic spikes without dropping events.
  • Replayability: When bugs are found or business rules change, you can reprocess historical events from the log.
  • Clean contract: Ingestion writes to the log. Downstream consumers read from it. There’s no coupling between producers and consumers.

From that log, two consumers diverge. The stream processor (Flink, Spark Structured Streaming, or Beam) handles real-time aggregation. In parallel, a log archiver continuously lands raw events into object storage as partitioned files for batch processing.

Storage layout#

A practical storage layout uses a hot OLAP storeA columnar, real-time analytics database (such as ClickHouse, Apache Druid, or Apache Pinot) optimized for low-latency aggregation queries over recent time windows. for fast dashboard queries and a warehouse or lakehouse (BigQuery, Snowflake, or Iceberg on Trino) for authoritative billing queries. A query routing layer inspects each request and dispatches it to the appropriate store based on time range, dimensions, and accuracy mode.

Pro tip: When you say “raw log is the source of truth” in an interview, you unlock replay, backfills, and audits in a single sentence. That’s the difference between a demo and revenue infrastructure.

With the architecture established, the next critical layer to get right is ingestion, where measurement systems quietly succeed or fail.

Ingestion edge and event integrity#

Ingestion is the front door of your measurement system, and it’s facing the open internet. The inputs are adversarial and unreliable. SDKs retry. Devices go offline for hours. Clocks are wrong. Some actors may tamper with payloads to inflate counts. Your ingestion edge must be optimized for fast acceptance while enforcing just enough integrity to prevent garbage from poisoning everything downstream.

Minimal, fast validation#

A well-designed ingestion layer uses globally distributed endpoints, either CDN edge POPs or regional API gateways, that accept click events via HTTP. These endpoints do not perform heavy aggregation or complex logic. Their job is narrow:

  1. Schema validation: Confirm required fields exist and conform to the expected type and version.
  2. Source authentication: Verify that the calling SDK or publisher is authorized.
  3. Rate limiting: Apply per-source and per-region rate limits to reject clearly abusive traffic.
  4. Metadata stamping: Attach server-side ingest_time, region, and sampler flags.

After validation, the edge batches and compresses events before writing them to Kafka. Batching improves throughput and lowers cost. Compression (typically LZ4 or Zstandard) reduces bandwidth and Kafka storage.

Real-world context: Google’s ad measurement infrastructure uses edge servers in over 100 countries to minimize ingestion latency. Events are accepted within milliseconds and forwarded to regional Kafka clusters for downstream processing.

Schema evolution for long-lived SDKs#

Mobile SDK versions roll out slowly. You cannot force all clients to upgrade simultaneously. This means your ingestion layer must handle multiple schema versions gracefully. The pattern is a versioned envelopeA message wrapper that includes a schemaversion field and eventtype alongside the payload, enabling parsers to apply version-specific deserialization logic and maintain forward compatibility as schemas evolve.

In practice, you validate required fields, drop or quarantine malformed payloads, and ship unknown optional fields into a flexible map (like a JSON extras field) for potential later use. Consider integrating a schema registry (such as Confluent Schema Registry) to enforce backward and forward compatibility rules at the serialization boundary. This prevents hard failures when the schema changes and gives you a deprecation path for retiring old fields.

Event integrity without full fraud prevention#

Event integrity at the edge is not a complete fraud solution. It’s a first line of defense that makes spoofing harder and keeps obviously invalid events out of the pipeline. Effective techniques include:

  • Signed click URLs or tokens: The ad server issues an HMAC-signed token with each impression. The click event must carry this token, and the edge validates the signature.
  • Issuer validation: Confirm that the click_id in the event corresponds to an impression actually issued by the ad serving system.
  • Replay protection: Use nonce windows to reject events whose tokens have expired or been seen before.

The key principle is that the edge does minimal, fast validation and integrity checks, then writes to the log. Heavy logic like fraud scoring and attribution belongs downstream where you can replay and correct.

Attention: Never drop raw events during overload. Instead, apply backpressure or sampling to non-critical paths. If the raw log is preserved, you can always recompute aggregates later.

With events safely in the log, the streaming pipeline takes over to produce the fast estimates that power dashboards and alerting.

Streaming aggregation with state, windows, and watermarks#

The streaming pipeline produces the near-real-time aggregates that campaign managers rely on: clicks per campaign per minute, spend pacing against budget, geo and device breakdowns, and anomaly alerts. This requires stateful processing with careful windowing, and it’s where the subtleties of event time, late arrivals, and skewed traffic become critical.

Event-time windows vs. processing time#

Most production systems use event-time windows, bucketing events based on click_time (when the user actually clicked) rather than when the event arrived at the processor. This ensures metrics align to real-world timing even when network delays cause events to arrive out of order. A typical configuration uses tumbling windows of 1 minute for high-granularity dashboards, with coarser rollups (5 minutes, 1 hour) for historical trending.

The challenge is that event-time processing requires the system to decide when a window is “complete enough” to emit results. This is where watermarkingA heuristic mechanism in stream processing that tracks the progress of event time, signaling when the system believes all events up to a certain timestamp have arrived, allowing windows to close and emit results while tolerating bounded lateness. comes in.

Late events and watermark configuration#

Late events are unavoidable. Mobile devices buffer events during subway rides. Edge retries arrive minutes after the original. Network partitions delay entire batches. You handle this by defining a lateness tolerance in your watermark configuration, typically 5 to 30 minutes depending on the product’s freshness requirements.

Within the watermark window, late events update the windowed aggregates and the dashboard self-corrects. After the watermark passes, one of two things happens:

  • Drop from streaming: The event is ignored by the streaming path, and the batch pipeline will include it in the authoritative recomputation.
  • Side output: The event is routed to a “late update” side output that triggers targeted backfills in the hot OLAP store.

The first approach is simpler and is sufficient for most systems. The second is useful when dashboards need to remain accurate over longer windows.

Loading D2 diagram...
Late event handling with watermark-based routing

Skew and hot key mitigation#

Some campaigns dominate traffic. During a Super Bowl ad, a single campaign_id might account for 30% of all click events. If you partition strictly by campaign_id, that hot key overloads one Kafka partition and one stream processor task while others sit idle.

The mitigation is key saltingA technique for distributing hot keys across multiple partitions by appending a random suffix (salt) to the key before partitioning, then merging partial aggregates downstream to reconstruct the correct total. For a hot campaign, you split its key into N salted variants (campaign_123_salt_0, campaign_123_salt_1, etc.), distribute them across partitions, compute partial aggregates per salt, and then merge them in a second stage.

This two-stage approach, sometimes called hierarchical aggregation, keeps merges deterministic and idempotent. The trade-off is added complexity in the merge step and slightly higher end-to-end latency for hot keys.

Pro tip: In interviews, when asked “What happens when one campaign gets 30% of traffic?”, describe key salting and partial aggregates. Simply saying “add more machines” is not a sufficient answer because the bottleneck is the partition, not the cluster size.

Correctness in streaming is about more than just window mechanics. The next section addresses how to make outputs trustworthy enough that billing can eventually depend on them.

Effective exactly-once without magical guarantees#

For billing and trusted metrics, you need correctness that behaves like exactly-once even when the underlying delivery is at-least-once. This is the section where you explain the practical recipe rather than hand-waving about framework guarantees.

The three-part recipe#

Effective exactly-once is achieved through the combination of three mechanisms:

  1. Stable event identifiers: Every event carries an event_id (or a composite key like click_id + source + timestamp) that uniquely identifies it. This key supports deduplication at every stage.
  2. Checkpointing: The stream processor periodically snapshots its state and Kafka offsets to durable storage. On failure, it restarts from the last consistent checkpoint and replays only the events since that point. Apache Flink’s checkpoint mechanism is the canonical example.
  3. Idempotent sinks: The output store handles duplicate writes gracefully. This can be achieved through upserts keyed by (time_bucket, dimension_set) with deterministic aggregation, or through transactional writes (two-phase commit) between the processor and the sink.

The following table compares idempotent sink strategies:

Upsert-Based Idempotent Sinks vs. Transactional Two-Phase Commit (2PC) Sinks

Dimension

Upsert-Based Idempotent Sinks

Transactional 2PC Sinks

Implementation Complexity

Low — uses simple upsert logic (e.g., `ON CONFLICT`) with no coordination overhead

High — requires a coordinator to manage prepare/commit/rollback phases across participants

Latency Overhead

Low — minimal coordination needed; latency depends on upsert efficiency

High — multiple communication rounds and synchronization between participants increase latency

Failure Recovery

Simple — reapplying operations is safe due to inherent idempotency; no duplicate data risk

Complex — coordinator failure can leave participants in an uncertain state, risking blocking until recovery

OLAP Store Suitability

Best for stores supporting efficient upserts and tolerating eventual consistency

Best for stores requiring strong consistency, atomicity, and native transactional support

Deduplication in practice#

Deduplication windows need to be bounded. Maintaining a set of all seen event_id values forever is impractical at scale. Instead, you keep a dedupe window (e.g., 1 to 6 hours) that covers the expected range of retries and late arrivals. Within this window, you can use a combination of in-memory hash sets (for the stream processor’s local state) and Bloom filtersA space-efficient probabilistic data structure that tests whether an element is a member of a set, with a tunable false-positive rate but zero false negatives, making it useful for fast deduplication checks where occasional duplicate acceptance is tolerable.

Some aggregates are naturally idempotent. A set-union metric (distinct users who clicked) won’t double-count a user if the same event is processed twice. But counters are not idempotent by nature. For count-based metrics, you need either the dedupe window or transactional writes to prevent inflation.

Real-world context: LinkedIn’s ad analytics pipeline uses Kafka consumer offsets aligned with Flink’s checkpoint barriers to achieve effective exactly-once. When a processor fails and restarts, it replays from the checkpoint offset, and idempotent writes to Pinot ensure no double-counting in dashboards.

Batch as the ultimate correction layer#

Even with all these safeguards, streaming correctness has limits. Dedupe windows expire. Watermarks make cutoff decisions. Fraud models aren’t finalized in real time. This is exactly why the batch pipeline exists. It reprocesses all raw events without time pressure, applies finalized rules, and produces authoritative partitions that overwrite prior results.

The mental model is simple. Streaming produces “draft” aggregates. Batch produces “published” aggregates. If the drafts are close to the published version, your streaming pipeline is healthy. If they diverge beyond tolerance, you investigate.

With correctness covered, the next challenge is supporting the flexible, multi-dimensional analytics that advertisers demand without blowing up storage costs.

Multi-dimensional analytics without storage explosion#

Advertisers want to slice data by campaign, geo, device, publisher, creative, and hour, in any combination. The naive approach of precomputing every possible combination of dimensions is a combinatorial nightmare that explodes storage and compute. Your job is to support flexible analytics while keeping the system economical.

Primary cubes and selective materialization#

Instead of precomputing all dimension combinations, define a small set of “primary cubes” that cover the majority of dashboard queries. For example:

  • Campaign Ă— ad Ă— time (1-minute buckets)
  • Campaign Ă— geo Ă— time (5-minute buckets)
  • Campaign Ă— device Ă— time (5-minute buckets)
  • Publisher Ă— campaign Ă— time (hourly buckets)

These cubes are materialized in the hot OLAP store for the recent window (e.g., last 7 days) and in the warehouse for historical queries. For rarer dimension combinations (e.g., campaign Ă— geo Ă— device Ă— publisher), you route queries to the warehouse where columnar scans are cheap and latency tolerance is higher.

Attention: Promising “arbitrary slicing in real time” without cost controls is a trap. A strong answer says which slices are fast (materialized cubes), which are possible but slower (warehouse scans), and which require pre-approval (full cross-product materializations for specific high-spend advertisers).

Approximate distinct counts with HyperLogLog#

Exact distinct counts (unique users who clicked) across high-cardinality dimensions are expensive in streaming. The standard interview answer is HyperLogLog (HLL)A probabilistic cardinality estimation algorithm that uses fixed memory (typically 12 to 16 KB) regardless of set size and provides distinct count estimates within approximately 2% error, supporting efficient merging of sketches across partitions and time windows.

You maintain HLL sketches per bucket and merge them at query time. This is compact, fast, and accurate enough for dashboards. For billing, you compute exact distinct counts in the batch pipeline for a limited set of officially invoiced metrics.

Loading D2 diagram...
HyperLogLog streaming approximation vs batch exact counting

The choice of what to materialize and what to approximate directly affects storage costs and query latency, which brings us to how the storage layer is organized and how queries are routed.

Storage layout and query routing#

Storage choices should reflect the operational reality that recent dashboards require sub-second query latency while historical billing analytics require cheap scans and correctness. This typically yields a two-tier architecture with intelligent query routing as the connective tissue.

Hot tier and cold tier#

The hot OLAP store (ClickHouse, Apache Druid, or Apache Pinot) holds recent rollups, typically the last 7 to 30 days of materialized cubes. These stores are optimized for columnar scans, pre-aggregation, and low-latency queries. They’re expensive per byte but fast.

The cold tier is a lakehouse or warehouse (BigQuery, Snowflake, or Apache Iceberg on Trino) holding authoritative partitions for the full retention window, often 1 to 3 years. Raw event archives live in object storage (S3, GCS) in partitioned Parquet or ORC files, organized by event_date and optionally by campaign_id for scan efficiency.

Query routing as the interview move#

A reporting API inspects each incoming query and decides which store to hit. The routing logic considers:

  • Time range: “Last 30 minutes by campaign” goes to the hot OLAP store. “Last quarter by campaign + geo + device” goes to the warehouse.
  • Accuracy mode: If the user or API explicitly requests “billing mode,” the query routes to batch truth even if it’s slower.
  • Dimension complexity: Simple cubes that match materialized rollups go to OLAP. Rare combinations go to the warehouse.

This routing is what prevents the perennial argument about why dashboard numbers don’t exactly match billing numbers. Dashboard mode is fast and approximate. Billing mode is authoritative and auditable. They’re different products served from different stores.

Handling corrections#

The batch pipeline may rewrite partitions, for example overwriting a day-level partition when late events or fraud reclassifications arrive. The reporting API must handle this through either versioned partitions (snapshot isolation) or an explicit refresh mechanism that propagates corrected data to downstream consumers.

Historical note: The Lambda Architecture, proposed by Nathan Marz in 2011, formalized this dual-path idea with a “speed layer” and “batch layer.” Modern implementations have evolved significantly (particularly with Kappa Architecture alternatives), but the core insight that freshness and correctness require separate optimization paths remains valid.

With the architecture components in place, let’s trace a single click through both pipelines to make the design concrete.

Walk-through 1: a click becomes a real-time dashboard update#

A user taps an ad in a mobile app. The SDK emits a click event carrying a stable event_id, click_time, campaign_id, ad_id, device metadata, geo coordinates, and an HMAC integrity token. The network is spotty, so the SDK retries twice before getting a 200 response.

The edge ingestion service in the nearest region accepts the event, validates the schema and token, stamps ingest_time and region, and appends it to a Kafka topic in a compressed batch. The duplicate retries also land in Kafka because ingestion is at-least-once by design. The edge does not try to deduplicate. That’s downstream’s job.

The Flink stream processor consumes from the Kafka topic. It checks the integrity token, then performs dedupe using event_id within a bounded state window (e.g., 2 hours). It assigns the event to a 1-minute event-time window using its watermark configuration and updates stateful aggregates keyed by (bucket, campaign_id, ad_id, geo, device_type). Within seconds, incremental results are upserted into ClickHouse.

The advertiser opens their dashboard, which queries ClickHouse for the last hour’s click counts and pacing metrics. The numbers update in near real time. If a late event arrives within the watermark tolerance (say, 10 minutes), the earlier bucket is updated and the dashboard self-corrects on the next refresh. If it arrives after the watermark, the streaming view doesn’t change, but the batch truth will include it.

Loading D2 diagram...
Real-time ad click processing pipeline with deduplication

Pro tip: A great interview soundbite is: “Real-time numbers are fast estimates from the streaming path. Late events correct within a bounded window. Beyond that, batch truth fixes it.” This shows you understand the contract between the two pipelines.

The same click event also has a parallel life in the batch pipeline, which is where billing truth is born.

Walk-through 2: a click becomes a billing line item#

That same click event, sitting in Kafka alongside its duplicate retries, is also consumed by a log archiver that writes raw events to object storage. Events land in hourly partitions organized by event_date/hour/ in compressed Parquet format. This archive is the durable, replayable copy of everything that entered the system.

Overnight (or on an hourly cadence for faster billing cycles), the batch pipeline kicks off. A Spark job reads the raw partitions for the billing period, applies global deduplication using the full event_id keyspace (not bounded by a streaming window), joins against campaign configuration tables, and applies finalized fraud decisions. Fraud models often lag real time because they rely on aggregated behavioral signals that aren’t available at the moment of the click.

The batch job produces authoritative aggregate partitions, for example daily totals by campaign_id, ad_id, and publisher_id. These partitions overwrite prior results for the same period. This overwrite mechanism is how the system incorporates late events, corrected attribution rules, and fraud reclassifications. Billing reports read exclusively from these truth tables, never from streaming rollups.

If an advertiser disputes a charge, the audit path traces from the billing line item to the underlying aggregate partition, then from that partition to the raw event ranges (object paths and offsets) that produced it. For high-stakes disputes, a trace index mapping event_id to storage location accelerates the investigation.

Real-world context: At major ad platforms, billing reconciliation runs are a contractual obligation. Advertisers receive finalized reports within 24 to 72 hours of the billing period closing, and discrepancies beyond agreed thresholds (often 1 to 2%) trigger automatic credits.

Both pipelines are now clear, but neither operates in a vacuum. The system needs a dedicated fraud detection component to protect billing integrity.

Fraud detection as a core pipeline component#

Competitors in this space treat fraud detection as a module, not an afterthought, and for good reason. Click fraud costs the digital advertising industry an estimated $35 billion annually. Your ad click aggregator must include a fraud detection component that feeds decisions into both the streaming and batch pipelines.

Where fraud detection fits#

Fraud detection operates at two speeds. A lightweight, real-time scoring layer runs inline with the streaming pipeline, flagging events that exhibit obvious bot signatures: impossible click velocities, known malicious IP ranges, device fingerprints associated with click farms, and geographic impossibilities (a user clicking from two countries within seconds).

A heavier, offline fraud engine runs as part of or alongside the batch pipeline. It applies machine learning models trained on historical patterns, cross-references IP reputation databases, analyzes click-to-conversion ratios, and identifies coordinated fraud rings that are invisible at the individual-event level.

The key architectural decision is that real-time fraud scores are advisory. They can suppress events from dashboard rollups to avoid inflating estimates, but they do not permanently delete events from the raw log. Final fraud decisions are applied by the batch pipeline, which has the benefit of hindsight and richer features.

Python
import time
from dataclasses import dataclass, field
from collections import defaultdict
from typing import Optional
# In-memory stores — replace with Redis/DB in production
_click_timestamps: dict[str, list[float]] = defaultdict(list)
_device_history: dict[str, list[str]] = defaultdict(list)
VELOCITY_WINDOW_SECONDS = 60
MAX_CLEAN_CLICKS_PER_WINDOW = 10
@dataclass
class ClickEvent:
user_id: str
ip_address: str
device_fingerprint: str
latitude: float
longitude: float
timestamp: float = field(default_factory=time.time)
@dataclass
class FraudSignals:
velocity_score: float # 0.0 (clean) – 1.0 (fraudulent)
ip_reputation_score: float
device_anomaly_score: float
geo_feasibility_score: float
def _compute_velocity_score(user_id: str, now: float) -> float:
"""High click rate within the window raises suspicion."""
timestamps = _click_timestamps[user_id]
# Retain only clicks within the rolling window
recent = [t for t in timestamps if now - t <= VELOCITY_WINDOW_SECONDS]
_click_timestamps[user_id] = recent
recent.append(now)
count = len(recent)
# Normalize: score saturates at 1.0 beyond the clean threshold
return min(count / MAX_CLEAN_CLICKS_PER_WINDOW, 1.0)
def _lookup_ip_reputation(ip_address: str) -> float:
"""
Query an IP reputation service.
Returns 0.0 for trusted IPs, 1.0 for known malicious IPs.
# Replace with real API call (e.g., IPQualityScore, MaxMind)
"""
known_bad_ips = {"192.168.1.100", "10.0.0.1"} # placeholder blocklist
if ip_address in known_bad_ips:
return 1.0
# Placeholder: assume moderate risk for unknown IPs
return 0.2
def _compute_device_anomaly_score(user_id: str, device_fingerprint: str) -> float:
"""Flag if the device fingerprint has never been seen for this user."""
history = _device_history[user_id]
if not history:
# First-ever device — mildly suspicious but not conclusive
_device_history[user_id].append(device_fingerprint)
return 0.3
if device_fingerprint not in history:
# New device for an existing user — higher risk
_device_history[user_id].append(device_fingerprint)
return 0.8
return 0.0 # Known device
def _geo_feasibility_score(
user_id: str,
latitude: float,
longitude: float,
now: float,
) -> float:
"""
Detect impossible travel: compare current location with the last known
location and time to check if the implied speed is physically plausible.
# Requires a persistent last-location store; using a simple dict here.
"""
_last_location: dict[str, tuple[float, float, float]] = (
_geo_feasibility_score.__dict__.setdefault("_store", {})
)
MAX_SPEED_KMH = 900 # ~commercial aircraft speed
prev = _last_location.get(user_id)
_last_location[user_id] = (latitude, longitude, now)
if prev is None:
return 0.0 # No prior location to compare
prev_lat, prev_lon, prev_time = prev
elapsed_hours = max((now - prev_time) / 3600, 1e-6) # avoid division by zero
# Haversine approximation (degrees → km, rough but fast)
delta_lat = abs(latitude - prev_lat) * 111.0
delta_lon = abs(longitude - prev_lon) * 111.0 * 0.85 # mid-latitude correction
distance_km = (delta_lat ** 2 + delta_lon ** 2) ** 0.5
implied_speed = distance_km / elapsed_hours
if implied_speed > MAX_SPEED_KMH:
return 1.0 # Impossible travel detected
# Linearly scale suspicion up to the speed limit
return min(implied_speed / MAX_SPEED_KMH, 1.0)
def score_click_event(event: ClickEvent) -> float:
"""
Compute a fraud probability score in [0.0, 1.0] for a click event.
Weights reflect relative signal reliability; tune via A/B testing.
"""
signals = FraudSignals(
velocity_score=_compute_velocity_score(event.user_id, event.timestamp),
ip_reputation_score=_lookup_ip_reputation(event.ip_address),
device_anomaly_score=_compute_device_anomaly_score(
event.user_id, event.device_fingerprint
),
geo_feasibility_score=_geo_feasibility_score(
event.user_id, event.latitude, event.longitude, event.timestamp
),
)
# Weighted linear combination — weights must sum to 1.0
weights = {
"velocity": 0.35,
"ip_reputation": 0.30,
"device_anomaly": 0.20,
"geo_feasibility": 0.15,
}
fraud_score = (
weights["velocity"] * signals.velocity_score
+ weights["ip_reputation"] * signals.ip_reputation_score
+ weights["device_anomaly"] * signals.device_anomaly_score
+ weights["geo_feasibility"] * signals.geo_feasibility_score
)
return round(min(max(fraud_score, 0.0), 1.0), 4) # clamp to [0, 1]

Attention: Never delete raw events based on real-time fraud signals alone. Fraud models evolve, and today’s “fraudulent” click may be reclassified tomorrow. Keep the raw log intact and let batch apply finalized decisions.

Fraud detection protects billing accuracy, but the system also needs explicit strategies for surviving traffic spikes and controlling costs at scale.

Degradation and cost controls#

At this scale, cost and reliability cannot be afterthoughts. You need explicit controls for overload (burst traffic, Kafka lag, hot campaigns) and explicit controls for spend (storage growth, compute time, cardinality explosion). Interviewers often probe what you do when everything is on fire.

Graceful degradation under load#

Kafka consumer lag and processor backpressureA flow-control mechanism where a downstream component signals to an upstream producer to slow down or buffer, preventing the downstream system from being overwhelmed during traffic spikes. are your early warning signals. When lag grows beyond thresholds, the system should shed non-critical workload:

  • Reduce rollup granularity: Temporarily switch from 1-minute to 5-minute windows.
  • Pause expensive computations: Suspend HLL sketch maintenance for low-priority dimensions.
  • Sample non-critical dimensions: Maintain full accuracy for billing-relevant dimensions while sampling others.
  • Adaptive rate limiting: At the ingestion edge, apply stricter per-source rate limits and reject clearly abusive traffic.

The critical invariant is that the raw log must be preserved. You can sacrifice real-time fidelity temporarily because you can always recompute from raw events once the storm passes.

Cost controls and tiered retention#

Storage costs grow relentlessly at this scale. Tiered retention is your primary lever:

  • Kafka retention: Hours to days (enough for consumer replay and backfill restarts).
  • Hot OLAP store: 7 to 30 days of materialized rollups.
  • Object storage archives: 1 to 3 years of raw events in compressed columnar formats.
  • Warehouse: Authoritative aggregates retained per regulatory and contractual requirements.

Track “cost per billion events” as a real operational metric. Design rollups to match common queries rather than theoretical flexibility. Every materialized cube that nobody queries is wasted money.

Comparison of Storage Tiers

Storage Tier

Retention Period

Cost per TB/Month

Query Latency

Primary Use Case

Kafka

~7 days (configurable)

Varies (SSD-based)

Milliseconds

Real-time data streaming & event processing

Hot OLAP

7–30 days

~$1,000–$2,000

Sub-millisecond to sub-second

Real-time analytics & dashboarding

Object Storage

Months to years

~$23

Milliseconds to seconds

Archival, backup & infrequently accessed data

Data Warehouse

Years

~$46 (compressed)

Seconds to minutes

Complex queries, BI & reporting

With degradation and cost handled, the final critical piece is ensuring the two pipelines stay in agreement and that billing can withstand scrutiny.

Reconciliation and audit for billing correctness#

Billing is where measurement systems get tested by lawyers and finance teams, not just engineers. Even with careful design, numbers drift. Streaming and batch will disagree. Dedupe rules evolve. Fraud models reclassify events after the fact. Late arrivals shift totals. Your system needs a reconciliation strategy that detects drift, explains it, and corrects it with traceability.

Reconciliation as a continuous process#

The core principle is that raw logs are the source of truth. Everything else, streaming rollups, batch aggregates, billing tables, is a derived view. Reconciliation compares derived outputs across paths and flags discrepancies beyond tolerance.

Not all differences are errors. Some are expected:

  • Watermark cutoffs mean streaming excludes events that batch includes.
  • Fraud reclassification shifts totals between runs.
  • Late arrivals increase batch totals relative to streaming snapshots.

You need explicit thresholds and categories for drift. A 0.5% difference due to late events is normal. A 5% difference in the same campaign warrants investigation. Automated jobs compare streaming snapshots against batch truth for each billing period and generate drift reports.

Auditability through lineage#

A billing line item should be traceable to a set of aggregate partitions, and those partitions should trace back to raw event ranges. You don’t need to store every raw event_id in the billing table, but you should be able to sample and prove correctness.

For high-stakes disputes, maintain a trace indexA time-bounded lookup structure that maps individual event_ids to their physical storage locations (object path and byte offset), enabling rapid retrieval of specific raw events during billing investigations and audits. This index is expensive to maintain indefinitely, so scope it to a limited window (e.g., the current billing quarter plus one prior quarter).

Pro tip: “Raw log as source of truth + reconciliation + lineage” is the trifecta for billing correctness. In an interview, stating this shows you understand disputes and regulatory scrutiny, not just data plumbing.

When drift is unacceptable, you trigger a backfill: the batch pipeline recomputes the affected partitions from raw events, and the billing tables are updated with versioned results. The audit trail records what changed, when, and why.

Reconciliation keeps the system honest, but you also need real-time visibility into the system’s health to catch problems before they become billing disputes.

Observability metrics that keep you honest#

The right metrics map directly to the failure modes of the system. In interviews, naming concrete metrics signals you’ve operated these systems and know what breaks first.

Metrics by pipeline stage#

At each stage, you want specific indicators:

  • Ingestion: QPS by region, error rates, schema validation failure rates, and per-source rate limit triggers.
  • Kafka: Consumer lag per topic and partition (the single most important streaming health metric), producer throughput, and partition balance.
  • Stream processor: Checkpoint success rate and duration, end-to-end freshness (time from click to dashboard visibility), and state size per operator.
  • Data quality: Late-arrival rate by source, dedupe drop rate (what percentage of events are duplicates), and fraud flag rate.
  • Billing health: Reconciliation drift percentage, backfill frequency, and batch job completion against SLA.
  • Cost: Cost per billion events processed, storage growth rate, and query cost by tier.
Historical note: The concept of “end-to-end freshness” as a primary metric was popularized by LinkedIn’s Unified Streaming Platform team, who found that monitoring throughput alone masked growing latency problems. A system can have high throughput while its lag steadily increases, which is failure in slow motion.

[image][Operations dashboard mockup showing panels for Kafka consumer lag (line chart with alert thresholds), end-to-end freshness (gauge), dedupe drop rate (bar chart by source), reconciliation drift (percentage with red/yellow/green zones), and cost per billion events (trend line).][Observability dashboard for ad click aggregator health][/image]

SLAs and numeric targets#

In production and in interviews, stating specific targets adds credibility:

  • Dashboard freshness: less than 60 seconds p99 during normal operations, degrading to less than 5 minutes under burst load.
  • Billing accuracy: less than 0.5% drift from raw-log recomputation.
  • System availability: 99.95% for ingestion endpoints, 99.9% for dashboard queries.
  • Kafka lag alert threshold: greater than 30 seconds of consumer lag triggers investigation, greater than 5 minutes triggers incident.

These numbers anchor the architecture in operational reality and give interviewers confidence that you’ve thought beyond diagrams.

With observability in place, let’s crystallize what a strong interview answer sounds like from start to finish.

What a strong interview answer sounds like#

A strong answer is structured, honest about contracts, and connects architecture to business outcomes. You don’t drown the interviewer in tool names. You use tools to support invariants.

The 60-second outline#

“I’ll build a dual-pipeline click aggregator: a streaming path for near-real-time estimates and a batch path for billing truth. Clicks arrive at an edge ingestion layer that does minimal validation, schema-versioned parsing, basic integrity checks, and writes batched, compressed events into Kafka as the raw source of truth. A stream processor consumes Kafka with event-time windows and watermarks, dedupes by stable event_id, mitigates hot keys via salting and partial aggregates, and writes incremental rollups to a hot OLAP store for dashboards. In parallel, raw events are archived to object storage. Batch jobs recompute authoritative partitions with full dedupe, late-event inclusion, and finalized fraud decisions, overwriting daily partitions for billing. A fraud detection component operates at both speeds: lightweight scoring in streaming and full model inference in batch, with raw events always preserved. Queries route to hot estimates for recent windows and to batch truth for billing. I monitor ingestion QPS, Kafka lag, freshness, dedupe and late-arrival rates, reconciliation drift, and cost per billion events.”

The checklist interviewers are scoring against#

  • Dual pipelines: Real-time estimates for dashboards, batch truth for billing.
  • Raw log as source of truth with replay and backfill capability.
  • Effective exactly-once via dedupe, checkpointing, and idempotent sinks.
  • Watermarks and late-event handling with batch as the correction layer.
  • Skew and hot key mitigation through salting and hierarchical aggregation.
  • Fraud detection as a dual-speed component integrated into both pipelines.
  • Reconciliation, lineage, and cost controls with concrete observability metrics.
Real-world context: Staff and Principal engineer interviews at companies like Google, Meta, and Amazon specifically probe whether candidates understand that “real-time” and “correct” are different contracts. The dual-pipeline pattern with reconciliation is the expected answer structure for any billing-adjacent system.

Sql
-- Configurable drift threshold (percentage); adjust as needed
WITH params AS (
SELECT 0.05 AS drift_threshold -- 5% drift tolerance
),
-- Streaming aggregate totals for the target date and campaign
streaming_totals AS (
SELECT
campaign_id,
report_date,
SUM(impressions) AS stream_impressions,
SUM(clicks) AS stream_clicks,
SUM(spend) AS stream_spend
FROM streaming_aggregates
WHERE report_date = '2024-06-01' -- parameterize for your scheduler
GROUP BY campaign_id, report_date
),
-- Batch (ground-truth) totals for the same date and campaign
batch_totals AS (
SELECT
campaign_id,
report_date,
SUM(impressions) AS batch_impressions,
SUM(clicks) AS batch_clicks,
SUM(spend) AS batch_spend
FROM batch_truth
WHERE report_date = '2024-06-01'
GROUP BY campaign_id, report_date
),
-- Join streaming vs batch and compute per-metric drift percentages
reconciliation AS (
SELECT
COALESCE(s.campaign_id, b.campaign_id) AS campaign_id,
COALESCE(s.report_date, b.report_date) AS report_date,
-- Raw values
s.stream_impressions,
b.batch_impressions,
s.stream_clicks,
b.batch_clicks,
s.stream_spend,
b.batch_spend,
-- Drift = (stream - batch) / batch; NULL-safe division
CASE WHEN b.batch_impressions <> 0
THEN (s.stream_impressions - b.batch_impressions) / b.batch_impressions
ELSE NULL END AS impressions_drift_pct,
CASE WHEN b.batch_clicks <> 0
THEN (s.stream_clicks - b.batch_clicks) / b.batch_clicks
ELSE NULL END AS clicks_drift_pct,
CASE WHEN b.batch_spend <> 0
THEN (s.stream_spend - b.batch_spend) / b.batch_spend
ELSE NULL END AS spend_drift_pct
FROM streaming_totals s
FULL OUTER JOIN batch_totals b
ON s.campaign_id = b.campaign_id
AND s.report_date = b.report_date
)
-- Final output: flag any campaign where ANY metric exceeds the threshold
SELECT
r.campaign_id,
r.report_date,
r.stream_impressions,
r.batch_impressions,
ROUND(r.impressions_drift_pct * 100, 4) AS impressions_drift_pct,
r.stream_clicks,
r.batch_clicks,
ROUND(r.clicks_drift_pct * 100, 4) AS clicks_drift_pct,
r.stream_spend,
r.batch_spend,
ROUND(r.spend_drift_pct * 100, 4) AS spend_drift_pct,
-- Flag when absolute drift on any metric exceeds the threshold
CASE
WHEN ABS(r.impressions_drift_pct) > p.drift_threshold
OR ABS(r.clicks_drift_pct) > p.drift_threshold
OR ABS(r.spend_drift_pct) > p.drift_threshold
THEN TRUE
ELSE FALSE
END AS exceeds_drift_threshold
FROM reconciliation r
CROSS JOIN params p -- single-row config row joined to every campaign
ORDER BY
exceeds_drift_threshold DESC, -- flagged campaigns first
ABS(r.spend_drift_pct) DESC -- highest spend drift at top
;

Conclusion#

The core insight of ad click aggregator design is that measurement for decisions and measurement for money are fundamentally different engineering problems that share the same input stream. The streaming path gives you the speed that campaign managers need to optimize in real time, while the batch path gives you the correctness and auditability that billing and finance require. The raw event log, sitting between ingestion and both consumers, is what makes the entire architecture replayable, auditable, and resilient to the inevitable bugs and rule changes that production systems face. Everything else, the OLAP stores, the warehouse, the fraud engine, the reconciliation jobs, is a derived view that can be recomputed from that log.

Looking ahead, the boundary between streaming and batch continues to blur. Technologies like Apache Flink’s unified batch-and-stream processing and lakehouse formats like Apache Iceberg with incremental compaction are making it possible to converge toward a single processing framework that can serve both tiers. Fraud detection is also evolving rapidly, with federated learning approaches that detect coordinated fraud across publishers without sharing raw data. The dual-pipeline pattern will persist as a logical separation, but the physical infrastructure underneath is steadily simplifying.

If you can explain correctness and billing auditability without claiming magical exactly-once, you’re answering at Staff level. That clarity, knowing what you guarantee, what you approximate, and where the truth lives, is what separates data plumbing from revenue infrastructure.


Written By:
Khayyam Hashmi