Live Comments System Design

Table of Contents

Why live comments are a broadcast problem, not a chat problem Clarify requirements and state the guarantees you can defend Functional scope Non-functional guarantees High-level architecture: a durable event pipeline with a fast broadcast edge Ordering and sequencing: why timestamps fail and how sequence numbers work Generating sequence numbers at scale Data model: store once, read sequentially, make moderation reversible Back-of-envelope storage estimation Comment life cycle state machine Walk-through 1: posting a comment and delivering it to all viewers Cold start and catch-up: where correctness is proven Join flow Reconnect and gap detection Moderation as a control plane Moderation event types Delivery priority Walk-through 2: moderator deletes a comment and all clients retract it Fan-out strategies and the hot-stream problem Detecting hot streams Degradation tactics Geo-redundancy and multi-region deployment Reliability, duplicates, and replay: how the system heals after crashes Why duplicates happen Making duplicates harmless Replay from the durable log Walk-through 3: crash, duplicate delivery, and recovery Observability: metrics that prove “live” and “safe”Core metrics Per-stream breakdowns Trade-offs you should say out loud What a strong interview answer sounds like Final checklist Conclusion

Home/

Blog/

Build live comments as a durable broadcast pipeline: server-assign seq per stream, persist + publish to a log, fan-out to gateways, dedupe on clients, catch up via last_seen_seq, make moderation “must win,” and degrade hot streams with sampling/slow mode while

Mar 11, 2026

Live comments systems are real-time broadcast architectures designed to deliver user-generated messages to massive concurrent audiences with low latency, strict per-stream ordering, and instant moderation enforcement. The core engineering challenge is not accepting comments but safely and cheaply fanning them out to hundreds of thousands of viewers under unpredictable burst load, while guaranteeing that moderation actions always override the data plane.

Key takeaways

Broadcast over chat: The defining constraint is one-to-many fan-out at scale, not peer-to-peer messaging, making fan-out bandwidth and gateway pressure the primary bottlenecks.
Server-assigned sequence numbers: Timestamps fail as ordering keys due to clock skew and retries, so a per-stream monotonic sequence is the only reliable way to guarantee correct rendering and catch-up.
Moderation as a control plane: Moderation events must propagate on a separate, higher-priority path so that retractions always win over comment delivery, even during hot-stream degradation.
At-least-once delivery with client dedupe: Accepting duplicates and making them harmless through stable identifiers is the only scalable guarantee that avoids silent message loss during crashes and reconnects.
Graceful degradation by design: Hot streams are inevitable, so the system must detect per-stream overload early and apply policy-driven mitigations like slow mode, comment sampling, and reaction aggregation rather than failing unpredictably.

Every time a last-second goal lights up a stadium or a celebrity goes live to millions, the comment feed beside the video becomes the collective nervous system of the audience. Building that feed is deceptively simple on a whiteboard and ruthlessly hard in production. This blog walks through a Staff-level system design for live comments, covering guarantees, ordering, fan-out, moderation, failure recovery, geo-scaling, and the metrics that keep everything honest.

Why live comments are a broadcast problem, not a chat problem#

Most candidates instinctively reach for a chat architecture when they hear “live comments.” Chat assumes a small group where every participant reads and writes roughly equally. Live comments flip that ratio on its head. A stream with 200,000 concurrent viewers might see only 500 comments per second. The read-to-write ratio can exceed 400:1.

That asymmetry means the hardest engineering is on the read path. Accepting and validating a comment is cheap. Delivering it to 200,000 open connections without melting your gateway fleet is where the design either shines or collapses. Every architectural choice, from protocol selection to database schema, must be evaluated through the lens of fan-out cost.

Real-world context: Twitch chat during large esports finals routinely exceeds 100,000 concurrent viewers per channel. YouTube Live and TikTok Live face similar ratios. The “hot stream” scenario is not an edge case. It is the defining workload.

This broadcast framing also changes how you think about ordering, moderation, and failure. In a small chat room, you can afford strong consistency. In a broadcast to hundreds of thousands of clients, you must accept weaker delivery guarantees and compensate with client-side intelligence. The next section locks down exactly which guarantees you can defend and which you intentionally relax.

Clarify requirements and state the guarantees you can defend#

A strong design begins by separating what the system promises from what it attempts on a best-effort basis. Interviewers pay close attention here because overcommitting on guarantees (like claiming exactly-once delivery) signals inexperience with real-time pipelines.

Functional scope#

The core features are straightforward:

Post a comment: Authenticated users submit text to a specific stream.
Receive comments in real time: All viewers on a stream see new comments with minimal delay.
Moderation actions: Delete, hold, shadow ban, slow mode, and pin.
Reactions: Lightweight signals (likes, emojis) at high volume.
Catch-up and history: Viewers who join late or reconnect see recent comments without gaps.

Non-functional guarantees#

The delivery contract is the most important decision you make early. Here is how the guarantees break down:

Comparison of Message Delivery Guarantees

Guarantee Type	Delivery Assurance	Failure Behavior	Client Complexity	Suitability for Live Broadcast
At-Most-Once	Zero or one delivery; possible message loss, no duplicates	Messages lost on failure; no redelivery attempted	Low – no duplicate handling required	Suitable only when occasional loss is acceptable (e.g., non-critical metrics)
At-Least-Once	One or more deliveries; no loss, but duplicates possible	Redelivery attempted on failure; may cause duplicates	Medium – requires idempotent processing	Appropriate when loss is unacceptable and duplicates can be tolerated or managed
Exactly-Once	Precisely one delivery; no loss and no duplicates	System ensures single processing even during failures	High – requires transactional coordination and state management	Ideal for strict live broadcast scenarios demanding both reliability and consistency

The safest and most scalable contract is at-least-once deliveryA guarantee that every event will be delivered to consumers at least one time, potentially more than once during retries or crash recovery, requiring receivers to handle duplicates. with client-side deduplication. Under crashes, the system cannot always confirm whether a client received an event, so retries may produce duplicates. Losing comments is worse for engagement than showing a duplicate, so you accept duplicates and make them harmless through stable identifiers and sequence numbers.

Ordering is the second critical guarantee. You cannot rely on timestamps alone. Clock skew between servers, variable network delays, and replayed events will reorder comments in ways that make the feed unreadable. The stable approach is per-stream ordering using server-assigned sequence numbers. That sequence becomes the canonical key for rendering, storage, and catch-up.

Attention: Saying “exactly once” or “we’ll use timestamps for ordering” in an interview without acknowledging clock skew and retry semantics signals that you have not operated real-time pipelines at scale. Interviewers treat this as a red flag.

With guarantees locked, the next step is to lay out the end-to-end pipeline that honors them, from ingest through fan-out to the client.

High-level architecture: a durable event pipeline with a fast broadcast edge#

The system is easiest to reason about as a linear pipeline with a parallel control plane:

Ingest → Validate → Sequence → Persist → Publish → Fan-out → Client

Each stage has a single job, and the boundaries between stages are where you insert durability, ordering, and priority.

The following diagram shows how these stages connect and where the ordering point, durable log, and moderation control plane sit relative to the data path.

At the edge, you use persistent connections to push events to clients. The two practical choices are WebSocketsA protocol providing full-duplex communication channels over a single TCP connection, allowing both the server and client to send messages independently at any time. and Server-Sent Events (SSE)A browser-native technology where the server pushes a unidirectional stream of events to the client over a standard HTTP connection, simpler than WebSockets but limited to server-to-client flow.. Gateways manage these connections and handle efficient pushing but contain no business logic. They are stateless relays that map stream subscriptions to open connections.

Behind the gateways, an ingestion service authenticates users, enforces per-user and per-stream rate limits, runs lightweight safety checks, and forwards accepted comments into a durable log like Apache Kafka. Fan-out workers then consume the log and broadcast new events to the correct stream’s connected gateways.

The “single source of truth” is the durable log plus the database that stores comment history. The log supports replay and recovery. The database supports cold-start history loads and post-stream playback. Caches are used for stream membership and connection mapping, not as truth.

Here is how the protocol options compare for the broadcast edge:

Comparison of WebSockets, SSE, and Long Polling

Dimension	WebSockets	Server-Sent Events (SSE)	Long Polling
Directionality	Full-duplex (bidirectional)	Unidirectional (server → client)	Simulated bidirectional via repeated requests
Connection Overhead	Low — single persistent connection after handshake	Low — persistent HTTP connection for streaming	High — repeated HTTP requests per interaction
Browser Support	All modern browsers	Most modern browsers (no IE support)	All browsers (standard HTTP)
Scalability for Broadcast	Resource-intensive with many persistent connections	Limited by one TCP connection per client	High server resource consumption at scale
Reconnect Complexity	Manual — requires custom reconnection logic	Automatic — built-in reconnection support	Inherent per request, but frequent polling adds complexity

Pro tip: In an interview, explicitly state where the “ordering point” lives. It is your system’s most important invariant. If you cannot point to a single place where sequence numbers are assigned, your design has an ordering gap.

With the pipeline structure in place, the next section dives into why the ordering point matters so much and how to implement it without expensive global coordination.

Ordering and sequencing: why timestamps fail and how sequence numbers work#

Ordering is not a nice-to-have in live comments. It is what makes the feed readable. If two viewers see comments in different orders, the conversation becomes incoherent. In distributed systems, timestamps are unreliable ordering keys for three reasons:

Clock skew: Even with NTP, servers can differ by tens of milliseconds, and clients can differ by seconds.
Network jitter: A comment sent first can arrive at the server second.
Retries and replays: A retransmitted comment carries its original timestamp but arrives much later.

The practical solution is to assign a monotonically increasing sequence number at the moment the comment is accepted into the pipeline. That sequence becomes the canonical ordering key used everywhere: clients render in sequence order, the database clusters comments by sequence, and catch-up queries are “give me everything after seq N.”

Generating sequence numbers at scale#

There are two defensible patterns for generating per-stream sequences:

Dedicated sequencer partitioned by stream_id: A lightweight service that maintains an in-memory counter per stream, incremented atomically on each accepted comment. Streams are assigned to sequencer instances via consistent hashing. This gives you explicit control but introduces a single point of coordination per stream.
Log-offset-as-sequence: If all events for a stream are routed to the same Kafka partition, the partition offset is a natural monotonic sequence. This piggybacks on Kafka’s ordering guarantee and avoids a separate service, but ties your sequence numbering to your log infrastructure.

Both approaches require stable stream-to-partition (or stream-to-sequencer) mapping. If a stream bounces between partitions during rebalancing, you can create ordering gaps or duplicates. Sticky partitioning and graceful rebalance protocols mitigate this.

Historical note: Early real-time systems like IRC relied on server timestamps for ordering, which led to notorious “message reordering” bugs during netsplits. The shift to explicit sequence numbers in modern systems like Kafka and Slack’s message infrastructure reflects decades of hard-won operational lessons.

Once you have a reliable sequence, you need a data model that makes sequential reads fast and moderation reversible. That is where the storage layer comes in.

Data model: store once, read sequentially, make moderation reversible#

The dominant access pattern for live comments is sequential reads. When a user joins, you load the most recent window. When catching up, you fetch everything after a sequence number. That access pattern maps perfectly to a storage model where stream_id is the partition key and seq is the clustering key.

A wide-column store like Apache Cassandra or a sorted key-value store fits naturally. Each row represents a single comment, and reads for a stream are sequential scans within a partition. This avoids the random I/O patterns that would make a traditional relational database struggle at broadcast scale.

The comment record should include:

stream_id, seq (partition and clustering keys)
comment_id (globally unique, used for deduplication and moderation references)
user_id, content, created_at
state (VISIBLE, HELD, DELETED, SHADOW_BANNED)
moderation_metadata (who acted, when, reason, policy reference)

Deletions must be soft deletesA data management pattern where records are marked as deleted (via a status flag or tombstone) rather than physically removed from storage, preserving the data for auditing, appeals, and sequence continuity.. Hard-deleting a comment from storage breaks the sequence timeline, destroys audit trails, and makes moderation appeals impossible. A soft delete marks the row as DELETED, records the moderation action, and lets the API layer decide whether to exclude it or return a tombstone.

Attention: Hard-deleting comments is a common design pitfall. It creates confusing sequence gaps and destroys evidence needed for safety reviews. Always soft-delete and filter at query time.

For reactions at massive scale, writing one record per individual reaction is prohibitively expensive. A better approach is to log reaction events to a lightweight stream and maintain aggregated counters per comment or per stream with eventual consistency. Clients display “12K 🔥” rather than rendering 12,000 individual reaction records.

Back-of-envelope storage estimation#

For a platform handling 10,000 concurrent streams with an average of 100 comments per minute per stream:

Comments per day: $10{,}000 \\times 100 \\times 60 \\times 24 = 1.44 \\times 10^9$
Average comment size (with metadata): ~500 bytes
Daily raw storage: $1.44 \\times 10^9 \\times 500 = 720\\text{ GB/day}$
Annual storage (before compression): ~263 TB

With compression and a 90-day hot retention window (archiving older data to cold storage), the hot storage footprint stays manageable at roughly 65 TB. These numbers justify a distributed wide-column store but also highlight the need for a retention and archival policy.

With the data model defined, the next question is: what happens to a comment between the moment a user hits “send” and the moment it appears on 100,000 screens? The answer is a state machine.

Comment life cycle state machine#

A broadcast system becomes interview-grade when you can describe the comment life cycle as a state machine with explicit transitions. Comments are not just “sent.” They pass through validation, sequencing, persistence, publication, visibility, and potential moderation override. Making these states explicit helps you reason about retries, duplicates, and the “moderation must win” invariant.

The key design rule is: persist before publish. A comment must be durable in storage before its event is emitted to the fan-out log. This guarantees that any event a client receives corresponds to a comment that exists in the history store, making catch-up and replay consistent.

Visibility on clients is derived from two inputs: receiving the comment event and the current moderation state. If a comment is later deleted, clients retract it by applying a delete event that references the original comment_id and seq. The state machine makes this retraction a standard transition rather than an ad-hoc patch.

Pro tip: In an interview, drawing this state machine on the whiteboard immediately elevates your answer. It proves you have thought about retries, replays, and moderation retractions as part of the design rather than as afterthoughts.

Let us now walk through the happy path end to end, from a user typing a comment to every viewer seeing it.

Walk-through 1: posting a comment and delivering it to all viewers#

A user types a comment and hits send. Here is the exact sequence of events through the pipeline:

Step 1: Ingestion. The client sends the comment payload to the ingestion service (routed through a gateway). Ingestion authenticates the user, validates the payload (length, encoding, attachment policies), and applies rate limits. Per-user limits prevent spam. Per-stream limits provide a coarse form of overload protection.

Step 2: Lightweight safety check. Before sequencing, the ingestion service runs a fast toxicity or blocklist check. Comments flagged by this check can be routed to a HELD state for human review rather than entering the visible pipeline. This pre-sequence check adds minimal latency (target: <50ms) but catches the most obvious violations.

Step 3: Sequencing and persistence. The ingestion service requests the next sequence number for this stream and writes the comment to the history store with state=VISIBLE. The comment is now durable.

Step 4: Event publication. A “comment created” event containing (stream_id, seq, comment_id, content, user_id) is published to the durable log partition for this stream.

Step 5: Fan-out. Fan-out workers consume the event from the log and look up which gateway instances hold connections for this stream. They forward the event to those gateways.

Step 6: Client delivery. Gateways push the event down WebSocket or SSE connections to all subscribed viewers. Each client deduplicates by (stream_id, seq), inserts the comment into the UI in sequence order, and updates its last_seen_seq.

If a viewer’s network jitters and comments arrive out of order, the sequence number lets the client buffer briefly and re-sort. If a comment is missing entirely, the client’s catch-up mechanism (covered next) fills the gap from the history store.

Real-world context: YouTube’s live chat infrastructure uses a similar pipeline with an explicit sequencing step before fan-out. The sequence number is what allows millions of clients to see a consistent comment order despite being served by different gateway instances across different data centers.

Python

import bisect
def create_dedup_reorder_handler(render_to_ui):
    """
    Returns a handler function that deduplicates and reorders incoming events.
    render_to_ui: callable(event) -- flushes a single event to the UI layer.
    """
    seen_set = set()          # tracks seq numbers already processed
    buffer = []               # list of (seq, event) kept in sorted order
    state = {"last_seen_seq": 0}  # last contiguously flushed seq
    def handle_event(event):
        seq = event["seq"]
        # Discard duplicate events
        if seq in seen_set:
            return
        seen_set.add(seq)
        # Insert into sorted buffer by seq using binary search
        keys = [item[0] for item in buffer]
        insert_pos = bisect.bisect_left(keys, seq)
        buffer.insert(insert_pos, (seq, event))
        # Flush all contiguous events starting from last_seen_seq + 1
        while buffer and buffer[0][0] == state["last_seen_seq"] + 1:
            next_seq, next_event = buffer.pop(0)
            render_to_ui(next_event)          # deliver in-order event to UI
            state["last_seen_seq"] = next_seq # advance the contiguous frontier
    return handle_event

But what happens when a viewer joins a stream that is already in progress, or reconnects after a network drop? That is the cold-start and catch-up problem.

Cold start and catch-up: where correctness is proven#

Cold start and catch-up are where real-time broadcast systems prove they actually work. Relying solely on “whatever arrives over the WebSocket” guarantees gaps, especially on mobile networks with flaky connectivity.

Join flow#

When a viewer opens a stream for the first time:

The client calls a history API: GET /streams/{stream_id}/comments?last=200
The server returns the most recent 200 comments ordered by seq, along with latest_seq (the highest sequence the server has seen).
The client renders these comments, sets last_seen_seq = latest_seq, and opens a persistent connection for live events.
New events arriving via the live connection are appended after last_seen_seq.

Reconnect and gap detection#

After a disconnect (which is routine on mobile), the client reconnects and provides last_seen_seq to the server. Two things can happen:

Small gap: The server streams the missing range (last_seen_seq, current_seq] inline during the reconnect handshake. This is fast and avoids a separate API call.
Large gap: If the client has been disconnected for a long time, the server returns a redirect to the history API with a sequence range. The client fetches the gap, merges it with the live stream, and resumes.

This design means the persistent push connection is a latency optimization, but the catch-up path is the correctness mechanism. Even if the real-time channel drops every other event, the system eventually converges because the client can always recover via sequence-based history queries.

Pro tip: In your interview answer, explicitly say: “Push is an optimization. Catch-up from last_seen_seq is how I guarantee correctness.” This one sentence signals that you understand the fundamental reliability model of real-time systems.

With the data path solid, it is time to address the most critical non-functional concern: moderation. In a live broadcast, a harmful comment that stays visible for even 10 seconds can cause real damage.

Moderation as a control plane#

Moderation is not “just another event” in the comment stream. It is a control planeA logically separate system layer that manages policies, configurations, and overrides, operating at higher priority than the data plane that carries user-generated content. that must override the data plane. Comment events are high-volume and user-generated. Moderation events are lower-volume but higher-priority and must propagate reliably and quickly. If your system treats them as the same stream with equal priority, you will eventually display content that should have been removed.

Moderation event types#

A robust moderation system supports more than just “delete”:

Delete: Removes a specific comment from all clients immediately.
Hold: Intercepts a comment before it becomes visible, pending human review.
Shadow ban: A user’s comments appear to themselves but are hidden from all other viewers. This avoids alerting bad actors that they have been flagged.
User mute: Temporarily or permanently prevents a user from posting on a specific stream or globally.
Slow mode: Limits the posting rate for all users on a stream (e.g., one comment per 30 seconds).
Pin: Elevates a specific comment to a fixed position at the top of the feed.

Each moderation action is recorded as an auditable event with the actor (human moderator, automated system, or policy engine), the target (comment_id, user_id, or stream_id), the reason, and a timestamp. This audit trail is essential for appeals, safety reviews, and training future moderation models.

Delivery priority#

Moderation events travel on a separate channel or are tagged with strict priority in the fan-out layer. During periods of high fan-out pressure, the system may sample or batch comment events, but moderation events are never sampled, never batched, and never deprioritized. This is the “moderation must win” invariant.

Attention: A common pitfall is only deleting comments in the database without emitting a real-time retract event. In live systems, viewers need immediate visual retraction. Database-only deletion is eventual consistency at best and harmful content exposure at worst.

Let us trace the moderation path end to end in a concrete walk-through.

Walk-through 2: moderator deletes a comment and all clients retract it#

A comment with seq=4072 is visible to 150,000 viewers. A moderator reviews it and clicks “Delete” in the moderation dashboard.

Step 1: The moderation service receives the delete action referencing comment_id and stream_id.

Step 2: It writes an audit record (who, what, when, why) and updates the comment’s state to DELETED in the history store.

Step 3: It publishes a high-priority moderation event: {type: DELETE, stream_id, comment_id, seq: 4072} to the moderation event channel.

Step 4: Fan-out workers prioritize moderation events. They immediately forward the delete event to all gateways holding connections for this stream.

Step 5: Gateways push the retract event to all connected clients. Clients remove seq=4072 from the UI.

Step 6: For viewers who join later and load history, the server either excludes deleted comments from the response or includes a tombstone record that instructs the client not to render it.

The critical property is that even if a client is behind on comment events and has not yet processed seq=4072, the moderation event applies immediately. If the client later receives the original comment event for seq=4072, it checks against its moderation state and discards it. Moderation always wins.

Real-world context: Twitch’s AutoMod combines automated detection with human review queues. Comments flagged by the ML model are held before publication, and moderator actions propagate to all connected clients in near real time. This two-tier approach (pre-publication hold + post-publication retraction) covers both automated and human moderation workflows.

With the data plane and control plane defined, the next challenge is what happens when a single stream becomes so popular that it saturates the fan-out layer.

Fan-out strategies and the hot-stream problem#

Fan-out is the most expensive operation in a live comments system. Every comment must be delivered to every connected viewer on that stream. There are two fundamental strategies, and the right choice depends on the workload:

Fan-out-on-write: When a comment is accepted, immediately push it to every subscriber. This is the default for live comments because latency matters and the write rate is low relative to the read rate. The cost scales with comments_per_second × viewers_per_stream.
Fan-out-on-read: Subscribers poll or pull from a shared log at their own pace. This is cheaper for the server but adds latency and shifts complexity to the client. It works better for low-urgency feeds (like social media timelines) but is generally too slow for live broadcast.

For live comments, fan-out-on-write is the standard choice, but it creates a scaling cliff for hot streams. If a stream has 500,000 viewers and 200 comments per second, the fan-out layer must deliver $200 \\times 500{,}000 = 10^8$ events per second for that single stream. That number can overwhelm gateway fleets and network bandwidth.

Detecting hot streams#

Detection must be proactive, not reactive. You instrument per-stream metrics:

Publish rate: Comments per second per stream.
Queue lag: How far behind fan-out workers are on the stream’s log partition.
Gateway outbound pressure: Bytes queued for push per stream.
Client drop rate: Percentage of events that gateways fail to deliver.
Fan-out success rate: Percentage of connected clients that receive each event within the latency target.

Threshold-based alerts (e.g., “if queue lag > 500ms or publish rate > 300/s, flag as hot”) trigger automatic mitigation.

Degradation tactics#

Degradation is not failure. It is a designed mode. When a stream is flagged as hot, the system applies stream-scoped mitigations:

Slow mode: Limit posting frequency to reduce comment volume at the source.
Comment sampling: Show only a random subset (e.g., 1 in 5) of comments to viewers. The feed still feels lively, but fan-out cost drops by 80%.
Reaction aggregation: Instead of delivering individual reaction events, batch them into periodic aggregate counters (e.g., every 2 seconds: “1,200 new 🔥 reactions”).
Priority tiering: Drop low-value events (typing indicators, minor UI signals) while preserving comment and moderation delivery.
Moderation always flows: Moderation events are never sampled or dropped, even during maximum degradation.

Degradation Tactics Overview

Tactic	Trigger Condition	User Impact	Fan-Out Cost Reduction	Moderation Affected
Uncertainty Spike	Confidence (p95) drops below threshold for 10 min	Switches to baseline model; reduced response quality	Not specified	Not specified
Feature Freshness Failure	Feature age exceeds SLA	Uses cached data with 24-hour staleness cap; outdated info	Not specified	Not specified
Vendor Outage	Upstream 5xx errors exceed 10% over 5-minute window	Circuit-breaks vendor; throttled prompts and context	Not specified	Not specified
Cost Cap Breach	Cost per decision (p95) exceeds budget for 15 min	Reduced context, less sampling, tail routed to baseline	Not specified	Not specified

Pro tip: In your interview, explicitly map triggers to mitigations. Saying “if queue lag exceeds 500ms, we enable slow mode and comment sampling for that stream” is much stronger than vaguely saying “we scale up.”

The hot-stream problem also motivates thinking about geographic distribution. If viewers are spread across continents, fan-out from a single region adds latency. Let us look at how geo-scaling addresses this.

Geo-redundancy and multi-region deployment#

For a global audience, a single-region deployment means that viewers far from the data center experience high latency. Comments posted by a user in Tokyo and fanned out from a US-East cluster arrive 150–200ms later for viewers in Asia. In live broadcast, that delay is noticeable and compounds with processing time.

A multi-region architecture deploys gateway fleets in each major region (US-East, EU-West, AP-Southeast, etc.) so that viewers connect to the nearest edge. The ingestion and sequencing layer can remain centralized (or use a primary region per stream) to preserve per-stream ordering without cross-region coordination overhead.

The flow becomes:

A user in São Paulo posts a comment to the nearest ingestion endpoint (US-East, for a stream homed there).
The comment is sequenced and persisted in the primary region.
The event is published to the durable log.
Regional fan-out workers in each region consume the log (via cross-region replication or a multi-region Kafka setup like Confluent’s multi-region clusters) and deliver to local gateways.
Viewers in São Paulo, London, and Tokyo all receive the event from their local gateway, minimizing last-mile latency.

The trade-off is clear: centralized sequencing preserves strict ordering but adds one cross-region round trip for the posting user. For live comments, this is acceptable because most users are viewers (read path), and only a small fraction are posters (write path). Optimizing the read path by placing gateways close to viewers is the higher-leverage decision.

Historical note: YouTube’s live infrastructure uses a similar model where comment sequencing happens in a primary region, but delivery is distributed to edge points of presence worldwide. This design keeps ordering simple while minimizing viewer-perceived latency.

Even with geo-distribution, components crash. The next section examines how the system heals after failures without losing or corrupting the comment stream.

Reliability, duplicates, and replay: how the system heals after crashes#

Distributed broadcast systems crash. Gateways restart, fan-out workers roll during deployments, Kafka partitions rebalance. If your design requires perfect delivery over a single persistent connection, it will lose events. The architecture must assume failure and build recovery into every layer.

Why duplicates happen#

Duplicates arise from a fundamental tension: a worker sends an event to a gateway and then crashes before committing its consumer offsetA pointer maintained by a consumer that tracks the last successfully processed position in a log partition, allowing the consumer to resume from the correct place after a restart. in the durable log. On restart, it reprocesses the same event and broadcasts it again. Some clients receive it twice. Others, who were disconnected during the first attempt, receive it only on the second. This is exactly the at-least-once behavior we committed to in our guarantees.

Making duplicates harmless#

Deduplication on the client is straightforward because every event carries a stable identifier. For comments, the pair (stream_id, seq) is sufficient. For moderation events, a moderation_id or moderation_seq serves the same purpose. Clients maintain a small sliding window of recently seen identifiers (the last 500–1000 events) and silently discard repeats.

Replay from the durable log#

If a fan-out worker crashes, it resumes consuming from the last committed offset. Events between the last commit and the crash point are reprocessed and redelivered. This is the “spine” of reliability. The durable log guarantees that no accepted comment is permanently lost, even if fan-out infrastructure is temporarily unavailable.

If gateways lose their stream subscription mappings temporarily (e.g., during a rolling restart), clients reconnect and catch up using last_seen_seq. The history store provides the “truth” for any missed ranges that the log may have already expired.

This brings us to a concrete crash scenario to illustrate the healing process.

Walk-through 3: crash, duplicate delivery, and recovery#

A fan-out worker consumes comment seq=9001 from the log partition for stream S and broadcasts it to gateways G1, G2, and G3. Clients connected to G1 and G2 receive seq=9001. Before the worker commits its offset, it crashes.

What happens next:

The worker restarts and resumes from the last committed offset (which is before seq=9001). It reprocesses seq=9001 and broadcasts it again to G1, G2, and G3.

Clients on G1 and G2 already have seq=9001 in their seen-set. They discard the duplicate. No visual glitch occurs.
Clients on G3 missed the first broadcast (perhaps G3 was briefly unreachable). They receive seq=9001 for the first time. At-least-once delivery has improved reliability here, not degraded it.
A mobile client that was disconnected during both broadcasts reconnects 30 seconds later with last_seen_seq=8995. It requests catch-up and receives 8996..current from the history store, which includes seq=9001. No event is lost.

Real-world context: Apache Kafka’s consumer group protocol explicitly supports this offset-commit model. Production systems tune auto.commit.interval.ms and use manual offset commits to control exactly when progress is recorded, balancing between duplicate risk and processing latency.

The net result is convergence: every client eventually sees every comment exactly once in the correct order, despite crashes, duplicates, and reconnects. The combination of at-least-once delivery, client dedupe, and sequence-based catch-up makes this possible.

With reliability covered, the final critical piece is making all of this observable. You cannot operate what you cannot measure.

Observability: metrics that prove “live” and “safe”#

In real-time broadcast, system health is measured in latency and lag, not just throughput. A system that processes 10 million events per second but delivers them 5 seconds late has failed its core promise. You need metrics that capture timeliness, completeness, and safety.

Core metrics#

Submit-to-visible latency (p50, p95, p99): The time from when a user sends a comment to when it appears on viewers’ screens. Target: p95 < 500ms.
Queue lag per stream partition: How far behind fan-out workers are in processing the durable log. Sustained lag > 1s indicates a hot stream or an under-provisioned worker pool.
Fan-out success rate: The percentage of connected clients that receive each event within the latency target. Drops below 99% signal gateway pressure.
Moderation propagation latency: The time from a moderator clicking “Delete” to the retract event arriving on clients. This is a top-tier SLO, not a secondary metric. Target: p95 < 300ms.
Reconnect rate: How often clients are dropping and reconnecting. A spike indicates network issues or gateway instability.
Drop/sampling rate per stream: When degradation is active, this metric tracks what fraction of events are being sampled or dropped. It makes degradation visible and auditable.

Per-stream breakdowns#

Global aggregates hide localized problems. A single hot stream can melt fan-out workers while thousands of other streams operate normally. Every metric above must be available as a per-stream breakdown, enabling operators to identify and mitigate problems at the stream level rather than applying global throttling.

Attention: If you cannot tell the interviewer how you will detect a hot stream before it saturates the system, you have not finished the design. Metrics are not an afterthought. They are a core component of the architecture.

Python

import json
# Build a Grafana dashboard dict programmatically; push via Grafana HTTP API
def build_dashboard() -> dict:
    dashboard = {
        "title": "Stream Processing Observability",
        "uid": "stream-obs-v1",
        "refresh": "10s",
        "time": {"from": "now-1h", "to": "now"},
        "panels": [
            _latency_timeseries_panel(panel_id=1, grid_pos={"x": 0, "y": 0, "w": 12, "h": 8}),
            _queue_lag_heatmap_panel(panel_id=2, grid_pos={"x": 12, "y": 0, "w": 12, "h": 8}),
            _fanout_failure_topn_panel(panel_id=3, grid_pos={"x": 0, "y": 8, "w": 12, "h": 8}),
        ],
        "annotations": {"list": []},
    }
    return dashboard
def _latency_timeseries_panel(panel_id: int, grid_pos: dict) -> dict:
    # p95 submit-to-visible latency; histogram_quantile over Prometheus histogram metric
    return {
        "id": panel_id,
        "type": "timeseries",
        "title": "Submit-to-Visible Latency p95",
        "gridPos": grid_pos,
        "fieldConfig": {
            "defaults": {
                "unit": "ms",
                "thresholds": {
                    "mode": "absolute",
                    "steps": [
                        {"color": "green", "value": None},
                        {"color": "yellow", "value": 500},
                        {"color": "red", "value": 1000},
                    ],
                },
            }
        },
        "targets": [
            {
                "datasource": {"type": "prometheus"},
                # 5-minute rolling p95 across all streams
                "expr": (
                    "histogram_quantile(0.95, "
                    "sum(rate(stream_submit_to_visible_duration_ms_bucket[5m])) "
                    "by (le, stream_id))"
                ),
                "legendFormat": "{{stream_id}}",
            }
        ],
    }
def _queue_lag_heatmap_panel(panel_id: int, grid_pos: dict) -> dict:
    # Heatmap bucketed by stream partition; colour intensity = lag magnitude
    return {
        "id": panel_id,
        "type": "heatmap",
        "title": "Queue Lag by Stream Partition",
        "gridPos": grid_pos,
        "options": {
            "calculate": False,  # use pre-bucketed data from Prometheus histogram
            "color": {"scheme": "Oranges", "fill": "dark-orange"},
            "yAxis": {"unit": "ms"},
        },
        "targets": [
            {
                "datasource": {"type": "prometheus"},
                # Rate of lag observations per bucket, split by partition label
                "expr": (
                    "sum(rate(stream_queue_lag_ms_bucket[1m])) "
                    "by (le, partition)"
                ),
                "legendFormat": "{{partition}}",
                "format": "heatmap",
            }
        ],
    }
def _fanout_failure_topn_panel(panel_id: int, grid_pos: dict, top_n: int = 10) -> dict:
    # Table sorted descending by fan-out failure rate; shows worst offenders first
    return {
        "id": panel_id,
        "type": "table",
        "title": f"Top-{top_n} Streams by Fan-out Failure Rate",
        "gridPos": grid_pos,
        "options": {"sortBy": [{"displayName": "Failure Rate", "desc": True}]},
        "transformations": [
            {"id": "sortBy", "options": {"fields": [{"displayName": "Failure Rate", "desc": True}]}},
            # Keep only the top-N rows after sorting
            {"id": "limit", "options": {"limitCount": top_n}},
        ],
        "targets": [
            {
                "datasource": {"type": "prometheus"},
                # Ratio of failed fan-out attempts over total attempts per stream
                "expr": (
                    "topk(%(top_n)d, "
                    "sum(rate(stream_fanout_failures_total[5m])) by (stream_id) "
                    "/ "
                    "sum(rate(stream_fanout_attempts_total[5m])) by (stream_id))"
                    % {"top_n": top_n}
                ),
                "legendFormat": "{{stream_id}}",
                "instant": True,  # single-value snapshot for table display
            }
        ],
        "fieldConfig": {"defaults": {"unit": "percentunit"}},
    }
def build_alert_rule(datasource_uid: str = "prometheus-default") -> dict:
    # Grafana unified alerting rule: fires when any stream's queue lag > 1 s for 30 s
    return {
        "title": "Stream Queue Lag Exceeds 1 Second",
        "condition": "C",  # final reduce expression label
        "data": [
            {
                # A: raw metric query
                "refId": "A",
                "datasourceUid": datasource_uid,
                "model": {
                    "expr": (
                        "max(stream_queue_lag_ms) by (stream_id)"
                    ),
                    "intervalMs": 1000,
                    "maxDataPoints": 43200,
                },
            },
            {
                # B: reduce to last value per stream
                "refId": "B",
                "datasourceUid": "-100",  # built-in expression datasource
                "model": {
                    "type": "reduce",
                    "expression": "A",
                    "reducer": "last",
                    "conditions": [],
                },
            },
            {
                # C: threshold — lag > 1000 ms triggers the alert
                "refId": "C",
                "datasourceUid": "-100",
                "model": {
                    "type": "threshold",
                    "expression": "B",
                    "conditions": [
                        {
                            "evaluator": {"type": "gt", "params": [1000]},
                            "operator": {"type": "and"},
                            "query": {"params": ["C"]},
                            "reducer": {"type": "last"},
                            "type": "query",
                        }
                    ],
                },
            },
        ],
        "for": "30s",  # must stay above threshold for 30 consecutive seconds
        "labels": {"severity": "warning", "team": "streaming"},
        "annotations": {
            "summary": "Stream {{ $labels.stream_id }} queue lag is {{ $values.B }}ms",
            "runbook_url": "https://wiki.internal/runbooks/stream-queue-lag",
        },
        "noDataState": "NoData",
        "execErrState": "Error",
    }
def export_dashboard_json(output_path: str = "stream_dashboard.json") -> None:
    payload = {
        "dashboard": build_dashboard(),
        "folderId": 0,
        "overwrite": True,
        "message": "auto-generated stream observability dashboard",
    }
    # Attach alert rule under the first panel for co-location clarity
    payload["dashboard"]["panels"][0]["alert"] = build_alert_rule()
    with open(output_path, "w") as fh:
        json.dump(payload, fh, indent=2)
    # POST payload to /api/dashboards/db to import into Grafana

With observability in place, the design is complete. Let us close with the trade-offs that separate a good answer from a great one.

Trade-offs you should say out loud#

Every architectural decision in this system involves a trade-off. A Staff-level answer acknowledges these explicitly and ties them to product and safety choices rather than pretending that perfect solutions exist.

Fidelity vs. stability. Sampling comments during hot-stream degradation reduces fidelity (not every viewer sees every comment) but preserves the feeling of liveness and prevents system-wide collapse. This is a product decision: users accept a curated subset of a 10,000-comment-per-second firehose more gracefully than a frozen feed.

Strictness vs. cost. Exactly-once delivery would require heavyweight coordination (distributed transactions or idempotency keys at every hop). At-least-once with client dedupe is dramatically cheaper and simpler, at the cost of occasional redundant processing. For live comments, this trade-off is unambiguous.

Safety vs. throughput. During overload, you may drop or sample comment events, but you never delay moderation retractions. This is one of the most important “senior” signals in an interview: prioritizing safety and correctness under stress over raw throughput.

Centralized sequencing vs. write latency. A single sequencing point per stream guarantees ordering but adds latency for geographically distant posters. The trade-off favors read-path optimization because the viewer-to-poster ratio is enormous.

Pro tip: In an interview, explicitly name two or three trade-offs and explain which side you chose and why. This signals that you have operated real systems where perfect solutions do not exist, and you are comfortable making and defending engineering judgments.

What a strong interview answer sounds like#

A complete answer fits in 5–7 minutes and follows a predictable structure: frame the broadcast nature, state guarantees, describe the pipeline, zoom into ordering, cover catch-up, explain the moderation control plane, discuss degradation, and close with metrics.

Here is a sample 60-second summary:

“Live comments are a broadcast system where one stream can have 100k+ viewers. I use WebSocket/SSE gateways for persistent connections, an ingestion service that authenticates, rate-limits, and assigns a per-stream sequence number for ordering, then persist the comment and publish it to a durable log like Kafka. Fan-out workers consume the log and broadcast to gateways. Clients render by sequence and dedupe for at-least-once delivery. Join and reconnect use cold-start history plus catch-up from last_seen_seq to fill gaps. Moderation is a separate control plane: deletes, holds, and shadow bans emit high-priority events that trigger immediate retractions. For hot streams, I detect queue lag and gateway pressure and degrade with slow mode, sampling, and reaction aggregation while always prioritizing moderation. I track p95 submit-to-visible latency, queue lag, fan-out success rate, moderation propagation latency, reconnect rate, and drop/sampling rate as my core SLOs.”

Final checklist#

Broadcast framing with explicit guarantees (at-least-once, per-stream ordering).
Server sequence numbers and why timestamps fail.
Durable log plus replay and a history store keyed by (stream_id, seq).
Cold start plus catch-up via last_seen_seq.
Moderation as a separate, high-priority control plane.
Hot-stream degradation with concrete triggers, mitigations, and metrics.

Conclusion#

The three ideas that matter most in live comments system design are the broadcast framing (one-to-many fan-out, not peer-to-peer chat), the per-stream sequence number as the single source of ordering truth, and the separation of moderation into a control plane that always overrides the data plane. Every other decision, from protocol selection to geo-distribution to degradation policy, flows from these foundational choices.

Looking ahead, the frontier for live comments is moving toward ML-powered real-time moderation that can hold or shadow-ban comments before they ever reach the fan-out layer, reducing the need for post-publication retractions. Edge computing and WebTransport (the successor to WebSockets built on QUIC) will further reduce delivery latency, and advances in conflict-free replicated data types (CRDTs) may eventually enable decentralized ordering without centralized sequencers.

Design for the broadcast. Protect the sequence. Let moderation win. The rest is engineering.

Written By:

Zarish Khalid

Free Resources

blog

How to learn SQL for beginners

blog

Why is feature scaling needed in Machine Learning

blog

Beginner-friendly roadmap to learn data engineering with Python