Live Comments System Design
Build live comments as a durable broadcast pipeline: server-assign seq per stream, persist + publish to a log, fan-out to gateways, dedupe on clients, catch up via last_seen_seq, make moderation βmust win,β and degrade hot streams with sampling/slow mode while
Live comments systems are real-time broadcast architectures designed to deliver user-generated messages to massive concurrent audiences with low latency, strict per-stream ordering, and instant moderation enforcement. The core engineering challenge is not accepting comments but safely and cheaply fanning them out to hundreds of thousands of viewers under unpredictable burst load, while guaranteeing that moderation actions always override the data plane.
Key takeaways
- Broadcast over chat: The defining constraint is one-to-many fan-out at scale, not peer-to-peer messaging, making fan-out bandwidth and gateway pressure the primary bottlenecks.
- Server-assigned sequence numbers: Timestamps fail as ordering keys due to clock skew and retries, so a per-stream monotonic sequence is the only reliable way to guarantee correct rendering and catch-up.
- Moderation as a control plane: Moderation events must propagate on a separate, higher-priority path so that retractions always win over comment delivery, even during hot-stream degradation.
- At-least-once delivery with client dedupe: Accepting duplicates and making them harmless through stable identifiers is the only scalable guarantee that avoids silent message loss during crashes and reconnects.
- Graceful degradation by design: Hot streams are inevitable, so the system must detect per-stream overload early and apply policy-driven mitigations like slow mode, comment sampling, and reaction aggregation rather than failing unpredictably.
Every time a last-second goal lights up a stadium or a celebrity goes live to millions, the comment feed beside the video becomes the collective nervous system of the audience. Building that feed is deceptively simple on a whiteboard and ruthlessly hard in production. This blog walks through a Staff-level system design for live comments, covering guarantees, ordering, fan-out, moderation, failure recovery, geo-scaling, and the metrics that keep everything honest.
Why live comments are a broadcast problem, not a chat problem#
Most candidates instinctively reach for a chat architecture when they hear “live comments.” Chat assumes a small group where every participant reads and writes roughly equally. Live comments flip that ratio on its head. A stream with 200,000 concurrent viewers might see only 500 comments per second. The read-to-write ratio can exceed 400:1.
That asymmetry means the hardest engineering is on the read path. Accepting and validating a comment is cheap. Delivering it to 200,000 open connections without melting your gateway fleet is where the design either shines or collapses. Every architectural choice, from protocol selection to database schema, must be evaluated through the lens of fan-out cost.
Real-world context: Twitch chat during large esports finals routinely exceeds 100,000 concurrent viewers per channel. YouTube Live and TikTok Live face similar ratios. The “hot stream” scenario is not an edge case. It is the defining workload.
This broadcast framing also changes how you think about ordering, moderation, and failure. In a small chat room, you can afford strong consistency. In a broadcast to hundreds of thousands of clients, you must accept weaker delivery guarantees and compensate with client-side intelligence. The next section locks down exactly which guarantees you can defend and which you intentionally relax.
Clarify requirements and state the guarantees you can defend#
A strong design begins by separating what the system promises from what it attempts on a best-effort basis. Interviewers pay close attention here because overcommitting on guarantees (like claiming exactly-once delivery) signals inexperience with real-time pipelines.
Functional scope#
The core features are straightforward:
- Post a comment: Authenticated users submit text to a specific stream.
- Receive comments in real time: All viewers on a stream see new comments with minimal delay.
- Moderation actions: Delete, hold, shadow ban, slow mode, and pin.
- Reactions: Lightweight signals (likes, emojis) at high volume.
- Catch-up and history: Viewers who join late or reconnect see recent comments without gaps.
Non-functional guarantees#
The delivery contract is the most important decision you make early. Here is how the guarantees break down:
Comparison of Message Delivery Guarantees
Guarantee Type | Delivery Assurance | Failure Behavior | Client Complexity | Suitability for Live Broadcast |
At-Most-Once | Zero or one delivery; possible message loss, no duplicates | Messages lost on failure; no redelivery attempted | Low β no duplicate handling required | Suitable only when occasional loss is acceptable (e.g., non-critical metrics) |
At-Least-Once | One or more deliveries; no loss, but duplicates possible | Redelivery attempted on failure; may cause duplicates | Medium β requires idempotent processing | Appropriate when loss is unacceptable and duplicates can be tolerated or managed |
Exactly-Once | Precisely one delivery; no loss and no duplicates | System ensures single processing even during failures | High β requires transactional coordination and state management | Ideal for strict live broadcast scenarios demanding both reliability and consistency |
The safest and most scalable contract is
Ordering is the second critical guarantee. You cannot rely on timestamps alone. Clock skew between servers, variable network delays, and replayed events will reorder comments in ways that make the feed unreadable. The stable approach is per-stream ordering using server-assigned sequence numbers. That sequence becomes the canonical key for rendering, storage, and catch-up.
Attention: Saying “exactly once” or “we’ll use timestamps for ordering” in an interview without acknowledging clock skew and retry semantics signals that you have not operated real-time pipelines at scale. Interviewers treat this as a red flag.
With guarantees locked, the next step is to lay out the end-to-end pipeline that honors them, from ingest through fan-out to the client.
High-level architecture: a durable event pipeline with a fast broadcast edge#
The system is easiest to reason about as a linear pipeline with a parallel control plane:
Ingest β Validate β Sequence β Persist β Publish β Fan-out β Client
Each stage has a single job, and the boundaries between stages are where you insert durability, ordering, and priority.
The following diagram shows how these stages connect and where the ordering point, durable log, and moderation control plane sit relative to the data path.
At the edge, you use persistent connections to push events to clients. The two practical choices are
Behind the gateways, an ingestion service authenticates users, enforces per-user and per-stream rate limits, runs lightweight safety checks, and forwards accepted comments into a durable log like Apache Kafka. Fan-out workers then consume the log and broadcast new events to the correct stream’s connected gateways.
The “single source of truth” is the durable log plus the database that stores comment history. The log supports replay and recovery. The database supports cold-start history loads and post-stream playback. Caches are used for stream membership and connection mapping, not as truth.
Here is how the protocol options compare for the broadcast edge:
Comparison of WebSockets, SSE, and Long Polling
Dimension | WebSockets | Server-Sent Events (SSE) | Long Polling |
Directionality | Full-duplex (bidirectional) | Unidirectional (server β client) | Simulated bidirectional via repeated requests |
Connection Overhead | Low β single persistent connection after handshake | Low β persistent HTTP connection for streaming | High β repeated HTTP requests per interaction |
Browser Support | All modern browsers | Most modern browsers (no IE support) | All browsers (standard HTTP) |
Scalability for Broadcast | Resource-intensive with many persistent connections | Limited by one TCP connection per client | High server resource consumption at scale |
Reconnect Complexity | Manual β requires custom reconnection logic | Automatic β built-in reconnection support | Inherent per request, but frequent polling adds complexity |
Pro tip: In an interview, explicitly state where the “ordering point” lives. It is your system’s most important invariant. If you cannot point to a single place where sequence numbers are assigned, your design has an ordering gap.
With the pipeline structure in place, the next section dives into why the ordering point matters so much and how to implement it without expensive global coordination.
Ordering and sequencing: why timestamps fail and how sequence numbers work#
Ordering is not a nice-to-have in live comments. It is what makes the feed readable. If two viewers see comments in different orders, the conversation becomes incoherent. In distributed systems, timestamps are unreliable ordering keys for three reasons:
- Clock skew: Even with NTP, servers can differ by tens of milliseconds, and clients can differ by seconds.
- Network jitter: A comment sent first can arrive at the server second.
- Retries and replays: A retransmitted comment carries its original timestamp but arrives much later.
The practical solution is to assign a monotonically increasing sequence number at the moment the comment is accepted into the pipeline. That sequence becomes the canonical ordering key used everywhere: clients render in sequence order, the database clusters comments by sequence, and catch-up queries are “give me everything after seq N.”
Generating sequence numbers at scale#
There are two defensible patterns for generating per-stream sequences:
- Dedicated sequencer partitioned by stream_id: A lightweight service that maintains an in-memory counter per stream, incremented atomically on each accepted comment. Streams are assigned to sequencer instances via consistent hashing. This gives you explicit control but introduces a single point of coordination per stream.
- Log-offset-as-sequence: If all events for a stream are routed to the same Kafka partition, the partition offset is a natural monotonic sequence. This piggybacks on Kafka’s ordering guarantee and avoids a separate service, but ties your sequence numbering to your log infrastructure.
Both approaches require stable stream-to-partition (or stream-to-sequencer) mapping. If a stream bounces between partitions during rebalancing, you can create ordering gaps or duplicates. Sticky partitioning and graceful rebalance protocols mitigate this.
Historical note: Early real-time systems like IRC relied on server timestamps for ordering, which led to notorious “message reordering” bugs during netsplits. The shift to explicit sequence numbers in modern systems like Kafka and Slack’s message infrastructure reflects decades of hard-won operational lessons.
Once you have a reliable sequence, you need a data model that makes sequential reads fast and moderation reversible. That is where the storage layer comes in.
Data model: store once, read sequentially, make moderation reversible#
The dominant access pattern for live comments is sequential reads. When a user joins, you load the most recent window. When catching up, you fetch everything after a sequence number. That access pattern maps perfectly to a storage model where stream_id is the partition key and seq is the clustering key.
A wide-column store like Apache Cassandra or a sorted key-value store fits naturally. Each row represents a single comment, and reads for a stream are sequential scans within a partition. This avoids the random I/O patterns that would make a traditional relational database struggle at broadcast scale.
The comment record should include:
stream_id,seq(partition and clustering keys)comment_id(globally unique, used for deduplication and moderation references)user_id,content,created_atstate(VISIBLE, HELD, DELETED, SHADOW_BANNED)moderation_metadata(who acted, when, reason, policy reference)
Deletions must be
Attention: Hard-deleting comments is a common design pitfall. It creates confusing sequence gaps and destroys evidence needed for safety reviews. Always soft-delete and filter at query time.
For reactions at massive scale, writing one record per individual reaction is prohibitively expensive. A better approach is to log reaction events to a lightweight stream and maintain aggregated counters per comment or per stream with eventual consistency. Clients display “12K π₯” rather than rendering 12,000 individual reaction records.
Back-of-envelope storage estimation#
For a platform handling 10,000 concurrent streams with an average of 100 comments per minute per stream:
- Comments per day: $10{,}000 \\times 100 \\times 60 \\times 24 = 1.44 \\times 10^9$
- Average comment size (with metadata): ~500 bytes
- Daily raw storage: $1.44 \\times 10^9 \\times 500 = 720\\text{ GB/day}$
- Annual storage (before compression): ~263 TB
With compression and a 90-day hot retention window (archiving older data to cold storage), the hot storage footprint stays manageable at roughly 65 TB. These numbers justify a distributed wide-column store but also highlight the need for a retention and archival policy.
With the data model defined, the next question is: what happens to a comment between the moment a user hits “send” and the moment it appears on 100,000 screens? The answer is a state machine.
Comment life cycle state machine#
A broadcast system becomes interview-grade when you can describe the comment life cycle as a state machine with explicit transitions. Comments are not just “sent.” They pass through validation, sequencing, persistence, publication, visibility, and potential moderation override. Making these states explicit helps you reason about retries, duplicates, and the “moderation must win” invariant.
The key design rule is: persist before publish. A comment must be durable in storage before its event is emitted to the fan-out log. This guarantees that any event a client receives corresponds to a comment that exists in the history store, making catch-up and replay consistent.
Visibility on clients is derived from two inputs: receiving the comment event and the current moderation state. If a comment is later deleted, clients retract it by applying a delete event that references the original comment_id and seq. The state machine makes this retraction a standard transition rather than an ad-hoc patch.
Pro tip: In an interview, drawing this state machine on the whiteboard immediately elevates your answer. It proves you have thought about retries, replays, and moderation retractions as part of the design rather than as afterthoughts.
Let us now walk through the happy path end to end, from a user typing a comment to every viewer seeing it.
Walk-through 1: posting a comment and delivering it to all viewers#
A user types a comment and hits send. Here is the exact sequence of events through the pipeline:
Step 1: Ingestion. The client sends the comment payload to the ingestion service (routed through a gateway). Ingestion authenticates the user, validates the payload (length, encoding, attachment policies), and applies rate limits. Per-user limits prevent spam. Per-stream limits provide a coarse form of overload protection.
Step 2: Lightweight safety check. Before sequencing, the ingestion service runs a fast toxicity or blocklist check. Comments flagged by this check can be routed to a HELD state for human review rather than entering the visible pipeline. This pre-sequence check adds minimal latency (target: <50ms) but catches the most obvious violations.
Step 3: Sequencing and persistence. The ingestion service requests the next sequence number for this stream and writes the comment to the history store with state=VISIBLE. The comment is now durable.
Step 4: Event publication. A “comment created” event containing (stream_id, seq, comment_id, content, user_id) is published to the durable log partition for this stream.
Step 5: Fan-out. Fan-out workers consume the event from the log and look up which gateway instances hold connections for this stream. They forward the event to those gateways.
Step 6: Client delivery. Gateways push the event down WebSocket or SSE connections to all subscribed viewers. Each client deduplicates by (stream_id, seq), inserts the comment into the UI in sequence order, and updates its last_seen_seq.
If a viewer’s network jitters and comments arrive out of order, the sequence number lets the client buffer briefly and re-sort. If a comment is missing entirely, the client’s catch-up mechanism (covered next) fills the gap from the history store.
Real-world context: YouTube’s live chat infrastructure uses a similar pipeline with an explicit sequencing step before fan-out. The sequence number is what allows millions of clients to see a consistent comment order despite being served by different gateway instances across different data centers.
But what happens when a viewer joins a stream that is already in progress, or reconnects after a network drop? That is the cold-start and catch-up problem.
Cold start and catch-up: where correctness is proven#
Cold start and catch-up are where real-time broadcast systems prove they actually work. Relying solely on “whatever arrives over the WebSocket” guarantees gaps, especially on mobile networks with flaky connectivity.
Join flow#
When a viewer opens a stream for the first time:
- The client calls a history API:
GET /streams/{stream_id}/comments?last=200 - The server returns the most recent 200 comments ordered by
seq, along withlatest_seq(the highest sequence the server has seen). - The client renders these comments, sets
last_seen_seq = latest_seq, and opens a persistent connection for live events. - New events arriving via the live connection are appended after
last_seen_seq.
Reconnect and gap detection#
After a disconnect (which is routine on mobile), the client reconnects and provides last_seen_seq to the server. Two things can happen:
- Small gap: The server streams the missing range
(last_seen_seq, current_seq]inline during the reconnect handshake. This is fast and avoids a separate API call. - Large gap: If the client has been disconnected for a long time, the server returns a redirect to the history API with a sequence range. The client fetches the gap, merges it with the live stream, and resumes.
This design means the persistent push connection is a latency optimization, but the catch-up path is the correctness mechanism. Even if the real-time channel drops every other event, the system eventually converges because the client can always recover via sequence-based history queries.
Pro tip: In your interview answer, explicitly say: “Push is an optimization. Catch-up from last_seen_seq is how I guarantee correctness.” This one sentence signals that you understand the fundamental reliability model of real-time systems.With the data path solid, it is time to address the most critical non-functional concern: moderation. In a live broadcast, a harmful comment that stays visible for even 10 seconds can cause real damage.
Moderation as a control plane#
Moderation is not “just another event” in the comment stream. It is a
Moderation event types#
A robust moderation system supports more than just “delete”:
- Delete: Removes a specific comment from all clients immediately.
- Hold: Intercepts a comment before it becomes visible, pending human review.
- Shadow ban: A user’s comments appear to themselves but are hidden from all other viewers. This avoids alerting bad actors that they have been flagged.
- User mute: Temporarily or permanently prevents a user from posting on a specific stream or globally.
- Slow mode: Limits the posting rate for all users on a stream (e.g., one comment per 30 seconds).
- Pin: Elevates a specific comment to a fixed position at the top of the feed.
Each moderation action is recorded as an auditable event with the actor (human moderator, automated system, or policy engine), the target (comment_id, user_id, or stream_id), the reason, and a timestamp. This audit trail is essential for appeals, safety reviews, and training future moderation models.
Delivery priority#
Moderation events travel on a separate channel or are tagged with strict priority in the fan-out layer. During periods of high fan-out pressure, the system may sample or batch comment events, but moderation events are never sampled, never batched, and never deprioritized. This is the “moderation must win” invariant.
Attention: A common pitfall is only deleting comments in the database without emitting a real-time retract event. In live systems, viewers need immediate visual retraction. Database-only deletion is eventual consistency at best and harmful content exposure at worst.
Let us trace the moderation path end to end in a concrete walk-through.
Walk-through 2: moderator deletes a comment and all clients retract it#
A comment with seq=4072 is visible to 150,000 viewers. A moderator reviews it and clicks “Delete” in the moderation dashboard.
Step 1: The moderation service receives the delete action referencing comment_id and stream_id.
Step 2: It writes an audit record (who, what, when, why) and updates the comment’s state to DELETED in the history store.
Step 3: It publishes a high-priority moderation event: {type: DELETE, stream_id, comment_id, seq: 4072} to the moderation event channel.
Step 4: Fan-out workers prioritize moderation events. They immediately forward the delete event to all gateways holding connections for this stream.
Step 5: Gateways push the retract event to all connected clients. Clients remove seq=4072 from the UI.
Step 6: For viewers who join later and load history, the server either excludes deleted comments from the response or includes a tombstone record that instructs the client not to render it.
The critical property is that even if a client is behind on comment events and has not yet processed seq=4072, the moderation event applies immediately. If the client later receives the original comment event for seq=4072, it checks against its moderation state and discards it. Moderation always wins.
Real-world context: Twitch’s AutoMod combines automated detection with human review queues. Comments flagged by the ML model are held before publication, and moderator actions propagate to all connected clients in near real time. This two-tier approach (pre-publication hold + post-publication retraction) covers both automated and human moderation workflows.
With the data plane and control plane defined, the next challenge is what happens when a single stream becomes so popular that it saturates the fan-out layer.
Fan-out strategies and the hot-stream problem#
Fan-out is the most expensive operation in a live comments system. Every comment must be delivered to every connected viewer on that stream. There are two fundamental strategies, and the right choice depends on the workload:
- Fan-out-on-write: When a comment is accepted, immediately push it to every subscriber. This is the default for live comments because latency matters and the write rate is low relative to the read rate. The cost scales with
comments_per_second Γ viewers_per_stream. - Fan-out-on-read: Subscribers poll or pull from a shared log at their own pace. This is cheaper for the server but adds latency and shifts complexity to the client. It works better for low-urgency feeds (like social media timelines) but is generally too slow for live broadcast.
For live comments, fan-out-on-write is the standard choice, but it creates a scaling cliff for hot streams. If a stream has 500,000 viewers and 200 comments per second, the fan-out layer must deliver $200 \\times 500{,}000 = 10^8$ events per second for that single stream. That number can overwhelm gateway fleets and network bandwidth.
Detecting hot streams#
Detection must be proactive, not reactive. You instrument per-stream metrics:
- Publish rate: Comments per second per stream.
- Queue lag: How far behind fan-out workers are on the stream’s log partition.
- Gateway outbound pressure: Bytes queued for push per stream.
- Client drop rate: Percentage of events that gateways fail to deliver.
- Fan-out success rate: Percentage of connected clients that receive each event within the latency target.
Threshold-based alerts (e.g., “if queue lag > 500ms or publish rate > 300/s, flag as hot”) trigger automatic mitigation.
Degradation tactics#
Degradation is not failure. It is a designed mode. When a stream is flagged as hot, the system applies stream-scoped mitigations:
- Slow mode: Limit posting frequency to reduce comment volume at the source.
- Comment sampling: Show only a random subset (e.g., 1 in 5) of comments to viewers. The feed still feels lively, but fan-out cost drops by 80%.
- Reaction aggregation: Instead of delivering individual reaction events, batch them into periodic aggregate counters (e.g., every 2 seconds: “1,200 new π₯ reactions”).
- Priority tiering: Drop low-value events (typing indicators, minor UI signals) while preserving comment and moderation delivery.
- Moderation always flows: Moderation events are never sampled or dropped, even during maximum degradation.
Degradation Tactics Overview
Tactic | Trigger Condition | User Impact | Fan-Out Cost Reduction | Moderation Affected |
Uncertainty Spike | Confidence (p95) drops below threshold for 10 min | Switches to baseline model; reduced response quality | Not specified | Not specified |
Feature Freshness Failure | Feature age exceeds SLA | Uses cached data with 24-hour staleness cap; outdated info | Not specified | Not specified |
Vendor Outage | Upstream 5xx errors exceed 10% over 5-minute window | Circuit-breaks vendor; throttled prompts and context | Not specified | Not specified |
Cost Cap Breach | Cost per decision (p95) exceeds budget for 15 min | Reduced context, less sampling, tail routed to baseline | Not specified | Not specified |
Pro tip: In your interview, explicitly map triggers to mitigations. Saying “if queue lag exceeds 500ms, we enable slow mode and comment sampling for that stream” is much stronger than vaguely saying “we scale up.”
The hot-stream problem also motivates thinking about geographic distribution. If viewers are spread across continents, fan-out from a single region adds latency. Let us look at how geo-scaling addresses this.
Geo-redundancy and multi-region deployment#
For a global audience, a single-region deployment means that viewers far from the data center experience high latency. Comments posted by a user in Tokyo and fanned out from a US-East cluster arrive 150β200ms later for viewers in Asia. In live broadcast, that delay is noticeable and compounds with processing time.
A multi-region architecture deploys gateway fleets in each major region (US-East, EU-West, AP-Southeast, etc.) so that viewers connect to the nearest edge. The ingestion and sequencing layer can remain centralized (or use a primary region per stream) to preserve per-stream ordering without cross-region coordination overhead.
The flow becomes:
- A user in SΓ£o Paulo posts a comment to the nearest ingestion endpoint (US-East, for a stream homed there).
- The comment is sequenced and persisted in the primary region.
- The event is published to the durable log.
- Regional fan-out workers in each region consume the log (via cross-region replication or a multi-region Kafka setup like Confluent’s multi-region clusters) and deliver to local gateways.
- Viewers in SΓ£o Paulo, London, and Tokyo all receive the event from their local gateway, minimizing last-mile latency.
The trade-off is clear: centralized sequencing preserves strict ordering but adds one cross-region round trip for the posting user. For live comments, this is acceptable because most users are viewers (read path), and only a small fraction are posters (write path). Optimizing the read path by placing gateways close to viewers is the higher-leverage decision.
Historical note: YouTube’s live infrastructure uses a similar model where comment sequencing happens in a primary region, but delivery is distributed to edge points of presence worldwide. This design keeps ordering simple while minimizing viewer-perceived latency.
Even with geo-distribution, components crash. The next section examines how the system heals after failures without losing or corrupting the comment stream.
Reliability, duplicates, and replay: how the system heals after crashes#
Distributed broadcast systems crash. Gateways restart, fan-out workers roll during deployments, Kafka partitions rebalance. If your design requires perfect delivery over a single persistent connection, it will lose events. The architecture must assume failure and build recovery into every layer.
Why duplicates happen#
Duplicates arise from a fundamental tension: a worker sends an event to a gateway and then crashes before committing its
Making duplicates harmless#
Deduplication on the client is straightforward because every event carries a stable identifier. For comments, the pair (stream_id, seq) is sufficient. For moderation events, a moderation_id or moderation_seq serves the same purpose. Clients maintain a small sliding window of recently seen identifiers (the last 500β1000 events) and silently discard repeats.
Replay from the durable log#
If a fan-out worker crashes, it resumes consuming from the last committed offset. Events between the last commit and the crash point are reprocessed and redelivered. This is the “spine” of reliability. The durable log guarantees that no accepted comment is permanently lost, even if fan-out infrastructure is temporarily unavailable.
If gateways lose their stream subscription mappings temporarily (e.g., during a rolling restart), clients reconnect and catch up using last_seen_seq. The history store provides the “truth” for any missed ranges that the log may have already expired.
This brings us to a concrete crash scenario to illustrate the healing process.
Walk-through 3: crash, duplicate delivery, and recovery#
A fan-out worker consumes comment seq=9001 from the log partition for stream S and broadcasts it to gateways G1, G2, and G3. Clients connected to G1 and G2 receive seq=9001. Before the worker commits its offset, it crashes.
What happens next:
The worker restarts and resumes from the last committed offset (which is before seq=9001). It reprocesses seq=9001 and broadcasts it again to G1, G2, and G3.
- Clients on G1 and G2 already have
seq=9001in their seen-set. They discard the duplicate. No visual glitch occurs. - Clients on G3 missed the first broadcast (perhaps G3 was briefly unreachable). They receive
seq=9001for the first time. At-least-once delivery has improved reliability here, not degraded it. - A mobile client that was disconnected during both broadcasts reconnects 30 seconds later with
last_seen_seq=8995. It requests catch-up and receives8996..currentfrom the history store, which includesseq=9001. No event is lost.
Real-world context: Apache Kafka’s consumer group protocol explicitly supports this offset-commit model. Production systems tune auto.commit.interval.ms and use manual offset commits to control exactly when progress is recorded, balancing between duplicate risk and processing latency.The net result is convergence: every client eventually sees every comment exactly once in the correct order, despite crashes, duplicates, and reconnects. The combination of at-least-once delivery, client dedupe, and sequence-based catch-up makes this possible.
With reliability covered, the final critical piece is making all of this observable. You cannot operate what you cannot measure.
Observability: metrics that prove “live” and “safe”#
In real-time broadcast, system health is measured in latency and lag, not just throughput. A system that processes 10 million events per second but delivers them 5 seconds late has failed its core promise. You need metrics that capture timeliness, completeness, and safety.
Core metrics#
- Submit-to-visible latency (p50, p95, p99): The time from when a user sends a comment to when it appears on viewers’ screens. Target: p95 < 500ms.
- Queue lag per stream partition: How far behind fan-out workers are in processing the durable log. Sustained lag > 1s indicates a hot stream or an under-provisioned worker pool.
- Fan-out success rate: The percentage of connected clients that receive each event within the latency target. Drops below 99% signal gateway pressure.
- Moderation propagation latency: The time from a moderator clicking “Delete” to the retract event arriving on clients. This is a top-tier SLO, not a secondary metric. Target: p95 < 300ms.
- Reconnect rate: How often clients are dropping and reconnecting. A spike indicates network issues or gateway instability.
- Drop/sampling rate per stream: When degradation is active, this metric tracks what fraction of events are being sampled or dropped. It makes degradation visible and auditable.
Per-stream breakdowns#
Global aggregates hide localized problems. A single hot stream can melt fan-out workers while thousands of other streams operate normally. Every metric above must be available as a per-stream breakdown, enabling operators to identify and mitigate problems at the stream level rather than applying global throttling.
Attention: If you cannot tell the interviewer how you will detect a hot stream before it saturates the system, you have not finished the design. Metrics are not an afterthought. They are a core component of the architecture.
With observability in place, the design is complete. Let us close with the trade-offs that separate a good answer from a great one.
Trade-offs you should say out loud#
Every architectural decision in this system involves a trade-off. A Staff-level answer acknowledges these explicitly and ties them to product and safety choices rather than pretending that perfect solutions exist.
Fidelity vs. stability. Sampling comments during hot-stream degradation reduces fidelity (not every viewer sees every comment) but preserves the feeling of liveness and prevents system-wide collapse. This is a product decision: users accept a curated subset of a 10,000-comment-per-second firehose more gracefully than a frozen feed.
Strictness vs. cost. Exactly-once delivery would require heavyweight coordination (distributed transactions or idempotency keys at every hop). At-least-once with client dedupe is dramatically cheaper and simpler, at the cost of occasional redundant processing. For live comments, this trade-off is unambiguous.
Safety vs. throughput. During overload, you may drop or sample comment events, but you never delay moderation retractions. This is one of the most important “senior” signals in an interview: prioritizing safety and correctness under stress over raw throughput.
Centralized sequencing vs. write latency. A single sequencing point per stream guarantees ordering but adds latency for geographically distant posters. The trade-off favors read-path optimization because the viewer-to-poster ratio is enormous.
Pro tip: In an interview, explicitly name two or three trade-offs and explain which side you chose and why. This signals that you have operated real systems where perfect solutions do not exist, and you are comfortable making and defending engineering judgments.
What a strong interview answer sounds like#
A complete answer fits in 5β7 minutes and follows a predictable structure: frame the broadcast nature, state guarantees, describe the pipeline, zoom into ordering, cover catch-up, explain the moderation control plane, discuss degradation, and close with metrics.
Here is a sample 60-second summary:
“Live comments are a broadcast system where one stream can have 100k+ viewers. I use WebSocket/SSE gateways for persistent connections, an ingestion service that authenticates, rate-limits, and assigns a per-stream sequence number for ordering, then persist the comment and publish it to a durable log like Kafka. Fan-out workers consume the log and broadcast to gateways. Clients render by sequence and dedupe for at-least-once delivery. Join and reconnect use cold-start history plus catch-up from last_seen_seq to fill gaps. Moderation is a separate control plane: deletes, holds, and shadow bans emit high-priority events that trigger immediate retractions. For hot streams, I detect queue lag and gateway pressure and degrade with slow mode, sampling, and reaction aggregation while always prioritizing moderation. I track p95 submit-to-visible latency, queue lag, fan-out success rate, moderation propagation latency, reconnect rate, and drop/sampling rate as my core SLOs.”Final checklist#
- Broadcast framing with explicit guarantees (at-least-once, per-stream ordering).
- Server sequence numbers and why timestamps fail.
- Durable log plus replay and a history store keyed by
(stream_id, seq). - Cold start plus catch-up via
last_seen_seq. - Moderation as a separate, high-priority control plane.
- Hot-stream degradation with concrete triggers, mitigations, and metrics.
Conclusion#
The three ideas that matter most in live comments system design are the broadcast framing (one-to-many fan-out, not peer-to-peer chat), the per-stream sequence number as the single source of ordering truth, and the separation of moderation into a control plane that always overrides the data plane. Every other decision, from protocol selection to geo-distribution to degradation policy, flows from these foundational choices.
Looking ahead, the frontier for live comments is moving toward ML-powered real-time moderation that can hold or shadow-ban comments before they ever reach the fan-out layer, reducing the need for post-publication retractions. Edge computing and WebTransport (the successor to WebSockets built on QUIC) will further reduce delivery latency, and advances in conflict-free replicated data types (CRDTs) may eventually enable decentralized ordering without centralized sequencers.
Design for the broadcast. Protect the sequence. Let moderation win. The rest is engineering.