Chat System Design
Build chat as store + sync: persist ordered messages, deliver at-least-once with dedupe, reconcile devices with per-device cursors, pick the right group fan-out strategy, treat push as a wake-up hint, and measure latency, lag, and sync convergence.
A chat system design is one of the most concentrated system design interview problems because it forces you to solve persistent connections, low-latency fan-out, ordered and durable delivery, multi-device synchronization, and soft real-time features like presence and typing all in a single coherent architecture. The core challenge is building a delivery and sync system, not merely sending messages over a WebSocket, which means your design must reconcile state after disconnects, handle duplicates gracefully, and prove reliability through measurable guarantees.
Key takeaways
- At-least-once delivery with dedup: Promising exactly-once is impractical over unreliable networks, so you design for at-least-once semantics and layer in stable message IDs for idempotent deduplication on both server and client.
- Per-conversation ordering via sequence numbers: A server-side sequencer partitioned by conversation_id assigns monotonic sequence numbers that define the authoritative message order.
- Cursor-based multi-device sync: Each device tracks a per-conversation delivered cursor so that reconnect sync pulls only missing messages, making offline delivery reliable without depending on push notifications.
- Group fan-out strategy depends on group size: Fan-out-on-write works for small groups, but hot groups with tens of thousands of members require fan-out-on-read or hybrid approaches to avoid catastrophic write amplification.
- Observability proves correctness: Metrics like p95 send-to-delivered latency, offline sync lag, and queue backlog are what separate a paper design from one you could actually operate in production.
Most engineers open a chat app dozens of times a day and never think about the machinery underneath. Messages appear instantly, read receipts tick into place, and typing bubbles animate on cue. But behind that effortless surface sits one of the hardest real-time distributed systems you can design. If your architecture cannot survive a gateway crash, reconcile state across four devices, and deliver to a 50,000-member group without melting your database, it is not production-grade. This guide walks through a Staff-level, interview-ready chat system design that treats delivery as a durable pipeline, sync as the correctness mechanism, and every other feature as a carefully scoped layer on top.
Clarify requirements and set the right guarantees#
The fastest way to derail a chat design interview is to start listing features without defining what is durable, what is ordered, and what is best-effort. Interviewers notice immediately when a candidate skips the guarantee conversation, because the rest of the design will lack a foundation. Your opening should frame the system around messaging reliability and explicitly draw lines between hard guarantees and soft-state features.
For functional scope, assume direct messages and group chats (up to tens of thousands of members), message history with paginated reads, delivery and read receipts, presence indicators, typing indicators, and basic attachment metadata. Keep the payload as “text + metadata” for the first pass, with media offloaded to an object store like Amazon S3 or a similar blob service.
The guarantees that win interviews are concise and defensible:
- Delivery: at-least-once with deduplication
- Ordering: per-conversation, assigned server-side
- Presence and typing: eventually consistent
These commitments are realistic at global scale and set up clean trade-off discussions later. Promising exactly-once delivery sounds appealing but is impractical over unreliable networks and forces complexity that rarely survives production. Promising global total ordering across conversations is unnecessary and expensive.
Attention: Promising “real-time” without defining a sync mechanism is a common pitfall. WebSockets improve latency, but cursor-based sync defines correctness. If your system cannot recover after a disconnect, it is not reliable.
Treating reconnect sync as part of delivery rather than an optional add-on is what separates a robust design from a fragile one. With guarantees locked in, the next step is laying out the high-level architecture that enforces them.
High-level architecture and separation of concerns#
A scalable chat architecture separates the connection-heavy edge from the durable messaging core. This separation lets you optimize the gateway tier for millions of long-lived connections while keeping your messaging logic stateless and horizontally scalable. Conflating these two concerns is a recipe for systems that cannot scale one dimension without breaking the other.
The following diagram illustrates how the major components connect.
The
The durable queue (or append-only log, such as Apache Kafka) between “message accepted” and “message delivered to devices” is the linchpin of reliability. It lets you survive gateway crashes, worker restarts, and recipient offline periods without losing messages. Think of it as the internal delivery pipeline you can replay.
Presence and typing live in a separate soft-state path backed by ephemeral storage with TTLs. Notification services interface with platform push systems (APNS, FCM) but are never the source of truth for delivery. The source of truth is always message history plus per-device cursors.
Pro tip: Name your “source of truth” explicitly in interviews. Gateways are not your source of truth. Message history and per-device cursors define correctness.
With the architecture laid out, the next question is how to model the data so that the two dominant access patterns, appending messages and reading conversation history, stay fast at scale.
Core data model for conversation history and delivery tracking#
Chat workloads are dominated by two access patterns: “append a message to a conversation” and “read messages in a conversation by order.” That pushes the schema toward a partition key of conversation_id and a sort key that preserves order. This is why distributed NoSQL stores like Apache Cassandra, Amazon DynamoDB, or HBase are common choices: they excel at append-heavy, partitioned reads with predictable latency.
A minimal message row looks like this:
Beyond message storage, you need to model membership and per-user/per-device state. Delivery and read receipts at scale are deceptively expensive. Storing per-message, per-recipient status works for 1:1 chats, but in a 10,000-member group, every message would require 10,000 status writes. The scalable approach is a hybrid:
- Small conversations (1:1 and small groups): Optionally store per-message delivery state for richer UI.
- Large groups: Track per-user
last_read_seqpointers so the UI derives read states relative to that pointer.
You also need a consistent way to message_id generated client-side (or assigned by the server) stays stable across retries. A dedup lookup keyed by (conversation_id, message_id) or (sender_id, client_msg_id) prevents the same message from being stored twice.
Attention: Modeling read receipts as “status per message per user” for all group sizes will explode in storage and write load. Use cursor-based receipts as the scalable baseline.
The data model is only useful if the delivery flow that writes to it is designed for durability and ordering. Let’s trace through that flow next.
Message delivery flow with ordering and durability#
A strong chat answer describes the send path as a series of durable steps, not a single WebSocket write. Each step has a clear purpose and a clear persistence boundary. The flow begins on the client and ends when every recipient device has acknowledged receipt.
The send path step by step#
- The client sends a message containing a stable, client-generated
message_id(or ) along with the conversation_id and payload.idempotency key A unique identifier attached to a request that allows the server to recognize retries and return the same result without re-executing the operation, ensuring safe retries under network failures. - The gateway forwards the request to the Messaging Service.
- The Messaging Service authenticates the sender, verifies conversation membership, and deduplicates the message by
message_id. - A
assigns the nextper-conversation sequencer A component (or storage operation) partitioned by conversation_id that assigns monotonically increasing sequence numbers to messages, defining the authoritative order within that conversation. seqfor the conversation. This is the ordering point. - The message is persisted to the message store.
- Delivery tasks are published to the durable queue for each recipient’s devices.
Ordering deserves extra attention. In most designs, ordering is assigned server-side at acceptance time. You can implement sequencing as a per-conversation counter in a strongly consistent store, a lightweight sequencer service partitioned by conversationid, or optimistic assignment with conflict resolution (though this is harder to get right). In interviews, saying “a sequencer partitioned by conversationid assigns seq” is usually sufficient.
Real-world context: Systems like Slack and Discord use server-assigned ordering rather than client timestamps because client clocks are unreliable and can drift by seconds or more. Server-side sequencing is the industry standard for chat ordering.
At-least-once delivery means devices may see duplicates. You prevent user-visible duplicates by having devices deduplicate by message_id or seq and by having the server deduplicate repeated sends. The delivery pipeline itself is retryable: if pushing to a gateway fails, a worker retries later. The message is never lost because it was persisted before fan-out began.
The delivery flow creates implicit state transitions for every message. Making those transitions explicit through a state machine is what elevates a design from “it probably works” to “I can reason about correctness.”
Message delivery state machine#
A message system becomes interview-grade when you can describe its state transitions. Chat delivery is not one state. It is a life cycle with persisted evidence at each step so the system can recover after crashes.
The following diagram shows the state machine from the system’s perspective.
The key insight is separating message state (durable in the message store) from per-recipient/per-device progress (durable as cursors). The message is stored exactly once. Delivery receipts become cursor updates, not per-message status writes. For small 1:1 chats you may also store per-message status for richer UI, but the cursor model is the scalable baseline.
Persisted evidence at each step means:
- Accepted → Persisted: Message row written to store.
- Persisted → Enqueued: Delivery task written to durable queue.
- Enqueued → Delivered: Per-device
last_delivered_seqcursor advanced. - Delivered → Read: Per-user
last_read_seqcursor advanced.
Pro tip: Receipts are state reconciliation. If you cannot explain how cursors converge after disconnects, your “read/delivered” indicators will be wrong under real network conditions. Practice walking through the crash scenarios.
With the state machine defined, let’s see it in action with a concrete walk-through of the simplest case: both sender and recipient are online.
Walk-through 1: online-to-online message with receipts#
Consider a 1:1 chat where both users are online, each on two devices. The sender’s client submits message_id and payload to the gateway, which forwards it to the Messaging Service. The service deduplicates, assigns the next sequence number (say seq=1042), persists the message, and emits delivery tasks for each of the recipient’s connected devices.
Delivery workers push the message over the recipient devices’ gateway connections. Each device deduplicates by message_id or seq, renders the message, and sends an acknowledgment back. The acknowledgment carries the highest sequence number delivered for that conversation. The server advances each device’s last_delivered_seq cursor to 1042.
When the recipient opens the conversation view and scrolls past the message, the client sends a read receipt as last_read_seq=1042. The server updates the membership state. The sender’s UI derives receipt states by comparing:
- Delivered: recipient’s
last_delivered_seq≥ messageseq - Read: recipient’s
last_read_seq≥ messageseq
This approach avoids per-message status writes and scales cleanly across conversation sizes.
Attention: Read receipts are best-effort and may arrive out of order. Using monotonic cursors makes reconciliation simpler because a cursor that says “I’ve read up to 1042” implicitly covers all messages with seq ≤ 1042.
The online-to-online case is the happy path. The design gets interesting when one side is offline, which is the most common scenario for mobile users.
Walk-through 2: online-to-offline message with notification and reconnect sync#
Now consider the recipient has no active gateway connection. The send path through persistence is identical: the message is deduped, sequenced at seq=1042, and stored. The difference is delivery. The server detects that the recipient’s devices have no live connections, so it cannot push over WebSocket.
The server records that the recipient’s devices have not advanced their delivered cursors and triggers a push notification request. The notification payload is minimal: conversation_id, sender display name, and a hint like “new message.” Stuffing the full message body into the push is risky because of platform payload size limits (4 KB for APNS), because it undermines end-to-end encryption designs, and because push delivery itself is unreliable.
When the recipient comes online (by tapping the notification, by app background refresh, or simply by opening the app later), the device establishes a WebSocket connection and sends its last_delivered_seq per conversation. The server streams missing messages in order. Only after the device acknowledges receipt does the server advance the delivered cursor and propagate delivery receipts back to the sender.
Real-world context: Apple’s APNS and Google’s FCM can delay, collapse, or silently drop notifications based on battery optimization, Doze mode, or user settings. WhatsApp and Signal treat push as a wake-up signal and rely entirely on cursor-based sync for actual message delivery.
This offline flow is also why multi-device sync needs its own deep treatment, because the real complexity is in tracking state per device across arbitrary connection patterns.
Multi-device sync and state tracking#
Multi-device sync is where chat designs either become robust or collapse into hand-wavy promises. Users log in from a phone, a tablet, and a laptop. They disconnect frequently. They expect history, receipts, and unread counts to converge correctly on every device. The mechanism that makes this work is
Cursor-based pull with server push as optimization#
The sync model is straightforward. When a device connects or reconnects, it sends its last_delivered_seq per conversation. The server responds with all messages having seq > last_delivered_seq in order. The device processes them, updates its local cursor, and acknowledges. This ensures offline delivery even if push notifications fail, gateways crash, or connections flap.
Server push (delivering messages in real time over the WebSocket) is an optimization for latency, not the correctness mechanism. If the push fails or the device was offline, the sync path catches it.
Reconciling delivered vs. read across devices#
A user-level last_read_seq is typically shared across devices. Reading a message on the phone marks it as read on the desktop too. But device-level last_delivered_seq cursors remain per device, because the laptop might not have synced yet. This separation lets your UI represent “delivered to at least one of the user’s devices” vs. “delivered to all devices,” depending on product requirements.
The following table summarizes the cursor semantics.
Cursor Types in Chat System Design
Cursor Type | Scope | Purpose |
last_delivered_seq | Per device | Tracks the highest sequence number each device has acknowledged receiving |
last_read_seq | Per user | Tracks the highest sequence number the user has actively viewed, shared across all devices |
Unread count derivation | Per user per conversation | Computed as `conversation_max_seq` minus the user's `last_read_seq` |
Pro tip: Interviewers want to hear three things in your sync answer: per-device cursors, a reconnect sync flow that does not rely on notifications, and a clear reconciliation strategy for delivered vs. read across devices.
With individual delivery and sync handled, the next challenge is group chat, where a single message can fan out to thousands of recipients and naive approaches quickly become catastrophically expensive.
Group chat fan-out strategies#
Group chat changes the economics of delivery. In 1:1 conversations, you deliver to a handful of devices. In a 50,000-member group, a single message requires tens of thousands of deliveries. Naive per-recipient writes will crush your database, your queue, or both. This is where interviewers want you to talk about
The three primary strategies are fan-out-on-write, fan-out-on-read, and hybrid.
Message Delivery Strategy Comparison
Strategy | How It Works | Pros | Cons |
Fan-out-on-write | Expands the message into per-recipient inbox entries at send time | Fast reads; simple unread counts | Massive write amplification for large groups; storage cost scales with members × messages |
Fan-out-on-read | Stores the message once in a central log; users fetch and compile their feed on demand | Cheap writes; storage-efficient | Complex unread count derivation; read path requires joining membership with message log |
Hybrid | Stores once, pushes to online users, maintains per-user cursors, and selectively precomputes inboxes for small groups | Balances write and read costs; adapts to group size | More complex implementation; requires group-size-based routing logic |
There is no single correct choice. A strong answer ties the decision to product realities: group size distributions, online/offline ratios, unread count UX requirements, and infrastructure cost constraints.
Walk-through: a “hot group” scenario and mitigations#
Imagine a group with 50,000 members and rapid message volume. With fan-out-on-write, every message becomes 50,000 inbox writes plus delivery tasks. Even if workers can push to online users quickly, the durable per-recipient writes dominate cost and latency. A burst of activity in this group creates a
With fan-out-on-read or hybrid, you store each message once in the conversation log and update a compact set of state: per-user last_read_seq and per-device delivered cursors. Online members receive pushed messages without permanently writing per-recipient inbox rows. Offline members rely on sync when they reconnect. Additional mitigations include:
- Splitting delivery from inbox computation so they can scale independently.
- Rate limiting message sends for extreme groups to prevent queue overload.
- Using tiered infrastructure for large channels (e.g., topic-based partitioning within a group).
- Caching recent log segments so reads avoid hitting cold storage.
Real-world context: Discord handles servers with hundreds of thousands of members using a channel-based fan-out-on-read model. Messages are stored once per channel, and clients pull history on demand. This is why Discord’s storage costs scale with messages, not with members × messages.
Fan-out complexity is closely tied to how notifications behave in group settings, which brings us to the critical distinction between push notifications and actual delivery.
Push notifications are a hint, not delivery#
Push notifications through APNS and FCM are important for user experience, but they are not a delivery mechanism you control. They can be delayed by minutes, collapsed into a single badge update, throttled by the OS, or disabled entirely by user settings. In interviews, treating push as “delivery” is a correctness flaw.
The correct model is: durable storage plus cursor-based sync is delivery. Push is a wake-up hint that reduces perceived latency.
You should mention platform constraints explicitly. APNS payloads are capped at 4 KB. FCM has similar limits and its own throttling behavior. You often want to use
This framing also aligns with security. If you later discuss end-to-end encryption, you do not want servers placing plaintext message bodies in push payloads routed through third-party infrastructure. Even without E2EE, minimal pushes reduce sensitive data exposure and keep the authoritative data path within your own infrastructure.
Attention: If your offline delivery story stops working when notifications are delayed, dropped, or disabled, you have built a notification system, not a chat system. Always verify your design works with push completely absent.
With delivery mechanics fully covered, it is time to address the transient features that make chat feel alive: presence and typing indicators.
Presence and typing indicators as eventually consistent soft state#
Presence (“Alice is online”) and typing indicators (“Bob is typing…”) feel real-time, but they are fundamentally soft state. Networks flap, mobile devices enter sleep mode, and reconnects happen constantly. Building these features with strong consistency would require coordination that adds latency and cost far out of proportion to the informational value of the feature.
Presence systems typically rely on ephemeral storage like Redis with TTLs. Gateways update presence on connect and disconnect events and with periodic heartbeats (e.g., every 30 seconds). If a heartbeat is missed, the server marks the user “offline” after a timeout (e.g., 60 seconds). This means presence can be stale by up to a minute, which is acceptable for an informational signal.
Typing indicators are even more ephemeral. They are sent over the same real-time channel but are never persisted. A typing event has a short validity window (typically 3–5 seconds), and if a follow-up event does not arrive, the indicator disappears. Dropping a typing event harms nothing.
The interview point is to articulate what you do when presence is wrong: you favor availability and low latency, and you accept brief inaccuracies because the feature is informational, not transactional.
Historical note: Early instant messaging systems like AIM and ICQ attempted near-real-time presence with centralized servers. As user bases grew to hundreds of millions, the industry shifted to eventually consistent presence with heartbeat-based TTLs, a pattern that persists across WhatsApp, Telegram, and Slack today.
Presence and typing round out the feature set, but a production system must also handle the inevitable: things break. The next section addresses failure modes and why designing for duplicates is not a compromise but a requirement.
Failure modes and crash safety#
A Staff-level chat design is honest about ambiguity. If a gateway pushes a message to a device and crashes before recording the acknowledgment, your system cannot know whether the device received it. The only safe action under at-least-once semantics is to retry. Retries create duplicates. Duplicates are not a bug. They are a consequence of building reliable systems over unreliable networks.
The solution is idempotency and deduplication at multiple layers:
- Server-side dedup: The messaging service rejects repeated sends from the client using the stable
message_id. - Device-side dedup: Devices ignore messages whose
message_idorseqalready exists in local storage. - Monotonic cursors:
last_delivered_seqandlast_read_seqonly move forward. Retrying a cursor update with the same or lower value is a no-op, making retries inherently safe.
This is also where
Walk-through 3: crash after delivery causes duplicate, solved by dedup#
A recipient device is online. A delivery worker pushes message (conversation_id=C, seq=1042, message_id=X) through the gateway. The device renders it. Before the device’s ack reaches the server, the gateway process crashes and the connection drops. The delivery pipeline times out waiting for the ack and schedules a retry.
The recipient reconnects and runs sync with last_delivered_seq=1041 (the ack for 1042 never made it). The server streams message 1042 again. The device sees message_id=X already in local storage, ignores the duplicate, updates its cursor to 1042, and sends the ack. The system converges: the server marks the device as delivered, and the sender sees “delivered” correctly.
Pro tip: In interviews, say the quiet part out loud: “I assume duplicates happen. I use stable IDs, deduplicate on both server and client, and I make all state updates monotonic so retries are always safe.” This signals production experience.
Crash safety ensures correctness, but you also need to prove the system is working in production. That requires observability and well-chosen SLOs.
Observability and SLOs#
Great designs are measurable. In interviews, naming specific metrics signals operational maturity and shows you think beyond the happy path. You want latency metrics for the send path, backlog metrics for delivery pipelines, and correctness indicators for sync and presence.
The most important user-perceived metric is p95 send-to-delivered latency for online recipients. This measures the time from when the sender taps “send” to when the recipient’s device acknowledges delivery. A typical SLO target might be p95 < 300ms for same-region delivery. A separate SLO for offline delivery, measured as reconnect sync lag (time from device reconnection to full catch-up), captures the other critical path.
Key metrics to instrument, broken down by region, gateway cluster, and conversation type (1:1 vs. group):
- Send-to-persisted latency (p50, p95, p99): How fast the server accepts and stores messages.
- Persisted-to-delivered latency (p50, p95): How fast the delivery pipeline reaches online devices.
- Delivery queue depth and consumer lag: Early warning for backpressure or stuck consumers.
- Offline sync lag: Time and message count between reconnection and full cursor convergence.
- Reconnect rate and connection churn: Indicates network health and gateway stability.
- Presence staleness percentage: Fraction of presence records that exceed the expected TTL, measured via sampling.
[image][A monitoring dashboard mockup with four panels: (1) a latency histogram showing p50/p95/p99 for send-to-delivered, (2) a time-series graph of delivery queue depth with a threshold alert line, (3) a gauge showing offline sync lag in seconds, and (4) a heatmap of reconnect rates by region. Each panel has a label indicating the metric name and SLO threshold.][Observability dashboard for a chat system showing delivery latency, queue health, sync lag, and reconnect metrics][/image]Attention: If you only measure “messages sent per second,” you will not catch the outages users actually feel. Throughput is a capacity metric. Latency, lag, and sync convergence are the reliability signals that matter.
Metrics tell you the system is healthy, but security ensures it is trustworthy. The final design layer covers authentication, authorization, and encryption posture.
Security and privacy#
Security in chat starts at the connection. Every gateway connection must be authenticated (typically via JWT tokens validated at connection establishment). Every message send must be authorized against conversation membership. Rate limiting on sends protects both infrastructure from abuse and users from spam. For group chats, membership changes must be enforced consistently so that removed users cannot send or receive messages after removal.
End-to-end encryption (E2EE) is frequently raised in interviews. You do not need to implement the Signal Protocol in your answer, but you should place E2EE correctly in the architecture:
- The server stores opaque ciphertext. It cannot read message bodies.
- Metadata remains visible to the server:
conversation_id,sender_id, timestamps, message sizes. - Push notification payloads stay minimal (no plaintext bodies).
- Key management and device enrollment shift complexity to clients.
E2EE does not fundamentally change the delivery pipeline design. Messages still flow through the same sequencer, the same durable queue, and the same sync path. What changes is what the server can inspect, which affects content moderation, search, and abuse detection.
Historical note: WhatsApp rolled out E2EE to over a billion users in 2016 using the Signal Protocol. The delivery infrastructure (store-and-forward with cursor-based sync) remained largely unchanged. The encryption layer was added as a transformation step on the client side, validating the principle that delivery architecture and encryption posture are separable concerns.
TLS in Transit Only vs. End-to-End Encryption (E2EE)
Criteria | TLS in Transit Only | End-to-End Encryption (E2EE) |
Who Can Read Message Content | Server can decrypt and read content | Only sender and recipient hold decryption keys |
Server-Side Search & Moderation | Possible; server accesses decrypted data | Not possible without client cooperation |
Key Management Complexity | Minimal; automatically handled by TLS protocol | Significant; requires per-device key exchange and rotation |
Push Notification Content | Can include message previews or snippets | Limited to metadata only (e.g., sender name, generic alert) |
Impact on Delivery Pipeline | None; server manages encryption/decryption | None; encryption is a client-side transform |
With security addressed, you have a complete design. The final step is knowing how to present it concisely and confidently.
What a strong interview answer sounds like#
A strong answer is structured and decisive. You start with guarantees, present the architecture, and then zoom into the hard parts: ordering, durability, multi-device sync, group fan-out, and failure recovery. You keep presence and typing in their lane as eventually consistent. You finish with metrics and trade-offs.
Here is a 60-second outline that interviewers recognize as coming from someone who has built messaging systems:
“I’ll build chat around durable message history plus cursor-based sync. Clients connect via WebSockets to stateless gateways, which route to a messaging service that authenticates, assigns per-conversation sequence numbers for ordering, persists messages, and publishes delivery tasks. Delivery is at-least-once, so I include stable message IDs for server and client dedup. Online devices get pushed immediately. Offline devices get a minimal push notification as a wake-up hint, then catch up via sync from per-device cursors. Read receipts are cursor updates on lastreadseq. Presence and typing are eventually consistent via TTL caches. For groups, I pick fan-out strategy based on size: hybrid or fan-out-on-read for hot groups to avoid write amplification. I’ll instrument p95 send-to-delivered, queue lag, offline sync lag, reconnect rates, and presence staleness to prove reliability.”
The checklist to hit every major point:
- Define guarantees: at-least-once delivery, per-conversation ordering, eventual consistency for presence
- Persist before fan-out: delivery is a retryable pipeline, not a single socket write
- Stable IDs for dedup: monotonic cursors for convergence
- Multi-device sync: per-device cursors with reconnect flow independent of push
- Group fan-out: compare strategies, call out hot group mitigations
- Metrics: p95 latency, queue lag, sync convergence, reconnect rates
Back-of-envelope estimation#
Interviewers sometimes ask for scale numbers to verify you understand the infrastructure implications. Here is a quick estimation for a large-scale chat system with 500 million DAUs.
Assume each DAU sends an average of 40 messages per day. That gives:
$$\\text{Messages per second} = \\frac{500 \\times 10^6 \\times 40}{86400} \\approx 231{,}000 \\text{ msg/s}$$
If each message averages 200 bytes of body plus 300 bytes of metadata (500 bytes total), daily storage is:
$$\\text{Daily storage} = 500 \\times 10^6 \\times 40 \\times 500 \\text{ bytes} = 10 \\text{ TB/day}$$
With a 5-year retention policy, you are looking at roughly 18 PB of message data, which is well within the range where distributed stores like Cassandra or DynamoDB with tiered storage are necessary. These numbers also justify why fan-out-on-write for large groups is dangerous: a single message in a 50,000-member group at 500 bytes per inbox entry costs 25 MB of writes, multiplied by message volume.
Real-world context: WhatsApp reportedly handles over 100 billion messages per day across 2+ billion users. At that scale, every byte of per-message overhead and every unnecessary per-recipient write translates directly into infrastructure cost measured in millions of dollars annually.
These estimates ground your architectural decisions in real numbers rather than abstract arguments.
Conclusion#
Chat system design is fundamentally a delivery and sync problem wrapped in a real-time interface. The three pillars that hold the design together are durable, ordered message storage as the single source of truth, cursor-based multi-device sync that works independently of push notifications or gateway availability, and a clear fan-out strategy that adapts to group size rather than assuming one model fits all. Every other feature, from presence to typing indicators to read receipts, layers on top of these foundations with intentionally weaker consistency guarantees.
The future of chat architecture is evolving in several directions. Edge computing is pushing gateway logic closer to users for sub-100ms delivery. AI-powered moderation and summarization require new metadata pipelines alongside E2EE constraints. Interoperability mandates like the EU’s Digital Markets Act are forcing large platforms to define standardized message exchange protocols, which will reshape how federation and cross-platform delivery work.
Build your chat system like something you could operate at 3 AM during an incident: clear guarantees, explicit state, failure recovery by design, and dashboards that tell you exactly where the problem is.