Webhooks are one of the core mechanisms behind the modern, event-driven internet. They let one service (the producer) notify another service (the subscriber) when something changes, without the subscriber continuously asking, “Has anything changed?” A payment processor notifying your app that an invoice is paid, or a Git provider notifying CI that a new commit landed—these are classic webhook stories.
In interviews, webhooks are deceptive: the HTTP POST is the easy part. The hard part is designing a system that pushes events to thousands (or hundreds of thousands) of endpoints you don’t control, where slow responses, intermittent failures, and ambiguous delivery outcomes are normal. If you can reason clearly about at-least-once delivery, retries, idempotency, ordering, and multi-tenant fairness, you’ll signal strong distributed systems instincts.
This blog is structured like a high-quality System Design interview answer: we’ll clarify requirements, propose an architecture, then go deep on correctness, delivery semantics, retry taxonomies, state machines, bottlenecks, and operational realities. Along the way, you’ll see walkthroughs that sound like how strong candidates narrate systems under pressure.
System Design Interviews decide your level and compensation at top tech companies. To succeed, you must design scalable systems, justify trade-offs, and explain decisions under time pressure. Most candidates struggle because they lack a repeatable method. Built by FAANG engineers, this is the definitive System Design Interview course. You will master distributed systems building blocks: databases, caches, load balancers, messaging, microservices, sharding, replication, and consistency, and learn the patterns behind web-scale architectures. Using the RESHADED framework, you will translate open-ended system design problems into precise requirements, explicit constraints, and success metrics, then design modular, reliable solutions. Full Mock Interview practice builds fluency and timing. By the end, you will discuss architectures with Staff-level clarity, tackle unseen questions with confidence, and stand out in System Design Interviews at leading companies.
Interviewer tip: A webhook system is a reliability product. Treat “delivery” as a lifecycle with persisted state, not a single network call.
Clarify requirements the interviewer actually cares about#
A webhook design conversation starts with requirements, but interviews reward the “why” behind them. You want to show you understand which decisions are forced by the environment: untrusted networks, untrusted endpoints, bursty fan-out, and the impossibility of exactly-once delivery without extreme complexity.
Functionally, you need a way to register subscriptions, publish events, and deliver payloads. But that’s not the differentiator. The differentiator is what you promise: at-least-once delivery, retry windows, ordering scope, and operational visibility for subscribers. You also need to define what “success” means: does a 2xx response mean “received” or “processed”? Most webhook platforms treat it as “accepted by the subscriber endpoint,” which is the only thing you can measure.
Non-functionally, webhooks are dominated by correctness (don’t lose events), scalability (fan-out), and isolation (one noisy subscriber must not harm others). You should also bring security into the requirements early, because it changes payload format, headers, and the subscriber contract.
Requirement dimension | What to clarify | Typical, interview-friendly choice |
Delivery semantics | At-least-once vs exactly-once | At-least-once with subscriber idempotency |
Success signal | What counts as “delivered” | 2xx from subscriber within timeout |
Retry window | How long you retry | 24–72 hours with exponential backoff + jitter |
Ordering | Global? per subscriber? per entity? | Per subscriber + per key/partition where needed |
Payload security | Integrity + authenticity | HMAC signature + timestamp + replay window |
Observability | What users can see | Delivery logs, attempts, DLQ, replay tooling |
Common pitfall: Saying “exactly once” without describing the coordination cost. In webhooks, the realistic move is at-least-once plus idempotency keys.
Summary (after the explanation):
Prefer at-least-once delivery and design the subscriber contract around idempotency.
Define delivery success as a 2xx response within a strict timeout.
Be explicit about ordering scope; global ordering is a trap.
Treat observability and security as first-class requirements, not add-ons.
High-level architecture: build a delivery pipeline, not a request handler#
The most interview-robust webhook architecture is a decoupled pipeline: producers emit events, a dispatcher fans them out into delivery tasks, a durable queue buffers those tasks, and workers perform HTTP delivery with retries and persistence. The queue is not “just a queue”; it’s the shock absorber that turns unpredictable subscriber behavior into a system you can scale and reason about.
A clean way to present this is to split responsibilities into services with single, defensible jobs. The dispatcher is responsible for turning “an event happened” into “these subscribers should receive it,” and for creating delivery tasks that can be retried safely. Workers are responsible for execution, timeouts, classification of responses (2xx vs 4xx vs 5xx vs 429), and transitioning persisted delivery state. Subscription management is its own surface area, because it involves validation, secrets, and per-tenant policies.
If you’re interviewing at Staff level, add the “control plane vs data plane” framing. Subscription creation, secret rotation, quota configuration, and replay tooling live in the control plane. High-throughput delivery lives in the data plane. This lets you scale and secure them differently.
Producer integration | Emit event + metadata reliably | Dual-write correctness, backpressure to producer |
Dispatcher | Lookup subscriptions and generate delivery tasks | Fan-out throughput, caching, partitioning |
Durable queue/stream | Buffer tasks, support retries and delayed delivery | Partition hot spots, storage cost, retention |
Delivery workers | Execute HTTP POST with timeout, classify, persist results | Concurrency control, jitter, idempotency |
Subscription store | Subscriber endpoints, filters, secrets, policies | Sharding, caching, consistency model |
Delivery store | Delivery records + attempts + state machine | Write amplification, indexes for lookup/replay |
Interviewer tip: Say out loud what is persisted and why. “We persist events and delivery tasks so we can survive crashes, retries, and ambiguous outcomes.”
Summary (after the explanation):
Use a queue/stream as the backbone that decouples producers from delivery.
Separate dispatcher fan-out from worker execution.
Treat subscription management as a control-plane surface.
Persist delivery state and attempt history to make retries and replay safe.
Transactional outbox and event publishing correctness#
Correct event publishing is where many webhook designs quietly fail. The common mistake is dual-writing: a producer service updates its database and separately publishes an event to the webhook system (or directly to a queue). If the service crashes after committing the DB write but before publishing the event, you lose the webhook forever. If it publishes the event but the DB transaction later rolls back, subscribers receive a webhook for something that never happened. In interviews, this is a high-signal moment: recognizing dual-write danger is one of the cleanest correctness markers.
The fix is to make event creation part of the same transactional boundary as the business state change. The most common approach is the transactional outbox pattern: write the domain change and an “outbox row” in the same database transaction, then have a separate publisher read outbox rows and publish them to your queue/stream. This gives you atomicity at the database boundary and pushes the “publish” action into a retryable, idempotent process.
A close cousin is CDC (Change Data Capture): instead of an outbox table, you stream DB changes from the transaction log into a messaging system. CDC can reduce application complexity, but it adds operational complexity and can be harder to evolve or filter. In interviews, you don’t need to pick one “forever”; you need to show you understand the trade-offs and failure modes.
Direct publish (dual-write) | App writes DB, then publishes event | Simple to build | Loses events or emits phantom events under crashes | Prototypes only |
Transactional outbox | App writes DB + outbox in one txn; publisher drains outbox | Strong correctness; clear audit trail | Extra table + publisher; can add write load | Most webhook systems |
CDC | DB log → stream → consumers | Low app coupling; high throughput | Operational complexity; schema evolution nuances | Mature infra teams |
What interviewers listen for on correctness: “I don’t trust dual-write. I need the event emission to be derived from the committed state change, via an outbox or CDC, and publishing must be idempotent.”
Summary (after the explanation):
Dual-writing is dangerous because crashes create lost or phantom events.
Transactional outbox makes event intent part of the same commit as state change.
CDC is powerful but operationally heavier.
Publishing must be idempotent and replayable regardless of approach.
Data model and APIs: define the subscriber contract#
A webhook platform is partially an API product: subscribers need a stable contract for how deliveries behave, how they authenticate messages, and how they can debug failures. In interviews, showing a crisp subscriber contract is more valuable than listing endpoints. Your contract should cover: subscription registration, event schemas/versioning, signature verification, idempotency keys, and what response codes mean.
On the data model side, you typically have four core records: subscriptions, events (or event envelopes), delivery tasks, and delivery attempts. The delivery task is the unit of work that gets retried. Attempts are the history. Keeping attempts separate from tasks prevents you from overwriting what happened during retries and gives you a clean audit trail.
A strong design also includes versioned event types and stable payload envelopes. The envelope should carry metadata like event_id, event_type, created_at, tenant_id, and a delivery_id or idempotency_key. That idempotency key is what lets subscribers safely handle duplicates—which will happen under at-least-once semantics.
Subscription | subscription_id, tenant_id, url, event_types, secret_id, status, policy limits | Control plane source of truth |
Event envelope | event_id, event_type, tenant_id, payload, created_at, schema_version | Stable metadata + payload |
Delivery task | delivery_id, event_id, subscription_id, state, next_attempt_at, attempt_count | Retryable unit of work |
Delivery attempt | attempt_id, delivery_id, started_at, latency_ms, status_code, error, response_hash | Audit + debugging + analytics |
Common pitfall: Forgetting the subscriber experience. If the user can’t see “why delivery failed” and “how to replay,” your system will generate support tickets forever.
Summary (after the explanation):
Define a payload envelope with stable metadata and versioning.
Model deliveries as tasks with attempt history, not a single “send.”
Include an idempotency key in headers or payload for duplicate handling.
Build replay and debugging into the contract, not as a later feature.
Delivery semantics: at-least-once, idempotency, and ordering boundaries#
The core interview move is to accept reality: you cannot rely on the network for exactly-once delivery. Even if your worker makes a successful POST, it can crash before persisting the success. From your system’s perspective, the delivery outcome is ambiguous, so you must retry. That is at-least-once delivery, and it is the correct baseline for webhooks.
Once you say “at-least-once,” you must immediately say “idempotency.” The system should provide a stable idempotency key per (event, subscription). The subscriber stores processed keys (or derives idempotency from event_id + subscription_id) and rejects duplicates safely. In a Staff-level answer, you also mention that subscriber-side dedupe storage needs a retention policy: keep keys for at least the maximum retry window plus clock skew.
Ordering is the next boundary to get right. Global ordering is not realistic, and most subscribers don’t need it. What you can offer is ordering per key: route events for the same entity (order_id, user_id) to the same partition and process with a per-key concurrency of one. This is a design choice with throughput consequences, so frame it as “only enforce ordering where needed,” often as an optional subscription setting.
Delivery | At-least-once | Persist tasks; retry on ambiguous outcomes |
Deduplication | Subscriber responsibility (with help) | Provide idempotency key in headers/payload |
Ordering | Per key / per subscription lane | Partition by (subscription_id, ordering_key) |
Exactly-once | Not offered | Too costly; requires heavy coordination |
What great answers sound like: “I will guarantee at-least-once delivery and make duplicates safe through an idempotency key. If ordering is required, I’ll scope it to an ordering key and serialize deliveries per key.”
Summary (after the explanation):
At-least-once is the practical standard; ambiguous outcomes force retries.
Provide a stable idempotency key per event per subscriber.
Offer ordering only within a defined scope (key/partition), not globally.
Make ordering a deliberate, opt-in trade-off when possible.
Delivery state machine and delivery receipts#
If you want your webhook system to be debuggable and correct under retries, you need an explicit delivery state machine. “We retry on failure” is not enough, because you need to define what’s persisted at each step, what transitions are allowed, and how you avoid double-sending under concurrency. A state machine is also how you talk clearly about edge cases like worker crashes, long subscriber outages, and manual replay.
A helpful mental model: each delivery is a persisted object with a next_attempt_at and attempt_count. Workers claim a delivery for an attempt, transition it into an in-flight state, send the POST, then transition to DELIVERED or schedule a retry. If your system supports “delivery receipts,” you can also record a subscriber-provided receipt token or a processing confirmation, but most webhook systems treat 2xx as the only observable receipt.
The key is to persist state transitions atomically with attempt records. If you persist “ATTEMPTING” and then crash, you need a reaper to move stuck attempts back to RETRY after a lease timeout. This is the difference between “works in happy path” and “survives production.”
PENDING | delivery_id, next_attempt_at=now, attempt_count=0 | → ATTEMPTING |
ATTEMPTING | attempt_id, worker_id, lease_expires_at | → DELIVERED, RETRY, DLQ |
RETRY | next_attempt_at (backoff+jitter), attempt_count++ | → ATTEMPTING |
DELIVERED | delivered_at, last_status_code, response metadata | terminal (or → REPLAYING) |
DLQ | failure_reason, last_error, exhausted_at | → REPLAYING (manual/automated) |
REPLAYING | replay_source, operator_id/system_id | → ATTEMPTING |
Interviewer tip: The state machine is not paperwork. It’s how you prevent duplicate work under concurrency, and how you explain crash recovery and replay.
Walkthrough: subscriber times out → retries with jitter → eventual success#
A payment event is created and a delivery task starts in PENDING with next_attempt_at=now. A worker claims it, transitions to ATTEMPTING, and starts an HTTP POST with a strict timeout (say five seconds). The subscriber endpoint doesn’t respond in time, so the worker classifies it as a transient failure (timeout is usually treated like 5xx).
The worker writes an attempt record with the timeout error and moves the delivery to RETRY, computing next_attempt_at using exponential backoff plus jitter. After several attempts, the subscriber recovers and responds with a 204 within the timeout. The worker persists DELIVERED with delivered_at and the final status code. Observability shows the full attempt chain, which is exactly what subscriber support needs.
Common pitfall: Retrying immediately after timeouts. Without backoff and jitter, you can amplify an outage and create a thundering herd against already-degraded subscriber infrastructure.
Summary (after the explanation):
Persist explicit states and attempt history for correctness and debuggability.
Use leases for in-flight attempts and a reaper for stuck ATTEMPTING tasks.
Treat 2xx as delivered; schedule retries for transient failures.
DLQ is a state, not a trash can—design for replay.
Retry taxonomy: 4xx vs 5xx vs 429, and what you do about each#
Retries are not “on/off.” A strong interview answer classifies failures by what they mean, and maps them to policy. Timeouts and 5xx imply transient server or network issues, so you retry with exponential backoff and jitter. 429 implies rate limiting, so you retry but respect Retry-After if present, and you should consider per-subscriber token-bucket limits to avoid repeated throttling.
4xx is nuanced. Some 4xx are permanent (404, 410, many 401/403 depending on your auth model), suggesting a disabled subscription or a terminal failure. Others can be transient because of subscriber deployments (e.g., a temporary 404 during a rollout), but that’s a product decision: most systems treat 4xx as non-retryable except 408/409/425/429, or allow subscriber-configurable policies.
The interview-level insight: retry policies are part of multi-tenant safety. If you keep hammering a broken endpoint forever, you waste capacity and harm everyone. If you give up too early, you break reliability expectations. A good design has a bounded retry window, a DLQ, and clear subscriber-facing visibility.
Timeout / connection error | Transient | Retry with exponential backoff + jitter |
5xx | Transient | Retry; consider circuit breaker after repeated failures |
429 | Backpressure from subscriber | Retry; respect Retry-After; reduce per-subscriber concurrency |
401/403 | Often permanent | Move to DLQ or disable subscription; alert subscriber |
404/410 | Likely permanent | Disable subscription or DLQ with reason |
2xx | Success | Mark DELIVERED |
What interviewers listen for: “I treat retries as a policy problem. I distinguish transient failures from permanent ones, respect subscriber backpressure, and I bound retry cost with a DLQ and a retry window.”
Summary (after the explanation):
Use exponential backoff + jitter for transient failures.
Respect 429 and Retry-After; reduce concurrency to match subscriber capacity.
Treat most 4xx as terminal unless you have a clear reason not to.
Bound retries by time and count, then move to DLQ with replay support.
Multi-tenant fairness and backpressure per subscriber#
In production, the hardest part of webhooks isn’t throughput—it’s fairness. A single subscriber can be slow, flaky, or misconfigured, and your system must prevent that subscriber from consuming all workers, filling partitions with retries, or causing dispatcher backlog. This is a classic noisy-neighbor problem, and interviews love it because it’s where “scales” becomes “scales safely.”
A robust strategy is to isolate work by subscriber and enforce per-subscriber limits. That can include per-subscriber concurrency caps (e.g., at most N in-flight deliveries), quotas (deliveries per minute), and distinct isolation lanes (separate queues or partitions per tenant tier). When a subscriber returns 429 or times out, you should reduce their concurrency dynamically (adaptive backoff) rather than just retrying more.
Where you enforce limits matters. You can enforce at the API layer for subscription creation and configuration. You can enforce in the dispatcher by controlling how many tasks you enqueue per subscriber per time window. You can enforce in the queue topology via per-subscriber partitions or per-tenant queues. And you can enforce in the worker by using per-subscriber semaphores and token buckets.
Concurrency cap | Configure defaults + tiered caps | Optional | Indirect (via partitioning) | Primary enforcement (semaphores) |
Rate quota | Configure | Can throttle enqueue | Can shape via scheduled messages | Token bucket per subscriber |
Isolation lanes | Configure tenant tier | Route to lane | Separate queues/partitions | Separate worker pools |
Backpressure reaction | Expose policy knobs | Slow fan-out | Protect partitions | Adaptive concurrency + circuit breaker |
Quote box on noisy-neighbor handling: “I isolate by subscriber, cap concurrency, and apply backpressure where it’s cheapest. If one subscriber is failing, their retries should not steal capacity from healthy subscribers.”
Summary (after the explanation):
Enforce per-subscriber concurrency caps and rate quotas.
Use isolation lanes (tiers) to protect premium/critical traffic.
Apply backpressure early (dispatcher) and precisely (worker semaphores).
Treat 429 and timeouts as signals to reduce pressure, not increase it.
Walkthrough: worker crash after sending → duplicate delivery → idempotency handling#
This scenario is where at-least-once becomes real. A worker claims a delivery, transitions it to ATTEMPTING, sends the POST, and the subscriber processes it successfully. But right after the network call returns, the worker crashes before persisting DELIVERED. When the lease expires, another worker reclaims the same delivery and sends it again.
From the subscriber’s perspective, the same event arrived twice. This is not a bug; it’s the expected outcome of an at-least-once system under crashes. The way you make this safe is by sending a stable idempotency key that the subscriber can dedupe on. The subscriber stores the key and returns 2xx for duplicates without reapplying side effects.
Your system should also behave well here. When the second worker gets a 2xx, it marks DELIVERED. Attempt history now shows two attempts, both with 2xx. That’s fine. In fact, this trace is useful: it proves your system is crash-tolerant.
Lease-based claiming | Prevents two workers from sending concurrently (most of the time) |
Persist attempt before/after send | Lets you diagnose ambiguous outcomes |
Stable idempotency key | Makes duplicates safe for subscribers |
Subscriber dedupe retention | Must exceed retry window to be effective |
What great answers sound like: “Duplicates are inevitable under crashes. I embrace at-least-once and make duplicates safe with an idempotency key and a clear subscriber dedupe contract.”
Summary (after the explanation):
A crash between “send” and “persist success” causes duplicate delivery.
Lease-based claiming reduces concurrency issues but doesn’t remove ambiguity.
Idempotency keys and subscriber dedupe are the correct solution.
Attempt history should record duplicates, not hide them.
Scaling fan-out: 100k subscribers and the dispatcher bottleneck#
Fan-out is the place where webhook systems can melt. If an event type has 100k subscribers and you naively “lookup then enqueue 100k tasks” synchronously, your dispatcher becomes your bottleneck, and your queue can become a write bottleneck. Strong candidates describe how they decouple fan-out, batch work, and prevent hot partitions.
One approach is to treat fan-out itself as a distributed job. The dispatcher writes a single “fanout job” record for an event and then a fleet of fanout workers expands it into delivery tasks in batches. You can store subscriber lists in a way that supports efficient iteration (e.g., by event_type + tenant_id), and you can shard the expansion work by consistent hashing. This prevents one dispatcher instance from needing to do all expansion work for massive events.
Queue design matters too. If you partition by subscriber_id, you avoid one subscriber blocking others, but you can still create hot partitions during massive blasts. If you partition by (subscriber_id mod P), the distribution is good, but you may lose strict per-subscriber ordering unless you add worker-side serialization. A common compromise is: partition by subscriber_id for ordering and isolation, and ensure P is large enough; then add batching and compression to reduce enqueue overhead.
Subscription lookup | DB hot reads | Cache subscriptions; shard by tenant; precompute lists per event type |
Task creation | Dispatcher CPU/memory | Batch expansion; fanout workers; streaming iteration |
Queue writes | Throughput limits | Batch produce; compress payload; store payload once and reference it |
Worker capacity | Too many outbound requests | Auto-scale workers; per-subscriber caps; tiered lanes |
Interviewer tip: When you describe 100k fan-out, quantify costs. “100k tasks per event” turns into “millions per minute” quickly, and your design should show where the system absorbs bursts.
Walkthrough: 100k subscribers fan-out event → dispatcher and queue bottlenecks + mitigation#
An event arrives for a popular event type. Instead of the dispatcher immediately enqueuing 100k delivery tasks, it persists the event envelope and creates a fanout job that points to the subscription segment(s) to expand. Fanout workers iterate subscribers in chunks—say 1,000 at a time—creating delivery tasks with references to the stored payload rather than embedding the entire payload in each queue message.
As tasks are enqueued, per-subscriber limits shape execution: subscribers with low caps don’t dominate outbound concurrency. Queue partitions distribute tasks broadly. Observability shows fanout lag separately from delivery lag, which is crucial: it tells you whether you’re falling behind in expansion or in outbound HTTP.
Common pitfall: Embedding the full payload into every task for massive fan-out. You multiply storage and network cost by subscriber count. Store once, reference many.
Summary (after the explanation):
Treat fan-out as a batchable, scalable job.
Store payload once and reference it from delivery tasks.
Use caching and sharding to protect the subscription store.
Separate fanout lag from delivery lag in monitoring.
Security: HMAC signatures, secret rotation, and replay prevention#
Webhook security is not optional because webhooks often trigger real side effects: granting access, provisioning resources, updating payment state. The baseline is authenticity and integrity: the subscriber must be able to verify the webhook came from you and wasn’t modified. HMAC signatures are the standard approach: compute an HMAC over a canonical representation of the payload plus metadata (often a timestamp), and send the signature in a header.
Replay prevention is the next piece. Without it, an attacker who captures a valid webhook can resend it later. The typical defense is to include a timestamp and require the subscriber to reject messages outside a narrow time window (for example, five minutes). For higher assurance, include a nonce or event_id in a dedupe store on the subscriber side. Secret rotation rounds out the story: subscribers should be able to rotate shared secrets, and your system should support multiple active secrets during a transition window.
Finally, enforce HTTPS endpoints and validate URLs at subscription time. In a mature platform, you also consider SSRF protections (don’t allow internal IP ranges), but at minimum you should mention endpoint validation and allowlisting patterns.
HMAC signature | Header includes HMAC(payload + timestamp) | Authenticity + integrity |
Timestamp window | Reject if timestamp outside allowed skew | Prevent replay |
Secret rotation | Support active+next secret IDs | Safe rotation without downtime |
HTTPS enforcement | Require https:// URLs | Encrypt in transit |
Endpoint validation | Block private ranges, validate DNS | Reduce SSRF risk |
What interviewers listen for: “I sign payloads with HMAC, include a timestamp to prevent replay, rotate secrets safely, and I validate endpoints to reduce abuse.”
Summary (after the explanation):
Use HMAC signatures for authenticity and integrity.
Prevent replay with timestamps (and optionally nonces/event_id dedupe).
Support secret rotation with overlapping validity.
Enforce HTTPS and validate endpoints.
Observability, DLQ, and replay: make reliability operable#
A webhook system is only as good as its ability to explain itself. Subscribers will ask, “Did you send it? When? What response did you get? Can you resend?” Your system must answer these questions with precise data from persisted attempts and delivery states. That means structured logs, metrics, and a delivery timeline view per event/subscription.
DLQ is not just where messages go to die. It’s a deliberate product feature: a place where terminal failures are stored with reasons, and from which replays can be safely initiated. Replays should preserve the original event_id and idempotency key so subscribers can dedupe. You also need to decide what changes during replay: do you reuse the same payload, or regenerate with the latest schema? Most platforms replay the original payload to preserve causality.
Operationally, add SLO-style metrics: time-to-first-attempt, delivery success rate, attempt distribution, queue lag, fanout lag, and per-subscriber error rates. These metrics are also how you detect noisy neighbors and system-wide incidents.
Queue lag | Indicates backlog and capacity mismatch | Lag > threshold for N minutes |
Time-to-first-attempt | Measures “near real-time” promise | p95 exceeds target |
Error rate by subscriber | Detects failing endpoints/noisy neighbors | Sustained 5xx/timeout rate |
DLQ growth | Indicates terminal delivery issues | DLQ rate spike |
Retry depth | Shows systemic or tenant outages | Attempts per delivery rising |
Common pitfall: Only tracking “success rate.” You need lag, latency distributions, and per-subscriber breakdowns, or you can’t debug real incidents.
Summary (after the explanation):
Persist delivery timelines and attempt history for subscriber debugging.
Treat DLQ as a replayable state with strong metadata.
Monitor lag, latency, retry depth, and per-subscriber error rates.
Keep replay idempotent by preserving event identity and keys.
Common trade-offs you should say out loud#
Interviewers expect trade-offs because webhooks sit at the intersection of reliability, cost, and latency. The key is to tie trade-offs to explicit promises. If you promise a 72-hour retry window, you must pay for storage, attempt history, and replay tooling. If you promise ordering, you pay in reduced concurrency. If you embed payloads into tasks, you pay in queue cost.
The best way to cover trade-offs is to state what you optimize for and what you intentionally don’t. For example: “I optimize for durability and correctness over minimal latency, because webhooks are asynchronous and correctness is the product.” That framing is senior-level because it treats the system as a product with expectations, not just infrastructure.
Latency vs durability | In-memory buffers | Durable queue/stream | Durable queue: correctness dominates |
Cost vs retry window | Short window | Long window | Long window for enterprise; tiered by plan |
Ordering vs throughput | Serialize per key | Max parallelism | Ordering only where needed (opt-in) |
Payload duplication | Copy per task | Store once + reference | Store once for fan-out efficiency |
Interviewer tip: Trade-offs sound stronger when you attach them to an SLO or product promise (“72-hour retries,” “p95 first attempt < X seconds,” “per-subscriber cap N”).
Summary (after the explanation):
Optimize for correctness and durability; webhooks are asynchronous by nature.
Use tiering to balance cost and reliability across tenants.
Make ordering explicit and scoped; don’t promise global ordering.
Avoid payload duplication for large fan-out.
What a strong interview answer sounds like#
In an interview, clarity beats completeness. A strong answer feels like a guided tour: you state the contract, propose the architecture, then drill into correctness, retries, idempotency, ordering, fairness, and observability with concrete failure narratives. You don’t need every detail; you need the right invariants and the right “why.”
Use this as your mental template: “I’ll guarantee at-least-once delivery with explicit persisted delivery state, bounded retries with backoff and jitter, and subscriber-safe dedupe via idempotency keys. I’ll ensure correctness in publishing using an outbox or CDC, and I’ll protect multi-tenant fairness with per-subscriber caps and isolation.”
What a strong answer sounds like: “I treat delivery as a state machine. Every event is persisted, every delivery is a retryable task, and ambiguous outcomes lead to duplicates that are made safe with idempotency keys. I classify failures (5xx vs 4xx vs 429), bound retries with a DLQ, and isolate tenants to prevent noisy neighbors. Correctness starts at publishing with an outbox or CDC.”
Semantics | At-least-once + idempotency keys |
Correctness | Outbox/CDC to avoid dual-write |
State machine | Persisted states + leases + reaper |
Retry taxonomy | 5xx/timeout vs 4xx vs 429 with policy |
Ordering | Scoped ordering per key/partition |
Fairness | Per-subscriber caps + quotas + isolation lanes |
Operability | Metrics, logs, DLQ, replay |
Summary (after the explanation):
Lead with explicit semantics and correctness boundaries.
Use walkthroughs to prove you understand failure modes.
Treat fairness and backpressure as first-class concerns.
Close with observability and replay as product features.
Final takeaway#
Webhook System Design is an interview favorite because it forces you to confront distributed systems reality: unreliable networks, ambiguous outcomes, massive fan-out, and the need for correctness without overpromising exactly-once. The winning approach is to design a delivery pipeline with durable persistence, a clear state machine, disciplined retry policies, and a subscriber contract that makes duplicates safe.
If you can narrate the three walkthroughs—timeouts and jittered retries, worker crashes causing duplicates, and 100k fan-out bottlenecks—while grounding your choices in correctness (outbox/CDC), fairness (per-subscriber backpressure), and operability (DLQ + replay), you’ll give the interviewer what they’re looking for: mature, production-grade reasoning.
Happy learning!