Webhook system design is the practice of building a reliable, event-driven delivery pipeline that pushes HTTP notifications from a producer service to external subscriber endpoints, handling retries, failures, and scale gracefully. The core challenge is not sending a single POST request but rather guaranteeing at-least-once delivery across thousands of endpoints you do not control, where network failures, slow responses, and crashes make every outcome ambiguous.
Key takeaways
- At-least-once delivery is the realistic baseline: Exactly-once semantics are impractical without extreme coordination cost, so design around idempotency keys and subscriber-side deduplication instead.
- Transactional outbox prevents lost events: Writing the event and the business state change in a single database transaction eliminates the dual-write problem that silently drops webhooks.
- Retry policies must classify failures precisely: Treating 4xx, 5xx, 429, and timeouts identically leads to wasted capacity, noisy-neighbor amplification, and broken subscriber trust.
- Multi-tenant fairness is a core design concern: Per-subscriber concurrency caps, token buckets, and isolation lanes prevent one slow endpoint from starving the entire delivery fleet.
- Observability and replay are product features, not afterthoughts: Subscribers need delivery timelines, attempt history, and self-service replay tools or your system will generate support tickets indefinitely.
Every time a payment processor confirms a charge, a Git provider triggers a CI build, or a CRM fires an automation, a webhook is doing the work. The HTTP POST itself takes milliseconds. Designing the system that reliably delivers millions of those POSTs to endpoints you have never tested, across networks you do not own, against servers that may be down, misconfigured, or deliberately slow is the real engineering problem. That gap between “send one request” and “operate a delivery platform” is exactly what makes webhook system design one of the most revealing topics in distributed systems interviews and one of the most underestimated challenges in production.
Clarify requirements the interviewer actually cares about#
A webhook design conversation opens with requirements, but the interview reward comes from explaining why specific requirements exist. The environment dictates most of your decisions: untrusted networks, untrusted endpoints, bursty fan-out patterns, and the mathematical impossibility of exactly-once delivery without heavyweight distributed coordination. Stating these constraints early signals that you understand the problem space, not just the feature list.
Functionally, you need three capabilities: register subscriptions, publish events, and deliver payloads. But those are table stakes. The real differentiator is the set of promises your system makes. You must define delivery semantics (at-least-once), retry windows (for example, 72 hours), ordering scope (per entity, not global), and what “success” means. Most webhook platforms treat a 2xx response as “accepted by the subscriber endpoint,” because that is the only outcome you can actually observe from the sender’s side.
Non-functional requirements break into three pillars:
- Correctness: Never silently lose an event after it has been committed.
- Scalability: Handle fan-out to hundreds of thousands of subscribers per event type without choking the pipeline.
- Isolation: One noisy or failing subscriber must not degrade delivery for anyone else.
Security belongs in the requirements conversation from the start, because it changes your payload format, header contract, and subscriber onboarding flow. If you defer it, you end up retrofitting HMAC signatures and secret rotation into a system that was not designed for them.
Attention: Saying “exactly once” without describing the coordination cost is a red flag. In webhooks, the realistic move is at-least-once delivery plus idempotency keys. Subscribers handle duplicates, not the platform.
The requirements phase is also where you establish observability as a core concern. Subscribers will ask “did you send it?” and “why did it fail?” If those questions are unanswerable, your system is operationally broken regardless of throughput. With clear requirements in hand, you can now sketch the architecture that fulfills them.
High-level architecture: build a delivery pipeline, not a request handler#
The most interview-robust webhook architecture is not a service that “sends webhooks.” It is a
A clean way to present the architecture is to split it into services with single, defensible jobs:
- Event ingestion: Accepts events from producer services and persists them durably.
- Dispatcher (fan-out): Turns “an event happened” into “these N subscribers should receive it” by creating individual delivery tasks.
- Durable queue or stream: Buffers delivery tasks and absorbs the mismatch between production rate and consumption rate.
- Delivery workers: Execute HTTP POSTs, enforce timeouts, classify responses, and transition persisted delivery state.
- Subscription management: Handles registration, validation, secret management, and per-tenant policy configuration.
The queue is the shock absorber of the entire system. It turns the unpredictable behavior of external subscriber endpoints into a buffered, retryable, and observable workload. Without it, a single slow subscriber can back-pressure your entire event pipeline.
The following diagram shows how events flow from producers through the dispatcher and queue into delivery workers, with state persisted at every stage.
If you are interviewing at Staff level, add the “control plane vs. data plane” framing. Subscription creation, secret rotation, quota configuration, and replay tooling live in the control plane. High-throughput delivery lives in the data plane. This lets you scale, secure, and deploy them independently.
Pro tip: Say out loud what is persisted and why. “We persist events and delivery tasks so we can survive crashes, retries, and ambiguous outcomes” is the kind of sentence that earns trust in a design interview.
The backbone of this pipeline is only as trustworthy as the mechanism that gets events into it. That mechanism is where many systems quietly fail.
Transactional outbox and event publishing correctness#
Correct event publishing is the first place a webhook system can lose data without anyone noticing. The common mistake is
The fix is to make event creation atomic with the business state change. The most widely adopted approach is the
A close cousin is
The following table compares the two approaches across key dimensions.
Transactional Outbox vs. Change Data Capture (CDC): A Comparative Overview
Dimension | Transactional Outbox | Change Data Capture (CDC) |
Implementation Complexity | Requires app code changes, outbox table management, and a background relay/poller process | Minimal app code changes; complexity lies in configuring and maintaining CDC tools (e.g., Debezium) |
Operational Overhead | Extra DB writes per transaction; outbox table must be monitored and pruned regularly | Low app-level overhead; requires dedicated CDC infrastructure and message broker maintenance |
Schema Coupling | Tightly coupled; outbox table and transactions are schema-dependent | Loosely coupled; reads from DB logs, though schema changes may still affect CDC configuration |
Filtering Flexibility | Full control over event generation, filtering, and metadata enrichment | Captures all DB changes; filtering requires additional processing in the CDC tool or consumers |
Failure Modes | Strong consistency within a single transaction; relay/poller failures may cause delays | Eventual consistency; CDC or broker failures can result in delayed or lost events |
Team Size Suitability | Better suited for smaller teams with direct control over code and schema | Better suited for larger teams or organizations with dedicated infrastructure and ops support |
In an interview, you do not need to pick one approach forever. You need to show you understand the trade-offs and failure modes of each. The critical invariant is the same in both cases: the event must be derived from committed state, and publishing must be idempotent and replayable.
Real-world context: Shopify’s webhook infrastructure relies on transactional outbox patterns to guarantee that merchant events are never silently dropped, even during high-traffic events like flash sales. This is a well-documented pattern in event-driven e-commerce systems.
With publishing correctness established, the next question is what exactly you are publishing and what contract you offer to subscribers.
Data model and APIs: define the subscriber contract#
A webhook platform is partially an API product. Subscribers are external developers who need a stable, predictable contract for how deliveries behave, how they authenticate messages, and how they debug failures. In interviews, showing a crisp subscriber contract is worth more than listing API endpoints.
Your contract should cover five areas:
- Subscription registration: How subscribers specify their endpoint URL, event types, optional ordering keys, and authentication secrets.
- Event schemas and versioning: How payloads are structured, how versions evolve, and what backward compatibility guarantees exist.
- Signature verification: How subscribers validate that a delivery is authentic and unmodified.
- Idempotency keys: How subscribers detect and safely handle duplicate deliveries.
- Response semantics: What response codes mean to your system (2xx is success, everything else triggers retry or terminal handling).
On the data model side, four core records anchor the system. Subscriptions define who wants what events delivered where. Events (or event envelopes) are the immutable records of what happened. Delivery tasks are the unit of retryable work, one per (event, subscription) pair. Delivery attempts are the history of each HTTP call made for a delivery task.
Keeping attempts separate from tasks is a deliberate design choice. It prevents overwriting what happened during retries and creates a clean audit trail that both your operations team and your subscribers can inspect.
The payload envelope should carry stable metadata: event_id, event_type, a version field for schema evolution, created_at, tenant_id, and a delivery_id or idempotency_key. That idempotency key is what lets subscribers safely deduplicate under at-least-once semantics.
Schema versioning deserves explicit attention. When you add fields, subscribers using older versions should not break. When you deprecate fields, you need a communication window. A common approach is to embed a version identifier in the event type (for example, invoice.paid.v2) and maintain backward compatibility within a major version. Optional fields should be documented as optional. Required fields should never be removed without a version bump.
Attention: Forgetting the subscriber experience is a common design blind spot. If subscribers cannot see why a delivery failed and how to replay it, your system will generate support tickets indefinitely. Build replay and debugging into the contract from day one.
With the contract defined, the next step is to reason precisely about what delivery guarantees that contract can actually provide.
Delivery semantics: at-least-once, idempotency, and ordering boundaries#
The core design move is accepting reality. You cannot achieve exactly-once delivery over an unreliable network without extreme coordination cost. Even if your worker makes a successful POST and the subscriber processes it, the worker can crash before persisting the success. From your system’s perspective, the outcome is ambiguous, so it must retry. That is at-least-once delivery, and it is the correct baseline for any webhook platform.
Once you commit to at-least-once, you must immediately commit to idempotency. Your system should provide a stable idempotency key per (event, subscription) pair. The subscriber stores processed keys (or derives idempotency from event_id combined with subscription_id) and safely rejects duplicates by returning 2xx without reapplying side effects.
Subscriber-side deduplication storage needs a retention policy. You should keep idempotency keys for at least the maximum retry window plus a buffer for clock skew. If your retry window is 72 hours, retaining keys for 96 hours is a reasonable default.
Ordering is the next boundary to define clearly. Global ordering across all subscribers and all event types is not realistic and most subscribers do not need it. What you can offer is ordering per key: route events for the same entity (such as order_id or user_id) to the same queue partition and process them with a per-key concurrency of one.
This is a throughput trade-off. Strict ordering within a key means serialized delivery for that key, which reduces parallelism. The right framing is “only enforce ordering where correctness requires it,” and offer it as an optional subscription setting rather than a global default.
Pro tip: A strong interview sentence sounds like this: “I guarantee at-least-once delivery and make duplicates safe through a stable idempotency key. If ordering is required, I scope it to an ordering key and serialize deliveries per key, accepting the throughput trade-off.”
Delivery semantics define what you promise. The next concern is how you track and enforce those promises through every stage of a delivery’s life cycle.
Delivery state machine and life cycle tracking#
If you want your webhook system to be debuggable and correct under retries, crashes, and concurrency, you need an explicit
A helpful mental model treats each delivery as a persisted object carrying a next_attempt_at timestamp and an attempt_count. The life cycle works as follows:
- A delivery task is created in PENDING with
next_attempt_at = now. - A worker claims the task using a lease, transitioning it to ATTEMPTING.
- The worker sends the HTTP POST with a strict timeout (for example, 5 seconds).
- On 2xx, the worker transitions to DELIVERED and records
delivered_at. - On transient failure (timeout, 5xx), the worker writes an attempt record and transitions to RETRY, computing the next attempt time using exponential backoff plus jitter.
- On permanent failure (most 4xx) or after exhausting the retry window, the worker transitions to FAILED or DLQ.
The following diagram illustrates the full state machine with transitions and the conditions that trigger each.
The key correctness mechanism is persisting state transitions atomically with attempt records. If you persist “ATTEMPTING” and then the worker crashes, you need a
Walk-through: subscriber timeout, jittered retries, and eventual success#
A payment event is created and a delivery task starts in PENDING. A worker claims it, transitions to ATTEMPTING, and fires an HTTP POST with a 5-second timeout. The subscriber endpoint does not respond in time.
The worker classifies the outcome as a transient failure (timeouts are treated like 5xx). It writes an attempt record documenting the timeout error, transitions the delivery to RETRY, and computes next_attempt_at using the formula:
$$\\text{next_attempt_at} = \\text{now} + \\min(\\text{base} \\times 2^{\\text{attempt}},\\ \\text{max_interval}) + \\text{jitter}$$
After several retries, the subscriber recovers and responds with 204 within the timeout window. The worker persists DELIVERED with the final status code and timestamp. The full attempt chain is visible in the delivery timeline, which is exactly what subscriber support needs to resolve questions without escalation.
Attention: Retrying immediately after timeouts without backoff and jitter can amplify an outage. If 10,000 deliveries all retry at the same instant, you create a thundering herd against already-degraded subscriber infrastructure.
The state machine tells you when to retry. The next question is how to retry differently based on what kind of failure occurred.
Retry taxonomy: classifying failures and matching them to policy#
Retries are not a binary “on or off” decision. A strong design classifies failures by what they mean and maps each class to a distinct policy. Treating all non-2xx responses the same way wastes capacity, hammers broken endpoints, and violates subscriber trust.
The following table shows how to classify HTTP responses and the appropriate retry behavior for each.
HTTP Response Classification and Retry Policy
Response Category | Examples | Classification | Retry Policy |
2xx | 200 (OK), 201 (Created), 204 (No Content) | Success | No retry needed |
408 / 429 | Request Timeout, Too Many Requests | Transient / Rate-limited | Retry with backoff; respect `Retry-After` header |
5xx | 500, 502, 503, 504 | Transient server error | Retry with exponential backoff + jitter |
Connection Timeout / DNS Failure | N/A | Transient network error | Retry with exponential backoff + jitter |
400 / 401 / 403 / 404 / 410 | Bad Request, Unauthorized, Forbidden, Not Found, Gone | Permanent client error | Do not retry; move to FAILED or DLQ |
409 / 425 | Conflict, Too Early | Context-dependent | Retry with caution or treat as transient per subscriber documentation |
The 429 case deserves special attention. When a subscriber returns 429, they are explicitly asking you to slow down. If the response includes a Retry-After header, respect it. Beyond that, consider implementing a per-subscriber
For 4xx responses, the default should be “do not retry” unless you have a specific, documented reason. A 404 or 410 strongly suggests the endpoint is gone. A 401 or 403 usually means credentials are invalid, and retrying will not fix that. Some teams allow a brief grace period for 404 during subscriber deployments, but this is a product decision that should be explicit, not implicit.
Real-world context: Stripe’s webhook documentation specifies a retry schedule with exponential backoff over 72 hours, ultimately disabling the endpoint after repeated failures. This bounded retry window is a deliberate product choice that balances reliability against resource cost.
The retry taxonomy protects individual deliveries. But when you have thousands of subscribers and some of them are failing, the system-wide concern shifts from “how do I retry this one delivery” to “how do I prevent one bad subscriber from hurting everyone else.”
Multi-tenant fairness and backpressure per subscriber#
In production, the hardest part of webhook delivery is not raw throughput. It is fairness. A single subscriber can be slow, flaky, rate-limited, or completely down, and your system must prevent that subscriber from consuming all delivery workers, filling queue partitions with retries, or causing dispatcher backlog for healthy subscribers. This is the classic noisy-neighbor problem, and it is where “scales” becomes “scales safely.”
A robust fairness strategy operates at multiple layers:
- Per-subscriber concurrency caps: Limit the number of in-flight deliveries for any single subscriber. If subscriber A has a cap of 50, no more than 50 workers will be attempting deliveries to A simultaneously, regardless of how many tasks are queued.
- Per-subscriber rate quotas: Limit deliveries per time window (for example, 1,000 per minute) to prevent overwhelming subscriber infrastructure.
- Isolation lanes: Separate queues or priority tiers for different subscriber classes. A free-tier subscriber’s failures should never compete for capacity with an enterprise subscriber’s deliveries.
- Adaptive backoff: When a subscriber returns 429 or times out repeatedly, dynamically reduce their concurrency allocation instead of continuing to retry at full rate.
Where you enforce limits matters as much as what limits you enforce. At the dispatcher layer, you control how many tasks are enqueued per subscriber per time window. At the queue layer, you use per-subscriber partitions or per-tenant queues to isolate backlogs. At the worker layer, you use per-subscriber semaphores to cap concurrency.
A particularly effective pattern is the
Pro tip: Frame fairness as a capacity allocation problem. “If one subscriber is failing, their retries should not steal capacity from healthy subscribers” is the kind of sentence that demonstrates production-grade thinking in an interview.
The following diagram shows how per-subscriber isolation works across the dispatcher, queue, and worker layers.
Fairness and backpressure protect the system under degraded conditions. But even under normal conditions, you need to reason about what happens when a worker crash creates an ambiguous delivery outcome.
Walk-through: worker crash, duplicate delivery, and idempotency handling#
This scenario is where at-least-once delivery becomes tangible. A worker claims a delivery task, transitions it to ATTEMPTING, sends the HTTP POST, and the subscriber processes it successfully, returning 200. But immediately after the network call returns, the worker process crashes before it can persist the DELIVERED state.
From the system’s perspective, the delivery is stuck in ATTEMPTING with no recorded success. The reaper detects the expired lease and moves the delivery back to RETRY. Another worker claims it and sends the same POST again. The subscriber receives the same event a second time.
This is not a bug. It is the expected and correct behavior of an at-least-once system under crashes. The safety mechanism is the stable idempotency key your system included in the delivery headers or payload. The subscriber stores processed idempotency keys (in a set, cache, or database with TTL) and, upon seeing a duplicate, returns 2xx without reapplying side effects.
Your system also behaves correctly here. When the second worker gets a 2xx, it marks the delivery as DELIVERED. The attempt history now shows two attempts, both with 2xx. That trace is valuable, not embarrassing. It proves your system is crash-tolerant and that the subscriber correctly handles duplicates.
Historical note: The idempotency key pattern was popularized in payment systems, where duplicate charges are catastrophic. Stripe’s idempotency key documentation is a well-known reference for how to design safe duplicate handling in API-driven systems.
Worker crashes are one source of system stress. The other major source is sheer scale, specifically the fan-out explosion that occurs when a single event needs to reach tens or hundreds of thousands of subscribers.
Scaling fan-out: 100k subscribers and the dispatcher bottleneck#
Fan-out is where webhook systems can melt. If an event type has 100,000 subscribers and your dispatcher naively looks up all subscribers and enqueues 100,000 delivery tasks synchronously, the dispatcher becomes a single-threaded bottleneck and the queue faces a write spike that can saturate network or storage I/O. Strong designs treat fan-out itself as a distributed, batchable job.
The approach is to decompose fan-out into stages. When an event arrives, the dispatcher persists the event envelope and creates a fanout job that references the event and the subscriber segment to expand. A fleet of fanout workers picks up these jobs and iterates subscribers in chunks (for example, 1,000 at a time), creating delivery tasks in batches. Each delivery task references the stored payload by event ID rather than embedding the full payload.
This “store once, reference many” pattern is critical. If your event payload is 2 KB and you have 100,000 subscribers, embedding the payload in every task means 200 MB of duplicated data per event. Referencing the payload by ID keeps task messages small and reduces queue storage, network transfer, and serialization costs by orders of magnitude.
Queue design also matters for fan-out. Partitioning by subscriber_id provides natural isolation (one subscriber’s backlog does not block another’s) and supports per-subscriber ordering if needed. The partition count must be large enough to distribute load evenly. If you partition by subscriber_id mod P, the distribution is good, but you may lose strict per-subscriber ordering unless workers enforce serialization per key.
When you describe 100k fan-out in an interview, quantify the costs. “100k tasks per event” turns into “millions of tasks per minute” if events arrive at 10 per second. Your design must show where the system absorbs those bursts: the queue buffers the spike, fanout workers parallelize the expansion, and per-subscriber caps throttle the delivery rate to match what endpoints can handle.
Real-world context: Large webhook providers like Twilio and Svix handle massive fan-out by treating expansion as a background job with its own scaling knobs, separate from the delivery workers. This separation lets them independently scale “how fast we create tasks” and “how fast we deliver them.”
Monitoring fan-out requires two distinct metrics: fanout lag (time between event creation and completion of task expansion) and delivery lag (time between task creation and successful delivery). Conflating these two metrics makes it impossible to diagnose whether you are falling behind in expansion or in outbound HTTP.
At this scale, the events you are delivering carry real business value, which means the security of those deliveries is not optional.
Security: HMAC signatures, secret rotation, and replay prevention#
Webhooks often trigger consequential side effects: granting access, provisioning resources, updating payment records, or starting deployments. Without authentication, any attacker who discovers a subscriber’s endpoint URL can forge webhook payloads and trigger those side effects. Security is not a “nice to have” layer on top of your delivery system. It is a fundamental part of the subscriber contract.
The baseline is authenticity and integrity via HMAC signatures. Your system computes an HMAC (typically HMAC-SHA256) over a canonical representation of the payload plus metadata (usually including a timestamp), using a shared secret known only to your platform and the subscriber. The signature is sent in a header (for example, X-Webhook-Signature). The subscriber recomputes the HMAC using the same secret and rejects the delivery if the signatures do not match.
Replay prevention is the next piece. Without it, an attacker who captures a valid webhook (for example, through a compromised log or network tap) can resend it later. The defense is to include a timestamp in the signed payload and require the subscriber to reject messages where the timestamp falls outside a narrow window, typically five minutes. For higher assurance, the subscriber can also deduplicate on event_id, which serves double duty as both an idempotency key and a replay prevention mechanism.
Secret rotation requires careful handling. Subscribers must be able to rotate their shared secret without downtime. Your system should support multiple active secrets during a transition window: sign with the new secret, but allow the subscriber to verify against either the old or new secret until the old one expires. This overlapping validity window prevents delivery failures during rotation.
Finally, enforce HTTPS for all subscriber endpoints and validate URLs at subscription time. In a mature platform, you also implement SSRF (Server-Side Request Forgery) protections by rejecting internal IP ranges, private network addresses, and localhost. The OWASP SSRF prevention cheat sheet is a useful reference for URL validation patterns.
Attention: Allowing HTTP (non-TLS) endpoints means your signed payloads travel in plaintext, making signature capture and replay trivial. Always require HTTPS and validate the certificate chain.
Security protects individual deliveries. But when deliveries fail, subscribers need to understand what happened and what to do about it, which brings us to observability.
Observability, DLQ, and replay: make reliability operable#
A webhook system is only as reliable as its ability to explain itself. Subscribers will ask three questions: “Did you send it?”, “Why did it fail?”, and “Can you resend it?” Your system must answer all three with precise data from persisted delivery state and attempt history.
Delivery timelines are the core debugging tool. For each (event, subscription) pair, subscribers should be able to see every attempt: the timestamp, the HTTP status code or error classification, the response latency, and the next scheduled retry. This is not just useful for subscribers. Your own operations team needs the same data to diagnose system-wide issues and identify patterns like a sudden spike in timeouts for a specific subscriber region.
Structured metrics power the operational side. The essential metrics for a webhook platform include:
- Time-to-first-attempt: How quickly after event creation does the first delivery attempt fire? This measures pipeline latency.
- Delivery success rate: The percentage of delivery tasks that reach DELIVERED within the retry window.
- Attempt distribution: How many attempts does the average delivery require? A spike here indicates subscriber health degradation.
- Queue lag and fanout lag: Separate metrics that tell you whether you are falling behind in task creation or task execution.
- Per-subscriber error rates: The most important metric for detecting noisy neighbors before they become system-wide problems.
The Dead Letter Queue (DLQ) is not where messages go to die. It is a deliberate product feature: a persisted state where terminal failures are stored with full context (the event payload, the subscription details, the attempt history, and the failure reason). From the DLQ, authorized subscribers or operators can initiate replays.
Replays must preserve the original event_id and idempotency key so that subscribers can safely deduplicate. You also need a clear policy on what a replay sends: the original payload (preserving causality) or a regenerated payload (reflecting current state). Most platforms replay the original payload, because the subscriber’s logic may depend on the state at the time the event occurred.
Pro tip: Track SLO-style metrics like “99% of deliveries receive their first attempt within 30 seconds of event creation” and “95% of deliveries succeed within 3 attempts.” These give your team and your subscribers a shared language for reliability expectations.
With observability in place, you have all the building blocks of a production-grade webhook system. The final challenge is knowing which trade-offs to make explicit.
Common trade-offs you should say out loud#
Webhook system design sits at the intersection of reliability, cost, latency, and fairness. Every design choice involves a trade-off, and the strongest interview answers (and the strongest production systems) make those trade-offs explicit rather than implicit.
Correctness vs. latency. The transactional outbox adds a small delay (the relay polling interval or CDC propagation lag) compared to a direct “publish on commit” approach. But the direct approach risks dual-write failures. Optimizing for correctness over minimal latency is the right call for webhooks because they are asynchronous by nature. A 200ms delay in first attempt is invisible to subscribers; a lost event is not.
Ordering vs. throughput. Per-key ordering requires serialized delivery within a partition, which caps parallelism for that key. If a subscriber needs ordering for order_id, deliveries for that order are sequential. The trade-off is worth it when misordered events cause subscriber-side bugs (for example, “order cancelled” arriving before “order created”). The right framing is to offer ordering as an opt-in subscription setting with documented throughput implications.
Retry window vs. resource cost. A 72-hour retry window means you must store delivery tasks, attempt history, and event payloads for at least that duration. You must also budget worker capacity for retries that may spike during subscriber recovery. A shorter window is cheaper but breaks reliability promises. Tiering helps: premium subscribers get 72-hour windows with aggressive retries, while free-tier subscribers get 24 hours with fewer attempts.
Payload embedding vs. referencing. Embedding the full payload in every delivery task is simple but multiplies storage cost by the fan-out factor. Referencing the payload by event ID is more efficient but adds a lookup step during delivery. For small fan-out, embedding is fine. For 100k+ subscribers, referencing is mandatory.
Historical note: The tension between exactly-once semantics and system complexity has been a recurring theme in distributed systems since the early work on the Two Generals Problem. Webhooks are a modern, practical instance of this fundamental impossibility, which is why at-least-once with idempotency is the industry standard.
The following table summarizes the key trade-offs and their recommended resolutions for a production webhook platform.
Event-Driven Architecture Trade-Off Comparison
Trade-off | Option A | Option B | Recommended Resolution |
Correctness vs. Latency | Transactional outbox (adds relay lag) | Direct publish (risk of dual-write loss) | Outbox for correctness |
Ordering vs. Throughput | Per-key serialization (lower parallelism) | Unordered delivery (higher parallelism) | Opt-in ordering per subscription |
Retry Window vs. Cost | 72-hour window (higher storage and compute) | 6-hour window (lower cost but lower reliability) | Tiered windows by subscriber plan |
Payload Strategy | Embed in every task (simple, expensive at scale) | Reference by event ID (efficient, adds lookup) | Reference for large fan-out, embed for small |
Global vs. Scoped Ordering | Global ordering (impractical at scale) | Per-key ordering (practical with throughput cost) | Per-key only |
These trade-offs are not academic. They map directly to SLOs, product promises, and infrastructure budgets. The strongest design answers attach each trade-off to a concrete number: “72-hour retries,” “p95 first attempt under 5 seconds,” “per-subscriber cap of 100 concurrent deliveries.”
What a strong interview answer sounds like#
In an interview, clarity beats completeness. A strong answer feels like a guided tour through increasing levels of detail: you state the contract, propose the architecture, then drill into correctness, retries, idempotency, ordering, fairness, and observability with concrete failure narratives. You do not need every detail. You need the right invariants and the right “why” behind each decision.
A mental template for the opening statement:
“I will guarantee at-least-once delivery with explicit persisted delivery state, bounded retries with exponential backoff and jitter, and subscriber-safe deduplication via stable idempotency keys. I will ensure correctness in event publishing using a transactional outbox. I will protect multi-tenant fairness with per-subscriber concurrency caps, circuit breakers, and isolation lanes.”
From there, the walk-throughs do the heavy lifting. Narrating the timeout-and-retry scenario proves you understand backoff mechanics. Narrating the worker-crash scenario proves you understand why duplicates are inevitable and how idempotency makes them safe. Narrating the 100k fan-out scenario proves you can reason about scale and bottleneck mitigation.
Real-world context: Companies like Svix (an open-source webhook delivery platform) have built entire products around the primitives discussed in this post: delivery state machines, retry taxonomies, subscriber dashboards, and replay tooling. The fact that webhook delivery is a standalone product category validates the complexity of the problem.
Close your answer with observability and replay as product features, not infrastructure afterthoughts. The ability to say “every delivery has a full attempt timeline, failures land in a replayable DLQ, and subscribers have self-service debugging tools” shows that you think about systems from the operator’s and the customer’s perspective, not just the architect’s.
Conclusion#
Webhook system design is an interview favorite and a production reality because it forces you to confront the full spectrum of distributed systems challenges: unreliable networks, ambiguous delivery outcomes, massive fan-out, multi-tenant fairness, and the need for correctness without the luxury of exactly-once semantics. The winning approach is not a clever trick. It is a disciplined delivery pipeline with durable persistence at every stage, a clear state machine that survives crashes and concurrency, a retry taxonomy that distinguishes transient failures from permanent ones, and a subscriber contract that makes duplicates safe through idempotency keys.
The future of webhook infrastructure is moving toward programmable delivery policies (where subscribers define their own retry curves and circuit breaker thresholds), streaming delivery over persistent connections (such as WebSockets or server-sent events) as a complement to HTTP push, and deeper integration with observability platforms that let subscribers set their own SLOs on delivery latency and success rate. As event-driven architectures become the default integration pattern, the webhook delivery platform will increasingly be treated as critical infrastructure rather than a feature checkbox.
The system that delivers a single HTTP POST is trivial. The system that delivers millions of them reliably, fairly, and observably, while surviving every failure mode the internet can produce, is a distributed systems problem worth mastering.