E-Commerce System Design
How to ace an e-commerce System Design interview: design checkout for correctness with inventory reservations + idempotency + a state machine, and scale for spikes with rate limits, queues, and graceful degradation.
Designing an e-commerce platform is one of the most common System Design interview prompts because it looks straightforward and still exposes nearly every skill an interviewer wants to test: traffic shaping, data modeling, correctness, reliability, and the ability to reason about trade-offs under pressure.
The trap is familiarity. Everyone has browsed a catalog, added items to a cart, and checked out. In an interview, that familiarity often produces vague answers like “we’ll use microservices” or “we’ll scale horizontally.” That is not what gets you hired.
This blog is a practical, interview-focused blueprint for designing an e-commerce system that is correct under failure, scalable under spikes, and explainable under follow-up questions.
System Design Interviews decide your level and compensation at top tech companies. To succeed, you must design scalable systems, justify trade-offs, and explain decisions under time pressure. Most candidates struggle because they lack a repeatable method. Built by FAANG engineers, this is the definitive System Design Interview course. You will master distributed systems building blocks: databases, caches, load balancers, messaging, microservices, sharding, replication, and consistency, and learn the patterns behind web-scale architectures. Using the RESHADED framework, you will translate open-ended system design problems into precise requirements, explicit constraints, and success metrics, then design modular, reliable solutions. Full Mock Interview practice builds fluency and timing. By the end, you will discuss architectures with Staff-level clarity, tackle unseen questions with confidence, and stand out in System Design Interviews at leading companies.
Interview signal: Strong candidates don’t just draw boxes. They explain why each box exists, what it owns, and what guarantees it provides.
How to structure your interview answer (the winning workflow)#
Acing this interview is mostly about sequencing. Interviewers reward engineers who stay structured and build the system in layers, instead of jumping straight into implementation details.
Start with the following workflow and narrate it as you go:
Clarify requirements and constraints (business + technical)
Identify core user flows and read/write patterns
Propose a high-level architecture (services + responsibilities)
Design data models and consistency boundaries
Deep dive into checkout correctness (inventory + payments + idempotency)
Handle failure scenarios and retries
Scale for spikes and flash sales with graceful degradation
Add observability and operational signals
This structure makes your answer interview-proof because it creates obvious “checkpoints” where the interviewer can interrupt you, and you can resume cleanly.
What to say in the interview: “I’ll start by locking requirements and traffic patterns, then propose a high-level architecture, and finally deep dive on checkout correctness and failure handling, since that’s the hardest part.”
Clarify the problem space like a Staff engineer#
Before you design anything, explicitly define what you’re building. E-commerce can mean a small marketplace app or a global Amazon-scale platform. You do not need every feature; you need the right feature set for the interview.
Anchor your scope around the most interview-relevant flows:
Browsing and product detail pages (read-heavy, cache-friendly)
Search (read-heavy, eventually consistent)
Cart management (session-like state, recoverable)
Checkout (write-heavy, correctness-critical)
Orders and payments (durable, auditable, immutable history)
Inventory (high contention, race conditions, overselling risk)
Then state the constraints you’re assuming. This is where you gain control of the problem.
Examples of high-value assumptions:
“We have one region initially, then we can discuss multi-region.”
“We support card payments via an external payment provider.”
“We treat checkout as correctness-first; we tolerate some staleness in catalog/search.”
“We need to survive retries and partial failures without double-charging or double-ordering.”
Interview signal: You win points by explicitly separating “fast and eventually consistent” paths (browse/search) from “durable and auditable” paths (orders/payments).
Core user flows and what they imply#
User flows are not just features. They determine read/write patterns, hotspots, and what breaks under load.
Here’s a crisp way to present that:
Core user flow | Read/write pattern | Design challenge |
Browse catalog / product details | Mostly reads, cacheable | Cache invalidation, stale tolerance, fast p99 latency |
Search products | Reads + indexing writes | Eventual consistency, ranking, query fan-out |
Add to cart / update cart | Writes per user session | Low latency, cart expiration, merge across devices |
View cart | Reads + derived totals | Pricing accuracy, promotions, tax/shipping estimation |
Checkout | Burst writes | Idempotency, inventory contention, payment correctness |
View orders | Reads, durable history | Auditable state, consistent status, pagination |
Common pitfall: Treating all flows as equal. Interviews reward prioritization: checkout correctness beats perfect search freshness.
High-level architecture (services you actually need)#
Once the scope is clear, define the system as a set of responsibilities. Avoid over-indexing on microservices vocabulary. In interviews, “microservices” is not an architecture; it’s a deployment style.
You want to describe a design where:
Reads are fast via caching and replicas
Writes are safe via transactions, idempotency, and durable logs
Cross-service workflows are resilient via async messaging and reconciliation
A practical decomposition looks like this:
API Gateway / BFF | Auth, routing, rate limits, aggregation | None (stateless) |
Catalog service | Product details, categories, metadata | Document store or relational DB + cache |
Search service | Query parsing, ranking, indexing pipeline | Search index (e.g., Elasticsearch/OpenSearch) |
Pricing service | Base price, discounts, promo rules | Relational DB + cache |
Cart service | Cart CRUD, cart TTL, merge carts | Key-value store (Redis/DynamoDB) |
Inventory service | Stock counts, reservations, allocation | Relational DB (strong consistency) |
Checkout service | Orchestrates checkout workflow | Minimal state + idempotency store |
Order service | Order creation, state transitions | Relational DB (ACID) |
Payment service | Payment intents, authorization/capture, webhooks | Relational DB + outbox |
Notification service | Email/SMS confirmations | Queue + provider integration |
Event bus | Async workflows, fan-out updates | Kafka/PubSub/SQS-like queue |
This table is interview-friendly because it tells a story: reads live in catalog/search; correctness lives in inventory/orders/payments; workflows connect them.
What to say in the interview: “I’m separating browse/search from checkout/order because they need different guarantees. Browse favors latency and caching; checkout favors correctness, durability, and idempotency.”
Where caching belongs (and where it does not)#
Caching is essential, but careless caching causes correctness bugs.
Use caching aggressively for:
Product details (with TTL and invalidation)
Category listings
Popular search queries
Pricing lookups (short TTL, careful with promotions)
Avoid caching as the source of truth for:
Orders
Payments
Inventory availability during checkout
A strong interview statement is:
Interview signal: Cache improves performance, but correctness-sensitive state must remain authoritative in a durable store.
Data consistency: what can be stale vs what must be correct#
E-commerce systems contain multiple “classes” of data. Treating them the same creates either unnecessary complexity or dangerous oversimplification.
Eventually consistent data (acceptable staleness)#
These domains can be eventually consistent because the user impact is limited and recoverable:
Catalog content (product name, description, images)
Search index results
“Recommended products”
Analytics events
View counters, trending lists
If the catalog says a product is available but it is actually out of stock, that is annoying. If the system charges someone for an order that never exists, that is unacceptable.
Strongly consistent, durable, auditable data#
These domains must be correct, durable, and traceable:
Orders (must not disappear)
Payments (must be reconciled)
Inventory allocation (must prevent oversell)
Refunds and chargebacks
Order state transitions (must be auditable)
A clean interview framing is:
What to say in the interview: “I’ll allow eventual consistency for catalog and search, but orders, payments, and inventory allocation require durable writes and auditability.”
Inventory reservation and overselling prevention#
Inventory is where many “good” designs fail under real-world concurrency. Overselling happens when multiple checkouts race to claim the same units.
The interviewer expects you to handle:
Contention (flash sales)
Abandoned carts
Reservation expiration (TTL)
Correctness under retries
Inventory is not just “a number.” It’s a contract.
The core problem: inventory is a shared resource under race#
During normal traffic, inventory updates are manageable. During flash sales, a single SKU becomes a hotspot and turns into a write bottleneck. If you decrement stock at the wrong time, you either oversell or block legitimate buyers.
You need an explicit strategy. Here are the common ones:
Reserve on add-to-cart | Create reservation immediately | Prevents oversell early | Blocks inventory for abandoned carts | High-demand limited stock items |
Decrement on checkout | Decrement only at order commit | Less inventory locking | Oversell risk unless strongly synchronized | Normal retail with ample stock |
Queue-based allocation | Queue requests, allocate sequentially | Strong oversell prevention under spikes | Adds latency, complex UX | Flash sales, drops, ticketing |
Common pitfall: “We decrement inventory at checkout” without explaining concurrency control. That answer collapses under flash sale follow-ups.
A practical design: TTL-based reservations#
A strong interview design is reservation-based inventory with TTL:
When the user begins checkout (or clicks “Place order”), create an inventory reservation for each SKU.
Each reservation has:
reservation_idsku_idquantityexpires_at(TTL)user_id/cart_idstatus(ACTIVE, COMMITTED, EXPIRED, RELEASED)
This gives you a deterministic way to prevent oversell while allowing abandoned carts to self-heal.
The inventory service then exposes:
ReserveItems(cart_id, items, ttl)CommitReservation(reservation_id)ReleaseReservation(reservation_id)GetAvailableStock(sku_id)(computed ason_hand - reserved_active)
Handling abandoned carts cleanly#
Abandoned carts are not an edge case; they are the default behavior. Most users do not complete checkout.
With TTL reservations:
The system automatically releases stock when
expires_atpasses.A background sweeper or TTL index removes expired reservations.
The available stock recovers without manual intervention.
This is exactly the kind of operationally-safe mechanism interviewers want.
Interview signal: Reservations with TTL turn abandoned carts from a correctness bug into a routine cleanup job.
Contention during flash sales#
When one SKU becomes hot, you need to avoid “thundering herd” behavior where thousands of requests hammer the same database row.
Techniques that work in practice:
Keep inventory operations in a single service with a tight API.
Use conditional updates (compare-and-set) or row-level locking.
Use a queue allocator for extreme spikes (discussed later).
Protect inventory storage with aggressive rate limiting at the edge.
If the interviewer pushes back…“If the SKU becomes a write hotspot, I switch allocation to a queue-based approach for that SKU so requests serialize and we never oversell.”
Checkout as a state machine#
Checkout is the highest-signal portion of this interview. This is where you prove you can build systems that remain correct when the world is unreliable: networks fail, retries happen, and external providers behave asynchronously.
A checkout design without a state machine is not a design. It’s wishful thinking.
Why checkout must be modeled explicitly#
Checkout has multiple steps with different failure modes:
inventory reservation
pricing validation
payment authorization
order creation
payment capture
confirmation and fulfillment triggers
Some of these steps can be retried safely. Some cannot. Some are synchronous; some are async.
The clean way to manage this is a checkout/order state machine with explicit transitions.
INITIATED | Checkout request created | → RESERVING_INVENTORY, → FAILED |
RESERVING_INVENTORY | Reserving stock with TTL | → INVENTORY_RESERVED, → OUT_OF_STOCK |
INVENTORY_RESERVED | Reservation exists | → PAYMENT_AUTH_PENDING, → EXPIRED |
PAYMENT_AUTH_PENDING | Calling payment provider | → PAYMENT_AUTHORIZED, → PAYMENT_FAILED |
PAYMENT_AUTHORIZED | Authorization succeeded | → ORDER_CREATING, → AUTH_EXPIRED |
ORDER_CREATING | Writing durable order record | → ORDER_CREATED, → ORDER_CREATE_FAILED |
ORDER_CREATED | Order persisted | → PAYMENT_CAPTURE_PENDING, → CANCELLED |
PAYMENT_CAPTURE_PENDING | Capturing payment | → COMPLETED, → CAPTURE_FAILED |
COMPLETED | Final success | (terminal) |
FAILED | Terminal failure | (terminal) |
Interview signal: State machines make correctness explainable. Interviewers trust designs they can reason about step-by-step.
Idempotency keys are mandatory#
Retries happen everywhere:
mobile networks drop
users double-click “Place order”
load balancers retry on timeouts
clients retry after 5xx
Your system must treat duplicate requests as the same operation.
Use idempotency keys at the checkout boundary:
client sends
Idempotency-Key(UUID)checkout service stores
(idempotency_key → checkout_id/result)repeated requests return the same outcome
Also use idempotency internally:
payment authorization call includes a unique payment intent key
order creation uses a unique constraint like
(user_id, idempotency_key)or(checkout_id)
What to say in the interview: “Checkout is idempotent. Every request includes an idempotency key so retries never create duplicate orders or double charges.”
Checkout flow walkthrough (narrative sequence)#
In interviews, the best way to demonstrate mastery is to walk through the checkout as a story. Do it end-to-end, including what gets written where.
Here is a narrative walkthrough you can deliver in five to seven minutes.
The user clicks “Checkout.” The client sends a request to the Checkout service with the cart ID and an idempotency key. The Checkout service first loads the cart contents from the Cart service and validates that each item is still purchasable. This includes verifying SKU availability, ensuring the product is active, and re-checking pricing rules to avoid stale totals from the cart page.
Next, the Checkout service calls the Inventory service to reserve inventory for the cart items. The Inventory service creates TTL reservations and returns a reservation ID. This step must be strongly consistent because it protects against overselling. At this point, the checkout has a hard expiration window: if the user does not finish in time, the reservation expires and the user must re-checkout.
Once inventory is reserved, the Checkout service initiates payment authorization. This is not a capture yet; it is an authorization hold that confirms the payment method is valid and the funds are available. The Payment service creates a payment intent record in its database and calls the external payment provider with its own idempotency token so repeated calls do not create multiple authorizations.
When authorization succeeds, the system creates the order in the Order service. This is a durable write that must succeed exactly once. The Order service persists the order record with an initial state like ORDER_CREATED and stores the reservation ID and payment intent ID as part of the order’s audit trail. This is the moment the business commits to the purchase.
After the order exists, the system captures the payment. Capture can be synchronous or asynchronous depending on the provider and latency goals, but it must be reconciled. If capture succeeds, the order transitions to COMPLETED (or PAID) and downstream processes trigger fulfillment and notifications. If capture fails, the order transitions to a failure-handling path such as PAYMENT_CAPTURE_FAILED, and the system releases inventory or keeps it reserved depending on your business rules.
Finally, the user receives a confirmation response. Importantly, the confirmation is not “payment succeeded.” The confirmation is “order created and payment captured,” or it is an explicit “pending” status if capture is asynchronous.
Interview signal: The best walkthroughs say exactly what gets written to durable storage and when the system becomes committed.
Failure scenarios (the ones interviewers always ask)#
You do not need to list every possible failure. You need to cover the ones that reveal correctness maturity.
Failure: payment succeeds but order creation fails#
This is the classic distributed systems failure: an external side effect succeeds, but your internal state write fails.
Correct handling requires reconciliation:
Payment service stores payment intent and authorization result durably.
Order creation is retried with idempotency.
If order creation cannot succeed after retries, the system triggers a compensating action:
void authorization (if not captured)
refund (if captured)
A reconciliation job continuously compares:
authorized/captured payments without orders
orders without successful payment capture
What to say in the interview: “I never rely on the synchronous response alone. Payments are reconciled. If authorization succeeded but order creation failed, I retry order creation idempotently and reconcile orphaned payments with void/refund workflows.”
Failure: inventory reservation expires mid-checkout#
This happens when:
user takes too long
payment authorization is slow
the system is under heavy load
Correct handling:
reservations have
expires_atorder creation validates reservation is still ACTIVE
if expired, checkout fails with a clear “items no longer available” response
user re-checks out with refreshed inventory
This is a correctness requirement, not a UX preference.
Common pitfall: Allowing checkout to proceed with expired reservations “because it’s rare.” Under flash sales, it becomes common.
Failure: duplicate checkout requests due to retries#
This is guaranteed to happen. Your design must treat it as normal behavior.
Correct handling:
idempotency key at the API boundary
unique constraint on order creation (
checkout_idmust be unique)payment provider calls are idempotent via payment intent key
return the same order ID for repeated calls
If the interviewer pushes back…“Even if the client retries five times, I return the same order ID because the idempotency key maps to a single checkout attempt.”
Handling flash sales and traffic spikes#
Flash sales are where e-commerce designs get exposed. Traffic shifts from “read-heavy with occasional writes” to “massive concurrent writes on a tiny set of SKUs.”
A strong answer describes two things:
how you protect correctness (no oversell, no double-charge)
how you degrade gracefully (some features sacrificed to protect the core)
Backpressure, queueing, and rate limiting#
Under extreme load, your job is not to keep every feature alive. Your job is to keep checkout safe and the system stable.
Use these tools deliberately:
Rate limiting at the API gateway (per IP, per user, per endpoint)
Queueing for checkout attempts on hot SKUs
Backpressure by rejecting early when downstream is overloaded
Circuit breakers when payment providers or inventory DB are failing
Load shedding for non-critical endpoints
Interview signal: “Graceful degradation” means you can name what you protect and what you sacrifice.
What to protect vs what to sacrifice#
During a flash sale, you protect correctness and revenue flows. You sacrifice convenience features.
Serve cached catalog pages | Site availability | Perfect freshness |
Disable personalized recommendations | Checkout capacity | Personalization |
Queue checkout attempts | No oversell | Immediate response time |
Limit cart modifications | Inventory integrity | Flexible cart UX |
Reduce search features (filters/sorting) | Core browse | Advanced search UX |
Switch to async order confirmation | System stability | Instant finality |
What to say in the interview: “During a flash sale, I keep checkout safe by applying rate limits, queueing allocation for hot SKUs, and degrading non-critical features like recommendations and advanced search filters.”
A flash sale allocation pattern that works#
For extremely limited inventory, reservation-based locking can still overload the database. The queue-based approach is cleaner:
Requests enter a queue per SKU (or per product group).
A small number of workers allocate inventory sequentially.
If inventory is available, the worker issues a reservation token.
The client proceeds to payment only with a valid token.
This trades latency for correctness and system stability, which is the right trade during a flash sale.
If the interviewer pushes back…“If the database becomes the bottleneck, I move allocation to a queue-backed worker model for hot SKUs so I serialize inventory decisions and keep the system stable.”
Reliability patterns that make your design credible#
At Staff level, you are expected to name the patterns that keep distributed systems correct.
Outbox pattern for durable event publishing#
If you create an order and then publish an event (“OrderCreated”), you must avoid losing the event when the process crashes.
Use an outbox table:
write order + outbox event in the same DB transaction
a background publisher reads outbox and publishes to the event bus
mark outbox event as published
This makes workflows reliable without requiring distributed transactions across services.
Interview signal: Outbox is a concrete reliability mechanism that demonstrates real production experience.
Reconciliation jobs are part of correctness#
You cannot rely on synchronous calls alone. External systems (payments) are asynchronous, and failures create drift.
A good design includes periodic reconciliation:
payments without orders
orders without captured payments
expired reservations still marked active
mismatch between inventory reserved vs allocated
Reconciliation is not a patch. It is the safety net that keeps trust intact.
Observability: the metrics interviewers respect#
An e-commerce system without observability is incomplete. In interviews, you do not need to list dozens of dashboards. You need the right few metrics that map to correctness and revenue.
Start by stating that you monitor:
user experience
system health
correctness invariants
Then name the metrics that prove those are under control:
Oversell rate: number of orders created beyond available stock
Reservation expiration rate: % of reservations expiring before commit
Checkout success rate: successful orders / checkout attempts
Checkout failure rate by reason: out-of-stock vs payment failure vs timeout
Payment authorization latency (p95/p99): external dependency health
Duplicate checkout suppression count: how often idempotency prevents duplication
What to say in the interview:
“I track oversell rate, reservation expiry rate, checkout failure rate by reason, and payment latency percentiles because they directly map to revenue loss and user trust.”
Common mistakes that cost candidates offers#
Most candidates fail this interview in predictable ways. Avoid these and you immediately move into the top tier.
Common pitfall:
Designing checkout as a single synchronous call with no idempotency and no state machine. That design breaks the moment retries and failures appear.
Other frequent mistakes:
treating inventory as “just a field in the product table”
using eventual consistency for orders or payments
hand-waving flash sales with “we’ll autoscale”
skipping failure scenarios and reconciliation
designing ten microservices without ownership boundaries
Final interview checklist (quick recap)#
You don’t need a perfect design. You need a design that is coherent, correct, and explainable under pressure.
State assumptions early and clearly
Separate read-heavy flows from correctness-critical flows
Treat inventory as a concurrency problem
Model checkout as a state machine
Use idempotency keys everywhere retries exist
Plan for failures and reconciliation
Design flash sale behavior with graceful degradation
Show observability metrics tied to trust and revenue
Interview signal:
The best answer is the one the interviewer can stress-test without it collapsing.
Happy learning!
Free Resources