E-Commerce System Design

Table of Contents

How to structure your interview answer (the winning workflow)Clarify the problem space like a Staff engineer Core user flows and what they imply High-level architecture (services you actually need)Where caching belongs (and where it does not)Data consistency: what can be stale vs what must be correct Eventually consistent data (acceptable staleness)Strongly consistent, durable, auditable data Inventory reservation and overselling prevention The core problem: inventory is a shared resource under race A practical design: TTL-based reservations Handling abandoned carts cleanly Contention during flash sales Checkout as a state machine Why checkout must be modeled explicitly Idempotency keys are mandatory Checkout flow walkthrough (narrative sequence)Failure scenarios (the ones interviewers always ask)Failure: payment succeeds but order creation fails Failure: inventory reservation expires mid-checkout Failure: duplicate checkout requests due to retries Handling flash sales and traffic spikes Backpressure, queueing, and rate limiting What to protect vs what to sacrifice A flash sale allocation pattern that works Reliability patterns that make your design credible Outbox pattern for durable event publishing Reconciliation jobs are part of correctness Observability: the metrics interviewers respect Common mistakes that cost candidates offers Final interview checklist (quick recap)

Home/

Blog/

How to ace an e-commerce System Design interview: design checkout for correctness with inventory reservations + idempotency + a state machine, and scale for spikes with rate limits, queues, and graceful degradation.

14 mins read

Feb 02, 2026

Designing an e-commerce platform is one of the most common System Design interview prompts because it looks straightforward and still exposes nearly every skill an interviewer wants to test: traffic shaping, data modeling, correctness, reliability, and the ability to reason about trade-offs under pressure.

The trap is familiarity. Everyone has browsed a catalog, added items to a cart, and checked out. In an interview, that familiarity often produces vague answers like “we’ll use microservices” or “we’ll scale horizontally.” That is not what gets you hired.

This blog is a practical, interview-focused blueprint for designing an e-commerce system that is correct under failure, scalable under spikes, and explainable under follow-up questions.

Grokking Modern System Design Interview

For a decade, when developers talked about how to prepare for System Design Interviews, the answer was always Grokking System Design. This is that course — updated for the current tech landscape. As AI handles more of the routine work, engineers at every level are expected to operate with the architectural fluency that used to belong to Staff engineers. That's why System Design Interviews still determine starting level and compensation, and the bar keeps rising. I built this course from my experience building global-scale distributed systems at Microsoft and Meta — and from interviewing hundreds of candidates at both companies. The failure pattern I kept seeing wasn't a lack of technical knowledge. Even strong coders would hit a wall, because System Design Interviews don't test what you can build; they test whether you can reason through an ambiguous problem, communicate ideas clearly, and defend trade-offs in real time (all skills that matter ore than never now in the AI era). RESHADED is the framework I developed to fix that: a repeatable 45-minute roadmap through any open-ended System Design problem. The course covers the distributed systems fundamentals that appear in every interview – databases, caches, load balancers, CDNs, messaging queues, and more – then applies them across 13+ real-world case studies: YouTube, WhatsApp, Uber, Twitter, Google Maps, and modern systems like ChatGPT and AI/ML infrastructure. Then put your knowledge to the test with AI Mock Interviews designed to simulate the real interview experience. Hundreds of thousands of candidates have already used this course to land SWE, TPM, and EM roles at top companies. If you're serious about acing your next System Design Interview, this is the best place to start.

26hrs

Intermediate

5 Playgrounds

28 Quizzes

Interview signal: Strong candidates don’t just draw boxes. They explain why each box exists, what it owns, and what guarantees it provides.

How to structure your interview answer (the winning workflow)#

Acing this interview is mostly about sequencing. Interviewers reward engineers who stay structured and build the system in layers, instead of jumping straight into implementation details.

Start with the following workflow and narrate it as you go:

Clarify requirements and constraints (business + technical)
Identify core user flows and read/write patterns
Propose a high-level architecture (services + responsibilities)
Design data models and consistency boundaries
Deep dive into checkout correctness (inventory + payments + idempotency)
Handle failure scenarios and retries
Scale for spikes and flash sales with graceful degradation
Add observability and operational signals

This structure makes your answer interview-proof because it creates obvious “checkpoints” where the interviewer can interrupt you, and you can resume cleanly.

What to say in the interview: “I’ll start by locking requirements and traffic patterns, then propose a high-level architecture, and finally deep dive on checkout correctness and failure handling, since that’s the hardest part.”

Clarify the problem space like a Staff engineer#

Before you design anything, explicitly define what you’re building. E-commerce can mean a small marketplace app or a global Amazon-scale platform. You do not need every feature; you need the right feature set for the interview.

Anchor your scope around the most interview-relevant flows:

Browsing and product detail pages (read-heavy, cache-friendly)
Search (read-heavy, eventually consistent)
Cart management (session-like state, recoverable)
Checkout (write-heavy, correctness-critical)
Orders and payments (durable, auditable, immutable history)
Inventory (high contention, race conditions, overselling risk)

Then state the constraints you’re assuming. This is where you gain control of the problem.

Examples of high-value assumptions:

“We have one region initially, then we can discuss multi-region.”
“We support card payments via an external payment provider.”
“We treat checkout as correctness-first; we tolerate some staleness in catalog/search.”
“We need to survive retries and partial failures without double-charging or double-ordering.”

Interview signal: You win points by explicitly separating “fast and eventually consistent” paths (browse/search) from “durable and auditable” paths (orders/payments).

Core user flows and what they imply#

User flows are not just features. They determine read/write patterns, hotspots, and what breaks under load.

Here’s a crisp way to present that:

Core user flow	Read/write pattern	Design challenge
Browse catalog / product details	Mostly reads, cacheable	Cache invalidation, stale tolerance, fast p99 latency
Search products	Reads + indexing writes	Eventual consistency, ranking, query fan-out
Add to cart / update cart	Writes per user session	Low latency, cart expiration, merge across devices
View cart	Reads + derived totals	Pricing accuracy, promotions, tax/shipping estimation
Checkout	Burst writes	Idempotency, inventory contention, payment correctness
View orders	Reads, durable history	Auditable state, consistent status, pagination

Common pitfall: Treating all flows as equal. Interviews reward prioritization: checkout correctness beats perfect search freshness.

High-level architecture (services you actually need)#

Once the scope is clear, define the system as a set of responsibilities. Avoid over-indexing on microservices vocabulary. In interviews, “microservices” is not an architecture; it’s a deployment style.

You want to describe a design where:

Reads are fast via caching and replicas
Writes are safe via transactions, idempotency, and durable logs
Cross-service workflows are resilient via async messaging and reconciliation

A practical decomposition looks like this:

API Gateway / BFF	Auth, routing, rate limits, aggregation	None (stateless)
Catalog service	Product details, categories, metadata	Document store or relational DB + cache
Search service	Query parsing, ranking, indexing pipeline	Search index (e.g., Elasticsearch/OpenSearch)
Pricing service	Base price, discounts, promo rules	Relational DB + cache
Cart service	Cart CRUD, cart TTL, merge carts	Key-value store (Redis/DynamoDB)
Inventory service	Stock counts, reservations, allocation	Relational DB (strong consistency)
Checkout service	Orchestrates checkout workflow	Minimal state + idempotency store
Order service	Order creation, state transitions	Relational DB (ACID)
Payment service	Payment intents, authorization/capture, webhooks	Relational DB + outbox
Notification service	Email/SMS confirmations	Queue + provider integration
Event bus	Async workflows, fan-out updates	Kafka/PubSub/SQS-like queue

This table is interview-friendly because it tells a story: reads live in catalog/search; correctness lives in inventory/orders/payments; workflows connect them.

What to say in the interview: “I’m separating browse/search from checkout/order because they need different guarantees. Browse favors latency and caching; checkout favors correctness, durability, and idempotency.”

Where caching belongs (and where it does not)#

Caching is essential, but careless caching causes correctness bugs.

Use caching aggressively for:

Product details (with TTL and invalidation)
Category listings
Popular search queries
Pricing lookups (short TTL, careful with promotions)

Avoid caching as the source of truth for:

Orders
Payments
Inventory availability during checkout

A strong interview statement is:

Interview signal: Cache improves performance, but correctness-sensitive state must remain authoritative in a durable store.

Data consistency: what can be stale vs what must be correct#

E-commerce systems contain multiple “classes” of data. Treating them the same creates either unnecessary complexity or dangerous oversimplification.

Eventually consistent data (acceptable staleness)#

These domains can be eventually consistent because the user impact is limited and recoverable:

Catalog content (product name, description, images)
Search index results
“Recommended products”
Analytics events
View counters, trending lists

If the catalog says a product is available but it is actually out of stock, that is annoying. If the system charges someone for an order that never exists, that is unacceptable.

Strongly consistent, durable, auditable data#

These domains must be correct, durable, and traceable:

Orders (must not disappear)
Payments (must be reconciled)
Inventory allocation (must prevent oversell)
Refunds and chargebacks
Order state transitions (must be auditable)

A clean interview framing is:

What to say in the interview: “I’ll allow eventual consistency for catalog and search, but orders, payments, and inventory allocation require durable writes and auditability.”

Inventory reservation and overselling prevention#

Inventory is where many “good” designs fail under real-world concurrency. Overselling happens when multiple checkouts race to claim the same units.

The interviewer expects you to handle:

Contention (flash sales)
Abandoned carts
Reservation expiration (TTL)
Correctness under retries

Inventory is not just “a number.” It’s a contract.

The core problem: inventory is a shared resource under race#

During normal traffic, inventory updates are manageable. During flash sales, a single SKU becomes a hotspot and turns into a write bottleneck. If you decrement stock at the wrong time, you either oversell or block legitimate buyers.

You need an explicit strategy. Here are the common ones:

Common pitfall: “We decrement inventory at checkout” without explaining concurrency control. That answer collapses under flash sale follow-ups.

A practical design: TTL-based reservations#

A strong interview design is reservation-based inventory with TTL:

When the user begins checkout (or clicks “Place order”), create an inventory reservation for each SKU.
Each reservation has:
- reservation_id
- sku_id
- quantity
- expires_at (TTL)
- user_id / cart_id
- status (ACTIVE, COMMITTED, EXPIRED, RELEASED)

This gives you a deterministic way to prevent oversell while allowing abandoned carts to self-heal.

The inventory service then exposes:

ReserveItems(cart_id, items, ttl)
CommitReservation(reservation_id)
ReleaseReservation(reservation_id)
GetAvailableStock(sku_id) (computed as on_hand - reserved_active)

Handling abandoned carts cleanly#

Abandoned carts are not an edge case; they are the default behavior. Most users do not complete checkout.

With TTL reservations:

The system automatically releases stock when expires_at passes.
A background sweeper or TTL index removes expired reservations.
The available stock recovers without manual intervention.

This is exactly the kind of operationally-safe mechanism interviewers want.

Interview signal: Reservations with TTL turn abandoned carts from a correctness bug into a routine cleanup job.

Contention during flash sales#

When one SKU becomes hot, you need to avoid “thundering herd” behavior where thousands of requests hammer the same database row.

Techniques that work in practice:

Keep inventory operations in a single service with a tight API.
Use conditional updates (compare-and-set) or row-level locking.
Use a queue allocator for extreme spikes (discussed later).
Protect inventory storage with aggressive rate limiting at the edge.

If the interviewer pushes back…“If the SKU becomes a write hotspot, I switch allocation to a queue-based approach for that SKU so requests serialize and we never oversell.”

Checkout as a state machine#

Checkout is the highest-signal portion of this interview. This is where you prove you can build systems that remain correct when the world is unreliable: networks fail, retries happen, and external providers behave asynchronously.

A checkout design without a state machine is not a design. It’s wishful thinking.

Why checkout must be modeled explicitly#

Checkout has multiple steps with different failure modes:

inventory reservation
pricing validation
payment authorization
order creation
payment capture
confirmation and fulfillment triggers

Some of these steps can be retried safely. Some cannot. Some are synchronous; some are async.

The clean way to manage this is a checkout/order state machine with explicit transitions.

INITIATED	Checkout request created	→ RESERVING_INVENTORY, → FAILED
RESERVING_INVENTORY	Reserving stock with TTL	→ INVENTORY_RESERVED, → OUT_OF_STOCK
INVENTORY_RESERVED	Reservation exists	→ PAYMENT_AUTH_PENDING, → EXPIRED
PAYMENT_AUTH_PENDING	Calling payment provider	→ PAYMENT_AUTHORIZED, → PAYMENT_FAILED
PAYMENT_AUTHORIZED	Authorization succeeded	→ ORDER_CREATING, → AUTH_EXPIRED
ORDER_CREATING	Writing durable order record	→ ORDER_CREATED, → ORDER_CREATE_FAILED
ORDER_CREATED	Order persisted	→ PAYMENT_CAPTURE_PENDING, → CANCELLED
PAYMENT_CAPTURE_PENDING	Capturing payment	→ COMPLETED, → CAPTURE_FAILED
COMPLETED	Final success	(terminal)
FAILED	Terminal failure	(terminal)

Interview signal: State machines make correctness explainable. Interviewers trust designs they can reason about step-by-step.

Idempotency keys are mandatory#

Retries happen everywhere:

mobile networks drop
users double-click “Place order”
load balancers retry on timeouts
clients retry after 5xx

Your system must treat duplicate requests as the same operation.

Use idempotency keys at the checkout boundary:

client sends Idempotency-Key (UUID)
checkout service stores (idempotency_key → checkout_id/result)
repeated requests return the same outcome

Also use idempotency internally:

payment authorization call includes a unique payment intent key
order creation uses a unique constraint like (user_id, idempotency_key) or (checkout_id)

What to say in the interview: “Checkout is idempotent. Every request includes an idempotency key so retries never create duplicate orders or double charges.”

Checkout flow walkthrough (narrative sequence)#

In interviews, the best way to demonstrate mastery is to walk through the checkout as a story. Do it end-to-end, including what gets written where.

Here is a narrative walkthrough you can deliver in five to seven minutes.

The user clicks “Checkout.” The client sends a request to the Checkout service with the cart ID and an idempotency key. The Checkout service first loads the cart contents from the Cart service and validates that each item is still purchasable. This includes verifying SKU availability, ensuring the product is active, and re-checking pricing rules to avoid stale totals from the cart page.

Next, the Checkout service calls the Inventory service to reserve inventory for the cart items. The Inventory service creates TTL reservations and returns a reservation ID. This step must be strongly consistent because it protects against overselling. At this point, the checkout has a hard expiration window: if the user does not finish in time, the reservation expires and the user must re-checkout.

Once inventory is reserved, the Checkout service initiates payment authorization. This is not a capture yet; it is an authorization hold that confirms the payment method is valid and the funds are available. The Payment service creates a payment intent record in its database and calls the external payment provider with its own idempotency token so repeated calls do not create multiple authorizations.

When authorization succeeds, the system creates the order in the Order service. This is a durable write that must succeed exactly once. The Order service persists the order record with an initial state like ORDER_CREATED and stores the reservation ID and payment intent ID as part of the order’s audit trail. This is the moment the business commits to the purchase.

After the order exists, the system captures the payment. Capture can be synchronous or asynchronous depending on the provider and latency goals, but it must be reconciled. If capture succeeds, the order transitions to COMPLETED (or PAID) and downstream processes trigger fulfillment and notifications. If capture fails, the order transitions to a failure-handling path such as PAYMENT_CAPTURE_FAILED, and the system releases inventory or keeps it reserved depending on your business rules.

Finally, the user receives a confirmation response. Importantly, the confirmation is not “payment succeeded.” The confirmation is “order created and payment captured,” or it is an explicit “pending” status if capture is asynchronous.

Interview signal: The best walkthroughs say exactly what gets written to durable storage and when the system becomes committed.

Failure scenarios (the ones interviewers always ask)#

You do not need to list every possible failure. You need to cover the ones that reveal correctness maturity.

Failure: payment succeeds but order creation fails#

This is the classic distributed systems failure: an external side effect succeeds, but your internal state write fails.

Correct handling requires reconciliation:

Payment service stores payment intent and authorization result durably.
Order creation is retried with idempotency.
If order creation cannot succeed after retries, the system triggers a compensating action:
- void authorization (if not captured)
- refund (if captured)
A reconciliation job continuously compares:
- authorized/captured payments without orders
- orders without successful payment capture

What to say in the interview: “I never rely on the synchronous response alone. Payments are reconciled. If authorization succeeded but order creation failed, I retry order creation idempotently and reconcile orphaned payments with void/refund workflows.”

Failure: inventory reservation expires mid-checkout#

This happens when:

user takes too long
payment authorization is slow
the system is under heavy load

Correct handling:

reservations have expires_at
order creation validates reservation is still ACTIVE
if expired, checkout fails with a clear “items no longer available” response
user re-checks out with refreshed inventory

This is a correctness requirement, not a UX preference.

Common pitfall: Allowing checkout to proceed with expired reservations “because it’s rare.” Under flash sales, it becomes common.

Failure: duplicate checkout requests due to retries#

This is guaranteed to happen. Your design must treat it as normal behavior.

Correct handling:

idempotency key at the API boundary
unique constraint on order creation (checkout_id must be unique)
payment provider calls are idempotent via payment intent key
return the same order ID for repeated calls

If the interviewer pushes back…“Even if the client retries five times, I return the same order ID because the idempotency key maps to a single checkout attempt.”

Handling flash sales and traffic spikes#

Flash sales are where e-commerce designs get exposed. Traffic shifts from “read-heavy with occasional writes” to “massive concurrent writes on a tiny set of SKUs.”

A strong answer describes two things:

how you protect correctness (no oversell, no double-charge)
how you degrade gracefully (some features sacrificed to protect the core)

Backpressure, queueing, and rate limiting#

Under extreme load, your job is not to keep every feature alive. Your job is to keep checkout safe and the system stable.

Use these tools deliberately:

Rate limiting at the API gateway (per IP, per user, per endpoint)
Queueing for checkout attempts on hot SKUs
Backpressure by rejecting early when downstream is overloaded
Circuit breakers when payment providers or inventory DB are failing
Load shedding for non-critical endpoints

Interview signal: “Graceful degradation” means you can name what you protect and what you sacrifice.

What to protect vs what to sacrifice#

During a flash sale, you protect correctness and revenue flows. You sacrifice convenience features.

What to say in the interview: “During a flash sale, I keep checkout safe by applying rate limits, queueing allocation for hot SKUs, and degrading non-critical features like recommendations and advanced search filters.”

A flash sale allocation pattern that works#

For extremely limited inventory, reservation-based locking can still overload the database. The queue-based approach is cleaner:

Requests enter a queue per SKU (or per product group).
A small number of workers allocate inventory sequentially.
If inventory is available, the worker issues a reservation token.
The client proceeds to payment only with a valid token.

This trades latency for correctness and system stability, which is the right trade during a flash sale.

If the interviewer pushes back…“If the database becomes the bottleneck, I move allocation to a queue-backed worker model for hot SKUs so I serialize inventory decisions and keep the system stable.”

Reliability patterns that make your design credible#

At Staff level, you are expected to name the patterns that keep distributed systems correct.

Outbox pattern for durable event publishing#

If you create an order and then publish an event (“OrderCreated”), you must avoid losing the event when the process crashes.

Use an outbox table:

write order + outbox event in the same DB transaction
a background publisher reads outbox and publishes to the event bus
mark outbox event as published

This makes workflows reliable without requiring distributed transactions across services.

Interview signal: Outbox is a concrete reliability mechanism that demonstrates real production experience.

Reconciliation jobs are part of correctness#

You cannot rely on synchronous calls alone. External systems (payments) are asynchronous, and failures create drift.

A good design includes periodic reconciliation:

payments without orders
orders without captured payments
expired reservations still marked active
mismatch between inventory reserved vs allocated

Reconciliation is not a patch. It is the safety net that keeps trust intact.

Observability: the metrics interviewers respect#

An e-commerce system without observability is incomplete. In interviews, you do not need to list dozens of dashboards. You need the right few metrics that map to correctness and revenue.

Start by stating that you monitor:

user experience
system health
correctness invariants

Then name the metrics that prove those are under control:

Oversell rate: number of orders created beyond available stock
Reservation expiration rate: % of reservations expiring before commit
Checkout success rate: successful orders / checkout attempts
Checkout failure rate by reason: out-of-stock vs payment failure vs timeout
Payment authorization latency (p95/p99): external dependency health
Duplicate checkout suppression count: how often idempotency prevents duplication

What to say in the interview:
“I track oversell rate, reservation expiry rate, checkout failure rate by reason, and payment latency percentiles because they directly map to revenue loss and user trust.”

Common mistakes that cost candidates offers#

Most candidates fail this interview in predictable ways. Avoid these and you immediately move into the top tier.

Common pitfall:
Designing checkout as a single synchronous call with no idempotency and no state machine. That design breaks the moment retries and failures appear.

Other frequent mistakes:

treating inventory as “just a field in the product table”
using eventual consistency for orders or payments
hand-waving flash sales with “we’ll autoscale”
skipping failure scenarios and reconciliation
designing ten microservices without ownership boundaries

Final interview checklist (quick recap)#

You don’t need a perfect design. You need a design that is coherent, correct, and explainable under pressure.

State assumptions early and clearly
Separate read-heavy flows from correctness-critical flows
Treat inventory as a concurrency problem
Model checkout as a state machine
Use idempotency keys everywhere retries exist
Plan for failures and reconciliation
Design flash sale behavior with graceful degradation
Show observability metrics tied to trust and revenue

Interview signal:
The best answer is the one the interviewer can stress-test without it collapsing.

Happy learning!

Written By:

Zarish Khalid

Free Resources

blog

Different types of scaling in Machine Learning

blog

Solutions architect tasks explained from pre sales to delivery

blog

What the Claude Code leak reveals about AI systems

Reserve on add-to-cart	Create reservation immediately	Prevents oversell early	Blocks inventory for abandoned carts	High-demand limited stock items
Decrement on checkout	Decrement only at order commit	Less inventory locking	Oversell risk unless strongly synchronized	Normal retail with ample stock
Queue-based allocation	Queue requests, allocate sequentially	Strong oversell prevention under spikes	Adds latency, complex UX	Flash sales, drops, ticketing

Serve cached catalog pages	Site availability	Perfect freshness
Disable personalized recommendations	Checkout capacity	Personalization
Queue checkout attempts	No oversell	Immediate response time
Limit cart modifications	Inventory integrity	Flexible cart UX
Reduce search features (filters/sorting)	Core browse	Advanced search UX
Switch to async order confirmation	System stability	Instant finality