E-Commerce System Design

E-Commerce System Design

How to ace an e-commerce System Design interview: design checkout for correctness with inventory reservations + idempotency + a state machine, and scale for spikes with rate limits, queues, and graceful degradation.

14 mins read
Feb 02, 2026
Share
editor-page-cover

Designing an e-commerce platform is one of the most common System Design interview prompts because it looks straightforward and still exposes nearly every skill an interviewer wants to test: traffic shaping, data modeling, correctness, reliability, and the ability to reason about trade-offs under pressure.

The trap is familiarity. Everyone has browsed a catalog, added items to a cart, and checked out. In an interview, that familiarity often produces vague answers like “we’ll use microservices” or “we’ll scale horizontally.” That is not what gets you hired.

This blog is a practical, interview-focused blueprint for designing an e-commerce system that is correct under failure, scalable under spikes, and explainable under follow-up questions.

Cover
Grokking Modern System Design Interview

System Design Interviews decide your level and compensation at top tech companies. To succeed, you must design scalable systems, justify trade-offs, and explain decisions under time pressure. Most candidates struggle because they lack a repeatable method. Built by FAANG engineers, this is the definitive System Design Interview course. You will master distributed systems building blocks: databases, caches, load balancers, messaging, microservices, sharding, replication, and consistency, and learn the patterns behind web-scale architectures. Using the RESHADED framework, you will translate open-ended system design problems into precise requirements, explicit constraints, and success metrics, then design modular, reliable solutions. Full Mock Interview practice builds fluency and timing. By the end, you will discuss architectures with Staff-level clarity, tackle unseen questions with confidence, and stand out in System Design Interviews at leading companies.

26hrs
Intermediate
5 Playgrounds
26 Quizzes

Interview signal: Strong candidates don’t just draw boxes. They explain why each box exists, what it owns, and what guarantees it provides.

How to structure your interview answer (the winning workflow)#

Acing this interview is mostly about sequencing. Interviewers reward engineers who stay structured and build the system in layers, instead of jumping straight into implementation details.

Start with the following workflow and narrate it as you go:

  1. Clarify requirements and constraints (business + technical)

  2. Identify core user flows and read/write patterns

  3. Propose a high-level architecture (services + responsibilities)

  4. Design data models and consistency boundaries

  5. Deep dive into checkout correctness (inventory + payments + idempotency)

  6. Handle failure scenarios and retries

  7. Scale for spikes and flash sales with graceful degradation

  8. Add observability and operational signals

This structure makes your answer interview-proof because it creates obvious “checkpoints” where the interviewer can interrupt you, and you can resume cleanly.

What to say in the interview: “I’ll start by locking requirements and traffic patterns, then propose a high-level architecture, and finally deep dive on checkout correctness and failure handling, since that’s the hardest part.”

Clarify the problem space like a Staff engineer#

Before you design anything, explicitly define what you’re building. E-commerce can mean a small marketplace app or a global Amazon-scale platform. You do not need every feature; you need the right feature set for the interview.

Anchor your scope around the most interview-relevant flows:

  • Browsing and product detail pages (read-heavy, cache-friendly)

  • Search (read-heavy, eventually consistent)

  • Cart management (session-like state, recoverable)

  • Checkout (write-heavy, correctness-critical)

  • Orders and payments (durable, auditable, immutable history)

  • Inventory (high contention, race conditions, overselling risk)

Then state the constraints you’re assuming. This is where you gain control of the problem.

Examples of high-value assumptions:

  • “We have one region initially, then we can discuss multi-region.”

  • “We support card payments via an external payment provider.”

  • “We treat checkout as correctness-first; we tolerate some staleness in catalog/search.”

  • “We need to survive retries and partial failures without double-charging or double-ordering.”

Interview signal: You win points by explicitly separating “fast and eventually consistent” paths (browse/search) from “durable and auditable” paths (orders/payments).

Core user flows and what they imply#

User flows are not just features. They determine read/write patterns, hotspots, and what breaks under load.

Here’s a crisp way to present that:

Core user flow

Read/write pattern

Design challenge

Browse catalog / product details

Mostly reads, cacheable

Cache invalidation, stale tolerance, fast p99 latency

Search products

Reads + indexing writes

Eventual consistency, ranking, query fan-out

Add to cart / update cart

Writes per user session

Low latency, cart expiration, merge across devices

View cart

Reads + derived totals

Pricing accuracy, promotions, tax/shipping estimation

Checkout

Burst writes

Idempotency, inventory contention, payment correctness

View orders

Reads, durable history

Auditable state, consistent status, pagination

Common pitfall: Treating all flows as equal. Interviews reward prioritization: checkout correctness beats perfect search freshness.

High-level architecture (services you actually need)#

Once the scope is clear, define the system as a set of responsibilities. Avoid over-indexing on microservices vocabulary. In interviews, “microservices” is not an architecture; it’s a deployment style.

You want to describe a design where:

  • Reads are fast via caching and replicas

  • Writes are safe via transactions, idempotency, and durable logs

  • Cross-service workflows are resilient via async messaging and reconciliation

A practical decomposition looks like this:

API Gateway / BFF

Auth, routing, rate limits, aggregation

None (stateless)

Catalog service

Product details, categories, metadata

Document store or relational DB + cache

Search service

Query parsing, ranking, indexing pipeline

Search index (e.g., Elasticsearch/OpenSearch)

Pricing service

Base price, discounts, promo rules

Relational DB + cache

Cart service

Cart CRUD, cart TTL, merge carts

Key-value store (Redis/DynamoDB)

Inventory service

Stock counts, reservations, allocation

Relational DB (strong consistency)

Checkout service

Orchestrates checkout workflow

Minimal state + idempotency store

Order service

Order creation, state transitions

Relational DB (ACID)

Payment service

Payment intents, authorization/capture, webhooks

Relational DB + outbox

Notification service

Email/SMS confirmations

Queue + provider integration

Event bus

Async workflows, fan-out updates

Kafka/PubSub/SQS-like queue

This table is interview-friendly because it tells a story: reads live in catalog/search; correctness lives in inventory/orders/payments; workflows connect them.

What to say in the interview: “I’m separating browse/search from checkout/order because they need different guarantees. Browse favors latency and caching; checkout favors correctness, durability, and idempotency.”

Where caching belongs (and where it does not)#

Caching is essential, but careless caching causes correctness bugs.

Use caching aggressively for:

  • Product details (with TTL and invalidation)

  • Category listings

  • Popular search queries

  • Pricing lookups (short TTL, careful with promotions)

Avoid caching as the source of truth for:

  • Orders

  • Payments

  • Inventory availability during checkout

A strong interview statement is:

Interview signal: Cache improves performance, but correctness-sensitive state must remain authoritative in a durable store.

Data consistency: what can be stale vs what must be correct#

E-commerce systems contain multiple “classes” of data. Treating them the same creates either unnecessary complexity or dangerous oversimplification.

Eventually consistent data (acceptable staleness)#

These domains can be eventually consistent because the user impact is limited and recoverable:

  • Catalog content (product name, description, images)

  • Search index results

  • “Recommended products”

  • Analytics events

  • View counters, trending lists

If the catalog says a product is available but it is actually out of stock, that is annoying. If the system charges someone for an order that never exists, that is unacceptable.

Strongly consistent, durable, auditable data#

These domains must be correct, durable, and traceable:

  • Orders (must not disappear)

  • Payments (must be reconciled)

  • Inventory allocation (must prevent oversell)

  • Refunds and chargebacks

  • Order state transitions (must be auditable)

A clean interview framing is:

What to say in the interview: “I’ll allow eventual consistency for catalog and search, but orders, payments, and inventory allocation require durable writes and auditability.”

Inventory reservation and overselling prevention#

Inventory is where many “good” designs fail under real-world concurrency. Overselling happens when multiple checkouts race to claim the same units.

The interviewer expects you to handle:

  • Contention (flash sales)

  • Abandoned carts

  • Reservation expiration (TTL)

  • Correctness under retries

Inventory is not just “a number.” It’s a contract.

The core problem: inventory is a shared resource under race#

During normal traffic, inventory updates are manageable. During flash sales, a single SKU becomes a hotspot and turns into a write bottleneck. If you decrement stock at the wrong time, you either oversell or block legitimate buyers.

You need an explicit strategy. Here are the common ones:

Reserve on add-to-cart

Create reservation immediately

Prevents oversell early

Blocks inventory for abandoned carts

High-demand limited stock items

Decrement on checkout

Decrement only at order commit

Less inventory locking

Oversell risk unless strongly synchronized

Normal retail with ample stock

Queue-based allocation

Queue requests, allocate sequentially

Strong oversell prevention under spikes

Adds latency, complex UX

Flash sales, drops, ticketing

Common pitfall: “We decrement inventory at checkout” without explaining concurrency control. That answer collapses under flash sale follow-ups.

A practical design: TTL-based reservations#

A strong interview design is reservation-based inventory with TTL:

  • When the user begins checkout (or clicks “Place order”), create an inventory reservation for each SKU.

  • Each reservation has:

    • reservation_id

    • sku_id

    • quantity

    • expires_at (TTL)

    • user_id / cart_id

    • status (ACTIVE, COMMITTED, EXPIRED, RELEASED)

This gives you a deterministic way to prevent oversell while allowing abandoned carts to self-heal.

The inventory service then exposes:

  • ReserveItems(cart_id, items, ttl)

  • CommitReservation(reservation_id)

  • ReleaseReservation(reservation_id)

  • GetAvailableStock(sku_id) (computed as on_hand - reserved_active)

Handling abandoned carts cleanly#

Abandoned carts are not an edge case; they are the default behavior. Most users do not complete checkout.

With TTL reservations:

  • The system automatically releases stock when expires_at passes.

  • A background sweeper or TTL index removes expired reservations.

  • The available stock recovers without manual intervention.

This is exactly the kind of operationally-safe mechanism interviewers want.

Interview signal: Reservations with TTL turn abandoned carts from a correctness bug into a routine cleanup job.

Contention during flash sales#

When one SKU becomes hot, you need to avoid “thundering herd” behavior where thousands of requests hammer the same database row.

Techniques that work in practice:

  • Keep inventory operations in a single service with a tight API.

  • Use conditional updates (compare-and-set) or row-level locking.

  • Use a queue allocator for extreme spikes (discussed later).

  • Protect inventory storage with aggressive rate limiting at the edge.

If the interviewer pushes back…“If the SKU becomes a write hotspot, I switch allocation to a queue-based approach for that SKU so requests serialize and we never oversell.”

Checkout as a state machine#

Checkout is the highest-signal portion of this interview. This is where you prove you can build systems that remain correct when the world is unreliable: networks fail, retries happen, and external providers behave asynchronously.

A checkout design without a state machine is not a design. It’s wishful thinking.

Why checkout must be modeled explicitly#

Checkout has multiple steps with different failure modes:

  • inventory reservation

  • pricing validation

  • payment authorization

  • order creation

  • payment capture

  • confirmation and fulfillment triggers

Some of these steps can be retried safely. Some cannot. Some are synchronous; some are async.

The clean way to manage this is a checkout/order state machine with explicit transitions.

INITIATED

Checkout request created

→ RESERVING_INVENTORY, → FAILED

RESERVING_INVENTORY

Reserving stock with TTL

→ INVENTORY_RESERVED, → OUT_OF_STOCK

INVENTORY_RESERVED

Reservation exists

→ PAYMENT_AUTH_PENDING, → EXPIRED

PAYMENT_AUTH_PENDING

Calling payment provider

→ PAYMENT_AUTHORIZED, → PAYMENT_FAILED

PAYMENT_AUTHORIZED

Authorization succeeded

→ ORDER_CREATING, → AUTH_EXPIRED

ORDER_CREATING

Writing durable order record

→ ORDER_CREATED, → ORDER_CREATE_FAILED

ORDER_CREATED

Order persisted

→ PAYMENT_CAPTURE_PENDING, → CANCELLED

PAYMENT_CAPTURE_PENDING

Capturing payment

→ COMPLETED, → CAPTURE_FAILED

COMPLETED

Final success

(terminal)

FAILED

Terminal failure

(terminal)

Interview signal: State machines make correctness explainable. Interviewers trust designs they can reason about step-by-step.

Idempotency keys are mandatory#

Retries happen everywhere:

  • mobile networks drop

  • users double-click “Place order”

  • load balancers retry on timeouts

  • clients retry after 5xx

Your system must treat duplicate requests as the same operation.

Use idempotency keys at the checkout boundary:

  • client sends Idempotency-Key (UUID)

  • checkout service stores (idempotency_key → checkout_id/result)

  • repeated requests return the same outcome

Also use idempotency internally:

  • payment authorization call includes a unique payment intent key

  • order creation uses a unique constraint like (user_id, idempotency_key) or (checkout_id)

What to say in the interview: “Checkout is idempotent. Every request includes an idempotency key so retries never create duplicate orders or double charges.”

Checkout flow walkthrough (narrative sequence)#

In interviews, the best way to demonstrate mastery is to walk through the checkout as a story. Do it end-to-end, including what gets written where.

Here is a narrative walkthrough you can deliver in five to seven minutes.

The user clicks “Checkout.” The client sends a request to the Checkout service with the cart ID and an idempotency key. The Checkout service first loads the cart contents from the Cart service and validates that each item is still purchasable. This includes verifying SKU availability, ensuring the product is active, and re-checking pricing rules to avoid stale totals from the cart page.

Next, the Checkout service calls the Inventory service to reserve inventory for the cart items. The Inventory service creates TTL reservations and returns a reservation ID. This step must be strongly consistent because it protects against overselling. At this point, the checkout has a hard expiration window: if the user does not finish in time, the reservation expires and the user must re-checkout.

Once inventory is reserved, the Checkout service initiates payment authorization. This is not a capture yet; it is an authorization hold that confirms the payment method is valid and the funds are available. The Payment service creates a payment intent record in its database and calls the external payment provider with its own idempotency token so repeated calls do not create multiple authorizations.

When authorization succeeds, the system creates the order in the Order service. This is a durable write that must succeed exactly once. The Order service persists the order record with an initial state like ORDER_CREATED and stores the reservation ID and payment intent ID as part of the order’s audit trail. This is the moment the business commits to the purchase.

After the order exists, the system captures the payment. Capture can be synchronous or asynchronous depending on the provider and latency goals, but it must be reconciled. If capture succeeds, the order transitions to COMPLETED (or PAID) and downstream processes trigger fulfillment and notifications. If capture fails, the order transitions to a failure-handling path such as PAYMENT_CAPTURE_FAILED, and the system releases inventory or keeps it reserved depending on your business rules.

Finally, the user receives a confirmation response. Importantly, the confirmation is not “payment succeeded.” The confirmation is “order created and payment captured,” or it is an explicit “pending” status if capture is asynchronous.

Interview signal: The best walkthroughs say exactly what gets written to durable storage and when the system becomes committed.

Failure scenarios (the ones interviewers always ask)#

You do not need to list every possible failure. You need to cover the ones that reveal correctness maturity.

Failure: payment succeeds but order creation fails#

This is the classic distributed systems failure: an external side effect succeeds, but your internal state write fails.

Correct handling requires reconciliation:

  • Payment service stores payment intent and authorization result durably.

  • Order creation is retried with idempotency.

  • If order creation cannot succeed after retries, the system triggers a compensating action:

    • void authorization (if not captured)

    • refund (if captured)

  • A reconciliation job continuously compares:

    • authorized/captured payments without orders

    • orders without successful payment capture

What to say in the interview: “I never rely on the synchronous response alone. Payments are reconciled. If authorization succeeded but order creation failed, I retry order creation idempotently and reconcile orphaned payments with void/refund workflows.”

Failure: inventory reservation expires mid-checkout#

This happens when:

  • user takes too long

  • payment authorization is slow

  • the system is under heavy load

Correct handling:

  • reservations have expires_at

  • order creation validates reservation is still ACTIVE

  • if expired, checkout fails with a clear “items no longer available” response

  • user re-checks out with refreshed inventory

This is a correctness requirement, not a UX preference.

Common pitfall: Allowing checkout to proceed with expired reservations “because it’s rare.” Under flash sales, it becomes common.

Failure: duplicate checkout requests due to retries#

This is guaranteed to happen. Your design must treat it as normal behavior.

Correct handling:

  • idempotency key at the API boundary

  • unique constraint on order creation (checkout_id must be unique)

  • payment provider calls are idempotent via payment intent key

  • return the same order ID for repeated calls

If the interviewer pushes back…“Even if the client retries five times, I return the same order ID because the idempotency key maps to a single checkout attempt.”

Handling flash sales and traffic spikes#

Flash sales are where e-commerce designs get exposed. Traffic shifts from “read-heavy with occasional writes” to “massive concurrent writes on a tiny set of SKUs.”

A strong answer describes two things:

  1. how you protect correctness (no oversell, no double-charge)

  2. how you degrade gracefully (some features sacrificed to protect the core)

Backpressure, queueing, and rate limiting#

Under extreme load, your job is not to keep every feature alive. Your job is to keep checkout safe and the system stable.

Use these tools deliberately:

  • Rate limiting at the API gateway (per IP, per user, per endpoint)

  • Queueing for checkout attempts on hot SKUs

  • Backpressure by rejecting early when downstream is overloaded

  • Circuit breakers when payment providers or inventory DB are failing

  • Load shedding for non-critical endpoints

Interview signal: “Graceful degradation” means you can name what you protect and what you sacrifice.

What to protect vs what to sacrifice#

During a flash sale, you protect correctness and revenue flows. You sacrifice convenience features.

Serve cached catalog pages

Site availability

Perfect freshness

Disable personalized recommendations

Checkout capacity

Personalization

Queue checkout attempts

No oversell

Immediate response time

Limit cart modifications

Inventory integrity

Flexible cart UX

Reduce search features (filters/sorting)

Core browse

Advanced search UX

Switch to async order confirmation

System stability

Instant finality

What to say in the interview: “During a flash sale, I keep checkout safe by applying rate limits, queueing allocation for hot SKUs, and degrading non-critical features like recommendations and advanced search filters.”

A flash sale allocation pattern that works#

For extremely limited inventory, reservation-based locking can still overload the database. The queue-based approach is cleaner:

  • Requests enter a queue per SKU (or per product group).

  • A small number of workers allocate inventory sequentially.

  • If inventory is available, the worker issues a reservation token.

  • The client proceeds to payment only with a valid token.

This trades latency for correctness and system stability, which is the right trade during a flash sale.

If the interviewer pushes back…“If the database becomes the bottleneck, I move allocation to a queue-backed worker model for hot SKUs so I serialize inventory decisions and keep the system stable.”

Reliability patterns that make your design credible#

At Staff level, you are expected to name the patterns that keep distributed systems correct.

Outbox pattern for durable event publishing#

If you create an order and then publish an event (“OrderCreated”), you must avoid losing the event when the process crashes.

Use an outbox table:

  • write order + outbox event in the same DB transaction

  • a background publisher reads outbox and publishes to the event bus

  • mark outbox event as published

This makes workflows reliable without requiring distributed transactions across services.

Interview signal: Outbox is a concrete reliability mechanism that demonstrates real production experience.

Reconciliation jobs are part of correctness#

You cannot rely on synchronous calls alone. External systems (payments) are asynchronous, and failures create drift.

A good design includes periodic reconciliation:

  • payments without orders

  • orders without captured payments

  • expired reservations still marked active

  • mismatch between inventory reserved vs allocated

Reconciliation is not a patch. It is the safety net that keeps trust intact.

Observability: the metrics interviewers respect#

An e-commerce system without observability is incomplete. In interviews, you do not need to list dozens of dashboards. You need the right few metrics that map to correctness and revenue.

Start by stating that you monitor:

  • user experience

  • system health

  • correctness invariants

Then name the metrics that prove those are under control:

  • Oversell rate: number of orders created beyond available stock

  • Reservation expiration rate: % of reservations expiring before commit

  • Checkout success rate: successful orders / checkout attempts

  • Checkout failure rate by reason: out-of-stock vs payment failure vs timeout

  • Payment authorization latency (p95/p99): external dependency health

  • Duplicate checkout suppression count: how often idempotency prevents duplication

What to say in the interview:
“I track oversell rate, reservation expiry rate, checkout failure rate by reason, and payment latency percentiles because they directly map to revenue loss and user trust.”

Common mistakes that cost candidates offers#

Most candidates fail this interview in predictable ways. Avoid these and you immediately move into the top tier.

Common pitfall:
Designing checkout as a single synchronous call with no idempotency and no state machine. That design breaks the moment retries and failures appear.

Other frequent mistakes:

  • treating inventory as “just a field in the product table”

  • using eventual consistency for orders or payments

  • hand-waving flash sales with “we’ll autoscale”

  • skipping failure scenarios and reconciliation

  • designing ten microservices without ownership boundaries

Final interview checklist (quick recap)#

You don’t need a perfect design. You need a design that is coherent, correct, and explainable under pressure.

  • State assumptions early and clearly

  • Separate read-heavy flows from correctness-critical flows

  • Treat inventory as a concurrency problem

  • Model checkout as a state machine

  • Use idempotency keys everywhere retries exist

  • Plan for failures and reconciliation

  • Design flash sale behavior with graceful degradation

  • Show observability metrics tied to trust and revenue

Interview signal:
The best answer is the one the interviewer can stress-test without it collapsing.

Happy learning!


Written By:
Zarish Khalid