We label a service a distributed system if multiple computing nodes are involved. Each node is physically distinct, does not share any computing resources, and can communicate only via network messages. Google Search, Netflix, and Amazon are examples of distributed systems.
Let's see some of the primary reasons we need a distributed system.
Making large-scale services possible: Many tasks are beyond the capacity of any single computer. For example, crawling the World Wide Web periodically and indexing this information with respect to keyword searches by billions of concurrent users is a task that involves thousands of servers.
Note: Grace Hopper, a mathematician and pioneering computer scientist who was one of the first programmers of the Harvard Mark I computer, had great farsightedness, as is evident from one of her famous quotes from the 1970s: “In pioneer days, they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers but for more systems of computers.”
Some argue that
Due to the distance of the service to its dispersed client base, clients away from a central service will incur a longer latency. Replicating services near the customers will need a distributed system.
The article assumes that
The failure cases are not adequately discussed, such as the outage of a full data center and the replacement of a failed central server.
Multiple cores in a server and NUMA-based memory access mean that a single server these days has many traits in common with a distributed system.
Our takeaway from this discussion is that while modern servers can help reduce the overall fleet size (hence the cost benefits), we can't entirely do away with the need for a distributed system.
Cost efficiency: As we try to make computational equipment bigger/more capable, the associated dollar cost does not increase linearly (cost often goes up super linearly), often making such an endeavor prohibitively costly. For example, we might wish to have a network switch to enable any-to-any communication, with high bandwidth, amongst hundreds of thousands of servers residing in a data center. But no such switch exists, and instead, we use smaller switches in a
Fault tolerance: A computer can fail due to a myriad of hardware or software glitches. A distributed system enables a service to hide most of the component level (or even data center level) failures from the end customers, giving the operators the necessary time to repair the failed components. For example, our data might be geo-replicated at three places. Losing any one of them will trigger the service to reroute our read/write requests to a secondary copy with little to no hiccups for us. Over time, the service will either restore the failed copy, or a new server might be assigned to take over the failed server.
Note: You can dive into understanding how data replication works in a fault-tolerant manner by studying Google's GFS file system design.
Reduced latency for the customers: If a service is only hosted, say, on the West Coast of the United States, the customers on the East Coast will need to deal with larger delays, and the customers in Europe and Asia might suffer much longer delays. Physical limitations, such as the speed of light, make it impossible to communicate any faster. Therefore, it becomes necessary to bring the service near the customers. Common examples are Netflix and other media that are time-sensitive and are often served from near the customer via content delivery networks.
The following illustration shows how major cloud computing players distribute their data centers and connect them via specialized networks to provide low-latency services to their customers.
Hopefully, we are convinced by now that modern services need distributed computational infrastructure to fulfill their goals. And now that we have opened this Pandora's box, let's see what challenges await us.
We’ll discuss some of the most difficult aspects of distributed systems:
Partial failures are common.
The network can lose, delay, duplicate, or reorder messages.
Failures can be hard to detect.
Node clocks are unreliable.
A single server/computer is designed so that when an unexpected fault happens, the system declares itself failed (instead of producing wrong results.) Examples are kernel panic in the Linux operating system and the “blue screen of death” in the Windows operating system—on unexpected faults such as memory corruption, they halt instead of working unreliably and producing possibly wrong results. Therefore, such a system mainly works in a binary fashion—either the system is working correctly or has failed and has stopped working. This binary behavior makes it easier to deal with them.
On the contrary, in a distributed system of many nodes, one or some nodes might have failed while the rest are working. Therefore we don't have that binary state (fully working or failed and stopped); instead, partial failures are possible. Such partial failures add non-determinism to the system. In fact, in a large distributed system, it is often assumed that something is always broken. Naturally, detecting and managing such failures is necessary.
Note: Large supercomputers such as
solve many scientific problems, and often on some fault, the whole system is halted, the fault is corrected, and the system resumes by loading from the last checkpoint. However, a large distributed service such as Google Search can't operate like that. Frontier FRONTIER - HPE CRAY EX235A, AMD OPTIMIZED 3RD GENERATION EPYC 64C 2GHZ, AMD INSTINCT MI250X, SLINGSHOT-11
Transient network problems are also common. Such problems include
Networks are often modeled as synchronous or asynchronous. A synchronous network is one where message delivery has a known upper bound, and we can figure out which messages were lost. On the contrary, in an asynchronous network, a message can take arbitrarily long to reach its recipient. This means in an asynchronous model, timeouts are not a good detector of lost messages.
A real network, such as the Internet, is neither purely synchronous nor asynchronous. Instead, most of the time, the Internet remains in a good state, behaving like a synchronous network, while when congestion is high, it might behave more like an asynchronous network. That partial synchrony is helpful, but we need to deal with occasional asynchrony—protocols should be able to detect and manage when the network becomes asynchronous.
A typical network delay has many components, such as transmission, queuing, and propagation delays. The queuing delays are usually responsible for high variability (jitter) in network latency. The Internet is often dubbed as a network of queues, where a packet gets through a queue at one router only to get queued in the next. Like a busy traffic intersection during rush hour, these queues can build up, adding substantial delay to the packets.
Packets can be dropped at a queue if there is no space to queue them. Senders might send multiple copies of the same data if they don't receive acknowledgment of a previous message in a reasonable time. It might be the case that an earlier copy is sitting in some queue, waiting for its turn, and ultimately more than one copy of data reaches the recipient.
It is often hard to detect a node failure. One of the reasons is due to message losses over the network. Not getting a response from the node is not a sure signal that the node has failed. The message might have dropped midway, or the acknowledgment might not have reached the sender. This is the classic Two Generals’ problem.
Note: Timeouts are not a reliable detector of node failure in an asynchronous system.
The node pausing at unexpected times can make programming challenging. As an example, see the following code snippet where a node acquires a lease and ensures that it has sufficient time left on the current lease (line 6). If this node pauses, say, just before executing line 11 for 20 seconds, on the resumption of the node, the code assumes that it still has a valid lease, while it does not. This scenario is a lot like concurrent code, where we can't rely on timing. (We will see a possible solution to this situation a little later in this blog.)
The following illustration pictorially shows the challenges due to unexpected node pauses. Such pauses are not hypothetical. They have been
We need some notion of time for many tasks, such as detecting timeout on a network message so that it may be retransmitted, logging the amount of time a client spends on different pages of our website, and tracking when a DNS cache entry will expire and will need to be ejected from the local cache. Unfortunately, it is impossible to perfectly synchronize clocks on different nodes: despite using specific protocols such as NTP, many challenges persist.
Note: Time-of-day clocks and monotonic clocks are different and are used for different purposes. Using time-of-day clocks instead of monotonic clocks can lead to serious errors. Monotonic clocks are designed in such a way that they can't go back in time. However, if need be, they can be made a bit faster or slower to adjust the skew in clock value.
We will see two examples where unsynchronized clocks can cause damage.
The first example is of Cloudflare. In the following code snippet (at line 7), the code is trying to find the duration between now and an earlier time start. Though at that specific point, the clock was set backward such that now - start gave a negative value. It was assumed that rtt would always be positive, and unexpectedly getting a negative value had a cascading impact, failing their service for many of their customers. This problem's
Note: We often assume that if two processes are on two different nodes, their failures will be independent—meaning that the failure of one node won’t impact the other processes. However, as illustrated by the Cloudflare example where the same bug impacted many of their nodes in a data center, that is not always the case: a bug in one node may impact multiple nodes.
Our second example is about a writing mechanism used in replicated data storage known as last write wins (LWW). If nodes use wall-clock time (that we are unable to synchronize perfectly), even an unsynchronized clock at a millisecond (ms) level can produce data consistency issues.
In the following illustration, it is clear that although x=2 happened after x=1, node 2 sets x=1 as the final value because the timestamp of
By now, we should be convinced that strange, unexpected things can happen in distributed settings that make designing software for distributed systems a challenging task. In the following section, we will see how researchers and system designers deal with the issues.
In this section, we survey some of the ways the above-mentioned challenges are managed. Researchers have devised rigorous formalisms to thoroughly understand the problem and to provide possible solutions. Let’s now discuss system models, failure models, and the correctness properties of systems.
Note: Challenging problems, especially those whose solution has substantial rewards, attract human curiosity and intellect. Over the last few decades, researchers and practitioners have worked hard to make today’s household services like Google, Facebook, Netflix, etc., possible. You can learn how YouTube works here.
With all the problems we explained above, a designer must have a strategy to determine the correctness of the system. We use safety and liveness conditions for this purpose. Safety means that nothing bad ever happens, and liveness means that something good eventually happens. If you want a more formal definition of safety and liveness, see the paper
Examples of using the safety and liveness properties are consensus algorithms, such as Paxos and Raft. The safety condition could be that a minority never changes a value chosen by a majority. The liveness condition can be that the consensus makes progress as long as a simple majority of participating nodes are available.
There is often a tradeoff between safety and liveness properties. Another reason to analyze the state of the system through two separate lenses, safety and liveness, is that for many use cases, it is acceptable to sacrifice liveness (at least for a while) as long as we never compromise on safety. In the context of consensus algorithms, the Paxos and Raft systems are always consistent (a safety condition) but might not make progress (a liveness condition) when a majority is unavailable.
It turns out that the execution environment in which a distributed system operates has a major impact on what can (and cannot) be done. Researchers have devised two kinds of computational models—synchronous and asynchronous.
A synchronous model is one where:
Nodes take a known bounded time to process any requests.
Network messages take a known bounded time to reach their recipients.
The clocks on different nodes have a known bound on the skew that is never violated.
This is a good model where everyone behaves nicely.
An asynchronous model is one where:
Nodes can take arbitrarily long to process requests.
Messages can take arbitrarily long to be delivered over the network.
The clock skew is also not bounded.
This is a harsh model of a computational environment with multiple sources of uncertainty and incomplete information.
The question is how closely our real computational environment matches these theoretical models. The reality is neither as pretty as in the synchronous model nor probably as bad as in the asynchronous model. In some sense, synchronous and asynchronous models are two ends of a spectrum. They help us prove exciting things such as:
X is impossible in a synchronous model.
Y can be done in an asynchronous model.
Implicit here is the notion that if X is impossible in a nice synchronous model, it is definitely impossible for anything farther along the spectrum, and if something is possible with a tough model, it is definitely possible with a model with less uncertainty.
Once again, we define a spectrum, this time, of failures—the easiest to manage, we call crash-stop failures, while the hardest to manage, we call Byzantine failures. They serve a similar purpose as in our discussion on computational models: to prove properties.
Crash-stop failures are those where when a fault occurs, the affected node stops working and does not participate in any further processing. The Byzantine failures are arbitrary failures, where participants can behave in an unexpected (possibly malicious) manner.
Note: We recently added a new section on consensus protocols in our course on Grokking the Principles and Practices of Advanced System Design. You may read that section to see safety, liveness, and computational and fault models in action.
TCP can hide message losses, message reordering, and message duplications. However, hiding message latency is not always possible. System designers go to great lengths to reduce latency for their customers. For example,
Because a misbehaving network is at the root of many challenges, hyper scalers such as
Modern services rely on extensive monitoring to deal with failed nodes. Still, it is possible that monitoring declares some nodes failed when it is not. To deal with such issues, systems are built in such a way that safety conditions are met (for example, in the context of consensus, there is never more than one leader). Similarly, implementing
We saw that synchronizing physical clocks on different nodes using network protocols like NTP is challenging. Leslie Lamport invented logical clocks that can be implemented using simple counters and suggested tagging each network message with these counters. A desirable property of such clocks is that they should support causality. For example, if Event A happens before Event B at a single node, the local logical clock value should be assigned such that the clock value of Event B is larger than the clock value of Event A. When a node sends a message to another node, the clock value at reception should be higher than the one when the message was sent.
Note: See our chapter on sequencers to know how to generate monotonically increasing, causality bearing counters.
Let's see how we can solve our earlier LWW issue with unsynchronized clock values. The following illustration shows that sending a monotonically increasing fencing token in each request helps us avoid the pitfalls that we saw earlier. When a node receives an old token with a request, the node rejects such requests.
The next innovation came with Google's Spanner, which introduced TrueTime API. The interesting observation was that while it is true that clocks on nodes of a distributed system cannot be perfectly synchronized, how about explicitly exposing the uncertainty in the clock value to the applications while trying hard to keep the uncertainty values low?
Spanner uses atomic and GPS clocks to keep the clock uncertainty around 6 ms. When TrueTime API gives a clock value, it also tells the current uncertainty bound.
Time with controlled uncertainty might not seem much on the surface, but it is a significant feat. Spanner being a strongly consistent database at the global scale owes quite a bit to its TrueTime API.
The following illustration shows that a timestamp of
The following illustration shows that the uncertainty in the clock increases over time, but the Spanner system strives hard to keep it below 6 ms on average. The slanted lines in the following illustration show a gradually increasing skew in the clock value. Occasionally, the uncertainty can go beyond this average value, and Spanner and its applications have special protocols to deal with such situations (for example, waiting out the uncertainty when they want to be sure that a time point has definitely passed.)
Distributed systems often make tradeoffs among different consistency guarantees. It’s important to understand not just safety and liveness, but which kind of consistency your system provides. Modern distributed databases and storage systems rarely stick to a single consistency model — instead, they balance tradeoffs based on latency requirements, data criticality, and system scale.
Strong consistency / linearizability: Every operation appears instantaneous and globally visible. This is essential for systems that require correctness guarantees (e.g., banking transactions, inventory systems). However, it comes at the cost of latency and availability during partitions.
Sequential consistency: Operations are observed in a total order that respects the order issued by each client. It’s easier to achieve than linearizability but still ensures predictable behavior.
Causal consistency / eventual consistency: Operations preserve causal relationships but may diverge temporarily. Most high-availability systems (e.g., social networks, shopping carts) rely on eventual consistency to scale.
Tunable consistency / quorum systems: Systems like Cassandra, DynamoDB, and Cosmos DB let engineers choose tradeoffs between latency, consistency, and availability through quorum-based read/write configurations.
Understanding where your system lies in the “consistency spectrum” — and how that choice affects user experience and performance — is key to designing scalable distributed architectures.
Modern distributed systems often span geographies for low latency, compliance, availability, and disaster recovery. Designing for geographical distribution introduces entirely new failure modes and operational considerations.
Geo-replication & multi-leader vs single-leader: Multi-leader replication allows writes in multiple regions but introduces conflict resolution challenges. Single-leader setups simplify consistency but can increase latency for remote users.
Partitioning / sharding: Breaking data into shards reduces load and improves scalability but complicates cross-shard queries, joins, and transactions. Many global systems use “geo-sharding” — storing data closest to users — to reduce latency.
Failover & region isolation: Automatic failover and blast radius containment strategies ensure services remain available even during regional outages. Handling split-brain scenarios gracefully is a critical design challenge.
Service mesh & traffic routing: Technologies like Envoy, Istio, and API gateways provide traffic management across regions, enabling intelligent routing, canary testing, and load distribution.
These architectural choices directly affect a system’s latency, resilience, and fault tolerance.
Beyond theoretical failure models, real distributed systems require practical resilience strategies to remain reliable under unpredictable conditions. These strategies are fundamental to building systems that degrade gracefully instead of catastrophically failing.
Retries with exponential backoff + jitter: Smooths retry storms and prevents additional load during outages.
Circuit breakers and bulkheads: Automatically cut off failing dependencies and isolate components to prevent cascading failures.
Idempotent operations and replay safety: Ensure that retries don’t cause duplicate state changes.
Health checks, leader elections, and quorum rules: Enable automated recovery, self-healing clusters, and predictable behavior under partial failure.
Distributed tracing, metrics, observability & SLOs / error budgets: Make failure states visible and measurable, allowing teams to react proactively.
Chaos engineering: Intentionally injecting faults uncovers hidden dependencies and prepares systems for real-world failure modes.
Combining these practices creates self-healing, resilient distributed systems that stay functional even during severe outages.
In modern architectures, messaging and event-driven design are foundational to scalability and decoupling. They enable microservices to communicate asynchronously and handle large-scale data flows.
Message queues / brokers (Kafka, Pulsar, RabbitMQ): Provide durability, backpressure management, and partitioned throughput but introduce ordering and offset management complexities.
Exactly-once semantics / idempotency: Achieving exactly-once delivery requires combining broker guarantees, idempotent consumers, and transaction semantics — a hard but essential problem in financial or audit systems.
Stream processing / stateful processors (Flink, Kafka Streams): Enable real-time analytics, alerting, and event correlation. Handling out-of-order data, watermarking, and checkpointing are key challenges.
Schema evolution and versioning: Ensuring producers and consumers remain compatible over time is critical for evolving distributed event systems.
Event-driven designs introduce a new layer of complexity but provide low-latency, scalable, and fault-tolerant communication across services.
Security is foundational — not an afterthought — in modern distributed design. With services communicating across networks, often over public clouds, maintaining end-to-end trust is non-negotiable.
mTLS / mutual TLS ensures both client and server identity verification.
Zero-trust design: No service or network segment is trusted by default. Every request must authenticate and authorize itself.
Key management and secret rotation: Automate credential management and rotation using tools like HashiCorp Vault or AWS KMS.
Authentication & token propagation (JWT, OAuth, OIDC): Use federated identity to manage user and service access across distributed environments.
Access control, audit trails, and multi-tenancy: Implement fine-grained access control and logging to meet compliance requirements and track system activity.
Security and trust boundaries shape how distributed systems are architected, deployed, and operated — neglecting them leads to vulnerabilities that scale with the system.
Modern distributed systems often run inside container orchestration platforms like Kubernetes, where additional layers of complexity — and power — emerge.
Control plane / data plane separation: Understanding how control components (e.g., API server, etcd) differ from the data plane is key to designing resilient clusters.
Service mesh (e.g., Istio, Linkerd): Adds reliability features like retries, circuit breaking, and security enforcement transparently at the network layer.
Canary deployments, traffic shaping, blue-green upgrades: Enable safer rollouts and progressive delivery of new features.
Operator patterns and autoscaling: Automate lifecycle management and resource allocation dynamically based on workload demand.
Cloud-native environments blur the line between infrastructure and application logic — understanding these patterns is critical for building reliable, scalable services.
Many production systems blend multiple consistency and storage models to achieve optimal tradeoffs.
Hybrid models: Use strong consistency for mission-critical paths (e.g., payment processing) and eventual consistency for less critical paths (e.g., analytics).
Polyglot persistence: Combine relational, document, graph, and key-value stores within a single architecture to optimize query performance and scalability.
Transactions across stores: Patterns like two-phase commit, saga orchestration, and eventual reconciliation handle distributed transactions that span multiple data systems.
Blending consistency and persistence approaches is often the only way to meet diverse application requirements without compromising scalability or performance.
Examining production-grade distributed systems offers practical lessons in tradeoffs, evolution, and operational realities:
Spanner / F1 (Google): Combines strong consistency with global replication using TrueTime. It illustrates how physical clock guarantees can enable distributed transactions across continents.
Cassandra / Dynamo / Cosmos DB: Showcase how eventual consistency, partitioning, and tunable quorum settings power web-scale availability.
Kafka + Kafka Streams: Demonstrate scalable, exactly-once event processing pipelines that underpin modern data platforms.
CockroachDB / YugabyteDB: Bring distributed SQL to the forefront, combining ACID guarantees with horizontal scalability.
Studying these real-world examples highlights the tradeoffs between latency, consistency, fault tolerance, and developer complexity — the core design decisions behind every distributed system.
Distributed systems are fundamentally different from writing software on a single computer. A distributed system can fail in a myriad of new, unexpected ways. A service built on a distributed system aims to shield its customers from all possible failures and yet serve its purpose well. In this blog, we surveyed some of the primary challenges in building distributed services and how we can manage those challenges. However, we were only able to scratch the surface, and our purpose was to stimulate your interest. To go deeper into these topics, we recommend our following courses: Distributed Systems for Practitioners and Grokking the Principles and Practices of Advanced System Design.