Characteristics of Distributed Systems

This lesson discusses the various cornerstones of distributed systems: reliability, scalability, availability, consistency, and maintainability.

Whenever we deal with distributed systems, there are five attributes particular to such systems that we must be aware of and understand.

Reliability

Reliability is a system’s ability to continue to work correctly in spite of faults. A distributed system is usually made of several smaller sub-components that together work to deliver a service. A reliable system can be banked upon to continue to work without degradation of service if a part of the overall system fails. The reliability concept can be extended to include a system’s ability to continue to perform with the expected functionality, tolerate human errors and unexpected use of the system, maintain performance under high data volume load, and, prevent any unauthorized use or abuse of the system when failures do happen.

Fault vs. failure

The terms “fault” and “failure” are often used interchangeably but mean different things. When a part of a system experiences failure but the system as a whole can still continue to operate reliably, we label the system as fault-tolerant or resilient. The system can tolerate components deviating from the spec but still function correctly. A failure occurs when the system as a whole fails. No system can be made fault-tolerant from all types of possible faults and can always potentially fail as a whole.

Netflix’s Chaos Monkey is an example of a tool designed to test the resiliency of services in the Netflix ecosystem. The tool randomly terminates service instances to uncover service failures.

Scalability

A system is said to be scalable if it can continue to work correctly as the load on the system increases.

Load

Load can be measured by a variety of metrics, including the number of requests per second received by a web server, the number of reads from vs writes to a cache, the frequency of a data back-up, the number of supported concurrent users, and more. When considering scalability, we generally question if the system can continue to perform well when one of the load parameters increases.

Examples

Let’s look at some well-known products and examples of their load parameters:

  • Microsoft Word: The number of geographically distributed users who can concurrently modify a document in collaboration mode without affecting usability for each user.

  • Facebook: The number of users able to view a live broadcast from, say, a celebrity. It becomes harder to broadcast a live feed to all users if there are many users viewing a particular live broadcast.

  • Google: Storage costs for indexing the web as the web grows.

  • Twitter: Adding a tweet by a popular user instantly to the timeline of all the followers. A tweet by an account with several million followers, if stored in a database, will be fetched millions of times when computing the timelines for each of the account followers and is a potential bottleneck.

Measuring performance

When the load parameters increase, a scalable system is expected to keep its expected performance. Amazon’s S3 service level agreement (SLA), for example, promises a certain performance benchmark to its users. For instance, the customers are entitled to service credits if S3’s monthly uptime is less than 99.9%.

Let’s consider another example. Instagram has an internal SLA to load a user’s home page in less than X milliseconds; let’s say 5 milliseconds. However, the team can’t guarantee that every home page will load at exactly 5ms. Thus, performances are measured in percentiles. Instagram engineers can set standards such as 99.9% of users will have their home pages loaded in under 10ms, 70% will have their home pages loaded in under 7ms and the median load time would be 5ms. Median implies that half the Instagram users would have their home screens loaded in under 5ms. Theoretically, there could be a very thin minority that sees load times greater than 10ms and due to random events outside the developer’s control. Generally, optimizing for very high percentiles, e.g., 99.99th percentile (1 in 10,000) comes with diminishing returns and is usually not worth optimizing for.

Horizontal vs. vertical scaling

Once load starts to increase on a system, it may be necessary to re-architect the system to handle the new load. There are generally two ways to scale a system demanding more resources:

  1. Vertical scaling: Add a more expensive or beefier machine than the one on which the current system runs. Consider a MySQL server that runs slowly on the current machine. If we replace it with a new machine with ten times the memory and computing power than the current one, the database server will be able to process a much larger number of queries.

  2. Horizontal scaling: Horizontal scaling refers to distributing load across several smaller machines. There’s a ceiling to vertical scaling as machines can only become so powerful before they run into the limits of physics. This is assuming you don’t hit the budget limits of your company in the first place. Scaling horizontally brings complexity with it, especially for stateful services. A distributed database is not a trivial service to maintain when spread across several machines. However, stateless services are much easier to scale horizontally.

In practice, a hybrid approach is usually selected to address scaling issues which uses more powerful machines that also spread load horizontally. Note that there is no one-size-fits-all architecture for distributed applications operating at scale, as each application can have unique load parameters. Access patterns may be different and so too the SLA’s response time requirements.

Maintainability

One of the cornerstones of good software design is maintainability. There are three major aspects to maintainability:

  • Ongoing operations: Software should be simple for operators to work with and provide support for.

  • Developer ease: Folks working on the code base should easily be able to grasp the logic and functionality of the software.

  • Extensibility: Software should be easily extensible to address new, related, or unanticipated use cases.

Availability

Availability is another key characteristic of distributed systems. Though availability is quantified mathematically, we can, for our purposes, think of availability as the length of time a system is responsive and functioning within a specified period of time. We’ll say the system is 100% available if the system is always responsive and functioning as intended within a given period of time. As a crude example, consider an API server that reports weather data. When the server’s API is unavailable for whatever reason and the server doesn’t respond, its availability metrics will be affected. Imagine the web server isn’t responding because the backing database with the weather data is down. The designer of the service could choose to respond to requests in such a scenario with stale data from the server’s cache, thus improving availability. However, failure to include the latest writes in the database implies the server isn’t consistent anymore. Generally, there’s a tradeoff between availability and consistency. In the industry, you’ll often see cloud products advertised as being available ninety nine percent and higher.

Consistency

In our web server example, if we designed the service to always return the latest data from the database, the service will not always be available if the database is down, but it will still be considered consistent. In distributed systems, there are generally several instances of a server. The clients of the system are generally talking to only one such instance. In that context, consistency implies that the client receives the same response from any one of the instances that it sends its request to. The CAP (consistency, availability and partition) theorem captures the relationship between consistency and availability. The crux of the theorem says that intermittent network failures among the different components of a distributed system can’t be ruled out and, under such a scenario, the system can choose to be either available all the time at the cost of serving inconsistent data or always serve consistent data to its customers at the risk of losing availability. We’ll explore each of the parts of this theorem further as we progress through the course.