Cloud Architecture: A Guide To Design & Architect Your Cloud/

...

Reliability on The Cloud

The reliability pillar includes the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.

We'll cover the following...

Design Principles: The five design principles for reliability on the cloud:-

Test recovery procedures:
Automatically recover from failure:
Scale horizontally to increase aggregate system availability:
Stop guessing capacity:
Manage change in automation:
Definition
Best Practices Foundations
Change Management
Failure Management
Key Services
Foundations:
Change Management:
Failure Management:

Design Principles: The five design principles for reliability on the cloud:-

Test recovery procedures:

In an on-premises environment, testing is often conducted to prove the system works in a particular scenario. Testing is not typically used to validate recovery strategies. In the cloud, you can test how your system fails, and you can validate your recovery procedures. You can use automation to simulate different failures or to recreate scenarios that led to failures before. This exposes failure pathways that you can test and rectify before a real failure scenario, reducing the risk of components failing that have not been tested before.

Automatically recover from failure:

By monitoring a system for key performance indicators (KPIs), you can trigger automation when a threshold is breached. This allows for automatic notification and tracking of failures, and for automated recovery processes that work around or repair the failure. With more sophisticated automation, it’s possible to anticipate and remediate failures before they occur.

Scale horizontally to increase aggregate system availability:

Replace one large resource with multiple small resources to reduce the impact of a single failure on the overall system. Distribute requests across multiple, smaller resources to ensure that they don’t share a common point of failure.

Stop guessing capacity:

A common cause of failure in on-premises systems is resource saturation when the demands placed on a system exceed the capacity of that system (this is often the ...

Introduction

Cloud Principles

Operating on the cloud

Security on the Cloud

Understanding the Cloud Architecture

Contact Us