Well Architected Framework: Operational Excellence

Operational Excellence

  1. Prescribes operational practices and procedures used to manage production workloads.
  2. This includes how planned changes are executed as well as responses to unexpected operational events.
  3. Change execution and response should be automated.

Best Practices

There are three best practice areas for operational excellence in the cloud:

  1. Prepare
  2. Operate
  3. Evolve

Operations teams need to understand their business and customer needs so they can effectively and efficiently support business outcomes. Operations create and use procedures to respond to operational events and validate their effectiveness to support business needs. Operations collect metrics that are used to measure the achievement of desired business outcomes.

Prepare

Effective preparation is required to drive operational excellence. Business success is enabled by shared goals and understanding across the business, development, and operations. Common standards simplify workload design and management, enabling operational success. Design workloads with mechanisms to monitor and gain insight into the application, platform, and infrastructure components, as well as customer experience and behavior.

Operate

Successful operation of a workload is measured by the achievement of business and customer outcomes. Define expected outcomes, determine how success will be measured, and identify the workload and operations metrics that will be used in those calculations to determine if operations are successful.

Consider that operational health includes both the health of the workload and the health and success of the operations acting upon the workload (for example, deployment and incident response). Establish baselines from which improvement or degradation of operations will be identified, collect and analyze your metrics, and then validate your understanding of operations success and how it changes over time.

Evolve

Evolution of operations is required to sustain operational excellence. Dedicate work cycles to make continuous incremental improvements. Regularly evaluate and prioritize opportunities for improvement (for example, feature requests, issue remediation, and compliance requirements), including both the workload and operations procedures.

Include feedback loops within your procedures to rapidly identify areas for improvement and capture learnings from the execution of operations. Share lessons learned across teams to share the benefits of those lessons.

Services supporting operational excellence

The following services and features support the three areas of operational excellence:

Prepare

AWS Config and AWS Config rules can be used to create standards for workloads and to determine if environments are compliant with those standards before being put into production.

Operate

Amazon CloudWatch allows you to monitor the operational health of a workload.

Evolve

Amazon Elasticsearch Service (Amazon ES) allows you to analyze your log data to gain actionable insights quickly and securely

Get hands-on with 1200+ tech skills courses.