Best Practices for the Cloud
Explore how to design fault-tolerant cloud applications by anticipating hardware and software failures, implementing automation, and ensuring scalability. Gain insights into leveraging elasticity, decoupling components, and parallel processing to build efficient, resilient cloud architectures optimized for high availability and performance.
We'll cover the following...
- Design for failure, and nothing will fail
- Questions that you need to ask yourself
- Questions that you need to ask
- Tactics for implementing the above best practice:
- Decouple your components
- Questions you need to ask:
- Implement elasticity
- To automate the deployment process:
- Benefits of bootstrapping your instances:
- AWS-specific tactics to automate your infrastructure
- Think parallel
- Tactics for parallelization:
- Tactics for implementing this best practice:
In this section, you will learn about designing the best practices that will help you build an application in the cloud.
Design for failure, and nothing will fail
Best practice: when designing cloud architectures, assume components will fail. Always design, implement, and deploy systems with automated failure recovery.
In particular, assume that your hardware will fail. Assume that outages will occur. Assume that some disaster will strike your application. Assume you will be slammed with more requests per second than expected at some point.
If you realize that things will fail over time and incorporate that thinking into your architecture, build mechanisms to handle failure before disaster strikes, and adopt a scalable infrastructure, you will end up creating a fault-tolerant architecture optimized for the cloud.
Questions that you need to ask yourself
What happens if a node in your system fails? How do you recognize that failure? How do I replace that node? What kind of scenarios do I have to plan for? What are my single points of failure? If a load balancer is sitting in front of an array of application servers, what if that load balancer fails? If your architecture has master and secondary nodes, what happens if the master node fails? How does the failover occur, and how is a new secondary instantiated and brought into sync with the master? Just as you design for hardware failure, you also have to design for software failure.
Questions that you need to ask
What happens to my application if the dependent services change their interface? What if downstream service times out or returns an exception? What if the cache keys grow beyond the memory limit of an instance? Build mechanisms to handle that failure. For example, the following strategies can help in the event of failure:
- Have a coherent backup and restore strategy for your data and automate it.
- Build process threads that resume on reboot.
- Allow the state of the system to re-sync by reloading messages from queues.
- Keep pre-configured and pre-optimized virtual images to support on launch/boot.
- Avoid in-memory sessions or stateful user context; move that to data stores. Good cloud architectures should be impervious to reboots and re-launches. You can do this using a combination of Amazon SQS and Amazon SimpleDB; the overall controller architecture is very resilient to the types of failures listed in this section.