Azure VM Fault and Update Domains
Redundancy, high availability (HA), uptime, and service-level agreement (SLA) are all terms referenced around one important concept: application (app) availability. You can build the best application in the world with the fanciest whiz-bang features that can cure cancer, but if that app isn’t available to users, it’s worthless. Apps and the infrastructure they are built upon must stay online and available as much as humanly possible.
One of the most common ways an app can lose a few points off of its 99.999% SLA is infrastructure downtime. Stuff happens; power goes out, rack cabling gets “accidentally” disconnected, a rogue VM goes nuts on a shared hypervisor, and more. If an app is running on a set of VMs, one way to mitigate the risk of app downtime is by spreading the workload apart as far as possible.
You can spread out VM resources in many ways:
Physical servers in the same rack
Hypervisors across different racks in the same datacenter
Different data centers all together
The more eggs you can put in different baskets, the better. Spreading VMs out reduces the risk of one VM going down, affecting too much functionality of an app. When creating VMs in Azure for an app, one way you can mitigate the risk of downtime by physical hardware failure is by spreading VMs across various fault and update domains.
Azure VM fault and update domains
When you build a VM on-prem in a datacenter, that VM is part of a much larger, holistic infrastructure that provides services to users. The VM is only a small component of an organization’s overall technology presence.
VM —> Hypervisor —> Physical Server —> Rack —> Datacenter
That single VM relies on the hypervisor to be operational. That hypervisor relies on a physical server to be up and so on. If, say, the power to a data center is cut, all the VMs in that datacenter die because all of the servers in all of the racks die.
Think of a few Azure VMs running on a hypervisor on a server in a rack. Perhaps network connectivity or power to that rack is cut. All VMs in that rack would lose connectivity to the world and services running on those VMs would go dark. When Azure VMs are all dependent on the same source of network connectivity and power, they are in a fault domain.
Azure defines a fault domain as a logical group of the underlying hardware that shares a common power source and network switch, similar to a rack within an on-premises datacenter.
Now, consider another way VMs can go down: intentional human intervention. VMs, unfortunately, need to be patched and rebooted at times. If your app is built in a way to sustain one or more VMs briefly going down, the app users will be none the wiser. If, for example, you have a few database servers and web servers behind a load balancer, it’s OK to temporarily lose a database or web server every now and then. Other VMs will pick up the load.
Having multiple VMs perform the same purpose behind a load balancer is an excellent way to maintain app uptime but not if all of that same type of VM are brought down at the same time. If you have three database servers serving up data to a web farm and all three database servers are rebooted at once, you’re going to have a problem.
To ensure all VMs of a particular type (database and web, etc.) are not patched or rebooted at the same time, Azure VMs fall under an update domain. An update domain is essentially the same as a fault domain but enforces a rule to not update or reboot VMs in the same update domain at once.