The Zero-Downtime Playbook: 3 Key Strategies to Consider

The Zero-Downtime Playbook: 3 Key Strategies to Consider

Blue/green, Canary, GitOps — what deployment strategy is right for you?
14 mins read
Jun 06, 2025
Share

Every minute your users experience downtime or bugs, you lose trust and revenue. Rushing changes without safety checks can cause outages, but being too cautious can slow down innovation. How can cloud-native teams deploy fast and stay safe?

Today, we’ll tackle that tension head-on by comparing three zero-downtime strategies—blue/green, canary, and GitOps-driven rollouts—through the lens of AWS services and Infrastructure as Code. Along the way, we’ll:

  • Help you decide which of the three strategies fits your needs.

  • Highlight the trade-offs in cost, complexity, and rollback speed.

  • Provide three real IaC examples (Terraform, Helm charts, Argo CD manifests) so you can get started right away.

  • Share five essential AWS-aligned tool categories that enable zero-downtime rollouts.

Ready? Let’s get to it!

1. Blue/green deployments#

Blue/green deployment is essentially the “double your environment” strategy. You create two identical environments:

  • Blue: The live environment currently serving users.

  • Green: The environment where the next version is deployed and tested.

Initially, all user traffic is directed to blue, ensuring that the green environment can be safely used for deploying and validating the new release, without exposing users to untested code. Once you’re confident that green is stable and production-ready, you switch traffic over to it. Green becomes the new live environment; blue can be retired or kept on standby.

Blue/green deployment on EKS with ALB shifting traffic from blue to green after testing
Blue/green deployment on EKS with ALB shifting traffic from blue to green after testing

In AWS terms, this is called “shifting traffic between two identical environments that are running different versions of your application” to eliminate downtime and make rollbacks painless.

AWS CodeDeploy has built-in support for blue/green deployments on multiple platforms, taking much of the manual work off your plate.

  • With ECS, CodeDeploy automatically launches a brand-new “green” task set alongside your existing “blue” tasks. Once the green tasks pass health checks, CodeDeploy flips your ALB target groups so all traffic streams to the new version—no manual load-balancer tweaks required.

  • With Lambda, the process is just as smooth. You deploy a new function version, then use an alias (like prod) to point traffic to it. When you’re ready, CodeDeploy updates the alias to send all invocations to the new version, instantly decommissioning the old one.

You can codify all of this in Infrastructure as Code. For instance, a Terraform script can define your CodeDeploy application, deployment group, and the blue/green setup in just a few lines, ensuring every environment stays consistent and repeatable. Let’s take a look at a sample Terraform snippet:

resource "aws_codedeploy_app" "my_app" {
compute_platform = "ECS"
name = "my-ecs-service"
}
resource "aws_codedeploy_deployment_group" "my_group" {
app_name = aws_codedeploy_app.my_app.name
deployment_group_name = "blue-green-group"
service_role_arn = aws_iam_role.codedeploy.arn
# Use an "all-at-once" or default config for switching
deployment_config_name = "CodeDeployDefault.ECSAllAtOnce"
blue_green_deployment_config {
# Continue deployment even if a hook times out
deployment_ready_option {
action_on_timeout = "CONTINUE_DEPLOYMENT"
}
# Discover existing target groups
green_fleet_provisioning_option {
action = "DISCOVER_EXISTING"
}
# Terminate blue tasks after success
terminate_blue_instances_on_deployment_success {
action = "TERMINATE"
termination_wait_time_in_minutes = 5
}
}
auto_rollback_configuration {
enabled = true
events = ["DEPLOYMENT_FAILURE"]
}
}
Sample Terraform snippet for setting up a blue/green deployment

Practical tip:

With AWS CDK, you can use the codedeploy.EcsDeploymentGroup construct to set up blue/green for ECS. The secret is defining two ALB target groups—one for “blue” and one for “green”—and then letting CodeDeploy swap them when the new version is ready.

Trade-offs#

The biggest advantage of blue/green deployment is safety. You always have a full, working copy of your production environment to fall back on. If the new release misbehaves, rolling back is as simple as rerouting traffic back to your blue environment—no lengthy restores or complex scripts required.

On the flip side, you’re running two complete environments at the same time, which doubles your resource costs (servers, databases, caches, and so on). Keeping both sides perfectly in sync—especially things like database schemas or cache state—adds another layer of operational complexity.

Key takeaway: Blue/green deployments shine for major upgrades (think big feature rewrites or database migrations) where instant full-environment fallback is priceless when downtime isn’t an option and you can afford the extra capacity.

2. Canary deployment#

Canary deployment lets you roll out changes bit by bit instead of flipping the switch all at once. It’s like sending in a small “canary” group of users to test the new code before everyone else sees it.

When deploying, you closely monitor key metrics (errors, response times, log outputs) and only proceed if everything looks healthy. If you spot a problem, you hit the brakes and roll back right away. This approach reduces your blast radius, catching issues on a small scale instead of impacting your entire user base.

A typical canary deployment pattern might involve routing 10% of traffic to the new pods while 90% stays on the old ones, then shifting to 50/50 and finally 100% on the new version once you’re confident.

Canary setup with 90% traffic to blue pods, 10% traffic to gray canaries pods
Canary setup with 90% traffic to blue pods, 10% traffic to gray canaries pods

AWS CodeDeploy offers built-in deployment strategies like canary and linear, making traffic shifting easier than ever. Whether you’re deploying to ECS or Lambda, you can pick from ready-made configs, like:

  • CodeDeployDefault.ECSCanary10Percent5Minutes

  • LambdaCanary10Percent5Minutes

These automatically send 10% of traffic to the new version, wait five minutes to check health, then flip the rest if everything looks good.

Under the hood, ECS canaries use ALB weighted target groups to route that 10% of traffic, while Lambda canaries rely on alias weight adjustments. It’s all handled for you—no manual routing tweaks required.

Here is a sample AWS CLI command that creates a weighted lambda alias with 90% and 10% traffic routing to version 1 and version 2, respectively.

aws lambda create-alias \
--name routing-alias \
--function-name my-function \
--function-version 1 \
--routing-config AdditionalVersionWeights={"2"=0.10}
AWS CLI command to create a weighted Lambda alias

AWS allows you to automate error spikes by simply integrating CloudWatch alarms to your deployment. If error rates climb beyond your threshold during the canary window, CodeDeploy will automatically roll back the traffic shift, keeping your users safe from a bad release.

App Mesh also facilitates canaries. To do something very similar, you define two Virtual Nodes (old/new) and a weighted Virtual Router route rule. For example, using Terraform, you might adjust an App Mesh route from 100% old to 90/10 to 50/50.

resource "aws_approute53_traffic_policy_document" "canary" {
# (Example snippet showing weighted routing for canary; not real Terraform code)
rule {
# 10% to new version, 90% to old version
weighted_target {
endpoint = aws_appmesh_virtual_node.new.arn
weight = 10
}
weighted_target {
endpoint = aws_appmesh_virtual_node.old.arn
weight = 90
}
}
}
Sample Terraform snippet showing weighted routing for canary

Practical tip:

In Kubernetes on EKS, you can automate this using Flagger or Argo Rollouts with the App Mesh controller.

Trade-offs#

Canary releases are all about lowering risk and capturing real user feedback. By sending a small slice of real production traffic to your new code, you get live testing with minimal user impact. That early exposure makes it much easier to spot regressions before they impact everyone.

That simplicity comes with its own trade-offs. You’ll need solid monitoring and traffic management in place to make smart go/no-go decisions. Observability—think metrics, logs, and dashboards—is your canary in the coal mine for deciding when to move forward.

Canaries usually cost less than blue/green deployments—you only launch a small new fleet while keeping the old one live. Just make sure you have proper routing controls in place.

Key takeaway: Canary deployments provide fast, continuous delivery without the full price tag of duplicate environments. They’re perfect for rolling out new features bit by bit, tuning as you go, and rolling back at the very first sign of trouble.

3. GitOps deployment#

GitOps is not a specific rollout pattern, but a deployment automation model built around Git as your “single source of truth” for both infrastructure and applications.

In a GitOps deployment model, the entire system state is stored declaratively in a Git repository. This includes Kubernetes manifests, Terraform configurations, Helm charts, or any other infrastructure and application definitions. A GitOps controller, such as Argo CD or Flux, continuously monitors the repository and pulls changes from Git. When a change is detected, it automatically synchronizes the live environment to match the declared state in the repo, ensuring consistency and version control.

Want to deploy a new version? Just commit your update to Git. The controller will pick it up and roll it out. It even corrects any drift if someone accidentally tweaks production.

GitOps deployment automation model
GitOps deployment automation model

Because every change is tracked in Git, you get fully versioned, auditable deployments. Deployments happen with a simple Git commit instead of manual CLI commands. That makes multi-cluster rollouts a breeze and lets you roll back just by doing a git revert.

GitOps rollback automation for the live environment
GitOps rollback automation for the live environment

On AWS, GitOps typically means running Argo CD or Flux in an Amazon EKS cluster. For example, using Terraform or AWS CDK, you might install Argo CD and then define an Application CRD that points to your Git repo.

Here’s a sample snippet of an Argo CD Application manifest to get you started:

YAML
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: example-app
spec:
project: default
source:
repoURL: https://github.com/myorg/myapp.git
targetRevision: HEAD
path: charts/myapp
destination:
server: https://kubernetes.default.svc
namespace: default
syncPolicy:
automated:
prune: true
selfHeal: true

At its heart, GitOps is all about keeping your live environment continuously in sync with your Git repo and having a full audit trail of every change.

AWS offers GitOps-friendly tools too—think EKS Blueprints, CodePipeline integrations, and AWS Controllers for Kubernetes—but most teams gravitate toward Argo CD or Flux.

Practical tip:

Even with GitOps at the core, you can still run blue/green or canary rollouts—let Argo CD handle the continuous Git sync while Argo Rollouts manages the progressive delivery.

Trade-offs#

GitOps excels at consistency and developer productivity. Every change undergoes code review, lives in version control, and is fully reproducible, so you never have to wrestle with “configuration drift” again.

Rolling back is as simple as reverting a Git commit. Your live environment is always being compared against your repo, so if someone makes a manual tweak, the system automatically brings things back in line.

On the flip side, getting started with GitOps can feel heavy. You’ll need Kubernetes controllers, well-defined RBAC, and CI/CD pipelines. And because it’s built around declarative configs, it really shines in cloud-native, Kubernetes-heavy stacks.

Also, GitOps by itself doesn’t handle traffic shifts the way blue/green or canary strategies do—it’s focused on config management. In practice, teams often marry GitOps with progressive delivery: commit a new YAML, let Argo Rollouts or a similar tool pull it in, and run a canary on EKS.

Key takeaway: If you care deeply about infrastructure as code rigor, multi-cluster consistency, and a full audit trail in Git, GitOps is hard to beat.

3 real-world, zero-downtime examples#

These case studies illustrate the practical benefits of adopting zero-downtime deployment strategies.

1. Ada accelerated upgrades with blue/green deployments#

The challenge: Ada, a leader in AI-powered customer service automation, faced prolonged Kubernetes upgrade cycles that spanned months. These extended timelines delayed feature rollouts and risked service disruptions during updates.

The solution: To address these challenges, Ada implemented a blue-green deployment strategy using Amazon EKS in conjunction with AWS Global Accelerator. This approach allowed them to run parallel environments, enabling seamless traffic switching between old and new versions without downtime.

The outcome: By adopting this strategy, Ada achieved near-zero downtime during major Kubernetes upgrades. The upgrade process was significantly streamlined, reducing the time from months to days. Additionally, Ada experienced a 70% increase in deployment velocity, a 30% boost in compute efficiency, and a 15% reduction in compute costs.

2. Perry Street Software (PSS) implemented resilient deployments with canary strategies#

The challenge: PSS relied on Capistrano scripts for deployments, which had become increasingly complex and challenging for new developers to manage. The absence of a CI/CD pipeline further hindered their ability to deploy code efficiently and reliably.

The solution: PSS overhauled its deployment process by adopting AWS CodePipeline and Amazon ECS. It established a canary deployment strategy, initially directing a small percentage of traffic (e.g., 10%) to new ECS service versions. This approach allowed it to monitor performance and ensure stability before a full rollout.

The outcome: The implementation of canary deployments enabled PSS to detect issues early in the deployment process, minimizing the risk of widespread disruptions. This strategy facilitated a more resilient and manageable deployment workflow, aligning with modern CI/CD practices.

3. Landbay streamlined deployments with GitOps and Flux#

The challenge: Landbay, a digital mortgage platform, grappled with deployment inefficiencies and security concerns. Their existing processes were time-consuming and lacked the agility needed for rapid development cycles.

The solution: Landbay transitioned to a GitOps approach by integrating Flux with Amazon EKS. This shift allowed them to manage infrastructure and application deployments declaratively through Git repositories, ensuring consistency and traceability.

The outcome: The adoption of GitOps streamlined Landbay’s operations, leading to faster deployments and reduced waiting times. The integration enhanced their security posture and provided engineering efficiencies across the board, revolutionizing their development processes.

5 AWS considerations#

To enable zero-downtime rollouts, here are five essential AWS-aligned tool categories—each supporting one or more deployment strategies:

  1. AWS CodeDeploy: Core tool for blue/green and canary deployments across ECS, Lambda, EC2, and on-prem. It handles automated traffic shifting and integrates with CloudWatch Alarms for rollback on failure.

  2. Application Load Balancer (ALB): Supports blue/green and canary by routing traffic between target groups. Weighted routing is key for ECS and Lambda rollouts and integrates directly with CodeDeploy.

  3. Amazon EKS: Argo CD enables GitOps with declarative Git-driven sync. While Argo Rollouts adds support for canary and blue/green delivery in Kubernetes, using smart rollout strategies and metrics.

  4. AWS App Mesh with Flagger: Best for Canary deployments in microservice architectures. Enables fine-grained traffic shaping between service versions. Flagger automates rollout decisions based on Prometheus metrics.

  5. CloudWatch and Prometheus: Critical across all strategies. CloudWatch Alarms trigger auto-rollbacks in CodeDeploy. In Kubernetes, Prometheus powers Argo Rollouts or Flagger decisions using latency, error rate, and health checks.

Choosing the right strategy#

Each deployment strategy offers trade-offs in terms of downtime, rollback safety, infrastructure cost, and operational complexity. To help you decide what fits your use case best, the table below breaks down key differences across several dimensions.

Comparative Overview of Modern Deployment Strategies

Aspect

Blue/Green

Canary

GitOps

Downtime Risk

~ 0% (instant switch)

Very low (gradual ramp)

Depends on strategy

Rollback

Instant (flip back to old env)

Pause or revert rollout

git revert and sync to the last good state

Infra Cost

High (duplicate environments)

Medium (one full + partial new)

Varies (usually K8s clusters, CI/CD tools)

Complexity

Moderate (two env to manage)

High (traffic routing and monitoring)

High (requires Git workflow + controllers)

Ideal Case

Full stack or DB migrations; instant cutover

Feature releases, backend updates, and detecting issues early

Multi-cluster CI/CD; teams needing a full audit trail

By understanding the trade-offs—risk vs. cost vs. complexity—you can choose the right pattern for your scenario.

TLDR;#

Choosing the right zero-downtime deployment strategy depends on your team’s priorities—speed, safety, cost, and control

  • Blue/green offers instant rollback with higher infra cost

  • Canary minimizes risk through gradual rollout

  • GitOps brings versioned, auditable delivery powered by Git

When combined with AWS-native tools like CodeDeploy, App Mesh, and EKS, each approach becomes easier to implement with Infrastructure as Code. Start small, automate smartly, and scale your rollout strategies with confidence.

The best way to master these patterns is hands-on practice!

Apply deployment strategies with Educative Cloud Labs#

Spin up a Cloud Lab — a set up-free, hands-on way for learners to interact with cloud services — to run some tests and watch your traffic shift with zero disruption.


Written By:
Fahim ul Haq
Free Edition
AWS’s latest AI updates are a big deal for devs––here's why
From cutting-edge SageMaker upgrades that slash costs and automate training to Amazon Bedrock’s new AI models and optimizations, AWS re:Invent 2024 unveiled game-changing updates for developers.
13 mins read
Mar 7, 2025