CI/CD for AI Apps

Explore how to implement CI/CD pipelines for generative AI applications on Amazon Bedrock. Learn to manage multi-dimensional changes including code, prompts, models, and configurations. Understand prompt version control, automated regression testing, defining infrastructure as code, and executing staged deployments with canary releases. Gain insights into model change management and quality gating to ensure reliable, scalable AI app deployments in production environments.

We'll cover the following...

Prompt version control and management
- Pull request reviews for prompts
- Bedrock Prompt Management as the production store
Automated prompt regression testing
- The evaluation pipeline
Infrastructure as code for Bedrock
- Key resources to codify
  - Environment promotion pattern
Staged deployment and model change management
- Deployment for production
- Model change management
Conclusion

After setting up the observability stack in the previous lesson, including Amazon CloudWatch metrics, invocation logging, AWS CloudTrail, AWS X-Ray, LLM-as-a-judge evaluations, and cost attribution dashboards, you have the operational and quality signals needed to answer an important deployment question: “Is this change safe to ship?” Those signals can feed the automated quality gates used in the CI/CD pipeline covered in this lesson. Without those signals, AI deployment pipelines cannot reliably detect regressions before rollout.

Deploying changes to AI applications often involves more moving parts than deploying a conventional code-only application. In a conventional application, code is typically the primary artifact that flows through the deployment pipeline. Applications built on Amazon Bedrock can introduce at least five separate types of change, each of which can change application behavior in ways standard unit tests may not catch. Application code can change, prompt templates can be reworded, the foundation model version can be upgraded, knowledge base content can be refreshed, and agent configurations can be restructured. A code change might pass every unit test while silently degrading prompt quality. A model version upgrade might improve latency but shift output tone in ways that violate brand guidelines.

Traditional CI/CD pipelines assume a single artifact flows through the build, test, and deploy stages. AI applications shatter that assumption. Each of the five dimensions requires its own versioning strategy, test suite, and promotion criteria. The AWS services that address this challenge span the full pipeline:

Bedrock Prompt Management handles prompt versioning and deployment.
AWS CodePipeline and GitHub Actions orchestrate multi-stage validation.
AWS CDK and CloudFormation define Bedrock resources as reproducible infrastructure.
Bedrock Model Evaluation provides structured quality gating for model and prompt changes.

The following diagram illustrates how these five change dimensions converge into a unified deployment pipeline:

With the challenge framed, the first dimension to formalize is prompt management, where undisciplined changes are the most frequent cause of production regressions.

Prompt version control and management

Prompts must be treated as code. Store prompt templates in a dedicated directory within the Git repository (e.g., /prompts/), with each file containing the template text, input variables, and metadata such as the model ID, temperature, and max tokens. Commit messages should explain why the change was made, not only what changed. For example: “Narrowed the system prompt to reduce unsupported ...

1.Introduction

2.Prompt Engineering and Model Selection

3.Customizing Models and Knowledge Retrieval

4.Building AI Agents with Amazon Bedrock

5.Integrating Bedrock with the AWS Ecosystem

6.Amazon Bedrock AgentCore and Production Agent Pipelines

7.Security and Responsible AI in Bedrock

8.Conclusion

CI/CD for AI Apps

Prompt version control and management