Orchestration and APIs

Explore how to connect and orchestrate AWS services such as API Gateway, Lambda, Step Functions, EventBridge, and SQS to build scalable and reliable generative AI applications using Amazon Bedrock. Understand API management, streaming options, event-driven automation, queuing for burst traffic, and quota monitoring essential for production AI systems.

We'll cover the following...

WebSocket APIs for streaming
Step Functions for multi-service workflows
- Designing a document processing pipeline
  - Choosing between “Standard” and “Express” workflows
Event-driven automation with EventBridge
SQS for decoupled Bedrock invocations
- The SQS-triggered Lambda pattern
  - Handling failures and backpressure
Rate limiting and quota management
- Exponential backoff with jitter
- Monitoring quota utilization
Conclusion

Building generative AI applications for production requires more than selecting an appropriate foundation model and writing effective prompts. The previous lesson explored how Amazon SageMaker and Amazon Bedrock integration patterns support model access and custom ML workflows, but exposing those capabilities to users requires an orchestration layer. This lesson focuses on the services that connect client applications to Amazon Bedrock, routing requests, coordinating workflows, reacting to events, and helping prevent overload through throttling, queues, retries, and scaling controls. This layer works like traffic control for your AI application: without it, even a strong model can be difficult to use reliably in production.

Amazon API Gateway serves as the managed entry point that sits between client applications and Lambda functions that invoke Bedrock. Rather than exposing Lambda functions directly, API Gateway provides a managed API layer that handles authentication, validation, throttling, and caching before any request reaches your inference logic.

Several REST API design considerations matter when fronting Bedrock workloads. Request models and validators enforce JSON schema on incoming prompts, rejecting malformed payloads before they consume Lambda execution time. Usage plansA configuration in API Gateway that pairs API keys with throttling limits and quota caps, controlling how many requests each client can make per second and per month. prevent abuse by capping request rates per consumer. Stage-level response caching can reduce latency and Bedrock costs for deterministic or highly repetitive requests such as FAQ lookups, policy summaries, or static product information. CORS configuration enables browser-based frontends to call your API directly.

Practical tip: Enable response caching for endpoints that serve predictable queries, such as product descriptions or policy summaries. A 5-minute TTL can dramatically reduce Bedrock invocation costs without noticeably degrading answer freshness.

Amazon Bedrock exposes multiple inference APIs, each of which maps naturally to different API Gateway integration patterns. The InvokeModel API provides direct synchronous model inference for standard request-response workloads and pairs well with REST proxy integrations. InvokeModelWithResponseStream enables token-by-token streaming for low-latency responses, making it suitable for real-time conversational interfaces. The Converse and ConverseStream APIs are designed for conversational interactions, with the client sending the conversation history with each request to maintain context across turns. ConverseStream extends this pattern by delivering tokens in a streaming fashion for chat-style user experiences.

For latency-sensitive workloads, the performanceConfig parameter allows applications to choose between standard and optimized inference modes at invocation time, providing a tuning lever for balancing responsiveness and cost without changing the underlying model selection.

The following diagram illustrates how REST and WebSocket patterns differ when used to front Bedrock:

WebSocket APIs for streaming

Standard REST request-response cycles force the client to wait until Bedrock generates the entire response before displaying anything. For conversational AI interfaces, this creates an unacceptable delay. Users expect to see tokens appear progressively, much like watching someone type a reply in a chat application.

WebSocket APIs in API Gateway maintain persistent bidirectional connections that enable token-by-token streaming. The connection life cycle follows a predictable sequence:

Connect route: Authenticates the user and registers the connectionId in a DynamoDB table, establishing the session record that subsequent messages reference.
Disconnect route: Cleans up the connection record from DynamoDB when the client disconnects, or the idle timeout expires.

Graceful disconnection handling matters because clients can drop unexpectedly. Configure the idle timeout to match your application’s expected interaction cadence, and ensure the $disconnect route reliably removes stale records.

Attention: WebSocket APIs do not support REST-style usage plans or API keys. Applications typically enforce tenant-level throttling through Lambda concurrency controls, custom authorization logic, or application-side rate limiting.

Streaming responses introduce a trade-off between user experience and architectural complexity. REST APIs are simpler to build, cache, and monitor, while WebSocket APIs require connection state management but deliver a significantly better conversational experience.

Step Functions for multi-service workflows

When a Bedrock-powered application involves more than a single inference call, you need an orchestration engine that coordinates multiple services reliably. AWS Step Functions provides exactly this capability through visual, serverless state machines.

Designing a document processing pipeline

Consider a concrete workflow triggered when a document arrives in S3. An AWS Step Functions state machine coordinates these stages: an AWS Lambda function extracts the document text and splits it into chunks, a Bedrock StartIngestionJob call initiates ingestion and indexing into the knowledge base, the workflow waits for ingestion completion, and a subsequent Bedrock InvokeAgent call analyzes the newly indexed content, a DynamoDB PutItem stores the analysis results, and an SNS Publish sends a notification to downstream consumers.

Choosing between “Standard” and “Express” workflows

Step Functions offers two workflow types, and the choice directly impacts your Bedrock architecture.

Standard workflows support long-running executions with exactly-once semantics and a full audit trail. They suit multi-step document processing pipelines where each execution may take minutes, and you need guaranteed delivery.
Express workflows handle high-volume, short-duration executions with at-most-once or at-least-once semantics. They are the right choice for high-throughput Bedrock invocation patterns where individual executions complete in seconds.

Error handling is built into the state machine definition itself. Retry blocks with exponential backoff handle transient ThrottlingException errors from Bedrock without any custom code. Catch blocks route persistent failures to cleanup or notification states. This declarative retry logic is a significant advantage over implementing backoff manually in Lambda.

The following table compares all the orchestration and messaging services covered in this lesson.

1.Introduction

2.Prompt Engineering and Model Selection

3.Customizing Models and Knowledge Retrieval

4.Building AI Agents with Amazon Bedrock

5.Integrating Bedrock with the AWS Ecosystem

6.Amazon Bedrock AgentCore and Production Agent Pipelines

7.Security and Responsible AI in Bedrock

8.Conclusion

Orchestration and APIs

WebSocket APIs for streaming

Step Functions for multi-service workflows

Designing a document processing pipeline

Choosing between “Standard” and “Express” workflows