Operational Efficiency and Optimization for GenAI Applications I
Explore techniques to optimize generative AI applications on AWS by reducing token usage through prompt compression, improving throughput with parallel processing, applying semantic caching for repeated queries, and minimizing latency with response streaming. Understand how these approaches enhance operational efficiency without compromising model quality or requiring major architecture changes.
We'll cover the following...
Question 51
A company operates a customer-facing GenAI chatbot built on Amazon Bedrock. After reviewing monthly cost reports, the team discovers that token usage has increased significantly. Analysis shows that repeated system instructions and verbose prompts are contributing to unnecessary token consumption. The company wants to reduce overall token costs by at least 40% without changing the underlying foundation model or degrading response quality.
Which approach will most effectively reduce token usage?
A. Increase the temperature parameter to encourage shorter responses.
B. Apply prompt compression and context pruning to remove redundant instructions and unused conversation history.
C. Enable Amazon Bedrock provisioned throughput to stabilize inference costs.
D. Replace the existing ...