Model Capabilities and Evaluation
Explore how to effectively select foundation models by defining relevant capability dimensions, configuring evaluation jobs in Amazon Bedrock, interpreting key automatic and human metrics, and analyzing cost-performance trade-offs to optimize AI model choice for production workloads.
Selecting the right foundation model for a production workload is one of the most consequential decisions an engineer makes when building on Amazon Bedrock. Marketing benchmarks and leaderboard rankings create a false sense of certainty. A model that meets an academic benchmark may underperform on your specific summarization task, with your proprietary data, or with respect to your latency requirements. Teams frequently default to the largest, most expensive model available without evidence that it outperforms smaller alternatives for their use case. This habit inflates costs and introduces unnecessary latency.
Amazon Bedrock addresses this challenge with a managed Model Evaluation feature that enables systematic, reproducible comparison of foundation models using automatic metrics, LLM-as-a-judge scoring, and human review. Instead of relying on intuition, engineers can run structured evaluation jobs against their own data and make evidence-based decisions.
This lesson covers five objectives that form a complete evaluation framework. You will learn to define the capability dimensions relevant to your application, configure and run evaluation jobs in Bedrock, interpret automated metrics such as ROUGE and BERTScore, design human-evaluation rubrics for subjective quality, and apply cost-performance trade-off analysis to select the optimal model tier. The evaluation rigor you build here feeds directly into the next lesson on inference strategies and optimization, where the selected model must be tuned for production latency and throughput.
Note: Bedrock model evaluation jobs incur no additional cost beyond the standard inference charges for the models you select. This makes experimentation low risk.
Model capability dimensions
Before running any evaluation, you need to know what you are measuring. A common mistake is evaluating models on generic benchmarks that do not reflect the actual tasks your application performs. The first step is identifying which capability dimensions matter for your specific workload.
Think of capability dimensions as the different skills you would test when hiring for a role. You would not give a software engineering test to a candidate applying for a copywriting position. The same logic applies to foundation models.
The following dimensions form a practical taxonomy for engineering decisions:
Instruction following measures whether the model adheres to complex, multi-step prompts without drifting or ignoring constraints.
Factual accuracy captures the degree to which the model avoids
in its responses.hallucination A hallucination occurs when a model generates information that sounds plausible but is factually incorrect or fabricated, not grounded in the provided context or real-world knowledge. Reasoning depth evaluates the model’s ability to handle multi-hop logic, where answering a question requires chaining several pieces of information together.
Code generation tests both syntax correctness and ...