Task-Specific Quality Metrics

Explore how to evaluate AI output quality beyond aggregate accuracy by stratifying results by document type and field, measuring confidence calibration, and tracking workflow stage metrics. Understand how these metrics reveal failure patterns and guide targeted improvements for production-ready Claude AI systems.

We'll cover the following...

Why aggregate scores are insufficient
Stratifying by document type
Stratifying by field
Confidence calibration
Workflow stage metrics
Complete code
Connecting metrics to improvement decisions
Exercise: Read the metrics
What’s next?

An aggregate accuracy score answers the question “How often is the output correct overall?” It does not answer the questions that matter for a production system: which document types are failing, which fields are unreliable, whether high-confidence outputs are actually trustworthy, and at what stage of the pipeline failures are occurring. A system that is 92% accurate overall may be 65% accurate on handwritten invoices, 40% accurate on the tax_id field, and return inflated confidence scores that downstream systems trust incorrectly. This lesson covers how to stratify evaluation to make those failures visible. By the end of this lesson, we will be able to:

Explain why aggregate accuracy scores mask the failure patterns that matter in production
Stratify extraction accuracy by document type and by field
Measure confidence calibration and identify when Claude’s stated confidence does not match actual accuracy
Track workflow stage metrics: retry rate, second-pass success rate, and escalation rate

Why aggregate scores are insufficient

An aggregate accuracy score averages across all documents, all fields, and all confidence levels. This average is dominated by the most common cases, which are usually the easiest ones. Hard documents and rare fields become statistical noise. Consider a pipeline that extracts five fields from three document types:

The aggregate accuracy is (0.80 × 0.97) + (0.15 × 0.71) + (0.05 × 0.44) = 0.776 + 0.107 + 0.022 = 90.5%. A 90.5% headline number sounds good. The 44% accuracy on handwritten receipts is invisible inside it.

Stratifying by document type reveals that handwritten receipts need a different extraction approach: more examples in the system prompt, a different model, or human review by default. The aggregate number gives no signal about where to invest improvement effort.

Stratifying by document type

Tracking accuracy per document type requires attaching a document type label to each extraction result and computing accuracy separately for each label. The label can come ...

Document Type	Share of Volume	Accuracy
Standard digital invoices	80%	97%
Scanned paper invoices	15%	71%
Handwritten receipts	5%	44%

1.Claude AI Systems Foundations

2.Building Agents with the Claude Client SDK

3.Architecting Agentic Systems

4.Orchestrating Multi-Agent Systems

5.Designing Tools and MCP Integrations

6.Prompting and Schema Design

7.Claude Code Configuration and Project Workflows

8.Validation, Retry Loops, and Metrics

9.Context Management Techniques

10.Making Reliable Claude Systems

Task-Specific Quality Metrics

Why aggregate scores are insufficient

Stratifying by document type