Task-Specific Quality Metrics
Explore how to evaluate AI output quality beyond aggregate accuracy by stratifying results by document type and field, measuring confidence calibration, and tracking workflow stage metrics. Understand how these metrics reveal failure patterns and guide targeted improvements for production-ready Claude AI systems.
An aggregate accuracy score answers the question “How often is the output correct overall?” It does not answer the questions that matter for a production system: which document types are failing, which fields are unreliable, whether high-confidence outputs are actually trustworthy, and at what stage of the pipeline failures are occurring. A system that is 92% accurate overall may be 65% accurate on handwritten invoices, 40% accurate on the tax_id field, and return inflated confidence scores that downstream systems trust incorrectly. This lesson covers how to stratify evaluation to make those failures visible. By the end of this lesson, we will be able to:
Explain why aggregate accuracy scores mask the failure patterns that matter in production
Stratify extraction accuracy by document type and by field
Measure confidence calibration and identify when Claude’s stated confidence does not match actual accuracy
Track workflow stage metrics: retry rate, second-pass success rate, and escalation rate
Why aggregate scores are insufficient
An aggregate accuracy score averages across all documents, all fields, and all confidence levels. This average is dominated by the most common cases, which are usually the easiest ones. Hard documents and rare fields become statistical noise. Consider a pipeline that extracts five fields from three document types:
Document Type | Share of Volume | Accuracy |
Standard digital invoices | 80% | 97% |
Scanned paper invoices | 15% | 71% |
Handwritten receipts | 5% | 44% |
The aggregate accuracy is (0.80 × 0.97) + (0.15 × 0.71) + (0.05 × 0.44) = 0.776 + 0.107 + 0.022 = 90.5%. A 90.5% headline number sounds good. The 44% accuracy on handwritten receipts is invisible inside it.
Stratifying by document type reveals that handwritten receipts need a different extraction approach: more examples in the system prompt, a different model, or human review by default. The aggregate number gives no signal about where to invest improvement effort.
Stratifying by document type
Tracking accuracy per document type requires attaching a document type label to each extraction result and computing accuracy separately for each label. The label can come ...