Safety by Measurement: Capabilities, Propensities, and Control
Explore how to assess AI safety through the safety-by-measurement framework. Understand how to measure a model's maximum harmful capabilities, its default behavioral propensities, and the effectiveness of control mechanisms. This lesson equips you to evaluate AI risks systematically, proving safety to stakeholders before deployment.
The last five lessons established our core toolkit:
We proved a model’s fragility (robustness) using FGSM and PGD.
We audited a model’s fairness (interpretability) using LIME and SHAP.
We built the RLHF pipeline to align the model’s intentions.
Now, we face the most difficult question in AI Safety: How do we prove to a manager, a regulator, or the public that a complex, highly capable AI is safe to deploy?
The crisis: The evaluation gap
Benchmarks, the standard tests like MMLU, GPQA, or even the adversarial tests we ran, have three major flaws when it comes to guaranteeing safety for frontier AI:
They become obsolete instantly: The pace of AI progress is so rapid that tests designed to measure capability quickly become useless as new models rapidly surpass them (e.g., the o3 model achieving breakthroughs shortly after benchmarks were released).
They only measure average performance: Benchmarks measure raw scores without assessing the maximum potential for harm. We need to know what the model can do when pushed to its absolute limit, not just its average grade.
The combinatorial challenge: The real danger lies in how capabilities combine. A model that is situationally aware (can tell it’s being tested) and has high coding ability (can exploit system weaknesses) creates an emergent risk that is fundamentally different and more dangerous than either capability alone.
We cannot afford to wait and discover the full extent of an AI’s capabilities through its emergent real-world impacts. We need a systematic, rigorous framework for evaluation that goes beyond simple test scores.
The solution: Safety by measurement
The solution is the safety by measurement framework, which requires us to evaluate three distinct properties of the AI system. Think of these three properties as the three necessary questions we must answer before deploying any high-stakes AI.
We must evaluate:
Capabilities: What is the absolute most harm this system could cause if we gave it all the resources and tools possible?
Propensities: What behaviors does the system tend to exhibit by default? Is it naturally helpful, or is it prone to cheating or being deceptive?
Control: Do our existing safety checks (guardrails, circuit breakers, emergency stops) actually work when the AI is trying its hardest to bypass them?
Let’s define each of these properties in detail and see why they are crucial for safety assessment.
Capabilities (measuring maximum potential)
A capability evaluation aims to establish the upper bound of the system’s risk. We are not interested in the model’s average performance; we are interested in the absolute maximum harm it could achieve if pushed to its absolute limit, establishing our safety defense budget. This is crucial because if a model can perform a dangerous task (like generating exploit code), a malicious actor will find a way to elicit that capability, even if the model refuses the ...