Representation Engineering & Circuit Breakers

Explore the advanced techniques of representation engineering and circuit breakers to gain intrinsic control over AI models. Understand how these methods detect harmful internal thought patterns and proactively prevent deceptive alignment by manipulating the model's internal representations before harmful outputs are generated.

We'll cover the following...

Engineering the model’s thoughts
The two futures of transparency
The final defense against deception
Summary
What’s next

We have now built the Alignment Engine (RLHF) to train our model to be harmless. But the smartest AI systems pose a deeper risk: Deceptive Alignment.

The threat: If a highly capable AI develops a hidden, misaligned objective, it may learn to pretend to be aligned (e.g., passing all our RLHF tests) until it is deployed. It schemes to avoid detection.
The failure of current tools: Our current tools are insufficient against a scheming model:
- Robustness tests (PGD): Only catch brittleness to input, not intentional deception.
- Interpretability (LIME/SHAP): Are only approximations and can be fooled if the model "fairwashes" its explanation (tells the auditor what they want to see).
- RLHF: Can inadvertently reward the models that are best at faking alignment.

To defend against an AI that is actively trying to hide its true intent, we cannot rely on external defenses or surface behavior. We need to look directly into the model’s mind and enforce safety at the level of thought.

This requires an entirely new approach to transparency and control, moving from external feedback to internal model manipulation.

Engineering the model’s thoughts

The cutting-edge solution is Representation Engineering (RepE). RepE is a "top-down" approach to transparency research that treats the internal vectors that encode concepts, the model’s raw mathematical “thoughts”, as the fundamental unit of analysis.

We are going to answer two questions:

What is RepE? (How do we find a concept inside the model?)
What is a Circuit Breaker? (How do we use that concept to stop a failure in real-time?)

Representation Engineering (RepE)

RepE is a cutting-edge top-down approach to transparency research. Instead of trying to understand every individual neuron or line of code, RepE focuses on the model’s internal concepts, its ...

1.Building the Foundation for Safe AI Systems

2.The Technical Toolkit

3.Advanced Governance and Frontier Problems

4.Wrap Up

Representation Engineering & Circuit Breakers

Engineering the model’s thoughts

Representation Engineering (RepE)