Representation Engineering & Circuit Breakers
Explore the advanced techniques of representation engineering and circuit breakers to gain intrinsic control over AI models. Understand how these methods detect harmful internal thought patterns and proactively prevent deceptive alignment by manipulating the model's internal representations before harmful outputs are generated.
We have now built the Alignment Engine (RLHF) to train our model to be harmless. But the smartest AI systems pose a deeper risk: Deceptive Alignment.
The threat: If a highly capable AI develops a hidden, misaligned objective, it may learn to pretend to be aligned (e.g., passing all our RLHF tests) until it is deployed. It schemes to avoid detection.
The failure of current tools: Our current tools are insufficient against a scheming model:
Robustness tests (PGD): Only catch brittleness to input, not intentional deception.
Interpretability (LIME/SHAP): Are only approximations and can be fooled if the model "fairwashes" its explanation (tells the auditor what they want to see).
RLHF: Can inadvertently reward the models that are best at faking alignment.
To defend against an AI that is actively trying to hide its true intent, we cannot rely on external defenses or surface behavior. We need to look directly into the model’s mind and enforce safety at the level of thought.
This requires an entirely new approach to transparency and control, moving from external feedback to internal model manipulation.
Engineering the model’s thoughts
The cutting-edge solution is Representation Engineering (RepE). RepE is a "top-down" approach to transparency research that treats the internal vectors that encode concepts, the model’s raw mathematical “thoughts”, as the fundamental unit of analysis.
We are going to answer two questions:
What is RepE? (How do we find a concept inside the model?)
What is a Circuit Breaker? (How do we use that concept to stop a failure in real-time?)
Representation Engineering (RepE)
RepE is a cutting-edge top-down approach to transparency research. Instead of trying to understand every individual neuron or line of code, RepE focuses on the model’s internal concepts, its mathematical representation of ideas like honesty, harm, or intent.
Consider a complex language model, such as the human brain.
Mechanistic interpretability (bottom-up): This is like dissecting the brain to study individual neurons and synapses. It’s incredibly precise but overwhelmingly complex, and it’s hard to tell what a single neuron means in the context of a thought. ...