Red Teaming with PyRIT and HarmBench
Explore how to automate red teaming for AI models using PyRIT and HarmBench. Learn to design a scalable testing loop, build custom adapters for different model endpoints, and measure safety with Attack Success Rate metrics, enabling you to identify vulnerabilities and improve AI safety evaluation.
To verify control, we must assume the worst-case scenario: an intelligent adversary (or a strategically scheming AI) is actively attempting to bypass safety guardrails.
Relying on a human to manually test a model creates the lazy attacker dilemma:
If our red team only tests 50 manual prompts and finds 0 failures, we haven’t proved the model is safe. We have only proved that our red team ran out of time or creativity.
Three critical failures of manual testing
For modern LLMs and Agents, manual testing fails in three specific engineering dimensions:
Scale: An automated attacker can generate thousands of prompt variations in minutes. A human tester can only check a few dozen per hour. We need to measure safety at the scale of 10k+ attacks to achieve statistical significance.
Sophistication: The most dangerous attacks today are not written by humans; they are optimized mathematically (e.g., Greedy Coordinate Gradient (GCG) or other gradient-based attacks). A human cannot manually write a gradient-optimized adversarial string.
Agentic misuse: Testing advanced AI agents, which use tools and execute multi-step plans, is too complex for simple Q&A testing. We need an automated system to track the agent’s intermediate steps and tool usage.
The engineering goal: Attack Success Rate (ASR)
To build safe systems, we use probabilistic metrics to measure how safe the systems is.
Our goal in this lesson is to build a pipeline that calculates the Attack Success Rate (ASR):
To calculate this reliably, we cannot rely on human effort. We must automate the adversarial process using standardized frameworks that integrate sophisticated attack algorithms with real-world threat models.
The solution is to automate the adversarial process. We need standardized frameworks that integrate sophisticated attack algorithms with real-world threat models.
Automated Red Teaming
To replace the human tester, we built a closed-loop system known as the Red Teaming Loop. This architecture enables us to mathematically assess the safety of a model by autonomously running thousands of attacks.
The automated workflow consists of four distinct components that interact in a continuous cycle. Understanding this data flow is critical before writing any code. The cycle proceeds as follows:
Attack strategy (the source): The system selects a specific attack vector (e.g., a prompt from a database).
Orchestrator (the manager): The central engine that formats the prompt, handles the network request, and sends it to the target.
Target (the victim): The AI model being stress-tested (e.g., Mistral, Llama, or a custom agent).
Scorer (the judge): A deterministic or LLM-based system that grades the Target’s response to determine if the attack was successful (Jailbroken) or failed (Refused).
The orchestrator (PyRIT)
PyRIT (Python Risk Identification Tool) acts as the backbone of this architecture. In manual testing, we are the orchestrator; we copy and paste prompts and read responses. In automated testing, PyRIT handles this logic. It is responsible for:
Memory: logging every interaction to a database (DuckDB) for analysis.
Concurrency: Sending 50+ attacks in parallel to maximize throughput.
Resilience: Handling API rate limits and connection errors automatically.
The threat library (HarmBench)
We need a high-quality source of attacks. Instead of inventing prompts from scratch, we rely on standardized threat libraries such as HarmBench.
HarmBench is distinct because it separates the harmful behavior from the test case.
Behavior: The abstract goal (e.g., Create a keylogger). ...