Breaking Models with Adversarial Attacks

Explore adversarial attacks like the Fast Gradient Sign Method and Projected Gradient Descent to test AI model robustness. Understand how these techniques reveal model brittleness by generating subtle input perturbations that cause misclassifications. Learn to implement these attacks on image classifiers and interpret results to build evidence for improving AI safety.

We'll cover the following...

The diagnostic tools we will build
Fast Gradient Sign Method (FGSM)
- The intuition: Using the map against itself
- Complete code
  - Run it yourself
Projected Gradient Descent (PGD)
The defense: Adversarial training
Exercise (Part 1 of 4)
Summary

In the previous chapter, we’ve been operating in the world of theory. We’ve discussed alignment, goals, and risks. Now, we are moving into the world of practice.

In this lesson, we are going to adopt the mindset of a Red Team (safety engineers who proactively test systems for flaws).

We are shifting our focus from alignment (does the AI want to do the right thing?) to robustness (can the AI survive difficult conditions?).

A robust model is one that remains stable and accurate even when it encounters noisy, corrupted, or maliciously crafted data. A brittle model might work perfectly on clean data, but if you change a few pixels (an adversarial perturbation), it might confidently misclassify a stop sign as a speed limit sign.

To fix this brittleness, we must first be able to measure it. We do this using adversarial evaluation tools.

The diagnostic tools we will build

We are going to implement the two industry-standard methods for stress-testing neural networks. In the research, these are often called attacks, but for a safety engineer, they are tests.

Fast Gradient Sign Method (FGSM): Think of this as a worst-case sensitivity check. It looks at the gradient of the model’s loss with respect to the input and nudges the input slightly in the direction that increases the loss the most. In other words, it asks: If I change this image by just a tiny amount in the most damaging way, does the model get confused?
Projected Gradient Descent (PGD): Think of this as a worst-case scenario test. It is an iterative process that searches for the absolute hardest input for the model to handle. It asks: What is the single most confusing input this model could possibly see?

Why do we do this?

We don’t do this to break the model for fun. We do it because adversarial training, the process of showing the model these hard examples during training, is currently our best defense against errors. You cannot train a robust model if you cannot generate the hard examples it needs to learn from.

Let’s break down the logic of the sensitivity check, FGSM, and see how we use the model’s own mathematics to find its blind spots.

Fast Gradient Sign Method (FGSM)

Think of this as a sensitivity check. It asks a simple question: “If I give this image a tiny nudge in the worst possible direction, does the model fall apart?”

The intuition: Using the map against itself

To train a neural network, we use gradient descent.

The model looks at a picture of a Panda.
It calculates the loss (the error).
It calculates the gradient (the direction "downhill" to reduce the error).
It updates its weights to move "downhill" so it gets the answer right next time.

FGSM does the exact opposite. Instead of updating the weights to reduce the error, we update the image to maximize the error. We calculate the gradient (the direction of highest error) and add a tiny layer of noise in that exact direction.

Mathematically, it looks like this:

OriginalImage ( $x$ ): The clean picture (e.g., a Panda).
Gradient ( $\nabla_x Loss$ ): The direction that increases the error the most.
The sign() (signum) function: This function takes any number and turns it into +1 (if positive), -1 (if negative), or 0.
- In FGSM, we don’t care how big the gradient is; we only care about the direction.
- By applying sign(), we ensure that we push the image pixel by the exact same amount (epsilon) in the direction that maximizes error, regardless of the gradient's size.
Epsilon ( $\epsilon$ ): The volume knob. This limits how much noise we add (e.g., 0.01) so the change remains invisible to humans.

Why Images and not LLMs?

We start with images because adversarial attacks rely on gradients (calculus).

Images are continuous: We can change a pixel from 0.50 to 0.51. The math is perfect for teaching the concept of following the gradient to maximize error.
LLMs are discrete: We cannot change a word by 1%. Text requires much more complex optimization (like the GCG attack we will cover in this lesson).
The lesson: Once we master the concept of gradient-based attacks here, applying it to LLMs later becomes much easier.

Let’s implement the Fast Gradient Sign Method (FGSM).

Step 1: Setting up the model and initial check

To begin this exercise, we need three things: a model to attack, an image to test, and the ability to look up the model’s classifications. We will use a standard ResNet-50 pre-trained on ImageNet. We must explicitly enable gradient tracking on the input image, as this is the raw data the attack relies on.

First, we load the necessary libraries, initialize the model in evaluation mode (model.eval()), load the image, and check the model’s performance on the clean input to establish a baseline.

Python 3.10.4

# --- Setup and Model Load ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1).to(device)
model.eval()
# --- Image Preparation (Assumes image path is valid) ---
image_path = "../Giant_Panda_in_Beijing_Zoo_1.jpeg" 
#... (Image loading and preprocessing code omitted for brevity)
# Create the input tensor, enable gradient tracking for the attack
input_tensor = preprocess(img).unsqueeze(0).to(device).requires_grad_(True)
labels = get_imagenet_labels() # Load the class names
# --- Initial Prediction (Before Attack) ---
pred_initial = model(input_tensor)
probabilities_initial = F.softmax(pred_initial, dim=1)
top_prob_initial, top_catid_initial = torch.max(probabilities_initial, 1)
original_confidence = top_prob_initial.item()
target_id_for_attack = top_catid_initial.item()
original_label = labels.get(target_id_for_attack, f"ID {target_id_for_attack} (Unknown)")
print("\n--- Model Prediction: BEFORE ATTACK ---")
print(f"Top Prediction: {original_label} (ID: {target_id_for_attack})")
print(f"Confidence: {original_confidence*100:.2f}%")

Line 2: We select the hardware accelerator. It checks if a GPU (cuda) is available for faster calculation; otherwise, it defaults to the CPU.
Line 3: We load the ResNet-50 architecture with pre-trained weights (IMAGENET1K_V1). This means the model is already smart, it has learned to recognize 1,000 different object categories (including pandas) from the ImageNet dataset.
Line 4: We switch the model to evaluation mode (eval()). This is critical because it ensures layers like Dropout and Batch Normalization behave consistently during our test, preventing random fluctuations.
Line 7: We define image_path to point to our sample image (a Giant Panda). This is the input we will attempt to break by adding invisible noise.
Line 11: This is the most ...

1.Building the Foundation for Safe AI Systems

2.The Technical Toolkit

3.Advanced Governance and Frontier Problems

4.Wrap Up

Breaking Models with Adversarial Attacks

The diagnostic tools we will build

Fast Gradient Sign Method (FGSM)

The intuition: Using the map against itself

Step 1: Setting up the model and initial check