Building a Simple RLHF Loop

Explore the fundamentals of Reinforcement Learning from Human Feedback by building a complete three-phase RLHF loop. Understand how to fine-tune a language model to behave as a helpful assistant, train a reward model to score outputs based on human preference, and apply Proximal Policy Optimization to align the model’s behavior with ethical values. This lesson teaches practical steps to internally steer AI models towards safety and alignment.

We'll cover the following...

The goal: Aligning intent
The challenge: Scaling and tooling
Phase 1: Supervised Fine-Tuning (SFT)
- Step 1: Setting up the open model and data
- Step 2: The supervised fine-tuning loop
Phase 2: Reward Model (RM) training
- Step 1: Simulated human preference data
- Step 2: Training the Reward Model (RM)
Phase 3: Reinforcement Learning (RL) with PPO
- Step 1: Setting up the RL environment (the trainer)
Run it yourself
Key takeaways from this simulation
Exercise (Part 3 of 4)

So far, our practical focus has been external:

In the lesson on robustness, we learned to break the model externally (PGD) to find flaws.
In the lesson on interpretability, we learned to observe the model externally (LIME/SHAP) to diagnose flaws.

Now we shift focus to the internal mechanisms. In this lesson, we examine the mechanics of the alignment engine, specifically Reinforcement Learning from Human Feedback (RLHF). RLHF is the primary technique used to guide a raw next-token–predicting LLM toward behavior that is helpful, honest, and harmless.

The goal: Aligning intent

Our goal in this lesson is to implement the entire three-phase RLHF pipeline. We will train a model to prefer certain outputs over others.

As we learned earlier, RLHF requires three phases:

Supervised Fine-Tuning (SFT): Teaching the model how to be a good assistant.
Reward Model (RM) training: Training a separate model to score the quality of the outputs based on human preference.
Reinforcement learning (RL/PPOPPO): Using the Reward Model’s score to train the original model to maximize that score.

The challenge: Scaling and tooling

Implementing RLHF from scratch is notoriously complex and computationally expensive. In a production environment (like those used by OpenAI or Anthropic ), it requires a massive compute cluster and thousands of human labelers.

To make this practical and executable, we will use industry-standard libraries designed for research and simplicity, specifically the Hugging Face ecosystem (which includes the popular transformers library and its RL extension, TRLX or similar).

We will use Mistral-7B as our base model.
We will use a synthetic dataset to simulate the high-quality human preference data (instead of crowdsourcing feedback).
We will use the Proximal Policy Optimization (PPO) algorithm, which is the standard RL algorithm for this task.

We now begin the first phase of our RLHF pipeline: Supervised Fine-Tuning (SFT).

Phase 1: Supervised Fine-Tuning (SFT)

The goal of SFT is not yet to make the model ethically aligned; the goal is simpler: to teach the model the job description.

A base LLM is powerful, but it doesn’t know how to be a polite or structured assistant. SFT fixes this by training the model on a small, curated dataset of human-written examples of desired behavior (e.g., responses that are concise, structured, or directly address the prompt).

The process: We fine-tune a pre-trained model on expert examples of desired outputs for various prompts.
The outcome: The model learns the format, style, and persona of a helpful assistant. It becomes our initial Policy Model that we will later align.

Step 1: Setting up the open model and data

We will use a popular open-weight model, such as a variant of the Mistral 7B family, which is known for its efficiency. We will use the Hugging Face ecosystem, the standard for open-source LLM research, as our tooling framework.

This code loads our open-source base model and simulates the curated dataset of high-quality examples that a development team typically produces.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# 1. Setup Quantization Config (The "Compressor")
# This tells standard hardware to load the model in 4-bit format (using ~5GB VRAM instead of 15GB)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
MODEL_NAME = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
# 2. Load Model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto" # Automatically puts model on GPU
)
# Enable gradient checkpointing (saves even more memory at cost of small speed dip)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
# 3. Apply LoRA (Low-Rank Adaptation)
# Instead of training 7B params, we freeze them and add tiny trainable layers
config = LoraConfig(
    r=8,                # Rank: low rank results in fewer parameters to train
    lora_alpha=32,      # Alpha: scaling factor for LoRA weights
    target_modules=["q_proj", "v_proj"], # Only target attention layers (saves memory)
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
# 4. Print Trainable Parameters
# This confirms we are only training <1% of the model!
model.print_trainable_parameters()
# --- Continue with Data Prep ---
prompts = ["What is the capital of France?", "Who wrote Macbeth?"]
responses = ["The capital of France is Paris.", "Macbeth was written by William Shakespeare."]
inputs = [p + " " + r + tokenizer.eos_token for p, r in zip(prompts, responses)]
tokenized_inputs = tokenizer(inputs, return_tensors='pt', padding=True, truncation=True).to("cuda")
print(f"✅ Model Loaded with LoRA. Ready for training.")

Imports, model loading, and data

1.Building the Foundation for Safe AI Systems

2.The Technical Toolkit

3.Advanced Governance and Frontier Problems

4.Wrap Up

Building a Simple RLHF Loop

The goal: Aligning intent

The challenge: Scaling and tooling

Phase 1: Supervised Fine-Tuning (SFT)

Step 1: Setting up the open model and data