Building a Simple RLHF Loop
Explore the fundamentals of Reinforcement Learning from Human Feedback by building a complete three-phase RLHF loop. Understand how to fine-tune a language model to behave as a helpful assistant, train a reward model to score outputs based on human preference, and apply Proximal Policy Optimization to align the model’s behavior with ethical values. This lesson teaches practical steps to internally steer AI models towards safety and alignment.
So far, our practical focus has been external:
In the lesson on robustness, we learned to break the model externally (PGD) to find flaws.
In the lesson on interpretability, we learned to observe the model externally (LIME/SHAP) to diagnose flaws.
Now we shift focus to the internal mechanisms. In this lesson, we examine the mechanics of the alignment engine, specifically Reinforcement Learning from Human Feedback (RLHF). RLHF is the primary technique used to guide a raw next-token–predicting LLM toward behavior that is helpful, honest, and harmless.
The goal: Aligning intent
Our goal in this lesson is to implement the entire three-phase RLHF pipeline. We will train a model to prefer certain outputs over others.
As we learned earlier, RLHF requires three phases:
Supervised Fine-Tuning (SFT): Teaching the model how to be a good assistant.
Reward Model (RM) training: Training a separate model to score the quality of the outputs based on human preference.
Reinforcement learning (RL/
): Using the Reward Model’s score to train the original model to maximize that score.PPO PPO
The challenge: Scaling and tooling
Implementing RLHF from scratch is notoriously complex and computationally expensive. In a production environment (like those used by OpenAI or Anthropic ), it requires a massive compute cluster and thousands of human labelers.
To make this practical and executable, we will use industry-standard libraries designed for research and simplicity, specifically the Hugging Face ecosystem (which includes the popular transformers library and its RL extension, TRLX or similar).
We will use Mistral-7B as our base model.
We will use a synthetic dataset to simulate the high-quality human preference data (instead of crowdsourcing feedback).
We will use the Proximal Policy Optimization (PPO) algorithm, which is the standard RL algorithm for this task.
We now begin the first phase of our RLHF pipeline: Supervised Fine-Tuning (SFT).
Phase 1: Supervised Fine-Tuning (SFT)
The goal of SFT is not yet to make the model ethically aligned; the goal is simpler: to teach the model the job description.
A base LLM is powerful, but it doesn’t know how to be a polite or structured assistant. SFT fixes this by training the model on a small, curated dataset of human-written examples of desired behavior (e.g., responses that are concise, structured, or directly address the prompt).
The process: We fine-tune a pre-trained model on expert examples of desired outputs for various prompts.
The outcome: The model learns the format, style, and persona of a helpful assistant. It becomes our initial Policy Model that we will later align.
Step 1: Setting up the open model and data
We will use a popular open-weight model, such as a variant of the Mistral 7B family, which is known for its efficiency. We will use the Hugging Face ecosystem, the standard for open-source LLM research, as our tooling framework.
This code loads our open-source base model and simulates the curated dataset of high-quality examples that a development team typically produces.
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigfrom peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training# 1. Setup Quantization Config (The "Compressor")# This tells standard hardware to load the model in 4-bit format (using ~5GB VRAM instead of 15GB)bnb_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_use_double_quant=True,bnb_4bit_quant_type="nf4",bnb_4bit_compute_dtype=torch.bfloat16)MODEL_NAME = "mistralai/Mistral-7B-v0.1"tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)tokenizer.pad_token = tokenizer.eos_token# 2. Load Model in 4-bitmodel = AutoModelForCausalLM.from_pretrained(MODEL_NAME,quantization_config=bnb_config,device_map="auto" # Automatically puts model on GPU)# Enable gradient checkpointing (saves even more memory at cost of small speed dip)model.gradient_checkpointing_enable()model = prepare_model_for_kbit_training(model)# 3. Apply LoRA (Low-Rank Adaptation)# Instead of training 7B params, we freeze them and add tiny trainable layersconfig = LoraConfig(r=8, # Rank: low rank results in fewer parameters to trainlora_alpha=32, # Alpha: scaling factor for LoRA weightstarget_modules=["q_proj", "v_proj"], # Only target attention layers (saves memory)lora_dropout=0.05,bias="none",task_type="CAUSAL_LM")model = get_peft_model(model, config)# 4. Print Trainable Parameters# This confirms we are only training <1% of the model!model.print_trainable_parameters()# --- Continue with Data Prep ---prompts = ["What is the capital of France?", "Who wrote Macbeth?"]responses = ["The capital of France is Paris.", "Macbeth was written by William Shakespeare."]inputs = [p + " " + r + tokenizer.eos_token for p, r in zip(prompts, responses)]tokenized_inputs = tokenizer(inputs, return_tensors='pt', padding=True, truncation=True).to("cuda")print(f"✅ Model Loaded with LoRA. Ready for training.")
Lines 1–3: We import the essential libraries: PyTorch for tensor operations and GPU support, Transformers (Hugging Face) to download and manage the pre-trained Mistral model and tokenizer, and PEFT to enable efficient fine-tuning by injecting tiny, trainable LoRA adapters into the frozen model.
Lines 7–12: This configures the “Compressor”. We use
BitsAndBytesConfigto load the massive 7B parameter model in 4-bit precision (nf4). This trick reduces memory usage by ...