Ever asked a large language model (LLM) a complex question ... only to receive a baffling response?
Maybe it handles basic queries just fine, but stumbles when you need insights into your company’s internal tools or niche processes.
That’s because most LLMs are built for general use. Out-of-the-box options like Mistral-7B are powerful, but not tailored for your unique use cases.
That's why we have fine-tuning.
LLMs like Mistral-7B are powered by billions of parameters—the "knobs and dials" that shape how they process and generate text. Training all of these from scratch is astronomically expensive. But with fine-tuning—and parameter-efficient methods like LoRA—you can teach an LLM to master your domain, cost-effectively.
In this newsletter, we’ll cover:
What fine-tuning is and why it’s a game-changer for customizing LLMs
How to avoid pitfalls like "catastrophic forgetting"
When to use LoRA, QLoRA, or even RAG (retrieval-augmented generation)
A step-by-step guide to fine-tuning Mistral-7B efficiently (with code samples!)
You’ll also learn how techniques like 4-bit quantization reduce memory usage, making fine-tuning accessible even on smaller hardware setups.
By the end, you’ll know how to adapt LLMs to your specific needs—without the high cost of training from scratch.
Let’s dive in!
We know that fine-tuning is a process for adapting an LLM to your domain. But what does that mean—especially if you’re brand new to LLMs?
Imagine you have a general model that knows how to speak English decently. You want it to handle medical terminology at a deeper level. By fine-tuning a set of carefully curated medical texts—like patient records, research articles, or hospital FAQs—the model begins to internalize the domain’s language and context.
Once it’s seen enough examples of technical vocabulary (e.g., drug names, diagnostic procedures, or symptom descriptions), it starts applying that knowledge to medical questions more accurately.
Fine-tuning offers significant advantages by enhancing a model’s domain fluency and efficiency.
Instead of relearning everything from scratch, the model concentrates on the new patterns in your specialized dataset, allowing it to better understand and generate domain-specific language. Additionally, since the model already possesses a strong grasp of general English and common patterns, it can swiftly integrate your specific domain knowledge on top of its existing capabilities. This approach is faster and more cost-effective than developing an entirely new model from the ground up.
However, fine-tuning comes with its own set of trade-offs.
Emphasizing new patterns can inadvertently lead to the model pushing aside some prior knowledge, a phenomenon known as catastrophic forgetting.
Suppose a medical dataset introduces information that conflicts with general nutritional advice. In that case, the model might start conflating everyday dietary tips with strict clinical guidelines, resulting in confusing or contradictory responses.
Moreover, fine-tuning doesn’t universally improve all tasks; it specifically hones in on the tasks or domains reflected in your curated data. Suppose the data is too narrow or inconsistent. In that case, the model might lose some versatility in broader topics—for instance, mixing prescription drug names with over-the-counter medications due to poorly curated medical text, leading to inaccurate advice in non-medical contexts.
High-quality data is key! If the data you feed into the model is relevant, accurate, and well-curated, the performance in that domain can soar. But if your dataset is too narrow, inconsistent, or riddled with errors, the model might develop weird quirks or lose some of its broad, general-purpose abilities.
Partial fine-tuning strategies are essential for balancing specialized domain knowledge and the model’s general competencies.
Instead of solely focusing on niche data, incorporating broader content during the fine-tuning process ensures that the model simultaneously preserves its foundational skills while learning new domain-specific patterns.
Additionally, employing regularization techniques can further safeguard the model’s existing knowledge. Advanced methods like LoRA and QLoRA, which are parameter-efficient approaches, allow for updating only a small fraction of the model’s weights or utilizing 4-bit quantization. These techniques minimize the risk of overwriting previously learned information, ensuring the model retains its general understanding while effectively integrating specialized expertise.
In other words, fine-tuning is a powerful tool for crafting large language models (LLMs) that excel at specific tasks. However, balancing the new information taught is crucial to avoid inadvertently overshadowing or erasing important prior knowledge. Adopting a thoughtful approach and leveraging robust data makes achieving the best of both worlds possible: developing specialized models that maintain their ability to handle general tasks effectively while demonstrating enhanced performance in targeted domains.
Now that we’ve covered why fine-tuning matters and how catastrophic forgetting can throw a wrench in the works, let’s look at which fine-tuning method might best suit your needs—whether full, partial, or parameter-efficient.
When you must adapt a pretrained LLM to a new domain or specialized tasks, your chosen method can profoundly impact your model’s performance, training cost, and risk of forgetting important knowledge. Let’s explore three common strategies—full fine-tuning, partial fine-tuning, and parameter-efficient approaches (like LoRA and QLoRA)—and see how each fits different use cases.
Full fine-tuning involves unfreezing every layer of a pretrained LLM and retraining all its parameters on your new dataset, allowing a deep and holistic adaptation to even the most radically different tasks.
Because every internal weight is adjustable, the model can align itself more comprehensively with unfamiliar domains or highly specialized knowledge. However, this complete flexibility comes at a steep cost in time and compute resources, as working through billions of parameters can be prohibitively expensive for smaller teams.
Additionally, the model risks catastrophic forgetting.
Full fine-tuning makes the most sense when you have sufficient GPU power, a unique dataset (where simpler approaches won’t suffice), and a pressing need for maximum accuracy in that specialized domain.
Partial fine-tuning keeps most of the model’s parameters frozen, typically focusing on only certain layers—often the higher-level ones more directly responsible for task-specific features.
This approach preserves much of the base model’s general knowledge and reduces the required computational overhead relative to full fine-tuning. At the same time, because you only adopt a subset of the weights, you usually achieve a moderate level of specialization rather than a complete overhaul.
This balance is a welcome trade-off for many projects: it’s more resource-friendly, poses less risk of catastrophic forgetting, and still grants sufficient room to tailor the model to moderately different tasks. Partial fine-tuning works particularly well when you want your LLM to remain versatile enough to handle general queries while improving on a specific cluster of domain requirements.
Parameter-efficient fine-tuning, including methods like LoRA and QLoRA, sidesteps the need to retrain the entire model by adding several adapter parameters while keeping the core weights frozen.
This drastically lowers GPU usage and training time, making it a cost-effective path to embedding specialized knowledge. Because the original weights remain untouched, there’s less risk of undermining the LLM’s broad skill set, which is especially helpful if you still rely on the model’s general competencies.
However, if your dataset is extremely large or diverges drastically from the model’s pre-existing knowledge, these lightweight adapters might not deliver the same depth of specialization as full or partial fine-tuning. Still, for many applications—like customizing a general-purpose LLM to answer industry-specific queries—PEFT can strike an ideal balance between efficiency, domain accuracy, and preserving overall language ability.
Customizing an LLM isn’t one-size-fits-all. Some tasks require deep expertise in a specialized domain, while others depend on staying current with real-time information.
Fine-tuning and retrieval-augmented generation (RAG) are two distinct approaches to meet these needs. The choice depends on your specific requirements:
Do you need your model to excel in a specialized field where accuracy and nuance are critical?
Or does it need real-time access to external knowledge to handle dynamic queries?
Here’s how these approaches compare:
Feature | Fine-Tuning | Retrieval-Augmented Generation (RAG) |
Consistency | Ideal for tasks requiring consistent performance in specialized areas. | Less consistent for specialized tasks as it relies on external data retrieval. |
Domain-Specific Knowledge | Excels in scenarios where deep understanding and generation within a specific domain are needed. | It can handle domain-specific queries by retrieving relevant information, but depth depends on the quality and relevance of the retrieved data. |
External Data Dependence | Minimal reliance on external data sources maximizes the model’s existing capabilities. | Highly reliant on external data sources for up-to-date and dynamic information. |
Up-to-Date Information | Requires retraining to incorporate new information; not inherently dynamic. | Excels at providing dynamic and up-to-date information by retrieving data in real time. |
Knowledge Base Breadth | It may lose some versatility if fine-tuned on very narrow or specialized data. | Offers a broad knowledge base by accessing extensive external sources. |
Fact-Checking and Accuracy | Depends on the quality and recency of the fine-tuning dataset, which may require frequent updates to maintain accuracy. | Enhances factual accuracy by cross-referencing responses with the most relevant and current external data sources. |
Training Resources | Requires significant computational resources, especially for large models. | Requires efficient data retrieval systems but less computational resources for training. |
Scalability | It can be less scalable due to the need for retraining with each new dataset or domain. | More scalable for applications needing real-time data access without retraining the model. |
Use Cases | Specialized customer support, legal document generation, and industry-specific chatbots. | Real-time information retrieval, dynamic content generation, and applications requiring up-to-date data like news or live event information. |
While both approaches have their strengths, they serve different purposes and aren’t directly comparable—even with this handy chart.
Fine-tuning is best for tasks that demand consistent, high-quality performance in specialized domains where understanding nuances and terminology is crucial.
RAG shines when up-to-date, dynamic information is needed, such as responding to real-time queries or accessing external knowledge sources.
Ultimately, the right choice depends on your use case.
Fine-tuning a language model isn’t just about pressing a “train” button and hoping for the best. It’s a systematic process where each step is crucial in adapting the model to your specific task or domain. Here’s the high-level breakdown:
Your first step is to choose the language model that best meets your requirements. Consider the nature of your task, the complexity of the data you’ll be using, and how much computational power you have.
For instance, smaller models (like a 7B-parameter LLM) can handle simpler tasks more efficiently, whereas larger models might be necessary for advanced domains requiring deep specialization.
Tip: If you’re aiming for quick experiments or have limited GPU resources, a smaller model can be a great starting point—especially if you use parameter-efficient methods like LoRA or QLoRA.
Next, gather a well-structured dataset relevant to your task. You can look for public datasets on platforms like Hugging Face or create your own if nothing suitable exists.
The key is ensuring that your dataset represents the problem you’re trying to solve. For example, if you’re fine-tuning a model to understand legal contracts, you’ll want a broad sampling of contractual documents, including different clauses, styles, and domain nuances.
Once you’ve collected your data, clean and preprocess it. This often means removing duplicates, fixing formatting issues, and splitting the dataset into training and validation/testing sets. Good preprocessing prevents the model from overfitting on noise or learning incorrect patterns.
Checklist:
Remove or label any sensitive information.
Normalize text (e.g., consistent casing, spacing).
Filter out spam or irrelevant entries.
Before you kick off training, you must configure crucial hyperparameters—like the learning rate, batch size, number of epochs, and more. These values dictate how quickly your model adjusts to the new data and how it generalizes beyond the training set. Setting them too high can cause overfitting (the model memorizes training data), while too low might lead to underfitting (the model never fully learns your domain).
Rule of Thumb: Always start with a moderate learning rate and adjust based on validation performance. Watch for signs of either plateauing (underfit) or spiking errors (overfit).
Now for the main event: fine-tuning your curated dataset. You’ll adjust the entire model or just a fraction depending on your chosen approach—full, partial, or parameter-efficient. During this phase, the model adapts to the specifics of your domain, incrementally updating its internal weights or adapters to better handle tasks it couldn’t fully address before.
Using LoRA, you’d focus on small rank-decomposition matrices that overlay the model’s key layers, drastically reducing compute needs while still learning new domain signals.
After you’ve trained your model, it’s time to put it to the test. Use a validation or test set to measure performance metrics—like accuracy, F1 score, or perplexity—depending on your task. If the results fall short of your goals, tweak your hyperparameters or refine the dataset. You might find that adding more general data can reduce catastrophic forgetting, or you need to increase the learning rate for better convergence.
Now that you’ve seen the end-to-end process for fine-tuning a language model—choosing the right LLM, assembling and preprocessing your data, tuning hyperparameters, and evaluating your results—you’re in a solid position to dive deeper into LoRA (and QLoRA).
These parameter-efficient techniques can dramatically reduce training costs and memory usage, all while helping your model adapt to even the most specialized tasks. In the next section, we’ll walk through how to apply LoRA to Mistral-7B, showing you the code, configuration steps, and best practices for integrating custom domain expertise into your LLM.
Having explored why fine-tuning matters and how to choose the right approach, let’s walk through a hands-on example using Mistral-7B.
In this demo, we’ll combine LoRA with 4-bit quantization to keep compute requirements modest while tailoring the model to a specific dataset.
Before we begin, install the latest versions of each necessary library. These handle everything from model quantization to dataset loading and fine-tuning itself:
pip3 install transformers==4.44.1pip3 install acceleratepip3 install bitsandbytes==0.43.3pip3 install datasets==2.21.0pip3 install trl==0.9.6pip3 install peft==0.12.0pip install -U "huggingface_hub[cli]"
In the code above:
Line 1–6: Install the main Hugging Face libraries—transformers and datasets—essential for loading models and managing data. bitsandbytes enables 4-bit quantization, reducing memory usage. trl and peft streamline LoRA-based fine-tuning, making the process more straightforward and accessible.
Line 7: Upgrade huggingface_hub to ensure you have the latest CLI tools for seamless interaction with the Hugging Face Hub.
The reason for utilizing multiple libraries is to leverage the specialized functionalities each one offers, ensuring a smooth and efficient fine-tuning process. The transformers and datasets libraries from Hugging Face are the core tools for loading models and managing data, providing a robust foundation for our tasks. bitsandbytes is essential for enabling 4-bit quantization, significantly reducing memory usage without substantially compromising the model’s performance.
Libraries like trl and peft simplify the implementation of LoRA-based fine-tuning, making the process more straightforward and accessible. Additionally, accelerate assists in optimizing training setups, whether working with a single GPU or multiple GPUs, ensuring that your training process is efficient and scalable.
Logging in via the CLI is essential if you’re using private models or need seamless access to datasets from the Hugging Face Hub. This step ensures that your environment is authenticated and authorized.
!huggingface-cli login --token "Enter your token" --add-to-git-credential
The above command runs the Hugging Face CLI login command in a notebook or script. Replace --token with your Hugging Face token to authenticate for model and dataset downloads. Also --add-to-git-credential stores your credentials for Git-based usage on Hugging Face.
Let’s bring in all the modules we’ll use. These modules collectively provide the necessary tools for loading the model, preparing the dataset, configuring LoRA, and managing the training process. Ensuring all these imports are in place is crucial for a seamless fine-tuning workflow.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArgumentsfrom datasets import load_datasetfrom peft import LoraConfigfrom trl import SFTTrainerimport transformersimport peftimport torchimport os
In the code above:
Line 1: Import model and tokenizer classes (AutoModelForCausalLM, AutoTokenizer) along with BitsAndBytesConfig for quantization and TrainingArguments for setting up training parameters.
Line 2: Imports load_dataset from datasets to manage and preprocess data.
Line 3: Brings in LoraConfig from peft to configure LoRA settings.
Line 4: Imports SFTTrainer from trl, which simplifies supervised fine-tuning.
Lines 5–8: Import essential Python modules (transformers, peft, torch, os) required for model manipulation, training, and system operations.
We’ll leverage the bitsandbytes library to load Mistral-7B in 4-bit precision. This is crucial for cutting down memory usage while still fine-tuning effectively.
bnb_config = BitsAndBytesConfig(load_in_4bit = True,bnb_4bit_use_double_quant = True,bnb_4bit_quant_type = "nf4",bnb_4bit_compute_dtype = torch.bfloat16)model_name = "mistralai/Mistral-7B-v0.1"quantized_model = AutoModelForCausalLM.from_pretrained(model_name,quantization_config = bnb_config,device_map = "auto")
In the code above:
Lines 1–5: Create a BitsAndBytesConfig object with the following settings:
load_in_4bit=True: Enables 4-bit quantization.
bnb_4bit_use_double_quant=True: Applies double quantization for improved accuracy.
bnb_4bit_quant_type="nf4": Sets the quantization type to NF4, a specific quantization scheme.
bnb_4bit_compute_dtype=torch.bfloat16: Defines the compute data type as bfloat16 for efficient processing.
Line 7: Specifies the model name from Hugging Face.
Lines 8–13: This function loads the Mistral-7B model using the quantization configuration, automatically mapping the model to available GPU(s) via device_map="auto."
With 4-bit quantization, each weight is stored using only 4 bits instead of the standard 16 bits. This significantly reduces GPU memory requirements, allowing larger models to fit into available hardware. While this introduces slightly noisier calculations, many applications still perform excellently with this approach.
Quantization is crucial for reducing the memory footprint of large models, making it feasible to fine-tune them on hardware with limited resources. By leveraging bitsandbytes, we ensure the model remains efficient without sacrificing too much accuracy.
Let’s see how the pretrained model performs on a simple prompt before fine-tuning. This baseline inference tells you what the un-fine-tuned model can do. By observing its initial performance, you have a reference point to evaluate the effectiveness of the fine-tuning process later.
tokenizer = AutoTokenizer.from_pretrained(model_name)input = tokenizer("Act as a travel guide", return_tensors="pt").to('cuda')response = quantized_model.generate(**input, max_new_tokens = 100)print(tokenizer.batch_decode(response, skip_special_tokens=True))
In the code above:
Line 1: Loads the same tokenizer used by Mistral-7B.
Line 2: Encodes the prompt “Act as a travel guide” into token IDs and moves them to the GPU with .to('cuda').
Line 4: Generates up to 100 new tokens in response to the input prompt.
Line 5: Decodes the generated token IDs into readable text, skipping any special tokens.
This baseline inference tells you what the un-fine-tuned model can do. You’ll compare these results to the fine-tuned version later. A sample output might be:
['Act as a travel guide\n\nThe best way to get to know a city is to havea local show you around. If you’re a local, you can make money showingvisitors around your city.\n\nYou can do this by creating a tour of yourcity and selling tickets to visitors. You can also offer your services asa guide for free and ask for tips.\n\nThis is a great way to make moneyif you’re passionate about your city and know it well.\n\nYou can also']
This baseline response is decent, but we can do better by training the model on role-specific instructions.
We’ll use the fka/awesome-chatgpt-prompts dataset for our demo, which provides columns like “act” and “prompt” to train on various roles.
Let’s load it and tokenize it:
dataset = "fka/awesome-chatgpt-prompts"data = load_dataset(dataset)tokenizer.pad_token = tokenizer.eos_tokendata = data.map(lambda samples: tokenizer(samples["act"], samples["prompt"]), batched=True)train_sample = data["train"].select(range(100))
In the code above:
Line 1: Specifies the dataset name from Hugging Face.
Line 2: Loads the dataset into a DatasetDict object.
Line 4: Sets the tokenizer’s pad_token to the eos_token to ensure consistent padding across inputs.
Line 5: Maps over each entry in the dataset, combining the “act” and “prompt” columns into tokenized input suitable for training. The batched=True parameter allows processing multiple samples at once for efficiency.
Line 6: Select a smaller subset (the first 100 examples) from the training set for quicker training demonstrations. In a real-world scenario, you’d use the entire dataset or a larger subset to ensure comprehensive training.
In practice, you’d replace the fka/awesome-chatgpt-prompts dataset with your domain-specific dataset, such as internal company documents, specialized knowledge bases, or proprietary data relevant to your application.
LoRA inserts small, trainable matrices (adapters) into specific parts of the model’s architecture. By focusing only on these adapters, we drastically reduce the number of parameters that need to be updated during training, making the fine-tuning process more efficient and less resource-intensive.
lora_config = LoraConfig(r=16,lora_alpha=16,target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"],lora_dropout=0.1,bias="none",task_type="CAUSAL_LM")
In the code above:
Line 2: Sets r=16, which defines the rank of the low-rank matrices being added. This controls the size and capacity of the LoRA adapters.
Line 3: Sets lora_alpha=16, which scales the updates from these matrices. This parameter influences the magnitude of the modifications during training.
Line 4: Specifies target_modules, listing the attention-related modules ("q_proj", "k_proj", "v_proj", "o_proj", "gate_proj") where LoRA adapters will be inserted.
Line 5: Sets lora_dropout=0.1, introducing dropout within the LoRA adapters to aid in generalization and prevent overfitting.
Line 7: Defines task_type="CAUSAL_LM", indicating that the task involves causal language modeling (typical for models like GPT).
These hyperparameters are crucial for controlling the training dynamics. auto_find_batch_size ensures efficient use of GPU memory, while the learning rate and number of epochs directly influence how well the model adapts to the new data without overfitting or underfitting.
working_dir = './'output_directory = os.path.join(working_dir, "finetuning")training_args = TrainingArguments(output_dir = output_directory,auto_find_batch_size = True,learning_rate = 2e-4,num_train_epochs=1)
In the code above:
Lines 1–2: Define the working directory and create an output directory (./finetuning) to store logs, checkpoints, and other training artifacts.
Lines 4–8: Instantiate a TrainingArguments object with the following settings:
output_dir=output_directory: Specifies where to save the training outputs.
auto_find_batch_size=True: Automatically adjusts the batch size to fit within the available GPU memory, preventing out-of-memory errors.
learning_rate=2e-4: Sets the learning rate for the optimizer, a typical starting value for LoRA-based fine-tuning.
num_train_epochs=1: Defines the number of training epochs. For demonstration purposes, one epoch is sufficient, but you can increase this based on your dataset size and desired performance.
Next, we initialize the SFTTrainer (from the trl library), simplifying supervised fine-tuning. We pass in the model, training arguments, dataset, and LoRA configuration:
trainer = SFTTrainer(model = quantized_model,args = training_args,train_dataset = train_sample,peft_config = lora_config,tokenizer = tokenizer,data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False))
In the code above:
Line 2: Assigns the quantized model to the trainer.
Line 3: Passes in your TrainingArguments from the previous step.
Line 4: Uses train_sample (100 examples) as the training dataset.
Line 5: Supplies the lora_config so the trainer knows to apply LoRA.
Line 6: Indicates which tokenizer to use for tokenizing each batch.
Line 7: Provides a default language-modeling collator, ensuring the correct shaping of input data.
What is
SFTTrainer? This is part of thetrllibrary (by Hugging Face), which streamlines supervised fine-tuning with tasks like causal language modeling. It abstracts away much of the boilerplate code, allowing you to focus on configuring your training process.
We’re ready to train! This step updates only the LoRA adapters, leaving the rest of Mistral-7B’s parameters frozen:
trainer.train()
This kicks off the training loop. Only the LoRA parameters will be updated; the rest of Mistral-7B remains frozen. Training can take a few minutes to several hours, depending on your hardware and dataset size. During this process, you’ll observe logs detailing metrics such as loss, learning rate, and epoch progress. These logs help monitor the training’s progress and diagnose potential issues like overfitting or underfitting.
Note: By updating only the LoRA adapters, we significantly reduce the number of parameters being trained, making the process faster and less resource-intensive. This approach retains the original model's general knowledge while integrating domain-specific expertise through the adapters.
After training, let’s ask the same question—“Act as a travel guide”—and see if the model’s response is now more polished:
input = tokenizer("Act as a travel guide", return_tensors="pt").to('cuda')response = trainer.model.generate(**input, max_new_tokens = 100)print(tokenizer.batch_decode(response, skip_special_tokens=True))
In the above code:
Line 1: Encodes the prompt, but this time, uses the fine-tuned version of the model (trainer.model).
Line 2: Generates up to 100 tokens.
Line 3: Prints the new output, which should reflect your specialized training data.
You’ll likely see a more role-aware or domain-specific response, especially if your dataset was heavily focused on a particular style, topic, or function (like “Act as a travel guide”).
['Act as a travel guide. I want you to act as a travel guide. You willcreate itineraries for people looking to explore new places, research thebest places to stay and eat, and provide helpful tips on how to getaround the city. My first suggestion request is "I need help planning anexciting trip to Paris." My first suggestion request is "I need helpplanning an exciting trip to Paris." My first suggestion request is"I need help planning an exciting trip to Paris." My first suggestionrequest is "I']
By evaluating the model’s responses before and after fine-tuning, you can assess the effectiveness of your fine-tuning process. Improved responses indicate successful integration of domain-specific knowledge, while persistent issues might suggest the need for further fine-tuning or dataset adjustments.
Fine-tuning large language models (LLMs) like Mistral-7B unlocks powerful possibilities—custom chatbots, enterprise data solutions, or healthcare diagnostics. But with complexity comes higher resource and time demands.
Curious about optimizing fine-tuning? Learn how techniques like quantization, LoRA, and QLoRA make customization faster and more cost-effective.
Our course, Fine-Tuning LLMs Using LoRA and QLoRA, equips you with hands-on skills to adapt models like Llama 3—even on limited compute. If you're ready to optimize AI for your unique challenges, dive in today.