Search⌘ K
AI Features

Limitations and When NOT to Fine-Tune

Explore the significant costs and risks involved in fine-tuning large language models, including high data and compute expenses and catastrophic forgetting. Understand when prompt engineering or retrieval-augmented generation (RAG) are more effective and cost-efficient. This lesson guides you in making informed decisions about model customization strategies before committing to fine-tuning, helping you optimize resources while maintaining model performance.

Fine-tuning a large language model can feel like the obvious next step once you understand supervised fine-tuning, instruction tuning, and RLHF. But consider this scenario. A team fine-tunes a 7B-parameter model on 500 customer-support examples. They spend over $2,000 on GPU compute, wait several days for training to complete, and then discover that a well-crafted system prompt with a handful of few-shot examples achieves comparable accuracy at near-zero cost. This happens more often than you might expect.

Fine-tuning is a powerful tool, but it carries significant costs across three dimensions: data curation, compute infrastructure, and ongoing maintenance. Beyond cost, it introduces a technical risk called catastrophic forgetting, where the model loses previously learned capabilities. This lesson serves as a decision-making guide. Before you commit to fine-tuning, you need to understand when simpler alternatives like prompt engineering or retrieval-augmented generation (RAG) will get the job done at a fraction of the expense. That understanding also sets the stage for why parameter-efficient methods like LoRA exist.

The true cost of fine-tuning

The expense of fine-tuning extends well beyond the GPU bill. It spans three compounding dimensions that teams frequently underestimate.

Data curation costs

High-quality labeled datasets do not appear out of thin air. Building a fine-tuning dataset requires domain experts who can write or validate examples, an annotation pipeline to ensure consistency, and a quality assurance process to catch errors. Even a modest dataset of a few thousand examples can take weeks to assemble and cost thousands of dollars in expert labor. Unlike prompt engineering, where you write a few examples directly into the prompt, fine-tuning demands a structured, cleaned, and deduplicated corpus before training even begins.

Compute and infrastructure costs

...