System Design: In-Context Learning and RAG

Explore techniques for customizing large language models at inference time using in-context learning, few-shot prompting, and retrieval-augmented generation. Understand when to apply each method, how to reduce hallucinations with RAG, and design cost-effective, scalable systems without the need for fine-tuning.

We'll cover the following...

In-context learning and few-shot prompting
- Prompting regimes
  - A concrete few-shot example
Retrieval-augmented generation
- The RAG pipeline step by step
  - Why RAG reduces hallucination
Choosing the right approach
Conclusion

In the previous lesson, you saw how fine-tuning adapts a model’s weights through gradient updates, requiring labeled data, GPU compute, and a training pipeline. But many production use cases never need any of that. Consider a support team that wants their LLM to answer questions about internal company policies. They have no labeled training data, no GPU budget, and their documents change every month. Fine-tuning would be expensive, slow, and outdated almost immediately. The alternative is to customize the model’s behavior entirely at inference time, by changing what goes into the prompt rather than what lives inside the model’s parameters.

This is the core idea behind in-context learning (ICL), the ability of LLMs to adapt their outputs based solely on information provided in the prompt. ICL, few-shot prompting, and retrieval-augmented generation represent a spectrum of inference-time customization strategies that are faster to deploy, cheaper to operate, and easier to iterate on than fine-tuning. Amazon Bedrock provides managed infrastructure for all three approaches, making them the recommended starting point before considering weight updates. This lesson walks through the definitions and mechanics of each technique, compares their trade-offs, and provides guidance on when each is appropriate.

In-context learning and few-shot prompting

In-context learning works by conditioning the model’s output on the full contents of the prompt, which can include instructions, examples, and the user’s query, all concatenated into a single input. No parameter updates occur. The model reads everything in the prompt, identifies patterns, and generates a response that follows those patterns. Think of it like handing someone a filled-out form as a reference before asking them to fill out a blank one. They do not need retraining; they just need a good example.

Prompting regimes

Three prompting regimes fall under the ICL umbrella, each differing in how much demonstration context you provide.

Zero-shot prompting: The prompt contains ...

1.LLM Application Architectures

2.Challenges and Risks

3.Transformers and Attention

4.Vector Databases

5.Prompt Engineering

Cloud Lab

6.Fine-Tuning

Cloud Lab

7.Model Context with LangChain

8.Agentic Workflows

Cloud Lab

9.Retrieval Augmented Generation (RAG)

Cloud Lab

Cloud Lab

10.LLM Evaluation

Cloud Lab

System Design: In-Context Learning and RAG

In-context learning and few-shot prompting

Prompting regimes