Quantized Low-Rank Adaptation (QLoRA)
Explore how QLoRA combines low-rank adaptation with quantization to fine-tune large language models efficiently. Understand key components like 4-bit NormalFloat quantization, double quantization, and paged optimizers. This lesson helps you grasp how QLoRA enables memory-efficient neural network fine-tuning, suitable for deployment on limited-resource devices.
Quantized Low-Rank Adaptation (QLoRA), as the name suggests, combines the two most widely used methods of fine-tuning, i.e., LoRA and quantization. Where LoRA uses the low-rank matrices to reduce the number of trainable parameters, QLoRA extends it by further reducing the model size by quantizing its weights.
Components of QLoRA
The following are the three main components of QLoRA:
4-bit NormalFloat quantization
Double quantization
Paged optimizers
Let’s dive into the details of each component
4-bit NormalFloat quantization
The ...