Model compression refers to the process of reducing the size of a trained model while maintaining its predictive performance. The goal is to create a more compact representation of the model that requires fewer resources for storage, memory, and computation. Large language models (LLMs) like ChatGPT are compressed using several methods to meet the deployment requirements on various platforms.
Point to Ponder
Have you ever wondered what is the scale of ChatGPT?
There are several techniques to model compression, as given below:
Quantization
Pruning
Knowledge distillation
Tensor decomposition (factorization)
Quantization is decreasing the precision or the bit count utilized to depict numerical values within a model. It involves representing the model’s parameters or activations with a lower number of bits than the standard 32-bit floating-point representation. This reduction in precision helps decrease the memory footprint and computational requirements of the model, making it more efficient for deployment on resource-constrained devices.
There are two main types of quantization:
Weight quantization: This involves reducing the precision of the weights (parameters) of the model. For example, instead of using 32-bit floating-point numbers, weights might be quantized to 8-bit integers. This reduces the model size and accelerates computations during inference.
Activation quantization: Activation quantization focuses on reducing the precision of the intermediate values (activations) during the forward pass of the model. Similar to weight quantization, this process helps in reducing memory requirements and speeding up inference.
Pruning involves removing certain weights or connections in the model that contribute less to overall performance. This helps reduce the number of parameters, leading to a smaller model. This includes unstructured and structured pruning.
Unstructured pruning: Unstructured pruning entails eliminating irrelevant parameters without taking the model’s structure into account. In essence, unstructured pruning sets parameters below a specified threshold to zero, effectively neutralizing their influence. This leads to a sparse model characterized by a random distribution of zero and non-zero weights.
Structured pruning: Structured pruning encompasses removing complete segments of a model, including neurons, channels, or layers. The benefit of structured pruning lies in streamlining model compression and enhancing hardware efficiency. For example, removing a complete layer or some neurons from the neural network can reduce the computational complexity of the model without introducing any irregularities in its structure.
Yet, structured pruning demands a comprehension of the model’s architecture and the contributions of various components to overall performance. Additionally, there is an elevated risk of substantially affecting the model’s accuracy when removing complete neurons or layers, potentially eliminating crucial learned features.
Knowledge distillation is a methodology for transferring knowledge from a sizable model to a more compact one without compromising validity. This approach proves beneficial when larger models possess a greater knowledge capacity than smaller counterparts, yet this capacity might not be fully utilized in the case of ChatGPT. Transferring knowledge from a large model to a smaller one makes the latter deployable on less potent hardware, including mobile devices.
Tensor decomposition (or factorization) typically refers to matrix factorization, a technique used in collaborative filtering and recommendation systems. Matrix factorization decomposes a matrix into the product of two or more matrices, enabling the model to capture underlying patterns or latent factors in the data.
To conclude, improving the effectiveness of LLM models includes optimizing the model’s efficiency using techniques such as pruning, quantization, knowledge distillation, and tensor decomposition. These strategies work together to reduce the model’s size, improving its performance when deployed on various edge devices.
Free Resources