How ChatGPT models are compressed to increase efficiency

Model compression refers to the process of reducing the size of a trained model while maintaining its predictive performance. The goal is to create a more compact representation of the model that requires fewer resources for storage, memory, and computation. Large language models (LLMs) like ChatGPT are compressed using several methods to meet the deployment requirements on various platforms.

Point to Ponder

Question

Have you ever wondered what is the scale of ChatGPT?

Show Answer

Model compression techniques

There are several techniques to model compression, as given below:

  • Quantization

  • Pruning

  • Knowledge distillation

  • Tensor decomposition (factorization)

Quantization

Quantization is decreasing the precision or the bit count utilized to depict numerical values within a model. It involves representing the model’s parameters or activations with a lower number of bits than the standard 32-bit floating-point representation. This reduction in precision helps decrease the memory footprint and computational requirements of the model, making it more efficient for deployment on resource-constrained devices.

Quantization of Float 32 matrix to an Int 8 matrix
Quantization of Float 32 matrix to an Int 8 matrix

There are two main types of quantization:

  • Weight quantization: This involves reducing the precision of the weights (parameters) of the model. For example, instead of using 32-bit floating-point numbers, weights might be quantized to 8-bit integers. This reduces the model size and accelerates computations during inference.

  • Activation quantization: Activation quantization focuses on reducing the precision of the intermediate values (activations) during the forward pass of the model. Similar to weight quantization, this process helps in reducing memory requirements and speeding up inference.

Pruning

Pruning involves removing certain weights or connections in the model that contribute less to overall performance. This helps reduce the number of parameters, leading to a smaller model. This includes unstructured and structured pruning.

  • Unstructured pruning: Unstructured pruning entails eliminating irrelevant parameters without taking the model’s structure into account. In essence, unstructured pruning sets parameters below a specified threshold to zero, effectively neutralizing their influence. This leads to a sparse model characterized by a random distribution of zero and non-zero weights.

An example of unstructured pruning by deactivating some links in the neural network
An example of unstructured pruning by deactivating some links in the neural network
  • Structured pruning: Structured pruning encompasses removing complete segments of a model, including neurons, channels, or layers. The benefit of structured pruning lies in streamlining model compression and enhancing hardware efficiency. For example, removing a complete layer or some neurons from the neural network can reduce the computational complexity of the model without introducing any irregularities in its structure.

An example of structured pruning by removing some neurons from the neural network
An example of structured pruning by removing some neurons from the neural network


Yet, structured pruning demands a comprehension of the model’s architecture and the contributions of various components to overall performance. Additionally, there is an elevated risk of substantially affecting the model’s accuracy when removing complete neurons or layers, potentially eliminating crucial learned features.

Knowledge distillation

Knowledge distillation is a methodology for transferring knowledge from a sizable model to a more compact one without compromising validity. This approach proves beneficial when larger models possess a greater knowledge capacity than smaller counterparts, yet this capacity might not be fully utilized in the case of ChatGPT. Transferring knowledge from a large model to a smaller one makes the latter deployable on less potent hardware, including mobile devices.

An example of knowledge distillation
An example of knowledge distillation

Tensor decomposition (factorization)

Tensor decomposition (or factorization) typically refers to matrix factorization, a technique used in collaborative filtering and recommendation systems. Matrix factorization decomposes a matrix into the product of two or more matrices, enabling the model to capture underlying patterns or latent factors in the data.

Tensor decomposition of a matrix into 3 different matrices
Tensor decomposition of a matrix into 3 different matrices

To conclude, improving the effectiveness of LLM models includes optimizing the model’s efficiency using techniques such as pruning, quantization, knowledge distillation, and tensor decomposition. These strategies work together to reduce the model’s size, improving its performance when deployed on various edge devices.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved