Ultimate Guide to LLM Quantization for Faster, Leaner AI Models

Learn how LLM quantization transforms AI models into faster, leaner, and more efficient tools in this ultimate guide.

· 19 min read
using python for quantization - llm quantization

The rapid growth of generative AI models is both exciting and daunting. While organizations are eager to implement these advanced technologies, they often need help with the massive amounts of data required and their hefty computational costs. As multimodal LLMs gain popularity, their ability to process and generate complex outputs across different data types like images, videos, and text can create real business efficiencies. The disappointing truth is that these models often underperform their tasks and require extensive fine-tuning. To make things worse, large language models (LLMs) and their multimodal counterparts are incredibly resource-intensive, making them costly and difficult to deploy in real-world settings. Fortunately, multimodal LLM quantization can help alleviate many of these challenges. This article will explore how LLM quantization can improve the efficiency of generative AI models so they consume fewer resources, perform better, and maintain accuracy.

Lamatic’s generative AI tech stack can help businesses implement LLM quantization to solve their problems faster and with less effort. Our solutions can enhance the performance of multimodal LLMs and their quantized versions, so organizations can deploy AI tools that work well for their needs.

What is LLM Quantization and Its Importance

LLM importance - LLM Quantization

Large language models (LLMs) take the artificial intelligence world by storm. They have become the foundation of numerous applications, powering everything from chatbots to code generation tools. As LLMs have evolved, their complexity has grown exponentially, significantly increasing their number of parameters. 

The first GPT model, launched in 2018, had 0.11 billion parameters. By late 2019, GPT-2 expanded this to 1.5 billion, and GPT-3, released in late 2020, skyrocketed to 175 billion parameters. GPT-4 boasts over 1 trillion parameters. This big increase presents a challenge: 

  • As models grow, so do their memory requirements, often surpassing the capacity of advanced hardware accelerators such as GPUs. 
  • This growing demand for memory limits both the training and hosting of the models for inference, consequently restricting the accessibility and adoption of LLM-based solutions. 

Why Quantization Matters

Reducing the size of large language models to make them more accessible for deployment on less powerful hardware can be achieved through various means, including quantization. By changing the precision of some model components, quantization reduces the model’s memory footprint while maintaining similar performance levels.

The Basics of Quantization

Quantization is a model compression technique that converts the weights and activations within a large language model from high-precision values to lower-precision ones. This means changing data from a type that can hold more information to one that holds less. 

A typical example is converting data from a 32-bit floating-point number to an 8-bit integer. Reducing the number of bits required for each model’s weights or activations significantly decreases its overall size. 

Understanding Quantization and Its Impact on LLM Performance and Efficiency

Quantization shrinks LLMs to consume less memory, require less storage space, and make them more energy-efficient. An effective analogy for understanding quantization is image compression. High-resolution images are often compressed for use on websites. 

This involves reducing the size of the image by removing some data or bits of information. While this somewhat lowers the image quality, it also decreases the image dimensions and file size, making web pages load faster while providing a satisfactory visual experience.

Making LLMs More Efficient Through Quantization

Quantizing an LLM reduces its computational requirements, allowing it to run on less powerful 

hardware while delivering adequate performance. Compressed images are easier to handle, just as quantized models are more deployable across various platforms, though there is a slight trade-off in detail or precision. As we will see, the quantization process also introduces some noise. 

Understanding the Theory of Quantization

Quantization is normally applied to the weights of a large language model, although it can also be applied to the activations. Model weights are parameters in a neural network that determine the strength of connections between neurons across different layers. Weights are the learned coefficients that transform input data through the network. Weights are initially set to random, meaningless values and adjusted during training based on the error between the predicted output and the actual targets. This adjustment process is guided by optimization algorithms such as gradient descent.

One option for quantizing a model is reducing the precision of its model weights. To illustrate that, let’s focus on the matrix on the left in the image below, representing a 3x3 matrix of weights with four-decimal precision. On the matrix on the right, we can observe the quantized version of the original matrix. This “quantized” matrix is computed by rounding the elements of the original matrix to one decimal. 

Explaining Quantization Error and Memory Efficiency in LLMs

We can observe that the matrices above are not completely equal but are very similar. The value-by-value difference is known as quantization error, which we can also represent in matrix form. In this simple example, we are just rounding the matrix elements. In practice, quantization is performed by converting numerical values to a different data type, e.g., from a data type of higher precision to a lower precision one. For example, most models' default storing data type is float32. In this case, we would need to allocate 4 bytes per parameter (4 times 8-bit precision). 

For a 3x3 matrix like the one in the example, the total memory footprint of this matrix is 36 bytes. By changing the data type   known as downcasting   to into, we would only need one byte per parameter. The total memory footprint of the matrix turns into 9 bytes. 

Brain Floating Point - BF16

The selected data type for the model’s weights determines how much we can reduce the model. Traditional floating-point types, like:

  • Float32
  • Float16

Have been the standard in many machine-learning applications, balancing accuracy and computational efficiency. While float32 offers high precision and a wide dynamic range, it requires more memory and computational power. 

Float16 offers reduced precision and range, significantly speeding up computations. In 2018, Google recognized the need for a floating-point format to offer a midpoint between float32’s wide dynamic range and float16’s efficiency. This led to the creation of the so-called Brain Floating Point  (bfloat16), which retains the dynamic range of float32 but with reduced precision.

Different Types of Quantization

There are several types of quantization, and we’ve described each in detail below: 

Linear Quantization

Linear quantization is one of the most popular quantization schemas for LLMs. It involves evenly mapping the range of floating-point values of the original weights to a range of fixed-point values, using the high-precision data type for inference. Let’s review the steps required to apply linear quantization to a model to make it as simple as possible. The actual formulas are shown in the image below: 

Calculate The Minimum And Maximum Values

For each tensor, we need to get the minimum and maximum values to define the range of the floating-point values to quantify. The data type we want to convert to will give the minimum and maximum of the quantized range. For example, in the case of an unsigned integer, the range would be from 0 to 255 for 8-bit. 

Compute the Scale (S) and The Zero-Point (S) Values

The scale adjusts the range of floating-point values to fit within the integer range. The zero-point ensures that zero in the floating-point range is accurately represented by an integer, maintaining numerical accuracy and stability, especially for values close to zero.

Quantize the Values (Q)

This step involves mapping floating-point values to a lower-precision integer range using a scale factor s and a zero point s computed in the previous step. The rounding operation ensures that the final result is a discrete integer suitable for storage and computation in lower-precision formats. 

Dequantize

During inference, the dequantized values are used for calculations to achieve higher precision, although only the quantized weights are stored. This step will also allow you to compute the quantization error. 

Blockwise Quantization 

Linear quantization is a popular option due to its simplicity, but there are multiple ways of building a mapping. Another method that is quite popular nowadays is Blockwise quantization, which is more accurate than linear quantization for models with non-uniform weight distributions. Blockwise quantization is a more sophisticated method that involves quantizing weights in smaller blocks rather than across the entire range. 

This method relies on two key concepts

  • Blockwise quantization: Weights are divided into smaller blocks, and quantization is applied separately. This allows for better handling of variations within different parts of the model. 
  • Distribution-aware blocks: The quantization process considers the relative frequency of the weights within each block, creating blocks that are aware of the distribution of the weights. This results in a more efficient mapping of values.

Weight Quantization vs. Activation Quantization

During our matrix examples, we have mainly focused on quantizing a model's weights. While weight quantization is a crucial step for model optimization, it is also important to consider that the activations of a model can also be quantized. Activation quantization refers to reducing the precision of the intermediate outputs of each layer in the network. Unlike weights, which are static (constant) once the model is trained, activations are dynamic. This means that activations change with each input to the network, making their range harder to predict. 

Activation quantization is harder to implement than weight quantization. It requires careful calibration to ensure that the dynamic range of activations is well captured. Weight quantization and activation quantization are complementary techniques. By applying both techniques, we can significantly improve model size without compromising performance too much.

Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT) 

Quantization can also be performed at different points in time. If we take a pre-trained model and quantize the model parameters during the inference phase, we are performing: 

  • Post-Training Quantization (PTQ): This method does not involve any changes to the training process itself. The dynamic range of parameters is recalculated at runtime, similar to how we worked with the example matrices. 
  • Quantization-Aware Training (QAT): This approach involves modifying the training process to simulate the effects of quantization during training. The model is trained to be robust to quantization noise, resulting in better accuracy. During QAT, the intermediate states of the training hold both a quantized version of the weights and the original unquantized weights (also in memory!). 

We use the quantized version of the model for inference, but the unquantized version of model weights will be updated during backpropagation. As expected, although more complex and time-consuming, QAT generally results in higher accuracy than PTQ. 

Different Techniques for LLM Quantization 

Now that we have covered quantization and its benefits, let's focus on different quantization methods and how they work. 

QLoRA 

Low-Rank Adaptation (LoRA) is a Parameter-Efficient Fine-Tuning (PEFT) technique that reduces the memory requirements of further training a base LLM by freezing its weights and fine-tuning a small set of additional weights, called adapters. 

Quantized Low-Rank Adaptation (QLoRA) takes this a step further by quantizing the original weights within the base LLM to 4-bit, reducing the memory requirements of an LLM to make it feasible to run on a single GPU. QLoRA carries out quantization through two key mechanisms: 

  • 4-bit NormalFloat (NF4) data type
  • Double Quantization

NF4

a 4-bit data type used in machine learning, which normalizes each weight to a value between -1 and 1 for a more accurate representation of the lower precision weight values than a conventional 4-bit float. While NF4 stores quantized weights, QLORA uses an additional data type, brainfloat16 (BFloat16), specially designed for machine learning purposes to carry out calculations during forward and backward propagation. 

Double Quantization (DQ)

A process of quantizing the quantization constants for additional memory savings. QLoRA quantizes weights in 64 blocks, and while this facilitates precise 4-bit quantization, you also have to account for the scaling factors of each block, which increases the amount of memory required. DQ addresses this issue by performing a second round of quantization on the scaling factors for each block. The 32-bit scaling factors are compiled into blocks of 256 and quantized to 8-bit. 

Where a 32-bit scaling factor for each block of previously added 0.5 bits per weight, DQ brings this down to only 0.127 bits. Though seemingly insignificant, when combined in a 65B LLM, for example, this saves 3 GB of memory. 

PRILoRA

Pruned and Rank-Increasing Low-Rank Adaptation (PRILoRA) is a fine-tuning technique recently proposed by researchers that aims to increase LoRA efficiency through the introduction of two additional mechanisms: 

  • The linear distribution of ranks
  • Ongoing importance-based A-weight pruning

Returning to the concept of low-rank decomposition, LoRA achieves fine-tuning by combining two matrices: 

  • W contains the entire model’s weights
  • AB represents all changes made to the model by training the additional weights, i.e., adapters. 

The AB matrix can be decomposed into smaller matrices of lower rank – A and B – hence the term low-rank decomposition. While the low-rank r is the same across all the LLM layers in LoRA, PRILoRA linearly increases the rank for each layer. For example, the researchers who developed PRILoRA started with r = 4 and increased the rank until r = 12 for the final layer – producing an average rank of 8 across all layers.

Optimizing LLMs with PRILoRA for Efficient Fine-Tuning

PRILoRA prunes the A matrix, eliminating the lowest, i.e., least significant weights every 40 steps throughout the fine-tuning process. The lowest weights are determined through an importance matrix, which stores both the temporary magnitude of weights and the collected statistics related to the input for each layer. 

Pruning the A matrix in this way reduces the number of weights that must be processed, reducing the time required to fine-tune an LLM and the memory requirements of the fine-tuned model. Although still a work in progress, PRILoRA showed very encouraging results on benchmark tests conducted by researchers. This included outperforming full fine-tuning methods on 6 out of 8 evaluation datasets while achieving better results than LoRA on all datasets. 

GPTQ 

GPTQ (General Pre-Trained Transformer Quantization) is a quantization technique designed to reduce models' size so they can run on a single GPU. GPTQ works through a form of layer-wise quantization: 

  • An approach that quantizes a model a layer at a time
  • To discover the quantized weights that minimize output error (the mean squared error (MSE), i.e., the squared error between the original outputs, i.e., the full-precision layer and the quantized layer.) 

All the model’s weights are converted into a matrix, which is worked through in batches of 128 columns at a time through lazy batch updating. This involves quantizing the weights in batch, calculating the MSE, and updating the weights to values that diminish it. After processing the calibration batch, all the remaining weights in the matrix are updated following the MSE of the initial batch – and then all the individual layers are re-combined to produce a quantized model. 

GPTQ employs a mixed INT4/FP16 quantization method in which a 4-bit integer quantizes weights, and activations remain in a higher precision float16 data type. Subsequently, during inference, the model’s weights are dequantized in real-time so computations are carried out in float16. 

GGML/GGUF 

GGML (which stands for Georgi Gerganov Machine Learning, after its creator, or GPT-Generated Model Language) is a C-based machine learning library designed for quantizing Llama models so they can run on a CPU. The library allows you to save quantized models in the GGML binary format, which can be executed on a broader range of hardware. 

GGML quantizes models through the k-quant system, which uses value representations of different bit widths depending on the chosen quant method. The model’s weights are divided into blocks of 32, each with a scaling factor based on the largest weight value, i.e., the highest gradient magnitude. Depending on the selected method, the most important weights are quantized to a higher-precision data type, while the rest are assigned to a lower-precision type. For example, the q2_k quant method converts the largest weights to 4-bit integers and the remaining weights to 2-bit. 

Exploring GGML and GGUF: Efficient Quantization for LLMs

The q5_0 and q8_0 quant methods convert all weights to 5-bit and 8-bit integer representations, respectively. You can view GGML’s full range of quant methods by looking at the model cards in this code repo.  GGUF (GPT-Generated Unified Format) is a successor to GGML and is designed to address its limitations – most notably, enabling the quantization of non-Llama models. 

GGUF is also extensible, integrating new features while retaining compatibility with older LLMs. To run GGML or GGUF models, however, you must use a C/C++ library called llama.cpp, developed by GGML’s creator, Georgi Gerganov. llama.cpp is capable of reading models saved in the.GGML or.GGUF format enables them to run on CPU devices instead of requiring GPUs.

AWQ 

Conventionally, a model’s weights are quantized irrespective of the data they process during inference. Activation-Aware Weight Quantization (AWQ) accounts for the activations of the model, i.e., the most significant features of the input data and how it is distributed during inference. By tailoring the precision of the model’s weights to the particular input characteristic, you can minimize the loss of accuracy caused by quantization. 

The first stage of AWQ is using a calibration data subset to collect activation statistics from the model, i.e., which weights are activated during inference. These are known as salient weights, typically comprising less than 1% of the total weights. The salient weights are skipped over for quantization to increase accuracy, remaining as an FP16 data type. Meanwhile, the rest of the weights are quantized into INT3 or INT4 to reduce memory requirements across the rest of the LLM. 

Calibration Techniques

Some quantization methods require a calibration step. For example, we must determine a model's original activation range before quantization. General calibration usually involves running inference on a representative dataset to optimize the quantization parameters and minimize quantization error. 

During this calibration process, the quantization algorithm collects statistics about the distribution and range of the model’s activations and weights. These statistics help determine the best quantization parameters. Computing the scale and the zero-point when quantizing the weights is also a sort of calibration, but there are other types: 

  • Percentile Calibration: Focuses on a specified percentile range of the weights, ignoring extreme outliers, leading to a more robust quantization. 
  • Mean and Standard Deviation Calibration: Defines the quantization range based on the statistical measures of the mean and standard deviation of the weights. However, quantization methods like QLoRA can be used without any calibration step. 

These methods typically replace all linear layers in the model with quantized linear layers (QLinear). QLinear layers are designed to handle quantization internally, eliminating the need for an additional calibration step. This makes the quantization process more straightforward while maintaining the model’s performance.

Why Do We Need to Compress LLMs?

Latency Getting a response from an LLMs is a computationally intensive task, and their execution time can be quite long. This can be problematic in real-time applications where immediate responses are required. 

By compressing LLMs, we can significantly reduce their execution time, allowing for faster responses and the ability to handle more load. Model Size LLMs are incredibly large, with billions or even hundreds of billions of parameters. Many of these models require multiple GPUs to run efficiently (some will not even load on 1 GPU). 

The Importance of Compressing LLMs for Cost-Effective and Accessible Deployment

Compressing LLMs can significantly reduce their size, making them more deployable on less powerful devices, even mobile phones! Memory consumption Serving LLMs requires multiple GPUs, leading to communication bottlenecks between them. By compressing LLMs, we can reduce the memory required to execute these models and alleviate communication bottlenecks. 

Costs By optimizing our models so that they can be deployed on less powerful machines (which in this context can still mean machines with a GPU). We can reduce deployment costs which is a big bottleneck in LLM products as they grow in complexity, making it possible to use these models in a wider range of contexts.

A Brief Practical Guide to LLM Quantization

woman coding - LLM Quantization

Numerous steps comprise the overall quantization process. Below is a breakdown of these steps to help you understand what quantization looks like in practice.

1. Model Training: Train your model as usual with full precision (typically 32-bit floating-point).

2. Calibration: Collect a representative dataset and run it through the trained model to gather statistics about the activation ranges. This step is crucial for determining the appropriate scaling factors for quantization.

3. Quantization: Convert the model’s weights and activations to lower precision. This can be done using different quantization techniques such as: 

  • Post-training Quantization: Apply quantization after the model has been trained.
  • Quantization-aware Training (QAT): Integrate quantization into the training process to improve the final quantized model’s accuracy.

4. Fine-Tuning: Optionally, fine-tune the quantized model with a small learning rate to recover any lost accuracy. 5. Deployment: Deploy the quantized model on the target hardware, ensuring compatibility with the lower precision format. 

TensorFlow and PyTorch Code Snippets to Help You Get Started

TensorFlow Post-Training Quantization:

import tensorflow as tf

# Load your Keras model
model = tf.keras.models.load_model('path_to_your_model')

# Convert the model to a quantized version
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_model = converter.convert()

# Save the quantized model
with open('quantized_model.tflite', 'wb') as f:
    f.write(quantized_model)

PyTorch Quantization-aware Training:

import torch
import torch.quantization

# Load your model
model = MyModel()
model.load_state_dict(torch.load('path_to_your_model.pth'))
model.eval()

# Fuse the layers for quantization
model.fuse_model()

# Prepare for quantization-aware training
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)

# Fine-tune the model
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(num_epochs):
    train_one_epoch(model, optimizer, train_loader, device, epoch)
    evaluate(model, val_loader, device)

# Convert to quantized model
torch.quantization.convert(model, inplace=True)

# Save the quantized model
torch.save(model.state_dict(), 'quantized_model.pth')

Why Do We Need Quantization?

Quantization is essential for several reasons: 

  • Resource Efficiency: It reduces the model size and memory footprint, enabling deployment on devices with limited resources such as smartphones and IoT devices. 
  • Speed: Lower precision arithmetic operations are faster, which can significantly enhance the inference speed of the model. 
  • Energy Consumption: Reduced computational requirements lead to lower power consumption, which is critical for battery-operated devices.
  • Cost: Smaller and faster models reduce the computational costs associated with running AI/ML services, making them more economically viable. 

Efficiency Gains from Quantization

Quantization can improve efficiency in several ways

  • Storage: Quantized models occupy less disk space, which is crucial for devices with limited storage capacity. 
  • Latency: Faster arithmetic operations reduce the inference latency, leading to quicker response times in real-time applications. 
  • Throughput: Enhanced computational efficiency allows for processing more data in the same amount of time, increasing the model’s throughput. 

Key Parameters to Consider When Quantizing a Model

When quantizing a model, several parameters must be carefully managed: 

  • Precision Level: Choosing the right precision level (e.g., 16-bit, 8-bit, 4-bit) is crucial. Lower precision levels offer greater efficiency gains but can lead to more significant accuracy losses. 
  • Scaling Factors: Proper scaling of weights and activations is necessary to maintain the model’s performance. This involves finding the right balance between range and precision. 
  • Quantization Scheme: Different schemes, such as symmetric or asymmetric quantization and per-channel or per-tensor quantization, have trade-offs in terms of complexity and performance. 
  • Hardware Compatibility: The target hardware must support the chosen quantization format. This includes ensuring the inference engine or library can effectively utilize the quantized model.
Navigating Complexities - LLM Quantization

Compressing LLMs with quantization can have several benefits. However, this process also has some challenges. 

Large LLMs vs. Small LLMs 

The biggest hurdle is the unintuitive results we get when performing weight quantization (WQ) vs weight and activation quantization (WAQ). They have different impacts on model accuracy. INT8 WQ leads to very little loss in accuracy, particularly for large models. Compressing to INT4 WQ affects smaller models even more disproportionately.

WAQ can result in a larger performance degradation for larger models. In some cases, the accuracy degradation may be higher than the advantage of using a larger model! Therefore, it is essential to carefully evaluate the trade-offs between these two techniques when compressing LLMs. You can read more about this here. 

Quantization Aware Training 

In TinyML, another field where model compression is an important part of deploying applications, Quantization-Aware Training (QAT) is a common technique used to recover any drop in accuracy. QAT simulates the loss of precision caused by quantization during the training model process. However, this process can be difficult, expensive, and time-consuming to perform, especially for really large LLMs. 

Bias 

Compressing LLMs can increase bias. When we remove parameters from the model, we risk losing some of the diversity in the data. This can result in a biased model that may perform poorly in real-world scenarios. To the best of our knowledge, no study has been conducted on the effects of quantization on LLM bias. 

Previous works have shown that compressed models amplify existing algorithmic bias and disproportionately impact performance on underrepresented features. Another work on BERT-based models has shown similar results. They also suggest methods to mitigate these issues. For generative LLMs, compression can lead to a loss of vocabulary and richness in the output. As we reduce the number of parameters in the model, we risk losing some of the nuances and details in the data. This can result in less expressive and less accurate output. 

Quantization Time and Latency 

LLMs are so large that it can take a few hours to quantize some of them. Even though quantization is a one-time activity, it is still computationally intensive and may need access to GPUs to run quickly. The results below show the time it took to quantize models using GPTQ on an Nvidia A100 GPU. Even the smallest model took nearly 3 minutes to quantize!

GPTQ Quantization times on an Nvidia A100 GPU. From GPTQ Finally, it is essential to note that in some cases, quantizing a model can lead to no latency decrease and, even an increase in latency! Inference times can vary based on the model's architecture and your hardware. 

Quantizing Large LLM vs. Finetuning or Training a Smaller LLM 

With these weird compression effects, you might wonder if compressing a 10B+ parameter LLM makes sense. More work must be done to reduce the quantization issues and get better hardware support. Other compression methods, like knowledge distillation (Alpaca), show more promising results than quantization.

While recent methods like RPTQ have shown ways to reduce the drop in perplexity of compressed 10B+ LLMs, the effects of bias on compression are still unknown. A better approach would be to combine finetuning with quantization. Some people have used a pattern to quantize medium-sized LLM (<10B parameters) and then finetune it with LORA. This approach should help with the issues mentioned earlier.

Start Building GenAI Apps for Free Today with Our Managed Generative AI Tech Stack

Lamatic offers a managed Generative AI tech stack that includes:

  • Managed GenAI Middleware
  • Custom GenAI API (GraphQL)
  • Low-Code Agent Builder
  • Automated GenAI Workflow (CI/CD)
  • GenOps (DevOps for GenAI)
  • Edge Deployment via Cloudflare Workers
  • Integrated Vector Database (Weaviate)

Lamatic empowers teams to rapidly implement GenAI solutions without accruing tech debt. Our platform automates workflows and ensures production-grade deployment on edge, enabling fast, efficient GenAI integration for products needing swift AI capabilities. 

Start building GenAI apps for free today with our managed generative AI tech stack.