LLM compression is a hugely important area of focus at the moment. Getting increasingly larger open-source LLMs to fit into consumer GPU memory for cheaper and more efficient training is often critical to production usecases. There are a few reasons why you may want to compress an LLM:
Increase inference speed
Decrease model size for storage
Decrease training costs
Acts as a regularizer so could improve generalization (this is often not the case, rather there is usually a decrease in accuracy associated with compression)
There’s a recently published paper titled SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression and an associated GitHub repository that has been able to run a 33B parameter LLM on a single 24 GB GPU with a 15% speedup of inference without any performance degradation. Before we cover how they were able to achieve this, let’s get a quick review of quantization.
Model quantization is a process that reduces the memory and computational requirements of a deep learning model without significant performance loss. It involves representing model parameters and activations with lower precision values, such as integers or fixed-point numbers, instead of higher precision floating-point numbers. This reduces memory usage and speeds up computations. The quantization process includes weight and activation quantization, where weights and intermediate values are mapped to smaller bit representations. Retraining or fine-tuning is often done to mitigate accuracy loss. Quantization enables efficient deployment on resource-constrained devices while maintaining acceptable performance.
SpQR
SpQR is the first quantization method that can reach the compression ratios of other quantization methods (up to 4x) while being near-lossless. There are just a few simple steps to the algorithm:
Iterate through the layers of the model and quantize the weights by converting them to a lower-bit representation.
Measure the inputs and outputs of the quantized model and the uncompressed model for each layer.
Identify the weights whose direct quantization results in an outsized impact on layer output behavior. These weights are considered outliers.
In the final step, most of the weights (at least 99%) are converted to a low-bitwidth representation. The outliers, which have been identified in the previous step, are extracted separately.
Why the special treatment of outliers? Well you may have guessed it, but they found that in some cases 1% of the “outlier weights” result in over 75% of the overall error introduced by quantization. Since these weights lead to high, irreducible error, they just keep them in their high-precision (16-bit) representation. Since there are so few outliers, this results in a similar level of compression with a much smaller hit to accuracy.
As mentioned, you can check out the GitHub repo here, and apply SpQR to your own LLM. Here’s a code snippet of how you could run the near lossless quantization algorithm on a LLaMA model:
export MODEL_PATH=<INSERT PATH_TO_MODEL_DIR>
export PAJAMAS_PATH=<INSERT PATH TO PAJAMAS DIR>
python main.py $MODEL_PATH custom \
--load_from_saved=$PAJAMAS_PATH \
--wbits 4 \
--groupsize 16 \
--perchannel \
--qq_scale_bits 3 \
--qq_zero_bits 3 \
--qq_groupsize 16 \
--outlier_threshold=0.2 \
--permutation_order act_order \
--percdamp 1e0 \
--nsamples 128