LLM Compressor has recently released version 0.7.0, which introduces a range of significant enhancements designed to improve the performance of quantizing and deploying large language models. This release features three notable additions:
- QuIP and SpinQuant-style transforms
- Mixed-precision support and FP4 enhancements
- DeepSeek v3-style block quantization
1. QuIP and SpinQuant-style transforms
This release introduces two new modifiers, QuIPModifier
and SpinQuantModifier
.
These modifiers facilitate the injection of Hadamard-based rotations into the model’s computational graph, thereby rotating weights and activations to mitigate quantization sensitivity. Applying these transforms can minimize the effect of quantization error and enhance accuracy, particularly in cases involving low-bit weight and activation quantization.
Rotating the weight space helps even out outliers which can improve the fidelity of post-training quantization. In order to accomplish this, QuIP rotates inputs into a rotated space, applies quantization, then rotates those outputs back into the original output space in order to preserve correctness.
Example: Using QuIPModifier for QuIP-style transforms
from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.transform import QuIPModifier
# Select model and load it.
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
# Configure the quantization algorithm to run.
# * apply quip-style transforms to model in order to make quantization easier
# * quantize the weights to 4 bit with a group size 128
recipe = [
QuIPModifier(transform_type="random-hadamard", targets="Linear"),
QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
]
# Apply algorithms.
oneshot(model=model, recipe=recipe, pipeline="datafree")
For a sample model produced from the QuIPModifier
example above, you can see where and how the rotations are applied to the model by looking at the transform_config
in the model’s config.json
.
"u": {
"apply": [
{
"ignore": [
"lm_head"
],
"inverse": false,
"location": "weight_output",
"targets": [
"Linear"
]
},
{
"ignore": [
"lm_head"
],
"inverse": true,
"location": "output",
"targets": [
"Linear"
]
}
]
}
In this case, the Hadamard transform (denoted by the config_group u
) is applied to the layer weights, and the inverse matrix is applied to each of the layer’s outputs.
Example: Using SpinQuantModifier
SpinQuant and QuaRot build upon the ideas of QuIP and apply rotations which span across activations, allowing for more efficient weight and activation quantization. In addition, many of the added rotations are considered to be “offline” rotations (known as R1 and R2), meaning that these rotations are fused directly into the model’s weights prior to quantization, allowing for rotation without additional runtime cost. See Figure 1.

from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.transform import SpinQuantModifier
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
# Configure the quantization algorithm to run.
# * apply spinquant transforms to model to reduce quantization loss
# * quantize the weights to 4 bit with group size 128
recipe = [
SpinQuantModifier(rotations=["R1", "R2"], transform_type="hadamard"),
QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
]
# Apply algorithms.
oneshot(model=model, recipe=recipe, pipeline="datafree")
A similar transform_config
is created for models produced using the SpinQuantModifier
.
2. Mixed-precision support and FP4 enhancements
LLM Compressor v0.7.0 also brings robust mixed-precision capabilities. FP4 quantization (specifically NVFP4) for both weights and activations has now been integrated with MoEs (such as Llama 4) and non-uniform quantization schemes.
With non-uniform quantization, you can combine NVFP4 and FP8 quantization, selectively applying certain quantization schemes to specific layers for improved accuracy. This functionality is enabled by the activation of multiple compressors within a given model.
Example: Non-uniform quantization
From a sample model with both NVFP4 and FP8 quantization (FP8 targeting down_proj
weights, NVFP4 targeting all other attention and MLP linear layer weights), the quantization_config
of the compressed model looks like this:
"quantization_config": {
"config_groups": {
"group_0": {
"format": "nvfp4-pack-quantized",
"input_activations": {
"actorder": null,
"block_structure": null,
"dynamic": "local",
"group_size": 16,
"num_bits": 4,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "tensor_group",
"symmetric": true,
"type": "float"
},
"output_activations": null,
"targets": [
"re:.*mlp.gate_proj.*",
"re:.*mlp.up_proj.*",
"re:.*self_attn.k_proj.*",
"re:.*self_attn.o_proj.*",
"re:.*self_attn.q_proj.*",
"re:.*self_attn.v_proj.*"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": 16,
"num_bits": 4,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "tensor_group",
"symmetric": true,
"type": "float"
}
},
"group_1": {
"format": "float-quantized",
"input_activations": {
"actorder": null,
"block_structure": null,
"dynamic": true,
"group_size": null,
"num_bits": 8,
"observer": null,
"observer_kwargs": {},
"strategy": "token",
"symmetric": true,
"type": "float"
},
"output_activations": null,
"targets": [
"re:.*mlp.down_proj.*"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": null,
"num_bits": 8,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "channel",
"symmetric": true,
"type": "float"
}
}
},
"format": "mixed-precision"
This configuration shows how you can assign different quantization formats (e.g., nvfp-pack-quantized
and float-quantized
, each handled by a separate compressor in compressed-tensors
) per layer group, mixing NVFP4 with FP8. This provides finer control over per-layer quantization, allowing more precise handling of layers that are especially sensitive to certain quantization types.
As of v0.10.1, models with multiple compressors are directly runnable in vLLM. We can run sample evaluations using the lm-evaluation-harness with the above mixed-precision model, comparing it with its NVFP4-only counterpart.
Using the following lm-eval
command on a single B200 GPU for each model, we get the following results:
lm_eval \
--model vllm \
--model_args pretrained=model_path,dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True,enforce_eager=True \
--tasks gsm8k_llama \
--apply_chat_template \
--fewshot_as_multiturn \
--batch_size auto
NVFP4 only:
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_llama| 3|flexible_extract| 8|exact_match|↑ |0.7278|± |0.0123|
| | |strict_match | 8|exact_match|↑ |0.6285|± |0.0133|
NVFP4 with FP8 down_proj
:
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_llama| 3|flexible_extract| 8|exact_match|↑ |0.7536|± |0.0119|
| | |strict_match | 8|exact_match|↑ |0.6914|± |0.0127|
3. DeepSeek v3-style block quantization
Another notable addition is block-wise quantization inspired by DeepSeek v3. This method enables more efficient model compression without needing a calibration dataset. Block quantization partitions weights into blocks and quantizes each independently, minimizing the influence of outliers while preserving accuracy.
Example: Specify a recipe with FP8 block quantization for Qwen/Qwen3-30B-A3B
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_BLOCK",
ignore=["lm_head", "re:.*mlp.gate$"],
)
Summary of key new features in LLM Compressor 0.7.0
- Transforms:
QuIPModifier
andSpinQuantModifier
—Hadamard rotations to reduce quantization error - Mixed-precision support: FP4 quantization with MoE and non-uniform support
- Block quantization: DeepSeek v3-style block-wise quantization for calibration-free, efficient compression.
Conclusion
LLM Compressor bridges the gap between fine-tuning and production with robust support for quantization, sparsity, calibration, and seamless integration with vLLM. Whether you’re optimizing for cost, latency, or innovation, LLM Compressor is the foundation for next-generation AI inference.
To get started, explore our Hugging Face collection, contribute to the GitHub repo, and join the community of developers and researchers advancing efficient AI.
The post LLM Compressor 0.7.0 release recap appeared first on Red Hat Developer.