Generative AI (gen AI) models are top of mind for many organizations and individuals looking to increase productivity, gain new insights, and build new kinds of interactive services. Many organizations have delegated running these models to third-party cloud services, seeking to get a head start by building on top of the capabilities they have on offer. Some organizations have regulatory pressure that makes that difficult, though, and many others are finding that the costs of those hosted generative AI models are not always in line with the value they get from them.
At the same time, the open source gen AI model landscape continues to evolve at a blistering pace. Model architectures come in and fall out of vogue, training methodologies are being implemented faster than the papers can be published, and the relentless march towards higher parameter counts continues in pursuit of every last mile of performance. Organizations that repatriate their generative workloads onto hardware accelerators they’ve invested in are responsible for deploying and running these new models on their platforms.
It wasn’t long after the Kimi K2 large language model (LLM) started making waves that suddenly Qwen3-235B-A22B-Instruct-2507 erupted onto the scene with major improvements over the original Qwen3 line for non-thinking tasks—and beating K2 in nearly every benchmark, convincingly. K2 is released with parameters of type FP8, 8-bit floating point numbers. The Qwen3 release is of type FP16, with the more standard 16 bits per parameter. Even though it has fewer parameters, the model is larger.
On top of that, the context window is twice the size. This can affect how much memory it takes to run efficiently, with a KV cache to accelerate inference for most use cases. It can be difficult to take a fixed pool of hardware in the data center and produce the same quantity of output with a larger model, because you need to use more of it for these larger models with their (useful!) longer context windows.
This 2-part blog series will dive into what it takes to optimize these models and get them to perform how we need them to, without sacrificing on the accuracy gains they’re offering on enterprise-grade hardware. In this post, we’re going to explore using the tools at their lowest level—with code, open source tools, and the CLI on a single server. In a future post, we’ll take a look at doing this work at scale in a repeatable way using Red Hat’s AI portfolio.
What is quantization?
Quantization is the act of taking those parameters that make up a model and reinterpreting some or all of them using a different data type, such as one that takes up less VRAM on your GPUs. The FP16 data type that is most commonly used for LLM releases takes up two bytes of VRAM. An 8-billion parameter (8B) model therefore takes around 15GiB of VRAM (8 billion times 2 bytes) to load just the weights. If we quantize that model down to FP8 parameters, it would only take around 7.5 GiB of VRAM to load the model, leaving an extra 7.5GiB for the aforementioned KV cache—and allowing queries against the model to better be parallelized at the same time.
Quantization therefore has two main benefits:
- Reduced memory footprint required to load the model.
- Improved performance (Time-to-First-Token, or TTFT, and overall throughput Tokens-Per-Second, or TPS, mostly for output tokens).
It’s important to note, though, that not all accelerators have the appropriate hardware to fully accelerate working with certain data types. You shouldn’t use a quantized model of a data type that your accelerator doesn’t natively support. For example, the FP8 data type for Kimi K2 requires a relatively new GPU, such as the Hopper generation from NVIDIA.
Quantizing the weights necessarily has to reduce their precision and it’s probable that the extra precision was leveraged as part of the training process to provide accuracy and quality to the model. Nuance is important!
The tradeoff, then, is to get the performance we need, using a data type we can accelerate, without sacrificing too much of the accuracy of the original model. For example, all quantized models in the Red Hat AI Hugging Face repo have been quantized using the LLM Compressor library and recover to greater than 99% of their baseline accuracy.
Introducing LLM Compressor
LLM Compressor, now part of the vLLM project, is an easy-to-use open source library for optimizing models, specifically designed to ensure that the models can be served with vLLM using the compressed-tensors file format. It includes a comprehensive set of quantization algorithms for both weight and activation as well as weight-only quantization, it’s seamlessly integrated with Hugging Face models and repositories, and it works with all models, including the very large ones.
When using any product in the Red Hat AI portfolio, there is an enterprise-grade supported distribution of vLLM included via Red Hat AI Inference Server. Red Hat AI Inference Server is tested against our validated third-party generative AI model repository. Many Red Hat AI customers like using the Granite family of models that have been fully tested for certain use cases, such as model customization with InstructLab.
Additionally, Red Hat’s Open Source Assurance policy ensures that the use of the models included with our AI products provides indemnification from Red Hat against intellectual property infringement. As of the time of writing, this means using IBM Granite 3.1.
Working with the new Granite 3.3 model
Red Hat AI’s validated third-party generative AI model repository includes a number of popular AI models in their originally released format, but it also includes a number of quantized models in different data types with published information on the benchmark accuracy loss against the original model. These models, including the INT8 (8-bit integer) quantized version of Granite 3.1 8B, were all created with LLM Compressor. The LLM Compressor recipes used to create these models are published alongside the model in the Hugging Face Repository.
Granite 3.3 was formally released by IBM around the beginning of April. It’s not yet part of the validated third-party model repository, nor has Red Hat yet published a product that claims official support for Granite 3.3. That doesn’t mean you can’t use it with Red Hat’s AI portfolio, simply that it’s not (yet) given the same assurances that Granite 3.1 has enjoyed, nor is it available in any semi-official capacity in pre-quantized form.
Granite 3.3 does not feature a radically different architecture from Granite 3.1 but it does offer significantly higher performance in a number of benchmarks. It also offers first-class support for Fill-in-the-Middle (FITM) and other features that make it desirable to use for several use cases that Granite 3.1 wasn’t as capable at. You can compare Granite 3.3 8B Instruct to the earlier variant using the published benchmark numbers here.
So, given Red Hat’s published recipe for quantizing Granite 3.1 8B, the open source and capable Granite 3.3 8B, and open source tools like LLM Compressor (which have productized versions making their way into the Red Hat AI portfolio), let’s dig in and make Granite 3.3 perform better.
Optimization environment
The environment we’ll be using for screenshots and benchmarking is an EC2 instance on the AWS Cloud. We’re using the g6.2xlarge instance type which has an NVIDIA L4 GPU with 24 GiB of VRAM and comparable performance for both FP8 and INT8 performance, roughly linearly scaling from FP16 performance (not including any overhead for conversion between types). I’ve spun my instance up in us-west-2, based out of the Oregon region. It comes with 8 vCPU and 32 GiB of memory and a base on-demand hourly rate of $0.9776 at the time of writing.
Because this is an economical instance, I’m constrained on resources a bit. I’m using the latest generally available version of RHEL 9, 9.6, and have installed the NVIDIA GPU drivers by following the NVIDIA documentation for RHEL 9 using the compute-only Open Kernel Modules instructions.
All of the commands we’re going to run from this point have been scripted out, and those scripts have been checked into a GitHub repository that you can view and follow along with here.
Running a model
First, we need a Python virtual environment with the libraries we are going to be using. I’m going to be using the uv package manager, but you should be able to modify the commands to use other Python packaging tools. Ensure a requirements.txt
file exists with the following contents:
--extra-index-url https://download.pytorch.org/whl/cu128
datasets==4.0.0
llmcompressor==0.6.0
transformers==4.52.0
vllm==0.9.2
lm-eval[vllm]==0.4.9
guidellm==0.2.1
Next, create a virtual environment with Python 3.12 and install these Python packages:
uv venv --system-site-packages --python 3.12 --seed .venv
source .venv/bin/activate
uv pip install --index-strategy unsafe-best-match -r requirements.txt
https://asciinema.org/a/ceOglvP9T4NuZGDjLhqDMA9id
I’m including system-site-packages
to make it simpler to access certain NVIDIA libraries for my GPU that came with the drivers.
Next, let’s go ahead and serve the Granite 3.3 8B Instruct model. In my case, I’m overriding the HF_HOME
variable to keep the model downloads local to my directory, instead of my user’s home directory, and to have good control and visibility over which token I’m using:
export HF_HOME="${PWD}/huggingface"
vllm serve ibm-granite/granite-3.3-8b-instruct
https://asciinema.org/a/OYn8oYXJVzTD6QPp5Jj8ut6uN
After downloading the model weights, vLLM tries to load the model and fails, saying that I don’t have enough memory to populate the KV Cache:
ValueError: To serve at least one request with the models's max seq len (131072), (20.00 GiB KV cache is needed, which is larger than the available KV cache memory (4.06 GiB). Based on the available memory, the estimated maximum model length is 26592. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
This is because the KV cache is scaled based on the maximum context window, and the default context window for Granite 3.3 is 128k tokens (which is pretty big). Let’s add an argument to vLLM to reduce the context window size:
vllm serve --max-model-len=16384 ibm-granite/granite-3.3-8b-instruct
https://asciinema.org/a/qa3XKhy7ZjohzJHbqMSAcULYY
We don’t have to download the model a second time, and now there’s enough VRAM to load the model on my NVIDIA L4 GPU. Let’s use another terminal window to make sure that it works like we expect:
curl http://localhost:8000/v1/models | jq .
curl -H 'Content-Type: application/json' http://localhost:8000/v1/chat/completions -d '{"model": "ibm-granite/granite-3.3-8b-instruct", "messages": [{"role": "system", "content": "You are a helpful assistant named Steve."},{"role": "user", "content": "Hey! My name is James. What is yours?"}]}' | jq .
https://asciinema.org/a/Xv1J2KB6TMSIsUHrzHoz5SS0g
And it looks good to me! Let’s get started quantizing this model with LLM Compressor. Stop the model server with Ctrl+C.
Quantizing Granite 3.3 8B Instruct
As mentioned, we’re going to take the script that was used to quantize Granite 3.1 into an INT8 data type and then modify it a bit. It’s not required to change the script they’ve included and you could just call it with different arguments, but for simplicity I’ve modified it to be more-or-less hard-coded to this purpose.
Ensure that a file named quantize.py
exists with the same content as in the repository, and let’s let it run. The advanced quantization algorithm will take a while to quantize the model, so feel free to let it run for about an hour while you work on something else.
python ./quantize.py
Start: https://asciinema.org/a/eZwEXVRWHrthUBypD4jgxdcKI
End: https://asciinema.org/a/H3bh3PkHhCTgvJUV7VlmLfK38
When this is complete, the quantized directory should have your new model in it! If you look at the relative sizes of the original model and this quantized one, you’ll find that it’s about half the size—which makes sense, considering the data type of the parameters.
du -sh huggingface/hub/models--ibm-granite--granite-3.3-8b-instruct/
du -sh quantized/granite-3.3-8b-instruct-quantized.w8a8
Now, we need to make sure that our compressed model is able to perform up to par with the original model. If we’ve significantly affected its performance, it won’t be very useful to us.
Evaluating LLM accuracy loss
lm-evaluation-harness, or lm_eval
, is a popular open source tool for performing standardized accuracy evaluations of large language models. There are a whole suite of benchmarks available, but to keep it easy today we’re only going to focus on GSM8K.
GSM8K is a dataset with over 8000 questions and answers from grade-school math questions designed to test a model’s ability to answer basic mathematical problems that require multistep reasoning. We’re not using a reasoning model for this test, so it will probably not perform well, but it’s still useful to test if any ability to perform this task was lost in our quantization process.
lm_eval
expects to manage the model server on its own, so remember to stop any running copies of vLLM you have as we’ll need the GPUs for this evaluation. Let’s get it started!
lm_eval --model vllm --model_args pretrained=ibm-granite/granite-3.3-8b-instruct,max_model_len=16384 --tasks gsm8k --num_fewshot 5 --batch_size auto
This command will spin up the original FP16 model using the same adjustment to the context window we provided before. It requests that the GSM8K benchmark be run, uses a pretty small selection of questions as examples, and ensures that lm_eval
is in the driver’s seat on deciding how large the batch size for the evaluation questions should be.
vllm (pretrained=ibm-granite/granite-3.3-8b-instruct,max_model_len=16384), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
Looking over our results (warning: due to randomness, your mileage may vary), we see that our non-reasoning model did, in fact, not perform super well on these reasoning problems with a strict match score of 67.1%. It’s not bad, though! That’s an interesting score to look at for our accuracy loss, because it’s just above but near the middle of the grade scale, so we should get good signals on how well or poorly our quantized version does.
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7233|± |0.0123|
| | |strict-match | 5|exact_match|↑ |0.6710|± |0.0129|
So, with that, let’s test it out:
lm_eval --model vllm --model_args pretrained=./quantized/granite-3.3-8b-instruct-quantized.w8a8,max_model_len=16384 --tasks gsm8k --num_fewshot 5 --batch_size auto
Note that we’re using the same context window size for this smaller model. We probably don’t have to (more on that in a bit), but for the accuracy loss evaluation it’s important to make sure that it’s using the exact same settings. If our smaller model performs better when given different settings, we haven’t really compared the loss due to the changes in the model weights. Keeping everything contextualized will help us make rational decisions about these models later.
vllm (pretrained=./quantized/granite-3.3-8b-instruct-quantized.w8a8,max_model_len=16384), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7195|± |0.0124|
| | |strict-match | 5|exact_match|↑ |0.6672|± |0.0130|
As a bit of foreshadowing, it looks like it finished faster than the FP16 model. It’s common for a single evaluation pass to show higher scores with a quantized model than the original, but here we were just within the approximate error at 66.72% for an overall recovery of 99.43% accuracy. When we quantize models, some evaluation results will go up and others will go down. We have meaningfully changed the way that computation works in this model, and it will therefore behave differently than the original.
It’s important to get a more holistic view of the model’s performance than running a single evaluation, which we’ll make more approachable in part 2 of this series. However, you can see an example of Red Hat doing just that when we publish accuracy evaluations for models that we have quantized and published here.
Quantized model performance
There are a number of reasons to consider quantization. Maybe you need the same amount of hardware to serve a given model faster, or for more concurrent users. Maybe your use case demands a larger context window than you can serve efficiently on a set of given hardware. Maybe you can’t run the model at all without quantizing it.
Let’s explore the first of those reasons using the GuideLLM open source benchmarking tool. GuideLLM runs a host of tortuous benchmarks against an OpenAI-compatible API endpoint at different concurrencies and measures important performance statistics, such as Time To First Token (TTFT) and maximum overall throughput in Tokens Per Second (TPS).
TTFT will, given a streaming client, affect how responsive a model feels. TPS affects how fast an LLM feels once it’s started, but depending on the concurrency at which you measure it, it might not be reflective of the speed for a single user. There are changes you can make on vLLM to affect things like the processing batch size that tune these numbers up or down at different concurrency levels, but we’re not going to be digging that deep today.
GuideLLM is designed not just to see how fast a model endpoint can serve one request, but to see how well a model endpoint handles serving while handling multiple requests, and provides those numbers at various step gradients until the endpoint tops out. We’re just going to use mostly default settings on the model server and GuideLLM and look at the highest overall token throughput as well as the lowest TTFT.
In one terminal, make sure you have the virtual environment activated and start the FP16 model:
vllm serve --max-model-len=16384 ibm-granite/granite-3.3-8b-instruct
In another terminal, again make sure that the virtual environment is activated and make sure that the model server has finished loading and is serving before running a GuideLLM benchmark:
while ! curl -L http://localhost:8000/v1/models >/dev/null 2>&1; do sleep 1; done
guidellm benchmark --target http://localhost:8000 \
--rate-type sweep --max-seconds 30 \
--output-path fp16.json \
--data prompt_tokens=256,output_tokens=128
When the benchmarking has finished, the detailed results are saved to the JSON file we specified on the command line:
Benchmarks Info:
===================================================================================================================================================
Metadata |||| Requests Made ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total ||
Benchmark| Start Time| End Time| Duration (s)| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err
-------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|------|-------|----|------|-------|-----
synchronous| 20:46:21| 20:46:51| 30.0| 3| 1| 0| 256.0| 256.0| 0.0| 128.0| 72.0| 0.0| 768| 256| 0| 384| 72| 0
throughput| 20:46:51| 20:47:21| 30.0| 74| 512| 0| 256.2| 256.0| 0.0| 128.0| 112.9| 0.0| 18961| 131072| 0| 9472| 57808| 0
constant@0.41| 20:47:28| 20:47:56| 27.6| 8| 4| 0| 256.1| 256.0| 0.0| 128.0| 65.0| 0.0| 2049| 1024| 0| 1024| 260| 0
constant@0.71| 20:47:57| 20:48:26| 28.6| 14| 7| 0| 256.4| 256.0| 0.0| 128.0| 63.6| 0.0| 3589| 1792| 0| 1792| 445| 0
constant@1.00| 20:48:26| 20:48:56| 30.0| 21| 10| 0| 256.2| 256.0| 0.0| 128.0| 71.1| 0.0| 5381| 2560| 0| 2688| 711| 0
constant@1.30| 20:48:57| 20:49:27| 30.0| 26| 13| 0| 256.3| 256.0| 0.0| 128.0| 65.6| 0.0| 6664| 3328| 0| 3328| 853| 0
constant@1.59| 20:49:27| 20:49:57| 30.0| 30| 18| 0| 256.1| 256.0| 0.0| 128.0| 62.7| 0.0| 7683| 4608| 0| 3840| 1128| 0
constant@1.89| 20:49:57| 20:50:27| 30.0| 34| 23| 0| 256.4| 256.0| 0.0| 128.0| 63.5| 0.0| 8717| 5888| 0| 4352| 1461| 0
constant@2.18| 20:50:27| 20:50:57| 30.0| 40| 27| 0| 256.2| 256.0| 0.0| 128.0| 66.8| 0.0| 10250| 6912| 0| 5120| 1804| 0
constant@2.47| 20:50:58| 20:51:28| 30.0| 43| 33| 0| 256.3| 256.0| 0.0| 128.0| 65.7| 0.0| 11020| 8448| 0| 5504| 2169| 0
===================================================================================================================================================
You’re also presented with a summary of the tests in a table:
Benchmarks Stats:
==========================================================================================================================================================
Metadata | Request Stats || Out Tok/sec| Tot Tok/sec| Req Latency (ms) ||| TTFT (ms) ||| ITL (ms) ||| TPOT (ms) ||
Benchmark| Per Second| Concurrency| mean| mean| mean| median| p99| mean| median| p99| mean| median| p99| mean| median| p99
-------------|-----------|------------|------------|------------|------|-------|------|-------|-------|-------|------|-------|------|------|-------|------
synchronous| 0.12| 1.00| 15.2| 75.9| 8.41| 8.42| 8.42| 104.8| 106.2| 106.7| 65.4| 65.4| 65.4| 64.9| 64.9| 64.9
throughput| 2.47| 54.73| 316.6| 1582.0| 22.12| 21.68| 29.80| 3236.5| 3047.1| 6169.8| 148.7| 146.7| 186.1| 147.5| 145.6| 184.7
constant@0.41| 0.31| 2.76| 39.6| 197.7| 8.91| 8.93| 8.99| 120.3| 114.7| 164.8| 69.2| 69.2| 69.8| 68.6| 68.7| 69.2
constant@0.71| 0.51| 4.62| 65.1| 325.4| 9.08| 9.13| 9.18| 121.8| 120.3| 193.4| 70.5| 71.0| 71.5| 70.0| 70.4| 71.0
constant@1.00| 0.70| 6.75| 89.8| 448.9| 9.61| 9.62| 9.96| 149.8| 153.4| 216.0| 74.5| 74.6| 76.7| 73.9| 74.0| 76.1
constant@1.30| 0.88| 9.00| 112.3| 561.2| 10.26| 10.30| 10.35| 190.9| 194.2| 218.4| 79.3| 79.7| 79.8| 78.6| 79.1| 79.1
constant@1.59| 1.00| 11.31| 128.1| 639.8| 11.29| 11.49| 11.74| 203.9| 211.9| 237.4| 87.3| 89.1| 90.6| 86.6| 88.4| 89.9
constant@1.89| 1.15| 13.62| 146.7| 733.4| 11.88| 12.10| 12.21| 196.4| 192.1| 241.5| 92.0| 93.8| 94.3| 91.3| 93.1| 93.5
constant@2.18| 1.34| 16.25| 171.6| 857.6| 12.12| 12.33| 12.43| 192.1| 185.4| 264.1| 93.9| 95.6| 96.2| 93.2| 94.8| 95.5
constant@2.47| 1.44| 18.49| 183.8| 918.7| 12.87| 13.01| 13.36| 208.2| 214.9| 264.6| 99.7| 101.0| 103.3| 98.9| 100.3| 102.5
==========================================================================================================================================================
When that wraps up, use Ctrl+C to stop the running vLLM instance. Now let’s start up a model server with the quantized model:
vllm serve --max-model-len=16384 --served-model-name=granite-3.3-8b-instruct-quantized.w8a8 ./quantized/granite-3.3-8b-instruct-quantized.w8a8
Then, back to the other terminal, let’s run the same benchmark with a different output file:
while ! curl -L http://localhost:8000/v1/models >/dev/null 2>&1; do sleep 1; done
guidellm benchmark --target http://localhost:8000 \
--rate-type sweep --max-seconds 30 \
--output-path int8.json \
--processor ./quantized/granite-3.3-8b-instruct-quantized.w8a8 \
--data prompt_tokens=256,output_tokens=128
And our results for this model look pretty good:
Benchmarks Info:
===================================================================================================================================================
Metadata |||| Requests Made ||| Prompt Tok/Req ||| Output Tok/Req ||| Prompt Tok Total||| Output Tok Total ||
Benchmark| Start Time| End Time| Duration (s)| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err| Comp| Inc| Err
-------------|-----------|---------|-------------|------|-----|-----|------|------|----|------|------|----|------|-------|----|------|-------|-----
synchronous| 20:57:15| 20:57:45| 30.0| 5| 1| 0| 256.0| 256.0| 0.0| 128.0| 115.0| 0.0| 1280| 256| 0| 640| 115| 0
throughput| 20:57:46| 20:58:16| 30.0| 64| 512| 0| 256.2| 256.0| 0.0| 128.0| 117.1| 0.0| 16397| 131072| 0| 8192| 59969| 0
constant@0.44| 20:58:22| 20:58:50| 27.7| 10| 3| 0| 256.1| 256.0| 0.0| 128.0| 68.0| 0.0| 2561| 768| 0| 1280| 204| 0
constant@0.68| 20:58:52| 20:59:20| 28.5| 16| 4| 0| 256.4| 256.0| 0.0| 128.0| 70.8| 0.0| 4103| 1024| 0| 2048| 283| 0
constant@0.93| 20:59:21| 20:59:50| 28.9| 22| 5| 0| 256.3| 256.0| 0.0| 128.0| 72.8| 0.0| 5638| 1280| 0| 2816| 364| 0
constant@1.17| 20:59:51| 21:00:21| 30.0| 29| 7| 0| 256.2| 256.0| 0.0| 128.0| 63.0| 0.0| 7431| 1792| 0| 3712| 441| 0
constant@1.42| 21:00:21| 21:00:51| 30.0| 35| 8| 0| 256.1| 256.0| 0.0| 128.0| 65.0| 0.0| 8964| 2048| 0| 4480| 520| 0
constant@1.66| 21:00:51| 21:01:21| 30.0| 40| 10| 0| 256.4| 256.0| 0.0| 128.0| 67.0| 0.0| 10256| 2560| 0| 5120| 670| 0
constant@1.90| 21:01:21| 21:01:51| 30.0| 46| 12| 0| 256.2| 256.0| 0.0| 128.0| 70.4| 0.0| 11786| 3072| 0| 5888| 845| 0
constant@2.15| 21:01:52| 21:02:22| 30.0| 52| 14| 0| 256.3| 256.0| 0.0| 128.0| 63.6| 0.0| 13327| 3584| 0| 6656| 890| 0
===================================================================================================================================================
Summary table:
Benchmarks Stats:
==========================================================================================================================================================
Metadata | Request Stats || Out Tok/sec| Tot Tok/sec| Req Latency (ms) ||| TTFT (ms) ||| ITL (ms) ||| TPOT (ms) ||
Benchmark| Per Second| Concurrency| mean| mean| mean| median| p99| mean| median| p99| mean| median| p99| mean| median| p99
-------------|-----------|------------|------------|------------|------|-------|------|-------|-------|-------|------|-------|------|------|-------|------
synchronous| 0.20| 1.00| 25.2| 125.8| 5.08| 5.08| 5.08| 62.4| 62.1| 64.6| 39.5| 39.5| 39.5| 39.2| 39.2| 39.2
throughput| 2.15| 60.21| 274.8| 1372.9| 28.04| 27.81| 29.75| 1820.7| 1622.6| 3227.8| 206.5| 206.2| 208.9| 204.8| 204.6| 207.2
constant@0.44| 0.39| 2.03| 49.9| 249.4| 5.19| 5.19| 5.22| 69.1| 70.3| 84.5| 40.3| 40.3| 40.5| 40.0| 40.0| 40.2
constant@0.68| 0.59| 3.10| 75.3| 376.6| 5.26| 5.27| 5.30| 67.5| 65.9| 86.3| 40.9| 40.9| 41.0| 40.6| 40.6| 40.7
constant@0.93| 0.79| 4.18| 100.7| 503.5| 5.31| 5.31| 5.32| 52.9| 50.6| 78.4| 41.4| 41.4| 41.5| 41.0| 41.1| 41.2
constant@1.17| 0.99| 5.33| 126.7| 633.0| 5.38| 5.39| 5.41| 68.6| 67.3| 88.9| 41.8| 41.9| 42.0| 41.5| 41.6| 41.6
constant@1.42| 1.19| 6.46| 152.0| 759.3| 5.44| 5.45| 5.48| 71.5| 74.4| 90.5| 42.3| 42.4| 42.4| 41.9| 42.0| 42.1
constant@1.66| 1.36| 8.10| 173.4| 867.0| 5.97| 6.04| 6.15| 108.2| 110.5| 140.8| 46.2| 46.7| 47.3| 45.8| 46.3| 46.9
constant@1.90| 1.54| 9.53| 197.1| 984.5| 6.19| 6.21| 6.25| 121.2| 125.3| 141.8| 47.8| 47.9| 48.1| 47.4| 47.5| 47.7
constant@2.15| 1.75| 11.10| 224.3| 1121.0| 6.33| 6.36| 6.38| 118.8| 120.8| 142.1| 48.9| 49.1| 49.2| 48.6| 48.7| 48.8
==========================================================================================================================================================
Looking at the mean synchronous TTFT for each model, we can see that the INT8 model responded with its first token about 40% quicker, with a higher total TPS of 224 while around 11 concurrent requests. This handled more requests per second than the FP16 model at around the same concurrency. That means that the backlog of work on the FP16 variant went a bit higher, taking longer for each request. The inter-token latency (ITL) and Time Per Output Token (TPOT) show that to be the case here for sure.
That’s a pretty great performance benefit for a model that’s still nearly as accurate as the slower one. We should consider warming these model servers up by running a benchmark, throwing it away, and then running it again—if that’s a reasonable way to measure some element of our workload. The INT8 variant has more GPU memory available for KV cache, so we might see an even better result.
Expanding our model’s use cases
The second reason mentioned above for maybe wanting to quantize a model was to enable a longer context window on the same hardware. If we’re running a model for a code assistant use case, for example, it might be nice to have a longer context window to be able to include more of our code base so that autocompletion can see more function signatures or global state tracking structures.
Although the GPU we’re using in this environment doesn’t work with the default KV cache settings or GPU utilization targets with the full context of the Granite 3.3 model, it can definitely use a lot more VRAM for KV cache with the INT8 model than it could with the original FP16 one.
You can serve the model using vLLM with this command line to see it load with 4 times the context window that we could achieve with the FP16 model:
vllm serve --max-model-len=65536 --served-model-name=granite-3.3-8b-instruct-quantized.w8a8 ./quantized/granite-3.3-8b-instruct-quantized.w8a8
Recap
Quantizing models doesn’t have to be hard or scary. Open source tools exist to not only help you do custom quantizations, but also measure their effectiveness. Red Hat and the upstream LLM Compressor community publish many recipes to make it even more approachable.
There are massive benefits to quantization, and it might unlock more performance, cheaper inference costs, or even totally new use cases that you couldn’t address before. Try it out on your own systems, and run more context through your inference engine, faster!
Ensuring you’re getting it right requires checking the results of your work, though. Join us for part 2, coming soon, to see how Red Hat AI, built upon the foundations of these free and open source tools, can enable you to perform these steps repeatedly and at scale, using clusters of accelerated compute instead of just one server, to ensure you’re getting the best model performance and capabilities you can.
The post Optimizing generative AI models with quantization appeared first on Red Hat Developer.