blog

LLM 量化（GPTQ、GGUF）实战以及效果和推理性能实测

Author: ninehills
Labels: blog
Created: 2025-09-22T16:31:39Z
Link and comments: https://github.com/ninehills/blog/issues/143

涉及到的代码在： https://github.com/ninehills/llm-speedup

1. 安装环境

硬件环境：

GTX 4090 24GB x 1
Windows 11 + WSL2
Driver Version: 581.29

安装软件环境（依赖conda: https://conda-forge.org/download/）

# 国内配置：export HF_ENDPOINT=https://hf-mirror.com
conda create -n llm-speedup python==3.12
conda activate llm-speedup

pip install "vllm==0.10.2" "sglang==0.5.2" "evalscope[perf]==1.0.1" langdetect immutabledict
cd llm-compressor
pip install -e ./

pip install "datasets<4.0.0" # fix evalscope datasets failed

2. 量化

2.1 使用 llm-compressor GPTQ 量化

我们以 GPTQ w4a16g128 量化 Qwen/Qwen3-4B-Instruct-2507 模型为例，其他量化方法（AWQ等）请参考 llm-compressor 文档。

# 生成校准数据集，使用中英文高质量 SFT 数据
python calib_data.py
# 进行 GPTQ 量化
python qwen3_dense_instruct_w4a16.py
# 逐层量化，大约需要 10 - 20 分钟

校准数据集使用中英文混合的高质量对话 SFT 数据1024条。
从各种评测和经验看，推荐使用 GPTQ w8a16/w4a16 量化，效果损失最小。
注意 MoE 模型量化时，需要额外忽略 Gate 层，避免量化误差过大。
如果量化损失过大，可以控制忽略掉前 N 层。

2.2 GPTQ 量化前后效果分析

# 启动bf16推理服务
vllm serve Qwen/Qwen3-4B-Instruct-2507 --max-model-len 8192 --served-model-name Qwen3-4B-Instruct-2507 --port 8080
# 评测 Math500（数学）、IFEval（指令遵循）、IQuiz（中文理解）
evalscope eval \
 --model Qwen3-4B-Instruct-2507 \
 --api-url http://127.0.0.1:8080/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets math_500 ifeval iquiz \
 --eval-batch-size 100
+------------------------+-----------+--------------------------+----------+-------+---------+---------+
| Model                  | Dataset   | Metric                   | Subset   |   Num |   Score | Cat.0   |
+========================+===========+==========================+==========+=======+=========+=========+
| Qwen3-4B-Instruct-2507 | ifeval    | mean_prompt_level_strict | default  |   541 |  0.8299 | default |
| Qwen3-4B-Instruct-2507 | ifeval    | mean_inst_level_strict   | default  |   541 |  0.8882 | default |
| Qwen3-4B-Instruct-2507 | iquiz     | mean_acc                 | OVERALL  |   120 |  0.525  | -       |
| Qwen3-4B-Instruct-2507 | math_500  | mean_acc                 | OVERALL  |   500 |  0.776  | -       |
+------------------------+-----------+--------------------------+----------+-------+---------+---------+ 

# 启动w4a16推理服务
vllm serve Qwen3-4B-Instruct-2507-W4A16-G128 --max-model-len 8192 --served-model-name Qwen3-4B-Instruct-2507-W4A16-G128 --port 8080
# 评测
evalscope eval \
 --model Qwen3-4B-Instruct-2507-W4A16-G128 \
 --api-url http://127.0.0.1:8080/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets math_500 ifeval iquiz \
 --eval-batch-size 100
+-----------------------------------+-----------+--------------------------+----------+-------+---------+---------+
| Model                             | Dataset   | Metric                   | Subset   |   Num |   Score | Cat.0   |
+===================================+===========+==========================+==========+=======+=========+=========+
| Qwen3-4B-Instruct-2507-W4A16-G128 | ifeval    | mean_prompt_level_strict | default  |   541 |  0.8355 | default |
| Qwen3-4B-Instruct-2507-W4A16-G128 | ifeval    | mean_inst_level_strict   | default  |   541 |  0.8879 | default |
| Qwen3-4B-Instruct-2507-W4A16-G128 | iquiz     | mean_acc                 | OVERALL  |   120 |  0.5333 | -       |
| Qwen3-4B-Instruct-2507-W4A16-G128 | math_500  | mean_acc                 | OVERALL  |   500 |  0.782  | -       |
+-----------------------------------+-----------+--------------------------+----------+-------+---------+---------+ 

发现：量化后指标反而全面高于未量化模型，这是因为我们的校准数据集为高质量 SFT 数据，属于正常现象。

2.3 GPTQ 量化前后 vLLM 推理性能分析

vllm serve Qwen/Qwen3-4B-Instruct-2507 --max-model-len 8192 --served-model-name Qwen3-4B-Instruct-2507 --port 8080
evalscope perf \
  --parallel 1 10 20 50 100 \
  --number 10 30 50 100 200 \
  --model Qwen3-4B-Instruct-2507 \
  --url http://127.0.0.1:8080/v1/chat/completions \
  --api openai \
  --dataset random \
  --max-tokens 1024 \
  --min-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --tokenizer-path Qwen3-4B-Instruct-2507 \
  --extra-args '{"ignore_eos": true}'

┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.09 │   11.530 │   11.588 │   88.81 │    0.050 │   0.065 │    0.011 │   0.011 │    100.0%│
│   10 │ 0.65 │   15.284 │   15.711 │  669.34 │    0.288 │   0.628 │    0.015 │   0.015 │    100.0%│
│   20 │ 0.93 │   18.492 │   20.202 │  954.49 │    0.467 │   1.304 │    0.018 │   0.019 │    100.0%│
│   50 │ 1.52 │   30.359 │   38.295 │ 1555.54 │    1.214 │   3.216 │    0.029 │   0.034 │    100.0%│
│  100 │ 1.54 │   54.048 │   75.195 │ 1579.02 │   13.821 │  39.359 │    0.039 │   0.066 │    100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘

vllm serve Qwen3-4B-Instruct-2507-W4A16-G128 --max-model-len 8192 --served-model-name Qwen3-4B-Instruct-2507-W4A16-G128 --port 8080
evalscope perf \
  --parallel 1 10 20 50 100 \
  --number 10 30 50 100 200 \
  --model Qwen3-4B-Instruct-2507-W4A16-G128 \
  --url http://127.0.0.1:8080/v1/chat/completions \
  --api openai \
  --dataset random \
  --max-tokens 1024 \
  --min-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --tokenizer-path Qwen3-4B-Instruct-2507-W4A16-G128 \
  --extra-args '{"ignore_eos": true}'
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.16 │    6.150 │    9.323 │  166.50 │    0.059 │   0.068 │    0.006 │   0.009 │    100.0%│
│   10 │ 1.03 │    9.666 │   10.177 │ 1058.72 │    0.386 │   0.807 │    0.009 │   0.009 │    100.0%│
│   20 │ 1.29 │   13.762 │   15.793 │ 1316.59 │    0.528 │   1.476 │    0.013 │   0.014 │    100.0%│
│   50 │ 1.77 │   28.100 │   31.295 │ 1816.37 │    1.165 │   3.533 │    0.026 │   0.027 │    100.0%│
│  100 │ 1.76 │   50.314 │   83.056 │ 1805.55 │    7.330 │  28.528 │    0.042 │   0.074 │    100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘

发现：量化后，单用户 OTPS 提升 100%，但是最大 OTPS 提升较少。

2.4 GGUF imatrix 量化

GGUF 各种量化方法参考：https://huggingface.co/docs/hub/en/gguf

我们使用 imatrix 4bit 量化（类似于 GPTQ的方法）IQ4_XS

git clone https://github.com/ggml-org/llama.cpp.git
# INSTALL CUDA TOOLKIT: https://developer.nvidia.com/cuda-toolkit-archive
# 安装依赖库
sudo apt-get install cmake curl libssl-dev libcurl4-openssl-dev
# 配置cuda 的路径，具体和你的CUDA版本有关
export PATH=/usr/local/cuda-12.6/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
# 编辑 llama.cpp GPU 版本
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j16
# 把模型下载到本地
hf download "Qwen/Qwen3-4B-Instruct-2507" --local-dir "Qwen3-4B-Instruct-2507"
# 转换为 fp16 gguf 格式
python llama.cpp/convert_hf_to_gguf.py "Qwen3-4B-Instruct-2507" --outtype f16 --outfile Qwen3-4B-Instruct-2507-f16.gguf
# 生成 imatrix.dat
./llama.cpp/build/bin/llama-imatrix -m Qwen3-4B-Instruct-2507-f16.gguf -f calibration.txt -ngl 99 --output-frequency 10 -o imatrix.dat --parse-special
# 进行带校准量化
./llama.cpp/build/bin/llama-quantize --leave-output-tensor --imatrix imatrix.dat Qwen3-4B-Instruct-2507-f16.gguf Qwen3-4B-Instruct-2507-iq4_xs.gguf IQ4_XS
# 无校准量化
./llama.cpp/build/bin/llama-quantize --leave-output-tensor Qwen3-4B-Instruct-2507-f16.gguf Qwen3-4B-Instruct-2507-q4_k_m.gguf Q4_K_M

GGUF 量化效果评测

评测模型在 wiki.test 数据集上的 PPL（困惑度），越低越好。

# ppl
./llama.cpp/scripts/get-wikitext-2.sh
./llama.cpp/build/bin/llama-perplexity -m Qwen3-4B-Instruct-2507-f16.gguf -f wikitext-2-raw/wiki.test.raw -ngl 99
PPL = 10.5498 +/- 0.08436
./llama.cpp/build/bin/llama-perplexity -m Qwen3-4B-Instruct-2507-iq4_xs.gguf -f wikitext-2-raw/wiki.test.raw -ngl 99
PPL = 10.7011 +/- 0.08542
./llama.cpp/build/bin/llama-perplexity -m Qwen3-4B-Instruct-2507-q4_k_m.gguf -f wikitext-2-raw/wiki.test.raw -ngl 99
PPL = 10.7434 +/- 0.08562

可以看到 iq4_xs 不仅体积小，效果也较好

评测模型的真实推理效果。

# 见下文，vllm 并发性能要好于 llama.cpp
vllm serve ./Qwen3-4B-Instruct-2507-iq4_xs.gguf --served-model-name Qwen3-4B-Instruct-2507-iq4_xs --max-model-len 8192 --port 8080 --tokenizer Qwen3-4B-Instruct-2507

evalscope eval \
 --model Qwen3-4B-Instruct-2507-iq4_xs \
 --api-url http://127.0.0.1:8080/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets math_500 ifeval iquiz \
 --eval-batch-size 100

+-------------------------------+-----------+--------------------------+----------+-------+---------+---------+
| Model                         | Dataset   | Metric                   | Subset   |   Num |   Score | Cat.0   |
+===============================+===========+==========================+==========+=======+=========+=========+
| Qwen3-4B-Instruct-2507-iq4_xs | ifeval    | mean_prompt_level_strict | default  |   541 |  0.8262 | default |
| Qwen3-4B-Instruct-2507-iq4_xs | ifeval    | mean_inst_level_strict   | default  |   541 |  0.8851 | default |
| Qwen3-4B-Instruct-2507-iq4_xs | iquiz     | mean_acc                 | OVERALL  |   120 |  0.5    | -       |
| Qwen3-4B-Instruct-2507-iq4_xs | math_500  | mean_acc                 | OVERALL  |   500 |  0.758  | -       |
+-------------------------------+-----------+--------------------------+----------+-------+---------+---------+

发现：比 GPTQ 量化效果略弱，但整体削弱较小。

GGUF 量化性能评测

vllm + gguf iq4 推理。

vllm serve ./Qwen3-4B-Instruct-2507-iq4_xs.gguf --served-model-name Qwen3-4B-Instruct-2507-iq4_xs --max-model-len 8192 --port 8080 --tokenizer Qwen3-4B-Instruct-2507
evalscope perf \
  --parallel 1 10 20 50 100 \
  --number 10 30 50 100 200 \
  --model Qwen3-4B-Instruct-2507-iq4_xs \
  --url http://127.0.0.1:8080/v1/chat/completions \
  --api openai \
  --dataset random \
  --max-tokens 1024 \
  --min-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --tokenizer-path Qwen3-4B-Instruct-2507/ \
  --extra-args '{"ignore_eos": true}'

┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.17 │    5.884 │    5.945 │  174.02 │    0.044 │   0.087 │    0.006 │   0.006 │    100.0%│
│   10 │ 0.40 │   24.839 │   25.406 │  412.00 │    0.449 │   1.034 │    0.024 │   0.024 │    100.0%│
│   20 │ 0.66 │   25.413 │   26.805 │  677.62 │    0.658 │   1.838 │    0.024 │   0.025 │    100.0%│
│   50 │ 1.17 │   42.447 │   46.481 │ 1201.77 │    1.444 │   4.483 │    0.040 │   0.041 │    100.0%│
│  100 │ 1.20 │   72.823 │  118.206 │ 1225.47 │    8.692 │  37.972 │    0.063 │   0.106 │    100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘

llama.cpp + gguf iq4 推理。

# set max input tokens = 4096, max output tokens = 4096
./llama.cpp/build/bin/llama-server -m Qwen3-4B-Instruct-2507-iq4_xs.gguf -c 4096 -n 4096 -ngl 99
# test
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-4B-Instruct-2507-iq4_xs",
    "messages": [
      {"role": "user", "content": "你好"}
    ], "stream": true
  }'
# 注意首次执行一会ctrl+c，进行warmup
evalscope perf \
  --parallel 1 10 20 50 100 \
  --number 10 30 50 100 200 \
  --model Qwen3-4B-Instruct-2507-iq4_xs \
  --url http://127.0.0.1:8080/v1/chat/completions \
  --api openai \
  --dataset random \
  --max-tokens 1024 \
  --min-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --tokenizer-path Qwen3-4B-Instruct-2507 \
  --extra-args '{"ignore_eos": true}'

┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.21 │    4.812 │    4.816 │  212.76 │    0.061 │   0.063 │    0.005 │   0.005 │    100.0%│
│   10 │ 0.20 │   41.531 │   48.982 │  209.89 │   36.711 │  44.152 │    0.005 │   0.005 │    100.0%│
│   20 │ 0.20 │   80.076 │   99.156 │  207.84 │   75.205 │  94.257 │    0.005 │   0.005 │    100.0%│
│   50 │ 0.20 │  189.758 │  251.990 │  204.79 │  184.814 │ 247.020 │    0.005 │   0.005 │    100.0%│
│  100 │ 0.20 │  378.942 │  504.018 │  204.04 │  373.980 │ 499.034 │    0.005 │   0.005 │    100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘

结论：看 OTPS 指标，llama.cpp 单用户性能最好，但是大并发性能下，vllm+GPTQ > vllm+GGUF。