AWQ Quantization Speed Benchmark: 16 Models Tested, INT4 vs BF16, and the MoE Reversal
How much faster does AWQ INT4 quantization actually make LLM inference? We benchmarked 16 models on an RTX PRO 6000 (96GB) with SGLang. The results: AWQ delivers 1.88x speedup at 4B parameters and scales to 2.94x at 32B. Larger models benefit more because inference is memory- bandwidth-bound. The biggest surprise: MoE model Qwen3-30B-A3B runs at 168.9 tok/s — faster than the 14B Dense AWQ model at 136.1 tok/s. 30 billion parameters, faster than 14 billion.
2.94x
32B AWQ Speedup
168.9
30B MoE tok/s
16
Models Tested
391.2
Peak tok/s (4B AWQ)
Complete 16-Model Speed Rankings
All models were tested on the same hardware (RTX PRO 6000, SGLang 0.5.8, single request) to ensure apples-to-apples comparison. The table below shows token generation speed for Dense models in both AWQ and BF16 formats, plus two MoE models that break the expected size-to-speed relationship.
| # | Model | Params | Type | tok/s |
|---|---|---|---|---|
| 1 | Qwen3-4B-AWQ | 4B | Dense AWQ | 391.2 |
| 2 | Qwen3-4B | 4B | Dense BF16 | 208.1 |
| 3 | Qwen3-8B-AWQ | 8B | Dense AWQ | 208.0 |
| 4 | Qwen3-30B-A3BMoE | 30B (3B active) | MoE | 168.9 |
| 5 | Qwen3-14B-AWQ | 14B | Dense AWQ | 136.1 |
| 6 | Gemma-3-12B-it-AWQ | 12B | Dense AWQ | 128.5 |
| 7 | GLM-4-9B-Chat-1MMoE | 9B (1.2B active) | MoE | 107.2 |
| 8 | EXAONE-3.5-7.8B | 7.8B | Dense BF16 | 95.2 |
| 9 | Qwen3-8B | 8B | Dense BF16 | 90.8 |
| 10 | Qwen3-32B-AWQ | 32B | Dense AWQ | 70.2 |
| 11 | Qwen3-14B | 14B | Dense BF16 | 53.2 |
| 12 | Phi-4-AWQ | 14B | Dense AWQ | 51.8 |
| 13 | EXAONE-3.5-32B-AWQ | 32B | Dense AWQ | 24.2 |
| 14 | DeepSeek-R1-Distill-Qwen-32B | 32B | Dense BF16 | 24.0 |
| 15 | Qwen3-32B | 32B | Dense BF16 | 23.9 |
| 16 | EXAONE-3.5-32B | 32B | Dense BF16 | 19.3 |
Three key patterns emerge: (1) AWQ always outperforms same-size BF16 — no exceptions from 4B to 32B. (2) MoE models appear at unexpected positions — the 30B model ranks 4th, above all 14B models. (3) Same parameter count, wildly different speeds — Phi-4-AWQ (14B) at 51.8 tok/s is 2.6x slower than Qwen3-14B-AWQ at 136.1 tok/s.
AWQ Quantization Effect: Larger Models Benefit More
To isolate the pure quantization effect, we compared the Qwen3 series at 4B, 8B, 14B, and 32B in both AWQ (INT4) and BF16 formats. The speedup ratio increases monotonically with model size.
| Model Size | AWQ (tok/s) | BF16 (tok/s) | Speedup |
|---|---|---|---|
| 4B | 391.2 | 208.1 | 1.88x |
| 8B | 208.0 | 90.8 | 2.29x |
| 14B | 136.1 | 53.2 | 2.56x |
| 32B | 70.2 | 23.9 | 2.94x |
The reason is straightforward: LLM token generation is memory-bandwidth-bound. The GPU must read every weight from VRAM for each generated token. Larger models have more weights to read, making the memory read the dominant bottleneck. AWQ reduces weight size by 4x (16-bit to 4-bit), directly relieving this bottleneck. At 32B, the weight size drops from ~64GB (BF16) to ~18GB (AWQ), and the speedup reaches 2.94x.
At 4B, the speedup is "only" 1.88x because the model is small enough that compute (matrix multiplication) starts becoming a meaningful fraction of total time. Reducing memory read time doesn't help when compute time remains constant.
The MoE Reversal: 30B Faster Than 14B
Normally, bigger models are slower. 32B is slower than 14B, which is slower than 8B. MoE (Mixture of Experts) architecture breaks this rule entirely. Qwen3-30B-A3B has 30 billion total parameters but activates only 3 billion per token — 8 out of 128 experts per layer. The result: 30B total knowledge at 3B compute cost.
| Model | Total Params | Active Params | tok/s | Note |
|---|---|---|---|---|
| Qwen3-30B-A3B | 30B | 3B | 168.9 | 8B-speed, 14B-quality |
| Qwen3-14B-AWQ | 14B | 14B | 136.1 | Dense — all params active |
| GLM-4-9B-Chat-1M | 9B | ~1.2B | 107.2 | 7B-speed, 1M context |
| Qwen3-8B-AWQ | 8B | 8B | 208.0 | Fastest practical Dense |
The trade-off: MoE models consume VRAM proportional to total parameters, not active parameters. Qwen3-30B-A3B runs at 3B compute speed but requires 30B worth of VRAM. On GPUs with limited memory (16GB), this makes MoE impractical without aggressive quantization. The speed advantage only materializes when VRAM is sufficient.
Why Memory Bandwidth Is Everything
There is a simple formula that predicts LLM inference speed with remarkable accuracy for large models:
Expected tok/s = Memory Bandwidth (GB/s) / Model Weight Size (GB)
The RTX PRO 6000 has 1,536 GB/s bandwidth. For Qwen3-32B BF16 (~64GB weights): 1,536 / 64 = 24 tok/s (predicted) vs 23.9 tok/s (measured). For Qwen3-32B AWQ (~18GB weights): 1,536 / 18 = 85 tok/s (predicted) vs 70.2 tok/s (measured, with overhead). AWQ's fundamental mechanism is reducing the denominator by 4x.
1,536
GB/s bandwidth
96
GB GDDR7 VRAM
350W
Power cap used
Practical Model Selection Guide
Based on the 16-model benchmark, here are the optimal choices by available VRAM.
VRAM 48GB+ (Quality-First)
Primary: Qwen3-30B-A3B (168.9 tok/s) — 14B-quality at 8B-speed through MoE. High quality: Qwen3-32B-AWQ (70.2 tok/s) — best quality, streaming recommended.
VRAM 16–24GB (Balanced)
Primary: Qwen3-14B-AWQ (136.1 tok/s) — top Korean benchmark score, fast speed. Lightweight: Qwen3-8B-AWQ (208.0 tok/s) — optimal for FAQ/classification tasks.
VRAM 8–12GB (Speed-First)
Primary: Qwen3-8B-AWQ (208.0 tok/s) — only ~5GB VRAM needed. Ultra-light: Qwen3-4B-AWQ (391.2 tok/s) — ~2.5GB VRAM, routing/classification only.
Conclusion: AWQ + MoE Changes the Game
Two technologies are transforming local LLM deployment economics:
- AWQ quantization is non-negotiable. At 32B, it delivers 2.94x speedup with no perceptible quality loss. There is no reason to run BF16 when VRAM allows AWQ.
- MoE unlocks speed-quality simultaneity. Qwen3-30B-A3B runs 24% faster than 14B AWQ while accessing 30B worth of knowledge.
- Architecture-engine compatibility matters. Same 14B parameter count, but Qwen3-14B-AWQ (136.1) vs Phi-4-AWQ (51.8) shows a 2.6x gap. SGLang + Qwen3 is the current optimal combination.
The simple rule: use AWQ everywhere, use MoE where VRAM permits, and always benchmark your specific model-engine combination before deploying to production.