treeru.com
AI · February 26, 2026

AWQ Quantization Speed Benchmark: 16 Models Tested, INT4 vs BF16, and the MoE Reversal

How much faster does AWQ INT4 quantization actually make LLM inference? We benchmarked 16 models on an RTX PRO 6000 (96GB) with SGLang. The results: AWQ delivers 1.88x speedup at 4B parameters and scales to 2.94x at 32B. Larger models benefit more because inference is memory- bandwidth-bound. The biggest surprise: MoE model Qwen3-30B-A3B runs at 168.9 tok/s — faster than the 14B Dense AWQ model at 136.1 tok/s. 30 billion parameters, faster than 14 billion.

2.94x

32B AWQ Speedup

168.9

30B MoE tok/s

16

Models Tested

391.2

Peak tok/s (4B AWQ)

Complete 16-Model Speed Rankings

All models were tested on the same hardware (RTX PRO 6000, SGLang 0.5.8, single request) to ensure apples-to-apples comparison. The table below shows token generation speed for Dense models in both AWQ and BF16 formats, plus two MoE models that break the expected size-to-speed relationship.

#ModelParamsTypetok/s
1Qwen3-4B-AWQ4BDense AWQ391.2
2Qwen3-4B4BDense BF16208.1
3Qwen3-8B-AWQ8BDense AWQ208.0
4Qwen3-30B-A3BMoE30B (3B active)MoE168.9
5Qwen3-14B-AWQ14BDense AWQ136.1
6Gemma-3-12B-it-AWQ12BDense AWQ128.5
7GLM-4-9B-Chat-1MMoE9B (1.2B active)MoE107.2
8EXAONE-3.5-7.8B7.8BDense BF1695.2
9Qwen3-8B8BDense BF1690.8
10Qwen3-32B-AWQ32BDense AWQ70.2
11Qwen3-14B14BDense BF1653.2
12Phi-4-AWQ14BDense AWQ51.8
13EXAONE-3.5-32B-AWQ32BDense AWQ24.2
14DeepSeek-R1-Distill-Qwen-32B32BDense BF1624.0
15Qwen3-32B32BDense BF1623.9
16EXAONE-3.5-32B32BDense BF1619.3

Three key patterns emerge: (1) AWQ always outperforms same-size BF16 — no exceptions from 4B to 32B. (2) MoE models appear at unexpected positions — the 30B model ranks 4th, above all 14B models. (3) Same parameter count, wildly different speeds — Phi-4-AWQ (14B) at 51.8 tok/s is 2.6x slower than Qwen3-14B-AWQ at 136.1 tok/s.

AWQ Quantization Effect: Larger Models Benefit More

To isolate the pure quantization effect, we compared the Qwen3 series at 4B, 8B, 14B, and 32B in both AWQ (INT4) and BF16 formats. The speedup ratio increases monotonically with model size.

Model SizeAWQ (tok/s)BF16 (tok/s)Speedup
4B391.2208.11.88x
8B208.090.82.29x
14B136.153.22.56x
32B70.223.92.94x

The reason is straightforward: LLM token generation is memory-bandwidth-bound. The GPU must read every weight from VRAM for each generated token. Larger models have more weights to read, making the memory read the dominant bottleneck. AWQ reduces weight size by 4x (16-bit to 4-bit), directly relieving this bottleneck. At 32B, the weight size drops from ~64GB (BF16) to ~18GB (AWQ), and the speedup reaches 2.94x.

At 4B, the speedup is "only" 1.88x because the model is small enough that compute (matrix multiplication) starts becoming a meaningful fraction of total time. Reducing memory read time doesn't help when compute time remains constant.

The MoE Reversal: 30B Faster Than 14B

Normally, bigger models are slower. 32B is slower than 14B, which is slower than 8B. MoE (Mixture of Experts) architecture breaks this rule entirely. Qwen3-30B-A3B has 30 billion total parameters but activates only 3 billion per token — 8 out of 128 experts per layer. The result: 30B total knowledge at 3B compute cost.

ModelTotal ParamsActive Paramstok/sNote
Qwen3-30B-A3B30B3B168.98B-speed, 14B-quality
Qwen3-14B-AWQ14B14B136.1Dense — all params active
GLM-4-9B-Chat-1M9B~1.2B107.27B-speed, 1M context
Qwen3-8B-AWQ8B8B208.0Fastest practical Dense

The trade-off: MoE models consume VRAM proportional to total parameters, not active parameters. Qwen3-30B-A3B runs at 3B compute speed but requires 30B worth of VRAM. On GPUs with limited memory (16GB), this makes MoE impractical without aggressive quantization. The speed advantage only materializes when VRAM is sufficient.

Why Memory Bandwidth Is Everything

There is a simple formula that predicts LLM inference speed with remarkable accuracy for large models:

Expected tok/s = Memory Bandwidth (GB/s) / Model Weight Size (GB)

The RTX PRO 6000 has 1,536 GB/s bandwidth. For Qwen3-32B BF16 (~64GB weights): 1,536 / 64 = 24 tok/s (predicted) vs 23.9 tok/s (measured). For Qwen3-32B AWQ (~18GB weights): 1,536 / 18 = 85 tok/s (predicted) vs 70.2 tok/s (measured, with overhead). AWQ's fundamental mechanism is reducing the denominator by 4x.

1,536

GB/s bandwidth

96

GB GDDR7 VRAM

350W

Power cap used

Practical Model Selection Guide

Based on the 16-model benchmark, here are the optimal choices by available VRAM.

VRAM 48GB+ (Quality-First)

Primary: Qwen3-30B-A3B (168.9 tok/s) — 14B-quality at 8B-speed through MoE. High quality: Qwen3-32B-AWQ (70.2 tok/s) — best quality, streaming recommended.

VRAM 16–24GB (Balanced)

Primary: Qwen3-14B-AWQ (136.1 tok/s) — top Korean benchmark score, fast speed. Lightweight: Qwen3-8B-AWQ (208.0 tok/s) — optimal for FAQ/classification tasks.

VRAM 8–12GB (Speed-First)

Primary: Qwen3-8B-AWQ (208.0 tok/s) — only ~5GB VRAM needed. Ultra-light: Qwen3-4B-AWQ (391.2 tok/s) — ~2.5GB VRAM, routing/classification only.

Conclusion: AWQ + MoE Changes the Game

Two technologies are transforming local LLM deployment economics:

  1. AWQ quantization is non-negotiable. At 32B, it delivers 2.94x speedup with no perceptible quality loss. There is no reason to run BF16 when VRAM allows AWQ.
  2. MoE unlocks speed-quality simultaneity. Qwen3-30B-A3B runs 24% faster than 14B AWQ while accessing 30B worth of knowledge.
  3. Architecture-engine compatibility matters. Same 14B parameter count, but Qwen3-14B-AWQ (136.1) vs Phi-4-AWQ (51.8) shows a 2.6x gap. SGLang + Qwen3 is the current optimal combination.

The simple rule: use AWQ everywhere, use MoE where VRAM permits, and always benchmark your specific model-engine combination before deploying to production.