AWQ Quantization Speed Benchmark: 16 Models, INT4 vs BF16, and the MoE Reversal

2026-02-26

Treeru

AI · February 26, 2026

How much faster does AWQ INT4 quantization actually make LLM inference? We benchmarked 16 models on an RTX PRO 6000 (96GB) with SGLang. The results: AWQ delivers 1.88x speedup at 4B parameters and scales to 2.94x at 32B. Larger models benefit more because inference is memory- bandwidth-bound. The biggest surprise: MoE model Qwen3-30B-A3B runs at 168.9 tok/s — faster than the 14B Dense AWQ model at 136.1 tok/s. 30 billion parameters, faster than 14 billion.

2.94x

32B AWQ Speedup

168.9

30B MoE tok/s

Models Tested

391.2

Peak tok/s (4B AWQ)

Complete 16-Model Speed Rankings

All models were tested on the same hardware (RTX PRO 6000, SGLang 0.5.8, single request) to ensure apples-to-apples comparison. The table below shows token generation speed for Dense models in both AWQ and BF16 formats, plus two MoE models that break the expected size-to-speed relationship.

#	Model	Params	Type	tok/s
1	Qwen3-4B-AWQ	4B	Dense AWQ	391.2
2	Qwen3-4B	4B	Dense BF16	208.1
3	Qwen3-8B-AWQ	8B	Dense AWQ	208.0
4	Qwen3-30B-A3BMoE	30B (3B active)	MoE	168.9
5	Qwen3-14B-AWQ	14B	Dense AWQ	136.1
6	Gemma-3-12B-it-AWQ	12B	Dense AWQ	128.5
7	GLM-4-9B-Chat-1MMoE	9B (1.2B active)	MoE	107.2
8	EXAONE-3.5-7.8B	7.8B	Dense BF16	95.2
9	Qwen3-8B	8B	Dense BF16	90.8
10	Qwen3-32B-AWQ	32B	Dense AWQ	70.2
11	Qwen3-14B	14B	Dense BF16	53.2
12	Phi-4-AWQ	14B	Dense AWQ	51.8
13	EXAONE-3.5-32B-AWQ	32B	Dense AWQ	24.2
14	DeepSeek-R1-Distill-Qwen-32B	32B	Dense BF16	24.0
15	Qwen3-32B	32B	Dense BF16	23.9
16	EXAONE-3.5-32B	32B	Dense BF16	19.3

Three key patterns emerge: (1) AWQ always outperforms same-size BF16 — no exceptions from 4B to 32B. (2) MoE models appear at unexpected positions — the 30B model ranks 4th, above all 14B models. (3) Same parameter count, wildly different speeds — Phi-4-AWQ (14B) at 51.8 tok/s is 2.6x slower than Qwen3-14B-AWQ at 136.1 tok/s.

AWQ Quantization Effect: Larger Models Benefit More

To isolate the pure quantization effect, we compared the Qwen3 series at 4B, 8B, 14B, and 32B in both AWQ (INT4) and BF16 formats. The speedup ratio increases monotonically with model size.

Model Size	AWQ (tok/s)	BF16 (tok/s)	Speedup
4B	391.2	208.1	1.88x
8B	208.0	90.8	2.29x
14B	136.1	53.2	2.56x
32B	70.2	23.9	2.94x

The reason is straightforward: LLM token generation is memory-bandwidth-bound. The GPU must read every weight from VRAM for each generated token. Larger models have more weights to read, making the memory read the dominant bottleneck. AWQ reduces weight size by 4x (16-bit to 4-bit), directly relieving this bottleneck. At 32B, the weight size drops from ~64GB (BF16) to ~18GB (AWQ), and the speedup reaches 2.94x.

At 4B, the speedup is "only" 1.88x because the model is small enough that compute (matrix multiplication) starts becoming a meaningful fraction of total time. Reducing memory read time doesn't help when compute time remains constant.

The MoE Reversal: 30B Faster Than 14B

Normally, bigger models are slower. 32B is slower than 14B, which is slower than 8B. MoE (Mixture of Experts) architecture breaks this rule entirely. Qwen3-30B-A3B has 30 billion total parameters but activates only 3 billion per token — 8 out of 128 experts per layer. The result: 30B total knowledge at 3B compute cost.

Model	Total Params	Active Params	tok/s	Note
Qwen3-30B-A3B	30B	3B	168.9	8B-speed, 14B-quality
Qwen3-14B-AWQ	14B	14B	136.1	Dense — all params active
GLM-4-9B-Chat-1M	9B	~1.2B	107.2	7B-speed, 1M context
Qwen3-8B-AWQ	8B	8B	208.0	Fastest practical Dense

The trade-off: MoE models consume VRAM proportional to total parameters, not active parameters. Qwen3-30B-A3B runs at 3B compute speed but requires 30B worth of VRAM. On GPUs with limited memory (16GB), this makes MoE impractical without aggressive quantization. The speed advantage only materializes when VRAM is sufficient.

Why Memory Bandwidth Is Everything

There is a simple formula that predicts LLM inference speed with remarkable accuracy for large models:

Expected tok/s = Memory Bandwidth (GB/s) / Model Weight Size (GB)

The RTX PRO 6000 has 1,536 GB/s bandwidth. For Qwen3-32B BF16 (~64GB weights): 1,536 / 64 = 24 tok/s (predicted) vs 23.9 tok/s (measured). For Qwen3-32B AWQ (~18GB weights): 1,536 / 18 = 85 tok/s (predicted) vs 70.2 tok/s (measured, with overhead). AWQ's fundamental mechanism is reducing the denominator by 4x.

1,536

GB/s bandwidth

GB GDDR7 VRAM

350W

Power cap used

Practical Model Selection Guide

Based on the 16-model benchmark, here are the optimal choices by available VRAM.

VRAM 48GB+ (Quality-First)

Primary: Qwen3-30B-A3B (168.9 tok/s) — 14B-quality at 8B-speed through MoE. High quality: Qwen3-32B-AWQ (70.2 tok/s) — best quality, streaming recommended.

VRAM 16–24GB (Balanced)

Primary: Qwen3-14B-AWQ (136.1 tok/s) — top Korean benchmark score, fast speed. Lightweight: Qwen3-8B-AWQ (208.0 tok/s) — optimal for FAQ/classification tasks.

VRAM 8–12GB (Speed-First)

Primary: Qwen3-8B-AWQ (208.0 tok/s) — only ~5GB VRAM needed. Ultra-light: Qwen3-4B-AWQ (391.2 tok/s) — ~2.5GB VRAM, routing/classification only.

Conclusion: AWQ + MoE Changes the Game

Two technologies are transforming local LLM deployment economics:

AWQ quantization is non-negotiable. At 32B, it delivers 2.94x speedup with no perceptible quality loss. There is no reason to run BF16 when VRAM allows AWQ.
MoE unlocks speed-quality simultaneity. Qwen3-30B-A3B runs 24% faster than 14B AWQ while accessing 30B worth of knowledge.
Architecture-engine compatibility matters. Same 14B parameter count, but Qwen3-14B-AWQ (136.1) vs Phi-4-AWQ (51.8) shows a 2.6x gap. SGLang + Qwen3 is the current optimal combination.

The simple rule: use AWQ everywhere, use MoE where VRAM permits, and always benchmark your specific model-engine combination before deploying to production.

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

AWQ quantization INT4 BF16 MoE benchmark SGLang token speed LLM optimization