Gemma 4 Benchmark on RTX PRO 6000 — 8 Attempts, 3 Models, and a Qwen3 Showdown

2026-04-05

Treeru

Google announced Gemma 4 on April 3, 2026. Two days later, we ran it on an RTX PRO 6000 Blackwell (96 GB VRAM, SM 12.0). The short version: 8 attempts, 7 failures. That failure log is the main value of this post. Gemma 4 is brand new, and the ecosystem — particularly vLLM and SGLang — hadn't fully caught up yet. The combination with Blackwell's SM 12.0 architecture made it harder than expected.

Test Environment

Hardware: RTX PRO 6000 Blackwell (96 GB VRAM, SM 12.0), AMD Ryzen 9 9950X3D 16-core, 96 GB DDR5, 350 W power limit (TDP 600 W), Ubuntu 24.04 LTS, CUDA 13.1. Three models were compared after surviving the setup gauntlet:

Model	Engine	Quantization	Size	Active Params
Qwen3-32B-AWQ	SGLang 0.5.9	AWQ 4-bit	19 GB	32B (all)
Gemma4-31B-AWQ	vLLM 0.19.0	AWQ 4-bit	20 GB	31B (all)
Gemma4-26B-MoE-AWQ	vLLM 0.19.0	compressed-tensors 4-bit	17 GB	3.8B only

Qwen3-32B served as the baseline — it has been running in production with SGLang + EAGLE-3 for months. Both Gemma 4 models were served with vLLM in enforce-eager mode. The reason for this unusual configuration is documented below.

The 8-Attempt Log

This is the most important section. If you are trying to run Gemma 4 on a Blackwell GPU right now, this may save you several hours.

Attempt 1 — SGLang 0.5.9 + BF16 → FAIL. The first try was the obvious one: use the engine already running in production. Error: KeyError: 'gemma4' — transformers doesn't recognize the Gemma4ForCausalLM architecture. Gemma 4 was too new for the pip release. Fix attempted: install transformers from GitHub main branch directly.

Attempt 2 — SGLang 0.5.10rc0 + BF16 → FAIL. After replacing transformers with GitHub main and upgrading SGLang to 0.5.10rc0: RuntimeError: CUDA graph capture failed — Expected head_dim=256, got 512 in layer 12. Gemma 4 uses mixed head dimensions (256/512 per layer). SGLang's CUDA graph doesn't handle this. Fix attempted: add--disable-cuda-graph.

Attempt 3 — SGLang + CUDA graph disabled → FAIL. New error after disabling CUDA graph: AttributeError: module 'sglang.srt.modalities' has no attribute 'MULTI_IMAGES'. The SGLang PR for Gemma 4 multimodal support had not been merged yet (as of 2026-04-05). Conclusion: SGLang cannot run Gemma 4 at this time. Switch to vLLM.

Attempt 4 — vLLM 0.19.0 + BF16 → FAIL. Fresh vLLM 0.19.0 install, BF16 serving. Error: triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 98304, Hardware limit: 65536 (SM 12.0 / Blackwell). The triton kernels exceed shared memory limits on SM 12.0. RTX 4090 (SM 8.9) and H100 (SM 9.0) don't have this problem. Fix attempted:--enforce-eager to disable CUDA graph.

Attempt 5 — vLLM + enforce-eager + BF16 → FAIL. The triton issue was gone, but the attention layer crashed: ImportError: cannot import name 'flash_attn_func' from 'flash_attn.ops'. The pip-installed flash-attn wasn't compiled for SM 12.0. A source build was needed. But then CUDA version mismatch appeared.

Attempt 6 — flash-attn source build → CUDA version mismatch. Error:System CUDA: 13.1 / PyTorch CUDA: 12.8 — flash-attn requires matching CUDA versions. The system had CUDA 13.1 but PyTorch was installed with cu128. Solution: (1) replace PyTorch with the cu130 build, (2) remove nvidia-nccl-cu12 (conflict source), (3) build flash-attn from source (~20 minutes). After this, vLLM started successfully in BF16 mode for the first time.

Attempt 7 — vLLM + BF16 → partial success, context problem. Server was up. But during actual testing: Warning: max_model_len reduced to 4096. The model weights consumed 85 GB, leaving insufficient KV cache space for longer contexts. With context 4,096, 9 out of 25 test cases failed due to context limit errors (success rate: 16/25).

Attempt 8 — AWQ quantization → SUCCESS. Switched to AWQ 4-bit version. Model weights dropped to 20 GB, freeing 70+ GB for KV cache. Context length: 16,384. Success rate: 25/25.

Why AWQ Was Necessary: BF16 vs AWQ

Metric	BF16	AWQ 4-bit
VRAM (model weights)	85 GB	20 GB
Context length	4,096	16,384
Success rate (25 tests)	16/25	25/25
Avg response (successful only)	1.02 s	2.76 s

BF16 is faster when it succeeds — 1.02 s average vs 2.76 s for AWQ. But the 4,096 context cap caused 9 failures. A model that randomly fails on long inputs isn't usable in production. AWQ sacrifices raw speed in exchange for reliability. That's not a trade-off; it's a requirement.

3-Model Comparison: Overall Results

All three models were tested on the same 25 questions (5 categories × 5 questions) connected to a RAG pipeline, run sequentially on the same hardware.

Metric	Qwen3-32B	Gemma4-31B	Gemma4-26B MoE
Success rate	25/25	25/25	25/25
Avg response time	2.50 s	2.76 s	1.58 s
LLM processing time	2.46 s	2.71 s	1.54 s
Foreign language contamination	0 chars	0 chars	0 chars
Model size	19 GB	20 GB	17 GB
Active parameters	32B	31B	3.8B
Acceleration	EAGLE-3	enforce-eager	enforce-eager
License	Apache 2.0	Apache 2.0	Apache 2.0

Gemma4-26B MoE uses only 3.8B active parameters at inference time, yet matches Gemma4-31B in accuracy across all test categories. It is 37% faster than Qwen3-32B and 43% faster than Gemma4-31B. The MoE architecture's practical advantage is confirmed here with real numbers.

Accuracy by category: simple fact retrieval — all three perfect; classification/aggregation — all three accurate, Gemma4-31B tends toward longer outputs; comparison/reasoning — Gemma4-31B occasionally deflects with "cannot determine" rather than committing to an answer; follow-up questions (multi-turn context) — all three maintain context well; edge cases — all three produced zero hallucinations, correctly responding "no data" when queried for non-existent records.

Speed by Category

Category	Qwen3-32B	Gemma4-31B	Gemma4-26B MoE
Simple fact retrieval	0.99 s	0.95 s	0.66 s
Classification / aggregation	4.18 s	5.54 s	2.95 s
Comparison / reasoning	1.70 s	1.52 s	1.31 s
Follow-up questions	3.27 s	3.51 s	1.20 s
Edge cases	2.37 s	2.30 s	1.78 s

The follow-up category is the most striking: Qwen3 at 3.27 s, Gemma4-31B at 3.51 s, but Gemma4-26B MoE at just 1.20 s. As conversation history grows, KV cache computation increases — but with only 3.8B active parameters handling each token, the MoE model handles accumulated context much more efficiently.

GPU Temperature and Power

State	Temperature	Power	VRAM	Note
Idle	22–23°C	8–11 W	6 GB	No model loaded, GPU clock at 0
Model serving (idle)	24–26°C	12–15 W	89–90 GB	Model loaded in VRAM, no inference requests (GPU clock idle)
Benchmark peak	30°C	16 W	90 GB	During actual inference

Note: The low power draw during “model serving (idle)” is not because of exceptional cooling — it's because the GPU clock is idle when no inference requests are active. The model sits in VRAM but the compute units aren't running. Even the benchmark peak of 16 W is low because this was a sequential single-request test with low GPU compute density. Concurrent load tests would show significantly higher power and temperature.

Server room conditions: 18°C ambient, 40% humidity. The RTX PRO 6000 uses a blower-style workstation cooler suitable for enclosed server racks.

Conclusion: Model Selection Guide

Speed winner: Gemma4-26B MoE AWQ. 1.58 s average, 37% faster than Qwen3-32B, 43% faster than Gemma4-31B. The MoE architecture delivers 31B-class accuracy at 3.8B active parameter inference cost.

Production choice: Qwen3-32B (for now). The SGLang + EAGLE-3 combination has been validated in production. Gemma 4 is running vLLM enforce-eager only — no CUDA graph optimization. Once SGLang's Gemma 4 PR is merged, a full re-comparison is planned.

Recommendations by use case: if speed is the top priority for real-time RAG, choose Gemma4-26B MoE AWQ. For a proven production setup, Qwen3-32B AWQ + SGLang + EAGLE-3 remains the stable choice. For Korean language naturalness, Gemma 4 26B MoE is noticeably better than both Qwen3 and Gemma4-31B. All three models are Apache 2.0 licensed — commercial use permitted without additional approval.

One caveat: all Gemma 4 results here are from vLLM enforce-eager mode. CUDA graph optimization (when available via SGLang) could push the MoE model below 1 second average. That test is next.

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

Gemma4 RTX PRO 6000 vLLM SGLang AWQ LLM Benchmark Qwen3 Local AI

Comments

(0)

Hardware

Gemma 4 Benchmark on RTX PRO 6000 — 8 Attempts, 3 Models, and a Qwen3 Showdown

Test Environment

The 8-Attempt Log

Why AWQ Was Necessary: BF16 vs AWQ

3-Model Comparison: Overall Results

Speed by Category

GPU Temperature and Power

Conclusion: Model Selection Guide

Comments

Related Posts

RTX Pro 6000 Local LLM Benchmark — 6 Models, 360 Questions, Complete Ranking

Qwen3-14B Deep Review — Why It Is Our Top-Ranked Local LLM

SGLang vs vLLM — The Secret Behind the 3x Throughput Gap