treeru.com
AI

Gemma 4 Benchmark on RTX PRO 6000 — 8 Attempts, 3 Models, and a Qwen3 Showdown

2026-04-05
Treeru

Google announced Gemma 4 on April 3, 2026. Two days later, we ran it on an RTX PRO 6000 Blackwell (96 GB VRAM, SM 12.0). The short version: 8 attempts, 7 failures. That failure log is the main value of this post. Gemma 4 is brand new, and the ecosystem — particularly vLLM and SGLang — hadn't fully caught up yet. The combination with Blackwell's SM 12.0 architecture made it harder than expected.

Test Environment

Hardware: RTX PRO 6000 Blackwell (96 GB VRAM, SM 12.0), AMD Ryzen 9 9950X3D 16-core, 96 GB DDR5, 350 W power limit (TDP 600 W), Ubuntu 24.04 LTS, CUDA 13.1. Three models were compared after surviving the setup gauntlet:

ModelEngineQuantizationSizeActive Params
Qwen3-32B-AWQSGLang 0.5.9AWQ 4-bit19 GB32B (all)
Gemma4-31B-AWQvLLM 0.19.0AWQ 4-bit20 GB31B (all)
Gemma4-26B-MoE-AWQvLLM 0.19.0compressed-tensors 4-bit17 GB3.8B only

Qwen3-32B served as the baseline — it has been running in production with SGLang + EAGLE-3 for months. Both Gemma 4 models were served with vLLM in enforce-eager mode. The reason for this unusual configuration is documented below.

The 8-Attempt Log

This is the most important section. If you are trying to run Gemma 4 on a Blackwell GPU right now, this may save you several hours.

Attempt 1 — SGLang 0.5.9 + BF16 → FAIL. The first try was the obvious one: use the engine already running in production. Error: KeyError: 'gemma4' — transformers doesn't recognize the Gemma4ForCausalLM architecture. Gemma 4 was too new for the pip release. Fix attempted: install transformers from GitHub main branch directly.

Attempt 2 — SGLang 0.5.10rc0 + BF16 → FAIL. After replacing transformers with GitHub main and upgrading SGLang to 0.5.10rc0: RuntimeError: CUDA graph capture failed — Expected head_dim=256, got 512 in layer 12. Gemma 4 uses mixed head dimensions (256/512 per layer). SGLang's CUDA graph doesn't handle this. Fix attempted: add--disable-cuda-graph.

Attempt 3 — SGLang + CUDA graph disabled → FAIL. New error after disabling CUDA graph: AttributeError: module 'sglang.srt.modalities' has no attribute 'MULTI_IMAGES'. The SGLang PR for Gemma 4 multimodal support had not been merged yet (as of 2026-04-05). Conclusion: SGLang cannot run Gemma 4 at this time. Switch to vLLM.

Attempt 4 — vLLM 0.19.0 + BF16 → FAIL. Fresh vLLM 0.19.0 install, BF16 serving. Error: triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 98304, Hardware limit: 65536 (SM 12.0 / Blackwell). The triton kernels exceed shared memory limits on SM 12.0. RTX 4090 (SM 8.9) and H100 (SM 9.0) don't have this problem. Fix attempted:--enforce-eager to disable CUDA graph.

Attempt 5 — vLLM + enforce-eager + BF16 → FAIL. The triton issue was gone, but the attention layer crashed: ImportError: cannot import name 'flash_attn_func' from 'flash_attn.ops'. The pip-installed flash-attn wasn't compiled for SM 12.0. A source build was needed. But then CUDA version mismatch appeared.

Attempt 6 — flash-attn source build → CUDA version mismatch. Error:System CUDA: 13.1 / PyTorch CUDA: 12.8 — flash-attn requires matching CUDA versions. The system had CUDA 13.1 but PyTorch was installed with cu128. Solution: (1) replace PyTorch with the cu130 build, (2) remove nvidia-nccl-cu12 (conflict source), (3) build flash-attn from source (~20 minutes). After this, vLLM started successfully in BF16 mode for the first time.

Attempt 7 — vLLM + BF16 → partial success, context problem. Server was up. But during actual testing: Warning: max_model_len reduced to 4096. The model weights consumed 85 GB, leaving insufficient KV cache space for longer contexts. With context 4,096, 9 out of 25 test cases failed due to context limit errors (success rate: 16/25).

Attempt 8 — AWQ quantization → SUCCESS. Switched to AWQ 4-bit version. Model weights dropped to 20 GB, freeing 70+ GB for KV cache. Context length: 16,384. Success rate: 25/25.

Why AWQ Was Necessary: BF16 vs AWQ

MetricBF16AWQ 4-bit
VRAM (model weights)85 GB20 GB
Context length4,09616,384
Success rate (25 tests)16/2525/25
Avg response (successful only)1.02 s2.76 s

BF16 is faster when it succeeds — 1.02 s average vs 2.76 s for AWQ. But the 4,096 context cap caused 9 failures. A model that randomly fails on long inputs isn't usable in production. AWQ sacrifices raw speed in exchange for reliability. That's not a trade-off; it's a requirement.

3-Model Comparison: Overall Results

All three models were tested on the same 25 questions (5 categories × 5 questions) connected to a RAG pipeline, run sequentially on the same hardware.

MetricQwen3-32BGemma4-31BGemma4-26B MoE
Success rate25/2525/2525/25
Avg response time2.50 s2.76 s1.58 s
LLM processing time2.46 s2.71 s1.54 s
Foreign language contamination0 chars0 chars0 chars
Model size19 GB20 GB17 GB
Active parameters32B31B3.8B
AccelerationEAGLE-3enforce-eagerenforce-eager
LicenseApache 2.0Apache 2.0Apache 2.0

Gemma4-26B MoE uses only 3.8B active parameters at inference time, yet matches Gemma4-31B in accuracy across all test categories. It is 37% faster than Qwen3-32B and 43% faster than Gemma4-31B. The MoE architecture's practical advantage is confirmed here with real numbers.

Accuracy by category: simple fact retrieval — all three perfect; classification/aggregation — all three accurate, Gemma4-31B tends toward longer outputs; comparison/reasoning — Gemma4-31B occasionally deflects with "cannot determine" rather than committing to an answer; follow-up questions (multi-turn context) — all three maintain context well; edge cases — all three produced zero hallucinations, correctly responding "no data" when queried for non-existent records.

Speed by Category

CategoryQwen3-32BGemma4-31BGemma4-26B MoE
Simple fact retrieval0.99 s0.95 s0.66 s
Classification / aggregation4.18 s5.54 s2.95 s
Comparison / reasoning1.70 s1.52 s1.31 s
Follow-up questions3.27 s3.51 s1.20 s
Edge cases2.37 s2.30 s1.78 s

The follow-up category is the most striking: Qwen3 at 3.27 s, Gemma4-31B at 3.51 s, but Gemma4-26B MoE at just 1.20 s. As conversation history grows, KV cache computation increases — but with only 3.8B active parameters handling each token, the MoE model handles accumulated context much more efficiently.

GPU Temperature and Power

StateTemperaturePowerVRAMNote
Idle22–23°C8–11 W6 GBNo model loaded, GPU clock at 0
Model serving (idle)24–26°C12–15 W89–90 GBModel loaded in VRAM, no inference requests (GPU clock idle)
Benchmark peak30°C16 W90 GBDuring actual inference

Note: The low power draw during “model serving (idle)” is not because of exceptional cooling — it's because the GPU clock is idle when no inference requests are active. The model sits in VRAM but the compute units aren't running. Even the benchmark peak of 16 W is low because this was a sequential single-request test with low GPU compute density. Concurrent load tests would show significantly higher power and temperature.

Server room conditions: 18°C ambient, 40% humidity. The RTX PRO 6000 uses a blower-style workstation cooler suitable for enclosed server racks.

Conclusion: Model Selection Guide

Speed winner: Gemma4-26B MoE AWQ. 1.58 s average, 37% faster than Qwen3-32B, 43% faster than Gemma4-31B. The MoE architecture delivers 31B-class accuracy at 3.8B active parameter inference cost.

Production choice: Qwen3-32B (for now). The SGLang + EAGLE-3 combination has been validated in production. Gemma 4 is running vLLM enforce-eager only — no CUDA graph optimization. Once SGLang's Gemma 4 PR is merged, a full re-comparison is planned.

Recommendations by use case: if speed is the top priority for real-time RAG, choose Gemma4-26B MoE AWQ. For a proven production setup, Qwen3-32B AWQ + SGLang + EAGLE-3 remains the stable choice. For Korean language naturalness, Gemma 4 26B MoE is noticeably better than both Qwen3 and Gemma4-31B. All three models are Apache 2.0 licensed — commercial use permitted without additional approval.

One caveat: all Gemma 4 results here are from vLLM enforce-eager mode. CUDA graph optimization (when available via SGLang) could push the MoE model below 1 second average. That test is next.

T

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

Share

Comments

(0)

Log in to leave a comment.

Related Posts

© 2026 TreeRU. All rights reserved.

All content is copyrighted by TreeRU. Unauthorized reproduction without attribution is prohibited.