Gemma 4 Benchmark on RTX PRO 6000 — 8 Attempts, 3 Models, and a Qwen3 Showdown
Google announced Gemma 4 on April 3, 2026. Two days later, we ran it on an RTX PRO 6000 Blackwell (96 GB VRAM, SM 12.0). The short version: 8 attempts, 7 failures. That failure log is the main value of this post. Gemma 4 is brand new, and the ecosystem — particularly vLLM and SGLang — hadn't fully caught up yet. The combination with Blackwell's SM 12.0 architecture made it harder than expected.
Test Environment
Hardware: RTX PRO 6000 Blackwell (96 GB VRAM, SM 12.0), AMD Ryzen 9 9950X3D 16-core, 96 GB DDR5, 350 W power limit (TDP 600 W), Ubuntu 24.04 LTS, CUDA 13.1. Three models were compared after surviving the setup gauntlet:
| Model | Engine | Quantization | Size | Active Params |
|---|---|---|---|---|
| Qwen3-32B-AWQ | SGLang 0.5.9 | AWQ 4-bit | 19 GB | 32B (all) |
| Gemma4-31B-AWQ | vLLM 0.19.0 | AWQ 4-bit | 20 GB | 31B (all) |
| Gemma4-26B-MoE-AWQ | vLLM 0.19.0 | compressed-tensors 4-bit | 17 GB | 3.8B only |
Qwen3-32B served as the baseline — it has been running in production with SGLang + EAGLE-3 for months. Both Gemma 4 models were served with vLLM in enforce-eager mode. The reason for this unusual configuration is documented below.
The 8-Attempt Log
This is the most important section. If you are trying to run Gemma 4 on a Blackwell GPU right now, this may save you several hours.
Attempt 1 — SGLang 0.5.9 + BF16 → FAIL. The first try was the obvious one: use the engine already running in production. Error: KeyError: 'gemma4' — transformers doesn't recognize the Gemma4ForCausalLM architecture. Gemma 4 was too new for the pip release. Fix attempted: install transformers from GitHub main branch directly.
Attempt 2 — SGLang 0.5.10rc0 + BF16 → FAIL. After replacing transformers with GitHub main and upgrading SGLang to 0.5.10rc0: RuntimeError: CUDA graph capture failed — Expected head_dim=256, got 512 in layer 12. Gemma 4 uses mixed head dimensions (256/512 per layer). SGLang's CUDA graph doesn't handle this. Fix attempted: add--disable-cuda-graph.
Attempt 3 — SGLang + CUDA graph disabled → FAIL. New error after disabling CUDA graph: AttributeError: module 'sglang.srt.modalities' has no attribute 'MULTI_IMAGES'. The SGLang PR for Gemma 4 multimodal support had not been merged yet (as of 2026-04-05). Conclusion: SGLang cannot run Gemma 4 at this time. Switch to vLLM.
Attempt 4 — vLLM 0.19.0 + BF16 → FAIL. Fresh vLLM 0.19.0 install, BF16 serving. Error: triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 98304, Hardware limit: 65536 (SM 12.0 / Blackwell). The triton kernels exceed shared memory limits on SM 12.0. RTX 4090 (SM 8.9) and H100 (SM 9.0) don't have this problem. Fix attempted:--enforce-eager to disable CUDA graph.
Attempt 5 — vLLM + enforce-eager + BF16 → FAIL. The triton issue was gone, but the attention layer crashed: ImportError: cannot import name 'flash_attn_func' from 'flash_attn.ops'. The pip-installed flash-attn wasn't compiled for SM 12.0. A source build was needed. But then CUDA version mismatch appeared.
Attempt 6 — flash-attn source build → CUDA version mismatch. Error:System CUDA: 13.1 / PyTorch CUDA: 12.8 — flash-attn requires matching CUDA versions. The system had CUDA 13.1 but PyTorch was installed with cu128. Solution: (1) replace PyTorch with the cu130 build, (2) remove nvidia-nccl-cu12 (conflict source), (3) build flash-attn from source (~20 minutes). After this, vLLM started successfully in BF16 mode for the first time.
Attempt 7 — vLLM + BF16 → partial success, context problem. Server was up. But during actual testing: Warning: max_model_len reduced to 4096. The model weights consumed 85 GB, leaving insufficient KV cache space for longer contexts. With context 4,096, 9 out of 25 test cases failed due to context limit errors (success rate: 16/25).
Attempt 8 — AWQ quantization → SUCCESS. Switched to AWQ 4-bit version. Model weights dropped to 20 GB, freeing 70+ GB for KV cache. Context length: 16,384. Success rate: 25/25.
Why AWQ Was Necessary: BF16 vs AWQ
| Metric | BF16 | AWQ 4-bit |
|---|---|---|
| VRAM (model weights) | 85 GB | 20 GB |
| Context length | 4,096 | 16,384 |
| Success rate (25 tests) | 16/25 | 25/25 |
| Avg response (successful only) | 1.02 s | 2.76 s |
BF16 is faster when it succeeds — 1.02 s average vs 2.76 s for AWQ. But the 4,096 context cap caused 9 failures. A model that randomly fails on long inputs isn't usable in production. AWQ sacrifices raw speed in exchange for reliability. That's not a trade-off; it's a requirement.
3-Model Comparison: Overall Results
All three models were tested on the same 25 questions (5 categories × 5 questions) connected to a RAG pipeline, run sequentially on the same hardware.
| Metric | Qwen3-32B | Gemma4-31B | Gemma4-26B MoE |
|---|---|---|---|
| Success rate | 25/25 | 25/25 | 25/25 |
| Avg response time | 2.50 s | 2.76 s | 1.58 s |
| LLM processing time | 2.46 s | 2.71 s | 1.54 s |
| Foreign language contamination | 0 chars | 0 chars | 0 chars |
| Model size | 19 GB | 20 GB | 17 GB |
| Active parameters | 32B | 31B | 3.8B |
| Acceleration | EAGLE-3 | enforce-eager | enforce-eager |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
Gemma4-26B MoE uses only 3.8B active parameters at inference time, yet matches Gemma4-31B in accuracy across all test categories. It is 37% faster than Qwen3-32B and 43% faster than Gemma4-31B. The MoE architecture's practical advantage is confirmed here with real numbers.
Accuracy by category: simple fact retrieval — all three perfect; classification/aggregation — all three accurate, Gemma4-31B tends toward longer outputs; comparison/reasoning — Gemma4-31B occasionally deflects with "cannot determine" rather than committing to an answer; follow-up questions (multi-turn context) — all three maintain context well; edge cases — all three produced zero hallucinations, correctly responding "no data" when queried for non-existent records.
Speed by Category
| Category | Qwen3-32B | Gemma4-31B | Gemma4-26B MoE |
|---|---|---|---|
| Simple fact retrieval | 0.99 s | 0.95 s | 0.66 s |
| Classification / aggregation | 4.18 s | 5.54 s | 2.95 s |
| Comparison / reasoning | 1.70 s | 1.52 s | 1.31 s |
| Follow-up questions | 3.27 s | 3.51 s | 1.20 s |
| Edge cases | 2.37 s | 2.30 s | 1.78 s |
The follow-up category is the most striking: Qwen3 at 3.27 s, Gemma4-31B at 3.51 s, but Gemma4-26B MoE at just 1.20 s. As conversation history grows, KV cache computation increases — but with only 3.8B active parameters handling each token, the MoE model handles accumulated context much more efficiently.
GPU Temperature and Power
| State | Temperature | Power | VRAM | Note |
|---|---|---|---|---|
| Idle | 22–23°C | 8–11 W | 6 GB | No model loaded, GPU clock at 0 |
| Model serving (idle) | 24–26°C | 12–15 W | 89–90 GB | Model loaded in VRAM, no inference requests (GPU clock idle) |
| Benchmark peak | 30°C | 16 W | 90 GB | During actual inference |
Note: The low power draw during “model serving (idle)” is not because of exceptional cooling — it's because the GPU clock is idle when no inference requests are active. The model sits in VRAM but the compute units aren't running. Even the benchmark peak of 16 W is low because this was a sequential single-request test with low GPU compute density. Concurrent load tests would show significantly higher power and temperature.
Server room conditions: 18°C ambient, 40% humidity. The RTX PRO 6000 uses a blower-style workstation cooler suitable for enclosed server racks.
Conclusion: Model Selection Guide
Speed winner: Gemma4-26B MoE AWQ. 1.58 s average, 37% faster than Qwen3-32B, 43% faster than Gemma4-31B. The MoE architecture delivers 31B-class accuracy at 3.8B active parameter inference cost.
Production choice: Qwen3-32B (for now). The SGLang + EAGLE-3 combination has been validated in production. Gemma 4 is running vLLM enforce-eager only — no CUDA graph optimization. Once SGLang's Gemma 4 PR is merged, a full re-comparison is planned.
Recommendations by use case: if speed is the top priority for real-time RAG, choose Gemma4-26B MoE AWQ. For a proven production setup, Qwen3-32B AWQ + SGLang + EAGLE-3 remains the stable choice. For Korean language naturalness, Gemma 4 26B MoE is noticeably better than both Qwen3 and Gemma4-31B. All three models are Apache 2.0 licensed — commercial use permitted without additional approval.
One caveat: all Gemma 4 results here are from vLLM enforce-eager mode. CUDA graph optimization (when available via SGLang) could push the MoE model below 1 second average. That test is next.
Comments
(0)Log in to leave a comment.
Related Posts
© 2026 TreeRU. All rights reserved.
All content is copyrighted by TreeRU. Unauthorized reproduction without attribution is prohibited.