treeru.com
Hardware

RTX PRO 6000 Token Speed — 6 LLMs Benchmarked at 350W

When deploying a local LLM into production, the first metric that matters is token generation speed (tok/s). We measured 6 models on an NVIDIA RTX PRO 6000 (96GB, 350W power limit) using the SGLang serving engine. The range spans from 218 tok/s down to 60 tok/s — but the fastest model isn't necessarily the best choice.

218

Peak tok/s

60

Lowest tok/s

350W

Power Limit

96GB

VRAM

1Test Setup

All models were tested under identical conditions to isolate pure speed differences. SGLang provides an OpenAI-compatible API, so existing code works without modification.

Test Configuration

GPU: RTX PRO 6000 (96GB VRAM)
Power limit: 350W (nvidia-smi -pl 350)
Serving engine: SGLang
Quantization: AWQ 4-bit
Temperature: 0.3 (fixed)
Test set: 60 questions, sequential execution

What is AWQ quantization?

Activation-aware Weight Quantization compresses model weights to 4 bits, reducing VRAM usage by ~75% while minimizing quality loss. With 96GB VRAM, even 14B models run comfortably with headroom to spare.

2Speed Benchmark Results

Total execution time, total tokens generated, and average tok/s across 60 questions.

ModelTotal TimeTotal TokensAvg tok/sAvg Response
Llama-3.1-8B97s21,165218 tok/s353 tokens
Qwen3-8B199s41,400208 tok/s690 tokens
Phi-4263s36,989141 tok/s616 tokens
Qwen3-14B297s40,289135 tok/s671 tokens
Gemma-3-12B258s22,08886 tok/s368 tokens
KORMo-10B434s25,93860 tok/s432 tokens

Llama-3.1-8B — Fastest at 218 tok/s

The lightest 8B model with short responses (avg 353 tokens) makes it the speed champion. But fewer total tokens means thinner answers — quality score of 2.67 ranks dead last.

Qwen3-8B — Runner-up at 208 tok/s

Similar parameter count to Llama but vastly richer responses (avg 690 tokens). 41,400 total tokens — highest across all models. More content generated per second. Quality score 3.47, solidly upper-tier.

Qwen3-14B — Best Balance at 135 tok/s

Slower than 8B models due to 14B parameters, but quality score of 3.86 takes overall first place. Rich responses (avg 671 tokens) make it the most satisfying choice for real-world deployment.

KORMo-10B — Slowest at 60 tok/s

A Korean-specialized model with unique strengths, but at one-third the speed of other models and 434s total time, it's better suited for batch processing than real-time services.

3Response Length by Scenario

The same model produces dramatically different response lengths depending on the scenario. Legal questions draw the longest answers; medical queries the shortest.

ModelA MfgB SaaSC MedicalD RetailE LegalF AutomationG Korean
Qwen3-8B892745628710548720587
Qwen3-14B865712605685540698592
Gemma-3-12B445398320375285362392
Phi-4780668548635478610593
Llama-3.1-8B432385298362265348381
KORMo-10B548465392445348418408

* Values in tokens (average of 10 questions per scenario)

Longest: Manufacturing Scenario (A)

Complex queries about manufacturing processes require detailed, structured explanations with specifications and safety notes. Qwen3-8B averages 892 tokens — the longest across all scenarios.

Shortest: Legal Scenario (E)

Medical and legal queries often trigger safety refusals — "consult a professional" responses keep answers short. This is actually appropriate boundary awareness.

Response length ≠ Quality

Qwen3-8B generates the most tokens, but Qwen3-14B scores higher on quality. Gemma writes concisely but hits the key points. Llama writes briefly and inaccurately.

4Speed vs Quality Trade-Off

Is the fastest model the best model? Plotting speed against quality reveals a clear optimal balance point.

ModelSpeed (tok/s)Quality (/5)Assessment
Llama-3.1-8B2182.67Fastest speed, lowest quality
Qwen3-8B2083.47Fast with decent quality
Phi-41413.10Mid speed, weak on Korean
Qwen3-14B1353.86Optimal balance (recommended)
Gemma-3-12B863.72Slow but best Korean
KORMo-10B603.46Slowest, Korean-specialized

Speed Comparison (tok/s)

Llama-3.1-8B
218
Qwen3-8B
208
Phi-4
141
Qwen3-14B
135
Gemma-3-12B
86
KORMo-10B
60

Quality Comparison (/5)

Qwen3-14B
3.86
Gemma-3-12B
3.72
Qwen3-8B
3.47
KORMo-10B
3.46
Phi-4
3.10
Llama-3.1-8B
2.67

Trade-Off Recommendations

Real-time service → Qwen3-14B

At 135 tok/s, streaming feels instant. Highest overall quality score.

Batch processing → Qwen3-8B

208 tok/s with solid quality. Best for processing large document volumes fast.

Korean quality priority → Gemma-3-12B

86 tok/s is slower, but Korean quality score of 4.28 is far above the rest. For services where accuracy is the competitive edge.

Key Takeaways

  • Llama-3.1-8B hits 218 tok/s peak but ranks last in quality (2.67/5)
  • Qwen3-14B at 135 tok/s is the optimal speed-quality sweet spot (3.86/5)
  • AWQ 4-bit quantization runs 14B models comfortably on 96GB VRAM
  • Manufacturing scenarios produce longest responses; medical/legal produce shortest
  • Choose models by 'speed × quality' efficiency, not raw tok/s alone

Conclusion

The RTX PRO 6000's 96GB VRAM and 350W power limit provide more than enough headroom for local LLM serving. But faster isn't always better. The key is finding the speed-quality balance that fits your use case. Overall, Qwen3-14B delivers the best all-around combination of speed, quality, and response richness.