RTX PRO 6000 Token Speed — 6 LLMs Benchmarked at 350W
When deploying a local LLM into production, the first metric that matters is token generation speed (tok/s). We measured 6 models on an NVIDIA RTX PRO 6000 (96GB, 350W power limit) using the SGLang serving engine. The range spans from 218 tok/s down to 60 tok/s — but the fastest model isn't necessarily the best choice.
218
Peak tok/s
60
Lowest tok/s
350W
Power Limit
96GB
VRAM
1Test Setup
All models were tested under identical conditions to isolate pure speed differences. SGLang provides an OpenAI-compatible API, so existing code works without modification.
Test Configuration
What is AWQ quantization?
Activation-aware Weight Quantization compresses model weights to 4 bits, reducing VRAM usage by ~75% while minimizing quality loss. With 96GB VRAM, even 14B models run comfortably with headroom to spare.
2Speed Benchmark Results
Total execution time, total tokens generated, and average tok/s across 60 questions.
| Model | Total Time | Total Tokens | Avg tok/s | Avg Response |
|---|---|---|---|---|
| Llama-3.1-8B | 97s | 21,165 | 218 tok/s | 353 tokens |
| Qwen3-8B | 199s | 41,400 | 208 tok/s | 690 tokens |
| Phi-4 | 263s | 36,989 | 141 tok/s | 616 tokens |
| Qwen3-14B | 297s | 40,289 | 135 tok/s | 671 tokens |
| Gemma-3-12B | 258s | 22,088 | 86 tok/s | 368 tokens |
| KORMo-10B | 434s | 25,938 | 60 tok/s | 432 tokens |
Llama-3.1-8B — Fastest at 218 tok/s
The lightest 8B model with short responses (avg 353 tokens) makes it the speed champion. But fewer total tokens means thinner answers — quality score of 2.67 ranks dead last.
Qwen3-8B — Runner-up at 208 tok/s
Similar parameter count to Llama but vastly richer responses (avg 690 tokens). 41,400 total tokens — highest across all models. More content generated per second. Quality score 3.47, solidly upper-tier.
Qwen3-14B — Best Balance at 135 tok/s
Slower than 8B models due to 14B parameters, but quality score of 3.86 takes overall first place. Rich responses (avg 671 tokens) make it the most satisfying choice for real-world deployment.
KORMo-10B — Slowest at 60 tok/s
A Korean-specialized model with unique strengths, but at one-third the speed of other models and 434s total time, it's better suited for batch processing than real-time services.
3Response Length by Scenario
The same model produces dramatically different response lengths depending on the scenario. Legal questions draw the longest answers; medical queries the shortest.
| Model | A Mfg | B SaaS | C Medical | D Retail | E Legal | F Automation | G Korean |
|---|---|---|---|---|---|---|---|
| Qwen3-8B | 892 | 745 | 628 | 710 | 548 | 720 | 587 |
| Qwen3-14B | 865 | 712 | 605 | 685 | 540 | 698 | 592 |
| Gemma-3-12B | 445 | 398 | 320 | 375 | 285 | 362 | 392 |
| Phi-4 | 780 | 668 | 548 | 635 | 478 | 610 | 593 |
| Llama-3.1-8B | 432 | 385 | 298 | 362 | 265 | 348 | 381 |
| KORMo-10B | 548 | 465 | 392 | 445 | 348 | 418 | 408 |
* Values in tokens (average of 10 questions per scenario)
Longest: Manufacturing Scenario (A)
Complex queries about manufacturing processes require detailed, structured explanations with specifications and safety notes. Qwen3-8B averages 892 tokens — the longest across all scenarios.
Shortest: Legal Scenario (E)
Medical and legal queries often trigger safety refusals — "consult a professional" responses keep answers short. This is actually appropriate boundary awareness.
Response length ≠ Quality
Qwen3-8B generates the most tokens, but Qwen3-14B scores higher on quality. Gemma writes concisely but hits the key points. Llama writes briefly and inaccurately.
4Speed vs Quality Trade-Off
Is the fastest model the best model? Plotting speed against quality reveals a clear optimal balance point.
| Model | Speed (tok/s) | Quality (/5) | Assessment |
|---|---|---|---|
| Llama-3.1-8B | 218 | 2.67 | Fastest speed, lowest quality |
| Qwen3-8B | 208 | 3.47 | Fast with decent quality |
| Phi-4 | 141 | 3.10 | Mid speed, weak on Korean |
| Qwen3-14B | 135 | 3.86 | Optimal balance (recommended) |
| Gemma-3-12B | 86 | 3.72 | Slow but best Korean |
| KORMo-10B | 60 | 3.46 | Slowest, Korean-specialized |
Speed Comparison (tok/s)
Quality Comparison (/5)
Trade-Off Recommendations
Real-time service → Qwen3-14B
At 135 tok/s, streaming feels instant. Highest overall quality score.
Batch processing → Qwen3-8B
208 tok/s with solid quality. Best for processing large document volumes fast.
Korean quality priority → Gemma-3-12B
86 tok/s is slower, but Korean quality score of 4.28 is far above the rest. For services where accuracy is the competitive edge.
Key Takeaways
- ✓Llama-3.1-8B hits 218 tok/s peak but ranks last in quality (2.67/5)
- ✓Qwen3-14B at 135 tok/s is the optimal speed-quality sweet spot (3.86/5)
- ✓AWQ 4-bit quantization runs 14B models comfortably on 96GB VRAM
- ✓Manufacturing scenarios produce longest responses; medical/legal produce shortest
- ✓Choose models by 'speed × quality' efficiency, not raw tok/s alone
Conclusion
The RTX PRO 6000's 96GB VRAM and 350W power limit provide more than enough headroom for local LLM serving. But faster isn't always better. The key is finding the speed-quality balance that fits your use case. Overall, Qwen3-14B delivers the best all-around combination of speed, quality, and response richness.