RTX PRO 6000 Token Speed Benchmark — 6 LLMs at 350W, How Many tok/s?

1Test Setup

All models were tested under identical conditions to isolate pure speed differences. SGLang provides an OpenAI-compatible API, so existing code works without modification.

Test Configuration

GPU: RTX PRO 6000 (96GB VRAM)

Power limit: 350W (nvidia-smi -pl 350)

Serving engine: SGLang

Quantization: AWQ 4-bit

Temperature: 0.3 (fixed)

Test set: 60 questions, sequential execution

What is AWQ quantization?

Activation-aware Weight Quantization compresses model weights to 4 bits, reducing VRAM usage by ~75% while minimizing quality loss. With 96GB VRAM, even 14B models run comfortably with headroom to spare.

2Speed Benchmark Results

Total execution time, total tokens generated, and average tok/s across 60 questions.

Model	Total Time	Total Tokens	Avg tok/s	Avg Response
Llama-3.1-8B	97s	21,165	218 tok/s	353 tokens
Qwen3-8B	199s	41,400	208 tok/s	690 tokens
Phi-4	263s	36,989	141 tok/s	616 tokens
Qwen3-14B	297s	40,289	135 tok/s	671 tokens
Gemma-3-12B	258s	22,088	86 tok/s	368 tokens
KORMo-10B	434s	25,938	60 tok/s	432 tokens

Llama-3.1-8B — Fastest at 218 tok/s

The lightest 8B model with short responses (avg 353 tokens) makes it the speed champion. But fewer total tokens means thinner answers — quality score of 2.67 ranks dead last.

Qwen3-8B — Runner-up at 208 tok/s

Similar parameter count to Llama but vastly richer responses (avg 690 tokens). 41,400 total tokens — highest across all models. More content generated per second. Quality score 3.47, solidly upper-tier.

Qwen3-14B — Best Balance at 135 tok/s

Slower than 8B models due to 14B parameters, but quality score of 3.86 takes overall first place. Rich responses (avg 671 tokens) make it the most satisfying choice for real-world deployment.

KORMo-10B — Slowest at 60 tok/s

A Korean-specialized model with unique strengths, but at one-third the speed of other models and 434s total time, it's better suited for batch processing than real-time services.

3Response Length by Scenario

The same model produces dramatically different response lengths depending on the scenario. Legal questions draw the longest answers; medical queries the shortest.

Model	A Mfg	B SaaS	C Medical	D Retail	E Legal	F Automation	G Korean
Qwen3-8B	892	745	628	710	548	720	587
Qwen3-14B	865	712	605	685	540	698	592
Gemma-3-12B	445	398	320	375	285	362	392
Phi-4	780	668	548	635	478	610	593
Llama-3.1-8B	432	385	298	362	265	348	381
KORMo-10B	548	465	392	445	348	418	408

* Values in tokens (average of 10 questions per scenario)

Longest: Manufacturing Scenario (A)

Complex queries about manufacturing processes require detailed, structured explanations with specifications and safety notes. Qwen3-8B averages 892 tokens — the longest across all scenarios.

Shortest: Legal Scenario (E)

Medical and legal queries often trigger safety refusals — "consult a professional" responses keep answers short. This is actually appropriate boundary awareness.

Response length ≠ Quality

Qwen3-8B generates the most tokens, but Qwen3-14B scores higher on quality. Gemma writes concisely but hits the key points. Llama writes briefly and inaccurately.

4Speed vs Quality Trade-Off

Is the fastest model the best model? Plotting speed against quality reveals a clear optimal balance point.

Model	Speed (tok/s)	Quality (/5)	Assessment
Llama-3.1-8B	218	2.67	Fastest speed, lowest quality
Qwen3-8B	208	3.47	Fast with decent quality
Phi-4	141	3.10	Mid speed, weak on Korean
Qwen3-14B	135	3.86	Optimal balance (recommended)
Gemma-3-12B	86	3.72	Slow but best Korean
KORMo-10B	60	3.46	Slowest, Korean-specialized

Speed Comparison (tok/s)

Llama-3.1-8B

218

Qwen3-8B

208

Phi-4

141

Qwen3-14B

135

Gemma-3-12B

KORMo-10B

Quality Comparison (/5)

Qwen3-14B

3.86

Gemma-3-12B

3.72

Qwen3-8B

3.47

KORMo-10B

3.46

Phi-4

3.10

Llama-3.1-8B

2.67

Trade-Off Recommendations

Real-time service → Qwen3-14B

At 135 tok/s, streaming feels instant. Highest overall quality score.

Batch processing → Qwen3-8B

208 tok/s with solid quality. Best for processing large document volumes fast.

Korean quality priority → Gemma-3-12B

86 tok/s is slower, but Korean quality score of 4.28 is far above the rest. For services where accuracy is the competitive edge.

Key Takeaways

✓Llama-3.1-8B hits 218 tok/s peak but ranks last in quality (2.67/5)
✓Qwen3-14B at 135 tok/s is the optimal speed-quality sweet spot (3.86/5)
✓AWQ 4-bit quantization runs 14B models comfortably on 96GB VRAM
✓Manufacturing scenarios produce longest responses; medical/legal produce shortest
✓Choose models by 'speed × quality' efficiency, not raw tok/s alone

Conclusion

The RTX PRO 6000's 96GB VRAM and 350W power limit provide more than enough headroom for local LLM serving. But faster isn't always better. The key is finding the speed-quality balance that fits your use case. Overall, Qwen3-14B delivers the best all-around combination of speed, quality, and response richness.

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

Comments

(0)

Hardware