RTX Pro 6000 Local LLM Benchmark — 6 Models, 360 Questions, Complete Ranking
We loaded 6 local LLMs onto an NVIDIA RTX Pro 6000 (96 GB VRAM) running SGLang with AWQ 4-bit quantization at a 350W power limit, then put each model through 60 questions across 7 business scenarios — 360 total responses. This is the comprehensive comparison covering speed, quality, hallucination resistance, and practical deployment recommendations.
Test Environment
Every model was tested under identical conditions: RTX Pro 6000, SGLang serving engine, AWQ quantization, temperature 0.3, 60 questions per model across 7 scenarios (manufacturing, SaaS, healthcare, e-commerce, legal, internal automation, Korean language). Quality was scored on a 5-point scale covering Korean fluency (25%), instruction following (25%), factual accuracy (25%), response structure (15%), and refusal/limitation awareness (10%).
Speed Comparison
| Model | Parameters | tok/s | Total Time (60Q) |
|---|---|---|---|
| Llama-3.1-8B-AWQ | 8B | 218 | 97 s |
| Qwen3-8B-AWQ | 8B | 208 | 199 s |
| Phi-4-AWQ | 14B | 141 | 263 s |
| Qwen3-14B-AWQ | 14B | 135 | 297 s |
| Gemma-3-12B-AWQ | 12B | 86 | 258 s |
| KORMo-10B-sft | 10B | 60 | 434 s |
Speed ranges from 218 tok/s (Llama-3.1-8B) to 60 tok/s (KORMo-10B), a 3.6x spread. However, speed alone is misleading — the fastest model also produces the lowest quality output.
Quality Rankings: Overall Scores
| Rank | Model | Overall | Automation | Korean | Hallucination Defense |
|---|---|---|---|---|---|
| 1 | Qwen3-14B-AWQ | 3.86 | 4.66 | 4.19 | 4/6 |
| 2 | Gemma-3-12B-AWQ | 3.72 | 4.15 | 4.28 | 2/6 |
| 3 | Qwen3-8B-AWQ | 3.47 | 3.95 | 3.33 | 2/6 |
| 4 | KORMo-10B-sft | 3.46 | 3.60 | 3.83 | 4/6 |
| 5 | Phi-4-AWQ | 3.10 | 3.18 | 2.33 | 1/6 |
| 6 | Llama-3.1-8B-AWQ | 2.67 | 3.00 | 2.67 | 3/6 |
Model-by-Model Assessment
#1 Qwen3-14B (3.86) — The All-Rounder
Consistent high scores across all 7 scenarios. Dominates automation tasks at 4.66 — email drafts, meeting notes, and reports are production-ready. Korean fluency at 4.19 is second only to Gemma. Hallucination defense at 4/6 ties for best. At 135 tok/s, streaming output has no perceptible latency. This is the default recommendation for production deployment.
#2 Gemma-3-12B (3.72) — Korean Quality Champion
Highest Korean language score at 4.28 with natural honorifics and nuanced expression. Strong in healthcare scenarios. However, speed at 86 tok/s is 37% slower than Qwen3-14B, and hallucination defense at 2/6 is concerning. Best for services where Korean fluency quality is the absolute top priority and speed requirements are relaxed.
#3 Qwen3-8B (3.47) — Best Value
Nearly as fast as Llama at 208 tok/s but with dramatically better quality (3.47 vs 2.67). Strong in legal scenarios and well-structured responses. Occasional Chinese character contamination is its main weakness. Ideal for high-throughput batch processing where moderate quality suffices.
#4 KORMo-10B (3.46) — Korean Specialist, Speed Limited
Natural business Korean and good refusal awareness (4/6 hallucination defense). But at 60 tok/s — the slowest model tested — it is better suited for batch processing than real-time chat. Choose this only when Korean-specialized model architecture is a requirement and speed is not a constraint.
#5 Phi-4 (3.10) — English-Centric
Reasonable logical reasoning but the worst Korean score at 2.33 with frequent English switching. Hallucination defense at 1/6 is the lowest. Not recommended for Korean-language services.
#6 Llama-3.1-8B (2.67) — Speed Only
Fastest at 218 tok/s but last in every quality metric. Severe multilingual contamination and the most hallucinations. Not viable for any Korean business application.
Hallucination Trap Summary
We tested 6 adversarial questions designed to trigger fabricated information. The pass rate (correctly refusing or admitting uncertainty) varied dramatically:
| Model | Pass Rate | Notable Failures |
|---|---|---|
| Qwen3-14B | 4/6 | Fabricates non-existent SaaS features (structural defect) |
| KORMo-10B | 4/6 | Some legal citation errors |
| Llama-3.1-8B | 3/6 | Frequent hallucination across domains |
| Gemma-3-12B | 2/6 | Creates plausible but fictional case law |
| Qwen3-8B | 2/6 | Medical and product information fabrication |
| Phi-4 | 1/6 | Almost no hallucination defense |
No model achieves 100% hallucination defense. Even the best performers (Qwen3-14B and KORMo-10B at 4/6) fail on specific trap types. RAG is mandatory for any production deployment that handles factual queries.
Final Recommendations
- Production real-time service → Qwen3-14B. Best overall quality (3.86), strong automation (4.66), adequate speed (135 tok/s), and best hallucination defense. The default choice.
- Korean quality above all → Gemma-3-12B. Highest Korean score (4.28) but slower (86 tok/s) and weaker hallucination defense (2/6).
- High-throughput batch → Qwen3-8B. Fast (208 tok/s) with decent quality (3.47). Best for document processing at scale.
- Avoid for Korean services → Phi-4, Llama-3.1-8B. Both score below 3.10 with critical Korean language deficiencies.
The key insight from 360 test responses: model selection should optimize for the "speed x quality" product, not either metric alone. Qwen3-14B wins this optimization decisively — it is not the fastest, not the most fluent in Korean, but it is the only model that scores above average in every single category.