treeru.com

RTX Pro 6000 Local LLM Benchmark — 6 Models, 360 Questions, Complete Ranking

We loaded 6 local LLMs onto an NVIDIA RTX Pro 6000 (96 GB VRAM) running SGLang with AWQ 4-bit quantization at a 350W power limit, then put each model through 60 questions across 7 business scenarios — 360 total responses. This is the comprehensive comparison covering speed, quality, hallucination resistance, and practical deployment recommendations.

Test Environment

Every model was tested under identical conditions: RTX Pro 6000, SGLang serving engine, AWQ quantization, temperature 0.3, 60 questions per model across 7 scenarios (manufacturing, SaaS, healthcare, e-commerce, legal, internal automation, Korean language). Quality was scored on a 5-point scale covering Korean fluency (25%), instruction following (25%), factual accuracy (25%), response structure (15%), and refusal/limitation awareness (10%).

Speed Comparison

ModelParameterstok/sTotal Time (60Q)
Llama-3.1-8B-AWQ8B21897 s
Qwen3-8B-AWQ8B208199 s
Phi-4-AWQ14B141263 s
Qwen3-14B-AWQ14B135297 s
Gemma-3-12B-AWQ12B86258 s
KORMo-10B-sft10B60434 s

Speed ranges from 218 tok/s (Llama-3.1-8B) to 60 tok/s (KORMo-10B), a 3.6x spread. However, speed alone is misleading — the fastest model also produces the lowest quality output.

Quality Rankings: Overall Scores

RankModelOverallAutomationKoreanHallucination Defense
1Qwen3-14B-AWQ3.864.664.194/6
2Gemma-3-12B-AWQ3.724.154.282/6
3Qwen3-8B-AWQ3.473.953.332/6
4KORMo-10B-sft3.463.603.834/6
5Phi-4-AWQ3.103.182.331/6
6Llama-3.1-8B-AWQ2.673.002.673/6

Model-by-Model Assessment

#1 Qwen3-14B (3.86) — The All-Rounder

Consistent high scores across all 7 scenarios. Dominates automation tasks at 4.66 — email drafts, meeting notes, and reports are production-ready. Korean fluency at 4.19 is second only to Gemma. Hallucination defense at 4/6 ties for best. At 135 tok/s, streaming output has no perceptible latency. This is the default recommendation for production deployment.

#2 Gemma-3-12B (3.72) — Korean Quality Champion

Highest Korean language score at 4.28 with natural honorifics and nuanced expression. Strong in healthcare scenarios. However, speed at 86 tok/s is 37% slower than Qwen3-14B, and hallucination defense at 2/6 is concerning. Best for services where Korean fluency quality is the absolute top priority and speed requirements are relaxed.

#3 Qwen3-8B (3.47) — Best Value

Nearly as fast as Llama at 208 tok/s but with dramatically better quality (3.47 vs 2.67). Strong in legal scenarios and well-structured responses. Occasional Chinese character contamination is its main weakness. Ideal for high-throughput batch processing where moderate quality suffices.

#4 KORMo-10B (3.46) — Korean Specialist, Speed Limited

Natural business Korean and good refusal awareness (4/6 hallucination defense). But at 60 tok/s — the slowest model tested — it is better suited for batch processing than real-time chat. Choose this only when Korean-specialized model architecture is a requirement and speed is not a constraint.

#5 Phi-4 (3.10) — English-Centric

Reasonable logical reasoning but the worst Korean score at 2.33 with frequent English switching. Hallucination defense at 1/6 is the lowest. Not recommended for Korean-language services.

#6 Llama-3.1-8B (2.67) — Speed Only

Fastest at 218 tok/s but last in every quality metric. Severe multilingual contamination and the most hallucinations. Not viable for any Korean business application.

Hallucination Trap Summary

We tested 6 adversarial questions designed to trigger fabricated information. The pass rate (correctly refusing or admitting uncertainty) varied dramatically:

ModelPass RateNotable Failures
Qwen3-14B4/6Fabricates non-existent SaaS features (structural defect)
KORMo-10B4/6Some legal citation errors
Llama-3.1-8B3/6Frequent hallucination across domains
Gemma-3-12B2/6Creates plausible but fictional case law
Qwen3-8B2/6Medical and product information fabrication
Phi-41/6Almost no hallucination defense

No model achieves 100% hallucination defense. Even the best performers (Qwen3-14B and KORMo-10B at 4/6) fail on specific trap types. RAG is mandatory for any production deployment that handles factual queries.

Final Recommendations

  • Production real-time service → Qwen3-14B. Best overall quality (3.86), strong automation (4.66), adequate speed (135 tok/s), and best hallucination defense. The default choice.
  • Korean quality above all → Gemma-3-12B. Highest Korean score (4.28) but slower (86 tok/s) and weaker hallucination defense (2/6).
  • High-throughput batch → Qwen3-8B. Fast (208 tok/s) with decent quality (3.47). Best for document processing at scale.
  • Avoid for Korean services → Phi-4, Llama-3.1-8B. Both score below 3.10 with critical Korean language deficiencies.

The key insight from 360 test responses: model selection should optimize for the "speed x quality" product, not either metric alone. Qwen3-14B wins this optimization decisively — it is not the fastest, not the most fluent in Korean, but it is the only model that scores above average in every single category.