RTX Pro 6000 Local LLM Benchmark — 6 Models, 360 Questions, Complete Ranking

2026-01-09

Treeru

We loaded 6 local LLMs onto an NVIDIA RTX Pro 6000 (96 GB VRAM) running SGLang with AWQ 4-bit quantization at a 350W power limit, then put each model through 60 questions across 7 business scenarios — 360 total responses. This is the comprehensive comparison covering speed, quality, hallucination resistance, and practical deployment recommendations.

Test Environment

Every model was tested under identical conditions: RTX Pro 6000, SGLang serving engine, AWQ quantization, temperature 0.3, 60 questions per model across 7 scenarios (manufacturing, SaaS, healthcare, e-commerce, legal, internal automation, Korean language). Quality was scored on a 5-point scale covering Korean fluency (25%), instruction following (25%), factual accuracy (25%), response structure (15%), and refusal/limitation awareness (10%).

Speed Comparison

Model	Parameters	tok/s	Total Time (60Q)
Llama-3.1-8B-AWQ	8B	218	97 s
Qwen3-8B-AWQ	8B	208	199 s
Phi-4-AWQ	14B	141	263 s
Qwen3-14B-AWQ	14B	135	297 s
Gemma-3-12B-AWQ	12B	86	258 s
KORMo-10B-sft	10B	60	434 s

Speed ranges from 218 tok/s (Llama-3.1-8B) to 60 tok/s (KORMo-10B), a 3.6x spread. However, speed alone is misleading — the fastest model also produces the lowest quality output.

Quality Rankings: Overall Scores

Rank	Model	Overall	Automation	Korean	Hallucination Defense
1	Qwen3-14B-AWQ	3.86	4.66	4.19	4/6
2	Gemma-3-12B-AWQ	3.72	4.15	4.28	2/6
3	Qwen3-8B-AWQ	3.47	3.95	3.33	2/6
4	KORMo-10B-sft	3.46	3.60	3.83	4/6
5	Phi-4-AWQ	3.10	3.18	2.33	1/6
6	Llama-3.1-8B-AWQ	2.67	3.00	2.67	3/6

Model-by-Model Assessment

#1 Qwen3-14B (3.86) — The All-Rounder

Consistent high scores across all 7 scenarios. Dominates automation tasks at 4.66 — email drafts, meeting notes, and reports are production-ready. Korean fluency at 4.19 is second only to Gemma. Hallucination defense at 4/6 ties for best. At 135 tok/s, streaming output has no perceptible latency. This is the default recommendation for production deployment.

#2 Gemma-3-12B (3.72) — Korean Quality Champion

Highest Korean language score at 4.28 with natural honorifics and nuanced expression. Strong in healthcare scenarios. However, speed at 86 tok/s is 37% slower than Qwen3-14B, and hallucination defense at 2/6 is concerning. Best for services where Korean fluency quality is the absolute top priority and speed requirements are relaxed.

#3 Qwen3-8B (3.47) — Best Value

Nearly as fast as Llama at 208 tok/s but with dramatically better quality (3.47 vs 2.67). Strong in legal scenarios and well-structured responses. Occasional Chinese character contamination is its main weakness. Ideal for high-throughput batch processing where moderate quality suffices.

#4 KORMo-10B (3.46) — Korean Specialist, Speed Limited

Natural business Korean and good refusal awareness (4/6 hallucination defense). But at 60 tok/s — the slowest model tested — it is better suited for batch processing than real-time chat. Choose this only when Korean-specialized model architecture is a requirement and speed is not a constraint.

#5 Phi-4 (3.10) — English-Centric

Reasonable logical reasoning but the worst Korean score at 2.33 with frequent English switching. Hallucination defense at 1/6 is the lowest. Not recommended for Korean-language services.

#6 Llama-3.1-8B (2.67) — Speed Only

Fastest at 218 tok/s but last in every quality metric. Severe multilingual contamination and the most hallucinations. Not viable for any Korean business application.

Hallucination Trap Summary

We tested 6 adversarial questions designed to trigger fabricated information. The pass rate (correctly refusing or admitting uncertainty) varied dramatically:

Model	Pass Rate	Notable Failures
Qwen3-14B	4/6	Fabricates non-existent SaaS features (structural defect)
KORMo-10B	4/6	Some legal citation errors
Llama-3.1-8B	3/6	Frequent hallucination across domains
Gemma-3-12B	2/6	Creates plausible but fictional case law
Qwen3-8B	2/6	Medical and product information fabrication
Phi-4	1/6	Almost no hallucination defense

No model achieves 100% hallucination defense. Even the best performers (Qwen3-14B and KORMo-10B at 4/6) fail on specific trap types. RAG is mandatory for any production deployment that handles factual queries.

Final Recommendations

Production real-time service → Qwen3-14B. Best overall quality (3.86), strong automation (4.66), adequate speed (135 tok/s), and best hallucination defense. The default choice.
Korean quality above all → Gemma-3-12B. Highest Korean score (4.28) but slower (86 tok/s) and weaker hallucination defense (2/6).
High-throughput batch → Qwen3-8B. Fast (208 tok/s) with decent quality (3.47). Best for document processing at scale.
Avoid for Korean services → Phi-4, Llama-3.1-8B. Both score below 3.10 with critical Korean language deficiencies.

The key insight from 360 test responses: model selection should optimize for the "speed x quality" product, not either metric alone. Qwen3-14B wins this optimization decisively — it is not the fastest, not the most fluent in Korean, but it is the only model that scores above average in every single category.