treeru.com
AI · Mar 4, 2026

Local LLM Benchmark: 6 Models Tested Across 60 Questions and 7 Business Scenarios

Choosing a local LLM for production is hard. Marketing benchmarks rarely reflect real-world performance, and every model claims to be "state of the art." We ran 6 models through 60 questions spanning 7 business scenarios — manufacturing, SaaS, medical, retail, legal, automation, and language quality — for a total of 360 evaluations. We also included hallucination trap questions to test each model's ability to say "I don't know." Here is the complete scorecard.

6

Models Compared

60

Questions x 7 Scenarios

3.86

#1 Score (Qwen3-14B)

4/6

Best Hallucination Trap Pass

Test Design and Methodology

All 6 models ran on the same GPU (RTX PRO 6000), the same serving engine (SGLang v0.4), and the same Temperature (0.3) with top_p fixed at 0.9. Each model processed all 60 questions sequentially. Responses were scored on a 5-point scale using five weighted criteria:

CriterionWeightWhat It Measures
Language naturalness25%Grammar, word order, honorifics usage
Instruction following25%Responds in the requested format and addresses the actual question
Domain accuracy25%Correctness of facts, numbers, and procedures
Response structure15%Use of lists, step-by-step explanations, logical organization
Refusal capability10%Ability to admit uncertainty rather than hallucinate

The 6 Models

ModelParametersQuantizationVRAM
Qwen3-14B-AWQ14BINT4 (AWQ)9.4 GB
Gemma-3-12B-AWQ12BINT4 (AWQ)8.1 GB
KORMo-10B-sft10BBF1620.0 GB
Qwen3-8B-AWQ8BINT4 (AWQ)5.2 GB
Phi-4-AWQ14BINT4 (AWQ)8.8 GB
Llama-3.1-8B-AWQ8BINT4 (AWQ)5.0 GB

Overall Rankings

RankModelOverallMfgSaaSMedicalRetailLegalAutoKorean
1Qwen3-14B3.864.13.93.83.73.54.24.0
2Gemma-3-12B3.703.83.73.93.63.23.83.9
2KORMo-10B3.703.63.53.73.83.43.54.2
4Qwen3-8B3.383.53.43.33.33.03.63.5
5Phi-42.642.82.92.52.42.23.22.3
6Llama-3.1-8B2.582.72.82.52.32.13.02.2

Qwen3-14B takes the top spot at 3.86/5.0, winning 5 out of 7 categories. It excels particularly in automation (4.2) and manufacturing (4.1), where structured, accurate responses matter most. Its code generation and step-by-step reasoning capabilities give it a clear edge over the competition.

Gemma-3-12B and KORMo-10B tie at 3.70. Gemma leads in medical scenarios (3.9) with more careful disclaimers and attention to edge cases. KORMo leads in Korean language quality (4.2) and retail (3.8) — impressive for a 10B model, demonstrating the impact of Korean-specialized training data.

Phi-4 (2.64) and Llama-3.1-8B (2.58) scored below 3.0, making them unsuitable for non-English production services. Their English-centric training results in awkward grammar, missed honorifics, and frequent language switching mid-response.

Hallucination Defense

Six of the 60 questions are deliberately designed hallucination traps: non-existent products, fake legal precedents, and fabricated medical conditions. A correct response means the model refuses to answer or explicitly states it lacks the information.

ModelPassedFailedNotes
Qwen3-14B4/62/6Failed on fake legal precedent and non-existent product
KORMo-10B4/62/6Natural refusal phrasing in Korean
Llama-3.1-8B3/63/6Sometimes refuses in English instead of Korean
Gemma-3-12B2/64/6Provided dangerous medical diagnoses; fabricated prices
Qwen3-8B2/64/6Significant refusal degradation vs 14B variant
Phi-41/65/6Confidently fabricated detailed false information; most dangerous

The best hallucination pass rate is only 4 out of 6 (Qwen3-14B and KORMo-10B). No model passed all six traps. Every single model fabricated at least one legal citation — inventing statute numbers, case references, and article clauses that don't exist. This means no local LLM should be deployed for legal or compliance tasks without RAG or database verification.

Critical Defects Across All Models

Beyond scores, certain defects would cause real-world incidents if deployed without safeguards:

  • Legal citation fabrication (all 6 models): Every model invented non-existent laws, court rulings, or article numbers at least once. Legal domains require mandatory RAG + source database integration.
  • Chinese text contamination (Qwen3-14B, Qwen3-8B): Occasionally inserts Chinese sentences mid-response, particularly for medical and legal terminology. Mitigated by system prompt enforcement and post-processing filters.
  • Repetition loops (Phi-4, Llama-3.1-8B): Endlessly repeats the same sentence until hitting max_tokens. Fix with repetition_penalty of 1.1–1.2.
  • Language switching (Phi-4, Llama-3.1-8B): Responds in English or mixes English mid-answer despite Korean prompts. Particularly frequent in technical scenarios.

Speed vs Quality Tradeoff

Faster models are not always better. Here is how generation speed maps against quality scores, all measured on an RTX PRO 6000 with SGLang:

ModelSizeSingle tok/s20-User tok/sScoreScore/tok/s
Qwen3-14B-AWQ14B1358503.860.029
Gemma-3-12B-AWQ12B1489203.700.025
KORMo-10B-sft10B1126803.700.033
Qwen3-8B-AWQ8B2081,5823.380.016
Phi-4-AWQ14B1308102.640.020
Llama-3.1-8B-AWQ8B2151,6402.580.012

The 8B models (Qwen3-8B, Llama-3.1-8B) are roughly 1.5x faster, but their quality drops by over a full point. Serving a low-quality model faster does not make it better — Qwen3-14B at 850 tok/s across 20 concurrent users still delivers 42+ tok/s per user, which is more than sufficient for interactive chatbot experiences.

KORMo-10B has the highest score-per-tok/s ratio (0.033), but runs in BF16 at 20 GB VRAM and only 112 tok/s. If an AWQ-quantized version becomes available, it could become a serious contender for VRAM-constrained deployments.

Recommended Models by Scenario

ScenarioBest PickRunner-UpWhy
General purposeQwen3-14BGemma-3-12BHighest overall score, stable across all scenarios
Korean-first servicesKORMo-10BQwen3-14BBest Korean score (4.2), natural honorifics
Medical consultationGemma-3-12BQwen3-14BHighest medical score (3.9), careful disclaimers
Code / automationQwen3-14BQwen3-8BTop automation score (4.2), excellent code structure
High throughputQwen3-8BQwen3-14B208 tok/s single, 1,582 tok/s at 20 users
VRAM-constrainedQwen3-8BLlama-3.1-8B5.2 GB VRAM, fits comfortably on 16 GB GPUs

Conclusion

For non-English production services, Qwen3-14B-AWQ is the default recommendation. At 14B parameters with AWQ quantization (9.4 GB VRAM), it offers the best quality-to-cost ratio and wins 5 of 7 scenario categories. For Korean-first services where language naturalness is the top priority, KORMo-10B is worth considering despite its higher VRAM footprint. For medical domains specifically, Gemma-3-12B edges ahead with more cautious and detailed responses.

Regardless of which model you choose, every model fabricates legal citations. For domains where factual accuracy is critical — legal, medical, financial — you must pair any local LLM with RAG or hybrid database search. And keep Temperature at 0.3 or below. These benchmarks were conducted in March 2026; as models evolve, re-running hallucination traps with each update is strongly recommended.