Local LLM Benchmark: 6 Models, 60 Questions, 7 Scenarios — Complete Scorecard

Test Design and Methodology

All 6 models ran on the same GPU (RTX PRO 6000), the same serving engine (SGLang v0.4), and the same Temperature (0.3) with top_p fixed at 0.9. Each model processed all 60 questions sequentially. Responses were scored on a 5-point scale using five weighted criteria:

Criterion	Weight	What It Measures
Language naturalness	25%	Grammar, word order, honorifics usage
Instruction following	25%	Responds in the requested format and addresses the actual question
Domain accuracy	25%	Correctness of facts, numbers, and procedures
Response structure	15%	Use of lists, step-by-step explanations, logical organization
Refusal capability	10%	Ability to admit uncertainty rather than hallucinate

The 6 Models

Model	Parameters	Quantization	VRAM
Qwen3-14B-AWQ	14B	INT4 (AWQ)	9.4 GB
Gemma-3-12B-AWQ	12B	INT4 (AWQ)	8.1 GB
KORMo-10B-sft	10B	BF16	20.0 GB
Qwen3-8B-AWQ	8B	INT4 (AWQ)	5.2 GB
Phi-4-AWQ	14B	INT4 (AWQ)	8.8 GB
Llama-3.1-8B-AWQ	8B	INT4 (AWQ)	5.0 GB

Overall Rankings

Rank	Model	Overall	Mfg	SaaS	Medical	Retail	Legal	Auto	Korean
1	Qwen3-14B	3.86	4.1	3.9	3.8	3.7	3.5	4.2	4.0
2	Gemma-3-12B	3.70	3.8	3.7	3.9	3.6	3.2	3.8	3.9
2	KORMo-10B	3.70	3.6	3.5	3.7	3.8	3.4	3.5	4.2
4	Qwen3-8B	3.38	3.5	3.4	3.3	3.3	3.0	3.6	3.5
5	Phi-4	2.64	2.8	2.9	2.5	2.4	2.2	3.2	2.3
6	Llama-3.1-8B	2.58	2.7	2.8	2.5	2.3	2.1	3.0	2.2

Qwen3-14B takes the top spot at 3.86/5.0, winning 5 out of 7 categories. It excels particularly in automation (4.2) and manufacturing (4.1), where structured, accurate responses matter most. Its code generation and step-by-step reasoning capabilities give it a clear edge over the competition.

Gemma-3-12B and KORMo-10B tie at 3.70. Gemma leads in medical scenarios (3.9) with more careful disclaimers and attention to edge cases. KORMo leads in Korean language quality (4.2) and retail (3.8) — impressive for a 10B model, demonstrating the impact of Korean-specialized training data.

Phi-4 (2.64) and Llama-3.1-8B (2.58) scored below 3.0, making them unsuitable for non-English production services. Their English-centric training results in awkward grammar, missed honorifics, and frequent language switching mid-response.

Hallucination Defense

Six of the 60 questions are deliberately designed hallucination traps: non-existent products, fake legal precedents, and fabricated medical conditions. A correct response means the model refuses to answer or explicitly states it lacks the information.

Model	Passed	Failed	Notes
Qwen3-14B	4/6	2/6	Failed on fake legal precedent and non-existent product
KORMo-10B	4/6	2/6	Natural refusal phrasing in Korean
Llama-3.1-8B	3/6	3/6	Sometimes refuses in English instead of Korean
Gemma-3-12B	2/6	4/6	Provided dangerous medical diagnoses; fabricated prices
Qwen3-8B	2/6	4/6	Significant refusal degradation vs 14B variant
Phi-4	1/6	5/6	Confidently fabricated detailed false information; most dangerous

The best hallucination pass rate is only 4 out of 6 (Qwen3-14B and KORMo-10B). No model passed all six traps. Every single model fabricated at least one legal citation — inventing statute numbers, case references, and article clauses that don't exist. This means no local LLM should be deployed for legal or compliance tasks without RAG or database verification.

Critical Defects Across All Models

Beyond scores, certain defects would cause real-world incidents if deployed without safeguards:

Legal citation fabrication (all 6 models): Every model invented non-existent laws, court rulings, or article numbers at least once. Legal domains require mandatory RAG + source database integration.
Chinese text contamination (Qwen3-14B, Qwen3-8B): Occasionally inserts Chinese sentences mid-response, particularly for medical and legal terminology. Mitigated by system prompt enforcement and post-processing filters.
Repetition loops (Phi-4, Llama-3.1-8B): Endlessly repeats the same sentence until hitting max_tokens. Fix with repetition_penalty of 1.1–1.2.
Language switching (Phi-4, Llama-3.1-8B): Responds in English or mixes English mid-answer despite Korean prompts. Particularly frequent in technical scenarios.

Speed vs Quality Tradeoff

Faster models are not always better. Here is how generation speed maps against quality scores, all measured on an RTX PRO 6000 with SGLang:

Model	Size	Single tok/s	20-User tok/s	Score	Score/tok/s
Qwen3-14B-AWQ	14B	135	850	3.86	0.029
Gemma-3-12B-AWQ	12B	148	920	3.70	0.025
KORMo-10B-sft	10B	112	680	3.70	0.033
Qwen3-8B-AWQ	8B	208	1,582	3.38	0.016
Phi-4-AWQ	14B	130	810	2.64	0.020
Llama-3.1-8B-AWQ	8B	215	1,640	2.58	0.012

The 8B models (Qwen3-8B, Llama-3.1-8B) are roughly 1.5x faster, but their quality drops by over a full point. Serving a low-quality model faster does not make it better — Qwen3-14B at 850 tok/s across 20 concurrent users still delivers 42+ tok/s per user, which is more than sufficient for interactive chatbot experiences.

KORMo-10B has the highest score-per-tok/s ratio (0.033), but runs in BF16 at 20 GB VRAM and only 112 tok/s. If an AWQ-quantized version becomes available, it could become a serious contender for VRAM-constrained deployments.

Recommended Models by Scenario

Scenario	Best Pick	Runner-Up	Why
General purpose	Qwen3-14B	Gemma-3-12B	Highest overall score, stable across all scenarios
Korean-first services	KORMo-10B	Qwen3-14B	Best Korean score (4.2), natural honorifics
Medical consultation	Gemma-3-12B	Qwen3-14B	Highest medical score (3.9), careful disclaimers
Code / automation	Qwen3-14B	Qwen3-8B	Top automation score (4.2), excellent code structure
High throughput	Qwen3-8B	Qwen3-14B	208 tok/s single, 1,582 tok/s at 20 users
VRAM-constrained	Qwen3-8B	Llama-3.1-8B	5.2 GB VRAM, fits comfortably on 16 GB GPUs

Conclusion

For non-English production services, Qwen3-14B-AWQ is the default recommendation. At 14B parameters with AWQ quantization (9.4 GB VRAM), it offers the best quality-to-cost ratio and wins 5 of 7 scenario categories. For Korean-first services where language naturalness is the top priority, KORMo-10B is worth considering despite its higher VRAM footprint. For medical domains specifically, Gemma-3-12B edges ahead with more cautious and detailed responses.

Regardless of which model you choose, every model fabricates legal citations. For domains where factual accuracy is critical — legal, medical, financial — you must pair any local LLM with RAG or hybrid database search. And keep Temperature at 0.3 or below. These benchmarks were conducted in March 2026; as models evolve, re-running hallucination traps with each update is strongly recommended.

Local LLM Benchmark: 6 Models Tested Across 60 Questions and 7 Business Scenarios

Test Design and Methodology

The 6 Models

Overall Rankings

Hallucination Defense

Critical Defects Across All Models

Speed vs Quality Tradeoff

Recommended Models by Scenario

Conclusion

Related Posts

Qwen3-14B Deep Review — Why It Is Our Top-Ranked Local LLM

LLM Hallucination Test — Which Local Models Fabricate Information?

Local LLM Korean Language Comparison — 6 Models Tested with 10 Real Questions