Local LLM Benchmark: 6 Models Tested Across 60 Questions and 7 Business Scenarios
Choosing a local LLM for production is hard. Marketing benchmarks rarely reflect real-world performance, and every model claims to be "state of the art." We ran 6 models through 60 questions spanning 7 business scenarios — manufacturing, SaaS, medical, retail, legal, automation, and language quality — for a total of 360 evaluations. We also included hallucination trap questions to test each model's ability to say "I don't know." Here is the complete scorecard.
6
Models Compared
60
Questions x 7 Scenarios
3.86
#1 Score (Qwen3-14B)
4/6
Best Hallucination Trap Pass
Test Design and Methodology
All 6 models ran on the same GPU (RTX PRO 6000), the same serving engine (SGLang v0.4), and the same Temperature (0.3) with top_p fixed at 0.9. Each model processed all 60 questions sequentially. Responses were scored on a 5-point scale using five weighted criteria:
| Criterion | Weight | What It Measures |
|---|---|---|
| Language naturalness | 25% | Grammar, word order, honorifics usage |
| Instruction following | 25% | Responds in the requested format and addresses the actual question |
| Domain accuracy | 25% | Correctness of facts, numbers, and procedures |
| Response structure | 15% | Use of lists, step-by-step explanations, logical organization |
| Refusal capability | 10% | Ability to admit uncertainty rather than hallucinate |
The 6 Models
| Model | Parameters | Quantization | VRAM |
|---|---|---|---|
| Qwen3-14B-AWQ | 14B | INT4 (AWQ) | 9.4 GB |
| Gemma-3-12B-AWQ | 12B | INT4 (AWQ) | 8.1 GB |
| KORMo-10B-sft | 10B | BF16 | 20.0 GB |
| Qwen3-8B-AWQ | 8B | INT4 (AWQ) | 5.2 GB |
| Phi-4-AWQ | 14B | INT4 (AWQ) | 8.8 GB |
| Llama-3.1-8B-AWQ | 8B | INT4 (AWQ) | 5.0 GB |
Overall Rankings
| Rank | Model | Overall | Mfg | SaaS | Medical | Retail | Legal | Auto | Korean |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen3-14B | 3.86 | 4.1 | 3.9 | 3.8 | 3.7 | 3.5 | 4.2 | 4.0 |
| 2 | Gemma-3-12B | 3.70 | 3.8 | 3.7 | 3.9 | 3.6 | 3.2 | 3.8 | 3.9 |
| 2 | KORMo-10B | 3.70 | 3.6 | 3.5 | 3.7 | 3.8 | 3.4 | 3.5 | 4.2 |
| 4 | Qwen3-8B | 3.38 | 3.5 | 3.4 | 3.3 | 3.3 | 3.0 | 3.6 | 3.5 |
| 5 | Phi-4 | 2.64 | 2.8 | 2.9 | 2.5 | 2.4 | 2.2 | 3.2 | 2.3 |
| 6 | Llama-3.1-8B | 2.58 | 2.7 | 2.8 | 2.5 | 2.3 | 2.1 | 3.0 | 2.2 |
Qwen3-14B takes the top spot at 3.86/5.0, winning 5 out of 7 categories. It excels particularly in automation (4.2) and manufacturing (4.1), where structured, accurate responses matter most. Its code generation and step-by-step reasoning capabilities give it a clear edge over the competition.
Gemma-3-12B and KORMo-10B tie at 3.70. Gemma leads in medical scenarios (3.9) with more careful disclaimers and attention to edge cases. KORMo leads in Korean language quality (4.2) and retail (3.8) — impressive for a 10B model, demonstrating the impact of Korean-specialized training data.
Phi-4 (2.64) and Llama-3.1-8B (2.58) scored below 3.0, making them unsuitable for non-English production services. Their English-centric training results in awkward grammar, missed honorifics, and frequent language switching mid-response.
Hallucination Defense
Six of the 60 questions are deliberately designed hallucination traps: non-existent products, fake legal precedents, and fabricated medical conditions. A correct response means the model refuses to answer or explicitly states it lacks the information.
| Model | Passed | Failed | Notes |
|---|---|---|---|
| Qwen3-14B | 4/6 | 2/6 | Failed on fake legal precedent and non-existent product |
| KORMo-10B | 4/6 | 2/6 | Natural refusal phrasing in Korean |
| Llama-3.1-8B | 3/6 | 3/6 | Sometimes refuses in English instead of Korean |
| Gemma-3-12B | 2/6 | 4/6 | Provided dangerous medical diagnoses; fabricated prices |
| Qwen3-8B | 2/6 | 4/6 | Significant refusal degradation vs 14B variant |
| Phi-4 | 1/6 | 5/6 | Confidently fabricated detailed false information; most dangerous |
The best hallucination pass rate is only 4 out of 6 (Qwen3-14B and KORMo-10B). No model passed all six traps. Every single model fabricated at least one legal citation — inventing statute numbers, case references, and article clauses that don't exist. This means no local LLM should be deployed for legal or compliance tasks without RAG or database verification.
Critical Defects Across All Models
Beyond scores, certain defects would cause real-world incidents if deployed without safeguards:
- Legal citation fabrication (all 6 models): Every model invented non-existent laws, court rulings, or article numbers at least once. Legal domains require mandatory RAG + source database integration.
- Chinese text contamination (Qwen3-14B, Qwen3-8B): Occasionally inserts Chinese sentences mid-response, particularly for medical and legal terminology. Mitigated by system prompt enforcement and post-processing filters.
- Repetition loops (Phi-4, Llama-3.1-8B): Endlessly repeats the same sentence until hitting max_tokens. Fix with repetition_penalty of 1.1–1.2.
- Language switching (Phi-4, Llama-3.1-8B): Responds in English or mixes English mid-answer despite Korean prompts. Particularly frequent in technical scenarios.
Speed vs Quality Tradeoff
Faster models are not always better. Here is how generation speed maps against quality scores, all measured on an RTX PRO 6000 with SGLang:
| Model | Size | Single tok/s | 20-User tok/s | Score | Score/tok/s |
|---|---|---|---|---|---|
| Qwen3-14B-AWQ | 14B | 135 | 850 | 3.86 | 0.029 |
| Gemma-3-12B-AWQ | 12B | 148 | 920 | 3.70 | 0.025 |
| KORMo-10B-sft | 10B | 112 | 680 | 3.70 | 0.033 |
| Qwen3-8B-AWQ | 8B | 208 | 1,582 | 3.38 | 0.016 |
| Phi-4-AWQ | 14B | 130 | 810 | 2.64 | 0.020 |
| Llama-3.1-8B-AWQ | 8B | 215 | 1,640 | 2.58 | 0.012 |
The 8B models (Qwen3-8B, Llama-3.1-8B) are roughly 1.5x faster, but their quality drops by over a full point. Serving a low-quality model faster does not make it better — Qwen3-14B at 850 tok/s across 20 concurrent users still delivers 42+ tok/s per user, which is more than sufficient for interactive chatbot experiences.
KORMo-10B has the highest score-per-tok/s ratio (0.033), but runs in BF16 at 20 GB VRAM and only 112 tok/s. If an AWQ-quantized version becomes available, it could become a serious contender for VRAM-constrained deployments.
Recommended Models by Scenario
| Scenario | Best Pick | Runner-Up | Why |
|---|---|---|---|
| General purpose | Qwen3-14B | Gemma-3-12B | Highest overall score, stable across all scenarios |
| Korean-first services | KORMo-10B | Qwen3-14B | Best Korean score (4.2), natural honorifics |
| Medical consultation | Gemma-3-12B | Qwen3-14B | Highest medical score (3.9), careful disclaimers |
| Code / automation | Qwen3-14B | Qwen3-8B | Top automation score (4.2), excellent code structure |
| High throughput | Qwen3-8B | Qwen3-14B | 208 tok/s single, 1,582 tok/s at 20 users |
| VRAM-constrained | Qwen3-8B | Llama-3.1-8B | 5.2 GB VRAM, fits comfortably on 16 GB GPUs |
Conclusion
For non-English production services, Qwen3-14B-AWQ is the default recommendation. At 14B parameters with AWQ quantization (9.4 GB VRAM), it offers the best quality-to-cost ratio and wins 5 of 7 scenario categories. For Korean-first services where language naturalness is the top priority, KORMo-10B is worth considering despite its higher VRAM footprint. For medical domains specifically, Gemma-3-12B edges ahead with more cautious and detailed responses.
Regardless of which model you choose, every model fabricates legal citations. For domains where factual accuracy is critical — legal, medical, financial — you must pair any local LLM with RAG or hybrid database search. And keep Temperature at 0.3 or below. These benchmarks were conducted in March 2026; as models evolve, re-running hallucination traps with each update is strongly recommended.