Local LLM Korean Language Comparison — 6 Models Tested with 10 Real Questions

2026-01-20

Treeru

When building Korean-language services with local LLMs, the most critical factor is how naturally the model speaks Korean. We tested 6 models across 10 real-world questions, evaluating honorific consistency, business register, language contamination (Chinese/English mixing), and naturalness of expression.

4.28

Top score (Gemma)

2.33

Bottom score (Phi-4)

Korean questions

Models tested

1Korean Language Score Comparison

Scores from Scenario G (Korean proficiency) evaluated across 10 questions. Criteria included naturalness, honorific consistency, terminology accuracy, and language contamination.

Rank	Model	Korean Score	Notes
1	Gemma-3-12B	4.28	#1 Korean, natural expressions
2	Qwen3-14B	4.19	Stable business Korean
3	KORMo-10B	3.83	Korean-specialized, excellent honorifics
4	Qwen3-8B	3.33	Adequate but Chinese contamination
5	Llama-3.1-8B	2.67	Severe multilingual contamination
6	Phi-4	2.33	Frequent English switching, lowest

Gemma-3-12B

4.28

Qwen3-14B

4.19

KORMo-10B

3.83

Qwen3-8B

3.33

Llama-3.1-8B

2.67

Phi-4

2.33

The Gap Between #1 and #6

Gemma-3-12B (4.28) and Phi-4 (2.33) are 1.95 points apart. Even though both are "LLMs," the Korean language experience is worlds apart. For Korean services, model selection can make or break your product.

2Honorifics and Business Register

In Korean, maintaining honorifics (formal speech levels) is the most basic requirement for any service. Korean has multiple speech levels, and business communication demands consistent use of formal endings like "-습니다" and "-하세요." We evaluated whether models could maintain formal register throughout their responses.

Good Honorific Usage

"I will guide you through this matter. According to Article 60 of the Labor Standards Act, annual paid leave is granted to workers who have attended 80% or more over one year."

- KORMo-10B (natural formal Korean)

Poor Honorific Usage

"This problem can be solved as follows. First, the employment contract should... ah, let me explain this part in Korean..."

- Phi-4 (switched to English mid-sentence)

Model	Honorific Consistency	Business Tone	Assessment
KORMo-10B	★★★★★	★★★★★	Near-perfect formal register
Gemma-3-12B	★★★★★	★★★★☆	Natural casual/formal switching
Qwen3-14B	★★★★☆	★★★★☆	Stable but occasional translationese
Qwen3-8B	★★★☆☆	★★★☆☆	Maintains formality but sounds stiff
Llama-3.1-8B	★★☆☆☆	★★☆☆☆	Inconsistent honorific maintenance
Phi-4	★★☆☆☆	★☆☆☆☆	Drops from formal to casual mid-response

3Language Contamination

When you ask a question in Korean and get Chinese, English, Russian, or Japanese mixed into the response, that’s "language contamination." It’s a service-quality killer.

Model	Incidents	Contamination Language	Severity
Gemma-3-12B	0–1	English words only	Minor
Qwen3-14B	0–1	English technical terms	Minor
KORMo-10B	1	English technical terms	Minor
Qwen3-8B	3	Chinese (characters, Chinese-style phrasing)	Moderate
Phi-4	3+	English (full sentence switching)	Severe
Llama-3.1-8B	5+	Russian, Chinese, Japanese, English	Critical

Qwen3-8B: Chinese Contamination Example

Korean response with sudden Chinese characters: "劳动者" appearing where the Korean word for "worker" should be used

Chinese from training data leaks into Korean responses, especially in legal/administrative contexts.

Llama-3.1-8B: Multi-Language Contamination

Mid-Korean response, Russian ("следующим образом") and Japanese ("具体的には") appear randomly

Random switching to Russian, Japanese, Chinese during Korean responses. Multiple languages per response.

Language Contamination Is a Service Killer

If your customer support chatbot suddenly outputs Russian, trust evaporates instantly. For Korean services, choose from Gemma, Qwen3-14B, or KORMo only.

4Natural Korean Expression

We evaluated whether responses read like a native Korean speaker wrote them, rather than machine-translated text. The biggest differences appear in conjunctions, particles, and sentence endings.

Naturalness Summary by Model

KORMo-10BReads like a native Korean writer. Most natural business writing style with proper conjunctions.

Gemma-3-12BConcise and accurate. Flexible tone switching without unnecessary verbosity.

Qwen3-14BGenerally stable, but occasionally lapses into translationese patterns.

Qwen3-8BFunctional but sounds bureaucratic. Quality drops in longer responses.

Llama-3.1-8BFrequent direct translations from English patterns. Lacks Korean pragmatics.

Phi-4Classic translationese throughout. Clearly more comfortable in English than Korean.

5Comprehensive Korean Evaluation by Model

Combined assessment of Korean scores, honorifics, language contamination, and naturalness.

Gemma-3-12B

Korean: 4.28 / 5.0

Korean #1. Excels at honorifics, business tone, and natural expression.

✓Seamless switching between casual and formal registers
✓Near-zero language contamination (1 English word at most)
✓Accurate use of medical/legal Korean terminology
✓Concise responses that hit the key points

Qwen3-14B

Korean: 4.19 / 5.0

Consistently stable. Strong business Korean.

✓High-level business Korean proficiency
✓Extremely rare language contamination (fewer than 1 instance)
✓Maintains Korean quality even in long responses
✓Occasional translationese expressions

KORMo-10B

Korean: 3.83 / 5.0

Korean-specialized model with the most natural expressions.

✓Most natural business Korean of all models
✓Near-perfect honorific and formal register usage
✓Slow speed limits real-time service deployment
✓Some gaps in specialized domain terminology

Qwen3-8B

Korean: 3.33 / 5.0

Basic Korean works, but occasional Chinese contamination.

✓Adequate basic Korean proficiency
✓3 Chinese contamination incidents (characters, Chinese-style phrasing)
✓Korean quality degrades in longer responses
✓Maintains honorifics but tone is stiff

Llama-3.1-8B

Korean: 2.67 / 5.0

Severe multilingual contamination. Unsuitable for Korean services.

✓Multiple contaminations: Russian, Chinese, Japanese, English
✓Foreign languages appear mid-sentence during Korean responses
✓Inconsistent honorific maintenance
✓Frequent translationese expressions

Phi-4

Korean: 2.33 / 5.0

Lowest Korean score. Frequent English switching makes it unusable.

✓Very frequent mid-response switches to English (3+ instances)
✓Frequent Korean grammar errors
✓Drops from formal to casual register mid-response
✓Tendency to use English-only for technical terms

Key Takeaways

✓Gemma-3-12B leads with 4.28 — natural expression and flexible tone switching
✓KORMo-10B excels in business Korean and honorific consistency
✓Phi-4 (2.33) and Llama (2.67) are unsuitable for Korean-language services
✓Llama-3.1-8B has the worst contamination: 5+ incidents across 4 languages
✓For Korean services, choose from Gemma, Qwen3-14B, or KORMo

Conclusion

Korean language ability varies drastically across local LLMs. The 1.95-point gap between Gemma-3-12B and Phi-4 means completely different user experiences from the same "AI chatbot" label. For Korean services, we recommend Gemma-3-12B (quality) or Qwen3-14B (balance). If business-grade formal Korean is a top priority, KORMo-10B is also worth considering despite its slower speed.

Tests conducted on February 21, 2026. Speed, token counts, and raw responses are actual measurements. Model rankings and scores include subjective evaluator judgment and may vary with different test environments and prompts. Non-commercial sharing is welcome; for commercial use, please contact us via the contact page.

Need a Korean AI Service?

Treeru builds Korean-optimized local LLM services, from model selection to deployment.

Request Free Consultation

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

LLM Korean local AI Qwen3 KORMo Gemma benchmark

Comments

(0)

Local LLM Korean Language Comparison — 6 Models Tested with 10 Real Questions

1Korean Language Score Comparison

2Honorifics and Business Register

Good Honorific Usage

Poor Honorific Usage

3Language Contamination

4Natural Korean Expression

Naturalness Summary by Model

5Comprehensive Korean Evaluation by Model

Gemma-3-12B

Qwen3-14B

KORMo-10B

Qwen3-8B

Llama-3.1-8B

Phi-4

Key Takeaways

Conclusion

Need a Korean AI Service?

Comments

Related Posts

Local LLM Benchmark: 6 Models Tested Across 60 Questions and 7 Business Scenarios

LLM Hallucination Test — Which Local Models Fabricate Information?

Local LLM Business Test (Part 1) — Manufacturing, SaaS, Healthcare