When building Korean-language services with local LLMs, the most critical factor is how naturally the model speaks Korean. We tested 6 models across 10 real-world questions, evaluating honorific consistency, business register, language contamination (Chinese/English mixing), and naturalness of expression.
4.28
Top score (Gemma)
2.33
Bottom score (Phi-4)
10
Korean questions
6
Models tested
1Korean Language Score Comparison
Scores from Scenario G (Korean proficiency) evaluated across 10 questions. Criteria included naturalness, honorific consistency, terminology accuracy, and language contamination.
| Rank | Model | Korean Score | Notes |
|---|---|---|---|
| 1 | Gemma-3-12B | 4.28 | #1 Korean, natural expressions |
| 2 | Qwen3-14B | 4.19 | Stable business Korean |
| 3 | KORMo-10B | 3.83 | Korean-specialized, excellent honorifics |
| 4 | Qwen3-8B | 3.33 | Adequate but Chinese contamination |
| 5 | Llama-3.1-8B | 2.67 | Severe multilingual contamination |
| 6 | Phi-4 | 2.33 | Frequent English switching, lowest |
The Gap Between #1 and #6
Gemma-3-12B (4.28) and Phi-4 (2.33) are 1.95 points apart. Even though both are "LLMs," the Korean language experience is worlds apart. For Korean services, model selection can make or break your product.
2Honorifics and Business Register
In Korean, maintaining honorifics (formal speech levels) is the most basic requirement for any service. Korean has multiple speech levels, and business communication demands consistent use of formal endings like "-습니다" and "-하세요." We evaluated whether models could maintain formal register throughout their responses.
Good Honorific Usage
"I will guide you through this matter. According to Article 60 of the Labor Standards Act, annual paid leave is granted to workers who have attended 80% or more over one year."
- KORMo-10B (natural formal Korean)
Poor Honorific Usage
"This problem can be solved as follows. First, the employment contract should... ah, let me explain this part in Korean..."
- Phi-4 (switched to English mid-sentence)
| Model | Honorific Consistency | Business Tone | Assessment |
|---|---|---|---|
| KORMo-10B | ★★★★★ | ★★★★★ | Near-perfect formal register |
| Gemma-3-12B | ★★★★★ | ★★★★☆ | Natural casual/formal switching |
| Qwen3-14B | ★★★★☆ | ★★★★☆ | Stable but occasional translationese |
| Qwen3-8B | ★★★☆☆ | ★★★☆☆ | Maintains formality but sounds stiff |
| Llama-3.1-8B | ★★☆☆☆ | ★★☆☆☆ | Inconsistent honorific maintenance |
| Phi-4 | ★★☆☆☆ | ★☆☆☆☆ | Drops from formal to casual mid-response |
3Language Contamination
When you ask a question in Korean and get Chinese, English, Russian, or Japanese mixed into the response, that’s "language contamination." It’s a service-quality killer.
| Model | Incidents | Contamination Language | Severity |
|---|---|---|---|
| Gemma-3-12B | 0–1 | English words only | Minor |
| Qwen3-14B | 0–1 | English technical terms | Minor |
| KORMo-10B | 1 | English technical terms | Minor |
| Qwen3-8B | 3 | Chinese (characters, Chinese-style phrasing) | Moderate |
| Phi-4 | 3+ | English (full sentence switching) | Severe |
| Llama-3.1-8B | 5+ | Russian, Chinese, Japanese, English | Critical |
Qwen3-8B: Chinese Contamination Example
Korean response with sudden Chinese characters: "劳动者" appearing where the Korean word for "worker" should be used
Chinese from training data leaks into Korean responses, especially in legal/administrative contexts.
Llama-3.1-8B: Multi-Language Contamination
Mid-Korean response, Russian ("следующим образом") and Japanese ("具体的には") appear randomly
Random switching to Russian, Japanese, Chinese during Korean responses. Multiple languages per response.
Language Contamination Is a Service Killer
If your customer support chatbot suddenly outputs Russian, trust evaporates instantly. For Korean services, choose from Gemma, Qwen3-14B, or KORMo only.
4Natural Korean Expression
We evaluated whether responses read like a native Korean speaker wrote them, rather than machine-translated text. The biggest differences appear in conjunctions, particles, and sentence endings.
Naturalness Summary by Model
5Comprehensive Korean Evaluation by Model
Combined assessment of Korean scores, honorifics, language contamination, and naturalness.
Gemma-3-12B
Korean: 4.28 / 5.0
Korean #1. Excels at honorifics, business tone, and natural expression.
- ✓Seamless switching between casual and formal registers
- ✓Near-zero language contamination (1 English word at most)
- ✓Accurate use of medical/legal Korean terminology
- ✓Concise responses that hit the key points
Qwen3-14B
Korean: 4.19 / 5.0
Consistently stable. Strong business Korean.
- ✓High-level business Korean proficiency
- ✓Extremely rare language contamination (fewer than 1 instance)
- ✓Maintains Korean quality even in long responses
- ✓Occasional translationese expressions
KORMo-10B
Korean: 3.83 / 5.0
Korean-specialized model with the most natural expressions.
- ✓Most natural business Korean of all models
- ✓Near-perfect honorific and formal register usage
- ✓Slow speed limits real-time service deployment
- ✓Some gaps in specialized domain terminology
Qwen3-8B
Korean: 3.33 / 5.0
Basic Korean works, but occasional Chinese contamination.
- ✓Adequate basic Korean proficiency
- ✓3 Chinese contamination incidents (characters, Chinese-style phrasing)
- ✓Korean quality degrades in longer responses
- ✓Maintains honorifics but tone is stiff
Llama-3.1-8B
Korean: 2.67 / 5.0
Severe multilingual contamination. Unsuitable for Korean services.
- ✓Multiple contaminations: Russian, Chinese, Japanese, English
- ✓Foreign languages appear mid-sentence during Korean responses
- ✓Inconsistent honorific maintenance
- ✓Frequent translationese expressions
Phi-4
Korean: 2.33 / 5.0
Lowest Korean score. Frequent English switching makes it unusable.
- ✓Very frequent mid-response switches to English (3+ instances)
- ✓Frequent Korean grammar errors
- ✓Drops from formal to casual register mid-response
- ✓Tendency to use English-only for technical terms
Key Takeaways
- ✓Gemma-3-12B leads with 4.28 — natural expression and flexible tone switching
- ✓KORMo-10B excels in business Korean and honorific consistency
- ✓Phi-4 (2.33) and Llama (2.67) are unsuitable for Korean-language services
- ✓Llama-3.1-8B has the worst contamination: 5+ incidents across 4 languages
- ✓For Korean services, choose from Gemma, Qwen3-14B, or KORMo
Conclusion
Korean language ability varies drastically across local LLMs. The 1.95-point gap between Gemma-3-12B and Phi-4 means completely different user experiences from the same "AI chatbot" label. For Korean services, we recommend Gemma-3-12B (quality) or Qwen3-14B (balance). If business-grade formal Korean is a top priority, KORMo-10B is also worth considering despite its slower speed.
Tests conducted on February 21, 2026. Speed, token counts, and raw responses are actual measurements. Model rankings and scores include subjective evaluator judgment and may vary with different test environments and prompts. Non-commercial sharing is welcome; for commercial use, please contact us via the contact page.
Need a Korean AI Service?
Treeru builds Korean-optimized local LLM services, from model selection to deployment.
Request Free Consultation