treeru.com

When building Korean-language services with local LLMs, the most critical factor is how naturally the model speaks Korean. We tested 6 models across 10 real-world questions, evaluating honorific consistency, business register, language contamination (Chinese/English mixing), and naturalness of expression.

4.28

Top score (Gemma)

2.33

Bottom score (Phi-4)

10

Korean questions

6

Models tested

1Korean Language Score Comparison

Scores from Scenario G (Korean proficiency) evaluated across 10 questions. Criteria included naturalness, honorific consistency, terminology accuracy, and language contamination.

RankModelKorean ScoreNotes
1Gemma-3-12B4.28#1 Korean, natural expressions
2Qwen3-14B4.19Stable business Korean
3KORMo-10B3.83Korean-specialized, excellent honorifics
4Qwen3-8B3.33Adequate but Chinese contamination
5Llama-3.1-8B2.67Severe multilingual contamination
6Phi-42.33Frequent English switching, lowest
Gemma-3-12B
4.28
Qwen3-14B
4.19
KORMo-10B
3.83
Qwen3-8B
3.33
Llama-3.1-8B
2.67
Phi-4
2.33

The Gap Between #1 and #6

Gemma-3-12B (4.28) and Phi-4 (2.33) are 1.95 points apart. Even though both are "LLMs," the Korean language experience is worlds apart. For Korean services, model selection can make or break your product.

2Honorifics and Business Register

In Korean, maintaining honorifics (formal speech levels) is the most basic requirement for any service. Korean has multiple speech levels, and business communication demands consistent use of formal endings like "-습니다" and "-하세요." We evaluated whether models could maintain formal register throughout their responses.

Good Honorific Usage

"I will guide you through this matter. According to Article 60 of the Labor Standards Act, annual paid leave is granted to workers who have attended 80% or more over one year."

- KORMo-10B (natural formal Korean)

Poor Honorific Usage

"This problem can be solved as follows. First, the employment contract should... ah, let me explain this part in Korean..."

- Phi-4 (switched to English mid-sentence)

ModelHonorific ConsistencyBusiness ToneAssessment
KORMo-10B★★★★★★★★★★Near-perfect formal register
Gemma-3-12B★★★★★★★★★☆Natural casual/formal switching
Qwen3-14B★★★★☆★★★★☆Stable but occasional translationese
Qwen3-8B★★★☆☆★★★☆☆Maintains formality but sounds stiff
Llama-3.1-8B★★☆☆☆★★☆☆☆Inconsistent honorific maintenance
Phi-4★★☆☆☆★☆☆☆☆Drops from formal to casual mid-response

3Language Contamination

When you ask a question in Korean and get Chinese, English, Russian, or Japanese mixed into the response, that’s "language contamination." It’s a service-quality killer.

ModelIncidentsContamination LanguageSeverity
Gemma-3-12B0–1English words onlyMinor
Qwen3-14B0–1English technical termsMinor
KORMo-10B1English technical termsMinor
Qwen3-8B3Chinese (characters, Chinese-style phrasing)Moderate
Phi-43+English (full sentence switching)Severe
Llama-3.1-8B5+Russian, Chinese, Japanese, EnglishCritical

Qwen3-8B: Chinese Contamination Example

Korean response with sudden Chinese characters: "劳动者" appearing where the Korean word for "worker" should be used

Chinese from training data leaks into Korean responses, especially in legal/administrative contexts.

Llama-3.1-8B: Multi-Language Contamination

Mid-Korean response, Russian ("следующим образом") and Japanese ("具体的には") appear randomly

Random switching to Russian, Japanese, Chinese during Korean responses. Multiple languages per response.

Language Contamination Is a Service Killer

If your customer support chatbot suddenly outputs Russian, trust evaporates instantly. For Korean services, choose from Gemma, Qwen3-14B, or KORMo only.

4Natural Korean Expression

We evaluated whether responses read like a native Korean speaker wrote them, rather than machine-translated text. The biggest differences appear in conjunctions, particles, and sentence endings.

Naturalness Summary by Model

KORMo-10BReads like a native Korean writer. Most natural business writing style with proper conjunctions.
Gemma-3-12BConcise and accurate. Flexible tone switching without unnecessary verbosity.
Qwen3-14BGenerally stable, but occasionally lapses into translationese patterns.
Qwen3-8BFunctional but sounds bureaucratic. Quality drops in longer responses.
Llama-3.1-8BFrequent direct translations from English patterns. Lacks Korean pragmatics.
Phi-4Classic translationese throughout. Clearly more comfortable in English than Korean.

5Comprehensive Korean Evaluation by Model

Combined assessment of Korean scores, honorifics, language contamination, and naturalness.

#1

Gemma-3-12B

Korean: 4.28 / 5.0

Korean #1. Excels at honorifics, business tone, and natural expression.

  • Seamless switching between casual and formal registers
  • Near-zero language contamination (1 English word at most)
  • Accurate use of medical/legal Korean terminology
  • Concise responses that hit the key points
#2

Qwen3-14B

Korean: 4.19 / 5.0

Consistently stable. Strong business Korean.

  • High-level business Korean proficiency
  • Extremely rare language contamination (fewer than 1 instance)
  • Maintains Korean quality even in long responses
  • Occasional translationese expressions
#3

KORMo-10B

Korean: 3.83 / 5.0

Korean-specialized model with the most natural expressions.

  • Most natural business Korean of all models
  • Near-perfect honorific and formal register usage
  • Slow speed limits real-time service deployment
  • Some gaps in specialized domain terminology
#4

Qwen3-8B

Korean: 3.33 / 5.0

Basic Korean works, but occasional Chinese contamination.

  • Adequate basic Korean proficiency
  • 3 Chinese contamination incidents (characters, Chinese-style phrasing)
  • Korean quality degrades in longer responses
  • Maintains honorifics but tone is stiff
#5

Llama-3.1-8B

Korean: 2.67 / 5.0

Severe multilingual contamination. Unsuitable for Korean services.

  • Multiple contaminations: Russian, Chinese, Japanese, English
  • Foreign languages appear mid-sentence during Korean responses
  • Inconsistent honorific maintenance
  • Frequent translationese expressions
#6

Phi-4

Korean: 2.33 / 5.0

Lowest Korean score. Frequent English switching makes it unusable.

  • Very frequent mid-response switches to English (3+ instances)
  • Frequent Korean grammar errors
  • Drops from formal to casual register mid-response
  • Tendency to use English-only for technical terms

Key Takeaways

  • Gemma-3-12B leads with 4.28 — natural expression and flexible tone switching
  • KORMo-10B excels in business Korean and honorific consistency
  • Phi-4 (2.33) and Llama (2.67) are unsuitable for Korean-language services
  • Llama-3.1-8B has the worst contamination: 5+ incidents across 4 languages
  • For Korean services, choose from Gemma, Qwen3-14B, or KORMo

Conclusion

Korean language ability varies drastically across local LLMs. The 1.95-point gap between Gemma-3-12B and Phi-4 means completely different user experiences from the same "AI chatbot" label. For Korean services, we recommend Gemma-3-12B (quality) or Qwen3-14B (balance). If business-grade formal Korean is a top priority, KORMo-10B is also worth considering despite its slower speed.

Tests conducted on February 21, 2026. Speed, token counts, and raw responses are actual measurements. Model rankings and scores include subjective evaluator judgment and may vary with different test environments and prompts. Non-commercial sharing is welcome; for commercial use, please contact us via the contact page.

Need a Korean AI Service?

Treeru builds Korean-optimized local LLM services, from model selection to deployment.

Request Free Consultation