LLM Temperature Comparison: 300 Experiments on Hallucination Rates & Optimal Settings

What Temperature Actually Controls

Temperature is the parameter that adjusts the probability distribution when an LLM selects the next token. It scales the logits before the softmax function: lower values sharpen the distribution toward high-probability tokens, while higher values flatten it, giving lower-probability tokens a better chance of being selected.

T = 0.1

Deterministic

Almost always picks the highest-probability token. Same question produces nearly identical answers. Stable but monotone.

T = 0.5

Balanced

Favors top tokens but allows some variation. Common default for production deployments.

T = 0.9

Creative

Frequently selects lower-probability tokens. Diverse output but higher risk of illogical or fabricated responses.

The theory is simple, but in practice, deciding between T=0.3 and T=0.5 for a customer service chatbot requires empirical testing. The optimal value varies by scenario, and hallucination behavior changes dramatically across the temperature range.

Experiment Design: 60 Questions × 5 Temperature Levels

To isolate the effect of temperature, we held every other variable constant: same model, same GPU, same serving engine, same top_p. Only temperature changed between runs.

Test Environment

ModelQwen3-14B-AWQ (INT4)

Serving EngineSGLang v0.4

Primary GPURTX PRO 6000 (96 GB)

Secondary GPURTX 5060 Ti (16 GB)

Temperature0.1 / 0.3 / 0.5 / 0.7 / 0.9

top_p0.9 (fixed)

max_tokens2048 (fixed)

Total Runs60 questions × 5 levels = 300

The 60 questions span 7 real-world business scenarios: Manufacturing (10), SaaS (10), Medical (10), E-commerce (10), Legal (10), Automation (5), and Korean language quality (5). Six of these questions are deliberate hallucination traps — asking about nonexistent products, fabricated court cases, and misleading medical claims — designed to reveal how temperature affects the model's tendency to fabricate information.

How Temperature Affects Response Length

Does raising the temperature produce longer responses? The data shows a clear trend: average response length increases by approximately 18% from T=0.1 to T=0.9. But the pattern varies significantly by scenario.

Temperature	Avg Tokens	tok/s	Total Time
T = 0.1	685	136.2	302s
T = 0.3	712	135.8	315s
T = 0.5	738	135.1	328s
T = 0.7	776	134.5	346s
T = 0.9	811	133.8	364s

Token generation speed (tok/s) is almost unaffected by temperature — dropping just 1.8% from 136.2 to 133.8 across the full range. The increase in total processing time comes entirely from the model generating more tokens, not generating them more slowly. Higher temperature means the model "talks more," not that it thinks slower.

Response Length Change by Scenario (T=0.1 → T=0.9)

Manufacturing+12%Procedural explanations get slightly more detailed. Structure stays intact.

SaaS+15%Feature descriptions gain additional examples.

Medical+22%More caveats and edge cases mentioned — but accuracy drops at higher T.

E-commerce+25%Product recommendations become more emotional. Hallucination also increases.

Legal+8%Least affected. Legal terminology constrains token selection naturally.

Automation+20%Code blocks grow longer. More comments added.

Korean+16%Conjunctions and filler words increase. Sounds more natural but drifts off-topic.

The Hallucination Cliff: Where Temperature Becomes Dangerous

This is the most critical finding of the entire experiment. Six hallucination trap questions — asking about nonexistent products, fabricated legal precedents, and misleading medical diagnoses — were tested at each temperature level, with 3 repetitions per question for a total of 90 runs.

Temperature	Correct Refusal	Hallucinated	Hallucination Rate	Behavior
T = 0.1	15/18	3/18	16.7%	Consistent refusal phrasing
T = 0.3	14/18	4/18	22.2%	Refuses but adds 1-2 speculative sentences
T = 0.5	12/18	6/18	33.3%	Hedged answers like "it might be..." increase
T = 0.7	8/18	10/18	55.6%	Fabricates specific prices and model names
T = 0.9	5/18	13/18	72.2%	Confidently presents fully fabricated details

Key Finding

Hallucination rate spikes dramatically starting at T=0.5. At T=0.3, the rate is a manageable 22.2%. At T=0.5, it jumps to 33.3%. By T=0.7, more than half of all trap questions produce fabricated answers (55.6%), and at T=0.9, nearly three out of four responses contain hallucinated information (72.2%). For any customer-facing production service, T=0.3 or lower is the safe zone.

The quality of hallucination also changes with temperature. At T=0.1, when the model does hallucinate, it uses hedging language — "this might be" or "it is possible that." At T=0.9, the model invents specific product names, prices, and technical specifications with complete confidence. The higher-temperature hallucinations are far more dangerous because they are harder for users to distinguish from genuine information.

Optimal Temperature by Business Scenario

For each scenario, we identified the temperature that best balances accuracy with natural-sounding responses. Scoring used a 5-point scale weighted across instruction following (25%), factual accuracy (25%), language naturalness (25%), response structure (15%), and refusal capability (10%).

Scenario	Optimal T	T=0.1 Score	Best Score	T=0.9 Score	Rationale
Manufacturing	0.3	3.8	4.0	3.2	Procedural accuracy is critical. Slight variation improves readability.
SaaS	0.3	3.7	3.9	3.4	Feature descriptions need accuracy with example diversity.
Medical	0.3	3.6	3.9	2.8	Richer caveats at 0.3, but dangerous misinformation starts at 0.5.
E-commerce	0.5	3.4	3.8	3.1	Recommendation diversity matters. Emotional language is effective.
Legal	0.1	3.9	3.9	2.4	Legal text must be precise. Higher T causes fabricated citations.
Automation	0.3	3.7	3.9	3.0	Code accuracy is essential. Comment diversity is a minor benefit.
Korean	0.5	3.5	3.8	3.3	Natural expression is key. Quality improves up to 0.5.

Production Quick Reference

T = 0.1

Legal, medical diagnosis, technical spec lookups

Factual accuracy above all. Expression variety is unnecessary.

T = 0.3

Customer service, manufacturing, SaaS, automation (recommended default)

Maintains accuracy with natural expression. Suitable for most B2B scenarios.

T = 0.5

E-commerce recommendations, conversational AI, content drafts

Diversity helps. Hallucination rate starts climbing — monitor closely.

T ≥ 0.7

Brainstorming, creative writing, idea generation (non-production only)

Hallucination rate exceeds 55%. Never use for customer-facing services.

GPU Comparison: Does Hardware Affect Optimal Temperature?

We re-ran the full experiment on an RTX 5060 Ti (16 GB) to check whether GPU hardware influences the optimal temperature setting or response quality.

Metric	RTX PRO 6000	RTX 5060 Ti	Difference
Avg tok/s (T=0.3)	135.8	42.3	3.2×
Total time (60 questions)	315s	1,012s	3.2×
Avg response length (T=0.3)	712 tok	708 tok	≈ identical
Hallucination rate (T=0.3)	22.2%	22.2%	Identical
Optimal T per scenario	0.3	0.3	Identical

The answer is definitive: GPU affects speed, not quality. The RTX PRO 6000 delivers 3.2× the throughput of the RTX 5060 Ti, but response length, hallucination rate, and optimal temperature settings are identical across both cards. This means temperature tuning guidelines are hardware-independent — you can develop your settings on a smaller GPU and deploy to production hardware with confidence.

Conclusion

The single most important takeaway from 300 experiments: T=0.3 is the safest production default for most business scenarios, and T=0.5 is the hallucination cliff. Below T=0.5, hallucination rates stay manageable (16–22%). Above it, the rate doubles and triples, with the model generating increasingly confident fabrications.

Temperature is not a simple creativity-vs-accuracy slider. Up to T=0.3, raising the temperature genuinely improves response naturalness with minimal cost to accuracy. Beyond T=0.5, a qualitatively different failure mode kicks in — the model does not just become "less accurate," it starts confidently inventing information that looks indistinguishable from real data.

For production deployments: use T=0.1 for domains where fabrication is unacceptable (legal, medical), T=0.3 as the general default, and reserve T=0.5+ for non-critical applications with human review. Never deploy T=0.7 or above in customer-facing systems. These findings are model-specific to Qwen3-14B-AWQ, but the "hallucination cliff around T=0.5" pattern is widely observed across LLM families.

LLM Temperature 0.1–0.9: 300-Run Experiment Reveals Optimal Settings per Use Case