treeru.com
AI · March 4, 2026

LLM Temperature 0.1 to 0.9: What 300 Experiments Reveal About Hallucination, Response Quality, and Optimal Settings

Everyone says lower temperature means more deterministic and higher means more creative. But how much does it actually matter — and where exactly does hallucination become a real problem? We ran 300 controlled experiments across 7 business scenarios on Qwen3-14B to find out.

300

Total Experiments

5

Temperature Levels

7

Business Scenarios

2

GPUs Compared

What Temperature Actually Controls

Temperature is the parameter that adjusts the probability distribution when an LLM selects the next token. It scales the logits before the softmax function: lower values sharpen the distribution toward high-probability tokens, while higher values flatten it, giving lower-probability tokens a better chance of being selected.

T = 0.1

Deterministic

Almost always picks the highest-probability token. Same question produces nearly identical answers. Stable but monotone.

T = 0.5

Balanced

Favors top tokens but allows some variation. Common default for production deployments.

T = 0.9

Creative

Frequently selects lower-probability tokens. Diverse output but higher risk of illogical or fabricated responses.

The theory is simple, but in practice, deciding between T=0.3 and T=0.5 for a customer service chatbot requires empirical testing. The optimal value varies by scenario, and hallucination behavior changes dramatically across the temperature range.

Experiment Design: 60 Questions × 5 Temperature Levels

To isolate the effect of temperature, we held every other variable constant: same model, same GPU, same serving engine, same top_p. Only temperature changed between runs.

Test Environment

ModelQwen3-14B-AWQ (INT4)
Serving EngineSGLang v0.4
Primary GPURTX PRO 6000 (96 GB)
Secondary GPURTX 5060 Ti (16 GB)
Temperature0.1 / 0.3 / 0.5 / 0.7 / 0.9
top_p0.9 (fixed)
max_tokens2048 (fixed)
Total Runs60 questions × 5 levels = 300

The 60 questions span 7 real-world business scenarios: Manufacturing (10), SaaS (10), Medical (10), E-commerce (10), Legal (10), Automation (5), and Korean language quality (5). Six of these questions are deliberate hallucination traps — asking about nonexistent products, fabricated court cases, and misleading medical claims — designed to reveal how temperature affects the model's tendency to fabricate information.

How Temperature Affects Response Length

Does raising the temperature produce longer responses? The data shows a clear trend: average response length increases by approximately 18% from T=0.1 to T=0.9. But the pattern varies significantly by scenario.

TemperatureAvg Tokenstok/sTotal Time
T = 0.1685136.2302s
T = 0.3712135.8315s
T = 0.5738135.1328s
T = 0.7776134.5346s
T = 0.9811133.8364s

Token generation speed (tok/s) is almost unaffected by temperature — dropping just 1.8% from 136.2 to 133.8 across the full range. The increase in total processing time comes entirely from the model generating more tokens, not generating them more slowly. Higher temperature means the model "talks more," not that it thinks slower.

Response Length Change by Scenario (T=0.1 → T=0.9)

Manufacturing+12%Procedural explanations get slightly more detailed. Structure stays intact.
SaaS+15%Feature descriptions gain additional examples.
Medical+22%More caveats and edge cases mentioned — but accuracy drops at higher T.
E-commerce+25%Product recommendations become more emotional. Hallucination also increases.
Legal+8%Least affected. Legal terminology constrains token selection naturally.
Automation+20%Code blocks grow longer. More comments added.
Korean+16%Conjunctions and filler words increase. Sounds more natural but drifts off-topic.

The Hallucination Cliff: Where Temperature Becomes Dangerous

This is the most critical finding of the entire experiment. Six hallucination trap questions — asking about nonexistent products, fabricated legal precedents, and misleading medical diagnoses — were tested at each temperature level, with 3 repetitions per question for a total of 90 runs.

TemperatureCorrect RefusalHallucinatedHallucination RateBehavior
T = 0.115/183/1816.7%Consistent refusal phrasing
T = 0.314/184/1822.2%Refuses but adds 1-2 speculative sentences
T = 0.512/186/1833.3%Hedged answers like "it might be..." increase
T = 0.78/1810/1855.6%Fabricates specific prices and model names
T = 0.95/1813/1872.2%Confidently presents fully fabricated details

Key Finding

Hallucination rate spikes dramatically starting at T=0.5. At T=0.3, the rate is a manageable 22.2%. At T=0.5, it jumps to 33.3%. By T=0.7, more than half of all trap questions produce fabricated answers (55.6%), and at T=0.9, nearly three out of four responses contain hallucinated information (72.2%). For any customer-facing production service, T=0.3 or lower is the safe zone.

The quality of hallucination also changes with temperature. At T=0.1, when the model does hallucinate, it uses hedging language — "this might be" or "it is possible that." At T=0.9, the model invents specific product names, prices, and technical specifications with complete confidence. The higher-temperature hallucinations are far more dangerous because they are harder for users to distinguish from genuine information.

Optimal Temperature by Business Scenario

For each scenario, we identified the temperature that best balances accuracy with natural-sounding responses. Scoring used a 5-point scale weighted across instruction following (25%), factual accuracy (25%), language naturalness (25%), response structure (15%), and refusal capability (10%).

ScenarioOptimal TT=0.1 ScoreBest ScoreT=0.9 ScoreRationale
Manufacturing0.33.84.03.2Procedural accuracy is critical. Slight variation improves readability.
SaaS0.33.73.93.4Feature descriptions need accuracy with example diversity.
Medical0.33.63.92.8Richer caveats at 0.3, but dangerous misinformation starts at 0.5.
E-commerce0.53.43.83.1Recommendation diversity matters. Emotional language is effective.
Legal0.13.93.92.4Legal text must be precise. Higher T causes fabricated citations.
Automation0.33.73.93.0Code accuracy is essential. Comment diversity is a minor benefit.
Korean0.53.53.83.3Natural expression is key. Quality improves up to 0.5.

Production Quick Reference

T = 0.1
Legal, medical diagnosis, technical spec lookups

Factual accuracy above all. Expression variety is unnecessary.

T = 0.3
Customer service, manufacturing, SaaS, automation (recommended default)

Maintains accuracy with natural expression. Suitable for most B2B scenarios.

T = 0.5
E-commerce recommendations, conversational AI, content drafts

Diversity helps. Hallucination rate starts climbing — monitor closely.

T ≥ 0.7
Brainstorming, creative writing, idea generation (non-production only)

Hallucination rate exceeds 55%. Never use for customer-facing services.

GPU Comparison: Does Hardware Affect Optimal Temperature?

We re-ran the full experiment on an RTX 5060 Ti (16 GB) to check whether GPU hardware influences the optimal temperature setting or response quality.

MetricRTX PRO 6000RTX 5060 TiDifference
Avg tok/s (T=0.3)135.842.33.2×
Total time (60 questions)315s1,012s3.2×
Avg response length (T=0.3)712 tok708 tok≈ identical
Hallucination rate (T=0.3)22.2%22.2%Identical
Optimal T per scenario0.30.3Identical

The answer is definitive: GPU affects speed, not quality. The RTX PRO 6000 delivers 3.2× the throughput of the RTX 5060 Ti, but response length, hallucination rate, and optimal temperature settings are identical across both cards. This means temperature tuning guidelines are hardware-independent — you can develop your settings on a smaller GPU and deploy to production hardware with confidence.

Conclusion

The single most important takeaway from 300 experiments: T=0.3 is the safest production default for most business scenarios, and T=0.5 is the hallucination cliff. Below T=0.5, hallucination rates stay manageable (16–22%). Above it, the rate doubles and triples, with the model generating increasingly confident fabrications.

Temperature is not a simple creativity-vs-accuracy slider. Up to T=0.3, raising the temperature genuinely improves response naturalness with minimal cost to accuracy. Beyond T=0.5, a qualitatively different failure mode kicks in — the model does not just become "less accurate," it starts confidently inventing information that looks indistinguishable from real data.

For production deployments: use T=0.1 for domains where fabrication is unacceptable (legal, medical), T=0.3 as the general default, and reserve T=0.5+ for non-critical applications with human review. Never deploy T=0.7 or above in customer-facing systems. These findings are model-specific to Qwen3-14B-AWQ, but the "hallucination cliff around T=0.5" pattern is widely observed across LLM families.

This experiment was conducted on Qwen3-14B-AWQ (INT4) with SGLang v0.4. Optimal temperature values may differ across model families, but the hallucination acceleration pattern above T=0.5 is consistent across most tested LLMs. We recommend running hallucination trap tests whenever switching models.