LLM Temperature 0.1 to 0.9: What 300 Experiments Reveal About Hallucination, Response Quality, and Optimal Settings
Everyone says lower temperature means more deterministic and higher means more creative. But how much does it actually matter — and where exactly does hallucination become a real problem? We ran 300 controlled experiments across 7 business scenarios on Qwen3-14B to find out.
300
Total Experiments
5
Temperature Levels
7
Business Scenarios
2
GPUs Compared
What Temperature Actually Controls
Temperature is the parameter that adjusts the probability distribution when an LLM selects the next token. It scales the logits before the softmax function: lower values sharpen the distribution toward high-probability tokens, while higher values flatten it, giving lower-probability tokens a better chance of being selected.
T = 0.1
Deterministic
Almost always picks the highest-probability token. Same question produces nearly identical answers. Stable but monotone.
T = 0.5
Balanced
Favors top tokens but allows some variation. Common default for production deployments.
T = 0.9
Creative
Frequently selects lower-probability tokens. Diverse output but higher risk of illogical or fabricated responses.
The theory is simple, but in practice, deciding between T=0.3 and T=0.5 for a customer service chatbot requires empirical testing. The optimal value varies by scenario, and hallucination behavior changes dramatically across the temperature range.
Experiment Design: 60 Questions × 5 Temperature Levels
To isolate the effect of temperature, we held every other variable constant: same model, same GPU, same serving engine, same top_p. Only temperature changed between runs.
Test Environment
The 60 questions span 7 real-world business scenarios: Manufacturing (10), SaaS (10), Medical (10), E-commerce (10), Legal (10), Automation (5), and Korean language quality (5). Six of these questions are deliberate hallucination traps — asking about nonexistent products, fabricated court cases, and misleading medical claims — designed to reveal how temperature affects the model's tendency to fabricate information.
How Temperature Affects Response Length
Does raising the temperature produce longer responses? The data shows a clear trend: average response length increases by approximately 18% from T=0.1 to T=0.9. But the pattern varies significantly by scenario.
| Temperature | Avg Tokens | tok/s | Total Time |
|---|---|---|---|
| T = 0.1 | 685 | 136.2 | 302s |
| T = 0.3 | 712 | 135.8 | 315s |
| T = 0.5 | 738 | 135.1 | 328s |
| T = 0.7 | 776 | 134.5 | 346s |
| T = 0.9 | 811 | 133.8 | 364s |
Token generation speed (tok/s) is almost unaffected by temperature — dropping just 1.8% from 136.2 to 133.8 across the full range. The increase in total processing time comes entirely from the model generating more tokens, not generating them more slowly. Higher temperature means the model "talks more," not that it thinks slower.
Response Length Change by Scenario (T=0.1 → T=0.9)
The Hallucination Cliff: Where Temperature Becomes Dangerous
This is the most critical finding of the entire experiment. Six hallucination trap questions — asking about nonexistent products, fabricated legal precedents, and misleading medical diagnoses — were tested at each temperature level, with 3 repetitions per question for a total of 90 runs.
| Temperature | Correct Refusal | Hallucinated | Hallucination Rate | Behavior |
|---|---|---|---|---|
| T = 0.1 | 15/18 | 3/18 | 16.7% | Consistent refusal phrasing |
| T = 0.3 | 14/18 | 4/18 | 22.2% | Refuses but adds 1-2 speculative sentences |
| T = 0.5 | 12/18 | 6/18 | 33.3% | Hedged answers like "it might be..." increase |
| T = 0.7 | 8/18 | 10/18 | 55.6% | Fabricates specific prices and model names |
| T = 0.9 | 5/18 | 13/18 | 72.2% | Confidently presents fully fabricated details |
Key Finding
Hallucination rate spikes dramatically starting at T=0.5. At T=0.3, the rate is a manageable 22.2%. At T=0.5, it jumps to 33.3%. By T=0.7, more than half of all trap questions produce fabricated answers (55.6%), and at T=0.9, nearly three out of four responses contain hallucinated information (72.2%). For any customer-facing production service, T=0.3 or lower is the safe zone.
The quality of hallucination also changes with temperature. At T=0.1, when the model does hallucinate, it uses hedging language — "this might be" or "it is possible that." At T=0.9, the model invents specific product names, prices, and technical specifications with complete confidence. The higher-temperature hallucinations are far more dangerous because they are harder for users to distinguish from genuine information.
Optimal Temperature by Business Scenario
For each scenario, we identified the temperature that best balances accuracy with natural-sounding responses. Scoring used a 5-point scale weighted across instruction following (25%), factual accuracy (25%), language naturalness (25%), response structure (15%), and refusal capability (10%).
| Scenario | Optimal T | T=0.1 Score | Best Score | T=0.9 Score | Rationale |
|---|---|---|---|---|---|
| Manufacturing | 0.3 | 3.8 | 4.0 | 3.2 | Procedural accuracy is critical. Slight variation improves readability. |
| SaaS | 0.3 | 3.7 | 3.9 | 3.4 | Feature descriptions need accuracy with example diversity. |
| Medical | 0.3 | 3.6 | 3.9 | 2.8 | Richer caveats at 0.3, but dangerous misinformation starts at 0.5. |
| E-commerce | 0.5 | 3.4 | 3.8 | 3.1 | Recommendation diversity matters. Emotional language is effective. |
| Legal | 0.1 | 3.9 | 3.9 | 2.4 | Legal text must be precise. Higher T causes fabricated citations. |
| Automation | 0.3 | 3.7 | 3.9 | 3.0 | Code accuracy is essential. Comment diversity is a minor benefit. |
| Korean | 0.5 | 3.5 | 3.8 | 3.3 | Natural expression is key. Quality improves up to 0.5. |
Production Quick Reference
Factual accuracy above all. Expression variety is unnecessary.
Maintains accuracy with natural expression. Suitable for most B2B scenarios.
Diversity helps. Hallucination rate starts climbing — monitor closely.
Hallucination rate exceeds 55%. Never use for customer-facing services.
GPU Comparison: Does Hardware Affect Optimal Temperature?
We re-ran the full experiment on an RTX 5060 Ti (16 GB) to check whether GPU hardware influences the optimal temperature setting or response quality.
| Metric | RTX PRO 6000 | RTX 5060 Ti | Difference |
|---|---|---|---|
| Avg tok/s (T=0.3) | 135.8 | 42.3 | 3.2× |
| Total time (60 questions) | 315s | 1,012s | 3.2× |
| Avg response length (T=0.3) | 712 tok | 708 tok | ≈ identical |
| Hallucination rate (T=0.3) | 22.2% | 22.2% | Identical |
| Optimal T per scenario | 0.3 | 0.3 | Identical |
The answer is definitive: GPU affects speed, not quality. The RTX PRO 6000 delivers 3.2× the throughput of the RTX 5060 Ti, but response length, hallucination rate, and optimal temperature settings are identical across both cards. This means temperature tuning guidelines are hardware-independent — you can develop your settings on a smaller GPU and deploy to production hardware with confidence.
Conclusion
The single most important takeaway from 300 experiments: T=0.3 is the safest production default for most business scenarios, and T=0.5 is the hallucination cliff. Below T=0.5, hallucination rates stay manageable (16–22%). Above it, the rate doubles and triples, with the model generating increasingly confident fabrications.
Temperature is not a simple creativity-vs-accuracy slider. Up to T=0.3, raising the temperature genuinely improves response naturalness with minimal cost to accuracy. Beyond T=0.5, a qualitatively different failure mode kicks in — the model does not just become "less accurate," it starts confidently inventing information that looks indistinguishable from real data.
For production deployments: use T=0.1 for domains where fabrication is unacceptable (legal, medical), T=0.3 as the general default, and reserve T=0.5+ for non-critical applications with human review. Never deploy T=0.7 or above in customer-facing systems. These findings are model-specific to Qwen3-14B-AWQ, but the "hallucination cliff around T=0.5" pattern is widely observed across LLM families.