SGLang vs vLLM — Why the 3x Throughput Gap Matters More Than Median Latency
At median latency, vLLM looks faster — 14% quicker at 20 users, 61% at 100 users. So why did we choose SGLang? The answer is P95 tail latency. At 20 concurrent users, SGLang P95 is 13.7 seconds while vLLM P95 is 89.8 seconds — a 6.5x gap. We tested both engines with the same GPU, same model (Qwen3-32B-AWQ), same 5 LoRA adapters, scaling from 20 to 200 concurrent users.
3.0x
Throughput Gap (20 users)
6.5x
P95 Latency Gap (20 users)
0
SGLang Errors (all loads)
15/15
Isolation Tests (both)
Test Setup: Same Everything, Engine Swapped
To isolate the engine-level differences, we held every variable constant except the serving engine itself. Both SGLang (v0.5.8.post1) and vLLM (v0.15.1) were tested on the same RTX PRO 6000 (96GB GDDR7, 350W power limit) running Qwen3-32B-AWQ with 5 LoRA adapters for multi-tenant serving.
Hardware & Model
Test Parameters
Both engines expose OpenAI-compatible APIs. Switching between them requires only changing thebase_url and LoRA parameter name in client code. We migrated 5 production projects without any code changes beyond this.
Median Latency: Where vLLM Looks Better
Let's be honest: if you only look at median response time (the experience of the middle 50% of users), vLLM wins at low-to-moderate concurrency. This is the metric that makes vLLM look like the obvious choice — and it's exactly why looking at median alone is dangerous.
| Concurrent Users | SGLang | vLLM | Difference |
|---|---|---|---|
| 20 | 11.6s | 10.0s | vLLM -14% |
| 50 | 18.5s | 11.3s | vLLM -39% |
| 100 | 38.0s | 14.8s | vLLM -61% |
| 200 | 71.4s | 101.3s | SGLang -29% |
Median tells you what half your users experience. It says nothing about the other half. In production, the metric that determines user satisfaction — and support ticket volume — is P95: the response time experienced by the slowest 5% of users.
P95 Tail Latency: The Decisive Metric
This is where the story reverses completely. vLLM's P95 latency explodes to 6–12x its own median, meaning a small percentage of users wait dramatically longer than everyone else. SGLang's P95 stays within 1.2–1.4x of its median — every user gets a similar experience.
| Concurrent Users | SGLang P95 | vLLM P95 | Gap |
|---|---|---|---|
| 20 | 13.7s | 89.8s | 6.5x |
| 50 | 24.1s | 137.2s | 5.7x |
| 100 | 46.2s | 167.5s | 3.6x |
| 200 | 97.3s | 118.5s | 1.2x |
SGLang: P95/Median Ratio
All users experience similar latency.
vLLM: P95/Median Ratio
Some users wait 9–12x longer than others.
Consider 20 concurrent users. With SGLang, every user gets a response between 11 and 14 seconds. With vLLM, half get responses in 10 seconds, but the slowest 5% wait 90 seconds. Those are the users who file support tickets saying "the chatbot is broken."
Throughput: 1.5–3x at Every Concurrency Level
Total tokens processed per second tells the capacity story. SGLang's RadixAttention manages KV cache in a prefix tree structure, enabling efficient cache sharing when multiple tenants use different system prompts with common prefixes. This is the primary mechanism behind the throughput gap.
| Concurrent Users | SGLang | vLLM | Ratio |
|---|---|---|---|
| 20 | 565 tok/s | 187 tok/s | 3.0x |
| 50 | 905 tok/s | 350 tok/s | 2.6x |
| 100 | 1,059 tok/s | 547 tok/s | 1.9x |
| 200 | 1,093 tok/s | 708 tok/s | 1.5x |
Stability and Multi-Tenant Isolation
Error rates reveal another critical difference. SGLang produced zero errors across all concurrency levels. vLLM generated 16 timeout errors at 100 concurrent users — caused by its extreme P95 tail latency pushing some requests past the timeout threshold.
| Concurrent Users | SGLang Errors | vLLM Errors |
|---|---|---|
| 20 | 0 | 0 |
| 50 | 0 | 0 |
| 100 | 0 | 16 |
| 200 | 0 | 0 |
For multi-tenant isolation (ensuring Company A's data never leaks to Company B), both engines scored 15/15 — perfect isolation across KV cache, LoRA separation, concurrent persona testing, and rapid switching. Security is not a differentiator here.
Conclusion: Why SGLang Wins for Production
The verdict depends entirely on which metric you optimize for.
| Category | Winner | Why |
|---|---|---|
| Throughput | SGLang | 1.5–3x advantage at all levels |
| P95 Stability | SGLang | 1.2–1.4x of median vs 9–12x |
| Error Rate | SGLang | 0% everywhere vs 16 errors at 100 |
| Median Latency | vLLM | 14–61% faster at low-mid load |
| Isolation | Tie | Both 15/15 PASS |
| API Compat | Tie | Both OpenAI-compatible |
| Monitoring | vLLM | Prometheus enabled by default |
When vLLM Is Still the Right Choice
Low-concurrency deployments (5–10 users): P95 problems don't surface at this scale. vLLM's median latency advantage is real and meaningful here. Prometheus-native monitoring: vLLM exposes Prometheus metrics by default; SGLang requires the --enable-metrics flag. Community ecosystem: vLLM has a larger community and broader third-party tool support.
The Core Insight
In production serving, P95 matters more than median. A service where half the users get 10-second responses but 5% wait 90 seconds is a "slow service" in the eyes of those 5% — and they are the ones who churn, complain, and file tickets.
SGLang delivers 1.5–3x throughput, 6.5x better P95 stability, and zero errors across all concurrency levels. The switching cost is negligible — just change your base_url. For any deployment serving more than 10 concurrent users, SGLang is the clear production choice.