treeru.com
AI · February 5, 2026

SGLang vs vLLM — Why the 3x Throughput Gap Matters More Than Median Latency

At median latency, vLLM looks faster — 14% quicker at 20 users, 61% at 100 users. So why did we choose SGLang? The answer is P95 tail latency. At 20 concurrent users, SGLang P95 is 13.7 seconds while vLLM P95 is 89.8 seconds — a 6.5x gap. We tested both engines with the same GPU, same model (Qwen3-32B-AWQ), same 5 LoRA adapters, scaling from 20 to 200 concurrent users.

3.0x

Throughput Gap (20 users)

6.5x

P95 Latency Gap (20 users)

0

SGLang Errors (all loads)

15/15

Isolation Tests (both)

Test Setup: Same Everything, Engine Swapped

To isolate the engine-level differences, we held every variable constant except the serving engine itself. Both SGLang (v0.5.8.post1) and vLLM (v0.15.1) were tested on the same RTX PRO 6000 (96GB GDDR7, 350W power limit) running Qwen3-32B-AWQ with 5 LoRA adapters for multi-tenant serving.

Hardware & Model

GPURTX PRO 6000 (96GB, 350W cap)
ModelQwen3-32B-AWQ
LoRA Adapters5 (multi-tenant)
VRAM Usage85.7–87.6GB

Test Parameters

Workload2–4 turn multi-turn chat
Max Tokens500
Concurrency Levels20 / 50 / 100 / 200
APIOpenAI-compatible (both)

Both engines expose OpenAI-compatible APIs. Switching between them requires only changing thebase_url and LoRA parameter name in client code. We migrated 5 production projects without any code changes beyond this.

Median Latency: Where vLLM Looks Better

Let's be honest: if you only look at median response time (the experience of the middle 50% of users), vLLM wins at low-to-moderate concurrency. This is the metric that makes vLLM look like the obvious choice — and it's exactly why looking at median alone is dangerous.

Concurrent UsersSGLangvLLMDifference
2011.6s10.0svLLM -14%
5018.5s11.3svLLM -39%
10038.0s14.8svLLM -61%
20071.4s101.3sSGLang -29%

Median tells you what half your users experience. It says nothing about the other half. In production, the metric that determines user satisfaction — and support ticket volume — is P95: the response time experienced by the slowest 5% of users.

P95 Tail Latency: The Decisive Metric

This is where the story reverses completely. vLLM's P95 latency explodes to 6–12x its own median, meaning a small percentage of users wait dramatically longer than everyone else. SGLang's P95 stays within 1.2–1.4x of its median — every user gets a similar experience.

Concurrent UsersSGLang P95vLLM P95Gap
2013.7s89.8s6.5x
5024.1s137.2s5.7x
10046.2s167.5s3.6x
20097.3s118.5s1.2x

SGLang: P95/Median Ratio

20 users1.2x
50 users1.3x
100 users1.2x
200 users1.4x

All users experience similar latency.

vLLM: P95/Median Ratio

20 users9.0x
50 users12.1x
100 users11.3x
200 users1.2x

Some users wait 9–12x longer than others.

Consider 20 concurrent users. With SGLang, every user gets a response between 11 and 14 seconds. With vLLM, half get responses in 10 seconds, but the slowest 5% wait 90 seconds. Those are the users who file support tickets saying "the chatbot is broken."

Throughput: 1.5–3x at Every Concurrency Level

Total tokens processed per second tells the capacity story. SGLang's RadixAttention manages KV cache in a prefix tree structure, enabling efficient cache sharing when multiple tenants use different system prompts with common prefixes. This is the primary mechanism behind the throughput gap.

Concurrent UsersSGLangvLLMRatio
20565 tok/s187 tok/s3.0x
50905 tok/s350 tok/s2.6x
1001,059 tok/s547 tok/s1.9x
2001,093 tok/s708 tok/s1.5x

Stability and Multi-Tenant Isolation

Error rates reveal another critical difference. SGLang produced zero errors across all concurrency levels. vLLM generated 16 timeout errors at 100 concurrent users — caused by its extreme P95 tail latency pushing some requests past the timeout threshold.

Concurrent UsersSGLang ErrorsvLLM Errors
2000
5000
100016
20000

For multi-tenant isolation (ensuring Company A's data never leaks to Company B), both engines scored 15/15 — perfect isolation across KV cache, LoRA separation, concurrent persona testing, and rapid switching. Security is not a differentiator here.

Conclusion: Why SGLang Wins for Production

The verdict depends entirely on which metric you optimize for.

CategoryWinnerWhy
ThroughputSGLang1.5–3x advantage at all levels
P95 StabilitySGLang1.2–1.4x of median vs 9–12x
Error RateSGLang0% everywhere vs 16 errors at 100
Median LatencyvLLM14–61% faster at low-mid load
IsolationTieBoth 15/15 PASS
API CompatTieBoth OpenAI-compatible
MonitoringvLLMPrometheus enabled by default

When vLLM Is Still the Right Choice

Low-concurrency deployments (5–10 users): P95 problems don't surface at this scale. vLLM's median latency advantage is real and meaningful here. Prometheus-native monitoring: vLLM exposes Prometheus metrics by default; SGLang requires the --enable-metrics flag. Community ecosystem: vLLM has a larger community and broader third-party tool support.

The Core Insight

In production serving, P95 matters more than median. A service where half the users get 10-second responses but 5% wait 90 seconds is a "slow service" in the eyes of those 5% — and they are the ones who churn, complain, and file tickets.

SGLang delivers 1.5–3x throughput, 6.5x better P95 stability, and zero errors across all concurrency levels. The switching cost is negligible — just change your base_url. For any deployment serving more than 10 concurrent users, SGLang is the clear production choice.