SGLang vs vLLM — The Secret Behind the 3x Throughput Gap

2026-02-05

Treeru

AI · February 5, 2026

At median latency, vLLM looks faster — 14% quicker at 20 users, 61% at 100 users. So why did we choose SGLang? The answer is P95 tail latency. At 20 concurrent users, SGLang P95 is 13.7 seconds while vLLM P95 is 89.8 seconds — a 6.5x gap. We tested both engines with the same GPU, same model (Qwen3-32B-AWQ), same 5 LoRA adapters, scaling from 20 to 200 concurrent users.

3.0x

Throughput Gap (20 users)

6.5x

P95 Latency Gap (20 users)

SGLang Errors (all loads)

15/15

Isolation Tests (both)

Test Setup: Same Everything, Engine Swapped

To isolate the engine-level differences, we held every variable constant except the serving engine itself. Both SGLang (v0.5.8.post1) and vLLM (v0.15.1) were tested on the same RTX PRO 6000 (96GB GDDR7, 350W power limit) running Qwen3-32B-AWQ with 5 LoRA adapters for multi-tenant serving.

Hardware & Model

GPURTX PRO 6000 (96GB, 350W cap)

ModelQwen3-32B-AWQ

LoRA Adapters5 (multi-tenant)

VRAM Usage85.7–87.6GB

Test Parameters

Workload2–4 turn multi-turn chat

Max Tokens500

Concurrency Levels20 / 50 / 100 / 200

APIOpenAI-compatible (both)

Both engines expose OpenAI-compatible APIs. Switching between them requires only changing thebase_url and LoRA parameter name in client code. We migrated 5 production projects without any code changes beyond this.

Median Latency: Where vLLM Looks Better

Let's be honest: if you only look at median response time (the experience of the middle 50% of users), vLLM wins at low-to-moderate concurrency. This is the metric that makes vLLM look like the obvious choice — and it's exactly why looking at median alone is dangerous.

Concurrent Users	SGLang	vLLM	Difference
20	11.6s	10.0s	vLLM -14%
50	18.5s	11.3s	vLLM -39%
100	38.0s	14.8s	vLLM -61%
200	71.4s	101.3s	SGLang -29%

Median tells you what half your users experience. It says nothing about the other half. In production, the metric that determines user satisfaction — and support ticket volume — is P95: the response time experienced by the slowest 5% of users.

P95 Tail Latency: The Decisive Metric

This is where the story reverses completely. vLLM's P95 latency explodes to 6–12x its own median, meaning a small percentage of users wait dramatically longer than everyone else. SGLang's P95 stays within 1.2–1.4x of its median — every user gets a similar experience.

Concurrent Users	SGLang P95	vLLM P95	Gap
20	13.7s	89.8s	6.5x
50	24.1s	137.2s	5.7x
100	46.2s	167.5s	3.6x
200	97.3s	118.5s	1.2x

SGLang: P95/Median Ratio

20 users1.2x

50 users1.3x

100 users1.2x

200 users1.4x

All users experience similar latency.

vLLM: P95/Median Ratio

20 users9.0x

50 users12.1x

100 users11.3x

200 users1.2x

Some users wait 9–12x longer than others.

Consider 20 concurrent users. With SGLang, every user gets a response between 11 and 14 seconds. With vLLM, half get responses in 10 seconds, but the slowest 5% wait 90 seconds. Those are the users who file support tickets saying "the chatbot is broken."

Throughput: 1.5–3x at Every Concurrency Level

Total tokens processed per second tells the capacity story. SGLang's RadixAttention manages KV cache in a prefix tree structure, enabling efficient cache sharing when multiple tenants use different system prompts with common prefixes. This is the primary mechanism behind the throughput gap.

Concurrent Users	SGLang	vLLM	Ratio
20	565 tok/s	187 tok/s	3.0x
50	905 tok/s	350 tok/s	2.6x
100	1,059 tok/s	547 tok/s	1.9x
200	1,093 tok/s	708 tok/s	1.5x

Stability and Multi-Tenant Isolation

Error rates reveal another critical difference. SGLang produced zero errors across all concurrency levels. vLLM generated 16 timeout errors at 100 concurrent users — caused by its extreme P95 tail latency pushing some requests past the timeout threshold.

Concurrent Users	SGLang Errors	vLLM Errors
20	0	0
50	0	0
100	0	16
200	0	0

For multi-tenant isolation (ensuring Company A's data never leaks to Company B), both engines scored 15/15 — perfect isolation across KV cache, LoRA separation, concurrent persona testing, and rapid switching. Security is not a differentiator here.

Conclusion: Why SGLang Wins for Production

The verdict depends entirely on which metric you optimize for.

Category	Winner	Why
Throughput	SGLang	1.5–3x advantage at all levels
P95 Stability	SGLang	1.2–1.4x of median vs 9–12x
Error Rate	SGLang	0% everywhere vs 16 errors at 100
Median Latency	vLLM	14–61% faster at low-mid load
Isolation	Tie	Both 15/15 PASS
API Compat	Tie	Both OpenAI-compatible
Monitoring	vLLM	Prometheus enabled by default

When vLLM Is Still the Right Choice

Low-concurrency deployments (5–10 users): P95 problems don't surface at this scale. vLLM's median latency advantage is real and meaningful here. Prometheus-native monitoring: vLLM exposes Prometheus metrics by default; SGLang requires the --enable-metrics flag. Community ecosystem: vLLM has a larger community and broader third-party tool support.

The Core Insight

In production serving, P95 matters more than median. A service where half the users get 10-second responses but 5% wait 90 seconds is a "slow service" in the eyes of those 5% — and they are the ones who churn, complain, and file tickets.

SGLang delivers 1.5–3x throughput, 6.5x better P95 stability, and zero errors across all concurrency levels. The switching cost is negligible — just change your base_url. For any deployment serving more than 10 concurrent users, SGLang is the clear production choice.