treeru.com
AI · February 3, 2026

Local LLM Concurrent User Load Test — How Many Users Can a Single RTX PRO 6000 Handle?

If you're planning to self-host an LLM for internal use, the first question is always: how many concurrent users can one GPU actually serve? We loaded an RTX PRO 6000 (96 GB VRAM) with both an 8B and a 32B model, then ran real multi-turn chat simulations from 1 to 200 simultaneous users. Zero errors across every test — but response times tell a very different story depending on model size.

200

Max concurrent users

0%

Error rate (all tests)

3.1x

8B vs 32B speed gap

82 GB

Stable VRAM usage

Test Environment

The goal was to simulate a realistic production workload — not just single-shot completions, but multi-turn conversations with LoRA adapter hot-swapping between requests. This is what an actual enterprise chatbot deployment looks like.

Hardware & Software

GPU: NVIDIA RTX PRO 6000 (96 GB VRAM)
Serving engine: SGLang (OpenAI-compatible API)
Quantization: AWQ 4-bit
LoRA adapters: 5 task-specific adapters (hot-swapped)
Context length: 4,096 tokens
Memory allocation: 85% (mem-fraction-static)

Models Under Test

Qwen3-8B-AWQ

VRAM: ~82 GB (model + 5 LoRA adapters + KV cache)

Qwen3-32B-AWQ

VRAM: ~84.5 GB (model + 5 LoRA adapters + KV cache)

Each simulated user ran a 2–4 turn conversation with max_tokens=500, maintaining context across turns. Think time between turns was randomized between 0.3–1 second to approximate real usage. All measurements are non-streaming (full response completion before delivery) — the worst-case scenario for perceived latency.

8B Model Load Test Results

Short Response Benchmark (max_tokens=200)

Before running the full chat simulation, we tested raw concurrent request handling with short responses — the kind you'd see for FAQ lookups or classification tasks.

Concurrent UsersTotal RequestsMedianP95GPU Temp
15428 ms1,006 ms30°C
525694 ms1,116 ms31°C
1050803 ms1,214 ms33°C
2060901 ms1,596 ms35°C
501001,172 ms2,063 ms36°C
1001001,285 ms1,967 ms39°C

Even at 100 concurrent users, the 8B model kept median response time under 1.3 seconds with zero errors. For short-response workloads like FAQ bots or intent classification, a single GPU handles the load comfortably.

Multi-Turn Chat Simulation (Production Workload)

This is the real test — sustained multi-turn conversations with 500-token responses, which mirrors how an actual chat service operates under load.

ScenarioUsersMedianP95GPU TempThroughput
Normal afternoon203.5s4.2s43°C1,582 tok/s
Lunch peak505.4s6.5s48°C2,590 tok/s
Event surge1008.6s10.9s62°C3,469 tok/s
Extreme stress20016.9s24.1s70°C3,890 tok/s

20 users

3.5s

Comfortable

50 users

5.4s

Slightly slow

100 users

8.6s

Frustrating

200 users

16.9s

Unusable

The sweet spot for the 8B model is 20 concurrent users — a 3.5-second median is perfectly acceptable for a chat interface. At 50 users you start feeling the lag, and beyond 100 the non-streaming wait becomes frustrating. However, with SSE streaming enabled, the first token arrives within 1 second even at high concurrency, which dramatically improves perceived latency. And critically, the error rate stayed at 0% across every concurrency level, including the 200-user extreme test.

32B Model Load Test Results

Larger models produce better responses for complex queries, but the throughput penalty is steep. Here's what happens when you run Qwen3-32B-AWQ under the same conditions.

ScenarioUsersMedianP95GPU TempPowerThroughput
Normal afternoon2010.4s11.5s61°C566W650 tok/s
Lunch peak5016.8s18.5s74°C600W1,122 tok/s
Event surge10026.6s34.8s80°C606W1,385 tok/s
Extreme stress20052.2s72.8s83°C606W1,429 tok/s

The numbers speak clearly: even at just 20 concurrent users, the 32B model takes 10.4 seconds per response — already uncomfortable for interactive chat. At 200 users, the median climbs to 52.2 seconds with P95 at 72.8 seconds, making it effectively unusable without streaming. GPU temperature hits 83°C, dangerously close to the 85°C thermal limit, and power consumption maxes out at the 600W TDP ceiling.

GPU Stability (32B Model)

MetricNormal (20 users)Extreme (200 users)Safety Limit
Temperature61°C83°C85°C
Power draw566W606W~600W TDP
VRAM84.5 GB84.5 GB95.6 GB
Error rate0%0%

8B vs 32B: Direct Comparison

Same GPU, same test methodology, same LoRA configuration. The only variable is model size. The performance gap is remarkably consistent.

MetricQwen3-8BQwen3-32BRatio
VRAM usage82.3 GB84.5 GB1.03x
20 users median3.5s10.4s3.0x slower
50 users median5.4s16.8s3.1x slower
100 users median8.6s26.6s3.1x slower
200 users median16.9s52.2s3.1x slower
20 users throughput1,582 tok/s650 tok/s0.41x
200 users throughput3,890 tok/s1,429 tok/s0.37x
200 users GPU temp70°C83°C+13°C
200 users power532W606W+74W
Error rate0%0%Same

The standout finding is the consistent 3.1x speed difference across all concurrency levels. Regardless of whether you have 20 or 200 users, the 32B model is almost exactly 3.1 times slower than the 8B. VRAM usage is nearly identical (only 2 GB difference), but thermals and power consumption tell a different story — the 32B model runs 13°C hotter and draws 74W more power at peak load.

Production Architecture Recommendations

Based on 12 months of operating these configurations in production, here's the architecture that works:

Recommended Architecture

1

Default tier: 8B model

Route the majority of traffic to Qwen3-8B-AWQ. Ideal for FAQ, simple guidance, and classification tasks. Comfortably serves 50 concurrent users per GPU.

2

Premium tier: 32B model (with routing)

Route only complex queries — detailed analysis, multi-step reasoning — to the 32B model. Keep concurrent load under 20 users. SSE streaming is mandatory.

3

Always enable streaming

With SSE streaming, the first token arrives within 1–2 seconds even under high load. Non-streaming numbers represent the absolute worst case for user experience.

Capacity Planning per GPU

8B Model (per GPU)

  • 20 users: comfortable (3.5s)
  • 50 users: acceptable with streaming
  • 100+ users: add GPUs or scale horizontally

32B Model (per GPU)

  • 20 users: streaming required (10.4s)
  • 50+ users: add GPUs or switch to 8B
  • Thermals: 83°C at peak — close to limit

Operational Notes

  • VRAM is independent of concurrency — KV cache is pre-allocated, so VRAM usage stays flat whether you have 1 or 200 users.
  • Throughput increases with concurrency — batching efficiency improves under load. The 8B model jumps from 1,582 tok/s at 20 users to 3,890 tok/s at 200.
  • LoRA hot-swapping incurs zero performance penalty — all 5 adapters were swapped between requests with no measurable degradation. This makes multi-tenant deployments viable on a single GPU.
  • Power limiting can reduce thermals significantly — setting a 350W power cap drops GPU temperature by roughly 22°C with only ~10% performance loss at low concurrency.

Conclusion

A single RTX PRO 6000 with an 8B model can realistically serve 20–50 concurrent users in a production chat application — no errors, reasonable latency, manageable thermals. The 32B model offers better response quality but requires streaming and strict concurrency limits. The consistent 3.1x speed penalty means you're trading raw throughput for smarter answers, so a routing strategy that sends only complex queries to the larger model is the practical approach. For most enterprise self-hosted LLM deployments, start with 8B, add routing when you need it, and scale GPUs when concurrency demands exceed what a single card can handle.