Serving 7 Companies on 1 GPU — Multi-Tenant Isolation Testing in Practice

2026-02-14

Treeru

What happens when Company A's customer asks a question and receives Company B's confidential data in the response? In multi-tenant AI serving — where multiple organizations share a single GPU — cross-tenant data leakage is the most critical failure mode. We deployed 5 LoRA adapters on one RTX PRO 6000, served 7 companies simultaneously, and ran 15 rigorous isolation tests. Every single test passed with zero data leakage. Here is the complete methodology and results.

Multi-Tenant Architecture Overview

Our serving stack runs on a single RTX PRO 6000 (96 GB VRAM, 350W TDP) using SGLang as the inference engine with LoRA hot-swapping enabled. The base model is Qwen3-8B-AWQ (4-bit quantized), loaded once into VRAM, with 5 LoRA adapters mounted on top — one per business persona.

Total VRAM consumption sits at approximately 82 GB (base model + 5 adapters + KV cache pool), leaving headroom within the 96 GB envelope. Each adapter adds only ~74 MB, so the marginal cost of adding a new tenant is negligible. Context length is fixed at 4,096 tokens per request.

The 7 Tenants

Tenant	Industry	Tone	LoRA
Moonlight Cafe	Cafe / Beverages	Casual, friendly	Yes
Future Clinic	Medical / Healthcare	Formal, polite	Yes
Today Market	E-commerce	Energetic, promotional	Yes
Justice Law Firm	Legal	Formal, authoritative	Yes
Sky Academy	Education / Tutoring	Bright, encouraging	Yes
Recipe Bot	Cooking / Recipes	Casual guide	No
Groove Marketing	Marketing Solutions	Business professional	No

Two tenants (Recipe Bot and Groove Marketing) run without LoRA adapters, relying solely on system prompts for persona enforcement. This tests whether system-prompt-only isolation holds up under the same concurrent workload.

Test 1: KV Cache Leak Detection

The most dangerous failure scenario in multi-tenant GPU serving is KV cache contamination. If the GPU's key-value cache retains fragments from Tenant A's conversation and injects them into Tenant B's response, sensitive business data crosses tenant boundaries. This is not a theoretical concern — poorly implemented inference engines have exhibited exactly this behavior.

Our test methodology was deliberately adversarial: we first generated a long, detail-rich response from Tenant A (requesting full menu listings with prices) to maximize KV cache utilization, then immediately sent a generic query to Tenant B ("Tell me about this place") and scanned the response for Tenant A's keywords.

Cross Pair	Verification	Result
Cafe → Clinic	No cafe menu keywords in clinic response	PASS
Law Firm → Academy	No legal keywords in academy response	PASS
E-commerce → Cafe	No shipping/order keywords in cafe response	PASS

All three cross-tenant pairs showed zero keyword leakage. SGLang allocates independent KV cache slots per request, ensuring that one tenant's conversation state never bleeds into another tenant's generation context — even when cache utilization is intentionally pushed to its limits.

Tests 2-3: LoRA Adapter Isolation and Concurrent Personas

Test 2 — Sequential adapter isolation: We sent the identical query ("Tell me about this place") to each of the 5 LoRA-equipped tenants in sequence. For each response, we checked two conditions: (1) the response contained keywords appropriate to that tenant's industry, and (2) the response contained zero keywords from any other tenant's domain.

Adapter	Own Keywords Present	Cross-Tenant Leakage	Result
Cafe	Yes	0 instances	PASS
Clinic	Yes	0 instances	PASS
E-commerce	Yes	0 instances	PASS
Law Firm	Yes	0 instances	PASS
Academy	Yes	0 instances	PASS

Test 3 — Simultaneous 5-persona concurrency: Instead of sequential requests, we fired the same query to all 5 tenants simultaneously. Under concurrent execution, a faulty adapter routing mechanism could mix up which LoRA weights are applied to which request. All 5 responses returned the correct persona with zero cross-contamination.

The verification method uses bidirectional keyword matching: each industry defines a set of expected keywords (e.g., "coffee" and "beverage" for the cafe) and a set of anti-keywords (e.g., "diagnosis" and "lawsuit" must never appear in a cafe response). The automated test script checks both directions for every response.

Tests 4-5: Multi-User Independence and Rapid Switching

Test 4 — Same-tenant multi-user independence: Two users connected to the same cafe tenant simultaneously. User A asked about menu prices while User B asked about location and parking. Each user received a response strictly relevant to their own query — User A got pricing data, User B got location details, with no cross-contamination between the two sessions.

Test 5 — Rapid tenant switching: We cycled through all 5 LoRA tenants in sequence (Cafe → Clinic → E-commerce → Law Firm → Academy), then repeated the entire cycle — 10 switches total. Every single switch maintained the correct persona without any residual behavior from the previous tenant. The adapter hot-swap mechanism in SGLang handles transitions cleanly even under rapid sequential load.

Result: 10/10 switches passed. No persona bleeding was observed at any point in the rotation, confirming that LoRA adapter loading is fully stateless between requests.

Fair Queuing Under Burst Load

Perfect isolation means nothing if one tenant's traffic spike starves the others. We tested this by having the cafe tenant fire 20 concurrent requests (simulating a traffic burst) while the other 4 tenants each sent 1 normal request. The key metric was how much the burst delayed the other tenants' responses compared to their baseline (single-tenant, no contention) performance.

Scenario	During Burst	Baseline	Delay Factor
Other 4 tenants (median)	~1.5x baseline	Reference	1.5x (acceptable)

A 1.5x delay factor under burst conditions is well within acceptable bounds. SGLang's continuous batching scheduler distributes GPU compute fairly across queued requests rather than letting one tenant monopolize the pipeline. For context, anything below 2x is generally imperceptible to end users, while 5x or above would indicate a scheduling problem.

Concurrent Load Benchmark (8B Model + 5 LoRA Adapters)

Concurrent Users	Median Latency	Throughput	GPU Temp	Error Rate
20	3.5 s	1,582 tok/s	43°C	0%
50	5.4 s	2,590 tok/s	48°C	0%
100	8.6 s	3,469 tok/s	62°C	0%
200	16.9 s	3,890 tok/s	70°C	0%

With 5 LoRA adapters active simultaneously, total additional VRAM consumption is only ~370 MB — less than 0.4% of the 96 GB budget. Throughput scales linearly from 1,582 tok/s at 20 users to 3,890 tok/s at 200 users, with zero errors across the entire range. GPU temperature stays well below throttling thresholds even at peak load.

Conclusion: 15 out of 15 Tests Passed

#	Test	Count	Result
1	KV cache leak detection (3 cross pairs)	3	3/3 PASS
2	LoRA adapter isolation (5 industries)	5	5/5 PASS
3	Concurrent 5-persona maintenance	5	5/5 PASS
4	Same-tenant multi-user independence	1	PASS
5	Rapid switching cycle (10 rotations)	1	10/10 PASS

1 GPU. 7 companies. Zero data leakage. KV cache isolation held under adversarial conditions. LoRA adapter routing remained correct under both sequential and concurrent loads. Rapid tenant switching showed no persona bleeding across 10 consecutive rotations. And when one tenant burst 20 requests simultaneously, the other tenants experienced only a 1.5x latency increase — barely noticeable to end users.

The economics are compelling: a single RTX PRO 6000 can serve 7 distinct business personas with complete data isolation, handling up to 200 concurrent users at 3,890 tokens per second with zero errors. Each additional LoRA adapter costs less than 75 MB of VRAM. For small-to-medium B2B SaaS deployments, this architecture eliminates the need for dedicated GPU instances per tenant — cutting infrastructure costs by up to 7x while maintaining enterprise-grade isolation guarantees.

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

multi-tenant isolation testing LoRA KV cache security local AI SGLang

Serving 7 Companies on 1 GPU — Multi-Tenant Isolation Testing in Practice

Multi-Tenant Architecture Overview

The 7 Tenants

Test 1: KV Cache Leak Detection

Tests 2-3: LoRA Adapter Isolation and Concurrent Personas

Tests 4-5: Multi-User Independence and Rapid Switching

Fair Queuing Under Burst Load

Concurrent Load Benchmark (8B Model + 5 LoRA Adapters)

Conclusion: 15 out of 15 Tests Passed

Related Posts

LoRA Fine-Tuning for Custom AI Chatbots — From 10 Training Pairs to Multi-Tenant Serving

Local LLM Concurrent User Load Test — How Many Users Can an RTX PRO 6000 Handle?

SGLang vs vLLM — The Secret Behind the 3x Throughput Gap