treeru.com

Serving 7 Companies on 1 GPU — Multi-Tenant Isolation Testing in Practice

What happens when Company A's customer asks a question and receives Company B's confidential data in the response? In multi-tenant AI serving — where multiple organizations share a single GPU — cross-tenant data leakage is the most critical failure mode. We deployed 5 LoRA adapters on one RTX PRO 6000, served 7 companies simultaneously, and ran 15 rigorous isolation tests. Every single test passed with zero data leakage. Here is the complete methodology and results.

Multi-Tenant Architecture Overview

Our serving stack runs on a single RTX PRO 6000 (96 GB VRAM, 350W TDP) using SGLang as the inference engine with LoRA hot-swapping enabled. The base model is Qwen3-8B-AWQ (4-bit quantized), loaded once into VRAM, with 5 LoRA adapters mounted on top — one per business persona.

Total VRAM consumption sits at approximately 82 GB (base model + 5 adapters + KV cache pool), leaving headroom within the 96 GB envelope. Each adapter adds only ~74 MB, so the marginal cost of adding a new tenant is negligible. Context length is fixed at 4,096 tokens per request.

The 7 Tenants

TenantIndustryToneLoRA
Moonlight CafeCafe / BeveragesCasual, friendlyYes
Future ClinicMedical / HealthcareFormal, politeYes
Today MarketE-commerceEnergetic, promotionalYes
Justice Law FirmLegalFormal, authoritativeYes
Sky AcademyEducation / TutoringBright, encouragingYes
Recipe BotCooking / RecipesCasual guideNo
Alpha LogMarketing SolutionsBusiness professionalNo

Two tenants (Recipe Bot and Alpha Log) run without LoRA adapters, relying solely on system prompts for persona enforcement. This tests whether system-prompt-only isolation holds up under the same concurrent workload.

Test 1: KV Cache Leak Detection

The most dangerous failure scenario in multi-tenant GPU serving is KV cache contamination. If the GPU's key-value cache retains fragments from Tenant A's conversation and injects them into Tenant B's response, sensitive business data crosses tenant boundaries. This is not a theoretical concern — poorly implemented inference engines have exhibited exactly this behavior.

Our test methodology was deliberately adversarial: we first generated a long, detail-rich response from Tenant A (requesting full menu listings with prices) to maximize KV cache utilization, then immediately sent a generic query to Tenant B ("Tell me about this place") and scanned the response for Tenant A's keywords.

Cross PairVerificationResult
Cafe → ClinicNo cafe menu keywords in clinic responsePASS
Law Firm → AcademyNo legal keywords in academy responsePASS
E-commerce → CafeNo shipping/order keywords in cafe responsePASS

All three cross-tenant pairs showed zero keyword leakage. SGLang allocates independent KV cache slots per request, ensuring that one tenant's conversation state never bleeds into another tenant's generation context — even when cache utilization is intentionally pushed to its limits.

Tests 2-3: LoRA Adapter Isolation and Concurrent Personas

Test 2 — Sequential adapter isolation: We sent the identical query ("Tell me about this place") to each of the 5 LoRA-equipped tenants in sequence. For each response, we checked two conditions: (1) the response contained keywords appropriate to that tenant's industry, and (2) the response contained zero keywords from any other tenant's domain.

AdapterOwn Keywords PresentCross-Tenant LeakageResult
CafeYes0 instancesPASS
ClinicYes0 instancesPASS
E-commerceYes0 instancesPASS
Law FirmYes0 instancesPASS
AcademyYes0 instancesPASS

Test 3 — Simultaneous 5-persona concurrency: Instead of sequential requests, we fired the same query to all 5 tenants simultaneously. Under concurrent execution, a faulty adapter routing mechanism could mix up which LoRA weights are applied to which request. All 5 responses returned the correct persona with zero cross-contamination.

The verification method uses bidirectional keyword matching: each industry defines a set of expected keywords (e.g., "coffee" and "beverage" for the cafe) and a set of anti-keywords (e.g., "diagnosis" and "lawsuit" must never appear in a cafe response). The automated test script checks both directions for every response.

Tests 4-5: Multi-User Independence and Rapid Switching

Test 4 — Same-tenant multi-user independence: Two users connected to the same cafe tenant simultaneously. User A asked about menu prices while User B asked about location and parking. Each user received a response strictly relevant to their own query — User A got pricing data, User B got location details, with no cross-contamination between the two sessions.

Test 5 — Rapid tenant switching: We cycled through all 5 LoRA tenants in sequence (Cafe → Clinic → E-commerce → Law Firm → Academy), then repeated the entire cycle — 10 switches total. Every single switch maintained the correct persona without any residual behavior from the previous tenant. The adapter hot-swap mechanism in SGLang handles transitions cleanly even under rapid sequential load.

Result: 10/10 switches passed. No persona bleeding was observed at any point in the rotation, confirming that LoRA adapter loading is fully stateless between requests.

Fair Queuing Under Burst Load

Perfect isolation means nothing if one tenant's traffic spike starves the others. We tested this by having the cafe tenant fire 20 concurrent requests (simulating a traffic burst) while the other 4 tenants each sent 1 normal request. The key metric was how much the burst delayed the other tenants' responses compared to their baseline (single-tenant, no contention) performance.

ScenarioDuring BurstBaselineDelay Factor
Other 4 tenants (median)~1.5x baselineReference1.5x (acceptable)

A 1.5x delay factor under burst conditions is well within acceptable bounds. SGLang's continuous batching scheduler distributes GPU compute fairly across queued requests rather than letting one tenant monopolize the pipeline. For context, anything below 2x is generally imperceptible to end users, while 5x or above would indicate a scheduling problem.

Concurrent Load Benchmark (8B Model + 5 LoRA Adapters)

Concurrent UsersMedian LatencyThroughputGPU TempError Rate
203.5 s1,582 tok/s43°C0%
505.4 s2,590 tok/s48°C0%
1008.6 s3,469 tok/s62°C0%
20016.9 s3,890 tok/s70°C0%

With 5 LoRA adapters active simultaneously, total additional VRAM consumption is only ~370 MB — less than 0.4% of the 96 GB budget. Throughput scales linearly from 1,582 tok/s at 20 users to 3,890 tok/s at 200 users, with zero errors across the entire range. GPU temperature stays well below throttling thresholds even at peak load.

Conclusion: 15 out of 15 Tests Passed

#TestCountResult
1KV cache leak detection (3 cross pairs)33/3 PASS
2LoRA adapter isolation (5 industries)55/5 PASS
3Concurrent 5-persona maintenance55/5 PASS
4Same-tenant multi-user independence1PASS
5Rapid switching cycle (10 rotations)110/10 PASS

1 GPU. 7 companies. Zero data leakage. KV cache isolation held under adversarial conditions. LoRA adapter routing remained correct under both sequential and concurrent loads. Rapid tenant switching showed no persona bleeding across 10 consecutive rotations. And when one tenant burst 20 requests simultaneously, the other tenants experienced only a 1.5x latency increase — barely noticeable to end users.

The economics are compelling: a single RTX PRO 6000 can serve 7 distinct business personas with complete data isolation, handling up to 200 concurrent users at 3,890 tokens per second with zero errors. Each additional LoRA adapter costs less than 75 MB of VRAM. For small-to-medium B2B SaaS deployments, this architecture eliminates the need for dedicated GPU instances per tenant — cutting infrastructure costs by up to 7x while maintaining enterprise-grade isolation guarantees.