LoRA Fine-Tuning for Custom AI Chatbots — 10 Training Pairs, 6 Seconds, Multi-Tenant Serving
Ten conversation pairs. Six seconds of training. A 73MB adapter file. With these minimal inputs, we built custom AI chatbots for five different businesses — a cafe, a clinic, an e-commerce store, a law firm, and an academy — all served simultaneously from a single GPU using SGLang hot-swap serving.
10
Training Pairs / Company
6.4s
5 Adapters Trained (8B)
73.7MB
Adapter Size (8B)
5
Concurrent Businesses
What LoRA Is and Why It Changes the Economics of Custom AI
Traditional fine-tuning updates every parameter in the model. For an 8B-parameter model, that means modifying all 8 billion weights — requiring significant GPU time, VRAM, and producing a full-sized model copy for each variant. If you need five company-specific chatbots, you end up with five separate 16GB models.
LoRA (Low-Rank Adaptation) takes a fundamentally different approach. The base model stays frozen. Instead, LoRA decomposes the weight update matrices into two small low-rank matrices and trains only those. With rank=16, this means training roughly 0.6% of the original parameters — about 50 million instead of 8 billion.
| Metric | Full Fine-Tuning | LoRA |
|---|---|---|
| Trained Parameters | 8B (all) | ~50M (0.6%) |
| Training Time (10 pairs) | Minutes to hours | 1.2–1.5 seconds |
| Output Size | ~16GB (full model) | 73.7MB (adapter) |
| Multi-Tenant | N separate models | 1 base + N adapters |
| VRAM for 5 Companies | ~80GB (5 models) | ~5GB (1 model + 5 adapters) |
The practical implication is that adapters are interchangeable at runtime. You can load a cafe adapter, a clinic adapter, and a law firm adapter into the same serving engine and switch between them per-request with zero server restarts. This is what makes true multi-tenant AI serving possible on a single GPU.
Training Data: Why 10 Pairs Is Enough
A common misconception is that fine-tuning requires thousands of examples. With LoRA, the goal is not injecting new knowledge into the model — it is teaching a communication style. The persona, tone, sentence structure, and response format are what the adapter learns. Actual business knowledge (menu items, prices, operating hours) is handled by the system prompt.
We created 10 Q&A pairs for each of 5 business types. The cafe adapter learned a friendly, casual tone with emoji usage. The clinic adapter learned formal, measured language with appropriate medical disclaimers. The law firm adapter learned strict formal speech with legal caveats. Ten pairs is enough because the patterns are consistent — LoRA picks up on the style signals rapidly.
Training data example — Cafe adapter (translated)
{"messages": [
{"role": "user", "content": "How much is an Americano?"},
{"role": "assistant", "content": "An Americano is $4.50! Same price for iced or hot~ ☕"}
]}
{"messages": [
{"role": "user", "content": "Do you have decaf options?"},
{"role": "assistant", "content": "Yes! Decaf Americano $5.00, Decaf Latte $5.50~ The taste is basically the same!"}
]}Training data example — Clinic adapter (translated)
{"messages": [
{"role": "user", "content": "I'd like to book an appointment."},
{"role": "assistant", "content": "Thank you for contacting our clinic. Appointments can be made by phone (02-555-1234) or in person. Could you please let us know your preferred department and date?"}
]}System Prompt Design: The Knowledge Layer
The LoRA adapter handles how the model speaks. The system prompt handles what it knows. This separation is critical: when a cafe updates its menu or prices, you update the system prompt — no retraining needed. The adapter remains the same because the communication style has not changed.
A well-structured system prompt follows a consistent pattern: role declaration, structured business information (address, hours, phone), product/service catalog with prices, current promotions, tone rules, and a forbidden-topics list. The forbidden-topics list is particularly important for preventing the chatbot from recommending competitors or disclosing proprietary information.
Decoding Parameters by Business Type
| Parameter | Cafe / E-commerce | Clinic / Legal | Rationale |
|---|---|---|---|
| temperature | 0.3 | 0.1 | Clinic/legal require maximum consistency |
| top_p | 0.9 | 0.9 | Shared default |
| max_tokens | 2,048 | 2,048 | Accommodates thinking-model behavior |
| repetition_penalty | 1.1 | 1.1 | Prevents repetitive loops |
Training Results: 6 Seconds for 5 Adapters
Training was performed on an RTX PRO 6000 (96GB VRAM). The 8B model (Qwen3-8B) produced each adapter in roughly 1.2–1.5 seconds. All 5 company adapters were trained sequentially in 6.4 seconds total. The 32B model (Qwen3-32B) took 15.9 seconds for the same 5 adapters.
Qwen3-8B Results
| Company | Time | Size |
|---|---|---|
| Cafe | 1.5s | 73.7MB |
| Clinic | 1.2s | 73.7MB |
| E-commerce | 1.2s | 73.7MB |
| Law Firm | 1.2s | 73.7MB |
| Academy | 1.3s | 73.7MB |
| Total | 6.4s | 368.5MB |
Qwen3-32B Results
| Company | Time | Size |
|---|---|---|
| Cafe | 3.2s | 167.2MB |
| Clinic | 3.1s | 167.2MB |
| E-commerce | 3.1s | 167.2MB |
| Law Firm | 3.2s | 167.2MB |
| Academy | 3.3s | 167.2MB |
| Total | 15.9s | 836MB |
LoRA Hyperparameters
LoraConfig(
r=16, # rank — higher = more expressive, larger file
lora_alpha=32, # scaling factor (typically 2× rank)
lora_dropout=0.05, # overfitting prevention
target_modules=[ # attention layers only
"q_proj", "k_proj",
"v_proj", "o_proj"
],
task_type="CAUSAL_LM",
)
# Training config
epochs=3, batch_size=2, gradient_accumulation=2
learning_rate=2e-4, bf16=TrueAn important detail: training runs on the full-precision model (Qwen3-8B, ~16GB), while serving uses the AWQ-quantized version (Qwen3-8B-AWQ, ~5GB). LoRA adapters trained on the full model are directly compatible with the quantized serving model — SGLang handles this compatibility automatically.
Hot-Swap Serving with SGLang
Once the adapters are trained, SGLang can load all of them simultaneously and route requests to the correct adapter based on the model parameter in the API call. No server restart is needed to switch between companies.
SGLang launch command
python -m sglang.launch_server \
--model Qwen3-8B-AWQ \
--quantization awq_marlin \
--enable-lora \
--lora-paths \
cafe=/path/to/adapters/cafe \
clinic=/path/to/adapters/clinic \
shop=/path/to/adapters/shop \
law=/path/to/adapters/law \
edu=/path/to/adapters/edu \
--context-length 4096 \
--mem-fraction-static 0.85 \
--port 30000API calls — switch company by changing "model"
# Route to cafe adapter
curl -X POST http://localhost:30000/v1/chat/completions \
-d '{"model": "cafe", "messages": [
{"role": "system", "content": "(cafe system prompt)"},
{"role": "user", "content": "How much is an Americano?"}
]}'
# Route to clinic adapter — same server, just change "model"
curl -X POST http://localhost:30000/v1/chat/completions \
-d '{"model": "clinic", "messages": [
{"role": "system", "content": "(clinic system prompt)"},
{"role": "user", "content": "I'd like to book an appointment."}
]}'Hot-swap verification confirmed that all 5 adapters maintain correct persona during rapid sequential switching (A→B→C→D→E→A) and under concurrent multi-company requests. VRAM overhead for 5 loaded adapters is just ~370MB for the 8B model or ~836MB for 32B — negligible compared to the base model footprint.
Conclusion: The Production Deployment Playbook
The end-to-end workflow is straightforward and repeatable for any new business:
Write the system prompt
Structure business info, catalog, prices, operating rules, and forbidden topics.
Prepare 10 training pairs
Q&A pairs that demonstrate the target communication style. Consistency matters more than quantity.
Run LoRA training
1–3 seconds per adapter on a single GPU. Output: a 73MB adapter file.
Start SGLang serving
Register all adapters with --enable-lora --lora-paths. All businesses served simultaneously.
Integrate via API
OpenAI-compatible API. Switch companies by changing the model parameter. Minimal code changes.
| Metric | 8B Model | 32B Model |
|---|---|---|
| Training (5 adapters) | 6.4s | 15.9s |
| Adapter size (each) | 73.7MB | 167.2MB |
| VRAM overhead (5 adapters) | ~370MB | ~836MB |
| 20-user concurrent response | 3.5s | 10.4s |
| Error rate | 0% | 0% |
LoRA makes custom AI economically viable at a scale that was previously impossible. The adapter handles tone and personality, the system prompt handles knowledge, and SGLang handles multi-tenant serving — all on a single GPU. Adding a new business client takes minutes, not days, and the marginal cost is a 73MB file and a system prompt update.