treeru.com
AI · February 7, 2026

LoRA Fine-Tuning for Custom AI Chatbots — 10 Training Pairs, 6 Seconds, Multi-Tenant Serving

Ten conversation pairs. Six seconds of training. A 73MB adapter file. With these minimal inputs, we built custom AI chatbots for five different businesses — a cafe, a clinic, an e-commerce store, a law firm, and an academy — all served simultaneously from a single GPU using SGLang hot-swap serving.

10

Training Pairs / Company

6.4s

5 Adapters Trained (8B)

73.7MB

Adapter Size (8B)

5

Concurrent Businesses

What LoRA Is and Why It Changes the Economics of Custom AI

Traditional fine-tuning updates every parameter in the model. For an 8B-parameter model, that means modifying all 8 billion weights — requiring significant GPU time, VRAM, and producing a full-sized model copy for each variant. If you need five company-specific chatbots, you end up with five separate 16GB models.

LoRA (Low-Rank Adaptation) takes a fundamentally different approach. The base model stays frozen. Instead, LoRA decomposes the weight update matrices into two small low-rank matrices and trains only those. With rank=16, this means training roughly 0.6% of the original parameters — about 50 million instead of 8 billion.

MetricFull Fine-TuningLoRA
Trained Parameters8B (all)~50M (0.6%)
Training Time (10 pairs)Minutes to hours1.2–1.5 seconds
Output Size~16GB (full model)73.7MB (adapter)
Multi-TenantN separate models1 base + N adapters
VRAM for 5 Companies~80GB (5 models)~5GB (1 model + 5 adapters)

The practical implication is that adapters are interchangeable at runtime. You can load a cafe adapter, a clinic adapter, and a law firm adapter into the same serving engine and switch between them per-request with zero server restarts. This is what makes true multi-tenant AI serving possible on a single GPU.

Training Data: Why 10 Pairs Is Enough

A common misconception is that fine-tuning requires thousands of examples. With LoRA, the goal is not injecting new knowledge into the model — it is teaching a communication style. The persona, tone, sentence structure, and response format are what the adapter learns. Actual business knowledge (menu items, prices, operating hours) is handled by the system prompt.

We created 10 Q&A pairs for each of 5 business types. The cafe adapter learned a friendly, casual tone with emoji usage. The clinic adapter learned formal, measured language with appropriate medical disclaimers. The law firm adapter learned strict formal speech with legal caveats. Ten pairs is enough because the patterns are consistent — LoRA picks up on the style signals rapidly.

Training data example — Cafe adapter (translated)

{"messages": [
  {"role": "user", "content": "How much is an Americano?"},
  {"role": "assistant", "content": "An Americano is $4.50! Same price for iced or hot~ ☕"}
]}

{"messages": [
  {"role": "user", "content": "Do you have decaf options?"},
  {"role": "assistant", "content": "Yes! Decaf Americano $5.00, Decaf Latte $5.50~ The taste is basically the same!"}
]}

Training data example — Clinic adapter (translated)

{"messages": [
  {"role": "user", "content": "I'd like to book an appointment."},
  {"role": "assistant", "content": "Thank you for contacting our clinic. Appointments can be made by phone (02-555-1234) or in person. Could you please let us know your preferred department and date?"}
]}

System Prompt Design: The Knowledge Layer

The LoRA adapter handles how the model speaks. The system prompt handles what it knows. This separation is critical: when a cafe updates its menu or prices, you update the system prompt — no retraining needed. The adapter remains the same because the communication style has not changed.

A well-structured system prompt follows a consistent pattern: role declaration, structured business information (address, hours, phone), product/service catalog with prices, current promotions, tone rules, and a forbidden-topics list. The forbidden-topics list is particularly important for preventing the chatbot from recommending competitors or disclosing proprietary information.

Decoding Parameters by Business Type

ParameterCafe / E-commerceClinic / LegalRationale
temperature0.30.1Clinic/legal require maximum consistency
top_p0.90.9Shared default
max_tokens2,0482,048Accommodates thinking-model behavior
repetition_penalty1.11.1Prevents repetitive loops

Training Results: 6 Seconds for 5 Adapters

Training was performed on an RTX PRO 6000 (96GB VRAM). The 8B model (Qwen3-8B) produced each adapter in roughly 1.2–1.5 seconds. All 5 company adapters were trained sequentially in 6.4 seconds total. The 32B model (Qwen3-32B) took 15.9 seconds for the same 5 adapters.

Qwen3-8B Results

CompanyTimeSize
Cafe1.5s73.7MB
Clinic1.2s73.7MB
E-commerce1.2s73.7MB
Law Firm1.2s73.7MB
Academy1.3s73.7MB
Total6.4s368.5MB

Qwen3-32B Results

CompanyTimeSize
Cafe3.2s167.2MB
Clinic3.1s167.2MB
E-commerce3.1s167.2MB
Law Firm3.2s167.2MB
Academy3.3s167.2MB
Total15.9s836MB

LoRA Hyperparameters

LoraConfig(
    r=16,                    # rank — higher = more expressive, larger file
    lora_alpha=32,           # scaling factor (typically 2× rank)
    lora_dropout=0.05,       # overfitting prevention
    target_modules=[         # attention layers only
        "q_proj", "k_proj",
        "v_proj", "o_proj"
    ],
    task_type="CAUSAL_LM",
)

# Training config
epochs=3, batch_size=2, gradient_accumulation=2
learning_rate=2e-4, bf16=True

An important detail: training runs on the full-precision model (Qwen3-8B, ~16GB), while serving uses the AWQ-quantized version (Qwen3-8B-AWQ, ~5GB). LoRA adapters trained on the full model are directly compatible with the quantized serving model — SGLang handles this compatibility automatically.

Hot-Swap Serving with SGLang

Once the adapters are trained, SGLang can load all of them simultaneously and route requests to the correct adapter based on the model parameter in the API call. No server restart is needed to switch between companies.

SGLang launch command

python -m sglang.launch_server \
  --model Qwen3-8B-AWQ \
  --quantization awq_marlin \
  --enable-lora \
  --lora-paths \
    cafe=/path/to/adapters/cafe \
    clinic=/path/to/adapters/clinic \
    shop=/path/to/adapters/shop \
    law=/path/to/adapters/law \
    edu=/path/to/adapters/edu \
  --context-length 4096 \
  --mem-fraction-static 0.85 \
  --port 30000

API calls — switch company by changing "model"

# Route to cafe adapter
curl -X POST http://localhost:30000/v1/chat/completions \
  -d '{"model": "cafe", "messages": [
    {"role": "system", "content": "(cafe system prompt)"},
    {"role": "user", "content": "How much is an Americano?"}
  ]}'

# Route to clinic adapter — same server, just change "model"
curl -X POST http://localhost:30000/v1/chat/completions \
  -d '{"model": "clinic", "messages": [
    {"role": "system", "content": "(clinic system prompt)"},
    {"role": "user", "content": "I'd like to book an appointment."}
  ]}'

Hot-swap verification confirmed that all 5 adapters maintain correct persona during rapid sequential switching (A→B→C→D→E→A) and under concurrent multi-company requests. VRAM overhead for 5 loaded adapters is just ~370MB for the 8B model or ~836MB for 32B — negligible compared to the base model footprint.

Conclusion: The Production Deployment Playbook

The end-to-end workflow is straightforward and repeatable for any new business:

1

Write the system prompt

Structure business info, catalog, prices, operating rules, and forbidden topics.

2

Prepare 10 training pairs

Q&A pairs that demonstrate the target communication style. Consistency matters more than quantity.

3

Run LoRA training

1–3 seconds per adapter on a single GPU. Output: a 73MB adapter file.

4

Start SGLang serving

Register all adapters with --enable-lora --lora-paths. All businesses served simultaneously.

5

Integrate via API

OpenAI-compatible API. Switch companies by changing the model parameter. Minimal code changes.

Metric8B Model32B Model
Training (5 adapters)6.4s15.9s
Adapter size (each)73.7MB167.2MB
VRAM overhead (5 adapters)~370MB~836MB
20-user concurrent response3.5s10.4s
Error rate0%0%

LoRA makes custom AI economically viable at a scale that was previously impossible. The adapter handles tone and personality, the system prompt handles knowledge, and SGLang handles multi-tenant serving — all on a single GPU. Adding a new business client takes minutes, not days, and the marginal cost is a 73MB file and a system prompt update.

Training was performed on RTX PRO 6000 (96GB) with SGLang v0.4. LoRA adapters trained on full-precision models are compatible with AWQ-quantized serving models. Results may vary with different base models and hardware configurations.