LoRA Fine-Tuning Guide: Custom AI Chatbots in 6 Seconds with 10 Training Pairs

What LoRA Is and Why It Changes the Economics of Custom AI

Traditional fine-tuning updates every parameter in the model. For an 8B-parameter model, that means modifying all 8 billion weights — requiring significant GPU time, VRAM, and producing a full-sized model copy for each variant. If you need five company-specific chatbots, you end up with five separate 16GB models.

LoRA (Low-Rank Adaptation) takes a fundamentally different approach. The base model stays frozen. Instead, LoRA decomposes the weight update matrices into two small low-rank matrices and trains only those. With rank=16, this means training roughly 0.6% of the original parameters — about 50 million instead of 8 billion.

Metric	Full Fine-Tuning	LoRA
Trained Parameters	8B (all)	~50M (0.6%)
Training Time (10 pairs)	Minutes to hours	1.2–1.5 seconds
Output Size	~16GB (full model)	73.7MB (adapter)
Multi-Tenant	N separate models	1 base + N adapters
VRAM for 5 Companies	~80GB (5 models)	~5GB (1 model + 5 adapters)

The practical implication is that adapters are interchangeable at runtime. You can load a cafe adapter, a clinic adapter, and a law firm adapter into the same serving engine and switch between them per-request with zero server restarts. This is what makes true multi-tenant AI serving possible on a single GPU.

Training Data: Why 10 Pairs Is Enough

A common misconception is that fine-tuning requires thousands of examples. With LoRA, the goal is not injecting new knowledge into the model — it is teaching a communication style. The persona, tone, sentence structure, and response format are what the adapter learns. Actual business knowledge (menu items, prices, operating hours) is handled by the system prompt.

We created 10 Q&A pairs for each of 5 business types. The cafe adapter learned a friendly, casual tone with emoji usage. The clinic adapter learned formal, measured language with appropriate medical disclaimers. The law firm adapter learned strict formal speech with legal caveats. Ten pairs is enough because the patterns are consistent — LoRA picks up on the style signals rapidly.

Training data example — Cafe adapter (translated)

{"messages": [
  {"role": "user", "content": "How much is an Americano?"},
  {"role": "assistant", "content": "An Americano is $4.50! Same price for iced or hot~ ☕"}
]}

{"messages": [
  {"role": "user", "content": "Do you have decaf options?"},
  {"role": "assistant", "content": "Yes! Decaf Americano $5.00, Decaf Latte $5.50~ The taste is basically the same!"}
]}

Training data example — Clinic adapter (translated)

{"messages": [
  {"role": "user", "content": "I'd like to book an appointment."},
  {"role": "assistant", "content": "Thank you for contacting our clinic. Appointments can be made by phone (02-555-1234) or in person. Could you please let us know your preferred department and date?"}
]}

System Prompt Design: The Knowledge Layer

The LoRA adapter handles how the model speaks. The system prompt handles what it knows. This separation is critical: when a cafe updates its menu or prices, you update the system prompt — no retraining needed. The adapter remains the same because the communication style has not changed.

A well-structured system prompt follows a consistent pattern: role declaration, structured business information (address, hours, phone), product/service catalog with prices, current promotions, tone rules, and a forbidden-topics list. The forbidden-topics list is particularly important for preventing the chatbot from recommending competitors or disclosing proprietary information.

Decoding Parameters by Business Type

Parameter	Cafe / E-commerce	Clinic / Legal	Rationale
temperature	0.3	0.1	Clinic/legal require maximum consistency
top_p	0.9	0.9	Shared default
max_tokens	2,048	2,048	Accommodates thinking-model behavior
repetition_penalty	1.1	1.1	Prevents repetitive loops

Training Results: 6 Seconds for 5 Adapters

Training was performed on an RTX PRO 6000 (96GB VRAM). The 8B model (Qwen3-8B) produced each adapter in roughly 1.2–1.5 seconds. All 5 company adapters were trained sequentially in 6.4 seconds total. The 32B model (Qwen3-32B) took 15.9 seconds for the same 5 adapters.

Qwen3-8B Results

Company	Time	Size
Cafe	1.5s	73.7MB
Clinic	1.2s	73.7MB
E-commerce	1.2s	73.7MB
Law Firm	1.2s	73.7MB
Academy	1.3s	73.7MB
Total	6.4s	368.5MB

Qwen3-32B Results

Company	Time	Size
Cafe	3.2s	167.2MB
Clinic	3.1s	167.2MB
E-commerce	3.1s	167.2MB
Law Firm	3.2s	167.2MB
Academy	3.3s	167.2MB
Total	15.9s	836MB

LoRA Hyperparameters

LoraConfig(
    r=16,                    # rank — higher = more expressive, larger file
    lora_alpha=32,           # scaling factor (typically 2× rank)
    lora_dropout=0.05,       # overfitting prevention
    target_modules=[         # attention layers only
        "q_proj", "k_proj",
        "v_proj", "o_proj"
    ],
    task_type="CAUSAL_LM",
)

# Training config
epochs=3, batch_size=2, gradient_accumulation=2
learning_rate=2e-4, bf16=True

An important detail: training runs on the full-precision model (Qwen3-8B, ~16GB), while serving uses the AWQ-quantized version (Qwen3-8B-AWQ, ~5GB). LoRA adapters trained on the full model are directly compatible with the quantized serving model — SGLang handles this compatibility automatically.

Hot-Swap Serving with SGLang

Once the adapters are trained, SGLang can load all of them simultaneously and route requests to the correct adapter based on the model parameter in the API call. No server restart is needed to switch between companies.

SGLang launch command

python -m sglang.launch_server \
  --model Qwen3-8B-AWQ \
  --quantization awq_marlin \
  --enable-lora \
  --lora-paths \
    cafe=/path/to/adapters/cafe \
    clinic=/path/to/adapters/clinic \
    shop=/path/to/adapters/shop \
    law=/path/to/adapters/law \
    edu=/path/to/adapters/edu \
  --context-length 4096 \
  --mem-fraction-static 0.85 \
  --port 30000

API calls — switch company by changing "model"

# Route to cafe adapter
curl -X POST http://localhost:30000/v1/chat/completions \
  -d '{"model": "cafe", "messages": [
    {"role": "system", "content": "(cafe system prompt)"},
    {"role": "user", "content": "How much is an Americano?"}
  ]}'

# Route to clinic adapter — same server, just change "model"
curl -X POST http://localhost:30000/v1/chat/completions \
  -d '{"model": "clinic", "messages": [
    {"role": "system", "content": "(clinic system prompt)"},
    {"role": "user", "content": "I'd like to book an appointment."}
  ]}'

Hot-swap verification confirmed that all 5 adapters maintain correct persona during rapid sequential switching (A→B→C→D→E→A) and under concurrent multi-company requests. VRAM overhead for 5 loaded adapters is just ~370MB for the 8B model or ~836MB for 32B — negligible compared to the base model footprint.

Conclusion: The Production Deployment Playbook

The end-to-end workflow is straightforward and repeatable for any new business:

Write the system prompt

Structure business info, catalog, prices, operating rules, and forbidden topics.

Prepare 10 training pairs

Q&A pairs that demonstrate the target communication style. Consistency matters more than quantity.

Run LoRA training

1–3 seconds per adapter on a single GPU. Output: a 73MB adapter file.

Start SGLang serving

Integrate via API

OpenAI-compatible API. Switch companies by changing the model parameter. Minimal code changes.

Metric	8B Model	32B Model
Training (5 adapters)	6.4s	15.9s
Adapter size (each)	73.7MB	167.2MB
VRAM overhead (5 adapters)	~370MB	~836MB
20-user concurrent response	3.5s	10.4s
Error rate	0%	0%

LoRA makes custom AI economically viable at a scale that was previously impossible. The adapter handles tone and personality, the system prompt handles knowledge, and SGLang handles multi-tenant serving — all on a single GPU. Adding a new business client takes minutes, not days, and the marginal cost is a 73MB file and a system prompt update.

LoRA Fine-Tuning for Custom AI Chatbots — From 10 Training Pairs to Multi-Tenant Serving