SGLang 23-Model Serving Guide — Optimal Configuration for Every Model
"Swapping the model path is not enough." When you serve LLMs with SGLang, each model demands its own quantization method, attention backend, context length, and VRAM budget. One wrong flag can trigger an OOM crash or silent inference failure. We loaded 23 models onto SGLang in production and documented the exact configuration that works for each — plus the troubleshooting playbook we built along the way.
Why Every Model Needs Different Settings
SGLang launches models via python -m sglang.launch_server, but a single command template cannot serve all 23 models. Four variables change from model to model:
- Quantization — AWQ (INT4), BF16 (16-bit), and FP8 (8-bit) differ by 2–4x in VRAM usage and throughput. AWQ is auto-detected; FP8 requires an explicit
--quantization fp8flag. - Attention backend — MoE (Mixture of Experts) models must use
flashinfer. Running them on thetritonbackend causes silent garbage output or outright crashes. - VRAM footprint — ranges from 1.5 GB (Qwen3-0.6B BF16) to 64 GB (Qwen3-32B BF16). Picking a model that exceeds your GPU's capacity guarantees an OOM on the first request.
- Context length — 8K to 1M tokens. Longer contexts consume proportionally more KV-cache memory on top of the model weights.
Complete 23-Model Configuration Table
Every model below was successfully served in production. Port numbers follow a deterministic rule: 30000 + (parameter count × 10) + quantization offset (BF16 = 0, AWQ = 1, FP8 = 2). This convention eliminates the need to memorize port assignments.
| Model | Quant | VRAM | Backend | CTX | Port | Notes |
|---|---|---|---|---|---|---|
| Qwen3-0.6B | BF16 | ~1.5 GB | flashinfer | 32K | 30010 | Testing / embedding |
| Qwen3-1.7B | BF16 | ~3.5 GB | flashinfer | 32K | 30011 | Lightweight routing |
| Qwen3-4B | BF16 | ~8 GB | flashinfer | 32K | 30040 | |
| Qwen3-4B-AWQ | AWQ | ~2.5 GB | flashinfer | 32K | 30041 | VRAM saver |
| Qwen3-8B | BF16 | ~16 GB | flashinfer | 32K | 30080 | |
| Qwen3-8B-AWQ | AWQ | ~5 GB | flashinfer | 32K | 30081 | Fits 16 GB GPU |
| Qwen3-14B | BF16 | ~28 GB | flashinfer | 32K | 30140 | |
| Qwen3-14B-AWQ | AWQ | ~8 GB | flashinfer | 32K | 30141 | Recommended |
| Qwen3-14B-FP8 | FP8 | ~15 GB | flashinfer | 32K | 30142 | Higher quality than AWQ |
| Qwen3-32B | BF16 | ~64 GB | flashinfer | 32K | 30320 | 96 GB GPU only |
| Qwen3-32B-AWQ | AWQ | ~18 GB | flashinfer | 32K | 30321 | Top quality |
| Qwen3-30B-A3B (MoE) | BF16 | ~60 GB | flashinfer | 32K | 30300 | MoE — fast inference |
| GLM-4-9B-Chat-1M (MoE) | BF16 | ~18 GB | flashinfer | 1M | 30900 | MoE — long context |
| Gemma-3-4B-it | BF16 | ~8 GB | flashinfer | 32K | 30050 | |
| Gemma-3-12B-it-AWQ | AWQ | ~7 GB | flashinfer | 32K | 30121 | |
| Phi-4 | BF16 | ~28 GB | flashinfer | 16K | 30200 | |
| Phi-4-AWQ | AWQ | ~8 GB | flashinfer | 16K | 30201 | Poor SGLang optimization |
| EXAONE-3.5-7.8B | BF16 | ~16 GB | flashinfer | 32K | 30070 | Korean specialist |
| EXAONE-3.5-32B | BF16 | ~64 GB | flashinfer | 32K | 30330 | |
| EXAONE-3.5-32B-AWQ | AWQ | ~18 GB | flashinfer | 32K | 30331 | |
| Llama-3.1-8B-AWQ | AWQ | ~5 GB | flashinfer | 8K | 30085 | English only |
| KORMo-10B-sft | BF16 | ~20 GB | flashinfer | 8K | 30100 | Korean domain |
| DeepSeek-R1-Distill-Qwen-32B | BF16 | ~64 GB | flashinfer | 32K | 30340 | Reasoning specialist |
Port management example: Qwen3-14B-AWQ = 30000 + (14 × 10) + 1 = 30141. During hot-swap, the same port is reused so clients need zero configuration changes.
Quantization Guide — AWQ vs BF16 vs FP8
Choosing the right quantization is the single highest-leverage decision in model serving. Here is how the three options compare:
| Metric | AWQ (INT4) | FP8 | BF16 |
|---|---|---|---|
| VRAM usage | Minimum (1/4) | Medium (1/2) | Maximum (baseline) |
| Inference speed | Fastest | Fast | Baseline |
| Quality loss | Negligible | Near zero | None (original) |
| SGLang support | Auto-detected | --quantization fp8 | Default |
| Recommended for | VRAM ≤ 24 GB | VRAM 24–48 GB | VRAM ≥ 48 GB |
AWQ (INT4) — Speed First
AWQ is the only viable option when VRAM is 16 GB or less. It compresses a 14B model to just 8 GB, delivers up to 2.94x speed improvement on 32B models, and the quality difference is imperceptible in Korean or English. SGLang auto-detects AWQ weights — no extra flags needed.
FP8 — The Balanced Choice
FP8 halves VRAM compared to BF16 while preserving slightly higher fidelity than AWQ. It requires the --quantization fp8 flag and is best suited for GPUs in the 24–48 GB range. Not all models ship with FP8 checkpoints, so availability must be verified first.
BF16 — Maximum Quality
BF16 serves the model at its original precision with zero quality loss. Use it only when VRAM is abundant (48 GB+) or when you need a benchmark baseline. On a 96 GB GPU, even 32B BF16 (64 GB) fits — but switching to AWQ (18 GB) frees 46 GB for longer contexts and higher concurrency.
Production rule of thumb: even when VRAM is plentiful, prefer AWQ. The freed memory translates directly into more concurrent requests or longer context windows. Reserve BF16 for quality benchmarking only.
MoE Models Require Special Configuration
Mixture of Experts (MoE) models activate only a fraction of their total parameters per token, which makes them faster than their parameter count suggests. However, they must use the flashinfer attention backend. Running a MoE model on triton produces either silent garbage output or an outright crash — with no clear error message.
| MoE Model | Total Params | Active Params | Required Backend | Notes |
|---|---|---|---|---|
| Qwen3-30B-A3B | 30B | 3B | flashinfer | VRAM of 30B, speed of 8B |
| GLM-4-9B-Chat-1M | 9B | ~1.2B | flashinfer | VRAM spikes at 1M context |
Dense models (Qwen3-8B, Gemma-3-12B, etc.) work on both backends, but flashinfer is generally faster. Universal rule: set --attention-backend flashinfer for every model to eliminate the MoE vs Dense distinction entirely.
VRAM Management and Model Sizing
The model's stated VRAM is not the whole story. KV-cache — intermediate data generated during inference — adds memory on top of the model weights. Use --mem-fraction-static 0.85 to allocate 85% of VRAM to the model, leaving 15% for the OS and KV-cache overhead.
| GPU VRAM | Servable Models (AWQ) | Servable Models (BF16) |
|---|---|---|
| 8 GB | Up to 4B, 8B models | 4B and below only |
| 12 GB | Up to 14B AWQ (8 GB) | 4B comfortable |
| 16 GB | 14B AWQ comfortable | Up to 8B models |
| 24 GB | 32B AWQ (18 GB) possible | Up to 14B models |
| 48 GB | All AWQ models + headroom | 32B impossible, 14B comfortable |
| 96 GB | All models + high concurrency | 32B BF16 (64 GB) possible |
Quick viability formula: Model VRAM × 1.3 < GPU VRAM → servable with 30% headroom. Example: Qwen3-14B-AWQ (8 GB) × 1.3 = 10.4 GB → fits a 12 GB GPU. Qwen3-32B-AWQ (18 GB) × 1.3 = 23.4 GB → tight but possible on a 24 GB GPU.
Concurrent requests multiply KV-cache allocations. A model that runs fine for single users may OOM at 10 concurrent sessions. For high-traffic services, maintain at least 30% VRAM headroombeyond model weights.
Production Troubleshooting Playbook
These are the four most common issues encountered while serving 23 models, along with their fixes:
OOM (Out of Memory) Crash
Symptom: CUDA OOM on the first request after model loading.
- Lower
mem-fraction-staticfrom 0.85 to 0.75 - Reduce
context-lengthfrom 32K to 16K or 8K - Switch to a smaller quantization (BF16 → AWQ)
- Check
nvidia-smifor other processes consuming VRAM
MoE Model Garbled Output
Symptom: Qwen3-30B-A3B produces broken or looping responses.
- Verify
--attention-backend flashinfer(triton causes this) - Confirm SGLang version ≥ 0.5.8 (older versions lack MoE support)
- Validate flashinfer installation with
pip show flashinfer
Port Conflict — Serving Fails to Start
Symptom: A zombie process from a previous serving session holds the port.
- Identify the occupying process:
lsof -i :PORT - Kill the zombie:
kill -9 PID - Add a pre-launch kill step to your serving script
Hot-Swap — Zero-Downtime Model Replacement
Goal: Replace a running model without service interruption.
- Launch the new model on a temporary port (e.g., 30141 → 30142)
- Run a health check, then switch the load-balancer upstream to the new port
- Terminate the old model process — zero downtime achieved
Known Model-Specific Issues
| Model | Issue | Mitigation |
|---|---|---|
| Phi-4-AWQ | Poor SGLang optimization, slow | Replace with Qwen3-14B-AWQ |
| GLM-4-9B-Chat-1M | VRAM explosion at 1M context | Cap context at 32K–128K |
| KORMo-10B-sft | No AWQ available, BF16 only | Requires 20 GB VRAM minimum |
| Llama-3.1-8B-AWQ | Very poor Korean quality | Do not use for Korean services |
| DeepSeek-R1-Distill-Qwen-32B | BF16 64 GB — 96 GB GPU only | Wait for AWQ release or try FP8 |
Automation Script and Core Principles
Managing 23 models manually is impractical. We use a serving automation script that determines quantization, backend, port, and context length from the model name alone:
#!/bin/bash
# serve.sh — auto-apply optimal settings from model name
MODEL=$1
PORT=$2
BACKEND="flashinfer"
MEM_FRAC="0.85"
CTX="32768"
EXTRA_ARGS=""
case $MODEL in
*"FP8"*) EXTRA_ARGS="--quantization fp8" ;;
*"1M"*) CTX="131072" ;; # cap at 128K
*"Phi"*) CTX="16384" ;;
*"Llama"*) CTX="8192" ;;
esac
lsof -ti :$PORT | xargs -r kill -9 2>/dev/null
sleep 2
python -m sglang.launch_server \
--model-path $MODEL \
--port $PORT \
--attention-backend $BACKEND \
--context-length $CTX \
--mem-fraction-static $MEM_FRAC \
$EXTRA_ARGSThe script uses flashinfer as the universal default, eliminating MoE/Dense branching. Model-path keywords (FP8, 1M, Phi, Llama) trigger special overrides automatically.
Five Core Principles
- Unify on flashinfer — works for both MoE and Dense models, removing a common source of configuration errors.
- Default to AWQ — even when VRAM is abundant, AWQ frees memory for longer contexts and higher concurrency.
- Deterministic port numbering — 30000 + (params × 10) + quant offset. No memorization needed.
- 30% VRAM headroom — model weight × 1.3 must fit within GPU VRAM to handle concurrent users safely.
- Hot-swap by default — launch the new model first, verify, then switch. Zero-downtime model replacement becomes routine.
These five rules were distilled from serving 23 models across two GPUs (RTX Pro 6000 96 GB and RTX 5060 Ti 16 GB) with SGLang 0.5.8.post1 and CUDA 12.8. They apply universally regardless of which specific models you deploy — the principles transfer to any SGLang serving environment. As MoE support improves with each SGLang release, check the latest release notes, but the flashinfer-first strategy will remain the safest default.