SGLang 23-Model Serving Guide — Optimal Configuration for Every Model

2026-02-26

Treeru

"Swapping the model path is not enough." When you serve LLMs with SGLang, each model demands its own quantization method, attention backend, context length, and VRAM budget. One wrong flag can trigger an OOM crash or silent inference failure. We loaded 23 models onto SGLang in production and documented the exact configuration that works for each — plus the troubleshooting playbook we built along the way.

Why Every Model Needs Different Settings

SGLang launches models via python -m sglang.launch_server, but a single command template cannot serve all 23 models. Four variables change from model to model:

Quantization — AWQ (INT4), BF16 (16-bit), and FP8 (8-bit) differ by 2–4x in VRAM usage and throughput. AWQ is auto-detected; FP8 requires an explicit --quantization fp8 flag.
Attention backend — MoE (Mixture of Experts) models must useflashinfer. Running them on the triton backend causes silent garbage output or outright crashes.
VRAM footprint — ranges from 1.5 GB (Qwen3-0.6B BF16) to 64 GB (Qwen3-32B BF16). Picking a model that exceeds your GPU's capacity guarantees an OOM on the first request.
Context length — 8K to 1M tokens. Longer contexts consume proportionally more KV-cache memory on top of the model weights.

Complete 23-Model Configuration Table

Every model below was successfully served in production. Port numbers follow a deterministic rule: 30000 + (parameter count × 10) + quantization offset (BF16 = 0, AWQ = 1, FP8 = 2). This convention eliminates the need to memorize port assignments.

Model	Quant	VRAM	Backend	CTX	Port	Notes
Qwen3-0.6B	BF16	~1.5 GB	flashinfer	32K	30010	Testing / embedding
Qwen3-1.7B	BF16	~3.5 GB	flashinfer	32K	30011	Lightweight routing
Qwen3-4B	BF16	~8 GB	flashinfer	32K	30040
Qwen3-4B-AWQ	AWQ	~2.5 GB	flashinfer	32K	30041	VRAM saver
Qwen3-8B	BF16	~16 GB	flashinfer	32K	30080
Qwen3-8B-AWQ	AWQ	~5 GB	flashinfer	32K	30081	Fits 16 GB GPU
Qwen3-14B	BF16	~28 GB	flashinfer	32K	30140
Qwen3-14B-AWQ	AWQ	~8 GB	flashinfer	32K	30141	Recommended
Qwen3-14B-FP8	FP8	~15 GB	flashinfer	32K	30142	Higher quality than AWQ
Qwen3-32B	BF16	~64 GB	flashinfer	32K	30320	96 GB GPU only
Qwen3-32B-AWQ	AWQ	~18 GB	flashinfer	32K	30321	Top quality
Qwen3-30B-A3B (MoE)	BF16	~60 GB	flashinfer	32K	30300	MoE — fast inference
GLM-4-9B-Chat-1M (MoE)	BF16	~18 GB	flashinfer	1M	30900	MoE — long context
Gemma-3-4B-it	BF16	~8 GB	flashinfer	32K	30050
Gemma-3-12B-it-AWQ	AWQ	~7 GB	flashinfer	32K	30121
Phi-4	BF16	~28 GB	flashinfer	16K	30200
Phi-4-AWQ	AWQ	~8 GB	flashinfer	16K	30201	Poor SGLang optimization
EXAONE-3.5-7.8B	BF16	~16 GB	flashinfer	32K	30070	Korean specialist
EXAONE-3.5-32B	BF16	~64 GB	flashinfer	32K	30330
EXAONE-3.5-32B-AWQ	AWQ	~18 GB	flashinfer	32K	30331
Llama-3.1-8B-AWQ	AWQ	~5 GB	flashinfer	8K	30085	English only
KORMo-10B-sft	BF16	~20 GB	flashinfer	8K	30100	Korean domain
DeepSeek-R1-Distill-Qwen-32B	BF16	~64 GB	flashinfer	32K	30340	Reasoning specialist

Port management example: Qwen3-14B-AWQ = 30000 + (14 × 10) + 1 = 30141. During hot-swap, the same port is reused so clients need zero configuration changes.

Quantization Guide — AWQ vs BF16 vs FP8

Choosing the right quantization is the single highest-leverage decision in model serving. Here is how the three options compare:

Metric	AWQ (INT4)	FP8	BF16
VRAM usage	Minimum (1/4)	Medium (1/2)	Maximum (baseline)
Inference speed	Fastest	Fast	Baseline
Quality loss	Negligible	Near zero	None (original)
SGLang support	Auto-detected	--quantization fp8	Default
Recommended for	VRAM ≤ 24 GB	VRAM 24–48 GB	VRAM ≥ 48 GB

AWQ (INT4) — Speed First

AWQ is the only viable option when VRAM is 16 GB or less. It compresses a 14B model to just 8 GB, delivers up to 2.94x speed improvement on 32B models, and the quality difference is imperceptible in Korean or English. SGLang auto-detects AWQ weights — no extra flags needed.

FP8 — The Balanced Choice

FP8 halves VRAM compared to BF16 while preserving slightly higher fidelity than AWQ. It requires the --quantization fp8 flag and is best suited for GPUs in the 24–48 GB range. Not all models ship with FP8 checkpoints, so availability must be verified first.

BF16 — Maximum Quality

BF16 serves the model at its original precision with zero quality loss. Use it only when VRAM is abundant (48 GB+) or when you need a benchmark baseline. On a 96 GB GPU, even 32B BF16 (64 GB) fits — but switching to AWQ (18 GB) frees 46 GB for longer contexts and higher concurrency.

Production rule of thumb: even when VRAM is plentiful, prefer AWQ. The freed memory translates directly into more concurrent requests or longer context windows. Reserve BF16 for quality benchmarking only.

MoE Models Require Special Configuration

Mixture of Experts (MoE) models activate only a fraction of their total parameters per token, which makes them faster than their parameter count suggests. However, they must use the flashinfer attention backend. Running a MoE model on triton produces either silent garbage output or an outright crash — with no clear error message.

MoE Model	Total Params	Active Params	Required Backend	Notes
Qwen3-30B-A3B	30B	3B	flashinfer	VRAM of 30B, speed of 8B
GLM-4-9B-Chat-1M	9B	~1.2B	flashinfer	VRAM spikes at 1M context

Dense models (Qwen3-8B, Gemma-3-12B, etc.) work on both backends, but flashinfer is generally faster. Universal rule: set --attention-backend flashinfer for every model to eliminate the MoE vs Dense distinction entirely.

VRAM Management and Model Sizing

The model's stated VRAM is not the whole story. KV-cache — intermediate data generated during inference — adds memory on top of the model weights. Use --mem-fraction-static 0.85 to allocate 85% of VRAM to the model, leaving 15% for the OS and KV-cache overhead.

GPU VRAM	Servable Models (AWQ)	Servable Models (BF16)
8 GB	Up to 4B, 8B models	4B and below only
12 GB	Up to 14B AWQ (8 GB)	4B comfortable
16 GB	14B AWQ comfortable	Up to 8B models
24 GB	32B AWQ (18 GB) possible	Up to 14B models
48 GB	All AWQ models + headroom	32B impossible, 14B comfortable
96 GB	All models + high concurrency	32B BF16 (64 GB) possible

Quick viability formula: Model VRAM × 1.3 < GPU VRAM → servable with 30% headroom. Example: Qwen3-14B-AWQ (8 GB) × 1.3 = 10.4 GB → fits a 12 GB GPU. Qwen3-32B-AWQ (18 GB) × 1.3 = 23.4 GB → tight but possible on a 24 GB GPU.

Concurrent requests multiply KV-cache allocations. A model that runs fine for single users may OOM at 10 concurrent sessions. For high-traffic services, maintain at least 30% VRAM headroombeyond model weights.

Production Troubleshooting Playbook

These are the four most common issues encountered while serving 23 models, along with their fixes:

OOM (Out of Memory) Crash

Symptom: CUDA OOM on the first request after model loading.

Lower mem-fraction-static from 0.85 to 0.75
Reduce context-length from 32K to 16K or 8K
Switch to a smaller quantization (BF16 → AWQ)
Check nvidia-smi for other processes consuming VRAM

MoE Model Garbled Output

Symptom: Qwen3-30B-A3B produces broken or looping responses.

Verify --attention-backend flashinfer (triton causes this)
Confirm SGLang version ≥ 0.5.8 (older versions lack MoE support)
Validate flashinfer installation with pip show flashinfer

Port Conflict — Serving Fails to Start

Symptom: A zombie process from a previous serving session holds the port.

Identify the occupying process: lsof -i :PORT
Kill the zombie: kill -9 PID
Add a pre-launch kill step to your serving script

Hot-Swap — Zero-Downtime Model Replacement

Goal: Replace a running model without service interruption.

Launch the new model on a temporary port (e.g., 30141 → 30142)
Run a health check, then switch the load-balancer upstream to the new port
Terminate the old model process — zero downtime achieved

Known Model-Specific Issues

Model	Issue	Mitigation
Phi-4-AWQ	Poor SGLang optimization, slow	Replace with Qwen3-14B-AWQ
GLM-4-9B-Chat-1M	VRAM explosion at 1M context	Cap context at 32K–128K
KORMo-10B-sft	No AWQ available, BF16 only	Requires 20 GB VRAM minimum
Llama-3.1-8B-AWQ	Very poor Korean quality	Do not use for Korean services
DeepSeek-R1-Distill-Qwen-32B	BF16 64 GB — 96 GB GPU only	Wait for AWQ release or try FP8

Automation Script and Core Principles

Managing 23 models manually is impractical. We use a serving automation script that determines quantization, backend, port, and context length from the model name alone:

#!/bin/bash
# serve.sh — auto-apply optimal settings from model name

MODEL=$1
PORT=$2

BACKEND="flashinfer"
MEM_FRAC="0.85"
CTX="32768"
EXTRA_ARGS=""

case $MODEL in
  *"FP8"*)   EXTRA_ARGS="--quantization fp8" ;;
  *"1M"*)    CTX="131072" ;;   # cap at 128K
  *"Phi"*)   CTX="16384" ;;
  *"Llama"*) CTX="8192" ;;
esac

lsof -ti :$PORT | xargs -r kill -9 2>/dev/null
sleep 2

python -m sglang.launch_server \
  --model-path $MODEL \
  --port $PORT \
  --attention-backend $BACKEND \
  --context-length $CTX \
  --mem-fraction-static $MEM_FRAC \
  $EXTRA_ARGS

The script uses flashinfer as the universal default, eliminating MoE/Dense branching. Model-path keywords (FP8, 1M, Phi, Llama) trigger special overrides automatically.

Five Core Principles

Unify on flashinfer — works for both MoE and Dense models, removing a common source of configuration errors.
Default to AWQ — even when VRAM is abundant, AWQ frees memory for longer contexts and higher concurrency.
Deterministic port numbering — 30000 + (params × 10) + quant offset. No memorization needed.
30% VRAM headroom — model weight × 1.3 must fit within GPU VRAM to handle concurrent users safely.
Hot-swap by default — launch the new model first, verify, then switch. Zero-downtime model replacement becomes routine.

These five rules were distilled from serving 23 models across two GPUs (RTX Pro 6000 96 GB and RTX 5060 Ti 16 GB) with SGLang 0.5.8.post1 and CUDA 12.8. They apply universally regardless of which specific models you deploy — the principles transfer to any SGLang serving environment. As MoE support improves with each SGLang release, check the latest release notes, but the flashinfer-first strategy will remain the safest default.