treeru.com

SGLang 23-Model Serving Guide — Optimal Configuration for Every Model

"Swapping the model path is not enough." When you serve LLMs with SGLang, each model demands its own quantization method, attention backend, context length, and VRAM budget. One wrong flag can trigger an OOM crash or silent inference failure. We loaded 23 models onto SGLang in production and documented the exact configuration that works for each — plus the troubleshooting playbook we built along the way.

Why Every Model Needs Different Settings

SGLang launches models via python -m sglang.launch_server, but a single command template cannot serve all 23 models. Four variables change from model to model:

  • Quantization — AWQ (INT4), BF16 (16-bit), and FP8 (8-bit) differ by 2–4x in VRAM usage and throughput. AWQ is auto-detected; FP8 requires an explicit --quantization fp8 flag.
  • Attention backend — MoE (Mixture of Experts) models must useflashinfer. Running them on the triton backend causes silent garbage output or outright crashes.
  • VRAM footprint — ranges from 1.5 GB (Qwen3-0.6B BF16) to 64 GB (Qwen3-32B BF16). Picking a model that exceeds your GPU's capacity guarantees an OOM on the first request.
  • Context length — 8K to 1M tokens. Longer contexts consume proportionally more KV-cache memory on top of the model weights.

Complete 23-Model Configuration Table

Every model below was successfully served in production. Port numbers follow a deterministic rule: 30000 + (parameter count × 10) + quantization offset (BF16 = 0, AWQ = 1, FP8 = 2). This convention eliminates the need to memorize port assignments.

ModelQuantVRAMBackendCTXPortNotes
Qwen3-0.6BBF16~1.5 GBflashinfer32K30010Testing / embedding
Qwen3-1.7BBF16~3.5 GBflashinfer32K30011Lightweight routing
Qwen3-4BBF16~8 GBflashinfer32K30040
Qwen3-4B-AWQAWQ~2.5 GBflashinfer32K30041VRAM saver
Qwen3-8BBF16~16 GBflashinfer32K30080
Qwen3-8B-AWQAWQ~5 GBflashinfer32K30081Fits 16 GB GPU
Qwen3-14BBF16~28 GBflashinfer32K30140
Qwen3-14B-AWQAWQ~8 GBflashinfer32K30141Recommended
Qwen3-14B-FP8FP8~15 GBflashinfer32K30142Higher quality than AWQ
Qwen3-32BBF16~64 GBflashinfer32K3032096 GB GPU only
Qwen3-32B-AWQAWQ~18 GBflashinfer32K30321Top quality
Qwen3-30B-A3B (MoE)BF16~60 GBflashinfer32K30300MoE — fast inference
GLM-4-9B-Chat-1M (MoE)BF16~18 GBflashinfer1M30900MoE — long context
Gemma-3-4B-itBF16~8 GBflashinfer32K30050
Gemma-3-12B-it-AWQAWQ~7 GBflashinfer32K30121
Phi-4BF16~28 GBflashinfer16K30200
Phi-4-AWQAWQ~8 GBflashinfer16K30201Poor SGLang optimization
EXAONE-3.5-7.8BBF16~16 GBflashinfer32K30070Korean specialist
EXAONE-3.5-32BBF16~64 GBflashinfer32K30330
EXAONE-3.5-32B-AWQAWQ~18 GBflashinfer32K30331
Llama-3.1-8B-AWQAWQ~5 GBflashinfer8K30085English only
KORMo-10B-sftBF16~20 GBflashinfer8K30100Korean domain
DeepSeek-R1-Distill-Qwen-32BBF16~64 GBflashinfer32K30340Reasoning specialist

Port management example: Qwen3-14B-AWQ = 30000 + (14 × 10) + 1 = 30141. During hot-swap, the same port is reused so clients need zero configuration changes.

Quantization Guide — AWQ vs BF16 vs FP8

Choosing the right quantization is the single highest-leverage decision in model serving. Here is how the three options compare:

MetricAWQ (INT4)FP8BF16
VRAM usageMinimum (1/4)Medium (1/2)Maximum (baseline)
Inference speedFastestFastBaseline
Quality lossNegligibleNear zeroNone (original)
SGLang supportAuto-detected--quantization fp8Default
Recommended forVRAM ≤ 24 GBVRAM 24–48 GBVRAM ≥ 48 GB

AWQ (INT4) — Speed First

AWQ is the only viable option when VRAM is 16 GB or less. It compresses a 14B model to just 8 GB, delivers up to 2.94x speed improvement on 32B models, and the quality difference is imperceptible in Korean or English. SGLang auto-detects AWQ weights — no extra flags needed.

FP8 — The Balanced Choice

FP8 halves VRAM compared to BF16 while preserving slightly higher fidelity than AWQ. It requires the --quantization fp8 flag and is best suited for GPUs in the 24–48 GB range. Not all models ship with FP8 checkpoints, so availability must be verified first.

BF16 — Maximum Quality

BF16 serves the model at its original precision with zero quality loss. Use it only when VRAM is abundant (48 GB+) or when you need a benchmark baseline. On a 96 GB GPU, even 32B BF16 (64 GB) fits — but switching to AWQ (18 GB) frees 46 GB for longer contexts and higher concurrency.

Production rule of thumb: even when VRAM is plentiful, prefer AWQ. The freed memory translates directly into more concurrent requests or longer context windows. Reserve BF16 for quality benchmarking only.

MoE Models Require Special Configuration

Mixture of Experts (MoE) models activate only a fraction of their total parameters per token, which makes them faster than their parameter count suggests. However, they must use the flashinfer attention backend. Running a MoE model on triton produces either silent garbage output or an outright crash — with no clear error message.

MoE ModelTotal ParamsActive ParamsRequired BackendNotes
Qwen3-30B-A3B30B3BflashinferVRAM of 30B, speed of 8B
GLM-4-9B-Chat-1M9B~1.2BflashinferVRAM spikes at 1M context

Dense models (Qwen3-8B, Gemma-3-12B, etc.) work on both backends, but flashinfer is generally faster. Universal rule: set --attention-backend flashinfer for every model to eliminate the MoE vs Dense distinction entirely.

VRAM Management and Model Sizing

The model's stated VRAM is not the whole story. KV-cache — intermediate data generated during inference — adds memory on top of the model weights. Use --mem-fraction-static 0.85 to allocate 85% of VRAM to the model, leaving 15% for the OS and KV-cache overhead.

GPU VRAMServable Models (AWQ)Servable Models (BF16)
8 GBUp to 4B, 8B models4B and below only
12 GBUp to 14B AWQ (8 GB)4B comfortable
16 GB14B AWQ comfortableUp to 8B models
24 GB32B AWQ (18 GB) possibleUp to 14B models
48 GBAll AWQ models + headroom32B impossible, 14B comfortable
96 GBAll models + high concurrency32B BF16 (64 GB) possible

Quick viability formula: Model VRAM × 1.3 < GPU VRAM → servable with 30% headroom. Example: Qwen3-14B-AWQ (8 GB) × 1.3 = 10.4 GB → fits a 12 GB GPU. Qwen3-32B-AWQ (18 GB) × 1.3 = 23.4 GB → tight but possible on a 24 GB GPU.

Concurrent requests multiply KV-cache allocations. A model that runs fine for single users may OOM at 10 concurrent sessions. For high-traffic services, maintain at least 30% VRAM headroombeyond model weights.

Production Troubleshooting Playbook

These are the four most common issues encountered while serving 23 models, along with their fixes:

OOM (Out of Memory) Crash

Symptom: CUDA OOM on the first request after model loading.

  • Lower mem-fraction-static from 0.85 to 0.75
  • Reduce context-length from 32K to 16K or 8K
  • Switch to a smaller quantization (BF16 → AWQ)
  • Check nvidia-smi for other processes consuming VRAM

MoE Model Garbled Output

Symptom: Qwen3-30B-A3B produces broken or looping responses.

  • Verify --attention-backend flashinfer (triton causes this)
  • Confirm SGLang version ≥ 0.5.8 (older versions lack MoE support)
  • Validate flashinfer installation with pip show flashinfer

Port Conflict — Serving Fails to Start

Symptom: A zombie process from a previous serving session holds the port.

  • Identify the occupying process: lsof -i :PORT
  • Kill the zombie: kill -9 PID
  • Add a pre-launch kill step to your serving script

Hot-Swap — Zero-Downtime Model Replacement

Goal: Replace a running model without service interruption.

  • Launch the new model on a temporary port (e.g., 30141 → 30142)
  • Run a health check, then switch the load-balancer upstream to the new port
  • Terminate the old model process — zero downtime achieved

Known Model-Specific Issues

ModelIssueMitigation
Phi-4-AWQPoor SGLang optimization, slowReplace with Qwen3-14B-AWQ
GLM-4-9B-Chat-1MVRAM explosion at 1M contextCap context at 32K–128K
KORMo-10B-sftNo AWQ available, BF16 onlyRequires 20 GB VRAM minimum
Llama-3.1-8B-AWQVery poor Korean qualityDo not use for Korean services
DeepSeek-R1-Distill-Qwen-32BBF16 64 GB — 96 GB GPU onlyWait for AWQ release or try FP8

Automation Script and Core Principles

Managing 23 models manually is impractical. We use a serving automation script that determines quantization, backend, port, and context length from the model name alone:

#!/bin/bash
# serve.sh — auto-apply optimal settings from model name

MODEL=$1
PORT=$2

BACKEND="flashinfer"
MEM_FRAC="0.85"
CTX="32768"
EXTRA_ARGS=""

case $MODEL in
  *"FP8"*)   EXTRA_ARGS="--quantization fp8" ;;
  *"1M"*)    CTX="131072" ;;   # cap at 128K
  *"Phi"*)   CTX="16384" ;;
  *"Llama"*) CTX="8192" ;;
esac

lsof -ti :$PORT | xargs -r kill -9 2>/dev/null
sleep 2

python -m sglang.launch_server \
  --model-path $MODEL \
  --port $PORT \
  --attention-backend $BACKEND \
  --context-length $CTX \
  --mem-fraction-static $MEM_FRAC \
  $EXTRA_ARGS

The script uses flashinfer as the universal default, eliminating MoE/Dense branching. Model-path keywords (FP8, 1M, Phi, Llama) trigger special overrides automatically.

Five Core Principles

  • Unify on flashinfer — works for both MoE and Dense models, removing a common source of configuration errors.
  • Default to AWQ — even when VRAM is abundant, AWQ frees memory for longer contexts and higher concurrency.
  • Deterministic port numbering — 30000 + (params × 10) + quant offset. No memorization needed.
  • 30% VRAM headroom — model weight × 1.3 must fit within GPU VRAM to handle concurrent users safely.
  • Hot-swap by default — launch the new model first, verify, then switch. Zero-downtime model replacement becomes routine.

These five rules were distilled from serving 23 models across two GPUs (RTX Pro 6000 96 GB and RTX 5060 Ti 16 GB) with SGLang 0.5.8.post1 and CUDA 12.8. They apply universally regardless of which specific models you deploy — the principles transfer to any SGLang serving environment. As MoE support improves with each SGLang release, check the latest release notes, but the flashinfer-first strategy will remain the safest default.