treeru.com
AI · March 4, 2026

MoE vs Dense: Why Qwen3-30B-A3B Is Slower Than 14B — and Why KTransformers Hybrid Inference Failed

Mixture of Experts promises 30B-level knowledge at 3B compute cost. In practice, Qwen3-30B-A3B uses 2.1x more VRAM and runs 1.9x slower than the 14B Dense model on the same GPU. When we tried KTransformers for CPU+GPU hybrid inference, the output was complete garbage. We traced the root cause by inspecting all 18,432 expert weights one by one.

2.1x

VRAM Gap (30B vs 14B)

1.9x

Speed Gap (14B Wins)

18,432

Weights Inspected

0.78

Error Layer Cosine Sim

What MoE Actually Means for Local Deployment

Mixture of Experts (MoE) is an architecture where each transformer layer contains multiple independent "expert" sub-networks. A router selects a small subset of experts for each input token. Qwen3-30B-A3B has 30 billion total parameters but activates only 3 billion per token — spread across 8 of its 128 experts per layer, across 48 layers.

The theoretical appeal is obvious: 30B worth of learned knowledge with only 3B of compute per token. But there is a critical catch that MoE proponents often gloss over: all 30 billion parameters must reside in memory. The router needs access to every expert to decide which ones to activate. Even though only 3B parameters fire per token, the other 27B sit in VRAM doing nothing — except consuming 56.9GB of space.

Architecture Comparison

MoE: Qwen3-30B-A3B

Total Parameters30B
Active Parameters3B per token
Layers48
Experts per Layer128
Active Experts8 / 128

Dense: Qwen3-14B

Total Parameters14B
Active Parameters14B (all)
Layers40
Experts per LayerNone (single FFN)
Active ExpertsN/A

Benchmark: VRAM, Speed, and KV Cache

Both models were loaded in BF16 full precision on the same RTX PRO 6000 (96GB GDDR7) and tested with identical prompts.

Metric30B-A3B (MoE)14B (Dense)Difference
VRAM Usage56.9GB27.5GBMoE 2.1x more
Generation Speed22.0 tok/s41.4 tok/sDense 1.9x faster
KV Cache Headroom39.1GB68.5GBDense 1.8x more
Est. Max Concurrent Users~8~20Dense 2.5x more
Active Parameters3B14BDense 4.7x more compute

The 14B Dense model wins every metric. The MoE model's fundamental problem is memory bandwidth: the GPU must read 30B parameters from VRAM per forward pass but only computes with 3B of them. The 14B Dense model reads 14B and uses all 14B — a much higher compute-to-memory ratio.

With AWQ quantization applied, the gap becomes even more extreme: 14B-AWQ runs at 135 tok/s in just 9.4GB VRAM. Compared to the 30B MoE in BF16 (56.9GB, 22 tok/s), that is 6x less VRAM and 6x faster.

KTransformers Hybrid Inference: Complete Failure

To address the MoE VRAM problem, we tried KTransformers — a framework that keeps expert weights in CPU RAM and streams only the activated experts to GPU on demand. In theory, this lets you run a 30B MoE model on a 16GB GPU.

We tested three GGUF quantization formats: Q4_K_M, Q6_K, and Q8_0. All three produced completely meaningless output — broken tokens, repeated special characters, and context-free word salads. The same GGUF files loaded through llama-cpp-python produced perfectly normal output, confirming the issue was not in the model files themselves.

Debugging: Inspecting All 18,432 Expert Weights

An MoE model with 48 layers, 128 experts per layer, and 3 projection matrices per expert (gate_proj, up_proj, down_proj) has 48 × 128 × 3 = 18,432 individual weight matrices. We loaded both the original BF16 weights and the GGUF-quantized weights, dequantized the GGUF values back to BF16 (matching KTransformers' internal process), and computed the cosine similarity between each original and dequantized weight matrix.

Most weights showed cosine similarity above 0.98 — acceptable quantization quality. But three weights fell significantly below threshold:

LayerLocationCosine SimImpact
Layer 2Expert 92, down_proj0.78Origin of hidden state distortion
Layer 5Expert 41, gate_proj0.91Router misrouting
Layer 12Expert 103, up_proj0.93Error accumulation accelerator
Remaining18,429 weights≥ 0.98Normal range

The critical finding: a single weight matrix at Layer 2, Expert 92 (cosine similarity 0.78) was enough to corrupt the entire model output. The distortion introduced at layer 2 propagated through the remaining 46 layers, accumulating exponentially during the forward pass until the final output was complete nonsense.

Root Cause: The GGUF + BF16 + MoE Triple Threat

The failure chain has four links:

  1. GGUF Quantization Error: Certain expert weights have value distributions that don't align well with the quantization grid, producing larger-than-average rounding errors.
  2. BF16 Dequantization: KTransformers converts GGUF values back to BF16 (7 effective mantissa bits) for GPU computation. This limited precision amplifies the quantization errors that were already present.
  3. MoE Expert Isolation: In a Dense model, quantization errors are distributed across the entire weight matrix and tend to cancel out. In MoE, each expert is an isolated sub-network — an error in one expert cannot be compensated by another.
  4. Forward Pass Accumulation: A small error at Layer 2 grows exponentially through 48 layers of computation, eventually producing completely incoherent output.

Why llama-cpp-python Works Fine

llama-cpp-python uses inline dequantization with FP32 accumulation for matrix multiplication. FP32 has 23 effective mantissa bits — over 3x the precision of BF16's 7 bits. The same quantization errors exist in the weights, but the higher numerical precision during computation prevents them from amplifying through the forward pass. The trade-off is speed: FP32 accumulation is significantly slower than BF16 matmul on GPU tensor cores.

The root cause is not GGUF quantization alone — it is the specific combination of GGUF quantization + BF16 dequantization + MoE expert isolation. Dense models with the same GGUF + BF16 combination work fine because errors distribute across the entire weight matrix rather than being trapped in isolated expert pathways.

Conclusion: Dense Wins for Local Deployment

The verdict is clear for single-GPU local deployments in 2026:

  • If you have enough VRAM (24GB+): Use Dense + AWQ. The 14B-AWQ model at 9.4GB VRAM and 135 tok/s outperforms the 30B MoE on every metric.
  • If VRAM is limited (16GB): Use a smaller Dense model (8B-AWQ at 5.2GB). It is faster and more stable than any MoE hybrid inference setup.
  • If you need maximum quality: Dense 14B-AWQ scores 3.86/5 overall (#1 in our benchmark) — higher quality and faster speed than the 30B MoE.

MoE architectures shine in cloud environments with hundreds of GPUs for distributed inference. On a single local GPU, the structural overhead — massive VRAM footprint, memory bandwidth bottleneck, and quantization fragility — makes MoE strictly worse than a well-optimized Dense model.

This conclusion is based on current quantization technology and inference frameworks. If KTransformers adds FP32 dequantization support or MoE-specific quantization formats emerge, the picture could change. Until then, Dense + AWQ remains the optimal choice for local LLM deployment.