treeru.com
AI

Local AI Video Generation with ComfyUI + Wan 2.2 — An RTX PRO 6000 96GB Build Log

2026-07-02
Treeru
AI · July 2, 2026

Video generation APIs are convenient, but per-second billing and commercial-use restrictions add up fast when you produce content continuously. So we built a fully local video generation pipeline on a single workstation with an RTX PRO 6000 (96GB VRAM), running three open-source models with unrestricted commercial licenses: Wan 2.2 for photorealism, LTX 2.3 for 4K and native audio, and MOVA for lip sync. This is the full build log — model selection, storage design, ComfyUI setup, and one-line shell automation.

96GB

GPU VRAM (RTX PRO 6000)

3

Models (347GB total)

~195s

Per 480p draft clip

$0

API cost per clip

Why Local Video Generation

The goal was simple: ask a conversational AI agent (Claude Code) to “make this video,” and have a local GPU produce it with open-source models. Three advantages over cloud APIs drove the decision: marginal cost is effectively zero (regenerate a draft 100 times without worrying about a bill), Apache 2.0 licensed output can be used in products and client work without restrictions, and an HTTP API server integrates into scripts, agents, and batch jobs.

The architecture: ComfyUI runs as a persistent API server on localhost, three models sit behind it for different jobs, and FFmpeg handles post-processing.

Model Selection: Licensing Is Half the Battle

Benchmark scores were only half the evaluation — the other half was licensing. Several strong candidates were eliminated on license terms alone: HunyuanVideo 1.5 has regional restrictions that prohibit commercial use in South Korea, CogVideoX 5B caps usage at 1 million monthly visitors, and CogVideoX 2B is license-free but falls short on quality (VBench 79.6).

AspectWan 2.2 (14B)LTX 2.3 (19B)MOVA (32B MoE)
Photorealism★★★★★★★★★★★★★
Motion quality★★★★½★★★★★★★★
Max resolution1080p4K720p
Audio generation
Lip sync
LicenseApache 2.0Open WeightsApache 2.0

The final trio covers non-overlapping roles. Wan 2.2 (Alibaba, Apache 2.0) is the general-purpose workhorse — best-in-class open-source photorealism, VBench 82.8, particularly strong on faces, skin, and hair, and 100% compatible with existing Wan 2.1 LoRAs. LTX 2.3 (Lightricks) is the only open-source model with 4K/50fps output and generates video plus audio in a single pass via a dual-stream architecture; its Open Weights license is free below $10M annual revenue. MOVA (OpenMOSS, Apache 2.0) generates video and speech simultaneously with multilingual lip sync, including Korean.

Hardware and Storage Strategy

The workstation: RTX PRO 6000 Blackwell (96GB VRAM), Ryzen 9 9950X3D, 96GB DDR5. The real design work was storage, not GPU. Video models are far larger than LLMs — the three models total 347GB (Wan 2.2: 125GB, LTX 2.3: 141GB, MOVA: 73GB).

Model weights, the ComfyUI engine, and caches live on an Intel Optane NVMe; generated videos and project files live on a regular NVMe (Samsung 980 PRO). Sequential reads are nearly identical (2,568 vs 2,613 MB/s), but Optane delivers roughly 4x the random-read IOPS (81,954 vs 19,687) at a quarter of the latency (390µs vs 1,624µs) — and model swapping plus inference temp files are dominated by random I/O.

ComfyUI Setup: Four Custom Node Packs

ComfyUI won as the engine because of its API server mode and its dominant video-generation ecosystem. Model weights stay outside the ComfyUI install directory, linked via extra_model_paths.yaml, so reinstalling ComfyUI never means re-downloading 160GB of weights. Four custom node packs: WanVideoWrapper (130+ advanced Wan 2.2 nodes), VideoHelperSuite (video I/O, essential), SeedVR2 (2x AI upscaling with temporal consistency), and RIFE Frame Interpolation (16fps → 32/48fps). LTX 2.3 is natively supported; MOVA uses its own dedicated node pack. Environment: Python 3.12, PyTorch 2.10 (cu128), CUDA 13.1.

Understanding Wan 2.2’s MoE Architecture

Wan 2.2 14B uses an unusual MoE design: 27B total parameters with 14B active, but instead of per-token routing like LLMs, the entire model swaps based on the denoising stage. A high-noise model handles early steps (overall composition and motion), then a low-noise model refines details in later steps. In ComfyUI this means two KSamplerAdvanced nodes chained in series, switching models at the halfway point. That is why T2V and I2V each need a high/low pair — 4 diffusion files × 27GB = 108GB.

Two traps worth documenting: T2V workflows must use wan_2.1_vae — the similarly-named wan2.2_vae is a 48-channel I2V-only VAE and will break T2V. And the fp8 text encoder (6.3GB) is indistinguishable in quality from fp16 (10.6GB), saving 4GB of VRAM for free.

Presets and Measured Performance

PresetResolutionFramesStepsLengthUse
fast854x4805712~3.5sQuick drafts / previews
default1280x7208120~5sGeneral purpose
hq1280x7208130~5sFinal quality
long1280x72012125~7.5sLonger clips
portrait720x12808120~5sVertical (social media)

Measured: the fast preset (854x480, 57 frames, 12 steps) takes about 195 seconds per clip; the 720p default runs 5–8 minutes. Generation uses roughly 67GB of VRAM (70% of 96GB) — models that would require quantization or offloading on a 24GB card run comfortably. For even faster drafts, the LightX2V 4-step LoRA generates ~5x faster than 20 steps with a modest quality tradeoff — but the high-noise and low-noise LoRAs must be applied as a pair, or output collapses.

The production pipeline: generate at 720p/16fps → SeedVR2 upscale (2x) → RIFE interpolation (to 32–48fps) → FFmpeg encode (H.264, CRF 18). Order matters: upscale while frame count is low, then interpolate.

One-Line Shell Automation via the API

ComfyUI’s real strength is the API, not the web UI. Save any workflow in API format (JSON), POST it to /prompt, and the job enters the queue; a WebSocket at /ws streams step-by-step progress. A thin Python wrapper turns this into a one-liner:

python wan22_generate.py "A cat sleeping on a sunny windowsill" --preset hq
python generate_video.py --model wan22 --mode i2v --image ref.png --prompt "..."
python generate_video.py --model wan22 --batch prompts.txt

The end result: a conversational agent can call video generation as a tool. “Draft three product intro clips” becomes an agent refining prompts and running a batch — this flow works today.

Prompt-Writing Notes

Video prompts follow a different grammar than image prompts. The structure that works: [subject + appearance] + [action] + [setting] + [lighting] + [camera work] + [style]. Always specify camera movement (slow dolly in, pan left, gentle orbit, tracking shot, static, handheld) — its presence changes results dramatically. Keep one main action per scene; subtle motion is far more stable than ambitious choreography. Bake a standard negative prompt (blurry, distorted, watermark, bad anatomy, extra fingers, JPEG artifacts) into the script defaults so you never think about it.

Conclusion: Is Local Video Generation Worth It Now?

  1. Check licenses before benchmarks. Our top quality candidate was eliminated because its license prohibits commercial use in South Korea.
  2. Storage design comes first. Three models consume 347GB — far beyond typical LLM deployments. Separate model storage (random-I/O optimized) from output storage.
  3. ComfyUI shines as an API server, not a UI. That is what turns video generation into a tool an AI agent can call.
  4. Document the traps. VAE version mismatches and unpaired LoRAs will bite again if they only live in your memory.

With a 96GB-class GPU, this is the right moment: since the Wan 2.2 generation, open-source video quality crossed into commercially usable territory. Next up: the LTX 2.3 4K+audio pipeline and MOVA lip sync in production — measured results to follow.

T

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

Share

Related Posts

© 2026 TreeRU. All rights reserved.

All content is copyrighted by TreeRU. Unauthorized reproduction without attribution is prohibited.