Local AI Video Generation: ComfyUI + Wan 2.2 on RTX PRO 6000 96GB, 3 Open-Source Models

Why Local Video Generation

The goal was simple: ask a conversational AI agent (Claude Code) to “make this video,” and have a local GPU produce it with open-source models. Three advantages over cloud APIs drove the decision: marginal cost is effectively zero (regenerate a draft 100 times without worrying about a bill), Apache 2.0 licensed output can be used in products and client work without restrictions, and an HTTP API server integrates into scripts, agents, and batch jobs.

The architecture: ComfyUI runs as a persistent API server on localhost, three models sit behind it for different jobs, and FFmpeg handles post-processing.

Model Selection: Licensing Is Half the Battle

Benchmark scores were only half the evaluation — the other half was licensing. Several strong candidates were eliminated on license terms alone: HunyuanVideo 1.5 has regional restrictions that prohibit commercial use in South Korea, CogVideoX 5B caps usage at 1 million monthly visitors, and CogVideoX 2B is license-free but falls short on quality (VBench 79.6).

Aspect	Wan 2.2 (14B)	LTX 2.3 (19B)	MOVA (32B MoE)
Photorealism	★★★★★	★★★★	★★★★
Motion quality	★★★★½	★★★★	★★★★
Max resolution	1080p	4K	720p
Audio generation	✗	✓	✓
Lip sync	✗	✗	✓
License	Apache 2.0	Open Weights	Apache 2.0

The final trio covers non-overlapping roles. Wan 2.2 (Alibaba, Apache 2.0) is the general-purpose workhorse — best-in-class open-source photorealism, VBench 82.8, particularly strong on faces, skin, and hair, and 100% compatible with existing Wan 2.1 LoRAs. LTX 2.3 (Lightricks) is the only open-source model with 4K/50fps output and generates video plus audio in a single pass via a dual-stream architecture; its Open Weights license is free below $10M annual revenue. MOVA (OpenMOSS, Apache 2.0) generates video and speech simultaneously with multilingual lip sync, including Korean.

Hardware and Storage Strategy

The workstation: RTX PRO 6000 Blackwell (96GB VRAM), Ryzen 9 9950X3D, 96GB DDR5. The real design work was storage, not GPU. Video models are far larger than LLMs — the three models total 347GB (Wan 2.2: 125GB, LTX 2.3: 141GB, MOVA: 73GB).

Model weights, the ComfyUI engine, and caches live on an Intel Optane NVMe; generated videos and project files live on a regular NVMe (Samsung 980 PRO). Sequential reads are nearly identical (2,568 vs 2,613 MB/s), but Optane delivers roughly 4x the random-read IOPS (81,954 vs 19,687) at a quarter of the latency (390µs vs 1,624µs) — and model swapping plus inference temp files are dominated by random I/O.

ComfyUI Setup: Four Custom Node Packs

ComfyUI won as the engine because of its API server mode and its dominant video-generation ecosystem. Model weights stay outside the ComfyUI install directory, linked via extra_model_paths.yaml, so reinstalling ComfyUI never means re-downloading 160GB of weights. Four custom node packs: WanVideoWrapper (130+ advanced Wan 2.2 nodes), VideoHelperSuite (video I/O, essential), SeedVR2 (2x AI upscaling with temporal consistency), and RIFE Frame Interpolation (16fps → 32/48fps). LTX 2.3 is natively supported; MOVA uses its own dedicated node pack. Environment: Python 3.12, PyTorch 2.10 (cu128), CUDA 13.1.

Understanding Wan 2.2’s MoE Architecture

Wan 2.2 14B uses an unusual MoE design: 27B total parameters with 14B active, but instead of per-token routing like LLMs, the entire model swaps based on the denoising stage. A high-noise model handles early steps (overall composition and motion), then a low-noise model refines details in later steps. In ComfyUI this means two KSamplerAdvanced nodes chained in series, switching models at the halfway point. That is why T2V and I2V each need a high/low pair — 4 diffusion files × 27GB = 108GB.

Two traps worth documenting: T2V workflows must use wan_2.1_vae — the similarly-named wan2.2_vae is a 48-channel I2V-only VAE and will break T2V. And the fp8 text encoder (6.3GB) is indistinguishable in quality from fp16 (10.6GB), saving 4GB of VRAM for free.

Presets and Measured Performance

Preset	Resolution	Frames	Steps	Length	Use
fast	854x480	57	12	~3.5s	Quick drafts / previews
default	1280x720	81	20	~5s	General purpose
hq	1280x720	81	30	~5s	Final quality
long	1280x720	121	25	~7.5s	Longer clips
portrait	720x1280	81	20	~5s	Vertical (social media)

Measured: the fast preset (854x480, 57 frames, 12 steps) takes about 195 seconds per clip; the 720p default runs 5–8 minutes. Generation uses roughly 67GB of VRAM (70% of 96GB) — models that would require quantization or offloading on a 24GB card run comfortably. For even faster drafts, the LightX2V 4-step LoRA generates ~5x faster than 20 steps with a modest quality tradeoff — but the high-noise and low-noise LoRAs must be applied as a pair, or output collapses.

The production pipeline: generate at 720p/16fps → SeedVR2 upscale (2x) → RIFE interpolation (to 32–48fps) → FFmpeg encode (H.264, CRF 18). Order matters: upscale while frame count is low, then interpolate.

One-Line Shell Automation via the API

ComfyUI’s real strength is the API, not the web UI. Save any workflow in API format (JSON), POST it to /prompt, and the job enters the queue; a WebSocket at /ws streams step-by-step progress. A thin Python wrapper turns this into a one-liner:

python wan22_generate.py "A cat sleeping on a sunny windowsill" --preset hq
python generate_video.py --model wan22 --mode i2v --image ref.png --prompt "..."
python generate_video.py --model wan22 --batch prompts.txt

The end result: a conversational agent can call video generation as a tool. “Draft three product intro clips” becomes an agent refining prompts and running a batch — this flow works today.

Prompt-Writing Notes

Video prompts follow a different grammar than image prompts. The structure that works: [subject + appearance] + [action] + [setting] + [lighting] + [camera work] + [style]. Always specify camera movement (slow dolly in, pan left, gentle orbit, tracking shot, static, handheld) — its presence changes results dramatically. Keep one main action per scene; subtle motion is far more stable than ambitious choreography. Bake a standard negative prompt (blurry, distorted, watermark, bad anatomy, extra fingers, JPEG artifacts) into the script defaults so you never think about it.

Conclusion: Is Local Video Generation Worth It Now?

Check licenses before benchmarks. Our top quality candidate was eliminated because its license prohibits commercial use in South Korea.
Storage design comes first. Three models consume 347GB — far beyond typical LLM deployments. Separate model storage (random-I/O optimized) from output storage.
ComfyUI shines as an API server, not a UI. That is what turns video generation into a tool an AI agent can call.
Document the traps. VAE version mismatches and unpaired LoRAs will bite again if they only live in your memory.

With a 96GB-class GPU, this is the right moment: since the Wan 2.2 generation, open-source video quality crossed into commercially usable territory. Next up: the LTX 2.3 4K+audio pipeline and MOVA lip sync in production — measured results to follow.

Local AI Video Generation with ComfyUI + Wan 2.2 — An RTX PRO 6000 96GB Build Log

Why Local Video Generation

Model Selection: Licensing Is Half the Battle

Hardware and Storage Strategy

ComfyUI Setup: Four Custom Node Packs

Understanding Wan 2.2’s MoE Architecture

Presets and Measured Performance

One-Line Shell Automation via the API

Prompt-Writing Notes

Conclusion: Is Local Video Generation Worth It Now?

Related Posts

AI Image Generation API Price Comparison 2026 — Automating Blog Thumbnails with GPT Image, Gemini, FLUX & DALL-E

RTX Pro 6000 Local LLM Benchmark — 6 Models, 360 Questions, Complete Ranking

Intel Optane 905P + NVMe 3-Tier Storage — AI Server Disk Strategy