treeru.com

Intel Optane 905P + NVMe 3-Tier Storage — AI Server Disk Strategy

How should you partition disks on an AI server? GPU and VRAM get all the attention, but real-world AI service bottlenecks often originate in storage. RAG vector searches demand random reads, 21 models totaling 808 GB need a permanent home, and production databases require durability. Each workload has fundamentally different I/O characteristics — so we separated them across three storage tiers built around the Intel Optane 905P.

Why a 3-Tier Architecture

AI server disk I/O falls into three distinct patterns. Forcing all three onto a single drive causes contention and degrades every workload:

  • Random-read intensive — RAG vector search traverses thousands of non-sequential vectors per query. Standard NVMe latency of ~100 μs accumulates fast; Optane's ~10 μs keeps search responsive. Assigned to Tier 0: Optane 905P (960 GB).
  • Sequential read/write — Model loading (30–40 GB sequential reads), training data writes, OS and application code. Standard NVMe excels here. Assigned to Tier 1: Samsung 980 PRO (2 TB).
  • Bulk archival — 21 AI model checkpoints (808 GB) that are accessed rarely but take hours to re-download. Assigned to Tier 2: Biwin NVMe (1 TB).

3-Tier Configuration Overview

TierDriveCapacityCharacteristicsPurpose
Tier 0Intel Optane 905P960 GBUltra-low latency 10 μs, 17.5 PBW enduranceVector DB, RAG, chat logs, cache
Tier 1Samsung 980 PRO2 TBGeneral-purpose NVMeOS, AI workspace, serving models
Tier 2Biwin NVMe1 TBGeneral-purpose NVMeModel archive (21 models, ~808 GB)

Tier 0 — Optane 905P 960 GB (Hot Data)

Stores everything that directly affects service quality: RAG vector databases, conversation log databases, active LoRA adapters, and inference caches. The Optane's 10 μs random-read latency is 10x faster than standard NVMe for these workloads, and its 17.5 PBW write endurance makes it ideal for database-heavy write patterns. At 960 GB, the drive accommodates years of operational data.

Tier 1 — Samsung 980 PRO 2 TB (Workspace)

Holds the OS, Python environment, the currently-serving base model, PoC/test artifacts, and customer data. Once a model is loaded into VRAM for serving, disk I/O drops to near zero — so sequential NVMe speed is sufficient. With only ~130 GB used out of 2 TB, over 1.5 TB remains for experiments.

Tier 2 — Biwin NVMe 1 TB (Model Archive)

Stores all 21 candidate AI models (~808 GB), past LoRA adapter versions, and training checkpoints for rollback. As an internal NVMe, it is far faster than USB-attached storage when a model needs to be pulled from archive to the workspace tier. Since re-downloading models from HuggingFace takes hours, local archival is cost-effective.

Why Optane Matters for RAG

The sole reason for choosing the Optane 905P is random-read latency. RAG vector search is inherently non-sequential — HNSW graph traversal accesses thousands of scattered vectors per query. Here is the performance comparison:

MetricOptane 905PStandard NVMeDifference
4K random read latency~10 μs~100 μs10x faster
4K random write latency~10 μs~20 μs2x faster
Endurance (TBW)17,520 TB~1,200 TB14.6x
Sequential read2,600 MB/s7,000 MB/sNVMe wins
Sequential write2,200 MB/s5,000 MB/sNVMe wins

The latency accumulation is dramatic in practice. Searching 10,000 vectors at 100 μs each takes1 second on standard NVMe; at 10 μs each, Optane completes the same search in0.1 seconds. With 10 concurrent users, the gap widens to 10 seconds versus 1 second. This directly impacts perceived response quality of the AI service.

The trade-off: Optane's sequential throughput is 2–3x slower than modern NVMe. That is exactly why model loading (sequential reads) stays on the 980 PRO while vector search (random reads) goes to Optane. Note that the Optane 905P is a discontinued product (Intel exited the Optane business), but 960 GB units are available on the secondary market at reasonable prices.

Data Flow Between Tiers

AI Model Flow

Models follow a defined path through the tiers:

  1. Download — Pull from HuggingFace to Tier 2 (Biwin archive).
  2. Test — Copy to Tier 1 (980 PRO), run SGLang serving tests.
  3. Deploy — If the test passes, the model stays on Tier 1 for production serving.
  4. Archive — Failed candidates remain on Tier 2 for potential future use.

Core rule: only one serving model lives on Tier 1 at a time. Once a base model is confirmed, it is not swapped — changing it requires retraining all LoRA adapters.

Service Data Flow

DataStorage LocationI/O PatternImpact if Lost
RAG vector DBTier 0 (Optane)Random-read heavyRe-indexing required
Chat log DBTier 0 (Optane)Write heavyService data lost
Active LoRA adaptersTier 0 (Optane)Read at model loadRetraining required
Serving base modelTier 1 (980 PRO)Sequential read at startupRe-downloadable
PoC / test artifactsTier 1 (980 PRO)MixedSafely deletable
Model archiveTier 2 (Biwin)Low frequencyRe-download (hours)

Cold Backup Strategy

The three internal drives (2 TB + 960 GB + 1 TB) are self-sufficient for full service operation. Cold backup runs on a separate server using NFS pull-mode — the backup server fetches data from the AI server via cron, rather than the AI server pushing to the backup.

Backup Server Specifications

  • CPU: AMD 5825U / RAM: 32 GB / OS: NVMe 256 GB
  • Backup storage: Seagate IronWolf 12 TB x 2 (SATA, 7200 RPM)
  • Dual 1 Gbps NICs for redundancy
  • NFS server with subnet-only access, UFW firewall (SSH + NFS only)
  • SMART monitoring via 30-minute cron with 60 C alert threshold

Why Pull-Mode Backup

If the AI server is compromised, a push-mode backup could propagate corrupted or encrypted data to the backup server. In pull mode, the backup server initiates all transfers — the AI server has no write access to the backup volume. Backup targets include configuration files, Optane snapshots, LoRA adapters, reverse-proxy configs, website databases, and certificates. OS and packages are excluded — they can be reinstalled.

Storage Design Principles

  • Separate by I/O pattern — Random reads (Optane), sequential reads (NVMe), bulk archival (dedicated NVMe). Never mix all workloads on a single drive.
  • Place service-critical data on the fastest tier — RAG search, chat logs, and active LoRA adapters go on Optane because they directly determine user-facing response quality.
  • Self-sufficient internal drives — The server operates fully without the backup server. Independence equals resilience.
  • Pull-mode backup for security — The backup server fetches; the AI server never writes to backup. This protects backups even if the production server is compromised.

Storage architecture is not as glamorous as GPU specs, but it determines the perceived quality of every AI service interaction. Optane's 10 μs random-read advantage over standard NVMe is measurable in RAG search latency — and that latency is exactly what end users feel when they wait for an answer.