Intel Optane 905P + NVMe 3-Tier Storage — AI Server Disk Strategy
How should you partition disks on an AI server? GPU and VRAM get all the attention, but real-world AI service bottlenecks often originate in storage. RAG vector searches demand random reads, 21 models totaling 808 GB need a permanent home, and production databases require durability. Each workload has fundamentally different I/O characteristics — so we separated them across three storage tiers built around the Intel Optane 905P.
Why a 3-Tier Architecture
AI server disk I/O falls into three distinct patterns. Forcing all three onto a single drive causes contention and degrades every workload:
- Random-read intensive — RAG vector search traverses thousands of non-sequential vectors per query. Standard NVMe latency of ~100 μs accumulates fast; Optane's ~10 μs keeps search responsive. Assigned to Tier 0: Optane 905P (960 GB).
- Sequential read/write — Model loading (30–40 GB sequential reads), training data writes, OS and application code. Standard NVMe excels here. Assigned to Tier 1: Samsung 980 PRO (2 TB).
- Bulk archival — 21 AI model checkpoints (808 GB) that are accessed rarely but take hours to re-download. Assigned to Tier 2: Biwin NVMe (1 TB).
3-Tier Configuration Overview
| Tier | Drive | Capacity | Characteristics | Purpose |
|---|---|---|---|---|
| Tier 0 | Intel Optane 905P | 960 GB | Ultra-low latency 10 μs, 17.5 PBW endurance | Vector DB, RAG, chat logs, cache |
| Tier 1 | Samsung 980 PRO | 2 TB | General-purpose NVMe | OS, AI workspace, serving models |
| Tier 2 | Biwin NVMe | 1 TB | General-purpose NVMe | Model archive (21 models, ~808 GB) |
Tier 0 — Optane 905P 960 GB (Hot Data)
Stores everything that directly affects service quality: RAG vector databases, conversation log databases, active LoRA adapters, and inference caches. The Optane's 10 μs random-read latency is 10x faster than standard NVMe for these workloads, and its 17.5 PBW write endurance makes it ideal for database-heavy write patterns. At 960 GB, the drive accommodates years of operational data.
Tier 1 — Samsung 980 PRO 2 TB (Workspace)
Holds the OS, Python environment, the currently-serving base model, PoC/test artifacts, and customer data. Once a model is loaded into VRAM for serving, disk I/O drops to near zero — so sequential NVMe speed is sufficient. With only ~130 GB used out of 2 TB, over 1.5 TB remains for experiments.
Tier 2 — Biwin NVMe 1 TB (Model Archive)
Stores all 21 candidate AI models (~808 GB), past LoRA adapter versions, and training checkpoints for rollback. As an internal NVMe, it is far faster than USB-attached storage when a model needs to be pulled from archive to the workspace tier. Since re-downloading models from HuggingFace takes hours, local archival is cost-effective.
Why Optane Matters for RAG
The sole reason for choosing the Optane 905P is random-read latency. RAG vector search is inherently non-sequential — HNSW graph traversal accesses thousands of scattered vectors per query. Here is the performance comparison:
| Metric | Optane 905P | Standard NVMe | Difference |
|---|---|---|---|
| 4K random read latency | ~10 μs | ~100 μs | 10x faster |
| 4K random write latency | ~10 μs | ~20 μs | 2x faster |
| Endurance (TBW) | 17,520 TB | ~1,200 TB | 14.6x |
| Sequential read | 2,600 MB/s | 7,000 MB/s | NVMe wins |
| Sequential write | 2,200 MB/s | 5,000 MB/s | NVMe wins |
The latency accumulation is dramatic in practice. Searching 10,000 vectors at 100 μs each takes1 second on standard NVMe; at 10 μs each, Optane completes the same search in0.1 seconds. With 10 concurrent users, the gap widens to 10 seconds versus 1 second. This directly impacts perceived response quality of the AI service.
The trade-off: Optane's sequential throughput is 2–3x slower than modern NVMe. That is exactly why model loading (sequential reads) stays on the 980 PRO while vector search (random reads) goes to Optane. Note that the Optane 905P is a discontinued product (Intel exited the Optane business), but 960 GB units are available on the secondary market at reasonable prices.
Data Flow Between Tiers
AI Model Flow
Models follow a defined path through the tiers:
- Download — Pull from HuggingFace to Tier 2 (Biwin archive).
- Test — Copy to Tier 1 (980 PRO), run SGLang serving tests.
- Deploy — If the test passes, the model stays on Tier 1 for production serving.
- Archive — Failed candidates remain on Tier 2 for potential future use.
Core rule: only one serving model lives on Tier 1 at a time. Once a base model is confirmed, it is not swapped — changing it requires retraining all LoRA adapters.
Service Data Flow
| Data | Storage Location | I/O Pattern | Impact if Lost |
|---|---|---|---|
| RAG vector DB | Tier 0 (Optane) | Random-read heavy | Re-indexing required |
| Chat log DB | Tier 0 (Optane) | Write heavy | Service data lost |
| Active LoRA adapters | Tier 0 (Optane) | Read at model load | Retraining required |
| Serving base model | Tier 1 (980 PRO) | Sequential read at startup | Re-downloadable |
| PoC / test artifacts | Tier 1 (980 PRO) | Mixed | Safely deletable |
| Model archive | Tier 2 (Biwin) | Low frequency | Re-download (hours) |
Cold Backup Strategy
The three internal drives (2 TB + 960 GB + 1 TB) are self-sufficient for full service operation. Cold backup runs on a separate server using NFS pull-mode — the backup server fetches data from the AI server via cron, rather than the AI server pushing to the backup.
Backup Server Specifications
- CPU: AMD 5825U / RAM: 32 GB / OS: NVMe 256 GB
- Backup storage: Seagate IronWolf 12 TB x 2 (SATA, 7200 RPM)
- Dual 1 Gbps NICs for redundancy
- NFS server with subnet-only access, UFW firewall (SSH + NFS only)
- SMART monitoring via 30-minute cron with 60 C alert threshold
Why Pull-Mode Backup
If the AI server is compromised, a push-mode backup could propagate corrupted or encrypted data to the backup server. In pull mode, the backup server initiates all transfers — the AI server has no write access to the backup volume. Backup targets include configuration files, Optane snapshots, LoRA adapters, reverse-proxy configs, website databases, and certificates. OS and packages are excluded — they can be reinstalled.
Storage Design Principles
- Separate by I/O pattern — Random reads (Optane), sequential reads (NVMe), bulk archival (dedicated NVMe). Never mix all workloads on a single drive.
- Place service-critical data on the fastest tier — RAG search, chat logs, and active LoRA adapters go on Optane because they directly determine user-facing response quality.
- Self-sufficient internal drives — The server operates fully without the backup server. Independence equals resilience.
- Pull-mode backup for security — The backup server fetches; the AI server never writes to backup. This protects backups even if the production server is compromised.
Storage architecture is not as glamorous as GPU specs, but it determines the perceived quality of every AI service interaction. Optane's 10 μs random-read advantage over standard NVMe is measurable in RAG search latency — and that latency is exactly what end users feel when they wait for an answer.