Intel Optane 905P + NVMe 3-Tier Storage — AI Server Disk Strategy

2026-02-24

Treeru

How should you partition disks on an AI server? GPU and VRAM get all the attention, but real-world AI service bottlenecks often originate in storage. RAG vector searches demand random reads, many models need a permanent home, and production databases require durability. Each workload has fundamentally different I/O characteristics — so we separated them across three storage tiers built around the Intel Optane 905P.

Why a 3-Tier Architecture

AI server disk I/O falls into three distinct patterns. Forcing all three onto a single drive causes contention and degrades every workload:

Random-read intensive — RAG vector search traverses thousands of non-sequential vectors per query. Standard NVMe latency of ~100 μs accumulates fast; Optane's ~10 μs keeps search responsive. Assigned to Tier 0: Optane 905P.
Sequential read/write — Model loading (sequential reads), training data writes, OS and application code. Standard NVMe excels here. Assigned to Tier 1: Workspace NVMe.
Bulk archival — Many AI model checkpoints that are accessed rarely but take hours to re-download. Assigned to Tier 2: Archival NVMe.

3-Tier Configuration Overview

Tier	Drive	Characteristics	Purpose
Tier 0	Intel Optane 905P	Ultra-low latency 10 μs, 17.5 PBW endurance	Vector DB, RAG, chat logs, cache
Tier 1	Workspace NVMe	General-purpose NVMe	OS, AI workspace, serving models
Tier 2	Archival NVMe	General-purpose NVMe	Model archive

Tier 0 — Optane 905P (Hot Data)

Stores everything that directly affects service quality: RAG vector databases, conversation log databases, active LoRA adapters, and inference caches. The Optane's 10 μs random-read latency is 10x faster than standard NVMe for these workloads, and its 17.5 PBW write endurance makes it ideal for database-heavy write patterns. The drive accommodates years of operational data.

Tier 1 — Workspace NVMe

Holds the OS, Python environment, the currently-serving base model, PoC/test artifacts, and customer data. Once a model is loaded into VRAM for serving, disk I/O drops to near zero — so sequential NVMe speed is sufficient. Ample free space remains for experiments.

Tier 2 — Archival NVMe

Stores many candidate AI models, past LoRA adapter versions, and training checkpoints for rollback. As an internal NVMe, it is far faster than USB-attached storage when a model needs to be pulled from archive to the workspace tier. Since re-downloading models from HuggingFace takes hours, local archival is cost-effective.

Why Optane Matters for RAG

The sole reason for choosing the Optane 905P is random-read latency. RAG vector search is inherently non-sequential — HNSW graph traversal accesses thousands of scattered vectors per query. Here is the performance comparison:

Metric	Optane 905P	Standard NVMe	Difference
4K random read latency	~10 μs	~100 μs	10x faster
4K random write latency	~10 μs	~20 μs	2x faster
Endurance (TBW)	17,520 TB	~1,200 TB	14.6x
Sequential read	2,600 MB/s	7,000 MB/s	NVMe wins
Sequential write	2,200 MB/s	5,000 MB/s	NVMe wins

The latency accumulation is dramatic in practice. Searching 10,000 vectors at 100 μs each takes1 second on standard NVMe; at 10 μs each, Optane completes the same search in0.1 seconds. With 10 concurrent users, the gap widens to 10 seconds versus 1 second. This directly impacts perceived response quality of the AI service.

The trade-off: Optane's sequential throughput is 2–3x slower than modern NVMe. That is exactly why model loading (sequential reads) stays on the workspace NVMe while vector search (random reads) goes to Optane. Note that the Optane 905P is a discontinued product (Intel exited the Optane business), but units are available on the secondary market at reasonable prices.

Data Flow Between Tiers

AI Model Flow

Models follow a defined path through the tiers:

Download — Pull from HuggingFace to Tier 2 (archival NVMe).
Test — Copy to Tier 1 (workspace NVMe), run SGLang serving tests.
Deploy — If the test passes, the model stays on Tier 1 for production serving.
Archive — Failed candidates remain on Tier 2 for potential future use.

Core rule: only one serving model lives on Tier 1 at a time. Once a base model is confirmed, it is not swapped — changing it requires retraining all LoRA adapters.

Service Data Flow

Data	Storage Location	I/O Pattern	Impact if Lost
RAG vector DB	Tier 0 (Optane)	Random-read heavy	Re-indexing required
Chat log DB	Tier 0 (Optane)	Write heavy	Service data lost
Active LoRA adapters	Tier 0 (Optane)	Read at model load	Retraining required
Serving base model	Tier 1 (workspace NVMe)	Sequential read at startup	Re-downloadable
PoC / test artifacts	Tier 1 (workspace NVMe)	Mixed	Safely deletable
Model archive	Tier 2 (archival NVMe)	Low frequency	Re-download (hours)

Cold Backup Strategy

The internal drives are self-sufficient for full service operation. Cold backup runs on a separate server using NFS pull-mode — the backup server fetches data from the AI server via cron, rather than the AI server pushing to the backup.

Backup Server Specifications

CPU: 8-core (low-power mobile CPU) / RAM: 32 GB / OS: NVMe SSD
Backup storage: Enterprise SATA HDD x 2 (7200 RPM)
Dual 1 Gbps NICs for redundancy
NFS server with subnet-only access, UFW firewall (SSH + NFS only)
SMART monitoring via 30-minute cron with 60 C alert threshold

Why Pull-Mode Backup

If the AI server is compromised, a push-mode backup could propagate corrupted or encrypted data to the backup server. In pull mode, the backup server initiates all transfers — the AI server has no write access to the backup volume. Backup targets include configuration files, Optane snapshots, LoRA adapters, reverse-proxy configs, website databases, and certificates. OS and packages are excluded — they can be reinstalled.

Storage Design Principles

Separate by I/O pattern — Random reads (Optane), sequential reads (NVMe), bulk archival (dedicated NVMe). Never mix all workloads on a single drive.
Place service-critical data on the fastest tier — RAG search, chat logs, and active LoRA adapters go on Optane because they directly determine user-facing response quality.
Self-sufficient internal drives — The server operates fully without the backup server. Independence equals resilience.
Pull-mode backup for security — The backup server fetches; the AI server never writes to backup. This protects backups even if the production server is compromised.

Storage architecture is not as glamorous as GPU specs, but it determines the perceived quality of every AI service interaction. Optane's 10 μs random-read advantage over standard NVMe is measurable in RAG search latency — and that latency is exactly what end users feel when they wait for an answer.