Embedding Infrastructure Across 15 Servers — CPU Selection to Architecture Tiers
As a RAG system scales, embedding computation becomes the bottleneck. Adding a GPU is costly, but assigning embedding to an arbitrary server left us without a clear selection criterion. We benchmarked embedding throughput on all 15 servers in the cluster and derived CPU selection criteria, the irrelevance of RAM and NVMe, and server placement priority.This is an infrastructure design record, not a tool introduction.
15
Servers benchmarked
7 types
CPU architectures
7.4x
Max performance gap
< 6%
Embedding share in pipeline
Measurement environment
- Benchmark date: 2026-02-24
- Models: BAAI/bge-m3 (568M params), intfloat/multilingual-e5-large (560M params)
- Iterations: 30 (excluding 5 warmup runs); statistics: p50/p95/p99
- Threads: auto-ranged from 1T to logical-core count
- GPU disabled (CUDA_VISIBLE_DEVICES=''), pure CPU inference
Why Design an Embedding Infrastructure
Once local LLM services stabilized, the next bottleneck was embedding. Converting user queries to vectors, pre-indexing documents, and supporting real-time search all require the embedding server to respond reliably.
Initially, embedding ran on the same GPU servers as LLM inference. But GPUs were saturated by LLM workloads, causing embedding requests to queue. At the same time, several CPU servers sat idle.Switching to CPU embedding frees the GPU exclusively for LLM inference and distributes embedding across idle CPU servers.The prerequisite was measuring “which CPU handles embedding well, and by how much.”
GPU Is Not Needed for Embedding
The first question was whether CPU embedding is practical. We calculated the fraction of total response time that embedding occupies relative to LLM inference.
Embedding share analysis within the LLM pipeline
| CPU | Embedding latency | LLM inference time | Embedding share |
|---|---|---|---|
| 9950X3D | 65ms | 2,000–5,000ms | 1.3–3.2% |
| H255 | 70ms | 2,000–5,000ms | 1.4–3.5% |
| 5700G | 82ms | 2,000–5,000ms | 1.6–4.1% |
| 5825U | 118ms | 2,000–5,000ms | 2.4–5.9% |
| N100 | 481ms | 2,000–5,000ms | 9.6–24.1% |
* Excluding N100, embedding accounts for less than 6% of the total pipeline on all servers
Excluding the N100, every server kept embedding latency below 6% of LLM inference time. The bottleneck is LLM inference, not embedding. Assigning GPU resources to embedding is wasteful.CPU embedding is sufficient; the GPU should stay dedicated to LLM inference.
The N100 is the exception. At 481ms, embedding consumes up to 24% of the wait time relative to LLM inference. This becomes a perceptible bottleneck. The N100 is reserved for proxy and lightweight services; it is not placed as an embedding server.
CPU Architecture Performance Comparison
The 15 servers span 7 CPU types. Ranked by BGE-M3 short text (30 chars) p50 latency and sustained throughput.
| Rank | CPU | Cores | BGE-M3 p50 | Throughput | Count |
|---|---|---|---|---|---|
| 1 | 9950X3D | 16C/32T | 65.0ms | 148.7/s | 1 unit |
| 2 | H255 | 8C/16T | 70.9ms | 99.7/s | 4 units |
| 3 | 5700G | 8C/16T | 82.0ms | 74.7/s | 1 unit |
| 4 | 5800U | 8C/16T | 88.9ms | 55.0/s | 1 unit |
| 5 | 7500F | 6C/12T | 92.0ms | 51.1/s | 1 unit |
| 6 | 5825U | 8C/16T | 118.4ms | 45.6/s | 6 units |
| 7 | N100 | 4C/4T | 481.3ms | 5.8/s | 1 unit |
9950X3D — Undisputed #1
The 16-core Zen 5 9950X3D leads at 65.0ms and 148.7 items/s. The gap to #2 H255 (70.9ms) is only 8.3% in latency, but throughput is 49% higher. The core count advantage grows more pronounced in batched parallel workloads. It currently shares an AI GPU server and handles both embedding and LLM inference simultaneously.
H255 — Balanced #2
Four H255 units average 70.9ms — close to the 9950X3D despite being embedded in mini PCs. Throughput approaches 100 items/s on BGE-M3, making it the top candidate for dedicated embedding servers after the 9950X3D.
5825U — Workhorse (6 units)
The Zen 3-based 5825U is the most numerous at 6 units. At 118ms it is 66% slower than the H255, but 46 items/s is sufficient for standard request loads. When one unit falls short, traffic can be distributed across multiple units.
N100 — Not Recommended for Embedding
The N100 (4-core low-power) recorded 481ms and 5.8 items/s —6.8x slower than the #2 H255. Suitable for lightweight services and reverse proxying, but not viable as an embedding server.
RAM and NVMe Have No Impact
Each server varies in RAM capacity, brand, and NVMe type. Does this affect embedding performance? We grouped servers by CPU and compared within each group.
5825U group (6 units)
| Server | RAM | NVMe | BGE-M3 p50 |
|---|---|---|---|
| Server A | 64GB DDR4 | Hynix 512GB | 117.8ms |
| Server B | 32GB DDR4 | Samsung 256GB | 118.1ms |
| Server C | 64GB DDR4 | ShiJi 256GB | 118.1ms |
| Server D | 64GB DDR4 | Samsung 500GB | 118.2ms |
| Server E | Generic 32GB DDR4 | 128GB | 118.8ms |
| Server F | 64GB DDR4 | Samsung 500GB | 119.2ms |
Variance: 1.2%. RAM capacity (32GB vs. 64GB), brand (Samsung / Hynix / generic), and NVMe brand are all irrelevant. The difference between the lowest spec (generic 32GB DDR4 + 128GB NVMe) and highest spec (64GB DDR4 + Samsung 500GB NVMe) is about 1ms.
H255 group (4 units)
| Server | RAM | NVMe | BGE-M3 p50 |
|---|---|---|---|
| Server G | Mark 16GB DDR5 | Hynix 512GB | 70.5ms |
| Server H | Samsung 16GB DDR5 | PM9A1 1TB | 70.8ms |
| Server I | Samsung 16GB DDR5 | PM9A1 512GB | 70.8ms |
| Server J | Mark 16GB DDR5 | Hynix 512GB | 71.6ms |
Variance: 1.7%. DDR5 memory brand (Samsung vs. Mark) and NVMe tier (consumer vs. PM9A1 enterprise) have no impact.
The reason is straightforward. Embedding inference repeatedly computes with model weights that reside in the CPU L3 cache. The memory bus and NVMe are barely touched during inference, so RAM and NVMe specs do not affect throughput.NVMe only matters during the initial model load. Once the server is up, it has no bearing on inference speed.
Practical conclusion: when provisioning a new embedding server, do not spend budget on RAM brand or NVMe tier.Focus budget on CPU selection.
Latency by Text Length
Embedding model latency scales nonlinearly with input text length. Three ranges were measured: short (30 chars), medium (150 chars), and long (500 chars).
| CPU | short (30 chars) | medium (150 chars) | long (500 chars) | long/short |
|---|---|---|---|---|
| 9950X3D | 65.0ms | 68.9ms | 109.8ms | 1.7x |
| H255 | 70.5ms | 92.7ms | 157.9ms | 2.2x |
| 5700G | 82.0ms | 114.3ms | 210.4ms | 2.6x |
| 5800U | 88.9ms | 145.8ms | 283.3ms | 3.2x |
| 7500F | 92.0ms | 147.1ms | 307.4ms | 3.3x |
| 5825U | 117.8ms | 174.9ms | 343.0ms | 2.9x |
| N100 | 481.3ms | 906.9ms | 2,650.9ms | 5.5x |
The 9950X3D shows the most linear scaling — only 1.7x increase from short to long text. The N100, by contrast, balloons to 2,650ms at 500 chars, a 5.5x increase. For document indexing workloads that process large volumes of long text, the CPU gap widens further.
RAG system design note: It is advisable to separate real-time user query embedding (short, latency-sensitive) from document indexing (long, batch). Assign low-latency servers for real-time queries and high-throughput servers for batch indexing.
The Hyperthreading Trap
What happens when you enable HT (hyperthreading) in multithreaded embedding? Increasing threads to the logical core count seems like it should improve performance — in practice, the results were the opposite.
| CPU | Physical cores | Optimal threads | HT effect | 1T → optimal speedup |
|---|---|---|---|---|
| 9950X3D | 16 cores | 16T | Destructive (22x degradation) | 5.7x |
| H255 | 8 cores | 8T | −12–25% degradation | 4.4x |
| 5700G | 8 cores | 8T | −19% degradation | 3.9x |
| 5825U | 8 cores | 8T | −10–12% degradation | 5.6x |
| N100 | 4 cores | 4T | No HT | 3.0x |
On the 9950X3D, pushing threads to the logical core count (32T) causes a 22x throughput degradation versus physical cores (16T). Embedding computation is cache- and ALU-intensive per thread; HT threads sharing a physical core compete with each other and reduce total throughput.
Configuration rule: threads = physical core count.When deploying an embedding server, explicitly set OMP_NUM_THREADS or torch.set_num_threads()to the physical core count. Relying on defaults — which typically use logical core count — inverts performance.
# Check physical core count (excluding HT) lscpu | grep "Core(s) per socket" # Set thread count when starting the embedding server OMP_NUM_THREADS=8 python embedding_server.py # Set within Python code import torch torch.set_num_threads(8) # physical core count
Temperature and Cooling — Safe Under Sustained Load?
Embedding servers run under 24/7 sustained load. Exceeding thermal limits triggers throttling and degrades performance. Temperature changes under 30 seconds of continuous embedding load were measured.
| CPU | Idle temp | Load temp | Rise | Cooling |
|---|---|---|---|---|
| 9950X3D | 58°C | 73°C | +15°C | Tower cooler |
| H255 | 34°C | 67°C | +33°C | Mini PC built-in |
| 5700G | 27°C | 48°C | +21°C | Stock cooler |
| 5825U | 51°C | 56°C | +5°C | Mini PC built-in |
| N100 | 37°C | 41°C | +4°C | Mini PC built-in |
Every server remained well below the thermal throttling threshold (typically 90–100°C) after 30 seconds of sustained load. Notably, the 5825U and N100 — running mini PC built-in cooling — rose only +5°C and +4°C respectively. Their low TDPs are well within the capacity of compact cooling solutions.
The 9950X3D runs at 58°C at idle because it shares a chassis with a GPU server and ambient temperature is elevated. Under load it reaches 73°C, leaving comfortable headroom. That said, in a hot server room, a tower cooler or better is recommended.
Server Tier Classification — Tier 1/2/3
Based on the benchmark data, servers were classified into three tiers by embedding workload suitability.
9950X3D (1 unit)
- 65ms / 148.7 items/s
- 1.7x increase for long text (most linear)
- On AI GPU server — can handle embedding alongside LLM
H255 (4 units)
- 70ms / 99.7 items/s
- 1.7% variance across 4 units — consistent
- Top candidate for dedicated embedding servers
5700G (1 unit)
- 82ms / 74.7 items/s
5800U (1 unit)
- 89ms / 55.0 items/s
7500F (1 unit)
- 92ms / 51.1 items/s
5825U (6 units)
- 118ms / 45.6 items/s (average)
- 1.2% variance across 6 units — uniform enough for load balancing
- Insufficient alone, but adequate in a load-balanced configuration
N100 (1 unit)
- 481ms / 5.8 items/s — 7.4x slower than Tier 1
- Up to 24.1% of LLM pipeline share
- Reserved for reverse proxy and gateway roles only
Embedding Infrastructure Design Checklist
Design principles derived from this benchmark.
CPU is everything
RAM brand/capacity and NVMe tier have no effect on embedding inference speed. Concentrate budget on CPU selection.
Threads = physical core count
Enabling HT reverses performance. Explicitly set OMP_NUM_THREADS to the physical core count.
GPU is unnecessary
Excluding N100, embedding is under 6% of the LLM pipeline. Keep GPU dedicated to LLM inference.
N100 is not for embedding
481ms, up to 24% pipeline share. Limit to proxy and gateway roles.
Separate real-time from batch
Assign low-latency servers for real-time queries (short text) and high-throughput servers for batch indexing (long text).
Same-CPU servers are uniform
6x 5825U variance 1.2%, 4x H255 variance 1.7%. Equal load balancing weights are fine.
Thermal headroom is not a concern
No server approached thermal throttling under 30 seconds of sustained load. Standard server room ventilation is sufficient.
Designing an embedding infrastructure is simpler than it looks. Pick the right CPU and the rest of the variables do not matter.Avoid CPUs at the extreme low end like the N100, and most servers in the fleet can handle embedding without becoming a bottleneck in the LLM pipeline.
The benchmark tool construction and script details are in the CPU Embedding Benchmark Tool post. The overall server monitoring setup is covered in the Grafana + Prometheus Monitoring post.
Comments
(4)Log in to leave a comment.
The Tier classification is practical. I can use this directly when deciding which servers to assign embedding services to.
The N100 not being recommended for embedding makes sense. 481ms would definitely become a pipeline bottleneck. Good as a proxy-only node though.
The key insight is that embedding accounts for under 6% of the LLM pipeline. The conclusion is clear: don't waste GPU on embedding — keep it focused on LLM inference.
Related Posts
© 2026 TreeRU. All rights reserved.
All content is copyrighted by TreeRU. Unauthorized reproduction without attribution is prohibited.