Embedding Infrastructure Across 15 Servers — CPU Selection to Architecture Tiers

2026-03-29

Treeru

As a RAG system scales, embedding computation becomes the bottleneck. Adding a GPU is costly, but assigning embedding to an arbitrary server left us without a clear selection criterion. We benchmarked embedding throughput on all 15 servers in the cluster and derived CPU selection criteria, the irrelevance of RAM and NVMe, and server placement priority.This is an infrastructure design record, not a tool introduction.

Servers benchmarked

7 types

CPU architectures

7.4x

Max performance gap

< 6%

Embedding share in pipeline

Measurement environment

Benchmark date: 2026-02-24
Models: BAAI/bge-m3 (568M params), intfloat/multilingual-e5-large (560M params)
Iterations: 30 (excluding 5 warmup runs); statistics: p50/p95/p99
Threads: auto-ranged from 1T to logical-core count
GPU disabled (CUDA_VISIBLE_DEVICES=''), pure CPU inference

Why Design an Embedding Infrastructure

Once local LLM services stabilized, the next bottleneck was embedding. Converting user queries to vectors, pre-indexing documents, and supporting real-time search all require the embedding server to respond reliably.

Initially, embedding ran on the same GPU servers as LLM inference. But GPUs were saturated by LLM workloads, causing embedding requests to queue. At the same time, several CPU servers sat idle.Switching to CPU embedding frees the GPU exclusively for LLM inference and distributes embedding across idle CPU servers.The prerequisite was measuring “which CPU handles embedding well, and by how much.”

GPU Is Not Needed for Embedding

The first question was whether CPU embedding is practical. We calculated the fraction of total response time that embedding occupies relative to LLM inference.

Embedding share analysis within the LLM pipeline

CPU	Embedding latency	LLM inference time	Embedding share
9950X3D	65ms	2,000–5,000ms	1.3–3.2%
H255	70ms	2,000–5,000ms	1.4–3.5%
5700G	82ms	2,000–5,000ms	1.6–4.1%
5825U	118ms	2,000–5,000ms	2.4–5.9%
N100	481ms	2,000–5,000ms	9.6–24.1%

* Excluding N100, embedding accounts for less than 6% of the total pipeline on all servers

Excluding the N100, every server kept embedding latency below 6% of LLM inference time. The bottleneck is LLM inference, not embedding. Assigning GPU resources to embedding is wasteful.CPU embedding is sufficient; the GPU should stay dedicated to LLM inference.

The N100 is the exception. At 481ms, embedding consumes up to 24% of the wait time relative to LLM inference. This becomes a perceptible bottleneck. The N100 is reserved for proxy and lightweight services; it is not placed as an embedding server.

CPU Architecture Performance Comparison

The 15 servers span 7 CPU types. Ranked by BGE-M3 short text (30 chars) p50 latency and sustained throughput.

Rank	CPU	Cores	BGE-M3 p50	Throughput	Count
1	9950X3D	16C/32T	65.0ms	148.7/s	1 unit
2	H255	8C/16T	70.9ms	99.7/s	4 units
3	5700G	8C/16T	82.0ms	74.7/s	1 unit
4	5800U	8C/16T	88.9ms	55.0/s	1 unit
5	7500F	6C/12T	92.0ms	51.1/s	1 unit
6	5825U	8C/16T	118.4ms	45.6/s	6 units
7	N100	4C/4T	481.3ms	5.8/s	1 unit

9950X3D — Undisputed #1

The 16-core Zen 5 9950X3D leads at 65.0ms and 148.7 items/s. The gap to #2 H255 (70.9ms) is only 8.3% in latency, but throughput is 49% higher. The core count advantage grows more pronounced in batched parallel workloads. It currently shares an AI GPU server and handles both embedding and LLM inference simultaneously.

H255 — Balanced #2

Four H255 units average 70.9ms — close to the 9950X3D despite being embedded in mini PCs. Throughput approaches 100 items/s on BGE-M3, making it the top candidate for dedicated embedding servers after the 9950X3D.

5825U — Workhorse (6 units)

The Zen 3-based 5825U is the most numerous at 6 units. At 118ms it is 66% slower than the H255, but 46 items/s is sufficient for standard request loads. When one unit falls short, traffic can be distributed across multiple units.

N100 — Not Recommended for Embedding

The N100 (4-core low-power) recorded 481ms and 5.8 items/s —6.8x slower than the #2 H255. Suitable for lightweight services and reverse proxying, but not viable as an embedding server.

RAM and NVMe Have No Impact

Each server varies in RAM capacity, brand, and NVMe type. Does this affect embedding performance? We grouped servers by CPU and compared within each group.

5825U group (6 units)

Server	RAM	NVMe	BGE-M3 p50
Server A	64GB DDR4	Hynix 512GB	117.8ms
Server B	32GB DDR4	Samsung 256GB	118.1ms
Server C	64GB DDR4	ShiJi 256GB	118.1ms
Server D	64GB DDR4	Samsung 500GB	118.2ms
Server E	Generic 32GB DDR4	128GB	118.8ms
Server F	64GB DDR4	Samsung 500GB	119.2ms

Variance: 1.2%. RAM capacity (32GB vs. 64GB), brand (Samsung / Hynix / generic), and NVMe brand are all irrelevant. The difference between the lowest spec (generic 32GB DDR4 + 128GB NVMe) and highest spec (64GB DDR4 + Samsung 500GB NVMe) is about 1ms.

H255 group (4 units)

Server	RAM	NVMe	BGE-M3 p50
Server G	Mark 16GB DDR5	Hynix 512GB	70.5ms
Server H	Samsung 16GB DDR5	PM9A1 1TB	70.8ms
Server I	Samsung 16GB DDR5	PM9A1 512GB	70.8ms
Server J	Mark 16GB DDR5	Hynix 512GB	71.6ms

Variance: 1.7%. DDR5 memory brand (Samsung vs. Mark) and NVMe tier (consumer vs. PM9A1 enterprise) have no impact.

The reason is straightforward. Embedding inference repeatedly computes with model weights that reside in the CPU L3 cache. The memory bus and NVMe are barely touched during inference, so RAM and NVMe specs do not affect throughput.NVMe only matters during the initial model load. Once the server is up, it has no bearing on inference speed.

Practical conclusion: when provisioning a new embedding server, do not spend budget on RAM brand or NVMe tier.Focus budget on CPU selection.

Latency by Text Length

Embedding model latency scales nonlinearly with input text length. Three ranges were measured: short (30 chars), medium (150 chars), and long (500 chars).

CPU	short (30 chars)	medium (150 chars)	long (500 chars)	long/short
9950X3D	65.0ms	68.9ms	109.8ms	1.7x
H255	70.5ms	92.7ms	157.9ms	2.2x
5700G	82.0ms	114.3ms	210.4ms	2.6x
5800U	88.9ms	145.8ms	283.3ms	3.2x
7500F	92.0ms	147.1ms	307.4ms	3.3x
5825U	117.8ms	174.9ms	343.0ms	2.9x
N100	481.3ms	906.9ms	2,650.9ms	5.5x

The 9950X3D shows the most linear scaling — only 1.7x increase from short to long text. The N100, by contrast, balloons to 2,650ms at 500 chars, a 5.5x increase. For document indexing workloads that process large volumes of long text, the CPU gap widens further.

RAG system design note: It is advisable to separate real-time user query embedding (short, latency-sensitive) from document indexing (long, batch). Assign low-latency servers for real-time queries and high-throughput servers for batch indexing.

The Hyperthreading Trap

What happens when you enable HT (hyperthreading) in multithreaded embedding? Increasing threads to the logical core count seems like it should improve performance — in practice, the results were the opposite.

CPU	Physical cores	Optimal threads	HT effect	1T → optimal speedup
9950X3D	16 cores	16T	Destructive (22x degradation)	5.7x
H255	8 cores	8T	−12–25% degradation	4.4x
5700G	8 cores	8T	−19% degradation	3.9x
5825U	8 cores	8T	−10–12% degradation	5.6x
N100	4 cores	4T	No HT	3.0x

On the 9950X3D, pushing threads to the logical core count (32T) causes a 22x throughput degradation versus physical cores (16T). Embedding computation is cache- and ALU-intensive per thread; HT threads sharing a physical core compete with each other and reduce total throughput.

Configuration rule: threads = physical core count.When deploying an embedding server, explicitly set OMP_NUM_THREADS or torch.set_num_threads()to the physical core count. Relying on defaults — which typically use logical core count — inverts performance.

# Check physical core count (excluding HT)
lscpu | grep "Core(s) per socket"

# Set thread count when starting the embedding server
OMP_NUM_THREADS=8 python embedding_server.py

# Set within Python code
import torch
torch.set_num_threads(8)  # physical core count

Temperature and Cooling — Safe Under Sustained Load?

Embedding servers run under 24/7 sustained load. Exceeding thermal limits triggers throttling and degrades performance. Temperature changes under 30 seconds of continuous embedding load were measured.

CPU	Idle temp	Load temp	Rise	Cooling
9950X3D	58°C	73°C	+15°C	Tower cooler
H255	34°C	67°C	+33°C	Mini PC built-in
5700G	27°C	48°C	+21°C	Stock cooler
5825U	51°C	56°C	+5°C	Mini PC built-in
N100	37°C	41°C	+4°C	Mini PC built-in

Every server remained well below the thermal throttling threshold (typically 90–100°C) after 30 seconds of sustained load. Notably, the 5825U and N100 — running mini PC built-in cooling — rose only +5°C and +4°C respectively. Their low TDPs are well within the capacity of compact cooling solutions.

The 9950X3D runs at 58°C at idle because it shares a chassis with a GPU server and ambient temperature is elevated. Under load it reaches 73°C, leaving comfortable headroom. That said, in a hot server room, a tower cooler or better is recommended.

Server Tier Classification — Tier 1/2/3

Based on the benchmark data, servers were classified into three tiers by embedding workload suitability.

Tier 1High-performance — suitable for both real-time and batch

9950X3D (1 unit)

65ms / 148.7 items/s
1.7x increase for long text (most linear)
On AI GPU server — can handle embedding alongside LLM

H255 (4 units)

70ms / 99.7 items/s
1.7% variance across 4 units — consistent
Top candidate for dedicated embedding servers

Tier 2General-purpose — suitable for real-time, supplemental for batch

5700G (1 unit)

82ms / 74.7 items/s

5800U (1 unit)

89ms / 55.0 items/s

7500F (1 unit)

92ms / 51.1 items/s

Tier 3Lightweight — low-traffic or distributed supplement

5825U (6 units)

118ms / 45.6 items/s (average)
1.2% variance across 6 units — uniform enough for load balancing
Insufficient alone, but adequate in a load-balanced configuration

ExcludedNot recommended for embedding

N100 (1 unit)

481ms / 5.8 items/s — 7.4x slower than Tier 1
Up to 24.1% of LLM pipeline share
Reserved for reverse proxy and gateway roles only

Embedding Infrastructure Design Checklist

Design principles derived from this benchmark.

CPU is everything

RAM brand/capacity and NVMe tier have no effect on embedding inference speed. Concentrate budget on CPU selection.

Threads = physical core count

Enabling HT reverses performance. Explicitly set OMP_NUM_THREADS to the physical core count.

GPU is unnecessary

Excluding N100, embedding is under 6% of the LLM pipeline. Keep GPU dedicated to LLM inference.

N100 is not for embedding

481ms, up to 24% pipeline share. Limit to proxy and gateway roles.

Separate real-time from batch

Assign low-latency servers for real-time queries (short text) and high-throughput servers for batch indexing (long text).

Same-CPU servers are uniform

6x 5825U variance 1.2%, 4x H255 variance 1.7%. Equal load balancing weights are fine.

Thermal headroom is not a concern

No server approached thermal throttling under 30 seconds of sustained load. Standard server room ventilation is sufficient.

Designing an embedding infrastructure is simpler than it looks. Pick the right CPU and the rest of the variables do not matter.Avoid CPUs at the extreme low end like the N100, and most servers in the fleet can handle embedding without becoming a bottleneck in the LLM pipeline.

The benchmark tool construction and script details are in the CPU Embedding Benchmark Tool post. The overall server monitoring setup is covered in the Grafana + Prometheus Monitoring post.

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

embedding infrastructure CPU RAG AI server server architecture performance optimization

Comments

(4)

4.63/ 5

AI_StartupCTO

2026-04-04

4.5

The Tier classification is practical. I can use this directly when deciding which servers to assign embedding services to.

HomeServerOpsM

2026-04-02

4.0

The N100 not being recommended for embedding makes sense. 481ms would definitely become a pipeline bottleneck. Good as a proxy-only node though.

InfraArchitectH

2026-03-31

5.0

The key insight is that embedding accounts for under 6% of the LLM pipeline. The conclusion is clear: don't waste GPU on embedding — keep it focused on LLM inference.

Tools