treeru.com
Tools

Embedding Infrastructure Across 15 Servers — CPU Selection to Architecture Tiers

2026-03-29
Treeru

As a RAG system scales, embedding computation becomes the bottleneck. Adding a GPU is costly, but assigning embedding to an arbitrary server left us without a clear selection criterion. We benchmarked embedding throughput on all 15 servers in the cluster and derived CPU selection criteria, the irrelevance of RAM and NVMe, and server placement priority.This is an infrastructure design record, not a tool introduction.

15

Servers benchmarked

7 types

CPU architectures

7.4x

Max performance gap

< 6%

Embedding share in pipeline

Measurement environment

  • Benchmark date: 2026-02-24
  • Models: BAAI/bge-m3 (568M params), intfloat/multilingual-e5-large (560M params)
  • Iterations: 30 (excluding 5 warmup runs); statistics: p50/p95/p99
  • Threads: auto-ranged from 1T to logical-core count
  • GPU disabled (CUDA_VISIBLE_DEVICES=''), pure CPU inference

Why Design an Embedding Infrastructure

Once local LLM services stabilized, the next bottleneck was embedding. Converting user queries to vectors, pre-indexing documents, and supporting real-time search all require the embedding server to respond reliably.

Initially, embedding ran on the same GPU servers as LLM inference. But GPUs were saturated by LLM workloads, causing embedding requests to queue. At the same time, several CPU servers sat idle.Switching to CPU embedding frees the GPU exclusively for LLM inference and distributes embedding across idle CPU servers.The prerequisite was measuring “which CPU handles embedding well, and by how much.”

GPU Is Not Needed for Embedding

The first question was whether CPU embedding is practical. We calculated the fraction of total response time that embedding occupies relative to LLM inference.

Embedding share analysis within the LLM pipeline

CPUEmbedding latencyLLM inference timeEmbedding share
9950X3D65ms2,000–5,000ms1.3–3.2%
H25570ms2,000–5,000ms1.4–3.5%
5700G82ms2,000–5,000ms1.6–4.1%
5825U118ms2,000–5,000ms2.4–5.9%
N100481ms2,000–5,000ms9.6–24.1%

* Excluding N100, embedding accounts for less than 6% of the total pipeline on all servers

Excluding the N100, every server kept embedding latency below 6% of LLM inference time. The bottleneck is LLM inference, not embedding. Assigning GPU resources to embedding is wasteful.CPU embedding is sufficient; the GPU should stay dedicated to LLM inference.

The N100 is the exception. At 481ms, embedding consumes up to 24% of the wait time relative to LLM inference. This becomes a perceptible bottleneck. The N100 is reserved for proxy and lightweight services; it is not placed as an embedding server.

CPU Architecture Performance Comparison

The 15 servers span 7 CPU types. Ranked by BGE-M3 short text (30 chars) p50 latency and sustained throughput.

RankCPUCoresBGE-M3 p50ThroughputCount
19950X3D16C/32T65.0ms148.7/s1 unit
2H2558C/16T70.9ms99.7/s4 units
35700G8C/16T82.0ms74.7/s1 unit
45800U8C/16T88.9ms55.0/s1 unit
57500F6C/12T92.0ms51.1/s1 unit
65825U8C/16T118.4ms45.6/s6 units
7N1004C/4T481.3ms5.8/s1 unit

9950X3D — Undisputed #1

The 16-core Zen 5 9950X3D leads at 65.0ms and 148.7 items/s. The gap to #2 H255 (70.9ms) is only 8.3% in latency, but throughput is 49% higher. The core count advantage grows more pronounced in batched parallel workloads. It currently shares an AI GPU server and handles both embedding and LLM inference simultaneously.

H255 — Balanced #2

Four H255 units average 70.9ms — close to the 9950X3D despite being embedded in mini PCs. Throughput approaches 100 items/s on BGE-M3, making it the top candidate for dedicated embedding servers after the 9950X3D.

5825U — Workhorse (6 units)

The Zen 3-based 5825U is the most numerous at 6 units. At 118ms it is 66% slower than the H255, but 46 items/s is sufficient for standard request loads. When one unit falls short, traffic can be distributed across multiple units.

N100 — Not Recommended for Embedding

The N100 (4-core low-power) recorded 481ms and 5.8 items/s —6.8x slower than the #2 H255. Suitable for lightweight services and reverse proxying, but not viable as an embedding server.

RAM and NVMe Have No Impact

Each server varies in RAM capacity, brand, and NVMe type. Does this affect embedding performance? We grouped servers by CPU and compared within each group.

5825U group (6 units)

ServerRAMNVMeBGE-M3 p50
Server A64GB DDR4Hynix 512GB117.8ms
Server B32GB DDR4Samsung 256GB118.1ms
Server C64GB DDR4ShiJi 256GB118.1ms
Server D64GB DDR4Samsung 500GB118.2ms
Server EGeneric 32GB DDR4128GB118.8ms
Server F64GB DDR4Samsung 500GB119.2ms

Variance: 1.2%. RAM capacity (32GB vs. 64GB), brand (Samsung / Hynix / generic), and NVMe brand are all irrelevant. The difference between the lowest spec (generic 32GB DDR4 + 128GB NVMe) and highest spec (64GB DDR4 + Samsung 500GB NVMe) is about 1ms.

H255 group (4 units)

ServerRAMNVMeBGE-M3 p50
Server GMark 16GB DDR5Hynix 512GB70.5ms
Server HSamsung 16GB DDR5PM9A1 1TB70.8ms
Server ISamsung 16GB DDR5PM9A1 512GB70.8ms
Server JMark 16GB DDR5Hynix 512GB71.6ms

Variance: 1.7%. DDR5 memory brand (Samsung vs. Mark) and NVMe tier (consumer vs. PM9A1 enterprise) have no impact.

The reason is straightforward. Embedding inference repeatedly computes with model weights that reside in the CPU L3 cache. The memory bus and NVMe are barely touched during inference, so RAM and NVMe specs do not affect throughput.NVMe only matters during the initial model load. Once the server is up, it has no bearing on inference speed.

Practical conclusion: when provisioning a new embedding server, do not spend budget on RAM brand or NVMe tier.Focus budget on CPU selection.

Latency by Text Length

Embedding model latency scales nonlinearly with input text length. Three ranges were measured: short (30 chars), medium (150 chars), and long (500 chars).

CPUshort (30 chars)medium (150 chars)long (500 chars)long/short
9950X3D65.0ms68.9ms109.8ms1.7x
H25570.5ms92.7ms157.9ms2.2x
5700G82.0ms114.3ms210.4ms2.6x
5800U88.9ms145.8ms283.3ms3.2x
7500F92.0ms147.1ms307.4ms3.3x
5825U117.8ms174.9ms343.0ms2.9x
N100481.3ms906.9ms2,650.9ms5.5x

The 9950X3D shows the most linear scaling — only 1.7x increase from short to long text. The N100, by contrast, balloons to 2,650ms at 500 chars, a 5.5x increase. For document indexing workloads that process large volumes of long text, the CPU gap widens further.

RAG system design note: It is advisable to separate real-time user query embedding (short, latency-sensitive) from document indexing (long, batch). Assign low-latency servers for real-time queries and high-throughput servers for batch indexing.

The Hyperthreading Trap

What happens when you enable HT (hyperthreading) in multithreaded embedding? Increasing threads to the logical core count seems like it should improve performance — in practice, the results were the opposite.

CPUPhysical coresOptimal threadsHT effect1T → optimal speedup
9950X3D16 cores16TDestructive (22x degradation)5.7x
H2558 cores8T−12–25% degradation4.4x
5700G8 cores8T−19% degradation3.9x
5825U8 cores8T−10–12% degradation5.6x
N1004 cores4TNo HT3.0x

On the 9950X3D, pushing threads to the logical core count (32T) causes a 22x throughput degradation versus physical cores (16T). Embedding computation is cache- and ALU-intensive per thread; HT threads sharing a physical core compete with each other and reduce total throughput.

Configuration rule: threads = physical core count.When deploying an embedding server, explicitly set OMP_NUM_THREADS or torch.set_num_threads()to the physical core count. Relying on defaults — which typically use logical core count — inverts performance.

# Check physical core count (excluding HT)
lscpu | grep "Core(s) per socket"

# Set thread count when starting the embedding server
OMP_NUM_THREADS=8 python embedding_server.py

# Set within Python code
import torch
torch.set_num_threads(8)  # physical core count

Temperature and Cooling — Safe Under Sustained Load?

Embedding servers run under 24/7 sustained load. Exceeding thermal limits triggers throttling and degrades performance. Temperature changes under 30 seconds of continuous embedding load were measured.

CPUIdle tempLoad tempRiseCooling
9950X3D58°C73°C+15°CTower cooler
H25534°C67°C+33°CMini PC built-in
5700G27°C48°C+21°CStock cooler
5825U51°C56°C+5°CMini PC built-in
N10037°C41°C+4°CMini PC built-in

Every server remained well below the thermal throttling threshold (typically 90–100°C) after 30 seconds of sustained load. Notably, the 5825U and N100 — running mini PC built-in cooling — rose only +5°C and +4°C respectively. Their low TDPs are well within the capacity of compact cooling solutions.

The 9950X3D runs at 58°C at idle because it shares a chassis with a GPU server and ambient temperature is elevated. Under load it reaches 73°C, leaving comfortable headroom. That said, in a hot server room, a tower cooler or better is recommended.

Server Tier Classification — Tier 1/2/3

Based on the benchmark data, servers were classified into three tiers by embedding workload suitability.

Tier 1High-performance — suitable for both real-time and batch

9950X3D (1 unit)

  • 65ms / 148.7 items/s
  • 1.7x increase for long text (most linear)
  • On AI GPU server — can handle embedding alongside LLM

H255 (4 units)

  • 70ms / 99.7 items/s
  • 1.7% variance across 4 units — consistent
  • Top candidate for dedicated embedding servers
Tier 2General-purpose — suitable for real-time, supplemental for batch

5700G (1 unit)

  • 82ms / 74.7 items/s

5800U (1 unit)

  • 89ms / 55.0 items/s

7500F (1 unit)

  • 92ms / 51.1 items/s
Tier 3Lightweight — low-traffic or distributed supplement

5825U (6 units)

  • 118ms / 45.6 items/s (average)
  • 1.2% variance across 6 units — uniform enough for load balancing
  • Insufficient alone, but adequate in a load-balanced configuration
ExcludedNot recommended for embedding

N100 (1 unit)

  • 481ms / 5.8 items/s — 7.4x slower than Tier 1
  • Up to 24.1% of LLM pipeline share
  • Reserved for reverse proxy and gateway roles only

Embedding Infrastructure Design Checklist

Design principles derived from this benchmark.

CPU is everything

RAM brand/capacity and NVMe tier have no effect on embedding inference speed. Concentrate budget on CPU selection.

Threads = physical core count

Enabling HT reverses performance. Explicitly set OMP_NUM_THREADS to the physical core count.

GPU is unnecessary

Excluding N100, embedding is under 6% of the LLM pipeline. Keep GPU dedicated to LLM inference.

N100 is not for embedding

481ms, up to 24% pipeline share. Limit to proxy and gateway roles.

Separate real-time from batch

Assign low-latency servers for real-time queries (short text) and high-throughput servers for batch indexing (long text).

Same-CPU servers are uniform

6x 5825U variance 1.2%, 4x H255 variance 1.7%. Equal load balancing weights are fine.

Thermal headroom is not a concern

No server approached thermal throttling under 30 seconds of sustained load. Standard server room ventilation is sufficient.

Designing an embedding infrastructure is simpler than it looks. Pick the right CPU and the rest of the variables do not matter.Avoid CPUs at the extreme low end like the N100, and most servers in the fleet can handle embedding without becoming a bottleneck in the LLM pipeline.

The benchmark tool construction and script details are in the CPU Embedding Benchmark Tool post. The overall server monitoring setup is covered in the Grafana + Prometheus Monitoring post.

T

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

Share

Comments

(4)
4.63/ 5

Log in to leave a comment.

2026-04-04
4.554.5

The Tier classification is practical. I can use this directly when deciding which servers to assign embedding services to.

2026-04-02
454.0

The N100 not being recommended for embedding makes sense. 481ms would definitely become a pipeline bottleneck. Good as a proxy-only node though.

2026-03-31
555.0

The key insight is that embedding accounts for under 6% of the LLM pipeline. The conclusion is clear: don't waste GPU on embedding — keep it focused on LLM inference.

Related Posts

© 2026 TreeRU. All rights reserved.

All content is copyrighted by TreeRU. Unauthorized reproduction without attribution is prohibited.