treeru.com
AI · Feb 18, 2026

Cross-Server AI Inference — How a $450 Secondary GPU Boosted Throughput by 70%

When your main inference server hits its limits, the conventional move is buying another expensive GPU. We took a different approach: adding a $450 RTX 5060 Ti on a separate machine connected over standard 1GbE Ethernet. The result was a 70% throughput increase at just 9% of the main GPU's cost. Here are the real numbers.

+70%

Combined Throughput Gain

$450

Secondary GPU Cost

5–18%

Network Overhead

0%

Error Rate

The Two-Server Architecture

Our setup is deliberately simple. The main server runs an RTX PRO 6000 with 96 GB of VRAM, handling 8B, 14B, and 32B AWQ models plus LoRA multi-tenant inference via SGLang. The secondary server is a budget machine with an RTX 5060 Ti (16 GB VRAM) running 8B and 14B AWQ models. The two boxes sit on the same 1 GbE office LAN — no NVLink, no InfiniBand, no special hardware. The main server dispatches HTTP requests to the secondary server's SGLang API endpoint.

At 20 concurrent users on the Qwen3-8B-AWQ model, the main server alone produces 1,582 tok/s while the secondary server alone produces 760 tok/s. The critical question was: how much throughput would we actually retain after adding network overhead?

Network Overhead: The Real Numbers

We measured the latency difference between calling the secondary server directly versus routing through the main server. This delta represents the pure network overhead of the cross-server hop.

Response LengthDirect CallVia Main ServerOverheadRatio
50 tokens (short)678 ms748 ms+70 ms+10%
200 tokens (medium)2,630 ms2,767 ms+137 ms+5%
500 tokens (long)6,552 ms7,728 ms+1,176 ms+18%

For short responses (50–200 tokens) — the kind you get from FAQ bots, classification tasks, and intent detection — network overhead ranges from 70 to 137 ms (5–10%). Users won't notice a 70 ms delay on a response that already takes 678 ms. Even for long 500-token responses, the 18% overhead is largely masked by streaming: the first token arrives at roughly the same time, and subsequent tokens trickle in with barely perceptible delay.

Upgrading to 10 GbE would slash these numbers further, but our testing confirms that standard 1 GbE is perfectly viable for production cross-server inference — particularly for the short-response workloads we offload to the secondary machine.

Combined Throughput: From 1,582 to ~2,700 tok/s

When both servers process requests simultaneously, their throughput stacks. With batching optimization and intelligent routing, the combined system reaches approximately 2,700 tok/s on the 8B model with 20 concurrent users — a 70% improvement over the main server alone.

ConfigurationThroughput (8B, 20 users)vs. Main Only
Main server only1,582 tok/sBaseline
Secondary server only760 tok/s48%
Main + Secondary combined~2,700 tok/s+70%

For the larger 14B model, throughput gains are still significant. The main server produces 1,049 tok/s and the secondary produces 326 tok/s (at 80% VRAM utilization), combining to roughly 1,375 tok/s — a 31% gain. The secondary server can't run 32B models due to VRAM constraints, so those requests stay exclusively on the main machine.

Stability Under Multi-Client Load

We also tested with five project servers simultaneously calling the main server. Response time variance between servers was minimal, and the error rate remained at 0% across all test runs. SGLang's continuous batching handled the concurrent load without any stability issues.

Cost Efficiency: 5.4x Better Per-Token Value

The cost math tells the real story. The RTX PRO 6000 costs approximately $5,000 and delivers 1,582 tok/s, which works out to $3.16 per tok/s. The RTX 5060 Ti costs $450 and delivers 760 tok/s, or $0.59 per tok/s. That makes the secondary GPU 5.4 times more cost-efficient per unit of throughput.

MetricMain (PRO 6000)Secondary (5060 Ti)Ratio
GPU price~$5,000~$4509%
TDP350W (limited)180W51%
8B throughput (20 users)1,582 tok/s760 tok/s48%
Cost per tok/s$3.16$0.595.4x better
Peak temperature (20 users)43°C42°CSimilar

Operating Costs

The secondary server draws roughly 200W total (120W GPU + 80W system). Running 24/7, that adds up to about 144 kWh per month, or approximately $12/month in electricity. For context, renting a single cloud A100 GPU costs around $3/hour — meaning the secondary server's entire annual electricity bill is less than what you'd spend on three days of cloud GPU time.

Routing Strategies for Production

Adding a secondary server introduces a routing question: which requests go where? Blindly round-robining all traffic is suboptimal. The most effective approach routes by request type and complexity.

Request TypeRoute ToReason
FAQ / simple queriesSecondary (8B)Fast response is key; 76 tok/s per user is sufficient
Classification / taggingSecondary (8B)Short outputs; network overhead under 10%
Customer support / emailMain (14B)Quality matters; leverages LoRA multi-tenant
Reports / document draftingMain (32B)Highest quality needed; only main can run 32B
Peak overflowSecondary (fallback)Auto-redirect when main queue exceeds threshold

Implementation Options

Simple: URL-based routing at the proxy level. Send /api/faq to the secondary and /api/chat to the main server. Caddy or nginx handles this natively.

Intermediate: Route by input token count. Requests under 50 tokens (likely FAQ) go to the secondary; longer requests go to the main server.

Advanced: Dynamic queue-based routing. Monitor the main server's queue depth via SGLang's /get_server_info endpoint. When the queue exceeds a threshold, automatically redirect new requests to the secondary server. This provides graceful degradation under load spikes.

When Should You Add a Secondary GPU?

Based on our testing, adding a secondary server makes sense when:

  • You regularly hit 50+ concurrent users and the main server queues requests
  • More than 50% of your workload is lightweight tasks (FAQ, classification, tagging)
  • You need redundancy — if the main server goes down, the secondary can handle basic inference
  • Your cloud GPU bill exceeds $200/month — the secondary server pays for itself in weeks

It's probably not worth it if your main server comfortably handles all traffic at under 20 concurrent users, if nearly all requests require 32B-quality output (the secondary can only run up to 14B), or if your network is below 100 Mbps.

Conclusion

Cross-server inference over commodity Ethernet is not just viable — it's remarkably cost-effective. For $450 (9% of the main GPU's cost), we added 760 tok/s of throughput, bringing the total system from 1,582 to approximately 2,700 tok/s. Network overhead is 5–18% over 1 GbE, with zero errors across all test runs.

The key insight is that you don't need to match your main server's capability. A budget GPU handling FAQ and classification frees the main server to focus on complex, quality-sensitive tasks. The cost per additional tok/s is 5.4 times cheaper with the secondary GPU. For self-hosted AI inference at scale, this asymmetric scaling strategy delivers far better ROI than simply doubling your primary hardware.