Cross-Server AI Inference: 70% More Throughput with a $450 GPU

The Two-Server Architecture

Our setup is deliberately simple. The main server runs an RTX PRO 6000 with 96 GB of VRAM, handling 8B, 14B, and 32B AWQ models plus LoRA multi-tenant inference via SGLang. The secondary server is a budget machine with an RTX 5060 Ti (16 GB VRAM) running 8B and 14B AWQ models. The two boxes sit on the same 1 GbE office LAN — no NVLink, no InfiniBand, no special hardware. The main server dispatches HTTP requests to the secondary server's SGLang API endpoint.

At 20 concurrent users on the Qwen3-8B-AWQ model, the main server alone produces 1,582 tok/s while the secondary server alone produces 760 tok/s. The critical question was: how much throughput would we actually retain after adding network overhead?

Network Overhead: The Real Numbers

We measured the latency difference between calling the secondary server directly versus routing through the main server. This delta represents the pure network overhead of the cross-server hop.

Response Length	Direct Call	Via Main Server	Overhead	Ratio
50 tokens (short)	678 ms	748 ms	+70 ms	+10%
200 tokens (medium)	2,630 ms	2,767 ms	+137 ms	+5%
500 tokens (long)	6,552 ms	7,728 ms	+1,176 ms	+18%

For short responses (50–200 tokens) — the kind you get from FAQ bots, classification tasks, and intent detection — network overhead ranges from 70 to 137 ms (5–10%). Users won't notice a 70 ms delay on a response that already takes 678 ms. Even for long 500-token responses, the 18% overhead is largely masked by streaming: the first token arrives at roughly the same time, and subsequent tokens trickle in with barely perceptible delay.

Upgrading to 10 GbE would slash these numbers further, but our testing confirms that standard 1 GbE is perfectly viable for production cross-server inference — particularly for the short-response workloads we offload to the secondary machine.

Combined Throughput: From 1,582 to ~2,700 tok/s

When both servers process requests simultaneously, their throughput stacks. With batching optimization and intelligent routing, the combined system reaches approximately 2,700 tok/s on the 8B model with 20 concurrent users — a 70% improvement over the main server alone.

Configuration	Throughput (8B, 20 users)	vs. Main Only
Main server only	1,582 tok/s	Baseline
Secondary server only	760 tok/s	48%
Main + Secondary combined	~2,700 tok/s	+70%

For the larger 14B model, throughput gains are still significant. The main server produces 1,049 tok/s and the secondary produces 326 tok/s (at 80% VRAM utilization), combining to roughly 1,375 tok/s — a 31% gain. The secondary server can't run 32B models due to VRAM constraints, so those requests stay exclusively on the main machine.

Stability Under Multi-Client Load

We also tested with several project servers simultaneously calling the main server. Response time variance between servers was minimal, and the error rate remained at 0% across all test runs. SGLang's continuous batching handled the concurrent load without any stability issues.

Cost Efficiency: 5.4x Better Per-Token Value

The cost math tells the real story. The RTX PRO 6000 costs approximately $5,000 and delivers 1,582 tok/s, which works out to $3.16 per tok/s. The RTX 5060 Ti costs $450 and delivers 760 tok/s, or $0.59 per tok/s. That makes the secondary GPU 5.4 times more cost-efficient per unit of throughput.

Metric	Main (PRO 6000)	Secondary (5060 Ti)	Ratio
GPU price	~$5,000	~$450	9%
TDP	350W (limited)	180W	51%
8B throughput (20 users)	1,582 tok/s	760 tok/s	48%
Cost per tok/s	$3.16	$0.59	5.4x better
Peak temperature (20 users)	43°C	42°C	Similar

Operating Costs

The secondary server draws roughly 200W total (120W GPU + 80W system). Running 24/7, that adds up to about 144 kWh per month, or approximately $12/month in electricity. For context, renting a single cloud A100 GPU costs around $3/hour — meaning the secondary server's entire annual electricity bill is less than what you'd spend on three days of cloud GPU time.

Routing Strategies for Production

Adding a secondary server introduces a routing question: which requests go where? Blindly round-robining all traffic is suboptimal. The most effective approach routes by request type and complexity.

Request Type	Route To	Reason
FAQ / simple queries	Secondary (8B)	Fast response is key; 76 tok/s per user is sufficient
Classification / tagging	Secondary (8B)	Short outputs; network overhead under 10%
Customer support / email	Main (14B)	Quality matters; leverages LoRA multi-tenant
Reports / document drafting	Main (32B)	Highest quality needed; only main can run 32B
Peak overflow	Secondary (fallback)	Auto-redirect when main queue exceeds threshold

Implementation Options

Simple: URL-based routing at the proxy level. Send /api/faq to the secondary and /api/chat to the main server. Caddy or nginx handles this natively.

Intermediate: Route by input token count. Requests under 50 tokens (likely FAQ) go to the secondary; longer requests go to the main server.

Advanced: Dynamic queue-based routing. Monitor the main server's queue depth via SGLang's /get_server_info endpoint. When the queue exceeds a threshold, automatically redirect new requests to the secondary server. This provides graceful degradation under load spikes.

When Should You Add a Secondary GPU?

Based on our testing, adding a secondary server makes sense when:

You regularly hit 50+ concurrent users and the main server queues requests
More than 50% of your workload is lightweight tasks (FAQ, classification, tagging)
You need redundancy — if the main server goes down, the secondary can handle basic inference
Your cloud GPU bill exceeds $200/month — the secondary server pays for itself in weeks

It's probably not worth it if your main server comfortably handles all traffic at under 20 concurrent users, if nearly all requests require 32B-quality output (the secondary can only run up to 14B), or if your network is below 100 Mbps.

Conclusion

Cross-server inference over commodity Ethernet is not just viable — it's remarkably cost-effective. For $450 (9% of the main GPU's cost), we added 760 tok/s of throughput, bringing the total system from 1,582 to approximately 2,700 tok/s. Network overhead is 5–18% over 1 GbE, with zero errors across all test runs.

The key insight is that you don't need to match your main server's capability. A budget GPU handling FAQ and classification frees the main server to focus on complex, quality-sensitive tasks. The cost per additional tok/s is 5.4 times cheaper with the secondary GPU. For self-hosted AI inference at scale, this asymmetric scaling strategy delivers far better ROI than simply doubling your primary hardware.

Cross-Server AI Inference — Boosting Throughput 70% with a $450 Secondary GPU