Building AI Server Infrastructure in the Office — A 16-Server Setup Guide
We built a full AI infrastructure in the office — no cloud required. Starting from a single GPU-equipped AI server, we expanded to 16 machines with dedicated roles: reverse proxy, project servers, lightweight monitoring nodes, and cold backup storage. Every server is connected via an SSH key mesh network, bandwidth is measured and tiered, and external servers are fully isolated. This is our practical guide to on-premise AI infrastructure.
Server Role Separation
We organized 16 servers into 5 role groups. Each group operates independently — a failure in one group never propagates to another. This isolation is the foundation of our entire infrastructure.
AI Brain Server (1 unit)
The core of all AI inference. Equipped with an AMD Ryzen 9950X3D, 96GB DDR5 RAM, and an NVIDIA RTX PRO 6000 with 96GB VRAM. Storage uses a 3-tier architecture: Intel Optane 905P for hot data, Samsung 980 PRO for warm data, and standard NVMe for cold storage.
AI Auxiliary Server (1 unit)
Handles FAQ and simple queries as a lightweight AI node. Runs on a Ryzen 7500F with an RTX 5060 Ti (16GB VRAM). Through cross-server inference, this server increases the main AI server's throughput by 70%.
Reverse Proxy (1 unit)
The single entry point for all external traffic. An Intel N100 mini PC running Caddy handles SSL termination, domain routing, and load balancing. Low-power and purpose-built — no GPU, no unnecessary services.
Project Servers (7 units)
Each project server runs a dedicated web service, API, or database. CPUs range from Ryzen 5700G to 7840HS with 32–64GB RAM. One project per server means resource conflicts are impossible and deployments are independent.
Lightweight Servers (5 units)
Intel H255-based mini PCs with 16GB RAM handling monitoring, log collection, and lightweight APIs. Low power draw and silent operation make them ideal for auxiliary tasks that run 24/7.
Cold Backup (1 unit)
An NFS-based network storage server with two Seagate IronWolf 12TB drives. Regularly backs up critical data from all servers. Dual NICs provide network redundancy.
Why Separate Roles?
| Principle | Benefit |
|---|---|
| Fault isolation | A project server crash never affects AI inference |
| Resource independence | GPU/VRAM usage does not compete with web service CPU/RAM |
| Horizontal scaling | Add a project server for new services; add a GPU for more AI capacity |
Network Bandwidth Tiers
All servers share the same subnet (10.0.10.0/24), but actual bandwidth varies by NIC capability. We measured every link with iperf3 --bidir and organized servers into two tiers. After cable and port replacements, we eliminated all 100Mbps bottlenecks — every server now achieves 1Gbps or higher.
Tier 1: 2.5Gbps Connections (5 servers)
| Server Role | CPU | Measured TX | Measured RX |
|---|---|---|---|
| Project Server A | Ryzen 5700G | 2.35 Gbps | 2.18 Gbps |
| Web Server | Ryzen 7840HS | 2.33 Gbps | 2.34 Gbps |
| Project Server B | Ryzen 5800U | 2.32 Gbps | 2.33 Gbps |
| Lightweight Server D | Intel H255 | 2.34 Gbps | 2.25 Gbps |
| AI Auxiliary | Ryzen 7500F | 2.31 Gbps | 2.24 Gbps |
These 2.5Gbps NICs connect directly to the AI brain server's 10Gbps NIC. The AI auxiliary server was upgraded from 100Mbps to 2.5Gbps — a 24x improvement — simply by replacing the cable and switch port.
Tier 2: 1Gbps Connections (10 servers)
| Server Role | CPU | Measured TX | Measured RX |
|---|---|---|---|
| Lightweight Server A | Intel H255 | 923 Mbps | 860 Mbps |
| Project Server C | Ryzen 5825U | 922 Mbps | 883 Mbps |
| Project Server D | Ryzen 5825U | 921 Mbps | 925 Mbps |
| Project Server E | Ryzen 5825U | 921 Mbps | 919 Mbps |
| Proxy Server | Intel N100 | 921 Mbps | 938 Mbps |
| Backup Server | Ryzen 5825U | 920 Mbps | 838 Mbps |
All 1Gbps-tier servers achieve 920+ Mbps TX. Previously, four servers ran at 456–731 Mbps due to faulty cables or mismatched port speeds. After diagnosing and replacing hardware, every server reached its rated speed.
Network Optimization Results
| Server | Before | After | Action |
|---|---|---|---|
| AI Auxiliary | 95 Mbps | 2.31 Gbps | Cable/port swap — upgraded to 2.5Gbps (24x) |
| Project Server D | 456 Mbps | 921 Mbps | Reached full 1Gbps (2x) |
| Project Server C | 610 Mbps | 922 Mbps | Normalized (51% improvement) |
| Project Server E | 731 Mbps | 921 Mbps | Normalized (26% improvement) |
SSH Mesh Security
All 16 servers have 17 SSH public keys cross-registered, enabling bidirectional key authentication between any pair of servers. Password authentication is completely disabled across the internal network.
Key Structure
The 17 keys include: 1 external workstation key (access to all servers), 1 AI brain server key, 1 proxy server key, 1 web server key, 12 project/lightweight server keys, and 1 backup server key. All keys use the Ed25519 algorithm. Every server's authorized_keys file contains all 17 public keys, so any server can SSH into any other server instantly.
Security Configuration
| Setting | Internal (16 servers) | External Backup | Purpose |
|---|---|---|---|
| PasswordAuthentication | OFF | ON | Block password login (key-only) |
| PermitRootLogin | OFF | Default | Block direct root access |
| MaxAuthTries | 3 | 6 | Limit attempts against brute force |
| sudo NOPASSWD | Enabled | Enabled | Automation-friendly sudo |
| DenyUsers | Enabled | — | Block external server IPs |
Why Mesh Instead of a Bastion Host?
Traditional bastion (jump host) architectures create a single point of failure. In our setup, every server needs to communicate with every other server for deployments, monitoring, and backups. A bastion host would bottleneck all inter-server traffic. With a mesh topology, we distribute 17 keys to all servers — any server can reach any other directly. Security is maintained through disabled password auth and a MaxAuthTries limit of 3.
External Server Isolation
The backup server sits on an external network and is completely blocked from accessing internal servers. Internal servers can reach the backup server (for pushing backups), but the reverse direction is denied at two layers.
Dual-Layer Blocking
| Layer | Method | Effect |
|---|---|---|
| Network | Remove internal subnet routing from WireGuard VPN config | External server cannot ping or reach internal IPs |
| SSH | DenyUsers directive on all 16 internal servers | Even if VPN is bypassed, SSH authentication is rejected |
Network-level blocking alone is insufficient — a VPN configuration change could bypass it. Adding SSH-level denial means that even if the network layer is compromised, authentication still fails. If the external backup server is ever breached, the attacker has zero access to internal infrastructure.
CPU Turbo Boost Control
The biggest enemy of 24/7 server operation is heat. Disabling CPU turbo boost caps the maximum clock speed, but stabilizes temperature and power consumption — critical for long-term reliability.
Which Servers Get Boost Disabled?
| Server Group | CPU | Boost Disabled | Reason |
|---|---|---|---|
| AI Brain | Ryzen 9950X3D | No | Maximum inference performance required |
| AI Auxiliary | Ryzen 7500F | No | Inference performance needed |
| Proxy | Intel N100 | Yes | Reverse proxy needs minimal CPU |
| Web Server | Ryzen 7840HS | Yes | Web serving does not need boost |
| Project Servers (5) | Ryzen 5825U x5 | Yes | Long-term stability over peak speed |
| Backup Server | Ryzen 5825U | Yes | NFS serving does not need boost |
| Lightweight (4) | Intel H255 x4 | No | Low-power CPUs — boost impact is minimal |
Implementation: AMD vs Intel
On AMD systems using amd-pstate-epp, writing 0 to the/sys/.../boost file disables turbo boost. On Intel systems usingintel_pstate, writing 1 to /sys/.../no_turboachieves the same effect — note the inverted logic. Both are implemented as systemd services that apply at boot and can be toggled on demand:
# AMD: echo 0 to disable boost
[Service]
ExecStart=/bin/bash -c 'echo 0 > /sys/.../boost'
ExecStop=/bin/bash -c 'echo 1 > /sys/.../boost'
RemainAfterExit=yes
# Intel: echo 1 to disable (inverted)
[Service]
ExecStart=/bin/bash -c 'echo 1 > /sys/.../no_turbo'
ExecStop=/bin/bash -c 'echo 0 > /sys/.../no_turbo'
RemainAfterExit=yesOne critical caveat: power-profiles-daemon may re-enable boost on startup. Set After=power-profiles-daemon.service in your systemd unit to ensure correct ordering. AI servers keep boost enabled — instead, we manage GPU temperatures through power limit tuning.
Operational Principles — Summary
After months of building and operating this 16-server infrastructure, we follow five principles that keep everything running reliably:
- Separate servers by role. AI, proxy, project, and backup servers are physically isolated. A crash in one role group never affects another.
- Key-only SSH, passwords off. 17 SSH keys cross-registered across all servers with password authentication disabled. Brute force attacks are eliminated.
- Dual-layer external isolation. Network (VPN routing) plus SSH (DenyUsers) blocks external servers from reaching internal infrastructure — even if one layer fails.
- Disable CPU boost for stability. All non-AI servers run without turbo boost. For 24/7 operation, thermal and power stability equals reliability.
- Measure bandwidth before placement. Use iperf3 to measure actual throughput, then place critical services on the fastest links. Data-driven decisions, not guesses.
Infrastructure Summary
| Category | Status |
|---|---|
| Server count | 16 internal + 1 external backup |
| Network | 10.0.10.0/24 — 5 servers at 2.5Gbps, 10 at 1Gbps (all above 920Mbps) |
| SSH security | 17-key mesh, passwords OFF, MaxAuthTries 3 |
| External isolation | WireGuard + DenyUsers dual blocking |
| CPU boost control | Disabled on 8 servers (project/proxy/backup) |
| AI GPUs | RTX PRO 6000 + RTX 5060 Ti (cross-server inference) |
| Backup | IronWolf 12TB x2, NFS network storage |
Running AI infrastructure on-premise without cloud services is entirely feasible. The key is not the number of servers — it is the discipline of role separation, security, and stability. As servers multiply, adding Grafana + Prometheus monitoring becomes essential for real-time visibility. These operational principles are what make advanced capabilities like cross-server inference, concurrent load testing, and GPU power optimization actually work in production.