Grafana + Prometheus for 16 Servers — From Installation to Alerting
One server? htop is enough. But when you are running 16 servers simultaneously, switching between 16 terminal windows is impractical and problems get discovered too late. With Grafana + Prometheus, a single dashboard monitors CPU, memory, temperature, GPU, and disk across every server in real time — and alerts you the moment something goes wrong.
Why Centralized Monitoring
We previously used shell-script-based monitoring that collected GPU logs into CSV files. It worked for a few servers, but at 16 machines the limitations became obvious:
| Capability | Shell Scripts | Grafana + Prometheus |
|---|---|---|
| Real-time monitoring | No — requires log file inspection | Yes — 15-second auto-refresh |
| Multi-server | No — run individually per server | Yes — centralized collection, single view |
| Alerting | Manual (custom mail scripts) | Automatic condition-based alerts |
| Visualization | No — needs external tools | Graphs, gauges, heatmaps built in |
| Historical queries | Manual CSV parsing | Drag to select any time range |
| Cost | Free | Free (open source) |
Shell scripts are not inferior — for 1–2 servers they are lighter and more efficient. But at scale, centralized monitoring becomes mandatory.
Architecture Design
The monitoring stack has four layers. Prometheus sits at the center pulling metrics from exporters on every server, and Grafana provides the visualization layer:
| Component | Role | Target |
|---|---|---|
| Prometheus 3.x | Metrics collection engine, alert evaluation | 1 monitoring server |
| Grafana 12.x | Visualization dashboards | 1 monitoring server |
| node_exporter 1.x | System metrics (CPU, RAM, disk, temperature) | All 16 servers |
| nvidia_gpu_exporter | GPU temperature, VRAM, power, utilization | 2 GPU servers |
| blackbox_exporter | TCP port liveness checks | Web service ports |
| textfile collector | Custom metrics (HDD SMART) | 1 NAS server |
Prometheus Installation and Configuration
Prometheus installs as a single binary. Download it, place it in /usr/local/bin/, and register a systemd service. Key configuration points:
- Bind to
127.0.0.1:9090only — there is no reason to expose Prometheus externally. - Set
--storage.tsdb.retention.time=30d— 16 servers for 30 days uses only ~600 MB. - Reference an
alert_rules.ymlfile for alert definitions.
The prometheus.yml core structure defines scrape targets:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: 'node'
static_configs:
- targets:
- 'server-1:9100'
- 'server-2:9100'
# ... all 16 servers
# Map IP addresses to readable server names
relabel_configs:
- source_labels: [__address__]
regex: '10.0.10.100:.*'
target_label: instance
replacement: 'main-server'
- job_name: 'nvidia_gpu'
scrape_interval: 5s # GPU needs faster polling
static_configs:
- targets: ['localhost:9835']
labels:
instance: 'gpu-server-1'
- job_name: 'blackbox_tcp'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 'proxy-server:80'
- 'proxy-server:443'The relabel_configs section is critical: it replaces IP addresses with human-readable server names in dashboards. Distinguishing 16 servers by IP alone is impractical.
node_exporter — Collecting from 16 Servers
node_exporter is a lightweight agent installed on every server. It automatically exposes hundreds of metrics covering CPU, memory, disk, network, and hardware temperatures. Install it as a systemd service with the --collector.hwmon flag to enable temperature sensors.
| Metric | Description | Use Case |
|---|---|---|
node_cpu_seconds_total | CPU time by mode | CPU usage calculation |
node_memory_MemAvailable_bytes | Available memory | Memory usage % |
node_filesystem_avail_bytes | Free disk space | Disk full alerts |
node_hwmon_temp_celsius | Hardware temperature sensors | CPU, NVMe, DDR5 temp |
node_network_receive_bytes_total | Network bytes received | Traffic monitoring |
node_boot_time_seconds | Boot timestamp | Uptime calculation |
For bulk deployment across 16 servers, use SSH key authentication with a simplefor host in server1 server2 ...; loop to install node_exporter on all machines simultaneously.
GPU Monitoring
Servers with NVIDIA GPUs get an additional nvidia_gpu_exporter — a lightweight exporter that converts nvidia-smi data into Prometheus format on port 9835. We set the GPU scrape interval to 5 seconds (vs 15 seconds for system metrics) because GPU temperature and VRAM can change dramatically within seconds during AI inference.
| Metric | Description | Alert Threshold |
|---|---|---|
nvidia_smi_temperature_gpu | GPU core temperature | > 80 C warning, > 85 C critical |
nvidia_smi_memory_used_bytes | VRAM usage | > 90% warning |
nvidia_smi_utilization_gpu_ratio | GPU utilization | Inference load tracking |
nvidia_smi_power_draw_watts | Real-time power consumption | Power limit monitoring |
Service Liveness with blackbox_exporter
A server can be alive while its services are dead. The blackbox_exporter performs external TCP connection attempts to verify that specific service ports are responding. Whenprobe_success == 0, the service is flagged as down and an alert fires. Typical targets include HTTP/HTTPS ports on the reverse proxy and API ports for AI inference services.
Grafana Dashboard Design
After installing Grafana and connecting Prometheus as a data source, we built 6 purpose-specific dashboards:
| # | Dashboard | Key Panels | Purpose |
|---|---|---|---|
| 1 | All Servers Overview | CPU, RAM, disk usage (16 servers at a glance) | Daily checks |
| 2 | Temperature Management | CPU, GPU, NVMe, DDR5 temperature trends | Thermal monitoring |
| 3 | GPU Real-time | GPU utilization, VRAM, temperature, power | AI inference monitoring |
| 4 | Single Server Detail | Deep-dive analysis of one server | Debugging |
| 5 | Version Management | OS, kernel, Node.js, Python versions, pending apt updates | Patch management |
| 6 | Service Status | TCP port liveness, response times | Service availability |
Essential PromQL Queries
# CPU usage (%)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage (%)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage (%)
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)
# Server uptime (days)
(time() - node_boot_time_seconds) / 86400
# GPU temperature
nvidia_smi_temperature_gpu
# GPU VRAM usage (%)
nvidia_smi_memory_used_bytes / nvidia_smi_memory_total_bytes * 100Alert Rules
Dashboards alone cannot provide 24/7 monitoring. Alert rules trigger automatically when thresholds are exceeded. The for clause requires the condition to persist for a specified duration, preventing false positives from momentary spikes.
| Alert | Condition | Duration | Severity |
|---|---|---|---|
| GPU Temperature Warning | > 80 C | 1 min | warning |
| GPU Temperature Critical | > 85 C | 30 sec | critical |
| GPU VRAM High | > 90% | 5 min | warning |
| Server Down | up == 0 | 5 min | critical |
| Disk Space Warning | > 90% | 5 min | warning |
| High CPU | > 95% | 10 min | warning |
| High Memory | > 95% | 5 min | warning |
| Service Down | probe_success == 0 | 2 min | critical |
Notice the different duration thresholds: GPU critical temperature fires after only 30 seconds (hardware damage risk) while CPU overload waits 10 minutes (transient spikes are normal under load).
Operational Tips
- Bind Prometheus locally —
127.0.0.1:9090prevents external access. Grafana communicates via localhost. An exposed Prometheus endpoint leaks infrastructure details. - Use relabel_configs — Map IPs to meaningful names like "main-server" or "gpu-server-1". Identifying 16 servers by IP address is unrealistic.
- 30-day retention is cheap — 16 servers for 30 days consumes only ~600 MB. Extend to 90 or 180 days with minimal disk impact.
- GPU scraping at 5-second intervals — System metrics are fine at 15 seconds, but GPU temperature during AI inference can spike in seconds.
- Back up dashboards as JSON — Export Grafana dashboards to JSON files and version them in Git. Server rebuilds restore dashboards automatically.
- textfile collector for custom metrics — Write
.promfiles for anything node_exporter does not cover natively: HDD SMART data, pending apt updates, custom service health checks.
The key insight from monitoring 16 servers: the purpose of monitoring is not to discover that a server has died, but to detect problems before they cause downtime. Grafana + Prometheus — both free, both open source — deliver enterprise-grade observability for any server fleet, from a handful of machines to hundreds.