treeru.com

Grafana + Prometheus for 16 Servers — From Installation to Alerting

One server? htop is enough. But when you are running 16 servers simultaneously, switching between 16 terminal windows is impractical and problems get discovered too late. With Grafana + Prometheus, a single dashboard monitors CPU, memory, temperature, GPU, and disk across every server in real time — and alerts you the moment something goes wrong.

Why Centralized Monitoring

We previously used shell-script-based monitoring that collected GPU logs into CSV files. It worked for a few servers, but at 16 machines the limitations became obvious:

CapabilityShell ScriptsGrafana + Prometheus
Real-time monitoringNo — requires log file inspectionYes — 15-second auto-refresh
Multi-serverNo — run individually per serverYes — centralized collection, single view
AlertingManual (custom mail scripts)Automatic condition-based alerts
VisualizationNo — needs external toolsGraphs, gauges, heatmaps built in
Historical queriesManual CSV parsingDrag to select any time range
CostFreeFree (open source)

Shell scripts are not inferior — for 1–2 servers they are lighter and more efficient. But at scale, centralized monitoring becomes mandatory.

Architecture Design

The monitoring stack has four layers. Prometheus sits at the center pulling metrics from exporters on every server, and Grafana provides the visualization layer:

ComponentRoleTarget
Prometheus 3.xMetrics collection engine, alert evaluation1 monitoring server
Grafana 12.xVisualization dashboards1 monitoring server
node_exporter 1.xSystem metrics (CPU, RAM, disk, temperature)All 16 servers
nvidia_gpu_exporterGPU temperature, VRAM, power, utilization2 GPU servers
blackbox_exporterTCP port liveness checksWeb service ports
textfile collectorCustom metrics (HDD SMART)1 NAS server

Prometheus Installation and Configuration

Prometheus installs as a single binary. Download it, place it in /usr/local/bin/, and register a systemd service. Key configuration points:

  • Bind to 127.0.0.1:9090 only — there is no reason to expose Prometheus externally.
  • Set --storage.tsdb.retention.time=30d — 16 servers for 30 days uses only ~600 MB.
  • Reference an alert_rules.yml file for alert definitions.

The prometheus.yml core structure defines scrape targets:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
        - 'server-1:9100'
        - 'server-2:9100'
        # ... all 16 servers

    # Map IP addresses to readable server names
    relabel_configs:
      - source_labels: [__address__]
        regex: '10.0.10.100:.*'
        target_label: instance
        replacement: 'main-server'

  - job_name: 'nvidia_gpu'
    scrape_interval: 5s    # GPU needs faster polling
    static_configs:
      - targets: ['localhost:9835']
        labels:
          instance: 'gpu-server-1'

  - job_name: 'blackbox_tcp'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
        - 'proxy-server:80'
        - 'proxy-server:443'

The relabel_configs section is critical: it replaces IP addresses with human-readable server names in dashboards. Distinguishing 16 servers by IP alone is impractical.

node_exporter — Collecting from 16 Servers

node_exporter is a lightweight agent installed on every server. It automatically exposes hundreds of metrics covering CPU, memory, disk, network, and hardware temperatures. Install it as a systemd service with the --collector.hwmon flag to enable temperature sensors.

MetricDescriptionUse Case
node_cpu_seconds_totalCPU time by modeCPU usage calculation
node_memory_MemAvailable_bytesAvailable memoryMemory usage %
node_filesystem_avail_bytesFree disk spaceDisk full alerts
node_hwmon_temp_celsiusHardware temperature sensorsCPU, NVMe, DDR5 temp
node_network_receive_bytes_totalNetwork bytes receivedTraffic monitoring
node_boot_time_secondsBoot timestampUptime calculation

For bulk deployment across 16 servers, use SSH key authentication with a simplefor host in server1 server2 ...; loop to install node_exporter on all machines simultaneously.

GPU Monitoring

Servers with NVIDIA GPUs get an additional nvidia_gpu_exporter — a lightweight exporter that converts nvidia-smi data into Prometheus format on port 9835. We set the GPU scrape interval to 5 seconds (vs 15 seconds for system metrics) because GPU temperature and VRAM can change dramatically within seconds during AI inference.

MetricDescriptionAlert Threshold
nvidia_smi_temperature_gpuGPU core temperature> 80 C warning, > 85 C critical
nvidia_smi_memory_used_bytesVRAM usage> 90% warning
nvidia_smi_utilization_gpu_ratioGPU utilizationInference load tracking
nvidia_smi_power_draw_wattsReal-time power consumptionPower limit monitoring

Service Liveness with blackbox_exporter

A server can be alive while its services are dead. The blackbox_exporter performs external TCP connection attempts to verify that specific service ports are responding. Whenprobe_success == 0, the service is flagged as down and an alert fires. Typical targets include HTTP/HTTPS ports on the reverse proxy and API ports for AI inference services.

Grafana Dashboard Design

After installing Grafana and connecting Prometheus as a data source, we built 6 purpose-specific dashboards:

#DashboardKey PanelsPurpose
1All Servers OverviewCPU, RAM, disk usage (16 servers at a glance)Daily checks
2Temperature ManagementCPU, GPU, NVMe, DDR5 temperature trendsThermal monitoring
3GPU Real-timeGPU utilization, VRAM, temperature, powerAI inference monitoring
4Single Server DetailDeep-dive analysis of one serverDebugging
5Version ManagementOS, kernel, Node.js, Python versions, pending apt updatesPatch management
6Service StatusTCP port liveness, response timesService availability

Essential PromQL Queries

# CPU usage (%)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage (%)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage (%)
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)

# Server uptime (days)
(time() - node_boot_time_seconds) / 86400

# GPU temperature
nvidia_smi_temperature_gpu

# GPU VRAM usage (%)
nvidia_smi_memory_used_bytes / nvidia_smi_memory_total_bytes * 100

Alert Rules

Dashboards alone cannot provide 24/7 monitoring. Alert rules trigger automatically when thresholds are exceeded. The for clause requires the condition to persist for a specified duration, preventing false positives from momentary spikes.

AlertConditionDurationSeverity
GPU Temperature Warning> 80 C1 minwarning
GPU Temperature Critical> 85 C30 seccritical
GPU VRAM High> 90%5 minwarning
Server Downup == 05 mincritical
Disk Space Warning> 90%5 minwarning
High CPU> 95%10 minwarning
High Memory> 95%5 minwarning
Service Downprobe_success == 02 mincritical

Notice the different duration thresholds: GPU critical temperature fires after only 30 seconds (hardware damage risk) while CPU overload waits 10 minutes (transient spikes are normal under load).

Operational Tips

  • Bind Prometheus locally127.0.0.1:9090 prevents external access. Grafana communicates via localhost. An exposed Prometheus endpoint leaks infrastructure details.
  • Use relabel_configs — Map IPs to meaningful names like "main-server" or "gpu-server-1". Identifying 16 servers by IP address is unrealistic.
  • 30-day retention is cheap — 16 servers for 30 days consumes only ~600 MB. Extend to 90 or 180 days with minimal disk impact.
  • GPU scraping at 5-second intervals — System metrics are fine at 15 seconds, but GPU temperature during AI inference can spike in seconds.
  • Back up dashboards as JSON — Export Grafana dashboards to JSON files and version them in Git. Server rebuilds restore dashboards automatically.
  • textfile collector for custom metrics — Write .prom files for anything node_exporter does not cover natively: HDD SMART data, pending apt updates, custom service health checks.

The key insight from monitoring 16 servers: the purpose of monitoring is not to discover that a server has died, but to detect problems before they cause downtime. Grafana + Prometheus — both free, both open source — deliver enterprise-grade observability for any server fleet, from a handful of machines to hundreds.