Grafana + Prometheus for Multiple Servers — Installation to Alerting

2026-02-26

Treeru

One server? htop is enough. But when you are running dozens of servers simultaneously, switching between 16 terminal windows is impractical and problems get discovered too late. With Grafana + Prometheus, a single dashboard monitors CPU, memory, temperature, GPU, and disk across every server in real time — and alerts you the moment something goes wrong.

Why Centralized Monitoring

We previously used shell-script-based monitoring that collected GPU logs into CSV files. It worked for a few servers, but at dozens of machines the limitations became obvious:

Capability	Shell Scripts	Grafana + Prometheus
Real-time monitoring	No — requires log file inspection	Yes — 15-second auto-refresh
Multi-server	No — run individually per server	Yes — centralized collection, single view
Alerting	Manual (custom mail scripts)	Automatic condition-based alerts
Visualization	No — needs external tools	Graphs, gauges, heatmaps built in
Historical queries	Manual CSV parsing	Drag to select any time range
Cost	Free	Free (open source)

Shell scripts are not inferior — for 1–2 servers they are lighter and more efficient. But at scale, centralized monitoring becomes mandatory.

Architecture Design

The monitoring stack has four layers. Prometheus sits at the center pulling metrics from exporters on every server, and Grafana provides the visualization layer:

Component	Role	Target
Prometheus 3.x	Metrics collection engine, alert evaluation	1 monitoring server
Grafana 12.x	Visualization dashboards	1 monitoring server
node_exporter 1.x	System metrics (CPU, RAM, disk, temperature)	All servers
nvidia_gpu_exporter	GPU temperature, VRAM, power, utilization	2 GPU servers
blackbox_exporter	TCP port liveness checks	Web service ports
textfile collector	Custom metrics (HDD SMART)	1 NAS server

Prometheus Installation and Configuration

Prometheus installs as a single binary. Download it, place it in /usr/local/bin/, and register a systemd service. Key configuration points:

Bind to 127.0.0.1:9090 only — there is no reason to expose Prometheus externally.
Set --storage.tsdb.retention.time=30d — dozens of servers for 30 days uses only ~600 MB.
Reference an alert_rules.yml file for alert definitions.

The prometheus.yml core structure defines scrape targets:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
        - 'server-1:9100'
        - 'server-2:9100'
        # ... all servers

    # Map IP addresses to readable server names
    relabel_configs:
      - source_labels: [__address__]
        regex: '10.0.0.10:.*'
        target_label: instance
        replacement: 'main-server'

  - job_name: 'nvidia_gpu'
    scrape_interval: 5s    # GPU needs faster polling
    static_configs:
      - targets: ['localhost:9835']
        labels:
          instance: 'gpu-server-1'

  - job_name: 'blackbox_tcp'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
        - 'proxy-server:80'
        - 'proxy-server:443'

The relabel_configs section is critical: it replaces IP addresses with human-readable server names in dashboards. Distinguishing dozens of servers by IP alone is impractical.

node_exporter — Collecting from Multiple Servers

node_exporter is a lightweight agent installed on every server. It automatically exposes hundreds of metrics covering CPU, memory, disk, network, and hardware temperatures. Install it as a systemd service with the --collector.hwmon flag to enable temperature sensors.

Metric	Description	Use Case
`node_cpu_seconds_total`	CPU time by mode	CPU usage calculation
`node_memory_MemAvailable_bytes`	Available memory	Memory usage %
`node_filesystem_avail_bytes`	Free disk space	Disk full alerts
`node_hwmon_temp_celsius`	Hardware temperature sensors	CPU, NVMe, DDR5 temp
`node_network_receive_bytes_total`	Network bytes received	Traffic monitoring
`node_boot_time_seconds`	Boot timestamp	Uptime calculation

For bulk deployment across dozens of servers, use SSH key authentication with a simplefor host in server1 server2 ...; loop to install node_exporter on all machines simultaneously.

GPU Monitoring

Servers with NVIDIA GPUs get an additional nvidia_gpu_exporter — a lightweight exporter that converts nvidia-smi data into Prometheus format on port 9835. We set the GPU scrape interval to 5 seconds (vs 15 seconds for system metrics) because GPU temperature and VRAM can change dramatically within seconds during AI inference.

Metric	Description	Alert Threshold
`nvidia_smi_temperature_gpu`	GPU core temperature	> 80 C warning, > 85 C critical
`nvidia_smi_memory_used_bytes`	VRAM usage	> 90% warning
`nvidia_smi_utilization_gpu_ratio`	GPU utilization	Inference load tracking
`nvidia_smi_power_draw_watts`	Real-time power consumption	Power limit monitoring

Service Liveness with blackbox_exporter

A server can be alive while its services are dead. The blackbox_exporter performs external TCP connection attempts to verify that specific service ports are responding. Whenprobe_success == 0, the service is flagged as down and an alert fires. Typical targets include HTTP/HTTPS ports on the reverse proxy and API ports for AI inference services.

Grafana Dashboard Design

After installing Grafana and connecting Prometheus as a data source, we built 6 purpose-specific dashboards:

#	Dashboard	Key Panels	Purpose
1	All Servers Overview	CPU, RAM, disk usage (dozens of servers at a glance)	Daily checks
2	Temperature Management	CPU, GPU, NVMe, DDR5 temperature trends	Thermal monitoring
3	GPU Real-time	GPU utilization, VRAM, temperature, power	AI inference monitoring
4	Single Server Detail	Deep-dive analysis of one server	Debugging
5	Version Management	OS, kernel, Node.js, Python versions, pending apt updates	Patch management
6	Service Status	TCP port liveness, response times	Service availability

Essential PromQL Queries

# CPU usage (%)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage (%)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage (%)
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)

# Server uptime (days)
(time() - node_boot_time_seconds) / 86400

# GPU temperature
nvidia_smi_temperature_gpu

# GPU VRAM usage (%)
nvidia_smi_memory_used_bytes / nvidia_smi_memory_total_bytes * 100

Alert Rules

Dashboards alone cannot provide 24/7 monitoring. Alert rules trigger automatically when thresholds are exceeded. The for clause requires the condition to persist for a specified duration, preventing false positives from momentary spikes.

Alert	Condition	Duration	Severity
GPU Temperature Warning	> 80 C	1 min	warning
GPU Temperature Critical	> 85 C	30 sec	critical
GPU VRAM High	> 90%	5 min	warning
Server Down	up == 0	5 min	critical
Disk Space Warning	> 90%	5 min	warning
High CPU	> 95%	10 min	warning
High Memory	> 95%	5 min	warning
Service Down	probe_success == 0	2 min	critical

Notice the different duration thresholds: GPU critical temperature fires after only 30 seconds (hardware damage risk) while CPU overload waits 10 minutes (transient spikes are normal under load).

Operational Tips

Bind Prometheus locally — 127.0.0.1:9090 prevents external access. Grafana communicates via localhost. An exposed Prometheus endpoint leaks infrastructure details.
Use relabel_configs — Map IPs to meaningful names like "main-server" or "gpu-server-1". Identifying dozens of servers by IP address is unrealistic.
30-day retention is cheap — dozens of servers for 30 days consumes only ~600 MB. Extend to 90 or 180 days with minimal disk impact.
GPU scraping at 5-second intervals — System metrics are fine at 15 seconds, but GPU temperature during AI inference can spike in seconds.
Back up dashboards as JSON — Export Grafana dashboards to JSON files and version them in Git. Server rebuilds restore dashboards automatically.
textfile collector for custom metrics — Write .prom files for anything node_exporter does not cover natively: HDD SMART data, pending apt updates, custom service health checks.

The key insight from monitoring dozens of servers: the purpose of monitoring is not to discover that a server has died, but to detect problems before they cause downtime. Grafana + Prometheus — both free, both open source — deliver enterprise-grade observability for any server fleet, from a handful of machines to hundreds.

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

Tools