GPU Monitoring Automation — 24/7 Surveillance with a Single Shell Script

2026-01-26

Treeru

Ever experienced a GPU server slowing down for no apparent reason? With nvidia-smi and cron, a single shell script can log GPU status around the clock and catch anomalies the moment they appear. No need for heavyweight tools like Prometheus — just start collecting data right away. When your fleet grows, consider upgrading to Grafana + Prometheus monitoring.

5 min

Collection Interval

CSV

Log Format

24/7

Unattended Operation

~50 lines

Script Size

1nvidia-smi Essentials

nvidia-smi ships with the NVIDIA driver and provides instant GPU status. The default output is useful for quick checks, but query mode is what makes automated monitoring possible.

# Basic commands

# Full GPU status
nvidia-smi

# Extract specific fields as CSV (the key command)
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,\
utilization.gpu,utilization.memory,memory.used,\
memory.total,power.draw --format=csv,noheader,nounits

# Example output:
# 2026/02/24 14:30:05, NVIDIA RTX PRO 6000, 62, 87, 45, 38400, 98304, 285.32

Query Field	Description	Unit
timestamp	Recording time	YYYY/MM/DD HH:MM:SS
name	GPU model name	string
temperature.gpu	GPU core temperature	°C
utilization.gpu	GPU compute utilization	%
utilization.memory	Memory controller utilization	%
memory.used	VRAM in use	MiB
memory.total	Total VRAM	MiB
power.draw	Current power consumption	W

The --format flag is critical

The csv,noheader,nounits combination produces clean CSV that's trivial to parse. Stripping units (°C, MiB, etc.) is essential for downstream processing.

2Writing the Monitoring Script

This shell script captures nvidia-smi output to a file, splitting logs by date and auto-generating headers for new files.

# gpu-monitor.sh

#!/bin/bash

LOG_DIR="/var/log/gpu-monitor"
DATE=$(date +%Y-%m-%d)
LOG_FILE="$LOG_DIR/gpu_$DATE.csv"

# Create log directory
mkdir -p "$LOG_DIR"

# Add header if file doesn't exist
if [ ! -f "$LOG_FILE" ]; then
  echo "timestamp,gpu_name,temp_c,gpu_util,mem_util,\
mem_used_mib,mem_total_mib,power_w" > "$LOG_FILE"
fi

# Collect and append GPU data
nvidia-smi --query-gpu=timestamp,name,\
temperature.gpu,utilization.gpu,\
utilization.memory,memory.used,\
memory.total,power.draw \
--format=csv,noheader,nounits >> "$LOG_FILE"

echo "[$(date)] GPU log recorded"

# Setup and test

# Make executable
chmod +x gpu-monitor.sh

# Test run
./gpu-monitor.sh

# Verify output
cat /var/log/gpu-monitor/gpu_2026-02-24.csv

Multi-GPU environments

With multiple GPUs, nvidia-smi outputs one line per GPU automatically. Use the -i flag to monitor a specific GPU (e.g., nvidia-smi -i 0).

3CSV Format Design

A well-designed CSV format pays dividends when you analyze the data later. Plan your columns and data types upfront.

Column	Type	Description	Analysis Use
timestamp	datetime	Collection time	Time-based pattern analysis
gpu_name	string	GPU model	Multi-GPU differentiation
temp_c	int	GPU temperature	Thermal trend detection
gpu_util	int	GPU utilization	Load pattern analysis
mem_util	int	Memory utilization	Memory bottleneck detection
mem_used_mib	int	VRAM used	Model size tracking
mem_total_mib	int	Total VRAM	Headroom calculation
power_w	float	Power draw	Electricity cost estimation

# Sample CSV output

timestamp,gpu_name,temp_c,gpu_util,mem_util,mem_used_mib,mem_total_mib,power_w
2026/02/24 14:30:05, NVIDIA RTX PRO 6000, 62, 87, 45, 38400, 98304, 285.32
2026/02/24 14:35:05, NVIDIA RTX PRO 6000, 64, 92, 48, 38400, 98304, 298.15
2026/02/24 14:40:05, NVIDIA RTX PRO 6000, 63, 85, 44, 38400, 98304, 280.47

Split logs by date

Appending everything to a single file makes analysis painful as it grows. Date-based splitting lets you quickly query specific days and simplifies cleanup of old data.

4cron Scheduling

Running the script manually defeats the purpose. Register it with cron for automatic execution every 5 minutes.

# cron setup

# Edit crontab
crontab -e

# Run GPU monitoring every 5 minutes
*/5 * * * * /opt/scripts/gpu-monitor.sh >> /var/log/gpu-monitor/cron.log 2>&1

# Every 1 minute (for more precise monitoring)
*/1 * * * * /opt/scripts/gpu-monitor.sh >> /var/log/gpu-monitor/cron.log 2>&1

# Verify registration
crontab -l

Interval	Daily Records	Monthly CSV Size	Best For
1 min	1,440	~5 MB	Active debugging sessions
5 min	288	~1 MB	General operations (recommended)
15 min	96	~350 KB	Long-term trend analysis
1 hour	24	~90 KB	Minimal monitoring

5-minute intervals hit the sweet spot

1-minute intervals add disk overhead; 15-minute gaps can miss sudden spikes. 5-minute intervals catch most anomaly patterns while keeping storage under 1 MB/month per GPU.

5Threshold Alerts

Collecting data without acting on it is pointless. Add alert logic that fires when values exceed safe thresholds.

# gpu-alert.sh — threshold alert script

#!/bin/bash

TEMP_THRESHOLD=85
UTIL_THRESHOLD=95
MEM_THRESHOLD=90
ALERT_LOG="/var/log/gpu-monitor/alerts.log"

# Collect GPU data
DATA=$(nvidia-smi --query-gpu=temperature.gpu,\
utilization.gpu,utilization.memory \
--format=csv,noheader,nounits)

TEMP=$(echo "$DATA" | awk -F', ' '{print $1}')
GPU_UTIL=$(echo "$DATA" | awk -F', ' '{print $2}')
MEM_UTIL=$(echo "$DATA" | awk -F', ' '{print $3}')

ALERT=""

if [ "$TEMP" -ge "$TEMP_THRESHOLD" ]; then
  ALERT+="[WARN] GPU temp ${TEMP}°C (threshold: ${TEMP_THRESHOLD}°C)\n"
fi

if [ "$GPU_UTIL" -ge "$UTIL_THRESHOLD" ]; then
  ALERT+="[WARN] GPU util ${GPU_UTIL}% (threshold: ${UTIL_THRESHOLD}%)\n"
fi

if [ "$MEM_UTIL" -ge "$MEM_THRESHOLD" ]; then
  ALERT+="[WARN] VRAM util ${MEM_UTIL}% (threshold: ${MEM_THRESHOLD}%)\n"
fi

if [ -n "$ALERT" ]; then
  echo -e "[$(date)] $ALERT" >> "$ALERT_LOG"
  # Add Slack/email notification here
fi

Metric	Warning	Critical	Action
GPU Temperature	80°C	90°C	Check cooling, apply power limit
GPU Utilization	95%	100% sustained	Distribute workload, adjust queue
VRAM Usage	90%	98%+	Reduce model size or batch size
Power Draw	90% TDP	Exceeds TDP	Set power limit via nvidia-smi

Multi-GPU alert scripts

The alert script above is for single-GPU setups. With multiple GPUs, nvidia-smi outputs one line per GPU — use awhile read loop to iterate each line, or specify GPU index with nvidia-smi -i 0,-i 1, etc. Running the single-GPU script on a multi-GPU server will only monitor GPU 0.

90°C+ requires immediate action

Above 90°C, the GPU begins thermal throttling, drastically reducing performance. Sustained temperatures above 95°C shorten hardware lifespan. When you receive a temperature alert, inspect cooling immediately.

6Log Rotation

Without log rotation, continuous logging will eventually fill your disk. Use logrotate to automatically compress and purge old logs.

# /etc/logrotate.d/gpu-monitor

/var/log/gpu-monitor/gpu_*.csv {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 644 root root
}

# Explanation:
# daily         — rotate daily
# rotate 30     — keep 30 days of history
# compress      — gzip old files
# delaycompress — skip compression for today's file
# missingok     — no error if file is absent

# Alternative: shell-based cleanup

#!/bin/bash
# cleanup-gpu-logs.sh — delete logs older than 30 days

LOG_DIR="/var/log/gpu-monitor"
RETENTION_DAYS=30

find "$LOG_DIR" -name "gpu_*.csv" \
  -mtime +$RETENTION_DAYS -delete

echo "[$(date)] Deleted logs older than $RETENTION_DAYS days"

Storage Requirements (per GPU)

5-min intervals: 288 rows/day, ~35 KB/day

30-day retention: ~1 MB (uncompressed)

With gzip: ~200 KB/month

1-year archive: ~2.5 MB (compressed)

A full disk kills the entire server

When logs fill the disk, the OS itself can freeze — especially on systems where /var isn't a separate partition. Log rotation is more important than the monitoring itself.

Key Takeaways

✓nvidia-smi --query-gpu extracts specific fields in clean CSV format
✓~50 lines of shell script automates date-separated log collection
✓5-minute cron intervals are optimal for general operations — ~1 MB/month per GPU
✓Threshold alerts catch temperature >85°C and utilization >95% instantly
✓logrotate with 30-day retention prevents disk-full disasters

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

GPU Monitoring nvidia-smi Shell Script cron Server Ops Automation

Tools

GPU Monitoring Automation — 24/7 Surveillance with a Single Shell Script

1nvidia-smi Essentials

2Writing the Monitoring Script

3CSV Format Design

4cron Scheduling

5Threshold Alerts

6Log Rotation

Storage Requirements (per GPU)

Key Takeaways

Related Posts

Grafana + Prometheus for Multiple Servers — Installation to Alerting

GPU 24/7 Long-Term Monitoring — 13 Days of Real Data

CPU Embedding Benchmark Tool: Automated Performance Comparison Across 8 CPUs