Ever experienced a GPU server slowing down for no apparent reason? With nvidia-smi and cron, a single shell script can log GPU status around the clock and catch anomalies the moment they appear. No need for heavyweight tools like Prometheus — just start collecting data right away. When your fleet grows, consider upgrading to Grafana + Prometheus monitoring.
5 min
Collection Interval
CSV
Log Format
24/7
Unattended Operation
~50 lines
Script Size
1nvidia-smi Essentials
nvidia-smi ships with the NVIDIA driver and provides instant GPU status. The default output is useful for quick checks, but query mode is what makes automated monitoring possible.
# Basic commands
# Full GPU status nvidia-smi # Extract specific fields as CSV (the key command) nvidia-smi --query-gpu=timestamp,name,temperature.gpu,\ utilization.gpu,utilization.memory,memory.used,\ memory.total,power.draw --format=csv,noheader,nounits # Example output: # 2026/02/24 14:30:05, NVIDIA RTX PRO 6000, 62, 87, 45, 38400, 98304, 285.32
| Query Field | Description | Unit |
|---|---|---|
| timestamp | Recording time | YYYY/MM/DD HH:MM:SS |
| name | GPU model name | string |
| temperature.gpu | GPU core temperature | °C |
| utilization.gpu | GPU compute utilization | % |
| utilization.memory | Memory controller utilization | % |
| memory.used | VRAM in use | MiB |
| memory.total | Total VRAM | MiB |
| power.draw | Current power consumption | W |
The --format flag is critical
The csv,noheader,nounits combination produces clean CSV that's trivial to parse. Stripping units (°C, MiB, etc.) is essential for downstream processing.
2Writing the Monitoring Script
This shell script captures nvidia-smi output to a file, splitting logs by date and auto-generating headers for new files.
# gpu-monitor.sh
#!/bin/bash LOG_DIR="/var/log/gpu-monitor" DATE=$(date +%Y-%m-%d) LOG_FILE="$LOG_DIR/gpu_$DATE.csv" # Create log directory mkdir -p "$LOG_DIR" # Add header if file doesn't exist if [ ! -f "$LOG_FILE" ]; then echo "timestamp,gpu_name,temp_c,gpu_util,mem_util,\ mem_used_mib,mem_total_mib,power_w" > "$LOG_FILE" fi # Collect and append GPU data nvidia-smi --query-gpu=timestamp,name,\ temperature.gpu,utilization.gpu,\ utilization.memory,memory.used,\ memory.total,power.draw \ --format=csv,noheader,nounits >> "$LOG_FILE" echo "[$(date)] GPU log recorded"
# Setup and test
# Make executable chmod +x gpu-monitor.sh # Test run ./gpu-monitor.sh # Verify output cat /var/log/gpu-monitor/gpu_2026-02-24.csv
Multi-GPU environments
With multiple GPUs, nvidia-smi outputs one line per GPU automatically. Use the -i flag to monitor a specific GPU (e.g., nvidia-smi -i 0).
3CSV Format Design
A well-designed CSV format pays dividends when you analyze the data later. Plan your columns and data types upfront.
| Column | Type | Description | Analysis Use |
|---|---|---|---|
| timestamp | datetime | Collection time | Time-based pattern analysis |
| gpu_name | string | GPU model | Multi-GPU differentiation |
| temp_c | int | GPU temperature | Thermal trend detection |
| gpu_util | int | GPU utilization | Load pattern analysis |
| mem_util | int | Memory utilization | Memory bottleneck detection |
| mem_used_mib | int | VRAM used | Model size tracking |
| mem_total_mib | int | Total VRAM | Headroom calculation |
| power_w | float | Power draw | Electricity cost estimation |
# Sample CSV output
timestamp,gpu_name,temp_c,gpu_util,mem_util,mem_used_mib,mem_total_mib,power_w 2026/02/24 14:30:05, NVIDIA RTX PRO 6000, 62, 87, 45, 38400, 98304, 285.32 2026/02/24 14:35:05, NVIDIA RTX PRO 6000, 64, 92, 48, 38400, 98304, 298.15 2026/02/24 14:40:05, NVIDIA RTX PRO 6000, 63, 85, 44, 38400, 98304, 280.47
Split logs by date
Appending everything to a single file makes analysis painful as it grows. Date-based splitting lets you quickly query specific days and simplifies cleanup of old data.
4cron Scheduling
Running the script manually defeats the purpose. Register it with cron for automatic execution every 5 minutes.
# cron setup
# Edit crontab crontab -e # Run GPU monitoring every 5 minutes */5 * * * * /opt/scripts/gpu-monitor.sh >> /var/log/gpu-monitor/cron.log 2>&1 # Every 1 minute (for more precise monitoring) */1 * * * * /opt/scripts/gpu-monitor.sh >> /var/log/gpu-monitor/cron.log 2>&1 # Verify registration crontab -l
| Interval | Daily Records | Monthly CSV Size | Best For |
|---|---|---|---|
| 1 min | 1,440 | ~5 MB | Active debugging sessions |
| 5 min | 288 | ~1 MB | General operations (recommended) |
| 15 min | 96 | ~350 KB | Long-term trend analysis |
| 1 hour | 24 | ~90 KB | Minimal monitoring |
5-minute intervals hit the sweet spot
1-minute intervals add disk overhead; 15-minute gaps can miss sudden spikes. 5-minute intervals catch most anomaly patterns while keeping storage under 1 MB/month per GPU.
5Threshold Alerts
Collecting data without acting on it is pointless. Add alert logic that fires when values exceed safe thresholds.
# gpu-alert.sh — threshold alert script
#!/bin/bash
TEMP_THRESHOLD=85
UTIL_THRESHOLD=95
MEM_THRESHOLD=90
ALERT_LOG="/var/log/gpu-monitor/alerts.log"
# Collect GPU data
DATA=$(nvidia-smi --query-gpu=temperature.gpu,\
utilization.gpu,utilization.memory \
--format=csv,noheader,nounits)
TEMP=$(echo "$DATA" | awk -F', ' '{print $1}')
GPU_UTIL=$(echo "$DATA" | awk -F', ' '{print $2}')
MEM_UTIL=$(echo "$DATA" | awk -F', ' '{print $3}')
ALERT=""
if [ "$TEMP" -ge "$TEMP_THRESHOLD" ]; then
ALERT+="[WARN] GPU temp ${TEMP}°C (threshold: ${TEMP_THRESHOLD}°C)\n"
fi
if [ "$GPU_UTIL" -ge "$UTIL_THRESHOLD" ]; then
ALERT+="[WARN] GPU util ${GPU_UTIL}% (threshold: ${UTIL_THRESHOLD}%)\n"
fi
if [ "$MEM_UTIL" -ge "$MEM_THRESHOLD" ]; then
ALERT+="[WARN] VRAM util ${MEM_UTIL}% (threshold: ${MEM_THRESHOLD}%)\n"
fi
if [ -n "$ALERT" ]; then
echo -e "[$(date)] $ALERT" >> "$ALERT_LOG"
# Add Slack/email notification here
fi| Metric | Warning | Critical | Action |
|---|---|---|---|
| GPU Temperature | 80°C | 90°C | Check cooling, apply power limit |
| GPU Utilization | 95% | 100% sustained | Distribute workload, adjust queue |
| VRAM Usage | 90% | 98%+ | Reduce model size or batch size |
| Power Draw | 90% TDP | Exceeds TDP | Set power limit via nvidia-smi |
Multi-GPU alert scripts
The alert script above is for single-GPU setups. With multiple GPUs, nvidia-smi outputs one line per GPU — use awhile read loop to iterate each line, or specify GPU index with nvidia-smi -i 0,-i 1, etc. Running the single-GPU script on a multi-GPU server will only monitor GPU 0.
90°C+ requires immediate action
Above 90°C, the GPU begins thermal throttling, drastically reducing performance. Sustained temperatures above 95°C shorten hardware lifespan. When you receive a temperature alert, inspect cooling immediately.
6Log Rotation
Without log rotation, continuous logging will eventually fill your disk. Use logrotate to automatically compress and purge old logs.
# /etc/logrotate.d/gpu-monitor
/var/log/gpu-monitor/gpu_*.csv {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 644 root root
}
# Explanation:
# daily — rotate daily
# rotate 30 — keep 30 days of history
# compress — gzip old files
# delaycompress — skip compression for today's file
# missingok — no error if file is absent# Alternative: shell-based cleanup
#!/bin/bash # cleanup-gpu-logs.sh — delete logs older than 30 days LOG_DIR="/var/log/gpu-monitor" RETENTION_DAYS=30 find "$LOG_DIR" -name "gpu_*.csv" \ -mtime +$RETENTION_DAYS -delete echo "[$(date)] Deleted logs older than $RETENTION_DAYS days"
Storage Requirements (per GPU)
A full disk kills the entire server
When logs fill the disk, the OS itself can freeze — especially on systems where /var isn't a separate partition. Log rotation is more important than the monitoring itself.
Key Takeaways
- ✓nvidia-smi --query-gpu extracts specific fields in clean CSV format
- ✓~50 lines of shell script automates date-separated log collection
- ✓5-minute cron intervals are optimal for general operations — ~1 MB/month per GPU
- ✓Threshold alerts catch temperature >85°C and utilization >95% instantly
- ✓logrotate with 30-day retention prevents disk-full disasters