treeru.com

Ever experienced a GPU server slowing down for no apparent reason? With nvidia-smi and cron, a single shell script can log GPU status around the clock and catch anomalies the moment they appear. No need for heavyweight tools like Prometheus — just start collecting data right away. When your fleet grows, consider upgrading to Grafana + Prometheus monitoring.

5 min

Collection Interval

CSV

Log Format

24/7

Unattended Operation

~50 lines

Script Size

1nvidia-smi Essentials

nvidia-smi ships with the NVIDIA driver and provides instant GPU status. The default output is useful for quick checks, but query mode is what makes automated monitoring possible.

# Basic commands

# Full GPU status
nvidia-smi

# Extract specific fields as CSV (the key command)
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,\
utilization.gpu,utilization.memory,memory.used,\
memory.total,power.draw --format=csv,noheader,nounits

# Example output:
# 2026/02/24 14:30:05, NVIDIA RTX PRO 6000, 62, 87, 45, 38400, 98304, 285.32
Query FieldDescriptionUnit
timestampRecording timeYYYY/MM/DD HH:MM:SS
nameGPU model namestring
temperature.gpuGPU core temperature°C
utilization.gpuGPU compute utilization%
utilization.memoryMemory controller utilization%
memory.usedVRAM in useMiB
memory.totalTotal VRAMMiB
power.drawCurrent power consumptionW

The --format flag is critical

The csv,noheader,nounits combination produces clean CSV that's trivial to parse. Stripping units (°C, MiB, etc.) is essential for downstream processing.

2Writing the Monitoring Script

This shell script captures nvidia-smi output to a file, splitting logs by date and auto-generating headers for new files.

# gpu-monitor.sh

#!/bin/bash

LOG_DIR="/var/log/gpu-monitor"
DATE=$(date +%Y-%m-%d)
LOG_FILE="$LOG_DIR/gpu_$DATE.csv"

# Create log directory
mkdir -p "$LOG_DIR"

# Add header if file doesn't exist
if [ ! -f "$LOG_FILE" ]; then
  echo "timestamp,gpu_name,temp_c,gpu_util,mem_util,\
mem_used_mib,mem_total_mib,power_w" > "$LOG_FILE"
fi

# Collect and append GPU data
nvidia-smi --query-gpu=timestamp,name,\
temperature.gpu,utilization.gpu,\
utilization.memory,memory.used,\
memory.total,power.draw \
--format=csv,noheader,nounits >> "$LOG_FILE"

echo "[$(date)] GPU log recorded"

# Setup and test

# Make executable
chmod +x gpu-monitor.sh

# Test run
./gpu-monitor.sh

# Verify output
cat /var/log/gpu-monitor/gpu_2026-02-24.csv

Multi-GPU environments

With multiple GPUs, nvidia-smi outputs one line per GPU automatically. Use the -i flag to monitor a specific GPU (e.g., nvidia-smi -i 0).

3CSV Format Design

A well-designed CSV format pays dividends when you analyze the data later. Plan your columns and data types upfront.

ColumnTypeDescriptionAnalysis Use
timestampdatetimeCollection timeTime-based pattern analysis
gpu_namestringGPU modelMulti-GPU differentiation
temp_cintGPU temperatureThermal trend detection
gpu_utilintGPU utilizationLoad pattern analysis
mem_utilintMemory utilizationMemory bottleneck detection
mem_used_mibintVRAM usedModel size tracking
mem_total_mibintTotal VRAMHeadroom calculation
power_wfloatPower drawElectricity cost estimation

# Sample CSV output

timestamp,gpu_name,temp_c,gpu_util,mem_util,mem_used_mib,mem_total_mib,power_w
2026/02/24 14:30:05, NVIDIA RTX PRO 6000, 62, 87, 45, 38400, 98304, 285.32
2026/02/24 14:35:05, NVIDIA RTX PRO 6000, 64, 92, 48, 38400, 98304, 298.15
2026/02/24 14:40:05, NVIDIA RTX PRO 6000, 63, 85, 44, 38400, 98304, 280.47

Split logs by date

Appending everything to a single file makes analysis painful as it grows. Date-based splitting lets you quickly query specific days and simplifies cleanup of old data.

4cron Scheduling

Running the script manually defeats the purpose. Register it with cron for automatic execution every 5 minutes.

# cron setup

# Edit crontab
crontab -e

# Run GPU monitoring every 5 minutes
*/5 * * * * /opt/scripts/gpu-monitor.sh >> /var/log/gpu-monitor/cron.log 2>&1

# Every 1 minute (for more precise monitoring)
*/1 * * * * /opt/scripts/gpu-monitor.sh >> /var/log/gpu-monitor/cron.log 2>&1

# Verify registration
crontab -l
IntervalDaily RecordsMonthly CSV SizeBest For
1 min1,440~5 MBActive debugging sessions
5 min288~1 MBGeneral operations (recommended)
15 min96~350 KBLong-term trend analysis
1 hour24~90 KBMinimal monitoring

5-minute intervals hit the sweet spot

1-minute intervals add disk overhead; 15-minute gaps can miss sudden spikes. 5-minute intervals catch most anomaly patterns while keeping storage under 1 MB/month per GPU.

5Threshold Alerts

Collecting data without acting on it is pointless. Add alert logic that fires when values exceed safe thresholds.

# gpu-alert.sh — threshold alert script

#!/bin/bash

TEMP_THRESHOLD=85
UTIL_THRESHOLD=95
MEM_THRESHOLD=90
ALERT_LOG="/var/log/gpu-monitor/alerts.log"

# Collect GPU data
DATA=$(nvidia-smi --query-gpu=temperature.gpu,\
utilization.gpu,utilization.memory \
--format=csv,noheader,nounits)

TEMP=$(echo "$DATA" | awk -F', ' '{print $1}')
GPU_UTIL=$(echo "$DATA" | awk -F', ' '{print $2}')
MEM_UTIL=$(echo "$DATA" | awk -F', ' '{print $3}')

ALERT=""

if [ "$TEMP" -ge "$TEMP_THRESHOLD" ]; then
  ALERT+="[WARN] GPU temp ${TEMP}°C (threshold: ${TEMP_THRESHOLD}°C)\n"
fi

if [ "$GPU_UTIL" -ge "$UTIL_THRESHOLD" ]; then
  ALERT+="[WARN] GPU util ${GPU_UTIL}% (threshold: ${UTIL_THRESHOLD}%)\n"
fi

if [ "$MEM_UTIL" -ge "$MEM_THRESHOLD" ]; then
  ALERT+="[WARN] VRAM util ${MEM_UTIL}% (threshold: ${MEM_THRESHOLD}%)\n"
fi

if [ -n "$ALERT" ]; then
  echo -e "[$(date)] $ALERT" >> "$ALERT_LOG"
  # Add Slack/email notification here
fi
MetricWarningCriticalAction
GPU Temperature80°C90°CCheck cooling, apply power limit
GPU Utilization95%100% sustainedDistribute workload, adjust queue
VRAM Usage90%98%+Reduce model size or batch size
Power Draw90% TDPExceeds TDPSet power limit via nvidia-smi

Multi-GPU alert scripts

The alert script above is for single-GPU setups. With multiple GPUs, nvidia-smi outputs one line per GPU — use awhile read loop to iterate each line, or specify GPU index with nvidia-smi -i 0,-i 1, etc. Running the single-GPU script on a multi-GPU server will only monitor GPU 0.

90°C+ requires immediate action

Above 90°C, the GPU begins thermal throttling, drastically reducing performance. Sustained temperatures above 95°C shorten hardware lifespan. When you receive a temperature alert, inspect cooling immediately.

6Log Rotation

Without log rotation, continuous logging will eventually fill your disk. Use logrotate to automatically compress and purge old logs.

# /etc/logrotate.d/gpu-monitor

/var/log/gpu-monitor/gpu_*.csv {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 644 root root
}

# Explanation:
# daily         — rotate daily
# rotate 30     — keep 30 days of history
# compress      — gzip old files
# delaycompress — skip compression for today's file
# missingok     — no error if file is absent

# Alternative: shell-based cleanup

#!/bin/bash
# cleanup-gpu-logs.sh — delete logs older than 30 days

LOG_DIR="/var/log/gpu-monitor"
RETENTION_DAYS=30

find "$LOG_DIR" -name "gpu_*.csv" \
  -mtime +$RETENTION_DAYS -delete

echo "[$(date)] Deleted logs older than $RETENTION_DAYS days"

Storage Requirements (per GPU)

5-min intervals: 288 rows/day, ~35 KB/day
30-day retention: ~1 MB (uncompressed)
With gzip: ~200 KB/month
1-year archive: ~2.5 MB (compressed)

A full disk kills the entire server

When logs fill the disk, the OS itself can freeze — especially on systems where /var isn't a separate partition. Log rotation is more important than the monitoring itself.

Key Takeaways

  • nvidia-smi --query-gpu extracts specific fields in clean CSV format
  • ~50 lines of shell script automates date-separated log collection
  • 5-minute cron intervals are optimal for general operations — ~1 MB/month per GPU
  • Threshold alerts catch temperature >85°C and utilization >95% instantly
  • logrotate with 30-day retention prevents disk-full disasters