카테고리

TOOL

Grafana + Prometheus로 16대 서버 통합 모니터링 — 설치부터 알림까지

2026-02-26

Treeru

서버 1대라면 htop 하나로 충분합니다. 하지만 16대의 서버를 동시에 운영하면 이야기가 달라집니다. 터미널 16개를 열어놓고 전환하는 건 비현실적이고, 문제가 생겨도 발견이 늦어집니다.Grafana + Prometheus 조합이면 하나의 대시보드에서 전 서버의 CPU, 메모리, 온도, GPU, 디스크를 실시간으로 감시하고, 이상 시 즉시 알림을 받을 수 있습니다.

16대

모니터링 서버

21개

수집 타겟

6종

대시보드

15초

수집 주기

1왜 통합 모니터링인가

이전에는 셸 스크립트 기반 모니터링으로 GPU 로그를 수집했습니다. CSV로 쌓아서 분석하는 방식이라 소규모에서는 효과적이었지만, 서버가 16대로 늘어나면서 한계가 명확해졌습니다.

항목	셸 스크립트	Grafana + Prometheus
실시간 감시	❌ 로그 파일 확인 필요	✅ 15초 간격 자동 갱신
다중 서버	❌ 서버마다 개별 실행	✅ 중앙 수집, 한 화면
알림	⚠️ 수동 (메일 스크립트)	✅ 조건별 자동 알림
시각화	❌ 별도 도구 필요	✅ 그래프, 게이지, 히트맵
이력 조회	⚠️ CSV 파싱 필요	✅ 시간 범위 드래그
비용	✅ 무료	✅ 무료 (OSS)

💡 셸 스크립트가 나쁜 것이 아닙니다. 서버 1~2대에서는 오히려 더 가볍고 효율적입니다. 하지만 규모가 커지면 중앙 집중형 모니터링이 필수가 됩니다.

2아키텍처 설계

전체 모니터링 스택은 4가지 계층으로 구성됩니다. Prometheus가 중앙에서 메트릭을 수집(pull)하고, Grafana가 시각화를 담당합니다.

모니터링 아키텍처

┌──────────────────────────────────────────────────┐
│                  Grafana (시각화)                   │
│         대시보드 6종 · 알림 규칙 · 이력 조회           │
└──────────────────┬───────────────────────────────┘
                   │ PromQL 쿼리
┌──────────────────┴───────────────────────────────┐
│              Prometheus (수집 엔진)                  │
│       15초 스크래핑 · 30일 보존 · 알림 평가            │
└────┬──────────┬──────────┬──────────┬────────────┘
     │          │          │          │
┌────┴────┐ ┌──┴───┐ ┌───┴───┐ ┌───┴────────┐
│  node   │ │ GPU  │ │ black │ │  textfile  │
│exporter │ │export│ │  box  │ │ collector  │
│ (16대)  │ │ (2대)│ │ (TCP) │ │  (SMART)   │
└─────────┘ └──────┘ └───────┘ └────────────┘

컴포넌트	역할	대상
Prometheus 3.x	메트릭 수집 엔진, 알림 평가	모니터링 서버 1대
Grafana 12.x	시각화 대시보드	모니터링 서버 1대
node_exporter 1.x	시스템 메트릭 (CPU, RAM, 디스크, 온도)	전 서버 16대
nvidia_gpu_exporter	GPU 온도, VRAM, 전력, 사용률	GPU 서버 2대
blackbox_exporter	TCP 포트 생존 확인	웹 서비스 포트
textfile collector	커스텀 메트릭 (HDD SMART)	NAS 서버 1대

3Prometheus 설치와 설정

Prometheus는 바이너리 하나로 설치가 끝납니다. apt나 Docker도 가능하지만, 공식 바이너리를 직접 설치하면 버전 관리가 수월합니다.

설치 과정

# Prometheus 바이너리 설치

# 1. 다운로드 (최신 버전 확인: github.com/prometheus/prometheus/releases)
wget https://github.com/prometheus/prometheus/releases/download/v3.x.x/prometheus-3.x.x.linux-amd64.tar.gz
tar xvf prometheus-*.tar.gz

# 2. 바이너리 배치
sudo cp prometheus promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus

# 3. 서비스 등록
sudo tee /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --web.listen-address=127.0.0.1:9090
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

⚠️ 보안 포인트: --web.listen-address=127.0.0.1:9090으로 로컬에서만 접근 가능하게 바인딩합니다. 외부에서 Prometheus에 직접 접근할 이유가 없습니다.

prometheus.yml 핵심 구조

설정 파일의 핵심은 scrape_configs입니다. 각 exporter의 주소를 등록하면 Prometheus가 주기적으로 메트릭을 가져옵니다.

# /etc/prometheus/prometheus.yml

global:
  scrape_interval: 15s      # 15초마다 메트릭 수집
  evaluation_interval: 15s  # 15초마다 알림 규칙 평가

rule_files:
  - "alert_rules.yml"       # 알림 규칙 파일

scrape_configs:
  # 시스템 메트릭 (16대 서버)
  - job_name: 'node'
    static_configs:
      - targets:
        - '<server-1>:9100'
        - '<server-2>:9100'
        - '<server-3>:9100'
        # ... 16대 전체 등록

    # IP → 서버명 변환 (대시보드 가독성)
    relabel_configs:
      - source_labels: [__address__]
        regex: '<server-1-ip>:.*'
        target_label: instance
        replacement: 'main-server'
      # ... 서버별 매핑

  # GPU 메트릭 (GPU 장착 서버만)
  - job_name: 'nvidia_gpu'
    scrape_interval: 5s      # GPU는 5초 간격
    static_configs:
      - targets: ['localhost:9835']
        labels:
          instance: 'gpu-server-1'
      - targets: ['<gpu-server-2>:9835']
        labels:
          instance: 'gpu-server-2'

  # 서비스 포트 감시
  - job_name: 'blackbox_tcp'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
        - '<proxy-server>:80'   # HTTP
        - '<proxy-server>:443'  # HTTPS

💡 relabel_configs 핵심: IP 주소 대신 서버명이 대시보드에 표시됩니다.instance 레이블을 "main-server", "gpu-server-1" 같은 의미 있는 이름으로 바꾸면 16대 서버를 구분하기가 훨씬 수월합니다.

4node_exporter로 16대 수집

node_exporter는 각 서버에 설치하는 경량 에이전트입니다. CPU, 메모리, 디스크, 네트워크, 하드웨어 온도까지 수백 개의 메트릭을 자동으로 노출합니다.

설치 (모든 서버에 반복)

# 바이너리 다운로드
wget https://github.com/prometheus/node_exporter/releases/download/v1.x.x/node_exporter-1.x.x.linux-amd64.tar.gz
tar xvf node_exporter-*.tar.gz
sudo cp node_exporter-*/node_exporter /usr/local/bin/

# systemd 서비스 등록
sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
  --collector.hwmon \
  --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo useradd -rs /bin/false node_exporter
sudo mkdir -p /var/lib/node_exporter/textfile_collector
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

수집되는 주요 메트릭

메트릭	설명	활용
node_cpu_seconds_total	CPU 사용 시간 (모드별)	CPU 사용률 계산
node_memory_MemAvailable_bytes	사용 가능한 메모리	메모리 사용률
node_filesystem_avail_bytes	디스크 여유 공간	디스크 부족 알림
node_hwmon_temp_celsius	하드웨어 온도 센서	CPU, NVMe, DDR5 온도
node_network_receive_bytes_total	네트워크 수신량	트래픽 모니터링
node_boot_time_seconds	부팅 시각	Uptime 계산

💡 16대 일괄 설치 팁: SSH 키 인증이 되어 있다면 다중 서버 관리 자동화를 활용해서 반복 작업을 스크립트화할 수 있습니다. for host in server1 server2 ...; 루프로 한 번에 배포하면 됩니다.

5GPU 모니터링 연동

GPU가 장착된 서버에는 nvidia_gpu_exporter를 추가로 설치합니다. nvidia-smi 데이터를 Prometheus 형식으로 변환해주는 경량 exporter입니다.

# nvidia_gpu_exporter 설치

# 바이너리 다운로드 (GitHub Releases)
wget https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v1.x.x/nvidia_gpu_exporter_1.x.x_linux_x86_64.tar.gz
tar xvf nvidia_gpu_exporter_*.tar.gz
sudo cp nvidia_gpu_exporter /usr/local/bin/

# systemd 등록 (기본 포트: 9835)
sudo tee /etc/systemd/system/nvidia_gpu_exporter.service << 'EOF'
[Unit]
Description=NVIDIA GPU Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/nvidia_gpu_exporter
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now nvidia_gpu_exporter

GPU 수집 메트릭

메트릭	설명	알림 기준
nvidia_smi_temperature_gpu	GPU 코어 온도	> 80°C 경고, > 85°C 위험
nvidia_smi_memory_used_bytes	VRAM 사용량	> 90% 경고
nvidia_smi_utilization_gpu_ratio	GPU 사용률	추론 부하 확인
nvidia_smi_power_draw_watts	실시간 전력 소비	전력 한도 모니터링

Prometheus 설정에서 GPU 타겟의 스크래핑 주기를 5초로 설정했습니다. AI 추론 중 GPU 온도는 수초 만에 급변할 수 있기 때문에, 시스템 메트릭(15초)보다 더 짧은 주기가 필요합니다.

6서비스 상태 감시

서버가 살아 있어도 서비스가 죽을 수 있습니다. blackbox_exporter는 외부에서 TCP 포트에 접속을 시도해서 서비스 생존 여부를 확인합니다.

# blackbox_exporter 설치

wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.x.x/blackbox_exporter-0.x.x.linux-amd64.tar.gz
tar xvf blackbox_exporter-*.tar.gz
sudo cp blackbox_exporter-*/blackbox_exporter /usr/local/bin/

# blackbox.yml 설정
sudo tee /etc/prometheus/blackbox.yml << 'EOF'
modules:
  tcp_connect:
    prober: tcp
    timeout: 5s
EOF

# systemd 등록
sudo tee /etc/systemd/system/blackbox_exporter.service << 'EOF'
[Unit]
Description=Blackbox Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/blackbox_exporter \
  --config.file=/etc/prometheus/blackbox.yml
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now blackbox_exporter

prometheus.yml의 blackbox_tcp job에서 감시할 서비스 포트를 등록하면, Prometheus가 주기적으로 TCP 연결을 시도합니다.probe_success == 0이 되면 서비스 다운으로 판정하고 알림을 발송합니다.

감시 대상 예시

• 리버스 프록시 서버 — HTTP(80), HTTPS(443) 포트
• AI 추론 서비스 — API 포트
• 웹 애플리케이션 — 서비스 포트

7Grafana 대시보드 구성

Grafana 설치 후 Prometheus를 데이터소스로 연결하면 끝입니다. 저희는 6종의 대시보드를 구성해서 목적별로 분리했습니다.

Grafana 설치

# Ubuntu/Debian
sudo apt install -y apt-transport-https software-properties-common
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | \
  gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] \
  https://apt.grafana.com stable main" | \
  sudo tee /etc/apt/sources.list.d/grafana.list

sudo apt update
sudo apt install grafana
sudo systemctl enable --now grafana-server

데이터소스 자동 프로비저닝

Grafana UI에서 수동으로 추가할 수도 있지만, YAML 파일로 자동 프로비저닝하면 서버를 재구축해도 설정이 유지됩니다.

# /etc/grafana/provisioning/datasources/prometheus.yml

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://127.0.0.1:9090
    isDefault: true
    editable: false

대시보드 6종 구성

#	대시보드	주요 패널	용도
1	전 서버 상태	CPU, RAM, 디스크 사용률 (16대 한눈에)	일상 점검
2	온도 관리	CPU, GPU, NVMe, DDR5 온도 추이	열 관리
3	GPU 실시간	GPU 사용률, VRAM, 온도, 전력	AI 추론 모니터링
4	메인 서버 상세	단일 서버 심층 분석	디버깅
5	버전 관리	OS, 커널, Node.js, Python, PM2, apt 업데이트	패치 관리
6	서비스 상태	TCP 포트 생존, 응답 시간	서비스 가용성

대시보드 실물: 버전 관리

16대 서버의 Ubuntu 버전, 커널 버전, Node.js, Python, PM2 버전을 한눈에 비교합니다. 업데이트 대기 중인 apt 패키지 수와 Uptime도 함께 표시되어, 어떤 서버를 먼저 패치해야 하는지 즉시 파악할 수 있습니다.

Grafana 버전 관리 대시보드 - 16대 서버의 OS, 커널, 런타임 버전과 업데이트 상태를 한눈에 보여주는 테이블

▲ 버전 관리 대시보드 — Ubuntu/커널/Node.js/Python/PM2 버전 + apt 업데이트 대기 수 + Uptime

대시보드 실물: 온도 관리

사무실 환경에서 16대 서버를 운영하면 온도 관리가 핵심입니다. CPU, GPU, NVMe SSD, DDR5 메모리 온도를 실시간으로 감시하고, 시간대별 추이 그래프로 냉각 성능을 모니터링합니다.

Grafana 온도 관리 대시보드 - CPU, GPU, NVMe, DDR5 온도 실시간 게이지와 시간대별 추이 그래프

▲ 온도 관리 대시보드 — CPU/GPU/NVMe/RAM 실시간 온도 + 시간대별 추이 그래프

PromQL 예시: 자주 쓰는 쿼리

# CPU 사용률 (%)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 메모리 사용률 (%)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 디스크 사용률 (%)
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)

# 서버 Uptime (일)
(time() - node_boot_time_seconds) / 86400

# GPU 온도 (°C)
nvidia_smi_temperature_gpu

# GPU VRAM 사용률 (%)
nvidia_smi_memory_used_bytes / nvidia_smi_memory_total_bytes * 100

8알림 규칙 설정

대시보드만으로는 24시간 감시가 불가능합니다.알림 규칙을 설정하면 임계값 초과 시 자동으로 알림이 발생합니다.

# /etc/prometheus/alert_rules.yml

groups:
  # GPU 알림
  - name: gpu_alerts
    rules:
      - alert: GPUTemperatureWarning
        expr: nvidia_smi_temperature_gpu > 80
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "GPU 온도 경고 ({{ $value }}°C)"

      - alert: GPUTemperatureCritical
        expr: nvidia_smi_temperature_gpu > 85
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "GPU 온도 위험 ({{ $value }}°C)"

      - alert: GPUVRAMHigh
        expr: >
          nvidia_smi_memory_used_bytes
          / nvidia_smi_memory_total_bytes * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU VRAM 90% 초과"

  # 서버 알림
  - name: server_alerts
    rules:
      - alert: ServerDown
        expr: up{job="node"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "서버 다운: {{ $labels.instance }}"

      - alert: DiskSpaceWarning
        expr: >
          100 - (node_filesystem_avail_bytes{mountpoint="/"}
          / node_filesystem_size_bytes{mountpoint="/"} * 100) > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "디스크 90% 초과: {{ $labels.instance }}"

      - alert: HighCPU
        expr: >
          100 - (avg by(instance)
          (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU 95% 초과: {{ $labels.instance }}"

      - alert: HighMemory
        expr: >
          (1 - node_memory_MemAvailable_bytes
          / node_memory_MemTotal_bytes) * 100 > 95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "RAM 95% 초과: {{ $labels.instance }}"

  # 서비스 알림
  - name: service_alerts
    rules:
      - alert: ServiceDown
        expr: probe_success{job="blackbox_tcp"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "서비스 다운: {{ $labels.instance }}"

알림 규칙 요약

알림	조건	지속 시간	심각도
GPU 온도 경고	> 80°C	1분	warning
GPU 온도 위험	> 85°C	30초	critical
GPU VRAM	> 90%	5분	warning
서버 다운	up == 0	5분	critical
디스크 부족	> 90%	5분	warning
CPU 과부하	> 95%	10분	warning
RAM 과부하	> 95%	5분	warning
서비스 다운	probe_success == 0	2분	critical

💡 for 절의 의미: 임계값을 넘은 상태가 지정 시간 동안 지속되어야 알림이 발생합니다. 순간적인 스파이크로 인한 오탐(false positive)을 방지합니다. GPU 온도 위험(85°C)은 30초로 짧게, CPU 과부하는 10분으로 길게 설정한 것이 핵심입니다.

9자동 점검 자동화

Grafana 대시보드와 별도로, Python 스크립트를 활용한 원커맨드 점검도 운용합니다. 스크립트 하나로 11개 점검 항목을 순회하고, 결과를 마크다운 로그로 자동 저장합니다.

점검 항목 11가지

전 서버 ping 상태

CPU 사용률

메모리 사용률

디스크 사용률

CPU/GPU 온도

GPU VRAM 상태

서비스 포트 생존

네트워크 연결성

Prometheus 알림 현황

Uptime 확인

HDD SMART 상태

이 점검 스크립트는 Prometheus API를 호출해서 최신 메트릭을 가져오기 때문에, SSH로 각 서버에 접속할 필요가 없습니다. 점검 결과는 타임스탬프가 포함된 마크다운 파일로 자동 저장되어 이력 관리가 됩니다.

# 점검 스크립트 핵심 구조

import requests
from datetime import datetime

PROMETHEUS_URL = "http://localhost:9090"

def query_prometheus(promql: str) -> dict:
    """Prometheus에 PromQL 쿼리를 실행합니다."""
    resp = requests.get(
        f"{PROMETHEUS_URL}/api/v1/query",
        params={"query": promql}
    )
    return resp.json()["data"]["result"]

def check_server_status():
    """전 서버 생존 상태를 확인합니다."""
    results = query_prometheus('up{job="node"}')
    for r in results:
        instance = r["metric"]["instance"]
        status = "정상" if r["value"][1] == "1" else "다운"
        print(f"  {instance}: {status}")

def check_cpu_temperature():
    """CPU 온도를 확인합니다."""
    results = query_prometheus(
        'node_hwmon_temp_celsius{chip=~".*coretemp.*|.*k10temp.*"}'
    )
    for r in results:
        instance = r["metric"]["instance"]
        temp = float(r["value"][1])
        status = "정상" if temp < 80 else "경고"
        print(f"  {instance}: {temp:.1f}°C [{status}]")

# ... 11개 점검 함수 실행 후 마크다운 로그 저장

10실전 운영 팁

1. Prometheus는 로컬 바인딩

--web.listen-address=127.0.0.1:9090으로 외부 접근을 차단합니다. Grafana가 같은 서버에 있으므로 localhost로 통신하면 충분합니다. 외부에서 Prometheus 쿼리 엔드포인트가 노출되면 인프라 정보가 유출될 수 있습니다.

2. relabel_configs로 가독성 확보

IP 주소 대신 의미 있는 서버명을 instance 레이블에 매핑하면 대시보드에서 "main-server", "gpu-server-1" 같은 이름이 표시됩니다. 서버 16대를 IP로 구분하는 것은 비현실적입니다.

3. 보존 기간과 디스크

--storage.tsdb.retention.time=30d로 30일 보존을 설정했습니다. 16대 서버의 메트릭을 30일 저장해도 약 600MB 수준이라 디스크 부담이 거의 없습니다. 필요하면 90일, 180일로 늘려도 수 GB 수준입니다.

4. GPU 스크래핑은 5초 간격

시스템 메트릭은 15초로 충분하지만, GPU 온도와 VRAM은 AI 추론 중 빠르게 변합니다.scrape_interval: 5s로 별도 설정해서 급격한 변화를 놓치지 않습니다.

5. 대시보드 JSON 백업

Grafana 대시보드는 JSON으로 export할 수 있습니다./var/lib/grafana/dashboards/에 JSON 파일을 저장하고 Git으로 버전 관리하면 서버 재구축 시에도 대시보드가 자동 복원됩니다.

6. textfile collector 활용

node_exporter가 기본 제공하지 않는 메트릭은 textfile collector로 추가합니다. 예를 들어 HDD SMART 데이터, apt 업데이트 대기 수, 커스텀 서비스 상태 등을.prom 파일로 작성하면 node_exporter가 자동으로 수집합니다.

정리

셸 스크립트 모니터링에서 시작해서 Grafana + Prometheus로 진화한 과정을 정리했습니다. 16대 서버를 운영하면서 느낀 핵심은, 모니터링은 "서버가 죽은 뒤"가 아니라 "죽기 전에" 문제를 감지하는 것이 목적이라는 점입니다.

16대

통합 관리

8종

알림 규칙

6종

대시보드

~600MB

30일 저장

Treeru

웹 개발, IT 인프라, AI 솔루션 분야의 실무 인사이트를 공유합니다. 기업의 디지털 전환을 돕는 IT 파트너, Treeru입니다.

Grafana Prometheus 서버모니터링 node_exporter GPU모니터링 대시보드 알림설정

(4)

4.63/ 5

로그인 하면 댓글을 작성할 수 있습니다.

AI인프라엔지니어

2026-03-07

5.0

셸 스크립트 모니터링에서 Grafana로 넘어오니까 차원이 다릅니다. 대시보드에서 온도 추이를 한눈에 보는 게 핵심이네요.

서버관리자P

2026-03-04

4.0

blackbox_exporter로 서비스 포트를 감시하는 부분이 좋습니다. 웹 서버가 죽었는데 모르고 있던 적이 있어서...

스타트업CTO

2026-03-01

4.5

Prometheus + Grafana 조합은 무료인데도 상용 모니터링 못지않습니다. GPU 온도 알림 85°C 기준이 실용적이에요.