treeru.com

Server failure is a matter of when, not if. Power outages, disk failures, ransomware — data disappears for any number of reasons. Backup isn't about "should we or shouldn't we" — it's about "how fast can we recover." This guide covers automated multi-server backup using the rsync Pull method, SMART-based disk health monitoring, and procedures for rebuilding any failed server on bare hardware. The backup server hardware was covered previously, so this article focuses on operational strategy and automation.

Pull Mode

Backup Server Collects

flock

Concurrency Prevention

30 min

SMART Check Interval

Manual

Human-Verified Recovery

Why Backup Actually Matters

Power outages can be handled with a UPS, and network failures resolve with a reboot. The real threat is physical disk failure. HDD annual failure rate (AFR) runs 1-3%. Operating 10 servers for 3 years gives you a 26-60% probability of at least one disk failure.

Scenarios Without Backup

High
OS Disk FailureOS + config files lost entirely. 1-2 day server rebuild
Critical
Data Disk FailureDatabase, model files, user data permanently lost
Critical
Ransomware InfectionAll files encrypted. Unrecoverable without backup
High
Accidental rm -rfUnrecoverable. Only restorable from backup
High
RAID Controller FailureEntire RAID array inaccessible. Rebuild required

Pull vs Push

Server backup approaches fall into Push (production server sends to backup) and Pull (backup server fetches from production). We chose Pull for its security and centralized management advantages.

Push Method

Backup scripts required on production servers
Production servers hold backup server SSH keys
If production is compromised, backup is at risk
Script version management needed per server
Debug backup failures on individual servers

Pull Method

All scripts centrally managed on backup server
Production servers only allow read-only SSH keys
Compromised production can't access backup server
Backup schedules managed in one place
All failure logs viewable from one location

The essence of Pull is that "production servers know nothing about backup." The backup server connects via SSH keys and uses rsync to fetch data. From the production server's perspective, it just allows regular SSH connections — no backup scripts exist, so even if an attacker compromises the production server, they can't corrupt the backups.

rsync Script Design

Using rsync's --archive --compress --delete options for incremental backup. Only changed files are transferred, so after the initial backup, subsequent runs complete within minutes.

Core Script Architecture

Incremental Backuprsync --archive compares timestamps, transfers only changed files. 90%+ reduction in transfer volume after initial backup
flock Lockingflock /var/lock/backup.lock prevents concurrent execution. Skips new backup if previous one hasn't finished
LoggingRecords start/end time, file count, transfer size, and errors to daily files in /var/log/backup/
Error AlertsImmediately notifies if rsync exit code is non-zero. Detects network timeouts, disk space shortages, etc.
Exclusion Patterns--exclude filters out temp files, caches, and log rotation files. Prevents unnecessary transfers

Caution with --delete: This removes files from backup that were deleted on production. If you accidentally delete files on production, they disappear from backup on the next run. To mitigate this, use --backup --backup-dir to keep deleted files in a separate directory for 7 days. Files can be recovered if caught within that window.

cron Schedule Design

Backing up multiple servers simultaneously creates network bandwidth and disk I/O contention. Stagger server schedules to prevent backup window collisions.

TimeTargetBackup ContentEst. Duration
02:00Web ServerNext.js source, static assets, PM2 config~5 min
02:30DB ServerPostgreSQL dump + WAL archive~15 min
03:00AI Inference ServerModel weights, SGLang config, LoRA adapters~30 min
04:00Monitoring ServerPrometheus data, Grafana dashboards~10 min
04:30VPN/Firewall ServerWireGuard config, OPNsense backup~2 min

With 30-minute intervals, each backup completes before the next begins. The AI inference server has the largest files (tens of GB), but when models haven't changed, rsync's incremental transfer is only a few MB. With flock in place, even if a previous backup runs late, it won't collide with the next one.

HDD Health Monitoring

If the backup disks die, your backups are worthless. We automatically check SMART status of three WD Red Pro 10TB drives every 30 minutes and monitor them in real-time via Grafana.

SMART Monitoring Setup

Short TestEvery Sunday 06:00

Quick surface scan for obvious defects

Long TestEvery Wednesday 02:00

Full surface precision scan. Detects latent bad sectors

SMART Value CollectionEvery 30 minutes

Track Reallocated_Sector, Current_Pending, Temperature

Threshold AlertsReal-time

Immediate alert if Reallocated_Sector > 0 or Temperature > 55°C

The most critical SMART metric is Reallocated_Sector_Ct (reallocated sector count). The moment this value changes from 0 to 1 is your disk replacement trigger. Not "it still works, so it's fine" — but "once reallocation starts, it accelerates." The cost of a disk is trivial compared to the cost of data loss.

Disaster Recovery Plan

The most important principle in disaster recovery is "never automate recovery." Automated recovery risks false triggers — misidentifying a brief network blip as a failure and overwriting good data with backup data. A human verifies the situation, determines recovery scope, and explicitly executes the restore.

Scenario 1: Service Process Crash
LowRTO: 5 min
  1. 1.Check service status (systemctl status)
  2. 2.Analyze logs (journalctl)
  3. 3.Restart process
  4. 4.Verify monitoring
Scenario 2: OS Disk Failure
HighRTO: 2-4 hours
  1. 1.Install OS on new disk
  2. 2.Restore /etc, /home from backup
  3. 3.Reinstall service packages
  4. 4.Restore config files and start services
Scenario 3: Data Disk Failure
HighRTO: 4-8 hours
  1. 1.Mount new disk and partition
  2. 2.Full data restore via rsync from backup server
  3. 3.DB recovery (WAL replay)
  4. 4.Data integrity verification
Scenario 4: Complete Server Physical Failure
CriticalRTO: 8-24 hours
  1. 1.Prepare spare or new server
  2. 2.Build OS + base environment
  3. 3.Full data restore from backup
  4. 4.Switch traffic via DNS change
  5. 5.Verify service operation

DNS change is the final recovery step. If the server IP changes, update the Caddy reverse proxy upstream or update the DNS A record. We don't automate this because switching to the wrong IP would take down the entire service. Verify thoroughly after recovery, then switch manually.

Lessons Learned

During our backup server build, we replaced Seagate IronWolf 12TB drives with WD Red Pro 10TB. Here's what we learned.

Never Ignore SMART Warnings

IronWolf 12TB showed 2 Reallocated_Sectors. We dismissed it with "still works fine" — 2 weeks later, it grew to 8. We decided to replace immediately. By the time replacement was complete, it had reached 12.

NAS HDDs Are Vibration-Sensitive

Mounting 3 drives side by side in a server rack transmitted vibrations, triggering SMART Current_Pending warnings. Resolved with rubber isolation mounts.

Test Backup Restores Regularly

Confirming daily backups run successfully is different from actually performing a restore. We validate our recovery procedures with quarterly restore tests.

Back Up the Backup Server Too

If the backup server disk fails, all backups are lost. We distribute data across 3 WD Red Pro drives and dual-store critical data on 2 of them.

Summary

1

Pull Method: Backup server controls everything centrally. Production servers are unaware of backup

2

flock: Prevents concurrent execution and backup collisions. Skips if previous backup hasn't finished

3

Staggered cron: 30-minute intervals between servers. Prevents network/disk I/O contention

4

SMART Monitoring: 30-minute check intervals. Replace immediately if Reallocated_Sector > 0

5

Manual Recovery: No fully automated recovery. Humans verify and explicitly execute

6

Restore Testing: Quarterly real restore tests to validate procedures

Backup isn't "insurance" — it's "infrastructure." When a server dies, "how fast can you recover?" is your backup system's performance metric. rsync Pull + flock + SMART + manual recovery — with these four pillars, you can restore service within half a day no matter which server fails.