Database Backup and Disaster Recovery: strategies for critical databases in production
Backup and disaster recovery strategies for critical databases require RPO/RTO planning, PITR, cross-region replication, automated snapshots, and regular restore testing.
Executive summary
Backup and disaster recovery strategies for critical databases require RPO/RTO planning, PITR, cross-region replication, automated snapshots, and regular restore testing.
Last updated: 3/12/2026
The cost of not planning DR
Data loss is not a question of "if" but "when." Recent studies show that 40% of companies never recover after critical data loss, and 60% close within 6 months after a significant data disaster. For critical databases, the absence of a tested disaster recovery (DR) plan is a gamble with worse odds than a casino.
The challenge isn't just "doing backup" — it's defining realistic RPO (Recovery Point Objective) and RTO (Recovery Time Objective), implementing PITR (Point-In-Time Recovery), ensuring functional cross-region replication, and regularly testing restores. A backup that has never been tested is not a backup — it's an illusion.
Defining RPO and RTO
RPO (Recovery Point Objective)
RPO is the maximum amount of data that can be lost, measured in time.
yaml# RPO examples by workload type
rpo_matrix:
financial_transactions:
rpo: "0" # Zero data loss
strategy: synchronous_replication
implementation: multi-region synchronous
customer_orders:
rpo: "1m" # 1 minute
strategy: async_replication + pitr
implementation: async replication + 1-minute PITR
user_analytics:
rpo: "15m" # 15 minutes
strategy: periodic_backups
implementation: 15-minute logical backups
logs_metrics:
rpo: "1h" # 1 hour
strategy: periodic_backups
implementation: hourly logical backups
static_content:
rpo: "24h" # 24 hours
strategy: daily_backups
implementation: daily full backupsRTO (Recovery Time Objective)
RTO is the maximum acceptable time to restore services after a disaster.
yaml# RTO examples by criticality
rto_matrix:
critical:
rto: "15m"
requirements:
- Automated failover
- Hot standby
- Health checks
implementation: active-active multi-region
high:
rto: "1h"
requirements:
- Manual failover
- Warm standby
- Automated backups
implementation: active-passive with automation
medium:
rto: "4h"
requirements:
- Manual restoration
- Cold standby
- Backup restoration
implementation: restore from backup
low:
rto: "24h"
requirements:
- Manual restoration
- No standby
- Backup restoration
implementation: restore from backupPostgreSQL Backup Strategies
Physical Backups (pg_basebackup)
Physical backups are binary copies of the data directory, ideal for fast full restores.
bash#!/bin/bash
# pg_basebackup for physical backup
# Configuration
PGHOST="primary-db.example.com"
PGPORT="5432"
PGUSER="backup_user"
BACKUP_DIR="/backups/postgres/physical"
RETENTION_DAYS=7
# Create backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="${BACKUP_DIR}/${TIMESTAMP}"
pg_basebackup \
-h $PGHOST \
-p $PGPORT \
-U $PGUSER \
-D $BACKUP_PATH \
-Ft \ # Format: tar
-z \ # Compress
-P \ # Show progress
-X stream \ # Include WAL files
-R \ # Create recovery.conf
-w # Wait for backup to complete
# Retention
find $BACKUP_DIR -maxdepth 1 -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;
# Upload to S3/GCS
aws s3 sync ${BACKUP_PATH} s3://my-bucket/postgres-backups/${TIMESTAMP}/Logical Backups (pg_dump)
Logical backups are SQL dumps, ideal for granular restores and schema changes.
bash#!/bin/bash
# pg_dump for logical backup
# Configuration
DB_NAME="production_db"
BACKUP_DIR="/backups/postgres/logical"
RETENTION_DAYS=30
SCHEMA_ONLY=false
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"
if [ "$SCHEMA_ONLY" = true ]; then
# Schema only backup
pg_dump -h $PGHOST -U $PGUSER \
--schema-only \
--no-owner \
--no-privileges \
-f ${BACKUP_FILE}.schema.sql $DB_NAME
else
# Full logical backup
pg_dump -h $PGHOST -U $PGUSER \
--format=custom \
--compress=9 \
--no-owner \
--no-privileges \
-f ${BACKUP_FILE} $DB_NAME
fi
# Upload to S3/GCS
aws s3 cp ${BACKUP_FILE} s3://my-bucket/postgres-backups/logical/
# Retention
find $BACKUP_DIR -name "${DB_NAME}_*.dump" -mtime +$RETENTION_DAYS -deleteContinuous Archiving (WAL Archiving)
WAL (Write-Ahead Log) archiving enables PITR.
bash# postgresql.conf - WAL archiving configuration
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-bucket/postgres-wal/%f'
max_wal_senders = 10
wal_keep_size = 1GB
# Replication slots
postgresql.conf:
max_replication_slots = 10bash# Archiving script for PITR
#!/bin/bash
WAL_DIR="/var/lib/postgresql/wal"
ARCHIVE_DIR="/backups/postgres/wal"
RETENTION_DAYS=7
# Archive WAL files
for WAL_FILE in $WAL_DIR/*.wal; do
FILENAME=$(basename $WAL_FILE)
aws s3 cp $WAL_FILE s3://my-bucket/postgres-wal/${FILENAME}
done
# Clean archived WAL
find $ARCHIVE_DIR -name "*.wal" -mtime +$RETENTION_DAYS -deletePoint-In-Time Recovery (PITR)
bash#!/bin/bash
# PITR restore script
RESTORE_TIMESTAMP="2026-03-12 14:30:00"
BACKUP_BASE="/backups/postgres/physical/latest"
RESTORE_DIR="/var/lib/postgresql/restore"
WAL_ARCHIVE="s3://my-bucket/postgres-wal"
# 1. Stop PostgreSQL
sudo systemctl stop postgresql
# 2. Clean existing data directory
rm -rf $RESTORE_DIR
# 3. Restore base backup
pg_basebackup \
-h primary-db.example.com \
-D $RESTORE_DIR \
-P
# 4. Configure recovery
cat > $RESTORE_DIR/recovery.conf <<EOF
restore_command = 'aws s3 cp s3://my-bucket/postgres-wal/%f %p'
recovery_target_time = '${RESTORE_TIMESTAMP}'
recovery_target_inclusive = true
EOF
# 5. Start PostgreSQL
sudo systemctl start postgresql
# 6. Monitor recovery logs
tail -f /var/log/postgresql/postgresql-*.log | grep "recovery"MySQL Backup Strategies
Physical Backups (Percona XtraBackup)
bash#!/bin/bash
# Percona XtraBackup for MySQL physical backup
BACKUP_DIR="/backups/mysql/xtrabackup"
RETENTION_DAYS=7
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Full backup
xtrabackup \
--backup \
--target-dir=${BACKUP_DIR}/full-${TIMESTAMP} \
--user=backup_user \
--password=${MYSQL_BACKUP_PASSWORD} \
--compress \
--parallel=4
# Prepare backup
xtrabackup \
--prepare \
--target-dir=${BACKUP_DIR}/full-${TIMESTAMP}
# Upload to S3
aws s3 sync ${BACKUP_DIR}/full-${TIMESTAMP} \
s3://my-bucket/mysql-backups/full-${TIMESTAMP}/
# Retention
find $BACKUP_DIR -maxdepth 1 -type d -name "full-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;Logical Backups (mysqldump)
bash#!/bin/bash
# mysqldump for logical backup
DB_NAME="production_db"
BACKUP_DIR="/backups/mysql/logical"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Single database backup
mysqldump \
--user=backup_user \
--password=${MYSQL_BACKUP_PASSWORD} \
--single-transaction \
--routines \
--triggers \
--events \
--quick \
--lock-tables=false \
--compress \
$DB_NAME | gzip > ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz
# All databases backup
mysqldump \
--user=backup_user \
--password=${MYSQL_BACKUP_PASSWORD} \
--all-databases \
--single-transaction \
--routines \
--triggers \
--events \
--quick \
--lock-tables=false \
--compress | gzip > ${BACKUP_DIR}/all_databases_${TIMESTAMP}.sql.gz
# Upload to S3
aws s3 cp ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz \
s3://my-bucket/mysql-backups/logical/
# Retention
find $BACKUP_DIR -name "*.sql.gz" -mtime +$RETENTION_DAYS -deleteBinary Log for PITR
bash# my.cnf - Binary log configuration
[mysqld]
log-bin=mysql-bin
binlog_format=ROW
binlog_expire_logs_seconds=604800 # 7 days
sync_binlog=1
max_binlog_size=1G
# Backup binary logs
#!/bin/bash
BINLOG_DIR="/var/lib/mysql"
ARCHIVE_DIR="/backups/mysql/binlog"
RETENTION_DAYS=7
# Copy binary logs
cp ${BINLOG_DIR}/mysql-bin.* ${ARCHIVE_DIR}/
# Upload to S3
aws s3 sync ${ARCHIVE_DIR} s3://my-bucket/mysql-binlog/
# Retention
find $ARCHIVE_DIR -name "mysql-bin.*" -mtime +$RETENTION_DAYS -deleteCross-Region Replication
PostgreSQL Logical Replication
sql-- Primary region
CREATE PUBLICATION my_publication
FOR ALL TABLES;
-- Standby region (replica)
CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=primary-db.example.com port=5432 dbname=production_db user=replication_user password=xxx'
PUBLICATION my_publication
WITH (create_slot = false);bash# Replication monitoring script
#!/bin/bash
STANDBY_HOST="standby-db.example.com"
PGUSER="monitoring_user"
# Check replication lag
psql -h $STANDBY_HOST -U $PGUSER -d production_db -c "
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_lag,
pg_last_xact_replay_timestamp() AS last_replay_time
;"MySQL Replication
sql-- Primary region
CHANGE REPLICATION SOURCE TO
SOURCE_HOST='primary-db.example.com',
SOURCE_USER='replication_user',
SOURCE_PASSWORD='xxx',
SOURCE_LOG_FILE='mysql-bin.000001',
SOURCE_LOG_POS=4;
START REPLICA;bash# MySQL replication monitoring
#!/bin/bash
STANDBY_HOST="standby-db.example.com"
MYSQL_USER="monitoring_user"
# Check replication lag
mysql -h $STANDBY_HOST -u $MYSQL_USER -e "
SHOW REPLICA STATUS\G" | grep "Seconds_Behind_Source"Automated Backup Orchestration
Kubernetes CronJobs
yamlapiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: database
spec:
schedule: "0 */4 * * *" # Every 4 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15-alpine
command:
- /bin/sh
- -c
- |
pg_dump -h postgres-primary -U $POSTGRES_USER \
-d $POSTGRES_DB -Ft -f /backup/db.dump.tar
# Upload to S3
aws s3 cp /backup/db.dump.tar \
s3://my-bucket/postgres-backups/$(date +%Y%m%d_%H%M%S).tar
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-secret
key: username
- name: POSTGRES_DB
value: "production_db"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
volumeMounts:
- name: backup
mountPath: /backup
volumes:
- name: backup
emptyDir: {}
restartPolicy: OnFailure
backoffLimit: 3Terraform Managed Backups
hcl# AWS RDS automated backups
resource "aws_db_instance" "production" {
allocated_storage = 100
storage_type = "gp3"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.r6g.xlarge"
db_name = "production"
# Backup configuration
backup_retention_period = 7 # days
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
# Multi-AZ for high availability
multi_az = true
# Final snapshot on deletion
final_snapshot_identifier = "production-final-snapshot"
skip_final_snapshot = false
# Deletion protection
deletion_protection = true
# Performance insights
performance_insights_enabled = true
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
}Disaster Recovery Testing
Automated DR Testing
bash#!/bin/bash
# Automated DR test script
PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
DB_IDENTIFIER="production-db"
TEST_DB="dr_test_$(date +%Y%m%d_%H%M%S)"
# 1. Create test database from snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier ${TEST_DB} \
--db-snapshot-identifier ${DB_IDENTIFIER}-snapshot \
--region $DR_REGION
# 2. Wait for restoration to complete
aws rds wait db-instance-available \
--db-instance-identifier ${TEST_DB} \
--region $DR_REGION
# 3. Test database connectivity
TEST_HOST=$(aws rds describe-db-instances \
--db-instance-identifier ${TEST_DB} \
--region $DR_REGION \
--query 'DBInstances[0].Endpoint.Address' \
--output text)
psql -h $TEST_HOST -U test_user -d production_db -c "SELECT NOW();"
# 4. Run data integrity checks
psql -h $TEST_HOST -U test_user -d production_db -c "
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM orders;
SELECT MAX(created_at) FROM orders;
"
# 5. Test query performance
time psql -h $TEST_HOST -U test_user -d production_db -c "
EXPLAIN ANALYZE SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '1 day';
"
# 6. Cleanup
aws rds delete-db-instance \
--db-instance-identifier ${TEST_DB} \
--skip-final-snapshot \
--region $DR_REGION
# Log results
echo "DR test completed at $(date)" >> /var/log/dr-tests.logDR Drill Schedule
yamldisaster_recovery_testing_schedule:
weekly:
- Verify replication lag < 1 minute
- Test backup integrity
- Verify backup completion within SLA
monthly:
- Restore from latest backup to test environment
- Validate data integrity
- Run application smoke tests
- Document recovery time
quarterly:
- Full failover drill to DR region
- Test all application functionality
- Measure actual RTO vs target
- Update DR documentation
annually:
- Complete DR plan review
- Update RPO/RTO targets
- Test backup restoration from 1 year ago
- Update contact informationMonitoring and Alerting
Backup Monitoring
yaml# Prometheus alerting rules
groups:
- name: database_backups
rules:
# Backup age alert
- alert: DatabaseBackupTooOld
expr: |
time() - postgres_backup_last_success_timestamp_seconds > 86400
for: 1h
labels:
severity: warning
annotations:
summary: "Database backup is more than 24 hours old"
description: "Last successful backup for {{ $labels.instance }} was {{ $value }}s ago"
# Backup size anomaly
- alert: DatabaseBackupSizeAnomaly
expr: |
abs(postgres_backup_size_bytes - postgres_backup_size_bytes offset 1h) / postgres_backup_size_bytes > 0.2
for: 30m
labels:
severity: warning
annotations:
summary: "Database backup size changed by more than 20%"
description: "Backup size for {{ $labels.instance }} changed significantly"
# Replication lag
- alert: DatabaseReplicationLagHigh
expr: |
pg_replication_lag_seconds > 300
for: 5m
labels:
severity: critical
annotations:
summary: "Database replication lag exceeds 5 minutes"
description: "Replication lag for {{ $labels.instance }} is {{ $value }}s"DR Dashboard
typescript// DR monitoring dashboard metrics
interface DRMetrics {
backups: {
lastSuccessful: Date;
lastFailed: Date | null;
ageHours: number;
sizeGB: number;
retentionDays: number;
};
replication: {
lagSeconds: number;
status: 'syncing' | 'idle' | 'error';
throughputMBps: number;
};
lastDRTest: {
date: Date;
durationMinutes: number;
success: boolean;
issues: string[];
};
rpo: {
target: string;
current: string;
status: 'met' | 'exceeded';
};
rto: {
target: string;
estimated: string;
status: 'met' | 'exceeded';
};
}Recovery Playbooks
Scenario 1: Single Table Corruption
bash#!/bin/bash
# Recover single table from logical backup
TABLE_NAME="corrupted_table"
BACKUP_FILE="/backups/postgres/logical/latest.dump"
DB_NAME="production_db"
# 1. Restore to temporary database
pg_restore -h localhost -U postgres \
-d temp_restore_db \
-t $TABLE_NAME \
$BACKUP_FILE
# 2. Export table data
pg_dump -h localhost -U postgres \
-t $TABLE_NAME \
-d temp_restore_db \
-f /tmp/table_restore.sql
# 3. Truncate corrupted table
psql -h localhost -U postgres -d $DB_NAME \
-c "TRUNCATE TABLE $TABLE_NAME;"
# 4. Restore data
psql -h localhost -U postgres -d $DB_NAME -f /tmp/table_restore.sql
# 5. Cleanup
dropdb temp_restore_db
rm /tmp/table_restore.sqlScenario 2: Full Database Restore
bash#!/bin/bash
# Full database restore from PITR
RECOVERY_TIMESTAMP="2026-03-12 14:30:00"
BACKUP_BASE="/backups/postgres/physical/latest"
RESTORE_DIR="/var/lib/postgresql/restore"
# 1. Stop database
sudo systemctl stop postgresql
# 2. Restore base backup
pg_basebackup -h primary-db.example.com \
-D $RESTORE_DIR -P
# 3. Configure recovery
cat > $RESTORE_DIR/recovery.conf <<EOF
restore_command = 'aws s3 cp s3://my-bucket/postgres-wal/%f %p'
recovery_target_time = '${RECOVERY_TIMESTAMP}'
EOF
# 4. Start database in recovery mode
sudo systemctl start postgresql
# 5. Monitor recovery
tail -f /var/log/postgresql/postgresql-*.log | grep "recovery"
# 6. When recovery completes, promote to primary
touch $RESTORE_DIR/promote
# 7. Update application connection string
# (Manual step)Scenario 3: Region Failover
bash#!/bin/bash
# Region failover procedure
PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
DNS_ZONE="example.com"
# 1. Verify DR region is healthy
aws rds describe-db-instances \
--region $DR_REGION \
--db-instance-identifier production-db-dr
# 2. Stop replication
aws rds stop-replication \
--source-db-instance-identifier production-db-primary \
--replica-db-instance-identifier production-db-dr \
--region $PRIMARY_REGION
# 3. Promote DR to primary
aws rds promote-read-replica \
--db-instance-identifier production-db-dr \
--region $DR_REGION
# 4. Update DNS
aws route53 change-resource-record-sets \
--hosted-zone-id $DNS_ZONE \
--change-batch file://dns-update.json
# 5. Verify application connectivity
curl -f https://api.example.com/health
# 6. Re-establish reverse replication
aws rds create-db-instance-read-replica \
--source-db-instance-identifier production-db-dr \
--db-instance-identifier production-db-primary-replica \
--region $PRIMARY_REGIONConclusion
Backup and disaster recovery for critical databases is not a one-time implementation — it's a continuous discipline combining automated backups, PITR, cross-region replication, regular restore testing, and documented playbooks. Organizations that treat DR as a systematic process, not a compliance checklist, ensure survival when disasters inevitably occur.
Need to implement backup and disaster recovery strategies for critical databases? Talk to Imperialis cloud specialists to implement backup and DR architecture with defined RPO/RTO, PITR, and automated testing.
Sources
- PostgreSQL Backup and Recovery documentation — accessed on 2026-03
- MySQL Backup and Recovery documentation — accessed on 2026-03
- AWS RDS Backup and Restore — accessed on 2026-03
- Percona XtraBackup documentation — accessed on 2026-03