Cloud and platform

Database Backup and Disaster Recovery: strategies for critical databases in production

Backup and disaster recovery strategies for critical databases require RPO/RTO planning, PITR, cross-region replication, automated snapshots, and regular restore testing.

3/12/202610 min readCloud
Database Backup and Disaster Recovery: strategies for critical databases in production

Executive summary

Backup and disaster recovery strategies for critical databases require RPO/RTO planning, PITR, cross-region replication, automated snapshots, and regular restore testing.

Last updated: 3/12/2026

The cost of not planning DR

Data loss is not a question of "if" but "when." Recent studies show that 40% of companies never recover after critical data loss, and 60% close within 6 months after a significant data disaster. For critical databases, the absence of a tested disaster recovery (DR) plan is a gamble with worse odds than a casino.

The challenge isn't just "doing backup" — it's defining realistic RPO (Recovery Point Objective) and RTO (Recovery Time Objective), implementing PITR (Point-In-Time Recovery), ensuring functional cross-region replication, and regularly testing restores. A backup that has never been tested is not a backup — it's an illusion.

Defining RPO and RTO

RPO (Recovery Point Objective)

RPO is the maximum amount of data that can be lost, measured in time.

yaml# RPO examples by workload type
rpo_matrix:
  financial_transactions:
    rpo: "0"  # Zero data loss
    strategy: synchronous_replication
    implementation: multi-region synchronous

  customer_orders:
    rpo: "1m"  # 1 minute
    strategy: async_replication + pitr
    implementation: async replication + 1-minute PITR

  user_analytics:
    rpo: "15m"  # 15 minutes
    strategy: periodic_backups
    implementation: 15-minute logical backups

  logs_metrics:
    rpo: "1h"  # 1 hour
    strategy: periodic_backups
    implementation: hourly logical backups

  static_content:
    rpo: "24h"  # 24 hours
    strategy: daily_backups
    implementation: daily full backups

RTO (Recovery Time Objective)

RTO is the maximum acceptable time to restore services after a disaster.

yaml# RTO examples by criticality
rto_matrix:
  critical:
    rto: "15m"
    requirements:
      - Automated failover
      - Hot standby
      - Health checks
    implementation: active-active multi-region

  high:
    rto: "1h"
    requirements:
      - Manual failover
      - Warm standby
      - Automated backups
    implementation: active-passive with automation

  medium:
    rto: "4h"
    requirements:
      - Manual restoration
      - Cold standby
      - Backup restoration
    implementation: restore from backup

  low:
    rto: "24h"
    requirements:
      - Manual restoration
      - No standby
      - Backup restoration
    implementation: restore from backup

PostgreSQL Backup Strategies

Physical Backups (pg_basebackup)

Physical backups are binary copies of the data directory, ideal for fast full restores.

bash#!/bin/bash
# pg_basebackup for physical backup

# Configuration
PGHOST="primary-db.example.com"
PGPORT="5432"
PGUSER="backup_user"
BACKUP_DIR="/backups/postgres/physical"
RETENTION_DAYS=7

# Create backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="${BACKUP_DIR}/${TIMESTAMP}"

pg_basebackup \
  -h $PGHOST \
  -p $PGPORT \
  -U $PGUSER \
  -D $BACKUP_PATH \
  -Ft \           # Format: tar
  -z \            # Compress
  -P \            # Show progress
  -X stream \     # Include WAL files
  -R \            # Create recovery.conf
  -w              # Wait for backup to complete

# Retention
find $BACKUP_DIR -maxdepth 1 -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;

# Upload to S3/GCS
aws s3 sync ${BACKUP_PATH} s3://my-bucket/postgres-backups/${TIMESTAMP}/

Logical Backups (pg_dump)

Logical backups are SQL dumps, ideal for granular restores and schema changes.

bash#!/bin/bash
# pg_dump for logical backup

# Configuration
DB_NAME="production_db"
BACKUP_DIR="/backups/postgres/logical"
RETENTION_DAYS=30
SCHEMA_ONLY=false

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"

if [ "$SCHEMA_ONLY" = true ]; then
  # Schema only backup
  pg_dump -h $PGHOST -U $PGUSER \
    --schema-only \
    --no-owner \
    --no-privileges \
    -f ${BACKUP_FILE}.schema.sql $DB_NAME
else
  # Full logical backup
  pg_dump -h $PGHOST -U $PGUSER \
    --format=custom \
    --compress=9 \
    --no-owner \
    --no-privileges \
    -f ${BACKUP_FILE} $DB_NAME
fi

# Upload to S3/GCS
aws s3 cp ${BACKUP_FILE} s3://my-bucket/postgres-backups/logical/

# Retention
find $BACKUP_DIR -name "${DB_NAME}_*.dump" -mtime +$RETENTION_DAYS -delete

Continuous Archiving (WAL Archiving)

WAL (Write-Ahead Log) archiving enables PITR.

bash# postgresql.conf - WAL archiving configuration
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-bucket/postgres-wal/%f'
max_wal_senders = 10
wal_keep_size = 1GB

# Replication slots
postgresql.conf:
max_replication_slots = 10
bash# Archiving script for PITR
#!/bin/bash
WAL_DIR="/var/lib/postgresql/wal"
ARCHIVE_DIR="/backups/postgres/wal"
RETENTION_DAYS=7

# Archive WAL files
for WAL_FILE in $WAL_DIR/*.wal; do
  FILENAME=$(basename $WAL_FILE)
  aws s3 cp $WAL_FILE s3://my-bucket/postgres-wal/${FILENAME}
done

# Clean archived WAL
find $ARCHIVE_DIR -name "*.wal" -mtime +$RETENTION_DAYS -delete

Point-In-Time Recovery (PITR)

bash#!/bin/bash
# PITR restore script

RESTORE_TIMESTAMP="2026-03-12 14:30:00"
BACKUP_BASE="/backups/postgres/physical/latest"
RESTORE_DIR="/var/lib/postgresql/restore"
WAL_ARCHIVE="s3://my-bucket/postgres-wal"

# 1. Stop PostgreSQL
sudo systemctl stop postgresql

# 2. Clean existing data directory
rm -rf $RESTORE_DIR

# 3. Restore base backup
pg_basebackup \
  -h primary-db.example.com \
  -D $RESTORE_DIR \
  -P

# 4. Configure recovery
cat > $RESTORE_DIR/recovery.conf <<EOF
restore_command = 'aws s3 cp s3://my-bucket/postgres-wal/%f %p'
recovery_target_time = '${RESTORE_TIMESTAMP}'
recovery_target_inclusive = true
EOF

# 5. Start PostgreSQL
sudo systemctl start postgresql

# 6. Monitor recovery logs
tail -f /var/log/postgresql/postgresql-*.log | grep "recovery"

MySQL Backup Strategies

Physical Backups (Percona XtraBackup)

bash#!/bin/bash
# Percona XtraBackup for MySQL physical backup

BACKUP_DIR="/backups/mysql/xtrabackup"
RETENTION_DAYS=7
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Full backup
xtrabackup \
  --backup \
  --target-dir=${BACKUP_DIR}/full-${TIMESTAMP} \
  --user=backup_user \
  --password=${MYSQL_BACKUP_PASSWORD} \
  --compress \
  --parallel=4

# Prepare backup
xtrabackup \
  --prepare \
  --target-dir=${BACKUP_DIR}/full-${TIMESTAMP}

# Upload to S3
aws s3 sync ${BACKUP_DIR}/full-${TIMESTAMP} \
  s3://my-bucket/mysql-backups/full-${TIMESTAMP}/

# Retention
find $BACKUP_DIR -maxdepth 1 -type d -name "full-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;

Logical Backups (mysqldump)

bash#!/bin/bash
# mysqldump for logical backup

DB_NAME="production_db"
BACKUP_DIR="/backups/mysql/logical"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Single database backup
mysqldump \
  --user=backup_user \
  --password=${MYSQL_BACKUP_PASSWORD} \
  --single-transaction \
  --routines \
  --triggers \
  --events \
  --quick \
  --lock-tables=false \
  --compress \
  $DB_NAME | gzip > ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz

# All databases backup
mysqldump \
  --user=backup_user \
  --password=${MYSQL_BACKUP_PASSWORD} \
  --all-databases \
  --single-transaction \
  --routines \
  --triggers \
  --events \
  --quick \
  --lock-tables=false \
  --compress | gzip > ${BACKUP_DIR}/all_databases_${TIMESTAMP}.sql.gz

# Upload to S3
aws s3 cp ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz \
  s3://my-bucket/mysql-backups/logical/

# Retention
find $BACKUP_DIR -name "*.sql.gz" -mtime +$RETENTION_DAYS -delete

Binary Log for PITR

bash# my.cnf - Binary log configuration
[mysqld]
log-bin=mysql-bin
binlog_format=ROW
binlog_expire_logs_seconds=604800  # 7 days
sync_binlog=1
max_binlog_size=1G

# Backup binary logs
#!/bin/bash
BINLOG_DIR="/var/lib/mysql"
ARCHIVE_DIR="/backups/mysql/binlog"
RETENTION_DAYS=7

# Copy binary logs
cp ${BINLOG_DIR}/mysql-bin.* ${ARCHIVE_DIR}/

# Upload to S3
aws s3 sync ${ARCHIVE_DIR} s3://my-bucket/mysql-binlog/

# Retention
find $ARCHIVE_DIR -name "mysql-bin.*" -mtime +$RETENTION_DAYS -delete

Cross-Region Replication

PostgreSQL Logical Replication

sql-- Primary region
CREATE PUBLICATION my_publication
FOR ALL TABLES;

-- Standby region (replica)
CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=primary-db.example.com port=5432 dbname=production_db user=replication_user password=xxx'
PUBLICATION my_publication
WITH (create_slot = false);
bash# Replication monitoring script
#!/bin/bash
STANDBY_HOST="standby-db.example.com"
PGUSER="monitoring_user"

# Check replication lag
psql -h $STANDBY_HOST -U $PGUSER -d production_db -c "
SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_lag,
  pg_last_xact_replay_timestamp() AS last_replay_time
;"

MySQL Replication

sql-- Primary region
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST='primary-db.example.com',
  SOURCE_USER='replication_user',
  SOURCE_PASSWORD='xxx',
  SOURCE_LOG_FILE='mysql-bin.000001',
  SOURCE_LOG_POS=4;

START REPLICA;
bash# MySQL replication monitoring
#!/bin/bash
STANDBY_HOST="standby-db.example.com"
MYSQL_USER="monitoring_user"

# Check replication lag
mysql -h $STANDBY_HOST -u $MYSQL_USER -e "
SHOW REPLICA STATUS\G" | grep "Seconds_Behind_Source"

Automated Backup Orchestration

Kubernetes CronJobs

yamlapiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: database
spec:
  schedule: "0 */4 * * *"  # Every 4 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: postgres:15-alpine
            command:
            - /bin/sh
            - -c
            - |
              pg_dump -h postgres-primary -U $POSTGRES_USER \
                -d $POSTGRES_DB -Ft -f /backup/db.dump.tar

              # Upload to S3
              aws s3 cp /backup/db.dump.tar \
                s3://my-bucket/postgres-backups/$(date +%Y%m%d_%H%M%S).tar
            env:
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: password
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: username
            - name: POSTGRES_DB
              value: "production_db"
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: access-key-id
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: secret-access-key
            volumeMounts:
            - name: backup
              mountPath: /backup
          volumes:
          - name: backup
            emptyDir: {}
          restartPolicy: OnFailure
          backoffLimit: 3

Terraform Managed Backups

hcl# AWS RDS automated backups
resource "aws_db_instance" "production" {
  allocated_storage    = 100
  storage_type         = "gp3"
  engine               = "postgres"
  engine_version       = "15.4"
  instance_class      = "db.r6g.xlarge"
  db_name             = "production"

  # Backup configuration
  backup_retention_period = 7  # days
  backup_window          = "03:00-04:00"
  maintenance_window     = "sun:04:00-sun:05:00"

  # Multi-AZ for high availability
  multi_az = true

  # Final snapshot on deletion
  final_snapshot_identifier = "production-final-snapshot"
  skip_final_snapshot      = false

  # Deletion protection
  deletion_protection = true

  # Performance insights
  performance_insights_enabled = true
  monitoring_interval         = 60
  monitoring_role_arn        = aws_iam_role.rds_monitoring.arn
}

Disaster Recovery Testing

Automated DR Testing

bash#!/bin/bash
# Automated DR test script

PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
DB_IDENTIFIER="production-db"
TEST_DB="dr_test_$(date +%Y%m%d_%H%M%S)"

# 1. Create test database from snapshot
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier ${TEST_DB} \
  --db-snapshot-identifier ${DB_IDENTIFIER}-snapshot \
  --region $DR_REGION

# 2. Wait for restoration to complete
aws rds wait db-instance-available \
  --db-instance-identifier ${TEST_DB} \
  --region $DR_REGION

# 3. Test database connectivity
TEST_HOST=$(aws rds describe-db-instances \
  --db-instance-identifier ${TEST_DB} \
  --region $DR_REGION \
  --query 'DBInstances[0].Endpoint.Address' \
  --output text)

psql -h $TEST_HOST -U test_user -d production_db -c "SELECT NOW();"

# 4. Run data integrity checks
psql -h $TEST_HOST -U test_user -d production_db -c "
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM orders;
SELECT MAX(created_at) FROM orders;
"

# 5. Test query performance
time psql -h $TEST_HOST -U test_user -d production_db -c "
EXPLAIN ANALYZE SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '1 day';
"

# 6. Cleanup
aws rds delete-db-instance \
  --db-instance-identifier ${TEST_DB} \
  --skip-final-snapshot \
  --region $DR_REGION

# Log results
echo "DR test completed at $(date)" >> /var/log/dr-tests.log

DR Drill Schedule

yamldisaster_recovery_testing_schedule:
  weekly:
    - Verify replication lag < 1 minute
    - Test backup integrity
    - Verify backup completion within SLA

  monthly:
    - Restore from latest backup to test environment
    - Validate data integrity
    - Run application smoke tests
    - Document recovery time

  quarterly:
    - Full failover drill to DR region
    - Test all application functionality
    - Measure actual RTO vs target
    - Update DR documentation

  annually:
    - Complete DR plan review
    - Update RPO/RTO targets
    - Test backup restoration from 1 year ago
    - Update contact information

Monitoring and Alerting

Backup Monitoring

yaml# Prometheus alerting rules
groups:
- name: database_backups
  rules:
  # Backup age alert
  - alert: DatabaseBackupTooOld
    expr: |
      time() - postgres_backup_last_success_timestamp_seconds > 86400
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Database backup is more than 24 hours old"
      description: "Last successful backup for {{ $labels.instance }} was {{ $value }}s ago"

  # Backup size anomaly
  - alert: DatabaseBackupSizeAnomaly
    expr: |
      abs(postgres_backup_size_bytes - postgres_backup_size_bytes offset 1h) / postgres_backup_size_bytes > 0.2
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "Database backup size changed by more than 20%"
      description: "Backup size for {{ $labels.instance }} changed significantly"

  # Replication lag
  - alert: DatabaseReplicationLagHigh
    expr: |
      pg_replication_lag_seconds > 300
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Database replication lag exceeds 5 minutes"
      description: "Replication lag for {{ $labels.instance }} is {{ $value }}s"

DR Dashboard

typescript// DR monitoring dashboard metrics
interface DRMetrics {
  backups: {
    lastSuccessful: Date;
    lastFailed: Date | null;
    ageHours: number;
    sizeGB: number;
    retentionDays: number;
  };
  replication: {
    lagSeconds: number;
    status: 'syncing' | 'idle' | 'error';
    throughputMBps: number;
  };
  lastDRTest: {
    date: Date;
    durationMinutes: number;
    success: boolean;
    issues: string[];
  };
  rpo: {
    target: string;
    current: string;
    status: 'met' | 'exceeded';
  };
  rto: {
    target: string;
    estimated: string;
    status: 'met' | 'exceeded';
  };
}

Recovery Playbooks

Scenario 1: Single Table Corruption

bash#!/bin/bash
# Recover single table from logical backup

TABLE_NAME="corrupted_table"
BACKUP_FILE="/backups/postgres/logical/latest.dump"
DB_NAME="production_db"

# 1. Restore to temporary database
pg_restore -h localhost -U postgres \
  -d temp_restore_db \
  -t $TABLE_NAME \
  $BACKUP_FILE

# 2. Export table data
pg_dump -h localhost -U postgres \
  -t $TABLE_NAME \
  -d temp_restore_db \
  -f /tmp/table_restore.sql

# 3. Truncate corrupted table
psql -h localhost -U postgres -d $DB_NAME \
  -c "TRUNCATE TABLE $TABLE_NAME;"

# 4. Restore data
psql -h localhost -U postgres -d $DB_NAME -f /tmp/table_restore.sql

# 5. Cleanup
dropdb temp_restore_db
rm /tmp/table_restore.sql

Scenario 2: Full Database Restore

bash#!/bin/bash
# Full database restore from PITR

RECOVERY_TIMESTAMP="2026-03-12 14:30:00"
BACKUP_BASE="/backups/postgres/physical/latest"
RESTORE_DIR="/var/lib/postgresql/restore"

# 1. Stop database
sudo systemctl stop postgresql

# 2. Restore base backup
pg_basebackup -h primary-db.example.com \
  -D $RESTORE_DIR -P

# 3. Configure recovery
cat > $RESTORE_DIR/recovery.conf <<EOF
restore_command = 'aws s3 cp s3://my-bucket/postgres-wal/%f %p'
recovery_target_time = '${RECOVERY_TIMESTAMP}'
EOF

# 4. Start database in recovery mode
sudo systemctl start postgresql

# 5. Monitor recovery
tail -f /var/log/postgresql/postgresql-*.log | grep "recovery"

# 6. When recovery completes, promote to primary
touch $RESTORE_DIR/promote

# 7. Update application connection string
# (Manual step)

Scenario 3: Region Failover

bash#!/bin/bash
# Region failover procedure

PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
DNS_ZONE="example.com"

# 1. Verify DR region is healthy
aws rds describe-db-instances \
  --region $DR_REGION \
  --db-instance-identifier production-db-dr

# 2. Stop replication
aws rds stop-replication \
  --source-db-instance-identifier production-db-primary \
  --replica-db-instance-identifier production-db-dr \
  --region $PRIMARY_REGION

# 3. Promote DR to primary
aws rds promote-read-replica \
  --db-instance-identifier production-db-dr \
  --region $DR_REGION

# 4. Update DNS
aws route53 change-resource-record-sets \
  --hosted-zone-id $DNS_ZONE \
  --change-batch file://dns-update.json

# 5. Verify application connectivity
curl -f https://api.example.com/health

# 6. Re-establish reverse replication
aws rds create-db-instance-read-replica \
  --source-db-instance-identifier production-db-dr \
  --db-instance-identifier production-db-primary-replica \
  --region $PRIMARY_REGION

Conclusion

Backup and disaster recovery for critical databases is not a one-time implementation — it's a continuous discipline combining automated backups, PITR, cross-region replication, regular restore testing, and documented playbooks. Organizations that treat DR as a systematic process, not a compliance checklist, ensure survival when disasters inevitably occur.


Need to implement backup and disaster recovery strategies for critical databases? Talk to Imperialis cloud specialists to implement backup and DR architecture with defined RPO/RTO, PITR, and automated testing.

Sources

Related reading