Cloud e plataforma

Database Backup and Disaster Recovery: estratégias para bancos de dados críticos em produção

Estratégias de backup e recuperação de desastres para bancos de dados críticos exigem planejamento de RPO/RTO, PITR, replicação cross-region, snapshots automatizados e testes regulares de restauração.

12/03/202610 min de leituraCloud
Database Backup and Disaster Recovery: estratégias para bancos de dados críticos em produção

Resumo executivo

Estratégias de backup e recuperação de desastres para bancos de dados críticos exigem planejamento de RPO/RTO, PITR, replicação cross-region, snapshots automatizados e testes regulares de restauração.

Ultima atualizacao: 12/03/2026

O custo de não ter DR planado

Perda de dados não é questão de "se", mas de "quando". Estudos recentes mostram que 40% das empresas nunca recuperam após perda crítica de dados, e 60% fecham em 6 meses após desastre de dados significativo. Em bancos de dados críticos, a ausência de um plano de disaster recovery (DR) testado é aposta com piores odds que cassino.

O desafio não é apenas "fazer backup" — é definir RPO (Recovery Point Objective) e RTO (Recovery Time Objective) realistas, implementar PITR (Point-In-Time Recovery), garantir replicação cross-region funcional, e testar restaurações regularmente. Backup que nunca foi testado não é backup — é ilusão.

Definindo RPO e RTO

RPO (Recovery Point Objective)

RPO é a quantidade máxima de dados que pode ser perdida, medida em tempo.

yaml# Exemplos de RPO por tipo de workload
rpo_matrix:
  financial_transactions:
    rpo: "0"  # Zero data loss
    strategy: synchronous_replication
    implementation: multi-region synchronous

  customer_orders:
    rpo: "1m"  # 1 minute
    strategy: async_replication + pitr
    implementation: async replication + 1-minute PITR

  user_analytics:
    rpo: "15m"  # 15 minutes
    strategy: periodic_backups
    implementation: 15-minute logical backups

  logs_metrics:
    rpo: "1h"  # 1 hour
    strategy: periodic_backups
    implementation: hourly logical backups

  static_content:
    rpo: "24h"  # 24 hours
    strategy: daily_backups
    implementation: daily full backups

RTO (Recovery Time Objective)

RTO é o tempo máximo aceitável para restaurar serviços após desastre.

yaml# Exemplos de RTO por criticidade
rto_matrix:
  critical:
    rto: "15m"
    requirements:
      - Automated failover
      - Hot standby
      - Health checks
    implementation: active-active multi-region

  high:
    rto: "1h"
    requirements:
      - Manual failover
      - Warm standby
      - Automated backups
    implementation: active-passive with automation

  medium:
    rto: "4h"
    requirements:
      - Manual restoration
      - Cold standby
      - Backup restoration
    implementation: restore from backup

  low:
    rto: "24h"
    requirements:
      - Manual restoration
      - No standby
      - Backup restoration
    implementation: restore from backup

PostgreSQL Backup Strategies

Physical Backups (pg_basebackup)

Physical backups são cópias binárias do diretório de dados, ideais para full restores rápidos.

bash#!/bin/bash
# pg_basebackup para physical backup

# Configurações
PGHOST="primary-db.example.com"
PGPORT="5432"
PGUSER="backup_user"
BACKUP_DIR="/backups/postgres/physical"
RETENTION_DAYS=7

# Criar backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="${BACKUP_DIR}/${TIMESTAMP}"

pg_basebackup \
  -h $PGHOST \
  -p $PGPORT \
  -U $PGUSER \
  -D $BACKUP_PATH \
  -Ft \           # Format: tar
  -z \            # Compress
  -P \            # Show progress
  -X stream \     # Include WAL files
  -R \            # Create recovery.conf
  -w              # Wait for backup to complete

# Retention
find $BACKUP_DIR -maxdepth 1 -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;

# Upload to S3/GCS
aws s3 sync ${BACKUP_PATH} s3://my-bucket/postgres-backups/${TIMESTAMP}/

Logical Backups (pg_dump)

Logical backups são dumps SQL, ideais para granular restores e schema changes.

bash#!/bin/bash
# pg_dump para logical backup

# Configurações
DB_NAME="production_db"
BACKUP_DIR="/backups/postgres/logical"
RETENTION_DAYS=30
SCHEMA_ONLY=false

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"

if [ "$SCHEMA_ONLY" = true ]; then
  # Schema only backup
  pg_dump -h $PGHOST -U $PGUSER \
    --schema-only \
    --no-owner \
    --no-privileges \
    -f ${BACKUP_FILE}.schema.sql $DB_NAME
else
  # Full logical backup
  pg_dump -h $PGHOST -U $PGUSER \
    --format=custom \
    --compress=9 \
    --no-owner \
    --no-privileges \
    -f ${BACKUP_FILE} $DB_NAME
fi

# Upload to S3/GCS
aws s3 cp ${BACKUP_FILE} s3://my-bucket/postgres-backups/logical/

# Retention
find $BACKUP_DIR -name "${DB_NAME}_*.dump" -mtime +$RETENTION_DAYS -delete

Continuous Archiving (WAL Archiving)

WAL (Write-Ahead Log) archiving permite PITR.

bash# postgresql.conf - WAL archiving configuration
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-bucket/postgres-wal/%f'
max_wal_senders = 10
wal_keep_size = 1GB

# Replication slots
postgresql.conf:
max_replication_slots = 10
bash# Script de archiving para PITR
#!/bin/bash
WAL_DIR="/var/lib/postgresql/wal"
ARCHIVE_DIR="/backups/postgres/wal"
RETENTION_DAYS=7

# Arquivar WAL files
for WAL_FILE in $WAL_DIR/*.wal; do
  FILENAME=$(basename $WAL_FILE)
  aws s3 cp $WAL_FILE s3://my-bucket/postgres-wal/${FILENAME}
done

# Limpar WAL arquivados
find $ARCHIVE_DIR -name "*.wal" -mtime +$RETENTION_DAYS -delete

Point-In-Time Recovery (PITR)

bash#!/bin/bash
# PITR restore script

RESTORE_TIMESTAMP="2026-03-12 14:30:00"
BACKUP_BASE="/backups/postgres/physical/latest"
RESTORE_DIR="/var/lib/postgresql/restore"
WAL_ARCHIVE="s3://my-bucket/postgres-wal"

# 1. Parar PostgreSQL
sudo systemctl stop postgresql

# 2. Limpar diretório de dados existente
rm -rf $RESTORE_DIR

# 3. Restore base backup
pg_basebackup \
  -h primary-db.example.com \
  -D $RESTORE_DIR \
  -P

# 4. Configurar recovery
cat > $RESTORE_DIR/recovery.conf <<EOF
restore_command = 'aws s3 cp s3://my-bucket/postgres-wal/%f %p'
recovery_target_time = '${RESTORE_TIMESTAMP}'
recovery_target_inclusive = true
EOF

# 5. Iniciar PostgreSQL
sudo systemctl start postgresql

# 6. Monitorar logs de recovery
tail -f /var/log/postgresql/postgresql-*.log | grep "recovery"

MySQL Backup Strategies

Physical Backups (Percona XtraBackup)

bash#!/bin/bash
# Percona XtraBackup para MySQL physical backup

BACKUP_DIR="/backups/mysql/xtrabackup"
RETENTION_DAYS=7
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Full backup
xtrabackup \
  --backup \
  --target-dir=${BACKUP_DIR}/full-${TIMESTAMP} \
  --user=backup_user \
  --password=${MYSQL_BACKUP_PASSWORD} \
  --compress \
  --parallel=4

# Prepare backup
xtrabackup \
  --prepare \
  --target-dir=${BACKUP_DIR}/full-${TIMESTAMP}

# Upload to S3
aws s3 sync ${BACKUP_DIR}/full-${TIMESTAMP} \
  s3://my-bucket/mysql-backups/full-${TIMESTAMP}/

# Retention
find $BACKUP_DIR -maxdepth 1 -type d -name "full-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;

Logical Backups (mysqldump)

bash#!/bin/bash
# mysqldump para logical backup

DB_NAME="production_db"
BACKUP_DIR="/backups/mysql/logical"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Single database backup
mysqldump \
  --user=backup_user \
  --password=${MYSQL_BACKUP_PASSWORD} \
  --single-transaction \
  --routines \
  --triggers \
  --events \
  --quick \
  --lock-tables=false \
  --compress \
  $DB_NAME | gzip > ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz

# All databases backup
mysqldump \
  --user=backup_user \
  --password=${MYSQL_BACKUP_PASSWORD} \
  --all-databases \
  --single-transaction \
  --routines \
  --triggers \
  --events \
  --quick \
  --lock-tables=false \
  --compress | gzip > ${BACKUP_DIR}/all_databases_${TIMESTAMP}.sql.gz

# Upload to S3
aws s3 cp ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz \
  s3://my-bucket/mysql-backups/logical/

# Retention
find $BACKUP_DIR -name "*.sql.gz" -mtime +$RETENTION_DAYS -delete

Binary Log para PITR

bash# my.cnf - Binary log configuration
[mysqld]
log-bin=mysql-bin
binlog_format=ROW
binlog_expire_logs_seconds=604800  # 7 days
sync_binlog=1
max_binlog_size=1G

# Backup binary logs
#!/bin/bash
BINLOG_DIR="/var/lib/mysql"
ARCHIVE_DIR="/backups/mysql/binlog"
RETENTION_DAYS=7

# Copiar binary logs
cp ${BINLOG_DIR}/mysql-bin.* ${ARCHIVE_DIR}/

# Upload to S3
aws s3 sync ${ARCHIVE_DIR} s3://my-bucket/mysql-binlog/

# Retention
find $ARCHIVE_DIR -name "mysql-bin.*" -mtime +$RETENTION_DAYS -delete

Cross-Region Replication

PostgreSQL Logical Replication

sql-- Primary region
CREATE PUBLICATION my_publication
FOR ALL TABLES;

-- Standby region (replica)
CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=primary-db.example.com port=5432 dbname=production_db user=replication_user password=xxx'
PUBLICATION my_publication
WITH (create_slot = false);
bash# Script de monitoramento de replicação
#!/bin/bash
STANDBY_HOST="standby-db.example.com"
PGUSER="monitoring_user"

# Check replication lag
psql -h $STANDBY_HOST -U $PGUSER -d production_db -c "
SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_lag,
  pg_last_xact_replay_timestamp() AS last_replay_time
;"

MySQL Replication

sql-- Primary region
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST='primary-db.example.com',
  SOURCE_USER='replication_user',
  SOURCE_PASSWORD='xxx',
  SOURCE_LOG_FILE='mysql-bin.000001',
  SOURCE_LOG_POS=4;

START REPLICA;
bash# Monitoramento de replicação MySQL
#!/bin/bash
STANDBY_HOST="standby-db.example.com"
MYSQL_USER="monitoring_user"

# Check replication lag
mysql -h $STANDBY_HOST -u $MYSQL_USER -e "
SHOW REPLICA STATUS\G" | grep "Seconds_Behind_Source"

Automated Backup Orchestration

Kubernetes CronJobs

yamlapiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: database
spec:
  schedule: "0 */4 * * *"  # Every 4 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: postgres:15-alpine
            command:
            - /bin/sh
            - -c
            - |
              pg_dump -h postgres-primary -U $POSTGRES_USER \
                -d $POSTGRES_DB -Ft -f /backup/db.dump.tar

              # Upload to S3
              aws s3 cp /backup/db.dump.tar \
                s3://my-bucket/postgres-backups/$(date +%Y%m%d_%H%M%S).tar
            env:
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: password
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: username
            - name: POSTGRES_DB
              value: "production_db"
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: access-key-id
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: secret-access-key
            volumeMounts:
            - name: backup
              mountPath: /backup
          volumes:
          - name: backup
            emptyDir: {}
          restartPolicy: OnFailure
          backoffLimit: 3

Terraform Managed Backups

hcl# AWS RDS automated backups
resource "aws_db_instance" "production" {
  allocated_storage    = 100
  storage_type         = "gp3"
  engine               = "postgres"
  engine_version       = "15.4"
  instance_class      = "db.r6g.xlarge"
  db_name             = "production"

  # Backup configuration
  backup_retention_period = 7  # days
  backup_window          = "03:00-04:00"
  maintenance_window     = "sun:04:00-sun:05:00"

  # Multi-AZ for high availability
  multi_az = true

  # Final snapshot on deletion
  final_snapshot_identifier = "production-final-snapshot"
  skip_final_snapshot      = false

  # Deletion protection
  deletion_protection = true

  # Performance insights
  performance_insights_enabled = true
  monitoring_interval         = 60
  monitoring_role_arn        = aws_iam_role.rds_monitoring.arn
}

Disaster Recovery Testing

Automated DR Testing

bash#!/bin/bash
# Automated DR test script

PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
DB_IDENTIFIER="production-db"
TEST_DB="dr_test_$(date +%Y%m%d_%H%M%S)"

# 1. Create test database from snapshot
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier ${TEST_DB} \
  --db-snapshot-identifier ${DB_IDENTIFIER}-snapshot \
  --region $DR_REGION

# 2. Wait for restoration to complete
aws rds wait db-instance-available \
  --db-instance-identifier ${TEST_DB} \
  --region $DR_REGION

# 3. Test database connectivity
TEST_HOST=$(aws rds describe-db-instances \
  --db-instance-identifier ${TEST_DB} \
  --region $DR_REGION \
  --query 'DBInstances[0].Endpoint.Address' \
  --output text)

psql -h $TEST_HOST -U test_user -d production_db -c "SELECT NOW();"

# 4. Run data integrity checks
psql -h $TEST_HOST -U test_user -d production_db -c "
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM orders;
SELECT MAX(created_at) FROM orders;
"

# 5. Test query performance
time psql -h $TEST_HOST -U test_user -d production_db -c "
EXPLAIN ANALYZE SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '1 day';
"

# 6. Cleanup
aws rds delete-db-instance \
  --db-instance-identifier ${TEST_DB} \
  --skip-final-snapshot \
  --region $DR_REGION

# Log results
echo "DR test completed at $(date)" >> /var/log/dr-tests.log

DR Drill Schedule

yamldisaster_recovery_testing_schedule:
  weekly:
    - Verify replication lag < 1 minute
    - Test backup integrity
    - Verify backup completion within SLA

  monthly:
    - Restore from latest backup to test environment
    - Validate data integrity
    - Run application smoke tests
    - Document recovery time

  quarterly:
    - Full failover drill to DR region
    - Test all application functionality
    - Measure actual RTO vs target
    - Update DR documentation

  annually:
    - Complete DR plan review
    - Update RPO/RTO targets
    - Test backup restoration from 1 year ago
    - Update contact information

Monitoring and Alerting

Backup Monitoring

yaml# Prometheus alerting rules
groups:
- name: database_backups
  rules:
  # Backup age alert
  - alert: DatabaseBackupTooOld
    expr: |
      time() - postgres_backup_last_success_timestamp_seconds > 86400
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Database backup is more than 24 hours old"
      description: "Last successful backup for {{ $labels.instance }} was {{ $value }}s ago"

  # Backup size anomaly
  - alert: DatabaseBackupSizeAnomaly
    expr: |
      abs(postgres_backup_size_bytes - postgres_backup_size_bytes offset 1h) / postgres_backup_size_bytes > 0.2
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "Database backup size changed by more than 20%"
      description: "Backup size for {{ $labels.instance }} changed significantly"

  # Replication lag
  - alert: DatabaseReplicationLagHigh
    expr: |
      pg_replication_lag_seconds > 300
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Database replication lag exceeds 5 minutes"
      description: "Replication lag for {{ $labels.instance }} is {{ $value }}s"

DR Dashboard

typescript// DR monitoring dashboard metrics
interface DRMetrics {
  backups: {
    lastSuccessful: Date;
    lastFailed: Date | null;
    ageHours: number;
    sizeGB: number;
    retentionDays: number;
  };
  replication: {
    lagSeconds: number;
    status: 'syncing' | 'idle' | 'error';
    throughputMBps: number;
  };
  lastDRTest: {
    date: Date;
    durationMinutes: number;
    success: boolean;
    issues: string[];
  };
  rpo: {
    target: string;
    current: string;
    status: 'met' | 'exceeded';
  };
  rto: {
    target: string;
    estimated: string;
    status: 'met' | 'exceeded';
  };
}

Recovery Playbooks

Scenario 1: Single Table Corruption

bash#!/bin/bash
# Recover single table from logical backup

TABLE_NAME="corrupted_table"
BACKUP_FILE="/backups/postgres/logical/latest.dump"
DB_NAME="production_db"

# 1. Restore to temporary database
pg_restore -h localhost -U postgres \
  -d temp_restore_db \
  -t $TABLE_NAME \
  $BACKUP_FILE

# 2. Export table data
pg_dump -h localhost -U postgres \
  -t $TABLE_NAME \
  -d temp_restore_db \
  -f /tmp/table_restore.sql

# 3. Truncate corrupted table
psql -h localhost -U postgres -d $DB_NAME \
  -c "TRUNCATE TABLE $TABLE_NAME;"

# 4. Restore data
psql -h localhost -U postgres -d $DB_NAME -f /tmp/table_restore.sql

# 5. Cleanup
dropdb temp_restore_db
rm /tmp/table_restore.sql

Scenario 2: Full Database Restore

bash#!/bin/bash
# Full database restore from PITR

RECOVERY_TIMESTAMP="2026-03-12 14:30:00"
BACKUP_BASE="/backups/postgres/physical/latest"
RESTORE_DIR="/var/lib/postgresql/restore"

# 1. Stop database
sudo systemctl stop postgresql

# 2. Restore base backup
pg_basebackup -h primary-db.example.com \
  -D $RESTORE_DIR -P

# 3. Configure recovery
cat > $RESTORE_DIR/recovery.conf <<EOF
restore_command = 'aws s3 cp s3://my-bucket/postgres-wal/%f %p'
recovery_target_time = '${RECOVERY_TIMESTAMP}'
EOF

# 4. Start database in recovery mode
sudo systemctl start postgresql

# 5. Monitor recovery
tail -f /var/log/postgresql/postgresql-*.log | grep "recovery"

# 6. When recovery completes, promote to primary
touch $RESTORE_DIR/promote

# 7. Update application connection string
# (Manual step)

Scenario 3: Region Failover

bash#!/bin/bash
# Region failover procedure

PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
DNS_ZONE="example.com"

# 1. Verify DR region is healthy
aws rds describe-db-instances \
  --region $DR_REGION \
  --db-instance-identifier production-db-dr

# 2. Stop replication
aws rds stop-replication \
  --source-db-instance-identifier production-db-primary \
  --replica-db-instance-identifier production-db-dr \
  --region $PRIMARY_REGION

# 3. Promote DR to primary
aws rds promote-read-replica \
  --db-instance-identifier production-db-dr \
  --region $DR_REGION

# 4. Update DNS
aws route53 change-resource-record-sets \
  --hosted-zone-id $DNS_ZONE \
  --change-batch file://dns-update.json

# 5. Verify application connectivity
curl -f https://api.example.com/health

# 6. Re-establish reverse replication
aws rds create-db-instance-read-replica \
  --source-db-instance-identifier production-db-dr \
  --db-instance-identifier production-db-primary-replica \
  --region $PRIMARY_REGION

Conclusion

Backup e disaster recovery para bancos de dados críticos não é implementação única — é disciplina contínua que combina backups automatizados, PITR, replicação cross-region, testes regulares de restauração e playbooks documentados. Organizações que tratam DR como processo sistemático, não como checklist de conformidade, garantem sobrevivência quando desastres inevitavelmente ocorrem.


Precisa implementar estratégia de backup e disaster recovery para bancos de dados críticos? Fale com especialistas em cloud da Imperialis para implementar arquitetura de backup e DR com RPO/RTO definidos, PITR e testes automatizados.

Fontes

Leituras relacionadas