Database Backup and Disaster Recovery: estratégias para bancos de dados críticos em produção
Estratégias de backup e recuperação de desastres para bancos de dados críticos exigem planejamento de RPO/RTO, PITR, replicação cross-region, snapshots automatizados e testes regulares de restauração.
Resumo executivo
Estratégias de backup e recuperação de desastres para bancos de dados críticos exigem planejamento de RPO/RTO, PITR, replicação cross-region, snapshots automatizados e testes regulares de restauração.
Ultima atualizacao: 12/03/2026
O custo de não ter DR planado
Perda de dados não é questão de "se", mas de "quando". Estudos recentes mostram que 40% das empresas nunca recuperam após perda crítica de dados, e 60% fecham em 6 meses após desastre de dados significativo. Em bancos de dados críticos, a ausência de um plano de disaster recovery (DR) testado é aposta com piores odds que cassino.
O desafio não é apenas "fazer backup" — é definir RPO (Recovery Point Objective) e RTO (Recovery Time Objective) realistas, implementar PITR (Point-In-Time Recovery), garantir replicação cross-region funcional, e testar restaurações regularmente. Backup que nunca foi testado não é backup — é ilusão.
Definindo RPO e RTO
RPO (Recovery Point Objective)
RPO é a quantidade máxima de dados que pode ser perdida, medida em tempo.
yaml# Exemplos de RPO por tipo de workload
rpo_matrix:
financial_transactions:
rpo: "0" # Zero data loss
strategy: synchronous_replication
implementation: multi-region synchronous
customer_orders:
rpo: "1m" # 1 minute
strategy: async_replication + pitr
implementation: async replication + 1-minute PITR
user_analytics:
rpo: "15m" # 15 minutes
strategy: periodic_backups
implementation: 15-minute logical backups
logs_metrics:
rpo: "1h" # 1 hour
strategy: periodic_backups
implementation: hourly logical backups
static_content:
rpo: "24h" # 24 hours
strategy: daily_backups
implementation: daily full backupsRTO (Recovery Time Objective)
RTO é o tempo máximo aceitável para restaurar serviços após desastre.
yaml# Exemplos de RTO por criticidade
rto_matrix:
critical:
rto: "15m"
requirements:
- Automated failover
- Hot standby
- Health checks
implementation: active-active multi-region
high:
rto: "1h"
requirements:
- Manual failover
- Warm standby
- Automated backups
implementation: active-passive with automation
medium:
rto: "4h"
requirements:
- Manual restoration
- Cold standby
- Backup restoration
implementation: restore from backup
low:
rto: "24h"
requirements:
- Manual restoration
- No standby
- Backup restoration
implementation: restore from backupPostgreSQL Backup Strategies
Physical Backups (pg_basebackup)
Physical backups são cópias binárias do diretório de dados, ideais para full restores rápidos.
bash#!/bin/bash
# pg_basebackup para physical backup
# Configurações
PGHOST="primary-db.example.com"
PGPORT="5432"
PGUSER="backup_user"
BACKUP_DIR="/backups/postgres/physical"
RETENTION_DAYS=7
# Criar backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="${BACKUP_DIR}/${TIMESTAMP}"
pg_basebackup \
-h $PGHOST \
-p $PGPORT \
-U $PGUSER \
-D $BACKUP_PATH \
-Ft \ # Format: tar
-z \ # Compress
-P \ # Show progress
-X stream \ # Include WAL files
-R \ # Create recovery.conf
-w # Wait for backup to complete
# Retention
find $BACKUP_DIR -maxdepth 1 -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;
# Upload to S3/GCS
aws s3 sync ${BACKUP_PATH} s3://my-bucket/postgres-backups/${TIMESTAMP}/Logical Backups (pg_dump)
Logical backups são dumps SQL, ideais para granular restores e schema changes.
bash#!/bin/bash
# pg_dump para logical backup
# Configurações
DB_NAME="production_db"
BACKUP_DIR="/backups/postgres/logical"
RETENTION_DAYS=30
SCHEMA_ONLY=false
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"
if [ "$SCHEMA_ONLY" = true ]; then
# Schema only backup
pg_dump -h $PGHOST -U $PGUSER \
--schema-only \
--no-owner \
--no-privileges \
-f ${BACKUP_FILE}.schema.sql $DB_NAME
else
# Full logical backup
pg_dump -h $PGHOST -U $PGUSER \
--format=custom \
--compress=9 \
--no-owner \
--no-privileges \
-f ${BACKUP_FILE} $DB_NAME
fi
# Upload to S3/GCS
aws s3 cp ${BACKUP_FILE} s3://my-bucket/postgres-backups/logical/
# Retention
find $BACKUP_DIR -name "${DB_NAME}_*.dump" -mtime +$RETENTION_DAYS -deleteContinuous Archiving (WAL Archiving)
WAL (Write-Ahead Log) archiving permite PITR.
bash# postgresql.conf - WAL archiving configuration
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-bucket/postgres-wal/%f'
max_wal_senders = 10
wal_keep_size = 1GB
# Replication slots
postgresql.conf:
max_replication_slots = 10bash# Script de archiving para PITR
#!/bin/bash
WAL_DIR="/var/lib/postgresql/wal"
ARCHIVE_DIR="/backups/postgres/wal"
RETENTION_DAYS=7
# Arquivar WAL files
for WAL_FILE in $WAL_DIR/*.wal; do
FILENAME=$(basename $WAL_FILE)
aws s3 cp $WAL_FILE s3://my-bucket/postgres-wal/${FILENAME}
done
# Limpar WAL arquivados
find $ARCHIVE_DIR -name "*.wal" -mtime +$RETENTION_DAYS -deletePoint-In-Time Recovery (PITR)
bash#!/bin/bash
# PITR restore script
RESTORE_TIMESTAMP="2026-03-12 14:30:00"
BACKUP_BASE="/backups/postgres/physical/latest"
RESTORE_DIR="/var/lib/postgresql/restore"
WAL_ARCHIVE="s3://my-bucket/postgres-wal"
# 1. Parar PostgreSQL
sudo systemctl stop postgresql
# 2. Limpar diretório de dados existente
rm -rf $RESTORE_DIR
# 3. Restore base backup
pg_basebackup \
-h primary-db.example.com \
-D $RESTORE_DIR \
-P
# 4. Configurar recovery
cat > $RESTORE_DIR/recovery.conf <<EOF
restore_command = 'aws s3 cp s3://my-bucket/postgres-wal/%f %p'
recovery_target_time = '${RESTORE_TIMESTAMP}'
recovery_target_inclusive = true
EOF
# 5. Iniciar PostgreSQL
sudo systemctl start postgresql
# 6. Monitorar logs de recovery
tail -f /var/log/postgresql/postgresql-*.log | grep "recovery"MySQL Backup Strategies
Physical Backups (Percona XtraBackup)
bash#!/bin/bash
# Percona XtraBackup para MySQL physical backup
BACKUP_DIR="/backups/mysql/xtrabackup"
RETENTION_DAYS=7
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Full backup
xtrabackup \
--backup \
--target-dir=${BACKUP_DIR}/full-${TIMESTAMP} \
--user=backup_user \
--password=${MYSQL_BACKUP_PASSWORD} \
--compress \
--parallel=4
# Prepare backup
xtrabackup \
--prepare \
--target-dir=${BACKUP_DIR}/full-${TIMESTAMP}
# Upload to S3
aws s3 sync ${BACKUP_DIR}/full-${TIMESTAMP} \
s3://my-bucket/mysql-backups/full-${TIMESTAMP}/
# Retention
find $BACKUP_DIR -maxdepth 1 -type d -name "full-*" -mtime +$RETENTION_DAYS -exec rm -rf {} \;Logical Backups (mysqldump)
bash#!/bin/bash
# mysqldump para logical backup
DB_NAME="production_db"
BACKUP_DIR="/backups/mysql/logical"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Single database backup
mysqldump \
--user=backup_user \
--password=${MYSQL_BACKUP_PASSWORD} \
--single-transaction \
--routines \
--triggers \
--events \
--quick \
--lock-tables=false \
--compress \
$DB_NAME | gzip > ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz
# All databases backup
mysqldump \
--user=backup_user \
--password=${MYSQL_BACKUP_PASSWORD} \
--all-databases \
--single-transaction \
--routines \
--triggers \
--events \
--quick \
--lock-tables=false \
--compress | gzip > ${BACKUP_DIR}/all_databases_${TIMESTAMP}.sql.gz
# Upload to S3
aws s3 cp ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz \
s3://my-bucket/mysql-backups/logical/
# Retention
find $BACKUP_DIR -name "*.sql.gz" -mtime +$RETENTION_DAYS -deleteBinary Log para PITR
bash# my.cnf - Binary log configuration
[mysqld]
log-bin=mysql-bin
binlog_format=ROW
binlog_expire_logs_seconds=604800 # 7 days
sync_binlog=1
max_binlog_size=1G
# Backup binary logs
#!/bin/bash
BINLOG_DIR="/var/lib/mysql"
ARCHIVE_DIR="/backups/mysql/binlog"
RETENTION_DAYS=7
# Copiar binary logs
cp ${BINLOG_DIR}/mysql-bin.* ${ARCHIVE_DIR}/
# Upload to S3
aws s3 sync ${ARCHIVE_DIR} s3://my-bucket/mysql-binlog/
# Retention
find $ARCHIVE_DIR -name "mysql-bin.*" -mtime +$RETENTION_DAYS -deleteCross-Region Replication
PostgreSQL Logical Replication
sql-- Primary region
CREATE PUBLICATION my_publication
FOR ALL TABLES;
-- Standby region (replica)
CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=primary-db.example.com port=5432 dbname=production_db user=replication_user password=xxx'
PUBLICATION my_publication
WITH (create_slot = false);bash# Script de monitoramento de replicação
#!/bin/bash
STANDBY_HOST="standby-db.example.com"
PGUSER="monitoring_user"
# Check replication lag
psql -h $STANDBY_HOST -U $PGUSER -d production_db -c "
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_lag,
pg_last_xact_replay_timestamp() AS last_replay_time
;"MySQL Replication
sql-- Primary region
CHANGE REPLICATION SOURCE TO
SOURCE_HOST='primary-db.example.com',
SOURCE_USER='replication_user',
SOURCE_PASSWORD='xxx',
SOURCE_LOG_FILE='mysql-bin.000001',
SOURCE_LOG_POS=4;
START REPLICA;bash# Monitoramento de replicação MySQL
#!/bin/bash
STANDBY_HOST="standby-db.example.com"
MYSQL_USER="monitoring_user"
# Check replication lag
mysql -h $STANDBY_HOST -u $MYSQL_USER -e "
SHOW REPLICA STATUS\G" | grep "Seconds_Behind_Source"Automated Backup Orchestration
Kubernetes CronJobs
yamlapiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: database
spec:
schedule: "0 */4 * * *" # Every 4 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15-alpine
command:
- /bin/sh
- -c
- |
pg_dump -h postgres-primary -U $POSTGRES_USER \
-d $POSTGRES_DB -Ft -f /backup/db.dump.tar
# Upload to S3
aws s3 cp /backup/db.dump.tar \
s3://my-bucket/postgres-backups/$(date +%Y%m%d_%H%M%S).tar
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-secret
key: username
- name: POSTGRES_DB
value: "production_db"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
volumeMounts:
- name: backup
mountPath: /backup
volumes:
- name: backup
emptyDir: {}
restartPolicy: OnFailure
backoffLimit: 3Terraform Managed Backups
hcl# AWS RDS automated backups
resource "aws_db_instance" "production" {
allocated_storage = 100
storage_type = "gp3"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.r6g.xlarge"
db_name = "production"
# Backup configuration
backup_retention_period = 7 # days
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
# Multi-AZ for high availability
multi_az = true
# Final snapshot on deletion
final_snapshot_identifier = "production-final-snapshot"
skip_final_snapshot = false
# Deletion protection
deletion_protection = true
# Performance insights
performance_insights_enabled = true
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
}Disaster Recovery Testing
Automated DR Testing
bash#!/bin/bash
# Automated DR test script
PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
DB_IDENTIFIER="production-db"
TEST_DB="dr_test_$(date +%Y%m%d_%H%M%S)"
# 1. Create test database from snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier ${TEST_DB} \
--db-snapshot-identifier ${DB_IDENTIFIER}-snapshot \
--region $DR_REGION
# 2. Wait for restoration to complete
aws rds wait db-instance-available \
--db-instance-identifier ${TEST_DB} \
--region $DR_REGION
# 3. Test database connectivity
TEST_HOST=$(aws rds describe-db-instances \
--db-instance-identifier ${TEST_DB} \
--region $DR_REGION \
--query 'DBInstances[0].Endpoint.Address' \
--output text)
psql -h $TEST_HOST -U test_user -d production_db -c "SELECT NOW();"
# 4. Run data integrity checks
psql -h $TEST_HOST -U test_user -d production_db -c "
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM orders;
SELECT MAX(created_at) FROM orders;
"
# 5. Test query performance
time psql -h $TEST_HOST -U test_user -d production_db -c "
EXPLAIN ANALYZE SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '1 day';
"
# 6. Cleanup
aws rds delete-db-instance \
--db-instance-identifier ${TEST_DB} \
--skip-final-snapshot \
--region $DR_REGION
# Log results
echo "DR test completed at $(date)" >> /var/log/dr-tests.logDR Drill Schedule
yamldisaster_recovery_testing_schedule:
weekly:
- Verify replication lag < 1 minute
- Test backup integrity
- Verify backup completion within SLA
monthly:
- Restore from latest backup to test environment
- Validate data integrity
- Run application smoke tests
- Document recovery time
quarterly:
- Full failover drill to DR region
- Test all application functionality
- Measure actual RTO vs target
- Update DR documentation
annually:
- Complete DR plan review
- Update RPO/RTO targets
- Test backup restoration from 1 year ago
- Update contact informationMonitoring and Alerting
Backup Monitoring
yaml# Prometheus alerting rules
groups:
- name: database_backups
rules:
# Backup age alert
- alert: DatabaseBackupTooOld
expr: |
time() - postgres_backup_last_success_timestamp_seconds > 86400
for: 1h
labels:
severity: warning
annotations:
summary: "Database backup is more than 24 hours old"
description: "Last successful backup for {{ $labels.instance }} was {{ $value }}s ago"
# Backup size anomaly
- alert: DatabaseBackupSizeAnomaly
expr: |
abs(postgres_backup_size_bytes - postgres_backup_size_bytes offset 1h) / postgres_backup_size_bytes > 0.2
for: 30m
labels:
severity: warning
annotations:
summary: "Database backup size changed by more than 20%"
description: "Backup size for {{ $labels.instance }} changed significantly"
# Replication lag
- alert: DatabaseReplicationLagHigh
expr: |
pg_replication_lag_seconds > 300
for: 5m
labels:
severity: critical
annotations:
summary: "Database replication lag exceeds 5 minutes"
description: "Replication lag for {{ $labels.instance }} is {{ $value }}s"DR Dashboard
typescript// DR monitoring dashboard metrics
interface DRMetrics {
backups: {
lastSuccessful: Date;
lastFailed: Date | null;
ageHours: number;
sizeGB: number;
retentionDays: number;
};
replication: {
lagSeconds: number;
status: 'syncing' | 'idle' | 'error';
throughputMBps: number;
};
lastDRTest: {
date: Date;
durationMinutes: number;
success: boolean;
issues: string[];
};
rpo: {
target: string;
current: string;
status: 'met' | 'exceeded';
};
rto: {
target: string;
estimated: string;
status: 'met' | 'exceeded';
};
}Recovery Playbooks
Scenario 1: Single Table Corruption
bash#!/bin/bash
# Recover single table from logical backup
TABLE_NAME="corrupted_table"
BACKUP_FILE="/backups/postgres/logical/latest.dump"
DB_NAME="production_db"
# 1. Restore to temporary database
pg_restore -h localhost -U postgres \
-d temp_restore_db \
-t $TABLE_NAME \
$BACKUP_FILE
# 2. Export table data
pg_dump -h localhost -U postgres \
-t $TABLE_NAME \
-d temp_restore_db \
-f /tmp/table_restore.sql
# 3. Truncate corrupted table
psql -h localhost -U postgres -d $DB_NAME \
-c "TRUNCATE TABLE $TABLE_NAME;"
# 4. Restore data
psql -h localhost -U postgres -d $DB_NAME -f /tmp/table_restore.sql
# 5. Cleanup
dropdb temp_restore_db
rm /tmp/table_restore.sqlScenario 2: Full Database Restore
bash#!/bin/bash
# Full database restore from PITR
RECOVERY_TIMESTAMP="2026-03-12 14:30:00"
BACKUP_BASE="/backups/postgres/physical/latest"
RESTORE_DIR="/var/lib/postgresql/restore"
# 1. Stop database
sudo systemctl stop postgresql
# 2. Restore base backup
pg_basebackup -h primary-db.example.com \
-D $RESTORE_DIR -P
# 3. Configure recovery
cat > $RESTORE_DIR/recovery.conf <<EOF
restore_command = 'aws s3 cp s3://my-bucket/postgres-wal/%f %p'
recovery_target_time = '${RECOVERY_TIMESTAMP}'
EOF
# 4. Start database in recovery mode
sudo systemctl start postgresql
# 5. Monitor recovery
tail -f /var/log/postgresql/postgresql-*.log | grep "recovery"
# 6. When recovery completes, promote to primary
touch $RESTORE_DIR/promote
# 7. Update application connection string
# (Manual step)Scenario 3: Region Failover
bash#!/bin/bash
# Region failover procedure
PRIMARY_REGION="us-east-1"
DR_REGION="us-west-2"
DNS_ZONE="example.com"
# 1. Verify DR region is healthy
aws rds describe-db-instances \
--region $DR_REGION \
--db-instance-identifier production-db-dr
# 2. Stop replication
aws rds stop-replication \
--source-db-instance-identifier production-db-primary \
--replica-db-instance-identifier production-db-dr \
--region $PRIMARY_REGION
# 3. Promote DR to primary
aws rds promote-read-replica \
--db-instance-identifier production-db-dr \
--region $DR_REGION
# 4. Update DNS
aws route53 change-resource-record-sets \
--hosted-zone-id $DNS_ZONE \
--change-batch file://dns-update.json
# 5. Verify application connectivity
curl -f https://api.example.com/health
# 6. Re-establish reverse replication
aws rds create-db-instance-read-replica \
--source-db-instance-identifier production-db-dr \
--db-instance-identifier production-db-primary-replica \
--region $PRIMARY_REGIONConclusion
Backup e disaster recovery para bancos de dados críticos não é implementação única — é disciplina contínua que combina backups automatizados, PITR, replicação cross-region, testes regulares de restauração e playbooks documentados. Organizações que tratam DR como processo sistemático, não como checklist de conformidade, garantem sobrevivência quando desastres inevitavelmente ocorrem.
Precisa implementar estratégia de backup e disaster recovery para bancos de dados críticos? Fale com especialistas em cloud da Imperialis para implementar arquitetura de backup e DR com RPO/RTO definidos, PITR e testes automatizados.
Fontes
- PostgreSQL Backup and Recovery documentation — accessed on 2026-03
- MySQL Backup and Recovery documentation — accessed on 2026-03
- AWS RDS Backup and Restore — accessed on 2026-03
- Percona XtraBackup documentation — accessed on 2026-03