Database schema migration strategies for production
Schema changes are inevitable. The difference between planned migrations and outages is how you handle rollback, data transformation, and execution timing.
Executive summary
Schema changes are inevitable. The difference between planned migrations and outages is how you handle rollback, data transformation, and execution timing.
Last updated: 3/13/2026
Executive summary
Database schema changes are the most dangerous operations in production. Unlike code deployments, database changes have no easy rollback path when they go wrong. The difference between a planned migration and an outage is often measured in three areas:
- Backward compatibility—ensuring old and new code coexist during deployment
- Rollback planning—knowing exactly how to revert when things break
- Execution timing—safely managing data transformation windows
Mature organizations treat database migrations as first-class artifacts: versioned, tested, and deployed with the same rigor as application code.
The migration lifecycle
Common failure patterns
1. Blocking migration during deployment:
bash# Bad: Application waits for migration
$ npm run build
$ npm run migrate # Takes 45 minutes on production
$ npm run deployProblem: Service is down during migration window. Any issue requires manual intervention.
2. Incompatible schema change:
sql-- Breaking change without rollback plan
ALTER TABLE users DROP COLUMN email;Problem: If deployment fails, there's no path forward. Email data is lost permanently.
3. Data transformation without staging:
sql-- Data-heavy transformation without testing
UPDATE orders SET status = 'cancelled' WHERE created_at < '2025-01-01';Problem: Unexpected data edge cases cause partial updates and inconsistent state.
Migration categories and patterns
Category 1: Non-breaking changes
These are safest and can be deployed without special coordination:
sql-- Add new column with default
ALTER TABLE users ADD COLUMN bio TEXT DEFAULT '';
-- Add new table
CREATE TABLE audit_logs (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL,
action VARCHAR(50) NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
-- Add index (non-locking on modern databases)
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);Deployment strategy:
- Run migration during normal deployment
- Application code can immediately use new features
- No rollback complexity needed
Category 2: Backward-compatible additions
These require application code coordination:
sql-- Step 1: Add nullable column
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- Step 2: Deploy application code
-- Application handles both NULL and FALSE values as unverified
-- Step 3: Populate data (separate migration or background job)
UPDATE users SET email_verified = FALSE WHERE email_verified IS NULL;
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL DEFAULT FALSE;
-- Step 4: Deploy new application code
-- Application now assumes email_verified always existsDeployment timeline:
T0: Migration 1 (add nullable column)
T1: Deploy code v1.1 (handles NULL values)
T2: Background job (populate data)
T3: Migration 2 (make NOT NULL, add default)
T4: Deploy code v1.2 (uses NOT NULL assumption)Rollback path: At any point before T3, rollback is simple. After T3, rollback requires re-running Migration 3.
Category 3: Breaking changes requiring zero-downtime
These are the most complex and require careful orchestration:
┌─────────────────────────────────────────────────────────────┐
│ ZERO-DOWNTIME MIGRATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ T0: Add new column (nullable) │
│ └─ v1 deployed (reads/writes both old and new) │
│ │
│ T1: Populate new column (background job) │
│ └─ Data consistency guaranteed │
│ │
│ T2: Deploy v2 (writes to both columns) │
│ └─ Dual-write strategy active │
│ │
│ T3: Verify data consistency (comparison job) │
│ └─ Ensure columns match │
│ │
│ T4: Deploy v3 (reads from new column only) │
│ └─ New column is source of truth │
│ │
│ T5: Remove old column (separate migration) │
│ └─ Cleanup migration │
│ │
└─────────────────────────────────────────────────────────────┘Example: Column rename (zero-downtime)
Goal: Rename email to email_address
Incorrect approach:
sql-- Breaks all existing deployments
ALTER TABLE users RENAME COLUMN email TO email_address;Correct approach:
sql-- Migration 1
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
-- Deploy code v2 (writes to both columns)
CREATE OR REPLACE FUNCTION insert_user_trig() RETURNS TRIGGER AS $$
BEGIN
NEW.email_address := NEW.email;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER sync_email BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION insert_user_trig();
-- Migration 2 (background job)
UPDATE users SET email_address = email WHERE email_address IS NULL;
-- Deploy code v3 (reads from email_address)
-- No longer uses email
-- Migration 3 (cleanup)
DROP TRIGGER sync_email ON users;
DROP FUNCTION insert_user_trig();
ALTER TABLE users DROP COLUMN email;Tooling landscape
Migration libraries
TypeScript/Node.js:
typescript// db/migrations/202603130001_add_user_bio.ts
import { Kysely } from 'kysely';
export async function up(db: Kysely<any>): Promise<void> {
await db.schema
.createTable('audit_logs')
.addColumn('id', 'serial', (col) => col.primaryKey())
.addColumn('user_id', 'integer', (col) => col.notNull())
.addColumn('action', 'varchar(50)', (col) => col.notNull())
.addColumn('created_at', 'timestamp', (col) => col.defaultTo('now()').notNull())
.execute();
}
export async function down(db: Kysely<any>): Promise<void> {
await db.schema.dropTable('audit_logs').execute();
}typescript// db/migrations/202603130002_populate_user_bio.ts
export async function up(db: Kysely<any>): Promise<void> {
// Data migration
await db
.insertInto('users')
.values({ bio: '' })
.where('bio', 'is', null)
.onConflict('id')
.ignore()
.execute();
}Python (Alembic):
python# alembic/versions/001_add_audit_log.py
from alembic import op
import sqlalchemy as sa
def upgrade():
op.create_table(
'audit_logs',
sa.Column('id', sa.Integer(), primary_key=True),
sa.Column('user_id', sa.Integer(), nullable=False),
sa.Column('action', sa.String(50), nullable=False),
sa.Column('created_at', sa.TIMESTAMP(), server_default=sa.text('NOW()'), nullable=False),
)
op.create_index('idx_audit_logs_user_id', 'audit_logs', ['user_id'])
def downgrade():
op.drop_index('idx_audit_logs_user_id')
op.drop_table('audit_logs')Java (Liquibase/Flyway):
xml<!-- db/changelog/add_audit_log.xml -->
<changeSet id="001" author="database-team">
<createTable tableName="audit_logs">
<column name="id" type="INT" autoIncrement="true">
<constraints primaryKey="true"/>
</column>
<column name="user_id" type="INT" nullable="false"/>
<column name="action" type="VARCHAR(50)" nullable="false"/>
<column name="created_at" type="TIMESTAMP" defaultValueComputed="NOW()" nullable="false"/>
</createTable>
<createIndex tableName="audit_logs" indexName="idx_audit_logs_user_id">
<column name="user_id"/>
</createIndex>
</changeSet>Schema versioning
Version tracking table:
sqlCREATE TABLE schema_migrations (
version VARCHAR(50) PRIMARY KEY,
applied_at TIMESTAMP NOT NULL DEFAULT NOW(),
checksum VARCHAR(64) NOT NULL,
execution_time_ms INTEGER
);Migration executor:
typescriptclass MigrationRunner {
async run(migrations: Migration[]): Promise<void> {
for (const migration of migrations) {
const applied = await this.db
.selectFrom('schema_migrations')
.where('version', '=', migration.version)
.executeTakeFirst();
if (!applied) {
const start = Date.now();
await migration.up(this.db);
await this.db
.insertInto('schema_migrations')
.values({
version: migration.version,
checksum: this.calculateChecksum(migration),
execution_time_ms: Date.now() - start,
})
.execute();
}
}
}
private calculateChecksum(migration: Migration): string {
// SHA256 of migration content
return createHash('sha256').update(migration.toString()).digest('hex');
}
}Production execution patterns
Pattern 1: Separate migration window
yaml# deploy.yml
stages:
- name: Migrate database
run: |
pnpm run db:migrate
pnpm run db:verify
- name: Deploy application
run: |
pnpm run deploy:app
- name: Verify application health
run: |
pnpm run health:checkAdvantages:
- Clear separation of concerns
- Easier to rollback if migration fails
- Health checks catch application issues
Risks:
- Gap between schema and application code
Pattern 2: Application-managed migrations
typescript// server/index.ts
import { migrationRunner } from './db/migrations';
app.on('startup', async () => {
await migrationRunner.run();
console.log('Migrations completed');
});
app.on('shutdown', async () => {
// Graceful shutdown
});Advantages:
- Schema and application always in sync
- Simpler deployment pipeline
- Each instance runs migrations on startup
Risks:
- Multiple instances running migrations simultaneously
- Harder to audit migration execution
- Startup time increases
Mitigation:
typescript// Use database advisory lock
async function runMigrationWithLock(migration: Migration): Promise<void> {
const lockId = `migration_${migration.version}`;
const acquired = await db.query(`
SELECT pg_try_advisory_lock($1::bigint) as acquired
FROM pg_advisory_xact_lock($2::bigint)
`, [hash(lockId), hash(lockId)]);
if (acquired) {
await migration.up(db);
}
}Pattern 3: Blue-green database migration
┌─────────────────────────────────────────────────────────────┐
│ BLUE-GREEN MIGRATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: Migration on replica │
│ └─ Migrations run on read replica │
│ └─ No impact on production traffic │
│ └─ Validate results │
│ │
│ Phase 2: Promote replica │
│ └─ Replica becomes primary (failover) │
│ └─ Application connects to new primary │
│ │
│ Phase 3: Old primary becomes replica │
│ └─ Can be demoted or kept as backup │
│ │
└─────────────────────────────────────────────────────────────┘Implementation:
sql-- On replica first
BEGIN;
-- Run migrations
COMMIT;
-- Promote replica
SELECT pg_promote();
-- Update application connection string
DATABASE_URL=postgresql://new-primary:5432/dbRollback strategies
Reversible migrations
Always write down functions:
typescriptexport async function up(db: Kysely<any>): Promise<void> {
await db.schema
.createTable('feature_flags')
.addColumn('name', 'varchar(100)', (col) => col.notNull().primaryKey())
.addColumn('enabled', 'boolean', (col) => col.defaultTo(false).notNull())
.execute();
}
export async function down(db: Kysely<any>): Promise<void> {
// Always provide rollback path
await db.schema.dropTable('feature_flags').execute();
}Point-in-time recovery (PITR)
Cloud-managed databases:
- AWS RDS: Automated backups + transaction logs
- Google Cloud SQL: Point-in-time recovery enabled
- Azure Database: Automatic backups with retention
Restore process:
bash# AWS CLI example
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier mydb-restored \
--db-snapshot-identifier mydb-snapshot-20260313
# Connect to restored instance
psql -h mydb-restored.cxyz.us-east-1.rds.amazonaws.com -U adminApplication-level rollback
When schema cannot be rolled back, application code must handle both states:
typescript// Application code supports both old and new schema
function getUserEmail(user: User): string {
// Backward compatible: check both columns
return user.email_address ?? user.email ?? '';
}Large table migrations
Problem with single-statement migrations
sql-- Bad: Locks table for hours
ALTER TABLE orders ADD COLUMN order_type VARCHAR(50);
-- Bad: Updates entire table, locks rows
UPDATE users SET email_verified = FALSE WHERE email_verified IS NULL;Strategy: Per-chunk migration
typescript// db/migrations/202603130003_populate_large_table.ts
const BATCH_SIZE = 10000;
const DELAY_MS = 100; // Throttle between batches
export async function up(db: Kysely<any>): Promise<void> {
let processed = 0;
let hasMore = true;
while (hasMore) {
// Process batch
const affected = await db
.updateTable('users')
.set({ email_verified: false })
.where('email_verified', 'is', null)
.where('id', 'in', db
.selectFrom('users')
.select('id')
.where('email_verified', 'is', null)
.limit(BATCH_SIZE)
)
.executeTakeFirstOrThrow();
processed += affected.numUpdatedRows || 0;
hasMore = (affected.numUpdatedRows || 0) >= BATCH_SIZE;
// Throttle to avoid overwhelming database
if (hasMore) {
await new Promise(resolve => setTimeout(resolve, DELAY_MS));
}
console.log(`Processed ${processed} users`);
}
}Strategy: Add new table and backfill
sql-- Migration 1: Create new table with desired schema
CREATE TABLE users_new (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
email_verified BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
-- Backfill in batches using application code
-- Migration 2: Swap tables atomically
BEGIN;
LOCK TABLE users IN ACCESS EXCLUSIVE MODE;
ALTER TABLE users RENAME TO users_old;
ALTER TABLE users_new RENAME TO users;
COMMIT;
-- Migration 3: Cleanup
DROP TABLE users_old;Migration testing strategy
Pre-production validation
1. Schema validation:
typescript// Compare staging and production schemas
async function validateSchemaStability(): Promise<boolean> {
const stagingSchema = await getSchema('staging');
const productionSchema = await getSchema('production');
const diff = compareSchemas(stagingSchema, productionSchema);
if (diff.hasChanges) {
console.error('Schema drift detected:', diff);
return false;
}
return true;
}2. Data consistency checks:
sql-- After data transformation migration
SELECT
COUNT(*) as total,
COUNT(CASE WHEN email_address IS NULL THEN 1 END) as missing_new_column,
COUNT(CASE WHEN email IS NULL THEN 1 END) as missing_old_column
FROM users;
-- Both should be zero after migration3. Performance impact testing:
typescript// Measure query performance before/after migration
const before = await measureQueryTime(() =>
db.selectFrom('users').where('email', '=', 'test@example.com').executeTakeFirst()
);
// Run migration
await migration.up(db);
const after = await measureQueryTime(() =>
db.selectFrom('users').where('email', '=', 'test@example.com').executeTakeFirst()
);
console.log(`Performance impact: ${after - before}ms`);Staging dry-run
typescript// Dry-run mode that logs without executing
class DryRunMigration implements Migration {
constructor(private migration: Migration) {}
async up(db: Kysely<any>): Promise<void> {
const sql = this.getSQL(this.migration);
console.log('[DRY RUN] Would execute:', sql);
}
private getSQL(migration: Migration): string {
// Extract SQL from migration
// Implementation depends on migration library
}
}Migration checklist
Before deployment
- [ ] Migration scripts have down() functions
- [ ] Scripts tested on staging with production-like data volume
- [ ] Estimated execution time documented
- [ ] Rollback plan documented
- [ ] Database backups verified (PITR available)
- [ ] Application code supports both old and new schema during transition
During deployment
- [ ] Execute migrations in transaction where possible
- [ ] Use advisory locks to prevent concurrent execution
- [ ] Monitor database connection pool
- [ ] Watch for long-running queries
- [ ] Verify migration checksums
After deployment
- [ ] Run data consistency checks
- [ ] Verify application health
- [ ] Monitor query performance
- [ ] Check error logs for schema-related issues
- [ ] Document any deviations from plan
Connection pool considerations
During migrations, connection pool behavior affects execution:
typescript// Configure connection pool for migration window
const pool = new Pool({
host: process.env.DB_HOST,
port: 5432,
database: process.env.DB_NAME,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
// Lower max connections during migration
max: 10,
// Longer timeout for migration queries
connectionTimeoutMillis: 30000,
// Allow more time for migration queries
statement_timeout: '5min',
});PgBouncer considerations:
- PgBouncer may pool transactions differently during migrations
- Consider bypassing PgBouncer for direct migration connection
- Ensure PgBouncer pool size accommodates migration workload
Conclusion
Database schema migration success comes from treating migrations as first-class artifacts: versioned, tested, and deployed with deliberate coordination between schema and application code.
The mature organization doesn't ask "how fast can we change the database?" but "how can we change the database without causing an incident?"
Practical closing question: When was the last time you successfully rolled back a production migration, and what would have changed if you couldn't?
Sources
- PostgreSQL Documentation: ALTER TABLE - official documentation
- MySQL Documentation: ALTER TABLE - official documentation
- Liquibase Documentation - official documentation
- Flyway Documentation - official documentation
- Alembic Documentation - official documentation
- Kysely Documentation - official documentation