Knowledge

Database schema migration strategies for production

Schema changes are inevitable. The difference between planned migrations and outages is how you handle rollback, data transformation, and execution timing.

3/13/20269 min readKnowledge
Database schema migration strategies for production

Executive summary

Schema changes are inevitable. The difference between planned migrations and outages is how you handle rollback, data transformation, and execution timing.

Last updated: 3/13/2026

Executive summary

Database schema changes are the most dangerous operations in production. Unlike code deployments, database changes have no easy rollback path when they go wrong. The difference between a planned migration and an outage is often measured in three areas:

  1. Backward compatibility—ensuring old and new code coexist during deployment
  2. Rollback planning—knowing exactly how to revert when things break
  3. Execution timing—safely managing data transformation windows

Mature organizations treat database migrations as first-class artifacts: versioned, tested, and deployed with the same rigor as application code.

The migration lifecycle

Common failure patterns

1. Blocking migration during deployment:

bash# Bad: Application waits for migration
$ npm run build
$ npm run migrate  # Takes 45 minutes on production
$ npm run deploy

Problem: Service is down during migration window. Any issue requires manual intervention.

2. Incompatible schema change:

sql-- Breaking change without rollback plan
ALTER TABLE users DROP COLUMN email;

Problem: If deployment fails, there's no path forward. Email data is lost permanently.

3. Data transformation without staging:

sql-- Data-heavy transformation without testing
UPDATE orders SET status = 'cancelled' WHERE created_at < '2025-01-01';

Problem: Unexpected data edge cases cause partial updates and inconsistent state.

Migration categories and patterns

Category 1: Non-breaking changes

These are safest and can be deployed without special coordination:

sql-- Add new column with default
ALTER TABLE users ADD COLUMN bio TEXT DEFAULT '';

-- Add new table
CREATE TABLE audit_logs (
  id SERIAL PRIMARY KEY,
  user_id INTEGER NOT NULL,
  action VARCHAR(50) NOT NULL,
  created_at TIMESTAMP NOT NULL DEFAULT NOW()
);

-- Add index (non-locking on modern databases)
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

Deployment strategy:

  1. Run migration during normal deployment
  2. Application code can immediately use new features
  3. No rollback complexity needed

Category 2: Backward-compatible additions

These require application code coordination:

sql-- Step 1: Add nullable column
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;

-- Step 2: Deploy application code
-- Application handles both NULL and FALSE values as unverified

-- Step 3: Populate data (separate migration or background job)
UPDATE users SET email_verified = FALSE WHERE email_verified IS NULL;
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL DEFAULT FALSE;

-- Step 4: Deploy new application code
-- Application now assumes email_verified always exists

Deployment timeline:

T0: Migration 1 (add nullable column)
T1: Deploy code v1.1 (handles NULL values)
T2: Background job (populate data)
T3: Migration 2 (make NOT NULL, add default)
T4: Deploy code v1.2 (uses NOT NULL assumption)

Rollback path: At any point before T3, rollback is simple. After T3, rollback requires re-running Migration 3.

Category 3: Breaking changes requiring zero-downtime

These are the most complex and require careful orchestration:

┌─────────────────────────────────────────────────────────────┐
│                 ZERO-DOWNTIME MIGRATION              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  T0: Add new column (nullable)                             │
│      └─ v1 deployed (reads/writes both old and new)      │
│                                                             │
│  T1: Populate new column (background job)                    │
│      └─ Data consistency guaranteed                           │
│                                                             │
│  T2: Deploy v2 (writes to both columns)                    │
│      └─ Dual-write strategy active                           │
│                                                             │
│  T3: Verify data consistency (comparison job)                  │
│      └─ Ensure columns match                                  │
│                                                             │
│  T4: Deploy v3 (reads from new column only)                │
│      └─ New column is source of truth                        │
│                                                             │
│  T5: Remove old column (separate migration)                   │
│      └─ Cleanup migration                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Example: Column rename (zero-downtime)

Goal: Rename email to email_address

Incorrect approach:

sql-- Breaks all existing deployments
ALTER TABLE users RENAME COLUMN email TO email_address;

Correct approach:

sql-- Migration 1
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);

-- Deploy code v2 (writes to both columns)
CREATE OR REPLACE FUNCTION insert_user_trig() RETURNS TRIGGER AS $$
BEGIN
  NEW.email_address := NEW.email;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER sync_email BEFORE INSERT OR UPDATE ON users
  FOR EACH ROW EXECUTE FUNCTION insert_user_trig();

-- Migration 2 (background job)
UPDATE users SET email_address = email WHERE email_address IS NULL;

-- Deploy code v3 (reads from email_address)
-- No longer uses email

-- Migration 3 (cleanup)
DROP TRIGGER sync_email ON users;
DROP FUNCTION insert_user_trig();
ALTER TABLE users DROP COLUMN email;

Tooling landscape

Migration libraries

TypeScript/Node.js:

typescript// db/migrations/202603130001_add_user_bio.ts
import { Kysely } from 'kysely';

export async function up(db: Kysely<any>): Promise<void> {
  await db.schema
    .createTable('audit_logs')
    .addColumn('id', 'serial', (col) => col.primaryKey())
    .addColumn('user_id', 'integer', (col) => col.notNull())
    .addColumn('action', 'varchar(50)', (col) => col.notNull())
    .addColumn('created_at', 'timestamp', (col) => col.defaultTo('now()').notNull())
    .execute();
}

export async function down(db: Kysely<any>): Promise<void> {
  await db.schema.dropTable('audit_logs').execute();
}
typescript// db/migrations/202603130002_populate_user_bio.ts
export async function up(db: Kysely<any>): Promise<void> {
  // Data migration
  await db
    .insertInto('users')
    .values({ bio: '' })
    .where('bio', 'is', null)
    .onConflict('id')
    .ignore()
    .execute();
}

Python (Alembic):

python# alembic/versions/001_add_audit_log.py
from alembic import op
import sqlalchemy as sa

def upgrade():
    op.create_table(
        'audit_logs',
        sa.Column('id', sa.Integer(), primary_key=True),
        sa.Column('user_id', sa.Integer(), nullable=False),
        sa.Column('action', sa.String(50), nullable=False),
        sa.Column('created_at', sa.TIMESTAMP(), server_default=sa.text('NOW()'), nullable=False),
    )
    op.create_index('idx_audit_logs_user_id', 'audit_logs', ['user_id'])

def downgrade():
    op.drop_index('idx_audit_logs_user_id')
    op.drop_table('audit_logs')

Java (Liquibase/Flyway):

xml<!-- db/changelog/add_audit_log.xml -->
<changeSet id="001" author="database-team">
    <createTable tableName="audit_logs">
        <column name="id" type="INT" autoIncrement="true">
            <constraints primaryKey="true"/>
        </column>
        <column name="user_id" type="INT" nullable="false"/>
        <column name="action" type="VARCHAR(50)" nullable="false"/>
        <column name="created_at" type="TIMESTAMP" defaultValueComputed="NOW()" nullable="false"/>
    </createTable>
    <createIndex tableName="audit_logs" indexName="idx_audit_logs_user_id">
        <column name="user_id"/>
    </createIndex>
</changeSet>

Schema versioning

Version tracking table:

sqlCREATE TABLE schema_migrations (
    version VARCHAR(50) PRIMARY KEY,
    applied_at TIMESTAMP NOT NULL DEFAULT NOW(),
    checksum VARCHAR(64) NOT NULL,
    execution_time_ms INTEGER
);

Migration executor:

typescriptclass MigrationRunner {
  async run(migrations: Migration[]): Promise<void> {
    for (const migration of migrations) {
      const applied = await this.db
        .selectFrom('schema_migrations')
        .where('version', '=', migration.version)
        .executeTakeFirst();

      if (!applied) {
        const start = Date.now();
        await migration.up(this.db);

        await this.db
          .insertInto('schema_migrations')
          .values({
            version: migration.version,
            checksum: this.calculateChecksum(migration),
            execution_time_ms: Date.now() - start,
          })
          .execute();
      }
    }
  }

  private calculateChecksum(migration: Migration): string {
    // SHA256 of migration content
    return createHash('sha256').update(migration.toString()).digest('hex');
  }
}

Production execution patterns

Pattern 1: Separate migration window

yaml# deploy.yml
stages:
  - name: Migrate database
    run: |
      pnpm run db:migrate
      pnpm run db:verify

  - name: Deploy application
    run: |
      pnpm run deploy:app

  - name: Verify application health
    run: |
      pnpm run health:check

Advantages:

  • Clear separation of concerns
  • Easier to rollback if migration fails
  • Health checks catch application issues

Risks:

  • Gap between schema and application code

Pattern 2: Application-managed migrations

typescript// server/index.ts
import { migrationRunner } from './db/migrations';

app.on('startup', async () => {
  await migrationRunner.run();
  console.log('Migrations completed');
});

app.on('shutdown', async () => {
  // Graceful shutdown
});

Advantages:

  • Schema and application always in sync
  • Simpler deployment pipeline
  • Each instance runs migrations on startup

Risks:

  • Multiple instances running migrations simultaneously
  • Harder to audit migration execution
  • Startup time increases

Mitigation:

typescript// Use database advisory lock
async function runMigrationWithLock(migration: Migration): Promise<void> {
  const lockId = `migration_${migration.version}`;

  const acquired = await db.query(`
    SELECT pg_try_advisory_lock($1::bigint) as acquired
    FROM pg_advisory_xact_lock($2::bigint)
  `, [hash(lockId), hash(lockId)]);

  if (acquired) {
    await migration.up(db);
  }
}

Pattern 3: Blue-green database migration

┌─────────────────────────────────────────────────────────────┐
│               BLUE-GREEN MIGRATION                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Phase 1: Migration on replica                          │
│      └─ Migrations run on read replica                   │
│      └─ No impact on production traffic                  │
│      └─ Validate results                                  │
│                                                             │
│  Phase 2: Promote replica                                 │
│      └─ Replica becomes primary (failover)                │
│      └─ Application connects to new primary                │
│                                                             │
│  Phase 3: Old primary becomes replica                     │
│      └─ Can be demoted or kept as backup                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Implementation:

sql-- On replica first
BEGIN;
-- Run migrations
COMMIT;

-- Promote replica
SELECT pg_promote();

-- Update application connection string
DATABASE_URL=postgresql://new-primary:5432/db

Rollback strategies

Reversible migrations

Always write down functions:

typescriptexport async function up(db: Kysely<any>): Promise<void> {
  await db.schema
    .createTable('feature_flags')
    .addColumn('name', 'varchar(100)', (col) => col.notNull().primaryKey())
    .addColumn('enabled', 'boolean', (col) => col.defaultTo(false).notNull())
    .execute();
}

export async function down(db: Kysely<any>): Promise<void> {
  // Always provide rollback path
  await db.schema.dropTable('feature_flags').execute();
}

Point-in-time recovery (PITR)

Cloud-managed databases:

  • AWS RDS: Automated backups + transaction logs
  • Google Cloud SQL: Point-in-time recovery enabled
  • Azure Database: Automatic backups with retention

Restore process:

bash# AWS CLI example
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier mydb-restored \
  --db-snapshot-identifier mydb-snapshot-20260313

# Connect to restored instance
psql -h mydb-restored.cxyz.us-east-1.rds.amazonaws.com -U admin

Application-level rollback

When schema cannot be rolled back, application code must handle both states:

typescript// Application code supports both old and new schema
function getUserEmail(user: User): string {
  // Backward compatible: check both columns
  return user.email_address ?? user.email ?? '';
}

Large table migrations

Problem with single-statement migrations

sql-- Bad: Locks table for hours
ALTER TABLE orders ADD COLUMN order_type VARCHAR(50);

-- Bad: Updates entire table, locks rows
UPDATE users SET email_verified = FALSE WHERE email_verified IS NULL;

Strategy: Per-chunk migration

typescript// db/migrations/202603130003_populate_large_table.ts
const BATCH_SIZE = 10000;
const DELAY_MS = 100; // Throttle between batches

export async function up(db: Kysely<any>): Promise<void> {
  let processed = 0;
  let hasMore = true;

  while (hasMore) {
    // Process batch
    const affected = await db
      .updateTable('users')
      .set({ email_verified: false })
      .where('email_verified', 'is', null)
      .where('id', 'in', db
        .selectFrom('users')
        .select('id')
        .where('email_verified', 'is', null)
        .limit(BATCH_SIZE)
      )
      .executeTakeFirstOrThrow();

    processed += affected.numUpdatedRows || 0;
    hasMore = (affected.numUpdatedRows || 0) >= BATCH_SIZE;

    // Throttle to avoid overwhelming database
    if (hasMore) {
      await new Promise(resolve => setTimeout(resolve, DELAY_MS));
    }

    console.log(`Processed ${processed} users`);
  }
}

Strategy: Add new table and backfill

sql-- Migration 1: Create new table with desired schema
CREATE TABLE users_new (
  id SERIAL PRIMARY KEY,
  email VARCHAR(255) NOT NULL,
  email_verified BOOLEAN NOT NULL DEFAULT FALSE,
  created_at TIMESTAMP NOT NULL DEFAULT NOW()
);

-- Backfill in batches using application code
-- Migration 2: Swap tables atomically
BEGIN;
LOCK TABLE users IN ACCESS EXCLUSIVE MODE;
ALTER TABLE users RENAME TO users_old;
ALTER TABLE users_new RENAME TO users;
COMMIT;

-- Migration 3: Cleanup
DROP TABLE users_old;

Migration testing strategy

Pre-production validation

1. Schema validation:

typescript// Compare staging and production schemas
async function validateSchemaStability(): Promise<boolean> {
  const stagingSchema = await getSchema('staging');
  const productionSchema = await getSchema('production');

  const diff = compareSchemas(stagingSchema, productionSchema);

  if (diff.hasChanges) {
    console.error('Schema drift detected:', diff);
    return false;
  }

  return true;
}

2. Data consistency checks:

sql-- After data transformation migration
SELECT
  COUNT(*) as total,
  COUNT(CASE WHEN email_address IS NULL THEN 1 END) as missing_new_column,
  COUNT(CASE WHEN email IS NULL THEN 1 END) as missing_old_column
FROM users;

-- Both should be zero after migration

3. Performance impact testing:

typescript// Measure query performance before/after migration
const before = await measureQueryTime(() =>
  db.selectFrom('users').where('email', '=', 'test@example.com').executeTakeFirst()
);

// Run migration
await migration.up(db);

const after = await measureQueryTime(() =>
  db.selectFrom('users').where('email', '=', 'test@example.com').executeTakeFirst()
);

console.log(`Performance impact: ${after - before}ms`);

Staging dry-run

typescript// Dry-run mode that logs without executing
class DryRunMigration implements Migration {
  constructor(private migration: Migration) {}

  async up(db: Kysely<any>): Promise<void> {
    const sql = this.getSQL(this.migration);
    console.log('[DRY RUN] Would execute:', sql);
  }

  private getSQL(migration: Migration): string {
    // Extract SQL from migration
    // Implementation depends on migration library
  }
}

Migration checklist

Before deployment

  • [ ] Migration scripts have down() functions
  • [ ] Scripts tested on staging with production-like data volume
  • [ ] Estimated execution time documented
  • [ ] Rollback plan documented
  • [ ] Database backups verified (PITR available)
  • [ ] Application code supports both old and new schema during transition

During deployment

  • [ ] Execute migrations in transaction where possible
  • [ ] Use advisory locks to prevent concurrent execution
  • [ ] Monitor database connection pool
  • [ ] Watch for long-running queries
  • [ ] Verify migration checksums

After deployment

  • [ ] Run data consistency checks
  • [ ] Verify application health
  • [ ] Monitor query performance
  • [ ] Check error logs for schema-related issues
  • [ ] Document any deviations from plan

Connection pool considerations

During migrations, connection pool behavior affects execution:

typescript// Configure connection pool for migration window
const pool = new Pool({
  host: process.env.DB_HOST,
  port: 5432,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,
  // Lower max connections during migration
  max: 10,
  // Longer timeout for migration queries
  connectionTimeoutMillis: 30000,
  // Allow more time for migration queries
  statement_timeout: '5min',
});

PgBouncer considerations:

  • PgBouncer may pool transactions differently during migrations
  • Consider bypassing PgBouncer for direct migration connection
  • Ensure PgBouncer pool size accommodates migration workload

Conclusion

Database schema migration success comes from treating migrations as first-class artifacts: versioned, tested, and deployed with deliberate coordination between schema and application code.

The mature organization doesn't ask "how fast can we change the database?" but "how can we change the database without causing an incident?"

Practical closing question: When was the last time you successfully rolled back a production migration, and what would have changed if you couldn't?

Sources

Related reading