Infraadvanced

Backup & Disaster Recovery

Implement automated backup strategies and disaster recovery plans to ensure data durability and business continuity.

Also known as: DR, disaster recovery, backup strategy, data backup, business continuity, RTO/RPO

Description

Backup and disaster recovery (DR) encompasses the strategies, processes, and infrastructure needed to protect data against loss and restore services after catastrophic failures. Two key metrics define DR requirements: Recovery Point Objective (RPO) -- the maximum acceptable data loss measured in time (e.g., 1 hour RPO means you can lose at most 1 hour of data), and Recovery Time Objective (RTO) -- the maximum acceptable time to restore service (e.g., 4 hour RTO means service must be back within 4 hours).

Database backup strategies include automated snapshots (RDS automated backups with point-in-time recovery), logical backups (pg_dump for PostgreSQL, producing portable SQL), continuous archiving (PostgreSQL WAL archiving to S3 for point-in-time recovery), and cross-region replication for geographic redundancy. File storage backups use versioned S3 buckets with lifecycle policies and cross-region replication. The 3-2-1 backup rule recommends 3 copies of data, on 2 different media types, with 1 stored offsite.

A disaster recovery plan must be documented, tested regularly (at least quarterly), and include runbooks for common failure scenarios: single instance failure, availability zone outage, region outage, data corruption (requiring point-in-time restore), and accidental deletion. Backups are useless if they cannot be restored -- automated restore testing should verify backup integrity on a schedule. Infrastructure as Code enables rapid environment reconstruction in a different region, while database read replicas in secondary regions can be promoted to primary for fast failover.

Prompt Snippet

Configure automated backup and DR for the production stack: enable RDS automated backups with 14-day retention and point-in-time recovery, configure cross-region read replica in us-west-2 for failover (RPO <5 minutes, RTO <30 minutes). Set up nightly pg_dump logical backups to a versioned S3 bucket with lifecycle rules (Standard for 30 days, Glacier for 1 year, delete after 7 years). Enable S3 Cross-Region Replication for the asset bucket. Create a documented DR runbook covering AZ failure (automatic via Multi-AZ), region failure (promote read replica, update DNS via Route53 failover routing, redeploy via Terraform to secondary region), and data corruption (RDS point-in-time restore procedure). Schedule quarterly DR drills that restore from backup to a test environment and verify data integrity with checksums.