Disaster recovery¶

This is the worst-case runbook. The data centre is gone, the cluster is gone, and you have your backups, your secret manager, and a fresh environment. Read once before you ever need it.

Targets¶

Metric	Recommended target	What sets the bound
RTO (recovery time)	≤ 4 hours	Postgres restore time + Kubernetes / VM bring-up
RPO (data loss window)	≤ 15 minutes	WAL archive cadence on PostgreSQL

These are recommendations. Tune to your business requirements. A SaaS- style multi-tenant deployment typically lives in the 1-hour / 5-minute band by enabling continuous WAL archiving and a hot DR site.

What you need on hand¶

Most recent Postgres backup in off-cluster object storage.
Master encryption key from your secret manager.
Server TLS material (or the means to re-issue, e.g. cert-manager + ACME).
The deployment manifests — they are in Git, so this is implicit.
The container image registry (or a copy of the image you were running).

If any of these is missing, focus on getting it back before continuing.

Procedure¶

1. Provision the new environment¶

Either:

a fresh Kubernetes cluster (with cert-manager and your Ingress controller), or
a fresh VM with the binary install prerequisites.

Your existing IaC (Terraform, Pulumi, Crossplane) should have this on demand. If it does not, that is the first DR-drill bug to fix.

2. Provision Postgres and restore¶

On the managed offering's PITR or pgBackRest restore --target-time:

pgbackrest --stanza=ampora restore \
  --target-time="2026-04-25 14:30:00 UTC" \
  --target-action=promote \
  --type=time

Verify the restore by counting a small table:

psql -h newdb.acme.io -U ampora -d ampora -c "SELECT count(*) FROM tenants;"

A non-zero count for the tenants you expect means the restore was applied. A zero count means you restored from too early.

3. Recover the master key¶

aws secretsmanager get-secret-value \
  --secret-id ampora-prod/master-key \
  --query SecretString --output text

Or the equivalent in your secret manager. This is the same key that encrypted at-rest material in the database you just restored — using a different key gives you garbage on every encrypted column.

Inject the value into the new cluster's ampora-secrets. Never check this into Git.

4. Recover server TLS¶

If you use cert-manager + ACME:

ensure the new Ingress hostname has an A/AAAA record pointing at the new cluster's public IP;
cert-manager issues automatically once the HTTP-01 challenge resolves.

If you use a corporate CA: re-issue the leaf for the same hostname and load it as the Ingress TLS Secret.

5. Apply the manifests¶

kubectl apply -k deploy/kustomize/overlays/acme-prod
kubectl -n ampora rollout status deploy/ampora-web

6. Validate¶

In order:

/health/live returns 200.
/health/ready returns 200 with all subsystems healthy.
Log in via OIDC.
The Dashboard shows familiar numbers — agents, configurations, audit events.
An agent reconnects. Its mTLS cert is signed by the persisted CA that lives in the restored database, so validation works as soon as the cert chain is loaded.
A small rollout (e.g. re-publish an existing config and target a single test agent) completes end-to-end.

If step 5 or 6 fail, the master key is the prime suspect.

Failure during recovery¶

Symptom	Likely cause	Fix
App starts, but every encrypted-at-rest field reads as gibberish	Wrong master key	Re-fetch from the secret manager; verify against the audit-log entry of the last `KeyProtection.MasterKey` rotation
App starts; agents cannot connect via mTLS	Persisted CA private key was not in the backup	Bootstrap a new CA; push fresh client certs to all agents — see Certificate rotation
Audit log is empty	Restored from a snapshot older than the audit retention sweep	Restore from a newer backup; loss window equals time since the last good snapshot
Federation peers refuse our calls	Federation client cert is missing	Re-issue from the primary tenant's UI on this side; coordinate with the other side to update the pinned thumbprint

Federated DR¶

If you operate a federated topology, partial DR is often easier:

A region outage can be served by federation peers without restoring anything. Operators see the unaffected regions' fleet and the affected region's status as "Unreachable".
Cross-cluster handover (ADR-051) lets affected agents migrate to a healthy peer for the duration of the outage. Identity is preserved; configs follow.

DR drills¶

A DR drill is the only way to know that any of this works. Schedule quarterly drills:

Restore a recent backup into a sandbox cluster.
Run through steps 2-6 of the procedure above.
Time each step. Record the total in your runbook.
Tear down the sandbox.

The two failure modes drills always catch:

The master key in the secret manager has been silently rotated and the old one is no longer retrievable.
The Postgres backup is technically valid but corrupt at the application layer (e.g. ran out of disk mid-archive). A test restore catches this; a "did the dump file land?" check does not.