Disaster recovery¶
This is the worst-case runbook. The data centre is gone, the cluster is gone, and you have your backups, your secret manager, and a fresh environment. Read once before you ever need it.
Targets¶
| Metric | Recommended target | What sets the bound |
|---|---|---|
| RTO (recovery time) | ≤ 4 hours | Postgres restore time + Kubernetes / VM bring-up |
| RPO (data loss window) | ≤ 15 minutes | WAL archive cadence on PostgreSQL |
These are recommendations. Tune to your business requirements. A SaaS- style multi-tenant deployment typically lives in the 1-hour / 5-minute band by enabling continuous WAL archiving and a hot DR site.
What you need on hand¶
- Most recent Postgres backup in off-cluster object storage.
- Master encryption key from your secret manager.
- Server TLS material (or the means to re-issue, e.g. cert-manager + ACME).
- The deployment manifests — they are in Git, so this is implicit.
- The container image registry (or a copy of the image you were running).
If any of these is missing, focus on getting it back before continuing.
Procedure¶
1. Provision the new environment¶
Either:
- a fresh Kubernetes cluster (with cert-manager and your Ingress controller), or
- a fresh VM with the binary install prerequisites.
Your existing IaC (Terraform, Pulumi, Crossplane) should have this on demand. If it does not, that is the first DR-drill bug to fix.
2. Provision Postgres and restore¶
On the managed offering's PITR or pgBackRest restore --target-time:
pgbackrest --stanza=ampora restore \
--target-time="2026-04-25 14:30:00 UTC" \
--target-action=promote \
--type=time
Verify the restore by counting a small table:
A non-zero count for the tenants you expect means the restore was applied. A zero count means you restored from too early.
3. Recover the master key¶
aws secretsmanager get-secret-value \
--secret-id ampora-prod/master-key \
--query SecretString --output text
Or the equivalent in your secret manager. This is the same key that encrypted at-rest material in the database you just restored — using a different key gives you garbage on every encrypted column.
Inject the value into the new cluster's ampora-secrets. Never check this into Git.
4. Recover server TLS¶
If you use cert-manager + ACME:
- ensure the new Ingress hostname has an A/AAAA record pointing at the new cluster's public IP;
- cert-manager issues automatically once the HTTP-01 challenge resolves.
If you use a corporate CA: re-issue the leaf for the same hostname and load it as the Ingress TLS Secret.
5. Apply the manifests¶
kubectl apply -k deploy/kustomize/overlays/acme-prod
kubectl -n ampora rollout status deploy/ampora-web
6. Validate¶
In order:
/health/livereturns 200./health/readyreturns 200 with all subsystems healthy.- Log in via OIDC.
- The Dashboard shows familiar numbers — agents, configurations, audit events.
- An agent reconnects. Its mTLS cert is signed by the persisted CA that lives in the restored database, so validation works as soon as the cert chain is loaded.
- A small rollout (e.g. re-publish an existing config and target a single test agent) completes end-to-end.
If step 5 or 6 fail, the master key is the prime suspect.
Failure during recovery¶
| Symptom | Likely cause | Fix |
|---|---|---|
| App starts, but every encrypted-at-rest field reads as gibberish | Wrong master key | Re-fetch from the secret manager; verify against the audit-log entry of the last KeyProtection.MasterKey rotation |
| App starts; agents cannot connect via mTLS | Persisted CA private key was not in the backup | Bootstrap a new CA; push fresh client certs to all agents — see Certificate rotation |
| Audit log is empty | Restored from a snapshot older than the audit retention sweep | Restore from a newer backup; loss window equals time since the last good snapshot |
| Federation peers refuse our calls | Federation client cert is missing | Re-issue from the primary tenant's UI on this side; coordinate with the other side to update the pinned thumbprint |
Federated DR¶
If you operate a federated topology, partial DR is often easier:
- A region outage can be served by federation peers without restoring anything. Operators see the unaffected regions' fleet and the affected region's status as "Unreachable".
- Cross-cluster handover (ADR-051) lets affected agents migrate to a healthy peer for the duration of the outage. Identity is preserved; configs follow.
DR drills¶
A DR drill is the only way to know that any of this works. Schedule quarterly drills:
- Restore a recent backup into a sandbox cluster.
- Run through steps 2-6 of the procedure above.
- Time each step. Record the total in your runbook.
- Tear down the sandbox.
The two failure modes drills always catch:
- The master key in the secret manager has been silently rotated and the old one is no longer retrievable.
- The Postgres backup is technically valid but corrupt at the application layer (e.g. ran out of disk mid-archive). A test restore catches this; a "did the dump file land?" check does not.