High availability¶
A production HA topology is three Ampora instances behind a sticky-session reverse proxy, sharing one PostgreSQL cluster, with the dispatch backplane configured. This page wires that up.
Prerequisites¶
- A reachable PostgreSQL with the schema migrated.
- A reverse proxy that can do cookie-based session affinity for the Blazor SignalR circuit — Nginx Ingress, Traefik, AWS ALB, Azure Application Gateway, GCP HTTPS LB all support this.
- A configured backplane: either
Postgres(no extra service) orRedis(extra Redis 6+).
Replica count¶
Two is the minimum for HA — one replica means a deploy or a node loss is downtime. Three is the recommended baseline. Beyond three, scale based on load: a single instance comfortably handles ~ 1000 concurrent OpAMP WebSockets.
The base deployment.yaml ships with replicas: 2. The production overlay typically bumps it:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ampora-web
spec:
replicas: 3
Sticky sessions¶
Blazor Server requires session affinity because the SignalR circuit is stateful. The OpAMP WebSocket also benefits — agents that reconnect to the same instance avoid an immediate session-ownership re-acquire.
Nginx Ingress¶
metadata:
annotations:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/affinity-mode: "persistent"
nginx.ingress.kubernetes.io/session-cookie-name: "ampora-affinity"
nginx.ingress.kubernetes.io/session-cookie-max-age: "86400"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
Traefik (Kubernetes CRDs)¶
apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
name: ampora-stickiness
spec:
forwardAuth: {}
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: ampora
spec:
entryPoints: [websecure]
routes:
- match: Host(`ampora.acme.io`)
kind: Rule
services:
- name: ampora-web
port: 8080
sticky:
cookie:
name: ampora-affinity
secure: true
httpOnly: true
sameSite: lax
Azure Application Gateway¶
Set the backend HTTP setting to Cookie-based affinity: Enabled. Application Gateway injects the ApplicationGatewayAffinity cookie.
AWS Application Load Balancer¶
In the target group's attributes, enable Stickiness with type lb_cookie and a duration of at least 24 hours.
Dispatch backplane¶
The default in-process backplane silently no-ops on multi-instance — your operator-side actions will work, but cross-instance dispatch will fail with "no live session" warnings whenever the targeted agent's session sits on a different instance.
Pick Postgres (simplest):
Or Redis (when you already run Redis, or want to keep notification load off Postgres):
Dispatch__Backplane=Redis
Dispatch__RedisConnectionString=redis://redis.acme.svc.cluster.local:6379,abortConnect=false
See Dispatch backplane for the mechanics.
PodDisruptionBudget¶
The base pdb.yaml keeps at least one instance available during voluntary disruption (node drain, cluster upgrade):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: ampora-web
spec:
minAvailable: 1
selector:
matchLabels:
app.kubernetes.io/name: ampora
app.kubernetes.io/component: web
Override to minAvailable: 2 when you have ≥ 3 replicas and want higher tolerance.
Topology spread¶
The base deployment also pins zone- and node-spread constraints:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
ScheduleAnyway is intentional: if your cluster has fewer zones than replicas, the scheduler must still place the pod. Strict spreading would trade availability for theoretical neatness.
Graceful drain¶
Two settings cooperate to drain WebSockets cleanly on shutdown:
terminationGracePeriodSeconds: 60on the Pod.lifecycle.preStoprunssleep 15so the readiness probe has time to flip to fail and the load balancer to stop sending new connections before the process gets SIGTERM.
The Ampora process itself responds to SIGTERM by closing every WebSocket with a clean OpAMP ServerErrorResponse of type ServiceShuttingDown. Agents reconnect on backoff.
Verifying HA¶
Three quick checks:
- Multiple instances active?
- Backplane wired? From any pod: The probe response includes a
backplanekey withHealthy/Degradedand the configured backend. - Cross-instance dispatch? Connect an agent so its session lands on pod #1, push a config change from a browser session pinned to pod #2. The agent must apply within ~ 2 seconds.
If any of those fails, see Troubleshooting → Rollouts stall.