Skip to content

High availability

A production HA topology is three Ampora instances behind a sticky-session reverse proxy, sharing one PostgreSQL cluster, with the dispatch backplane configured. This page wires that up.

Prerequisites

  • A reachable PostgreSQL with the schema migrated.
  • A reverse proxy that can do cookie-based session affinity for the Blazor SignalR circuit — Nginx Ingress, Traefik, AWS ALB, Azure Application Gateway, GCP HTTPS LB all support this.
  • A configured backplane: either Postgres (no extra service) or Redis (extra Redis 6+).

Replica count

Two is the minimum for HA — one replica means a deploy or a node loss is downtime. Three is the recommended baseline. Beyond three, scale based on load: a single instance comfortably handles ~ 1000 concurrent OpAMP WebSockets.

The base deployment.yaml ships with replicas: 2. The production overlay typically bumps it:

deploy/kustomize/overlays/acme-prod/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ampora-web
spec:
  replicas: 3

Sticky sessions

Blazor Server requires session affinity because the SignalR circuit is stateful. The OpAMP WebSocket also benefits — agents that reconnect to the same instance avoid an immediate session-ownership re-acquire.

Nginx Ingress

metadata:
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/affinity-mode: "persistent"
    nginx.ingress.kubernetes.io/session-cookie-name: "ampora-affinity"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "86400"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"

Traefik (Kubernetes CRDs)

apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
  name: ampora-stickiness
spec:
  forwardAuth: {}
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: ampora
spec:
  entryPoints: [websecure]
  routes:
    - match: Host(`ampora.acme.io`)
      kind: Rule
      services:
        - name: ampora-web
          port: 8080
          sticky:
            cookie:
              name: ampora-affinity
              secure: true
              httpOnly: true
              sameSite: lax

Azure Application Gateway

Set the backend HTTP setting to Cookie-based affinity: Enabled. Application Gateway injects the ApplicationGatewayAffinity cookie.

AWS Application Load Balancer

In the target group's attributes, enable Stickiness with type lb_cookie and a duration of at least 24 hours.

Dispatch backplane

The default in-process backplane silently no-ops on multi-instance — your operator-side actions will work, but cross-instance dispatch will fail with "no live session" warnings whenever the targeted agent's session sits on a different instance.

Pick Postgres (simplest):

Dispatch__Backplane=Postgres

Or Redis (when you already run Redis, or want to keep notification load off Postgres):

Dispatch__Backplane=Redis
Dispatch__RedisConnectionString=redis://redis.acme.svc.cluster.local:6379,abortConnect=false

See Dispatch backplane for the mechanics.

PodDisruptionBudget

The base pdb.yaml keeps at least one instance available during voluntary disruption (node drain, cluster upgrade):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: ampora-web
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: ampora
      app.kubernetes.io/component: web

Override to minAvailable: 2 when you have ≥ 3 replicas and want higher tolerance.

Topology spread

The base deployment also pins zone- and node-spread constraints:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway

ScheduleAnyway is intentional: if your cluster has fewer zones than replicas, the scheduler must still place the pod. Strict spreading would trade availability for theoretical neatness.

Graceful drain

Two settings cooperate to drain WebSockets cleanly on shutdown:

  • terminationGracePeriodSeconds: 60 on the Pod.
  • lifecycle.preStop runs sleep 15 so the readiness probe has time to flip to fail and the load balancer to stop sending new connections before the process gets SIGTERM.

The Ampora process itself responds to SIGTERM by closing every WebSocket with a clean OpAMP ServerErrorResponse of type ServiceShuttingDown. Agents reconnect on backoff.

Verifying HA

Three quick checks:

  1. Multiple instances active?
    kubectl -n ampora get pods -l app.kubernetes.io/name=ampora
    
  2. Backplane wired? From any pod:
    curl -s localhost:8080/health/ready | jq
    
    The probe response includes a backplane key with Healthy / Degraded and the configured backend.
  3. Cross-instance dispatch? Connect an agent so its session lands on pod #1, push a config change from a browser session pinned to pod #2. The agent must apply within ~ 2 seconds.

If any of those fails, see Troubleshooting → Rollouts stall.