Skip to content

Scaling out

Ampora is designed for fleets in the low thousands of agents per deployment with linear horizontal scaling. This page is the cheat sheet for what to watch and which knob to turn.

Capacity guidance

Fleet size Replicas Notes
≤ 200 agents 1 Single instance is fine; InProcess backplane
200 – 2 000 2 – 3 HA, sticky sessions, Postgres or Redis backplane
2 000 – 10 000 3 – 5 Watch JSONB write rate; consider read replicas for the dashboard queries
> 10 000 5+ Federation across regions; one Ampora per region instead of one big global instance

Why not "one big instance"? Agent state events (status, health, effective config) are write-heavy on PostgreSQL JSONB. Past 10 000 agents the bottleneck is the database, and scaling Postgres vertically is sub-linear. Federating across regions keeps each Postgres in its comfort zone.

What to monitor

Self-observability metrics that matter most:

Metric Warning if …
ampora_opamp_active_sessions Approaches 5000 per instance
ampora_dispatch_local_total vs. _forwarded_total High forwarded ratio means LB stickiness is broken
ampora_dispatch_no_session_total Non-zero means agents are disconnecting before dispatch arrives
ampora_db_query_duration_seconds_bucket (p95) Above 100 ms — Postgres is the bottleneck
ampora_live_bus_events_total Spikes correlate with chatty fleet events; coalescing should mask most
kubelet container memory rss Approaching the limit — bump resources.limits.memory

See Self-observability for the full metric set.

Knobs

Per Ampora instance

OpAmp:
  MaxMessageBytes: 10485760            # 10 MiB. Larger configs need a larger value.
  HeartbeatWindowSeconds: 90           # Tighten for faster drift detection (= more writes)
Dispatch:
  OwnershipTtlSeconds: 60              # Lower = faster failover, more renew traffic
  LeaderLeaseSeconds: 30

Postgres

The dominant writes are agent_status_history and (to a lesser extent) audit_events and dispatch_envelopes. Tune:

  • Connection pool: Npgsql defaults to 100. With 5 Ampora replicas, cap each at 20 (MaxPoolSize=20 in the connection string) so total connections stay below your Postgres max_connections.
  • shared_buffers: 25 % of RAM is the rule of thumb.
  • max_wal_size: bump for high-write workloads.
  • Read replicas (if your offering supports them): Ampora can target a read replica for the dashboard via ConnectionStrings:AmporaRead. Writes still go to the primary.

Reverse proxy

WebSocket and SignalR-friendly timeouts:

  • Read / send timeout ≥ 1 hour.
  • Buffer sizes: 64 KB or larger.
  • Keepalive: enabled.
  • Connection cap per client: do not set anything aggressive on the agent path; one OpAMP WebSocket per agent is normal.

Saturation signals

These are the symptoms you usually see before the system actually breaks:

  • Increasing ampora_dispatch_no_session_total — agents are flapping; check the load balancer health checks and timeouts.
  • ampora_opamp_frames_rejected_total > 0 — agents are sending oversized or malformed frames. Bump OpAmp:MaxMessageBytes or open a ticket if it is a single agent type going wrong.
  • leader_lease_acquisitions_total per leader name climbing — flaky leader election. Often a clock-skew or Postgres-latency problem.
  • Health probe /health/ready flipping — connection pool exhaustion is a common root cause; increase MaxPoolSize or scale Postgres.

Federation as horizontal scale

For multi-region or very-large fleets, one Ampora per region with federation is strictly better than one giant Ampora:

  • Each region's Postgres only sees its region's writes.
  • A region outage is a region outage, not a global outage.
  • The federation protocol is read-only by default; cross-region rollouts require explicit operator intent.

This is the same pattern Kubernetes uses with one control plane per cluster.