Scaling out¶

Ampora is designed for fleets in the low thousands of agents per deployment with linear horizontal scaling. This page is the cheat sheet for what to watch and which knob to turn.

Capacity guidance¶

Fleet size	Replicas	Notes
≤ 200 agents	1	Single instance is fine; `InProcess` backplane
200 – 2 000	2 – 3	HA, sticky sessions, `Postgres` or `Redis` backplane
2 000 – 10 000	3 – 5	Watch JSONB write rate; consider read replicas for the dashboard queries
> 10 000	5+	Federation across regions; one Ampora per region instead of one big global instance

Why not "one big instance"? Agent state events (status, health, effective config) are write-heavy on PostgreSQL JSONB. Past 10 000 agents the bottleneck is the database, and scaling Postgres vertically is sub-linear. Federating across regions keeps each Postgres in its comfort zone.

What to monitor¶

Self-observability metrics that matter most:

Metric	Warning if …
`ampora_opamp_active_sessions`	Approaches `5000` per instance
`ampora_dispatch_local_total` vs. `_forwarded_total`	High forwarded ratio means LB stickiness is broken
`ampora_dispatch_no_session_total`	Non-zero means agents are disconnecting before dispatch arrives
`ampora_db_query_duration_seconds_bucket` (p95)	Above 100 ms — Postgres is the bottleneck
`ampora_live_bus_events_total`	Spikes correlate with chatty fleet events; coalescing should mask most
`kubelet container memory rss`	Approaching the limit — bump `resources.limits.memory`

See Self-observability for the full metric set.

Knobs¶

Per Ampora instance¶

OpAmp:
  MaxMessageBytes: 10485760            # 10 MiB. Larger configs need a larger value.
  HeartbeatWindowSeconds: 90           # Tighten for faster drift detection (= more writes)
Dispatch:
  OwnershipTtlSeconds: 60              # Lower = faster failover, more renew traffic
  LeaderLeaseSeconds: 30

Postgres¶

The dominant writes are agent_status_history and (to a lesser extent) audit_events and dispatch_envelopes. Tune:

Connection pool: Npgsql defaults to 100. With 5 Ampora replicas, cap each at 20 (MaxPoolSize=20 in the connection string) so total connections stay below your Postgres max_connections.
shared_buffers: 25 % of RAM is the rule of thumb.
max_wal_size: bump for high-write workloads.
Read replicas (if your offering supports them): Ampora can target a read replica for the dashboard via ConnectionStrings:AmporaRead. Writes still go to the primary.

Reverse proxy¶

WebSocket and SignalR-friendly timeouts:

Read / send timeout ≥ 1 hour.
Buffer sizes: 64 KB or larger.
Keepalive: enabled.
Connection cap per client: do not set anything aggressive on the agent path; one OpAMP WebSocket per agent is normal.

Saturation signals¶

These are the symptoms you usually see before the system actually breaks:

Increasing ampora_dispatch_no_session_total — agents are flapping; check the load balancer health checks and timeouts.
ampora_opamp_frames_rejected_total > 0 — agents are sending oversized or malformed frames. Bump OpAmp:MaxMessageBytes or open a ticket if it is a single agent type going wrong.
leader_lease_acquisitions_total per leader name climbing — flaky leader election. Often a clock-skew or Postgres-latency problem.
Health probe /health/ready flipping — connection pool exhaustion is a common root cause; increase MaxPoolSize or scale Postgres.

Federation as horizontal scale¶

For multi-region or very-large fleets, one Ampora per region with federation is strictly better than one giant Ampora:

Each region's Postgres only sees its region's writes.
A region outage is a region outage, not a global outage.
The federation protocol is read-only by default; cross-region rollouts require explicit operator intent.

This is the same pattern Kubernetes uses with one control plane per cluster.