Scaling out¶
Ampora is designed for fleets in the low thousands of agents per deployment with linear horizontal scaling. This page is the cheat sheet for what to watch and which knob to turn.
Capacity guidance¶
| Fleet size | Replicas | Notes |
|---|---|---|
| ≤ 200 agents | 1 | Single instance is fine; InProcess backplane |
| 200 – 2 000 | 2 – 3 | HA, sticky sessions, Postgres or Redis backplane |
| 2 000 – 10 000 | 3 – 5 | Watch JSONB write rate; consider read replicas for the dashboard queries |
| > 10 000 | 5+ | Federation across regions; one Ampora per region instead of one big global instance |
Why not "one big instance"? Agent state events (status, health, effective config) are write-heavy on PostgreSQL JSONB. Past 10 000 agents the bottleneck is the database, and scaling Postgres vertically is sub-linear. Federating across regions keeps each Postgres in its comfort zone.
What to monitor¶
Self-observability metrics that matter most:
| Metric | Warning if … |
|---|---|
ampora_opamp_active_sessions | Approaches 5000 per instance |
ampora_dispatch_local_total vs. _forwarded_total | High forwarded ratio means LB stickiness is broken |
ampora_dispatch_no_session_total | Non-zero means agents are disconnecting before dispatch arrives |
ampora_db_query_duration_seconds_bucket (p95) | Above 100 ms — Postgres is the bottleneck |
ampora_live_bus_events_total | Spikes correlate with chatty fleet events; coalescing should mask most |
kubelet container memory rss | Approaching the limit — bump resources.limits.memory |
See Self-observability for the full metric set.
Knobs¶
Per Ampora instance¶
OpAmp:
MaxMessageBytes: 10485760 # 10 MiB. Larger configs need a larger value.
HeartbeatWindowSeconds: 90 # Tighten for faster drift detection (= more writes)
Dispatch:
OwnershipTtlSeconds: 60 # Lower = faster failover, more renew traffic
LeaderLeaseSeconds: 30
Postgres¶
The dominant writes are agent_status_history and (to a lesser extent) audit_events and dispatch_envelopes. Tune:
- Connection pool: Npgsql defaults to 100. With 5 Ampora replicas, cap each at 20 (
MaxPoolSize=20in the connection string) so total connections stay below your Postgresmax_connections. shared_buffers: 25 % of RAM is the rule of thumb.max_wal_size: bump for high-write workloads.- Read replicas (if your offering supports them): Ampora can target a read replica for the dashboard via
ConnectionStrings:AmporaRead. Writes still go to the primary.
Reverse proxy¶
WebSocket and SignalR-friendly timeouts:
- Read / send timeout ≥ 1 hour.
- Buffer sizes: 64 KB or larger.
- Keepalive: enabled.
- Connection cap per client: do not set anything aggressive on the agent path; one OpAMP WebSocket per agent is normal.
Saturation signals¶
These are the symptoms you usually see before the system actually breaks:
- Increasing
ampora_dispatch_no_session_total— agents are flapping; check the load balancer health checks and timeouts. ampora_opamp_frames_rejected_total > 0— agents are sending oversized or malformed frames. BumpOpAmp:MaxMessageBytesor open a ticket if it is a single agent type going wrong.leader_lease_acquisitions_totalper leader name climbing — flaky leader election. Often a clock-skew or Postgres-latency problem.- Health probe
/health/readyflipping — connection pool exhaustion is a common root cause; increaseMaxPoolSizeor scale Postgres.
Federation as horizontal scale¶
For multi-region or very-large fleets, one Ampora per region with federation is strictly better than one giant Ampora:
- Each region's Postgres only sees its region's writes.
- A region outage is a region outage, not a global outage.
- The federation protocol is read-only by default; cross-region rollouts require explicit operator intent.
This is the same pattern Kubernetes uses with one control plane per cluster.