Skip to content

Rollouts stall

Symptoms: a rollout stays in InProgress longer than it should, batches do not advance, or the rollout auto-paused with a gate reason that is not obvious.

1. Read the rollout detail page

It tells you exactly which agents are pending, applying, applied, or failed. The summary at the top:

  • Batch X of Y — current step.
  • Pending — agents in this batch that have not received the push.
  • Applying — push delivered, awaiting RemoteConfigStatus = APPLIED.
  • Failed — agents that returned FAILED. Click into them to see the agent-side error string.
  • Disconnected since push — agents that lost the WebSocket between push and apply.

Auto-pause shows the gate name that fired and the observed value (e.g. "disconnect-ratio: 35 % > 20 %").

2. Cross-instance dispatch

If you run multiple Ampora instances and a chunk of agents look Pending forever:

  • Check ampora_dispatch_no_session_total — non-zero means dispatch could not find a live session for the targeted agents.
  • Check ampora_dispatch_forwarded_total vs ampora_dispatch_local_total — if forwarded is 0 despite multi-instance, the backplane is not wired. See Dispatch backplane.
  • Confirm sticky sessions on the load balancer (HA wire-up) — if agents are bouncing between instances, every reconnect releases the ownership.

3. The agent rejected the config

If a batch of agents has RemoteConfigStatus = FAILED:

  • Click an agent. The Apply error panel shows the error_message the agent returned. Most failures are bad component references or missing extensions.
  • The semantic diff between the previous and the new config often highlights the component that does not exist on this agent type.

If only specific agents fail (e.g. only the ones running an older collector version), you have a config-vs-version mismatch. Either:

  • pin the rollout to a group that is version-homogeneous, or
  • backport the config to be compatible with the older version.

4. The agent never applies

Pending → Applying → ... and the agent stays in Applying forever.

  • Check ampora_opamp_apply_duration_seconds — what is the p95 and p99? Ampora's "stuck on apply" signal is a duration histogram you can graph per agent type.
  • The agent's own logs show whether the YAML even reached the opamp_extension. Capability mismatch (ReportsRemoteConfig not signalled) means Ampora pushed but the agent never reports back.
  • Increase the apply timeout in the agent's opamp_extension config if the config is large and the agent is on slow hardware.

5. Health gate triggered, gate threshold is too tight

If a gate fires "fairly":

  • Resume if you understand and accept the failure.
  • Rollback if the change itself is at fault.
  • Edit gate thresholds for this rollout in the wizard, then resume.

If the same gate fires repeatedly across rollouts, your global defaults are mis-tuned. Adjust them in Settings → Rollouts → Gate defaults.

6. The leader-elected rollout ticker is stalled

A rare but possible cause: the rollout state machine itself is not ticking because the leader instance crashed without releasing the lease.

  • Check leader_lease table — the row for rollout-engine shows the current holder and the expiry.
  • A held lease past expiry indicates the holder died unsafely. Wait for the lease TTL to expire (default 30 s) — another instance picks it up on the next tick.
  • Persistent stalling suggests Postgres is unhealthy or the holder cannot renew. Check Database problems.

7. Policy denied the push

Check ampora_policy_denied_total{policy=...} and the audit log filtered to the rollout: a Rollout.PushDenied event lists the policy that blocked the push and the agent it would have gone to. Either:

  • adjust the policy (with the four-eyes approval flow if it is custom), or
  • exclude the affected agent from the group, or
  • accept the denial (the agent stays on its previous version).

8. Agents are flapping

If agents bounce between Connected and Disconnected every few seconds, no rollout makes progress.

  • TLS / mTLS expiry: a chunk of agents whose certs expired in the same window.
  • Reverse-proxy timeout too aggressive — see Agents do not connect.
  • Network MTU / fragmentation issues on a specific subnet — usually region-specific.

9. "Stuck" but actually working

Sanity check: large fleets take time. A 5000-agent rollout at default batch size moves through several thousand agent-applies. Look at Throughput per minute on the rollout page — if it is 100/min, your "stuck" rollout will finish in 50 minutes and is just plodding along.

If the throughput is actually zero, work through steps 2–7.