Skip to content

Health gates

Health gates auto-pause a rollout when too many agents fail to apply, disconnect, or report unhealthy. They are the safety net underneath every rollout strategy.

Built-in gates

Gate Default threshold What it watches
apply-failed-ratio > 20 % Agents in batch with RemoteConfigStatus = FAILED
disconnect-ratio > 20 % Agents that disconnect within 5 min after push
effective-mismatch-ratio > 20 % Agents whose effective hash hasn't matched within 10 min
unhealthy-ratio > 10 % Agents reporting health.healthy = false after apply

All are configurable per rollout in the wizard. Defaults are tuned for a typical mixed fleet; tighten them for production-critical paths.

How thresholds are evaluated

  • The denominator is the number of agents in the current step (not the whole rollout).
  • The numerator is the running count of "bad outcomes" of the matching kind for that step.
  • Evaluation is event-driven: every relevant event ticks the numerator and re-evaluates.
  • Once a gate's threshold is exceeded, the rollout transitions to Paused immediately. The remaining agents in the step are not pushed.

What happens on a fire

  1. Rollout state → Paused.
  2. Audit event Rollout.Paused with the gate name and the observed value.
  3. Live-update bus broadcasts the state change; UIs update.
  4. The Rollout detail page surfaces:
  5. which gate fired,
  6. the affected agents (with deep links),
  7. the next-action chips: Resume, Rollback, Abort.

Operator decisions

  • Resume — the rollout re-enters InProgress from the next agent. Use when you understand and accept the failure (e.g. the failing agents are stale clones you'll re-image anyway).
  • Rollback — re-assigns the previous Published version to all touched agents. The state machine moves to RolledBack.
  • Abort — stop without rolling back. The pushed agents stay on the new config.

Tuning

Tighten thresholds for production-critical paths:

Gate Stricter setting
apply-failed-ratio > 5 %
disconnect-ratio > 10 %
effective-mismatch-ratio > 15 %
unhealthy-ratio > 5 %

Loosen them for noisy environments (frequently-recycled hosts, flaky networks). Document the rationale in the rollout's description field — the audit trail keeps it.

Scope

Gates only watch the current step's agents — not the rollout total. This matters for canary strategies: a bad step-1 trips its own gate even if the not-yet-touched majority is fine.

Edge cases

  • Group shrinks mid-step — the denominator follows. If 4 of 20 failed and another 5 disconnected for unrelated reasons, the gate evaluates 4 / 15 = 26 %, not 4 / 20 = 20 %.
  • Slow agents — a slow but eventually-applying agent shows up as effective-mismatch-ratio only after the 10-minute timeout. Bump the timeout for fleets on slow links.
  • Restarted agents — a restart that drops the WebSocket but reconnects with the new effective hash within a minute does not count toward disconnect-ratio — the disconnect window resets on reconnect-with-the-expected-hash.

Custom gates

Custom gate expressions (in the same DSL as policies) are an opt-in extension; see Custom policy DSL. Built-in gates are sufficient for most environments.