Health gates¶
Health gates auto-pause a rollout when too many agents fail to apply, disconnect, or report unhealthy. They are the safety net underneath every rollout strategy.
Built-in gates¶
| Gate | Default threshold | What it watches |
|---|---|---|
apply-failed-ratio | > 20 % | Agents in batch with RemoteConfigStatus = FAILED |
disconnect-ratio | > 20 % | Agents that disconnect within 5 min after push |
effective-mismatch-ratio | > 20 % | Agents whose effective hash hasn't matched within 10 min |
unhealthy-ratio | > 10 % | Agents reporting health.healthy = false after apply |
All are configurable per rollout in the wizard. Defaults are tuned for a typical mixed fleet; tighten them for production-critical paths.
How thresholds are evaluated¶
- The denominator is the number of agents in the current step (not the whole rollout).
- The numerator is the running count of "bad outcomes" of the matching kind for that step.
- Evaluation is event-driven: every relevant event ticks the numerator and re-evaluates.
- Once a gate's threshold is exceeded, the rollout transitions to
Pausedimmediately. The remaining agents in the step are not pushed.
What happens on a fire¶
- Rollout state →
Paused. - Audit event
Rollout.Pausedwith the gate name and the observed value. - Live-update bus broadcasts the state change; UIs update.
- The Rollout detail page surfaces:
- which gate fired,
- the affected agents (with deep links),
- the next-action chips: Resume, Rollback, Abort.
Operator decisions¶
- Resume — the rollout re-enters
InProgressfrom the next agent. Use when you understand and accept the failure (e.g. the failing agents are stale clones you'll re-image anyway). - Rollback — re-assigns the previous Published version to all touched agents. The state machine moves to
RolledBack. - Abort — stop without rolling back. The pushed agents stay on the new config.
Tuning¶
Tighten thresholds for production-critical paths:
| Gate | Stricter setting |
|---|---|
apply-failed-ratio | > 5 % |
disconnect-ratio | > 10 % |
effective-mismatch-ratio | > 15 % |
unhealthy-ratio | > 5 % |
Loosen them for noisy environments (frequently-recycled hosts, flaky networks). Document the rationale in the rollout's description field — the audit trail keeps it.
Scope¶
Gates only watch the current step's agents — not the rollout total. This matters for canary strategies: a bad step-1 trips its own gate even if the not-yet-touched majority is fine.
Edge cases¶
- Group shrinks mid-step — the denominator follows. If 4 of 20 failed and another 5 disconnected for unrelated reasons, the gate evaluates 4 / 15 = 26 %, not 4 / 20 = 20 %.
- Slow agents — a slow but eventually-applying agent shows up as
effective-mismatch-ratioonly after the 10-minute timeout. Bump the timeout for fleets on slow links. - Restarted agents — a restart that drops the WebSocket but reconnects with the new effective hash within a minute does not count toward
disconnect-ratio— the disconnect window resets on reconnect-with-the-expected-hash.
Custom gates¶
Custom gate expressions (in the same DSL as policies) are an opt-in extension; see Custom policy DSL. Built-in gates are sufficient for most environments.