Health gates¶
The four built-in gates that auto-pause a rollout. Each row is the canonical reference; the narrative on User → Health gates explains the why.
apply-failed-ratio¶
| Property | Value |
|---|---|
| Default threshold | > 20 % |
| Numerator | Agents in current step with RemoteConfigStatus = FAILED |
| Denominator | Agents in current step that received the push |
| Window | Step lifetime (resets per step) |
| Configurable | Yes, per rollout |
The most directly actionable gate: a high apply-failed-ratio means the config does not work on those agents.
disconnect-ratio¶
| Property | Value |
|---|---|
| Default threshold | > 20 % |
| Numerator | Agents in current step that disconnected within WindowSeconds after push |
| Denominator | Agents in current step that received the push |
| Window | 5 minutes (configurable as WindowSeconds) |
| Configurable | Yes, per rollout |
A disconnect that comes back with the expected effective hash within the window does not count. A disconnect that does not return within the window does.
effective-mismatch-ratio¶
| Property | Value |
|---|---|
| Default threshold | > 20 % |
| Numerator | Agents in current step whose EffectiveHash != AssignedHash after TimeoutSeconds |
| Denominator | Agents in current step that received the push |
| Window | 10 minutes (configurable as TimeoutSeconds) |
| Configurable | Yes, per rollout |
Catches "the agent says APPLIED but reports a different hash" — usually indicates the agent applied the config and then crashed, fell back to on-disk config, or applied an older version.
unhealthy-ratio¶
| Property | Value |
|---|---|
| Default threshold | > 10 % |
| Numerator | Agents in current step reporting health.healthy = false after TimeoutSeconds |
| Denominator | Agents in current step that received the push |
| Window | 10 minutes (configurable as TimeoutSeconds) |
| Configurable | Yes, per rollout |
Slightly tighter default than the others because unhealthy is a strong signal. An agent that is healthy → unhealthy after a config apply is a clear regression.
Evaluation order¶
When multiple gates would fire at the same time, the rollout pauses on the first one whose threshold the running counter crossed — strict insertion order on the gate list. The audit event records that single gate; the others are evaluated again on resume.
Custom gate expressions¶
Custom gates use the same DSL as Custom policies. They are an opt-in extension; built-in gates cover the common cases.
A custom gate's expression is evaluated against the per-step counters (stats.applied, stats.failed, stats.disconnected, stats.unhealthy, stats.received). Example:
(Don't fire if the step is too small to be statistically meaningful.)
Disabling gates¶
Set the threshold to 100 % to disable a gate effectively (it will never fire). Disabling all four gates is allowed but not recommended — you lose the safety net.
The audit log records the rollout's gate configuration at start; an operator looking back can see which gates were active for the rollout.