Skip to content

Health gates

The four built-in gates that auto-pause a rollout. Each row is the canonical reference; the narrative on User → Health gates explains the why.

apply-failed-ratio

Property Value
Default threshold > 20 %
Numerator Agents in current step with RemoteConfigStatus = FAILED
Denominator Agents in current step that received the push
Window Step lifetime (resets per step)
Configurable Yes, per rollout

The most directly actionable gate: a high apply-failed-ratio means the config does not work on those agents.

disconnect-ratio

Property Value
Default threshold > 20 %
Numerator Agents in current step that disconnected within WindowSeconds after push
Denominator Agents in current step that received the push
Window 5 minutes (configurable as WindowSeconds)
Configurable Yes, per rollout

A disconnect that comes back with the expected effective hash within the window does not count. A disconnect that does not return within the window does.

effective-mismatch-ratio

Property Value
Default threshold > 20 %
Numerator Agents in current step whose EffectiveHash != AssignedHash after TimeoutSeconds
Denominator Agents in current step that received the push
Window 10 minutes (configurable as TimeoutSeconds)
Configurable Yes, per rollout

Catches "the agent says APPLIED but reports a different hash" — usually indicates the agent applied the config and then crashed, fell back to on-disk config, or applied an older version.

unhealthy-ratio

Property Value
Default threshold > 10 %
Numerator Agents in current step reporting health.healthy = false after TimeoutSeconds
Denominator Agents in current step that received the push
Window 10 minutes (configurable as TimeoutSeconds)
Configurable Yes, per rollout

Slightly tighter default than the others because unhealthy is a strong signal. An agent that is healthy → unhealthy after a config apply is a clear regression.

Evaluation order

When multiple gates would fire at the same time, the rollout pauses on the first one whose threshold the running counter crossed — strict insertion order on the gate list. The audit event records that single gate; the others are evaluated again on resume.

Custom gate expressions

Custom gates use the same DSL as Custom policies. They are an opt-in extension; built-in gates cover the common cases.

A custom gate's expression is evaluated against the per-step counters (stats.applied, stats.failed, stats.disconnected, stats.unhealthy, stats.received). Example:

stats.failed / stats.received > 0.1 and stats.received >= 20

(Don't fire if the step is too small to be statistically meaningful.)

Disabling gates

Set the threshold to 100 % to disable a gate effectively (it will never fire). Disabling all four gates is allowed but not recommended — you lose the safety net.

The audit log records the rollout's gate configuration at start; an operator looking back can see which gates were active for the rollout.