Skip to content

Canary step-up

A canary step-up rollout is a percentage rollout with time-based dwell between steps. Auto-advance, but with thinking time.

When to use

This is the default safe rollout strategy for any production fleet. A change reaches the cohort in waves; an operator only intervenes if a gate fires.

Configuring

In the rollout wizard:

  • Strategy: Canary step-up.
  • Steps: list of (percentage, dwell) pairs:
  • 5 %, 5 m
  • 25 %, 10 m
  • 50 %, 30 m
  • 100 %, end

The dwell starts when the step's batch reaches "all Applied or Failed". After the dwell, if no gate has fired, the next step starts automatically.

What dwell time buys you

  • Time for downstream alerting to fire if the new config breaks the data shape.
  • Time for the agent's exporter back-pressure to surface (a stuck exporter often takes minutes, not seconds, to back up).
  • Time for the operator to spot the change in the live metrics view and stop it manually.

A 5-minute first-step dwell is the minimum we recommend. 30 minutes between mid-rollout steps is common in regulated environments.

Operator interactions

During dwell:

  • Advance now — skip the rest of the dwell.
  • Extend dwell — add more time to this step.
  • Pause — stop the auto-advance entirely. Resume when ready.
  • Rollback — push the previous version to all touched agents.

Combining with health gates

Health gates (Health gates) auto-pause regardless of strategy. The canary strategy plays nicely with them:

  • A gate firing during step 2 pauses the rollout.
  • Resume continues into the next step (the one after the paused step's intended target).
  • The rollout's audit history records both the gate fire and the resume, with the actor.

Pre-step preview

The wizard shows a preview of "which agents will be in step N" if the group is dynamic enough that you might want to confirm. This is the same preview you see for percentage rollouts.

  • Cognitive load is low: pick percentages and dwell, push start, forget about it until something interesting happens.
  • Auditability is high: every step is its own audit row; resume /rollback events sit alongside gate fires; reviewers see the full story.
  • The blast radius is bounded: a bad config caught at step 1 has damaged 5 %, not 100 %.