Skip to content

Run a canary rollout

You will roll a configuration out to a canary group with a step-up schedule, watch a synthetic failure trip the health gate, and rollback to the previous version. This is the workflow you will use most days as an Operator.

Time: ~ 15 minutes.

Prerequisites

1. Make the change you want to roll out

Open the configuration, edit the draft, change something observable. Easiest is to flip an exporter off:

 service:
   pipelines:
     traces:
       receivers: [otlp]
       processors: [batch, resource/edge]
-      exporters: [otlp/central]
+      exporters: [otlp/central, debug]

Save and Publish as v2.

2. Create the rollout

  1. Rollouts → New rollout.
  2. Configuration: pick edge-collector v2.
  3. Group: pick tutorial-static (or your dynamic group).
  4. Strategy: Canary step-up.
  5. Steps:
  6. 33 % → dwell 60 seconds
  7. 66 % → dwell 60 seconds
  8. 100 % → end
  9. Health gates: leave on the defaults.
  10. Save — but do not start yet. The rollout sits in Pending.

The wizard shows a per-batch preview: which agents will be touched in each step, and which capability flag will be checked (AcceptsRemoteConfig).

3. Start it

Click Start. The rollout transitions to InProgress. Step 1 (33 % of the group) gets the new config:

  • The agent receives a ServerToAgent frame with the new config hash.
  • It applies the YAML, reports RemoteConfigStatus = APPLYING, then APPLIED.
  • It re-reports its EffectiveConfig — the hash now matches the AssignedHash.

The Rollout page shows live progress per agent. Once all batch-1 agents are green and the dwell timer ends, batch 2 starts automatically.

If you want to feel the safety net working, kill one of the canary agents during step 1:

docker stop tutorial-agent

The disconnect health gate fires once the disconnect ratio exceeds 20 % within 5 minutes. The rollout transitions to Paused with the reason field set to:

Gate disconnect-ratio exceeded: 50 % > 20 % within 300 s.

The agent panel shows which gate fired, on which agents, and links to their detail pages.

5. Decide

You have three options:

Action Effect
Resume Re-enters InProgress. Use this if you fixed the underlying issue.
Rollback Re-assigns the previous published version to the canary'd agents and stops. Use this when the change itself is at fault.
Abort Stops without rolling back. Use only if you understand the partial-state consequences.

For this tutorial, restart the dead agent:

docker start tutorial-agent

Wait for it to reconnect (Fleet shows it green again), then Resume. The rollout completes through batches 2 and 3.

6. Audit the rollout

Audit log → filter by Rollout. You see the full timeline:

  • Rollout.Created — by you, with the diff between v1 and v2.
  • Rollout.Started.
  • Per-batch Batch.Started and Batch.Completed.
  • Rollout.Paused — with the gate that fired.
  • Rollout.Resumed.
  • Rollout.Completed.

Every audit row carries the actor (you), the entity, and a JSON before/after snapshot.

7. Rollback for real

To practise the rollback path:

  1. Rollouts → New rollout.
  2. Configuration: pick edge-collector v1.
  3. Group: same group.
  4. Strategy: Batch, batch size = group size.
  5. Start.

This is rollback as a normal rollout — there is no special "undo" mode because rolling back a config is the same operation as rolling forward to an older version. Predictable, audited, idempotent.

What you learned

  • How a canary step-up rollout walks through batches with dwell timers.
  • How health gates auto-pause and which one fires when.
  • The difference between Resume, Rollback and Abort.
  • How to read the rollout audit trail.

Next