zester

Update Architecture

This page documents the current self-update execution path from CLI request to node-local binary swap.

End-to-End Flow

Control Plane Subjects

Source of truth: pkg/bus/subjects.go.

SubjectDirectionUse
zester.update.rollout.startCLI -> master (request/reply)Start rollout
zester.update.rollout.abortCLI -> master (request/reply)Abort rollout
zester.update.cmd.<id>master -> watchdog (request/reply)Node update commands (prepare, apply, confirm, rollback, status)

Watchdog status flows through update-status KV heartbeats, not a dedicated event subject.

Rollout Controller Behavior

Source of truth: pkg/update/rollout.go, wiring in cmd/zester-master/main.go.

  • Rollouts are persisted in update-rollouts KV with CAS revision control.
  • Node list is resolved from update-status keys.
  • Targets currently support:
    • standard targeting expressions (glob, E@ PCRE, L@ list, G@ fact, compound)
    • evaluated against the set of nodes currently reporting in update-status for the component
  • Pre-flight eligibility checks run before batching (skipped when the status KV is unavailable; they also apply to dry runs):
    • Degraded exclusion — nodes whose NodeStatus.Degraded is set are excluded, with an Info log listing them (excluding degraded nodes from rollout). If exclusion empties a non-empty target set, the rollout errors with no eligible nodes (N degraded excluded).
    • Min-protocol gating — when the target manifest declares MinProtocol > 0, every remaining node's NodeStatus.Protocol is checked; any node below it refuses the whole rollout with an error naming each incompatible node as node (protocol X < Y). Nodes reporting Protocol 0 (the field was never set — treated as compatible everywhere else) and nodes with no status record fail the check only when MinProtocol > 0; a manifest with MinProtocol 0 never refuses anyone. Watchdogs stamp NodeStatus.Protocol with proto.ProtocolVersion (see pkg/proto/doc.go for the additive-only wire policy that MinProtocol gates).
  • Batch execution runs commands in parallel per batch:
    • prepare on batch
    • apply on batch
    • wait soak_time (each node's watchdog independently polls the child's readiness during this window and auto-rolls back on sustained failure — see Watchdog Runtime)
    • confirm on batch — must arrive before the watchdog's confirm-deadline (3× soak time, floored at 5 minutes), or the node auto-rolls back
    • wait batch_pause between batches
  • max_failed is global per rollout; once reached, rollout transitions to aborted.
  • Abort does not automatically roll back already-updated nodes. Abort works from any master: if the rollout is not locally active, the abort is CAS-written to the rollout record and the owning driver stops on its next save or abort-poll. Aborting an already-aborted rollout is idempotent.

Rollout Resume (Driver Failover)

A rollout survives the death of the master driving it:

  • The rollout record carries DriverID and DriverHeartbeat (msgpack driver_id / driver_heartbeat, additive). The driving controller refreshes the heartbeat on every batch step and at least every 10s while waiting or soaking.
  • Every master runs the resume loop (RolloutController.RunResumeLoop, scan period DefaultRolloutResumeInterval = 60s). It scans update-rollouts for non-terminal, non-dry-run rollouts whose heartbeat is older than DefaultRolloutStaleAfter (60s) and CAS-adopts each — writing its own driver ID and a fresh heartbeat; losing the CAS race just skips. A zero heartbeat (never written) reads as maximally stale and is adopted immediately.
  • A resumed rollout re-issues commands to its current batch starting at prepare; nodes already past prepare (staged/soaking) reject the duplicate and are recorded failed — bounded by max_failed and backstopped by the watchdog confirm-deadline auto-rollback. Adopted aborting records just finish the abort.
  • A master shutting down (context cancelled) stops driving without aborting, leaving the record for another master's resume loop. A mid-run persistence failure, by contrast, loudly aborts the rollout.
  • Brief double-drive during adoption is safe by design: the watchdog handler rejects duplicate prepare/apply commands, and a superseded driver stops on its next CAS save.

Rollout Persistence Model

Each rollout state tracks:

  • config (version, component, target and timing controls)
  • deterministic batches
  • per-node result map (prepared, applied, confirmed, failed)
  • current batch index
  • failed count
  • start/finish timestamps
  • state (created, rolling, completed, aborting, aborted, etc.)
  • driver identity and liveness (driver_id, driver_heartbeat) for resume adoption

Scope Notes

  • This architecture is for watchdog-mediated binary swaps, not state.apply.
  • Rollout targeting only sees nodes currently reporting to update-status — an offline watchdog is invisible to a rollout.

On this page