Timeouts and Cancellation
Zester provides timeout and cancellation mechanisms to handle unresponsive peels, long-running jobs, and operational control over in-flight work. Every job has a configurable timeout, and any running job can be cancelled by the operator or programmatically through the Manager API.
Source: pkg/job/watcher.go, pkg/job/manager.go
Job Timeout
Every job has a configurable Timeout field (time.Duration) that limits how long the master waits for all target peels to return results. At dispatch the master converts it into an absolute Deadline on the job record (Created + Timeout, anchored to the master's clock — client-supplied values are never trusted), and the Watcher's timer is capped at time.Until(Deadline).
The deadline is what makes failover honest: a watcher recovered after a master crash honors the remaining time budget instead of restarting the full timeout, and a job whose deadline has already passed is finalized immediately from the returns collected so far. A job record read back with a zero deadline is defensively normalized to Created + Timeout before use.
Default Timeout
If no timeout is specified when creating a job, a default of 60 seconds is applied when the deadline is computed.
CLI default differs from Watcher default
The Watcher's built-in default is 60 seconds, but the state.apply CLI command defaults to 5 minutes. The CLI default is intentionally longer because state applications typically involve package installs and service restarts. The 60-second Watcher default is a safety net for programmatic usage where no timeout is specified.
Setting Timeouts
Use the --timeout flag to set a custom timeout from the CLI:
# Default 5-minute timeout (CLI default for state.apply)
zester 'web*' state.apply webserver
# Custom 10-minute timeout for complex states
zester 'web*' state.apply webserver --timeout 10m
# Short timeout for quick connectivity checks
zester 'web*' state.apply test.ping --timeout 30s
# Long timeout for large-scale deployments
zester '*' state.apply base --timeout 1hThe timeout value accepts Go duration strings: 30s, 5m, 1h, 2h30m, etc.
Timeout Behavior
When the timer fires, the Watcher stops waiting for additional returns and proceeds to finalization. The final status depends on how many returns were collected before the timeout expired.
Watcher starts
|
+-- Subscribe to ack + return subjects
|
+-- Start timer: time.NewTimer(timeout)
|
+-- Arm one-shot ack-window timer (fresh dispatches only, default 5s)
|
+-- select {
case <-ctx.Done(): // All returns received or cancelled
case <-timer.C: // Timeout expired
case <-ackC: // Ack window: re-dispatch once to silent targets
}
|
+-- Finalize: determine status, persist resultsThe ack-window case never terminates the loop -- it fires exactly once, re-sending the ExecRequest to targets that have neither acked nor returned (see Acks and Silent-Target Re-Dispatch).
Final Status After Timeout
| Condition | Status | Description |
|---|---|---|
| All targets returned successfully before timer fired | complete | Timer never fires because ctx.Done() triggers first |
| All targets returned, some with errors | failed | All peels responded but one or more reported failures |
| Cancelled before all targets returned | canceled | Operator or API cancelled the job |
| Some targets returned, timer fired | partial | Partial results are stored; missing peels are identifiable |
| No targets returned, timer fired | timeout | No peels responded within the timeout window |
Partial results are still stored
When a job completes with partial status, the returns that were received are still persisted in the job-returns KV bucket. This data is not discarded -- you can retrieve and act on partial results even though the job did not fully complete.
Identifying Missing Peels
Compare the job's target list with the returned peel IDs:
$ zester job show <jid>
{
"jid": "2hPx2Kd8VnR7YmWqTz4PLsCfNjA",
"function": "state.apply",
"targets": ["web-01", "web-02", "web-03"],
"status": "partial",
...
}
Returns:
PEEL SUCCESS DURATION
web-01 true 12.3s
web-02 true 14.1sIn this example, web-03 did not return. It either never received the job, is still executing, or crashed during execution. Check peel connectivity and logs to diagnose.
Cancellation
Jobs can be cancelled by the operator via the CLI or programmatically through the Manager API.
CLI Cancellation
zester job kill <jid>$ zester job kill 2hPx2Kd8VnR7YmWqTz4PLsCfNjA
Cancel signal sent for job 2hPx2Kd8VnR7YmWqTz4PLsCfNjAManager API Cancellation
err := manager.Cancel(ctx, jid)The Manager.Cancel method performs three operations:
-
Signal the Watcher -- Looks up the active Watcher for the given JID and calls
watcher.Cancel(), which invokes the context cancel function to stop the Watcher's select loop. -
Publish a cancel signal to peels -- Publishes to
zester.job.<jid>.cancel. Each peel executing this job subscribes to this subject and aborts its execution context when the signal arrives. -
Publish a cancel event -- Publishes an event to
zester.job.<jid>.cancelwithEventCanceledtype and"cancelled by user"data. This is captured by thejob-eventsstream for audit.
Watcher Cancellation Internals
The Watcher's Cancel() method triggers the internal context cancellation function:
func (w *Watcher) Cancel() {
w.mu.Lock()
w.canceled = true
fn := w.cancelFunc
w.mu.Unlock()
if fn != nil {
fn()
}
}This causes the Watcher's select to unblock on ctx.Done(), proceeding to finalization. Any returns collected before cancellation are preserved.
Cancellation propagates to peels
When a job is dispatched, each peel creates a cancellable execution context and subscribes to zester.job.<jid>.cancel. When the cancel signal arrives, the peel's context is cancelled, aborting the running state modules. However, some operations (like in-progress package installations) may not be safely interruptible and will complete before the peel stops.
Manager Shutdown
When the master process shuts down gracefully, the Manager.Shutdown() method detaches all active Watchers in parallel and waits for them to finish:
func (m *Manager) Shutdown() {
// Signal all watchers to detach
for _, w := range watchers {
w.Detach()
}
// Wait for each watcher's Watch goroutine to persist and exit
for _, w := range watchers {
<-w.Done()
}
// Clear the watchers map
}Detaching is deliberately not finalizing: the peels are still executing, so writing a terminal status would be a lie. Instead:
- Collected returns are persisted -- Each Watcher flushes its per-peel returns to the
job-returnsKV bucket before exiting. - In-flight jobs stay
running-- The job keeps its current epoch so the orphan scanner on a surviving master can reclaim it once this master's heartbeat expires (~55s), seed the recovery watcher from the persisted per-peel keys plus ajob-eventsstream replay, and collect the outstanding returns against the remaining deadline. - No goroutine leaks -- All Watcher goroutines are cleanly terminated and awaited.
A detached watcher whose targets have all already returned finalizes normally -- the job is genuinely done. Operator cancellation (Manager.Cancel) still finalizes as canceled.
Ungraceful shutdown is recovered by surviving masters
If the master process is killed with SIGKILL (or crashes), active Watchers do not get a chance to finalize. In a multi-master deployment the orphan scanner on a surviving master reclaims the jobs (~55s worst case): per-peel returns already persisted to job-returns are read back, and the job-events stream is replayed to recover any returns that missed KV before the recovery watcher finalizes. In a single-master deployment the jobs remain running in KV until the same master restarts and reclaims them, or the 7-day TTL expires.
Best Practices
Setting Realistic Timeouts
| Scenario | Suggested Timeout | Rationale |
|---|---|---|
test.ping | 30s | Simple connectivity check |
fact.refresh | 1m | Fact collection is fast |
cmd.run (simple) | 2m | Short commands like uptime or df |
state.apply (small) | 5m | Default, covers most state applications |
state.apply (large) | 15-30m | Complex states with package installs |
pkg.installed (many packages) | 10-20m | Package downloads and installs take time |
| Large-scale rollout | 30-60m | Hundreds of peels with broad target expressions |
Monitoring Partial Returns
When jobs frequently complete with partial status, investigate:
- Network connectivity -- Are specific peels consistently unreachable? Check NATS connection status.
- Peel performance -- Are some peels slower than others? Compare
Durationvalues in returns. - Timeout adequacy -- Is the timeout too short for the operation? Increase it and re-run.
Retry Patterns
Zester does not have built-in automatic retry. Retries are intentionally manual to prevent cascading failures.
# Original job targeted all web servers
zester 'web*' state.apply webserver --timeout 10m
# Check which peels returned
zester job show <jid>
# Retry specific missing peels with a longer timeout
zester 'L@web-03,web-07' state.apply webserver --timeout 20mRetry in smaller target slices
For large retries, split targets into smaller lists or narrower expressions to reduce load and isolate failures.