zester
GuidesJobs

Tracking and Returns

Every dispatched job in Zester is tracked from creation through completion. The tracking system uses a per-job Watcher goroutine to monitor returns over NATS, persists all state transitions in JetStream KV, and publishes events to a durable stream for replay and audit.

Source: pkg/job/watcher.go, pkg/job/manager.go


Job Status Lifecycle

Jobs move through a defined set of statuses during their lifecycle:

Status Definitions

StatusDescription
pendingJob created but not yet dispatched to targets
claimedJob claimed by a master via KV CAS but ExecRequests not yet sent -- the running status is persisted before anything is published, so a claimed record provably means no peel ever received the job
runningJob dispatched; waiting for returns from targets
completeAll targeted peels returned results successfully
partialSome (but not all) targeted peels returned results before timeout
timeoutTimeout expired with zero returns received
failedAll peels returned but one or more reported errors, or a watcher/subscription failure occurred
canceledJob was cancelled before all targets returned

Terminal States

A job is considered terminal when its status is one of: complete, partial, timeout, failed, or canceled. Terminal jobs are no longer actively monitored by a Watcher.

pkg/job/job.go
func (j *Job) IsTerminal() bool {
    switch j.Status {
    case StatusComplete, StatusPartial, StatusTimeout, StatusFailed, StatusCanceled:
        return true
    }
    return false
}

Non-terminal jobs consume resources

Each non-terminal job has an active Watcher goroutine with NATS subscriptions. If a job stays in running state indefinitely (due to a bug or missed timeout), it leaks resources. The timeout mechanism is the primary safeguard against this.


Job Ownership

In multi-master deployments, every job has an Owner field that records which master instance dispatched and is watching it, and an Epoch field that serves as a fencing token to prevent duplicate execution.

FieldTypeDescription
OwnerstringMaster instance ID that dispatched and is watching this job
Epochuint64KV revision number from the CAS ownership claim (fencing token)

The Owner and Epoch are set during Manager.Dispatch via a KV CAS operation. The Epoch is included in every ExecRequest sent to peels. Peels track the highest epoch seen per JID and reject ExecRequests whose epoch is less than or equal to the stored value -- a lower epoch is a stale dispatch from a superseded owner, an equal epoch is a duplicate delivery (e.g. the ack-window re-send). The per-JID record is persisted to disk (/data/peel-dedup.msgpack), so rejection survives peel restarts; legitimate failover re-dispatch always carries a freshly CAS-bumped (higher) epoch and is unaffected.


Master Heartbeat

Each master publishes a periodic heartbeat to the master-heartbeat KV bucket. The heartbeat contains the master's instance ID, a timestamp, and the list of active job IDs it owns. Other masters watch this bucket to detect failures.

PropertyValue
Bucketmaster-heartbeat
Key pattern<master-instance-id>
TTL15 seconds (3x heartbeat interval)
History1
IntervalEvery 5 seconds

The 15-second TTL (3x the heartbeat interval) means a master must miss 3 consecutive heartbeats before being declared dead. This provides tolerance for GC pauses, brief network hiccups, and NATS leader elections.


Active-Jobs Index

The jobs KV bucket holds two key families:

KeyValueLifecycle
<jid>Full MessagePack Job recordCreated at claim time; lives out the 7-day bucket TTL
active.<jid>Tiny ActiveJobEntry (owner, updated)Created at claim time; owner rewritten on reclaim; deleted on every terminal transition

The index exists so that anything asking "which jobs are live right now?" -- the orphan scanner and zester job active -- enumerates only the active. prefix instead of scanning the entire 7-day bucket. Scan cost is O(active jobs), independent of retention and scheduler-synthetic job volume (synthetic jobs are terminal at creation and never get index entries).

Index maintenance is best-effort with self-healing:

  • The index entry is deleted by the watcher after a successful terminal write and when the reclaim cap finalizes a job as failed. A CAS-fenced finalize does not delete the entry -- the superseding owner (or the scanner) is responsible for it.
  • The scanner self-heals stale entries: an active.<jid> key whose job record is missing or already terminal is deleted and skipped. The job record itself is authoritative for status, epoch, and reclaim count; the index value is never used for reclaim decisions.

Orphan Detection and Recovery

When a master's heartbeat key expires (the master has not refreshed it within the TTL), surviving masters detect the missing heartbeat via the OrphanScanner's periodic polling and initiate orphan recovery.

Recovery Process

  1. Detect heartbeat expiration -- The OrphanScanner polls every 20 seconds, listing live masters from the master-heartbeat bucket and comparing against job owners.
  2. Scan for orphaned jobs -- Enumerate the active-jobs index (active.<jid> keys) and fetch only the referenced job records, keeping those where Owner matches a dead master's ID and Status is non-terminal (claimed or running).
  3. Adopt orphaned jobs -- For each orphaned job, perform a CAS update on the Owner field (generating a new epoch) and rewrite the index entry with the new owner. Only one master's CAS succeeds; others back off.
  4. Recover return state -- Read the incrementally persisted per-peel returns from the job-returns KV bucket, then replay the job-events stream for the job's return subjects and merge the results (KV wins on collision; replayed-only returns are persisted so the per-peel keys catch up). Re-subscribe to return subjects. If the merged seeds already cover every target, the job finalizes immediately instead of burning the remaining deadline.
  5. Close the replay gap -- Once the recovery watcher's live subscriptions are attached, the stream is replayed a second time and merged, so a return published between the first replay and the subscription attach is never lost.
  6. Finalize or continue -- The recovery watcher operates identically to a normal watcher (except it never re-publishes ExecRequests -- its peels may be mid-execution). It collects any remaining returns and finalizes the job when complete, timed out, or cancelled.

Recovery race condition

If multiple surviving masters detect the same heartbeat expiration simultaneously, they could both attempt to adopt the same orphaned jobs. This is resolved by using NATS KV's compare-and-swap (CAS) semantics on the Owner field update. Only one master's CAS succeeds; the others back off.

Incremental Return Persistence

Returns are persisted to the job-returns KV bucket incrementally during execution, not just at job finalization. Every return is written immediately (returnPersistThreshold is 1) as a per-peel entry keyed <jid>.<peel-id> — an O(1) write per return. The NATS callback only records state and enqueues; a dedicated per-watcher writer goroutine (buffered channel, synchronous-persist fallback on overflow so nothing is ever dropped) performs the KV writes, and entries whose write fails are re-queued and retried on the next persist (including the drain-and-flush that runs before finalize and when a watcher detaches at shutdown).

The per-peel keys are the sole store for return payloads. No aggregated []Return is written at finalization — a single aggregated value would bound a job's total returns by the ~1MB NATS payload cap, so a wide job could not record all its results. Instead, the finalized job record carries two summary counts, return_count and success_count, and readers list the per-peel keys (an O(this job's returns) prefix-scoped listing, not a bucket scan).

Cancel Propagation

In multi-master mode, all masters subscribe to the cancel wildcard zester.job.*.cancel. When a master receives a cancel for a job it owns, it cancels the local watcher. If the master does not own the job, it ignores the message. Peels also receive the cancel directly and stop execution.

Recovery Guarantees

GuaranteeMechanism
At-most-one ownerKV CAS on Owner field update
No duplicate executionFencing tokens (epoch) in ExecRequest; peels reject stale and same-epoch duplicates via persisted per-JID dedup (survives restarts)
Claimed = never publishedDispatch CAS-persists running before publishing any ExecRequest, so reclaimed claimed jobs are safe to re-dispatch
Returns recoverableIncremental per-peel return persistence to job-returns KV
Returns not lost (post-crash)Recovery watcher replays the job-events stream twice (seed before start, gap merge after its live subscriptions attach) and re-subscribes to return subjects
Timely detection15-second heartbeat TTL (3x heartbeat interval) provides bounded detection time
Cancel reaches ownerAll masters subscribe to zester.job.*.cancel wildcard

See also

For detailed HA deployment procedures, multi-master topology diagrams, fencing token protocol, and operational runbooks, see the High Availability documentation.


Watcher

The Watcher is a goroutine created per-job that monitors NATS subscriptions for acks and returns from target peels.

Source: pkg/job/watcher.go

Subscription Subjects

Subject PatternPurpose
zester.job.<jid>.ack.*Receive delivery acknowledgments from peels that accepted the dispatch
zester.job.<jid>.return.*Receive returns from all targeted peels

The wildcard * in the subject matches the peel ID, so a single subscription captures messages from all targets for that job.

Watcher Loop

The watcher runs a select loop that processes incoming acks and returns until one of three conditions terminates it:

  1. All returns received -- len(returns) >= TargetCount() triggers context cancellation.
  2. Timeout expires -- The time.Timer fires.
  3. Context cancelled -- External cancellation (e.g., Manager.Cancel or Manager.Shutdown).

On watchers that follow an actual publish, a fourth (non-terminating) event fires once: the ack window (default 5 seconds), which re-publishes the ExecRequest exactly once to targets that have neither acked nor returned. See Acks and Silent-Target Re-Dispatch.


Return Structure

Each peel that executes a job publishes a Return with its result:

FieldTypeDescription
JIDstringThe job this return belongs to
PeelIDstringWhich peel produced this return
SuccessboolWhether the function executed without error
ReturnDataanyThe function's output data (module-specific)
ErrorstringError message if the function failed
Durationtime.DurationHow long execution took on the peel
Timestamptime.TimeWhen the return was generated (UTC)

Returns are published to zester.job.<jid>.return.<peel-id> and collected by the Watcher. Each return is already persisted to its per-peel job-returns key as it arrives; finalization only stamps the summary counts (return_count, success_count) onto the job record.

Acknowledgment Structure

Peels publish an Ack on zester.job.<jid>.ack.<peel-id> the moment they accept a job dispatch -- after epoch/duplicate fencing, before execution starts:

FieldTypeDescription
JIDstringThe job being acknowledged
PeelIDstringWhich peel is acknowledging
Timestamptime.TimeWhen the ack was sent (UTC)

The ack lets the watcher distinguish "received, still executing" from "never received": acked peels are exempt from the ack-window re-dispatch, so long-running jobs are never double-sent to healthy peels. Rejected (stale/duplicate) dispatches and --direct request/reply executions produce no ack.


Completion Detection

The Watcher tracks how many returns it has collected and compares against the job's target count:

pkg/job/watcher.go
if len(w.returns) >= w.job.TargetCount() {
    w.cancelFunc() // triggers ctx.Done(), exits the select loop
}

When len(returns) >= TargetCount(), the watcher cancels its context, which causes the main select loop to exit and proceed to finalization.

Final Status Determination

When the watcher terminates (by completion, timeout, or cancellation), it determines the final status:

ConditionFinal Status
Cancelled and returns < targetscanceled
Returns >= Targets, all successfulcomplete
Returns >= Targets, some failedfailed
Returns > 0 but < Targetspartial
Returns == 0timeout

Finalization Steps

After determining the final status, the watcher:

  1. Flushes any still-queued per-peel returns to the job-returns KV bucket (the per-peel keys must be complete before the terminal status lands).
  2. Updates the job in the jobs KV bucket -- final status, updated timestamp, and the return_count/success_count summary -- via CAS, so a stale owner cannot overwrite a newer finalization.
  3. Deletes the job's active.<jid> index entry (only after the terminal write succeeded; a CAS-fenced finalize leaves it for the superseding owner or the scanner's self-heal).
  4. Publishes a completion event to JetStream (zester.job.<jid>.status).
  5. Closes its done channel to signal completion to the Manager.

Event Stream

All job events are captured by the job-events JetStream stream for replay and audit:

PropertyValue
Stream namejob-events
Subjectszester.job.> (all job sub-subjects)
Max age7 days
StorageFile
RetentionLimits policy

Event Types

Event TypePublished ByWhen
dispatchedMasterJob is dispatched to targets
ackedPeelPeel accepted the dispatch (published on the ack subject before execution)
returnedPeel (via watcher)Peel publishes execution result
completedMaster (watcher)Job reaches a terminal status
canceledMasterJob cancelled by operator or API
timeoutMaster (watcher)Job timeout expires
failedMaster (watcher)Job completed with all returns failing

Event stream is append-only

The job-events stream uses file storage with a 7-day retention window. Events cannot be modified or deleted -- they are immutable once written. This makes the stream suitable for compliance auditing but means sensitive data in job arguments or return data will persist for the full retention period.


Retrieving Returns

The Manager.GetReturns method reads returns from the job-returns KV bucket:

pkg/job/manager.go
returns, err := manager.GetReturns(ctx, jid)
// returns is []Return -- one entry per peel that responded

It lists the <jid>.<peel-id> keys via a prefix-scoped listing (server-side filter, O(this job's returns)) and decodes each. The per-peel keys are the only place returns live -- scheduled results are stored the same way.

The KV bucket has a 7-day TTL, after which entries are automatically purged.

KV Storage

BucketKey FormatTTLHistoryContent
jobs<jid> and active.<jid>7 days10 revisionsFull job spec and current status, plus the active-jobs index entries
job-returns<jid>.<peel-id>7 days1 revisionIncremental per-peel returns -- the sole store for return payloads, scheduled results included

The jobs bucket keeps 10 revisions of history per key, allowing you to trace how a job's status evolved over time:

Revision 1: {status: "claimed", owner: "master-1", ...}
Revision 2: {status: "running", ...}
Revision 3: {status: "complete", ...}

Active Jobs

The Manager.ActiveJobs method returns a list of JIDs for jobs that currently have an active Watcher:

pkg/job/manager.go
jids := manager.ActiveJobs()
// jids is []string -- JIDs of all jobs with running watchers

Active jobs are those in a non-terminal state with a Watcher goroutine still monitoring for returns. Note that this is a local view (this master's watchers); the fleet-wide view is the active-jobs index, which is what zester job active reads.


CLI Commands

Show Job Details

Inspect a specific job with its returns
$ zester job show 2hPx1FNsSgJMqVn5bLSTXeBQMpO
{
  "jid": "2hPx1FNsSgJMqVn5bLSTXeBQMpO",
  "function": "state.apply",
  "args": ["webserver"],
  "targets": ["web-01", "web-02"],
  "status": "complete",
  "created": "2026-02-10T14:30:00Z",
  "updated": "2026-02-10T14:30:45Z",
  "return_count": 2,
  "success_count": 2
}

Returns:
PEEL    SUCCESS  DURATION
web-01  true     12.3s
web-02  true     14.1s

The Returns table is read from the per-peel <jid>.<peel-id> keys (sorted by peel ID, so output is deterministic).

List Active Jobs

See which jobs are currently active
$ zester job active
JID                            FUNCTION    TARGETS                STATUS    USER    OWNER
2hPx2Kd8VnR7YmWqTz4PLsCfNjA   cmd.run     [db-01 db-02 db-03]    running   bob     2hPwQm3TxUvW8YnZrAb6NcDeFgH

The command enumerates only the active-jobs index keys and fetches just the referenced job records -- no full-bucket scan. The TARGETS column shows the targeted peel IDs and STATUS shows the current job lifecycle state; stale index entries (job already terminal) are skipped.

List Recent Jobs

List jobs in chronological order
$ zester job list
JID                            FUNCTION      TARGET    STATE     USER    OWNER
2hPx1FNsSgJMqVn5bLSTXeBQMpO   state.apply   web*      complete  alice   2hPwQm3TxUvW8YnZrAb6NcDeFgH
2hPx2Kd8VnR7YmWqTz4PLsCfNjA   cmd.run       db*       running   bob     2hPwQm3TxUvW8YnZrAb6NcDeFgH
2hPx3GhTWoS9ZnXrUa5QMtDgOkB   state.apply   cache*    partial   alice   2hPwQm3TxUvW8YnZrAb6NcDeFgH

The listing skips active.<jid> index keys -- only real job records are shown.

Cancel a Job

Cancel a running job
$ zester job kill 2hPx2Kd8VnR7YmWqTz4PLsCfNjA
Cancel signal sent for job 2hPx2Kd8VnR7YmWqTz4PLsCfNjA

See Timeouts and Cancellation for details on cancellation behavior.

On this page