Zester — Architecture & Resilience Review

Findings and proposed fixes. Scope: architecture, resilience, scalability, consistency, operability, failure-tolerance, and evolvability. No security dimension. Assessment only — no code was changed to produce this document.

Method: 6-dimension multi-agent review, every finding adversarially verified against the code before inclusion.
Confirmed findings: 42 — 8 critical, 18 high, 16 medium.
Refuted / overstated on verification: 2 (documented in the Appendix, for honesty).

Implementation status (July 2026)

Executive summary & cross-cutting themes
Prioritized roadmap
Confirmed findings by dimension
What is already sound
Appendix: refuted and overstated claims

Implementation status (July 2026)

The full roadmap — Tier A (items 1–13), Tier B (items 1–15), and Tier C (items 1–6) — landed in July 2026, with each wave adversarially re-reviewed and the confirmed findings fixed (unit + race + 76/76 Docker integration green). The findings below are preserved exactly as written at review time; this section records what was built against them.

Tier A

✅ ListLiveMasters (pkg/job/heartbeat.go) returns real KV errors (empty bucket ≠ error) and a per-key heartbeat read error aborts the scan (only key-not-found = expired counts as dead); OrphanScanner aborts the cycle on liveness errors and requires MissThreshold (default 2) consecutive missed scans before reclaim (pkg/job/recovery.go).
✅ Peel exec path never runs with nil settings: on exec-time resolve error, state.apply/state.highstate fall back to last-known-good cachedSettings (Warn logged); with no successful resolve ever, the execution fails with an explanatory ExecResponse error (cmd/zester-peel/main.go).
✅ The peel's WatchSecrets callback calls resolver.InvalidateCache() before the debounced re-resolve, and Resolver.Resolve fails — and never caches — when unresolved __ZESTER_SECRET:...__ placeholders survive the final merge (pkg/settings/resolve.go).
✅ returnPersistThreshold dropped 10 → 1: every return is persisted to its per-peel KV key immediately; failed writes are re-queued and retried, with a detach-path flush (pkg/job/watcher.go).
✅ bus.BumpRevision is a real CAS retry loop (Create "1" when missing / Update with revision, ~10 attempts with jitter) — concurrent bumps no longer lose updates (pkg/bus/kv.go).
✅ Job.User populated from the operator identity (os/user → $USER → "unknown"), shown in zester job list/job active (USER column) and job show; the master logs every accepted dispatch at Info with jid, user, function, target count.
✅ bus.ClientConfig.RetryConnect (wraps nats.RetryOnFailedConnect): master/peel/watchdog set it so an unreachable NATS at boot is non-fatal (background connect); the operator CLI leaves it false.
✅ Basket publisher hash-skips unchanged values (SHA-256 of sorted-key msgpack, same pattern as facts.Manager.publish); the hash is cached only after a successful put so failed writes retry next tick (pkg/basket/publisher.go).
✅ Watchdog degraded mode is no longer terminal: after 10 consecutive failed restarts the supervisor enters a slow-retry tier (SupervisorConfig.DegradedRetryInterval, default 10m) and recovers automatically; NodeStatus.Degraded (msgpack additive) is reported to update-status via ReporterConfig.DegradedFn.
✅ enroll.ClientConfig.MasterURLs (peel --master-urls/master_urls): ordered failover across all enrollment operations — rotate on connection failures/5xx, pin on success, never rotate on 4xx; a 401 "challenge verification failed" triggers a transparent fresh-nonce re-request + resubmit (3 total attempts).
✅ --log-level/--log-format on all three daemons via the shared bootstrap internal/logging.Setup(w, component, level, format); master and peel default format changed text → JSON; base attrs component/version (+ master_id/peel_id once known) on every line.
✅ Job.Deadline (msgpack deadline,omitempty) stored at dispatch; recovered watchers honor the remaining budget and an expired deadline finalizes immediately from collected returns; Job.ReclaimCount added and capped at 3 (maxReclaims), after which the job is CAS-finalized failed instead of re-dispatched.
✅ internal/config/bind.go: BindFlags(fs, cfg) generates flags from flag:"name" usage:"..." struct tags and ApplyVisited(fs, cfg) re-applies only user-set flags after YAML load — the triplicate (flag block + YAML field + flag.Visit switch) is gone in both daemons, with parity pinned by tests (including --gitfs-remotes "" disabling a YAML list).

Tier B

✅ internal/health.Checker wired: both daemons serve GET /readyz on health_addr (peel checks: nats, kv; master checks: nats, enroll-server, sched-consumer, gitfs when enabled); the watchdog's update-soak loop polls readiness via Supervisor.CheckReady / --ready-url while restart monitoring stays on /healthz liveness.
✅ internal/metrics wired: GET /metrics mounted on both daemons' health mux; master fires zester_jobs_total{status}/zester_job_duration_seconds{function} (new OnJobFinalized hook), zester_job_reclaims_total (new OrphanScanner.OnReclaim), zester_facts_sync_total, and zester_nats_reconnects_total/_disconnects_total/_slow_consumers_total (new bus.ClientConfig hooks); peel fires zester_peel_state_apply_total{state,result} + duration, zester_peel_connected, zester_nats_reconnects_total, zester_peel_uptime_seconds.
✅ Replicas tiering: bus.EffectiveReplicas(explicit, clusterSize) — explicit wins, else min(3, max(1, clusterSize)); every bucket/stream/object-store config carries a DurabilityTier (TierCritical/TierEphemeral), InitializeStorageOpts logs a loud per-asset warning when a critical asset ends up under-replicated in a cluster, and a failed replica upgrade falls back to R1 with a nats stream edit migration hint instead of aborting startup.
✅ Active-jobs index: active.<jid> keys (msgpack ActiveJobEntry{owner, updated}) in the jobs bucket, created at dispatch, rewritten on reclaim, deleted on every terminal transition; the orphan scanner enumerates only bus.ListKeysWithPrefix(jobsKV, "active.>") — O(active), independent of the 7-day retention — and self-heals stale keys; scheduler-synthetic jobs never get index entries; zester job list skips index keys and job active is rebuilt on the index.
✅ Per-peel return keys are the source of truth: finalizeJob no longer writes the ~1MB aggregated value (the job record carries ReturnCount/SuccessCount); GetReturns/recovery use prefix-scoped listing (a legacy-aggregate fallback shipped initially and was subsequently removed — the tool is pre-first-release, so the per-peel keys are the only returns storage); return persistence moved off the NATS callback into a buffered channel (256) + writer goroutine with synchronous-persist overflow fallback and guaranteed drain on finish/detach/finalize.
✅ Secrets hash-gate + single owner: PublishSecrets keeps a per-peel fingerprint cache (sender curve pub ‖ recipient curve pub ‖ sorted secrets, updated only after a successful put) and skips unchanged re-encrypts; the "facts-secrets" leader lease gates the facts-watcher secrets reaction so exactly one master re-encrypts.
✅ Single-publisher lease: advisory bus.LeaderLease (leases bucket, 15s TTL / 5s renew) key "publisher" gates settings-files publish, state-files publish, and the GitFS sync loop; loading still runs on every master, publishes run in a per-acquisition sub-context cancelled on lease loss (brief <TTL dual publishing tolerated — all gated writes idempotent/hash-gated).
✅ _manifest deletion propagation in both pipelines: publishers write a sorted {key, sha256} manifest after the file puts and before the _revision bump, then prune stale KV keys; the settings resolver verifies every loaded file against the manifest (listed-but-missing / hash mismatch / loaded-but-unlisted all fail the resolve — torn-read protection) and the statefiles cache fetches exactly the manifest set with SHA-256 verification plus local pruning. (A no-manifest legacy mode shipped initially and was subsequently removed — the tool is pre-first-release: content without a _manifest now fails as a torn/tampered publish, and only an empty pre-first-publish bucket is a clean no-op.)
✅ Dispatch ordering + peel JID dedup: Dispatch is Create(claimed) → CAS-to-running → publish, so a StatusClaimed record provably means never-published (same inversion on the reclaim-claimed path; a retried own claim resumes from the running CAS); peels persist jid→epoch dedup state to /data/peel-dedup.msgpack (4096-entry cap) and reject equal-or-lower epochs — pkg/job CAS-bumps Epoch on every legitimate redispatch.
✅ Statefiles cache: revision-triggered re-syncs reuse settings.NewDebouncedFunc (2s debounce + 5s jitter), retry with capped backoff (1s→30s) until the current _revision is synced, and Sync is atomic — the complete file set is staged in a temp dir and swapped in via rename, so mid-download failures leave the previous complete tree intact; empty-set publishes over a populated bucket are refused unless AllowEmpty.
✅ Peel heartbeat: peels Put facts.Heartbeat{ts, version, protocol} into the peel-heartbeat bucket (10s interval, 30s TTL) from the connected phase; zester peel list is truthful (ONLINE/LAST-SEEN from the bucket, absent key = offline) and the master feeds a zester_connected_peels gauge from bucket key counts.
✅ Versioned envelope: pkg/proto.ProtocolVersion (= 1) with a normative additive-only msgpack policy (doc.go) and key-set conformance tests; additive V fields on ExecRequest/ExecResponse/Job (decoded 0 = never set, compatible); Manifest.MinProtocol + NodeStatus.Protocol make StartRollout refuse rollouts naming incompatible nodes.
✅ Rollout resume + confirm-deadline: RolloutState.DriverID/DriverHeartbeat (10s heartbeat goroutine); ResumeOrphaned/RunResumeLoop (started on every master) CAS-adopt non-terminal rollouts with heartbeats staler than 60s and resume from the persisted batch; the watchdog arms a soak confirm-deadline (HandlerConfig.ConfirmDeadline, default 3×soak floored at 5m) and auto-rolls back when neither confirm nor rollback arrives; StartRollout also excludes degraded nodes.
✅ Watcher consolidation + jitter: settings.WatchSecretsAndCurve covers the per-peel secrets key and _master_curve_pub with ONE WatchFiltered consumer (available; the peel still runs the two separate watchers pending the wiring switch); all settings watcher reconnect loops add deterministic per-id jitter (SetWatchJitter, default spread 60s) to the first reconnect after a consumer loss.
✅ Bounded exec worker queue + read-only fast path + ModuleContext embed: mutating executions go through a 64-slot queue with a single execMu worker (queue-full replies "peel busy: execution queue full" immediately); the read-only set (facts.* except facts.set, settings.*/pillar.*, test.ping, grains.*, sys.list_functions) runs concurrently outside execMu on immutable contexts derived via exec.ModuleContext.WithFactsSettings from the frozen mctxTemplate; ModuleContext embeds ProviderSet (the nine duplicated fields and hand-copy are gone).

Tier C

✅ (scoped variant) Durable job path: reclaiming a running job replays the job-events stream for the JID (NewStreamReturnReplayer, ephemeral pull consumer, auto-wired via bus.ConsumerAPI) to recover returns published during the ownerless window, and dispatch watchers re-send the ExecRequest exactly once to targets that neither acked nor returned within the 5s AckWindow (peels ack on accept via job.Ack on bus.JobAckSubject, made safe by the persisted JID dedup). The full JetStream dispatch path with per-peel durable consumers was explicitly scoped out.
✅ Master-side target-resolution + basket service: every master maintains facts.Index via facts.WatchIntoIndex and serves target.StartResolveService on zester.target.resolve (queue zester-target-resolvers); the CLI and the peel's basket() wrap their KV listers in target.NewServiceLister — target.Resolve auto-delegates via ExprResolver, only matched IDs cross the wire, any failure falls back to the KV scan; Job.TargetExpr stores the operator's original expression (audit).
✅ Offline-first peel startup: internal/peeld.Agent.Run splits into a LOCAL phase (facts collect, providers, states engine, settings warm-start from /data/settings-snapshot.msgpack, scheduler, exec subscriptions) and a CONNECTED-phase goroutine (facts publish, resolve, watchers, statefiles sync, heartbeat) — a peel booted during a NATS outage enforces from local state and attaches when the connection arrives; only genuinely local errors are fatal.
✅ bus.KV narrow interface: bus-owned KVEntry/KeyWatcher/KeyLister, op consts, and errors.Is-compatible error aliases; jetstream.KeyValue adapted once in pkg/bus/kv_jetstream.go; domain packages hold no nats.go KV types.
✅ internal/peeld (Agent) and internal/masterd (Daemon) extracted: both daemon main.gos own only flag parsing, config loading, logging setup, and signal handling (~90 lines each); subsystems register named readiness checks with the shared internal/health.Checker.
✅ Admin mutations through request/reply: zester enroll approve|reject|revoke send enroll.AdminRequest on zester.admin.enroll.*, answered by the masters' zester-masters-admin queue group; direct KV writes remain only behind the break-glass --direct-kv flag; pkg/masterapi gained POST /api/v1/enrollments/{id}/reject and /revoke.

Review-driven hardening (first round)

Fixes found during the Tier A implementation's adversarial review, beyond the roadmap items:

Master-anchored Deadline — Manager.Dispatch always re-anchors Job.Deadline server-side; client clocks are untrusted.
StateID carried through job mode — Job.StateID forwards the CLI's bare positional as ExecRequest.ID, so job mode behaves like --direct (previously the JID leaked in as the positional/name default).
Nil-args panic guard — the peel's execModule guards against nil Args maps (REST-API dispatches with omitted args no longer panic the peel).
persistReturns re-queue — failed per-peel return writes are re-queued and retried instead of dropped.
Per-key heartbeat error abort — a heartbeat read error for any key aborts the orphan scan; only key-not-found counts as a dead master.
Resolver invalidation generation — an invalidation-generation counter prevents an in-flight Resolve from re-caching pre-rotation values.
/readyz degraded = 200 semantics — HTTP 503 only when a check is down; degraded means working-but-impaired and stays 200.
Watchdog ready-url derivation — --ready-url defaults to --health-url with its path replaced by /readyz, so overriding --health-url alone (e.g. :9091 for a master child) keeps soak probing the right port.
CLI fail-fast connect — the operator CLI keeps RetryConnect false and fails fast with a clear error on unreachable NATS.
Honest connected gauge — zester_peel_connected reports 0 until NATS is actually connected.
Live state-module precedence — the peel checks registry.Has at dispatch time, so Starlark state modules loaded after startup are no longer shadowed by same-named execution functions.
/healthz version field — the liveness body carries component and version; update.Supervisor.HealthVersion parses it for fleet status.

Review-driven hardening (second round)

Fixes confirmed by the adversarial review of the Tier B/C change set:

Deferred-connect self-heal — pkg/bus registers a nats.ConnectHandler: a RetryOnFailedConnect deferred initial connection now flips IsHealthy, fires OnReconnect, and signals waiters (previously the peel's connected phase never started until a subsequent NATS flap); IsHealthy() is also defensive (cached flag OR nc.IsConnected()).
Double stream replay on reclaim — the job-events stream is replayed once before the recovery watcher starts (seed) and again after its live subscriptions attach (Watcher.Subscribed() + idempotent MergeReturns()), so a return published in the replay→subscribe gap is no longer lost.
Peel ack publishing — peels publish job.Ack on bus.JobAckSubject(jid, peelID) for every accepted job dispatch (after fencing/dedup, before execution); the dispatch watcher counts acked peels as heard, so acked-but-slow peels are never re-sent within the ack window.
Synchronous dedup durability — dedupTracker.ObserveDurable flushes the accepted (jid, epoch) record to /data/peel-dedup.msgpack (atomic rename) before a job dispatch executes, so a peel crash + restart inside the master's 5s ack window cannot re-execute a non-idempotent job at the same epoch.
zester job show per-peel returns — reads the per-peel "<jid>.<peelID>" keys (the new source of truth, sorted); previously it read only the bare key the watcher no longer writes and silently omitted the Returns table for every master-dispatched job. (The legacy-aggregate fallback it initially shipped with was subsequently removed — pre-first-release, no aggregate records ever existed.)
Resolver revision sampling — Resolve samples _revision at entry and re-reads it at cache-write time, caching only when both match; a publish completing during the load/render window can no longer pin the old file batch under the new revision.
Best-effort settings prune — stale-key pruning in PublishRawFiles is garbage collection: ListKeys/Delete failures are Warn-logged and never abort the publish, so the _revision bump always fires once files + manifest are committed (previously one un-deletable key stalled all settings distribution).
GitFS validity gating + atomic clones — clone-vs-pull decided by repository VALIDITY (own .git + anchored git rev-parse), invalid dirs removed and re-cloned (self-healing); fresh clones staged in hidden .<repo>.clone-* temp dirs and renamed into place; a remote whose sync failed publishes only from a VALID local clone (last known good) — a partial clone can no longer propagate fleet-wide deletions.
resolveStatesDir retry — an empty/missing state-file cache dir is retried 5×25ms before falling back to the baked /data/states (Warn logged), so the statefiles cache's two-rename swap window cannot flip a run onto the baked tree.
Lazy states engine — KV-only peels (no baked states, cache not yet synced) no longer crash-loop at boot: the engine builds automatically on the first execution after state files land, and until then state.apply/state.highstate return a clear "states engine unavailable" error while every other module works.
facts.set serialization — facts.set is excluded from the read-only fast path (it mutates the custom-facts file via a non-atomic read-modify-write) and runs on the execMu-serialized worker, so concurrent sets cannot clobber each other's keys.

Nothing on the roadmap remains open. The only deliberately unbuilt piece is the maximal variant of Tier C item 1 — a full JetStream dispatch path with per-peel durable consumers — which was explicitly scoped out in favor of the stream-replay-on-reclaim + ack-window re-dispatch combination above.

1. Executive summary & cross-cutting themes

Cross-cutting themes

Theme 1: You bought durability and don't use it. JetStream is the only backplane, yet the hot paths route around it: job dispatch and returns ride fire-and-forget core NATS; the job-events stream durably captures every return but nothing ever replays it; watchers and the rollout controller hold state in memory; every bucket defaults to Replicas:1. The result is a system that looks HA (CAS, epochs, heartbeats, orphan scanner) but loses in-flight work on exactly the failures the HA machinery exists for. This single theme spans 8+ findings across availability, consistency, and failure-modes.

Theme 2: Symmetric multi-master without a coordination primitive. Every master does everything — facts-watch + secrets re-encryption, settings/statefiles publish, GitFS sync, orphan scanning — with no leader lease, no queue group, no sharding. Adding a master for HA multiplies load (N× secrets re-encryption, N× full-bucket WatchAll replays) and creates new failure modes (torn _revision batches from concurrent publishers, non-CAS BumpRevision). The architecture is symmetric; the workloads are not.

Theme 3: Observability is fully built and fully unwired. internal/metrics (23 metrics), internal/health (checker + 503 semantics), and pkg/facts/index.go (the inverted index that would fix O(N²) targeting) all exist, are tested, and are imported by nothing. Meanwhile the static-200 /healthz isn't just a monitoring gap — it's the signal the self-update soak/rollback mechanism trusts, so the flagship safety feature validates a constant. This is the cheapest theme to fix: the code is written.

Theme 4: Swallowed errors turn infrastructure blips into fleet-wide wrong actions. ListLiveMasters returns empty-map-nil-error on KV failure → mass false reclaim → duplicate execution of non-idempotent commands. cs, _ = resolver.Resolve(...) → states applied with nil settings → silent misconfiguration reported as success. _ = r.store.Save(...) ×6 in the rollout controller. The revision-gated cache silently eats secrets rotations. The pattern is consistent: degraded infra should degrade the operation (fail, retry, use last-known-good), never proceed with wrong data.

Theme 5: O(N) full-bucket scans are the default query pattern. Orphan scanner enumerates every retained job every 20s; every target resolution and every basket() call drains the entire facts bucket; return recovery lists all keys in job-returns; ~4 ephemeral consumers per peel. Nothing is indexed. The practical ceiling is ~1–2k peels regardless of NATS cluster size — the bottleneck is access patterns, not the backplane.

2. Prioritized roadmap

Tier A — Quick wins (small, do this month)

ListLiveMasters: distinguish ErrNoKeysFound from real errors; abort scan on error; require N consecutive misses before reclaim (consistency-1 / availability-3)
Never execute with nil settings — fail or fall back to cachedSettings (failure-modes-1)
Invalidate resolver cache on secrets-watch events; treat unresolved placeholders as resolve failure (consistency-2)
Drop returnPersistThreshold to 1 (part of availability-2 / consistency-4)
Fix BumpRevision to a CAS retry loop (part of consistency-5)
Populate Job.User from creds identity; log dispatch with jid+user (operability-4)
nats.RetryOnFailedConnect(true) in bus.NewClient (first half of availability-5)
Hash-skip in basket publisher (copy facts.Manager.publish) (scalability-7)
Watchdog: slow-retry tier instead of terminal degraded; expose degraded in NodeStatus (failure-modes-7)
Multi-URL enrollment fallback in PeelConfig; re-request nonce on "unknown challenge" (availability-7)
--log-level/--log-format flags, JSON default, shared bootstrap (operability-6)
Store dispatch deadline on Job; recovered watchers honor remaining budget; add ReclaimCount (availability-6)
Config: generate flags from YAML struct tags, kill the triplicate (modularity-6)

Tier B — Structural investments (this quarter)

Wire internal/health into /healthz//readyz (NATS check, subsystem registration); point watchdog soak at /readyz (availability-4, operability-2)
Wire internal/metrics into both daemons' existing health mux (operability-1)
Replicas tiering: default min(3, clusterSize), loud warning on R1 in a cluster; enrollments/jobs/rollouts must be R3 (availability-1)
Active-jobs index (or owner.<masterID>.<jid> keys) so orphan scan is O(active), not O(retained) (scalability-1)
Per-peel return keys as source of truth; kill the 1MB aggregated value; writer goroutine off the NATS callback (scalability-2)
Hash-gate secrets re-encryption + single-owner (lease) facts→secrets reaction; split volatile metrics from identity facts (scalability-4, failure-modes-6)
Single-publisher lease for settings/statefiles/GitFS republish (rest of consistency-5)
_manifest key for deletion propagation in both distribution pipelines (consistency-6)
Dispatch ordering fix (CAS running before publish) + peel-side persisted JID dedup (consistency-3)
Statefiles cache: reuse DebouncedFunc, retry on failure, atomic temp-dir sync (failure-modes-5)
Peel heartbeat bucket (reuse Heartbeater); make zester peel list truthful (operability-3)
Versioned envelope on ExecRequest/Response/Job/Return; rollout min-protocol check (modularity-1)
Rollout resume-on-startup scan + watchdog confirm-deadline (operability-5, failure-modes-3)
Consolidate per-peel KV watchers; jittered reconnect (scalability-5)
Bounded exec worker queue; read-only modules outside execMu; embed ProviderSet + immutable per-request ModuleContext (failure-modes-4, modularity-7)

Tier C — Re-architectures (next two quarters)

Durable job path: JetStream dispatch with per-peel consumers, or at minimum replay job-events on reclaim (availability-2, consistency-4/7, failure-modes-2)
Master-side target-resolution + basket request/reply service backed by facts.Index; store target expression not resolved list (scalability-3, scalability-6)
Offline-first peel startup: cached facts/settings/states + scheduler before NATS attaches (availability-5)
bus.KV narrow interface owning entry/watcher types; shrink FakeKV (modularity-2)
Extract internal/peeld and internal/masterd with a Runnable/Healthy subsystem interface (modularity-3/4)
Admin mutations through masterapi/NATS request-reply; direct KV becomes break-glass only (modularity-5)

3. Confirmed findings by dimension

Each finding: Weakness (the structural flaw) → Scenario (when it bites) → Impact (blast radius) → Fix (recommended change) → Evidence (file references) → effort/confidence.

3.1 Availability & High Availability

Zester's multi-master design is genuinely symmetric (queue-group dispatch, CAS job claims, epoch-fenced finalization, heartbeat+orphan-scanner failover in ~15-35s), which is a solid skeleton. However, the availability story has three structural holes. First, every durability primitive — all 13 KV buckets, the job-events stream, and the update object store — defaults to Replicas:1, so unless the operator remembers --jetstream-replicas, a single NATS node is a fleet-wide SPOF whose disk loss permanently destroys the enrollment identity database and job history even in a 3-node cluster. Second, the actual job payload path (ExecRequest dispatch and return collection) rides fire-and-forget core NATS with only every-10th-return KV persistence, so in-flight jobs are not durable: offline peels silently never execute, and returns arriving during the ownerless failover window are lost to recovery even though a durable job-events stream capturing them already exists. Third, failure detection is weak on both ends: the orphan scanner treats a KV list error as "all masters dead" (mass false reclaim + duplicate execution, since peels do no JID dedup), and /healthz is a static 200 that reflects nothing, so the watchdog supervision and update-soak machinery built on it validates nothing while subsystems (enrollment server, GitFS) can die silently inside the fat master process. Peel autonomy is also only partial: a peel that restarts during a NATS outage exits immediately instead of enforcing cached state offline.

1. 🔴 CRITICAL — Replicas:1 default on every JetStream asset makes one NATS node a fleet-wide SPOF with permanent data loss

Weakness: All 13 KV buckets (facts, jobs, job-returns, master-heartbeat, enrollments, settings-files, secrets, state-files, update-*), the job-events stream, and the update-binaries object store default to Replicas:1. Replication is opt-in via --jetstream-replicas on the master; there is no warning, no per-bucket durability tiering, and the enrollment bucket — the source of truth for peel identity — gets the same R1 default as ephemeral heartbeats.
Scenario: Operator deploys the 3-node NATS cluster from integration/nats-cluster but forgets (or doesn't know) to set --jetstream-replicas 3. Every asset lands on one JetStream node. That node's disk dies: dispatch fails (jobs bucket gone), heartbeats stop, settings/state distribution halts, and the enrollments bucket — every peel's identity record — is unrecoverable. Fleet must be re-enrolled manually.
Impact: Entire control plane unavailable on single node failure despite a 'clustered' NATS deployment; permanent loss of enrollment records, job history, rollout state, and published binaries. Blast radius is the whole fleet, and recovery requires manual re-enrollment of every peel.
Fix: Invert the default: auto-detect cluster size (JSZ/varz) and default Replicas to min(3, clusterSize), requiring explicit opt-in for R1. Tier durability: enrollments/jobs/rollouts must be R3 in any cluster; heartbeat/update-status can stay low. Log a loud startup warning (or refuse) when replicas < cluster size, and document a migration path for existing R1 buckets.
Evidence: pkg/bus/kv.go:219-305 (all DefaultBuckets Replicas:1), kv.go:389 (job-events stream Replicas:1), kv.go:449 (object store Replicas:1); override only via cmd/zester-master/main.go:67 --jetstream-replicas defaulting to 0
Effort: medium · Confidence: high

2. 🔴 CRITICAL — Job dispatch and return collection are fire-and-forget core NATS — in-flight jobs are not durable, and recovery ignores the durable stream that exists

Weakness: ExecRequests are published to peels via core NATS (m.nc.Publish, no JetStream, no ack), and returns are collected via a core NATS subscription with KV persistence only every 10th return (returnPersistThreshold=10). The job-events stream (zester.job.>) durably captures these messages, but orphan recovery reads only the per-peel KV keys and never replays the stream — a durable log exists and is unused for the exact failure it could solve.
Scenario: (a) A peel is rebooting when a job is dispatched: the ExecRequest evaporates, the peel never executes, and the job silently ends StatusTimeout with no retry. (b) Master A crashes mid-job; during the ownerless window (heartbeat TTL 15s + orphan scan up to 20s ≈ 35s) peels publish returns that no watcher hears; the reclaiming master seeds only from KV (persisted every 10 returns), so up to 9 persisted-window returns plus all window returns are lost — job finalizes Partial/Timeout even though every peel succeeded and the returns sit in the job-events stream.
Impact: Silent non-execution on offline peels (config drift with a 'timeout' that looks like a peel problem) and false job failures after any master failover. Operators re-run jobs, compounding the duplicate-execution risk. Undermines the core promise of the CAS/epoch HA machinery.
Fix: Make the payload path durable: dispatch ExecRequests via JetStream with per-peel durable consumers (peels catch up after reboot), or at minimum have reclaim recovery replay the job-events stream for the JID (returns are already captured there). Drop returnPersistThreshold to 1 — per-peel keys are already O(1) writes.
Evidence: pkg/job/manager.go:124 and 176 (core Publish of ExecRequest), pkg/job/watcher.go:15 (returnPersistThreshold=10) and 96-127 (core-NATS return subscription), pkg/job/manager.go:198-229 (StatusRunning reclaim reads only KV per-peel keys, never the stream defined at pkg/bus/kv.go:383-393)
Effort: large · Confidence: high

3. 🟠 HIGH — Orphan scanner treats KV errors as 'all masters dead' and reclaim re-dispatches without peel-side dedup — mass false reclaim and duplicate execution

Weakness: ListLiveMasters swallows every ListKeys error and returns an empty map with nil error, so a transient JetStream hiccup makes the scanner see zero live masters and reclaim every non-terminal job owned by living masters. Separately, any JetStream unavailability >15s expires ALL heartbeats simultaneously, triggering the same mass reclaim on recovery. Epoch fencing protects the KV record, but ReclaimJob re-publishes ExecRequests for 'claimed' jobs and peels perform no JID deduplication, so re-dispatch means re-execution.
Scenario: The single JetStream node hosting the R1 heartbeat bucket restarts for 20s. All master heartbeats expire. When KV returns, master B's scanner runs before master A's next beat lands, sees A as dead, and reclaims A's claimed job — republishing 'cmd.run systemctl restart postgresql' to 500 peels that already executed it. Same outcome if ListKeys returns a transient error while everything is actually healthy.
Impact: Non-idempotent modules (cmd.run, state applies with onchanges side effects) execute twice fleet-wide; dual watchers race on the same job. Blast radius is every in-flight job at the moment of any JetStream blip — precisely when the system is already stressed.
Fix: In ListLiveMasters, distinguish ErrNoKeysFound from real errors and abort the scan on error. Require N consecutive missed scans (or a per-master tombstone grace period) before declaring a master dead, so a KV outage shorter than one scan cycle can't trigger reclaim. Add peel-side JID dedup (small LRU of executed JIDs) so re-dispatch is idempotent.
Evidence: pkg/job/heartbeat.go:84-88 (ListKeys error → empty map, nil error), pkg/job/recovery.go:100 (liveness check against that map), pkg/job/manager.go:158-196 (ReclaimJob re-publishes ExecRequests), cmd/zester-peel/main.go exec handler has no JID dedup (grep for dedup/seenJIDs: none)
Effort: medium · Confidence: high

4. 🟠 HIGH — /healthz is a static 200 and master subsystems fail silently — the watchdog/soak supervision stack validates nothing

Weakness: The master's ~10 concurrent responsibilities run as unsupervised goroutines in one process with no shared health state: the enrollment HTTPS server error is logged and the process continues without an enrollment API; the GitFS syncer goroutine exits permanently on error; the schedule-results consumer is optional at startup. Meanwhile /healthz unconditionally returns {"status":"ok"} — it reflects neither NATS connectivity nor subsystem liveness. internal/health exists but is unwired, and metrics are unwired, so partial failure is invisible to the watchdog, the update-soak health check, and operators alike.
Scenario: The enrollment TLS cert expires or the port is taken: enrollSrv.Start() errors once into the log and the master runs for weeks accepting jobs but silently unable to enroll new nodes. During a self-update rollout, the watchdog soak phase polls /healthz on a master whose NATS connection is flapping and whose GitFS sync died — sees 200, confirms the update, and the rollback safety net never fires.
Impact: Silent partial outages of enrollment, state-file sync, and scheduler-result persistence; the entire self-update rollback mechanism (whose correctness depends on health signal fidelity) is built on a constant. Degrades MTTR from minutes to whenever a human notices.
Fix: Wire internal/health: per-subsystem readiness registered by each goroutine (NATS connected, enroll server listening, gitfs last-sync age, consumer alive), /healthz returns 503 when any critical subsystem is down. Supervise subsystem goroutines with restart-on-error (the GitFS one-shot goroutine especially), and make enrollment-server failure either fatal or restart-looped.
Evidence: cmd/zester-master/main.go:610-616 (static healthz), :490-494 (enroll server error logged, non-fatal), :299-304 (gitfs goroutine exits on error, no restart), :317-322 (sched consumer optional); internal/health and internal/metrics unimported by any binary
Effort: medium · Confidence: high

5. 🟠 HIGH — Peel exits fatally if NATS is unreachable at boot — no offline-first autonomy despite local caches existing

Weakness: bus.NewClient does not set nats.RetryOnFailedConnect, and unlike the master (20-attempt retry loop) the peel has no connect retry at all: connection failure at startup is a fatal error. The peel has everything needed for offline operation — state files cached on disk (states-cache), last-resolved settings, a local scheduler — but none of it starts before a successful NATS connection, so runtime resilience (unlimited reconnects) does not extend to boot.
Scenario: NATS cluster is down (or the R1 asset node is being replaced) and a managed node reboots — a common correlated event during a datacenter incident. zester-peel exits immediately; the watchdog restarts it into the same failure in a backoff loop. Scheduled highstate enforcement, which would run fine from the local cache, never executes for the entire outage.
Impact: Exactly when the control plane is degraded, restarting nodes lose all configuration enforcement — the opposite of the Salt-minion autonomy model the architecture aims for. Fleet-wide correlated reboots (power event) plus a NATS outage leaves every node unmanaged.
Fix: Add nats.RetryOnFailedConnect(true) to bus.NewClient, and restructure peel startup offline-first (the watchdog already models this pattern): load cached facts/settings/state files, start the scheduler and local enforcement immediately, and attach NATS-dependent features (job subscription, fact publish, KV watches) asynchronously when the connection arrives.
Evidence: cmd/zester-peel/main.go:167-176 (NewClient error → fatal return, no retry loop), pkg/bus/client.go (no RetryOnFailedConnect option; grep confirms zero uses), contrast cmd/zester-master/main.go:147-164 (master's 20-attempt loop)
Effort: medium · Confidence: high

6. 🟡 MEDIUM — Failover recovery restarts the job timeout from zero and offers no visibility into the reclaim window

Weakness: A reclaimed 'running' job gets a brand-new Watcher whose timeout timer starts from the full job timeout at reclaim time (up to ~35s after the original master died), and only if per-peel KV returns happen to exist does it seed prior state. There is no record on the job of how many times it was reclaimed or by whom, and detection latency (heartbeat TTL 15s + scan interval 20s) is hardcoded rather than configurable.
Scenario: A 60s-timeout job is 55s in when its master dies. Detection + reclaim takes ~35s, then the new watcher waits another full 60s before finalizing — the CLI caller timed out long ago and the operator, seeing 'running', re-submits the job under a new JID, executing it twice. Repeated master crashes can ping-pong a job between owners indefinitely with no reclaim counter to cap it.
Impact: Job completion latency after failover is unbounded relative to the operator-visible timeout; encourages manual re-runs (duplicate execution); no audit trail for debugging split ownership incidents.
Fix: Store dispatch deadline (Created + Timeout) on the Job and have recovered watchers honor the remaining budget, finalizing immediately as Timeout/Partial if already expired. Add a ReclaimCount/OwnerHistory field to Job (bounded, e.g. max 3 reclaims → Failed) and make heartbeat TTL / scan interval configurable.
Evidence: pkg/job/watcher.go:130-144 (timer starts at Watch() entry with full j.Timeout), pkg/job/manager.go:198-229 (fresh watcher on reclaim), pkg/job/recovery.go:108-110 (Owner overwritten, no history), heartbeat.go:14 and recovery.go:13 (hardcoded 5s/20s)
Effort: small · Confidence: high

7. 🟡 MEDIUM — Enrollment availability hinges on a single configured master URL and volatile challenge state

Weakness: Peels enroll against exactly one --master-url; there is no multi-URL fallback even though every master runs a symmetric enrollment server backed by shared KV. Combined with the non-fatal, non-restarted enrollment server (finding above), the enrollment path has a narrower availability envelope than the rest of the control plane. Additionally the enroll-challenges bucket is MemoryStorage with Replicas:1, so a NATS node restart wipes all in-flight challenge nonces.
Scenario: Three masters run for HA, but every peel's peel.yaml points at master-1's enroll URL. Master-1 is down for patching during an autoscaling event: new nodes fail enrollment with a connection error and sit in the watchdog's creds-polling loop indefinitely, while masters 2 and 3 could have served them. Concurrently, a NATS restart invalidates every outstanding challenge nonce mid-handshake.
Impact: New-node provisioning (the most common operation during incident recovery and scale-out) is a de facto single point of failure in an otherwise symmetric multi-master design. Failure mode is silent stalling, not an actionable error.
Fix: Accept a list of master URLs in PeelConfig (like NATS URLs already are) and rotate through them in enroll.Client with backoff. Alternatively publish enrollment endpoints via DNS/LB and document it as the required pattern. Challenge nonces are cheap to retry, but the client should transparently re-request a nonce on 'unknown challenge'.
Evidence: cmd/zester-peel/main.go:124-127 and 134-139 (single cfg.MasterURL), pkg/bus/kv.go:272-279 (enroll-challenges MemoryStorage, Replicas:1), cmd/zester-master/main.go:490-494 (enroll server non-fatal failure)
Effort: small · Confidence: high

3.2 Scalability & Load

Zester's scalability ceiling is set by four O(N)-per-event patterns layered on a single NATS backplane. What breaks first: (1) with schedules enabled, the orphan scanner (every master, every 20s, ListKeys + Get over the ENTIRE 7-day jobs bucket, which scheduled return_job runs flood with one synthetic job + 3 KV writes per run per peel) degrades at just a few thousand accumulated job keys — scan time exceeds the 20s interval well before 1k peels; (2) wide-target jobs hit a hard ceiling around 1-3k targets because finalizeJob writes ALL returns as ONE KV value (NATS 1MB max payload) and return handling is serial inside one NATS callback with synchronous KV persists (slow-consumer drops above ~6KB average return size); (3) every basket() call and every CLI job dispatch transfers the entire facts bucket (KVGetAll/Keys) — O(N) per call, O(N²) fleet-wide when settings changes trigger fleet re-renders, saturating NATS bandwidth around 1-2k peels (the purpose-built facts.Index in pkg/facts/index.go is wired to nothing); (4) at 10k peels the steady-state background load — every master re-encrypting and republishing secrets on every facts update (volatile memory/disk facts guarantee a publish per peel every 5 min, multiplied by master count), plus ~4 ephemeral JetStream consumers per peel (~40k consumers) — dominates NATS server memory and CPU. Concrete ceilings: ~500-1k peels works today; 1k-3k breaks wide jobs and orphan scanning; 10k requires re-architecting target resolution (master-side index service), return aggregation (per-peel keys only, prefix reads), and an active-jobs index.

8. 🔴 CRITICAL — Orphan scanner does full jobs-bucket scan (ListKeys + per-key Get) every 20s on every master, while scheduled results flood the same bucket

Weakness: OrphanScanner.scan enumerates every key in the jobs bucket and issues an individual KV Get per JID to check ownership — O(total retained jobs) work per 20s tick, per master, with no index separating active from terminal jobs. The jobs bucket retains 7 days of history (TTL 7d, History 10) and schedresult.go creates a synthetic job (jobsBucket.Create) plus two job-returns writes for every scheduled return_job run on every peel, so bucket cardinality grows as peels × schedule-frequency × 7 days.
Scenario: 1k peels each running one return_job schedule every 5 minutes = ~2M job keys accumulated over the 7-day TTL. Every master then attempts 2M sequential KV Gets per 20s scan; at even 1ms/Get a single scan takes >30 minutes, so scans permanently overlap the interval and dead-master job recovery latency goes from ~20s to effectively unbounded. Multiple masters do this simultaneously, multiplying read load on the NATS cluster.
Impact: HA guarantee silently degrades fleet-wide: orphaned jobs from a crashed master are not reclaimed for minutes-to-hours, and the scan itself becomes a sustained read storm against JetStream that steals throughput from dispatch and settings traffic. Breaks at low thousands of retained jobs — before 1k peels if schedules are used.
Fix: Maintain an active-jobs index: a small 'jobs-active' KV bucket (or key prefix like 'active.<jid>') written at claim time and deleted in finalizeJob, so the scanner enumerates only non-terminal jobs. Alternatively key active jobs by owner ('owner.<masterID>.<jid>') so reclaim is a single prefix listing per dead master. Also exclude scheduler-synthetic jobs from the scannable namespace entirely (they are already terminal at creation).
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/pkg/job/recovery.go:68-77 (ListKeys over whole bucket + Get per jid, every 20s); pkg/job/schedresult.go:123,143-146 (job Create + 2 returns writes per scheduled run); pkg/bus/kv.go DefaultBuckets (jobs: TTL 7d, History 10)
Effort: medium · Confidence: high

9. 🔴 CRITICAL — Wide-job return path has a hard ~1MB ceiling: all returns aggregated into a single KV value, collected serially in one NATS callback

Weakness: Watcher.finalizeJob encodes every peel's return into ONE KV entry (KVPut(returnsBucket, jid, returns)), bounded by NATS max payload (1MB default). All returns for a job arrive on a single subscription whose handler runs serially and performs synchronous KV writes (persistReturns, up to 10 puts) inside the message callback, so a return burst from a wide target queues against the sub's pending limit (64MB default) while the handler blocks on JetStream round-trips. Recovery/read fallback (recoverPerPeelReturns) lists ALL keys in the job-returns bucket to find one job's per-peel keys.
Scenario: zester '*' cmd.run on 3k peels where each return carries ~1KB of command output: 3k returns × 1KB ≈ 3MB aggregate → the finalize KVPut fails (payload too large), logged as a warning only; 'zester job lookup' then falls back to a full-bucket key scan. At 10k peels returning within a few seconds, serial persistReturns inside the callback pushes the subscription past its pending limit → slow-consumer drops → returns lost → job wrongly finalized as partial/timeout.
Impact: Job results for any wide target are lost or unretrievable; operators see phantom partial/timeout statuses. This caps reliable job width at roughly 1MB / avg-return-size (~1-3k peels for realistic outputs) regardless of NATS cluster capacity.
Fix: Drop the aggregated-returns KV value entirely: per-peel keys '{jid}.{peelID}' are already written — make them the source of truth and read them via a prefix-filtered watch (kv.Watch(ctx, jid+".*")) instead of full-bucket Keys(). Move return persistence out of the NATS callback into a buffered channel + dedicated writer goroutine with batching. Store only counts/status in the job record.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/pkg/job/watcher.go:325-333 (single aggregated KVPut of all returns), watcher.go:97-127 (serial callback with synchronous persist), pkg/job/manager.go:252-275 (recoverPerPeelReturns: Keys() over entire job-returns bucket)
Effort: medium · Confidence: high

10. 🔴 CRITICAL — Target resolution and basket() queries transfer the entire facts bucket per call — O(N) per query, O(N²) fleet-wide; the facts.Index built for this is wired to nothing

Weakness: Every fact/compound target resolution — CLI job dispatch (target.KVPeelLister), masterapi, and every basket() call in every peel's template render — does ListPeelsWithFacts, i.e. KVGetAll = a WatchAll drain of the full facts bucket (all peels' complete fact maps) over NATS. pkg/facts/index.go implements exactly the in-memory inverted index needed, but no binary constructs it. basket() runs during settings rendering, so a single settings change fans out to N peels re-rendering, each pulling N fact entries.
Scenario: 10k peels × ~5KB facts = 50MB per basket() call or per 'zester G@os:ubuntu ...' invocation (multi-second CLI latency). A settings-files revision bump triggers re-resolve on all 10k peels within the 2s debounce + 5s jitter window; if templates use basket(), that is 10k × 50MB ≈ 500GB attempted through the NATS cluster in ~7 seconds — total saturation. Even at 1k peels it is 5GB in a burst, enough to stall JetStream replication and job traffic.
Impact: NATS cluster bandwidth/CPU saturation coupled to routine config changes; CLI latency grows linearly with fleet size; basket-driven service discovery becomes the fleet's biggest load generator. Practical ceiling ~1-2k peels for any deployment using basket() or fact targeting.
Fix: Wire facts.Index into the master: the facts.Watch stream the master already consumes can maintain the index incrementally. Expose target resolution and basket queries as a NATS request/reply service on the masters (queue group, index lookup = microseconds, response = matched IDs only). Peels' makeBasketFunc and the CLI's KVPeelLister call that service instead of scanning the bucket; keep direct-scan only as a --direct fallback.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/cmd/zester-peel/main.go:1328-1334 (ListPeelsWithFacts → bus.KVGetAll), pkg/bus/kv.go:155-179 (KVGetAll = WatchAll full drain), cmd/zester/cmd/exec.go:145-146 (CLI full resolution per invocation), pkg/facts/index.go:14-24 (Index exists, grep shows no non-test constructor call anywhere in cmd/ or pkg/)
Effort: large · Confidence: high

11. 🟠 HIGH — Master facts watcher re-encrypts and republishes secrets on every facts update, on every master, with full-bucket replay at startup

Weakness: facts.Watch runs on ALL masters (plain WatchAll, no queue group or leader election), and the callback unconditionally re-encrypts (randomized sealed-box, new ciphertext every time) and KVPuts the peel's secrets on every facts revision — with no change detection. Memory/disk collectors (5-min interval) report volatile values (free/used bytes), so the hash-skip in Manager.publish never suppresses; every peel writes facts every ~5 min forever. WatchAll also replays the entire bucket on master start and reconnect. The callback additionally does a synchronous enrollment KV Get per event on a single serial goroutine.
Scenario: 10k peels ÷ 300s = ~33 facts updates/s steady-state; with 3 masters that is ~100 pointless secret encrypt+KVPut/s into the secrets bucket (History 3) around the clock, each write also delivered to the target peel's secrets watcher. A master restart replays 10k facts entries → a burst of 10k enrollment Gets + 10k secret re-encryptions + 10k KV writes, times each restarting master — minutes of churn that competes with job dispatch.
Impact: Sustained write amplification on the secrets bucket and wasted master CPU that scales linearly with fleet size AND master count; the serial callback goroutine becomes a bottleneck near ~1k events/s (each event costs at least one KV round-trip). Multi-master adds load instead of sharing it.
Fix: Hash-gate secret publishing: cache SHA-256(secrets set + curve key) per peel and skip the KVPut when unchanged (mirror of the peel-side facts hash-skip). Make secret-republish a single-owner responsibility (leader via heartbeat bucket CAS, or shard peels across masters by consistent hash of peelID). Separate volatile metrics (memory/disk usage) from identity facts so routine telemetry does not bump the facts revision at all.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/cmd/zester-master/main.go:400-427 (per-update FindByPeelID + PublishSecrets, runs on every master), pkg/settings/publish.go:206-229 (unconditional encrypt + KVPut), pkg/facts/collectors/memory.go:22-28 (volatile free/used values defeat hash-skip), pkg/facts/manager.go:269 (WatchAll → full replay on start)
Effort: medium · Confidence: high

12. 🟠 HIGH — ~4 ephemeral JetStream consumers per peel: 40k server-side consumers at 10k peels, recreated en masse after NATS failover

Weakness: Each peel holds long-lived KV watchers: settings-files revision key, per-peel secrets key, master curve pub key, and state-files revision key — each an ephemeral ordered push consumer on the NATS cluster. Consumer count scales as ~4×N with no aggregation. Every watcher independently reconnects with its own backoff loop after a consumer loss, so a NATS server restart or failover triggers a fleet-wide consumer-recreation storm.
Scenario: At 10k peels the JetStream cluster carries ~40k ordered consumers across the settings-files, secrets, and state-files streams — tens-to-hundreds of MB of consumer state and per-consumer delivery bookkeeping on each server. A rolling NATS upgrade drops them all; 40k consumer-create requests arrive within the (uncoordinated, per-watcher) 1-30s backoff window, spiking the JetStream meta layer exactly when the cluster is weakest, potentially re-tripping timeouts and prolonging the storm.
Impact: NATS server memory and meta-leader CPU become the scaling bottleneck around ~5-10k peels; failover recovery time grows with fleet size, turning a routine NATS restart into a multi-minute control-plane brownout.
Fix: Collapse per-peel watchers: merge the secrets and curve-pub watches into one (both live in the secrets bucket — watch 'peelID' and '_master_curve_pub' via one multi-key watcher), and consider replacing low-churn revision watches with a single lightweight core-NATS broadcast subject ('zester.settings.revision') published by the master — one publish fans out to all peels with zero JetStream consumers; peels then do gated KV reads. Add global jitter (e.g. hash(peelID) mod 60s) to watcher reconnect backoff.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/cmd/zester-peel/main.go:497-523 (three settings-related watchers) and :539 (state cache watcher); pkg/settings/resolve.go:343,464,528 (each kv.Watch = one ordered consumer); pkg/statefiles/cache.go:97
Effort: medium · Confidence: high

13. 🟡 MEDIUM — Job record and dispatch request embed the fully-resolved target list, resolved client-side

Weakness: Target resolution happens in the CLI, and the complete peel-ID list is embedded in the Job struct, sent in the zester.job.dispatch request, stored in the jobs KV record, and rewritten on every CAS status update. Message and KV-value size grow linearly with target count toward the 1MB NATS payload cap; dispatch then publishes one ExecRequest per target in a serial loop from one goroutine.
Scenario: zester '*' at 10k peels with 40-char peel IDs: ~500KB dispatch request and jobs-KV value, rewritten at least twice (claimed→running→final) per job — approaching the 1MB cap; add job args/metadata and it fails outright with an opaque publish error. The 10k serial nc.Publish calls also delay the last peel's start by the full loop duration, skewing 'simultaneous' execution.
Impact: Hard cap on job width (~10-20k targets depending on ID length and args), triple write amplification of a large value into the R-replicated jobs stream per job, and increasingly skewed execution start times on wide targets.
Fix: Store the target expression + a count (and optionally a content-addressed target-list object in the Object Store for exact-audit needs) instead of the inline ID list; resolve on the master (ties into the master-side index service from the target-resolution finding). Dispatch wide jobs via a single broadcast subject with peel-side target matching (peels already hold their own facts) rather than N unicast publishes.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/cmd/zester/cmd/exec.go:145-161 (client-side resolve, full list into job), pkg/job/manager.go:57-140 (job with Targets encoded into KV, per-target publish loop, two CAS rewrites)
Effort: large · Confidence: high

14. 🟡 MEDIUM — Basket publisher writes unconditionally on every interval; no change-skip like the facts path

Weakness: Publisher.refreshLoop KVPuts each configured basket function every interval (default 5m) regardless of whether the value changed, unlike facts.Manager.publish which hash-skips. Each write is a new revision on the basket stream replicated across the JetStream cluster.
Scenario: 10k peels × 3 basket functions ÷ 300s = 100 writes/s of unchanged values (typically static IPs/hostnames) into the basket bucket, 24/7 — pure replication and fsync load on the NATS cluster that grows linearly with fleet size and function count, competing with job and settings traffic during bursts.
Impact: Steady-state background write load that consumes a meaningful fraction of a modest NATS cluster's write budget at 10k peels, for data that changes approximately never; also inflates stream sequence churn that all basket readers' watchers must track.
Fix: Add the same last-value hash-skip used in pkg/facts/manager.go publish(): cache the last published value per function and skip the KVPut when identical. Optionally lengthen the default interval and rely on facts-change triggers for freshness.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/pkg/basket/publisher.go:113-135 (unconditional KVPut in refreshLoop) vs pkg/facts/manager.go:199-208 (hash-skip pattern that exists but wasn't reused)
Effort: small · Confidence: high

3.3 Consistency & Data Integrity

The CAS/epoch job model is fundamentally sound (Create for idempotent claim, epoch-fenced Update, CAS finalize), but the surrounding machinery has real lost-update and duplicate-execution windows: liveness errors are swallowed and treated as "no live masters" (mass false reclaim), a crash between ExecRequest publish and the running-status CAS causes double execution on reclaim, and failover loses returns that are durably sitting unused in the job-events stream. The settings/state-file "atomic batch via _revision" protocol is single-writer by construction while the deployment is symmetric multi-master: BumpRevision is Get+Put (not CAS despite its doc comment), every master publishes on startup and per GitFS sync, and the peel resolver caches under the revision read after resolution — so torn snapshots can be pinned as current. Worst of all, the revision gate deliberately suppresses secrets-triggered re-resolves, so secret rotation (and late initial secret delivery) silently never propagates until an unrelated settings-file change. Deletion is never propagated in either distribution pipeline, and 7-day KV TTLs silently erase non-terminal jobs and their idempotency guards.

15. 🔴 CRITICAL — ListLiveMasters swallows KV errors as "no live masters", triggering mass false-orphan reclaim

Weakness: Liveness detection conflates infrastructure failure with master death. pkg/job/heartbeat.go ListLiveMasters returns an EMPTY map with nil error when kv.ListKeys fails ('No keys means no live masters'), and OrphanScanner.scan proceeds with that empty set, treating every non-terminal job owned by another master as orphaned. The master-heartbeat bucket defaults to Replicas:1, so a single NATS node restart makes the lister fail or return empty for all masters simultaneously.
Scenario: Two masters, NATS node hosting master-heartbeat restarts (or a transient JetStream timeout occurs) during an orphan scan tick (every 20s). Both masters see zero live masters, both reclaim each other's active jobs via CAS, ReclaimJob re-dispatches all StatusClaimed jobs to peels and spins up duplicate watchers for running jobs — while the real owners are alive and their watchers still running.
Impact: Fleet-wide duplicate job dispatch (non-idempotent cmd.run executed twice on every targeted peel), duplicate watchers racing on finalize (epoch fencing saves the job record but not the side effects on managed nodes), and ownership churn across all in-flight jobs. Blast radius is every active job in the cluster from a single transient KV blip.
Fix: Distinguish jetstream.ErrNoKeysFound from real errors in ListLiveMasters and return the error; make OrphanScanner.scan abort the cycle on liveness-read failure; require the owner to be absent for N consecutive scans (or compare job.Updated age against heartbeat TTL) before reclaiming.
Evidence: pkg/job/heartbeat.go:84-88 (lister error → empty map, nil error); pkg/job/recovery.go:55-124 (scan reclaims when owner not in liveMasters, no error path)
Effort: small · Confidence: high

16. 🔴 CRITICAL — Revision gate suppresses secrets-triggered re-resolves: rotated or late-arriving secrets never take effect

Weakness: The resolver's cache is keyed only on the settings-files bucket _revision, but secrets live in a separate bucket updated on a different trigger (master publishes per-peel secrets only when that peel's facts arrive, after PublishRawFiles has already bumped _revision). The peel's WatchSecrets callback funnels into the same reResolve, which hits the revision-gated cache and returns the stale result — nothing invalidates the cache on a secrets change (only master-curve-key rotation does, via SetSenderPub).
Scenario: (a) Startup race: peel resolves before the master's facts-watcher has published its secrets; loadSecrets fails soft ('continuing without'), so cachedResult contains literal ZESTER_SECRET:db.password placeholder strings, cached under revision N. Secrets land seconds later, WatchSecrets fires, Resolve sees rev==N and returns the placeholder-laden cache. (b) Rotation: operator changes only a secret value; master re-encrypts and Puts the peel's secrets key; no settings-file changed, so _revision never bumps and the old secret is used indefinitely.
Impact: Data-integrity failure in the config plane: services configured with placeholder strings as passwords, or with revoked/old credentials after rotation — silently, per-peel, until an unrelated settings-file edit happens to bump _revision. Scheduler entries and basket_scope derived from cachedSettings inherit the stale/placeholder values.
Fix: Make the secrets watcher call resolver.InvalidateCache() before triggering re-resolve (mirroring the curve-key path), or key the cache on (settings-files _revision, secrets-entry revision) pair; additionally treat unresolved placeholders in the final merged map as a resolve failure rather than a cacheable success.
Evidence: pkg/settings/resolve.go:130-137 and 212-215 (revision-gated cache), resolve.go:169-172 (soft-fail on missing secrets); cmd/zester-peel/main.go:469-508 (WatchSecrets → debounced reResolve with no InvalidateCache); cmd/zester-master/main.go:256 vs 400-426 (files+bump at startup, secrets only on facts arrival)
Effort: small · Confidence: high

17. 🟠 HIGH — Dispatch claimed→publish→running window causes double execution on reclaim; peel epoch fencing is designed to allow it

Weakness: Dispatch persists the job as StatusClaimed (Create), then publishes ExecRequests to all targets, then CAS-updates to StatusRunning. A crash between the publish loop and the CAS leaves a job that peels are already executing but that KV says was never dispatched. ReclaimJob unconditionally re-dispatches StatusClaimed jobs. The peel-side epoch fence (epochMap) accepts any request with a higher epoch for the same JID — and the reclaim path always carries a higher epoch (the reclaim CAS bumped the revision) — so the duplicate is accepted by design. The epochMap is also in-memory only, so a peel restart clears all fencing state.
Scenario: Master crashes (or is falsely declared dead per finding 1) immediately after publishing ExecRequests for a 500-target cmd.run job but before the running-status CAS. 15s heartbeat TTL + 20s scan later, another master reclaims and re-publishes; every peel that already ran the command runs it again because the reclaim epoch exceeds the stored one.
Impact: At-least-twice execution of arbitrary, potentially non-idempotent commands across the entire target set of any job caught in the window. For state.apply this is mostly benign; for cmd.run (restarts, migrations, destructive ops) it is not.
Fix: Invert the ordering: CAS to StatusRunning (or a 'dispatching' status recording intent) before publishing ExecRequests, so reclaim of claimed jobs is provably safe; track per-peel acks in KV during dispatch and re-dispatch only to unacked targets; persist the peel's JID/epoch dedup map to disk so restarts don't reopen the window.
Evidence: pkg/job/manager.go:73-143 (Create → publish loop → CAS to running), manager.go:158-196 (ReclaimJob re-dispatches StatusClaimed); cmd/zester-peel/main.go:1011,1056-1066 (in-memory epochMap accepts strictly-higher epochs)
Effort: medium · Confidence: high

18. 🟠 HIGH — Returns lost during master failover even though they are durably stored in the job-events stream

Weakness: Watchers collect acks/returns via core-NATS subscriptions and persist to KV only every 10 returns (returnPersistThreshold). During the dead window (heartbeat TTL 15s + up to 20s scan interval), returns published by peels have no subscriber. Those messages land in the job-events JetStream stream (subjects zester.job.> include zester.job.<jid>.return.<peel>), i.e., the data is durably captured — but recovery (ReclaimJob/recoverPerPeelReturns) reads only the KV per-peel keys and never replays the stream. Up to 9 already-received returns are also lost on hard crash (kill -9 skips the detach persist).
Scenario: Master owning a 200-peel job crashes; 150 peels return during the ~35s ownership gap. The reclaiming master seeds only the ≤ threshold-persisted returns, waits out the remaining timeout, and finalizes the job as partial/timeout despite every peel having succeeded and every return sitting in the job-events stream.
Impact: Incorrect terminal job status and missing result data for any job in flight during a master failure; operators re-run jobs that already succeeded (compounding the double-execution risk of finding 3). Undermines the audit trail the stream was built for.
Fix: On reclaim, replay zester.job.<jid>.return.* from the job-events stream (it already retains 7 days) to reconstruct the return set before starting the recovery watcher; alternatively have peels write returns directly to the job-returns KV keyed {jid}.{peel} (idempotent Put) with the NATS publish as a mere notification.
Evidence: pkg/job/watcher.go:12-16 (threshold 10), watcher.go:96-127 (core-NATS subscription), manager.go:198-229 (recovery reads only KV); pkg/bus/kv.go:383-393 (job-events stream captures zester.job.>); cmd/zester-peel/main.go:1337-1355 + pkg/bus/subjects.go:157-158 (return subject under the stream)
Effort: medium · Confidence: high

19. 🟠 HIGH — The _revision batch protocol assumes a single writer, but every master publishes settings and state files; BumpRevision is not atomic

Weakness: The 'write all files then bump _revision' protocol only yields consistent snapshots with one publisher, yet the architecture is symmetric: every master publishes settings-files and state-files on startup and republishes after each GitFS sync (own 5m timer per master). BumpRevision itself is Get+Put with no CAS despite its 'atomically increments' doc comment — concurrent bumps lose updates. Per-key Put interleaving between two masters with divergent working trees (git clones synced at different moments) yields a final bucket state mixed across two source versions. The peel resolver compounds this: it caches the resolved result under the _revision read AFTER resolution (resolve.go:214), so a resolve that raced a concurrent publish pins a torn file set as the current-revision snapshot until the next bump.
Scenario: Two masters restart after a config push; master A has the new git commit, master B is one sync behind. Both walk their trees and Put every key; final per-key values are an arbitrary interleaving of old and new versions, then both bump _revision. Peels re-resolve, some read mid-interleave, and each caches whatever mixture it saw under the latest revision. Fleet now runs heterogeneous, partially-old settings with no signal anything is wrong.
Impact: Fleet-wide configuration skew and torn template sets (e.g., top.zy from commit N referencing files from commit N-1); state.highstate compiles inconsistent formulas. Self-healing only occurs at the next publish event.
Fix: Elect a single active publisher via a KV lease (Create with TTL on a well-known key) so only one master runs PublishRawFiles/statefiles.Publish/GitFS republish; fix BumpRevision to a CAS retry loop (kv.Update with the read revision); have the resolver cache under the revision read at the START of Resolve and retry if the end revision differs.
Evidence: pkg/bus/kv.go:202-216 (Get+Put, no CAS); pkg/settings/publish.go:178-201 and pkg/statefiles/publisher.go:62-77 (per-key Puts then bump); cmd/zester-master/main.go:256,284,290-304 (every master publishes + GitFS republish); pkg/settings/resolve.go:130-137,212-215 (cache keyed on post-resolve revision)
Effort: medium · Confidence: high

20. 🟡 MEDIUM — No deletion propagation: removed settings/state files persist forever in KV and on peel disk caches

Weakness: Both distribution pipelines are Put-only. settings.PublishRawFiles and statefiles.PublishFiles never delete KV keys absent from the current source tree, and the peel-side statefiles.Cache.Sync writes files but never removes local files whose KV keys are gone. There is no batch manifest to define set membership, so 'the current file set' is undefined — it is the union of everything ever published.
Scenario: Operator deletes a formula (webserver/) from git and removes its top.zy entry in the same commit — that works. But a highstate that still references it (stale top on a torn peel per finding 5), an operator running state.apply webserver directly, or a renamed file (old path lingers alongside new) all silently execute the deleted/obsolete .zy content pulled from KV or the on-disk cache. Decommissioned secrets templates similarly remain readable in settings-files indefinitely (History:3 keeps prior versions too).
Impact: Configuration that was intentionally removed from the source of truth remains applicable fleet-wide, indefinitely; renames double-apply; retired sanitized templates (and their placeholder structure) never leave the bus. Violates the operator's mental model that git is authoritative.
Fix: Publish a _manifest key (list of keys + hashes) in the same batch before bumping _revision; publishers delete KV keys not in the manifest, and Cache.Sync removes local files not in the manifest. This also gives peels an integrity check that the snapshot they read matches the batch that bumped the revision.
Evidence: pkg/statefiles/publisher.go:62-77 (Put-only), pkg/statefiles/cache.go:51-83 (write-only sync, no pruning), pkg/settings/publish.go:178-201 (Put-only)
Effort: medium · Confidence: high

21. 🟡 MEDIUM — Job delivery to peels is fire-and-forget core NATS: an offline peel silently never receives the job

Weakness: ExecRequests are published per-target on zester.cmd.<peel> via core NATS with no persistence, no delivery confirmation reconciliation, and no redelivery. The ack subscription exists but acks are only collected in memory for observability — the watcher never compares acked vs targeted peels to re-send or to distinguish 'peel never got it' from 'peel is slow'. The cmd subjects are not covered by any JetStream stream.
Scenario: A 100-peel job dispatches while 3 peels are briefly reconnecting to NATS (restart, network flap). Those peels never see the request; the job burns its full timeout and finalizes as partial. The operator cannot tell delivery failure from execution failure, and re-running the whole job re-executes on the 97 peels that already succeeded.
Impact: Systematic under-execution on any fleet with normal connection churn; timeout-bound latency inflation for every partially-delivered job; retries amplify duplicate execution on healthy peels.
Fix: Use ack data: after a short ack window, re-publish to unacked targets (bounded retries) before the timeout; or move dispatch onto a JetStream work stream per peel (durable consumer per peel) so offline peels receive queued jobs on reconnect, with the existing JID/epoch dedup making redelivery safe.
Evidence: pkg/job/manager.go:109-127 (plain nc.Publish per target, errors only logged); pkg/job/watcher.go:74-93 (acks collected but never reconciled against targets); pkg/bus/kv.go:383-393 (job-events stream covers zester.job.>, not zester.cmd.*)
Effort: large · Confidence: high

3.4 Operability & Observability

Zester has essentially zero production telemetry despite the scaffolding existing in-tree: internal/metrics defines a complete Prometheus registry (23 metrics: job throughput/duration/active, connected peels, facts sync errors, settings render, state apply, NATS reconnects/slow-consumers) that is imported by no binary, and internal/health defines a real Checker/Readiness framework while both daemons serve a hardcoded always-200 /healthz stub. The stub is not benign: the self-update watchdog's soak-and-auto-rollback decision polls it, so a functionally dead child (NATS-disconnected, hung exec loop) passes soak and a broken fleet update gets confirmed. There is no peel presence signal at all — zester peel list claims to list "connected" peels but reads a 7-day-TTL facts bucket. Job.User exists in the schema and is even CAS-compared, but nothing ever sets it, so there is no who-ran-what audit trail. Logging is format-inconsistent (text on master/peel, JSON on watchdog) with hardcoded Info level and no flag. The day-2 diagnosis story for "peel X isn't applying state" is: SSH to the node and read stdout. The minimal wiring to close most of this is small: the health HTTP listener already exists in both daemons — mount metrics.Handler() and health.Checker on it, add a peel heartbeat bucket mirroring the existing master-heartbeat, and set Job.User at CLI dispatch.

Weakness: internal/metrics/metrics.go defines NewMasterRegistry()/NewPeelRegistry() with 23 metrics (zester_jobs_total by status, job_duration_seconds, job_active, connected_peels, facts_sync_errors_total, settings_render_duration, state_apply_total/duration, nats_reconnects_total, nats_slow_consumers_total, peel_uptime, beacon events) plus a promhttp Handler(), but a grep for importers finds zero — no binary instantiates the registry, no code path increments a counter, no /metrics endpoint exists anywhere. Orphan-reclaim events and rollout progress aren't even defined as metrics. The only signals are stdout logs and manual CLI polling of KV.
Scenario: Job failure rate climbs from 1% to 40% after a bad state file push, or NATS slow-consumer events start dropping returns under fleet load. No alert fires because nothing can fire; operators learn from users. Capacity questions ('how many peels are connected? what is p99 highstate duration?') are unanswerable without ad-hoc KV scraping.
Impact: Entire fleet. No SLOs, no alerting, no capacity planning, no regression detection across every subsystem (jobs, facts, settings, updates, NATS transport). This is the single largest day-2 gap.
Fix: Minimal wiring, no new listener needed: in cmd/zester-master/main.go and cmd/zester-peel/main.go instantiate the registry and mount metrics.Handler() on the existing startLocalHealthServer mux (mux.Handle("/metrics", reg.Handler())). Pass the Registry into job.Manager (increment JobsTotal/JobDuration in watcher finalize), facts.Watch (FactsSyncTotal/Errors), bus.Client handlers (NATSReconnects/SlowConsumers via the already-present DisconnectErr/Error handlers at pkg/bus/client.go:130-161), and RolloutController. Add a reclaim counter to the registry for OrphanScanner.
Evidence: internal/metrics/metrics.go:54-258 (full registry + Handler); grep -rn internal/metrics over *.go returns no importer outside the package itself; prometheus already in go.mod.
Effort: medium · Confidence: high

23. 🟠 HIGH — Static always-200 /healthz feeds the self-update auto-rollback decision — soak confirms broken binaries

Weakness: Both daemons' startLocalHealthServer returns a hardcoded {"status":"ok"} with WriteHeader(200) unconditionally (cmd/zester-master/main.go:610-616, cmd/zester-peel/main.go:1127+). It is a liveness stub, not a health check: it does not consult NATS connectivity even though bus.Client tracks a healthy bool and exposes IsHealthy() (pkg/bus/client.go:217-221) that no binary ever calls. internal/health's Checker/ReadinessHandler with per-check timeouts and 503 semantics exists unused. Critically, this stub is the exact endpoint the watchdog polls: update.Supervisor.CheckHealth hits HealthURL (pkg/update/supervisor.go:274-276, default http://127.0.0.1:9090/healthz) and update.Handler's soak loop auto-rolls-back only on its failure (pkg/update/handler.go:253-287).
Scenario: A rollout ships a peel build with a broken NATS TLS config or a deadlocked exec handler. The child process starts, binds the health port, serves 200. WaitForHealthy passes, soak passes, watchdog reports 'confirmed', RolloutController advances to the next batch. The entire fleet is 'successfully updated' to agents that can never receive a job; auto-rollback — the system's headline safety feature — never triggers unless the process fully crashes.
Impact: Fleet-wide blast radius via the update system: the safety mechanism validates the wrong signal. Also breaks k8s/systemd readiness gating and monitoring for masters (a master with NATS down reports healthy).
Fix: Replace the inline stub with internal/health.Checker: register a NATS check (client.IsHealthy()), a JetStream KV round-trip check, and on peels a 'exec-loop responsive' check (timestamp touched by the dispatch subscriber, StatusDown if stale). Keep GET /healthz as liveness but point the watchdog's HealthURL at a new /readyz backed by the Checker so soak validates functional health.
Evidence: cmd/zester-master/main.go:612-616 (unconditional 200); pkg/bus/client.go:218 IsHealthy() with zero callers in cmd/; pkg/update/handler.go:277-287 soak rollback keyed solely on Supervisor.CheckHealth of that URL; internal/health/health.go:72-152 unused Checker/ReadinessHandler.
Effort: medium · Confidence: high

24. 🟠 HIGH — No peel presence/heartbeat — 'connected peels' is unknowable and `zester peel list` lies

Weakness: Masters heartbeat into a 15s-TTL bucket (pkg/bus/kv.go:262) but peels have no equivalent. The only per-peel artifact is the facts KV entry with a 7-day TTL (pkg/bus/kv.go:248). zester peel list (cmd/zester/cmd/peel.go:25-59), documented as 'List connected peels', simply lists facts-bucket keys — a peel that died six days ago is indistinguishable from a live one. The metrics registry's zester_connected_peels gauge has no data source even if it were wired.
Scenario: Operator investigates 'peel X isn't applying state'. Step zero — is the peel even online? — has no answer: peel list shows it, facts look present (stale), and the only probe is dispatching test.ping and waiting for the job timeout. At fleet scale, 50 silently-dead agents (OOM-killed, cert expired, NATS auth revoked) drift out of configuration compliance for up to a week with zero signal, which for a config-management system is the primary failure mode it exists to prevent.
Impact: Silent configuration drift across the fleet; every diagnosis workflow starts with an unanswerable question; targeting resolves jobs to dead peels producing chronic 'partial' job statuses that mask real failures.
Fix: Add a peel-heartbeat KV bucket mirroring the existing master pattern (pkg/job/heartbeat.go — 5s interval, 15-30s TTL; the Heartbeater code is reusable nearly verbatim). Surface last-seen/online-offline in zester peel list, feed the ConnectedPeels gauge from a master-side watch, and alert on expected-vs-alive divergence.
Evidence: cmd/zester-master/main.go — master-heartbeat exists; pkg/bus/kv.go:248 (facts TTL 7d) vs :262 (master-heartbeat TTL 15s); cmd/zester/cmd/peel.go:27 'List connected peels' backed by listKVKeys over the facts bucket at :56-59.
Effort: medium · Confidence: high

25. 🟡 MEDIUM — Job.User exists but is never populated — no audit trail of who ran what

Weakness: The Job struct carries User string documented as 'the identity that initiated the job' (pkg/job/job.go:59-60) and the manager even CAS-compares it on idempotent re-dispatch (pkg/job/manager.go:405), but NewJob (job.go:74-85) never sets it and the CLI dispatch path (cmd/zester/cmd/exec.go:161 job.NewJob(module, modArgs, peels, timeout)) never assigns it — so it is empty on every job ever dispatched. Combined with the jobs bucket's 7-day TTL, there is no durable record of who executed arbitrary commands (cmd.run!) across the fleet, and --direct mode bypasses job records entirely.
Scenario: Post-incident review: someone ran zester '*' cmd.run 'rm -rf /var/lib/app' last Tuesday. The job record (if still within TTL) shows function, args, targets — and a blank user. NATS server logs might show a connection, but nothing maps it to the job. Compliance/SOC2 evidence of change control is impossible to produce.
Impact: Security/compliance gap for a tool whose core function is remote arbitrary execution as root; forensic attribution impossible; insider misuse undetectable.
Fix: Populate j.User in the CLI from the NATS creds identity (or os/user.Current() as fallback — the pattern already exists in cmd/zester/cmd/enroll_approve.go:43) before dispatch; have the master log dispatch acceptance at Info with jid+user+function+target; additionally append terminal job records to a long-retention JetStream audit stream rather than relying on the 7-day KV TTL. Surface user in zester job show.
Evidence: pkg/job/job.go:59-60,74-85 (field defined, constructor omits it); cmd/zester/cmd/exec.go:161 (no assignment); pkg/job/manager.go:405 (compares two always-empty strings).
Effort: small · Confidence: high

26. 🟡 MEDIUM — Rollout controller state is fire-and-forget: no resume, ignored persistence errors, no progress events

Weakness: RolloutController runs the batch loop as an in-memory goroutine on whichever master started it; there is no resume/recovery path on master startup (grep for Resume in pkg/update/rollout.go and cmd/zester-master/main.go: none). Mid-run KV persistence errors are systematically discarded (_ = r.store.Save(...) at rollout.go:289, 302, 341, 437, 489, 492), and progress is emitted only as local logs — no metrics, no event stream an operator can subscribe to.
Scenario: Master crashes (or is itself updated) between batch 2 and 3 of a 10-batch fleet rollout. The rollout record sits in the update-rollouts bucket in state 'applying' forever; zester update status --rollout <id> reports an in-progress rollout that no process is driving; half the fleet runs v0.5.0 and half v0.4.x indefinitely with no alert. Separately, if a CAS Save fails mid-batch, the persisted batch index diverges silently from reality, so post-mortem state in KV is untrustworthy.
Impact: The highest-blast-radius operation the system performs (fleet binary replacement) is undriveable after a master restart and its recorded state may be wrong; operators cannot distinguish stalled from progressing without reading master logs.
Fix: On master startup, scan update-rollouts for non-terminal states and either resume (state machine is already batch-indexed and persisted) or mark them 'stalled' with a reason. Add a controller heartbeat/lease field so status output can show 'no active driver'. Handle Save errors (retry, then abort loudly). Emit rollout lifecycle events on zester.update.event.* and wire batch/node counters into the metrics registry.
Evidence: pkg/update/rollout.go:289,302,341,437,489,492 (ignored Save errors); no Resume symbol in pkg/update or cmd/zester-master/main.go; progress visible only via r.logger.Info calls (rollout.go:304,342).
Effort: medium · Confidence: high

27. 🟡 MEDIUM — Inconsistent log formats and hardcoded Info level with no operator control

Weakness: Master and peel hardcode slog TextHandler at LevelInfo (cmd/zester-master/main.go:75, cmd/zester-peel/main.go:69) while the watchdog uses a JSONHandler (cmd/zester-watchdog/main.go:63). Neither daemon exposes a --log-level or --log-format flag (full flag list at master main.go:58-73 confirms). Since watchdog wraps peel/master as parent process, a single node emits interleaved JSON and text lines on the same stdout stream.
Scenario: Operator needs debug-level output to diagnose a settings-resolution failure on one peel: impossible without a rebuild. Fleet log aggregation (Loki/ELK) needs two parsers per node and text-format key=value with spaces breaks field extraction; watchdog-vs-child lines can't be uniformly queried.
Impact: Every debugging session and any centralized logging deployment; raises MTTR on all other findings since logs are currently the only telemetry.
Fix: Shared logging bootstrap in internal/ (or pkg/): --log-level/--log-format flags plus YAML fields per the project's flag.Visit convention, JSON default for daemons, and a consistent base attr set (component, peel_id/master_id, version) on every line.
Evidence: cmd/zester-master/main.go:75 and cmd/zester-peel/main.go:69 (NewTextHandler, LevelInfo hardcoded) vs cmd/zester-watchdog/main.go:63 (NewJSONHandler); no log flag among master flags at main.go:58-73.
Effort: small · Confidence: high

28. 🟡 MEDIUM — Diagnosis of peel-side failures requires SSH: no central log/result capture for scheduled and local activity

Weakness: Correlation is decent inside the job path — JID flows through ExecRequest and is logged on the peel (cmd/zester-peel/main.go:1071,1110) and in the watcher (pkg/job/watcher.go) — but everything a peel does outside an operator-initiated job is observable only on that node's stdout: settings re-resolution failures, facts publish errors, state-file cache sync errors, and scheduled runs without return_job (schedule results are published only when return_job is set, main.go:964-978). There is no 'last highstate result' or 'last error' record per peel in KV, and no command to pull recent peel-side events centrally.
Scenario: 'Peel X isn't applying its scheduled highstate.' The schedule entry lacks return_job, so failures never leave the node. The operator must SSH in (through whatever bastion the managed node sits behind) and grep text logs — for a tool whose premise is not needing to SSH into nodes. If the root cause is a template render error in settings resolution, nothing surfaced it to the master at all.
Impact: Per-node MTTR; failures in the settings/statefiles pipelines (the mechanism that changes machines) are invisible fleet-wide; scheduled-run failures silently accumulate.
Fix: Write a small per-peel status record to KV (last_highstate: jid/result/timestamp, last_settings_resolve: ok/error, last_error ring of N entries) updated by the peel, surfaced via zester peel show <id>; default return_job=true for state.* schedules; optionally publish peel Warn/Error slog records to a capped JetStream stream for zester peel logs <id>.
Evidence: cmd/zester-peel/main.go:964-978 (schedule results only published with return_job); no per-peel status bucket in pkg/bus/kv.go bucket constants; zester peel list output limited to facts-derived OS/arch (cmd/zester/cmd/peel.go:72-83).
Effort: medium · Confidence: high

3.5 Failure Modes & Fault Tolerance

Zester has genuinely good failure-handling bones in several places — CAS/epoch-fenced job ownership, a debounced+jittered settings re-resolve path, hash-gated facts publishing, a MaxDeliver=5 poison-message guard on the schedule-results consumer, and watcher-reconnect loops with backoff on every KV WatchAll. But the discipline is uneven: the state-file distribution path lacks the debounce/retry the settings path has; job returns ride ephemeral core-NATS into an in-memory watcher, so master failover silently loses results; the peel swallows settings-resolution errors during highstate and applies states with nil settings; the rollout controller is a single-process in-memory orchestration with no crash recovery, leaving watchdogs wedged in 'soaking'; and the peel's globally-serialized executor lets one slow job make an entire node appear dead. Most Warn-and-continue paths degrade correctness (stale/mixed state trees, wrong-config renders) rather than availability, which is the more dangerous direction for a config-management system.

29. 🔴 CRITICAL — Peel silently applies states with nil settings when settings resolution fails

Weakness: In the peel's execModule, state.apply and state.highstate discard the settings-resolution error: cs, _ = resolver.Resolve(execCtx, currentFacts) — a transient KV/NATS failure yields cs=nil and execution proceeds anyway. Templates then render with empty settings; .get(key, default) / pillar_get(key, default) patterns silently substitute defaults instead of production values.
Scenario: Operator runs zester '*' state.highstate during a brief JetStream leader election or secrets-decrypt failure. Every peel whose Resolve() errors compiles its states against empty settings and file.managed rewrites configs (DB endpoints, credentials placeholders, feature flags) with template defaults across the fleet, reporting Success=true.
Impact: Fleet-wide silent misconfiguration — the worst failure class for a config-management system: jobs report success while writing wrong content. Nothing in the return signals that settings were absent.
Fix: Fail the execution (return an ExecResponse error) when Resolve() errors, or fall back to the mutex-protected cachedSettings (last-known-good) and mark the result degraded. Never proceed with nil settings unless the peel has never had any.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/cmd/zester-peel/main.go:661 and :686 (cs, _ = resolver.Resolve(execCtx, currentFacts)); contrast with the guarded startup path at :418-421 which at least logs.
Effort: small · Confidence: high

30. 🟠 HIGH — Job returns ride core NATS into an in-memory watcher; master failover loses results

Weakness: The per-job Watcher collects returns via ephemeral core-NATS subscriptions (bus.PubSub) and holds them in memory, persisting to KV only every 10 returns (returnPersistThreshold). The job-events stream durably captures zester.job.>, but nothing ever replays returns from it — orphan recovery reads only the per-peel KV keys the dead watcher happened to flush.
Scenario: Master A dispatches a 500-peel job and crashes. Up to 9 unflushed returns are lost, and every return published during the failover gap (heartbeat TTL 15s + orphan scan up to 20s ≈ 35s) has no subscriber. Master B reclaims the job, starts a fresh 60s watcher, and finalizes it as timeout/partial even though all peels succeeded. Same loss occurs without any crash: the default 60s watcher timeout finalizes long highstates, and returns arriving after finalize are dropped on the floor.
Impact: Operators see failed/partial/timeout jobs that actually succeeded and re-run them — re-executing non-idempotent cmd.run states. Job history (the system of record) is unreliable exactly when HA is exercised.
Fix: Consume job returns from the job-events JetStream stream (durable or ordered consumer with DeliverAll from dispatch time) instead of core pub/sub, so failover and late returns are replayed; alternatively have peels write returns via a durable path and lower/remove the batch-of-10 persist threshold.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/pkg/job/watcher.go:15 (returnPersistThreshold=10), :97 (core Subscribe for returns), :130-136 (60s default timeout); /Users/ptorbus/SynologyDrive/tools/zester/pkg/job/manager.go:198-229 (recovery reads only KV per-peel keys); stream captures zester.job.> per /Users/ptorbus/SynologyDrive/tools/zester/pkg/bus/kv.go:387 but is never replayed for returns.
Effort: large · Confidence: high

31. 🟠 HIGH — Rollout controller is in-memory only; master crash wedges watchdogs in 'soaking' forever

Weakness: executeRollout runs as a goroutine keyed in the controller's in-memory active map. RolloutState is persisted to KV, but no master (including a restarted one) ever rehydrates or resumes a non-terminal rollout — there is no rollout equivalent of the job orphan scanner. On the watchdog side, a node that reached StateSoaking waits indefinitely for a confirm that will never come, and handlePrepare rejects any future update while in that state.
Scenario: During a fleet rollout the owning master crashes after batch 2's apply. The rollout sits in KV as 'rolling' forever; batch-2 watchdogs stay in StateSoaking running the new binary with Slots never Confirm()ed, and every subsequent rollout attempt against those nodes fails with 'cannot prepare in state soaking'. Additionally, in multi-master HA, zester update abort is queue-group routed — if it lands on a non-owning master, AbortRollout returns 'not active' and never writes RolloutAborted to KV, so the owning master's isAbortedInKV poll never fires either.
Impact: Fleet stuck in mixed versions with self-update permanently disabled on affected nodes until each watchdog is manually restarted; abort is unreliable with >1 master.
Fix: On master start (and periodically), scan update-rollouts KV for non-terminal states and resume or fail them with heartbeat-based ownership (reuse the orphan-scanner pattern). Make AbortRollout write RolloutAborted to KV via CAS even when the rollout isn't locally active. On the watchdog, add a confirm-deadline: auto-confirm or auto-rollback if no controller command arrives within N× soak time.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/pkg/update/rollout.go:244-249 (in-memory active map + goroutine, no rehydration anywhere in cmd/zester-master/main.go:528-597), :257-273 (AbortRollout errors for non-owned rollouts without touching KV); /Users/ptorbus/SynologyDrive/tools/zester/pkg/update/handler.go:154-159 (prepare rejected in soaking), :295-301 (confirm is the only exit from soaking besides health-failure rollback).
Effort: large · Confidence: high

32. 🟠 HIGH — Globally serialized peel executor: one slow job makes the whole node unresponsive

Weakness: All module execution on a peel is serialized behind a single execMu, and execModule runs synchronously inside the NATS subscription callback (one goroutine per subscription). There is no queue-depth bound, busy rejection, or concurrency for cheap read-only queries; a long execution blocks everything behind it. The per-JID epochMap also grows without eviction for the peel's lifetime.
Scenario: A highstate with a slow pkg.installed takes 10 minutes. Every subsequent dispatch (test.ping, facts.get, direct CLI request/reply) queues in the NATS client's pending buffer: request/reply CLI calls time out, master watchers finalize the queued jobs as StatusTimeout after 60s, and if the buffer overflows (client slow-consumer limits) dispatches are silently dropped. Job cancel can't help — cancelFuncs are only registered once execution starts, so queued jobs are uncancelable.
Impact: A single slow or hung state makes the peel appear dead fleet-wide for its duration; on broad targets this converts one bad state file into apparent mass node failure and a wave of misleading timeout results.
Fix: Dispatch exec requests to a bounded worker queue: keep state.apply/highstate serialized, but run read-only modules (facts., settings., test.ping) outside execMu, return an immediate 'busy/queued' ack for depth visibility, and evict epochMap entries when jobs complete.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/cmd/zester-peel/main.go:616 (execMu comment: 'serializes all module executions'), :1046-1111 (synchronous execModule inside subscription callback), :1013-1014 (epochMap never pruned).
Effort: medium · Confidence: high

33. 🟠 HIGH — State-file distribution: thundering-herd full re-sync with no debounce, no retry, and non-atomic partial application

Weakness: statefiles.Cache.Watch triggers a full Sync() (Keys + per-file Get) on every _revision bump with zero debounce or jitter — unlike the settings path, which deliberately uses DebouncedFunc(2s debounce + 5s jitter). A failed Sync is only Warn-logged with no retry until the next revision bump, and Sync writes files one-by-one, so a mid-sync failure leaves a mixed old/new states tree that the compiler will happily use.
Scenario: GitFS pulls a commit; master republishes and bumps _revision. All N peels simultaneously issue Keys() plus M Get() calls against the state-files bucket (N×M request burst on JetStream). Under that self-inflicted load, some peels' Sync fails halfway ('full re-sync failed on revision change', Warn) — those peels keep half-updated state trees indefinitely and their next highstate compiles a mixture of the old and new formula versions.
Impact: NATS/JetStream load spikes proportional to fleet size on every state change, plus silent state drift: affected peels apply internally inconsistent states with no error surfaced to the operator (no metrics are wired either).
Fix: Reuse settings.DebouncedFunc with jitter for the revision watcher; retry failed Sync with backoff until the target revision is reached; make sync atomic by downloading into a temp dir and renaming over the cache (or record the synced revision and refuse compilation on mismatch).
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/pkg/statefiles/cache.go:144-146 (Sync on every revision update, error only Warned, no retry) and :66-79 (per-file writes, partial failure returns mid-loop); contrast /Users/ptorbus/SynologyDrive/tools/zester/cmd/zester-peel/main.go:460-495 where the settings path got debounce+jitter explicitly for this reason.
Effort: medium · Confidence: high

34. 🟡 MEDIUM — Facts churn drives O(peels × masters) secrets re-encryption and KV write amplification

Weakness: Every master independently runs facts.Watch (WatchAll on the facts bucket, not a queue group) and its callback does an enrollment KV lookup plus per-peel secret re-encryption and PublishSecrets on every facts update. The peel's hash gate only suppresses publishes when facts are byte-identical — volatile collectors (memory/disk usage) change every interval, so most peels republish facts continuously. WatchAll also replays the entire bucket on every master start/reconnect, triggering a full-fleet secrets re-publish burst.
Scenario: 1,000 peels with 30s collector intervals and 3 masters: ~100 facts updates/sec each fan out to 3 masters, each performing Curve25519 encryption and a secrets-bucket Put — thousands of redundant crypto ops and KV revisions per minute, and a synchronized burst whenever a master restarts or its watcher reconnects. The peels' revision-gated resolvers absorb it, but the masters and the secrets bucket do not.
Impact: Master CPU and JetStream write load scale with fleet size × master count for zero information gain; secrets-bucket history churns; adding a master for HA multiplies the waste. This is the classic retry-storm-shaped load path that will surface first as NATS cluster saturation.
Fix: Gate PublishSecrets on a hash of (peel curve key + secret set) so unchanged inputs skip re-encryption; separate volatile metric facts from identity facts (only the latter feed the secrets path); or elect one master (queue-group/leader lease) to own the facts→secrets reaction.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/cmd/zester-master/main.go:400-427 (per-update FindByPeelID + PublishSecrets on every master); /Users/ptorbus/SynologyDrive/tools/zester/pkg/facts/manager.go:191-217 (hash gate defeats only byte-identical facts), :269 (WatchAll replays full bucket).
Effort: medium · Confidence: high

35. 🟡 MEDIUM — Watchdog degraded mode is terminal with no re-arm or remote recovery path

Weakness: Supervisor.AutoRestart permanently exits after 10 consecutive failed restarts (degraded=true) and nothing ever resets it: no NATS command clears the flag, no slow background retry exists, and the AutoRestart goroutine is gone. The watchdog keeps running and its Reporter keeps heartbeating to update-status KV, so the node does not disappear from fleet status even though its managed child is dead.
Scenario: A transient condition (disk full, port conflict, missing shared lib) makes the peel binary crash-loop 10 times over ~10 minutes. The operator fixes the underlying issue, but the child stays down forever — the only recovery is out-of-band restart of the watchdog process itself, on a node whose management agent (the peel) is precisely what is dead.
Impact: Nodes drop out of management permanently after recoverable incidents; because the failure mode is 'watchdog alive, child dead', it is easy to miss in fleet status, and self-update (the watchdog's raison d'être) can no longer repair the node remotely.
Fix: Replace terminal degradation with a slow retry tier (e.g., attempt restart every 10-15 min while degraded), add an explicit NATS command (reuse the update cmd subject) to reset degraded and force a start, and surface degraded=true prominently in NodeStatus so rollout targeting excludes such nodes.
Evidence: /Users/ptorbus/SynologyDrive/tools/zester/pkg/update/supervisor.go:354-369 (degraded set, AutoRestart returns; no reset path anywhere in pkg/update or cmd/zester-watchdog); /Users/ptorbus/SynologyDrive/tools/zester/pkg/update/status.go:95 (state string reported but node remains 'alive' in KV).
Effort: small · Confidence: high

3.6 Modularity, Coupling & Evolvability

Zester's package layout is genuinely good — narrow domain packages, a two-layer exec/state design, and in-memory test fakes — but the modularity story is thinner than it looks. The bus abstraction stops one level too early: JetStreamAPI returns raw jetstream.KeyValue, so every domain package (update, enroll, settings, basket, target, statefiles) is directly coupled to the nats.go client API, and the test fakes must implement NATS's full ~25-method KV interface. The single biggest evolvability gap is the total absence of wire/schema versioning: every NATS message and KV record is a bare MessagePack struct with no version field or negotiation, in a system whose flagship feature is rolling self-updates that guarantee mixed-version fleets. The composition roots are the other debt center: cmd/zester-peel/main.go is a 1355-line run() containing untested business logic in closures (moduleDispatch, execModule, reResolve, a 50-entry module registry, three ad-hoc mutexes), and the master wires ~10 subsystems inline with no lifecycle abstraction — which is also why internal/metrics and internal/health exist but are wired into nothing. Highest-payoff refactors, in order: (1) introduce a protocol/schema version envelope + additive-only field policy before the fleet grows; (2) define a narrow bus.KV interface so NATS stops being load-bearing in every package; (3) extract peel/master runtime packages (internal/peeld, internal/masterd) so subsystem wiring becomes testable and metrics/health get a natural home.

36. 🟠 HIGH — No wire/schema versioning in a system built around rolling self-updates

Weakness: All NATS messages and KV records are bare MessagePack structs (pkg/bus/codec.go Encode/Decode) with no version field, no envelope, and no compatibility negotiation. pkg/proto/exec.go (ExecRequest/ExecResponse), pkg/job/job.go (Job, Return), enrollment records, rollout state, and update manifests are all unversioned. The RolloutController has no notion of protocol compatibility — it will happily roll a fleet through any binary combination. Mixed-version fleets are not an edge case here: they are the designed steady state during every batched rollout with soak time.
Scenario: v0.6.0 renames or retypes a field in ExecRequest (or Job, which is both a wire message and a KV record shared by all masters). During a batched rollout, upgraded peels decode messages from a not-yet-upgraded master: msgpack silently skips unknown keys and zero-fills missing ones, so Epoch fencing tokens, Args, or Success flags silently become zero values — no decode error, just wrong behavior (duplicate execution past a zeroed fence, states applied with empty args). Same for two masters at different versions sharing the jobs bucket during a master upgrade.
Impact: Fleet-wide silent misbehavior during the exact window (rollout) the product is designed to make safe; corrupted job records persist in KV; the only recovery is emergency full-fleet upgrade. Every future schema change carries this risk, so schema evolution becomes frozen-by-fear — the classic evolvability death spiral.
Fix: Add a small versioned envelope (or a leading v field, msgpack decodes it first) to ExecRequest/ExecResponse/Job/Return and KV record types; have peels publish their protocol version as a fact and their binary version in update-status; make RolloutController refuse rollouts whose manifest declares an incompatible min-protocol; adopt and CI-enforce an additive-only, never-renumber field policy documented in pkg/proto.
Evidence: pkg/proto/exec.go:9-31 (no version field), pkg/bus/codec.go:10-25 (raw msgpack.Marshal/Unmarshal), pkg/job/job.go:36-70, pkg/update/rollout.go (no compat check; grep for MinVersion/Compat returns nothing)
Effort: medium · Confidence: high

37. 🟠 HIGH — bus abstraction leaks jetstream.KeyValue — NATS client API is load-bearing in every domain package

Weakness: JetStreamAPI (pkg/bus/jsapi.go:14-20) is narrow in method count but returns raw jetstream.KeyValue, jetstream.Stream, jetstream.ObjectStore. Consequently update, enroll, settings, basket, target, statefiles, and facts all hold nats.go types in their struct fields and signatures (16 non-test files). bustest.FakeKV must implement NATS's entire ~25-method KeyValue interface (bustest/fake_kv.go, 405 lines) including KeyWatcher semantics. The abstraction abstracts the JetStream handle but not the storage.
Scenario: Two concrete failure modes: (a) any nats.go minor release that adds a method to the KeyValue interface breaks compilation of FakeKV and blocks the upgrade until the fake is extended — the dependency's interface evolution dictates your test infrastructure; (b) supporting any alternative backplane (etcd, embedded sqlite for single-node, or even a newer NATS API generation like the ObjectStore→ObjectStoreAPI split already forced) requires touching every domain package, not just pkg/bus.
Impact: NATS is effectively unswappable and nats.go upgrades are high-friction across the whole codebase. The ObjectStoreAPI/ConsumerAPI interfaces (pkg/bus/kv.go:12-24, comments literally say 'Separated ... to avoid breaking existing test fakes') are early symptoms: interface design is already being driven by fake fragility rather than domain needs.
Fix: Define a narrow bus.KV interface owned by pkg/bus (Get/Put/Create/Update/Delete/Keys/Watch returning bus-owned entry/watcher types), have GetBucket return it, and adapt jetstream.KeyValue behind it once. Domain packages then depend only on bus types; FakeKV shrinks to the methods Zester actually uses; ObjectStoreAPI/ConsumerAPI fold into the same pattern.
Evidence: pkg/bus/jsapi.go:15-19 (returns jetstream.KeyValue); pkg/update/manifest.go:45, pkg/enroll/store.go:25, pkg/settings/publish.go:33-47, pkg/basket/publisher.go:36, pkg/target/kv_lister.go:19 (domain structs holding jetstream.KeyValue); pkg/bus/bustest/fake_kv.go:12 (405-line fake of the full NATS interface)
Effort: large · Confidence: high

38. 🟠 HIGH — Peel composition root: 1355-line main.go with untested business logic embedded in closures

Weakness: cmd/zester-peel/main.go's run() (lines 109-1125) is not just wiring — it contains real business logic defined inline: moduleDispatch (the salt['mod.func'] router, lines 310-359), execModule (the core execution path for state.apply/highstate/facts/settings, ~line 622+), reResolve (settings hot-reload + scheduler reload), basket-scope propagation, and a 50-entry hand-maintained module registry (lines 224-272). Concurrency invariants live in three ad-hoc primitives in function scope (execMu, basketScopeMu, atomic.Pointer[schedule.Runner]) whose correctness depends on comments, not types or tests.
Scenario: Any new peel feature (a new query module, a new watcher, a new provider) edits this one function; the interactions between reResolve, the scheduler pointer, the exec mutex, and cachedSettings can only be verified by the Docker integration suite (5-minute cycle, requires Docker) or in production. The global execMu also encodes an architectural decision — full serialization of scheduler and ad-hoc executions — as a local variable, invisible to anyone designing around it.
Impact: Slowest-to-change file in the codebase and the highest-risk one: it is the peel's entire runtime, deployed to every managed node, and the closure logic (module dispatch fallbacks, settings cache races, states-dir switching) has zero unit coverage (only extracted helpers like makeBasketFunc are tested).
Fix: Extract an internal/peeld package with a PeelAgent struct: NewAgent(cfg, deps) wiring, Start/Stop lifecycle, and named methods for moduleDispatch/execModule/reResolve so each is unit-testable against bustest/exectest fakes. Move the module registration list to modules.RegisterAll(registry, mctx) in pkg/state/modules so adding a module doesn't touch main.go. main() shrinks to flag parsing + NewAgent + signal handling.
Evidence: cmd/zester-peel/main.go:109 (run() spanning ~1000 lines), :224-272 (manual 50-entry registry), :300-359 (moduleDispatch closure), :616-624 (execMu comment admitting global serialization), test list shows only helper-level tests
Effort: large · Confidence: high

39. 🟡 MEDIUM — Master wires ~10 subsystems inline with no lifecycle abstraction; metrics/health packages built but unwired

Weakness: cmd/zester-master/main.go run() (lines 125-608) sequentially constructs gitfs, settings publisher, statefiles publisher, job manager, sched-result consumer, dispatch/cancel/rollout subscriptions, heartbeater, orphan scanner, facts watcher, enrollment server, and rollout controller as raw goroutines with defers. There is no Subsystem interface, no readiness aggregation, and no supervision: several subsystems degrade to a log line on startup failure (sched consumer line 317-322, gitfs line 299-303, settings publish 249-272). Meanwhile internal/health (152 lines, tested) and internal/metrics (258 lines) exist but are imported by no binary — the inline /healthz (line 610-647) returns static 200 before NATS is even connected.
Scenario: GitFS dies permanently (bad SSH key after rotation) or the sched-result consumer never starts: the master reports healthy forever, state files silently go stale fleet-wide, scheduler return_job results are silently dropped. An operator load balancer or the watchdog health-check sees 200 OK throughout. Conversely, adding any new master responsibility means another 50-100 inline lines and another defer, with no way to unit-test the wiring.
Impact: Operability blind spot (health always OK, zero metrics despite a full registry existing) plus compounding wiring debt: the master is 10 responsibilities in one untestable function, which also blocks ever splitting responsibilities into separately deployable processes (e.g., enrollment API or rollout controller as their own service).
Fix: Introduce a minimal Runnable/Healthy subsystem interface in internal/masterd; register each subsystem with the existing internal/health checker and expose it on the health listener; wire internal/metrics at the same seam. This makes 'healthz reflects reality' and 'metrics exist' fall out of the same refactor.
Evidence: cmd/zester-master/main.go:125-608 (run), :610-647 (static healthz), :317-322 (consumer failure → Warn only); internal/metrics/metrics.go and internal/health/health.go present but grep -rn internal/metrics cmd/ pkg/ returns nothing
Effort: medium · Confidence: high

40. 🟡 MEDIUM — CLI reads/writes KV buckets directly — storage schema is a public API shared with operator tooling

Weakness: cmd/zester/cmd/{enroll*,update_status,update_versions,job,basket,peel,kv}.go connect to NATS and manipulate KV records directly (enrollment approve/reject writes enroll.Record via CAS; update status reads NodeStatus; kv.go is a raw bucket editor). The bucket layout and every record struct are therefore a compatibility surface between three independently-versioned artifacts: master, peel, and every operator's CLI binary.
Scenario: Master v0.7 changes the enrollment Record struct or the peel.<id> index convention. An operator with a v0.6 CLI runs zester enroll approve: it CAS-writes a record in the old shape, silently dropping new fields or corrupting the state machine — there is no server-side validation because there is no server in the path. This bites precisely when versions skew, which the self-update system makes routine.
Impact: Every storage schema change requires simultaneous upgrade of all operator CLI installs; admin writes bypass any invariant enforcement the master could provide (state-machine transitions are re-implemented client-side). Blast radius is the enrollment/update control plane — security-sensitive paths.
Fix: Route admin mutations through a single authority: either the existing masterapi HTTP handler (already built, cmd/zester-master/main.go:471-483) or NATS request/reply admin subjects handled by the master (the rollout start/abort commands already use this pattern). Keep direct-KV access only in break-glass zester kv and mark it as such.
Evidence: cmd/zester/cmd/ listing: enroll_approve.go/enroll_reject.go/enroll_revoke.go operate on KV per CLAUDE.md ('connects directly to NATS, not HTTP'); cmd/zester/cmd/kv.go, update_status.go, basket.go all call bus.GetBucket directly
Effort: medium · Confidence: high

41. 🟡 MEDIUM — Config surface maintained in triplicate per binary (flag defs + YAML struct + flag.Visit switch), enforced only by convention

Weakness: Each daemon repeats every config knob three times: the flag.StringVar block, the YAML field in internal/config, and a hand-written flag.Visit case mapping one to the other (master: 14 cases at cmd/zester-master/main.go:85-117; peel: ~20 cases). CLAUDE.md contains an explicit bolded rule reminding developers to keep the three in sync — a process rule papering over a structural gap. There is no env-var layer and no validation seam.
Scenario: A new flag is added but the flag.Visit case is forgotten (nothing fails at compile or test time): the flag parses fine, is silently ignored whenever a config file is present, and the operator debugging why --gitfs-interval 30s has no effect on a config-file deployment burns hours. The gitfs-remotes case already needed special empty-string semantics (line 110-111), showing the mapping is accreting bespoke logic.
Impact: Slow leak: every new feature pays a three-file tax and a silent-misconfiguration risk that grows with the config surface (already ~14 master + ~20 peel knobs and rising with enroll, gitfs, API tokens, schedule).
Fix: Collapse to one declarative source: define config solely as the YAML struct with struct tags, generate/bind flags from it (small reflection helper or koanf/env layering), so a new field automatically gets flag + YAML + override precedence; add a test asserting every registered flag has a struct binding.
Evidence: cmd/zester-master/main.go:40-73 (flag block) + :85-117 (manual Visit switch); cmd/zester-peel/main.go has 20 case "..." entries; CLAUDE.md bolded sync rule for both daemons
Effort: small · Confidence: high

42. 🟡 MEDIUM — ProviderSet/ModuleContext duplication makes adding an exec provider a 6-touchpoint change, and the shared mutable ModuleContext forces global execution serialization

Weakness: ModuleContext (pkg/exec/context.go:25-44) duplicates all nine ProviderSet fields instead of embedding it, and NewModuleContext hand-copies each. Adding a provider type touches: the interface file, ProviderSet, ModuleContext, NewModuleContext, DetectProviders, an exectest fake, and (if scriptable) starmod builtins. Worse, ModuleContext is a single shared mutable instance whose Facts/Settings are overwritten per request — which is exactly why cmd/zester-peel/main.go needs execMu to serialize ALL executions (scheduler + NATS handler + ad-hoc).
Scenario: A scheduled highstate is running (minutes-long apt transaction); an operator's ad-hoc zester '*' test.ping blocks behind execMu on that peel until the highstate finishes — the two-layer design's extensibility is fine, but its context-sharing choice caps peel concurrency at 1 forever. Meanwhile each new provider (e.g., FirewallExec, ZfsExec) pays the 6-file boilerplate tax and risks a missed copy in NewModuleContext (compiles fine, provider silently nil).
Impact: Peel-side throughput ceiling of one execution at a time (visible to operators as unresponsive ad-hoc commands during long runs) plus friction on the main extension axis of the product (new modules/providers).
Fix: Embed ProviderSet in ModuleContext (removes the copy and the missed-field failure mode), then make ModuleContext immutable and derived per-request (mctx.WithFactsSettings(f, s) returning a shallow copy) so execMu can shrink to protecting genuinely exclusive resources (package manager locks) rather than all executions.
Evidence: pkg/exec/context.go:10-65 (field duplication and hand-copying constructor); cmd/zester-peel/main.go:611-624 (execMu comment: 'The shared ModuleContext is mutated per execution ... without serialization those would race')
Effort: medium · Confidence: high

3.7 Lower-severity notes

[low] (failure-modes) Orphan scanner does full jobs-bucket scan per master every 20s with no terminal-job GC — Track only non-terminal jobs in a separate small KV bucket (or key prefix) that the scanner reads, and add retention: TTL or a periodic sweeper that moves terminal jobs older than N days to the job-events stream / deletes them.
[low] (modularity-evolution) Interface segregation in pkg/bus driven by test-fake fragility (JetStreamAPI vs ObjectStoreAPI vs ConsumerAPI) — When introducing the bus-owned KV/ObjectStore/Consumer types (finding #2), merge the three interfaces into capability-oriented ones (bus.Storage, bus.Streams) sized to Zester's usage, with FakeJS implementing all of it once.

4. What is already sound

The NATS-only backplane decision itself. Every finding is about usage patterns, not the choice. Do not add a second datastore; fix the access patterns.
CAS/epoch job-ownership mechanics. Create-for-idempotent-dispatch, Update-for-fencing, KSUID master IDs — the fencing design is correct; only the liveness detection feeding it is broken.
The test strategy. bustest/exectest in-memory fakes, no embedded NATS, external-test-package discipline for import cycles. This is why the Tier A fixes are cheap to land safely.
Two-layer exec/state module architecture with closure-based builder injection. Clean, testable, the right Salt-inspired split. The ModuleContext mutability issue (mod-7) is a flaw within a sound design.
Enrollment security design. Ed25519 challenge binding the curve key, mandatory TLS, zero NATS access pre-enrollment, local seed generation, atomic peel-index Create. Genuinely well thought out — the availability gaps (single URL, memory challenges) are trim, not redesign.
Two-phase settings rendering (master sanitizes, peel renders with local facts + decrypts). The pipeline shape is right; it needs a single publisher and cache-key fixes, not a rethink.
The settings-path DebouncedFunc (debounce + jitter). It's the pattern the statefiles path should have copied. Propagate it, don't replace it.
Watchdog three-slot atomic binary swap and the update state machine. The mechanism is solid; it's the health signal feeding it that's fake.
The compiler pipeline (include/extend/merge semantics, Starlark loader integration). Mature, deterministic, well-specified.

5. Appendix: refuted and overstated claims

These were raised during review but knocked down (or narrowed) on adversarial verification. Recorded so the assessment is honest about what does not apply.

Settings resolver's revision gate makes secret rotations invisible until an unrelated settings-files change

Why refuted: The mechanical description of the gate is accurate, but the claimed failure scenario is not reachable because there is no "separate secrets-publish path" that changes secret VALUES without bumping the settings-files revision. Verified against the code: (1) pkg/settings/resolve.go:125-137 does short-circuit Resolve() to the cached result whenever settings-files _revision is unchanged, and cmd/zester-peel/main.go:503-508 wires WatchSecrets to the same debounced reResolve with no InvalidateCache (unlike the curve-pub path at :511-517 which invalidates via SetSenderPub). So a secrets-bucket-only update is indeed swallowed by the gate. (2) However, new secret plaintext can ONLY enter the system through Publisher.PublishRawFiles (pkg/settings/publish.go:178-201), which extracts secrets from !encrypted values in .zy files and then calls bus.BumpRevision UNCONDITIONALLY (publish.go:196) — even if sanitized file content is byte-identical (the placeholder __ZESTER_SECRET:key__ doesn't encode the value, so sanitized content is unchanged on rotation, yet the revision still bumps). (3) The runtime secrets-publish path (cmd/zester-master/main.go:400-427, facts watcher → PublishSecrets) only re-encrypts the SAME plaintexts captured in the allSecrets closure at startup (main.go:256); those republishes are same-plaintext with fresh nonces, so cache-gating them is correct, not a staleness bug — it is exactly the amplification-loop suppression the gate was built for. (4) There is no fsnotify/SIGHUP/timer reload of the settings dir; loadSettingsFiles runs once at startup (main.go:249). Therefore rotating a !encrypted value requires a master restart, which always bumps _revision, so the peel's next resolve misses the cache and calls loadSecrets fresh. The rotation is NOT invisible until an unrelated .zy edit. What genuinely survives is a much narrower timing race: after a master restart, the peel's debounced re-resolve (2s debounce + up to 5s jitter) fires on the revision bump, while the new per-peel secrets are written asynchronously as the master's facts WatchAll replay iterates the fleet (per-peel curve encryption). If a given peel's resolve at the new revision runs BEFORE the master writes that peel's rotated secrets, the later WatchSecrets trigger hits the cache gate and that peel serves the stale secret until the next revision bump or restart. In practice the 2-7s debounce usually coalesces both events, but on a large fleet the ordering is not guaranteed, so a subset of peels can go stale after a rotation — a real but bounded-window race, not the unconditional "every rotation is invisible" defect claimed. Multi-master also has a related but distinct issue (a non-restarted master holding old allSecrets keeps republishing OLD plaintexts on facts arrival, last-writer-wins), which is a master-side split-brain, not the resolver gate.
What is actually true: The claim's premise — that secret values can be rotated via a secrets-only publish path that leaves settings-files _revision unchanged — is false. Rotation requires editing the .zy file and restarting the master, and PublishRawFiles unconditionally calls BumpRevision (pkg/settings/publish.go:196), so peels always see a revision change and reload secrets. What is actually true: (a) the gate does swallow secrets-bucket-triggered re-resolves, which is intentional and safe for the routine same-plaintext republishes driven by facts churn; (b) there is a narrow startup-ordering race where a peel's debounced resolve consumes the new revision before the master's facts-replay loop has written that peel's rotated secrets — in that case the subsequent secrets update IS silently ignored until the next revision bump or peel/master restart. A robust fix would be for the WatchSecrets callback to call resolver.InvalidateCache() (mirroring the curve-pub path) or for the cache key to incorporate the secrets entry revision, but the impact is a low-probability race on large fleets, not indefinite fleet-wide invisibility of every rotation.

Silent 7-day TTL expiry erases non-terminal jobs, their idempotency guards, and splits job/returns lifetimes

Why refuted: The raw facts check out — pkg/bus/kv.go:246-258 sets a silent 7-day per-entry TTL on both the jobs and job-returns buckets, dispatch dedup is Create-based on the jobs key (pkg/job/manager.go:73-90), and GetReturns does fall back to a full-bucket Keys() scan (manager.go:252-275, 294-308) reachable from the master API (pkg/masterapi/handler.go:315). But every headline failure scenario is mitigated or unreachable. (1) The 'weekend outage erases 300 running jobs' scenario ignores the OrphanScanner (pkg/job/recovery.go): it runs on every master every 20s, and because each master process gets a fresh KSUID masterID, even a single restarted master sees the old owner as dead (15s heartbeat TTL) and reclaims all non-terminal jobs via CAS; ReclaimJob starts a recovery watcher that finalizes with a terminal status (StatusTimeout, watcher.go:185-188), rewriting the key and resetting the TTL. Silent expiry requires zero masters running for 7 consecutive days, not a weekend. (2) The week-old-JID replay is architecturally unreachable: dispatches arrive via core NATS QueueSubscribe (cmd/zester-master/main.go:327), not a durable JetStream consumer, so nothing can redeliver a message after 7 days, and JIDs are minted fresh per CLI invocation (ksuid.New(), cmd/zester/cmd/exec.go:91). Losing the dedup key after 7 days only matters if an external actor deliberately replays a recorded JID. (3) The 'split lifetimes' mechanism is misattributed: finalizeJob writes the job status and the aggregate returns key at essentially the same instant (watcher.go:287-330), so they expire together; the Keys() fallback actually fires because the aggregate key does not exist until finalize — i.e., for every in-flight job status query, independent of TTL. (4) Terminal jobs expire at 7 days too, so the audit gap is a uniform retention window, not selective erasure of abnormal jobs.
What is actually true: What is actually true: (a) a hard-coded, silent 7-day retention window on jobs/returns with no archival or expiry event — a real but bounded operational limitation, and a job only escapes without terminal status if no master runs its orphan scanner for the entire 7 days; (b) the Create-based dedup guarantee has a 7-day horizon, exploitable only by an external client deliberately reusing an old JID (no in-system replay path exists); (c) the genuine scalability defect is that GetReturns performs an O(entire-returns-bucket) Keys() scan for ANY non-finalized job — every status poll of a running job via the master API, and once per job during orphan recovery (manager.go:211) — because per-peel returns are keyed '{jid}.{peel}' with no prefix-scoped listing. That scan issue is real and worse than framed, but it is caused by the aggregate-at-finalize design, not by TTL skew.

Zester — Architecture & Resilience Review

On this page