zester

Monitoring

Both daemons serve three local HTTP endpoints — /healthz (liveness), /readyz (readiness), and /metrics (Prometheus) — and log structured JSON via Go's log/slog package.

Health Check Endpoints

Both zester-peel and zester-master serve a plain-HTTP listener on a loopback address, started before anything else during daemon startup:

ComponentDefault addressFlag / YAML key
zester-peel127.0.0.1:9090--health-addr / health_addr
zester-master127.0.0.1:9091--health-addr / health_addr

The defaults are distinct so a peel and a master can colocate on the same host. Each listener serves three endpoints: /healthz, /readyz, and /metrics.

/healthz — Liveness

Always returns 200 while the process is up and serving HTTP, regardless of NATS or JetStream state:

$ curl -s http://127.0.0.1:9090/healthz
{"status":"ok","component":"peel","version":"v0.5.0"}

$ curl -s http://127.0.0.1:9091/healthz
{"status":"ok","component":"master","version":"v0.5.0"}

The version field is parsed by the watchdog for fleet status reporting — keep it if you front the endpoint with a proxy.

/readyz — Readiness

Runs named per-subsystem checks (in parallel, 2s timeout per check) and reports one of three statuses per check:

StatusMeaningHTTP effect
okSubsystem fully functional200
degradedWorking but impaired (e.g. stale GitFS sync)200
downSubsystem non-functional503

The endpoint returns 503 only when at least one check is downdegraded stays 200 so load balancers don't drop the node and the watchdog's update soak doesn't roll back healthy binaries over non-fatal conditions. The JSON body always carries per-check status, message, and latency_ms for operators and monitoring.

Peel checks:

CheckSemantics
natsdown until the NATS client is connected and healthy
kvJetStream round-trip (bucket-info lookup on the facts bucket); down on error

Master checks:

CheckSemantics
natsdown until the NATS client is connected and healthy
enroll-serverdown (with the error) if the enrollment TLS listener failed
sched-consumerdown while the schedule-results consumer is not running; a startup failure is retried in the background every 60s and the check flips to ok on success (scheduled-result consumer started after retry)
gitfsOnly registered when GitFS is enabled: ok while the last fully-successful sync is younger than 3× the GitFS interval, degraded before the first sync or when stale, down if the syncer exits
target-servicedown until the target-resolution service (in-memory fact index + zester.target.resolve responder) is running; while down, CLI/API targeting falls back to facts-KV scans — functional, just slower
reactorOnly registered when the reactor is enabled: down while the shared durable reactor consumer is not running (a boot failure is retried every 60s), degraded while the engine runs on last-known-good rules after a failed rule load or while a rule's storm circuit breaker is open (the message lists the affected rules), ok otherwise — see Reactor operations
$ curl -s http://127.0.0.1:9090/readyz | jq .
{
  "status": "ok",
  "checks": {
    "nats": {"status": "ok", "latency_ms": 1},
    "kv": {"status": "ok", "latency_ms": 3}
  },
  "version": "v0.5.0",
  "uptime": "2h3m0s"
}

Degraded `gitfs` on standby masters is normal

In multi-master deployments, only the master holding the publisher lease runs GitFS syncs. A standby master with GitFS configured reports gitfs: degraded ("no successful sync yet") until it acquires the lease — that is expected steady-state, not a fault, and /readyz still returns 200. Alert on down, not degraded.

Which probe to use

Point restart-style probes (Kubernetes liveness, systemd watchdogs) at /healthz and traffic/rollout gates at /readyz. A NATS outage makes /readyz return 503 on every node at once — restarting processes on that signal would turn a broker outage into a fleet-wide restart storm.

Watchdog Health Monitoring

zester-watchdog uses both endpoints, for different decisions:

  • --health-url (default http://127.0.0.1:9090/healthz) — liveness. Drives restart monitoring and the post-apply WaitForHealthy gate. Must match the child's --health-addr — the packaged systemd units pair them accordingly (:9090 for zester-peel.service, :9091 for zester-master.service).
  • --ready-url (default: derived from --health-url by replacing the path with /readyz) — readiness. Polled only during the update soak phase, so an alive-but-functionally-dead child fails soak and auto-rolls back.

Related watchdog flags: --health-timeout (default 5s), --health-interval (default 10s), and --health-retries (default 3 — consecutive failures before a post-update rollback). See Watchdog Runtime for the full flag list and soak policy.

Peel Presence (Heartbeats)

The health endpoints answer "is this process alive?" — fleet-wide peel presence comes from the peel heartbeat bucket. Every connected peel rewrites its own key in the peel-heartbeat KV bucket on a fixed cadence:

PropertyValue
Bucketpeel-heartbeat
Key pattern<peel-id>
IntervalEvery 10 seconds
TTL30 seconds (3× the interval)
ValueMessagePack {ts, version, protocol}

Key presence is the liveness signal: a peel is considered offline after ~3 missed beats, when the TTL expires its key. Two consumers read the bucket:

  • zester peel list derives its ONLINE and LAST-SEEN columns from the heartbeat keys (fleets without the bucket show - in both columns).
  • The master counts the bucket's keys every 15 seconds to feed the zester_connected_peels gauge.

Heartbeat writes are non-fatal on the peel (Debug-logged, retried on the next tick). A peel whose credentials lack the heartbeat KV grant cannot write the bucket — it runs fine but looks offline in presence views.

Prometheus Metrics

Both daemons serve Prometheus metrics (OpenMetrics format) at /metrics on the same health_addr listener as the health endpoints.

Master metrics

MetricTypeLabelsDescription
zester_connected_peelsgaugeLive peels, counted from peel-heartbeat bucket keys every 15s (reads 0 until peels write heartbeats)
zester_jobs_totalcounterstatusJobs finalized, partitioned by final status (complete, partial, timeout, failed, ...)
zester_job_duration_secondshistogramfunctionEnd-to-end job duration from dispatch to finalize
zester_job_reclaims_totalcounterJobs reclaimed from dead masters by the orphan scanner
zester_facts_sync_totalcounterFact sync operations received from peels
zester_nats_reconnects_totalcounterNATS reconnection events
zester_nats_disconnects_totalcounterNATS disconnection events
zester_nats_slow_consumers_totalcounterNATS slow consumer events
zester_reactor_events_totalcounterorigin_typeEvents consumed by the reactor (peel, master, admin)
zester_reactor_events_dropped_totalcounterreasonEvents dropped at the reactor's consume gates (malformed, decode, spoof, depth, ratelimit, stale, backpressure)
zester_reactor_events_unmatched_totalcounterEvents that matched no reactor rule
zester_reactor_reactions_totalcounterrule, resultReaction executions per rule and outcome
zester_reactor_render_duration_secondshistogramTime to render one matched reaction rule
zester_reactor_breaker_opengaugerule1 while a rule's storm circuit breaker is open
zester_reactor_laggaugePending events on the shared reactor consumer (polled every 15s)
zester_reactor_rule_errors_totalcounterFailed reactor rule loads (last-known-good rules stay active)
zester_reactor_rules_loadedgaugeRules in the active reactor rule set

Peel metrics

MetricTypeLabelsDescription
zester_peel_connectedgauge1 while the NATS connection is actually up, 0 otherwise (including from process start until the first connect)
zester_peel_uptime_secondsgaugePeel process uptime
zester_peel_state_apply_totalcounterstate, resultExecutions per module (state = module name, e.g. state.apply; result = success or error)
zester_peel_state_apply_duration_secondshistogramstateExecution duration per module
zester_peel_beacon_events_totalcounterbeaconBeacon events generated (counted before publish, so offline-buffered events are included)
zester_nats_reconnects_totalcounterNATS reconnection events
zester_nats_slow_consumers_totalcounterNATS slow consumer events

Both registries also expose the standard Go runtime, process, and build-info collectors (go_*, process_*, go_build_info).

Registered but not yet wired

The metric registries (internal/metrics) define additional series that no code path increments yet — notably zester_job_active, zester_facts_sync_errors_total, zester_settings_render_duration_seconds, zester_targeting_resolution_duration_seconds, and the zester_nats_msgs_*/zester_nats_bytes_* transport counters on the master, plus zester_peel_facts_collect_duration_seconds on the peel. They appear in scrape output at zero (or not at all, for labeled series) — don't alert on them yet.

Scrape Configuration

The metrics listener binds to health_addr, which defaults to loopback. To scrape from a central Prometheus, either run a node-local agent (Prometheus agent mode, Grafana Alloy, etc.) or set health_addr to a routable interface — if you do the latter, firewall the port, as the endpoints are unauthenticated.

scrape_configs:
  - job_name: 'zester-master'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['master-host:9091']

  - job_name: 'zester-peel'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['web-01:9090', 'web-02:9090']

NATS Server Monitoring

Complement Zester-native metrics with the NATS server's built-in monitoring endpoints for broker-side visibility:

# Server status
curl -s http://nats-host:8222/varz | jq .

# Connected clients (peels + master)
curl -s http://nats-host:8222/connz | jq '.num_connections'

# JetStream status (storage, streams, consumers)
curl -s http://nats-host:8222/jsz | jq .

# Cluster routes
curl -s http://nats-host:8222/routez | jq .

Add the NATS monitoring endpoint to your prometheus.yml:

scrape_configs:
  - job_name: 'nats'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['nats-host:8222']

Structured Logging

Zester uses Go's log/slog package for structured logging. All daemons log JSON by default; level and format are configurable:

Flag / YAML keyDefaultValues
--log-level / log_levelinfodebug, info, warn, error
--log-format / log_formatjsonjson, text

The flags exist on zester-master, zester-peel, and zester-watchdog (the watchdog has no YAML config — flags only). Invalid values fail startup with an error listing the valid options.

Every log line carries base attributes: component (master, peel, or watchdog) and version (build version). The master additionally attaches master_id (its KSUID) and the peel attaches peel_id once known — so a watchdog and its wrapped child can share one stdout stream and still be told apart.

Log Format

{
  "time": "2026-07-01T14:30:00.123Z",
  "level": "INFO",
  "msg": "received exec request",
  "component": "peel",
  "version": "v0.5.0",
  "peel_id": "web-01",
  "jid": "2oHfKnCPMQnLEYQeBQsNtUiJp3r",
  "module": "state.apply"
}

Log Levels

LevelUsageExamples
ERRORActionable failures requiring investigationNATS connection lost, JetStream write failure, state.apply crash
WARNDegraded conditions, non-critical issuesReconnection attempts, timeout threshold breaches, slow consumers
INFONormal operational eventsJob dispatch, peel connect/disconnect, state.apply, fact sync
DEBUGVerbose diagnostic dataNATS message traces, template rendering, targeting resolution details

Multi-Master Log Lines

In multi-master deployments, KV publishing (settings files, state files, GitFS sync) and per-peel secrets encryption are gated by advisory leader leases in the leases bucket (publisher and facts-secrets). Standby masters are healthy while logging only the candidate lines — do not mistake them for a stuck startup:

Log lineLevelMeaning
publisher lease candidate started; standing by until acquiredINFOThis master is a standby for the publisher lease — normal steady-state on non-leaders
leader lease acquiredINFOThis master took over a lease (at startup, or after the previous holder died)
leader lease lostWARNThe lease could not be renewed — another master will take over; publishing stops here
settings files loaded for peel-side renderingINFOLogged on every master at startup (loading is not lease-gated)
published raw settings files for peel-side rendering / published state filesINFOLogged only on the current publisher lease holder

Log Aggregation

Ship logs to your log aggregation platform via journald:

# View logs
journalctl -u zester-master -f --output json

# Export to a collector
journalctl -u zester-master -f --output json | your-log-shipper

Use tools like Filebeat, Fluentd, Promtail, or Vector to ship log files to Elasticsearch, Loki, Splunk, or your platform of choice.

Useful Log Queries

When using a log aggregation system, these queries help with operations:

QueryPurpose
level=ERRORAll errors requiring investigation
component=master AND msg="job dispatched"All accepted dispatches (logged with jid, user, function, target count)
msg="state applied" AND changed=trueStates that made changes (drift detection)
jid=<JID>Trace a specific job
msg="http.request.complete" AND path="/api/v1/jobs"REST dispatch request latency and status
msg="http.request.unauthorized"Failed bearer-token auth attempts
peel_id=<ID>All events for a specific peel
duration_ms>5000Slow operations

Alerting Recommendations

Base alerts on Zester-native metrics where available, plus NATS server metrics (at http://<nats-host>:8222/metrics) for broker health:

Critical Alerts

AlertPromQLThresholdDescription
Peel Disconnectedzester_peel_connected == 05 minutesPeel process is up but has no NATS connection
Master Not Readyprobe_success{job="zester-master-readyz"} == 02 minutesA master subsystem (NATS, enroll-server, sched-consumer, target-service, reactor) is down — probe /readyz with the blackbox exporter
NATS Downup{job="nats"} == 01 minuteNATS server is not responding to scrapes
JetStream Storage Fullnats_server_jetstream_storage_used / nats_server_jetstream_storage_limit > 0.9ImmediateJetStream disk storage above 90%

Warning Alerts

AlertPromQLThresholdDescription
Job Failuresincrease(zester_jobs_total{status=~"failed|timeout"}[15m]) > 015 minutesJobs finalizing as failed or timed out
Connected Peels Dropzester_connected_peels < <expected fleet size>10 minutesPeels stopped heartbeating into the peel-heartbeat bucket — cross-check with zester peel list
Job Reclaimsincrease(zester_job_reclaims_total[1h]) > 01 hourA master died and its jobs were reclaimed — investigate the dead master
NATS Flappingincrease(zester_nats_reconnects_total[15m]) > 315 minutesRepeated reconnects on a master or peel
Storage Usage Growingpredict_linear(nats_server_jetstream_storage_used[1h], 86400) > nats_server_jetstream_storage_limit1 hourStorage will be full within 24 hours

Dashboard Recommendations

Dashboard 1: Zester Fleet

  • Connected peel count (zester_connected_peels on the master, fed from peel heartbeats)
  • Peel connectivity (zester_peel_connected across the fleet)
  • Job throughput and final status breakdown (zester_jobs_total by status)
  • Job duration percentiles (zester_job_duration_seconds by function)
  • Module execution rate and error ratio (zester_peel_state_apply_total by result)
  • Job reclaims (zester_job_reclaims_total)

Dashboard 2: NATS Infrastructure

  • Connection counts (total clients = master + peels)
  • Messages per second (published + received)
  • Bytes per second throughput
  • JetStream storage utilization
  • Slow consumer events
  • Reconnection events (zester_nats_reconnects_total / zester_nats_disconnects_total)

On this page