Monitoring

Both daemons serve three local HTTP endpoints — /healthz (liveness), /readyz (readiness), and /metrics (Prometheus) — and log structured JSON via Go's log/slog package.

Health Check Endpoints

Both zester-peel and zester-master serve a plain-HTTP listener on a loopback address, started before anything else during daemon startup:

Component	Default address	Flag / YAML key
`zester-peel`	`127.0.0.1:9090`	`--health-addr` / `health_addr`
`zester-master`	`127.0.0.1:9091`	`--health-addr` / `health_addr`

The defaults are distinct so a peel and a master can colocate on the same host. Each listener serves three endpoints: /healthz, /readyz, and /metrics.

`/healthz` — Liveness

Always returns 200 while the process is up and serving HTTP, regardless of NATS or JetStream state:

$ curl -s http://127.0.0.1:9090/healthz
{"status":"ok","component":"peel","version":"v0.5.0"}

$ curl -s http://127.0.0.1:9091/healthz
{"status":"ok","component":"master","version":"v0.5.0"}

The version field is parsed by the watchdog for fleet status reporting — keep it if you front the endpoint with a proxy.

`/readyz` — Readiness

Runs named per-subsystem checks (in parallel, 2s timeout per check) and reports one of three statuses per check:

Status	Meaning	HTTP effect
`ok`	Subsystem fully functional	`200`
`degraded`	Working but impaired (e.g. stale GitFS sync)	`200`
`down`	Subsystem non-functional	`503`

The endpoint returns 503 only when at least one check is down — degraded stays 200 so load balancers don't drop the node and the watchdog's update soak doesn't roll back healthy binaries over non-fatal conditions. The JSON body always carries per-check status, message, and latency_ms for operators and monitoring.

Peel checks:

Check	Semantics
`nats`	`down` until the NATS client is connected and healthy
`kv`	JetStream round-trip (bucket-info lookup on the facts bucket); `down` on error

Master checks:

Check	Semantics
`nats`	`down` until the NATS client is connected and healthy
`enroll-server`	`down` (with the error) if the enrollment TLS listener failed
`sched-consumer`	`down` while the `schedule-results` consumer is not running; a startup failure is retried in the background every 60s and the check flips to `ok` on success (`scheduled-result consumer started after retry`)
`gitfs`	Only registered when GitFS is enabled: `ok` while the last fully-successful sync is younger than 3× the GitFS interval, `degraded` before the first sync or when stale, `down` if the syncer exits
`target-service`	`down` until the target-resolution service (in-memory fact index + `zester.target.resolve` responder) is running; while down, CLI/API targeting falls back to facts-KV scans — functional, just slower
`reactor`	Only registered when the reactor is enabled: `down` while the shared durable `reactor` consumer is not running (a boot failure is retried every 60s), `degraded` while the engine runs on last-known-good rules after a failed rule load or while a rule's storm circuit breaker is open (the message lists the affected rules), `ok` otherwise — see Reactor operations

$ curl -s http://127.0.0.1:9090/readyz | jq .
{
  "status": "ok",
  "checks": {
    "nats": {"status": "ok", "latency_ms": 1},
    "kv": {"status": "ok", "latency_ms": 3}
  },
  "version": "v0.5.0",
  "uptime": "2h3m0s"
}

Degraded `gitfs` on standby masters is normal

In multi-master deployments, only the master holding the publisher lease runs GitFS syncs. A standby master with GitFS configured reports gitfs: degraded ("no successful sync yet") until it acquires the lease — that is expected steady-state, not a fault, and /readyz still returns 200. Alert on down, not degraded.

Which probe to use

Point restart-style probes (Kubernetes liveness, systemd watchdogs) at /healthz and traffic/rollout gates at /readyz. A NATS outage makes /readyz return 503 on every node at once — restarting processes on that signal would turn a broker outage into a fleet-wide restart storm.

Watchdog Health Monitoring

zester-watchdog uses both endpoints, for different decisions:

--health-url (default http://127.0.0.1:9090/healthz) — liveness. Drives restart monitoring and the post-apply WaitForHealthy gate. Must match the child's --health-addr — the packaged systemd units pair them accordingly (:9090 for zester-peel.service, :9091 for zester-master.service).
--ready-url (default: derived from --health-url by replacing the path with /readyz) — readiness. Polled only during the update soak phase, so an alive-but-functionally-dead child fails soak and auto-rolls back.

Related watchdog flags: --health-timeout (default 5s), --health-interval (default 10s), and --health-retries (default 3 — consecutive failures before a post-update rollback). See Watchdog Runtime for the full flag list and soak policy.

Peel Presence (Heartbeats)

The health endpoints answer "is this process alive?" — fleet-wide peel presence comes from the peel heartbeat bucket. Every connected peel rewrites its own key in the peel-heartbeat KV bucket on a fixed cadence:

Property	Value
Bucket	`peel-heartbeat`
Key pattern	`<peel-id>`
Interval	Every 10 seconds
TTL	30 seconds (3× the interval)
Value	MessagePack `{ts, version, protocol}`

Key presence is the liveness signal: a peel is considered offline after ~3 missed beats, when the TTL expires its key. Two consumers read the bucket:

zester peel list derives its ONLINE and LAST-SEEN columns from the heartbeat keys (fleets without the bucket show - in both columns).
The master counts the bucket's keys every 15 seconds to feed the zester_connected_peels gauge.

Heartbeat writes are non-fatal on the peel (Debug-logged, retried on the next tick). A peel whose credentials lack the heartbeat KV grant cannot write the bucket — it runs fine but looks offline in presence views.

Prometheus Metrics

Both daemons serve Prometheus metrics (OpenMetrics format) at /metrics on the same health_addr listener as the health endpoints.

Master metrics

Metric	Type	Labels	Description
`zester_connected_peels`	gauge		Live peels, counted from `peel-heartbeat` bucket keys every 15s (reads `0` until peels write heartbeats)
`zester_jobs_total`	counter	`status`	Jobs finalized, partitioned by final status (`complete`, `partial`, `timeout`, `failed`, ...)
`zester_job_duration_seconds`	histogram	`function`	End-to-end job duration from dispatch to finalize
`zester_job_reclaims_total`	counter		Jobs reclaimed from dead masters by the orphan scanner
`zester_facts_sync_total`	counter		Fact sync operations received from peels
`zester_nats_reconnects_total`	counter		NATS reconnection events
`zester_nats_disconnects_total`	counter		NATS disconnection events
`zester_nats_slow_consumers_total`	counter		NATS slow consumer events
`zester_reactor_events_total`	counter	`origin_type`	Events consumed by the reactor (`peel`, `master`, `admin`)
`zester_reactor_events_dropped_total`	counter	`reason`	Events dropped at the reactor's consume gates (`malformed`, `decode`, `spoof`, `depth`, `ratelimit`, `stale`, `backpressure`)
`zester_reactor_events_unmatched_total`	counter		Events that matched no reactor rule
`zester_reactor_reactions_total`	counter	`rule`, `result`	Reaction executions per rule and outcome
`zester_reactor_render_duration_seconds`	histogram		Time to render one matched reaction rule
`zester_reactor_breaker_open`	gauge	`rule`	`1` while a rule's storm circuit breaker is open
`zester_reactor_lag`	gauge		Pending events on the shared `reactor` consumer (polled every 15s)
`zester_reactor_rule_errors_total`	counter		Failed reactor rule loads (last-known-good rules stay active)
`zester_reactor_rules_loaded`	gauge		Rules in the active reactor rule set

Peel metrics

Metric	Type	Labels	Description
`zester_peel_connected`	gauge		`1` while the NATS connection is actually up, `0` otherwise (including from process start until the first connect)
`zester_peel_uptime_seconds`	gauge		Peel process uptime
`zester_peel_state_apply_total`	counter	`state`, `result`	Executions per module (`state` = module name, e.g. `state.apply`; `result` = `success` or `error`)
`zester_peel_state_apply_duration_seconds`	histogram	`state`	Execution duration per module
`zester_peel_beacon_events_total`	counter	`beacon`	Beacon events generated (counted before publish, so offline-buffered events are included)
`zester_nats_reconnects_total`	counter		NATS reconnection events
`zester_nats_slow_consumers_total`	counter		NATS slow consumer events

Both registries also expose the standard Go runtime, process, and build-info collectors (go_*, process_*, go_build_info).

Registered but not yet wired

The metric registries (internal/metrics) define additional series that no code path increments yet — notably zester_job_active, zester_facts_sync_errors_total, zester_settings_render_duration_seconds, zester_targeting_resolution_duration_seconds, and the zester_nats_msgs_*/zester_nats_bytes_* transport counters on the master, plus zester_peel_facts_collect_duration_seconds on the peel. They appear in scrape output at zero (or not at all, for labeled series) — don't alert on them yet.

Scrape Configuration

The metrics listener binds to health_addr, which defaults to loopback. To scrape from a central Prometheus, either run a node-local agent (Prometheus agent mode, Grafana Alloy, etc.) or set health_addr to a routable interface — if you do the latter, firewall the port, as the endpoints are unauthenticated.

scrape_configs:
  - job_name: 'zester-master'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['master-host:9091']

  - job_name: 'zester-peel'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['web-01:9090', 'web-02:9090']

NATS Server Monitoring

Complement Zester-native metrics with the NATS server's built-in monitoring endpoints for broker-side visibility:

# Server status
curl -s http://nats-host:8222/varz | jq .

# Connected clients (peels + master)
curl -s http://nats-host:8222/connz | jq '.num_connections'

# JetStream status (storage, streams, consumers)
curl -s http://nats-host:8222/jsz | jq .

# Cluster routes
curl -s http://nats-host:8222/routez | jq .

Add the NATS monitoring endpoint to your prometheus.yml:

scrape_configs:
  - job_name: 'nats'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['nats-host:8222']

Structured Logging

Zester uses Go's log/slog package for structured logging. All daemons log JSON by default; level and format are configurable:

Flag / YAML key	Default	Values
`--log-level` / `log_level`	`info`	`debug`, `info`, `warn`, `error`
`--log-format` / `log_format`	`json`	`json`, `text`

The flags exist on zester-master, zester-peel, and zester-watchdog (the watchdog has no YAML config — flags only). Invalid values fail startup with an error listing the valid options.

Every log line carries base attributes: component (master, peel, or watchdog) and version (build version). The master additionally attaches master_id (its KSUID) and the peel attaches peel_id once known — so a watchdog and its wrapped child can share one stdout stream and still be told apart.

Log Format

{
  "time": "2026-07-01T14:30:00.123Z",
  "level": "INFO",
  "msg": "received exec request",
  "component": "peel",
  "version": "v0.5.0",
  "peel_id": "web-01",
  "jid": "2oHfKnCPMQnLEYQeBQsNtUiJp3r",
  "module": "state.apply"
}

Log Levels

Level	Usage	Examples
`ERROR`	Actionable failures requiring investigation	NATS connection lost, JetStream write failure, state.apply crash
`WARN`	Degraded conditions, non-critical issues	Reconnection attempts, timeout threshold breaches, slow consumers
`INFO`	Normal operational events	Job dispatch, peel connect/disconnect, state.apply, fact sync
`DEBUG`	Verbose diagnostic data	NATS message traces, template rendering, targeting resolution details

Multi-Master Log Lines

In multi-master deployments, KV publishing (settings files, state files, GitFS sync) and per-peel secrets encryption are gated by advisory leader leases in the leases bucket (publisher and facts-secrets). Standby masters are healthy while logging only the candidate lines — do not mistake them for a stuck startup:

Log line	Level	Meaning
`publisher lease candidate started; standing by until acquired`	INFO	This master is a standby for the `publisher` lease — normal steady-state on non-leaders
`leader lease acquired`	INFO	This master took over a lease (at startup, or after the previous holder died)
`leader lease lost`	WARN	The lease could not be renewed — another master will take over; publishing stops here
`settings files loaded for peel-side rendering`	INFO	Logged on every master at startup (loading is not lease-gated)
`published raw settings files for peel-side rendering` / `published state files`	INFO	Logged only on the current `publisher` lease holder

Log Aggregation

Ship logs to your log aggregation platform via journald:

# View logs
journalctl -u zester-master -f --output json

# Export to a collector
journalctl -u zester-master -f --output json | your-log-shipper

Use tools like Filebeat, Fluentd, Promtail, or Vector to ship log files to Elasticsearch, Loki, Splunk, or your platform of choice.

Useful Log Queries

When using a log aggregation system, these queries help with operations:

Query	Purpose
`level=ERROR`	All errors requiring investigation
`component=master AND msg="job dispatched"`	All accepted dispatches (logged with `jid`, `user`, `function`, target count)
`msg="state applied" AND changed=true`	States that made changes (drift detection)
`jid=<JID>`	Trace a specific job
`msg="http.request.complete" AND path="/api/v1/jobs"`	REST dispatch request latency and status
`msg="http.request.unauthorized"`	Failed bearer-token auth attempts
`peel_id=<ID>`	All events for a specific peel
`duration_ms>5000`	Slow operations

Alerting Recommendations

Base alerts on Zester-native metrics where available, plus NATS server metrics (at http://<nats-host>:8222/metrics) for broker health:

Critical Alerts

Alert	PromQL	Threshold	Description
Peel Disconnected	`zester_peel_connected == 0`	5 minutes	Peel process is up but has no NATS connection
Master Not Ready	`probe_success{job="zester-master-readyz"} == 0`	2 minutes	A master subsystem (NATS, enroll-server, sched-consumer, target-service, reactor) is `down` — probe `/readyz` with the blackbox exporter
NATS Down	`up{job="nats"} == 0`	1 minute	NATS server is not responding to scrapes
JetStream Storage Full	`nats_server_jetstream_storage_used / nats_server_jetstream_storage_limit > 0.9`	Immediate	JetStream disk storage above 90%

Warning Alerts

Alert	PromQL	Threshold	Description
Job Failures	`increase(zester_jobs_total{status=~"failed\|timeout"}[15m]) > 0`	15 minutes	Jobs finalizing as failed or timed out
Connected Peels Drop	`zester_connected_peels < <expected fleet size>`	10 minutes	Peels stopped heartbeating into the `peel-heartbeat` bucket — cross-check with `zester peel list`
Job Reclaims	`increase(zester_job_reclaims_total[1h]) > 0`	1 hour	A master died and its jobs were reclaimed — investigate the dead master
NATS Flapping	`increase(zester_nats_reconnects_total[15m]) > 3`	15 minutes	Repeated reconnects on a master or peel
Storage Usage Growing	`predict_linear(nats_server_jetstream_storage_used[1h], 86400) > nats_server_jetstream_storage_limit`	1 hour	Storage will be full within 24 hours

Dashboard Recommendations

Dashboard 1: Zester Fleet

Connected peel count (zester_connected_peels on the master, fed from peel heartbeats)
Peel connectivity (zester_peel_connected across the fleet)
Job throughput and final status breakdown (zester_jobs_total by status)
Job duration percentiles (zester_job_duration_seconds by function)
Module execution rate and error ratio (zester_peel_state_apply_total by result)
Job reclaims (zester_job_reclaims_total)

Dashboard 2: NATS Infrastructure

Connection counts (total clients = master + peels)
Messages per second (published + received)
Bytes per second throughput
JetStream storage utilization
Slow consumer events
Reconnection events (zester_nats_reconnects_total / zester_nats_disconnects_total)

On this page