Monitoring
Both daemons serve three local HTTP endpoints — /healthz (liveness), /readyz (readiness), and /metrics (Prometheus) — and log structured JSON via Go's log/slog package.
Health Check Endpoints
Both zester-peel and zester-master serve a plain-HTTP listener on a loopback address, started before anything else during daemon startup:
| Component | Default address | Flag / YAML key |
|---|---|---|
zester-peel | 127.0.0.1:9090 | --health-addr / health_addr |
zester-master | 127.0.0.1:9091 | --health-addr / health_addr |
The defaults are distinct so a peel and a master can colocate on the same host. Each listener serves three endpoints: /healthz, /readyz, and /metrics.
/healthz — Liveness
Always returns 200 while the process is up and serving HTTP, regardless of NATS or JetStream state:
$ curl -s http://127.0.0.1:9090/healthz
{"status":"ok","component":"peel","version":"v0.5.0"}
$ curl -s http://127.0.0.1:9091/healthz
{"status":"ok","component":"master","version":"v0.5.0"}The version field is parsed by the watchdog for fleet status reporting — keep it if you front the endpoint with a proxy.
/readyz — Readiness
Runs named per-subsystem checks (in parallel, 2s timeout per check) and reports one of three statuses per check:
| Status | Meaning | HTTP effect |
|---|---|---|
ok | Subsystem fully functional | 200 |
degraded | Working but impaired (e.g. stale GitFS sync) | 200 |
down | Subsystem non-functional | 503 |
The endpoint returns 503 only when at least one check is down — degraded stays 200 so load balancers don't drop the node and the watchdog's update soak doesn't roll back healthy binaries over non-fatal conditions. The JSON body always carries per-check status, message, and latency_ms for operators and monitoring.
Peel checks:
| Check | Semantics |
|---|---|
nats | down until the NATS client is connected and healthy |
kv | JetStream round-trip (bucket-info lookup on the facts bucket); down on error |
Master checks:
| Check | Semantics |
|---|---|
nats | down until the NATS client is connected and healthy |
enroll-server | down (with the error) if the enrollment TLS listener failed |
sched-consumer | down while the schedule-results consumer is not running; a startup failure is retried in the background every 60s and the check flips to ok on success (scheduled-result consumer started after retry) |
gitfs | Only registered when GitFS is enabled: ok while the last fully-successful sync is younger than 3× the GitFS interval, degraded before the first sync or when stale, down if the syncer exits |
target-service | down until the target-resolution service (in-memory fact index + zester.target.resolve responder) is running; while down, CLI/API targeting falls back to facts-KV scans — functional, just slower |
reactor | Only registered when the reactor is enabled: down while the shared durable reactor consumer is not running (a boot failure is retried every 60s), degraded while the engine runs on last-known-good rules after a failed rule load or while a rule's storm circuit breaker is open (the message lists the affected rules), ok otherwise — see Reactor operations |
$ curl -s http://127.0.0.1:9090/readyz | jq .
{
"status": "ok",
"checks": {
"nats": {"status": "ok", "latency_ms": 1},
"kv": {"status": "ok", "latency_ms": 3}
},
"version": "v0.5.0",
"uptime": "2h3m0s"
}Degraded `gitfs` on standby masters is normal
In multi-master deployments, only the master holding the publisher lease runs GitFS syncs. A standby master with GitFS configured reports gitfs: degraded ("no successful sync yet") until it acquires the lease — that is expected steady-state, not a fault, and /readyz still returns 200. Alert on down, not degraded.
Which probe to use
Point restart-style probes (Kubernetes liveness, systemd watchdogs) at /healthz and traffic/rollout gates at /readyz. A NATS outage makes /readyz return 503 on every node at once — restarting processes on that signal would turn a broker outage into a fleet-wide restart storm.
Watchdog Health Monitoring
zester-watchdog uses both endpoints, for different decisions:
--health-url(defaulthttp://127.0.0.1:9090/healthz) — liveness. Drives restart monitoring and the post-applyWaitForHealthygate. Must match the child's--health-addr— the packaged systemd units pair them accordingly (:9090forzester-peel.service,:9091forzester-master.service).--ready-url(default: derived from--health-urlby replacing the path with/readyz) — readiness. Polled only during the update soak phase, so an alive-but-functionally-dead child fails soak and auto-rolls back.
Related watchdog flags: --health-timeout (default 5s), --health-interval (default 10s), and --health-retries (default 3 — consecutive failures before a post-update rollback). See Watchdog Runtime for the full flag list and soak policy.
Peel Presence (Heartbeats)
The health endpoints answer "is this process alive?" — fleet-wide peel presence comes from the peel heartbeat bucket. Every connected peel rewrites its own key in the peel-heartbeat KV bucket on a fixed cadence:
| Property | Value |
|---|---|
| Bucket | peel-heartbeat |
| Key pattern | <peel-id> |
| Interval | Every 10 seconds |
| TTL | 30 seconds (3× the interval) |
| Value | MessagePack {ts, version, protocol} |
Key presence is the liveness signal: a peel is considered offline after ~3 missed beats, when the TTL expires its key. Two consumers read the bucket:
zester peel listderives itsONLINEandLAST-SEENcolumns from the heartbeat keys (fleets without the bucket show-in both columns).- The master counts the bucket's keys every 15 seconds to feed the
zester_connected_peelsgauge.
Heartbeat writes are non-fatal on the peel (Debug-logged, retried on the next tick). A peel whose credentials lack the heartbeat KV grant cannot write the bucket — it runs fine but looks offline in presence views.
Prometheus Metrics
Both daemons serve Prometheus metrics (OpenMetrics format) at /metrics on the same health_addr listener as the health endpoints.
Master metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
zester_connected_peels | gauge | Live peels, counted from peel-heartbeat bucket keys every 15s (reads 0 until peels write heartbeats) | |
zester_jobs_total | counter | status | Jobs finalized, partitioned by final status (complete, partial, timeout, failed, ...) |
zester_job_duration_seconds | histogram | function | End-to-end job duration from dispatch to finalize |
zester_job_reclaims_total | counter | Jobs reclaimed from dead masters by the orphan scanner | |
zester_facts_sync_total | counter | Fact sync operations received from peels | |
zester_nats_reconnects_total | counter | NATS reconnection events | |
zester_nats_disconnects_total | counter | NATS disconnection events | |
zester_nats_slow_consumers_total | counter | NATS slow consumer events | |
zester_reactor_events_total | counter | origin_type | Events consumed by the reactor (peel, master, admin) |
zester_reactor_events_dropped_total | counter | reason | Events dropped at the reactor's consume gates (malformed, decode, spoof, depth, ratelimit, stale, backpressure) |
zester_reactor_events_unmatched_total | counter | Events that matched no reactor rule | |
zester_reactor_reactions_total | counter | rule, result | Reaction executions per rule and outcome |
zester_reactor_render_duration_seconds | histogram | Time to render one matched reaction rule | |
zester_reactor_breaker_open | gauge | rule | 1 while a rule's storm circuit breaker is open |
zester_reactor_lag | gauge | Pending events on the shared reactor consumer (polled every 15s) | |
zester_reactor_rule_errors_total | counter | Failed reactor rule loads (last-known-good rules stay active) | |
zester_reactor_rules_loaded | gauge | Rules in the active reactor rule set |
Peel metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
zester_peel_connected | gauge | 1 while the NATS connection is actually up, 0 otherwise (including from process start until the first connect) | |
zester_peel_uptime_seconds | gauge | Peel process uptime | |
zester_peel_state_apply_total | counter | state, result | Executions per module (state = module name, e.g. state.apply; result = success or error) |
zester_peel_state_apply_duration_seconds | histogram | state | Execution duration per module |
zester_peel_beacon_events_total | counter | beacon | Beacon events generated (counted before publish, so offline-buffered events are included) |
zester_nats_reconnects_total | counter | NATS reconnection events | |
zester_nats_slow_consumers_total | counter | NATS slow consumer events |
Both registries also expose the standard Go runtime, process, and build-info collectors (go_*, process_*, go_build_info).
Registered but not yet wired
The metric registries (internal/metrics) define additional series that no code path increments yet — notably zester_job_active, zester_facts_sync_errors_total, zester_settings_render_duration_seconds, zester_targeting_resolution_duration_seconds, and the zester_nats_msgs_*/zester_nats_bytes_* transport counters on the master, plus zester_peel_facts_collect_duration_seconds on the peel. They appear in scrape output at zero (or not at all, for labeled series) — don't alert on them yet.
Scrape Configuration
The metrics listener binds to health_addr, which defaults to loopback. To scrape from a central Prometheus, either run a node-local agent (Prometheus agent mode, Grafana Alloy, etc.) or set health_addr to a routable interface — if you do the latter, firewall the port, as the endpoints are unauthenticated.
scrape_configs:
- job_name: 'zester-master'
metrics_path: '/metrics'
static_configs:
- targets: ['master-host:9091']
- job_name: 'zester-peel'
metrics_path: '/metrics'
static_configs:
- targets: ['web-01:9090', 'web-02:9090']NATS Server Monitoring
Complement Zester-native metrics with the NATS server's built-in monitoring endpoints for broker-side visibility:
# Server status
curl -s http://nats-host:8222/varz | jq .
# Connected clients (peels + master)
curl -s http://nats-host:8222/connz | jq '.num_connections'
# JetStream status (storage, streams, consumers)
curl -s http://nats-host:8222/jsz | jq .
# Cluster routes
curl -s http://nats-host:8222/routez | jq .Add the NATS monitoring endpoint to your prometheus.yml:
scrape_configs:
- job_name: 'nats'
metrics_path: '/metrics'
static_configs:
- targets: ['nats-host:8222']Structured Logging
Zester uses Go's log/slog package for structured logging. All daemons log JSON by default; level and format are configurable:
| Flag / YAML key | Default | Values |
|---|---|---|
--log-level / log_level | info | debug, info, warn, error |
--log-format / log_format | json | json, text |
The flags exist on zester-master, zester-peel, and zester-watchdog (the watchdog has no YAML config — flags only). Invalid values fail startup with an error listing the valid options.
Every log line carries base attributes: component (master, peel, or watchdog) and version (build version). The master additionally attaches master_id (its KSUID) and the peel attaches peel_id once known — so a watchdog and its wrapped child can share one stdout stream and still be told apart.
Log Format
{
"time": "2026-07-01T14:30:00.123Z",
"level": "INFO",
"msg": "received exec request",
"component": "peel",
"version": "v0.5.0",
"peel_id": "web-01",
"jid": "2oHfKnCPMQnLEYQeBQsNtUiJp3r",
"module": "state.apply"
}Log Levels
| Level | Usage | Examples |
|---|---|---|
ERROR | Actionable failures requiring investigation | NATS connection lost, JetStream write failure, state.apply crash |
WARN | Degraded conditions, non-critical issues | Reconnection attempts, timeout threshold breaches, slow consumers |
INFO | Normal operational events | Job dispatch, peel connect/disconnect, state.apply, fact sync |
DEBUG | Verbose diagnostic data | NATS message traces, template rendering, targeting resolution details |
Multi-Master Log Lines
In multi-master deployments, KV publishing (settings files, state files, GitFS sync) and per-peel secrets encryption are gated by advisory leader leases in the leases bucket (publisher and facts-secrets). Standby masters are healthy while logging only the candidate lines — do not mistake them for a stuck startup:
| Log line | Level | Meaning |
|---|---|---|
publisher lease candidate started; standing by until acquired | INFO | This master is a standby for the publisher lease — normal steady-state on non-leaders |
leader lease acquired | INFO | This master took over a lease (at startup, or after the previous holder died) |
leader lease lost | WARN | The lease could not be renewed — another master will take over; publishing stops here |
settings files loaded for peel-side rendering | INFO | Logged on every master at startup (loading is not lease-gated) |
published raw settings files for peel-side rendering / published state files | INFO | Logged only on the current publisher lease holder |
Log Aggregation
Ship logs to your log aggregation platform via journald:
# View logs
journalctl -u zester-master -f --output json
# Export to a collector
journalctl -u zester-master -f --output json | your-log-shipperUse tools like Filebeat, Fluentd, Promtail, or Vector to ship log files to Elasticsearch, Loki, Splunk, or your platform of choice.
Useful Log Queries
When using a log aggregation system, these queries help with operations:
| Query | Purpose |
|---|---|
level=ERROR | All errors requiring investigation |
component=master AND msg="job dispatched" | All accepted dispatches (logged with jid, user, function, target count) |
msg="state applied" AND changed=true | States that made changes (drift detection) |
jid=<JID> | Trace a specific job |
msg="http.request.complete" AND path="/api/v1/jobs" | REST dispatch request latency and status |
msg="http.request.unauthorized" | Failed bearer-token auth attempts |
peel_id=<ID> | All events for a specific peel |
duration_ms>5000 | Slow operations |
Alerting Recommendations
Base alerts on Zester-native metrics where available, plus NATS server metrics (at http://<nats-host>:8222/metrics) for broker health:
Critical Alerts
| Alert | PromQL | Threshold | Description |
|---|---|---|---|
| Peel Disconnected | zester_peel_connected == 0 | 5 minutes | Peel process is up but has no NATS connection |
| Master Not Ready | probe_success{job="zester-master-readyz"} == 0 | 2 minutes | A master subsystem (NATS, enroll-server, sched-consumer, target-service, reactor) is down — probe /readyz with the blackbox exporter |
| NATS Down | up{job="nats"} == 0 | 1 minute | NATS server is not responding to scrapes |
| JetStream Storage Full | nats_server_jetstream_storage_used / nats_server_jetstream_storage_limit > 0.9 | Immediate | JetStream disk storage above 90% |
Warning Alerts
| Alert | PromQL | Threshold | Description |
|---|---|---|---|
| Job Failures | increase(zester_jobs_total{status=~"failed|timeout"}[15m]) > 0 | 15 minutes | Jobs finalizing as failed or timed out |
| Connected Peels Drop | zester_connected_peels < <expected fleet size> | 10 minutes | Peels stopped heartbeating into the peel-heartbeat bucket — cross-check with zester peel list |
| Job Reclaims | increase(zester_job_reclaims_total[1h]) > 0 | 1 hour | A master died and its jobs were reclaimed — investigate the dead master |
| NATS Flapping | increase(zester_nats_reconnects_total[15m]) > 3 | 15 minutes | Repeated reconnects on a master or peel |
| Storage Usage Growing | predict_linear(nats_server_jetstream_storage_used[1h], 86400) > nats_server_jetstream_storage_limit | 1 hour | Storage will be full within 24 hours |
Dashboard Recommendations
Dashboard 1: Zester Fleet
- Connected peel count (
zester_connected_peelson the master, fed from peel heartbeats) - Peel connectivity (
zester_peel_connectedacross the fleet) - Job throughput and final status breakdown (
zester_jobs_totalbystatus) - Job duration percentiles (
zester_job_duration_secondsbyfunction) - Module execution rate and error ratio (
zester_peel_state_apply_totalbyresult) - Job reclaims (
zester_job_reclaims_total)
Dashboard 2: NATS Infrastructure
- Connection counts (total clients = master + peels)
- Messages per second (published + received)
- Bytes per second throughput
- JetStream storage utilization
- Slow consumer events
- Reconnection events (
zester_nats_reconnects_total/zester_nats_disconnects_total)