Master Configuration
The master connects to an external NATS server as a client using credentials-based authentication. Configuration is provided via a YAML config file and/or command-line flags. The underlying ClientConfig struct controls connection behavior.
Configuration File
The master loads configuration from a YAML file. The default search path is /etc/zester/master.yaml. Use --config to specify a custom path.
Precedence (lowest to highest):
- Built-in defaults
- Config file values
- Command-line flags
Flags always override config file values. Only flags explicitly passed on the command line take effect as overrides — default flag values do not override config file settings.
Flags and YAML fields are generated from the same tagged config struct (MasterDaemonConfig in internal/config/master_daemon.go), so every flag has a matching YAML field with a shared default. The precedence above is unchanged by this binding — including --gitfs-remotes "" explicitly disabling GitFS over a YAML-provided remote list.
Example Config File
# Only tls:// URLs are accepted — plaintext nats:// is rejected at startup.
nats_url: tls://nats-1.example.com:4222
nats_ca: /data/auth/nats-ca.crt
states_dir: /srv/zester/states
settings_dir: /srv/zester/settings
health_addr: "127.0.0.1:9091"
enroll:
addr: ":8443"
tls_cert: /data/auth/enroll.crt
tls_key: /data/auth/enroll.key
api:
docs_enabled: false
tokens:
- username: ci-system
token_file: /data/auth/api-tokens/ci-system.token
gitfs:
remotes:
- git@github.com:org/base-states.git
- git@github.com:org/app-states.git
interval: 2m
ssh_key: /data/auth/deploy.keyThe packaged config installed by the .deb/.rpm (packaging/config/master.yaml) uses nats_url: "tls://localhost:4222" with paths under /var/lib/zester/.
Config File Reference
| Key | Type | Default | Description |
|---|---|---|---|
nats_url | string | tls://nats:4222 | NATS server URL. Must use the tls:// scheme — plaintext nats:// URLs are rejected at startup. |
nats_ca | string | "" | CA certificate file for NATS TLS server verification. See CA resolution order. |
auth_dir | string | /data/auth | Directory containing auth files (master.creds, account.seed) |
states_dir | string | /data/states | Root directory for state files |
settings_dir | string | /data/settings | Root directory for settings files |
jetstream_replicas | int | 0 | JetStream replication factor applied to all buckets, streams, and object stores (0 = auto: min(3, detected cluster size) — see Replication) |
health_addr | string | 127.0.0.1:9091 | Listen address for the local observability endpoints — /healthz, /readyz, /metrics (polled by zester-watchdog --health-url / --ready-url) |
log_level | string | info | Log level: debug, info, warn, or error. Invalid values abort startup. |
log_format | string | json | Log format: json or text. See Logging. |
enroll.addr | string | :8443 | Enrollment HTTPS API listen address |
enroll.tls_cert | string | /data/auth/enroll.crt | TLS certificate for enrollment API |
enroll.tls_key | string | /data/auth/enroll.key | TLS private key for enrollment API |
api.docs_enabled | bool | false | Serve Swagger UI + OpenAPI spec (/api/v1/docs, /api/v1/openapi.*). These routes are served without authentication on the peel-facing enrollment listener, so this is an explicit opt-in. |
api.tokens[].username | string | "" | API client username used in request context/logging |
api.tokens[].token_file | string | "" | Path to bearer token file; file is read on every request. The master logs a startup warning unless permissions are 0600 (no group/other access). |
gitfs.remotes | list | [] | Git remote URLs for state file sync |
gitfs.interval | duration | 5m | GitFS pull interval |
gitfs.ssh_key | string | "" | Path to SSH private key for GitFS |
reactor.enabled | bool | true | Run the reactor engine (event-driven reactions) on this master |
reactor.dir | string | /data/reactor | Local directory holding reactor rule files (top.zy + reaction .zy files) |
reactor.workers | int | 4 | Reactor render/execute worker pool size |
reactor.max_chain_depth | int | 3 | Maximum reaction chain depth before events are dropped |
reactor.enable_chaining | bool | true | Allow reaction rules to emit derived events (event.send) |
reactor.default_throttle | duration | 0 | Default per-(rule, source) refractory period (0 = none) |
reactor.source_rate_limit | int | 120 | Per-source event rate limit in events/minute (0 = unlimited) |
reactor.max_event_age | duration | 1h | Drop events older than this at consume time (0 = full replay). See Reactor operations. |
reactor.storm_rate | int | 60 | Per-rule fires/minute that trips the circuit breaker (0 = no breaker) |
reactor.breaker_cooldown | duration | 5m | How long a tripped reaction circuit breaker stays open |
REST API exposure and rate limiting
REST API routes are only registered when api.docs_enabled is true or at least one api.tokens entry exists. The unauthenticated peel-facing enrollment endpoints (/api/v1/enroll and subpaths) get a strict per-IP rate limit (burst 10, 1 req/10s); all other routes — the token-authenticated REST API — get a much higher budget (burst 120, 20 req/s per IP).
Command-Line Flags
| Flag | Default | Description |
|---|---|---|
--config | /etc/zester/master.yaml | Path to YAML config file |
--nats-url | tls://nats:4222 | NATS server URL (must be tls://) |
--nats-ca | "" | CA certificate for NATS TLS server verification |
--auth-dir | /data/auth | Directory containing auth files (master.creds, account.seed) |
--enroll-addr | :8443 | Enrollment HTTP API listen address |
--enroll-tls-cert | /data/auth/enroll.crt | TLS certificate for enrollment API (required) |
--enroll-tls-key | /data/auth/enroll.key | TLS private key for enrollment API (required) |
--states-dir | /data/states | Root directory for state files |
--settings-dir | /data/settings | Root directory for settings files |
--jetstream-replicas | 0 | JetStream replication factor (0 = auto: min(3, detected cluster size); explicit count overrides). See Replication. |
--health-addr | 127.0.0.1:9091 | Listen address for /healthz, /readyz, and /metrics |
--log-level | info | Log level (debug, info, warn, error) |
--log-format | json | Log format (json, text) |
--api-docs | false | Serve Swagger UI and OpenAPI spec (unauthenticated) on the enrollment listener |
--gitfs-remotes | "" | Comma-separated Git remote URLs for state file sync. Passing an explicitly empty value (--gitfs-remotes "") disables GitFS even when the config file sets gitfs.remotes. |
--gitfs-interval | 5m | GitFS pull interval |
--gitfs-ssh-key | "" | Path to SSH private key for GitFS authentication |
--reactor | true | Enable the reactor engine (event-driven reactions) |
--reactor-dir | /data/reactor | Local directory holding reactor rule files |
--reactor-workers | 4 | Reactor render/execute worker pool size |
--reactor-max-chain-depth | 3 | Maximum reaction chain depth before events are dropped |
--reactor-enable-chaining | true | Allow reaction rules to emit derived events |
--reactor-default-throttle | 0 | Default per-(rule, source) refractory period |
--reactor-source-rate-limit | 120 | Per-source event rate limit (events/minute; 0 = unlimited) |
--reactor-max-event-age | 1h | Drop events older than this at consume time (0 = full replay) |
--reactor-storm-rate | 60 | Per-rule fires/minute that trips the circuit breaker (0 = no breaker) |
--reactor-breaker-cooldown | 5m | How long a tripped reaction circuit breaker stays open |
The master credentials file is loaded from <auth-dir>/master.creds and the account seed from <auth-dir>/account.seed. The default auth directory is /data/auth; override with --auth-dir or auth_dir in the config file.
Logging
The master emits structured slog logs with a configurable level and format:
log_level/--log-level—debug,info,warn, orerror(defaultinfo)log_format/--log-format—jsonortext(defaultjson)
Invalid values abort startup with an error listing the valid options. Every log line carries the base attributes component=master and version=<build version>, plus master_id=<KSUID> once the instance ID is generated.
JSON is now the default log format
Earlier releases logged human-readable text by default. The default is now structured JSON, so a watchdog and its wrapped child emit uniformly parseable lines on the same stream. Set log_format: text (or --log-format text) to restore the previous behavior.
Health and Metrics Endpoints
The local listener on health_addr (default 127.0.0.1:9091) serves three endpoints:
| Endpoint | Purpose |
|---|---|
GET /healthz | Pure liveness — 200 whenever the process is up. Body: {"status":"ok","component":"master","version":"<build>"}. The watchdog's restart monitoring polls this; it never depends on NATS or any other subsystem. |
GET /readyz | Readiness — runs the subsystem checks below in parallel with a 2s per-check timeout. Returns 503 only when at least one check is down; degraded means working-but-impaired and stays 200. The JSON body carries per-check status, message, and latency. |
GET /metrics | Prometheus scrape endpoint (see Monitoring). |
Readiness checks:
| Check | down when | degraded when |
|---|---|---|
nats | NATS client not yet connected, or the connection is unhealthy | — |
enroll-server | The enrollment TLS listener failed to start or exited with an error | — |
sched-consumer | The schedule-results consumer is not running. A boot failure is retried every 60s in the background; the check flips to OK once a retry succeeds (log line: scheduled-result consumer started after retry). | — |
target-service | The target-resolution service failed to start. Non-fatal: CLI and peel targeting fall back to facts-KV scans. | — |
gitfs (registered only when GitFS is enabled) | The GitFS syncer exited | No successful sync yet — normal on a standby master that does not hold the publisher lease — or the last fully successful sync is older than 3× the pull interval |
reactor (registered only when the reactor is enabled) | The shared durable reactor consumer is not running. A boot failure is retried every 60s in the background. | The last rule load failed (running on last-known-good rules), or a rule's storm circuit breaker is open (the message lists the affected rules). See Reactor operations. |
/readyz reports 503 from process start until the NATS connection is established, so it is suitable for systemd/Kubernetes readiness gating and is what the watchdog polls during the update soak phase (--ready-url).
Multi-Master Coordination
Any number of masters can run against the same NATS server. Most subsystems run on every master and coordinate through NATS:
- Job dispatch and tracking — dispatch requests arrive via a queue group, job ownership is CAS-protected in KV, and the orphan scanner reclaims jobs from dead masters.
schedule-resultsconsumer — all masters share the durable consumer; whichever master receives a peel scheduler result persists it.- Enrollment — every master serves the enrollment HTTPS API, the REST API, and the admin request/reply service (queue group
zester-masters-admin). - Target resolution — every master serves
zester.target.resolvefrom an in-memory facts index via the shared queue groupzester-target-resolvers, so requests are load-balanced across masters. When no master serves the subject, CLI and peel targeting automatically fall back to facts-KV scans. - Rollout resume — every master scans the
update-rolloutsbucket every 60s and CAS-adopts rollouts whose driver heartbeat is stale (older than 60s), resuming them from the persisted batch. An in-flight rollout survives the death of the master driving it.
Two pieces of work are gated behind advisory leader leases — TTL'd entries in the leases KV bucket (15s TTL, renewed every 5s):
| Lease key | Leader-only work |
|---|---|
publisher | Settings-files publish, state-files publish, reactor-rules publish (independent of reactor.enabled), and the GitFS sync loop |
facts-secrets | Per-peel encrypted-secrets publication from the facts watcher |
In a single-master deployment the leases are acquired immediately after storage initialization, so startup behavior is unchanged. Standby masters log publisher lease candidate started; standing by until acquired and still load the settings files locally — settings files loaded for peel-side rendering appears on every master, because every master must be able to encrypt secrets — but only the lease holder logs published raw settings files for peel-side rendering and published state files. When a standby acquires the lease (the previous holder died or lost connectivity), it republishes the settings and state files and takes over the GitFS sync loop.
The leases are advisory: a brief (sub-TTL) window of dual ownership is tolerated by design — file publishes are idempotent KV puts and secrets publication is hash-gated, so double-publishing is harmless.
NATS TLS and CA Resolution
The master refuses to start with a plaintext NATS URL: bus.ValidateTLSNATSURLs rejects anything that is not tls:// before connecting. The same enforcement applies to peels and the watchdog.
The CA certificate used to verify the NATS server is resolved in this order:
- Explicit
--nats-caflag /nats_caYAML field NATS_CA_FILEenvironment variable/data/auth/nats-ca.crt, if the file exists- The host's system trust store
If your NATS server certificate is signed by a public or OS-trusted CA, no configuration is needed. For a private CA, either drop the CA cert at /data/auth/nats-ca.crt or point nats_ca at it.
ClientConfig Reference
The ClientConfig struct in pkg/bus/client.go controls the NATS connection:
| Field | Type | Default | Description |
|---|---|---|---|
URLs | []string | (required) | NATS server URL(s). Multiple URLs for cluster failover. |
Name | string | "zester-peel" | Client name used in NATS connection identification and logging. |
CredsFile | string | "" | Path to NATS credentials file (.creds) containing JWT and nkey seed. |
NKeySeedFile | string | "" | Path to an nkey seed file for authentication. |
MaxReconnects | int | -1 (unlimited) | Maximum reconnection attempts. Use -1 for unlimited. |
ReconnectWait | duration | 2s | Base wait between reconnection attempts. NATS adds jitter automatically. |
ReconnectBufSize | int (bytes) | 8388608 (8 MB) | Buffer size for messages published during reconnection. |
PingInterval | duration | 20s | Interval for NATS ping/pong health checks. |
MaxPingsOut | int | 3 | Outstanding pings before declaring unhealthy. |
DrainTimeout | duration | 30s | Time allowed for draining subscriptions during graceful shutdown. |
Full Example
nats_url: tls://nats-1.example.com:4222
nats_ca: /data/auth/nats-ca.crt
states_dir: /srv/zester/states
settings_dir: /srv/zester/settings
jetstream_replicas: 3
health_addr: "127.0.0.1:9091"
log_level: info
log_format: json
enroll:
addr: ":8443"
tls_cert: /data/auth/enroll.crt
tls_key: /data/auth/enroll.key
api:
docs_enabled: false
tokens:
- username: ci-system
token_file: /data/auth/api-tokens/ci-system.token
gitfs:
remotes:
- git@github.com:org/base-states.git
- git@github.com:org/app-states.git
interval: 2m
ssh_key: /data/auth/deploy.keyNATS Server Configuration
The NATS server is managed independently. For operator-mode JWT authentication with JetStream, a typical nats-server.conf looks like:
port: 4222
tls {
cert_file: /etc/nats/tls/server.crt
key_file: /etc/nats/tls/server.key
}
jetstream {
store_dir: /var/lib/nats/jetstream
max_mem: 2GB
max_file: 50GB
}
operator: /etc/nats/operator.jwt
system_account: <system-account-public-key>
resolver: MEMORY
resolver_preload: {
<account-public-key>: <account-jwt>
<system-account-public-key>: <system-account-jwt>
}TLS is required
Without TLS, NATS traffic (including JWTs and credentials) is transmitted in plaintext. Zester clients only accept tls:// URLs, so the NATS server must serve TLS. If the server certificate is signed by a private CA, distribute the CA cert to every node (see CA resolution order).
Runtime enforcement
The master rejects non-TLS NATS URLs at startup. Any nats:// URL will fail fast — use tls://.... The same enforcement applies on peels and the watchdog. Set nats_ca (or --nats-ca) only if your NATS CA is not resolvable via the CA resolution order.
JetStream Storage
JetStream storage is managed by the NATS server, not the master. The master initializes KV buckets and streams on startup via InitializeStorage().
JetStream is used for:
- Streams: Job event log for audit and replay (also carries peel scheduler results, consumed by the durable
schedule-resultsconsumer) - Key-Value stores: Facts, settings, jobs, job returns, basket data, peel heartbeats, leader leases
- Object stores: Update binary distribution (
update-binaries)
Replication
jetstream_replicas / --jetstream-replicas sets the replication factor the master applies to every KV bucket, the job-events stream, and the update-binaries Object Store when it initializes them. The default 0 means auto: the master detects the NATS cluster size from the connection's known servers and applies min(3, cluster size) — a single NATS server gets single-replica assets, while a 3+ node cluster automatically gets 3 replicas so job history, settings, enrollment records, and published binaries survive a node loss. An explicit count overrides auto-detection for all assets (counts above 3 buy little for KV workloads — 3 is the JetStream RAFT sweet spot); forcing a count below min(3, cluster size) logs a per-asset warning for the critical buckets, since a single node's disk loss would then permanently destroy them despite the cluster.
Raising the factor on an existing deployment is attempted in place. If the replica change on an existing asset fails — an older NATS server, or a cluster without the capacity — startup does not abort: the asset is kept at 1 replica and a warning is logged with a manual migration hint, e.g.:
bucket kept 1 replica: replica change failed; migrate manually
wanted_replicas=3 hint="nats stream edit KV_jobs --replicas=3"Follow the hint with the nats CLI to migrate the named asset by hand (KV buckets are backed by streams named KV_<bucket>; object stores by OBJ_<bucket>).
Storage Sizing
| Deployment Size | Recommended NATS Memory | Recommended NATS Disk |
|---|---|---|
| Small (< 100 peels) | 512 MB | 5 GB |
| Medium (100-1000 peels) | 2 GB | 20 GB |
| Large (1000+ peels) | 8 GB+ | 100 GB+ |
Use SSDs for NATS store_dir
JetStream storage benefits significantly from SSD/NVMe storage. Mechanical disks will work but may bottleneck fact collection and job dispatch under heavy load.
Startup Sequence
- Load config file (
--configor/etc/zester/master.yaml), then apply CLI flag overrides. - Start the local observability server on
health_addr(default127.0.0.1:9091), serving/healthz,/readyz, and/metrics. - Validate NATS URLs — non-
tls://URLs abort startup — and build the client TLS config (CA resolution order above). - Connect to the external NATS server using credentials (
<auth-dir>/master.creds), retrying with backoff (up to 20 attempts). - Initialize KV buckets and streams (
InitializeStorage) and the update Object Store (InitializeObjectStores), each with short per-attempt timeouts and retries. - Load account key for settings encryption (
<auth-dir>/account.seed), initialize the peel-side rendering publisher, publish the master curve public key, and load + sanitize the raw settings files (settings files loaded for peel-side rendering— this runs on every master). - Initialize the state-files publisher and the reactor-files publisher (constructed regardless of
reactor.enabled— rule distribution is a lease concern, not an engine concern), and construct the GitFS syncer ifgitfs.remotesis configured. - Start the
publisherleader lease — the holder publishes the settings files, state files, and reactor rule files to KV and runs the GitFS sync loop (see Multi-Master Coordination). - Generate unique master instance ID (KSUID-based) and start the job manager.
- Start the shared durable
schedule-resultsconsumer on thejob-eventsstream (persists peel schedulerreturn_jobresults as synthetic jobs); a boot failure is retried every 60s in the background. - Subscribe to job dispatch requests via queue group (
zester.masters). - Subscribe to cancel wildcard (
zester.job.*.cancel). - Start master heartbeat (5s interval, 15s bucket TTL).
- Start orphan scanner to reclaim jobs from dead masters.
- Create the enrollment store (before the facts watcher, which uses it to mark peels active).
- Start the
facts-secretsleader lease (its holder publishes per-peel encrypted secrets), then the facts watcher (enrollmentissued → activetransitions run on every master). - Initialize the rest of the enrollment system (challenge store, credential issuer, HTTP handler).
- Register master REST API routes on the enrollment mux (only when
api.docs_enabledor at least oneapi.tokensentry is set). - Start the HTTPS server on
enroll.addr. - Start the enrollment admin request/reply service (queue group
zester-masters-admin). - Initialize the rollout controller, subscribe to rollout start/abort requests, and start the rollout resume loop (adopts rollouts orphaned by dead masters, 60s scan).
- Start the target-resolution service (queue group
zester-target-resolvers) backed by an in-memory facts index. - Start the reactor (when
reactor.enabled): the rule loader over thereactor-filesbucket, the shared durablereactorconsumer on theeventsstream, thezester.reactor.testservice, and the lag gauge; a boot failure is retried every 60s in the background. - Start the connected-peels gauge (recounts the
peel-heartbeatbucket every 15s intozester_master_connected_peels).
Shutdown
The master performs a graceful shutdown:
- Drains the NATS client connection (finishes pending messages).
- Closes the connection to NATS.
Message Encoding
All Zester messages over NATS use MessagePack encoding (not JSON). This applies to:
- Fact reports from peels
- Job dispatch and return values
- Settings distribution
- Event payloads
The bus.Encode and bus.Decode functions handle serialization transparently.