Overview

This section covers everything you need to run Zester in production, from initial deployment through day-two operations, monitoring, scaling, and incident response.

Zester's operational model is simple by design. The master connects to an external NATS server as a client — there is no external database beyond NATS, and no additional runtime dependencies. A single static binary per role is all you deploy, plus a NATS server.

Component	Binary	Deploys To	Persistent State
Master	`zester-master`	Dedicated server(s)	None (stateless client)
NATS	`nats-server`	Dedicated server(s)	JetStream data (`/var/lib/nats/jetstream/`)
Peel	`zester-peel`	Every managed host	Local working state under `/data` (credentials, state-file cache, settings snapshot, job dedup)
CLI	`zester`	Operator workstations	None

NATS JetStream holds all authoritative state — KV buckets for facts, settings, jobs, and returns, plus streams for the job event log. A peel keeps only its credentials and a small set of locally rebuilt working files (the state-file cache, a last-known-good settings snapshot, and a job-dedup record) that let it boot and enforce offline; everything except the credentials is repopulated from NATS, so peels can be replaced or upgraded without data loss.

Peels are cattle, not pets. Beyond their credentials, they carry only locally rebuilt caches. Replace them freely.
The master is the single source of truth. All persistent data lives in JetStream KV. Back it up.
NATS handles the hard parts. Clustering, replication, failover, message delivery guarantees — let NATS do what it does best.
Graceful degradation is built in. Peels survive master and NATS outages: they boot offline-first from cached state and snapshot settings, keep their schedules enforcing, retry connections, and resume seamlessly when the control plane returns.
Observe everything. Every component produces structured JSON logs. Use NATS server monitoring endpoints and log aggregation for operational visibility.

Overview

Overview

Sections

Deployment

Monitoring

Backup & Recovery

Scaling

Troubleshooting

Enrollment

SRE Runbook

Operational Principles

On this page