zester

Scaling

Zester scales from a handful of nodes to over 100,000 by leveraging NATS JetStream's built-in clustering, leaf nodes, and superclusters. This guide covers resource sizing, cluster topologies, and performance tuning.

Resource Sizing

Master Resources

ScaleNATS TopologyCPU (Steady)CPU (Burst)RAMDisk (JetStream)Disk IOPSNetwork
< 100 peelsSingle NATS server0.1 core1 core512 MB10 GB SSD501 Mbps
100--1,000 peelsSingle NATS server0.5 core4 cores2 GB50 GB SSD50010 Mbps
1,000--10,000 peels3-node cluster4 CPU each8 CPU each8 GB each200 GB SSD each5,000100 Mbps
10,000--50,000 peels5-node cluster8 CPU each16 CPU each16 GB each500 GB SSD each20,000500 Mbps
50,000--100,000 peels5-node + leaf nodes16 CPU each32 CPU each32 GB each1 TB NVMe each50,0001 Gbps
100,000+ peelsSupercluster16+ CPU each32+ CPU each64 GB each2 TB NVMe each50,000+1+ Gbps

CPU burst occurs during mass state.apply or full fact sync. Steady-state load comes from heartbeats and incremental fact updates.

Use NVMe for 10k+ peels

JetStream performance is directly tied to disk I/O. At scale, NVMe storage significantly improves job dispatch latency and fact sync throughput.

Peel Resources

Peels are lightweight agents with minimal resource requirements:

ResourceIdleDuring State Apply
CPUNegligibleSpikes depending on state modules
RAM20--50 MBVaries by state complexity
DiskMinimal (no local persistence)Temporary during state operations
Network~100 B/s (heartbeat)Bursts during fact sync and state.apply

Network Bandwidth Estimates

OperationPer Peel1,000 Peels10,000 Peels100,000 Peels
Heartbeat (idle)~100 B/s~100 KB/s~1 MB/s~10 MB/s
Fact sync (5m interval)~2 KB/event~33 KB/s~330 KB/s~3.3 MB/s
State apply (typical)~10 KB/jobburstburstburst
Settings push~5 KB/peel~5 MB total~50 MB total~500 MB total

NATS Cluster Topologies

Single Master (Development / Small Deployments)

┌─────────────┐     ┌─────────────┐
│   Master    │────│    NATS     │ ◄── all peels connect here
│  (client)   │     │  (server)   │
└─────────────┘     └─────────────┘
  • Suitable for up to ~1,000 peels
  • No high availability — master failure means no new jobs
  • Peels survive master outage with cached state
  • Simplest to operate

3-Node Cluster (Production Standard)

┌──────────┐     ┌──────────┐     ┌──────────┐
│ Master-1 │────│ Master-2 │────│ Master-3 │
│  (RAFT)  │     │  (RAFT)  │     │  (RAFT)  │
└──────────┘     └──────────┘     └──────────┘
     ▲                ▲                ▲
     └── peels distribute connections ─┘
  • Tolerates 1 node failure
  • RAFT consensus for JetStream data
  • Automatic leader election (~2 seconds)
  • Recommended for 1,000--10,000 peels

5-Node Cluster (Large Scale)

  • Tolerates 2 simultaneous node failures
  • Better read distribution across nodes
  • Required for 10,000--50,000 peels
  • Same topology as 3-node, with two additional nodes

Cluster + Leaf Nodes (Edge / DMZ)

┌────────── Core Cluster ──────────┐
│ ┌────────┐ ┌────────┐ ┌────────┐│
│ │Master-1│ │Master-2│ │Master-3││
│ └────────┘ └────────┘ └────────┘│
└──────────────────────────────────┘
         ▲            ▲
    ┌────┴────┐  ┌────┴────┐
    │  Leaf   │  │  Leaf   │
    │ (DMZ)   │  │ (Edge)  │
    └─────────┘  └─────────┘
      ▲   ▲        ▲   ▲
    peels         peels

Leaf nodes act as local proxies, similar to Salt's syndic but built into NATS:

  • Reduce latency for remote/edge peels by providing a local NATS endpoint
  • Cross firewall boundaries with a single outbound connection from leaf to core
  • Isolate traffic — local peel-to-peel communication stays on the leaf
  • Scale horizontally — each leaf handles its own connection pool

Configure a leaf node:

# On the leaf node server
nats:
  listen: "0.0.0.0:4222"
  leaf_node:
    remotes:
      - url: "tls://core-master-01:7422"
        credentials: "/etc/zester/leaf.creds"

Supercluster (Multi-Region / Global)

┌─── Region: US-East ────┐     ┌─── Region: EU-West ────┐
│  ┌──────┐  ┌──────┐    │     │    ┌──────┐  ┌──────┐  │
│  │ M-1  │──│ M-2  │    │     │    │ M-1  │──│ M-2  │  │
│  └──────┘  └──────┘    │     │    └──────┘  └──────┘  │
│       ▲                 │     │         ▲               │
│     peels               │     │       peels             │
└─────────┬───────────────┘     └─────────┬───────────────┘
          │       NATS Gateway            │
          └───────────────────────────────┘

NATS superclusters connect multiple independent clusters via gateways:

  • Each region operates independently if gateways fail
  • Cross-region job dispatch and fact sharing via gateway routing
  • JetStream data replication across regions (configurable)
  • Replaces Salt's complex syndic chains

Performance Tuning

Server Configuration Parameters

These parameters control the NATS server behavior (configured on the external NATS server):

ParameterDefaultTuning Guidance
JetStreamMaxMemory75% of RAMIncrease for large fact indexes. At 10k+ peels, allocate at least 4 GB.
JetStreamMaxStore75% of diskSet explicitly to prevent JetStream from consuming all disk space.
JetStreamDomainnoneSet for multi-cluster isolation to prevent stream name collisions.
ReadyTimeout10 secondsIncrease to 30s if JetStream recovery takes longer after restart with large stores.

Client Configuration Parameters

These parameters in ClientConfig control peel-to-master connections:

ParameterDefaultTuning Guidance
MaxReconnects-1 (unlimited)Keep at -1 for production. Peels should always retry.
ReconnectWait2 secondsBase wait between reconnection attempts. NATS adds jitter automatically. Increase to 5s for very large fleets to reduce thundering herd.
ReconnectBufSize8 MBBuffer for messages published during reconnection. Increase if peels generate high event volume during outages.
PingInterval20 secondsNATS ping/pong health check interval. Reduce to 10s for faster disconnect detection. Increase to 30s for high-latency links.
MaxPingsOut3Outstanding pings before declaring unhealthy. At default PingInterval, disconnect detection takes up to 60s (3 x 20s).
DrainTimeout30 secondsTime allowed for draining subscriptions during graceful shutdown.

NATS Tuning for Scale

For deployments above 10,000 peels, tune the NATS server configuration:

nats:
  max_connections: 20000
  max_payload: "8MB"
  write_deadline: "10s"
  jetstream:
    max_file_store: "500GB"
    max_mem_store: "4GB"
    limits:
      max_ack_pending: 50000
      duplicate_window: "600s"
ParameterDefaultDescription
max_connections65536Maximum concurrent client connections. Set to 2x expected peel count.
max_payload1 MBMaximum message payload size. Increase to 8 MB for large state applies or fact sets.
write_deadline2sMaximum time to write to a client before dropping. Increase for slow links.
max_ack_pending65536Maximum unacknowledged messages per consumer. Increase for high-throughput job dispatch.
duplicate_window2mWindow for detecting duplicate messages. Increase to 10m for high-latency environments.

KV Bucket Tuning

Default bucket configurations from the codebase:

BucketHistoryTTLReplicasTuning Notes
facts5None1Increase replicas to 3 in clustered deployments. History of 5 enables fact change tracking.
settings-files3None1Increase replicas to 3 in clusters. Stores sanitized .zy templates for peel-side rendering.
secrets3None1Increase replicas to 3 in clusters. Per-peel encrypted values.
basket1None1Consider adding TTL if basket data grows unbounded.
jobs107 days1Increase replicas to 3 in clusters. Reduce TTL to 3d if storage is constrained.
job-returns17 days1Increase replicas to 3 in clusters. Most storage-heavy bucket at scale.
master-heartbeat115s1Increase replicas to 3 in clusters. Low storage, auto-expires.
enrollments10None1Increase replicas to 3 in clusters. Enrollment records.
enroll-challenges15 min1Memory storage. Ephemeral challenge nonces.
state-files3None1State file distribution from master to peels.
update-manifests5None1Self-update binary manifests.
update-status160s1Watchdog update status heartbeats.
update-rollouts10None1Fleet rollout state tracking.

To update bucket replicas for a clustered deployment:

# Each KV bucket is a stream prefixed with KV_
nats stream edit KV_facts --replicas 3
nats stream edit KV_jobs --replicas 3
nats stream edit KV_job-returns --replicas 3
nats stream edit job-events --replicas 3

Operating System Tuning

For masters handling 10,000+ connections:

# /etc/sysctl.d/99-zester.conf
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1

# File descriptor limits (also set in systemd unit)
fs.file-max = 2097152
# Apply without reboot
sysctl --system

Scaling Checklist

When scaling to a new tier, work through this checklist:

Moving to 1,000+ Peels

  • Deploy 3-node NATS cluster for high availability
  • Set KV bucket replicas to 3
  • Configure Prometheus monitoring and alerting
  • Set up automated JetStream backups
  • Increase LimitNOFILE in systemd unit to 65536

Moving to 10,000+ Peels

  • Deploy 5-node NATS cluster
  • Switch to NVMe storage for JetStream
  • Tune max_connections, max_payload, and write_deadline
  • Apply OS-level tuning (sysctl, file descriptors)
  • Use Prometheus service discovery for peel scraping
  • Consider leaf nodes for remote sites

Moving to 50,000+ Peels

  • Deploy leaf nodes for edge and DMZ networks
  • Increase ReconnectWait on peels to 5s to reduce thundering herd
  • Increase JetStream memory allocation to 16+ GB
  • Monitor and tune max_ack_pending for job dispatch throughput
  • Consider target slicing/run partitioning to limit concurrent fan-out

Moving to 100,000+ Peels

  • Deploy NATS supercluster for multi-region
  • Configure JetStream domain isolation per region
  • Increase ReconnectBufSize on peels for longer outage tolerance
  • Implement graduated rollout for mass state applies (partitioned target sets)
  • Dedicated monitoring infrastructure for Zester metrics

On this page