Skip to main content

High Availability

Symphony can be deployed as a shared-nothing multi-instance cluster for fault tolerance and high availability. Each instance runs independently with its own configuration directory, storage, and embedded messaging server—there is no shared filesystem or database. The instances form a cluster that replicates all state—user accounts, roles, extension registrations, key-value stores, and extension data — across nodes automatically.

How It Works

A Symphony cluster consists of three or more instances that share the same cryptographic identity (operator and account JWTs, seeds). Each instance runs its own embedded messaging server, and these servers form a cluster using dedicated routing ports. Symphony automatically replicates data across cluster members, providing redundancy and fault tolerance. Because each node holds a complete copy of the replicated state, any node can serve any request independently.

If one instance becomes unavailable, the remaining instances continue to serve requests. Users connected to the failed instance can reconnect to any other instance in the cluster. When the failed instance recovers, it rejoins the cluster and synchronises its state automatically.

Each node in the cluster communicates with every other node via dedicated routing ports. Extensions connect to any node and automatically fail over to a remaining node if their connection is lost. Users access the cluster through a load balancer that distributes requests across all healthy nodes.

Prerequisites

  • A working single-instance Symphony deployment (the "primary" instance)
  • Three or more servers (physical or virtual) with network connectivity between them
  • Each server must be able to reach the other servers on its cluster routing port
  • A load balancer or DNS round-robin for distributing client connections

Cluster Architecture

Each node in a cluster needs its own ports for NATS client connections, WebSocket connections, cluster routing, and the UI. You can choose any available ports. The following table shows an example layout for a three-node cluster:

NodeNATS PortWebSocket PortCluster PortUI Port
node14222922262228080
node24333933362238282
node34444944462248383

When all instances run on the same host (for development or testing), each must use different ports. When deployed on separate servers, you can use the same ports on each server.

Setting Up a Cluster

warning

Do not run the setup wizard on each node independently. The wizard generates unique cryptographic identities (operator keys, account JWTs, signing seeds, storage salt). If each node generates its own identities, the nodes will have different trust roots and the cluster cannot replicate state or authenticate accounts across nodes. Always configure one node first, then copy its configuration to the others.

1. Set up the primary instance

Install and configure the first Symphony instance normally using any installation method. Complete the setup wizard to generate the identity keys and configuration files. The instance will start automatically in single-node mode after the wizard completes—this is expected.

2. Copy configuration to additional nodes

Copy the configuration files from the primary instance to each additional node. The two files that must be identical across all nodes are:

  • symphony.config—shared identity (operator JWT and seeds, Symphony JWT and seeds, signing seeds, OIDC settings, storage salt)
  • nats.config—operator JWT, system account JWT, Symphony account JWT, and resolver preload

These files contain the cluster's cryptographic identity. Every field in symphony.config must match across nodes except network.uiport. In nats.config, only networking fields (host, port, websocket.port, logfile, store_dir) and the cluster block differ per node.

3. Configure symphony.config on each node

Edit symphony.config on each additional node to set a unique UI port:

Node 2:

{
"network": {
"uiport": 8282
}
}

Node 3:

{
"network": {
"uiport": 8383
}
}

4. Configure nats.config for clustering

Add a server_name and cluster block to each node's nats.config. Each node must have a unique server_name, a unique cluster.listen port, and route entries pointing to every other node's cluster port.

Node 1 (primary):

server_name: node1

cluster {
name: symphony-cluster
listen: 0.0.0.0:6222
routes: [
nats-route://node2.example.com:6223,
nats-route://node3.example.com:6224
]
}

Node 2:

server_name: node2

cluster {
name: symphony-cluster
listen: 0.0.0.0:6223
routes: [
nats-route://node1.example.com:6222,
nats-route://node3.example.com:6224
]
}

Node 3:

server_name: node3

cluster {
name: symphony-cluster
listen: 0.0.0.0:6224
routes: [
nats-route://node1.example.com:6222,
nats-route://node2.example.com:6223
]
}
tip

When NATS TLS is enabled on the server, cluster routes are automatically encrypted using the same certificate. No additional cluster TLS configuration is needed.

The cluster.name must be identical on all nodes. Each node's route list should include all other nodes in the cluster—routing is automatically managed once the connections are established.

If running all nodes on the same host, also update each node's NATS client port and WebSocket port to unique values:

port: 4333
websocket {
port: 9333
no_tls: true
}

5. Start the cluster

Restart the primary instance (to pick up the cluster configuration added in step 4) and start the additional instances. The order does not matter—cluster routing handles nodes starting independently. However, the internal data store requires a majority quorum before it can accept operations (e.g. 2 of 3 nodes, or 3 of 5). Nodes that start before quorum is reached will retry automatically with increasing backoff for several minutes while waiting for peers to become available.

For the smoothest startup, start all nodes within a short window so that quorum is established quickly. If nodes start minutes apart, the early nodes will log transient errors such as "JetStream system temporarily unavailable" or "context deadline exceeded" while waiting for peers—this is normal and resolves once quorum is reached.

info

This guide includes docker compose commands. Refer to your deployment documentation for environment-specific prerequisites.

# On each node (Linux)
sudo systemctl restart symphony

# Or with Docker Compose
docker compose restart symphony

Check the logs on each node to verify cluster formation:

# Linux
sudo journalctl -u symphony -f

# Docker
docker compose logs -f symphony

Look for messages indicating successful route connections and startup:

Cluster ready  clusterName=symphony-cluster  routes=2
Symphony ready url=nats://10.0.0.1:4222

Account Resolver Synchronisation

Each node periodically compares its account state with other nodes in the cluster. The default synchronisation interval is 2 minutes, configured in nats.config:

resolver {
type: full
interval: "2m"
timeout: "1.9s"
}

This means that when a new user account or extension is registered on one node, it may take up to 2 minutes for the registration to propagate to all other nodes. During this window, the new user or extension may only be accessible on the node where it was created.

Load Balancing

Place a load balancer (nginx, HAProxy, or a cloud load balancer) in front of the cluster to distribute client connections across nodes.

HTTP/HTTPS Traffic

Configure the load balancer to distribute HTTP requests across all nodes' UI ports. Use health checks on each node's HTTP port to detect failed instances:

upstream symphony_http {
server node1.example.com:8080;
server node2.example.com:8282;
server node3.example.com:8383;
}

server {
listen 443 ssl;
server_name symphony.example.com;

location / {
proxy_pass http://symphony_http;
}

location /ws {
proxy_pass http://symphony_ws;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}

WebSocket Traffic

WebSocket connections must use sticky sessions (session affinity) to ensure that each browser session remains connected to the same node for the duration of the WebSocket connection:

upstream symphony_ws {
ip_hash;
server node1.example.com:9222;
server node2.example.com:9333;
server node3.example.com:9444;
}

NATS Client Traffic

If extensions or external NATS clients connect to the cluster, configure TCP passthrough load balancing for the NATS client ports:

stream {
upstream nats_backend {
server node1.example.com:4222;
server node2.example.com:4333;
server node3.example.com:4444;
}
server {
listen 4222;
proxy_pass nats_backend;
}
}

Cluster Sizing

  • 3 nodes—minimum recommended for production. Tolerates the loss of one node while maintaining quorum for data replication.
  • 5 nodes—tolerates the loss of two nodes. Use this for environments that require higher availability.
  • Even numbers—avoid clusters of 2 or 4 nodes. Symphony requires a majority quorum for writes, so an even number of nodes provides no additional fault tolerance over the next-lower odd number.

What Is Replicated

The following state is automatically replicated across all cluster nodes. Every item is stored inside each node's Symphony data directory (the storage location configured in symphony.config), so identifying replication problems usually starts with comparing the listed paths across nodes.

DataReplication mechanismWhere to find it on diskHow to verify it
User accounts and profiles (account JWTs)NATS account resolver — file-based, broadcast between nodesaccounts/<account-pubkey>.jwt (one file per user account, same filename and timestamp on every node)ls -l accounts/ on each node; filenames and sizes should match
Role definitions, role assignments, extension shares, licenses, usage, business units, enforcement stateJetStream KV in the Symphony system accountstorage/jetstream/<symphony-account-pubkey>/streams/KV_symphony_roles/, .../KV_symphony_role_assignments/, .../KV_symphony_shares/, .../KV_symphony_licenses/, .../KV_symphony_usage/, .../KV_symphony_business_units/, .../KV_symphony_enforcement/nats stream info KV_symphony_<name> — confirm the Replicas value matches the Cluster Sizing table above and the full peer list appears under Cluster → Replicas
Per-user KV mirrors (user's view of the Symphony state above)JetStream KV mirror (memory-backed, resourced from the Symphony source on cold start)storage/jetstream/<user-account-pubkey>/streams/KV_symphony_<name>/ under each user's account directoryConnect as that user and run nats stream info KV_symphony_<name>
Extension data (extension-declared KV buckets, combined registry/operations streams)JetStream KV / streamsstorage/jetstream/<user-account-pubkey>/streams/KV_<bucket-name>/nats stream info KV_<bucket> — compare Replicas, leader, and peer list

Symphony uses replica factor R=3 on 3- and 4-node clusters and R=5 on 5+ node clusters. Standalone deployments (1 node) and 2-node deployments use R=1 — two nodes cannot form a majority quorum, so a 2-node cluster has the same durability as a single node. Extensions that create JetStream state at runtime can discover the current replica count via the NATS service cirata.services.cluster.info; the Go SDK wraps this in extension.CreateOrUpdateKeyValue, which fills the replica count in automatically when the caller leaves it unset.

State held by Symphony on behalf of extensions — key-value bucket content and streams — is replicated across the cluster. In a cluster of 3 or more nodes, this state survives the loss of any individual node, because writes are committed to a majority (2 of 3) before they are acknowledged and the surviving majority can continue to serve and accept writes while the failed node is recovering.

In a standalone deployment (a single node, no cluster routes) the same buckets are created at replication factor 1 because there are no peers to replicate to.

To list every replicated stream on a node and see its current replica count at a glance:

nats stream ls --json | jq '.[] | {
name:.config.name,
replicas:.config.num_replicas,
leader:.cluster.leader,
peers:.cluster.replicas
}'

The following is not replicated and is local to each node:

DataNotes
symphony.configMust be manually copied and maintained
nats.configMust be manually configured per node
Log filesEach node writes its own logs
CDN cacheIn-memory per node; rebuilt automatically from CDN on cache miss. Only populated when dependency resolution mode is Proxy or Mixed.
Extension bundlesStored in JetStream object storage and replicated across the cluster like other JetStream data. Active version pointer is replicated via the bundle manifest KV.
Extension processesExtensions connect to any node and automatically fail over to remaining nodes if their initial node becomes unavailable

Removing a Node

To remove a node from the cluster:

  1. Stop the Symphony instance on the node being removed.
  2. Remove the node's route entry from the nats.config on all remaining nodes.
  3. Restart the remaining instances to apply the route changes (or send a reload signal).

The cluster will continue operating with the remaining nodes. Symphony will re-replicate any data that was held on the removed node.

Troubleshooting

  • Cluster not forming—verify that the cluster routing ports are reachable between all nodes. Check firewall rules and security groups. The cluster.name must be identical on all nodes.
  • "JetStream system temporarily unavailable" or "context deadline exceeded" at startup—this means the data store has not yet reached quorum. Ensure that a majority of nodes are running and can reach each other on the cluster routing ports. The node will retry automatically; once quorum is established the errors will stop. If the errors persist, check that all nodes have completed initial configuration (a node still showing the setup wizard does not participate in quorum).
  • Split brain / data inconsistency—ensure all nodes share the same operator JWT, system account, and Symphony account JWTs. If each node ran the setup wizard independently, they will have different cryptographic identities and cannot replicate state. The fix is to copy symphony.config and nats.config from a single primary node to all others (adjusting only per-node networking and the cluster block). Compare the operator, system_account, and resolver_preload values in nats.config across nodes—they must be identical.
  • Node won't rejoin after restart—check the NATS log for connection errors. Verify that the node's server_name is unique and that its routes point to the correct hostnames and ports.
  • Stale data after failover—the account resolver synchronisation interval (default 2 minutes) means recently created accounts or registrations may not be immediately available on all nodes. Reduce the interval value in the resolver block if faster convergence is required.
  • Replication warnings—if the log shows replication errors, ensure that jetstream is enabled in nats.config on all nodes and that each node has sufficient storage configured (max_mem and max_file in the jetstream block).
  • jetstream not enabled for account (err_code 10039)—the local node's account resolver has not yet received the affected user's account JWT (the one that grants the JetStream limits). This is usually transient and resolves as the resolver propagates; the operation will succeed on retry. Check that accounts/<account-pubkey>.jwt exists and has the same size on every node, and consider reducing the resolver interval in nats.config if propagation is slow.
  • failed to create KV … context deadline exceeded—a JetStream create operation timed out waiting for a quorum to form. On a fresh cluster, verify every node reports its peers via nats server info. If peers are connected but creation still fails, the affected stream's source buckets may not yet be available on this node; retrying after the cluster fully converges typically succeeds. Symphony automatically retries transient errors during startup.
  • Missing subdirectories under storage/jetstream/<account>/streams/— a bucket is present on some nodes but not others. At R=3 every node should hold a copy. First check the bucket's current replica count: nats stream info KV_<bucket>. If it reports Replicas: 1, the bucket was created before Symphony switched to R=3 in this cluster; it will scale up automatically on the next Symphony startup or when the owning account next triggers bucket creation. To force an immediate scale-up: nats stream update KV_<bucket> --replicas=3. If the count is already 3 but a node is missing its copy, the node was out of the cluster at creation time and JetStream will restore it during the next leader sync (usually within a few seconds of the node rejoining).