High Availability
Symphony can be deployed as a shared-nothing multi-instance cluster for fault tolerance and high availability. Each instance runs independently with its own configuration directory, storage, and embedded messaging server—there is no shared filesystem or database. The instances form a cluster that replicates all state—user accounts, roles, extension registrations, key-value stores, and extension data — across nodes automatically.
How It Works
A Symphony cluster consists of three or more instances that share the same cryptographic identity (operator and account JWTs, seeds). Each instance runs its own embedded messaging server, and these servers form a cluster using dedicated routing ports. Symphony automatically replicates data across cluster members, providing redundancy and fault tolerance. Because each node holds a complete copy of the replicated state, any node can serve any request independently.
If one instance becomes unavailable, the remaining instances continue to serve requests. Users connected to the failed instance can reconnect to any other instance in the cluster. When the failed instance recovers, it rejoins the cluster and synchronises its state automatically.
Each node in the cluster communicates with every other node via dedicated routing ports. Extensions connect to any node and automatically fail over to a remaining node if their connection is lost. Users access the cluster through a load balancer that distributes requests across all healthy nodes.
Prerequisites
- A working single-instance Symphony deployment (the "primary" instance)
- Three or more servers (physical or virtual) with network connectivity between them
- Each server must be able to reach the other servers on its cluster routing port
- A load balancer or DNS round-robin for distributing client connections
Cluster Architecture
Each node in a cluster needs its own ports for NATS client connections, WebSocket connections, cluster routing, and the UI. You can choose any available ports. The following table shows an example layout for a three-node cluster:
| Node | NATS Port | WebSocket Port | Cluster Port | UI Port |
|---|---|---|---|---|
| node1 | 4222 | 9222 | 6222 | 8080 |
| node2 | 4333 | 9333 | 6223 | 8282 |
| node3 | 4444 | 9444 | 6224 | 8383 |
When all instances run on the same host (for development or testing), each must use different ports. When deployed on separate servers, you can use the same ports on each server.
Setting Up a Cluster
Do not run the setup wizard on each node independently. The wizard generates unique cryptographic identities (operator keys, account JWTs, signing seeds, storage salt). If each node generates its own identities, the nodes will have different trust roots and the cluster cannot replicate state or authenticate accounts across nodes. Always configure one node first, then copy its configuration to the others.
1. Set up the primary instance
Install and configure the first Symphony instance normally using any installation method. Complete the setup wizard to generate the identity keys and configuration files. The instance will start automatically in single-node mode after the wizard completes—this is expected.
2. Copy configuration to additional nodes
Copy the configuration files from the primary instance to each additional node. The two files that must be identical across all nodes are:
symphony.config—shared identity (operator JWT and seeds, Symphony JWT and seeds, signing seeds, OIDC settings, storage salt)nats.config—operator JWT, system account JWT, Symphony account JWT, and resolver preload
These files contain the cluster's cryptographic identity. Every field
in symphony.config must match across nodes except network.uiport.
In nats.config, only networking fields (host, port,
websocket.port, logfile, store_dir) and the cluster block
differ per node.
3. Configure symphony.config on each node
Edit symphony.config on each additional node to set a unique UI port:
Node 2:
{
"network": {
"uiport": 8282
}
}
Node 3:
{
"network": {
"uiport": 8383
}
}
4. Configure nats.config for clustering
Add a server_name and cluster block to each node's nats.config.
Each node must have a unique server_name, a unique cluster.listen
port, and route entries pointing to every other node's cluster port.
Node 1 (primary):
server_name: node1
cluster {
name: symphony-cluster
listen: 0.0.0.0:6222
routes: [
nats-route://node2.example.com:6223,
nats-route://node3.example.com:6224
]
}
Node 2:
server_name: node2
cluster {
name: symphony-cluster
listen: 0.0.0.0:6223
routes: [
nats-route://node1.example.com:6222,
nats-route://node3.example.com:6224
]
}
Node 3:
server_name: node3
cluster {
name: symphony-cluster
listen: 0.0.0.0:6224
routes: [
nats-route://node1.example.com:6222,
nats-route://node2.example.com:6223
]
}
When NATS TLS is enabled on the server, cluster routes are automatically encrypted using the same certificate. No additional cluster TLS configuration is needed.
The cluster.name must be identical on all nodes. Each node's route
list should include all other nodes in the cluster—routing is
automatically managed once the connections are established.
If running all nodes on the same host, also update each node's NATS client port and WebSocket port to unique values:
port: 4333
websocket {
port: 9333
no_tls: true
}
5. Start the cluster
Restart the primary instance (to pick up the cluster configuration added in step 4) and start the additional instances. The order does not matter—cluster routing handles nodes starting independently. However, the internal data store requires a majority quorum before it can accept operations (e.g. 2 of 3 nodes, or 3 of 5). Nodes that start before quorum is reached will retry automatically with increasing backoff for several minutes while waiting for peers to become available.
For the smoothest startup, start all nodes within a short window so that quorum is established quickly. If nodes start minutes apart, the early nodes will log transient errors such as "JetStream system temporarily unavailable" or "context deadline exceeded" while waiting for peers—this is normal and resolves once quorum is reached.
This guide includes docker compose commands. Refer to your deployment
documentation for environment-specific prerequisites.
# On each node (Linux)
sudo systemctl restart symphony
# Or with Docker Compose
docker compose restart symphony
Check the logs on each node to verify cluster formation:
# Linux
sudo journalctl -u symphony -f
# Docker
docker compose logs -f symphony
Look for messages indicating successful route connections and startup:
Cluster ready clusterName=symphony-cluster routes=2
Symphony ready url=nats://10.0.0.1:4222
Account Resolver Synchronisation
Each node periodically compares its account state with other nodes in
the cluster. The default synchronisation interval is 2 minutes,
configured in nats.config:
resolver {
type: full
interval: "2m"
timeout: "1.9s"
}
This means that when a new user account or extension is registered on one node, it may take up to 2 minutes for the registration to propagate to all other nodes. During this window, the new user or extension may only be accessible on the node where it was created.
Load Balancing
Place a load balancer (nginx, HAProxy, or a cloud load balancer) in front of the cluster to distribute client connections across nodes.
HTTP/HTTPS Traffic
Configure the load balancer to distribute HTTP requests across all nodes' UI ports. Use health checks on each node's HTTP port to detect failed instances:
upstream symphony_http {
server node1.example.com:8080;
server node2.example.com:8282;
server node3.example.com:8383;
}
server {
listen 443 ssl;
server_name symphony.example.com;
location / {
proxy_pass http://symphony_http;
}
location /ws {
proxy_pass http://symphony_ws;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
WebSocket Traffic
WebSocket connections must use sticky sessions (session affinity) to ensure that each browser session remains connected to the same node for the duration of the WebSocket connection:
upstream symphony_ws {
ip_hash;
server node1.example.com:9222;
server node2.example.com:9333;
server node3.example.com:9444;
}
NATS Client Traffic
If extensions or external NATS clients connect to the cluster, configure TCP passthrough load balancing for the NATS client ports:
stream {
upstream nats_backend {
server node1.example.com:4222;
server node2.example.com:4333;
server node3.example.com:4444;
}
server {
listen 4222;
proxy_pass nats_backend;
}
}
Cluster Sizing
- 3 nodes—minimum recommended for production. Tolerates the loss of one node while maintaining quorum for data replication.
- 5 nodes—tolerates the loss of two nodes. Use this for environments that require higher availability.
- Even numbers—avoid clusters of 2 or 4 nodes. Symphony requires a majority quorum for writes, so an even number of nodes provides no additional fault tolerance over the next-lower odd number.
What Is Replicated
The following state is automatically replicated across all cluster
nodes. Every item is stored inside each node's Symphony data directory (the storage location configured in symphony.config), so
identifying replication problems usually starts with comparing the
listed paths across nodes.
| Data | Replication mechanism | Where to find it on disk | How to verify it |
|---|---|---|---|
| User accounts and profiles (account JWTs) | NATS account resolver — file-based, broadcast between nodes | accounts/<account-pubkey>.jwt (one file per user account, same filename and timestamp on every node) | ls -l accounts/ on each node; filenames and sizes should match |
| Role definitions, role assignments, extension shares, licenses, usage, business units, enforcement state | JetStream KV in the Symphony system account | storage/jetstream/<symphony-account-pubkey>/streams/KV_symphony_roles/, .../KV_symphony_role_assignments/, .../KV_symphony_shares/, .../KV_symphony_licenses/, .../KV_symphony_usage/, .../KV_symphony_business_units/, .../KV_symphony_enforcement/ | nats stream info KV_symphony_<name> — confirm the Replicas value matches the Cluster Sizing table above and the full peer list appears under Cluster → Replicas |
| Per-user KV mirrors (user's view of the Symphony state above) | JetStream KV mirror (memory-backed, resourced from the Symphony source on cold start) | storage/jetstream/<user-account-pubkey>/streams/KV_symphony_<name>/ under each user's account directory | Connect as that user and run nats stream info KV_symphony_<name> |
| Extension data (extension-declared KV buckets, combined registry/operations streams) | JetStream KV / streams | storage/jetstream/<user-account-pubkey>/streams/KV_<bucket-name>/ | nats stream info KV_<bucket> — compare Replicas, leader, and peer list |
Symphony uses replica factor R=3 on 3- and 4-node clusters and R=5 on
5+ node clusters. Standalone deployments (1 node) and 2-node deployments
use R=1 — two nodes cannot form a majority quorum, so a 2-node cluster has
the same durability as a single node. Extensions that create JetStream
state at runtime can discover the current replica count via the NATS
service cirata.services.cluster.info; the Go SDK wraps this in
extension.CreateOrUpdateKeyValue, which fills the replica count in
automatically when the caller leaves it unset.
State held by Symphony on behalf of extensions — key-value bucket content and streams — is replicated across the cluster. In a cluster of 3 or more nodes, this state survives the loss of any individual node, because writes are committed to a majority (2 of 3) before they are acknowledged and the surviving majority can continue to serve and accept writes while the failed node is recovering.
In a standalone deployment (a single node, no cluster routes) the same buckets are created at replication factor 1 because there are no peers to replicate to.
To list every replicated stream on a node and see its current replica count at a glance:
nats stream ls --json | jq '.[] | {
name:.config.name,
replicas:.config.num_replicas,
leader:.cluster.leader,
peers:.cluster.replicas
}'
The following is not replicated and is local to each node:
| Data | Notes |
|---|---|
symphony.config | Must be manually copied and maintained |
nats.config | Must be manually configured per node |
| Log files | Each node writes its own logs |
| CDN cache | In-memory per node; rebuilt automatically from CDN on cache miss. Only populated when dependency resolution mode is Proxy or Mixed. |
| Extension bundles | Stored in JetStream object storage and replicated across the cluster like other JetStream data. Active version pointer is replicated via the bundle manifest KV. |
| Extension processes | Extensions connect to any node and automatically fail over to remaining nodes if their initial node becomes unavailable |
Removing a Node
To remove a node from the cluster:
- Stop the Symphony instance on the node being removed.
- Remove the node's route entry from the
nats.configon all remaining nodes. - Restart the remaining instances to apply the route changes (or send a reload signal).
The cluster will continue operating with the remaining nodes. Symphony will re-replicate any data that was held on the removed node.
Troubleshooting
- Cluster not forming—verify that the cluster routing ports are
reachable between all nodes. Check firewall rules and security groups.
The
cluster.namemust be identical on all nodes. - "JetStream system temporarily unavailable" or "context deadline exceeded" at startup—this means the data store has not yet reached quorum. Ensure that a majority of nodes are running and can reach each other on the cluster routing ports. The node will retry automatically; once quorum is established the errors will stop. If the errors persist, check that all nodes have completed initial configuration (a node still showing the setup wizard does not participate in quorum).
- Split brain / data inconsistency—ensure all nodes share the
same operator JWT, system account, and Symphony account JWTs. If
each node ran the setup wizard independently, they will have
different cryptographic identities and cannot replicate state. The
fix is to copy
symphony.configandnats.configfrom a single primary node to all others (adjusting only per-node networking and theclusterblock). Compare theoperator,system_account, andresolver_preloadvalues innats.configacross nodes—they must be identical. - Node won't rejoin after restart—check the NATS log for
connection errors. Verify that the node's
server_nameis unique and that its routes point to the correct hostnames and ports. - Stale data after failover—the account resolver synchronisation
interval (default 2 minutes) means recently created accounts or
registrations may not be immediately available on all nodes. Reduce
the
intervalvalue in theresolverblock if faster convergence is required. - Replication warnings—if the log shows replication errors,
ensure that
jetstreamis enabled innats.configon all nodes and that each node has sufficient storage configured (max_memandmax_filein thejetstreamblock). jetstream not enabled for account(err_code 10039)—the local node's account resolver has not yet received the affected user's account JWT (the one that grants the JetStream limits). This is usually transient and resolves as the resolver propagates; the operation will succeed on retry. Check thataccounts/<account-pubkey>.jwtexists and has the same size on every node, and consider reducing the resolverintervalinnats.configif propagation is slow.failed to create KV … context deadline exceeded—a JetStream create operation timed out waiting for a quorum to form. On a fresh cluster, verify every node reports its peers vianats server info. If peers are connected but creation still fails, the affected stream's source buckets may not yet be available on this node; retrying after the cluster fully converges typically succeeds. Symphony automatically retries transient errors during startup.- Missing subdirectories under
storage/jetstream/<account>/streams/— a bucket is present on some nodes but not others. At R=3 every node should hold a copy. First check the bucket's current replica count:nats stream info KV_<bucket>. If it reportsReplicas: 1, the bucket was created before Symphony switched to R=3 in this cluster; it will scale up automatically on the next Symphony startup or when the owning account next triggers bucket creation. To force an immediate scale-up:nats stream update KV_<bucket> --replicas=3. If the count is already 3 but a node is missing its copy, the node was out of the cluster at creation time and JetStream will restore it during the next leader sync (usually within a few seconds of the node rejoining).