Fleet capacity model

Current model

Production CXB Core fleet nodes use one pre-forked worker per call slot.

MAX_CONCURRENT_CALLS = 1
CXBCORE_WORKERS = 16
1 fleet host = 16 active calls

Each worker is a uvicorn process. nginx routes to Unix sockets using least_conn and max_conns=1. CXB Core also enforces MAX_CONCURRENT_CALLS=1 inside each worker.

Why one call per worker

Voice calls are long-lived, stateful, CPU-sensitive workloads. A single active call may hold:

STT streaming connection
LLM context and tool state
TTS streaming/generation
VAD/turn detection
recorder and transcript buffers
provider latency and retry state

Keeping one call per worker gives predictable failure isolation. If one worker wedges or crashes, it drops one call, not a batch.

Current outbound fleet routing

CXB API stores fleet URLs in system settings. For dialout:

CXB API health-checks every fleet URL at /health/fleet.
It skips unreachable or full servers.
It picks the healthy server with the most available capacity.
It sends /livekit/dialout to that fleet server.

CXB Dialler follows a similar idea for campaign attempts: poll fleet capacity, reserve the best server locally, and attach answered calls. For campaign attach, there is one extra local-host detail: /attach is a short HTTP POST that starts long-running call work in the selected worker. nginx can see the POST finish while the worker remains logically busy. If nginx later routes another attach to that same worker, CXB Core returns 429 at_capacity even if another worker on the host is free. Production fleet nginx therefore needs exact route blocks for /attach, /livekit/dialout, and /livekit/widget that retry http_429 across the upstream pool:

proxy_next_upstream error timeout http_429 non_idempotent;
proxy_next_upstream_tries <worker-count>;

Do not apply that retry block to /livekit/dispatch until its rejection path is side-effect-free. The current path can remove the inbound SIP participant before returning.

Multi-fleet ingress (HAProxy)

A single deployment can run more than one fleet host. A production deployment can run two fleets, fleet.cxbridge.io and fleet2.cxbridge.io, fronted by an HAProxy ingress at calls.cxbridge.io that distributes inbound WebSocket calls (telephony, Exotel) across both fleets. The per-host nginx + max_conns=1 model is unchanged behind the ingress; HAProxy only spreads load across fleets, while each fleet’s nginx still pins one call per worker. HAProxy ingress gotchas (learned in production):

Gotcha	Rule
`retry-on` and 429	HAProxy `retry-on` does not retry on `429`; a busy-fleet 429 must be handled explicitly, not via generic retry.
Body-based routing	Use an anchored `rstring` body check. A leading-comma (non-anchored) match black-holes both fleets.
`haproxy -c`	Only validates syntax. Validate routing against a captured request body, not just config parse.
Fleet availability bound	`fleet_available` health signal is bounded to the worker count range (1–16); treat out-of-range values as unhealthy.

The HAProxy ingress is brand-portable via render.sh (templated config) and add-fleet.sh (adds a fleet backend without hand-editing). See WSS ingress.

WSS URL behind a load balancer

The public URL should remain stable:

wss://fleet.example.com/ws/{bot_id}
wss://fleet.example.com/exotel/{bot_id}

The load balancer must keep each WebSocket pinned to the backend worker selected during upgrade. The critical property is not the URL; it is capacity-aware routing so one worker does not receive two simultaneous calls.

Scale math

At the current 16-call host shape:

Target active calls	Approx fleet hosts
100	7
500	32
1,000	63
20,000	1,250

This is why 1K scale is not just “more Docker containers.” The routing, registry, observability, rollout, and cost model all matter.

What must change for 1K

Current	1K-ready direction
Static list of fleet URLs	Worker or node registration with TTL.
Health polling every fleet URL	Central capacity registry with atomic reservations.
One CXB Dialler loop	Partitioned dialler workers or campaign leases.
Per-host nginx config	Generated/containerized LB config or service discovery.
Host-local logs	Central logs, metrics, and alerts.
nginx retry blocks for short async POST routes	Explicit slot reservation or worker-aware routing before attach.
Manual node addition	Scripted or orchestrated fleet provisioning.

Recommended path

Dockerize without changing behavior.
Keep 1 worker = 1 call slot.
Replace per-host systemd workers with worker containers.
Add a per-host LB with maxconn=1 backends for the cheapest production version.
Add a central slot registry before moving beyond tens of hosts.

​Current model

​Why one call per worker

​Current outbound fleet routing

​Multi-fleet ingress (HAProxy)

​WSS URL behind a load balancer

​Scale math

​What must change for 1K

​Recommended path