Pipeline - CX Bridge

Every call runs through the same core pipeline regardless of route. The route owns the connection details; the pipeline factory assembles media-pipeline processors based on runtime bot configuration from CXB API.

Pipeline structure

Audio flows left to right. The Audio Buffer sits after transport output so it captures exactly what was played to the customer.

For the communication-quality controls around listening, interruptions, silence, tool timing, grounding, and call telemetry, see Communication quality.

Services

STT (speech-to-text)

Capability	Model	Notes
Streaming STT	`<model-id>`	Default. Real-time streaming.
Turn-detecting STT	`<model-id>`	Uses the engine’s `/v2/listen` and external turn detection.
Multilingual STT	`<model-id>`	Alternative engine.

Configured via stt.provider and stt.model in bot config. Extra options pass through stt.extra:

Streaming STT: endpointing, smart_format, punctuate, interim_results
Turn-detecting STT: eot_threshold, eager_eot_threshold, eot_timeout_ms, keyterm, min_confidence, language_hints, should_interrupt
Multilingual STT: language_hints, language_hints_strict, context, enable_speaker_diarization, enable_language_identification, client_reference_id, vad_force_turn_endpoint

language_hints for the turn-detecting STT engine is only applied when the multilingual model is selected. should_interrupt defaults to true.

For the multilingual STT engine, CXB Core defaults vad_force_turn_endpoint to false (the pipeline framework’s library default is true). With force-turn-endpoint on, VAD stop events finalize the STT segment mid-turn, which wedges turns when soft Hindi/Hinglish speech sits below VAD min_volume and re-engagement never fires. Operators can flip it back per bot via stt.extra.vad_force_turn_endpoint = true. This behavior lives in the STT service factory within the Core service.

LLM (language model)

Provider	Example models	Notes
LLM provider A	`<model-id>`	Default. Via the LLM provider’s API.
LLM provider B	`<model-id>`	Via the LLM provider’s API.
Managed LLM platform	`<model-id>`	Requires `project_id` in `llm.extra`.

The LLM receives the system prompt and conversation context. It can call built-in functions (end_call, transfer_call, detected_voicemail, search_knowledge) and custom tools defined in the bot config.

TTS (text-to-speech)

Provider	Example models	Notes
TTS engine A	`<model-id>`	Default. Multilingual models require language code.
TTS engine B	`<model-id>`	Indian language support.
TTS engine C	—	`provider: "tts_c"`. ~20-language map (English, Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, plus Spanish/French/German/Arabic/Portuguese/Japanese/Korean/Chinese/Russian/Italian). `voice` and `model` optional.

Configured via tts.provider, tts.model, and tts.voice_id. The TTS service factory in the Core service handles the per-provider settings passthrough. The default TTS engine can be wrapped with Redis-backed TTS caching when tts.cache_config.enabled=true and TTS_CACHE_REDIS_URL is configured.

VAD and turn detection

Voice Activity Detection (VAD): an on-device VAD model detects when the customer is speaking. Configurable via vad in bot config (confidence threshold, start/stop timing, minimum volume). Turn detection: an on-device turn-detection model determines when the customer has finished their turn, triggering the LLM response. Turn-detecting STT: the turn-detecting STT engine has external turn detection, so CXB Core uses its turn events instead of VAD-based user-turn strategies for stt.provider = "stt_turn_detecting". Interruptions: Enabled by default. When the customer speaks over the bot, TTS output is cancelled and the LLM processes the new input. min_words_interruption (default: 3) prevents accidental interruptions from short utterances.

Built-in functions

The LLM has access to built-in functions:

Function	Behavior
`end_call()`	Bot speaks any final message, then hangs up. Sets `disconnected_by = "bot"`.
`transfer_call(reason)`	Speaks pre-transfer message, then transfers to configured target. Native transfer-state is derived from pipeline events and added to the call results — see Transfer and escalation.
`detected_voicemail()`	Speaks voicemail message (if configured), then hangs up. Sets `disconnected_by = "voicemail"`.
`search_knowledge(query)`	Searches attached CXB API knowledge bases when RAG is enabled for the bot.

Custom tools defined in bot config are also registered. Custom tools can run immediate, speak-then-run, speak-and-run-parallel, or terminal-after-speech policies. Tool calls and results are logged in call events and turn metrics.

Re-engagement (dead air handling)

When the customer goes silent mid-call, CXB Core prompts them to respond:

After gap_seconds of silence, speak a re-engagement message (shuffled, non-repeating)
Reduce the gap for subsequent attempts
After max_retries exceeded, end the call with disconnected_by = "RNR"

Configured via re_engagement in bot config: messages (list of prompts), gap_seconds (int or [first, subsequent]), max_retries.

Call events

The pipeline appends structured events to the call results as the conversation progresses. These are the primary tool for debugging why a call felt slow, wedged, or force-closed. All emitted from the pipeline factory in the Core service.

Service errors

When the pipeline raises an error frame, CXB Core records a service_error event:

Field	Meaning
`service`	Classified as `llm`, `tts`, `stt`, or `unknown`. Derived from the failing processor’s class name, falling back to matching provider names in the error message.
`processor`	Processor class name that raised the error.
`message`	Error string (truncated to 500 chars).
`fatal`	`true` when the error frame is fatal (terminates the call).

User-turn lifecycle

These events trace how each customer turn was detected and closed — essential for diagnosing dead-air and force-closed-turn bugs.

Event	Meaning
`user_turn_started`	A user turn began. Carries `strategy` (turn strategy class name, or `null`). Resets the re-engagement retry counter.
`user_turn_inference_triggered`	A turn strategy fired and the LLM was triggered. Carries `strategy`.
`turn_stop_timeout`	No stop strategy fired before the timeout; the turn was force-closed by the watchdog without inference.
`user_turn_stopped`	The turn ended. Carries `strategy`, `inference_triggered` (whether the LLM was triggered), and `had_content` (whether the turn had transcript text).

A user_turn_stopped with strategy = null, inference_triggered = false, and had_content = true means the turn was force-closed by the stop-timeout watchdog and its transcript was discarded without reaching the LLM. This is the signature of the dead-air bug.

Recording

Audio is captured by an AudioBufferProcessor placed after transport output. On call end, the buffer is encoded as WAV and uploaded to S3-compatible object storage: DigitalOcean Spaces or MinIO depending on runtime config. The recording URL and storage key are included in call results.

Max duration

If max_call_duration_seconds is set in bot config (default: 600), the pipeline automatically ends the call when the limit is reached. Sets disconnected_by = "timeout".

Latency tracking

Per-turn latency is measured and collected as samples (latency_samples):

Per-turn sample	What it measures
`stt_ms`	Time-to-first-byte from STT processor
`llm_ms`	Time-to-first-byte from LLM processor
`tts_ms`	Time-to-first-byte from TTS processor
`tool_ms`	Custom tool execution latency, when tools ran
`rag_ms`	Knowledge search latency, when RAG ran
`total_ms`	End-to-end response latency

At call end these samples are averaged into the latency object on the call results (LatencyData in the Core service’s results model, populated by the post-call latency averaging step). The averaged fields are named distinctly from the per-turn samples:

Averaged field	Source sample
`stt_avg_ms`	mean of `stt_ms`
`llm_avg_ms`	mean of `llm_ms`
`tts_avg_ms`	mean of `tts_ms`
`tool_avg_ms`	mean of `tool_ms`
`rag_avg_ms`	mean of `rag_ms`
`total_avg_response_ms`	mean of `total_ms`

Both the raw latency_samples list and the averaged latency object are included in the final call results.

Live prompt caching

For non-policy bots, the static portion of the system prompt can be served from a provider-side prompt cache to cut LLM cost and time-to-first-token on the live call. The LLM provider and managed LLM platform use explicit CachedContent; the other LLM provider uses prompt_cache_key/prompt_cache_retention request hints. CXB Core reports per-call cache hit/miss metrics on the first live llm usage entry (cache.namespace = "live_prompt"). See Caching for the full design, including how static and dynamic prompt portions must be split and the lifecycle of the cache registry.

​Pipeline structure

​Services

​STT (speech-to-text)

​LLM (language model)

​TTS (text-to-speech)

​VAD and turn detection

​Built-in functions

​Re-engagement (dead air handling)

​Call events

​Service errors

​User-turn lifecycle

​Recording

​Max duration

​Latency tracking

​Live prompt caching

Pipeline structure

Services

STT (speech-to-text)

LLM (language model)

TTS (text-to-speech)

VAD and turn detection

Built-in functions

Re-engagement (dead air handling)

Call events

Service errors

User-turn lifecycle

Recording

Max duration

Latency tracking

Live prompt caching