Every call runs through the same core pipeline regardless of route. The route owns the connection details; the pipeline factory assembles media-pipeline processors based on runtime bot configuration from CXB API.

Pipeline structure

Audio flows left to right. The Audio Buffer sits after transport output so it captures exactly what was played to the customer.
For the communication-quality controls around listening, interruptions, silence, tool timing, grounding, and call telemetry, see Communication quality.

Services

STT (speech-to-text)

CapabilityModelNotes
Streaming STT<model-id>Default. Real-time streaming.
Turn-detecting STT<model-id>Uses the engine’s /v2/listen and external turn detection.
Multilingual STT<model-id>Alternative engine.
Configured via stt.provider and stt.model in bot config. Extra options pass through stt.extra:
  • Streaming STT: endpointing, smart_format, punctuate, interim_results
  • Turn-detecting STT: eot_threshold, eager_eot_threshold, eot_timeout_ms, keyterm, min_confidence, language_hints, should_interrupt
  • Multilingual STT: language_hints, language_hints_strict, context, enable_speaker_diarization, enable_language_identification, client_reference_id, vad_force_turn_endpoint
language_hints for the turn-detecting STT engine is only applied when the multilingual model is selected. should_interrupt defaults to true.
For the multilingual STT engine, CXB Core defaults vad_force_turn_endpoint to false (the pipeline framework’s library default is true). With force-turn-endpoint on, VAD stop events finalize the STT segment mid-turn, which wedges turns when soft Hindi/Hinglish speech sits below VAD min_volume and re-engagement never fires. Operators can flip it back per bot via stt.extra.vad_force_turn_endpoint = true. This behavior lives in the STT service factory within the Core service.

LLM (language model)

ProviderExample modelsNotes
LLM provider A<model-id>Default. Via the LLM provider’s API.
LLM provider B<model-id>Via the LLM provider’s API.
Managed LLM platform<model-id>Requires project_id in llm.extra.
The LLM receives the system prompt and conversation context. It can call built-in functions (end_call, transfer_call, detected_voicemail, search_knowledge) and custom tools defined in the bot config.

TTS (text-to-speech)

ProviderExample modelsNotes
TTS engine A<model-id>Default. Multilingual models require language code.
TTS engine B<model-id>Indian language support.
TTS engine Cprovider: "tts_c". ~20-language map (English, Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, plus Spanish/French/German/Arabic/Portuguese/Japanese/Korean/Chinese/Russian/Italian). voice and model optional.
Configured via tts.provider, tts.model, and tts.voice_id. The TTS service factory in the Core service handles the per-provider settings passthrough. The default TTS engine can be wrapped with Redis-backed TTS caching when tts.cache_config.enabled=true and TTS_CACHE_REDIS_URL is configured.

VAD and turn detection

Voice Activity Detection (VAD): an on-device VAD model detects when the customer is speaking. Configurable via vad in bot config (confidence threshold, start/stop timing, minimum volume). Turn detection: an on-device turn-detection model determines when the customer has finished their turn, triggering the LLM response. Turn-detecting STT: the turn-detecting STT engine has external turn detection, so CXB Core uses its turn events instead of VAD-based user-turn strategies for stt.provider = "stt_turn_detecting". Interruptions: Enabled by default. When the customer speaks over the bot, TTS output is cancelled and the LLM processes the new input. min_words_interruption (default: 3) prevents accidental interruptions from short utterances.

Built-in functions

The LLM has access to built-in functions:
FunctionBehavior
end_call()Bot speaks any final message, then hangs up. Sets disconnected_by = "bot".
transfer_call(reason)Speaks pre-transfer message, then transfers to configured target. Native transfer-state is derived from pipeline events and added to the call results — see Transfer and escalation.
detected_voicemail()Speaks voicemail message (if configured), then hangs up. Sets disconnected_by = "voicemail".
search_knowledge(query)Searches attached CXB API knowledge bases when RAG is enabled for the bot.
Custom tools defined in bot config are also registered. Custom tools can run immediate, speak-then-run, speak-and-run-parallel, or terminal-after-speech policies. Tool calls and results are logged in call events and turn metrics.

Re-engagement (dead air handling)

When the customer goes silent mid-call, CXB Core prompts them to respond:
  1. After gap_seconds of silence, speak a re-engagement message (shuffled, non-repeating)
  2. Reduce the gap for subsequent attempts
  3. After max_retries exceeded, end the call with disconnected_by = "RNR"
Configured via re_engagement in bot config: messages (list of prompts), gap_seconds (int or [first, subsequent]), max_retries.

Call events

The pipeline appends structured events to the call results as the conversation progresses. These are the primary tool for debugging why a call felt slow, wedged, or force-closed. All emitted from the pipeline factory in the Core service.

Service errors

When the pipeline raises an error frame, CXB Core records a service_error event:
FieldMeaning
serviceClassified as llm, tts, stt, or unknown. Derived from the failing processor’s class name, falling back to matching provider names in the error message.
processorProcessor class name that raised the error.
messageError string (truncated to 500 chars).
fataltrue when the error frame is fatal (terminates the call).

User-turn lifecycle

These events trace how each customer turn was detected and closed — essential for diagnosing dead-air and force-closed-turn bugs.
EventMeaning
user_turn_startedA user turn began. Carries strategy (turn strategy class name, or null). Resets the re-engagement retry counter.
user_turn_inference_triggeredA turn strategy fired and the LLM was triggered. Carries strategy.
turn_stop_timeoutNo stop strategy fired before the timeout; the turn was force-closed by the watchdog without inference.
user_turn_stoppedThe turn ended. Carries strategy, inference_triggered (whether the LLM was triggered), and had_content (whether the turn had transcript text).
A user_turn_stopped with strategy = null, inference_triggered = false, and had_content = true means the turn was force-closed by the stop-timeout watchdog and its transcript was discarded without reaching the LLM. This is the signature of the dead-air bug.

Recording

Audio is captured by an AudioBufferProcessor placed after transport output. On call end, the buffer is encoded as WAV and uploaded to S3-compatible object storage: DigitalOcean Spaces or MinIO depending on runtime config. The recording URL and storage key are included in call results.

Max duration

If max_call_duration_seconds is set in bot config (default: 600), the pipeline automatically ends the call when the limit is reached. Sets disconnected_by = "timeout".

Latency tracking

Per-turn latency is measured and collected as samples (latency_samples):
Per-turn sampleWhat it measures
stt_msTime-to-first-byte from STT processor
llm_msTime-to-first-byte from LLM processor
tts_msTime-to-first-byte from TTS processor
tool_msCustom tool execution latency, when tools ran
rag_msKnowledge search latency, when RAG ran
total_msEnd-to-end response latency
At call end these samples are averaged into the latency object on the call results (LatencyData in the Core service’s results model, populated by the post-call latency averaging step). The averaged fields are named distinctly from the per-turn samples:
Averaged fieldSource sample
stt_avg_msmean of stt_ms
llm_avg_msmean of llm_ms
tts_avg_msmean of tts_ms
tool_avg_msmean of tool_ms
rag_avg_msmean of rag_ms
total_avg_response_msmean of total_ms
Both the raw latency_samples list and the averaged latency object are included in the final call results.

Live prompt caching

For non-policy bots, the static portion of the system prompt can be served from a provider-side prompt cache to cut LLM cost and time-to-first-token on the live call. The LLM provider and managed LLM platform use explicit CachedContent; the other LLM provider uses prompt_cache_key/prompt_cache_retention request hints. CXB Core reports per-call cache hit/miss metrics on the first live llm usage entry (cache.namespace = "live_prompt"). See Caching for the full design, including how static and dynamic prompt portions must be split and the lifecycle of the cache registry.