CXB Core caches the static portion of LLM prompts so that the model does not re-process the same system instructions (and tool specs) on every turn. Two independent caches share the same registry/Redis pattern:

Live prompt cache

Caches the static system prompt + tools used during the live call. Cuts per-turn latency and input-token cost.

Post-call context cache

Caches the analysis/QC system prompt for post-call LLM passes. Cuts post-call input-token cost.
Both use explicit CachedContent, supported only by the LLM provider and the managed LLM platform. The other LLM provider uses request-level hints instead of the cache registry.

Why

A bot’s system prompt — policy, compliance rules, objection handling, tool specs — is large and identical across every turn and every call for that bot version. Re-sending it each turn wastes input tokens and adds latency. Caching the static block lets the provider charge the cheaper cached-input rate and skip re-tokenizing it.
Never cache a fully rendered system prompt that contains per-customer data. Reuse is poor and dynamic context can leak across calls. The prompt must be split into a static block and a fresh runtime block first.

Prompt split

The split is configured per bot via PromptPartsConfig (prompt_parts in bot config, defined in the Core service’s config model):
FieldPurpose
modelegacy (no split, no cache) or direct (split prompt, cache eligible)
cache_enabledMaster toggle for live prompt caching on this bot
static_system_promptThe cacheable block: policy, compliance, flow, tool-usage rules, disposition schema
dynamic_runtime_promptFresh per-call block: customer variables, CRM fields, dates, attempt data
static_versionCache version — bump to invalidate when the static prompt changes
prompt_cache_keyCache routing key (request-hint provider only)
prompt_cache_retentionin_memory or 24h (request-hint provider only)
Live caching only engages when mode == "direct" and cache_enabled is true (resolved in the Core service’s pipeline factory).
Direct-prompt bots carry both system_prompt and static_system_prompt. When the live cache is active, CXB Core builds the live LLM context from the static block plus dynamic_runtime_prompt — not the legacy system_prompt. Keep both in sync.

Live prompt cache

Explicit-cache providers

For the LLM provider and the managed LLM platform, CXB Core resolves a CachedContent name and injects it before the LLM service is created (in the Core service’s pipeline factory):
generation_config["cached_content"] = live_prompt_cache_result.cached_content_name
generation_config["tools"] = None
generation_config["tool_config"] = None
When cached_content is active, the context is built with tools=NOT_GIVEN and system_instruction is omitted per-request — the LLM provider requires that tools and the system instruction live inside the cached content, not in the request.
Pipeline-framework builtin-tool caveat. With tools=NOT_GIVEN, the pipeline framework’s BaseLLMAdapter.from_standard_tools skips its builtin-tool injection entirely (it only merges builtin tools when the input is a ToolsSchema). CXB Core is safe today because every tool (end_call, transfer_call, detected_voicemail, search_knowledge, custom tools) is registered via register_function(...), not as a builtin tool. If a future pipeline-framework upgrade introduces a builtin tool we need, add its spec inside CreateCachedContentConfig.tools in the live-prompt-cache tool-conversion helper — do not re-enable per-request tools; the provider rejects that combination. Verified against the pinned pipeline framework version.

Request-hint provider

The request-hint LLM provider does not use the cache registry. When the bot is direct-mode + cache-enabled and prompt_cache_key is set, CXB Core passes request hints (in the Core service’s LLM factory):
  • prompt_cache_key — routing key for the provider’s automatic prefix cache
  • prompt_cache_retentionin_memory or 24h
No CachedContent is created; CXB Core relies on the provider’s automatic prefix caching and its reported cached-token usage.

Cache resolution order

get_or_create_live_prompt_cache (in the Core service’s live-prompt-cache module) resolves the cache name in this order:
1

Prewarmed name

If CXB API supplied a prewarmed live_prompt_cache_state.cache_name, it short-circuits everything (no registry, Redis, or managed-platform create). Status hit, source prewarmed.
2

In-process registry

A bounded OrderedDict (max 512 entries) keyed by provider + client identity + model + prompt hash + version + tools hash. A fresh entry is returned directly.
3

Redis

Shared across all 16 workers via cxbcore:live_prompt_cache: keys. A hit is promoted into the local registry so siblings skip creation.
4

Create

Otherwise CXB Core creates a new CachedContent (static prompt as system_instruction, converted tools, TTL), then writes it to the registry and Redis.
Entries are proactively refreshed once they pass ~90% of their TTL (_LIVE_PROMPT_CACHE_REFRESH_RATIO = 0.9); least-recently-used entries are evicted when the registry is full.

Lifecycle ownership

CXB API owns the scheduled lifecycle; CXB Core is a consumer that prefers CXB API’s prewarmed name and self-heals when it expires.
ResponsibilityOwnerReference
7am prewarm (default prewarm_hour=7)CXB APILive-prompt-cache lifecycle service
11pm cleanup (default cleanup_hour=23, disabled bots only)CXB APILive-prompt-cache lifecycle service
10h TTL (default ttl_hours=10)CXB APILive-prompt-cache lifecycle service
Missed-run catch-upCXB API_catch_up_missed_runs
Inline single-flight recreateCXB APIInternal live-cache route (/recreate)
Audit eventsCXB APILive-prompt-cache audit service
Prefer prewarmed name, in-process registry, Redis sharing, near-TTL refreshCXB CoreLive-prompt-cache module
Cache refresh is version-based, not delete-based. To force a refresh, bump static_version (which changes the registry key) and invalidate the CXB API bot-config cache. Do not try to delete provider-side caches across all workers — old caches age out by TTL. Cleanup deliberately targets disabled bots only, because deleting an enabled bot’s cache nightly created a dead window between cleanup and the next prewarm.

Cache-expiry recovery

A cached content name can become unusable mid-call — TTL expiry (400 INVALID_ARGUMENT ... is expired) or deletion/aging-out (404 ... cached content metadata ... not found). Both are recoverable. The Core service’s cache-recovery LLM service wrappers handle it:
1

Detect during iteration

The provider’s response stream is lazy — the error surfaces while iterating the response, not when the stream is awaited. CXB Core wraps the iteration itself so the error is caught.
2

Evict the stale name

Removes it from the local registry and Redis via invalidate_live_prompt_cache.
3

Fetch a replacement

Reads the current prewarmed name from CXB API (live_prompt_cache_state.cache_name); if missing or unchanged, POSTs to CXB API /recreate for an inline single-flight create.
4

Swap for the next turn

Mutates self._settings.extra["generation_config"]["cached_content"]. The pipeline framework re-reads self._settings.extra per _stream_content call, so the next turn on the same call uses the new cache.
Customer experience: the current turn fails (a moment of silence on one turn), but the call is not dropped — the next turn continues on the fresh cache. The alternative (failing the call) would break every call referencing the expired cache. Audit events (expired_in_call, swap_after_expiry) are posted to CXB API fire-and-forget.

Post-call context cache

Post-call analysis, QC, and callback extraction share the same registry/Redis pattern in the Core service’s post-call processor (cxbcore:post_call_cache: keys, max 512 entries, 90% TTL refresh).

Configuration

Enable per bot via either the legacy flat fields or the nested post_call_cache dict (the post-call orchestration reads both):
FieldPurpose
post_call_cache_enabledLegacy flat toggle
post_call_cache_versionLegacy flat version
post_call_cache.enabledNested toggle (overrides flat)
post_call_cache.analysis_versionPer-namespace version for analysis
post_call_cache.qc_versionPer-namespace version for QC
The cached system_instruction is the prompt without the injected per-call date context — the date block is sent as fresh content so the cache key stays stable across calls (cache_system_prompt vs system_prompt in the post-call processor).
CXB API computes the version from raw analysis/QC prompt templates, not rendered per-call values. Invalidation is version-based: bump the version, the registry key changes, and the next call creates a fresh CachedContent. Never delete-based.

Request-hint provider

Post-call context caching is not supported on the request-hint LLM provider. When cache_enabled is set on a post-call call for that provider, the usage cache metadata reports status unsupported_provider with reason post_call_cache_not_supported.

Stale retry

If a cached generate fails (expired cache), the post-call path evicts the registry entry and retries once with system_instruction inline (no cache). The usage cache metadata records status stale_retry.

Visibility

Every LLM usage entry carries a cache dict (UsageEntry.cache in the Core service’s results model) with enabled, status, namespace, version, reason.
NamespaceWhere attached
live_promptFirst live llm usage entry only
post_call_analysisPost-call analysis pass
qc_analysisQC pass
callback_extractionCallback-detection pass (caching disabled)
Cache status values include disabled, hit, created, fallback, ineligible, unsupported_provider, stale_retry.
For the request-hint LLM provider, live hit/miss is derived only from real cache_read_input_tokens (hit when > 0, miss when 0). Do not invent estimated tokens or money saved. For the explicit-cache providers, cache-creation tokens come from the cache’s usage_metadata.
CXB API’s build_llm_cache_summary (in the call service) is scoped to post-call/QC only (type in {"post_call", "qc"}) so dashboard aggregate cards are not polluted by live-conversation cache data. The summary reports token facts only, never money. CXB Console Call Detail renders separate Live Prompt Cache and Post-Call Cache sections.

Bot configuration

prompt_parts, post_call_cache, and related fields.

Pipeline

Where the LLM service and live cache are wired into the pipeline.