Prompt caching - CX Bridge

CXB Core caches the static portion of LLM prompts so that the model does not re-process the same system instructions (and tool specs) on every turn. Two independent caches share the same registry/Redis pattern:

Live prompt cache

Caches the static system prompt + tools used during the live call. Cuts per-turn latency and input-token cost.

Post-call context cache

Caches the analysis/QC system prompt for post-call LLM passes. Cuts post-call input-token cost.

Both use explicit CachedContent, supported only by the LLM provider and the managed LLM platform. The other LLM provider uses request-level hints instead of the cache registry.

Why

A bot’s system prompt — policy, compliance rules, objection handling, tool specs — is large and identical across every turn and every call for that bot version. Re-sending it each turn wastes input tokens and adds latency. Caching the static block lets the provider charge the cheaper cached-input rate and skip re-tokenizing it.

Never cache a fully rendered system prompt that contains per-customer data. Reuse is poor and dynamic context can leak across calls. The prompt must be split into a static block and a fresh runtime block first.

Prompt split

The split is configured per bot via PromptPartsConfig (prompt_parts in bot config, defined in the Core service’s config model):

Field	Purpose
`mode`	`legacy` (no split, no cache) or `direct` (split prompt, cache eligible)
`cache_enabled`	Master toggle for live prompt caching on this bot
`static_system_prompt`	The cacheable block: policy, compliance, flow, tool-usage rules, disposition schema
`dynamic_runtime_prompt`	Fresh per-call block: customer variables, CRM fields, dates, attempt data
`static_version`	Cache version — bump to invalidate when the static prompt changes
`prompt_cache_key`	Cache routing key (request-hint provider only)
`prompt_cache_retention`	`in_memory` or `24h` (request-hint provider only)

Live caching only engages when mode == "direct" and cache_enabled is true (resolved in the Core service’s pipeline factory).

Direct-prompt bots carry both system_prompt and static_system_prompt. When the live cache is active, CXB Core builds the live LLM context from the static block plus dynamic_runtime_prompt — not the legacy system_prompt. Keep both in sync.

Live prompt cache

Explicit-cache providers

For the LLM provider and the managed LLM platform, CXB Core resolves a CachedContent name and injects it before the LLM service is created (in the Core service’s pipeline factory):

generation_config["cached_content"] = live_prompt_cache_result.cached_content_name
generation_config["tools"] = None
generation_config["tool_config"] = None

When cached_content is active, the context is built with tools=NOT_GIVEN and system_instruction is omitted per-request — the LLM provider requires that tools and the system instruction live inside the cached content, not in the request.

Pipeline-framework builtin-tool caveat. With tools=NOT_GIVEN, the pipeline framework’s BaseLLMAdapter.from_standard_tools skips its builtin-tool injection entirely (it only merges builtin tools when the input is a ToolsSchema). CXB Core is safe today because every tool (end_call, transfer_call, detected_voicemail, search_knowledge, custom tools) is registered via register_function(...), not as a builtin tool. If a future pipeline-framework upgrade introduces a builtin tool we need, add its spec inside CreateCachedContentConfig.tools in the live-prompt-cache tool-conversion helper — do not re-enable per-request tools; the provider rejects that combination. Verified against the pinned pipeline framework version.

Request-hint provider

The request-hint LLM provider does not use the cache registry. When the bot is direct-mode + cache-enabled and prompt_cache_key is set, CXB Core passes request hints (in the Core service’s LLM factory):

prompt_cache_key — routing key for the provider’s automatic prefix cache
prompt_cache_retention — in_memory or 24h

No CachedContent is created; CXB Core relies on the provider’s automatic prefix caching and its reported cached-token usage.

Cache resolution order

get_or_create_live_prompt_cache (in the Core service’s live-prompt-cache module) resolves the cache name in this order:

Prewarmed name

If CXB API supplied a prewarmed live_prompt_cache_state.cache_name, it short-circuits everything (no registry, Redis, or managed-platform create). Status hit, source prewarmed.

In-process registry

A bounded OrderedDict (max 512 entries) keyed by provider + client identity + model + prompt hash + version + tools hash. A fresh entry is returned directly.

Redis

Shared across all 16 workers via cxbcore:live_prompt_cache: keys. A hit is promoted into the local registry so siblings skip creation.

Create

Otherwise CXB Core creates a new CachedContent (static prompt as system_instruction, converted tools, TTL), then writes it to the registry and Redis.

Entries are proactively refreshed once they pass ~90% of their TTL (_LIVE_PROMPT_CACHE_REFRESH_RATIO = 0.9); least-recently-used entries are evicted when the registry is full.

Lifecycle ownership

CXB API owns the scheduled lifecycle; CXB Core is a consumer that prefers CXB API’s prewarmed name and self-heals when it expires.

Responsibility	Owner	Reference
7am prewarm (default `prewarm_hour=7`)	CXB API	Live-prompt-cache lifecycle service
11pm cleanup (default `cleanup_hour=23`, disabled bots only)	CXB API	Live-prompt-cache lifecycle service
10h TTL (default `ttl_hours=10`)	CXB API	Live-prompt-cache lifecycle service
Missed-run catch-up	CXB API	`_catch_up_missed_runs`
Inline single-flight recreate	CXB API	Internal live-cache route (`/recreate`)
Audit events	CXB API	Live-prompt-cache audit service
Prefer prewarmed name, in-process registry, Redis sharing, near-TTL refresh	CXB Core	Live-prompt-cache module

Cache refresh is version-based, not delete-based. To force a refresh, bump static_version (which changes the registry key) and invalidate the CXB API bot-config cache. Do not try to delete provider-side caches across all workers — old caches age out by TTL. Cleanup deliberately targets disabled bots only, because deleting an enabled bot’s cache nightly created a dead window between cleanup and the next prewarm.

Cache-expiry recovery

A cached content name can become unusable mid-call — TTL expiry (400 INVALID_ARGUMENT ... is expired) or deletion/aging-out (404 ... cached content metadata ... not found). Both are recoverable. The Core service’s cache-recovery LLM service wrappers handle it:

Detect during iteration

The provider’s response stream is lazy — the error surfaces while iterating the response, not when the stream is awaited. CXB Core wraps the iteration itself so the error is caught.

Evict the stale name

Removes it from the local registry and Redis via invalidate_live_prompt_cache.

Fetch a replacement

Reads the current prewarmed name from CXB API (live_prompt_cache_state.cache_name); if missing or unchanged, POSTs to CXB API /recreate for an inline single-flight create.

Swap for the next turn

Mutates self._settings.extra["generation_config"]["cached_content"]. The pipeline framework re-reads self._settings.extra per _stream_content call, so the next turn on the same call uses the new cache.

Customer experience: the current turn fails (a moment of silence on one turn), but the call is not dropped — the next turn continues on the fresh cache. The alternative (failing the call) would break every call referencing the expired cache. Audit events (expired_in_call, swap_after_expiry) are posted to CXB API fire-and-forget.

Post-call context cache

Post-call analysis, QC, and callback extraction share the same registry/Redis pattern in the Core service’s post-call processor (cxbcore:post_call_cache: keys, max 512 entries, 90% TTL refresh).

Configuration

Enable per bot via either the legacy flat fields or the nested post_call_cache dict (the post-call orchestration reads both):

Field	Purpose
`post_call_cache_enabled`	Legacy flat toggle
`post_call_cache_version`	Legacy flat version
`post_call_cache.enabled`	Nested toggle (overrides flat)
`post_call_cache.analysis_version`	Per-namespace version for analysis
`post_call_cache.qc_version`	Per-namespace version for QC

The cached system_instruction is the prompt without the injected per-call date context — the date block is sent as fresh content so the cache key stays stable across calls (cache_system_prompt vs system_prompt in the post-call processor).

CXB API computes the version from raw analysis/QC prompt templates, not rendered per-call values. Invalidation is version-based: bump the version, the registry key changes, and the next call creates a fresh CachedContent. Never delete-based.

Request-hint provider

Post-call context caching is not supported on the request-hint LLM provider. When cache_enabled is set on a post-call call for that provider, the usage cache metadata reports status unsupported_provider with reason post_call_cache_not_supported.

Stale retry

If a cached generate fails (expired cache), the post-call path evicts the registry entry and retries once with system_instruction inline (no cache). The usage cache metadata records status stale_retry.

Visibility

Every LLM usage entry carries a cache dict (UsageEntry.cache in the Core service’s results model) with enabled, status, namespace, version, reason.

Namespace	Where attached
`live_prompt`	First live `llm` usage entry only
`post_call_analysis`	Post-call analysis pass
`qc_analysis`	QC pass
`callback_extraction`	Callback-detection pass (caching disabled)

Cache status values include disabled, hit, created, fallback, ineligible, unsupported_provider, stale_retry.

For the request-hint LLM provider, live hit/miss is derived only from real cache_read_input_tokens (hit when > 0, miss when 0). Do not invent estimated tokens or money saved. For the explicit-cache providers, cache-creation tokens come from the cache’s usage_metadata.

CXB API’s build_llm_cache_summary (in the call service) is scoped to post-call/QC only (type in {"post_call", "qc"}) so dashboard aggregate cards are not polluted by live-conversation cache data. The summary reports token facts only, never money. CXB Console Call Detail renders separate Live Prompt Cache and Post-Call Cache sections.

Bot configuration

prompt_parts, post_call_cache, and related fields.

Pipeline

Where the LLM service and live cache are wired into the pipeline.

Live prompt cache

Post-call context cache

​Why

​Prompt split

​Live prompt cache

​Explicit-cache providers

​Request-hint provider

​Cache resolution order

​Lifecycle ownership

​Cache-expiry recovery

​Post-call context cache

​Configuration

​Request-hint provider

​Stale retry

​Visibility

​Related

Bot configuration

Pipeline

Why

Prompt split

Live prompt cache

Explicit-cache providers

Request-hint provider

Cache resolution order

Lifecycle ownership

Cache-expiry recovery

Post-call context cache

Configuration

Request-hint provider

Stale retry

Visibility

Related