Live prompt cache
Caches the static system prompt + tools used during the live call. Cuts per-turn latency and input-token cost.
Post-call context cache
Caches the analysis/QC system prompt for post-call LLM passes. Cuts post-call input-token cost.
CachedContent, supported only by the LLM provider and the managed LLM platform. The other LLM provider uses request-level hints instead of the cache registry.
Why
A bot’s system prompt — policy, compliance rules, objection handling, tool specs — is large and identical across every turn and every call for that bot version. Re-sending it each turn wastes input tokens and adds latency. Caching the static block lets the provider charge the cheaper cached-input rate and skip re-tokenizing it.Prompt split
The split is configured per bot viaPromptPartsConfig (prompt_parts in bot config, defined in the Core service’s config model):
| Field | Purpose |
|---|---|
mode | legacy (no split, no cache) or direct (split prompt, cache eligible) |
cache_enabled | Master toggle for live prompt caching on this bot |
static_system_prompt | The cacheable block: policy, compliance, flow, tool-usage rules, disposition schema |
dynamic_runtime_prompt | Fresh per-call block: customer variables, CRM fields, dates, attempt data |
static_version | Cache version — bump to invalidate when the static prompt changes |
prompt_cache_key | Cache routing key (request-hint provider only) |
prompt_cache_retention | in_memory or 24h (request-hint provider only) |
mode == "direct" and cache_enabled is true (resolved in the Core service’s pipeline factory).
Direct-prompt bots carry both
system_prompt and static_system_prompt. When the live cache is active, CXB Core builds the live LLM context from the static block plus dynamic_runtime_prompt — not the legacy system_prompt. Keep both in sync.Live prompt cache
Explicit-cache providers
For the LLM provider and the managed LLM platform, CXB Core resolves aCachedContent name and injects it before the LLM service is created (in the Core service’s pipeline factory):
cached_content is active, the context is built with tools=NOT_GIVEN and system_instruction is omitted per-request — the LLM provider requires that tools and the system instruction live inside the cached content, not in the request.
Request-hint provider
The request-hint LLM provider does not use the cache registry. When the bot is direct-mode + cache-enabled andprompt_cache_key is set, CXB Core passes request hints (in the Core service’s LLM factory):
prompt_cache_key— routing key for the provider’s automatic prefix cacheprompt_cache_retention—in_memoryor24h
CachedContent is created; CXB Core relies on the provider’s automatic prefix caching and its reported cached-token usage.
Cache resolution order
get_or_create_live_prompt_cache (in the Core service’s live-prompt-cache module) resolves the cache name in this order:
Prewarmed name
If CXB API supplied a prewarmed
live_prompt_cache_state.cache_name, it short-circuits everything (no registry, Redis, or managed-platform create). Status hit, source prewarmed.In-process registry
A bounded
OrderedDict (max 512 entries) keyed by provider + client identity + model + prompt hash + version + tools hash. A fresh entry is returned directly.Redis
Shared across all 16 workers via
cxbcore:live_prompt_cache: keys. A hit is promoted into the local registry so siblings skip creation._LIVE_PROMPT_CACHE_REFRESH_RATIO = 0.9); least-recently-used entries are evicted when the registry is full.
Lifecycle ownership
CXB API owns the scheduled lifecycle; CXB Core is a consumer that prefers CXB API’s prewarmed name and self-heals when it expires.| Responsibility | Owner | Reference |
|---|---|---|
7am prewarm (default prewarm_hour=7) | CXB API | Live-prompt-cache lifecycle service |
11pm cleanup (default cleanup_hour=23, disabled bots only) | CXB API | Live-prompt-cache lifecycle service |
10h TTL (default ttl_hours=10) | CXB API | Live-prompt-cache lifecycle service |
| Missed-run catch-up | CXB API | _catch_up_missed_runs |
| Inline single-flight recreate | CXB API | Internal live-cache route (/recreate) |
| Audit events | CXB API | Live-prompt-cache audit service |
| Prefer prewarmed name, in-process registry, Redis sharing, near-TTL refresh | CXB Core | Live-prompt-cache module |
Cache refresh is version-based, not delete-based. To force a refresh, bump
static_version (which changes the registry key) and invalidate the CXB API bot-config cache. Do not try to delete provider-side caches across all workers — old caches age out by TTL. Cleanup deliberately targets disabled bots only, because deleting an enabled bot’s cache nightly created a dead window between cleanup and the next prewarm.Cache-expiry recovery
A cached content name can become unusable mid-call — TTL expiry (400 INVALID_ARGUMENT ... is expired) or deletion/aging-out (404 ... cached content metadata ... not found). Both are recoverable. The Core service’s cache-recovery LLM service wrappers handle it:
Detect during iteration
The provider’s response stream is lazy — the error surfaces while iterating the response, not when the stream is awaited. CXB Core wraps the iteration itself so the error is caught.
Fetch a replacement
Reads the current prewarmed name from CXB API (
live_prompt_cache_state.cache_name); if missing or unchanged, POSTs to CXB API /recreate for an inline single-flight create.expired_in_call, swap_after_expiry) are posted to CXB API fire-and-forget.
Post-call context cache
Post-call analysis, QC, and callback extraction share the same registry/Redis pattern in the Core service’s post-call processor (cxbcore:post_call_cache: keys, max 512 entries, 90% TTL refresh).
Configuration
Enable per bot via either the legacy flat fields or the nestedpost_call_cache dict (the post-call orchestration reads both):
| Field | Purpose |
|---|---|
post_call_cache_enabled | Legacy flat toggle |
post_call_cache_version | Legacy flat version |
post_call_cache.enabled | Nested toggle (overrides flat) |
post_call_cache.analysis_version | Per-namespace version for analysis |
post_call_cache.qc_version | Per-namespace version for QC |
system_instruction is the prompt without the injected per-call date context — the date block is sent as fresh content so the cache key stays stable across calls (cache_system_prompt vs system_prompt in the post-call processor).
CXB API computes the version from raw analysis/QC prompt templates, not rendered per-call values. Invalidation is version-based: bump the version, the registry key changes, and the next call creates a fresh
CachedContent. Never delete-based.Request-hint provider
Post-call context caching is not supported on the request-hint LLM provider. Whencache_enabled is set on a post-call call for that provider, the usage cache metadata reports status unsupported_provider with reason post_call_cache_not_supported.
Stale retry
If a cached generate fails (expired cache), the post-call path evicts the registry entry and retries once withsystem_instruction inline (no cache). The usage cache metadata records status stale_retry.
Visibility
Every LLM usage entry carries acache dict (UsageEntry.cache in the Core service’s results model) with enabled, status, namespace, version, reason.
| Namespace | Where attached |
|---|---|
live_prompt | First live llm usage entry only |
post_call_analysis | Post-call analysis pass |
qc_analysis | QC pass |
callback_extraction | Callback-detection pass (caching disabled) |
status values include disabled, hit, created, fallback, ineligible, unsupported_provider, stale_retry.
CXB API’s build_llm_cache_summary (in the call service) is scoped to post-call/QC only (type in {"post_call", "qc"}) so dashboard aggregate cards are not polluted by live-conversation cache data. The summary reports token facts only, never money. CXB Console Call Detail renders separate Live Prompt Cache and Post-Call Cache sections.
Related
Bot configuration
prompt_parts, post_call_cache, and related fields.Pipeline
Where the LLM service and live cache are wired into the pipeline.