Knowledge Base Pipeline

A knowledge base (KB) is a tenant-scoped collection of documents that CXB Core can query at call time via the search_knowledge tool. CXB API owns ingestion (extract → chunk → embed → index) and the internal search endpoint; the vectors live in the vector database and the metadata in MongoDB. The KB pipeline lives in the API service’s knowledge route, knowledge service, and the document, chunking, embedding, and vector-store modules.

Stores

Store	Collection / object	Holds
MongoDB	`knowledge_bases`	KB metadata, counts, `status` (`active`/`disabled`), `default_language`, `tenant_id`
MongoDB	`knowledge_documents`	Per-document `parse_status`, `chunk_count`, `text_char_count`, `storage_key`
Vector database	collection `cxb_knowledge_chunks` (configurable)	Chunk vectors + payload (`tenant_id`, `kb_id`, `document_id`, `chunk_id`, `chunk_text`, `active`, …)

KB and document IDs are prefixed: kb_<hex>, doc_<hex>, and chunk IDs {document_id}_chunk_{i}. Vector-database point IDs are a deterministic UUID5 of the chunk ID, so re-ingestion is idempotent.

Admin endpoints

Under /api/v1/knowledge-bases, all requiring admin:

Method	Path	Purpose
`POST`	`/`	Create a KB
`GET`	`/`	List KBs for the tenant
`GET`	`/{kb_id}`	Get one KB
`PATCH`	`/{kb_id}`	Update name/description/status/language
`DELETE`	`/{kb_id}`	Soft-disable (sets `status=disabled`)
`POST`	`/{kb_id}/documents`	Upload + ingest a document
`GET`	`/{kb_id}/documents`	List documents
`DELETE`	`/{kb_id}/documents/{document_id}`	Delete document + its vector-database chunks

Ingestion pipeline

ingest_document runs synchronously within the upload request and records progress on the document:

Stage	Step	Notes
Extract	`extract_text`	Supported: `.pdf`, `.txt`, `.md`, `.csv`, `.docx`. PDF via `pypdf`, DOCX via `python-docx` (paragraphs + tables). Unsupported types raise `KnowledgeDocumentError`.
Chunk	`chunk_text`	Normalizes whitespace, then slides a window of `chunk_size` chars with `overlap`, preferring a `\n`/`.` /space boundary past 50% of the window. Defaults `chunk_size_chars=1200`, `chunk_overlap_chars=180`.
Embed	`embed_texts`	The LLM client SDK, model from `knowledge.embedding_model` (settings default `<embedding-model>`; the route falls back to a default embedding model if the field is unset), `output_dimensionality` = `embedding_dimensions` (768). Requires the LLM provider API key.
Index	`upsert_chunks`	Ensures the collection (cosine distance) + payload indexes on `tenant_id`/`kb_id`/`document_id`/`active`.

The upload route enforces max_upload_mb (default 20) and returns 413 if exceeded. On any ingestion error the document is marked failed with a truncated parse_error (the upload still returns 200 with that status).

Upload is not deferred to a worker — extraction, embedding, and vector-database upsert all happen inside the request. Large documents therefore make the upload call slow rather than returning a queued status.

Bot attachment

A bot’s KB attachment lives in bot.knowledge (BotKnowledgeConfig in models/knowledge.py):

Field	Default	Purpose
`enabled`	`false`	Master toggle
`kb_ids`	`[]`	Attached KBs (deduped)
`top_k`	`4` (1–10)	Max chunks returned
`score_threshold`	`0.55` (0–1)	Min cosine score
`strict`	`true`	If true, emit `fallback_message` when no hit
`trigger_instructions`	`""`	Natural-language guidance injected into the `search_knowledge` tool description in CXB Core
`fallback_message`	default sentence	Spoken when nothing is found in strict mode

Search contract (CXB Core)

CXB Core calls POST /api/v1/internal/knowledge/search, authenticated by X-CXB-Core-Secret. The request carries bot_id, session_id, query, and optional kb_ids/top_k/score_threshold/strict. search_knowledge enforces and accelerates access:

Access guard: validate_bot_kb_access_with_meta intersects the requested kb_ids with the bot’s attached, active KBs. Unattached or disabled KBs are silently dropped; no active KB → empty hits.
Redis caching (layered): KB-access (60s), query embeddings (24h), and full results (30m on hit, 2m on no-hit). Result cache keys include a kb_revision derived from each KB’s updated_at/counts/status, so editing a KB invalidates cached answers.
Vector-database filter: tenant_id + kb_id ∈ active + active=true, top-k with score_threshold.
Response includes hits[] (with score, source_name, chunk_text) and a metrics block (embedding/vector-search/cache timings and cache-hit flags). In strict mode with no hits, fallback_message is returned.

CXB Core pipeline

How search_knowledge is registered and invoked mid-call.

Settings

The knowledge system-settings block (vector-database URL, embedding model, chunk sizes).

Tools & integrations

Bot-level tool configuration including knowledge.

CXB API overview

Where the KB pipeline fits in the control plane.

​Stores

​Admin endpoints

​Ingestion pipeline

​Bot attachment

​Search contract (CXB Core)

​Related docs

CXB Core pipeline

Settings

Tools & integrations

CXB API overview

Stores

Admin endpoints

Ingestion pipeline

Bot attachment

Search contract (CXB Core)

Related docs