Rag Pipeline
Rag Pipeline protocol schemas
RAG (Retrieval-Augmented Generation) Pipeline Protocol
Defines schemas for building context-aware AI assistants using RAG techniques.
Enables vector search, document chunking, embeddings, and retrieval configuration.
Source: packages/spec/src/ai/rag-pipeline.zod.ts
import { ChunkingStrategy, DocumentChunk, DocumentLoaderConfig, DocumentMetadata, EmbeddingModel, FilterExpression, FilterGroup, MetadataFilter, RAGPipelineConfig, RAGPipelineStatus, RAGQueryRequest, RAGQueryResponse, RerankingConfig, RetrievalStrategy, VectorStoreConfig, VectorStoreProvider } from '@objectstack/spec/ai';
import type { ChunkingStrategy, DocumentChunk, DocumentLoaderConfig, DocumentMetadata, EmbeddingModel, FilterExpression, FilterGroup, MetadataFilter, RAGPipelineConfig, RAGPipelineStatus, RAGQueryRequest, RAGQueryResponse, RerankingConfig, RetrievalStrategy, VectorStoreConfig, VectorStoreProvider } from '@objectstack/spec/ai';
// Validate data
const result = ChunkingStrategy.parse(data);
This schema accepts one of the following structures:
Type: fixed
| Property | Type | Required | Description |
|---|
| type | string | ✅ | |
| chunkSize | integer | ✅ | Fixed chunk size in tokens/chars |
| chunkOverlap | integer | ✅ | Overlap between chunks |
| unit | Enum<'tokens' | 'characters'> | ✅ | |
Type: semantic
| Property | Type | Required | Description |
|---|
| type | string | ✅ | |
| model | string | optional | Model for semantic chunking |
| minChunkSize | integer | ✅ | |
| maxChunkSize | integer | ✅ | |
Type: recursive
| Property | Type | Required | Description |
|---|
| type | string | ✅ | |
| separators | string[] | ✅ | |
| chunkSize | integer | ✅ | |
| chunkOverlap | integer | ✅ | |
Type: markdown
| Property | Type | Required | Description |
|---|
| type | string | ✅ | |
| maxChunkSize | integer | ✅ | |
| respectHeaders | boolean | ✅ | Keep headers with content |
| respectCodeBlocks | boolean | ✅ | Keep code blocks intact |
| Property | Type | Required | Description |
|---|
| id | string | ✅ | Unique chunk identifier |
| content | string | ✅ | Chunk text content |
| embedding | number[] | optional | Embedding vector |
| metadata | Object | ✅ | |
| chunkIndex | integer | ✅ | Chunk position in document |
| tokens | integer | optional | Token count |
| Property | Type | Required | Description |
|---|
| type | Enum<'file' | 'directory' | 'url' | 'api' | 'database' | 'custom'> | ✅ | |
| source | string | ✅ | Source path, URL, or identifier |
| fileTypes | string[] | optional | Accepted file extensions (e.g., [".pdf", ".md"]) |
| recursive | boolean | ✅ | Process directories recursively |
| maxFileSize | integer | optional | Maximum file size in bytes |
| excludePatterns | string[] | optional | Patterns to exclude |
| extractImages | boolean | ✅ | Extract text from images (OCR) |
| extractTables | boolean | ✅ | Extract and format tables |
| loaderConfig | Record<string, any> | optional | Custom loader-specific config |
| Property | Type | Required | Description |
|---|
| source | string | ✅ | Document source (file path, URL, etc.) |
| sourceType | Enum<'file' | 'url' | 'api' | 'database' | 'custom'> | optional | |
| title | string | optional | |
| author | string | optional | Document author |
| createdAt | string | optional | ISO timestamp |
| updatedAt | string | optional | ISO timestamp |
| tags | string[] | optional | |
| category | string | optional | |
| language | string | optional | Document language (ISO 639-1 code) |
| custom | Record<string, any> | optional | Custom metadata fields |
| Property | Type | Required | Description |
|---|
| provider | Enum<'openai' | 'cohere' | 'huggingface' | 'azure_openai' | 'local' | 'custom'> | ✅ | |
| model | string | ✅ | Model name (e.g., "text-embedding-3-large") |
| dimensions | integer | ✅ | Embedding vector dimensions |
| maxTokens | integer | optional | Maximum tokens per embedding |
| batchSize | integer | ✅ | Batch size for embedding |
| endpoint | string | optional | Custom endpoint URL |
| apiKey | string | optional | API key |
| secretRef | string | optional | Reference to stored secret |
| Property | Type | Required | Description |
|---|
| field | string | ✅ | Metadata field to filter |
| operator | Enum<'eq' | 'neq' | 'gt' | 'gte' | 'lt' | 'lte' | 'in' | 'nin' | 'contains'> | ✅ | |
| value | string | number | boolean | string | number[] | ✅ | Filter value |
| Property | Type | Required | Description |
|---|
| logic | Enum<'and' | 'or'> | ✅ | |
| filters | Object | [#](./#)[] | ✅ | |
This schema accepts one of the following structures:
| Property | Type | Required | Description |
|---|
| field | string | ✅ | Metadata field to filter |
| operator | Enum<'eq' | 'neq' | 'gt' | 'gte' | 'lt' | 'lte' | 'in' | 'nin' | 'contains'> | ✅ | |
| value | string | number | boolean | string | number[] | ✅ | Filter value |
Reference: __schema0
Type: Record<string, string | number | boolean | string | number[]>
| Property | Type | Required | Description |
|---|
| name | string | ✅ | Pipeline name (snake_case) |
| label | string | ✅ | Display name |
| description | string | optional | |
| embedding | Object | ✅ | |
| vectorStore | Object | ✅ | |
| chunking | Object | Object | Object | Object | ✅ | |
| retrieval | Object | Object | Object | Object | ✅ | |
| reranking | Object | optional | |
| loaders | Object[] | optional | Document loaders |
| maxContextTokens | integer | ✅ | Maximum tokens in context |
| contextWindow | integer | optional | LLM context window size |
| metadataFilters | Object | [__schema0](./__schema0) | Record<string, string | number | boolean | string | number[]> | optional | Global filters for retrieval |
| enableCache | boolean | ✅ | |
| cacheTTL | integer | ✅ | Cache TTL in seconds |
| cacheInvalidationStrategy | Enum<'time_based' | 'manual' | 'on_update'> | optional | |
| Property | Type | Required | Description |
|---|
| name | string | ✅ | |
| status | Enum<'active' | 'indexing' | 'error' | 'disabled'> | ✅ | |
| documentsIndexed | integer | ✅ | |
| lastIndexed | string | optional | ISO timestamp |
| errorMessage | string | optional | |
| health | Object | optional | |
| Property | Type | Required | Description |
|---|
| query | string | ✅ | User query |
| pipelineName | string | ✅ | Pipeline to use |
| topK | integer | optional | |
| metadataFilters | Record<string, any> | optional | |
| conversationHistory | Object[] | optional | |
| includeMetadata | boolean | ✅ | |
| includeSources | boolean | ✅ | |
| Property | Type | Required | Description |
|---|
| query | string | ✅ | |
| results | Object[] | ✅ | |
| context | string | ✅ | Assembled context for LLM |
| tokens | Object | optional | Token usage for this query |
| cost | number | optional | Cost for this query in USD |
| retrievalTime | number | optional | Retrieval time in milliseconds |
| Property | Type | Required | Description |
|---|
| enabled | boolean | ✅ | |
| model | string | optional | Reranking model name |
| provider | Enum<'cohere' | 'huggingface' | 'custom'> | optional | |
| topK | integer | ✅ | Final number of results after reranking |
This schema accepts one of the following structures:
Type: similarity
| Property | Type | Required | Description |
|---|
| type | string | ✅ | |
| topK | integer | ✅ | Number of results to retrieve |
| scoreThreshold | number | optional | Minimum similarity score |
Type: mmr
| Property | Type | Required | Description |
|---|
| type | string | ✅ | |
| topK | integer | ✅ | |
| fetchK | integer | ✅ | Initial fetch size |
| lambda | number | ✅ | Diversity vs relevance (0=diverse, 1=relevant) |
Type: hybrid
| Property | Type | Required | Description |
|---|
| type | string | ✅ | |
| topK | integer | ✅ | |
| vectorWeight | number | ✅ | Weight for vector search |
| keywordWeight | number | ✅ | Weight for keyword search |
Type: parent_document
| Property | Type | Required | Description |
|---|
| type | string | ✅ | |
| topK | integer | ✅ | |
| retrieveParent | boolean | ✅ | Retrieve full parent document |
| Property | Type | Required | Description |
|---|
| provider | Enum<'pinecone' | 'weaviate' | 'qdrant' | 'milvus' | 'chroma' | 'pgvector' | 'redis' | 'opensearch' | 'elasticsearch' | 'custom'> | ✅ | |
| indexName | string | ✅ | Index/collection name |
| namespace | string | optional | Namespace for multi-tenancy |
| host | string | optional | Vector store host |
| port | integer | optional | Vector store port |
| secretRef | string | optional | Reference to stored secret |
| apiKey | string | optional | API key or reference to secret |
| dimensions | integer | ✅ | Vector dimensions |
| metric | Enum<'cosine' | 'euclidean' | 'dotproduct'> | ✅ | |
| batchSize | integer | ✅ | |
| connectionPoolSize | integer | ✅ | |
| timeout | integer | ✅ | Timeout in milliseconds |
pinecone
weaviate
qdrant
milvus
chroma
pgvector
redis
opensearch
elasticsearch
custom