Rag Pipeline

RAG (Retrieval-Augmented Generation) Pipeline Protocol

Defines schemas for building context-aware AI assistants using RAG techniques.

Enables vector search, document chunking, embeddings, and retrieval configuration.

Source: packages/spec/src/ai/rag-pipeline.zod.ts

TypeScript Usage

import { ChunkingStrategy, DocumentChunk, DocumentLoaderConfig, DocumentMetadata, EmbeddingModel, FilterExpression, FilterGroup, MetadataFilter, RAGPipelineConfig, RAGPipelineStatus, RAGQueryRequest, RAGQueryResponse, RerankingConfig, RetrievalStrategy, VectorStoreConfig, VectorStoreProvider } from '@objectstack/spec/ai';
import type { ChunkingStrategy, DocumentChunk, DocumentLoaderConfig, DocumentMetadata, EmbeddingModel, FilterExpression, FilterGroup, MetadataFilter, RAGPipelineConfig, RAGPipelineStatus, RAGQueryRequest, RAGQueryResponse, RerankingConfig, RetrievalStrategy, VectorStoreConfig, VectorStoreProvider } from '@objectstack/spec/ai';

// Validate data
const result = ChunkingStrategy.parse(data);

ChunkingStrategy

Union Options

This schema accepts one of the following structures:

Option 1

Type: fixed

Properties

Property	Type	Required	Description
type	`string`	✅
chunkSize	`integer`	✅	Fixed chunk size in tokens/chars
chunkOverlap	`integer`	✅	Overlap between chunks
unit	`Enum<'tokens' \| 'characters'>`	✅

Option 2

Type: semantic

Properties

Property	Type	Required	Description
type	`string`	✅
model	`string`	optional	Model for semantic chunking
minChunkSize	`integer`	✅
maxChunkSize	`integer`	✅

Option 3

Type: recursive

Properties

Property	Type	Required
type	`string`	✅
separators	`string[]`	✅
chunkSize	`integer`	✅
chunkOverlap	`integer`	✅

Option 4

Type: markdown

Properties

Property	Type	Required	Description
type	`string`	✅
maxChunkSize	`integer`	✅
respectHeaders	`boolean`	✅	Keep headers with content
respectCodeBlocks	`boolean`	✅	Keep code blocks intact

DocumentChunk

Properties

Property	Type	Required	Description
id	`string`	✅	Unique chunk identifier
content	`string`	✅	Chunk text content
embedding	`number[]`	optional	Embedding vector
metadata	`Object`	✅
chunkIndex	`integer`	✅	Chunk position in document
tokens	`integer`	optional	Token count

DocumentLoaderConfig

Properties

Property	Type	Required	Description
type	`Enum<'file' \| 'directory' \| 'url' \| 'api' \| 'database' \| 'custom'>`	✅
source	`string`	✅	Source path, URL, or identifier
fileTypes	`string[]`	optional	Accepted file extensions (e.g., [".pdf", ".md"])
recursive	`boolean`	✅	Process directories recursively
maxFileSize	`integer`	optional	Maximum file size in bytes
excludePatterns	`string[]`	optional	Patterns to exclude
extractImages	`boolean`	✅	Extract text from images (OCR)
extractTables	`boolean`	✅	Extract and format tables
loaderConfig	`Record<string, any>`	optional	Custom loader-specific config

DocumentMetadata

Properties

Property	Type	Required	Description
source	`string`	✅	Document source (file path, URL, etc.)
sourceType	`Enum<'file' \| 'url' \| 'api' \| 'database' \| 'custom'>`	optional
title	`string`	optional
author	`string`	optional	Document author
createdAt	`string`	optional	ISO timestamp
updatedAt	`string`	optional	ISO timestamp
tags	`string[]`	optional
category	`string`	optional
language	`string`	optional	Document language (ISO 639-1 code)
custom	`Record<string, any>`	optional	Custom metadata fields

EmbeddingModel

Properties

Property	Type	Required	Description
provider	`Enum<'openai' \| 'cohere' \| 'huggingface' \| 'azure_openai' \| 'local' \| 'custom'>`	✅
model	`string`	✅	Model name (e.g., "text-embedding-3-large")
dimensions	`integer`	✅	Embedding vector dimensions
maxTokens	`integer`	optional	Maximum tokens per embedding
batchSize	`integer`	✅	Batch size for embedding
endpoint	`string`	optional	Custom endpoint URL
apiKey	`string`	optional	API key
secretRef	`string`	optional	Reference to stored secret

FilterExpression

Properties

Property	Type	Required	Description
field	`string`	✅	Metadata field to filter
operator	`Enum<'eq' \| 'neq' \| 'gt' \| 'gte' \| 'lt' \| 'lte' \| 'in' \| 'nin' \| 'contains'>`	✅
value	`string \| number \| boolean \| string \| number[]`	✅	Filter value

Properties

Property	Type	Required	Description
logic	`Enum<'and' \| 'or'>`	✅
filters	`Object \| [#](./#)[]`	✅

MetadataFilter

Union Options

This schema accepts one of the following structures:

Option 1

Properties

Property	Type	Required	Description
field	`string`	✅	Metadata field to filter
operator	`Enum<'eq' \| 'neq' \| 'gt' \| 'gte' \| 'lt' \| 'lte' \| 'in' \| 'nin' \| 'contains'>`	✅
value	`string \| number \| boolean \| string \| number[]`	✅	Filter value

Option 2

Reference: __schema0

Option 3

Type: Record<string, string | number | boolean | string | number[]>

RAGPipelineConfig

Properties

Property	Type	Required	Description
name	`string`	✅	Pipeline name (snake_case)
label	`string`	✅	Display name
description	`string`	optional
embedding	`Object`	✅
vectorStore	`Object`	✅
chunking	`Object \| Object \| Object \| Object`	✅
retrieval	`Object \| Object \| Object \| Object`	✅
reranking	`Object`	optional
loaders	`Object[]`	optional	Document loaders
maxContextTokens	`integer`	✅	Maximum tokens in context
contextWindow	`integer`	optional	LLM context window size
metadataFilters	`Object \| [__schema0](./__schema0) \| Record<string, string \| number \| boolean \| string \| number[]>`	optional	Global filters for retrieval
enableCache	`boolean`	✅
cacheTTL	`integer`	✅	Cache TTL in seconds
cacheInvalidationStrategy	`Enum<'time_based' \| 'manual' \| 'on_update'>`	optional

RAGPipelineStatus

Properties

Property	Type	Required	Description
name	`string`	✅
status	`Enum<'active' \| 'indexing' \| 'error' \| 'disabled'>`	✅
documentsIndexed	`integer`	✅
lastIndexed	`string`	optional	ISO timestamp
errorMessage	`string`	optional
health	`Object`	optional

RAGQueryRequest

Properties

Property	Type	Required	Description
query	`string`	✅	User query
pipelineName	`string`	✅	Pipeline to use
topK	`integer`	optional
metadataFilters	`Record<string, any>`	optional
conversationHistory	`Object[]`	optional
includeMetadata	`boolean`	✅
includeSources	`boolean`	✅

RAGQueryResponse

Properties

Property	Type	Required	Description
query	`string`	✅
results	`Object[]`	✅
context	`string`	✅	Assembled context for LLM
tokens	`Object`	optional	Token usage for this query
cost	`number`	optional	Cost for this query in USD
retrievalTime	`number`	optional	Retrieval time in milliseconds

RerankingConfig

Properties

Property	Type	Required	Description
enabled	`boolean`	✅
model	`string`	optional	Reranking model name
provider	`Enum<'cohere' \| 'huggingface' \| 'custom'>`	optional
topK	`integer`	✅	Final number of results after reranking

RetrievalStrategy

Union Options

This schema accepts one of the following structures:

Option 1

Type: similarity

Properties

Property	Type	Required	Description
type	`string`	✅
topK	`integer`	✅	Number of results to retrieve
scoreThreshold	`number`	optional	Minimum similarity score

Option 2

Type: mmr

Properties

Property	Type	Required	Description
type	`string`	✅
topK	`integer`	✅
fetchK	`integer`	✅	Initial fetch size
lambda	`number`	✅	Diversity vs relevance (0=diverse, 1=relevant)

Option 3

Type: hybrid

Properties

Property	Type	Required	Description
type	`string`	✅
topK	`integer`	✅
vectorWeight	`number`	✅	Weight for vector search
keywordWeight	`number`	✅	Weight for keyword search

Option 4

Type: parent_document

Properties

Property	Type	Required	Description
type	`string`	✅
topK	`integer`	✅
retrieveParent	`boolean`	✅	Retrieve full parent document

VectorStoreConfig

Properties

Property	Type	Required	Description
provider	`Enum<'pinecone' \| 'weaviate' \| 'qdrant' \| 'milvus' \| 'chroma' \| 'pgvector' \| 'redis' \| 'opensearch' \| 'elasticsearch' \| 'custom'>`	✅
indexName	`string`	✅	Index/collection name
namespace	`string`	optional	Namespace for multi-tenancy
host	`string`	optional	Vector store host
port	`integer`	optional	Vector store port
secretRef	`string`	optional	Reference to stored secret
apiKey	`string`	optional	API key or reference to secret
dimensions	`integer`	✅	Vector dimensions
metric	`Enum<'cosine' \| 'euclidean' \| 'dotproduct'>`	✅
batchSize	`integer`	✅
connectionPoolSize	`integer`	✅
timeout	`integer`	✅	Timeout in milliseconds

VectorStoreProvider

Allowed Values

pinecone
weaviate
qdrant
milvus
chroma
pgvector
redis
opensearch
elasticsearch
custom

On this page