Pattern: RAG¶
Motivation¶
When you need information, you don't rely solely on memory. You search the web, consult reference books, ask experts, or look through documentation. You retrieve relevant information and use it to answer questions or make decisions. Retrieval Augmented Generation (RAG) gives agents this capability: accessing external knowledge bases, finding relevant information, and augmenting their responses with retrieved context, just as humans look things up when needed.
LLMs exhibit substantial capabilities in generating human-like text. However, their knowledge base is typically confined to the data on which they were trained, limiting their access to real-time information, specific company data, or highly specialized details. Retrieval Augmented Generation, addresses this limitation. RAG enables LLMs to access and integrate external, current, and context-specific information, thereby enhancing the accuracy, relevance, and factual basis of their outputs.
For AI agents, this is crucial as it allows them to ground their actions and responses in real-time, verifiable data beyond their static training. This capability enables them to perform complex tasks accurately, such as accessing the latest company policies to answer a specific question or checking current inventory before placing an order. By integrating external knowledge, RAG transforms agents from simple conversationalists into effective, data-driven tools capable of executing meaningful work.
Pattern Overview¶
Problem¶
LLMs exhibit substantial capabilities in generating human-like text. However, their knowledge base is typically confined to the data on which they were trained, limiting their access to real-time information, specific company data, or highly specialized details. Without access to external knowledge bases, LLMs cannot provide up-to-date information, access proprietary or domain-specific data, or ground their responses in verifiable sources. This limitation leads to hallucinations, outdated information, and inability to answer questions about recent events or specialized knowledge not in training data.
Solution¶
The RAG pattern significantly enhances the capabilities of LLMs by granting them access to external knowledge bases before generating a response. Instead of relying solely on their internal, pre-trained knowledge, RAG allows LLMs to "look up" information, much like a human might consult a book or search the internet. This process empowers LLMs to provide more accurate, up-to-date, and verifiable answers.
RAG is composed of two main steps. First, the system retrieves relevant information from a large knowledge base—not just identifying relevant documents or links, but extracting the most pertinent text segments from those documents. This retrieval step goes beyond simple document matching; it must identify and extract the specific passages, sentences, or chunks that directly address the user's query. This search is a "semantic search" that understands the user's intent and the meaning behind their words. Second, the extracted information is augmented into the LLM's context, enabling the model to generate a direct answer to the query rather than simply returning a list of sources like a traditional search engine. The LLM synthesizes the retrieved context with its reasoning capabilities to produce a coherent, contextually grounded response.
The RAG framework provides several significant benefits. It allows LLMs to access up-to-date information, thereby overcoming the constraints of their static training data. This approach also reduces the risk of "hallucination"—the generation of false information—by grounding responses in verifiable data. Moreover, LLMs can utilize specialized knowledge found in internal company documents or wikis. A vital advantage of this process is the capability to offer "citations," which pinpoint the exact source of information, thereby enhancing the trustworthiness and verifiability of the AI's responses.
Key Concepts¶
- Embeddings: Numerical representations of text (words, phrases, documents) as vectors that capture semantic meaning and relationships in a mathematical space.
- Semantic Search: A search method that understands user intent and meaning, finding conceptually similar content even when wording differs.
- Chunking: Breaking down large documents into smaller, manageable pieces (chunks) to enable efficient retrieval and focused context.
- Vector Databases: Specialized databases designed to store and query embeddings efficiently, enabling rapid semantic search through millions of vectors.
- Retrieval-Augmentation: The process of retrieving relevant information from external knowledge bases and augmenting it into the LLM's context before generation.
- Hybrid Search: Combining keyword-based methods (BM25) with semantic search to leverage both exact matches and conceptual relevance.
Historical Context and Pattern Classification
RAG represents one of the earliest AI agent implementations that gained widespread adoption. In fact, RAG can be viewed as a specific agent implementation rather than a design pattern by itself—it's a concrete system where the agent's tools help access relevant data in a knowledge base, retrieve it, and augment the LLM's context with that information. However, we have selected to present RAG as a design pattern in this book due to the fundamental mechanisms it employs—such as embeddings, semantic similarity, and relevance-based retrieval—that represent reusable solutions to the broader problem of knowledge access and grounding. These mechanisms are relevant to other agents in a general sense, providing a template for how agents can extend their capabilities through external knowledge retrieval, making RAG both a specific implementation and a pattern that can be adapted and applied across different agent architectures.
In addition, it's important to note that basic RAG is more like a workflow than a true agentic solution—it follows a predetermined sequence: query → retrieve → augment → generate. The LLM doesn't autonomously decide when or how to retrieve information; the retrieval step is hardcoded into the pipeline. However, RAG can be viewed as a reusable skill for accessing information from corpora that can be integrated into agent architectures. This is especially valuable for agents that primarily work with large document corpora or are required to abide by certain corporate rules and policies of an organization. In such contexts, RAG serves as a specialized capability that agents can invoke when they need to ground their responses in specific documentation or knowledge bases, making it a composable building block for more sophisticated agentic systems.
How It Works¶
RAG operates through a two-phase process: retrieval and augmentation. When a user poses a question, the system first performs semantic search across a knowledge base. The query is converted into an embedding vector, which is then compared against pre-computed embeddings of document chunks stored in a vector database. The system identifies and extracts the most relevant text segments—not just entire documents, but specific passages that directly address the query.
The retrieved chunks are then augmented into the LLM's context, creating an enriched prompt that combines the user's question with relevant external information. The LLM synthesizes this retrieved context with its reasoning capabilities to generate a response that is factually grounded in the retrieved data. This process enables the LLM to provide accurate, up-to-date answers with citations, overcoming the limitations of static training data while reducing hallucinations by grounding responses in verifiable sources.
To fully appreciate how RAG functions, it's essential to understand a few core concepts:
Embeddings¶
In the context of LLMs, embeddings are numerical representations of text, such as words, phrases, or entire documents. These representations are in the form of a vector, which is a list of numbers. The key idea is to capture the semantic meaning and the relationships between different pieces of text in a mathematical space. Words or phrases with similar meanings will have embeddings that are closer to each other in this vector space. For instance, imagine a simple 2D graph. The word "cat" might be represented by the coordinates (2, 3), while "kitten" would be very close at (2.1, 3.1). In contrast, the word "car" would have a distant coordinate like (8, 1), reflecting its different meaning. In reality, these embeddings are in a much higher-dimensional space with hundreds or even thousands of dimensions, allowing for a very nuanced understanding of language.
Text Similarity¶
Text similarity refers to the measure of how alike two pieces of text are. This can be at a surface level, looking at the overlap of words (lexical similarity), or at a deeper, meaning-based level. In the context of RAG, text similarity is crucial for finding the most relevant information in the knowledge base that corresponds to a user's query. For instance, consider the sentences: "What is the capital of France?" and "Which city is the capital of France?". While the wording is different, they are asking the same question. A good text similarity model would recognize this and assign a high similarity score to these two sentences, even though they only share a few words. This is often calculated using the embeddings of the texts.
Semantic Similarity and Distance¶
Semantic similarity is a more advanced form of text similarity that focuses purely on the meaning and context of the text, rather than just the words used. It aims to understand if two pieces of text convey the same concept or idea. Semantic distance is the inverse of this; a high semantic similarity implies a low semantic distance, and vice versa. In RAG, semantic search relies on finding documents with the smallest semantic distance to the user's query. For instance, the phrases "a furry feline companion" and "a domestic cat" have no words in common besides "a". However, a model that understands semantic similarity would recognize that they refer to the same thing and would consider them to be highly similar. This is because their embeddings would be very close in the vector space, indicating a small semantic distance. This is the "smart search" that allows RAG to find relevant information even when the user's wording doesn't exactly match the text in the knowledge base.
Chunking of Documents¶
Chunking is the process of breaking down large documents into smaller, more manageable pieces, or "chunks." For a RAG system to work efficiently, it cannot feed entire large documents into the LLM. Instead, it processes these smaller chunks. The way documents are chunked is important for preserving the context and meaning of the information. For instance, instead of treating a 50-page user manual as a single block of text, a chunking strategy might break it down into sections, paragraphs, or even sentences. For instance, a section on "Troubleshooting" would be a separate chunk from the "Installation Guide." When a user asks a question about a specific problem, the RAG system can then retrieve the most relevant troubleshooting chunk, rather than the entire manual. This makes the retrieval process faster and the information provided to the LLM more focused and relevant to the user's immediate need. Once documents are chunked, the RAG system must employ a retrieval technique to find the most relevant pieces for a given query.
The primary method is vector search, which uses embeddings and semantic distance to find chunks that are conceptually similar to the user's question. An older, but still valuable, technique is BM25, a keyword-based algorithm that ranks chunks based on term frequency without understanding semantic meaning. To get the best of both worlds, hybrid search approaches are often used, combining the keyword precision of BM25 with the contextual understanding of semantic search. This fusion allows for more robust and accurate retrieval, capturing both literal matches and conceptual relevance.
It is important to note that in RAG systems, the search and extraction phase—specifically, not just identifying relevant documents or links, but extracting the most relevant text segments from those documents—is typically the most challenging task. The quality of the final response depends critically on the system's ability to pinpoint and extract the precise passages that directly address the user's query, rather than returning entire documents or irrelevant sections. This challenge makes effective chunking strategies and precise retrieval mechanisms essential components of a successful RAG implementation.
TF-IDF and BM25: Keyword-Based Search Approaches
Before the advent of semantic search and embeddings, information retrieval systems relied on keyword-based approaches. Two fundamental algorithms in this domain are TF-IDF (Term Frequency-Inverse Document Frequency) and BM25 (Best Matching 25). These methods excel at finding documents that contain specific keywords from a query, making them particularly effective for exact term matching and keyword-focused searches.
TF-IDF (Term Frequency-Inverse Document Frequency)¶
TF-IDF is a statistical measure that evaluates how important a word is to a document within a collection of documents. It combines two components:
-
Term Frequency (TF): Measures how frequently a term appears in a document. The intuition is that words appearing more often in a document are likely more relevant to that document's topic.
-
Inverse Document Frequency (IDF): Measures how rare or common a term is across the entire document collection. Common words (like "the", "is", "a") appear in many documents and receive low IDF scores, while rare, distinctive words receive high IDF scores.
The TF-IDF score is calculated as:
Where:
- TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)
- IDF(t) = log(Total number of documents / Number of documents containing term t)
Example: If the word "quantum" appears 5 times in a 100-word document, and "quantum" appears in 10 out of 1000 documents: - TF = 5/100 = 0.05 - IDF = log(1000/10) = log(100) ≈ 4.61 - TF-IDF = 0.05 × 4.61 ≈ 0.23
Strengths: Simple, interpretable, effective for keyword matching, no training required.
Limitations: Cannot capture semantic meaning, treats all word positions equally, doesn't understand synonyms or context.
BM25 (Best Matching 25)¶
BM25 is an evolution of TF-IDF that addresses some of its limitations. It's a probabilistic ranking function that scores documents based on how well they match a query. BM25 improves upon TF-IDF by:
-
Saturating term frequency: Unlike TF-IDF, where term frequency grows linearly, BM25 uses a saturation function that prevents very frequent terms from dominating the score.
-
Length normalization: BM25 normalizes scores by document length, preventing longer documents from having an unfair advantage simply because they contain more words.
The BM25 formula is:
Where:
- q = query
- d = document
- f(qi, d) = frequency of query term qi in document d
- |d| = length of document d (number of words)
- avgdl = average document length in the collection
- k1 = term frequency saturation parameter (typically 1.2-2.0)
- b = length normalization parameter (typically 0.75)
- IDF(qi) = inverse document frequency of query term qi
Key Improvements over TF-IDF:
-
Term frequency saturation: The
(k1 + 1)term in the numerator and thef(qi, d) + k1 × ...in the denominator create a saturation curve. After a term appears a certain number of times, additional occurrences contribute less to the score. This prevents documents with excessive repetition from scoring too high. -
Length normalization: The
(1 - b + b × |d|/avgdl)component penalizes longer documents. Ifb = 0, there's no length normalization. Ifb = 1, full normalization is applied. Typicallyb = 0.75provides a good balance.
Example: Consider a query "machine learning" and two documents: - Document A (50 words): "machine learning" appears 3 times - Document B (500 words): "machine learning" appears 10 times
With TF-IDF, Document B might score higher simply because it has more occurrences. With BM25, Document A could score higher because: 1. The term frequency is normalized by document length 2. The saturation function means the 3 occurrences in the shorter document are more significant than the 10 in the longer one
Strengths: - Better handling of document length variations - Term frequency saturation prevents over-weighting repeated terms - Proven effectiveness in information retrieval (used by search engines like Elasticsearch) - No training required, works out of the box
Limitations: - Still keyword-based, cannot understand semantic meaning - Doesn't handle synonyms or paraphrasing - Requires exact term matches (though stemming can help)
When to Use TF-IDF vs BM25¶
- Use TF-IDF when:
- You need a simple, interpretable baseline
- Documents are roughly the same length
-
You want to understand exactly why documents are ranked
-
Use BM25 when:
- Documents vary significantly in length
- You want better ranking quality (BM25 generally outperforms TF-IDF)
- You're building a production search system
TF-IDF and BM25 in RAG Systems¶
In modern RAG systems, TF-IDF and BM25 are often used in hybrid search approaches:
-
Keyword precision: BM25 excels at finding documents with exact keyword matches, which is crucial when users search for specific terms, product names, or technical jargon.
-
Complementing semantic search: While semantic search (using embeddings) finds conceptually similar content, BM25 ensures that documents containing the exact query terms are not overlooked.
-
Handling rare terms: BM25 is particularly effective for rare, specific terms that might not have strong semantic representations in embedding models.
Hybrid Approach: Many production RAG systems combine BM25 and semantic search by: - Running both searches in parallel - Normalizing scores from both methods - Combining scores with weighted averaging (e.g., 40% BM25 + 60% semantic) - Returning the top-k results based on combined scores
This hybrid approach leverages the strengths of both methods: BM25's precision for exact matches and semantic search's ability to find conceptually relevant content even without keyword overlap.
Vector Databases¶
A vector database is a specialized type of database designed to store and query embeddings efficiently. After documents are chunked and converted into embeddings, these high-dimensional vectors are stored in a vector database. Traditional retrieval techniques, like keyword-based search, are excellent at finding documents containing exact words from a query but lack a deep understanding of language. They wouldn't recognize that "furry feline companion" means "cat." This is where vector databases excel. They are built specifically for semantic search. By storing text as numerical vectors, they can find results based on conceptual meaning, not just keyword overlap. When a user's query is also converted into a vector, the database uses highly optimized algorithms (like HNSW - Hierarchical Navigable Small World) to rapidly search through millions of vectors and find the ones that are "closest" in meaning. This approach is far superior for RAG because it uncovers relevant context even if the user's phrasing is completely different from the source documents. In essence, while other techniques search for words, vector databases search for meaning.
This technology is implemented in various forms, from managed databases like Pinecone and Weaviate to open-source solutions such as Chroma DB, Milvus, and Qdrant. Even existing databases can be augmented with vector search capabilities, as seen with Redis, Elasticsearch, and Postgres (using the pgvector extension). The core retrieval mechanisms are often powered by libraries like Meta AI's FAISS or Google Research's ScaNN, which are fundamental to the efficiency of these systems.
Implementations¶
Basic RAG Implementation¶
The following example demonstrates a basic RAG system with document chunking, embedding generation, and vector similarity search:
Basic RAG Implementation
from typing import List, Dict, Tuple
import numpy as np
from dataclasses import dataclass
@dataclass
class Document:
"""Represents a document chunk with its content and metadata."""
content: str
metadata: Dict = None
embedding: np.ndarray = None
class SimpleRAGSystem:
"""
Basic RAG system demonstrating core concepts:
- Document chunking
- Embedding generation
- Vector similarity search
"""
def __init__(self, embedding_model=None):
"""
Initialize RAG system.
Args:
embedding_model: Function that takes text and returns embedding vector.
If None, uses a simple TF-IDF-like approach for demo.
"""
self.documents: List[Document] = []
self.embedding_model = embedding_model or self._simple_embedding
self.vocabulary: Dict[str, int] = {} # Word to index mapping
self.vocab_built = False
def _build_vocabulary(self, texts: List[str]):
"""Build vocabulary from all documents."""
all_words = set()
for text in texts:
words = text.lower().split()
all_words.update(words)
# Create word to index mapping
self.vocabulary = {word: idx for idx, word in enumerate(sorted(all_words))}
self.vocab_built = True
def _simple_embedding(self, text: str) -> np.ndarray:
"""
Simple embedding function for demonstration.
In production, use models like OpenAI's text-embedding-ada-002,
Sentence-BERT, or similar.
"""
if not self.vocab_built:
# If vocabulary not built yet, create a simple embedding
words = text.lower().split()
unique_words = list(set(words))
embedding = np.zeros(len(unique_words))
for word in words:
if word in unique_words:
embedding[unique_words.index(word)] += 1
norm = np.linalg.norm(embedding)
return embedding / norm if norm > 0 else embedding
# Use consistent vocabulary
embedding = np.zeros(len(self.vocabulary))
words = text.lower().split()
for word in words:
if word in self.vocabulary:
embedding[self.vocabulary[word]] += 1
# Normalize
norm = np.linalg.norm(embedding)
return embedding / norm if norm > 0 else embedding
def add_documents(self, texts: List[str], metadata: List[Dict] = None):
"""
Add documents to the knowledge base.
Args:
texts: List of document texts (already chunked)
metadata: Optional list of metadata dictionaries
"""
if metadata is None:
metadata = [{}] * len(texts)
# Build vocabulary from all texts first
if not self.vocab_built:
self._build_vocabulary(texts)
for text, meta in zip(texts, metadata):
embedding = self.embedding_model(text)
doc = Document(content=text, metadata=meta, embedding=embedding)
self.documents.append(doc)
def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[Document, float]]:
"""
Retrieve most relevant documents for a query using semantic similarity.
Args:
query: User's query text
top_k: Number of top results to return
Returns:
List of (document, similarity_score) tuples, sorted by relevance
"""
# Generate query embedding
query_embedding = self.embedding_model(query)
# Calculate cosine similarity with all documents
similarities = []
for doc in self.documents:
# Cosine similarity: dot product of normalized vectors
similarity = np.dot(query_embedding, doc.embedding)
similarities.append((doc, similarity))
# Sort by similarity (descending) and return top_k
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
def query(self, query: str, top_k: int = 3) -> str:
"""
Complete RAG pipeline: retrieve relevant context and format for LLM.
Args:
query: User's question
top_k: Number of context chunks to retrieve
Returns:
Formatted context string ready for LLM prompt augmentation
"""
# Retrieve relevant documents
results = self.retrieve(query, top_k)
# Format context for LLM
context_parts = []
for i, (doc, score) in enumerate(results, 1):
context_parts.append(
f"[Context {i}] (Relevance: {score:.3f})\n{doc.content}\n"
)
context = "\n".join(context_parts)
# Return augmented prompt
return f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
def main():
"""Basic RAG example."""
# Initialize and add documents
rag = SimpleRAGSystem()
documents = [
"RAG enhances LLMs by retrieving relevant information from external knowledge bases.",
"Embeddings capture semantic meaning in a high-dimensional vector space.",
"Vector databases enable fast semantic search using algorithms like HNSW.",
]
rag.add_documents(documents)
# Query and retrieve
query = "How does RAG work with embeddings?"
results = rag.retrieve(query, top_k=2)
print(f"Query: {query}")
for doc, score in results:
print(f" Score: {score:.3f} - {doc.content[:60]}...")
# Get formatted prompt
prompt = rag.query(query, top_k=2)
print(f"\nAugmented prompt:\n{prompt[:150]}...")
if __name__ == "__main__":
main()
Vector Database Integration¶
For production systems, vector databases provide scalable storage and efficient querying. This example shows integration with Chroma DB:
Vector Database Integration
from typing import List, Dict, Optional
import chromadb
from chromadb.config import Settings
class VectorDatabaseRAG:
"""
RAG system using Chroma DB vector database for production-ready
semantic search at scale.
"""
def __init__(self, collection_name: str = "knowledge_base", persist_directory: str = "./chroma_db", use_default_embeddings: bool = True):
"""
Initialize RAG system with Chroma DB.
Args:
collection_name: Name of the collection to store documents
persist_directory: Directory to persist the database
use_default_embeddings: If True, use Chroma's default embedding function
"""
# Initialize Chroma client with persistence
self.client = chromadb.PersistentClient(
path=persist_directory,
settings=Settings(anonymized_telemetry=False)
)
# Get or create collection
# If using default embeddings, Chroma will auto-generate them
# Otherwise, we'll provide our own
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"} # Use cosine similarity
)
self.use_default_embeddings = use_default_embeddings
def add_documents(
self,
texts: List[str],
embeddings: Optional[List[List[float]]] = None,
metadatas: Optional[List[Dict]] = None,
ids: Optional[List[str]] = None
):
"""
Add documents to the vector database.
Args:
texts: List of document chunk texts
embeddings: Pre-computed embeddings for each text (optional if using default)
metadatas: Optional metadata for each document (source, timestamp, etc.)
ids: Optional unique IDs for each document
"""
if metadatas is None:
metadatas = [{}] * len(texts)
if ids is None:
ids = [f"doc_{i}" for i in range(len(texts))]
# If using default embeddings, don't provide embeddings parameter
if self.use_default_embeddings and embeddings is None:
self.collection.add(
documents=texts,
metadatas=metadatas,
ids=ids
)
else:
# Use provided embeddings
if embeddings is None:
raise ValueError("embeddings must be provided when use_default_embeddings=False")
self.collection.add(
embeddings=embeddings,
documents=texts,
metadatas=metadatas,
ids=ids
)
def retrieve(
self,
query_embedding: List[float],
top_k: int = 5,
where: Optional[Dict] = None
) -> Dict:
"""
Retrieve relevant documents using semantic search.
Args:
query_embedding: Embedding vector of the query
top_k: Number of results to return
where: Optional metadata filter (e.g., {"source": "wiki"})
Returns:
Dictionary with 'documents', 'metadatas', 'distances', and 'ids'
"""
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
where=where
)
return results
def retrieve_by_text(
self,
query_text: str,
top_k: int = 5,
where: Optional[Dict] = None
) -> Dict:
"""
Retrieve using query text (requires embedding model).
Note: In production, you'd use an embedding model here.
This is a placeholder showing the interface.
"""
# In production: query_embedding = embedding_model(query_text)
# For demo, we'll use the collection's embedding function
results = self.collection.query(
query_texts=[query_text],
n_results=top_k,
where=where
)
return results
def format_context_for_llm(self, results: Dict, query: str) -> str:
"""
Format retrieved results into context for LLM prompt.
Args:
results: Results from retrieve() or retrieve_by_text()
query: Original user query
Returns:
Formatted context string with citations
"""
if not results['documents'] or not results['documents'][0]:
return f"No relevant context found for: {query}"
context_parts = []
documents = results['documents'][0]
metadatas = results.get('metadatas', [[]])[0] or [{}] * len(documents)
distances = results.get('distances', [[]])[0] or [0.0] * len(documents)
for i, (doc, metadata, distance) in enumerate(zip(documents, metadatas, distances), 1):
source = metadata.get('source', 'Unknown')
similarity = 1 - distance # Convert distance to similarity
context_parts.append(
f"[Source {i}: {source}] (Similarity: {similarity:.3f})\n{doc}\n"
)
context = "\n".join(context_parts)
return f"""Based on the following retrieved context, answer the question.
Retrieved Context:
{context}
Question: {query}
Answer (cite sources when possible):"""
def main():
"""Vector database RAG example."""
import os
import shutil
# Initialize with cleanup
persist_dir = "./chroma_db_demo"
if os.path.exists(persist_dir):
shutil.rmtree(persist_dir)
rag = VectorDatabaseRAG(persist_directory=persist_dir)
# Add documents
documents = [
"Our remote work policy allows employees to work from home 3 days per week.",
"The Q1 budget for Project Alpha was finalized at €65,000.",
]
metadatas = [
{"source": "hr_policy_2025", "type": "policy"},
{"source": "finance_report_q1", "type": "financial"},
]
rag.add_documents(texts=documents, embeddings=None, metadatas=metadatas)
# Query with filter
query = "What is the remote work policy?"
results = rag.retrieve_by_text(query, top_k=2, where={"type": "policy"})
if results['documents'] and results['documents'][0]:
print(f"Query: {query}")
for doc, metadata in zip(results['documents'][0], results['metadatas'][0]):
print(f" Source: {metadata.get('source')} - {doc[:60]}...")
# Cleanup
if os.path.exists(persist_dir):
shutil.rmtree(persist_dir)
if __name__ == "__main__":
main()
Hybrid Search: Combining BM25 and Semantic Search¶
Hybrid search combines the precision of keyword matching (BM25) with the contextual understanding of semantic search:
Hybrid Search: Combining BM25 and Semantic Search
from typing import List, Dict, Tuple
import numpy as np
from collections import Counter
import math
class HybridSearchRAG:
"""
Hybrid RAG system combining BM25 (keyword-based) and semantic search
for robust retrieval that captures both literal matches and conceptual relevance.
"""
def __init__(self, embedding_model=None):
"""
Initialize hybrid search system.
Args:
embedding_model: Function that generates embeddings (for semantic search)
"""
self.documents: List[str] = []
self.embeddings: List[np.ndarray] = []
self.embedding_model = embedding_model or self._simple_embedding
self.vocabulary: Dict[str, int] = {} # Word to index mapping
self.vocab_built = False
# BM25 parameters
self.k1 = 1.5 # Term frequency saturation parameter
self.b = 0.75 # Length normalization parameter
# BM25 precomputed values
self.doc_freqs: Dict[str, int] = {} # Document frequency for each term
self.idf: Dict[str, float] = {} # Inverse document frequency
self.avg_doc_length = 0.0
self.doc_lengths: List[int] = []
self.term_doc_freqs: List[Dict[str, int]] = [] # Term frequencies per document
def _build_vocabulary(self, texts: List[str]):
"""Build vocabulary from all documents."""
all_words = set()
for text in texts:
words = text.lower().split()
all_words.update(words)
# Create word to index mapping
self.vocabulary = {word: idx for idx, word in enumerate(sorted(all_words))}
self.vocab_built = True
def _simple_embedding(self, text: str) -> np.ndarray:
"""Simple embedding for demonstration."""
if not self.vocab_built:
words = text.lower().split()
unique_words = list(set(words))
embedding = np.zeros(len(unique_words))
for word in words:
if word in unique_words:
embedding[unique_words.index(word)] += 1
norm = np.linalg.norm(embedding)
return embedding / norm if norm > 0 else embedding
# Use consistent vocabulary
embedding = np.zeros(len(self.vocabulary))
words = text.lower().split()
for word in words:
if word in self.vocabulary:
embedding[self.vocabulary[word]] += 1
norm = np.linalg.norm(embedding)
return embedding / norm if norm > 0 else embedding
def add_documents(self, texts: List[str]):
"""
Add documents and precompute BM25 statistics.
Args:
texts: List of document texts
"""
self.documents = texts
# Build vocabulary first
if not self.vocab_built:
self._build_vocabulary(texts)
# Generate embeddings using consistent vocabulary
self.embeddings = [self.embedding_model(text) for text in texts]
# Precompute BM25 statistics
self._precompute_bm25(texts)
def _precompute_bm25(self, texts: List[str]):
"""Precompute BM25 statistics for all documents."""
# Tokenize and compute term frequencies
self.term_doc_freqs = []
all_terms = set()
for text in texts:
terms = text.lower().split()
term_freq = Counter(terms)
self.term_doc_freqs.append(term_freq)
self.doc_lengths.append(len(terms))
all_terms.update(terms)
# Compute document frequencies
self.doc_freqs = {}
for term in all_terms:
self.doc_freqs[term] = sum(
1 for term_freq in self.term_doc_freqs if term in term_freq
)
# Compute IDF
N = len(texts)
self.idf = {}
for term, df in self.doc_freqs.items():
self.idf[term] = math.log((N - df + 0.5) / (df + 0.5) + 1.0)
# Average document length
self.avg_doc_length = sum(self.doc_lengths) / len(self.doc_lengths) if self.doc_lengths else 0
def bm25_search(self, query: str, top_k: int = 5) -> List[Tuple[int, float]]:
"""
BM25 keyword-based search.
Args:
query: Search query
top_k: Number of results to return
Returns:
List of (document_index, bm25_score) tuples
"""
query_terms = query.lower().split()
scores = []
for i, doc_term_freq in enumerate(self.term_doc_freqs):
score = 0.0
doc_length = self.doc_lengths[i]
for term in query_terms:
if term in doc_term_freq:
tf = doc_term_freq[term]
idf = self.idf.get(term, 0.0)
# BM25 formula
numerator = idf * tf * (self.k1 + 1)
denominator = tf + self.k1 * (1 - self.b + self.b * (doc_length / self.avg_doc_length))
score += numerator / denominator
scores.append((i, score))
# Sort by score (descending) and return top_k
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
def semantic_search(self, query: str, top_k: int = 5) -> List[Tuple[int, float]]:
"""
Semantic vector search using embeddings.
Args:
query: Search query
top_k: Number of results to return
Returns:
List of (document_index, similarity_score) tuples
"""
query_embedding = self.embedding_model(query)
scores = []
for i, doc_embedding in enumerate(self.embeddings):
# Cosine similarity
similarity = np.dot(query_embedding, doc_embedding)
scores.append((i, similarity))
# Sort by similarity (descending) and return top_k
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
def hybrid_search(
self,
query: str,
top_k: int = 5,
bm25_weight: float = 0.4,
semantic_weight: float = 0.6
) -> List[Tuple[int, float, Dict]]:
"""
Combine BM25 and semantic search using weighted scores.
Args:
query: Search query
top_k: Number of results to return
bm25_weight: Weight for BM25 scores (default 0.4)
semantic_weight: Weight for semantic scores (default 0.6)
Returns:
List of (document_index, combined_score, metadata) tuples
"""
# Get results from both methods
bm25_results = self.bm25_search(query, top_k * 2)
semantic_results = self.semantic_search(query, top_k * 2)
# Normalize scores to [0, 1] range
bm25_scores = {idx: score for idx, score in bm25_results}
semantic_scores = {idx: score for idx, score in semantic_results}
# Find max scores for normalization
max_bm25 = max(bm25_scores.values()) if bm25_scores else 1.0
max_semantic = max(semantic_scores.values()) if semantic_scores else 1.0
# Combine scores
combined_scores = {}
all_indices = set(bm25_scores.keys()) | set(semantic_scores.keys())
for idx in all_indices:
# Normalize and combine
norm_bm25 = (bm25_scores.get(idx, 0.0) / max_bm25) if max_bm25 > 0 else 0.0
norm_semantic = (semantic_scores.get(idx, 0.0) / max_semantic) if max_semantic > 0 else 0.0
combined = (bm25_weight * norm_bm25) + (semantic_weight * norm_semantic)
combined_scores[idx] = {
'combined': combined,
'bm25': norm_bm25,
'semantic': norm_semantic
}
# Sort by combined score and return top_k
sorted_results = sorted(
combined_scores.items(),
key=lambda x: x[1]['combined'],
reverse=True
)[:top_k]
return [(idx, scores['combined'], scores) for idx, scores in sorted_results]
def main():
"""Hybrid search RAG example."""
# Initialize and add documents
rag = HybridSearchRAG()
documents = [
"RAG systems use embeddings to find semantically similar documents.",
"BM25 is a keyword-based ranking algorithm used in information retrieval.",
"Hybrid search combines keyword matching with semantic understanding.",
]
rag.add_documents(documents)
# Test hybrid search
query = "semantic search for documents"
hybrid_results = rag.hybrid_search(query, top_k=2, bm25_weight=0.4, semantic_weight=0.6)
print(f"Query: {query}")
for idx, combined_score, breakdown in hybrid_results:
print(f" Combined: {combined_score:.3f} "
f"(BM25: {breakdown['bm25']:.3f}, Semantic: {breakdown['semantic']:.3f})")
print(f" {rag.documents[idx][:60]}...")
if __name__ == "__main__":
main()
RAG's Challenges¶
Despite its power, the RAG pattern is not without its challenges. A primary issue arises when the information needed to answer a query is not confined to a single chunk but is spread across multiple parts of a document or even several documents. In such cases, the retriever might fail to gather all the necessary context, leading to an incomplete or inaccurate answer. The system's effectiveness is also highly dependent on the quality of the chunking and retrieval process; if irrelevant chunks are retrieved, it can introduce noise and confuse the LLM. Furthermore, effectively synthesizing information from potentially contradictory sources remains a significant hurdle for these systems.
Besides that, another challenge is that RAG requires the entire knowledge base to be pre-processed and stored in specialized databases, such as vector or graph databases, which is a considerable undertaking. Consequently, this knowledge requires periodic reconciliation to remain up-to-date, a crucial task when dealing with evolving sources like company wikis. This entire process can have a noticeable impact on performance, increasing latency, operational costs, and the number of tokens used in the final prompt.
Graph RAG¶
GraphRAG is an advanced form of Retrieval-Augmented Generation that utilizes a knowledge graph instead of a simple vector database for information retrieval. It answers complex queries by navigating the explicit relationships (edges) between data entities (nodes) within this structured knowledge base. A key advantage is its ability to synthesize answers from information fragmented across multiple documents, a common failing of traditional RAG. By understanding these connections, GraphRAG provides more contextually accurate and nuanced responses.
Use cases include complex financial analysis, connecting companies to market events, and scientific research for discovering relationships between genes and diseases. The primary drawback, however, is the significant complexity, cost, and expertise required to build and maintain a high-quality knowledge graph. This setup is also less flexible and can introduce higher latency compared to simpler vector search systems. The system's effectiveness is entirely dependent on the quality and completeness of the underlying graph structure. Consequently, GraphRAG offers superior contextual reasoning for intricate questions but at a much higher implementation and maintenance cost. In summary, it excels where deep, interconnected insights are more critical than the speed and simplicity of standard RAG.
Agentic RAG¶
An evolution of this pattern, known as Agentic RAG (see Fig.2), introduces a reasoning and decision-making layer to significantly enhance the reliability of information extraction. Instead of just retrieving and augmenting, an "agent"—a specialized AI component—acts as a critical gatekeeper and refiner of knowledge. Rather than passively accepting the initially retrieved data, this agent actively interrogates its quality, relevance, and completeness, as illustrated by the following scenarios.
Reflection and Source Validation¶
First, an agent excels at reflection and source validation. If a user asks, "What is our company's policy on remote work?" a standard RAG might pull up a 2020 blog post alongside the official 2025 policy document. The agent, however, would analyze the documents' metadata, recognize the 2025 policy as the most current and authoritative source, and discard the outdated blog post before sending the correct context to the LLM for a precise answer.
Reconciling Knowledge Conflicts¶
Second, an agent is adept at reconciling knowledge conflicts. Imagine a financial analyst asks, "What was Project Alpha's Q1 budget?" The system retrieves two documents: an initial proposal stating a €50,000 budget and a finalized financial report listing it as €65,000. An Agentic RAG would identify this contradiction, prioritize the financial report as the more reliable source, and provide the LLM with the verified figure, ensuring the final answer is based on the most accurate data.
Multi-Step Reasoning¶
Third, an agent can perform multi-step reasoning to synthesize complex answers. If a user asks, "How do our product's features and pricing compare to Competitor X's?" the agent would decompose this into separate sub-queries. It would initiate distinct searches for its own product's features, its pricing, Competitor X's features, and Competitor X's pricing. After gathering these individual pieces of information, the agent would synthesize them into a structured, comparative context before feeding it to the LLM, enabling a comprehensive response that a simple retrieval could not have produced.
Identifying Knowledge Gaps and Using External Tools¶
Fourth, an agent can identify knowledge gaps and use external tools. Suppose a user asks, "What was the market's immediate reaction to our new product launched yesterday?" The agent searches the internal knowledge base, which is updated weekly, and finds no relevant information. Recognizing this gap, it can then activate a tool—such as a live web-search API—to find recent news articles and social media sentiment. The agent then uses this freshly gathered external information to provide an up-to-the-minute answer, overcoming the limitations of its static internal database.
Agentic RAG Implementation¶
The following example demonstrates an Agentic RAG system that actively evaluates, validates, and refines retrieved information:
Agentic RAG Implementation
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
from datetime import datetime
import numpy as np
@dataclass
class RetrievedDocument:
"""Document with metadata for agentic evaluation."""
content: str
source: str
timestamp: Optional[str] = None
confidence: float = 1.0
metadata: Dict = None
class AgenticRAGSystem:
"""
Agentic RAG system that actively evaluates, validates, and refines
retrieved information before passing to LLM.
"""
def __init__(self, retrieval_system, llm_agent=None):
"""
Initialize Agentic RAG system.
Args:
retrieval_system: Base RAG system for initial retrieval
llm_agent: Optional LLM agent for reasoning (simulated here)
"""
self.retrieval_system = retrieval_system
self.llm_agent = llm_agent
def retrieve_and_validate(
self,
query: str,
top_k: int = 5
) -> List[RetrievedDocument]:
"""
Retrieve documents and validate their quality and relevance.
"""
# Initial retrieval
raw_results = self.retrieval_system.retrieve(query, top_k * 2)
# Agentic validation: filter and rank
validated = []
for doc, score in raw_results:
retrieved_doc = RetrievedDocument(
content=doc.content,
source=doc.metadata.get('source', 'unknown'),
timestamp=doc.metadata.get('timestamp'),
confidence=score,
metadata=doc.metadata
)
# Validate relevance threshold
if score > 0.3: # Minimum relevance threshold
validated.append(retrieved_doc)
return validated[:top_k]
def reconcile_conflicts(
self,
documents: List[RetrievedDocument],
query: str
) -> List[RetrievedDocument]:
"""
Identify and resolve conflicts between retrieved documents.
Prioritizes more authoritative or recent sources.
"""
if len(documents) <= 1:
return documents
# Group by topic/key concept (simplified - in production use NLP)
# For demo, we'll prioritize by timestamp and source authority
prioritized = []
seen_content = set()
# Sort by timestamp (newest first) and source authority
sorted_docs = sorted(
documents,
key=lambda d: (
self._source_authority(d.source),
self._parse_timestamp(d.timestamp) if d.timestamp else datetime.min
),
reverse=True
)
for doc in sorted_docs:
# Simple deduplication: skip near-duplicates
content_hash = hash(doc.content[:100])
if content_hash not in seen_content:
prioritized.append(doc)
seen_content.add(content_hash)
return prioritized
def _source_authority(self, source: str) -> int:
"""Rank source authority (higher = more authoritative)."""
authority_map = {
'official_policy': 10,
'financial_report': 9,
'handbook': 8,
'blog_post': 3,
'wiki': 5,
}
for key, value in authority_map.items():
if key in source.lower():
return value
return 5 # Default
def _parse_timestamp(self, timestamp: str) -> datetime:
"""Parse timestamp string to datetime."""
try:
return datetime.fromisoformat(timestamp)
except:
return datetime.min
def identify_knowledge_gaps(
self,
documents: List[RetrievedDocument],
query: str
) -> Tuple[List[RetrievedDocument], bool]:
"""
Check if retrieved documents fully answer the query.
Returns (documents, has_gap).
"""
# Simple heuristic: if no documents or all low confidence
if not documents:
return documents, True
avg_confidence = sum(d.confidence for d in documents) / len(documents)
has_gap = avg_confidence < 0.5
return documents, has_gap
def multi_step_retrieval(
self,
query: str
) -> List[RetrievedDocument]:
"""
Decompose complex queries into sub-queries and retrieve for each.
"""
# Simple decomposition (in production, use LLM to decompose)
all_documents = []
# In production, decompose query into sub-queries
sub_queries = self._decompose_query(query)
for sub_query in sub_queries:
results = self.retrieve_and_validate(sub_query, top_k=3)
all_documents.extend(results)
# Deduplicate and prioritize
return self.reconcile_conflicts(all_documents, query)
def _decompose_query(self, query: str) -> List[str]:
"""Decompose complex query into sub-queries."""
# Simplified: in production, use LLM to intelligently decompose
query_lower = query.lower()
if "compare" in query_lower:
return [query] # Placeholder
elif "and" in query_lower:
# Split on "and" for multi-part queries
parts = query_lower.split(" and ")
return [q.strip() for q in parts if q.strip()]
else:
return [query]
def query_with_agentic_reasoning(
self,
query: str,
use_external_tools: bool = False
) -> str:
"""
Complete Agentic RAG pipeline with reasoning layer.
"""
# Step 1: Multi-step retrieval if needed
if self._is_complex_query(query):
documents = self.multi_step_retrieval(query)
else:
documents = self.retrieve_and_validate(query, top_k=5)
# Step 2: Reconcile conflicts
documents = self.reconcile_conflicts(documents, query)
# Step 3: Check for knowledge gaps
documents, has_gap = self.identify_knowledge_gaps(documents, query)
# Step 4: Use external tools if gap detected
if has_gap and use_external_tools:
# In production: call web search API, database, etc.
external_docs = self._fetch_external_info(query)
documents.extend(external_docs)
documents = self.reconcile_conflicts(documents, query)
# Step 5: Format context with source validation
context = self._format_agentic_context(documents, query)
return context
def _is_complex_query(self, query: str) -> bool:
"""Heuristic to detect complex queries requiring decomposition."""
complex_indicators = ["compare", "versus", "difference between", "and", "or"]
return any(indicator in query.lower() for indicator in complex_indicators)
def _fetch_external_info(self, query: str) -> List[RetrievedDocument]:
"""Fetch information from external sources (web, APIs, etc.)."""
# Placeholder: in production, call web search API, database, etc.
return []
def _format_agentic_context(
self,
documents: List[RetrievedDocument],
query: str
) -> str:
"""Format context with source citations and validation notes."""
if not documents:
return f"No relevant information found for: {query}"
context_parts = []
for i, doc in enumerate(documents, 1):
source_note = f"[Source {i}: {doc.source}]"
if doc.timestamp:
source_note += f" (Updated: {doc.timestamp})"
source_note += f" (Confidence: {doc.confidence:.3f})"
context_parts.append(f"{source_note}\n{doc.content}\n")
context = "\n".join(context_parts)
return f"""Based on the following validated and reconciled context, answer the question.
Validated Context (sources prioritized by authority and recency):
{context}
Question: {query}
Answer (cite specific sources):"""
# Simple retrieval system for Agentic RAG demo
class SimpleRetrievalSystem:
"""Simple retrieval system for Agentic RAG demo."""
def __init__(self):
self.documents = []
self.embeddings = []
self.vocabulary = {}
self.vocab_built = False
def _build_vocabulary(self, texts):
all_words = set()
for text in texts:
words = text.lower().split()
all_words.update(words)
self.vocabulary = {word: idx for idx, word in enumerate(sorted(all_words))}
self.vocab_built = True
def _simple_embedding(self, text):
if not self.vocab_built:
words = text.lower().split()
unique_words = list(set(words))
embedding = np.zeros(len(unique_words))
for word in words:
if word in unique_words:
embedding[unique_words.index(word)] += 1
norm = np.linalg.norm(embedding)
return embedding / norm if norm > 0 else embedding
embedding = np.zeros(len(self.vocabulary))
words = text.lower().split()
for word in words:
if word in self.vocabulary:
embedding[self.vocabulary[word]] += 1
norm = np.linalg.norm(embedding)
return embedding / norm if norm > 0 else embedding
def add_documents(self, texts, metadatas=None):
if metadatas is None:
metadatas = [{}] * len(texts)
if not self.vocab_built:
self._build_vocabulary(texts)
for text, meta in zip(texts, metadatas):
embedding = self._simple_embedding(text)
self.documents.append({
'content': text,
'metadata': meta,
'embedding': embedding
})
def retrieve(self, query, top_k):
query_embedding = self._simple_embedding(query)
similarities = []
for doc in self.documents:
similarity = np.dot(query_embedding, doc['embedding'])
similarities.append((
type('Doc', (), {'content': doc['content'], 'metadata': doc['metadata']})(),
similarity
))
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
def main():
"""Agentic RAG example."""
# Initialize base retrieval system
base_retrieval = SimpleRetrievalSystem()
documents = [
"Our remote work policy from 2020 allows employees to work from home 2 days per week.",
"The official remote work policy updated in 2025 allows employees to work from home 3 days per week.",
]
metadatas = [
{"source": "blog_2020", "timestamp": "2020-01-01", "type": "blog"},
{"source": "official_policy_2025", "timestamp": "2025-01-15", "type": "policy"},
]
base_retrieval.add_documents(documents, metadatas)
# Initialize Agentic RAG
agentic_rag = AgenticRAGSystem(base_retrieval)
# Query with agentic reasoning
query = "What is our company's remote work policy?"
context = agentic_rag.query_with_agentic_reasoning(query)
print(f"Query: {query}")
print(f"Context: {context[:200]}...")
if __name__ == "__main__":
main()
Challenges of Agentic RAG¶
While powerful, the agentic layer introduces its own set of challenges. The primary drawback is a significant increase in complexity and cost. Designing, implementing, and maintaining the agent's decision-making logic and tool integrations requires substantial engineering effort and adds to computational expenses. This complexity can also lead to increased latency, as the agent's cycles of reflection, tool use, and multi-step reasoning take more time than a standard, direct retrieval process. Furthermore, the agent itself can become a new source of error; a flawed reasoning process could cause it to get stuck in useless loops, misinterpret a task, or improperly discard relevant information, ultimately degrading the quality of the final response.
In summary: Agentic RAG represents a sophisticated evolution of the standard retrieval pattern, transforming it from a passive data pipeline into an active, problem-solving framework. By embedding a reasoning layer that can evaluate sources, reconcile conflicts, decompose complex questions, and use external tools, agents dramatically improve the reliability and depth of the generated answers. This advancement makes the AI more trustworthy and capable, though it comes with important trade-offs in system complexity, latency, and cost that must be carefully managed.
Practical Applications & Use Cases¶
Knowledge Retrieval (RAG) is changing how Large Language Models (LLMs) are utilized across various industries, enhancing their ability to provide more accurate and contextually relevant responses.
Applications include:
- Enterprise Search and Q&A: Organizations can develop internal chatbots that respond to employee inquiries using internal documentation such as HR policies, technical manuals, and product specifications. The RAG system extracts relevant sections from these documents to inform the LLM's response.
- Customer Support and Helpdesks: RAG-based systems can offer precise and consistent responses to customer queries by accessing information from product manuals, frequently asked questions (FAQs), and support tickets. This can reduce the need for direct human intervention for routine issues.
- Personalized Content Recommendation: Instead of basic keyword matching, RAG can identify and retrieve content (articles, products) that is semantically related to a user's preferences or previous interactions, leading to more relevant recommendations.
- News and Current Events Summarization: LLMs can be integrated with real-time news feeds. When prompted about a current event, the RAG system retrieves recent articles, allowing the LLM to produce an up-to-date summary.
By incorporating external knowledge, RAG extends the capabilities of LLMs beyond simple communication to function as knowledge processing systems.
Summary¶
In summary, the Retrieval-Augmented Generation (RAG) pattern represents a significant leap forward in making AI more knowledgeable and reliable. By seamlessly integrating an external knowledge retrieval step into the generation process, RAG addresses some of the core limitations of standalone LLMs. The foundational concepts of embeddings and semantic similarity, combined with retrieval techniques like keyword and hybrid search, allow the system to intelligently find relevant information, which is made manageable through strategic chunking. This entire retrieval process is powered by specialized vector databases designed to store and efficiently query millions of embeddings at scale. While challenges in retrieving fragmented or contradictory information persist, RAG empowers LLMs to produce answers that are not only contextually appropriate but also anchored in verifiable facts, fostering greater trust and utility in AI.