AI/ML

Building RAG Pipelines That Actually Work in Production

March 29, 2026 3 min read

AILLMsRAGEmbeddingsPython

Retrieval-Augmented Generation (RAG) is the most practical pattern for grounding LLM outputs in real data. But the naive version chunk documents, embed them, dump into a vector DB, retrieve top-k works fine in a notebook and fails quietly in production.

This post covers the engineering decisions that separate demo RAG from production RAG: chunking strategy, embedding model selection, metadata filtering, re-ranking, and observability.

Chunking Is Not Trivial

The most common RAG mistake is treating chunking as a fixed-size window operation. Paragraphs get split mid-sentence, context is lost, and retrieval quality degrades. The fix is semantic chunking splitting on natural boundaries (headings, paragraphs, sentence ends) with overlap.

def semantic_chunk(text: str, max_chars: int = 1000, overlap: int = 200):
    chunks = []
    current = []
    length = 0
    for paragraph in text.split('\n\n'):
        if length + len(paragraph) > max_chars and current:
            chunks.append('\n\n'.join(current))
            overlap_text = current[-1] if current else ''
            current = [overlap_text[-overlap:]] if overlap_text else []
            length = len(current[0]) if current else 0
        current.append(paragraph)
        length += len(paragraph)
    if current:
        chunks.append('\n\n'.join(current))
    return chunks

Semantic chunking preserves document structure and gives the retriever coherent passages instead of random text slices.

Embedding Model Selection

Not all embedding models are equal. The trade-off is dimensionality vs. precision vs. latency. Small models (384-dim) are fast but miss nuance. Large models (1536-dim) are accurate but slower and more expensive.

The pragmatic approach: use a medium model like gte-large (1024-dim) for indexing, and experiment with dimensionality reduction via Matryoshka embeddings. If you are using OpenAI, text-embedding-3-small at 512 dimensions strikes a good balance.

# Matryoshka: one model, multiple dimensions
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-mistral-7b-instruct')
embeddings = model.encode(docs, normalize_embeddings=True)
# Can truncate to 256, 512, or 1024 dims at query time

Retrieval with Re-Ranking

Vector search returns the top-k most semantically similar chunks. But similarity does not guarantee relevance. A chunk that mentions the same keywords might not answer the user question.

The fix: retrieve more candidates (top-20 or top-50) and re-rank using a cross-encoder. Cross-encoders score each (query, chunk) pair directly, giving much better relevance than cosine similarity on embeddings.

from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

candidates = vector_store.similarity_search(query, k=20)
pairs = [(query, c.page_content) for c in candidates]
scores = reranker.predict(pairs)

# Keep top-5 after re-ranking
ranked = [c for _, c in sorted(zip(scores, candidates), reverse=True)][:5]

This two-stage retrieval (vector search + cross-encoder re-ranking) consistently outperforms single-stage retrieval by 15-30% in recall metrics.

Metadata Filtering

Vector search alone has no concept of time, source, or access control. If you are building a RAG system over a document corpus, you need metadata pre-filtering before the vector search runs.

# Filter by date range BEFORE vector search
filter = {
    "date": {"$gte": "2025-01-01"},
    "source": {"$in": ["docs", "wiki"]},
    "access_level": "public"
}
results = collection.query(
    query_embeddings=[embedding],
    n_results=20,
    filter=filter
)

Most vector databases (Chroma, Pinecone, Weaviate, Qdrant) support metadata filtering. Use it. It dramatically improves retrieval quality by eliminating irrelevant candidates before similarity search.

Observability and Evaluation

RAG pipelines degrade silently. A chunking change that looks fine in development can halve retrieval accuracy in production. You need metrics:

Hit rate: What fraction of queries return a relevant chunk?
MRR (Mean Reciprocal Rank): How high is the first relevant result?
Faithfulness: Does the LLM response actually reflect the retrieved context?

Tools like LangSmith, Arize, or a simple custom evaluation harness with GPT-4 as a judge can track these metrics over time. Without observability, you are flying blind.

Building RAG Pipelines That Actually Work in Production

Chunking Is Not Trivial

Embedding Model Selection

Retrieval with Re-Ranking

Metadata Filtering

Observability and Evaluation

Circuit Breakers, Bulkheading, and Why Retrying POST Is a Junior Mistake

Rate Limiting Algorithms From Scratch: Token Bucket vs Sliding Window Log in Redis Lua

Building throttleGate — A Production-Grade API Gateway From Scratch