Building RAG Pipelines That Actually Work in Production
Retrieval-Augmented Generation (RAG) is the most practical pattern for grounding LLM outputs in real data. But the naive version chunk documents, embed them, dump into a vector DB, retrieve top-k works fine in a notebook and fails quietly in production.
This post covers the engineering decisions that separate demo RAG from production RAG: chunking strategy, embedding model selection, metadata filtering, re-ranking, and observability.
Chunking Is Not Trivial
The most common RAG mistake is treating chunking as a fixed-size window operation. Paragraphs get split mid-sentence, context is lost, and retrieval quality degrades. The fix is semantic chunking splitting on natural boundaries (headings, paragraphs, sentence ends) with overlap.
def semantic_chunk(text: str, max_chars: int = 1000, overlap: int = 200):
chunks = []
current = []
length = 0
for paragraph in text.split('\n\n'):
if length + len(paragraph) > max_chars and current:
chunks.append('\n\n'.join(current))
overlap_text = current[-1] if current else ''
current = [overlap_text[-overlap:]] if overlap_text else []
length = len(current[0]) if current else 0
current.append(paragraph)
length += len(paragraph)
if current:
chunks.append('\n\n'.join(current))
return chunksSemantic chunking preserves document structure and gives the retriever coherent passages instead of random text slices.
Embedding Model Selection
Not all embedding models are equal. The trade-off is dimensionality vs. precision vs. latency. Small models (384-dim) are fast but miss nuance. Large models (1536-dim) are accurate but slower and more expensive.
The pragmatic approach: use a medium model like gte-large (1024-dim) for indexing, and experiment with dimensionality reduction via Matryoshka embeddings. If you are using OpenAI, text-embedding-3-small at 512 dimensions strikes a good balance.
# Matryoshka: one model, multiple dimensions
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-mistral-7b-instruct')
embeddings = model.encode(docs, normalize_embeddings=True)
# Can truncate to 256, 512, or 1024 dims at query timeRetrieval with Re-Ranking
Vector search returns the top-k most semantically similar chunks. But similarity does not guarantee relevance. A chunk that mentions the same keywords might not answer the user question.
The fix: retrieve more candidates (top-20 or top-50) and re-rank using a cross-encoder. Cross-encoders score each (query, chunk) pair directly, giving much better relevance than cosine similarity on embeddings.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
candidates = vector_store.similarity_search(query, k=20)
pairs = [(query, c.page_content) for c in candidates]
scores = reranker.predict(pairs)
# Keep top-5 after re-ranking
ranked = [c for _, c in sorted(zip(scores, candidates), reverse=True)][:5]This two-stage retrieval (vector search + cross-encoder re-ranking) consistently outperforms single-stage retrieval by 15-30% in recall metrics.
Metadata Filtering
Vector search alone has no concept of time, source, or access control. If you are building a RAG system over a document corpus, you need metadata pre-filtering before the vector search runs.
# Filter by date range BEFORE vector search
filter = {
"date": {"$gte": "2025-01-01"},
"source": {"$in": ["docs", "wiki"]},
"access_level": "public"
}
results = collection.query(
query_embeddings=[embedding],
n_results=20,
filter=filter
)Most vector databases (Chroma, Pinecone, Weaviate, Qdrant) support metadata filtering. Use it. It dramatically improves retrieval quality by eliminating irrelevant candidates before similarity search.
Observability and Evaluation
RAG pipelines degrade silently. A chunking change that looks fine in development can halve retrieval accuracy in production. You need metrics:
- Hit rate: What fraction of queries return a relevant chunk?
- MRR (Mean Reciprocal Rank): How high is the first relevant result?
- Faithfulness: Does the LLM response actually reflect the retrieved context?
Tools like LangSmith, Arize, or a simple custom evaluation harness with GPT-4 as a judge can track these metrics over time. Without observability, you are flying blind.
Designing Multi-Tenant Systems Without Creating a Data Leak Nightmare
Schema-per-tenant vs shared-table tenancy, tradeoffs that actually matter, and why convenience-first architecture usually turns into future damage.
Redis Caching That Doesn't Rot Your System From the Inside
Caching is easy until invalidation turns your app into a liar. This breaks down practical TTL strategy, namespaced keys, and targeted invalidation.
How I Structure Production APIs So They Don't Collapse Under Growth
Controllers, services, validation boundaries, auth layers, cache placement, and why most beginner backends become unreadable after 3 months.