# RAG at Scale — A 10-Step Architecture for Zero-Hallucination Search Across Millions of Documents

URL: https://whitepaper.designervenkat.online/docs/ai-machine-learning/rag-pipeline-at-scale
Markdown export: https://whitepaper.designervenkat.online/llms.mdx/docs/ai-machine-learning/rag-pipeline-at-scale
Site: White Papers - Designer Venkat
Author: Designer Venkat
Language: en
Category: AI & Machine Learning (ai-machine-learning)

How to build a hallucination-resistant RAG pipeline at scale — hybrid retrieval, confidence gating, constrained generation, and continuous evals.


Most RAG prototypes work. Most RAG systems in production hallucinate. The gap between the two is not the LLM — it is the retrieval pipeline, the confidence gate, and the evaluation loop that surrounds it.This article walks through a ten-step architecture for building a RAG system that handles millions of documents without fabricating answers. Each step solves a specific failure mode. Every technical term is defined on first use.What you will learnWhy brute-force vector search breaks at scale and what replaces itHow to combine keyword search and semantic search to cover each other's blind spotsWhy two-stage reranking is the highest-leverage improvement in any RAG stackHow a confidence gate stops the LLM from guessing when evidence is thinWhat constrained generation actually means in a system prompt — and why it worksHow to detect hallucinations automatically before a response reaches the userWhat three metrics tell you whether retrieval, generation, or the model itself is failingBackgroundRAG (Retrieval-Augmented Generation) is a pattern where you first retrieve relevant text from a document store, then ask the LLM to answer using only that text — not its training memory. The idea is simple. Making it reliable at scale is not.At small scale — hundreds of documents — you can cheat. Scan every vector, put everything in the context window, ship it. At millions of documents, every shortcut breaks:ScaleBrute-force vector scanContext window coverageRetrieval failure impact1,000 docs< 100ms — fineCan see it allLow — you have the answer10,000 docs~1s — borderlinePartialModerate10,000,000 docsMinutes — impossible0.001% of corpusFatal — retrieval IS outputThe core insight is this: retrieval quality matters more than the frontier model. A well-retrieved set of five faithful chunks fed to GPT-3.5 will produce a more trustworthy answer than GPT-4 hallucinating over poorly retrieved context. The LLM is the finishing coat. Retrieval is the foundation.Before the ten steps, here are the key terms used throughout:TermPlain meaningChunkA fixed-size or semantically-bounded segment of a document — typically 200–500 tokensEmbeddingA list of numbers (a vector) that represents the meaning of a piece of textBM25A keyword-matching algorithm that scores documents by term frequency and rarityANNApproximate Nearest Neighbour — fast vector search that trades a little accuracy for large speed gainsHNSWHierarchical Navigable Small World — the most widely used ANN index in productionCross-encoderA model that reads a query and a document chunk together and outputs a single relevance scoreFaithfulnessThe fraction of claims in a response that are verifiable against the retrieved contextGroundingTying every assertion in a response to a specific source passageThe full pipeline at a glanceSteps 1–4 are the retrieval stack. Steps 5–7 are the generation guardrails. Steps 8–10 are the operational layer that keeps the system honest over time.Step 1 — Ingest and normalizeTen million documents arrive from everywhere: PDFs, Word files, HTML pages, database exports, scanned images, internal wikis. Each format introduces different noise. Silent normalization failures destroy retrieval recall long before any model sees the data.What to do at ingest time:Strip formatting artifacts (HTML tags, PDF control characters, footnote markers)Apply Unicode NFC normalization — é encoded two different ways will not BM25-matchRemove non-printable characters and control sequencesStandardize whitespace and newlinesTag every document with metadata: source, date, author, domain, versionAssign a content hash — re-ingesting an unchanged document is then a no-opAt scale, use a distributed pipeline such as Kafka + Apache Spark or Flink. Process documents in parallel and idempotently. A document that fails normalization should land in a dead-letter queue, not silently produce garbage chunks.Chunk size matters. Chunks that are too small lose context. Chunks that are too large dilute the signal. 200–400 tokens per chunk with a 50-token overlap is a reliable starting point for most domains. Legal and technical documents benefit from semantic chunking — splitting on paragraph or section boundaries rather than fixed token counts.Step 2 — Hybrid retrieval: BM25 and embeddings togetherThe most common engineering mistake in RAG is using only vector embeddings for retrieval.Embeddings capture meaning well. They fail at exact terms. If a user asks about "Clause 4.2.1," a semantic model returns chunks about termination in general. BM25 finds the exact clause. If a user asks "what are the general themes in contract termination?", BM25 struggles with conceptual overlap while embeddings excel.The fix is to run both in parallel and fuse their scores.BM25 (Okapi BM25) scores a document against a query using term frequency and inverse document frequency:score(q, d) = Σ IDF(tᵢ) × [tf(tᵢ,d) × (k1+1)] / [tf(tᵢ,d) + k1×(1-b+b×|d|/avgdl)]
k1 = 1.2 controls term frequency saturation — repeated terms give diminishing returnsb = 0.75 normalizes for document length — long documents do not win just by repeating termsIDF penalizes common words like "the" and rewards rare, meaningful onesVector embeddings capture semantic similarity. all-MiniLM-L6-v2 (384 dimensions) gives good speed. text-embedding-3-large gives better accuracy. Both use cosine similarity on L2-normalized vectors.Hybrid fusion combines both scores:fusedScore = α × cosineSimilarity + (1 - α) × normalizedBM25
The α weight is tunable per domain:Domainα (vector weight)RationaleLegal documents0.3Exact clause references matter more than semantic proximityConceptual knowledge bases0.7Meaning matters more than exact wordingCustomer support FAQs0.5Balance bothCode documentation0.4Function names and identifiers need exact matchRun both retrieval paths in parallel. Each returns the top 30 candidates. Union them, fuse scores, and pass the top 15 to the reranker.Step 3 — Two-stage rerankingRetrieval finds candidates. Reranking finds the right ones.Stage 1: Approximate Nearest Neighbour (ANN)Exact cosine similarity over 10 million 384-dimensional vectors is not feasible in real time. ANN indices trade a small amount of recall for massive speed:IndexSpeedMemoryBest forHNSW~10ms for top-100 at 10M vectorsHighBest recall/speed ratio — default choiceIVF-PQSimilar speed, lower memoryLowMemory-constrained deploymentsScaNNHighest throughput at extreme scaleMedium100M+ vector corporaHNSW achieves >95% recall@10 compared to exact search at 10M vectors on modern hardware with well-tuned index parameters. It is used by Pinecone, Weaviate, and pgvector.Stage 2: Cross-encoder rerankingANN scores query and document independently. A cross-encoder reads them together:CrossEncoder([query, chunk]) → relevance_score ∈ [0, 1]
This joint scoring is far more accurate but too slow to run across the full corpus. The architecture is: ANN retrieves the top 30 candidates quickly, then the cross-encoder reranks only those 30.Think of it like hiring: ANN does the CV screen. The cross-encoder does the interview.Recommended models: ms-marco-MiniLM-L-6-v2 for speed (90MB), ms-marco-MiniLM-L-12-v2 for accuracy. Both run locally. At scale, reranking runs on a dedicated inference service, not inline with the query path.Step 4 — Source confidence scoringEvery retrieved chunk must earn its place in the prompt. A confidence score decides whether a chunk is trustworthy enough to influence the response.Four components:Retrieval confidence — the normalized fusion score from Step 2 (0 to 1)Source freshness — documents older than two years receive a decay penaltySource authority — domain-specific trust scores (internal audit reports rank higher than anonymous web pages)Cross-chunk agreement — if four of your top five chunks say the same thing, confidence in that claim risesWeighted formula:confidence = 0.5 × retrievalScore
           + 0.2 × freshnessScore
           + 0.2 × authorityScore
           + 0.1 × agreementScore
These weights are illustrative starting points. Tune them per domain — high-stakes domains (legal, medical) typically weight authority and freshness more heavily.The threshold gate: if confidence < 0.65 for all retrieved chunks, do not generate a response. Return: "Insufficient information found in the knowledge base."This is not a failure — it is the system working correctly. A confident wrong answer is worse than an honest refusal.Step 5 — Constrained generationThis is the architectural decision that separates zero-hallucination RAG from regular RAG.The LLM system prompt must explicitly forbid the model from using anything outside the provided context. Vague instructions like "answer based on the documents" do not work. The constraint must be unambiguous and include a defined fallback.A system prompt that works:You are a citation-backed AI assistant.
Answer using ONLY the provided Context sections below.

Rules:
1. Every claim you make must be supported by the provided Context.
2. Cite every assertion with [Source N] where N is the context section number.
3. If the Context does not contain the answer, respond with exactly:
   "The provided documents do not contain sufficient information
    to answer this question."
4. Do NOT use any knowledge from your training data to fill gaps.
5. Do NOT speculate, extrapolate, or make inferences beyond what
   the Context explicitly states.

Context:
---
[Source 1: contract_v4.pdf, Page 4]
<chunk text>
---
[Source 2: policy_2024.docx, Page 12]
<chunk text>
---
Set temperature to 0.0 or 0.1. High temperature increases creativity. In a grounding task, creativity is another word for hallucination.Step 6 — CitationsEvery factual claim in the response links back to the source chunk that supported it. Citations serve two purposes: they let the user verify answers, and they force the generation constraint to be auditable.Format citations inline as [Source N], where N corresponds to the numbered context sections in the system prompt. Each citation includes:Document name and versionPage or section numberThe specific passage the claim was drawn fromCitations also enable the hallucination detection step that follows — you cannot verify grounding without knowing what was claimed and where.Step 7 — Hallucination detectionEven with constrained generation, hallucinations slip through. You need an automated verification layer between the model output and the user.Three-pass verification:Pass 1 — Assertion extraction: pull all factual claims from the response. Numbers, proper nouns, dates, percentages, named entities. Use regex and Named Entity Recognition (NER).Pass 2 — Grounding check: for each extracted claim, verify it appears in the retrieved context. Use fuzzy string matching rather than exact match — the model paraphrases. Any claim that appears in the response but not in any retrieved chunk is flagged.Pass 3 — Faithfulness threshold:faithfulness = verified_assertions / total_assertions
If faithfulness < 0.8 and flagged assertions exist, do not suppress the response. Surface it with a visible warning showing which specific claims could not be verified.Fallback sequence (in order of severity):Show response with inline warning on unverified claimsRe-run with temperature = 0.0 and a stricter promptReturn "cannot verify" if the second pass also failsEscalate to human review queueAt scale, run this as an async post-processor. Stream the response to the user and show the hallucination warning within 500ms of response completion if verification fails.Step 8 — Continuous evaluationYou cannot fix what you cannot measure. RAG evals must run in production on every query — not just during offline testing before launch.Three core metrics:Context Relevance — are the retrieved chunks actually relevant to the query?contextRelevance = |queryTokens ∩ contextTokens| / |queryTokens|
Low context relevance is a retrieval problem. Fix the index, the chunking, or the fusion weight.Faithfulness — does the response stay grounded in the retrieved context?faithfulness = verified_claims / total_claims
Low faithfulness is a generation problem. Constrain the system prompt harder, lower the temperature.Answer Relevance — does the response actually answer the question?answerRelevance = |queryTokens ∩ answerTokens| / |queryTokens|
High context relevance but low answer relevance means the model is ignoring the context it was given.Additional production metrics:MetricTargetMeaning if lowLatency p95 per stage< 500ms totalIdentify which stage is the bottleneckCache hit rate> 40%Common queries are being recomputed unnecessarilyRetrieval diversityVaries across docsSystem is over-indexing a small document subsetUser rejection rate< 5%Users are explicitly marking answers wrongAlert if faithfulness drops below 0.75 over a rolling one-hour window. That is a system-level signal, not a one-off.Step 9 — Caching and memoryAt millions of documents and production traffic, many queries repeat. Recomputing the full pipeline for identical queries adds latency and cost with no benefit.Two-level cache:Level 1 — Exact match cache: hash (query + retrieval_config + model) and cache the full response with citations. Set TTL tied to document freshness — if any source document in the cached result is updated, invalidate that cache entry immediately.Level 2 — Semantic near-duplicate cache: cache query embeddings. For incoming queries, check if a semantically similar query (cosine similarity > 0.97) has already been answered. Return the cached result with a note indicating a similar query was matched.Memory layer:Session memory: within a conversation, maintain context of what has been discussed. If the user referenced "Clause 4.2.1" three turns ago, the system should not require them to repeat it. Inject relevant prior turns into the retrieval context.Long-term correction memory (human-in-the-loop): when a domain expert corrects a wrong answer, store that correction with the query topic, source document, and domain keywords. On future similar queries, retrieve relevant corrections and prepend them to the system prompt:[Retrieved expert correction from prior session]
Note: Previous answer on termination clauses was incorrect —
Clause 4.2.1 applies only to fixed-term contracts, not at-will.
This is how a RAG system learns from errors without retraining the model.Step 10 — Observability everywhereWhen a hallucination reaches a user at 3am, you need to answer four questions within minutes. Which document caused it? Which retrieval step ranked it too highly? Which eval metric failed to catch it? How many users saw it? Without structured traces, you guess.Every query should emit a structured trace:[INGEST LAYER]      Document parsed — 847 chunks generated in 2.3s
[VECTOR LAYER]      ANN search — 30 candidates in 8ms (HNSW index)
[BM25 LAYER]        Keyword search — 12 candidates in 3ms
[FUSION LAYER]      Hybrid merge — 38 unique candidates, top 15 selected
[RERANK LAYER]      Cross-encoder scored 15 chunks in 180ms
[CONFIDENCE LAYER]  Top chunk: 0.847, threshold: 0.65 — PASS
[GENERATION LAYER]  LLM call — 1240ms, 387 tokens generated
[EVAL LAYER]        Faithfulness: 0.91, Relevance: 0.84 — OK
[CACHE LAYER]       Result cached. Key: a3f9b2c1...
What to capture per query:Timing breakdown per stage, not just total end-to-end latencyWhich documents were retrieved and their scoresWhich chunks were rejected by the reranker and whyThe exact system prompt sent to the modelRaw model response before citation parsingEval scores and whether any threshold was breachedCache hit or missInfrastructure stack: OpenTelemetry for distributed tracing, Prometheus + Grafana for metrics dashboards, structured JSON logs to Elasticsearch or Loki. Every trace must be queryable by document ID, query hash, user session ID, and time range.How the ten steps composeThe steps are not independent modules — they form a chain where each stage sets the conditions for the next.A failure at Step 2 (retrieval) propagates to every downstream step. The confidence gate at Step 4 is the last line of defence before the LLM. The eval loop at Step 8 is how you detect drift before users notice it.Choosing where to investYour current failure modeStep to fix firstResponses about irrelevant topicsStep 2 — add BM25 to pure-vector retrievalRight topic, wrong specificsStep 3 — add cross-encoder rerankingAnswers when it should say "I don't know"Step 4 — add confidence threshold gatePlausible-sounding fabricationsStep 5 — tighten the system prompt constraintNo way to audit which document was usedStep 6 — add structured citationsHallucinations slip through generationStep 7 — add assertion grounding passNo idea if quality is degradingStep 8 — add continuous eval metricsSame slow queries repeatingStep 9 — add two-level cacheCannot debug production incidentsStep 10 — add structured per-query tracesWatch out: Steps 5–7 alone cannot compensate for poor retrieval. If the wrong documents reach the prompt, no generation constraint will produce a correct answer. Fix retrieval first.SummaryAt 10M+ documents, brute-force vector search is not feasible — ANN indices like HNSW return top-100 candidates in ~10ms with >95% recall.Hybrid retrieval (BM25 + embeddings, fused by a tunable α) covers each method's blind spots — keyword exactness for BM25, semantic generalization for vectors.Two-stage reranking uses ANN for speed and a cross-encoder for accuracy — reranking only the top 30 candidates, never the full corpus.Confidence scoring gates the LLM: if no retrieved chunk clears a 0.65 threshold, the system refuses to generate rather than guess.Constrained generation with explicit system prompt rules and temperature ≤ 0.1 is the primary hallucination prevention mechanism.Three-pass hallucination detection (extract → ground → score) runs asynchronously and surfaces unverified claims inline rather than suppressing responses.Continuous evals on context relevance, faithfulness, and answer relevance tell you which layer is failing — retrieval, generation, or the model itself.Two-level caching (exact match + semantic near-duplicate at cosine > 0.97) eliminates redundant computation on repeated queries in production.Structured per-query traces are the only way to diagnose production hallucinations without guessing.The engineers who get this right obsess over BM25 index quality, fusion weights, reranker calibration, and confidence thresholds — not model benchmarks. The retrieval pipeline is the product. The LLM is the final step.ReferencesRobertson, S. and Zaragoza, H. — The Probabilistic Relevance Framework: BM25 and Beyond (2009). Foundations and Trends in Information Retrieval.Malkov, Y. and Yashunin, D. — Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (2018). IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/1603.09320Nogueira, R. and Cho, K. — Passage Re-ranking with BERT (2019). https://arxiv.org/abs/1901.04085Lewis, P. et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS 2020). https://arxiv.org/abs/2005.11401Es, S. et al. — RAGAS: Automated Evaluation of Retrieval Augmented Generation (2023). https://arxiv.org/abs/2309.15217Johnson, J. et al. — Billion-scale similarity search with GPUs (FAISS). IEEE Transactions on Big Data (2019). https://arxiv.org/abs/1702.08734Anthropic — Effective context engineering for AI agents. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents