AI & Prompting Series

RAG Pipeline Simulator

Watch how embedding, vector search, and top-K retrieval ground LLM responses in real documents — and see exactly what happens without them.

RAG Pipeline Simulator

Watch how retrieval-augmented generation grounds LLM responses in real documents.

Mode
Query
Top-K: 3
Speed: Medium
User Query

What was Apple's Q4 2024 revenue and net income?

Select a query and press Start to run the RAG pipeline.
1
Query Received
2
Embedding Model
3
Vector DB Search
4
Top-K Retrieval
5
LLM Processing
6
Response Ready
Grounded Response

Awaiting pipeline…

Pipeline Mode
RAG
Active Step
Ready
Chunks Retrieved
Grounding

Quick Guide: RAG Pipeline

Understanding the basics in 30 seconds

How It Works

  • Select a query preset (financial data, policy, or technical requirements)
  • Choose With RAG or Without RAG mode to compare behaviors
  • Adjust Top-K to control how many document chunks are retrieved
  • Press Start and watch each pipeline stage activate in sequence
  • Compare the grounded response with citations vs the hallucination-prone direct response

Key Benefits

  • Makes the abstract RAG pipeline concrete and step-by-step
  • Shows exactly why LLMs hallucinate without retrieved context
  • Demonstrates the role of similarity scores in chunk selection
  • Teaches the Top-K trade-off between coverage and noise
  • Connects the embedding step to the final response quality

Real-World Uses

  • Notion AI: Retrieves from workspace documents for team-specific answers
  • GitHub Copilot Chat: Retrieves from repository index for codebase questions
  • Zendesk AI: Grounds support replies in product documentation
  • Amazon Q: Connects enterprise knowledge bases to natural language queries
  • Harvey AI: Retrieves from case law for citable legal analysis

What is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG) is an architecture that connects a large language model to an external knowledge base at inference time. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents from a vector database and injects them into the prompt as context before the LLM generates a response.

This solves three fundamental limitations of standalone LLMs: knowledge cutoff dates, lack of access to private or proprietary data, and the tendency to hallucinate when uncertain. With RAG, the model always has a grounded, current, and citable source to reason from.

How the Pipeline Works

  1. 1
    Query Ingestion: The user's query enters the system. This is the same input that would go directly to an LLM in a naïve setup.
  2. 2
    Embedding: The query is passed through an embedding model (e.g., OpenAI text-embedding-3-small or a local model like nomic-embed-text). This converts the text into a high-dimensional vector — a list of numbers that captures semantic meaning.
  3. 3
    Vector Database Search: The query vector is compared against pre-computed vectors of all document chunks stored in the vector database (Pinecone, pgvector, Weaviate, Chroma, etc.) using cosine similarity or dot product distance.
  4. 4
    Top-K Retrieval: The K chunks with the highest similarity scores are selected. K is a tunable parameter — higher K means more context but also more token cost and potential noise.
  5. 5
    Context Injection: The retrieved chunks are assembled and injected into the LLM prompt alongside the original query. The model is instructed to ground its answer in the provided context.
  6. 6
    Grounded Response: The LLM generates a response citing or referencing the retrieved chunks. Citations make the output auditable and reduce hallucination.

Trade-offs

Advantages

  • • Reduces hallucination by grounding answers in retrieved sources
  • • Enables access to private, proprietary, and up-to-date data
  • • Responses can be audited via source citations
  • • No need to fine-tune or retrain the LLM when data changes
  • • Works with any LLM as a drop-in augmentation layer

Disadvantages

  • • Retrieval quality depends heavily on chunking and embedding strategy
  • • Poor Top-K selection can inject irrelevant or conflicting context
  • • Adds latency: one extra embedding call + vector DB query per request
  • • Context window limits cap how much can be retrieved
  • • Keeping the vector DB in sync with changing source documents is non-trivial