Retrieval-augmented generation (RAG) is a technique that improves LLMs by enabling them to retrieve text and use that as new information. In RAG, the LLM performs information retrieval from an external knowledge base before generating a response to a user prompt. In this blog post, I’ll provide a high-level overview of how RAG works.

External data, documents, and chunks

The external knowledge base is the data contained in a specified set of documents. It is external in the sense that it supplements the LLM’s original training data. Unfortunately, the term “document” is used inconsistently when discussing RAG, which can be confusing. For some, “document” refers to a large text source (e.g., a PDF, Word doc, or web page). For others, it’s a retrievable unit of text (aka a chunk). In this post, I’ll use the latter definition. Thus, the following are all valid documents: a series of paragraphs from a PDF; a code snippet from a Python script; a subgraph (that gets serialized to text) of a knowledge graph.

To drive the point home, imagine ingesting a 100-page PDF, splitting it into 500 text chunks. The retriever would see 500 documents, even though we think of them as coming from a single PDF. Chunking is mainly pragmatic: LLMs have finite context lengths and chunks are smaller units of text. They also improve retrieval precision by being smaller, focussed pieces of text.

Stages of RAG

RAG has four key stages, summarized below.

Stage When? Description
Indexing Offline, i.e. before any user queries Documents (chunks of text obtained by e.g. splitting a large text file or serializing sections of a knowledge graph) get converted to embeddings that are stored (alongside the documents) in a vector database (e.g., Chroma, Pinecone) for retrieval.
Retrieval After user query When a user submits a query, RAG first searches the vector database for the most relevant documents using vector similarity. Returns the actual text content of the relevant documents, not the embeddings (these are only used for the initial similarity search).
Augmentation After user query Use prompt engineering to combine the retrieved text with the user prompt.
Generation After user query LLM generates response based on both the user prompt and the retrieved text. Some models incorporate extra steps to improve output, such as the re-ranking of retrieved information, context selection, and fine-tuning.

These stages can be visualized in the diagram below:


High-level overview of RAG, showing its four main stages (indexing, retrieval, augmentation, and generation). The diagram doesn’t include details about how chunking is done. Source: I created this diagram with draw.io.

High-level overview of RAG, showing its four main stages (indexing, retrieval, augmentation, and generation). The diagram doesn’t include details about how chunking is done. Source: I created this diagram with draw.io.

Importantly, irrespective of whether the original data is structured or unstructured, the retrieval step always returns text (not embeddings!) that the augmentation step combines with the user’s prompt. During indexing, structured data like knowledge graphs is serialized to text. For example, in a knowledge graph of all scientists, the triples (Einstein, born_in, Germany) and (Einstein, profession, physicist) could convert to: “Einstein was born in Germany and worked as a physicist”. If Einstein’s subgraph formed a document, it would be serialized to text; it is this text and its embedding that gets stored in the vector DB.

Benefits

The external knowledge base in RAG provides several benefits:

It’s worth pointing out that RAG doesn’t eliminate hallucinations entirely. Even with RAG, LLMs can still generate incorrect responses. For example, one failure mode is misinterpreting context even when referencing factually correct sources.