Internal working of a RAG Application

Large Language Models (LLMs) are powerful tools, but their capabilities are limited by the data they're trained on. They lack access to private user data and the ever-growing stream of newly published information. This challenge along with the limitations of context window size in LLMs, fuels the need for external data connections.

In the early days of LLMs, the context window – the amount of text the model considers during generation – was relatively small. This restricted the model's ability to understand complex relationships and generate comprehensive responses. However, with advancements in technology, context windows have expanded significantly.

Although increased context window allows for more better understanding, the need for eradicating static training data is a must. To address this issue, connecting LLMs to external data sources has become crucial. The ability to access external data has significantly boosted the popularity of LLMs.

Internally there are 3 main things that happen internally in a RAG application:

Indexing:
Retrieval
Generation

Indexing

In a RAG application, you extract out documents/information relevant to your query. These documents are going to be stored somewhere, like a database. Storing them in a particular representation is also important as that will help in finding the similarity between your query and the documents.

One way is to store them is using numerical representation or vector embeddings. Comparing numerical representations is easier as compared to free flow text, a lot of approaches have been worked upon in the past few years.

Rather than preserving vector representations of entire documents intact, a more practical approach involves dividing them into smaller segments. This segmentation is necessary because of finite context windows of the embedding models, which is usually around 512 tokens or more.

Each document segment is compressed into a vector representation. By converting documents into vectors, they can be visualized within higher-dimensional space. Documents that are semantically similar will be present closer to each other. These vectors encapsulate the semantic essence of the documents and is indexed for efficient retrieval.

Multiple document loaders and embedding models, are available to support this process, allowing for flexibility and customization.

Retrieval

During runtime, user queries undergo a similar transformation into vector embeddings using the embedding model. A semantic search is then conducted by comparing the vector embeddings of the query with those of the segmented documents stored in the database.

How will we perform semantic search? You can use metrics like Euclidean distance, Cosine similarity, or K-nearest neighbors (KNN).

By setting a predetermined value, users can refine their search to retrieve only the top "k" most similar documents, optimizing relevance and precision.

Generation

Following the retrieval of relevant document embeddings, they are decoded back into textual format. Subsequently, this text, alongside the user query, serves as the context for a Language Model (LLM). The LLM will feed on that context and use its existing knowledge to answer your query.

Internal working of a RAG Application

Inner Mechanics of a RAG application: From Data Storage to Semantic Search

Indexing

Retrieval

Generation