Exploring RAG architecture: How retrieval-augmented generation is revolutionizing GenAI

Discover how RAG bridges AI limitations, combining external data with GenAI for dynamic, accurate responses.
Person coding on a laptop
Summary

This article is written by Laura Meyer, an Engineer at a leading consultancy, specializing in AI, data science, and DevOps, with extensive experience in GenAI innovation and delivering technical training.

 


 

Bridging the gap: Overcoming LLM limitations with retrieval-augmented generation (RAG)

Large language models (LLMs) have revolutionized industries by enabling intelligent content creation, analysis, and automation. Their ability to learn from massive and diverse datasets has unlocked new possibilities in how we interact with technology, solve complex problems, and innovate. Yet, even these powerful tools have their challenges: LLMs can be resource-intensive, struggle with real-time information, lack access to proprietary or private data and may not be tailored to specific domains. Their reliance on static training data limits their ability to adapt quickly to changing environments, and fine-tuning them can inadvertently introduce errors (so-called “hallucinations”) or erase valuable prior knowledge. In dynamic, high-stakes industries where accuracy, confidentiality, and up-to-date insights are critical, these limitations highlight the need for a more agile solution.

Retrieval-augmented generation (RAG) emerges as a groundbreaking solution, redefining how we use language models. By integrating external knowledge retrieval into the generative process, RAG enables AI systems to access up-to-date, domain-specific information and provide more accurate, context-aware responses.

This innovation bridges the gap between static AI knowledge and the real-world need for timely, reliable and contextually-relevant information.

RAG combines two essential components:

  1. Retrieval mechanism: This component pulls relevant, domain-specific information from external sources like knowledge bases, databases, or even the web.
  2. Generative model: The retrieved information is then passed to a GenAI model (like GPT) to craft responses grounded in the retrieved data.

 

Imagine you’re taking an open-book exam. You have all the resources you need—notes, textbooks, and references—but the key to success isn’t memorizing everything. Instead, it’s knowing where to look, finding the right information quickly, and applying it effectively to answer complex questions. That’s exactly how RAG works.

In this scenario, the generative AI model is like the student. It has foundational knowledge (pre-trained data), but instead of storing every detail, it relies on an “open book”—an external knowledge base (like databases, websites, or even PDFs). The retrieval system acts like flipping through pages or searching your notes for the most relevant information. Once found, the model integrates this knowledge with its existing understanding to craft a clear and thoughtful response.

RAG, like a prepared student, thrives by combining what it knows with the ability to quickly access and use external resources, making its answers precise, dynamic, and context-aware.

 

Summarizing RAG architecture:

The flow diagram below illustrates the step-by-step process of a RAG system, where a user’s query is processed through embedding, retrieval, and generative modeling to produce context-aware and accurate responses.

  1. The process begins when a user submits a query or request. This input could be a question, a search prompt, or any task requiring a context-aware response.
  2. The query is converted into a numerical vector representation using an embedding model (e.g., text-embedding-ada-002). This representation captures the semantic meaning of the input and enables effective comparison with stored knowledge.
  3. Relevant external knowledge sources, such as structured databases, unstructured documents, knowledge graphs, APIs, or websites, are identified as potential candidates for the query.
  4. The raw information is broken into smaller, manageable chunks of text. This step maintains the context within each chunk while optimizing the search and indexing processes.
  5. Each chunk is converted into a numerical vector representation using an embedding model. These vectors are stored in a vector database for fast retrieval, where they can be indexed and searched based on semantic similarity.
  6. When the user query is received and transformed into a vector, this vector is passed to the retrieval system, which searches the vector database to find the most semantically similar chunks of text (or documents) based on the query vector.
  7. The retrieval system employs techniques such as vector-based similarity search (embedding matching) or keyword-based filtering to locate the most pertinent and contextually relevant data.
  8. The most relevant documents or chunks of text are retrieved and preprocessed, ensuring the generative model has access to accurate, up-to-date, and factual information.
  9. The user’s query, retrieved chunks, and additional instructions are combined to create an enriched input for the generative model. This augmentation ensures the model has all the necessary context to deliver an accurate response.
  10. The LLM (e.g., GPT-4, PaLM) processes the augmented input, blending the retrieved knowledge with its pre-trained understanding to generate a well-formed and context-aware response.
  11. The final response is presented to the user, whether as a direct answer, a detailed explanation, or a summarized document tailored to the user’s request.

 

Let’s dive into what makes RAG so transformative and guide you through building your very own RAG system.

 

Building a simple RAG system: Hands-on guide

Basic pass-through prompt

Before exploring RAG functionality, let’s first examine a simple pass-through prompt. This demonstrates how a standalone LLM processes questions compared to a RAG architecture. A pass-through prompt directly sends a user query to the model without introducing any additional context or data retrieval, allowing us to understand how the model generates responses purely based on pre-existing knowledge from its training data.

We first need to import the necessary OpenAI components and initialize the client to interface with the LLM. The OpenAI() class provides access to the chat-based models, such as gpt-3.5-turbo .

Python:

from openai import OpenAI

# Initialize the OpenAI client

client = OpenAI()

Next, let’s define a user query to test the model’s response. Do you know how many times per year you should change your car’s oil? 🤔

This question is stored in the user_query variable and will be sent directly to the LLM without any auxiliary context or retrieval mechanisms.

Python:

user_query = “How many times per year should a car’s oil be changed”

The client.chat.completions.create() function is used to interact with the LLM. This function specifies the model, user input, and other parameters like temperature, which controls the response’s randomness—higher values lead to more creative and varied outputs, while lower values make responses more deterministic.

Python:

response = client.chat.completions.create(

    model=”gpt-3.5-turbo”, # Specify the model

    messages=[{“role”: “user”, “content”: user_query}],

    temperature=0.7 # Controls response randomness (higher = more creative)

Once the response is received, response.choices[0].message.content extracts the actual generated text from the model’s response. This is the answer the model generates based on the query.

Python:

print(response.choices[0].message.content)

The LLM’s response demonstrates its ability to synthesize general knowledge and context. When asked, “How many times per year should a car’s oil be changed?” it provided the common recommendation: every 3,000 to 5,000 miles or every 3 to 6 months, whichever comes first.

Furthermore, it acknowledged contextual nuances by noting that the interval may vary depending on factors like the car’s make, model, and type of oil used. This showcases the LLM’s ability to incorporate general knowledge and reason through edge cases even without additional contextual inputs or retrieval-based context augmentation.

System prompt

Next, let’s introduce a system prompt to guide the behavior of the LLM. A system prompt is a way to set the context or instructions for how the model should respond. Unlike the user query, the system prompt defines the tone, style, or constraints for the response. This system instruction is passed along with the user query to influence the model’s response style.
In this case, we want the model to answer in the form of a poem, so we define:
Python:

system_prompt = “Please answer in a poem”

response = client.chat.completions.create(

    model=”gpt-3.5-turbo”,

    messages=[

        {“role”: “user”, “content”: user_query},

        {“role”: “system”, “content”: system_prompt}

    ],

    temperature=0.7

)

print(response.choices[0].message.content)

RAG use case

As mentioned in the introduction, standard GenAI models rely on pre-trained knowledge and patterns rather than dynamically pulling data in real-time. Let’s explore a use case to show how RAG can solve this limitation.

Choosing what to watch on Netflix can feel like an endless search—scrolling through countless options without finding the right choice. The challenge comes from trying to decide between genres, moods, or specific preferences. This is where RAG comes in. By connecting an LLM to an up-to-date database of Netflix movies and their descriptions, you can directly tell the AI how you’re feeling or what kind of mood you’re in—whether you’re in the mood for a romantic time-travel story, an action-packed thriller, or a light-hearted comedy. The AI can then provide a curated list of the best movies for you, compare options, and even summarize plot details to help you decide. This approach transforms the often stressful search into a personalized, seamless, and enjoyable discovery experience.

Today, we’re in the mood for something romantic, with a twist of time travel—something that feels both nostalgic and magical:

Python:

user_query = “””Which 3 romantic movies on Netflix feature a time travel plot, 

and what are their key similarities?”””

The netflix_df DataFrame serves as our external knowledge base for the RAG system. It contains 2,000 movie descriptions sourced from Netflix, acting as real-time context data to power the retrieval process. This dataset has a single column named description, which holds brief summaries of various movies available on the platform. All entries are non-null, ensuring every row has valid text data. These descriptions capture snapshots of each movie’s storyline, theme, and genre, making them ideal for user queries based on mood, preferences, or context.

Python:

netflix_df.iloc[80][‘description’]

The next step is to convert each movie description into a vector representation to enable semantic comparison and retrieval using RAG. This is done by computing embeddings for each description from our external Netflix knowledge base.

The get_embedding()function sends a cleaned version of the text (movie description) to the embedding model (text-embedding-3-small) and returns the resulting numerical vector. Using netflix_df.iloc[0]['description'], we extract the first movie description from the DataFrame, convert it to a string, and pass it to the embedding function.

Python:

def get_embedding(text, model=”text-embedding-3-small”):

    text = text.replace(“\\n”, “”)

    return client.embeddings.create(input=[text], model=model).data[0].embedding

# Pass the first value in the Series directly (first item as a string)

result_embedding = get_embedding(str(netflix_df.iloc[0][‘description’]))

print(result_embedding) # Check the embeddings

The result_embedding will be a multi-dimensional numerical vector that represents the semantic meaning of the first movie’s description. You can trust me—there are 1536 values in this vector. Each of these numbers captures a different aspect of the movie’s storyline, themes, and context as understood by the embedding model.

Let’s apply the get_embedding() function to the entire description column to convert all movie descriptions into their respective vector representations. We’re adding a new column called “embedding” to the DataFrame, which will store these vector representations for each movie’s description. This allows us to easily compare and query these embeddings later on for similarity searches.

Python:

netflix_df[“embedding”] = netflix_df[“description”].apply(get_embedding)

netflix_df.head(5)

We now generate an embedding for the user’s query. Vectors closer in semantic space represent similar concepts. The LLM uses the most similar vector as context to generate a relevant response—while the user won’t see this raw content, it guides the model to align its response with the query intent.

Python:

embedding_user_query = get_embedding(user_query)

len(embedding_user_query)

I also returns 1536, matching the model’s embedding dimensions. This embedding will be compared to the movie embeddings in netflix_df["embedding"] to find the most relevant match.

Next, we calculate the similarity between the user’s query embedding and each movie’s embedding in our dataset. This is done using the cosine similarity, which measures how closely related two vectors are in terms of their semantic meaning.

We define a function called calculate_similarity, which takes a single movie embedding as input and computes the dot product between this embedding and the user query’s embedding. This dot product effectively measures how closely the movie description is related to the user’s query.

Using this function, we apply it to all movie embeddings in our DataFrame (netflix_df["embedding"]) to compute their similarity scores. These scores are stored in a new column called distance. A higher score indicates that the movie’s description is semantically closer to the user’s query.

Finally, we sort the DataFrame by the similarity scores in descending order, so the most relevant movies appear first. The top 5 results are then displayed, showing the movies most closely aligned with the user’s query.

Python:

import numpy as np

def calculate_similarity(page_embedding):

    “””Calculates the similarity between the user query embedding and the page embedding.”””

    return np.dot(page_embedding, embedding_user_query)

netflix_df[“distance”] = netflix_df[“embedding”].apply(calculate_similarity)

netflix_df.sort_values(“distance”, ascending=False, inplace=True)

netflix_df.head(5)

Python:

netflix_df.iloc[0].description

Our system identified this movie as one of the most relevant to our user query, based on semantic similarity.

Next, we’re combining the descriptions of the top 3 most relevant movies into a single string, separated by double newlines. This aggregated context serves as the background information for the LLM, allowing it to generate responses informed by the most semantically relevant movie summaries.

Python:

top_3_movies = netflix_df.head(3)

context = “\\n\\n”.join(top_3_movies[“description”].values)

Now, things are getting serious… We’re sending the user’s query, along with the relevant context, to the LLM to generate a response. The system prompt and user query are combined with the top 3 most contextually similar movie descriptions as input. This allows the assistant to provide an answer informed by the most relevant information extracted through similarity calculations.

Python:

response = client.chat.completions.create(

    model=”gpt-3.5-turbo”,

    messages=[

        {“role”: “user”, “content”: user_query},

        {“role”: “system”, “content”: system_prompt}, 

        {“role”: “assistant”, “content”: f””’Please use this context when answering the question”’: {context}”}

    ],

    temperature=0.7

)

print(response.choices[0].message.content

The model’s output effectively incorporates the provided context by aligning its response with the details of the user query, system prompt, and specified contextual information.

 

Conclusion

By combining generative AI with real-time retrieval mechanisms, the RAG architecture overcomes the drawbacks of conventional search tools and static language models. This approach provides a powerful, scalable solution for delivering accurate and timely responses.

Our users have also consulted:
Meet the women who code in Brussels #2

Every year, Women in Tech launches a code festival in order to fight gender disparity

More freedom and a healthier work/life balance through coding

Sara studied Geomatics at Carleton University (Canada) and used to work as an analyst in

Pour développe mes compétences
Formation développeur web
Formation data scientist
Formation data analyst
Les internautes ont également consulté :
Le Wagon Melbourne: meet Jamie, our career coach!

When you join Le Wagon as a student, we want you to gain the confidence

Learning To Code At Home With Le Wagon

Our students tell us a little bit about their experience of doing the Le Wagon

Suscribe to our newsletter

Receive a monthly newsletter with personalized tech tips.