Future of Retrieval-Augmented Generation (RAG), Will It be Replaced?

7 min readMar 4, 2024

If you’re a heavy user of ChatGPT, you may see the many limitations of AI products today. This leads to a solution called Retrieval-Augmented Generation (RAG), which is drawing more and more attention in Silicon Valley.

In this article, I’m gonna introduce what RAG is and what researchers and startup founders say about its future.

LLM Limitations

Hallucination: LLMs can generate incorrect or fabricated information, affecting their reliability.
Out-of-Date Information: Since LLMs are trained on static datasets, they cannot provide information on events occurring after their training date.
Lack of Domain Knowledge: LLMs may not have a deep understanding of specialized domains, leading to oversimplified or inaccurate answers.
Processing Confidential Data: The use of LLMs is challenging for tasks involving company data or personal information.

All of these indicate the need to incorporate additional information into a product built on top of a pre-trained large language model.

Technical Explanation

The RAG paper was first introduced by Facebook AI Research in October 2020, aiming to improve performance on NLP tasks.

A simple example

Suppose you have a list of tourist recommendations that you want an LLM to be able to talk about (source). Here, we view each line as a text document.

“Take a leisurely walk in the park and enjoy the fresh air.”,
“Visit a local museum and discover something new.”,
“Attend a live music concert and feel the rhythm.”,
“Go for a hike and admire the natural scenery.”,
“Have a picnic with friends and share some laughs.”,
“Explore a new cuisine by dining at an ethnic restaurant.”…

When a user inputs into the system:

user_input = “I like to hike.”

The system can search through the list and retrieve the most relevant document, which we define as the line that has the most identical words.

relevant_document = “Go for a hike and admire the natural scenery.”

Then we propose the text prompt by filling the part for user_input.

You are a bot that makes recommendations for activities. You answer in very short sentences and do not include extra information.
This is the recommended activity: {relevant_document}
The user input is: {user_input}
Compile a recommendation to the user based on the recommended activity and the user input.

Now the system can take the new prompt to the LLM, which will return some kind of response:

“Great! Based on your interest in hiking, I recommend trying out the nearby trails for a challenging and rewarding experience with breathtaking views Great! Based on your interest in hiking, I recommend checking out the nearby trails for a fun and challenging adventure.”

Bam! You get the fundamental idea of RAG.

Potential improvements

This is the most simplified version and, in reality, has a few potential improvements-

To store these documents in a vector store depending on needs for scalability, cloud support, and multimodal support.
To encode the documents and user input in a word vector space where each word is a vector in high dimensional space.
To switch our similarity measures for the vectors accordingly.

There are also advanced RAGs with more modules and different components.

Source: https://arxiv.org/pdf/2312.10997.pdf

Real World Scenarios

RAG has gained wide attention since the summer of 2023.

We see big tech companies have started providing tools to help their clients build RAG applications, for example, Google, AWS, IBM, Microsoft, NVIDIA, and so on.

Kari Briski, Vice President of AI Software at Nvidia, says: “Expect to hear a lot more about retrieval-augmented generation as enterprises embrace these AI frameworks in 2024.”

Imagine using RAG in these real-world scenarios, some of which are already a reality-

Educational Tools: Educational platforms use RAG to generate study materials and create detailed explanations of complex subjects. Example: MathGPTPro.
Customized Chatbots: RAG-powered chatbots and virtual assistants provide more accurate and contextually appropriate responses to user queries. Example: CastChat.
Interactive Entertainment and Gaming: RAG is used to generate dynamic narratives, dialogues, and storylines based on player choices and actions. Example: Inworld AI.

CastChat, available in Google Play Store

When to choose RAG

A 2024 article concludes that RAG best fits a high-external-knowledge and low-model-adaption scenario.

Siva Surendira from Lyzr.ai tweeted: “RAG suits long-term production apps, while large prompts can be used in one-off scenarios.”

Richard Zhang, CSO of Seeed Studio, points out “the 3 common ways to personalize large models’ outputs: prompt engineering, RAG, and fine-tuning. RAG is the best solution for adding domain-specific knowledge to LLM whereas finetuning is good for complex prompts or special output tones. In some cases, a hybrid mode is preferred.”

Startup Experiences

Many startups have adopted RAG in their AI products, and some others decided not to.

Srijan Kumar, Professor at GeorgiaTech and the founder of Lighthouz AI, tweeted: “RAG reduces but does not solve hallucinations. In a finance assistant use case, my startup showed that an RAG system gives ‘correct and complete’ responses only 57% time even to simple queries.”

Richard Zhang, in response to Srijan’s point, emphasizes that RAG is still the most effective way to reduce hallucinations. He shares with me that, improving the satisfaction rate from 60% to, say, 90% goes beyond the technology itself. A team needs to have a good understanding of the business logic and clean the data properly before optimizing RAG. It’s a challenging, long-term process.

Sekai is a mobile app that enables users to create and explore personalized “what if” stories with AI characters, fostering a global storytelling community. The founder shared with me that, their product didn’t adopt RAG because the context window length has been enough to use, with an estimate of 10,000 tokens per conversation.

Discussions

RAG vs. Long Context Window

A hotly debated topic these days is whether RAG will become obsolete due to longer context windows. Due to memory and computing power limitations, large language models can only memorize recent user input, whose length is called context window length. LLM’s context window lengths have grown longer and longer.

Following Google’s Gemini 1.5 release with a context window length of 100k tokens, many folks shout out that RAG will be dead because additional information to LLMs can now be part of the input prompt. Many well-respected members of the research community expressed disagreement-

Oriol Vinyals, Gemini Co-lead at Google DeepMind, wrote: “RAG has some nice properties that can enhance, and be enhanced by, long context” which reflects his belief that RAG and long context will complement each other in the future.

Siva Surendira from Lyzr.ai points out that RAG is cheaper and faster.

Yao Fu, CS PhD at the University of Edinburgh, defends the potential of long context models over RAG by arguing that despite higher costs and current limitations, long context models offer superior on-the-fly retrieval and reasoning capabilities. The possibility of language models accessing vast databases directly, like Google’s index might be a future advancement.

Jim Fan from Nvidia envisions some hybrid method that combines retrieval with a long context window, perhaps “some form of spreading neural activations across a giant unstructured database.”

Elvis Saravia from DAIR.ai also acknowledges the effectiveness of combining RAG and long-context LLMs for a robust system. He states that “different families of LLMs will help solve different types of problems. We need to move on from this idea that there will be one LLM that will rule all.”

RAG Alternatives

James Nguye’s article mentions a few of RAG’s long-term limitations. For example, the processes of Retrieval, Augmentation, and Generation are handled independently. This means that the LLM for Retrieval might not understand the user’s intent in the same way as the LLM for Generation. If the retrieval phase yields irrelevant results due to poorly chosen search queries, it might end up making up a response aka. hallucination again.

Alternatively, he proposed to develop an intelligent agent such that, “when the agent needs to find knowledge it doesn’t possess, it formulates a search query and signals the search engine to retrieve the required answer.”

Ethan He from Nvidia made ActGPT with a sorta similar idea a year ago where an agent retrieves up-to-date information from Google and performs tasks like writing a marketing post on Twitter.

As a contributor to the project, I think it mitigates hallucination and data freshness problems, but cannot process confidential data.

What alternative solutions will thrive afterward? Ethan wants to explore more about the Mixture of Experts (MoE), which is an architecture that combines multiple LLM modules for a gated result.

pub.towardsai.net/the-mixture-of-experts-moe-an-easy-tutorial-with-python-pytorch-coding-778065191853

Conclusion

It seems from research outcomes and startup practices that, RAG is the most effective way so far and will not be replaced in the next five years. In the long term, smarter architectures may evolve and eventually outperform RAG.