You're reading from Unlocking Data with Generative AI and RAG Enhance generative AI systems by integrating internal data with large language models using RAG

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781835887905

Length 346 pages

Edition 1st Edition

Concepts

GPT/LLMs

Author (1):

Keith Bourne

View More author details

Table of Contents (20) Chapters

Preface

1. Part 1 – Introduction to Retrieval-Augmented Generation (RAG)

2. Chapter 1: What Is Retrieval-Augmented Generation (RAG) FREE CHAPTER

3. Chapter 2: Code Lab – An Entire RAG Pipeline

4. Chapter 3: Practical Applications of RAG

5. Chapter 4: Components of a RAG System

6. Chapter 5: Managing Security in RAG Applications

7. Part 2 – Components of RAG

8. Chapter 6: Interfacing with RAG and Gradio

9. Chapter 7: The Key Role Vectors and Vector Stores Play in RAG

10. Chapter 8: Similarity Searching with Vectors

11. Chapter 9: Evaluating RAG Quantitatively and with Visualizations

12. Chapter 10: Key RAG Components in LangChain

13. Chapter 11: Using LangChain to Get More from RAG

14. Part 3 – Implementing Advanced RAG

15. Chapter 12: Combining RAG with the Power of AI Agents and LangGraph

16. Chapter 13: Using Prompt Engineering to Improve RAG Efforts

17. Chapter 14: Advanced RAG-Related Techniques for Improving Results

18. Index

Why subscribe?

19. Other Books You May Enjoy

RAG vocabulary

Now is as good a time as any to review some vocabulary that should help you become familiar with the various concepts in RAG. In the following subsections, we will familiarize ourselves with some of this vocabulary, including LLMs, prompting concepts, inference, context windows, fine-tuning approaches, vector databases, and vectors/embeddings. This is not an exhaustive list, but understanding these core concepts should help you understand everything else we will teach you about RAG in a more effective way.

LLM

Most of this book will deal with LLMs. LLMs are generative AI technologies that focus on generating text. We will keep things simple by concentrating on the type of model that most RAG pipelines use, the LLM. However, we would like to clarify that while we will focus primarily on LLMs, RAG can also be applied to other types of generative models, such as those for images, audio, and videos. We will focus on these other types of models and how they are used in RAG in Chapter 14.

Some popular examples of LLMs are the OpenAI ChatGPT models, the Meta Llama models, Google’s Gemini models, and Anthropic’s Claude models.

Prompting, prompt design, and prompt engineering

These terms are sometimes used interchangeably, but technically, while they all have to do with prompting, they do have different meanings:

Prompting is the act of sending a query or prompt to an LLM.
Prompt design refers to the strategy you implement to design the prompt you will send to the LLM. Many different prompt design strategies work in different scenarios. We will review many of these in Chapter 13.
Prompt engineering focuses more on the technical aspects surrounding the prompt that you use to improve the outputs from the LLM. For example, you may break up a complex query into two or three different LLM interactions, engineering it better to achieve superior results. We will also review prompt engineering in Chapter 13.

LangChain and LlamaIndex

This book will focus on using LangChain as the framework for building our RAG pipelines. LangChain is an open source framework that supports not just RAG but any development that wants to use LLMs within a pipeline approach. With over 15 million monthly downloads, LangChain is the most popular generative AI development framework. It supports RAG particularly well, providing a modular and flexible set of tools that make RAG development significantly more efficient than not using a framework.

While LangChain is currently the most popular framework for developing RAG pipelines, LlamaIndex is a leading alternative to LangChain, with similar capabilities in general. LlamaIndex is known for its focus on search and retrieval tasks and may be a good option if you require advanced search or need to handle large datasets.

Many other options focus on various niches. Once you have gotten familiar with building RAG pipelines, be sure to look at some of the other options to see if there are frameworks that work for your particular project better.

Inference

We will use the term inference from time to time. Generally, this refers to the process of the LLM generating outputs or predictions based on given inputs using a pre-trained language model. For example, when you ask ChatGPT a question, the steps it takes to provide you with a response is called inference.

Context window

A context window, in the context of LLMs, refers to the maximum number of tokens (words, sub-words, or characters) that the model can process in a single pass. It determines the amount of text the model can see or attend to at once when making predictions or generating responses.

The context window size is a key parameter of the model architecture and is typically fixed during model training. It directly relates to the input size of the model as it sets an upper limit on the number of tokens that can be fed into the model at a time.

For example, if a model has a context window size of 4,096 tokens, it means that the model can process and generate sequences of up to 4,096 tokens. When processing longer texts, such as documents or conversations, the input needs to be divided into smaller segments that fit within the context window. This is often done using techniques such as sliding windows or truncation.

The size of the context window has implications for the model’s ability to understand and maintain long-range dependencies and context. Models with larger context windows can capture and utilize more contextual information when generating responses, which can lead to more coherent and contextually relevant outputs. However, increasing the context window size also increases the computational resources required to train and run the model.

In the context of RAG, the context window size is essential because it determines how much information from the retrieved documents can be effectively utilized by the model when generating the final response. Recent advancements in language models have led to the development of models with significantly larger context windows, enabling them to process and retain more information from the retrieved sources. See Table 1.1 to see the context windows of many popular LLMs, both closed and open sourced:

LLM	Context Window (Tokens)
ChatGPT-3.5 Turbo 0613 (OpenAI)	4,096
Llama 2 (Meta)	4,096
Llama 3 (Meta)	8,000
ChatGPT-4 (OpenAI)	8,192
ChatGPT-3.5 Turbo 0125 (OpenAI)	16,385
ChatGPT-4.0-32k (OpenAI)	32,000
Mistral (Mistral AI)	32,000
Mixtral (Mistral AI)	32,000
DBRX (Databricks)	32,000
Gemini 1.0 Pro (Google)	32,000
ChatGPT-4.0 Turbo (OpenAI)	128,000
ChatGPT-4o (OpenAI)	128,000
Claude 2.1 (Anthropic)	200,000
Claude 3 (Anthropic)	200,000
Gemini 1.5 Pro (Google)	1,000,000

Table 1.1 – Different context windows for LLMs

Figure 1.1, which is based on Table 1.1, shows that Gemini 1.5 Pro is far larger than the others.

Figure 1.1 – Different context windows for LLMs

Note that Figure 1.1 shows models that have generally aged from right to left, meaning the older models tended to have smaller context windows, with the newest models having larger context windows. This trend is likely to continue, pushing the typical context window larger as time progresses.

Fine-tuning – full-model fine-tuning (FMFT) and parameter-efficient fine-tuning (PEFT)

FMFT is where you take a foundation model and train it further to gain new capabilities. You could simply give it new knowledge for a specific domain, or you could give it a skill, such as being a conversational chatbot. FMFT updates all the parameters and biases in the model.

PEFT, on the other hand, is a type of fine-tuning where you focus only on specific parts of the parameters or biases when you fine-tune the model, but with a similar goal as general fine-tuning. The latest research in this area shows that you can achieve similar results to FMFT with far less cost, time commitment, and data.

While this book does not focus on fine-tuning, it is a very valid strategy to try to use a model fine-tuned with your data to give it more knowledge from your domain or to give it more of a voice from your domain. For example, you could train it to talk more like a scientist than a generic foundation model, if you’re using this in a scientific field. Alternatively, if you are developing in a legal field, you may want it to sound more like a lawyer.

Fine-tuning also helps the LLM to understand your company’s data better, making it better at generating an effective response during the RAG process. For example, if you have a scientific company, you might fine-tune a model with scientific information and use it for a RAG application that summarizes your research. This may improve your RAG application’s output (the summaries of your research) because your fine-tuned model understands your data better and can provide a more effective summary.

Vector store or vector database?

Both! All vector databases are vector stores, but not all vector stores are vector databases. OK, while you get out your chalkboard to draw a Vinn diagram, I will continue to explain this statement.

There are ways to store vectors that are not full databases. They are simply storage devices for vectors. So, to encompass all possible ways to store vectors, LangChain calls them all vector stores. Let’s do the same! Just know that not all the vector stores that LangChain connects with are officially considered vector databases, but in general, most of them are and many people refer to all of them as vector databases, even when they are not technically full databases from a functionality standpoint. Phew – glad we cleared that up!

Vectors, vectors, vectors!

A vector is a mathematical representation of your data. They are often referred to as embeddings when talking specifically about natural language processing (NLP) and LLMs. Vectors are one of the most important concepts to understand and there are many different parts of a RAG pipeline that utilize vectors.

We just covered many key vocabulary terms that will be important for you to understand the rest of this book. Many of these concepts will be expanded upon in future chapters. In the next section, we will continue to discuss vectors in further depth. And beyond that, we will spend Chapters 7 and 8 going over vectors and how they are used to find similar content.