Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Unlocking Data with Generative AI and RAG

You're reading from   Unlocking Data with Generative AI and RAG Enhance generative AI systems by integrating internal data with large language models using RAG

Arrow left icon
Product type Paperback
Published in Sep 2024
Publisher Packt
ISBN-13 9781835887905
Length 346 pages
Edition 1st Edition
Concepts
Arrow right icon
Author (1):
Arrow left icon
Keith Bourne Keith Bourne
Author Profile Icon Keith Bourne
Keith Bourne
Arrow right icon
View More author details
Toc

Table of Contents (20) Chapters Close

Preface 1. Part 1 – Introduction to Retrieval-Augmented Generation (RAG)
2. Chapter 1: What Is Retrieval-Augmented Generation (RAG) FREE CHAPTER 3. Chapter 2: Code Lab – An Entire RAG Pipeline 4. Chapter 3: Practical Applications of RAG 5. Chapter 4: Components of a RAG System 6. Chapter 5: Managing Security in RAG Applications 7. Part 2 – Components of RAG
8. Chapter 6: Interfacing with RAG and Gradio 9. Chapter 7: The Key Role Vectors and Vector Stores Play in RAG 10. Chapter 8: Similarity Searching with Vectors 11. Chapter 9: Evaluating RAG Quantitatively and with Visualizations 12. Chapter 10: Key RAG Components in LangChain 13. Chapter 11: Using LangChain to Get More from RAG 14. Part 3 – Implementing Advanced RAG
15. Chapter 12: Combining RAG with the Power of AI Agents and LangGraph 16. Chapter 13: Using Prompt Engineering to Improve RAG Efforts 17. Chapter 14: Advanced RAG-Related Techniques for Improving Results 18. Index 19. Other Books You May Enjoy

RAG vocabulary

Now is as good a time as any to review some vocabulary that should help you become familiar with the various concepts in RAG. In the following subsections, we will familiarize ourselves with some of this vocabulary, including LLMs, prompting concepts, inference, context windows, fine-tuning approaches, vector databases, and vectors/embeddings. This is not an exhaustive list, but understanding these core concepts should help you understand everything else we will teach you about RAG in a more effective way.

LLM

Most of this book will deal with LLMs. LLMs are generative AI technologies that focus on generating text. We will keep things simple by concentrating on the type of model that most RAG pipelines use, the LLM. However, we would like to clarify that while we will focus primarily on LLMs, RAG can also be applied to other types of generative models, such as those for images, audio, and videos. We will focus on these other types of models and how they are used in RAG in Chapter 14.

Some popular examples of LLMs are the OpenAI ChatGPT models, the Meta Llama models, Google’s Gemini models, and Anthropic’s Claude models.

Prompting, prompt design, and prompt engineering

These terms are sometimes used interchangeably, but technically, while they all have to do with prompting, they do have different meanings:

  • Prompting is the act of sending a query or prompt to an LLM.
  • Prompt design refers to the strategy you implement to design the prompt you will send to the LLM. Many different prompt design strategies work in different scenarios. We will review many of these in Chapter 13.
  • Prompt engineering focuses more on the technical aspects surrounding the prompt that you use to improve the outputs from the LLM. For example, you may break up a complex query into two or three different LLM interactions, engineering it better to achieve superior results. We will also review prompt engineering in Chapter 13.

LangChain and LlamaIndex

This book will focus on using LangChain as the framework for building our RAG pipelines. LangChain is an open source framework that supports not just RAG but any development that wants to use LLMs within a pipeline approach. With over 15 million monthly downloads, LangChain is the most popular generative AI development framework. It supports RAG particularly well, providing a modular and flexible set of tools that make RAG development significantly more efficient than not using a framework.

While LangChain is currently the most popular framework for developing RAG pipelines, LlamaIndex is a leading alternative to LangChain, with similar capabilities in general. LlamaIndex is known for its focus on search and retrieval tasks and may be a good option if you require advanced search or need to handle large datasets.

Many other options focus on various niches. Once you have gotten familiar with building RAG pipelines, be sure to look at some of the other options to see if there are frameworks that work for your particular project better.

Inference

We will use the term inference from time to time. Generally, this refers to the process of the LLM generating outputs or predictions based on given inputs using a pre-trained language model. For example, when you ask ChatGPT a question, the steps it takes to provide you with a response is called inference.

Context window

A context window, in the context of LLMs, refers to the maximum number of tokens (words, sub-words, or characters) that the model can process in a single pass. It determines the amount of text the model can see or attend to at once when making predictions or generating responses.

The context window size is a key parameter of the model architecture and is typically fixed during model training. It directly relates to the input size of the model as it sets an upper limit on the number of tokens that can be fed into the model at a time.

For example, if a model has a context window size of 4,096 tokens, it means that the model can process and generate sequences of up to 4,096 tokens. When processing longer texts, such as documents or conversations, the input needs to be divided into smaller segments that fit within the context window. This is often done using techniques such as sliding windows or truncation.

The size of the context window has implications for the model’s ability to understand and maintain long-range dependencies and context. Models with larger context windows can capture and utilize more contextual information when generating responses, which can lead to more coherent and contextually relevant outputs. However, increasing the context window size also increases the computational resources required to train and run the model.

In the context of RAG, the context window size is essential because it determines how much information from the retrieved documents can be effectively utilized by the model when generating the final response. Recent advancements in language models have led to the development of models with significantly larger context windows, enabling them to process and retain more information from the retrieved sources. See Table 1.1 to see the context windows of many popular LLMs, both closed and open sourced:

LLM

Context Window (Tokens)

ChatGPT-3.5 Turbo 0613 (OpenAI)

4,096

Llama 2 (Meta)

4,096

Llama 3 (Meta)

8,000

ChatGPT-4 (OpenAI)

8,192

ChatGPT-3.5 Turbo 0125 (OpenAI)

16,385

ChatGPT-4.0-32k (OpenAI)

32,000

Mistral (Mistral AI)

32,000

Mixtral (Mistral AI)

32,000

DBRX (Databricks)

32,000

Gemini 1.0 Pro (Google)

32,000

ChatGPT-4.0 Turbo (OpenAI)

128,000

ChatGPT-4o (OpenAI)

128,000

Claude 2.1 (Anthropic)

200,000

Claude 3 (Anthropic)

200,000

Gemini 1.5 Pro (Google)

1,000,000

Table 1.1 – Different context windows for LLMs

Figure 1.1, which is based on Table 1.1, shows that Gemini 1.5 Pro is far larger than the others.

Figure 1.1 – Different context windows for LLMs

Figure 1.1 – Different context windows for LLMs

Note that Figure 1.1 shows models that have generally aged from right to left, meaning the older models tended to have smaller context windows, with the newest models having larger context windows. This trend is likely to continue, pushing the typical context window larger as time progresses.

Fine-tuning – full-model fine-tuning (FMFT) and parameter-efficient fine-tuning (PEFT)

FMFT is where you take a foundation model and train it further to gain new capabilities. You could simply give it new knowledge for a specific domain, or you could give it a skill, such as being a conversational chatbot. FMFT updates all the parameters and biases in the model.

PEFT, on the other hand, is a type of fine-tuning where you focus only on specific parts of the parameters or biases when you fine-tune the model, but with a similar goal as general fine-tuning. The latest research in this area shows that you can achieve similar results to FMFT with far less cost, time commitment, and data.

While this book does not focus on fine-tuning, it is a very valid strategy to try to use a model fine-tuned with your data to give it more knowledge from your domain or to give it more of a voice from your domain. For example, you could train it to talk more like a scientist than a generic foundation model, if you’re using this in a scientific field. Alternatively, if you are developing in a legal field, you may want it to sound more like a lawyer.

Fine-tuning also helps the LLM to understand your company’s data better, making it better at generating an effective response during the RAG process. For example, if you have a scientific company, you might fine-tune a model with scientific information and use it for a RAG application that summarizes your research. This may improve your RAG application’s output (the summaries of your research) because your fine-tuned model understands your data better and can provide a more effective summary.

Vector store or vector database?

Both! All vector databases are vector stores, but not all vector stores are vector databases. OK, while you get out your chalkboard to draw a Vinn diagram, I will continue to explain this statement.

There are ways to store vectors that are not full databases. They are simply storage devices for vectors. So, to encompass all possible ways to store vectors, LangChain calls them all vector stores. Let’s do the same! Just know that not all the vector stores that LangChain connects with are officially considered vector databases, but in general, most of them are and many people refer to all of them as vector databases, even when they are not technically full databases from a functionality standpoint. Phew – glad we cleared that up!

Vectors, vectors, vectors!

A vector is a mathematical representation of your data. They are often referred to as embeddings when talking specifically about natural language processing (NLP) and LLMs. Vectors are one of the most important concepts to understand and there are many different parts of a RAG pipeline that utilize vectors.

We just covered many key vocabulary terms that will be important for you to understand the rest of this book. Many of these concepts will be expanded upon in future chapters. In the next section, we will continue to discuss vectors in further depth. And beyond that, we will spend Chapters 7 and 8 going over vectors and how they are used to find similar content.

You have been reading a chapter from
Unlocking Data with Generative AI and RAG
Published in: Sep 2024
Publisher: Packt
ISBN-13: 9781835887905
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime