RAG vocabulary
Now is as good a time as any to review some vocabulary that should help you become familiar with the various concepts in RAG. In the following subsections, we will familiarize ourselves with some of this vocabulary, including LLMs, prompting concepts, inference, context windows, fine-tuning approaches, vector databases, and vectors/embeddings. This is not an exhaustive list, but understanding these core concepts should help you understand everything else we will teach you about RAG in a more effective way.
LLM
Most of this book will deal with LLMs. LLMs are generative AI technologies that focus on generating text. We will keep things simple by concentrating on the type of model that most RAG pipelines use, the LLM. However, we would like to clarify that while we will focus primarily on LLMs, RAG can also be applied to other types of generative models, such as those for images, audio, and videos. We will focus on these other types of models and how they are used in RAG in Chapter 14.
Some popular examples of LLMs are the OpenAI ChatGPT models, the Meta Llama models, Google’s Gemini models, and Anthropic’s Claude models.
Prompting, prompt design, and prompt engineering
These terms are sometimes used interchangeably, but technically, while they all have to do with prompting, they do have different meanings:
- Prompting is the act of sending a query or prompt to an LLM.
- Prompt design refers to the strategy you implement to design the prompt you will send to the LLM. Many different prompt design strategies work in different scenarios. We will review many of these in Chapter 13.
- Prompt engineering focuses more on the technical aspects surrounding the prompt that you use to improve the outputs from the LLM. For example, you may break up a complex query into two or three different LLM interactions, engineering it better to achieve superior results. We will also review prompt engineering in Chapter 13.
LangChain and LlamaIndex
This book will focus on using LangChain as the framework for building our RAG pipelines. LangChain is an open source framework that supports not just RAG but any development that wants to use LLMs within a pipeline approach. With over 15 million monthly downloads, LangChain is the most popular generative AI development framework. It supports RAG particularly well, providing a modular and flexible set of tools that make RAG development significantly more efficient than not using a framework.
While LangChain is currently the most popular framework for developing RAG pipelines, LlamaIndex is a leading alternative to LangChain, with similar capabilities in general. LlamaIndex is known for its focus on search and retrieval tasks and may be a good option if you require advanced search or need to handle large datasets.
Many other options focus on various niches. Once you have gotten familiar with building RAG pipelines, be sure to look at some of the other options to see if there are frameworks that work for your particular project better.
Inference
We will use the term inference from time to time. Generally, this refers to the process of the LLM generating outputs or predictions based on given inputs using a pre-trained language model. For example, when you ask ChatGPT a question, the steps it takes to provide you with a response is called inference.
Context window
A context window, in the context of LLMs, refers to the maximum number of tokens (words, sub-words, or characters) that the model can process in a single pass. It determines the amount of text the model can see or attend to at once when making predictions or generating responses.
The context window size is a key parameter of the model architecture and is typically fixed during model training. It directly relates to the input size of the model as it sets an upper limit on the number of tokens that can be fed into the model at a time.
For example, if a model has a context window size of 4,096 tokens, it means that the model can process and generate sequences of up to 4,096 tokens. When processing longer texts, such as documents or conversations, the input needs to be divided into smaller segments that fit within the context window. This is often done using techniques such as sliding windows or truncation.
The size of the context window has implications for the model’s ability to understand and maintain long-range dependencies and context. Models with larger context windows can capture and utilize more contextual information when generating responses, which can lead to more coherent and contextually relevant outputs. However, increasing the context window size also increases the computational resources required to train and run the model.
In the context of RAG, the context window size is essential because it determines how much information from the retrieved documents can be effectively utilized by the model when generating the final response. Recent advancements in language models have led to the development of models with significantly larger context windows, enabling them to process and retain more information from the retrieved sources. See Table 1.1 to see the context windows of many popular LLMs, both closed and open sourced:
LLM |
Context Window (Tokens) |
ChatGPT-3.5 Turbo 0613 (OpenAI) |
4,096 |
Llama 2 (Meta) |
4,096 |
Llama 3 (Meta) |
8,000 |
ChatGPT-4 (OpenAI) |
8,192 |
ChatGPT-3.5 Turbo 0125 (OpenAI) |
16,385 |
ChatGPT-4.0-32k (OpenAI) |
32,000 |
Mistral (Mistral AI) |
32,000 |
Mixtral (Mistral AI) |
32,000 |
DBRX (Databricks) |
32,000 |
Gemini 1.0 Pro (Google) |
32,000 |
ChatGPT-4.0 Turbo (OpenAI) |
128,000 |
ChatGPT-4o (OpenAI) |
128,000 |
Claude 2.1 (Anthropic) |
200,000 |
Claude 3 (Anthropic) |
200,000 |
Gemini 1.5 Pro (Google) |
1,000,000 |
Table 1.1 – Different context windows for LLMs
Figure 1.1, which is based on Table 1.1, shows that Gemini 1.5 Pro is far larger than the others.

Figure 1.1 – Different context windows for LLMs
Note that Figure 1.1 shows models that have generally aged from right to left, meaning the older models tended to have smaller context windows, with the newest models having larger context windows. This trend is likely to continue, pushing the typical context window larger as time progresses.
Fine-tuning – full-model fine-tuning (FMFT) and parameter-efficient fine-tuning (PEFT)
FMFT is where you take a foundation model and train it further to gain new capabilities. You could simply give it new knowledge for a specific domain, or you could give it a skill, such as being a conversational chatbot. FMFT updates all the parameters and biases in the model.
PEFT, on the other hand, is a type of fine-tuning where you focus only on specific parts of the parameters or biases when you fine-tune the model, but with a similar goal as general fine-tuning. The latest research in this area shows that you can achieve similar results to FMFT with far less cost, time commitment, and data.
While this book does not focus on fine-tuning, it is a very valid strategy to try to use a model fine-tuned with your data to give it more knowledge from your domain or to give it more of a voice from your domain. For example, you could train it to talk more like a scientist than a generic foundation model, if you’re using this in a scientific field. Alternatively, if you are developing in a legal field, you may want it to sound more like a lawyer.
Fine-tuning also helps the LLM to understand your company’s data better, making it better at generating an effective response during the RAG process. For example, if you have a scientific company, you might fine-tune a model with scientific information and use it for a RAG application that summarizes your research. This may improve your RAG application’s output (the summaries of your research) because your fine-tuned model understands your data better and can provide a more effective summary.
Vector store or vector database?
Both! All vector databases are vector stores, but not all vector stores are vector databases. OK, while you get out your chalkboard to draw a Vinn diagram, I will continue to explain this statement.
There are ways to store vectors that are not full databases. They are simply storage devices for vectors. So, to encompass all possible ways to store vectors, LangChain calls them all vector stores. Let’s do the same! Just know that not all the vector stores that LangChain connects with are officially considered vector databases, but in general, most of them are and many people refer to all of them as vector databases, even when they are not technically full databases from a functionality standpoint. Phew – glad we cleared that up!
Vectors, vectors, vectors!
A vector is a mathematical representation of your data. They are often referred to as embeddings when talking specifically about natural language processing (NLP) and LLMs. Vectors are one of the most important concepts to understand and there are many different parts of a RAG pipeline that utilize vectors.
We just covered many key vocabulary terms that will be important for you to understand the rest of this book. Many of these concepts will be expanded upon in future chapters. In the next section, we will continue to discuss vectors in further depth. And beyond that, we will spend Chapters 7 and 8 going over vectors and how they are used to find similar content.