SlideShare a Scribd company logo
Local Applications of Large Language
Models based on RAG (Retrieval
Augmented Generation)
——Local Documents Q&A
Luo Weizhi
1. Large language modeling
2. Key structures of model transformers
3. Advantages of comparing RNN networks
4. Large language models llama2
5. The fine-tuning process for quiz datasets.
6. Langchain and chaining concepts
7. RAG (Retrieval Augmented Generation)
8. Demonstration of the project
9. Conclusion
Large language modeling
01
LLM
A large language model (LLM) is a language model
notable for its ability to achieve general-purpose
language generation and other natural language
processing tasks(GPT-4).
The reason why LLM is said to be large is because
the size of the relevant text dataset used for training
is very large. A model of size 7b (7 billion
parameters) is the smallest size of LLM. But it has a
very large dataset.LLMs acquire these abilities by
learning statistical relationships from text
documents during a computationally intensive self-
supervised and semi-supervised training process.
LLMs can be used for text generation, a form of
generative AI, by taking an input text and repeatedly
predicting the next token or word.
LLM
Fg.1.Status of development of LLM((having a size larger than 10B)
Key structures of model
transformers
02
Transformer
Current LLMs are designed based on transformer's
network architecture.It’s a new Encoder-Decoder
structure, encoder with multi-head attention
mechanism and feed-forward network, decoder with
extra mask part
Self-Attention: The core of Transformer enables the
model to take into account the interactions and
dependencies between the elements of the sequence
when processing the sequence data. Self-attention
allows the model to dynamically focus on different
parts of the input sequence as it generates each output,
which is critical to understanding the context and
meaning of the text.
Positional Encoding: Since the Transformer is
entirely based on the attention mechanism and lacks
the ability to deal with sequence order, Positional
Encoding provides information about the position of
individual elements in a sequence by adding
additional information to the input elements.
Transformer
Multi-Head Attention: The attention mechanism is
decomposed into multiple "heads", each of which
learns information from a different representation
subspace, which allows the model to capture data
features from multiple perspectives at the same time.
Feed-Forward Networks: In each Transformer
block, the output of the self-attention layer is passed
to a feed-forward network, which is the same for
each position, but is applied independently at
different positions.
Transformer
Self-Attention mechanism
The self-attention mechanism allows the model to capture contextual relationships within a sequence
by taking into account other elements in the sequence as each element of the sequence is processed.
The mathematical expression for self-attention is:
 Q,K,V are the Query, Key, and Value matrices, respectively, which are obtained by multiplying the
embedding vectors of the input sequence with three different weight matrices.
 dk is the dimension of the key vector, which is used to scale the dot product to prevent the dot product from
being too large and causing the softmax function to be in the saturation region, thus affecting the
backpropagation of the gradient.
 QKT denotes the dot product of the query and key, which is used to compute the similarity between the
positions in the input sequence.
 The softmax function is used to convert the similarity into weights.
Transformer
Multi-Head Attention
The multi-head attention mechanism divides self-attention into multiple "heads".In layman's terms,
it's better to get 8 different people to do the same thing than just 1 person. Each of which captures
information in a different representation subspace:
 Wi
Q ,Wi
K,Wi
V,WO is the trainable weight matrix.
 h is the number of heads.
 The information in different representation subspaces can be fused by concatenating the outputs of different
heads and multiplying them by the output weight matrix WO.
Transformer
Position-wise Feed-Forward Networks
A position feed-forward network follows each attention layer, which performs independent linear
transformations on the representation of each position:
 This is a two-layer fully-connected feed-forward network where the max(0,x) represents the ReLU
activation function.
 W1,W2 and b1,b2 are the weights and biases of the network.
Transformer
Output Layer
Ultimately, the transformer generates predictions for each element of the output sequence using the
linear and soft-max layers:
 Here X is the output of the last decoder layer.
 W and b are the weights and biases of the output layer.
Transformer in LLM
web : https://bbycroft.net/llm
Transformer in LLM
Advantages of comparing RNN
networks
03
Limitations of RNN models
Processes language sequentially in a left-to-right or right-to-left manner. Reading one word at a time forces the RNN
to perform multiple steps to make decisions based on words that are far from each other. The more such steps
required to make decisions, the harder it is for the recurrent network to learn how to make those decisions. That is,
the depth of the network does not match the number of words and learning is difficult.
• Gradient extinction, explosion problem, etc.
Practically impossible to express entire sentences in terms of -vectors.
• Difficult to express complex structures such as sequential information
Probability of each word
Hidden layer of RNN
Word Embedding Layers
For all time steps t of
y1, ... ,yt the last word
in the sequence,
corresponding to a
vector will be fed into
the model
Large language models Llama2
04
Llama 2
llama2 is the second generation of
large language models introduced
by Facebook's ai lab-Meta. It has 4
versions 7b,13b,34b,70b.
Depending on my hardware, we
will download and fine tune its
original 7b model. Although 7b is
the smallest model in terms of
volume, it has about 7 billion
tunable weights and bias
parameters. These parameters are
learned from large amounts of
textual data during the training of
the model to capture and model
language complexity, contextual
relationships, and subtle patterns
in language use.
We can download many free base models in:https://huggingface.co/
Llama 2
Then this process gives the base model a generalized prediction capability
The fine-tuning process for quiz
datasets.
05
Fine-tuning
LLMs are pretrained on an extensive corpus of text. In the case of Llama 2, we know very little about the
composition of the training set, besides its length of 2 trillion tokens. In comparison, BERT (2018) was “only”
trained on the BookCorpus (800M words) and English Wikipedia (2,500M words). From experience, this is a very
costly and long process with a lot of hardware issues.
When the pretraining is complete, auto-regressive models like Llama 2 can predict the next token in a sequence.
However, this does not make them particularly useful assistants since they don’t reply to instructions. This is why
we employ instruction tuning to align their answers with what our project expect.
Fine-tuning
There are two main mainstream fine-tuning techniques:
Supervised fine-tuning (SFT): Trains the model on the dataset of instructions and answers. It minimizes
the difference between generated answers and ground truth answers as labels by adjusting the weights in
the LLM.
Reinforcement Learning with Human Feedback (RLHF): The model learns by interacting with the
environment and receiving feedback. The model is trained to maximize the reward signal (using PPO),
which usually comes from human evaluation of the model output.
In general, RLHF has been shown to capture more complex and nuanced human preferences, but it is also
more challenging to implement effectively. Indeed, the process is systematic, as well as voluminous.
Thus, in my project we will be implementing SFT, but this raises the question: do we need to know why
fine-tuning works in the first place? As emphasized in the Orca[1] paper, my understanding is that fine-
tuning leverages the knowledge learned during the pre-training process. In other words, if the model has
never seen the type of data we are interested in, then fine-tuning will not help.
[1]:Subhabrata et al., Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Fine-tuning
Fine-tuning
For our hardware conditions: the RAM is 16gb,and
Llama 2-7b weights (in FP16, 7b × 2 bytes = 14
GB)!
First, we have to load our defined dataset. Here,
our dataset has been preprocessed, but typically, we
can reformat cues, filter out error text, merge
multiple datasets, and so on.
Then, we configure bitsandbytes4 bit quantization.
Next, we load the Llama 2 model with 4-bit
precision on the GPU using the appropriate splitter.
Finally, we load the configuration of QLoRA, the
general training parameters, and pass everything to
SFTTrainer.Training
Fine-tuning
For this nearly 4-hour
fine-tuning process, we
ensured that the model's
fine-tuning behavior was
correct.!
Still Working...

More Related Content

PPTX
Gnerative AI presidency Module1_L4_LLMs_new.pptx
Arunnaik63
 
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
ijnlc
 
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
kevig
 
PDF
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
ijscai
 
PDF
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
ijscai
 
PDF
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
ijscai
 
PDF
Hidden Layer Leraning Vector Quantizatio
Armando Vieira
 
PPTX
Introduction to Machine Learning basics.pptx
srimathihss
 
Gnerative AI presidency Module1_L4_LLMs_new.pptx
Arunnaik63
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
ijnlc
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
kevig
 
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
ijscai
 
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
ijscai
 
THE EFFECTS OF THE LDA TOPIC MODEL ON SENTIMENT CLASSIFICATION
ijscai
 
Hidden Layer Leraning Vector Quantizatio
Armando Vieira
 
Introduction to Machine Learning basics.pptx
srimathihss
 

Similar to Local Applications of Large Language Models based on RAG.pptx (20)

PDF
LLMs for the “GPU-Poor” - Franck Nijimbere.pdf
GDG Bujumbura
 
PDF
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Fordham University
 
PPTX
Comparative Analysis of Transformer Based Pre-Trained NLP Models
saurav singla
 
PDF
Arabic named entity recognition using deep learning approach
IJECEIAES
 
PDF
Marvin_Capstone
Marvin Bertin
 
PDF
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
ijistjournal
 
PDF
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
ijistjournal
 
PDF
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
ijistjournal
 
PDF
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
ijceronline
 
PDF
Analysis of the evolution of advanced transformer-based language models: Expe...
IAESIJAI
 
PDF
Build a Large Language Model From Scratch MEAP Sebastian Raschka
esperomareta
 
PDF
Performance Comparison between Pytorch and Mindspore
IJDMS
 
PDF
DL for sentence classification project Write-up
Hoàng Triều Trịnh
 
DOCX
mapReduce for machine learning
Pranya Prabhakar
 
PDF
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
ijnlc
 
PDF
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
kevig
 
PDF
Myanmar Alphabet Recognition System Based on Artificial Neural Network
ijtsrd
 
PPTX
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
PPTX
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Sharmila Sathish
 
LLMs for the “GPU-Poor” - Franck Nijimbere.pdf
GDG Bujumbura
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Fordham University
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
saurav singla
 
Arabic named entity recognition using deep learning approach
IJECEIAES
 
Marvin_Capstone
Marvin Bertin
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
ijistjournal
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
ijistjournal
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
ijistjournal
 
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
ijceronline
 
Analysis of the evolution of advanced transformer-based language models: Expe...
IAESIJAI
 
Build a Large Language Model From Scratch MEAP Sebastian Raschka
esperomareta
 
Performance Comparison between Pytorch and Mindspore
IJDMS
 
DL for sentence classification project Write-up
Hoàng Triều Trịnh
 
mapReduce for machine learning
Pranya Prabhakar
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
ijnlc
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
kevig
 
Myanmar Alphabet Recognition System Based on Artificial Neural Network
ijtsrd
 
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Sharmila Sathish
 
Ad

Recently uploaded (20)

PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Doc9.....................................
SofiaCollazos
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Software Development Methodologies in 2025
KodekX
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Doc9.....................................
SofiaCollazos
 
Ad

Local Applications of Large Language Models based on RAG.pptx

  • 1. Local Applications of Large Language Models based on RAG (Retrieval Augmented Generation) ——Local Documents Q&A Luo Weizhi
  • 2. 1. Large language modeling 2. Key structures of model transformers 3. Advantages of comparing RNN networks 4. Large language models llama2 5. The fine-tuning process for quiz datasets. 6. Langchain and chaining concepts 7. RAG (Retrieval Augmented Generation) 8. Demonstration of the project 9. Conclusion
  • 4. LLM A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks(GPT-4). The reason why LLM is said to be large is because the size of the relevant text dataset used for training is very large. A model of size 7b (7 billion parameters) is the smallest size of LLM. But it has a very large dataset.LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self- supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.
  • 5. LLM Fg.1.Status of development of LLM((having a size larger than 10B)
  • 6. Key structures of model transformers 02
  • 7. Transformer Current LLMs are designed based on transformer's network architecture.It’s a new Encoder-Decoder structure, encoder with multi-head attention mechanism and feed-forward network, decoder with extra mask part Self-Attention: The core of Transformer enables the model to take into account the interactions and dependencies between the elements of the sequence when processing the sequence data. Self-attention allows the model to dynamically focus on different parts of the input sequence as it generates each output, which is critical to understanding the context and meaning of the text. Positional Encoding: Since the Transformer is entirely based on the attention mechanism and lacks the ability to deal with sequence order, Positional Encoding provides information about the position of individual elements in a sequence by adding additional information to the input elements.
  • 8. Transformer Multi-Head Attention: The attention mechanism is decomposed into multiple "heads", each of which learns information from a different representation subspace, which allows the model to capture data features from multiple perspectives at the same time. Feed-Forward Networks: In each Transformer block, the output of the self-attention layer is passed to a feed-forward network, which is the same for each position, but is applied independently at different positions.
  • 9. Transformer Self-Attention mechanism The self-attention mechanism allows the model to capture contextual relationships within a sequence by taking into account other elements in the sequence as each element of the sequence is processed. The mathematical expression for self-attention is:  Q,K,V are the Query, Key, and Value matrices, respectively, which are obtained by multiplying the embedding vectors of the input sequence with three different weight matrices.  dk is the dimension of the key vector, which is used to scale the dot product to prevent the dot product from being too large and causing the softmax function to be in the saturation region, thus affecting the backpropagation of the gradient.  QKT denotes the dot product of the query and key, which is used to compute the similarity between the positions in the input sequence.  The softmax function is used to convert the similarity into weights.
  • 10. Transformer Multi-Head Attention The multi-head attention mechanism divides self-attention into multiple "heads".In layman's terms, it's better to get 8 different people to do the same thing than just 1 person. Each of which captures information in a different representation subspace:  Wi Q ,Wi K,Wi V,WO is the trainable weight matrix.  h is the number of heads.  The information in different representation subspaces can be fused by concatenating the outputs of different heads and multiplying them by the output weight matrix WO.
  • 11. Transformer Position-wise Feed-Forward Networks A position feed-forward network follows each attention layer, which performs independent linear transformations on the representation of each position:  This is a two-layer fully-connected feed-forward network where the max(0,x) represents the ReLU activation function.  W1,W2 and b1,b2 are the weights and biases of the network.
  • 12. Transformer Output Layer Ultimately, the transformer generates predictions for each element of the output sequence using the linear and soft-max layers:  Here X is the output of the last decoder layer.  W and b are the weights and biases of the output layer.
  • 13. Transformer in LLM web : https://bbycroft.net/llm
  • 15. Advantages of comparing RNN networks 03
  • 16. Limitations of RNN models Processes language sequentially in a left-to-right or right-to-left manner. Reading one word at a time forces the RNN to perform multiple steps to make decisions based on words that are far from each other. The more such steps required to make decisions, the harder it is for the recurrent network to learn how to make those decisions. That is, the depth of the network does not match the number of words and learning is difficult. • Gradient extinction, explosion problem, etc. Practically impossible to express entire sentences in terms of -vectors. • Difficult to express complex structures such as sequential information Probability of each word Hidden layer of RNN Word Embedding Layers For all time steps t of y1, ... ,yt the last word in the sequence, corresponding to a vector will be fed into the model
  • 18. Llama 2 llama2 is the second generation of large language models introduced by Facebook's ai lab-Meta. It has 4 versions 7b,13b,34b,70b. Depending on my hardware, we will download and fine tune its original 7b model. Although 7b is the smallest model in terms of volume, it has about 7 billion tunable weights and bias parameters. These parameters are learned from large amounts of textual data during the training of the model to capture and model language complexity, contextual relationships, and subtle patterns in language use. We can download many free base models in:https://huggingface.co/
  • 19. Llama 2 Then this process gives the base model a generalized prediction capability
  • 20. The fine-tuning process for quiz datasets. 05
  • 21. Fine-tuning LLMs are pretrained on an extensive corpus of text. In the case of Llama 2, we know very little about the composition of the training set, besides its length of 2 trillion tokens. In comparison, BERT (2018) was “only” trained on the BookCorpus (800M words) and English Wikipedia (2,500M words). From experience, this is a very costly and long process with a lot of hardware issues. When the pretraining is complete, auto-regressive models like Llama 2 can predict the next token in a sequence. However, this does not make them particularly useful assistants since they don’t reply to instructions. This is why we employ instruction tuning to align their answers with what our project expect.
  • 22. Fine-tuning There are two main mainstream fine-tuning techniques: Supervised fine-tuning (SFT): Trains the model on the dataset of instructions and answers. It minimizes the difference between generated answers and ground truth answers as labels by adjusting the weights in the LLM. Reinforcement Learning with Human Feedback (RLHF): The model learns by interacting with the environment and receiving feedback. The model is trained to maximize the reward signal (using PPO), which usually comes from human evaluation of the model output. In general, RLHF has been shown to capture more complex and nuanced human preferences, but it is also more challenging to implement effectively. Indeed, the process is systematic, as well as voluminous. Thus, in my project we will be implementing SFT, but this raises the question: do we need to know why fine-tuning works in the first place? As emphasized in the Orca[1] paper, my understanding is that fine- tuning leverages the knowledge learned during the pre-training process. In other words, if the model has never seen the type of data we are interested in, then fine-tuning will not help. [1]:Subhabrata et al., Orca: Progressive Learning from Complex Explanation Traces of GPT-4
  • 24. Fine-tuning For our hardware conditions: the RAM is 16gb,and Llama 2-7b weights (in FP16, 7b × 2 bytes = 14 GB)! First, we have to load our defined dataset. Here, our dataset has been preprocessed, but typically, we can reformat cues, filter out error text, merge multiple datasets, and so on. Then, we configure bitsandbytes4 bit quantization. Next, we load the Llama 2 model with 4-bit precision on the GPU using the appropriate splitter. Finally, we load the configuration of QLoRA, the general training parameters, and pass everything to SFTTrainer.Training
  • 25. Fine-tuning For this nearly 4-hour fine-tuning process, we ensured that the model's fine-tuning behavior was correct.!