SlideShare a Scribd company logo
Building, Evaluating, and Optimizing your
RAG App for Production
Simon Suo, co-founder/CTO of LlamaIndex
Setup
1. Clone Github Repo
2. Setup python environment
3. Setup postgres
git clone
https://github.com/Disiok/ai-engineer-workshop.git
python3 -m venv rag
source rag/bin/activate
pip install -r requirements.txt
docker-compose up -d
jupyter lab
Launch Notebooks
🗓 Today’s agenda.
● Prototype
● Production Challenges: Diagnosis & Mitigation
● Evaluate
● Experiment & Optimize
HERE IS A SECTION
HEADER
Prototype
RAG System with LlamaIndex
and Ray
LlamaIndex
Data Framework for LLM Applications
● Ingestion, indexing, and querying
● Orchestrates over LLMs,
embedding models, vector DBs,
graph DBs
● RAG, chat, data agents
Ray
Scale the entire ML pipeline
● Scalable embedding generation &
indexing with Ray Data
● Distributed training & fine-tuning
with Ray Train
● Scalable application deployments
with Ray Serve
HERE IS A SECTION
HEADER
What are we
building?
Ray Assistant
Query LLM Response
Base LLMs
Vector DB
Query
LLM
Response
Retrieved
contexts
Embedding
model
1
2
3
4
5
RAG
Docs TextSplitter Vector DB
Embedding
model
Chunk Embed Index
Load
sources
Data
sources
Creating our Vector DB
Vector DB
(text, source, embedding)
data source chunks
Lab
Let’s dive right in!
HERE IS A SECTION
HEADER
Production
Challenges
Diagnosis & mitigation
From Prototype to Production
Prototype Productionize
Few lines of
code
latency
cost
hallucination
poor retrieval
output parsing
errors
harmful answers
Need for evaluation,
diagnosis, and optimization!
Challenges - non quality related
Symptoms
● Latency,
rate-limits
● Cost
● Service Availability
Diagnosis & Mitigation
● as
● Logging & monitoring
● Isolate issue to retrieval vs.
generation components
● Evaluate different LLM
service providers
● Use smaller, task-specific
models
● Host your own models
Challenges - quality related
Symptoms
● Unknown
performance
● Hallucinations
● Incomplete
answers
● Poor retrievals
Diagnosis & Mitigation
● as
● Evaluation
○ Collect labels
○ Collect user
feedback
● Optimization
○ Tune configs
○ Customization &
fine-tune models
HERE IS A SECTION
HEADER
Evaluate
Understanding system
performance
Expectation vs. Reality of Evaluation
● “Academic benchmark is all you need”
● Build, evaluate, done
● A single metric number that perfectly
tell you the performance
● I don’t have user data yet?
● Should I label some data?
● “Vibe-check” only
● I can’t run big evaluation on every CI
run
● When should I evaluate?
The development process
Idea
Validation
Build
prototype
with toy data
Build prototype
with full corpus
Solution
Validation
Deployment Optimization
Deploy system &
collect user queries
& feedback
Collect labels,
improve system
Evaluation in the development process
Idea
Validation
Build
prototype
with toy data
● Quick iteration cycle is critical
● “Vibe check” on ad-hoc queries
● Experiment with different modules
● Ad-hoc configuration changes
At this stage
The “Vibe Check”
Ad-hoc spot check with random questions.
● Quick way to sanity check and iterate
● Good for initial prototyping and early
development
● Could be surprising good indication of
performance
��
https://blog.replit.com/llm-training
https://eugeneyan.com/writing/llm-patterns/
Not systematic, but useful!
The development process
Idea
Validation
Build
prototype
with toy data
Build prototype
with full corpus
Solution
Validation
● “Vibe check” on curated/representative
set of test queries
● Towards more systematic evaluation
to gain confidence towards initial
deployment
At this stage
Challenge of Systemic Evaluation
● Metrics
○ Flexibility of natural language, no one right answer
○ Human evaluation is not scalable
● Data availability
○ Labeled data is slow & costly to collect
● Actionable insight
○ Not just “it’s bad”, but also “how to improve”
○ End-to-end vs. Component-wise
LLM-as-a-Judge
https://www.databricks.com/blog/LLM-au
to-eval-best-practices-RAG
https://arxiv.org/pdf/2306.05685.pdf
● Strong LLMs (GPT-4, Claude 2) are good
evaluators
○ High agreement with human
○ scalability
○ Interpretability
● Approaches
○ Pairwise comparison
○ Single answer grading
○ Reference-guided grading (i.e.
“golden” answer)
Component-wise evaluation
Helps attribute quality issue to specific
components:
● Retrieval: are we retrieving the
relevant context
● Generation: given context, are we
generating an accurate and
coherent answer
Systematic Evaluation - Overview
End-to-end evaluation
Helps understand how well the full RAG
application works:
● given user question, how “good” is
the answer
Analogous to unit tests
Analogous to integration tests
Data for Systematic Evaluation
User
Query
“Golden”
Context
“Golden”
Answer
● representative set
of real user queries
● Set of relevant
documents from
our corpus to best
answer a given
query
● Best answer
given “golden”
context
● Can be
optional
User
Feedback
● Feedback from
past interaction
● up/down vote,
rating
● On retrieval
and/or
generation
Data Challenges
User
Query
User
Feedback
● Need to deploy
system & collect
Relatively easy
● Need to deploy
system & collect
Require good UX for
good data
● Need labelers
Relatively
cheap/easy
● Need labelers
More
costly/tedious
“Golden”
Context
“Golden”
Answer
“Cold Start” Problem
Dilemma
We have the chicken and egg problem of:
1. We want to evaluate to gain
confidence of our RAG system before
deployment
2. Without deployment, we can’t collect
real user queries and label an
evaluation dataset
Strategy
● Use purely LLM-based label-free
evaluation
● Leverage LLM to generate synthetic
label dataset, including
● Query
● “Golden” context
● “Golden” response
28
Generating Synthetic Evaluation Data
Knowledge
corpus
doc_0
Sample
Document
chunk
LLM
{
User query: …
“Golden” context: doc_0
“Golden” answer: …
}
● Sample document chunks, and leverage LLM to generate Q&A pairs that can
be best answer by the given context.
● Not always representative of user queries, but useful for development
iterations & diagnostics
Synthetic
“datapoint” for
eval
The development process
Deployment Optimization
Deploy system &
collect user queries
& feedback
Collect labels,
improve system
● Have user queries, maybe labels &
feedback
● Need repeatable, consistent,
automated evaluation
● Need actionable insight or automated
tuning
At this stage
End-to-end evaluation
Eval
Dataset
Query
“Golden”
Answer
RAG Answer
Sample
entry
Query Answer
Evaluator
Metrics:
Correctness,
Faithfulness,
Relevancy,
…
LLM-as-a-Judge:
The devil is in the details
● What model?
○ Only “strong” LLMs right now
○ GPT-4, Claude 2
● What grading scale?
○ Low-precision, e.g. binary, 1-5
○ Easier rubrics, more interpretable
● Do we need few-shot examples?
○ Helps to give guidance/example for
each score level
● Holistic judgement vs. individual aspects
○ Be as concrete as possible
● Chain of thought reasoning
You are an expert evaluation system for a question answering
chatbot.
You are given the following information:
- a user query,
- a reference answer, and
- a generated answer.
Your job is to judge the relevance and correctness of the
generated answer.
Output a single score that represents a holistic evaluation.
You must return your response in a line with only the score.
Do not return answers in any other format.
On a separate line provide your reasoning for the score as well.
Follow these guidelines for scoring:
- Your score has to be between 1 and 5, where 1 is the worst and 5
is the best.
- If the generated answer is not relevant to the user query, 
you should give a score of 1.
- If the generated answer is relevant but contains mistakes, 
you should give a score between 2 and 3.
- If the generated answer is relevant and fully correct, 
you should give a score between 4 and 5.
## User Query
{query}
## Reference Answer
{reference_answer}
## Generated Answer
{generated_answer}
LLM-as-a-Judge:
Limitations
● Position bias
○ First, last, etc
● Verbosity bias
○ Longer the better?
● Self-enhancement bias
○ GPT <3 GPT
You are an expert evaluation system for a question answering
chatbot.
You are given the following information:
- a user query,
- a reference answer, and
- a generated answer.
Your job is to judge the relevance and correctness of the
generated answer.
Output a single score that represents a holistic evaluation.
You must return your response in a line with only the score.
Do not return answers in any other format.
On a separate line provide your reasoning for the score as well.
Follow these guidelines for scoring:
- Your score has to be between 1 and 5, where 1 is the worst and 5
is the best.
- If the generated answer is not relevant to the user query, 
you should give a score of 1.
- If the generated answer is relevant but contains mistakes, 
you should give a score between 2 and 3.
- If the generated answer is relevant and fully correct, 
you should give a score between 4 and 5.
## User Query
{query}
## Reference Answer
{reference_answer}
## Generated Answer
{generated_answer}
Component-wise evaluation
RAG -
Retrieval
Eval
Dataset
Query
Sample
entry
● Evaluate retrieval component
Retrieved
docs
“Golden”
context
Retrieved
docs
Retrieval
Evaluator
Metrics:
● Hit rate
● MRR
● MAP
● NDCG
Component-wise evaluation
RAG -
Generation
Eval
Dataset
Query
Sample
entry
● Evaluate generation component
“Golden”
Context
Answer
“Golden”
Answer
Query Answer
Evaluator
Metrics:
Correctness,
Faithfulness,
Relevancy,
…
36
Experiment
config Index
build_or_load_index
Retriever
Query
Engine
Embedding_model_name
Chunk_size
chunk_overlap
Similarity_top_k
embedding_model_name
Similarity_top_k
Embedding_model_name
Llm_model_name
Temperature
run_retrieval_exp run_e2e_exp
“Golden”
Contexts
Queries
Query
Engine
“Golden”
Answers
Queries
Retriever
Hit rate Mean correctness score
Lab
● Cold start problem: generating synthetic eval dataset from documents
● Component wise evaluation
○ Retrieval
○ Generation
● End-to-end evaluation
HERE IS A SECTION
HEADER
Experiment &
Optimize
Improve system performance
Build customized
pipeline
● Deeply understand
your data & query
pattern
● define indexing &
retrieval strategies
optimized for your
specific use-case
Ways to optimize the RAG system
Configure standard
components
● Use standard, off the
shelf components
(e.g. LLM, embedding
model, standard
semantic search)
● Select good
components, tune
parameters for
end-performance
Model Fine-tuning
● Specialize embedding
model and LLM to your
domain
Configure standard components
Retrieval
● Chunk size
● Embedding model
● Number of retrieved chunks
Generation
● LLM
● (prompt)
Grid search
● Run all parameter combinations and
evaluate the end performance of
system
{
Chunk_size: 512,
Top_k: 5,
Llm: gpt-4,
…
}
Score: 4.5
{
Chunk_size: 512,
Top_k: 10,
Llm: gpt-4,
…
}
Score: 3.5
41
Vector
Store
Retrieved
docs
LLM
Query
Response
Embedding
Model
CHUNK_SIZES = [256, 512, 1024]
TOP_K = [1, 3, 5, 7]
EMBED_MODELS = [
"thenlper/gte-base",
"BAAI/bge-large-en",
"text-embedding-ada-002"
]
LLMS = [
"gpt-3.5-turbo",
"gpt-4",
"meta-llama/Llama-2-7b-chat-hf",
"meta-llama/Llama-2-13b-chat-hf",
"meta-llama/Llama-2-70b-chat-hf"
]
Lab
● Run experiments over standard component configurations
● Gain intuition on optimal parameter configurations
Strategy 1:
Use two-stage retrieval:
● First retrieve a lot of potentially relevant contexts
● Then rerank/filter to a smaller subset
Customizing Retrieval & Generation
43
Retriever Reranker
Index
● Less accurate
● Fast
● Operate over
full corpus
● More accurate
● Slow
● Operate over
candidates
Strategy 2a:
Embed different representations of the
same data: e.g.
● Summarize document and embed
summary
● Extract distinct topics and embed
topic extractions separately
This can improve retrieval over specific
questions
Customizing Retrieval & Generation
44
Document
Document Topics
Summary
Strategy 2b:
Use different data representation for
retrieval & for generation
Customizing Retrieval & Generation
45
Strategy 3:
Leverage LLM to infer structured query for retrieval (e.g. metadata filters, top-k, score
threshold)
Customizing Retrieval & Generation
46
Strategy 4:
Recursive retrieval over hierarchical index
Customizing Retrieval & Generation
47
Strategy 5:
Routing to different retrieval methods
depending on the query
May need to rewrite query as well,
depending on the interface of the retriever.
Customizing Retrieval & Generation
48
LLM Fine-tuning
● When want to adjust the “style” of
generation: e.g. professional legal
assistant
● When want to enforce output
structure: e.g. JSON
Not great for injecting new knowledge
and fighting hallucination
Model Fine-tuning
49
Embedding Model Fine-tuning
● Improve retrieval performance,
especially documents that are
○ Domain specific terminology
○ Malformed (e.g. extraneous spacing
due to parsing, etc)
● Two approaches
○ Fine-tune the full embedding model
○ Fine-tune an adapter layer on top of
a frozen embedding model
Great to improving retrieval (and thus
end-to-end RAG performance)
Lab
● Second-stage re-ranking
● Sentence window strategy
● Fine-tuning embeddings for RAG with synthetic data
Sign up for our newsletter:
https://www.llamaindex.ai/
Sign up for office hours
Get In Touch

More Related Content

PDF
Customizing LLMs
Jim Steele
 
PPTX
Introduction to RAG (Retrieval Augmented Generation) and its application
Knoldus Inc.
 
PDF
Large Language Models - Chat AI.pdf
David Rostcheck
 
PDF
Introduction to Open Source RAG and RAG Evaluation
Zilliz
 
PPTX
Notes on DeepSeek as of 29th of January 2025
Damian T. Gordon
 
PDF
Llama-index
Denis973830
 
PPTX
MuleSoft Architecture Presentation
Rupesh Sinha
 
PDF
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
Customizing LLMs
Jim Steele
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Knoldus Inc.
 
Large Language Models - Chat AI.pdf
David Rostcheck
 
Introduction to Open Source RAG and RAG Evaluation
Zilliz
 
Notes on DeepSeek as of 29th of January 2025
Damian T. Gordon
 
Llama-index
Denis973830
 
MuleSoft Architecture Presentation
Rupesh Sinha
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 

What's hot (20)

PPTX
Using Generative AI
Mark DeLoura
 
PDF
Api gateway
enyert
 
PPTX
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati
 
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
PDF
Using the power of Generative AI at scale
Maxim Salnikov
 
PDF
Use Case Patterns for LLM Applications (1).pdf
M Waleed Kadous
 
PDF
Exploring Opportunities in the Generative AI Value Chain.pdf
Dung Hoang
 
PDF
MLOps for production-level machine learning
cnvrg.io AI OS - Hands-on ML Workshops
 
PPTX
MLOps - The Assembly Line of ML
Jordan Birdsell
 
PDF
Apply MLOps at Scale
Databricks
 
PDF
ML-Ops how to bring your data science to production
Herman Wu
 
PDF
Generative-AI-in-enterprise-20230615.pdf
Liming Zhu
 
PDF
Apply MLOps at Scale by H&M
Databricks
 
PDF
What is MLOps
Henrik Skogström
 
PDF
Large Language Models Bootcamp
Data Science Dojo
 
PDF
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
Ivo Andreev
 
PDF
LanGCHAIN Framework
Keymate.AI
 
PPTX
GitHub Copilot.pptx
Luis Beltran
 
PDF
AI and ML Series - Leveraging Generative AI and LLMs Using the UiPath Platfor...
DianaGray10
 
PPTX
From Data Science to MLOps
Carl W. Handlin
 
Using Generative AI
Mark DeLoura
 
Api gateway
enyert
 
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
Using the power of Generative AI at scale
Maxim Salnikov
 
Use Case Patterns for LLM Applications (1).pdf
M Waleed Kadous
 
Exploring Opportunities in the Generative AI Value Chain.pdf
Dung Hoang
 
MLOps for production-level machine learning
cnvrg.io AI OS - Hands-on ML Workshops
 
MLOps - The Assembly Line of ML
Jordan Birdsell
 
Apply MLOps at Scale
Databricks
 
ML-Ops how to bring your data science to production
Herman Wu
 
Generative-AI-in-enterprise-20230615.pdf
Liming Zhu
 
Apply MLOps at Scale by H&M
Databricks
 
What is MLOps
Henrik Skogström
 
Large Language Models Bootcamp
Data Science Dojo
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
Ivo Andreev
 
LanGCHAIN Framework
Keymate.AI
 
GitHub Copilot.pptx
Luis Beltran
 
AI and ML Series - Leveraging Generative AI and LLMs Using the UiPath Platfor...
DianaGray10
 
From Data Science to MLOps
Carl W. Handlin
 
Ad

Similar to presentation.pdf (20)

PPTX
Pull_Request_PAW_Shared_Rohit.pptx
rohitagarwal24
 
PDF
LLM Agent Observability: Lessons Learned from Real-World Applications
Zilliz
 
PDF
Anton Muzhailo - Practical Test Process Improvement using ISTQB
Ievgenii Katsan
 
PDF
Offline evaluation of recommender systems: all pain and no gain?
Mark Levy
 
PPTX
Not Your Grandfather's Requirements-Based Testing Webinar – Robin Goldsmith, ...
XBOSoft
 
PDF
Optimizely Workshop 1: Prioritize your roadmap
Optimizely
 
PDF
Tune Agile Test Strategies to Project and Product Maturity
TechWell
 
PPTX
Patrick McKenzie Opticon 2014: Advanced A/B Testing
Patrick McKenzie
 
PDF
Jeff Sing - Quarterly Service Delivery Reviews.pdf
QA or the Highway
 
PDF
Amp Up Your Testing by Harnessing Test Data
TechWell
 
PDF
Manual Testing real time questions .pdf
TiktokIndia2
 
PDF
Behaviour Driven Development: Oltre i limiti del possibile
Iosif Itkin
 
PDF
Lessons Learned When Automating
Alan Richardson
 
PDF
Sanitized tb swstmppp1516july
Agile Testing alliance
 
PDF
Demise of test scripts rise of test ideas
Richard Robinson
 
PPTX
Ben Walters - Creating Customer Value With Agile Testing - EuroSTAR 2011
TEST Huddle
 
DOC
Swarna pippalla Testing
swarna pippalla
 
PPTX
Agile testingandautomation
jeisner
 
PDF
Hypothesis driven development
Duri Chitayat
 
PPTX
Agile Project Management - Course Details
alirazakdsp2023
 
Pull_Request_PAW_Shared_Rohit.pptx
rohitagarwal24
 
LLM Agent Observability: Lessons Learned from Real-World Applications
Zilliz
 
Anton Muzhailo - Practical Test Process Improvement using ISTQB
Ievgenii Katsan
 
Offline evaluation of recommender systems: all pain and no gain?
Mark Levy
 
Not Your Grandfather's Requirements-Based Testing Webinar – Robin Goldsmith, ...
XBOSoft
 
Optimizely Workshop 1: Prioritize your roadmap
Optimizely
 
Tune Agile Test Strategies to Project and Product Maturity
TechWell
 
Patrick McKenzie Opticon 2014: Advanced A/B Testing
Patrick McKenzie
 
Jeff Sing - Quarterly Service Delivery Reviews.pdf
QA or the Highway
 
Amp Up Your Testing by Harnessing Test Data
TechWell
 
Manual Testing real time questions .pdf
TiktokIndia2
 
Behaviour Driven Development: Oltre i limiti del possibile
Iosif Itkin
 
Lessons Learned When Automating
Alan Richardson
 
Sanitized tb swstmppp1516july
Agile Testing alliance
 
Demise of test scripts rise of test ideas
Richard Robinson
 
Ben Walters - Creating Customer Value With Agile Testing - EuroSTAR 2011
TEST Huddle
 
Swarna pippalla Testing
swarna pippalla
 
Agile testingandautomation
jeisner
 
Hypothesis driven development
Duri Chitayat
 
Agile Project Management - Course Details
alirazakdsp2023
 
Ad

Recently uploaded (20)

PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
short term internship project on Data visualization
JMJCollegeComputerde
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 

presentation.pdf

  • 1. Building, Evaluating, and Optimizing your RAG App for Production Simon Suo, co-founder/CTO of LlamaIndex
  • 2. Setup 1. Clone Github Repo 2. Setup python environment 3. Setup postgres git clone https://github.com/Disiok/ai-engineer-workshop.git python3 -m venv rag source rag/bin/activate pip install -r requirements.txt docker-compose up -d jupyter lab Launch Notebooks
  • 3. 🗓 Today’s agenda. ● Prototype ● Production Challenges: Diagnosis & Mitigation ● Evaluate ● Experiment & Optimize
  • 4. HERE IS A SECTION HEADER Prototype RAG System with LlamaIndex and Ray
  • 5. LlamaIndex Data Framework for LLM Applications ● Ingestion, indexing, and querying ● Orchestrates over LLMs, embedding models, vector DBs, graph DBs ● RAG, chat, data agents
  • 6. Ray Scale the entire ML pipeline ● Scalable embedding generation & indexing with Ray Data ● Distributed training & fine-tuning with Ray Train ● Scalable application deployments with Ray Serve
  • 7. HERE IS A SECTION HEADER What are we building?
  • 11. Docs TextSplitter Vector DB Embedding model Chunk Embed Index Load sources Data sources Creating our Vector DB Vector DB (text, source, embedding) data source chunks
  • 13. HERE IS A SECTION HEADER Production Challenges Diagnosis & mitigation
  • 14. From Prototype to Production Prototype Productionize Few lines of code latency cost hallucination poor retrieval output parsing errors harmful answers Need for evaluation, diagnosis, and optimization!
  • 15. Challenges - non quality related Symptoms ● Latency, rate-limits ● Cost ● Service Availability Diagnosis & Mitigation ● as ● Logging & monitoring ● Isolate issue to retrieval vs. generation components ● Evaluate different LLM service providers ● Use smaller, task-specific models ● Host your own models
  • 16. Challenges - quality related Symptoms ● Unknown performance ● Hallucinations ● Incomplete answers ● Poor retrievals Diagnosis & Mitigation ● as ● Evaluation ○ Collect labels ○ Collect user feedback ● Optimization ○ Tune configs ○ Customization & fine-tune models
  • 17. HERE IS A SECTION HEADER Evaluate Understanding system performance
  • 18. Expectation vs. Reality of Evaluation ● “Academic benchmark is all you need” ● Build, evaluate, done ● A single metric number that perfectly tell you the performance ● I don’t have user data yet? ● Should I label some data? ● “Vibe-check” only ● I can’t run big evaluation on every CI run ● When should I evaluate?
  • 19. The development process Idea Validation Build prototype with toy data Build prototype with full corpus Solution Validation Deployment Optimization Deploy system & collect user queries & feedback Collect labels, improve system
  • 20. Evaluation in the development process Idea Validation Build prototype with toy data ● Quick iteration cycle is critical ● “Vibe check” on ad-hoc queries ● Experiment with different modules ● Ad-hoc configuration changes At this stage
  • 21. The “Vibe Check” Ad-hoc spot check with random questions. ● Quick way to sanity check and iterate ● Good for initial prototyping and early development ● Could be surprising good indication of performance �� https://blog.replit.com/llm-training https://eugeneyan.com/writing/llm-patterns/ Not systematic, but useful!
  • 22. The development process Idea Validation Build prototype with toy data Build prototype with full corpus Solution Validation ● “Vibe check” on curated/representative set of test queries ● Towards more systematic evaluation to gain confidence towards initial deployment At this stage
  • 23. Challenge of Systemic Evaluation ● Metrics ○ Flexibility of natural language, no one right answer ○ Human evaluation is not scalable ● Data availability ○ Labeled data is slow & costly to collect ● Actionable insight ○ Not just “it’s bad”, but also “how to improve” ○ End-to-end vs. Component-wise
  • 24. LLM-as-a-Judge https://www.databricks.com/blog/LLM-au to-eval-best-practices-RAG https://arxiv.org/pdf/2306.05685.pdf ● Strong LLMs (GPT-4, Claude 2) are good evaluators ○ High agreement with human ○ scalability ○ Interpretability ● Approaches ○ Pairwise comparison ○ Single answer grading ○ Reference-guided grading (i.e. “golden” answer)
  • 25. Component-wise evaluation Helps attribute quality issue to specific components: ● Retrieval: are we retrieving the relevant context ● Generation: given context, are we generating an accurate and coherent answer Systematic Evaluation - Overview End-to-end evaluation Helps understand how well the full RAG application works: ● given user question, how “good” is the answer Analogous to unit tests Analogous to integration tests
  • 26. Data for Systematic Evaluation User Query “Golden” Context “Golden” Answer ● representative set of real user queries ● Set of relevant documents from our corpus to best answer a given query ● Best answer given “golden” context ● Can be optional User Feedback ● Feedback from past interaction ● up/down vote, rating ● On retrieval and/or generation
  • 27. Data Challenges User Query User Feedback ● Need to deploy system & collect Relatively easy ● Need to deploy system & collect Require good UX for good data ● Need labelers Relatively cheap/easy ● Need labelers More costly/tedious “Golden” Context “Golden” Answer
  • 28. “Cold Start” Problem Dilemma We have the chicken and egg problem of: 1. We want to evaluate to gain confidence of our RAG system before deployment 2. Without deployment, we can’t collect real user queries and label an evaluation dataset Strategy ● Use purely LLM-based label-free evaluation ● Leverage LLM to generate synthetic label dataset, including ● Query ● “Golden” context ● “Golden” response 28
  • 29. Generating Synthetic Evaluation Data Knowledge corpus doc_0 Sample Document chunk LLM { User query: … “Golden” context: doc_0 “Golden” answer: … } ● Sample document chunks, and leverage LLM to generate Q&A pairs that can be best answer by the given context. ● Not always representative of user queries, but useful for development iterations & diagnostics Synthetic “datapoint” for eval
  • 30. The development process Deployment Optimization Deploy system & collect user queries & feedback Collect labels, improve system ● Have user queries, maybe labels & feedback ● Need repeatable, consistent, automated evaluation ● Need actionable insight or automated tuning At this stage
  • 31. End-to-end evaluation Eval Dataset Query “Golden” Answer RAG Answer Sample entry Query Answer Evaluator Metrics: Correctness, Faithfulness, Relevancy, …
  • 32. LLM-as-a-Judge: The devil is in the details ● What model? ○ Only “strong” LLMs right now ○ GPT-4, Claude 2 ● What grading scale? ○ Low-precision, e.g. binary, 1-5 ○ Easier rubrics, more interpretable ● Do we need few-shot examples? ○ Helps to give guidance/example for each score level ● Holistic judgement vs. individual aspects ○ Be as concrete as possible ● Chain of thought reasoning You are an expert evaluation system for a question answering chatbot. You are given the following information: - a user query, - a reference answer, and - a generated answer. Your job is to judge the relevance and correctness of the generated answer. Output a single score that represents a holistic evaluation. You must return your response in a line with only the score. Do not return answers in any other format. On a separate line provide your reasoning for the score as well. Follow these guidelines for scoring: - Your score has to be between 1 and 5, where 1 is the worst and 5 is the best. - If the generated answer is not relevant to the user query, you should give a score of 1. - If the generated answer is relevant but contains mistakes, you should give a score between 2 and 3. - If the generated answer is relevant and fully correct, you should give a score between 4 and 5. ## User Query {query} ## Reference Answer {reference_answer} ## Generated Answer {generated_answer}
  • 33. LLM-as-a-Judge: Limitations ● Position bias ○ First, last, etc ● Verbosity bias ○ Longer the better? ● Self-enhancement bias ○ GPT <3 GPT You are an expert evaluation system for a question answering chatbot. You are given the following information: - a user query, - a reference answer, and - a generated answer. Your job is to judge the relevance and correctness of the generated answer. Output a single score that represents a holistic evaluation. You must return your response in a line with only the score. Do not return answers in any other format. On a separate line provide your reasoning for the score as well. Follow these guidelines for scoring: - Your score has to be between 1 and 5, where 1 is the worst and 5 is the best. - If the generated answer is not relevant to the user query, you should give a score of 1. - If the generated answer is relevant but contains mistakes, you should give a score between 2 and 3. - If the generated answer is relevant and fully correct, you should give a score between 4 and 5. ## User Query {query} ## Reference Answer {reference_answer} ## Generated Answer {generated_answer}
  • 34. Component-wise evaluation RAG - Retrieval Eval Dataset Query Sample entry ● Evaluate retrieval component Retrieved docs “Golden” context Retrieved docs Retrieval Evaluator Metrics: ● Hit rate ● MRR ● MAP ● NDCG
  • 35. Component-wise evaluation RAG - Generation Eval Dataset Query Sample entry ● Evaluate generation component “Golden” Context Answer “Golden” Answer Query Answer Evaluator Metrics: Correctness, Faithfulness, Relevancy, …
  • 37. Lab ● Cold start problem: generating synthetic eval dataset from documents ● Component wise evaluation ○ Retrieval ○ Generation ● End-to-end evaluation
  • 38. HERE IS A SECTION HEADER Experiment & Optimize Improve system performance
  • 39. Build customized pipeline ● Deeply understand your data & query pattern ● define indexing & retrieval strategies optimized for your specific use-case Ways to optimize the RAG system Configure standard components ● Use standard, off the shelf components (e.g. LLM, embedding model, standard semantic search) ● Select good components, tune parameters for end-performance Model Fine-tuning ● Specialize embedding model and LLM to your domain
  • 40. Configure standard components Retrieval ● Chunk size ● Embedding model ● Number of retrieved chunks Generation ● LLM ● (prompt) Grid search ● Run all parameter combinations and evaluate the end performance of system { Chunk_size: 512, Top_k: 5, Llm: gpt-4, … } Score: 4.5 { Chunk_size: 512, Top_k: 10, Llm: gpt-4, … } Score: 3.5
  • 41. 41 Vector Store Retrieved docs LLM Query Response Embedding Model CHUNK_SIZES = [256, 512, 1024] TOP_K = [1, 3, 5, 7] EMBED_MODELS = [ "thenlper/gte-base", "BAAI/bge-large-en", "text-embedding-ada-002" ] LLMS = [ "gpt-3.5-turbo", "gpt-4", "meta-llama/Llama-2-7b-chat-hf", "meta-llama/Llama-2-13b-chat-hf", "meta-llama/Llama-2-70b-chat-hf" ]
  • 42. Lab ● Run experiments over standard component configurations ● Gain intuition on optimal parameter configurations
  • 43. Strategy 1: Use two-stage retrieval: ● First retrieve a lot of potentially relevant contexts ● Then rerank/filter to a smaller subset Customizing Retrieval & Generation 43 Retriever Reranker Index ● Less accurate ● Fast ● Operate over full corpus ● More accurate ● Slow ● Operate over candidates
  • 44. Strategy 2a: Embed different representations of the same data: e.g. ● Summarize document and embed summary ● Extract distinct topics and embed topic extractions separately This can improve retrieval over specific questions Customizing Retrieval & Generation 44 Document Document Topics Summary
  • 45. Strategy 2b: Use different data representation for retrieval & for generation Customizing Retrieval & Generation 45
  • 46. Strategy 3: Leverage LLM to infer structured query for retrieval (e.g. metadata filters, top-k, score threshold) Customizing Retrieval & Generation 46
  • 47. Strategy 4: Recursive retrieval over hierarchical index Customizing Retrieval & Generation 47
  • 48. Strategy 5: Routing to different retrieval methods depending on the query May need to rewrite query as well, depending on the interface of the retriever. Customizing Retrieval & Generation 48
  • 49. LLM Fine-tuning ● When want to adjust the “style” of generation: e.g. professional legal assistant ● When want to enforce output structure: e.g. JSON Not great for injecting new knowledge and fighting hallucination Model Fine-tuning 49 Embedding Model Fine-tuning ● Improve retrieval performance, especially documents that are ○ Domain specific terminology ○ Malformed (e.g. extraneous spacing due to parsing, etc) ● Two approaches ○ Fine-tune the full embedding model ○ Fine-tune an adapter layer on top of a frozen embedding model Great to improving retrieval (and thus end-to-end RAG performance)
  • 50. Lab ● Second-stage re-ranking ● Sentence window strategy ● Fine-tuning embeddings for RAG with synthetic data
  • 51. Sign up for our newsletter: https://www.llamaindex.ai/ Sign up for office hours Get In Touch