presentation.pdf

Building, Evaluating, and Optimizing your
RAG App for Production
Simon Suo, co-founder/CTO of LlamaIndex

Setup
1. Clone Github Repo
2. Setup python environment
3. Setup postgres
git clone
https://github.com/Disiok/ai-engineer-workshop.git
python3 -m venv rag
source rag/bin/activate
pip install -r requirements.txt
docker-compose up -d
jupyter lab
Launch Notebooks

🗓 Today’s agenda.
● Prototype
● Production Challenges: Diagnosis & Mitigation
● Evaluate
● Experiment & Optimize

HERE IS A SECTION
HEADER
Prototype
RAG System with LlamaIndex
and Ray

LlamaIndex
Data Framework for LLM Applications
● Ingestion, indexing, and querying
● Orchestrates over LLMs,
embedding models, vector DBs,
graph DBs
● RAG, chat, data agents

Ray
Scale the entire ML pipeline
● Scalable embedding generation &
indexing with Ray Data
● Distributed training & fine-tuning
with Ray Train
● Scalable application deployments
with Ray Serve

HERE IS A SECTION
HEADER
What are we
building?

Vector DB
Query
LLM
Response
Retrieved
contexts
Embedding
model
1
2
3
4
5
RAG

Docs TextSplitter Vector DB
Embedding
model
Chunk Embed Index
Load
sources
Data
sources
Creating our Vector DB
Vector DB
(text, source, embedding)
data source chunks

HERE IS A SECTION
HEADER
Production
Challenges
Diagnosis & mitigation

From Prototype to Production
Prototype Productionize
Few lines of
code
latency
cost
hallucination
poor retrieval
output parsing
errors
harmful answers
Need for evaluation,
diagnosis, and optimization!

Challenges - non quality related
Symptoms
● Latency,
rate-limits
● Cost
● Service Availability
Diagnosis & Mitigation
● as
● Logging & monitoring
● Isolate issue to retrieval vs.
generation components
● Evaluate different LLM
service providers
● Use smaller, task-specific
models
● Host your own models

Challenges - quality related
Symptoms
● Unknown
performance
● Hallucinations
● Incomplete
answers
● Poor retrievals
Diagnosis & Mitigation
● as
● Evaluation
○ Collect labels
○ Collect user
feedback
● Optimization
○ Tune configs
○ Customization &
fine-tune models

HERE IS A SECTION
HEADER
Evaluate
Understanding system
performance

Expectation vs. Reality of Evaluation
● “Academic benchmark is all you need”
● Build, evaluate, done
● A single metric number that perfectly
tell you the performance
● I don’t have user data yet?
● Should I label some data?
● “Vibe-check” only
● I can’t run big evaluation on every CI
run
● When should I evaluate?

The development process
Idea
Validation
Build
prototype
with toy data
Build prototype
with full corpus
Solution
Validation
Deployment Optimization
Deploy system &
collect user queries
& feedback
Collect labels,
improve system

Evaluation in the development process
Idea
Validation
Build
prototype
with toy data
● Quick iteration cycle is critical
● “Vibe check” on ad-hoc queries
● Experiment with different modules
● Ad-hoc configuration changes
At this stage

The “Vibe Check”
Ad-hoc spot check with random questions.
● Quick way to sanity check and iterate
● Good for initial prototyping and early
development
● Could be surprising good indication of
performance
��
https://blog.replit.com/llm-training
https://eugeneyan.com/writing/llm-patterns/
Not systematic, but useful!

Idea
Validation
Build
prototype
with toy data
Build prototype
with full corpus
Solution
Validation
● “Vibe check” on curated/representative
set of test queries
● Towards more systematic evaluation
to gain confidence towards initial
deployment
At this stage

Challenge of Systemic Evaluation
● Metrics
○ Flexibility of natural language, no one right answer
○ Human evaluation is not scalable
● Data availability
○ Labeled data is slow & costly to collect
● Actionable insight
○ Not just “it’s bad”, but also “how to improve”
○ End-to-end vs. Component-wise

LLM-as-a-Judge
https://www.databricks.com/blog/LLM-au
to-eval-best-practices-RAG
https://arxiv.org/pdf/2306.05685.pdf
● Strong LLMs (GPT-4, Claude 2) are good
evaluators
○ High agreement with human
○ scalability
○ Interpretability
● Approaches
○ Pairwise comparison
○ Single answer grading
○ Reference-guided grading (i.e.
“golden” answer)

Component-wise evaluation
Helps attribute quality issue to specific
components:
● Retrieval: are we retrieving the
relevant context
● Generation: given context, are we
generating an accurate and
coherent answer
Systematic Evaluation - Overview
End-to-end evaluation
Helps understand how well the full RAG
application works:
● given user question, how “good” is
the answer
Analogous to unit tests
Analogous to integration tests

Data for Systematic Evaluation
User
Query
“Golden”
Context
“Golden”
Answer
● representative set
of real user queries
● Set of relevant
documents from
our corpus to best
answer a given
query
● Best answer
given “golden”
context
● Can be
optional
User
Feedback
● Feedback from
past interaction
● up/down vote,
rating
● On retrieval
and/or
generation

Data Challenges
User
Query
User
Feedback
● Need to deploy
system & collect
Relatively easy
● Need to deploy
system & collect
Require good UX for
good data
● Need labelers
Relatively
cheap/easy
● Need labelers
More
costly/tedious
“Golden”
Context
“Golden”
Answer

“Cold Start” Problem
Dilemma
We have the chicken and egg problem of:
1. We want to evaluate to gain
confidence of our RAG system before
deployment
2. Without deployment, we can’t collect
real user queries and label an
evaluation dataset
Strategy
● Use purely LLM-based label-free
evaluation
● Leverage LLM to generate synthetic
label dataset, including
● Query
● “Golden” context
● “Golden” response
28

Generating Synthetic Evaluation Data
Knowledge
corpus
doc_0
Sample
Document
chunk
LLM
{
User query: …
“Golden” context: doc_0
“Golden” answer: …
}
● Sample document chunks, and leverage LLM to generate Q&A pairs that can
be best answer by the given context.
● Not always representative of user queries, but useful for development
iterations & diagnostics
Synthetic
“datapoint” for
eval

Deployment Optimization
Deploy system &
collect user queries
& feedback
Collect labels,
improve system
● Have user queries, maybe labels &
feedback
● Need repeatable, consistent,
automated evaluation
● Need actionable insight or automated
tuning
At this stage

End-to-end evaluation
Eval
Dataset
Query
“Golden”
Answer
RAG Answer
Sample
entry
Query Answer
Evaluator
Metrics:
Correctness,
Faithfulness,
Relevancy,
…

LLM-as-a-Judge:
The devil is in the details
● What model?
○ Only “strong” LLMs right now
○ GPT-4, Claude 2
● What grading scale?
○ Low-precision, e.g. binary, 1-5
○ Easier rubrics, more interpretable
● Do we need few-shot examples?
○ Helps to give guidance/example for
each score level
● Holistic judgement vs. individual aspects
○ Be as concrete as possible
● Chain of thought reasoning
You are an expert evaluation system for a question answering
chatbot.
You are given the following information:
- a user query,
- a reference answer, and
- a generated answer.
Your job is to judge the relevance and correctness of the
generated answer.
Output a single score that represents a holistic evaluation.
You must return your response in a line with only the score.
Do not return answers in any other format.
On a separate line provide your reasoning for the score as well.
Follow these guidelines for scoring:
- Your score has to be between 1 and 5, where 1 is the worst and 5
is the best.
- If the generated answer is not relevant to the user query,
you should give a score of 1.
- If the generated answer is relevant but contains mistakes,
you should give a score between 2 and 3.
- If the generated answer is relevant and fully correct,
## User Query
{query}
## Reference Answer
{reference_answer}
## Generated Answer
{generated_answer}

LLM-as-a-Judge:
Limitations
● Position bias
○ First, last, etc
● Verbosity bias
○ Longer the better?
● Self-enhancement bias
○ GPT <3 GPT
You are an expert evaluation system for a question answering
chatbot.
You are given the following information:
- a user query,
- a reference answer, and
- a generated answer.
Your job is to judge the relevance and correctness of the
generated answer.
Output a single score that represents a holistic evaluation.
You must return your response in a line with only the score.
Do not return answers in any other format.
On a separate line provide your reasoning for the score as well.
Follow these guidelines for scoring:
- Your score has to be between 1 and 5, where 1 is the worst and 5
is the best.
- If the generated answer is not relevant to the user query,
you should give a score of 1.
- If the generated answer is relevant but contains mistakes,
- If the generated answer is relevant and fully correct,
## User Query
{query}
## Reference Answer
{reference_answer}
## Generated Answer
{generated_answer}

RAG -
Retrieval
Eval
Dataset
Query
Sample
entry
● Evaluate retrieval component
Retrieved
docs
“Golden”
context
Retrieved
docs
Retrieval
Evaluator
Metrics:
● Hit rate
● MRR
● MAP
● NDCG

RAG -
Generation
Eval
Dataset
Query
Sample
entry
● Evaluate generation component
“Golden”
Context
Answer
“Golden”
Answer
Query Answer
Evaluator
Metrics:
Correctness,
Faithfulness,
Relevancy,
…

36
Experiment
config Index
build_or_load_index
Retriever
Query
Engine
Embedding_model_name
Chunk_size
chunk_overlap
Similarity_top_k
embedding_model_name
Similarity_top_k
Embedding_model_name
Llm_model_name
Temperature
run_retrieval_exp run_e2e_exp
“Golden”
Contexts
Queries
Query
Engine
“Golden”
Answers
Queries
Retriever
Hit rate Mean correctness score

Lab
● Cold start problem: generating synthetic eval dataset from documents
● Component wise evaluation
○ Retrieval
○ Generation
● End-to-end evaluation

HERE IS A SECTION
HEADER
Experiment &
Optimize
Improve system performance

Build customized
pipeline
● Deeply understand
your data & query
pattern
● define indexing &
retrieval strategies
optimized for your
specific use-case
Ways to optimize the RAG system
Configure standard
components
● Use standard, off the
shelf components
(e.g. LLM, embedding
model, standard
semantic search)
● Select good
components, tune
parameters for
end-performance
Model Fine-tuning
● Specialize embedding
model and LLM to your
domain

Configure standard components
Retrieval
● Chunk size
● Embedding model
● Number of retrieved chunks
Generation
● LLM
● (prompt)
Grid search
● Run all parameter combinations and
evaluate the end performance of
system
{
Chunk_size: 512,
Top_k: 5,
Llm: gpt-4,
…
}
Score: 4.5
{
Chunk_size: 512,
Top_k: 10,
Llm: gpt-4,
…
}
Score: 3.5

41
Vector
Store
Retrieved
docs
LLM
Query
Response
Embedding
Model
CHUNK_SIZES = [256, 512, 1024]
TOP_K = [1, 3, 5, 7]
EMBED_MODELS = [
"thenlper/gte-base",
"BAAI/bge-large-en",
"text-embedding-ada-002"
]
LLMS = [
"gpt-3.5-turbo",
"gpt-4",
"meta-llama/Llama-2-7b-chat-hf",
"meta-llama/Llama-2-13b-chat-hf",
"meta-llama/Llama-2-70b-chat-hf"
]

Lab
● Run experiments over standard component conﬁgurations
● Gain intuition on optimal parameter conﬁgurations

Strategy 1:
Use two-stage retrieval:
● First retrieve a lot of potentially relevant contexts
● Then rerank/filter to a smaller subset
Customizing Retrieval & Generation
43
Retriever Reranker
Index
● Less accurate
● Fast
● Operate over
full corpus
● More accurate
● Slow
● Operate over
candidates

Strategy 2a:
Embed different representations of the
same data: e.g.
● Summarize document and embed
summary
● Extract distinct topics and embed
topic extractions separately
This can improve retrieval over specific
questions
44
Document
Document Topics
Summary

Strategy 2b:
Use different data representation for
retrieval & for generation
45

Strategy 3:
Leverage LLM to infer structured query for retrieval (e.g. metadata filters, top-k, score
threshold)
46

Strategy 4:
Recursive retrieval over hierarchical index
47

Strategy 5:
Routing to different retrieval methods
depending on the query
May need to rewrite query as well,
depending on the interface of the retriever.
48

LLM Fine-tuning
● When want to adjust the “style” of
generation: e.g. professional legal
assistant
● When want to enforce output
structure: e.g. JSON
Not great for injecting new knowledge
and fighting hallucination
Model Fine-tuning
49
Embedding Model Fine-tuning
● Improve retrieval performance,
especially documents that are
○ Domain specific terminology
○ Malformed (e.g. extraneous spacing
due to parsing, etc)
● Two approaches
○ Fine-tune the full embedding model
○ Fine-tune an adapter layer on top of
a frozen embedding model
Great to improving retrieval (and thus
end-to-end RAG performance)

Lab
● Second-stage re-ranking
● Sentence window strategy
● Fine-tuning embeddings for RAG with synthetic data

Sign up for our newsletter:
https://www.llamaindex.ai/
Sign up for office hours
Get In Touch

presentation.pdf

More Related Content

What's hot (20)

Similar to presentation.pdf (20)

Recently uploaded (20)

presentation.pdf