Skip to content
Navigation menu
Search
Powered by Algolia
Search
Log in
Create account
DEV Community
Close
#
evaluation
Follow
Hide
Posts
Left menu
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
Right menu
Workflow Series (05): Evaluation Framework — Three-Layer Testing and Trace Tracking
WonderLab
WonderLab
WonderLab
Follow
Jul 3
Workflow Series (05): Evaluation Framework — Three-Layer Testing and Trace Tracking
#
ai
#
workflow
#
evaluation
#
trace
Comments
Add Comment
5 min read
Short-Circuit Your Agent Evals: Tier Order Is a Latency Budget, Not a Preference
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jul 2
Short-Circuit Your Agent Evals: Tier Order Is a Latency Budget, Not a Preference
#
ai
#
agents
#
evaluation
#
typescript
1
 reaction
Comments
Add Comment
5 min read
One Triage Pass, Every Trace Format: Stop Letting Fragmentation Shrink Your Eval Coverage
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jul 2
One Triage Pass, Every Trace Format: Stop Letting Fragmentation Shrink Your Eval Coverage
#
ai
#
agents
#
observability
#
evaluation
2
 reactions
Comments
Add Comment
5 min read
Your AI judge might be reliable — and still be wrong
Breach Protocol
Breach Protocol
Breach Protocol
Follow
Jul 1
Your AI judge might be reliable — and still be wrong
#
evaluation
#
llmjudges
#
rlhf
#
methodology
Comments
Add Comment
3 min read
Reliable, and still wrong
Breach Protocol
Breach Protocol
Breach Protocol
Follow
Jul 1
Reliable, and still wrong
#
evaluation
#
llmasjudge
#
benchmarks
Comments
Add Comment
3 min read
Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 29
Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge
#
ai
#
agents
#
evaluation
#
typescript
1
 reaction
Comments
Add Comment
4 min read
Your Model-as-Judge Doesn't Belong in the Hot Path
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 28
Your Model-as-Judge Doesn't Belong in the Hot Path
#
ai
#
agents
#
evaluation
#
observability
1
 reaction
Comments
Add Comment
9 min read
Evaluating Large Language Models: The Pitfall of Overfitting in RAG
Tanishq Soni
Tanishq Soni
Tanishq Soni
Follow
Jun 28
Evaluating Large Language Models: The Pitfall of Overfitting in RAG
#
llm
#
evaluation
#
overfitting
#
rag
Comments
Add Comment
2 min read
Evaluating Large Language Models: The Overfitting Problem
Tanishq Soni
Tanishq Soni
Tanishq Soni
Follow
Jun 28
Evaluating Large Language Models: The Overfitting Problem
#
llm
#
evaluation
#
overfitting
#
rag
Comments
Add Comment
2 min read
Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production
Abdul Rehman
Abdul Rehman
Abdul Rehman
Follow
Jun 27
Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production
#
ai
#
evaluation
#
production
#
llm
Comments
Add Comment
5 min read
Our Quality Scores Were Precise, Useless, and Identical
Alex @ Vibe Agent Making
Alex @ Vibe Agent Making
Alex @ Vibe Agent Making
Follow
Jun 24
Our Quality Scores Were Precise, Useless, and Identical
#
engineering
#
management
#
evaluation
#
codequality
1
 reaction
Comments
1
 comment
8 min read
Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 27
Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production
#
ai
#
evaluation
#
observability
#
testing
1
 reaction
Comments
2
 comments
5 min read
Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 20
Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems
#
ai
#
agents
#
observability
#
evaluation
2
 reactions
Comments
1
 comment
7 min read
Stop Asking 'Is GAI Here' — Ask 'At What Layer'
keeper
keeper
keeper
Follow
Jun 19
Stop Asking 'Is GAI Here' — Ask 'At What Layer'
#
ai
#
gai
#
framework
#
evaluation
1
 reaction
Comments
Add Comment
3 min read
Your Model Upgrade Broke Three Workflows and the Tests Still Passed
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 30
Your Model Upgrade Broke Three Workflows and the Tests Still Passed
#
ai
#
agents
#
evaluation
#
testing
2
 reactions
Comments
4
 comments
5 min read
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account