Evaluation

👋 Sign in for the ability to sort posts by relevant, latest, or top.

WonderLab

Jul 3

Workflow Series (05): Evaluation Framework — Three-Layer Testing and Trace Tracking

#ai #workflow #evaluation #trace

5 min read

Saurav Bhattacharya

Jul 2

Short-Circuit Your Agent Evals: Tier Order Is a Latency Budget, Not a Preference

#ai #agents #evaluation #typescript

5 min read

Saurav Bhattacharya

Jul 2

One Triage Pass, Every Trace Format: Stop Letting Fragmentation Shrink Your Eval Coverage

#ai #agents #observability #evaluation

5 min read

Breach Protocol

Jul 1

Your AI judge might be reliable — and still be wrong

#evaluation #llmjudges #rlhf #methodology

3 min read

Breach Protocol

Jul 1

Reliable, and still wrong

#evaluation #llmasjudge #benchmarks

3 min read

Saurav Bhattacharya

Jun 29

Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge

#ai #agents #evaluation #typescript

4 min read

Saurav Bhattacharya

Jun 28

Your Model-as-Judge Doesn't Belong in the Hot Path

#ai #agents #evaluation #observability

9 min read

Tanishq Soni

Jun 28

Evaluating Large Language Models: The Pitfall of Overfitting in RAG

#llm #evaluation #overfitting #rag

2 min read

Tanishq Soni

Jun 28

Evaluating Large Language Models: The Overfitting Problem

#llm #evaluation #overfitting #rag

2 min read

Abdul Rehman

Jun 27

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

#ai #evaluation #production #llm

5 min read

Alex @ Vibe Agent Making

Jun 24

Our Quality Scores Were Precise, Useless, and Identical

#engineering #management #evaluation #codequality

8 min read

Saurav Bhattacharya

Jun 27

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

#ai #evaluation #observability #testing

5 min read

Saurav Bhattacharya

Jun 20

Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems

#ai #agents #observability #evaluation

7 min read

keeper

Jun 19

Stop Asking 'Is GAI Here' — Ask 'At What Layer'

#ai #gai #framework #evaluation

3 min read

Saurav Bhattacharya

Jun 30

Your Model Upgrade Broke Three Workflows and the Tests Still Passed

#ai #agents #evaluation #testing

5 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

DEV Community

# evaluation

Workflow Series (05): Evaluation Framework — Three-Layer Testing and Trace Tracking

Short-Circuit Your Agent Evals: Tier Order Is a Latency Budget, Not a Preference

One Triage Pass, Every Trace Format: Stop Letting Fragmentation Shrink Your Eval Coverage

Your AI judge might be reliable — and still be wrong

Reliable, and still wrong

Give Your Agent a Type Signature: Contract-First Output Beats a Smarter Judge

Your Model-as-Judge Doesn't Belong in the Hot Path

Evaluating Large Language Models: The Pitfall of Overfitting in RAG

Evaluating Large Language Models: The Overfitting Problem

Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

Our Quality Scores Were Precise, Useless, and Identical

Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

Your Agent Didn't Break, It Drifted: Detecting Slow Decay in Autonomous Systems

Stop Asking 'Is GAI Here' — Ask 'At What Layer'

Your Model Upgrade Broke Three Workflows and the Tests Still Passed