๐งช Evaluation & Testing¶
Frameworks and benchmarks for evaluating and testing LLM applications.
๐ Table of Contents¶
Overview¶
Evaluation tools provide: - Automated Testing: Unit tests for LLM outputs - Benchmarking: Compare models on standardized tasks - Quality Metrics: Measure accuracy, relevance, safety - Regression Testing: Catch degradations
Tools List¶
| Repo | Description | Stars |
|---|---|---|
| confident-ai/deepeval | LLM evaluation framework with 14+ metrics | |
| EleutherAI/lm-evaluation-harness | Standard framework for LLM evaluation (200+ tasks) | |
| openai/evals | Evaluation framework and registry from OpenAI | |
| patronus-ai/patronus | RAG assessment framework (RAGAS) | |
| langchain-ai/langsmith-sdk | LangChain evaluation and testing (partially open) | |
| microsoft/promptflow | Evaluation and testing for LLM apps |
Selection Guide¶
By Use Case¶
๐ General LLM Evaluation - LM Evaluation Harness - Academic benchmarks - DeepEval - Production testing - OpenAI Evals - Custom evaluations
๐ RAG-Specific - RAGAS - RAG evaluation metrics - DeepEval - RAG test cases - LangSmith - End-to-end tracing
โก Unit Testing - DeepEval - Pytest integration - PromptFlow - Flow testing - LangSmith - Dataset testing
๐ Benchmarking - LM Evaluation Harness - 200+ standard tasks - OpenAI Evals - Custom benchmarks - HELM - Holistic evaluation
By Team Size¶
Solo Developer - DeepEval - Quick setup - RAGAS - RAG focus - LM Harness - Benchmark models
Small Team (2-10) - LangSmith - Collaborative testing - PromptFlow - Visual workflows - DeepEval - CI/CD integration
Enterprise - Custom framework - Specific needs - LangSmith Enterprise - Team features - PromptFlow + Azure - Full stack
Quick Start¶
DeepEval - Production Testing¶
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
def test_customer_support_bot():
# Define test case
test_case = LLMTestCase(
input="How do I reset my password?",
actual_output="You can reset your password by clicking 'Forgot Password' on the login page.",
expected_output="Guide user to password reset flow",
context=["Password reset documentation", "Login page help"]
)
# Define metrics
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
faithfulness_metric = FaithfulnessMetric(threshold=0.8)
# Assert
assert_test(test_case, [relevancy_metric, faithfulness_metric])
# Run with pytest
# pytest test_llm.py
Available Metrics:
from deepeval.metrics import (
AnswerRelevancyMetric, # Is answer relevant to question?
FaithfulnessMetric, # Is answer grounded in context?
ContextualRelevancyMetric, # Is retrieved context relevant?
HallucinationMetric, # Does output contain hallucinations?
ToxicityMetric, # Is output toxic?
BiasMetric, # Does output show bias?
GEval, # Custom criteria evaluation
)
Synthetic Data for Testing:
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
# Generate test cases from docs
test_cases = synthesizer.generate_goldens_from_docs(
document_paths=["./docs"],
num_goldens=50
)
# Each has: input, expected_output, context
for test_case in test_cases:
# Run your RAG pipeline
actual_output = rag_pipeline(test_case.input)
# Evaluate
metric = AnswerRelevancyMetric()
score = metric.measure(
test_case=LLMTestCase(
input=test_case.input,
actual_output=actual_output,
expected_output=test_case.expected_output
)
)
LM Evaluation Harness - Benchmarking¶
# Evaluate on MMLU (knowledge benchmark)
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--batch_size 8
# Multiple tasks
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks hendrycks_ethics,truthfulqa,gsm8k \
--output_path results/
# Custom model (API)
lm_eval --model openai-chat-completions \
--model_args model=gpt-4 \
--tasks mmlu
Python API:
from lm_eval import evaluator
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=meta-llama/Llama-2-7b-hf",
tasks=["mmlu", "hellaswag", "arc_challenge"],
batch_size=8
)
print(f"MMLU Score: {results['results']['mmlu']['acc']}")
print(f"HellaSwag: {results['results']['hellaswag']['acc_norm']}")
200+ Available Tasks: - Knowledge: MMLU, TriviaQA, NaturalQuestions - Reasoning: BBH, GSM8K, ARC - Code: HumanEval, MBPP - Safety: ToxiGen, CrowS-Pairs - Truthfulness: TruthfulQA
RAGAS - RAG Evaluation¶
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
from datasets import Dataset
# Your RAG outputs
data = {
"question": ["What is the capital of France?"],
"answer": ["Paris is the capital of France."],
"contexts": [["France is a country in Europe. Its capital is Paris."]],
"ground_truth": ["Paris"]
}
dataset = Dataset.from_dict(data)
# Evaluate
results = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall
]
)
print(results)
# {
# "faithfulness": 0.95,
# "answer_relevancy": 0.92,
# "context_precision": 0.88,
# "context_recall": 0.90
# }
Metrics Explained: - Faithfulness: Is answer grounded in retrieved context? - Answer Relevancy: Does answer address the question? - Context Precision: Is retrieved context relevant? - Context Recall: Did we retrieve all relevant info?
OpenAI Evals¶
# Run existing eval
oaieval gpt-3.5-turbo my-eval
# Create custom eval
# evals/registry/evals/my-eval.yaml
my-eval:
class: evals.elsuite.basic.match:Match
args:
samples_jsonl: my-samples.jsonl
Custom Evaluation:
# my-eval.py
import evals
from evals.api import CompletionFn
class CustomEval(evals.Eval):
def eval_sample(self, sample, *args):
prompt = sample["input"]
expected = sample["ideal"]
# Get model output
result = self.completion_fn(
prompt=prompt,
temperature=0.0
)
# Custom scoring logic
score = self.calculate_score(result, expected)
return evals.record.record_match(
correct=score > 0.8,
expected=expected,
picked=result
)
Key Metrics¶
1. Accuracy-Based¶
# Exact Match
def exact_match(prediction, reference):
return prediction.strip().lower() == reference.strip().lower()
# F1 Score (token overlap)
def f1_score(prediction, reference):
pred_tokens = set(prediction.split())
ref_tokens = set(reference.split())
if len(pred_tokens) == 0 or len(ref_tokens) == 0:
return 0
precision = len(pred_tokens & ref_tokens) / len(pred_tokens)
recall = len(pred_tokens & ref_tokens) / len(ref_tokens)
if precision + recall == 0:
return 0
return 2 * (precision * recall) / (precision + recall)
# BLEU Score (n-gram overlap)
from nltk.translate.bleu_score import sentence_bleu
bleu = sentence_bleu([reference.split()], prediction.split())
2. Semantic Similarity¶
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_similarity(text1, text2):
emb1 = model.encode(text1)
emb2 = model.encode(text2)
return util.cos_sim(emb1, emb2).item()
# Usage
score = semantic_similarity(
"Paris is the capital of France",
"The capital of France is Paris"
)
# 0.95 (very similar despite different wording)
3. LLM-as-Judge¶
from openai import OpenAI
client = OpenAI()
def llm_judge(question, answer, criteria="accuracy and helpfulness"):
prompt = f"""Evaluate this answer on {criteria}.
Question: {question}
Answer: {answer}
Provide:
1. Score (1-10)
2. Reasoning
Format:
Score: X
Reasoning: ...
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# Parse score and reasoning
output = response.choices[0].message.content
score = extract_score(output)
reasoning = extract_reasoning(output)
return score, reasoning
4. Safety Metrics¶
# Toxicity (using Detoxify)
from detoxify import Detoxify
model = Detoxify('original')
def toxicity_score(text):
results = model.predict(text)
return results['toxicity']
# PII Detection
import re
def contains_pii(text):
patterns = {
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
}
for pii_type, pattern in patterns.items():
if re.search(pattern, text):
return True, pii_type
return False, None
CI/CD Integration¶
GitHub Actions¶
# .github/workflows/llm-tests.yml
name: LLM Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
run: |
pip install deepeval pytest
- name: Run LLM tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
pytest tests/test_llm.py
- name: Upload results
uses: actions/upload-artifact@v2
with:
name: test-results
path: test-results/
Pre-commit Hooks¶
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: llm-quality-check
name: LLM Quality Check
entry: python scripts/quality_check.py
language: python
pass_filenames: false
# scripts/quality_check.py
import sys
from deepeval.metrics import HallucinationMetric
def check_prompt_quality(prompt_file):
with open(prompt_file) as f:
prompts = f.readlines()
metric = HallucinationMetric()
for prompt in prompts:
# Test prompt for potential issues
score = test_prompt(prompt, metric)
if score < 0.8:
print(f"โ ๏ธ Low quality prompt detected: {prompt[:50]}...")
sys.exit(1)
print("โ
All prompts passed quality check")
if __name__ == "__main__":
check_prompt_quality("prompts.txt")
Regression Testing¶
Dataset-Based Testing¶
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
# Load golden dataset
import pandas as pd
golden_dataset = pd.read_csv("golden_dataset.csv")
@pytest.mark.parametrize("row", golden_dataset.itertuples())
def test_against_golden_set(row):
# Run your LLM
actual_output = my_rag_pipeline(row.question)
# Compare against golden answer
test_case = LLMTestCase(
input=row.question,
actual_output=actual_output,
expected_output=row.golden_answer,
context=row.context
)
metric = AnswerRelevancyMetric(threshold=0.8)
assert_test(test_case, [metric])
Version Comparison¶
# Compare model versions
def compare_versions(test_cases, model_v1, model_v2):
results = []
for test in test_cases:
output_v1 = model_v1.generate(test.input)
output_v2 = model_v2.generate(test.input)
score_v1 = evaluate(output_v1, test.expected)
score_v2 = evaluate(output_v2, test.expected)
results.append({
"test": test.input,
"v1_score": score_v1,
"v2_score": score_v2,
"change": score_v2 - score_v1
})
# Flag regressions
regressions = [r for r in results if r["change"] < -0.1]
improvements = [r for r in results if r["change"] > 0.1]
print(f"Regressions: {len(regressions)}")
print(f"Improvements: {len(improvements)}")
return results
Benchmark Leaderboards¶
Popular Benchmarks (2026)¶
| Benchmark | Focus | Tasks | Top Score | Model |
|---|---|---|---|---|
| MMLU | Knowledge | 57 | 89.5% | GPT-4.5 |
| BBH | Reasoning | 23 | 92.3% | Claude Opus 4 |
| HumanEval | Coding | 164 | 96.2% | GPT-4.5 |
| GSM8K | Math | 8.5K | 98.1% | Gemini Ultra 2 |
| TruthfulQA | Truthfulness | 817 | 87.4% | Claude Opus 4 |
| HellaSwag | Commonsense | 10K | 95.6% | GPT-4.5 |
Running Benchmarks¶
from lm_eval import evaluator
# Comprehensive benchmark suite
benchmark_tasks = [
"mmlu",
"bbh",
"hellaswag",
"arc_challenge",
"truthfulqa",
"gsm8k"
]
results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=meta-llama/Llama-2-70b-hf",
tasks=benchmark_tasks,
num_fewshot=5
)
# Generate report
for task, metrics in results['results'].items():
print(f"{task}: {metrics['acc']:.2%}")
A/B Testing¶
Statistical Significance¶
from scipy import stats
import numpy as np
def ab_test(variant_a_scores, variant_b_scores, alpha=0.05):
"""
H0: Both variants perform equally
H1: Variants perform differently
"""
# T-test
t_stat, p_value = stats.ttest_ind(variant_a_scores, variant_b_scores)
# Reject H0 if p < alpha
significant = p_value < alpha
# Effect size (Cohen's d)
mean_diff = np.mean(variant_b_scores) - np.mean(variant_a_scores)
pooled_std = np.sqrt((np.std(variant_a_scores)**2 + np.std(variant_b_scores)**2) / 2)
cohens_d = mean_diff / pooled_std
return {
"p_value": p_value,
"significant": significant,
"effect_size": cohens_d,
"winner": "B" if mean_diff > 0 else "A"
}
# Usage
variant_a_scores = [0.82, 0.85, 0.79, 0.88, 0.83] # Old model
variant_b_scores = [0.87, 0.90, 0.85, 0.91, 0.89] # New model
result = ab_test(variant_a_scores, variant_b_scores)
if result["significant"]:
print(f"โ
Variant {result['winner']} is significantly better")
print(f"Effect size: {result['effect_size']:.3f}")
else:
print("โ ๏ธ No significant difference detected")
Best Practices¶
1. Diverse Test Sets¶
# Don't test on training distribution only
test_sets = {
"in_distribution": normal_queries,
"edge_cases": unusual_queries,
"adversarial": tricky_queries,
"multilingual": non_english_queries
}
for name, test_set in test_sets.items():
score = evaluate(model, test_set)
print(f"{name}: {score}")
2. Human Evaluation¶
# LLM metrics aren't perfect - use human eval
import random
# Sample 100 random outputs for human review
sampled = random.sample(all_outputs, 100)
# Use tool like Argilla for annotation
import argilla as rg
dataset = rg.FeedbackDataset.from_huggingface("my-eval-set")
dataset.push_to_argilla("human-evaluation")
# Team annotates, then:
annotated = rg.load("human-evaluation")
human_accuracy = calculate_agreement(annotated)
3. Continuous Monitoring¶
# Track metrics over time
from datetime import datetime
def log_daily_metrics():
test_results = run_evaluation_suite()
mlflow.log_metrics({
"accuracy": test_results["accuracy"],
"latency_p95": test_results["latency_p95"],
"cost_per_query": test_results["cost"]
}, step=datetime.now().timestamp())
# Run daily
schedule.every().day.at("02:00").do(log_daily_metrics)
Related Resources¶
- Datasets & Data Tools - Create test datasets
- Observability - Monitor in production
- Production AI - Deploy tested models
- RAG & Knowledge - RAG-specific testing