Building Evaluation Systems with LangSmith
Learn how to create comprehensive evaluation frameworks for your AI agents using LangSmith to catch regressions and improve performance over time.
As AI systems grow more complex, robust evaluation becomes critical. LangSmith provides a comprehensive platform for evaluating, monitoring, and debugging LLM applications. In this tutorial, we'll explore how to build evaluation systems that help you catch regressions and continuously improve your agents' performance.
Why Evaluation Matters
LLM-based applications present unique challenges for testing and evaluation:
- Non-deterministic outputs: Responses vary even with the same inputs
- Multiple dimensions of quality: Correctness, helpfulness, harmlessness, etc.
- Contextual appropriateness: Responses must be situationally relevant
- Regression risks: New model versions or prompt changes can break functionality
LangSmith helps address these challenges with structured evaluation frameworks.
Setting Up LangSmith
First, let's set up LangSmith by installing the library and configuring your API key:
# Install LangSmith
pip install langsmith
# Set environment variables
import os
os.environ["LANGCHAIN_API_KEY"] = "your-api-key" # Get this from langsmith.com
os.environ["LANGCHAIN_PROJECT"] = "agent-evaluation" # Your project name
# Import the LangSmith client
from langsmith import Client
client = Client()
Creating Evaluation Datasets
The foundation of any evaluation system is a good dataset. Let's create one for our agent:
# Define sample inputs and expected outputs
examples = [
{
"input": {
"question": "What is the capital of France?"
},
"expected_output": {
"answer": "Paris"
}
},
{
"input": {
"question": "Calculate the square root of 144."
},
"expected_output": {
"answer": "12"
}
},
{
"input": {
"question": "Summarize the main causes of climate change."
},
"expected_output": {
"answer": "The main causes include burning fossil fuels, deforestation, industrial processes, and agricultural practices."
}
}
]
# Create a dataset in LangSmith
dataset_name = "agent-evaluation-dataset"
dataset = client.create_dataset(
dataset_name=dataset_name,
description="Evaluation dataset for testing our agent's capabilities"
)
# Add examples to the dataset
for example in examples:
client.create_example(
inputs=example["input"],
outputs=example["expected_output"],
dataset_id=dataset.id
)
Basic Evaluation with Built-in Evaluators
LangSmith provides several built-in evaluators. Let's use them to evaluate our agent:
from langsmith.evaluation import RunEvalConfig
from langchain.chat_models import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import Tool
# Assuming we have an agent set up
llm = ChatOpenAI(temperature=0)
tools = [...] # Your agent's tools
agent = create_react_agent(llm, tools, prompt_template)
agent_executor = AgentExecutor(agent=agent, tools=tools)
# Configure basic evaluation
eval_config = RunEvalConfig(
evaluators=[
"correctness", # Checks if the response is factually correct
"qa", # Evaluates question-answering performance
"criteria", # Evaluates based on custom criteria
],
custom_evaluators=[],
reference_key="expected_output" # Field in the dataset containing reference answers
)
# Run the evaluation on our dataset
eval_results = client.run_evaluation(
dataset_name=dataset_name,
llm_or_chain=agent_executor,
evaluation=eval_config
)
# View the results
for result in eval_results:
print(f"Example: {result.example.inputs}")
print(f"Score: {result.evaluations[0].score}")
print(f"Feedback: {result.evaluations[0].feedback}")
Custom Evaluators for Domain-Specific Needs
While built-in evaluators are useful, custom evaluators let you address domain-specific requirements:
# Create a custom evaluator for factual accuracy
def factual_accuracy_evaluator(run, example):
"""Evaluates the factual accuracy of the model output."""
# Extract the prediction and reference
prediction = run.outputs.get("output", "")
reference = example.outputs.get("expected_output", {}).get("answer", "")
# Custom evaluation logic
# Here we're using a simple check, but you could use a more sophisticated approach
evaluation_llm = ChatOpenAI(model="gpt-4", temperature=0)
prompt = f"""
Evaluate the factual accuracy of the following response to a question.
Question: {example.inputs.get('question')}
Predicted answer: {prediction}
Reference answer: {reference}
Rate the factual accuracy on a scale from 1 to 5, where:
1 = Completely inaccurate, contains major factual errors
2 = Mostly inaccurate, contains significant factual errors
3 = Partially accurate, contains some factual errors
4 = Mostly accurate, contains minor factual errors
5 = Completely accurate, no factual errors
Provide your rating as a single number followed by a brief explanation.
"""
evaluation_result = evaluation_llm.predict(prompt)
try:
# Extract score (assuming format "Score: X. Explanation...")
score_text = evaluation_result.split()[0]
score = float(score_text)
# Normalize to 0-1 range
normalized_score = (score - 1) / 4
return {
"score": normalized_score,
"reasoning": evaluation_result
}
except:
# Fallback if parsing fails
return {
"score": 0.5, # Neutral score
"reasoning": f"Failed to parse score. Raw evaluation: {evaluation_result}"
}
# Add the custom evaluator to our config
eval_config = RunEvalConfig(
evaluators=[
"correctness",
"qa",
],
custom_evaluators={
"factual_accuracy": factual_accuracy_evaluator
}
)
Comprehensive Evaluation Framework
Let's build a more comprehensive evaluation framework that addresses multiple dimensions:
from langchain.smith import RunEvalConfig
# Define evaluation dimensions
eval_config = RunEvalConfig(
custom_evaluators={
"factual_accuracy": factual_accuracy_evaluator,
"helpfulness": helpfulness_evaluator, # Define these similar to factual_accuracy
"harmlessness": harmlessness_evaluator,
"reasoning": reasoning_evaluator,
"tool_usage": tool_usage_evaluator,
},
# Define criteria for LLM-based evaluation
criteria={
"relevance": "Does the response directly address the user's question?",
"conciseness": "Is the response concise without unnecessary information?",
"completeness": "Does the response fully answer all aspects of the question?",
"clarity": "Is the response clear and easy to understand?",
}
)
# Run a full evaluation across all dimensions
results = client.run_evaluation(
dataset_name=dataset_name,
llm_or_chain=agent_executor,
evaluation=eval_config,
project_name="comprehensive-agent-eval"
)
Analyzing Evaluation Results
LangSmith provides tools to analyze evaluation results and identify patterns:
# Get aggregate metrics
metrics = client.get_evaluation_metrics(project_name="comprehensive-agent-eval")
print(f"Average accuracy: {metrics.get('factual_accuracy', {}).get('mean', 0)}")
print(f"Lowest scoring category: {min(metrics.items(), key=lambda x: x[1].get('mean', 1))[0]}")
# Identify failing examples
failing_examples = client.list_runs(
project_name="comprehensive-agent-eval",
filter={
"has_feedback": True,
"feedback": {
"score": {"lt": 0.7} # Examples scoring below 0.7
}
}
)
# Analyze patterns in failures
for example in failing_examples:
print(f"Failed example: {example.inputs}")
print(f"Tags: {example.tags}")
feedback = client.list_feedback(run_id=example.id)
for fb in feedback:
print(f"{fb.key}: {fb.score} - {fb.comment}")
Continuous Evaluation Workflows
To make evaluation part of your development process, set up continuous evaluation workflows:
# Continuous evaluation script (can be run in CI/CD pipeline)
import argparse
from langsmith import Client
from langchain.chat_models import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent
def continuous_evaluation(model_version, dataset_name, threshold=0.8):
# Initialize client
client = Client()
# Set up agent with specific model version
llm = ChatOpenAI(model=model_version, temperature=0)
tools = [...] # Your agent's tools
agent = create_react_agent(llm, tools, prompt_template)
agent_executor = AgentExecutor(agent=agent, tools=tools)
# Run evaluation
results = client.run_evaluation(
dataset_name=dataset_name,
llm_or_chain=agent_executor,
evaluation=eval_config,
project_name=f"eval-{model_version}"
)
# Calculate average score
scores = [
result.evaluations[0].score
for result in results
if result.evaluations
]
avg_score = sum(scores) / len(scores) if scores else 0
# Check against threshold
if avg_score < threshold:
print(f"Evaluation failed: {avg_score} < {threshold}")
return False
else:
print(f"Evaluation passed: {avg_score} >= {threshold}")
return True
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run continuous evaluation")
parser.add_argument("--model", type=str, required=True, help="Model version")
parser.add_argument("--dataset", type=str, required=True, help="Dataset name")
parser.add_argument("--threshold", type=float, default=0.8, help="Pass threshold")
args = parser.parse_args()
success = continuous_evaluation(args.model, args.dataset, args.threshold)
if not success:
exit(1) # Fail the CI/CD pipeline
Regression Testing with LangSmith
Regression testing is crucial when updating prompts, models, or tools:
def compare_versions(old_version, new_version, dataset_name):
"""Compare performance between two versions of your agent."""
client = Client()
# Set up old version
old_llm = ChatOpenAI(model=old_version, temperature=0)
old_agent = create_react_agent(old_llm, tools, old_prompt)
old_executor = AgentExecutor(agent=old_agent, tools=tools)
# Set up new version
new_llm = ChatOpenAI(model=new_version, temperature=0)
new_agent = create_react_agent(new_llm, tools, new_prompt)
new_executor = AgentExecutor(agent=new_agent, tools=tools)
# Run evaluations
old_results = client.run_evaluation(
dataset_name=dataset_name,
llm_or_chain=old_executor,
evaluation=eval_config,
project_name=f"eval-{old_version}"
)
new_results = client.run_evaluation(
dataset_name=dataset_name,
llm_or_chain=new_executor,
evaluation=eval_config,
project_name=f"eval-{new_version}"
)
# Compute metrics
old_scores = [r.evaluations[0].score for r in old_results if r.evaluations]
new_scores = [r.evaluations[0].score for r in new_results if r.evaluations]
old_avg = sum(old_scores) / len(old_scores) if old_scores else 0
new_avg = sum(new_scores) / len(new_scores) if new_scores else 0
# Compare and report
print(f"Old version ({old_version}): {old_avg:.4f}")
print(f"New version ({new_version}): {new_avg:.4f}")
print(f"Difference: {new_avg - old_avg:.4f}")
# List examples where performance changed significantly
for old, new in zip(old_results, new_results):
old_score = old.evaluations[0].score if old.evaluations else 0
new_score = new.evaluations[0].score if new.evaluations else 0
if abs(new_score - old_score) > 0.2:
print(f"Significant change on: {old.example.inputs}")
print(f" Old score: {old_score}")
print(f" New score: {new_score}")
print(f" Old output: {old.outputs.get('output', '')[:100]}...")
print(f" New output: {new.outputs.get('output', '')[:100]}...")
Best Practices for Evaluation Systems
- Diverse datasets: Include edge cases, common queries, and representative examples
- Multiple dimensions: Evaluate across different quality aspects (accuracy, clarity, etc.)
- Automated workflows: Make evaluation part of your CI/CD pipeline
- Continuous improvement: Add failing examples to your dataset to improve coverage
- Human-in-the-loop: Combine automated evaluation with human review for critical systems
Conclusion
Building robust evaluation systems with LangSmith provides a foundation for developing reliable AI agents. By implementing comprehensive evaluation across multiple dimensions, you can catch regressions early, understand your system's limitations, and continuously improve performance over time.
Remember that evaluation is not a one-time activity but an ongoing process that should evolve with your application. As you discover new edge cases or failure modes, incorporate them into your evaluation framework to build increasingly reliable AI systems.