LangSmithMay 10, 2023

Building Evaluation Systems with LangSmith

Learn how to create comprehensive evaluation frameworks for your AI agents using LangSmith to catch regressions and improve performance over time.

As AI systems grow more complex, robust evaluation becomes critical. LangSmith provides a comprehensive platform for evaluating, monitoring, and debugging LLM applications. In this tutorial, we'll explore how to build evaluation systems that help you catch regressions and continuously improve your agents' performance.

Why Evaluation Matters

LLM-based applications present unique challenges for testing and evaluation:

Non-deterministic outputs: Responses vary even with the same inputs
Multiple dimensions of quality: Correctness, helpfulness, harmlessness, etc.
Contextual appropriateness: Responses must be situationally relevant
Regression risks: New model versions or prompt changes can break functionality

LangSmith helps address these challenges with structured evaluation frameworks.

Setting Up LangSmith

First, let's set up LangSmith by installing the library and configuring your API key:

# Install LangSmith
pip install langsmith

# Set environment variables
import os
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"  # Get this from langsmith.com
os.environ["LANGCHAIN_PROJECT"] = "agent-evaluation"  # Your project name

# Import the LangSmith client
from langsmith import Client
client = Client()

Creating Evaluation Datasets

The foundation of any evaluation system is a good dataset. Let's create one for our agent:

# Define sample inputs and expected outputs
examples = [
    {
        "input": {
            "question": "What is the capital of France?"
        },
        "expected_output": {
            "answer": "Paris"
        }
    },
    {
        "input": {
            "question": "Calculate the square root of 144."
        },
        "expected_output": {
            "answer": "12"
        }
    },
    {
        "input": {
            "question": "Summarize the main causes of climate change."
        },
        "expected_output": {
            "answer": "The main causes include burning fossil fuels, deforestation, industrial processes, and agricultural practices."
        }
    }
]

# Create a dataset in LangSmith
dataset_name = "agent-evaluation-dataset"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Evaluation dataset for testing our agent's capabilities"
)

# Add examples to the dataset
for example in examples:
    client.create_example(
        inputs=example["input"],
        outputs=example["expected_output"],
        dataset_id=dataset.id
    )

Basic Evaluation with Built-in Evaluators

LangSmith provides several built-in evaluators. Let's use them to evaluate our agent:

from langsmith.evaluation import RunEvalConfig
from langchain.chat_models import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import Tool

# Assuming we have an agent set up
llm = ChatOpenAI(temperature=0)
tools = [...]  # Your agent's tools
agent = create_react_agent(llm, tools, prompt_template)
agent_executor = AgentExecutor(agent=agent, tools=tools)

# Configure basic evaluation
eval_config = RunEvalConfig(
    evaluators=[
        "correctness",  # Checks if the response is factually correct
        "qa",           # Evaluates question-answering performance
        "criteria",     # Evaluates based on custom criteria
    ],
    custom_evaluators=[],
    reference_key="expected_output"  # Field in the dataset containing reference answers
)

# Run the evaluation on our dataset
eval_results = client.run_evaluation(
    dataset_name=dataset_name,
    llm_or_chain=agent_executor,
    evaluation=eval_config
)

# View the results
for result in eval_results:
    print(f"Example: {result.example.inputs}")
    print(f"Score: {result.evaluations[0].score}")
    print(f"Feedback: {result.evaluations[0].feedback}")

Custom Evaluators for Domain-Specific Needs

While built-in evaluators are useful, custom evaluators let you address domain-specific requirements:

# Create a custom evaluator for factual accuracy
def factual_accuracy_evaluator(run, example):
    """Evaluates the factual accuracy of the model output."""
    # Extract the prediction and reference
    prediction = run.outputs.get("output", "")
    reference = example.outputs.get("expected_output", {}).get("answer", "")
    
    # Custom evaluation logic
    # Here we're using a simple check, but you could use a more sophisticated approach
    evaluation_llm = ChatOpenAI(model="gpt-4", temperature=0)
    
    prompt = f"""
    Evaluate the factual accuracy of the following response to a question.
    
    Question: {example.inputs.get('question')}
    Predicted answer: {prediction}
    Reference answer: {reference}
    
    Rate the factual accuracy on a scale from 1 to 5, where:
    1 = Completely inaccurate, contains major factual errors
    2 = Mostly inaccurate, contains significant factual errors
    3 = Partially accurate, contains some factual errors
    4 = Mostly accurate, contains minor factual errors
    5 = Completely accurate, no factual errors
    
    Provide your rating as a single number followed by a brief explanation.
    """
    
    evaluation_result = evaluation_llm.predict(prompt)
    
    try:
        # Extract score (assuming format "Score: X. Explanation...")
        score_text = evaluation_result.split()[0]
        score = float(score_text)
        # Normalize to 0-1 range
        normalized_score = (score - 1) / 4
        
        return {
            "score": normalized_score,
            "reasoning": evaluation_result
        }
    except:
        # Fallback if parsing fails
        return {
            "score": 0.5,  # Neutral score
            "reasoning": f"Failed to parse score. Raw evaluation: {evaluation_result}"
        }

# Add the custom evaluator to our config
eval_config = RunEvalConfig(
    evaluators=[
        "correctness",
        "qa",
    ],
    custom_evaluators={
        "factual_accuracy": factual_accuracy_evaluator
    }
)

Comprehensive Evaluation Framework

Let's build a more comprehensive evaluation framework that addresses multiple dimensions:

from langchain.smith import RunEvalConfig

# Define evaluation dimensions
eval_config = RunEvalConfig(
    custom_evaluators={
        "factual_accuracy": factual_accuracy_evaluator,
        "helpfulness": helpfulness_evaluator,  # Define these similar to factual_accuracy
        "harmlessness": harmlessness_evaluator,
        "reasoning": reasoning_evaluator,
        "tool_usage": tool_usage_evaluator,
    },
    # Define criteria for LLM-based evaluation
    criteria={
        "relevance": "Does the response directly address the user's question?",
        "conciseness": "Is the response concise without unnecessary information?",
        "completeness": "Does the response fully answer all aspects of the question?",
        "clarity": "Is the response clear and easy to understand?",
    }
)

# Run a full evaluation across all dimensions
results = client.run_evaluation(
    dataset_name=dataset_name,
    llm_or_chain=agent_executor,
    evaluation=eval_config,
    project_name="comprehensive-agent-eval"
)

Analyzing Evaluation Results

LangSmith provides tools to analyze evaluation results and identify patterns:

# Get aggregate metrics
metrics = client.get_evaluation_metrics(project_name="comprehensive-agent-eval")
print(f"Average accuracy: {metrics.get('factual_accuracy', {}).get('mean', 0)}")
print(f"Lowest scoring category: {min(metrics.items(), key=lambda x: x[1].get('mean', 1))[0]}")

# Identify failing examples
failing_examples = client.list_runs(
    project_name="comprehensive-agent-eval",
    filter={
        "has_feedback": True,
        "feedback": {
            "score": {"lt": 0.7}  # Examples scoring below 0.7
        }
    }
)

# Analyze patterns in failures
for example in failing_examples:
    print(f"Failed example: {example.inputs}")
    print(f"Tags: {example.tags}")
    feedback = client.list_feedback(run_id=example.id)
    for fb in feedback:
        print(f"{fb.key}: {fb.score} - {fb.comment}")

Continuous Evaluation Workflows

To make evaluation part of your development process, set up continuous evaluation workflows:

# Continuous evaluation script (can be run in CI/CD pipeline)
import argparse
from langsmith import Client
from langchain.chat_models import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent

def continuous_evaluation(model_version, dataset_name, threshold=0.8):
    # Initialize client
    client = Client()
    
    # Set up agent with specific model version
    llm = ChatOpenAI(model=model_version, temperature=0)
    tools = [...]  # Your agent's tools
    agent = create_react_agent(llm, tools, prompt_template)
    agent_executor = AgentExecutor(agent=agent, tools=tools)
    
    # Run evaluation
    results = client.run_evaluation(
        dataset_name=dataset_name,
        llm_or_chain=agent_executor,
        evaluation=eval_config,
        project_name=f"eval-{model_version}"
    )
    
    # Calculate average score
    scores = [
        result.evaluations[0].score 
        for result in results 
        if result.evaluations
    ]
    avg_score = sum(scores) / len(scores) if scores else 0
    
    # Check against threshold
    if avg_score < threshold:
        print(f"Evaluation failed: {avg_score} < {threshold}")
        return False
    else:
        print(f"Evaluation passed: {avg_score} >= {threshold}")
        return True

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run continuous evaluation")
    parser.add_argument("--model", type=str, required=True, help="Model version")
    parser.add_argument("--dataset", type=str, required=True, help="Dataset name")
    parser.add_argument("--threshold", type=float, default=0.8, help="Pass threshold")
    args = parser.parse_args()
    
    success = continuous_evaluation(args.model, args.dataset, args.threshold)
    if not success:
        exit(1)  # Fail the CI/CD pipeline

Regression Testing with LangSmith

Regression testing is crucial when updating prompts, models, or tools:

def compare_versions(old_version, new_version, dataset_name):
    """Compare performance between two versions of your agent."""
    client = Client()
    
    # Set up old version
    old_llm = ChatOpenAI(model=old_version, temperature=0)
    old_agent = create_react_agent(old_llm, tools, old_prompt)
    old_executor = AgentExecutor(agent=old_agent, tools=tools)
    
    # Set up new version
    new_llm = ChatOpenAI(model=new_version, temperature=0)
    new_agent = create_react_agent(new_llm, tools, new_prompt)
    new_executor = AgentExecutor(agent=new_agent, tools=tools)
    
    # Run evaluations
    old_results = client.run_evaluation(
        dataset_name=dataset_name,
        llm_or_chain=old_executor,
        evaluation=eval_config,
        project_name=f"eval-{old_version}"
    )
    
    new_results = client.run_evaluation(
        dataset_name=dataset_name,
        llm_or_chain=new_executor,
        evaluation=eval_config,
        project_name=f"eval-{new_version}"
    )
    
    # Compute metrics
    old_scores = [r.evaluations[0].score for r in old_results if r.evaluations]
    new_scores = [r.evaluations[0].score for r in new_results if r.evaluations]
    
    old_avg = sum(old_scores) / len(old_scores) if old_scores else 0
    new_avg = sum(new_scores) / len(new_scores) if new_scores else 0
    
    # Compare and report
    print(f"Old version ({old_version}): {old_avg:.4f}")
    print(f"New version ({new_version}): {new_avg:.4f}")
    print(f"Difference: {new_avg - old_avg:.4f}")
    
    # List examples where performance changed significantly
    for old, new in zip(old_results, new_results):
        old_score = old.evaluations[0].score if old.evaluations else 0
        new_score = new.evaluations[0].score if new.evaluations else 0
        
        if abs(new_score - old_score) > 0.2:
            print(f"Significant change on: {old.example.inputs}")
            print(f"  Old score: {old_score}")
            print(f"  New score: {new_score}")
            print(f"  Old output: {old.outputs.get('output', '')[:100]}...")
            print(f"  New output: {new.outputs.get('output', '')[:100]}...")

Best Practices for Evaluation Systems

Diverse datasets: Include edge cases, common queries, and representative examples
Multiple dimensions: Evaluate across different quality aspects (accuracy, clarity, etc.)
Automated workflows: Make evaluation part of your CI/CD pipeline
Continuous improvement: Add failing examples to your dataset to improve coverage
Human-in-the-loop: Combine automated evaluation with human review for critical systems

Conclusion

Building robust evaluation systems with LangSmith provides a foundation for developing reliable AI agents. By implementing comprehensive evaluation across multiple dimensions, you can catch regressions early, understand your system's limitations, and continuously improve performance over time.

Remember that evaluation is not a one-time activity but an ongoing process that should evolve with your application. As you discover new edge cases or failure modes, incorporate them into your evaluation framework to build increasingly reliable AI systems.

LangSmithEvaluationTestingAdvanced

Back to Learning Hub