Skip navigation
Video
11 minutes
May 27, 2025

Video


ABB

How to Solve the #1 AI Agent Production Blocker

In this video, LangChain CEO Harrison Chase discusses the critical role of evaluations (evals) in improving LLM agent quality and highlights offline, online, and in-the-loop evals as essential phases in the development cycle.

Artificial Intelligence Software Engineering Agent Evaluation LLM-as-a-Judge LangChain Evals

Takeaways

  • Developers cite quality as the top barrier to moving agents from prototype to production.
  • Evals should be embedded throughout an agent’s lifecycle: offline, online, and in-the-loop.
  • Offline evals help benchmark changes before deployment, while online evals monitor live performance.
  • In-the-loop evals enable real-time agent self-correction but come with latency and cost trade-offs.
  • Custom, use-case-specific data and evaluators yield better results than academic datasets.

Summary

Quality is the leading blocker preventing LLM agents from moving into production, according to a developer survey. To address this, the speaker outlines three phases of evaluation—offline, online, and in-the-loop—that should be integrated into the development lifecycle to ensure consistent improvement and reliability of AI agents. Offline evals assess agent performance against static, pre-defined datasets and enable developers to track iterative improvements. Online evals, by contrast, involve real-time analysis of actual production data to continuously monitor application behavior in the wild. In-the-loop evals go a step further by allowing the system to self-correct as it operates, offering the highest quality assurance at the cost of latency and resource usage.

Evaluations require two core components: data and evaluators. Effective evals depend on building domain-specific datasets and custom evaluators, as general-purpose or academic datasets often fail to represent real-world use. LangSmith, LangChain’s observability and evaluation toolkit, offers tools to streamline the creation and tracking of evals, from dataset construction to real-time monitoring. Various types of evaluators are discussed, including deterministic code checks, LLM-as-a-judge scoring models, and human feedback mechanisms.

LangChain is also releasing open-source evaluation utilities that support common scenarios like code linting, RAG extraction, and tool use, and is introducing conversation simulation tools for multi-turn interaction testing. For complex evaluations, especially LLM-as-a-judge models, the team is launching a private preview of features inspired by recent research (e.g., Eugene Yan's "AlignEval") to help developers calibrate and trust these evaluators.

Overall, evaluation is positioned not as a one-time task but as a continuous, evolving process essential for building robust and reliable LLM applications.

Job Profiles

Chief Technology Officer (CTO) Data Analyst Artificial Intelligence Engineer Machine Learning Engineer Academic/Researcher

Actions

Watch full video Export
Contributors

ABB
Content rating = A
  • Relies on reputable sources
  • Presents an objective viewpoint
  • Shows hints of original thinking
  • Adequate structure
Author rating = B
  • Has professional experience in the subject matter area
  • Shows future impact potential
  • Occasionally cited by or featured in subject-specific media
Source rating = B
  • Professional contributors
  • Acceptable editorial standards
  • Industry leader blog