ABB

How to Solve the #1 AI Agent Production Blocker

In this video, LangChain CEO Harrison Chase discusses the critical role of evaluations (evals) in improving LLM agent quality and highlights offline, online, and in-the-loop evals as essential phases in the development cycle.

Artificial Intelligence Software Engineering Agent Evaluation LLM-as-a-Judge LangChain Evals

Takeaways

Developers cite quality as the top barrier to moving agents from prototype to production.
Evals should be embedded throughout an agent’s lifecycle: offline, online, and in-the-loop.
Offline evals help benchmark changes before deployment, while online evals monitor live performance.
In-the-loop evals enable real-time agent self-correction but come with latency and cost trade-offs.
Custom, use-case-specific data and evaluators yield better results than academic datasets.

Summary

Quality is the leading blocker preventing LLM agents from moving into production, according to a developer survey. To address this, the speaker outlines three phases of evaluation—offline, online, and in-the-loop—that should be integrated into the development lifecycle to ensure consistent improvement and reliability of AI agents. Offline evals assess agent performance against static, pre-defined datasets and enable developers to track iterative improvements. Online evals, by contrast, involve real-time analysis of actual production data to continuously monitor application behavior in the wild. In-the-loop evals go a step further by allowing the system to self-correct as it operates, offering the highest quality assurance at the cost of latency and resource usage.

Evaluations require two core components: data and evaluators. Effective evals depend on building domain-specific datasets and custom evaluators, as general-purpose or academic datasets often fail to represent real-world use. LangSmith, LangChain’s observability and evaluation toolkit, offers tools to streamline the creation and tracking of evals, from dataset construction to real-time monitoring. Various types of evaluators are discussed, including deterministic code checks, LLM-as-a-judge scoring models, and human feedback mechanisms.

LangChain is also releasing open-source evaluation utilities that support common scenarios like code linting, RAG extraction, and tool use, and is introducing conversation simulation tools for multi-turn interaction testing. For complex evaluations, especially LLM-as-a-judge models, the team is launching a private preview of features inspired by recent research (e.g., Eugene Yan's "AlignEval") to help developers calibrate and trust these evaluators.

Overall, evaluation is positioned not as a one-time task but as a continuous, evolving process essential for building robust and reliable LLM applications.

Job Profiles

Chief Technology Officer (CTO) Data Analyst Artificial Intelligence Engineer Machine Learning Engineer Academic/Researcher

Actions

Watch full video Export

Contributors

Harrison Chase

Source

LangChain (YouTube)

ABB

Content rating = A

Relies on reputable sources
Presents an objective viewpoint
Shows hints of original thinking
Adequate structure

Author rating = B

Has professional experience in the subject matter area
Shows future impact potential
Occasionally cited by or featured in subject-specific media

Source rating = B

Professional contributors
Acceptable editorial standards
Industry leader blog

Video

ABB