Takeaways
- Developers cite quality as the top barrier to moving agents from prototype to production.
- Evals should be embedded throughout an agent’s lifecycle: offline, online, and in-the-loop.
- Offline evals help benchmark changes before deployment, while online evals monitor live performance.
- In-the-loop evals enable real-time agent self-correction but come with latency and cost trade-offs.
- Custom, use-case-specific data and evaluators yield better results than academic datasets.
Summary
Quality is the leading blocker preventing LLM agents from moving into production, according to a developer survey. To address this, the speaker outlines three phases of evaluation—offline, online, and in-the-loop—that should be integrated into the development lifecycle to ensure consistent improvement and reliability of AI agents. Offline evals assess agent performance against static, pre-defined datasets and enable developers to track iterative improvements. Online evals, by contrast, involve real-time analysis of actual production data to continuously monitor application behavior in the wild. In-the-loop evals go a step further by allowing the system to self-correct as it operates, offering the highest quality assurance at the cost of latency and resource usage.
Evaluations require two core components: data and evaluators. Effective evals depend on building domain-specific datasets and custom evaluators, as general-purpose or academic datasets often fail to represent real-world use. LangSmith, LangChain’s observability and evaluation toolkit, offers tools to streamline the creation and tracking of evals, from dataset construction to real-time monitoring. Various types of evaluators are discussed, including deterministic code checks, LLM-as-a-judge scoring models, and human feedback mechanisms.
LangChain is also releasing open-source evaluation utilities that support common scenarios like code linting, RAG extraction, and tool use, and is introducing conversation simulation tools for multi-turn interaction testing. For complex evaluations, especially LLM-as-a-judge models, the team is launching a private preview of features inspired by recent research (e.g., Eugene Yan's "AlignEval") to help developers calibrate and trust these evaluators.
Overall, evaluation is positioned not as a one-time task but as a continuous, evolving process essential for building robust and reliable LLM applications.