The Right Way to Do AI Evals (ft Freddie Vargus) - Ep 44



AI Summary

This video is episode 44 of Tool Use hosted by Mike Bird with guest Freddy Vargas, CTO and co-founder of Quotient AI. Freddy shares insights on how to use evals (evaluations) to systematically build and improve AI products faster. The conversation covers why evals are important for product improvement, especially as AI apps reach higher complexities like multi-turn interactions and tool calling.

Key points discussed include:

  • The role of evals in measuring AI system performance and avoiding regressions
  • Different levels of app complexity from single-turn prompts to multi-turn conversations with tool calls
  • The importance of testing function executability and state dependency in tool calls
  • Strategies for designing effective evals including milestones (desired states) and minefields (undesired states) in AI behavior
  • Methods to collect and store eval data realistically, like using databases and spreadsheets
  • How to bootstrap eval sets using role-played scenarios or early user data
  • Using human annotators versus AI judges for labeling and evaluation quality
  • Challenges around complexity and the need for intentional, thoughtful eval design
  • Practical advice to prioritize and improve AI models using continuous eval feedback

Freddy emphasizes the value of taking the eval process seriously as a systematic method to improve AI products, and invites viewers to learn more or collaborate via Quotient AI.

The episode provides deep practical guidance for developers building AI tools and agents, focusing on evaluation as a key factor in product success.