How to Improve AI Apps with (Automated) Evals



AI Summary

The video by Shaw discusses improving AI applications using automated evaluations (evals). It explains that evals are metrics that assess the quality of LLM (large language model) outputs and why automating them is important for scalability beyond manual human review. Shaw walks through an example project of a LinkedIn ghostwriter AI where automated evals helped systematically improve the writing quality by identifying failure modes like excessive use of em dashes and unnatural voice. The video covers the two main types of automated evals: code-based checks (e.g., formatting, length) and LLM-based judgements (e.g., style, empathy). Shaw describes the workflow to build automated evals including creating and refining LLM judge prompts, curating real input data, generating outputs, evaluating them automatically, and iterating to improve prompts based on evals. The approach enables objective, rapid, and large-scale prompt engineering that would be impossible manually. Shaw also shares tools like a custom Streamlit app and usage tips for refining evaluation prompts effectively. The source code and all materials are freely available on GitHub, and the project was inspired by an AI course on evals by HL Hussein and Treya Shanker.