In this talk, I'll share the journey and lessons learned as Coda transitioned from the ground up in building our AI evaluation process. Drawing from my background in Data Infrastructure and Product Analytics, I'll delve into the evolution of our AI eval, highlighting the practical strategies and insights relevant to startup engineers and product developers.
We'll explore the step-by-step evolution of our process, starting from initial tests using OpenAI's playground, to integrating evaluations into Coda documents, and eventually leveraging various vendors for comprehensive evaluation. The talk will cover our methodical approach, beginning with understanding the specific AI tasks, performing manual evaluations, and advancing to feature development with continuous feedback integration.
Key takeaways include:
- The importance of a robust benchmark dataset and iterative learning from internal user interactions.
- The critical role of manual evaluation alongside automated processes, utilizing LLMs for enhanced issue identification.
- Integrating evaluation seamlessly with application code, balancing between development agility and feature synchronization.
This session is designed not just to share Coda's journey, but to equip you with actionable insights and frameworks that can be applied to your own AI feature development and evaluation processes. Whether you're building from scratch or looking to refine existing systems, this talk will provide valuable guidance and real-world applications from the startup world.
Kenny is a software engineer at Coda, a platform that transforms documents into powerful apps by seamlessly integrating data without any coding. Along with his team in AI, he focuses on integrating GenAI into Coda’s products, with an emphasis on AI evaluation. He has built the evaluation framework from the ground up and continues to evolve it as the GenAI landscape changes. Before joining the GenAI team, he worked as a Data Infrastructure Engineer at Coda, Box, and Simon Data, specializing in product analytics, A/B testing, and event streaming. Outside of work, you might find him playing Gloomhaven, enjoying sushi, or pushing himself in workouts at Orangetheory.