Evaluating LLMs is no longer an academic concern. As models get more intelligent and incorporate more modalities, gaining confidence in your application is only going to get harder. To get ahead early, the broader community needs to discuss practical solutions to calculate LLM performance & reliability across many metrics. In this talk, I will begin by providing a survey of the current approaches to evaluating LLMs and discuss their primary drawbacks - being too slow, expensive or biased. We will discuss practical solutions that will unlock faster iteration & more safety in GenAI, such as using tiny evaluators in an online setting & making efficient use of human feedback offline. At the end, you should be well equipped to understand what you can do today to get a clear signal on your GenAI application performance.
Dhruv Singh has a diverse work experience in various roles and companies. He is currently the Co-Founder and CTO at HoneyHive AI. Prior to that, he worked at Microsoft as a Software Engineer, where he contributed to the development of frameworks for LLM developers on Microsoft's OpenAI Innovation team and also worked on projects in the Office of the CTO.. Dhruv won the Codex Innovation Challenge organized by the Office of the CTO during his time at Microsoft. Before joining Microsoft, Dhruv had internships at Otsuka Pharmaceutical Companies (U.S.) and Genomic Prediction in 2018. Dhruv also served as a Software Engineering Intern at Microsoft in 2019. Dhruv has a Bachelor of Science degree in Computer Science from Columbia University.