Technical Talks

View All

Building a Flexible Data Platform for LLM Training Data

Jonathan Talmi Jonathan Talmi | Manager of Technical Staff | Cohere

World-class LLMs are trained on trillions of tokens, and data quality is critical in determining the ultimate performance of these systems.

At Cohere, we train LLMs and retrieval (RAG) systems from scratch using datasets created through complex ingestion, preprocessing, and distillation pipelines. This talk will cover what we know about how data drives LLM performance, as well as the data platform we use to manage hundreds of datasets, from ultra-niche finetuning datasets to petabyte-scale web data, and automate the measurement and enforcement of data quality at scale.

Here’s what practitioners will walk away from in this talk:

  • What the science and our own experiences show us about data for LLMs
  • A detailed understanding of the unique architecture we built to manage datasets for training complex NLP systems
  • The anatomy of an LLM training data pipeline, and how we build data quality evaluation into these pipelines
  • Practical lessons for how they can implement best practices for their LLM training infrastructure

Jonathan Talmi
Jonathan Talmi
Manager of Technical Staff | Cohere

Jonathan is a six-year veteran of data-in-tech, and leads the team responsible for integrating large, high quality datasets into Cohere's foundational language models. He is passionate about advancing the science of data for LLMs and hopes to share his learnings with the broader community. Previously he led a data platform team at mobile commerce startup Super, and worked on data teams at Shopify and Instacart.