Technical Talks

View All

Automatically Fix Data Issues & Label Errors in Most ML Datasets

Curtis Northcutt Curtis Northcutt | CEO | Cleanlab

Gold-standard ML test sets contain significantly more label errors than previously believed (see Using confident learning (open-sourced as cleanlab), we found that at least 3.3% of labels in the ten most commonly used benchmark test sets are erroneous (over 10% in some cases). Real-world datasets used to train public-facing models are often less curated than these academic benchmarks, making both benchmarking (errors in test sets) and training reliable models (errors in train sets) difficult.

In this talk, I'll present (1) cleanlab open-source
(, a fast-growing python framework for data-centric AI that automatically finds issues in most ML datasets -- for most models, data modalities, and data formats -- often in only a few lines of code and (2) Cleanlab Studio (, a no-code, automatic web interface used by many universities and fortune 500 companies for finding and fixing issues in ML datasets. Notably, cleanlab algorithms are theoretically proven to work well in some cases. While the proof of exact label error finding is out of scope for this talk, I'll share intuitive theoretical arguments that demonstrate why cleanlab tends to train more accurate models on real-world, messy data than prior approaches.

Curtis Northcutt
Curtis Northcutt
CEO | Cleanlab

Curtis Northcutt is an American computer scientist and entrepreneur focusing on machine learning and AI to empower people. He is the CEO and co-founder of Cleanlab, an AI software company that improves machine learning model performance by automatically fixing data and label issues in real-world, messy datasets. Curtis completed his PhD at MIT where he invented Cleanlab’s algorithms for automatically finding and fixing label issues in any dataset. He is the recipient of the MIT Morris Levin Thesis Award, the NSF Fellowship, and the Goldwater Scholarship and has worked at several leading AI research groups, including Google, Oculus, Amazon, Facebook, Microsoft, and NASA.