Automatically Fix Data Issues & Label Errors in Most ML Datasets

Technical Talks

Gold-standard ML test sets contain significantly more label errors than previously believed (see labelerrors.com). Using confident learning (open-sourced as cleanlab), we found that at least 3.3% of labels in the ten most commonly used benchmark test sets are erroneous (over 10% in some cases). Real-world datasets used to train public-facing models are often less curated than these academic benchmarks, making both benchmarking (errors in test sets) and training reliable models (errors in train sets) difficult.

In this talk, I'll present (1) cleanlab open-source (github.com/cleanlab/cleanlab), a fast-growing python framework for data-centric AI that automatically finds issues in most ML datasets -- for most models, data modalities, and data formats -- often in only a few lines of code and (2) Cleanlab Studio (https://cleanlab.ai/studio), a no-code, automatic web interface used by many universities and fortune 500 companies for finding and fixing issues in ML datasets. Notably, cleanlab algorithms are theoretically proven to work well in some cases. While the proof of exact label error finding is out of scope for this talk, I'll share intuitive theoretical arguments that demonstrate why cleanlab tends to train more accurate models on real-world, messy data than prior approaches.

💾 Download Slides

Curtis Northcutt

CEO | Cleanlab

Curtis Northcutt is an American computer scientist and entrepreneur focusing on machine learning and AI to empower people. He is the CEO and co-founder of Cleanlab, an AI software company that improves machine learning model performance by automatically fixing data and label issues in real-world, messy datasets. Curtis completed his PhD at MIT where he invented Cleanlab’s algorithms for automatically finding and fixing label issues in any dataset. He is the recipient of the MIT Morris Levin Thesis Award, the NSF Fellowship, and the Goldwater Scholarship and has worked at several leading AI research groups, including Google, Oculus, Amazon, Facebook, Microsoft, and NASA.

Technical Talks

Automatically Fix Data Issues & Label Errors in Most ML Datasets

FEATURED MEETINGS

Follow / Join Us

Contact Us

Menu