The Statistics of Dirty Data

Sanjay Krishnan | UC Berkeley, AMPLab


Learning-based components are increasingly in the critical path of important software systems such as in fraud detection, product recommendation, robotics and control, and machine vision. Much of the research on scalable Machine Learning has focused on efficient distributed optimization, and has resulted in the development of several open-source libraries, e.g., TensorFlow, MLlib, and CAFFE.

While these tools certainly make scalable Machine Learning applications easier to develop, a significant amount of developer effort is still required for the steps prior to model training including extracting structure, imputing missing values, and handling inconsistencies.

I will describe statistical models that consider the effects of these systems and allow us to analyze learning pipelines in an end-to-end way. I will present examples where divorcing these steps from the subsequent Machine Learning can lead to statistical biases, convergence problems, and missed opportunities for joint optimization.

Download Slides

sanjay krishnan

Graduate ResearcherUC Berkeley, AMPLab

Sanjay Krishnan is a Computer Science PhD candidate in the Algorithms, Machines, and People Lab (AMPLab) and in the Berkeley Laboratory for Automation Science and Engineering at UC Berkeley. His research studies techniques for data analytics on dirty data and data representation problems in physical systems.