The Statistics of Dirty Data

Sanjay Krishnan, Graduate Researcher | UC Berkeley, AMPLab

Learning-based components are increasingly in the critical path of important software systems such as in fraud detection, product recommendation, robotics and control, and machine vision. Much of the research on scalable Machine Learning has focused on efficient distributed optimization, and has resulted in the development of several open-source libraries, e.g., TensorFlow, MLlib, and CAFFE.

While these tools certainly make scalable Machine Learning applications easier to develop, a significant amount of developer effort is still required for the steps prior to model training including extracting structure, imputing missing values, and handling inconsistencies.

I will describe statistical models that consider the effects of these systems and allow us to analyze learning pipelines in an end-to-end way. I will present examples where divorcing these steps from the subsequent Machine Learning can lead to statistical biases, convergence problems, and missed opportunities for joint optimization.



Sanjay Krishnan

Graduate Researcher | UC Berkeley, AMPLab

I'm a Computer Science PhD student in the RISELab and AUTOLAB at UC Berkeley researching exciting projects that span from databases to robotics. Basically, I like to think about how to build intelligent learning systems, what are the right theoretical tools to understand these systems, and what are the right design principles to scale these systems to the problems of the future. My CV is linked here.

Sanjay Krishnan