One of the key bottlenecks in building machine learning systems is creating and managing the massive training datasets that today’s models learn from. In this talk, I will describe our work on data management systems that let users specify training datasets in higher-level, faster, and more flexible ways, leading to applications that can be built in hours or days, rather than months or years.
I will start by describing Snorkel, an open-source system for programmatically labeling training data that has been deployed by major technology companies, academic labs, and government agencies. In Snorkel, rather than hand-labeling training data, users write labeling functions which label data using heuristic strategies such as pattern matching, distant supervision, and other models. These labeling functions can have noisy, conflicting, and correlated outputs, which Snorkel models and combines into clean training labels. We solve this novel data cleaning problem without any ground truth labels using a matrix-completion style approach, which we show has strong consistency guarantees, and demonstrate that Snorkel leads to impactful gains in applications ranging from knowledge base construction to medical imaging.
Next, I will give an overview of two other systems that accelerate training data creation and management: TANDA, a system for optimizing and managing data augmentation strategies, wherein a labeled dataset is artificially expanded by transforming data points; and MeTaL, a system for integrating training labels across multiple related tasks. I will conclude by outlining future research directions for further accelerating and democratizing machine learning workflows, such as higher-level interfaces and massively multi-task frameworks.
Alex Ratner is a 5th year Ph.D. candidate advised by Christopher Ré in the Computer Science department at Stanford, where he is supported by a Stanford Bio-X fellowship. His research focuses on applying data management and statistical learning techniques to emerging machine learning workflows, such as creating and managing training data, and applying this to real-world problems in medicine, knowledge base construction, and more. He leads the Snorkel project (snorkel.stanford.edu), which has been deployed at large technology companies, academic labs, and government agencies, and his work has been recognized in VLDB 2018 (“Best Of”).