MacroBase is a new open source analytics engine designed to prioritize the scarcest resource in large-scale, fast-moving data streams: human attention. In many deployments at scale, an overwhelming proportion of data collected is never read and is instead retained only for reactive failure analysis. In contrast, MacroBase analyzes data as it arrives, providing interpretable, high-level explanations of stream behaviors, enabling real-time, root cause analysis and anomaly detection.
At its core, MacroBase combines cascades of streaming classification and explanation operators to both identify individual points of interest and highlight commonalities across them. For example, the Android device ecosystem comprises over 24,000 distinct device types. How can you determine whether your mobile application is behaving correctly on all of them? MacroBase’s classification operators can identify abnormally behaving devices, while its explanation operators can aggregate many such devices, producing more interpretable outputs. Thus, MacroBase is designed as a set of reconfigurable dataflow operators that are composed to form end-to-end dataflow pipelines, and has already been used to diagnose issues in production streams in mobile, data center, and industrial applications.
In this talk, we will walk through the core concepts behind MacroBase, its architecture, and use cases, as well as key takeaways from the recent research literature for data engineers, data scientists, and DevOps engineers. MacroBase is a core component of the Stanford DAWN project, a new research initiative designed to enable more usable and efficient machine learning infrastructure.
Sahaana Suri is a second year PhD student in the Stanford InfoLab, working with Peter Bailis. Sahaana’s research focuses on building easily accessible data analytics and machine learning systems that scale. She holds a bachelor’s degree in electrical engineering and computer science from the University of California, Berkeley.