At Netflix, we log raw data (aka facts) using which we generate ML features, instead of the traditional approach of logging features. Building the fact store has significantly decreased the time it takes to experiment with new ML features in an A/B test as historical features can be generated on demand. Since we started fact logging, we have been through several iterations of building a fact store: starting from a pull based approach to a push based approach with support for data exploration. Along with the development of the fact store, our label store has also seen major design changes, but coupled well together to ensure that we are able to find the exact facts for a given label.
The latest version of the fact store is performant and debuggable for both research and production use cases. While the fact store supports logging any kind of facts, it gives special privileges to known facts; privileges like easier data exploration, faster query performance, data quality metrics etc. To protect against regressions and support faster development, the fact store uses development swimlanes, which aids in end to end testing from logging to feature generation.
In this talk, we will give insights into our ML training data preparation and how we evolved building the fact and label stores.
Vivek Kaushal is a senior software engineer in the Personalization Infrastructure team at Netflix. He works on distributed systems and big data, and is currently focused on storing and querying petabytes of data. Prior to Netflix, he has worked at Apple, Sumo Logic and Amazon in similar roles.