Personalization allows Stitch Fix to style its clients and provide recommendations to help them find what they love. To do this, we gather information about a client’s preferences up front when they sign up from the service, and learn more about them as they become longer term customers.
The Data Science team at Stitch Fix is the primary owner of the recommendation systems. Backing them up is the Data Platform team who builds and maintains the data ecosystem. Each entity in the data ecosystem is considered as a resource. These resources help Data Scientists read, transform and write data that helps understand our clients and helps style them better.
Missing from our current data ecosystem is the presence of reliable, traceable lineage information about these resources. This includes the origin of data, its journey in the ecosystem, and its upstream and downstream dependents. All this information is valuable to have available but there isn't a coherent system that brings everything together.
What lineage helps in our world is enhance traceability and improve resource management. If there is a change in resource A, what else changes, and are we aware of all the relationships surrounding resource A. Extend this idea to every resource in the organization and you have a map of all the dependencies.
We designed a system that helps us achieve those relationships and let Data Scientists leverage that visibility to maintain their workflows better, and we are actively building it.
In this talk, Neelesh, explains the ongoing journey of building the Data Lineage system, named Sultan, and why is it so important for us at Stitch Fix.
Neelesh Srinivas Salian is a Software Engineer in the Data Platform team at Stitch Fix, where he works on building the Compute Infrastructure used by Data Scientists. Previously, he worked at Cloudera, where he worked on Apache projects like YARN, Spark, and Kafka.