In this talk we present Hamilton, a novel open-source framework for developing and maintaining scalable feature engineering dataflows. Hamilton was initially built to solve the problem of managing a codebase of transforms on pandas dataframes, enabling a data science team to scale the capabilities they offer with the complexity of their business. Since then, it has grown to be an end-to-end tool for writing, managing, and iterating on Machine Learning pipelines. We introduce the framework, discuss its motivations and initial successes at Stitch Fix, showcase its lightweight data lineage and catalog abilities, and share recent extensions that seamlessly integrate it with distributed compute offerings, such as Dask, Ray, and Spark.
Elijah has always enjoyed working at the intersection of math and engineering. More recently, he has focused his career on building tools to make data scientists and researchers more productive. At Two Sigma, he built infrastructure to help quantitative researchers efficiently turn ideas into production trading models. At Stitch Fix he ran the Model Lifecycle team — a team that focuses on streamlining the experience for data scientists to create and ship machine learning models. He is now the CTO at DAGWorks, which aims to solve the problem of building and maintaining complex ETLs for machine learning. In his spare time, he enjoys geeking out about fractals, poring over antique maps, and playing jazz piano.