There are several commercial, managed service and open source choices of data pipeline frameworks on the market. In this talk, we will discuss two of them, the AWS Data Pipeline managed service and the open source software Airflow. These frameworks have very different feature sets and operational models, however, they have both benefited us and fallen short of our needs in similar ways.

To understand the reasons, we analyze our experience of first building a data processing platform on Data Pipeline, and then developing the next generation platform on Airflow. We find that managed service and open source framework are leaky abstractions and thus both frameworks required us to understand and build primitives to support deployment and operations.

Likewise, we discuss the necessity of implementing cross-cutting aspects such as logging, monitoring, security and configuration, which arises from the shortcomings of existing pre-implemented components. Generalizing from specific pain points and solutions, we posit that almost any organization building a data platform using a pipeline framework or service will run into many of the same issues, because opinionated framework/service implementations will conflict with an organization's existing code, preferences and procedures.

So where is the line? What value can you expect to get from a data pipeline framework or service? What will you need to wrap, integrate with or fully implement yourself? To develop a robust data pipeline platform for your organization, you will need to bridge the gap between the framework dream and production reality. This talk will help you do that.

Download Slides

Mark Weiss

Senior Software Engineer | Beeswax

Mark Weiss is a Senior Software Engineer at Beeswax, the online advertising industry’s first extensible programmatic buying platform, where he focuses on designing and building data processing infrastructure and applications supporting reporting and machine learning. He has previously held various engineering individual contributor and leadership roles, and has worked on ETL systems and data-driven distributed platforms for much of his career. Mark has spoken previously at DataEngConf NYC, and regularly speaks and mentors at the NYC Python Meetup. He is also blogs and hosts the podcast "Using Reflection" at http://www.usingreflection.com, and can be found on Github, Twitter and LinkedIn under @marksweiss. He lives in Brooklyn, NY

Mark Weiss

Experience talks like this and many more at San Francisco 2019