Data Council - Data Science, Machine Learning, AI, and Engineering Blog

Data Council Blog

How to "Democratize" the Responsibility for Data Quality Across your Organization

 clover

 

Writing endless data transformations wasn't sustainable for an engineering team handling hundreds of inputs. Here's how Clover Health enabled their business users to help.

It's rare to find an ETL system that's completely static. As organizations change and grow they develop new business requirements. Because of this their data pipelines must change and adapt, ultimately becoming more robust and full-featured. Yet constant development can make already brittle ETL systems seem even more fragile.

Furthermore, systems with large numbers of different types of inputs bring special challenges - building, testing and managing an exploding number of data transformations can become a daunting project for the engineering team. 

The Clover Health ETL system supports hundreds of inputs and more than 500 custom transformations in production as well as a large number of custom connections between their different ETL pipelines. When hearing about the magnitude of the system, one might rightfully wonder, "how does Clover guarantee and maintain data quality across so many different inputs and transforms?"

Exploring the development trajectory of Clover's system makes for a fascinating story; hearing about their data team's successes and pitfalls are illustrative lessons to other engineers as they seek to increase the robustness of their own ETL systems.

As one of the lead data engineers on their ETL systems, Chris Hartfield's talk at DataEngConf SF '18 will explain how his team at Clover built-in ETL robustness as they developed successive iterations of their data pipelines. Chris will also share tips and tricks they learned while building the process and system that enabled users from across business units to help keep the quality and integrity of Clover's data high.


Meet Chris Hartfield of Clover Health

Chris Hartfield Clover

Chris Hartfield is a senior data engineer at Clover Health, where he uses analytics to improve healthcare outcomes for the Medicare community. Prior to joining Clover, Chris was part of the engineering team at Poplicus, a organization that creates proprietary analytics from big data in the public sector. There, he was responsible for the search logic and proprietary metrics to detect trends in public spending. Chris holds his B.S. degree in Biomolecular Engineering from UC, Santa Cruz.


In its original version, Clover introduced an application that enabled business users to create their own transformations. This system supported many types of user-defined data transformations with limited testing. While it sounds like a bit of a wild west, the main goal was to prove that users would use such a system, and see its value.

After this mission was accomplished, the data team started building tooling that put more structure around existing workflows, although there was some concern that users might feel shackled and confined. But, in fact, it was just the opposite; users didn't simply want flexibility, but they appreciated less development overhead and the benefits of more robust testing. The tradeoff was very well received.

In his talk, "Democratizing Data with the Clover Transform Framework," Chris will tell the story of how Clover tamed its existing ETL system while at the same time created the next version of its data tooling that allowed them to scale their processing by harnessing the power of user-defined transformations across its data pipelines.

Want to hear more talks like this? Join us at DataEngConf SF '18 and be part of the biggest technical conference around data science and engineering.

New Call-to-action

Data Science, Data Engineering, Event Updates, Startups, apache arrow

Pete Soderling

Written by Pete Soderling

Pete is a software engineer, 3x founder and angel investor. As the founder of Hakka Labs and DataEngConf he loves to build community for software engineers and has some bumps and bruises to prove it. He's spoken at conferences like RSA Security, O'Reilly Strata and TEDx, helped organize QCon events, and launched data meetups around the world. He's a mentor at 500 Startups in SF but he lives in Jackson Hole, WY, where the snow is far better.

Wanna be our Pen Pal?

Receive the latest news, tips and special events from our community directly to your inbox once in a while (we promise no spam)

Data Council Blog Sign Up