In this blog series leading up to our SF18 conference, we invite our featured startups to tell us more about their data engineering challenges. Today, we speak with Pachyderm, an early-stage company building a data platform for data science.
Q: What surprised you most as an engineer about the work you did that you'll be telling us about in your talk?
Daniel Whitenack: ML/AI is where its at, and we're entering a world where everyone feels they need it so as not to be left behind. However, it's surprising to me how few teams seem to understand how to manage ML/AI at scale while maintaining productivity, compliance and security. It's like we are in a phase where data science/engineering teams are swimming in a sea of Jupyter notebooks, random files, parameter sets and data sources that they don't know how to manage. Not only is this inefficient and error prone, it's anything but compliant with the latest regulations around algorithmic decision making (like the GDPR in the EU).
The real missing piece of the puzzle is data provenance. That is, the ability to connect any piece of data with all the other pieces of data and the processing that contributed to it. No matter how many data management frameworks and distributed computing frameworks get released, this property seems to be super elusive. What we realized is that, to truly obtain full data provenance that is also easy to access and use, you need a mechanism for managing/versioning data that is unified with a data pipelining mechanism. Otherwise, you will always be trying to reconcile what data was used by what processing at which time.
Q: What do you think a listener will get out of this this talk vs. other talks on distributed data processing and data versioning that they've previously heard?
Daniel Whitenack: Along with a truly holistic view of data provenance and processing, listeners will get a glimpse into what is possible with container-based solutions. Kubernetes has won as the scheduler/orchestrator of modern infrastructure. However, many data science/engineering architectures are still centered around frameworks that don't naturally fit into this new world. The demonstration that we will present is focused on a data management/pipelining framework built from the ground up on cloud native infrastructure. As such, it naturally works with any languages/frameworks (Python, R, TensorFlow, H2O, PyTorch, etc.) and has built-in self healing, portability, fault tolerance, scalability, etc. In this talk, listeners will see the power of this Kubernetes-based approach, and they will understand how language/data agnostic ML/AI workflows can be easily scaled within any cloud or on premise infrastructure.
About the Startups Track
The data-oriented Startups Track at DataEngConf features dozen of startups forging ahead with innovative approaches to data and new data technologies. We find the most interesting startups at the intersection of ML, AI, data infrastructure and new applications of data science and highlight them in technical talks from their CTOs and their lead engineers building their platforms.