This talk discusses Apache Arrow project and its uses for high performance analytics and system interoperability. Data processing systems have historically been full-stack systems features memory management, IO, file format adapters, runtime memory format, in-memory query engine, and front-end user interfaces. Many of these components are fully "bespoke" or "custom", in part due to a lack of open standards for many of the pieces.
Apache Arrow was created by a diverse group of open source data system developers to define open standards and community-maintained libraries for high performance in-memory data processing. Since the beginning of 2016, we have been building a cross-language development platform for data processing to help create systems that are faster, more scalable, and more interoperable.
I discuss the current development initiative and future roadmap as it relates to the data science and data engineering worlds.
Wes McKinney is an open source software developer focusing on data processing tools. He created the Python pandas project and has been a major contributor to many other OSS projects. He is a Member of the Apache Software Foundation and a project PMC member for Apache Arrow and Apache Parquet. He is the director of Ursa Labs, an innovation lab for open source data science tools powered by Apache Arrow.