More big data and machine learning applications are built on top of the more scalable and cost-effective cloud storages (S3, Azure object store and etc), but trading off the benefit from traditional file systems like caching and data locality due to the separation of storage and compute. Alluxio (www.alluxio.org) is an open-source distributed file system that not only provides distributed applications like Presto and Apache Spark a common and unified data access layer to different data sources but also intelligently manages and places data and metadata closer to computation to improve performance. As a result, applications can seamlessly access multiple different data sources with consistent performance for data and metadata operations. Alluxio is originally a research project named “Tachyon” at UC Berkeley AMPLab.
In this talk, we will focus on Alluxio design, its architecture, data flow and metadata flow. We will dive into the choices in its design space and share the experiences when implementing features like data tiering, storage options and cache eviction policies. We will also share our lessons in design, implementation and operation when working to build an open source distributed storage systems with 900 contributors for 5+ years.
Bin Fan is the founding engineer of Alluxio, Inc. and the PMC member of Alluxio open source project. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems and algorithms.