Building a modern data lake requires dealing with a lot of complexity: querying historical data + streaming data simultaneously (lambda architecture), validation to ensure data isn't too messy for data science and machine learning, reprocessing to handle failures, and ensuring ACID-compliant data updates. We created the Delta Lake project, open sourced under the Linux Foundation, to relieve data scientists and data engineers from these complex systems problems and instead enable them to focus on extracting value from data. In this talk, we'll dive into these challenges and how ACID transactions solve them. We'll discuss patterns that emerge when you can focus on data quality and the nitty gritty internals of ACID on Spark which enable this focus.
Michael Armbrust is committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Databricks Delta. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.