Even after a decade, the name “Hadoop" remains synonymous with "big data”, even as new options for processing/querying (stream processing, in-memory analytics, interactive sql) and storage services (S3/Google Cloud/Azure) have emerged & unlocked new possibilities. However, the overall data architecture has become more complex with more moving parts and specialized systems, leading to duplication of data and strain on usability . In this talk, we argue that by adding some missing blocks to existing Hadoop stack, we are able to a provide similar capabilities right on top of Hadoop, at reduced cost and increased efficiency, greatly simplifying the overall architecture as well in the process.
We will discuss the need for incremental processing primitives on Hadoop, motivating them with some real world problems from Uber. We will then introduce “Hoodie”, an open source spark library built at Uber, to enable faster data for petabyte scale data analytics and solve these problems. We will deep dive into the design & implementation of the system and discuss the core concepts around timeline consistency, tradeoffs between ingest speed & query performance. We contrast Hoodie with similar systems in the space, discuss how its deployed across Hadoop ecosystem at Uber and finally also share the technical direction ahead for the project.