Structured Streaming is the next generation of distributed, streaming processing in Apache Spark. Developers can write a query written in their language of choice (Scala/Java/Python/R) using powerful high-level APIs (DataFrames / Datasets / SQL) and apply that same query to both static datasets and streaming data. In case of streaming, Spark will automatically create an incremental execution plan that automatically handles late, out-of-order data and ensures end-to-end exactly-once fault-tolerance guarantees.

In this practical session, I will walk through a concrete streaming ETL example where – in less than 10 lines – you can read raw, unstructured data from Kafka data, transform it and write it out as a structured table ready for batch and ad-hoc queries on up-to-the-last-minute data. I will give a quick glimpse of advanced features like event-time based aggregations, stream-stream joins and arbitrary stateful operations.

Download Slides

Tathagata Das

Lead Developer of Spark Streaming | Databricks

Tathagata Das is an Apache Spark committer and a member of the PMC. He’s the lead developer behind Spark Streaming and currently develops Structured Streaming. Previously, he was a grad student in the UC Berkeley at AMPLab, where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica.

Tathagata Das

Experience talks like this and many more at our upcoming event

Learn More

Data Council, PO Box 2087, Wilson, WY 83014, USA - Phone: +1 (415) 800-4938 - EIN: 46-3540315 - Email: community (at) datacouncil.ai