Easy, Scalable, Fault-tolerant Stream Processing with Structured Streaming in Apache Spark

Burak Yavuz | Databricks


Structured Streaming, first introduced in Apache Spark 2.0 and GA with Spark 2.2, is a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing applications. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Datasets and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees. Finally we will showcase how you can harness all this power by connecting to systems that you know and love such as Apache Kafka and Apache Cassandra.

Download Slides

burak Yavuz

Software Engineer | Databricks

Burak Yavuz is an Apache Spark Committer and a Software Engineer at Databricks working
on Structured Streaming. He's been contributing to Spark since Spark 1.1, and is the
maintainer of Spark Packages (https://spark-packages.org,https://sparkpackages.appspot.com/).