Technical Talks

View All

Why Streaming SQL? The Semantics and Challenges of Applying SQL to Unbounded Data

Micah Wylde Micah Wylde | Co-founder & CEO | Arroyo Sytems

Over the past decade Big Data systems have transitioned from bespoke query languages and code-based pipelines, to supporting SQL as their primary language. Stream processing engines have followed this trend. Most new systems primarily or exclusively use SQL.

But traditional SQL semantics require us to wait for all data to be available to compute queries like joins or aggregates, and users do not want to wait forever for their queries to return. What does it mean to apply SQL to streams of data, which may never end?

This talk will cover the two most common answers to that question and introduce the underlying elements of time-oriented SQL, including event-time and watermarks, and how these features are implemented in modern streaming engines.

We will also cover the challenges that SQL introduces for continuously running pipelines, including the difficulty of upgrading program logic and handling application upgrades that may cause pipelines to re-plan, as well as the optimization benefits of using a declarative language like SQL.

Micah Wylde
Micah Wylde
Co-founder & CEO | Arroyo Sytems

Micah is the co-founder and CEO of Arroyo, a startup building a new open-source stream processing engine. He was previously tech lead for streaming compute at Splunk, where his team built infrastructure to manage hundreds of customer Flink pipelines, and at Lyft, where he built real-time data infra powering Lyft's dynamic pricing, ETA, and safety features. He has worked on many big data batch and streaming systems in his decade-plus career across domains including ad-tech, fraud protection, and the on-demand economy.