Over the past decade Big Data systems have transitioned from bespoke query languages and code-based pipelines, to supporting SQL as their primary language. Stream processing engines have followed this trend. Most new systems primarily or exclusively use SQL.
But traditional SQL semantics require us to wait for all data to be available to compute queries like joins or aggregates, and users do not want to wait forever for their queries to return. What does it mean to apply SQL to streams of data, which may never end?
This talk will cover the two most common answers to that question and introduce the underlying elements of time-oriented SQL, including event-time and watermarks, and how these features are implemented in modern streaming engines.
We will also cover the challenges that SQL introduces for continuously running pipelines, including the difficulty of upgrading program logic and handling application upgrades that may cause pipelines to re-plan, as well as the optimization benefits of using a declarative language like SQL.