The entry barrier into the world of streaming analytics often proves daunting, particularly for data scientists and machine learning engineers. The necessity of orchestrating multiple distributed systems can be a significant challenge. This obstacle is further compounded by the need to configure non-domain tools on local machines or Kubernetes clusters, prior to even initiating the development of streaming pipelines.
The novel solution leveraging JupyterHub as a gateway to a pre-configured environment, equipped with all necessary integrations substantially reduces the entry barrier for interactive streaming analytics. We will present how data scientists and machine learning engineers can leverage the capabilities of JupyterHub, PyFlink, and FlinkSQL. This combination provides them with intuitive abstractions, enabling the construction and deployment of pipelines directly within JupyterHub. These pipelines can then be seamlessly executed in a remote Kubernetes cluster environment.
The results have been compelling, with users reporting marked improvements in productivity, enabling the creation of streaming pipelines in significantly shorter timeframes and at larger scales.
Elkhan Dadashov is a software engineer at Apple. He is working at the AIML data platform team focusing on building interactive real-time data infrastructure and scaling a massive data ingestion pipelines. Elkhan previously worked on world-scale microservices, machine learning model serving, streaming/big data processing tools and pipelines at Uber AI.