Data Council Blog

Data Council Blog

Open Source Highlight: Klio

Klio is a framework for easy large-scale processing and ML research on binary files, such as audio files -- its original use case. As a matter of fact, it was developed for audio intelligence at Spotify, which open-sourced it earlier this year at the 2020 International Society for Music Information Retrieval Conference.
Klio
With a fast-growing catalog of more than 60 million songs, Spotify already had a tool to build data pipelines at scale: Scio (also open-source). However, its teams felt the need to process the audio itself, and not just the data attached to it; and to do so in a way that would let engineers and researchers share tooling & infrastructure and speak the same language, without a translation layer.
 
As a result, Klio is based on Python and Apache Beam -- because Python is arguably the lingua franca of ML, while Beam supports both batch and streaming data processing jobs that run on any execution engine. This comes with a caveat, as Python Beam and Klio aren’t fully compatible with all engines yet, but this could change in the near future (also thanks to contributors). In the meantime, Klio is already cloud-agnostic in its philosophy, and it is really its integration with the cloud that makes it so compelling, as it opens the prospect of production-ized audio processing to anyone.
 
While it is early to predict which use cases will emerge, it will be interesting to see it applied to different ML scenarios: “We think Klio’s ease of use — and its ability to let anyone leverage modern cloud infrastructure and tooling — has the potential to unlock new possibilities in media and ML research everywhere, from big tech companies to universities and libraries,” its (well-worth reading) announcement blog post stated.
 
To learn more, check out the repository, documentation, and FAQs. If you’d like to go one step further, you can also join the dedicated #klio channel in the Spotify FOSS Slack, or watch the talk that Lynn Root gave about Klio during Apache Beam Summit last August.

big data, Data Pipelines, Machine Learning, Open Source, Audio Research