Using Apache Spark for processing trillions of records each day at Datadog

Vadim Semenov | Datadog


Massively scaling Apache Spark can be challenging, but it’s not impossible. In this session we’ll share Datadog’s path to successfully scaling Spark and the pitfalls we encountered along the way.

We’ll discuss some low-level features of Spark, Scala, JVM, and the optimizations we had to make in order to scale our pipeline to handle trillions of records every day. We’ll also talk about some of the unexpected behaviors of Spark regarding fault-tolerance and recovery—including the ExternalShuffleService, recomputing partitions, and Shuffle Fetch failures—which can complicate your scaling efforts.


Download Slides

Vadim semenov

Data Engineer | Datadog

Vadim Semenov is a data engineer at Datadog where he brought Spark from the ground up and keeps working on scaling Spark-based pipelines and keeping them reliable. Previously, he was a data engineer at Exponential Interactive where he worked on building an OLAP solution using Spark/HBase and managed Hadoop infrastructure, and prior to that, he was building distributed systems at AdoTube.