Massively scaling Apache Spark can be challenging, but it’s not impossible. In this session we’ll share Datadog’s path to successfully scaling Spark and the pitfalls we encountered along the way.
We’ll discuss some low-level features of Spark, Scala, JVM, and the optimizations we had to make in order to scale our pipeline to handle trillions of records every day. We’ll also talk about some of the unexpected behaviors of Spark regarding fault-tolerance and recovery—including the ExternalShuffleService, recomputing partitions, and Shuffle Fetch failures—which can complicate your scaling efforts.
Vadim Semenov has been working at Datadog for almost 4 years as a software engineer. He helped bring Spark to Datadog and helped build major pipelines that process all metrics data on top of Spark. Previously he worked at Exponential Interactive on scaling main systems, operating Hadoop clusters, working closely with HBase, and building pipelines in MapReduce and Spark. In his free time he teaches high school students programming at CodeNation, and so far he's been to two Taylor Swift concerts.