Massively scaling Apache Spark can be challenging, but it’s not impossible. In this session we’ll share Datadog’s path to successfully scaling Spark and the pitfalls we encountered along the way.
We’ll discuss some low-level features of Spark, Scala, JVM, and the optimizations we had to make in order to scale our pipeline to handle trillions of records every day. We’ll also talk about some of the unexpected behaviors of Spark regarding fault-tolerance and recovery—including the ExternalShuffleService, recomputing partitions, and Shuffle Fetch failures—which can complicate your scaling efforts.
Vadim Semenov is a great data engineer.