This talk is about a service that replaced our previous real-time service that produces all historical metrics data which powers a lot of things at Datadog. This new service is based on Kafka-Connect and Spark (vs. Go previously) and handles 100s TBs/10s T records a day while running on spot instances in a pretty violent environment. We're going to talk about decisions we made, architecture, engineering considerations, various optimizations we had to do, we'll show that it's possible to process trillions of records using Spark, we'll talk about certain savings we got in the result, operational challenges and benefits, and various migrations we had to implement as part of rolling out a new service.
Albeit, not everything can be directly used by others but we believe that the general approaches and considerations for building and migrating a system of such scale can be used as a general guidance for building systems of different levels.
Vadim Semenov has been working at Datadog for almost 4 years as a software engineer. He helped bring Spark to Datadog and helped build major pipelines that process all metrics data on top of Spark. Previously he worked at Exponential Interactive on scaling main systems, operating Hadoop clusters, working closely with HBase, and building pipelines in MapReduce and Spark. In his free time he teaches high school students programming at CodeNation, and so far he's been to two Taylor Swift concerts.