The past years have seen a dramatic amount of work in the space of scalable computation frameworks, but how much progress have we actually made? We start with several well-known systems for graph processing and compare them against simple single-threaded implementations to find out just how much faster they can go. The answer, at least when we took the measurements, was that they don't go faster. That is, each was slower than a 10-15 line implementation on a laptop.
This problem recurs in several areas of big data systems and research where weak baselines make new results seem like progress when they are just recovering ground lost in the initial excitement over Hadoop and Spark. We will trek through several such evaluations including the most recent systems coming out of the databases research community and provide a bit of advice and structure for the performance-minded.
Frank McSherry is an independent researcher working on problems related to scalable distributed computation. He was part of Microsoft Research's Silicon Valley group where he jointly invented differential privacy, basked in the light of DryadLINQ, and then lead the Naiad project. Most recently he has been working on Rust implementations of Naiad's timely and differential dataflows in collaboration with ETH Zürich's Systems Group.