The past years have seen a dramatic amount of work in the space of scalable computation frameworks, but how much progress have we actually made? We start with several well-known systems for graph processing and compare them against simple single-threaded implementations to find out just how much faster they can go. The answer, at least when we took the measurements, was that they don't go faster. That is, each was slower than a 10-15 line implementation on a laptop.
This problem recurs in several areas of big data systems and research where weak baselines make new results seem like progress when they are just recovering ground lost in the initial excitement over Hadoop and Spark. We will trek through several such evaluations including the most recent systems coming out of the databases research community and provide a bit of advice and structure for the performance-minded.
Frank McSherry is Chief Scientist at Materialize, where he (and others) convert SQL into scale-out, streaming, and interactive dataflows. Before this, he developed the timely and differential dataflow Rust libraries (with colleagues at ETHZ), and led the Naiad research project and co-invented differential privacy while at MSR Silicon Valley. He has a PhD in computer science from the University of Washington.