Michael Isard, Derek Murray, and I recently sent in a HotOS submission (it’s not blind, so no harm talking about it, we think). The subject is hinted at from the post title (stolen from the paper title):
Big data systems may scale well, but this can often be just because they introduce a lot of overhead.
Rather than making your computation go faster, the systems introduce substantial overheads which can require large compute clusters just to bring under control.
In many cases, you’d be better off running the same computation on your laptop.
Methodology
Here is the set-up: we took several recent graph-processing publications from the systems community, and compared the measurements they report to simple single-threaded implementations running on my work laptop (RIP). We wrote competent implementations, but we didn’t obsess deeply over fancy algorithms, cunning data-dependent tweaks, or what have you. I’ll show you the code, and you decide.
We evaluated PageRank (20 iterations) and graph connectivity on two graphs, twitter_rv and uk_2007_05. We chose these algorithms and datasets mostly because a recent OSDI 2014 paper GraphX used exactly these to evaluate several of the top systems, and we wanted to borrow their numbers rather than try to reproduce them on systems we aren’t expert with.
As a caveat, these algorithms are quite specific to graph processing, and the data sets are not large (billions of edges, but still just a few gigabytes). Our conclusion is not that the systems we are going to look at are obviously bad, but rather that there is not yet much evidence that they are especially good.
The point we try to make in the HotOS submission is that evaluation of these systems, especially in the academic context, is lacking. Folks have gotten all wound-up about scalability, despite the fact that scalability is just a means to an end (performance, capacity). When we actually look at performance, the benefits the scalable systems bring start to look much more sketchy. We’d like that to change.
Measurements
All measurements other than the “single thread” lines are from the GraphX paper linked above. I’ll talk about the single-threaded implementations soon, but the results for the most direct implementations of PageRank and label propagation (the algorithm the systems use for graph connectivity) look like this: