Parallel Local Graph Clustering

Julian Shun gave a talk at MIT last week on his paper ‘Parallel Local Graph Clustering, Shun et al. VLDB 2016‘. This paper talks about how to parallelize graph clustering heuristics whose runtime scales linearly with the size of the cluster being formed, as opposed to graph size, what the authors call local running time. The paper doesn’t propose new clustering heuristics but instead focuses on how to parallelize various heuristics from previous work, while maintaining the local running time property.

In particular parallelization techniques are presented for 3 clustering heuristics – Nibble, PageRank-Nibble (PR-Nibble) and Deterministic Heat Kernel PageRank. The basic idea behind each the above heuristics is very similar. First, a random vertex in the graph is chosen and an initial weight is assigned to this vertex, usually 1.0. Then a few iterations are performed to propagate this weight/mass to other vertices in the graph. In each iteration, each active vertex distributes some fraction of its current weight among its neighbors, while assigning the remaining fraction to itself. An active vertex is one whose current weight is greater than a certain threshold.The intuition is that vertices that form a good cluster will have larger weights because they are well-connected among themselves. This process is repeated for a pre-specified number of iterations or till a convergence condition is met, very similar to PageRank algorithm for ranking vertices in a web graph.

The heuristics differ in what fraction of a vertex’s weight is kept to itself, how weights are propagated to neighbors, how active vertices are chosen and the convergence criteria. The output of these heuristic is a real-valued vector of weights for all the vertices in the graph. The Sweep Cut heuristic takes this vector and generate the final cluster. The author also discuss a clever parallelization technique for Sweep Cut.

The authors use the primitives from the Ligra Graph Processing framework, to parallelize these heuristics. In particular, the weight propagation step maps directly to EdgeMap primitive of Ligra. The authors develop a concurrent hash-table to store the weights of active vertices. The main benefit of these primitives is that they only process edges of the active vertices, making the runtime proportional to number of active vertices and sum total of their degrees  (called volume). Thus the parallel versions do not lose the local running time property. The authors prove this by providing theoretical bounds on the work efficiency of their parallel versions.

These parallel versions achieve speedups of 23x-28x on a 40 core machine with hyperthreading, for a total of 80 threads. The code is available in the Ligra Github repository.. The paper shows Network Community Profiles for some of the largest publicly available graphs. These profiles plot the conductance (a metric that indicates the quality of a cluster) of a set of randomly generated local clusters versus the size of these clusters. For most graphs the best clusters have size roughly equal to the square root of number of vertices in the graph.

Some questions:

  • Is memory performance the bottleneck to the scalability of these algorithms?
  • The parallel version uses double-buffering, i.e. they maintain a copy of weights for current and previous iterations. Is it possible to use a single buffer while maintaining correctness and what is effect of this on performance. Credits: Yunming Zhang
  • Since the general consensus seems to be that it is difficult to cluster vertices in power-law graphs as opposed to edges, how do these heuristics compare with edge-centric clustering heuristics?

Transactional Memory and Lock Elision

I started reading a post on recent trends in hardware and their implications on distributed systems design. One of the recent additions that could make parallel programming much easier is the TSX extensions in Intel Haswell processors.

 

It comes in two flavors, a backwards-compatible Hardware Lock Elision (HLE) instruction set, and a more flexible forward-looking Restricted Transactional Memory (RTM) instruction set. Here are a few blog posts and papers that give a good overview of Intel TSX and it’s origins in academia.

 

Brief introduction into Intel TSX and it’s implications for the industry

http://www.anandtech.com/show/6290/making-sense-of-intel-haswell-transactional-synchronization-extensions

Analysis of Haswell’s transactional memory by David Kanter

http://www.realworldtech.com/haswell-tm/1/

Hardware Lock Elision on Haswell

https://brooker.co.za/blog/2013/12/14/intel-hle.html

Lock elision support in gcc and linux

http://halobates.de/adding-lock-elision-to-linux.pdf

The original paper on Speculative Lock Elision: Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

http://pages.cs.wisc.edu/~rajwar/papers/micro01.pdf