### Inhalt des Dokuments

# Scaling up Computations on Highly Skewed Graphs

**Information**

Advisors: Dr. Asterios Katsifodimos (TU Berlin), Vasia Kalavri (KTH), Prof. Dr. Volker Markl

Thesis assignee: Andra Lungu

**Motivation**

The world wide web, the citation network or general purpose social networks follow a highly skewed power-law degree distribution [3][4]. For instance, in Twitter, according to twittercounter.com, a regular user does not have more than several hundreds of followers, whereas celebrities such as Kim Kardashian or Katy Perry have several millions of followers. Similarly, in Flickr[5], users have a tendency towards tagging thousands of pictures within their area of interest, while other images receive less attention. When suggesting groups, a standard approach is to analyse the common neighborhoods in the tag graph.

Additionally, various machine learning and data mining algorithms(e.g. belief propagation, Gibbs sampling) are modeled using graphs[2]. Knowledge representations are generally used to predict facts about the world. The bigger the graph, the higher the accuracy of the predictions. A particular set of nodes can learn from billions of sources, while others gather knowledge from hundreds or thousands of other nodes.

Scaling graph analysis computations to such types of skewed graphs can be an utterly challenging effort.

**Problem Definition**

In distributed graph analysis, a common approach is to partition the graph and spread it across multiple cluster nodes. There are many solutions to the partitioning problem ranging from the very simple idea of assigning random vertices to a machine(based on a hash function), followed by their corresponding edges, to stream algorithms that partition the graph while it is loaded into the cluster.

Partitioning a graph into components of similar size while minimizing the number of edges per component is known to be NP-hard[7]. In order to achieve load balancing among cluster nodes, greedy algorithms have been proposed in previous works. For instance, multilevel methods[6] approximate the graph by first constructing subgraphs(coarsening), then by partitioning the smallest subgraph(coarse partitioning) and finally(refining) by projecting the previous result to larger subgraphs, spectral solutions[8] leverage the connection between partitioning and eigenvalues, eigenvectors of the graph’s Laplacian matrix in order to achieve significant results. Finally, Stanton et al.[9] propose using the program that loads a graph from an external storage into the cluster in order to find a close to optimal balanced partitioning. As vertices arrive in a stream, the partitioner decides on which machine to place it. The aforementioned approaches work well in some cases: for symmetric data[6], only when the graph information is stored on a single machine[8] and for serial input as well as a single graph loader[9].

However, if the degree of a vertex is too high, these solutions perform poorly as one processing unit may have to handle a significantly larger workload in comparison to the others.

Therefore, computation for these skewed vertices also needs to be distributed across multiple nodes. In Giraph[12], each vertex executes a compute() function where messages are received and processing is done in order to proceed to the next superstep. For algorithms like triangle detection, high degree nodes will receive numerous messages and Giraph will probably exhaust its memory resources. To solve this issue, the concept of megasteps[13] was introduced: execution is grouped in into identical supersteps. During each megastep, only some vertices will process data while others remain idle. The vertex ids are manually assigned to the supersteps. This solution amortizes the memory and message requirements within one superstep, but fails to entirely solve the skew problem.

PowerGraph[1] ’s approach is to spread computation over the edges. Even though this method successfully overcomes skew in general, for a high degree vertex, the computation can either use too much memory or can take too much time to terminate.

This thesis addresses the problems associated with analysing such highly skewed graphs.

**Proposed Solution**

Aggregation trees have been widely used in applications such as parallel prefix sum computation, performance and monitoring tools, information management systems and stream processing.

We move the concept of aggregation trees to node partitioning in the sense that a high degree node will be split into subnodes which will perform a computation in parallel. The split is done recursively, based on the root’s key: if, for example, a highly skewed node is divided into four nodes and those four nodes are, in turn, highly skewed, they are again split. The splitting method as well as the depth of the tree will be decided before the beginning of the computation. We assume that the operation used for computing is associative and commutative. As soon as one computation is finished, the result is sent back to an entity which combines the partial values. By recursively aggregating the values in a bottom-up manner, the root(initial node) will ultimately contain an efficiently calculated value since the only information it needs to collect comes from its direct child nodes.

This approach requires communication and coordination among nodes. In order to properly balance the workload, it is desirable that the divided nodes have a similar branching factor. This factor will be computed based on input such as number of available nodes, their load, their CPU or memory capacity. The height of the aggregation tree is also important as it dictates the number of nodes a message should pass before reaching the root. Latency is mostly proportional to height.

## Research goal

The general objective of the thesis is to implement and analyze the behavior of various graph algorithms and skewed datasets in evaluate the proposed solution given above. In order to achieve this objective, the following specific objectives must be reached:

- A runtime analysis of computationally expensive algorithms, such as triangle enumeration, neighbor similarity, and clustering coefficient, using the vertex-centric and the GAS(gather-apply-scatter) programming models. We intend to show how high-degree nodes act as stragglers and slow down the entire computation.
- Examination of the solution of Giraph megasteps. The use of megasteps spreads the computation of the high-degree nodes across several supersteps. Show how this technique successfully amortizes the memory and message requirements within one superstep, but fails to entirely solve the problem.
- Implementation of the subnode technique in Flink [11] and Giraph [12]: break the high-degree nodes into subnodes, which can perform parts of the computation in parallel and then combine the results. This technique is very similar to aggregation trees. The implementation includes several challenges, like (a) how to choose which node to divide into subnodes? (b) how is application correctness affected? (c) into how many sub-nodes shall a high-degree node be split?
- Evaluation of the subnode technique and comparison with vertex-centric and GAS models, using the applications from step (1).

## Prerequisites

Basic knowledge in query processing techniques and Java programming as well as graph algorithms and dataflow programming.

## References

[1] Powergraph: http://graphlab.org/files/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf

[2] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. “GraphLab: A new framework for parallel machine learning”. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2010.

[3] Michalis Faloutsos, Petros Faloutsos, Christos Faloutsos, “On power-law relationships of the Internet topology”. In Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, Pages 251-262, SIGCOMM, 1999.

[4] Dangzhi Zhao, Andreas Strotmann, “Analysis and Visualization of Citation Networks”, Morgan and Claypool Publishers, 2015

[5] Andrea Capocci, Andrea Baldassarri, Vito D.P. Servedio, Vittorio Loreto, “Friendship, collaboration and semantics in Flickr: from social interaction to semantic similarity”. In Proceedings of the International Workshop on Modeling Social Media, Toronto, Canada June 13-16 2010 Paper n. 8, ACM, New York, NY, USA, 2010

[6] Devine, K.D., Boman, E.G. , Heaphy, R.T., Bisseling, R.H., Catalyurek, U.V. “Parallel hypergraph partitioning for scientific computing“, (IPDPS 2014).

[7] Konstantin Andreev, Harald Räcke, “Balanced graph partitioning”. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, 2004.

[8] S. Arora, S. Rao, and U. Vazirani. “Expander flows, geometric embeddings and graph partitioning”. J.ACM, 2009

[9] Isabelle Stanton, Gabriel Kliot, “Streaming Graph Partitioning for Large Distributed Graphs”. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 2012.

[10] Other related literature: http://www.citeulike.org/user/vasiakalavri/article/10800832

[11] Apache Flink: https://flink.incubator.apache.org

[12] Apache Giraph: https://giraph.apache.org/

[13] Giraph Megasteps: http://event.cwi.nl/grades2014/00-ching-slides.pdf