14.09.2011

C. Boden

## Analyzing large social graphs in the ROBUST Project

As it is ROBUST's goal to analyze online communities with millions of users, it becomes a challenging task to scale out the algorithms and infrastructure that is necessary to accomplish that. Running such analytical computations on a single computer would take days or weeks and quickly becomes an infeasible approach.

Therefore the Database Systems and Information Management Group (DIMA) from TU Berlin (TUB) is participating in ROBUST, having the task to provide scalable community analytics on parallel processing platforms such as Hadoop and Stratosphere. These systems run clusters of several machines and provide programming models (called MapReduce and PACT) to execute computations in parallel.

DIMA has strong ties to the "Mahout Machine Learning"-Project of the Apache Software Foundation (a non-profit organisation where hundreds of volunteers develop software under the open-source business-friendly Apache License). This project aims to develop scalable data mining algorithms and is a perfect fit for our work in ROBUST.

In order to drive forward standardization and make research results instantly usable for companies, a majority of the algorithms developed and implemented for ROBUST are committed into the Mahout project. The latest contributions consist of two algorithms that measure the proximity and similarity of nodes in a social graph. Social graphs usually consist of nodes that represent users which are connected by edges that represent relationships like friendship or replies in a communication process. In ROBUST scenarios we look at social graphs with millions of nodes and hundreds of millions of edges.

The first algorithm TUB implemented is called Random Walk With Restarts. It mimics a random walk from a start node through the social graph and calculates the probability that we arrive at a particular different node. This algorithm provides a measure of the proximity of nodes. If the probability to reach a particular node from our start node is exceptionally high and there is no edge linking those directly, it might be desirable to form a new connection between them. If we apply the algorithm to friendship graphs we have a way to predict "missing" friendships, similar to "people you might know" recommendations on platforms such as LinkedIn or Facebook.

The second algorithm comes from Recommendation Mining originally and interprets a link in the social graph as a recommendation from one user to another. It pairwisely compares all nodes to find the most similar ones. This is actually the same as the approach behind Amazon's well known "people who like this book also like those books" recommendations. We mine relations that could be described by "people who are friends with this user are also friends with those users". The results can again be used to predict missing friendships and might also be an indicator for like minded groups of people.