Community Detection Algorithms Explained

Sabrina Liu
8 min readOct 14, 2021

A Survey of Fundamental Clustering Methods in Graph Theory

This article is intended for readers who are already familiar with the more basic concepts in Graph Theory, such as node statistics or node centrality algorithms. The focus here is to facilitate in-depth understanding of the various community detection algorithms often available in programming libraries in terms of their mathematical explanations.

Below community detection algorithms will be discussed:

  • Weakly Connected Components
  • Strongly Connected Components
  • Triangle Count
  • Clustering Coefficient
  • Local Clustering Coefficient
  • Modularity Maximization

Due to time limitation, several more advanced algorithms are not covered yet. The relevant contents will be updated later.

Weakly Connected Components

The term component is the most basic concept for understanding the structure of a network. Generally speaking, two nodes are considered in the same component of a graph if there exists a path that connects them. A connected graph has only one component and a disconnected graph has at least two components. An isolated node that doesn’t have any connections itself makes one single component.

For a non-directional graph, the adjacency matrix can be expressed as the direct sum of the several adjacency matrices respectively representing each of the components:

Direct Sum of Adjacency Matrices

The reason why we differentiate between weakly connected components and strongly connected components has its significance in directional graphs. Two nodes are in the same weakly connected component if they are connected through at least one path, regardless of the directions on any edge along the path. This is the most intuitive type of community topology that we can understand and it applies to many real-world examples, especially social networks.

The adjacency matrix representation of a directional graph is asymmetric, compared to that of a non-directional graph, which is perfectly symmetric. Other than that, we can still apply the above matrix partition method and quickly derive the component that each node belongs to from rearranging the matrix with computers. This makes the weakly connected components a very efficient and easy-to-use algorithm.

Strongly Connected Components

The definition for strongly connected components is a little more strict: we say that two nodes are strongly connected if and only if there exists a path connecting them either way, hence requiring the edges connecting all the pairwise nodes along the pathway go in both directions. Therefore, a strongly connected component must be a cyclic graph, and any of its two nodes also form a cycle, like Bilbo Baggins would say, there and back again.

To derive strongly connected components from the adjacency matrix, we need to keep only the values that have a symmetric value across the diagonal and truncate those whose counterparts are zeros. That is

Truncation of Adjacency Matrix

before rearranging the matrix as direct sum.

Strongly connected components widely exist in real-world technology or infrastructure networks, for example, Internet Service Providers (ISP), transportation networks, power grid, and postal services.

Triangle Count

The above concepts help us understand the natural clusters in a network from the perspective of connectivity. Now let’s look at some ways to measure the more intriguing natures of networks in terms of transitivity, which quantifies the degree of freedom in the information exchange process.

First, we need to introduce a term called clique, referring to a set of nodes among which each two of them are connected. In this scenario, the communication in between this part of the graph is barrier-less, therefore considered perfect transitivity. A densely connected graph in which any two nodes are connected to each other is a one-piece giant clique, but it rarely happens in graph modeling practices and the case is trivial.

More often we are interested in different levels of local transitivity for clustering analysis. A triangle, or a 3-clique, that connects three nodes to each other is a locally densely connected subgraph that has locally perfect transitivity. The edges on all three sides are considered bi-directional, or reciprocal, therefore has the same math with non-directional edges. The triangle count within a graph can be calculated from the adjacency matrix:

Count Triangles from Adjacency Matrix

assuming the graph has no multi-edges and the entries are binary.

Triangle count itself as a quantitative fact does not tell the whole story. It is more useful as an intermediate metric to calculate the clustering coefficient which speaks in relative terms as explained below.

Clustering Coefficient

The clustering coefficient of an entire graph is rigorously defined as “the fraction of paths of length two in the network that are closed” (Mark Newman, p.183). In a triangle, all the paths are considered closed since there is always a way to traverse back to the start point within the current neighborhood. On the other hand, all the closed paths of length two must belong to some triangle.

Given some thinking, it’s not hard to find out that each triangle envelopes six distinct paths of length two (A-B-C and A-C-B are two different paths). Therefore, the clustering coefficient can be calculated using the triangle count as below:

Clustering Coefficient

and its value is between 0 (no 3-cliques) and 1 (densely connected). For tree-graphs that has no triangles at all, the clustering coefficient is always 0.

The clustering coefficient can be seen as a measure to indicate to what portion of a social network displays “a friend of a friend is also a friend of mine.” The actual clustering coefficient in many real-world applications are empirically low values, typically 0.1~0.2 for social networks and much lower for technological and biological networks (Mark Newman, p.185).

Local Clustering Coefficient

The local clustering coefficient examines the level of transitivity around each node, instead of the entire network. It’s considered a community detection algorithm for it aims to study the structure of the neighborhood surrounding the measured node, not the importance of the node itself compared to all other nodes.

The idea is to find out how often the neighbors of a certain node are also connected to each other. To compute the local clustering coefficient for node i, we need to divide the number of connected neighbor pairs by the number of all possible neighbor pairs surrounding node i. Note that the numerator is equivalent to the number of triangles with one vertex on node i, and the denominator is a permutation question for a would-be densely connected scenario (neighbors only) that is tied to the degree of node i, denoted by kᵢ. Therefore, the local clustering coefficient is calculated as:

Local Clustering Coefficient

Local clustering coefficient is a very interesting measure. Empirically, nodes with higher degrees usually end up with lower local clustering coefficients. It’s not difficult to see why from the above formula. Additionally, this metric locally represent the betweenness centrality of a node in a reversed sense, which measures the control that a node has over the flow of information. It is also used to study further concepts such as structural holes and redundancy in social networks and communications.

Modularity Maximization

So far all the above measures can be directly calculated based on the graph metadata — the adjacency matrix. To gain in-depth insights about a graph, we often need to purposely divide a connected graph into several groups. One good example is that the majority of Facebook users could form one giant weakly connected component from which we can learn very little. Both policy researchers and commercial advertisers would want to know where all the special interest groups are. One way to achieve this is through modularity maximization, which is an optimization process.

The term modularity is basically a metric to determine how independently a part functions in complex systems. When applied to graph partitioning, it measures the density performance over competing partition plans and hence is used as the objective function for optimization. The density of a graph is defined as the number of total existing edges divided by the number of total possible edges (or node pairs). The key takeaway is that a partition plan with a high modularity score should preserve high densities within each component.

To explain the mathematical expression of the modularity score, we need to first introduce the Kronecker delta function, which is a discrete version of the Dirac delta function. It returns a Boolean value indicating whether two non-negative integers (usually indices) are equal:

Kronecker Delta Function

To compute the modularity score, we need to mark whether two nodes i and j belong to the same group. Using function g to return the group membership each node belongs to we get

Node Assignment Delta

where i and j are node indices.

Since the density as a metric is normalized by a constant scalar, we can just directly use the number of edges placed in subgraphs instead. The total number of edges located inside the partitioned subgraphs can be calculated using the original adjacency matrix entries and the node assignment delta:

Empirical Edge Count

which is divided by 2 since each node pair i and j is counted twice. And the goal is to calculate the difference between the above empirical edge count and the expected edge count derived from the degree distribution:

Expected Edge Count

where kᵢ and kⱼ are respectively the degrees of node i and node j. Consider a node i with kᵢ connections, then the probability that the other end is a particular node j is given by the connections of node j out of all the connections kⱼ/2m, divided by 2 since the i-j connection is counted twice, thus the expected value of the number of edges in subgraphs is above.

Finally, combining terms and normalizing by m, the resulting modularity score is below:

Modularity Score

This optimization problem turns out to be non-trivial due to exponentially many possibilities available as the size of a graph grows, as the options to divide nodes into two groups is already equivalent to enumerating the power set of the set of nodes. While an exhaustive brute-force search would generate the true maximum, it’s neither practical nor necessary. Fortunately, several useful heuristic algorithms have been developed to shorten the process, such as using eigenvalues of the modularity matrix or using the betweenness centrality to guide the optimization process.

References:

  • Mark Newman, Networks (2nd Edition), Oxford
  • Neo4j Graph Data Science Library Documentation

Disclaimer: Many of the mathematical explanations used in this article are based on Prof. Newman’s work. A ton of valuable information are left out from the original book since this article serves a specific audience. If one is truly interested in learning Graph Theory in depth, I strongly recommend buying his book.

--

--