Community Detection: Getting Started within Graphs and Networks

Prateek Joshi 23 Feb, 2024

11 min read

Introduction

Networks surround us, perhaps in the form of your favorite social media hubs: Facebook, Instagram, Twitter, or the intricate web of stock exchanges where buying and selling intertwine. But networks don’t just thrive in technology; they’re woven into our daily lives. Communities are the heartbeat of these networks, where clusters of closely connected nodes form distinct groups. Picture your social sphere on platforms like Facebook or Instagram – interactions with friends, colleagues, and family – a tight-knit community within the digital world. Join us on an expedition into the realm of community detection algorithms, where hidden patterns shape our interconnected world.

Getting Started within Graphs and Networks

What is a Community?

A community, with respect to graphs, can be defined as a subset of nodes that are densely connected to each other and loosely connected to the nodes in the other communities in the same graph.

Think about social media platforms such as Facebook, Instagram, or Twitter, where we try to connect with other people and different numbers of communities. Eventually, after a while, we end up being connected with people belonging to different social circles. These social circles can be a group of relatives, schoolmates, colleagues, etc.

These social circles are nothing but communities!

Also Read: Think in Graphs: Introduction to Graph Theory and its Applications using Python

What is Community Detection?

Community detection methods locate communities based on network structure, such as strongly connected groupings of nodes; however, they often ignore node properties. Nodes of similar types form a community in a network. Intra-community edges are the edges that connect the nodes in a community.

Detecting communities in a network is one of the most important tasks in network analysis. In a large scale network, such as an online social network, we could have millions of nodes and edges. Detecting communities in such networks becomes a herculean task.

Therefore, we need community detection algorithms that can partition the network into multiple communities.

Types of Community Detection

There are primarily two types of methods for detecting communities in graphs:

Agglomerative Methods
Divisive Methods

Agglomerative Methods

In agglomerative methods, we start with an empty graph that consists of nodes of the original graph but no edges. Next, the edges are added one-by-one to the graph, starting from “stronger” to “weaker” edges. This strength of the edge, or the weight of the edge, can be calculated in different ways.

Divisive Methods

In divisive methods, we go the other way round. We start with the complete graph and take off the edges iteratively. The edge with the highest weight is removed first. At every step, the edge-weight calculation is repeated, since the weight of the remaining edges changes after an edge is removed. After a certain number of steps, we get clusters of densely connected nodes.

In this article, we will cover the Girvan-Newman algorithm – an example of the divisive method.

Girvan-Newman Algorithm for Community Detection

Under the Girvan-Newman algorithm, the communities in a graph are discovered by iteratively removing the edges of the graph, based on the edge betweenness centrality value.

The edge with the highest edge betweenness is removed first. We will cover this algorithm later in the article, but first, let’s understand the concept of “edge betweenness centrality”.

Edge Betweenness Centrality (EBC)

The edge betweenness centrality (EBC) can be defined as the number of shortest paths that pass through an edge in a network. Each and every edge is given an EBC score based on the shortest paths among all the nodes in the graph.

With respect to graphs and networks, the shortest path means the path between any two nodes covering the least amount of distance. Let’s take an example to find how EBC scores are calculated. Consider this graph below:

EBC Scores

It has 6 nodes and 7 edges, right? Now, we will try to find the EBC scores for all the edges in this graph. Note that it is an iterative process and I’ve given an outline of it here:

We will take one node at a time and plot the shortest paths to the other nodes from the selected node
Based on the shortest paths, we will compute the EBC scores for all the edges
We need to repeat this process for every node in the graph. As you can see, we have 6 nodes in the graph above. Therefore, there will be 6 iterations of this process
This means every edge will get 6 scores. These scores will be added edge-wise
Finally, the total score of each edge will be divided by 2 to get the EBC score

Now, let’s start with node A. The directly connected nodes to node A are nodes B and D. So, the shortest paths to B and D from A are AB and AD respectively:

It turns out that the shortest paths to nodes C and E from A go through B and D:

The shortest paths to the last node F from node A, pass through nodes B, D, C, and E:

The graph above depicts only the shortest paths from node A to all the other nodes. Now we will see how edges are scored.

Before giving scores to the edges, we will assign a score to the nodes in the shortest-path-graph. To assign these scores, we will have to traverse the graph from the root node, i.e., node A to the last node (node F).

Assigning Scores to Nodes

As you can see in the graph above, nodes B and D have been given a score of 1 each. This is because the shortest path to either node from node A is only one. For the very same reason, node C has been given a score of 1 as there is only one shortest path from node A to node C.

Moving on to node E. It is connected to node A through two shortest paths, ABE and ADE. Hence, it gets a score of 2.

The last node F is connected to A through three shortest paths — ABCF, ABEF, and ADEF. So, it gets a score of 3.

Computing Scores for Edges

Next, we will proceed with computing scores for the edges. Here we will move in the backward direction, from node F to node A:

Step 1

We first compute the score for the edges FC and FE. The edge score for edge FC is the ratio of the node scores of C and F, i.e. 1/3 or 0.33. Similarly, for FE the edge score is 2/3.

Step 2

Now we have to calculate the edge score for the edges CB, EB, and ED. According to the Girvan-Newman algorithm, from this level onwards, every node will have a default value of 1 and the edge scores computed in the previous step will be added to this value.

So, the edge score of CB is (1 + 0.33)/1. Similarly, edge score EB or ED is (1 + 0.667)/2.

Step 3

Then we move to the next level to calculate the edge scores for BA and DA.

So far, we have computed the edge scores of the shortest paths with respect to node A. We will have to repeat the same steps again from the other remaining five nodes. In the end, we will get a set of six scores for all the edges in the network. We will add these scores and assign them to the original graph as shown below:

Step 4

Since it is an undirected graph, we will divide these scores by two and finally, we will get the EBC scores:

According to the Girvan-Newman algorithm, after computing the EBC scores, the edges with the highest scores will be taken off till the point the graph splits into two. So, in the graph above, we can see that the edges AB, BC, DE, and EF have the highest score, i.e., 4. We will strike off these edges and it gives us 3 subgraphs that we can call communities:

If you prefer learning all of this in video form, this will help you out:

Implementing the Girvan-Newman Algorithm for Community Detection in Python

I will use Zachary’s karate club Graph to demonstrate how you can perform community detection using the Girvan-Newman Algorithm.

Zachary’s karate club is a social network of a university karate club. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002.

We will start off with loading the required libraries: networkx and matplotlib:

import networkx as nx

import matplotlib.pyplot as plt

%matplotlib inline

The Karate Club graph comes pre-installed with the networkx library. So, we can easily import it:

# load the graph

G = nx.karate_club_graph()

# visualize the graph

nx.draw(G, with_labels = True)

len(G.nodes), len(G.edges)

Output: (34, 78)

It turns out that the graph has 34 nodes and 78 edges. This is a pretty small graph. Usually, real-world graphs could easily be a thousand times bigger than this graph. But we’ll go ahead with this graph for this tutorial.

Now, we know that in the Girvan-Newman method, the edges of the graph are eliminated one by one based on the EBC score. So, the first task is to find the EBC values for all the edges and then take off the edge with the largest value. The below function performs the exact task:

def edge_to_remove(graph):

G_dict = nx.edge_betweenness_centrality(graph)

edge = ()

# extract the edge with highest edge betweenness centrality score

for key, value in sorted(G_dict.items(), key=lambda item: item[1], reverse = True):

edge = key

break

return edge

Now the function below will use the above function to partition the graph into multiple communities:

def girvan_newman(graph):

# find number of connected components

sg = nx.connected_components(graph)

sg_count = nx.number_connected_components(graph)

while(sg_count == 1):

graph.remove_edge(edge_to_remove(graph)[0], edge_to_remove(graph)[1])

sg = nx.connected_components(graph)

sg_count = nx.number_connected_components(graph)

return sg

Next, let’s pass the Karate Club graph to the above function:

# find communities in the graph

c = girvan_newman(G.copy())

# find the nodes forming the communities

node_groups = []

for i in c:

node_groups.append(list(i))

node_groups

Output: [[0, 1, 3, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17, 19, 21],
[32, 33, 2, 8, 9, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]]

These are the node sets belonging to two communities in the Karate Club graph according to the Girvan-Newman algorithm. Let’s use these node sets to visualize the communities discovered in the graph:

# plot the communities

color_map = []

for node in G:

if node in node_groups[0]:

color_map.append('blue')

else:

color_map.append('green')

nx.draw(G, node_color=color_map, with_labels=True)

plt.show()

Community Detection i: Methods, Techniques, and Applications

Optimization Techniques for Community Detection

Community detection algorithms aim to optimize specific metrics, such as modularity, to identify the optimal partitioning of networks into communities. Modularity quantifies the quality of the division of a network into communities by comparing the number of edges within communities to the expected number in a random network. Optimization techniques iteratively adjust community assignments to maximize modularity, resulting in more accurate community detection. Examples of optimization techniques include spectral clustering, the Louvain method, and graph partitioning algorithms. These techniques can be enhanced by considering topological features such as the Lambiotte coefficient, which measures the importance of nodes within modules, and Clauset’s parameter, which characterizes the community structure.

Machine Learning Approaches to Community Detection

Machine learning techniques play a significant role in community detection, offering powerful tools for identifying communities in complex networks. Algorithms like label propagation and generative models leverage machine learning principles to partition networks into distinct communities. Label propagation algorithms assign labels to nodes based on their connectivity patterns, while generative models simulate the formation of communities based on network structure and properties. These machine-learning approaches are widely used in various domains, including biological networks and social network analysis.

Clustering Algorithms and Community Detection

Clustering algorithms form the backbone of many community detection methods, providing efficient mechanisms for grouping nodes into communities. Spectral clustering algorithms utilize eigenvectors of the graph Laplacian matrix to partition nodes into clusters based on their connectivity patterns. The Louvain method, a popular heuristic algorithm, optimizes modularity to iteratively merge and split communities until an optimal partition is achieved. These clustering algorithms are instrumental in computer science, data science, and network analysis for uncovering hidden structures in large networks.

Overlapping Community Detection

In many real-world networks, nodes may belong to multiple communities simultaneously, necessitating the detection of overlapping communities. Overlapping community detection methods, such as random walks and embedding approaches, address this challenge by identifying nodes that exhibit multiple network community memberships. Random walk-based algorithms traverse the network to identify nodes with high community memberships while embedding techniques project nodes into a latent space where overlapping communities are represented as clusters. These methods are essential for capturing the intricate relationships between nodes in complex networks.

Benchmark Datasets and Evaluation Metrics

Researchers often rely on benchmark datasets and evaluation metrics to evaluate the performance of community detection algorithms. Benchmark datasets, such as Lancichinetti-Fortunato-Radicchi benchmark graphs, provide standardized datasets for testing the efficacy of algorithms across different scenarios. Evaluation metrics, including modularity and normalized mutual information, quantify the quality of community partitions and assess algorithm performance. These benchmark datasets and evaluation metrics facilitate rigorous evaluation and comparison of community detection algorithms. (DOI: Digital Object Identifier)

Conclusion

In conclusion, community detection in graphs and networks is a dynamic field with diverse methods and applications. Optimization techniques like spectral clustering and the Louvain method, along with machine learning approaches, play crucial roles in identifying communities across various domains. Clustering algorithms provide an efficient grouping of nodes while overlapping community detection methods address the complexity of real-world networks. Benchmark datasets and evaluation metrics ensure rigorous assessment of algorithm performance. While this article provides a comprehensive overview, further exploration of concepts like vertices, different communities, and network dynamics is encouraged, along with insights from publications like “Physical Review” and “arXiv.” Overall, community detection remains a vital tool for uncovering the intricate structures of complex networks.

Frequently Asked Questions

Q1. What is community detection in graphs and networks?

A. Community detection is the process of identifying clusters or communities of closely connected nodes within a network, where nodes within a community exhibit strong connections to each other and weaker connections to nodes outside the community.

Q2. What are some common methods for community detection?

A. Common methods for community detection include agglomerative methods, where communities are formed by iteratively adding edges based on their strength, and divisive methods, where communities are identified by iteratively removing edges based on their weight. The Girvan-Newman algorithm is an example of a divisive method.

3. How does the Girvan-Newman algorithm work?

A. The Girvan-Newman algorithm identifies communities by iteratively removing edges from the graph based on their edge betweenness centrality values. Edge betweenness centrality quantifies the number of shortest paths that pass through an edge in the network.

4. What are some optimization techniques used in community detection?

A. Optimization techniques such as spectral clustering and the Louvain method aim to maximize specific metrics like modularity, which measures the quality of community divisions in a network. These techniques adjust community assignments to achieve optimal partitioning.

5. Are there any benchmark datasets and evaluation metrics for assessing community detection algorithms?

A. Yes, benchmark datasets like the Lancichinetti-Fortunato-Radicchi benchmark graphs provide standardized datasets for testing algorithm efficacy. Evaluation metrics such as modularity and normalized mutual information quantify the quality of community partitions and assess algorithm performance. Further exploration of concepts like vertices, different communities, and network dynamics is encouraged to deepen understanding, along with insights from publications like “IEEE”, “Physical Review” and “arXiv.”

Prateek Joshi 23 Feb, 2024

Data Scientist at Analytics Vidhya with multidisciplinary academic background. Experienced in machine learning, NLP, graphs & networks. Passionate about learning and applying data science to solve real world problems.

Algorithm Graphs & Networks Intermediate Python Unsupervised