School of Computer Science, Beijing University of Posts and Telecommunications, Beijing100876, China
Show Author Information
Hide Author Information
Abstract
Recently, complex networks have attracted considerable research attention. Community detection is an important problem in the field of complex networks and is useful in a variety of applications such as information propagation, link prediction, recommendation, and marketing. In this study, we focus on discovering overlapping community structures by using link partitions. We propose a Latent Dirichlet Allocation (LDA)-Based Link Partition (LBLP) method, which can find communities with an adjustable range of overlapping. This method employs the LDA model to detect link partitions, which can calculate the community belonging factor for each link. On the basis of this factor, link partitions with bridge links can be found efficiently. We validate the effectiveness of the proposed solution by using both real-world and synthesized networks. The experimental results demonstrate that the approach can find a meaningful and relevant link community structure.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
G.Palla, I.Derenyi, I.Farkas, and T.Vicsek, Uncovering the overlapping community structure of complex networks in nature and society, Nature, vol. 435, no. 7043, pp. 814-818, 2005.
J.Xie, S.Kelley, and B. K.Szymanski, Overlapping community detection in networks: The state of the art and comparative study, ACM Computing Surveys, Arxiv preprint arxiv:1110.5013, pp. 2013.
Z.Wu, Y.Lin, H.Wan, and S.Tian, A fast and reasonable method for community detection with adjustable extent of overlapping, inProc. 5th Int. Intelligent Systems and Knowledge Engineering Conf., Hangzhou, China, 2010, pp. 376-379.
[8]
Y.Kim and H.Jeong, Map equation for link community, Computing Research Repository, arxiv preprint arxiv: 1105.0257, 2011.
Q.Ye, B.Wu, Z.Zhao, and B.Wang, Detecting link communities in massive networks, inProc. 3th Int. Advances in Social Networks Analysis and Mining Conf., Taiwan, China, 2011, pp. 71-78.
A.Lancichinetti, S.Fortunato, and J.Kertész, Detecting the overlapping and hierarchical community structure in complex networks, New Journal of Physics, vol. 11, no. 3, p. 033015, 2009.
P.Arnau, P.Guillem, P.Julian, and M. Victor. Overlapping community search for social networks, inProc. 26th Int. Data Engineering Conf., Long Beach, California, USA, 2010, pp. 992-995.
[15]
C.Lee, F.Reid, A.McDaid, and N.Hurley, Detecting highly overlapping community structure by greedy clique expansion, arXiv preprint arXiv:1002.1827, 2010.
[16]
N.Du, B.Wang, and B.Wu, Overlapping community structure detection in networks, inProc. 17th Int. Information and Knowledge Management Conf., Napa Valley, USA, 2008, pp.1371-1372.
H.Zhang, B.Qiu, C. L.Giles, H. C.Foley, and J.Yen, An LDA-based community structure discovery approach for large-scale social networks, inProc. 5th Int. Intelligence and Security Informatics Conf., New Brunswick, NJ, USA, 2007, pp. 200-207.
T.Nepusz, A.Petróczi, L.Négyessy, and F.Bazsó, Fuzzy communities and the concept of bridgeness in complex networks, Physical Review E, vol. 77, no. 1, p.016107, 2008.
S.Zhang, R.Wang, and X.Zhang, Identification of overlapping community structure in complex networks using fuzzy c-means clustering, Physica A: Statistical Mechanics and Its Applications, vol. 374, no. 1, pp. 483-490, 2007.
S.Gregory, A fast algorithm to find overlapping communities in networks, Machine Learning and Knowledge Discovery in Databases. New York: Springer, 2008, pp. 408-423.
[22]
S.Gregory, Finding overlapping communities in networks by label propagation, New Journal of Physics, vol. 12, no. 10, p. 103018, 2010.
I.Porteous, D.Newman, A.Ihler, A.Asuncion, P.Smyth, and M.Welling, Fast collapsed Gibbs sampling for latent dirichlet allocation, inProc. 14th Int. Knowledge Discovery and Data Mining Conf., Las Vegas, USA, 2008, pp. 569-577.
T. L.Griffiths and M.Steyvers, Finding scientific topics, National Academy of Sciences of the United States of America, vol. 101, no. suppl, pp. 5228-5235, 2004.
T.Minka and J.Lafferty, Expectation-propagation for the generative aspect model, inProc. th Int. Uncertainty in Artificial Intelligence Conf., Alberta, Canada, 2002, pp. 352-359.
[27]
M.Steyvers and T.Griffiths, Probabilistic topic models, Handbook of Latent Semantic Analysis, vol. 427, no. 7, pp. 424-440, 2007.
X.Wei and W. B.Croft. LDA-based document models for ad-hoc retrieval, inProc. 29th Int. Research and Development in Information Retrieval Conf., Seattle, Washington, USA, 2006, pp. 178-185.
Yu L, Wu B, Wang B. LBLP: Link-Clustering-Based Approach for Overlapping Community Detection. Tsinghua Science and Technology, 2013, 18(4): 387-397. https://doi.org/10.1109/TST.2013.6574677
10.1109/TST.2013.6574677.F001
Graphical model for LDA.
3.1 Terminology and background
We assume that a social network contains the vertex set and the edge set . Each vertex in represents as an actor and an edge in represents the interaction between the actors.
Problem Given a network including the node set , divide the edge set , which is defined for the edges between the different nodes, into communities on the basis of the link behavior and determine the node community induced by the edge community .
Definition 1 Line graph Give a network , create a line graph LG of the original network. The nodes in LG are the edges in and the edges in LG exist on the basis of whether the edges share common nodes in , i.e., .
Definition 2 Edge weight Given , represents the nodes connecting the edge . For all the edge pairs , and , . Define the matrix F as follows: if , then ; otherwise, . The weight of edge can be defined according to Ref. [
23
]:
where denotes the degree of node . If , then ; otherwise, .
The LDA model is a general probabilistic process for modeling the sparse vectors of the count data.
Figure 1
shows the graphical model representation of the LDA model. For the considered text corpus, each document of documents is a mixture of latent topics, each of which describes a multinomial distribution of a -word vocabulary. The generative process of the LDA model for words and documents is as follows:
10.1109/TST.2013.6574677.F001
Graphical model for LDA.
The probability that a word exists in the document is as follows:
The parameters of the multinomials for topics in a documents and words in a topic have Dirichlet priors and respectively as follows:
indicates the words that are important in topic , and indicates the topics that appear in document [
25
]. Therefore, the joint probability of a document with all the words and topics is as follows:
Finally, by taking the product of the marginal probabilities of single documents, we estimate the probability of a corpus as follows:
The mixing coefficients for each document and the word-topic distribution are hidden and are learnt from data using unsupervised learning methods[
24
]. Model inference approaches that have been used frequently include variational Bayesian, Gibbs sampling, and expectation propagation[
25
,
3
,
26
]. The symbols and notations used in this paper are summarized in
Table 1
.
10.1109/TST.2013.6574677.T001
Notations
Parameter
Definition
Number of nodes
Number of edges
Number of communities
Number of edge ’s neighbors
Dirichlet parameters
Edge-community distribution
Community-edge ditribution
Link community structure
Node community structure
Co-occurrence of edge and community
Co-occurrence of edge and assigned to community
10.1109/TST.2013.6574677.F002
Encoding process.
10.1109/TST.2013.6574677.F003
Generative process of line graph.
3.2 LDA-based link clustering
When the LDA model is used for the detection of a link community, the line graph should usually be encoded in a set of structured character strings. In the line graph, each edge is characterized by its Interaction Profile (IP), which is defined as a set of its neighboring edges and the corresponding weight (Eq. (
1
)). Formally,
where denotes the number of adjacent edges. For example, in
Fig. 2
, edge has four adjacent edges {1, 2, 5, 6}, and the corresponding weights can be calculated by using Eq. (
1
). Each edge is represented by its adjacent edges. The encoding schema guarantees that edges in the same community have similar interaction profiles. Moreover, we consider that the edges in IP are exchangeable and hence, their order is not a concern.
10.1109/TST.2013.6574677.F002
Encoding process.
In our research, we use LDA to model the interactions between the nodes in a network. The main idea of LBLP algorithms is to assume that the adjacent edges of each edge are generated by a mixture of communities, where a community is represented as a multinomial probability distribution over the edges. The mixing coefficients for each edge and the edge-community distribution are unobserved and are learned from data by using unsupervised learning methods.
Figure 3
shows the graphical model of the link-LDA model. The generative process for the line graph as follows:
10.1109/TST.2013.6574677.F003
Generative process of line graph.
Therefore, the probability that there is an interaction existing between edges and is as follows:
Given the hyper-parameters and , the join distribution of all observed and hidden variables is as follows:
Given the observed edges , the goal of inference is to compute the posterior distribution over the latent link communities , the mixing proportions , and the communities .
3.2 LDA-based link clustering
When the LDA model is used for the detection of a link community, the line graph should usually be encoded in a set of structured character strings. In the line graph, each edge is characterized by its Interaction Profile (IP), which is defined as a set of its neighboring edges and the corresponding weight (Eq. (
1
)). Formally,
where denotes the number of adjacent edges. For example, in
Fig. 2
, edge has four adjacent edges {1, 2, 5, 6}, and the corresponding weights can be calculated by using Eq. (
1
). Each edge is represented by its adjacent edges. The encoding schema guarantees that edges in the same community have similar interaction profiles. Moreover, we consider that the edges in IP are exchangeable and hence, their order is not a concern.
10.1109/TST.2013.6574677.F002
Encoding process.
In our research, we use LDA to model the interactions between the nodes in a network. The main idea of LBLP algorithms is to assume that the adjacent edges of each edge are generated by a mixture of communities, where a community is represented as a multinomial probability distribution over the edges. The mixing coefficients for each edge and the edge-community distribution are unobserved and are learned from data by using unsupervised learning methods.
Figure 3
shows the graphical model of the link-LDA model. The generative process for the line graph as follows:
10.1109/TST.2013.6574677.F003
Generative process of line graph.
Therefore, the probability that there is an interaction existing between edges and is as follows:
Given the hyper-parameters and , the join distribution of all observed and hidden variables is as follows:
Given the observed edges , the goal of inference is to compute the posterior distribution over the latent link communities , the mixing proportions , and the communities .
10.1109/TST.2013.6574677.F004
Illustration of bridge edge.