Coronavirus Pandemic Analysis Through Tripartite Graph Clustering in Online Social Networks

Xueting Liao; Danyang Zheng; Xiaojun Cao

doi:10.26599/BDMA.2021.9020010

| Sign up

PDF (6.2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (7)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Tables (3)

Table 1

Table 2

Table 3

Open Access

Coronavirus Pandemic Analysis Through Tripartite Graph Clustering in Online Social Networks

Xueting Liao, Danyang Zheng(), Xiaojun Cao

Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA

Suzhou Key Laboratory of Advanced Optical Communication Network Technology, School of Electronic and Information Engineering, Soochow University, Suzhou 215006, China

Show Author Information

Abstract

The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related issues has been pouring into social platforms, such as Twitter. Many public officials and governments use Twitter to make policy announcements. People keep close track of the related information and express their concerns about the policies on Twitter. It is beneficial yet challenging to derive important information or knowledge out of such Twitter data. In this paper, we propose a Tripartite Graph Clustering for Pandemic Data Analysis (TGC-PDA) framework that builds on the proposed models and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization with regularization, and (3) sentiment analysis. We collect the tweets containing a set of keywords related to coronavirus pandemic as the ground truth data. Our framework can detect the communities of Twitter users and analyze the topics that are discussed in the communities. The extensive experiments show that our TGC-PDA framework can effectively and efficiently identify the topics and correlations within the Twitter data for monitoring and understanding public opinions, which would provide policy makers useful information and statistics for decision making.

Keywords

COVID-19 clustering online social network Twitter

References

[1]

Everyone included: Social impact of COVID-19, https://www.un.org/development/desa/dspd/everyone-included-covid-19.html, 2020.

Crossref

[2]

Wikipedia, COVID-19 pandemic, https://en.wikipedia.org/wiki/COVID-19pandemic, 2021.

[3]

Domestic travel during the COVID-19 pandemic, https://www.cdc.gov/coronavirus/2019-ncov/travelers/travel-during-covid19.html, 2020.

[4]

Travelers prohibited from entry to the United States, https://www.cdc.gov/coronavirus/2019-ncov/travelers/from-other-countries.html, 2020.

[5]

Cohen

, Tokyo 2020 Olympics officially postponed until 2021, https://tv5.espn.com/olympics/story/_/id/28946033/tokyo-olympics-officially-postponed-2021, 2020.

[6]

Wikipedia, RNA virus, https://en.wikipedia.org/wiki/RNAvirus, 2021.

[7]

How does fake news of 5G and COVID-19 spread worldwide?, https://www.medicalnewstoday.com/articles/5g-doesnt-cause-covid-19-but-the-rumor-it-does-spread-like-a-virus, 2021.

[8]

L. J.

Chang

, W.

, L.

Qin

, W. J.

Zhang

, and S. Y.

Yang

, pSCAN: Fast and exact structural graph clustering, IEEE Trans. Knowl. Data Eng., vol. 29, no. 2, pp. 387-401, 2017.

Crossref Google Scholar

[9]

El Bacha

and T. T.

Zin

, Ranking of influential users based on user-tweet bipartite graph, in Proc. of 2018 IEEE Int. Conf. Service Operations and Logistics, and Informatics (SOLI), Singapore, 2018, pp. 97-101.

Crossref

[10]

Rodríguez

, C.

Argueta

, and Y. L.

Chen

, Automatic detection of hate speech on facebook using sentiment and emotion analysis, in Proc. of 2019 Int. Conf. Artificial Intelligence in Information and Communication (ICAIIC), Okinawa, Japan, 2019, pp. 169-174.

Crossref

[11]

Zhou

and C.

Kwan

, Missing link prediction in social networks, in Proc. 15th Int. Symp. Neural Networks, Minsk, Belarus, 2018, pp. 346-354.

Crossref

[12]

Reyes-Menendez

, J. R.

Saura

, and C.

Alvarez-Alonso

, Understanding #worldEnvironmentDay user opinions in twitter: A topic-based sentiment analysis approach, Int. J. Environ. Res. Public Health, vol. 15, no. 11, p. 2537, 2018.

Crossref Google Scholar

[13]

C. H.

Tan

, L. L.

Lee

, J.

Tang

, L.

Jiang

, M.

Zhou

, and P.

, User-level sentiment analysis incorporating social networks, in Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2011, pp. 1397-1405.

Crossref

[14]

Giachanou

and F.

Crestani

, Like it or not: A survey of twitter sentiment analysis methods, ACM Comput. Surv., vol. 49, no. 2, p. 28, 2016.

Crossref Google Scholar

[15]

R. R.

Iyer

, J.

Chen

, H. N.

Sun

, and K. Y.

, A heterogeneous graphical model to understand user-level sentiments in social media, arXiv preprint arXiv: 1912.07911, 2019.

Google Scholar

[16]

H. B.

Deng

, J. W.

Han

, H.

, H. N.

Wang

, and Y.

, Exploring and inferring user-user pseudo-friendship for sentiment analysis with heterogeneous networks, Stat. Anal. Data Min., vol. 7, no. 4, pp. 308-321, 2014.

Crossref Google Scholar

[17]

C. A.

Phillips

, Multipartite graph algorithms for the analysis of heterogeneous data, PhD dissertation, Univ. Tennessee, Knoxville, TN, USA, 2015.

[18]

D. W.

Zhou

, S.

Zhang

, M. Y.

Yildirim

, S.

Alcorn

, H. H.

Tong

, H.

Davulcu

, and J. R.

, A local algorithm for structure-preserving graph cut, in Proc. 23rd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Halifax, Canada, 2017, pp. 655-664.

Crossref

[19]

P. M.

Comar

, P. N.

Tan

, and A. K.

Jain

, A framework for joint community detection across multiple related networks, Neurocomputing, vol. 76, no. 1, pp. 93-104, 2012.

Crossref Google Scholar

[20]

Y. Z.

Sun

, Y. T.

, and J. W.

Han

, Ranking-based clustering of heterogeneous information networks with star network schema, in Proc. 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 797-806.

Crossref

[21]

D. D.

Lee

and H. S.

Seung

, Algorithms for non-negative matrix factorization, in Proc. 13th Int. Conf. Neural Information Proc. Systems, Cambridge, MA, USA, 2001, pp. 535-541.

[22]

Gillis

, The why and how of nonnegative matrix factorization, arXiv preprint arXiv: 1401.5226v2, 2014.

Google Scholar

[23]

Abdi

and L. J.

Williams

, Principal component analysis, WIRs Comput. Stat., vol. 2, no. 4, pp. 433-459, 2010.

Crossref Google Scholar

[24]

M. E.

Wall

, A.

Rechtsteiner

, and L. M.

Rocha

, Singular value decomposition and principal component analysis, in A Practical Approach to Microarray Data Analysis, D. P.

Berrar

, W.

Dubitzky

, M.

Granzow

, eds. Norwell, MA, USA: Springer, 2003, pp. 91-109.

[25]

Ding

, T.

, W.

Peng

, and H.

Park

, Orthogonal nonnegative matrix t-factorizations for clustering, in Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 2006, pp. 126-135.

Crossref

[26]

Kim

, S.

Sra

, and I. S.

Dhillon

, Fast newton-type methods for the least squares nonnegative matrix approximation problem, in Proc. 2007 SIAM Int. Conf. Data Mining, Minneapolis, MN, USA, 2007, pp. 343-354.

Crossref

[27]

C. J.

Lin

, On the convergence of multiplicative update algorithms for nonnegative matrix factorization, IEEE Trans. Neural Netw., vol. 18, no. 6, pp. 1589-1596, 2007.

Crossref Google Scholar

[28]

Kim

and H.

Park

, Toward faster nonnegative matrix factorization: A new algorithm and comparisons, in Proc. of 2008 Eighth IEEE Int. Conf. Data Mining, Pisa, Italy, 2008, pp. 353-362.

Crossref

[29]

Wang

and P.

, Efficient nonnegative matrix factorization with random projections, in Proc. 2010 SIAM Int. Conf. Data Mining, Columbus, OH, USA, 2010, pp. 281-292.

Crossref

[30]

Annett

and G.

Kondrak

, A comparison of sentiment analysis techniques: Polarizing movie blogs, in Proc. 21st Conference of the Canadian Society for Computational Studies of Intelligence, Windsor, Canada, 2008, pp. 25-35.

Crossref

[31]

Hillmann

and M.

Trier

, Sentiment polarization and balance among users in online social networks, http://aisel.aisnet.org/amcis2012/proceedings/VirtualCommunities/10, 2021.

[32]

Del Vicario

, G.

Vivaldo

, A.

Bessi

, F.

Zollo

, A.

Scala

, G.

Caldarelli

, and W.

Quattrociocchi

, Echo chambers: Emotional contagion and group polarization on facebook, Sci. Rep., vol. 6, p. 37825, 2016.

Crossref Google Scholar

[33]

S. M.

Mohammad

, X. D.

Zhu

, S.

Kiritchenko

, and J.

Martin

, Sentiment, emotion, purpose, and style in electoral tweets, Informat. Proc. Manag., vol. 51, no. 4, pp. 480-499, 2015.

Crossref Google Scholar

[34]

Chakraborty

, S.

Bhattacharyya

, R.

Bag

, and A.

Hassanien

, Sentiment analysis on a set of movie reviews using deep learning techniques, in Social Network Analytics Computational Research Methods and Techniques, Cambridge, MA, USA, 2019, pp. 127-147.

Crossref

[35]

Sailunaz

and R.

Alhajj

, Emotion and sentiment analysis from twitter text, J. Comput. Sci., vol. 36, p. 101003, 2019.

Crossref Google Scholar

[36]

Meisheri

, K.

Ranjan

, and L.

Dey

, Sentiment extraction from consumer-generated noisy short texts, in Proc. of 2017 IEEE Int. Conf. Data Mining Workshops (ICDMW), New Orleans, LA, USA, 2017, pp. 399-406.

Crossref

[37]

A. S. M.

Alharbi

and E.

de Doncker

, Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information, Cogn. Syst. Res., vol. 54, pp. 50-61, 2019.

Crossref Google Scholar

[38]

M. E. J.

Newman

, Modularity and community structure in networks, Proc. Natl. Acad. Sci. USA, vol. 103, no. 23, pp. 8577-8582, 2006.

Crossref Google Scholar

[39]

Wang

, C. K.

Wang

, J. X.

, and J.

Zhang

, Community detection in social networks: An in-depth benchmarking study with a procedure-oriented framework, Proc. VLDB Endow., vol. 8, no. 10, pp. 998-1009, 2015.

Crossref Google Scholar

[40]

Cai

, X. F.

, X. Y.

, and J. W.

Han

, Non-negative matrix factorization on manifold, in Proc. 2008 8th IEEE Int. Conf. Data Mining, Pisa, Italy, 2008, pp. 63-72.

Crossref

[41]

Wang

, F. P.

Nie

, H.

Huang

, and F.

Makedon

, Fast nonnegative matrix tri-factorization for large-scale data co-clustering, in Proc. 22nd Int. Joint Conf. Artificial Intelligence, Barcelona, Spain, 2011, pp. 1553-1558.

[42]

TextBlob: Simplified text processing, https://textblob.readthedocs.io/en/dev/, 2020.

[43]

C. H. Q.

Ding

, T.

, and M. I.

Jordan

, Convex and semi-nonnegative matrix factorizations, IEEE Trans. Patt. Anal. Mach. Intell., vol. 32, no. 1, pp. 45-55, 2010.

Crossref Google Scholar

[44]

Abe

and H.

Yadohisa

, Orthogonal nonnegative matrix tri-factorization based on tweedie distributions, Adv. Data Anal. Classi., vol. 13, no. 4, pp. 825-853, 2019.

Crossref Google Scholar

[45]

P. K.

Shivaswamy

and T.

Jebara

, Permutation invariant SVMs, in Proc. 23rd Int. Conf. Machine Learning, Pittsburgh, PA, USA, 2006, pp. 817-824.

Crossref

Big Data Mining and Analytics

Volume 4 Issue 4,
December 2021

Pages 242-251

DOI: 10.26599/BDMA.2021.9020010

Cite this article:

Liao X, Zheng D, Cao X. Coronavirus Pandemic Analysis Through Tripartite Graph Clustering in Online Social Networks. Big Data Mining and Analytics, 2021, 4(4): 242-251. https://doi.org/10.26599/BDMA.2021.9020010

6 Experimental Result

In this section, we analyze our experimental results and the performance of TGC-PDA. We also compare NMFRU with the well-known clustering methods, such as Kmeans, NMF, and the commonly used variants, including Semi-NMF (SNMF)^{[

43
]} and Orthogonal NMTF (ONMTF)^{[

44
]}.

6.1 Dataset

We evaluate the performance of TGC-PDA with real Twitter dataset about “Covid-19” collected between Feb. 15th, 2020 and Sep. 30th 2020. To get the tweet data, we wrote a python program to crawl the tweets and the users who liked them. Multiple hashtag keywords, such as #COVID19, #coronavirus, #covid, covid pandemic, and #COVID20 are used to ensure we can get a large dataset. Since the free Twitter API we use has rate limits and it restricts the number of retrieved tweets during each login access, we have to crawl the data for several months. After removing the duplicate and non-English posts, we obtain 18 327 tweets, with 752 649 users who interacted with the tweets. Some users only interacted with one tweet in our dataset, which are identified as “less interactive” users and excluded. After the data cleanup, we have 301 982 users left.

6.2 Experimental setup

As all the clustering methods (i.e., Kmeans, NMF, SNMF, ONMTF, and our NMFRU) have one or more parameters to be tuned, to make the comparison fair, we run these methods under different parameters and choose the best result for each algorithm. In addition, we set the number of clusters as the true number of classes for all clustering algorithms on the dataset. In specific, for Kmeans and NMF algorithms, the hyperparameter is $k_{cluster}$ (number of clusters). If $k_{cluster}$ is given, no other parameters would be needed. In NMFR, we have two hyperparameters: $α$ and $β$ . To find a proper value for these parameters, we plot a loss-value curve, with value ranging from 0.1 to 1000. Then, the $α$ and $β$ values can be found by scanning the plot. Since our data size is relatively large and cannot be completely labeled manually, we randomly choose 5% of the data to label and use the result tested by sample data as the framework result.

6.3 Evaluation metrics

To evaluate the clustering result, we use the widely used standard metrics, including the clustering accuracy, cluster purity, and Normalized Mutual Information (NMI).

For the clustering accuracy, we compare the outputted clusters $c \in C_{out}$ with the ground truth labelled data $g \in C_{ground}$ . The accuracy is defined as follows:

(9) $Accuracy(C_{out}, C_{ground}) = \frac{1}{k_{sample}} \sum δ (g_{i}, map (c_{i}))$

where $k_{sample}$ is total number of data samples, $δ (a, b)$ is the delta function, in which the value equals one if $a = b$ , and equals zero otherwise. The $map (c_{i})$ is a mapping function that maps each cluster label $c_{i}$ to the same label in the ground truth data. We can use Kuhn-Munkres algorithm to find the best mappings^{[

45
]}.

For the cluster purity, we compare the cluster output $c \in C_{out}$ with the ground truth labelled data $g \in C_{ground}$ . The purity of the cluster result is calculated,

(10) $Purity(C_{out}, C_{ground}) = \frac{1}{k_{sample}} \sum_{c \in C_{out}} \max_{g \in C_{ground}} (c \cap g)$

where $k_{sample}$ is the number of data samples. A perfect clustering result has a purity of one, and a bad one has the purity value close to zero.

For the NMI, we compare the cluster output $c \in C_{out}$ with the ground truth labelled data $g \in C_{ground}$ . The NMI is defined as

(11) $NMI(C_{out}, C_{ground}) = \frac{2 \times I (C_{out}, C_{ground})}{[H (C_{out}) + H (C_{ground})]}$

where $H ()$ represents the entropy, and $I (C_{out}, C_{ground})$ denotes the mutual information between $C_{out}$ and $C_{ground}$ . A higher NMI value means the better clustering result.

To obtain a less biased estimation of the framework, we run NMFRU algorithms twenty times and take the average result for each model.

6.4 Results and discussion

Table 2 shows the comparison between NMFRU and several baseline models, such as Kmeans, NMF, SNMF, and ONMTF. When applying these baseline models to our data, we do not embed the topic nodes to user nodes. Instead, we use the user and tweet bipartite graph to calculate the clustering result. The matrix form of the bipartite graph is that the columns and rows correspond to the two sets of vertices, with each entry corresponding to an edge between a column and a row. From Table 2 , we can see that NMFRU achieves the best performance in terms of accuracy, purity, and NMI. This is because our bipartite graph is created based on our tripartite graph model, and it embedded more information than the plain bipartite graph. We also utilize the tri-factorization and locality preserved schemes, which can further improve the performance.

10.26599/BDMA.2021.9020010.T002 Table 2Performance results of classifiers.

Method	Accuracy	Purity	NMI
Kmeans	0.613	0.549	0.513
NMF	0.583	0.536	0.493
SNMF	0.627	0.562	0.534
ONMTF	0.674	0.578	0.557
NMFRU	0.706	0.617	0.621

We study the average convergence time of our framework in Fig. 6 . When the number of iterations is around 23, our framework tends to converge with a total loss of 2, which shows that the calculation of NMFRU is fast. Meanwhile, when comparing the convergence time by the different baseline methods in Fig. 7 , we can see that NMFRU is slower than Kmeans but faster than other baseline clustering methods. It is because we do fewer matrix multiplication operations in NMFRU, hence saving some running time. Therefore, TGC-PDA that utilizes NMFRU as the core clustering algorithm can be used for a large dataset.

10.26599/BDMA.2021.9020010.F006 Fig. 6Total loss with different numbers of iterations.

10.26599/BDMA.2021.9020010.F007 Fig. 7Convergence time of methods.

As for the polarity of the communities, Table 3 shows the largest ten communities with its polarity ratio. From Table 3 , we find that the neutral ratio is quite high among all topics. In order to figure out the rationale here, we manually examined 1000 posts and found that there are lots of media or government agencies (e.g., CNN and CDC) that use Twitter to publish real-time news and the latest policy. These tweets tend to be retweeted many times. Obviously, such tweets are more likely to be considered neutral.

10.26599/BDMA.2021.9020010.T003 Table 3Largest ten communities with its polarity ratio. (%)

Keyword	Positive	Neutral	Negative
marketcrash2021	18.2	48.7	33.1
maskshortage	14.1	41.6	44.3
death	4.1	73.1	22.8
NYbreak	12.2	57.5	30.3
antibody	30.5	41.3	28.2
stimulus	32.6	41.7	25.7
testing	32.7	38.9	28.4
vaccine	20.4	61.2	18.4
symptoms	26.3	48.9	24.8
stayathome	23.6	51.8	24.6

Symbol	Definition
$n, m, t$	Number of users, tweets, and topics
$G (V, E)$	Graph with node set V and edge set E
$U, T, H$	Node set of users, tweets, and topics
B	Matrix representation of a bipartite graph
$P_{i, j}$	Number of paths between node i and node j
L	Normalized Laplacian matrix
D	Degree matrix: diagonal with ${[𝑫]}_{i, i} = degree (v_{i})$
$𝑭, 𝑮$	Decomposed matrices: $𝑭 \in Ψ^{n \times d}$ and $𝑮 \in Ψ^{k \times n}$
S	Association matrix: $𝑺 \in R_{+}^{d \times k}$
$Ψ$	Set of all cluster indicator matrices
$Tr (𝑿)$	Trace of matrix X: $Tr (𝑿) = \sum_{1}^{n} x_{i, i}$
${\|\| 𝑿 \|\|}_{F}$	Frobenius norm of a matrix X