Sort:
Regular Paper Issue
Incremental User Identification Across Social Networks Based on User-Guider Similarity Index
Journal of Computer Science and Technology 2022, 37(5): 1086-1104
Published: 30 September 2022
Abstract Collect

Identifying accounts across different online social networks that belong to the same user has attracted extensive attentions. However, existing techniques rely on given user seeds and ignore the dynamic changes of online social networks, which fails to generate high quality identification results. In order to solve this problem, we propose an incremental user identification method based on user-guider similarity index (called CURIOUS), which efficiently identifies users and well captures the changes of user features over time. Specifically, we first construct a novel user-guider similarity index (called USI) to speed up the matching between users. Second we propose a two-phase user identification strategy consisting of USI-based bidirectional user matching and seed-based user matching, which is effective even for incomplete networks. Finally, we propose incremental maintenance for both USI and the identification results, which dynamically captures the instant states of social networks. We conduct experimental studies based on three real-world social networks. The experiments demonstrate the effectiveness and the efficiency of our proposed method in comparison with traditional methods. Compared with the traditional methods, our method improves precision, recall and rank score by an average of 0.19, 0.16 and 0.09 respectively, and reduces the time cost by an average of 81%.

Regular Paper Issue
Finding Communities by Decomposing and Embedding Heterogeneous Information Network
Journal of Computer Science and Technology 2020, 35(2): 320-337
Published: 27 March 2020
Abstract Collect

Community discovery is an important task in social network analysis. However, most existing methods for community discovery rely on the topological structure alone. These methods ignore the rich information available in the content data. In order to solve this issue, in this paper, we present a community discovery method based on heterogeneous information network decomposition and embedding. Unlike traditional methods, our method takes into account topology, node content and edge content, which can supply abundant evidence for community discovery. First, an embedding-based similarity evaluation method is proposed, which decomposes the heterogeneous information network into several subnetworks, and extracts their potential deep representation to evaluate the similarities between nodes. Second, a bottom-up community discovery algorithm is proposed. Via leader nodes selection, initial community generation, and community expansion, communities can be found more efficiently. Third, some incremental maintenance strategies for the changes of networks are proposed. We conduct experimental studies based on three real-world social networks. Experiments demonstrate the effectiveness and the efficiency of our proposed method. Compared with the traditional methods, our method improves normalized mutual information (NMI) and the modularity by an average of 12% and 37% respectively.

Open Access Issue
HPPQ: A Parallel Package Queries Processing Approach for Large-Scale Data
Big Data Mining and Analytics 2018, 1(2): 146-159
Published: 12 April 2018
Abstract PDF (4.4 MB) Collect
Downloads:31

A lot of scholars have focused on developing effective techniques for package queries, and a lot of excellent approaches have been proposed. Unfortunately, most of the existing methods focus on a small volume of data. The rapid increase in data volume means that traditional methods of package queries find it difficult to meet the increasing requirements. To solve this problem, a novel optimization method of package queries (HPPQ) is proposed in this paper. First, the data is preprocessed into regions. Data preprocessing segments the dataset into multiple subsets and the centroid of the subsets is used for package queries, this effectively reduces the volume of candidate results. Furthermore, an efficient heuristic algorithm is proposed (namely IPOL-HS) based on the preprocessing results. This improves the quality of the candidate results in the iterative stage and improves the convergence rate of the heuristic algorithm. Finally, a strategy called HPR is proposed, which relies on a greedy algorithm and parallel processing to accelerate the rate of query. The experimental results show that our method can significantly reduce time consumption compared with existing methods.

Open Access Issue
Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning
Tsinghua Science and Technology 2017, 22(2): 160-173
Published: 06 April 2017
Abstract PDF (913.5 KB) Collect
Downloads:51

The Extreme Learning Machine (ELM) and its variants are effective in many machine learning applications such as Imbalanced Learning (IL) or Big Data (BD) learning. However, they are unable to solve both imbalanced and large-volume data learning problems. This study addresses the IL problem in BD applications. The Distributed and Weighted ELM (DW-ELM) algorithm is proposed, which is based on the MapReduce framework. To confirm the feasibility of parallel computation, first, the fact that matrix multiplication operators are decomposable is illustrated. Then, to further improve the computational efficiency, an Improved DW-ELM algorithm (IDW-ELM) is developed using only one MapReduce job. The successful operations of the proposed DW-ELM and IDW-ELM algorithms are finally validated through experiments.

Open Access Issue
Modeling Chinese Microblogs with Five Ws for Topic Hashtags Extraction
Tsinghua Science and Technology 2017, 22(2): 135-148
Published: 06 April 2017
Abstract PDF (957 KB) Collect
Downloads:26

Hashtags are important metadata in microblogs and are used to mark topics or index messages. However, statistics show that hashtags are absent from most microblogs. This poses great challenges for the retrieval and analysis of these tagless microblogs. In this paper, we summarize the similarity between microblogs and short-message-style news, and then propose an algorithm, named 5WTAG, for detecting microblog topics based on a model of five Ws (When, Where, Who, What, hoW). As five-W attributes are the core components in event description, it is guaranteed theoretically that 5WTAG can properly extract semantic topics from microblogs. We introduce the detailed procedure of the algorithm in this paper including spam microblog identification, microblog segmentation, and candidate hashtag construction. In addition, we propose a novel recommendation computing method for ranking candidate hashtags, which combines syntax and semantic analysis and observes the distribution of artificial topic hashtags. Finally, we conduct comprehensive experiments to verify the semantic correctness and completeness of the candidate hashtags, as well as the accuracy of the recommendation method using real data from Sina Weibo.

Total 5