Scholar - SciOpen

The objective of knowledge graph completion is to comprehend the structure and inherent relationships of domain knowledge, thereby providing a valuable foundation for knowledge reasoning and analysis. However, existing methods for knowledge graph completion face challenges. For instance, rule-based completion methods exhibit high accuracy and interpretability, but encounter difficulties when handling large knowledge graphs. In contrast, embedding-based completion methods demonstrate strong scalability and efficiency, but also have limited utilisation of domain knowledge. In response to the aforementioned issues, we propose a method of pre-training and inference for knowledge graph completion based on integrated rules. The approach combines rule mining and reasoning to generate precise candidate facts. Subsequently, a pre-trained language model is fine-tuned and probabilistic structural loss is incorporated to embed the knowledge graph. This enables the language model to capture more deep semantic information while the loss function reconstructs the structure of the knowledge graph. This enables the language model to capture more deep semantic information while the loss function reconstructs the structure of the knowledge graph. Extensive tests using various publicly accessible datasets have indicated that the suggested model performs better than current techniques in tackling knowledge graph completion problems.

Open Access Issue

Graph Deep Active Learning Framework for Data Deduplication

Huan Cao, Shengdong Du, Jie Hu, Yan Yang, Shi-Jinn Horng, Tianrui Li

Big Data Mining and Analytics 2024, 7(3): 753-764

Published: 28 August 2024

Abstract

PDF (2.9 MB) Collect Collected

Downloads：43

With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks.

Open Access Issue

TW-Co-MFC: Two-Level Weighted Collaborative Fuzzy Clustering Based on Maximum Entropy for Multi-View Data

Jie Hu, Yi Pan, Tianrui Li, Yan Yang

Tsinghua Science and Technology 2021, 26(2): 185-198

Published: 24 July 2020

Abstract

PDF (573.5 KB) Collect Collected

Downloads：63

In recent years, multi-view clustering research has attracted considerable attention because of the rapidly growing demand for unsupervised analysis of multi-view data in practical applications. Despite the significant advances in multi-view clustering, two challenges still need to be addressed, i.e., how to make full use of the consistent and complementary information in multiple views and how to discriminate the contributions of different views and features in the same view to efficiently reveal the latent cluster structure of multi-view data for clustering. In this study, we propose a novel Two-level Weighted Collaborative Multi-view Fuzzy Clustering (TW-Co-MFC) approach to address the aforementioned issues. In TW-Co-MFC, a two-level weighting strategy is devised to measure the importance of views and features, and a collaborative working mechanism is introduced to balance the within-view clustering quality and the cross-view clustering consistency. Then an iterative optimization objective function based on the maximum entropy principle is designed for multi-view clustering. Experiments on real-world datasets show the effectiveness of the proposed approach.

Open Access Issue

M2M: A Simple Matlab-to-MapReduce Translator for Cloud Computing

Junbo Zhang, Dong Xiang, Tianrui Li, Yi Pan

Tsinghua Science and Technology 2013, 18(1): 1-9

Published: 07 February 2013

Abstract

PDF (695.4 KB) Collect Collected

Downloads：34

MapReduce is a very popular parallel programming model for cloud computing platforms, and has become an effective method for processing massive data by using a cluster of computers. X-to-MapReduce (X is a program language) translator is a possible solution to help traditional programmers easily deploy an application to cloud systems through translating sequential codes to MapReduce codes. Recently, some SQL-to-MapReduce translators emerge to translate SQL-like queries to MapReduce codes and have good performance in cloud systems. However, SQL-to-MapReduce translators mainly focus on SQL-like queries, but not on numerical computation. Matlab is a high-level language and interactive environment for numerical computation, visualization, and programming, which is very popular in engineering. We propose and develop a simple Matlab-to-MapReduce translator for cloud computing, called M2M, for basic numerical computations. M2M can translate a Matlab code with up to 100 commands to MapReduce code in few seconds, which may cost a proficient Hadoop MapReduce programmer some days on coding so many commands. In addition, M2M can also recognize the dependency between complex commands, which is always confusing during hand coding. We implemented M2M with evaluation for Matlab commands on a cluster. Several common commands are used in our experiments. The results show that M2M is comparable in performance with hand-coded programs.

Total 4

<1/11>GOpage