Sort:
Open Access Issue
Graph Deep Active Learning Framework for Data Deduplication
Big Data Mining and Analytics 2024, 7(3): 753-764
Published: 28 August 2024
Abstract PDF (2.9 MB) Collect
Downloads:15

With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks.

Open Access Issue
TW-Co-MFC: Two-Level Weighted Collaborative Fuzzy Clustering Based on Maximum Entropy for Multi-View Data
Tsinghua Science and Technology 2021, 26(2): 185-198
Published: 24 July 2020
Abstract PDF (573.5 KB) Collect
Downloads:58

In recent years, multi-view clustering research has attracted considerable attention because of the rapidly growing demand for unsupervised analysis of multi-view data in practical applications. Despite the significant advances in multi-view clustering, two challenges still need to be addressed, i.e., how to make full use of the consistent and complementary information in multiple views and how to discriminate the contributions of different views and features in the same view to efficiently reveal the latent cluster structure of multi-view data for clustering. In this study, we propose a novel Two-level Weighted Collaborative Multi-view Fuzzy Clustering (TW-Co-MFC) approach to address the aforementioned issues. In TW-Co-MFC, a two-level weighting strategy is devised to measure the importance of views and features, and a collaborative working mechanism is introduced to balance the within-view clustering quality and the cross-view clustering consistency. Then an iterative optimization objective function based on the maximum entropy principle is designed for multi-view clustering. Experiments on real-world datasets show the effectiveness of the proposed approach.

Open Access Issue
M2M: A Simple Matlab-to-MapReduce Translator for Cloud Computing
Tsinghua Science and Technology 2013, 18(1): 1-9
Published: 07 February 2013
Abstract PDF (695.4 KB) Collect
Downloads:28

MapReduce is a very popular parallel programming model for cloud computing platforms, and has become an effective method for processing massive data by using a cluster of computers. X-to-MapReduce (X is a program language) translator is a possible solution to help traditional programmers easily deploy an application to cloud systems through translating sequential codes to MapReduce codes. Recently, some SQL-to-MapReduce translators emerge to translate SQL-like queries to MapReduce codes and have good performance in cloud systems. However, SQL-to-MapReduce translators mainly focus on SQL-like queries, but not on numerical computation. Matlab is a high-level language and interactive environment for numerical computation, visualization, and programming, which is very popular in engineering. We propose and develop a simple Matlab-to-MapReduce translator for cloud computing, called M2M, for basic numerical computations. M2M can translate a Matlab code with up to 100 commands to MapReduce code in few seconds, which may cost a proficient Hadoop MapReduce programmer some days on coding so many commands. In addition, M2M can also recognize the dependency between complex commands, which is always confusing during hand coding. We implemented M2M with evaluation for Matlab commands on a cluster. Several common commands are used in our experiments. The results show that M2M is comparable in performance with hand-coded programs.

Total 3