With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks.
- Article type
- Year
- Co-author
Although the Faster Region-based Convolutional Neural Network (Faster R-CNN) model has obvious advantages in defect recognition, it still cannot overcome challenging problems, such as time-consuming, small targets, irregular shapes, and strong noise interference in bridge defect detection. To deal with these issues, this paper proposes a novel Multi-scale Feature Fusion (MFF) model for bridge appearance disease detection. First, the Faster R-CNN model adopts Region Of Interest (ROI) pooling, which omits the edge information of the target area, resulting in some missed detections and inaccuracies in both detecting and localizing bridge defects. Therefore, this paper proposes an MFF based on regional feature Aggregation (MFF-A), which reduces the missed detection rate of bridge defect detection and improves the positioning accuracy of the target area. Second, the Faster R-CNN model is insensitive to small targets, irregular shapes, and strong noises in bridge defect detection, which results in a long training time and low recognition accuracy. Accordingly, a novel Lightweight MFF (namely MFF-L)model for bridge appearance defect detection using a lightweight network EfficientNetV2 and a feature pyramid network is proposed, which fuses multi-scale features to shorten the training speed and improve recognition accuracy. Finally, the effectiveness of the proposed method is evaluated on the bridge disease dataset and public computational fluid dynamic dataset.
As a class of effective methods for incomplete multi-view clustering, graph-based algorithms have recently drawn wide attention. However, most of them could use further improvement regarding the following aspects. First, in some graph-based models, all views are forced to share a common similarity graph regardless of the severe consistency degeneration due to incomplete views. Next, similarity graph construction and cluster analysis are sometimes performed separately. Finally, the contribution difference of individual views is not always carefully considered. To address these issues simultaneously, this paper proposes an incomplete multi-view clustering algorithm based on auto-weighted fusion in partition space. In our algorithm, the information of cluster structure is introduced into the process of similarity learning to construct a desirable similarity graph, information fusion is performed in partition space to alleviate the negative impact brought about by consistency degradation, and all views are adaptively weighted to reflect their different contributions to clustering tasks. Finally, all the subtasks are collaboratively optimized in a united framework to reach an overall optimal result. Experimental results show that the proposed method compares favorably with the state-of-the-art methods.
The aspect-based sentiment analysis (ABSA) consists of two subtasks—aspect term extraction and aspect sentiment prediction. Most methods conduct the ABSA task by handling the subtasks in a pipeline manner, whereby problems in performance and real application emerge. In this study, we propose an end-to-end ABSA model, namely, SSi-LSi, which fuses the syntactic structure information and the lexical semantic information, to address the limitation that existing end-to-end methods do not fully exploit the text information. Through two network branches, the model extracts syntactic structure information and lexical semantic information, which integrates the part of speech, sememes, and context, respectively. Then, on the basis of an attention mechanism, the model further realizes the fusion of the syntactic structure information and the lexical semantic information to obtain higher quality ABSA results, in which way the text information is fully used. Subsequent experiments demonstrate that the SSi-LSi model has certain advantages in using different text information.
Dialog State Tracking (DST) aims to extract the current state from the conversation and plays an important role in dialog systems. Existing methods usually predict the value of each slot independently and do not consider the correlations among slots, which will exacerbate the data sparsity problem because of the increased number of candidate values. In this paper, we propose a multi-domain DST model that integrates slot-relevant information. In particular, certain connections may exist among slots in different domains, and their corresponding values can be obtained through explicit or implicit reasoning. Therefore, we use the graph adjacency matrix to determine the correlation between slots, so that the slots can incorporate more slot-value transformer information. Experimental results show that our approach has performed well on the Multi-domain Wizard-Of-Oz (MultiWOZ) 2.0 and MultiWOZ2.1 datasets, demonstrating the effectiveness and necessity of incorporating slot-relevant information.
The aspect-based sentiment analysis (ABSA) consists of two subtasks'aspect term extraction and aspect sentiment prediction. Existing methods deal with both subtasks one by one in a pipeline manner, in which there lies some problems in performance and real application. This study investigates the end-to-end ABSA and proposes a novel multitask multiview network (MTMVN) architecture. Specifically, the architecture takes the unified ABSA as the main task with the two subtasks as auxiliary tasks. Meanwhile, the representation obtained from the branch network of the main task is regarded as the global view, whereas the representations of the two subtasks are considered two local views with different emphases. Through multitask learning, the main task can be facilitated by additional accurate aspect boundary information and sentiment polarity information. By enhancing the correlations between the views under the idea of multiview learning, the representation of the global view can be optimized to improve the overall performance of the model. The experimental results on three benchmark datasets show that the proposed method exceeds the existing pipeline methods and end-to-end methods, proving the superiority of our MTMVN architecture.
As an important branch of natural language processing, sentiment analysis has received increasing attention. In teaching evaluation, sentiment analysis can help educators discover the true feelings of students about the course in a timely manner and adjust the teaching plan accurately and timely to improve the quality of education and teaching. Aiming at the inefficiency and heavy workload of college curriculum evaluation methods, a Multi-Attention Fusion Modeling (Multi-AFM) is proposed, which integrates global attention and local attention through gating unit control to generate a reasonable contextual representation and achieve improved classification results. Experimental results show that the Multi-AFM model performs better than the existing methods in the application of education and other fields.
Monitoring the operating status of a High-Speed Train (HST) at any moment is necessary to ensure its security. Multi-channel vibration signals are collected by sensors installed on bogies and beneficial information are extracted to determine the running condition. Based on multi-view clustering and considering different views of complementary information, this study proposes a Multi-view Kernel Fuzzy C-Means (MvKFCM) model for condition recognition of the HST bogie. First, fast Fourier transform coefficients of HST vibration signals of all channels are extracted. Then, the fuzzy classification coefficient of every channel is calculated after clustering to select the appropriate channels. Finally, the selected channels are used to cluster by MvKFCM and the conditions of HST are determined. Experimental results show that the selection is effective to maintain rich feature information and remove redundancy. Furthermore, the condition recognition rate of MvKFCM is higher than that of single-view and four other multiple-view clustering algorithms.
In the big data era, the data are generated from different sources or observed from different views. These data are referred to as multi-view data. Unleashing the power of knowledge in multi-view data is very important in big data mining and analysis. This calls for advanced techniques that consider the diversity of different views, while fusing these data. Multi-view Clustering (MvC) has attracted increasing attention in recent years by aiming to exploit complementary and consensus information across multiple views. This paper summarizes a large number of multi-view clustering algorithms, provides a taxonomy according to the mechanisms and principles involved, and classifies these algorithms into five categories, namely, co-training style algorithms, multi-kernel learning, multi-view graph clustering, multi-view subspace clustering, and multi-task multi-view clustering. Therein, multi-view graph clustering is further categorized as graph-based, network-based, and spectral-based methods. Multi-view subspace clustering is further divided into subspace learning-based, and non-negative matrix factorization-based methods. This paper does not only introduce the mechanisms for each category of methods, but also gives a few examples for how these techniques are used. In addition, it lists some publically available multi-view datasets. Overall, this paper serves as an introductory text and survey for multi-view clustering.