Scholar - SciOpen

Disinformation, often known as fake news, is a major issue that has received a lot of attention lately. Many researchers have proposed effective means of detecting and addressing it. Current machine and deep learning based methodologies for classification/detection of fake news are content-based, network (propagation) based, or multimodal methods that combine both textual and visual information. We introduce here a framework, called FNACSPM, based on sequential pattern mining (SPM), for fake news analysis and classification. In this framework, six publicly available datasets, containing a diverse range of fake and real news, and their combination, are first transformed into a proper format. Then, algorithms for SPM are applied to the transformed datasets to extract frequent patterns (and rules) of words, phrases, or linguistic features. The obtained patterns capture distinctive characteristics associated with fake or real news content, providing valuable insights into the underlying structures and commonalities of misinformation. Subsequently, the discovered frequent patterns are used as features for fake news classification. This framework is evaluated with eight classifiers, and their performance is assessed with various metrics. Extensive experiments were performed and obtained results show that FNACSPM outperformed other state-of-the-art approaches for fake news classification, and that it expedites the classification task with high accuracy.

Issue

Distribution consistency-based missing value imputation algorithm for large-scale data sets

Jiayin YU, Yulin HE, Laizhong CUI, Zhexue HUANG

Journal of Tsinghua University (Science and Technology) 2023, 63(5): 740-753

Published: 15 May 2023

Abstract

PDF (5.3 MB) Collect Collected

Downloads：7

Objective

As a significant research branch in the field of data mining, missing value imputation (MVI) aims to provide high-quality data support for the training of machine learning algorithms.However, MVI results for large-scale data sets are not ideal in terms of restoring data distribution and improving data prognosis accuracy.To improve the performance of the existing MVI algorithms, we propose a distribution consistency-based MVI (DC-MVI) algorithm that attempts to restore the original data structure by imputing the missing values for large-scale data sets.

Methods

First, the DC-MVI algorithm developed an objective function to determine the optimal imputation values based on the principle of probability distribution consistency.Second, the data set is preprocessed by random initialization of missing values and normalization, and a feasible missing value update rule is derived to obtain the imputation values with the closest variance and the greatest consistency with the complete original values.Next, in a distributed environment, the large-scale data set is divided into multiple groups of random sample partition (RSP) data blocks with the same distribution as the entire data set by taking into account the statistical properties of the large-scale data set.Finally, the DC-MVI algorithm is trained in parallel to obtain the imputation value corresponding to the missing value of the large-scale data set and preserve distribution consistency with the non-missing values.The rationality experiments verify the convergence of the objective function and the contribution of DC-MVI to distribution consistency.In addition, the effectiveness experiments assess the performance of DC-MVI and eight other MVI algorithms (mean, KNN, MICE, RF, EM, SOFT, GAIN, and MIDA) through the following three indicators: distribution consistency, time complexity, and classification accuracy.

Results

The experimental results on seven selected large-scale data sets showed that: 1) The objective function of the DC-MVI method was effective, and the missing value update rule was feasible, allowing the imputation values to remain stable throughout the adjustment process; 2) the DC-MVI algorithm obtained the smallest maximum mean discrepancy and Jensen-Shannon divergence on all data sets, showing that the proposed method had a more consistent probability distribution with the complete original values under the given significance level; 3) the running time of the DC-MVI algorithm tended to be stable in the time comparison experiment, whereas the running time of other state-of-the-art MVI methods increased linearly with data volume; 4) the DC-MVI approach could produce imputation values that were more consistent with the original data set compared to existing methods, which was beneficial for subsequent data mining analysis.

Conclusions

Considering the peculiarities and limitations of missing large-scale data, this paper incorporates RSP into the imputation algorithm and derives the update rules of imputation values to restore the data distribution and further confirm the effectiveness and practical performance of DC-MVI in the large-scale data set imputation, such as preserving distribution consistency and increasing imputation quality.The method proposes in this paper achieves the desired result and represents a viable solution to the problem of large-scale data imputation.

Total 2

<1/11>GOpage