Abstract
The problem of imbalanced data classification learning has received much attention. Conventional classification algorithms are susceptible to data skew to favor majority samples and ignore minority samples. Majority weighted minority oversampling technique (MWMOTE) is an effective approach to solve this problem, however, it may suffer from the shortcomings of inadequate noise filtering and synthesizing the same samples as the original minority data. To this end, we propose an improved MWMOTE method named joint sample position based noise filtering and mean shift clustering (SPMSC) to solve these problems. Firstly, in order to effectively eliminate the effect of noisy samples, SPMSC uses a new noise filtering mechanism to determine whether a minority sample is noisy or not based on its position and distribution relative to the majority sample. Note that MWMOTE may generate duplicate samples, we then employ the mean shift algorithm to cluster minority samples to reduce synthetic replicate samples. Finally, data cleaning is performed on the processed data to further eliminate class overlap. Experiments on extensive benchmark datasets demonstrate the effectiveness of SPMSC compared with other sampling methods.