Sampling is a fundamental method for generating data subsets. As many data analysis methods are developed based on probability distributions, maintaining distributions when sampling can help to ensure good data analysis performance. However, sampling a minimum subset while maintaining probability distributions is still a problem. In this paper, we decompose a joint probability distribution into a product of conditional probabilities based on Bayesian networks and use the chi-square test to formulate a sampling problem that requires that the sampled subset pass the distribution test to ensure the distribution. Furthermore, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions. Experiments on four types of datasets with a size of 60000 show that when the significant difference level, α, is set to 0:05, the algorithm can exclude 99:9%, 99:0%, 93:1% and 96:7% of the samples based on their Bayesian networks—ASIA, ALARM, HEPAR2, and ANDES, respectively. When subsets of the same size are sampled, the subset generated by our algorithm passes all the distribution tests and the average distribution difference is approximately 0:03; by contrast, the subsets generated by random sampling pass only 83:8% of the tests, and the average distribution difference is approximately 0:24.
Publications
- Article type
- Year
- Co-author
Article type
Year
Regular Paper
Issue
Journal of Computer Science and Technology 2021, 36(4): 896-909
Published: 05 July 2021
Total 1