Institute of Applied Physics and ComputationalMathematics, Beijing 100094, China
School of Computer Science and Technology, Shandong University, Qingdao 266237, China
Show Author Information
Hide Author Information
Graphical Abstract
View original imageDownload original image
Abstract
In this paper, we propose a correlation-aware probabilistic data summarization technique to efficiently analyze and visualize large-scale multi-block volume data generated by massively parallel scientific simulations. The core of our technique is correlation modeling of distribution representations of adjacent data blocks using copula functions and accurate data value estimation by combining numerical information, spatial location, and correlation distribution using Bayes’ rule. This effectively preserves statisticalproperties without merging data blocks in different parallel computing nodes and repartitioning them, thus significantly reducing the computational cost. Furthermore, this enables reconstruction of the original data more accurately than existing methods. We demonstrate the effectiveness of our technique using six datasets, with the largest having one billion grid points. The experimental results show that our approach reduces the data storage cost by approximately one order of magnitude compared to state-of-the-artmethods while providing a higher reconstruction accuracy at a lower computational cost.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
Ahrens,J.; Hendrickson,B.; Long,G.; Miller,S.; Ross,R.; Williams,D.Data intensive science in the Department of Energy. Technical Report, LA-UR-10-07088. Los Alamos National Laboratory, 2010.
Luo,A.; Kao,D.; Pang,A.Visualizing spatial distribution data sets. In: Proceedings of the Symposium on Data Visualisation, 29–38, 2003.
[4]
Kniss,J. M.; Van Uitert,R.; Stephens,A.; Li,G.; Tasdizen,T.; Hansen,C.Statistically quantitative volume visualization. In: Proceedings of the IEEE Visualization, 287–294, 2005.
Thompson,D.; Levine,J. A.; Bennett,J. C.; Bremer,P. T.; Gyulassy,A.; Pascucci,V.; Pébay,P. P.Analysis of large-scale scalar data using hixels. In: Proceedings of the IEEE Symposium on Large Data Analysis and Visualization, 23–30, 2011.
Liu,S. S.; Levine,J. A.; Bremer,P. T.; Pascucci,V.Gaussian mixture model based volume visualization. In: Proceedings of the IEEE Symposium on Large Data Analysis and Visualization, 73–77, 2012.
[11]
Dutta,S.; Shen,H. W.Distribution driven extraction and tracking of features for time-varying data analysis. IEEE Transactions on Visualization and Computer Graphics Vol. 22, No. 1, 837–846, 2016.
Chaudhuri,A.; Wei,T. H.; Lee,T. Y.; Shen,H. W.; Peterka,T.Efficient range distribution query for visualizing scientific data. In: Proceedings of the IEEE Pacific Visualization Symposium, 201–208, 2014.
Nouanesengsy,B.; Woodring,J.; Patchett,J.; Myers,K.; Ahrens,J.ADR visualization: A generalized framework for ranking large-scale scientific data using Analysis-Driven Refinement. In: Proceedings of the IEEE 4th Symposium on Large Data Analysis and Visualization, 43–50, 2014.
Athawale,T.; Sakhaee,E.; Entezari,A.Isosurface visualization of data with nonparametric models for uncertainty. IEEE Transactions on Visualization and Computer Graphics Vol. 22, No. 1, 777–786, 2016.
Dutta,S.; Woodring,J.; Shen,H. W.; Chen,J. P.; Ahrens,J.Homogeneity guided probabilistic data summaries for analysis and visualization of large-scale data sets. In: Proceedings of the IEEE Pacific Visualization Symposium, 111–120, 2017.
Wang,K. C.; Lu,K. W.; Wei,T. H.; Shareef,N.; Shen,H. W.Statistical visualization and analysis of large data using a value-based spatial distribution. In: Proceedings of the IEEE Pacific Visualization Symposium, 161–170, 2017.
Sklar,A.Fonctions de Répartition à n Dimensions et Leurs Marges. Publications de l’Institut Statistique de l’Université de Paris Vol. 8, 229–231, 1959.
Hazarika,S.; Biswas,A.; Shen,H. W.Uncertainty visualization using copula-based analysis in mixed distribution models. IEEE Transactions on Visualization and Computer Graphics Vol. 24, No. 1, 934–943, 2018.
Kim,T.; Shin,Y.An efficient wavelet-based compression method for volume rendering. In: Proceedings of the 7th Pacific Conference on Computer Graphics and Applications, 147–156, 1999.
[26]
Sasaki,N.; Sato,K.; Endo,T.; Matsuoka,S.Exploration of lossy compression for application-level checkpoint/restart. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 914–922, 2015.
Khodakovsky,A.; Schröder,P.; Sweldens,W.Progressive geometry compression. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, 271–278, 2000.
Tzeng,F. Y.; Lum,E. B.; Ma,K. L.A novel interface for higher-dimensional classification of volume data. In: Proceedings of the IEEE Visualization, 505–512, 2003.
[32]
Kindlmann,G.; Whitaker,R.; Tasdizen,T.; Moller,T.Curvature-based transfer functions for direct volume rendering: Methods and applications. In: Proceedings of the IEEE Visualization, 513–520, 2003.
[33]
Tenginakai,S.; Lee,J.; Machiraju,R.Salient iso-surface detection with model-independent statistical signatures. In: Proceedings of the Visualization, 231–238, 2001.
[34]
Hladůvka,J.; König,A.; Gröller,E.Salient representation of volume data. In: Data Visualization 2001. Eurographics. Ebert,D. S.; Favre,J. M.; Peikert,R.Eds.Springer Vienna, 203–211, 2001.
Wang,K. C.; Xu,J. Y.; Woodring,J.; Shen,H. W.Statistical super resolution for data analysis and visualization of large scale cosmological simulations. In: Proceedings of the IEEE Pacific Visualization Symposium, 303–312, 2019.
Yang Y, Lu K, Wu Y, et al. Correlation-aware probabilistic data summarization for large-scale multi-block scientific data visualization. Computational Visual Media, 2023, 9(3): 513-529. https://doi.org/10.1007/s41095-022-0304-6
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
10.1007/s41095-022-0304-6.F001
Domain decomposition for parallel computing: (a) large single block of data representing the overall domain; (b) multi-block data generated by domain decomposition.
10.1007/s41095-022-0304-6.F002
Our correlation-aware probabilistic summarization method significantly reduces the data storage cost while ensuring high reconstruction accuracy. A multi-block volume rendering of aircraft electromagnetic data is shown. Compared to the ground truth (a), the reconstruction result of our method (d) has higher accuracy than the reconstruction results of Refs. [
9
,
20
] (b, c), yet provides greater data compression. Respective sizes of the original data, block histogram model, spatial GMM, and our model are 536.8 MB, 268.4 MB, 89.5 MB, and 24.4 MB.
10.1007/s41095-022-0304-6.F003
Construction of a two-dimensional Gaussian copula function: (a) generation of new bivariate samples from a bivariate standard normal distribution, (b) construction of the Gaussian copula with uniform marginals, and (c) the final bivariate samples with the desired univariate distribution types.
10.1007/s41095-022-0304-6.F004
Merging and repartitioning timing according to visualization requirements, using different numbers of parallel computing cores, for the original multi-block data with a resolution of , and modeling timing for the spatial GMM.
10.1007/s41095-022-0304-6.F005
Overview of our method. Given multi-block data, each single-block data’s numerical and spatial distributions are represented by a histogram and GMM, and further combined as a spatial GMM. Correlation-aware probabilistic summarization of multi-block data is represented by the Gaussian copula function (Stage I). Consider the block (shown with a gray background). Nine blocks are contained in its 1-ring, so nine marginal distributions are taken into consideration. In Stage II, copula-based reconstruction helps various post-hoc multivariate analysis and visualization tasks using stored data summarization. Given an arbitrary spatial location, the corresponding value is reconstructed using Bayes’ rule.
10.1007/s41095-022-0304-6.F006
Composition of the 1-ring of Block: (a) , (b) , (c) , and (d) .
10.1007/s41095-022-0304-6.F007
Schematic diagram of the spatial correlation modeling.
<i>Solver</i>
Because the maximum likelihood function of the Gaussian copula functions in Eqs. (
6
) and (
7
) is complex and cannot be solved using partial derivatives, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method [
39
] is used, which is the most popular quasi-Newton algorithm, to accelerate the EM algorithm for this parameter estimation problem with missing data.
We now consider discrete calculation of the copula’s marginal CDF. To calculate the parameters of the Gaussian copula function, it is necessary to pre-calculate the marginal CDFs of value distributions and spatial distributions. The discrete calculation of and in Eq. (
6
) is obvious, i.e., accumulation of probabilities corresponding to the bins satisfying certain conditions. The calculation of and in Eq. (
7
) is also computed discretely. Consider as an example: if , then the value of the corresponding CDF is
where , and .
After calculating the CDFs, the parameters of the Gaussian copula functions are estimated. The EM algorithm [
38
] is an iterative algorithm used for calculating the maximum likelihood of parameters, and it is widely applied to incomplete data. However, the EM algorithm has a sublinear convergence speed, and the derivative of the function in the EM algorithm has no explicit expression. Hence BFGS [
39
] is used to accelerate the algorithm. For the case of calculating SC in Eq. (
7
), pseudo-code is shown in Algorithm 1.
We determine step size and stopping conditions as follows. The function in Algorithm 1 is the expectation of the logarithmic likelihood function of the complete data on the conditional probability distribution of the unobserved data. The step size is determined using a backtracking line search so that the energy decreases monotonically. To optimize , in the () iteration is initialized to . is multiplied by two if the initial decreases the energy; . For the stopping conditions, we set and in our experiments.
In
Fig. 8
, a schematic diagram of the BFGS-based EM acceleration algorithm for estimating the parameter of a simple bivariate Gaussian copula function is shown to explain the algorithm. The gray curve represents the objective function (the logarithmic likelihood function of ). The blue curves represent the lower bound function , which constantly approaches the local optimum, converging to it.
10.1007/s41095-022-0304-6.F008
Schematic diagram of the BFGS-based EM acceleration algorithm.
10.1007/s41095-022-0304-6.F008
Schematic diagram of the BFGS-based EM acceleration algorithm.
10.1007/s41095-022-0304-6.F009
Visual comparison of a volume rendering of (a) the ground truth and the reconstruction results of LF, using (b) the block histogram model [
9
], (c) spatial GMM [
20
], and (d) our method.
10.1007/s41095-022-0304-6.F010
Visual comparison of a section pseudo-color rendering of (a) the ground truth and the reconstruction results of SW, using (b) the SLIC-based model [
18
], (c) the entropy-based model [
40
], and (d) our method.
10.1007/s41095-022-0304-6.F011
Visual comparison of an isosurface rendering of HI (the seventh time step): (a) ground truth, (b, c) reconstructed results using the spatial GMM [
20
] with and our method with .
10.1007/s41095-022-0304-6.F012
Block size versus RMSE (larger is better) and block size vs. (larger is better) for the reconstruction results of our method for LF, SW, AE, and HI, using block sizes of 64, 32, 16, and 8.
10.1007/s41095-022-0304-6.F013
Intra-node performance overhead of (a) LF and (b) AE using the block histogram model [
9
], spatial GMM [
20
], slic-based model [
18
], entropy-based model [
40
], and our correlation-aware model.
10.1007/s41095-022-0304-6.F014
Visual comparison of (above) a volume rendering and (below) a section pseudo-color rendering of (a) the ground truth and the reconstructed results for AIT, using (b) the spatial GMM [
20
] and (c) our method.
10.1007/s41095-022-0304-6.F015
Visual comparison of (above) a volume rendering and (below) a section pseudo-color rendering of (a) the ground truth and the reconstructed results for AIV, using (b) the spatial GMM [
20
] and (c) our method.