AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (944.9 KB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model

Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China.
Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA.
Show Author Information

Abstract

With the rise of various cloud services, the problem of redundant data is more prominent in the cloud storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage space, but also ensure the access efficiency as much as possible, is an urgent problem which needs to be solved. Space-efficiency mainly uses data de-duplication technologies, while access-efficiency requires gathering the files with high similarity on a server. Based on the study of other data de-duplication technologies, especially the Similarity-Aware Partitioning (SAP) algorithm, this paper proposes the Frequency and Similarity-Aware Partitioning (FSAP) algorithm for cloud storage. The FSAP algorithm is a more reasonable data partitioning algorithm than the SAP algorithm. Meanwhile, this paper proposes the Space-Time Utility Maximization Model (STUMM), which is useful in balancing the relationship between space-efficiency and access-efficiency. Finally, this paper uses 100 web files downloaded from CNN for testing, and the results show that, relative to using the algorithms associated with the SAP algorithm (including the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm), the FSAP algorithm based on STUMM reaches higher compression ratio and a more balanced distribution of data blocks.

References

[1]
Benson, T. Akella, A. and Maltz, D. A. Network traffic characteristics of data centers in the wild, in Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, ACM, 2010, pp. 267-280.
[2]
Meyer D. T. and Bolosky, W. J. A study of practical deduplication, ACM Transactions on Storage (TOS), vol. 7, no. 4, p. 14, 2012.
[3]
Clements, A. T. Ahmad, I. Vilayannur, M. and Li, J. Decentralized deduplication in san cluster file systems, in USENIX Annual Technical Conference, 2009, pp. 101-114.
[4]
Manber, U. Finding similar files in a large file system, in Usenix Winter 1994 Technical Conference, 1994, pp. 1-10.
[5]
Baker, B. S. On finding duplication and near-duplication in large software systems, in Reverse Engineering, 1995, Proceedings of 2nd Working Conference on, IEEE, 1995, pp. 86-95.
[6]
Forman, G. Eshghi, K. and Chiocchetti, S. Finding similar files in large document repositories, in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, ACM, 2005, pp. 394-400.
[7]
Douceur, J. R. Adya, A. Bolosky, W. J. Simon, P. and Theimer, M. Reclaiming space from duplicate files in a serverless distributed file system, in Distributed Computing Systems, 2002. Proceedings. 22nd International Conference on, IEEE, 2002, pp. 617-624.
[8]
Quinlan S. and Dorward, S. Venti: A new approach to archival storage, FAST, vol. 2, pp. 89-101, 2002.
[9]
Zhu, B. Li, K. and Patterson, R. H. Avoiding the disk bottleneck in the data domain deduplication file system, FAST, vol. 8, pp. 1-14, 2008.
[10]
Balasubramanian, B. Lan, T. and Chiang, M. Sap: Similarity aware partitioning for efficient cloud storage, in Infocom 2014 Proceedings IEEE, 2014, pp. 592-600.
[11]
Bhagwat, D. Pollack, K. Long, D. D. Schwarz, T. Miller, E. L. and Paris, J. F. Providing high reliability in a minimum redundancy archival storage system, in Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006. MASCOTS 2006. 14th IEEE International Symposium on, IEEE, 2006, pp. 413-421.
[12]
Schwarz, T. J. Xin, Q. Miller, E. L. Long, D. D. Hospodor, A. and Ng, S. Disk scrubbing in large archival storage systems, in Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings. The IEEE Computer Societys 12th Annual International Symposium on, IEEE, 2004, pp. 409-418.
[13]
Weil, S. A. Brandt, S. A. Miller, E. L. Long, D. D. and Maltzahn, C. Ceph: A scalable, high-performance distributed file system, in Proceedings of the 7th Symposium on Operating Systems Design and Implementation, USENIX Association, 2006, pp. 307-320.
[14]
Zhu, R. Qin, L.-H. Zhou, J.-L. and Zheng, H. Using multithreads to hide deduplication i/o latency with low synchronization overhead, Journal of Central South University, vol. 20, pp. 1582-1591, 2013.
[15]
Bolosky, W. J. Corbin, S. Goebel, D. and Douceur, J. R. Single instance storage in windows 2000, in Proceedings of the 4th USENIX Windows Systems Symposium, Seattle, USA, 2000, pp. 13-24.
[16]
Hong, B. Plantenberg, D. Long, D. D. and Sivan-Zimet, M. Duplicate data elimination in a san file system, in Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, 2004, pp. 301-314.
[17]
Sapuntzakis, C. P. Chandra, R. Pfaff, B. Chow, J. Lam, M. S. and Rosenblum, M. Optimizing the migration of virtual computers, ACM SIGOPS Operating Systems Review, vol. 36, no. SI, pp. 377-390, 2002.
[18]
Muthitacharoen, A. Chen, B. and Mazieres, D. A low bandwidth network file system, ACM SIGOPS Operating Systems Review, vol. 35, no. 5, pp. 174-187, 2001.
[19]
Rabin, M. O. Fingerprinting by random polynomials, Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.
[20]
Broder, A. Z. Some applications of rabins fingerprinting method, in Sequences II. Springer, 1993, pp. 143-152.
[21]
Douglis F. and Iyengar, A. Application-specific deltaencoding via resemblance detection. in USENIX Annual Technical Conference, General Track, 2003, pp. 113-126.
[22]
Bloom, B. H. Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970.
[23]
Lillibridge, M. Eshghi, K. Bhagwat, D. Deolalikar, V. Trezis, G. and Camble, P. Sparse indexing: Large scale, inline deduplication using sampling and locality, in 7th USENIX Conference on File and Storage Technologies, 2009, pp. 111-123.
Tsinghua Science and Technology
Pages 233-245
Cite this article:
Li J, Wu J, Ma Z. Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model. Tsinghua Science and Technology, 2015, 20(3): 233-245. https://doi.org/10.1109/TST.2015.7128935

660

Views

23

Downloads

0

Crossref

N/A

Web of Science

3

Scopus

4

CSCD

Altmetrics

Received: 16 March 2015
Accepted: 04 May 2015
Published: 19 June 2015
© The authors 2015
Return