Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model

Jianjiang Li; Jie Wu; Zhanning Ma

doi:10.1109/TST.2015.7128935

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (944.9 KB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model

Jianjiang Li, Jie Wu, Zhanning Ma(

)

Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China.

Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA.

Show Author Information

Abstract

With the rise of various cloud services, the problem of redundant data is more prominent in the cloud storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage space, but also ensure the access efficiency as much as possible, is an urgent problem which needs to be solved. Space-efficiency mainly uses data de-duplication technologies, while access-efficiency requires gathering the files with high similarity on a server. Based on the study of other data de-duplication technologies, especially the Similarity-Aware Partitioning (SAP) algorithm, this paper proposes the Frequency and Similarity-Aware Partitioning (FSAP) algorithm for cloud storage. The FSAP algorithm is a more reasonable data partitioning algorithm than the SAP algorithm. Meanwhile, this paper proposes the Space-Time Utility Maximization Model (STUMM), which is useful in balancing the relationship between space-efficiency and access-efficiency. Finally, this paper uses 100 web files downloaded from CNN for testing, and the results show that, relative to using the algorithms associated with the SAP algorithm (including the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm), the FSAP algorithm based on STUMM reaches higher compression ratio and a more balanced distribution of data blocks.

Keywords

cloud storage de-duplication redundancy frequency

References

[1]

Benson, T. Akella, A. and Maltz, D. A. Network traffic characteristics of data centers in the wild, in Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, ACM, 2010, pp. 267-280.

[2]

Meyer D. T. and Bolosky, W. J. A study of practical deduplication, ACM Transactions on Storage (TOS), vol. 7, no. 4, p. 14, 2012.

Crossref Google Scholar

[3]

Clements, A. T. Ahmad, I. Vilayannur, M. and Li, J. Decentralized deduplication in san cluster file systems, in USENIX Annual Technical Conference, 2009, pp. 101-114.

[4]

Manber, U. Finding similar files in a large file system, in Usenix Winter 1994 Technical Conference, 1994, pp. 1-10.

[5]

Baker, B. S. On finding duplication and near-duplication in large software systems, in Reverse Engineering, 1995, Proceedings of 2nd Working Conference on, IEEE, 1995, pp. 86-95.

[6]

Forman, G. Eshghi, K. and Chiocchetti, S. Finding similar files in large document repositories, in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, ACM, 2005, pp. 394-400.

[7]

Douceur, J. R. Adya, A. Bolosky, W. J. Simon, P. and Theimer, M. Reclaiming space from duplicate files in a serverless distributed file system, in Distributed Computing Systems, 2002. Proceedings. 22nd International Conference on, IEEE, 2002, pp. 617-624.

[8]

Quinlan S. and Dorward, S. Venti: A new approach to archival storage, FAST, vol. 2, pp. 89-101, 2002.

Google Scholar

[9]

Zhu, B. Li, K. and Patterson, R. H. Avoiding the disk bottleneck in the data domain deduplication file system, FAST, vol. 8, pp. 1-14, 2008.

Google Scholar

[10]

Balasubramanian, B. Lan, T. and Chiang, M. Sap: Similarity aware partitioning for efficient cloud storage, in Infocom 2014 Proceedings IEEE, 2014, pp. 592-600.

[11]

Bhagwat, D. Pollack, K. Long, D. D. Schwarz, T. Miller, E. L. and Paris, J. F. Providing high reliability in a minimum redundancy archival storage system, in Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006. MASCOTS 2006. 14th IEEE International Symposium on, IEEE, 2006, pp. 413-421.

[12]

Schwarz, T. J. Xin, Q. Miller, E. L. Long, D. D. Hospodor, A. and Ng, S. Disk scrubbing in large archival storage systems, in Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings. The IEEE Computer Societys 12th Annual International Symposium on, IEEE, 2004, pp. 409-418.

[13]

Weil, S. A. Brandt, S. A. Miller, E. L. Long, D. D. and Maltzahn, C. Ceph: A scalable, high-performance distributed file system, in Proceedings of the 7th Symposium on Operating Systems Design and Implementation, USENIX Association, 2006, pp. 307-320.

[14]

Zhu, R. Qin, L.-H. Zhou, J.-L. and Zheng, H. Using multithreads to hide deduplication i/o latency with low synchronization overhead, Journal of Central South University, vol. 20, pp. 1582-1591, 2013.

Crossref Google Scholar

[15]

Bolosky, W. J. Corbin, S. Goebel, D. and Douceur, J. R. Single instance storage in windows 2000, in Proceedings of the 4th USENIX Windows Systems Symposium, Seattle, USA, 2000, pp. 13-24.

[16]

Hong, B. Plantenberg, D. Long, D. D. and Sivan-Zimet, M. Duplicate data elimination in a san file system, in Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, 2004, pp. 301-314.

[17]

Sapuntzakis, C. P. Chandra, R. Pfaff, B. Chow, J. Lam, M. S. and Rosenblum, M. Optimizing the migration of virtual computers, ACM SIGOPS Operating Systems Review, vol. 36, no. SI, pp. 377-390, 2002.

Crossref Google Scholar

[18]

Muthitacharoen, A. Chen, B. and Mazieres, D. A low bandwidth network file system, ACM SIGOPS Operating Systems Review, vol. 35, no. 5, pp. 174-187, 2001.

Crossref Google Scholar

[19]

Rabin, M. O. Fingerprinting by random polynomials, Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.

[20]

Broder, A. Z. Some applications of rabins fingerprinting method, in Sequences II. Springer, 1993, pp. 143-152.

Crossref

[21]

Douglis F. and Iyengar, A. Application-specific deltaencoding via resemblance detection. in USENIX Annual Technical Conference, General Track, 2003, pp. 113-126.

[22]

Bloom, B. H. Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970.

Crossref Google Scholar

[23]

Lillibridge, M. Eshghi, K. Bhagwat, D. Deolalikar, V. Trezis, G. and Camble, P. Sparse indexing: Large scale, inline deduplication using sampling and locality, in 7th USENIX Conference on File and Storage Technologies, 2009, pp. 111-123.

Tsinghua Science and Technology

Volume 20 Issue 3,
June 2015

Pages 233-245

DOI: 10.1109/TST.2015.7128935

Cite this article:

Li J, Wu J, Ma Z. Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model. Tsinghua Science and Technology, 2015, 20(3): 233-245. https://doi.org/10.1109/TST.2015.7128935

628

Views

Downloads

Crossref

N/A

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 16 March 2015

Accepted: 04 May 2015

Published: 19 June 2015