AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (3.6 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Wide Area Analytics for Geographically Distributed Datacenters

Department of Electrical and Computer Engineering, University of Toronto, Toronto M5S 3G4, Canada.
Show Author Information

Abstract

Big data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a single datacenter. As the volume of data grows at a tremendous rate, it is less efficient for only one datacenter to handle such large volumes of data from a performance point of view. Large cloud service providers are deploying datacenters geographically around the world for better performance and availability. A widely used approach for analytics of geo-distributed data is the centralized approach, which aggregates all the raw data from local datacenters to a central datacenter. However, it has been observed that this approach consumes a significant amount of bandwidth, leading to worse performance. A number of mechanisms have been proposed to achieve optimal performance when data analytics are performed over geo-distributed datacenters. In this paper, we present a survey on the representative mechanisms proposed in the literature for wide area analytics. We discuss basic ideas, present proposed architectures and mechanisms, and discuss several examples to illustrate existing work. We point out the limitations of these mechanisms, give comparisons, and conclude with our thoughts on future research directions.

References

[1]
Kloudas K., Mamede M., Preguiça N., and Rodrigues R., Pixida: Optimizing data parallel jobs in bandwidth-skewed environments, VLDB Endowment, vol. 9, no. 2, pp. 7283, 2015.10.14778/2850578.2850582
[2]
Vulimiri A., Curino C., Godfrey P., Jungblut T., Padhye J., and Varghese G., Global analytics in the face of bandwidth and regulatory constraints, in Proc. of USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2015.
[3]
Shvachko K., Kuang H., Radia S., and Chansler R., The hadoop distributed file system, in Proc. of IEEE on Mass Storage Systems and Technologies (MSST), 2010.
[4]
Dean J. and Ghemawat S., MapReduce: Simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp. 107113, 2008.10.1145/1327452.1327492
[5]
Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M. J., Shenker S., and Stoica I., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. of USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012.
[6]
Zaharia M., Das T., Li H., Hunter T., Shenker S., and Stoica I., Discretized streams: Fault-tolerant streaming computation at scale, in Proc. of the 24th ACM Symposium on Operating Systems Principles (SOSP), 2013, pp. 423-438.
[7]
Couto R., Secci S., Campista M., and Costa L., Latency versus survivability in geo-distributed data center design, in Proc. of IEEE Global Communications Conference (GLOBECOM), 2014, pp. 1102-1107.
[8]
Zhang Q., Liu L., Lee K., Zhou Y., Singh A., Mandagere N., Gopisetty S., and Alatorre G., Improving hadoop service provisioning in a geographically distributed cloud, in Proc. of the 7th IEEE International Conference on Cloud Computing, 2014.
[9]
Munir A., Qazi I. A., and Qaisar B., On achieving low latency in data centers, in Proc. of IEEE International Conference on Communications (ICC), 2013, pp. 3721-3725.
[10]
Rabkin A., Arye M., Sen S., Pai V. S., and Freedman M. J., Aggregation and degradation in JetStream: Streaming analytics in the wide area, in Proc. of USENIX NSDI, 2014.
[11]
Upadhyaya P., Kwon Y., and Balazinska M., A latency and fault-tolerance optimizer for online parallel query plans, in Proc. of ACM SIGMOD International Conference on Management of Data, 2011, pp. 241-252.
[12]
Vulimiri A., Curino C., Godfrey B., Karanasos K., and Varghese G., WANalytics: Analytics for a geo-distributed data-intensive world, in Proc. of Conference on Innovative Data Systems Research (CIDR), 2015.
[13]
Pu Q., Ananthanarayanan G., Bodik P., Kandula S., Akella A., Bahl P., and Stoica I., Low latency geo-distributed data analytics, in Proc. of ACM SIGCOMM, 2015.
[14]
Laoutaris N., Sirivianos M., Yang X., and Rodriguez P., Inter-datacenter bulk transfers with netstitcher, in Proc. of ACM SIGCOMM, 2011.
[15]
Gu L., Zeng D., Li P., and Guo S., Cost minimization for big data processing in geo-distributed data centers, IEEE Trans. on Emerging Topics in Computing, vol. 2, no. 3, pp. 314323, 2014.10.1109/TETC.2014.2310456
[16]
Hung C.-C., Golubchik L., and Yu M., Scheduling jobs across geo-distributed datacenters, in Proc. of the 6th ACM Symposium on Cloud Computing (SoCC), 2015.
Tsinghua Science and Technology
Pages 125-135
Cite this article:
Ji S, Li B. Wide Area Analytics for Geographically Distributed Datacenters. Tsinghua Science and Technology, 2016, 21(2): 125-135. https://doi.org/10.1109/TST.2016.7442496

553

Views

20

Downloads

11

Crossref

N/A

Web of Science

12

Scopus

0

CSCD

Altmetrics

Received: 03 March 2016
Accepted: 10 March 2016
Published: 31 March 2016
© The author(s) 2016
Return