Wide Area Analytics for Geographically Distributed Datacenters

Siqi Ji; Baochun Li

doi:10.1109/TST.2016.7442496

| Sign up

PDF (3.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Wide Area Analytics for Geographically Distributed Datacenters

Siqi Ji, Baochun Li()

Department of Electrical and Computer Engineering, University of Toronto, Toronto M5S 3G4, Canada.

Show Author Information

Abstract

Big data analytics, the process of organizing and analyzing data to get useful information, is one of the primary uses of cloud services today. Traditionally, collections of data are stored and processed in a single datacenter. As the volume of data grows at a tremendous rate, it is less efficient for only one datacenter to handle such large volumes of data from a performance point of view. Large cloud service providers are deploying datacenters geographically around the world for better performance and availability. A widely used approach for analytics of geo-distributed data is the centralized approach, which aggregates all the raw data from local datacenters to a central datacenter. However, it has been observed that this approach consumes a significant amount of bandwidth, leading to worse performance. A number of mechanisms have been proposed to achieve optimal performance when data analytics are performed over geo-distributed datacenters. In this paper, we present a survey on the representative mechanisms proposed in the literature for wide area analytics. We discuss basic ideas, present proposed architectures and mechanisms, and discuss several examples to illustrate existing work. We point out the limitations of these mechanisms, give comparisons, and conclude with our thoughts on future research directions.

Keywords

big data analytics geo-distributed datacenters

References

[1]

Kloudas

, Mamede

, Preguiça

, and Rodrigues

, Pixida: Optimizing data parallel jobs in bandwidth-skewed environments, VLDB Endowment, vol. 9, no. 2, pp. 72–83, 2015.10.14778/2850578.2850582

Crossref Google Scholar

[2]

Vulimiri

, Curino

, Godfrey

, Jungblut

, Padhye

, and Varghese

, Global analytics in the face of bandwidth and regulatory constraints, in Proc. of USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2015.

[3]

Shvachko

, Kuang

, Radia

, and Chansler

, The hadoop distributed file system, in Proc. of IEEE on Mass Storage Systems and Technologies (MSST), 2010.

[4]

Dean

and Ghemawat

, MapReduce: Simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.10.1145/1327452.1327492

Crossref Google Scholar

[5]

Zaharia

, Chowdhury

, Das

, Dave

, Ma

, McCauley

, Franklin

M. J.

, Shenker

, and Stoica

, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. of USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012.

[6]

Zaharia

, Das

, Li

, Hunter

, Shenker

, and Stoica

, Discretized streams: Fault-tolerant streaming computation at scale, in Proc. of the 24th ACM Symposium on Operating Systems Principles (SOSP), 2013, pp. 423-438.

[7]

Couto

, Secci

, Campista

, and Costa

, Latency versus survivability in geo-distributed data center design, in Proc. of IEEE Global Communications Conference (GLOBECOM), 2014, pp. 1102-1107.

[8]

Zhang

, Liu

, Lee

, Zhou

, Singh

, Mandagere

, Gopisetty

, and Alatorre

, Improving hadoop service provisioning in a geographically distributed cloud, in Proc. of the 7th IEEE International Conference on Cloud Computing, 2014.

[9]

Munir

, Qazi

I. A.

, and Qaisar

, On achieving low latency in data centers, in Proc. of IEEE International Conference on Communications (ICC), 2013, pp. 3721-3725.

[10]

Rabkin

, Arye

, Sen

, Pai

V. S.

, and Freedman

M. J.

, Aggregation and degradation in JetStream: Streaming analytics in the wide area, in Proc. of USENIX NSDI, 2014.

[11]

Upadhyaya

, Kwon

, and Balazinska

, A latency and fault-tolerance optimizer for online parallel query plans, in Proc. of ACM SIGMOD International Conference on Management of Data, 2011, pp. 241-252.

[12]

Vulimiri

, Curino

, Godfrey

, Karanasos

, and Varghese

, WANalytics: Analytics for a geo-distributed data-intensive world, in Proc. of Conference on Innovative Data Systems Research (CIDR), 2015.

[13]

, Ananthanarayanan

, Bodik

, Kandula

, Akella

, Bahl

, and Stoica

, Low latency geo-distributed data analytics, in Proc. of ACM SIGCOMM, 2015.

[14]

Laoutaris

, Sirivianos

, Yang

, and Rodriguez

, Inter-datacenter bulk transfers with netstitcher, in Proc. of ACM SIGCOMM, 2011.

[15]

, Zeng

, Li

, and Guo

, Cost minimization for big data processing in geo-distributed data centers, IEEE Trans. on Emerging Topics in Computing, vol. 2, no. 3, pp. 314–323, 2014.10.1109/TETC.2014.2310456

Crossref Google Scholar

[16]

Hung

C.-C.

, Golubchik

, and Yu

, Scheduling jobs across geo-distributed datacenters, in Proc. of the 6th ACM Symposium on Cloud Computing (SoCC), 2015.

Tsinghua Science and Technology

Volume 21 Issue 2,
April 2016

Pages 125-135

DOI: 10.1109/TST.2016.7442496

Cite this article:

Ji S, Li B. Wide Area Analytics for Geographically Distributed Datacenters. Tsinghua Science and Technology, 2016, 21(2): 125-135. https://doi.org/10.1109/TST.2016.7442496