A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis

Mohammad Sultan Mahmud; Joshua Zhexue Huang; Salman Salloum; Tamer Z. Emara; Kuanishbay Sadatdiynov

doi:10.26599/BDMA.2019.9020015

| Sign up

PDF (6.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (7)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Tables (3)

Table 1

Table 2

Table 3

Open Access

A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis

Mohammad Sultan Mahmud, Joshua Zhexue Huang, Salman Salloum(), Tamer Z. Emara, Kuanishbay Sadatdiynov

∙ National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China, and Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.

Show Author Information

Abstract

Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

Keywords

big data analysis data partitioning data sampling distributed and parallel computing approximate computing

References

[1]

Cattell

, Scalable SQL and NoSQL data stores, ACM SIGMOD Record, vol. 39, no. 4, pp. 12-27, 2011.

Crossref Google Scholar

[2]

Bakshi

, Considerations for big data: Architecture and approach, in Proc. of 2012 IEEE Aerospace Conference, Big Sky, MT, USA, 2012, pp. 1-7.

Crossref

[3]

Chen

and M.

Xie

, A split-and-conquer approach for analysis of extraordinarily large data, Statistica Sinica, vol. 24, no. 4, pp. 1655-1684, 2014.

Crossref Google Scholar

[4]

Lazar

, The big picture: Divide and combine to conquer big data, Chance, vol. 31, no. 1, pp. 57-59, 2018.

Crossref Google Scholar

[5]

Dean

and S.

Ghemawat

, MapReduce: Simplified data processing on large clusters, in Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI’04), San Francisco, CA, USA, 2004, pp. 137-150.

[6]

Singh

and C. K.

Reddy

, A survey on platforms for big data analytics, Journal of Big Data, vol. 2, no. 1, p. 8, 2014.

Crossref Google Scholar

[7]

H. V.

Jagadish

, J.

Gehrke

, A.

Labrinidis

, Y.

Papakonstantinou

, J. M.

Patel

, R.

Ramakrishnan

, and C.

Shahabi

, Big data and its technical challenges, Communications of the ACM, vol. 57, no. 7, pp. 86-94, 2014.

Crossref Google Scholar

[8]

Zaharia

, R. S.

Xin

, P.

Wendell

, T.

Das

, M.

Armbrust

, A.

Dave

, X.

Meng

, J.

Rosen

, S.

Venkataraman

, M. J.

Franklin

, et al., Apache spark: A unified engine for big data processing, Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016.

Crossref Google Scholar

[9]

Salloum

, R.

Dautov

, X.

Chen

, P. X.

Peng

, and J. Z.

Huang

, Big data analytics on apache spark, International Journal of Data Science and Analytics, vol. 1, no. 3, pp. 145-164, 2016.

Crossref Google Scholar

[10]

Shvachko

, H.

Kuang

, S.

Radia

, and R.

Chansler

, The Hadoop distributed file system, in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 2010, pp. 1-10.

Crossref

[11]

C.-T.

Chu

, S. K.

Kim

, Y.-A.

Lin

, Y.

, G.

Bradski

, A. Y.

, and K.

Olukotun

, Map-reduce for machine learning on multicore, in Proceedings of the 19th International Conference on Neural Information Processing Systems (NIPS’06), Cambridge, MA, USA, 2006, pp. 281-288.

[12]

Quoc

, Approximate data analytics systems, PhD dissertation, Technische Universität Dresden, Dresden, Germany, 2017.

[13]

Nair

, Big data needs approximate computing: Technical perspective, Communications of the ACM, vol. 58, no. 1, pp. 104-104, 2015.

Crossref Google Scholar

[14]

Mittal

, A survey of techniques for approximate computing, ACM Computing Surveys, vol. 48, no. 4, pp. 1-33, 2016.

Crossref Google Scholar

[15]

D. A.

Reed

and J.

Dongarra

, Exascale computing and big data, Communications of the ACM, vol. 58, no. 7, pp. 56-68, 2015.

Crossref Google Scholar

[16]

C. E.

Otero

and A.

Peter

, Research directions for engineering big data analytics software, IEEE Intelligent Systems, vol. 30, no. 1, pp. 13-19, 2015.

Crossref Google Scholar

[17]

Agrawal

, J.

Choi

, K.

Gopalakrishnan

, S.

Gupta

, R.

Nair

, J.

, D. A.

Prener

, S.

Shukla

, V.

Srinivasan

, and Z.

Sura

, Approximate computing: Challenges and opportunities, in Proc. of 2016 IEEE International Conference on Rebooting Computing (ICRC), San Diego, CA, USA, 2016, pp. 1-8.

Crossref

[18]

Sidiroglou-Douskos

, S.

Misailovic

, H.

Hoffmann

, and M.

Rinard

, Managing performance vs. accuracy trade-offs with loop perforation, in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE’11), Szeged, Hungary, 2011, pp. 124-134.

Crossref

[19]

Agarwal

, B.

Mozafari

, A.

Panda

, H.

Milner

, S.

Madden

, and I.

Stoica

, BlinkDB: Queries with bounded errors and bounded response times on very large data, in Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys’13), Prague, Czech Republic, 2013, pp. 29-42.

Crossref

[20]

Goiri

, R.

Bianchini

, S.

Nagarakatte

, and T. D.

Nguyen

, ApproxHadoop: Bringing approximations to MapReduce frameworks, in Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15), Istanbul, Turkey, 2015, pp. 383-397.

Crossref

[21]

and G.

, Approximate query processing: What is new and where to go?, Data Science and Engineering, vol. 3, no. 4, pp. 379-397, 2018.

Crossref Google Scholar

[22]

Agarwal

, H.

Milner

, A.

Kleiner

, A.

Talwalkar

, M.

Jordan

, S.

Madden

, B.

Mozafari

, and I.

Stoica

, Knowing when you’re wrong: Building fast and reliable approximate query processing systems, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14), Snowbird, UT, USA, 2014, pp. 481-492.

Crossref

[23]

D. L.

Quoc

, R.

Chen

, P.

Bhatotia

, C.

Fetzer

, V.

Hilt

, and T.

Strufe

, Streamapprox: Approximate computing for stream analytics, in Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware’17), Las Vegas, NV, USA, 2017, pp. 185-197.

Crossref

[24]

Sagi

and L.

Rokach

, Ensemble learning: A survey, Data Mining and Knowledge Discovery, vol. 8, no. 4, pp. 1-18, 2018.

Crossref Google Scholar

[25]

Basiri

, E.

Ollila

, and V.

Koivunen

, Robust, scalable, and fast bootstrap method for analyzing large scale data, IEEE Transactions on Signal Processing, vol. 64, no. 4, pp. 1007-1017, 2016.

Crossref Google Scholar

[26]

V. K.

Singh

, M.

Taram

, V.

Agrawal

, and B. S.

Baghel

, A literature review on Hadoop ecosystem and various techniques of big data optimization, in Proceedings of International Conference on Data and Information Systems (ICDIS’17), Amarkantak, India, 2017, pp. 231-240.

Crossref

[27]

Polato

, R.

Ré

, A.

Goldman

, and F.

Kon

, A comprehensive view of Hadoop research-systematic literature review, Journal of Network and Computer Applications, vol. 46, pp. 1-25, 2014.

Crossref Google Scholar

[28]

Dean

and S.

Ghemawat

, Mapreduce: Simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.

Crossref Google Scholar

[29]

Hasan

, Z.

Anwar

, W.

Yurcik

, L.

Brumbaugh

, and R.

Campbell

, A survey of peer-to-peer storage techniques for distributed file systems, in Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05) - Volume II, Las Vegas, NV, USA, 2005, pp. 205-213.

Crossref

[30]

Bzoch

and J.

Safarik

, State of the art in distributed file systems: Increasing performance, in Proc. of 2011 Second Eastern European Regional Conference on the Engineering of Computer Based Systems, Bratislava, Slovakia, 2011, pp. 153-154.

Crossref

[31]

M. T.

Ozsu

and P.

Valduriez

, Principles of Distributed Database Systems, 3rd ed. New York, NY, USA: Springer-Verlag, 2011.

Crossref

[32]

Taniar

, High performance database processing, in Proc. of 2012 IEEE 26th International Conference on Advanced Information Networking and Applications, Fukuoka, Japan, 2012, pp. 5-6.

Crossref

[33]

Vijayakumar

, A.

Bhargavi

, U.

Praseeda

, and S. A.

Ahamed

, Optimizing sequence alignment in cloud using Hadoop and Mpp database, in Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing (CLOUD’12), Honolulu, HI, USA, 2012, pp. 819-827.

Crossref

[34]

Ghemawat

, H.

Gobioff

, and S.-T.

Leung

, The google file system, ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29-43, 2003.

Crossref Google Scholar

[35]

Isard

, M.

Budiu

, Y.

, A.

Birrell

, and D.

Fetterly

, Dryad: Distributed data-parallel programs from sequential building blocks, in Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys’07), Lisbon, Portugal, 2007, pp. 59-72.

Crossref

[36]

Chaiken

, B.

Jenkins

, P.-A.

Larson

, B.

Ramsey

, D.

Shakib

, S.

Weaver

, and J.

Zhou

, Scope: Easy and efficient parallel processing of massive data sets, Proceedings of the VLDB Endowment, vol. 1, no. 2, pp. 1265-1276, 2008.

Crossref Google Scholar

[37]

, M.

Isard

, D.

Fetterly

, M.

Budiu

, U.

Erlingsson

, P. K.

Gunda

, and J.

Currey

, Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language, in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08), San Diego, CA, USA, 2008, pp. 1-14.

[38]

Olston

, B.

Reed

, U.

Srivastava

, R.

Kumar

, and A.

Tomkins

, Pig latin: A not-so-foreign language for data processing, in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08), Vancouver, Canada, 2008, pp. 1099-1110.

Crossref

[39]

Melnik

, A.

Gubarev

, J. J.

Long

, G.

Romer

, S.

Shivakumar

, M.

Tolton

, and T.

Vassilakis

, Dremel: Interactive analysis of web-scale datasets, Communications of the ACM, vol. 54, no. 6, pp. 114-123, 2011.

Crossref Google Scholar

[40]

Thusoo

, J.

Sarma

, N.

Jain

, Z.

Shao

, P.

Chakka

, N.

Zhang

, S.

Antony

, H.

Liu

, and R.

Murthy

, Hive a petabyte scale data warehouse using hadoop, in Proceedings of the 2010 IEEE 26th International Conference on Data Engineering (ICDE’10), Long Beach, CA, USA, 2010, pp. 996-1005.

Crossref

[41]

K. S.

Beyer

, V.

Ercegovac

, R.

Gemulla

, A.

Balmin

, M.

Eltabakh

, C.-C.

Kanne

, F.

Ozcan

, and E. J.

Shekita

, Jaql: A scripting language for large scale semi structured data analysis, Proceedings of the VLDB Endowment, vol. 4, no. 12, pp. 1272-1283, 2011.

Crossref Google Scholar

[42]

Chattopadhyay

, L.

Lin

, W.

Liu

, S.

Mittal

, P.

Aragonda

, V.

Lychagina

, Y.

Kwon

, and M.

Wong

, Tenzing: A SQL implementation on the MapReduce framework, Proceedings of the VLDB Endowment, vol. 4, no. 12, pp. 1318-1327, 2011.

Crossref Google Scholar

[43]

Karau

, A.

Konwinski

, P.

Wendell

, and M.

Zaharia

, Learning Spark: Lightning-Fast Big Data Analytics, 1st ed. Sebastopol, CA, USA: O’Reilly Media, 2015.

[44]

Zaharia

, M.

Chowdhury

, T.

Das

, A.

Dave

, J.

, M.

McCauley

, M. J.

Franklin

, S.

Shenker

, and I.

Stoica

, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12), San Jose, CA, USA, 2012, p. 2.

[45]

Navathe

, S.

Ceri

, G.

Wiederhold

, and J.

Dou

, Vertical partitioning algorithms for database design, ACM Transactions on Database Systems, vol. 9, no. 4, pp. 680-710, 1984.

Crossref Google Scholar

[46]

D. W.

Cornell

and P. S.

, An effective approach to vertical partitioning for physical design of relational databases, IEEE Transactions on Software Engineering, vol. 16, no. 2, pp. 248-258, 1990.

Crossref Google Scholar

[47]

W. W.

Chu

and I. T.

Ieong

, A transaction-based approach to vertical partitioning for relational database systems, IEEE Transactions on Software Engineering, vol. 19, no. 8, pp. 804-812, 1993.

Crossref Google Scholar

[48]

Curino

, E.

Jones

, Y.

Zhang

, and S.

Madden

, Schism: A workload-driven approach to database replication and partitioning, Proceedings of the VLDB Endowment, vol. 3, no. 1, pp. 48-57, 2010.

Crossref Google Scholar

[49]

Curino

, E. P.

Jones

, R. A.

Popa

, N.

Malviya

, E.

, S.

Madden

, H.

Balakrishnan

, and N.

Zeldovich

, Relational cloud: A database-as-a-service for the cloud, in Proc. of 5th Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA, 2011, pp. 235-240.

[50]

Das

, D.

Agrawal

, and A.

El Abbadi

, Elastras: An elastic transactional data store in the cloud, in Proceedings of the 2009 Conference on Hot Topics in Cloud Computing (HotCloud’09), San Diego, CA, USA, 2009, pp. 1-5.

Crossref

[51]

Baker

, C.

Bond

, J.

Corbett

, J.

Furman

, A.

Khorlin

, J.

Larson

, J.-M.

Leon

, Y.

, A.

Lloyd

, and V.

Yushprakh

, Megastore: Providing scalable, highly available storage for interactive services, in Proc. of Conference on Innovative Database Research (CIDR), Asilomar, CA, USA, 2011, pp. 223-234.

[52]

Kamal

, M.

Murshed

, and R.

Buyya

, Workload-aware incremental repartitioning of shared-nothing distributed databases for scalable OLTP applications, Future Generation Computer Systems, vol. 56, pp. 421-435, 2016.

Crossref Google Scholar

[53]

S. P.

Phansalkar

and A. R.

Dani

, Transaction aware vertical partitioning of database (TAVDP) for responsive OLTP applications in cloud data stores, Journal of Theoretical and Applied Information Technology, vol. 59, no. 1, pp. 73-81, 2014.

Google Scholar

[54]

P. A.

Bernstein

, I.

Cseri

, N.

Dani

, N.

Ellis

, A.

Kalhan

, G.

Kakivaya

, D. B.

Lomet

, R.

Manne

, L.

Novik

, and T.

Talius

, Adapting microsoft SQL server for cloud computing, in Proc. of 2011 IEEE 27th International Conference on Data Engineering, Hannover, Germany, 2011, pp. 1255-1263.

Crossref

[55]

Ahirrao

and R.

Ingle

, Scalable transactions in cloud data stores, in Proc. of 2013 3rd IEEE International Advance Computing Conference (IACC), Ghaziabad, India, 2013, pp. 116-119.

Crossref

[56]

S. B.

Navathe

and M.

, Vertical partitioning for database design: A graphical algorithm, ACM SIGMOD Record, vol. 18, no. 2, pp. 440-450, 1989.

Crossref Google Scholar

[57]

J. H.

Son

and M. H.

Kim

, An adaptable vertical partitioning method in distributed systems, Journal of Systems and Software, vol. 73, no. 3, pp. 551-561, 2004.

Crossref Google Scholar

[58]

Zhao

, Y.

Cheng

, and F.

Rusu

, Workload-driven vertical partitioning for effective query processing over raw data, arXiv preprint arXiv: 1503.08946, 2015.

Crossref

[59]

Y.-F.

Huang

and C.-J.

Lai

, Integrating frequent pattern clustering and branch-and-bound approaches for data partitioning, Information Sciences, vol. 328, pp. 288-301, 2016.

Crossref Google Scholar

[60]

Vojnovic

, F.

, and J.

Zhou

, Sampling-based range partition methods for big data analytics, Technical Report MSR-TR-2012-18, Microsoft Research, Redmond, WA, USA, 2012.

[61]

Chakrabarti

, S.

Parthasarathy

, and C.

Stewart

, Green- and heterogeneity-aware partitioning for data analytics, in Proc. of IEEE Conference on Computer Communications Workshops, San Francisco, CA, USA, 2016, pp. 366-371.

Crossref

[62]

Phansalkar

and S.

Ahirrao

, Survey of data partitioning algorithms for big data stores, in Proc. of 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), Waknaghat, India, 2016, pp. 163-168.

Crossref

[63]

Shanbhag

, A.

Jindal

, S.

Madden

, J.

Quiane

, and A. J.

Elmore

, A robust partitioning scheme for ad-hoc query workloads, in Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17), Santa Clara, CA, USA, 2017, pp. 229-241.

Crossref

[64]

Wang

, Q.

Xiao

, and J.

Yin

, DRAW: A new Data-gRouping-AWare data placement scheme for data intensive applications with interest locality, IEEE Transactions on Magnetics, vol. 49, no. 6, pp. 2514-2520, 2013.

Crossref Google Scholar

[65]

K. H. K.

Reddy

and D. S.

Roy

, DPPACS: A novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications, The Computer Journal, vol. 59, no. 1, pp. 64-82, 2016.

Crossref Google Scholar

[66]

Yuan

, Y.

Yang

, X.

Liu

, and J.

Chen

, A data placement strategy in scientific cloud workflows, Future Generation Computer Systems, vol. 26, no. 8, pp. 1200-1214, 2010.

Crossref Google Scholar

[67]

Slagter

, C.-H.

Hsu

, Y.-C.

Chung

, and D.

Zhang

, An improved partitioning mechanism for optimizing massive data analysis using MapReduce, The Journal of Supercomputing, vol. 66, no. 1, pp. 539-555, 2013.

Crossref Google Scholar

[68]

Chen

, J.

Yao

, and Z.

Xiao

, Libra: Lightweight data skew mitigation in MapReduce, IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 9, pp. 2520-2533, 2015.

Crossref Google Scholar

[69]

, G.

, C.

Huang

, Y.

, and W.

Tian

, A comparative study of data skew in Hadoop, in Proceedings of 26th International Conference on Network, Communication and Computing (ICNCC’17), Kunming, China, 2017, pp. 1-6.

Crossref

[70]

Tang

, W.

, K.

, and K.

, An intermediate data partition algorithm for skew mitigation in spark computing environment, IEEE Transactions on Cloud Computing, .

Crossref Google Scholar

[71]

Kwon

, M.

Balazinska

, B.

Howe

, and J.

Rolia

, A study of skew in MapReduce applications, in The 5th Open Cirrus Summit, Moskow, Russia, 2011, pp. 1-5.

Crossref

[72]

Tang

, X.

Zhang

, K.

, and K.

, An intermediate data placement algorithm for load balancing in spark computing environment, Future Generation Computer Systems, vol. 78, no. 1, pp. 287-301, 2018.

Crossref Google Scholar

[73]

, P.

Zou

, W.

, Z.

, K.

, and X.

Cui

, Sampling-based partitioning in MapReduce for skewed data, in Proc. of 2012 Seventh ChinaGrid Annual Conference (CHINAGRID’12), Beijing, China, 2012, pp. 1-8.

Crossref

[74]

del RÃo

, V.

LÃ³pez

, J. M.

BenÃtez

, and F.

Herrera

, On the use of MapReduce for imbalanced big data using random forest, Information Sciences, vol. 285, pp. 112-137, 2014.

Crossref Google Scholar

[75]

and X.

Meng

, An efficient block sampling strategy for online aggregation in the cloud, in Proc. of International Conference on Web-Age Information Management (WAIM 2015), Qingdao, China, 2015, pp. 362-373.

Crossref

[76]

Salloum

, J. Z.

Huang

, Y.

, X.

Zhang

, T. Z.

Emara

, C.

Wei

, and H.

, A random sample partition data model for big data analysis, arXiv preprint arXiv: 1712.04146, 2017.

[77]

Wei

, S.

Salloum

, T. Z.

Emara

, X.

Zhang

, J. Z.

Huang

, and Y.

, A two-stage data processing algorithm to generate random sample partitions for big data analysis, in Proc. of International Conference on Cloud Computing (CLOUD 2018), Seattle, WA, USA, 2018, pp. 347-364.

Crossref

[78]

T. Z.

Emara

and J. Z.

Huang

, A distributed data management system to support large-scale data analysis, The Journal of Systems and Software, vol. 148, pp. 105-115, 2019.

Crossref Google Scholar

[79]

T. Z.

Emara

and J. Z.

Huang

, RRPlib: A Spark library for representing HDFS blocks as a set of random sample data blocks, Science of Computer Programming, vol. 184, pp. 1-7, 2019.

Crossref Google Scholar

[80]

Jacobs

, The pathologies of big data, Communications of the ACM, vol. 52, no. 8, pp. 36-44, 2009.

Crossref Google Scholar

[81]

Cormode

and N.

Duffield

, Sampling for big data: A tutorial, in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14), New York, NY, USA, 2014, pp. 1975-1975.

Crossref

[82]

Acharya

, I.

Diakonikolas

, C.

Hegde

, J. Z.

, and L.

Schmidt

, Fast and near-optimal algorithms for approximating distributions by histograms, in Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’15), Melbourne, Australia, 2015, pp. 249-263.

Crossref

[83]

Efron

and R. J.

Tibshirani

, An Introduction to the Bootstrap. New York, NY, USA: Chapman & Hall, 1993.

Crossref

[84]

Olken

and D.

Rotem

, Random sampling from database files: A survey, in Proceedings of the 5th International Conference on Statistical and Scientific Database Management (SSDBM’1990), Charlotte, NC, USA, 1990, pp. 92-111.

Crossref

[85]

Olken

, Random sampling from databases, PhD dissertation, University of California at Berkeley, Berkeley, CA, USA, 1993.

[86]

Bruce

and A.

Bruce

, Practical Statistics for Data Scientists: 50 Essential Concepts. Sebastopol, CA, USA: O’Reilly Media, Inc., 2017.

[87]

W. S.

Cleveland

and R.

Hafen

, Divide and recombine (d&r): Data science for large complex data, Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 7, no. 6, pp. 425-433, 2014.

Crossref Google Scholar

[88]

Bailis

, E.

Gan

, S.

Madden

, D.

Narayanan

, K.

Rong

, and S.

Suri

, Macrobase: Prioritizing attention in fast data, in Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17), Chicago, IL, USA, 2017, pp. 541-556.

Crossref

[89]

Chaudhuri

, R.

Motwani

, and V.

Narasayya

, Random sampling for histogram construction: How much is enough?, ACM SIGMOD Record, vol. 27, no. 2, pp. 436-447, 1998.

Crossref Google Scholar

[90]

G. S.

Manku

, S.

Rajagopalan

, and B. G.

Lindsay

, Random sampling techniques for space efficient online computation of order statistics of large datasets, in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD’99), Philadelphia, PA, USA, 1999, pp. 251-262.

Crossref

[91]

Charikar

, K.

Chen

, and M.

Farach-Colton

, Finding frequent items in data streams, in Proceedings of the 29th International Colloquium on Automata, Languages and Programming (ICALP’02), Málaga, Spain, 2002, pp. 693-703.

Crossref

[92]

Guha

and A.

McGregor

, Space-efficient sampling, in Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS 2007), San Juan, PUR, USA, 2007, pp. 169-176.

[93]

Kuhn

, T.

Locher

, and S.

Schmid

, Distributed computation of the mode, in Proceedings of the Twenty-seventh ACM Symposium on Principles of Distributed Computing (PODC’08), Toronto, Canada, 2008, pp. 15-24.

Crossref

[94]

Kirsch

, M.

Mitzenmacher

, A.

Pietracaprina

, E.

Upfal

, and F.

Vandin

, A rigorous statistical approach for identifying significant itemsets, in Proceedings of the IEEE International Conference on Data Mining (ICDM’08), Pisa, Italy, 2008, pp. 1-10.

[95]

J. A. R.

Rojas

, M.

Beth Kery

, S.

Rosenthal

, and A.

Dey

, Sampling techniques to improve big data exploration, in Proc. of 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV), Phoenix, AZ, USA, 2017, pp. 26-35.

[96]

Kandel

, A.

Paepcke

, J. M.

Hellerstein

, and J.

Heer

, Enterprise data analysis and visualization: An interview study, IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 12, pp. 2917-2926, 2012.

Crossref Google Scholar

[97]

Lin

and D.

Ryaboy

, Scaling big data mining infrastructure: The twitter experience, SIGKDD Explorations, vol. 14, no. 2, pp. 6-19, 2013.

Crossref Google Scholar

[98]

C. T.

Fan

, Development of sampling plans by using sequential (item by item) selection techniques and digital computers, Publications of the American Statistical Association, vol. 57, no. 298, pp. 387-402, 1962.

Crossref Google Scholar

[99]

P. J.

Haas

, Data-stream sampling: Basic techniques and results, in Data Stream Management, Data-Centric Systems and Applications Book Series. Berlin, Heidelberg, Germany: Springer, 2016, pp. 13-44.

Crossref

[100]

T. E.

Oliphant

, Scipy: Open source scientific tools for python, Computing in Science and Engineering, vol. 9, no. 3, pp. 10-20, 2007.

Crossref Google Scholar

[101]

Hall

, E.

Frank

, G.

Holmes

, B.

Pfahringer

, P.

Reutemann

, and I. H.

Witten

, The weka data mining software: An update, ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10-18, 2009.

Crossref Google Scholar

[102]

Pedregosa

, G.

Varoquaux

, A.

Gramfort

, V.

Michel

, B.

Thirion

, O.

Grisel

, M.

Blondel

, P.

Prettenhofer

, R.

Weiss

, V.

Dubourg

, et al., Scikit-learn: Machine learning in python, The Journal of Machine Learning Research, vol. 12, no. 10, pp. 2825-2830, 2011.

Google Scholar

[103]

Al-Kateb

and B. S.

Lee

, Stratified reservoir sampling over heterogeneous data streams, in Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM’10), Heidelberg, Germany, 2010, pp. 621-639.

Crossref

[104]

J. S.

Vitter

, Random sampling with a reservoir, ACM Transactions on Mathematical Software, vol. 11, no. 1, pp. 37-57, 1985.

Crossref Google Scholar

[105]

C. C.

Aggarwal

, On biased reservoir sampling in the presence of stream evolution, in Proceedings of the 32nd International Conference on Very Large Databases (VLDB’06), Seoul, Korea, 2006, pp. 607-618.

[106]

Dash

and W.

, Efficient reservoir sampling for transactional data streams, in Proc. of Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06), Hong Kong, China, 2006, pp. 662-666.

Crossref

[107]

Malbasa

and S.

Vucetic

, A reservoir sampling algorithm with adaptive estimation of conditional expectation, in Proc. of International Joint Conference on Neural Networks, Orlando, FL, USA, 2007, pp. 2200-2204.

Crossref

[108]

Kleiner

, A.

Talwalkar

, P.

Sarkar

, and M. I.

Jordan

, The big data Bootstrap, in Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML’12), Edinburgh, UK, 2012, pp. 1787-1794.

[109]

Breiman

, J.

Friedman

, C. J.

Stone

, and R. A.

Olshen

, Classification and Regression Trees. Boca Raton, FL, USA: CRC Press, 1984.

[110]

Breiman

, Bagging predictors, Machine Learning, vol. 24, no. 2, pp. 123-140, 1996.

Crossref Google Scholar

[111]

M. I.

Jordan

, On statistics, computation and scalability, Bernoulli, vol. 19, no. 4, pp. 1378-1390, 2013.

Crossref Google Scholar

[112]

Genuer

, J.-M.

Poggi

, C.

Tuleau-Malot

, and N.

Villa-Vialaneix

, Random forests for big data, Big Data Research, vol. 9, pp. 28-46, 2017.

Crossref Google Scholar

[113]

Meng

, Scalable simple random sampling and stratified sampling, in Proceedings of the 30th International Conference on International Conference on Machine Learning (ICML’13), Atlanta, GA, USA, 2013, pp. 531-539.

[114]

Sanders

, S.

Lamm

, L.

Hübschle-Schneider

, E.

Schrade

, and C.

Dachsbacher

, Efficient parallel random sampling-vectorized, cache-efficient, and online, ACM Transactions on Mathematical Software, vol. 44, no. 3, pp. 1-14, 2018.

Crossref Google Scholar

[115]

Gavagsaz

, A.

Rezaee

, and H. H. S.

Javadi

, Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling, The Journal of Supercomputing, vol. 74, no. 7, pp. 3415-3440, 2018.

Crossref Google Scholar

[116]

Cormode

, M.

Garofalakis

, P. J.

Haas

, and C.

Jermaine

, Synopses for massive data: Samples, histograms, wavelets, sketches, Foundations Trends Databases, vol. 4, no. 1, pp. 1-294, 2012.

Crossref Google Scholar

[117]

, T.

Mytkowicz

, and N. S.

Kim

, Approximate computing: A survey, IEEE Design Test, vol. 33, no. 1, pp. 8-22, 2016.

Crossref Google Scholar

[118]

Shi

, X.

Meng

, F.

Wang

, and Y.

Gan

, HEDC: A histogram estimator for data in the cloud, in Proceedings of the 4th International Workshop on Cloud Data Management (CloudDB’12), Maui, HI, USA, 2012, pp. 51-58.

Crossref

[119]

D. R.

Krishnan

, D. L.

Quoc

, P.

Bhatotia

, C.

Fetzer

, and R.

Rodrigues

, IncApprox: A data analytics system for incremental approximate computing, in Proceedings of the 25th International Conference on World Wide Web (WWW’16), Montral, Canada, 2016, pp. 1133-1144.

Crossref

[120]

M. N.

Garofalakis

and P. B.

Gibbon

, Approximate query processing: Taming the terabytes, in Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01), Roma, Italy, 2001, p. 725.

[121]

Laptev

, K.

Zeng

, and C.

Zaniolo

, Early accurate results for advanced analytics on MapReduce, Proceedings of the VLDB Endowment, vol. 5, no. 10, pp. 1028-1039, 2012.

Crossref Google Scholar

[122]

, D.

Zhang

, S.

Rigo

, and T. D.

Nguyen

, Approximation with error bounds in spark, arXiv preprint arXiv: 1812.01823, 2018.

Crossref

[123]

Park

, J.

Qing

, X.

Shen

, and B.

Mozafari

, BlinkML: Approximate machine learning with probabilistic guarantees, in Proc. of the 45th International Conference on Very Large Data Bases, Los Angeles, CA, USA, 2018, pp. 1-18.

Crossref

[124]

Zhang

, J.

Wang

, and J.

Yin

, Sapprox: Enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling, Proceedings of the VLDB Endowment, vol. 10, no. 3, pp. 109-120, 2016.

Crossref Google Scholar

[125]

Salloum

, J. Z.

Huang

, Y.

, and X.

Chen

, An asymptotic ensemble learning framework for big data analysis, IEEE Access, vol. 7, no. 1, pp. 3675-3693, 2019.

Crossref Google Scholar

[126]

Salloum

, J. Z.

Huang

, and Y.

, Exploring and cleaning big data with random sample data blocks, Journal of Big Data, vol. 6, no. 1, p. 45, 2019.

Crossref Google Scholar

[127]

Wen

, D. L.

Quoc

, P.

Bhatotia

, R.

Chen

, and M.

Lee

, ApproxIoT: Approximate analytics for edge computing, in Proc. of 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, 2018, pp. 411-421.

Crossref

[128]

Elteir

, H.

Lin

, and W.

Feng

, Enhancing MapReduce via asynchronous data processing, in Proc. of 2010 IEEE 16th International Conference on Parallel and Distributed Systems, Shanghai, China, 2010, pp. 397-405.

Crossref

Big Data Mining and Analytics

Volume 3 Issue 2,
June 2020

Pages 85-101

DOI: 10.26599/BDMA.2019.9020015

Cite this article:

Mahmud MS, Huang JZ, Salloum S, et al. A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis. Big Data Mining and Analytics, 2020, 3(2): 85-101. https://doi.org/10.26599/BDMA.2019.9020015

Scheme	Strength	Limitation
Round-robin	- Sequential scan of the entire dataset	- Both point and range queries are complicated to process.
Round-robin	- Well-balanced data partition
Hash	- Sequential scan of the entire dataset	- Not well-suited for range queries
Hash	- Best suited for point queries based partitioning attributes (only one partition has to be searched)	- Also, not well-suited for point queries on non-partitioning attributes
Range	- Sequential scan of the entire dataset	- Execution skew might occur because of all processes in one or a few partitions.
	- Well-balanced data partition
	- Well-suited for both point and range queries (only one or few partitions has to be searched)
Random	- Sequential scan of the entire dataset	- It requires extra processing to calculate random values.
Random	- Approximately balanced data partition	- Records are distributed in random, and no order is followed.

Method	Description
Simple random sampling^{[100,101,102]}	Data items are selected with equal probability and the sample size is fixed.
Bernoulli sampling^[98,99]	Data items are selected with equal probability but the sample size is random.
Stratified sampling^[103]	Data items are divided into strata and a sample is drawn from each stratum.
Reservoir sampling^[104]	Data items are added to a reservoir of a fixed size.
Bootstrapping^[83,108]	Multiple samples are drawn with replacement and used for statistical estimation and diagnosis.

Framework	Description
BlinkDB^[19]	Approximate distributed query processing engine that uses stratified sampling
ApproxHadoop^[119]	Uses multi-stage sampling for approximate MapReduce job execution
EARL^[121]	Uses online uniform sampling from HDFS files with the bootstrap method
ApproxSpark^[122]	Uses multi-stage sampling or adaptive stratified reservoir sampling
ApproxSpark^[122]	Supports both record-level and block-level sampling
BlinkML^[123]	Uses online uniform sampling without replacement
BlinkML^[123]	If the dataset cannot fit into the memory, either uses Bernoulli sampling or offline samples
IncApprox^[119]	A stream data analytics system which depends on both approximate and incremental computing
Sapprox^[124]	Uses it to facilitate online sampling
RSP approach^[76]	Approach for approximate big data analysis
RSP approach^[76]	Depends on a step-wise process to analyze data in batches of RSP blocks
ApproxIoT^[127]	Sampling-based approximation in IoT with an online hierarchical stratified reservoir sampling algorithm