A Scheduling Optimization Technique Based on Reuse in Spark to Defend Against APT Attack

Jianchao Tang; Ming Xu; Shaojing Fu; Kai Huang

doi:10.26599/TST.2018.9010022

| Sign up

PDF (1.5 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

A Scheduling Optimization Technique Based on Reuse in Spark to Defend Against APT Attack

Jianchao Tang, Ming Xu(), Shaojing Fu, Kai Huang

College of Computer, National University of Defense Technology, Changsha 410073, China.

Sate Key Laboratory of Cryptology, Beijing 100878, China.

Show Author Information

Abstract

Advanced Persistent Threat (APT) attack, an attack option in recent years, poses serious threats to the security of governments and enterprises data due to its advanced and persistent attacking characteristics. To address this issue, a security policy of big data analysis has been proposed based on the analysis of log data of servers and terminals in Spark. However, in practical applications, Spark cannot suitably analyze very huge amounts of log data. To address this problem, we propose a scheduling optimization technique based on the reuse of datasets to improve Spark performance. In this technique, we define and formulate the reuse degree of Directed Acyclic Graphs (DAGs) in Spark based on Resilient Distributed Datasets (RDDs). Then, we define a global optimization function to obtain the optimal DAG sequence, that is, the sequence with the least execution time. To implement the global optimization function, we further propose a novel cost optimization algorithm based on the traditional Genetic Algorithm (GA). Our experiments demonstrate that this scheduling optimization technique in Spark can greatly decrease the time overhead of analyzing log data for detecting APT attacks.

Keywords

Spark Advanced Persistent Threat (APT)schedule reuse Resilient Distributed Dataset (RDD)Directed Acyclic Graph (DAG)Genetic Algorithm (GA)

References

[1]

Chen

, L.

Desmet

, and C.

Huygens

, A study on advanced persistent threats, in Proc. 15th IFIP TC 6/TC 11 Int. Conf. Communications and Multimedia Security, Aveiro, Portugal, 2014, pp. 63-72.

Crossref

[2]

Vukalović

and D.

Delija

, Advanced persistent threats detection and defense, in Proc. 2015 38th Int. Convention on Information and Communication Technology, Electronics and Microelectronics, Opatija, Croatia, 2015, pp. 1324-1330.

Crossref

[3]

Tankard

, Advanced persistent threats and how to monitor and deter them, Netw. Secur., vol. 2011, no. 8, pp. 16-19, 2011.

Crossref Google Scholar

[4]

Brewer

, Advanced persistent threats: Minimising the damage, Netw. Secur., vol. 2014, no. 4, pp. 5-9, 2014.

Crossref Google Scholar

[5]

Moscaritolo

, Transparency: Operation aurora, SC Magazine: For IT Security Professionals, vol. 21, no. 3, p. 14, 2010.

Google Scholar

[6]

T. M.

Chen

and S.

Abu-Nimeh

, Lessons from stuxnet, Computer, vol. 44, no. 4, pp. 91-93, 2011.

Crossref Google Scholar

[7]

Apache Spark™, Apache spark project, http://spark.apache.org/, 2018.

[8]

Zaharia

, M.

Chowdhury

, M. J.

Franklin

, S.

Shenker

, and I.

Stoica

, Spark: Cluster computing with working sets, in Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing, Boston, MA, USA, 2010, p. 10.

[9]

Zaharia

, M.

Chowdhury

, T.

Das

, A.

Dave

, J.

, M.

McCauley

, M. J.

Franklin

, S.

Shenker

, and I.

Stoica

, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. 9th USENIX Conf. Networked Systems Design and Implementation, San Jose, CA, USA, 2012, p. 2.

[10]

Zaharia

, T.

Das

, H. Y.

, S.

Shenker

, and I.

Stoica

, Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters, in Proc. 4th USENIX Conf. Hot Topics in Cloud Ccomputing, Boston, MA, USA, 2012, p. 10.

Crossref

[11]

R. S.

Xin

, J.

Rosen

, M.

Zaharia

, M. J.

Franklin

, S.

Shenker

, and I.

Stoica

, Shark: SQL and rich analytics at scale, in Proc. 2013 ACM SIGMOD Int. Conf. Management of Data, New York, NY, USA, 2013, pp. 13-24.

Crossref

[12]

N. M.

Weber

, The relevance of research data sharing and reuse studies, Bull. Am. Soc. Inf. Sci. Technol, vol. 39, no. 6, pp. 23-26, 2013.

Crossref Google Scholar

[13]

T. K.

Sellis

, Multiple-query optimization, ACM Trans. Database Syst., vol. 13, no. 1, pp. 23-52, 1988.

Crossref Google Scholar

[14]

Dar

, M. J.

Franklin

, B. T.

Jónsson

, D.

Srivastava

, and M.

Tan

, Semantic data caching and replacement, in Proc. 22nd Int. Conf. Very Large Data Bases, Bombay, India, 1996, pp. 330-341.

[15]

Dursun

, C.

Binnig

, U.

Cetintemel

, and T.

Kraska

, Revisiting reuse in main memory database systems, in Proc. 2017 ACM Int. Conf. Management of Data, Chicago, IL, USA, 2017, pp. 1275-1289.

Crossref

[16]

Karau

, A.

Konwinski

, P.

Wendell

, and M.

Zaharia

, Learning Spark: Lightning-Fast Big Data Analysis. California, CA, USA: O’Reilly Media, 2015, pp. 26-30.

[17]

Wang

, Directed acyclic graph, in Encyclopedia of Systems Biology, W. Dubitzky, O. Wolkenhauer, eds. New York, NY, USA: Springer, 2013, pp. 1105-1114.

Crossref

[18]

Ren

, M. H.

Dunham

, and V.

Kumar

, Semantic caching and query processing, IEEE Trans. Knowl. Data Eng., vol. 15, no. 1, pp. 192-210, 2003.

Crossref Google Scholar

[19]

Wikipedia, Schedule, https://en.wikipedia.org/wiki/Schedule, 2018.

[20]

Sakellariou

and H.

Zhao

, A hybrid heuristic for DAG scheduling on heterogeneous systems, in Proc. 18th Int. Parallel and Distributed Processing Symp., Santa Fe, NM, USA, 2004, pp. 111-123.

[21]

Zaharia

, D.

Borthakur

, J. S.

Sarma

, K.

Elmeleegy

, S.

Shenker

, and I.

Stoica

, Job Scheduling for Multiuser Mapreduce Clusters. Berkeley, CA, USA: University of California, 2009.

[22]

Schwiegelshohn

and R.

Yahyapour

, Fairness in parallel job scheduling, J. Schedul., vol. 3, no. 5, pp. 297-320, 2000.

Crossref

[23]

D. G.

Feitelson

, L.

Rudolph

, and U.

Schwiegelshohn

, Parallel job scheduling—A status report, in Proc. 10th Int. Workshop on Job Scheduling Strategies for Parallel Processing, New York, NY, USA, 2004, pp. 1-16.

Crossref

[24]

T. S.

Ferguson

, Linear Programming: A concise introduction, https://www.math.ucla.edu/~tom/LP.pdf, 2000.

[25]

Dorigo

and L. M.

Gambardella

, Ant colony system: A cooperative learning approach to the traveling salesman problem, IEEE Trans. Evol. Comput., vol. 1, no. 1, pp. 53-66, 1997.

Crossref Google Scholar

[26]

D. B.

Skalak

, Prototype and feature selection by sampling and random mutation hill climbing algorithms, in Proc. 11th Int. Conf. Machine Learning, New Brunswick, NJ, USA, 1994, pp. 293-301.

Crossref

[27]

S. Z.

Selim

and K.

Alsultan

, A simulated annealing algorithm for the clustering problem, Pattern Recognit., vol. 24, no. 10, pp. 1003-1008, 1991.

Crossref Google Scholar

[28]

De Jong

, Learning with genetic algorithms: An overview, Mach. Learn., vol. 3, no. 23, pp. 121-138, 1988.

Crossref Google Scholar

[29]

U. C. Berkeley AMPLab, Big data benchmark, https://amplabcsberkeleyedu/benchmark, 2014.

Tsinghua Science and Technology

Volume 23 Issue 5,
October 2018

Pages 550-560

DOI: 10.26599/TST.2018.9010022

Cite this article:

Tang J, Xu M, Fu S, et al. A Scheduling Optimization Technique Based on Reuse in Spark to Defend Against APT Attack. Tsinghua Science and Technology, 2018, 23(5): 550-560. https://doi.org/10.26599/TST.2018.9010022