Lab. of Information Security, School of Computer Science, Huazhong University of Science and Technology, Wuhan430074, China.
Show Author Information
Hide Author Information
Abstract
Malicious applications can be introduced to attack users and services so as to gain financial rewards, individuals’ sensitive information, company and government intellectual property, and to gain remote control of systems. However, traditional methods of malicious code detection, such as signature detection, behavior detection, virtual machine detection, and heuristic detection, have various weaknesses which make them unreliable. This paper presents the existing technologies of malicious code detection and a malicious code detection model is proposed based on behavior association. The behavior points of malicious code are first extracted through API monitoring technology and integrated into the behavior; then a relation between behaviors is established according to data dependence. Next, a behavior association model is built up and a discrimination method is put forth using pushdown automation. Finally, the exact malicious code is taken as a sample to carry out an experiment on the behavior’s capture, association, and discrimination, thus proving that the theoretical model is viable.
D.Spinellis, Reliable identification of bounded-length viruses is NP-complete, IEEE Transactions on Information Theory, vol. 49, no. 1, pp. 280-284, 2003.
M.Feng and R.Gupta, Detecting virus mutations via dynamic matching, in IEEE International Conference on Software Maintenance, Edmonton, Alberta, Canada, 2009, pp. 105-114.
J. R.Harrald, S. A.Schmitt, and S.Shrestha, The effect of computer virus occurrence and virusthreat level on antivirus companies’ financial performance, in Engineering Management Conf., 2004. Proceedings. IEEE International, 2004, vol. 2, pp. 780-784.
Z.-P.Kang, H.Xiang, and L.Fu, Attack and defence on API hook technology of trojan horse, Information Security and Communications Privacy, vol. 2, pp.145-148, 2007.
L.Wang, Y.Li, and Z.Li, A novel technique of recognising multi-stage attack behavior, Int. J. of High Performance Computing and Networking, vol. 6, no. 3/4, pp. 174-180, 2010.
L. C.Briand, J.Feng, and Y.Labiche, Experimenting with genetic algorithms and coupling measures to devise optimal integration test orders, in Software Engineering with Computational Intelligence, T. M.Khoshgoftaar, Ed. Kluwer Academic Publishers, 2003, pp. 204-234.
M. G.Schultz, E.Eskin, E.Zadok, and S. J.Stolfo, Data mining methods for detection of new malicious executables, in IEEE Symposium on Security and Privacy, Oakland, CA, USA, 2001, pp. 38-49.
[11]
F.Porikli and O.Tuzel, Multi-kernel object tracking, in IEEE International Conference on Multimedia and Expo, Amsterdam, Holland, 2005, pp. 1234-1237.
Y.Sakakibara, Grammatical inference in bioinformatics, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 7, pp. 1051-1062, 2005.
D. E.Muller and P. E.Schupp, Groups, the theory of ends and context-free languages, Journal of Computer and System Sciences, vol. 26, no. 3, pp. 295-310, 1983.
M.Plicka, J.Janousek, and B.Melichar, Subtree oracle pushdown automata for ranked and unranked ordered trees, in Federated Conference on Computer Science and Information Systems (FedCSIS), Szczecin, Poland, 2011, pp. 903-906.
Han L, Qian M, Xu X, et al. Malicious Code Detection Model Based on Behavior Association. Tsinghua Science and Technology, 2014, 19(5): 508-515. https://doi.org/10.1109/TST.2014.6919827
113745N-2014-05-508.F001
Behavior association digraph.
113745N-2014-05-508.F002
Record of behavior buffer.
2.3.2 The establishment of relationship between behaviors
After behavior extraction, we then establish the relationship between behaviors.
The relationship between behaviors includes data and non-data dependence; in this paper we focus on the former. Data dependence means the ordinal relations in data processing, such as transitive relations. For example, when document B reads document A’s content, it is a typical data transitive relation. This relationship is relatively close so the behavior association could be established by data dependence. Non-data dependence mainly refers to the dependence of logical structure: for example, behavior A and behavior B access or modify different fields in the same database file. In comparison to data dependence, non-data dependence has a looser relationship and behavior association can not therefore be established.
From the above discussion, it is clear that the relationship between behaviors can be obtained by tracking the programs that the data file accesses. For example, we can record the behavior’s data address (internal memory or buffer) and its change, and observe whether the data address is called by other behaviors. With this, we can determine the dependence between behaviors from the data’s transmission and establish a behavior association digraph based on Section 2.2.
The record of behavior buffer is shown in
Fig. 2
and detailed algorithm is as follows:
113745N-2014-05-508.F002
Record of behavior buffer.
The relationship between behaviors is mostly transmitted by parameters while a minority is transmitted by commands without parameters, such as the shutdown command. In most cases, behavior B can call a related parameter of behavior A, thus manipulating the data which behavior A manipulated, further establishing the data relationship between the two behaviors. In other words, the analysis of parameters can help us understand more about the relationship between behaviors. Take the address parameter for example: A buffer is a kind of typical address parameter. When we analyze the buffer address, we can regard address and the address space to which points as the same parameter, so that address ) is a parameter. To analyze the address parameter better, we can classify some addresses with important attributes such as import table, service description table, entry point, etc. A decision tree can be established based on address space, as shown in
Fig. 3
, so we can analyze different address spaces better with the help of a decision tree[
12
].
113745N-2014-05-508.F003
Address analysis.
113745N-2014-05-508.F003
Address analysis.
113745N-2014-05-508.F004
The model of pushdown automation.
3.3 The process of discrimination
We can configure a series of pushdown automation based on numerous samples of malicious code. If there are behaviors which need to be detected, we can use pushdown automation to discriminate them with the behaviors as input. If the final state can be reached by pushdown automation it means the behavior is malicious. The automation model is as shown in
Fig. 4
.
113745N-2014-05-508.F004
The model of pushdown automation.