Method And Device For Feature Extraction YANG; Kang ; et al. [Beijing Qihoo Technology Company Limited]

Method And Device For Feature Extraction

YANG; Kang ; et al.

Patent Application Summary

U.S. patent application number 15/109343 was filed with the patent office on 2017-07-27 for method and device for feature extraction. The applicant listed for this patent is Beijing Qihoo Technology Company Limited, Qizhi Software (Beijing) Company Limited. Invention is credited to Zhuo CHEN, Hai TANG, Kang YANG.

Application Number	20170214704 15/109343
Document ID	/
Family ID	50528712
Filed Date	2017-07-27

United States Patent Application	20170214704
Kind Code	A1
YANG; Kang ; et al.	July 27, 2017

METHOD AND DEVICE FOR FEATURE EXTRACTION

Abstract

The present invention discloses a method and device for feature extraction, wherein the method comprises acquiring a batch of black sample files and white sample files from an application layer of a smart terminal operating system; parsing each file to obtain information structure of all functions contained in each file, and computing a check code of each function; determining whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files; extracting black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extracting white sample features based on functions only appearing in the white sample files while not appearing in the black sample files. By analyzing and computing the acquired black sample files and white sample files and counting the times that a check code of each function appears in the files, the embodiments of the present invention only use the functions appearing in the black sample files while not appearing in the white sample files as the basis for feature extraction.

Inventors:

YANG; Kang; (Beijing, CN) ; CHEN; Zhuo; (Beijing, CN) ; TANG; Hai; (Beijing, CN)

Applicant:

Name	City	State	Country	Type
Beijing Qihoo Technology Company Limited Qizhi Software (Beijing) Company Limited	Beijing Beijing		CN CN

Family ID:

50528712

Appl. No.:

15/109343

Filed:

August 7, 2014

PCT Filed:

August 7, 2014

PCT NO:

PCT/CN2014/083910

371 Date:

June 30, 2016

Current U.S. Class:	1/1
Current CPC Class:	G06F 21/563 20130101; H04L 63/1425 20130101; G06F 21/566 20130101; Y04S 40/20 20130101; G06F 21/562 20130101; G06F 21/56 20130101
International Class:	H04L 29/06 20060101 H04L029/06; G06F 21/56 20060101 G06F021/56

Foreign Application Data

Date	Code	Application Number
Dec 30, 2013	CN	201310746033.6

Claims

1. A method for feature extraction, comprising: acquiring a batch of black sample files and white sample files from an application layer of a smart terminal operating system; parsing each file to obtain information structure of all functions contained in each file, and computing a check code of each function; determining whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files; extracting black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extracting white sample features based on functions only appearing in the white sample files while not appearing in the black sample files.

2. The method according to claim 1, wherein after counting the black samples or white samples, the method further comprises: optimizing features, specifically: establishing a vector for each feature with respect to all files; initializing a set to be compared sequentially with the vector of each feature; if the set contains the compared vector, reserving the set; if the set does not contain the compared vector, getting a union of the set and the compared vector; sequentially comparing the vectors of all features, and taking the features contained in the finally obtained set as the last reserved features.

3. The method according to claim 1, wherein after counting the black samples or white samples, the method further comprises: optimizing features, specifically: for different file sets with different features, if one file set contains all files in another file set, reserving features corresponding to a file set with a larger scope, while abandoning features corresponding to a file set with a smaller scope.

4. The method according to claim 3, wherein the features contain a first feature and a second feature, files containing the first feature form a first file set, and files containing the second feature form a second file set; if the first file set contains all files in the second file set, the first feature is reserved, while the second feature is abandoned.

5. The method according to claim 1, wherein before the counting times that each function appears in the black sample files and the white sample files, the method further comprises: performing intra-file de-duplication to the check code of the function.

6. The method according to claim 5, wherein the performing intra-file de-duplication to the check code of the function comprises: for each file, if a plurality of functions have a same check code, extracting a function from the plurality of functions as a function corresponding to the check code.

7. The method according to claim 1, wherein the black sample files and the white sample files are all virtual machine executable files; the parsing each file to obtain information structure of all functions contained in the each file comprises: decompiling a virtual machine executable file to obtain a decompiled information structure of all functions contained in the virtual machine executable file.

8. The method according to claim 7, wherein the computing a check code of each function comprises: computing a hash value of information structure of the function by hash algorithm, use the hash value as the check code corresponding to the function.

9-11. (canceled)

12. The method according to claim 1, wherein: the extracting black sample features based on functions only appearing in black sample files while not appearing in white sample files comprises: using a function that only appears in the black sample files while not appearing in the white sample files as the black sample feature, or using a part of code of the function that only appears in the black sample files while not appearing in the white sample files as the black sample feature; the extracting white sample features based on functions only appearing in white sample files while not appearing in black sample files comprises: using a function that only appears in the white sample files while not appearing in the black sample files as the white sample feature, or using a part of code of the function that only appears in the white sample files while not appearing in the black sample files as the white sample feature.

13. The method according to claim 1, further comprising: adding black sample features into a black sample feature library, and matching a target file using the black sample feature library; if the target file contains a function or a subset of functions corresponding to a black sample feature, determining that malicious code exists in the target file.

14. (canceled)

15. The method according to claim 1, wherein, the black sample file refers to a file preliminarily determined as containing a black sample, while the white sample file refers to a file preliminarily determined as not containing a black sample.

16. The method according to claim 15, wherein the acquiring a batch of black sample files and white sample files comprises: finding an installation package of an application from an application layer of a smart terminal operating system; parsing the installation package to obtain a virtual machine executable file of the application; using the virtual machine executable file as a black sample file or a white sample file.

17. (canceled)

18. A device for feature extraction, comprising a memory having instructions stored therein and at least one processor to execute the instructions to cause: acquiring a batch of black sample files and white sample files from an application layer of a smart terminal operating system; parsing each file to obtain information structure of all functions contained in each file; computing a check code of each function; determining whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files; and extracting black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extracting white sample features based on functions only appearing in the white sample files while not appearing in the black sample files.

19. The device according to claim 18, the processor further executes the instructions to cause optimizing features that comprising: establishing a vector for each feature with respect to all files; initializing a set to be compared sequentially with the vector of each feature; if the set contains the compared vector, reserving the set; if the set does not contain the compared vector, getting a union of the set and the compared vector; sequentially comparing the vectors of all features, and taking the features contained in the finally obtained set as the last reserved features.

20. The device according to claim 18, the processor further executes the instructions to cause: for different file sets with different features, if one file set contains all files in another file set, reserving features corresponding to a file set with a larger scope, while abandoning features corresponding to a file set with a smaller scope.

21-22. (canceled)

23. The device according to claim 22, wherein the processor further executes the instructions to cause: performing intra-file de-duplication to the check code of the function, wherein the performing intra-file de-duplication to the check code of the function comprises: for each file, if a plurality of functions have a same check code, extracting a function from the plurality of functions as a function corresponding to the check code.

24. The device according to claim 18, wherein: the black sample files and the white sample files are all virtual machine executable files; and the parsing each file to obtain information structure of all functions contained in each file specifically comprises: decompiling the virtual machine executable file to obtain a decompiled information structure of all functions contained in the virtual machine executable file.

25-28. (canceled)

29. The device according to claim 18, wherein: the extracting black sample features based on functions only appearing in black sample files while not appearing in white sample files comprises: using a function that only appears in the black sample files while not appearing in the white sample files as the black sample feature, or using a part of code of the function that only appears in the black sample files while not appearing in the white sample files as the black sample feature; the extracting white sample features based on functions only appearing in white sample files while not appearing in black sample files comprises: using a function that only appears in the white sample files while not appearing in the black sample files as the white sample feature, or using a part of code of the function that only appears in the white sample files while not appearing in the black sample files as the white sample feature.

30. The device according to claim 18, wherein the processor further executes the instructions to cause: adding a black sample feature into a black sample feature library, and matching a target file using the black sample feature library; if the target file contains a function or a subset of functions corresponding to the black sample feature, determining that malicious code exists in the target file.

31-35. (canceled)

36. A computer-readable medium, having instructions stored therein that, when executed by at least one processor, cause the processor to perform feature extraction comprising: acquiring a batch of black sample files and white sample files from an application layer of a smart terminal operating system; parsing each file to obtain information structure of all functions contained in each file, and computing a check code of each function; determining whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files; and extracting black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extracting white sample features based on functions only appearing in the white sample files while not appearing in the black sample files.

Description

FIELD OF THE INVENTION

[0001] The present invention relates to the technical field of network security, and more specifically relates to a method and device for feature extraction.

BACKGROUND OF THE INVENTION

[0002] With the development of sciences and technologies, smart terminals are provided with more and more functions. For example, mobile phones have turned from traditional GSM and TDMA digital mobile phones into smart phones that have capabilities of processing multimedia resources and providing various kinds of information services such as network browsing, telephone conference, electronic commerce, etc. However, that also brings increasing varieties of malicious code attacks to the mobile phones and increasingly serious personal data security issues. Smart mobile phone users suffer deeply from more and more mobile phone viruses.

[0003] Mobile phone malicious code protection technologies perform protection against malicious codes. A variety of mobile phone malicious code protection approaches have been provided, for example, feature value scanning approach, virtual machine technology-based malicious code protection, heuristic scanning and similar samples clustering, etc. Regardless of which protection manners, besides an efficient scanning algorithm (also named as matching algorithm), a malicious code feature library that is reasonably organized is basis. Therefore, how to accurately and efficiently extract features is crucial to build a feature library or even to the entire protection technology.

SUMMARY OF THE INVENTION

[0004] In view of the problems above, a method and device for feature extraction according to the present invention is provided so as to overcome the above problems or at least partially solve the above problems.

[0005] According to one aspect of the present invention, there is provided a method for feature extraction, comprising acquiring a batch of black sample files and white sample files from an application layer of a smart terminal operating system; parsing each file to obtain information structure of all functions contained in each file, and computing a check code of each function; determining whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files; extracting black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extracting white sample features based on functions only appearing in the white sample files while not appearing in the black sample files.

[0006] According to another aspect of the present invention, there is provided a device for feature extraction, comprising a file acquiring unit configured to acquire a batch of black sample files and white sample files from an application layer of a smart terminal operating system; a parsing unit configured to parse each file to obtain information structure of all functions contained in each file, and a check code computing unit configured to compute a check code of each function; a counting unit configured to determine whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files; an extracting unit configured to extract black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extract white sample features based on functions only appearing in the white sample files while not appearing in the black sample files.

[0007] Thus by analyzing and computing the acquired black sample files and white sample files and counting times that a check code of each function appears in the files, the embodiments of the present invention only use functions appearing in the black sample files while not appearing in the white sample files as the basis for feature extraction. In this way, the fast and accurate feature extraction may guarantee building of an efficient feature library and guarantee implementation of the defending technologies. Preferably, the features may be optimized so as to detect most files with least features after acquiring a large amount of extractable black sample features.

[0008] The above are only summaries of the technical solutions of the present invention; in order to understand the technical means of the present invention more clearly, the implementation may be based on the content in the specification. Besides, in order to make the above and other objectives, features, and advantages of the present invention more apparent and comprehensible, preferred embodiments of the present invention will be specifically provided below.

BRIEF DESCRIPTION OF THE SEVERAL DRAWINGS

[0009] Through reading detailed depiction of the preferred embodiments below, various other advantages and benefits become clear to a person those skilled in the art. The drawings are only used for the purpose of illustrating preferred embodiments, and should not be regarded as limitation to the present invention. Moreover, throughout the entire drawings, same reference numerals are used to indicate same components. In the accompanying drawings,

[0010] FIG. 1 illustrates a flow diagram of a method for feature extraction according to one embodiment of the present invention;

[0011] FIG. 2 illustrates a flow diagram of optimizing features in a method for feature extraction according to one embodiment of the present invention;

[0012] FIG. 3 illustrates a schematic diagram of a device for feature extraction according to one embodiment of the present invention;

[0013] FIG. 4 illustrates a block diagram of a smart electronic device for executing the method according to the present invention;

[0014] FIG. 5 illustrates a schematic diagram of a storage unit for maintaining or carrying program codes that implement the method according to the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0015] Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the drawings reveal the exemplary embodiments of the present disclosure, it should be understood that the present disclosure may be implemented in various forms and should not be limited by the embodiments illustrated here. On the contrary, these embodiments are provided for a more thorough understanding of the present disclosure and for a complete delivery of the scope of the present disclosure to those skilled in the art.

[0016] Android operating system, as an example, contains an application layer (app layer) and a system framework layer (framework layer); As for other layers that might be comprised in terms of functional partitioning, will not be discussed here. Wherein, the app layer may be generally understood as an upper layer, in charge of interfaces for interaction with a user, e.g., application maintenance, identifying different kinds of click contents upon clicking onto a page so as to display different context menus, and etc. The framework layer is generally used as an intermediate layer, mainly for forwarding a user request (e.g., starting an application, clicking on a link, click to save a picture, and the like) to a lower layer; and distributing contents completely processed by the lower layer to the upper layer either via a message or via an intermediate proxy class, so as to present them to the user.

[0017] The inventors of the present invention have found in researching that by counting times that a check code of a function contained in a sample file appears in files, it may be determined whether the function is a black sample or a white sample.

[0018] Refer to FIG. 1, in which a flow diagram of a method for feature extraction according to one embodiment of the present invention is presented.

[0019] The method for feature extraction comprises steps of

[0020] S101: acquiring a batch of black sample files and white sample files from an application layer of a smart terminal operating system;

[0021] Wherein, the black sample files refer to files preliminarily determined as containing a black sample, e.g., a file containing malicious codes, while the white sample files refer to files preliminarily determined not containing a black sample, e.g., a file not containing malicious codes. Those skilled in the art should understand that a feature library needs to be built during matching, detecting, and removing malicious codes, and building of the feature library is based on extracting features from sample files. In the embodiments of the present invention, whether a batch of files are black sample files or white sample files is preliminarily determined manually in advance. More black sample files and white sample files will be beneficial for accurate extraction of sample features.

[0022] In the embodiments of the present invention, the black sample files or white sample files may be, for example, dex files. Dex files refer to virtual machine executable files directly loaded and running in a Dalvik virtual machine (Dalvik VM) in Android system. Dalvik is a Java virtual machine for an Android platform. An optimized Dalvik allows concurrently running instances of multiple virtual machines in a limited internal memory, and each Dalvik application is executed as an independent Linux process. The independent process can prevent closing of all programs when the virtual machine breaks down. The Dalvik virtual machine may support running of a Java application that has been converted into a dex (Dalvik Executable) format. The dex format is a kind of compressed format specifically designed for Dalvik and is suitable for a system with limited memory and processor speed. Java source codes may be converted into a dex file by ADT (Android Development Tools) through a complex compilation. The dex file is an optimized result for an embedded system. The Dalvik virtual machine does not employ standard Java virtual machine instruction codes, but uses its specific instruction set. The dex file shares a plenty of class names and constant strings, thus its volume is small and operating efficiency is relatively high.

[0023] Specifically, obtaining a batch of black sample dex files and white sample dex files from a smart terminal may comprise finding an installation package of an application from an application layer of a smart terminal operating system; parsing the installation package to obtain a dex file of the application; using a dex executing file as a black sample file or a white sample file. For example, it can be obtained by parsing an APK (Android Package). The APK file is actually a compressed package of a zip format, but its affix name is modified to apk; a Dex file may be obtained after decompression via UnZip.

[0024] As previously mentioned, the Android operating system comprises an application layer (app layer) and a system framework layer (framework layer). The present invention focuses on study and improvement of the app layer. However, those skilled in the art understand that when the Android is started, Dalvik VM monitors all programs (APK files) and frameworks and create a dependency relationship tree for them. Through this dependency relationship tree, the Dalvik VM optimizes code for each program and stores the optimized codes into a Dalvik cache (dalvik-cache). In this way, all programs will use optimized code upon running. When a program (or framework) changes, the Dalvik VM will re-optimize the code and store them into the cache again. The cache/dalvik-cache is for depositing dex files generated by programs on the system, while data/dalvik-cache is for depositing dex files generated by data/app. In other words, the present invention focuses on analyzing and processing of dex files generated by data/app. However, it should be understood that the theory and operation of the present invention is likewise applicable to dex files generated by programs on the system.

[0025] S102: parsing each file to obtain information structure of all functions contained in each file, and computing a check code of each function;

[0026] Still taking a dex file as an example, parsing a file to obtain information structure of all functions contained in the file comprises decompiling the dex file to obtain decompiled information structure of all functions contained in the dex file.

[0027] Wherein, the dex file is decompiled in a plurality of manners.

[0028] Manner 1: parsing the dex file according to a dex file format to obtain a function information structure of each class; determining a location and size of the dex file according to fields in the function information structure, to obtain a decompiled function information structure. Wherein, by parsing the function information structure, a bytecode array field indicating a function position of the dex file and a list length field indicating a function size of the dex file are obtained, thereby determining the position and size of the function of the dex file.

[0029] For example, the dex file is parsed according to a dex file format to obtain the function information body of each class. The function information structure contains fields in Table 1.

TABLE-US-00001 TABLE 1 registers_size ushort Number registers used in the segment of code ins_size ushort Words of input parameters of the method in the segment of code outs_size ushort Space that needs to be provided for invoking the segment of code to an output function of the function tries_size ushort Number of try_item of the object; if not 0, it will appear as a tries array after the present object insns debug_info_off uint Offset amount from the beginning of the file to the debug info; without information, the value is 0; if not 0, it represents a position of a data segment; the data shall follow a debug_info_item prescribed format insns_size uint Length of the Instructions list, with two bytes as a unit insns ushort[insns_size] Bytecode array. The format of the bytecode will be detailed in the file "Bytecode for the Dalvik VM." Although it is defined as a ushort-type array, some internal structure employ a 4-byte alignment; if the file is just a file subjected to a byte exchange operation, the byte exchange can only be performed within the ushort type. padding ushort Two padding bytes are used to satisfy the tries 4-byte alignment (optional) = 0 manner. The element only exists when the tries_size is an odd number and not 0. tries try_item[tries_size] This array is for identifying where abnormalities are possibly thrown out (optional) in the representations. The array elements shall be arranged in an ascending order of the addresses, and no repetitive addresses shall appear. This element only exists when the tries_size is not 0. handlers encoded_catch_handler_list These bytes represent a series of abnormal types and an address (optional) list of their processing methods; each try_item has an offset of one byte width; and the element only exists when the tries_size is not 0.

[0030] Wherein, the insns_size and insns fields in each function information structure represent the function size and position, respectively. Then, the information structure of the function may be decompiled according to the fields insns_size and insns. The decompiled information structure is comprised of Dalvik VM bytes, which will be detailed later.

[0031] Manner 2: decompiling the dex file into a virtual machine byte code using a dex file decompilation tool.

[0032] As mentioned above, the Dalvik virtual machine runs a Dalvik bytecode, which exists in a dex executable file form. The Dalvik virtual machine executes codes by interpreting the dex file. Currently, some tools are provided to decompile a DEX file into Dalvik compilation codes, such dex file decompiling tools include baksmali, Dedexer 1.26, dexdump, dexinspecto 03-12-12r, IDA Pro, androguard, dex2jar, and 010 Editor, etc.

[0033] It is seen that all decompiled function information structure may be obtained by decompiling the dex file. Wherein, the function information structure comprises function execution codes, which, in the present embodiment, are formed by a virtual machine instruction sequence and a virtual machine memonic sequence. As the example below, the function information structure is formed by an instruction sequence of Dalvik VM and a memonic sequence of the Dalvik VM.

[0034] For example, a function information structure obtained by decompiling the dex file according to one embodiment of the present invention is specified below:

TABLE-US-00002 ##STR00001## ##STR00002## ##STR00003## ##STR00004##

[0035] It is seen that the dex file is decompiled into an instruction sequence of Dalvik VM and a memonic sequence of the Dalvik VM. As indicated in the example above, in the function information structure obtained by decompilation, the first 2 digits of each line in the machine code field denote an instruction sequence (the left circled part in the example above), while the part corresponding to the instruction sequence is a memonic (right side of the example, partially circled, not completely selected). The memonic is mainly for facilitating user communication and code compilation.

[0036] After obtaining the information structure of each function, the check code of the function may be computed. Later, the check code may be used to represent its corresponding unique function. The check code of the function may be calculated using an existing or future algorithm. For example, a hash algorithm may be used to calculate the hash value of the function as the previous check code. Wherein, the hash algorithm has many kinds, e.g., CRC (Cyclic Redundancy Check), MD5 (Message Digest Algorithm), or SHA (Secure Hash Algorithm), etc.

[0037] S103: determining whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files.

[0038] This step is to count times that a hash value appears in a batch of black sample files and white sample files obtained in step S101.

[0039] Suppose a hash value of each function is determined by analyzing and computing the black sample files and white sample files; then, times that each hash value appears in the black sample files and white sample files are counted.

[0040] Suppose there are n sample files (including a part of black sample files and a part of white sample files), wherein the first file comprises function hash values A, B, C; the second file comprises function hash values A, C, D; the third file comprises function hash values B, C, E; . . . the nth file comprises hash values C, D. All in all, after all files are analyzed, suppose 5 function values A, B, C, D, E are determined. Then, times that the 5 hash values appear in the black samples and in white sample files are counted. Suppose the results are shown in Table 2 below after counting.

TABLE-US-00003 TABLE 2 Times that Times that Total times that appearance in appearance in Function appearance in the the black the white hash value file sample files sample files A 10000 5000 5000 B 10000 10000 0 C 10000 0 10000 D 10000 8000 2000 E 7000 7000 0

[0041] Those skilled in the art understand that different functions have different hash values, i.e., different hash values represent different functions; therefore, A, B, C, D, E are also employed subsequently to represent 5 functions or 5 features. Based on the times that the above hash values appear in the files, the times that each function appears in the files may be determined.

[0042] Preferably, before counting the times that each function appears in the black sample files and the white sample files, the method further comprises de-duplicating a check code of the function within the file. Specifically, de-duplicating the check code of the function within the file refers to for each file, if a plurality of functions have a same check code, extracting one function from the plurality of functions as a function corresponding to the check code. For example, suppose that for a dex file, the information structure of all functions contained therein are obtained by parsing it. Suppose that three information structure s1, s2, and s3 are parsed out; 3 hash values hash 1, hash 2, and hash 3 of the three information structure s1, s2, and 3 are obtained further through a hash algorithm. Those skilled in the art should understand, different functions have different hash values, i.e., different hash values represent different functions. Suppose that among the three hash values some are identical, e.g., hash 1=hash 2, then it is deemed that they represent the same function. In this case, any one of s1 and s2 is selected, while the other one is aborted.

[0043] S104: extracting black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extracting white sample features based on functions only appearing in the white sample files while not appearing in the black sample files.

[0044] When extracting the samples, only functions appearing in the black sample files while not appearing in the white sample files are selected as black sample features. For example, still taking Table 1 as an example, functions B and E are selected for black sample feature extraction. Specifically, functions B, E are taken as black sample features, or part of codes of the functions B and E are taken as black sample features. Likewise, functions only appearing in the white sample files while not appearing in the black sample files are selected as white sample features. For example, still taking Table 1 for further illustration, function C is selected for performing white sample feature extraction. Specifically, function C may be used as a white sample feature or part of code of the function C is used as a white sample feature.

[0045] After the black sample feature is extracted in step S104, the following steps may be continued to execute the following steps: adding a black sample feature to the black sample feature library; matching a target file using the black sample feature library, and if the target file comprises a function or a subset of functions corresponding to the black sample feature, determining that malicious code exists in the target file. As understood by those skilled in the art, sample feature detecting and removing, virtual machine-based detecting removing, heuristic detecting and removing or similar samples clustering may be performed to the target files using the function corresponding to the black sample feature in the black sample feature library.

[0046] Hereinafter, the malicious code and the malicious code protection schemes (sample feature detecting and removing, virtual machine-based detecting and removing, heuristic detecting and removing, and similar samples clustering) will be introduced.

[0047] The malicious code refers to a program or code that is disseminated via a storage medium or a network, destroys integrity of the operating system and steals undisclosed confidential information in the system without authorization. With a mobile phone as an example, a mobile phone malicious code refers to a malicious code against a portable device and a PDA. The mobile phone malicious code may be simply divided into a replication-type malicious code and a non-replication-type malicious code, wherein the replication-type malicious code mainly contains a virus and a worm, while the non-replication-type malicious code mainly contains a Trojan horse, rogue software, a malicious mobile code, a rootkit program, and etc.

[0048] A mobile phone malicious code protection technology performs protection against malicious code. There are a plurality of mobile phone malicious code protection technologies. For example, a feature value scanning manner. First, it needs to learn in advance to build a malicious code feature library; the feature values saved in the malicious code feature library may be a segment of continuous fixed character strings, or several segments of definite character strings inserted with other discontinuous character strings having indefinite characters; during scanning, the to-be-detected file or the memory is detected based on the character feature or string in the feature library; when a matching item is found, it may be determined that the target is infected with malicious code. For another example, a virtual machine technology based malicious code protection. This kind of protection scheme is mainly directed against polymorph viruses and metamorphic viruses. The virtual machine refers to a complete computer system simulated through software to have a complete hardware system function and run in a completely isolated environment. This scheme is also referred to as a software simulation method, where a software analyzer simulates and analyzes program running using a software method. It essentially simulates a small closed program execution environment in the inner memory, and all files to be subject to virus detection and removal are executed virtually therein. When removing a virus using a virtual machine technology, the feature value scanning technology is also used first, and only when finding that the target has a feature of encrypted malicious code, will the virtual machine module be started to make the encrypted code decoded autonomously. After decoding, the traditional feature value scanning manner may be employed to detect and remove. For another example, a heuristic detection and removal manner. The heuristic detection and removal manner is mainly directed against constant mutation of malicious code for the purpose of enhancing the study on unknown malicious code. The so-called "heuristic" is originated from artificial intelligence, which refers to "a capability of self-discovery" or "a knowledge or technique that exerts a certain manner or method to judge an object." The heuristic detection and killing of the malicious code means the scanning software can detect a virus by analyzing a structure of the program and its behavior using a rule extracted empirically. Because usual behaviors of a malicious code will have certain features such as reading and writing a file in an unconventional manner, terminating itself, or entering into a zero ring in an unconventional manner, so as to achieve the objectives of infection and damage. Therefore, whether a program is a malicious code may be determined by scanning specific behaviors or a combination of multiple behaviors. Besides, similar samples clustering may be performed to a target program, e.g., clustering similar samples determined through analysis using a K-mean value clustering algorithm.

[0049] Irrespective of which protection manner is used, its core always contains two parts. The first part is reasonably organized malicious code feature library; the second part is an efficient scanning algorithm (also referred to as a matching algorithm). The matching algorithm is generally divided into a single-mode matching algorithm and a multi-mode matching algorithm. The single-mode matching algorithm comprises a BF (Brute-Force) algorithm, a KMP (Knuth-Morris-Pratt) algorithm, a BM (Boyer-Moore) algorithm, and a QS (Quick Search) algorithm, etc. The multi-mode matching algorithm contains a typical multi-mode matching DFSA algorithm and an ordered binary tree-based multi-mode matching algorithm. Additionally, the matching algorithm may be divided into a fussy matching algorithm and a similar matching algorithm.

[0050] It should be noted that the present invention does not limit which malicious code protection solution is employed to detect a malicious code. For example, the sample feature detection and removal (feature value scan), the virtual machine-based scan, or heuristic detection and removal as introduced above may be employed. In addition, a similar sample clustering may also be performed. Moreover, the present application makes no limitation to the matching algorithm. For example, the fussy matching algorithm or similarity matching algorithm as introduced above may be employed.

[0051] There is such a scenario that a file set with function A being detected contains a file set with function B being detected. This scenario preferably uses function A as a feature, while abandons function B feature. This is because after a considerable number of black sample features are obtained, it is needed to consider how to detect most files with least features. The embodiments of the present invention achieve this objective through a feature optimization method.

[0052] To summarize, the feature optimization method comprises, for different file sets with different features, if one file set contains all files in another file set, the feature corresponding to a file set with a larger scope will be reserved, while the feature corresponding to the file set with a smaller scope will be abandoned. For example, suppose there are two features: a first feature and a second feature; the files containing the first feature form a first file set, while the files containing the second feature form a second file set; if the first file set contains all files in the second file set, the first feature is reserved, while the second feature is abandoned.

[0053] FIG. 2 illustrates a flow diagram of optimizing features in a method for feature extraction according to one embodiment of the present invention. feature optimization comprises steps of:

[0054] S201: establishing a vector for each feature with respect to all files;

[0055] S202: initializing a set;

[0056] S203: comparing the set sequentially with the vector of each feature;

[0057] S204: determining whether the set contains the compared vector; if the set contains the compared vector, performing S205; if the set does not contain the compared vector, performing S206;

[0058] S205: reserving the set;

[0059] S206: getting a union of the set and the compared vector;

[0060] S207: determining whether the vectors of all features have been compared; if so, performing S208; otherwise, returning to perform S203 to compare with the next feature vector;

[0061] S208: taking the features contained in the finally obtained set as the last reserved features.

[0062] Hereinafter, a preferred example is provided.

[0063] Suppose there are M black sample files and N extractable features (i.e., functions). An M-dimension vector is generated for each extractable feature; the ith-dimension vector represents whether the black sample file indexed by i can be detected with the feature.

[0064] For example, the vector generated by feature A is 1:1, 2:0, 3:1, 4:1, 5:0, 6:0. This represents that the feature may detect three files indexed by 1, 3, 4.

[0065] Steps:

[0066] initializing a set SA, which is compared sequentially with each feature vector;

[0067] if the SA comprises Mi, continuing to compare with the next feature vector set;

[0068] otherwise, getting a union of SA and Mi, and then continuing to compare with the next feature vector set.

[0069] For example, the vectors generated by features A, B, C, and D are specified below:

[0070] A: 1:0, 2:0, 3:1, 4:1, 5:0, 6:0

[0071] B: 1:1, 2:1, 3:1, 4:0, 5:0, 6:1

[0072] C: 1:1, 2:1, 3:1, 4:1, 5:0, 6:0

[0073] D: 1:1, 2:0, 3:1, 4:1, 5:1, 6:0

[0074] First Step:

[0075] Comparing vectors of A, B; because A does not contain B, getting the union of A and B to obtain a detected vector as AB: 1:1, 2:1, 3:1, 4:1, 5:0, 6:1;

[0076] Second Step:

[0077] Using AB to compare with C; because a file that may be detected by C can already be detected by AB, abandoning C;

[0078] Repeating the Second Step:

[0079] Using AB to compare with D, because D may detect file 5, while AB cannot; therefore, getting a union of AB and D;

[0080] Namely, ABD: 1:1, 2:1, 3:1, 4:1, 5:1, 6:1.

[0081] If feature E is followed, comparing ABD with the feature E, similar to the second step.

[0082] For four vectors A, B, C, and D, the finally chosen features are A, B, D.

[0083] Therefore, the shortest feature set of the M files may be detected.

[0084] Thus by analyzing and computing the acquired black sample files and white sample files and counting the times that a check code of each function appears in the files, the embodiments of the present invention only use the functions appearing in the black sample files while not appearing in the white sample files as the basis for feature extraction. In this way, the fast and accurate feature extraction may guarantee building of an efficient feature library and guarantee implementation of the protection technology. Preferably, the features may be optimized so as to detect most files with least features after acquiring a large amount of extractable black sample features.

[0085] Corresponding to the method above, the embodiments of the present invention further provide a device for feature extraction. The device may be implemented by software, hardware or a combination of software and hardware. Specifically, the device may be a terminal device or a functional entity inside the device. For example, the device may be a functional module inside the mobile phone. Preferably, the device is running under Android operating system.

[0086] The feature extracting device comprises:

[0087] a file acquiring unit 301 configured to acquire a batch of black sample files and white sample files from an application layer of a smart terminal operating system;

[0088] a parsing unit 302 configured to parse each file to obtain information structure of all functions contained in each file;

[0089] a check code computing unit 303 configured to compute a check code of each function;

[0090] a counting unit 304 configured to determine whether each file contains functions corresponding to respective check codes so as to count times that each function appears in the black sample files and white sample files;

[0091] an extracting unit 305 configured to extract black sample features based on functions only appearing in the black sample files while not appearing in the white sample files, or extract white sample features based on functions only appearing in the white sample files while not appearing in the black sample files.

[0092] preferably, the device further comprises a feature optimization unit 306 configured to for different file sets with different features, if one file set contains all files in another file set, reserve the feature corresponding to a file set with a larger scope, while abandoning the feature corresponding to the file set with a smaller scope. For example, when the first file set comprises all files in the second file set, the feature optimization unit 306 reserves a first feature corresponding to the first file set, while abandoning a second feature corresponding to the second file set.

[0093] Or, the device further comprises a feature optimization unit 306 configured to establish a vector for each feature with respect to all files; initialize a set to be compared sequentially with the vector of each feature; if the set contains the compared vector, reserve the set; if the set does not contain the compared vector, get a union of the set and the compared vector; sequentially compare the vectors of all features, and take the features contained in the finally obtained set as the last reserved features.

[0094] Preferably, the device further comprises: an inner de-duplicating unit 307 configured to perform intra-file de-duplication to a check code of a function. For example, the inner de-duplicating unit 307 is specifically configured to, for each file, if a plurality of functions have a same check code, extract a function from the plurality of functions as a function corresponding to the check code.

[0095] Wherein, the black sample files and the white sample files are all virtual machine executable files; the parsing unit 302 is specifically configured to decompile the virtual machine executable file to obtain a decompiled information structure of all functions contained in the virtual machine executable file.

[0096] Wherein, the check code computing unit 303 is specifically configured to compute a hash value of the information structure of the function to use the hash value as the check code of the function.

[0097] Wherein, the parsing unit 302 is further configured to parse the virtual machine executable file according to format of the virtual machine executable file to obtain the function information structure of each class; determine a position and size of each function of the virtual machine executable file according to fields in the function information structure, and obtain the decompiled function information structure of each function.

[0098] The parsing unit 302 is further configured to parse the function information structure to obtain a bytecode array field indicating the function position of the virtual machine executable file and a list length field indicating the function size of the virtual machine executable file; and determine a position and size of the function of the virtual machine executable file based on the bytecode array field and the list length field.

[0099] The parsing unit 302 is specifically configured to decompile the virtual machine executable file into a virtual machine bytecode using a virtual machine executable file decompilation tool.

[0100] Wherein, the extracting unit 303 is configured to take a function that only appears in the black sample file while not appearing in the white sample file as the black sample feature, or take a part of code of the function that only appears in the black sample file while not appearing in the white sample file as the black sample feature; or,

[0101] take a function that only appears in the white sample file while not appearing in the black sample file as the white sample feature, or take a part of code of the function that only appears in the white sample file while not appearing in the black sample file as the white sample feature.

[0102] Preferably, the device further comprises: a feature library adding unit 308 configured to add a black sample feature into the black sample feature library, and a matching unit 309 configured to match a target file using the black sample feature library; if the target file contains a function or a subset of functions corresponding to the black sample feature, determine that malicious code exists in the target file. Wherein, the matching unit specially may perform sample feature detection and removal, virtual-machine based detection and removal, heuristic detection and removal, and/or similar samples clustering to the target file using the function corresponding to the black sample feature in the black sample feature library.

[0103] Wherein, the black sample file refers to a file preliminarily determined as containing a black sample, while the white sample file refers to a file preliminarily determined as not containing a black file.

[0104] Wherein, the file extracting unit 301 is specifically configured to find an installation package of an application from an application layer of a smart terminal operating system; parse the installation package to obtain a virtual machine executable file of the application; and take the virtual machine executable file as a black sample file or a white sample file.

[0105] Regarding specific implementations of the device, the method embodiments may be referenced, which will not be detailed here.

[0106] The algorithm and display provided here are not inherently related to any specific computer, virtual system or other device. Various general systems may also be used with the teaching based on that. According to the depiction above, a structure required for building such system is obvious. In addition, the present invention is not directed to any specific programming language. It should be understood that various programming languages may be utilized to implement the content of the present invention depicted here, and the depiction above with respect to the specific language is for disclosing the preferred embodiments of the present invention.

[0107] The specification provided here illustrates many specific details. However, it should be understood that the embodiments of the present invention may be implemented without these specific details. In some embodiments, known methods, structure and technologies are not illustrated in detail so as not to blur the understanding of the present invention.

[0108] Similarly, it should be understood that in order to simplify the present disclosure and facilitate understanding one or more of various invention aspects, in the depiction of the exemplary embodiments of the present invention above, respective features of the present invention are sometimes grouped into a single embodiment, a figure or a depiction of the figure. However, the method of the present disclosure should not be interpreted as reflecting the following intentions: the present invention as claimed claims more features than the explicitly stated features in each claim. More specifically, as reflected by the claims below, the invention aspect is less than all features in a single embodiment as disclosed above. Therefore, the claims conforming to a specific embodiment are thereby explicitly incorporated in the specific embodiment, wherein each claim per se is used as a standalone embodiment of the present invention.

[0109] Those skilled in the art may understand that modules in a device in an embodiment may be adapted and provided in one or more devices different from the embodiment. Modules or units or components in an embodiment may be combined into one module or unit or assembly; besides, they may also be divided into a plurality of sub-modules or sub-units or sub-assemblies. Except that at least some of such features and/or processes or units are mutually exclusive, any combination may be employed to combine all features disclosed in the specification (including the appended claims, abstract and drawings) and all processes or units of any method or device such disclosed. Except otherwise explicitly stated, each feature disclosed in the present specification (including the appended claims, abstract, and drawings) may be replaced by alternative features providing same, equivalent or similar objectives.

[0110] Besides, those skilled in the art can understand that although some embodiments depicted here contain some features, rather than other features, contained in other embodiments, a combination of features from different embodiments means being within the scope of the present invention but forming a different embodiment. For example, in the appended claims, any one of the embodiments as claimed here may be used in any combination manner.

[0111] Various component embodiments of the present invention may be implemented by hardware or by software modules running on one or more processors, or implemented by their combination. Those skilled in the art should understand that in practice, a microprocessor or a digital signal processor (DSP) may be used to implement some or all functions of some or all components of the device for feature extraction according to the embodiments of the present invention. The present invention may also be implemented a device or device program (e.g., a computer program and a computer program product) for implementing a part or all of the method described here. Such a problem for implementing the present invention may be stored on a computer readable medium, or may have a form of one or more signals. Such signals may be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form.

[0112] For example, FIG. 4 illustrates a smart electronic device for executing the method for feature extraction according to the present invention. The smart device traditionally comprises a processor 410 and a computer program product or a computer readable medium in a form of memory 420. The memory 420 may be an electronic storage such as a flash memory, an EEPROM (Electrically Erasable Programmable Read-Only Memory), an EPROM, a hard disk or a ROM. The memory 420 has a storage space 430 with program codes 431 for executing any method steps in the method. For example, the storage space 430 for program code may contain various program codes 431 for implementing respective steps in the methods above, respectively. These program codes may be read out from one or more computer program codes or written into one or more such computer program codes. These computer program products contain program code carriers such as a hard disk, a compact disk (CD), a memory card or a floppy disk and the like. Such computer program product is generally a portable or fixed storage unit as depicted with reference to FIG. 5. The storage unit may have a storage segment, a storage space and the like, in a similar arrangement to the memory 420 in the intelligence electronic device of FIG. 4. The program code may, for example, be compressed in any appropriate form. Generally, the storage unit contains a computer readable code 431', i.e., codes that may be read by a processor such as the processor 410. These codes, when being executed by the server, cause the server to execute various steps of the methods depicted above.

[0113] It should be noted that the embodiments above are intended to illustrate the present invention, not intended to limit the present invention; moreover, without departing from the scope of the appended claims, those skilled in the art may design an alternative embodiments. In the claims, no reference numerals contained within parentheses should constitute a limitation to the claims. The word "comprise" does not exclude elements or steps not stated in the claims. Wording like "a" or "an" before an element does not exclude existence of a plurality of such elements. The present invention may be implemented by virtue of hardware including a plurality of different elements and an appropriately programmed computer. In a device claim listing several means, several of such means may be embodied through the same hardware item. Use of words like first, second, and third and etc. does not indicate any sequence. These words may be explained as names.

* * * * *