U.S. patent application number 15/109343 was filed with the patent office on 2017-07-27 for method and device for feature extraction.
The applicant listed for this patent is Beijing Qihoo Technology Company Limited, Qizhi Software (Beijing) Company Limited. Invention is credited to Zhuo CHEN, Hai TANG, Kang YANG.
Application Number | 20170214704 15/109343 |
Document ID | / |
Family ID | 50528712 |
Filed Date | 2017-07-27 |
United States Patent
Application |
20170214704 |
Kind Code |
A1 |
YANG; Kang ; et al. |
July 27, 2017 |
METHOD AND DEVICE FOR FEATURE EXTRACTION
Abstract
The present invention discloses a method and device for feature
extraction, wherein the method comprises acquiring a batch of black
sample files and white sample files from an application layer of a
smart terminal operating system; parsing each file to obtain
information structure of all functions contained in each file, and
computing a check code of each function; determining whether each
file contains functions corresponding to respective check codes so
as to count times that each function appears in the black sample
files and white sample files; extracting black sample features
based on functions only appearing in the black sample files while
not appearing in the white sample files, or extracting white sample
features based on functions only appearing in the white sample
files while not appearing in the black sample files. By analyzing
and computing the acquired black sample files and white sample
files and counting the times that a check code of each function
appears in the files, the embodiments of the present invention only
use the functions appearing in the black sample files while not
appearing in the white sample files as the basis for feature
extraction.
Inventors: |
YANG; Kang; (Beijing,
CN) ; CHEN; Zhuo; (Beijing, CN) ; TANG;
Hai; (Beijing, CN) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Beijing Qihoo Technology Company Limited
Qizhi Software (Beijing) Company Limited |
Beijing
Beijing |
|
CN
CN |
|
|
Family ID: |
50528712 |
Appl. No.: |
15/109343 |
Filed: |
August 7, 2014 |
PCT Filed: |
August 7, 2014 |
PCT NO: |
PCT/CN2014/083910 |
371 Date: |
June 30, 2016 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 21/563 20130101;
H04L 63/1425 20130101; G06F 21/566 20130101; Y04S 40/20 20130101;
G06F 21/562 20130101; G06F 21/56 20130101 |
International
Class: |
H04L 29/06 20060101
H04L029/06; G06F 21/56 20060101 G06F021/56 |
Foreign Application Data
Date |
Code |
Application Number |
Dec 30, 2013 |
CN |
201310746033.6 |
Claims
1. A method for feature extraction, comprising: acquiring a batch
of black sample files and white sample files from an application
layer of a smart terminal operating system; parsing each file to
obtain information structure of all functions contained in each
file, and computing a check code of each function; determining
whether each file contains functions corresponding to respective
check codes so as to count times that each function appears in the
black sample files and white sample files; extracting black sample
features based on functions only appearing in the black sample
files while not appearing in the white sample files, or extracting
white sample features based on functions only appearing in the
white sample files while not appearing in the black sample
files.
2. The method according to claim 1, wherein after counting the
black samples or white samples, the method further comprises:
optimizing features, specifically: establishing a vector for each
feature with respect to all files; initializing a set to be
compared sequentially with the vector of each feature; if the set
contains the compared vector, reserving the set; if the set does
not contain the compared vector, getting a union of the set and the
compared vector; sequentially comparing the vectors of all
features, and taking the features contained in the finally obtained
set as the last reserved features.
3. The method according to claim 1, wherein after counting the
black samples or white samples, the method further comprises:
optimizing features, specifically: for different file sets with
different features, if one file set contains all files in another
file set, reserving features corresponding to a file set with a
larger scope, while abandoning features corresponding to a file set
with a smaller scope.
4. The method according to claim 3, wherein the features contain a
first feature and a second feature, files containing the first
feature form a first file set, and files containing the second
feature form a second file set; if the first file set contains all
files in the second file set, the first feature is reserved, while
the second feature is abandoned.
5. The method according to claim 1, wherein before the counting
times that each function appears in the black sample files and the
white sample files, the method further comprises: performing
intra-file de-duplication to the check code of the function.
6. The method according to claim 5, wherein the performing
intra-file de-duplication to the check code of the function
comprises: for each file, if a plurality of functions have a same
check code, extracting a function from the plurality of functions
as a function corresponding to the check code.
7. The method according to claim 1, wherein the black sample files
and the white sample files are all virtual machine executable
files; the parsing each file to obtain information structure of all
functions contained in the each file comprises: decompiling a
virtual machine executable file to obtain a decompiled information
structure of all functions contained in the virtual machine
executable file.
8. The method according to claim 7, wherein the computing a check
code of each function comprises: computing a hash value of
information structure of the function by hash algorithm, use the
hash value as the check code corresponding to the function.
9-11. (canceled)
12. The method according to claim 1, wherein: the extracting black
sample features based on functions only appearing in black sample
files while not appearing in white sample files comprises: using a
function that only appears in the black sample files while not
appearing in the white sample files as the black sample feature, or
using a part of code of the function that only appears in the black
sample files while not appearing in the white sample files as the
black sample feature; the extracting white sample features based on
functions only appearing in white sample files while not appearing
in black sample files comprises: using a function that only appears
in the white sample files while not appearing in the black sample
files as the white sample feature, or using a part of code of the
function that only appears in the white sample files while not
appearing in the black sample files as the white sample
feature.
13. The method according to claim 1, further comprising: adding
black sample features into a black sample feature library, and
matching a target file using the black sample feature library; if
the target file contains a function or a subset of functions
corresponding to a black sample feature, determining that malicious
code exists in the target file.
14. (canceled)
15. The method according to claim 1, wherein, the black sample file
refers to a file preliminarily determined as containing a black
sample, while the white sample file refers to a file preliminarily
determined as not containing a black sample.
16. The method according to claim 15, wherein the acquiring a batch
of black sample files and white sample files comprises: finding an
installation package of an application from an application layer of
a smart terminal operating system; parsing the installation package
to obtain a virtual machine executable file of the application;
using the virtual machine executable file as a black sample file or
a white sample file.
17. (canceled)
18. A device for feature extraction, comprising a memory having
instructions stored therein and at least one processor to execute
the instructions to cause: acquiring a batch of black sample files
and white sample files from an application layer of a smart
terminal operating system; parsing each file to obtain information
structure of all functions contained in each file; computing a
check code of each function; determining whether each file contains
functions corresponding to respective check codes so as to count
times that each function appears in the black sample files and
white sample files; and extracting black sample features based on
functions only appearing in the black sample files while not
appearing in the white sample files, or extracting white sample
features based on functions only appearing in the white sample
files while not appearing in the black sample files.
19. The device according to claim 18, the processor further
executes the instructions to cause optimizing features that
comprising: establishing a vector for each feature with respect to
all files; initializing a set to be compared sequentially with the
vector of each feature; if the set contains the compared vector,
reserving the set; if the set does not contain the compared vector,
getting a union of the set and the compared vector; sequentially
comparing the vectors of all features, and taking the features
contained in the finally obtained set as the last reserved
features.
20. The device according to claim 18, the processor further
executes the instructions to cause: for different file sets with
different features, if one file set contains all files in another
file set, reserving features corresponding to a file set with a
larger scope, while abandoning features corresponding to a file set
with a smaller scope.
21-22. (canceled)
23. The device according to claim 22, wherein the processor further
executes the instructions to cause: performing intra-file
de-duplication to the check code of the function, wherein the
performing intra-file de-duplication to the check code of the
function comprises: for each file, if a plurality of functions have
a same check code, extracting a function from the plurality of
functions as a function corresponding to the check code.
24. The device according to claim 18, wherein: the black sample
files and the white sample files are all virtual machine executable
files; and the parsing each file to obtain information structure of
all functions contained in each file specifically comprises:
decompiling the virtual machine executable file to obtain a
decompiled information structure of all functions contained in the
virtual machine executable file.
25-28. (canceled)
29. The device according to claim 18, wherein: the extracting black
sample features based on functions only appearing in black sample
files while not appearing in white sample files comprises: using a
function that only appears in the black sample files while not
appearing in the white sample files as the black sample feature, or
using a part of code of the function that only appears in the black
sample files while not appearing in the white sample files as the
black sample feature; the extracting white sample features based on
functions only appearing in white sample files while not appearing
in black sample files comprises: using a function that only appears
in the white sample files while not appearing in the black sample
files as the white sample feature, or using a part of code of the
function that only appears in the white sample files while not
appearing in the black sample files as the white sample
feature.
30. The device according to claim 18, wherein the processor further
executes the instructions to cause: adding a black sample feature
into a black sample feature library, and matching a target file
using the black sample feature library; if the target file contains
a function or a subset of functions corresponding to the black
sample feature, determining that malicious code exists in the
target file.
31-35. (canceled)
36. A computer-readable medium, having instructions stored therein
that, when executed by at least one processor, cause the processor
to perform feature extraction comprising: acquiring a batch of
black sample files and white sample files from an application layer
of a smart terminal operating system; parsing each file to obtain
information structure of all functions contained in each file, and
computing a check code of each function; determining whether each
file contains functions corresponding to respective check codes so
as to count times that each function appears in the black sample
files and white sample files; and extracting black sample features
based on functions only appearing in the black sample files while
not appearing in the white sample files, or extracting white sample
features based on functions only appearing in the white sample
files while not appearing in the black sample files.
Description
FIELD OF THE INVENTION
[0001] The present invention relates to the technical field of
network security, and more specifically relates to a method and
device for feature extraction.
BACKGROUND OF THE INVENTION
[0002] With the development of sciences and technologies, smart
terminals are provided with more and more functions. For example,
mobile phones have turned from traditional GSM and TDMA digital
mobile phones into smart phones that have capabilities of
processing multimedia resources and providing various kinds of
information services such as network browsing, telephone
conference, electronic commerce, etc. However, that also brings
increasing varieties of malicious code attacks to the mobile phones
and increasingly serious personal data security issues. Smart
mobile phone users suffer deeply from more and more mobile phone
viruses.
[0003] Mobile phone malicious code protection technologies perform
protection against malicious codes. A variety of mobile phone
malicious code protection approaches have been provided, for
example, feature value scanning approach, virtual machine
technology-based malicious code protection, heuristic scanning and
similar samples clustering, etc. Regardless of which protection
manners, besides an efficient scanning algorithm (also named as
matching algorithm), a malicious code feature library that is
reasonably organized is basis. Therefore, how to accurately and
efficiently extract features is crucial to build a feature library
or even to the entire protection technology.
SUMMARY OF THE INVENTION
[0004] In view of the problems above, a method and device for
feature extraction according to the present invention is provided
so as to overcome the above problems or at least partially solve
the above problems.
[0005] According to one aspect of the present invention, there is
provided a method for feature extraction, comprising acquiring a
batch of black sample files and white sample files from an
application layer of a smart terminal operating system; parsing
each file to obtain information structure of all functions
contained in each file, and computing a check code of each
function; determining whether each file contains functions
corresponding to respective check codes so as to count times that
each function appears in the black sample files and white sample
files; extracting black sample features based on functions only
appearing in the black sample files while not appearing in the
white sample files, or extracting white sample features based on
functions only appearing in the white sample files while not
appearing in the black sample files.
[0006] According to another aspect of the present invention, there
is provided a device for feature extraction, comprising a file
acquiring unit configured to acquire a batch of black sample files
and white sample files from an application layer of a smart
terminal operating system; a parsing unit configured to parse each
file to obtain information structure of all functions contained in
each file, and a check code computing unit configured to compute a
check code of each function; a counting unit configured to
determine whether each file contains functions corresponding to
respective check codes so as to count times that each function
appears in the black sample files and white sample files; an
extracting unit configured to extract black sample features based
on functions only appearing in the black sample files while not
appearing in the white sample files, or extract white sample
features based on functions only appearing in the white sample
files while not appearing in the black sample files.
[0007] Thus by analyzing and computing the acquired black sample
files and white sample files and counting times that a check code
of each function appears in the files, the embodiments of the
present invention only use functions appearing in the black sample
files while not appearing in the white sample files as the basis
for feature extraction. In this way, the fast and accurate feature
extraction may guarantee building of an efficient feature library
and guarantee implementation of the defending technologies.
Preferably, the features may be optimized so as to detect most
files with least features after acquiring a large amount of
extractable black sample features.
[0008] The above are only summaries of the technical solutions of
the present invention; in order to understand the technical means
of the present invention more clearly, the implementation may be
based on the content in the specification. Besides, in order to
make the above and other objectives, features, and advantages of
the present invention more apparent and comprehensible, preferred
embodiments of the present invention will be specifically provided
below.
BRIEF DESCRIPTION OF THE SEVERAL DRAWINGS
[0009] Through reading detailed depiction of the preferred
embodiments below, various other advantages and benefits become
clear to a person those skilled in the art. The drawings are only
used for the purpose of illustrating preferred embodiments, and
should not be regarded as limitation to the present invention.
Moreover, throughout the entire drawings, same reference numerals
are used to indicate same components. In the accompanying
drawings,
[0010] FIG. 1 illustrates a flow diagram of a method for feature
extraction according to one embodiment of the present
invention;
[0011] FIG. 2 illustrates a flow diagram of optimizing features in
a method for feature extraction according to one embodiment of the
present invention;
[0012] FIG. 3 illustrates a schematic diagram of a device for
feature extraction according to one embodiment of the present
invention;
[0013] FIG. 4 illustrates a block diagram of a smart electronic
device for executing the method according to the present
invention;
[0014] FIG. 5 illustrates a schematic diagram of a storage unit for
maintaining or carrying program codes that implement the method
according to the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0015] Hereinafter, exemplary embodiments of the present disclosure
will be described in more detail with reference to the accompanying
drawings. Although the drawings reveal the exemplary embodiments of
the present disclosure, it should be understood that the present
disclosure may be implemented in various forms and should not be
limited by the embodiments illustrated here. On the contrary, these
embodiments are provided for a more thorough understanding of the
present disclosure and for a complete delivery of the scope of the
present disclosure to those skilled in the art.
[0016] Android operating system, as an example, contains an
application layer (app layer) and a system framework layer
(framework layer); As for other layers that might be comprised in
terms of functional partitioning, will not be discussed here.
Wherein, the app layer may be generally understood as an upper
layer, in charge of interfaces for interaction with a user, e.g.,
application maintenance, identifying different kinds of click
contents upon clicking onto a page so as to display different
context menus, and etc. The framework layer is generally used as an
intermediate layer, mainly for forwarding a user request (e.g.,
starting an application, clicking on a link, click to save a
picture, and the like) to a lower layer; and distributing contents
completely processed by the lower layer to the upper layer either
via a message or via an intermediate proxy class, so as to present
them to the user.
[0017] The inventors of the present invention have found in
researching that by counting times that a check code of a function
contained in a sample file appears in files, it may be determined
whether the function is a black sample or a white sample.
[0018] Refer to FIG. 1, in which a flow diagram of a method for
feature extraction according to one embodiment of the present
invention is presented.
[0019] The method for feature extraction comprises steps of
[0020] S101: acquiring a batch of black sample files and white
sample files from an application layer of a smart terminal
operating system;
[0021] Wherein, the black sample files refer to files preliminarily
determined as containing a black sample, e.g., a file containing
malicious codes, while the white sample files refer to files
preliminarily determined not containing a black sample, e.g., a
file not containing malicious codes. Those skilled in the art
should understand that a feature library needs to be built during
matching, detecting, and removing malicious codes, and building of
the feature library is based on extracting features from sample
files. In the embodiments of the present invention, whether a batch
of files are black sample files or white sample files is
preliminarily determined manually in advance. More black sample
files and white sample files will be beneficial for accurate
extraction of sample features.
[0022] In the embodiments of the present invention, the black
sample files or white sample files may be, for example, dex files.
Dex files refer to virtual machine executable files directly loaded
and running in a Dalvik virtual machine (Dalvik VM) in Android
system. Dalvik is a Java virtual machine for an Android platform.
An optimized Dalvik allows concurrently running instances of
multiple virtual machines in a limited internal memory, and each
Dalvik application is executed as an independent Linux process. The
independent process can prevent closing of all programs when the
virtual machine breaks down. The Dalvik virtual machine may support
running of a Java application that has been converted into a dex
(Dalvik Executable) format. The dex format is a kind of compressed
format specifically designed for Dalvik and is suitable for a
system with limited memory and processor speed. Java source codes
may be converted into a dex file by ADT (Android Development Tools)
through a complex compilation. The dex file is an optimized result
for an embedded system. The Dalvik virtual machine does not employ
standard Java virtual machine instruction codes, but uses its
specific instruction set. The dex file shares a plenty of class
names and constant strings, thus its volume is small and operating
efficiency is relatively high.
[0023] Specifically, obtaining a batch of black sample dex files
and white sample dex files from a smart terminal may comprise
finding an installation package of an application from an
application layer of a smart terminal operating system; parsing the
installation package to obtain a dex file of the application; using
a dex executing file as a black sample file or a white sample file.
For example, it can be obtained by parsing an APK (Android
Package). The APK file is actually a compressed package of a zip
format, but its affix name is modified to apk; a Dex file may be
obtained after decompression via UnZip.
[0024] As previously mentioned, the Android operating system
comprises an application layer (app layer) and a system framework
layer (framework layer). The present invention focuses on study and
improvement of the app layer. However, those skilled in the art
understand that when the Android is started, Dalvik VM monitors all
programs (APK files) and frameworks and create a dependency
relationship tree for them. Through this dependency relationship
tree, the Dalvik VM optimizes code for each program and stores the
optimized codes into a Dalvik cache (dalvik-cache). In this way,
all programs will use optimized code upon running. When a program
(or framework) changes, the Dalvik VM will re-optimize the code and
store them into the cache again. The cache/dalvik-cache is for
depositing dex files generated by programs on the system, while
data/dalvik-cache is for depositing dex files generated by
data/app. In other words, the present invention focuses on
analyzing and processing of dex files generated by data/app.
However, it should be understood that the theory and operation of
the present invention is likewise applicable to dex files generated
by programs on the system.
[0025] S102: parsing each file to obtain information structure of
all functions contained in each file, and computing a check code of
each function;
[0026] Still taking a dex file as an example, parsing a file to
obtain information structure of all functions contained in the file
comprises decompiling the dex file to obtain decompiled information
structure of all functions contained in the dex file.
[0027] Wherein, the dex file is decompiled in a plurality of
manners.
[0028] Manner 1: parsing the dex file according to a dex file
format to obtain a function information structure of each class;
determining a location and size of the dex file according to fields
in the function information structure, to obtain a decompiled
function information structure. Wherein, by parsing the function
information structure, a bytecode array field indicating a function
position of the dex file and a list length field indicating a
function size of the dex file are obtained, thereby determining the
position and size of the function of the dex file.
[0029] For example, the dex file is parsed according to a dex file
format to obtain the function information body of each class. The
function information structure contains fields in Table 1.
TABLE-US-00001 TABLE 1 registers_size ushort Number registers used
in the segment of code ins_size ushort Words of input parameters of
the method in the segment of code outs_size ushort Space that needs
to be provided for invoking the segment of code to an output
function of the function tries_size ushort Number of try_item of
the object; if not 0, it will appear as a tries array after the
present object insns debug_info_off uint Offset amount from the
beginning of the file to the debug info; without information, the
value is 0; if not 0, it represents a position of a data segment;
the data shall follow a debug_info_item prescribed format
insns_size uint Length of the Instructions list, with two bytes as
a unit insns ushort[insns_size] Bytecode array. The format of the
bytecode will be detailed in the file "Bytecode for the Dalvik VM."
Although it is defined as a ushort-type array, some internal
structure employ a 4-byte alignment; if the file is just a file
subjected to a byte exchange operation, the byte exchange can only
be performed within the ushort type. padding ushort Two padding
bytes are used to satisfy the tries 4-byte alignment (optional) = 0
manner. The element only exists when the tries_size is an odd
number and not 0. tries try_item[tries_size] This array is for
identifying where abnormalities are possibly thrown out (optional)
in the representations. The array elements shall be arranged in an
ascending order of the addresses, and no repetitive addresses shall
appear. This element only exists when the tries_size is not 0.
handlers encoded_catch_handler_list These bytes represent a series
of abnormal types and an address (optional) list of their
processing methods; each try_item has an offset of one byte width;
and the element only exists when the tries_size is not 0.
[0030] Wherein, the insns_size and insns fields in each function
information structure represent the function size and position,
respectively. Then, the information structure of the function may
be decompiled according to the fields insns_size and insns. The
decompiled information structure is comprised of Dalvik VM bytes,
which will be detailed later.
[0031] Manner 2: decompiling the dex file into a virtual machine
byte code using a dex file decompilation tool.
[0032] As mentioned above, the Dalvik virtual machine runs a Dalvik
bytecode, which exists in a dex executable file form. The Dalvik
virtual machine executes codes by interpreting the dex file.
Currently, some tools are provided to decompile a DEX file into
Dalvik compilation codes, such dex file decompiling tools include
baksmali, Dedexer 1.26, dexdump, dexinspecto 03-12-12r, IDA Pro,
androguard, dex2jar, and 010 Editor, etc.
[0033] It is seen that all decompiled function information
structure may be obtained by decompiling the dex file. Wherein, the
function information structure comprises function execution codes,
which, in the present embodiment, are formed by a virtual machine
instruction sequence and a virtual machine memonic sequence. As the
example below, the function information structure is formed by an
instruction sequence of Dalvik VM and a memonic sequence of the
Dalvik VM.
[0034] For example, a function information structure obtained by
decompiling the dex file according to one embodiment of the present
invention is specified below:
TABLE-US-00002 ##STR00001## ##STR00002## ##STR00003##
##STR00004##
[0035] It is seen that the dex file is decompiled into an
instruction sequence of Dalvik VM and a memonic sequence of the
Dalvik VM. As indicated in the example above, in the function
information structure obtained by decompilation, the first 2 digits
of each line in the machine code field denote an instruction
sequence (the left circled part in the example above), while the
part corresponding to the instruction sequence is a memonic (right
side of the example, partially circled, not completely selected).
The memonic is mainly for facilitating user communication and code
compilation.
[0036] After obtaining the information structure of each function,
the check code of the function may be computed. Later, the check
code may be used to represent its corresponding unique function.
The check code of the function may be calculated using an existing
or future algorithm. For example, a hash algorithm may be used to
calculate the hash value of the function as the previous check
code. Wherein, the hash algorithm has many kinds, e.g., CRC (Cyclic
Redundancy Check), MD5 (Message Digest Algorithm), or SHA (Secure
Hash Algorithm), etc.
[0037] S103: determining whether each file contains functions
corresponding to respective check codes so as to count times that
each function appears in the black sample files and white sample
files.
[0038] This step is to count times that a hash value appears in a
batch of black sample files and white sample files obtained in step
S101.
[0039] Suppose a hash value of each function is determined by
analyzing and computing the black sample files and white sample
files; then, times that each hash value appears in the black sample
files and white sample files are counted.
[0040] Suppose there are n sample files (including a part of black
sample files and a part of white sample files), wherein the first
file comprises function hash values A, B, C; the second file
comprises function hash values A, C, D; the third file comprises
function hash values B, C, E; . . . the nth file comprises hash
values C, D. All in all, after all files are analyzed, suppose 5
function values A, B, C, D, E are determined. Then, times that the
5 hash values appear in the black samples and in white sample files
are counted. Suppose the results are shown in Table 2 below after
counting.
TABLE-US-00003 TABLE 2 Times that Times that Total times that
appearance in appearance in Function appearance in the the black
the white hash value file sample files sample files A 10000 5000
5000 B 10000 10000 0 C 10000 0 10000 D 10000 8000 2000 E 7000 7000
0
[0041] Those skilled in the art understand that different functions
have different hash values, i.e., different hash values represent
different functions; therefore, A, B, C, D, E are also employed
subsequently to represent 5 functions or 5 features. Based on the
times that the above hash values appear in the files, the times
that each function appears in the files may be determined.
[0042] Preferably, before counting the times that each function
appears in the black sample files and the white sample files, the
method further comprises de-duplicating a check code of the
function within the file. Specifically, de-duplicating the check
code of the function within the file refers to for each file, if a
plurality of functions have a same check code, extracting one
function from the plurality of functions as a function
corresponding to the check code. For example, suppose that for a
dex file, the information structure of all functions contained
therein are obtained by parsing it. Suppose that three information
structure s1, s2, and s3 are parsed out; 3 hash values hash 1, hash
2, and hash 3 of the three information structure s1, s2, and 3 are
obtained further through a hash algorithm. Those skilled in the art
should understand, different functions have different hash values,
i.e., different hash values represent different functions. Suppose
that among the three hash values some are identical, e.g., hash
1=hash 2, then it is deemed that they represent the same function.
In this case, any one of s1 and s2 is selected, while the other one
is aborted.
[0043] S104: extracting black sample features based on functions
only appearing in the black sample files while not appearing in the
white sample files, or extracting white sample features based on
functions only appearing in the white sample files while not
appearing in the black sample files.
[0044] When extracting the samples, only functions appearing in the
black sample files while not appearing in the white sample files
are selected as black sample features. For example, still taking
Table 1 as an example, functions B and E are selected for black
sample feature extraction. Specifically, functions B, E are taken
as black sample features, or part of codes of the functions B and E
are taken as black sample features. Likewise, functions only
appearing in the white sample files while not appearing in the
black sample files are selected as white sample features. For
example, still taking Table 1 for further illustration, function C
is selected for performing white sample feature extraction.
Specifically, function C may be used as a white sample feature or
part of code of the function C is used as a white sample
feature.
[0045] After the black sample feature is extracted in step S104,
the following steps may be continued to execute the following
steps: adding a black sample feature to the black sample feature
library; matching a target file using the black sample feature
library, and if the target file comprises a function or a subset of
functions corresponding to the black sample feature, determining
that malicious code exists in the target file. As understood by
those skilled in the art, sample feature detecting and removing,
virtual machine-based detecting removing, heuristic detecting and
removing or similar samples clustering may be performed to the
target files using the function corresponding to the black sample
feature in the black sample feature library.
[0046] Hereinafter, the malicious code and the malicious code
protection schemes (sample feature detecting and removing, virtual
machine-based detecting and removing, heuristic detecting and
removing, and similar samples clustering) will be introduced.
[0047] The malicious code refers to a program or code that is
disseminated via a storage medium or a network, destroys integrity
of the operating system and steals undisclosed confidential
information in the system without authorization. With a mobile
phone as an example, a mobile phone malicious code refers to a
malicious code against a portable device and a PDA. The mobile
phone malicious code may be simply divided into a replication-type
malicious code and a non-replication-type malicious code, wherein
the replication-type malicious code mainly contains a virus and a
worm, while the non-replication-type malicious code mainly contains
a Trojan horse, rogue software, a malicious mobile code, a rootkit
program, and etc.
[0048] A mobile phone malicious code protection technology performs
protection against malicious code. There are a plurality of mobile
phone malicious code protection technologies. For example, a
feature value scanning manner. First, it needs to learn in advance
to build a malicious code feature library; the feature values saved
in the malicious code feature library may be a segment of
continuous fixed character strings, or several segments of definite
character strings inserted with other discontinuous character
strings having indefinite characters; during scanning, the
to-be-detected file or the memory is detected based on the
character feature or string in the feature library; when a matching
item is found, it may be determined that the target is infected
with malicious code. For another example, a virtual machine
technology based malicious code protection. This kind of protection
scheme is mainly directed against polymorph viruses and metamorphic
viruses. The virtual machine refers to a complete computer system
simulated through software to have a complete hardware system
function and run in a completely isolated environment. This scheme
is also referred to as a software simulation method, where a
software analyzer simulates and analyzes program running using a
software method. It essentially simulates a small closed program
execution environment in the inner memory, and all files to be
subject to virus detection and removal are executed virtually
therein. When removing a virus using a virtual machine technology,
the feature value scanning technology is also used first, and only
when finding that the target has a feature of encrypted malicious
code, will the virtual machine module be started to make the
encrypted code decoded autonomously. After decoding, the
traditional feature value scanning manner may be employed to detect
and remove. For another example, a heuristic detection and removal
manner. The heuristic detection and removal manner is mainly
directed against constant mutation of malicious code for the
purpose of enhancing the study on unknown malicious code. The
so-called "heuristic" is originated from artificial intelligence,
which refers to "a capability of self-discovery" or "a knowledge or
technique that exerts a certain manner or method to judge an
object." The heuristic detection and killing of the malicious code
means the scanning software can detect a virus by analyzing a
structure of the program and its behavior using a rule extracted
empirically. Because usual behaviors of a malicious code will have
certain features such as reading and writing a file in an
unconventional manner, terminating itself, or entering into a zero
ring in an unconventional manner, so as to achieve the objectives
of infection and damage. Therefore, whether a program is a
malicious code may be determined by scanning specific behaviors or
a combination of multiple behaviors. Besides, similar samples
clustering may be performed to a target program, e.g., clustering
similar samples determined through analysis using a K-mean value
clustering algorithm.
[0049] Irrespective of which protection manner is used, its core
always contains two parts. The first part is reasonably organized
malicious code feature library; the second part is an efficient
scanning algorithm (also referred to as a matching algorithm). The
matching algorithm is generally divided into a single-mode matching
algorithm and a multi-mode matching algorithm. The single-mode
matching algorithm comprises a BF (Brute-Force) algorithm, a KMP
(Knuth-Morris-Pratt) algorithm, a BM (Boyer-Moore) algorithm, and a
QS (Quick Search) algorithm, etc. The multi-mode matching algorithm
contains a typical multi-mode matching DFSA algorithm and an
ordered binary tree-based multi-mode matching algorithm.
Additionally, the matching algorithm may be divided into a fussy
matching algorithm and a similar matching algorithm.
[0050] It should be noted that the present invention does not limit
which malicious code protection solution is employed to detect a
malicious code. For example, the sample feature detection and
removal (feature value scan), the virtual machine-based scan, or
heuristic detection and removal as introduced above may be
employed. In addition, a similar sample clustering may also be
performed. Moreover, the present application makes no limitation to
the matching algorithm. For example, the fussy matching algorithm
or similarity matching algorithm as introduced above may be
employed.
[0051] There is such a scenario that a file set with function A
being detected contains a file set with function B being detected.
This scenario preferably uses function A as a feature, while
abandons function B feature. This is because after a considerable
number of black sample features are obtained, it is needed to
consider how to detect most files with least features. The
embodiments of the present invention achieve this objective through
a feature optimization method.
[0052] To summarize, the feature optimization method comprises, for
different file sets with different features, if one file set
contains all files in another file set, the feature corresponding
to a file set with a larger scope will be reserved, while the
feature corresponding to the file set with a smaller scope will be
abandoned. For example, suppose there are two features: a first
feature and a second feature; the files containing the first
feature form a first file set, while the files containing the
second feature form a second file set; if the first file set
contains all files in the second file set, the first feature is
reserved, while the second feature is abandoned.
[0053] FIG. 2 illustrates a flow diagram of optimizing features in
a method for feature extraction according to one embodiment of the
present invention. feature optimization comprises steps of:
[0054] S201: establishing a vector for each feature with respect to
all files;
[0055] S202: initializing a set;
[0056] S203: comparing the set sequentially with the vector of each
feature;
[0057] S204: determining whether the set contains the compared
vector; if the set contains the compared vector, performing S205;
if the set does not contain the compared vector, performing
S206;
[0058] S205: reserving the set;
[0059] S206: getting a union of the set and the compared
vector;
[0060] S207: determining whether the vectors of all features have
been compared; if so, performing S208; otherwise, returning to
perform S203 to compare with the next feature vector;
[0061] S208: taking the features contained in the finally obtained
set as the last reserved features.
[0062] Hereinafter, a preferred example is provided.
[0063] Suppose there are M black sample files and N extractable
features (i.e., functions). An M-dimension vector is generated for
each extractable feature; the ith-dimension vector represents
whether the black sample file indexed by i can be detected with the
feature.
[0064] For example, the vector generated by feature A is 1:1, 2:0,
3:1, 4:1, 5:0, 6:0. This represents that the feature may detect
three files indexed by 1, 3, 4.
[0065] Steps:
[0066] initializing a set SA, which is compared sequentially with
each feature vector;
[0067] if the SA comprises Mi, continuing to compare with the next
feature vector set;
[0068] otherwise, getting a union of SA and Mi, and then continuing
to compare with the next feature vector set.
[0069] For example, the vectors generated by features A, B, C, and
D are specified below:
[0070] A: 1:0, 2:0, 3:1, 4:1, 5:0, 6:0
[0071] B: 1:1, 2:1, 3:1, 4:0, 5:0, 6:1
[0072] C: 1:1, 2:1, 3:1, 4:1, 5:0, 6:0
[0073] D: 1:1, 2:0, 3:1, 4:1, 5:1, 6:0
[0074] First Step:
[0075] Comparing vectors of A, B; because A does not contain B,
getting the union of A and B to obtain a detected vector as AB:
1:1, 2:1, 3:1, 4:1, 5:0, 6:1;
[0076] Second Step:
[0077] Using AB to compare with C; because a file that may be
detected by C can already be detected by AB, abandoning C;
[0078] Repeating the Second Step:
[0079] Using AB to compare with D, because D may detect file 5,
while AB cannot; therefore, getting a union of AB and D;
[0080] Namely, ABD: 1:1, 2:1, 3:1, 4:1, 5:1, 6:1.
[0081] If feature E is followed, comparing ABD with the feature E,
similar to the second step.
[0082] For four vectors A, B, C, and D, the finally chosen features
are A, B, D.
[0083] Therefore, the shortest feature set of the M files may be
detected.
[0084] Thus by analyzing and computing the acquired black sample
files and white sample files and counting the times that a check
code of each function appears in the files, the embodiments of the
present invention only use the functions appearing in the black
sample files while not appearing in the white sample files as the
basis for feature extraction. In this way, the fast and accurate
feature extraction may guarantee building of an efficient feature
library and guarantee implementation of the protection technology.
Preferably, the features may be optimized so as to detect most
files with least features after acquiring a large amount of
extractable black sample features.
[0085] Corresponding to the method above, the embodiments of the
present invention further provide a device for feature extraction.
The device may be implemented by software, hardware or a
combination of software and hardware. Specifically, the device may
be a terminal device or a functional entity inside the device. For
example, the device may be a functional module inside the mobile
phone. Preferably, the device is running under Android operating
system.
[0086] The feature extracting device comprises:
[0087] a file acquiring unit 301 configured to acquire a batch of
black sample files and white sample files from an application layer
of a smart terminal operating system;
[0088] a parsing unit 302 configured to parse each file to obtain
information structure of all functions contained in each file;
[0089] a check code computing unit 303 configured to compute a
check code of each function;
[0090] a counting unit 304 configured to determine whether each
file contains functions corresponding to respective check codes so
as to count times that each function appears in the black sample
files and white sample files;
[0091] an extracting unit 305 configured to extract black sample
features based on functions only appearing in the black sample
files while not appearing in the white sample files, or extract
white sample features based on functions only appearing in the
white sample files while not appearing in the black sample
files.
[0092] preferably, the device further comprises a feature
optimization unit 306 configured to for different file sets with
different features, if one file set contains all files in another
file set, reserve the feature corresponding to a file set with a
larger scope, while abandoning the feature corresponding to the
file set with a smaller scope. For example, when the first file set
comprises all files in the second file set, the feature
optimization unit 306 reserves a first feature corresponding to the
first file set, while abandoning a second feature corresponding to
the second file set.
[0093] Or, the device further comprises a feature optimization unit
306 configured to establish a vector for each feature with respect
to all files; initialize a set to be compared sequentially with the
vector of each feature; if the set contains the compared vector,
reserve the set; if the set does not contain the compared vector,
get a union of the set and the compared vector; sequentially
compare the vectors of all features, and take the features
contained in the finally obtained set as the last reserved
features.
[0094] Preferably, the device further comprises: an inner
de-duplicating unit 307 configured to perform intra-file
de-duplication to a check code of a function. For example, the
inner de-duplicating unit 307 is specifically configured to, for
each file, if a plurality of functions have a same check code,
extract a function from the plurality of functions as a function
corresponding to the check code.
[0095] Wherein, the black sample files and the white sample files
are all virtual machine executable files; the parsing unit 302 is
specifically configured to decompile the virtual machine executable
file to obtain a decompiled information structure of all functions
contained in the virtual machine executable file.
[0096] Wherein, the check code computing unit 303 is specifically
configured to compute a hash value of the information structure of
the function to use the hash value as the check code of the
function.
[0097] Wherein, the parsing unit 302 is further configured to parse
the virtual machine executable file according to format of the
virtual machine executable file to obtain the function information
structure of each class; determine a position and size of each
function of the virtual machine executable file according to fields
in the function information structure, and obtain the decompiled
function information structure of each function.
[0098] The parsing unit 302 is further configured to parse the
function information structure to obtain a bytecode array field
indicating the function position of the virtual machine executable
file and a list length field indicating the function size of the
virtual machine executable file; and determine a position and size
of the function of the virtual machine executable file based on the
bytecode array field and the list length field.
[0099] The parsing unit 302 is specifically configured to decompile
the virtual machine executable file into a virtual machine bytecode
using a virtual machine executable file decompilation tool.
[0100] Wherein, the extracting unit 303 is configured to take a
function that only appears in the black sample file while not
appearing in the white sample file as the black sample feature, or
take a part of code of the function that only appears in the black
sample file while not appearing in the white sample file as the
black sample feature; or,
[0101] take a function that only appears in the white sample file
while not appearing in the black sample file as the white sample
feature, or take a part of code of the function that only appears
in the white sample file while not appearing in the black sample
file as the white sample feature.
[0102] Preferably, the device further comprises: a feature library
adding unit 308 configured to add a black sample feature into the
black sample feature library, and a matching unit 309 configured to
match a target file using the black sample feature library; if the
target file contains a function or a subset of functions
corresponding to the black sample feature, determine that malicious
code exists in the target file. Wherein, the matching unit
specially may perform sample feature detection and removal,
virtual-machine based detection and removal, heuristic detection
and removal, and/or similar samples clustering to the target file
using the function corresponding to the black sample feature in the
black sample feature library.
[0103] Wherein, the black sample file refers to a file
preliminarily determined as containing a black sample, while the
white sample file refers to a file preliminarily determined as not
containing a black file.
[0104] Wherein, the file extracting unit 301 is specifically
configured to find an installation package of an application from
an application layer of a smart terminal operating system; parse
the installation package to obtain a virtual machine executable
file of the application; and take the virtual machine executable
file as a black sample file or a white sample file.
[0105] Regarding specific implementations of the device, the method
embodiments may be referenced, which will not be detailed here.
[0106] The algorithm and display provided here are not inherently
related to any specific computer, virtual system or other device.
Various general systems may also be used with the teaching based on
that. According to the depiction above, a structure required for
building such system is obvious. In addition, the present invention
is not directed to any specific programming language. It should be
understood that various programming languages may be utilized to
implement the content of the present invention depicted here, and
the depiction above with respect to the specific language is for
disclosing the preferred embodiments of the present invention.
[0107] The specification provided here illustrates many specific
details. However, it should be understood that the embodiments of
the present invention may be implemented without these specific
details. In some embodiments, known methods, structure and
technologies are not illustrated in detail so as not to blur the
understanding of the present invention.
[0108] Similarly, it should be understood that in order to simplify
the present disclosure and facilitate understanding one or more of
various invention aspects, in the depiction of the exemplary
embodiments of the present invention above, respective features of
the present invention are sometimes grouped into a single
embodiment, a figure or a depiction of the figure. However, the
method of the present disclosure should not be interpreted as
reflecting the following intentions: the present invention as
claimed claims more features than the explicitly stated features in
each claim. More specifically, as reflected by the claims below,
the invention aspect is less than all features in a single
embodiment as disclosed above. Therefore, the claims conforming to
a specific embodiment are thereby explicitly incorporated in the
specific embodiment, wherein each claim per se is used as a
standalone embodiment of the present invention.
[0109] Those skilled in the art may understand that modules in a
device in an embodiment may be adapted and provided in one or more
devices different from the embodiment. Modules or units or
components in an embodiment may be combined into one module or unit
or assembly; besides, they may also be divided into a plurality of
sub-modules or sub-units or sub-assemblies. Except that at least
some of such features and/or processes or units are mutually
exclusive, any combination may be employed to combine all features
disclosed in the specification (including the appended claims,
abstract and drawings) and all processes or units of any method or
device such disclosed. Except otherwise explicitly stated, each
feature disclosed in the present specification (including the
appended claims, abstract, and drawings) may be replaced by
alternative features providing same, equivalent or similar
objectives.
[0110] Besides, those skilled in the art can understand that
although some embodiments depicted here contain some features,
rather than other features, contained in other embodiments, a
combination of features from different embodiments means being
within the scope of the present invention but forming a different
embodiment. For example, in the appended claims, any one of the
embodiments as claimed here may be used in any combination
manner.
[0111] Various component embodiments of the present invention may
be implemented by hardware or by software modules running on one or
more processors, or implemented by their combination. Those skilled
in the art should understand that in practice, a microprocessor or
a digital signal processor (DSP) may be used to implement some or
all functions of some or all components of the device for feature
extraction according to the embodiments of the present invention.
The present invention may also be implemented a device or device
program (e.g., a computer program and a computer program product)
for implementing a part or all of the method described here. Such a
problem for implementing the present invention may be stored on a
computer readable medium, or may have a form of one or more
signals. Such signals may be downloaded from an Internet website,
or provided on a carrier signal, or provided in any other form.
[0112] For example, FIG. 4 illustrates a smart electronic device
for executing the method for feature extraction according to the
present invention. The smart device traditionally comprises a
processor 410 and a computer program product or a computer readable
medium in a form of memory 420. The memory 420 may be an electronic
storage such as a flash memory, an EEPROM (Electrically Erasable
Programmable Read-Only Memory), an EPROM, a hard disk or a ROM. The
memory 420 has a storage space 430 with program codes 431 for
executing any method steps in the method. For example, the storage
space 430 for program code may contain various program codes 431
for implementing respective steps in the methods above,
respectively. These program codes may be read out from one or more
computer program codes or written into one or more such computer
program codes. These computer program products contain program code
carriers such as a hard disk, a compact disk (CD), a memory card or
a floppy disk and the like. Such computer program product is
generally a portable or fixed storage unit as depicted with
reference to FIG. 5. The storage unit may have a storage segment, a
storage space and the like, in a similar arrangement to the memory
420 in the intelligence electronic device of FIG. 4. The program
code may, for example, be compressed in any appropriate form.
Generally, the storage unit contains a computer readable code 431',
i.e., codes that may be read by a processor such as the processor
410. These codes, when being executed by the server, cause the
server to execute various steps of the methods depicted above.
[0113] It should be noted that the embodiments above are intended
to illustrate the present invention, not intended to limit the
present invention; moreover, without departing from the scope of
the appended claims, those skilled in the art may design an
alternative embodiments. In the claims, no reference numerals
contained within parentheses should constitute a limitation to the
claims. The word "comprise" does not exclude elements or steps not
stated in the claims. Wording like "a" or "an" before an element
does not exclude existence of a plurality of such elements. The
present invention may be implemented by virtue of hardware
including a plurality of different elements and an appropriately
programmed computer. In a device claim listing several means,
several of such means may be embodied through the same hardware
item. Use of words like first, second, and third and etc. does not
indicate any sequence. These words may be explained as names.
* * * * *