U.S. patent application number 12/371892 was filed with the patent office on 2009-06-04 for system and program product for determining java software code plagiarism and infringement.
This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Steven Edward Atkin, Margot Bruce Casey, Tyron Jerrod Stading.
Application Number | 20090144702 12/371892 |
Document ID | / |
Family ID | 35515508 |
Filed Date | 2009-06-04 |
United States Patent
Application |
20090144702 |
Kind Code |
A1 |
Atkin; Steven Edward ; et
al. |
June 4, 2009 |
System And Program Product for Determining Java Software Code
Plagiarism and Infringement
Abstract
An analysis tool extracts class data from Java objects within a
potential plagiarizing Java program and the original Java program,
and then compares classes common to the potential plagiarizing
program and the original across various performance metrics.
Similarities disclosed by the analysis tool are output for user or
programmatic comparison of the matches between the potentially
plagiarizing Java program byte code and the original program's byte
code.
Inventors: |
Atkin; Steven Edward;
(Austin, TX) ; Casey; Margot Bruce; (Austin,
TX) ; Stading; Tyron Jerrod; (Austin, TX) |
Correspondence
Address: |
DILLON & YUDELL LLP
8911 N. CAPITAL OF TEXAS HWY.,, SUITE 2110
AUSTIN
TX
78759
US
|
Assignee: |
International Business Machines
Corporation
Armonk
NY
|
Family ID: |
35515508 |
Appl. No.: |
12/371892 |
Filed: |
February 16, 2009 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10881973 |
Jun 30, 2004 |
7493596 |
|
|
12371892 |
|
|
|
|
Current U.S.
Class: |
717/121 |
Current CPC
Class: |
G06Q 50/18 20130101;
G06F 21/16 20130101; G06F 8/71 20130101 |
Class at
Publication: |
717/121 |
International
Class: |
G06F 9/44 20060101
G06F009/44 |
Claims
1. A system for evaluating Java software to determine similarity
with other Java software comprising: means for extracting class
data from a software program; means for extracting class data from
an original software program; means for determining class data of
the software program matching the class data from the original
software program; means for outputting the comparison as a function
of class data; and means for determining a similarity of the
software program to the original software program based on the
comparison output.
2. The system according to claim 1, wherein the class data includes
method signatures, byte codes and byte arguments.
3. The system according to claim 1, wherein the software programs
are determined to be similar according to user input based on the
comparison output.
4. The system according to claim 1, wherein the software programs
are determined to be similar according to statistical analysis
based on the comparison output.
5. The system according to claim 1, further comprising means for
extracting class data from Java objects in the software
program.
6. The system according to claim 1, wherein the class data is taken
directly from a binary class file for an object in the software
program.
7. An article of manufacture comprising machine-readable medium
including program logic embedded therein for evaluating Java
software to determine similarity with other Java software that
causes control circuitry in a data processing system to perform the
steps of: extracting class data from a software program; extracting
class data from an original software program; determining class
data of the software program matching the class data from the
original software program; outputting the comparison as a function
of class data; and determining a similarity of the software program
to the original software program based on the comparison
output.
8. The article of manufacture according to claim 7, wherein the
class data includes method signatures, byte codes and byte
arguments.
9. The article of manufacture according to claim 7, wherein the
software programs are determined to be similar according to user
input based on the comparison output.
10. The article of manufacture according to claim 7, wherein the
software programs are determined to be similar according to
statistical analysis based on the comparison output.
11. The article of manufacture according to claim 7, further
comprising the step of extracting class data from Java objects in
the software program.
12. The article of manufacture according to claim 7, wherein the
class data is taken directly from a binary class file for an object
in the software program.
Description
PRIORITY CLAIM
[0001] The present application is a continuation of, and claims
priority to, U.S. patent application Ser. No. 10/881,973 filed on
Jun. 30, 2004, entitled METHOD AND SYSTEM PROGRAM PRODUCT FOR
DETERMINING JAVA SOFTWARE CODE PLAGIARISM AND INFRINGEMENT.
CROSS-REFERENCE TO RELATED APPLICATIONS
[0002] The present application is related to the following
co-pending U.S. patent applications filed on Jun. 30, 2004, and
incorporated herein by reference in their entirety.
[0003] Ser. No. 10/881,967 entitled "METHOD, SYSTEM AND PROGRAM
PRODUCT FOR DETERMINING STANDARD JAVA OBJECTS".
[0004] Ser. No. 10/881,969 entitled "METHOD, SYSTEM AND PROGRAM
PRODUCT FOR EVALUATING JAVA SOFTWARE BEST PRACTICES ACROSS MULTIPLE
VENDORS".
[0005] Ser. No. 10/881,968 entitled "METHOD, SYSTEM AND PROGRAM
PRODUCT FOR OPTIMIZING JAVA APPLICATION PERFORMANCE THROUGH DATA
MINING".
BACKGROUND OF THE INVENTION
[0006] 1. Field of the Invention
[0007] The present invention relates to computer systems, software
architectures and programs, and more particularly to a method,
system and computer program product for evaluating Java programs to
determine code plagiarism.
[0008] 2. Description of the Related Art
[0009] Java is a robust, portable object-oriented programming
language developed by Sun Microsystems, Inc., that is gaining wide
acceptance for writing code for the Internet and World Wide Web
(hereinafter, "Web"). The nature of Java programming allows
programmers to easily decompile Java code and review its source
code. As such, competitors are able to obtain software from other
vendors and review their source code as part of their process of
developing their own Java programs. It is a simple matter for the
Java source code itself to be copied into the new software program
being developed. This copying of the Java source code infringes on
the copyright of the original author of the Java source code.
Unfortunately, there are few tools available to discover such code
plagiarism, and the typical detection devices that are available
are easily fooled when the plagiarizing source code is slightly
modified or changed from the original source code.
[0010] This difficulty in detecting code plagiarism is further
complicated when source code for the infringing software is not
available. There are a variety of this obfuscation programs that
limit access to source code of a program. Without viewing the
original source code, it is extremely difficult to determine if
program code has been copied. Using artificial intelligence
algorithms, software can analyze source code for keywords,
patterns, and compare them to known software programs. If
duplication of a known software program is found within a certain
probability, the new code is considered to be copied or duplicated.
However, such a technique has hereto only been capable of detecting
such patterns in source code.
[0011] Accordingly, it would be valuable to provide a tool for
detecting code plagiarism and intellectual property infringement of
a java program, even when the original code has been modified
somewhat by the plagiarist. Such a tool would have particular value
if you could detect code plagiarism even if the Java source code
was not available.
SUMMARY OF THE INVENTION
[0012] In accordance with the present invention, improved methods,
systems and articles of manufacture for evaluating Java objects to
determine best practices across multiple vendors are disclosed. In
one embodiment of the present invention, an analysis tool extracts
class data from a potentially infringing software program and an
original program, and then determines what class data of the
software program matches the class data from the original software
program. The analysis tool then outputs the comparison as a
function of class data. The tool or user can then determine the
similarity of the software program to the original software program
based on the comparison output.
[0013] All objects, features, and advantages of the present
invention will become apparent in the following detailed written
description.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] This invention is described in a preferred embodiment in the
following description with reference to the drawings, in which like
numbers represent the same or similar elements, as follows:
[0015] FIG. 1 is a block diagram of a general-purpose computer
system to which the present invention may be applied.
[0016] FIG. 2A a block diagram of a software architecture for
system, in accordance with the preferred embodiment of the present
invention.
[0017] FIG. 2B is a logical block diagram of a Java file method
entry in a database, in accordance with the preferred embodiment of
the present invention.
[0018] FIG. 2C shows a graphical representation of a Java class
file as viewed by the Byte Code Engineering Library.
[0019] FIG. 3 is a flow diagram of a process for decompiling a Java
archive that contains Java classes for use in the pre-analysis
process, in accordance with the preferred embodiment of the present
invention.
[0020] FIG. 4 shows a screen shot of the graphical user interface
of the Byte Code Inspector tool, in accordance with the preferred
embodiment of the present invention.
[0021] FIG. 5 shows a flow diagram for evaluating the original
program's and the potential infringer's implementations of a Java
class, in accordance with a preferred embodiment of the present
invention.
[0022] FIG. 6 shows an example of database entries indicating
various design metrics associated with the potentially plagiarizing
Java program and the original program, in accordance with a
preferred embodiment of the present invention.
[0023] FIG. 7 shows a flow diagram of a process for comparing a
vendor's program with an original program to determine if the
vendor's program plagiarizes source code of the original program,
in accordance with a preferred embodiment of the present
invention.
[0024] FIG. 8 shows an example of an output of such a visualization
tool showing the methods compared and the number of matching byte
codes for each such method across the two compared programs, in
accordance with the preferred embodiment of the present
invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0025] In a preferred embodiment, the present invention provides a
facility to analyze Java byte code to determine if the code has
been plagiarized from another source program. This is accomplished
by an analysis tool performing a pre-analysis to train itself on
particular programming metrics of the original program, and then
performing a dynamic analysis of the potentially plagiarizing or
infringing Java program to determine if its byte code matches the
criteria or programming metrics extracted from the original
program's byte code. The function of the analysis tool in a typical
software environment is described below.
[0026] In the following detailed description of exemplary
embodiments of the invention, reference is made to the accompanying
drawings, which form a part hereof, and in which is shown by way of
illustration specific exemplary embodiments in which the invention
may be practiced. These embodiments are described in sufficient
detail to enable those skilled in the art to practice the
invention, and it is to be understood that other embodiments may be
utilized and that logical, mechanical, electrical and other changes
may be made without departing from the spirit or scope of the
present invention. The following detailed description is,
therefore, not to be taken in a limiting sense, and the scope of
the present invention is defined only by the appended claims. In
particular, although the preferred embodiment is described below
with respect to a Java software system, it will be appreciated that
the present invention is not limited to Java implementations.
[0027] FIG. 1 is a block diagram of a general-purpose computer
system 100 to which the present invention may be applied. The
computer system 100 includes at least one processor (CPU) 102
operatively coupled to other components via a system bus 104. A
read only memory (ROM) 106, a random access memory (RAM) 108, a
display adapter 110, an I/O adapter 112, and a user interface
adapter 114 are coupled to system bus 104.
[0028] Display adapter 110 operatively couples a display device 116
to system bus 104. A disk storage device (e.g., a magnetic or
optical disk storage device) 118 is operatively coupled to system
bus 104 by I/O adapter 112. User interface adapter 114 operatively
couples a mouse 120 and keyboard 124 to system bus 104. One or more
objects are created when an Object-Oriented Program (not shown) is
executed in computer system 100.
[0029] The present invention has particular utility in Java
applications. Java is a robust, portable object-oriented
programming language developed by Sun Microsystems, Inc. Java
attains its portability through use of a virtual machine or "Java
Virtual Machine", or "JVM". The JVM enables isolating the details
of the underlying hardware from the compiler used to compile the
Java programming instructions.
[0030] Java applications are typically constructed using a
development toolkit such as the "JDK" (Java Development Kit)
product from Sun Microsystems, and are executed using the "JRE"
(Java Runtime Environment) product, also from Sun Microsystems.
Programs are executed from the command line when using the JRE. The
Java Runtime Environment includes the JVM, as well as a number of
files and classes that are required to run Java applications or
applets. Hereinafter, the terms "JVM" and "runtime environment"
will be used interchangeably unless otherwise noted.
[0031] Java program source code typically consists of a set of
class definitions stored in a series of text files. A Java class
contains both methods and data. Java source code is compiled into
binary code, referred to as Java "byte code." Byte codes are
machine independent so that they can be executed on any machine's
JVM, where the JVM is tailored to the specific operating
environment. After the source code files are compiled into Java
byte code class files, they are then typically loaded into memory
by a class loader for interpretation by a JVM interpreter before
the associated program is executed. Class loading can also occur
when a Java program dynamically attempts to load another class at
run time. The Java class loader uses a predetermined search
strategy when locating class files, which gives precedence to
particular locations. According to the Java 1.2 platform
specification, the highest priority search locations are the
bootstrap run time and internationalization classes in the JRE,
named "rt.jar" and "I18N.jar," respectively. The Java Archive (JAR)
or ".jar" file is a repository that contains one or more Java class
files and is simply called a "Java jar" or "JAR". JAR files provide
compression, backward compatibility with existing Java applets,
portability, and security features.
[0032] With reference now to FIG. 2A, there is shown a block
diagram of a software architecture for computer system 100, in
accordance with the preferred embodiment of the present invention.
In a preferred embodiment, an analysis tool written in Java, called
the Byte Code Inspector (BCI) 200, is executing in CPU 102. BCI 200
accesses database 202, which is contained within disk storage
device 118, for the storage and retrieval of analysis data.
Database Mining Tool (DMT) 204 is also executing in CPU 102. In a
preferred embodiment, a commercially available database mining tool
such as Enterprise Miner produced by SAS, Inc. may implement DMT
204.
[0033] BCI 200 decompiles a known Java class or JAR file (known
Java files 208) to extract specific method information to be placed
in database 202. FIG. 2B is a logical block diagram of a Java file
method entry 203 in database 202. In a preferred embodiment, method
entry 203 stores method information for a specific class of a Java
file, including the vendor 214, method signature 216, byte code 218
and byte arguments 220. A number of method entries 203 are created
in database 202 corresponding to each Java class contained in known
Java files 208.
[0034] BCI 200 retrieves methods of a class being analyzed from a
commercially available toolkit called the byte code engineering
library (BCEL) 206. In a preferred embodiment, BCI 200 has been
implemented using a publicly available byte code engineering
library such as the open source Byte Code Engineering Library,
which is available from The Apache Software Foundation on the
Internet at http://jakarta.apache.org/index.html. This library is
used to separate a Java class file into its constituent parts. FIG.
2C shows a graphical representation of a Java class file as viewed
by the BCEL 206. As can be seen in FIG. 2C, the library separates
the byte code for class 225 into various attributes, including a
header 230, constant pool 235, access rights 240, implemented
interfaces 245, fields 250, methods 255 and class attributes
260.
[0035] With reference now to FIG. 3, there is shown a flow diagram
of a process implemented by BCI 200 for unpacking a Java archive
for use in the pre-analysis process 800 (see FIG. 5), in accordance
with the preferred embodiment of the present invention. As shown in
FIG. 3, process 300 begins at step 302, when the user loads the
known Java files 208 into BCI 200. At step 304, BCI 200 extracts
all class files contained in the known Java files 208 by calling
Java's zip file utility class (shown at 306) to obtain the list of
classes from the known Java files 208. At step 308, BCI 200 obtains
all methods for each class extracted from the known Java files 208
(at step 304) by calling the BCEL 206 to get the corresponding list
of methods known to be contained within each class (shown at 310).
At step 312, BCI 200 extracts the method signature, byte codes,
byte arguments and other fields for each method (shown at 314) by
calling the BCEL 206. Thereafter, process 300 ends at step 316,
when BCI 200 stores the method information 214-220 for each method
of each class extracted at step 312 in database 202 within method
entry 203.
[0036] FIG. 4 shows a screen shot of the graphical user interface
window (GUI) 400 of BCI 200. FIG. 4 shows an example where "rt.jar"
has been decompiled by BCI 200 to extract all class files 402
within the rt.jar archive, and then to further extract all methods
404 within the class files 402. For example, the
"java.lang.Comparable" 405 and "java.lang.ClassLoader" 406 class
files 402 are shown within left windowpane 412. Additionally, the
"java.lang.ClassLoader" class 406 is shown expanded in the GUI 400,
thereby listing all of its methods 404 below the expanded class
file 406. For example, the "protected void <init>(ClassLoader
arg1)" method 408 is listed as one of the methods 404 of the
ClassLoader class 406. All byte codes extracted from the
user-selected method in the left hand windowpane 412 of the GUI 400
are listed in the right hand windowpane 414. For example, as shown
in FIG. 4, if the user selects the "protected void
<init>(ClassLoader arg1)" method 408 (e.g., using a mouse)
the constituent byte codes are shown in the right hand windowpane
414.
[0037] With reference now to FIG. 5, there is shown a flow diagram
for evaluating the original program's and the potential infringer's
implementations of a Java class, in accordance with a preferred
embodiment of the present invention. In a preferred process, a
common Java class is selected from among the byte code of each
program for analysis. As seen in FIG. 5, process 800 begins at step
802, where a database (database 202) of byte codes for the program
to be analyzed is created. In a preferred embodiment, a BCEL from
an open source provider may be accessed to provide the byte code
database for the potential infringing program. At step 804, BCI 200
extracts one or more classes that are common to the two programs
from byte code database 202. In a preferred embodiment, a rt.jar
for each program (known Java files 208) is accessed. Then, at step
806, BCI 200 performs process 300 to find all the methods that are
common for each of the vendor implementations within each of the
classes identified at step 804. Here, using the database created in
step 802, BCI 200 breaks the byte codes of the found methods into
multiple pieces, whereby method attributes are stored in
association with the original vendor program and the potential
infringing program in database 202.
[0038] At step 808, BCI 200 analyzes each of the methods found in
step 806 for programming metrics of interest, for example, such as
size of the code, security implemented, performance provided, etc.
At step 810, each of the analyzed classes, dependencies and
metadata are stored in a database. FIG. 6 shows an example of
database 202 entries indicating the type of data stored at step
810. The database 900 shows data for the original vendor's program
916 and the compared vendor's program 918, and shows columns for
vendor 902, class 904, method 906, number of lines of code 908,
number of native cells 910, number of try/catch blocks 912, and
number of referred classes for the analyzed program 914, etc., for
example.
[0039] With reference now to FIG. 7, there is shown a flow diagram
of a process for comparing a vendor's program with an original
program to determine if the vendor's program plagiarizes source
code of the original program, in accordance with an alternative
preferred embodiment of the present invention. Process 1000 begins
at step 1002, where the pre-processed byte code information from
each of the original vendor program and the vendor program to be
compared are retrieved from database 202 by BCI 200. At step 1004,
BCI 200 converts the retrieved byte code information into a matrix
format to facilitate a comparison of the byte codes. In a preferred
embodiment, BCI 200 complies the byte code information stored
within database 202 into a table or matrix, for example, in the
form of FIG. 6. At step 1006, BCI 200 performs a comparison of each
common method within the matrix 900. For example, each common
method shown in column 906 of FIG. 6 is compared at step 1006. As
shown in FIG. 6, the "findCLASS" method would be compared between
the original vendor (Row 916) with the compared vendor (Row 918).
For each method compared at step 1006, BCI 200 would execute some
type of text mining algorithm (signified by function 1008) on the
two matrixes for each of the common methods for the two programs
(original and compared) to determine similarities between the
methods. In a preferred embodiment, a clustering tool kit (CLUTO)
that operates by running clustering algorithms on the byte codes is
utilized for this step. At step 1008, the actual byte code that
comprises the methods being compared is loaded in the text mining
tool to determine similarities (based on clustering,
classification, association, etc.).
[0040] Thereafter, the process proceeds to step 1010, where BCI 200
determines the similarity between the compared codes based on the
detected similarities at step 1008. BCI 200 would demonstrate the
similarities at step 1010 by generating a visual representation of
the comparison using a visualizing tool. For example a WEKA machine
learning GUI tool could be used. FIG. 8 shows an example of an
output of such a WEKA visualization tool showing the methods
compared and the number of matching byte codes for each such method
across the two compared programs. By viewing the processed
information, segments of code can be analyzed according to
similarities. If code segments are within a "threshold of
similarity", for example, a threshold number of identical byte
codes shown in the vertical axis of FIG. 8, the methods can be
considered duplicated. As will now be appreciated, because this
analysis is performed on top of compiled Java code, the semantic
space has been limited so that false positives and false negatives
are reduced, and the process has a higher degree of accuracy of
determining duplicated code.
[0041] While the invention has been particularly shown and
described with reference to a preferred embodiment, it will be
understood by those skilled in the art that various changes in form
and detail may be made therein without departing from the spirit
and scope of the invention. For example, the present invention may
be implemented using any combination of computer programming
software, firmware or hardware. As a preparatory step to practicing
the invention or constructing an apparatus according to the
invention, the computer programming code (whether software or
firmware) according to the invention will typically be stored in
one or more machine readable storage mediums such as fixed (hard)
drives, diskettes, optical disks, magnetic tape, semiconductor
memories such as ROMs, PROMs, etc., thereby making an article of
manufacture in accordance with the invention. The article of
manufacture containing the computer programming code is used by
either executing the code directly from the storage device, by
copying the code from the storage device into another storage
device such as a hard disk, RAM, etc., or by transmitting the code
for remote execution. The method form of the invention may be
practiced by combining one or more machine-readable storage devices
containing the code according to the present invention with
appropriate standard computer hardware to execute the code
contained therein. An apparatus for practicing the invention could
be one or more computers and storage systems containing or having
network access to computer program(s) coded in accordance with the
invention. While this invention is described in terms of the best
mode for achieving this invention's objectives, it will be
appreciated by those skilled in the art that variations may be
accomplished in view of these teachings without deviating from the
spirit or scope of the present invention.
* * * * *
References