U.S. patent application number 10/448168 was filed with the patent office on 2004-01-08 for system, apparatus, and method for user tunable and selectable searching of a database using a weigthted quantized feature vector.
Invention is credited to Framroze, Bomi Patel, Gange, David M..
Application Number | 20040006559 10/448168 |
Document ID | / |
Family ID | 33489401 |
Filed Date | 2004-01-08 |
United States Patent
Application |
20040006559 |
Kind Code |
A1 |
Gange, David M. ; et
al. |
January 8, 2004 |
System, apparatus, and method for user tunable and selectable
searching of a database using a weigthted quantized feature
vector
Abstract
The invention disclosed herein concerns a data processing means
for user tunable and selectable searching of a database wherein the
data contained therein have associated descriptive properties
capable of being expressed in numeric form. A quantized vector
representative of the descriptive properties is created for each
item in the database. This quantized vector becomes the fingerprint
for each data item. The user submits a query item to be matched
against the database for similarity. A fingerprint is calculated
for the query item. The user may then assign weights to the
individual descriptive properties based upon perceived importance.
A newly weighted fingerprint for the query item is then compared
with the weighted fingerprints for all the data in the database. A
list of results sorted in order of decreasing similarity is
presented to the user. The user may then change the previously
assigned weights and then re-run the similarity search. This may be
done as often as necessary to achieve the desired results. The
invention describes similarity searching in a generic database.
However, this invention is particularly desirable in databases
containing chemical compound structure data or biological response
screening result data. The process described herein may be run
stand alone or as a preliminary screening search in a large
database. If used for screening, it can greatly reduce the amount
of data required for exactly matching a query item to the data in
the database.
Inventors: |
Gange, David M.;
(Pennington, NJ) ; Framroze, Bomi Patel; (Bombay,
IN) |
Correspondence
Address: |
STANLEY H. KREMEN
4 LENAPE LANE
EAST BRUNSWICK
NJ
08816
US
|
Family ID: |
33489401 |
Appl. No.: |
10/448168 |
Filed: |
May 28, 2003 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60383952 |
May 29, 2002 |
|
|
|
60384305 |
May 30, 2002 |
|
|
|
Current U.S.
Class: |
1/1 ;
707/999.003 |
Current CPC
Class: |
G06F 16/24578 20190101;
G16C 20/40 20190201; G06F 16/903 20190101; Y10S 707/99931 20130101;
Y10S 707/955 20130101 |
Class at
Publication: |
707/3 |
International
Class: |
G06F 007/00 |
Claims
We claim:
1. A method for searching an electronic database of data, wherein
said data has associated with them a set of one or more calculated
descriptive properties related to said data; and wherein said
descriptive properties are capable of being expressed in numeric
form; said method comprising: a) accepting a query datum submitted
electronically by a user; b) electronically calculating a set of
one or more descriptive properties of said query datum wherein the
descriptive properties of said query datum are capable of being
expressed in numeric form, are of the same number and arrangement,
and are calculated in the same manner as said descriptive
properties of the data in said database; c) allowing the user to
electronically examine the calculated descriptive properties of
said query datum; d) electronically setting a weight for every
descriptive property to unity, said weight being an importance
value for that particular descriptive property; e) allowing the
user to change said weights for any or all of the descriptive
properties to other numeric values, said other numeric values being
set at the user's discretion; f) electronically calculating a
similarity value to the query datum for all data in the database
according to a method comprising: factoring in the user assigned
weights of the descriptive properties to both the query datum and
each datum in the database thereby forming a weighted query datum
and a weighted database datum; computing a quantized vector
distance, or equivalent indicator or coefficient, between said
weighted query dataum and said weighted database datum; and,
assigning said quantized vector distance, or equivalent indicator
or coefficient, to the similarity value; g) presenting a list of
data from said database to the user wherein said data is sorted in
order of their similarity values; h) repeating steps f) and g) of
this method at the user's discretion as many times as the user
desires.
2. The method according to claim 1 wherein a user can assign
weights to said descriptive properties by manipulating objects on a
computer screen.
3. The method according to claim 2 wherein said objects are sliders
with numeric scales.
4. The method according to claim 2 wherein said objects are dials
with numeric scales.
5. The method according to claim 2 wherein said objects are text
boxes allowing numeric entry.
6. The method according to claim 1 wherein said similarity value is
calculated using a weighted Euclidean Distance between said
quantized vectors of the descriptive properties.
7. The method according to claim 1 wherein said similarity value is
calculated using a weighted Hamming Distance between said quantized
vectors of the descriptive properties.
8. The method according to claim 1 wherein said similarity value is
calculated using a weighted Soergel Distance between said quantized
vectors of the descriptive properties.
9. The method according to claim 1 wherein said similarity value is
calculated using a weighted Tanimoto Coefficient between said
quantized vectors of the descriptive properties.
10. The method according to claim 1 wherein said similarity value
is calculated using a weighted Dice Coefficient between said
quantized vectors of the descriptive properties.
11. The method according to claim 1 wherein said similarity value
is calculated using a weighted Cosine Coefficient between said
quantized vectors of the descriptive properties.
12. The method according to claim 1 wherein said data in said
database is representative of structures of chemical compounds and
wherein said query datum is also representative of the structure of
a chemical compound.
13. The method according to claim 12 wherein said qurey datum is
generated by a chemical structure drawing package.
14. The method according to claim 12 wherein said descriptive
properties of said database data and query datum are characterized
by assigned structure fragments.
15. The method according to claim 14 wherein the numeric values of
said descriptive properties are set to either one or zero, a one
representing the presence of the structure fragment associated with
a particular descriptive property, and a zero representing the
absence of said structure fragment.
16. The method according to claim 14 wherein the structure
fragments are contained within and referenced in an electronic
dictionary.
17. The method according to claim 14 wherein the structure
fragments are generated by an algorithm.
18. The method according to claim 1 wherein said data in said
database is representative of biological activity screening results
and wherein said query datum is also representative of biological
activity screening results.
19. The method according to claim 18 further comprising: a) a user
entering biological response values for known screening results; b)
a user entering the biological response values for a target query
item; c) a user entering weights for each of the biological
response values; d) a user selectively designating a method to be
used to calculate similarity;
20. The method according to claim 1 further comprising storing the
sorted calculated similarity results for further use.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This is a U.S. nonprovisional utility patent application
that is also described in and claims the benefit of both U.S.
provisional patent application Nos. 60/383,952 filed on May 29,
2002, entitled MACHINE, METHOD AND ARTICLE OF MANUFACTURE FOR A
SELECTIVELY SEARCHING A DATABASE OF CHEMICAL COMPOUNDS, and
60/384,305 filed on May 30, 2002, entitled MACHINE, METHOD AND
ARTICLE OF MANUFACTURE FOR SEARCHING A DATABASE OF BIOLOGICAL
ACTIVITY SCREENING RESULTS, said provisional applications being
incorporated by reference in their entirety herein.
REFERENCE TO AN APPENDIX
[0002] Accompanying this patent application is a CD-R, bearing the
electronic title "Gange & Framroze," the contents of which
comprise a program listing in ASCII text file format entitled
LISTING.TXT, being of size 86 KB and having been created on May 29,
2003. The contents of said CD-R is incorporated by reference
herein. The CD-R is hand labeled as follows:
[0003] Non-Provisional Patent Application Dr. David M. Gange &
Dr. Bomi P. Framroze Filed: May 29, 2003 Docket No.:
51900-ROW2-01-001
[0004] Attached to this application and made an integral part
hereof is an APPENDIX comprising the identical program listing as
that found on said CD-R.
BACKGROUND
[0005] 1. Field of the Invention
[0006] This invention relates to data processing and specifically
enabling highly efficient searching of a database wherein the
entries can be characterized using a set of one or more descriptive
properties that can be expressed in numeric form.
[0007] 2. Description of the Prior Art
[0008] Modern database management systems have been used since the
early 1970's. Commercial database systems mostly concentrate on
finding exact matches. Searches are performed either to find a
specific entry, or to find multiple entries having the same
characteristics. Attributes of the data often become fields. An
exact search can be made to find a specific person by looking up
his name or social security number. A search can be performed to
find multiple individuals having the same occupation or place of
birth. Alternatively, one may locate all people born before a
particular date. Whether a single entry or multiple entries are
found, this type of query constitutes an exact search. Exact
searches try to exactly or relationally match one or more fields in
different data records.
[0009] Similarity searching of databases has been around for
several years. A similarity search compares two or more entries in
their entirety to determine how closely they match one another.
Consider the following simple database containing entries of
various animals that fly:
[0010] a house-fly
[0011] a bat
[0012] a hummingbird
[0013] a dragonfly
[0014] a flying fish
[0015] a hawk
[0016] The question: "Which are most similar?" is not meaningful
without additional input. A proper answer requires input of the key
dimension. If "feathers" represent the key dimension, then the hawk
and the hummingbird are most similar. If "the ability to fly
stationary" is the key dimension, then the dragonfly and the
hummingbird are most similar. Other possible key dimensions could
be metabolism, life span, body temperature, etc. Therefore, the
answer to the question: "Which are most similar?" is subjective
depending upon the preferences of the person supplying the
answer.
[0017] For a more complicated residential real estate database, a
potential buyer would be looking to buy a home by expressing
preferences that become the parameters for a similarity search.
Such parameters might include number of bedrooms, type of house,
asking price, neighborhood, quality of the local school system,
property taxes, age restrictions on residents, home-owners'
associations, etc. Currently, a real estate agent would first
screen for homes having a specific most desirable characteristic
(e.g., neighborhood or number of bedrooms). Then, the agent would
look for the next desirable characteristic. The process would be
repeated for each parameter, each search yielding a number of homes
for consideration by the buyer. Where a particular home appears in
the search results multiple times, it is more likely that the agent
can make a sale. However, a binary feature vector may be created
using these and other parameters, and a similarity search can be
performed to match a potential buyer's preferences. This search
would generate a list of homes approximating these preferences. A
binary vector could indicate whether or not the buyer is interested
in a particular feature. The homes can then be compared in their
entirety by computing the mathematical distance between their
feature vectors. In the rare instance where an exact match is
found, the distance between the vectors would be zero. However, if
the distance is not zero, the smaller the distance between the
feature vectors of an ideal home and an available home, the more
similar they are.
[0018] This technique has been found to be particularly useful for
searching in databases containing chemical structures. Databases of
organic chemical compounds can contain millions of records. An
atom-by-atom and bond-by-bond search becomes more difficult as the
size of the molecule increases. Even were the organic molecules to
be pre-classified according to specific features, queries to find
exact matches of these features might still yield questionable and
non-useful results. Furthermore, in large databases, exact match
searching can be extremely time consuming. Similarity searching in
a large chemical structure database is a method of screening for
compounds which are closely related to one another but may not
exactly match. Such a screening query can also be used to shorten
the list of compounds to be matched thereby resulting in greatly
reducing the overall query time. In fact, several screening
searches using different algorithms may be performed that would
yield a manageable list of chemical compounds that would then be
exactly matched in an atom-by-atom and bond-by-bond search.
[0019] A chemical structure similarity search may be performed by
creating a chemical fragment dictionary or by using an algorithm
that generates chemical structure fragments. A fragment consists of
a grouping of atoms attached to one another by specific chemical
bonds. All of the compounds in the database are parsed to determine
whether or not a particular fragment is present. Associated with
each compound is a binary vector. Each element of the vector
represents the presence or absence of a specific fragment. This
binary vector then serves as an index for that compound in the
database. Now a search can be made to find a compound in the
database that is similar to a substance that interests the user.
The distance between the vector for the new substance and the
vectors of compounds in the database can be calculated. The results
can then be returned in order of decreasing similarity.
[0020] In another application, chemical compounds, natural
products, fermentation broths, and other substances are often
tested for biological activity, or pharmacological activity. The
results of these tests are often stored in electronic databases.
Biologists and chemists are often interested in searching a
database of biological screening results for substances with an
activity profile similar to a given biological activity profile.
For example, in the development of an antibiotic a scientist might
be interested in substances showing good activity against
gram-positive bacteria and one species of gram-negative bacteria.
The profile of such a substance would have strong activity values
for the several gram-positive and one gram-negative bacteria under
consideration and weak activity values for the rest of the
gram-negative species tested. In addition, physical properties of
the substances, such as LogP, molecular weight, molecular size,
pKa, and other physical properties may be considered. One method
that can be used to examine biological screening results and
property data is similarity searching.
[0021] In this type of database, a feature vector or a vector of
test results can be formed where binary values would not be used.
In this case, it would be desirable to create a vector where a
specific element would refer to a particular feature or test, and
the vector would contain numeric values other than one or zero. The
distance between vectors may be measured, and distances would
represent the degree of similarity between entries in the
database.
[0022] Similarity searching using quantized vectors is prior art.
However, prior art searches have been performed according to a
fixed searching algorithm. In a chemical compound database, the
user might wish to perform a similarity search based upon
substructure comparisons, and the data processing system would
provide an answer as a sorted list of compounds. When process
development chemists search for similarity in chemical compounds,
some parts of the molecule are more important to them than other
parts. Therefore, when performing a search, they would be
interested in establishing a higher search priority to the
important substructures and a lower search priority to the less
important substructures. Assignment of search priorities is
arbitrary and based upon user preference. If priorities of
substructure preferences can be dynamically assigned, then should
the results of the search not be what the user desires, the user
can reassign substructure priorities, thereby refining the search
results. The units of assigned priority or weights can be
arbitrary, and only their ratio to each other is important.
[0023] In the previously mentioned residential real estate
database, the similarity search revealed homes having all of the
features that interested a potential buyer. Yet, for some potential
buyers, certain items are more important than others. For example,
for a family with four children, purchase of a house with five
bedrooms and the quality of the school system might be more
important than asking price and property taxes. Yet these latter
features could also serve as influencing factors. In such a case,
being able to assign higher priorities to certain features and
lower priorities to other features would result in a more
meaningful search.
[0024] The underlying mathematics for this search is very broadly
applicable. It can be used inter alia in biology and medical
databases, in physiology databases, in anthropology databases, in
photography databases, and in taxonomy databases. It is practical
where a characterization vector can be applied to the description
of the data.
[0025] It is an object of the invention described herein to create
a computerized system that will perform similarity searches in an
electronic database where the entries have a set of one or more
descriptive properties capable of being expressed in numeric form
and wherein the user can assign weights or priorities to the
descriptive properties so as to influence the similarity
searches.
SUMMARY OF THE INVENTION
[0026] The invention disclosed herein is a data processing product
and method that permits computerized similarity searching of an
electronic database using a quantization vector. The quantization
vector, a linear array of descriptive properties of the entries in
the database, is maintained by the system. Different datatype
representations of the quantization vector may be implemented. The
system examines the structure of a query item in terms of its known
descriptive properties. During examination, the quantization vector
is established. This vector represents the query item's
"fingerprint." The system then searches the entire database for
identity or similarity to the query item by comparing the vectors.
The system further permits the user to set numeric priorities for
the descriptive properties in a user friendly environment, said
priorities to be used in the search for entries that are similar to
the query item. An object of the invention is to provide a
simplified searching system for naive and infrequent users. In one
of the embodiments presented herein, a computerized user tunable
system is disclosed that selectively searches a database of
chemical compounds. In another embodiment presented herein, a
computerized user tunable system is disclosed that selectively
searches a database of biological activity screening test
results.
BRIEF DESCRIPTION OF THE DRAWINGS
[0027] FIG. 1 is an overview program flowchart showing a current
computerized method of performing similarity searches for a
generalized database. The method shown is prior art.
[0028] FIG. 2 is an overview program flowchart showing the
computerized method of the invention disclosed herein being used to
perform similarity searches for a generalized database.
[0029] FIG. 3 is an overview program flowchart showing the
computerized method of the invention disclosed herein being used to
perform similarity searches in a database of organic chemical
compounds.
[0030] FIG. 4 shows the screen view of the chemical structure of
the query compound, Trovafloxacin, as drawn by the user with one of
the standard chemical drawing software packages and input into the
search program.
[0031] FIG. 5 shows the structure of the query compound having been
parsed or fingerprinted according to chemical structure fragments
in a fragment dictionary. Only twelve fragments are shown on the
screen in the figure. However, a slider on the right edge of the
screen may be used to display additional fragments. An adjustable
slider with a numeric scale is associated with each fragment
shown.
[0032] FIG. 6 is the screen of FIG. 5 after the user adjusted some
of the sliders so as to assign weights to their associated
fragments.
[0033] FIG. 7 shows the results of the similarity search for the
query compound with compounds in the database. The figure shows ten
out of fifty compounds returned as part of the search.
[0034] FIG. 8 shows a program flow chart for a specific
implementation of the program in FIG. 3 in which the screens shown
in FIG. 4 through FIG. 7 are used.
[0035] FIG. 9 is an overview program flowchart showing the
computerized method of the invention disclosed herein being used to
perform similarity searches in a database of biological responses
to various compounds.
[0036] FIG. 10 is a MICROSOFT EXCEL spreadsheet divided into three
parts, FIGS. 10(a), (b), and (c), done so because the entire
spreadsheet could not conveniently fit on a single drawing sheet.
The data represented in the spreadsheet (the LEWI Data) represents
the biological response test data of rats to various
tranquilizers.
[0037] FIG. 11 represents four screen prints of the program shown
in FIG. 9 operating on the data shown in FIG. 10.
DESCRIPTION OF THE PREFERRED AND ALTERNATE EMBODIMENTS
[0038] It is feasible to perform similarity searches in an
electronic database of items, wherein said items possess a set of
one or more descriptive properties (related to the items) that can
be expressed in numeric form. Similarity searching in such a
generalized database according to current technology may be
performed in a computer using the method shown in FIG. 1.
[0039] 1. A user submits a query to the system. The query may be
submitted using different formats, but a query item must be able to
be classified according to its descriptive properties. The
descriptive properties may have inherent numeric values (e.g., test
results, characteristic values, prices, ASCII values, checksums,
etc.). Alternatively, they may have binary values (`one` indicating
the presence of a feature and `zero` indicating the absence of the
feature).
[0040] 2. The query item is parsed according to its descriptive
properties. The descriptive properties are analyzed by comparing
various elements of the query item in sequence to standardized
descriptive properties previously entered electronically into the
computer. The characteristics of these descriptive properties may
be pre-stored in an electronic dictionary or be generated
dynamically by some program algorithm. However these descriptive
properties are presented for comparison with the query item, the
query item is analyzed for the presence of absence of a particular
property, and its numeric value is noted. A quantized vector is
formed wherein each element in the vector represents a value for a
specific descriptive property. The quantized vector can be thought
of as a "fingerprint" for the query item.
[0041] 3. The database contains entries of similarly describable
items, each such item having been similarly pre-parsed into
quantized vectors. The quantized vector (or "fingerprint") for each
entry is stored in the database and associated with its entry.
Therefore, a distance may be computed between the vector
representing the query item and each vector representing each and
every item in the database. The closer the query item vector is to
a vector representing an entry in the database, the more similar
the query item is to that database entry.
[0042] 4. The results are sorted in order of similarity.
[0043] 5. The sorted results may be stored for future use at the
user's discretion.
[0044] 6. The sorted list of database entries is then presented to
the user.
[0045] The computation of vector distances in step 3 above may be
calculated, inter alia, as the standard Euclidean Distance, the
Tanimoto Coefficient, the Hamming Distance, the Soergel Distance,
the Dice Coefficient, or the Cosine Coefficient. Other types of
similarity measurement may also be used.
[0046] The most familiar method for computing the distance between
two vectors, thereby comparing their overall similarity, is to
measure the Euclidean distance between them. This is done according
to the well known equation: 1 D A , B = [ j = 1 j = n ( x jA - x jB
) 2 ] 1 / 2 [ 1 ]
[0047] where:
[0048] D.sub.A,B=the distance between vectors A and B;
[0049] j=the index to a specific vector element;
[0050] n=the number of elements in the vector;
[0051] x.sub.jA=the value of the jth element in the A vector;
and,
[0052] x.sub.jB=the value of the jth element in the B vector.
[0053] This is the familiar process of obtaining the difference
between each of the elements in the same position in each vector,
squaring that difference, and then taking the square root of the
sum of the squares. Using this method, the distance between two
identical vectors would be zero. The smaller the distance between
two vectors, the greater their degree of similarity. The Euclidean
Distance can be normalized to the range of 0 to 1 if the values of
all attributes are normalized to this range and the results divided
by n.
[0054] To illustrate computation of the distance, assume two binary
dimension 5 vectors: A=1 1 0 1 1 and B 0 1 1 1 0. Using Equation
[1], the calculation of Euclidean distance from A to B is as
follows:
1 A - B = C C * C Sum of C Distance 1 0 1 1 3 1.73 1 1 0 0 0 1 -1 1
1 1 0 0 1 0 1 1
[0055] Another method for comparing similarity is to compute the
Tanimoto Coefficient of the two vectors. This is done using the
equation: 2 S A , B = j = 1 j = n x jA x jB j = 1 j = n ( x jA ) 2
+ j = 1 j = n ( x jB ) 2 - j = 1 j = n x jA x jB [ 2 ]
[0056] where:
[0057] S.sub.A,B=the Tanimoto Coefficient.
[0058] The Tanimoto Coefficient is determined by taking the
quotient of the sum of the cross product of two vectors divided by
the sum of the squares of the elements of the first vector added to
the sum of the squares of the elements of the second vector less
the cross product of the two vectors. Another name for the Tanimoto
Coefficient is the Jaccard Coefficient.
[0059] Other distance computations such as the Hamming Distance,
the Soergel Distance, the Dice Coefficient and the Cosine
Coefficient are sometimes used to perform similarity searches and
are prior art. The Hamming Distance is computed as: 3 D A , B = j =
1 j = n x jA - x jB [ 3 ]
[0060] The Soergel Distance is computed as: 4 D A , B = j = 1 j = n
x jA - x jB j = 1 j = n max ( x jA x jB ) [ 4 ]
[0061] The Dice Coefficient (also known as the Czekanowski
Coefficient and the S.o slashed.renson Coefficient) is computed as:
5 S A , B = 2 j = 1 j = n x jA x jB j = 1 j = n ( x jA ) 2 j = 1 j
= n ( x jB ) 2 [ 5 ]
[0062] The Cosine Coefficient is computed as: 6 S A , B = j = 1 j =
n x jA x jB [ j = 1 j = n ( x jA ) 2 j = 1 j = n ( x jB ) 2 ] 1 / 2
[ 6 ]
[0063] The foregoing comparison methodologies represented by
Equations [1] through [6] are only a few prior art techniques for
similarity measurement between two quantized vectors. Of course,
the measure of similarity depends upon the method of measurement.
Changing the "fingerprint" changes the similarity. The results are
dictated by the algorithm of the system. For the aforementioned
prior art similarity measurement methods, there is generally no
feedback, no user control over the results, and no possibility of
iteratively improving the answer.
[0064] The present invention improves the quality of the results
obtained from similarity searching in the type of database
discussed above. The results obtained from a search using the
methodology disclosed herein should be more meaningful to the user.
FIG. 2 is an overview program flowchart showing the computerized
method of the invention disclosed herein being used to perform
similarity searches for a generalized database. The methodology is
as follows:
[0065] 1. The user submits a query item to the system.
[0066] 2. The query item is parsed according to its descriptive
properties using the same method that is used to calculate the
descriptive properties of the entries stored in the database. A
quantized vector (or "fingerprint") for the query item is
formed.
[0067] 3. The user is permitted to assign a weight or priority to
each descriptive property of the quantized vector. A quantized
weight vector is then formed in this manner. The weight vector has
the same dimension (or number of elements) as the quantized vector
representing the "fingerprint" of the query item. The assignment of
weights can be done by presenting to the user a computer screen
showing the query item, the descriptive properties of the query
item, and a means to adjust weighting to assign importance values
to the descriptive properties. The means to adjust weighting may be
adjustable sliders, dials, text boxes, or any other controls that
permit the user to interactively assign weights to the descriptive
properties. Alternatively, the weight values representing the
elements of the quantized weight vector may be obtained from a file
created by the user.
[0068] 4. The user adjusts the descriptive property weightings to
suit his or her individual preferences.
[0069] 5. Using the query item properties and weightings,
similarity values between the query item and all of the items in
the database are calculated using one of the standard similarity
algorithms (Euclidean Distance, Tanimoto Coefficient, etc.)
[0070] 6. Using the calculated similarity values, the database
items are sorted.
[0071] 7. The sorted results may be stored for future use at the
user's discretion.
[0072] 8. The sorted list of database items is presented to the
user.
[0073] 9. If the user so desires, the process may be repeated until
the desired outcome is achieved.
[0074] The units of assigned priority or weights can be arbitrary,
and only their ratio to each other is important. In the system
represented by reduction to practice of the present invention, the
weights are unitless integers between zero and ten. However a
logarithmic scale may also be used. In that case, "1" would be the
inflection point. Fractional weights (between "0" and "1") should
be in tenths. Fractional weights downscale priorities while weights
above "1" upscale priorities.
[0075] In this type of system, using a weight vector, w, the
Euclidean distance between the two vectors would be computed as: 7
D A , B = [ j = 1 j = n w j ( x jA - x jB ) 2 ] 1 / 2 [ 7 ]
[0076] where:
[0077] D.sub.A,B=the distance between vectors A and B; and,
[0078] w.sub.j=the weight assigned to vector element j.
[0079] To illustrate the new computation of the Euclidean distance
as influenced by the assigned weights (w.sub.j) for the two
previous binary dimension 5 vectors: A=1 1 0 1 1 and B=0 1 1 1 0.
The calculation of the new Euclidean distance from A to B is as
follows:
2 A - B = C C * C Weight Sum of C Distance 1 0 1 1 3 9 3 1 1 0 0 1
0 1 -1 1 3 1 1 0 0 1 1 0 1 1 3
[0080] The new weighted Tanimoto Coefficient derived from Equation
[2] would be computed according to Equation [8]: 8 S A , B = j = 1
j = n w j x jA x jB j = 1 j = n w j ( x jA ) 2 + j = 1 j = n w j (
x jB ) 2 - j = 1 j = n w j x jA x jB [ 8 ]
[0081] Likewise, the new weighted Hamming Distance derived from
Equation [3] would be computed using Equation [9]: 9 D A , B = j =
1 j = n w j x jA - x jB [ 9 ]
[0082] the new weighted Soergel distance derived from Equation [4]
would be computed using Equation [10]: 10 D A , B = j = 1 j = n w j
x jA - x jB j = 1 j = n w j max ( x jA x jB ) [ 10 ]
[0083] the new weighted Dice coefficient derived from Equation [5]
would be computed using Equation [11]: 11 S A , B = 2 j = 1 j = n w
j x jA x jB j = 1 j = n w j ( x jA ) 2 + j = 1 j = n w j ( x jB ) 2
[ 11 ]
[0084] and the new weighted Cosine coefficient derived from
Equation [6] would be computed using Equation [12]: 12 S A , B = j
= 1 j = n w j x jA x jB [ j = 1 j = n w j ( x jA ) 2 + j = 1 j = n
w j ( x jB ) 2 ] 1 / 2 [ 12 ]
[0085] As previously mentioned, one of the preferred uses for this
methodology implemented as a computerized system is as a means to
selectively search a database of chemical compounds. All chemical
compounds can be structurally decomposed into recognizable
fragments. Inorganic molecules are composed of atoms, and these
atoms are bound to each other in a limited number of ways. The
elements making up these molecules span the entire periodic table.
However, their structures are simple. On the other hand, organic
molecules comprise very few elements usually on the lower end of
the periodic table (e.g., carbon, hydrogen, oxygen, nitrogen,
etc.), but their structures are complex. Due to structural
complexity and the ability of these elements to form large
molecules, the number of possible organic molecules is virtually
limitless. During product development of organic compounds, it is
often important to search for other compounds having a similar
molecular structure in an effort to adjust the new structure so as
to predict its chemical, biological, and physical properties. Such
a search is also necessary to insure that the new product does not
infringe on patented products previously developed by others.
[0086] Computerized searching of organic chemical compound
databases has been around for decades. Many of these databases
store molecular information according to their recognizable
fragments. The data processing systems maintain a fragment
dictionary, and all compounds input into the database are parsed so
as to establish a relationship between fragments in the dictionary.
The dictionary is instituted with a limited number of fragments
well known to those skilled in the art. Many database searching
tools use fragment dictionaries with a large number of entries, and
others use fragment dictionaries with a smaller number of entries.
A larger number of fragments makes it easier to define a complex
molecule, but it increases the search time. A more rapid search
engine requires fewer fragments in the dictionary.
[0087] As the number of atoms in a complex organic molecule
increase, the search time for identity and similarity in these
databases grows exponentially. There is no upper bound as to the
time required to secure a match. Therefore, a search should be done
in two stages. The first stage is a screening search. This stage
eliminates most of the compounds in the database (possibly up to
99%). In order to determine whether one structure is a
sub-structure of another, traditionally one performs an
atom-by-atom match. Atoms are graphically superimposed upon one
another to make sure that all the atoms match and that all the
bonds between the atoms match. If one is a subset of the other,
then there is a substructure match. However, this is a slow
process. In order to minimize the number of times that this process
is performed, it is important to first apply a filter in order to
perform a screening search. If a substructure match is found, all
of the atoms and bonds between the atoms of the smaller structure
will be contained within the larger structure. If there is a
fragment dictionary, all of the fragments in the molecule to be
matched must also be in the target molecule. Other fragments may
also be present, but all the fragments in the substructure of both
must be present in both molecules. So after performing the search
using a binary vector of fragments, most of the molecules are
eleminated. Then, an atom-by-atom search is performed on the
remainder of the database.
[0088] One possible representation of a complex molecule would be
to parse it into a binary fragment vector. Each bit represents the
presence of absence of a particular fragment in the dictionary. The
vector element order is keyed to the fragment dictionary. Molecular
parsing is performed by analyzing the chemical structure
atom-by-atom and bond-by-bond that is associated with each atom. A
search of the fragment dictionary is performed to find a match.
When a match is found, the element for molecular descriptor vector
corresponding to the matched fragment is set to 1. The binary
vector may be represented logically as a string of bits or bytes or
may have any conenient representation. These binary vectors then
form a fingerprint for the chemical structure of the molecule. Each
bit or fragment in the fingerprint is a dimension representing one
row in the vector. Equal weighting is applied to all dimensions.
Data processing systems that use this type of fingerprint implement
search for similarity of new compounds with known existing
compounds
[0089] Searching using a fragment dictionary is commonly used in
chemical database technology. Chemical Abstracts (CAS/STN) uses a
dictionary of two-thousand fragment keys in the dictionary for a
database of approximately ten-million chemical compounds. Most
commercial databases use a dictionary of between five-hundred to
one-thousand keys for a database of approximately one-million to
two-million chemical compounds. The inventors have reduced the
current system to practice. Said system uses a dictionary 230
fragment keys for a database of approximately seventy-thousand
compounds. The performance of said system is excellent.
[0090] FIG. 3 is an overview program flowchart showing the
computerized method of the invention disclosed herein being used to
perform similarity searches in a database of organic chemical
compounds. In designing a search query system for a chemical
compound database, the following steps must be performed:
[0091] 1. Draw the query:--A user draws a chemical structure using
a chemical structure drawing package such as ChemDraw, ISISDraw, or
CASDraw. The resulting chemical structure, the qurey structure, is
transferred to the program implementing the search.
[0092] 2. Fingerprint the query:--Use the dictionary of chemical
structure fragments or an algorithm that generates the fragments to
characterize the chemical structure. The seaching program
determines which structure fragments, from the fragment dictionary,
are present in the query structure.
[0093] 3. Allow the user to adjust the fragment weighting:--An
electronic form displaying the structure fragments, from the
fragment dictionary, which are present in the query structure is
displayed to the user. For each structure fragment, there is also
present a control that allows the user to define the importance of
the fragment. The control on the form could be a slider with a
numeric scale, a dial with a numeric scale, a text box allowing
numeric value entry, or any graphic or text based system that would
permit the user to interactively assign a weight to the importance
of a particular structure fragment. Alternatively, the fragment
weights may be input from a file.
[0094] 4. Run the similarity search:--After the user has assigned
the structure fragment weights, the similarity search is performed
using a Euclidean distance, the Tanimoto coefficient, or other
method of comparing the similarity between two vectors.
[0095] 5. Return Results:--The results of the similarity search may
be stored for future use. The results are then displayed to the
user. In the preferred embodiment, they would be shown as a
graphical series of compounds sorted in order of decreasing
importance. However, any method of user informative display could
be used.
[0096] Using the above method of searching, the search may be
biased in a direction defined by the user. The above tunable search
process applied to organic chemical compounds is illustrated in
FIG. 4 through FIG. 7. FIG. 4 illustrates a computer monitor screen
display of the chemical structure of query compound 1
[0097] Trovafloxacin (C.sub.20H.sub.15F.sub.3N.sub.4O.sub.3) as
input through one of the standard chemical drawing packages. FIG. 5
shows the structure of the query compound having been
"fingerprinted" using the twelve fragments 2
[0098] These are shown graphically on the lower portion of the
screen. Sliders are shown next to each fragment all preset to their
default values of 1. FIG. 6 shows the same screen where the user
has set the sliders for the 3
[0099] fragment to 6.5, the 4
[0100] fragment to 7, and the 5
[0101] to 7. FIG. 7 shows the results of the similarity search. In
the figure the first ten compounds (of fifty) found to be similar
to Trovafloxacin are shown arranged in order decreasing similarity.
For example, the molecule of compound labeled {fraction (1/50)} is
deemed by the search criteria to be most similar. It differs only
by substitution of fluorine (F) for the ethyl (CH.sub.2)
grouping.
[0102] FIG. 8 shows a program flow chart for a specific
implementation of the program in shown in FIG. 3 in which the
screens shown in FIG. 4 through FIG. 7 are used. A printed program
listing for this system can be found in the APPENDIX attached
hereto. The system comprises a MICROSOFT VISUAL BASIC program and
an associated ORACLE database. In addition, the ACCORD CHEMISTRY
TOOLKIT available from ACCELRYS is used for certain chemistry
related functions (primarily substructure matching).
[0103] The ORACLE database requires at least two tables in this
implementation of the method:
[0104] Fragment dictionary table containing
[0105] ID numbers
[0106] Chemical fragment structures in MOLFILE or other chemistry
structure format
[0107] Main compound table containing:
[0108] ID numbers
[0109] Chemical structures in MOLFILE or other chemistry structure
format
[0110] Chemical structure fingerprints (stored as binary bit string
or other numeric format
[0111] The VISUAL BASIC program is comprised of Forms, Modules, and
Class Modules.
[0112] Forms:
[0113] 1. Search (SearchAgent.frm)--This is the main form used in
the application. Query input and function execution are primarily
handled from this form.
[0114] 2. frmTune (Tune.frm)--The form used for tuning the fragment
weights used in the chemical tunable search.
[0115] 3. frmLogin (Login.frm)--This is a small form used to take
database name, user name, and password input from the user, and
then use the information to open the ORACLE database.
[0116] Modules:
[0117] 1. AccordSDK (ACCSDK50.BAS)--Module from Accelrys containing
chemical structure handling routines.
[0118] 2. AccordSDK Constraints (ACCSDK50CNST.BAS)--Definitions of
constraints used by the chemical structure toolkit.
[0119] 3. AccordSDK Fingerprints (ACCSDK50FP.BAS)--Fingerprint
handling routines.
[0120] 4. AccordSDKOld (ACCSDK50OLD.BAS)--Older versions of
routines included for backward compatibility.
[0121] 5. AccordSDKX (ACORDX50.BAS)--ActiveX controls to use on
forms in conjunction with the rest of the toolkit routines.
[0122] 6. Utilities (Utilities.BAS)--General purpose utility
functions.
[0123] Class Modules:
[0124] 1. cChemDb (cChemDb.cls)--Class for handling chemistry
related functions of the program.
[0125] 2. cChemUtils (cChemUtils.cls)--Class containing chemistry
utilities.
[0126] 3. cError (cError.cls)--Error handling and logging
class.
[0127] The detailed program execution follows:
[0128] 1. User starts the program.
[0129] 2. Search.form_load( ) executes:
[0130] Error handler is set up;
[0131] Accelrys Accord license is checked and a new Accord session
is created to allow use of the toolkit functions;
[0132] New database connections are set up and an Accord chemistry
object is created;
[0133] The active form is displayed to the user.
[0134] 3. User clicks "Open DB Connection":
[0135] cmdOpenDbConnection_click( ) executes;
[0136] A new login form is created and displayed;
[0137] User enters database connection information, username, and
password and then clicks OK.
[0138] 4. frmlogin.cmdOK_click( ) executes:
[0139] User supplied information is loaded into variables and form
is closed.
[0140] 5. Search.fLogin_close(Cancel as Integer) executes:
[0141] ORACLE database is opened using the Open method of the
mOraCnn(ORACLE connection) object.
[0142] 6. Search.mOraCnn_ConnectionComplete executes:
[0143] Status of the ORACLE connection is returned;
[0144] User is notified that DB is open;
[0145] Database record sets are opened and initialized;
[0146] User clicks OK button on status notice.
[0147] 7. User clicks Tune button
[0148] 8. Search.cmdTune_click( ) executes:
[0149] Strings containing (fragment) key status information are
initialized;
[0150] Tuning form is loaded.
[0151] 9. frmTune.Form_load( ) executes:
[0152] Form checks (fragment) key status information and
initializes sliders (user weighting controls) if needed.
[0153] 10. User double clicks on a structure box--ISIS Draw
starts.
[0154] 11. User draws or reads a structure into ISIS Draw.
[0155] 12. User clicks return box on ISIS Draw and returns
structure to program.
[0156] 13. frmTune.chmTune_changes( ) event fires:
[0157] The arrays containing (fragment) key information are
initialized;
[0158] Any pictures of keys already present on the form are
removed;
[0159] If it does not already exist, a chemistry object is
created;
[0160] The TunableKeys method of the chemistry object is called
cChem.TunableKeys;
[0161] For every key found in the query the appropriate members of
the key arrays are set;
[0162] Key arrays are returned to the calling routine;
[0163] For every key that has been set in the key arrays, a picture
and a slider are loaded and displayed on the form.
[0164] 14. The user adjusts the settings of the sliders to adjust
the weightings used in the similarity calculations.
[0165] 15. The user clicks the Search button.
[0166] 16. frmTune.cmdTunableSearch_click( ) executes:
[0167] The values of the sliders are loaded into the tunable key
arrays;
[0168] The structure contained in the Tune form is loaded into the
query box of the Search form
[0169] 17. Search.cmdTunableProductAnalogySearch_click( ) is called
by frmTune:
[0170] If it does not already exist, a chemistry object is
created;
[0171] When the object is created, the database connection is
established and the record sets are opened;
[0172] The TunableProductAnalogySearch method of the chemistry
object is called using the tunable key arrays as input;
[0173] cChem.TunableProductAnalogySearch initialization routines
are fired;
[0174] Query structures are searched for certain heterocycles--If
the heterocycles are present, copies of the query are made and
edited to generate related molecules whose syntheses are related to
the initial query--Similarity search will be performed on the query
and related synthetically equivalent structures;
[0175] Calculate the similarity values between the query
compound(s) and the molecules in the database, sort, and store the
top 50 results;
[0176] SearchDone event is raised;
[0177] Search.cChem_SearchDone executes;
[0178] Search complete message is displayed to the user;
[0179] Answers are extracted from the database and displayed on the
form.
[0180] 18. User clicks "Done" button on Tune form:
[0181] Tune form unloads.
[0182] 19. User browses answers and runs another search at his or
her discretion.
[0183] The aforementioned data processing system was also
implemented in the C++ and JAVA programming languages in addition
to the MICROSOFT VISUAL BASIC implementation shown in the APPENDIX.
As described above, several prior art software packages were used
in the implementation of the system shown in the APPENDIX. The
ORACLE database is well known to those with ordinary skill in the
art. It was used only for the implementation discussed herein, and
any comparable database management system may be substituted
therefor. Similarly, ACCORD allows a user to search through
documents and files for chemical structures and reactions. ACCORD
CHEMEXPLORER recognizes a wide range of formats--ISIS/DRAW,
SKETCHFILES or CHEMDRAW files, MOLFILES, RXNFILES, SD and RD files,
MICROSOFT WORD documents, EXCEL spreadsheets or the like. It looks,
works and feels like the WINDOWS Finder. The ACCORD utilities are
well known. They were used only for the implementation demonstrated
herein, and the data processing routines contained in the ACCORD
utilities are prior art. Similarly functioning routines may be
easily substituted therefor. Finally, ISIS/DRAW was used in this
implementation as a means to input a chemical structure into the
program. This program is available from MDL.RTM.. It is one of many
programs of this type. The data processing routines contained in
ISIS/DRAW are prior art, and an equivalent utility may be
substituted therefor.
[0184] The above mentioned method may also be used to search for
biological data. In this case, the values of the elements in the
vector might not be binary. A biological response is a continuous
variable. For example, the binding strength of a drug to a
particular receptor would have a specific numeric value, and it
would be important to express that value in the vector. These
measurements are important for drug competition experiments where
relative binding strengths are relevant. They are also important
for antibody and monoclonal antibody research that involve binding
to specific epitope sites. However, the priority or weight that a
user would apply to a characteristic such as binding strength for a
particular receptor when performing a similarity search is
independent of the actual data. Compounds can be described based
upon biological response. Plotting the biological response over a
series of tests produces a graph possessing a characteristic shape.
A database biological compounds may be probed for those having
characteristic shapes that are similar. Often compounds having a
similar profile would have similar modes of action. In this case, a
weighted search would provide a significant advantage. The
inventors have reduced this technique to practice with excellent
results by performing such a search with a highly descriptive
biological compound model based upon biological response.
[0185] FIG. 9 is an overview program flowchart showing the
computerized method of the invention disclosed herein being used to
perform similarity searches in a database of biological responses
to various compounds. The implementation of the method described
herein is in the form of a MICROSOFT EXCEL Spreadsheet with macros
performing all of the necessary functions. A source code listing
for this implementation appears in a section entitled, "COMPUTER
PROGRAM LISTING--TUNABLE BIOLOGICAL SEARCH," at the end of this
application. The data shown in the spreadsheet of FIG. 10 has been
separated into three parts, viz., FIGS. 10(a), (b), and (c). The
source of the data (hereinafter the LEWI Data) is the paper:
Janssen, Paul A. J.; Niemegeers, Carlos J. E.; and Schellekens,
Karel H. L.; "Is it Possible to Predict the Clinical Effects of
Neuroleptic Drugs (Major Tranquillizers) from Animal Data?--Part I:
`Neuroleptic activity spectra` for rats"; from the Janssen
Pharmaceutic n.v., Research Laboratoria, Beerse (Belgium), Drug
Research, Vol 15, Heft 2, 1965, pp 104-117. A copy of this paper is
provided with this application, and is incorporated by reference as
non-essential material in its entirety herein.
[0186] The following features are needed:
[0187] 1. A row of target biological activity scores to use as a
target in the similarity search. In the LEWI Data, there are twelve
measured responses.
[0188] 2. A row of weighting values to apply to the target
biological responses. Weightings are input by the user to indicate
the relative importance that the user places on the importance of
the associated biological test.
[0189] 3. A collection of data related to individual compounds and
their associated biological responses. For the purposes of this
implementation, the data are contained within the same spreadsheet
as the target input scores and the target weightings. The LEWI Data
set contains data on 40 compounds.
[0190] The program works as follows:
[0191] 1. After the user has entered the biological response
values, and the associated biological response weightings, the user
initiates the calculation by pressing the button for Euclidean
Distance or Tanimoto Coefficient.
[0192] 2. Using the user-supplied biological activity target
values, and user-supplied target weightings, the similarity values
are calculated for each compound in the data set.
[0193] 3. The calculated similarity values for each compound in the
data set is stored.
[0194] 4. After the similarity values for all the compounds in the
data set have been calculated, the data is then sorted in order of
decreasing similarity.
[0195] For the convenience of the user, other features have been
added:
[0196] To simplify the entry of target biological activity values,
a control box has been set up to allow the user to select a
compound from the data set to use as a starting point data entry.
Biological activity values from a selected data set compound are
loaded. Then the user can modify the values to suit his or her
needs.
[0197] When scrolling through the sorted output data, a graph
showing the relationship between the input target and the data set
compound data currently selected can be shown.
[0198] The routines are as follows:
[0199] 1. cboCompoundNames_change( )--A combo box,
cboCompoundNames, is loaded with the names of the compounds in the
data set contained within the EXCEL worksheet. When the user
selects a compound name from the combo box, the biological data
associated with the compound is loaded into the Target area at the
top of the spreadsheet. This is purely a convenience for the user,
not a critical feature.
[0200] 2. cmdEuclidLC50_click( )--This routine calculates the
Euclidean Distance between the user-supplied target biological data
values, and the biological data values for the compounds in the
data set using the appropriate user-supplied weights. Biological
data values are sorted according to the calculated Euclidean
Distances.
[0201] 3. cmdEuclidSpec_click( )--Not Used!
[0202] 4. cmdTanimotoLC50_click( )--This routine calculates the
Tanimoto Coefficient between the user-supplied target biological
data values, and the biological data values for the compounds in
the data set, using the appropriate user-supplied weights.
Biological data values are sorted according to the calculated
Tanimoto coefficients.
[0203] 5. cmdTanimotoSpec_click( )--Not Used!
[0204] 6. Worksheet_activate( )--Loads the combo box with the
compound names from the data set. Reset all weightings to 1. This
routine fires when the user opens the spreadsheet.
[0205] 7. Worksheet_SelectionChange(by Val Target as Range)--This
routine checks to see if the selection is within the dataset range.
If it is within the range, then a chart showing the biological
responses from the Target and the selected compound are shown. This
routine uses the makeChart routine to create the charts.
[0206] 8. makeChart(by Val as long)--This routine creates a chart
using the Target biological responses and the biological data from
a compound in the data set. This routine is called by
Worksheet_SelectionChange.
[0207] FIG. 11 represents four screen prints of the program shown
in FIG. 9 operating on the data shown in FIG. 10. The figure is
divided into four parts, viz., FIGS. 11(a), (b), (c), and (d). The
data for all versions of FIG. 11 are those shown in FIG. 10(a).
[0208] In FIG. 11(a), the cursor is positioned on the Target
compound, Aceperone (R3248) butyr (Row 17--Column A). The chart
shows the Tanimoto fingerprint for twelve test results on rats on a
logarithmic scale. Note the "Euclid LC50" and "Tanimoto LC50" radio
buttons. Since the target compound is only being compared with
itself, only one fingerprint is shown.
[0209] In FIG. 11(b), the cursor is positioned on Promazine phen
(Row 18--Column A). Here, the chart compares two fingerprints. The
darker graph is the fingerprint of Aceperone (R3248) butyr while
the lighter graph is the fingerprint of Promazine phen. Note how
closely the fingerprints of these adjacently sorted compounds
resemble each other.
[0210] In FIG. 11(c), the cursor is positioned on Levomepromazine
phen (Row 25--Column A). Once again there are two fingerprints
being compared where the darker graph is the fingerprint of
Aceperone (R3248) butyr and the lighter graph is the fingerprint of
Levomepromazine phen. Note here that the two graphs are far less
similar than those of FIG. 11(b).
[0211] In FIG. 11(d), the cursor is positioned on Trabuton (R1516)
butyr (Row 29--Column A). The darker graph is the fingerprint of
Aceperone (R3248) butyr and the lighter graph is the fingerprint of
Trabuton (R1516) butyr. Here, the two fingerprint graphs are far
less similar than those of FIGS. 11(b) and (c).
[0212] The systems, methods, and programs disclosed herein may be
implemented in hardware or software, or a combination of both.
Preferably, the techniques are implemented in computer programs
executing on programmable computers that each comprise a processor,
a storage medium readable by said processor (including volatile and
non-volatile memory and/or storage elements), at least one input
device, and at least one output device. Program code is applied to
data entered using the input device to perform the functions
described and to generate output information. The output
information is routed to one or more output devices.
[0213] Each such computer program is preferably stored on a storage
medium or device (e.g., CD-ROM, hard disk, magnetic tape, or
magnetic diskette) that is readable by a general or special purpose
programmable computer. Said computer program configures and
operates the computer when the storage medium or device is read by
the computer to perform the procedures described in this
application. The system may also be considered to be implemented as
a computer-readable storage medium, configured with a computer
program, where the storage medium so configured causes a computer
to operate in a specific and predefined manner. The present
invention may be embodied in computer-readable media, such as
floppy disks, ZIP or JAZ disks, conventional hard disks, optical
media, CD-ROMS, Flash ROMS, nonvolatile ROM, RAM and any other
equivalent computer memory device. It will be appreciated that the
system, method of operation and product may vary as to the details
of its configuration and operation without departing from the basic
concepts disclosed herein.
[0214] Some or all of the functionality may be implemented on an
analog computer or device or on a hybrid digital/analog computer.
User tuning (i.e., the process whereby weights are assigned to the
specific descriptive properties) is an area of the computerized
process most applicable to analog processing. The analog processing
devices used may be inter alia electrical, mechanical, optical,
hydraulic, or any other means for analog processing.
Analog-to-digital or digital-to-analog conversion may take place at
any step of the process.
[0215] Based upon the disclosure of the systems, processes,
methods, and computer programs herein, as well as the foregoing
discussion of apparatus considerations, it is apparent that one
skilled in the art would be able to implement the present invention
on any of the apparatuses or devices mentioned above without undue
experimentation.
3 1 COMPUTER PROGRAM LISTING - TUNABLE BIOLOGICAL SEARCH 2 Written
to be executed within a Microsoft Excel Spreadsheet, using MS Excel
Visual Basic. 3 Private Sub cboCompoundNames_Change() 4 'Copy the
data into the appropriate boxes 5 Dim i As Long, rowNumber As Long,
rng As Range, name As String 6 If cboCompoundNames.ListIndex = -1
Then 7 'Do Nothing, empty box 8 Else 9 'Get LC50 data 10 name =
cboCompoundNames.Text 11 Set rng = Range("A17:A56").Find(name) 12
rowNumber = rng.row 13 'Set the title for the rows in header area
14 Me.Cells(3, 1) = CStr(Me.Cells(rowNumber, 1)) & " LC50" 15
Me.Cells)5, 1) = "Weighting (0 - 9)" 16 For i = 2 To 13 17 'copy
LC50 values into row 3 18 Me.Cells(3, i) = Me.Cells(rowNumber, i)
19 'copy standard weights into row 5 20 Me.Cells(5, i) = 1# 21 Next
22 End If 23 End Sub 24 Private Sub cmdEuclidLC50_Click() 25 Dim
valueRow As Long, rowCount As Long, columnCoumt As Long 26 Dim
SumOfSquares As Double, targetCellvalue As Double, testCellValue As
Doube 27 Dim Difference As Double, EuclideanDistance As Double,
weight(1 To 20) As Double 28 Dim i As Long, j As Long, weightRow As
Long, rng As Range 29 valueRow = 3 30 weightRow = 5 31 columnCount
= 14 32 'Data in rows 17 to 56 33 For i = 17 To 56 34 SumOfSquares
= 0 35 'Include first column in data 36 For j = 2 To 13 37
weight(j) = Me.Cells(weightRow, j) 38 targetCellvalue =
Me.Cells(valueRow, j) 39 testCellValue = Me.Cells(i, j) 40
Difference = weight(j) * (targetCellValue - testCellValue) 41
SumofSquares = SumofSquares + (Difference * Difference) 42 Next j
43 'Take the square root 44 EuclideanDistance = Sqr(SumOfSquares)
45 Me.Cells(i, columnCount + 1) = EuclideanDistance 46 Next i 47
'Now sort the results 48 Set rng = Range("A16:P56") 49 rng.Select
50 rng.sort Key1:=Ramge("016"), Order1:=xlAscending, Header:=xlYes,
_ 51 MatchCase:=False, OrderCustom:=1, Orientation:=xlRows 52 Set
rng = Range("A17") 53 rng.Select 54 End Sub 55 56 Private Sub
cmdTanimotoLC50_Click() 57 Dim valueRow As Long, rowCount As Long,
columnCount As Long 58 Dim SumASquared As Double, SumBSquared As
Double, SumAtimesB As Double 59 Dim i As Long, j As Long, tanimoto
As Double, weight(1 To 20) As Double 60 Dim weightRow As Long, rng
As Range 1 valueRow = 3 2 weightRow = 5 3 columnCount = 14 4
'Calculate the Tanimoto Coefficient 5 'Use all columns when
operating on the untransformed data. 6 SumASquared = 0 7 For j = 2
To 13 8 weight(j) = Me.Cells(weightRow, j) 9 SumASquared =
SumASquared + weight(j) * (Me.Cells(valueRow, j) *
Me.Cells(valueRow, 10 j)) 11 Next j 12 For i = 17 To 56 13
SumBSquared = 0 14 SumAtimesB = 0 15 For j = 2 To 13 16 weight(j) =
Me.Cells(weightRow, j) 17 SumBSquared = SumBSquared + weight(j) *
(Me.Cells(i, j) * Me.Cells(i, j)) 18 SumAtimesB = SumAtimesB +
weight(j) * (Me.Cells(valueRow, j) * Me.Cells(i, j)) 19 Next j 20
tanimoto = SumAtimesB / (SumASquared + SumBSquared - SumAtimesB) 21
Me.Cells(i, columnCount + 2) = tanimoto 22 Next i 23 'Now sort the
results 24 Set rng = Range("A16:P56") 25 rng.Select 26 rng.sort
Key1:=Range("P16"), Order1:=xlDescending, Header:=xlYes, _ 27
MatchCase:=False, OrderCustom:=1, Orientation:=xlRows 28 Set rng =
Range("A17") 29 rng.Select 30 End Sub 31 32 Private Sub
Worksheet_Activate() 33 Dim i As Long 34 cboCompoundNames.Clear 35
'Fill the combo box with the compound names 36 If
cboCompoundNames.ListCount = 0 Then 37 For i = 3 To 42 38
cboCompoundNames.AddItem Sheet1.Cells(i, 1).value 39 Next i 40 End
If 41 'Fill the weighting cells with the standard value (1.00) 42
For i = 2 To 14 43 Me.Cells(4, i) = 1# 44 Next 45 End Sub 46 47
Private Sub Worksheet_SelectionChange(ByVal Target As Range) 48
'Look to see if we are in the LC50 rows or spectral rows and then
put up a chart 49 Dim newrow As Long 50 newrow = Target.row 51 If
newrow <> oldrow Then 52 'make a chart 53 makeChart (newrow)
54 End If 55 oldrow = newrow 56 End Sub 57 Private Sub
makeChart(ByVal row As Long) 58 Dim co As ChartObject, cw As Long,
rh As Long 59 Dim rng As Range, oCell As Range, selection As String
1 Dim MinimumValue As Double 2 'Get rid of old charts 3 If
ActiveSheet.ChartObjects.Co- unt > 0 Then 4 Do 5
ActiveSheet.ChartObjects.Dele- te 6 Loop Until
ActiveSheet.ChartObjects.Count = 0 7 EndIf 8 If row < 57 And row
> 16 Then 9 'Charts for LC50 similarities 10 selection = "A3:M3,
" & "A" & row & ":M" & row 11 'Rows("2:2").Select
12 'Rows(CStr(row) & ":" & CStr(row)).Select 13 'Create
column width and row height units 14 cw = Columns(2).Width 'In
points 15 rh = Rows(1).Height 16 'Place chart with respect to upper
left corner of A1 17 ' ( Left, Top, Width, Height ) 18 Set co =
ActiveSheet.ChartObjects.Add(cw * 7.5, rh * 5.5, cw * 7, rh * 18)
19 co.name = "Test Chart" 20 'Set the chart type 21
'co.Chart.ChartType = xlXYScatterSmooth 22 'co.Chart.ChartType =
xlLine 23 co.Chart.ChartType = xlLineMarkers 24 co.Chart.HasLegend
= False 25 'Attach the data to the chart 26 '
Source:=ActiveSheet.Range("B1;I1", selection), 27
co.Chart.SeriesCollection.Add _ 28 Source:=ActiveSheet.Range
(selection), 29 rowcol:=xlRows 30 co.Chart.HasTitle = False 31
'These are the standard default values 32 With co.Chart 33
.HasAxis(xlCategory, xlPrimary) = True 34 .HasAxis(xlCategory,
xlSecondary) = False 35 .HasAxis(xlValue, xlPrimary) = True 36
.HasAxis(xlValue, xlSecondary) = False 37 End With 38 'Get the
names from the first row, category names missing from scatter plot
39 co.Chart.Axes(xlCategory).CategoryNames = _ 40 ActiveSheet.Range
("bl:ml") 41 'co.Chart.Axes(xlvalue).Cr- ossAt =
xlAxisCrossesMinimum 'This doesn't work here 42 'MinimumValue =
co.Chart.Axes(xlValue, xlPrimary).MinimumScale 43
co.Chart.Axes(xlValue, xlPrimary).MinimumScale = -4# 44
co.Chart.Axes(xlValue, xlPrimary).MaximumScale = 2# 45
co.Chart.Axes(xlValue).CrossesAt = -4# 46
co.Chart.Axes(xlValue).HasTitle = True 47
'co.Chart.Axes(xlValue).AxisTitle.Orientation = xlHorizontal 48
'co.Chart.Axes(xlValue).AxisTitle.Orientation = xlVertical 49
co.Chart.Axes(xlValue).AxisTitle.Orientation = xlUpward 50
'co.Chart.Axes(xlValue).AxisTitle.Orientation = xlDownward 51
co.Chart.Axes(xlValue).AxisTitle.Text = "Log(1/C)" 52 add a data
table to the bottom 53 'datatable.doesn't appear in scatter
plots,does appear in line graphs 54 'co.Chart.HasDataTable = True
55 'Doesn't affect a line graph, does affect scatter plot 56
co.Chart.SeriesCollection(1).MarkerS- ize = 5 57
co.Chart.SeriesCollection(1).MarkerStyle = xlMarkerStyleDiamond 58
End If 59 End Sub
[0216]
* * * * *