U.S. patent application number 09/777278 was filed with the patent office on 2001-11-01 for multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values.
This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Agrawal, Rakesh, Chakrabarti, Soumen, Dom, Byron Edward, Raghavan, Prabhakar.
Application Number | 20010037324 09/777278 |
Document ID | / |
Family ID | 26728460 |
Filed Date | 2001-11-01 |
United States Patent
Application |
20010037324 |
Kind Code |
A1 |
Agrawal, Rakesh ; et
al. |
November 1, 2001 |
Multilevel taxonomy based on features derived from training
documents classification using fisher values as discrimination
values
Abstract
A system, process, and article of manufacture for organizing a
large text database into a hierarchy of topics and for maintaining
this organization as documents are added and deleted and as the
topic hierarchy changes. Given sample documents belonging to
various nodes in the topic hierarchy, the tokens (terms. phrases,
dates, or other usable feature in the document) that are most
useful at each internal decision node for the purpose of routing
new documents to the children of that node are automatically
detected. Using feature terms, statistical models are constructed
for each topic node. The models are used in an estimation technique
to assign topic paths to new unlabeled documents. The hierarchical
technique, in which feature terms can be very different at
different nodes, leads to an efficient context-sensitive
classification technique. The hierarchical technique can handle
millions of documents and tens of thousands of topics. A resulting
taxonomy and path enhanced retrieval system (TAPER) is used to
generate context-dependent document indexing terms. The topic paths
are used, in addition to keywords, for better focused searching and
browsing of the text database.
Inventors: |
Agrawal, Rakesh; (San Jose,
CA) ; Chakrabarti, Soumen; (San Jose, CA) ;
Dom, Byron Edward; (Los Gatos, CA) ; Raghavan,
Prabhakar; (Saratoga, CA) |
Correspondence
Address: |
GATES & COOPER LLP
HOWARD HUGHES CENTER
6701 CENTER DRIVE WEST, SUITE 1050
LOS ANGELES
CA
90045
US
|
Assignee: |
International Business Machines
Corporation
Armonk
NY
|
Family ID: |
26728460 |
Appl. No.: |
09/777278 |
Filed: |
February 5, 2001 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
09777278 |
Feb 5, 2001 |
|
|
|
09102861 |
Jun 23, 1998 |
|
|
|
6233575 |
|
|
|
|
60050611 |
Jun 24, 1997 |
|
|
|
Current U.S.
Class: |
1/1 ;
707/999.001; 707/E17.091 |
Current CPC
Class: |
Y10S 707/99932 20130101;
Y10S 707/99936 20130101; G06F 16/355 20190101 |
Class at
Publication: |
707/1 |
International
Class: |
G06F 007/00 |
Claims
What is claimed is:
1. A process for classifying new documents containing features
under nodes defining a multilevel taxonomy, based on features
derived from a training set of documents that have been classified
under respective nodes of the taxonomy, the process comprising:
associating a respective set of features with each one of said
plurality of nodes, each given set of features comprising a
plurality of features that are in at least one training document
classified under the associated node; and classifying each new
document under at least one node, based on the set of features
associated with said at least one node.
2. A process as recited in claim 1, wherein said step of
associating comprises the steps of: determining a discrimination
value for each term in at least one training document which is
classified under each one of a plurality of the nodes of the
taxonomy; and determining a minimum discrimination value for each
of said plurality of nodes; wherein the features in each given set
of features have discrimination values equal to or above the
minimum discrimination value determined for the node associated
with the given set of features.
3. A process as recited in claim 1, wherein said step of
classifying comprises: scanning each new document to determine the
features in the document; and defining, for each of said plurality
of said plurality of nodes and for each new document, the
probability that the new document is classified under the node,
based on the set of features associated with the node and the
features in the document.
4. A process as recited in claim 3, wherein said step of defining
the probability comprises the step of applying a statistical model
to define said probability that features in each given new document
would occur at the frequency at which they do occur in the given
new document.
5. A process as recited in claim 4, wherein said statistical model
comprises a Bernoulli model.
6. A process as recited in claim 4, wherein said statistical model
comprises a Poisson model.
7. A process as recited in claim 4, wherein said step of
classifying further comprises the step of assigning each given new
document to at least one respective node in at least one level of
the taxonomy, wherein the at least one node to which each given new
document is assigned is the node for which the defined probability
is above a predefined threshold among all of the nodes at the same
level in the taxonomy.
8. A process as recited in claim 7, wherein the at least one node
to which each given new document is assigned is the node for which
the defined probability is maximum among all of the nodes at the
same level in the taxonomy.
9. A process as recited in claim 8, wherein said step of assigning
each given new document to at least one respective node in at least
one level of the taxonomy comprises the step of assigning each
given new document to at least one respective node in each of a
plurality of levels of the taxonomy.
10. A process as recited in claim 2, wherein said step of selecting
a set of features comprises selecting features that are in a
plurality of training documents classified under the associated
node and that have discrimination values equal to or above the
minimum discrimination value.
11. A process as recited in claim 2, wherein said step of selecting
a set of features comprises selecting features that are in all of
training documents classified under the associated node and that
have discrimination values equal to or above the minimum
discrimination value.
12. A process as recited in claim 2, wherein said step of
determining a discrimination value comprises determining a
discrimination value for each feature in a plurality of training
documents which are classified under each one of a plurality of the
nodes of the taxonomy.
13. A process as recited in claim 2, wherein said step of
determining a discrimination value comprises determining a
discrimination value for each feature in all of the training
documents which are classified under each one of a plurality of the
nodes of the taxonomy.
14. A process as recited in claim 2, wherein said step of
determining a discrimination value for each feature comprises
determining a Fisher value for each feature, based on the equation:
20 Fisher ( t ) = c 1 c 2 ( ( c 1 , t ) - ( c 2 , t ) ) 2 c 1 / c (
x ( d , t ) - ( c , t ) ) 2 where: t represents a term, d
represents a document,
.mu.(c,t)=1/.vertline.c.vertline..SIGMA..sub.d.di-elect cons.c
X(d,t), and x(d,t) =the occurrence rate of t in d.
15. A process as recited in claim 1, wherein said step of
associating a respective set of features with each node comprises
the step of determining the number of features to associate with
each respective node.
16. A process as recited in claim 15, wherein said step of
associating a respective set of features with each given node
comprises the steps of: ranking, by discrimination power, each of a
plurality of features that are in at least one training document
classified under the each given node; providing an optimal number N
of features for each given node; and defining the set of features
associated with a given node as the features ranked highest to the
Nth highest in said step of ranking.
17. A process as recited in claim 16, wherein said step of
providing an optimal number N comprises the step of determining the
number N for each given node based on a test set of documents.
18. A process as recited in claim 1, further comprising the step of
displaying, for given node of a plurality of nodes of the taxonomy,
a signature comprising at least one feature associated with the
documents classified under the given node.
19. A process as recited in claim 18, wherein said signature for
each given node comprises a plurality of features associated with
the documents classified under the given node.
20. A process as recited in claim 18, wherein said signature for
each given node comprises a plurality of features that occur in the
documents classified under the given node, but which are determined
to have a relatively low frequency of occurrence among documents
under the given node.
21. A process for searching for documents relevant to a search
query from a group of accessible documents containing terms,
comprising the steps of: defining a multilevel taxonomy having a
plurality of nodes, including a root node, at least one
intermediate node associated with and under the root node and a
plurality of terminal nodes associated with and under each
intermediate node; classifying each one of a plurality of training
documents with at least one of the terminal and intermediate nodes;
determining a discrimination value for each term in at least one
training document which is classified with each one of a plurality
of the terminal and intermediate nodes of the taxonomy; determining
a minimum discrimination value for each of said plurality of
terminal and intermediate nodes; selecting a set of feature terms
associated with each one of said plurality of terminal and
intermediate nodes, said feature terms comprising terms that are in
at least one training document classified with the associated node
or any node under the associated node and that have discrimination
values equal to or above the minimum discrimination value;
receiving a search query: determining a plurality of search
documents, each search document comprising one of the accessible
document that is relevant to the search query; classifying each
search document with at least one of the terminal and intermediate
nodes of the taxonomy, based on the sets of feature terms
associated with the terminal and intermediate nodes of the
taxonomy; displaying a list of nodes with or under which said
search documents are classified; selecting at least one of the
displayed nodes; and displaying at least one search document
classified under each selected node.
22. A process as recited in claim 21: wherein said search query
comprises at least one search term; and wherein each search
document comprises one of the accessible document that contain said
at least one search term as one of the terms in the document.
23. A process as recited in claim 21: wherein, following said step
of selecting at least one of the displayed nodes and prior to said
step of displaying at least one search document, said process
further includes the steps of displaying a second list of further
nodes with or under which said search documents are classified; and
selecting at least one of the displayed further nodes; and wherein
said step of displaying at least one search document comprises the
step of displaying at least one search document classified under
each selected further node.
24. A process as recited in claim 21, wherein said step of
displaying a list of nodes with or under which said search
documents are classified further comprises the step of displaying
signature terms associated with said search documents classified
with or under each of said nodes in the displayed list.
25. A process as recited in claim 24, wherein said signature terms
comprise a plurality of the most frequently occurring terms in the
search documents that are also feature terms.
26. A classifier system for classifying new documents containing
terms under nodes defining a multilevel taxonomy, based on feature
terms derived from a training set of documents which are classified
under respective nodes of the taxonomy, the system comprising:
means for determining a discrimination value for each term in at
least one training document which is classified under each one of a
plurality of the nodes of the taxonomy; means for determining a
minimum discrimination value for each of said plurality of nodes;
means for selecting a set of feature terms associated with each one
of said plurality of nodes, said feature terms comprising terms
that are in at least one training document classified under the
associated node and that have discrimination values equal to or
above the minimum discrimination value; and means for classifying
each new document under at least one node, based on the feature
terms associated with said at least one node.
27. A system as recited in claim 26, wherein said means for
classifying comprises: means for scanning each new document to
determine the terms in the document; and means for defining, for
each of said plurality of said plurality of nodes and for each new
document, the probability that the new document is classified under
the node, based on the feature terms associated with the node and
the terms in the document.
28. A system as recited in claim 27, wherein said means for
defining the probability comprises means for applying a Bernoulli
model to define said probability for each of said plurality of
nodes.
29. A system as recited in claim 26, wherein said means for
selecting a set of feature terms comprises means for selecting
terms that are in a plurality of training documents classified
under the associated node and that have discrimination values equal
to or above the minimum discrimination value.
30. A system as recited in claim 26, wherein said means for
selecting a set of feature terms comprises means for selecting
terms that are in all of training documents classified under the
associated node and that have discrimination values equal to or
above the minimum discrimination value.
31. A system as recited in claim 26, wherein said means for
determining a discrimination value comprises means for determining
a discrimination value for each term in a plurality of training
documents which are classified under each one of a plurality of the
nodes of the taxonomy.
32. A system as recited in claim 26, wherein said means for
determining a discrimination value comprises means for determining
a discrimination value for each term in all of the training
documents which are classified under each one of a plurality of the
nodes of the taxonomy.
33. A system as recited in claim 26, wherein said means for
determining a discrimination value for each term comprises means
for determining a Fisher value for each term, based on the
equation: 21 Fishers ( t ) = c1 , c2 ( ( c 1 , t ) - ( c 2 , t ) )
2 c 1 / c ( x ( d , t ) - ( c , t ) ) 2 where: t represents a term,
d represents a document,
.mu.(c,t)=1/.vertline.c.vertline..SIGMA..sub.d.di-elect
cons.cx(d,t), and x(d,t)=the occurrence rate of t in d.
34. A system for searching for documents relevant to a search query
from a group of accessible documents containing terms, comprising:
means for defining a multilevel taxonomy having a plurality of
nodes, including a root node, at least one intermediate node
associated with and under the root node and a plurality of terminal
nodes associated with and under each intermediate node; means for
classifying each one of a plurality of training documents with at
least one of the terminal and intermediate nodes; means for
determining a discrimination value for each term in at least one
training document which is classified with each one of a plurality
of the terminal and intermediate nodes of the taxonomy; means for
determining a minimum discrimination value for each of said
plurality of terminal and intermediate nodes; means for selecting a
set of feature terms associated with each one of said plurality of
terminal and intermediate nodes, said feature terms comprising
terms that are in at least one training document classified with
the associated node or any node under the associated node and that
have discrimination values equal to or above the minimum
discrimination value; means for receiving a search query; means for
determining a plurality of search documents, each search document
comprising one of the accessible document that is relevant to the
search query; means for classifying each search document with at
least one of the terminal and intermediate nodes of the taxonomy,
based on the sets of feature terms associated with the terminal and
intermediate nodes of the taxonomy; means for displaying a list of
nodes with or under which said search documents are classified;
means for selecting at least one of the displayed nodes; and means
for displaying at least one search document classified under each
selected node.
35. A system as recited in claim 34: wherein said search query
comprises at least one search term; and wherein each search
document comprises one of the accessible document that contain said
at least one search term as one of the terms in the document.
36. A system as recited in claim 34, wherein said means for
displaying a list of nodes with or under which said search
documents are classified further comprises means for displaying
signature terms associated with said search documents classified
with or under each of said nodes in the displayed list.
37. A system as recited in claim 34, wherein said signature terms
comprise a plurality of the most frequently occurring terms in the
search documents that are also feature terms.
38. An article of manufacture comprising a computer program carrier
readable by a computer and embodying one or more instructions
executable by the computer to perform a process for classifying new
documents containing terms under nodes defining a multilevel
taxonomy, based on feature terms derived from a training set of
documents which are classified under respective nodes of the
taxonomy, the process comprising: determining a discrimination
value for each term in at least one training document which is
classified under each one of a plurality of the nodes of the
taxonomy; determining a minimum discrimination value for each of
said plurality of nodes; selecting a set of feature terms
associated with each one of said plurality of nodes, said feature
terms comprising terms that are in at least one training document
classified under the associated node and that have discrimination
values equal to or above the minimum discrimination value; and
classifying each new document under at least one node, based on the
feature terms associated with said at least one node.
39. An article as recited in claim 38, wherein said step of
classifying comprises: scanning each new document to determine the
terms in the document; and defining, for each of said plurality of
said plurality of nodes and for each new document, the probability
that the new document is classified under the node, based on the
feature terms associated with the node and the terms in the
document.
40. An article as recited in claim 39, wherein said step of
defining the probability comprises the step of applying a Bernoulli
model to define said probability for each of said plurality of
nodes.
41. An article as recited in claim 38, wherein said step of
selecting a set of feature terms comprises selecting terms that are
in a plurality of training documents classified under the
associated node and that have discrimination values equal to or
above the minimum discrimination value.
42. An article as recited in claim 38, wherein said step of
selecting a set of feature terms comprises selecting terms that are
in all of training documents classified under the associated node
and that have discrimination values equal to or above the minimum
discrimination value.
43. An article as recited in claim 38, wherein said step of
determining a discrimination value comprises determining a
discrimination value for each term in a plurality of training
documents which are classified under each one of a plurality of the
nodes of the taxonomy.
44. An article as recited in claim 38, wherein said step of
determining a discrimination value comprises determining a
discrimination value for each term in all of the training documents
which are classified under each one of a plurality of the nodes of
the taxonomy.
45. An article as recited in claim 38, wherein said step of
determining a discrimination value for each term comprises
determining a Fisher value for each term, based on the equation: 22
Fishers ( t ) = c1 , c2 ( ( c 1 , t ) - ( c 2 , t ) ) 2 c 1 / c ( x
( d , t ) - ( c , t ) ) 2 where: t represents a term, d represents
a document, .mu.(c,t)=1/.vertline.c.vertline..SIGMA..sub.d.di-elect
cons.cx(d,t), and x(d,t)=the occurrence rate of t in d.
46. An article of manufacture comprising a computer program carrier
readable by a computer and embodying one or more instructions
executable by the computer for searching for documents relevant to
a search query from a group of accessible documents containing
terms, comprising the steps of: defining a multilevel taxonomy
having a plurality of nodes, including a root node, at least one
intermediate node associated with and under the root node and a
plurality of terminal nodes associated with and under each
intermediate node; classifying each one of a plurality of training
documents with at least one of the terminal and intermediate nodes;
determining a discrimination value for each term in at least one
training document which is classified with each one of a plurality
of the terminal and intermediate nodes of the taxonomy; determining
a minimum discrimination value for each of said plurality of
terminal and intermediate nodes; selecting a set of feature terms
associated with each one of said plurality of terminal and
intermediate nodes, said feature terms comprising terms that are in
at least one training document classified with the associated node
or any node under the associated node and that have discrimination
values equal to or above the minimum discrimination value;
receiving a search query; determining a plurality of search
documents, each search document comprising one of the accessible
document that is relevant to the search query; classifying each
search document with at least one of the terminal and intermediate
nodes of the taxonomy, based on the sets of feature terms
associated with the terminal and intermediate nodes of the
taxonomy; displaying a list of nodes with or under which said
search documents are classified; selecting at least one of the
displayed nodes; and displaying at least one search document
classified under each selected node.
47. An article as recited in claim 46, wherein said step of
displaying a list of nodes with or under which said search
documents are classified further comprises the step of displaying
signature terms associated with said search documents classified
with or under each of said nodes in the displayed list.
48. An article as recited in claim 47, wherein said signature terms
comprise a plurality of the most frequently occurring terms in the
search documents that are also feature terms.
Description
PROVISIONAL APPLICATION
[0001] The present application claims the benefit of U.S.
Provisional Application Ser. No. 60/050,611, entitled "USING
TAXONOMY, DISCRIMINANTS, AND SIGNATURES FOR NAVIGATING IN TEXT
DATABASES", filed Jun. 24, 1997, by Rakesh Agrawal, et al.,
attorney's reference number AM9-97-060, which is incorporated
herein by reference, in its entirety.
BACKGROUND OF THE INVENTION
[0002] 1. Field of the Invention
[0003] The present invention relates, generally, to a process,
system and article of manufacture for organizing and indexing
information items such as documents by topic, and in preferred
embodiments, to such a process, system and article which employ a
topic hierarchy and involve a determination of discriminating terms
and stop terms at each internal node in the topic hierarchy.
[0004] 2. Description of Related Art
[0005] With modem advances in computer technology, modem speeds and
network and internet technologies, vast amounts of information have
become readily available in homes, businesses and educational and
government institutions throughout the world. Indeed, many
businesses, individuals and institutions rely on
computer-accessible information on a daily basis. This global
popularity has further increased the demand for even greater
amounts of computer-accessible information. However, as the total
amount of accessible information increases, the ability to locate
specific items of information within the totality becomes
increasingly more difficult.
[0006] The format with which the accessible information is arranged
also affects the level of difficulty in locating specific items of
information within the totality. For example, searching through
vast amounts of information arranged in a free-form format can be
substantially more difficult and time consuming than searching
through information arranged in a pre-defined order, such as by
topic, date, category, or the like. However, due to the nature of
certain on-line systems, such as the internet, much of the
accessible information is placed on-line in the form of free-format
text. Moreover, the amount of on-line data in the form of
free-format text continues to grow very rapidly.
[0007] Search schemes employed to locate specific items of
information among the on-line information content, typically depend
upon the presence or absence of key words (words included in the
user-entered query) in the searchable text. Such search schemes
identify those textual information items that include (or omit) the
key words. However, in systems, such as the web, or large
intranets, where the total information content is relatively large
and free-form, key word searching can be problematic, for example,
resulting in the identification of numerous text items that contain
(or omit) the selected key words, but which are not relevant to the
actual subject matter to which the user intended to direct the
search.
[0008] As text repositories grow in number and size and global
connectivity improves, there is a pressing need to support
efficient and effective information retrieval (IR), searching and
filtering. A manifestation of this need is the recent proliferation
of over one hundred commercial text search engines that crawl and
index the web, and several subscription-based information multicast
mechanisms. Nevertheless, there is little structure on the
overwhelming information content of the internet.
[0009] Common practices for managing such information complexity on
the internet or in database structures typically involve
tree-structured hierarchical indices. Many internet directories.
such as Yahoo!.TM. (http://www.yahoo.com) and Infoseek
(http://www.infoseek.com) are largely manually organized in preset
hierarchies. International Business Machine Corporation has
implemented a patent database (http://www.ibm.com/patents- ) which
is organized by the US Patent Office's class codes, which form a
preset hierarchy. Digital libraries that mimic hardcopy libraries
support some form of subject indexing such as the Library of
Congress Catalogue, which is also hierarchical. Such topic
hierarchies are referred to herein as "taxonomies." Taxonomies can
provide a means for designing vastly enhanced searching, browsing
and filtering systems. Querying with respect to a topic can be more
reliable than depending only on the presence or absence of specific
words in documents. By the same token, multicast systems such as
PointCast (http://www.pointcast.com) are likely to achieve higher
quality by registering a user profile in terms of classes in a
taxonomy rather than key words.
[0010] The danger in querying or filtering by keywords alone is
that there may be many aspects to, and often different
interpretations of the key words, and many of these aspects and
interpretations are irrelevant to the subject matter that the
searcher intended to find.
[0011] Consider, for example, a situation in which a wildlife
researcher is attempting to find information about the running
speed of the jaguar, using the conventional Alta Vista.TM. internet
search engine (http://www.altavista.digital.com), with the query
"jaguar speed". In a test search conducted with the above-noted
search engine and query, a variety of responses were generated,
spanning the car, the Atari.TM. video game, the football team, and
a LAN server, in no particular order. The first page about the
animal was ranked 183, and was directed to a fable.
[0012] To eliminate the responses on cars, the test query was then
changed to "jaguar speed-car-auto". The top response in the
generated results read as follows:
[0013] "If you own a classic Jaguar, you are no doubt aware how
difficult it can be to find certain replacement parts. This is
particularly true of gearbox parts."
[0014] The words car and auto do not occur on this page. There was
no cat in the first 50 pages of the generated response. Some search
engines such as Alta Vista.TM. propose additional keywords to
refine the query, but, at the time of writing, all of the keyword
were related to cars or football.
[0015] Even the query "jaguar speed+cat" gave unsatisfactory
results. The responses included the word "cat", but were often
about automobiles. The 25th page was the first with information
about jaguars, but did not contain the desired information.
[0016] In contrast, if a topic taxonomy such as Yahoo.TM. is used,
there is no problem in insisting that the user seeks documents
containing "jaguar" in the topical context of animals, not cars.
Unfortunately, it is labor-intensive to maintain Yahoo.TM. manually
as the web changes and grows faster than ever. In our test case,
even though the search was easily restricted to within animals, no
answer could be found within the relatively small collection
returned.
[0017] Search engines are still an immature technology. Other areas
have been researched intensively long before web search engines
were devised, and the following discussion surveys the following
overlapping areas of related research: Information Retrieval (IR)
systems and text databases, data mining, statistical pattern
recognition, and machine learning.
[0018] For data mining, machine learning, and pattern recognition,
the supervised classification problem has been addressed in
statistical decision theory (both classical, as in Wald,
Statistical Decision Functions, 1950, and Bayesian, as in Berger,
Statistical Decision Theory and Bayesian Analysis, 1985, each of
which is incorporated herein by reference), in statistical pattern
recognition (as in Duda and Hart, Pattern Classification and Scene
Analysis, 1973 and Fukunaga, An Introduction to Statistical Pattern
Recognition, 1990, each of which is incorporated herein by
reference), in machine learning (as in Weiss and Kulikowski,
Computer Systems that Learn, 1990, Natarajan, Machine Learning: A
Theoretical Approach, 1991, and Langley, Elements of Machine
Learning, 1996, each of which is incorporated herein by
reference).
[0019] Classifiers can be parametric or non-parametric. Two
well-known classes of non-parametric classifiers are decision
trees, such as CART (as in Breiman et al, Classification and
Regression Trees, 1984, which is incorporated herein by reference)
and C4.5 (as in Quinlan, C4.5: Programs for Machine Learning, 1993,
which is incorporated herein by reference), and neural networks (as
in Hush and Horne, Progress in Supervised Neural Networks, 1993,
Lippmann, Pattern Classification using Neural Networks, 1989, and
Jain et al, Artificial Neural Networks, 1996, each of which is
incorporated herein by reference. For such classifiers, feature
sets larger than 100 are considered extremely large. Document
classification may require more than 50,000.
[0020] The most mature ideas in IR systems and text databases,
which are also successfully integrated into commercial text search
systems such as Verity, ConText, and Alta Vista, involve processing
at a relatively syntactic level (e.g., stopword filtering,
tokenizing, stemming, building inverted indices, computing
heuristic term weights, and computing similarity measures between
documents and queries in the vector-space model. as described by
Rijsbergen, Information Retrieval, 1979, Salton and McGill,
Introduction to Modern Information Retrieval, 1983, or Frakes and
Baeza-Yates, Information Retrieval: Data Structures and Algorithms,
1992, each of which is incorporated herein by reference). More
recent work includes statistical modeling of documents,
unsupervised clustering (where documents are not labeled with
topics and the goal is to discover coherent clusters, as described
in Anick and Vaithyanathan, Exploiting Clustering and Phrases for
Content-based Information Retrieval, 1997, which is incorporated
herein by reference), supervised classification (as in Apte et al,
Automated Learning of Decision Rules for Text Categorization, 1994,
and Cohen and Singer, Context Sensitive Learning Methods for Text
Categorization, 1996. each of which is incorporated herein by
reference), query expansion (as in Schutze et al, A Comparison of
Classifiers and Document Representations for the Routing Problem,
1995, and Voorhees, Using WordNet to Disambiguate Word Senses for
text Retrieval, 1993. each of which is incorporated herein by
reference).
[0021] Singular value decomposition on the term-document matrix has
been found to cluster semantically related documents together even
if they do not share keywords (as discussed in Deerwester et al,
Indexing by Latent Semantic Analysis, 1990, and Papadimitriou et
al, Latent Semantic Indexing: A Probabilistic Analysis, 1996, each
of which is incorporated herein by reference). None of these works
address the supervised topic analysis problem in a hierarchy or how
to use context-dependent words for indexing, how to automatically
and efficiently compute good feature sets, and how to maintain disk
data structures as training documents and the topic structure
changes with time.
[0022] Koller and Sahami, Hierarchically Classifying Documents
Using Very Few Words, International Conference on Machine Learning,
July 1997 and Yang and Pedersen, A comparative study on feature
selection in text categorization, International Conference on
Machine Learning, July 1997 discuss classification. Koller et al
propose a sophisticated feature selection algorithm that uses a
Bayesian net to learn interterm dependencies. The complexity in the
number of features is supralinear (e.g., quadratic in the number of
starting terms and exponential in the degree of dependence between
terms). Consequently, the reported experiments have been restricted
to a few thousand features and documents. Yang and Pedersen's
experiments appear to indicate that much simpler methods suffice,
in particular, that the approach of Apte et al of picking a fixed
fraction of most frequent terms per class performs reasonably. This
fraction may be very sensitive to corpus and methodology (e.g.
whether stemming and stopwording is performed). T his is indicated
by the poor performance of methods observed in recent work by
Mladenic, Feature Subset Selection In Text Learning, 10th European
Conference on Machine Learning, 1998.
SUMMARY OF THE DISCLOSURE
[0023] Accordingly, it is an object of the present invention to
enable scalable, efficient. reliable, and semi-automatic
organization and reorganization of a database of information
entities, such as text and hypertext documents, into a topic
hierarchy with the express, but not exclusive, purpose of
facilitating searching and filtering of documents as per the user's
information need.
[0024] It is an object of further preferred embodiments to provide
a system which uses the topic taxonomy to present the user with a
series of progressively refined views of document collections in
response to queries, and to discover and highlight salient
differences between 1:0 two or more collections of documents.
[0025] It is an object of yet further preferred embodiments to
provide such a system which is sufficiently fast, for example, for
use in conjunction with a crawler or newswire service.
[0026] It is yet another object of further preferred embodiments to
provide such a system which efficiently updates its knowledge when
it makes mistakes and a human intervenes, or when there is a need
to modify the topic taxonomy to accommodate new contents.
[0027] These and other objects are accomplished, according to
preferred embodiments of the present invention, with a system
called TAPER, standing for Taxonomy And Path Enhanced Retrieval, as
described herein. For every internal node in the taxonomy, TAPER
separates feature and noise terms by computing the best
discriminant terms for that node. This is accomplished, by
computing a measure of discrimination capability or power for each
term in each document of a training set. In addition, a cut-off
point is computed, for which terms having discrimination powers
above the cut-off are considered feature (or discriminant) terms
and terms having discrimination powers below the cut-off are
considered stop terms or noise terms. In a preferred embodiment, a
discrimination measure derived from mutual information (for
example, as described by T. M. Cover and J. A. Thomas, Elements of
Information Theory, John Wiley and Sons, Inc., 1991. incorporated
herein by reference) or Fisher's discriminant (for example, as
described by R. Duda and P. Hart, Pattern Classification and Scene
Analysis, Wiley. New York, 1973, incorporated herein by reference)
may be used.
[0028] Statistical models are constructed for each topic in the
taxonomy, using the feature terms determined locally for that
topic. In preferred embodiments, a Bernoulli or binary model of
text generation may be assumed.
[0029] When classifying new documents, only the feature terms
therein are used. Such feature terms are relatively few in number,
so the class models are small and classification may be
accomplished relatively quickly, in contrast to existing
classifiers that employ a flat set of classes, the feature set
changes by context as the classification process proceeds down the
taxonomy. As a result, jargon common to lower nodes (or levels) of
the taxonomy are filtered out, as stop words, at each node (or
level), and the classification accuracy remains high in spite of
the reduction in the number of terms and candidate classes
inspected.
[0030] Addition and deletion of documents to given topics, as well
as reorganization of the topic hierarchy itself, are easily
handled. The text models built at each node also yield a means to
summarize a number of documents using a few descriptive keywords,
referred to herein as their signature.
[0031] A taxonomy has been used for illustration purposes for the
discussion above. However, the taxonomy may take on various forms,
and, in some embodiments of the present invention, is not
necessary.
BRIEF DESCRIPTION OF THE DRAWINGS
[0032] FIG. 1 shows a diagram representing a hardware environment
for one preferred embodiment of the present invention.
[0033] FIG. 2 shows a tree-shaped topic hierarchy which may be
employed with one embodiment of the present invention.
[0034] FIG. 3 shows a block diagram representing a classifier
training and feature selection system according to a preferred
embodiment of the present invention.
[0035] FIG. 4 is a flow diagram illustrating the steps performed by
the present invention for reorganizing, training, and testing.
[0036] FIG. 5 shows a graphical representation of an example of a
function of error rate versus number of terms for a single
document.
[0037] FIG. 6 shows a graphical representation of an example of a
function of accuracy versus prefix size.
[0038] FIG. 7 shows a block diagram of the main statistics table
maintained during training in one preferred embodiment.
[0039] FIG. 8 shows an illustration of the data structures involved
in the computation of discrimination power of each term during
feature selection, in one preferred embodiment.
[0040] FIG. 9 shows an illustration of the organization of the
table of terms and their discrimination powers in one preferred
embodiment.
[0041] FIG. 10 shows a diagram showing how indexed statistics are
computed after feature selection in one preferred embodiment.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0042] Querying in a Taxonomy
[0043] According to preferred embodiments of the present invention,
a search query would not elicit a list of documents, but instead
would elicit a list of topic paths. A topic path is a path in a
topic taxonomy. For example, the search query "jaguar speed" may
elicit a list of topics, as follows:
[0044] Business_and_Economy:Companies:Automotive
[0045] Recreation:
[0046] Automotive
[0047] Games:Video_Games
[0048] Sports:Football
[0049] Science:Biology:Animal_Behavior
[0050] Thereafter, the user may restrict queries by concept, not by
keyword. A concept is more general than a keyword. For example, the
concept "animal" together with the keyword "jaguar" is a better
query than "jaguar" alone. A keyword is syntactically embedded or
implied by a document, whereas a concept is a semantic attribute. A
concept could even be functional (e.g., irrespective of syntactic
content, a site can be categorized as "commercial" or "academic").
Moreover, the same keyword may induce different concepts (e.g.,
"jaguar" in the context of "animal" vs. "car").
[0051] Upon obtaining a list of topic paths, the user may restrict
the search to only one topic path or a few selected topic paths, by
selecting one or more topics in the list. Depending upon the depth
of the taxonomy, the selection of one or more topic paths may
result in a list of further topic paths, representing further
levels of the taxonomy.
[0052] As shown above, designing the query to enforce or forbid
additional keywords may not always be as effective as restricting
the search to particular topic paths. The ability to restrict
searches to particular topic paths may be very useful for multicast
channels as well. Additionally, user profiles will be topic paths
rather than keywords.
[0053] Scalable Filtering Using a Topic Taxonomy
[0054] Another paradigm of information retrieval is filtering, in
which a continual stream of documents are generated on-line, as in
newsgroups and newsfeed. The system collects interest profiles from
users and uses them to implement either content-based or
collaborative filtering. i.e., it notifies the user only of those
documents they are likely to be interested in.
[0055] In its simplest form, a profile may be a set of terms and
phrases specified explicitly by the user. This has the same problem
as querying without topic context as discussed above. A better
notion of a profile is the set of documents the user has seen
and/or liked, perhaps with scores. This may work well with small
systems, but for thousands of users and millions of documents, a
system storing this level of detail will not scale. A promising
alternative is to characterize profiles, not at the level of
individual documents, but at the level of narrow but canonical
topics. This can be realized as one embodiment of the present
invention.
[0056] Context Dependent Signatures
[0057] An exhaustive keyword index employed by such systems as Alta
Vista.TM. is perhaps more of a problem than a solution. The IR
literature has advanced further, and now prototypes exist that
extract signature terms, which are then used for indexing. These
signatures can also be used as summaries or thumbnails. The
descriptive power of signatures can often compare favorably with
that of arbitrary sentences as extracted by popular search engines.
The signatures are also effective for describing a document
cluster. Many approaches have been used for signature extraction,
and in one common approach, the most frequent terms that are not
stopwords are selected.
[0058] A document abstract or signature as a function of the
document alone is of limited utility. In the case of a taxonomy, a
useful signature is a function of both the document and the
reference node. The signature includes terms that are "surprising"
given the path from the root to the reference node. In the above
example, car and auto may be good signature terms at the top level,
or even at the Recreation level, but not when the user has drilled
down to Recreation:Automotive.
[0059] The following is an illustration of text from a document in
Health Nursing
(http://www2.best.com/goodnews/practice/faq.htm):
[0060] "Beware of the too-good-to-be-true baby that is sleeping and
sleeping and doesn't want to nurse. Especially monitor the number
of wet diapers, as seriously jaundiced babies are lethargic."
[0061] The first level classification is Health. The top signature
terms are computed with respect to Health as follows:
[0062] Jaundice, dampen, dehydration, lethargic, hydrate, forcibly,
caregiver, laxative, disposable.
[0063] This indicates that the document is about treating jaundice.
The second level classification is Health:Nursing. Shifting the
reference class, the new signature is computed to be:
[0064] Baby, water, breast-feed, monitor, new-born, hormone.
[0065] This now indicates that the document is about nursing
babies. This information comes from both the path and the
signatures. Techniques for computing context-sensitive signatures
are described herein. Thus, significant improvement in search
quality may be possible by maintaining functionally separate
indices at each taxonomy node, using only a few signature terms
from each document.
[0066] Context Dependent Term Associations
[0067] Finding term associations is another application of
context-sensitive signatures. The use of phrases for search and
classification can potentially boost accuracy. The usual way to
find phrases is to test a set of terms for occurrence rate far
above that predicted by assuming independence between terms.
Unfortunately, associations that are strong for a section of the
corpus may not be strong globally and may go unnoticed. For
example, the term "precision" may be visibly associated with the
term "recall" in a set of documents on IR. but not in a collection
also including documents on machine tools. Computing signatures at
each node can expose all such associations.
[0068] Context Dependent Feature Selection
[0069] Separating feature terms from noise terms is central to
preferred embodiments described herein. In the above examples, car
and auto should be "stopwords" within Recreation:Automotive and,
therefore, should be pruned from the signatures. Feature and noise
terms must be determined at each node in the taxonomy.
[0070] It is tricky to hand-craft the stopwords out of domain
knowledge of the language. For example, the term "can" is
frequently included in stopword lists. However, that term should
not be a stop term for a corpus on waste management. The contents
of a stopword list should be highly dependent on the corpus. This
issue looms large in searching using categories and clusters. In
hierarchical categories, the importance of a search term depends on
the position in the hierarchy.
[0071] Other Applications of Feature Selection
[0072] Feature selection is also useful in any setting in which
salient distinctions are sought between two or more sets of
documents. Consider the scenario in which a set of documents (e.g.
a keyword query result) has been clustered into subsets, and a user
wishes to annotate the clusters using salient keywords. The
clusters can be regarded as classes, and feature selection can be
used to find these keywords.
[0073] Another application of feature selection is in the discovery
of succinct differences between two sets of documents, such as
patents filed by two companies, or by the same company over two
different time intervals, to expose interesting trends over
time.
[0074] Hardware Environment
[0075] As noted above, the present invention relates, generally, to
a process, system and article of manufacture for organizing,
classifying, and indexing information items by topic, such as text
and hypertext documents. In one example embodiment, such
information items comprise documents accessible on the internet.
However, it will be understood that further embodiments of the
invention are applicable to information items accessible in local
network environments, dedicated database environments, or the
like.
[0076] An example hardware environment for an internet embodiment
is shown in FIG. 1, which includes a user computer 10, a user
display 11, and a user interface 12. The display 11 is preferably a
visual display device, such as a cathode ray tube monitor, a liquid
crystal monitor or other suitable display device. The user
interface 12 preferably comprises one or more of a key board,
mouse, touch-screen device or other suitable input device. The
computer 10 operates in accordance with a software program stored
on a computer readable medium. such as a floppy disk 13, hard disk
(not shown) or other suitable storage medium.
[0077] The computer 10 is linked, through an internet connection,
and operates in accordance with a suitable software program to
access information items stored in at least one information
database. In the illustrated embodiment, the information items
comprise text documents stored or accessible through one or more
server locations. For example, with reference to FIG. 1, a set of
text documents on a variety of topics are stored or accessible
through the Server 1 location, and a further set of text documents
on a variety of topics are stored or accessible through the Server
2 location. Further, server locations (not shown) may store or
provide access to additional documents.
[0078] As described in farther detail herein, preferred embodiments
of the present invention include a system comprising a computer,
display and user interface, which operate in accordance with a
process stored as a program on a computer readable medium, to
organize and classify information items. While such information
items may be text documents accessible through an internet
connection, as shown in FIG. 2, it will be understood that, in
further embodiments, the system may operate with information items
other than those available on the internet, including, but not
limited to information items accessible through local networks,
dedicated databases, or the like. However, for purposes of
simplifying the present disclosure, embodiments of the present
invention are described herein primarily with reference to a search
and classification system which operates with information items in
the form of text documents that are accessible through an internet
connection with the internet.
[0079] As described in further detail below, embodiments of the
invention relate to an automatic process for learning topics from
examples and later identifying topics of new documents (also called
"test documents"). The process employs a multilevel taxonomy having
a plurality of nodes, including a root node, at least one
intermediate node associated with and under the root node and a
plurality of terminal nodes associated with and under each
intermediate node. A different set of feature terms are associated
with each intermediate node, which are used to classify test
documents. The feature term sets are determined, according to
preferred embodiments, during a training procedure.
[0080] The training procedure, according to preferred embodiments,
employs a plurality of training documents that have been
pre-assigned manually to various terminal and intermediate nodes in
the taxonomy. The training documents are tokenized, and information
related to the frequency of terms or tokens is recorded in a
database. A discrimination value is determined for each term in the
training documents, and a minimum discrimination value is
determined. Then, for each intermediate node, a set of feature
terms is selected, where the feature terms are those that are in
the training documents associated with the intermediate node or any
of its descendants and that have discrimination values equal to or
above the minimum discrimination value for the intermediate
node.
[0081] In one preferred embodiment, such as related to finding
salient differences between two or more sets of documents, the
process ends here, after having output the best features.
[0082] In another embodiment, once the system is trained, test
documents are analyzed. During this phase, a text document is first
tokenized. Of all the tokens in the document, only those that are
also in the feature set of the root topic in the taxonomy are
considered useful. The statistics related to these useful terms are
retrieved from the database, and the statistics are used to compute
a score for each of the children of the root node (nodes comprising
the next level connected to the root node). A few children with
high scores are then picked for further exploration. If any child
is an intermediate node, it has associated with it another feature
set. The set of all tokens in the test document is now intersected
with this new feature set, and the procedure continues from the
child in the same manner.
[0083] In a related embodiment, the system also computes, for each
topic node with suitably high score, the terms in the test document
that are significantly more frequent than in the training set for
that topic. These are then used for building a term index.
[0084] In yet another preferred embodiment, the above system can be
used to process a search query. The search query is received from
the user, for example, through a user input device in the form of
keywords. Optionally, the user also restricts the topical context
using a suitable selection on the taxonomy. Then a plurality of
relevant documents which also adhere to the topic restrictions is
retrieved. In a preferred embodiment, each document in the database
has been pre-classified using the above system. The user is
presented with a suitable display of those portions of the taxonomy
where relevant documents were found. The user may then enter a
command through the user input device to cause the system to select
at least one of the displayed sub-topics. This process is repeated
as necessary to refine the query topic until the user's information
need is satisfied.
[0085] Topic Hierarchy
[0086] Organization and classification of information items,
according to embodiments of the present invention, involves a topic
hierarchy, or "taxonomy," preferably having a plurality of levels
of nodes. While embodiments of the invention may employ any
directed acyclic graph hierarchy structure, embodiments are
described herein with reference to a tree-like topic hierarchy.
[0087] An example of a tree-like topic hierarchy, or taxonomy, for
organizing a database of topical documents is shown in FIG. 2. The
tree 20 includes a first level comprising a single node 22 titled
"All Topics." A second level of the tree may divide the first level
"All Topics" node into several further nodes directed to general
topic categories, such as Business and Economy 24, Recreation 26,
Science 28, and so forth. Each of the second level nodes may be
divided, at the third level, into several further nodes directed to
more specific topics within each second level topic. For example,
at the third level, the Business and Economy topic 24 may be
divided into Companies, Stock Markets, and so forth. Similarly,
each of the other second level topics may be divided at the third
level to further topics. Also, in a similar fashion, further levels
under the third level may be included in the topic hierarchy, or
taxonomy. The final level of each path in the taxonomy terminates
at a terminal or leaf node, labeled c in the diagram. The taxonomy
in the diagram is provided as an example for purposes of
simplifying the present disclosure and is not intended to limit the
invention to the specific illustration of the taxonomy.
[0088] According to preferred embodiments of the present invention,
an appropriate topic hierarchy, or taxonomy, is provided by the
user, based on the material (for example text documents) that are
intended to be classified and searched. For example, if the
material to be classified and searched includes major topics on the
internet, a taxonomy might appear as shown in FIG. 2. As another
example, if the material to be classified and searched includes all
U.S. patents, a taxonomy which more closely follows the U.S. Patent
and Trademark Office classification system might be employed.
[0089] System Training
[0090] Before the system can analyze the topics of new documents,
the system is provided with examples of documents belonging to
given topic(s). Given a topic hierarchy (taxonomy), system training
is performed by providing an initial collection of documents for
which classifications are known in advance. With reference to the
block diagram of a training system 30 of FIG. 3, this may be
accomplished, for example, by collecting a number of documents 32.
For example, for classification and searching of documents
available on the internet, the document collection may be performed
with a suitable web crawler. Alternatively, a sample document
collection may be provided with the system software 13 (FIG. 1) or
manually collected from any suitable source.
[0091] The sample document collection is divided into two sets. One
set of documents is set aside as a testing set 34. The other set is
manually classified or otherwise pre-designated as corresponding to
a particular class or terminal node (or, in some cases,
intermediate node) within the given topic hierarchy and becomes the
training set 38. At block 40, the training set of documents 38 is
split, preferably randomly, into a statistics collection set of
documents 42 and a model validation set of documents 44.
[0092] In block 46, statistics are collected from the statistics
collection set 42, based on terms appearing in those documents and
the known classes for those documents. These statistics are used in
the determination of the discriminating power of terms in the
documents from the collection set 42. The statistics are calculated
for each node in the taxonomy, such that, for any one node, the
discriminating power is calculated for the terms in all of the
documents that are classified in terminal (and intermediate) nodes
below that node. That is, the power that each term has to
discriminate between classes in the next level below each node is
calculated.
[0093] Thus, with reference to the hierarchy represented in the
FIG. 2, statistics are calculated for the "science" node 28, based
on the terms in all of the documents d from the collection set 42
that are classified in classes represented by nodes (terminal and
intermediate) below the "science" node 28. including the nodes
labeled "biology," "chemistry," "electronics," and all children
nodes of those nodes. Similarly, statistics are calculated for the
terms in all of the documents (from the collection set 42) under
each of those intermediate nodes and each of the other intermediate
nodes in the hierarchy.
[0094] The statistics calculated for each intermediate node in the
hierarchy includes quantities that enable computing the
"discriminating power" of each term found in some training document
under the node. Based on these statistics, terms are ordered by
decreasing discriminating power and the top discriminating terms
(those terms with the highest discriminating power) are selected as
feature terms for use in classification, while the remaining terms
are characterized as stop terms that have little value in
distinguishing between topics in the immediate context. The
determination of which terms in the order are feature terms and
which terms are stop terms is provided by selecting a cut-off point
within the ordering.
[0095] At block 42 in FIG. 3, the "statistics collection" subset of
the training documents are used to collect term frequency
information. Then, in block 48, feature terms and stop terms are
determined for each internal topic node based on the model
validation set 44. Finally, class models are constructed over the
chosen features in block 49, preferably as described below in the
section titled "Document Models."
[0096] The class models and statistical information calculated in
block 46 are provided to the classifier 50, for classifying the
test documents 34 in a testing mode, as well as new documents when
the system is deployed. Classification of test (or new) documents
is carried out in the taxonomy, such that each test (or new)
document is ultimately classified to correspond to one or more
classes, designated by terminal or leaf nodes (or, in some cases,
intermediate nodes in the hierarchy).
[0097] FIG. 4 is a flow diagram illustrating the steps performed by
the present invention for reorganizing, training, and testing.
After starting at block 60, in block 62, an action type is
determined.
[0098] If the action type is "reorganize", then, statistics are
updated in block 64 and stored in the raw statistics database in
block 70. Reorganization refers to a change in the structure of the
topic tree. For example, one type of change involves collapsing two
topics into one (e.g., a "science" topic may be collapsed with a
"mathematics" topic). Another type of change involves expanding a
topic into multiple topics (e.g., a "mathematics" topic may be
expanded into a "calculus" topic, a "linear algebra" topic, and
other topics). Once statistics are stored in the raw database in
block 70, processing continues at block 72.
[0099] In block 72, whether compaction is needed is determined. As
will be discussed under the "Statistics Collection" heading below,
compaction refers to merging an unsorted table into the main sorted
table. If compaction is needed, then tables are compacted in block
74. If compaction is not needed or after compaction, the processing
continues to block 76. In block 76, features are selected. Feature
selection will be discussed in more detail below under the "Feature
Selection" heading. In block 78, indexed statistics are written to
tables for testing. The statistics are stored in the indexed
database in block 90.
[0100] If the action type is "new training document", then the
document is tokenized in block 66. Next, in block 68, statistics
are appended. In block 70, the appended statistics are stored in
the raw statistics database. Also, the appended statistics are
stored in the indexed database in block 90. Then, processing
continues at block 72, as discussed above.
[0101] If the action type is "test", documents are tokenized in
block 92. In block 94, the root topic is selected as a starting
point. In block 96, a top topic is picked form the pool (i.e., a
topic with a high goodness score). In block 98, using indexed
statistics from block 90 (as indicated by the arrow from block 90
to block 98), the children of the picked topic are evaluated and
the best ones (i.e., those with high goodness scores) are added to
the pool. In block 100, it is determined whether there are enough
leaf topics. If there are not enough leaf topics, at block 96,
another topic is picked. Otherwise, the context-dependent
signatures are computed in block 102. The computation of
context-dependent signatures is discussed in more detail under the
"Extracting Document Signatures" heading below. These
context-dependent signatures are displayed or indexed in block
104.
[0102] Document Models
[0103] There have been many proposals for statistical models of
text generation. One of the earliest indicators of the power of
simple statistical tests on term frequencies is Zipfs law. The
models most frequently used in the IR community are Poisson and
Poisson mixtures. If X is distributed Poisson with rate .mu.,
denoted X.about.P(x), then Pr[X=x]=e.sup.-.mu..mu..sup.x/x!, and if
Y is distributed Bernoulli with n trials and mean np, denoted
Y.about.B(n,p), then 1 Pr [ Y = y ] = ( n y ) p y ( 1 - p ) n - y .
As n .infin. and p 0 , the distributions B ( n , p ) and P ( np
)
[0104] converge to each other.
[0105] According to preferred embodiments described herein, a
Bernoulli or binary model of document generation is assumed. In the
Bernoulli model, a document d is generated by first picking a class
c. Each class has an associated multi-faced coin, with each face
representing a term t and having some success probability
.theta.(c,t). Then a document length n(d) is arbitrarily fixed, and
each term is generated by flipping the coin. In the binary model, a
document is a set of terms with counts zero or one, and
.theta.(c,t) is an estimate of the fraction of documents in class c
that contain term t at least once.
[0106] Generally, in a binary model, the focus is on whether a term
occurs, and so a term is either associated with zero (i.e. occurs)
or one (i.e., does not occur). In a Bernoulli model, the focus is
on how many times a term occurs, and so the model keeps track of
"buckets" for the number of times a term occurs (e.g., once, twice,
three times, . . . , n times). However, a variety of other models
could be used. For example, in one model, it may be relevant that a
term occurred once, twice, three times or four or more times, while
it is not relevant that the term occurred specifically four times,
five times, . . . , n times. In another model, the "buckets" could
be specified to hold terms that occur "once", "two to three times",
"four to seven times", etc.
[0107] Conceptually, as the training text is being scanned, the
classifier database will be organized as a three-dimensional table.
The first axis is for terms, the second axis is for documents, and
the third axis is for classes or topics. The measure maintained
along these dimensions, (t,d,c), is called n(t,d,c), which is the
number of times t occurs in d.di-elect cons.c. This number is
non-zero only when t.di-elect cons.d.di-elect cons.c. t E d means
that term t occurs in document d, and d.di-elect cons.c means that
d is a training sample for class c. A super-class of c, i.e., an
ancestor in the topic tree, inherits all d.di-elect cons.c.
[0108] Aggregation along some dimensions gives some important
statistics about the corpus that is used by the classifier. The
following is a list of these statistics:
[0109] The length of training document d, given by 2 n ( d ) = i n
( t , d , c ) .
[0110] The length of all documents can be found using a GROUP BY on
(d,c).
[0111] The total length of the training documents in class c,
denoted n(c).
[0112] The total number of times that term t occurred over all
training documents of class c, denoted n(t,c).
[0113] The fraction of times, 3 f ( t , d , c ) = n ( t , d , c ) /
t n ( , d , c ) ,
[0114] that term t occurs in document d. The sum of f and f.sup.2
over all documents in a class is needed.
[0115] The number m(t,c) of training documents in class c that have
at least one occurrence of term t. This will be needed for the
binary model.
[0116] The number of training documents in class c, denoted
.vertline.c.vertline..
[0117] Assuming the Bernoulli model with parameters .theta.(c,t),
the following equation, 4 Pr [ d c ] = ( n ( d ) { n ( d , t ) } )
t ( c , t ) n ( d , t ) , where ( n ( d ) { n ( d , t ) } ) = n ( d
, t ) ! n ( d , t 1 ) ! n ( d , t 2 ) !
[0118] is the multinomial coefficient. A corresponding expression
can be easily derived for the binary model as well.
[0119] The Bernoulli model makes the assumption that term
occurrences are uncorrelated, which is not accurate. First, given
that a term has occurred once in a document it is more likely to
occur again when compared to a term about which information is not
available. Second, the term frequency distributions are
dependent.
[0120] Our independence assumption leads to what is called a naive
Bayes classifier. (A naive Bayes classifier, in essence, builds
density functions, which are marginally independent, for each
class, and then classifies a data point based on which density
function has the maximum value at that point.) In practice, these
simple classifiers perform surprisingly well compared to more
sophisticated ones that attempt to approximate the dependence
between attributes.
[0121] Recently this phenomenon has been investigated in depth by
Friedman in On Bias, Variance, 0/1 Loss, and the
Curse-of-dimensionality, Data Mining and Knowledge Discovery, 1(1),
pp. 55-77, 1997, incorporated herein by reference). A classifier
that uses an estimate of class densities is subject to bias
(decision boundaries that are shifted from the "best" position,
because the model is inaccurate) and variance (decision boundaries
fit to noisy data). Friedman analyzes how the low variance of naive
density estimates can mitigate the high bias to give simple
classifiers that can often beat more sophisticated ones. It will
also be clear from a description of the system that this simplicity
enables designing of a system that can handle enormous problem
sets.
[0122] Rare Events and Laws of Succession
[0123] The average English speaker uses about 20,000 of the
1,000,000 or more terms in an English dictionary. In that sense,
many terms that occur in documents are "rare events." This means
that with reasonably small sample sets, there will be zero
occurrences of many terms in many classes, and the maximum
likelihood estimate .theta.(c,t)=n(c,t)/n(c) wilt be problematic: a
class with .theta.(c,t)=0 will reject any document containing t
even a single time.
[0124] Finding better estimates of small probabilities, also called
laws of succession, has been pursued in classical statistics for
centuries. Laplace showed that given the results of n tosses of a
L-sided coin, i.e., the number of times each face occurred,
n.sub.1, . . . , n.sub.L, the correct Bayesian estimate for the
probability of face i is not n.sub.i/n, but 5 n 1 + 1 n + L .
[0125] This is the result of assuming that all possible associated
L-component vectors of face probabilities (p.sub.1, . . . ,
p.sub.L) are a priori equally likely. This is called the uniform
prior assumption. The above value of is obtained by using Bayes
rule and evaluating 6 1 Pr [ n 1 ] 0 1 Pr [ n i ] .
[0126] Alternative priors have been suggested and justified. (E. S.
Ristad, A natural law of succession, Research report CS-TR-495-95,
Princeton University, July 1995, which is incorporated herein by
reference.) However, based on experimentation conducted in
connection with the development of the present invention, it was
found that Laplace's law provides a few percent better
classification accuracy. With this adjustment (and returning to the
earlier notation) .theta.(c,t) is estimated as 7 1 + n ( c , t ) n
( c ) + L ( c )
[0127] where L(c) is the size of the lexicon (the number of
distinct terms found in the training documents) of class c.
[0128] Hierarchical Classification
[0129] A classifier inputs a document and outputs a class. If the
class is not the one from which the document was generated, the
classifier is said to have misclassified that document. In the case
of a topic hierarchy, one may wish to give the classifier "partial
credit" for correctly finding the first few levels of the "true"
topic. This is ignored in the current discussion and will be
commented on later. For now the discussion focuses on how to find
the best leaf topics.
[0130] According to preferred embodiments of the present invention,
a distinct classifier is associated with each internal node in the
taxonomy, including the root. During classifier training, a set of
feature terms is generated for each of such nodes. Given a new
document d, the goal of the classification process is to find a
leaf node c such that the probability that the document d was
generated from class c (called the posterior probability
Pr[c.vertline.d] ) is maximized among all the leaves.
[0131] Hierarchical classification has the benefit of greatly
increased speed of classification. As described next,
classification of a test document starts at the taxonomy root by
assigning a score to each child of the root. In many cases, it will
be possible to eliminate most of the topic sub-trees as unlikely
candidates. Thus, large sub-trees in the topic tree can be
eliminated forthwith if the score of the root of those sub-trees
are very poor. Text database population is not the only application
of fast multi-level classification. With increasing connectivity,
it will be inevitable that some searches will go out to remote text
servers and retrieve results that must then be classified in real
time.
[0132] This benefit of increased speed may be useless if an error
is made in choosing a topic early in the process at a shallow level
of the tree. Thus, a greedy search for the best leaf may be risky.
Let the path to a leaf c from the root c.sub.1 be c.sub.1,c.sub.2,
. . . , c.sub.k=c. Since the root subsumes all classes, the
probability that any document d is within the root class c.sub.1 is
Pr[c.sub.1.vertline.d]-1. Thereafter,
Pr[c.sub.i.vertline.d]=Pr[c.sub.i=1.vertline.d]Pr[c.sub.i.vertline.c.sub.-
i-1,d], for i=2, . . . , k. Taking logs, log
Pr[c.sub.i.vertline.d]=log Pr[c.sub.i-1.vertline.d]+log
Pr[c.sub.i.vertline.c.sub.i-1,d]. Suppose that, in the taxonomy,
the edge (c.sub.i-1,c.sub.i) is marked with the edge cost-log
Pr[c.sub.i.vertline.c.sub.i-1,d]. Then the least-cost path from the
root to some leaf is being sought.
[0133] Computing the one-step conditional probability
Pr[c.sub.i.vertline.c.sub.i-1,d] is straight-forward. For
notational convenience, name c.sub.i-1 as r.sub.0 and its children
{r.sub.j}. Then the probability that the document d belongs to the
child node r.sub.j, given that it belongs to the parent node
r.sub.0, is given by
Pr[r.sub.j.vertline.r.sub.0,d]=Pr[r.sub.j.vertline.d]/Pr[r.sub.0.vertline-
.d], where 8 Pr [ r 0 d ] = Pr [ d ] ,
[0134] where the sum is over all children .rho. of r.sub.0. Note
that 9 Pr [ r 0 d ] = Pr [ d r j ] / Pr [ d ]
[0135] using Bayes rule, and this can be evaluated from the model
parameters. Care is needed here with finite-precision numbers and
underflow, because the probabilities are very small and the scaling
needed to condition the probability prevents maintaining the
numbers always in log-form.
[0136] Feature and Noise Terms
[0137] The above application of Bayes rule depended on a document
model. This was embedded in the .theta.(c,t) parameters. These
parameters are estimated during the training phase. using sample
documents from the statistics collection set 42 (FIG. 3). When
building a model for each class from a training set, a
determination is made as to whether a term appears only
incidentally, or sufficiently consistently to suspect a causal
connection; the term is accordingly a noise term (also called a
stopword) or a feature term. Only feature terms should be used for
classifying new documents.
[0138] A property of preferred embodiments of the present invention
that distinguishes it from prior art (as in Apte, Damerau, and
Weiss, 1994) is the use of different feature sets computed
separately for each internal node. This prevents the classifier
from losing accuracy even though it inspects very few of the
classes in the taxonomy to pick the best leaf topics.
[0139] The challenge is to select suitable feature terms from a
lexicon that can be as large as a hundred thousand terms. The
selection process is constrained both ways: the highly
discriminating terms should not be missed, and every term should
not be included, because the frequencies of some terms are noisy
and not indicative of content. This is called the feature-selection
problem in the statistical pattern recognition literature. In
general, there is a need to find a set of terms that minimizes the
probability that a document is misclassified, with the
understanding that only terms in the intersection of the document
and the feature set are used by the classifier.
[0140] It is not possible to search for the best feature set,
because it is not known what the best possible classifier does, and
because there are too many terms in the lexicon. Therefore, in
practice this is done for a fixed classifier. A heuristic is
desired that is essentially linear in the original number of terms
and preferably makes only one pass over the training corpus.
[0141] Therefore, the following approach is carried out: first a
merit measure is assigned to each term, and then a prefix of terms
with highest merit are selected. In preferred embodiments, the
merit measure comprises an index based on mutual information or on
Fisher's linear discriminant. Mutual information is a well-known
statistical measure of dependence between random variables (see
Cover and Thomas). It is straight-forward to apply mutual
information to the binary document model, but it is more
complicated to apply it to the Bernoulli model, and more expensive
to evaluate. Next, the Fisher discriminant measure that was used in
the present invention is described, and this measure was found to
be more effective than mutual information.
[0142] Fisher's Discriminant
[0143] Suppose two sets of points are given in k-dimensional
Euclidean space, interpreted as two classes. Fisher's approach
finds a direction on which to project all the points so as to
maximize (in the resulting one-dimensional space) the relative
class separation as measured by the ratio of inter-class to
intra-class variance. More specifically, let X and Y be the point
sets, and .mu..sub.X, .mu..sub.Y be their respective centroids,
i.e., 10 X = 1 X x X x
[0144] and 11 y = 1 y y Y y .
[0145] Further, let the respective covariance matrices be 12 X = 1
X x X ( x - X ) ( x - X ) T
[0146] and 13 y = 1 y y Y ( y - y ) ( y - y ) T .
[0147] Fisher's discriminant approach seeks to find a vector
.alpha. such that the ratio of the projected difference in means,
.vertline..alpha..sup.T(.mu..sub.X-.mu..sub.Y).vertline., to the
average variance,
1/2.alpha..sup.T(.SIGMA..sub.X+.SIGMA..sub.Y).alpha.=.alpha..su-
b.T.SIGMA..alpha., is maximized. It can be shown that
.alpha.=.SIGMA..sup.-1(.mu..sub.X-.mu..sub.Y) achieves the
extremum, provided .SIGMA..sup.-1 exists. Furthermore, when X and Y
are drawn from multivariate Gaussian distributions with
.SIGMA..sub.X=.SIGMA..sub.Y, this is the optimal discriminator, in
that computing .alpha..sup.Tq for a test point q and comparing the
result to a suitable threshold is the minimum error classifier.
[0148] Computing .alpha. involves a generalized eigenvalue problem
involving the covariance matrices. In applications such as signal
processing where Fisher's discriminant is used, the matrix size k
is typically a few hundred at most. In the text domain, the matrix
size k is typically 50,000 to 100,000, and the covariance matrices
may not be suitably sparse for efficient computation. Moreover, it
is difficult to interpret a discriminant that is a linear sum of
term frequencies, possibly with negative coefficients. A preferred
approach, therefore, will be to take the directions .alpha. as
given, namely, a coordinate axes for each term. A figure of merit
is assigned to each term, which is called its Fisher index, based
on the variance figures above, which is 14 T ( X - Y ) T
[0149] in the two-class case. For each term t,
.alpha.=.epsilon..sub.1, is a unit vector in the direction of
t.
[0150] In general, given a set of two or more classes {c}, with
.vertline.c.vertline. documents in class c, the ratio of the
so-called between-class to within-class scatter is computed.
Switching back to term frequency notations, this is expressed as 15
Fisher ( t ) = c 1 c 2 ( ( c 1 , t ) - ( c 2 t ) ) 2 c 1 c 1 d c (
n ( t , d , c , ) - ( c t ) ) 2 , where ( c , t ) = 1 c 1 d c n ( t
, d , c ) .
[0151] Selecting a Cut-Off
[0152] The remaining exercise, having sorted terms in decreasing
order of Fisher index, is to pick a suitable number of feature
terms starting with those having the highest index. Let F be the
list of terms in our lexicon sorted by decreasing Fisher index. A
preferred heuristic is to pick from F a prefix F.sub.k of the k
most discriminating terms. F.sub.k must include most useful
features and exclude most noise terms. A short F.sub.k enables
holding a larger taxonomy in memory and hence fast classification.
Too large an F.sub.k will degrade not only performance, but also
accuracy because of the phenomenon of over-fitting: the classifier
will fit the training data very well, but will result in degraded
accuracy for test data. There are various techniques for pruning
feature sets. The current invention prefers the technique of
minimization of the classification error rate on a separate
validation set. There exist other approaches in the prior art using
the minimum description length principle, resampling or cross
validation, but these make too many passes over the text
corpus.
[0153] The pre-classified samples are partitioned, preferably
randomly, into the training set T (shown as block 42 in FIG. 3) and
the validation set V (shown as block 44 in FIG. 3). The Fisher
index of each term based on documents in set T is computed, and
then documents in set V are classified using various prefixes
F.sub.k. Let N.sub.k be the number of documents incorrectly
classified when a prefix of k features is used, then (the smallest)
k for which N.sub.k is minimized is sought.
[0154] For classification using a feature set F.sub.k. the class c
is chosen that maximizes the following a posteriori class
probability based on the Bernoulli model described above: 16 Pr [ c
d ' ] = f k ( c ) t d F k ( f ( c t ) ) n ( d , t ) c ' ( c ' ) t d
F k ( f ( c * ' t ) ) n ( d , t ) ,
[0155] where .pi. is the prior distribution on the classes. Let
c.(d) be the "true" class of d.di-elect cons.V, then 17 N k = d N k
( d ) , where N k ( d ) = { 1 , c c * ( d ) : Pr [ c d , F k ] >
Pr [ c * ( d ) d , F k ] 0 , otherwise .
[0156] In effect, the overall plot is constructed of the faction of
documents incorrectly classified against the number of features
used by averaging the per-document function written above. For a
fixed document in the model validation set, the class for that
document is first estimated based only on the highest Fisher value
term in the order. If the class estimate is erroneous, the error is
1 for that document. If the class estimate is correct, then an
error rate of 0 (zero) is plotted for the term number 1, for that
document.
[0157] Further terms in the Fisher value order are added, one at a
time, to the estimation process, to add points to the plot of
N.sub.k(d). Thus, a plot of the error rate versus term number
function for any one particular document might appear as shown in
FIG. 5. Typically, the document will be erroneously classified
until some number m of terms are employed in the class estimation.
Accordingly, an error rate of 1 is plotted, up to m terms, upon
which the document is correctly classified and the error plot drops
to 0. As further terms (in Fisher value order) are added to the
class estimation, eventually an added term may be a noise term and
may cause misclassification of the document, upon which the error
jumps back to 1. Thereafter, further terms may cause fluctuations
in the N.sub.k(d) function, between 0 and 1.
[0158] The process is repeated for each document in the model
validation set 44, such that a function of the error rate versus
number of terms is generated for each document. The functions for
all of the documents in the model validation set are added together
to provide an overall error rate versus term number function that
appears, for example, as shown in FIG. 6., which is discussed in
more detail below under the section titled "Feature Selection". In
preferred embodiments, this aggregate error for all required number
of feature terms is computed while scanning each validation
document only once through the whole process.
[0159] The aggregate or average error will typically decrease
steeply as terms are initially added to the feature set, reach a
minimum, and then show an upward trend. Suppose k* is the smallest
number of feature terms for which N.sub.k achieves (close to) its
global minimum. Then these k* terms are picked as features for the
intermediate node of the taxonomy under discussion. In this manner,
a generally distinct set of feature terms are derived for
classification at each intermediate node in the taxonomy. However,
it will be understood that other means for determining a cut-off
point in the order of discriminating powers may be employed,
including but not limited to defining a preset number of terms as
the cut-off point.
[0160] Extracting Document Signatures
[0161] Up to a point, the user can sift a query response based only
on the topic paths. However, even the leaf classes are necessarily
coarser than individual documents. Support is therefore needed to
browse quickly through many documents without looking into the
documents in detail. Most search engines attach a few lines from
each document. Often these are the title and first few lines, or
they are sentences with the most search terms. For many documents,
better keyword extraction is needed. Moreover, it would be more
advantageous for these signatures to be extracted relative to a
node in the taxonomy.
[0162] Given this reference node c, one approach is to concatenate
the training documents vassociated with c into a super document
d.sub.c and then rank the terms t.di-elect cons.d.sub.c in
decreasing order of the number of standard deviations that n(d,t)
is away from .theta.(c,t). Here, the previously described
simplistic document model may not suffice. As mentioned above, a
term that has occurred once in a document is more likely to occur
again. Since the Bernoulli model does not take this into account,
frequent terms often remain surprising all along the taxonomy
path.
[0163] Matters are improved by moving to another simple model, the
binary model. First, consider a single test document d, and
consider t.di-elect cons.d. If the observed fraction of training
documents in class c containing term t at least once is
.theta.(c,t), all t.di-elect cons.d are sorted in increasing order
of .theta.(c,t) and report the top few. Second, if there are l>1
test documents, then the fraction .phi.(t) is found that contains t
at least once, and the terms are sorted in increasing order of 18 (
( c , t ) - ( t ) ) l ( c , t ) ( 1 - ( c , t ) )
[0164] before presenting the results. Both, in fact, correspond to
p-values computed using the normal approximation to the binomial
distribution.
[0165] Data Structures and Pseudocode
[0166] Preferred embodiments of the current invention have the
following distinctive features as an "industrial strength" topic
analyzer:
[0167] Thousands of classes and millions of documents can be
handled. The current limits in a prototype implementation are
2.sup.16 classes in the taxonomy, 2.sup.32 unique tokens and
2.sup.32 documents. A reasonable assumption is made that a simple
pointer representation of the taxonomy together with a few words of
data per class can be held in the computer memory at all times.
[0168] Training has near real-time response, as needed by crawling
and indexing applications. Training makes a single pass over the
corpus.
[0169] The prototype permits efficient incremental updates to a
fixed taxonomy with new documents, deletion of documents, or moving
documents from one class to another. With some more work, it is
also possible to reorganize entire topic sub-trees.
[0170] System modules, according to preferred embodiments of the
invention, include:
[0171] A mapping from textual terms to numeric term ID's (called
TID's), which may be derived using a global counter, or a suitable
hash function. A reverse mapping is also maintained.
[0172] Similar mappings between classes and numeric class ID's
(called CID's).
[0173] A pointer-based tree data structure in memory for storing
class-specific information, such as the number of training
documents, etc.
[0174] A module for statistics.
[0175] A module for feature selection.
[0176] A module for applying the classifier on a test/new
document.
[0177] The last three modules are described in detail next.
[0178] Statistics Collection
[0179] The goal of this module is to collect term statistics from a
training document and dispense with it as fast as possible. After
simple stopword filtering and stemming while scanning, the document
is converted to a sequence of 32-bit TID's (term ID's). The main
table, for example, maintained on a computer readable disk, is a
frequency table, such as that shown in FIG. 7. TID corresponds to a
term that occurs in some document belonging to a class
corresponding to KCID (child or kid class ID). The Parent Class ID
(PCID) represents the parent of KCID (zero if KCID is the root).
The class identifications CID's (KCID's and PCID's) are numbered
from one onwards. The explicit presence of PCID is only to simplify
the current disclosure. In an actual system, the class tree data
structure is preferably always available to map from KCID's to
PCID's, and, therefore, PCID does not need to be explicitly stored
in the table.
[0180] In the statistics collection phase the main table is kept
sorted on the keys TID and KCID. There is another unsorted table
with the same row format. There are four other numeric fields per
row. All of these four numbers are additive over documents, so for
each document d and term t, a row is appended to the unsorted
table, with SMC set to one, SNC set to the number of times t
occurred in d, called n(d,t), SF1 set to n(d,
t)/.SIGMA..sub.rn(d,t), and SF2 set to SF1.sup.2. SMC is used in
the binary model, while SNC is needed in the Bernoulli model.
[0181] This approach trades off space for time, and the unsorted
table grows rather quickly, but with a lot of duplicate keys.
Depending on how much disk space exists, once in a while the system
sorts and aggregates the unsorted table and then merges the result
into the main sorted table. Various simple heuristics may be used
to decide when to initiate a sort-merge. In one implementation,
processing documents stops while the sort-merge is in progress. To
meet tough real-time requirements, one can open a new frequency
table and fork a thread, perhaps on another processor, to aggregate
the last run while more documents continue to be accepted.
[0182] An indexed access approach could be used instead of the
frequency table. As each document is scanned, the system would look
up on the (TID, KCID) key and update SMC. SNC, SF1 and SF2. That
would result in index lookups and random IO, potentially for every
term in the training set. For large corpora, it is far more
efficient to append statistics in a logged fashion as in the
preferred embodiment. The frequency table is a temporary file and
no direct indexed access to it is actually required later. Another
benefit is compactness: this is the most space-intensive phase of
training, and the storage overheads of indexed access are avoided,
while explicit control of compaction is obtained. The space
overhead of storing TID redundantly is moderate, as the rest of
each row is already 18.about.bytes long.
[0183] Computing Fisher Indices
[0184] Before beginning feature selection, the frequency table is
aggregated one last time, if needed to eliminate all duplicates.
The frequency table is rewound and prepared for scanning. At this
stage, all rows with the same TID are collected in a contiguous run
going through all CID's where that TID occurred (see FIG. 7). Also,
preparation takes place to output another file, called the fisher
table. For the following description, a format shown in FIG. 8 is
assumed. The format includes rows that are keyed by PCID and a
floating point number FI (Fisher index), where for each fixed PCID
the rows are sorted in decreasing order of FI. The last column is
the TID (term ID) whose corresponding PCID and FI are the first and
second columns.
[0185] Because TID is the primary key in the frequency table, as it
is scanned, a sequence of runs are obtained, each run having a
fixed TID. Associated with each topic node in memory, a few words
of statistics are kept (derived from SMC, SNC, etc.). When a run is
started for a given TID, these statistics are cleared. As the
various KCID's are scanned for the given TID in the frequency
table, the node corresponding to the KCID in the taxonomy is
located, and these statistics are updated. In a large taxonomy,
very few of the nodes will be updated during a run. If a node is
updated, its parent will be updated as well. These statistics
efficiently can, therefore, be reset after each run.
[0186] When the run for a given TID completes, exploring only the
updated nodes, the Fisher index of that term is computed for every
internal node (identified by its PCID) in the taxonomy as described
in the section titled "Feature and Noise Terms." For each of these
PCID's. a row is appended to the Fisher table. Next, the Fisher
table is sorted on the key (PCID.FI). This collects all PCID's into
contiguous segments, and for each PCID, orders terms by decreasing
values of FI.
[0187] Consider now the case in which, for each internal topic c,
the number k*(c) of features to pick is specified to TAPER
directly. (The next section discusses how k*(c) is determined in
one pass over the portion of the training documents set apart
earlier for model pruning.) Given k*(c), the sorted Fisher table is
scanned while copying the first k*(c) rows for the run
corresponding to class c to an output table and discarding the
remaining terms. This involves completely sequential IO.
[0188] As shown in FIG. 10, once feature selection is performed on
the Fisher table, both the frequency and fisher table are sorted
once again, this time with (PCID,TID) as the key. After these
sorts, a merge is performed. Rows of the Fisher table are
considered one by one. For each row, once the beginning of a
key-matched row of the frequency table is found, the row is read as
long as the key remains unchanged, constructing a vector in memory
where each element has the form (KCID,SMC,SNC). This buffer is then
written into a hash table on disk.
[0189] Feature Selection
[0190] Given terms in decreasing Fisher index order for a fixed
class c, it is desirable to find a good value for k*, the number of
chosen features, as described in the section titled "Selecting a
cut-off." In a preferred embodiment, this is done in only one pass
over the validation documents in set V. FIG. 9 shows an
illustration of the organization of the table of terms and their
discrimination powers in one preferred embodiment. The table of
terms is used in the feature selection process.
[0191] The technique used holds N.sub.k (the number of incorrectly
classified documents as a function of k, the number of features
used) in memory as against scanning the validation documents for
different values of k. The following is pseudocode:
1 Nave pseudocode: For all or suitably many values of k Prepare
models with only k features For each validation document d
.epsilon. V, Determine V.sub.k (d) IO-efficient pseudocode: Compute
Fisher ordering and initialize N.sub.k = 0 for k = 0,1,2, . . . For
each validation document d .epsilon. V Compute Pr[c'.vertline.c, d,
F.sub.0] for every child c' using priors only If a wrong class has
highest Pr[c'.vertline. . . .] increment N.sub.0 For k = 1,2, . . .
Check if the k -th feature occurs in d If so, find
Pr[c'.vertline.c, d, F.sub.k] using Pr[c'.vertline.c, d, F.sub.k-1]
If a wrong class has highest Pr[c'.vertline. . . .] increment
N.sub.k
[0192] Even if N.sub.k is stored for every value of k, and the
lexicon is of size 500,000 (a sample of 266,000 documents from
Yahoo.TM. required 600,000), only 2MB is needed. As each document d
is scanned, N.sub.k (d) is aggregated into the N.sub.k array.
[0193] Earlier, it was discussed that indexed statistics could be
computed, given the optimal number of k* of terms at each node.
Here, the discussion is directed to how the k* is determined. The
documents in the validation set are considered one by one. Each
document d has a pre-assigned "correct" class label c*. The parent
c of this class node is located in the topic tree. Recall that c
has associated with it a ranked list of terms. These are
intersected with the terms in d, and the common terms are sorted by
the rank.
[0194] Now a sequence of progressively more detailed classifiers
are constructed at c. The first one has only one feature, the
top-ranked term in d. The second classifier has the top two
features, etc. For each such classifier, it is checked whether d is
correctly routed from c to c*.
[0195] This gives a sequence of zero-one readings (i.e. zero if
correct, and one if not) on a scale indexed by the rank of the last
included term. The following is an example sequence for a
particular document:
[0196] 0-29 1 (classifiers indexed 0-29 did not place d in c*)
[0197] 29-500 0 (classifiers that used terms ranked 0 through k,
for k=29, 30, . . . 500 did place d in c*)
[0198] 500-550 1 (classifiers that included even more terms went
wrong again, because they included noise)
[0199] Attached with each internal node, such as c above, is a file
in which such sequences are appended for each validation document
d. Once processing of (c, c*) is complete, processing continues up
the tree, etc., until the processing of d is completely finished.
These sequence files are then sorted on the index and a running sum
is computed to generate the error rate curve as illustrated in FIG.
6.
[0200] Updates to the Database
[0201] For a batch job, the large frequency and fisher tables can
now be deleted, leaving the relatively smaller indexed topic
statistics and the much smaller term-to-TID maps. If the system is
used to determine where new documents will be added to classes, it
is necessary to preserve the frequency table. It continues to be
used in the same way as before: rows are appended and occasionally
it is compacted to aggregate duplicate keys. Running feature
selection integrates the new data into the indexed statistics.
[0202] Like running statistics generation for a relational server,
feature selection is not an interactive operation. For example, on
a database with 2000 classes, with an average of 150 documents per
class, and an average of 100 terms per document, it may take a
couple of hours. So feature selection is invoked only when there is
reason to believe that the refreshed statistics will improve
classification. In further preferred embodiments, such times to
perform feature selection is automatically determined, for example,
based on the occurrence of one or more predetermined events.
[0203] Another issue is deletion of documents and moving of
documents from one class to another (perhaps because classification
was poor or erroneous for those documents). Since feature selection
is always preceded by a frequency table aggregation, negative
"correction" entries may be placed in it. That is, a frequency
table row is produced, corresponding to each term in the deleted
document, and SNIC, SNC, SF1 and SF2 are negated for the class(es)
from which the document is being deleted. Here, it cannot be
ensured that the document was originally included in the aggregate,
but that can be done by preserving ID's for training documents.
[0204] A more difficult issue is the reorganization of the taxonomy
itself. Notice that in TAPER, a parent class inherits, in an
additive fashion, the statistics of its children, since each
training document generates rows for each topic node from the
assigned topic up to the root. The preferred means of
reorganization therefore involves reassigning some CID's and
writing out a new frequency table with some negative "correction"
entries.
[0205] For example, consider detaching a subtree under node c.sub.1
and attaching it elsewhere under node c.sub.2. Statistics at or
above the least common ancestor c, of c.sub.1 and c.sub.2 remain
unchanged. Negative (respectively, positive) rows are appended to
the frequency table corresponding to all classes on the path
between c.sub.l inclusive and c.sub.1 (respectively, c.sub.2)
exclusive. Finally, the parent and child links have to be modified
in the taxonomy tree.
[0206] Classification
[0207] The rationale for the data organization described above is
now explained by a description of the procedure of classifying a
new or test document. When analyzing a new or test document, the
taxonomy and associated pre-computed statistics are first loaded,
and then the document is submitted to the system in the form of a
map from terms to frequencies. In the present model, the
probability that the document is generated by the root topic is 1,
by definition, and decreases down any path in the taxonomy.
Accordingly, the user also specifies a probability cut-off for
topics reported back as close matches.
[0208] Consider the document d at some internal node c.sub.0 with
children c.sub.1,c.sub.2, . . . . The system first intersects d
with the feature set at c.sub.0, then, for each retained term
t.di-elect cons.d.andgate.F(c.sub.0), and then the system looks up
the model parameters for subtopics c.sub.1, c.sub.2, etc. It is
thus best for both space and IO efficiency to index the statistics
by (c.sub.0, t) and include in the record a vector of statistics
for each child c.sub.i, for i=1,2, . . . , of node c.sub.0. The
obvious pseudocode has to be slightly modified to reflect this
optimization (i.e., p.sub.c denotes log probabilities).
2 Nave index lookup: For each child c, of c.sub.0, i = 1,2 . . .
Initialize p.sub.c, to 0 For each term t .epsilon. d .andgate.
F(c.sub.0) Lookup statistics for term t for class c, Update
p.sub.c, Normalize .SIGMA.,e.sup.p,, to 1 and add p.sub.c0 to each
p.sub.c,. Optimized index lookup: Initialize all p.sub.c, to 0 For
each term t .epsilon. d .andgate. F(c.sub.0) Skip if key (c.sub.0,
t) is not in index Otherwise, retrieve record for (c.sub.0, t) For
each c, that appears in the record Update p.sub.c, Normalize
etc.
[0209] Caching and Batch Classification
[0210] Two additional optimizations may be needed in the above
technique to rapidly compute class estimates for test/new
documents. First, observe that in practice, most terms in a
document will not be found in F(c.sub.0). It can be wasteful to
frequently look up the index on disk only to discard terms. Since
few terms are found useful, it is relatively acceptable to look up
these statistics on disk. Thus, a fast cache of feature TID's, even
without any attached statistics, just to check if a TID is a
feature or not, greatly increases speed.
[0211] Second, given not one, but several documents to classify, it
is desirable to amortize lookup of common terms by using a B-Tree
access approach on the statistics (rather than a hash table),
pre-sorting the TID's and making indexed scans on the statistics
table. It is to be understood that any such locality-based
performance optimizations and reordering of accesses to disk data
are obvious to one trained in the art.
[0212] Incomplete Classification
[0213] With some documents, it may not be possible to narrow down
the topic to a leaf node of the taxonomy. For example, there may be
strong elements of both "Art:Paintina" and "Arts:Photography" in a
document, so that the best classification should be just "Arts."
Various schemes may be proposed for dealing with such situations.
In a preferred embodiment the system can trade off between the
benefit of narrowing down the topic against the cost of doing so
with low confidence.
[0214] To illustrate this, it will suffice to consider the case of
a three-node taxonomy with the root c.sub.0 having two children
c.sub.1,c.sub.2, where the goal, for a given document d, is to
decide whether to stop at c.sub.0, or to pick between c.sub.1 and
c.sub.2. Intuitively, it is pointless to make the finer distinction
if it is incorrect too often. This can be quantified in a variety
of ways. Suppose the classifier picks c.sub.1, the fraction of
training documents in that class is .pi..sub.1, and somehow it is
known that the classifier is correct in this case with probability
p.sub.1. One possible estimate is Pr[d.di-elect
cons.c.sub.1.vertline.d.di-elect cons.c.sub.0]. This can also be
combined in various ways with the fraction of correct
classifications over a cross-validation set. Then the "expected"
factor by which the scope is narrowed down is 19 p 1 1 + ( 1 - p 1
) .
[0215] If this quantity is smaller than 1, the system stops at
c.sub.0 in a preferred embodiment. It is to be understood that any
other such test of confidence, with the purpose of stopping the
classifier prematurely, is within the spirit of the present
invention.
[0216] Performance (Speed and Accuracy) Example
[0217] We used three data sets for evaluation: the Reuters newswire
widely used in the IR community, a portion from the US Patent
database, and a sample from Yahoo.TM..
[0218] The experimental computers were between 133 and 266 MHz,
with 128-256 MB of memory. Once a document is in memory, typical
training time is 140s and typical testing time was 30s. Training
and testing on Reuters takes 20 minutes overall. The Yahoo.TM.
sample, with 2118 classes and 266,000 documents, takes 19 hours to
train. Bernoulli was found to be superior to binary for all our
experiments.
[0219] The Patent data set had three intermediate nodes below the
root and twelve leaf nodes. A few hundred terms out of the lexicon
of about 30,000 terms were sufficient to minimize error; the
overall accuracy to the leaves was 66% (i.e., 66% patents were
correctly classified) and the accuracy at the root node was 75%.
For Reuters, the accuracy was 87%, as against the earlier best
known technique's accuracy of 81% by Apte et al.
[0220] The best features in the patent taxonomy for the root and
its three children are listed in Table 1 below. Observe how
distinct the sets are.
3TABLE 1 Patent: signal, modulate, motor, receive, antenna,
telephone, transmit, frequency, demodulate, current, voltage, data
. . . Communication: Electricity: motor, Electronics: amplifier.
antenna. telephone, heat, voltage, oscillator, input, modulator,
transistor, output, output, frequency, demodulator, circuit, . . .
transistor, . . . signal, . . .
[0221] The feature selection technique is applied to find salient
differences between various interesting sets of documents. One
application is to find descriptions for clusters in unsupervised
document clustering. For example, the query mouse provides hundreds
of responses from the US patent database. Clustering the responses
and applying TAPER with the clusters treated as classes yields the
automatically generated cluster descriptions shown in Table 2
below.
4TABLE 2 Tissue, thymus, transplanted, hematopoietic, treatment,
organ, trypsin, . . . Computer, keyboard, hand, edge, top,
location, keys, support, cleaning, . . . Point, select,
environment, object, display, correspondence. image, . . .
[0222] These "cluster digests" clearly identify different contexts
of the occurrence of "mouse" and help the user easily refine the
query.
[0223] As yet another example application of the TAPER system, the
patent portfolios of two assignees, or one assignee over two
different intervals, were compared. For example, a comparison
between "Sun Microsystems" and "Silicon Graphics" gave the result
shown in Table 3 below.
5 TABLE 3 Sun Microsystems Silicon Graphics General purpose
Information processing programmable digital system organization
computer systems Data representation, Electrical computers and
computer graphics data processing system Sufrace detail, texture
Integrated circuit, Adjusting resolution or processor level of
detail
[0224] Note how the commonality between the assignees ("UNIX
workstations") is filtered out and the difference brought out.
[0225] An even more interesting example is comparing Intel patents
in 1993-94 with those in 1994-95. TAPER detects a new line of
research and patenting in the latter year, as shown in Table 4
below.
6 TABLE 4 Intel, 1993-94 Intel, 1994-95 General purpose Interactive
television programmable digital bandwidth reduction computer
systems system Chip, fabrication, counter, Involving difference
input transmission Field or frame difference
[0226] Note that it is possible to get coherent phrases in this
case because the patent meta-data stores these as effectively
single terms.
[0227] In a typical application, a taxonomy will be initially
designed by hand, and training documents obtained. Once training is
completed, the accuracy of the system can be estimated by comparing
the class output of the classifier 50 with the known
classifications of the testing documents 34 (FIG. 3). If the
accuracy is inadequate, a further training procedure, using a
different collection of documents, or a reorganization of the
taxonomy, may be carried out to retrain the system. Once
satisfactory accuracy is achieved, the system may be used for a
number of purposes such as search, filtering, and indexing as
described above.
[0228] Broadly, the current invention can be distinguished from
prior work in its use of a context dependent statistics, and its
emphasis on scalability and speed in dealing with corpora ranging
into tens to hundreds of gigabytes, the use of efficient disk data
structures, and efficient update mechanisms. The present invention
has focused on techniques that have good statistical foundation
while remaining within almost linear time and one pass over the
corpus, even when doing feature selection simultaneously for many
nodes in a large topic taxonomy.
[0229] Having thus described exemplary embodiments of the present
invention, it should be understood by those skilled in the art that
the above disclosures are exemplary only and that various other
alternatives, adaptations and modifications may be made within the
scope of the present invention. The presently disclosed embodiments
are to be considered in all respects as illustrative and not
restrictive. The scope of the invention being indicated by the
appended claims, rather than the foregoing description, and all
changes which come within the meaning and range of equivalency of
the claims are, therefore, intended to be embraced therein.
* * * * *
References