U.S. patent application number 10/921954 was filed with the patent office on 2005-04-14 for system and method for processing text utilizing a suite of disambiguation techniques.
Invention is credited to Barnes, Jeremy, Belzile, Pierre, Colledge, Matthew.
Application Number | 20050080613 10/921954 |
Document ID | / |
Family ID | 34216034 |
Filed Date | 2005-04-14 |
United States Patent
Application |
20050080613 |
Kind Code |
A1 |
Colledge, Matthew ; et
al. |
April 14, 2005 |
System and method for processing text utilizing a suite of
disambiguation techniques
Abstract
The invention relates to a system and method for processing
natural language text utilizing disambiguation components to
identify a disambiguated sense for the text. For the method, it
comprises applying a selection of the components to the text to
identify a local disambiguated sense for the text. Each component
provides a local disambiguated sense of the text with a confidence
score and a probability score. The disambiguated sense is
determined utilizing a selection of local disambiguated senses. The
invention also relates to a system and method for generating
sense-tagged text. For the method, it comprises steps of:
disambiguating a quantity of documents utilizing a disambiguation
component; generating a confidence score and a probability score
for a sense identified for a word provided by the component; if the
confidence score for the sense for the word is below a set
threshold, the sense is ignored; and if the confidence score for
the sense for the word is above the set threshold, the sense is
added to the sense-tagged text.
Inventors: |
Colledge, Matthew;
(Montreal, CA) ; Belzile, Pierre; (Lachine,
CA) ; Barnes, Jeremy; (Montreal, CA) |
Correspondence
Address: |
McCarthy Tetrault LLP
Toronto Dominion Bank Tower
66 Wellington Street West, Suite 4700
Box 48
Toronto
ON
M5K1E6
CA
|
Family ID: |
34216034 |
Appl. No.: |
10/921954 |
Filed: |
August 20, 2004 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60496681 |
Aug 21, 2003 |
|
|
|
Current U.S.
Class: |
704/9 |
Current CPC
Class: |
G06F 16/951 20190101;
Y10S 707/99933 20130101; G06F 40/30 20200101; Y10S 707/99934
20130101; G06F 40/247 20200101; Y10S 707/99935 20130101 |
Class at
Publication: |
704/009 |
International
Class: |
G06F 017/27 |
Claims
We claim:
1. A method of processing natural language text utilizing a
plurality of disambiguation components to identify a disambiguated
sense for said text, said method comprising steps of: applying a
selection of components from said plurality of disambiguation
components to said text to identify a local disambiguated sense for
said text, wherein each component of said selection provides a
local disambiguated sense of said text with a confidence score and
a probability score; and said disambiguated sense is determined
utilizing a selection of local disambiguated senses from said
selection.
2. The method of processing natural language text as claimed in
claim 1, wherein said selection of components are sequentially
activated and controlled by a central module.
3. The method of processing natural language text as claimed in
claim 2, further comprising identifying a second selection of
components from said plurality of components; applying said second
selection to said text to refine said disambiguated sense, wherein
each component of said second selection provide a second local
disambiguated sense of said text with a second confidence score and
a second probability score; and said disambiguated sense is
determined utilizing a selection of second local disambiguated
senses from said second selection.
4. The method of processing natural language text as claimed in
claim 3, further comprising after applying said selection to said
text and prior to applying said second selection to refine said
disambiguated sense, eliminating a sense from said disambiguated
sense having a confidence score below a threshold.
5. The method of processing natural language text as claimed in
claim 4, wherein when a particular component of said plurality of
components is present in said selection and said second selection,
at least one of its confidence and probability scores is adjusted
when applying said second selection to said text.
6. The method of processing natural language text as claimed in
claim 4, wherein said selection and said second selection of
components are identical.
7. The method of processing natural language text as claimed in
claim 4, wherein said confidence score of said each component is
generated by a confidence function utilizing a trait of each
component.
8. The method of processing natural language text as claimed in
claim 4, wherein after applying said selection of components to
said text to identify a local disambiguated sense for said text,
said method further comprising for each said component of said
selection, generating a probability distribution for its
disambiguated sense; and merging all probability distributions for
said selection.
9. The method of processing natural language text as claimed in
claim 8, wherein said selection of component disambiguates said
text using context of said text identified from one of domain; user
history; and specified contexts.
10. The method of processing natural language text as claimed in
claim 8, further comprising after applying said selection to said
text, refining a knowledge base of each component in said selection
utilizing said disambiguated sense.
11. The method of processing natural language text as claimed in
claim 4, wherein at least one of said selection of components
provides results only for coarse sense s.
12. The method of processing natural language text as claimed in
claim 4, wherein results of said selection of components are
combined into one result utilizing a merging algorithm.
13. The method of processing natural language text as claimed in
claim 12, wherein said process utilizes a first stage comprising
merging of coarse senses, and a second stage comprising merging of
fine senses within each coarse sense grouping.
14. The method of processing natural language text as claimed as
claimed in claim 13, wherein said merging process utilizes a
weighted sum of probability distributions, and said weights are the
confidence score associated with said distribution, and wherein
said merging process comprises a weighted average of confidence
scores, and said weights are again the confidence scores associated
with said distribution.
15. A method of generating sense-tagged text, said method
comprising steps of: disambiguating a quantity of documents
utilizing a disambiguation component; generating a confidence score
and a probability score for a sense identified for a word provided
by said component; if said confidence score for said sense for said
word is below a set threshold, said sense is ignored; and if said
confidence score for said sense for said word is above said set
threshold, said sense is added to said sense-tagged text.
16. A method of processing natural language text utilizing a
plurality of disambiguation components to identify a disambiguated
sense or senses for said text, said method comprising steps of:
defining an accuracy target for disambiguation; and applying a
selection of components from said plurality of disambiguation
components to meet said accuracy target.
17. A method of processing natural language text utilizing a
plurality of disambiguation components to identify a disambiguated
sense for said text, said method comprising steps of: identifying a
set of senses for said text; and identifying and removing an
unwanted sense from said set.
18. A method of processing natural language text utilizing a
plurality of disambiguation components to identify a disambiguated
sense for said text, said method comprising steps of: identifying a
set of senses for said text; and identifying and removing a
specified amount of ambiguity from said set of senses.
Description
RELATED APPLICATION
[0001] This application claims the benefit of U.S. Provisional
Application No. 60/496,681 filed on Aug. 21, 2003.
FIELD OF THE INVENTION
[0002] The present invention relates to disambiguating natural
language text, such as queries to an Internet search engine, web
pages and other electronic documents, and disambiguating textual
output of a speech to text system.
BACKGROUND
[0003] Word sense disambiguation is the process of determining the
meaning of words in text. For example, the word "bank" can mean a
financial institution, an embankment, or an aerial manoeuvre (or
several other meanings). When humans listen to or read naturally
expressed language, they automatically select the correct meaning
of each word based on the context in which it is expressed. A word
sense disambiguator is a computer-based system for accomplishing
this task, and is a critical component of technology for making
naturally expressed language understandable to computers.
[0004] A word sense disambiguator is used in applications which
require or which can be improved by making use of the meaning of
the words in the text. Such applications include but are not
limited to: Internet search and other information retrieval
applications; document classification; machine translation; and
speech recognition.
[0005] It is accepted by those skilled in the art that, although
humans perform word sense disambiguation effortlessly, and this is
a critical step in understanding naturally expressed language, no
system has yet been developed to accomplish word sense
disambiguation of general texts to an accuracy sufficient to permit
deployment in such applications. Even current advanced word sense
disambiguation systems may have an accuracy of only approximately
33%, thereby making their results too inaccurate for many
applications.
[0006] There is a need for word sense disambiguation system and
method which addresses deficiencies in the prior art.
SUMMARY OF THE INVENTION
[0007] In a first aspect, a method of processing natural language
text utilizing disambiguation components to identify a
disambiguated sense or senses for the text is provided. The method
comprises applying a selection of the components to the text to
identify a local disambiguated sense for the text. Each component
provides a local disambiguated sense of the text with a confidence
score and a probability score. The disambiguated sense is
determined utilizing a selection of local disambiguated senses.
[0008] In the method, the components are sequentially activated and
controlled by a central module.
[0009] The method may further comprise identifying a second
selection of components; and applying the second selection to the
text to refine the disambiguated sense (or senses). Each component
in the second selection provides a second local disambiguated sense
(or senses) of the text with a second confidence score and a second
probability score. The disambiguated sense (or senses) is
determined utilizing a selection of the second local disambiguated
senses.
[0010] In the method, after applying the selection to the text and
prior to applying the second selection to refine the disambiguated
sense (or senses), the further step of eliminating a sense from the
disambiguated sense having a confidence score below a threshold may
be executed.
[0011] In the method, when a particular component is present in the
selection and the second selection, its confidence and probability
scores may be adjusted when applying the second selection to the
text.
[0012] In the method, the selection and the second selection of
components may be identical.
[0013] In the method, the confidence score of the each component
may be generated by a confidence function utilizing a trait of each
component.
[0014] After applying the selection of components to the text to
identify a local disambiguated sense (or senses) for the text, for
each component of the selection, the method may generate a
probability distribution for its disambiguated sense (or senses).
Further the method may merge all probability distributions for the
selection.
[0015] In the method, the selection of component disambiguates the
text using context of the text may be identified from one of the
following contexts: domain; user history; and specified
context.
[0016] After applying the selection to the text, the method may
refine a knowledge base of each component in the selection
utilizing the disambiguated sense (or senses).
[0017] In the method at least one of the selection of components
provides results only for coarse senses.
[0018] In the method, results of the selection of components may be
combined into one result utilizing a merging algorithm.
[0019] In the method, the process may utilize a first stage
comprising merging of coarse senses, and a second stage comprising
merging of fine senses within each coarse sense grouping.
[0020] In the method, the merging process may utilize a weighted
sum of probability distributions, and the weights may be the
confidence score associated with the distribution. Further, the
merging process may comprise a weighted average of confidence
scores, and the weights are again the confidence scores associated
with the distribution.
[0021] In another aspect, a method of processing natural language
text utilizing disambiguation components to identify a
disambiguated sense for the text is provided. The method comprises
steps of: defining an accuracy target for disambiguation; and
applying a selection of components from the plurality of
disambiguation components to meet the accuracy target.
[0022] In another aspect, a method of processing natural language
text utilizing disambiguation components to identify a
disambiguated sense for the text is provided. The method comprises
steps of: identifying a set of senses for the text; and identifying
and removing an unwanted sense from the set.
[0023] In another aspect a method of processing natural language
text utilizing disambiguation components to identify a
disambiguated sense for the text is provided. The method comprises
steps of: identifying a set of senses for the text; and identifying
and removing an amount of ambiguity from the set of senses.
[0024] In another second aspect, a method of generating
sense-tagged text is provided. The method comprises steps of:
disambiguating a quantity of documents utilizing a disambiguation
component; generating a confidence score and a probability score
for a sense identified for a word provided by the component; if the
confidence score for the sense for the word is below a set
threshold, the sense is ignored; and if the confidence score for
the sense for the word is above the set threshold, the sense is
added to the sense-tagged text.
[0025] In other aspects various combinations of sets and subsets of
the above aspects are provided.
BRIEF DESCRIPTION OF THE DRAWINGS
[0026] The foregoing and other aspects of the invention will become
more apparent from the following description of specific
embodiments thereof and the accompanying drawings which illustrate,
by way of example only, the principles of the invention. In the
drawings, where like elements feature like reference numerals (and
wherein individual elements bear unique alphabetical suffixes):
[0027] FIG. 1 is a schematic representation of words and word
senses associated with an embodiment of a text processing
system;
[0028] FIG. 2 is a schematic representation of a representative
semantic relationship or words for with the system of FIG. 1;
[0029] FIG. 3 is a schematic representation of an embodiment of a
text processing system providing word sense disambiguation;
[0030] FIG. 4 is a block diagram of a word sense disambiguator
module, control file optimizer, and database elements of the text
processing system of FIG. 3.
[0031] FIG. 5 is a diagram of data structures used to represent the
semantic relationships of FIG. 2 for the system of FIG. 3;
[0032] FIG. 6 is a flow diagram of a text processing process
performed by the embodiment of FIG. 3;
[0033] FIG. 7 is flow diagram of a process for a disambiguating
step of the text processing process of FIG. 6;
[0034] FIG. 8 is a data flow diagram for the control file optimizer
of FIG. 4; and
[0035] FIG. 9 is a flow diagram of a bootstrapping process
associated with the text processing system of FIG. 3.
DESCRIPTION OF EMBODIMENTS
[0036] The description which follows, and the embodiments described
therein, are provided by way of illustration of an example, or
examples, of particular embodiments of the principles of the
present invention. These examples are provided for the purposes of
explanation, and not limitation, of those principles and of the
invention. In the description, which follows, like parts are marked
throughout the specification and the drawings with the same
respective reference numerals.
[0037] The following terms will be used in the following
description, and have the meanings shown below:
[0038] Computer readable storage medium: hardware for storing
instructions or data for a computer. For example, magnetic disks,
magnetic tape, optically readable medium such as CD ROMs, and
semi-conductor memory such as PCMCIA cards. In each case, the
medium may take the form of a portable item such as a small disk,
floppy diskette, cassette, or it may take the form of a relatively
large or immobile item such as hard disk drive, solid state memory
card, or RAM.
[0039] Information: documents, web pages, emails, image
descriptions, transcripts, stored text etc. that contain searchable
content of interest to users, for example, contents related to news
articles, news group messages, web logs, etc.
[0040] Module: a software or hardware component that performs
certain steps and/or processes; may be implemented in software
running on a general-purpose processor.
[0041] Natural language: a formulation of words intended to be
understood by a person rather than a machine or computer.
[0042] Network: an interconnected system of devices configured to
communicate over a communication channel using particular
protocols. This could be a local area network, a wide area network,
the Internet, or the like operating over communication lines or
through wireless transmissions.
[0043] Query: a list of keywords indicative of desired search
results; may utilize Boolean operators (e.g. "AND", "OR"); may be
expressed in natural language.
[0044] Text: textual information represented in its usual form
within a computer or associated storage device. Unless otherwise
specified, it is assumed to be expressed in natural language.
[0045] Search engine: a hardware or software component to provide
search results regarding information of interest to a user in
response to text from the user. The search results may be ranked
and/or sorted by relevance.
[0046] Sense-tagged text: text in which some or all of the words
have been marked with a word sense or senses signifying the meaning
of the word in the text.
[0047] Sense-tagged corpus: is a collection of sense-tagged text
for which the senses and possibly linguistic information such as
part of speech tags of some or all words have been marked. The
accuracy of the specification of the senses and other linguistic
information must be similar to that which would be achieved by a
human lexicographer. Thus, if sense-tagged text is generated by a
machine, then the accuracy of word senses that are marked by the
machine must similar that of a human lexicographer performing word
sense disambiguation.
[0048] The embodiment relates to natural language processing, and
in particular to processing natural language text as a step in an
application which requires or can be improved by making use of the
meaning of the words in the text. This process is known generally
as word sense disambiguation. Applications include but are not
limited to:
[0049] 1. Internet search and other information retrieval
applications; both in disambiguating queries to better specify the
user's request, and in disambiguating documents to select more
relevant results. When working with large sets of data, such as a
database of documents or web pages on the Internet, the volume of
available data can make it difficult to find information of
relevance. Various methods of searching are used in an attempt to
find relevant information in such stores of information. Some of
the best known systems are Internet search engines, such as Yahoo
(trademark) and Google (trademark) which allow users to perform
keyword-based searches. These searches typically involve matching
keywords entered by the user with keywords in an index of web
pages. One reason for some difficulties encountered in performing
such searches is the ambiguity of words used in natural language.
Specifically, difficulties are often encountered because one word
can have several meanings, and each meaning can have multiple
synonyms or paraphrases. For example, "Java bean" is matched by a
search engine to documents which simply contain these two words. By
disambiguating "Java bean" to mean "coffee bean" instead of the
"Java Bean" computer technology by Sun Microsystems, a
disambiguator would allow documents about this computer technology
to be excluded from the results, and would similarly allow
documents concerning coffee beans to be included in the
results.
[0050] 2. Document classification; in allowing documents to be
clustered based upon precise criteria of meaning as opposed to
their textual content. For example, consider an application which
automatically sorted email messages into folders each pertaining to
a topic specified by a user. One such folder might be entitled
"programming tools", and contain any emails that mentioned any form
of "programming tool". The use of word sense disambiguation in this
application would allow emails that contained related information,
but did not contain words matching the title of the folder to be
accurately classified as belonging in the folder or not. For
example, the words "Java object" could be placed in the folder
because it contains a sense of "Java" meaning a programming
language, whereas an email containing the terms "Java coffee" or
"tools to use in designing a conference program" could be rejected
because, in the first case, the word "Java" is disambiguated to
mean a type of coffee, and, in the second case, the word "program"
refers to an event, which is a meaning not associated with computer
programming. Such an effect could be optionally achieved by giving
the senses present in a disambiguated email to a machine learning
algorithm, rather than just providing the words as is currently
done by state-of-the-art applications. The accuracy of the
classification would increase as a result, and the application
would appear more intelligent and be more useful to the user.
[0051] 3. Machine translation; in knowing the precise meanings of
words before they are translated, so that the correct translation
can be provided for words with multiple possible translations. For
example, the word "bank" in English may translate into the French
"banque" if it means "financial institution", but "rive" if it
means "river bank". In order to perform an accurate translation of
such a word, it is necessary to select a meaning. It will be
recognised by those skilled in the art that a large percentage of
the errors in prior art machine translation systems are made due to
the selection of the wrong senses of words being translated. The
addition of word sense disambiguation to such a system would
improve accuracy by reducing or eliminating the errors of this type
that are made by today's state-of-the-art systems.
[0052] 4. Speech recognition; in allowing utterances with words or
combinations of words that sound the same but are written
differently to be correctly interpreted. Most speech recognition
systems include a recognition component that analyses the phonetics
of a phrase and outputs several possible sequences of words that
could have been pronounced. For example, "I asked to people" and "I
asked two people" are pronounced the same, and would both be output
as possible sequences of words by such a recognition component.
Most speech recognition systems then include a module which selects
which of the possible word sequences is the most probable, and
outputs this sequence as the result. This module typically operates
by selecting the word sequence that matches most closely with word
sequences that are known to be uttered. Word sense disambiguation
could improve the operation of such a module by selecting the word
sequence that leads to the most consistent interpretation. For
example, consider a speech recognition system which generated two
alternative interpretations for an utterance: "I scream in flat
endings" or "Ice cream is fattening". A word sense disambiguator
would select between these two interpretations which sound the
same, in exactly the same manner as it would disambiguate between
two possible interpretations in text which are spelled the
same,
[0053] 5. Text to speech (speech synthesis), in allowing words with
multiple pronunciations to be pronounced correctly. For example, "I
saw her sow the seeds" and "The old sow was slaughtered for bacon"
both contain the word "sow", which is pronounced differently in
each sentence. A text to speech application needs to know which
interpretation applies to each word in order to correctly utter
each sentence. A word sense disambiguation module could determine
that the sense of "sow" in the first sentence was the verb "to sow"
and in the second sentence was "a female hog". The application
would then have the information necessary to pronounce each
sentence correctly.
[0054] Before describing specific aspects of the embodiment, some
background on relationships between words and their word senses is
provided. Referring to FIG. 1, relationship between words and word
senses is shown generally by the reference 100. As seen in this
example, certain words have multiple senses. Among many other
possibilities, the word "bank" may represent: (i) a noun referring
to a financial institution; (ii) a noun referring to a river bank;
or (iii) a verb referring to an action to save money. Similarly,
the word "interest" has multiple meanings including: (i) a noun
representing an amount of money payable relating to an outstanding
investment or loan; (ii) a noun representing special attention
given to something; or (iii) a noun representing a legal right in
something.
[0055] The embodiment assigns senses to words. In particular, the
embodiment defines two senses of words: coarse and fine. A fine
sense defines a precise meaning and usage of a word. Each fine
sense applies within a particular part of speech category (noun,
verb, adjective or adverb). A coarse sense defines a broad concept
associated with a word, and may be associated with more than one
part of speech category. Each coarse sense contains one or more
fine senses, and each fine sense belongs to one coarse sense. A
word can have more than one fine and more than one coarse sense. A
fine sense is classified under the coarse sense because the fine
sense of the word matches the generic concept associated with the
coarse sense definition. Table 1 illustrates the relationship
between a word, its coarse senses and its fine senses. As an
example to illustrate the distinction between fine and coarse
senses, the fine senses for the word "bank" respect the distinction
between the verb "to bank" as in "to bank a plane" and the noun "a
bank" as in "the pilot performed a bank", whereas these two senses
are grouped together under the more general coarse sense
"Manoeuvre".
1TABLE 1 Word Coarse Sense Fine Senses Bank Financial Institutions
Financial institution (Noun) Building where banking is done (Noun)
Perform Business with a Bank (Verb) Ground formations Land beside
water (Noun) Ridge of earth (Noun) Slope in road (Noun) Manoeuvre
Flight manoeuvre (Noun) Tip laterally (Verb) Gambling Funds held by
a gambling house (Noun) act as a banker in gambling (Verb)
[0056] Referring to FIG. 2, example semantic relationships between
word senses are shown. These semantic relationships are precisely
defined types of associations between two words based on meaning.
The relationships are between word senses, which are specific
meanings of words. For example, a bank (in the sense of a river
bank) is a type of terrain and a bluff (in the sense of a noun
meaning a land formation) is also a type of terrain. A bank (in the
sense of river bank) is a type of incline (in the sense of grade of
the land). A bank in the sense of a financial institution is
synonymous with a "banking company" or a "banking concern." A bank
is also a type of financial institution, which is in turn a type of
business. A bank (in the sense of financial institution) is related
to interest (in the sense of money paid on investments) and is also
related to a loan (in the sense of borrowed money) by the generally
understood fact that banks pay interest on deposits and charge
interest on loans.
[0057] It will be understood that there are many other types of
semantic relationships that may be used. Although known in the art,
following are some examples of semantic relationships between
words: Words which are in synonymy are words which are synonyms to
each other. A hypernym is a relationship where one word represents
a whole class of specific instances. For example "transportation"
is a hypernym for a class of words including "train", "chariot",
"dogsled" and "car", as these words provide specific instances of
the class. Meanwhile, a hyponym is a relationship where one word is
a member of a class of instances. From the previous list, "train"
is a hyponym of the class "transportation". A meronym is a
relationship where one word is a constituent part of, the substance
of, or a member of something. For example, for the relationship
between "leg" and "knee", "knee" is a meronym to "leg", as a knee
is a constituent part of a leg. Meanwhile, a holonym a relationship
where one word is the whole of which a meronym names a part. From
the previous example, "leg" is a holonym to "knee". Any semantic
relationships that fall into these categories may be used. In
addition, any known semantic relationships that indicate specific
semantic and syntactic relationships between word senses may be
used.
[0058] It will be recognized that use of word sense disambiguation
in a search engine addresses the problem of retrieval relevance.
Furthermore, users often express text as they would express
language. However, since the same meaning can be described in many
different ways, users encounter difficulties when they do not
express text in the same specific manner in which the relevant
information was initially classified.
[0059] For example if the user is seeking information about "Java"
the island, and is interested in "holidays" on Java (island), the
user would not retrieve useful documents that had been categorized
using the keywords "Java" and "vacation". The embodiment addresses
this issue. It has been recognized that deriving precise synonyms
and sub-concepts for each key term in a naturally expressed text
increases the volume of retrieved relevant retrievals. If this were
performed using a thesaurus without word sense disambiguation, the
result could be worsened. For example, semantically expanding the
word "Java" without first establishing its precise meaning would
yield a massive and unwieldy result set with results potentially
selected based on word senses as diverse as "Indonesia" and
"computer programming". The embodiment provides systems and methods
of interpreting meaning of each word which are semantically
expanded to produce a comprehensive and simultaneously more precise
result set.
[0060] Referring to FIG. 3, text processing system associated with
an embodiment is shown generally at reference 10. The system takes
as input a text file 12. The text file 12 contains natural language
text, such as a query, a document, the output of a speech to text
system, or any source of natural language text in electronic
form.
[0061] The system includes text processing engine 20. The text
processing engine 20 may be implemented as dedicated hardware, or
as software operating on a general purpose processor. The text
processing engine may also operate on a network.
[0062] The text processing engine 20 generally includes a processor
22. The engine may also be connected, either directly thereto, or
indirectly over a network or other such communication means, to a
display 24, an interface 26, and a computer readable storage medium
28. The processor 22 is coupled to the display 24 and to the
interface 26, which may comprise user input devices such as a
keyboard, mouse, or other suitable devices. If the display 24 is
touch sensitive, then the display 24 itself can be employed as the
interface 26. The computer readable storage medium 28 is coupled to
the processor 22 for providing instructions to the processor 22 to
instruct and/or configure processor 22 to perform steps or
algorithms related to the operation of text processing engine 20,
as further explained below. Portions or all of the computer
readable storage medium 28 may be physically located outside of the
text processing engine 20 to accommodate, for example, very large
amounts of storage. Persons skilled in the art will appreciate that
various forms of text processing engines can be used with the
present invention.
[0063] Optionally, and for greater computational speed, the text
processing engine 20 may include multiple processors operating in
parallel or any other multi-processing arrangement. Such use of
multiple processors may enable the text processing engine 20 to
divide tasks among various processors. Furthermore, the multiple
processors need not be physically located in the same place, but
rather may be geographically separated and interconnected over a
network as will be understood by those skilled in the art.
[0064] Text processing engine 20 includes a database 30 for storing
a knowledge base and component linguistic resources used by the
text processing engine 20. The database 30 stores the information
in a structured format to allow computationally efficient storage
and retrieval as will be understood by those skilled in the art.
The database 30 may be updated by adding additional keyword senses
or by referencing existing keyword senses to additional documents.
The database 30 may be divided and stored in multiple locations for
greater efficiency.
[0065] A central component of text processing engine 20 is word
sense disambiguation (WSD) module 32, which processes words from an
input document or text into word senses. A word sense is a given
interpretation ascribed to a word, in view of the context of its
usage and its neighbouring words. For example, the word "book" in
the sentence "Book me a flight to New York" is ambiguous, because
"book" can be a noun or a verb, each with multiple potential
meanings. The result of processing of the words by the WSD module
32 is a disambiguated document or disambiguated text comprising
word senses rather than ambiguous or uninterpreted words. WSD
module 32 distinguishes between word senses for each word in the
document or text. WSD module 32 identifies which specific meaning
of the word is the intended meaning using a wide range of
interlinked linguistic techniques to analyze the syntax (e.g. part
of speech, grammatical relations) and semantics (e.g. logical
relations) in context. It may use a knowledge base of word senses
which expresses explicit semantic relationships between word senses
to assist in performing the disambiguation.
[0066] Referring to FIG. 4, further detail on database 30 is
provided.
[0067] To assist in disambiguating words into word senses, the
embodiment utilizes knowledge base 400 of word senses capturing
relationships of words as described above for FIG. 2. Knowledge
base 400 is associated with database 30 and is accessed to assist
WSD module 32 in performing word sense disambiguation as well as
provide the inventory of possible senses of words in a text. While
prior art dictionaries, and lexical databases such as WordNet
(trademark), have been used in systems, knowledge base 400 provides
an enhanced inventory of words, word senses, and semantic
relations. For example, while prior art dictionaries contain only
definitions of words for each of their word senses, knowledge base
400 also contains information on relations between word senses.
These relations includes the definition of the sense and the
associated part of speech (noun, verb, etc.), fine sense synonyms,
antonyms, hyponyms, meronyms, pertainyms, similar adjectives
relations and other relationships known in the art. Knowledge base
400 also contains additional semantic relations not contained in
other prior art lexical databases: (i) additional relations between
word senses, such as the grouping of fine senses into coarse
senses, "instance of" relations, classification relations, and
inflectional and derivational morphological relations; (ii)
corrections of errors in data obtained from published sources; and
(iii) additional words, word senses, and relations that are not
present in other prior art knowledge bases.
[0068] In addition to containing an inventory of words and word
senses (fine and coarse) for each word and concepts, as well as
over 40 specific types of semantic links between them, database 30
also provides a repository for component resources 402 used by
linguistic components 502 and WSD components 504. Some component
resources are shared by several components while other resources
are specific to a given component. In the embodiment, the component
resources include: general models, domain specific models, user
models and session models. General models contain general domain
information, such as a probability distribution of senses for each
word for any text of unknown domain. They are trained using data
from several domains. WSD components 504 and linguistic components
502 utilize these resources as necessary. For example, a component
may use these resources on all requests or may use it only when the
request cannot be completed using more specific models.
Domain-specific models are trained from domain specific
information. They are useful for modelling usage of specialized
meanings of words in various domains. For example, the word "Java"
has different meaning for travel agents and computer programmers.
These resources allow the building of statistical models for each
group. User models are trained for a specific user. The models may
be given and maybe learnt over time. The user models can be
constructed by the application or automatically by the word sense
disambiguation system. Session models provide information regarding
multiple requests regrouped within a session. For example, several
word sense disambiguation requests may be related to the same topic
during an information retrieval session using a search engine. The
session models can be constructed by the application or
automatically by WSD module 32.
[0069] Database 30 also contains sense-tagged corpus 404.
Sense-tagged corpus 404 may optionally be split up into sub-units
used for training components, training confidence functions for
components and training the control file optimizer, as described
further below.
[0070] Referring to FIG. 5, further detail on knowledge base 400 is
provided. In the embodiment, knowledge base 400 is a generalized
graph data structure and is implemented as a table of nodes 402 and
a table of edge relations 404 associating two nodes together. Each
is described in turn. Annotations of arbitrary data types may be
attached to each node or edge. In other embodiments, other data
structures, such as linked lists, may be used to implement
knowledge base 400.
[0071] In table 402, each node is an element in a row of table 402.
In the embodiment, a record for each node has as many as the
following fields: an ID field 406, a type field 408 and an
annotation field 410. There are two types of entries in table 402:
a word and a word sense definition. For example, the word "bank" in
ID field 406A is identified as a word by the "word" entry in type
field 408A. Also, exemplary table 402 provides several definitions
of words. To catalog the definitions and to distinguish definition
entries in table 402 from word entries, labels are used to identify
definition entries. For example, entry in ID field 406B is labeled
"LABEL001". A corresponding definition in type field 408B
identifies the label as a "fine sense" word relationship. A
corresponding entry in annotation filed 410B identifies the label
as "Noun. A financial institution". As such, a "bank" can now be
linked to this word sense definition. Furthermore an entry for the
word "brokerage" may also be linked to this word sense definition.
Alternate embodiments may use a common word with a suffix attached
to it, in order to facilitate recognition of the word sense
definition. For example, an alternative label could be "bank/n1",
where the "/n1" suffix identifies the label as a noun (n) and the
first meaning for that noun. It will be appreciated that other
label variations may be used. Other identifiers to identify
adjectives, adverbs and others may be used. The entry in type field
408 identifies the type associated with the word. There are several
types available for a word, including: word, fine sense and coarse
sense. Other types may also be provided. In the embodiment, when an
instance of a word has a fine sense, that instance also has an
entry in annotation field 410 to provide further particulars on
that instance of the word.
[0072] Edge/Relations table 404 contains records indicating
relationships between two entries in nodes table 402. Table 404 has
the following entries: From node ID column 412, to node ID column
414, type column 416 and annotation column 418. Columns 412 and 414
are used to link two entries in table 402 together. Column 416
identifies the type of relation that links the two entries. A
record has the ID of the origin and the destination node, the type
of the relation, and may have annotations based on the type. Types
of relations include "root word to word", "word to fine sense",
"word to coarse sense", "coarse to fine sense", "derivation",
"hyponym", "category", "pertainym", "similar", "has part". Other
relations may also be tracked therein. Entries in annotation column
418 provide a (numeric) key to uniquely identify an edge type going
from a word node to either a coarse node or fine node for a given
part-of-speech.
[0073] Referring to FIG. 4, further detail on WSD module 32 is
provided. WSD module 32 comprises control file optimizer 514,
iterative component sequencer (ICS) 500, linguistic components 502,
and WSD components 504.
[0074] Turning first to WSD components 504 and linguistic
components 502, common characteristics and features of WSD
components 504 and linguistic components 502 ("components") are now
described. Results generated by a particular component are
preferably rated using a probability distribution and a confidence
score. The probability distribution allows a component to return a
probability figure indicating the likelihood that any possible
answer is correct. In the case of WSD components 504, possible
answers comprise possible senses of words in the text. In the case
of linguistic components 502, possible answers depend on the task
being performed by the linguistic component; for example, possible
answers for part-of-speech tagger 502F are the set of possible part
of speech tags for each word. The confidence score provides an
indication of a level of confidence of the algorithm in the
probability distribution. As such, an answer having a high
probability and a high confidence score indicates that the
algorithm has identified a single answer as most probable and it is
highly likely that the identified answer is accurate. If an answer
has a high probability score and a low confidence, then although
the algorithm has identified a single answer as most probable, its
confidence score indicates that it may not be correct. In the case
of WSD components 504, a low confidence score may indicate that the
component is lacking information that it needed to disambiguate
this particular word. It is important that each component have a
good confidence function. A component with a low overall accuracy
but a good confidence function is able to contribute to the system
accuracy despite its low overall accuracy, as the confidence
function will identify correctly the subset of words for which the
answers supplied by the component can be trusted.
[0075] The confidence function considers internal operating
features of the component and its algorithm and evaluates potential
weaknesses of accuracy of the algorithm. For example, if an
algorithm relies on statistical probabilities, it would tend to
produce incorrect results when probabilities were calculated from
very few examples. Accordingly, for that algorithm, the confidence
score will use a variable containing the number of examples used by
the algorithm. A confidence function may contain several variables,
even hundreds of variables. The function is usually created by
using the variables as input into a classification or regression
algorithm (statistical, such as a generalized linear model, or
based upon machine learning, such as a neural network) familiar to
those skilled in the art. The data used to train the classification
or regression algorithm is preferably obtained by running the WSD
algorithm over a portion of sense-tagged corpus 404 that has been
set aside for this purpose.
[0076] Many of the components employ statistical techniques based
on machine learning concepts or other statistical techniques which
will be familiar to those skilled in the art. It will be
appreciated by those skilled in the art that such components
require use training data, in order to construct their statistical
models. For example, the priors component 504A utilizes many
sense-tagged examples of each word in order to determine what is
the statistically most likely sense for that particular word. In
the embodiment, the training data is provided by sense-tagged
corpus 404, which is known by those skilled in the art as a
"training corpus".
[0077] Further detail is now provided on features of WSD components
504. Each WSD component 504 attempts to associate the correct
senses to words in text using a particular word sense
disambiguation algorithm. Each WSD component 504 may run more than
one time during the course of a disambiguation. The system provides
semantic word data or other forms of data in database 30 that each
of the algorithms needs in order to perform disambiguation. As
noted earlier, each WSD component 504 has an algorithm that
executes a particular type of disambiguation and generates a
probability score and a confidence score with its results. The WSD
components include but are not limited to: priors component 504A;
example memory component 504B; n-gram component 504C; concept
overlapping component 504E; heuristic word sense component 504F;
frequent words component 504G; and dependency component 504H. Each
component has a specialized knowledge base associated with its
particular operation. Each component produces a confidence function
as detailed above. Details of each component are described below.
Each technique is generally known in the art, unless specific
aspects are provided herein. It will also be appreciated that not
all of the WSD components described in the embodiment may be
necessary to accomplish accurate word sense disambiguation, but
that some combination of different techniques is required.
[0078] For priors component 504A, it utilizes a priors algorithm to
predict word senses by utilizing statistical data on frequency of
appearances of various word senses. Specifically the algorithm
assigns a probability to each word sense based on the frequency of
the word sense in a sense-tagged corpus 404. These frequencies are
preferably stored in the component resources 402.
[0079] For example memory component 504B, it utilizes an example
memory algorithm to predict words senses for phrases (or word
sequences). Preferably it attempts to predict word senses of all
the words in a sequence. Phrases typically are defined as a series
of consecutive words. A phrase can be two words long up to a full
sentence. The algorithm accesses a list of phrases (word sequences)
which provide a deemed correct sense for each word in that phrase.
Preferably, the list comprises sentence fragments from sense-tagged
corpus 404 that occurred multiple times where the senses for each
of the fragment occurrence was identical. Preferably, when an
analyzed phrase contains a word which has a sense which differs
from a sense previously attributed to that word in that phrase,
senses in the analyzed phrase are rejected and are not retained in
the list of word sequences.
[0080] When disambiguating text, the example memory algorithm
identifies whether parts of the text or text match the previously
identified recurring sequences of words which have been retained in
the list of word sequences. If there is a match, the module assigns
the word senses of the sequence to the matching words in the
text.
[0081] For n-gram component 504C, it utilizes an n-gram algorithm
which operates over a fixed range of words and only attempts to
predict a sense of a single word once at a time, in contrast to the
example memory algorithm. The n-grams algorithm predicts word
senses for a head word by matching features immediately surrounding
the word in a very narrow window. Such features include: lemma,
part of speech, coarse of fine word sense, and a name entity type.
While the algorithm may examine n words before or following a
target word, typically, n is set at two words. With n being set at
2, the algorithm utilizes a list of word pairs with a correct sense
associated with each word. This list is derived from word pairs
from sense-tagged corpus 404 that occurred multiple times, where
the senses for each of the word pair occurrence was identical.
However, when a sense of at least one word differs, such word pair
senses are rejected and are not retained in the list. When
disambiguating text, the algorithm matches word pairs from the text
or text being processed with word pair present in the list
maintained by the algorithm. A match is identified when a word pair
is found and the sense of one of the two words is already present
in the text or text being processed. When a match is identified, it
is assigned the sense relating to the second word in the word pair
being processed.
[0082] The component resource associated with the n-grams algorithm
is trained over sense-tagged corpus 404, and is part of component
resources 402. The n-grams component resource includes a
statistical model which identifies when an n-gram has been seen
sufficiently frequently to become a valid sense predictor. Several
predictors from the knowledge base may by triggered by a pattern of
words. These predictors may reinforce a common sense or may
actually generate multiple possible senses with a given probability
distribution.
[0083] For concept overlapping component 504E, it has a concept
overlapping algorithm which predicts a sense for words by choosing
the senses which match most closely the general topic of the text
segment. In the embodiment, the topic of the text segment is
defined as the set of all non-removed senses for all words in text
segment, and topical similarity is assessed by comparing the topic
of the text segment which is being disambiguated with the topics
extracted from the sense tagged corpus 404 for each word sense, and
choosing the sense of each word with the highest such similarity.
One such method of comparison is the dot-product or cosine metric.
There are many other techniques for making use of topic similarity
to disambiguate text, as will be familiar to those skilled in the
art.
[0084] For heuristic word sense component 504F, it has a heuristic
word sense algorithm which predicts a sense of words using
human-generated rules which may use intrinsic language properties
and semantic links in the knowledge base. For example, the senses
"language" in terms of"a spoken human language" and "Indonesian"
are related in the knowledge base by the relation "Indonesian is a
language". A sentence containing both "language" and "Indonesian"
would have the word "language" disambiguated by this component.
Typically, such a relation has been manually verified, thereby
providing a high confidence in accuracy.
[0085] For frequent words component 504G, it has a frequent words
algorithm which identifies the senses of the most frequently
occurring words. In English, the 500 most frequently occurring
words account for almost a third of the words encountered in normal
text. For each of these words, a large amount of training examples
are available in sense-tagged corpus 404. Accordingly, it is
possible to train using supervised machine learning methods
specific sense predictors for each word. In the embodiment, the
machine learning method used to train the component is boosting,
and the features used include the words and parts of speech of the
words in immediate proximity to the target word to be
disambiguated. Other features and machine learning techniques may
be used to accomplish the same goal, as will be familiar to those
skilled in the art.
[0086] For dependency component 504H, it has a dependency algorithm
which utilizes a sense prediction model based on the semantic
dependencies in a sentence. By determining that a word is a head
word in a dependency, and optionally the sense of the head word, it
predicts the sense of its dependant words. Similarly, having
determined that a word is a dependent and optionally the sense of
the dependent word, it can predict the sense of the head word. For
example in the text fragment "drive the car", the head word is
"drive" and the dependant is "car". Knowledge of the sense of "car"
will be sufficient to predict the sense of "drive" as "drive a
vehicle".
[0087] It will be appreciated that other techniques for word sense
disambiguation become available from time to time as the scientific
research in the field progresses, and that such other techniques
could equally be included as new WSD components within the system.
It will by appreciated that a single WSD component may not be
sufficient to disambiguate text with high accuracy. To address this
issue, the embodiment utilizes multiple techniques to disambiguate
text. The techniques described above specify an exemplary
combination which is capable of performing high accuracy word sense
disambiguation. Other techniques may also be used.
[0088] Turning now to linguistic components 502, each component 502
provides a text processing function which can be applied to text to
determine a certain type of linguistic information. This
information is then provided to the WSD components 504 for
disambiguation. The operation of each of the linguistic components
502 will be familiar to one skilled in the art. The linguistic
components 502 include:
[0089] Tokenizer 502A which splits input text into individual words
and symbols. Tokenizer 502A processes the input text as a sequences
of characters and breaks the input text into a series of tokens,
where a token is the smallest sequence of characters that can form
a word.
[0090] Sentence boundary detector 502B which identifies sentence
boundaries in the input text. It uses rules and data (e.g., list of
abbreviations) to identify the possible sentence breaks in the
input text.
[0091] Morpher 502C which identifies a lemma, i.e. a base form, of
a word. In the embodiment, the lemma defines the fine sense and
coarse sense inventories of the word. For example, for the
inflected word "jumping" the morpher identifies its base form
"jump".
[0092] Parser 502D which identifies relationships between the words
in the input text. Parser 502D identifies grammatical structures
and phrases in the input text. The result of this operation is a
parse tree, which is a concept very well known in the field. Some
relationships include "subject of the verb" and "object of the
verb". From the phrases, a list of syntactic and semantic
dependencies can later be extracted. Parser 502D also produces part
of speech tags that are used to update the part of speech
distribution. Parser information is also used to select possible
compounds.
[0093] Dependency extractor 502J uses the parse tree to generate a
list of syntactic and semantic dependencies, which will be familiar
to those skilled in the art. The semantic dependencies are used by
a number of other components to enhance their models. Dependencies
are extracted in the following manner:
[0094] 1. Parser 502D is used to generate a syntactic parse tree,
including syntactic heads for each phrase.
[0095] 2. Using set of heuristics, as will be familiar to those
skilled in the art, semantic heads are generated for each phrase.
Semantic heads differ from syntactic heads as the semantic rules
give preference to semantically important elements (like nouns and
verbs) while syntactic heads give preference to syntactically
important elements like prepositions.
[0096] 3. Once a semantic head (word or phrase) is identified,
sister words and phrases are considered to form dependencies with
the head.
[0097] Named-entity recogniser 502E identifies known proper nouns
such as "Albert Einstein" or "International Business Machines
Incorporated" and other multi-word proper nouns. Named-entity
tagger 502E collects tokens that form a named entity into groups
and classifies the group into categories. Such categories include:
a person, location, artefact, as will be familiar to those skilled
in the art. Named-entity categories are determined by a Hidden
Markov Model (HMM) that is trained on parts of the sense-tagged
corpus 404 in which the named entities have been marked. For
example in the text fragment "Today Coca-Cola announced . . . ",
the HMM will categorize "Coca-Cola" as a company (instead of an
artefact) because of analysis of the surrounding words. Many
techniques exist for named entity recognition as will be familiar
to those skilled in the art.
[0098] Part-of-speech tagger 502F assigns functional roles such as
"noun" and "verb" to the words in the input text. Part of speech
tagger 502F identifies a part of speech, which can be mapped to the
broad parts of speech (noun, verb, adverb, adjective) relevant to
disambiguating between word senses. Part-of-speech tagger 502F
utilizes several a trigram-based Hidden Markov Model (HMM) trained
on a portion of sense-tagged corpus 404 which has been annotated
with part of speech information. Many techniques exist for part of
speech tagging, as will be familiar to those skilled in the
art.
[0099] Compound finder 502H finds possible compounds in the input
text. An example of a compound is "coffee table" or "fire truck",
which although sometimes written as two words need to be treated as
a single word for the purposes of word sense disambiguation.
Knowledge base 400 contains a list of compounds, which can be
identified in the text. Each identified compound is given a
probability which marks the likelihood that the compound was
correctly formed. The probability is calculated from the
sense-tagged corpus 404.
[0100] Turning now to ICS 500, ICS 500 controls the sequence in
which linguistic components 502 and WSD components 504 are operated
on text, to continually reduce the amount of ambiguity in a text
being processed. It has several specific functions:
[0101] 1. It coordinates extraction of required elements from text
utilizing selected linguistic components 502 and provides such
elements to WSD components 504. through a common interface.
[0102] 2. It seeds an initial set of sense possible for each word
using seeder 500A, which associates an initial set of possible
senses from the knowledge base 400 to each word in the text to
identify to the WSD components 504 which senses they must
disambiguate between, thus providing an initial maximum level of
ambiguity.
[0103] 3. It invokes WSD components 504 according to an algorithm
mix identified by control file 516. Activations of the selected WSD
components 504 then attempt to disambiguate the text, providing
probabilities and confidence scores associated with possible senses
of the words in the text. Preferably, WSD components are invoked in
multiple iterations.
[0104] 4. It merges and integrates output from multiple components
using merging module 500B and ambiguity eliminator 500C. Merger
module 500B combines the outputs of all of the WSD components 504
into a single merged probability distribution and confidence score.
Ambiguity eliminator 500C which determines which sense ambiguity
can be removed from the text based upon the output of merger module
500B.
[0105] More detailed description of the function and design of ICS
500 is provided in subsequent sections describing the operation of
the process of word sense disambiguation.
[0106] The control file optimizer 514 optionally performs a
training procedure which outputs a "recipe" in the form of control
file 516, which contains optimal sequence and parameters for the
WSD components 504 in each iteration, and is used by ICS 500 during
word sense disambiguation. More detailed description of the
function and design of control file optimizer 514 is provided in
subsequent section describing the generation of an optimized
control file.
[0107] Further detail is now provided on steps performed by the
embodiment to process text. Referring to FIG. 6, a process to
perform disambiguation of text generally by reference 600. The
process may be divided into four steps. The first step is to
generate an optimized control file 602. This step creates a control
file which is used in the step disambiguate text 606. The second
step read text 604 comprises reading in the text to be
disambiguated from a file. The third step disambiguate text 606
consists of disambiguating the text, and is the main step in the
process. The fourth step output disambiguated text 608 consists of
writing the sense-tagged text to a file.
[0108] Referring to FIG. 7, further detail is now provided on the
main processing step, disambiguate text 606.
[0109] Upon receiving a text to disambiguate, ICS 500 processes the
text in the following manner:
[0110] 1. ICS 500 passes the text through tokenizer 502A to
identify the boundaries of the words and separate these from
punctuation symbols that may be present in the text.
[0111] 2. ICS 500 causes the syntactic features in the text to be
identified by passing the text through linguistic components 502.
Such features include: lemma (including compounds), part of speech,
named entities and semantic dependencies. Each feature is generated
with a confidence score and with a probability distribution.
[0112] 3. Processed text is then provided to seeder 500A which uses
lemma and part of speech generated by linguistic components 502 to
identify a list of possible senses in the knowledge base 400 for
each word in the text.
[0113] 4. ICS 500 then applies a set of WSD components 504
independently to the input text, where specific WSD components 504
and a sequence of their execution are specified in control file
516. Each WSD component 504 disambiguates some or all of the words
in the text. For senses that are disambiguated, a probability
distribution and a confidence score are generated by each WSD
component 504.
[0114] 5. ICS 500 then performs a merging operation using merging
module 500B. This module merges the results of all components for
all words to generate a single probability distribution of senses
and associated confidence score for each word. Prior to merging, if
specified in the control file 516, ICS 500 may discard results with
insufficiently high confidence, or for which the probability of the
top result is insufficiently high. The merged probability
distribution is the weighted sum of each remaining probability
distribution, with the weight being provided by the confidence
score. The merged confidence score is a weighted average of
confidence values, with weights provided by the confidence score.
For example, if a WSD component "A" had given "hot beverage" at
100% probability for the sense of the word "Java", and WSD
component "B" had given "programming language" at 100% probability
for the same word, then the merged distribution would contain both
"hot beverage" and "programming language" at 50% probability each.
In order to merge the results of WSD components 504 that produce
only coarse senses, the merger can optionally be run twice, once on
the coarse senses and a second time over the group of fine senses
associated with each coarse sense.
[0115] 6. ICS 500 then performs ambiguity reduction using ambiguity
eliminator 500C. The embodiment performs this process based upon
the merged distribution and confidence output by merging module
500B. When a sense in the merged distribution has a deemed very
high probability and high confidence, it is deemed to contain the
correct sense and all other senses can be removed. For example, if
a merged result indicated that the disambiguation for "java" was
"coffee" with 98% probability and its confidence score was 90%,
then all other senses would be excluded as being possible, and
"coffee" would be the sole remaining sense. Control file 516 sets
probability and confidence score thresholds for this decision
point. Conversely, when one or more senses have a very low
probability and high confidence score, such senses may be deemed to
be improbable and are removed from the set of senses. Again control
file 516 sets probability and confidence thresholds for this
decision point. This process reduces ambiguity from the input text
by utilizing information provided by WSD components 504, and
accordingly influences which senses are provided to WSD components
504 during subsequent iterations of disambiguation.
[0116] 7. At least one or more iterations of steps 4, 5 and 6 may
optionally be performed. It will be appreciated that results of
each subsequent iteration will likely be different than those of
previous iteration(s), as WSD components 504 themselves do not
predict senses which were eliminated after previous iterations. WSD
components 504 make use of the reduced ambiguity as compared to the
previous iteration to produce a result with a more accurate
distribution and/or higher confidence score. Control file 516
identifies which set of WSD components 504 is applied on each
iteration. It will be appreciated that several iterations may be
performed until a sufficient number of words have been
disambiguated or until the number of iterations specified in the
control file 516 have been completed.
[0117] In the embodiment, the word sense disambiguation process may
involve multiple iterations. Typically, in each iteration, only a
portion of ambiguity can be removed without introducing a large
number of disambiguation errors. Preferably, for each word that any
selected WSD component 504 attempts to disambiguate, the selected
WSD component 504 returns a full probability distribution over
those senses which had not previously been removed. Generally, a
WSD component 504 is not allowed to increase ambiguity of a text by
re-submitting a sense for a word which has previously been
discarded for that word. Also, each WSD component in an iteration
operates independently from the others and interactions between WSD
components 504 occur under the control of ICS 500 or via ambiguity
removed in a previous iteration. In other embodiments, different
degrees of interaction and knowledge of results between WSD
components during an iteration and between iterations may be
provided. It will be appreciated that due to the highly complex and
unpredictable nature of such interactions, systems that include a
high degree of interaction between WSD components 504 explicitly
programmed into the WSD components 504 tend to be too complex to
built practically. As such, the controlled interaction between WSD
components 504 provided by the structure of the ICS and the
independence of the WSD components 504 is a key advantage of the
embodiment and invention.
[0118] The combined action of merger module 500B and ambiguity
eliminator 500C is to post-process the results of several WSD
algorithms 504 to reduce ambiguity in the text. The combined action
of these modules is referred to as the post processing module 512.
It will be appreciated that the use of a merging module 500B and an
ambiguity reducer 500C as described in the embodiment is an
exemplary technique in this particular embodiment only and that
alternative techniques could be devised. For example, post
processing module 512 may utilize a machine learning technique,
such as a neural network, to merge and prune results. In this
algorithm, the probability distributions and confidence scores of
each algorithm are fed into a learning system, which generates a
combined probability and confidence score for each sense.
[0119] In relation to the merger module 500B, other algorithms,
such as voting algorithms and merging of rankings algorithms may be
used.
[0120] Referring to FIG. 8, further details are now provided on
control file optimizer process 514 used to generate an optimized
control file 516 providing maximum disambiguation accuracy. The
process begins with a sense tagged corpus 802. In the embodiment,
this sense tagged corpus is a portion of the sense tagged corpus
404 that has been set aside for the purpose of performing control
file optimizer process 514. Control file optimizer 514 uses the WSD
module 606 to generate a control file 516 that optimizes accuracy
of the WSD module over the sense tagged corpus.
[0121] Control file optimizer 514 requires that optimization
criteria are specified. Thresholds are specified separately for
either the percentage of ambiguity to be removed, or the percentage
accuracy of disambiguation; the control file optimizer then
optimizes the control file to maximize the performance of word
sense disambiguator on one measure given the threshold for the
other. It is also possible to specify a maximum number of
iterations. The number of correct results or the amount of
ambiguity removed given are then maximized for each iteration.
After the optimal combination of algorithms and thresholds for a
given accuracy have been determined, the training proceeds to the
next iteration. The target accuracy is lowered at each iteration,
which allows the standard of results to drop gradually as the
number of iterations increases. Multiple sequences of target
accuracy are tested and the sequence producing the best results
over the sense tagged corpus 802 is selected. Preferentially,
accuracy or remaining ambiguity is progressively reduced on each
subsequent iteration. Example iteration accuracy sequences that are
tested are:
[0122] 1. 95%->90%->85%->80%
[0123] 2. 90%->80%
[0124] For a given iteration and target disambiguation accuracy,
the optimal list of algorithms to invoke and the associated
probability and confidence thresholds of results to keep is
identified by executing the following steps:
[0125] 1. Invoke each WSD component 504 individually on
sense-tagged corpus 802 to obtain a set of results for each
component.
[0126] 2. For a set of results of a WSD component 504, search space
of probability and confidence threshold to identify thresholds
which maximize performance against the optimization criteria. This
is done through a search of all combinations of probability and
confidence thresholds in the range of 0% to 100% in fixed step
increments, such as 5%.
[0127] 3. Once optimal thresholds for each WSD component 504 are
identified, results of all WSD components 504 are pruned according
to those thresholds and are merged using the merging module 500B as
described earlier.
[0128] 4. Consolidated merged results are then searched to identify
probability and confidence thresholds of merged results that
optimize a number of correct answers with an accuracy equal to or
above the target accuracy for the iteration. This is preferably
performed using the method of step 2.
[0129] 5. Step 4 is repeated for WSD component 504 that was merged
but the results of the WSD component 504 of interest are excluded.
The probability and confidence thresholds to maximize the number of
correct results of this result set are them identified. The
difference between the maximum number of correct results of this
set compared to the number obtained in step 4 indicates a
contribution of correct unique answers of the algorithm of
interest. If the contribution of a WSD component 504 is negative,
it identifies that this WSD component 504 as having a detrimental
impact on the results. If the contribution is zero, then it
identifies that the WSD component 504 is not contributing new
correct results in the iteration. In either case, the WSD component
504 having the lowest negative contribution is removed from the
list of WSD components 504 to be invoked in subsequent
iterations.
[0130] 6. Step 5 is repeated until a set number WSD components 504
that have a negative or zero contribution are identified and
removed. The number may be all WSD components 504.
[0131] 7. Steps 2 through 6 are repeated but with the target
accuracy for of step 2 modified by a small increment, e.g. 2.5%
both above and then below the target accuracy of the iteration.
[0132] 8. The combination of WSD components 504 and the associated
probability and confidence thresholds that resulted in the largest
number of correct answers are retained as the solution to a given
iteration. The thresholds for probability and confidence for each
WSD algorithm 504 and the ambiguity reducer 500C are written to the
control file, and the training proceeds to the next iteration and
target disambiguation accuracy.
[0133] The control file optimizer 514, can be set to optimize
accuracy given that each word is assigned one and only one sense,
the above description implies. It will be recognized that for
certain applications or in certain specific instances, it may not
make sense to attempt to assign only one sense to each word, or to
disambiguate all the words.
[0134] The amount of ambiguity present in text prior to any
disambiguation may be considered to be the maximum ambiguity. The
amount of ambiguity present in fully sense-tagged text, for which
each word has been assigned one and only one word sense can be
considered to be the minimum ambiguity. It will be recognized that
for some applications or in certain cases it will be useful to
remove only part of the ambiguity present in the text. This can be
accomplished by allowing a word to have more than one possible
sense, or by not disambiguating certain words, or both of these. In
the embodiment, the percentage of ambiguity removed is defined as
the (number of senses discarded), divided by the (total number of
possible senses minus one). It will further be recognized that, in
general, removing a smaller percentage of ambiguity permits word
sense disambiguator 32 to return a more accurate results, given
that word sense disambiguator 32 can specify more than one possible
sense for a word, and where a word is considered correctly
disambiguated if senses specified for the word include the correct
sense of the word.
[0135] Optionally, the control file optimizer 514 can be provided
with separate optimization criteria and thresholds for the
percentage of ambiguity to be removed by the word sense
disambiguator 32 and the accuracy of the disambiguation results of
word sense disambiguator 32. The control file optimizer 514 can be
asked to either a) maximize the amount of ambiguity removed subject
to a minimum threshold of accuracy (for example, remove as much
ambiguity as possible, ensuring that the remaining possible senses
for the words are 95% likely to contain the correct sense), or b)
to maximize disambiguation accuracy subject to a minimum percentage
of ambiguity to remove (for example, maximize accuracy subject to
removing at least 70% of additional senses for each word). This
capability is useful in applications a) because it allows word
sense disambiguator 32 to better fit the real world of natural
language texts, in which words may be truly ambiguous (i.e.
ambiguous to a human) as expressed in a text, and therefore not
possible to fully disambiguate, and b) because it allows
applications making use of word sense disambiguator 32 to opt for
more or less conservative implementations of word sense
disambiguator 32, wherein the precision of the disambiguation is
lower, but fewer correct senses are discarded. This is particularly
valuable, for example in information retrieval applications for
which it is critical that correct information is never discarded
(e.g. due to incorrect disambiguation), even at the expense of
including extraneous information (e.g. due to additional incorrect
senses being present in the disambiguated text).
[0136] Optionally, the control file optimizer 514 can be provided
with a maximum number of iterations.
[0137] It will be appreciated that creating accurate confidence
functions is important. A component with a poor confidence
function, even a component with high accuracy, will not contribute
or will contribute less than optimally to the system accuracy. This
occurs in one of two ways:
[0138] 1. If the confidence function tends to frequently give a low
confidence value to a correct result, then merger 500A will
effectively ignore this result, due to the arithmetic of the merger
whereby results are weighted by the confidence score, with the net
effect being as if the component had not given a result at all for
that word. Thus, these correct results will be excluded from
contributing to the system due to the poor confidence function.
[0139] 2. On the other hand, if the confidence function gives a
high confidence value to incorrect results, then the automatic
training procedure will recognize that the algorithm contributes
many incorrect results, and exclude it from being run.
[0140] It will be appreciated that adding an algorithm with a poor
confidence function to the system (for example, one which is overly
optimistic and often produces incorrect results with 100%
confidence) does not severely detrimentally affect the accuracy of
the system, as the control file optimization procedure 514
described above will discounts such results and it will not execute
that algorithm in further iterations of disambiguation. This
provides a level of robustness to the system against the inclusion
of poor WSD components.
[0141] It will be apparent to those skilled in the art that the
accuracy of most WSD systems increases with the size of the
training corpus but decreases with an inaccurately tagged training
corpus. The addition of accurately sense-tagged text to the
training corpus will usually increase the effectiveness of WSD
components. In addition, most WSD components 504 require a portion
of the sense-tagged corpus 404 to be set aside for the training of
their confidence function. It will be appreciated that the
effectiveness of the confidence function increases as the amount of
sense-tagged text in the portion of the sense-tagged corpus 404 set
aside for confidence function training increases.
[0142] Sense-tagged corpus 404 can be created manually by human
lexicographers. It will be appreciated that this is a time
consuming and expensive process, and that finding a way to generate
or augment sense-tagged corpus 404 automatically would be of
substantial value.
[0143] Referring to FIG. 9, the embodiment also provides a system
and method for automatically providing a sense-tagged corpus 404 or
for automatically increasing the size of sense-tagged corpus 404
for the training of WSD components 504. There are two processes
illustrated in FIG. 9. The first is the component training process
960. This process uses sense tagged text 404 or untagged text 900
as an input to the WSD component training module 906 in order to
generate improved component resources for the WSD components 504.
The second process is the corpus generation process 950. This
process processes untagged text 900 or partially tagged text 902
through the WSD module 32. Using the confidence function and
probability distributions output by the WSD process 32, senses
which are likely to be incorrectly tagged are then filtered out by
the filter module 904. This partially sense tagged text can then be
added to the partially tagged text 902 or the sense tagged corpus
404. When these two processes component training process 960 and
corpus generation process 950 are run alternatively, the effect is
to improve the accuracy of the WSD module 32 and to increase the
size of the sense-tagged corpus 404.
[0144] As described above, it will recognized that most conceivable
WSD components 504 require a training process to be performed over
a sense tagged corpus 404 before they can be used to disambiguate
text. For example, priors component 504A requires that the
frequencies of senses be recorded from a sense tagged corpus 404.
These frequencies are stored in the WSD component resources 402. As
described above, the more sense tagged text 404 is available to the
training process, the more accurate each WSD algorithm 504 will be.
The collection of the training processes of all WSD components 504
is collectively referred to in FIG. 9 as the WSD component training
process 960.
[0145] As described above, results of several WSD components 504
are combined to disambiguate previously unseen text. This is a
process known as "bootstrapping".
[0146] With the embodiment, only results with sufficiently high
confidence are added to the training data, utilizing the following
algorithm:
[0147] 1. Train each model of each word sense disambiguation using
the component training process 960 using available training data
from the sense tagged corpus 404.
[0148] 2. Disambiguate a large quantity of untagged documents 900
using the WSD module 32; preferably a very large quantity of
documents are used from various domains.
[0149] 3. In the filter module 904, discard all results where the
result is ambiguous or where the confidence is below a threshold,
which may be adjusted.
[0150] 4. Add the non-discarded senses to the sense tagged data
404.
[0151] 5. Re-train the set of word sense disambiguation components
using the component training process 960.
[0152] 6. Restart the training over the same documents which are
now in the sense tagged corpus 404 or over a new body of untagged
text 900.
[0153] A key to this process is the use of a probability
distribution and confidence score. In prior art systems, a
confidence score is not available and inaccurate results cannot be
discarded. As a result, the WSD components 504 are less accurate
after retraining on the enlarged sense tagged corpus 404 than they
were before, and such a process is not practically useful. By
setting a high confidence threshold that rejects most incorrect
senses from being added to the sense tagged corpus 404, the
embodiment eliminates this deficiency in the prior art system and
allows the training data to be enlarged with high quality tagged
text. It will be appreciated that this process can run multiple
times, and may create a self-reinforcing loop that increases both
the size of the sense tagged corpus 404 and the accuracy of the WSD
system 32. The quality of the training data extracted (due to the
use of a probability distribution and a confidence score) and the
potentially self-reinforcing nature of the bootstrapping process
are features of the embodiment.
[0154] The embodiment also provides a variant of the above
bootstrapping process to train the system for a specific domain
(e.g., law, health, etc.), utilizing the following variation on the
algorithm:
[0155] 1. A number of documents are disambiguated by a highly
accurate method, such as manually by a skilled human. Use of these
documents provides "seeding resources" to the system, which are
added to the sense tagged corpus 404.
[0156] 2. The word sense disambiguation components are trained
using the WSD component training process 960.
[0157] 3. A large quantity of documents from the domain are
automatically disambiguated and added to the sense tagged corpus
404 using the corpus tagging process 950.
[0158] It will be apparent that the embodiment has several
advantages over the prior art. Some include:
[0159] 1. Multiple independent algorithms. The embodiment allows
more components to be incorporated utilizing a simplified interface
through ICS 500. As such, several disambiguation techniques (for
example between 10 and 20) without the system becoming too complex
to manipulate.
[0160] 2. Confidence functions. In prior art systems, a confidence
score is not available. The confidence score provides several
critical advantages in prior art systems:
[0161] a) Merging together of results of multiple components. The
confidence function allows results from different probabilistic
algorithms to be combined with different weights reflecting the
expected accuracy of the algorithm in a particular situation. Using
the confidence function invention above, the system can merge
together decisions of many components to obtain a more likely
sense.
[0162] b) Discarding poor results or word senses for truly
ambiguous words. It allows potentially inaccurate results to be
discarded, such embodiment can opt not to provide senses for words
for which it has little confidence in its answer. This reflects
better the real world of natural language expression, wherein some
expressions remain ambiguous even when analyzed by a human.
[0163] c) Bootstrapping. The confidence function provides a
likelihood that each answer is correct. This allows only highly
accurate results to be kept and reused as training text for
components and the overall system. Additional training text in turn
further improves the accuracy of the components and the overall
system. This is a highly accurate form of bootstrapping, and offers
a comparable gain in performance to sense-tagging additional
training text using human lexicographers, at a tiny fraction of the
cost. The amount of sense-tagged text that can be generated from
untagged text (for example, the Internet) with this technique is
limited only by available computer capacity Prior art systems have
performed bootstrapping without a confidence score, but the sense
tags in the text fed to the system are far less accurate than those
provided by a human lexicographer or a confidence-score enabled
system, and the overall performance of the system quickly stagnates
or degrades.
[0164] 3. Iterative disambiguation. The system allows a component
to have multiple passes over the text being disambiguated, which
allows it to use high-accuracy disambiguations (or reductions in
ambiguity) provided by any of the other components, to improve its
accuracy in disambiguating the remaining words. For example, when
faced with the words "cup" and "green" in one sentence, a
particular WSD component 504 may not be able to distinguish between
a "cup" sense for "golf" and the more mundane "drinking vessel". If
another WSD component 504 is able to disambiguate the word "green"
into its "golf green" sense, then the first WSD component 504 may
now be able to correctly disambiguate "golf" into "golf cup". In
this sense, WSD components 504 interact with each other to arrive
at more likely senses.
[0165] 4. Method for automatically tuning WSD module 32. WSD module
32 includes a method for merging an optimal "recipe" of components
and parameter values. This merged set is optimal in the sense that
it provides the parameters which utilise multiple iterations of
multiple components to obtain the maximum possible accuracy.
[0166] 5. Multiple levels of ambiguity. By operating simultaneously
on coarse and fine senses, the embodiment can integrate different
components effectively. For example, several classes of linguistic
components operate by attempting to discern a topical content of
text. These types of components tend to have poor accuracy over
fine senses, since these often respect grammatical rather than
semantic distinctions, but do very well over coarse senses. The WSD
module 32 is capable of merging results between components that
give fine and coarse senses, allowing each component to operate
over the sense granularity most appropriate for that component.
Furthermore, an application that requires only coarse senses can
obtain these from WSD module 32. Due to their coarseness, these
coarse senses will have higher accuracy than the fine senses.
[0167] 6. Use of domain-specific data. If information about the
problem domain is known, the embodiment can be biased to favour
senses which match the problem domain. For example, if it is known
that a particular document falls within the domain of Law, then WSD
module 32 can provide sense distributions to the components which
favour those terms in the legal domain.
[0168] 7. Gradual reduction in ambiguity. It will be appreciated
that prior art systems perform disambiguation by attempting to
choose one single sense for each word in a single iteration, which
amounts to removing all ambiguity at once. This decreases the
accuracy of the disambiguation. The embodiment instead performs
this process gradually, removing some of the ambiguity at each
iteration.
[0169] Optionally, the embodiment uses metadata. For example, the
title of the document can be used to aid in the disambiguation of
the document's text, by allowing the words in the title to carry
disproportionate weight towards the disambiguation.
[0170] Although the invention has been described with reference to
certain specific embodiments, various modifications thereof will be
apparent to those skilled in the art without departing from the
scope of the invention as outlined in the claims appended hereto. A
person skilled in the art would have sufficient knowledge of at
least one or more of the following disciplines: computer
programming, machine learning and computational linguistics.
* * * * *