U.S. patent number 7,451,125 [Application Number 11/268,203] was granted by the patent office on 2008-11-11 for system and method for compiling rules created by machine learning program.
This patent grant is currently assigned to AT&T Intellectual Property II, L.P.. Invention is credited to Srinivas Bangalore.
United States Patent |
7,451,125 |
Bangalore |
November 11, 2008 |
System and method for compiling rules created by machine learning
program
Abstract
A system, a method, and a machine-readable medium are provided.
A group of linear rules and associated weights are provided as a
result of machine learning. Each one of the group of linear rules
is partitioned into a respective one of a group of types of rules.
A respective transducer for each of the linear rules is compiled. A
combined finite state transducer is created from a union of the
respective transducers compiled from the linear rules.
Inventors: |
Bangalore; Srinivas
(Morristown, NJ) |
Assignee: |
AT&T Intellectual Property II,
L.P. (New York, NY)
|
Family
ID: |
36031819 |
Appl.
No.: |
11/268,203 |
Filed: |
November 7, 2005 |
Prior Publication Data
|
|
|
|
Document
Identifier |
Publication Date |
|
US 20060100971 A1 |
May 11, 2006 |
|
Related U.S. Patent Documents
|
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
Issue Date |
|
|
60625993 |
Nov 8, 2004 |
|
|
|
|
Current U.S.
Class: |
706/47; 704/255;
704/4; 706/12; 704/254 |
Current CPC
Class: |
G06N
5/025 (20130101); G06N 20/20 (20190101); G06N
20/00 (20190101); G06N 5/02 (20130101); G06N
7/005 (20130101) |
Current International
Class: |
G06F
17/00 (20060101); G06N 5/02 (20060101) |
References Cited
[Referenced By]
U.S. Patent Documents
Other References
French large vocabulary recognition with cross-word phonology
transducers Boulianne, G.; Brousseau, J.; Ouellet, P.; Dumouchel,
P.; Acoustics, Speech, and Signal Processing, 2000. ICASSP '00.
Proceedings. 2000 IEEE International Conference on vol. 3, Jun.
5-9, 2000 pp. 1675-1678 vol. cited by examiner .
Speech Translation with Phrase Based Stochastic Finite-State
Transducers Perez, A.; Torres, M.I.; Casacuberta, F.; Acoutstics,
Speech and Signal Processing, 2007. ICASSP 2007. IEEE International
Conference on vol. 4, Apr. 15-20, 2007 pp. IV-113-IV-116 Digital
Object Identifier 10.1109/ICASSP.2007.367176. cited by examiner
.
Language model adaptation using WFST-based speaking-style
translation Hori, T.; Willett, D.; Minami, Y.; Acoustics, Speech,
and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE
International Conference on vol. 1, Apr. 6-10, 2003 pp. I-228-I-231
vol. 1. cited by examiner .
Incremental language models for speech recognition using
finite-state transducers Dolfing, H.J.G.A.; Hetherington, I.L.;
Automatic Speech Recognition and Understanding, 2001. ASRU '01.
IEEE Workshop on Dec. 9-13, 2001 pp. 194-197. cited by examiner
.
A multifaceted Internet language environment for the hearing
impaired Masuda, Y.; Clubb, O.L.; Pang Kin Lai; Cognitive
Technology, 1997. `Humanizing the Information Age`. Proceedings.,
Second International Conference on Aug. 25-28, 1997 p. 174 Digital
Object Identifier 10.1109/CT.1997.617697. cited by examiner .
Recent efforts in spoken language translation Casacuberta, F.;
Federico, M.; Ney, H.; Vidal, E.; Signal Processing Magazine, IEEE
vol. 25, Issue 3, May 2008 pp. 80-88 Digital Object Identifier
10.1109/MSP.2008.917989. cited by examiner .
2005 IEEE International Conference on Acoustics, Speech, and Signal
Processing Acoustics, Speech, and Signal Processing, 2005.
Proceedings. (ICASSP '05). IEEE International Conference on vol. 3,
Mar. 18-23, 2005 pp. 0.sub.--1-0.sub.--1 Digital Object Identifier
10.1109/ICASSP.2005.1415622. cited by examiner .
Natural language processing for dynamic environments Fromm, P.;
Drews, P.; Industrial Electronics Society, 1998. IECON '98.
Proceedings of the 24th Annual Conference of the IEEE vol. 4, Aug.
31-Sep. 4, 1998 pp. 2018-2021 vol. 4 Digital Object Identifier
10.1109/IECON.1998.724028. cited by examiner .
Some results with a trainable speech translation and understanding
system Jimenez, V.M.; Castellanos, A.; Vidal, E.; Acoustics,
Speech, and Signal Processing, 1995. ICASSP-95., 1995 International
Conference on vol. 1, May 9-12, 1995 pp. 113-116 vol. 1 Digital
Object Identifier 10.1109/ICASSP.1995.479286. cited by examiner
.
Finite-state transducer based modeling of morphosyntax with
applications to Hungarian LVCSR Szarvas, M.; Furui, S.; Acoustics,
Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).
2003 IEEE International Conference on vol. 1, Apr. 6-10, 2003 pp.
I-368-I-371 vol. 1. cited by examiner .
Sub-lexical modelling using a finite state transducer framework
Xiaolong Mou; Zue, V.; Acoustics, Speech, and Signal Processing,
2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference
on vol. 1, May 7-11, 2001 pp. 573-576 vol. 1 Digital Object
Identifier 10.1109/ICASSP.2001.940896. cited by examiner .
Large vocabulary continuous Mandarin speech recognition using
finite state machine Yi-Cheng Pan; Chia-Hsing Yu; Lin-Shan Lee;
Chinese Spoken Language Processing, 2004 International Symposium on
Dec. 15-18, 2004 pp. 5-8 Digital Object Identifier
10.1109/CHINSL.2004.1409572. cited by examiner .
A tail-sharing WFST composition algorithm for large vocabulary
speech recognition Caseiro, D.; Trancoso, I.; Acoustics, Speech,
and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE
International Conference on vol. 1, Apr. 6-10, 2003 pp. I-357-I-359
vol. 1. cited by examiner .
Transducer composition for "on-the-fly" lexicon and language model
integration Caseiro, D.; Trancoso, I.; Automatic Speech Recognition
and Understanding, 2001. ASRU '01. IEEE Workshop on Dec. 9-13, 2001
pp. 393-396. cited by examiner .
Interactive grammar inference with finite state transducers Caskey,
S.P.; Story, E.; Pieraccini, R.; Automatic Speech Recognition and
Understanding, 2003. ASRU '03. 2003 IEEE Workshop on Nov. 30-Dec.
3, 2003 pp. 572-576 Digital Object Identifier
10.1109/ASRU.2003.1318503. cited by examiner .
Full expansion of context-dependent networks in large vocabulary
speech recognition Mohri, M.; Riley, M.; Hindle, D.; Ljolje, A.;
Pereira, F.;Acoustics, Speech and Signal Processing, 1998.
Proceedings of the 1998 IEEE International Conference on vol. 2,
May 12-15, 1998 pp. 665-668 vol. 2 Digital Object Identifier
10.1109/ICASSP.1998.675352. cited by examiner .
Multilingual text analysis for text-to-speech synthesis Sproat, R.;
Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International
Conference on vol. 3, Oct. 3-6, 1996 pp. 1365-1368 vol. 3 Digital
Object Identifier 10.1109/ICSLP.1996.607867. cited by examiner
.
A decoder for large vocabulary continuous short message dictation
on embedded devices Olsen, J.; Yang Cao; Guohong Ding; Xinxing
Yang; Acoustics, Speech and Signal Processing, 2008. ICASSP 2008.
IEEE International Conference on Mar. 31-Apr. 4, 2008 pp. 4337-4340
Digital Object Identifier 10.1109/ICASSP.2008.4518615. cited by
examiner .
Constrained Phrase-based Translation Using Weighted Finite State
Transducer Bowen Zhou; Chen, S.F.; Yuqing Gao; Acoustics, Speech,
and Signal Processing, 2005. Proceedings. (ICASSP '05). IEEE
International Conference on vol. 1, Mar. 18-23, 2005 pp. 1017-1020
Digital Object Identifier 10.1109/ICASSP.2005.1415289. cited by
examiner .
Pro-active service robots in a health care framework: vocal
interaction using natural language and prosody Mumolo, E.; Nolich,
M.; Vercelli, G.; Robot and Human Interactive Communication, 2001.
Proceedings. 10th IEEE International Workshop on Sep. 18-21, 2001
pp. 606-611 Digital Object Identifier 10.1109/ROMAN.2001.981971.
cited by examiner.
|
Primary Examiner: Holmes; Michael B
Parent Case Text
This application claims the benefit of U.S. Provisional Patent
Application 60/625,993, filed in the U.S. Patent and Trademark
Office on Nov. 8, 2004, and hereby incorporated by reference herein
in its entirety.
Claims
I claim:
1. A method for compiling a plurality of linear rules created by
machine learning for use in natural language processing, the method
comprising: providing a plurality of linear rules and associated
weights as a result of the machine learning; partitioning each of
the plurality of linear rules into a respective one of a plurality
of types of rules; compiling a respective transducer for each of
the types of rules with the associated partitioning from the
plurality of linear rules; and creating a combined finite state
transducer from a union of the respective transducers compiled from
the plurality of linear rules; and processing natural language
speech using the combined finite state transducer.
2. The method of claim 1, wherein the machine learning comprises
assigning weights to the plurality of linear rules based on using a
boosting algorithm.
3. The method of claim 1, wherein the partitioned plurality of
linear rules comprise weighted rewrite rules.
4. The method of claim 1, wherein the plurality of types of rules
comprise: a first rule type for testing a feature of a word or a
sentence, a second rule type for testing a feature of a left
context, and a third rule type for testing a feature of a right
context.
5. The method of claim 1, further comprising: applying an input to
the combined finite state transducer; searching for a best path for
the input with respect to the finite state transducer; and
determining a best classification result based on the best
path.
6. The method of claim 1, wherein creating a combined finite state
transducer from a union of the respective transducers for the
plurality of linear rules further comprises: adding weights of
paths having both a same input label and a same output label.
7. The method of claim 1, wherein compiling a respective transducer
for each of the plurality of linear rules comprises: compiling each
of the respective transducers from a set of five transducers.
8. A machine-readable medium having instructions recorded therein
for at least one processor to process natural language, the
machine-readable medium comprising: instructions for providing a
plurality of linear rules and associated weights as a result of
machine learning; instructions for partitioning each of the
plurality of linear rules into a respective one of a plurality of
types of rules; instructions for compiling a respective transducer
for each of the types of rules with the associated partitioning
from the plurality of linear rules; instructions for creating a
combined finite state transducer from a union of the respective
transducers compiled from the plurality of linear rules; and
instructions for processing natural language speech using the
combined finite state transducer.
9. The machine-readable medium of claim 8, wherein the machine
learning comprises assigning weights to the plurality of linear
rules based on using a boosting algorithm.
10. The machine-readable medium of claim 8, wherein the partitioned
plurality of linear rules comprise weighted rewrite rules.
11. The machine-readable medium of claim 8, wherein the plurality
of types of rules comprise: a first rule type for testing a feature
of a word or a sentence, a second rule type for testing a feature
of a left context, and a third rule type for testing a feature of a
right context.
12. The machine-readable medium of claim 8, further comprising:
instructions for applying an input to the combined finite state
transducer; instructions for searching for a best path for the
input with respect to the finite state transducer; and instructions
for determining a best classification result based on the best
path.
13. The machine-readable medium of claim 8, wherein the
instructions for creating a combined finite state transducer from a
union of the respective transducers for the plurality of linear
rules further comprise: instructions for adding weights of paths
having both a same input label and a same output label.
14. The machine-readable medium of claim 8, wherein the
instructions for compiling a respective transducer for each of the
plurality of linear rules further comprise: instructions for
compiling each of the respective transducers from a set of five
transducers.
15. A system for compiling a plurality of linear rules created by
machine learning, the plurality of linear rules for use in natural
language processing, the system comprising: at least one processor;
a memory; a bus to permit communications between the at least one
processor and the memory, wherein: the system is configured to:
provide a plurality of linear rules and associated weights as a
result of the machine learning, partition each of the plurality of
linear rules into a respective one of a plurality of types of
rules, compile a respective transducer for each of the types of
rules with the associated partitioning from the plurality of linear
rules; create a combined finite state transducer from a union of
the respective transducers compiled from the plurality of linear
rules; and process natural language speech using the combined
finite state transducer.
16. The system of claim 15, wherein the machine learning is
configured to assign weights to the plurality of linear rules based
on using a boosting algorithm.
17. The system of claim 15, wherein the partitioned plurality of
linear rules comprise weighted rewrite rules.
18. The system of claim 15, wherein the plurality of types of rules
comprise: a first rule type for testing a feature of a word or a
sentence, a second rule type for testing a feature of a left
context, and a third rule type for testing a feature of a right
context.
19. The system of claim 15, wherein the system is further
configured to: apply an input to the combined finite state
transducer; search for a best path for the input with respect to
the finite state transducer; and determine a best classification
result based on the best path.
20. The system of claim 15, wherein the system being configured to
compile a respective transducer for each of the plurality of linear
rules further comprises the system being configured to: compile
each of the respective transducers from a set of five
transducers.
21. A system for compiling a plurality of linear rules created by
machine learning for using in natural language processing, the
system comprising: means for receiving a plurality of linear rules
and associated weights as a result of the machine learning; means
for partitioning each of the plurality of linear rules into a
respective one of a plurality of types of rules; means for
compiling a respective transducer for each of the types of rules
with the associated partitioning from the plurality of linear
rules; means for creating a combined finite state transducer from a
union of the respective transducers compiled from the plurality of
linear rules; and means for processing natural language speech
using the combined finite state transducer.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a system and method for compiling
rules created by machine learning programs and, more particularly,
to a method for compiling rules into weighted finite-state
transducers.
2. Introduction
Many problems in Natural Language Processing (NLP) can be modeled
as classification tasks, either at the word or at the sentence
level. For example, part-of-speech tagging, named-entity
identification supertagging (associating each word with a label
that represents syntactic information of the word given its context
in a sentence), and word sense disambiguation are tasks that have
been modeled as classification problems at the word level. In
addition, there are problems that classify an entire sentence or
document into one of a set of categories. These problems are
loosely characterized as semantic classification and have been used
in many practical applications including call routing and text
classification.
Most of these problems have been addressed in isolation assuming
unambiguous (one-best) input. Typically, however, in NLP
applications, modules are chained together with each of the modules
introducing some amount of error. In order to alleviate the errors
introduced by a module, it is typical for the module to provide
multiple weighted solutions (ideally as a packed representation)
that serve as input to the next module. For example, a speech
recognizer provides a lattice of possible recognition outputs that
is to be annotated with part-of-speech and named-entities.
SUMMARY OF THE INVENTION
Additional features and advantages of the invention will be set
forth in the description which follows, and in part will be obvious
from the description, or may be learned by practice of the
invention. The features and advantages of the invention may be
realized and obtained by means of the instruments and combinations
particularly pointed out in the appended claims. These and other
features of the present invention will become more fully apparent
from the following description and appended claims, or may be
learned by the practice of the invention as set forth herein.
In a first aspect of the invention, a method for compiling a group
of linear rules created by machine learning is provided. A group of
linear rules and associated weights are provided as a result of
machine learning. Each one of the group of linear rules is
partitioned into a respective one of a group of types of rules. A
respective transducer for each of the linear rules is compiled. A
combined finite state transducer is created from a union of the
respective transducers compiled from the linear rules.
In a second aspect of the invention, a machine-readable medium
having instructions recorded therein for at least one processor is
provided. The machine-readable medium includes instructions for
providing a group of linear rules and associated weights as a
result of machine learning, instructions for partitioning each one
of the group of linear rules into a respective one of a plurality
of types of rules, instructions for compiling a respective
transducer for each one of the group of linear rules, and
instructions for creating a combined finite state transducer from a
union of the respective transducers compiled from the group of
linear rules.
In a third aspect of the invention, a system for compiling a group
of linear rules created by machine learning is provided. The system
includes at least one processor, a memory, and a bus to permit
communications between the at least one processor and the memory.
The system is configured to provide a group of linear rules and
associated weights as a result of the machine learning, partition
each one of the group of linear rules into a respective one of a
group of types of rules, compile a respective transducer for each
one of the group of linear rules, and create a combined finite
state transducer from a union of the respective transducers
compiled from the group of linear rules.
BRIEF DESCRIPTION OF THE DRAWINGS
In order to describe the manner in which the above-recited aspects
and other advantages and features of the invention can be obtained,
a more particular description of the invention briefly described
above will be rendered by reference to specific embodiments thereof
which are illustrated in the appended drawings. Understanding that
these drawings depict only typical embodiments of the invention and
are not therefore to be considered to be limiting of its scope, the
invention will be described and explained with additional
specificity and detail through the use of the accompanying drawings
in which:
FIG. 1 illustrates an exemplary system in which implementations
consistent with the principles of the invention may operate;
and
FIG. 2 is a flowchart that illustrates an exemplary method that may
be used in implementations consistent with principles of the
invention.
DETAILED DESCRIPTION OF THE INVENTION
Various embodiments of the invention are discussed in detail below.
While specific implementations are discussed, it should be
understood that this is done for illustration purposes only. A
person skilled in the relevant art will recognize that other
components and configurations may be used without parting from the
spirit and scope of the invention.
Introduction
U.S. Pat. No. 5,819,247, which is hereby incorporated by reference
herein in its entirety, discloses an apparatus and method for
machine learning of a group of hypotheses used in a classifier
component in, for example, NLP systems, as well as other systems.
Machine learning techniques may be used to generate weak hypotheses
for a training set of examples, such as, for example, a database
including training speech. The resulting hypotheses may then be
evaluated against the training examples. The evaluation results may
be used to give a weight to each weak hypotheses such that a
probability is increased that examples used to generate the next
weak hypothesis are ones which were incorrectly classified by the
previous weak hypothesis. A combination of the weak hypotheses
according to their weights results in a strong hypothesis.
U.S. Pat. No. 6,453,307, which is hereby incorporated by reference
herein in its entirety, discloses several boosting algorithms,
which may be used in implementations consistent with the principles
of the invention.
Finite state models have been extensively applied to many aspects
of language processing including, speech recognition, phonology,
morphology, chunking, parsing and machine translation. Finite-state
models are attractive mechanisms for language processing since they
(a) provide an efficient data structure for representing weighted
ambiguous hypotheses, (b) are generally effective for decoding, and
(c) are associated with a calculus for composing models which
allows for straightforward integration of constraints from various
levels of speech and language processing.
U.S. Pat. No. 6,032,111, which is hereby incorporated by reference
herein in its entirety, discloses a system and method for compiling
rules, such as weighted context dependent rewrite rules into
weighted finite-state transducers. A compiling method is disclosed
in which a weighted context dependent rewrite rule may be compiled
by using a composition of five simple finite-state transducers.
Implementations consistent with the principles of the present
invention utilize results of machine learning techniques, such as,
for example, those disclosed in U.S. Pat. No. 5,819,247 to create a
weighted finite-state transducer.
Exemplary System
FIG. 1 illustrates a block diagram of an exemplary processing
device 100 which may be used to implement systems and methods
consistent with the principles of the invention. Processing device
100 may include a bus 110, a processor 120, a memory 130, a read
only memory (ROM) 140, a storage device 150, an input device 160,
an output device 170, and a communication interface 180. Bus 110
may permit communication among the components of processing device
100.
Processor 120 may include at least one conventional processor or
microprocessor that interprets and executes instructions. Memory
130 may be a random access memory (RAM) or another type of dynamic
storage device that stores information and instructions for
execution by processor 120. Memory 130 may also store temporary
variables or other intermediate information used during execution
of instructions by processor 120. ROM 140 may include a
conventional ROM device or another type of static storage device
that stores static information and instructions for processor 120.
Storage device 150 may include any type of media, such as, for
example, magnetic or optical recording media and its corresponding
drive. In some implementations consistent with the principles of
the invention, storage device 150 may store and retrieve data
according to a database management system.
Input device 160 may include one or more conventional mechanisms
that permit a user to input information to system 200, such as a
keyboard, a mouse, a pen, a voice recognition device, a microphone,
a headset, etc. Output device 170 may include one or more
conventional mechanisms that output information to the user,
including a display, a printer, one or more speakers, a headset, or
a medium, such as a memory, or a magnetic or optical disk and a
corresponding disk drive. Communication interface 180 may include
any transceiver-like mechanism that enables processing device 100
to communicate via a network. For example, communication interface
180 may include a modem, or an Ethernet interface for communicating
via a local area network (LAN). Alternatively, communication
interface 180 may include other mechanisms for communicating with
other devices and/or systems via wired, wireless or optical
connections.
Processing device 100 may perform such functions in response to
processor 120 executing sequences of instructions contained in a
computer-readable medium, such as, for example, memory 130, a
magnetic disk, or an optical disk. Such instructions may be read
into memory 130 from another computer-readable medium, such as
storage device 150, or from a separate device via communication
interface 180.
Processing device 100 may be, for example, a personal computer
(PC), or any other type of processing device capable of creating
and sending messages. In alternative implementations, such as, for
example, a distributed processing implementation, a group of
processing devices 100 may communicate with one another via a
network such that various processors may perform operations
pertaining to different aspects of the particular
implementation.
Classification
In general, tagging problems may be characterized as search
problems formulated as shown in Equation (1). .SIGMA. is used to
represent an input vocabulary, .gamma. is used to represent a
vocabulary of n tags, an N word input sequence is represented as W
(.di-elect cons..SIGMA..sup.+) and a tag sequence is represented as
T (.di-elect cons..gamma..sup.+). T*, the most likely tag sequence
out of the possible tag sequences (T) that can be associated to W
is of particular interest.
.times..times..times..times..function. ##EQU00001##
Following techniques of Hidden Markov Models (HMM) applied to
speech recognition, these tagging problems have been previously
modeled indirectly through the transformation of Bayes rule as in
Equation 2. The problem may then be approximated for sequence
classification by a k.sup.th-order Markov model as shown in
Equation (3).
.times..times..times..times..function..times..function..times..times..tim-
es..times..times..function..times..function..times..times..times..times.
##EQU00002##
Although the HMM approach to tagging may easily be represented as a
weighted finite-state transducer (WFST), it has a drawback in that
use of large contexts and richer features results in sparseness
leading to unreliable estimation of parameters of the model.
An alternate approach to arriving at T* is to model Equation 1
directly. There are many examples in recent literature. The general
framework for these approaches is to learn a model from pairs of
associations of the form (x.sub.i,y.sub.i), where x.sub.i is a
feature representation of W and y.sub.i (.di-elect cons..gamma.) is
one of the members of a tag set. Although these approaches have
been more effective than HMMs, there have not been many attempts to
represent these models as a WFST.
Boosting
U.S. Pat. No. 5,819,247 discloses is a machine learning tool which
is based on the boosting family of algorithms. The basic idea of
boosting is to build a highly accurate classifier by combining many
"weak" or "simple" base learners, each one of which may only be
moderately accurate. A weak learner or a rule b is a triple
(p,{right arrow over (.alpha.)},{right arrow over (.beta.)}), which
may test a predicate (p) of the input (x) and may assign a weight
.alpha..sub.i(i=1, . . . , n) for each member (y) of .gamma. if p
is true in x and may assign a weight (.beta..sub.i) otherwise. It
is assumed that a pool of such weak learners H={b} may be easily
constructed.
From the pool of weak learners, the selection of a weak learner to
be combined may be performed iteratively. At each iteration t, a
weak learner b.sub.t may be selected that minimizes a prediction
error loss function on a training corpus which takes into account
the weight w.sub.t assigned to each training example. Intuitively,
the weights may encode how important it is that b.sub.t correctly
classify each of the training examples. Generally, the examples
that were most often misclassified by the preceding base
classifiers may be given the most weight so as to force the base
learner to focus on the "hardest" examples. One machine learning
tool, known as Boostexter, produces confidence rated classifiers
b.sub.t that may output a real number h.sub.t(x,y) whose sign (- or
+) may be interpreted as a prediction, and whose magnitude
|h.sub.t(x)| may be a measure of "confidence". The iterative
algorithm for combining weak learners may stop after a prespecified
number of iterations or when training set accuracy saturates.
In the case of text classification applications, the set of
possible weak learners may be instantiated from simple n-grams of
input text (W). Thus, if x.sub.n is a function to produce all
n-grams up to n of its argument, then the set of predicates for the
weak learners is P=x.sub.n(W). For word-level classification
problems, which take into account a left and right context, the set
of weak learners created from the word features may be extended
with those created from the left and right context features. Thus,
features of the left context (.phi..sub.L.sup.i), features of the
right context (.phi..sub.R.sup.i) and the features of the word
itself (.phi..sub.w.sub.i.sup.i) constitute the features at
position i. The predicates for the pool of weak learners may be
created from this set of features and are typically n-grams on the
feature representations. Thus, the set of predicates resulting from
the word level features may be
H.sub.W=.orgate..sub.ix.sub.n(.phi..sub.w.sub.i.sup.i), from left
context features may be
H.sub.L=.orgate..sub.ix.sub.n(.phi..sub.L.sup.i) and from right
context features may be
H.sub.R=.orgate..sub.ix.sub.n(.phi..sub.R.sup.i) The set of
predicates for the weak learners for word level classification
problems may be: H=H.sub.W.orgate.H.sub.L.orgate.H.sub.R.
Training may employ a machine learning technique such as, for
example, a boosting algorithm, which may provide a set of selected
rules {b.sub.1, b.sub.2, . . . b.sub.N} (.OR right.H). The output
of the final classifier may be
.function..times..function. ##EQU00003## i.e. the sum of confidence
of all classifiers b.sub.i. The real-valued predictions of the
final classifier F may be converted into probabilities by a
logistic function transform; that is
.function.e.function.'.di-elect cons..gamma..times.e.function.'
##EQU00004##
Thus the most likely tag sequence T* may be determined as in
Equation 5, where
P(t.sub.i|.phi..sub.L.sup.i,.phi..sub.R.sup.i,.phi..sub.w.sub.i.sup-
.i) may be computed using Equation 4.
.times..times..times..times..times..function..PHI..PHI..PHI.
##EQU00005##
Previously, using boosted rule sets was restricted to cases where
the test input was unambiguous such as strings or words (not word
graphs). By compiling these rule sets into WFSTs, their
applicability may be extended to packed representations of
ambiguous input such as word graphs.
Compilation
Weak learners selected at the end of the training process may be
partitioned into one of three types based on the features that the
learners tested. b.sub.w: tests features of a word (or,
alternatively, a sentence) b.sub.L: tests features of the left
context b.sub.R: tests features of the right context
(Weighted) context-dependent rewrite rules have the general form
.phi..fwdarw..phi.|.gamma._.delta. (6) where .phi., .phi., .gamma.
and .delta. are regular expressions on the alphabet of the rules.
The interpretation of these rules are as follows: Rewrite .phi. by
.phi. when it is preceded by y and followed by .delta..
Furthermore, .phi. may be extended to a rational power series which
are weighted regular expressions in which the weights may encode
preferences over the paths in .phi..
Each weak learner may be viewed as a set of weighted rewrite rules
mapping the input word (or sentence) into each member
y.sub.i(.di-elect cons..gamma.) with a weight .alpha..sub.i when
the predicate of the weak learner is true and with weight
.beta..sub.i when the predicate of the weak learner is false.
Exemplary translation between the three types of weak learners and
the weighted context-dependency rules is shown in Table 1.
TABLE-US-00001 TABLE 1 Translation of three types of weak learners
into weighted context dependency rules Type Weighted Context of
Weak Weak Learner Dependency Rule h.sub.W If WORD==w then w
.fwdarw. .alpha..sub.1y.sub.1 + .alpha..sub.2y.sub.2 . . . +
y.sub.i : .alpha..sub.i else y.sub.i : .beta..sub.i
.alpha..sub.ny.sub.n | .sub.-- .sub.-- .sub.-- (.SIGMA. - w)
.fwdarw. .beta..sub.1y.sub.1 + .beta..sub.2y.sub.2 . . . +
.beta..sub.ny.sub.n | .sub.-- .sub.-- .sub.-- h.sub.L If
LeftContext==w then .SIGMA. .fwdarw. .alpha..sub.1y.sub.1 +
.alpha..sub.2y.sub.2 . . . + y.sub.i : .alpha..sub.i else y.sub.i :
.beta..sub.i .alpha..sub.ny.sub.n | w .sub.-- .sub.-- .SIGMA.
.fwdarw. .beta..sub.1y.sub.1 + .beta..sub.2y.sub.2 . . . +
.beta..sub.ny.sub.n | (.SIGMA. - w) .sub.-- .sub.-- h.sub.R If
RightContext==w then .SIGMA. .fwdarw. .alpha..sub.1y.sub.1 +
.alpha..sub.2y.sub.2 . . . + y.sub.i : .alpha..sub.i else y.sub.i :
.beta..sub.i .alpha..sub.ny.sub.n | .sub.-- _ w .SIGMA. .fwdarw.
.beta..sub.1y.sub.1 + .beta..sub.2y.sub.2 . . . +
.beta..sub.ny.sub.n | .sub.-- _ (.SIGMA. - w)
These rules may apply left to right on an input and may not
repeatedly apply at the same point in an input because the output
vocabulary .gamma. would typically be disjoint from the input
vocabulary .SIGMA..
A technique described in U.S. Pat. No. 6,032,111 may be used to
compile each of the weighted context-dependency rules into a WFST.
The compilation may be accomplished by introduction of context
symbols which are used as markers to identify locations for
rewrites of .phi. with .phi.. After the rewrites, the markers may
be deleted. The compilation process may be represented as a
composition of five transducers.
The WFSTs resulting from the compilation of each selected weak
learner (.lamda..sub.i) may be unioned to create the WFST to be
used for decoding. The weights of paths with both the same input
and output labels may be added during the union operation.
.LAMBDA.=.orgate..sub.i.lamda..sub.i (7)
Due to the difference in the nature of the learning algorithm,
compiling decision trees results in a composition of WFSTs
representing rules on the path from the root to a leaf node, while
compiling boosted rules results in a union of WFSTs, which is
expected to result in smaller transducers. We define rules that may
be compiled into a union of WFSTs as linear rules.
In order to apply a WFST for decoding, a model with input
represented as an WFST (.lamda..sub.i) may be composed and searched
for the best path (if one is interested in a single best
classification result).
y.sup..circle-solid.=BestPath(.lamda..sub.x.smallcircle..LAMBDA.)
(8)
FIG. 2 is a flowchart illustrating an exemplary process that may be
used in implementations consistent with the principles of the
invention. The process may begin by obtaining linear rules and
weights from a machine learning tool (act 202) such as, for
example, Boostexter, or a machine learning tool that may employ a
boosting algorithm. In one implementation consistent with the
principles of the invention, the machine learning tool may include
a boosting algorithm such as, for example, an AdaBoost classifier.
Each one of the rules may then be partitioned into one of a group
of context-dependent rewrite rule types (act 204). For example, the
rule types may include a first rule type for testing a feature of a
word or a sentence, a second rule type for testing a feature of a
left context, and a third rule type for testing a feature of a
right context. Each of the rules may then be compiled to create a
WFST (act 206). U.S. Pat. No. 6,032,111 discloses a method for
creating transducers from context-dependent rewrite rules that may
be employed in implementations consistent with the principles of
the invention. The created WFSTs may then be unioned to provide a
combined WFST for decoding (act 208). The combined WFST may be
used, for example, to process speech transcribed by an automatic
speech recognition device and to provide a label or classification
for the transcribed speech.
Experimental Results
A machine learning tool, Boostexter, was trained on transcriptions
of speech utterances from a call routing task with a vocabulary
(|.SIGMA.|) of 2,912 and 40 classes (n=40). There were a total of
1,800 rules comprising 900 positive rules and their negative
counterparts. The rules resulting from Boostexter were then
compiled into WFSTs and then the union of the WFSTs were combined
into a single combined WFST. The combined WFST resulting from
compiling the rules had 14,372 states and 5.7 million arcs.
Accuracy of the combined WFST on a random set of 7,013 sentences
was the same (85% accuracy) as accuracy with a decoder that
accompanies the Boostexter machine learning tool. This validated
the compilation procedure.
CONCLUSION
Embodiments within the scope of the present invention may also
include computer-readable media for carrying or having
computer-executable instructions or data structures stored thereon.
Such computer-readable media can be any available media that can be
accessed by a general purpose or special purpose computer. By way
of example, and not limitation, such computer-readable media can
comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage,
magnetic disk storage or other magnetic storage devices, or any
other medium which can be used to carry or store desired program
code means in the form of computer-executable instructions or data
structures. When information is transferred or provided over a
network or another communications connection (either hardwired,
wireless, or combination thereof) to a computer, the computer
properly views the connection as a computer-readable medium. Thus,
any such connection is properly termed a computer-readable medium.
Combinations of the above should also be included within the scope
of the computer-readable media.
Computer-executable instructions include, for example, instructions
and data which cause a general purpose computer, special purpose
computer, or special purpose processing device to perform a certain
function or group of functions. Computer-executable instructions
also include program modules that are executed by computers in
stand-alone or network environments. Generally, program modules
include routines, programs, objects, components, and data
structures, etc. that perform particular tasks or implement
particular abstract data types. Computer-executable instructions,
associated data structures, and program modules represent examples
of the program code means for executing steps of the methods
disclosed herein. The particular sequence of such executable
instructions or associated data structures represents examples of
corresponding acts for implementing the functions described in such
steps.
Those of skill in the art will appreciate that other embodiments of
the invention may be practiced in network computing environments
with many types of computer system configurations, including
personal computers, hand-held devices, multi-processor systems,
microprocessor-based or programmable consumer electronics, network
PCs, minicomputers, mainframe computers, and the like. Embodiments
may also be practiced in distributed computing environments where
tasks are performed by local and remote processing devices that are
linked (either by hardwired links, wireless links, or by a
combination thereof) through a communications network. In a
distributed computing environment, program modules may be located
in both local and remote memory storage devices.
Although the above description may contain specific details, they
should not be construed as limiting the claims in any way. Other
configurations of the described embodiments of the invention are
part of the scope of this invention. For example, hardwired logic
may be used in implementations instead of processors, or one or
more application specific integrated circuits (ASICs) may be used
in implementations consistent with the principles of the invention.
Further, implementations consistent with the principles of the
invention may have more or fewer acts than as described, or may
implement acts in a different order than as shown. Accordingly, the
appended claims and their legal equivalents should only define the
invention, rather than any specific examples given.
* * * * *