U.S. patent application number 11/849876 was filed with the patent office on 2008-01-03 for data disambiguation systems and methods.
Invention is credited to Robert Hust, Mark Zartler.
Application Number | 20080005158 11/849876 |
Document ID | / |
Family ID | 35320658 |
Filed Date | 2008-01-03 |
United States Patent
Application |
20080005158 |
Kind Code |
A1 |
Zartler; Mark ; et
al. |
January 3, 2008 |
Data Disambiguation Systems and Methods
Abstract
Various embodiments provide a state-based, regular expression
parser in which data, such as generally unstructured text, is
received into the system and undergoes a tokenization process which
permits structure to be imparted to the data. Tokenization of the
data effectively enables various patterns in the data to be
identified. In some embodiments, one or more components can utilize
stimulus/response paradigms to recognize and react to patterns in
the data.
Inventors: |
Zartler; Mark; (Garland,
TX) ; Hust; Robert; (Hayden, ID) |
Correspondence
Address: |
LEE & HAYES, PLLC
421 W. RIVERSIDE AVE
STE 500
SPOKANE
WA
99201
US
|
Family ID: |
35320658 |
Appl. No.: |
11/849876 |
Filed: |
September 4, 2007 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10839425 |
May 4, 2004 |
|
|
|
11849876 |
Sep 4, 2007 |
|
|
|
Current U.S.
Class: |
1/1 ;
707/999.102; 707/E17.068; 707/E17.078; 707/E17.134 |
Current CPC
Class: |
G06F 40/284 20200101;
G06F 16/3329 20190101; G06F 16/3344 20190101; Y10S 707/99943
20130101; Y10S 707/99944 20130101 |
Class at
Publication: |
707/102 ;
707/E17.134 |
International
Class: |
G06F 7/00 20060101
G06F007/00 |
Claims
1. A system comprising: one or more computer-readable media; a
dynamic agent server embodied on the one or more computer-readable
media and configured to manage deployable agents; one or more
deployable agents embodied on the one or more computer-readable
media and configured to be deployed by the dynamic agent server;
one or more knowledge base objects embodied on the one or more
computer-readable media and associated with and useable by the one
or more deployable agents, the knowledge base objects being
associated with one or more files that define cases of text that
can be matched by a functional presence engine configured as a
probabilistic parser, to text obtained from one or more data
origination entities, the one or more files being defined in a
hierarchical tag-based language.
2. The system of claim 1, wherein the hierarchical tag-based
language comprises an extensible markup language.
3. The system of claim 1, wherein an agent comprises a data source
that provides a pipeline through which data travels and a runtime
object that is configured to receive and process data in the form
of text that travels through the pipeline.
4. The system of claim 1, wherein an agent comprises a data source
that provides a pipeline through which data travels and a runtime
object that is configured to receive and process data in the form
of text that travels through the pipeline, and wherein the system
comprises a plurality of agents, and wherein data sources can
comprise different types of data sources.
5. The system of claim 4, wherein at least one data source
comprises an IRC data source.
6. The system of claim 4, wherein at least one data source
comprises a TCP/IP data source.
7. The system of claim 4, wherein at least one data source
comprises a POP 3 data source.
8. The system of claim 1, wherein said one or more deployable
agents comprises agents of a first type and agents of a second
different type.
9. The system of claim 1, wherein said one or more deployable
agents comprises agents of a first type and agents of a second
different type, wherein a first type of agent is a passive agent
that listens to one or more data origination entities.
10. The system of claim 1, wherein said one or more deployable
agents comprises agents of a first type and agents of a second
different type, wherein a first type of agent is a passive agent
that listens to one or more data origination entities, and wherein
a second type of agent is an active agent that is configured to
interact with a data origination entity.
11. A method comprising: providing a dynamic agent server
configured to manage deployable agents; deploying one or more
deployable agents that are configured to receive data; with the one
or more deployable agents, using one or more knowledge base objects
to process the data using a functional presence engine configured
as a probabilistic parser, the knowledge base objects being
associated with one or more files that define cases of text that
can be matched to text obtained from one or more data origination
entities, the one or more files being defined in a hierarchical
tag-based language.
12. The method of claim 11, wherein the hierarchical tag-based
language comprises an extensible markup language.
13. The method of claim 11, wherein an agent comprises a data
source that provides a pipeline through which data travels and a
runtime object that is configured to receive and process data in
the form of text that travels through the pipeline.
14. The method of claim 11, wherein an agent comprises a data
source that provides a pipeline through which data travels and a
runtime object that is configured to receive and process data in
the form of text that travels through the pipeline, and wherein the
system comprises a plurality of agents, and wherein data sources
can comprise different types of data sources.
15. The method of claim 11, wherein said one or more deployable
agents comprises agents of a first type and agents of a second
different type.
16. The method of claim 11, wherein said one or more deployable
agents comprises agents of a first type and agents of a second
different type, and wherein a first type of agent is a passive
agent that listens to one or more data origination entities.
17. The method of claim 11, wherein said one or more deployable
agents comprises agents of a first type and agents of a second
different type, and wherein a first type of agent is a passive
agent that listens to one or more data origination entities, and
wherein a second type of agent is an active agent that is
configured to interact with a data origination entity.
Description
RELATED APPLICATION
[0001] This application is a divisional of and claims priority to
U.S. patent application Ser. No. 10/839,425, the disclosure of
which is incorporated by reference herein.
TECHNICAL FIELD
[0002] This invention relates generally to data disambiguation.
More particularly, the invention pertains to systems, methods and
software architectures that are directed to pattern processing and
recognition in the context of generally unstructured data.
BACKGROUND
[0003] There is a great deal of so-called unstructured data that
resides in the world. Typically, unstructured data has
characteristics which, as the name implies, find it highly
unstructured and difficult to work with. Perhaps a good perspective
from which to understand unstructured data is from the perspective
of structured data. Structured data, by its very nature, is
typically easily indexed and searched.
[0004] As an example, consider the following. In many cases,
governments, corporations, and various other large entities such as
businesses and the like, can have many thousands of documents to
deal with. These documents constitute knowledge in the sense that
the documents contain information that might be useful to the
particular entity. Yet, by virtue of the voluminous number of
documents and the fact that such documents may be in a generally
unstructured state, this knowledge is not reasonably and readily
attained by these entities. Even if such entities were to have, for
example, an intranet, one would have to know what to specifically
search for, and what the information means to the searcher.
[0005] Thus, as noted above, one of the difficulties in working
with unstructured data is that of building and creating knowledge
based on the unstructured data. Put another way, one of the
challenges with unstructured data pertains to disambiguating the
data so that the data can be the subject of meaningful information
processing techniques.
[0006] Some approaches that have been used in the past in an
attempt to disambiguate unstructured data utilize so-called
knowledge architects. Knowledge architects are typically very
highly skilled professionals who craft knowledge based on the data.
The techniques and approaches that these individuals use tend to be
very expensive-owing to the highly-skilled nature of the
individual(s) architecting the system. Additionally, the specific
systems that are put in place by such individuals do not tend to be
easily repeatable in different scenarios or environments. Thus,
these approaches tend to be expensive and highly specifically
directed to a particular problem at hand. As such, there remains a
need, in the area of data disambiguation, for systems that are less
complex insofar as implementation and deployment are concerned. In
addition, there is a need for such systems that do not require a
highly specialized professional to set and deploy the system.
SUMMARY
[0007] Various embodiments provide a state-based, regular
expression parser in which data, such as generally unstructured
text, is received into the system and undergoes a tokenization
process which permits structure to be imparted to the data.
Tokenization of the data effectively enables various patterns in
the data to be identified. In some embodiments, one or more
components can utilize stimulus/response paradigms to recognize and
react to patterns in the data.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] FIG. 1 is a block diagram illustrates components of a system
in accordance with one embodiment.
[0009] FIG. 2 is a block diagram that illustrates components that
can be used for conducting lexical analysis in accordance with one
embodiment.
[0010] FIG. 3 is a block diagram that illustrates software
components in a system in accordance with one embodiment.
[0011] FIG. 4 is a block diagram that illustrates a system in
accordance with one embodiment.
[0012] FIG. 5 is a block diagram that illustrates a system in
accordance with one embodiment.
[0013] FIG. 5a is a block diagram that illustrates steps in a
method in accordance with one embodiment.
[0014] FIG. 6 is a block diagram that illustrates a system in
accordance with one embodiment.
DETAILED DESCRIPTION
Overview
[0015] As an overview, various embodiments described in this
document utilize a state-based, regular expression parser that is
designed to deal with language, text and text types. In accordance
with at least one embodiment, data such as text is received into
the system and undergoes a tokenization process which permits
structure to be imparted to the data. As the data undergoes the
tokenization process, portions of the data (e.g. individual words
of text) are assigned different types. As an elementary example
consider that individual words can be considered as parts of
speech-such as nouns, verbs, prepositions and the like. Thus, a
very elementary system might be set up to tokenize individual words
according to their respective part of speech. Perhaps a better
example is to consider that different nouns can be tokenized as
types of nouns, e.g. places, dates, email addresses, web sites, and
the like.
[0016] Tokenizing data creates patterns in the language which, in
turn, can allow simple key words searches or searching for
different type objects such as date objects, place objects, email
address objects and the like. The tokenization process is
effectively a generalized abstraction process in which typing is
used to abstract classes of words into different contexts that can
be used for much broader purposes, as will become apparent
below.
[0017] FIG. 1 shows, generally at 100, an exemplary software
architecture or system that can be utilized to implement various
embodiments described above and below. The software architecture
can be embodied on any suitable computer- readable media.
[0018] In this particular example, system 100 comprises a
functional presence engine 102, one or more knowledge bases 104
and, optionally, an information retrieval module 106. In accordance
one embodiment, system 100 receives unstructured data and processes
it in a manner that imparts a degree of useful structure to it. The
output of system 100 can be one or more of structured data and/or
one or more actions as will become apparent below. Each of these
individual components is discussed in more detail below under their
own respective headings.
Functional Presence Engine
[0019] In accordance with one embodiment, the functional presence
engine 102 is implemented as a probabilistic parser that performs
lexical analysis, using lexical archetypes, to define recognizable
patterns. The functional presence engine can then use one or more
stimulus/response knowledge bases, such as knowledge bases 104, to
make sense of the patterns and react to them appropriately. In
accordance with one embodiment, system 100 can learn or be trained
by either changing the lexical archetypes and/or the knowledge
bases.
[0020] Lexical Analysis
[0021] The discussion below provides but one exemplary
implementation example of how lexical analysis can be performed in
accordance with the described embodiment. It is to be appreciated
and understood that the description provided below is not intended
to limit application of the claimed subject matter. Rather, other
approaches can be utilized without departing from the spirit and
scope of the claimed subject matter.
[0022] In accordance with one embodiment, lexical analysis is
performed utilizing a system, such as the system shown generally at
200 in FIG. 2. System 200 comprises, in this embodiment, an
external .lex file 202 which specifies a series of rules and their
output symbols, a program 204 to read the .lex file and convert it
into a program which, in this example comprises a C++ lex-program,
a lexical analysis program 206 which, when provided with data such
as text, produces tokenized content in the format specified in the
lex file, and an independent regular expression library 208.
[0023] The .lex File
[0024] In accordance with one embodiment, the .lex file 202
comprises a structure having two component parts: a macro section
and one or more lex sections. In the illustrated and described
embodiment, the .lex file is case sensitive, as are the regular
expressions embodied by it. The macro section specifies symbol
rewrites. The macro section is used to create named identifiers
representing more complicated regular expression patterns. This
allows the author to create and re-use regular expressions without
having to rewrite the same patterns in more than one place. Macros
keep the lex sections cleaner and allow common expressions to be
changed in only one place. As an example, consider the following.
TABLE-US-00001 %macro regular_expression .fwdarw. macro-name
regular-expressionl .fwdarw. macro-name1
[0025] This is a valid example macros section. TABLE-US-00002
%macros // begin macros \t\n\f\r,`- .fwdarw.wb //macros!
\!\?\:\;\." .fwdarw.sb //more macros
[0026] With respect to the lex section, consider the following:
TABLE-US-00003 %lex optional_name regular_expression1 .fwdarw.
output_specifier [,output_specifier...] regular_expression2
.fwdarw. output_specifier [,output_specifier...]
regular-expression3 .fwdarw. output_specifier[,output_specifier...]
optional name .fwdarw. output_specifier [,output_specifier...]
[0027] "% lex" denotes the beginning a section of lexical rewrite
rules. In some cases it is desirable to specify a name. This is
explored in more detail below. On the lines following the "% lex"
tag, a series of rules are specified. These rules specify a regular
expression followed by a series of output symbols. As an example,
consider the following: ([[:alpha:]]+) [:wb:]+.fwdarw.WORD{1}
[0028] The left hand side of this expression is a regular
expression. In this example, notice a ":wb:" on the left hand side
which specifies a macro. Macros are specified using the format ":
macro-name :". A preprocessor will substitute the macro value
wherever it finds a macro name surrounded by colons. A special case
construct is when the rule expression matches the name specified in
the "% lex tag". This is a pass through rule, meaning that if no
other rule matches, this default rule will consume the entire text
and call the output specifiers with the entire text. There are some
cases where this is useful, such as when the % lexer will never be
a top level program. In accordance with one embodiment, a known
regular expression engine is utilized and is referred to as the
public domain engine PCRE 3.9, which will be known by the skilled
artisan.
[0029] Continuing, after the regular expression appears a
".fwdarw." followed by a series of output specifiers. In the above
example, a match of the given regular expression produces the
output symbol, "WORD" and the output text {1}. The brackets and
numeric identifier are optional. These specify which sub-expression
is output with the symbol. In the illustrated and described
embodiment, sub-expressions are the text which matches regular
expressions within parentheses. In this example, the text which
matches ([[:alpha:]]+) would be output along with the token "WORD".
If the above example were changed to:
([[:alpha:]]+)[:wb:]+.fwdarw.WORD then the output token would be
the same, but then the entire match would be returned as the text.
This is same as writing "WORD{0}". As another example, consider the
following: ([[:alpha:]]+)([:sb:]+).fwdarw.WORD {1}, EOS {2}//WORD
and EOS
[0030] The example pattern above matches alpha characters, followed
by the macro :sb:, which is defined in our example to be sentence
boundary tokens. When text followed by a period occurs, two tokens
are output--the WORD token and an end of sentence (EOS) token. This
demonstrates how a single match can produce more than one token.
There is no limit on the number of tokens which can be output,
except as guided by practicality. As another example, consider the
pattern appearing just below: [ :wb:]+.fwdarw.GWORD
[0031] This pattern looks for any character that is not a word
boundary character and outputs a GWORD token, and the output text
is the entire match.
[0032] Putting the entire lex construct together, consider the
following: TABLE-US-00004 %lex main ([[:alpha:]]+)[:wb:]+
.fwdarw.WORD{1} ([[:alpha]]+)([:sb:]+) .fwdarw.WORD{1}, EOS {2} //
WORD and EOS [{circumflex over ( )}:wb:]+ .fwdarw.GWORD //generic
graphic word
[0033] In this particular example, when the lexer runs, it chooses
the rule which matches the most text as the rule which will trigger
the output token. Options may be added later to control this
behavior. This lexer will output text words, end of sentence
markers, and graphic words.
[0034] For handling large volumes of text, it is important to keep
the main lexer simple. That said, in some scenarios, it can be
desirable to tokenize things such as EMAIL, MONEY, IP addresses and
URLs. The following simple rules are provided as an example of
rules that tokenize such things. TABLE-US-00005
([a-zA-ZO-9._-]+)@(([a-zA-ZO-9._-]+\.)+[a-zA-ZO-9._-]{2,3})
.fwdarw. EMAIL [$]([\d]+\.[\d]*) .fwdarw.MONEY
[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3} .fwdarw. IP
((http|https)://)?([a-zA-Z0-9._-]+\.)+[a-zA-Z0-9]{1,3} .fwdarw.
URL
[0035] To address efficiency and performance issues, format
utilized for lexical analysis can add some additional constructs.
Recall from above that the file can specify one or more % lex
constructs. This being the case, consider the following.
[0036] Instead of putting the four rules listed above into the
"main" lexer, the rules can instead be added to a sub-lexer as
follows: TABLE-US-00006 %lex GWORD
([a-zA-Z0-9._-]+)@(([a-zA-Z0-9._-]+\.)+[a-zA-Z0-9]{2,3})
.fwdarw.EMAIL [$]([\d]+\.[\d]*) .fwdarw.MONEY
[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3} .fwdarw. IP
((http|https)://)?([a-zA-Z0-9._-]+\.)+[a-zA-Z0-9]{1,3} .fwdarw.
URL
[0037] Using this format, the entire file would look as follows:
TABLE-US-00007 %macros // begin macros \t\n\f\r,`- -> wb //
macros! \!\?\:\;\.\" -> sb // more macros %lex main
([[:alpha:]]+)[:wb:]+ .fwdarw. WORD{1} ([[:alpha:]]+)([:sb:]+)
.fwdarw.WORD{1}, EOS{2} // WORD and EOS [{circumflex over (
)}:wb:]+ .fwdarw.GWORD{0} // generic graphic word %lex GWORD
([a-zA-Z0-9._-]+)@(([a-zA-Z0-9._-]+\.)+[a-zA-Z0-9]{2,3}) .fwdarw.
EMAIL [$]([\d]+\.[\d]*) .fwdarw. MONEY
[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3} .fwdarw. IP
((http|https)://)?([a-zA-Z0-9._-]+\.)+[a-zA-Z0-9]{1,3} .fwdarw.
URL
[0038] In this file, there are two lex_programs. Generally, the
"main" lex is the only lexer executed at the top level of the text
tokenization process. The rules under the % lex GWORD in general
will not execute. However, make note of the fact that the rule "[
:wb:]+.fwdarw.GWORD{0}" has the output token of GWORD and note that
the new % lex construct has the name GWORD. This specifies a
recursive lex procedure. When GWORD is the matched construct from
"main", that is, no other rule matches more text, before outputting
GWORD, it will first try to match all the lexical rules under the %
lex GWORD tag. This is analogous to a procedure call in a
programming language. The data that gets passed in is the text
specified in the output--in our case GWORD{0}, the entire matched
text. From a performance standpoint, there is only a performance
hit when we find special graphic words. For alpha-only words, the
GWORD lexer will not run.
[0039] In addition to the constructs described above, in at least
one embodiment, other constructs can be utilized. These constructs
can control which lexers lexically process the data first. As an
example, consider the construct "% push lexer-name, % pop". In
accordance with one embodiment, the lexer program can maintain a
stack of lexers. Lexers which are on the execution stack are
evaluated by a top level parser. Lexers which are not on the
execution stack are not evaluated unless recursive tokenization
occurs. It is possible to push new lexers onto the stack in order
to read specific data and then pop them when finished, as will be
appreciated by the skilled artisan.
[0040] To demonstrate this, the code listing below is a text
representation of the actual .lex file parser in the lex language.
The main lexer is either the % lex named "main" or the first % lex
encountered in the file. In the present example, both conditions
are satisfied by the first-encountered lexer. In the illustrated
and described example, the % lex main program looks for "% macro"
specifiers or "% lex" specifiers, comments, or extra white
space.
[0041] When a % macro is encountered, it emits a symbol "MACRO",
then pops anything on the stack, and then pushes the % lex
READ_MACRO program onto the top of the stack. The Rules in % lex
READ_MACRO will now get the first chance to evaluate the incoming
data or text. If READ_MACRO fails to match, then % lex main will
also have an opportunity to evaluate the incoming data or text.
[0042] When a % lex is encountered, the same process occurs, except
the top program becomes READ_LEX. READ_LEX looks for rules and, if
encountered, it tokenizes the REGEX of the rule, and then pushes
READ_LEX_RULE to read the right hand side of the rule. This
demonstrates the recursive capabilities of the system. The program,
on certain input conditions, triggers a state change to a
specialized sub lexer which is capable of parsing a specific type
of data. The sub lexer will process the data and then perform a %
pop operation when the sub lexer has completed it's task.
[0043] If READ_LEX_RULE encounters some non-white, non-comment
text, it gathers it, and calls the LEX_TOKEN program with the
gathered text. LEX_TOKEN looks for % push, % pop, xxx{digit}, or
xxx. In the illustrated and described embodiment, LEX_TOKEN is not
on the stack though. Rather, it is sub-component that is executed
based on the text gathered by the parent, as described above.
[0044] Consider now the additional construct "% lex default", which
is a program that is specified at the bottom of the code listing
below. In accordance with one embodiment, these constructs will
only execute if the text cannot be tokenized using the execution
stack. In the present example, this program is utilized to indicate
a syntax error. TABLE-US-00008 // This program is the lex
specification for a lexer that // reads the "lex" file. // main
lexer %lex main \%macros[ \t]* -> MACRO, %pop, %push READ_MACRO
\%lex[ \t]*([A-Za-z0-9 ]*)[ \t]* -> LEX{1}, %pop, %push READ_LEX
//.* -> // ignore [ \t\r\n]+* -> // ignore // read macro
lexer %lex READ_MACRO [ \t]*(.+)[ \t]*-\>+[ \t]*([A-Za-z0-9_]*)
-> MACRONAME{2}, MACROVALUE{1} // read lex rules %lex READ_LEX [
\t]*(.+)[ \t]*-\> -> RULE_RE{1}, %push READ_LEX_RULE // read
a single lex rule %lex READ_LEX_RULE [\r\n] -> %pop // done
reading a rule [ \t]*//.* -> %pop [ \t[:alnum:]_\{%\}]+ ->
LEX_TOKEN{0} // recursive descent into // %lex LEX_TOKEN with the
string , -> // ignore // read a single lex token %lex LEX_TOKEN
// lex_token program // this is called to perform a subrecognition
on lex token output forms // it is not in the top level parser
stack [ \t]*%push[ \t]+(.*)[ \t]* -> STACK_PUSH{1} [ \t]*%pop[
\t]*$", -> STACK_POP [ \t]*([A-Za-z0-9_]+)\{[ \t]*([0-9]+)[
\t]*\}[ \t]* -> TOKEN{1}, TOKENPARAM{2} [ \t]*([A-Za-z0-9_]+)[
\t]*$ -> TOKEN{1} %lex default // default is ONLY hit when no
other lexers on the stack evaluate // in this case we want to spit
a syntax error // at this point, any text is considered a syntax
error .* -> SYNTAX ERROR
[0045] How .lex Matches
[0046] In accordance with one embodiment, under a given
.lex_program, the default methodology is to attempt to match all
the regular expressions in the % lex group and choose the rule
which consumes the most input. In accordance with one embodiment,
however, a program directive "% pragma" can be utilized to specify
behaviors for the analysis system. For example, a % pragma
firstmatch before the % lex tag indicates that the matching
behavior should be to choose the first rule which successfully
matches the incoming text. This can improve performance but can
significantly impact the matching process.
[0047] The syntax of this program directive is: % pragma
pragma_name. The following pragmas (case-sensitive) are currently
defined: TABLE-US-00009 Pragma Definition %pragma firstmatch The
%lex tags and rules below this %pragma are matched using the "first
match" strategy. That is, the first rule which is able to match the
incoming text is the rule which will fire, other rules are ignored.
This is for performance. It is not the default behavior. %pragma
bestmatch The %lex tags and rules below this %pragma are matched
using the "best match" strategy. All the rules of a particular %lex
group are used. The rule which matches the longest string will be
the rule which is fired. This is the default behavior.
Implementation Details
[0048] The following discussion is provided to describe one
particular implementation example of the system shown in FIG. 2.
This example is not intended to limit application of the claimed
subject matter to specifically described example. Rather, this
example is provided as a guide to the skilled artisan as but one
way certain aspects of the described embodiments can be
implemented.
[0049] First, in accordance with one embodiment, to utilize the
lexer or the regular expression engine singularly, the user should
consider the following classes, each of which is discussed below
under its own separate heading: [0050] CRegex--regular expression
engine. This class allows the user to set the regular expression
and then search a string of data for the expression. [0051]
lex_program--a C++ implementation of the features provided in the
.lex file format [0052] lex_program compiler--a compiler that
produces a lex_program from a .lex stream [0053] lextoken--output
data from the lex_program. An individual token of data.
CRegex
[0054] In accordance with one embodiment, this class is a
self-contained class for matching strings of text to a regular
expression. Like all C++ classes in the Lex library, CRegex
supports value class semantics, assignment, and copy construction.
All of these operations are valid and tested.
[0055] In accordance with one embodiment, the member methods in
this class include a compile method, a match method, a getMatches
method, and a GetLastError method, each of which is described
below.
[0056] CRegex::compile
[0057] bool compile(const char* szRE, int flags, long*
pFailureOffset=NULL);
[0058] This method compiles the specified regular expression, given
in szRE, and flags.
[0059] Parameters
[0060] szRE--[in] pointer to a perl5 compatible regular
expression
[0061] flags--[in] modifier flags for compilation TABLE-US-00010
lexer::anchored The pattern is forced to be "anchored". That is,
the pattern is constrained to match only at the start of the string
which is being searched (the "subject string"). This effect can
also be achieved by appropriate constructs in the pattern itself.
lexer::caseless Letters in the pattern match both upper and lower
case letters. lexer::dollarend A dollar metacharacter in the
pattern matches only at the end of the subject string. Without this
option, a dollar also matches immediately before the final
character if it is a new line (but not before any other new lines).
This option is ignored if lexer::multiline is set. lexer::multiline
By default, CRegex treats the subject string as consisting of a
single "line" of characters (even if it actually contains several
new lines). The "start of line" metacharacter ({circumflex over (
)}) matches only at the start of the string, while the "end of
line" metacharacter ($) matches only at the end of the string, or
before a terminating new line (unless lexer::dollarend is set).
[0062] pFailureOffset--[in, out] a pointer to a integer variable
that will receive the offset if the string failed to compile. This
may be useful for custom error handling. The return value is
undefined if the compilation succeeds.
[0063] Return Value
[0064] bool--true if the compilation was successful, false
otherwise. Use CRegex::GetLastError( ) to retrieve a more detailed
error message.
[0065] CRegex::match
[0066] long match(const char* szText, long inLen, int flags=0)
[0067] This method attempts to match the compiled regular
expression stored in the class object with the specified text given
the length of text. It returns the number of characters consumed by
the match. It will return 0 if the match failed. As an example, if
"this" is the text to match and the string is as follows, "blah
this is fun", then match will return "9"--the position past the
last match.
[0068] To retrieve more detailed information about the match, the
method CRegex:: getMatches, described below, can be utilized after
performing the match.
[0069] Parameters
[0070] szText--[in] data to match against
[0071] inLen--[in] number of characters in the szText to analyze.
Use -1 if szText is null terminated and you wish to match up to the
end of string.
[0072] flags--[in] TABLE-US-00011 Lexer::notbol The first character
of the string is not the beginning of line, so the circumflex
metacharacter should not mat before it. Setting this without
lexer::multiline (at compi time) causes {circumflex over ( )} never
to match. Lexer::noteol The end of the string is not the end of a
line, so the doll metacharacter should not match it nor (except in
multi-li mode) a newline immediately before it. Setting th without
lexer::multiline (at compile time) causes doll never to match.
[0073]
[0074] Notes
[0075] The return value is somewhat unintuitive. It returns the
pointer to the next character after the end of the matched text.
Just note that a non-zero value means there was a match. To get
specific information about exactly where the match occurs, call
CRegex::getMatches anytime after the call to CRegex::match.
[0076] Returns
[0077] long--number of characters consumed by the match process--0
if no match.
[0078] CRegex::getMatches
[0079] long getMatches(int** ppmatches);
[0080] Call this method after calling CRegex::match to get a
pointer to the list of matches (submatches). It returns, in
ppmatches, a pointer to the internal list of matches retrieved
after the last match. It also returns the number of valid
matches.
[0081] Each call to match only matches the regular expression once.
Callers will need to iterate to find all the particular matches.
The getMatches method returns positional information about where
the match occurred in the text. The first two integers specify the
start position and end position of the whole match. The next "n"
integers return position of all submatches in the source string. As
an example, consider the following: TABLE-US-00012 RE:
([[:alpha:]]+)([\d]+) Subject string: "Mark123 is here"
CRegex::match returns: 7 indicating success getMatches returns 3.
This list of integers looks like this: [0] - 0 [1] - 7 [2] - 0 [3]
- 4 [4] - 7
[0082] Parameters
[0083] ppmatches--[out] a live pointer to the matches. This pointer
dies with the class or when the matcher is recompiled.
[0084] Returns
[0085] The number of matches: 1 (whole match)+number of submatches.
0 if there was no match in the last call to CRegex::match
[0086] Notes
[0087] The returned match list is a class object which goes out of
scope with the class, or when the CRegex::compile is called.
[0088] CRegex::GetLastError
[0089] std::string GetLastError( ) const;
[0090] This method returns the compilation error if any. It returns
a string that specifies where in the regular expression the
compilation failed and is useful for debugging compilation
errors.
lex_program
[0091] In accordance with the described embodiment, lex_program is
the C++ lexical analyzer and is used to tokenize data sources. The
lex_program can be created from scratch or compiled from a file
using a lex_file_compiler. The member methods of this class include
a lex_program method, a Tokenize method, a Begin method and a
GetNextTokens each of which is described below.
[0092] lex_program::lex_program
[0093] lex_program::lex_program(ulong lexOptions=0)
[0094] Parameters
[0095] lexOptions--[in]. Flags to control the behavior of the
program. TABLE-US-00013 Lexer::opt_lineCounts The lexer will
manually keep track of character and line position. Lextoken's
returned from this program will contain valid charNum and lineNum
fields. This should only be used when this information is
important, otherwise it is not recommended because there is a
modest performance hit involved in keeping track of this
information. lexer::opt_firstMatch This option instructs the lex
program to prefer first matches. (See how matching occurs). This is
usually controlled by the input .lex file and the lexer::lexer
class. It is recommended that this value be set in the lexer class
and not here.
[0096] lex_program::Tokenize
[0097] virtual bool Tokenize(const spchar* pData, ulong length,
[0098] std::vector<lextoken>& vcTokens, bool
bResetState=true);
[0099] Given source text, this method tokenizes the source data and
returns lextokens.
[0100] Parameters
[0101] pData--[in] pointer to the data to be tokenized
[0102] length--[in] length of data to be tokenized
[0103] vcTokens--[out] list of tokens generated by the content.
Tokens are appended to the end of it.
[0104] Remarks
[0105] This method tokenizes the entire content. It is an
alternative or simplification to calling lex_program::begin( ), and
then lex_program::get_next_token( ) iteratively until it returns
false.
[0106] lex_program::Begin
[0107] virtual bool Begin(const spchar* pData, ulong length, bool
bReset=true);
[0108] This method is used to set the source data for the lexical
analysis. Call this before calling lex_program::GetNextToken. It is
not necessary to call this method if using
lex_program::tokenize.
[0109] Parameters
[0110] pData--[in] pointer to the data to be tokenized
[0111] length--[in] length of data to be tokenized
[0112] bReset--[in] reset the stack and variable state of the lexer
back to default.
[0113] Returns
[0114] true
[0115] lex_program::GetNextTokens
[0116] virtual bool GetNextTokens(std::vector<lextoken>&
toks);
[0117] This method is used to retrieve the next token from the
input stream. It may return more than one token. Use this function
to iteratively run through the data in as atomic a way as possible.
This method returns (true) until the end of data is reached. It is
possible that the return token list is empty even if the return
value is true.
[0118] Parameters
[0119] Toks--[out] vector of tokens. New tokens are appended onto
this list and the list is NOT CLEARED by this method, users must
clear the list manually if this is the desired effect.
[0120] Returns
[0121] false when the entire string has been tokenized, to the best
of the programs ability.
[0122] Remarks
[0123] Lex_program::begin( ) must be called before calling
GetNextTokens.
lex_program_compiler
[0124] The class lex_program_compiler is a class that converts a
stream of text in .lex format (described above) into a lex_program
which can be used for lexical analysis. The member methods in this
class include a compil_lex method described below.
[0125] lex_program_compiler::compile_lex
[0126] bool compile_lex(const char* pData, long nDataLength, [0127]
lex_program& lexprogram, std::vector<lexfileerror_t>&
errors);
[0128] Given a pointer to .lex formatted data and a data length,
returns an instantiated lex_program capable of tokenizing streams
as specified in the pData.
[0129] Parameters
[0130] pData--[in] pointer to the data. Use
Nextlt::LoadDiskFilelntoString( . . . ) or some other disk file
loading method to load the lex file into memory.
[0131] nDataLength--[in] number of .lex formatted bytes contained
in the pData pointer
[0132] lex_Program--[out]--compiled program
[0133] errors--descriptive list of errors, if any.
[0134] Returns
[0135] bool--true if the compiled succeeded without errors or
warnings. The application is responsible for determining if errors
or warnings warrant a stoppage. Recommended: stop and display
errors.
lextoken
[0136] This data class is the return class of the lex_program and
represents a token. It is designed for efficient parsing. In
addition to returning a token constant, it also returns positional
information and length information of the source text that produced
the token, which is important for language processing.
TABLE-US-00014 Member data lexfilepos_t startPos typedef struct {
ulong lineNum; // 0 based line number ulong charNum; // 0 based
character index ulong pos; // absolute position in the buffer }
lexfilepos_t;
[0137] This is the starting information within the source data. If
the lex_program was created with the lexer::opt_lineCounts, the
lexfilepos_t will also contain a valid character and line number.
startPos.pos specifies the exact byte position in the source data.
TABLE-US-00015 long length;
[0138] This is the length in bytes of text representing this
token.
[0139] An example usage of the file position information would be
to create a string representing the exact characters captured by
the token. Such as: std::string str(&pData[tok.startPos.pos],
tok.length); TABLE-US-00016 ulong idToken
[0140] This is the unique identifier for this token. The id is a
unique hash value defining the lexical token, or type, which the
lexer has recognized. In this implementation, the hashing program
is the system hasher used by many subsystems, WordHashG.
Knowledge Bases
[0141] As noted in FIG. 1, one of the components that utilized by
system 100 is a knowledge base component 104. In the illustrated
and described embodiment, knowledge base component 104 is
implemented, at least in part, utilizing one or more files that are
defined in terms of a hierarchical, tag-based language which, in at
least some embodiments, can be used to set up cases of text that
matches incoming data or text, and define responses that are to be
triggered in the event of a case match. In the illustrated and
described embodiment, the tag-based language is referred to as
"Functional Presence Markup Language" or FPML. Effectively, the
FPML files are utilized to encode the knowledge that the system
utilizes.
[0142] FPML
[0143] The discussion provided just below describes aspects of the
FPML that are utilized by system 100 to implement various knowledge
bases. It is to be appreciated and understood that this description
is provided as but one example of how knowledge can be encoded and
used by system 100. Accordingly, other techniques and paradigms can
be utilized without departing from the spirit and scope of the
claimed subject matter.
[0144] Preliminarily, FPML is an extensible markup language (XML)
that can be utilized to define a surface-level conversational,
action-based, or information acquisition program. FPML can be
characterized as a stateful expression parser with the
expressiveness of a simple programming language. Some of the
advantages of FPML include its simplicity insofar as enabling
ordinary technical people to capture and embody a collective body
of knowledge. Further, FPML promotes extensibility in that deeper
semantic forms can be embedded in the surface level engine. In
addition, using FPML promotes scalability in that the system can be
designed to allow multiple robots to run on a single machine,
without significant performance degradation or inordinate memory
requirements. That is, preliminarily it should be noted that one
application of the technology described in this document utilizes
robots, more properly characterized as hots, to provide
implementations that can be set up to automatically monitor and/or
engage with a particular cyberspace environment such as a chat room
or web page. The knowledge bases, through the FPML files, are
effectively utilized to encode the knowledge that is utilized by
the bots to interact with their environment.
[0145] As noted above, FPML allows a user to set up "cases" of
language text that match incoming sentences and define responses to
be triggered when the case matches. In accordance with various
embodiments, cases can be exact string matches, or more commonly
partial string matches, and more complicated forms. FPML also
supports typed variables that can be used for any purpose, for
example, to control which cases are allowed to fire at a given
time, thereby establishing a "state" for the program. Typed
variables can also be used to set and record information about
conversations that take place, as well as configuration settings
for one or more robots, as will become apparent below.
[0146] In accordance with one embodiment, any suitable types of
variables can be supported, e.g. string variables, floating point
variables, number variables, array variables, and date variables to
name just a few.
[0147] As noted above, FPML is a hierarchical tag-based language.
The discussion provided just below describes various specific tags,
their characteristics and how they can be used to encode knowledge.
Each individual tag is discussed under its own associated
heading.
Fpml Tag
[0148] The fpml tag is used as follows:
[0149] <fpml>
[0150] . . .
[0151] </fpml>
[0152] The FPML object is the top level tag for any fpml program.
It encloses or encapsulates all other tags found in a document or
file. The fpml tag can contain the following tags: <unit>,
<rem>, <situation>, <if>, <lexer> and
<load>. It should be noted that <rem name="variablename"
value="variableValue"> is used to specify initial variables for
the XML. When an FPML file is loaded, any <rem> at whose
direct parent is <fpml> is evaluated. This mechanism is used
to set up initial values for variables and is used often. As an
example of the fpml tag, consider the following:
[0153] <fpml>
[0154] <unit>
[0155] <input>I like dogs</input>
[0156] <response>I like dogs too, <acq
name="name"/>!
[0157] </response>
[0158] </unit>
[0159] </fpml>
[0160] This example fpml file has one case, which recognizes the
string "I like dogs", and responds with "I like dogs too" followed
by the value of the variable "name", which by convention is the
name of the user.
Load Tag
[0161] The load tag is used as follows:
[0162] <load filename="path to file"/>
[0163] This instruction directs an fpml interpreter to load the
fpml file specified by "path to file". This path may be a fully
qualified or partial path from FPML file in which the <load>
tag appears. The load tag is contained in <fpml>, and does
not contain other tags as the tag should be open-closed. As an
example of the load tag, consider the following:
[0164] <!--Load the fpml program defined in braindead.fpml
!-->
[0165] load filename="C:\fpml\braindead.fpml"/>
[0166] <load filename="\files\LAO10189-0003.fpml"/>
[0167] <load filename=".\words.fpml"/>
[0168] <load filename="words.fpml"/>
[0169] The first form loads a file from fully qualified path. The
second form loads the file from a subdirectory of the directory in
which this file is located. The third loads from the current
directory, as does the forth form.
Lexer Tag
[0170] The lexer tag is used as follows:
[0171] <lexer filename="path-to-file"/>
[0172] This instruction directs the fpml interpreter to load and
use the specified .lex file (described above) for breaking up
incoming text into word tokens. This is important because even
though fpml is a word-based parsing language, there is no absolute
definition of what constitutes a word. The lexer program can also
categorize words and surface this information to the fpml. This is
discussed in more detail below in connection with the <input>
tag reference. The lexer tag does not contain other tags and should
be open-closed, and is contained in the <fpml> tag. As an
example of the lexer tag, consider the following:
[0173] <load filename="C:\fpml\words.lex"/>
[0174] <load filename="\files\words.lex"/>
[0175] <load filename=".\words.lex"/>
[0176] <load filename="words.lex"/>
[0177] The first form loads from a fully qualified path. The second
form loads from a subdirectory "files" relative to the directory in
which the loading file lives. The third and fourth forms load the
file located in the same directory in which the loading file
lives.
Unit Tag
[0178] Use of the unit tag is as follows:
[0179] <unit>
[0180] . . .
[0181] </unit>
[0182] The unit tag is a "case" in the system whose subtags
identify the text that it matches, and the response that should be
taken in the presence of a match. The unit tag must contain the
following tags: <input> and <response>, and can
contain: <prev> and <prev_input>. The unit tag
contained in the tags: <fpml>, <if>, <cond> and
<situation>.
[0183] The <input> tag is used to specify a text pattern to
match. The optional <prev> and <prev_input> tags
contain expressions that match previous dialog either from the user
or from a robot. The <response> tag specifies the output when
a match occurs. As an example of how this tag is used, consider the
following:
[0184] <unit>
[0185] <input>I like [.]</input>
[0186] <response>I like <wild index="1"/>too, <acq
name="name"/>!
[0187] </response>
[0188] </unit>
[0189] This example fpml file has one case, which recognizes the
string "I like [any single word]", and responds with "I like "%
incoming-word% too" followed by the value of the variable "name",
which by convention is the name of the user.
Input Tag
[0190] Use of the input tag is as follows:
[0191] <input>text-input-expression</input>
[0192] The text contained within the input tag defines the words
and expressions which will trigger the response encapsulated by the
<response> tag. This tag contains text and no inner tags are
evaluated. The input tag is contained in the unit tag. Using the
"text-input-expression", the text contained within the
<input> tag can have a special format. It can be
characterized as a word-based regular expression. As an example of
how this tag can be utilized, consider the following:
[0193] <input>I like dogs</input>
[0194] This matches the sentence "I like dogs" and nothing else,
from the incoming text. Consider now the following use of this
tag:
[0195] <input>I like +</input>
[0196] <input>I like [+]</input>
[0197] This matches a sentence which begins with "I like" and is
followed by one or more words. Additionally, consider the following
example:
[0198] <input>I like *</input>
[0199] <input>I like [*]</input>
[0200] This matches a sentence which begins with "I like" and is
followed by zero or more words. It matches both "I like" and "I
like you over there". Further, consider the following example:
[0201] <input>I like [.]</input>
[0202] This matches a sentence which begins with "I like" and is
followed by any single words. For example, it matches "I like you",
but not "I like the pickles" or "I like". Thus, the expression [.]
matches a single word. Consider the following examples:
[0203] <input>* I like *</input>
[0204] <input>* I like +</input>
[0205] <input>+ I like +</input>
[0206] <input>* I like [.]</input>As indicated above,
input expressions can contain more than one wildcard of any kind
anywhere as long as the wildcards are separated by at least one
space from the literals.
[0207] The input tag can also utilize embedded expressions.
Embedded expressions are bracked with `[` and `]`. These bracketed
expressions are called queried-wildcards and are used to add
expressiveness to the input language. The format of this construct
is as follows:
[0208] [match-expression from expression where_expression]
[0209] The following examples match expression syntax:
[0210] [ANY(word1, word2, word3, . . . ) from `wildcard`] (where
wildcard is *, +,.
[0211] [ANY(word1, word2, . . . )] the wildcard `+` is implied
[0212] [ANY(W1, w2) AND NOT ANY(w3, w4 . . . ) from +|*|.]
[0213] [VAR(bot_name) from +]
[0214] The function ANY(word1, word2, word3, . . . ) matches any of
the specified words, e.g. <input>[ANY(books, magazines,
pictures) from +]</input> matches "books", "magazines" and
"pictures". The function ALL(word1, word2 . . . ) matches all of
the specified words. The function VAR(variableName) matches the
incoming string against a variable name, e.g.
<input>[VAR(bot_name) from .] +</input> recognizes the
bot name from the beginning of the sentence.
[0215] Consider also the function: [0216]
REGEX(perl5regularexpression); [0217]
<input>[REGEX(\$[\d]+(\.[\d]*)?) from +]</input>
[0218] This function matches money. The regular expression operates
on each word subsumed by the star, looking for a match.
[0219] Various operators can be utilized within the input tag among
which include the NOT, AND, OR, "ANY(w1, w2) AND NOT ANY(w3, w4)",
and "VAR(bot_name)" operators.
[0220] The operator from expression can be used and is optional. It
specifies the wildcard of the queried-expression
[0221] from +// one or more
[0222] from . // exactly one.
[0223] If the from-expression is not specified, it is assumed to be
the `+` wildcard.
[0224] The Operator Where-Expression
[0225] The where-expression is used to constrain the match further.
Currently this is used to constrain a match to a given lexical
token type. For instance if an application is looking for e-mails,
it could, create a pattern that accepts only e-mail types, as
created by the lexer.
[0226] [. WHERE TYPE==EMAIL ]
[0227] This queried-wildcard expression would match any word whose
type is EMAIL.
[0228] The lexer, in addition to splitting words and sentences,
also produces tokens, which are characterizations of the graphic
word. A lexical analyzer can, for example, recognize URLs, IP
addresses, Dollars, and the like, as noted above.
[0229] This information is available to the pattern matcher and can
be used to match "types" of data. Consider the following
example:
[0230] <unit>
[0231] <input>* [where TYPE==URL] *</input>
[0232] <response>URL: <wild
index="2"/></response>
[0233] </unit>
[0234] This unit matches any sentence containing a URL. In this
example, the response is to simply provide the URL back to the
user. A more complicated example can look for a particular URL. As
an example, consider the following:
[0235] <unit>
[0236] <input>* [REGEX(spectreai) from . where TYPE==URL]
*</input>
[0237] <response>URL: <wild
index="2"/></response>
[0238] </unit>
[0239] This unit looks for URLs containing the string "spectreai"
anywhere in them.
[0240] In an implementation example, matching can proceed in a case
insensitive way. That is, for a given sentence, all the
<unit>'s are given a chance to fire (assuming an <if>
or <cond>) does not prevent this. Given this, it is likely
that there may be more than one match for a given string. For
example:
[0241] <unit>
[0242] <input>*</input>
[0243] <response>I don't understand you</response>
[0244] </unit>
[0245] <unit>
[0246] <input>+ what is your name
+</input><response>My name is <acq
name="bot_name"/></response>
[0247] </unit>
[0248] If an incoming sentence is "Hey, what is your name dude?".
Both of these patterns actually match this string. Desirably,
however, one wants the second pattern to evaluate. Given that the
matcher is probabilistic, the second match, the one which
recognizes the most known text, is chosen. The general idea is that
the end-user should not have to worry about this. Picking the best
match is the responsibility of the fpml interpreter. In the event
of identical patterns, or identical probabilistic matches, the
match that is loaded last wins. Consider the following example:
[0249] <unit>
[0250] <input>+ what is your name +</input>
[0251] <response>My name is <acq
name="bot_name"/></response>
[0252] </unit>
[0253] <unit>
[0254] <input>+ what is your name +</input>
[0255] <response>Who cares!<response/>
[0256] </unit>
[0257] They both match the same text with the same probability.
However, as the second match was the last loaded, the second will
fire.
Prev Tag
[0258] Use of this tag is as follows:
[0259] <prev>text-input-expression</prev>
[0260] The <prev> element is part of the <unit> tag and
declares a constraint on the matcher. In order for a sentence to
match this unit, the
<prev>"text-input-expression"</prev> must also match
what the robot said previously. That is, the unit will match ONLY
if what the robot said prior to the current input can match against
"text-input-expression".
[0261] The format for text-input-expression is identical to the
format of data in the <input> tag, thus attention is directed
to the input tag for details on syntax. The prev tag has an
optional index attribute which specifies how many places back to go
in a robot's conversation history to find a match. The default
value is one. This means that the last sentence the robot said must
match against the text-input-expression in order for the
<unit> to match. If the index attribute is less than zero,
e.g. <prev index="-5">* yes *</prev_input>, then all of
the past five sentences of the robot history will be matched. If
any are matched, the unit will be allowed to match the
<input> tag.
[0262] Consider the following FPML example of a conversation
relating to going to a movie.
[0263] <unit>
[0264] <input>yes</input>
[0265] <prev index="1">* go to a movie *</prev>
[0266] <response>which one?</response>
[0267] </input>
[0268] <unit>
[0269] <input>* matrix *<input>
[0270] <prev>which one</prev>
[0271] <response>The Matrix it is. When?</response>
[0272] </unit>
[0273] <unit>
[0274] <input>*</input>
[0275] <prev>* the matrix it is * when *</prev>
[0276] <response>Sounds good</response>
[0277] </unit>
[0278] Example dialog:
[0279] robot>do you want to go to a movie?
[0280] user>yes
[0281] robot>which one?
[0282] user>I like the matrix
[0283] robot>The matrix it is. when?
[0284] user>11:30
[0285] robot>Sounds good.
Prev_Input Tag
[0286] Use of the prev_input tag is as follows:
[0287] <prev_input
index="1">text-input-expression</prev_input>
[0288] The <prev_input> element is part of the <unit>
tag and declares a constraint on the matcher. In order for a
sentence to match this unit, the "text-input-expression" must also
match what the user said previously. That is, the unit will match
ONLY if what the user said prior to the current input can match
against "text-input-expression".
[0289] The format for text-input-expression is identical to the
format of data in the <input> tag. Thus, the reader is
referred to the discussion of the input tag for details on
syntax.
[0290] The prev_input tag has an optional index attribute which
specifies how many places back to go in the user's history to find
a match. The default value is one, which means that the last
sentence the user said must match against the text-input-expression
in order for the <unit> to match.
[0291] If the index attribute is less than zero, e.g.
<prev_input index="-5">* yes *</prev input>, then all
of the past five sentences of the user history will be matched. If
any are matched, the unit will be allowed to match the
<input> tag.
[0292] This tag contains text-expression just like the input
expression and is contained in: <unit>.
Response Tag
[0293] Use of the response tag is as follows:
[0294] <response>
[0295] </response>
[0296] The response tag holds elements that will evaluate when the
<input> (and <prev . . . ) generate the best match for a
given sentence. In some embodiments, the response tag defines what
the robot will say or record. This tag is contained in:
<unit> and can contain: text, as well as the following tags:
<cond>, <rand>, <op>, <if>, <acq>,
<rem>, <cap>, <hearsay>, <impasse>,
<lc>, <uc>, <sentence>, <swap_pers>,
<swap_pers1>, <rwild>, <wild>, <recurs>,
and <quiet>.
If Tag
[0297] Use of this tag is as follows:
[0298] <if name="variableName"
value=="text-input-expression">
[0299] fpml-tags
[0300] </if>
[0301] <if expr="script-expression">
[0302] </if>
[0303] The if tag is used to control execution flow. If the
specified variable can be evaluated against the value, the
contained nodes are turned on. If not, the contained nodes are not
executed. Variables and the <if> expression allow the FPML
programs to run in a stateful way. This tag can contain the
following tags: <unit>, <if>, and <situation>,
and all tags the response tag can contain. This tag is contained in
the following tags: <fpml>, <response>, all tags the
response tag can contain, <if> and <situation>. The if
tag can be used as an intra-unit tag to control program flow. As
example, consider the following:
[0304] <if name="name" value="* tommy *">
[0305] <unit>
[0306] <input>* HI *</input>
[0307] <response>It has been a long time. still working on
the documentation</response>
[0308] </unit>
[0309] <unit>
[0310] . . .
[0311] </unit>
[0312] </if>
[0313] In this situation, the units contained within the <if>
statement will only be evaluated if the user name "name" is
something with "tommy" in it. Although this is an elementary
example, this shows how to use arbitrary variables to control
program flow.
[0314] The value=" . . . " attribute of the <if> tag can be
any expression that is valid in the <input> text. It can also
be "?". When value is `?`, the conditional evaluates to true if the
variable is set and is false otherwise. This construct can be used
in <if>, <cond>, and <situation>. Alternatively,
the <if> tag can use "expr=" instead of name and value pairs.
This allows code expressions to be used to perform the test.
Additionally, the <if> tag can be used to control program
flow in the <response> tag. As an example, consider the
following:
[0315] <unit>
[0316] <input>*</input>
[0317] <response>
[0318] Hello there.
[0319] <if name="vTalkative" value="true">
[0320] Goodness, my. It is a lovely day. I wonder where the other
people are. I love to chat.
[0321] </if>
[0322] How are you?
[0323] </response>
[0324] </unit>
[0325] Another silly example, if vTalkative is set to "true", then
the text underneath the if statement will be added to the response
string.
Situation Tag
[0326] Use of this tag is as follows:
[0327] <situation name="input-text-expression">
[0328] The situation tag is another program control tag and is used
to control which units get precedence over all other units. It is
useful in managing discourse. However, it is not used in the
<response> tag. This tag can contain the following tags:
<unit> and <if>, and can be contained in: <fpml>
and <if>.
[0329] As an example, consider the situation "computers" below:
[0330] <unit>
[0331] <input>* Computers *</input>
[0332] <response>Lets talk about computers.
<quiet><rem
name="situation">computers</rem><quiet/>
[0333] </unit>
[0334] <situation name="* computers *"/>
[0335] <unit>
[0336] <input>[ANY(buy, purchase, lease,
rent)]</input>
[0337] <response>I've had success with Dell. Can go to dell
online at www.dell.com</response>
[0338] </unit>
[0339] <unit>
[0340] <input>[crash, crashed, crashing,
bomb)]</input>
[0341] <response>Which operating system are you
running?</response>
[0342] </unit>
[0343] <unit>
[0344] <input>* XP *</input>
[0345] <prev input>[crash, crashed, crashing,
bomb)]</prev_input>
[0346] <response>which program?</response>
[0347] </unit>
[0348] . . .
[0349] <unit>
[0350] <input>*</input>
[0351] <response>We were talking about computers. would you
like to talk about something else?
[0352] </response>
[0353] </unit>
[0354] </situation>
[0355] The situation tag provides a way to encapsulate a particular
subject and protect it somewhat from outside <unit>. It's
probabilistic <input>*<input> in the above situation
only if no other <unit>s in the global space produce a better
match.
[0356] In the above example, <situation name="* computers *" is
syntactically equivalent to this IF statement:
[0357] <if name="situation" value="* computers *>.
Response Tags
[0358] As noted above, tags within the <response> generate
output or record information. With a couple of exceptions, such as
<cond>, every valid response tag can contain all other tags
located within the response.
Rand Tag
[0359] Use of this tag is as follows:
[0360] <rand>
[0361] <op>response-expression(1)</op>
[0362] <op>response-expression(2)</op>
[0363] <op>response-expression(3)</op>
[0364] </rand>
[0365] The rand tag picks one of it's sub-elements at random and
uses it to generate the response. This tag is contained in
<response> and contains <op>. As an example of this
tag's use, consider the following: TABLE-US-00017 <unit>
<input>HI + </input> <response> <rand>
<op>Hello<acq name="name"/>!!!</op>
<op>Hidy ho!</op> <op>Cheers!</op>
</rand> <rwild/> </response> </input>
Cond Tag
[0366] Use of this tag is as follows:
[0367] <cond>
[0368] The cond tag allows for conditional evaluation inside the
<response> tag. It is a complicated form and has three levels
of expressivity. The first level of expressivity is where it is
identical to the <if> tag and can assume the same places and
locations. For example,
[0369] <cond name="variableName"
value="text-input-expression>
[0370] <cond expr="script-expression">.
[0371] The second level of expressivity is where the cond tag
identifies the variable name, but not the variable value. In this
case, the cond tag should contain only <op> tags. Each op tag
will define the value field. The <op> which matches best is
chosen for the evaluation. As an example, consider the following:
TABLE-US-00018 <unit>
<input>SERVICECONNECTED</input> <response>
<cond name="bot_name"> <op value="ScoobyDruid">
/nickserv identify oicu812 <impasse>!MASTER \0304I took care
of the privacy and the identity for you sir </impasse>
</op> <op value="MonkeyKnuckles"> /nickserv identify
oicu812<impasse>!DELAY
1</impasse><impasse>!MASTER \0304I took care of the
privacy and the identity for you sir.</impasse> </op>
<op><impasse>!MASTER \0304This nick is not
registeredK</impasse> </op> </cond>
</response> </unit>
[0372] In the third level of expressivity, <cond> has no
attributes, and each <op> field will have both a "name" and
"value" attribute. As an example, consider the following:
TABLE-US-00019 <unit>
<input>SERVICECONNECTED</input> <response>
<cond> <op name="bot_name" value="ScoobyDruid">
/nickserv identify oicu812 <impasse>!MASTER \0304I took care
of the privacy and the identity for you sir </impasse>
</op> <op name="bot_name" value="MonkeyKnuckles">
/nickserv identify oicu812<impasse>!DELAY
1</impasse><impasse>!MASTER \0304I took care of the
privacy and the identity for you sir.</impasse> </op>
<op><impasse>!MASTER \0304This nick is not
registered</impasse> </op> </cond>
</response> </unit>
[0373] Note that both forms have exactly the same behavior. There
can also be default behavior for <cond> case expressions.
Consider the example just below. If the variable "name" does not
exist (via "?" construct), nothing is output. The default case is
the last <op> tag without any expression. This will always
evaluate, but only if nothing above it is fired. TABLE-US-00020
<unit> <input>* HI *</input> <response>
Hello <cond name="name"/> <op value="?"></op>
<op>,<acq name="name"/></op> </cond> .
</response> </unit>
Op Tag
[0374] Use of the op tag is as follows: TABLE-US-00021
<op>fpml-response</op> <op
value="variableValue">fpml-response</op> <op
name="variableName"
value="variableValue">fpml-response</op>
[0375] This tag is used to express a conditional or random "case"
for output. See, e.g. <cond> and <rand> for usage. This
tag contains text and any valid response tag, and is contained in
<cond> and <rand>.
Rem Tag
[0376] Use of this tag is as follows: TABLE-US-00022 <rem
name="varName" value="varValue"/> <rem
expr="script-expression"> <rem name="varName">The Variable
Value</rem>
[0377] This tag is used to set a variable to a specified value. The
names and values are arbitrary and can be any value. This tag can
contain text and any tag which is valid within the <response>
tag, and is contained in <fpml> (for variable initialization)
and <response> (for setting new variables). As an example of
this tag's usage, consider the following: TABLE-US-00023
<fpml> <rem name="bot_name" value="Mr. Z"/> <rem
name="bot_favorite_color" value="purple"/> ...
[0378] When the fpml loads, these variables are initialized to
these values. Additionally consider the following example:
TABLE-US-00024 <unit> <input>Let * talk about
the+</input> <response> Sounds great.
<quiet><rem name="situation"><wild
index="2"/></rem></quiet> Do you have strong
feelings about <wild index="2"/>? </response>
</unit>
[0379] Within the unit tag, this sets the "situation" to the
wildcard, and asks a general question.
Acq Tag
[0380] Use of this tag is as follows:
[0381] <acq name="variableName"/>
[0382] This tag is used to retrieve a variable value and contains
no other tags. This tag is contained in <response> or any
valid response tag except <cond> and <rand>. As an
example of this tag's use, consider the following: TABLE-US-00025
<unit> <input>* HI *</input> <response>Well
hello, <acq name="name"/></response> </unit>
Quiet Tag
[0383] Use of this tag is as follows:
[0384] <quite>
[0385] This tag is used to evaluate inner tags but to nullify the
response these tags generate. This tag contains any valid tag
within the <response>, and is contained in <response>
and any valid tag within the <response>. As an example of
this tag's use, consider the following: TABLE-US-00026 <unit>
<input>* computers *</input> <response> I am a
computer. <quiet><rem
name="situation">computers</rem></quiet>
</response> </unit>
[0386] Without the quiet tag, the text "computers" would be added
to the response. With the quiet tag, it is not.
Wild Tag
[0387] Use of this tag is as follows: TABLE-US-00027 <wild/>
<wild index="1 based wildcard index"/>
[0388] This tag is used to retrieve the value of the wildcards that
are unified in the <input> expression. This tag contains no
tags and is contained in <response> or any valid response
tags. As an example of this tag's use, consider the following:
TABLE-US-00028 <unit> <input>* I like *</input>
<response><recurs><wild
index="1"/></recurs>. I like <wild index="2"/>
</response> </unit>
Rwild Tag
[0389] Use of this tag is as follows:
[0390] <rwild/>
[0391] <rwild index="1 based wildcard index"/>
[0392] The engine supports recursion of responses. There are two
recursion tags <recurs> and <rwild>. These tags submit
their evaluations back into the engine for response. This filtering
mechanism allows language syntax to be reduced iteratively.
<rwild> instruction is used to recurse on the first wildcard,
and <rwild index="2"/> is used to recurse on the second
wildcard. This tag contains no other tags and is contained in
<response> or any response sub element. As an example of this
tag's use, consider the following: TABLE-US-00029 <unit>
<input>THE *</input>
<response><rwild/></response> </unit>
[0393] This generic pattern will be matched only if no other better
match is found. In this case, the determiner is stripped off and
the text is resubmitted for evaluation with the hope that the
engine will better recognize the entity without the determiner.
Recurs Tag
[0394] Use of this tag is as follows:
[0395] <recurs>fpml-response</recurse>
[0396] As noted above, the engine supports recursion of responses
and this is the other of two recursion tags. These tags submit
their evaluations back into the engine for response. This filtering
mechanism allows syntax to be scraped iteratively. The
<recurs> instruction, unlike <rwild>, can contain
elements. These elements are evaluated and the resulting text is
resubmitted as input to the fpml interpreter. As an example of this
tag's use, consider the following: TABLE-US-00030 <unit>
<input>DO YOU KNOW WHO * IS</input>
<response><recurs>WHO IS
<wild/></recurs></response> </unit>
[0397] This example takes a more complex grammatical form and
reduces it to a more generic form. Consider the synonym rewrite as
follows: TABLE-US-00031 <unit> <input>HI
THERE</input>
<response><recurs>HELLO1</recurs>
</response> </unit> <unit>
<input>Aloha</input>
<response><recurs>HELLO1</recurs>
</response> </unit> <unit>
<input>HIYA</input>
<response><recurs>HELLO1</recurs>
</response> </unit> ..
[0398] This example allows for a complicated hello response,
without having to duplicate the response expression across a
variety of units.
Impasse Tag
[0399] Use of the impasse tag is as follows:
[0400] <impasse>
[0401] This tag in the <response> element forces a callout to
the calling application with the evaluation text of its inner
elements. This is used to communicate information to the outer
application. This tag is contained in <response> or any
response sub element, and contains text or any response sub
element. A command structure can be utilized that uses the impasse
tag to trigger application specific operations.
Cap Tag
[0402] This tag capitalizes the first letter of the output of all
its contained elements or text and is contained in any response
tag, and contains any response tag/text. As an example of its use,
consider the following:
[0403] <cap>United States</cap>
[0404] output: United States
<Lc><Uc>Tags
[0405] These tags make the output of the contained elements all
lower case <lc> or upper case <uc>.
<Sentence>Tag
[0406] This <response> tag will convert the contained text
and elements into a sentence form.
<Swap_Pers>Tag
[0407] This tag transforms inner elements from first person into
second person.
<Swap_Pers1>Tag
[0408] This tag transforms inner elements and text from second
person to third person.
Script Expressions
[0409] In accordance with one embodiment, the FPML runtime
(discussed in more detail below) can support assignment and
conditional testing expressions. The syntax is ECMAScript, but it
does not include the ability to have control statements, or
functions.
[0410] Script expressions are added to fpml through the
expr="script expression" attribute. This attribute is valid in the
following tags: <if>, <cond>, <op>, <rem>
and <acq>. As an example of this tag's use, consider the
following:
[0411] <if expr="(var1 ==1 && var2==2.0)">
[0412] <if expr="myVar ==myVar1 +1 && profile [key_name]
==`george`/>
[0413] <rem expr="key_name =0; profile_array[key_name] =`Mark`;
/>
[0414] If one wishes to add more than one assignment expression in
a single "expr" attribute, this is possible, by separating the
expression statements with a semicolon `;`. This is useful for
creating <rem> expressions which initialize a whole bunch of
variables. If the rem expression is in the top level of the file,
it will be evaluated when the FPML is instantiated in, for example,
a bot. As an example, consider the following: TABLE-US-00032
<fpml> <rem expr=" likes_cooking = 0; likes_eating = 1;
likes_gasgrill = 2; profile[likes_cooking] = 0.0;
profile[likes_eating] = 0.5; profile[likes_gasgrill] = 1.0; " />
</fmpl>
[0415] In this case, the <rem> expression will be evaluated
on bot startup, and all those variables initialized to these
values.
[0416] Variables are loosely typed and can transform to new types
without explicit operators. New variables can be created on the fly
and are case sensitive. For example, "Var1" is not the same as
"var1". Numbers are created simply by assigning a numeric value to
a variable, e.g. Var1=1, Var1=1.23445 and the like. Strings are
created by using the " single quote, e.g. Var1=`Mark`. Arrays are
created simply by indexing a variable. If the variable exists, it
will be retyped as an array; if the index is greater than the size
of the array (initially 0 length), the array will grow dynamically,
e.g. profile[0]=0.95; profile[1]=0.50; profile[2]=0.25;
likes_food=0; likes_beach=1; likes_coffee=2;
profile[likes_food]=0.95; profile[likes beach]=0.50;
profile[likes_coffee]=0.25; profile[likes_coffee+1]=`mark`.
[0417] Arrays indices are full fledged variables and can loosely be
types. Index[0] can be a string, while Index[1] can be floating
point, Index[2] can be another array, and the like. The expression
system does not impose a limit on the dimension of the array, e.g.
Array[0][1][0][0] =`true` is valid.
[0418] In one embodiment, the following operators grouping tags are
supported by the expression evaluator. Precedence rules of
EMCAScript have been adopted.
[0419] = assignment operator
[0420] == comparison operator
[0421] != comparison operator
[0422] ( ) grouping operator
[0423] > greater than
[0424] < less than
[0425] >= gte
[0426] <= lte
[0427] && Logical AND
[0428] Logical or
[0429] [] Array index
[0430] `xx` const string--v=`mark`
[0431] {. . . , . . . } constant array--v={0, 1, 2, 3, 4, 5};
[0432] ! logical not
[0433] + add operator
[0434] - subtract operator
[0435] * multiply operator
[0436] / divide operator
[0437] The following keywords are currently defined for the
language: true, false.
Probabilistic Expression Matching
[0438] Having now considered the above discussion of the functional
presence engine and the knowledge bases, consider the following. As
can surely be appreciated, FPML, at a basic level, can be used to
define a list of regular expressions which trigger a response when
incoming data is matched against the expression. It is desirable
that the matching process be as smart as possible insofar as its
ability to handle collisions. A collision occurs when incoming text
matches two or more FPML units. To address the issue of collisions
and in accordance with one embodiment, a statistical or
probabilistic methodology is utilized. For example, in accordance
with this embodiment, instead of returning Boolean values, the
process can return a probabilistic score that identifies how close
the input text is matches to the particular knowledge base unit. If
the scoring methodology is sound, then unit interdependence is much
less of an issue and the highest ranking FPML unit which matches
the incoming text is also guaranteed to be the most semantically
relevant to the text and thus captures the most information of all
competing knowledge base units.
[0439] As noted above, more than one <unit> may unify
successfully against incoming text. This is expected and in some
instances desirable. The FPML Runtime (discussed in more detail
below), uses a probabilistic methodology to choose the best
unification among competing units. The best unification, in
accordance with one embodiment, is the unification that provides
the best semantic coverage for the incoming text. This is achieved,
in this embodiment, by scoring exact graphic word matches at a high
value and scoring wildcard matches lower. As an example, consider
the following FPML: TABLE-US-00033 <!-input 1 !-->
<input>OSAMA IS EVIL</input> ... <!-input 2 !-->
<input>* OSAMA * </input> <!-input 3 !-->
<input>OSAMA IS *</input>
[0440] Given the string "OSAMA IS EVIL", more semantic information
is uncovered by selecting input 1 as the best unification. Given
the string "OSAMA IS MOVING", more semantic information is
uncovered by selecting input 2. Semantic context is garnered when
graphic words in the <input> match graphic words in the
incoming text. The more graphic words which match exactly, the more
semantic information is uncovered. Thus, a generalization is that
for any incoming text, one wants to match it to an input which
uncovers the most graphic words either directly or through a
functional process.
[0441] Consider the following mathematical approach. Each
<input> expression E is represented by (e.sub.1 . . .
e.sub.k) terms, where each term can be either a graphic word, a
wildcard type, or an embedded functional expression, and k is the
total number of terms. Given this, consider the following table
which separates four expressions into their component terms:
TABLE-US-00034 OSAMA is evil OSAMA is a * *OSAMA* [ANY(OSAMA,
USAMA) from+] e1 = osama e1 = osama e1 = KSTAR e1 = STAR +
Fn(ANY(OSAMA, e2 = is e2 = is e2 = osama USAMA)) e3 = evil e3 = a
e3 = KSTAR e4 = KSTAR
[0442] Each incoming sentence S is split into words (w.sub.1 . . .
w.sub.n), where each word represents a graphic word as defined by
the lexical analyzer and n is the total number of words in the
sentence. Thus, "Osama is a evil man" breaks down as follows:
[0443] w.sub.1=Osama
[0444] w.sub.2=is
[0445] w.sub.3=a
[0446] w.sub.4=evil
[0447] w.sub.5=man
[0448] The unifier takes an <E,S> and attempts to create a
resultant list R of size k where r.sub.i is a list of words
subsumed by e.sub.i. If such a set R can be produced, S can be said
to unify against E. The incoming text "Osama is moving out" unifies
against three of the 4 specified inputs as follows: TABLE-US-00035
OSAMA *moving* *OSAMA* [ANY(OSAMA, USAMA) from+] r.sub.1 = osama
r.sub.1 = O r.sub.1 = STAR + Fn (ANY(OSAMA)) r.sub.2 = osama
r.sub.2 = Osama R.sub.1 = Osama is moving out R.sub.3 = moving
r.sub.3 = is moving out R.sub.4 = out
[0449] In this example, the FPML engine needs to make a decision
about which is the best unification. It is easy to observe that
"OSAMA *moving*" would be the <input> which uncovers the
greatest semantic context. Thus, this is the preferred unification.
Using R, a probability is calculated by assigning a score to each
R.sub.i and then summing them and dividing by the number of words
in the input (n)+ the number of KLEENSTAR matches which unify
against nothing.
[0450] In accordance with one embodiment, there are two methods
that can be utilized for assigning scores. This first method is an
ad-hoc method that works well in the absence of word statistics. As
an example, consider the following:
[0451] Score.sub.i=
[0452] If E.sub.i is a graphic word type, the score for r.sub.i is
0.95.
[0453] If E.sub.i is a MATCHONE wildcard type, the score is
0.7.
[0454] If E.sub.i is a STAR (one or more), the score is 0.55 times
the number of words in r.sub.i.
[0455] If E.sub.i is a KLEENSTAR, the score is 0.45 times the
number of words in r.sub.i.
[0456] If E.sub.i is a functional type, the score is dependent on
the function. This value is usually calculated by adding high
scores for terms that match the function, and low scores for extra
terms.
[0457] The second method can utilize term weights, such as inverse
document frequency. Here, the graphic words can be assigned scores
based on the semantic information returned by the word and do not
need to be a constant. As an example, consider the following.
[0458] Count.sub.i=
[0459] If E.sub.i is a graphic word or MATCHONE, 1.
[0460] If E.sub.i is KLEENSTAR, number of terms captured by the
wildcard.
[0461] If E.sub.i is KLEENSTAR and number of terms is greater than
0, number of terms captured by the wildcard.
[0462] If E.sub.i is KLEENSTAR and the number of terms is 0, 1.
[0463] The score is thus computed as follows: Prob .function. ( E |
S ) = i = 1 i <= k .times. .times. Score .times. .times. ( r i )
i = 1 i <= k .times. .times. Count .times. .times. ( r i ) Prob
.times. .times. ( E | S ) = i = 1 i <= k .times. .times. Score
.times. .times. ( r i ) ##EQU1##
[0464] The first equation is the normalized approach. In this
methodology, scores from different inputs can be compared to each
other in a meaningful way.
[0465] In many applications, relative comparisons among different
inputs is not necessary, and there are some consequences of the
normalization methods related to matching. Hence, the second
equation constitutes an unnormalized variant, to remove these side
effects.
[0466] Using the above scoring equation, the scores are calculated
as follows:
[0467] OSAMA * moving *
[0468] Osama is moving out
[0469] (0.95 0.45 0.95 0.45)/(1+1+1+1)=0.7
[0470] *OSAMA*
[0471] Osama is moving out
[0472] (0 0.95 0.45 0.45 0.45)/(1+1+3)=0.46
[0473] [ANY(OSAMA, USAMA) from +]
[0474] Osama is moving out
[0475] (0.95 0.55 0.55 0.55)/(4)=0.65
[0476] It is also reasonable to remove the normalization step from
the equation. In this case, generated scores will be significantly
higher, and reflect precisely the amount of data that has been
unified, regardless of the size of the source string. The advantage
is that matches will generate larger numbers. The disadvantage is
that scores from generated by an input pattern pair can only be
reasonably compared with the results of unification from other
patterns using the same input. Comparing results from unifications
of different inputs is not possible when normalization is turned
off.
Exemplary Software Architecture
[0477] The following discussion describes an exemplary software
architecture that can implement the systems and methods described
above. It is to be appreciated that the following discussion
provides but one example and is not intended to limit application
of the claimed subject matter. Accordingly, other architectures can
be utilized without departing from the spirit and scope of the
claimed subject matter.
[0478] FIG. 3 shows an exemplary system generally at 300 comprising
one or more runtime objects 302, 304, 306 and 308 and one or more
knowledge base objects 350, 352 and 354.
[0479] In the illustrated and described embodiment, the runtime
objects are software objects that have an interface that receives
data in the form of text and produces text or actions. In one
embodiment, the runtime objects are implemented as C++ objects.
Knowledge base objects 350-354 are software objects that load and
execute FPML knowledge bases and handle user requests. Together,
the runtime objects and the knowledge base objects cooperate to
implement the functionality described above.
[0480] More specifically, in the illustrated and described
embodiment, each knowledge base object is associated with a single
FPML file. Examples of FPML files are described above. The
knowledge base object is configured to load and execute, in a
programmatic manner, the FPML file. In some embodiments, FPML files
can be nested and can contain links to other objects. This allows
one broader knowledge base to include individual FPML files. This
keeps the knowledge organized and makes it easier to edit domain
specific knowledge. In the present example, runtime objects can
point to or otherwise refer to more than one knowledge base object,
thus utilizing the functionality of more than one knowledge base
object. Similarly, knowledge base objects can be shared by more
than one runtime object. This promotes economies, scalability and
use in environments in which it is desirable to receive and process
text from many different sources.
[0481] In the illustrated and described embodiment, the runtime
objects contain state information associated with the text that it
receives. For example, if the text is received in the form of a
conversation in a chat room, the runtime object maintains state
information associated with the dialog and discourse. The runtime
objects can also maintain state information associated with the
FPML that is utilized by the various knowledge base objects. This
promotes sharing of the knowledge base objects among the different
runtime objects.
[0482] As an overview to the processing that takes place using
system 300, consider the following. In the present example, the
runtime objects receive text and then utilize the knowledge base
objects to process the text using the FPML file associated with the
particular knowledge base object. Each of the runtime objects can
be employed in a different context, while utilizing the same
knowledge base objects as other runtime objects.
[0483] Now specifically consider knowledge base object 354 which is
associated with a loaded FPML file N. As described above, the FPML
file comprises a hierarchical tree structure that has <unit>
nodes that encapsulate <input> nodes and <response>
nodes. Each of these nodes (and others) is described above. When a
runtime object receives text, it passes the text to one or more of
the knowledge base objects. In this particular example, runtime
objects 304 and 308 point to knowledge base object 354. Thus, each
of these knowledge base objects passes its text to knowledge base
object 354. As noted above, each knowledge base object, through its
associated FPML file, can be associated with a particular lexer
that performs the lexical analysis described above. When the
knowledge base object receives text from the runtime object(s), it
lexically analyzes the text using its associated lexer.
[0484] As noted above, because each runtime object can point to
more than one knowledge base object, and because each knowledge
base object can specify a different lexer, the same text that is
received by a runtime object can be processed differently by
different knowledge base objects.
[0485] Consider now the process flow when text is received by a
runtime object. When the runtime object receives its text, it makes
a method call on one or more of the knowledge base objects and
provides the text, through the method call, to the knowledge base
object or objects. The process now loops through each of the
knowledge base objects (if there is more than one), looking for a
match. If there is a match, the method returns a score and an
associated node that generated the score, to the runtime object. In
the present example, assume that the FPML associated with knowledge
base object 354 processes the text provided to it by runtime object
308 and, as a result, generates a match and score for the
<input2> node of <unit 2>. The score and an indication
of the matching node are thus returned to the runtime object and
can be maintained as part of the state that the runtime object
maintains. In the event that there are multiple matches, a best
match can be calculated as described above. Once the runtime object
has completed the process of looping through the knowledge base
objects, and, in this example, ascertained a best match, it can
then call a method on the matching node to implement the associated
response. Note that in the presently-described embodiment, for each
<input> node there is an associated <response> node
that defines a response for the associated input. Exemplary
responses are described above. Thus, when the runtime object calls
the knowledge base object and receives a particular response, the
runtime object can then call an associated application and forward
the response to the application.
Exemplary System Utilizing Runtime And Knowledge Base Objects
[0486] FIG. 4 shows an arrangement of components that utilize the
above-described runtime and knowledge base objects in accordance
with one embodiment, generally at 400. In this system, one or more
human users or monitors can interact with an application 402 which,
in turn, interacts with a functional presence system 404. In
accordance with one embodiment, the application can comprise an
agent manager component 403, which is discussed in greater detail
below in the section entitled "Implementation Example Using Dynamic
Agent Server".
[0487] In accordance with the presently described embodiment,
functional presence system 404 comprises a functional presence
engine 406 which itself comprises one or more runtime objects 408
and one or more knowledge base objects 410. Each knowledge base
object can have an associated lex object 412 that is configured to
perform lexical analysis as described above. The functionality of
the runtime and knowledge base objects is discussed above and, for
the sake of brevity, is not repeated here.
[0488] In addition, system 404 can comprise an information
retrieval component 414 which is described in more detail just
below.
Information Retrieval
[0489] In accordance with one embodiment, information retrieval
component 414 utilizes the services of the functional presence
engine 406 to process large numbers of documents and perform
searches on the documents in a highly efficient manner.
[0490] Before, however, describing the information retrieval
process, a little background is given so that the reader can
appreciate the inventive processes. One way to perform searches on
documents is to perform a so-called linear or serial search. For
example, assume that, given 4 Gigabytes of data, an individual
wishes to search for a particular term that might be contained
within the data. By performing a linear or serial search, a search
would proceed linearly-byte by byte-until the term was or was not
found. Needless to say, a linear search can take a long time and
can be needlessly inefficient.
[0491] In accordance with the described embodiment, the information
retrieval component creates and utilizes a table whose individual
entries point to individual documents. Entries in the table are
formed utilizing the services of the functional presence
engine.
[0492] As an example, consider FIG. 5 which shows a system
generally at 500 that comprises a functional presence engine that
utilizes one or more runtime objects and one or more knowledge base
objects 506. An information retrieval component 508 utilizes
functional presence engine 502 to create and use a table 510 which
is shown in expanded form just beneath system 500.
[0493] In accordance with one embodiment, functional presence
engine 502 receives and processes data which, in this example,
comprises a large number of documents. The documents, under the
influence of the functional presence engine and its constituent
components, undergoes lexical analysis and tokenization (typing) as
described above. As these processes were described above in detail,
they are not described again. The output of the tokenization or
typing process is one or more tables.
[0494] Specifically, in the present example, table 510, shown in
expanded form, includes a column 512 that holds a value associated
with a particular word found in the documents, a column 514 that
holds a value associated with the type assigned to the word in the
tokenization process, and a column 516 that holds one or more
values associated with the documents in which the word (type)
appear. Thus, in this example, each row defines a word, an
associated type and the document(s) in which the word or type
appears. So, for example, word A is assigned type 1 and appears in
documents 1 and 3.
[0495] In the illustrated and described embodiment, the typing of
the data removes any need to do a key word search. Instead, one can
search for various types and can specify, through the information
retrieval component 508, a regular expression to be used to search
the various types. For example, one might specify a search for all
documents that contain an email address that contains a certain
specific term. A search on the type "Email addresses" identifies
all of the email addresses from column 514. A regular expression
search of column 512 can then identify all of the matches whose
associated documents are indicated in column 516.
[0496] Although this is a simple example, as the skilled artisan
will surely appreciate, what begins to emerge is a system that
allows for structured types of operations to be performed on
unstructured data.
[0497] In the illustrated and described embodiment, the information
retrieval process is passive in that it is provided with
information and then processes the information, along with the
functional presence engine to provide a robust searching and
retrieval tool. In this particular example, the information
retrieval component is not responsible for seeking out and
acquiring the information that it searches.
Exemplary Method
[0498] FIG. 5a illustrates steps in a method in accordance with one
embodiment. The method can be implemented in connection with any
suitable hardware, software, firmware or combination thereof. In
one embodiment, the method can be implemented in connection with
systems such as those shown and described in FIGS. 1-5. Step 530
receives text from a data origination entity. A data origination
entity, as used in this document, is intended to describe an entity
from which data is obtained. For example, in the context of the
Internet, a data origination entity can comprise a server, a
server-accessible data store, a web page and the like. In the
context of a company intranet, a data origination entity can
comprise a network-accessible data store, a server, a desktop
computer and the like.
[0499] Step 532 probabilistically parses the text effective to
tokenize text portions with tokens. In the illustrated and
described embodiment, probabilistic parsing is accomplished using
one or more matching rules that are defined as regular expressions
in an attempt to match to text received from the data origination
entity. Examples of how probabilistic parsing can take place are
described above. Hence, for the sake of brevity, such examples are
not repeated here. Step 534 conducts a search on the tokens.
Examples of how and why searches can be conducted are given above
and, for the sake of brevity, are not repeated here.
Implementation Example Using Dynamic Agent Server
[0500] In accordance with one embodiment, the above-described
systems and methods can be employed in the context of a system
referred to as the "dynamic agent server." The dynamic agent server
is a client-server platform and application interface for managing
and deploying software agents across networks and over the
Internet. The dynamic agent server is enabled by the functional
presence engine and, in particular, the runtime objects that are
created by the functional presence engine. The dynamic agent server
can be configured to incorporate and use various applications and
protocols, ingest multiple textural data types, and write to files
or databases.
[0501] As an example, consider FIG. 6 which shows a system
comprising a dynamic agent server 600 that comprises or uses a
functional presence engine 602 which, in turn, utilizes one or more
runtime objects 604 and one or more knowledge base objects 606. An
application 608 is provided and includes an agent manager component
610 which manages agents that get created and deployed. One or more
data sources 612 are provided and include, in this example, IRC
data sources, TCP/IP data sources, POP3 data sources, wget data
sources, htdig data sources among others. The data sources can be
considered as pipeline through which data passes. In the present
example, data can originate or come from the Internet 614, from a
network 616 and/or various other data stores 618. Data sources 612
are the pipeline through which the data travels.
[0502] In accordance with one embodiment, an agent can be
considered as an instance of a runtime object coupled with a data
source. In the illustrated and described embodiment, application
608 controls the types of agents that are generated and deployed.
In the present example, there are two different types of agents
that can be created. A first type of agent gets created and opens a
communication channel via a data source and simply listens to a
destination such as one of the data origination entities names
above, i.e. Internet 614, network 616 and/or data store 618. This
type of agent might be considered a passive agent. A second type of
agent gets created and opens a communication channel via a data
source and interacts in some way with the destination. This second
type of agent communicates back through application 608 to the
functional presence engine 602. This type of agent might be
considered an active agent.
[0503] In the illustrated and described embodiment, an agent (i.e.
a runtime object 604 and a data source) is associated with one or
more knowledge base objects 606. The knowledge base objects define
how the agent interacts with data from a data origination entity.
That is, the functional presence engine 602 is utilized to direct
and control agents. In the illustrated and described embodiment,
there is a one-to-one association between a particular runtime
object and data source defined by the associated agent.
[0504] Because of the componentized nature of the runtime objects,
large numbers of agents can be created and deployed across various
different types of systems. Additionally, as the runtime objects
can be associated with more than one knowledge base, a very robust
information processing system can be provided.
[0505] As an example, of how the dynamic agent server can be
utilized, consider the following example. The wget data source is a
mechanism which, in combination with a runtime object, goes to a
particular web site and can download the entire web site. That is,
the agent in this example establishes a connection with the web
site, follows all of the links for the web site, and downloads all
of the content of that site. This, in and of itself, can provide a
huge data problem insofar as moving from a hard target (the web
site) to a very large collection of unstructured data (the entire
content of the site). The functional presence engine can alleviate
this problem by allowing the agent, in this instance, to go to the
website and only pull information that is important by identifying
which pages are relevant as defined by the FPML.
Agent Based Information Retrieval Response System
[0506] In accordance with another embodiment, the above-described
systems and methods can be employed to deploy multiple agents
across a network to gather, read, and react to various stores of
unstructured data. The system can utilize an analysis tool that
tags, indexes and/or otherwise flags relationships in structured
and unstructured data for performing alerting, automation and
reporting functions either on a desk top system or enterprise
wide.
[0507] In accordance with one embodiment, the system utilizes a two
stage process. In the first stage, the system retrieves information
of interest and stores the information in a location that is
associated with a particular agent. In the second stage, the system
presents a user with an interface by which the user can query the
index to find documents of interest.
[0508] As will surely be appreciated by the skilled artisan, the
systems and methods described above provide a tool that can be
utilized to impart to generally unstructured data, a structure that
permits a robust and seemingly endless number of operations to be
employed on the now-structured data. The various approaches
discussed above are generally much simpler and more flexible to
other data disambiguation approaches that have been utilized in the
past. The examples provided below describe scenarios in which the
technology described above can be employed. These examples are not
intended to limit application of the claimed subject matter to the
specific examples described below. Rather, the examples are
intended to illustrate the flexibility that the tools described
above provide.
EXAMPLE 1
[0509] One important area of application pertains to real time
scenarios in which detection of patterns and appropriate response
generation takes place. As an example, consider a scenario in which
law enforcement individuals wish to search for potential child
molesters in chat rooms. Given the vast expanse of cyberspace and
the seemingly endless number of chat rooms that serve as forums for
child molesters, the task of monitoring these chat rooms and
reacting to dialogs from potential molesters is a daunting one. One
current approach is to assign a law enforcement individual a small
number of chat rooms and have them monitor the chat room for
suspicious dialogs. When a suspicious dialog is detected, the law
enforcement individual can intervene and attempt to ensnare the
potential molester. There are limits on this approach, the most
obvious of which is that only a small number of chat rooms can be
monitored by any one law enforcement individual. Given the
budgetary constraints of many laws enforcement organizations, funds
are often not available to place as many law enforcement
individuals on this task as are necessary or desirable
[0510] Using the above-described systems and methods, agents can be
deployed to essentially sit in multiple chat rooms and use
knowledge bases to monitor and process the dialog that takes place
in the chat room. If and when problematic words or patterns are
identified the agent can react immediately. In one instance, the
agent might notify a human monitor, via an application such as
application 608 (FIG. 6), that a pattern has been detected, thus
allowing the human monitor to join in the conversation in the chat
room and participate in further law enforcement activities. In
another instance, the agent might be programmed to engage in a
conversation a potential molester and, in parallel, generate an
alert for a human monitor.
[0511] In this particular instance, the inventive systems and
methods are force multipliers in that the ratio of chat
room-to-human monitors can be dramatically increased.
EXAMPLE 2
[0512] The systems and methods described above can be utilized to
develop links and relationships within generally unstructured data.
In one example, links are built through proximities-where
proximities can be subjects that appear in or at the same media,
place, time, and the like. As an example, consider that a subject
of interest is "John Doe" and that John Doe is suspected of having
a relationship with a certain person of interest "Mike Smith". Yet
to date, this relationship has been elusive. Consider now that a
system, such as the system of FIG. 6, is set up with agents to
monitor various data origination entities for information
associated with John Doe and independently, Mike Smith. Assume that
during the monitoring of the data origination entities, information
is developed that indicates that John Doe went to the Pentagon at 9
P.M. Assume also that in monitoring the data origination entities,
information is developed that associates a time range from between
6 P.M. and 11 P.M. with Mike Smith's presence in the Washington
D.C. area. Once this information has been developed and processed
by the inventive systems described above, as by, for example, being
formulated into a table such as the table shown in FIG. 5, a search
can be conducted on the table to establish a link or relationship
between John Doe and Mike Smith. As noted above, the search can be
defined as a simple key word search of the table, or a more robust
regular expression search of the table.
EXAMPLE 3
[0513] Consider a travel related application in which a user is
interested in booking a vacation to a particular destination.
Assume that a deployed agent now engages the user in a conversation
at a web site that books vacation trips. During the course of the
conversation, the user types in certain responses to queries by the
agent. For example, the agent may ascertain that the user wishes to
book a vacation to Maui and is interested in staying on the north
side of the island. Responsive to learning this information, the
agent causes multimedia showing the north side of Maui to be
presented to the user. As the conversation proceeds, the agent
learns other information from the user such as various general
activities in which the user likes to participate. For example, the
agent may learn that the user likes to hike and explore. Responsive
to learning this information, the agent may then cause multimedia
associated with hiking and exploring on Maui to be presented to the
user as the query continues. Needless to say, the systems and
methods described above can be utilized to provide a flexibly
robust, user-centric experience.
Conclusion
[0514] The embodiments described above provide a state-based,
regular expression parser in which data, such as generally
unstructured text, is received into the system and undergoes a
tokenization process which permits structure to be imparted to the
data. Tokenization of the data effectively enables various patterns
in the data to be identified. In some embodiments, one or more
components can utilize stimulus/response paradigms to recognize and
react to patterns in the data.
[0515] Although the invention has been described in language
specific to structural features and/or methodological steps, it is
to be understood that the invention defined in the appended claims
is not necessarily limited to the specific features or steps
described. Rather, the specific features and steps are disclosed as
preferred forms of implementing the claimed invention.
* * * * *
References