U.S. patent application number 12/802202 was filed with the patent office on 2011-12-01 for iterative fact-extraction.
Invention is credited to Martin Betz, Ankush Nagpal, Harshal Pradhan, Xiang Yu.
Application Number | 20110295864 12/802202 |
Document ID | / |
Family ID | 45022954 |
Filed Date | 2011-12-01 |
United States Patent
Application |
20110295864 |
Kind Code |
A1 |
Betz; Martin ; et
al. |
December 1, 2011 |
Iterative fact-extraction
Abstract
Some embodiments provide a method for identifying a first
pattern formed by a first set of document elements. The method
associates a tag that identifies the first pattern with the
document. The method then identifies a second pattern formed by a
second set of document elements and the tag. The method associates
a second tag with the document. The second tag identifies the
second pattern and is associated with a fact. Some embodiments
provide a graphical user interface (GUI) for manually processing
tags associated with a document. Further, different embodiments
provide a system for performing iterative fact-extraction on a set
of documents.
Inventors: |
Betz; Martin; (San
Francisco, CA) ; Pradhan; Harshal; (Thane(W), IN)
; Nagpal; Ankush; (Gurgaon, IN) ; Yu; Xiang;
(Clarksville, MD) |
Family ID: |
45022954 |
Appl. No.: |
12/802202 |
Filed: |
May 29, 2010 |
Current U.S.
Class: |
707/754 ;
707/E17.014; 715/780 |
Current CPC
Class: |
G06F 16/38 20190101 |
Class at
Publication: |
707/754 ;
715/780; 707/E17.014 |
International
Class: |
G06F 17/30 20060101
G06F017/30; G06F 3/048 20060101 G06F003/048 |
Claims
1. A machine-implemented method for extracting a fact from a
document, the document comprising a plurality of document elements,
the method comprising: identifying a first pattern formed by a
first set of document elements; associating a first tag with the
document, the first tag identifying the first pattern; identifying
a second pattern formed by a second set of document elements and
the first tag; associating a second tag with the document; and
based on the second tag, recording a first fact for the
document.
2. The method of claim 1, wherein the first pattern is defined by a
list including known words and expressions.
3. The method of claim 1, wherein the second pattern is defined by
a rule that specifies a required relationship between the second
set of document elements and the first tag.
4. The method of claim 1, wherein the first tag identifies the
first pattern as a company, the second tag identifies the second
pattern as an action verb, and the first fact is a new company
hire.
5. The method of claim 1, wherein the first tag identifies the
first pattern as a person, the second tag identifies the second
pattern as a quote, and the first fact is the quote attributed to
the person.
6. The method of claim 1, wherein the first tag identifies the
first pattern as a person, the second tag identifies the second
pattern as a gender pronoun, and the first fact is the person being
male or female.
7. The method of claim 1 further comprising: identifying a third
pattern formed by a third set of document elements, the first tag,
and the second tag; and associating a third tag with the document,
the third tag identifying the third pattern and associated with a
second fact.
8. A computer readable storage medium including a computer program,
the computer program including instructions for providing a
graphical user interface (GUI) for manually processing tags
associated with a document, the GUI comprising: a first UI item for
selecting a script for performing iterative fact-extraction on text
data from the document; a text box UI item for inputting the text
data from the document for the iterative fact-extraction; a first
display portion for presenting identified patterns from the
inputted text data resulting from the iterative fact-extraction;
and a second display portion for providing a plurality of UI items
representing a plurality of tags associated with the identified
patterns.
9. The computer readable storage medium of claim 8, wherein the
plurality of UI items allows a user to modify a first tag
associated with a particular pattern to a second tag when the user
selects a second UI item that represents the second tag.
10. The computer readable storage medium of claim 8, wherein the
plurality of UI items further includes a second UI item for
removing a tag associated with a particular pattern.
11. A system for performing iterative fact-extraction on a set of
documents, the system comprising: a pattern analysis engine for
identifying a set of patterns in the set of documents; and a tag
engine for annotating the set of documents with respective tags
that are associated with facts in the set of documents.
12. The system of claim 11, wherein the pattern analysis engine
executes a set of pattern analysis instructions to identify the set
of patterns, said set of pattern analysis instructions defining the
set of patterns to identify.
13. The system of claim 11, wherein the set of documents are stored
in a document storage.
14. The system of claim 11, wherein the tags are stored in a tag
storage.
15. The system of claim 14 further comprising: a fact processing
module for processing the stored tags to extract a set of facts
associated with the tags; and a query processor for executing
search queries on the set of facts to retrieve facts that match the
search queries.
16. The system of claim 11 further comprising a document crawler
module for communicating with a network to retrieve the set of
documents on a real-time basis.
17. The system of claim 11 further comprising a document crawler
module for communicating with a network to retrieve the set of
documents on a periodic basis.
18. The system of claim 12 further comprising a file handler module
for receiving scripts that are embedded with the set of pattern
analysis instructions.
19. The computer readable medium of claim 8, wherein the first
display portion provides a second UI item for editing the
identified patterns.
20. The computer readable medium of claim 8, wherein the first
display portion provides a second UI item for removing an
identified pattern and any tag associated with the pattern.
Description
BACKGROUND
[0001] Most information today is stored electronically and is
available on the World Wide Web. This information includes blog
posts, articles (e.g., news articles, opinion pieces, etc.),
research papers, web pages, and many other types of documents.
While having this much information available is useful, it is often
desirable to find information relevant to a particular topic.
[0002] Search engines exist today to find documents on the web.
Most search engines use web crawlers to retrieve these documents
from web sites. These retrieved documents are typically passed to
an indexer that parses and generates keywords from the contents of
the documents. Some existing implementations of the indexer may
parse each sentence of a particular document to generate data
structures known as grammatical trees to represent words and
phrases of the document. These structures can then be utilized
subsequently to provide responses to user queries. However, these
responses are not always good and often require the users to spend
quite a bit of time processing the retrieved documents to derive
answers for their questions.
BRIEF SUMMARY
[0003] Some embodiments of the invention provide a novel method for
extracting one or more facts from a document that contains several
document elements. The method of some embodiments iteratively
analyzes the document to identify facts contained in the document,
which it then annotates with tags that identify the facts. In each
iteration, the method of some embodiments identifies facts by (1)
identifying patterns of document elements and/or tags from prior
iteration(s), and (2) associating a tag with a pattern identified
in the current iteration to specify the fact that is derived from
the existence of the identified pattern. In other words, after each
level of the multi-level fact-extraction analysis, the method of
some embodiments can use the tags that the method generated in the
previous level(s) of fact-extraction to identify additional
patterns of documents elements and/or tags. Such additional
patterns of document elements and/or tags are then tagged to
indicate other facts contained in the document.
[0004] Accordingly, in some embodiments, the method associates each
identified pattern of document elements and/or tags with another
tag, which is, in turn, associated with a fact that can get
associated with the identified pattern. In other embodiments,
however, the method might not associate each tag with a fact. For
instance, in some embodiments, the method identifies some tags only
to use them to identify subsequent patterns that are subsequently
associated with facts through other tags.
[0005] To identify patterns of document elements and/or tags, the
method of some embodiments executes one or more sets of pattern
analysis instructions. This method uses different pattern analysis
instructions sets in different embodiments. For example, in some
embodiments, the pattern analysis instructions sets define patterns
as an uninterrupted sequence of document elements (e.g., a sequence
of words that form an expression) or tags. In other embodiments,
the pattern analysis instructions sets define patterns in terms of
one or more rules that associate document elements and/or tags.
Examples of such rules include rules that specify relationship
between document elements and/or tags (e.g., rules the specify
intersection, spacing and/or location constraints between one or
more sets of words and/or tags). Yet in other embodiments, the
pattern analysis instructions sets define patterns in terms of
uninterrupted sequence of document elements and one or more rules
that associate document elements and/or tags.
[0006] Additionally, in some embodiments, the method determines
whether manual processing of the tags of the document is required.
In some instances, human intervention is required to resolve issues
with the extracted facts from the document. Some embodiments
provide a tool for a person to manually process the document to
adjust the tags of the document. Also, in some embodiments, the
method provides a presentation of the facts of a particular
document by processing the tags associated with several documents
that it processes.
[0007] The preceding Summary is intended to serve as a brief
introduction to some embodiments of the invention. It is not meant
to be an introduction or overview of all inventive subject matter
disclosed in this document. The Detailed Description that follows
and the Drawings that are referred to in the Detailed Description
will further describe the embodiments described in the Summary as
well as other embodiments. Accordingly, to understand all the
embodiments described by this document, a full review of the
Summary, Detailed Description and the Drawings is needed. Moreover,
the claimed subject matters are not to be limited by the
illustrative details in the Summary, Detailed Description and the
Drawings, but rather are to be defined by the appended claims,
because the claimed subject matters can be embodied in other
specific forms without departing from the spirit of the subject
matters.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] FIG. 1 conceptually illustrates an example of an iterative
fact-extraction methodology of some embodiments.
[0009] FIG. 2 illustrates the association of each pattern with a
tag, which, in turn, is associated with a fact.
[0010] FIG. 3 conceptually illustrates an iterative fact-extraction
process of some embodiments.
[0011] FIG. 4 illustrates a conceptual block-diagram view of a
multi-level fact-extraction process.
[0012] FIG. 5 conceptually illustrates a fact-extraction process
that processes a series of tags that are generated by performing
the iterative fact-extraction process multiple times for multiple
documents in order to extract and store facts associated with these
tags.
[0013] FIG. 6 conceptually illustrates an example document that has
been analyzed in multiple iterations in order to extract a fact
about management turnover.
[0014] FIG. 7 conceptually illustrates an example data structure of
a tag created by the fact-extraction process.
[0015] FIG. 8 conceptually illustrates examples of back-referenced
patterns in a particular document and a corresponding tag of some
embodiments.
[0016] FIGS. 9(a)-(b) conceptually illustrate example patterns for
attributing a name to a quote of some embodiments.
[0017] FIG. 10 conceptually illustrates an example pattern for
attributing a name to quote of some embodiments.
[0018] FIG. 11 conceptually illustrates an example pattern of
management turnover of some embodiments
[0019] FIGS. 12(a)-(b) conceptually illustrate other example
patterns of management turnover of some embodiments.
[0020] FIG. 13 conceptually illustrates an example pattern of
attributing commentary to management turnover of some
embodiments.
[0021] FIGS. 14(a)-(b) conceptually illustrate example patterns for
product initiatives of some embodiments.
[0022] FIGS. 15(a)-(b) conceptually illustrate example patterns for
financial transactions of some embodiments.
[0023] FIG. 16 conceptually illustrates one example of pattern
analysis that extracts facts by having earlier stage tags build on
later stage tags.
[0024] FIG. 17 illustrates an example graphical user interface
(GUI) display for manually processing documents to adjust the
results of iterative fact processing of some embodiments.
[0025] FIG. 18 illustrates an example of a GUI display for
presenting management turnover for a particular company of some
embodiments.
[0026] FIG. 19 illustrates an example of a GUI display for
displaying a product initiative of some embodiments.
[0027] FIG. 20 illustrates an example of a graphical user interface
output for displaying a financial transaction of some
embodiments.
[0028] FIG. 21 conceptually illustrates an architecture block
diagram of a system implements the iterative fact-extraction
process of some embodiments.
[0029] FIG. 22 conceptually illustrates an electronic system with
which some embodiments are implemented.
DETAILED DESCRIPTION
[0030] In the following description, numerous details are set forth
for purpose of explanation. However, one of ordinary skill in the
art will realize that the invention may be practiced without the
use of these specific details. For instance, some of the examples
illustrate iterative fact-extraction for specific categories of
facts. One of ordinary skill in the art will recognize that
fact-extraction could be accomplished based on any number of
categories of facts. Further, some embodiments provide perform
iterative fact-extraction to a specified number of levels. Each
level of iterative fact-extraction may identify additional facts.
However, one of ordinary skill in the art will also realize that
fact-extraction may be performed to any number of levels.
[0031] Some embodiments of the invention provide a novel method for
extracting one or more facts from a document that contains several
document elements. Examples of document elements include content
elements (e.g., glyphs, letters, words, punctuation, numerical
characters, symbols, etc.), structural elements (e.g., markup tags,
headers, sections, columns, dividers, lines, etc.), and any other
document elements that can be parsed in a programmatic manner.
Different embodiments can examine different types and combinations
of such document elements.
[0032] The method of some embodiments iteratively analyzes the
document to identify facts contained in the document, which it then
annotates with tags that identify the facts. In each iteration, the
method of some embodiments identifies facts by (1) identifying
patterns of document elements and/or tags from prior iteration(s),
and (2) associating a tag with each pattern identified in the
current iteration to specify the fact that is derived from the
existence of the identified pattern. In other words, after each
level of the multi-level fact-extraction analysis, the method of
some embodiments can use the tags that the method generated in the
previous level(s) of fact-extraction to identify additional
patterns of documents elements and/or tags. Such additional
patterns of document elements and/or tags are then tagged to
indicate other facts contained in the document. Identifying
subsequent patterns by using tags that specify previously
identified patterns results in a more efficient fact-extraction
process because this approach does not require a later analysis
stage to repeat the work done by a previous analysis stage.
[0033] In some embodiments, the method associates each identified
pattern of document elements and/or tags with another tag, which
is, in turn, associated with a fact that can get associated with
the identified pattern, as described above. However, in other
embodiments, the method might not associate each tag with a fact.
For instance, in some embodiments, the method identifies some tags
only to use them to identify subsequent patterns that are
subsequently associated with facts through other tags.
[0034] FIG. 1 conceptually illustrates an example of an iterative
fact-extraction method of some embodiments. In this example, the
fact-extraction method analyzes one document 110 in four stages,
with each of the second, third and fourth stages building on tags
identified in the previous stages. To identify patterns of document
elements and/or tags, the method of some embodiments executes one
set of pattern analysis instructions in each stage. This method
uses different pattern analysis instructions sets in different
embodiments. For example, in some embodiments, the pattern analysis
instructions sets define patterns as an uninterrupted sequence of
document elements (e.g., a sequence of words that form an
expression) or tags. In other embodiments, the pattern analysis
instructions sets define patterns in terms of one or more rules
that associate document elements and/or tags. Examples of such
rules include rules that specify relationship between document
elements and/or tags (e.g., rules the specify intersection, spacing
and/or location constraints between one or more sets of words
and/or tags). Yet in other embodiments, the pattern analysis
instructions sets define patterns in terms of uninterrupted
sequence of document elements and one or more rules that associate
document elements and/or tags. For simplicity's sake, this example
only illustrates words in the document 110 as the document elements
that are used for each stage of the analysis.
[0035] FIG. 1 illustrates the fact-extraction analysis of the
document 110 in terms of four sets of results 101-104 that
correspond to the output of four stages of the analysis. The first
stage result 101 shows that the first stage of the analysis
identified one pattern. The identified pattern includes a certain
noun 115 that has a particular relationship with a certain verb
120. In this example, the particular relationship is the proximity
between the certain noun 115 and the verb 120. The first analysis
stage identifies this pattern in some embodiments when executing a
pattern analysis instruction. In some embodiments, this instruction
is a relationship-based rule that is expressed as a conditional
statement. One example of such a relationship-based rule (i.e., a
proximity-based rule in this example) is: [0036] If Type A Noun
within X of Type B Verb [0037] .fwdarw.Tag C {to identify the first
noun/verb pattern}, where X is an integer, Type A Noun is a list of
certain nouns (e.g., a list of the names of certain companies),
Type B Verb is a list of certain verbs (e.g., certain actions
verbs), and Tag C is a tag that is associated with the identified
pattern. One example of such a noun/verb pattern might be a noun
that is a company name followed by an action verb: hired, employed,
joined, started, etc. This noun/verb pattern would possibly
indicate that a particular company has hired a new employee.
[0038] In some embodiments, the method not only associates each
identifiable pattern with a tag but also associates each tag with a
fact that possibly exists in the document. FIG. 2 illustrates the
association of each pattern 205 with a tag 210, which, in turn, is
associated with a fact 215. As shown FIG. 2, some embodiments do
not associate a fact with each tag in each level of pattern
analysis. As further described below, some embodiments do not
associate facts with the earlier stage tags. Also, as further
described below, the facts that are derived in some embodiments are
the tags that are defined in the later stages of the analysis. In
other words, these embodiments do not specify an association
between tags and facts, but instead utilize the later stage tags to
express facts that can be reported, presented and/or queried.
[0039] In the example illustrated in FIG. 1, the first stage result
101 associates the identified noun/verb pattern with a tag 125 that
has an associated fact. For instance, in the new hiring example
mentioned above, the tag 125 would be a "New Hire" tag that would
be associated with a fact: "New employee hiring by a company."
[0040] The second stage result 102 shows that the second stage of
the analysis identified two patterns. In the second stage, the
first identified pattern is a multi-word expression 165, while the
second identified pattern is formed by the first stage tag 125
having a relationship with (i.e., in this example, being within a
certain proximity of) the multi-word expression. The second
analysis stage identifies these patterns in some embodiments when
executing a pattern analysis instruction, which in some of these
embodiments is formulated as:
[0041] If find Type D Expression, [0042] Then If Type D Expression
within Q of Tag C.fwdarw. [0043] Tag E {to identify Type A
Expression}, [0044] Tag F {to identify Type A Expression with Q of
Tag C}, where Q is an integer, Type D Expression is a list of
certain words, multi-word combinations or acronyms (e.g., a list of
the job titles), Tag E is a tag that is associated with the
identified expression, and Tag F is a tag associated with the
pattern formed by the identified expression having a relationship
with (i.e., in this example, being within a certain proximity of)
the noun/verb pattern identified in the first stage. Tag types E
and F appear as tags 135 and 140 in the second stage result 102.
Using the first stage tag 125 to identify the second stage tag 140
allows the fact-extraction method to operate more efficiently by
allowing the second stage analysis to build on the first stage
analysis (i.e., to not repeat the first stage analysis).
[0045] For the new hiring example mentioned above, the expression
could be a multi-word titular expression such as Chief Executive
Officer, Vice President, Chief Operating Officer, etc., while the
tag/expression pattern could be a titular expression having a
relationship with (i.e., in this example, being within a certain
proximity of) a New Hire tag. Accordingly, for this new hire
example, the tag 135 would associate a job title with the
identified titular expression pattern. The tag 140, in turn, could
be a "New Hire Title" tag 140 that could be a title (in this case,
the identified multi-word expression) associated with the New Hire
tag.
[0046] The third stage result 103 shows that the third stage of the
analysis identifies another pattern. This pattern includes a
certain noun 130 having a relationship with (e.g., in this example,
being within a certain proximity of) a certain verb 145, with this
noun/verb combination within a particular proximity. The third
analysis stage identifies this pattern in some embodiments when
executing a pattern analysis instruction, which in some of these
embodiments is formulated as:
[0047] If Type G Noun within Y of Type H Verb, [0048] Then If Type
G Noun or Type H Verb within W of Tag C.fwdarw. [0049] Tag I {to
identify noun/verb pattern}, where Y is an integer, Type G Noun is
a list of certain nouns (e.g., a list of the names of certain
companies), Type H Verb is a list of certain verbs (e.g., certain
actions verbs), and Tag I is a tag that is associated with the
identified pattern of the newly identified noun/verb combination
that has a relationship between (e.g., within certain proximity)
the first tag C, which is associated with the first noun/verb
pattern. In the third stage, Tag I appears as tag 155.
[0050] For the new hiring example mentioned above, the second
noun/verb pattern identified in the third stage could be a noun
that is a company name followed by an action verb, such as fired,
left, resigned, quit, etc. This noun/verb pattern would possibly
indicate that a particular person has left a particular company.
When this second noun/verb pattern is within the first noun/verb
pattern identified in the first stage (where this first pattern
identifies a company hiring a new employee), this relationship
(e.g., proximity) might specify the previous employer for a newly
hired person. To annotate this potential fact, the third stage
result 103 (1) uses a "Previous Company" tag as the tag 155, and
(2) stores with this tag a reference to the noun from the
identified second noun/verb pattern, in order to identify the name
of the previous company.
[0051] The fourth stage result 104 shows that the fourth stage of
the analysis identifies yet another pattern. This pattern includes
a quote 150 having a relationship with (i.e., in this example,
within a certain proximity of) the tag 125 that represented the
first identified noun/verb pattern. The fourth analysis stage
identifies this pattern in some embodiments when executing a
pattern analysis instruction, which in some of these embodiments is
formulated as:
[0052] If Type J Quote within Z of Tag C [0053] .fwdarw.Tag K {to
identify quote within proximity of first noun/verb pattern}, where
Z is an integer, Type J Quote is one of the quote type from a list
of quote types (e.g., single quotes, double quotes, block quotes,
nested quotes), and Tag K is a tag that is associated with the
identified pattern of the newly identified quote within certain
proximity of the first tag C, which is associated with the first
noun/verb pattern. In the third stage, Tag K appears as tag
160.
[0054] For the new hiring example mentioned above, the quote
identified in the fourth stage could possibly be a quote about the
hiring of the new employee when this quote has a relationship with
(i.e., in this example, is within a certain proximity of) the first
noun/verb pattern identified in the first stage (where this first
pattern identifies a company hiring a new employee). Accordingly,
to annotate this potential fact, the fourth stage result 104 (1)
uses a "New Hire Quote" tag as the tag 160, and (2) stores with
this tag a reference to a portion or the entirety of this quote.
This reference can be specified in a variety of ways, such as in
terms of the start and end locations of the quote or quote portion,
or in terms of the start or end location plus or minus a certain
number of words, or in terms of a location within the quote or
document from which the rest of the portion or the quote can be
identified, etc.
[0055] One of ordinary skill will realize that the fact-extraction
process that is described above by reference to FIG. 1 is only one
example of how some embodiments perform fact-extraction. Other
embodiments might perform such processes differently. For instance,
while the example illustrated in FIG. 1 is described by reference
to four stages of analysis, other fact-extraction processes of
other embodiments might have fewer or additional stages. Moreover,
other embodiments might generate tags and associate these tags with
the document, document elements, and/or other tags differently. For
example, as mentioned above, the third stage of the fact-extraction
process of FIG. 1 creates one tag when it identifies a second
noun/verb pattern and then identifies the relationship (e.g., the
proximity) of this pattern with respect to the first tag 125. Other
embodiments, however, might first create and associate a tag with
the second noun/verb pattern identified in the third stage, and
then in another stage identify the relationship (e.g., proximity)
of this new tag to the first tag 125.
[0056] Several more detailed embodiments are described in the
sections below. Section I provides a conceptual description of the
multi-level iterative fact-extraction process. Next, Section II
conceptually describes example structures of documents and tags.
Section TR provides conceptual descriptions of example patterns
that the iterative fact-extraction process can identify in a
document. Section IV provides a description of a tool for manually
processing the tags of a document to adjust the results of
iterative fact processing. Section V describes examples of various
graphical user interfaces for presenting the output of the
iterative fact-extraction process. Next, Section VI provides a
description for an architecture of a system that performs the
iterative fact-extraction process. Section VII describes an
electronic system that implements the iterative fact-extraction
process.
I. Exemplary Fact-Extraction Process
[0057] FIG. 3 conceptually illustrates an iterative pattern
analysis process 300 of some embodiments. In each iteration, this
process analyzes the document to identify, and to annotate with
tags, patterns of (1) document elements that are contained in the
document and/or (2) annotated tags that are previously associated
with the document in any previous iteration. In some embodiments,
some or all of the annotated tags are associated with facts.
Through this association, facts can be extracted from the document.
The process 300 will be explained below by reference to FIG. 4,
which conceptually illustrates multiple levels of analysis 410 that
are performed on a document 405 based on a group of pattern
analysis instructions 415.
[0058] The process 300 is performed each time one document needs to
be analyzed to extract facts from it. Some embodiments perform this
process on each document in a group of documents that is retrieved
from a network (e.g., a local area network, a wide area network,
etc.) or network of networks (e.g., the Internet) by a document
retriever that retrieves documents (e.g., HTML documents, PDF
documents, word processed documents--such as Word documents, Word
Perfect documents, etc.) on a periodic or continuous basis. A web
crawler is an example of such a document retriever. Some
embodiments perform this process for each document when batch
processing several documents. Other embodiments perform this
process on each document in a real-time basis as each document is
received or retrieved. Also, when a document does not have document
elements or sufficient number of document elements in the desired
format for the pattern-identification analysis of the process 300,
some embodiments attempt to change the document's format into a
format that has the desired document elements or desired number of
document elements. For instance, when the document is a PDF
document that is not specified in terms of alphanumeric characters
or symbols, some embodiments perform optical character recognition
(OCR) analysis to produce alphanumeric characters or symbols in the
document.
[0059] As shown in FIG. 3, the process initially identifies (at
305) one set of pattern analysis instructions for identifying
patterns in the document. The process then executes (at 310) the
identified pattern analysis instruction set. The execution of such
a set may cause the process to identify one or more patterns of
document elements. For each pattern that it identifies at 310, the
process (at 310) creates a tag and associates the tag with the
pattern. In some embodiments, a tag respectively can be associated
with a fact. Hence, each pattern identified at 310 can be
associated with a fact through its associated tag.
[0060] However, while the process 300 in some embodiments
associates each identified pattern with a tag that is associated
with a fact, the process 300 in other embodiments does not
associate each identified pattern with a tag that is associated
with a fact. For instance, in some embodiments, the process 300
does not associate facts with all tags, or does not associate facts
with tags from the earlier analysis stages (e.g., with tags from
the first analysis stage 310). Accordingly, in some embodiments,
not all the patterns identified by the process 300 are necessarily
associated with facts through their tags. In these embodiments, the
process 300 identifies some tags only to use them to identify
subsequent patterns that are associated with facts through their
tags.
[0061] FIG. 4 illustrates an example of a first level analysis
410.sub.1 that is performed on a document 405 at 310. As shown in
this figure, this first analysis level executes a first set of
pattern analysis instructions 417 on the document 405. This
execution results in a first set of tags 407 that are associated
with a first set of patterns identified by the process 300 at 310.
Thus, the output of the level one analysis 410.sub.1 includes a tag
set 407 that indicates an extracted set of facts in the document
405.
[0062] After 310, the process 300 then determines (at 315) whether
more pattern analysis is to be performed on the document, as shown
in FIG. 3. When the process determines (at 315) that it needs to
perform additional pattern analysis, it transitions to 320 to
perform another level of pattern analysis. At 320, the process
identifies another set of pattern analysis instructions to
execute.
[0063] From 320, the process then transitions back to 310, where it
executes the pattern analysis instruction set identified at 320. In
some embodiments, the pattern analysis instruction sets after the
first pattern analysis instruction set can define patterns of
document elements and/or tags identified in prior pattern analysis
levels. Accordingly, the execution (at 310) of the instruction set
identified at 320 may cause the process 300 to identify one or more
patterns of document elements and/or tags. For each pattern that it
identifies at 310, the process (at 310) creates a tag and
associates the tag with the pattern. As mentioned above, some
embodiments associate each tag with a fact, while other embodiments
do not associate each tag with a fact. When a tag is associated
with a fact, the tag associates this fact with the pattern that the
tag annotates.
[0064] FIG. 4 illustrates an example of a second level analysis
410.sub.2 that is performed on the document 405 at 310. As shown in
this figure, this second analysis level executes a second set of
pattern analysis instructions 420 on the document 405. This
execution results in a second set of tags 422 that are associated
with a second set of patterns of document elements and/or tags that
is identified by the process 300 at 310. Thus, the output of the
level two analysis 410.sub.2 includes tag set 422 that indicates an
additional fact set extracted in the document 405.
[0065] After 310, the process again determines (at 315) whether
more pattern analysis is to be performed on the document, as shown
in FIG. 3. If so, the process repeats operations 320 and 310 for
each additional set of pattern analysis instructions that it needs
to execute. In the example illustrated in FIG. 4, the process 300
performs N-2 additional pattern analysis iterations
410.sub.3-410.sub.N, after performing the first two iterations of
analysis. In each of these additional levels of analysis, the
process executes another pattern analysis instruction set (e.g.,
instruction set 430 and instruction set 440) that results in
another set of tags (e.g., tag set 432 and tag set 442). Each of
these tag sets, in turn, includes one or more tags that can be used
in subsequent levels of analysis and that can be associated with
one or more facts.
[0066] Once the process 300 determines (at 315) that it does not
need to perform any additional pattern analysis operation, the
process stores (at 330) the tags generated at 310 through multiple
levels of analysis. A separate process can then later analyze the
stored tags to extract facts associated with the tags. After 330,
the process 300 ends.
[0067] FIG. 5 conceptually illustrates a fact-extraction process
500 that processes a series of tags that are generated by
performing the process 300 multiple times for multiple documents in
order to extract and store facts associated with these tags. Some
embodiments perform the process 500 periodically for a collection
of documents that are annotated with a collection of tags.
[0068] As shown in FIG. 5, the process 500 initially selects (at
505) a document from the collection of documents. The process then
identifies (at 507) a tag type of a tag associated with the
document. The process then determines (at 510) whether the
identified tag type can be processed.
[0069] When the process 500 determines (at 510) that the identified
tag cannot be processed, the process flags (at 515) the document
for manual processing. In some embodiments, a document is flagged
for manual processing when it is determined during the analysis of
the document that some aspect of the analysis might not be
accurate. For instance, in some instances, a set of tags associated
with a document does not accurately identify the facts in the
document. Such could be the case when the fact-extraction process
erroneously annotates the document with certain tags due to
unforeseen combinations of document elements that cause an error in
the pattern analysis. An example of the fact-extraction process
erroneously annotating the document is when a company is instead
tagged as a person (e.g., "Charles Schwab"). In another example,
the fact-extraction process fails to identify a pattern of document
elements that indicates a particular fact in the document. For
example, the fact-extraction process could fail to identify a
particular pattern for a person because the person's name is not
recognized (e.g., an Indian name) as a pattern. Additionally, when
a certain number of tags is associated with a set of identified
patterns to derive a set of particular facts from the document, but
the number of tags do not meet this requisite number, the document
can be flagged as requiring manual processing.
[0070] For the manual processing of the document, some embodiments
provide a tool for modifying the tags associated with the document.
In this fashion, a human operator (e.g., a user) can use the tool
to manually modify the tags so that the tags are more accurately
associated with the facts in the document. The tool, in some
embodiments, then marks the document as processed or validated by a
human operator. One example of such a tool will be further
described below by reference to FIG. 17. After flagging the
document for manual processing, the process 500 transitions to 525,
which will be further described below.
[0071] Alternatively, when the process 500 determines (at 510) that
the identified tag type can be processed, the process 500 then
processes (at 520) the set of tags associated with the identified
tag type to extract a set of facts associated with these tags. In
some embodiments, the sets of tags are implemented as a set of tag
objects that are converted to a set of fact objects. This set of
fact objects is populated with the data from the tag objects.
Moreover, while processing tag objects of a particular type, the
process can search for other previously specified tags that it can
examine to retrieve facts to populate the attributes of the set of
fact objects it creates for the particular set of tag objects.
[0072] The extracted facts (e.g., fact objects) can be stored to
represent the extracted facts associated with the selected
document. Accordingly, the process stores the extracted facts in a
storage that can later be queried with or without the documents. In
some embodiments, the storage that contains the extracted facts is
the same storage that contains the documents and tags, while in
other embodiments different storages are used to store the
documents, tags, and/or facts. After 520, the process 500 then
transitions to 523.
[0073] The process 500 determines (at 523) whether the identified
tag type is the last tag type that it has to analyze for the
selected document. When the process 500 determines (at 523) that
the identified tag type is not the last tag type, the process 500
returns to 507 to select another tag type for processing and then
repeats the subsequent operations for this document in order to
process each tag type associated with the selected document.
[0074] Alternatively, when the process 500 determines that the
identified tag type is the last tag type that it has to analyze for
the selected document, the process 500 transitions to 525. The
process 500 determines (at 525) whether the selected document is
the last document from the collection of documents. When the
process 500 determines (at 525) that the document is not the last
document from the collection of documents, the process 500 returns
to 505 to select another document for processing and then repeats
the subsequent operations for this document in order to process it.
The process 500 ends when it determines that the last document has
been processed.
[0075] One of ordinary skill will realize that other embodiments
perform pattern analysis and fact extraction differently that the
approach described above by reference to the pattern analysis
process 300 of FIG. 3 and the fact extraction process 500 of FIG.
5. For instance, some embodiments extract the pattern analysis and
fact extraction processes into one process that is performed
individually for each document. In some such embodiments, the tags
that are identified in the earlier stages are only used to identify
in later stages subsequent patterns and tags, which are then used
to extract facts. Also, in some of these embodiments, the facts
that are derived in the later stages are simply tags that are
identified based on patterns that are detected from the earlier
stage tags. Accordingly, some of these embodiments do not specify
an association between tags and facts, but instead utilize the
later stage tags to express facts that can be reported, presented
and/or queried. In some of these other embodiments, some of these
later stage tags are expressed in a manner that is optimized for
subsequent querying or reporting. However, other embodiments still
convert some of these later stage tags into a fact-specific format
that is optimized for subsequent querying and reporting.
[0076] In the sections below, examples of patterns that the
iterative fact-extraction process can identify have been described
in connection with FIGS. 8-16. However, before describing the
example patterns, a description of the structure of a particular
document and tags is given.
II. Structure of Documents and Tags
[0077] A. Example Structure of a Document
[0078] In some embodiments, the process of iterative
fact-extraction examines documents for document elements that
include, but are not limited to, characters, expressions, phrases,
symbols, markup tags and other structural elements. To illustrate
this further, FIG. 6 conceptually illustrates an example document
610 that has been analyzed in multiple iterations in order to
extract a fact about management turnover.
[0079] In some embodiments, the document 610 is characterized as a
web document that includes markup tags (e.g., 605) that denote
particular sections of the document. In some embodiments, each
markup tag could be from markup languages such as Hypertext Markup
Language (HTML), Extensible Markup Language (XML), Extensible
Hypertext Markup Language (XHTML), and LaTeX.
[0080] The document 610 includes numerous document elements. In
different embodiments, the document elements of the document 610
are defined differently. For instance, some embodiments define the
document elements of this document as alphanumeric characters,
symbols and markup tags, while other embodiments, define the
document elements of the document 610 as alphanumeric character
strings (which can be words), symbols and markup tags.
[0081] FIG. 6 illustrates three patterns 615-625 that have been
identified in order to extract several facts relating to a
management turnover. In some embodiments, a pattern analysis
process has identified these and other patterns in several pattern
analysis iterations. As described above, the pattern analysis
process of some embodiments, in each iteration, analyzes the
document to identify, and to annotate with tags, patterns of
document elements and/or annotated tags that are previously
associated with the document in any previous iteration. In FIG. 6,
the tags that are used to annotate the patterns are not
illustrated. Instead, only the identified patterns are circled to
simplify the presentation of this example.
[0082] When the document elements of the document 610 are
characters and markup tags, the pattern analysis process of some
embodiments first identifies patterns of character strings that
form words. In some embodiments, this process then searches the
document to see if it contains a title field, and if so, whether
this field includes words relating to management turnover in a
company. In this example, the document 610 includes a title tag 605
that specifies the title field. This title field includes three
words that the process identifies as words (specified by word tags
that identify recognized character strings) relating to management
turnover. These words are CFO 630, hired 635, and MacAffie 640.
[0083] Once the process identifies the management turnover words in
the title field of the document, the process in some embodiments
examines the document further to identify additional management
turnover facts. In this example, two additional management turnover
facts are extracted. One fact relates to the name of the newly
hired CFO, while the other fact relates to the name of the previous
CFO. The first fact is identified by identifying the pattern formed
by word "new" 655, the name "Jonathan Chadwick" 660, and the title
"chief financial officer" 665 being within a certain proximity of
each other. The second fact is identified by identifying the
pattern formed by the words "leaving" 670 and "MacAffie" 672, "CFO"
674 and name "Rocky Pementel" 676 being within a certain proximity
of each other. The pattern extracted from the title field and the
two patterns extracted from the body of the document can then be
grouped as a part of a larger pattern 680. This larger pattern can
then be used to represent several facts extracted from the document
610 regarding a management turnover.
[0084] B. Example Structure of Tags
[0085] As described above, the iterative fact-extraction process of
some embodiments creates tags to identify patterns of document
elements and/or tags in the document. FIG. 7 conceptually
illustrates an example data structure of a tag data structure 710
for a tag that is created by the fact-extraction process. This tag
structure is a "non-token" tag for some embodiments that utilize
token and non-token tags. A token tag in some of these embodiments
is a tag that is initially assigned to each recognizable character
string (e.g., each recognizable word or punctuation) in the
document. In some embodiments, the earliest or one of the earliest
stages of the pattern analysis iterations identifies the token
tags, while the later stages of the pattern analysis define
non-token tags by reference to token tags. In other words,
non-token tags in some embodiments are tags that are defined in
later stages by reference to token tags.
[0086] As illustrated in FIG. 7, the tag 710 includes a tag
identifier 720, a tag type 730, a token number 740, a count field
750 and metadata 760. The tag identifier 720 uniquely identifies
the tag amongst the other tags. The tag type 730 represents a type
of tag corresponding to an identified pattern. For example, the tag
type can be, but not limited to, person, company, joining verb,
quote, descriptive phrase, location, title, etc.
[0087] The token number 740 represents the position of the tag in
terms of a first token tag that it overlaps in the document. The
count field 750 represents the length of the tag in terms of the
number of token it overlaps, where the length is an integer value
greater than or equal to one. For instance, if the tag's token
number is 50 and its count field is 5, that means that the tag
starts at the location of the 50.sup.th token and ends at the
location of the 55.sup.th token.
[0088] The metadata 760 represents metadata that describes
additional contextual and/or factual information related to the tag
type. In some embodiments, the metadata reflects further
qualifications on the tag type. For example, the metadata can be
information related to a gender of a person, an ethnic origin of a
name, a nationality of a person, and/or a role.
[0089] In other words, different types of metadata can be included
with particular tag types in some embodiments. For example, a set
of tag types corresponding to a person "Ursula Jones", a gender
pronoun "she", a descriptive phrase "an active city council
member", a quote "This is an exciting opportunity" and a title
"Chairwoman" can include different metadata represented by the
following examples: person (gender=female), gender pronoun
(backreference=Ursula Jones), descriptive phrase
(backreference=Ursula Jones), quote (speaker=Ursula Jones); title
(gender=female).
[0090] Other embodiments define tag structures differently. For
instance, some embodiments do not define one set of tags in terms
of another set of tags (e.g., do not define non-token tags in terms
non-token tags). Still other embodiments define building tag types
other than token tags. For instance, some embodiments define
sentence tags, which are used to define token tags that, in turn,
are used to define non-token tags. Instead of defining sentence
and/or token tags, some embodiments utilize elements that are
native in the document as the building blocks for defining tags
(e.g., for defining the location of the tags).
III. Exemplary Patterns
[0091] The following description provides several additional
conceptual examples of patterns that are identified by the
fact-extraction process of some embodiments. In some embodiments,
this process identifies these patterns in several pattern analysis
iterations. As described above, the pattern analysis process of
some embodiments, in each iteration, analyzes the document to
identify, and to annotate with tags, patterns of document elements
and/or annotated tags that are previously associated with the
document in any previous iteration.
[0092] In the examples below, not all the tags that are used to
annotate the patterns are illustrated. Instead, for some of the
identified patterns, only the pattern is encircled in order to
simplify the presentation of this example. Also, in these examples,
the identified patterns are patterns that are built with words or
multi-word sequences. To identify these words and sequences, some
embodiments perform pattern analysis operations that precede the
stages illustrated in the examples, and these earlier stages
identify and tag character strings or sequence of words to
represent these words and sequences. These earlier analysis
operations are not described in the earlier examples in order to
avoid unneeded complexity in the describing the examples. Lastly,
to simplify the examples below, the patterns described below are
defined by reference to "elements." These elements can be document
elements (e.g., words, etc.) native to the document in some
embodiments, tags (specifying earlier identified patterns) in other
embodiments, or both in yet other embodiments.
[0093] A. Back-Referencing of Patterns
[0094] In some embodiments, the fact-extraction process correlates
identified patterns formed from document elements and/or tags with
other identified patterns. For example, the process can correlate a
gender pronoun (e.g., "he" or "she") with a person to determine the
gender of the person. The correlation between patterns creates a
relationship between the patterns.
[0095] FIG. 8 conceptually illustrates examples of back-referenced
patterns in a particular document and a corresponding tag of some
embodiments. Document 805 includes identified patterns as
highlighted elements 810-860. For example, as shown in FIG. 8, an
element 840 corresponds to a gender pronoun "He" and an element 820
corresponds to a person. Element 820, in some embodiments, includes
a proper full name, proper first name or proper last name.
[0096] In some embodiments, the fact-extraction process determines
a back-reference for the gender pronoun by searching the document
805 for the first instance of a proper name. The fact-extraction
process initially searches for the proper name starting from the
beginning of a sentence including the gender pronoun and continues
until reaching the end of that sentence. In a case where the proper
name is not found, the process determines the location of the
previous sentence from the current one and searches from the start
to the end of that sentence. The fact-extraction process continues
to search for the proper name in this fashion (e.g., from the start
to end of each previous sentence from the gender pronoun) until the
proper name is located.
[0097] The fact-extraction process will then correlate the gender
pronoun to the first instance of the proper name that is found. As
shown in this example, the process back-references the element 840
(i.e., gender pronoun) to element 820 (i.e., person) to create a
relationship between the proper name and the pronoun. In some
embodiments, the process confirms that element 820 is a person from
a relationship (e.g., the proximity) between element 810 (title) to
element 820 (person) when the name of the person is ambiguous.
After back-referencing the gender pronoun to the person, the
process can further annotate a respective tag associated with the
person to include contextual information or metadata for the
tag.
[0098] In some embodiments, the gender of person can be determined
by the relationship between title element 810 and person element
820. In a case where the title is a known word and/or expression of
a gender specific title, the proximity of the title to the person
confirms that the person is of that gender derived from the title.
For example, in an example where the title is "Chairwoman," the
person within a certain proximity of the title is of the gender
(i.e., female) derived from the title.
[0099] The process of some embodiments then annotates a tag 865
corresponding to element 820 (person) with metadata 875 to describe
the gender of the person. The process can determine the gender of
person by using the back-referenced gender pronoun to the person.
The tag 865 can also be annotated with other metadata from this
determined gender information. The ethnic origin of the person can
be determined from the gender information for example. In a case
where the gender is male, the process can determine the ethnic
origin of the person's name by using a list of known male names of
a particular ethnic origin (e.g., male Indian, Spanish or English
names, etc.) The process then annotates tag 865 with further
metadata 875 describing the origin of the person's name.
[0100] Although the above example describes back-referencing the
gender pronoun to the person, in some embodiments, the process can
also correlate a person of a known gender to a gender pronoun. In a
case where the gender of the person is known (e.g., the name of the
person is determined to be a known male or female name), the
fact-extraction process searches from the position of the person to
locate the first instance of the gender pronoun matching the gender
of the name. After locating the matching gender pronoun, the
process then correlates the gender pronoun to the person to create
a relationship between the person and the pronoun. The tag 865 is
then annotated to describe the gender of the person in a similar
manner as described above.
[0101] B. Quote Attribution
[0102] FIGS. 9(a)-(b) conceptually illustrate example patterns for
attributing a name to a quote of some embodiments. In some
embodiments, the iterative fact-extraction process attributes a
name to a quote during a second level of pattern analysis. For
example, during a first level of pattern analysis, the process
identifies patterns for names and quotes in a particular document.
During the second level of pattern analysis, the process attributes
the name to the quote after determining the presence of intervening
text between the name and quote. By attributing the name to the
quote, the process can identify the speaker of the quote.
[0103] In the example, the pattern for quote attribution is formed
when the intervening text includes another identified element
(e.g., a "saying" verb") that is between previously identified
patterns of elements (e.g., the name and quote). In this manner,
the process attributes the name to the quote using a combination of
previously identified patterns and an identified pattern of
elements (e.g., the intervening text including the "saying" verb)
in the current iteration of fact-extraction.
[0104] As illustrated in FIG. 9(a), document 910 includes element
915 corresponding to a name, element 920 corresponding to a
"saying" verb, element 925 corresponding to a quote, and a pattern
935. The pattern 935 is formed from a combination of elements 915,
920 and 925. In some embodiments, the identified "saying" verb of
element 920 includes a verb that indicates that a particular person
provided spoken commentary (e.g., a remark that expresses a
statement or opinion). Examples of "saying" verbs include, but are
not limited to, forms of the words "say," "state," "express,"
"declare," etc. In some embodiments, the "saying" verb is
identified using a list including known "saying" verbs.
[0105] In some embodiments, the process identifies the pattern 935
by determining that element 920 ("saying" verb) is positioned
between elements 915 (name) and 925 (quote). In particular, pattern
935 requires that element 920 is positioned with intervening texts
940 and 945 between elements 915 and 925. The combination of
elements 915, 920 and 925 forms a pattern 935 that indicates that
the name of a person of element 915 said the quote of element 925.
Therefore, as illustrated in FIG. 9, the fact-extraction process
then annotates the pattern 935 with tag 930 to attribute the quote
to the name in document 910.
[0106] The process, in some embodiments, can attribute special
types of quotes to a person. To illustrate this scenario, FIG. 9(b)
conceptually illustrates a document including a "no comment" quote
(e.g., where a person states, "No comment") stated by a person is
also attributed to the person identified in the document.
Attribution of a "no comment" quote in document 950 is accomplished
in a similar fashion as described above in connection with document
910. In some embodiments, document 950 includes elements 955
(name), 960 ("saying" verb) and 965 ("no comment" quote) and
pattern 975. The pattern 975 is formed from the combination of
elements 955, 960 and 965. To identify pattern 975, the iterative
fact-extraction method identifies the presence of the "saying" verb
in element 960 with intervening texts 980 and 985 in between
elements 955 and 965. In some embodiments, the fact-extraction
method annotates the pattern 975 with tag 970 to attribute the "no
comment" comment to the name of the person.
[0107] Moreover, the process can attribute a name to a quote by
back-referencing the name to the quote in the document. Referring
back to FIG. 8, in some embodiments, the process back-references a
quote to a person to attribute the quote to the person. For
example, the process back-references the quote to the person by
scanning document 805 to identify the first instance of the person
in the document. As shown in FIG. 8, an element 860 corresponding
to the quote is back-referenced to the element 820 corresponding to
the person in this manner. In addition, the process can further
confirm that element 820 (person) is the speaker of the quote from
the relationship (e.g., proximity) between element 850 ("speaking"
verb) with the previously described back-referenced element 840
(gender pronoun) and element 820 (person).
[0108] FIG. 10 conceptually illustrates an additional example
pattern for attributing a name to a quote of some embodiments. This
example differs from the above quote attribution pattern examples
in that additional elements corresponding to a title and gender
pronoun are included in a document. As shown in this example,
document 1010 include elements 1035 (title), 1015 (name), 1020
(gender pronoun), 1040 ("saying" verb), 1045 (quote) and pattern
1017. Pattern 1017 is formed from a combination the elements 1035
(title), 1015 (name), 1020 (gender pronoun), 1040 ("saying" verb)
and 1045 (quote). In this example, the relationship (e.g.,
proximity) between the elements 1045 (quote) and 1040 (saying verb)
and intervening texts 1043 and 1033 to element 1020 (gender
pronoun) indicates that name corresponding to the element 1015 is
the speaker of the quote. Moreover, the gender pronoun can be
correlated to the name to further confirm the speaker of the quote.
In some embodiments, the fact-extraction process can correlate the
gender pronoun to the name by back-referencing the gender pronoun
to the name (e.g., person) as described above in FIG. 8. Therefore,
to attribute the quote to the name of the person in the document,
pattern 1017 requires a relationship (e.g., proximity) between
element 1015 (name) to the respective elements for the gender
pronoun, the saying verb, the quote and any intervening text.
[0109] In some cases, the iterative fact-extraction process uses a
title to identify a name in the document where the name is
ambiguous (e.g., when the name is not detected solely based on
itself because of its particular ethnic origin). In this example,
the process can confirm that the name is the speaker of the quote
because pattern 1017 includes the title and intervening text 1050
having a relationship (e.g., proximity) with the name. Further, in
some embodiments, the presence of element 1020 (gender pronoun) and
intervening text 1055 between element 1015 (name) in pattern 1017
allows the gender pronoun to be back-referenced with the name in
the document as described above in connection with FIG. 8. The
process can use the back-referenced gender pronoun with the name to
determine that name is the speaker of the quote.
[0110] The process then annotates the pattern 1017 with tag 1030 to
attribute the quote to the name in document 1010. In this manner,
the document is annotated with an attribution that describes the
name of the person that said the quote in document 1010.
[0111] C. Management Turnover
[0112] In some embodiments, the iterative fact-extraction process
identifies patterns for management turnover to track the addition
or attrition of employees at a particular company. This can be
useful in instances where a company wishes to research the change
of personnel for a particular competitor.
[0113] FIG. 11 conceptually illustrates example patterns of
management turnover in documents of some embodiments. Document 1110
includes a pattern 1127 that is formed from a combination of
elements 1115, 1125, 1130 and intervening texts 1120 and 1123.
Element 1125 is a "joining" verb that includes, but is not limited
to, forms of the words "hire," "join," "take," etc. Element 1130 is
a company. The iterative fact-extraction process identifies the
pattern 1127 in document 1110 to determine that management turnover
has occurred for a particular company. The process then annotates
the pattern 1127 with tag 1140 to indicate management turnover.
[0114] FIG. 12(a)-(b) conceptually illustrate other example
patterns of management turnover in documents of some embodiments.
Similar to the above described document 1110 of FIG. 11, management
turnover is conceptually illustrated in FIG. 12(a) by elements
related to a title or hire date in a particular document. In some
embodiments, document 1210 includes a pattern 1227 formed from
combination of elements 1215 (name), 1225 (joining verb), 1230
(company), 1235 (title) and intervening texts 1240 and 1245. In
some embodiments, element 1235 (title) and intervening text 1237
having a relationship (e.g., proximity) between elements 1215
(name), 1225 (joining verb) and 1230 (company) indicates management
turnover for document 1210. The iterative fact-extraction process
then annotates pattern 1227 with tag 1240 to indicate management
turnover.
[0115] Moreover, management turnover is also indicated by a hire
date. The hire date may be identified by elements including words
or expressions that include a combination of days, months and/or
numbers (e.g., "Jan. 1, 2010", "Jan. 1, 2010", etc.). In some
embodiments, document 1250 includes a pattern 1273 formed from
elements 1265 (name), 1260 (joining verb), 1275 (company), 1280
(hire date) and intervening texts 1255, 1267 and 1263. The presence
of a hire date of element 1280 and intervening text 1267 between
the "joining" verb of element 1260 indicates management turnover in
some embodiments. In some embodiments, the iterative
fact-extraction process annotates pattern 1273 with tag 1270 to
indicate management turnover in the document.
[0116] FIG. 13 conceptually illustrates an example pattern for
indicating commentary about management turnover. In some
embodiments, the fact-extraction process annotates a pattern to
indicate commentary about the management turnover based on
identifying a quote in document 1310. Document 1310 includes a
pattern 1327 that is formed from a combination of elements 1315
(name), 1325 (joining verb), 1330 (company) and intervening texts
1320 and 1323. The pattern 1327 is annotated with a tag 1340 to
indicate management turnover. In this example, the presence of
element 1335 (quote) and intervening text 1337 to the elements and
intervening text of pattern 1327 indicates commentary about the
management turnover. The iterative fact-extraction process then
annotates pattern 1327 with tag 1345 to indicate this commentary
about management turnover in the document. In some embodiments, the
process also attributes the name to the quote in document 1310 from
the relationship (e.g., proximity) between element 1335 (quote) and
any intervening text with element 1315 (name).
[0117] D. Product Initiative
[0118] In some embodiments, the iterative fact-extraction process
identifies patterns indicating a product initiative for a
particular company. This can be useful for tracking the
introduction of new products of a particular company.
[0119] FIGS. 14(a)-(b) conceptually illustrate example patterns for
product initiatives of some embodiments. FIG. 14(a) shows a
document 1410 that includes a pattern 1423 formed from elements
1415 (company), 1425 (product), 1430 (action verb) and intervening
texts 1420 and 1435. In some embodiments, element 1425 (product) is
identified using a list including known words and expressions for
products. Element 1430 is an action verb that may include, but not
limited to, forms of the words "introduce," "sell," etc. The action
verb, in some embodiments, is identified using a list including
known words for action verbs. The iterative fact-extraction process
then annotates the pattern 1423 with tag 1440 to indicate a product
initiative in document 1410.
[0120] As shown in FIG. 14(b), similar to document 1410 described
above, document 1450 also indicates a product initiative. Document
1450 includes a pattern 1463 formed from elements 1455 (company),
1465 (action verb), 1470 (product) and intervening texts 1460 and
1475. Document 1450 differs from document 1410 in that the
positions of an identified action verb (element 1465) and
identified product (element 1470) are transposed (e.g., swapped
positions) in the document. In some embodiments, the iterative
fact-extraction process annotates pattern 1463 with tag 1480 to
indicate the product initiative.
[0121] E. Transactions
[0122] In some embodiments, the iterative fact-extraction process
identifies patterns indicating a financial transaction. This can be
useful for tracking specific types offerings for financial
securities.
[0123] FIGS. 15(a)-(b) conceptually illustrate example patterns for
financial transactions of some embodiments. In particular, bond
offerings by a particular issuer are described in the following
examples. As shown in FIG. 15(a), document 1510 includes a pattern
1527 formed from elements 1515 (issuer), 1525 (offering verb), 1535
(type of offering) and intervening texts 1520 and 1530. The
iterative fact-extraction process identifies pattern 1527 to
identify a financial transaction in the document. Element 1515
(issuer) indicates the issuer of a bond offering.
[0124] In some embodiments, the issuer is identified by a list
including known words and expressions of issuers of financial
securities such as bonds. The issuers included in such a list
include known financial institutions and companies. Element 1525 is
an "offering" verb that includes, but not is limited to, forms of
the words "sell," "offer," "issue," etc. The offering verb is
identified, in some embodiments, using a list including known words
and expressions of offering verbs. Element 1535 is a type of
offering. The type of offering may include, but is not limited to,
forms of words and expressions for types of financial securities
such as bonds, stocks, etc. In some embodiments, the type of
offering is identified using a list including known words and
expressions of types of financial offerings. In some embodiments,
the iterative fact-extraction process annotates pattern 1527 with
tag 1540 to indicate the financial transaction (e.g., a bond
offering) in the document 1510.
[0125] As shown in FIG. 15(b), similar to document 1510 described
above, document 1550 indicates a financial transaction. Document
1550 includes a pattern 1565 formed from elements 1555 (issuer),
1570 (offering verb), 1585 (type of offering) and intervening text
1560 and 1580. Document 1550 is annotated with tag 1575 to indicate
a financial transaction. The iterative fact-extraction process then
performs a subsequent level of pattern analysis to identify element
1595 that indicates an expression related to collateral trust
bonds. Collateral trust bonds are a type of financial security that
is secured by a financial asset (e.g., stocks or other bonds). In
some embodiments, the collateral trust bonds are identified using a
list including known words and expressions for collateral trust
bonds or other types of bonds. The position of element 1595 and
intervening text 1587 in relation to pattern 1565 indicates a
securitization of the financial transaction in document 1550. In
some embodiments, the iterative fact-extraction process annotates
the pattern 1565 with tag 1590 to indicate the securitization of
the financial transaction in the document 1550.
[0126] F. Tags Building on Other Tags
[0127] As mentioned above, the pattern analysis process of some
embodiments identifies patterns of document elements and tags from
tags that it identifies in the earlier stages. In other words, it
can build tags upon previously defined tags. FIG. 16 conceptually
illustrates one example of pattern analysis that extracts facts by
having earlier stage tags build on later stage tags. In this
example, facts are not directly extracted from the earlier stage
tags. Instead, the earlier stage tags are only used to identify
later stage "fact" tags that specify facts. In some embodiments,
these fact tags might not be expressed in a manner that is
optimized for subsequent querying or reporting. Accordingly, other
embodiments convert these fact tags into a fact-specific format
that is optimized for subsequent querying and reporting.
[0128] FIG. 16 illustrates a table 1610 that shows several levels
of tags that are identified for a sentence in a document that
states: Fred Simplax has been hired as the new editor of PolitiFact
Florida, a joint project of The St. Petersburg Times and The Miami
Herald. The table 1610, in some embodiments, can be a data
structure for containing the respective tags and/facts from the
different levels of pattern analysis. However, for other
embodiments, this table does not correspond to a data structure
used by these other embodiments, and instead only, serves as a way
of conceptualizing the analysis that is performed by these other
embodiments.
[0129] The table 1610 includes rows that represent levels of
pattern analysis in ascending order. The first row of the table
1610 includes a number of cells that each include one word that
represents a token tag. In this example, each token tag is one
recognized word in the analyzed sentence or a word that represents
a punctuation in the sentence. After the first row, each of the
subsequent rows of the table 1610 include one or more horizontal
bars that span a particular number of token tags. Each horizontal
bar represents a tag that was identified in a particular level of
pattern analysis. While each of the rows of the table 1610 can be
used to place horizontal tag bars in the example illustrated in
FIG. 16, the columns of the table 1610 can be used to analyze the
tags in order to identify relationships between the tags, as
further described below.
[0130] In the example illustrated in FIG. 16, a tag in a row above
another tag is a tag from an earlier stage of pattern analysis.
However, in this example, tags that appear on each row were not
necessarily identified in the same stage of the pattern analysis.
In this example, tags from different levels of analysis might be
placed on the same row in order to provide a condensed tabular
representation of the analysis. In other words, each column of the
table 1610 can be in different ascending scales. For example, a
first column with different tags can represent levels 1-4 of
pattern analysis while a second column with different tags can
represent levels 1-7 of pattern analysis. However, one of ordinary
skill will realize that in some embodiments, each row corresponds
to one specific level of pattern analysis.
[0131] FIG. 16 illustrates that, after identifying the token tags
in the first row, the pattern analysis process identifies several
capitalized expression in the analyzed sentence and tags these
capitalized expressions by CapBar tag 1615, 1620, 1625 and 1630. In
some embodiments, CapBar tag annotates multi-word expressions that
might serve as names. As shown in FIG. 16, the process then
identifies the words "hired" and "new" as words that relate to a
management turnover occurrence and thereby tags these words with
Joining tags 1619 and 1622.
[0132] Next, the process identifies names of individuals and
locations within the sentence and annotates these names with the
Person tag 1617 and the Location tags 1624, 1626 and 1628. In some
embodiments, the process is designed to assume that all names would
be capitalized, and hence identifies these names by only searching
within the regions of the documents covered by a CapBar tag, in
order to achieve some efficiency in the document processing. In
other embodiments, the process identifies these names independent
of the CapBar tags.
[0133] After identifying the individuals and locations, the process
(1) identifies the word "editor" with a Title tag 1627, (2) the
name "PolitiFact Florida" with a KnownCompany tag 1640, and (3) the
expression "a joint project of The St. Petersburg Times and The
Miami Herald" with a DescriptivePhrase tag 1632. In some
embodiments, the process identifies the title and company name by
searching a list of job titles and a list of known companies. Also,
in some embodiments, the process identifies the expression "a joint
project . . . . Herald" as a descriptive phrase based on a
conditional relationship rule, such as "if a comma follows a known
name, then the description following the comma until the subsequent
punctuation is a descriptive phrase."
[0134] Next, the process identifies The St. Petersburg Times and
The Miami Herald as names of Newspaper names and tags them
accordingly with tags 1642 and 1644. The process then identifies a
management turnover Fact tag 1645 by noting the relationship
between the Person tag 1617, the Joining tags 1619 and 1622, and
the KnownCompany tag 1640. This fact tag is expressed in terms of
the person and company names identified by the Person and
KnownCompany tags and the word that is tagged by the Joining
tag.
[0135] Specifically, this fact tag is expressed as "Fred Simplax
hired by PolitiFact Florida." This fact tag is also annotated with
a confidence metadata, which reflects a confidence rating of 70%.
This rating indicates a confidence level regarding the accuracy of
the fact tag 1645. Such a rating is generated based on the pattern
of document elements and/or tags that is used to generate the fact
tag. Patterns that are more detailed and elaborate often have
higher confidence scores associated with them in some embodiments
than patterns that are simple and less elaborate. Reporting and
query modules can subsequently use such ratings to determine which
facts to report in response to a query or for a periodic electronic
or print reporting.
[0136] Finally, the process identifies the fact tag 1650 based on
the relationship of the previous management turnover fact tag 1645
and the other tags. For instance, in some embodiments, the fact tag
1650 is identified because the descriptive phrase tag 1632 follows
a comma after the previous management turnover fact tag 1645 and it
follows the name of the hiring company. The fact tag 1650 is thus
identified in some embodiments as an elaboration on the hiring
company. This fact tag 1650 is designated with a confidence rating
of 50%. In some embodiments, this confidence rating is the
confidence score of the pattern that led to the identification of
the fact tag 1650.
IV. Manual Processing Tool
[0137] In some embodiments, the iterative fact-extraction process
includes a pattern analysis process that performs pattern analysis
of a document to identify patterns. For each iteration of
fact-extraction, the process creates tags for any identified
patterns from the document. However, the tags can be inaccurate.
For example, the process may have failed to identify a pattern
because an unexpected grammatical construction was used to
represent the pattern in the document. In other cases, the process
may have failed to identify a pattern because of an unrecognized or
unexpected expression represented the fact (e.g., where a Spanish
name is unrecognized as such). Other conditions in which the
pattern analysis process or a subsequent fact-extraction process of
some embodiments flags the document for manual processing were
described above.
[0138] Accordingly, in some embodiments, the tags associated with
the document can require manual processing by a human operator
(i.e., user) because for an x number of documents that are
processed by the iterative fact-extraction process, a certain
percentage of the documents will require human intervention to
modify the tags created by the iterative fact-extraction
process.
[0139] To assist in such modifications, some embodiments provide a
tool for performing manual processing of documents. A human
operator uses the tool to modify the associated tags of a
particular document. FIG. 17 illustrates an example graphical user
interface (GUI) display for manually processing documents to adjust
the results of iterative fact processing of some embodiments. The
GUI display 1710 shown in FIG. 17 may only show a portion of the
GUI display 1710 of some embodiments. For example, in some cases
the GUI display 1710 may include additional user interface elements
that include additional functionalities and/or features that the
user can select. Furthermore, in some embodiments, the illustrated
GUI display 1710 can be part of a larger GUI display that has
additional functional features.
[0140] The GUI display 1710 includes dropdown list UI item 1715,
text box UI item 1720 and button 1725 ("analyze" button). GUI
display 1710 includes dropdown list UI item 1715 to select a script
for performing iterative fact-extraction on a particular document.
The scripts displayed in dropdown list UI item 1715 can be user
provided scripts (e.g., transferred from user clients) or scripts
from local or remote storages (i.e., script files). Each script
includes a particular set of pattern analysis instructions to
identify patterns from a document. For example, respective scripts
including specific pattern analysis instructions for identifying
patterns for management turnover, just people, just words, just
numbers, companies, or people and titles can be provided. In some
embodiments, the pattern analysis instructions can define patterns
from lists of known words and/or expressions. For example, the
lists can include, but are not limited to, a list of locations,
list of states, list of countries, list of cities, list of first
names, list of male first names, list of female first names, list
of verbs, list of titles, list of financial securities, list of
companies, and a list of expressions. Accordingly, one example of a
pattern analysis instruction defines a pattern that is formed from
a first word in a first list, a second word in a second list and a
third word that is a type of verb.
[0141] The user can then select a script from dropdown list UI item
1715 for executing a set of pattern analysis instructions from the
selected script file on a particular document. GUI display 1710
includes text box UI item 1720 for presenting text data from a
particular document. The document can be retrieved from local or
remote storages or submitted by the user. In some embodiments, the
user (i.e., the human operator) manually enters text data into text
box UI item 1720 for processing. The user then selects button 1725
to run the selected script on the text data in text box UI item
1720. For each pattern analysis instruction from the script, the
iterative fact-extraction process then performs iterative pattern
analysis on the text data. For each level of pattern analysis, the
iterative fact-extraction process creates tags for any identified
patterns from the text data.
[0142] In some embodiments, GUI display 1710 presents the
identified patterns in GUI display portions 1790 and 1795. In
particular, the identified patterns from a first and a second
paragraph of the text in text box UI item 1720 are respectively
presented in GUI display portions 1790 and 1795. In some
embodiments, GUI display 1710 provides respective GUI display
portions for each paragraph in text box UI item 1720 to indicate
identified patterns in various highlighted colors. For example,
identified patterns in GUI display portions 1790 and 1795 are
highlighted in different colors (e.g., red, yellow, green, etc.) in
some embodiments to indicate different types of facts. Some
examples of identified patterns include, but not limited to,
companies, persons, titles, roles, locations, and numbers.
[0143] GUI display 1710 also presents respective identified
patterns from the first and second paragraphs in GUI display
portions 1745 and 1780. Each identified pattern is tagged (i.e.,
annotated) under a specific category after executing the script. As
shown, identified patterns from the first paragraph are tagged as
indicated by any one of UI items 1760. UI items 1760 include a set
of respective UI items that indicate respective categories of
tagged patterns tagged. For example, as shown, UI items 1760
represent the categories of companies ("C"), persons ("P"), titles
("T"), roles ("R"), locations ("L") and decimals ("D"). One of
ordinary skill in the art would understand that UI items 1760 could
include other categories of facts and be within the scope of the
invention. As shown, facts from the first paragraph stating
"Ambrose Corporation" and "Manager" are respectively tagged as a
company ("C") and title ("T") with highlighted UI items 1755 and
1765. Moreover, the identified pattern stating "He (John Jones)"
from the second paragraph is tagged as a person with highlighted UI
item 1775 ("P"). For each identified pattern, GUI display 1710
indicates the tag for a particular identified pattern with a
corresponding UI item similar to the previous description.
[0144] The user can then review the identified patterns and
associated tags in GUI display 1710 to determine the accuracy of
the pattern analysis performed by running the script. In some
cases, the pattern analysis performed by the script identifies
patterns formed from lists of known words and/or expressions as
discussed above. Therefore, by viewing the displayed identified
patterns and tags, the user can assess the accuracy of the pattern
analysis by reviewing such lists to determine whether corresponding
words and/or expressions from the analyzed text data were correctly
identified as displayed in GUI display 1710.
[0145] In cases where the pattern analysis is inaccurate, the user
(1) can modify a list of patterns (e.g., list of known words and/or
expressions to add or remove words and/or expressions from the text
data), or (2) to modify, add or delete a tag associated with a
pattern. For example, the user can modify a list of known male
names to include a name (e.g., a male Indian name such as "Hrithik
Khan") from the text data that was not identified by the pattern
analysis. After modifying the list, the user can re-execute the
pattern analysis from the script by selecting button 1725. In some
embodiments, the pattern analysis is only performed for respective
levels of pattern analysis that are affected by the user
modification (e.g., levels that are affected by the list(s)
modified by the user). In a case where the pattern analysis
identifies male Indian names during a third level of pattern
analysis, the first and second levels pattern analysis are not
performed to avoid repeating these levels of pattern analysis that
are not affected by the user-modified list. In other words, for
this example, the pattern analysis is performed starting from the
third level of analysis and continues for each subsequent level
until all remaining levels of pattern analysis are performed. In
this fashion, the pattern analysis avoids re-running each level of
pattern analysis from the script.
[0146] After the pattern analysis has identified patterns and
created associated tags, the user can review the results and modify
the resulting patterns and tags. The GUI display 1710 provides
particular UI items to modify the tags of identified patterns after
executing the script. In some cases, the tag of an identified
pattern can be designated under a wrong category. For an identified
pattern from the first paragraph of text data, the user may select
any of UI items 1760 to modify the tag. For example, the user may
wish to modify an identified pattern "Charles Schwab" that is
tagged as a person to instead be tagged as a company. In this
example, the user may select one of the UI items 1760 corresponding
to company ("C") to modify the tag. In some embodiments, the user
may select submit button 1770 to apply the modifications to the
identified pattern. Similarly, the user may select any of UI items
1763 to modify the tag for a particular identified pattern from the
second paragraph and then apply the modification by selecting the
corresponding submit button 1783. After selecting the corresponding
submit button, in some embodiments, the iterative fact-extraction
method flags the document and tag as being human validated. In some
embodiments, the process updates the respective document and tag in
storage.
[0147] In some embodiments, GUI display 1710 displays an identified
pattern that is tagged but should not be tagged by any of the
provided tags. This is another example of where the pattern
analysis has erroneously identified an identified pattern in the
document. In one particular case, for example, the user selects UI
item 1750 ("?") to indicate that the identified pattern
corresponding to "Ambrose Corporation" is an unrecognized or
ambiguous fact and then selects submit button 1770 to apply this
indication of the identified pattern being ambiguous. After
selecting the corresponding submit button, in some embodiments, the
iterative fact-extraction method flags the document and tag as
being human validated. In some embodiments, the process updates the
respective document and tag in storage after modification.
[0148] In some embodiments, GUI display 1710 provides a
corresponding UI item for editing each identified pattern. For
example, the user selects UI item 1747 ("edit") to edit the
identified pattern for "Ambrose Corporation." The user then
modifies "Ambrose Corporation" to instead state "Ambrose" or
"Ambrose, Inc." The user then selects a corresponding submit button
(i.e., submit button 1770) to apply the modification to the
identified pattern. As shown, GUI display 1710 provides a
corresponding UI item ("edit") to edit the text of each identified
pattern (e.g., as shown for the identified patterns of the first
and second paragraphs). The user then selects the corresponding
submit button to apply the edits and to indicate the document and
tag as being human validated. In some embodiments, the process
updates the respective document and tag in storage after editing is
finished.
[0149] In some embodiments, GUI display 1710 provides a particular
UI item for removing an identified pattern and its associated tag.
For example, the user removes an identified pattern and its
associated tag by selecting a corresponding UI item in GUI display
1710. For example, the user selects UI item 1749 ("X") to remove
the identified pattern for "Ambrose Corporation" and its tag as a
company. The user then selects a corresponding submit button (i.e.,
submit button 1770) to apply the modification to the identified
pattern and tag. Each identified pattern in GUI display 1710 is
provided with a corresponding UI item to remove the identified
pattern and corresponding tag. In some embodiments, the process
updates the respective document and removes the corresponding tag
in storage.
[0150] In some embodiments, GUI display 1710 provides UI items for
further annotation of an identified pattern. GUI display 1710
provides UT items 1730 ("Joined"), 1735 ("Promoted") or 1740
("Left") to allow the user to select several extracted tags and
then annotate the selection with another tag. For example, for the
facts extracted from the first paragraph, the user selects
"Manager" and "Ambrose Corporation" from GUI display portion 1745
and then selects UI item 1730 ("Joined") to indicate that a manager
has joined the company. To apply the modification, the user then
selects corresponding submit button (i.e., submit button 1770).
After selecting the corresponding submit button, in some
embodiments, the process flags the document and tags as being human
validated. In some embodiments, the process updates the respective
document and stores the new tag in storage. In this manner, GUI
display 1710 allows the user to provide additional facts that the
script has not indicated.
V. Example GUI Output of Iterative Fact-Extraction
[0151] In some embodiments, a graphical user interface (GUI)
display presents the facts from a document as indicated by the
associated tags of the document. The GUI display can present facts
related to management turnover, a product initiative, a financial
transaction or any of the above described patterns of facts in
documents. The following described figures illustrate some examples
of respective GUI displays for presenting facts.
[0152] FIG. 18 illustrates an example of a graphical user interface
(GUI) display for presenting management turnover for a particular
company of some embodiments. GUI display 1800 includes GUI display
portions 1810-1830 for presenting various facts for management
turnover. In some embodiments, GUI display portion 1810 presents
facts that indicate departures and hires of management personnel
for a particular company. As shown in FIG. 18, the details of
management turnover for Apple.RTM. are presented in the GUI display
portion 1810 that presents the hires and departures of several
management positions at Apple.RTM.. In GUI display portion 1820,
facts indicating new or previous positions of respective management
positions are presented. As shown, RJ Pittman was previously the
director of product management at Google.RTM. before joining
Apple.RTM., and Pablo Calamera left Apple.RTM. to become the Chief
Technology Office of Thumbplay.RTM.. In some embodiments, GUI
display portion 1830 includes a timeline chart presenting the
statistics of management turnover for a specified time period
(e.g., quarterly, monthly, etc.). The presentation of extracted
facts in GUI display portions 1810-1830 more easily allows a user
to discern management turnover information in GUI display 1800.
[0153] FIG. 19 illustrates an example of a GUI display for
displaying a product initiative of some embodiments. GUI display
1900 includes GUI display portions 1910 and 1920 for presenting
facts related to a product initiative for a particular company. GUI
display 1910 includes facts that indicate a product initiative from
a particular company. In some embodiments, facts indicating a
product initiative from a particular document are presented in GUI
display portion 1910. As shown, a product initiative related to
Jeep.RTM. introducing a new Grand Cherokee vehicle is presented in
GUI display portion 1910. In GUI display portion 1920, a quote
related to the product initiative from the document is presented.
In some embodiments, the quote is highlighted in bold to allow the
user to more quickly discern information related to the product
initiative. In this manner, relevant portions of the document are
presented to the user without requiring the user to read the entire
document.
[0154] FIG. 20 illustrates an example of a graphical user interface
output for displaying a financial transaction of some embodiments.
GUI display 2000 includes GUI display portions 2010 and 2020 for
presenting facts of a particular type of financial transaction. As
shown in GUI display portion 2010, facts for a bond offering are
presented. GUI display portion 2010 includes a highlighted (e.g.,
bolded) portion of a document that indicates a particular company
(i.e., SkillSoft PLC) may sell bonds (i.e., eight-year notes). In
GUI display portion 2020, a quote related to the bond offering is
presented. Moreover, GUI display portion 2020 includes a list of
companies mentioned in the document and a list of topics covered in
the document. This presentation of extracted facts in GUI display
portions 2010 and 2020 allows the user to more easily gather
pertinent information related to the bond offering.
VI. System Architecture
[0155] FIG. 21 conceptually illustrates an architecture block
diagram of a system 2110 that implements the iterative
fact-extraction process of some embodiments of the invention. In
addition to the iterative fact-extraction process, the system 2110
also retrieves documents from other computers connected to the
system through a network. It further retrieves or identifies
documents in response to queries received from other computers
connected to the system through the network. In some embodiments,
this system is implemented by several servers, each of which
implements one or more components of the system.
[0156] As shown FIG. 21, the system 2110 includes a pattern
analysis engine 2115, a file handler module 2120, a web crawler
module 2125, a document storage 2130, a tag storage 2135, a pattern
analysis instructions storage 2140, a fact processing module 2145,
a tag engine 2150, a web server 2160, a query processor 2165, a
fact storage 2170, a list storage 2185, and a work tool 2190.
[0157] The document crawler module 2125 communicates with a network
2155 to retrieve documents on a real-time or periodic basis. The
document crawler 2125 can be any commercially used crawler, such as
any known web crawler. In some embodiments, the network 2155 may be
a local area network, a wide area network, a network of networks
(e.g., the Internet), a wireless network, a mobile network, or any
other type of communication network. The web crawler module 2125
then stores the retrieved documents on the document storage
2130.
[0158] The file handler module 2120 receives scripts 2105 that are
embedded with pattern analysis instructions. In some embodiments,
the file handler module 2120 can receive scripts from user
computers 2177a and 2177b. The handler module 2120 then parses
these scripts to extract the pattern analysis instructions, which
it then stores in the pattern analysis instruction storage 2140. In
some embodiments, the scripts 2105 are generated by a set of
authors who identify different fact-extraction patterns that can be
used to potentially identify and extract different facts from
documents.
[0159] The pattern analysis engine 2115 retrieves the pattern
analysis instructions from the pattern analysis instructions
storage 2140, and then uses these instructions to analyze each
document in multiple stages. In some embodiments, the pattern
analysis engine 2115 performs for each retrieved document a
multi-level fact-extraction process like the one described above in
connection with FIGS. 1, 3 and 4. As described above, such a
process iteratively executes different sets of pattern analysis
instructions (where each set can include one or more pattern
analysis instructions) to identify different levels of patterns.
Accordingly, the pattern analysis engine 2115 iteratively performs
the level 1 analysis 410.sub.1 through the level N analysis
410.sub.N of FIG. 4 when it iteratively receives and executes the
first instruction set 417 through the Nth instruction set 440.
Moreover, the sets of pattern analysis instructions can define
patterns from lists of known patterns (e.g., known words and/or
expressions) that the pattern analysis engine 2115 can identify
during the iterative pattern analysis. The list storage 2185 stores
these lists of known words and/or expressions.
[0160] In executing the set of pattern analysis instructions, the
pattern analysis engine 2115 uses the tag engine 2150 to annotate
any identified patterns in the document with a respective tag.
After tagging the patterns, the tag engine 2150 stores the
identified tags in tags storage 2135. In some embodiments, the tag
storage 2135 is a separate logical or physical storage than the
document and instruction storages 2130 and 2140. However, in other
embodiments, one or more of these storages are within the same
logical or physical storages. Also, in some embodiments, the tags
are stored within the documents (e.g., are embedded in the same
file or other storage structure with the documents).
[0161] The fact processing module 2145 performs a post-processing
operation on the tags stored in the tags storage 2135 and the
documents stored on the documents storage 2130 to extract facts
associated with the tags. One example of such post-processing
operation was described above by reference to FIG. 5. After
processing the tags, the fact processing module 2145 stores the
facts on a facts storage 2170.
[0162] The work tool 2190 provides a tool for manually processing
tags associated with documents stored on the documents storage 2130
and modifying the lists stored on the list storage 2185. One
example of such a tool for manually processing tags and modifying
lists was described above by reference to FIG. 17. One or more user
computers 2179 can communicatively couple with the work tool 2190
(e.g., through the network 2155) in order to allow a user to
manually process the tags associated with the documents. After
processing the tags and/or modifying lists, the work tool 2190
updates the tags on tags storage 2135 and/or the lists on the list
storage 2185. In some embodiments, the work tool can also modifies
facts, pattern analysis instructions, and/or documents.
Accordingly, FIG. 21 shows that the work tool 2190 can communicate
with all the storages 2130, 2135, 2140, 2170 and 2185 (i.e., by
showing that the arrow to and from the work tool terminates on the
dashed box encircling these storages).
[0163] The query processor 2165 executes search queries on the
facts stored on facts storage 2170 and the documents in documents
storage 2130 to retrieve documents and/or facts that match search
queries. The web server receives search queries from one or more
user computers 2175 through the network 2155, and submits the
search queries to the query processor 2165 for processing. The web
server also returns search query results (i.e., the documents
and/or facts) to the user computers through the network. The web
server is also used to connect the document crawler 2125 and the
work tool 2190 to the network 2155. However, one of ordinary skill
will realize that the web server 2160 might not be so utilized in
other embodiments. For instance, in some embodiments, the work tool
2190 operates on a computer on which the user that is using the
tool directly operates.
[0164] The operation of the system 2110 will now be described for
processing one document. Initially, the document crawler 2125
retrieves the document and stores the document in the document
storage 2130. The pattern analysis engine 2115 later retrieves the
document for processing. This engine performs a multi-level pattern
analysis on the retrieved document, which results in a set of tags
that are produced by the tag engine 2150 and that are stored in the
tag storage 2135. The fact processing module 2145 later analyzes
this set of tags in conjunction with the associated document, and
produces a set of facts that are associated with these set of tags.
This module then stores this set of facts in the fact storage 2170.
The stored set of facts can later be queries along with the stored
document when the query manager 2165 responds to a query from a
user computer 2175. The work tool 2190 later receives requests from
the user computer 2179 for manually processing the tags of the
stored document. After processing the tags, the work tool then
updates the tags stored in the tag storage 2135. The work tool 2190
also receives requests from the user computer 2179 for modifying
the lists stored in the list storage 2185. The work tool updates
the lists stored in the list storage 2185 responsive to these
request.
[0165] Even though the system 2110 was described above to include
several features some embodiments of the invention, one of ordinary
skill will realize that this system can be implemented differently
in other embodiments. For instance, in the system 2110, the pattern
analysis engine 2115 generates tags that it stores in the tag
storage 2135, while the fact processing module 2145 analyzes tags
to generate facts that it stores in the fact storage 2170.
[0166] However, as mentioned above, some embodiments perform the
pattern analysis and fact extraction as one process that is
performed individually for each document. In some such embodiments,
the tags that are identified in the earlier stages are only used to
identify in later stages subsequent patterns and tags, which are
then used to extract facts. Also, in some of these embodiments, the
facts that are derived in the later stages are simply tags that are
identified based on patterns that are detected from the earlier
stage tags. Accordingly, in some of these embodiments, the later
stage tags generated by pattern analysis engine 2115 express facts
that can be reported, presented and/or queried. However, even in
some of these other embodiments, some of these later stage tags are
expressed in a manner that is optimized for subsequent querying or
reporting. Accordingly, even in some of these embodiments, the
system 2110 uses the fact processing module 2145 to convert some of
later stage tags (which express a fact) into a fact-specific format
that is optimized for subsequent querying and reporting.
[0167] Also, as illustrated in FIG. 21, the system 2110 performs
fact-extraction processing along with document crawling and query
management. However, in some embodiments, the system that performs
the fact-extraction processing is not the same system that performs
the document crawler or the query management. One such example
would be a system that retrieves documents from a network and
extracts facts from these documents, but instead of making the
extracted searches available for queries, this system might produce
electronic or physical reports or bulletin for distribution
regarding the extracted facts.
VII. Electronic System
[0168] Many of the above-described features and applications are
implemented as software processes that are specified as a set of
instructions recorded on a computer readable storage medium (also
referred to as computer readable medium). When these instructions
are executed by one or more processing unit(s) (e.g., one or more
processors, cores of processors, or other processing units), they
cause the processing unit(s) to perform the actions indicated in
the instructions. Examples of computer readable media include, but
are not limited to, CD-ROMs, flash drives, RAM chips, hard drives,
EPROMs, etc. The computer readable media does not include carrier
waves and electronic signals passing wirelessly or over wired
connections.
[0169] In this specification, the term "software" is meant to
include firmware residing in read-only memory or applications
stored in magnetic storage, which can be read into memory for
processing by a processor. Also, in some embodiments, multiple
software inventions can be implemented as sub-parts of a larger
program while remaining distinct software inventions. In some
embodiments, multiple software inventions can also be implemented
as separate programs. Finally, any combination of separate programs
that together implement a software invention described here is
within the scope of the invention. In some embodiments, the
software programs, when installed to operate on one or more
electronic systems, define one or more specific machine
implementations that execute and perform the operations of the
software programs.
[0170] FIG. 22 conceptually illustrates an electronic system 2200
with which some embodiments of the invention are implemented. The
electronic system 2200 may be a computer, phone, PDA, or any other
sort of electronic device. Such an electronic system includes
various types of computer readable media and interfaces for various
other types of computer readable media. Electronic system 2200
includes a bus 2205, processing unit(s) 2210, a system memory 2215,
a read-only 2220, a permanent storage device 2225, input devices
2230, output devices 2235, and a network 2240.
[0171] The bus 2205 collectively represents all system, peripheral,
and chipset buses that communicatively connect the numerous
internal devices of the electronic system 2200. For instance, the
bus 2205 communicatively connects the processing unit(s) 2210 with
the read-only 2220, the system memory 2215, and the permanent
storage device 2225.
[0172] From these various memory units, the processing unit(s) 2210
retrieves instructions to execute and data to process in order to
execute the processes of the invention. The processing unit(s) may
be a single processor or a multi-core processor in different
embodiments.
[0173] The read-only-memory (ROM) 2220 stores static data and
instructions that are needed by the processing unit(s) 2210 and
other modules of the electronic system. The permanent storage
device 2225, on the other hand, is a read-and-write memory device.
This device is a non-volatile memory unit that stores instructions
and data even when the electronic system 2200 is off. Some
embodiments of the invention use a mass-storage device (such as a
magnetic or optical disk and its corresponding disk drive) as the
permanent storage device 2225.
[0174] Other embodiments use a removable storage device (such as a
floppy disk, flash drive, or ZIP.RTM. disk, and its corresponding
disk drive) as the permanent storage device 2225. Like the
permanent storage device 2225, the system memory 2215 is a
read-and-write memory device. However, unlike storage device 2225,
the system memory 2215 is a volatile read-and-write memory, such a
random access memory. The system memory 2215 stores some of the
instructions and data that the processor needs at runtime. In some
embodiments, the invention's processes are stored in the system
memory 2215, the permanent storage device 2225, and/or the
read-only 2220. For example, the various memory units include
instructions for processing multimedia items in accordance with
some embodiments. From these various memory units, the processing
unit(s) 2210 retrieves instructions to execute and data to process
in order to execute the processes of some embodiments.
[0175] The bus 2205 also connects to the input and output devices
2230 and 2235. The input devices enable the user to communicate
information and select commands to the electronic system. The input
devices 2230 include alphanumeric keyboards and pointing devices
(also called "cursor control devices"). The output devices 2235
display images generated by the electronic system 2200. The output
devices 2235 include printers and display devices, such as cathode
ray tubes (CRT) or liquid crystal displays (LCD). Some embodiments
include devices such as a touchscreen that functions as both input
and output devices.
[0176] Finally, as shown in FIG. 22, bus 2205 also couples
electronic system 2200 to a network 2240 through a network adapter
(not shown). In this manner, the computer can be a part of a
network of computers (such as a local area network ("LAN"), a wide
area network ("WAN"), or an Intranet, or a network of networks,
such as the Internet. Any or all components of electronic system
2200 may be used in conjunction with the invention.
[0177] These functions described above can be implemented in
digital electronic circuitry, in computer software, firmware or
hardware. The techniques can be implemented using one or more
computer program products. Programmable processors and computers
can be included in or packaged as mobile devices. The processes and
logic flows may be performed by one or more programmable processors
and by one or more programmable logic circuitry. General and
special purpose computing devices and storage devices can be
interconnected through communication networks.
[0178] Some embodiments include electronic components, such as
microprocessors, storage and memory that store computer program
instructions in a machine-readable or computer-readable medium
(alternatively referred to as computer-readable storage media,
machine-readable media, or machine-readable storage media). Some
examples of such computer-readable media include RAM, ROM,
read-only compact discs (CD-ROM), recordable compact discs (CD-R),
rewritable compact discs (CD-RW), read-only digital versatile discs
(e.g., DVD-ROM, dual-layer DVD-ROM), a variety of
recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.),
flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.),
magnetic and/or solid state hard drives, read-only and recordable
Blu-Ray.RTM. discs, ultra density optical discs, any other optical
or magnetic media, and floppy disks. The computer-readable media
may store a computer program that is executable by at least one
processing unit and includes sets of instructions for performing
various operations. Examples of computer programs or computer code
include machine code, such as is produced by a compiler, and files
including higher-level code that are executed by a computer, an
electronic component, or a microprocessor using an interpreter.
[0179] While the above discussion primarily refers to
microprocessor or multi-core processors that execute software, some
embodiments are performed by one or more integrated circuits, such
as application specific integrated circuits (ASICs) or field
programmable gate arrays (FPGAs). In some embodiments, such
integrated circuits execute instructions that are stored on the
circuit itself.
[0180] As used in this specification and any claims of this
application, the terms "computer", "server", "processor", and
"memory" all refer to electronic or other technological devices.
These terms exclude people or groups of people. For the purposes of
the specification, the terms display or displaying means displaying
on an electronic device. As used in this specification and any
claims of this application, the terms "computer readable medium"
and "computer readable media" are entirely restricted to tangible,
physical objects that store information in a form that is readable
by a computer. These terms exclude any wireless signals, wired
download signals, and any other ephemeral signals.
[0181] While the invention has been described with reference to
numerous specific details, one of ordinary skill in the art will
recognize that the invention can be embodied in other specific
forms without departing from the spirit of the invention. In
addition, a number of the figures conceptually illustrate
processes. The specific operations of these processes may not be
performed in the exact order shown and described. The specific
operations may not be performed in one continuous series of
operations, and different specific operations may be performed in
different embodiments. Furthermore, the process could be
implemented using several sub-processes, or as part of a larger
macro process.
* * * * *