U.S. patent application number 12/133205 was filed with the patent office on 2008-12-04 for extracting and displaying compact and sorted results from queries over unstructured or semi-structured text.
Invention is credited to Roger W. Hale, Sylvia F. Knight, David R. Milward, James R. Thomas.
Application Number | 20080301129 12/133205 |
Document ID | / |
Family ID | 40089426 |
Filed Date | 2008-12-04 |
United States Patent
Application |
20080301129 |
Kind Code |
A1 |
Milward; David R. ; et
al. |
December 4, 2008 |
EXTRACTING AND DISPLAYING COMPACT AND SORTED RESULTS FROM QUERIES
OVER UNSTRUCTURED OR SEMI-STRUCTURED TEXT
Abstract
A system for indexing unstructured or semi-structured data is
disclosed. The system may identify regions within the data, such as
"Abstract" or "References". The system may identify linguistic
units such as sentences, noun groups, verb groups. The system may
also identify concepts such as companies, people, diseases,
amounts, and so forth. The query results may be formatted so that
similar results from different documents, or from the same
document, are clustered together.
Inventors: |
Milward; David R.;
(Cambridge, GB) ; Thomas; James R.; (Cambridge,
GB) ; Knight; Sylvia F.; (Cambridge, GB) ;
Hale; Roger W.; (Cambridge, GB) |
Correspondence
Address: |
PERKINS COIE LLP;PATENT-SEA
P.O. BOX 1247
SEATTLE
WA
98111-1247
US
|
Family ID: |
40089426 |
Appl. No.: |
12/133205 |
Filed: |
June 4, 2008 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60941944 |
Jun 4, 2007 |
|
|
|
60980758 |
Oct 17, 2007 |
|
|
|
Current U.S.
Class: |
1/1 ; 704/1;
707/999.005; 707/E17.014; 707/E17.015; 707/E17.078; 707/E17.083;
707/E17.122 |
Current CPC
Class: |
G06F 16/31 20190101;
G06F 16/3344 20190101; G06F 16/80 20190101; G06F 40/284
20200101 |
Class at
Publication: |
707/5 ; 704/1;
707/E17.015; 707/E17.014 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A system comprising: a region detect component configured to
analyze unstructured data and to identify one or more regions in
the unstructured data; a linguistic component configured to
identify linguistic units in the unstructured data; and an indexing
component configured to store the unstructured data in an index
based on the one or more identified regions and the one or more
identified linguistic units.
2. The system of claim 1 wherein the region detect component is
further configured to analyze semi-structured data that includes at
least one region and to identify one or more additional regions in
the semi-structured data.
3. The system of claim 1 wherein the region detect component is
further configured to generate a semi-structured document
comprising the unstructured data and the one or more identified
regions.
4. The system of claim 1 wherein the linguistic component is
further configured to identify regular and irregular morphological
variants of words.
5. The system of claim 1 wherein the linguistic component is
further configured to determine a position corresponding to each of
the identified linguistic units, and wherein the indexing component
is further configured to receive the determined position of each
identified linguistic unit and record the position of the selected
linguistic unit in the index.
6. The system of claim 1 wherein the identified linguistic units
include concepts.
7. The system of claim 6 wherein the identified concepts are
arranged in taxonomies.
8. The system of claim 6 wherein the linguistic component
identifies the concepts based on a form of the concepts and a
context in which the concepts appear in the unstructured data.
9. The system of claim 1 wherein the linguistic component is
further configured to identify expressions.
10. The system of claim 9 wherein the identified expressions are
selected from the group consisting of: numerical expressions,
temporal expressions, phone numbers and email addresses.
11. The system of claim 9 wherein the linguistic component
identifies the expressions based on a form of the expressions and a
context in which the expressions appear in the unstructured
data.
12. The system of claim 1 further comprising a query component
configured to query the index based on one or more constraints.
13. The system of claim 12 wherein the one or more constraints are
received from a user.
14. The system of claim 12 wherein the query component includes a
drag and drop user interface.
15. The system of claim 12 wherein the one or more constraints
include linguistic patterns.
16. The system of claim 12 further comprising a display component
configured to format and display results of the query component to
a user.
17. The system of claim 16 wherein the display component formats
the results based on one or more rules.
18. The system of claim 17 wherein the one or more rules are
configurable by a user.
19. The method comprising: in response to receiving a query
comprising a linguistic pattern, querying an index of unstructured
text based on one or more constraints of the linguistic pattern,
the index having been generated by identifying a plurality
linguistic units in the unstructured text; and formatting results
of the query by clustering together similar results from a same
document; and clustering together similar results from different
documents.
20. The method of claim 19 wherein the formatting is based on the
linguistic pattern.
21. The method of claim 19 wherein the formatting is based on a
plurality of rules having an order of precedence.
22. The method of claim 19 further comprising displaying the
results to a user.
23. The method of claim 22 wherein the results are displayed in a
format selected from the group consisting of: HTML, XLS, XML, CSV,
and TSV.
24. The method of claim 22 further comprising providing one or more
controls that enable the user to manipulate the displayed
results.
25. The method of claim 24 wherein the one or more controls include
an expand control to expand the displayed results.
26. The method of claim 22 wherein the results are displayed in
alphabetical order.
27. The method of claim 22 wherein the results are displayed
according to frequency order.
28. The method of claim 19 further comprising, in response to
receiving a subsequent request to join the results to another
linguistic pattern; and querying the index based on one or more
constraints of the other linguistic pattern; and joining the
results to results of the other linguistic pattern query.
29. A computer-readable storage medium comprising instructions
that, when executed by a computer system, cause the computer system
to: receive a query that includes at least one linguistic
constraint and an indication of at least one region; identify
within a plurality of semi-structured and unstructured documents,
at least one document that includes the at least one linguistic
constraint within the at least one region of the document; and
format results of the query by clustering together similar results
from the at least one document.
30. The computer-readable storage medium of claim 29 wherein the
results are further formatted based on a plurality of rules having
an order of precedence.
31. The computer-readable storage medium of claim 29 further
comprising instructions that, when executed by the computer system,
cause the computer system to display the results to a user.
32. The computer-readable storage medium of claim 31 wherein the
results are displayed in a format selected from the group
consisting of: HTML, XLS, XML, CSV, and TSV.
33. The computer-readable storage medium of claim 31 further
comprising instructions that, when executed by the computer system,
cause the computer system to provide one or more controls that
enable the user to manipulate the displayed results.
34. A system comprising: a region detect component configured to
analyze unstructured data and to identify one or more regions in
the unstructured data; and analyze semi-structured data that
includes at least one region and to identify one or more additional
regions in the semi-structured data a linguistic component
configured to identify linguistic units in the unstructured and
semi-structured data; and a querying component configured to
receive a query which identifies one or more linguistic units
occurring within a region in combination with one or more
linguistic patterns.
35. The system of claim 34 wherein the linguistic component is
further configured to identify regular and irregular morphological
variants of words.
36. The system of claim 34 wherein at least one of the identified
linguistic units is a concept.
37. The system of claim 36 further comprising taxonomies, and
wherein the taxonomies are used by the linguistic component to
identify the at least one concept.
38. The system of claim 34 wherein the query is received from a
user.
39. The system of claim 38 further comprising a display component
configured to format and display results of the interactive query
component to the user.
Description
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application claims priority to U.S. Provisional Patent
Application No. 60/941,944 entitled "USE OF REGIONS TO PROVIDE
RESPONSES TO QUERIES," and filed on Jun. 4, 2007, which is hereby
incorporated by reference.
[0002] This application claims priority to U.S. Provisional Patent
Application No. 60/980,758 entitled "EXTRACTING AND DISPLAYING
COMPACT AND SORTED RESULTS DIRECTLY FROM QUERIES OVER UNSTRUCTURED
TEXT," and filed on Oct. 17, 2007, which is hereby incorporated by
reference.
BACKGROUND
[0003] Large organizations such as pharmaceutical companies and
healthcare organizations have a massive amount of information
available to them. This may include, for example, ongoing and
historical clinical trials and studies, treatment guidelines,
patient information, patents, research documents, external research
literature, news articles, as well as information on the web. Most
of this information is in the form of unstructured or
semi-structured text (e.g. XML). The vast quantities make it hard
to read, even with the help of a search engine to prune down the
number of relevant documents.
[0004] Conventional systems do not provide results directly from
the structured or unstructured text in a format that can be used
directly for decision making. Search engines do not provide any
structure, other than the structure in the original document.
Information extraction systems do not use an index, so cannot
provide fast interactive querying, nor do they allow a flexible mix
of constraints based on linguistic constructions and the structure
of the document.
BRIEF DESCRIPTION OF THE DRAWINGS
[0005] FIG. 1 is a high-level data flow diagram showing data flow
within an arrangement of components used to index and query
semi-structured and unstructured data.
[0006] FIG. 2 illustrates an example of how an indexing engine
identifies meaningful linguistic units.
[0007] FIGS. 3A and 3B illustrate controls that enable a user to
examine the evidence of a query result.
[0008] FIG. 4 illustrates the grouping of one or more key columns
that are associated with a user-preferred concept in a query
result.
[0009] FIG. 5 illustrates the joining of two queries in a query
result.
[0010] FIG. 6A is an example of a user interface for entering a
search query.
[0011] FIG. 6B is an example of a user interface for constructing a
query using a graphical drag and drop interface.
[0012] FIG. 6C is an example of a user interface for constructing a
query over a region of a document.
[0013] FIG. 7 illustrates a query over a combination of structured
and semi-structured data.
DETAILED DESCRIPTION
[0014] The terminology used in the description presented below is
intended to be interpreted in its broadest reasonable manner, even
though it is being used in conjunction with a detailed description
of certain specific embodiments of the invention. Certain terms may
even be emphasized below; however, any terminology intended to be
interpreted in any restricted manner will be overtly and
specifically defined as such in this Detailed Description
section.
[0015] Various embodiments of the invention will now be described.
The following description provides specific details for a thorough
understanding and enabling description of these embodiments. One
skilled in the art will understand, however, that the invention may
be practiced without many of these details. Additionally, some
well-known structures or functions may not be shown or described in
detail, so as to avoid obscuring the description of the various
embodiments.
[0016] FIG. 1 is a high-level data flow diagram showing data flow
within an arrangement of components used to index and query
semi-structured and unstructured data. The system comprises an
indexing engine 100, a querying engine 105, and an output engine
110. The indexing engine analyzes semi-structured documents 115 and
unstructured documents 120 (collectively, "source documents") and
creates an efficient representation of the content of each source
document. Semi-structured information includes both free text and
some degree of structure. The indexing engine may also analyze
other types of documents not mentioned here. In some embodiments,
the indexing engine identifies a number of regions within a source
document. A region is part of the text, which is either a
structural unit (such as, e.g., the Abstract) or a meta-data field
(such as, e.g., a publication date). For example, identified
regions may include an Abstract, Acknowledgements, Authors, Body,
Figures, Figure Text, Paragraphs, Tables, Table Row, References,
Keywords, Title, etc. Regions may be nested within the source
documents, i.e., regions may fall inside each other. For example,
the title of a work that is contained within an appendix is a
nested region.
[0017] When a document is semi-structured, the region boundaries
may be determined by identifying tags within the source documents
and associating the tags with particular types of regions. For some
semi-structured documents, however, the structuring provided is not
sufficient to identify the relevant regions. In these cases, a
Region Detect module 125 may be used to elaborate the original
structure. This may involve meta-tagging a document with fields and
values, partitioning text of the document into sections, or
marking-up of the entire document (such as, e.g., XML or HTML).
[0018] When a document is unstructured (e.g., plain text), the
Region Detect module 125 analyzes the document to determine the
regions of the document. In some embodiments, a Region Detect
module 125 analyzes unstructured documents one line at a time using
a set of rules to determine the probability that a line is part of
a particular region or is a region itself. This determination may
be based on the form of the line and the form of the lines
immediately preceding and following the analyzed line. For example,
when the line is in all capital letters the Region Detect module
may determine that the line is a title or a section heading. The
region detect module can be customized so that documents using
non-standard conventions can be indexed. After identifying the
potential region boundaries, the Region Detect module generates a
semi-structured document (e.g., an XML document) having tags that
are associated with identified region boundaries.
[0019] The indexing engine 100 encodes the type of each region and
the text of the source document in an index 130 designed for
efficient querying. In some embodiments, the indexing engine uses a
configuration file to map tags within source documents, for
example, to regions of particular types or to other concepts of
interest. The indexing engine uses an opening tag to identify the
start of a region and its type (e.g., paragraph, section, etc.). It
stores region start position and type, adding the end position when
the matching closing tag is found. Positions may be stored
according to sentence number and word number within a sentence.
Positions may also be stored in other fashions, such as character
position within the document.
[0020] The indexing engine 100 analyzes text according to
linguistic structure. In this embodiment, the indexing engine
processes each source document word by word and stores the start
and end position of linguistic units, including sentences, noun
groups, verb groups, etc. FIG. 2 illustrates an example of how the
indexing engine identifies meaningful linguistic units. The
indexing engine identifies the boundaries of each sentence (i.e.,
boundaries of sentences 200 and 205). The indexing engine also
identifies the boundaries of noun phrases 210 and verb groups 215.
Noun phrases match entities and verb groups match actions. The
indexing engine may also identify regular and irregular
morphological variants of words such as find vs. finds vs. found
220. In some embodiment, this is accomplished using a stemming
algorithm. Stemming is the process for reducing inflected (or
sometimes derived) words to their stem, base, or root form. The
stem need not be identical to the morphological root of the word;
it is usually sufficient that related words map to the same stem,
even if this stem is not in itself a valid root. The indexing
engine may also identify concepts (e.g. breast cancer), whether
these are referred to by the standard name (e.g. breast cancer) or
by a synonym (e.g. breast carcinoma, breast neoplasm etc.). In
addition, the indexing engine may also identify broader classes,
such as, e.g., people, companies, amounts, temporal expressions,
etc.
[0021] In some embodiments, the indexing engine 100 includes one or
more taxonomies of concepts that are used to index source
documents. These concept taxonomies may include a variety of
sub-concept taxonomies. For example, a concept taxonomy may include
a "disease" sub-taxonomy, which may further include a "neurological
disease" sub-taxonomy listing the preferred names of neurological
diseases as well as any synonyms or irregular morphological
variants of those preferred names. In some embodiments, each
concept taxonomy and/or sub-concept taxonomy is associated with a
unique concept identifier. When the indexing engine identifies a
concept (or synonym for that concept) within a source document, the
indexing engine records the position of the concept within the
source document in the index. In some embodiments, a user may
update and/or import a taxonomy or sub-taxonomy.
[0022] Querying engine 105 evaluates the constraints of a query 135
against the index 130. In some embodiments, the querying engine
includes one or more taxonomies that may be used to evaluate a
query. For example, the querying engine may expand a query to
search for synonyms of a concept (or multiple concepts) of a query.
That is, the taxonomy may be included as part of a query. In some
embodiments, the constraints are provided to the querying engine
via an API so that queries 135 can be run, for example, as part of
scheduled automatic processes.
[0023] In some embodiments, a query 135 is received by the querying
engine 105 from a user. When querying the index, a user may impose
a variety of constraints. The constraints of a query may include
keywords, concepts, linguistic patterns, regions, etc. For example,
the user may specify a query for a document containing a word in
the title region, and having a particular concept (e.g., a
neurological disease) in the description section of that document.
That is, the querying engine allows a user to search the index to
locate all instances of a particular region relevant to the user's
query. In some embodiments, the user constraints are provided from
a search-style text box (see, e.g., FIG. 6A), or from a graphical
drag and drop interface (see, e.g., FIG. 6B). The user can pick all
regions, an individual region, or multi-pick a set of regions to
include in the search. Regions can also be organized in a hierarchy
so that users can select a group of regions by selecting regions
that are higher in the hierarchy. In some embodiments, a user can
select a region of a document within which to search (see, e.g.,
FIG. 6C). The querying engine 105 may provide an interactive query
interface that enables a user to refine a general query based on
user-specified criteria (e.g., a selected region) and/or other
metadata describing the index schema or taxonomies exposed to the
user through the interactive query interface.
[0024] Output engine 110 analyzes and formats the results of the
querying engine 105. The output engine may present the query
results 140 in a variety of formats, including, but not limited to
HTML, XLS (Excel format), XML, CSV (comma separated list), TSV (tab
separated list), network graph languages (e.g., SIF, XGML), etc.
FIG. 3A illustrates results for a query searching for different
types of medical studies. Column 300 identifies the preferred names
for the types of studies identified, which can be a synonym or
morphological variant of a term in the text (e.g., "non-blind" vs.
"non-blinding"). Column 305 identifies the number of documents in
which the term was found. Column 310 shows an identifier for the
identified documents, i.e., a unique identifier that is linked to
the document. Column 315 provides the number of instances that the
term appears within the document. Column 320 shows as evidence a
segment of representative text where the study name occurs.
[0025] In some embodiments, the output engine 110 determines the
format and/or the form of the results based on the constraints of
the query. The output engine may include a variety of default
output rules associated with particular types of queries. For
example, the output engine may include a rule associated with class
queries (e.g., types of medical studies) that indicates the form of
the results will include a key column (e.g., "study type") having
rows corresponding to the preferred class names (e.g., clinical,
single-blind, etc.). As another example, the output engine may
include a rule associated with linguistic pattern queries that
orders columns according to the order of the query terms. In this
example, the query "dosage" followed by the word "of" followed by
"any drug or chemical" (see e.g., FIGS. 6A and 6B) would have the
following default ordering of columns: "dosage" (first column) and
"drug or chemical" (second column). In some embodiments, the
default rules have an order of precedence. For example, a rule
having a higher order of precedence may provide that columns
corresponding to prepositions (e.g., "of") are not displayed.
[0026] In some embodiments, the output engine 110 determines the
format and/or the form of the results based on display preferences
specified by the user. The user's display preferences may be
specified as part of a query and/or stored within a user profile.
In some embodiments, the output engine includes an output editor
that allows the user to manipulate how the results are displayed.
For example, in the "dosage" example above, the user may manipulate
the column order such that the "drug or chemical" column is listed
first and the "dosage" column is listed second (see e.g., results
shown in FIG. 4). The user may also specify display preferences
after the query is executed to automatically change the format or
form in which the results are displayed. For example, users can
specify one or more regions to be displayed in the results. Regions
can be nested, and the system allows users to exploit this, for
example, to look for the introduction of the conclusion.
[0027] In some embodiments, the output engine 110 provides a
variety of controls that allow a user to change how the results are
displayed. For example, the output engine may provide controls that
allow the user to add or remove columns, order the results (e.g.,
by the document identifier, by the frequency of a term or terms
within a document or region, alphabetically, etc.), etc.
[0028] In some embodiments, the output engine 110 enables the user
to drill down within a particular result to examine the evidence
for that result. As shown in FIGS. 3A and 3B, columns 305 and 315
include controls 330 and 335 represented by the arrows. The
controls allow the user to open (or close) a row to show (or hide)
the documents corresponding to a particular result (e.g., a study
type such as single-blind). The controls of column 315 allow the
user to open (or close) a row to show (or hide) the instances of
the study type within a particular document. FIG. 3B illustrates
the effect of expanding the single-blind row (i.e., row 325). When
control 330 is activated, row 325 expands to show the documents
corresponding to the single-blind study type. When control 335 is
activated, row 325 expands to show the instances of the
single-blind study type within the text of document 340. In some
embodiments, the output engine ranks the results. For example, the
document having the greatest number of instances of a term is
listed first among documents that have the same terms. As shown in
FIGS. 3A and 3B, document 340 has the greatest number of instances
(i.e., 2) of the single-blind study type.
[0029] In some embodiments, when the output engine 110 clusters
similar and/or identical results, the output engine determines
whether all of the documents or only a selection of the documents
will be presented to the user. For example, the output engine may
delete duplicate documents or display only a selection of the
documents when the cluster is based on non-key columns. In some
embodiments, the output engine orders the results. For example, the
results may be ordered alphabetically or according to frequency,
with the results found in the most documents ordered first.
[0030] In some embodiments, the output engine 110 highlights text
areas that are relevant to the query in the results of the query.
For example, the column 320 in FIG. 3B includes highlighted terms
and phrases that were included in or related to the search-query.
The output engine may also provide hyperlinks to the documents
identified by the particular query. For example, the document
identifier such as identifier 340 may include a hyperlink to the
document. In some embodiments, the hyperlinks are included in the
relevant parts of the results so that a user can navigate to the
position within a document where the displayed region is
located.
[0031] In some embodiments, the output engine 110 groups the
results. For example, the results may be grouped according to a
preferred term, concept, string, or character position. By grouping
results, relationships among terms of the query are identified for
the user. FIG. 4 illustrates part of a table of results associated
with a search for sentences containing "drugs" and "dosages" in one
or more linguistic patterns. Column 400 shows as evidence a
representative sentence in the text where a drug appears with a
dosage in a linguistic pattern. Linguistic patterns include classes
(e.g., drugs, dosages, companies, people, genes, proteins, etc.) in
a particular structure within a sentence. This includes the classes
being at a certain word distance, or in a syntactic or semantic
relationship composed from linguistic units such as noun groups,
verb groups or prepositions. As shown in FIG. 4, the results have
been grouped according to the preferred names of the drugs, and
particular dosages (i.e., columns 405 and 410 respectively).
[0032] In some embodiments, the output engine 110 can combine one
or more queries. For example, the output engine may add queries;
subtract queries; determine the intersection, union, or difference
of queries; and/or join queries. FIG. 5 illustrates the joining of
two queries (formatted with the option of displaying the evidence
to the right). In this example, a first query looks for a
relationship between the drug cyclosporine and any gene. The search
is for the concept cyclosporine which includes the synonym CsA (as
shown in row 500 of the results). A second query looks for the
relationship between any gene and psoriasis. By joining the first
query and the second query, the results provide a list of potential
gene intermediaries and hence a hypothesis for the connection
between cyclosporine and psoriasis. That is, the joined results
provide evidence of an indirect relationship between cyclosporine
and the disease psoriasis
[0033] The system's uniform treatment of linguistic units (e.g.,
sentences, noun groups, and verb groups), structural units (e.g.,
paragraphs, sections, and titles), and metadata (e.g. publication
year or list of authors) allows users considerable freedom to
formulate queries and receive results that are both relevant and
easy to process. For example, users can search for words or
concepts within specific regions. FIG. 7 illustrates a query over a
combination of structured and semi-structured data. In this
example, the patent numbers (column 700) and the publication dates
(column 705) were extracted from the metadata (semi-structured
text) of the documents, and the amounts (column 710) and drugs
(column 715) were extracted from the unstructured text of the
documents.
[0034] Those skilled in the art will appreciate that various
architectural changes to the system may be made while still
providing similar or identical functionality. For example, the
system may be implemented in a variety of environments including a
single, monolithic computer system, a distributed system, as well
as various other combinations of computer systems or similar
devices connected in various ways. Moreover, those skilled in the
art will further appreciate that the actions of the system
described in FIG. 1 may be altered in a variety of ways. For
example, the order of the actions may be rearranged, certain
actions may be performed in parallel, actions may be omitted, or
other actions may be included.
* * * * *