U.S. patent application number 14/601837 was filed with the patent office on 2015-05-14 for self-learning methods for automatically generating a summary of a document, knowledge extraction and contextual mapping.
The applicant listed for this patent is Syed YASIN. Invention is credited to Syed YASIN.
Application Number | 20150134574 14/601837 |
Document ID | / |
Family ID | 43901157 |
Filed Date | 2015-05-14 |
United States Patent
Application |
20150134574 |
Kind Code |
A1 |
YASIN; Syed |
May 14, 2015 |
SELF-LEARNING METHODS FOR AUTOMATICALLY GENERATING A SUMMARY OF A
DOCUMENT, KNOWLEDGE EXTRACTION AND CONTEXTUAL MAPPING
Abstract
Advance Machine Learning or Unsupervised Machine Learning
Techniques are provided that relate to Self-learning processes by
which a machine generates a sensible automated summary, extracts
knowledge, and extracts contextually related Topics along with the
justification that explains "why they are related" automatically
without any human intervention or guidance (backed ontology's)
during the process. Such processes also relate to generating a
360-Degree Contextual Result (360-DCR) using Auto-summary,
Knowledge Extraction and Contextual Mapping.
Inventors: |
YASIN; Syed; (Bangalore,
IN) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
YASIN; Syed |
Bangalore |
|
IN |
|
|
Family ID: |
43901157 |
Appl. No.: |
14/601837 |
Filed: |
January 21, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13575478 |
Jul 26, 2012 |
8977540 |
|
|
PCT/IB11/50409 |
Jan 31, 2011 |
|
|
|
14601837 |
|
|
|
|
Current U.S.
Class: |
706/11 |
Current CPC
Class: |
G06F 16/93 20190101;
G06F 16/345 20190101; G06N 20/00 20190101; G06F 16/285 20190101;
G06N 7/00 20130101 |
Class at
Publication: |
706/11 |
International
Class: |
G06N 99/00 20060101
G06N099/00; G06F 17/30 20060101 G06F017/30; G06N 7/00 20060101
G06N007/00 |
Foreign Application Data
Date |
Code |
Application Number |
Feb 3, 2010 |
IN |
267/CHE/2010 |
Claims
1. A self learning method for automatically generating a Summary of
a document without human intervention, said method comprising acts
of: extracting Important Words (IW) of the document based on
incremental order of their occurrence; listing the order of the
IW's extracted in the order of highest Word Group (WG), wherein the
highest word group is combination of maximum number of words that
go together as one word; for each IW's starting in the order of
highest word group, analyzing every sentence in the document to
determine presence of the IW and thereafter extracting all the
sentences having corresponding IW as important sentences (IS) after
eliminating redundancies to generate the auto-summary for the
document.
2. The method as claimed in claim 1, wherein identifying WG and IW
comprises using word article and punctuation marks from the
document.
3. A self learning method for automatically extracting knowledge of
a given set of documents without human intervention, said method
comprising acts of extracting Important Words (IW) and their
corresponding Important Sentences (IS), and Topics (T) of the
documents in a predetermined order; eliminating duplicates of each
extracted IW and its corresponding sentences; and clustering the
IS's and Topic's (T) in the list based on the extracted IW's as
"Contextual-Topical Cluster" and "Knowledge-cluster" to extract
knowledge and related contextual Topics from the set of given
documents.
4. The method as claimed in claim 3, wherein defining topic to each
documents is done by comparing each IW in the document with its
file name and Title name, if any of the IW's matches than that is
defined as a Topic.
5. The method as claimed in claim 4, wherein the IW's with highest
frequency occurrences in the document is defined as the topic of
the document.
6. The method as claimed in claim 3, wherein eliminating the
duplicates of each extracted IW and its corresponding sentences
using Hashing technique.
7. A self-learning method for automatically displaying 360-degree
Contextual Search Results without human intervention, said method
comprising acts of: generating Topic (T), Important-Words (IW),
Important Sentences (IS) and Auto-Summary (SY) for a given
document; storing the generated Auto-Summary as a field value
during indexing along with corresponding Topic and Content of the
document in Master-Index; extracting Topic List by processing the
Master-Index and thereafter 360-degree Contextual Mapping (360-DCM)
into 360-DCM cluster; extracting Knowledge from the document into
Knowledge Extraction (KE) cluster; and analyzing user query to
identify Topic in the TL and corresponding 360-DCM cluster to
return related Topics along with the relationship map, wherein the
Master-index returns search results along with auto-summary for
each result; and the KE cluster returns relevant knowledge for the
search query to display 360-degree Contextual Search Results.
8. The method as claimed in claim 7, wherein the generating
auto-summary comprises acts of: extracting Important Words (IW) of
the document from incremental order of their occurrence; listing
the order of the IW's extracted in the order of highest Word Group
(WG); splitting the document into sentences and storing the
sentences in a sequential order as Array of Sentences (AS); for
each IW's starting in the order of highest word group, analyzing
every sentence in the AS to determine presence of the IW and
thereafter extracting all the sentences having IW as important
sentences (IS) to eliminate redundancies; and removing the
extracted sentences from the AS and corresponding IW from the list
of IW's to generate the auto-summary for the document.
9. The method as claimed in claim 7, wherein the extracting
Knowledge from the document into Knowledge Extraction (KE) cluster
comprises acts of extracting Important Words (IW) and their
corresponding Important Sentences (IS), and Topics (T) of the
documents in a predetermined order; eliminating duplicates of each
extracted IW and its corresponding sentences; and clustering the
IS's and Topic's (T) in the list based on the extracted IW's or the
Topic as "Contextual-Topical Cluster" and "Knowledge-cluster" to
extract knowledge and related contextual Topics from the set of
given documents.
10. The method as claimed in claim 7, wherein generating 360-degree
Contextual Mapping comprises acts of indexing one or more
documents; storing the topics identified for each documents in a
predetermined order as Topical List (TL) during the indexing
process and removing duplicates topics from the TL; extracting
predefined number of results for each Topic in the TL by searching
one Topic at a time in the index; extracting corresponding Topic
and Content for each of the extracted result and storing the
extracted Topic and Content in a predetermined order as Result-List
(RL) in a temporary storage; analyzing the RL for each topic to
extract Related Topic; analyzing corresponding document Content of
the Topic to extract "why they are related" phrases from the
content; and clustering the resultant "Related Topics" along with
their respective sentences to generate 360-degree Contextual
Mapping.
Description
CROSS-REFERENCE TO THE RELATED APPLICATIONS
[0001] This application is a divisional application of U.S. patent
application Ser. No. 13/575,478 filed Jul. 26, 2012, which is a
U.S. National Stage Application claiming the benefit of prior filed
International Application No. PCT/IB2011/050409 filed Jan. 31,
2011, in which the International Application claims a priority date
of Feb. 3, 2010 based on prior filed Indian Application No.
267/CHE/2010, the entire contents of which are incorporated herein
by reference.
TECHNICAL FIELD
[0002] Embodiments of the present disclosure relate to Advance
Machine Learning or Unsupervised Machine Learning Techniques. More
particularly, embodiments relate to Self-learning process by which
a machine generates a sensible automated summary, extracts
knowledge, and extracts contextually related Topics along with the
justification that explains "why they are related" automatically
without any human intervention or guidance (backed ontology's)
during the process.
BACKGROUND
[0003] No Search engine today brings the justification/description
to "Why this relation?", while representing contextual Topics or
Search Refinements for the user query during the process of Search.
Users wonder, why or how is this Topic Related?, Also, Knowledge
representation is the key to the next generation of search as
suppose to mere information retrieval basis user queries. This
algorithm brings in a 360-degree contextual knowledge
representation apart from being capable of answering specific
questions.
[0004] Currently, most of the search engines are mere keyword based
information extraction basis relevance algorithms. There is a huge
demand for overall or 360 degree contextual knowledge
representation in Search industry, which is the future of search.
In a nutshell we have and are in process to build a revelation of
3.sup.rd & 4.sup.th generation search engine.
[0005] In light of the foregoing discussion, there is a need for a
method to solve the above mentioned shortcomings in the search
industry.
SUMMARY
[0006] The shortcomings of the prior art are overcome and
additional advantages are provided through the provision of a
method and a system as described in the description.
[0007] The present disclosure solves the limitations of existing
techniques by providing an Advance Machine Learning or Unsupervised
Machine Learning Techniques, which uses a mathematical approach to
identify and extract knowledge concepts in a set of given,
documents (Unstructured Data). This approach does not necessarily
need training data to help make decisions on building the
360-degree contextual map but rather has the ability to
statistically learn from the data itself. Given a set of natural
documents or web-pages or anything similar, the algorithm is
elegant enough to organize the knowledge concepts automatically
without any human guidance during the process.
[0008] In one embodiment, the technology disclosed in the present
disclosure provide an method/process that is elegant enough to
sensibly build an Auto-Summary of a given document completely
automatically (self-learning) using the important words identified
from the document.
[0009] In one embodiment, the technology disclosed in the present
disclosure is a novel and inventive Text-Analytics framework, which
extracts Knowledge completely automatically from Information
Indexed or Processed (self-learning). The solution proposed in the
present disclosure brings in a 360-degree contextual results; which
is highly effective from usability perspective.
[0010] In one embodiment, the present technology or process can be
used in Search Engines, both Web Search and Enterprise Search. It
can also be used in Online Business (AdWords, AdSense).
Summarization of Documents, WebPages etc. . . . more importantly
LPeSr brings 360-contextual mapping of knowledge and contextual
clusters.
[0011] The foregoing summary is illustrative only and is not
intended to be in any way limiting. In addition to the illustrative
aspects, embodiments, and features described above, further
aspects, embodiments, and features will become apparent by
reference to the drawings and the following detailed
description.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
[0012] The novel features and characteristic of the disclosure are
set forth in the appended claims. The embodiments of the disclosure
itself, however, as well as a preferred mode of use, further
objectives and advantages thereof, will best be understood by
reference to the following detailed description of an illustrative
embodiment when read in conjunction with the accompanying drawings.
One or more embodiments are now described, by way of example only,
with reference to the accompanying drawings wherein like reference
numerals represent like elements and in which:
[0013] FIG. 1 is a flowchart illustrating a methodology to generate
an automated summary of the document and to extract knowledge
concepts from the set of given documents, in accordance with an
exemplary embodiment.
[0014] FIG. 2 is a flowchart illustrating a methodology to generate
an automated summary of the document, in accordance with an
exemplary embodiment.
[0015] FIG. 3 is a flowchart illustrating a methodology to generate
Knowledge-Extraction (KE) basis Text-Analytics of given sum or set
of documents, in accordance with an exemplary embodiment.
[0016] FIG. 4 is a flowchart illustrating a methodology to generate
360-Degree Contextual Mapping (360-DCM) Cluster, in accordance with
an exemplary embodiment.
[0017] FIG. 5 is a flowchart illustrating a methodology to generate
360-Degree Contextual Results (360-DCR), in accordance with an
exemplary embodiment.
[0018] FIG. 6 is an exemplary snap shot of the web page
highlighting auto summary created by present technology for given
documents and/or WebPages.
[0019] FIG. 7 is an exemplary snap shot of the web page
highlighting Knowledge-Extraction (KE) created by present
technology for given documents and/or WebPages.
[0020] FIG. 8 is an exemplary snap shot of the web page displaying
exploded view of the contextual related topic link that gives
information on why and how the topics selected are contextually
related.
[0021] FIG. 9 is an exemplary snap shot of the web page displaying
360-Degree Contextual Results (360-DCR) for the selected Topic, in
accordance with an exemplary embodiment.
[0022] The figures depict embodiments of the disclosure for
purposes of illustration only. One skilled in the art will readily
recognize from the following description that alternative
embodiments of the structures and methods illustrated herein may be
employed without departing from the principles of the disclosure
described herein.
DETAILED DESCRIPTION
[0023] The foregoing has broadly outlined the features and
technical advantages of the present disclosure in order that the
detailed description of the disclosure that follows may be better
understood. Additional features and advantages of the disclosure
will be described hereinafter which form the subject of the claims
of the disclosure. It should be appreciated by those skilled in the
art that the conception and specific embodiment disclosed may be
readily utilized as a basis for modifying or designing other
structures for carrying out the same purposes of the present
disclosure. It should also be realized by those skilled in the art
that such equivalent constructions do not depart from the spirit
and scope of the disclosure as set forth in the appended claims.
The novel features which are believed to be characteristic of the
disclosure, both as to its organization and method of operation,
together with further objects and advantages will be better
understood from the following description when considered in
connection with the accompanying figures. It is to be expressly
understood, however, that each of the figures is provided for the
purpose of illustration and description only and is not intended as
a definition of the limits of the present disclosure.
[0024] Exemplary embodiments of the present disclosure relate to
Latent Precis Extraction and Synecdoche Representation (LPeSr)
which is a self-learning methodology/process designed to extract
important words automatically from a given document or a web page.
It then builds an Automated Machine generated summary that is not
just mere chopping-off or truncating paragraphs but actually makes
it a real sensible summary ensuring that important sentences are
very much captured, hence the word "Precis" or "a Summary". Basis
the first level of process the method then maps the overall
information extracted into "Knowledge Entities" basis a linguistic
framework built on Natural Language Processing or NLP. During the
final stage the method brings in the description to the
contextually related clusters explaining "Why this relation?" . . .
In short a "360-degree contextual representation of Knowledge" or
"360-degree contextual map" is achieved as suppose to mere
Information retrieval, hence the word "Synecdoche" (Part of
something that is used to refer to the whole thing).
[0025] Aforesaid features are explained in detail herein below with
the help of examples for better understanding. However, these
examples should not be construed as the limitation on the scope of
the present technology.
[0026] Referring now to FIG. 1, which illustrates a high-level
snap-shot of steps used in generating automated summary of the
document and to extract knowledge concepts from the set of given
documents. The basic step starts with processing the documents or
web-pages using corresponding parsers to extract the textual
information. The textual or content within the document is the
input to the algorithm. The instant methodology then processes the
content of each document to extract important words within it
automatically. A standard procedure of finding the frequency of
each word in the document can be used to extract the words of high
frequency instance after filtering the word articles from the
documents. the present technology make use of an elegant technique
that is much advanced & efficient from eminence perspective.
These words make real sense from practical standpoint to be used in
generation of automated summary of the document and to extract
knowledge concepts from the set of given documents.
[0027] As seen from the FIG. 1 it is obvious that the end result of
the instant technology disclosed in the present disclosure is to
generate sensible auto-summary of the document and to extract key
concepts or knowledge concepts basis the summarization process.
Therefore, the instant technology gets into the details of core
aspects of the process of auto-summarization and extraction of
knowledge concepts.
[0028] Let's quickly look into the technique used as a base for
extraction of Important-Words (IW) with High-Frequency Occurrence.
In one embodiment, identify the words that go together first,
example "Saudi Arabia" is a one word although made up of two words.
It is termed as "Saudi Arabia"=l(2), this means two words making
sense as one word. Therefore in a given document the method finds
such words that go together from the incremental order of their
occurrences, which means first find the highest order like l(5),
then l(4), then l(3) . . . the general way to define this is
"m(n)", where "n" is number of words in the group to make one word
as a whole. "m" is a constant as it is a representation of the word
as a whole.
[0029] In one embodiment, the procedure of extracting
High-Frequency words of the document, which is applied, would be as
follows: [0030] a. m(n), m(n-1), m(n-2), m(n-3) . . . m(n-p) [0031]
b. Given the above "n" could be any positive integer;
typically/mostly value of "n" is in the range 5 to 4 in practical
scenarios. "n-p" is always 2.
[0032] For example, if the document content is something like:
[0033] "The Kingdom of Saudi Arabia, commonly known as Saudi Arabia
is the largest Arab country of the Middle East. It is bordered by
Jordan and Iraq on the north and northeast, Kuwait, Qatar and the
United Arab Emirates on the east, Oman on the southeast, and Yemen
on the south. It is also connected to Bahrain by the King Fand
Causeway . . . "
[0034] To keep it simple for explanation purpose, value of "n" is
considered as "2" (value of p=0), therefore the words that go
together are identified by the following procedure:
[0035] Consider the first two words in the string i.e. "The
Kingdom", now since "The" is a word article, this is skipped, the
next word "Kingdom of", again "of" is a word article, this
combination is skipped too "of Saudi", same case . . . now the next
combination "Saudi Arabia" seems to make sense but for a machine
does not make sense yet but since there is no word article or
punctuation marks this is recorded and assuming this to be a valid
word the method processes other words similarly until it finds the
same combination again, if it found then "Saudi Arabia" makes it
into a valid word of combination 2. Also, it is obvious that the
probability of such combination depends upon at least one
co-occurrence of such word.
[0036] Where ever such combination is found in the contents of the
document, the method would extract such a word and record its
frequency of occurrence and replace it with a void value or null.
Having mentioned this, let's now assume that the initial value of
n=4, then first four (4) combinations of the words are considered
and proceed to compare with the consecutive combination ex: "The
Kingdom of Saudi" & "Kingdom of Saudi Arabia" are compared . .
. and so on. Once the document reaches its end, the value of n
becomes n-1, which is 3. The process is repeated on the same
document now keeping in mind the void or null value too that might
have been replaced for any possible valid combination of 4 groups
of words. This whole process is repeated until value of p=0.
[0037] Now, the only left over words is obviously single words,
they are processed to find the highest frequency words. The
combination of Group(4), Group(3), Group(2) and (1) words gives a
set of valid words that make the document, in other words these
words are important to the document.
[0038] Therefore, in a practical scenario if a given document has
about 1500 to 2000 words and given a very comprehensive list to
eliminate "common words" or simply call "Word-Articles" (is, it,
the, do, should . . . ) etc. . . . the words in the document get
reduced to more than 50%. A huge list of these common words for
English and also other languages is freely available (we use one
such list and fine tune if there are any missed out words).
[0039] In one embodiment, these common words are filtered only
during the last stage of the process, i.e. during the process of
single-words. Initially for Group (4), Group (3), Group (2), we
very much use this common word list to figure out "Word-Groups"
(WG) as explained earlier. Once, the method is through the
procedure of extracting WG's and Single Words, they typically come
down to 15 to 20 words for a document size of about 2000 words.
Now, these words are used in a controlled way to develop a sensible
automated summary for the given document.
[0040] In one embodiment, the following Hypothesis is made basis
which the logic for developing an automated summary & knowledge
extraction for a given document is judged by the
machine/methodology. [0041] 1. Every sensible document has a
meaning, message that it illustrates to its readers. [0042] 2.
There are a set of words that are important and around which
"common words" or "word-articles" attach to make a sentence that
describe these IW. Refer to the paragraph again:
[0043] "The Kingdom of Saudi Arabia, commonly known as Saudi Arabia
is the largest Arab country of the Middle East. It is bordered by
Jordan and Iraq on the north and northeast, Kuwait, Qatar and the
United Arab Emirates on the east, Oman on the southeast, and Yemen
on the south. It is also connected to Bahrain by the King Fand
Causeway . . . "
[0044] The Highlighted bold words are common words. Now, if we
analyze the above paragraph point 2 becomes clear. [0045] 3. The
important words and common words together form sentences . . .
typically sentences and paragraphs are the building blocks of a
document. [0046] 4. Each sentence is separated by a "period" (.)
symbol. [0047] 5. Therefore, every sentence that contains the IW's
becomes important. [0048] 6. Also, every other sentence that is
close to this particular sentence also can be assumed strongly to
be important. [0049] 7. Therefore, we just need a technique to
elegantly extract these important sentences and join them to make a
very sensible summary.
[0050] In one embodiment, the step by step process used in
generation of Auto-Summary (Precis) is explained herein in detail.
Firstly, the Important Words (IW's) are extracted using the
technique explained hereinabove for extracting the IW's to process
the next steps to generate an auto-summary. The next step is to
list the order of the IW's extracted in the order of highest WG's.
The given document is split into sentences, using the period symbol
as the mark for the split. All of these sentences are stored in a
storage medium in the sequential order. For example storage medium
includes but is not limiting to an array, database and any proper
media. Let's call this "Array of Sentences" as "AS". For each IW
starting in the order of highest WG's, every sentence in AS is
analyzed to find if said IW is present. If the IW is found then,
that particular sentence is extracted; let's call it "S1". Now this
sentence is removed from the list of AS. Therefore, AS becomes
(AS-S1). The corresponding IW is also removed from the list of
IW's. For example, if there are 10 IW's (IW1, IW2, IW3, . . .
IW10), we now would have 9 IW's (IW2, IW3, IW4, . . . IW10). The
reason why we might want to remove IW after a match is found is to
avoid repetitions or redundancies during final stage of
auto-summary creation.
[0051] In one embodiment, above said process is repeated for all
other consecutive IW. As a result of which for say about 10 IW's
the process would have extracted about 10 sentences that matched
(S1, S2, and S3 . . . S10). The combination of these would form a
very sensible summary of the given document in the real world, each
sentence however is appropriately separated by consecutive periods
(about 4 to 5) to give a obvious feeling that these sentences are
snippets of the document that are joined. Therefore, resultant
auto-summary (SY).fwdarw.(S1, S2, S3 . . . S10) is a summation of
Important-Sentences (IS's)=Auto-Summary of the document as
illustrated in FIG. 2. FIG. 6 is an exemplary snap shot of the web
page highlighting auto summary created by present technology for
given documents and/or WebPages.
[0052] In another embodiment, the present disclosure provides
details about Knowledge-Extraction (KE) basis Text-Analytics of
given sum or set of documents. Basis the hypothesis mentioned
earlier, we now look in to the process/methodology used for
extracting knowledge from given set of documents.
[0053] It is understood that for a given n number of documents,
each document would have corresponding IW's and Auto-SummarY (SY)
for each document. Firstly, define Topic to each document basis
IW's and combination of filename & Title name. This is done by
comparing each IW in a document with its filename & Title name,
if any of the IW's matches than that is defined as a Topic; else
the IW with highest-frequency is defined as the Topic. Sometimes,
Topic can be just Title name if it is consistent in the given
documents (if we manually analyze and are content with the Title to
be Topic, then we simply use Title to be Topic (in most of the
cases this is true but involves manual intervention).
[0054] Therefore, now we have for each Document (D1), its
corresponding Topic (T1), Important-Words (IWs) and its
corresponding auto-summary (SY1) and more importantly the
Important-Sentences (IS's) (that were extracted to build
auto-summary). i.e. D1=T1, IWs, SY1, IS's. Therefore, for given set
of n number of documents we would have their corresponding Topic's,
Auto-Summaries, Important-Words and Important-Sentences.
[0055] Now analyze the data statistically to extract
Knowledge-Clusters and corresponding Topic-Clusters that are
contextually related. For the given data set of n number of
documents, the list of IW's and their corresponding
Important-Sentences (IS's) and Topics (T's) are extracted in an
order (may be an array, database etc. . . . any proper media). Now,
each IW and its corresponding sentence is hashed to generate a
hash-code (H), this will be an integer number. Associate hash-code
as well with IW's in the list in the same order. In one embodiment,
hashing is primarily used to eliminate any redundancies during KE
process, as the same hash code is generated for same sentences,
duplicates can be removed. I.e. IW+IS=hash code. No two hash codes
in the list would be same after filtering duplicates. Where hashing
is used to only facilitate removal of redundancies, however any
other technique can also be used as an alternative.
[0056] Once, the duplicates are eliminated, the IS's & Topic's
(T's) are Grouped or Clustered in the list based on IW's.
Therefore, for each IW, there would be "m" number of
Important-Sentences (IS's) & "m" number of Topic's that might
have been extracted. The Topic's for each IW cluster can be defined
as contextually related. Basis the hypothesis mentioned earlier, if
the sentence containing the IW is important then obviously the
Topic's of the relative IS's from other documents for the same IW
shall be related contextually. Now, we have two clusters for a
given IW--"Contextual-Topical Clusters" and "Knowledge-Clusters".
Aforesaid process is clearly illustrated in FIG. 3. In another
embodiment, it is possible to Group basis Topic instead of grouping
basis IW, this depends on the scenario & requirement that one
is trying to address. FIG. 7 is an exemplary snap shot of the web
page highlighting Knowledge-Extraction (KE) created by present
technology for given documents and/or WebPages.
360-Degree Contextual Mapping (Synecdoche)
[0057] The basic philosophy of the present disclosure is to bring
"Knowledge as suppose to mere Information" during the process of
retrieval of results. Although, the process is very laborious and
involves huge computation but the end results are simply amazing.
The next generation of search is definitely going to be in this
direction. Let's understand how this is achieved.
[0058] Note that basis the techniques explained earlier we have
certain attributes associated with each document after the first
two levels of processing, which are Topic (T), Important-Words
(IW's), Auto-Summary (SY) and Important-Sentences (IS's). To
achieve 360-Degree Contextual-Mapping (360-DCM) the Topics (T's)
& the Data or Content (C) of the document itself plays a vital
role. There is a certain way in which this is achieved, please
refer to the pointers below that explains the same.
[0059] For a given set of documents say "n", it has corresponding
Topics (Tn) and Content (Cn) associated with that. Index these
documents or rather process them in a very specific way. While
running an Index on the documents, two separate values or fields
are stored in the index which is Topic of the document, and
Document Content. During the Indexing process, the Topic that is
identified for each document is stored in a storage medium in a
predetermined order. For example, storage medium includes but is
not limiting to an array, database and any proper media. Let's call
this "Topical-List" or "TL". There is a possibility that the TL may
contain duplicates, filter these duplicates results in a list with
non-redundant Topics. For each Topic in the TL, search index is hit
for one Topic at a time and extracted predefined set of results.
The predefined threshold of number of results depends upon the size
of the Data or Index size. Typically, the first 50 to 150 results
could be extracted. For each of the result, the corresponding Topic
and Content is extracted. These are stored in a predetermined
order. Let's call this "Result-List" or "RL".
[0060] Let's assume that the Topic from TL that hit the Index is
"Obama", we get the corresponding RL. Now, there are two things
that are analyzed. Firstly, if the RL has Topics that match each
other (Example: if they are at least two occurrence of say Topic:
"Hillary", then "Hillary", is extracted to be related), they are
extracted to be related to the Topic "Obama". Secondly, the
corresponding Document Content of the Topics is analyzed to find
the sentences that list the word "Obama" (The same technique is
used here to split the sentences to find this important word as the
method did in the process of auto-summary creation). Refer FIG. 4.
Preferably, those sentences that contain both the word "Obama" and
the Topic of the document are selected. For example, if the Topic
of the document from RL being processed is say "Hillary", then the
sentence extracted would be like say "Hillary Diane Rodham Clinton
is the 67th United States Secretary of State, serving in the
administration of President Barack Obama". This kind of structure
would justify "Why this Relationship?" or "How is this Topic
Related?" functionality.
[0061] In one embodiment, aforesaid process is repeated until all
the Topics in the TL are exhausted. The resultant "Related Topics"
along with their respective sentences that justify the
relationships are stored in a cluster. For example cluster includes
but is not limiting to an array, database and any proper media.
This is represented as the "360-Degree Contextual-Map" (360-DCM) of
a given Topic that describes "Part of something that is used to
refer to the whole thing", which is nothing but Synecdoche
Representation.
[0062] This functionality is extremely helpful during the process
of search; the user gets the required information along with
contextually related Topics with the explanation of "Why or How are
these Related?"; As an example, consider if the user-query is
"Heart Attack", then apart from regular results the query hits the
360-DCM and if there is a cluster for the Topic "Heart Attack" then
the Related Topics, which it might return may be Thrombolytic
Therapy, Coronary artery spasm, Atherosclerosis, Unstable angina.
Apart from this if the user clicks on the link "Why or How are
these Related?">>, then the following information would be
displayed:
[0063] Thrombolytic Therapy
Those who die from heart attacks generally die within 1 hour from
the initial onset of symptoms and sometimes before they get to the
hospital. For a person having an acute heart attack, tPA works by
dissolving a major clot quickly. The clot is most likely blocking
one of the coronary arteries that normally allows blood and oxygen
get to the heart muscle.
[0064] health.allrefer.com/health/thrombolytic-therapy . . .
[0065] Coronary Artery Spasm
Coronary artery spasm is a temporary, sudden narrowing of one of
the coronary arteries (the arteries that supply blood to the
heart). In many people, coronary artery spasm may occur without any
other heart risk factors (such as smoking, diabetes, high blood
pressure, and high cholesterol). If the spasm lasts long enough, it
may even cause a heart attack. Treatment: The goal of treatment is
to control chest pain and prevent a heart attack.
[0066] www.nlm.nih.gov/medlineplus/ency/ . . .
[0067] Atherosclerosis
If the coronary arteries become narrow, blood flow to the heart can
slow down or stop. This can cause chest pain (stable angina),
shortness of breath, heart attack, and other symptoms. This is a
common cause of heart attack and stroke. If the clot moves into an
artery in the heart, lungs, or brain, it can cause a stroke, heart
attack, or pulmonary embolism.
[0068] www.nlm.nih.gov/medlineplus/ency/ . . .
[0069] Unstable Angina
Unstable angina is a condition in which your heart doesn't get
enough blood flow and oxygen. It is a prelude to a heart attack.
This causes arteries to become less flexible and narrow, which
interrupts blood flow to the heart, causing chest pain. The chest
pain: Occurs without cause (for example, it wakes you up from
sleep) Lasts longer than 15-20 minutes Responds poorly to a
medicine called nitroglycerin May occur along with a drop in blood
pressure or significant shortness of breath People with unstable
angina are at increased risk of having a heart attack.
[0070] www.nlm.nih.gov/medlineplus/ency/ . . .
[0071] An exemplary snap shot of the web page displaying exploded
view of the contextual related topic link that gives information on
why and how the topics selected are contextually related is
illustrated in FIG. 8.
Summing-Up all the Features to Display 360-Degree Contextual
Results
[0072] Generations of Auto-Summary (Precis), Text-Analytics,
Knowledge Extraction and 360-Degree Contextual Mapping (Synecdoche)
techniques are explained in detail above. Using all the above
steps, the process analyze and index the data in such a way that it
will facilitate the retrieval of search results that will portray
"360-Degree Contextual Results" of the search query as illustrated
in FIG. 5.
[0073] As seen from FIG. 5, all the processes are collated together
to bring in the 360-DCR, the following is the way in which it is
achieved:
[0074] For the given data, for each document corresponding Topic
(T), Important-Words (IW's), Important-Sentences (IS's), and
Auto-summary (SY) is generated as explained earlier. Auto-Summary
is stored as a field value during indexing along with corresponding
Topic and Content of the document, this we call it the
Master-Index, this Index is used for displaying search results.
Since, auto-summary is a field value; every result will have a
summary of the entire document, which will help the user to have a
quick overview of each result without actually having the user to
visit the content page. While processing Master-Index the
Topical-List is extracted. TL later hits the Master-Index to
extract 360-DCM clusters as explained earlier. Knowledge is
extracted into KE clusters as explained earlier.
[0075] For a given user query the process analyzes it to see if
such a Topic exists in the TL, if so the corresponding cluster from
the 360-DCM cluster returns related Topics along with the
relationship map. The Master-Index returns search results along
with auto-summary for each result. The KE cluster is analyzed to
see if such an Important-Word (IW) exits, if so relevant Knowledge
that is gathered about such a search-query is highlighted.
Therefore, in a nutshell the solution is 3 fold, the end-users get
Information along with sensible summary of the document, they get
Knowledge pertaining to their query and last but not least they
also get contextually related Topics listed with the relationship
map, which in-itself is a separate result set that is nothing but
an advance level of brining in Query-Expansion based results as a
part of contextual results.
[0076] Hence, for the given query the system brings in information
about it along with relevant Knowledge and Contextually related
topics and their relationship map, which gives the user more than
just mere results. An exemplary snap shot of the web page
displaying 360-Degree Contextual Results (360-DCR) for the selected
Topic, in accordance with an exemplary embodiment is illustrated in
FIG. 9.
[0077] The present disclosure is not to be limited in terms of the
particular embodiments described in this application, which are
intended as illustrations of various aspects. Many modifications
and variations can be made without departing from its spirit and
scope, as will be apparent to those skilled in the art.
Functionally equivalent methods and devices within the scope of the
disclosure, in addition to those enumerated herein, will be
apparent to those skilled in the art from the foregoing
descriptions. Such modifications and variations are intended to
fall within the scope of the appended claims. The present
disclosure is to be limited only by the terms of the appended
claims, along with the full scope of equivalents to which such
claims are entitled. It is also to be understood that the
terminology used herein is for the purpose of describing particular
embodiments only, and is not intended to be limiting.
[0078] With respect to the use of substantially any plural and/or
singular terms herein, those having skill in the art can translate
from the plural to the singular and/or from the singular to the
plural as is appropriate to the context and/or application. The
various singular/plural permutations may be expressly set forth
herein for sake of clarity.
[0079] In addition, where features or aspects of the disclosure are
described in terms of Markush groups, those skilled in the art will
recognize that the disclosure is also thereby described in terms of
any individual member or subgroup of members of the Markush
group.
[0080] While various aspects and embodiments have been disclosed
herein, other aspects and embodiments will be apparent to those
skilled in the art. The various aspects and embodiments disclosed
herein are for purposes of illustration and are not intended to be
limiting, with the true scope and spirit being indicated by the
following claims.
* * * * *
References