U.S. patent application number 13/746324 was filed with the patent office on 2013-07-25 for advanced summarization on a plurality of sentiments based on intents.
This patent application is currently assigned to Formcept Technologies and Solutions Pvt Ltd. The applicant listed for this patent is Formcept Technologies and Solutions Pvt Ltd. Invention is credited to ANUJ KUMAR, SURESH SRINIVASAN.
Application Number | 20130191735 13/746324 |
Document ID | / |
Family ID | 48798096 |
Filed Date | 2013-07-25 |
United States Patent
Application |
20130191735 |
Kind Code |
A1 |
KUMAR; ANUJ ; et
al. |
July 25, 2013 |
ADVANCED SUMMARIZATION ON A PLURALITY OF SENTIMENTS BASED ON
INTENTS
Abstract
A method of summarizing content around a sentiment using
weighted Formal Concept Analysis (wFCA) is provided. The method
includes identifying one or more sentences associated with the
content based on parts of speech, identifying, at least one
sentiment associated with the one or more sentences based on the
parts of speech, identifying one or more keywords in the one or
more sentences, disambiguating at least one ambiguous keyword from
the one or more keywords using the wFCA, computing a weight for
each sentence of the one or more sentences based on a number of
keywords of the one or more keywords associated with each sentence,
processing, an input including an indication of the sentiment, and
generating a summary on the content around the sentiment based on
(i) the weight, and b) at least one of i) the at least one
sentiment, and ii) the indication.
Inventors: |
KUMAR; ANUJ; (Bangalore,
IN) ; SRINIVASAN; SURESH; (Bangalore, IN) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Formcept Technologies and Solutions Pvt Ltd; |
Bangalore |
|
IN |
|
|
Assignee: |
Formcept Technologies and Solutions
Pvt Ltd
Bangalore
IN
|
Family ID: |
48798096 |
Appl. No.: |
13/746324 |
Filed: |
January 22, 2013 |
Current U.S.
Class: |
715/254 |
Current CPC
Class: |
G06F 16/358 20190101;
G06F 40/166 20200101; G06F 16/345 20190101 |
Class at
Publication: |
715/254 |
International
Class: |
G06F 17/24 20060101
G06F017/24 |
Foreign Application Data
Date |
Code |
Application Number |
Jan 23, 2012 |
IN |
263/CHE/2012 |
Claims
1. A method of summarizing content around a sentiment using
weighted Formal Concept Analysis (wFCA), said method comprising:
(i) identifying, by a processor, a plurality of sentences
associated with said content based on parts of speech; (ii)
identifying, by said processor, at least one sentiment associated
with said plurality of sentences based on said parts of speech;
(iii) identifying, by said processor, a plurality of keywords in
said plurality of sentences; (iv) disambiguating, by said
processor, at least one ambiguous keyword from said plurality of
keywords using said wFCA; (v) computing, by said processor, a
weight for each sentence of said plurality of sentences based on a
number of keywords of said plurality of keywords associated with
said each sentence; (vi) processing, by said processor, an input
comprising an indication of said sentiment; and (vii) generating,
by said processor, a summary of said content based on (a) said
weight, and b) at least one of i) said at least one sentiment, and
ii) said indication.
2. The method of claim 1, wherein said sentiment is a positive
sentiment, wherein at least one sentence of said plurality of
sentences comprises at least one positive statement.
3. The method of claim 1, wherein said sentiment is a negative
sentiment, wherein at least one sentence of said plurality of
sentences comprises at least one negative statement.
4. The method of claim 1, wherein said plurality of sentences
comprise (a) a first set of sentences, wherein each sentence of
said first set of sentences comprises said at least one
sentiment.
5. The method of claim 4, wherein said plurality of sentences
further comprise (b) a second set of sentences that are not
associated with said at least one sentiment.
6. The method of claim 1, wherein said weight is computed based on
a number of associations of each keyword within said plurality of
sentences.
7. The method of claim 6, further comprising expanding said summary
based on (a) said weight assigned for each sentence of said
plurality of sentences, and (b) at least one of i) said at least
one sentiment, and ii) said indication.
8. A non-transitory program storage device readable by computer,
and comprising a program of instructions executable by said
computer to perform a method for summarizing content around a
sentiment based on weighted Formal Concept Analysis (wFCA), said
method comprising: (i) identifying, by a processor, a plurality of
sentences associated with said content based on parts of speech;
(ii) identifying, by said processor, at least one sentiment
associated with said plurality of sentences based on said parts of
speech; (iii) identifying, by said processor, a plurality of
keywords in said plurality of sentences; (iv) disambiguating, by
said processor, at least one ambiguous keyword from said plurality
of keywords using said wFCA; (v) generating, by said processor, a
graph to compute a weight for each sentence of said plurality of
sentences based on a number of keywords of said plurality of
keywords associated with each sentence; (vi) processing, by said
processor, an input comprising an indication of said sentiment; and
(vi) generating, by said processor, a summary of said content based
on (a) said weight, and b) at least one of i) said at least one
sentiment, and ii) said indication.
9. The non-transitory program storage device of claim 8, wherein
said weight is computed based on a number of associations of each
keyword within said plurality of sentences.
10. The non-transitory program storage device of claim 9, wherein
said method further comprises expanding said summary based on (a)
said weight, and b) at least one of i) said at least one sentiment,
and ii) said indication.
11. A system for summarizing content around a sentiment based on
weighted Formal Concept Analysis (wFCA) using a content
summarization engine, said system comprising: (a) a memory unit
that stores (i) a set of modules, and (ii) a database; (b) a
display unit; (c) a processor that executes said set of modules,
wherein said set of modules comprise: (i) a sentence identifying
module executed by said processor that identifies (a) a first set
of sentences, and (b) a second set of sentences in said content
based on parts of speech, wherein each sentence of said first set
of sentences comprises at least one keyword that indicates at least
one sentiment, wherein said second set of sentences are not
associated with sentiments; (ii) a sentiment identifying module
executed by said processor that identifies said at least one
sentiment associated with said each sentence of said first set of
sentences based on said parts of speech; (iii) a keyword
identifying module executed by said processor that identifies a
plurality of keywords from said first set of sentences, and said
second set of sentences; (iv) a disambiguating module executed by
said processor that disambiguates at least one ambiguous keyword
from said plurality of keywords using said wFCA, wherein said wFCA
comprises generation of a lattice with said plurality of keywords
as objects, and categories associated with said plurality of
keywords as attributes, wherein said categories are obtained from a
knowledge base; (v) a weight computing module executed by said
processor that computes a weight for each sentence of (a) said
first set of sentences, and (b) said second set of sentences based
on a number of keywords associated with said each sentence; (vi)
(i) a sentiment indicating module executed by said processor that
processes an input comprising an indication of said sentiment in
said content; and (vii) an intent building module executed by said
processor that generates a summary of said content based on (a)
said weight, and b) at least one of i) said at least one sentiment,
and ii) said indication.
12. The system of claim 11, wherein said summary comprises at least
one sentence that are only obtained from said first set of
sentences.
13. The system of claim 11, wherein said summary comprises at least
one sentence from said second set of sentences.
14. The method of claim 11, wherein said weight is computed based
on a number of associations of each keyword within (a) said first
set of sentences, and (b) said second set of sentences.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to Indian patent
application no. 263/CHE/2012 filed on Jan. 23, 2012, the complete
disclosure of which, in its entirety, is herein incorporated by
reference.
BACKGROUND
[0002] 1. Technical Field
[0003] The embodiments herein generally relate to content
summarization, and more particularly to a system and method for
summarizing content around a sentiment using weighted formal
concept analysis (wFCA) based on user intent.
[0004] 2. Description of the Related Art
[0005] Documents obtained via an electronic medium (i.e., Internet
or on-line services, or any other services) are often provided in
such volume that it is important to summarize them. It is desired
to be able to quickly obtain a brief summary of a document rather
than reading in its entirety. Typically, such document may span
multiple paragraphs to several pages in length.
[0006] Summarization or abstraction is even more essential in the
framework of emerging "push" technologies, where a user has hardly
any control over what documents arrive at the desktop for his/her
attention. Summarization is always a key feature in content
extraction and there is currently no solution available that
provides a summary that is comparable to that of a human.
Conventionally, summarization of content is manually performed by
users, which is time consuming and also expensive. Further, it is
slow and also not scalable for a large number of documents.
[0007] Summarization involves representing whole content into a
limited set of words without losing main crux of the content.
Traditional summarization of content (in general a document) is
based on lexical chaining, in which the longest chain is assumed to
best represent the content, and first sentence of a summary is
taken from first sentence of the longest chain. The second-longest
chain is assumed to be the next best, and second sentence of the
summary is then taken from first sentence of the second longest
chain. However, this lexical chaining approach tends to not only
miss out on important content related to intent of the user but
also fails to elaborate it in a manner in which it can be easily
understood. Accordingly, there remains need for a system to
automatically analyze one or more documents and generate an
accurate summary based on user intent.
SUMMARY
[0008] In view of the foregoing, an embodiment herein provides a
method of summarizing content around a sentiment using weighted
Formal Concept Analysis (wFCA). The method includes identifying one
or more sentences associated with the content based on parts of
speech, identifying one or more sentiments from the one or more
sentences based on parts of speech, identifying one or more
keywords in the one or more sentences, disambiguating at least one
ambiguous keyword from the one or more keywords using the wFCA,
computing a weight for each sentence of the one or more sentences
based on a number of keywords of the one or more keywords
associated with each sentence, processing, an input including an
indication of the sentiment in the content, and generating a
summary on the content around the sentiment based on (i) the
weight, (ii) the one or more sentiments, and (iii) the
indication.
[0009] The sentiment may be a positive sentiment. Each sentence of
the one or more sentences includes at least one positive statement.
The sentiment may be a negative sentiment. Each sentence of the one
or more sentences includes at least one negative statement. The one
or more sentences may include (a) a first set of sentences. Each
sentence of the first set of sentences includes at least one
sentiment. The one or more sentences may further include (b) a
second set of sentences that are not associated with the at least
one sentiment. The weight may be computed based on a number of
associations of each keyword within the one or more sentences. The
summary may be expanded based on (a) the weight assigned for each
sentence of the one or more sentences, and (b) at least one of i)
the one or more sentiments, and ii) the indication.
[0010] In another aspect, a non-transitory program storage device
readable by computer, and including a program of instructions
executable by the computer to perform a method for summarizing
content around a sentiment based on weighted Formal Concept
Analysis (wFCA) is provided. The method includes identifying one or
more sentences associated with the content based on parts of
speech, identifying one or more sentences associated in the one or
more sentences based on the parts of speech, identifying one or
more keywords in the one or more sentences, disambiguating at least
one ambiguous keyword from the one or more keywords using the wFCA,
generating a graph to compute a weight for each sentence of the one
or more sentences based on a number of keywords of the one or more
keywords associated with each sentence, processing, an input
including an indication of the sentiment in the content, and
generating a summary on the content based on (a) the weight, and b)
at least one of i) one or more sentiments, and ii) the indication.
The weight may be computed based on a number of associations of
each keyword within the one or more sentences. The summary may be
expanded based on a) the weight, and b) at least one of i) one or
more sentiments, and ii) the indication.
[0011] In yet another aspect, a system for summarizing content
around a sentiment based on weighted Formal Concept Analysis (wFCA)
using a content summarization engine is provided. The system
includes (a) a memory unit that stores (i) a set of modules, and
(ii) a database, (b) a display unit, and (c) a processor that
executes said set of modules. The set of modules include (i) a
sentence identifying module executed by the processor that
identifies (a) a first set of sentences, and (b) a second set of
sentences in the content based on parts of speech. Each sentence of
the first set of sentences includes at least one keyword that
indicates at least one sentiment. The second set of sentences is
not associated with sentiments.
[0012] The set of modules further include (ii) a sentiment
identifying module executed by the processor that identifies one or
more sentiments associated with each sentence of the first set of
sentences based on the parts of speech, (iii) a keyword identifying
module executed by the processor that identifies one or more
keywords from the first set of sentences, and the second set of
sentences, (iv) a disambiguating module executed by the processor
that disambiguates at least one ambiguous keyword from the one or
more keywords using the wFCA by generating a lattice with the one
or more keywords as objects, and categories associated with the one
or more keywords as attributes. The categories are obtained from a
knowledge base.
[0013] The set of modules further includes (v) a weight computing
module executed by the processor that computes a weight for each
sentence of (a) the first set of sentences, and (b) the second set
of sentences based on a number of keywords associated with each
sentence, and (vi) a sentiment indicating module executed by the
processor that processes an input including an indication of the
sentiment in the content, (vii) an intent building module executed
by the processor that generates a summary on the content based on
a) the weight, and b) at least one of (i) the one or more
sentiments, and (ii) the indication.
[0014] The summary may include at least one sentence that is only
obtained from the first set of sentences. The summary may include
at least one sentence from the second set of sentences. The weight
may be computed based on a number of associations of each keyword
within (a) the first set of sentences, and (b) the second set of
sentences.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] The embodiments herein will be better understood from the
following detailed description with reference to the drawings, in
which:
[0016] FIG. 1 illustrates a system view of a user communicating
with a user system to generate a summary based on one or more
sentiments using a content summarization engine according to an
embodiment herein;
[0017] FIG. 2 illustrates an exploded view of the user system with
a memory storage unit for storing the content summarization engine
of FIG. 1, and an external database according to an embodiment
herein;
[0018] FIG. 3 illustrates an exploded of the content summarization
engine of FIG. 1 according to an embodiment herein;
[0019] FIG. 4 illustrates a user interface view of the content
collection module of FIG. 3 of the content summarization engine of
FIG. 1 according to an embodiment herein;
[0020] FIG. 5 illustrates a user interface view of an input content
to be summarized using the content summarization engine of FIG. 1
according to an embodiment herein;
[0021] FIG. 6 illustrates an exploded view of the content
annotation module of FIG. 3 of the content summarization engine of
FIG. 1 according to an embodiment herein;
[0022] FIG. 7 illustrates a user interface view of intent selection
by the user of FIG. 1 according to an embodiment herein;
[0023] FIG. 8 is a flow diagram illustrating a method of
summarizing of content around a sentiment using weighted Formal
Concept Analysis (wFCA) according to an embodiment herein;
[0024] FIG. 9 illustrates a graphical representation of a lattice
that is generated to disambiguate a keyword "kingfisher" of the
input content using the lattice construction module of FIG. 3
according to an embodiment herein;
[0025] FIG. 10 is a graphical representation illustrating a graph
that indicates an association between one or more keywords and
sentences of the input content of FIG. 5 according to an embodiment
herein;
[0026] FIG. 11 is a table view illustrating a weight that is
computed for each keyword that is identified by the keyword
identifying module of FIG. 3 based on number of associations of
each keyword with sentences of the input content according to an
embodiment herein;
[0027] FIG. 12 is a table view illustrating a weight that is
computed for each sentence of the input content based on associated
keywords according to an embodiment herein; and
[0028] FIG. 13 illustrates a schematic diagram of a computer
architecture according to an embodiment herein.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0029] The embodiments herein and the various features and
advantageous details thereof are explained more fully with
reference to the non-limiting embodiments that are illustrated in
the accompanying drawings and detailed in the following
description. Descriptions of well-known components and processing
techniques are omitted so as to not unnecessarily obscure the
embodiments herein. The examples used herein are intended merely to
facilitate an understanding of ways in which the embodiments herein
may be practiced and to further enable those of skill in the art to
practice the embodiments herein. Accordingly, the examples should
not be construed as limiting the scope of the embodiments
herein.
[0030] As mentioned, there remains a need for a system to
automatically analyze one or more documents and generate an
accurate summary based on user intent. The user intent may be an
overall summarization, a keyword based summarization, a page wise
summarization, and/or a section wise summarization. The content
summarization engine computes crux of content of a document, and
generates a main story surrounding the content. The content
summarization engine also provides an option for an end user to
expand a summary for better understanding. The content
summarization engine takes the end user directly to a main concept
from where he/she can expand, and read the content in a flow that
can give a better understanding without expending time in reading
entire content. Referring now to the drawings, and more
particularly to FIGS. 1 through 13, where similar reference
characters denote corresponding features consistently throughout
the figures, there are shown preferred embodiments.
[0031] FIG. 1 illustrates a system view 100 of a user 102
communicating with a user system 104 to generate a summary based on
one or more sentiments using a content summarization engine 106
according to an embodiment herein. In one embodiment, the user
system 104A-N may be a personal computer (PC) 104A, a tablet 104B
and/or a smart phone 104N. The content summarization engine 106
summarizes content.
[0032] FIG. 2 illustrates an exploded view of the user system
104A-N with a memory storage unit 202 for storing the content
summarization engine 106 of FIG. 1, and an external database 216
according to an embodiment herein. The user system 104A-N includes
the memory storage unit 202, a bus 204, a communication device 206,
a processor 208, a cursor control 210, a keyboard 212, and a
display 214. The memory storage unit 202 stores the content
summarization engine 106 that includes one or more modules to
perform various functions on an input content and generates a
summary intent surrounding the input content. The external database
216 includes a knowledge base 218 that is constructed based on
concepts of linked data. The knowledge base 218 further includes a
set of categories that correspond to various keywords.
[0033] FIG. 3 illustrates an exploded of the content summarization
engine 106 of FIG. 1 according to an embodiment herein. The content
summarization engine 106 includes a database 302, a content
collection module 304, a content parsing/extraction module 306, a
content cleaning module 308, a content annotation module 310, an
annotation extractor module 312, a sentence identifying module 314A
that includes a sentiment identifying module 314B, a keyword
identifying module 316, a disambiguating module 318, a graph
generating module 320, a sentiment indicating module 321, an intent
building module 322, and an intent expanding module 324.
[0034] The content collection module 304 collects content from at
least one format of text (e.g., including multiple different
documents of different formats) provided by the user 102. Such
formats may include, for example, .doc, .pdf, .rtf, a URL, a blog,
a feed, etc. The content extraction/parsing module 306 fetch the
content from these one or more documents (e.g. abc.doc, xyz.pdf,
etc), and provide the content require to generate a summary.
Further, the content extraction/parsing module 306 parses HTML
content in case one of the sources of input is a URL. The content
parsing/extraction module 306 extracts one or more sentences from
the content.
[0035] The content cleaning module 308 cleans the content that may
include removal of junk characters, new lines that are not useful,
application specific symbols (e.g., MS Word bullets), non-unicode
characters, etc. In one embodiment, specific parts of a document
(e.g., footer) are specified as to be excluded. Such exclusions can
be specified based on a type of content that is provided (e.g.,
news article, resume etc), and the content cleaning module 308 is
configurable accordingly.
[0036] The content annotation module 310 annotates the content for
useful information, such as sentences, keywords, tokens,
sentiments, durations, sections, durations within the sections,
quantities, sentences associated with the sections, and sentences
associated with durations and quantities. The annotation extractor
module 312 extracts annotated information from the content
annotation module 310. The sentence identifying module 314A
identifies the one or more sentences from the content. The
sentiment identifying module 314B identifies one or more sentiments
from the one or more sentences. The one or more sentences may
include a first set of sentences that are associated with a
sentiment, a second set of sentences that may not be associated
with the sentiment (e.g., a set of neutral sentences). The first
set of sentences may be a set of positive sentiments. The first set
of sentences may be a set of negative sentiments. The sentence
identifying module 314A may identify a third set of sentences that
includes positive sentiment and negative sentiments. When a
sentence is identified having at least one (a) a positive
sentiment, (b) a negative sentiment, and c) a neutral sentiment,
the sentence identifying module 314A may prompt the user 102 to
identify, edit/modify, and validate the sentence as (a) a positive
sentiment sentence, b) a negative sentiment sentence, or a neutral
sentiment sentences. The sentence identifying module 314A may then
accept an input from the user 102 that confirms a type of the
sentence and one or more sentiments associated with the sentence.
The sentence identifying module 314A also identifies one or more
sentences associated with a keyword when the user 102 intents to
summarize the content around the sentiment.
[0037] The keyword identifying module 316 identifies one or more
keywords from the first set of sentences, and the second set of
sentences of the content based on a parts of speech. The keyword
identifying module 316 may further identify one or more keywords
from the third set of sentences. The disambiguating module 318
disambiguates at least one ambiguous keyword from the one or more
keywords in a context of its meaning using weighted formal concept
analysis (wFCA). The wFCA includes generation of a lattice with the
one or more of keywords as objects, and one or more categories
associated with the one or more keywords as attributes. The one or
more categories are obtained from the knowledge base 218.
[0038] The disambiguating module 318 further includes a lattice
construction module 326 that generates a lattice to disambiguate
the at least one ambiguous keyword. The lattice includes one or
more concepts that are generated based on the one or more keywords,
and the one or more categories associated with the one or more
keywords. In one embodiment, the disambiguating module 318 further
includes a score computing module 328 that computes a score for
each concept of the one or more concepts. The score is used to
disambiguate the at least one ambiguous keyword.
[0039] The graph generating module 320 constructs/generates a graph
to obtain an association between (i) sentences, (ii) keywords and
sentences, and (iii) sentences and durations. In one embodiment,
the graph includes one or more nodes, and each node corresponds to
a sentence of the content. The graph generating module 320 includes
a weight assigning module 330 that assigns a weight to each
sentence of the content. The weight assigning module 330 computes a
weight for each sentence of (a) the first set of sentences, and (b)
the second set of sentences based on a number of keywords
associated with each of the sentence.
[0040] The graph generation module 318 generates a graph based on
(a) the first set of sentences, (b) the second set of sentences,
and (c) the plurality of keywords. Each node indicates a sentence
of (a) the first set of sentences, or (b) the second set of
sentences. The sentiment indicating module 321 processes an input
that includes an indication of the sentiment in the content. The
indication is provided by the user 102, in one example embodiment.
The user 102 may indicate a type of sentiment for which
summarization of the content occurs. The intent building module 322
generates a summary by tailoring one or more sentences in the same
exact order as it appears in the content. In one embodiment, the
intent building module 322 tailors the one or more sentences based
on a) the weight assigned for each sentence of the content, b) at
least one sentiment that is identified by the sentiment identifying
module 314B, and c) the indication provided by the user 102. The
intent expanding module 324 expands the summary while preserving
intent and elaborating further. The summary may include at least
one of (i) at least one sentence obtained only from the first set
of sentences, (ii) at least one sentence from the second set of
sentences, and (iii) at least one sentence from the third set of
sentences.
[0041] FIG. 4 illustrates a user interface view of the content
collection module 304 of FIG. 3 of the content summarization engine
106 of FIG. 1 according to an embodiment herein. The user interface
view of the content collection module 304 includes a header 402, a
text field 404, an upload button 406, an URL text field 408, a
fetch button 410, a drag and drop field 412, an upload file button
414, a task status table 416, a task progress field 418, and a
proceed button 420. The header 402 displays a logo, a welcome
message, and a status of an application. Through, the text field
404, the user 102 can provide a plain text to be summarized, and
submits the plain text to a server, using the upload button 406.
The plain text may also be provided as a Uniform (or universal)
resource locator (URL) in the URL text field 408, and text
associated with the URL is crawled using the fetch button 410. The
drag and drop field 412 is used to drag and drop any files with
text to be uploaded. Through, the upload file button 414, the user
102 can browse a file to be uploaded. The task status table 416
displays uploaded text, the URL and/or the file. The task progress
field 418 notifies the user 102 about progress of a summarization
process for each uploaded content, and the proceed button 420
directs the user 102 to a next page.
[0042] FIG. 5 illustrates a user interface view 500 of an input
content 502 to be summarized using the content summarization engine
106 of FIG. 1 according to an embodiment herein. The input content
502 may be obtained in the form of a document, an URL, and/or
plain-text as specified by the user 102 using the content
collection module 304. The content collection module 304 collects
the input content 502 and stores it on the server.
[0043] In one embodiment, the input content 502 is collected from
one or more documents (e.g., abc.doc, and/or xyz.pdf), and are
parsed/extracted (e.g., using the content parsing/extraction module
306 of FIG. 3). In another embodiment, the input content 502 may be
fed as an URL (e.g., www.xyzairlines.com/content.html). The content
collection module 304 fetches content associated with the URL, and
parses/extracts using the content parsing/extraction module 306.
The content summarization engine 106 obtains the input content 502
to generate a summary on the input content 502.
[0044] The content cleaning module 308 cleans the input content 502
before performing annotation. Cleaning the content includes
removing (i) junk characters, (ii) new lines that are not useful,
(iii) application specific symbols (word processing bullets, etc.),
and/or (iv) non-Unicode characters, etc. The input content 502 may
be in a form of a cleaned text which does not require removing (i)
junk characters, (ii) new lines that are not useful, (iii)
application specific symbols (word processing bullets, etc.),
and/or (iv) non-Unicode characters, etc., in one example
embodiment.
[0045] FIG. 6 illustrates an exploded view of the content
annotation module 310 of FIG. 3 of the content summarization engine
106 of FIG. 1 according to an embodiment herein. The content
annotation module 310 annotates the input content 502 for useful
information (e.g., one or more keywords, one or more sentences,
etc). The content annotation module 310 includes a sentence
annotations module 602, a token annotations module 604, a stem
annotations module 606, a forced new lines, paragraphs and
indentations computing module 608, a parts of speech tag (POS)
token annotations module 610, a POS line annotation module 612, a
duration and quantities determining module 614, a section
annotations module 616, a section span annotations module 618, and
a section duration annotation module 620. The dotted lines (arrows
having a dotted line property) of FIG. 6 represent internal
dependencies among various modules, and whereas the solid lines
(arrows having a solid line property) represent the flow of
annotation process.
[0046] After parsing and cleaning the input content 502, a cleaned
content is annotated by performing various levels of annotations
using sub-modules of the content annotation module 310. The
sentence annotations module 602 extracts each and every sentence in
the input content 502. For example, a first sentence of the input
content 502 may be extracted by the sentence annotations module
602. The first sentence includes [0047] S1: REUTERS: the chairman
of kingfisher Airlines, Vijay Mallaya, said in an interview with
the financial times that he was close to sealing a $370 million
deal with tan Indian private investor and a consortium of banks
that would save the airlines.
[0048] Similarly, the sentence annotations module 602 extracts all
the sentences of the input content 502. [0049] S2: The
Bangalore-based entrepreneur told the FT he was nearing a deal with
14 banks led by State Bank of India that would provide the
loss-making carrier with working capital of 6 billion rupees.
[0050] S3: He did not name the banks. [0051] S4: Earlier this week,
kingfisher said its net loss for the September quarter doubled but
Mallaya offered little to revive its finances. [0052] S5: It had
also said it has been approached by strategic investors. [0053] S6:
Mallaya, a flamboyant liquor baron who owns a Formula One
Motor-racing team, told the paper he was finalizing a separate $250
million equity injection from an unnamed wealthy Indian individual
to recapitalize the cash-strapped carrier. [0054] S7: He added that
he was about to conclude a deal with the banks to reduce the
interest rate which the airlines is currently paying on its $1.4
billion debt pile. [0055] S8: Mallaya said on the social networking
site Twitter.RTM. that the report was "factually wrong" but he did
not elaborate Reuters could not immediately reach company officials
for a comment. [0056] S9: Shares kingfisher which is named after
its best selling beer, were down more than 5 percentage in early
trade on Friday in Mumbai. [0057] S10: Kingfisher, which listed
when it bought out budget airline, Air Deccan in 2008, has never
made a profit and its market value has plunged 64 percentage this
year. [0058] S11: The airlines become No. 2 private carrier since
it began its operations in 2005 as the economy boomed but has
become one of the main causalities of high fuel costs and a fierce
price war between a handful of airlines which, between them, have
ordered hundreds of aircraft on delivery over the next decade in an
ambitious bet on the future.
[0059] The token annotations module 604 determines each and every
token in the sentences of the input content 502. For example,
"The", "chairman", "of", "kingfisher", "airlines", "vijay",
"mallya", "said", "in", "an "interview", "with" are all tokens in a
first line of the input content 502. The stem annotations module
606 computes a root word for each and every token identified by the
token annotations module 604. For example, [0060] "reuter--the
chairman of kingfish airlin, vijay mallaya, said in an interview
with 370 million deal with an indian privat investor and a
consortium of bank that would save the airlin."
[0061] The forced new lines, paragraphs and indentations computing
module 608 determines white spaces like new lines that are forced
(an enter received, list of sentence that are not properly
phrased), paragraphs, and/or indentations, etc. It is used to
extract new lines, and sentences separately as content in the case
of documents like feeds and tweets which most often do not follow
the language semantics. Such documents may also contain sentences
that are not phrased correctly. In such cases, the extraction of
new lines and one or more sentences are more valuable. The POS
token annotations module 610 generates one or more parts of speech
(POS) tag such as a noun, a verb, etc. for each token in the
sentences such that each token annotation has an associated POS
tag. Further, POS line annotations module 612 tags each token in
the new lines as a noun or a verb, etc. In addition, the new lines
are also useful for section extraction because section names may
not be proper sentences. For example, in the input content 502,
"Shares" and "Introduction" are not proper sentences but a word,
and they are captured using the section annotations module 616 as a
new line because they occur in a separate line.
[0062] The duration and quantities determining module 614 extracts
duration and quantities wherever it occurs in a text of the content
502. For example, it extracts duration, like "2008, "2005" and
quantities like "64 percentage" from the input content 502. The
section annotations module 616 determines a group of sentences that
form a section that has a heading. To determine a start point and
an end point of the section, various heuristics are used that
includes lookup for well-known sections, sentence construction
based on the parts of speech, relevance with respect to surrounded
text, exclusion terms, term co-occurrence, etc.
[0063] In one embodiment, the user 102 can specify a section name
of the input content 502 around which a summary has to be
generated. For example, from the input content 502, two sections
names are extracted such as "introduction" and "shares". The user
102 can obtain information from the content associated with any of
these two sections by using the section span annotations module
618. The section duration annotations module 620 determines (i) one
or more durations that appear in the section name specified by the
user 102, and (ii) text associated with the one or more durations.
If the user 102 does not specify the section name, a summary may be
generated for an entire content.
[0064] Once annotations are done, the annotation extractor module
312 extracts all the required artifacts (e.g., sentences, keywords,
duration, sections, etc) from the annotations. The annotation
extractor module 312 extracts one or more sentences, one or more
keywords, one or more sections, one or more durations within the
section, one or more spans of the one or more durations, etc.
occurred within the input content 502, and provides to the user 102
for an intent selection.
[0065] With reference to FIG. 6, FIG. 7 illustrates a user
interface view 700 of the intent selection by the user 102 of FIG.
1 according to an embodiment herein. The user interface view 700 of
the intent selection includes the header 402, the input content
502, a create folder(s) or organize content button 702, an intent
analytics field 704, and an intent selection field 706. In one
embodiment, the input content 502 may be obtained from one or more
documents (e.g., an uploaded content). These documents may be
listed and/or displayed as one or more scrollable lists (e.g., a
left to right scrollable list, a right to left scrollable list, an
up to down scrollable list and/or a down to up scrollable list).
The create folder(s) or organize content button 702 is used for
organizing those contents, and enables to create new folders, where
the content from the scrollable lists can be dragged and dropped to
a required folder to organize them. The intent analytics field 704
displays an analysis for the input content 502 which is selected
from the scrollable lists. The analysis include, but not limited to
sections, summary, identified keywords, and other details such as
duration information. The intent selection field 706 provides one
or more options to specify various intents (analysis) around which
summarization are to be done. In one embodiment, the user 102 can
specify a section, and/or a page of a document that includes
content to be summarized. In another embodiment, the user 102
specifies a keyword around which summarization of content needs to
be done. In yet another embodiment, the user 102 specifies an
overall summarization when the user 102 intents to summarize the
entire content.
[0066] FIG. 8 is a flow diagram illustrating a method of
summarizing content around a sentiment using the weighted Formal
Concept Analysis (wFCA) according to an embodiment herein. In step
802, one or more sentences associated with the content are
identified based on parts of speech. The one or more sentences may
include the first set of sentences, the second set of sentences.
The first set of sentences may include either positive sentences or
negative sentences. A positive sentence S1 may be "REUTERS: the
chairman of kingfisher Airlines, Vijay Mallaya, said in an
interview with the financial times that he was close to sealing a
$370 million deal with tan Indian private investor and a consortium
of banks that would save the airlines". Similarly, a positive
sentiment S2 may be "The Bangalore-based entrepreneur told the FT
he was nearing a deal with 14 banks led by State Bank of India that
would provide the loss-making carrier with S2 working capital of 6
billion rupees". Similarly, a negative sentence S11 may be "The
airlines become No. 2 private carrier since it began its operations
in 2005 as the economy boomed but has become one of the main
causalities of high fuel costs and a fierce price war between a
handful of airlines which, between them, have ordered hundreds of
aircraft on delivery over the next decade in an ambitious bet on
the future.
[0067] In step 804, one or more sentiments are identified in the
one or more sentences based on the parts of speech. A positive
statement and a negative statement may indicate type of sentiment
in a sentence.
[0068] In step 806, one or more keywords in the one or more
sentences are identified. In one embodiment, the keyword
identifying module 316 identifies the one or more keywords based on
the Parts of speech (POS) tagged by the POS token annotations
module 610, and/or the POS line annotations module 612 of FIG. 6.
In step 808, one or more ambiguous keywords from the one or more
keywords are disambiguated using (i) the wFCA, by generating a
lattice that includes one or more concepts. The one or more
concepts are generated with (i) the one or more keywords as
objects, and (ii) categories associated with the one or more
keywords as attributes. The categories may be obtained from the
knowledge base 218 of FIG. 2. In step 810, a weight is computed and
assigned for each sentence of the one or more sentences based on a
number of keywords of the one or more keywords that are associated
with each sentence. The weight is computed based on a number of
associations of each keyword within the one or more sentences. A
graph may be generated based on the one or more sentences and the
one or more keywords to compute the weight. The graph includes one
or more nodes. Each node indicates a sentence of the one or more
sentences that are associated with the sentiment. In step 812, an
input that includes an indication of the sentiment is processed.
The indication may include summarization around the sentiment
(e.g., a positive sentiment, or a negative sentiment). In one
embodiment, the indication may be provided by the user 102. In step
814, a summary on the content around the sentiment is generated
based on (i) the weight and at least one of (a) the one or more
sentiments, and (b) the indication. The summary may include at
least one of (i) at least one sentence obtained only from the first
set of sentences, and (ii) at least one sentence from the second
set of sentences. When the user 102 intents to expand the summary,
the content summarization engine 106 expands the summary based on
(a) the weight assigned for each sentence of the one or more
sentences, and (b) at least one of i) the one or more sentiments,
and ii) the indication.
[0069] For example, from the input content 502, the one or more
keywords are identified and extracted based on parts of speech
(POS) tag generated by the POS token annotations module 610, and/or
the POS line annotations module 612. For instance, a noun is very
likely to be a keyword in a sentence. Similarly, co-occurring nouns
and its derivatives are also a keyword. A keyword chunker is used
to obtain these keywords and keyword phrases depending on the noun
and related tags. In one embodiment, when the user 102 does not
specify a section of the input content 502, then entire input
content 502 is summarized. The annotation extractor module 312
extracts the one or more keywords (e.g., 6 keywords) using POS tag.
For instance, the extracted keywords are:
[0070] reuters--POS Tag says that it is a noun
[0071] chairman--POS Tag says that it is a noun
[0072] Kingfisher Airlines--POS Tag says that it is a noun followed
by a noun (phrase)
[0073] Vijay Mallya--POS Tag says that it is a noun followed by a
noun (phrase)
[0074] Shares--POS Tag says that it is a noun
[0075] Kingfisher--POS Tag says that it is a noun
[0076] Once these keywords are identified and extracted, they need
to be disambiguated to find right meaning in which the one or more
keywords are used in the input content 502. To disambiguate, the
content summarization engine 106 determines different disambiguated
terms from the one or more keywords, and their related categories.
Further, the content summarization engine 106 uses the knowledge
base 218 stored in the external database 216 for obtaining
categories for the one or more keywords. Each keyword is queried
separately against the knowledge base 218 and corresponding
categories are obtained. For example, for the above keywords, the
categories obtained are [0077] REUTERS--{Society, Corporate Groups,
Organizations, Organizations by type, Agencies, News agencies}
[0078] chairman--{Business, Management, Management occupations}
[0079] Kingfisher Airlines--{Business, Industry, Service
Industries, Travel, Transportation, Transport by mode, Aviation,
Aviation by Continent, Aviation in Asia, Aviation in India,
Airlines of India} [0080] Vijay Mallya--{Business, Management,
Management occupations, Business executives, Chief executives,
Chief executives by nationality, Indian chief executives} [0081]
Shares--{Business, Finance, Financial Economics, Financial markets,
Stock market, Share (finance)} [0082] Kingfisher--{Nature, Natural
Sciences, Biology, Zoology, Animals, Chordates, Vertebrates, Birds,
Birds by common name, Kingfishers}
[0083] For example, the keyword "kingfisher" has got two
disambiguations (e.g., one for "Kingfisher (Bird)" and one for
"Kingfisher Airlines". The categories corresponding to each word
are shown against them. In one embodiment, the categories may be
modified by the user 102. The modification is taken as a feedback
to the categories suggested for the keywords and is used to train
the knowledge base 218 for preferred categories. In order to
disambiguate the keyword "kingfisher" and to compute a context in
the right meaning, the content summarization engine 106 uses the
lattice construction module 326. The lattice construction module
326 constructs a lattice based on the weighted Formal Concept
Analysis (wFCA).
[0084] FIG. 9 illustrates a graphical representation 900 of a
lattice that is generated to disambiguate a keyword "kingfisher" of
the input content 502 using the lattice construction module 326 of
FIG. 3 according to an embodiment herein. The lattice construction
module 326 forms various concepts with one or more keywords, and
their associated categories. For example, concept-1 to concept-9
associated with FIG. 9 are: [0085] Concept-1: [Kingfisher
Airlines]: [Aviation in India, Travel, Aviation, Transport by mode,
Aviation by Continent, Transportation, Business, Aviation in Asia,
Service Industries, Airlines of India, Industry] [0086] Concept-2:
[Shares]: [Finance, Business, Share (finance), Financial Economics,
Stock market, Financial markets] [0087] Concept-3: [Vijay Mallya]:
[Management occupations, Management, Business, Chief executives,
Business executives, Indian chief executives, Chief executives by
nationality] [0088] Concept-4: [Kingfisher]: [Animals, Zoology,
Natural Sciences, Chordates, Vertebrates, Birds, Birds by common
name, Kingfishers, Nature, Biology] [0089] Concept-5: [REUTERS]:
[Society, Organizations by type, Agencies, Organizations, Corporate
Groups, News agencies] [0090] Concept-6: [chairman, Vijay Mallya]:
[Management occupations, Management, Business] [0091] Concept-7:
[chairman, Kingfisher Airlines, Shares, Vijay Mallya]: [Business]
[0092] Concept-8: [ ]: [Aviation in India, Society, Travel,
Animals, Management, Management occupations, Organizations, Chief
executives, Indian chief executives, Corporate Groups, Share
(finance), Financial Economics, Chordates, Airlines of India,
Transport by mode, Transportation, Agencies, Aviation in Asia,
Aviation, Organizations by type, Zoology, Aviation by Continent,
Business, Finance, Business executives, Natural Sciences, News
agencies, Chief executives by nationality, Birds by common name,
Nature, Service Industries, Stock market, Vertebrates, Birds,
Financial markets, Kingfishers, Industry, Biology] [0093]
Concept-9: [chairman, Kingfisher Airlines, Shares, Kingfisher,
Vijay Mallya, REUTERS]: [ ]
[0094] In one embodiment, the lattice construction module 326
interprets that the concept 4 "Kingfisher" is not associated with
any other concept and there are no matching contexts. Whereas, the
concept 1 "Kingfisher Airlines" is associated with the concept 3
"Vijay Mallya", the concept 2 "shares" and the concept 6
"chairman", and it also has an overlapping context "business"
(concept 7). Thus, the word kingfisher is treated as "Kingfisher
Airlines" and not "Kingfisher (bird)".
[0095] Further, concept 1 to concept 5 defines distinct category
sets for each keyword. The keyword "chairman" does not have a
distinct concept because it is a subset of the category set of the
keyword "Vijay Mallya". This implies that the keywords "chairman"
and "Vijay Mallya" are strongly related in a context of the input
content 502. In addition, the concept 6 and concept 7 provide
contextual information that the keywords "chairman", "Kingfisher
Airlines", "Shares" and "Vijay Mallya" are related in the context
of the input content 502. The keyword "reuters" and "Kingfisher"
are not related to any other keywords and are treated as an
unimportant (less priority) in the context of the input content
502, and there is no concept that covers all the categories.
[0096] The score computing module 328 computes a score (shown in
the percentage) for each node or concept in FIG. 9 using the
weighted FCA. A simple heuristic model of the weighted FCA computes
score of the nodes, and a node with highest score is used to
disambiguate a keyword in the context of right meaning. For
computing score, the heuristic may assign equal probability for all
six keywords. Hence, there are totally 6 keywords having a score of
1/6 each. The concept 1 to concept 5 defines a distinct category
set for each keyword. Therefore, the score for each keyword of
concept 1 to concept 5 is 1/6 (16.67%).
[0097] However, as described the keyword "chairman" does not have a
distinct category because it is a subset of the category set of
keyword "Vijay Mallya". Such categories are "Management
occupations", "Management" and "Business". Further, this categories
are common for both keyword chairman" and "Vijay Mallya" and hence
they are strongly associated in the context of the input content
502. This makes the concept 6.
[0098] Further, a score for the concept 6 is (1/6)*2 which is
33.33%.
[0099] In addition, a category "business" is associated with the
categories of the concept 2. Thus, the keywords "kingfisher
Airlines", "shares", "vijay Mallya", "chairman" is strongly
associated with the category business. This makes the concept 7.
The score for the concept 7 will be (1/6)*4 which is 66.67%. Hence,
the keyword "kingfisher Airlines" as described is strongly
associated with the category "business". Thus, the keyword
"kingfisher" is treated as "kingfisher Airlines" and not as
"kingfisher (bird)" by using weighted FCA.
[0100] In one embodiment, the weighted FCA can be further drilled
down to provide more precise results and are often useful to obtain
more contextual information that are useful for disambiguation. In
one embodiment, using the weighted FCA, the disambiguation is done
by treating all the categories at the same level and ignoring
hierarchy. Whereas, in drill down FCA all the associated categories
form a hierarchy in the knowledge base 218.
[0101] For example, consider the hierarchy for Chairman, Kingfisher
Airlines and Vijay Mallya. [0102]
Chairman--Business->Management->Management occupations}
[0103] Kingfisher Airlines--{Business->Industry->Service
Industries->Travel->Transportation->Transport by
mode->Aviation->Aviation by Continent->Aviation in
Asia->Aviation in India, Airlines of India} [0104] Vijay
Mallya--{Business->Management->Management
occupations->Business executives->Chief executives->Chief
executives by nationality->Indian chief executives} In first
level, the weighted FCA, i.e., considering the root element
"business", the weight for all three keywords in the concept is
(1/3)*3=1.0. Hence, in the context of "business" all are related.
But, using the drill down FCA, if the "Business" category is
drilled down to a set of categories such as {Business, Management,
Industry}, two drill down concepts will be obtained with respect to
the three concepts.
[0105] Concept-10: [chairman, Vijay Mallya]: [Business, Management]
Weight: 2/3.about.0.67
[0106] Concept-11: [Kingfisher Airlines]: [Business, Industry]
Weight: 1/3.about.0.33
[0107] By performing the drill-down FCA with the subset of
categories, contextual information is obtained. The contextual
information indicates an affinity among the keywords "chairman",
"Kingfisher Airlines" and "Vijay Mallya". For instance, from the
concept 10 and concept 11 shows that although "chairman",
"Kingfisher Airlines" and "Vijay Mallya" are related in a context
of "Business", but "Kingfisher Airlines" is a different concept in
a context of "Business" with respect to "Industry", whereas,
"chairman" and "Vijay Mallya" are related in a context of
"Business" with respect to "Management". Similarly, the drill down
FCA can be performed until all the contextual information is
retrieved and the disambiguation is achieved. Further, the content
summarization engine 106 accepts the input content 502 at the
disambiguation step as well and the user 102 can correct the
incorrect associations by viewing at alternative category
associations.
[0108] In an embodiment, the user 102 may also disambiguate one or
more keywords in the context of right meaning using popularity of
the one or more keywords. In yet another embodiment, the user 102
is provided with a graphical representation for visualizing
summarization around the one or more keywords. The user 102 can
view a graph having the one or more keywords around which related
text such as keywords, sentences, and/or content are associated.
Once disambiguation is over, the content summarization engine 106
has one or more disambiguated keywords to generate a content
graph.
[0109] FIG. 10 is a graphical representation illustrating a graph
1000 that indicates an association between one or more keywords and
sentences of the input content 502 of FIG. 5 according to an
embodiment herein. The graph generating module 320 generates the
graph 1000 that indicates the sentences S1 till S11 and their
associated keywords. The sentences S1 and S6 are positive
sentences. The sentences S4, S7, S8, S9, S10, and S11 are negative
sentences. The sentences S2, S3 and S5 are the set of neutral
sentences. The graph 1000 includes one or more nodes, and each node
corresponds to at least one sentence of the input content 502. The
graph 1000 further includes one or more keywords of the input
content 502 identified by the keyword identifying module 316.
[0110] For example, the graph 1000 is generated for the input
content 502 with sentences S1, S2, S3, S4 S5, S6, S7, S8, S9, S10,
and S11 associated with the keywords K1, K2, K3 . . . and K21 such
as,
[0111] K1: Kingfisher Airlines
[0112] K2: Shares
[0113] K3: chairman
[0114] K4: State bank of India
[0115] K5: Reuters
[0116] K6: Investor
[0117] K7: Financial times
[0118] K8: Banks
[0119] K9: Deal
[0120] K10: Interview
[0121] K11: Bangalore
[0122] K12: Entrepreneur
[0123] K13: Vijay Mallaya
[0124] K14: Equity/Equity injection
[0125] K15: Market value
[0126] K16: Debt
[0127] K17: Profit
[0128] K18: Economy
[0129] K19: Aircraft
[0130] K20: Fuel
[0131] K21: Causalities
[0132] For instance, all the keywords K1, K2, K3, . . . and K21
have an equal weight of 1/21 (i.e., 0.04762). However, based on
number of associations of each keyword with the sentences, the
actual weight may vary. For example, the keyword K1 "kingfisher
Airlines" is associated with S1 directly, also indirectly with S4,
S5, S9, and S10 as "kingfisher", and with S7, S10, and S11 as
"Airline or Airlines". The keyword "kingfisher" is treated here as
"kingfisher Airlines" as already disambiguated. The keyword
"kingfisher Airlines" is thus associated 7 sentences. Hence, a
weight for the keyword K1 "kingfisher Airlines" computed as
0.04762/7. Similarly, a weight is computed for each keyword based
on number of associations.
[0133] Similarly, keywords K4, K7, K8, K9, K11 and K12 are
associated with the second sentence S2. Keywords K8, and K13 are
associated with the third sentence S3. Keywords K1, and K13 are
associated with the fourth sentence S4. Keywords K1, and K6 are
associated with the fifth sentence S5. Keywords K7, K13 and K14 are
associated with the sixth sentence S6. Keywords K1, K8, K9, K13,
and K16 are associated with the seventh sentence S7. Keywords K5,
and K13 are associated with the eighth sentence S8. Keywords K1,
and K2 are associated with the ninth sentence S9. Keywords K1, K15,
and K17 are associated with the second tenth S10. Keywords K1, K18,
K19, K20 and K21 are associated with the eleventh sentence S11.
[0134] FIG. 11 is a table view 1100 illustrating a weight 1102 that
is computed for each keyword that is identified by the keyword
identifying module 316 of FIG. 3 based on number of associations
1104 of each keyword with sentences of the input content 502
according to an embodiment herein. As described, computing the
weight for the keyword K1 "kingfisher Airlines" based on number of
associations of the keyword K1 with sentences of the input content
502, a weight is computed for each keyword as shown in the
table.
[0135] Once the weight is computed for each keyword, the weight
assigning module 330 assigns a weight for a sentence in the input
content 502 based on a count that corresponds to the keywords
associated with a node that corresponds to the sentence using
simple heuristics. Similarly, a weight is computed for each
sentence of the input content 502. For example, a weight for the
first sentence S1 is computed based on a count that correspond to
the keywords associate with a node that corresponds to the first
sentence S1. Such keywords that associate with the first sentence
S1 are K1, K3, K5, K6, K7, K8, K9, K10, and K13. The weight for the
first sentence S1 is computed as summation of weights associate
with the keywords that corresponds to the first sentence S1.
[0136] Thus,
Weight of S1=(Weight of K1)+(Weight of K3)+(Weight of K5)+(Weight
of K6)+(Weight of K7)+(Weight of K8)+(Weight of K9)+(Weight of
K10)+(Weight of
K13)=(0.04762/7)+(0.04762)+(0.04762/2)+(0.04762/2)+(0.04762/3)+(0.0476-
2/4)+(0.04762/3)+(0.04762)+(0.04762/6)=0.2013
[0137] FIG. 12 is a table view 1200 illustrating a weight 1202 that
is computed for each sentence of the input content 502 based on
associated keywords 1204 according to an embodiment herein. As
described, computing the weight for the sentence S1, a weight is
computed for each sentence of the input content 502 as shown in the
table.
[0138] From the graph 1000 of FIG. 10, the graph generating module
320 interprets that S1 is most important sentence when compared to
other sentences, because it has more number of associations with
the keywords. In one embodiment, the current example explains a
simple weighting scheme based on keywords. However, weighting
scheme can also depend on various factors, like sentence selection,
section selection and sentiments selection.
[0139] Once the most important sentences are identified, the intent
building module 322 is used for tailoring sentences together in the
exact sequence in which they appear in the original text, and
provides summary of the input content 502. The number of sentences
to be used as a summary depends on an input parameter from the user
102 as well as a weighted cut-off that is configurable.
[0140] In one embodiment, the user 102 can expand the summary of
the input content 502 using the intent expanding module 324. For
example, a first level of summary for the input content 502 has
only S1 (having a highest weight). If the user wants to expand the
summary to a second level, the intent expanding module 324 relaxes
weight of sentences. Then, a next most important sentence is S11
(associated with 5 keywords, and having next highest weight) is
tailored with S1 in the exact sequence in which they appear in the
input content 502. In one embodiment, when a summary on a
particular section is requested, then the summary is generated by
considering only sentences occurring in that section. Similarly, in
another embodiment, when a summary on a particular page is
requested, then the summary is generated by considering only
sentences occurring in that page. In yet another embodiment, the
user 102 intents to summarize content around a particular
keyword.
[0141] The sentiments that have highest weight are arranged at the
top, and the sentiments that have higher, high, low, lower, and
lowest weight are followed, in one example embodiment. For example,
the positive sentiments are at the top followed by neutral
sentiments, in another example embodiment. When the user 102
intents to summarize the input content 502 around the one or more
positive sentiments, the content summarization engine 106 arranges
the sentence S1 having a weight 0.2013 and the sentence S6 having a
weight 0.07143 as positive sentiments at the top and may be
followed by one or more neutral sentence (e.g., the neutral
sentence S2 having a weight 0.1865, the neutral sentence S5 having
a weight 0.03061, and the neutral sentence S3 having a weight
0.01984). Similarly, when the user 102 intents to summarize the
input content 502 around the one or more negative sentiments, the
content summarization engine 106 arranges the sentence S11 having a
weight 0.1973 and the sentiment S10 are negative sentences at the
top and may be followed by one or more neutral sentences (e.g., the
neutral sentences S2, S5, and S3).
[0142] In case, where there are no neutral sentences, only the
positive sentiments are arranged at the top followed by the
positive sentiments that have lesser weight. Similarly, in case,
where there are no neutral sentences, only the negative sentences
having a highest weight are arranged at the top and may be followed
by the negative sentences that have lesser weight. Further, to
summarize around a positive sentiment, (i) one or more keywords are
identified from one or more positive sentences and not from the
entire input content 502, and (ii) one or more ambiguous keywords
are disambiguated from the one or more keywords, in one example
embodiment. Similarly, to summarize around a negative sentiment,
(i) one or more keywords are identified from one or more negative
sentences and not from the entire input content 502, and (ii) one
or more ambiguous keywords are disambiguated from the one or more
keywords, in another example embodiment. Once the summary is
generated around the positive sentiment, or around the negative
sentiment, the summary may be displayed either in a text format, or
in a graphical representation (e.g., a bar chart, or a pie
chart).
[0143] The techniques provided by the embodiments herein may be
implemented on an integrated circuit chip (not shown). The chip
design is created in a graphical computer programming language, and
stored in a computer storage medium (such as a disk, tape, physical
hard drive, or virtual hard drive such as in a storage access
network). If the designer does not fabricate chips or the
photolithographic masks used to fabricate chips, the designer
transmits the resulting design by physical means (e.g., by
providing a copy of the storage medium storing the design) or
electronically (e.g., through the Internet) to such entities,
directly or indirectly. The stored design is then converted into
the appropriate format (e.g., GDSII) for the fabrication of
photolithographic masks, which typically include multiple copies of
the chip design in question that are to be formed on a wafer. The
photolithographic masks are utilized to define areas of the wafer
(and/or the layers thereon) to be etched or otherwise
processed.
[0144] The resulting integrated circuit chips can be distributed by
the fabricator in raw wafer form (that is, as a single wafer that
has multiple unpackaged chips), as a bare die, or in a packaged
form. In the latter case the chip is mounted in a single chip
package (such as a plastic carrier, with leads that are affixed to
a motherboard or other higher level carrier) or in a multichip
package (such as a ceramic carrier that has either or both surface
interconnections or buried interconnections). In any case the chip
is then integrated with other chips, discrete circuit elements,
and/or other signal processing devices as part of either (a) an
intermediate product, such as a motherboard, or (b) an end product.
The end product can be any product that includes integrated circuit
chips, ranging from toys and other low-end applications to advanced
computer products having a display, a keyboard or other input
device, and a central processor. The embodiments herein can take
the form of an entirely hardware embodiment, an entirely software
embodiment or an embodiment including both hardware and software
elements. The embodiments that are implemented in software include
but are not limited to, firmware, resident software, microcode,
etc.
[0145] Furthermore, the embodiments herein can take the form of a
computer program product accessible from a computer-usable or
computer-readable medium providing program code for use by or in
connection with a computer or any instruction execution system. For
the purposes of this description, a computer-usable or computer
readable medium can be any apparatus that can comprise, store,
communicate, propagate, or transport the program for use by or in
connection with the instruction execution system, apparatus, or
device.
[0146] The medium can be an electronic, magnetic, optical,
electromagnetic, infrared, or semiconductor system (or apparatus or
device) or a propagation medium. Examples of a computer-readable
medium include a semiconductor or solid state memory, magnetic
tape, a removable computer diskette, a random access memory (RAM),
a read-only memory (ROM), a rigid magnetic disk and an optical
disk. Current examples of optical disks include compact disk-read
only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
[0147] A data processing system suitable for storing and/or
executing program code will include at least one processor coupled
directly or indirectly to memory elements through a system bus. The
memory elements can include local memory employed during actual
execution of the program code, bulk storage, and cache memories
which provide temporary storage of at least some program code in
order to reduce the number of times code must be retrieved from
bulk storage during execution.
[0148] Input/output (I/O) devices (including but not limited to
keyboards, displays, pointing devices, etc.) can be coupled to the
system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the
data processing system to become coupled to other data processing
systems or remote printers or storage devices through intervening
private or public networks. Modems, cable modem and Ethernet cards
are just a few of the currently available types of network
adapters.
[0149] A representative hardware environment for practicing the
embodiments herein is depicted in FIG. 13. This schematic drawing
illustrates a hardware configuration of an information
handling/computer system in accordance with the embodiments herein.
The system comprises at least one processor or central processing
unit (CPU) 10. The CPUs 10 are interconnected via system bus 12 to
various devices such as a random access memory (RAM) 14, read-only
memory (ROM) 16, and an input/output (I/O) adapter 18. The I/O
adapter 18 can connect to peripheral devices, such as disk units 11
and tape drives 13, or other program storage devices that are
readable by the system. The system can read the inventive
instructions on the program storage devices and follow these
instructions to execute the methodology of the embodiments herein.
The system further includes a user interface adapter 19 that
connects a keyboard 15, mouse 17, speaker 24, microphone 22, and/or
other user interface devices such as a touch screen device (not
shown) to the bus 12 to gather user input. Additionally, a
communication adapter 20 connects the bus 12 to a data processing
network 25, and a display adapter 21 connects the bus 12 to a
display device 23 which may be embodied as an output device such as
a monitor, printer, or transmitter, for example.
[0150] The content summarization engine 106 provides the user 102 a
more precise summary on content, and assists the user 102 to grasp
it quickly. Moreover, sentences of the content are stitched in an
exact order as appear in the content. This provides continuity and
clear understanding to the user 102 while reviewing the summary.
The content summarization engine 106 saves considerable amount of
user's time by providing the summary, and the user 102 can also
expand the summary for better understanding. Also, the user 102
obtains a summary on the content based on their intent.
[0151] The content summarization engine 106 may also enable
summarization around a keyword. For example, a summary around a
keyword may be generated based on (i) a sentence having a positive
sentiment, or (ii) a sentence having a negative sentiment. For
instances, the user 102 intents to summarize the input content 502
around the keyword K1--Kingfisher Airlines that is either
associated with a set of positive sentences, or associated with a
set of negative sentence. When the user 102 intents to summarize
the keyword K1--Kingfisher Airlines that is associated with the set
of positive sentences (S1 and S6), only the sentence S1 is selected
for summarization since the keyword K1 is not associated the
sentence S6. When the user 102 intents to summarize the keyword
K1--Kingfisher Airlines that is associated with the set of negative
sentences (S4, S7, S8, S9, S10, S11), at least one sentence (S4,
S7, S9, S10, and/or S11) is selected for summarization (e.g., S4).
Since the keyword K1 is not associated the sentence S8, the
sentence S8 is not selected for summarization by the content
summarization engine 106.
[0152] The foregoing description of the specific embodiments will
so fully reveal the general nature of the embodiments herein that
others can, by applying current knowledge, readily modify and/or
adapt for various applications such specific embodiments without
departing from the generic concept, and, therefore, such
adaptations and modifications should and are intended to be
comprehended within the meaning and range of equivalents of the
disclosed embodiments. It is to be understood that the phraseology
or terminology employed herein is for the purpose of description
and not of limitation. Therefore, while the embodiments herein have
been described in terms of preferred embodiments, those skilled in
the art will recognize that the embodiments herein can be practiced
with modification within the spirit and scope of the appended
claims.
* * * * *
References