U.S. patent application number 13/616403 was filed with the patent office on 2013-06-13 for inferring emerging and evolving topics in streaming text.
This patent application is currently assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION. The applicant listed for this patent is Saha Ankan, Arindam Banerjee, Shiva P. Kasiviswanathan, Richard D. Lawrence, Prem Melville, Vikas Sindhwani, Edison L. Ting. Invention is credited to Saha Ankan, Arindam Banerjee, Shiva P. Kasiviswanathan, Richard D. Lawrence, Prem Melville, Vikas Sindhwani, Edison L. Ting.
Application Number | 20130151525 13/616403 |
Document ID | / |
Family ID | 48572981 |
Filed Date | 2013-06-13 |
United States Patent
Application |
20130151525 |
Kind Code |
A1 |
Ankan; Saha ; et
al. |
June 13, 2013 |
INFERRING EMERGING AND EVOLVING TOPICS IN STREAMING TEXT
Abstract
A method, system and computer program product for inferring
topic evolution and emergence in a set of documents. In one
embodiment, the method comprises forming a group of matrices using
text in the documents, and analyzing these matrices to identify
evolving topics and emerging topics. The matrices includes a matrix
X identifying a multitude of words in each of the documents, a
matrix W identifying a multitude of topics in each of the
documents, and a matrix H identifying a multitude of words for each
of the multitude of topics. These matrices are analyzed to identify
the evolving and emerging topics. In an embodiment, two forms of
temporal regularizers are used to help identify the evolving and
emerging topics. In another embodiment, a two stage approach
involving detection and clustering is used to help identify the
evolving and emerging topics.
Inventors: |
Ankan; Saha; (Chicago,
IL) ; Banerjee; Arindam; (Roseville, MN) ;
Kasiviswanathan; Shiva P.; (White Plains, NY) ;
Lawrence; Richard D.; (Ridgefield, CT) ; Melville;
Prem; (New York, NY) ; Sindhwani; Vikas;
(Hawthorne, NY) ; Ting; Edison L.; (San Jose,
CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Ankan; Saha
Banerjee; Arindam
Kasiviswanathan; Shiva P.
Lawrence; Richard D.
Melville; Prem
Sindhwani; Vikas
Ting; Edison L. |
Chicago
Roseville
White Plains
Ridgefield
New York
Hawthorne
San Jose |
IL
MN
NY
CT
NY
NY
CA |
US
US
US
US
US
US
US |
|
|
Assignee: |
INTERNATIONAL BUSINESS MACHINES
CORPORATION
Armonk
NY
|
Family ID: |
48572981 |
Appl. No.: |
13/616403 |
Filed: |
September 14, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13315798 |
Dec 9, 2011 |
|
|
|
13616403 |
|
|
|
|
Current U.S.
Class: |
707/737 ;
707/E17.089 |
Current CPC
Class: |
G06F 40/30 20200101;
G06F 16/316 20190101 |
Class at
Publication: |
707/737 ;
707/E17.089 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A method of inferring topic evolution and emergence in a
multitude of documents, comprising: forming a group of matrices
using text in the documents, said group of matrices including a
first matrix X identifying a multitude of words in each of the
documents, a second matrix W identifying a multitude of topics in
each of the documents, and a third matrix H identifying a multitude
of words for each of said multitude of topics; and analyzing said
group of matrices to identify a first group of said multitude of
topics as evolving topics and a second group of said multitude of
topics as emerging topics.
2. The method according to claim 1, wherein said multitude of
documents comprise a sequence of streaming documents, each of the
documents being associated with a timepoint t, in a defined period
of time T, and wherein: the forming the group of matrices using
data in the documents includes: forming a first sequence of
matrices X(t), each of the matrices X(t) identifying a multitude of
words in each of a set of the documents associated with the
timepoints within a defined sliding window in the time period T;
forming a second sequence of matrices W(t), each of the matrices
W(t) identifying a multitude of topics in said set of documents
associated with the timepoints within said defined window; and
forming a third sequence of matrices H(t), each of the matrices
H(t) identifying a multitude of words for each of the topics
identified in an associated one of the matrices W(t); and the
analyzing the groups of matrices includes using a defined equation
including the matrices X(t), W(t) and H(t), to identify the
evolving and the emerging topics.
3. The method according to claim 2, wherein: said defined equation
includes a first regularizer .mu. to enforce smooth evolution of
the evolving topics via constraints on an amount of drift allowed
by the evolving topics, and a second regularizer .OMEGA. to apply a
topic bandwidth for early detection of the emerging topics to
extract smooth trends of candidate emerging topics; and said
defined equation is an objective function: ( W * , H ( t ) )
.ident. arg min W , H X ( t - w , t ) - WH fro 2 + .mu..OMEGA. ( W
) ##EQU00014##
4. The method according to claim 2, wherein: said defined equation
includes solving an l.sub.1 dictionary learning problem to identify
evolving topics, and using a reconstruction error to identify novel
documents; the analyzing the group of matrices further includes
clustering said novel documents to identify emerging topics; and
said defined equation is an objective function: W*,
H*=argmin.sub.W,H.parallel.X-WH.parallel..sub.1+.lamda..parallel.W.parall-
el..sub.1 such that W, H.gtoreq.0.
5. The method according to claim 1, wherein the forming the group
of matrices using text in the documents includes using the first
matrix to form the second and third matrices.
6. The method according to claim 1, wherein the analyzing the group
of matrices includes using a defined equation, including the first,
second and third matrices, to identify the evolving topics and the
emerging topics.
7. The method according to claim 6, wherein said defined equation
includes a regularizer to enforce smooth evolution of the evolving
topics via constraints on an amount of drift allowed by the
evolving topics.
8. The method according to claim 6, wherein: said defined equation
includes a regularizer to apply a topic bandwidth for early
detection of the emerging topics: and the regularizer extracts
smooth trends of candidate emerging topics.
9. The method of claim 6 wherein the analyzing the group of
matrices further includes: using said defined equation to identify
novel documents based on reconstruction error; and clustering the
novel documents to identify emerging topics and used for updating
the evolving topics.
10. The method according to claim 9, wherein: the using the defined
equation includes applying a threshold on the reconstruction errors
obtained from a solution of said defined equation to identify the
novel documents: and the clustering the novel documents includes
using a given clustering algorithm to cluster the novel documents.
Description
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application is a continuation of copending U.S. patent
application Ser. No. 13/315,798, filed Dec. 9, 2011, the entire
content and disclosure of which is hereby incorporated herein by
reference.
BACKGROUND OF THE INVENTION
[0002] The present invention generally relates to document
analysis, and more specifically, to inferring topic evolution and
emergence in streaming documents.
[0003] Learning a dictionary of basis elements with the objective
of building compact data representations is a problem of
fundamental importance in statistics, machine learning and signal
processing. In many settings, data points appear as a stream of
high dimensional feature vectors. Streaming datasets present new
twists to the problem. On one hand, basis elements need to be
dynamically adapted to the statistics of incoming datapoints, while
on the other hand, many applications require early detection of
rising new trends. The analysis of social media streams formed by
tweets and blog posts is a prime example of such a setting, where
topics of social discussions need to be continuously tracked and
new emerging themes need to be rapidly detected.
[0004] Consider the problem of building compact, dynamic
representations of streaming datasets such as those that arise in
social media. By constructing such representations, "signal" can be
separated from "noise" and essential data characteristics can be
continuously summarized in terms of a small number of human
interpretable components. In the context of social media
applications, this maps to the discovery of unknown "topics" from a
streaming document collection. Each new batch of documents arriving
at a timepoint is completely unorganized and may contribute either
to ongoing unknown topics of discussion (potentially causing
underlying topics to drift over time) and/or initiate new themes
that may or may not become significant going forward, and/or simply
inject irrelevant "noise".
[0005] While the dominant body of previous work in dictionary
learning and topic modeling has focused on solving batch learning
problems, a real deployment scenario in social media applications
truly requires forms of online learning. The user of such a system
is less interested in a one-time analysis of topics in a document
archive, and more in being able to follow ongoing evolving
discussions and being vigilant of any emerging themes that might
require immediate action. Several papers have proposed dynamic
topic and online dictionary learning models (see [D. Blei and J.
Lafferty, Dynamic topic models, in ICML, 2006; Tzu-Chuan Chou and
Meng Chang Chen, Using Incremental PLSI for Threshold-Resilient
Online Event Analysis, IEEE transactions on Knowledge and Data
Engineering, 2008; A. Gohr, H. Hinneburg, R. Schult, and M.
Spiliopoulou, Topic evolution in a stream of documents, in SDM,
2009; and J. Mairal, F. Bach, J. Ponce and G. Sapiro, Online
learning for matrix factorization and sparse coding, JMLR, 2010]
and references therein) that either exploit temporal order of
documents in offline batch mode or are limited to handling a fixed
bandwidth of topics with no explicit algorithmic constructs to
attempt to detect emerging themes early.
BRIEF SUMMARY
[0006] Embodiments of the invention provide a method, system and
computer program product for inferring topic evolution and
emergence in a multitude of documents. In one embodiment, the
method comprises forming a group of matrices using data in the
documents, and analyzing this group of matrices to identify
evolving topics and emerging topics. This group of matrices
includes a first matrix X identifying a multitude of words in each
of the documents, a second matrix W identifying a multitude of
topics in each of the documents, and a third matrix H identifying a
multitude of words for each of said multitude of topics. These
matrices are analyzed to identify a first group of said multitude
of topics as the evolving topics and a second group of said
multitude of topics as the emerging topics.
[0007] In an embodiment, the input is a sequence of streaming
documents, and each of the document is associated with a timepoint
t.sub.i. The group of matrices may include a first sequence of
matrices X(t), a second sequence of matrices W(t), and a third
sequence of matrices H(t). Each of the first matrices X(t)
identifies a multitude of words in each of a set of the documents
associated with the timepoints within a defined sliding window w in
a time period T. Each of the matrices W(t) identifies a multitude
of topics in said set of documents associated with the timepoints
within the defined window, and each of the matrices H(t) identifies
a multitude of words for each of the topics identified in the
matrices W(t).
[0008] In one embodiment, groups of matrices are analyzed using a
defined equation, including the matrices X(t), W(t) and H(t), to
identify the evolving and the emerging topics. In an embodiment,
the defined equation includes first and second regularizers. The
first regularizer .mu. enforces a smooth evolution of the evolving
topics via constraints on an amount of drift allowed by the
evolving topics. The second regularizer .OMEGA. applies a topic
bandwidth for early detection of the emerging topics to extract
smooth trends of candidate emerging topics.
[0009] In one embodiment, said defined equation is an objective
function:
( W * , H ( t ) ) .ident. arg min W , H X ( t - w , t ) - WH fro 2
+ .mu..OMEGA. ( W ) ##EQU00001##
such that W, H.gtoreq.0 where X(t-w, t) refers to the document-term
matrix in the time interval (t-w) to t.
[0010] In another embodiment, groups of matrices are analyzed using
a defined equation, including the matrices X(t), W(t) and H(t), to
identify the emerging topics. In this embodiment, a two stage
approach based on l.sub.1-dictionary learning is used to detect
emerging topics.
[0011] In one embodiment, said defined equation is an objective
function:
( W * , H ( t ) ) .ident. arg min W , H X ( t - w , t ) - WH fro 2
+ .mu..OMEGA. ( W ) such that W , H .gtoreq. 0 ##EQU00002##
[0012] In an embodiment, said defined equation is an objective
function:
W*,
H*=argmin.sub.W,H.parallel.X-WH.parallel..sub.1+.lamda..parallel.W.p-
arallel..sub.1 such that W, H.gtoreq.0
[0013] Embodiments of the invention provide an online learning
framework to consistently reassemble the data streams into coherent
threads of evolving components while also serving as an early
warning or detection system for new, rapidly emerging trends.
[0014] In an embodiment, the invention provides a framework for
online dictionary learning to handle streaming non-negative data
matrices with possibly growing number of components. Embodiments of
the invention are rooted in non-negative matrix factorizations
(NMF) [D. Lee and H. S. Seung, Learning the parts of objects using
non-negative matrix factorizations, Nature, 1999] whose
unregularized variants for generalized KL-divergence minimization
is equivalent to pLSI [C. Ding, T. Li, and W. Peng, On the
equivalence between non-negative matrix factorizations and
probabilistic latent semantic analysis, Computational Statistics
and Data Analysis, 2008]. For squared loss, NMF finds a low-rank
approximation to a data matrix X by minimizing
.parallel.X-WH.parallel..sub.fro.sup.2 under non-negativity and
scaling constraints on the factors W and H. It is common to add
some form of l.sub.1 or l.sub.2 regularization, e.g., to encourage
sparse factors. If X is an N.times.D document-term matrix, then W
is a N.times.K matrix of topic encodings of documents, while H is a
K.times.D matrix of topic-word associations, whose rows are the
dictionary elements learnt by the NMF approach.
[0015] In one embodiment of the invention, given streaming
matrices, a sequence of NMFs is learned with two forms of temporal
regularization. The first regularizer enforces smooth evolution of
topics via constraints on amount of drift allowed. The second
regularizer applies to an additional "topic bandwidth" introduced
into the system for early detection of emerging trends. Implicitly,
this regularizer extracts smooth trends of candidate emerging
topics and then encourages the discovery of those that are rapidly
growing over a short time window. This setup is formulated as an
objective function which reduces to rank-one subproblems involving
projections onto the probability simplex and SVM-like optimization
with additional non-negativity constraints. Embodiments of the
invention provide efficient algorithms for finding stationary
points of this objective function. Since they mainly involve
matrix-vector operations and linear-time subroutines, these
algorithms scale gracefully to large datasets.
[0016] In one embodiment of the invention, given streaming
matrices, a sequence of NMFs is learned under a robust objective
function. The objective function is a combination of the
l.sub.1-norms of a sparse error (robust reconstruction) and a
sparse code, which appears well suited for sparse high-dimensional
datasets such as those that arise in text applications.
Additionally, there are non-negativity constraints on the sparse
code and dictionary, to maintain interpretability.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0017] FIG. 1 illustrates static non-negative matrix
factorizations.
[0018] FIG. 2 illustrates dynamic NMFs with temporal
regularization.
[0019] FIG. 3 shows the temporal profile of an emerging topic and
overall dynamics in a simulated dataset used in an empirical study
of an embodiment of this invention.
[0020] FIG. 4 shows the tracking performance as a function of an
evolution parameter .delta. in an embodiment of the invention.
[0021] FIG. 5 shows the effectiveness of emergence regulation in an
embodiment of the invention.
[0022] FIGS. 6 and 7 illustrates an approach of using robust
l.sub.1 objective function for detecting emerging topics.
[0023] FIG. 8 shows the effectiveness of robust l.sub.1/l.sub.1
objective function for emerging topic detection.
[0024] FIG. 9 depicts a computer system that may be used in the
implementation of the present invention.
DETAILED DESCRIPTION
[0025] As will be appreciated by one skilled in the art,
embodiments of the present invention may be embodied as a system,
method or computer program product. Accordingly, embodiments of the
present invention may take the form of an entirely hardware
embodiment, an entirely software embodiment (including firmware,
resident software, micro-code, etc.) or an embodiment combining
software and hardware aspects that may all generally be referred to
herein as a "circuit," "module" or "system." Furthermore,
embodiments of the present invention may take the form of a
computer program product embodied in any tangible medium of
expression having computer usable program code embodied in the
medium.
[0026] Any combination of one or more computer usable or computer
readable medium(s) may be utilized. The computer-usable or
computer-readable medium may be, for example but not limited to, an
electronic, magnetic, optical, electromagnetic, infrared, or
semiconductor system, apparatus, device, or propagation medium.
More specific examples (a non-exhaustive list) of the
computer-readable medium would include the following: an electrical
connection having one or more wires, a portable computer diskette,
a hard disk, a random access memory (RAM), a read-only memory
(ROM), an erasable programmable read-only memory (EPROM or Flash
memory), an optical fiber, a portable compact disc read-only memory
(CDROM), an optical storage device, a transmission media such as
those supporting the Internet or an intranet, or a magnetic storage
device. Note that the computer-usable or computer-readable medium
could even be paper or another suitable medium, upon which the
program is printed, as the program can be electronically captured,
via, for instance, optical scanning of the paper or other medium,
then compiled, interpreted, or otherwise processed in a suitable
manner, if necessary, and then stored in a computer memory. In the
context of this document, a computer-usable or computer-readable
medium may be any medium that can contain, store, communicate,
propagate, or transport the program for use by or in connection
with the instruction execution system, apparatus, or device. The
computer-usable medium may include a propagated data signal with
the computer-usable program code embodied therewith, either in
baseband or as part of a carrier wave. The computer usable program
code may be transmitted using any appropriate medium, including but
not limited to wireless, wireline, optical fiber cable, RF,
etc.
[0027] Computer program code for carrying out operations of the
present invention may be written in any combination of one or more
programming languages, including an object oriented programming
language such as Java, Smalltalk, C++ or the like and conventional
procedural programming languages, such as the "C" programming
language or similar programming languages. The program code may
execute entirely on the user's computer, partly on the user's
computer, as a stand-alone software package, partly on the user's
computer and partly on a remote computer or entirely on the remote
computer or server. In the latter scenario, the remote computer may
be connected to the user's computer through any type of network,
including a local area network (LAN) or a wide area network (WAN),
or the connection may be made to an external computer (for example,
through the Internet using an Internet Service Provider).
[0028] The present invention is described below with reference to
flowchart illustrations and/or block diagrams of methods, apparatus
(systems) and computer program products according to embodiments of
the invention. It will be understood that each block of the
flowchart illustrations and/or block diagrams, and combinations of
blocks in the flowchart illustrations and/or block diagrams, can be
implemented by computer program instructions. These computer
program instructions may be provided to a processor of a general
purpose computer, special purpose computer, or other programmable
data processing apparatus to produce a machine, such that the
instructions, which execute via the processor of the computer or
other programmable data processing apparatus, create means for
implementing the functions/acts specified in the flowchart and/or
block diagram block or blocks. These computer program instructions
may also be stored in a computer-readable medium that can direct a
computer or other programmable data processing apparatus to
function in a particular manner, such that the instructions stored
in the computer-readable medium produce an article of manufacture
including instruction means which implement the function/act
specified in the flowchart and/or block diagram block or
blocks.
[0029] The computer program instructions may also be loaded onto a
computer or other programmable data processing apparatus to cause a
series of operational steps to be performed on the computer or
other programmable apparatus to produce a computer implemented
process such that the instructions which execute on the computer or
other programmable apparatus provide processes for implementing the
functions/acts specified in the flowchart and/or block diagram
block or blocks.
[0030] Embodiments of the invention provide a method, system and
computer program product for inferring topic evolution and
emergence in a multitude of documents. In an embodiment, the
invention provides a framework for online dictionary learning to
handle streaming non-negative data matrices with possibly growing
number of components. With reference to FIG. 1, embodiments of the
invention are rooted in non-negative matrix factorizations (NMF)
[D. Lee and H. S. Seung, Learning the parts of objects using
non-negative matrix factorizations, Nature, 1999] whose
unregularized variants for generalized KL-divergence minimization
is equivalent to pLSI [C. Ding, T. Li, and W. Peng, On the
equivalence between non-negative matrix factorizations and
probabilistic latent semantic analysis, Computational Statistics
and Data Analysis, 2008]. For squared loss, NMF finds a low-rank
approximation to a data matrix X 102 by minimizing
.parallel.X-WH.parallel..sub.fro.sup.2 under non-negativity and
scaling constraints on the factors W 104 and H 106. It is common to
add some form of l.sub.1 or l.sub.2-regularization e.g., to
encourage sparse factors. If X is an N.times.D document-term
matrix, then W is an N.times.K matrix of topic encodings of
documents while H is a K.times.D matrix of topic-word associations,
whose rows are the dictionary elements learnt by the NMF
approach.
[0031] In one embodiment, we use the l.sub.i loss function
.parallel.X-WH.parallel..sub.1 is used with the motivation that
l.sub.1 loss function performs better when the underlying noise
distribution is spiky.
[0032] FIG. 2 illustrates dynamic NMFs with temporal
regularization.
[0033] FIGS. 6 and 7 illustrate an approach of using robust l.sub.1
objective function for detecting emerging topics.
[0034] Let .parallel.X(t).di-elect cons. R.sup.N(t).times.D, t=1, 2
. . . } denote a sequence of streaming matrices 152 where each row
of X(t) represents an observation whose timestamp is t. In topic
modeling applications over streaming documents, X(t) will represent
the highly sparse document-term matrix observed at time t.
X(t.sub.1,t.sub.2) is used to denote the document-term matrix
formed by vertically concatenating {X(t), t.sub.1,
.ltoreq.t.ltoreq.t.sub.2}. At the current timepoint t, the model
consumes the incoming data X(t) and generates a factorization
(W(t), H(t)) comprising of K(t) topics.
[0035] One embodiment of this factorization stems from the
following considerations: (1) The first K(t-1) topics in H(t) must
be smooth evolutions of the K(t-1) topics found up to the previous
timepoint, H(t-1). This is called the evolving set 154 and an
evolution parameter, .delta., is introduced which constrains the
evolving set to reside within a box of size .delta. on the
probability simplex around the previously found topics. With minor
modifications, .delta. can also be made topic or word-specific
e.g., to take topic volatility or word dominance into account. (2)
A second consideration is the fast detection of emerging topics. At
each timepoint, we inject additional topic bandwidth for this
purpose. This is called the emerging set 156. Thus the topic
variable H(t) can be partitioned into an evolving set of K(t-1)
topics, H.sup.ev, and an emerging set of K.sup.em topics H.sup.em.
Furthermore, it is assumed that emerging topics can be
distinguished from noise based on their temporal profile. In other
words, the number of documents that a true emerging topic
associates with begins to rapidly increase. For this purpose, we
introduce a short sliding time window .omega. is introduced over
which topical trends are estimated. As discussed in more detail
below, a novel regularizer .OMEGA.(W.sup.em) is defined that
consumes the document-topic associations for the emerging bandwidth
and penalizes components that are static or decaying so that learnt
emerging topics are more likely to be ones that are rising in
strength. (3) It is assumed that topics in the emerging set become
part of the evolving set going forward, unless some of them are
discarded as noise by manual guidance from the user or using
criteria such as net current strength. In experiments, all topics
in the emerging set were retained. This embodiment is discussed
more below.
[0036] The discussion above motivates the following objective
function that is optimized at every timepoint t.
( W * , H ( t ) ) .ident. arg min W , H X ( t - w , t ) - WH fro 2
+ .mu..OMEGA. ( W ) ( 1 ) ##EQU00003##
[0037] This objective function is minimized under the following
constraints.
W , H .gtoreq. 0 ( 2 ) j = 1 D H ij = 1 .A-inverted. i .di-elect
cons. [ K ( t - 1 ) + K em ] ( 3 ) min ( H ij ( t - 1 ) - .delta. ,
0 ) .ltoreq. H ij ( t ) .ltoreq. max ( H ij ( t - 1 ) + .delta. , 1
) , .A-inverted. i .di-elect cons. [ K ( t - 1 ) , .A-inverted. j
.di-elect cons. [ D ] ( 4 ) ##EQU00004##
[0038] W(t) is then extracted from the bottom rows of W* that
correspond to X(t). The system is then said to have tagged the
i.sup.th document (row) in X(t) with the most dominating topic
argmax.sub.j W(t)(i,j) which gives a clustering of documents. Note
that the regularizer, .OMEGA.(W), defined below, implicitly only
operates on those columns of W that correspond to emerging topics.
Note that W* is prepared for initializing parts of W in the next
run. This hot-start mechanism significantly accelerates
convergence.
[0039] In another embodiment of this factorization, the task of
detecting novel signals in streaming datasets is formulated as a
sparse signal representation problem. A signal is represented with
a sparse code over an existing dictionary along with a sparse error
term. A novel signal is detected based on the lack of sparsity in
such a representation. While one application is emerging topic
detection on streaming text, the methodology applies more broadly
to other domains. This embodiment is discussed in more detail
below.
[0040] In this embodiment, the objective function is a combination
of the l.sub.1-norms of a sparse error (robust reconstruction) and
a sparse code which appears well suited for sparse high-dimensional
datasets such as those that arise in text applications.
Additionally, there are non-negativity constraints on the sparse
code and dictionary, to maintain interpretability.
[0041] A new practical alternating direction method (ADM) is used
to solve various optimization problems appearing in the
formulation. ADM has recently gathered significant attention in the
Machine Learning community due to its wide applicability to a range
of learning problems with complex objective functions [S. Boyd, N.
Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed
Optimization and Statistical Learning via the Alternating Direction
Method of Multipliers].
Temporal Regularization
[0042] Generally, the regularization operator .OMEGA.(W) is
formulated by chaining together trend extraction with a
margin-based loss function to penalize static or decaying topics.
We begin with a brief discussion of trend filtering.
[0043] Hodrick-Prescott (HP) Trend Filtering: Let
{y.sub.t}.sub.t=1.sup.T be a univariate time-series which is
composed of an unknown, slowly varying trend component
{x.sub.t}.sub.t=1.sup.T perturbed by random noise
{z.sub.t}.sub.t=1.sup.T. Trend Filtering is the task of recovering
the trend component {x.sub.t} given {y.sub.t}. The Hodrick-Prescott
filter is an approach to estimate the trend assuming that it is
smooth and that the random residual is small. It is based on
solving the following optimization problem:
{ x t } arg min 1 2 i = 1 T ( y t - x i ) 2 + .lamda. t = 2 T - 1 (
( x t + 1 ) - ( x t - x t - 1 ) ) 2 ( 5 ) ##EQU00005##
[0044] Let us introduce the second order difference matrix
D.di-elect cons. R.sup.(T-2).times.T such that
D(i,i)=1,D(i,i+1)=2, and D(i,i+2)=1 .A-inverted.i .di-elect cons.
[T-2]
Then, it is easy to see that the solution to the optimization
problem of Equation 5 is given by, x=[1+2.lamda.D.sup.TD].sup.-1y,
where we use the notation y-(y.sub.1 . . . y.sub.T).sup.T,
x=(x.sub.1 . . . x.sub.T).sup.T. We use F to denote
[1+2.lamda.D.sup.TD].sup.-1, the linear smoothing operator
associated with the Hodrick-Prescott Filter. Given the time series
y, the Hodrick-Prescott (HP) trend estimate simply is x=Fy.
[0045] Loss Function for Measuring Emerging Trend: Let x=Fy be the
HP trend of the time series y. Let D be the forward difference
operator, i.e., the only non-zero entries of D are: D.sub.i,i and
D.sub.i,i+1=1. If z=Dx, then z.sub.i=x.sub.i+1-x.sub.1 reflects the
discrete numerical gradient in the trend x. Given z.sub.i, we
define a margin based loss function (the l.sub.2 -hinge loss),
L(z.sub.i)=c.sub.i max (0, .delta.-z.sub.i).sup.2, where if the
growth in the trend at time i is sufficient, i.e., greater than
.delta., the loss evaluates to 0. If the growth is insufficient,
the loss evaluates to c.sub.i(.delta.-z.sub.i).sup.2 where c.sub.i
is the weight of timepoint i which typically increases with i. For
a vector z, the loss is added over the components. In terms of the
original time series y, this loss function is,
L ( y ) = i = 1 T - 1 c 1 max ( 0 , .delta. - DF y ) i ) 2 ( 6 )
##EQU00006##
[0046] Optimization Problem: As documents arrive over t .di-elect
cons. [T], we use S to denote a T.times.N time-document matrix,
where S(i,j)=1 if the document j has time stamp i. Noting that each
column w of W, denotes the document associations for a given topic,
Sw captures the time series of total contribution of the topic w
over the time frame of S. Finally, we concretize equation (1) as
the following optimization problem
argmin
.parallel.X-WH.parallel..sub.fro.sup.2+.mu..SIGMA.L(Sw.sub.i). W,
H.gtoreq.0 and w.sub.i in W.sup.em (7)
subject to constraints in Equations (2) and (4)
[0047] We optimize the above objective using the rank-one residue
iteration (RRI) approach [Ngoc-Diep Ho, Paul Van Dooren, and
Vincent D. Blondel, Descent methods for nonnegative matrix
factorization, Numerical Linear Algebra in Signals, abs/0801.3199,
2007]. We approximate X as the sum of rank-one matrices
w.sub.ih.sub.i.sup.T and optimize cyclically over individual
w.sub.i and h.sub.i variables while keeping all other variables
fixed. This results in three specific sub-problems, each of which
requires an efficient projection of a vector onto an appropriate
space.
[0048] Optimization over h.sub.i: Holding all variables except
h.sub.i fixed and omitting additive constants independent of
h.sub.i, equation (7) can be reduced to argmin.sub.h.sub.i.di-elect
cons.C.parallel.R-w.sub.ih.sub.i.sup.T.parallel..sub.fro.sup.2 for
appropriate R .di-elect cons. R.sup.N.times.D independent of
h.sub.i. Simple algebraic operations yield that the above is
equivalent to
h i - w i / w i 2 2 ( 8 ) ##EQU00007##
[0049] Case 1: h.sub.i is evolving: For an evolving topic, the
optimization needs to be performed under the constraints of
equations (4) and (3). Thus the optimum h.sub.i* is obtained by
projection onto the set ={h.sub.i: h.sub.i .di-elect cons.
.DELTA..sub.D, l.sub.j.ltoreq.h.sub.ij.ltoreq.u.sub.j} for
appropriate constants l.sub.j and u.sub.j. This is.sub.J equivalent
to a projection onto a simplex with box constraints. Adapting a
method due to [P. M. Pardalos and N. Kovoor. An algorithm for
singly constrained class of quadratic programs subject to upper and
lower bounds. Mathematical Programming, 46:321-328, 1990], we can
find the minimizer in O(D) time i.e., linear in the number of
coordinates.
[0050] Case 2: h.sub.i is emerging: For an emerging topic
={h.sub.i:h.sub.i .di-elect cons. .DELTA..sub.D} and the
optimization equation (8) becomes equivalent to a projection onto
the simplex .DELTA..sub.D. the same algorithm [P. M. Pardalos and
N. Kovoor, An algorithm for singly constrained class of quadratic
programs subject to upper and lower bounds, Mathematical
Programming, 46:321-328, 1990] again gives us the minimizer in
linear time O(D).
[0051] Optimization over evolving w.sub.i: When w.sub.i .di-elect
cons. W.sup.ev, the second term in equation (7) does not contribute
and using the RRI scheme, the optimization problem can be written
down as w.sub.i*=arg
min.sub.w.sub.i.gtoreq.0.parallel.R-w.sub.ih.sub.i.sup.T.parallel..sup.2.
Similar to equation (8), simple algebraic operations yield that the
above minimization is equal to the following simple projection
algorithm
arg min w i .gtoreq. 0 w i - R h i / h i 2 2 ( 9 ) ##EQU00008##
The corresponding minimizer is simply given by
w ij = max ( 0 , 1 h i 2 ( R h i ) j ) . arg min w i .gtoreq. 0 R -
w i h i 2 + L ( S w i ) ##EQU00009##
[0052] Emerging w.sub.i: When w.sub.i .di-elect cons. W.sup.em, the
RRI step of the corresponding optimization problem look like
arg min w i .gtoreq. 0 w i - R h i 2 + .mu. L ( S w i ) / h i 2 (
10 ) ##EQU00010## [0053] Noting that we choose L to be the l.sub.2
hinge loss, equation (10) leads to
[0053] arg min w i .gtoreq. 0 w i - R h i 2 + .mu. h i 2 i = 1 T -
1 c i max ( 0 , .delta. - q i T w i ) 2 ##EQU00011## [0054] where
q.sub.i.sup.T-(DFS).sub.i,. This can be converted into a generic
mimization problem of the form
[0054] min w .gtoreq. 0 J ( w ) = i ( max ( 0 , c i ( .delta. i - (
w , x i ) ) ) ) 2 + .lamda. 2 w - w 0 2 ( 11 ) ##EQU00012##
for some constant w.sub.0. This is precisely the SVM optimization
problem with additional non-negativity constraints on w.sub.i. This
objective is minimized using a projected gradient algorithm on the
primal objective directly, as it is smooth and therefore the
gradient is well defined. Thus
w.sup.(k+1)=.PI.(w.sup.(k)-.eta..sub.k.gradient.J(w.sup.(k)))
(12)
where .PI. is the projection operator .PI.(s)=max(s,0) and
.gradient. J ( w ( k ) ) = - 2 i max ( c i ( .delta. i - < w ( k
) , x i > ) , 0 ) x i + .lamda. ( w ( k ) - w 0 )
##EQU00013##
The best rate .eta..sub.k at the k.sup.th step is chosen according
to [C. J. Lin, Projected gradient methods for non-negative matrix
factorization, In Neural Computation, 2007]. In particular
.eta..sub.k=.beta..sup.t.sup.k for some constant .beta. and t.sub.k
is the smallest integer for which
J(w.sup.(k+1))-J(w.sup.(k)).ltoreq..sigma..gradient.J(w.sup.(k)),
w.sup.(k+1)-w.sup.(k) (13)
[0055] At every iteration .eta..sub.k is hot started from
.eta..sub.k-1 and finally it is the largest .eta. which satisfies
Equation (13).
L.sub.1 Dictionary Learning Approach for Emerging Topic
Detection
[0056] In the discussion below, the need for l.sub.1 loss objective
function is discussed, and then an optimization algorithm is
presented that is used to solve an embodiment of the
factorization.
[0057] Let H(t-1) in R.sup.m.times.k represent the dictionary after
time t-1; where the dictionary H(t-1) is a compact summary
representation of all the documents in X(1,t-1). Given a new
document vector y with timestamp t, we see if y could be
represented as a sparse linear combination of the rows of H(t-1).
The sparsest representation is the solution of:
min.sub.x .parallel.x.parallel..sub.0 such that y=H(t-1).sup.Tx,
with x.gtoreq.0 (13)
where, .parallel. .parallel..sub.0 is the I.sub.0 norm counting the
non-zero entries of the vector. However, in general case solving
the above optimization problem is NP-hard and also hard to
approximate [E. Amaldi and V. Kann. On the Approximability of
Minimizing Nonzero Variables or Unsatisfied Relations in Linear
Systems]. Therefore, instead of solving the (13), we solve a convex
relaxation of it:
min.sub.x .parallel.x.parallel..sub.1 such that y=H(t-1).sup.Tx+e
such that x.gtoreq.0. (14)
[0058] In most practical situations, equation (14) is not
applicable because it may not be possible to represent y as
H(t-1).sup.Tx, e.g., if y has new words which are absent (i.e.,
have no support) in H(t-1). In such cases, one could represent
y=H(t-1).sup.Tx+e where e is an unknown noise vector. In the
presence of isotopic Gaussian noise, the l.sub.2-penalty of
e=y-H(t-1).sup.Tx gives the best approximation of x. However, for
text documents (and in most other real scenarios), the noise vector
e rarely satisfies the Gaussian assumption, and some of its
coefficients contain large, impulsive values. In such scenarios,
the l.sub.2-penalty on the loss function may give an extremely bad
approximation of x. However, in such a real-world scenario,
imposing an l1 reconstruction error gives a more robust and better
approximation of x. The following l.sub.1-formulation is used to
recover x.
min.sub.x
.parallel.y-H(t-1).sup.Tx.parallel..sub.1+.lamda..sub.1|x.parallel..sub.1
such that x.gtoreq.0 (15)
[0059] Given a new document y with timestamp of t and a dictionary
H(t-1), equation (15) is solved to determine whether y is novel
(with respect to dictionary H(t-1)) or not. If the objective value
of (15) is "small," then y is well-reconstructed by a linear
combination of some basis vectors in H(t-1). Such documents are
marked as non-novel discarded. Now, if the objective value is
"large," then y has no good reconstruction among the basis vectors
of the previous topics, thus suggesting novelty of y. We add such
documents to the set Nvlt.
[0060] Dictionary Update: The dictionary is updated so that it
forms a compact summary representation of all the documents in
X(1,t). The dictionary is updated by minimizing
W(t),H(t)=argmin.sub.W,H.parallel.X(1,t)-WH.parallel..sub.1+.lamda..para-
llel.W.parallel..sub.1 such that W,H.gtoreq.0 (16)
For scalability, an online version of dictionary learning is used,
where only H is updated and use W is obtained from previous stages
of the algorithm.
[0061] The algorithm alternates between a "detection stage",
represented in FIG. 6, and a "dictionary learning stage,"
represented in FIG. 7. The detection stage at time t gets as input
the dictionary H(t-1) and X(t), and for each document p.sub.j in
X(t), computes the best representation of p.sub.j in terms of
H(t-1) by solving equation (15) (where y is replaced by p.sub.j). A
document p.sub.j is classified as novel if the objective value of
equation (15) is above some chosen threshold .zeta.. Let Nvl.sub.t
be the set of document that are marked as novel at time t. The set
of novel documents is then passed to a clustering stage represented
at 160. The idea is to again use dictionary learning. Given as
input a set of (novel) documents and the number of topics k.sub.1
to be generated, a suitable modification of equation (16) is used
to detect emerging topics. The idea is as follows: If Nvl.sub.t
represents the set of novel documents, we learn a dictionary with
k.sub.1 atoms, where each atom corresponds to an emerging topic. In
other words, we minimize the following function over (R(t),
S(t)):
R(t),S(t)=argmin.sub.R,S
.parallel.Nvl.sub.t-RS.parallel..sub.1+.lamda..parallel.R.parallel..sub.1
such that R,S.gtoreq.0 (17)
Since, the size of Nvl.sub.t is typically small, this function is
solved using a simple iterative batch procedure, alternatively
fixing R(t), S(t) and updating the other variables using the method
of alternating directions.
[0062] The dictionary learning stage is performed in an online
fashion. In the online setting, instead of using equation (16), the
dictionary is updated by minimizing the following function over
H:
H(t)=argmin.sub.H.parallel.X(1,t)-W(1,t)H.parallel..sub.1 such that
H.gtoreq.0,
where W(1,t)=[x.sub.1,x.sub.2, . . . ] are computed during the
previous detection stages. This online dictionary learning
framework has similar structure to that of [J. Mairal, F. Bach, J.
Ponce, and G. Sapiro. Online Learning for Matrix Factorization and
Sparse Coding].
[0063] To speedup the algorithms, the method of alternating
directions is used to solve the various optimization problems. We
start with a brief review of the general framework of ADM from [J.
Yang and Y. Zhang. Alternating Direction Algorithms for
L.sub.1-Problems in Compressive Sensing]. Let p(x):R.sup.a.fwdarw.R
and q(y):R.sup.b.fwdarw.R be convex functions, F in
R.sup.c.times.a, G in R.sup.c.times.b, and z in R.sup.c. Consider
the following optimization problem
min.sub.x,yp(x)+q(y) s.t. Fx+Gy=z,
where the variable vectors x and y are separate in the objective,
and coupled only in the constraint. The augmented Lagrangian for
the above problem is given by
L(x, y, .rho.)=p(x)+q(y)+.rho..sup.T
(z-Fx-Gy)+.beta./2.parallel.z-Fx-Gy.parallel..sub.2.sup.2, (18)
where .rho. is the Lagrangian multiplier and .beta.>0 is a
penalty parameter. ADM utilizes the separability form of equation
(18) and replaces the joint minimization over x and y with two
simpler problems. The ADM first minimizes L over x, then over y,
and then applies a proximal minimization step with respect to the
Lagrange multiplier .rho..
[0064] Let R.sub.+ be the set of positive real numbers. In the
detection stage for each document p.sub.j, the following program is
solved:min.sub.x,e
.parallel.e.parallel..sub.1+.lamda..parallel.x.parallel..sub.1 such
that e=p.sub.j-H(t-1).sup.Tx Then the augmented Lagrangian form of
the above is
L(x,e,.rho.)=min.sub.x,e
.parallel.e.parallel..sub.1+.lamda..parallel.x.parallel..sub.1+.rho..sup.-
T(p.sub.j-H(t-1).sup.Tx-e)+.beta./2.parallel.p.sub.j-H(t-1).sup.Tx-e.paral-
lel..sub.2.sup.2 (19)
ADM is now applied to the above Lagrangian. Let us assume that we
have (x.sub.(i),e.sub.(i), .rho..sub.(i)),
(x.sub.(i+1),e.sub.(i+1),.rho..sub.(i+1)) is constructed as
follows. First, for a fixed x.sub.(i) and .rho..sub.(i), e is
updated by solving
min.sub.e
.parallel.e.parallel..sub.1+.rho..sub.(i).sup.T(p.sub.j-H(t-1).sup.Tx.sub-
.(i)-e)+.beta./2.parallel.p.sub.j-H(t-1).sup.Tx-e.parallel..sub.2.sup.2
The minimum value of the above optimization is attained by
setting
e=soft(p.sub.j-H(t-1).sup.Tx.sub.(i)+.rho..sub.(i)/.beta.,
1/.beta.),
where soft(r, T)=sign(r).times.max{|r|-T, 0}, where sign(r) is the
sign of vector r.
[0065] Now, for a fixed e.sub.(i+1) and .rho..sub.(i) a simple
manipulation shows that we can obtain x that minimizes by solving
the following
min.sub.x .lamda..parallel.x.parallel..sub.1+(.beta./2)
.parallel.p.sub.j-H(t-1).sup.Tx-e.sub.(i+1)+.rho..sub.(i)/.beta..parallel-
..sub.2.sup.2.
However, instead of solving the above optimization exactly, it is
approximated by
min.sub.x
.lamda..parallel.x.parallel..sup.1+.beta.(g.sub.(i).sup.T(x-x.sub.(i))+.t-
au./2 .parallel.x-x.sub.(i).parallel..sub.2.sup.2 (20)
where .tau.>0 is a proximal parameter and g.sub.(i)=H.sub.(t-1)
(H(t-1).sup.Tx.sub.(i)+e.sub.(i+1)-p.sub.j-.rho..sub.(i)/.beta.).
The minimum value of equation (20) is attained by setting
x=max{x.sub.(i)-.tau.g.sub.(i)-(.lamda..tau.)/.beta., 0]. Now given
fixed x.sub.(i+1) and e.sub.(i+1), multiplier .rho. is updated as
.rho..sub.(i+1)=.rho..sub.(i)+.gamma..beta.(p.sub.j-H(t-1).sup.Tx.sub.(i+-
1)-e.sub.(i+1)). The ADM equations for updating the dictionary H( )
are derived similarly. Empirical Studies for Detecting Evolving and
Emerging Topics using Temporal Regularizers
[0066] The goal of this empirical study is to understand the
influence of temporal regulation (evolution and emergence
parameters) on the effectiveness of topic detection and tracking.
To enable quantitative evaluation, two topic-labeled datasets were
presented to the algorithm as streams, and the resulting topics
generated by the system were benchmarked against ground truth topic
assignments.
[0067] Datasets: Two datasets were used for the experiments. The
Simulation dataset consists of 1000 documents with 2500 terms
divided into 25 topics accumulated over 31 days. We generated a
(nearly) low-rank document-term matrix, X=WH+S, where S is a noise
matrix with sparsity 0.001 and non-zero elements randomly drawn
from a uniform distribution on the unit interval. This dataset
comprises of 25 topics whose term-distributions (as specified by
the 25 rows of H) are random 2500-dimensional points on the topic
simplex with sparsity 0.01. These topics are then randomly mixed
(as specified in W) to create the documents such that each topic
dominates 40 documents with at least 80% mixing proportions and
each document on average contains 2.5 topics. These documents are
then associated with timestamps such that topic i, i>5 steadily
emerges at timepoint i with a time profile as shown in the left
subfigure in FIG. 3. These emerging topics arise in the background
of 5 initial static topics leading to an overall profile of
temporal dynamics as shown (stacked area chart) in the right
subfigure of FIG. 3. We choose the hinge parameter to be .mu.=5 and
emerging bandwidth of 1 per timepoint for this dataset. In the
experiments, a sliding window of .omega.=7 timepoints was used. The
second dataset is drawn from the Nist Topic Detection and Tracking
(TDT2) corpus which consists of news stories in the first half of
1998. In the evaluation, we used a set of 9394 documents
represented over 19528 terms and distributed into the top 30 TST2
topics over a period of 27 weeks. We choose the hinge parameter to
be .mu.=20 and emerging bandwidth of 2 per week for this dataset.
In the experiments, a sliding window of .omega.=4 weeks was
used.
[0068] Evaluation Metrics: For tracking, we use F1 scores, as
commonly reported in topic detection and tracking (TDT) literature.
A precise definition of micro averaged F1 used in the experiments
is given in [Tzu-Chuan Chou and Meng Chang Chen, Using Incremental
PLSI for Threshold-Resilient Online Event Analysis, IEEE
transactions on Knowledge and Data Engineering, 2008]. A second
performance metric is defined to capture how rapidly an emerging
topic is "caught" and communicated to the user. Recall that a topic
is communicated by the top keywords that dominate the associated
term distribution in H(t). We first define true topic distributions
as
H.sup.true(t)=argmin.sub.H>0.parallel.X(1,t)-W.sup.trueH.parallel..sub-
.fro.sup.2, where W.sup.true is set using true topic labels. Next,
for each true topic i, we compute first detection time, which is
the first timepoint at which the system generates a topic
distribution in H(t) that is within a threshold of t from the true
topic, as measured by symmetric KL-divergence. We then record the
percentage of documents missed before detection, and take the
average of this miss rate across all true topics.
[0069] Results and Discussion: FIG. 4 shows tracking performance as
a function of the evolution parameter .delta.. When .delta.=0, the
system freezes a topic as soon as it is detected, not allowing the
word distributions to change as the underlying topic drifts over
time. When .delta.=1, the system has complete freedom in retraining
topic distributions, causing no single channel to remain
consistently associated with an underlying topic. It can be seen
that both these extremes are suboptimal. Tracking is much more
effective when topic distributions are allowed to evolve under
sufficient constraints in response to the statistics of incoming
data. In FIG. 5 we turn to the effectiveness of emergence
regularization. The figure shows how much information on average is
missed before underlying topics are first detected, as a function
of the emergence parameter .mu.. We see that increasing .mu., for a
fixed choice of .delta., typically reduces miss rates, causing
topics to be detected early. As .delta. is increased, topics become
less constrained and therefore provide additional bandwidth to
drift towards emerging topics, therefore lowering the miss rate
curves. However, this comes at the price of reduced tracking
performance. Thus, for a fixed amount of available topic bandwidth,
there is a tradeoff between tracking and early detection that can
be navigated with the choice of .mu. and .delta..
Empirical Studies for Detecting Emerging Topic Detection using
L.sub.1 Dictionary Learning
[0070] The goal of this empirical study is to understand the
influence of using a l.sub.1 loss function for detecting emerging
topics. We first empirically evaluate our approach on
publicly-available labeled datasets from news streams and
newsgroups.
[0071] Evaulation Metrics: For the purpose of evaluation, we assume
that documents in the corpus have been identified with a set of
topics. For simplicity, we assume that each document is tagged with
a single, most dominant topic that it associates with which we
refer to as the true topic for that document.
[0072] We use variations of standard IR measures like pairwise
precision, recall, and F1 score. Given X(t), the set of documents
arriving at time t, let TNvl.sub.t be the set of true novel
documents in X(t). Let C.sub.t be the set of system generated
emerging topic clusters at time t, and let T.sub.t be the true
emerging topic clusters at time t. Note that clusters in T.sub.t
are formed over documents in TNvl.sub.t, whereas the clusters in
C.sub.t are formed over documents in Nvl.sub.t, and TNvl.sub.t may
not be equal to Nvl.sub.t.
[0073] We define our evaluation metrics over the novel documents.
Pairwise precision is the number of pairs of documents that are in
the same cluster in both T.sub.t and C.sub.t divided by the number
of pairs of documents that are in the same cluster in C.sub.t.
Pairwise recall is the number of pairs of documents that are in the
same cluster in both T.sub.t and C.sub.t divided by the number of
pairs of documents that are in the same cluster in T.sub.t.
Pairwise F1 is the harmonic mean of pairwise precision and
recall.
[0074] We compare the performance of the algorithm against three
alternative approaches that were created, which are based on
combining nearest neighbor (NN) and K-Means algorithms with
dictionary learning. We describe these baselines below.
[0075] NN-KM: To detect novel documents, we use the nearest
neighbor approach used by the UMass FSD system [J. Allan. Topic
Detection and Tracking: Event-based Information Organization],
which is one of the best performing system for this task. As in the
UMass system, we use cosine distance as a similarity measure and a
TF-IDF weighted document representation. Every document in X(t)
whose cosine distance to its nearest neighbor in X(t-1) is below
some threshold is marked as novel. We build on this algorithm to
get a baseline for emerging topic detection, by running a K-Means
clustering with cosine distance (Spherical K-Means) on the
documents marked novel.
[0076] DICT-KM: The second baseline is a modification of the
above-identified dictionary based scheme. We use the dictionary
learning approach to detect novel documents and then run a
Spherical K-Means clustering on these novel documents to create
emerging topic clusters.
[0077] NN-DICT: The third baseline is also a modification of the
dictionary based scheme. We first use the nearest neighbor approach
(explained above) to detect novel documents and then run a
dictionary based clustering on these novel documents to create
emerging topic clusters.
[0078] Results on TDT2 and 20 Newsgroups Datasets: We use two
standard labeled datasets to evaluate the performance of the
proposed algorithm. We start by describing these datasets and the
experimental setup.
[0079] The first dataset is the NIST topic detection and tracking
(TDT2) corpus. For the evaluation, we use a set of 9,394 documents
represented over 19,528 terms and spread over 27 weeks. These
documents are partitioned into 30 human-labeled topics. We
introduce the documents from the 27 weeks in 5 different phases. In
the zeroth phase, we introduce all the documents between weeks 1 to
5 and these documents are used for initializing the dictionary
H(0). In the first phase, we introduce all the documents between
weeks 6 to 7 and run the emerging topic detection on these
documents with dictionary H(0). In the second phase, we introduce
all the documents between weeks 8 to 13 and run the emerging topic
detection algorithm on these documents with dictionary H(1)
(outputted by the first phase). We repeat the same steps for the
third phase (between weeks 14 to 17) and fourth phase (between
weeks 18 to 27).
[0080] As the second dataset we use the 20 Newsgroups corpus. The
corpus contains 18,774 articles distributed among 20 clusters where
each cluster is a Usenet group. For the experiments, we use a
vocabulary of 10,000 terms selected based on frequency. We do a set
of controlled experiments on this corpus. Again, we introduce the
documents in phases. Documents within each cluster are temporally
ordered, and we use this temporal ordering to introduce the
documents. At the end of Phase i-1, we have documents from some
(old) clusters, and in Phase i we introduce a mixture of documents,
some coming from these old clusters and some belonging to new
clusters; and see how well the algorithm performs in detecting
these new clusters. We begin Phase 0 with documents sampled from 6
randomly chosen clusters. In each subsequent phase, we introduce
documents from 2 new clusters. The numbers of documents from added
at each phase are presented in FIG. 8.
[0081] For baselines with K-Means clustering, we run the algorithm
8 times (with random initialization for centroids) and take the
best result. FIG. 7 presents the maximum F1 for both datasets
(obtained by varying thresholds). The algorithm always outperforms
all the three baselines. For TDT2, the algorithm gives on average
16.9% improvement in F1 score over the NN-KM, 6.7% improvement over
DICT-KM, and 4.3% improvement over NN-DICT. For 20 Newsgroups, we
notice on average 16.0% improvement over NN-KM, 7.0% improvement
over DICT-KM, and 9.0% improvement over NN-DICT. The results are
shown in FIG. 8.
[0082] A computer-based system 200 in which a method embodiment of
the invention may be carried out is depicted in FIG. 9. The
computer-based system 200 includes a processing unit 210, which
houses a processor, memory and other systems components (not shown
expressly in the drawing) that implement a general purpose
processing system, or computer that may execute a computer program
product. The computer program product may comprise media, for
example a compact storage medium such as a compact disc, which may
be read by the processing unit 210 through a disc drive 120, or by
any means known to the skilled artisan for providing the computer
program product to the general purpose processing system for
execution thereby.
[0083] The computer program product may comprise all the respective
features enabling the implementation of the inventive method
described herein, and which--when loaded in a computer system--is
able to carry out the method. Computer program, software program,
program, or software, in the present context means any expression,
in any language, code or notation, of a set of instructions
intended to cause a system having an information processing
capability to perform a particular function either directly or
after either or both of the following: (a) conversion to another
language, code or notation; and/or (b) reproduction in a different
material form.
[0084] The computer program product may be stored on hard disk
drives within processing unit 210, as mentioned, or may be located
on a remote system such as a server 230, coupled to processing unit
210, via a network interface such as an Ethernet interface. Monitor
240, mouse 250 and keyboard 260 are coupled to the processing unit
210, to provide user interaction. Scanner 280 and printer 270 are
provided for document input and output. Printer 270 is shown
coupled to the processing unit 210 via a network connection, but
may be coupled directly to the processing unit. Scanner 280 is
shown coupled to the processing unit 210 directly, but it should be
understood that peripherals might be network coupled, or direct
coupled without affecting the performance of the processing unit
210.
[0085] While it is apparent that the invention herein disclosed is
well calculated to fulfill the objectives discussed above, it will
be appreciated that numerous modifications and embodiments may be
devised by those skilled in the art, and it is intended that the
appended claims cover all such modifications and embodiments as
fall within the true spirit and scope of the present invention.
* * * * *