U.S. patent application number 13/087308 was filed with the patent office on 2012-10-18 for system and method for identifying users relevant to a topic of interest.
This patent application is currently assigned to PALO ALTO RESEARCH CENTER INCORPORATED. Invention is credited to Kevin Robert Canini, Ed H. Chi, Peter L. Pirolli, Bongwon Suh.
Application Number | 20120265771 13/087308 |
Document ID | / |
Family ID | 46846404 |
Filed Date | 2012-10-18 |
United States Patent
Application |
20120265771 |
Kind Code |
A1 |
Suh; Bongwon ; et
al. |
October 18, 2012 |
SYSTEM AND METHOD FOR IDENTIFYING USERS RELEVANT TO A TOPIC OF
INTEREST
Abstract
A system and method for identifying users relevant to a topic of
interest is provided. A query comprising one or more topics is
executed against a corpus of messages. Voting users associated with
the messages matching the query are identified. A set of candidate
users comprising users connected to the voting users is generated.
A relevancy score is computed for each candidate user. The
candidate users are ranked by their respective relevancy score.
Inventors: |
Suh; Bongwon; (Cupertino,
CA) ; Canini; Kevin Robert; (San Francisco, CA)
; Pirolli; Peter L.; (San Francisco, CA) ; Chi; Ed
H.; (Palo Alto, CA) |
Assignee: |
PALO ALTO RESEARCH CENTER
INCORPORATED
Palo Alto
CA
|
Family ID: |
46846404 |
Appl. No.: |
13/087308 |
Filed: |
April 14, 2011 |
Current U.S.
Class: |
707/749 ;
707/E17.108 |
Current CPC
Class: |
G06Q 10/101
20130101 |
Class at
Publication: |
707/749 ;
707/E17.108 |
International
Class: |
G06F 17/30 20060101
G06F017/30; G06F 7/00 20060101 G06F007/00 |
Claims
1. A system for identifying users relevant to a topic of interest,
comprising: a query module to execute a query comprising one or
more topics against a corpus of messages and to identify voting
users associated with the messages matching the query; a candidate
generator module to generate a set of candidate users comprising
users connected to the voting users; a relevancy scorer module to
compute a relevancy score for each candidate user; and a candidate
ranking module to rank the candidate users by their respective
relevancy score.
2. A system according to claim 1, further comprising: a calculation
module to calculate the number of voting users for each candidate
user and to calculate the number of users comprising the voting
users and non-voting users connected to each candidate user.
3. A system according to claim 2, wherein the relevancy score is
determined according to the equation f.sub.u, where f.sub.u, is the
number of voter users connected to the candidate user.
4. A system according to claim 2, wherein the relevancy score is
determined according to the equation f.sub.u/F.sub.u, where
f.sub.u, is the number of voter users connected to the candidate
user and F.sub.u is the total number of users connected to the
candidate user.
5. A system according to claim 2, wherein the relevancy score is
determined according to the equation f.sub.u/log F.sub.u, where
f.sub.u, is the number of voter users connected to the candidate
user and F.sub.u is the total number of users connected to the
candidate user.
6. A system according to claim 2, wherein the relevancy score is
determined according to the equation:
E[p|f.sub.u,F.sub.u]=(f.sub.u+a)/(F.sub.u+.alpha.+.beta.) where E
is the estimated probability p, f.sub.u, is the number of voter
users connected to the candidate user, F.sub.u is the total number
of users connected to the candidate user, and .alpha. and .beta.
are binomials.
7. A system according to claim 1, further comprising: a topic
module to identify one or more messages of each candidate user and
to apply a topic model to the identified messages; and a re-ranking
module to re-rank the candidate based on the topic model.
8. A system according to claim 7, wherein the topic model is
determined according to the equation: Score LDA ( user i ) = k P 1
( queryterm | topic k ) * P 2 ( topic k | user i ) ##EQU00002##
where P.sub.1 (query term|topic.sub.k) is the probability
distribution of terms for each topic, k is the number of topics,
P.sub.2(topic.sub.k|user.sub.i) is the probability distribution of
topics for each document of a user, and i is the number of
users.
9. A system according to claim 8, wherein the re-ranking is
determined according to the equation:
Score.sub.Combined=W.sub.LinkStructure*Score.sub.LinkStructure+W.sub.LDA*-
Score.sub.LDA, where Score.sub.LinkStructure equals the relevancy
score, and 0<W.sub.LinkStructure, W.sub.LDA<1 and
W.sub.LinkStructure+W.sub.LDA=1.
10. A system according to claim 1, further comprising: a time
window module to determine a time window and to apply the query
against the messages in the corpus that were created within the
time window.
11. A method for identifying users relevant to a topic of interest,
comprising: executing a query comprising one or more topics against
a corpus of messages; identifying voting users associated with the
messages matching the query; generating a set of candidate users
comprising users connected to the voting users; computing a
relevancy score for each candidate user; and ranking the candidate
users by their respective relevancy score.
12. A method according to claim 11, further comprising: calculating
the number of voting users for each candidate user; and calculating
the number of users comprising the voting users and non-voting
users connected to each candidate user.
13. A method according to claim 12, wherein the relevancy score is
determined according to the equation f.sub.u, where f.sub.u, is the
number of voter users connected to the candidate user.
14. A method according to claim 12, wherein the relevancy score is
determined according to the equation f.sub.u/F.sub.u, where
f.sub.u, is the number of voter users connected to the candidate
user and F.sub.u is the total number of users connected to the
candidate user.
15. A method according to claim 12, wherein the relevancy score is
determined according to the equation f.sub.u/log F.sub.u, where
f.sub.u, is the number of voter users connected to the candidate
user and F.sub.u is the total number of users connected to the
candidate user.
16. A method according to claim 12, wherein the relevancy score is
determined according to the equation:
E[p|f.sub.u,F.sub.u]=(f.sub.u+.alpha.)/(F.sub.u+.alpha.+.beta.)
where E is the estimated probability p, f.sub.u, is the number of
voter users connected to the candidate user, F.sub.u, is the total
number of users connected to the candidate user, and .alpha. and
.beta. are binomials.
17. A method according to claim 11, further comprising: identifying
one or more messages of each candidate user; applying a topic model
to the identified messages; and re-ranking the candidate based on
the topic model.
18. A method according to claim 17, wherein the topic model is
determined according to the equation: Score LDA ( user i ) = k P 1
( queryterm | topic k ) * P 2 ( topic k | user i ) ##EQU00003##
where P.sub.1(query term|topic.sub.k) is the probability
distribution of terms for each topic, k is the number of topics,
P.sub.2(topic.sub.k|user.sub.i) is the probability distribution of
topics for each document of a user, and i is the number of
users.
19. A method according to claim 18, wherein the re-ranking is
determined according to the equation:
Score.sub.Combined=W.sub.LinkStructure*Score.sub.LinkStructure+W.sub.LDA*-
Score.sub.LDA, where Score.sub.LinkStructure equals the relevancy
score, and 0<W.sub.LinkStructure, W.sub.LDA<1 and
W.sub.LinkStructure+W.sub.LDA=1.
20. A method according to claim 11, further comprising: determining
a time window; and applying the query against the messages in the
corpus that were created within the time window.
Description
FIELD
[0001] This application relates in general to management of
electronic information and, in particular, to a system and method
for identifying users relevant to a topic of interest.
BACKGROUND
[0002] A growing amount of information is shared through social
networking websites, such as Facebook and Twitter. Initially, these
types of websites were used mainly as a way to keep in touch with
friends and family by sharing personal information such as status
updates and uploaded photographs. Currently, social media tools are
increasingly utilized for purposes beyond personal conversations,
including public discourse in diverse areas, including politics,
business, technology, and pop culture, as well as professional
networking
[0003] Information is transferred via a relationship, or
connection, such as "friending" in Facebook and "following" in
Twitter. For example, Twitter is a social networking and
microblogging service that allows users to send and receive short
messages, known as "tweets", and to share and discover various
topics of interest in real-time. To receive another user's tweets,
a user must subscribe to, or "follow", the other user's tweets. To
receive high-quality information about a topic of interest, a user
has to identify credible users whose tweets are relevant to the
topic. A user is found credible based at least in part on both the
expertise of the user and the trust other users have in the user,
reflected in the number of followers the user has.
[0004] As there are currently over 100 million registered users of
Twitter, finding the credible, or otherwise valuable, users who
publish information on a regular basis can be difficult as there
are no simple or efficient ways to determine which users are
relevant to particular topics of interest. Twitter has introduced
lists whereby users can organize the users they follow,
"followees," into groups. Third party services, such as Listorious,
available at listorious.com, and MyTwitterCloud, available at
mytwittercloud.com, use the created Twitter lists to index popular
users based on their membership in other users' lists. The list
assignments are aggregated and used to generate a ranking of users
for a given tag. However, user ranking is based on the manually
provided users lists, which have not been widely adopted, leading
to an under representation of potential credible users. Moreover,
the list categories are arbitrarily chosen by a user, which means
that the topics associated with a user can be arbitrary as well,
and may not reflect the actual topic of credibility of a user in
the list.
[0005] Additionally, WeFollow, available at wefollow.com, allows a
user to self-associate with a keyword of choice, which is then used
to rank the user against other uses who have opted-in for the same
keyword. However, a user has to manually opt-in to be included on a
list, which means many credible sources may not be represented in
the list for the particular keyword or topic. Like Listorious and
MyTwitterCloud, a user may be arbitrarily associated with a
particular typographical instantiation of a keyword or topic. For
example, a user may associate with the term "photography" but, in
turn, may be weakly associated with the term "photographer."
[0006] Accordingly, there is a need for leveraging the existing
social structure to identify relevant users associated with a
particular topic of interest.
SUMMARY
[0007] An embodiment provides a system and method for identifying
users relevant to a topic of interest. A query comprising one or
more topics is executed against a corpus of messages. Voting users
associated with the messages matching the query are identified. A
set of candidate users comprising users connected to the voting
users is generated. A relevancy score is computed for each
candidate user. The candidate users are ranked by their respective
relevancy score
[0008] Still other embodiments of the invention will become readily
apparent to those skilled in the art from the following detailed
description, wherein are embodiments of the invention by way of
illustrating the best mode contemplated for carrying out the
invention. As will be realized, the invention is capable of other
and different embodiments and its several details are capable of
modifications in various obvious respects, all without departing
from the spirit and the scope of the invention. Accordingly, the
drawings and detailed description are to be regarded as
illustrative in nature and not as restrictive.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] FIG. 1 is a block diagram showing a system for identifying
users relevant to a topic of interest, in accordance with one
embodiment.
[0010] FIG. 2 is a flow diagram showing a method for identifying
users relevant to a topic of interest, in accordance with one
embodiment.
[0011] FIG. 3 is a data flow diagram showing, for example, a method
for generating candidate users for use in the method of FIG. 2.
[0012] FIG. 4 is data flow diagram showing, for example, types of
relevancy measures.
DETAILED DESCRIPTION
[0013] As valuable knowledge is increasingly shared through social
networks, identifying credible users who are highly relevant to a
particular topic of interest becomes more difficult. Providing an
automated ranking of the most interesting or relevant users
relevant to a topic greatly minimizes the time and effort required
by a user to identify other users worth connecting to. FIG. 1 is a
block diagram showing a system 10 for identifying users relevant to
a topic of interest, in accordance with one embodiment. One or more
user devices 11-13 are connected to a content server 14 via an
Internetwork 15, such as the Internet. The user devices 11-13 can
include a computer, laptop, or mobile device, such as a cellular
telephone or mobile Internet device (not shown). In general, each
user device 11-13 is a Web-enabled device that executes a Web
browser, which supports interfacing tools and information exchange
with the content server 14.
[0014] The content server 14 is interconnected to a content
database 16 and a user database 24. The content database 16 stores
messages 17, which are provided to the user devices 11-13 upon
request. The user database 24 stores user profiles 25, such as user
name, password, and connections between users. Other types of data
are possible. In a further embodiment, messages 17 and user
profiles 25 can be stored locally on the user devices 11-13.
[0015] A user inputs a search query of one or more keywords or
topics and the query is executed against the messages 19 in the
content database 16 via the content server 14. Messages 19 are
created by users or automatically generated, and can include status
updates from networking sites, such as Facebook and Twitter,
emails, blog postings, forums, and news content. Other types of
messages 19 are possible. Messages 19 can be queried and the
results received directly on user devices 11-13, for user review
via a user interface from the content server 14, through a
application programming interface of the message source, such as
the Twitter API, or messages 19 from many sources can be
aggregated, cached, and accessed by user devices 11-13 from other
servers 18.
[0016] Subsequently, a relevancy server 18 generates a ranking of
users relevant to the search query. The relevancy server 18 is
interconnected to the user devices 11-13 and the content server 14
via the internetwork 15, and includes a candidate generator module
19, relevancy scorer module 20, and candidate ranking module
21.
[0017] The candidate generator module 19 generates, a set of
candidate users. The candidate users are generated from a
combination of the user generated search query of the messages 19
in the message database 16 and the social connections between
users. For example, social networks include features to connect
with other users, such as family, friends, colleagues, and
strangers. Facebook has "friending" and Twitter has "following."
Users connect to one another to keep updated with messages posted
by other users. The messages can include, for example, status
updates, weblinks, and photos.
[0018] The relevancy scorer module 20 applies a relevancy measure
to the candidate users and determines a relevancy score for each
candidate. The relevancy score of each candidate user is compared
and the candidates are ranked 23 based on the score. The rankings
23 can be cached for later retrieval or update in the relevancy
database 22. Users can then select one or more of the ranked users
to connect to, such a by following or friending the user.
[0019] The user devices 11-13, relevancy server 18, and content
server 14 each include components conventionally found in general
purpose programmable computing devices, such as a central
processing unit, memory, input/output ports, network interfaces,
and non-volatile storage, although other components are
possible.
[0020] Further, the user devices 11-13, relevancy server 18, and
content server 14 can each include one or more modules for carrying
out the embodiments disclosed herein. The modules can be
implemented as a computer program or procedure written as source
code in a conventional programming language and is presented for
execution by the central processing unit as object or byte code.
Alternatively, the modules could also be implemented in hardware,
either as integrated circuitry or burned into read-only memory
components. The various implementations of the source code and
object and byte codes can be held on a computer-readable storage
medium, such as a floppy disk, hard drive, digital video disk
(DVD), random access memory (RAM), read-only memory (ROM) and
similar storage mediums. Other types of modules and module
functions are possible, as well as other physical hardware
components.
[0021] Users relevant to a topic of interest are identified from
content of user messages and social connections between users. FIG.
2 is a flow diagram 30 showing a method for identifying users
relevant to a topic of interest, in accordance with one embodiment.
A search query is received from a user and applied to a corpus of
messages 16 and the messages 19 satisfying the query are identified
(block 31). In one embodiment, various linguistics, such as word
stemming, synonym expansion, and spelling corrections can be
applied to the search query
[0022] The query can be applied to all messages 19 or to only those
messages 19 within a specified time window. The time period can be
manually chosen by the user or automatically determined. For
example, the time window may be all messages 19 received since the
last time the user used the system 10, those that have been
received in the last hour, or only the most recent number n
messages. Other time windows are possible. The query can be applied
directly to the messages through the content server 14, through an
application programming interface of the message source, such as
the Twitter API, through the relevancy server 18, or messages 19
from many sources can be aggregated, cached, and accessed by user
devices 11-13 from other servers.
[0023] Candidate users that may be relevant to the query are
generated from the identified messages (block 32), as further
discussed below with reference to FIG. 3. Briefly, the users
connected to the users whose messages satisfy the query are
identified as candidate users. The uses can be connected through,
for example, social links such as "friends" on Facebook or
"following" on Twitter. Other connections between users are
possible.
[0024] Candidate users are identified from other users who follow
their message streams. FIG. 3 is a data flow diagram 40 showing,
for example, a method for generating candidate users for use in the
method of FIG. 2. A query 41 is received from a user of the system
10. The query 41 can include one or more keywords or terms. The
query is applied to the corpus through a search 42 for messages
matching the query or a subset of the query 41. In one embodiment,
the query results include all messages that match the query. In a
further embodiment the query results only include a subset of the
entire results, such as described above with reference to FIG. 2.
For example, only the most recent 1,500 messages that include the
query will be returned. Confining results to the most recent
messages increases the ability to adapt to temporal trends in
messages 19. When the semantics of a query changes over time, the
result generated by the system 10 can track the most recent meaning
of the query. For example, a query term "election" can have
multiple semantics depending on when the term is used. Near an U.S.
presidential election, "election" may be strongly associated with
"presidential election" while "election" may be more related to
gubernatorial or senate elections when used close to a midterm
election.
[0025] Users whose message content satisfies the query are
identified, placed in a voter user set, and designated as voter
users 43. The social connections of the voter users 43 are analyzed
and the users who are connected to the voter users 43 are
identified as candidate users 44. For each candidate user 44, the
number, f.sub.u, of voter users 43 who are connected to the
candidate user 44 is determined. Additionally, the total number of
users, F.sub.u, who are connected to each candidate user 44 is
determined by combining the number of voter users 43 and non-voter
users 45 for each candidate user 44. For example, candidate user
C.sub.1, has a f.sub.u value of 1, since the only voter user
connected to C.sub.1 is V.sub.1, while C.sub.1 has a F.sub.u of 2
since NV.sub.1 is connected to C.sub.1 as well. Candidate users
C.sub.2 and C.sub.3 have f.sub.u scores of 3 and 1, and F.sub.u
scores of 3 and 3, respectively. The numbers f.sub.u and F.sub.u
are then used to determine a relevancy score for each candidate
user 44, as further described below with reference to FIG. 4.
[0026] Returning to the above discussion with respect to FIG. 2, a
relevancy measure is then applied to each of the candidate users to
determine a relevancy score (block 33), as further described below
with reference to FIG. 4. The relevancy scores of the candidate
users are then compared and the candidate users ranked (block 34)
based on the relevancy scores. The ranking is displayed to the
querying user, who can then select one or more candidate users to
connect to, for example "following" or "friend".
[0027] A relevancy measure is applied to determine a ranking of
each candidate user to a topic of interest. FIG. 4 is data flow
diagram 50 showing, for example, types of relevancy measures 51.
Relevancy measures 51 can include NumVotes 52, DivF 53, DivLogF 54,
BetaBin(.alpha., .beta.) 55, and latent Dirichlet allocation (LDA)
56. Other relevancy measures are possible. NumVotes 52 counts the
number of voter users who follow a particular candidate user u.
Each voter user casts a vote for each of their social connections,
such as followees, and the total number determined, f.sub.u, is the
relevancy score for user u.
[0028] In some circumstances, NumVotes 52 can overly favor the most
popular users who may not be relevant to the topic of interest. For
example, some Twitter users have over one million followers and
would likely return many voting users for any search query.
Therefore, DivF 53 counts the proportion, rather than the actual
number, of a user's followers, who satisfied the search query. A
higher proportion of a user's followers who are associated with a
topic, the more relevant that user should be to the topic of the
query. DivF 53 is determined according to the equation
f.sub.u/F.sub.u.
[0029] DivF 53 may overpenalize generally popular users and
underpenalize unpopular users in some situations, and can be overly
sensitive to spuriously large values of f.sub.u when F.sub.u is
small. DivLogF 54 provides a balance between the NumVotes 52 and
DivF 53 relevancy measures. DivLogF 54 is determined according to
the equation f.sub.u/log f.sub.u. DivLogF 54 generates values
between NumVotes 52 and DivF 53, balancing between the two
measures. However, DivLogF 54, in some circumstances, may not
properly penalize generally popular users.
[0030] BetaBin(.alpha., .beta.) 55 properly penalizes generally
popular users without underpenalizing unpopular users.
BetaBin(.alpha., .beta.) 55 is probability based. Each candidate
user's followers is assumed to be randomly included in the voter
user set independently of one another and with probability p, and
f.sub.u is then approximated by a Binomial(F.sub.u, p) binomial
probability distribution. Next a Beta(.alpha., .beta.) prior
distribution over p is used, so that after observing f.sub.u of the
user's F.sub.u followers occurring in the voter users set, the
posterior probability of p follows a Beta(f.sub.u+.alpha.,
F.sub.u+.beta.) distribution. The expected value of the posterior
distribution gives an estimate, E, of the probability that each of
the user's followers is to be part of the voter user set, after
observing the values of f.sub.u and F.sub.u. The posterior expected
value is determined according to the equation:
E[p|f.sub.u,F.sub.u]=(f.sub.u+.alpha.)/(F.sub.u+.alpha.+.beta.)
[0031] which defines the BetaBin(.alpha., .beta.) 55 relevancy
measure.
[0032] Since the proportion of a user's followers within the voter
user set is expected to be low on average, .alpha. is set so that
.alpha.<<.beta.. For example, .alpha. is set to 1, while
.beta. is given values such as 10.sup.2, 10', or 10.sup.4. Other
values for .alpha. and .beta. are possible.
[0033] Additionally, the BetaBin(.alpha., .beta.) 55 relevancy
measure functions similar to the NumVotes 41 measure when
F<<.alpha.+.beta., since
(f.sub.u+.alpha.)/(F.sub.u+.alpha.+.beta.).apprxeq.(f.sub.u+.alpha.)/(.al-
pha.+.beta.).about.f.sub.u. Further, BetaBin(.alpha., .beta.) 44
functions similar to the DivF 42 measure when F>>a+B, since
(f.sub.u+.alpha.)/(F.sub.u+.alpha.+.beta.).apprxeq.f.sub.u/F.sub.u.
Therefore, BetaBin(.alpha., .beta.) 55 has the benefit of measuring
the proportion of a user's followers who are in the voter user set,
like DivF 53, while also appropriately penalizing unpopular users
like the NumVotes 52 measure.
[0034] Unlike NumVotes 52, DivF 53, DivLogF 54, BetaBin(.alpha.,
.beta.) 55, which take into account information about the link
structure of the social network between the users, the LDA measure
56 takes into account the overall content, or topics, of users'
messages as well. Candidate users are still determined from the
voter user set, such as described above in FIG. 3. A topic model is
built to associate each user, and associated message, with one or
more topics. The entire message histories of the candidate users
are collected and the LDA measure 56 is run on the messages. The
LDA results provide a way of determine the topical similarity of
any user to a search query based on the content of the user's
tweets.
[0035] The LDA measure 56 analysis first begins by collecting all
messages made by a user into a document. Each user is represented
by the aggregation of messages they have created. Next, the
parameters for the LDA analysis are chosen. The number of topics,
k, is empirically chosen, and is generally between 200 and 1,000
topics, though other topic numbers are possible. In one embodiment,
the number of topics is set to 500. Parameters alpha and beta for
the Dirichlet kernel are empirically chosen as well and are set to
0.1 and 0.5 respectively. Finally, the LDA algorithm, such as
described in D. M. Blei et al., "Latent Dirichlet Allocation," 3
Jour. Of Machine Learning Research 993-1022 (2003), the disclosure
of which is incorporated herein by reference, is applied on the set
of documents to obtain the two sets, P.sub.1 and P.sub.2, of
topical distribution. P.sub.1 (query term|topic.sub.k) is the
probability distribution of terms for each topic, where k is the
number of topics. P.sub.2(topic.sub.k|user.sub.i) is the
probability distribution of topics for each document, which is an
aggregation of messages by a user, where is the number of
users.
[0036] Given the two probability distributions, P.sub.1 and P.sub.2
the topical similarity between query terms and a user can be
calculated as the probability that the user would generate the
query terms, which is according to the equation:
Score LDA ( user i ) = k P 1 ( queryterm | topic k ) * P 2 ( topic
k | user i ) ##EQU00001##
[0037] The candidates are then ranked based on the results. In a
further embodiment, LDA can be applied to one of the link
structure-based measures that has been applied to re-rank the
candidate users, using topic similarity to the search query as the
ranking criterion. For example, the two scores for ranking can be
combined according to the equation:
Score.sub.Combined=W.sub.LinkStructure*Score.sub.LinkStructure+W.sub.LDA-
*Score.sub.LDA,
[0038] where Score.sub.LinkStructure equals one of NumVotes 52,
DivF 53, DivLogF 54, or BetaBin(.alpha., .beta.) 55), Score.sub.LDA
equals the LDA determination, and 0<W.sub.LinkStructure,
W.sub.LDA<1 and W.sub.LinkStructure+W.sub.LDA=1.
[0039] Other content-based algorithms can be used, for example,
probabilistic latent semantic analysis, latent semantic indexing,
hierarchical LDA, and explicit semantic analysis.
[0040] While the invention has been particularly shown and
described as referenced to the embodiments thereof, those skilled
in the art will understand that the foregoing and other changes in
form and detail may be made therein without departing from the
spirit and scope of the invention.
* * * * *