System And Method For Identifying Users Relevant To A Topic Of Interest Suh; Bongwon ; et al. [PALO ALTO RESEARCH CENTER INCORPORATED]

System And Method For Identifying Users Relevant To A Topic Of Interest

Suh; Bongwon ; et al.

Patent Application Summary

U.S. patent application number 13/087308 was filed with the patent office on 2012-10-18 for system and method for identifying users relevant to a topic of interest. This patent application is currently assigned to PALO ALTO RESEARCH CENTER INCORPORATED. Invention is credited to Kevin Robert Canini, Ed H. Chi, Peter L. Pirolli, Bongwon Suh.

Application Number	20120265771 13/087308
Document ID	/
Family ID	46846404
Filed Date	2012-10-18

United States Patent Application	20120265771
Kind Code	A1
Suh; Bongwon ; et al.	October 18, 2012

SYSTEM AND METHOD FOR IDENTIFYING USERS RELEVANT TO A TOPIC OF INTEREST

Abstract

A system and method for identifying users relevant to a topic of interest is provided. A query comprising one or more topics is executed against a corpus of messages. Voting users associated with the messages matching the query are identified. A set of candidate users comprising users connected to the voting users is generated. A relevancy score is computed for each candidate user. The candidate users are ranked by their respective relevancy score.

Inventors:	Suh; Bongwon; (Cupertino, CA) ; Canini; Kevin Robert; (San Francisco, CA) ; Pirolli; Peter L.; (San Francisco, CA) ; Chi; Ed H.; (Palo Alto, CA)
Assignee:	PALO ALTO RESEARCH CENTER INCORPORATED Palo Alto CA
Family ID:	46846404
Appl. No.:	13/087308
Filed:	April 14, 2011

Current U.S. Class:	707/749 ; 707/E17.108
Current CPC Class:	G06Q 10/101 20130101
Class at Publication:	707/749 ; 707/E17.108
International Class:	G06F 17/30 20060101 G06F017/30; G06F 7/00 20060101 G06F007/00

Claims

1. A system for identifying users relevant to a topic of interest, comprising: a query module to execute a query comprising one or more topics against a corpus of messages and to identify voting users associated with the messages matching the query; a candidate generator module to generate a set of candidate users comprising users connected to the voting users; a relevancy scorer module to compute a relevancy score for each candidate user; and a candidate ranking module to rank the candidate users by their respective relevancy score.

2. A system according to claim 1, further comprising: a calculation module to calculate the number of voting users for each candidate user and to calculate the number of users comprising the voting users and non-voting users connected to each candidate user.

3. A system according to claim 2, wherein the relevancy score is determined according to the equation f.sub.u, where f.sub.u, is the number of voter users connected to the candidate user.

4. A system according to claim 2, wherein the relevancy score is determined according to the equation f.sub.u/F.sub.u, where f.sub.u, is the number of voter users connected to the candidate user and F.sub.u is the total number of users connected to the candidate user.

5. A system according to claim 2, wherein the relevancy score is determined according to the equation f.sub.u/log F.sub.u, where f.sub.u, is the number of voter users connected to the candidate user and F.sub.u is the total number of users connected to the candidate user.

6. A system according to claim 2, wherein the relevancy score is determined according to the equation: E[p|f.sub.u,F.sub.u]=(f.sub.u+a)/(F.sub.u+.alpha.+.beta.) where E is the estimated probability p, f.sub.u, is the number of voter users connected to the candidate user, F.sub.u is the total number of users connected to the candidate user, and .alpha. and .beta. are binomials.

7. A system according to claim 1, further comprising: a topic module to identify one or more messages of each candidate user and to apply a topic model to the identified messages; and a re-ranking module to re-rank the candidate based on the topic model.

8. A system according to claim 7, wherein the topic model is determined according to the equation: Score LDA ( user i ) = k P 1 ( queryterm | topic k ) * P 2 ( topic k | user i ) ##EQU00002## where P.sub.1 (query term|topic.sub.k) is the probability distribution of terms for each topic, k is the number of topics, P.sub.2(topic.sub.k|user.sub.i) is the probability distribution of topics for each document of a user, and i is the number of users.

9. A system according to claim 8, wherein the re-ranking is determined according to the equation: Score.sub.Combined=W.sub.LinkStructure*Score.sub.LinkStructure+W.sub.LDA*- Score.sub.LDA, where Score.sub.LinkStructure equals the relevancy score, and 0<W.sub.LinkStructure, W.sub.LDA<1 and W.sub.LinkStructure+W.sub.LDA=1.

10. A system according to claim 1, further comprising: a time window module to determine a time window and to apply the query against the messages in the corpus that were created within the time window.

11. A method for identifying users relevant to a topic of interest, comprising: executing a query comprising one or more topics against a corpus of messages; identifying voting users associated with the messages matching the query; generating a set of candidate users comprising users connected to the voting users; computing a relevancy score for each candidate user; and ranking the candidate users by their respective relevancy score.

12. A method according to claim 11, further comprising: calculating the number of voting users for each candidate user; and calculating the number of users comprising the voting users and non-voting users connected to each candidate user.

13. A method according to claim 12, wherein the relevancy score is determined according to the equation f.sub.u, where f.sub.u, is the number of voter users connected to the candidate user.

14. A method according to claim 12, wherein the relevancy score is determined according to the equation f.sub.u/F.sub.u, where f.sub.u, is the number of voter users connected to the candidate user and F.sub.u is the total number of users connected to the candidate user.

15. A method according to claim 12, wherein the relevancy score is determined according to the equation f.sub.u/log F.sub.u, where f.sub.u, is the number of voter users connected to the candidate user and F.sub.u is the total number of users connected to the candidate user.

16. A method according to claim 12, wherein the relevancy score is determined according to the equation: E[p|f.sub.u,F.sub.u]=(f.sub.u+.alpha.)/(F.sub.u+.alpha.+.beta.) where E is the estimated probability p, f.sub.u, is the number of voter users connected to the candidate user, F.sub.u, is the total number of users connected to the candidate user, and .alpha. and .beta. are binomials.

17. A method according to claim 11, further comprising: identifying one or more messages of each candidate user; applying a topic model to the identified messages; and re-ranking the candidate based on the topic model.

18. A method according to claim 17, wherein the topic model is determined according to the equation: Score LDA ( user i ) = k P 1 ( queryterm | topic k ) * P 2 ( topic k | user i ) ##EQU00003## where P.sub.1(query term|topic.sub.k) is the probability distribution of terms for each topic, k is the number of topics, P.sub.2(topic.sub.k|user.sub.i) is the probability distribution of topics for each document of a user, and i is the number of users.

19. A method according to claim 18, wherein the re-ranking is determined according to the equation: Score.sub.Combined=W.sub.LinkStructure*Score.sub.LinkStructure+W.sub.LDA*- Score.sub.LDA, where Score.sub.LinkStructure equals the relevancy score, and 0<W.sub.LinkStructure, W.sub.LDA<1 and W.sub.LinkStructure+W.sub.LDA=1.

20. A method according to claim 11, further comprising: determining a time window; and applying the query against the messages in the corpus that were created within the time window.

Description

FIELD

[0001] This application relates in general to management of electronic information and, in particular, to a system and method for identifying users relevant to a topic of interest.

BACKGROUND

[0002] A growing amount of information is shared through social networking websites, such as Facebook and Twitter. Initially, these types of websites were used mainly as a way to keep in touch with friends and family by sharing personal information such as status updates and uploaded photographs. Currently, social media tools are increasingly utilized for purposes beyond personal conversations, including public discourse in diverse areas, including politics, business, technology, and pop culture, as well as professional networking

[0003] Information is transferred via a relationship, or connection, such as "friending" in Facebook and "following" in Twitter. For example, Twitter is a social networking and microblogging service that allows users to send and receive short messages, known as "tweets", and to share and discover various topics of interest in real-time. To receive another user's tweets, a user must subscribe to, or "follow", the other user's tweets. To receive high-quality information about a topic of interest, a user has to identify credible users whose tweets are relevant to the topic. A user is found credible based at least in part on both the expertise of the user and the trust other users have in the user, reflected in the number of followers the user has.

[0004] As there are currently over 100 million registered users of Twitter, finding the credible, or otherwise valuable, users who publish information on a regular basis can be difficult as there are no simple or efficient ways to determine which users are relevant to particular topics of interest. Twitter has introduced lists whereby users can organize the users they follow, "followees," into groups. Third party services, such as Listorious, available at listorious.com, and MyTwitterCloud, available at mytwittercloud.com, use the created Twitter lists to index popular users based on their membership in other users' lists. The list assignments are aggregated and used to generate a ranking of users for a given tag. However, user ranking is based on the manually provided users lists, which have not been widely adopted, leading to an under representation of potential credible users. Moreover, the list categories are arbitrarily chosen by a user, which means that the topics associated with a user can be arbitrary as well, and may not reflect the actual topic of credibility of a user in the list.

[0005] Additionally, WeFollow, available at wefollow.com, allows a user to self-associate with a keyword of choice, which is then used to rank the user against other uses who have opted-in for the same keyword. However, a user has to manually opt-in to be included on a list, which means many credible sources may not be represented in the list for the particular keyword or topic. Like Listorious and MyTwitterCloud, a user may be arbitrarily associated with a particular typographical instantiation of a keyword or topic. For example, a user may associate with the term "photography" but, in turn, may be weakly associated with the term "photographer."

[0006] Accordingly, there is a need for leveraging the existing social structure to identify relevant users associated with a particular topic of interest.

SUMMARY

[0007] An embodiment provides a system and method for identifying users relevant to a topic of interest. A query comprising one or more topics is executed against a corpus of messages. Voting users associated with the messages matching the query are identified. A set of candidate users comprising users connected to the voting users is generated. A relevancy score is computed for each candidate user. The candidate users are ranked by their respective relevancy score

[0008] Still other embodiments of the invention will become readily apparent to those skilled in the art from the following detailed description, wherein are embodiments of the invention by way of illustrating the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and the scope of the invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] FIG. 1 is a block diagram showing a system for identifying users relevant to a topic of interest, in accordance with one embodiment.

[0010] FIG. 2 is a flow diagram showing a method for identifying users relevant to a topic of interest, in accordance with one embodiment.

[0011] FIG. 3 is a data flow diagram showing, for example, a method for generating candidate users for use in the method of FIG. 2.

[0012] FIG. 4 is data flow diagram showing, for example, types of relevancy measures.

DETAILED DESCRIPTION

[0013] As valuable knowledge is increasingly shared through social networks, identifying credible users who are highly relevant to a particular topic of interest becomes more difficult. Providing an automated ranking of the most interesting or relevant users relevant to a topic greatly minimizes the time and effort required by a user to identify other users worth connecting to. FIG. 1 is a block diagram showing a system 10 for identifying users relevant to a topic of interest, in accordance with one embodiment. One or more user devices 11-13 are connected to a content server 14 via an Internetwork 15, such as the Internet. The user devices 11-13 can include a computer, laptop, or mobile device, such as a cellular telephone or mobile Internet device (not shown). In general, each user device 11-13 is a Web-enabled device that executes a Web browser, which supports interfacing tools and information exchange with the content server 14.

[0014] The content server 14 is interconnected to a content database 16 and a user database 24. The content database 16 stores messages 17, which are provided to the user devices 11-13 upon request. The user database 24 stores user profiles 25, such as user name, password, and connections between users. Other types of data are possible. In a further embodiment, messages 17 and user profiles 25 can be stored locally on the user devices 11-13.

[0015] A user inputs a search query of one or more keywords or topics and the query is executed against the messages 19 in the content database 16 via the content server 14. Messages 19 are created by users or automatically generated, and can include status updates from networking sites, such as Facebook and Twitter, emails, blog postings, forums, and news content. Other types of messages 19 are possible. Messages 19 can be queried and the results received directly on user devices 11-13, for user review via a user interface from the content server 14, through a application programming interface of the message source, such as the Twitter API, or messages 19 from many sources can be aggregated, cached, and accessed by user devices 11-13 from other servers 18.

[0016] Subsequently, a relevancy server 18 generates a ranking of users relevant to the search query. The relevancy server 18 is interconnected to the user devices 11-13 and the content server 14 via the internetwork 15, and includes a candidate generator module 19, relevancy scorer module 20, and candidate ranking module 21.

[0017] The candidate generator module 19 generates, a set of candidate users. The candidate users are generated from a combination of the user generated search query of the messages 19 in the message database 16 and the social connections between users. For example, social networks include features to connect with other users, such as family, friends, colleagues, and strangers. Facebook has "friending" and Twitter has "following." Users connect to one another to keep updated with messages posted by other users. The messages can include, for example, status updates, weblinks, and photos.

[0018] The relevancy scorer module 20 applies a relevancy measure to the candidate users and determines a relevancy score for each candidate. The relevancy score of each candidate user is compared and the candidates are ranked 23 based on the score. The rankings 23 can be cached for later retrieval or update in the relevancy database 22. Users can then select one or more of the ranked users to connect to, such a by following or friending the user.

[0019] The user devices 11-13, relevancy server 18, and content server 14 each include components conventionally found in general purpose programmable computing devices, such as a central processing unit, memory, input/output ports, network interfaces, and non-volatile storage, although other components are possible.

[0020] Further, the user devices 11-13, relevancy server 18, and content server 14 can each include one or more modules for carrying out the embodiments disclosed herein. The modules can be implemented as a computer program or procedure written as source code in a conventional programming language and is presented for execution by the central processing unit as object or byte code. Alternatively, the modules could also be implemented in hardware, either as integrated circuitry or burned into read-only memory components. The various implementations of the source code and object and byte codes can be held on a computer-readable storage medium, such as a floppy disk, hard drive, digital video disk (DVD), random access memory (RAM), read-only memory (ROM) and similar storage mediums. Other types of modules and module functions are possible, as well as other physical hardware components.

[0021] Users relevant to a topic of interest are identified from content of user messages and social connections between users. FIG. 2 is a flow diagram 30 showing a method for identifying users relevant to a topic of interest, in accordance with one embodiment. A search query is received from a user and applied to a corpus of messages 16 and the messages 19 satisfying the query are identified (block 31). In one embodiment, various linguistics, such as word stemming, synonym expansion, and spelling corrections can be applied to the search query

[0022] The query can be applied to all messages 19 or to only those messages 19 within a specified time window. The time period can be manually chosen by the user or automatically determined. For example, the time window may be all messages 19 received since the last time the user used the system 10, those that have been received in the last hour, or only the most recent number n messages. Other time windows are possible. The query can be applied directly to the messages through the content server 14, through an application programming interface of the message source, such as the Twitter API, through the relevancy server 18, or messages 19 from many sources can be aggregated, cached, and accessed by user devices 11-13 from other servers.

[0023] Candidate users that may be relevant to the query are generated from the identified messages (block 32), as further discussed below with reference to FIG. 3. Briefly, the users connected to the users whose messages satisfy the query are identified as candidate users. The uses can be connected through, for example, social links such as "friends" on Facebook or "following" on Twitter. Other connections between users are possible.

[0024] Candidate users are identified from other users who follow their message streams. FIG. 3 is a data flow diagram 40 showing, for example, a method for generating candidate users for use in the method of FIG. 2. A query 41 is received from a user of the system 10. The query 41 can include one or more keywords or terms. The query is applied to the corpus through a search 42 for messages matching the query or a subset of the query 41. In one embodiment, the query results include all messages that match the query. In a further embodiment the query results only include a subset of the entire results, such as described above with reference to FIG. 2. For example, only the most recent 1,500 messages that include the query will be returned. Confining results to the most recent messages increases the ability to adapt to temporal trends in messages 19. When the semantics of a query changes over time, the result generated by the system 10 can track the most recent meaning of the query. For example, a query term "election" can have multiple semantics depending on when the term is used. Near an U.S. presidential election, "election" may be strongly associated with "presidential election" while "election" may be more related to gubernatorial or senate elections when used close to a midterm election.

[0025] Users whose message content satisfies the query are identified, placed in a voter user set, and designated as voter users 43. The social connections of the voter users 43 are analyzed and the users who are connected to the voter users 43 are identified as candidate users 44. For each candidate user 44, the number, f.sub.u, of voter users 43 who are connected to the candidate user 44 is determined. Additionally, the total number of users, F.sub.u, who are connected to each candidate user 44 is determined by combining the number of voter users 43 and non-voter users 45 for each candidate user 44. For example, candidate user C.sub.1, has a f.sub.u value of 1, since the only voter user connected to C.sub.1 is V.sub.1, while C.sub.1 has a F.sub.u of 2 since NV.sub.1 is connected to C.sub.1 as well. Candidate users C.sub.2 and C.sub.3 have f.sub.u scores of 3 and 1, and F.sub.u scores of 3 and 3, respectively. The numbers f.sub.u and F.sub.u are then used to determine a relevancy score for each candidate user 44, as further described below with reference to FIG. 4.

[0026] Returning to the above discussion with respect to FIG. 2, a relevancy measure is then applied to each of the candidate users to determine a relevancy score (block 33), as further described below with reference to FIG. 4. The relevancy scores of the candidate users are then compared and the candidate users ranked (block 34) based on the relevancy scores. The ranking is displayed to the querying user, who can then select one or more candidate users to connect to, for example "following" or "friend".

[0027] A relevancy measure is applied to determine a ranking of each candidate user to a topic of interest. FIG. 4 is data flow diagram 50 showing, for example, types of relevancy measures 51. Relevancy measures 51 can include NumVotes 52, DivF 53, DivLogF 54, BetaBin(.alpha., .beta.) 55, and latent Dirichlet allocation (LDA) 56. Other relevancy measures are possible. NumVotes 52 counts the number of voter users who follow a particular candidate user u. Each voter user casts a vote for each of their social connections, such as followees, and the total number determined, f.sub.u, is the relevancy score for user u.

[0028] In some circumstances, NumVotes 52 can overly favor the most popular users who may not be relevant to the topic of interest. For example, some Twitter users have over one million followers and would likely return many voting users for any search query. Therefore, DivF 53 counts the proportion, rather than the actual number, of a user's followers, who satisfied the search query. A higher proportion of a user's followers who are associated with a topic, the more relevant that user should be to the topic of the query. DivF 53 is determined according to the equation f.sub.u/F.sub.u.

[0029] DivF 53 may overpenalize generally popular users and underpenalize unpopular users in some situations, and can be overly sensitive to spuriously large values of f.sub.u when F.sub.u is small. DivLogF 54 provides a balance between the NumVotes 52 and DivF 53 relevancy measures. DivLogF 54 is determined according to the equation f.sub.u/log f.sub.u. DivLogF 54 generates values between NumVotes 52 and DivF 53, balancing between the two measures. However, DivLogF 54, in some circumstances, may not properly penalize generally popular users.

[0030] BetaBin(.alpha., .beta.) 55 properly penalizes generally popular users without underpenalizing unpopular users. BetaBin(.alpha., .beta.) 55 is probability based. Each candidate user's followers is assumed to be randomly included in the voter user set independently of one another and with probability p, and f.sub.u is then approximated by a Binomial(F.sub.u, p) binomial probability distribution. Next a Beta(.alpha., .beta.) prior distribution over p is used, so that after observing f.sub.u of the user's F.sub.u followers occurring in the voter users set, the posterior probability of p follows a Beta(f.sub.u+.alpha., F.sub.u+.beta.) distribution. The expected value of the posterior distribution gives an estimate, E, of the probability that each of the user's followers is to be part of the voter user set, after observing the values of f.sub.u and F.sub.u. The posterior expected value is determined according to the equation:

E[p|f.sub.u,F.sub.u]=(f.sub.u+.alpha.)/(F.sub.u+.alpha.+.beta.)

[0031] which defines the BetaBin(.alpha., .beta.) 55 relevancy measure.

[0032] Since the proportion of a user's followers within the voter user set is expected to be low on average, .alpha. is set so that .alpha.<<.beta.. For example, .alpha. is set to 1, while .beta. is given values such as 10.sup.2, 10', or 10.sup.4. Other values for .alpha. and .beta. are possible.

[0033] Additionally, the BetaBin(.alpha., .beta.) 55 relevancy measure functions similar to the NumVotes 41 measure when F<<.alpha.+.beta., since (f.sub.u+.alpha.)/(F.sub.u+.alpha.+.beta.).apprxeq.(f.sub.u+.alpha.)/(.al- pha.+.beta.).about.f.sub.u. Further, BetaBin(.alpha., .beta.) 44 functions similar to the DivF 42 measure when F>>a+B, since (f.sub.u+.alpha.)/(F.sub.u+.alpha.+.beta.).apprxeq.f.sub.u/F.sub.u. Therefore, BetaBin(.alpha., .beta.) 55 has the benefit of measuring the proportion of a user's followers who are in the voter user set, like DivF 53, while also appropriately penalizing unpopular users like the NumVotes 52 measure.

[0034] Unlike NumVotes 52, DivF 53, DivLogF 54, BetaBin(.alpha., .beta.) 55, which take into account information about the link structure of the social network between the users, the LDA measure 56 takes into account the overall content, or topics, of users' messages as well. Candidate users are still determined from the voter user set, such as described above in FIG. 3. A topic model is built to associate each user, and associated message, with one or more topics. The entire message histories of the candidate users are collected and the LDA measure 56 is run on the messages. The LDA results provide a way of determine the topical similarity of any user to a search query based on the content of the user's tweets.

[0035] The LDA measure 56 analysis first begins by collecting all messages made by a user into a document. Each user is represented by the aggregation of messages they have created. Next, the parameters for the LDA analysis are chosen. The number of topics, k, is empirically chosen, and is generally between 200 and 1,000 topics, though other topic numbers are possible. In one embodiment, the number of topics is set to 500. Parameters alpha and beta for the Dirichlet kernel are empirically chosen as well and are set to 0.1 and 0.5 respectively. Finally, the LDA algorithm, such as described in D. M. Blei et al., "Latent Dirichlet Allocation," 3 Jour. Of Machine Learning Research 993-1022 (2003), the disclosure of which is incorporated herein by reference, is applied on the set of documents to obtain the two sets, P.sub.1 and P.sub.2, of topical distribution. P.sub.1 (query term|topic.sub.k) is the probability distribution of terms for each topic, where k is the number of topics. P.sub.2(topic.sub.k|user.sub.i) is the probability distribution of topics for each document, which is an aggregation of messages by a user, where is the number of users.

[0036] Given the two probability distributions, P.sub.1 and P.sub.2 the topical similarity between query terms and a user can be calculated as the probability that the user would generate the query terms, which is according to the equation:

Score LDA ( user i ) = k P 1 ( queryterm | topic k ) * P 2 ( topic k | user i ) ##EQU00001##

[0037] The candidates are then ranked based on the results. In a further embodiment, LDA can be applied to one of the link structure-based measures that has been applied to re-rank the candidate users, using topic similarity to the search query as the ranking criterion. For example, the two scores for ranking can be combined according to the equation:

Score.sub.Combined=W.sub.LinkStructure*Score.sub.LinkStructure+W.sub.LDA- *Score.sub.LDA,

[0038] where Score.sub.LinkStructure equals one of NumVotes 52, DivF 53, DivLogF 54, or BetaBin(.alpha., .beta.) 55), Score.sub.LDA equals the LDA determination, and 0<W.sub.LinkStructure, W.sub.LDA<1 and W.sub.LinkStructure+W.sub.LDA=1.

[0039] Other content-based algorithms can be used, for example, probabilistic latent semantic analysis, latent semantic indexing, hierarchical LDA, and explicit semantic analysis.

[0040] While the invention has been particularly shown and described as referenced to the embodiments thereof, those skilled in the art will understand that the foregoing and other changes in form and detail may be made therein without departing from the spirit and scope of the invention.

* * * * *