U.S. patent application number 16/950816 was filed with the patent office on 2021-05-13 for concept-level user intent profile extraction and applications.
The applicant listed for this patent is Netseer, Inc.. Invention is credited to Riccardo Boscolo, Sanjiv Ghate, Nima Khajehnouri, John Mracek, Behnam A. Rezaei, Vwani P. Roychowdhury.
Application Number | 20210141814 16/950816 |
Document ID | / |
Family ID | 1000005346895 |
Filed Date | 2021-05-13 |
![](/patent/app/20210141814/US20210141814A1-20210513\US20210141814A1-2021051)
United States Patent
Application |
20210141814 |
Kind Code |
A1 |
Rezaei; Behnam A. ; et
al. |
May 13, 2021 |
CONCEPT-LEVEL USER INTENT PROFILE EXTRACTION AND APPLICATIONS
Abstract
Methods and systems for extracting intents and intent profiles
of users, as inferred from the different activities they execute
and data they share on social media sites, and then (i)
monetization of such intents via targeted advertisements, and (ii)
enhancement of user experience via organization of their contact
lists and conversations and posts based on their content and
conceptual context.
Inventors: |
Rezaei; Behnam A.; (Santa
Clara, CA) ; Roychowdhury; Vwani P.; (Los Angeles,
CA) ; Ghate; Sanjiv; (Sunnyvale, CA) ;
Khajehnouri; Nima; (Los Angeles, CA) ; Boscolo;
Riccardo; (Culver City, CA) ; Mracek; John;
(Los Altos, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Netseer, Inc. |
Menlo Park |
CA |
US |
|
|
Family ID: |
1000005346895 |
Appl. No.: |
16/950816 |
Filed: |
November 17, 2020 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
16430742 |
Jun 4, 2019 |
10860619 |
|
|
16950816 |
|
|
|
|
14017123 |
Sep 3, 2013 |
10311085 |
|
|
16430742 |
|
|
|
|
61695877 |
Aug 31, 2012 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06Q 30/0269 20130101;
G06F 16/285 20190101; G06Q 30/0251 20130101; G06Q 50/01 20130101;
G06Q 30/0255 20130101; G06Q 30/0256 20130101; G06Q 30/0271
20130101 |
International
Class: |
G06F 16/28 20060101
G06F016/28; G06Q 30/02 20060101 G06Q030/02; G06Q 50/00 20060101
G06Q050/00 |
Claims
1. A computer-implemented method comprising: identifying, via a
processor, structured user data from a social media site, the
structured user data comprising an email identification, phone
number, geo-location, friends and links; obtaining a global concept
graph comprising nodes that are concepts, and edges that are
relationships among such concepts, wherein the concepts comprise
phrases that represent entities, domain-specific terms and common
expressions that are used to convey information; wherein the
relationships are identified by annotated edges among concepts,
wherein the relationships comprise measures of closeness among the
concepts; obtaining at least one weighted sub-graph of the global
concept graph using the structured user data, user activities on
the social media site, the user activities not involving the social
media site, and user searches that do not involve the social media
site; obtaining an intent profile of the user from information in
the weighted sub-graph(s); and matching the intent profile with an
advertiser profile to target an advertisement to the user.
2. The method of claim 1 further comprising identifying, via the
processor, user activities involving the Internet that do not
involve the social media site.
3. The method of claim 1 further comprising: generating a score for
each vertical in a targeting vertical list.
4. The method of claim 3, wherein the score comprises a time factor
and an interest factor.
5. The method of claim 3, wherein the score is generated based on
user activities on the social media site, user activities involving
the Internet, incoming likes, sharing, recommendation on the social
media site, and user searches.
6. The method of claim 1 further comprising modifying a user
experience of the user by organizing a contact list of the user and
conversations and posts of the user based on their content and
conceptual context.
7. The method of claim 6, further comprising performing an
aggregation scoring.
8. The method of claim 1, wherein the concepts are selected from
the group consisting of people, companies, drugs, diets, films,
shows, events, wherein the domain-specific terms are selected from
the group consisting of sports and medical terminologies, specific
treatments, and procedures.
9. The method of claim 1, further comprising tagging the intent
profile of the user with temporal data.
10. The method of claim 1, further comprising: tagging unstructured
data using collective activities of users at the social media site,
wherein the unstructured data is selected from searches,
we-browsing, posts, comments, content of web pages that received
likes and links.
11. The method of claim 1, further comprising prioritizing
unstructured data using collective activities of users at the
social media site.
12. The method of claim 1, further comprising post-processing and
tagging the intent profile of the user with weighted category
scores defined over a structured taxonomy of interest.
13. The method of claim 1, further comprising assigning at least
one advertiser a set of advertiser target profiles.
14. The method of claim 13, wherein each of the advertiser target
profiles comprises a weighted list of categories picked from a
structured taxonomy.
15. The method of claim 1, wherein selecting a final set of
advertisement units is completed by an optimization process that
maximizes objective functions of interest, including revenue for
the social media site, value and Return-On-Investment (ROI) for
advertisers, while considering a device and media of the user.
16. The method of claim 6, wherein modifying a user experience
comprises organizing friends or contact lists of the user into
potentially overlapping groups by computing similarity between a
user's profile and the profiles of those of the user's friends and
contacts.
17. The method of claim 16, wherein organizing friends or contact
lists of a user into potentially overlapping groups comprise
computing similarity between a user's profile and the profiles of
those of the user's friends and contacts.
18. The method of claim 6, wherein modifying a user engagement
comprises organizing posts, comments and social interactions
between a user and his friends based on an underlying context.
19. A non-transitory computer readable medium comprising
instructions, that when read by a processor, cause the processor to
perform: identifying structured user data from a social media site,
the structured user data comprising an email identification, phone
number, geo-location, friends and links; obtaining a global concept
graph comprising nodes that are concepts, and edges that are
relationships among such concepts, wherein the concepts comprise
phrases that represent entities, domain-specific terms and common
expressions that are used to convey information; wherein the
relationships are identified by annotated edges among concepts,
wherein the relationships comprise measures of closeness among the
concepts; obtaining at least one weighted sub-graph of the global
concept graph using the structured user data; user activities on
the social media site; the user activities not involving the social
media site; the incoming likes, sharing, and recommendations on the
social media site; and user searches that do not involve the social
media site; obtaining an intent profile of the user from
information in the weighted sub-graph(s); and matching the intent
profile with an advertiser profile to target an advertisement to
the user.
20. A system comprising: a processor; and memory communicably
coupled to the processor; wherein the processor: identifies
structured user data from a social media site, the structured user
data comprising an email identification, phone number,
geo-location, friends and links; identifies incoming likes,
sharing, recommendations on the social media site; obtains a global
concept graph comprising nodes that are concepts, and edges that
are relationships among such concepts, wherein the concepts
comprise phrases that represent entities, domain-specific terms and
common expressions that are used to convey information; wherein the
relationships are identified by annotated edges among concepts,
wherein the relationships comprise measures of closeness among the
concepts; obtains at least one weighted sub-graph of the global
concept graph using the structured user data; user activities on
the social media site; the user activities not involving the social
media site; and user searches that do not involve the social media
site; obtains an intent profile of the user from information in the
weighted sub-graph(s); and matches the intent profile with an
advertiser profile to target an advertisement to the user.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of U.S. patent
application Ser. No. 16/430,742, filed on Jun. 4, 2019, which is a
continuation of U.S. patent application Ser. No. 14/017,123, filed
on Sep. 3, 2013, now U.S. Pat. No. 10,311,085, issued on Jun. 4,
2019, which claims priority to U.S. Provisional Application Ser.
No. 61/695,877, filed on Aug. 31, 2012, the entire disclosures of
which are incorporated by reference herein.
BACKGROUND
1. Field
[0002] The present disclosure relates generally to concept-level
user intent profile extraction and applications to monetization and
user-engagement enhancement in large-scale social media
platforms.
2. Related Art
[0003] In the online world there is a major need to be able to
understand and create temporally evolving profiles of users, and
how they interact with the various institutions and activities,
both online and in the real world. If such understanding and
profiling, both at the individual user level and at the collective
level of groups of users, can be achieved, then the various service
providers (e.g., social media sites, online advertisers, offline
stores and organizations) can use automated algorithms to serve the
right information, content, and services to every individual and
organizations (i.e., groups of users) in the right context and at
the right time. The only kinds of information available online are
the individual user actions, and the kind of structured data they
share with various social media and other sites that they register
with voluntarily. The structured data shared, e.g., one's place of
residence, education level and degrees obtained, professional
credentials, and their explicitly stated friends, email contact
lists, and followers on social media and news sites, etc. is easy
to categorize and collect and is being stored and heavily utilized
and mined by various online entities such as social networking and
media sites, including Facebook, Twitter, LinkedIn, Google+
etc.
[0004] The majority of user actions, however, are unstructured and
when aggregated, comprises of billions of atomic or elementary
actions, per day such as (i) user's Votes or Likes for articles,
posts, or other users' posts and activities, (ii) searches done at
major search engines and at individual sites, (iii) articles and
web pages browsed, and (iv) posts on social media and networking
sites and other interactions made among friends on such sites. For
example, not all friends are created equal, and one shares
different types of information and activities with different sets
of friends and colleagues. Such preferences are not explicitly
expressed and defined, but rather can only be inferred from the
content of the posts shared and liked, and the locations visited
together and can evolve over time.
[0005] One computationally challenging problem is how to make sense
of individual users, and of groups of users collectively, from the
billions of such seemingly diverse elementary actions and the
available structured data. Is it possible to create a unified
informational and functional view of individual users and groups of
users that is granular enough to capture all aspects of behavior
and preferences, and can evolve over time to be able to track a
user's evolving needs and interests? Others have tried to
accomplish such a task at different levels of granularity and with
varying success, but a comprehensive and a computationally scalable
solution has not been proposed.
[0006] For example, in the existing art detailed structured
databases are created based on the explicitly stated attributes of
users. This may include, age, gender, place of residence, education
and schools attended, favorite institutions, such as sports teams,
favorite, TV shows, music and music artists, celebrities, preferred
types of food etc. These are valuable information but the
expressive capabilities of such explicitly stated categories are
known to be very limited in characterizing a user's intent and
profile accurately. Moreover, often such information is outdated
and is incorrectly entered making them prone to be highly noisy.
Once entered in a database they cannot be easily updated or
corrected.
[0007] The main way to deal with unstructured activities has been
to use taxonomies with predefined categories organized in various
data structures, such as a tree. For example if a person visits a
sports page talking about the Los Angeles Lakers then that activity
could be categorized as an activity related to
Sports/Basketball/Lakers. These categories are then aggregated to
create user profiles. The major drawbacks of such an approach are
two-fold: (i) taxonomies have to be defined manually and can
comprise only a limited number of categories in them. The manual
nature of the process makes it less expressive, and user actions
cannot be captured comprehensively and at the right granularity by
such necessarily limited sets of categories. (ii) Every action and
content has to be classified as belonging to one of the categories
in a taxonomy and this process of classification is highly error
prone. The only ways to achieve such classification is via (i)
extensive training, which means providing examples of known pages
or content for each category and (ii) providing a set of keywords
or terms for each category and a classification is done based on
how many or what sets of such keywords appear in a document. Both
of these methods are highly manual and have computational problems
associated with them, including (i) the accuracy of the underlying
classification engine is only as good as the training sets provided
to them; it can lead to over training quite easily and thereby poor
generalization capabilities on new content, (ii) the bigger the
taxonomy the more is the manual and supervised part of the training
process, (iii) keywords are notoriously ambiguous and lead to
highly inaccurate classifications, and finally (iv) often documents
or content belong to multiple categories at the same time, and
training for such cases that involves classifying documents as
belonging to more than one category at the same time leads to a
combinatorially intractable problem.
SUMMARY
[0008] The following summary of the invention is included in order
to provide a basic understanding of some aspects and features of
the invention. This summary is not an extensive overview of the
invention and as such it is not intended to particularly identify
key or critical elements of the invention or to delineate the scope
of the invention. Its sole purpose is to present some concepts of
the invention in a simplified form as a prelude to the more
detailed description that is presented below.
[0009] To address the above mentioned limitations we present a
system and methodology for creating unified intent profiles based
on a collective/global concept graph, comprising nodes that are
concepts, and edges that are relationships among such concepts.
Naturally occurring and overlapping sub-graphs or communities in
such a concept graph organically define potential intent space of
users. Each activity, such a search or a Like, is mapped to the
concept graph and a weighted subgraph is identified in the
underlying concept graph as a record of the activity. Such weighted
subgraphs are then aggregated over all activities of a user to
create a temporally evolving intent space profile of a user. Such a
shared concept graph, with highly granular concept level
expressiveness, allows one to not only capture individual users'
intent with unprecedented accuracy and detail, but also allows one
to compare and group similar users. It is as if an all-knowing
human goes through and catalogs all the activities of every user
and creates a summary in her mind, which allows comparison of
individuals across both time and geo-locations.
[0010] According to one embodiment, a method is disclosed that
includes extracting an intent profile of a user based on activities
executed and information shared on a social media site by the user.
Extracting the intent profile includes identifying structured user
data from the social media site; identifying user activities
exclusively on the social media site; identifying user activities
involving the Internet; identifying incoming likes, sharing,
recommendation on the social media site; identifying user searches;
determining a social graph and connections of the user; and
determining a weighted cluster and sub-graph of a global concept
graph, wherein the global concept graph comprises nodes that are
concepts, and edges that are relationships among such concepts; and
determining a user intent from the intent profile of the user.
[0011] The method may further include targeting an advertisement to
the user based on the determined user intent.
[0012] The method may further include generating a score for each
vertical in a targeting vertical list. The score may include a time
factor and an interest factor. The score may be generated based on
user activities exclusively on the social media site, user
activities involving the Internet, incoming likes, sharing,
recommendation on the social media site, and user searches.
[0013] The method may further include modifying a user experience
of the user by organizing a contact list of the user and
conversations and posts of the user based on their content and
conceptual context.
[0014] The method may further include performing an aggregation
scoring.
[0015] The structured data may include an email identification,
phone number, geo-location, friends and links.
[0016] The concepts may include phrases that represent entities
(e.g. people, companies, drugs, diets, films, shows, events etc.),
domain-specific terms (e.g., sports and medical terminologies,
specific treatments, procedures etc.), and common expressions that
are used to convey information. The relationships may be identified
by annotated edges among concepts. The relationships may be
measures of closeness among the concepts, including at least one of
co-occurrence statistics and explicit semantic relationships.
[0017] The method may include tagging the user intent profiles with
temporal data.
[0018] Determining the weighted sub-graph may be based on a number
of the user's friends that share the same nodes or edges.
[0019] The method may further include tagging unstructured data
using collective activities of users at the social media site.
[0020] The method may further include prioritizing unstructured
data using collective activities of users at the social media
site.
[0021] The method may further include post-processing and tagging
the user profile post-processed and tagged with weighted category
scores defined over a structured taxonomy of interest.
[0022] The method may further include assigning each such
advertiser a set of advertiser target profiles.
[0023] The advertiser target profile may include one or more
weighted and time-tagged sub-graphs of the global concept
graph.
[0024] The advertiser target pro file may include a weighted list
of categories picked from a structured taxonomy.
[0025] The method may further include determining for each user a
weighted and prioritized list of advertisers that best match the
user intent.
[0026] Matching user intent to an advertiser may include computing
an overlap or distance between the user intent profile and the
advertiser target profile by computing a measure of the distance
between the two corresponding weighted sub-graphs in the global
concept graph.
[0027] Selecting a final set of advertisement units may be
completed by an optimization process that maximizes objective
functions of interest, including revenue for the social media site,
value and Return-On-Investment (ROI) for the advertisers, while
considering the device and media that the user is on at the time of
the impression.
[0028] Modifying a user engagement may include organizing friends
or contact lists of a user into potentially overlapping groups by
computing similarity between a user's profile and the profiles of
those of his friends and contacts.
[0029] Organizing friends or contact lists of a user into
potentially overlapping groups may include computing similarity
between a user's profile and the profiles of those of the user's
friends and contacts.
[0030] Modifying a user engagement may include organizing posts,
comments and social interactions between a user and his friends
based on an underlying context.
[0031] The linear list of posts on page of the social media site
can be organized into categories by automatically classifying the
posts by mapping the posts to categories in the global concept
graph.
[0032] The method may further include post-processing and tagging
the user profile with weighted category scores defined over a
structured taxonomy of interest.
[0033] The method may further include determining a suggested
search term based on the intent profile.
[0034] The method may further include grouping content based on
communal user actions; and dividing the grouped content into
clusters. The method may further include generating a collective
content profile; and tagging the corresponding users. The method
may further include prioritizing the grouped content based on page
and domain statistics.
[0035] Computer systems having memory and a processor for
implementing the method are also disclosed. A computer readable
storage medium for executing the method is also disclosed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0036] The accompanying drawings, which are incorporated into and
constitute a part of this specification, illustrate one or more
examples of embodiments and, together with the description of
example embodiments, serve to explain the principles and
implementations of the embodiments.
[0037] FIG. 1 is a graphical view of a sample concept cluster
around diabetes. FIG. 1 is an example sub-graph of a concept graph
where nodes comprise concepts and edges their relationships. Such
sub-graphs in the concept graph naturally define different topics
or categories in knowledge space. Because of the nature of the
graphical representation with edge and node weights, the categories
are overlapping and can evolve as the nature of the network and the
relative weights change over time.
[0038] FIGS. 2A and 2B illustrate mapping a user action to a global
concept graph. The user activity is illustrated in FIG. 2A by the
user, who is tagged with an ID, who visits a page on diabetes. The
user action in FIG. 2A is mapped to the globally shared Concept
Graph by finding the concepts that are most important to the page
visited by the user as shown in FIG. 2B. Nodes in grey are subset
of nodes in the graph neighborhood related to the user action of
visiting the WebMD page, around the concept diabetes.
[0039] FIGS. 3A and 3B further illustrate tracking user actions. As
shown in FIG. 3A, the same user, i.e., the user with the same ID as
in FIG. 2A, visits another page related to insulin pumps. As shown
in FIG. 3B, the user action is mapped to the global concept graph.
In FIG. 3B, the nodes in grey are subset of nodes in the graph
neighborhood related to this user action. As shown in FIG. 3B, the
Concept graph provides one globally shared template for tracking
user actions.
[0040] FIGS. 4A and 4B further illustrate tracking user actions. As
shown in FIG. 4A, the same user as in FIGS. 2 and 3 queries for
"Sulfonylureas". This is mapped to nodes in the Concept Graph, as
shown in FIG. 4B. As shown in FIG. 4B, the nodes in dark grey are a
subset of nodes in the graph neighborhood in the Concept Graph
related to the search user action. As shown in FIGS. 2B-4B, the
three different actions (FIGS. 2A-4A) may be mapped to the same
neighborhood and highly connected region of the underlying concept
graph.
[0041] FIGS. 5A-5C illustrate combing a social graph and a concept
graph. A social networking site can use the improvements described
herein to create powerful User-Intent Profiles. For example, a
hypothetical Mr. X can have structured information such as: Male,
34, Santa Clara, Calif., Widowed, Working Organization
Affiliations: Sharif University, Iran, UCLA and NetSeer. The user's
explicit Intent may include: Likes: Classical Music, Chopin,
Peruvian Food. The user's social Graph(Friends) may include: CEO's
of Yahoo!, Facebook, and NetSeer; Michael Jackson, and Mario
Batali. From the unstructured actions, such as likes and browsing,
one can represent the person as: Mr X is a diabetic that likes
sailing and Rolex watches. These characterizations are captured as
the weighted sub-graphs in the globally shared Concept Graph, which
are shown in FIGS. 5A-5C.
[0042] FIG. 6 illustrates how a social media site such as Facebook
can predict users' next searches, based on the complete user
profile and posts. This can enable such social networking sites to
get search market share from Search Engines by integrating search
experience into social interactions. This can have a further
advantage in being able to bootstrap and provide price support for
their existing advertisement network. On a user clicking on one of
these search links, the user could be directed to the current
sponsored search/ad feed of the social media entity. In FIG. 6, a
post by a user about aerobics is matched to search suggestions
related to exercise, dieting and weight loss.
[0043] FIG. 7 illustrates a schematic of an Intent Based Display
Market that could be powered by the inventions stated in this
application. As shown in FIG. 7, an entity such as the assignee,
NetSeer, can take all the inputs from social networks, web
browsing, and other activities and then empower a Social Advertiser
Platform. The user intent profiles can be used to create
personalized advertisement creatives that are suitable for any
format and media, including social networking posts, mobile
devices, and web pages. The demand side feeds that can be used are
search ad feed, display advertisement, as well as feeds meant for
social sites. The supply side inventories could come from social
networking inventory, the conventional web pages or mobile
inventory. In mobile inventory, one form of advertisement could be
search suggestions targeted for the individual user in the right
context, i.e, the search suggestions not match the intent profile
of the user but also match the context of the page being viewed or
the App being used.
[0044] FIG. 8 illustrates one exemplary application of the
invention to enhancing user experience on a social networking site.
Currently the signal is mostly social, that is, for example, who
likes whose posts. However this leads to a cluttered user
experience. As illustrated in FIG. 8, embodiments of invention
enables addition intent/context signal to the social signal. Thus,
in FIG. 8, posts from Mr. X on baseball and specifically for the
Giants could be given higher priority when listed on the wall of
Mr. Y.
[0045] FIG. 9 is a schematic diagram of an exemplary computer
system according to one embodiment of the invention.
DETAILED DESCRIPTION
[0046] Embodiments of the invention teaches methods and systems for
extracting intents and. intent profiles of users, as inferred from
the different activities they execute and data they share on social
media sites, and then (i) monetization of such intents via targeted
advertisements, and (ii) enhancement of user experience via
organization of their contact lists and conversations and posts
based on their content and conceptual context.
[0047] In the following paragraphs, a local algorithm according to
the present invention based on percolation theory is described.
Then the local community finding is generalized to an algorithm to
detect and explore community structure of a complex network.
Thereafter benchmarking is disclosed for the algorithm using
different randomly generated networks and real world networks
employed in the literature to estimate the performance of community
finding algorithms.
Example 1
Percolation Community Finding Approach
[0048] In this section a social definition of communities is
adopted, showing that percolation identifies communities of a given
vertex compared to previous diffusion based algorithms. Next,
essentials and performance of the algorithm on ER random graphs are
illustrated.
A. Social Communities Revisited
[0049] While other definitions provide important insight into
partitioning of the graph, a more local approach is chosen herein.
Unlike cohesive community definitions, Garton et. al. [8] define
communities as follows. In social network analysis context, a group
is an empirically-discovered structure. By examining the pattern of
relationships among members of a population, groups emerge as
highly interconnected sets of actors known as cliques and clusters.
In network analytic language, they are densely-knit (most
possibilities exist) and tightly-bounded, i.e., most relevant ties
stay within the defined network [18-20]. Social network analysts
then want to know who belongs to a group, as well as the types and
patterns of relations that define and sustain such a group. Note
that this definition differs from pervious cohesive definitions
used by Newman and others that not only most relevant ties should
stay within the community but also nodes should be highly
interconnected. [19]. It also allows overlap between
communities.
[0050] This definition arises naturally in many biological, social
or web networks that vertices with close functionality or interest
form highly clustered communities. A vertex is connected to many
other vertices within the community.
B. Percolation: Background
[0051] In an embodiment of the present invention, bond percolation
is used as a tool for vertices to explore and identify their
structural neighborhood. Percolation theory was first used to study
the flow of fluids in porous media and introduced a new approach to
problems usually dominated by diffusion theory. Percolation theory
has been used to describe randomness and disorder in the structure
of the medium while diffusion processes ascribe random movements of
agents in a deterministic structure. Similar concepts have been
used extensively in complex networks literature to model and
analyze different phenomena in the network such as random walk and
network robustness to node and edge failures. Many of these
nonlinear dynamic properties of complex networks undergo phase
transition when subject to different factors that affect
interactions of structure and movement of agents in the system.
[0052] Other community finding algorithms have extensively used
diffusion processes and random walks to define betweenness and
identify partitioning of a network into different communities.
According to the present invention, a different approach,
percolation theory, is used to identify a highly clustered group of
vertices that have close ties to each other. Diffusion based
algorithms are based on the observation that edge betweenness [24]
of edges at community boundaries are high since they enable
communication between nodes in different communities. FIG. 1
illustrates this concept where thickness of edges are proportional
to shortest-path betweenness between all vertices in the network.
It may be easily seen that while boundaries of cohesive communities
are easily identified, it does not perform well in overlapping
communities. Instead of performing random walks, a percolation
message is started from a node, constitutes the set of connected
components and looks at the components when their size does not
increase as percolation probability is increased, as shown in FIG.
2. FIG. 1D and FIG. 1E show how a distributed percolation search
may return highly clustered nodes to a vertex where size of the
nodes are proportional to the fraction of percolation messages
returned. The random walk based algorithms may be viewed as
randomized versions of breadth-first search while percolation
method is a randomized version of depth-first search.
C. Performance on Random Graphs
[0053] Review of a simple community model in random graphs shows
the relationship between empirical social definitions and complex
network analysis. A collection of highly clustered ER graphs have
been extensively used in the literature to analyze simple
performance of community finding algorithms [9, 12, 15]. A random
graph is a graph in which properties such as the number of graph
vertices, graph edges, and connections between them are determined
in some random way [2]. For many monotone-increasing properties of
random graphs, graphs of a size slightly less than a certain
threshold are very unlikely to have the property, whereas graphs
with a few more graph edges are almost certain to have it. This is
known as a phase transition or threshold phenomena. Of particular
interest is the size of the largest connected component of the
graph. An ER graph G(N; p) is a random graph with n vertices where
each pair of vertices has an edge between them with probability p,
[5, 6]. the existence of any two edges are independent events.
[0054] Consider a random ER graph of size N, where each pair of
vertices are connected with probability p0(N). This may be viewed
as (bond) percolation on a complete graph with percolation
probability of p0(N). Erdos and Reneyi [5, 6] show that the
connected components have interesting properties when p0(N) scale
as p0(N)ac/N. Depending on c, following behaviors happen with
probability one for large N:
[0055] I. For c<1 size of the largest connected component is
.THETA.(log(N)).
[0056] II. At phase transition and for c=1 size of the largest
connected component is .THETA.(N.sup.2/3).
[0057] III. For c>1 a giant component appears and has size
.THETA.(N).
[0058] Remark 1: Bond percolation on an ER graph of G(N; p0) with
probability pp will result in an ER graph of G(N; p0.pp)
[0059] Thus the critical percolation probability for a randomly
generated graph with p0 is given by pc=c/(p0N) where c>1. below
this probability, vertex i will belong to a connected component of
maximum size .THETA.(log(N)) and above the threshold the
probability of almost all vertices belonging to a giant connected
component is a constant, i.e. there is a path between any two
randomly chosen pair of vertices with non vanishing constant
probability for large N.
[0060] For a vertex i define set Si p as the connected component i
including vertex i when (bond) percolating with probability p.
Define the community with strength p of vertex i, Ci p, as pair of
(j, m) where j.epsilon.Si p for m iterations out of k iterations
where m>kth.
[0061] The question remaining is how a vertex i identifies its
communities distinctively, i.e. what values of percolation strength
p corresponds to distinguishable communities. Returning to the
definition of communities as sets of vertices with similar interest
or topic and thus higher probability of connection, one may observe
that communities will emerge as connected components when varying
percolation probability. To illustrate this more consider a simple
example of an ER graph, G(N, p0) of size N with probability p0. A
subset .GAMMA. of nodes form a local community of size M, i.e. each
pair of vertices are connected with probability p.GAMMA.>>p0,
as illustrated in FIG. 3. Then,
[0062] Remark 2: For large M and N and percolation threshold of
c/(p0N)>>pc>>c/(p.GAMMA.M), probability of any two
vertices i and j belonging to a connected component is one if they
belong to .GAMMA. and is vanishingly small otherwise.
Proof
[0063] The proof follows directly from property II since the
percolation threshold is above the threshold for an ER graph of
.GAMMA. and below the percolation threshold of a global ER
graph.
[0064] This means that for any vertex i in .GAMMA., Ci p is
approximately .GAMMA. for c/(p0N)>p>c/(p.GAMMA.M) and will
include almost all vertices of G for p>c/(p0N).
[0065] The definition is now generalized to multiple overlapping
and non overlapping communities and investigate the behavior of Ci
p in different cases. Consider an ER graph of size N with
probability p0 and two subgraphs, .GAMMA.1 and .GAMMA.2 of size M1
and M2 and connection probabilities of p1>>p0 and
p2>>p0 respectively. Define critical percolation
probabilities pci=c/(piMi), i=1, 2. Looking at the connected
components as the percolation probability is swept for both
overlapping and non overlapping cases, is illustrated in FIG. 4A
and FIG. 4B.
[0066] For c/(p0N)>pp>max(pc1, pc2) the percolation
probability is above subgraph percolation probabilities so using
remark 2 almost all the vertices in each community are connected.
Now consider two cases:
[0067] If .GAMMA.1 and .GAMMA.2 have overlaps then any two vertices
within same subgraph are almost surely connected. So any two
vertices in both the communities are connected almost surely. If
starting percolation from a node in .GAMMA.k, it will get back
fraction qk of iterations from nodes in .GAMMA.k, and naively
fraction q1q2 of iterations from nodes in other community.
2.
[0068] If .GAMMA.1 and .GAMMA.2 are non-overlapping, the
probability of getting from any node in .GAMMA.k to any other node
when percolating is a non vanishing constant qk. Then the
probability of getting from a node i in one community to a node j
in another community is then 1-(1-q1q2).alpha. where .alpha. is the
expected number of edges between two community and in this model is
approximately .alpha.=M1M2P0. So any two communities that have
strong ties will also connect weakly were the strength depends
roughly on number of edges between communities.
[0069] The above analysis predicts that Ci p will have phase
transitions at critical probabilities corresponding to communities,
which analysis provides a local way of distinguishing communities
without any global information.
Local Community Finding
A. Algorithm
[0070] The algorithm to find communities for each vertex involves
sending a percolation message with percolation probability pp,
forming Ci p p for a range of pp and finding the abrupt change in
the community size. [0071] 1. Vertex i sends a message with
percolating probability pp with a unique ID identifying iteration;
[0072] 2. It records the responses and constitutes the set Si pp of
the vertices responded; [0073] 3. The above task is performed k
times and constitutes set Ci pp of all the vertices responding more
than kth; and [0074] 4. Ci pp is computed for a range of pp and the
abrupt changes in Ci p p are found at percolation probabilities of
ppl defining community layer 1 with strength ppl as Ci p pl.
[0075] The above algorithm basically finds nodes with high
clustering and strong ties with the source node, while diffusion
algorithms try to identify edges with high betweenness and high
flow of random walks to find boundaries of communities. FIG. 1
compares diffusion-based algorithms and percolation-based
algorithms.
B. Advantages
[0076] Using percolation-based algorithms has many advantages over
divisive and agglomerative algorithms introduced in the literature.
The distributed and parallel nature of percolation search provides
a means to locally explore communities for a particular node,
called their structure neighborhood [16]. Often in real-world
networks communities are not cohesive and have overlaps, in which
case diffusion-based approaches fail since there are no separate
boundaries for communities to find. The algorithm according to the
present invention explores communities and identifies vertices in
overlap of communities. Another property of interest in community
structure is how strong a vertex belongs to a community and the
level of confidence in community structure [21]. A percolation
search may be shown to easily provide these statistics by observing
fraction of returned messages from a particular vertex. FIG. 1
illustrates the fundamental differences between diffusion- and
percolation-based approaches. For a network with n vertices and m
edges, other types of community finding algorithms may find
community structure in O(md log(n)) [4] where d is the depth of
community dendrogram. However, one needs to process the whole graph
to capture community structure of a particular node.
Community Finding
[0077] In the previous section we discussed a local and distributed
algorithm to find communities of a single vertex. In this section
we generalize this method to find the community structure of the
graph, usually called community dendrogram. In this case the
dendrogram is not a simple tree since communities may overlap.
[0078] The first approach to create community structure is to
define the non-symmetric distance d(i,j) between vertices as:
[0079] d(i,j)={0, j Cipp|.A-inverted.|; max m(ppm), m:j Cippm;
[0080] Then classical clustering approaches may be used on this
distance matrix to find the partitioning of nodes into communities.
[25].
[0081] Since the local community finding algorithm finds major
communities, taking advantage of this the present invention
includes a global community finding algorithm that merges the
individual vertex community finding results. This algorithm has
several advantages over previously proposed algorithms. It is more
robust since it merges the communities over several vertices. It
allows overlap of communities and purge weak and insignificant
communities automatically.
Community Finding Algorithm
[0082] For each community pair (C1, C2). We then have: [0083]
n1,2=|{(i,m)|(i,m) C1, (i,m') C2, m>0.25 m1, m'>0.25 m2}|
[0084] n1=|{(i,m)|(i,m) C1, (i,m') C2, m>0.25 m1}|+|{(i,m)|(i,m)
C1, (i,m') C2, m>0.25 m1, m'.ltoreq.0.25 m2}| [0085]
n2=|{(i,m)|(i,m) C1, (i,m') C2, m'>0.25 m2}|+|{(i,m)|(i,m) C1,
(i,m') C2, m.ltoreq.0.25 m1, m'>0.25 m2}|
[0086] Where mi is the number of times community i has been merged.
The similarity measure, .psi.1,2, is then defined as
(n1,2-(n1+n2))/(n1,2+(n1+n2)). [0087] 1. For each vertex i in the
network perform the local community finding algorithm to get
different levels of communities Ci p pl corresponding to
percolation probabilities ppl. Normalize it by
M=max(m).A-inverted.(j,m) C i p pl, j.noteq.i. Set (i,m)=(i,1).
[0088] 2. Find the community pair Cl and Ck that have maximum
similarity .psi.max=maxi,j.psi.i,j. if .psi.max<1/3 go to 3.
[0089] 3. Merge community Cl, into Ck and set mk=mk+ml [0090] 4.
Normalize each remaining community Ck by (i,m)=(i,m)max(k,n) Ck
(n).
[0091] To further benchmark an algorithm according to the present
invention, the results with a number of randomly generated graphs
and social and biological networks used to measure performance of
previous community finding algorithms [9, 12, 15] are compared.
A. Randomly Generated Network
[0092] An algorithm according to the present invention is applied
to two sets of randomly generated graphs. To benchmark the
algorithm a traditional synthetic ER graph proposed in [9] is used.
Then an overlap model of randomly generated graphs is used to
demonstrate the advantages of proposed algorithm compared to
partitioning algorithms and in particular to the fast community
finding algorithm proposed in [9].
1. Random Non-Overlapping Communities.
[0093] A large number of graphs of size N=128. were generated and
divided into 4 equal-sized communities of 32 vertices each. Any two
vertices within the same community is connected with probability p1
and between different communities with probability p0. So that
expected degree of vertices is 16. The performance of the community
finding algorithm to find the communities, for different values of
intra-community edges was examined. FIGS. 5A-5C show the percentage
of the vertices classified correctly for a range of intercommunity
edges. The results are bench-marked with similar experiments with
other algorithms. FIGS. 5A-5C show that proposed algorithm works as
well as a GN algorithm for small number of average inter-community
edges per vertex and works considerably better for large values of
inter-community edges inasmuch as the GN algorithm fails to detect
communities because the number of edges inside a community and the
number of edges to outside of the community is close. The algorithm
according to the present invention detects communities with less
precision since edges to outside of the community are randomly
distributed over the network while inside edges form a clustered
set of vertices that are more interconnected.
2. Random Overlapping Communities
[0094] While the previous example benchmarks performance of a
community finding algorithms on random graphs, often in practical
networks communities are not well separated as modeled in the
previous model, but rather real life communities have overlaps,
i.e. some of the nodes have strong ties to more than one community.
One of the advantages of proposed algorithm due to its localized
approach is that it may correctly identify overlapping communities,
while traditional partitioning algorithms partition overlap
vertices into one of the communities. Further enhancements of the
GN algorithm propose to capture such behaviors by randomizing the
partitioning steps [21].
[0095] Considering a randomly-generated graph with 128 vertices,
each random vertex has 2 random edges on average. Two communities
of size 37 exist where each node has on average 14 random edges
inside the community. The two communities also have 5 nodes in
common, as shown in FIG. 6A. Both a GN fast community finding
algorithm and percolation community finding algorithm were applied.
The GN method partitions the common vertices into one of the
communities, as shown in FIG. 6B, while the method according to the
present invention identifies communities and includes overlapping
vertices in both communities, as shown in FIG. 6C. Using the method
according to the present invention, seven nodes, {44, 60, 61, 77,
88, 90, 102}, are misclassified, while using the GN method 28 nodes
are misclassified.
B. Zachary Karate Club
[0096] The local community finding algorithm according to the
present invention has been applied to the Zachary karate club
network [22]. This undirected graph has been used extensively in
previous literature [9, 12, 15] for algorithm benchmark. Zachary
recorded the contacts between members of a karate club over a
period of time. During the study, after a fight between owner and
trainer, the club eventually split in half. The original network
and the partitioning after split is depicted in FIG. 7A. The local
community finding algorithm according to the present invention was
applied for three important nodes in the network. Nodes 1 and 33
represent the administrator and instructor respectively, and node
34 represents a high degree node with close relations with node 33,
see FIG. 7. Note that the notion of community used herein is
different from that of [9] and hence the outcome is different. The
algorithm according to the resent invention looks for closely
connected nodes in a cluster. Identified communities clearly have
overlaps. The sizes of the nodes are proportional to their strength
in the community. FIG. 7B shows the local community finding for
node 1. As expected, node 17 is singled out since it does not have
strong ties to the community. FIG. 7C shows the local community
finding result for the instructor. Node 27 has been singled out of
the community since it does not have strong social connections with
the community. Also, nodes 10, 25, 26, 28, 29 were singled out.
FIG. 7D shows the community for node 34 and it shows that it also
includes the administrator. Note that although percolation
probability is symmetrical, i.e. the probability of node i and j
being in the same connected component, the inclusion in the
community is not symmetrical and node 34 is not included in
community of node 1. The reason is that the community threshold is
different for the two depending on network neighborhood. The
community structure finding algorithm was also applied. Seven
overlapping communities were identified. FIG. 8 shows the schematic
of the relationship between detected communities. Two major
communities are represented in FIG. 9. Again the sizes of the nodes
represent their strengths in the community. The algorithm is able
to identify the communities correctly and further identify the role
and strength of each node in the community. Several nodes are
clearly in the overlap between the communities as they have weak
ties with both communities. As expected, nodes 1 and 34 are in the
both the communities because of their close social connections with
both communities, and node 12 has been excluded from the
administrator community because it only has a single connection to
node 1.
[0097] In the present description, a new distributed algorithm for
finding communities of a vertex in a localized fashion is
disclosed. It exploits social definition of a community has highly
interconnected set of vertices. The algorithm according to the
present invention is generalized to achieve a list of the
communities for a network. It is shown how this algorithm has
superior performance over previous algorithms by allowing overlap
between communities and robustness to network perturbations. The
algorithm may be further optimized by taking advantage of the fact
that strong nodes in a community have similar local communities,
and so complexity of the algorithm may be reduced by removing this
computational redundancy.
Variations in the basic algorithm include: [0098] 1. Starting from
a set of seed nodes instead of a single node [0099] 2. Instead of
defining a global and uniform percolation probability, each node i
is assigned a weight between 0 and 1 as Wi. Each edge between nodes
i and j may also be assigned a weight between 0 to 1 as Wij. Then
each node instead of passing the message with Percolation
probability Pperc it passes the message with a probability as a
function of f(Pperc,Wi,Wi,j) for example it may be PPerc*Wi*Wi,j
[0100] 3. Nodes and links may have different types and each type
may have a predefined weights. [0101] 4. Weights of different links
may be trained and adjusted for a particular user depending on the
usage pattern or concept. For example, for a user searching for
biotechnology, weight of the nodes in other concepts like food
industry could be reduced [0102] 5. Sweeping over percolation
probability may be optimized by doing a quick search over this
metric. [0103] 6. Result of the community findings may be used to
adjust link and node weights
[0104] The present invention has a broad scope of applicability to
almost any collection of data. FIG. 10A depicts a generalized
system flowchart of a generalized process according to the present
invention. The flowchart includes forming the network, assigning
different weights, and performing local community finding on the
network. This process is refined by feedback to adjust weights and
modify nodes based on query, community results and/or user
feedback. FIG. 10B illustrates a system architecture reference
model; The system includes different layers. Meta-data is imported
from various operational information databases and is organized and
processed into a meta-data repository. Different information
retrieval components are used to analyze the data for particular
applications. Customer and web services access an enterprise portal
network with general interfaces to make queries and receive results
processed by an information retrieval framework and refined
information presentation framework. FIG. 10C illustrates a core
architectural model according to the present invention wherein a
meta-data repository consists of analyzed linked storage of
different types of data as discussed with respect to the system
architecture. Different plug-ins may be used to interact with
structural analysis engines to answer queries. A standard
command/report API is used to access the system through web
services.
[0105] It may be applied to documents, such as papers, patents, FDA
clinical trials documents, product descriptions, news reports,
market analyses, analyst reports, business reporting information,
and any combination or permutation thereof. It may also be employed
in applications for analysis of the World Wide Web, Email and spam
filtering. The present invention may also be applied to pattern
detection in biological networks, such as transcription regulatory
networks, social networks and communities, for example for military
and homeland security applications.
[0106] In a patent: finding landscape, the present invention may be
used to analyze competitors and to monitor those competitors with a
watchdog capability by flagging results of ongoing analyses of
companies, concepts, and technologies. FIG. 11A illustrates a
system flowchart for application to patent information. The results
are refined by feeding back the user adjustments of results to a
meta-data repository. FIG. 11B shows transaction flows prior to
application of an algorithm according to the present invention.
FIG. 11C shows transaction flow with application to landscape
analysis of patents using an algorithm according to the present
invention.
[0107] The present invention also enables a user to browse through
communities and fine-tune the results with a simple binary filter.
In application to market analysis, the present invention may be
used to provide a portfolio for different sections of the market in
terms of competitors, technologies, latest news and technical
papers and publications. Gaps between communities, and hence
opportunities not covered by competitors, may be discovered by
examining inter-community relations.
[0108] The present invention may be used to find the web
neighborhood of a website, its impact and links and communities on
the web. It may also be used to monitor the neighborhood change
over time. The present invention also provides an email and spam
filter. by providing a method to reduce spam and deliver messages
only from the people relevant to an address.
[0109] Recently the problem of unsolicited commercial email or spam
has been identified as an ubiquitous problem with email. The
present invention provides a more general framework of cybertrust
which not only provides a solution to the spam problem, but also
restricts email access to trusted individuals. A new distributed
method may be based on percolation theory for identifying
individual users local network trust in cyberspace using simple
local interactions. Recommendation and social connections are used
in daily activities to identify trust and reliability. Adopting the
same social approach, percolative communication of email messages
and limiting interactions to social contacts restricts
communication to locally trusted overlapping of communities. The
method according to the present invention further exploits the
properties of social networks to construct a distributed web of
trust based on a user's personal email network to simultaneously
prevent spam emails and emails from unwanted sources. The algorithm
represents a new paradigm for email communication that proves
superior to simple white-list/black-list approaches.
[0110] The problem is not just spam, the problem is the user
receiving email from the people the user doesn't know. Many people
simply discard an email if it is not from their contact list or
unless somebody introduces them through an email (CC). This notion
of online recommendation may be generalized according to the
present invention and made invisible and intuitive. The method
according to the present invention may be combined with Bayesian
and text based filters.
[0111] Behind the implementing algorithm is the present invention
of applying percolation to overlapping communities for a user. The
header of the email is changed so that it includes current receiver
(To) and final destination (Final-To). Then upon receiving an
email, if the final destination is the user, it is delivered to the
user's inbox else if Current destination is me and time to live of
email is less than some threshold I forward it to people in my
contact list with forwarding probability P. This probability is
chosen by user and can define how the limited a user wants to
define its email community.
TABLE-US-00001 Algorithm 1 PROCESS-MAIL(Email E) 1: if E.F ROM is
not in Contact list then 2: Put E in (High-Probability-Spam) 3:
else 4: if E.FinalTO = My Address then 5: Put E in INBOX 6: else 7:
if TTL<Threshold then 8: for all ContactAddress in ContactList
do 9: RandomVal = RANDOM-GEN01( ); 10: if RandomVal <
ForwardingProbability then 11: SENDMAIL(FinalTo:E.FinalTo,
From:MyAddress, To:ContactAddress) 12: end if 13: end for 14: end
if 15: end if 16: end if contacts.
Where SENDMAIL(FinalTo, From,To) sends an email with a proper
header for TO, FROM and FinalTo.
[0112] The algorithm according to the present invention may be
implemented in a distributed fashion, or in a centralized fashion
by emulating it in the mail server for large email providers. In
one variation according to the present invention, the forwarding
probability may be weighted as a function of the email traffic
between a sender and the user.
[0113] For social networks, including dating and recreational
activities, the present invention may be used to identify
communities and relationship between communities using social
interactions data and to find the best social connection with a
group of people.
[0114] With respect to biological networks like transcription
regulatory networks, the present invention may be used to discover
functional blueprints of a cellular system from large-scale and
high-throughput sequence and experimental data and allowing complex
intracellular processes. to be revealed. See
http://arxiv.org/abs/q-bio.MN/0501039). The present invention may
be used to mine genomic data and other data to correlate functional
and structural with sequence data, for example. Also according to
the present invention, literature, patent, patient history, drug
trial and other data may be mined to assist in providing diagnosis
or prognosis for a disease in a patient.
[0115] Patterns and communities may be revealed by applying the
present invention to homeland security data: Finding certain
patterns of groups and behaviors related to homeland security,
communities with certain relevant characteristics may be
identified.
[0116] The user activities and data processed in embodiments of the
invention include, but not limited to, structured data, e.g.
email-id, phone number, geo-location, friends/links, etc., as well
as, unstructured data--e.g. searches, web-browsing (both on and off
the social media properties), posts, comments, content of web pages
that receive Likes or Links etc. The user intent and profiles
extracted in embodiments of the invention are captured and
expressed in terms of weighted sub-graphs of a collective/global
concept graph. The collective/global concept graph includes nodes
that are concepts, and edges that are relationships among such
concepts. Concepts in embodiments of the invention include, but not
limited to, phrases that represent entities (e.g. people,
companies, drugs, diets, films, shows, events etc.),
domain-specific terms (e.g., sports and medical terminologies,
specific treatments, procedures etc.), or common expressions that
are used to convey information. Relationships, as captured by
annotated edges among concepts, include, but not limited to,
measures of closeness among the concepts, e.g., co-occurrence
statistics, or explicit semantic relationships (e.g., "acted in",
"father of", "part of" etc.).
[0117] In some embodiments of the invention, the user intent
profiles are determined and generated using weighted clusters and
sub-graphs of the global concept graph. Furthermore, the user
intent profiles can be tagged with temporal data. For example,
certain subsets of a user's profile could be considered recent and
time-sensitive (e.g., in the market to buy a product or go on a
trip), whereas, other aspects of user's profile could be longer
lasting (e.g., regions of the concept graph that represent an
antique aficionado or a diabetes patient).
[0118] Similarly, one of the parameters in computing the weights in
the sub-graph (representing a user's profile) can be based on how
many of his/her friends also share the same nodes or edges. The
group or collective activities of users at the social media site
can be further used to tag and prioritize both the unstructured
data used to extract the user profile, as well as, the distilled
user profile. For example, in one embodiment of the invention,
pages and content sites are grouped based on communal user actions,
such as Likes or posts made to the page/site; thus, pages/sites
belonging to the same group have been all visited/liked by a
distinct group of users. Such a group of content sites and pages
can then be collectively analyzed to create a collective content
profile (again in terms of the global concept graph), which can
then be used to tag the corresponding users.
[0119] In yet another embodiment of this invention, a user profile
is post-processed and tagged with weighted category scores defined
over a structured taxonomy of interest. For example, in one
embodiment, a standard taxonomy used by the advertisement industry
can be used to categorize the user profile (i.e., the sub-graph of
the concept graph) in terms of the various categories in the
taxonomy and compute a score for each such category. Since a number
of advertisers define their campaigns based on categories, rather
than the finer specificity of concept graphs, such coarse-grained
tagging of user profiles can be used to more easily and directly
match advertisements to users.
[0120] The advertisers, in embodiments of the invention, include
but are not limited to, entities that carry out display and brand
advertisement campaigns or perform search keyword auctions, or are
lead-generators who aggregate demand across different platforms and
media for merchants. Embodiments of the invention also allows one
to assign each such advertiser a set of advertiser target profiles,
which again could be expressed as (i) weighted and time-tagged
sub-graphs of the global concept graph or (ii) a weighted list of
categories picked from a certain structured taxonomy. Such
advertiser target profiles serve the same purpose as user intent
profiles. In other words, the advertiser target profiles provide a
description of their target audience in terms of a common and
globally-shared database, which is the concept graph. The act of
targeting advertisement includes, but not limited to, determining
for each user a weighted and prioritized list of advertisers that
best match the user intent. One method for matching user intent to
an advertiser is based upon computing the overlap or distance
between the user intent profile and the advertiser target profile,
e.g., by computing a measure of the distance (e.g. weighted
shortest paths, or sum of the weights of the shared edges) between
the two corresponding weighted sub-graphs in the global concept
graph. The selection of the final set of advertisement units is
completed by an optimization process that maximizes objective
functions of interest, including revenue for the social media site,
value and Return-On-Investment (ROI) for the advertisers, while
considering the device and media that the user is on at the time of
the impression. For example, in one embodiment of the invention,
contextually relevant search-suggestions (i.e., keywords feed from
a search engine) may be provided to users while they are engaged in
a particular conversation on the social media site. In another
embodiment of the invention, contextually relevant text or display
ads are targeted to a user while the user is browsing a third-party
web page (i.e., a web page not owned by the social media site). In
yet another embodiment, an advertisement (both display and text
units) is provided to a user on a mobile device by targeting the
user profile.
[0121] In embodiments of the invention, user profiles and the
underlying concept graph are used to enhance user engagement at a
social media site. Such user engagement enhancement includes, but
not limited to, (i) organizing friends or contact lists of a user
into potentially overlapping groups, by computing similarity
between a user's profile and the profiles of those of his friends
and contacts. An active user working in the finance sector, for
example, will have a lot of activities on the social media site (or
activities visible from the site), such as searches or browsing
history, related to finance and investment concepts; these concepts
would be shared with his friends/contacts from the finance sector,
leading to overlaps in their intent profiles and allowing the
invention to define a common profile and a group comprising these
individuals. Typically, a user has friends that are from different
contexts of one's life, e.g., work related, hobby related,
high-school friends, college friends etc. and the invention
provides a method for identifying and labeling these groups in an
automated fashion. (ii) Organizing posts, comments and social
interactions between a user and his friends in terms of the
underlying context. For example, the linear and a long list of
posts on wall or the home page of a social media site can be
organized into different subjects and categories by automatically
classifying them into categories such as sports, sports teams,
travel etc. by mapping them into the concept graph. This will also
enable one to determine the dominant types of social interactions
that happen among friends, which can then be used for targeting
advertisement. For example, if posts from a particular friend have
a predominant theme of electronic gadgets (as determined by the
invention's automated classification method), and he then
recommends/likes a new phone, then such a recommendation would have
a lot higher relevance than that coming from a
cooking-enthusiast.
[0122] The user activities and data processed in embodiments of the
invention include, but are not limited to, structured data, e.g.
email-id, phone number, geo-location, friends/links, etc., as well
as, unstructured data--e.g. searches, web-browsing (both on and off
the social media properties), posts, comments, content of web pages
that receive Likes or Links etc. For example, some of the data
would include:
1. User activities exclusively on the Social Media Site:
[0123] a. Post Likes
[0124] b. Page and Group Likes
[0125] c. Posts d. Subscriptions
2. User activities involving the rest of the Internet:
[0126] a. Incoming likes, sharing, recommendation coming from WEB
through for example, social-site's plugin on publisher sites
[0127] b. Searches
3. Social graph and connections
[0128] These data sources are processed to extract user intent and
profiles in terms of weighted sub-graphs of a collective/global
concept graph, comprising nodes that are concepts, and edges that
are relationships among such concepts. Concepts in embodiments of
the invention include, but not limited to, phrases that represent
entities (e.g. people, companies, drugs, diets, films, shows,
events etc.), domain-specific terms (e.g., sports and medical
terminologies, specific treatments, procedures etc.), or common
expressions that are used to convey information. Relationships, as
captured by annotated edges among concepts, include, but not
limited to, measures of closeness among the concepts, e.g.,
co-occurrence statistics, or explicit semantic relationships (e.g.,
"acted in", "father of", "part of" etc.). For the purpose of
demonstrating the invention, an exemplary concept graph can be used
to capture and represent user profiles. The global concept graph is
constructed by using the structure of the web (analyzing trillions
of words) and an embodiment of it comprises of over 50 million
concepts and 2 billion relationships, as disclosed, for example, in
U.S. Pat. No. 7,958,120 and co-pending U.S. patent application Ser.
No. 11/625,279, entitled "SYSTEMS AND METHODS FOR CREATING,
NAVIGATING, AND SEARCHING INFORMATIONAL WEB NEIGHBORHOODS", U.S.
patent application Ser. No. 12/436,748, entitled "METHODS AND
APPARATUS FOR DISCOVERING RELEVANT CONCEPT AND CONTEXT FOR CONTENT
SPECIFIC NODE", and U.S. patent application Ser. No. 12/906,051,
"GENERATING A CONCEPTUAL ASSOCIATION GRAPH FROM LARGE-SCALE
LOOSELY-GROUPED CONTENT", the entireties of which are hereby
incorporated by reference. FIG. 1 shows a sub-graph of the global
concept graph for the concept "Diabetes".
[0129] In embodiments of the invention, a sequence of page views
and search activities can be followed on the concept graph and can
be used to distill a subgraph that represents the intent as
expressed by the sequence of page views. This is represented, for
example, in FIGS. 2-4, where a User "31b23njh31abcds23j2h123213"
visited webMD on 2012-08-01 5:31 pm (FIG. 2A), then the same user
visited healthline on 2012-08-04 9:31 am (FIG. 3A), and then
searched on Google on 2012 Aug. 4 9:31 pm (FIG. 4A). All three
activities are tracked as group of concepts in the same
neighborhood around the concept "Diabetes" in the underlying
concept graph (FIGS. 2B, 3B and 4B).
[0130] FIGS. 5A-5C illustrate an example of a user profile based on
content data that is typically found on a social media site: A Mr.
X that is diabetic, enjoys or interested in sailing and in Rolex
watches.
[0131] The data is processed to generate such profiles (i.e., the
posts, page content, etc. that is processed for content). The data
is selected based on a number of social media signals. The data
includes, for example, Post Likes, Page and Group Likes, pages
viewed/browsed on and off the social media site. The group or
collective activities of users at the social media site can be
further used to tag and prioritize both the unstructured data used
to extract the user profile, as well as, the distilled user
profile. For example, in one embodiment of the invention, pages and
content sites are grouped based on communal user actions, such as
Likes or posts made to the page/site. Thus, pages/sites/Groups can
be divided into clusters, such that pages/sites belonging to the
same clusters have been all visited/liked by a distinct group of
users. Such a cluster of content sites and pages and Groups can
then be collectively analyzed to create a collective content
profile (again in terms of the global concept graph), which can
then be used to tag the corresponding users. Moreover, usual page
and domain statistics, such as page-rank and com-score ratings can
be used to prioritize.
[0132] Furthermore, the user intent profiles (which include
weighted clusters and sub-graphs of the global concept graph) can
be tagged with temporal data, for example, certain subsets of a
user's profile can be considered recent and time-sensitive (e.g.,
in the market to buy a product or go on a trip), whereas other
aspects of user's profile could be longer lasting (e.g., regions of
the concept graph that represent an antique aficionado or a
diabetes patient). Similarly, one of the parameters for computing
the weights in the sub-graph (representing a user's profile) can be
based on how many of his/her friends also share the same nodes or
edges. In yet another embodiment of this invention, a user profile
is post-processed and tagged with weighted category scores defined
over a structured taxonomy of interest. For example, a standard
taxonomy can be used by the advertisement industry, and then be
used to categorize the user profile (i.e., the sub-graph of the
concept graph) in terms of the various categories in the taxonomy
and compute a score for each such category. Since a number of
advertisers define their campaigns based on categories, rather than
the finer specificity of concept graphs, such coarse-grained
tagging of user profiles can be used to more easily and directly
match advertisements to users.
[0133] The user intent profiles then can be used to, for example,
target the right advertisement to the right user at the right
moment. The advertisers, in embodiments of the invention, include
but not limited to, entities that carry out display and brand
advertisement campaigns or perform search keyword auctions, or are
lead-generators who aggregate demand across different platforms and
media for merchants. Embodiments of the invention relate to
assignment of a set of advertiser target profiles to each such
advertiser, which again could be expressed as (i) weighted and
time-tagged sub-graphs of the global concept graph or (ii) a
weighted list of categories picked from a certain structured
taxonomy. Such advertiser target profiles serve the same purpose as
user intent profiles, i.e., provide a description of their target
audience in terms of a common and globally-shared database, which
is the concept graph. The act of targeting advertisement includes,
but not limited to, determining for each user a weighted and
prioritized list of advertisers that best match the user intent.
One method for matching user intent to an advertiser is based upon
computing the overlap or distance between the user intent profile
and the advertiser target profile, e.g., by computing a measure of
the distance (e.g. weighted shortest paths, or sum of the weights
of the shared edges) between the two corresponding weighted
sub-graphs in the global concept graph. The selection of the final
set of advertisement units is completed by an optimization process
that maximizes objective functions of interest, including revenue
for the social media site, value and Return-On-Investment (ROI) for
the advertisers, while considering the device and media that the
user is on at the time of the impression.
[0134] For example, in one embodiment described in this invention,
contextually relevant search-suggestions (i.e., keywords feed from
a search engine) are shown to users while they are engaged in a
particular conversation on the social media site.
[0135] In another embodiment of the invention, contextually
relevant text or display advertisements are targeted to a user
while browsing a third-party web page (i.e., a web page not owned
by the social media site). Yet another embodiment teaches how to
show advertisement (both display and text units) to a user on a
mobile device by targeting user profile.
[0136] Embodiments of the invention are also directed to use of
user profiles and the underlying concept graph to enhance user
engagement at a social media site. Such user engagement enhancement
includes, but is not limited to, (i) organizing friends or contact
lists of a user into potentially overlapping groups, by computing
similarity between a user's profile and the profiles of those of
his friends and contacts. An active user working in the finance
sector, for example, will have a lot of activities on the social
media site (or activities visible from the site), such as searches
or browsing history, related to finance and investment concepts;
these concepts would be shared with his friends/contacts from the
finance sector, leading to overlaps in their intent profiles and
allowing the invention to define a common profile and a group
comprising these individuals. Typically, a user has friends that
are from different contexts of one's life, e.g., work related,
hobby related, high-school friends, college friends etc. and the
invention provides a method for identifying and labeling these
groups in an automated fashion. (ii) Organizing posts, comments and
social interactions between a user and his friends in terms of the
underlying context. For example, the linear and a long list of
posts on wall or the home page of a social media site can be
organized into different subjects and categories by automatically
classifying them into categories such as sports, sports teams,
travel etc. by mapping them into the concept graph.
[0137] This enables determination of the dominant types of social
interactions that happen among friends, which can then be used for
targeting advertisement. For example, if posts from a particular
friend have a predominant theme of electronic gadgets (as
determined by the invention's automated classification method), and
he then recommends/likes a new phone, then such a recommendation
would have a lot higher relevance than that coming from a
cooking-enthusiast.
[0138] Embodiments of the invention are also directed to a Temporal
Commercial Intent Profile (Collective Retargeting) that can be used
to target advertisement and create commercially useful user intent
profiles in a computationally efficient way from large-scale user
data.
[0139] The goal is to generate a profile for each user that shows
the level of urgency and interest in each of the standard
commercial verticals. So, this can be used as an important factor
in placement score of an ad on a page based on the cross relevancy
of the page and ad to the top verticals of interests.
[0140] Output:
[0141] For each user we generate a scored version of a list, which
is the standard targeting vertical used in the advertisement
industry. Since this list is a standard, it makes it very easy to
match against intention of the advertiser when we have such a
profile for users.
[0142] Score could be separated into two parts as well:
1--Urgency (time factor) 2--Interest (has generic interest factor
regardless of time as well)
[0143] Example: It is winter time and Mr. X is interested in
Travel>Tourist Destinations>Mountain & Ski Resorts, the
score of this vertical for my profile should only last until the
season lasts and I'm still interested in taking or retaking such a
trip based on my activities.
[0144] Inputs:
1. Internal user activities: a. Posts and Likes b. Pages and groups
Likes c. Posts d. Subscriptions 2. External activities: a. Incoming
likes, sharing, recommendation coming from WEB through for example
FaceBook social-plugin on publisher sites 3. Social graph and
connections 4. Using social-plugin visits (not actions) as a
retargeting pixel 5. Concept Graph and all the related meta
data
[0145] Besides being able to process content accurately, the main
challenge would be filtering out large percentage of the collective
data and only stick to the part which has relevant commercial
intent hidden in it. For example if Mr. X has liked a page related
to a poet, months back, that should be considered with a very low
commercial intent.
Exemplary Implementations:
[0146] Add commercial intent score to concepts; this could be done
using third party data, so we will have a DB of <CONCEPT,SCORED
VERTICALS>. This db should be temporal as well, and we need to
find a good trending method to keep this db updated. For example,
Shopping>Toys should be boosted at Christmas time, or
News>Politics should be boosted around election times.
TABLE-US-00002 Join <VISITED EXTERNAL URL,CONCEPTS> to
<CONCEPT,SCORED VERTICALS> and create <URL,SCORED
VERTICALS> Join <USERS,VISITED EXTERNAL URL > to
<URL,SCORED VERTICALS> and create <USERS,SCORED VERTICALS
BASED ON EXTERNAL PAGES> Do the same and join <FB
PAGE,CONCEPT> to <CONCEPT,SCORED VERTICALS> and create
<FB PAGE,SCORED VERTICALS> Join <USERS,POSTS> to
<POSTS,CONCEPTS> and create <USERS,CONCEPTS> Join
<USERS, CONCEPTS > to <CONCEPT,SCORED VERTICALS> and
create <USERS,SCORED VERTICALS BASED ON POSTS> Join
<USERS, FB PAGES > to <CONCEPT,SCORED VERTICALS> and
create <USERS,SCORED VERTICALS BASED ON FB PAGES>
[0147] We can use the social graph and cross correlation (overlap
of users) on different FB pages to boost up or filter out any of
the above data. The aggregation scoring can be done based above
steps.
[0148] At the end of this process, we get mapping of
<USER,AGGREGATED SCORED VERTICALS>which can then used to
target advertisement as illustrated in FIG. 7.
[0149] FIG. 9 shows a diagrammatic representation of machine in the
exemplary form of a computer system 900 within which a set of
instructions, for causing the machine to perform any one or more of
the methodologies discussed herein, may be executed. In alternative
embodiments, the machine operates as a standalone device or may be
connected (e.g., networked) to other machines. In a networked
deployment, the machine may operate in the capacity of a server or
a client machine in server-client network environment, or as a peer
machine in a peer-to-peer (or distributed) network environment. The
machine may be a personal computer (PC), a tablet PC, a set-top box
(STB), a Personal Digital Assistant (PDA), a cellular telephone, a
web appliance, a network router, switch or bridge, or any machine
capable of executing a set of instructions (sequential or
otherwise) that specify actions to be taken by that machine.
Further, while only a single machine is illustrated, the term
"machine" shall also be taken to include any collection of machines
that individually or jointly execute a set (or multiple sets) of
instructions to perform any one or more of the methodologies
discussed herein.
[0150] The exemplary computer system 900 includes a processor 902
(e.g., a central processing unit (CPU), a graphics processing unit
(GPU) or both), a main memory 904 (e.g., read only memory (ROM),
flash memory, dynamic random access memory (DRAM) such as
synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.) and a static
memory 906 (e.g., flash memory, static random access memory (SRAM),
etc.), which communicate with each other via a bus 908.
[0151] The computer system 900 may further include a video display
unit 910 (e.g., a liquid crystal display (LCD) or a cathode ray
tube (CRT)). The computer system 900 also includes an alphanumeric
input device 912 (e.g., a keyboard), a cursor control device 914
(e.g., a mouse), a disk drive unit 916, a signal generation device
920 (e.g., a speaker) and a network interface device 922.
[0152] The disk drive unit 916 includes a computer-readable medium
924 on which is stored one or more sets of instructions (e.g.,
software 926) embodying any one or more of the methodologies or
functions described herein. The software 926 may also reside,
completely or at least partially, within the main memory 904 and/or
within the processor 902 during execution thereof by the computer
system 900, the main memory 904 and the processor 902 also
constituting computer-readable media.
[0153] The software 926 may further be transmitted or received over
a network 928 via the network interface device 922.
[0154] While the computer-readable medium 924 is shown in an
exemplary embodiment to be a single medium, the term
"computer-readable medium" should be taken to include a single
medium or multiple media (e.g., a centralized or distributed
database, and/or associated caches and servers) that store the one
or more sets of instructions. The term "computer-readable medium"
shall also be taken to include any medium that is capable of
storing, encoding or carrying a set of instructions for execution
by the machine and that cause the machine to perform any one or
more of the methodologies of the present invention. The term
"computer-readable medium" shall accordingly be taken to include,
but not be limited to, solid-state memories, and optical and
magnetic media.
[0155] One or more of the methodologies or functions described
herein may be embodied in a computer-readable medium on which is
stored one or more sets of instructions (e.g., software). The
software may reside, completely or at least partially, within
memory and/or within a processor during execution thereof. The
software may further be transmitted or received over a network.
[0156] It should be noted that the intent profile extraction
systems, methods and applications are illustrated and discussed
herein as having various modules which perform particular functions
and interact with one another. It should be understood that these
modules are merely segregated based on their function for the sake
of description and represent computer hardware and/or executable
software code which is stored on a computer-readable medium for
execution on appropriate computing hardware. The various functions
of the different modules and units can be combined or segregated as
hardware and/or software stored on a computer-readable medium as
above as modules in any manner, and can be used separately or in
combination.
[0157] It should be noted that the invention is illustrated and
discussed herein as having various modules which perform particular
functions and interact with one another. It should be understood
that these modules are merely segregated based on their function
for the sake of description and represent computer hardware and/or
executable software code which is stored on a computer-readable
medium for execution on appropriate computing hardware. The various
functions of the different modules and units can be combined or
segregated as hardware and/or software stored on a
computer-readable medium as above as modules in any manner, and can
be used separately or in combination.
[0158] The term "computer-readable medium" should be taken to
include a single medium or multiple media that store the one or
more sets of instructions. The term "computer-readable medium"
shall also be taken to include any medium that is capable of
storing, encoding or carrying a set of instructions for execution
by a machine and that cause a machine to perform any one or more of
the methodologies of the present invention. The term
"computer-readable medium" shall accordingly be taken to include,
but not be limited to, solid-state memories, and optical and
magnetic media.
[0159] Embodiments of the invention have been described through
functional modules at times, which are defined by executable
instructions recorded on computer readable media which cause a
computer, microprocessors or chipsets to perform method steps when
executed. The modules have been segregated by function for the sake
of clarity. However, it should be understood that the modules need
not correspond to discreet blocks of code and the described
functions can be carried out by the execution of various code
portions stored on various media and executed at various times.
[0160] It should be understood that processes and techniques
described herein are not inherently related to any particular
apparatus and may be implemented by any suitable combination of
components. Further, various types of general purpose devices may
be used in accordance with the teachings described herein. It may
also prove advantageous to construct specialized apparatus to
perform the method steps described herein. The invention has been
described in relation to particular examples, which are intended in
all respects to be illustrative rather than restrictive. Those
skilled in the art will appreciate that many different combinations
of hardware, software, and firmware will be suitable for practicing
the present invention. Various aspects and/or components of the
described embodiments may be used singly or in any combination. It
is intended that the specification and examples be considered as
exemplary only, with a true scope and spirit of the invention being
indicated by the claims.
* * * * *
References