U.S. patent application number 11/936029 was filed with the patent office on 2008-03-13 for method and system for identifying image relatedness using link and page layout analysis.
This patent application is currently assigned to Microsoft Corporation. Invention is credited to Deng Cai, Xiaofei He, Wei-Ying Ma, Ji-Rong Wen.
Application Number | 20080065627 11/936029 |
Document ID | / |
Family ID | 34939562 |
Filed Date | 2008-03-13 |
United States Patent
Application |
20080065627 |
Kind Code |
A1 |
Ma; Wei-Ying ; et
al. |
March 13, 2008 |
METHOD AND SYSTEM FOR IDENTIFYING IMAGE RELATEDNESS USING LINK AND
PAGE LAYOUT ANALYSIS
Abstract
A method and system for determining relatedness of images of
pages based on link and page layout analysis. A link analysis
system determines relatedness between images by first identifying
blocks within web pages, and then analyzing the importance of the
blocks to web pages, web pages to blocks, and images to blocks.
Based on this analysis, the link analysis system determines the
degree to which each image is related to each other image. The link
analysis system may also use the relatedness of images to generate
a ranking of the images. The link analysis system may also generate
a vector representation of the images based on their relatedness
and apply a clustering algorithm to the vector representations to
identify clusters of related images.
Inventors: |
Ma; Wei-Ying; (Beijing,
CN) ; Wen; Ji-Rong; (Beijing, CN) ; He;
Xiaofei; (Chicago, IL) ; Cai; Deng; (Beijing,
CN) |
Correspondence
Address: |
PERKINS COIE LLP/MSFT
P. O. BOX 1247
SEATTLE
WA
98111-1247
US
|
Assignee: |
Microsoft Corporation
Redmond
WA
|
Family ID: |
34939562 |
Appl. No.: |
11/936029 |
Filed: |
November 6, 2007 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10834483 |
Apr 29, 2004 |
7293007 |
|
|
11936029 |
Nov 6, 2007 |
|
|
|
Current U.S.
Class: |
1/1 ;
707/999.005; 707/E17.014; 707/E17.108 |
Current CPC
Class: |
G06F 16/951 20190101;
Y10S 707/99931 20130101 |
Class at
Publication: |
707/005 ;
707/E17.014 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A method in a computer system for determining relatedness
between blocks of pages, the method comprising: calculating value
indicators of a page to a block; calculating value indicators of a
block to a page; and calculating block-to-block indicators of
relatedness of one block to another block by combining the value
indicators of a block to a page and the value indicators of
importance of a page to a block.
2. The method of claim 1 wherein the value indicators of a page to
a block are probabilities that a user will select a link from each
block that will lead to each other page.
3. The method of claim 1 wherein the value indicators of a block to
a page are probabilities that a user will focus on each block of
the page.
4. The method of claim 1 wherein the value indicators of a page to
a block are probabilities that a user will select a link from each
block that will lead to each other page and the value indicators of
a block to a page are probabilities that a user will focus on each
block of the page.
5. The method of claim 1 including calculating a rank of the blocks
from the block-to-block indicators.
6. The method of claim 5 wherein the calculated rank is based on a
probability that a user starting at an arbitrary block will
transition to another block.
7. The method of claim 1 wherein the block-to-block indicators are
calculated as follows: W.sub.BZX where X is a matrix of the value
indicators of a block to a page and Z is a matrix of the value
indicators of a page to a block.
8. A computer-readable storage medium containing instructions for
controlling a computer system to determine relatedness between page
elements, the method comprising: calculating value indicators of a
first element to a second element; calculating value indicators of
a second element to a first element; and calculating indicators of
relatedness of a first element to another first element by
combining the value indicators of a first element to a second
element and the value indicators of a second element to a first
element.
9. The computer-readable storage medium of claim 8 wherein the
first element is an image of a block of a page and the second
element is a block.
10. The computer-readable storage medium of claim 8 wherein the
first element is a block of a page and the second element is a
page.
11. The computer-readable storage medium of claim 8 wherein the
value indicators of a first element to a second element are
probabilities that a user will select from the first element
information relating to each other second element.
12. The computer-readable storage medium of claim 8 wherein the
value indicators of a second element to a first element are
probabilities that a user will focus on each second element within
the first element.
13. The computer-readable storage medium of claim 8 wherein the
value indicators of a first element to a second element are
probabilities that a user will select from the first element
information relating to each other second element and wherein the
value indicators of a second element to a first element are
probabilities that a user will focus on each second element within
the first element.
14. The computer-readable storage medium of claim 8 including
calculating a rank of the first elements from the indicators of
relatedness of the first element to another first element.
15. The computer-readable storage medium of claim 14 wherein the
calculated rank is based on a probability that a user starting at a
first element will transition to another first element.
16. The computer-readable storage medium of claim 8 wherein the
first element is a page and the second element is a block of the
page.
17. A computer system for determining relatedness between blocks of
pages, comprising: for each combination of a page and a block, a
probability of the page to the block indicating that a user will
select a link from the block that will lead to the page; and a
probability of the block to the page indicating that a user will
focus on the block of that page; and a component that combines the
probabilities of a block to a page and the probabilities of a page
to a block to calculate block-to-block indicators of relatedness of
one block to another block.
18. The computer system of claim 17 including a component that
ranks the block based on the block-to-block indicators of
relatedness.
19. The computer system of claim 18 wherein the ranking is based on
a probability that a user starting at an arbitrary block will
transition to another block.
20. The computer system of claim 17 wherein in the block-to-block
indicators are calculated as follows: W.sub.BZX where W.sub.B is a
matrix of block-to-block indicators, Z is a matrix of the
probabilities of pages to blocks, and X is a matrix of the
probabilities of blocks to pages.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is a divisional of U.S. patent application
Ser. No. 10/834,483, filed Apr. 29, 2004, now U.S. Pat. No.
7,293,007, issued Nov. 6, 2007, entitled "METHOD AND SYSTEM FOR
IDENTIFYING IMAGE RELATEDNESS USING LINK AND PAGE LAYOUT ANALYSIS,"
which is hereby incorporated by reference in its entirety.
TECHNICAL FIELD
[0002] The described technology relates generally to analyzing web
pages and particularly to relatedness of images of web pages.
BACKGROUND
[0003] Many search engine services, such as Google and Overture,
provide for searching for information that is accessible via the
Internet. These search engine services allow users to search for
display pages, such as web pages, that may be of interest to users.
After a user submits a search request that includes search terms,
the search engine service identifies web pages that may be related
to those search terms. To quickly identify related web pages, the
search engine services may maintain a mapping of keywords to web
pages. This mapping may be generated by "crawling and indexing" the
web (i.e., the World Wide Web) to identify the keywords of each web
page. To crawl the web, a search engine service may use a list of
root web pages to identify all web pages that are accessible
through those root web pages. The keywords of any particular web
page can be identified using various well-known information
retrieval techniques, such as identifying the words of a headline,
the words supplied in the metadata of the web page, the words that
are highlighted, and so on. The search engine service then ranks
the web pages of the search result based on the closeness of each
match, web page popularity (e.g., Google's PageRank), and so on.
The search engine service may also generate a relevance score to
indicate how relevant the information of the web page may be to the
search request. The search engine service then displays to the user
links to those web pages in an order that is based on their
rankings.
[0004] Although many web pages are graphically oriented in that
they may contain many images, conventional search engine services
typically search based on only the textual content of a web page.
Some attempts have been made, however, to support image-based
searching of web pages. For example, a user viewing a web page may
want to identify other web pages that contain images related to an
image on that web page. The image-based search techniques are
typically either content-based or link-based and additionally use
surrounding text to aid in analyzing images. The content-based
techniques use low-level visual information for image indexing.
Because the content-based search techniques are very
computationally expensive, they are not practical for image
searching on the web. The link-based search techniques typically
assume that images on the same web page are likely to be related
and that images on web pages that are each linked to by the same
web page are related. Unfortunately, these assumptions are
incorrect in many situations primarily because a single web page
may have content relating to many different topics. For example, a
web page for a news web site may contain content relating to an
international political event and content relating to a national
sporting event. In such a case, it is unlikely that a picture of a
sports team relating to the national sporting event is related to a
web page linked to by the content relating to the international
political event.
[0005] It would be desirable to have an image-based search
technique that would not be computationally as expensive as
conventional content-based search techniques and that, unlike
conventional link-based search techniques, would account for the
diverse topics that can occur on a single web page.
SUMMARY
[0006] A system for determining relatedness of images of pages
based on link and page layout analysis is provided. A link analysis
system determines relatedness between images by first identifying
blocks within pages, and then analyzing the importance of the
blocks to pages, pages to blocks, and images to blocks. Based on
this analysis, the link analysis system determines the degree to
which each image is related to each other image. Because the
relatedness of an image to another image is based on block-level
importance, which is a smaller unit than a page, rather than
page-level importance, this relatedness is a more accurate
representation of relatedness than conventional link-based search
techniques.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] FIG. 1 is a block diagram illustrating blocks, images, and
links in a sample collection of web pages.
[0008] FIG. 2 is a block diagram illustrating components of the
link analysis system in one embodiment.
[0009] FIG. 3 is a flow diagram that illustrates processing of a
generate image-to-image matrix component in one embodiment.
[0010] FIG. 4 is a flow diagram that illustrates the processing of
a generate block-to-page matrix component in one embodiment.
[0011] FIG. 5 is a flow diagram that illustrates the processing of
a generate page-to-block matrix component in one embodiment.
[0012] FIG. 6 is a flow diagram that illustrates the processing of
a generate block-to-image matrix component in one embodiment.
DETAILED DESCRIPTION
[0013] A method and system for determining relatedness of images of
pages based on link and page layout analysis is provided. In one
embodiment, a link analysis system determines relatedness between
images by first identifying blocks within web pages, and then
analyzing the importance of the blocks to web pages, web pages to
blocks, and images to blocks. Based on this analysis, the link
analysis system determines the degree to which each image is
related to each other image. A block of a web page represents an
area of the web page that appears to relate to a similar topic. For
example, a news article relating to an international political
event may represent one block, and a news article relating to a
national sporting event may represent another block. The importance
of a block to a page may indicate a probability that a user will
focus on that block when viewing that page. The importance of a
page to a block may indicate the probability that a user will
select from that block a link to that page. The importance of an
image to a block may indicate the probability that a user will
focus on that image when viewing that block. After calculating a
numeric indicator of these importances for pairs of pages and
blocks and pairs of images and blocks, the link analysis system
generates an indicator of the relatedness of each image to each
other image by combining the calculated importance of a block to a
page, the calculated importance of a page to a block, and the
calculated importance of an image to a block. Because the
relatedness of an image to another image is based on block-level
importance rather than on page-level importance, this relatedness
is a more accurate representation of relatedness than conventional
link-based search techniques.
[0014] The link analysis system may also use the relatedness of
images to generate a ranking of the images. The ranking may be
based on a probability that a user who starts viewing an arbitrary
image will transition to another image after an arbitrarily large
number of transitions between images. The link analysis system may
also generate a vector representation of the images based on their
relatedness and apply a clustering algorithm to the vector
representations to identify clusters of related images.
[0015] FIG. 1 is a block diagram illustrating blocks, images, and
links in a sample collection of web pages. This collection of web
pages includes web pages 1-4. The blocks within the web pages are
represented as rectangles, the images within blocks are represented
as circles, and the links within blocks are represented as directed
arrows from a block to a linked-to web page. Web page 1 contains
block 1, which contains images 1 and 2 and links 1 and 2. Web page
2 contains block 2, which contains image 3 and link 3, and block 3,
which contains image 4 and link 4. Web page 3 contains block 4,
which contains image 5 and links 5 and 6, and block 5, which
contains image 6 and link 7. Web page 4 contains block 5, which
contains images 7, 8, 9, and 10 and link 8. Because the link
analysis system bases image relatedness on blocks rather than
entire web pages, the relatedness of an image to other images is
likely based on a more accurate representation of the topic of an
image. For example, web page 2 contains blocks 2 and 3, which may
be directed to different topics such as an international political
event and a national sporting event, respectively. The link
analysis system may identify that image 4 is more closely related
to the images of web page 4 than to the images of web page 3,
because block 3, which contains image 4, has a link 4 to web page
4. For example, web page 4 is more likely sport-related than is web
page 3 because block 3 contains a link to web page 4, but not to
web page 3. As such, image 4 is more likely related to images 7, 8,
9, and 10 than to images 5 and 6 of web page 3. Techniques that are
not based on block-level analysis may identify that image 4 is
equally related to web page 3 and web page 4 because those
techniques do not distinguish block 2 from block 3 on web page
2.
[0016] In one embodiment, the link analysis system calculates the
importance of a page to a block, for each block and page
combination, as the probability that a user who selects a link of
that block will select a link to that page. If a block does not
have a link to a page, then the probability is zero. If a block has
a link to a page, then the link analysis system may assume a user
will select each of the links of the block with equal probability.
A block-to-page matrix of probabilities is defined by the following
equation: Z ij = { 1 / s i if .times. .times. there .times. .times.
is .times. .times. a .times. .times. link .times. .times. from
.times. .times. block .times. .times. i .times. .times. to .times.
.times. page .times. .times. j 0 otherwise ( 1 ) ##EQU1##
[0017] where Z.sub.ij represents the probability that a user who
selects a link of block i will select the link to page j and
s.sub.i is the number of links in block i. The block-to-page matrix
Z for the web pages of FIG. 1 is shown in Table 1. The rows of
Table 1 represent the blocks and the columns represent the pages.
In this example, the probability that a user who selects of link of
block 4 will select a link to web page 2 is 0.5. TABLE-US-00001
TABLE 1 1 2 3 4 1 .5 .5 2 1 3 1 4 .5 .5 5 1 6 1
[0018] In one embodiment, the link analysis system calculates, for
each page and block combination, the importance of a block to a
page as the probability of that block being the most important
block of the page. The probability of a block not contained on a
page being the most important block of that page is zero. The link
analysis system may assume that each block contained on a page is
most important with equal probability. A page-to-block matrix of
probabilities is defined by the following equation: X ij = { 1 / s
i if .times. .times. page .times. .times. i .times. .times. .times.
contains .times. .times. block .times. .times. .times. j 0
otherwise ( 2 ) ##EQU2## where X.sub.ij represents the probability
that block j is the most important block of page i and s.sub.i is
the number of blocks on page i.
[0019] In one embodiment, the link analysis system calculates a
probability that a block is the most important block of a page
based on position, size, font, color, and other physical attributes
of the block. For example, a large block that is centered in the
middle of a page may be more important than a small block in the
lower left corner of the page. A technique for calculating block
importance and the degree of coherency of blocks is described in
U.S. patent application No. ______, entitled "Method and System for
Calculating Importance of a Block Within a Display Page" and filed
on Apr. 29, 2004, which is hereby incorporated by reference. The
page-to-block matrix X may be more generally represented as: X ij =
{ f p i .function. ( b j ) if .times. .times. page .times. .times.
i .times. .times. contains .times. .times. block .times. .times. j
0 otherwise ( 3 ) ##EQU3## where f.sub.pi is a function
representing the probability that block j is the most important
block of page i. In one embodiment, the function f.sub.pi is
defined as the size of block j divided by the distance of the
center of the block from the center of the screen when page i is
displayed. The function f may be defined by the following: f p i
.function. ( b ) = .alpha. .times. size .times. .times. of .times.
.times. block .times. .times. b .times. .times. in .times. .times.
page .times. .times. p i dist . .times. .times. from .times.
.times. the .times. .times. center .times. .times. of .times.
.times. b .times. .times. to .times. .times. the .times. .times.
center .times. .times. of .times. .times. screen ( 4 ) ##EQU4##
[0020] where .alpha. is a normalization factor that ensures that
the sum of the values of the function for a block is 1. The
function f can be considered to be the probability that a user is
focused on block j when viewing page i. The page-to-block matrix X
for the web pages of FIG. 1 is shown in Table 2. The rows of Table
2 represent the pages and the columns represent the blocks. In this
example, the probability that block 4 is the most important block
of web page 3 is 0.8. TABLE-US-00002 TABLE 2 1 2 3 4 5 6 1 1 2 .5
.5 3 .8 .2 4 1
[0021] In one embodiment, the link analysis system calculates, for
each block and image combination, the importance of an image to a
block as the probability of that image being the most important
image of that block. If a block does not contain a certain image,
then the probability of that image being the most important of that
block is zero. The link analysis system may assume that each image
of a block is most important with equal probability. The link
analysis system could use other measures of importance of an image
to a block, such as based on the relative sizes of the images, the
location of the images within the blocks, and so on. A
block-to-image matrix of the probabilities is defined by the
following equation: Y ij = { 1 / s i if .times. .times. block
.times. .times. i .times. .times. contains .times. .times. image
.times. .times. j 0 otherwise ( 5 ) ##EQU5##
[0022] where Y.sub.ij represents the probability that image j is
the most important image of block i and s.sub.i is the number of
images in block i. The block-to-image matrix Y for the web pages of
FIG. 1 is shown in Table 3. The rows of Table 3 represent blocks
and the columns represent the images. In this example, the
probability that image 2 is the most important image of block 1 is
0.5. TABLE-US-00003 TABLE 3 1 2 3 4 5 6 7 8 9 10 1 .5 .5 2 1 3 1 4
1 5 1 6 .25 .25 .25 .25
[0023] In one embodiment, the link analysis system calculates the
importance of one page to another page, for each ordered pair of
pages, as the probability that a user viewing the first page of the
pair will select a link to the second page of the pair. The link
analysis system calculates the probability for each pair by summing
for each block of the first page the probability of that block
being the most important block of the first page times the
probability that the second page is the most important page to that
block. The importance of a page to another page thus factors in
that users may prefer to select links within the most important
blocks of a page. A page-to-page matrix of these probabilities is
represented by the following: W.sub.P=XZ (6) where W.sub.P
represents the page-to-page matrix. The probability of W can
alternately be represented as:
Prob(.beta.|.alpha.)=.SIGMA..sub.b.epsilon..alpha.Prob(.beta.|b)Prob(b|.a-
lpha.) (7)
[0024] where .alpha. represents the first page of the pair and
.beta. represents the second page of the pair. The page-to-page
matrix W.sub.P for the web pages of FIG. 1 is shown in Table 4. In
this example, the probability that a user viewing page 3 will
transition to page 2 is 0.4. TABLE-US-00004 TABLE 4 1 2 3 4 1 0 .5
.5 0 2 0 0 .5 .5 3 .2 .4 0 .4 4 0 0 1 0
[0025] The link analysis system calculates, for each ordered pair
of blocks, the importance of one block to another block as the
probability that a user viewing the first block of the pair will
select a link to the page containing the second block of the pair
and will find that second block to be the most important of its
page. The link analysis system calculates the probability for each
pair by summing the probabilities that a user who selects a link of
the first block will select a link for the page that contains the
second block times the probability of that second block being the
most important block of its page. Thus, the importance of one block
to another block represents that a user viewing the first block
will select a link to the page containing the second block and
focus their attention on the second block. A block-to-block matrix
of these probabilities is represented by the following: W.sub.B=ZX
(8) where W.sub.B represents the block-to-block matrix. The
probabilities of W can alternately be represented as: W B
.function. ( a , b ) = Pr .times. .times. ob .function. ( b | a ) =
.gamma. .di-elect cons. P .times. .times. Prob .function. ( .gamma.
| a ) .times. Prob .function. ( b | .gamma. ) = Prob .function. (
.beta. | a ) .times. Prob .function. ( b | .beta. ) = Z .function.
( a , .beta. ) .times. X .function. ( .beta. , b ) , .times. a , b
.di-elect cons. B ( 9 ) ##EQU6##
[0026] The block-to-block matrix W.sub.B for the web pages of FIG.
1 is shown in Table 5. In thus example, the probability that a user
viewing block 4 will jump to page 2 and focus their attention on
block 3 is 0.25. TABLE-US-00005 TABLE 5 1 2 3 4 5 6 1 0 .25 .25 .4
.1 0 2 0 0 .8 .2 0 0 3 0 0 0 0 0 1 4 0 .25 .25 0 0 .5 5 1 0 0 0 0 0
6 0 0 .8 .2 0 0
[0027] In one embodiment, the link analysis system factors into the
block-to-block matrix the probability that two blocks on the same
page may be related. The revised block-to-block matrix is
represented by the following: W.sub.B=(1-t)ZX+tDU (10) where D is a
diagonal matrix D.sub.ii=.SIGMA..sub.jU.sub.ij, U is a coherence
matrix, and t is a weighting factor. The matrix U is defined as
follows: U ij = { 0 if .times. .times. block .times. .times. i
.times. .times. and .times. .times. block .times. .times. j .times.
.times. are .times. .times. on .times. .times. different .times.
.times. pages DOC otherwise ( 11 ) ##EQU7## where DOC is the degree
of coherency of the smallest block containing both block i and
block j. The weighting factor t may typically be set to a small
value (e.g., less than 0.1) because in most instances different
blocks on the same page relate to different topics.
[0028] The link analysis system calculates for each ordered pair of
images the probability that the first image of the pair is related
to the second image of the pair. The link analysis system
calculates the probability by summing the block-to-block
probabilities for the combination of each block that contains the
first image to each block that contains the second image. An
image-to-image matrix of these probabilities is represented by the
following: W.sub.t=Y.sup.TW.sub.BY (12)
[0029] where W.sub.I represents the image-to-image matrix. The
image-to-image matrix W.sub.I for the web pages of FIG. 1 is shown
in Table 6. In this example, the probability that a user viewing
block 10 will next view page 3 and focus on block 5 is 0.05.
TABLE-US-00006 TABLE 6 1 2 3 4 5 6 7 8 9 10 1 0 0 .125 .125 .2 .05
0 0 0 0 2 0 0 .125 .125 .2 .05 0 0 0 0 3 0 0 0 .8 .2 0 0 0 0 0 4 0
0 0 0 0 0 .25 .25 .25 .25 5 0 0 .25 .25 0 0 .125 .125 .125 .125 6
.5 .5 0 0 0 0 0 0 0 0 7 0 0 0 .2 .05 0 0 0 0 0 8 0 0 0 .2 .05 0 0 0
0 0 9 0 0 0 .2 .05 0 0 0 0 0 10 0 0 0 .2 .05 0 0 0 0 0
[0030] In one embodiment, the link analysis system factors into the
image-to-image matrix the probability that two blocks on the same
page may be related. The revised image-to-image matrix is
represented by the following: W=tDY.sup.TY+(1-t)Y.sup.TW.sub.BY
(13) where t is a weighting factor and D is a diagonal matrix
representing D.sub.ii=E.sub.j(Y.sup.TY).sub.ij (14) The weighting
factor t may be set to a large value (e.g., 0.7-0.9) because two
images in the same block are likely to be related.
[0031] In one embodiment, the link analysis system generates a
vector representation of each image from the image-to-image matrix.
The link analysis system generates the vectors using a
least-squares approach that factors in the similarity between a
pair of images as indicated by the image-to-image matrix. The link
analysis system initially converts the image-to-image matrix to a
similarity matrix represented by the following:
S=(W.sub.I+W.sub.I.sup.T)/2 (15) where S represents the similarity
matrix. If y.sub.i is a vector representation of image i, then the
optimal set of image vectors is y=(y.sub.1, . . . , y.sub.m)
obtained using the following objective function: min y .times. i ,
j .times. .times. ( y i - y j ) 2 .times. S ij ( 16 ) ##EQU8## If D
is a diagonal matrix such that D.sub.ii is the sum of the values of
the i.sup.th row of the similarity matrix S, then the minimization
problem reduces to the following: min y T .times. y = 1 .times. y T
.times. Ly ( 17 ) ##EQU9## where L is equal to D-S. The solution is
given by the minimum eigenvalue solution to the general eigenvalue
problem: Ly=.lamda.y (18) If (y.sup.0, .lamda..sup.0), (y.sup.1,
.lamda..sup.1), . . . , (y.sup.m-1, .lamda..sup.m-1) are solutions
to Equation 16, and .lamda..sup.0<.lamda..sup.1< . . .
<.lamda..sup.m-1, then .lamda..sup.0=0 and y.sup.0=(1, 1, . . .
, 1). The link analysis system selects eigenvectors I through K to
represent the images in a k-dimensional Euclidean space. The vector
for an image is represented as follows: image j.rarw.(y.sup.1(j), .
. . , y.sup.k(j)) (19) where y.sup.i(j) denotes the j.sup.th
element of y.sup.i.
[0032] The link analysis system identifies clusters of related
images by representing each image by a vector such that the
distance between the image vectors represents their semantic
similarity. Various clustering algorithms may be applied to the
image vectors to identify clusters of semantically related images.
These clustering algorithms may include a Fiedler vector from
spectral graph theory, a k-means clustering, and so on.
[0033] The clustering of images can be used to assist in browsing.
For example, when browsing to a web page, a user can select an
image and request to see related images. The web pages that contain
the images that are clustered together with the selected image can
then be presented as the result of the request. In one embodiment,
the web pages can be presented in an order that is based on the
distance between the image vector of each image and the image
vector of the selected image.
[0034] The clustering of images can also be used to provide a
multidimensional visualization of images that are semantically
related. The image vectors can be generated for the images of a
collection of web pages. Once the clusters are identified, the
system can display an indication of each cluster on a
two-dimensional grid representing clusters based on different
eigenvectors.
[0035] The link analysis system can rank images based on the
image-to-image matrix. The image-to-image matrix represents the
probability of transitioning from image to image. It is possible
that a user will transition to an image randomly. To account for
this, the link analysis system generates a probability transition
matrix that factors this randomness into the image-to-image matrix
as follows: P=.epsilon.W+(1-.epsilon.)U (20) where P is a
probability transition matrix, .epsilon. is a weighting factor
(e.g., 0.1-0.2), and U is a transition matrix of uniform transition
probabilities (U.sub.ij=1/m for all i, j). Because of the
introduction of U, the graph is connected and a stationary
distribution of a random walk of the graph exists. The rank of an
image can be represented as follows: P.sup.T.pi.=.pi. (21) where
.pi. is an eigenvector of p.sup.T with eigenvalue 1 representing
the image rank. .pi.=(.pi..sub.1, .pi..sub.1, . . . , .pi..sub.m)
represents a stationary probability distribution and .pi..sub.i
represents the rank of image i.
[0036] FIG. 2 is a block diagram illustrating components of the
link analysis system in one embodiment. The link analysis system
200 includes a web page store 201, a calculate image rank component
202, an identify image clusters component 203, and a generate
image-to-image matrix component 211. The generate image-to-image
matrix component 211 uses an identify blocks component 212, a
generate block-to-page matrix component 213, a generate
page-to-block matrix component 214, and a generate block-to-image
matrix component 215 to generate a matrix that indicates the
image-to-image relatedness. The web page store contains the
collection of web pages. The calculate image rank component uses
the generate image-to-image component to calculate the relatedness
of the images and then uses those calculations of relatedness to
rank the images. The identify image clusters component uses the
generate image-to-image matrix component to calculate the
relatedness of the images, generates a vector representation of the
images based on the matrix, and identifies clusters of images using
the generated vectors. Although not shown in FIG. 2, the link
analysis system may also include a component to calculate ranking
elements of a web page other than the images. For example, the link
analysis system may apply the rankings of Equations 20 and 21 to
the block-to-block matrix to rank the blocks and to the
page-to-page matrix to rank the pages themselves.
[0037] The computing device on which the link analysis system is
implemented may include a central processing unit, memory, input
devices (e.g., keyboard and pointing devices), output devices
(e.g., display devices), and storage devices (e.g., disk drives).
The memory and storage devices are computer-readable media that may
contain instructions that implement the link analysis system. In
addition, the data structures and message structures may be stored
or transmitted via a data transmission medium, such as a signal on
a communications link. Various communications links may be used,
such as the Internet, a local area network, a wide area network, or
a point-to-point dial-up connection.
[0038] FIG. 2 illustrates an example of a suitable operating
environment in which the link analysis system may be implemented.
The operating environment is only one example of a suitable
operating environment and is not intended to suggest any limitation
as to the scope of use or functionality of the link analysis
system. Other well-known computing systems, environments, and
configurations that may be suitable for use include personal
computers, server computers, hand-held or laptop devices,
multiprocessor systems, microprocessor-based systems, programmable
consumer electronics, network PCs, minicomputers, mainframe
computers, distributed computing environments that include any of
the above systems or devices, and the like.
[0039] The link analysis system may be described in the general
context of computer-executable instructions, such as program
modules, executed by one or more computers or other devices.
Generally, program modules include routines, programs, objects,
components, data structures, etc., that perform particular tasks or
implement particular abstract data types. Typically, the
functionality of the program modules may be combined or distributed
as desired in various embodiments.
[0040] FIG. 3 is a flow diagram that illustrates processing of a
generate image-to-image matrix component in one embodiment. In
block 301, the component identifies the blocks within the web pages
stored in the web page store. In block 302, the component invokes
the generate block-to-page matrix component. In block 303, the
component invokes the generate page-to-block matrix component. In
block 304, the component invokes the generate block-to-image matrix
component. In block 305, the component generates the block-to-block
matrix. In block 306, the component generates the image-to-image
matrix and then completes.
[0041] FIG. 4 is a flow diagram that illustrates the processing of
a generate block-to-page matrix component in one embodiment. In
blocks 401-408, the component loops selecting each page, each block
within each page, and each link within each block and sets the
importance of the pages linked to by that link, to that block. In
block 401, the component selects the next page. In decision block
402, if all the pages have already been selected, then the
component returns the block-to-page matrix, else the component
continues at block 403. In block 403, the component selects the
next block of the selected page. In decision block 404, if all the
blocks of the selected page have already been selected, then the
component loops to block 401 to select the next page, else the
component continues at block 405. In block 405, the component
counts the number of links within the selected block. In block 406,
the component selects the linked-to page of the next link of the
selected block. In decision block 407, if all the linked-to pages
of the selected block have already been selected, then the
component loops to block 403 to select the next block, else the
component continues at block 408. In block 408, the component sets
the importance of the linked-to page to the block and then loops to
block 406 to select the linked-to page of the next link of the
selected block.
[0042] FIG. 5 is a flow diagram that illustrates the processing of
a generate page-to-block matrix component in one embodiment. In
blocks 501-506, the component loops selecting each page and each
block within each page and setting the importance of that block to
the selected page. In block 501, the component selects the next
page of the web page store. In decision block 502, if all the pages
have already been selected, then the component returns the
page-to-block matrix, else the component continues at block 503. In
block 503, the component selects the next block of the selected
page. In decision block 504, if all the blocks of the selected page
have already been selected, then the component loops to block 501
to select the next page, else the component continues at block 505.
In block 505, the component calculates the importance of the
selected block to the selected page. In block 506, the component
sets the importance of the selected block to the selected page and
then loops to block 503 to select the next block of the selected
page.
[0043] FIG. 6 is a flow diagram that illustrates the processing of
a generate block-to-image matrix component in one embodiment. In
blocks 601-607, the component loops selecting each page, each block
within each page, and each image within each block and setting the
importance of the image to the selected block. In block 601, the
component selects the next page of the web page store. In decision
block 602, if all the pages have already been selected, then the
component returns the block-to-image matrix, else the component
continues at block 603. In block 603, the component selects the
next block of the selected page. In decision block 604, if all the
blocks of the selected page have already been selected, then the
component loops to block 601 to select the next page, else the
component continues at block 605. In block 605, the component
counts the number of images of the selected block. In block 606,
the component selects the next image of the selected block. In
decision block 607, if all the images of the selected block have
already been selected, then the component loops to block 603 to
select the next block, else the component continues at block 608.
In block 608, the component sets the importance of the selected
image to the selected block and then loops to block 606 to select
the next image of the selected block.
[0044] One skilled in the art will appreciate that although
specific embodiments of the link analysis system have been
described herein for purposes of illustration, various
modifications may be made without deviating from the spirit and
scope of the invention. Accordingly, the invention is not limited
except by the appended claims.
* * * * *