U.S. patent application number 10/995770 was filed with the patent office on 2006-05-25 for methods and apparatus for assessing web page decay.
This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Ziv Bar-Yossef, Andrei Zary Broder, Shanmagasundaram Ravikumar, Andrew Tomkins.
Application Number | 20060112089 10/995770 |
Document ID | / |
Family ID | 36462123 |
Filed Date | 2006-05-25 |
United States Patent
Application |
20060112089 |
Kind Code |
A1 |
Broder; Andrei Zary ; et
al. |
May 25, 2006 |
Methods and apparatus for assessing web page decay
Abstract
Systems and methods are herein disclosed for assessing the
staleness of a web page. In particular, in one method of the
present invention, the staleness of a web page is assessed by
examining internal date references within the web page. In another
method of the present invention, the staleness of a web page is
assessed by examining the meta-data associated with the web page.
In a further method of the present invention, the staleness of a
hyperlinked web page is determined by examining the link status of
the hyperlinks. If the web page has a relatively large number of
dead links, it is assessed as being a stale web page. In a still
further method of the present invention, the link status of web
pages in the neighborhood of the web page being assessed is
likewise examined.
Inventors: |
Broder; Andrei Zary; (Bronx,
NY) ; Bar-Yossef; Ziv; (Ra'anana, IL) ;
Ravikumar; Shanmagasundaram; (Cupertino, CA) ;
Tomkins; Andrew; (San Jose, CA) |
Correspondence
Address: |
HARRINGTON & SMITH, LLP
4 RESEARCH DRIVE
SHELTON
CT
06484-6212
US
|
Assignee: |
International Business Machines
Corporation
|
Family ID: |
36462123 |
Appl. No.: |
10/995770 |
Filed: |
November 22, 2004 |
Current U.S.
Class: |
1/1 ;
707/999.004; 707/E17.116 |
Current CPC
Class: |
G06F 16/958
20190101 |
Class at
Publication: |
707/004 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform operations for assessing
the currency of a web page, the operations comprising: establishing
a date threshold, wherein web pages older than the date threshold
will be assessed at not being current; accessing a web page;
extracting date information from the web page identifying the age
of the web page; and comparing the date information extracted from
the web page to the date threshold.
2. The signal-bearing medium of claim 1 further comprising:
identifying the web page as lacking currency if the date
information identifying the age of the web page is older than the
date threshold.
3. The signal-bearing medium of claim 1 further comprising:
identifying the web page as being current if the date information
identifying the age of the web page is younger than the date
threshold.
4. A signal-bearing medium tangibly embodying a program of
machine-readable executable by a digital processing apparatus of a
computer system to perform operations for assessing the currency of
a web page, the operations comprising: receiving a user-specified
topicality threshold, where the topicality threshold concerns the
topicality of material content of the web page; accessing a web
page; extracting topicality information from the web page; and
comparing the topicality information extracted from the web page to
the topicality threshold.
5. The signal-bearing medium of claim 4 further comprising:
identifying the web page as lacking currency if the topicality
information extracted from the web page lack topicality when
compared to the topicality threshold.
6. The signal-bearing medium of claim 4 further comprising:
identifying the web page as being current if the topicality
information extracted from the web page is topical when compared to
the topicality threshold.
7. A signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform operations for assessing
the currency of a web page, the operations comprising: establishing
a link threshold, wherein a web page will be assessed as lacking
currency if a percentage of hyperlinks contained in the web page
that link to an active page is less than the link threshold;
accessing a web page containing hyperlinks; testing the hyperlinks;
calculating the percentage of hyperlinks that return active web
pages; and comparing the percentage of hyperlinks that return
active web pages with the link threshold.
8. The signal-bearing medium of claim 7 where the operations
further comprise: identifying the web page as lacking currency if
the percentage of hyperlinks that return active web pages is less
than the link threshold.
9. The signal-bearing medium of claim 7 where the operations
further comprise: identifying the web page as being current if the
percentage of hyperlinks that return active web pages is greater
than the link threshold.
10. The signal-bearing medium of claim 7 where testing the
hyperlinks further comprises: establishing a time out limit for
testing a hyperlink, where when the hyperlink is tested, the
hyperlink will be assessed as linking to a dead web page if the
time out limit is exceeded; selecting a hyperlink; and monitoring
the elapsed time until a web page is returned, if at all.
11. The signal-bearing medium of claim 10 where testing the
hyperlinks further comprises: assessing the hyperlink as linking to
a dead page if the time out limit is exceeded.
12. The signal-bearing medium of claim 7 where testing the
hyperlinks further comprises: selecting a hyperlink; and assessing
the hyperlink as linking to a dead page based on an HTTP code
returned in response to an HTTP request targeting the selected
hyperlink.
13. The signal-bearing medium of claim 7 where testing the
hyperlinks further comprises: establishing a redirect limit for
testing a hyperlink, where when the hyperlink is tested, the
hyperlink will be assessed as linking to a dead web page if the
redirect limit is exceeded; selecting a hyperlink; and monitoring
the number of redirects before the desired web page is returned, if
at all.
14. The signal-bearing medium of claim 13 where testing the
hyperlinks further comprises: assessing the hyperlink as linking to
a dead page if the redirect limit is exceeded.
15. The signal-bearing medium of claim 7 where testing the
hyperlinks further comprises: selecting a first hyperlink; saving
the web page returned in response to the selection of the first
hyperlink; formulating a web page request to a host of the first
hyperlink, where the web page request is of a form that will not
return an active web page with a high degree of probability; and
issuing the web page request.
16. The signal-bearing medium of claim 15 where testing the
hyperlinks further comprises: assessing the first hyperlink as
linking to an active web page if an HTTP error code is returned in
response to the web page request.
17. The signal-bearing medium of claim 15 where testing the
hyperlinks further comprises: saving a web page returned in
response to the web page request; comparing the web page returned
in response to the web page request to the web page returned in
response to the selection of the first hyperlink; and assessing the
first hyperlink as linking to a dead web page if the web page
returned in response to the selection of the first hyperlink is
identical to the web page returned in response to the web page
request.
18. The signal-bearing medium of claim 7 where the currency of a
web page is assessed by additionally testing the link status of
hyperlinks contained in web pages linked through a chain of at
least one hyperlink to the web page whose currency is being tested,
and where: establishing a link threshold further comprises applying
a sliding scale weighting factor to hyperlinks contained in web
pages linked to the web page whose currency is being tested, where
the weight given to a dead link decreases with the distance of the
web page containing the dead link from the web page whose currency
is being tested in terms of intermediate web pages; and where
testing the hyperlinks further comprises testing hyperlinks in web
pages linked to the web page whose currency is being tested.
19. A signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform operations for assessing
the decay of a web page, the operations comprising: accessing a
subject web page containing hyperlinks; assessing the decay of the
subject web page by following a random walk away from the subject
web page, where the random walk consists of a testing of links on
the subject web page and web pages linked to the subject web page
under test; and assigning a decay score to the subject web page in
dependence on dead links encountered in the random walk, wherein
the decay score is a weighted sliding scale, where a dead link
encountered relatively close in the random walk to the subject web
page in terms of intermediate web pages results in a higher decay
score than a dead link encountered relatively farther away from the
subject web page.
20. The signal-bearing medium of claim 19 where testing of links
further comprises: establishing a time out limit for testing a
hyperlink, where when the hyperlink is tested, the hyperlink will
be assessed as linking to a dead web page if the time out limit is
exceeded; selecting a hyperlink; and monitoring the elapsed time
until a web is returned, if at all.
21. The signal-bearing medium of claim 20 where testing the links
further comprises: assessing the hyperlink as linking to a dead
page if the time out limit is exceeded.
22. The signal-bearing medium of claim 19 where testing the links
further comprises: selecting a hyperlink; and assessing the
hyperlink as linking to a dead page based on an HTTP code returned
in response to an HTTP request targeting the selected
hyperlink.
23. The signal-bearing medium of claim 19 where testing the links
further comprises: establishing a redirect limit for testing a
hyperlink, where when the hyperlink is tested, the hyperlink will
be assessed as linking to a dead web page if the redirect limit is
exceeded; selecting a hyperlink; and monitoring the number of
redirects before the desired web page is returned, if at all.
24. The signal-bearing medium of claim 23 where testing the links
further comprises: assessing the hyperlink as linking to a dead
page if the redirect limit is exceeded.
25. The signal-bearing medium of claim 19 where testing the links
further comprises: selecting a first hyperlink; saving the web page
returned in response to the selection of the first hyperlink;
formulating a web page request to the host of the first hyperlink,
where the request is of a form that will not return an active web
page with a high degree of probability; and issuing the web page
request.
26. The signal-bearing medium of claim 25 where testing the links
further comprises: assessing the first hyperlink as linking to an
active web page if an HTTP error code is returned in response to
the web page request.
27. The signal-bearing medium of claim 25 where testing the links
further comprises: saving a web page returned in response to the
web page request; comparing the web page returned in response to
the web page request to the web page returned in response to the
selection of the first hyperlink; and assessing the first hyperlink
as linking to a dead web page if the web page returned in response
to the selection of the first hyperlink is identical to the web
page returned in response to the web page request.
28. A computer system for assessing the currency of a web page, the
computer system comprising: an internet connection for connecting
to the internet and for accessing web pages available on the
internet; at least one memory to store web pages retrieved from the
internet and at least one program of machine-readable instructions,
where the at least one program performs operations to assess the
currency of a web page; at least one processor coupled to the
internet connection and the at least one memory, where the at least
one processor performs the following operations when the at least
one program is executed: retrieving a date threshold, wherein web
pages older than the date threshold will be assessed as not being
current; accessing a web page; extracting date information from the
web page identifying the age of the web page; and comparing the
date information extracted from the web page to the date
threshold.
29. The computer system of claim 28 where the operations further
comprise: identifying the web page as lacking currency if the date
information identifying the age of the web page is older than the
date threshold.
30. The computer system of claim 28 where the operations further
comprise: identifying the web page as being current if the date
information identifying the age of the web page is younger than the
date threshold.
31. A computer system for assessing the currency of a web page, the
computer system comprising: an internet connection for connecting
to the internet and for accessing web pages available on the
internet; at least one memory to store web pages retrieved from the
internet and at least one program of machine-readable instructions,
where the at least one program performs operations to assess the
currency of a web page; at least one processor coupled to the
internet connection and the at least one memory, where the at least
one processor performs the following operations when the at least
one program is executed: retrieving a predetermined topicality
threshold, where the topicality threshold, where the topicality
threshold concerns the topicality of material comprising a web
page; extracting topicality information from the web page; and
comparing the topicality information extracted from the web page to
the topicality threshold.
32. The computer system of claim 31 where the operations further
comprise: identifying the web page as lacking currency if the
topicality information extracted from the web page lacks topicality
when compared to the topicality threshold.
33. The computer system of claim 31 where the operations further
comprise: identifying the web page as being current if the
topicality information extracted from the web page is topical when
compared to the topicality threshold.
34. A computer system for assessing the currency of a web page, the
computer system comprising: an internet connection for connecting
to the internet and for accessing web pages available on the
internet; at least one memory to store web pages retrieved from the
internet and at least one program of machine-readable instructions,
where the at least one program performs operations to assess the
currency of a web page; at least one processor coupled to internet
connection and the at least one memory, where the at least
processor performs the following operations when the at least one
program is executed; establishing a link threshold, wherein a web
page will be assessed as lacking currency if a percentage of
hyperlinks contained in the web page that link to an active page is
less than the link threshold; accessing a web page containing
hyperlinks; testing the hyperlinks; calculating the percentage of
hyperlinks that return active web pages; and comparing the
percentage of hyperlinks that return active web pages with the link
threshold.
35. The computer system for assessing the currency of a web page of
claim 34 where the operations further comprise: identifying the web
page as lacking currency if the percentage of hyperlinks that
return active web pages is less than the link threshold.
36. The computer system for assessing the currency of a web page of
claim 34 where the operations further comprise: identifying the web
page as being current if the percentage of hyperlinks that return
active web pages is greater than the link threshold.
37. The computer system for assessing the currency of a web page of
claim 34 where testing the hyperlinks further comprises:
establishing a time out limit for testing a hyperlink, where when
the hyperlink is tested, the hyperlink will be assessed as linking
to a dead web page of the time out limit is exceeded; selecting a
hyperlink; and monitoring the elapsed time until a web page is
returned, if at all.
38. The computer system for assessing the currency of a web page of
claim 37 where testing the hyperlinks further comprises: assessing
the hyperlink as linking to a dead page if the time out limit is
exceeded.
39. The computer system for assessing the currency of a web page of
claim 34 where testing the hyperlinks further comprises: selecting
a hyperlink; and assessing the hyperlink as linking to a dead page
based on an HTTP code returned in response to an HTTP request
targeting the selected hyperlink.
40. The computer system for assessing the currency of a web page of
claim 34 where testing the hyperlinks further comprises:
establishing a redirect limit for testing a hyperlink, where when
the hyperlink is tested, the hyperlink will be assessed as linking
to a dead web page if the redirect limit is exceeded; selecting a
hyperlink; and monitoring a number of redirects before the desired
web page is returned, if at all.
41. The computer system for assessing the currency of a web page of
claim 40 where testing the hyperlinks further comprises: assessing
the hyperlink as linking to a dead web page if the redirect limit
is exceeded.
42. The computer system for assessing the currency of a web page of
claim 34 where testing the hyperlinks further comprises: selecting
a first hyperlink; saving the web page returned in response to the
selection of the first hyperlink; formulating a web page request to
the parent directory of the address corresponding to the first
hyperlink, where the web page request is of a form that will not
return an active web page with a high degree of probability; and
issuing the web page request.
43. The computer system for assessing the currency of a web page of
claim 42 where testing the hyperlinks further comprises: assessing
the first hyperlink as linking to an active web page if an HTTP
error code is returned in response to the web page request.
44. The computer system for assessing the currency of a web page of
claim 42 where testing the hyperlinks further comprises: saving a
web page returned in response to the web page request; comparing
the web page returned in response to the web page request to the
web page returned in response to the selection of the first
hyperlink; and assessing the first hyperlink as linking to a dead
web page if the web page returned in response to the selection of
the first hyperlink is identical to the web page returned in
response to the web page request.
45. The computer system for assessing the currency of a web page of
claim 34 where the currency of a web page is assessed by
additionally testing the link status of hyperlinks contained in web
pages linked through a chain of at least one hyperlink to the web
page whose currency is being tested, and where: establishing a link
threshold further comprises applying a sliding scale weighting
factor to hyperlinks contained in web pages linked to the web page
whose currency is being tested, where the weight given to a dead
link decreases with the distance of the web page containing the
dead link from the web page whose currency is being tested in terms
of intermediate web pages; and where testing the hyperlinks further
comprises testing hyperlinks in web pages linked from the web page
whose currency is being tested.
46. A computer system for assessing the decay of a web page
comprising: an internet connection for connecting to the internet
and for accessing web pages available on the internet; at least one
memory to store web pages retrieved from the internet and at least
one program of machine-readable instructions, where the at least
one program performs operations to assess the decay of web page; at
least one processor coupled to the internet connection and the at
least one memory, where the at least one processor performs the
following operations when the at least one program is executed:
accessing a subject web page containing hyperlinks; assessing the
decay of the subject web page by following a random walk away from
the subject web page, where the random walk consists of a testing
of links on the subject web page and web pages linked to the
subject web page under test; and assigning a decay score to the
subject web page in dependence on dead links encountered in the
random walk, wherein the decay score is a weighted sliding scale,
where a dead link encountered relatively close in the random walk
to the subject web page in terms of intermediate web pages results
in a higher decay score than a dead link encountered relatively
farther away from the subject web page.
47. The computer system for assessing the decay of a web page of
claim 46 where testing of links further comprises: establishing a
time out limit for testing a hyperlink, where when the hyperlink is
tested, the hyperlink will be assessed as linking to a dead web
page if the time out limit is exceeded; selecting a hyperlink; and
monitoring the elapsed time until a web is returned, if at all.
48. The computer system for assessing the decay of a web page of
claim 47 where testing the links further comprises: assessing the
hyperlink as linking to a dead page if the time out limit is
exceeded.
49. The computer system for assessing the decay of a web page of
claim 46 where testing the links further comprises: selecting a
hyperlink; and assessing the hyperlink as linking to a dead page
based on the HTTP code returned in response to an HTTP request
targeting the selected hyperlink.
50. The computer system for assessing the decay of a web page of
claim 46 where testing the links further comprises: establishing a
redirect limit for testing a hyperlink, where when the hyperlink is
tested, the hyperlink will be assessed as linking to a dead web
page if the redirect limit is exceeded; selecting a hyperlink; and
monitoring the number of redirects before the desired web page is
returned, if at all.
51. The computer system for assessing the decay of a web page of
claim 50 where testing the links further comprises: assessing the
hyperlink as linking to a dead page if the redirect limit is
exceeded.
52. The computer system for assessing the decay of a web page of
claim 46 where testing the links further comprises: selecting a
first hyperlink; saving the web page returned in response to the
selection of the first hyperlink; formulating a web page request to
the host of the first hyperlink, where the request is of a form
that will not return an active web page with a high degree of
probability; and issuing the web page request.
53. The computer system for assessing the decay of a web page of
claim 52 where testing the links further comprises: assessing the
first hyperlink as linking to an active web page if an HTTP error
code is returned in response to the web page request.
54. The computer system for assessing the decay of a web page of
claim 52 where testing the links further comprises: saving a web
page returned in response to the web page request; comparing the
web page returned in response to the web page request to the web
page returned in response to the selection of the first hyperlink;
and assessing the first hyperlink as linking to a dead web page if
the web page returned in response to the selection of the first
hyperlink is identical to the web page returned in response to the
web page request.
Description
TECHNICAL FIELD
[0001] The present invention generally concerns web pages and more
particularly concerns methods and apparatus for assessing the decay
of web pages.
BACKGROUND
[0002] The rapid growth of the web has been noted and tracked
extensively. Recent studies, however, have documented the dual
phenomenon: web pages often have small half-lives, and thus the web
exhibits rapid decay as well. Consequently, page creators are faced
with an increasingly burdensome task of keeping links up to date,
and many fall behind. In addition to individual pages, collections
of pages or even entire neighborhoods on the web exhibit
significant decay, rendering them less effective as information
resources. Such neighborhoods are identified by frustrated
searchers, seeking a way out of these stale neighborhoods, back to
more up-to-date sections of the web.
[0003] On Nov. 2, 2003, the Associated Press reported that the
"Internet [is] littered with abandoned sites." [20] The story was
picked up by many news outlets from USA's CNN to Singapore's
Straits Times. The article further states that "[d]espite the
Internet's ability to deliver information quickly and frequently,
the World Wide Web is littered with deadwood--sites abandoned and
woefully out of date."
[0004] Of course this is not news to most net-denizens, and speed
of delivery has nothing to do with the quality of content, but
there is no denial that the increase in the number of outdated
sites has made finding reliable information on the web even more
difficult and frustrating. Part of the problem is an issue of
perception: the immediacy and flexibility of the web create the
expectation that the content is up-to-date; after all, in a library
no one expects every book to be current, but, on the other hand, it
is clear that books once published do not change, and it is fairly
easy to find the publication date.
[0005] While there have been substantial efforts in mapping and
understanding the growth of the web, there have been fewer
investigations of its death and decay. Determining whether a URL is
dead or alive is quite easy, at least in the first approximation,
and, in fact, it is known that web pages disappear at a rate of
0.25-0.5%/week. However, determining whether a web page has been
abandoned is much more difficult.
[0006] Thus, those skilled in the art desire a method for assessing
the decay status or "staleness" of a web page. In addition, those
skilled in the art desire methods for assessing the staleness of a
web page so that the method can be used as a way of ranking web
pages. Further, those skilled in the art desire methods and
apparatus for use in web maintenance activities. Methods and
apparatus that accurately assess the staleness of web pages are
particularly useful in managing web maintenance activities.
SUMMARY OF THE PREFERRED EMBODIMENTS
[0007] A first alternate embodiment of the present invention
comprises a signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform operations for assessing
the currency of a web page, the operations comprising: establishing
a date threshold, wherein web pages older than the date threshold
will be assessed at not being current; accessing a web page;
extracting date information from the web page identifying the age
of the web page; and comparing the date information extracted from
the web page to the date threshold.
[0008] A second alternate embodiment of the present invention
comprises a signal-bearing medium tangibly embodying a program of
machine-readable executable by a digital processing apparatus of a
computer system to perform operations for assessing the currency of
a web page, the operations comprising: receiving a user-specified
topicality threshold, where the topicality threshold concerns the
topicality of material content of the web page; accessing a web
page; extracting topicality information from the web page; and
comparing the topicality information extracted from the web page to
the topicality threshold.
[0009] A third alternate embodiment of the present invention
comprises: a signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform operations for assessing
the currency of a web page, the operations comprising: establishing
a link threshold, wherein a web page will be assessed as lacking
currency if a percentage of hyperlinks contained in the web page
that link to an active page is less than the link threshold;
accessing a web page containing hyperlinks; testing the hyperlinks;
calculating the percentage of hyperlinks that return active web
pages; and comparing the percentage of hyperlinks that return
active web pages with the link threshold.
[0010] A fourth alternate embodiment of the present invention
comprises: a signal-bearing medium tangibly embodying a program of
machine-readable instructions executable by a digital processing
apparatus of a computer system to perform operations for assessing
the decay of a web page, the operations comprising: accessing a
subject web page containing hyperlinks; assessing the decay of the
subject web page by following a random walk away from the subject
web page, where the random walk consists of a testing of links on
the subject web page and web pages linked to the subject web page
under test; and assigning a decay score to the subject web page in
dependence on dead links encountered in the random walk, wherein
the decay score is a weighted sliding scale, where a dead link
encountered relatively close in the random walk to the subject web
page in terms of intermediate web pages results in a higher decay
score than a dead link encountered relatively farther away from the
subject web page.
[0011] A fifth alternate embodiment of the present invention
comprises: a computer system for assessing the currency of a web
page, the computer system comprising: an internet connection for
connecting to the internet and for accessing web pages available on
the internet; at least one memory to store web pages retrieved from
the internet and at least one program of machine-readable
instructions, where the at least one program performs operations to
assess the currency of a web page; at least one processor coupled
to the internet connection and the at least one memory, where the
at least one processor performs the following operations when the
at least one program is executed: retrieving a date threshold,
wherein web pages older than the date threshold will be assessed as
not being current; accessing a web page; extracting date
information from the web page identifying the age of the web page;
and comparing the date information extracted from the web page to
the date threshold.
[0012] A sixth alternate embodiment of the present invention
comprises: a computer system for assessing the currency of a web
page, the computer system comprising: an internet connection for
connecting to the internet and for accessing web pages available on
the internet; at least one memory to store web pages retrieved from
the internet and at least one program of machine-readable
instructions, where the at least one program performs operations to
assess the currency of a web page; at least one processor coupled
to the internet connection and the at least one memory, where the
at least one processor performs the following operations when the
at least one program is executed: retrieving a predetermined
topicality threshold, where the topicality threshold, where the
topicality threshold concerns the topicality of material comprising
a web page; extracting topicality information from the web page;
and comparing the topicality information extracted from the web
page to the topicality threshold.
[0013] A seventh alternate embodiment of the present invention
comprises: a computer system for assessing the currency of a web
page, the computer system comprising: an internet connection for
connecting to the internet and for accessing web pages available on
the internet; at least one memory to store web pages retrieved from
the internet and at least one program of machine-readable
instructions, where the at least one program performs operations to
assess the currency of a web page; at least one processor coupled
to internet connection and the at least one memory, where the at
least processor performs the following operations when the at least
one program is executed; establishing a link threshold, wherein a
web page will be assessed as lacking currency if a percentage of
hyperlinks contained in the web page that link to an active page is
less than the link threshold; accessing a web page containing
hyperlinks; testing the hyperlinks; calculating the percentage of
hyperlinks that return active web pages; and comparing the
percentage of hyperlinks that return active web pages with the link
threshold.
[0014] An eighth alternate embodiment of the present invention
comprises: a computer system for assessing the decay of a web page
comprising: an internet connection for connecting to the internet
and for accessing web pages available on the internet; at least one
memory to store web pages retrieved from the internet and at least
one program of machine-readable instructions, where the at least
one program performs operations to assess the decay of web page; at
least one processor coupled to the internet connection and the at
least one memory, where the at least one processor performs the
following operations when the at least one program is executed:
accessing a subject web page containing hyperlinks; assessing the
decay of the subject web page by following a random walk away from
the subject web page, where the random walk consists of a testing
of links on the subject web page and web pages linked to the
subject web page under test; and assigning a decay score to the
subject web page in dependence on dead links encountered in the
random walk, wherein the decay score is a weighted sliding scale,
where a dead link encountered relatively close in the random walk
to the subject web page in terms of intermediate web pages results
in a higher decay score than a dead link encountered relatively
farther away from the subject web page.
[0015] Thus it is seen that embodiments of the present invention
overcome the limitations of the prior art. In particular, in the
prior art there was no known way to assess the currency of a
webpage. In contrast, the apparatus and methods of the present
invention provide a reliable and accurate method for assessing the
currency of a webpage.
[0016] The methods and apparatus of the present invention are
particularly useful in combination with web ranking and enterprise
web management applications. In web ranking situations, it is not
desirable to assign a high ranking to a web page that is grossly
out of date. Accordingly, having an accurate assessment of the
currency of a web page is one factor that may be used in ranking a
particular web page.
[0017] In enterprise web management situations, proprietors of
web-based services wish to continually assess the currency of the
web pages constituting their web-based services. Thus, having
methods and apparatus that can accurately assess the currency of
web pages are particularly useful in managing maintenance
activities.
[0018] In conclusion, the foregoing summary of the alternate
embodiments of the present invention is exemplary and non-limiting.
For example, one of ordinary skill in the art will understand that
one or more aspects or steps from one alternate embodiment can be
combined with one or more aspects or steps from another alternate
embodiment to create a new embodiment within the scope of the
present invention.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] The foregoing and other aspects of these teachings are made
more evident in the following Detailed Description of the Preferred
Embodiments, when read in conjunction with the attached Drawing
Figures, wherein:
[0020] FIG. 1 is a flowchart depicting the steps of a method
operating in accordance with an embodiment of the present
invention;
[0021] FIG. 2 is a flowchart depicting the steps of a method
operating in accordance with an embodiment of the present
invention;
[0022] FIG. 3 is a flowchart depicting the steps of a method
operating in accordance with an embodiment of the present
invention;
[0023] FIG. 4 is a flowchart depicting the steps of a method
operating in accordance with one embodiment of the present
invention;
[0024] FIG. 5 depicts a block diagram of a computer system suitable
for practicing the methods and apparatus of the present
invention;
[0025] FIG. 6 is a flowchart depicting the steps of a method
operating in accordance with an embodiment of the present
invention;
[0026] FIG. 7 is a flowchart depicting the steps of a method
operating in accordance with an embodiment of the present
invention;
[0027] FIG. 8 is a graph depicting the distribution of the fraction
of dead links and decay scores for various .sigma.'s;
[0028] FIG. 9 is a scatter plot of decay scores versus the fraction
of dead links;
[0029] FIG. 10 is a graph depicting the average decay score and
fraction of dead links for papers from the last ten WWW
conferences;
[0030] FIG. 11 depicts the average decay scores and fraction of
dead links for 30 Yahoo nodes; and
[0031] FIG. 12 depicts the average decay scores and fraction of
dead links for FAQs.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0032] A method for assessing the currency of a web page operating
in accordance with the present invention is depicted in FIG. 1. In
step 10, a date threshold is established, where web pages older
than the date threshold will be assessed as not being current.
Next, at step 11, a web page is accessed over the internet. Then,
at step 12 date information is extracted from the web page
identifying the age of the web page. In alternate embodiments, the
"last-modified" information may be extracted from the web page,
indicating when the web page was last modified. Next, at step 13,
the date information extracted from the web page is compared to the
date threshold. If the date information extracted from the web page
is older than the date threshold, the web page is assessed as
lacking currency; if it is younger, the web page is assessed as
being current.
[0033] Another method operating in accordance with the present
invention is depicted in FIG. 2. At step 20, the method receives a
user-specified topicality threshold, where the topicality threshold
concerns the topicality of material content of the web page. As
used herein, "topical" means a web page whose content or subject
matter is current or up-to-date; a web page the content of which is
not "topical" is outmoded or out-of-date. The topicality threshold
can be specified in a number of ways. For example, the topicality
threshold can concern date references within the content of the web
page. Alternatively, the topicality threshold can concern
historical events the presence of which would indicate that the
page is out of date. Further, the topicality threshold can be set
by using product identifiers. If a researcher sought to assess the
currency of web pages concerning computer hardware, the researcher
could use product identifiers as indicia of whether a web page is
up to date. For example, a web page discussing Pentium III
processors for non-historical reasons would be out of date. Then,
at step 21 a web page is accessed over the internet. Next, at step
22 topicality information is extracted from the web page. Then, at
step 23, the topicality information extracted from the web page is
compared to the topicality threshold. If the comparison reveals
that the information extracted from the web page lacks topicality
when compared to the topicality threshold, the web page is assessed
as lacking currency. Alternatively, if the information extracted
from the web page is topical when compared to the topicality
threshold, the web page is assessed as being current.
[0034] A further method operating in accordance with the present
invention is depicted in FIG. 3. At step 30, a link threshold is
established, where a web page will be assessed as lacking currency
if a percentage of hyperlinks contained in the web page that link
to active web pages is less than the threshold. Next, at step 31, a
web page containing hyperlinks is accessed over the internet. Then,
at step 32, the hyperlinks contained in the web page are tested.
Next, at step 33, the percentage of web pages that return active
web pages is calculated. Then, at step 34, the percentage
hyperlinks that return active web pages is compared with the link
threshold. If the percentage is less than the link threshold, the
web page is assessed as lacking currency; if is greater than the
link threshold, the web page is assessed ads being current. One of
ordinary skill in the art will understand that the link threshold
could have been set in terms of hyperlinks which do not return
active web pages.
[0035] The next aspect of the present invention concerns assessing
whether a hyperlink does, in fact, link to a dead page. Dead links
are the clearest giveaway to the obsolescence of a page. Indeed,
this phenomenon of "link-rot" has been studied in several
areas--for example, Fetterly et al. [16] in the context of web
research, Koehler [22, 23] in the context of digital libraries, and
Markwell and Brooks [26, 27] in the context of biology education.
However using the proportion of dead links as a decay signal
presents two problems.
[0036] (1) The first problem--determining whether a link is
"dead"--is not trivial. According to the HTTP protocol [17] when a
request is made to a server for a page that is no longer available,
the server is supposed to return an error code, usually the HTTP
return code 404. As discussed in the following sections, in fact
many servers, including most reputable ones, do not return a 404
code--instead the servers return a substitute page and an OK code
(200). The substitute page sometimes gives a written error
indication, sometimes returns a redirect to the original domain
home page, and sometimes returns a page which has absolutely
nothing to do with the original page. Studies show that these type
of substitutions, called "soft-404s," account for more than 25% of
the dead links. This issue is discussed in detail and a heuristic
is proposed for the detection of servers that engage in soft 404s.
The heuristic is effective for all cases except for one special
case: a dead domain home page bought by a new entity and/or
"parked" with a broker of domain names: in this special case it can
be determined that the server engages in soft 404 in general but
there is no way to know whether the domain home page is a soft 404
or not.
[0037] (2) The second problem associated with dead links as a decay
signal is that they are very noisy signals. One reason is because
it is easy to manipulate. Indeed, many commercial sites use content
management systems and quality check systems that automatically
remove any link that results in a 404 code. For example,
experiments indicate that the Yahoo! taxonomy is continuously
purged of any dead links. However, this is hardly an indication
that every piece of the Yahoo! taxonomy is up-to-date.
[0038] Another reason for the noisiness is that pages of certain
types tend to live "forever" even though no one maintains them: a
typical example might be graduate students pages--many universities
allow alumni to keep their pages and e-mail addresses indefinitely
as long as they do not waste too much space. Because these pages
link among themselves at a relatively high rate, they will have few
dead links on every page, even long after the alumni have left the
ivory towers; it is only as a larger radius is examined around
these pages that a surfeit of dead links is observed.
[0039] The discussion above suggests that the measure of the decay
of page p should depend not only on the proportion of dead pages at
distance 1 from p but also, and to a decreasing extent, on the
proportion of dead pages at distance 2, 3, and so on.
[0040] One way to estimate these proportions is via a random walk
from p: at every step if a dead page is reached failure is
declared, otherwise with probability .sigma. success is declared,
and with probability 1-.sigma. the walk continues. The decay score
of p, denoted D(p) is defined as the probability of failure in this
walk. Thus the decay score of a page p will be some number between
0 and 1.
[0041] At first glance, this process is similar to the famous
random surfer of PageRank [7]; however, they are quite different in
practice: for PageRank the importance of a page p depends
recursively on the importance of the pages that point top. In
contrast the decay of p depends recursively on the decay of the
pages that are linked from p. Thus, computing the underlying
recurrence once the web graph is fully explored and represented is
very similar, but [0042] 1. The decay of a given page can be
approximated in isolation, that is, without having to compute the
decay of all pages in the graph, hence it is a much easier task
when the number of nodes of interest is relatively small. [0043] 2.
While the owner of a page p has few licit means of improving its
PageRank, it can easily reduce its decay by simply making sure that
all the links on page p go to well maintained pages.
[0044] It is generally agreed that PageRank is a better signal for
the quality of a page than simply its in-degree (i.e., the number
of pages that point to it) and recent studies [29, 10] have shown
that the in-degree has only limited correlation with PageRank.
Similar questions can be asked about the decay number versus the
dead links proportion: experiments indicate that their correlation
is only limited and indeed the decay number is a better indicator.
For instance, on average, the set of 30 pages that analyzed from
the Yahoo! taxonomy have almost no dead links, but have relatively
high decay, roughly the median value observable on the Web. This
seem to indicate that Yahoo! has a filter that drops dead links
immediately, but on the other hand the editors that maintain Yahoo!
do not have the resources to check very often whether a page once
listed continues to be as good as it was.
[0045] A dead web page is a page that is not publicly available
over the web. A page can be dead for any of the following reasons:
(1) its URL is malformed; (2) its host is down or non-existent; or
(3) it does not exist on the host. The first two types of dead
pages are easy to detect: the former fails URL parsing and the
latter fails the resolution of the host address. When fetching
pages that are not found on a host, the web server of the host is
supposed to return an error; typically the error message returned
is the 404 HTTP return code. However, it turns out that many web
servers today do not return an error code even when they receive
HTTP requests for non-existent pages. Instead, they return an OK
code (200) and some substitute page; typically, this substitute is
an error message page or the home-page of that host or even some
completely unrelated page. Such non-existent pages that cause a
server to issue the foregoing result are called "soft-404
pages".
[0046] The existence of soft-404 pages makes the task of
identifying dead pages non-trivial. Next to be described will be an
algorithm for this task operating in accordance with one embodiment
of the present invention. The pseudo code for the task is
reproduced in Appendix A, and a flowchart depicting the steps of
the method is shown in FIG. 4. For the rest of the discussion, a
web page will be identified with its URL, and the two concepts will
be used interchangeably.
[0047] A soft-404 page is a non-existent page that does not result
in the return of an error code. This is because the server to which
the web page request was directed is programmed to issue an
alternate page whenever a 404 error message would ordinarily be
issued. In contrast, a hard 404 page is a non-existent page that
returns an error code of 403, 404 or 410, or any error code of the
form 5xx. Dead pages consist of soft-404 pages, hard-404 pages, and
a few more cases such as time-outs and infinite redirects discussed
below.
[0048] Let u be the URL of a page, to be tested whether dead or
alive. Let u.host denote the host of u, and let u.parent denote the
URL of the parent directory of u. For example, both the host and
the parent directory URL of http://www.ibm.com/us are
http://www.ibm.com; however the parent directory of
http://www.ibm.com/us/hr is http://www.ibm.com/us.u.host and
u.parent can be extracted from u by proper parsing.
[0049] An algorithm operating in accordance with aspects of methods
and apparatus of the present invention starts by attempting to
fetch u from the web (Line 3 of the function DeadPage). A fetch
(step 100 in FIG. 4; see function atomicFetch) may result in one of
the following three outcomes: (1) it succeeds, (2) it fails, or (3)
it redirects to a different URL v. The possible reasons for failure
are: (a) u is an invalid URL and could not be properly parsed
(lines 2-3 of atomicFetch); (b) the local DNS server could not
resolve the IP address of u.HOST (lines 6-7 of atomicFetch); (c)
when creating a connection to u.HOST, there was no response within
T seconds (in experiments T=10 was chosen) (Lines 10-11 of
atomicFetch); or (d) the web server of u.HOST returns an error HTTP
return code in response to the request for u (Lines 12-13 of
atomicFetch). The HTTP return codes which are considered to be
errors are 403 (Forbidden), 404 (Not found), 410 (Gone), and all
the codes of the form 5xx (Server errors). If a return code in
these classes is returned, the algorithm concludes that the page
does not exist at 112 in FIG. 4. A success is a HTTP return code in
the 2xx series or 4xx series (except for 403, 404, 410), and a
redirect is indicated by an HTTP return code in the 3xx series.
[0050] Clearly when the fetch fails, the page is dead. Next to be
discussed is how to analyze the two other cases (success or
redirect). The redirect case is also rather simple. An algorithm
operating in accordance with the present invention attempts to
fetch u. If it redirects to a new URL v, it then attempts to fetch
v. It continues to follow the redirects, until reaching some JRL
w.sub.u, whose fetch results in a success or a failure (see the
function fetch). (A third possibility is that the algorithm detects
a loop in the redirect path (Lines 12-13 of fetch) or that the
number of redirects exceeds some limit L, which is chosen to be 20
(Lines 14-15 of fetch); in such a case the algorithm declares u to
be a dead page, and stops). If the fetch of w.sub.u results in a
failure, u is declared a dead page as before. If the fetch results
in a success (step 114 in FIG. 4), the algorithm proceeds to
checking whether u is a soft-404 page.
[0051] The algorithm detects whether u is a soft-404 page or not by
"learning" whether the web server of u.HOST produces soft-404s at
all. This is done by asking for a page r, known with high
probability not to exist on u.HOST at step 120 in FIG. 4. It then
compares the server behavior when asked for r, with its behavior
when asked for u.
[0052] The first question to be addressed is how to come up with a
page r that is likely not to exist on u.HOST with a high
probability. This is done as follows: first, a URL is chosen, which
has the same directory as u, and whose file name is a sequence of R
random letters (in experiments R=25 was chosen; see Line 5 of
DeadPage and step 120 of FIG. 4). The URL r is simply the
concatenation of the URL u.PARENT with the random sequence. Since
the file name is chosen at random, the probability that it exists
under that directory is at most N/26.sup.R, where N is the number
of files that do exist under the directory. For any reasonable
value of N, this probability is tiny, and thus it can be safely
assumed that the random page r does not exist.
[0053] The reason to choose r to be in the same directory as u (and
not as a random page under u.HOST) is that in large hosts different
directories are controlled by different web servers, and therefore
may exhibit different responses to requests for non-existent pages.
An example is the host http://www.ibm.com. When trying to fetch a
non-existent page http://www.ibm.com/blablabla, the result is a 404
code. However, a fetch of http://www.ibm.com/us/blablabla returns
the home-page http://www.ibm.com/us. Thus
http://www.ibm.com/us/blablabla is a soft-404 page, but
http://www.ibm.com/blablabla is a hard-404 page.
[0054] Next it is necessary to compare the behavior of the web
server on r with its behavior on u. Let w.sub.r and w.sub.u denote
the final URLs reached when following redirects from r and u,
respectively. Let T.sub.r and T.sub.u denote the contents of
w.sub.r and w.sub.u, respectively. Let K.sub.r and K.sub.u denote
the number of redirects the algorithm had to follow to reach
w.sub.r and w.sub.u, respectively.
[0055] If the fetch of w.sub.r results in a failure, it is
concluded at step 132 in FIG. 4 that the web server does not
produce soft-404 pages. Since the fetch of w.sub.u succeeded, the
algorithm can safely declare u as alive (Lines 7-8 in DeadPage).
Suppose, then, that the fetch of w.sub.rresults in a success. Thus,
r is a soft-404 page.
[0056] If w.sub.r=w.sub.u and K.sub.r=K.sub.u, then u and r are
indistinguishable. This gives a clear indication that u is a
soft-404 page except for one special case: there are situations
when soft-404 pages and legitimate URLs both redirect to the same
final destination (for example, to the host's home-page). A good
example of that is the URL http://www.cnn.de (the CNN of Germany),
which redirects to http://www.n-tv.de; however, also a non-existent
page like http://www.cnn.de/blablabla redirects to
http://www.n-tv.de. Thus the following heuristic is used: if u is a
root of a web site, then it can never be a soft-404 page (step 140
of FIG. 4 and Lines 9-10 of DeadPage; see discussion below about
when this heuristic may fail). Otherwise at step 150, if
w.sub.r=w.sub.u and K.sub.r=K.sub.u, then u is declared a soft-404
page (Step 152 of FIG. 4; Lines 13-14 of DeadPage).
[0057] If K.sub.r.noteq.K.sub.u (step 142 in FIG. 4) the algorithm
declares u to be alive (step 180) (even if w.sub.r=w.sub.u),
because the behavior of the web server on u is different from its
behavior on r (Lines 11-12 of DeadPage). An example that
demonstrates that the number of redirects is crucial for the test
is http://www.eurosport.de/. Fetching http://www.eurosport.de/
incurs two redirects that finally land in a valid page. However,
fetching http://www.eurosport.de/blablabla redirects first to
http://www.eurosport.de/ and then results in two more redirects as
before. Thus, both the valid page and the soft-404 page end up at
the same valid page, but the former requires two redirects while
the latter requires three.
[0058] Even if w.sub.r.noteq.w.sub.u (step 152 in FIG. 4) it is
still possible that u is a soft-404 page, because in some hosts
each soft-404 page is redirected into a unique address
(http://www.amazon.com, for example). Thus, the contents of w.sub.r
and w.sub.u and the parameters K.sub.r and K.sub.u are next
examined at step 160. If w.sub.r.noteq.w.sub.u, K.sub.r=K.sub.u,
and T.sub.u and T.sub.r are identical or nearly-identical
(near-identity can be checked via shingling [8]), the algorithm
declares u to be a soft-404 page (step 162; Lines 15-16 of
DeadPage). If not, the page is declared to be alive at step 164.
Note that testing near-identity (as opposed to complete identity)
may be important; because sometime the web server embeds the
non-existing URL u in the text of the page it returns or does other
minor changes.
[0059] A computer system for practicing the methods of the present
invention is depicted in simplified form in FIG. 5. The data
processing system 200 includes at least one data processor 201
coupled to a bus 202 through which the data processor may address a
memory sub-system 203, also referred to herein simply as "memory"
203. The memory 203 may include RAM, ROM and fixed and removable
disks and/or tape. The memory 203 is assumed to store at least one
program comprising instructions for causing the processor 201 to
execute methods in accordance with the present invention. Also
stored in memory 203 is at least one database 204.
[0060] The data processor 201 is also coupled through the bus 202
to a user interface, preferably a graphical user interface ("GUI")
205 that includes a user input device 205A, such as one or more of
a keyboard, a mouse, a trackball, a voice recognition interface, as
well as a user display device 205B, such as a high resolution
graphical CRT display terminal, a LCD display terminal, or any
suitable display device. With these input/output devices, a user
can initiate operations to determine the currency or staleness of a
web page.
[0061] The data processor 201 may also be coupled through the bus
202 to a network interface 206 that provides bidirectional access
to a data communications network 207, such as an intranet and/or
the internet. In various embodiments of the present invention, a
host 208 containing web pages to be tested can be accessed over the
internet through server 209.
[0062] In general, these teachings may be implemented using at
least one software program running on a personal computer, a
server, a microcomputer, a mainframe computer, a portable computer,
an embedded computer, or by any suitable type of programmable data
processor 201. Further, a program of machine-readable instructions
capable of performing operations in accordance with the present
invention may be tangibly embodied in a signal-bearing medium, such
as, a CD-ROM.
[0063] The above scheme is doing its best to capture as many of the
cases of soft-404 pages as possible. There are other instances of
soft-404 errors that need to be detected, for example, when the
root of a web page is, in fact, a soft-404 page. An emerging
phenomenon on the web is the one of "parked web sites". These are
dead sites whose address was re-registered to a third party. The
third party puts a redirect from those dead sites into his own web
site. The idea is to profit from the prior promotional works of the
previous owners of the dead sites. A report by Edelman [15] gives a
nice description of this phenomenon as well as a case study of a
specific example.
[0064] Let n be the total number of pages. Let D.OR right.C [n] be
the set of all dead pages, and let all other pages be live. Let M
be the n.times.n matrix of the multi-graph of links among pages, so
that M.sub.ij is the number of links on page i to page j. To begin,
one modification is performed on the matrix: M.rarw.M+I, adding a
self loop to each page. A measure D.sub..sigma.(i) will be defined
in terms of a "success parameter" .sigma..epsilon.[0, 1]. (In
experiments, .sigma.=0.1 is selected).
[0065] First, decay is described as a random process. Next, it is
given a formal recursive definition, and finally, it is cast as a
random walk in a Markov chain.
[0066] The measure can be seen as a random process governing a "web
surfer" as follows. Initially, the current page p is set to i, the
page whose decay is being computed (step 200 in FIG. 6). The surfer
at the current page will perform the following steps, eventually
returning a binary decay score depending on the random choices made
during execution of the steps; the process therefore defines a
distribution over {0, 1}. The decay D.sub..sigma.(i) is the mean of
this distribution. [0067] 1. If p.epsilon.D, the surfer terminates
with decay value 1: the page is completely decayed (Steps 212 and
214 in FIG. 6). [0068] 2. Otherwise the result is "no" (Step 216 in
FIG. 6), and the surfer flips a biased coin at step 220, and with
probability .sigma. decides that the content of the current page
meets his information need (Step 230 in FIG. 6), and hence
terminates successfully with decay score 0 (Step 234 in FIG. 6).
[0069] 3. With the remaining probability 1-.sigma., the surfer
chooses an outlink of p uniformly at random (Step 236 in FIG. 6),
sets p to be the destination of that outlink, and begins the again
from step 200.
[0070] Unrolling this definition a few steps, it becomes clear that
the decay of a page is influenced by dead pages a few steps away,
but that the influence of a single path decreases exponentially
with the length of the path. For example a dead page has decay 1, a
live page whose outlinks are all dead has decay 1-.sigma., a live
page whose all outlinks point to live pages that in turn point only
to dead pages has decay (1-.sigma.).sup.2, etc.
[0071] Now, a formal definition of the decay measure is given.
Recursively, D.sub..sigma.(i) is defined as follows: D .sigma.
.function. ( i ) = { 1 i .di-elect cons. D , ( 1 - .sigma. )
.times. ( j .di-elect cons. [ n ] .times. M ij .times. D .sigma.
.function. ( j ) j .di-elect cons. [ n ] .times. M ij ) otherwise .
##EQU1## Understanding the solution to this recursive formulation
is easiest in the context of random walks, as described below.
[0072] Decay scores may also be viewed as absorption probabilities
in a random walk. A Markov chain in which this walk takes place is
now defined. First, the incidence matrix of the web graph must be
normalized to be row stochastic (each nonzero element is divided by
its row sum). Next, two new states must be added to the chain, each
of which has a single outlink to itself: n+1 is the success state,
and n+2 is the failure state. Thus these two new states are
absorbing. Finally, the following two modifications are made to the
matrix: first, each dead state is modified to have a single outlink
with probability 1 to the failure state; second, all edges from
non-dead states ([n]\D) are multiplied by 1-.sigma. in probability,
and a new edge with probability .sigma. is added to the success
state. Hence the two new states are the only two absorbing states
of the chain, and any random walk in this chain will be eventually
absorbed in one of the two states. Walks in this new chain mirror
the random process described above, and the decay of page i is the
probability of absorption in the failure state when starting from
state i.
[0073] Global static ranking measures such as PageRank [7] usually
have to be computed globally for the entire graph during a lengthy
batch process. Other graph oriented measures such as HITS [21] may
be computed on-the-fly, but require inlink information typically
derived from a complete representation of the web graph, such as
[4], or from a large scale search engine that makes available
information about the inlinks of a page.
[0074] Decay, on the other hand, is defined purely in terms of the
out-neighbors of i. The following observations can be made: [0075]
OBSERVATION 1. The decay value of a page can be approximated to
within constant accuracy in a constant number of HTTP fetches,
independent of the link structure of the graph, without access to
any other supporting indexes.
[0076] Such an implementation mirrors the random process definition
of decay set forth previously. Because the walk terminates with
probability at least .sigma. at each step, the distribution over
number of steps is bounded above by the geometric distribution with
parameter .sigma.; thus, the expected number of steps for a single
trial is no more than 1/.sigma., and the probability of long trials
is exponentially small. Further, the value of each trial is 0 or 1,
and so decay can be estimated to within error .epsilon. with
probability 1-.delta. in O(1/.epsilon..sup.2 log 1/.delta.) steps;
this follows from standard Chernoff bounds. (In practice, 300
trials are employed to estimate the decay value of each page).
[0077] An alternative method operating in accordance with the
present invention for assessing the decay of a web page is depicted
in FIG. 7. At step 250, a subject web page containing hyperlinks is
accessed over the internet. Then at step 251, the decay of the
subject web page is assessed by following a random walk away from
the subject web page, where the random walk consists of testing of
links on web pages linking from the subject web page under test. In
variants of this embodiment, the links being tested may be on web
pages directly linked to the subject web page whose decay status is
being tested, or may be on web pages linked to the subject web page
by an arbitrary number of intermediate web pages and hyperlinks.
Then at step 252, a decay score is assigned to the subject web page
in dependence on dead links encountered in the random walk, wherein
the decay score is a weighted sliding scale, where a dead link
encountered relatively close in the random walk to the subject web
page results in a higher decay score than a dead link encountered
relatively farther away from the subject web page.
[0078] Like other measures, decay is also amenable to the more
traditional batch computation; it is expected to require a time
similar to the time required by PageRank.
[0079] Next, the algorithm for identifying dead pages and the
random walk algorithm for estimating the decay score of a given
page was implemented. Then several sets of experiments described
below were run. The first set of experiments validated that the
decay measure set forth previously is a reasonable measure for the
decay of web pages. Next, it was compared to another plausible
measure, namely, the fraction of dead links on a page. After
establishing that the present decay measure is reasonable, it was
used to discover interesting facts about the web.
[0080] In this section the settings of parameters for two
algorithms that were used in the experiments are described. The
parameters of the algorithm for detecting dead pages were set as
follows: [0081] A timeout of T=10 seconds was allowed for fetching
a page. If the server does not respond within 10 seconds, the page
is declared dead. [0082] At most L=20 redirects for a page are
allowed. If more than 20 redirects are encountered, the page was
declared dead. [0083] To create a random URL in the same directory
of the page, the parent directory is appended with a sequence of 25
random lower case Latin letters.
[0084] The parameters of the random walk algorithm were set as
follows: [0085] In general, a success parameter .sigma.=0.1 is
used. Thus, at each step of the random walk, with probability 0.1,
the random walk proceeds to the success absorbing state. The
expected length of a random walk is then at most 10. [0086] For
each page, the random walk algorithm is run 300 times. This
guarantees an additive error in the decay measure estimates of at
most 0.1 with confidence at least 0.8.
[0087] On average, getting the decay score of a page took about 7
minutes on a machine with double 1.6 GHz AMD processors, 3 GB of
main memory, running a Linux operating system and having a 100 Mbps
connection to the network. Since the task was highly parallelizable
(the decay score of different pages could be estimated in parallel,
and also different random walks for the same page could be run in
parallel), about 10 random walk processes were run simultaneously,
in order to increase throughput.
[0088] The first experiment involved computing the decay score and
the fraction of dead links on 1000 randomly chosen pages. The pages
were chosen from a two billion page crawl performed largely in the
last four months.
[0089] To begin with, of the 1000 pages, 475 were already dead
(substantiating the claim that web pages have short half lives, on
average). For each remaining page, its decay score was computed as
well as the fraction of its dead links. In total, there were 710
dead links on the pages and out of these, 207 were pointing to
soft-404 pages (roughly 29%). Moreover, the random walks during the
decay score computation of the 525 pages encountered a total of
22,504 dead links, out of which 6,060 pointed to soft-404 pages
(roughly 27%). Another interesting statistic is that only 350 of
the 525 pages alive had a non-empty "Last Modified Date".
[0090] The main statistic emerging out of this experiment is that
the average fraction of dead links is 0.068 whereas the average
decay scores of a live page with at least one outlink are 0.168,
0.106, 0.072, and 0.041 for values of .sigma.=0.1, 0.2, 0.33 and
0.5, respectively.
[0091] The decay curves in FIG. 8 reflect the fact that for a given
page i if .sigma..sub.1.gtoreq..sigma..sub.2, then
D.sub..sigma.1(i).ltoreq.D.sub..sigma.2(i). Proof: The decay is the
probability of absorption into the failure state. Consider all
paths that lead to the failure state. Then the weight of each
individual path under .sigma..sub.1 is less or equal to its weight
under .sigma..sub.2; namely for a path of length k it is
(1-.sigma..sub.i).sup.k times the unbiased random walk weight of
the path. (The same argument does not work for the paths that lead
to the success state; their individual weight is not monotonic in
.sigma..)
[0092] For the rest of the description, .sigma.=0.1 is used.
[0093] Clearly the decay and the fraction of dead links are related
but not in a simple way.
[0094] More precisely, if (i) is the fraction of dead links on page
i, and page i is not dead then .sub.94
(i)=(1-.sigma.)((i)+(1-(i)){overscore ()}(i)) where {overscore
()}(i) is the average decay of the non-dead neighbors of i.
[0095] FIG. 8 shows that the distributions of and intersect. The
difference among them can also be seen from the scatter plot of
these distributions for .sigma.=0.1 (FIG. 9). The scatter plot
shows that the decay score is generally more than the fraction of
dead links. (This also follows from equation 1). More
interestingly, it also shows that the decay measure can be close to
0.5 even when the fraction of the dead links is close to 0.
[0096] The next experiments to be described concern papers from the
last ten World Wide Web conferences. All of the (refereed track)
papers from WWW3 to WWW12 were crawled and for each paper with at
least one outlink, its decay score and the fraction of dead links
was computed. The averaged results are shown in FIG. 10. The main
observation is the following. It is claimed that the trend
exhibited by decay scores is more representative and more useful
than that of the fraction of dead links. From the figure, it is
evident that the decay scores decline as conferences get more
recent; on the other hand, the fraction of dead links exhibits a
flatter trend. It is arguably the case that on average, links
contained in papers from older conferences not only have a higher
chance of themselves being dead, but also are more likely to point
to pages that are dead. Decay scores are therefore able to reflect
better the temporal aspect of hyperlink creation and maintenance;
it is believed this feature might have other applications.
[0097] The next experiment performed consisted of a set of 30 nodes
from the current Yahoo! ontology (Appendix B). The nodes were
chosen so as to have a relatively large number of outside links and
be well represented in the Internet Archive (www.archive.org). The
decay score and fraction of dead links were computed for each of
the 30 nodes. The Internet Archive was used to fetch the previous
incarnations of the same nodes in the past five years and computed
the decay scores and fraction of dead links for these "old" pages
as well. Since the archived pages have time stamps embedded in the
URL, at the end of this step, a history of decay scores and
fraction of dead links for each leaf was obtained. These scores
were averaged over the 30 nodes and the time line bucketed into
months (since 1998) to obtain FIG. 11.
[0098] The behavior of decay scores and fraction of dead links are
still different; but the important point is that this difference in
behavior is different from that of WWW conferences as well (FIG.
10). Unlike in the WWW conference case, here, the decay score is
flatter whereas the fraction of dead links is rapidly decreasing.
The behavior of the dead links is as expected--the fraction of dead
links is close to 0 in the current version of the Yahoo! nodes;
this is obviously due to their automatic filtering of dead links.
But, even in the current version of these nodes, the figure shows
that the decay score of these is as high as that of a random web
page (i.e., close to 0.2).
[0099] Thus, it can be concluded that many of the pages pointed by
Yahoo! nodes, even though are not dead themselves yet, are littered
with dead links and outdated. For example, consider the Yahoo!
category Health/Nursing. Only three out of 77 links on this page
are dead. However, the decay score of this page is 0.19. A few
examples of dead pages that can be reached by browsing from the
above Yahoo! page are: (1) the page
http://www.geocities.com/Athens/4656/has an ECG tutorial where all
the links are dead; (2) the page http://virtualnurse.com/er/er.html
has many dead links; (3) many of the links in the menu bar of
http://www.nursinglife.com/index.php?n=1&id=1 are dead; and so
on. It is believed that using decay scores in an automatic
filtering system will improve overall quality of links in a
taxonomy like Yahoo!.
[0100] The final set of experiments to be described involved the
frequently asked questions (FAQs) obtained from www.faqs.org. All
3,803 FAQs were collected and decay scores and the fraction of dead
links were computed for each of them. The last modified/last
updated date for the FAQs was computed by explicitly parsing the
FAQ (since the last modified date returned in the HTTP header from
www.faqs.org does not represent the actual date when the FAQ was
last modified/updated). As in the earlier case, the results were
collated and the time line bucketed into years since 1992 to obtain
FIG. 12.
[0101] From the figure, it is clear that despite the fact that the
FAQs are hand-maintained in a distributed fashion by a number of
diverse and unrelated people, it suffers from the same
problem--many pages pointed to by FAQs are unmaintained.
[0102] A number of applications areas could fruitfully apply the
decay concept:
[0103] (1) Webmaster and ontologist tools: There are a number of
tools made available to help webmasters and ontologists track dead
links on their sites; however, for web sites that maintain
resources, there are no tools to help understand whether the
linked-to resources are decayed. The observation about Yahoo! leaf
nodes suggests that such tools might provide an automatic or
semi-automatic approach to addressing the decay problem.
[0104] (2) Ranking: Decay measures have not been used in ranking,
but users routinely complain about search results pointing to pages
that either do not exist (dead pages), or exist but not reference
valid current information (decayed pages). Incorporating the decay
measure into the rank computation will alleviate this problem.
Furthermore, web search engines could use the soft-404 detection
algorithm to eliminate soft-404 pages from their corpus. Note that
soft-404 pages indexed under their new content are still
problematic since most search engines put a substantial weight on
anchor text, and the anchor text to soft-404 pages is likely to be
quite wrong.
[0105] (3) Crawling: The decay score can be used to guide the
crawling process and the frequency of the crawl, in particular for
topic sensitive crawling [12]. For instance, one can argue that it
is not worthwhile to frequently crawl a portion of the web that has
sufficiently decayed; as seen in the described experiments, very
few pages have valid last modified dates in them. The on-the-fly
random walk algorithm for computing the decay score might be too
expensive to assist this decision at crawl-time but post a global
crawl one can compute the decay scores of all pages on the web at
the same cost as PageRank. Heavily decayed pages can be crawled
infrequently.
[0106] (4) Web sociology and economics: Measuring decay score of a
topic can give an idea of the `trendiness` of the topic.
[0107] Thus it is seen that the foregoing description has provided
by way of exemplary and non-limiting examples a full and
informative description of the best methods and apparatus presently
contemplated by the inventors for assessing the currency or
staleness of web pages. One skilled in the art will appreciate that
the various embodiments described herein can be practiced
individually; in combination with one or more other embodiments
described herein; or in combination with methods and apparatus
differing somewhat from those described herein. Further, one
skilled in the art will appreciate that the present invention can
be practiced by other than the described embodiments; that these
described embodiments are presented for the purposes of
illustration and not of limitation; and that the present invention
is therefore limited only by the claims which follow. [0108] [1] W.
Aiello, F. Chung, and L. Lu. [0109] A random graph model for power
law graphs. [0110] Experimental Mathematics, 10:53-66, 2001. [0111]
[2] Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D.
Weitz. Approximating aggregate queries about web pages via random
walks. In Proceedings of the 26th International Conference on Very
Large Databases, pages 535-544, 2000. [0112] [3] A.-L. Barabasi and
R. Albert. [0113] Emergence of scaling in random networks. [0114]
Science, 286:509-512, 1999. [0115] [4] K. Bharat, A. Broder, M.
Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity
server: Fast access to linkage information on the Web. [0116] In
Proceedings of the 7th International World Wide Web Conference,
pages 104-111, 1998. [0117] [5] K. Bharat and M. Henzinger. [0118]
Improved algorithms for topic distillation in a hyperlinked
environment. [0119] In Proceedings of the 21st Annual International
ACM SIGIR Conference on Research and Development in Information
Retrieval, pages 104-111, 1998. [0120] [6] B. Brewington and G.
Cybenko. [0121] How dynamic is the web? [0122] In Proceedings of
the Ninth International World Wide Web Conference, pages 257-276,
May 2000. [0123] [7] S. Brin and L. Page. [0124] The anatomy of a
large-scale hypertextual Web search engine. [0125] In Proceedings
of the 7th International World Wide Web Conference, pages 107-117,
1998. [0126] [8] A. Z. Broder, S. C. Glassman, M. S. Manasse, and
G. Zweig. [0127] Syntactic clustering of the Web. [0128] In
Proceedings of the 6th International World Wide Web Conference,
pages 391-404, 1997. [0129] [9] A. Z. Broder, R. Kumar, F. Maghoul,
P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener.
[0130] Graph structure in the web. [0131] WWW9/Computer Networks,
33(1-6):309-320, 2000. [0132] [10] A. Z. Broder, R. Lempel, F.
Maghoul, and J. Pedersen. [0133] Efficient Pagerank approximation
via graph aggregation. [0134] Manuscript. [0135] [11] S.
Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S.
Rajagopalan, and A. Tomkins. Spectral filtering for resource
discovery. [0136] In Proceedings of the ACM SIGIR Workshop on
Hypertext Analysis, pages 13-21, 1998. [0137] [12] S. Chakrabarti,
M. van den Berg, and B. Dom. [0138] Focused crawling: a new
approach to topic-specific web resource discovery. [0139]
WWW8/Computer Networks, 31(11-16):1623-1640, 1999. [0140] [13] J.
Cho and H. Garcia-Molina. The evolution of the web and implications
for an incremental crawler. In Proceedings of the 26th
International Conference on Very Large Databases, pages 200-209,
2000. [0141] [14] F. Douglis, A. Feldmann, B. Krishnamurthy, and J.
C. Mogul. Rate of change and other metrics: a live study of the
world wide web. [0142] In USENIX Symposium on Internet Technologies
and Systems, 1997. [0143] [15] B. Edelman. Domains reregistered for
distribution of unrelated content: A case study of "Tina's Free
Live Webcam".
http://cyber.law.harvard.edu/people/edelman/renewals/, 2002. [0144]
[16] D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A
large-scale study of the evolution of web pages. In Proceedings of
the 12th International World Wide Web Conference, pages 669-678,
2003. [0145] [17] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L.
Masinter, P. Leach, and T. Berners-Lee.RFC2616: Hypertext Transfer
Protocol--HTTP/1.1.
http://www.w3.org/Protocols/rfc2616/rfc2616.html, June 1999. [0146]
[18] T. Haveliwala. Topic-sensitive PageRank. In Proceedings of the
11th International World Wide Web Conference, pages 517-526, 2002.
[0147] [19] M. Henzinger, A. Heydon, M. Mitzenmacher, and M.
Najork. On near-uniform URL sampling. WWW9/Computer Networks,
33(1-6):295-308, 2000. [0148] [20] A. Jesdanun. Internet littered
with dead web sites.
http://story.news.yahoo.com/news?tmpl=story&u=/ap/20031102/ap_on_hi_te/de-
adwood_online.sub.--1, November 2002. [0149] [21] J. M. Kleinberg.
Authoritative sources in a hyperlinked environment. Journal of the
ACM, 46(5):604-632, 1999. [0150] [22] W. Koehler. An analysis of
web page and web site constancy and permanence. Journal of the
American Society for Information Science, 50(2):162-180, 1999.
[0151] [23] W. Koehler. Digital libraries and world wide web sites
and page persistence. Information Research, 4(4), 1999. [0152] [24]
K. Kokoszkiewicz (a.k.a. Alectorides Conradus). Vocabula
Computatralia Anglico-Latinum. University of Warsaw, Centre for
Studies on the Classical Tradition in Poland and East-Central
Europe (OBTA).
http://www.obta.uw.edu.pl/.about.draco/docs/voccomp.html. [0153]
[25] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A.
Tomkins, and E. Upfal. Stochastic models for the web graph. [0154]
In Proceedings of the 41st IEEE Annual Foundations of Computer
Science, pages 57-65, 2000. [0155] [26] J. Markwell and D. W.
Brooks. Broken links: The ephemeral nature of educational WWW
hyperlinks. Journal of Science Education and Technology,
11(2):105-108, 2002. [0156] [27] J. Markwell and D. W. Brooks.
"Link rot" limits the usefulness of web-based educational materials
in biochemistry and molecular biology. [0157] Biochemistry and
Molecular Biology Education, 31(1):69-72, 2003. [0158] [28] A.
Ntoulas, J. Cho, and C. Olston. What's new on the web? The
evolution of the web from a search engine perspective. In
Proceedings of the 13th International World Wide Web Conference,
2004. [0159] [29] G. Pandurangan, P. Raghavan, and E. Upfal. Using
PageRank to characterize web structure. In Computing and
Combinatorics: 8th Annual International Conference, pages 330-339,
2002. [0160] [30] P. Rusmevichientong, D. M. Pennock, S. Lawrence,
and C. L. Giles. Methods for sampling pages uniformly from the
world wide web. [0161] In Proceedings of the AAAI Fall Symposium on
Using Uncertainty Within Computation, pages 121-128, 2001. [0162]
[31] J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L.
Ozsen. Optimal crawling strategies for web search engines. In
Proceedings of the 11th International World Wide Web Conference,
pages 136-147, 2002.
APPENDIX A
[0162] [0163] Function bool DeadPage(u) [0164] In URL u [0165] 1:
string T.sub.u, T.sub.r, int K.sub.u, K.sub.r, bool error [0166] 2:
fetch (u, w.sub.u, T.sub.u, K.sub.u, error) [0167] 3: if (error)
then // a hard 404 error [0168] 4: return true [0169] 5: URL
r=u.PARENT+25 random characters [0170] 6: fetch(r, w.sub.r,
T.sub.r, K.sub.r, error) [0171] 7: if (error) then // host returns
a hard-404 on dead pages [0172] 8: return false [0173] 9: if (u is
the root of u.HOST) then [0174] 10: return false // a root cannot
be a soft-404 [0175] 11: if (K.sub.u.noteq.K.sub.r) then //
different number of redirects [0176] 12: return false [0177] 13: if
(w.sub.u=w.sub.r) then // same redirects & same number of
redirects [0178] 14: return true [0179] 15: if
(shingle(T.sub.u)=shingle(T.sub.r)) then // almost-identical
content [0180] 16: return true [0181] 17: return false // not a
soft-404 page [0182] Function fetch (u, T.sub.u, w.sub.u, K.sub.u,
error) [0183] in: URL u [0184] out: string T.sub.u, URL w.sub.u,
int K.sub.u, bool error [0185] 1: w.sub.u:=u [0186] 2: K.sub.u:=0
[0187] 3: set <URL> redirects [0188] 4: redirects.insert(u)
[0189] 5: while (true) do [0190] 6: URL v, bool redirect [0191] 7:
atomicFetch (w.sub.u, T.sub.u, v, redirect, error) [0192] 8: if
(error) then [0193] 9: return // A hard-404 [0194] 10: if
(!redirect) then // no more redirects [0195] 11: return [0196] 12:
if (redirects.find(v)) then // a redirect loop [0197] 13:
error=true; return [0198] 14: if(K.sub.u.gtoreq.20) then // too
many redirects [0199] 15: error=true; return [0200] 16: w.sub.u:=v,
K.sub.u:=K.sub.u+1 [0201] 17: end while [0202] Function atomic
fetch (w, T, v, redirect, error) [0203] in: URL w [0204] out:
string T, URL v, bool redirect, bool error [0205] 1: parse (w,
error) [0206] 2: if (error) then/parse URL failed [0207] 3: return
[0208] 4: IPAddress address [0209] 5: getIPAddress (w.HOST,
address, error) [0210] 6: if (error) then // resolution of host's
IP address failed [0211] 7: return [0212] 8: HTTPRetCode code
[0213] 9: httpGet (address, T, v, code, timeout=10 sec, error)
[0214] 10: if (error) then // http got timed out [0215] 11: return
[0216] 12: if(code in {403, 404, 410, 5xx}) then // bad http return
code [0217] 13: error=true; return [0218] 14: if (code in {3xx})
then [0219] 15: redirect :=true [0220] 16: else [0221] 17: redirect
:=false
APPENDIX B
[0221] [0222] 1. Business_and_Economy/Classifieds [0223] 2.
Business_and_Economy/Employment_and_Work/Organizations [0224] 3.
Computers_and_Internet/News_and_Media/Magazines [0225] 4.
Computers_and_Internet/News_and_Media/Magazines [0226] 5.
News_and_Media/Journalism [0227] 6.
News_and_Media/Television/Satellite [0228] 7.
Entertainment/Music/Band_Naming [0229] 8. Entertainment/Humor
[0230] 9. Recreation/Automotive [0231] 10. Recreation/Gambling
[0232] 11. Health/Medicine [0233] 12. Health/Nursing [0234] 13.
Health/Fitness [0235] 14. Government/Military/Weapons_and_Equipment
[0236] 15. Government/Law [0237] 16.
Regional/U_S_States/California/Education [0238] 17.
Regional/Countries/France/Arts_and Humanities/Museums_Galleries
and_Centers [0239] 18. Society_and_Culture/Environment_and_Nature
[0240] 19. Society_and_Culture/Food_and_Drink/Cooking [0241] 20.
Society_and_Culture/Death_and_Dying [0242] 21.
Education/Higher_Education [0243] 22.
Education/K.sub.--12/Gifted_Youth/Schools [0244] 23.
Arts/Visual_Arts/Photography/Digital [0245] 24.
Arts/Humanities/Literature/Poetry [0246] 25.
Science/Computer_Science/Electronic_Computer_Aided_Design_ECAD_
[0247] 26.
Science/Biology/Zoology/Animals_Insects_and_Pets/Pets/Health [0248]
27. Social_Science/Psychology/Branches/Sleep and_Dreams [0249] 28.
Social_Science_Anthropology_and_Archaeology [0250] 29.
Reference/Quotations [0251] 30. Reference/Dictionaries
* * * * *
References