U.S. patent application number 12/610225 was filed with the patent office on 2010-02-25 for extraction of anchor explanatory text by mining repeated patterns.
This patent application is currently assigned to Microsoft Corporation. Invention is credited to Kefeng Deng, Feng Jing, Wei-Ying Ma, Lei Zhang.
Application Number | 20100049772 12/610225 |
Document ID | / |
Family ID | 38576736 |
Filed Date | 2010-02-25 |
United States Patent
Application |
20100049772 |
Kind Code |
A1 |
Jing; Feng ; et al. |
February 25, 2010 |
EXTRACTION OF ANCHOR EXPLANATORY TEXT BY MINING REPEATED
PATTERNS
Abstract
A method and system for identifying explanatory text for a
referenced web page based on a reference to the referenced web page
contained in a repeated pattern of a referencing web page is
provided. An anchor explanatory text ("AET") system uses the
hierarchical organization of the web page to identify a repeated
pattern of hierarchical elements that contain references to other
display pages. After the AET system identifies a repeated pattern,
it identifies the dominant reference or anchor within each
occurrence of the pattern. The AET system uses the explanatory text
surrounding a dominant anchor as a description of the referenced
web page.
Inventors: |
Jing; Feng; (Beijing,
CN) ; Deng; Kefeng; (Beijing, CN) ; Zhang;
Lei; (Beijing, CN) ; Ma; Wei-Ying; (Beijing,
CN) |
Correspondence
Address: |
PERKINS COIE LLP/MSFT
P. O. BOX 1247
SEATTLE
WA
98111-1247
US
|
Assignee: |
Microsoft Corporation
Redmond
WA
|
Family ID: |
38576736 |
Appl. No.: |
12/610225 |
Filed: |
October 30, 2009 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11278289 |
Mar 31, 2006 |
7627571 |
|
|
12610225 |
|
|
|
|
Current U.S.
Class: |
707/776 ;
707/E17.108; 715/234 |
Current CPC
Class: |
Y10S 707/99936 20130101;
G06F 16/951 20190101 |
Class at
Publication: |
707/776 ;
715/234; 707/E17.108 |
International
Class: |
G06F 17/30 20060101
G06F017/30; G06F 17/21 20060101 G06F017/21 |
Claims
1-11. (canceled)
12. A computer-readable storage medium containing instructions for
controlling a computer system to identify explanatory text for a
referenced web page from a referencing web page, by a method
comprising: identifying repeated patterns of elements within the
referencing web page, an element of a repeated pattern having a
reference to a web page along with text surrounding the reference;
and for each occurrence of a repeated pattern, identifying a
dominant reference to a web page; and extracting the text
surrounding the dominant reference as explanatory text for the
referenced web page.
13. The computer-readable storage medium of claim 12 wherein the
elements of the referencing web page are hierarchically organized
as nodes and wherein the identifying of repeated patterns
identifies a reference explanatory text node as a collection of
adjacent, sibling nodes with a subtree of one node containing a
reference node with surrounding text.
14. The computer-readable storage medium of claim 13 wherein the
identifying of repeated patterns identifies a reference explanatory
text region as a collection of adjacent, sibling reference
explanatory text nodes that have the same length and that are
within a threshold edit distance.
15. The computer-readable storage medium of claim 14 wherein the
threshold edit distance varies based on number of block nodes
within the reference explanatory text nodes.
16. The computer-readable storage medium of claim 12 wherein a
reference explanatory text node has a dominant reference node when
it has only one reference node that is a block node with a unique
subtree structure.
17. The computer-readable storage medium of claim 12 including
generating a summary of the referenced web page from the extracted
text that surrounds references to the referenced web page.
18. A computer system for identifying explanatory text for a
referenced web page from a referencing web page, comprising: a
memory storing computer-executable instructions of: a component
that identifies repeated patterns of elements within the
referencing web page, an occurrence of a repeated pattern having a
reference to a web page along with text surrounding the reference;
a component that identifies a dominant reference for each repeated
pattern; and a component that extracts text surrounding the
dominant reference as explanatory text for the referenced web page,
and a processor executing the computer-executable instructions
stored in the memory.
19. The computer system of claim 18 wherein the occurrences of a
repeated pattern have a similarity that is within a similarity
threshold that varies based on whether an occurrence contains a
block element.
20. The computer system of claim 18 wherein elements of the
referencing web page are hierarchically organized as nodes and
wherein the identifying of repeated patterns identifies a reference
explanatory text node as a collection of adjacent, sibling nodes
with a subtree of one node containing a reference node with
surrounding text and identifies a reference explanatory text region
as a collection of adjacent, sibling reference explanatory text
nodes that have the same length and are similar.
21. A method in a computing device for identifying explanatory text
for a referenced display page from a referencing display page,
comprising: identifying by the computing device repeated patterns
of elements within a display page by comparing elements of the
display page to other elements of the display page, a repeated
pattern having a reference to a referenced display page along with
text associated with the reference; for each identified repeated
pattern a dominant anchor, identifying a dominant anchor of the
identified repeated pattern that is a reference to a referenced
display page along with text associated with the reference; and for
each identified dominant anchor, extracting the text associated
with the identified dominant anchor, wherein the extracted text
represents explanatory text for the display page referenced by the
identified dominant anchor.
22. The method of claim 21 wherein patterns of elements are
considered to be repeated when the patterns have the same number of
elements and the patterns have an edit distance that is within a
threshold.
23. The method of claim 21 wherein a display page is represented as
a tag tree with nodes representing elements and the identifying of
repeated patterns identifies a reference explanatory text node as a
collection of adjacent, sibling nodes with a subtree of one node
containing a reference node with associated text.
24. The method of claim 23 wherein the identifying of repeated
patterns includes identifying a reference explanatory text region
as a collection of adjacent, sibling reference explanatory text
nodes that have the same length and are similar.
25. The method of claim 21 including ranking the display page based
on the identified explanatory text.
26. The method of claim 21 wherein when an occurrence of a repeated
pattern includes multiple references with associated text,
designating that the occurrence does not have a dominant
anchor.
27. The method of claim 21 wherein when an occurrence of a repeated
pattern includes only one reference with associated text,
designating that the occurrence has a dominant anchor.
28. The method of claim 21 including using the identified
explanatory text when crawling the display page.
29. The method of claim 21 including using the identified
explanatory text for query refinement.
30. The method of claim 21 wherein the display page is a web page.
Description
BACKGROUND
[0001] The Internet allows users to access millions of electronic
documents, such as electronic mail messages, web pages, memoranda,
design specifications, electronic books, and so on. Because of the
large number of documents, it can be difficult for users to locate
documents of interest. To locate a document, a user may submit
search terms to a search engine. The search engine identifies
documents that may be related to the search terms and then presents
indications of those documents as the search result. When a search
result is presented, the search engine may attempt to provide a
summary of each document so that the user can quickly determine
whether a document is really of interest. Some documents may have
an abstract or summary section that can be used by the search
engine. Many documents, however, do not have abstracts or
summaries. The search engine may automatically generate a summary
for such documents. The usefulness of the automatically generated
summaries depends in large part on how effectively a summary
represents the main concepts of a document.
[0002] Many traditional information retrieval summarization
algorithms have been adapted to automatically generate summaries of
web pages from their content. For example, Luhn proposed an
algorithm that calculates the significance of a sentence to a
document based on keywords of the document that are contained
within the sentence. Luhn's algorithm selects the sentences with
the highest significance to form the summary of the document. As
another example, latent semantic analysis ("LSA") algorithms
generate an LSA score for each sentence of a document using
singular value decomposition. The sentences with the highest score
are selected to form the summary of the document. Unfortunately,
the summaries generated by the adaptation of these conventional
algorithms to web pages are not particularly accurate summaries of
the web pages. The main reason for the inaccuracies in the
summaries may be that many web pages contain content directed to
different topics (e.g., different news articles and
advertisements). Many conventional algorithms, in contrast, were
designed to generate a summary of a document having a primary
topic.
[0003] More recent algorithms use the hyperlink structure of the
web to generate more accurate summaries of web pages. In
particular, many of these techniques use the content of the web
pages that link to a web page to generate a summary for that web
page. The underlying assumption is that a web page author who
includes a link in their web page is likely to provide an accurate
(albeit possibly short) summary of the content of a referenced web
page. These hyperlink-based algorithms may use the text of the
hyperlink itself and the text surrounding the hyperlink to generate
a summary. Some algorithms that use the text surrounding the
hyperlink may extract a certain number of words (e.g., 25) before
and after a hyperlink or may extract a complete sentence or
paragraph surrounding a hyperlink.
[0004] These hyperlink-based or anchor-based algorithms, however,
have difficulty distinguishing hyperlinks with surrounding text
that accurately describes the referenced web page from those that
do not. For example, a web page may contain the sentence "Today, I
visited the <link>White House</link> with my mother."
The text surrounding this link, however, provides an inaccurate
description of a web page for the White House. As a result, these
hyperlink-based algorithms often generate summaries that are
inaccurate.
SUMMARY
[0005] A method and system for identifying explanatory text for a
referenced web page based on a reference to the referenced web page
contained in a repeated pattern of a referencing web page is
provided. An anchor explanatory text ("AET") system uses the
hierarchical organization of the web page to identify a repeated
pattern of hierarchical elements that contain references to other
web pages. After the AET system identifies a repeated pattern, it
identifies the dominant reference or anchor within each occurrence
of the pattern. The AET system uses the explanatory text associated
with (e.g., surrounding) a dominant anchor as a description of the
referenced web page. If an occurrence has only one anchor, then
that anchor is the dominant anchor. If, however, an occurrence has
multiple anchors, then the AET system attempts to identify which of
the multiple anchors is the dominant anchor. If the AET system
cannot identify a dominant anchor within an occurrence, then the
AET system may consider the text surrounding the anchors as a
description of the referenced web page that cannot be verified as
accurate.
[0006] This Summary is provided to introduce a selection of
concepts in a simplified form that are further described below in
the Detailed Description. This Summary is not intended to identify
key features or essential features of the claimed subject matter,
nor is it intended to be used as an aid in determining the scope of
the claimed subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] FIG. 1 illustrates a search result with a repeated
pattern.
[0008] FIG. 2 illustrates a list of Federal Executive Boards home
pages as a repeated pattern.
[0009] FIG. 3 is a diagram that illustrates a tag tree
representation of the web page that contains the list of FEB home
pages.
[0010] FIG. 4A illustrates subtrees of the tag tree that should be
similar AET nodes.
[0011] FIG. 4B illustrates subtrees of a tag tree that should not
be similar AET nodes.
[0012] FIG. 5 illustrates the condition that the second criterion
is designed to identify.
[0013] FIG. 6 is a block diagram that illustrates components of the
AET system in one embodiment.
[0014] FIG. 7 is a flow diagram that illustrates the processing of
a high-level description of an extract anchor explanatory text
component of the AET system in one embodiment.
[0015] FIG. 8 is a flow diagram that illustrates the processing of
a more detailed extract AET component of the AET system in one
embodiment.
[0016] FIG. 9 is a flow diagram that illustrates the processing of
the traverse tag tree component in one embodiment.
[0017] FIG. 10 is a flow diagram illustrating the processing of the
MAR component of the AET system in one embodiment.
[0018] FIG. 11 is a flow diagram that illustrates the processing of
the combcomp component of the AET system in one embodiment.
[0019] FIG. 12 is a flow diagram that illustrates the processing of
the find ARs component of the AET system in one embodiment.
[0020] FIG. 13 is a flow diagram that illustrates the processing of
the identify ARs component of the AET system in one embodiment.
[0021] FIG. 14 is a flow diagram that illustrates the processing of
the uncover ARs component of the AET system in one embodiment.
[0022] FIG. 15 is a flow diagram that illustrates the processing of
the ID DA component of the AET system in one embodiment.
[0023] FIG. 16 is a flow diagram that illustrates the processing of
the DA identify1 component of the AET system in one embodiment.
[0024] FIG. 17 is a flow diagram that illustrates the processing of
the DA identify2 component of the AET system in one embodiment.
DETAILED DESCRIPTION
[0025] A method and system for identifying explanatory text for a
referenced display page based on a reference to the referenced
display page contained in a repeated pattern of a referencing
display page is provided. In one embodiment, an anchor explanatory
text ("AET") system uses the hierarchical organization of a web
page to identify a repeated pattern of hierarchical elements that
contain references to other web pages. For example, a web page that
contains a list of cameras may have a list element for each camera
that each contains the same sub-elements (e.g., make, model,
description, rating, price, and URL to a detailed page). Each list
element is an occurrence of a repeated pattern. The AET system may
use a mining data records ("MDR") based algorithm to identify a
repeated pattern. After the AET system identifies a repeated
pattern, it identifies the dominant reference or anchor within each
occurrence of the pattern. The AET system uses the explanatory text
associated with (e.g., surrounding) a dominant anchor as a
description of the referenced web page. If an occurrence has only
one anchor, then that anchor is the dominant anchor. If, however,
an occurrence has multiple anchors, then the AET system attempts to
identify which of the multiple anchors is the dominant anchor. If
the AET system cannot identify a dominant anchor, then the AET
system may consider the text surrounding the anchors as a
description of the referenced web page that cannot be verified as
accurate. The explanatory text identified by the AET system may be
used by various applications such as for web page summarization,
focused crawling, query refinement, and language translation. By
relying on anchors within repeated patterns, the AET system
extracts anchor explanatory text that, in general, provides a
description of a referenced web page that is less likely to be
inaccurate than previous techniques that do not rely on repeated
patterns.
[0026] FIGS. 1 and 2 contain examples of repeated patterns of web
pages. FIG. 1 illustrates a search result with a repeated pattern.
The search result 100 includes entries 101-104. Each entry includes
a reference (e.g., a hyperlink) to a web page identified as
matching the search request. The web page containing the search
result may identify a reference by an anchor tag within the portion
of an HTML document corresponding to the entry. An anchor tag
includes the text displayed as the reference. Each entry also
contains additional text describing the referenced web page and
additional anchors for cached and similar pages. In this example,
the dominant anchor of each element is the anchor that references
the web page that matches the search request. The cached and
similar pages are anchors, but are not dominant anchors. FIG. 2
illustrates a list of Federal Executive Boards ("FEB") home pages
as a repeated pattern. The list 200 contains an entry 201 for each
home page of the FEB. Each entry contains a reference to the home
page with surrounding text.
[0027] In one embodiment, the AET system adapts an MDR-based
algorithm to identify repeated patterns within web pages. An
MDR-based algorithm is described in Liu, B., Grossman, R., and
Zhai, Y., "Mining Data Records in Web Pages," SIGKDD 2003, Aug.
24-27, 2003. The AET system first identifies AET nodes (also
referred to more generally as reference explanatory text nodes)
within an HTML tag tree of a web page, which generally corresponds
to MDR generalized nodes. An AET node, like a generalized node, is
a collection of tag tree nodes (or simply nodes) that are adjacent,
sibling nodes. (A tag tree is a hierarchical structure that
represents the tags of an HTML document as nodes.) An AET node,
however, has the additional requirement that at least one node in
the collection contain an anchor node with valid surrounding text.
After identifying the AET nodes, the AET system identifies AET
regions (also referred to more generally as reference explanatory
text regions), which generally correspond to MDR data regions. An
AET region is a collection of ADT nodes, like an MDR data region is
a collection of generalized nodes, that are adjacent, sibling AET
nodes (i.e., have the same parent node) and that are similar. The
AET system may consider AET nodes to be similar when they have the
same length and have an edit distance within a threshold. The
length of an AET node is the number of sibling nodes that it
contains. The edit distance represents the number of changes needed
to transform the hierarchical structure of the nodes within one of
the AET nodes into the hierarchical structure of the nodes within
the other AET node. The hierarchical structure of a node may be
represented by a tag string corresponding to the tags visited in a
depth-first traversal of the subtree with its root at the node. In
addition, the AET system may use a variable threshold that varies
based on characteristics of the tags within the AET nodes.
[0028] FIG. 3 is a diagram that illustrates a tag tree
representation of the web page that contains the list of FEB home
pages. The tag tree contains nodes corresponding to the tags of the
HTML document representing a web page. The tag tree includes a root
HTML tag 301 with a child head tag 302 and a child body tag 303.
The body tag includes a child bold tag 304, a child break line tag
305, and a child unordered list tag 306. The bold tag includes a
child text tag 307. The unordered list tag includes a child list
item tag 311, 321 for each home page in the list. Each list item
tag contains a child bold tag 312, 322 and a child anchor tag 314,
324. Each bold tag 312, 322 includes a child text tag 313, 323, and
each anchor tag 314, 324 includes a child text tag 315, 325. List
item tags 311, 321 correspond to AET nodes 310, 320. In this
example, each AET node includes only one child tag of the parent
unordered list tag 306. In a more general case, an AET node
includes multiple child tags of the parent tag. The AET region 330
includes AET nodes 310, 320.
[0029] In one embodiment, the AET system applies a variable or
adaptive threshold for edit distance to determine whether two AET
nodes are similar. If the AET system uses a small fixed threshold,
it may fail to identify some repeated patterns. FIG. 4A illustrates
subtrees of the tag tree that should be similar AET nodes. In this
example, the search result entry 401 is represented by subtree 402.
The anchor tag of subtree 402 contains a separate bold tag and text
tag for each word of the anchor text. The search result entry 403
is represented by subtree 404. The anchor tag of subtree 404
contains only one bold tag and one text tag for the entire anchor
text. Although the edit distance between the anchor tag of subtree
402 and the anchor tag of subtree 404 is large, the subtrees are
similar and thus should be combined into the same AET region as
representing the same repeated pattern. If the AET system uses a
large fixed threshold, it may, however, incorrectly identify some
repeated patterns. FIG. 4B illustrates subtrees of a tag tree that
should not be similar AET nodes. In this example, the search result
entry 411 contains source information that corresponds to the
subtree 412. The identification of AET node 413 and AET node 412 as
being similar would be incorrect even though their edit distance is
relatively small. In such a case, a large fixed threshold would
lead to incorrectly identifying an AET region comprising AET node
413 and AET node 414.
[0030] To help ensure that similar AET nodes are correctly
identified, the AET system uses a variable threshold for similarity
that is based on the number of block nodes within the AET nodes
that are being compared. A block node generally corresponds to a
block-type tag of an HTML document. The block-type tags include the
CENTER, DD, DIV, DL, DT, FORM, LI, OL, P, PRE, TABLE, TBODY, TD,
TR, and UL tags. In one embodiment, the AET system sets the
variable threshold depending on whether (1) neither AET node has a
block node, (2) only one AET node has a block node(s), (3) both AET
nodes have at least two block nodes, and (4) otherwise. The AET
system sets the thresholds for a normalized edit distance to -1,
0.1, 0.5, and 0.3 for (1), (2), (3), and (4), respectively. The AET
system sets the threshold for (1) to -1 because if an AET node
contains no block nodes, then the pattern of a tag string may be
ambiguous. A tag string is a depth-first listing of descendant tags
of a tag. An example of an ambiguous tag string is <TEXT A TEXT
TEXT A TEXT TEXT>, which may contain the pattern of <TEXT A
TEXT> or <A TEXT TEXT>. Although setting the threshold to
-1 will reduce the recall of the algorithm, it will increase
precision of the algorithm.
[0031] In one embodiment, the AET system builds an HTML tag tree
using a conventional algorithm that is augmented to collect
information needed for extracting anchor explanatory text. When
building a tag tree, the AET system collects the additional
information for each node that indicates whether a descendant node
is an anchor tag, whether a descendant node has valid text that
surrounds an anchor tag, the number of block nodes within
descendant nodes, and the tag string. The AET system considers any
combination of alphanumeric characters to be valid text.
[0032] In one embodiment, the AET system identifies a dominant
anchor for each AET node. If an AET node has multiple anchors, then
the dominant anchor would be the anchor of the sole node that
contains a block node and that has explanatory text for the anchor.
If the AET node has multiple anchors containing a block node with
explanatory text surrounding the anchor, then the AET considers
none of those anchors to be dominant anchors. The AET system
identifies dominant anchors by traversing the tag tree subtree of
each node within an AET node in a depth-first manner. When a node
has multiple anchors, the AET system decides whether that node has
a dominant anchor, which is propagated up the subtree for
determining the dominant anchor of its parent node. The AET system
specifies two criteria for identifying dominant anchors. The AET
system repeatedly applies the criteria to pairs of nodes with
dominant anchors to determine whether only one node is left as a
candidate to contain the dominant anchor for the node. If so, then
that node contains the dominant anchor, else there is no dominant
anchor. Each criterion determines whether either of the nodes can
be eliminated as a candidate based on the attributes of the other
node. The AET system starts determining the dominant anchor by
creating a list of the sibling nodes within an AET node and
recursively applying the criteria. The first criterion, which is
applied when both nodes of a pair have dominant anchors, is as
follows: [0033] If both nodes are not block nodes, eliminate both
since neither is dominant over the other. [0034] If one node is a
block node and the other is not, eliminate the non-block node since
the non-block node is dominated by the other. [0035] If both nodes
are block nodes and their tag strings are the same, eliminate both
since neither is dominated by the other. [0036] If both nodes are
block nodes, eliminate any node that does not contain explanatory
text since any node without explanatory text is dominated by a node
with explanatory text and if neither has explanatory text, then
neither is dominated by the other.
[0037] The second criterion, which is applied when only one of the
pair of nodes has a dominant anchor, is as follows: [0038] If both
nodes are block nodes with the same tag string and the node without
the dominant anchor has explanatory text, eliminate the node with
the dominant anchor.
[0039] FIG. 5 illustrates the condition that the second criterion
is designed to identify. The entry 501 of a web page corresponds to
an AET node with tag tree 502. The subtree 503 contains a dominant
anchor, and the subtree 504 contains no dominant anchor. These
nodes may be grouped into the same AET node because of their
similarity. However, since the subtree 504 contains explanatory
text, all the text that surrounds the anchor in the entry is not
directly related to the entry. As a result, the AET system
eliminates the dominant anchor of subtree 503 as being a candidate
for the dominant anchor of the AET node.
[0040] FIG. 6 is a block diagram that illustrates components of the
AET system in one embodiment. The AET system 630 is connected via
communications link 620 to web sites 610 and user devices 615. The
AET system includes an extract AET component 631, a traverse tag
tree component 632, a mine anchor records ("MAR") component 633, a
find anchor regions ("ARs") component 634, an ID dominant anchor
("DA") component 635, a combination compare ("combcomp") component
636, an identify ARs component 637, a DA identify1 component 638,
and a DA identify2 component 639. The extract AET component
generates a tag tree and invokes the traverse tag tree component to
collect the information needed for anchor explanatory text
extraction. The extract AET component also invokes the MAR
component to determine the similarity between various combinations
of adjacent, sibling nodes (i.e., potential AET nodes), invokes the
find ARs component to find AET regions, and invokes the ID DA
component to identify dominant anchors for the AET nodes within the
AET regions. These components in turn invoke helper components that
include the combcomp component, the identify ARs component, the DA
identify1 component, and the DA identify2 component. The system may
extract explanatory text from the web pages identified in a web
page store 640. The AET system may be used in conjunction with
various applications such as a summarize application 651, a crawler
application 652, a refine query application 653, and a translation
application 654. The summarize application may generate a summary
of a web page based on explanatory text associated with dominant
anchors that reference the web page. The crawler application may
use the explanatory text in prioritizing unvisited URLs. The refine
query application may use the explanatory text to automatically
refine a query. The translation application may use the anchor
explanatory text to incrementally discover knowledge for extracting
multilingual translations of query terms.
[0041] The computing devices on which the AET system may be
implemented may include a central processing unit, memory, input
devices (e.g., keyboard and pointing devices), output devices
(e.g., display devices), and storage devices (e.g., disk drives).
The memory and storage devices are computer-readable media that may
contain instructions that implement the AET system. In addition,
the data structures and message structures may be stored or
transmitted via a data transmission medium, such as a signal on a
communications link. Various communications links may be used, such
as the Internet, a local area network, a wide area network, or a
point-to-point dial-up connection.
[0042] The AET system may be implemented on various computing
systems or devices including personal computers, server computers,
hand-held or laptop devices, multiprocessor systems,
microprocessor-based systems, network PCs, minicomputers, mainframe
computers, distributed computing environments that include any of
the above systems or devices, and the like. The AET system may also
provide its services to various computing systems such as personal
computers, cell phones, personal digital assistants, consumer
electronics, home automation devices, and so on.
[0043] The AET system may be described in the general context of
computer-executable instructions, such as program modules, executed
by one or more computers or other devices. Generally, program
modules include routines, programs, objects, components, data
structures, and so on that perform particular tasks or implement
particular abstract data types. Typically, the functionality of the
program modules may be combined or distributed as desired in
various embodiments.
[0044] FIG. 7 is a flow diagram that illustrates the processing of
a high-level description of an extract anchor explanatory text
component of the AET system in one embodiment. In block 701, the
component finds the repeated patterns within a web page. In blocks
702-706, the component loops extracting explanatory text associated
with dominant anchors. In block 702, the component selects the next
repeated pattern. In decision block 703, if all the repeated
patterns have already been selected, then the component completes,
else the component continues at block 704. In block 704, the
component identifies the dominant anchor, if any, of the selected
repeated pattern. In block 705, the component extracts text
surrounding the dominant anchor. In block 706, the component
associates the extracted explanatory text with the referenced web
page. The component then loops to block 702 to select the next
repeated pattern.
[0045] FIG. 8 is a flow diagram that illustrates the processing of
a more detailed extract AET component of the AET system in one
embodiment. In block 801, the component builds the tag tree for a
web page. In block 802, the component invokes the traverse tag tree
component passing the root of the tag tree to collect the data
needed for extracting anchor explanatory text. In block 803, the
component invokes the MAR component to calculate the similarity
between sequences of nodes that may form an AET node. In block 804,
the component invokes the find ARs component to identify the AET
regions. In block 805, the component invokes the ID DA component
for each AET node of an AET region to identify the dominant anchors
and extract the surrounding explanatory text. The component then
completes.
[0046] FIG. 9 is a flow diagram that illustrates the processing of
the traverse tag tree component in one embodiment. The component
recursively invokes itself to traverse the tag tree with its root
at the passed node in a depth-first manner. In blocks 901-905, the
component initializes the information to be collected for the node.
In block 901, the component initializes the tag string for the
node. In decision block 902, if the node is an anchor node, then
the component initializes an anchor flag for the node in block 903,
else the component continues at block 904. In decision block 904,
if the node is a block node, then the component initializes the
block count of the node in block 905, else the component continues
at block 906. In decision block 906, if the node is a leaf node,
then the component returns, else the component continues at block
907. In blocks 907-913, the component loops recursively invoking
itself for each child node of the passed node. In block 907, the
component selects the next child node. In decision block 908, if
all the child nodes have already been selected, then the component
continues at block 914, else the component continues at block 909.
In block 909, the component recursively invokes the traverse tag
tree component passing the child node. In block 910, the component
accumulates the anchor flag of the child node into the passed node.
In block 911, the component accumulates the block count of the
child node into the passed node. In block 912, the component
accumulates the tag string of the child node into the passed node.
In block 913, the component accumulates the surrounding text
information for the child node into the surrounding text
information for the passed node. The component then loops to block
907 to select the next child node. In decision block 914, if the
passed node has an anchor surrounded by text, then the component
sets the surrounding text indicator for the passed node in block
915. The component then completes.
[0047] FIG. 10 is a flow diagram illustrating the processing of the
MAR component of the AET system in one embodiment. The component
determines the similarity between sequences of nodes. In decision
block 1001, if the depth of the tree from the passed node is
greater than or equal to three, then the component continues at
block 1002, else the component returns. In block 1002, the
component invokes the combcomp component to calculate the
similarity between various combinations of child nodes for possible
identification as AET nodes. In blocks 1003-1005, the component
loops recursively invoking the MAR component for each child node.
In block 1003, the component selects the next child node of the
passed node. In decision block 1004, if all the child nodes have
already been selected, then the component returns, else the
component continues at block 1005. In block 1005, the component
recursively invokes the MAR component passing the child node and
then loops to block 1003 to select the next child node.
[0048] FIG. 11 is a flow diagram that illustrates the processing of
the combcomp component of the AET system in one embodiment. The
component loops selecting collections of adjacent nodes of the
passed list of nodes and calculating their similarity as potential
AET nodes. In block 1101, the component increments a variable i for
indicating the node in the node list that is the start of the next
first possible AET node to have its similarity calculated, starting
at the first node. In decision block 1102, if the variable is less
than or equal to the maximum number of nodes in a combination, then
the component continues at block 1103, else the component returns.
In blocks 1103-1111, the component loops calculating the similarity
of possible AET nodes of different lengths, starting with a length
equal to the current start. During the first iteration with
variable i equal to 1, the component calculates the similarity for
AET nodes starting at the first node for AET nodes of length 1 to
the maximum length of an AET node. During the second iteration with
variable i equal to 2, the component only needs to calculate the
similarity for AET nodes of length 2 to the maximum length, since
the first iteration calculated the similarity for all possible AET
nodes of length 1 and similarly for subsequent iterations. In block
1103, the component sets the length j of the AET nodes for the next
iteration starting at the variable i. In decision block 1104, if
the length is less than or equal to the maximum length, then the
component continues at block 1105, else the component loops to
block 1101 to select the next start node. In decision block 1105,
if there are at least two full possible AET nodes to compare at the
current length, then the component continues at block 1106, else
the component loops to block 1103 to select the next length, which
will also be too long. In blocks 1106-1111, the component loops
calculating the similarity between successive pairs of possible AET
nodes. In block 1106, the component initializes the start node of
the first AET node of the pair. In block 1107, the component
increments the variable k to point to the start of the second AET
node of the pair. In decision block 1108, if the variable k is less
than the number of nodes in the node list, then the component
continues at block 1109, else the component loops to block 1103 to
select the next length. In decision block 1109, if there are enough
nodes in the node list to fill out the second AET node, then the
component continues at block 1110, else the component loops to
block 1107 to select the next second AET node of a pair, which will
be passed the end of the list. In block 1110, the component
calculates the edit distance between the first and the second AET
nodes of the pair. In block 1111, the component sets the start of
the first AET node for the second iteration to the start of the
current second AET node and then loops to block 1107.
[0049] FIG. 12 is a flow diagram that illustrates the processing of
the find ARs component of the AET system in one embodiment. The
component traverses the tag tree in a depth-first manner
identifying possible AET regions on the way down and determining
whether a parent AET region covers a child AET region on the way
up. The component discards the covered AET regions. In decision
block 1201, if the tree depth from the passed node is greater than
or equal to three, then the component continues at block 1202, else
the component returns. In block 1202, the component invokes the
identify ARs component to identify potential AET regions within the
child nodes of the passed node. In block 1203, the component
initializes a list of possible AET regions. In blocks 1204-1208,
the component loops recursively invoking the find ARs component for
each child node. In block 1204, the component selects the next
child node. In decision block 1205, if all the child nodes have
already been selected, then the component continues at block 1209,
else the component continues at block 1206. In block 1206, the
component recursively invokes the find ARs component passing the
selected child node. In block 1207, the component invokes the
uncover ARs component to identify any uncovered AET regions of the
selected child node. In block 1208, the component accumulates the
uncovered AET regions and then loops to block 1204 to select the
next child node. In block 1209, the component accumulates the
uncovered AET regions of the child nodes into the AET regions of
the passed node and then returns.
[0050] FIG. 13 is a flow diagram that illustrates the processing of
the identify ARs component of the AET system in one embodiment. The
component is recursively invoked to identify AET regions that cover
the maximum number of AET nodes. The variable maxAR indicates the
number of nodes in a combination, the location of the start child
in a node of the AET region, and the number of nodes involved in or
covered by the AET region. The variable curDR represents the
current data region being considered. The component is passed a
start location and a node. In block 1301, the component initializes
the variable maxAR. In block 1302, the component increments the
variable i to indicate the next length of an AET node, starting at
1. In decision block 1303, if the length is less than or equal to
the maximum length, then the component continues at block 1304,
else the component continues at block 1316. In block 1304, the
component increments the variable f, starting with the passed start
value. In decision block 1305, if the variable f is less than or
equal to the variable i, then the component continues at block
1306, else the component loops to block 1302 to select the next
length. In block 1306, the component sets a flag to true. In block
1307, the component increments the variable j by the variable i,
starting with the variable f. In decision block 1308, if the
variable j is less than the number of child nodes, then the
component continues at block 1309, else the component continues at
block 1314. In decision block 1309, if the distance of the edit
distance for the length i of the jth child node is less than the
variable threshold, then the component continues at block 1310,
else the component continues at block 1313. In block 1310, if the
flag is true, then the component continues at block 1311, else the
component continues at block 1312. In block 1311, the component
starts an AET region and sets the flag to false and then loops to
block 1307 to select the next AET node. In block 1312, the
component continues the current AET region and loops to block 1307
to select the next AET node. In block 1313, if no AET region has
been started, then the component loops to block 1307 to select the
next AET node, else the component continues at block 1314. In block
1314, the component determines whether the current AET region
should replace the maxAR including whether the AET node contains an
anchor tag with surrounding text. If so, the component continues at
block 1315 to replace the variable maxAR. The component loops to
block 1304 to select the next variable f. In decision block 1316,
the component determines whether to return an indication of no AET
regions. If the component does not return, the component
recursively invokes the identify ARs component in block 1317 and
then returns the accumulation of the maxAR and the ARs identified
by the recursive invocation.
[0051] FIG. 14 is a flow diagram that illustrates the processing of
the uncover ARs component of the AET system in one embodiment. The
component is passed a node and one of its child nodes. In block
1401, the component initializes a variable to track the uncovered
AET regions. In block 1402, the component selects the next AET
region of the child node. In decision block 1403, if all the AET
regions have already been selected, then the component returns the
uncovered AET regions, else the component continues at block 1404.
In decision block 1404, if the selected AET region is covered by an
AET region of the parent node, then the component loops to block
1402 to select the next AET region of the child node, else the
component continues at block 1405. In block 1405, the component
adds the selected AET region to the list of uncovered AET regions
and then loops to block 1402.
[0052] FIG. 15 is a flow diagram that illustrates the processing of
the ID DA component of the AET system in one embodiment. The
component is passed a list of nodes and determines whether an
anchor within one of those nodes is a dominant anchor. In decision
block 1501, if the list contains only an anchor node, then the
component returns, else the component continues at block 1502. In
blocks 1502-1505, the component loops recursively invoking the ID
DA component. In block 1502, the component selects the next node of
the node list starting with the first. In decision block 1503, if
all the nodes in the node list have not yet been selected, then the
component continues at block 1504, else the component continues at
block 1506. In block 1504, the component recursively invokes the ID
DA component to identify a dominant anchor for the selected node.
In block 1505, the component indicates whether the selected node
has a dominant anchor and then loops to block 1502 to select the
next node. In blocks 1506-1514, the component loops identifying a
dominant anchor for the passed list of nodes. In block 1506, the
component selects the next node in the list. In decision block
1507, if not all the nodes have been selected, then the component
continues at block 1508, else the component continues at block
1515. In decision block 1508, if the selected node has a candidate
anchor, then the component continues at block 1509, else the
component loops to block 1506 to select the next node. In blocks
1509-1514, the component loops choosing every node to determine
whether one can be eliminated as a candidate dominant anchor based
on comparison to the selected node. In block 1509, the component
chooses the next node of the node list. In decision block 1510, if
not all the nodes have been chosen, then the component continues at
block 1511, else the component loops to block 1506. In decision
block 1511, if the selected node and the chosen node are the same,
then the component loops to block 1509 to choose the next node,
else the component continues at block 1512. In decision block 1512,
if the chosen node is a candidate anchor, then the component
continues at block 1513, else the component continues at block
1514. In block 1513, the component invokes the component to apply
the first criterion for a dominant anchor and then loops to block
1509 to choose the next node. In block 1514, the component invokes
the component to apply the second criterion for a dominant anchor
and then loops to block 1509 to choose the next node. In block
1515, the component initializes a dominant anchor to null. In
decision block 1516, if there is only one candidate anchor, then
the component continues at block 1517, else the component returns
an indication that there is no dominant anchor. In block 1517, the
component sets and returns the dominant anchor.
[0053] FIG. 16 is a flow diagram that illustrates the processing of
the DA identify1 component of the AET system in one embodiment. The
component applies criterion 1 to determine whether one or both of a
pair of nodes can be eliminated as a candidate for a dominant
anchor. In decision block 1601, if neither node is a block node,
then the component eliminates both nodes as candidate nodes in
block 1602 and then returns, else the component continues at block
1603. In decision blocks 1603 and 1605, if one of the nodes is a
block and the other is not, then the component eliminates the other
block as a candidate in blocks 1604 and 1606 and then returns, else
the component continues at block 1607. In decision block 1607, if
both the nodes are blocks and their tag strings are equal, then the
component eliminates both the nodes as candidates in block 1608 and
then returns, else the component continues at block 1609. In
decision blocks 1609 and 1611, if either or both nodes are block
nodes with no explanatory text, then the component eliminates
either or both nodes in blocks 1610 and 1612, and then returns.
[0054] FIG. 17 is a flow diagram that illustrates the processing of
the DA identify2 component of the AET system in one embodiment. The
component applies criterion 2 to determine whether one or both of a
pair of nodes can be eliminated as a candidate for a dominant
anchor. In decision block 1701, if both nodes are block nodes, then
the component returns, else the component continues at block 1702.
In block 1702, if both nodes have the same tag string, then the
component returns, else the component continues at block 1703. In
decision block 1703, if the node that does not have a dominant
anchor has explanatory text, then the component eliminates the
other node as a candidate in block 1704. The component then
returns.
[0055] Although the subject matter has been described in language
specific to structural features and/or methodological acts, it is
to be understood that the subject matter defined in the appended
claims is not necessarily limited to the specific features or acts
described above. Rather, the specific features and acts described
above are disclosed as example forms of implementing the claims.
Accordingly, the invention is not limited except as by the appended
claims.
* * * * *