U.S. patent application number 13/053156 was filed with the patent office on 2011-07-14 for determining semantically distinct regions of a document.
Invention is credited to Yonatan Zunger.
Application Number | 20110173528 13/053156 |
Document ID | / |
Family ID | 43741899 |
Filed Date | 2011-07-14 |
United States Patent
Application |
20110173528 |
Kind Code |
A1 |
Zunger; Yonatan |
July 14, 2011 |
Determining Semantically Distinct Regions of a Document
Abstract
A structured document is translated into an initial hierarchical
data structure in accordance with syntactic elements defined in the
structured document. The initial hierarchical data structure
includes a plurality of nodes, and each node corresponds to one of
the syntactic elements. The method then annotates a node with a set
of attributes including geometric parameters of semantic elements
in the structured document that are associated with the node in
accordance with a pseudo-rendering of the structured document.
Finally, the method merges the nodes in the initial hierarchical
data structure into a tree of merged nodes in accordance with their
respective attributes and a set of predefined rules such that each
merged node is associated with a semantically distinct region of
the pseudo-rendered document. The predefined rules include rules
for merging nodes associated with semantic elements that have
nearby positions and/or compatible attributes in the
pseudo-rendered document.
Inventors: |
Zunger; Yonatan; (Mountain
View, CA) |
Family ID: |
43741899 |
Appl. No.: |
13/053156 |
Filed: |
March 21, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10947702 |
Sep 22, 2004 |
7913163 |
|
|
13053156 |
|
|
|
|
Current U.S.
Class: |
715/234 |
Current CPC
Class: |
G06F 40/14 20200101;
G06F 40/16 20200101; G06F 16/951 20190101 |
Class at
Publication: |
715/234 |
International
Class: |
G06F 17/00 20060101
G06F017/00 |
Claims
1. A method of generating a hierarchical data structure for a
document, performed by a computer system having one or more
processors and memory storing one or more programs for execution by
the one or more processors, comprising: performing a
pseudo-rendering for the document in accordance with syntactic
elements defined in the document; generating an initial
hierarchical data structure for the document, the initial
hierarchical data structure including a plurality of nodes
corresponding to the syntactic elements of the document, each node
of the plurality of nodes having an associated set of attributes
derived from the pseudo-rendered document; and converting the
initial hierarchical data structure into a final hierarchical data
structure in accordance with a set of predefined rules, the final
hierarchical data structure having a plurality of chunks, each
chunk corresponding to a semantically distinct region of the
pseudo-rendered document.
2. The method of claim 1, wherein the document is an HTML file and
the syntactic elements are HTML tags in the HTML file.
3. The method of claim 1, wherein the set of attributes of a node
of the initial hierarchical data structure includes geometric
parameters of a semantic element of the document in accordance with
the pseudo-rendering of the document.
4. The method of claim 1, wherein the set of attributes includes
font size and color if the semantic element is a text string.
5. The method of claim 1, wherein the set of attributes includes a
pseudo-title of the semantic element in the document.
6. The method of claim 1, wherein the conversion of the initial
hierarchical data structure into the final hierarchical data
structure includes a plurality of actions selected from the group
consisting of: merging a child node into its respective parent node
in the initial hierarchical data structure if the child node is the
only child of its respective parent node; expanding a child node
into multiple child nodes if the child node is associated with
multiple semantic elements; identifying a group of nodes sharing a
set of compatible attributes, the compatible attributes including
semantic elements associated with the group of nodes having similar
geometric parameters and the semantic elements appearing in the
pseudo-rendered document in a periodic or semi-periodic manner;
annotating each chunk of the final hierarchical data structure with
an initial label in accordance with its associated semantically
distinct region in the pseudo-rendered document; merging sibling
chunks of the final hierarchical data structure into a single chunk
if their initial labels and geometric parameters are compatible;
and assigning each chunk of the final hierarchical data structure a
final label indicative of the function and location of its
associated semantically distinct region in the pseudo-rendered
document.
7. The method of claim 1, including assigning links in a first
semantically distinct region a different weight than links in a
second semantically distinct region in the document and performing
a computation using said assigned weights.
8. The method of claim 1, including assigning text in a first
semantically distinct region a different weight than text in a
second semantically distinct region in the document and performing
a computation using said assigned weights.
9. The method of claim 1, wherein the set of attributes of a node
include a pseudo-title of a semantic element in the pseudo-rendered
document associated with the node.
10. The method of claim 9, wherein the pseudo-title of the semantic
element is selected from text that satisfies predefined criteria
with respect to appearing prominently if the document were rendered
for display.
11. A system for partitioning a structured document, comprising:
one or more processors, a memory coupled to the one or more
processors; and one or more programs, stored in the memory,
configured for execution by the one or more processors, the one or
more programs comprising instructions to: perform a
pseudo-rendering for the document in accordance with syntactic
elements defined in the document; generate an initial hierarchical
data structure for the document, the initial hierarchical data
structure including a plurality of nodes corresponding to the
syntactic elements of the document, each node of the plurality of
nodes having an associated set of attributes derived from the
pseudo-rendered document; and convert the initial hierarchical data
structure into a final hierarchical data structure in accordance
with a set of predefined rules, the final hierarchical data
structure having a plurality of chunks, each chunk corresponding to
a semantically distinct region of the pseudo-rendered document.
12. The system of claim 11, wherein the document is an HTML file
and the syntactic elements are HTML tags in the HTML file.
13. The system of claim 11, wherein the set of attributes of a node
of the initial hierarchical data structure includes geometric
parameters of a semantic element of the document in accordance with
the pseudo-rendering of the document.
14. The system of claim 11, wherein the conversion of the initial
hierarchical data structure into the final hierarchical data
structure includes a plurality of actions selected from the group
consisting of: merging a child node into its respective parent node
in the initial hierarchical data structure if the child node is the
only child of its respective parent node; expanding a child node
into multiple child nodes if the child node is associated with
multiple semantic elements; identifying a group of nodes sharing a
set of compatible attributes, the compatible attributes including
semantic elements associated with the group of nodes having similar
geometric parameters and the semantic elements appearing in the
pseudo-rendered document in a periodic or semi-periodic manner;
annotating each chunk of the final hierarchical data structure with
an initial label in accordance with its associated semantically
distinct region in the pseudo-rendered document; merging sibling
chunks of the final hierarchical data structure into a single chunk
if their initial labels and geometric parameters are compatible;
and assigning each chunk of the final hierarchical data structure a
final label indicative of the function and location of its
associated semantically distinct region in the pseudo-rendered
document.
15. The system of claim 11, the one or more programs further
comprising instructions for assigning links in a first semantically
distinct region a different weight than links in a second
semantically distinct region in the document and performing a
computation using said assigned weights.
16. A non-transitory computer-readable storage medium storing one
or more programs configured for execution by a server system
comprising one or more processors, the one or more programs
comprising instructions for: performing a pseudo-rendering for the
document in accordance with syntactic elements defined in the
document; generating an initial hierarchical data structure for the
document, the initial hierarchical data structure including a
plurality of nodes corresponding to the syntactic elements of the
document, each node of the plurality of nodes having an associated
set of attributes derived from the pseudo-rendered document; and
converting the initial hierarchical data structure into a final
hierarchical data structure in accordance with a set of predefined
rules, the final hierarchical data structure having a plurality of
chunks, each chunk corresponding to a semantically distinct region
of the pseudo-rendered document.
17. The computer-readable storage medium of claim 16, wherein the
document is an HTML file and the syntactic elements are HTML tags
in the HTML file.
18. The computer-readable storage medium of claim 16, wherein the
set of attributes of a node of the initial hierarchical data
structure includes geometric parameters of a semantic element of
the document in accordance with the pseudo-rendering of the
document.
19. The computer-readable storage medium of claim 16, wherein the
conversion of the initial hierarchical data structure into the
final hierarchical data structure includes a plurality of actions
selected from the group consisting of: merging a child node into
its respective parent node in the initial hierarchical data
structure if the child node is the only child of its respective
parent node; expanding a child node into multiple child nodes if
the child node is associated with multiple semantic elements;
identifying a group of nodes sharing a set of compatible
attributes, the compatible attributes including semantic elements
associated with the group of nodes having similar geometric
parameters and the semantic elements appearing in the
pseudo-rendered document in a periodic or semi-periodic manner;
annotating each chunk of the final hierarchical data structure with
an initial label in accordance with its associated semantically
distinct region in the pseudo-rendered document; merging sibling
chunks of the final hierarchical data structure into a single chunk
if their initial labels and geometric parameters are compatible;
and assigning each chunk of the final hierarchical data structure a
final label indicative of the function and location of its
associated semantically distinct region in the pseudo-rendered
document.
20. The computer-readable storage medium of claim 16, the one or
more programs further comprising instructions for assigning links
in a first semantically distinct region a different weight than
links in a second semantically distinct region in the document and
performing a computation using said assigned weights.
Description
RELATED APPLICATIONS
[0001] This application is a divisional application of U.S. patent
application Ser. No. 10/947,702, "Determining Semantically Distinct
Regions of a Document," filed Sep. 22, 2004, which is herein
incorporated by reference in its entirety.
FIELD OF THE INVENTION
[0002] The present invention relates generally to the field of
computational linguistics, and in particular, to a system and
method of determining semantically distinct regions of a
document.
BACKGROUND OF THE INVENTION
[0003] A document displayed on a computer monitor typically
comprises multiple semantically distinct regions, such as a header,
a footer or a sidebar, each region including one or more semantic
elements such as text paragraphs, pictures, advertisement blocks or
navigational links, etc. Each section occupies a unique location on
the computer monitor. For example, the text paragraphs and pictures
are usually the focus of the viewer and are therefore positioned at
the center of the computer monitor, which is the most eye-catching
part of the computer monitor. In contrast, the footer often
contains boilerplate items that are deemed less important from a
viewer's perspective, e.g., legal disclaimer, copyright notice or
timestamp, and is therefore located at the bottom of the
document.
[0004] Even though a semantically distinct region in a document is
easily recognizable on a computer monitor by human eyes, it may be
a difficult task for a computer program to identify its counterpart
in a file that renders the document on the computer monitor. For
example, a webpage displayed in a web browser window is typically
created from a hypertext markup language (HTML) file by the web
browser. The HTML file usually includes multiple syntactic
elements, e.g., <TABLE> and </TABLE>, that instruct the
web browser on how to display different components in the webpage
in a specific manner. But it rarely occurs that, for instance, one
pair of <TABLE> and </TABLE> corresponds to an actual
table in the webpage. More than that, a semantically distinct
region of a document, e.g., a sidebar of navigational links or a
column of advertisement blocks, is often associated with multiple
syntactic elements, but the corresponding HTML file does not group
those elements together nor does it provide any other structure for
identifying the plurality of elements that belong to a semantically
distinct region.
SUMMARY
[0005] In a first embodiment of the present invention, a method for
partitioning a structured document translates the document into an
initial hierarchical data structure in accordance with syntactic
elements defined in the structured document. The initial
hierarchical data structure includes a plurality of nodes, and each
node corresponds to one of the syntactic elements. The method then
annotates a node with a set of attributes including geometric
parameters of semantic elements in the structured document that are
associated with the node in accordance with a pseudo-rendering of
the structured document. Finally, the method merges the nodes in
the initial hierarchical data structure into a tree of merged nodes
in accordance with their respective attributes and a set of
predefined rules such that each merged node is associated with a
semantically distinct region of the pseudo-rendered document. The
predefined rules include rules for merging nodes associated with
semantic elements that have nearby positions and/or compatible
attributes in the pseudo-rendered document.
[0006] In a second embodiment of the present invention, a method
for partitioning a document pseudo-renders the document in
accordance with syntactic elements defined in the document and
generates an initial hierarchical data structure for the document.
The initial hierarchical data structure includes a plurality of
nodes corresponding to the syntactic elements of the document, and
each node has an associated set of attributes derived from the
pseudo-rendered document. The method then converts the initial
hierarchical data structure into a final hierarchical data
structure in accordance with a set of predefined rules. The final
hierarchical data structure includes multiple chunks, each chunk
corresponding to a semantically distinct region of the
pseudo-rendered document.
[0007] The pseudo-rendering of a document determines the
approximate position and size of each element of the document,
without necessarily performing a full rendering of the document. A
primary purpose of pseudo-rendering is to determine geometric
information for each element of the document and to associate that
geometric information with the document's elements in a
hierarchical data structure, thereby providing the factual basis
for identifying semantically distinct regions of the document and
for assigning elements of the document to those regions.
[0008] In a third embodiment of the present invention, a method of
partitioning a document into semantically distinct regions first
generates a hierarchical data structure for the document. The
hierarchical data structure includes a plurality of nodes that are
associated with a plurality of syntactic elements of the document,
each node having a set of geometric parameters characterizing one
or more semantic elements in the document. The method then merges
the nodes into one or more semantically distinct regions in
accordance with their respective sets of geometric parameters and a
set of predefined rules, each section including at least one of the
semantic elements in the document.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] FIG. 1 is a block diagram illustrating an exemplary webpage
displayed in a web browser window.
[0010] FIG. 2 depicts a hierarchical data structure, i.e., a chunk
tree, of the exemplary webpage according to one embodiment of the
present invention.
[0011] FIG. 3 is a block diagram illustrating a computer-based
geometry detector in accordance with one embodiment of the present
invention.
[0012] FIG. 4 illustrates exemplary data structures used for
storing attributes associated with a quasi-DOM tree node, a chunk
tree node and a geometric token, respectively, in accordance with
one embodiment of the present invention.
[0013] FIG. 5 is an overview flowchart of major actions by the
geometry detector according to one embodiment of the present
invention.
[0014] FIG. 6 is a flowchart illustrating further details of an
algorithm of converting a quasi-DOM tree into a chunk tree
according to one embodiment of the present invention.
[0015] FIG. 7 is a quasi-DOM tree of the exemplary webpage
according to one embodiment of the present invention.
[0016] FIGS. 8A-8D are an example illustrating the chunk tree at
different stages according to one embodiment of the present
invention.
[0017] FIGS. 9A and 9B are an example illustrating an algorithm of
row and grid analysis according to one embodiment of the present
invention.
[0018] Like reference numerals refer to corresponding parts
throughout the several views of the drawings.
DESCRIPTION OF EMBODIMENTS
Overview
[0019] FIG. 1 is a block diagram illustrating an exemplary webpage
100 displayed in a web browser window. A human being visiting the
webpage 100 can easily divide it into multiple semantically
distinct regions as indicated by the dash-line rectangles. At the
top of the webpage is a header 120 that carries a logo image 125 of
a company. There are three sections sitting side by side right
below the header 120, including a left hand side (LHS) sidebar 130,
an array of image blocks 140 in the middle and a right hand side
(RHS) sidebar 150. A footer 160 at the bottom of the webpage
includes one or more links pointing to other relevant documents,
such as company information, privacy policy, etc.
[0020] The LHS sidebar 130 includes multiple navigational links and
these links are typically on-site links that guide a viewer from
this webpage to other web pages at the same website. For example,
if this webpage is a homepage of a newspaper, different
navigational links may be associated with different topics, e.g.,
politics, sports, market, etc. A viewer can switch from one topic
to another by clicking on an associated navigational link.
[0021] The location containing the image blocks 140 typically
carries the primary content of a webpage since it is at or near the
center of the webpage. In this example, the primary content is a
photo album that includes multiple pictures, each picture
associated with a short sentence (denoted by "Desc_1," "Desc_2," .
. . in FIG. 1) describing the picture, e.g., where the picture is
taken or who are in the picture.
[0022] In this example, the RHS sidebar 150 includes one or more
advertisement blocks (Ad_Block_1, Ad_Block_2, etc.). The
advertisement blocks are vertically separated. Each block may be
contain an image and/or text that conveys commercial information.
For example, an advertisement block may contain a promotional offer
from a sponsor that creates and sells paper copies for digital
images. An advertisement block is typically associated with an
off-site link if the sponsor of the advertisement has its own
website. A viewer who is interested in a particular piece of
commercial information can jump to its sponsor's website by
clicking on the corresponding advertisement block.
[0023] Generally speaking, since a semantically distinct region
occupies a unique location in the webpage 100, it carries a unique
weight from a viewer's perspective. What is of most interest to the
viewer is probably the image blocks 140 located at the center of
the webpage, since it contains the primary content of the webpage
100. By contrast, what is of least interest is probably the footer
150, because information associated with the footer 160 is
predominantly boilerplate terms that are the same across the entire
website.
[0024] FIG. 2 depicts a hierarchical data structure, herein called
a chunk tree 200, of the webpage 100 according to one embodiment of
the present invention. The chunk tree 200 has a root labeled "ROOT"
and the root has five chunks, each chunk corresponding to one
semantically distinct region shown in FIG. 1.
[0025] For example, the chunk 210 that corresponds to the header
120 is labeled "DATA_NODE" since the header 120 includes the image
125 and is also annotated with a set of attributes, e.g., "Header"
and "Image", indicative of its function. In some embodiments, the
set of attributes further includes a group of geometric parameters
indicative of the location of the image 125 within the webpage 100,
e.g., the coordinate of the top-left corner of the image in pixels,
the width and height of the image in pixels. The chunk 230 has
attributes like "LHS SideBar" and "On-Site Links" indicative of its
function and location. The chunk 240 is labeled "MISC_NODE" and has
an attribute "Grid Root" because it is further split into eight
child chunks, each child chunk corresponding to one picture and its
associated description in the image blocks 140. Note that a child
chunk below the chunk 240 is labeled "DATA_NODE" and has a unique
attribute "Grid Element" indicating that it is a member of a grid
associated with the chunk 240. Similarly, the chunk 250 is
annotated with attributes like "RHS SideBar" and "Off-Site Links"
because the RHS sidebar 160 includes advertisement blocks pointing
to other websites, and the chunk 260 has attributes like "Footer"
and "On/Off-Site Links"
[0026] If a chunk tree of a webpage as shown in FIG. 2 is available
to a search engine, the search engine can more accurately determine
the relevance between the webpage and a search query term based
upon which semantically distinct region contains the query term.
For example, if the query term is contained in the image blocks
140, the webpage 100 should generally be granted a priority higher
than another webpage in which the query term is found in one of the
advertisement blocks.
[0027] However, what is normally available to a search engine is
actually a structured or semi-structured document such as an HTML
file which does not have a chunk tree embedded therein. The
document wound need to be subsequently interpreted line by line by
a web browser in order to have a 2-D geometric structure as shown
in FIG. 1. Thus, search engines normally cannot distinguish between
query terms found in important sections of a page and query terms
found in less important sections.
[0028] As described below, one embodiment of the present invention
is a method for generating a hierarchical data structure like a
chunk tree from an HTML file by performing a pseudo-rendering of
the HTML file. The resulting hierarchical data structure can be
used by a search engine to improve search results, for example by
taking into account of the document location of a query term or
link, creating a semantically meaningful title for an image, or by
constructing more accurate snippets for search results.
[0029] Pseudo-rendering a document determines the approximate
position and size of each element of the document, without
necessarily performing a full rendering of the document. A primary
purpose of pseudo-rendering is to determine geometric information
for each element of the document and to associate that geometric
information with the document's elements in a hierarchical data
structure, thereby providing the factual basis for identifying
semantically distinct regions of the document and for assigning
elements of the document to those regions. In some embodiments, the
pseudo-rendering is performed by a pseudo-browser program using a
simplified one pass rendering method, thereby achieving
pseudo-rendering with minimal computational resources at the
possible expense of accuracy. In some embodiments, the geometric
information produced by the pseudo-rendering needs to be only
approximately accurate, i.e., accurate enough to identify the
semantically distinct region to which each element belongs. In
another embodiment, pseudo-rendering is achieved using a normal
page/document rendering procedure, but the resulting image data is
used for determining the geometric information associated with the
document's elements rather than for actual display of the
document.
System Architecture
[0030] FIG. 3 is a block diagram illustrating a computer-based
geometry detector 300 in accordance with one embodiment of the
present invention. The geometry detector 300 typically includes one
or more processing units (CPU's) 302, one or more network or other
communications interfaces 310, memory 312, and one or more
communication buses 314 for interconnecting these components. The
geometry detector 300 optionally may include a user interface 304
comprising a display device 306 and a keyboard 308. In some
embodiments, memory 312 includes high speed random access memory
and also includes non-volatile memory, such as one or more magnetic
disk storage devices. Memory 312 may optionally include one or more
storage devices remotely located from the CPU(s) 302. Memory 312,
the non-volatile memory devices of memory 312, or a subset thereof,
comprise a non-volatile computer readable storage medium. In some
embodiments, the memory 312 or the non-volatile computer readable
storage medium of memory 312 stores the following programs, modules
and data structures, or a subset thereof: [0031] an operating
system 316 that includes procedures for handling various basic
system services and for performing hardware dependent tasks; [0032]
a network communication module 318 that is used for connecting the
geometry detector 300 to other computers via the one or more
communication network interfaces 310 (wired or wireless), such as
the Internet, other wide area networks, local area networks,
metropolitan area networks, and so on; [0033] a pseudo-web browser
module 320 that is used for pseudo-rendering specified HTML files
and generating estimated geometric information for various elements
in the HTML file; [0034] a quasi-DOM tree generation module 322
that is used for generating a quasi-DOM tree for an HTML file, the
quasi-DOM tree having multiple nodes and each node corresponding to
a syntactic element in the HTML file; [0035] a chunk tree
generation module 324 that is used for converting a quasi-DOM tree
of an HTML file into a chunk tree by consolidating the nodes of the
quasi-DOM tree in accordance with a set of predefined heuristic
rules and the nodes' geometric information; [0036] one or more
quasi-DOM trees 330 generated by the quasi-DOM tree generation
module 322, each quasi-DOM tree including multiple nodes
corresponding to syntactic and semantic elements of an HTML file
and each node having a set of node attributes including a group of
geometric parameters derived from the pseudo-rendering of the HTML
file; [0037] one or more chunk trees 340 generated by the chunk
tree generation module 324, each chunk tree including multiple
chunks, each chunk corresponding to one semantically distinct
region of a pseudo-rendered webpage and having a set of chunk
attributes indicating the function and location of the semantically
distinct region in the pseudo-rendered webpage; and [0038] one or
more geometric token lists 350 generated by the quasi-DOM tree
generation module 322, each token list including multiple tokens,
each token corresponding to a term, e.g., a word or an image,
appearing in the pseudo-rendered webpage and including a set of
token attributes representing the location and size of the term and
its associated quasi-DOM tree node and chunk tree node.
[0039] Each of the above identified modules corresponds to a set of
instructions for performing a function described above. These
modules (i.e., sets of instructions) need not be implemented as
separate software programs, procedures or modules, and thus various
subsets of these modules may be combined or otherwise re-arranged
in various embodiments. In some embodiments, memory 312 may store a
subset of the modules and data structures identified above.
Furthermore, memory 312 may store additional modules and data
structures not described above.
[0040] FIG. 4 illustrates exemplary data structures used for
storing attributes associated with a quasi-DOM tree node 410, a
chunk tree node 420 and a geometric token 430, respectively, in
accordance with one embodiment of the present invention.
[0041] In particular, in one embodiment, a quasi-DOM tree node 410
includes: [0042] node_id that uniquely identifies the node among
other nodes of the same quasi-DOM tree; [0043] node_type that
indicates what kind of syntactic or semantic element this node is
associated with, e.g., "TABLE" for <TABLE> and
</TABLE>, "ROW" for <TR> and </TR>, "CELL" for
<TD> and </TD>, etc; [0044] parent_node_id pointing to
the current node's parent (if any) on the quasi-DOM tree; [0045]
associated_chunk_id pointing to a chunk tree node in a
corresponding chunk tree to which the current node belongs; [0046]
(x_pos, y_pos) indicating the location of the top-left corner of
one or more semantic elements associated with the current node in a
pseudo-rendered webpage; [0047] (width, height) indicating the area
in number of pixels occupied by the semantic elements in the
pseudo-rendered webpage; [0048] font_size of the text (if any)
associated with the current node; [0049] color of the text (if any)
associated with the current node; and [0050] child_node_id(s) of
other nodes in the quasi-DOM tree that are children of the current
node, if any.
[0051] In another embodiment, a quasi-DOM tree node may include a
subset of the above identified fields, may contain additional
fields, and may include somewhat different fields providing similar
information. For example, geometric information may be provided in
several different but equivalent ways. In another example, the set
of fields may include the font type of the text, if any, associated
with the node.
[0052] Since a chunk tree originates from a quasi-DOM tree, it has
a set of chunk attributes similar to a quasi-DOM tree node's
attributes (see, e.g., the quasi-DOM tree node 410 and the chunk
tree node 420 in FIG. 4). However, unlike a quasi-DOM tree that is
driven primarily by the syntactic elements of an HTML file, a chunk
tree is more closely tied with the semantic elements of a
pseudo-rendered webpage. Therefore, there are a few attributes that
are unique to a chunk tree node: [0053] chunk_type that suggests
the location and function of a chunk, e.g., sidebar, header,
footer, etc; and [0054] node_type that indicates the type of the
node in a chunk tree, e.g., "ROOT" for the root of a chunk tree,
"DATA" for nodes containing information such as text, images or
links, "STRUCTURAL" for nodes that do not contain such information,
or "GRID ELEMENT" for nodes corresponding to elements belonging to
a list or array.
[0055] In some embodiments, each chunk tree node also includes a
Chunk_ID, and Parent_Chunk_ID (identifying a parent node, if any,
geometric fields such as (x_pos, y_pos) and (width, height), and
child_chunk_ID(s), or a subset thereof.
[0056] In some embodiments, the two data structures 410 and 420 are
merged into a single data structure which is shared by the
quasi-DOM tree and the chunk tree.
[0057] Finally, in some embodiments, a geometric token 430 is
created for any word or image appearing in the pseudo-rendered
webpage. The geometric token 430 may include the following fields,
or a subset thereof: [0058] associated_node_id that points to a
leaf node of a quasi-DOM tree whose associated semantic element
includes the geometric token; [0059] associated_chunk_id that
points to a leaf node of a chunk tree whose associated semantically
distinct region includes the geometric token; [0060] (x_pos, y_pos)
indicating the location of the top-left corner of the geometric
token in a pseudo-rendered webpage; [0061] (width, height)
indicating the area in number of pixels occupied by the geometric
token in the pseudo-rendered webpage; [0062] token_term that may be
a word, an image, a link, etc.; and [0063] a pseudo-title flag,
indicating that the token is a word that may be used as the
pseudo-title or part of the pseudo-title of the chunk in which the
associated chunk tree leaf node is found.
Process and Example
[0064] FIG. 5 is an overview flowchart of major actions by the
geometry detector according to one embodiment of the present
invention.
[0065] After receiving a document, for example in the form of an
HTML file or data structure, the geometry detector first performs a
pseudo-rendering of the HTML file (510). This action emulates the
operation of a real web browser by interpreting the HTML file line
by line and creating a webpage for the HTML file in the memory, the
webpage including multiple semantically distinct region as shown in
FIG. 1. Note that the pseudo-rendered webpage is used only for
identifying geometric information of a semantically distinct
region, not for real-world visualization. Therefore, the geometry
detector does not have be as accurate as a real web browser. While,
a real web browser typically needs to scan an HTML file twice to
generate a webpage, the geometry detector may need only one pass of
the HTML file to create a pseudo-rendered webpage, which reduces
the computational cost of determining geometric information for the
elements of the document.
[0066] The geometry detector also generates a quasi-DOM tree that
includes multiple nodes (520), each node corresponding to one of
the semantic and syntactic elements of the HTML file. A standard
document object model (DOM) tree usually contains a large number of
nodes, e.g., several hundred nodes, most of which correspond to
purely structure-oriented HTML tags like "<TABLE>",
"<TR>", "<TD>", etc. In contrast, the quasi-DOM tree
eliminates those syntactic elements which are totally irrelevant to
the geometric structure of the pseudo-rendered webpage, and creates
some nodes which do not have a direct counterpart element in the
HTML file, e.g., splitting paragraphs separated by significant
vertical gaps into multiple semantic elements. More than that, each
node of the quasi-DOM tree is associated with a set of attributes
including geometric information derived from the pseudo-rendered
webpage. In some embodiments, operation 520 is performed before
operation 510, which is used to populate the nodes of the quasi-DOM
tree with geometric information.
[0067] FIG. 7 is a quasi-DOM tree 700 of the exemplary webpage 100
according to one embodiment of the present invention. The quasi-DOM
tree 700 has one root node labeled "ROOT", and all other nodes are
direct or indirect descendants of the root node. Non-root and
non-leaf nodes usually correspond to syntactic elements in the HTML
file, e.g., "TABLE", "ROW", or "CELL", while leaf nodes correspond
to semantic elements in the file, e.g., "Image", "Text", or "Link".
Note that the child of a "CELL" node may be a leaf node (e.g.,
block 710 or block 730) or a sub-tree including multiple non-leaf
nodes (e.g., block 720 or block 740). In the latter case, the
sub-tree itself may include its own set of "TABLE", "ROW", and
"CELL" nodes, each "CELL" node having one or more child leaf nodes.
In some other embodiments, the parent of a leaf node may not
necessarily be a "CELL" node. For instance, the parent node could
be a "TABLE" node, a "ROW" node or any node associated with a
syntactic element in an HTML file.
[0068] It is worth noting that, for illustrative purposes, not all
attributes associated with a quasi-DOM tree have been listed in
FIG. 7. For example, FIG. 7 does not show the geometric
information, e.g., (x_pos, y_pos) or (width, height), associated
with each node. However, it will be understood by one skilled in
the art that each node shown in FIG. 7 is associated with a set of
attributes listed in the data structure 410 of FIG. 4. Relying upon
these attributes, the geometry detector converts the quasi-DOM tree
into a chunk tree (530).
[0069] FIG. 6 is a flowchart illustrating a method of converting a
quasi-DOM tree into a chunk tree according to one embodiment of the
present invention. The chunk tree at different stages of the method
is shown in FIGS. 8A-8D.
[0070] First, the geometry detector constructs an initial chunk
tree out of the quasi-DOM tree (610). In particular, the geometry
detector identifies interesting nodes on the quasi-DOM tree. In
some embodiments, an interesting node is one that contains actual
text or image, as opposed to those purely syntactic elements. At
this stage, the geometry detector may also collapse a child node
which does not have any siblings into its parent node.
[0071] Referring again to FIG. 7, since each parent node in the
block 710, "ROW" and "CELL", has only one child node, "CELL" and
"Image", respectively, this quasi-DOM tree branch collapses into a
single node 810 labeled "IMAGE" in FIG. 8A. Similarly, the block
720 is "shrunk" (or reduced) into the block 820 and the block 740
into the block 840. As a result, the initial chunk tree 800 has
fewer nodes than the quasi-DOM tree 700. Note that the block 730 is
not shrunk, but expanded into the block 830. This is because the
geometry detector, after identifying large vertical gaps within the
RHS sidebar 150, determines that the RHS sidebar 150 includes three
semantically distinct elements and therefore splits the "Text" leaf
node in the block 730 into three "TEXT" nodes 830-1, 830-2 and
830-3 in the block 830, each node having its own set of geometric
parameters.
[0072] Second, the geometry detector conducts a row and grid
analysis of the initial version of the chunk tree (620). A purpose
of this analysis is to establish among selected portions of an HTML
file a logical relationship that associates a group of semantic
elements together as shown in a corresponding webpage. For example,
the image blocks 140 in FIG. 1 correspond to an image gallery that
include an array of related pictures and their associated
descriptions. However, this logical relationship between the
pictures is not well represented by the block 850 in the initial
chunk tree. Therefore, the geometry detector may highlight the
logical relationship between the pictures by examining the
geometric parameters associated with each picture and description
through row and grid analysis.
[0073] FIGS. 9A and 9B are an example illustrating a row and grid
analysis method or process according to one embodiment of the
present invention. The analysis method focuses the existence of any
kind of periodic or semi-periodic relationship between the
pictures. A semi-periodic relationship among a set of elements is
one in which at least a threshold percentage of a set of elements
are positioned at approximately regular intervals. The interval
pattern can be defined to allow for a predefined amount of
positioning variation, and the threshold can be selected to allow
some positions in the pattern to be either unoccupied or partially
occupied by elements not fitting the pattern. Finding a set of
elements to be compatible with a grid pattern, even if the elements
are only semi-periodically positioned, is very beneficial in terms
of grouping the elements into a single chunk, which forms or is
part of a single semantic region of the document.
[0074] As shown in FIG. 9A, the geometry detector first replaces
each leaf node in the block 850 with a hash code based on the
content type of the leaf node. For example, an "Image" leaf node is
replaced with a hash code of the expression "Image" or H("Image"),
and a "Text" leaf node is replaced with a hash code of the
expression "Text" or H("Text"). Accordingly, each of the eight
parent nodes 910-980 is associated with a hash code that depends
upon both its child nodes' hash codes and its own label "CELL".
[0075] As shown in FIG. 9B, the eight hash codes are assigned to
eight elements of a 2-D array according to their respective
geometric information produced at the pseudo-rendering stage 510.
Two hash codes are selected from two distinct elements of the 2-D
array and then compared to determine whether they match each other
or not in accordance with a predefined selecting pattern. In a
first selecting pattern, the geometry detector selects all pairs of
vertically adjacent hash codes that are deemed to have a vertical
distance of "1". In a second selecting pattern, the geometry
detector selects all pairs of hash codes that have a vertical
distance of "2". Similar selecting patterns are adapted in the
horizontal direction. As a result, two tables are shown in FIG. 9B,
one containing a result indicating the number of matching pairs and
the total number of candidate pairs in the horizontal direction and
the other containing a result indicating similar information for
pairs in the vertical direction.
[0076] The data in the two tables demonstrate that many pairs of
selected hash codes match each other, 83% in the case of distance
of "1" and 67% in the case of distance of "2". These percentages
are sufficiently high to indicate that the elements are organized
in a periodic or semi-periodic manner. In other words, the
underlying pictures and descriptions in the image blocks 140 are
geometrically compatible with each other. Therefore, these semantic
elements can be associated together with a 3.times.3
two-dimensional grid structure, each grid element corresponding to
a picture and a description. Note that the same algorithm, when
applied to a group of vertically-spaced semantic elements, can
generate a 1-D list structure. It is also worth noting that the row
and grid analysis as shown in FIGS. 9A and 9B is a heuristic-based
approach that is flexible and robust enough to detect the geometric
relationship of a set of elements even if the relationship is not
strictly periodic, but only semi-periodic. As a result, the row and
grid analysis is robust against "noise", such as small deviations
from periodic positioning and small numbers of "missing" elements
in a pattern. For example, a missing element at the lower-right
corner of the array in FIG. 9B does not prevent the row and grid
analysis from determining that the elements in this example are
organized in a 3.times.3 two-dimensional grid.
[0077] After row and grid analysis, the chunk tree is further
simplified by grouping together each set of elements found to fit a
periodic or semi-periodic pattern, as shown in FIG. 8B. There are
five child nodes associated with the root node, 860-1 to 860-5,
each child node corresponding to a rectangular region in the
webpage 100. For example, the child node 860-1 is associated with
the header 120, and the child node 860-2 is associated with the LHS
sidebar 130, etc. As a result, the revised chunk tree is starting
to reflect the geometric structure of the webpage 100. Note that
the three intermediate nodes separating the eight grid elements in
FIG. 8A have been eliminated by the row and grid analysis and the
eight grid elements become sibling nodes in FIG. 8B. Further,
whenever nodes in the chunk tree are combined, their geometric
information is also combined so that the revised nodes represent
the position and extent of each combined or revised node.
[0078] Third, the geometry detector assigns preliminary tags to
nodes of the chunk tree after the chunk tree has been simplified by
the row and grid analysis (630). The preliminary tags are assigned
according to the geometric information associated with each node.
For example, in FIG. 8C, the node 870-1 has two tags, "Header"
indicating that the associated semantic element probably serves as
the header of a webpage since it is located at the top of the
webpage and "Image" indicating that the associated semantic element
is an image, e.g., a logo image. The node 870-2 has two tags,
"LHS_Sidebar" suggesting that this node corresponds to semantic
elements located on the left hand side of the webpage and "On-Site
Links" indicating that these elements are actually links to another
webpage on the same website. The node 870-3 is tagged as
"Grid_Root" since its child nodes are eight grid elements according
to the row and grid analysis conducted previously at step 620. The
node 870-4 is similar to the node 870-2 except that it is located
on the right hand side of the webpage and all the links are
off-site links. In some embodiments, the node 870-4 may have
another tag such as "Adbanner" suggesting that its elements are
advertisement blocks sponsored by another website. Finally, the
node 870-5 is tagged as "Footer" due to the fact that it is located
at the bottom of the webpage and "On/Off-Site Links" if it includes
both on-site and off-site links.
[0079] Fourth, the geometry detector merges semantically related
sibling nodes of the chunk tree according to their respective
preliminary tags and geometric information (640). For example, the
three sibling leaf nodes associated with the node 870-2 in FIG. 8C
have been merged into the node 890-2 in FIG. 8D. This is, in part,
due to the fact that the two tags "LHS_Sidebar" and "On-Site Links"
are sufficient in characterizing the three individual semantic
elements within the region, and there are no attributes that
substantially separate the three links at the leaf node. Similarly,
the two leaf nodes associated with a grid element are merged into a
single node label "DATA" and the three links in the footer region
are also merged into the footer node 890-5. However, the three leaf
nodes labeled "TEXT" that are children of the node 890-4 remain
separate because each node may be associated with a distinct type
of advertisement. For example, if the webpage 100 is an image
gallery, one of the advertisement blocks may provide paper printing
service and the other may sell digital camera, which are
sufficiently different to remain as separate nodes. But in general,
the chunk tree is further simplified after merging sibling
nodes.
[0080] Finally, the geometry detector finishes chunk tree
construction by assigning final tags to the chunk tree nodes (650).
The chunk tree, as shown in FIG. 2, has only 17 nodes in contrast
to the quasi-DOM tree that has 58 nodes as shown in FIG. 7. Not
only is the chunk tree simpler than the quasi-DOM tree, it is also
more intuitive. Specifically, the chunk tree has a root node and
the root node has five child nodes, 210-260, each child node
corresponding to a semantically distinct region shown in FIG. 1.
Among them, the nodes 210, 230 and 260 are labeled "DATA_NODE"
because each of them includes certain type of data (e.g., image,
text or links). In contrast, the nodes 240 and 250 are labeled
"MISC_NODE" and each of these two nodes has a group of child nodes
that are labeled "DATA_NODE". The child nodes have not been merged
into their respective parent nodes for the reasons discussed
above.
[0081] In some embodiments, during the course of chunk tree
construction, the geometry detector identifies a pseudo-title for
each chunk in the chunk tree. For example, during preliminary
tagging (630), the geometry detector tags text in a chunk that
would appear prominently when the document is rendered, e.g., in
large font size or unique font type or located at the beginning of
a paragraph as a candidate pseudo-title for the chunk. In some
embodiments, the geometry detector searches for text that satisfies
predefined criteria with respect to appearing prominently if the
document were rendered for display, and identifies such text as a
pseudo-title for the associated chunk.
[0082] After preliminary tagging, the geometry detector checks
whether a candidate pseudo-title of a parent chunk could reasonably
be considered to be a pseudo-title for the children of that chunk
according to the geometric information of the parent and child
chunks. For example, if the pseudo-title is an isolated section of
text that is directly above the child chunks, it will be considered
to be the pseudo-title of the child chunks as well.
[0083] During sibling merge (640), the geometry detector checks if
the first sibling of a sequence may be reasonably considered a
pseudo-title for the other siblings. For example, the first row of
a list of links in a sidebar is often the boldfaced title for that
sidebar region. As a result, the chunk tree includes not only a map
linking various semantically distinct regions in a webpage, but
also includes an appropriate title for each region for which a
pseudo-title has been found.
[0084] In some embodiments, after completion of chunk tree
construction, the geometry detector generates a geometric token
list for an HTML file being pseudo-rendered. The geometric token
list includes multiple members, and each one may be a word, an
image or a link in the HTML file. For example, if a word is
considered part of the pseudo-title of a semantically distinct
region, this word will be marked accordingly in the data structure
430 as shown in FIG. 4.
General
[0085] There are many important applications that can benefit from
the chunk tree. For illustrative purposes, the following is an
exemplary list of such applications, each of which may be
implemented on either the same computer system or a different
computer system than the geometry detector 300. [0086] Link
analysis module (360). Most documents on the Internet either have
links referring to other documents, or are referred to by other
documents, or both. A link analysis module, or set of instructions
360 (sometimes called a page ranker) helps to determine a
document's popularity on the Internet. A document having more
referring documents is often granted a higher weight by a search
engine. On the other hand, as discussed above, using the analysis
tools of the present invention, different links found in different
semantically distinct regions may be assigned different weights.
Therefore, a link from one document to another document can be
weighted in accordance with the semantically distinct region in
which the link is located in order to better characterize the
popularity of target document. [0087] Text analysis module (362).
Similar to links appearing in different semantic regions being
assigned different weights, a text analysis module 362 may assign
identical text different weights when that text is found in
different semantic regions of a document. Referring to FIG. 1, if a
query term finds its match in the image blocks 140 instead of the
footer 160, the webpage should be given a higher query score than
if the term were found in the footer, and thus a higher position in
the list of matching documents, because the image blocks 140 are
usually the primary target of a viewer. [0088] Image captioning
module (364). Generally speaking, text close to an image is more
likely relevant to the image than text farther away from the image.
Since every token in a webpage has a set of associated geometric
parameters, it is possible to identify text that is close to an
image and therefore create a caption for the image, which may
subsequently be used in applications like image search. An image
captioning module 364 with such image captioning functions may be
implemented as part of a search engine backend system for
generating a searchable database of information about documents,
including images. [0089] Snippet construction module (366). When
one or more chunks in a document's chunk tree are assigned a
pseudo-title, an application may rely on the pseudo-titles to
construct a more accurate snippet of the document that captures the
major topic(s) of the document and includes at least one of the
pseudo-titles. A "snippet" is text selected from a document (or
other set of information) as being either representative of the
document or text from the document that matches and surrounds one
or more query terms or the like. A snippet construction module 366
may be implemented as part of a search engine, email system, or the
like.
[0090] Although some of various drawings illustrate a number of
logical stages in a particular order, stages which are not order
dependent may be reordered and other stages may be combined or
broken out. While some reordering or other groupings are
specifically mentioned, others will be obvious to those of ordinary
skill in the art and so do not present an exhaustive list of
alternatives. Moreover, it should be recognized that the stages
could be implemented in hardware, firmware, software or any
combination thereof.
[0091] The foregoing description, for purpose of explanation, has
been described with reference to specific embodiments. However, the
illustrative discussions above are not intended to be exhaustive or
to limit the invention to the precise forms disclosed. Many
modifications and variations are possible in view of the above
teachings.
* * * * *