U.S. patent application number 13/911049 was filed with the patent office on 2014-04-03 for product search engine.
The applicant listed for this patent is Derek Edwin Pappas. Invention is credited to Derek Edwin Pappas.
Application Number | 20140095463 13/911049 |
Document ID | / |
Family ID | 49716133 |
Filed Date | 2014-04-03 |
United States Patent
Application |
20140095463 |
Kind Code |
A1 |
Pappas; Derek Edwin |
April 3, 2014 |
Product Search Engine
Abstract
The present invention facilitates product searches on a personal
computer, mobile or other device from remote sites via widget
lookup using a computed image signature and optional product
information extracted using a template in order to retrieve a list
of the same or similar products available at other sites. The
search starts with a widget lookup process, followed by the
submission of the product image URL, optional product information
extracted using a site specific product information template and
information from HTML attributes to a server. The image signature
is computed, a lookup based on the image signature and product
information is executed and a product list with an image, price and
link to each retailer where the product can be found is returned.
The list is reduced based on the submitted image, optional product
template and attribute information. The server sends the product
list to the user's browser for display.
Inventors: |
Pappas; Derek Edwin; (Palo
Alto, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Pappas; Derek Edwin |
Palo Alto |
CA |
US |
|
|
Family ID: |
49716133 |
Appl. No.: |
13/911049 |
Filed: |
June 5, 2013 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61656502 |
Jun 6, 2012 |
|
|
|
Current U.S.
Class: |
707/706 |
Current CPC
Class: |
G06F 16/353 20190101;
G06F 16/951 20190101; G06F 16/284 20190101; G06F 16/30 20190101;
G06F 16/35 20190101 |
Class at
Publication: |
707/706 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A method for extracting a data record from a web page, said
method comprising: a. accessing said web page with a web browser;
b. activating a web browser device in said web page; c. associating
an extraction template with a data record type on said web page; d.
extracting the data record associated with said data record type;
e. downloading an image associated with an image url; f. creating
an image signature for said image; g. associating said image
signature with said data record; h. storing said image signature in
a third data store; i. storing the association between said image
signature, said image and said data record in the fourth data
store; and j. storing said data record in a first data store
wherein there is an association between a first data field name in
said data record in said first data store and a second data field
name in said extraction template in said second data store.
2. The method of claim 1 wherein said data record is a hidden data
record.
3. The method of claim 1 wherein said data record is a visible data
record on said web page, further comprising extracting said data
record associated with said data record type by: i. selecting a
data field value on said web page; ii. associating said first data
field name with said data field value; iii. displaying a visible
rectangle around said data field value and displaying said first
data field name; iv. calculating an XPATH value of said data field
value on said web page, wherein said extraction template is created
utilizing said first data field name and said XPATH value using
said web browser device; and v. storing said extraction template in
said first data store.
4. The method of claim 1 further comprising automatically
retrieving said extraction template for said web page.
5. The method of claim 2 further comprising storing said hidden
data record in an industry standard format and associating said
hidden data record with a hidden data record template and XPATH
location of the hidden data record and is associated with a root
URL for a web site associated with said web page.
6. The method of claim 1 further comprising automatically
displaying said data record in said web browser device panel,
accepting a description and a collection from a user, and
submitting said data record, said description and said collection
to said first data store.
7. The method of claim 2 further comprising checking validity of
said extraction template by re-extracting a current data field
value and comparing to said data field value and finding any data
field names present in the web page which are missing in the
extraction template or the hidden data record.
8. A method for displaying errors and missing template elements in
a data record from a web page, said method comprising: a. logging
in as an administrator; b. accessing said web page with a web
browser; c. activating a web browser device in said web page; d.
associating an extraction template with a data record type on the
web page; e. accessing the error report from the template error
report server for the data record type in said web page; f.
highlighting the errors and missing elements in said web page; g.
highlighting data field name/data field value pairs in said web
browser device panel that contain errors or are missing from the
template or should not be in the template; h. correcting the web
page template errors by i. associating a data field name with said
data field value or ii. removing said data field values; i.
creating an extraction template comprising said data field name and
an XPATH value using said web browser device; and j. storing said
extraction template in a first data store. k. extracting said data
record associated with said data record type; l. storing said data
record in a second data store wherein there is an association
between said data field name in said data record in said first data
store and a second data field name in said extraction template in
said second data store;
9. The method of claim 1 further comprising computing the image
signature using one of the following methods a. compute an image
signature from standard manufacturer image used by stores; i. using
an external and internal image signature and color histogram; ii.
using a industry standard signature such as BRISK for the entire
image; b. compute an image signature from a random image which
displays the product from different angles using a industry
standard signature such as BRISK for the entire image.
10. The method of claim 9 further comprising computing an external
image signature by finding a minimum bounding box around a product
in a manufacturer or retailer image by projecting rays from the
edge of the image and finding the intersection of the ray with the
edge of the product in the image.
11. The method of claim 10 further comprising computing an image
signature by finding a minimal bounding box around a product in a
manufacturer or retailer image using a binary search to find the
closest point from the product object to the edge of each of the
four sides of an image;
12. The method of claim 11 further comprising creating said image
signature from points indicating the intersection between the rays
and the minimum bounding box.
13. The method of claim 12 further comprising finding an internal
image signature by finding the center of said minimal bounding box
and projecting rays from the center to the edges terminating the
rays at the boundary between two different colors/features.
14. The method of claim 10 further comprising accepting from a user
an indication that said data field value is a constant wherein said
constant becomes part of said extraction template or hidden data
record template, and said constant is displayed in subsequent
extraction processes.
15. The method of claim 1 further comprising storing said data
field value with said data field name, said XPATH value and
associating a root URL name in an extraction template in said first
data store.
16. The method of claim 1 further comprising classifying said data
field value using a product classifier and assigning a product
classification to said data field value.
17. The method of claim 1 further comprising aggregating a
plurality of said data field names and said data field values in
said second data store into user defined collections.
18. The method of claim 8 further comprising, associating plurality
of said extraction templates with a user for measuring the quality
and quantity of extraction templates generated by said user.
19. The method of claim 1 further comprising allowing a second user
accessing said web page from which the data record was extracted or
said extraction template was created or retrieved to extract a
current data field value from said web page.
20. The method of claim 1 further comprising extracting all of the
elements of a list associated with said data field value using a
repeating structured pattern associated with said data field name
and said XPATH value.
21. The method of claim 1 further comprising selecting said data
field value using a predefined extraction template retrieved from
said first data store.
22. The method of claim 1 further comprising selecting said data
field value extracted from the hidden data record.
23. The method of claim 1 further comprising selecting said data
field value using by searching for a predefined data field name on
said web page.
24. The method of claim 1 further comprising converting said
extraction template from said first data store into an automatic
data extraction template to extract current data field values from
all web pages at the root web site which matches said template.
25. The method of claim 1 further comprising converting said hidden
record data template from said first data store into an automatic
data extraction template to extract current data field values from
all web pages at the root web site which matches said template.
26. The method of claim 1 further comprising cleaning said data
field value, classifying said data field value, normalizing said
data field value, storing said data field value and indexing said
data field value.
27. The method of claim 1 further comprising adding date and
purchase location information associated with said data field value
to said second data store.
28. The method of claim 1 further comprising comparing a plurality
of data field values from said second data store by a user in the
in a social network or a shopping engine and storing the comparison
for viewing by said user or other social network members.
29. A method for implementing a browser based information
transmission method comprising: a. extracting a data record from a
web page; b. adding said data record to a user profile on a social
network; and c. sharing said data record with a plurality of users
wherein each of said users can comment, copy, compare, vote on, or
access the web page.
30. The method of claim 29 further comprising combining said data
record with plurality of other extracted data records to form a
collection.
31. The method of claim 29 further comprising storing said
collection in a searchable index.
32. The method of claim 29 further comprising finding a product
search result from a product image on a web page by a. accessing
the web page with a web browser; b. activating a web browser device
on the web page in a web browser; c. The image identifier
associated with the web browser device automatically finds the
images greater than a certain size; d. The selected images are
shown in a pop up; e. The user selects a single image in the pop up
and presses "done"; f. extracting the image url/bytes from the web
page; g. transmitting the extracted image url/bytes to the web
service controller; h. querying a image data store and associating
the image with a product search result; i. returning the product
search result from the web service controller to the web browser
device; j. displaying the product search result in the web browser
device.
33. The method of claim 29 wherein the data record is a visible
data record, further comprising: transmitting a root URL from the
web browser device to a web service controller; associating the
root URL with an extraction template; returning the extraction
template from the web service controller to the web browser device;
and extracting a product record from the web page using the
extraction template.
34. The method of claim 29 wherein JavaScript is inserted into the
web page such that when said user navigates to said web page said
JavaScript is activated and identifies the hidden data records,
product images or product words; transmits the information to the
web server which looks up the information and returns product
information about the image which includes the list of stores the
product can be purchased at, similar brands, other products from
the brand, other products from the store, products from the same
category with affiliate links or paper-click links that when
activated by a user result in the commission being payed to the end
user and to the service provider.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application claims the benefit of U.S.
Provisional Application No. 61/656,502, filed Jun. 6, 2012, by
Derek Edwin Pappas and titled "Structured and Social Data
Aggregator", incorporated by reference herein and for which benefit
of the priority date is hereby claimed.
FEDERALLY SPONSORED RESEARCH
[0002] Not applicable.
SEQUENCE LISTING OR PROGRAM
[0003] Not applicable.
FIELD OF INVENTION
[0004] The present invention relates to web search, image
processing, on-line shopping and social networking. Specifically,
techniques of web search and image processing to aid users that
either view, compare and buy products on-line, or share their
product findings and preferences via social networks.
BACKGROUND OF THE INVENTION
[0005] Currently, users search for products on retailer,
manufacturer, shopping engine, social network and blog websites.
When users find a product image that they like, they often want to
know how much the product costs, where they can buy it and other
details and attributes about the product. In addition the users may
want the specifications for the product and to know the
manufacturer, model number, and product name. Currently users
cannot search for products by image at remote websites. Users can
search by image for similar images using services such as Google
image search. But Google image search does not return the
structured data associated with images search results.
[0006] Shopping search engines do not de-duplicate or normalize the
shopping data for all products. Typically the same image will
appear in many different records in a shopping search engine
result.
[0007] Socially curated image sites which are used for curating
images from other sites typically can capture the image and the
page title. However, the meta-data (i.e., the structured data
record in the page which was generated from a database) is not
extracted automatically. The image could have been copied to a blog
by another user.
[0008] Users can save a product image found on the 3rd party
website, go to Google image search, upload the image and search for
it. Google image search will find the images that contain the same
features (i.e., are similar in terms of shape and structure).
Google image search will also return images that contain keywords
which match the keywords associated with the matching images. A
list of images is presented in the search results. The list will
contain the original image, images that are from the same
manufacturer and other images. The user then needs to click on each
image and visit each site to see the price and the product
information. The user then needs to note the product information on
each of the different product pages in order to compare the
different offers for the same product. The Google search results do
not include the product information or any related products.
[0009] The Google Data Highlighter tool performs structured data
extraction using a template made by the user (web page owner). The
user tags the data field values with data field names on the web
page using the tool. The Data Highlighter finds pages that contain
the same HTML markup and structure. The tool then finds the
additional pages on the site which match the specified HTML layout.
The tool then identifies the other pages on the site with the same
structure. The extracted data is then presented as a rich snippet
in the search results. Currently, the Data Highlighter can extract
only the events-related data records which contain a time, date,
place and person. The identification of semantic information is the
most difficult part of structured data extraction from web
pages.
[0010] "Intelligent data search engine" U.S. Pat. No. 8,190,556
automatically identifies pages with similar structure from the same
site, finds the intersection between the page structure (i.e. the
XPATH and semantic type) automatically generates an extraction
template, crawls each page on the site and checks if the page
matches the structure of the template. If there is a match the
structured data on the page is extracted and stored in a data
store.
[0011] Pinterest is a social information catalog that is curated by
users. Users navigate to pages on remote sites which contain images
and then press the "Pin it!" button embedded in the web page or use
the "Pin it" bookmarklet to upload or add a pin on the Pinterest
web site. A set of images from the web page appears and the user
clicks on one of the images, adds a description, selects an
existing pin board or creates a new pin board and then presses
submit. The image, page title and user description are added to the
user's Pinterest pin board. Currently Pinterest does not support
product information extraction via templates nor do they allow the
user to perform a remote product information search via their
widget. Pinterest does identify URL's that belong to stores, looks
up the price in a database that is created from a retailer data
feed (not extracted) and displays the price in the page.
[0012] TheFind is a conventional shopping engine. A user searches
for a product by brand, store, category or can use a limited set of
specifications to narrow down the search results. The search
results are presented to the user. The results often contain
duplicate products from different and the same store. The results
do not group the stores that contain the same product.
[0013] Currently neither Google nor Pinterest extract product
information from a single product page using templates. Normally,
TheFind and other shopping engines do not de-duplicate or group the
same product together and display a canonical record for the
product. Moreover, the shopping engines do not present all of the
data from all of the stores on the Internet. Shopping engines use
invented indexes generated by Apache SOLR or internal tools that
index the fields in the product records. Based on the search
results presented to the user, limited attempts are made to group
the same product from different stores.
SUMMARY OF THE INVENTION
[0014] In accordance with the present invention, there is provided
a method and system for implementing a shopping engine that users
take with them when they browse the Internet, providing a
centralized search service that is connected to content on remote
sites and provides a remote lookup system for the Internet's
products. The invention facilitates web search, image processing,
on-line shopping and social networking. Specifically, the
invention's unique methods of web search and image processing, when
employed, aid users that view, compare and buy products on-line, or
share their product findings and preferences via social
networks.
[0015] The invention facilitates users search for products on
retailer, manufacturer, shopping engine, social network, blog, and
other types of websites. When users find a product they are
interested in, the invention provides the information users want to
know. Information such as product costs, where they can buy it,
model numbers, product names, product specifications, the name of
the manufacturer, and various other details.
[0016] Another aspect of the invention is that the lookup system
provides the users viewing product information on product
information sites with the following information: other stores that
sell the same product; the best store to buy from for non-price
reasons (i.e., support, store, warranty, returns, and customer
service); the historical pricing for the product, similar products
and aggregated information about social brand messages.
[0017] The product recognition process consists of the following
eleven steps. First, execute a web browser program on a computer
device with a screen, a microprocessor, volatile memory and
persistent storage such as a hard disk drive or flash memory.
Second, log into a first remote site incorporating our invention
which contains our web browser device. Install the install the web
browser device. Third, navigate the Internet via web browser to
find a product page by searching, browsing or directly typing in a
known URL.
[0018] In the fourth step, the advanced search method looks up the
site URL and if the site template exists sends it from the server
to the client browser. The template is created by the user in the
current or previous session on the same or a different page at the
same site by selecting the each data field value (DFV) in the HTML
rendered web page, in the web browser, associating the DFV with a
data field name (DFN), and extracting the XPATH to the data field
value (DFV). The web browser device then uses the XPATH's in the
template to the extract the DFV's from the current page and
associate them with their respective DFN's. The product record
DFN's include but are not limited to the manufacturer name (MN),
model number (M#), retailer and manufacturer logos, product name
(PN), product image, ratings, breadcrumb (product category), price,
sales price, the rich attributes (specifications, colors, and
features), and product identification codes such as Universal
Product Code (UPC's), and ISBN's. If a template does not exist then
the user is prompted to identify the parts of the page which are
associated with each of the DFN's. The product information is
extracted from different places in the HTML code of the product
page using the XPATHS associated with each DFN/DFV. The places
include "alt" attribute of the "img" tag, URL and title. The
product information is also being extracted from the paragraphs,
headings, breadcrumb and menu links and tables containing product
name, description, category, specifications, retailer name, etc.
The second method includes clicking on the web browser device icon
in the browser address bar at any remote site. The user then
selects an image to send a message to the first remote site's image
lookup server. The message contains the name of the remote site and
the image URL.
[0019] Fifth, the first remote site's image lookup server downloads
the image from the image URL. The lookup server automatically
performs the image signature computation producing an image
signature conversion of the image to a vector of numbers, and
creation of the image signature. Sixth, the software converts the
image signature into a list of product IDs. Send the image
signature to lookup in the image signature database via the product
index, which finds a list of product records for the same or
similar products with matching image signatures (a range check is
performed to allow for image artifacts such as noise). Seventh, the
list of product records that is sent back to the user who is
waiting at the remote site. The displayed list of product records
shows all stores where the user can buy the same or similar
products. Eighth, combine the image signature and product
information lookup results to allow further refining of the
combined search results by checking for the same or similar
products in the combined list, using such checks as a range check
on the price and similar categories. In the event that two similar
products have the same signature, the product information is used
to verify that the combined results contain the same product. If
the user requested that similar products be returned, then a
combined result including similar products is returned to the user.
Ninth, sending the resulting product list from the first remote
site via a JSON file over the world wide web and displaying it in
the user's browser which is executing on their client computing
device. Tenth, the user selects retailer sites to visit by clicking
on the links in the returned search results. And eleventh,
optionally allowing the user to add the product(s) in the search
results in the web browser executing on the client computing device
to the user's collection on the first remote site.
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] A complete understanding of the present invention may be
obtained by reference to the accompanying drawings, when considered
in conjunction with the subsequent, detailed description, in
which:
[0021] FIG. 1 is a block diagram of various functional components
of a system.
[0022] FIG. 2 is a block diagram of various functional components
of a system.
[0023] FIG. 3 is a block diagram of various functional components
of a system.
[0024] FIG. 4 is a flow chart of the image signature
computation.
[0025] FIG. 5A is a block diagram of a step of the example image
signature computation.
[0026] FIG. 5B is a block diagram of a step of the example image
signature computation.
[0027] FIG. 5C is a block diagram of a step of the example image
signature computation.
[0028] FIG. 5D is a block diagram of a step of the example image
signature computation.
[0029] FIG. 5E is a block diagram of a step of the example image
signature computation.
[0030] FIG. 5F is a block diagram of a step of the example image
signature computation.
[0031] FIG. 5G is a block diagram of a step of the example image
signature computation.
[0032] FIG. 5H is a block diagram of a step of the example image
signature computation.
[0033] FIG. 5I is a block diagram of a step of the example image
signature computation.
[0034] FIG. 5J is a block diagram of a step of the example image
signature computation.
[0035] FIG. 6A is a flow chart illustrating use of the web browser
device.
[0036] FIG. 6B is a flow chart illustrating use of the web browser
device.
[0037] FIG. 6C is a flow chart illustrating use of the web browser
device.
[0038] FIG. 6D is a flow chart illustrating use of the web browser
device.
[0039] FIG. 6E is a flow chart illustrating use of the web browser
device.
[0040] FIG. 6F is a flow chart illustrating use of the web browser
device.
[0041] FIG. 6G is a flow chart illustrating use of the web browser
device.
[0042] FIG. 6H is a flow chart illustrating use of the web browser
device.
[0043] FIG. 6I is a flow chart illustrating use of the web browser
device.
[0044] FIG. 7 is a flow chart of product recognition based on an
image.
[0045] FIG. 8 is a block diagram of an example computing
system.
DETAILED DESCRIPTION
[0046] Before the invention is described in further detail, it is
to be understood that the invention is not limited to the
particular embodiments described, as such may, of course, vary. It
is also to be understood that the terminology used herein is for
the purpose of describing particular embodiments only, and not
intended to be limiting, since the scope of the present invention
will be limited only by the appended claims.
[0047] Where a range of values is provided, it is understood that
each intervening value, to the tenth of the unit of the lower limit
unless the context clearly dictates otherwise, between the upper
and lower limit of that range and any other stated or intervening
value in that stated range is encompassed with the invention. The
upper and lower limits of these smaller ranges may independently be
included in the smaller ranges is also encompassed within the
invention, subject to any specifically excluded limit in the stated
range. Where the stated range includes one or both of the limits,
ranges excluding either or both of those included limits are also
included in the invention.
[0048] Unless defined otherwise, all technical and scientific terms
used herein have the same meaning as commonly understood by one of
ordinary skill in the art to which this invention belongs. Although
any methods and materials similar or equivalent to those described
herein can also be used in the practice or testing of the present
invention, a limited number of the exemplary methods and materials
are described herein.
[0049] It must be noted that as used herein and in the appended
claims, the singular forms "a", "an", and "the" include plural
referents unless the context clearly dictates otherwise.
[0050] All publications mentioned herein are incorporated herein by
reference to disclose and describe the methods and/or materials in
connection with which the publications are cited. The publications
discussed herein are provided solely for their disclosure prior to
the filing date of the present application. Nothing herein is to be
construed as an admission that the present invention is not
entitled to antedate such publication by virtue of prior invention.
Further, if dates of publication are provided, they may be
different from the actual publication dates and may need to be
confirmed independently.
[0051] Embodiments of the present invention include methods and
apparatus for product searches on a personal computer, mobile or
other device that provides means for a user to gather the
information the user needs with minimal effort and in a
straightforward way. One advantage is that users are able to bring
the shopping engine with them when they browse the internet because
the centralized service provides the databases for the Internet's
products.
[0052] In some embodiments, methods invoke a browser extension in
the form of a widget placed on the browser tool bar. A user can
navigate on the Internet to the web browser device installation
website using a browser. The web browser device installation
websites contains a web browser device. The web browser device is
installed in the web browser tool bar. The user can then navigate
to any remote site where desired information is to be searched for.
The web browser device has two buttons: remote search (for images)
and Advanced Search (for data records). When Advanced Search is
clicked, the remote URL is sent to the template server. The
template server looks up the root URL. If a template is found, then
it is returned to the web browser along with a JavaScript extractor
which then extracts the data (product record) from the (template)
page and sends it to the cleaner server which performs additional
extraction and cleaning. The extracted product record is then
looked up in a product database and all stores which carry the
product along with additional information are returned to the
browser which displays the information in a popup or another
browser tab. Additional information can also be sent to the popup,
such as similar products or in the case where the image signature
results in a number of product matches from the same or different
categories, the list of categories or canonical product records,
which the user then chooses from to get a list of products.
[0053] If a template is not returned by the template server, then
the user is prompted to complete the search form in the web browser
device by right clicking on the data field value elements in the
page as per the steps described above. The search process described
above is performed and the products are looked up. When remote
search button is clicked: the web browser device analyses the page,
creates a template describing the page, and sends the information
to the template server.
[0054] FIG. 1 shows the abstract of the system. The product
information can be extracted from a remote product information site
101 by the automatic product information extraction 102 and the
user generated template semi-automatic product information
extraction 103. The extracted product information which is
normalized, grouped, de-duped and classified is stored in the
product data store 109. The product images which are first
processed in the image processing service 111 are stored in the
image data store 110. The user can then perform a lookup using the
web browser device 104. The lookup 108 queries the product data
store and the image data store and through the web services 105
returns results 114 displayed in the web browser 100.
Advertisements stored in the ads data base 107 can also be looked
up 106. The advertisements that were returned from the lookup are
displayed 113 in the browser. The social network 112 is
communicating with the web services 105 and contains records
pointing to the remote product information sites 101.
[0055] FIG. 2 shows the system operation when the user presses the
web browser device (bookmarklet or button or extension) 202 to
extract visible or hidden data on the remote site web page. The
user will register at the shopping engine or socially curated
shopping site or search engine 201. The user installs the web
browser device for product lookup. The user can then go to a remote
third party site 208 generated by a remote web service 207 which
contains products stored in a structured data format generated from
a remote product or other structured data website database 205 and
a remote web site template 206. The remote web page 209 contains
the product record 210 and an image URL and/or image bytes 226.
[0056] Then the user clicks on the web browser device 202
containing JavaScript code, which can be an embedded button,
widget, extension or toolbar button, in the browser 200. When the
widget, extension or button is pressed, the product record 210 and
the image URL 226 embedded in the web page 209 are extracted. The
image 226 is downloaded 230 and processed as described later in the
patent and sent to the web service controller 214. When the user
presses the web browser device 202, the JavaScript is executed by
the browser 200. The web browser device JavaScript code 202 creates
an HTML script tag 213 in the page which points to a server side
script 236 that will be created on the web service server 203. The
HTML script tag 213 passes a URL 204 from the address bar 235 to
the server side script 236 as an argument. The web service server
203 will extract the root URL from the sent URL 204 and look up the
retrieved extraction template(s) 216. The server side script 236 is
created on the web service server 203 which contains the merged
site extraction template(s) 216 for the root URL associated with
the URL 204, web browser device panel user interface code 218, and
the JavaScript extractor 217. The modified HTML page 209 that
contains the injected HTML script tag 237 is converted to the DOM
representation 212 by the browser 200. The browser then executes
the server side script 236 creates the following elements in 213 in
the browser: the web browser device panel 218 which appears in the
product page tab, the JavaScript extractor 217, and the merged site
extraction template(s) 216. If a template was retrieved then the
XPATH in each tuple is looked up in the DOM and the product record
219 is extracted and inserted into the web browser device panel UI
218 data fields. The extracted data field values will be
highlighted in the web page and tagged with the corresponding data
field name.
[0057] If no template was returned by the web service server 203 or
the page 209 has changed or there is missing information in the
page then the user selects that product information in the web
page. The information that the user selects in the web page is
checked for semantic errors, string too long errors, other types of
checks and the data is cleaned by the product record
checker/cleaner 225. After the user selects the product information
in the web page and populates the panel the user presses the panel
submit button 221 the web browser device sends the submission
container 220 with the submitted product record 224, the new
extraction template 222 which contains the list of tuples (data
field name, data field value, XPATH, semantic type) url 226 in a
post key/value form to the web service controller 214. If the user
selected a price alert option for the product in the web browser
device panel, then the set price alert message is sent to the price
alert and history server which then stores the price alert in the
price history database.
[0058] The user can press the find button 223 to search for
products in the product database 241 and in the image database 227.
The selected product record 224 is processed in the image/data
processing pipeline 240. The index 242 is generated from the image
and the product database. The lookup 243 will generate the search
results 238 sent to the web service controller 214. The browser 200
will display the search results 238 containing the list of stores
with prices 211 and product list 238 with the product that can be
selected 232. When the user selects a product 232 from the search
results 238 the selected product is looked up 231.
[0059] The web service will send a new template record which
contains the URL of the page 204, the new extraction template 222,
if the user created one, from the web browser device and submitted
product record 224 from the web page to the product record cleaner.
The cleaner will clean the product record and send a cleaned
product record.
[0060] The web service performs the following operations: (1) the
server generates a unique identifier. The product page URL 204 is
hashed to a 256-bit UUID by the web service 214; (2) the web
service sends the unique identifier and the user collection
identifier to the user database 228; and (4) the server sends the
unique identifier the extraction template in JSON form 222 to the
extraction template database 215. Templates from the template
database are checked by the template checker 233. Template widget
stat server 234 communicates with the template database 215. The
XPATH and the semantic type are used to extract data field values
from pages on the site and associate them with data field names.
Pages on the site are constructed from the same remote template
206. The new extraction template 222 contains the list of tuples (a
tuple consists of the following: data field name, data field value,
XPATH, semantic type).
[0061] If the user submitted a product using a web browser device
the user and others can see the selected data record that was
inserted into the collection specified by a collection id on their
profile page on the socially curated website. Periodically a job is
run to generate a new index 242 from the product database 241 to
make it easier to search for the products in user collections.
[0062] Search engines index words and phrases. Attempts to extract
structured data in web pages have been made by search engines using
special markup in the web pages such as RDF, good relations, micro
format and rich snippets. The web designer inserts the industry
standard structured data formats into the web page to create data
records in the web pages. The search engine crawls the site and
examines the web pages for the presence of industry standard
structured data formats. The industry standard structured data
formats identify the data field values using a set of data field
names. A method for extraction of structured data from a page
containing a visible and invisible data record at a site using an
identifiable invisible data and layout format is shown when a web
browser device button is pressed on the web page. The data record
is located in a set of HTML tag(s) with corresponding data field
names. An aspect of the present invention provides that a 3.sup.rd
party predefined set of data field names are used to enclose the
data field values on the page. 3.sup.rd party data field names are
placed in attributes next to the data field values in the HTML
tags.
[0063] Turning now to FIG. 3, the product record information in the
online store database 324 at the affiliate marketing FTP website
325 is accessed by the ftp down loader 326 which fetches the
product record data feed 327. The downloaded product records are
then sent to the data processing pipeline. A product information
web site 302 is connected to the remote web service 329 that reads
remote template(s) 328 containing the data field name variables,
and remote online store database 324 to generate the online store
site 302. The page downloader or crawler 306 reads a list of sites
or pages from the online store URL list 305 and downloads the
product pages 307.
[0064] The downloaded pages are then used in conjunction with the
selected corresponding site template 336 from template database 303
by the automatic extractor 308 which extracts the product records
from all pages matching the site template. A site may have more
than one site template. The product pages are processed by the
automatic extractor which sends the root URL of each page that it
is processing to the extraction template database 303 and retrieves
the web browser device extraction template. The web browser device
extraction template is converted to an automatic extraction
template. The automatic extractor extracts the structured data
record from each product information page using the automatic
extraction template and creates a product record 309.
[0065] The affiliate downloaded product records 327 and
automatically extracted product records 309 each are read by the
cleaner 310. The cleaner analyses each downloaded product record
and produces a cleaned product record 311. The cleaner moves data
field values and partial data field values from one data field to
another, removes extraneous text, verifies the correctness of the
data field values, and calculates statistics on the number of
good/bad data field values using semantic checking and stores the
stats in the product record. Cleaned product records are then
classified by the product classifier 312. The product classifier
matches data records to one or more product classification tuples
from the product classification tuple list using words from the
data record which are product classification base or synonym words.
The classified data records 313 are normalized and grouped by the
normalizer 314. The normalizer will de-duplicate the product record
stream, group records together which are the same record found at
different sources (e.g. stores, shopping engines, socially curated
sites, blogs, and manufacturer sites), refine the classification of
a group of the same product records from different sources using
methods such as voting. Further normalization steps can also be
performed. The automatic extraction, cleaner, product classifier,
normalizer and grouper stages communicate with the dictionary
database 304. The dictionary looks up token(s) and returns semantic
type information. Synonyms are converted to base words. The
dictionary information is used by each pipe stage to process the
data record. The resulting cleaned, classified and normalized
product records 315 are saved 316 in the affiliate product database
319 or in the extracted product database 318 depending on the
source of the product record.
[0066] The user runs the web browser device 345 in a web browser
300 and creates a new extraction template 333 and a product record
331 from a product information web page 334 which is inserted into
the extraction template database 303. The web browser device new
extraction template is converted to an automatic structured data
extraction process template which is used to do the structured data
extraction 308 of all pages matching the page layout at the site
that the web browser device extraction template was created from.
All pages are downloaded from the site. Each web page from the same
site is tested to see if it matches the structured data extraction
template(s). If there is a match the data record is extracted from
the matching pages. The extracted record is cleaned, classified,
normalized, and stored in a database or index.
[0067] The image URL from the cleaned product record 311 is used to
download the image 337. The Downloaded image 338 is then processed
in the image signature computation flow 339 and the image record
340 is generated. The image record contains product id 341, image
URL 342 and the computed image signature 343. The image records are
stored in the image database 344. The web browser device extracted
product records database 317, the extracted product database 318,
the affiliate product database 319 and the image database 344 are
merged by the database merger 320 and a merged and normalized
product database 321 is created. The merged product database is
then indexed by the indexer 322 and an index 323 is created.
[0068] The user 348 can optionally search for a product using the
web browser device 345. The web browser's device panel 330 sends
the product image (image URL and/or the image byte) 342 and/or the
product record 347 from the current web page 334 to the web service
332. The web service queries the index. The product search index
323 is looked up 301 and the search results are returned 349. The
product search result is displayed in the browser 300. The user can
then select a specific product by clicking the URL, navigating to
the remote URL and then viewing the remote product information. The
advantage of this aspect of the invention is that the user can
search for product information on remote product information web
sites without leaving the product information web page i.e. the
user does not have to cut information from the product page and
paste it into the search box at Google and/or a shopping search
engine.
[0069] The user or a previous user identifies the data field values
(DFV) on the web page and associates each DFV with a data field
name (DFN) which are converted into an extraction template. If the
template server contains a template the template is downloaded to
the browser. The template contains downloaded JavaScript used to
extract the data record from the HTML page, and send the
information to the template server. If the template server does not
contain the extraction template for the web page then the user will
be prompted to specify the data field values (DFV's). The DFV's be
used in the product record search on the server. In either case
after the information in the page is extracted to the web browser
device panel the user presses search and the product server looks
up the product record information and returns the list of stores
and their prices that contain the item. Additional information can
be returned as well, such as specifications and other rich
attributes and similar products.
[0070] FIG. 4 describes the image signature computation flow 402.
An image 401 is transmitted to the image processing service and
prepared for the processing. Image preprocessing 403 scales the
image to a predefined size (the scaled image) and creates a
gray-scale copy of the scaled image. Certain parts of the algorithm
use the gray-scale model. The background type, solid color,
gradient or transparent is detected in step 404. The filter
selection 405 for the object boundary detection is determined by
the background type. If the background is transparent and the pixel
is transparent then it's a part of the background. Otherwise, it's
a part of the object. In case of the solid background the edge
between the background and the object is detected. If the
background is a gradient the background between the gradient and
object is detected. Various industry standard edge detection
algorithms can be used to detect the boundary (minimum bounding
box). Then, the binary search lookup 406 is performed along the
rays to define the intersection between the background and the edge
of the object. Using the bounding box, the image in original color
space and the gray-scale image are then cropped to the bounding box
edges and prepared for the further processing 407.
[0071] The external image signature creation 408 projects lines
from each corner angled in the gray-scale cropped image at
45.degree. to the minimum bounding box which intersects the line.
Rays bisecting the image edges are projected and the intersection
with the minimum bounding box is detected using the same method as
the 45.degree. intersection. Traversal lines form other
characteristic points on edges perpendicular to the edge they are
on. Then, the first intersection with the object on each traversal
line starting from the line origin is found. Next the lengths from
line origin at the edge of the image to the object intersection on
each line are found. The x, y coordinates of the intersection point
are equal for 45.degree. lines. A single value (x or y) is used in
the image signature for each 45.degree. and 90.degree. lines in the
implementation. The number of lines can be increased for
accuracy.
[0072] The first phase in the internal image signature computation
409 is taking the eight traversal lines starting from the image
(original color space cropped) center in eight directions. The
first line is perpendicular and directed to the top edge and each
subsequent angled at 45.degree. to the previous one in clockwise
rotation direction. Then the first color changes with large
differences in intensity on each traversal line, along the line
direction (exceeding certain threshold) is found. Next step is to
calculate the lengths from line start to the color change point on
each line.
[0073] Color histogram 410--in the cropped image in original color
space several pixel samples in characteristic positions relative to
the image are taken. Then, color value intervals of equal lengths
for each sample are made and occurrences of values from each
interval are counted. The following is the example of the color
histogram. For each pixel the RGB pixel values are converted to luv
color space.
TABLE-US-00001 FF 10 20 6A D7 AD R G B L U V
Then occurrences of each (L, U, V) number in each set of 3 lines is
counted,
TABLE-US-00002 L U V OCC 0 0 0 152 C8 5B 8A 295 8A 5B 8A 198 60 48
2A 90 3C 5A 70 65 6A D7 AD 170
and finally first three colors by occurrence are selected, in order
of occurrence.
TABLE-US-00003 L U V OCC C8 5B 8A 295 8A 5B 8A 198 6A D7 AD 170
Next, a table is made containing the top three colors by occurrence
for all four sampling directions. That makes the internal color
signature.
[0074] The external image signature 408, internal image signature
409 and color histogram 410, along with the bounding box dimensions
are passed to the image signature generator 411 which produces the
image signature 412. The image signature can be computed using the
traditional feature detection algorithms, such as BRISK. People in
state of the art in image feature detection are familiar with BRISK
algorithm and its computational efficiency. BRISK is created to
match images with a high level of detail and has a configurable
(but large by default) number of keypoints that are used in
comparison. Hence, the performance in the use case of product
images with lower level of detail can yield a lower number of
keypoints needed for comparison and therefore almost proportionally
lower computation time. Another performance enhancement may be made
by using only the image part within the cropped and scaled image
407. Then, scale-space calculation phase in BRISK algorithm can be
omitted, as the scale dimension is invariant.
[0075] Consequently, the product detection in images, besides being
performed by the proposed image signature algorithm, can also be
done by some other familiar algorithms in the field, in conjunction
with or as a replacement for the image signature algorithm, whilst
satisfying conditions for more efficient utilization than for a
regular use case for the algorithms as shown in FIGS. 5A-5J. Shown
are the original image 501 and the scaled, gray-scale image 502.
After the background type is detected 503 the bounding box is found
in the analyzed image 504. The analyzed image is cropped to
bounding box edges 505. 506 shows the scaled bounding box. Two
image signatures are computed: the external 507 and the internal
image signature 508. 509 shows the ray color sample and the 510
shows the color sample.
[0076] Turing now to FIGS. 6A-6I, the user navigates to a product
information site web page containing the product image, product
information and additional images 602, in browser 601. Previously
installed web browser device will be displayed in the browser
address bar as an icon 603. When a user presses the icon the panel
with images found on the product information web page will be
displayed 604. The user can then, in browser 605, select an image
606 to lookup. The selected image will be highlighted 607. Pressing
the "Done" button will send the product information and product
image URL (optionally the image bytes will be sent as well) to the
web service. The web browser device icon will be updated 608 to
show the current status of the lookup 609. When a number of found
results appears 610 user can click 611 on an icon to see the lookup
results. The lookup results list 613 in browser 612 can contain the
same and/or similar products found on different store pages.
[0077] In the case where the product database is normalized and the
same product from different retailers are grouped together the user
can be presented with a list of single products which might match
the product and/or the image on the page that is being searched
for. The user then selects one of the normalized products and the
user is then presented with a list of the stores that carry that
single product. If the user is interested in similar products the
user can also indicate that they want to see similar products. This
search is facilitated by preprocessing the products and grouping
similar products by image characteristics, product classification
and the same products by product record and image signature. Brands
make products in certain categories so it is possible to group
different manufacturer's products by category.
[0078] This direct image search and product information lookup from
remote shopping engine, retailer, manufacturer and other shopping
related pages provides an efficient method for shoppers to find out
competing prices, additional product information, and other
locations where the product can be purchased. User has an option
614 to select a lookup result from the result list 615 and the
store page containing the selected product will be shown in a popup
window 617 on the browser 616.
[0079] FIG. 7 represents the product identification by an image.
Product record and image URL and/or image bytes 702 are sent from
the remote product information web site 701 to the widget
extraction flow 703. The image is then processed in the image
processing service 704 which produces the image record 705
containing the computed image signature 706. Product and image
records are stored in the product data store and image data store
707. Other social bookmarking widgets 708 can be used on the same
remote sites to extract the image from the remote product
information website 709 and save it on the social bookmarking
website 710. Users that come to the social bookmarking website 710
can use the web browser device to perform a remote lookup 711 on
the selected image. Image URL and/or image bytes 712 are used to
run the lookup 713 which will query the image and the product
record data store 707. The data store 707 will return the product
information and/or images 714. The search results 715 can be
displayed on the remote page 716 or the advertising platform 717 or
the services for external social bookmarking 718 can be built.
[0080] The product recognition consists of the following steps.
First, logging into a first remote site which contains a web
browser device and installing the web browser device. Second,
executing a web browser program on a computer device with a screen,
a microprocessor, volatile memory and persistent storage such as a
hard disk drive or flash memory. Navigating the Internet via web
browser to find a second remote site to find a URL containing a
single product page by searching, browsing or directly typing in a
known URL. The URL is sent to a remote computer server, which
contains a microprocessor, volatile memory and persistent storage
such as a hard disk drive or flash memory, over a network
connection to the Internet to retrieve the single product page (the
web page contents--the HTML) and send it over a network connection
to the Internet and rendering the HTML for the second remote site
in the browser. The user can optionally indicate that similar
products be included in the search results. Third, pressing the web
browser device button. Pressing the "find" button in the web
browser device panel. Selecting a product image from the
multi-image view to lookup. Fourth, sending the product image
signature and optional product information from the client computer
to the first remote site's server. Fifth, the first remote site's
server performs the image signature computation which produces an
image signature. Sixth, sending the image signature to the image
signature lookup which finds a list of product records for the same
or similar products with matching (a range check is performed to
allow for image artifacts such as noise) image signatures. Seventh,
performing the product information lookup which finds a list of
product records matching the client side product information.
Eighth, optionally combining the image signature and product lookup
results. Further refining the combined search results by checking
for the same or similar products in the combined list, using such
checks as a range check on the price, similar categories. In the
event that two similar products have the same signature the product
information is used to verify that the combined results contain the
same product. If the user requested that similar products be
returned then a combined result including similar products is
returned to the user. Ninth, sending the resulting product list
from the first remote site via a JSON file over the world wide web
and displaying it in the user's browser which is executing on their
client computing device. Tenth, optionally adding the product(s) in
the search results in the web browser executing on the client
computing device to the user's collection on the first remote site.
And eleventh, the user selects retailer sites to visit by clicking
on the links in the returned search results.
[0081] The MN, M#, PN, UPC product information is extracted from
different places in the HTML code of the product page. The places
include "alt" attribute of the "img" tag, URL and title. The
product information is also being extracted from the paragraphs,
headings, breadcrumb and menu links and tables containing product
name, description, category, specifications, retailer name,
etc.
[0082] The information extracted from product information site web
pages is used to create clusters of different images of the same
product. The textual information is used to find potentially
similar product records. The images in the similar product records
are then analyzed by the image processing service to join existing
clusters and/or add products to clusters and/or create new
clusters. Comparison of image signatures can thus be used in
conjunction with limited, semi, and/or complete product record
information to identify products in product information sites
(i.e., manufacturer, retailer sites, blogs and social catalog).
[0083] Matching images on a product information site to a product
record facilitates the serving of ads on the social catalog site,
brand analytics on the social catalog site, conversion of links on
the social catalog site to affiliate marketing links for commission
based programs so that when the user clicks on the link to the page
at the original site contains the image a cookie is set on the
user's computer and if the user buys something at the site the
store pays a commission to the referring site. Additional
advantages include adding meta-information about the product to the
visible text on the page to give the viewer additional information
about the product. Another advantage of the system is setting
keywords in meta-tags and descriptions for search engines to index.
Other SEO and SEM advantages that adding keywords to pages have are
not described here but are well understood in the Internet
community.
[0084] Furthermore, the merging of structured data and social
networking information greatly increases the accuracy of search
results where qualitative results are desired. The probability of
finding useful information in response to search keywords is
significantly greater. Moreover, because the database contains more
complete information, such as numeric attribute information which
describe the database elements (e.g., the size of an object) and
qualitative information (e.g., an expert's opinion of the
durability of an object), searches can be conducted using general
descriptions of the objects (e.g., search for a digital SLR which
is within a certain dimension range and longevity) or searches can
be conducted using the category, brand, store, and social rating of
the former. Conventional search engines, by contrast, return
results that require the user to manually validate, sort, and
filter the search results. In the case of conventional search
engines that return links based on popularity, the user must search
through the list of links to find relevant web pages and manually
search social networking services to find corresponding qualitative
data.
[0085] With reference now to FIG. 8, portions of the technology for
providing computer-readable and computer-executable instructions
that reside, for example, in or on computer-usable media of a
computer system. That is, FIG. 8 illustrates one example of a type
of computer that can be used to implement one embodiment of the
present technology.
[0086] Although computer system 800 of FIG. 8 is an example of one
embodiment, the present technology is well suited for operation on
or with a number of different computer systems including general
purpose networked computer systems, embedded computer systems,
routers, switches, server devices, user devices, various
intermediate devices/artifacts, standalone computer systems, mobile
phones, personal data assistants, and the like.
[0087] In one embodiment, computer system 800 of FIG. 8 includes
peripheral computer readable media 801 such as, for example, a
floppy disk, a compact disc, and the like coupled thereto.
[0088] Computer system 800 of FIG. 8 also includes an address/data
bus 810 for communicating information, and a processor 8091 coupled
to bus 810 for processing information and instructions. In one
embodiment, computer system 800 includes a multi-processor
environment in which a plurality of processors 8092, 8093 are
present. Conversely, computer system 800 is also well suited to
having a single processor such as, for example, processor 8091.
Processors 8091, 8092, 8093 may be any of various types of
microprocessors. Computer system 800 also includes data storage
features such as a computer usable volatile memory 806, e.g. random
access memory (RAM), coupled to bus 810 for storing information and
instructions for processors 8091, 8092 and 8093.
[0089] Computer system 800 also includes computer usable
non-volatile memory 808, e.g. read only memory (ROM), coupled to
bus 810 for storing static information and instructions for
processors 8091, 8092, 8093. Also present in computer system 800 is
a data storage unit 807 (e.g., a magnetic or optical disk and disk
drive) coupled to bus 810 for storing information and instructions.
Computer system 800 also includes an optional alpha-numeric input
device 812 including alpha-numeric and function keys coupled to bus
810 for communicating information and command selections to
processor 8091, 8092, 8093. Computer system 800 also includes an
optional cursor control device 813 coupled to bus 810 for
communicating user input information and command selections to
processor 8091 or processors 8091, 8092, 8093. In one embodiment,
an optional display device 811 is coupled to bus 810 for displaying
information.
[0090] Referring still to FIG. 8, optional display device 811 of
FIG. 8 may be a liquid crystal device, cathode ray tube, plasma
display device or other display device suitable for creating
graphic images and alpha-numeric characters recognizable to a user.
Optional cursor control device 813 allows the computer user to
dynamically signal the movement of a visible symbol (cursor) on a
display screen of display device 811. Implementations of cursor
control device 813 include a trackball, mouse, touch pad, joystick
or special keys on alphanumeric input device 812 capable of
signaling movement of a given direction or manner of displacement.
Alternatively, in one embodiment, the cursor can be directed and/or
activated via input from alpha-numeric input device 812 using
special keys and key sequence commands or other means such as, for
example, voice commands.
[0091] Computer system 800 also includes an I/O device 814 for
coupling computer system 800 with external entities. In one
embodiment, I/O device 814 is a modem for enabling wired or
wireless communications between computer system 800 and an external
network such as, but not limited to, the Internet. Referring still
to FIG. 8, various other components are depicted for computer
system 800. Specifically, when present, an operating system 802,
applications 803, modules 804, and data 805 are shown as typically
residing in one or some combination of computer usable volatile
memory 806, e.g. random access memory (RAM), and data storage unit
807. However, in an alternate embodiment, operating system 802 may
be stored in another location such as on a network or on a flash
drive. Further, operating system 802 may be accessed from a remote
location via, for example, a coupling to the internet. In one
embodiment, the present technology is stored as an application 803
or module 804 in memory locations within RAM 806 and memory areas
within data storage unit 807.
[0092] The present technology may be described in the general
context of computer-executable instructions stored on computer
readable medium that may be executed by a computer. However, one
embodiment of the present technology may also utilize a distributed
computing environment where tasks are performed remotely by devices
linked through a communications network.
[0093] It should be further understood that the examples and
embodiments pertaining to the systems and methods disclosed herein
are not meant to limit the possible implementations of the present
technology. Further, although the subject matter has been described
in a language specific to structural features and/or methodological
acts, it is to be understood that the subject matter defined in the
appended claims is not necessarily limited to the specific features
or acts described above. Rather, the specific features and acts
described above are disclosed as example forms of implementing the
Claims.
[0094] Since other modifications and changes varied to fit
particular operating requirements and environments will be apparent
to those skilled in the art, the invention is not considered
limited to the example chosen for purposes of disclosure, and
covers all changes and modifications which do not constitute
departures from the true spirit and scope of this invention.
* * * * *