U.S. patent application number 14/401828 was filed with the patent office on 2015-06-18 for systems and methods for re-ranking ranked search results.
The applicant listed for this patent is GOOGLE INC.. Invention is credited to Chung Tin Kwok, Zhihuan Qiu, Lei Zhong.
Application Number | 20150169584 14/401828 |
Document ID | / |
Family ID | 48570457 |
Filed Date | 2015-06-18 |
United States Patent
Application |
20150169584 |
Kind Code |
A1 |
Kwok; Chung Tin ; et
al. |
June 18, 2015 |
SYSTEMS AND METHODS FOR RE-RANKING RANKED SEARCH RESULTS
Abstract
A system, computer-readable storage medium storing at least one
program, and a computer-implemented method for re-ranking ranked
search results is presented. Ranked search results satisfying a
search query are obtained, where the ranked search results include
a first search result corresponding to a first document associated
with a first entity and a second search result corresponding to a
second document associated with a second entity, and where the
first search result is ranked higher than the second search result.
The first document and the second document are determined to
satisfy a similarity criterion. The second entity is determined to
satisfy a predefined authorship differential with respect to the
first entity. Responsive to determining that the second entity
satisfies the predefined authorship differential with respect to
the first entity, the second search result and the first search
result in the ranked search results are swapped to produce
re-ranked search results.
Inventors: |
Kwok; Chung Tin; (Mountain
View, CA) ; Zhong; Lei; (Sunnyvale, CA) ; Qiu;
Zhihuan; (Sunnyvale, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
GOOGLE INC. |
Mountain View |
CA |
US |
|
|
Family ID: |
48570457 |
Appl. No.: |
14/401828 |
Filed: |
May 17, 2013 |
PCT Filed: |
May 17, 2013 |
PCT NO: |
PCT/US13/41722 |
371 Date: |
November 17, 2014 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61648553 |
May 17, 2012 |
|
|
|
Current U.S.
Class: |
707/730 |
Current CPC
Class: |
G06F 16/24578 20190101;
G06F 16/951 20190101; G06F 16/2228 20190101; G06F 16/248
20190101 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A computer-implemented method for re-ranking ranked search
results, performed on a server system having at least one processor
and memory storing at least one program for execution by the at
least one processor to perform the method, comprising: obtaining
ranked search results satisfying a search query, the ranked search
results including a first search result corresponding to a first
document associated with a first entity and a second search result
corresponding to a second document associated with a second entity,
the first search result being ranked higher than the second search
result; determining that the first document and the second document
satisfy a similarity criterion; determining that the second entity
satisfies a predefined authorship differential with respect to the
first entity; and responsive to determining that the second entity
satisfies the predefined authorship differential with respect to
the first entity, swapping the second search result and the first
search result in the ranked search results to produce re-ranked
search results.
2. The computer-implemented method of claim 1, wherein determining
that the first document and the second document satisfy the
similarity criterion includes: identifying a first plurality of
portions of content for the first document; identifying a second
plurality of portions of content for the second document; and
determining that at least a predetermined number of portions in the
first plurality of portions of content and in the second plurality
of portions of content are identical.
3. The computer-implemented method of claim 2, wherein respective
portions of content in the first plurality of portions of content
and the second plurality of portions of content include a
predetermined amount of data.
4. The computer-implemented method of claim 2, wherein the
predetermined number of portions encompasses a predetermined
percentage of the data in the first document.
5. The computer-implemented method of claim 2, wherein the
predetermined number of portions is a threshold percentage of the
portions of content in the first plurality of portions of
content.
6. The computer-implemented method of any of claims 1-5, wherein
determining that the first document and the second document satisfy
the similarity criterion includes: identifying a first plurality of
shingles for the first document; identifying a second plurality of
shingles for the second document; and determining that a
predetermined quantity of shingles in the first plurality of
shingles and in the second plurality of shingles of content are
identical.
7. The computer-implemented method of any of claims 1-5, wherein
the first document and the second document satisfy the similarity
criterion when a character count of the first document and a
character count of the second document deviate by less than a
character count threshold.
8. The computer-implemented method of any of claims 1-5, wherein
the first document and the second document satisfy the similarity
criterion when a Levenshtein edit distance from the first document
to the second document is less than a predetermined number of
character edits.
9. The computer-implemented method of any of claims 1-8, wherein
determining that the second entity satisfies the predefined
authorship differential with respect to the first entity includes:
determining an author score for the first entity and an author
score for the second entity, wherein the author score for a
respective entity, in the set of entities consisting of the first
entity and the second entity, indicates a likelihood that the
respective entity is an author of original content; and when the
composite author score for the second entity deviates from the
composite author score for the first entity by at least a
predetermined threshold, determining that the second entity
satisfies the predefined authorship differential with respect to
the first entity.
10. The computer-implemented method of claim 9, wherein determining
the author score for a respective entity includes: determining a
citation score for the respective entity, wherein the citation
score corresponds to a frequency at which content associated with
the respective entity is cited; determining an original author
score for the respective entity, wherein the original author score
corresponds to a percentage of content associated with the
respective entity that is a first instance of the content in an
index of known content; and combining the citation score and the
original author score using a predetermined function to produce the
author score.
11. The computer-implemented method of claim 10, wherein
determining the citation score for the respective entity includes
calculating the citation score as a quotient of a number of
citations to content that is associated with the respective entity
and a quantity of content that has at least one citation.
12. The computer-implemented method of any of claims 10-11, wherein
determining the original author score for the respective entity
includes: identifying a plurality of shingles representing content
in the index of known content identified as being associated with
the respective entity; and calculating a percentage of the
plurality of shingles that are first instances of the shingles in
the index of known content.
13. The computer-implemented method of claim 12, wherein
calculating the percentage of the plurality of shingles that are
first instances includes: determining a first number of the
plurality of shingles, each having an associated timestamp, for
which there are no identical shingles in the index of known content
having an earlier timestamp; and calculating the percentage as a
quotient of the first number and a total number of shingles in the
plurality of shingles.
14. The computer-implemented method of any of claims 10-11, wherein
determining the original author score for the respective entity
includes: identifying a plurality of portions of content in the
index of known content identified as being associated with the
respective entity, each portion in the plurality of portions
representing a predetermined amount of data in the index of known
content; and calculating a percentage of the plurality of the
portions that are first instances of the portions of content in the
index of known content.
15. The computer-implemented method of claim 14, wherein
calculating the percentage of the plurality of portions of content
that are first instances includes: determining a first number of
the plurality of portions of content, each having an associated
timestamp, for which there are no identical portions of content in
the index of known content having an earlier timestamp; and
calculating the percentage as a quotient of the first number and a
total number of portions of content in the plurality of portions of
content.
16. The computer-implemented method of claim 9, wherein, when the
author score for the first entity and the second entity is within
the predetermined threshold of each other, the method includes:
obtaining a first timestamp corresponding to a time when the first
document was included in an index of known content; obtaining a
second timestamp corresponding to a time when the second document
was included in the index of known content; and when the second
timestamp is less than the first timestamp by at least a
predetermined time interval, determining that the second entity
satisfies the predefined authorship differential with respect to
the first entity.
17. The computer-implemented method of claim 1, wherein the ranked
search results are obtained from a search engine, and wherein after
swapping the second search result and the first search result in
the ranked search results to produce the re-ranked search results,
the method further comprises providing the re-ranked search results
to the search engine.
18. A system to re-rank ranked search results, comprising: at least
one processor; memory; and at least one program stored in the
memory and executable by the at least one processor, the at least
one program comprising instructions to: obtain ranked search
results satisfying a search query, the ranked search results
including a first search result corresponding to a first document
associated with a first entity and a second search result
corresponding to a second document associated with a second entity,
the first search result being ranked higher than the second search
result; determine that the first document and the second document
satisfy a similarity criterion; determine that the second entity
satisfies a predefined authorship differential with respect to the
first entity; and responsive to determining that the second entity
satisfies the predefined authorship differential with respect to
the first entity, swap the second search result and the first
search result in the ranked search results to produce re-ranked
search results.
19. The system of claim 18, wherein the instructions to determine
that the first document and the second document satisfy the
similarity criterion include instructions to: identify a first
plurality of portions of content for the first document; identify a
second plurality of portions of content for the second document;
and determine that at least a predetermined number of portions in
the first plurality of portions of content and in the second
plurality of portions of content are identical.
20. The system of claim 19, wherein respective portions of content
in the first plurality of portions of content and the second
plurality of portions of content include a predetermined amount of
data.
21. The system of claim 19, wherein the predetermined number of
portions encompasses a predetermined percentage of the data in the
first document.
22. The system of claim 19, wherein the predetermined number of
portions is a threshold percentage of the portions of content in
the first plurality of portions of content.
23. The system of any of claims 18-22, wherein the instructions to
determine that the first document and the second document satisfy
the similarity criterion include instructions to: identify a first
plurality of shingles for the first document; identify a second
plurality of shingles for the second document; and determine that a
predetermined quantity of shingles in the first plurality of
shingles and in the second plurality of shingles of content are
identical.
24. The system of any of claims 18-22, wherein the first document
and the second document satisfy the similarity criterion when a
character count of the first document and a character count of the
second document deviate by less than a character count
threshold.
25. The system of any of claims 18-22, wherein the first document
and the second document satisfy the similarity criterion when a
Levenshtein edit distance from the first document to the second
document is less than a predetermined number of character
edits.
26. The system of any of claims 18-25, wherein the instructions to
determine that the second entity satisfies the predefined
authorship differential with respect to the first entity include
instructions to: determine an author score for the first entity and
an author score for the second entity, wherein the author score for
a respective entity, in the set of entities consisting of the first
entity and the second entity, indicates a likelihood that the
respective entity is an author of original content; and when the
author score for the second entity is greater than the author score
for the first entity by at least a predetermined threshold,
determine that the second entity satisfies the predefined
authorship differential with respect to the first entity.
27. The system of claim 26, wherein the instructions to determine
the author score for a respective entity include instructions to:
determine a citation score for the respective entity, wherein the
citation score corresponds to a frequency at which content
associated with the respective entity is cited; determine an
original author score for the respective entity, wherein the
original author score corresponds to a percentage of content
associated with the respective entity that is a first instance of
the content in an index of known content; and combine the citation
score and the original author score using a predetermined function
to produce the author score.
28. The system of claim 27, wherein the instructions to determine
the citation score for the respective entity include instructions
to calculate the citation score as a quotient of a number of
citations to content that is associated with the respective entity
and a quantity of content that has at least one citation.
29. The system of any of claims 27-28, wherein the instructions to
determine the original author score for the respective entity
include instructions to: identify a plurality of shingles
representing content in the index of known content identified as
being associated with the respective entity; and calculate a
percentage of the plurality of shingles that are first instances of
the shingles in the index of known content.
30. The system of claim 29, wherein the instructions to calculate
the percentage of the plurality of shingles that are first
instances include instructions to: determine a first number of the
plurality of shingles, each having an associated timestamp, for
which there are no identical shingles in the index of known content
having an earlier timestamp; and calculate the percentage as a
quotient of the first number and a total number of shingles in the
plurality of shingles.
31. The system of any of claims 27-28, wherein the instructions to
determine the original author score for the respective entity
include instructions to: identify a plurality of portions of
content in the index of known content identified as being
associated with the respective entity, each portion in the
plurality of portions representing a predetermined amount of data
in the index of known content; and calculate a percentage of the
plurality of the portions that are first instances of the portions
of content in the index of known content.
32. The system of claim 31, wherein the instructions to calculate
the percentage of the plurality of portions of content that are
first instances include instructions to: determine a first number
of the plurality of portions of content, each having an associated
timestamp, for which there are no identical portions of content in
the index of known content having an earlier timestamp; and
calculate the percentage as a quotient of the first number and a
total number of portions of content in the plurality of portions of
content.
33. The system of claim 26, wherein, when the author score for the
first entity and the second entity is within the predetermined
threshold of each other, the at least one program includes
instructions to: obtain a first timestamp corresponding to a time
when the first document was included in an index of known content;
obtain a second timestamp corresponding to a time when the second
document was included in the index of known content; and when the
second timestamp is less than the first timestamp by at least a
predetermined time interval, determine that the second entity
satisfies the predefined authorship differential with respect to
the first entity.
34. The system of claim 18, wherein the ranked search results are
obtained from a search engine, and wherein after swapping the
second search result and the first search result in the ranked
search results to produce the re-ranked search results, the method
further comprises providing the re-ranked search results to the
search engine.
35. A non-transitory computer readable storage medium storing at
least one program configured for execution by at least one
processor of a computer system, the at least one program comprising
instructions to: obtain ranked search results satisfying a search
query, the ranked search results including a first search result
corresponding to a first document associated with a first entity
and a second search result corresponding to a second document
associated with a second entity, the first search result being
ranked higher than the second search result; determine that the
first document and the second document satisfy a similarity
criterion; determine that the second entity satisfies a predefined
authorship differential with respect to the first entity; and
responsive to determining that the second entity satisfies the
predefined authorship differential with respect to the first
entity, swap the second search result and the first search result
in the ranked search results to produce re-ranked search
results.
36. The non-transitory computer readable storage medium of claim
35, wherein the instructions to determine that the first document
and the second document satisfy the similarity criterion include
instructions to: identify a first plurality of portions of content
for the first document; identify a second plurality of portions of
content for the second document; and determine that at least a
predetermined number of portions in the first plurality of portions
of content and in the second plurality of portions of content are
identical.
37. The non-transitory computer readable storage medium of claim
36, wherein respective portions of content in the first plurality
of portions of content and the second plurality of portions of
content include a predetermined amount of data.
38. The non-transitory computer readable storage medium of claim
36, wherein the predetermined number of portions encompasses a
predetermined percentage of the data in the first document.
39. The non-transitory computer readable storage medium of claim
36, wherein the predetermined number of portions is a threshold
percentage of the portions of content in the first plurality of
portions of content.
40. The non-transitory computer readable storage medium of any of
claims 35-39, wherein the instructions to determine that the first
document and the second document satisfy the similarity criterion
include instructions to: identify a first plurality of shingles for
the first document; identify a second plurality of shingles for the
second document; and determine that a predetermined quantity of
shingles in the first plurality of shingles and in the second
plurality of shingles of content are identical.
41. The non-transitory computer readable storage medium of any of
claims 35-39, wherein the first document and the second document
satisfy the similarity criterion when a character count of the
first document and a character count of the second document deviate
by less than a character count threshold.
42. The non-transitory computer readable storage medium of any of
claims 35-39, wherein the first document and the second document
satisfy the similarity criterion when a Levenshtein edit distance
from the first document to the second document is less than a
predetermined number of character edits.
43. The non-transitory computer readable storage medium of any of
claims 35-42, wherein the instructions to determine that the second
entity satisfies the predefined authorship differential with
respect to the first entity include instructions to: determine an
author score for the first entity and an author score for the
second entity, wherein the author score for a respective entity, in
the set of entities consisting of the first entity and the second
entity, indicates a likelihood that the respective entity is an
author of original content; and when the author score for the
second entity is greater than the author score for the first entity
by at least a predetermined threshold, determine that the second
entity satisfies the predefined authorship differential with
respect to the first entity.
44. The non-transitory computer readable storage medium of claim
43, wherein the instructions to determine the author score for a
respective entity include instructions to: determine a citation
score for the respective entity, wherein the citation score
corresponds to a frequency at which content associated with the
respective entity is cited; determine an original author score for
the respective entity, wherein the original author score
corresponds to a percentage of content associated with the
respective entity that is a first instance of the content in an
index of known content; and combine the citation score and the
original author score using a predetermined function to produce the
author score.
45. The non-transitory computer readable storage medium of claim
44, wherein the instructions to determine the citation score for
the respective entity include instructions to calculate the
citation score as a quotient of a number of citations to content
that is associated with the respective entity and a quantity of
content that has at least one citation.
46. The non-transitory computer readable storage medium of any of
claims 44-45, wherein the instructions to determine the original
author score for the respective entity include instructions to:
identify a plurality of shingles representing content in the index
of known content identified as being associated with the respective
entity; and calculate a percentage of the plurality of shingles
that are first instances of the shingles in the index of known
content.
47. The non-transitory computer readable storage medium of claim
46, wherein the instructions to calculate the percentage of the
plurality of shingles that are first instances include instructions
to: determine a first number of the plurality of shingles, each
having an associated timestamp, for which there are no identical
shingles in the index of known content having an earlier timestamp;
and calculate the percentage as a quotient of the first number and
a total number of shingles in the plurality of shingles.
48. The non-transitory computer readable storage medium of any of
claims 44-45, wherein the instructions to determine the original
author score for the respective entity include instructions to:
identify a plurality of portions of content in the index of known
content identified as being associated with the respective entity,
each portion in the plurality of portions representing a
predetermined amount of data in the index of known content; and
calculate a percentage of the plurality of the portions that are
first instances of the portions of content in the index of known
content.
49. The non-transitory computer readable storage medium of claim
48, wherein the instructions to calculate the percentage of the
plurality of portions of content that are first instances include
instructions to: determine a first number of the plurality of
portions of content, each having an associated timestamp, for which
there are no identical portions of content in the index of known
content having an earlier timestamp; and calculate the percentage
as a quotient of the first number and a total number of portions of
content in the plurality of portions of content.
50. The non-transitory computer readable storage medium of claim
43, wherein, when the author score for the first entity and the
second entity is within the predetermined threshold of each other,
the at least one program includes instructions to: obtain a first
timestamp corresponding to a time when the first document was
included in an index of known content; obtain a second timestamp
corresponding to a time when the second document was included in
the index of known content; and when the second timestamp is less
than the first timestamp by at least a predetermined time interval,
determine that the second entity satisfies the predefined
authorship differential with respect to the first entity.
51. The non-transitory computer readable storage medium of claim
35, wherein the ranked search results are obtained from a search
engine, and wherein after swapping the second search result and the
first search result in the ranked search results to produce the
re-ranked search results, the method further comprises providing
the re-ranked search results to the search engine.
52. A non-transitory computer readable storage medium storing at
least one program configured for execution by at least one
processor of a system, the at least one program comprising
instructions to be executed by the at least one processor so as to
perform the method of any of claims 1-17.
53. A system, comprising: at least one processor; and memory
storing at least one program for execution by the at least one
processor; the system including means for performing the method of
any of claims 1-17.
54. A computer-implemented method for re-ranking ranked search
results, performed on a server system having at least one processor
and memory storing at least one program for execution by the at
least one processor to perform the method, comprising: obtaining
ranked search results satisfying a search query, the ranked search
results including a first search result corresponding to a first
document associated with a first entity and a second search result
corresponding to a second document associated with a second entity,
the first search result being ranked higher than the second search
result; determining that the first document and the second document
satisfy a similarity criterion; obtaining a first timestamp
corresponding to a time when the first document was included in an
index of known content; obtaining a second timestamp corresponding
to a time when the second document was included in the index of
known content; and swapping the second search result and the first
search result in the ranked search results to produce re-ranked
search results when the second timestamp is less than the first
timestamp by at least a predetermined time interval.
55. A non-transitory computer readable storage medium storing at
least one program configured for execution by at least one
processor of a system, the at least one program comprising
instructions to be executed by the at least one processor so as to
perform the method of claim 54.
56. A system, comprising: at least one processor; and memory
storing at least one program for execution by the at least one
processor; the system including means for performing the method of
claim 54.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is a national stage filing of
PCT/US13/41722 (attorney docket No. 060963-5809WO), entitled
"Systems and Methods for Re-Ranking Ranked Search Results" filed
May 17, 2013, which claims priority to U.S. Provisional Application
61/648,553 (attorney docket No. 060963-5809PR) entitled "Systems
and Methods for Re-Ranking Ranked Search Results" filed May 17,
2012, each of which is hereby incorporated by reference herein in
its entirety.
[0002] This application is also a national stage filing of
PCT/US13/41717 (attorney docket No. 060963-5808WO), entitled
"Systems and Methods for Crawling and Indexing Content" filed May
17, 2013, which claims priority to U.S. Provisional Application
61/648,546 (attorney docket No. 060963-5808PR) entitled "Systems
and Methods for Crawling and Indexing Content," filed May 17, 2012,
each of which is hereby incorporated by reference herein in its
entirety.
[0003] This application is also a national stage filing of
PCT/US13/41723 (attorney docket No. 060963-5810WO), entitled
"Systems and Methods for Determining a Likelihood That an Entity is
an Author of Original Content" filed May 17, 2013, which claims
priority to U.S. Provisional Application 61/648,562 (attorney
docket No. 060963-5810PR) entitled "Systems and Methods for
Determining a Likelihood That an Entity is an Author of Original
Content," filed May 17, 2012, each of which is hereby incorporated
by reference herein in its entirety.
TECHNICAL FIELD
[0004] The disclosed embodiments relate generally re-ranking ranked
search results.
BACKGROUND
[0005] When a user submits a search query to a search engine, the
search engine identifies search results satisfying the search query
and ranks the search results based on a number of ranking factors
(e.g., relative relevance of search results). The ranked search
results may include two search results that have similar (or
substantially the same) content. For example, a first document
corresponding to a first search result may be associated with a
first entity that is known to not produce original content (e.g.,
the first entity copies or otherwise redistributes original content
authored by other entities). In contrast, a second document
corresponding to a second search result may be associated with a
second entity that is known to produce original content. In some
instances, it is desirable to rank the search results that
correspond to documents that are associated with entities that are
authors of original content higher than search results
corresponding to documents that associated with entities that are
not authors of original content even though the documents
associated with entities that are not authors of original content
have higher relevancy scores.
SUMMARY
[0006] One aspect of the present disclosure provides a
computer-implemented method for crawling and indexing content,
performed on a server system having at least one processor and
memory storing at least one program for execution by the at least
one processor to perform the method, comprising receiving a
message, the message including (i) submitted content, where the
submitted content is identified as being published by an entity,
and (ii) a link to location on a resource hosting the submitted
content; evaluating whether the submitted content is represented in
an index of known content to determine whether the submitted
content is new relative to the known content; and in response to
the evaluating, issuing a request to a search engine to crawl and
index the submitted content hosted by the resource associated with
the link when the submitted content is new relative to the known
content, where the request to the search engine to crawl and index
the submitted content hosted by the resource associated with the
link is issued responsive to determining that the submitted content
is deemed to not be represented in the index of known content.
[0007] In some embodiments, the request is a high priority request
that is directed to a high priority crawler of the search engine,
and where a time when the high priority crawler of the search
engine is scheduled to crawl the submitted content occurs before a
time when a standard crawler of the search engine is scheduled to
crawl the submitted content.
[0008] In some embodiments, the evaluating whether the submitted
content is represented in the index of known content to determine
whether the submitted content is new relative to the known content
includes: identifying a set of shingles representing the submitted
content, a respective shingle in the set of shingles including a
predetermined number of consecutive tokens for the submitted
content; comparing the set of shingles representing the submitted
content with a first plurality of shingles representing the known
content; and when less than a first predetermined threshold number
of the set of shingles representing the submitted content is
included in the first plurality of shingles representing the known
content, determining that the submitted content is new relative to
the known content. In some embodiments, (i) at least the first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, (ii) at least a second
predetermined threshold number of the set of shingles representing
the submitted content is included in a second plurality of shingles
representing other submitted content identified as being published
by the entity, and (iii) at least a third predetermined threshold
of corresponding shingles in the second plurality of shingles
representing the other submitted content are first instances with
respect to corresponding shingles in the first plurality of
shingles representing the known content, the method includes:
determining that the submitted content is new relative to the known
content.
[0009] In some embodiments, when (i) at least the first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, (ii) at least a second
predetermined threshold number of the set of shingles representing
the submitted content is included in a second plurality of shingles
representing other submitted content identified as being published
by the entity, and (iii) less than a third predetermined threshold
of corresponding shingles in the second plurality of shingles
representing the other submitted content are first instances with
respect to corresponding shingles in the first plurality of
shingles representing the known content, the method includes:
determining that the submitted content is not new relative to the
known content.
[0010] In some embodiments when (i) at least the first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, and (ii) less than a
second predetermined threshold number of the set of shingles
representing the submitted content is included in a second
plurality of shingles representing other submitted content
identified as being published by the entity, the method includes:
determining that the submitted content is not new relative to the
known content.
[0011] In some embodiments, prior to issuing the request to the
search engine to crawl and index the submitted content hosted by
the resource associated with the link, the method includes
determining that the entity is an author of original content based
on an evaluation of other submitted content identified as being
published by the entity, where the other submitted content
identified as being published by the entity is included in the
known content; and the request to the search engine to crawl and
index the submitted content hosted by the resource associated with
the link is issued responsive to determining that the submitted
content is deemed to not be represented in the index of known
content and determining that the entity is an author of original
content.
[0012] In some embodiments, the disclosure further provides
determining that the entity is an author of original content based
on the other submitted content identified as being published by the
entity includes: identifying a second plurality of shingles
representing the other submitted content identified as being
published by the entity; determining that at least a third
predetermined threshold number of the second plurality of shingles
representing the other submitted content identified as being
published by the entity are first instances with respect to
corresponding shingles in a first plurality of shingles
representing the known content; and responsive to the determining,
determining that the entity is an author of original content.
[0013] In some embodiments, the disclosure further provides
determining that at least the third predetermined threshold number
of the second plurality of shingles representing the other
submitted content identified as being published by the entity are
first instances with respect to the corresponding shingles in the
first plurality of shingles representing the known content
includes: determining a first number of the second plurality of
shingles representing the other submitted content identified as
being published by the entity that are unique with respect to the
first plurality of shingles representing the known content;
determining a second number of the second plurality of shingles
representing the other submitted content identified as being
published by the entity that are not unique with respect to the
shingles for the known content but that are associated with
timestamps that precede timestamps for corresponding shingles in
the first plurality of shingles representing the known content; and
determining that a sum of the first number and the second number is
at least the third predetermined threshold number.
[0014] In some embodiments, the disclosure further provides
identifying a set of shingles for the submitted content; for each
respective shingle in the set of shingles, applying a hash function
to the shingle to produce a respective hash value for the
respective shingle; and storing the link to a location on the
resource hosting the submitted content and a timestamp of the
message in a record in the index of known content indexed by the
respective hash value.
[0015] In some embodiments, for each respective shingle in the set
of shingles, storing an identifier for the entity in the record of
the index of known content. In some embodiments, the request to the
search engine includes the link to the location on the resource and
a timestamp associated with the message, and where the timestamp
associated with the message is included in a search index entry for
the submitted content. In some embodiments, the message is received
in response to the entity publishing the submitted content on a
website. In some embodiments, the message is received in response
to the entity submitting the submitted content and the link to the
location on the resource hosting the submitted content using a user
interface provided by the server.
[0016] In some embodiments, the message is received in response to
the entity submitting the submitted content using a user interface
provided by the server, and where the submitted content is marked
as being submitted by the entity prior to publication of the
submitted content by the entity. In some embodiments, the entity
includes at least one author of content. In some embodiments, the
entity includes at least one publisher of content. In some
embodiments, the entity includes at least one website.
[0017] In some embodiments, (i) a respective link to a respective
location on a respective resource hosting a respective known
content and (ii) a timestamp associated with the respective known
content are included in a respective record of the index of known
content indexed by a respective hash value for a respective shingle
of the respective known content.
[0018] In some embodiments, prior to receiving the message, the
method includes: receiving, from a client device of the entity, a
registration request to register the entity as a registered author
of content; providing, to the client device of the entity, at least
one verification mechanism to verify that the entity is an author
of content; responsive to the providing, receiving, from the client
device of the entity, a verification request to verify that the
entity is an author of content using the at least one verification
mechanism; verifying that the entity is an author of content using
the at least one verification mechanism; and registering the entity
as a registered author of content.
[0019] Another aspect of the present disclosure provides a system
to crawl and index content, comprising: at least one processor;
memory; and at least one program stored in the memory and
executable by the at least one processor, the at least one program
comprising instructions to: receive a message, the message
including (i) submitted content, where the submitted content is
identified as being published by an entity, and (ii) a link to a
location on a resource hosting the submitted content; evaluate
whether the submitted content is represented in an index of known
content to determine whether the submitted content is new relative
to the known content; and in response to the evaluating, issue a
request to a search engine to crawl and index the submitted content
hosted by the resource associated with the link when the submitted
content is new relative to the known content, where the request to
the search engine to crawl and index the submitted content hosted
by the resource associated with the link is issued responsive to
determining that the submitted content is deemed to not be
represented in the index of known content.
[0020] In some embodiments, the request is a high priority request
that is directed to a high priority crawler of the search engine,
and a time when the high priority crawler of the search engine is
scheduled to crawl the submitted content occurs before a time when
a standard crawler of the search engine is scheduled to crawl the
submitted content.
[0021] In some embodiments, the instructions to evaluate whether
the submitted content is represented in the index of known content
to determine whether the submitted content is new relative to the
known content include instructions to: identify a set of shingles
representing the submitted content, a respective shingle in the set
of shingles including a predetermined number of consecutive tokens
for the submitted content; compare the set of shingles representing
the submitted content with a first plurality of shingles
representing the known content; and when less than a first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, determine that the
submitted content is new relative to the known content.
[0022] In some embodiments, when (i) at least the first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, (ii) at least a second
predetermined threshold number of the set of shingles representing
the submitted content is included in a second plurality of shingles
representing other submitted content identified as being published
by the entity, and (iii) at least a third predetermined threshold
of corresponding shingles in the second plurality of shingles
representing the other submitted content are first instances with
respect to corresponding shingles in the first plurality of
shingles representing the known content, the at least one program
includes instructions to: determine that the submitted content is
new relative to the known content.
[0023] In some embodiments, when (i) at least the first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, (ii) at least a second
predetermined threshold number of the set of shingles representing
the submitted content is included in a second plurality of shingles
representing other submitted content identified as being published
by the entity, and (iii) less than a third predetermined threshold
of corresponding shingles in the second plurality of shingles
representing the other submitted content are first instances with
respect to corresponding shingles in the first plurality of
shingles representing the known content, the at least one program
includes instructions to: determine that the submitted content is
not new relative to the known content.
[0024] In some embodiments, when (i) at least the first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, and (ii) less than a
second predetermined threshold number of the set of shingles
representing the submitted content is included in a second
plurality of shingles representing other submitted content
identified as being published by the entity, the at least one
program includes instructions to: determine that the submitted
content is not new relative to the known content.
[0025] In some embodiments, prior to issuing the request to the
search engine to crawl and index the submitted content hosted by
the resource associated with the link, the at least one program
includes instructions to determine that the entity is an author of
original content based on an evaluation of other submitted content
identified as being published by the entity, where the other
submitted content identified as being published by the entity is
included in the known content; and where the request to the search
engine to crawl and index the submitted content hosted by the
resource associated with the link is issued responsive to
determining that the submitted content is deemed to not be
represented in the index of known content and determining that the
entity is an author of original content.
[0026] In some embodiments, the instructions to determine that the
entity is an author of original content based on the other
submitted content identified as being published by the entity
include instructions to: identify a second plurality of shingles
representing the other submitted content identified as being
published by the entity; determine that at least a third
predetermined threshold number of the second plurality of shingles
representing the other submitted content identified as being
published by the entity are first instances with respect to
corresponding shingles in a first plurality of shingles
representing the known content; and responsive to the determining,
determine that the entity is an author of original content.
[0027] In some embodiments, the instructions to determine that at
least the third predetermined threshold number of the second
plurality of shingles representing the other submitted content
identified as being published by the entity are first instances
with respect to the corresponding shingles in the first plurality
of shingles representing the known content include instructions to:
determine a first number of the second plurality of shingles
representing the other submitted content identified as being
published by the entity that are unique with respect to the first
plurality of shingles representing the known content; determine a
second number of the second plurality of shingles representing the
other submitted content identified as being published by the entity
that are not unique with respect to the shingles for the known
content but that are associated with timestamps that precede
timestamps for corresponding shingles in the first plurality of
shingles representing the known content; and determine that a sum
of the first number and the second number is at least the third
predetermined threshold number.
[0028] In some embodiments, the computer implemented method further
comprises instructions to: identify a set of shingles for the
submitted content; for each respective shingle in the set of
shingles, apply a hash function to the shingle to produce a
respective hash value for the respective shingle; and store the
link to the location on the resource hosting the submitted content
and a timestamp of the message in a record in the index of known
content indexed by the respective hash value.
[0029] In some embodiments, the computer implemented method further
comprises instructions to, for each respective shingle in the set
of shingles, store an identifier for the entity in the record of
the index of known content.
[0030] In some embodiments, the request to the search engine
includes the link to the location on the resource and a timestamp
associated with the message, and where the timestamp associated
with the message is included in a search index entry for the
submitted content.
[0031] In some embodiments, the message is received in response to
the entity publishing the submitted content on a website. In some
embodiments, the message is received in response to the entity
submitting the submitted content and the link to the location on
the resource hosting the submitted content using a user interface
provided by the server.
[0032] In some embodiments, the message is received in response to
the entity submitting the submitted content using a user interface
provided by the server, and where the submitted content is marked
as being submitted by the entity prior to publication of the
submitted content by the entity. In some embodiments, the entity
includes at least one author of content. In some embodiments, the
entity includes at least one publisher of content. In some
embodiments, the entity includes at least one website.
[0033] In some embodiments, (i) a respective link to a respective
location on a respective resource hosting a respective known
content and (ii) a timestamp associated with the respective known
content are included in a respective record of the index of known
content indexed by a respective hash value for a respective shingle
of the respective known content. In some embodiments, prior to
receiving the message, the at least one program includes
instructions to: receive, from a client device of the entity, a
registration request to register the entity as a registered author
of content; provide, to the client device of the entity, at least
one verification mechanism to verify that the entity is an author
of content; responsive to the providing, receive, from the client
device of the entity, a verification request to verify that the
entity is an author of content using the at least one verification
mechanism; verify that the entity is an author of content using the
at least one verification mechanism; and register the entity as a
registered author of content.
[0034] Another aspect of the present disclosure provides a
non-transitory computer readable storage medium storing at least
one program configured for execution by at least one processor of a
computer system, the at least one program comprising instructions
to: receive a message, the message including (i) submitted content,
where the submitted content is identified as being published by an
entity, and (ii) a link to a location on a resource hosting the
submitted content; evaluate whether the submitted content is
represented in an index of known content to determine whether the
submitted content is new relative to the known content; and in
response to the evaluating, issue a request to a search engine to
crawl and index the submitted content hosted by the resource
associated with the link when the submitted content is new relative
to the known content, where the request to the search engine to
crawl and index the submitted content hosted by the resource
associated with the link is issued responsive to determining that
the submitted content is deemed to not be represented in the index
of known content.
[0035] In some embodiments, the request is a high priority request
that is directed to a high priority crawler of the search engine,
and where a time when the high priority crawler of the search
engine is scheduled to crawl the submitted content occurs before a
time when a standard crawler of the search engine is scheduled to
crawl the submitted content.
[0036] In some embodiments, the instructions to evaluate whether
the submitted content is represented in the index of known content
to determine whether the submitted content is new relative to the
known content include instructions to: identify a set of shingles
representing the submitted content, a respective shingle in the set
of shingles including a predetermined number of consecutive tokens
for the submitted content; compare the set of shingles representing
the submitted content with a first plurality of shingles
representing the known content; and when less than a first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, determine that the
submitted content is new relative to the known content.
[0037] In some embodiments, when (i) at least the first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, (ii) at least a second
predetermined threshold number of the set of shingles representing
the submitted content is included in a second plurality of shingles
representing other submitted content identified as being published
by the entity, and (iii) at least a third predetermined threshold
of corresponding shingles in the second plurality of shingles
representing the other submitted content are first instances with
respect to corresponding shingles in the first plurality of
shingles representing the known content, the at least one program
includes instructions to: determine that the submitted content is
new relative to the known content.
[0038] In some embodiments, when (i) at least the first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, (ii) at least a second
predetermined threshold number of the set of shingles representing
the submitted content is included in a second plurality of shingles
representing other submitted content identified as being published
by the entity, and (iii) less than a third predetermined threshold
of corresponding shingles in the second plurality of shingles
representing the other submitted content are first instances with
respect to corresponding shingles in the first plurality of
shingles representing the known content, the at least one program
includes instructions to: determine that the submitted content is
not new relative to the known content.
[0039] In some embodiments, when (i) at least the first
predetermined threshold number of the set of shingles representing
the submitted content is included in the first plurality of
shingles representing the known content, and (ii) less than a
second predetermined threshold number of the set of shingles
representing the submitted content is included in a second
plurality of shingles representing other submitted content
identified as being published by the entity, the at least one
program includes instructions to: determine that the submitted
content is not new relative to the known content.
[0040] In some embodiments, prior to issuing the request to the
search engine to crawl and index the submitted content hosted by
the resource associated with the link, the at least one program
includes instructions to determine that the entity is an author of
original content based on an evaluation of other submitted content
identified as being published by the entity, where the other
submitted content identified as being published by the entity is
included in the known content; and where the request to the search
engine to crawl and index the submitted content hosted by the
resource associated with the link is issued responsive to
determining that the submitted content is deemed to not be
represented in the index of known content and determining that the
entity is an author of original content.
[0041] In some embodiments, the instructions to determine that the
entity is an author of original content based on the other
submitted content identified as being published by the entity
include instructions to: identify a second plurality of shingles
representing the other submitted content identified as being
published by the entity; determine that at least a third
predetermined threshold number of the second plurality of shingles
representing the other submitted content identified as being
published by the entity are first instances with respect to
corresponding shingles in a first plurality of shingles
representing the known content; and responsive to the determining,
determine that the entity is an author of original content.
[0042] In some embodiments, the instructions to determine that at
least the third predetermined threshold number of the second
plurality of shingles representing the other submitted content
identified as being published by the entity are first instances
with respect to the corresponding shingles in the first plurality
of shingles representing the known content include instructions to:
determine a first number of the second plurality of shingles
representing the other submitted content identified as being
published by the entity that are unique with respect to the first
plurality of shingles representing the known content; determine a
second number of the second plurality of shingles representing the
other submitted content identified as being published by the entity
that are not unique with respect to the shingles for the known
content but that are associated with timestamps that precede
timestamps for corresponding shingles in the first plurality of
shingles representing the known content; and determine that a sum
of the first number and the second number is at least the third
predetermined threshold number.
[0043] In some embodiments, the non-transitory computer readable
storage medium further comprises instructions to: identify a set of
shingles for the submitted content; for each respective shingle in
the set of shingles, apply a hash function to the shingle to
produce a respective hash value for the respective shingle; and
store the link to the location on the resource hosting the
submitted content and a timestamp of the message in a record in the
index of known content indexed by the respective hash value.
[0044] In some embodiments, the non-transitory computer readable
storage medium further comprises instructions to, for each
respective shingle in the set of shingles, store an identifier for
the entity in the record of the index of known content. In some
embodiments the request to the search engine includes the link to
the location on the resource and a timestamp associated with the
message, and the timestamp associated with the message is included
in a search index entry for the submitted content. In some
embodiments, the message is received in response to the entity
publishing the submitted content on a website. In some embodiments,
the message is received in response to the entity submitting the
submitted content and the link to the location on the resource
hosting the submitted content using a user interface provided by
the server. In some embodiments, the message is received in
response to the entity submitting the submitted content using a
user interface provided by the server, and where the submitted
content is marked as being submitted by the entity prior to
publication of the submitted content by the entity. In some
embodiments, the entity includes at least one author of content. In
some embodiments, the entity includes at least one publisher of
content. In some embodiments, the entity includes at least one
website.
[0045] In some embodiments, (i) a respective link to a respective
location on a respective resource hosting a respective known
content and (ii) a timestamp associated with the respective known
content are included in a respective record of the index of known
content indexed by a respective hash value for a respective shingle
of the respective known content.
[0046] In some embodiments, prior to receiving the message, the at
least one program includes instructions to: receive, from a client
device of the entity, a registration request to register the entity
as a registered author of content; provide, to the client device of
the entity, at least one verification mechanism to verify that the
entity is an author of content; responsive to the providing,
receive, from the client device of the entity, a verification
request to verify that the entity is an author of content using the
at least one verification mechanism; verify that the entity is an
author of content using the at least one verification mechanism;
and register the entity as a registered author of content.
[0047] Another aspect of the present disclosure provides a
non-transitory computer readable storage medium storing at least
one program configured for execution by at least one processor of a
system, the at least one program comprising instructions to be
executed by the at least one processor so as to perform any of the
methods disclosed herein.
[0048] Another aspect of the present disclosure provides a system,
comprising at least one processor; and memory storing at least one
program for execution by the at least one processor; the system
including means for performing the computer-implemented method of
any of the methods disclosed herein.
[0049] Another aspect of the present disclosure provides a
computer-implemented method for re-ranking ranked search results,
performed on a server system having at least one processor and
memory storing at least one program for execution by the at least
one processor to perform the method, comprising: obtaining ranked
search results satisfying a search query, the ranked search results
including a first search result corresponding to a first document
associated with a first entity and a second search result
corresponding to a second document associated with a second entity,
the first search result being ranked higher than the second search
result; determining that the first document and the second document
satisfy a similarity criterion; determining that the second entity
satisfies a predefined authorship differential with respect to the
first entity; and responsive to determining that the second entity
satisfies the predefined authorship differential with respect to
the first entity, swapping the second search result and the first
search result in the ranked search results to produce re-ranked
search results.
[0050] In some embodiments, the determining that the first document
and the second document satisfy the similarity criterion includes:
identifying a first plurality of portions of content for the first
document; identifying a second plurality of portions of content for
the second document; and determining that at least a predetermined
number of portions in the first plurality of portions of content
and in the second plurality of portions of content are identical.
In some embodiments, respective portions of content in the first
plurality of portions of content and the second plurality of
portions of content include a predetermined amount of data.
[0051] In some embodiments, the predetermined number of portions
encompasses a predetermined percentage of the data in the first
document. In some embodiments, the predetermined number of portions
is a threshold percentage of the portions of content in the first
plurality of portions of content. In some embodiments, the
determining that the first document and the second document satisfy
the similarity criterion includes: identifying a first plurality of
shingles for the first document; identifying a second plurality of
shingles for the second document; and determining that a
predetermined quantity of shingles in the first plurality of
shingles and in the second plurality of shingles of content are
identical.
[0052] In some embodiments, the first document and the second
document satisfy the similarity criterion when a character count of
the first document and a character count of the second document
deviate by less than a character count threshold. In some
embodiments, the first document and the second document satisfy the
similarity criterion when a Levenshtein edit distance from the
first document to the second document is less than a predetermined
number of character edits.
[0053] In some embodiments, the determining that the second entity
satisfies the predefined authorship differential with respect to
the first entity includes: determining an author score for the
first entity and an author score for the second entity, where the
author score for a respective entity, in the set of entities
consisting of the first entity and the second entity, indicates a
likelihood that the respective entity is an author of original
content; and when the composite author score for the second entity
deviates from the composite author score for the first entity by at
least a predetermined threshold, determining that the second entity
satisfies the predefined authorship differential with respect to
the first entity.
[0054] In some embodiments, the determining the author score for a
respective entity includes: determining a citation score for the
respective entity, where the citation score corresponds to a
frequency at which content associated with the respective entity is
cited; determining an original author score for the respective
entity, where the original author score corresponds to a percentage
of content associated with the respective entity that is a first
instance of the content in an index of known content; and combining
the citation score and the original author score using a
predetermined function to produce the author score.
[0055] In some embodiments, the determining the citation score for
the respective entity includes calculating the citation score as a
quotient of a number of citations to content that is associated
with the respective entity and a quantity of content that has at
least one citation.
[0056] In some embodiments, the determining the original author
score for the respective entity includes: identifying a plurality
of shingles representing content in the index of known content
identified as being associated with the respective entity; and
calculating a percentage of the plurality of shingles that are
first instances of the shingles in the index of known content.
[0057] In some embodiments the calculating the percentage of the
plurality of shingles that are first instances includes:
determining a first number of the plurality of shingles, each
having an associated timestamp, for which there are no identical
shingles in the index of known content having an earlier timestamp;
and calculating the percentage as a quotient of the first number
and a total number of shingles in the plurality of shingles.
[0058] In some embodiments, the determining the original author
score for the respective entity includes: identifying a plurality
of portions of content in the index of known content identified as
being associated with the respective entity, each portion in the
plurality of portions representing a predetermined amount of data
in the index of known content; and calculating a percentage of the
plurality of the portions that are first instances of the portions
of content in the index of known content.
[0059] In some embodiments, the calculating the percentage of the
plurality of portions of content that are first instances includes:
determining a first number of the plurality of portions of content,
each having an associated timestamp, for which there are no
identical portions of content in the index of known content having
an earlier timestamp; and calculating the percentage as a quotient
of the first number and a total number of portions of content in
the plurality of portions of content.
[0060] In some embodiments, when the author score for the first
entity and the second entity is within the predetermined threshold
of each other, the method includes: obtaining a first timestamp
corresponding to a time when the first document was included in an
index of known content; obtaining a second timestamp corresponding
to a time when the second document was included in the index of
known content; and when the second timestamp is less than the first
timestamp by at least a predetermined time interval, determining
that the second entity satisfies the predefined authorship
differential with respect to the first entity.
[0061] In some embodiments, the ranked search results are obtained
from a search engine, and after swapping the second search result
and the first search result in the ranked search results to produce
the re-ranked search results, the method further comprises
providing the re-ranked search results to the search engine.
[0062] Another aspect of the present disclosure provides a system
to re-rank ranked search results, comprising: at least one
processor; memory; and at least one program stored in the memory
and executable by the at least one processor, the at least one
program comprising instructions to: obtain ranked search results
satisfying a search query, the ranked search results including a
first search result corresponding to a first document associated
with a first entity and a second search result corresponding to a
second document associated with a second entity, the first search
result being ranked higher than the second search result; determine
that the first document and the second document satisfy a
similarity criterion; determine that the second entity satisfies a
predefined authorship differential with respect to the first
entity; and responsive to determining that the second entity
satisfies the predefined authorship differential with respect to
the first entity, swap the second search result and the first
search result in the ranked search results to produce re-ranked
search results.
[0063] In some embodiments, the instructions to determine that the
first document and the second document satisfy the similarity
criterion include instructions to: identify a first plurality of
portions of content for the first document; identify a second
plurality of portions of content for the second document; and
determine that at least a predetermined number of portions in the
first plurality of portions of content and in the second plurality
of portions of content are identical.
[0064] In some embodiments, respective portions of content in the
first plurality of portions of content and the second plurality of
portions of content include a predetermined amount of data. In some
embodiments, the predetermined number of portions encompasses a
predetermined percentage of the data in the first document. In some
embodiments, the predetermined number of portions is a threshold
percentage of the portions of content in the first plurality of
portions of content. In some embodiments, the instructions to
determine that the first document and the second document satisfy
the similarity criterion include instructions to: identify a first
plurality of shingles for the first document; identify a second
plurality of shingles for the second document; and determine that a
predetermined quantity of shingles in the first plurality of
shingles and in the second plurality of shingles of content are
identical.
[0065] In some embodiments, the first document and the second
document satisfy the similarity criterion when a character count of
the first document and a character count of the second document
deviate by less than a character count threshold. In some
embodiments, the first document and the second document satisfy the
similarity criterion when a Levenshtein edit distance from the
first document to the second document is less than a predetermined
number of character edits.
[0066] In some embodiments, the instructions to determine that the
second entity satisfies the predefined authorship differential with
respect to the first entity include instructions to: determine an
author score for the first entity and an author score for the
second entity, where the author score for a respective entity, in
the set of entities consisting of the first entity and the second
entity, indicates a likelihood that the respective entity is an
author of original content; and when the author score for the
second entity is greater than the author score for the first entity
by at least a predetermined threshold, determine that the second
entity satisfies the predefined authorship differential with
respect to the first entity.
[0067] In some embodiments, the instructions to determine the
author score for a respective entity include instructions to:
determine a citation score for the respective entity, where the
citation score corresponds to a frequency at which content
associated with the respective entity is cited; determine an
original author score for the respective entity, where the original
author score corresponds to a percentage of content associated with
the respective entity that is a first instance of the content in an
index of known content; and combine the citation score and the
original author score using a predetermined function to produce the
author score.
[0068] In some embodiments, the instructions to determine the
citation score for the respective entity include instructions to
calculate the citation score as a quotient of a number of citations
to content that is associated with the respective entity and a
quantity of content that has at least one citation.
[0069] In some embodiments, the instructions to determine the
original author score for the respective entity include
instructions to: identify a plurality of shingles representing
content in the index of known content identified as being
associated with the respective entity; and calculate a percentage
of the plurality of shingles that are first instances of the
shingles in the index of known content.
[0070] In some embodiments, the instructions to calculate the
percentage of the plurality of shingles that are first instances
include instructions to: determine a first number of the plurality
of shingles, each having an associated timestamp, for which there
are no identical shingles in the index of known content having an
earlier timestamp; and calculate the percentage as a quotient of
the first number and a total number of shingles in the plurality of
shingles.
[0071] In some embodiments, the instructions to determine the
original author score for the respective entity include
instructions to: identify a plurality of portions of content in the
index of known content identified as being associated with the
respective entity, each portion in the plurality of portions
representing a predetermined amount of data in the index of known
content; and calculate a percentage of the plurality of the
portions that are first instances of the portions of content in the
index of known content.
[0072] In some embodiments, the instructions to calculate the
percentage of the plurality of portions of content that are first
instances include instructions to: determine a first number of the
plurality of portions of content, each having an associated
timestamp, for which there are no identical portions of content in
the index of known content having an earlier timestamp; and
calculate the percentage as a quotient of the first number and a
total number of portions of content in the plurality of portions of
content.
[0073] In some embodiments, when the author score for the first
entity and the second entity is within the predetermined threshold
of each other, the at least one program includes instructions to:
obtain a first timestamp corresponding to a time when the first
document was included in an index of known content; obtain a second
timestamp corresponding to a time when the second document was
included in the index of known content; and when the second
timestamp is less than the first timestamp by at least a
predetermined time interval, determine that the second entity
satisfies the predefined authorship differential with respect to
the first entity.
[0074] In some embodiments, the ranked search results are obtained
from a search engine, and where after swapping the second search
result and the first search result in the ranked search results to
produce the re-ranked search results, the method further comprises
providing the re-ranked search results to the search engine.
[0075] Another aspect of the present disclosure provides a
non-transitory computer readable storage medium storing at least
one program configured for execution by at least one processor of a
computer system, the at least one program comprising instructions
to: obtain ranked search results satisfying a search query, the
ranked search results including a first search result corresponding
to a first document associated with a first entity and a second
search result corresponding to a second document associated with a
second entity, the first search result being ranked higher than the
second search result; determine that the first document and the
second document satisfy a similarity criterion; determine that the
second entity satisfies a predefined authorship differential with
respect to the first entity; and responsive to determining that the
second entity satisfies the predefined authorship differential with
respect to the first entity, swap the second search result and the
first search result in the ranked search results to produce
re-ranked search results.
[0076] In some embodiments, the instructions to determine that the
first document and the second document satisfy the similarity
criterion include instructions to: identify a first plurality of
portions of content for the first document; identify a second
plurality of portions of content for the second document; and
determine that at least a predetermined number of portions in the
first plurality of portions of content and in the second plurality
of portions of content are identical.
[0077] In some embodiments, respective portions of content in the
first plurality of portions of content and the second plurality of
portions of content include a predetermined amount of data. In some
embodiments, the predetermined number of portions encompasses a
predetermined percentage of the data in the first document. In some
embodiments, the predetermined number of portions is a threshold
percentage of the portions of content in the first plurality of
portions of content. In some embodiments, the instructions to
determine that the first document and the second document satisfy
the similarity criterion include instructions to: identify a first
plurality of shingles for the first document; identify a second
plurality of shingles for the second document; and determine that a
predetermined quantity of shingles in the first plurality of
shingles and in the second plurality of shingles of content are
identical.
[0078] In some embodiments, the first document and the second
document satisfy the similarity criterion when a character count of
the first document and a character count of the second document
deviate by less than a character count threshold.
[0079] In some embodiments, the first document and the second
document satisfy the similarity criterion when a Levenshtein edit
distance from the first document to the second document is less
than a predetermined number of character edits.
[0080] In some embodiments, the instructions to determine that the
second entity satisfies the predefined authorship differential with
respect to the first entity include instructions to: determine an
author score for the first entity and an author score for the
second entity, where the author score for a respective entity, in
the set of entities consisting of the first entity and the second
entity, indicates a likelihood that the respective entity is an
author of original content; and when the author score for the
second entity is greater than the author score for the first entity
by at least a predetermined threshold, determine that the second
entity satisfies the predefined authorship differential with
respect to the first entity.
[0081] In some embodiments, the instructions to determine the
author score for a respective entity include instructions to:
determine a citation score for the respective entity, where the
citation score corresponds to a frequency at which content
associated with the respective entity is cited; determine an
original author score for the respective entity, where the original
author score corresponds to a percentage of content associated with
the respective entity that is a first instance of the content in an
index of known content; and combine the citation score and the
original author score using a predetermined function to produce the
author score.
[0082] In some embodiments, the instructions to determine the
citation score for the respective entity include instructions to
calculate the citation score as a quotient of a number of citations
to content that is associated with the respective entity and a
quantity of content that has at least one citation.
[0083] In some embodiments, the instructions to determine the
original author score for the respective entity include
instructions to: identify a plurality of shingles representing
content in the index of known content identified as being
associated with the respective entity; and calculate a percentage
of the plurality of shingles that are first instances of the
shingles in the index of known content.
[0084] In some embodiments, the instructions to calculate the
percentage of the plurality of shingles that are first instances
include instructions to: determine a first number of the plurality
of shingles, each having an associated timestamp, for which there
are no identical shingles in the index of known content having an
earlier timestamp; and calculate the percentage as a quotient of
the first number and a total number of shingles in the plurality of
shingles.
[0085] In some embodiments, the instructions to determine the
original author score for the respective entity include
instructions to: identify a plurality of portions of content in the
index of known content identified as being associated with the
respective entity, each portion in the plurality of portions
representing a predetermined amount of data in the index of known
content; and calculate a percentage of the plurality of the
portions that are first instances of the portions of content in the
index of known content.
[0086] In some embodiments, the instructions to calculate the
percentage of the plurality of portions of content that are first
instances include instructions to: determine a first number of the
plurality of portions of content, each having an associated
timestamp, for which there are no identical portions of content in
the index of known content having an earlier timestamp; and
calculate the percentage as a quotient of the first number and a
total number of portions of content in the plurality of portions of
content.
[0087] In some embodiments, the author score for the first entity
and the second entity is within the predetermined threshold of each
other, the at least one program includes instructions to: obtain a
first timestamp corresponding to a time when the first document was
included in an index of known content; obtain a second timestamp
corresponding to a time when the second document was included in
the index of known content; and when the second timestamp is less
than the first timestamp by at least a predetermined time interval,
determine that the second entity satisfies the predefined
authorship differential with respect to the first entity.
[0088] In some embodiments, the ranked search results are obtained
from a search engine, and where after swapping the second search
result and the first search result in the ranked search results to
produce the re-ranked search results, the method further comprises
providing the re-ranked search results to the search engine.
[0089] Another aspect of the present disclosure provides a
computer-implemented method for re-ranking ranked search results,
performed on a server system having at least one processor and
memory storing at least one program for execution by the at least
one processor to perform the method, comprising: obtaining ranked
search results satisfying a search query, the ranked search results
including a first search result corresponding to a first document
associated with a first entity and a second search result
corresponding to a second document associated with a second entity,
the first search result being ranked higher than the second search
result; determining that the first document and the second document
satisfy a similarity criterion; obtaining a first timestamp
corresponding to a time when the first document was included in an
index of known content; obtaining a second timestamp corresponding
to a time when the second document was included in the index of
known content; and swapping the second search result and the first
search result in the ranked search results to produce re-ranked
search results when the second timestamp is less than the first
timestamp by at least a predetermined time interval.
[0090] Another aspect of the present disclosure provides a
computer-implemented method for determining a likelihood that an
entity is an author of original content, performed on a server
system having at least one processor and memory storing at least
one program for execution by the at least one processor to perform
the method, comprising: receiving a message, the message including
submitted content, where the submitted content is identified as
being associated with an entity; storing a representation of the
submitted content in an index of known content; calculating an
original author score for the entity based on an evaluation of
representations of submitted content in the index of known content
identified as being associated with the entity, the original author
score indicating a likelihood that the entity is an author of
original content; and storing the original author score and an
identifier for the entity in an entity database.
[0091] In some embodiments, the message comprises a timestamp, and
where storing the representation of the submitted content in the
index of known content includes: identifying a set of shingles for
the submitted content; for each respective shingle in the set of
shingles, applying a hash function to the respective shingle to
produce a hash value for the respective shingle; and storing the
timestamp of the message in a record of the index of known content
indexed by the hash value.
[0092] In some embodiments, the message includes a link to a
location on a resource hosting the submitted content, and where the
method further comprises for each respective shingle in the set of
shingles, storing the link to the location on the resource hosting
the submitted content in a record of the index of known content
corresponding to the respective shingle in the set of shingles.
[0093] Some embodiments further disclose that, for each respective
shingle in the set of shingles, storing the identifier for the
entity in a record of the index of known content corresponding to
the respective shingle in the set of shingles.
[0094] In some embodiments, calculating the original author score
for the entity includes: identifying a plurality of shingles
representing content, including the submitted content, in the index
of known content identified as being associated with the entity;
and calculating a percentage of the plurality of shingles that are
first instances of the shingles in the index of known content. In
some embodiments, calculating the percentage of the plurality of
shingles that are first instances includes: determining a first
number of the plurality of shingles, each having an associated
timestamp, for which there are no identical shingles in the index
of known content having an earlier timestamp; and calculating the
percentage as a quotient of the first number and a total number of
shingles in the plurality of shingles.
[0095] In some embodiments, calculating the original author score
for the entity includes: identifying a plurality of portions of
content, including the submitted content, in the index of known
content identified as being associated with the entity, each
portion in the plurality of portions representing a predetermined
amount of data in the index of known content; and calculating a
percentage of the plurality of the portions that are first
instances of the portions of content in the index of known content.
In some embodiments, calculating the percentage of the plurality of
portions of content that are first instances includes: determining
a first number of the plurality of portions of content, each having
an associated timestamp, for which there are no identical portions
of content in the index of known content having an earlier
timestamp; and calculating the percentage as a quotient of the
first number and a total number of portions of content in the
plurality of portions of content.
[0096] In some embodiments, the message is received in response to
the entity publishing the submitted content on a website. In some
embodiments, the message is received in response to the entity
submitting the submitted content using a user interface provided by
the server. In some embodiments, the message is received in
response to the entity submitting the submitted content using a
user interface provided by the server, and the submitted content is
marked as being submitted by the entity prior to publication of the
submitted content by the entity. In some embodiments, the entity
includes at least one author of content. In some embodiments, the
entity includes at least one publisher of content. In some
embodiments, the entity includes at least one website.
[0097] In some embodiments, prior to receiving the message from the
entity, the method includes: receiving, from a client device of the
entity, a registration request to register the entity as a
registered author of content; providing, to the client device, at
least one verification mechanism to verify that the entity is an
author of content; responsive to the providing, receiving, from the
client device of the entity, a verification request to verify that
the entity is an author of content using the at least one
verification mechanism; verifying that the entity is an author of
content using the at least one verification mechanism; and
registering the entity as a registered author of content.
[0098] In some embodiments, calculating the original author score
for the entity based on the evaluation of representations of the
submitted content in the index of known content identified as being
associated with the entity includes calculating the original author
score for the entity based on the evaluation of a subset of the
representations of submitted content in the index of known content
identified as being associated with the entity.
[0099] In some embodiments, the subset of representations of
submitted content includes of all content submitted by the entity
within a threshold period of time.
[0100] In some embodiments, each representation in the subset of
representations of submitted content satisfies a dissimilarity
criterion with respect to each other representation in the subset
of representations.
[0101] In some embodiments, the subset of representations of
submitted content consists of representations for ten or less
documents submitted by the entity.
[0102] Another aspect of the present disclosure provides a system
to determine a likelihood that an entity is an author of original
content, comprising at least one processor; memory; and at least
one program stored in the memory and executable by the at least one
processor, the at least one program comprising instructions to:
receive a message, the message including submitted content, where
the submitted content is identified as being associated with an
entity; store a representation of the submitted content in an index
of known content; calculate an original author score for the entity
based on an evaluation of representations of submitted content in
the index of known content identified as being associated with the
entity, the original author score indicating a likelihood that the
entity is an author of original content; and store the original
author score and an identifier for the entity in an entity
database.
[0103] In some embodiments, the message comprises a timestamp, and
where instructions to store the representation of the submitted
content in the index of known content include instructions to:
identify a set of shingles for the submitted content; for each
respective shingle in the set of shingles, apply a hash function to
the respective shingle to produce a hash value for the respective
shingle; and store the timestamp of the message in a record of the
index of known content indexed by the hash value.
[0104] In some embodiments, the message includes a link to a
location on a resource hosting the submitted content, and where the
at least one program includes instructions to, for each respective
shingle in the set of shingles, store the link to the location on
the resource hosting the submitted content in a record of the index
of known content corresponding to the respective shingle in the set
of shingles.
[0105] In some embodiments, the at least one program includes
instructions to, for each respective shingle in the set of
shingles, store the identifier for the entity in a record of the
index of known content corresponding to the respective shingle in
the set of shingles.
[0106] In some embodiments, the instructions to calculate the
original author score for the entity include instructions to:
identify a plurality of shingles representing content, including
the submitted content, in the index of known content identified as
being associated with the entity; and calculate a percentage of the
plurality of shingles that are first instances of the shingles in
the index of known content.
[0107] In some embodiments, the instructions to calculate the
percentage of the plurality of shingles that are first instances
include instructions to: determine a first number of the plurality
of shingles, each having an associated timestamp, for which there
are no identical shingles in the index of known content having an
earlier timestamp; and calculate the percentage as a quotient of
the first number and a total number of shingles in the plurality of
shingles.
[0108] In some embodiments, the instructions to calculate the
original author score for the entity include instructions to:
identify a plurality of portions of content, including the
submitted content, in the index of known content identified as
being associated with the entity, each portion in the plurality of
portions representing a predetermined amount of data in the index
of known content; and calculate a percentage of the plurality of
the portions that are first instances of the portions of content in
the index of known content.
[0109] In some embodiments, the instructions to calculate the
percentage of the plurality of portions of content that are first
instances include instructions to: determine a first number of the
plurality of portions of content, each having an associated
timestamp, for which there are no identical portions of content in
the index of known content having an earlier timestamp; and
calculate the percentage as a quotient of the first number and a
total number of portions of content in the plurality of portions of
content.
[0110] In some embodiments, the message is received in response to
the entity publishing the submitted content on a website. In some
embodiments, the message is received in response to the entity
submitting the submitted content using a user interface provided by
the server. In some embodiments, the message is received in
response to the entity submitting the submitted content using a
user interface provided by the server, and where the submitted
content is marked as being submitted by the entity prior to
publication of the submitted content by the entity.
[0111] In some embodiments, the entity includes at least one author
of content. In some embodiments, the entity includes at least one
publisher of content. In some embodiments, the entity includes at
least one website.
[0112] In some embodiments, the at least one program includes
instructions to, prior to receiving the message from the entity:
receive, from a client device of the entity, a registration request
to register the entity as a registered author of content; provide,
to the client device, at least one verification mechanism to verify
that the entity is an author of content; responsive to the
providing, receive, from the client device of the entity, a
verification request to verify that the entity is an author of
content using the at least one verification mechanism; verify that
the entity is an author of content using the at least one
verification mechanism; and register the entity as a registered
author of content.
[0113] In some embodiments, the instructions to calculate the
original author score for the entity based on the evaluation of
representations of the submitted content in the index of known
content identified as being associated with the entity include
instructions to calculate the original author score for the entity
based on the evaluation of a subset of the representations of
submitted content in the index of known content identified as being
associated with the entity.
[0114] In some embodiments, the subset of representations of
submitted content includes of all content submitted by the entity
within a threshold period of time.
[0115] In some embodiments, each representation in the subset of
representations of submitted content satisfies a dissimilarity
criterion with respect to each other representation in the subset
of representations.
[0116] In some embodiments, the subset of representations of
submitted content consists of representations for ten or less
documents submitted by the entity.
[0117] Another aspect of the present disclosure provides a
non-transitory computer readable storage medium storing at least
one program configured for execution by at least one processor of a
computer system, the at least one program comprising instructions
to: receive a message, the message including submitted content,
where the submitted content is identified as being associated with
an entity; store a representation of the submitted content in an
index of known content; calculate an original author score for the
entity based on an evaluation of representations of submitted
content in the index of known content identified as being
associated with the entity, the original author score indicating a
likelihood that the entity is an author of original content; and
store the original author score and an identifier for the entity in
an entity database.
[0118] In some embodiments, the message comprises a timestamp, and
the instructions to store the representation of the submitted
content in the index of known content include instructions to:
identify a set of shingles for the submitted content; for each
respective shingle in the set of shingles, apply a hash function to
the respective shingle to produce a hash value for the respective
shingle; and store the timestamp of the message in a record of the
index of known content indexed by the hash value.
[0119] In some embodiments, the message includes a link to a
location on a resource hosting the submitted content, and where the
at least one program includes instructions to, for each respective
shingle in the set of shingles, store the link to the location on
the resource hosting the submitted content in a record of the index
of known content corresponding to the respective shingle in the set
of shingles.
[0120] In some embodiments, the at least one program includes
instructions to, for each respective shingle in the set of
shingles, store the identifier for the entity in a record of the
index of known content corresponding to the respective shingle in
the set of shingles.
[0121] In some embodiments, the instructions to calculate the
original author score for the entity include instructions to:
identify a plurality of shingles representing content, including
the submitted content, in the index of known content identified as
being associated with the entity; and calculate a percentage of the
plurality of shingles that are first instances of the shingles in
the index of known content.
[0122] In some embodiments, the instructions to calculate the
percentage of the plurality of shingles that are first instances
include instructions to: determine a first number of the plurality
of shingles, each having an associated timestamp, for which there
are no identical shingles in the index of known content having an
earlier timestamp; and calculate the percentage as a quotient of
the first number and a total number of shingles in the plurality of
shingles.
[0123] In some embodiments, the instructions to calculate the
original author score for the entity include instructions to:
identify a plurality of portions of content, including the
submitted content, in the index of known content identified as
being associated with the entity, each portion in the plurality of
portions representing a predetermined amount of data in the index
of known content; and calculate a percentage of the plurality of
the portions that are first instances of the portions of content in
the index of known content.
[0124] In some embodiments, the instructions to calculate the
percentage of the plurality of portions of content that are first
instances include instructions to: determine a first number of the
plurality of portions of content, each having an associated
timestamp, for which there are no identical portions of content in
the index of known content having an earlier timestamp; and
calculate the percentage as a quotient of the first number and a
total number of portions of content in the plurality of portions of
content.
[0125] In some embodiments, the message is received in response to
the entity publishing the submitted content on a website. In some
embodiments, the message is received in response to the entity
submitting the submitted content using a user interface provided by
the server.
[0126] In some embodiments, the message is received in response to
the entity submitting the submitted content using a user interface
provided by the server, and where the submitted content is marked
as being submitted by the entity prior to publication of the
submitted content by the entity. In some embodiments, the entity
includes at least one author of content. In some embodiments, the
entity includes at least one publisher of content. In some
embodiments, where the entity includes at least one website.
[0127] In some embodiments, the at least one program includes
instructions to, prior to receiving the message from the entity:
receive, from a client device of the entity, a registration request
to register the entity as a registered author of content; provide,
to the client device, at least one verification mechanism to verify
that the entity is an author of content; responsive to the
providing, receive, from the client device of the entity, a
verification request to verify that the entity is an author of
content using the at least one verification mechanism; verify that
the entity is an author of content using the at least one
verification mechanism; and register the entity as a registered
author of content.
[0128] In some embodiments, the instructions to calculate the
original author score for the entity based on the evaluation of
representations of the submitted content in the index of known
content identified as being associated with the entity include
instructions to calculate the original author score for the entity
based on the evaluation of a subset of the representations of
submitted content in the index of known content identified as being
associated with the entity.
[0129] In some embodiments, the subset of representations of
submitted content consists of all content submitted by the entity
within a threshold period of time.
[0130] In some embodiments, each representation in the subset of
representations of submitted content satisfies a dissimilarity
criterion with respect to each other representation in the subset
of representations.
[0131] In some embodiments, the subset of representations of
submitted content consists of representations for ten or less
documents submitted by the entity.
BRIEF DESCRIPTION OF THE DRAWINGS
[0132] The embodiments disclosed herein are illustrated by way of
example, and not by way of limitation, in the figures of the
accompanying drawings. Like reference numerals refer to
corresponding parts throughout the drawings.
[0133] FIG. 1 is a block diagram illustrating a network system,
according to some embodiments.
[0134] FIG. 2A is a block diagram illustrating an example process
for crawling and indexing submitted content, according to some
embodiments.
[0135] FIG. 2B continues the process illustrated in FIG. 2A,
according to some embodiments.
[0136] FIG. 3A is a block diagram illustrating an example process
for determining a likelihood that an entity is an author of
original content, according to some embodiments.
[0137] FIG. 3B continues the process illustrated in FIG. 3A,
according to some embodiments.
[0138] FIG. 4A is a block diagram illustrating an example process
for re-ranking ranked search results, according to some
embodiments.
[0139] FIG. 4B continues the process illustrated in FIG. 4A,
according to some embodiments.
[0140] FIG. 5A is a block diagram illustrating an example process
for registering an entity as a registered author of content,
according to some embodiments.
[0141] FIG. 5B continues the process illustrated in FIG. 5A,
according to some embodiments.
[0142] FIG. 6 is a block diagram illustrating a server, according
to some embodiments.
[0143] FIG. 7 is a block diagram illustrating a resource server,
according to some embodiments.
[0144] FIG. 8 is a block diagram illustrating a search engine,
according to some embodiments.
[0145] FIG. 9 is a block diagram illustrating a client device,
according to some embodiments.
[0146] FIG. 10 is a flowchart of a method for re-ranking ranked
search results, according to some embodiments.
[0147] FIG. 11 is a flowchart of a method for determining that a
first document and a second document satisfy a similarity
criterion, according to some embodiments.
[0148] FIG. 12 is a flowchart of another method for determining
that a first document and a second document satisfy a similarity
criterion, according to some embodiments.
[0149] FIG. 13 is a flowchart of a method for determining that a
second entity satisfies a predefined authorship differential with
respect to a first entity, according to some embodiments.
[0150] FIG. 14 is a flowchart of a method for determining an author
score for a respective entity, according to some embodiments.
[0151] FIG. 15 is a flowchart of a method for determining an
original author score for a respective entity, according to some
embodiments.
[0152] FIG. 16 is a flowchart of a method for calculating a
percentage of shingles that are first instances, according to some
embodiments.
[0153] FIG. 17 is a flowchart of another method for determining an
original author score for a respective entity, according to some
embodiments.
[0154] FIG. 18 is a flowchart of a method for calculating a
percentage of portions of content that are first instances,
according to some embodiments.
[0155] FIG. 19 is a flowchart of another method for re-ranking
ranked search results, according to some embodiments.
[0156] FIG. 20A is an example screenshot illustrating a web page
that allows an entity to notify a server that content has been
published on a resource server, according to some embodiments.
[0157] FIG. 20B is an example screenshot illustrating another web
page that allows an entity to notify a server that content has been
published on a resource server, according to some embodiments.
[0158] FIG. 20C is an example screenshot illustrating another web
page that allows an entity to notify a server that content has been
published on a resource server, according to some embodiments.
DETAILED DESCRIPTION
[0159] The embodiments described herein provide techniques for
re-ranking ranked search results.
[0160] FIG. 1 is a block diagram illustrating a network system 100,
according to some embodiments. The network system 100 includes
resource servers 102 (e.g., the resource servers 102-1 to 102-N),
client devices 104 (e.g., the client devices 104-1 to 104-M) of
entities 105 (e.g., entities 105-1 to 105-P), a server 110, and a
search engine 112 coupled to network 120. Network 120 can generally
include any type of wired or wireless communication channel capable
of coupling together computing nodes. This includes, but is not
limited to, a local area network, a wide area network, or a
combination of networks. In some embodiments, network 120 includes
the Internet.
[0161] In some embodiments, a respective resource server 102 is
associated with a respective entity 105. For example, in some
embodiments the respective entity 105 is an operator of the
respective resource server 102. In other embodiments, the
respective entity 105 is a subscriber of a publication service
(e.g., a news wire service) hosted by the respective resource
server 102, and/or a subscriber of a web hosting service of the
respective resource server 102.
[0162] In some implementations, the respective entity 105 includes
at least one author of content. For example, the respective entity
105 may include an individual author or one of a plurality of
co-authors for (or contributors to) content. In some embodiments,
an entity 105 is a business organization that produces original, or
partially original content. In some embodiments, an entity 105 is a
news organization. In some implementations, the entity 105 includes
at least one publisher of content. For example, the respective
entity 105 may be a publisher of books, a publisher of periodicals,
a publisher of online content and/or the like. In some
implementations, the respective entity 105 is the author of content
on at least one website. For example, the respective entity 105 may
contribute original content to a blogging website, a website for a
publisher (e.g., news, magazine, etc.) and/or the like. Note that
such a website may include a subset of the content within a
particular domain. For example, the website may include content in
a particular domain (e.g., a top-level domain example.com). In
another example, the website includes content in a sub-domain of
the particular domain (e.g., a sub-domain blogs.example.com). In
another example, the website includes content in a directory of the
domain (e.g., www.example.com/.about.johndoe/). In some
embodiments, the website includes content in: a plurality of
domains (e.g., a network of affiliated websites), a plurality of
sub-domains of at least one domain, and/or a plurality of
subdirectories of at least one domain. In some embodiments, the
content authored by an entity 105 is a blog post, a social network
post, or a post in an on-line discussion thread. In some
embodiments, the content authored by the entity 105 is any content
that has been posted to a location accessible on the Internet such
that it is readily ascertainable that the entity 105 posted the
content.
[0163] Thus, network system 100 encompasses a broad array of
systems 100. For example, in some embodiments (i) a plurality of
entities 105 author or publish content (e.g., books, periodicals,
websites, etc.) that is hosted on a single resource server 102,
(ii) a plurality of entities 105 author content that is stored on a
plurality of resource servers 102, and/or (iii) a single entity 105
authors content that is stored on a plurality of resource servers
102. Any combination of such arrangements is also found in the
network systems 100 within the scope of the present disclosure. For
instance, an example network system 100 provides for one entity 105
that authors content that is stored on two resource servers 102
while two or more other entities 105 author content that is stored
on a single resource server 102.
[0164] The search engine 112 periodically crawls links to identify
and index new (or previously un-indexed) content hosted by resource
servers 102. The frequency at which the search engine 112 crawls
links for a particular domain may depend on factors including, but
not limited to, a frequency at which new (or previously un-indexed
content) is available on the particular domain, an amount of
traffic the domain receives from users (e.g., entities), an
importance of web pages in the particular domain, whether the
domain includes original content, and the like. The search engine
112 also identifies ranked search results that satisfy search
queries received from client devices 104.
[0165] The client devices 104 include any computing device that the
entities 105 may use to access the resource servers 102 (e.g., a
website or content hosted on the resource servers 102, etc.), the
server 110, and/or the search engine 112. For example, the client
devices 104 may include a laptop computer, a desktop computer, a
smartphone, a mobile phone, a personal digital assistant, a tablet
device, a set top box, a game console, and/or the like.
[0166] In some embodiments, the entity 105-1 uses the client device
104-1 to publish content onto the resource server 102-1. For
example, the resource server 102-1 may host content for a newspaper
website and the entity 105-1 may use the client device 104-1 to
upload and to publish a news article onto the newspaper website
hosted on the resource server 102-1. Note that the entity 105-1 may
upload the news article to the resource server 102-1 without
publishing the news article. For example, the entity 105-1 may
upload the news article to the resource server 102-1 and publish
the news article at a later date and/or time. The act of publishing
content makes the content available to users (e.g., to the public,
to subscribers, etc.).
[0167] As discussed above, one problem faced by search engines is
that content uploaded to the resource server 102-1 may not be
indexed and may not be available to the search engine until after
the usefulness of the content has passed. In order to address this
problem, in some embodiments, the resource server 102-1 sends a
message to the server 110 to notify the server 110 that the content
has been published on the resource server 102-1. Alternatively, in
some embodiments, the entity 105-1 uses the client device 104-1 to
send a message to the server 110 to notify the server 110 that the
content has been published on the resource server 102-1. For
example, the entity 105-1 may use the client device 104-1 to send a
message to the server 110 using a web page provided by the server
110 that allows the entity 105-1 to notify the server 110 that the
content has been published on the resource server 102-1. In either
of these alternative embodiments, the message includes all or a
portion of the content, termed the submitted content 231, which was
published on the resource server 102-1 and/or a link to the
published content on the resource server 102-1. The term "submitted
content" is used in this specification to refer to the content
included in the message or the content that is a target of a link
that is included in the message. The server 110 may then
conditionally issue a request to the search engine 112 to crawl and
index the content (i.e., the submitted content 231) that was
uploaded to the resource server 102-1. In some embodiments, the
request is a high-priority request that is directed to a high
priority crawler of the search engine 112. In these embodiments, a
time when the high priority crawler of the search engine 112 is
scheduled to crawl the content occurs before a time when a standard
crawler of the search engine 112 is scheduled to crawl the
submitted content 231. These embodiments are described in more
detail below with reference to FIGS. 2A and 2B.
[0168] Some embodiments of the disclosed systems and methods
further address the problem incurred by crawling and indexing when
faced with duplicate or substantially duplicate existing content.
Such duplicate content poses a challenge to the identification of
authors of original content so that such original content may be
attributed with authorship of content and accordingly returned in
response to user queries. Accordingly, in some embodiments, the
server 110 determines a likelihood that an entity is an author of
original content. These embodiments are described in more detail
below with reference to FIGS. 3A and 3B. Furthermore, in some
embodiments, search results including the content that are
associated with authors of original content may be ranked higher in
ranked search results than search results including the content
that are not associated with authors of original content. These
embodiments are described in more detail below with reference to,
for example, FIGS. 4A, 4B and 10-19.
[0169] Note that although FIG. 1 shows one instance for each of the
server 110 and the search engine 112, multiple servers and search
engines may be present in the network system 100. For example, each
of the server 110 and the search engine 112 may include a plurality
of distributed servers. The plurality of distributed servers may
provide load balancing and/or may provide low-latency points of
access to nearby computer systems. The distributed servers may be
located within a single location (e.g., a data center, a building,
etc.) or may be geographically distributed across multiple
locations (e.g., data centers at various geographical locations,
etc.).
[0170] Also note that although the embodiments described herein
refer to the server 110 and the search engine 112, the embodiments
may be applied to multiple servers and search engines. Furthermore,
the functionality of any of the server 110 and the search engine
112 may be implemented within a single server (or a set of
distributed servers). For example, the server 110 and the search
engine 112 may be located on the same server (or the same set of
distributed servers).
[0171] FIGS. 2A and 2B are block diagrams 200 illustrating an
example process for crawling and indexing submitted content 231,
according to some embodiments. Note that although the following
discussion refers to the resource server 102-1, the client device
104-1, and the entity 105-1, the process illustrated in FIGS. 2A
and 2B may be applied to any resource server, client device, and
entity. As illustrated in FIG. 2A, in response to content being
published on the resource server 102-1, the resource server 102-1
sends a message 230 to the server 110 indicating that the content
was published by the entity 105-1 on the resource server 102-1.
[0172] In an alternative embodiment not depicted in FIG. 2A, in
response to content being published on the resource server 102-1,
the client device 104-1 sends the message 230 to the server 110
indicating that the content was published by the entity 105-1 on
the resource server 102-1. For example, the entity 105-1 may use
the client device 104-1 to send the message 230 to the server 110
by accessing a web page provided by the server 110 that allows the
entity 105-1 to notify the server 110 that the content has been
published by the entity 105-1 on the resource server 102-1. In
another example, the client device 104-1 may send the message 230
to the server 110 indicating that the content was published by the
entity 105-1 on the resource server 102-1.
[0173] In some embodiments, the message 230 includes all of the
content that was published on the resource server 102-1 and a link
to the content that was published on the resource server 102-1. In
some embodiments, the message 230 includes a portion of the content
that was published on the resource server 102-1 and a link to the
content that was published on the resource server 102-1. For
example, the message 230 may contain the first kilobyte, first 10
kilobytes, first 100 kilobytes or first megabyte of the content
that was published on the resource server 102-1. In some
embodiments, the message 230 includes a digest of the content that
was published on the resource server 102-1 and a link to the
content that was published on the resource server 102-1. In some
embodiments, the message 230 only includes a link to the content
that was published on the resource server 102-1. In these
embodiments, the server 110 obtains all or a portion of the content
published on the resource server 102-1 prior to proceeding with the
other operations described below. For example, in some embodiments
the server 110 obtains the first kilobyte, first 10 kilobytes,
first 100 kilobytes or first megabyte of the content that was
published on the resource server 102-1. In another example, in some
embodiments the server 110 obtains a digest of the content that was
published on the resource server 102-1. This digest may be the
result of, for example, a predetermined hash function on the known
content. As discussed above, the term "submitted content" is used
in this specification to refer to the content included in the
message 230 or the content that is a target of a link that is
included in the message 230.
[0174] Returning to the embodiment illustrated in FIG. 2A,
regardless of whether the message 230 originated from resource
server 102 or from client device 104, the front end module 202
receives the message 230 and provides the message 230 to the
content module 204. The content module 204 evaluates whether the
submitted content 231 is represented in an index of known content
220 to determine whether the submitted content 231 is new relative
to the known content.
[0175] In some embodiments, the index of known content 220 includes
content obtained through crawl operations of the search engine 112.
In some embodiments, the index of known content 220 includes a
representation of content obtained through messages (e.g., the
message 230) received from the resource server 102-1 (or the client
device 104-1). The content is obtained either directly from a
message 230 when the message 230 includes the submitted content 231
or indirectly when the message 230 includes a link to the submitted
content 231.
[0176] As illustrated in FIG. 2B, when the submitted content 231 is
new relative to the known content in the index of known content
220, the content module 204 issues a request 232 to the search
engine 112 (via the front end module 202) to crawl and index the
submitted content 231 that is hosted by the resource server 102-1.
The request 232 includes the link (e.g., address, URL, pointer) to
the content that was published on the resource server 102-1 (i.e.,
the submitted content 231).
[0177] In some embodiments, the content module 204 stores a
representation of the submitted content 231 in the index of known
content 220. For example, in some instances, the content module 204
generates a set of shingles for the submitted content 231, and for
each respective shingle in the set of shingles, the content module
204 (i) applies a hash function to the shingle to produce a
corresponding hash value for the respective shingle and (ii) stores
information relating to the submitted content 231 (e.g., a link to
a location on the resource server 102-1 hosting the submitted
content 231, a timestamp of the message 230, etc.) in a record in
the index of known content 220 that is indexed by the respective
hash value.
[0178] The embodiments described in FIGS. 2A and 2B are described
in more detail below with reference to FIGS. 10-15 in U.S.
Provisional Patent Application No. 61/648,546 filed on May 17,
2012, entitled "Systems and Methods for Crawling and Indexing
Content" which is hereby incorporated by reference herein in its
entirety.
[0179] FIGS. 3A and 3B are block diagrams 300 illustrating an
example process for determining a likelihood that an entity is an
author of original content, according to some embodiments. Note
that although the following discussion refers to the resource
server 102-1, the client device 104-1, and the entity 105-1, this
process may be applied to any resource server, client device, and
entity (e.g., any resource server 102, any client device 104, and
any entity 105). As illustrated in FIG. 3A, in response to content
being published on the resource server 102-1, the resource server
102-1 sends a message 330 to the server 110 indicating that the
content was published by the entity 105-1 on the resource server
102-1. Alternatively, although not depicted in FIG. 3A, in response
to content being published on the resource server 102-1 (or in
response to an indication that content will be published at a
future time on the resource server 102-1), the client device 104-1
associated with the entity 105-1 sends the message 330 to the
server 110 indicating that the content was published by the entity
105-1 on the resource server 102-1 (or will be published on the
resource server 102-1 at a future time). For example, the entity
105-1 may use the client device 104-1 to send the message 330 to
the server 110 by accessing a web page provided by the server 110
that allows the entity 105-1 to notify the server 110 that the
content has been published by the entity 105-1 on the resource
server 102-1. FIGS. 20A-20C are exemplary screenshots of web pages
provided by the server 110 that allow the entity 105-1 to notify
the server 110 that content has been published by the entity 105-1
on the resource server 102-1. In FIG. 20A, a web page 2000 allows
the entity 105-1 to view content associated the entity 105-1. In
this example, the entity 105-1 is associated with the domain
"www.foo.com" and the path "bar," as illustrated by a location
selector 2004. However, in other examples, the entity 105-1 may be
associated with a subdomain of "foo.com" (e.g., "blog.foo.com")
and/or any path of the domain (or subdomain) of "foo.com" (e.g.,
"blog.foo.com/bar/", "www.foo.com"). When the entity 105-1 presses
a "register content" button 2002, the entity 105-1 is redirected to
a content registration web page 2010, which is illustrated in FIG.
20B. The content registration web page 2010 allows the entity 105-1
to provide content in a content field 2016, provide labels for the
content in an optional label field 2012, provide a link to the
content (e.g., URL) in an optional link field (not shown), and
format the content using formatting tools 2014. The entity 105-1
submits the content to the server 110 by pressing a "submit
content" button 2018. After submitting the content using the
content registration web page 2010, the entity 105-1 is redirected
to the web page 2000, which now indicates that the content is
registered and the date (and/or time) at which the content was
registered (e.g., see reference numbers 2022 and 2024). In some
embodiments, full access to the functionally associated with the
web pages illustrated in FIG. 20 require that a user log in to
their account by, for example, selecting the "My Account" tab 2050
and providing log in information that identifies the user.
Returning to FIG. 3A, in another example, the client device 104-1
may send the message 330 to the server 110 indicating that the
content was published by the entity 105-1 on the resource server
102-1.
[0180] Returning to FIG. 3A, in some embodiments, the message 330
includes only the content that was published on the resource server
102-1 or that will be published on the resource server 102-1 at a
future time. In some embodiments, the message 330 includes the
content that was published on the resource server 102-1 and a link
to the content that was published on the resource server 102-1.
[0181] In some embodiments, the message 330 only includes a link to
the content that was published on the resource server 102-1. In
some embodiments, the message 330 includes a portion of the content
that was published on the resource server 102-1 and a link to the
content that was published on the resource server 102-1. For
example, the message 330 may contain the first kilobyte, first 10
kilobytes, first 100 kilobytes or first megabyte of the content
that was published on the resource server 102-1. In some
embodiments, the message 330 may contain a digest of the content
that was published on the resource server 102-1. In embodiments
where the message 330 does not include any portion of the content
but does contain a link to the content, the server 110 obtains the
content published on the resource server 102-1 prior to proceeding
with the other operations described below. In some embodiments
where the message 330 includes a portion of the content but not all
of the content, the server 110 obtains the full set of content
published on the resource server 102-1 prior to proceeding with the
other operations described below. In some alternative embodiments
where the message 330 includes a portion of the content but not all
of the content, the server 110 does not obtain the full set of
content published on the resource server 102-1 prior to proceeding
with the other operations described below.
[0182] The front end module 202 receives the message 330 and
provides the message 330 to the content module 204. The content
module 204 stores a representation of the submitted content 331 in
the index of known content 220. In some embodiments, the content
module 204 stores a representation of the submitted content 331 in
the index of known content 220. For example, in some embodiments,
the content module 204 generates a set of shingles for the
submitted content 331 and, for each respective shingle in the set
of shingles, the content module 204 (i) applies a hash function to
the respective shingle to produce a corresponding hash value for
the respective shingle and (ii) stores information relating to the
submitted content 331 (e.g., a link to a location on the resource
server 102-1 hosting the submitted content 331, a timestamp of the
message 330, etc.) in a record in the index of known content 220
that is indexed by the respective hash value.
[0183] As illustrated in FIG. 3B, the author module 206 calculates
an original author score 332 for the entity 105-1 based on an
evaluation of the collective representations of submitted content
(e.g., all submitted content associated with the entity 105-1) in
the index of known content 220 identified as being associated with
the entity 105-1. The original author score is a metric
representing a probability or likelihood that an entity is an
author of original content. In some embodiments, the original
author score for the entity 105-1 is calculated based at least in
part on a percentage of the representations of submitted content in
the index of known content 220 identified as being associated with
the entity 105-1 that are first instances of representations of
submitted content in the index of known content 220.
[0184] In some embodiments, the author module 206 calculates the
original author score for the entity 105-1 in response to receiving
the message 330. In some embodiments, the author module 206
periodically calculates the original author score for the entity
105-1 (e.g., every hour, every day, every week, etc.). In some
embodiments, all content associated with the entity 105-1 in the
index of known content 220 is considered in computing the original
author score for the entity 105-1. In some embodiments, only more
recent content associated with the entity 105-1 is considered in
computing the original author score for the entity 105-1. For
example, in various embodiments, only content within a threshold
time period is used to calculate the original author score of the
entity 105-1 (e.g., content with a timestamp within the past week,
the past month, the past year, or the past two years, etc.). In
some embodiments, only a discrete set of content associated with
the entity 105-1 is considered in computing the original author
score for the entity 105-1. For example, in various embodiments,
only five or less, ten or less, fifty or less, or 100 or less
different documents identified as being associated with the entity
105-1 are used to calculate the original author score of the entity
105-1. In some embodiments, recent content associated with the
entity 105-1 in the index of known content provides a more
significant contribution to the original author score for the
entity 105-1 than less recent content. To illustrate one such
embodiment, in some original authorship calculations, more recent
content is up-weighted and less recent content is down-weighted in
the original author score calculations.
[0185] In some embodiments, only content that is unique with
respect to all other content that (i) has already been selected to
compute an original author score for the entity 105-1 and (ii) is
associated with the entity 105-1 is considered when computing the
original author score for the entity. For example, in some
embodiments, only content that (i) satisfies a dissimilarity
criterion (e.g., at least 80 percent different, at least 90 percent
different, etc.) with respect to all other content that has been
selected for computation of an original author score for the entity
105-1 and (ii) is associated with the entity 105-1 is used in
computing the original author score for the entity 105-1. In this
way, original author scores are not biased by multiple submissions
of the original same original work by the entity 105-1.
[0186] Dissimilarity between a first document and second document
associated with the entity 105-1 can be computed using any measure
of similarity or dissimilarity in order to determine with the
dissimilarity criterion has been satisfied. In some embodiments, a
first document and second document associated with the entity 105-1
satisfy the dissimilarity criterion when a Levenshtein edit
distance from the first document to the second document is greater
than a predetermined number of character edits. In some
embodiments, the first and second documents satisfy the
dissimilarity criterion when a Jaro-Winkler distance metric from
the first document to the second document is more than a
predetermined distance. In some embodiments, the first and second
documents satisfy the dissimilarity criterion when a Jaccard index
is less than a predetermined value. In some embodiments, the first
and second documents satisfy the dissimilarity criterion when a
Masi distance is greater than a predetermined distance.
[0187] In some embodiments, the first and second documents document
satisfy the dissimilarity criterion when a character count of the
first document and a character count of the second document deviate
by more than a character count threshold (and/or a word count
threshold). For example, if the character count threshold is five
characters, the first document and the second document satisfy the
dissimilarity criterion when the character count of the first
document and the character count of the second document deviate by
more than five characters.
[0188] In some embodiments, the first and second documents document
satisfy the dissimilarity criterion when the first document
contains a predetermined quantity of shingles that are not found in
the second document, or vice versa, where such shingles are
computed using the techniques disclosed below in conjunction with
step 1202 of FIG. 12.
[0189] In some embodiments, the author module 206 stores the
original author score 332 in a record of an author database 222
that is associated with the entity 105-1.
[0190] The embodiments described in FIGS. 3A and 3B are described
in more detail below with reference to FIGS. 10-16 in U.S.
Provisional Patent Application No. 61/648,562 filed on May 17,
2012, entitled "Systems and Methods for Determining a Likelihood
that an Entity is an Author of Original Content" which is hereby
incorporated by reference herein in its entirety.
[0191] FIGS. 4A and 4B are block diagrams 400 illustrating an
example process for re-ranking ranked search results, according to
some embodiments. As illustrated in FIG. 4A, ranked search results
430 are obtained from the search engine 112. The ranked search
results 430 may be identified by the search engine 112 in response
to the search engine 112 receiving a search query from a client
device (e.g., the client device 104-1). The front end module 202
provides the ranked search results 430 to the re-ranking module
208.
[0192] When the ranked search results 430 includes a first search
result corresponding to a first document associated with a first
entity (e.g., entity 105-1) and a second search result
corresponding to a second document associated with a second entity
(e.g., entity 105-2) where (1) the first search result is ranked
higher than the second search result, (2) the first document and
the second document satisfy a similarity criterion (e.g., based on
information obtained from the index of known content 220), and (3)
the second entity satisfies a predefined authorship differential
with respect to the first entity (e.g., based on original author
scores obtained from the author database 222), the re-ranking
module 208 swaps the first ranked search result and the second
ranked search result to produce re-ranked search results 431.
[0193] As illustrated in FIG. 4B, the re-ranking module 208
provides the re-ranked search results 431 to the search engine 112
via the front end module 202. The search engine 112 then provides
the re-ranked search results 431 to the client device for
presentation to a user of the client device.
[0194] The embodiments described in FIGS. 4A and 4B are described
in more detail below with reference to FIGS. 10-19 in U.S.
Provisional Patent Application No. 61/648,553 filed on May 17,
2012, entitled "System and Method Re-ranking Ranked Search Results"
which is hereby incorporated by reference herein in its
entirety.
[0195] FIGS. 5A and 5B are block diagrams 500 illustrating an
example process for registering an entity as a registered author of
content, according to some embodiments. The process of registering
an entity as a registered author of content is an optional process
that is performed prior to processes described above with reference
to FIGS. 2-4. Note that although the following discussion refers to
the entity 105-1, this process may be used to register any entity
as a registered author of content. Also note that the process of
registering an entity as a registered author of content merely
identifies the entity as an author of content, but does not verify
whether the entity is an author of original content (e.g., an
entity that generates original content versus an entity that copies
content).
[0196] In accordance with the example embodiments illustrated in
FIGS. 5A and 5B, the server 110 receives, from the client device
104-1, a registration request 530 to register the entity 105-1 as a
registered author of content. The front end module 202 provides the
registration request 530 to the registration module 210. When the
registration module 210 determines that the entity 105-1 is not
registered as a registered author of content, the registration
module 210 provides, to the client device 104-1 associated with the
entity 105-1 via the front end module 202, a verification mechanism
531 to verify that the entity 105-1 is an author of content. The
verification mechanism 531 includes one or more requests and one or
more targets. For example, the verification mechanism 531 may
include a request for the entity 105-1 to upload a predetermined
file to a specified target such as a predetermined location on a
resource server associated with the entity 105-1 (e.g., the
resource server 102-1). In another example, the verification
mechanism 531 includes a request for the entity 105-1 to add
predetermined text to a target such as a predetermined webpage on a
resource server associated with the entity 105-1 (e.g., the
resource server 102-1). In another example, the verification
mechanism 531 includes a request for the entity 105-1 to add a
predetermined HTML tag in a target such as a predetermined webpage
hosted on a resource server associated with the entity 105-1 (e.g.,
the resource server 102-1). In still another example, the
verification mechanism 531 includes a request for the entity 105-1
to add a predetermined script to the target such as a predetermined
webpage hosted on the resource server 102 associated with the
entity 105-1. In yet another example, the verification mechanism
531 includes a request for the entity 105-1 to add a DNS TXT record
to a DNS record for a website operated by the entity 105-1 and
hosted on a resource server associated with the entity 105-1 (e.g.,
a website of the entity 105-1 that is hosted on the resource server
102-1). In some embodiments, in addition to performing any one of,
or any combination of, the above-identified verification
mechanisms, the registration module 210 may perform additional
verification mechanisms such as, for example, ensuring that the
applicant entity 105 is not on a list of known bad actors. For
example, in some embodiments, when the applicant entity 105 is on a
list of known bad actors, the applicant is denied registration even
if the applicant successfully enacted the request at the target
specified by the at least one verification mechanism 531. Example
verification mechanisms 531 have now been described. One of skill
in the art, based on the present disclosure, will recognize and
appreciate that there are numerous other verification mechanisms
531 and all such verification mechanisms are within the scope of
the present disclosure.
[0197] As illustrated in FIG. 5B, in some embodiments, responsive
to the providing the at least one verification mechanism 531, the
server 110 receives a verification request 532 from the client
device 104-1 to verify that the entity 105-1 is an author of
content based upon the verification mechanism 531. The registration
module 210 verifies that the entity 105-1 is an author of content
by checking to see if the one or more requests included in the
verification mechanism 531 have been enacted at the one or more
targets specified by the verification mechanism 531. If each of the
respective requests in the one or more requests has been enacted at
corresponding targets in the one or more targets, the registration
module 210 registers the entity 105-1 as a registered author of
content. For example, the registration module 210 may store entity
information 533 for the entity 105-1 in a record of an author
database 222. In some embodiments in accordance with this example,
the entity information 533 for the entity 105-1 includes a name of
the entity 105-1, an original author score for the entity 105-1,
links to websites associated with the entity 105-1, links to
content authored by the entity 105-1, links to profile pages for
the entity 105-1 (e.g., on social networks or other websites),
biographic information for the entity 105-1, and/or bibliographic
information for the entity 105-1.
[0198] In an alternative to the embodiment depicted in FIG. 5B,
after providing the at least one verification mechanism 531 to the
client device 104-1, the registration module 210 polls the target
specified by the verification mechanism 531 to determine whether
the request set forth in the verification mechanism 531 has been
enacted. In some embodiments, the registration module 210 polls the
target a number of times (e.g., every minute, every five minutes,
every hour, each day) within a predetermined cut off time (e.g.,
within one hour, one day, one week after providing the verification
mechanism 531 to the client device 104-1) to determine whether the
request set forth in the verification mechanism 531 has been
enacted. If this polling successfully determines that the request
specified by the verification mechanism 531 has been enacted at the
target, the registration module 210 registers the entity 105-1 as a
registered author of content.
[0199] FIG. 6 is a block diagram illustrating the server 110,
according to some embodiments. The server 110 typically includes
one or more processing units (CPU's, sometimes called processors)
602 for executing programs (e.g., programs stored in memory 610),
one or more network or other communications interfaces 604, memory
610, and one or more communication buses 609 for interconnecting
these components. The communication buses 609 may include circuitry
(sometimes called a chipset) that interconnects and controls
communications between system components. The server 110 optionally
includes (but typically does not include) a user interface 605
comprising a display device 606 and input devices 608 (e.g.,
keyboard, mouse, touch screen, keypads, etc.). Memory 610 includes
high-speed random access memory, such as DRAM, SRAM, DDR RAM or
other random access solid state memory devices; and typically
includes non-volatile memory, such as one or more magnetic disk
storage devices, optical disk storage devices, flash memory
devices, or other non-volatile solid state storage devices. Memory
610 optionally includes one or more storage devices remotely
located from the CPU(s) 602. Memory 610, or alternately the
non-volatile memory device(s) within memory 610, comprises a
non-transitory computer readable storage medium. In some
embodiments, memory 610 or the computer readable storage medium of
memory 610 stores the following programs, modules and data
structures, or a subset thereof: [0200] an operating system 612
that includes procedures for handling various basic system services
and for performing hardware dependent tasks; [0201] a communication
module 614 that is used for connecting the server 110 to other
computers via the one or more communication interfaces 604 (wired
or wireless) and one or more communication networks, such as the
Internet, other wide area networks, local area networks,
metropolitan area networks, and so on; [0202] an optional user
interface module 616 that receives commands from the user via the
input devices 608 and generates user interface objects in the
display device 606; [0203] the front end module 202 that provides
an interface between the server 110 and other computer systems, as
described herein; [0204] the content module 204 that evaluates
submitted content to determine whether the submitted content is new
relative to known content in the index of known content 220, issues
requests to the search engine 112 to crawl and index the submitted
content, and/or stores a representation of the submitted content in
the index of known content 220, as described herein; [0205] the
author module 206 that calculates an original author score for an
entity based at least in part on an evaluation of representations
of submitted content in the index of known content 220 identified
as being associated with the entity, as described herein; [0206]
the re-ranking module 208 that re-ranks ranked search results based
at least in part on an original author score for the entities
associated with the ranked search results, as described herein;
[0207] the registration module 210 that registers an entity as a
registered author of content, as described herein; [0208] the index
of known content 220 that includes representations of submitted
content, as described herein; and [0209] the author database 222
that includes entity information for entities (e.g., the entity
information 533).
[0210] In some embodiments, the programs or modules identified
above correspond to sets of instructions for performing a function
described above. The sets of instructions can be executed by one or
more processors (e.g., the CPUs 602). The above identified modules
or programs (i.e., sets of instructions) need not be implemented as
separate software programs, procedures or modules, and thus various
subsets of these programs or modules may be combined or otherwise
re-arranged in various embodiments. In some embodiments, memory 610
stores a subset of the modules and data structures identified
above. Furthermore, memory 610 may store additional modules and
data structures not described above.
[0211] Although FIG. 6 shows a "server," FIG. 6 is intended more as
functional description of the various features which may be present
in a set of servers than as a structural schematic of the
embodiments described herein. In practice, and as recognized by
those of ordinary skill in the art, items shown separately could be
combined and some items could be separated. For example, some items
shown separately in FIG. 6 could be implemented on single servers
and single items could be implemented by one or more servers. The
actual number of servers used to implement a server and how
features are allocated among them will vary from one implementation
to another, and may depend in part on the amount of data traffic
that the system must handle during peak usage periods as well as
during average usage periods.
[0212] FIG. 7 is a block diagram illustrating the resource server
102-1, according to some embodiments. Note that discussion below is
not limited to the resource server 102-1 and may apply to any
resource server. The resource server 102-1 typically includes one
or more processing units (CPU's, sometimes called processors) 702
for executing programs (e.g., programs stored in memory 710), one
or more network or other communications interfaces 704, memory 710,
and one or more communication buses 709 for interconnecting these
components. The communication buses 709 may include circuitry
(sometimes called a chipset) that interconnects and controls
communications between system components. The resource server 102-1
optionally includes (but typically does not include) a user
interface 705 comprising a display device 706 and input devices 708
(e.g., keyboard, mouse, touch screen, keypads, etc.). Memory 710
includes high-speed random access memory, such as DRAM, SRAM, DDR
RAM or other random access solid state memory devices; and
typically includes non-volatile memory, such as one or more
magnetic disk storage devices, optical disk storage devices, flash
memory devices, or other non-volatile solid state storage devices.
Memory 710 optionally includes one or more storage devices remotely
located from the CPU(s) 702. Memory 710, or alternately the
non-volatile memory device(s) within memory 710, comprises a
non-transitory computer readable storage medium. In some
embodiments, memory 710 or the computer readable storage medium of
memory 710 stores the following programs, modules and data
structures, or a subset thereof: [0213] an operating system 712
that includes procedures for handling various basic system services
and for performing hardware dependent tasks; [0214] a communication
module 714 that is used for connecting the resource server 102-1 to
other computers via the one or more communication interfaces 704
(wired or wireless) and one or more communication networks, such as
the Internet, other wide area networks, local area networks,
metropolitan area networks, and so on; [0215] an optional user
interface module 716 that receives commands from the user via the
input devices 708 and generates user interface objects in the
display device 706; [0216] a web module 718 that responds to
requests to access web pages or other resources (e.g., scripts,
programs, etc.) hosted on the resource server 102-1; and [0217] a
publication module 720 that publishes content for entities and
sends messages (e.g., the messages 230 and 330) to the server 110
indicating that content has been published on the resource server
102-1.
[0218] In some embodiments, the programs or modules identified
above correspond to sets of instructions for performing a function
described above. The sets of instructions can be executed by one or
more processors (e.g., the CPUs 702). The above identified modules
or programs (i.e., sets of instructions) need not be implemented as
separate software programs, procedures or modules, and thus various
subsets of these programs or modules may be combined or otherwise
re-arranged in various embodiments. In some embodiments, memory 710
stores a subset of the modules and data structures identified
above. Furthermore, memory 710 may store additional modules and
data structures not described above.
[0219] Although FIG. 7 shows a "resource server," FIG. 7 is
intended more as functional description of the various features
which may be present in a set of resource servers than as a
structural schematic of the embodiments described herein. In
practice, and as recognized by those of ordinary skill in the art,
items shown separately could be combined and some items could be
separated. For example, some items shown separately in FIG. 7 could
be implemented on single servers and single items could be
implemented by one or more servers. The actual number of servers
used to implement a resource server and how features are allocated
among them will vary from one implementation to another, and may
depend in part on the amount of data traffic that the system must
handle during peak usage periods as well as during average usage
periods.
[0220] FIG. 8 is a block diagram illustrating the search engine
112, according to some embodiments. The search engine 112 typically
includes one or more processing units (CPU's, sometimes called
processors) 802 for executing programs (e.g., programs stored in
memory 810), one or more network or other communications interfaces
804, memory 810, and one or more communication buses 809 for
interconnecting these components. The communication buses 809 may
include circuitry (sometimes called a chipset) that interconnects
and controls communications between system components. The search
engine 112 optionally includes (but typically does not include) a
user interface 805 comprising a display device 806 and input
devices 808 (e.g., keyboard, mouse, touch screen, keypads, etc.).
Memory 810 includes high-speed random access memory, such as DRAM,
SRAM, DDR RAM or other random access solid state memory devices;
and typically includes non-volatile memory, such as one or more
magnetic disk storage devices, optical disk storage devices, flash
memory devices, or other non-volatile solid state storage devices.
Memory 810 optionally includes one or more storage devices remotely
located from the CPU(s) 802. Memory 810, or alternately the
non-volatile memory device(s) within memory 810, comprises a
non-transitory computer readable storage medium. In some
embodiments, memory 810 or the computer readable storage medium of
memory 810 stores the following programs, modules and data
structures, or a subset thereof: [0221] an operating system 812
that includes procedures for handling various basic system services
and for performing hardware dependent tasks; [0222] a communication
module 814 that is used for connecting the search engine 112 to
other computers via the one or more communication interfaces 804
(wired or wireless) and one or more communication networks, such as
the Internet, other wide area networks, local area networks,
metropolitan area networks, and so on; [0223] an optional user
interface module 816 that receives commands from the user via the
input devices 808 and generates user interface objects in the
display device 806; [0224] a search module 818 that identifies
ranked search results that satisfy a search query; [0225] a crawler
module 820 that periodically performs normal-priority crawl
operations on links to identify new content and performs
high-priority crawl operations to identify new content in response
to receiving a request from the content module 204 of the server
110; and [0226] an indexer module 822 that indexes new content
identified by the crawler module 820 into a search index 824.
[0227] In some embodiments, the programs or modules identified
above correspond to sets of instructions for performing a function
described above. The sets of instructions can be executed by one or
more processors (e.g., the CPUs 802). The above identified modules
or programs (i.e., sets of instructions) need not be implemented as
separate software programs, procedures or modules, and thus various
subsets of these programs or modules may be combined or otherwise
re-arranged in various embodiments. In some embodiments, memory 810
stores a subset of the modules and data structures identified
above. Furthermore, memory 810 may store additional modules and
data structures not described above.
[0228] Although FIG. 8 shows a "search engine," FIG. 8 is intended
more as functional description of the various features which may be
present in a set of servers implementing a search engine than as a
structural schematic of the embodiments described herein. In
practice, and as recognized by those of ordinary skill in the art,
items shown separately could be combined and some items could be
separated. For example, some items shown separately in FIG. 8 could
be implemented on single servers and single items could be
implemented by one or more servers. The actual number of servers
used to implement a search engine and how features are allocated
among them will vary from one implementation to another, and may
depend in part on the amount of data traffic that the system must
handle during peak usage periods as well as during average usage
periods.
[0229] FIG. 9 is a block diagram illustrating the client device
104-1, according to some embodiments. Note that discussion below
may apply to any client device. The client device 104-1 typically
includes one or more processing units (CPU's, sometimes called
processors) 902 for executing programs (e.g., programs stored in
memory 910), one or more network or other communications interfaces
904, memory 910, and one or more communication buses 909 for
interconnecting these components. The communication buses 909 may
include circuitry (sometimes called a chipset) that interconnects
and controls communications between system components. The client
device 104-1 includes a user interface 905 comprising a display
device 906 and input devices 908 (e.g., keyboard, mouse, touch
screen, keypads, etc.). Memory 910 includes high-speed random
access memory, such as DRAM, SRAM, DDR RAM or other random access
solid state memory devices; and typically includes non-volatile
memory, such as one or more magnetic disk storage devices, optical
disk storage devices, flash memory devices, or other non-volatile
solid state storage devices. Memory 910 optionally includes one or
more storage devices remotely located from the CPU(s) 902. Memory
910, or alternately the non-volatile memory device(s) within memory
910, comprises a non-transitory computer readable storage medium.
In some embodiments, memory 910 or the computer readable storage
medium of memory 910 stores the following programs, modules and
data structures, or a subset thereof: [0230] an operating system
912 that includes procedures for handling various basic system
services and for performing hardware dependent tasks; [0231] a
communication module 914 that is used for connecting the client
device 104-1 to other computers via the one or more communication
interfaces 904 (wired or wireless) and one or more communication
networks, such as the Internet, other wide area networks, local
area networks, metropolitan area networks, and so on; [0232] a user
interface module 916 that receives commands from the user via the
input devices 908 and generates user interface objects in the
display device 906; and [0233] a browser module 918 that provides a
user interface for users to access and obtain documents and/or
services hosted on servers (e.g., the resource servers 102),
renders the documents and/or services in the user interface, and/or
executes scripts (or other code).
[0234] In some embodiments, the programs or modules identified
above correspond to sets of instructions for performing a function
described above. The sets of instructions can be executed by one or
more processors (e.g., the CPUs 902). The above identified modules
or programs (i.e., sets of instructions) need not be implemented as
separate software programs, procedures or modules, and thus various
subsets of these programs or modules may be combined or otherwise
re-arranged in various embodiments. In some embodiments, memory 910
stores a subset of the modules and data structures identified
above. Furthermore, memory 910 may store additional modules and
data structures not described above.
[0235] Although FIG. 9 shows a "client device," FIG. 9 is intended
more as functional description of the various features which may be
present in a client device than as a structural schematic of the
embodiments described herein. In practice, and as recognized by
those of ordinary skill in the art, items shown separately could be
combined and some items could be separated.
[0236] Re-Ranking Ranked Search Results
[0237] The embodiments described below with reference to FIGS.
10-19 provide techniques for re-ranking ranked search results.
[0238] The following discussion refers to the resource server
102-1, the client device 104-1, the entity 105-1, and the search
engine 112. However, it should be noted that the following
discussion may be applied to any resource server, client device,
entity, and search engine. Furthermore, the following discussion
refers to particular modules of the server 110 performing
particular operations illustrated in the FIGS. 10-19. However, the
operations discussed below may be performed by other modules of the
server 110.
[0239] FIG. 10 is a flowchart of a method 1000 for re-ranking
ranked search results, according to some embodiments. The
re-ranking module 208 obtains (1002) ranked search results (e.g.,
the ranked search results 430) satisfying a search query, where the
ranked search results includes a first search result corresponding
to a first document associated with a first entity and a second
search result corresponding to a second document associated with a
second entity, and where the first search result is ranked higher
than the second search result.
[0240] The re-ranking module 208 determines (1004) that the first
document and the second document satisfy a similarity criterion. In
some embodiments, the first document and the second document
satisfy the similarity criterion when a similarity threshold has
been exceeded. For example, if the similarity criterion is based on
the percentage of the content in the first and second documents
that are similar, the similarity threshold may be set to a value of
90%. Accordingly, when the percentage of the content in the first
and second documents that are similar exceeds the 90% similarity
threshold, the re-ranking module 208 determines that the first
document and the second document satisfy the similarity criterion.
Operation 1004 is described in more detail below with reference to
FIGS. 11 and 12.
[0241] The re-ranking module 208 determines (1006) that the second
entity satisfies a predefined authorship differential with respect
to the first entity. In some embodiments, the authorship
differential is a difference between the original author scores for
the first and second entities. Operation 1006 is described in more
detail below with reference to FIG. 13.
[0242] Responsive to determining that the second entity satisfies
the predefined authorship differential with respect to the first
entity, the re-ranking module 208 swaps (1008) the second search
result and the first search result in the ranked search results to
produce re-ranked search results.
[0243] In some embodiments, when the ranked search results are
obtained from the search engine 112, after swapping the second
search result and the first search result in the ranked search
results to produce the re-ranked search results, the re-ranking
module 208 provides the re-ranked search results to the search
engine 112.
[0244] In some embodiments, rather than swapping the second search
result and the first search result in the ranked search results to
produce re-ranked search results as specified in (1008), the second
search result is given a higher ranking that is not necessarily
that of the first search result. For example, in some
implementations, the second search result is given the ranking of
the first result and the first result is deleted from the ranked
search results. In another example, the first search result is
given a ranking in the ranked search results that is less than its
original ranking and the second search result is given a ranking
that is greater than its original ranking in the ranked search
results. In this way, the second search result obtains a ranking
that is higher than the ranking of the first search result in the
revised ranked search results. In yet another example, the second
search result is given a ranking that is greater than it original
ranking such that its ranking is higher than the ranking of the
first search result in the revised ranked search results.
[0245] FIG. 11 is a flowchart of a method for determining (1004)
that a first document and a second document satisfy a similarity
criterion, according to some embodiments. The re-ranking module 208
identifies (1102) a first plurality of portions of content for the
first document, identifies (1104) a second plurality of portions of
content for the second document, and determines (1106) that at
least a predetermined number of portions in the first plurality of
portions of content and in the second plurality of portions of
content are identical. In some embodiments, the predetermined
number of portions encompasses a predetermined percentage of the
data in the first document. In some embodiments, the predetermined
number of portions is a threshold percentage of the portions of
content in the first plurality of portions of content.
[0246] In some embodiments, respective portions of content in the
first plurality of portions of content and the second plurality of
portions of content include a predetermined amount of data. For
example, the predetermined amount of data may include a
predetermined number of characters, a predetermined number of
bytes, a predetermined number of words, and the like.
[0247] FIG. 12 is a flowchart of another method for determining
(1004) that a first document and a second document satisfy a
similarity criterion, according to some embodiments. The re-ranking
module 208 identifies (1202) a first plurality of shingles for the
first document, identifies (1204) a second plurality of shingles
for the second document, and determines (1206) that a predetermined
quantity of shingles in the first plurality of shingles and in the
second plurality of shingles of content are identical. In some
implementations, the predetermined quantity is a predetermined
percentage of shingles (e.g., 90% of the shingles). In some
implementations, the predetermined quantity is a predetermined
number of shingles (e.g., 20 shingles).
[0248] In some embodiments, the shingles in a respective plurality
of shingles include at least one overlapping token (e.g., word or
character). For example, if the content is "a big dog crosses the
street" and 3-shingling is used (i.e., the set of all unique
combinations of three words in the content is used as the basis for
shingling), the set of shingles includes the following shingles: "a
big dog," "big dog crosses," "dog crosses the," and "crosses the
street." In some embodiments, the shingles in the respective
plurality of shingles do not include overlapping tokens. For
example, using the example content from the previous example, the
set of shingles for a 3-shingling of the content includes the
following shingles: "a big dog," and "crosses the street." Note
that the discussion with reference to FIG. 12 refers to operations
performed with respect to shingles (e.g., the respective tokens of
the content included in the respective shingles). However, in some
embodiments, hash values of the tokens in the shingles are used
instead of the actual tokens in the shingles. The hash values for
the tokens of the content may be generated using any appropriate
hash function (e.g., MD5, SHA1, etc.). For example, for the
overlapping token 3-shingling of the content from the example
above, the hash values for the shingles "a big dog," "big dog
crosses," "dog crosses the," and "crosses the street" may be used
instead of the tokens themselves. Thus, when the terms "shingles"
is used, it should be understood that hash values of the tokens in
the shingles may be used instead. It will be appreciated that
w-Shingling, where w is a positive integer greater than 1, is used
to perform the shingling disclosed herein. In some embodiments w is
2, 3, 4, 5, 6, or an integer greater than 6.
[0249] In some embodiments, the first document and the second
document satisfy the similarity criterion when a character count of
the first document and a character count of the second document
deviate by less than a character count threshold (and/or a word
count threshold). For example, if the character count threshold is
five characters, the first document and the second document satisfy
the similarity criterion when the character count of the first
document and the character count of the second document deviate by
less than five characters.
[0250] In some embodiments, the first document and the second
document satisfy the similarity criterion when a Levenshtein edit
distance from the first document to the second document is less
than a predetermined number of character edits. In some
embodiments, the first document and the second document satisfy the
similarity criterion when a Jaro-Winkler distance metric from the
first document to the second document is less than a predetermined
distance. In some embodiments, the first document and the second
document satisfy the similarity criterion when a Jaccard index is
greater than a predetermined value. In some embodiments, the first
document and the second document satisfy the similarity criterion
when a Masi distance is less than a predetermined distance.
[0251] FIG. 13 is a flowchart of a method for determining 1006)
that a second entity satisfies a predefined authorship differential
with respect to a first entity, according to some embodiments. The
re-ranking module 208 determines (1302) an author score for the
first entity and an author score for the second entity, where the
author score for a respective entity indicates a likelihood that
the respective entity is an author of original content. Operation
1302 is described in more detail below with reference to FIG.
14.
[0252] When the author score for the second entity is greater than
the author score for the first entity by at least a predetermined
threshold (e.g., 10%), the re-ranking module 208 determines (1304)
that the second entity satisfies the predefined authorship
differential with respect to the first entity.
[0253] FIG. 14 is a flowchart of a method for determining (1302) an
author score for a respective entity, according to some
embodiments. The re-ranking module 208 determines (1402) a citation
score for the respective entity. The citation score corresponds to
a frequency at which content associated with the respective entity
is cited. In some embodiments, the re-ranking module 208 determines
the citation score for the respective entity by calculating the
citation score as a quotient of a number of citations to content
that is associated with the respective entity and a quantity of
content that has at least one citation. In some embodiments, the
re-ranking module 208 determines the citation score for the
respective entity by calculating the citation score as a quotient
of a number of citations to content that is associated with the
respective entity and a quantity of content associated with the
respective entity. In some implementations, the citation includes a
link to at least a portion of the content (e.g., a subset of the
content, the entire content, etc.). In some embodiments, the
content is all content associated with the respective entity. In
some embodiments, the content is content satisfying any of the
criterion used to identify content for computation of an original
author score discussed above in conjunction with FIG. 3B.
[0254] The re-ranking module 208 determines (1404) an original
author score for the respective entity, where the original author
score corresponds to a percentage of content associated with the
respective entity that is a first instance of the content in the
index of known content 220. Operation 1404 is described in more
detail below with reference to FIGS. 15 and 17.
[0255] The re-ranking module 208 combines (1406) the citation score
and the original author score using a predetermined function (e.g.,
simple summation of the citation score and the original author
score, a weighted sum of the citation score and the original author
score, multiplication of the citation score and the original author
score, etc.) to produce the author score.
[0256] FIG. 15 is a flowchart of a method for determining (1404) an
original author score for a respective entity, according to some
embodiments. The re-ranking module 208 identifies (1502) a
plurality of shingles representing content in the index of known
content identified as being associated with the respective entity
and calculates (1504) a percentage of the plurality of shingles
that are first instances of the shingles in the index of known
content 220. Operation 1504 is described in more detail below with
reference to FIG. 16.
[0257] FIG. 16 is a flowchart of a method for calculating (1504) a
percentage of shingles that are first instances, according to some
embodiments. The re-ranking module 208 determines (1602) a first
number of the plurality of shingles, each having an associated
timestamp, for which there are no identical shingles in the index
of known content having an earlier timestamp. The re-ranking module
208 then calculates (1604) the percentage as a quotient of the (i)
first number and (ii) a total number of shingles in the plurality
of shingles.
[0258] In some embodiments, the re-ranking module 208 determines a
first number of the plurality of shingles that are unique with
respect to all other shingles in the index of known content and
determines a second number of the plurality of shingles that are
not unique with respect to the corresponding shingles in the index
of known content but that are associated with timestamps that
precede timestamps for identical shingles in the index of known
content. The re-ranking module 208 then calculates the percentage
as a quotient of (i) a sum of the first number and the second
number and (ii) a total number of shingles in the plurality of
shingles.
[0259] FIG. 17 is a flowchart of another method for determining
(1404) an original author score for a respective entity, according
to some embodiments. The re-ranking module 208 identifies (1702) a
plurality of portions of content in the index of known content
identified as being associated with the respective entity, where
each portion in the plurality of portions represents a
predetermined amount of data in the index of known content. The
re-ranking module 208 calculates (1704) a percentage of the
plurality of the portions that are first instances of the portions
of content in the index of known content. Operation 1704 is
described in more detail below with reference to FIG. 18.
[0260] FIG. 18 is a flowchart of a method for calculating (1704) a
percentage of portions of content that are first instances,
according to some embodiments. The re-ranking module 208 determines
(1802) a first number of the plurality of portions of content, each
having an associated timestamp, for which there are no identical
portions of content in the index of known content having an earlier
timestamp. The re-ranking module 208 then calculates (1804) the
percentage as a quotient of (i) the first number and (ii) a total
number of portions of content in the plurality of portions of
content.
[0261] In some embodiments, the re-ranking module 208 calculates
(1704) a percentage of portions of content that are first instances
by determining a first number of the plurality of portions of
content that are unique with respect to all other portions of
content in the index of known content and determining a second
number of the plurality of portions of content that are not unique
with respect to the corresponding portions of content in the index
of known content but that are associated with timestamps that
precede timestamps for identical portions of content in the index
of known content. The re-ranking module 208 then calculates the
percentage as a quotient of (i) a sum of the first number and the
second number and (ii) a total number of portions of content in the
plurality of portions of content.
[0262] In some embodiments, the re-ranking module 208 determines
(1404) an original author score for a respective entity by
identifying an amount of content in the index of known content
identified as being associated with the entity, where the amount of
content includes the submitted content. The re-ranking module 208
then calculates a percentage of the amount that is uniquely
represented in the index of known content. In some implementations,
the re-ranking module 208 calculates the percentage of the amount
by determining a subset of the amount for which there are no
identical portions of content in the index of known content having
an earlier timestamp and calculating the percentage as a quotient
of the subset and the amount.
[0263] In some embodiments, when the second entity does not satisfy
a predefined authorship differential with respect to the first
entity (e.g., the other branch of operation 1006), the re-ranking
module 208 compares timestamps associated with the first document
and the second document to determine whether to swap the first and
second search results. FIG. 19 is a flowchart of a method 1900 for
re-ranking ranked search results, according to some embodiments.
The re-ranking module 208 obtains (1902) a first timestamp
corresponding to a time when the first document was included in an
index of known content 220 and obtains (1904) a second timestamp
corresponding to a time when the second document was included in
the index of known content 220. A respective timestamp may include
one of: a time when a message (e.g., the message 230, the message
330) is received indicating that the respective document has been
published on a resource server, a time when a message is received
indicating that the respective document will be published at a
future time, or a time when the search engine 112 first crawled and
indexed the respective document. When the second timestamp is less
than the first timestamp by at least a predetermined time interval,
the re-ranking module 208 determines (1906) that the second entity
satisfies the predefined authorship differential with respect to
the first entity. In other words, the second document was included
in the index of known content 220 prior to the first document being
included in the index of known content 220. In such instances, the
re-ranking module 208 swaps (1008) the second search result and the
first search result in the ranked search results to produce
re-ranked search results.
[0264] The methods illustrated in FIGS. 10-19 may be governed by
instructions that are stored in a computer readable storage medium
and that are executed by at least one processor of at least one
server. Each of the operations shown in FIGS. 10-19 may correspond
to instructions stored in a non-transitory computer memory or
computer readable storage medium. In various implementations, the
non-transitory computer readable storage medium includes a magnetic
or optical disk storage device, solid state storage devices such as
Flash memory, or other non-volatile memory device or devices. The
computer readable instructions stored on the non-transitory
computer readable storage medium may be in source code, assembly
language code, object code, or other instruction format that is
interpreted and/or executable by one or more processors.
[0265] Plural instances may be provided for components, operations
or structures described herein as a single instance. Finally,
boundaries between various components, operations, and data stores
are somewhat arbitrary, and particular operations are illustrated
in the context of specific illustrative configurations. Other
allocations of functionality are envisioned and may fall within the
scope of the implementation(s). In general, structures and
functionality presented as separate components in the example
configurations may be implemented as a combined structure or
component. Similarly, structures and functionality presented as a
single component may be implemented as separate components. These
and other variations, modifications, additions, and improvements
fall within the scope of the implementation(s).
[0266] It will also be understood that, although the terms "first,"
"second," etc. may be used herein to describe various elements,
these elements should not be limited by these terms. These terms
are only used to distinguish one element from another. For example,
a first contact could be termed a second contact, and, similarly, a
second contact could be termed a first contact, which changing the
meaning of the description, so long as all occurrences of the
"first contact" are renamed consistently and all occurrences of the
second contact are renamed consistently. The first contact and the
second contact are both contacts, but they are not the same
contact.
[0267] The terminology used herein is for the purpose of describing
particular implementations only and is not intended to be limiting
of the claims. As used in the description of the implementations
and the appended claims, the singular forms "a", "an" and "the" are
intended to include the plural forms as well, unless the context
clearly indicates otherwise. It will also be understood that the
term "and/or" as used herein refers to and encompasses any and all
possible combinations of one or more of the associated listed
items. It will be further understood that the terms "comprises"
and/or "comprising," when used in this specification, specify the
presence of stated features, integers, steps, operations, elements,
and/or components, but do not preclude the presence or addition of
one or more other features, integers, steps, operations, elements,
components, and/or groups thereof.
[0268] As used herein, the term "if" may be construed to mean
"when" or "upon" or "in response to determining" or "in accordance
with a determination" or "in response to detecting," that a stated
condition precedent is true, depending on the context. Similarly,
the phrase "if it is determined (that a stated condition precedent
is true)" or "if (a stated condition precedent is true)" or "when
(a stated condition precedent is true)" may be construed to mean
"upon determining" or "in response to determining" or "in
accordance with a determination" or "upon detecting" or "in
response to detecting" that the stated condition precedent is true,
depending on the context.
[0269] Embodiments in which a second search result and a first
search result are swapped in ranked search results have been
presented above. In an alternative implementation of any of these
embodiments, rather than swapping the second search result and the
first search result in the ranked search results to produce
re-ranked search results as specified, the second search result is
given a higher ranking that is not necessarily that of the first
search result. For example, in some embodiments, the second search
result is given the ranking of the first result and the first
result is deleted from the ranked search results. In another
example, the first search result is given a ranking in the ranked
search results that is less than its original ranking and the
second search result is given a ranking that is greater than its
original ranking in the ranked search results. In this way, the
second search result obtains a ranking that is higher than the
ranking of the first search result in the revised ranked search
results. In yet another example, the second search result is given
a ranking that is greater than it original ranking such that its
ranking is higher than the ranking of the first search result in
the revised ranked search results.
[0270] The foregoing description included example systems, methods,
techniques, instruction sequences, and computing machine program
products that embody illustrative implementations. For purposes of
explanation, numerous specific details were set forth in order to
provide an understanding of various implementations of the
inventive subject matter. It will be evident, however, to those
skilled in the art that implementations of the inventive subject
matter may be practiced without these specific details. In general,
well-known instruction instances, protocols, structures and
techniques have not been shown in detail.
[0271] The foregoing description, for purpose of explanation, has
been described with reference to specific implementations. However,
the illustrative discussions above are not intended to be
exhaustive or to limit the implementations to the precise forms
disclosed. Many modifications and variations are possible in view
of the above teachings. The implementations were chosen and
described in order to best explain the principles and their
practical applications, to thereby enable others skilled in the art
to best utilize the implementations and various implementations
with various modifications as are suited to the particular use
contemplated.
* * * * *
References