Reverse engineering circumvention of spam detection algorithms Patent Grant Grundman December 17, 2 [Grundman; Douglas Richard]

Reverse engineering circumvention of spam detection algorithms

Grundman December 17, 2

Patent Grant 8612436

U.S. patent number 8,612,436 [Application Number 13/246,363] was granted by the patent office on 2013-12-17 for reverse engineering circumvention of spam detection algorithms. This patent grant is currently assigned to Google Inc.. The grantee listed for this patent is Douglas Richard Grundman. Invention is credited to Douglas Richard Grundman.

United States Patent	8,612,436
Grundman	December 17, 2013

Reverse engineering circumvention of spam detection algorithms

Abstract

A spam score is assigned to a business listing when the listing is received at a search entity. A noise function is added to the spam score such that the spam score is varied. In the event that the spam score is greater than a first threshold, the listing is identified as fraudulent and the listing is not included in (or is removed from) the group of searchable business listings. In the event that the spam score is greater than a second threshold that is less than the first threshold, the listing may be flagged for inspection. The addition of the noise to the spam scores prevents potential spammers from reverse engineering the spam detecting algorithm such that more listings that are submitted to the search entity may be identified as fraudulent and not included in the group of searchable listings.

Inventors:

Grundman; Douglas Richard (Rosemont, PA)

Applicant:

Name	City	State	Country	Type
Grundman; Douglas Richard	Rosemont	PA	US

Assignee:

Google Inc. (Mountain View, CA)

Family ID:

49725873

Appl. No.:

13/246,363

Filed:

September 27, 2011

Current U.S. Class:	707/735; 707/754
Current CPC Class:	G06F 16/24578 (20190101); G06F 16/951 (20190101)
Current International Class:	G06F 7/00 (20060101); G06F 17/30 (20060101)
Field of Search:	;707/735,999.2,754

References Cited [Referenced By]

U.S. Patent Documents


7831667	November 2010	Gleeson et al.
2005/0015454	January 2005	Goodman et al.
2005/0251496	November 2005	DeCoste et al.
2008/0222725	September 2008	Chayes et al.
2010/0094868	April 2010	Leung et al.
2010/0223250	September 2010	Guha
2012/0041846	February 2012	Rehman et al.
2012/0268269	October 2012	Doyle

Other References

B Wu, V. Goel, and B. Davison. Propagating trust and distrust to demote web spam. In MTW'06: Proceeding of Models of Trust for the Web Workshop, International World Wide Web Conference, 2006. Retrieved on Feb. 22, 2013 from http://www.cse.lehigh.edu/.about.brian/pubs/2006/MTW/propagating-tru- st.pdf. cited by examiner .
Index of /.about.brian/pubs/2006. Verification of 2006 publication date of Wu Non-Patent Literature. Retrieved on Feb. 22, 2013 from http://www.cse.lehigh.edu/.about.brian/pubs/2006/. cited by examiner.

Primary Examiner: Mofiz; Apu
Assistant Examiner: Walker; Bryan
Attorney, Agent or Firm: Lerner, David, Littenberg, Krumholz & Mentlik, LLP

Claims

The invention claimed is:

1. A computer-implemented method comprising: receiving a listing to be included in a group of searchable listings; assigning, using a processor, a first spam score in a predetermined range to the listing; varying the first spam score to a second spam score using the processor, wherein varying the first spam score includes: varying the first spam score according to a first function if the first spam score is within a first subset of the predetermined range, and varying the first spam score according to a second function if the first spam score is within a second subset of the predetermined range, the second function being different than the first function; and in the event that the second spam score is larger than a first threshold, removing the listing from or not including the listing in the group of searchable listings.

2. The method of claim 1, further comprising, in the event that the second spam score is larger than a second threshold, identifying the listing as potential spam.

3. The method of claim 2, wherein the second threshold is less than the first threshold.

4. The method of claim 2, wherein the listing identified as potential spam is flagged to determine whether the listing is to be removed from or not included in the group of searchable listings.

5. The method of claim 2, wherein the listing identified as potential spam is demoted such that the listing does not appear as frequently or as highly ranked as other similarly situated listings in response to a search of the group of searchable listings.

6. The method of claim 2, wherein varying the first spam score according to the first function comprises: adding a noise function to the spam score using the processor.

7. The method of claim 6, wherein the second function varies the first spam score by a larger amount than the first function when the first spam score value is proximate to one of the first threshold and the second threshold.

8. The method of claim 7, wherein varying the first spam score according to the second function comprises: increasing the first spam score when the first spam score is proximate to the second threshold.

9. The method of claim 7, wherein varying the first spam score according to the second function comprises: decreasing the first spam score when the first spam score is proximate to the first threshold.

10. The method of claim 1, further comprising, in the event that the second spam score value is less than a second threshold, including the listing in the group of searchable listings.

11. A computer-implemented method comprising: receiving a listing to be included in a group of searchable listings; assigning, using a processor, a first spam score value to the listing; varying the first spam score value to a second spam score value using the processor, wherein varying the first spam score includes: varying the first spam score according to a first function if the first spam score is within a first subset of the predetermined range, the first function including a noise component, and varying the first spam score value according to a second function if the first spam score value is within a second subset of the predetermined range, the second function including a non-noise component; and in the event that the second spam score value is larger than a first threshold, identifying the listing as potential spam.

12. The method of claim 11, further comprising, in the event that the second spam score value is larger than a second threshold, removing the listing from the group of searchable listings.

13. The method of claim 12, wherein the second threshold is greater than the first threshold.

14. The method of claim 13, wherein varying the first spam score value according to the second function varies the first spam score value by a larger amount than the first function when the spam score value is proximate to one of the first threshold and the second threshold.

15. The method of claim 13, wherein varying the first spam score value according to the second function comprises: increasing the first spam score value in the event that the first spam score is proximate to the first threshold.

16. The method of claim 13, wherein varying the first spam score value according to the second function comprises: decreasing the first spam score value in the event that the first spam score is proximate to the second threshold.

17. A computer-implemented method for circumventing the reverse engineering of a spam detection algorithm, the method comprising: receiving a listing to be included in a group of searchable listings; assigning a first spam score within a predetermined range to the listing using a processor, wherein the first spam score indicates a likelihood that the listing is legitimate; varying the first spam score to a second spam score using the processor, wherein varying the first spam score includes: varying the first spam score according to a first function if the first spam score is within a first subset of the predetermined range, varying the first spam score according to a second function if the first spam score is within a second subset of the predetermined range, the second function including a non-noise component, and varying the first spam score according to a third function if the first spam score is within a third subset of the predetermined range, the third function including a non-noise component and being different from the second function; in the event that the second spam score is larger than a first threshold, removing the listing from or not including the listing in the group of searchable listings; and in the event that the second spam score value is larger than a second threshold, identifying the listing as potential spam, wherein the second threshold is less than the first threshold.

18. The method of claim 17, wherein the listing identified as potential spam is flagged to determine whether the listing is to be removed from or included in the group of searchable listings.

19. The method of claim 17, wherein the listing identified as potential spam is demoted such that it does not appear as frequently or as highly ranked as other similarly situated listings in response to a search of the group of searchable listings.

20. The method of claim 17, wherein varying the first spam score according to one of the second function and the third function comprises: varying the spam score by a larger amount when the first spam score value is proximate to one of the first threshold and the second threshold.

21. The method of claim 17, wherein varying the first spam score according to the second function comprises: increasing the first spam score when the spam score is proximate to the second threshold.

22. The method of claim 17, wherein varying the first spam score according to the third function comprises: decreasing the spam score when the spam score is proximate to the first threshold.

23. The method of claim 17, in the event that the spam score is less than the second threshold, including the listing in the group of searchable listings.

Description

BACKGROUND

Various network-based search applications allow a user to enter search terms and receive a list of search results. Such applications commonly use ranking algorithms to ensure that the search results are relevant to the user's query. For example, some systems rank such results based on reliability and safety of the search result, location of the user and search result, etc. These services may also provide business listings in response to a particular search query.

The business listing search results, or data identifying a business, its contact information, web site address, and other associated content, may be displayed to a user such that the most relevant businesses may be easily identified. In an attempt to generate more customers, some businesses may employ methods to include multiple different listings to identify the same business. For example, a business may contribute a large number of listings for nonexistent business locations to a search engine, and each listing is provided with a contact telephone number that is associated with the actual business location. The customer may be defrauded by contacting or visiting an entity believed to be at a particular location only to learn that the business is actually operating from a completely different location. Such fraudulent marketing tactics are commonly referred to as "fake business spam".

In order to provide users with correct information, search engine companies occasionally modify their ranking algorithms to attempt to identify and exclude fake business spam listings from results presented to end users. Many spammers continually monitor changes in search engine rankings for their fake listings to determine when ranking algorithm changes occur and what those changes are. By reverse engineering spam identification aspects of the ranking algorithm, spammers can determine how to modify their fake listings to avoid spam-catching ranking penalties. Given the large number of spammers doing this, it is difficult for search engines to prevail.

SUMMARY

Aspects of the present disclosure relate generally to adding noise to spam scores to circumvent the reverse engineering of spam detection algorithms. A spam score value is assigned to a business listing when the listing is received at a search entity. The spam score may be between values of zero and one, where zero indicates that the listing is legitimate and one indicates that the listing is fraudulent. The spam score is then varied slightly by adding noise to the spam score. The amount of noise that is added is sufficient to affect spammer's attempts to reverse engineer spam-identification algorithms, but small enough so that the search experience of end users is minimally affected because the ranking of legitimate listings is unaffected.

In the event that the spam score is greater than a first threshold, the listing is identified as fraudulent and the listing is removed from or not included in the group of searchable business listings. In the event that the spam score is greater than a second threshold that is less than the first threshold, the listing may be demoted such that the listing does not appear in response to a search as frequently or as highly rated as listings having spam scores that are less than the second threshold. The amount of demotion may be affected by the spam score such that listings that appear to be more likely spam are demoted more than other listings. Alternatively, the listing may be flagged for inspection such that the listing may be further analyzed to determine whether the listing is fraudulent. In the event that the spam score is less than the first and second thresholds, the corresponding listing is identified as legitimate. The listing is then included in the group of searchable listings such that the listing may be provided in response to a search with no change to its ranking.

By varying the spam score, it becomes difficult for spammers to reverse engineer the spam detection algorithm. The amount by which the spam score is varied may change over time and may depend on an initial value of the spam score. Accordingly, spammers may not be able to easily reverse engineer the spam detection algorithms that are in place especially with regard to values corresponding to the first and second thresholds. The difficulty of deducing the effects of listing changes may increase greatly if those effects appear to be nondeterministic to spammers.

In one aspect, a computer-implemented method includes receiving a listing to be included in a group of searchable listings. Using a processor, a spam score is assigned to the listing and the spam score is varied. In the event that the spam score is larger than a first threshold, the listing is removed from or not including the listing in the group of searchable listings.

In another aspect, a computer-implemented method includes receiving a listing to be included in a group of searchable listings. Using a processor, a spam score value is assigned to the listing. Noise is added to the spam score value using the processor. In the event that the spam score value is larger than a first threshold, the listing is identified as potential spam.

In another aspect, a computer-implemented method for circumventing the reverse engineering of a spam detection algorithm includes receiving a listing to be included in a group of searchable listings. A spam score is assigned to the listing using a processor. The spam score indicates a likelihood that the listing is legitimate. The spam score is varied by adding noise to the spam score using the processor. In the event that the spam score is larger than a first threshold, the listing is removed from or not including the listing in the group of searchable listings. In the event that the spam score value is larger than a second threshold, the listing is identified as potential spam. The second threshold is less than the first threshold.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a functional diagram of a system in accordance with an example embodiment.

FIG. 2 is a pictorial diagram of the system of FIG. 1.

FIG. 3 is an exemplary flow diagram in accordance with an example embodiment.

FIG. 4 illustrates a spam score variation function in accordance with an example embodiment.

FIG. 5 illustrates a spam score variation function in accordance with an example embodiment.

DETAILED DESCRIPTION

The present disclosure is directed to the addition of noise to spam scores to circumvent the reverse engineering of algorithms that seek to identify fraudulent business listings. A search entity receives new business listings to be added to a searchable group of listings on a regular basis. Each listing is processed to associate the listing with a spam score value that identifies a likelihood that the listing is fraudulent. In some embodiments, the spam score is between zero and one, where a score close to one indicates that the listing is likely spam and a score close to zero indicates that the listing is likely legitimate.

In cases where the listing is identified as likely being a fake business listing, different penalties are assigned to the listing based on the spam score value. For example, in the event that the spam score is greater than 0.95, the listing may be removed from the searchable group of listings because there is a strong likelihood that the listing is spam. In the event that the spam score is between 0.5 and 0.95, the listing may or may not actually be a fake business listing. In this case, the listing may not be removed from the group of searchable listings; but since the spam score indicates a likelihood that the listing is not legitimate, the listing may be demoted such that, for example, it does not appear as frequently or as highly rated in search results as other similarly situated business listings. If the spam score is less than 0.5, the listing may not be subject to any penalty.

Spammers frequently perform experiments by submitting fake data to a search entity and determining the result of a search targeting the fake data. In one example, a spammer may provide the search entity with different "spammy" listings that are each slightly different from each other. The listings are then targeted in a search by the spammer to determine which listings went essentially undetected by the spam identifying algorithms and are treated as legitimate, and which listings were determined by the search entity to be spam. Spammers also examine the order of results for searches for their own listings to determine which listings have been demoted, and to measure the relative demotion that occurs as a function of various features of those listings. This allows spammers to determine which listing features cause demotion (and how much) and which features are ignored by the spam scoring system. Accordingly, spammers can attempt to circumvent spam filters in order to have fake business listings appear as un-demoted search results.

In accordance with some aspects of the present disclosure, spam scores are varied slightly using a noise function. In the event that a spam score is near a penalty cutoff value, the spam score may not be varied too much because some "spammy" listings may avoid being penalized. The spam score variation may occur discretely such as whenever a search index is copied to production search machines. Spammers submit listings, wait for the search entity to copy an index and, based on the search results, attempt to decipher the result of the spam filter. As stated above, the addition of noise to spam scores may cause spammers to be continually frustrated by the outcome of their experiments.

An index is a set of data files underlying a search engine. The index may include listings from businesses that have submitted a listing to be added to the group of searchable listings. The index is organized so that the listings can be produced in response to a search based on the terms in each listing. A new index may be built every few weeks and then copied to production search machines, replacing the previous index data. Building the index includes (a) collecting data associated with the group of searchable listings, (b) merging duplicate and near-duplicate information, and removing any identifiable incorrect information, and (c) building the index structure such that listings can be located in response to the submission of search terms.

Spam filtering may occur at any one of three different times: 1) at the beginning of the indexing process, where listings that are obviously spam are removed; 2) near the end of the indexing process, where the listings are analyzed more thoroughly to judge whether or not the listing is spam; or 3) in an update stream, which is a separate process that receives updates in real time from users and applies the updates to an output index that is used in place of a former index. The index data may be processed to identify spam near the end of the index generation process before the index is copied to the production machines. Any spam listings are demoted, are flagged for inspection, or are removed so that they not included in the group of searchable listings.

It does not matter when during the index data processing that the noise is added to spam scores. But the noise may be added to the spam score of a listing before a decision is made whether to drop or demote the listing because: (a) if the listing is going to be demoted, the noise should be allowed to affect the amount of demotion or to push the listing over the drop threshold causing the listing to be dropped, or (b) if the listing is going to be dropped, the spam score is pushed over the demote threshold with small probability, causing the listing to be demoted. The noise function is shaped to identify an amount for the small probability. In some embodiments, the noise addition is performed as late in the process as possible to make coding easier.

As shown in FIGS. 1 and 2, a system 100 in accordance with example embodiments includes a computer 110 containing a processor 120, memory 130 and other components typically present in general purpose computers. The memory 130 stores information accessible by the processor 120, including instructions 132 and data 134 that may be executed or otherwise used by the processor 120. The memory 130 may be of any type capable of storing information accessible by the processor 120, including a computer-readable medium, or other medium that stores data that may be read with the aid of an electronic device, such as a hard-drive, memory card, flash drive, ROM, RAM, DVD or other optical disks, as well as other write-capable and read-only memories. In that regard, memory may include short term or temporary storage as well as long term or persistent storage. Systems and methods may include different combinations of the foregoing, whereby different portions of the instructions and data are stored on different types of media.

The instructions 132 may be any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by the processor. For example, the instructions may be stored as computer code on the computer-readable medium. In that regard, the terms "instructions" and "programs" may be used interchangeably herein. The instructions may be stored in object code format for direct processing by the processor, or in any other computer language including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance. Functions, methods and routines of the instructions are explained in more detail below.

The data 134 may be retrieved, stored or modified by the processor 120 in accordance with the instructions 132. For instance, although the architecture is not limited by any particular data structure, the data may be stored in computer registers, in a relational database as a table having a plurality of different fields and records, etc. The data may also be formatted in any computer-readable format. By further way of example only, image data may be stored as bitmaps comprised of grids of pixels that are stored in accordance with formats that are compressed or uncompressed, lossless or lossy, and bitmap or vector-based, as well as computer instructions for drawing graphics. The data may comprise any information sufficient to identify the relevant information, such as numbers, descriptive text, proprietary codes, references to data stored in other areas of the same memory or different memories (including other network locations) or information that is used by a function to calculate the relevant data. Data 134 of server 110 may include data 136 corresponding to spam score algorithms, spam score thresholds and noise functions to be added to spam scores, which are described in detail below.

The processor 120 may be any conventional processor, such as a CPU for a personal computer. Alternatively, the processor 120 may be a dedicated controller such as an ASIC. Although FIG. 1 functionally illustrates the processor 120 and memory 130 as being within the same block, it will be understood by those of ordinary skill in the art that the processor and memory may actually comprise multiple processors and memories that may or may not be stored within the same physical housing. For example, memory may be a hard drive or other storage media located in a server farm of a data center. Accordingly, references to a processor, a computer or a memory will be understood to include references to a collection of processors or computers or memories that may or may not operate in parallel.

The computer 110 may be at one node of a network 150 and capable of directly and indirectly receiving data from other nodes of the network. For example, computer 110 may comprise a web server that is capable of receiving data from client devices 160, 170 via network 150 such that server 110 uses network 150 to transmit and display information to a user on display 165 of client device 160. Server 110 may also comprise a plurality of computers that exchange information with different nodes of a network for the purpose of receiving, processing and transmitting data to the client devices 160, 170. In this instance, the client devices 160, 170 will typically still be at different nodes of the network than any of the computers comprising server 110.

Network 150, and intervening nodes between server 110 and client devices 160, 170, may comprise various configurations and use various protocols including the Internet, World Wide Web, intranets, virtual private networks, local Ethernet networks, private networks using communication protocols proprietary to one or more companies, cellular and wireless networks (e.g., Wi-Fi), instant messaging, HTTP and SMTP, and various combinations of the foregoing. Although only a few computers are depicted in FIGS. 1 and 2, it should be appreciated that a typical system can include a large number of connected computers.

Each client device 160 may be configured similarly to the server 110, with a processor, memory and instructions as described above. Each client device 160 may be a personal computer intended for use by a person, and have all of the components normally used in connection with a personal computer such as a central processing unit (CPU) 162, memory (e.g., RAM and internal hard drives) storing data 163 and instructions 164, an electronic display 165 (e.g., a monitor having a screen, a touch-screen, a projector, a television, a computer printer or any other electrical device that is operable to display information), and user input 166 (e.g., a mouse, keyboard, touch-screen or microphone). The client device 160 may also include a camera 167, geographical position component 168, accelerometer, speakers, a network interface device, a battery power supply 169 or other power source, and all of the components used for connecting these elements to one another.

In addition to the operations described below and illustrated in the figures, various operations in accordance with example embodiments will now be described. It should also be understood that the following operations do not have to be performed in the precise order described below. Rather, various steps can be handled in a different order or simultaneously, and may include additional or fewer operations.

FIG. 3 demonstrates a process 300 of adding noise to spam scores to circumvent the reverse engineering of spam detection algorithms. The process begins when a business listing is received at a search entity from a business that wants to have their listing included in a group of searchable business listings (block 310). The business may desire to increase traffic to a website or otherwise attract potential customers. Some of the listings that are received may be fraudulent such as those listings that may be identified as "fake business spam" that are submitted in an unscrupulous attempt to increase customer traffic.

In order to identify which listings may be fraudulent, a spam score is assigned to each listing (block 320). In some embodiments, the spam score may be between values of zero and one, where zero indicates that the listing is legitimate and one indicates that the listing is fraudulent. The spam score may be based on any number of factors or combinations of factors. Example factors include the geographic density of businesses in the same category, repeated identifying information in different listings, and ratios of common terms in the business listing title to total words in the title. It is understood that the spam score may be based on any number of known methods for determining whether a listing is fraudulent.

The spam score is varied (block 330). The spam score may be varied by adding a noise function to the spam score. The variations in the spam score, especially at specific boundaries on the spam score spectrum, lead to the difficulty in spammers being able to reverse engineer spam detection algorithms. The noise function and its application to the spam score are described in detail below with reference to FIGS. 4 and 5.

A determination is made whether the spam score is higher than a first threshold (block 340). In this example, the first threshold is desirably set at a value that is closer to 1 than 0.5 such that spam scores that are higher than the first threshold are determined to be likely fraudulent. In some embodiments, the first threshold is set at about 0.8. In the event that the spam score is greater than the first threshold, processing proceeds to block 350. Here, the listing is identified as fraudulent, and the listing is not included in (or removed from) the group of searchable business listings. Processing then terminates.

In the event that the spam score is not greater than the first threshold, processing proceeds to block 360. At block 360, a determination is made whether the spam score is higher than a second threshold. The second threshold is set at a value that is less than the first threshold such that a spam score that is higher than the second threshold but less than the first threshold is identified as corresponding to a listing that may or may not be fraudulent. In some embodiments, the second threshold is set at about 0.6.

In the event that the spam score is greater than the second threshold but less than the first threshold, processing continues to block 370. At block 370, the corresponding listing may be demoted such that the listing does not appear in response to a search as frequently or as highly rated as listings having spam scores that are less than the second threshold. Alternatively, the listing may be flagged for inspection such that the search entity may analyze the listing to determine a likelihood that the listing is fraudulent. Processing then terminates.

In the event that the spam score is less than the first and second thresholds, processing continues to block 380. At block 380, the corresponding listing is identified as legitimate. The listing is then included in the group of searchable listings such that the listing may be provided in response to a search. Processing then terminates.

As discussed above, an algorithm or combination of algorithms are used to provide each listing with a spam score value between 0.0 and 1.0, where 0.0 identifies the listing as likely legitimate and 1.0 identifies the listing as fraudulent.

Function B(x) is a function that defines a maximum amount of noise added to any given spam score. For example, if a listing has a spam score of 0.7, the value of the bound function B(0.7) may be 0.08. Accordingly, the maximum amount of noise that is applied to the listing is 0.08 such that the final spam score for this listing is between 0.62 and 0.78. Function B(x) maps [0, 1] to [0, 1] such that the input to the function B(x) is a value between 0 and 1 inclusive, and the output of the function B(x) is a value between 0 and 1 inclusive.

Function B(x) receives a spam score as input in the range of 0.0-1.0 inclusive, and outputs a spam score value in the range of 0.0-1.0 inclusive, where: B(x).ltoreq.x Eq. 1 B(x).ltoreq.1-x Eq. 2

A limit value (L) is defined to be a maximum amount of noise applied to a listing's spam score.

A simple "B" may be: B(x)=K*(x-x.sup.2).sup.2 Eq. 3 for some choice of K that ensures that the properties of Eqs. 1 and 2 are fulfilled.

The function has a maximum value at x=1/2 of 0.0625 when K=1. To get 0.ltoreq.L.ltoreq.1, such that B(x).ltoreq.L for all x, K=L/0.0625. The example function B(x) that satisfies these parameters (with L=0.1) is shown in FIG. 4.

Letting "z" be a variable denoting a listing and "S" indicate a pre-existing spam function, and given a function "B" as in Eq. 3 and a random number generator "R" that has values distributed uniformly between -1.0 and 1.0 and that may use information from a listing for its seed value, a noise function "N" is defined on listings as: N(z)=B(S(z))*R(z) Eq. 4

"R" may be a simple random number generator, or "R" may be a normalization of a hash of contents of "z" along with, for example, the date or the name of the index being built. This definition ensures that: -B(S(z)).ltoreq.N(z).ltoreq.B(S(z)) Eq. 5

Combining with Eq. 1: -S(z).ltoreq.-B(S(z)).ltoreq.N(z).ltoreq.B(S(z)).ltoreq.S(z) Eq. 6

A "noisy spam score" function S'(z) is defined as: S'(z)=S(z)+N(z), Eq. 7 where function S' (z) has the following properties: 0.ltoreq.S'(z).ltoreq.1 Eq. 8 S(z)-L.ltoreq.S'(z).ltoreq.S(z)+L Eq. 9 When S(z)=0, S'(z)=0 Eq. 10 When S(z)=1, S'(z)=1 Eq. 11

S' also produces spam scores that can be used in place of S with no change to any other software component in the system.

The ultimate goal of any spam detection system is to penalize "spammy" content. Content is penalized by implementing various penalties according to the spam score a listing has garnered. Since there are only a finite number of different penalties than can be meted out (e.g., suppression of the listing, lowering the listing's prominence in search results, or doing nothing to the listing), there is a small number of key points on the continuum of spam scores, namely, the points that separate regions that incur different penalties.

For example, if the policy was to demote all listings with a score greater than 0.6, but completely block all listings with a score greater than 0.8, the two interesting points would be at 0.6 (the boundary between doing nothing and demotion) and 0.8 (the boundary between demotion and blocking). The addition of noise to a spam score proximate to a boundary point (e.g., within .+-.5%) may cause a corresponding listing having a spam to receive a score that is on the opposite side of the boundary, thereby receiving an inaccurate penalty.

There are two different possible effects caused by the addition of noise to spam scores. The first effect is that a listing on one side of a boundary point may receive enough noise to change the spam score value such that the value changes to the other side of the boundary point, giving the listing a different penalty. The second effect is that some penalties are parameterized by the spam score, so the noise changes the amount of the penalty. For example, listings with bigger spam scores may be demoted more than listings with lesser spam scores. Both of these outcomes will confuse spammers. Independent control may be gained over the two outcomes to make sure that legitimate listings are not penalized in an attempt to confound spammers.

Because of the properties of N, the value of S'(z) is close to the value of S(z) for any listing z. The values of S(z) and S'(z) differ by at most B(z), which is smaller than L. For most listings, if S causes z to be dropped, S' will also cause z to be dropped; and if S causes z to be demoted, S' will also cause z to be demoted. Because of the "noise" added, some listings are treated differently than they otherwise would be treated (e.g., demoted instead of dropped, or vice-versa). Demotion amounts may change slightly. For many pairs of listings, for example z and w, where values for S(z) and S(w) are close (e.g., <L), S(z)<S(w), but S'(z)>S'(w). So z and w are both demoted by a similar amount, and z and w may exchange places in the search results ranking.

Additional constraints may be imposed on the bound function to minimize the effect of noise on end-users. Accordingly, the effect of noise does not cause anyone to see substantially more spam than they would with a noiseless system. This result is achieved by shaping the bound function (B(x)) to be proximate to specific values at boundary points. The easiest way to achieve the desired result is by multiplying the bound function by a function that is near a value of 1.0 nearly everywhere, but that is also larger or smaller at the boundary points, depending on whether the effect of the boundary crossing is to be increased or decreased proximate to each boundary. Using Eq. 3 with L=0.1, one such bound function is:

'.function..function.e.times..times..times.e.times..times..times. ##EQU00001##

Using B'(x) as defined in Eq. 12 instead of the original B(x) increases the effect (e.g., allows more noise) proximate to 0.6, and decreases the effect (e.g., suppresses more noise) proximate to 0.8, as shown in FIG. 5.

The values of "2000" and "5000" in Eq. 12 affect the steepness of the modification "bumps", and depend on where along the curve the corresponding "bump" occurs. The values of "0.6" and "0.8" indicate the boundary points. Each instance of 0.9 either removes 90% from or adds 90% to the noise at the corresponding boundary point. All of the values in Eq. 12 are configurable to achieve the effect desired. In general, to increase the amount of noise by a factor of X at spam score Y with steepness Z, an additional multiplicative term is added to the denominator of Eq. 12 of the form: 1+e.sup.-XZ(X-Y).sup.2

To decrease the amount of noise by a factor of X at spam score Y with steepness Z, a multiplicative term is added to the denominator of Eq. 12 of the form: 1-e.sup.-XZ(X-Y).sup.2

As described above, a spam score value is assigned to a business listing when the listing is received at a search entity. A noise function is added to the spam score such that the spam score is varied. In the event that the spam score is greater than a first threshold, the listing is identified as fraudulent and the listing is not included in (or is removed from) the group of searchable business listings. In the event that the spam score is greater than a second threshold that is less than the first threshold, the listing may be flagged for inspection. In the event that the spam score is less than the first and second thresholds, the corresponding listing is identified as legitimate. The addition of the noise to the spam scores prevents potential spammers from reverse engineering the spam detecting algorithm such that more "spammy" listings that are submitted to the search entity may be identified as fraudulent and not included in the group of searchable listings.

As these and other variations and combinations of the features discussed above can be utilized without departing from the scope of the claims, the foregoing description of exemplary embodiments should be taken by way of illustration rather than by way of limitation. It will also be understood that the provision of examples (as well as clauses phrased as "such as," "e.g.", "including" and the like) should not be interpreted as limiting; rather, the examples are intended to illustrate only some of many possible aspects.

* * * * *