U.S. patent application number 11/932589 was filed with the patent office on 2009-04-30 for image spam filtering based on senders' intention analysis.
This patent application is currently assigned to FORTINET, INC., A Delaware Corporation. Invention is credited to Jiandong Cheng, Jun Lu.
Application Number | 20090113003 11/932589 |
Document ID | / |
Family ID | 40582891 |
Filed Date | 2009-04-30 |
United States Patent
Application |
20090113003 |
Kind Code |
A1 |
Lu; Jun ; et al. |
April 30, 2009 |
IMAGE SPAM FILTERING BASED ON SENDERS' INTENTION ANALYSIS
Abstract
Systems and methods for an anti-spam detection module that can
detect image spam are provided. According to one embodiment, an
image spam detection process involves determining and measuring
various characteristics of images that may be embedded within or
otherwise associated with an electronic mail (email) message. An
approximate display location of the embedded images is determined.
The existence of one or more abnormal factors associated with the
embedded images is identified. A quantity of text included in the
one or more embedded images is determined and measured by analyzing
one or more blocks of binarized representations of the one or more
embedded images. Finally, the likelihood that the email message is
spam is determined based on one or more of the approximate display
location, the existence of one or more abnormal factors and the
quantity and location of text measured.
Inventors: |
Lu; Jun; (Ottawa, CA)
; Cheng; Jiandong; (Ottawa, CA) |
Correspondence
Address: |
MICHAEL A DESANCTIS;HAMILTON DESANCTIS & CHA LLP
FINANCIAL PLAZA AT UNION SQUARE, 225 UNION BOULEVARD, SUITE 305
LAKEWOOD
CO
80228
US
|
Assignee: |
FORTINET, INC., A Delaware
Corporation
|
Family ID: |
40582891 |
Appl. No.: |
11/932589 |
Filed: |
October 31, 2007 |
Current U.S.
Class: |
709/206 |
Current CPC
Class: |
G06K 9/00456 20130101;
G06Q 10/107 20130101 |
Class at
Publication: |
709/206 |
International
Class: |
G06F 15/16 20060101
G06F015/16 |
Claims
1. A method comprising: measuring or estimating one or more of the
quantity and position of text within an image associated with an
electronic message; and estimating the likelihood that the
electronic message is spain based at least in part on results of
the measuring or estimating.
2. The method of claim 1, wherein the electronic message comprises
an electronic mail (email) message.
3. The method of claim 1, wherein the image is divided up into a
plurality of blocks and image processing is applied to each of the
plurality of blocks.
4. The method of claim 3, wherein the image processing includes
local thresholding.
5. The method of claim 3, wherein the image processing includes
global thresholding.
6. The method of claim 1, wherein filtering is applied to the image
to remove noise deliberately added by an originator of the
electronic message.
7. The method of claim 3, wherein the image processing comprises
converting the image or one or more of the plurality of blocks to
grayscale.
8. The method of claim 3, further comprising determining which
colors or intensities are likely to represent text within the image
or within one or more of the plurality of blocks by calculating an
intensity histogram for the image or for the one or more of the
plurality of blocks.
9. The method of claim 3, wherein the quantity of text is measured
or estimated by summing the number of blocks within a portion of
the image visible in a preview pane of an email client.
10-27. (canceled)
28. A computer-readable medium having stored thereon instructions,
which when executed by one or more processors cause the one or more
processors to perform a method comprising: measuring or estimating
one or more of the quantity and position of text within an image
associated with an electronic message; and estimating the
likelihood that the electronic message is spain based at least in
part on results of the measuring or estimating.
29. The computer-readable medium of claim 28, wherein the
electronic message comprises an electronic mail (email)
message.
30. The computer-readable medium of claim 28, wherein the image is
divided up into a plurality of blocks and image processing is
applied to each of the plurality of blocks.
31. The computer-readable medium of claim 30, wherein the image
processing includes local thresholding.
32. The computer-readable medium of claim 30, wherein the image
processing includes global thresholding.
33. The computer-readable medium of claim 28, wherein filtering is
applied to the image to remove noise deliberately added by an
originator of the electronic message.
34. The computer-readable medium of claim 30, wherein the image
processing comprises convening the image or one or more of the
plurality of blocks to grayscale.
35. The computer-readable medium of claim 30, further comprising
determining which colors or intensities are likely to represent
text within the image or within one or more of the plurality of
blocks by calculating an intensity histogram for the image or for
the one or more of the plurality of blocks.
36. The computer-readable medium of claim 30, wherein the quantity
of text is measured or estimated by summing the number of blocks
within a portion of the image visible in a preview pane of an email
client.
Description
COPYRIGHT NOTICE
[0001] Contained herein is material that is subject to copyright
protection. The copyright owner has no objection to the facsimile
reproduction of the patent disclosure by any person as it appears
in the Patent and Trademark Office patent files or records, but
otherwise reserves all rights to the copyright whatsoever.
Copyright.COPYRGT. 2007, Fortinet, Inc.
BACKGROUND
[0002] 1. Field
[0003] Embodiments of the present invention generally relate to the
field of spam filtering and anti-spam techniques. In particular,
various embodiments relate to image analysis and methods for
combating spam in which spammers use images to carry the
advertising text.
[0004] 2. Description of the Related Art
[0005] Image spam was originally created in order to get past
heuristic filters, which block messages containing words and
phrases commonly found in spam. Since image files have different
formats than the text found in the message body of an electronic
mail (email) message, conventional heuristic filters, which analyze
such text do not detect the content of the message, which may be
partly or wholly conveyed by embedded text within the image. As a
result, heuristic filters were easily defeated by image spam
techniques.
[0006] To address this spamming technique, fuzzy signature
technologies, which flag both known and similar messages as spam,
were deployed by anti-spam vendors. Such fuzzy signature
technologies allowed message attachments to be targeted, thereby
recognizing as spam messages with different content but the same
attachment.
[0007] Spammers now alter the images to make the email message
appear different to signature-based filtering approaches yet while
maintaining readability of the embedded text message to human
viewers. The content of images lies in two levels: (i) the pixel
matrix and (ii) the text or graphics these pixel matrices
represent. At present, the notion of pixel-based matching does not
make sense, as the same text could be represented by countless
pixel matrices by simply changing various attributes, such as the
font, size, color or by adding noise. Therefore, hash matching and
other signature-based approaches have essentially been rendered
useless to block image spam as they fail as a result of even minor
changes to the background of the image.
[0008] Some vendors have attempted to catch image spam by employing
Optical Character Recognition (OCR) techniques; however, such
approaches have only limited success in view of spammers' use of
techniques to obscure the embedded text messages with a variety of
noise. FIGS. 1A and 1B illustrate sample images and obfuscation
techniques used by spammers to defeat OCR image spam detection
techniques. As shown in FIGS. 1A and 1B, polygons, lines, random
colors, jagged text, random dots, varying borders and the like may
be inserted into image spam in an attempt to defeat signature
detection techniques and obscure the embedded text from OCR
techniques. There are an almost infinite number of ways that
spammers can randomize images. In addition to the foregoing
obfuscation techniques, spammers have recently used techniques such
as varying the colors used in an image, changing the width and/or
pattern of the border, altering the font style, and slicing images
into smaller pieces (which are then reassembled to appear as a
single image to the recipient). Meanwhile, OCR is very
computationally expensive. Depending upon the implementation, fully
rendering a message and then looking for word matches against
different character set libraries may take as long as several
seconds per message, which is typically unacceptable for many
contexts.
SUMMARY
[0009] Systems and methods are described for an anti-spam detection
module that can detect image spam. According to one embodiment, one
or more of the quantity and position of text within an image
associated with an electronic message are measured or estimated.
Then, based at least in part on the results of the measuring or
estimating, the likelihood that the electronic message is spam is
determined.
[0010] According to another embodiment, an embedded image of an
electronic mail (email) message is converted to a binarized
representation by performing thresholding on a grayscale
representation of the embedded image. A quantity of text included
in the embedded image is then determined and measured by analyzing
one or more blocks of the binarized representations. Finally, the
email message is classified as spam or clean based at least in part
on the quantity of text measured.
[0011] In one embodiment, the embedded image may be formatted in
accordance with the Graphic Interchange Format (GIF), Joint
Photographic Experts Group (JPEG) or Portable Network Graphics
(PNG) formats/standards.
[0012] In one embodiment, the embedded image may be an image
contained within a file attached to the email message.
[0013] In one embodiment, the method also includes determining an
approximate display location of an embedded image within the email
message and identifying existence of one or more abnormal factors
associated with the embedded image. Then, the classification can be
based upon the approximate display location, the existence of one
or more abnormal factors as well as the quantity of text
measured.
[0014] Other features of embodiments of the present invention will
be apparent from the accompanying drawings and from the detailed
description that follows.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] Embodiments of the present invention are illustrated by way
of example, and not by way of limitation, in the figures of the
accompanying drawings and in which like reference numerals refer to
similar elements and in which:
[0016] FIGS. 1A and 1B illustrate sample images and obfuscation
techniques used by spammers to defeat OCR image spam detection
techniques.
[0017] FIG. 2 is a block diagram conceptually illustrating a
simplified network architecture in which embodiments of the present
invention may be employed.
[0018] FIG. 3 is a block diagram conceptually illustrating
interaction among various functional units of an email security
system with a client workstation and an email server in accordance
with an embodiment of the present invention.
[0019] FIG. 4 is an example of a computer system with which
embodiments of the present invention may be utilized.
[0020] FIG. 5 is a high-level flow diagram illustrating anti-spam
processing of images using sender's intention analysis in
accordance with an embodiment of the present invention.
[0021] FIG. 6 is a flow diagram illustrating quantity of text
measurement processing in accordance with an embodiment of the
present invention.
[0022] FIG. 7 is an example of an image spam email message
containing an embedded image.
[0023] FIG. 8 is a grayscale image based on the embedded image of
FIG. 7 according to one embodiment of the present invention.
[0024] FIG. 9 is an intensity histogram for the grayscale image of
FIG. 8 according to one embodiment of the present invention.
[0025] FIG. 10 is a binary image resulting from thresholding the
grayscale image of FIG. 8 in accordance with an embodiment of the
present invention.
[0026] FIG. 11 illustrates an exemplary segmentation of the binary
image of FIG. 10 into 28 virtual blocks and highlights the text
strings detected within the blocks in accordance with an embodiment
of the present invention.
[0027] FIG. 12 is a grayscale image based on another exemplary
embedded image observed in connection with image spam.
[0028] FIG. 13 illustrates an exemplary segmentation into 56
virtual blocks a binarized image corresponding to the image of FIG.
12 and highlights the text strings detected within the blocks in
accordance with an embodiment of the present invention.
DETAILED DESCRIPTION
[0029] Systems and methods are described for an anti-spam detection
module that can detect various forms of image spam. According to
one embodiment, images attached to or embedded within email
messages are analyzed to determine the senders' intention.
Empirical analysis reveals legitimate emails may contain embedded
images, but valid images sent through email rarely contain a
substantial quantity of text. Additionally, when legitimate images
are included within email messages, the senders of such email
messages do not painstakingly adjust the location of such included
images to assure such images appear in the preview window of an
email client. Furthermore, legitimate senders do not intentionally
inject noise into the embedded images. In contrast, spammers
usually compose email messages in different ways. For example, in
the context of image spam, spammers insert text into images to
avoid filtering by traditional text filters and employ techniques
to randomize images and/or obscure text embedded within images.
Spammers also typically make great efforts to draw attention to
their image spam by carefully placing the image in such a manner as
to make it visible to the recipient in the preview window/pane of
an email client that supports HTML email, such as Netscape
Messenger or Microsoft Outlook. Consequently, various indicators of
image spam include, but are not limited to, inclusion of one or
more images in the front part of an email message, inclusion of one
or more images containing text meeting a certain threshold and/or
inclusion of one or more images into which noise appears to have
been injected to obfuscate embedded text.
[0030] According to one embodiment, various image analysis
techniques are employed to more accurately detect image spam based
on senders' intention analysis. The goal of senders' intention
analysis is to discover the email message sender's intent by
examining various characteristics of the email message and the
embedded or attached images. If it appears, for example, after
performing image analysis that one or more images associated with
an email message have had one or more obfuscation techniques
applied, the intent is to draw attention to the one or more images
and/or the one or more images include suspicious quantities of
text, then the sender's intention analysis anti-spam processing may
flag the email message at issue as spam. In one embodiment, the
image scanning spam detection method is based on a combination of
email header analysis, email body analysis and image processing on
image attachments.
[0031] Importantly, although various embodiments of the anti-spam
detection module and image scanning methodologies are discussed in
the context of an email security system, they are equally
applicable to network gateways, email appliances, client
workstations, servers and other virtual or physical network devices
or appliances that may be logically interposed between client
workstations and servers, such as firewalls, network security
appliances, email security appliances, virtual private network
(VPN) gateways, switches, bridges, routers and like devices through
which email messages flow. Furthermore, the anti-spam techniques
described herein are equally applicable to instant messages,
(Multimedia Message Service) MMS messages and other forms of
electronic communications in the event that such message become
vulnerable to image spam in the future.
[0032] Additionally, various embodiments of the present invention
are described with reference to filtering of incoming email
messages. However, it is to be understood, that the filtering
methodologies described herein are equally applicable to email
messages originated within an enterprise and circulated internally
or outgoing email messages intended for recipients outside of the
enterprise. Therefore, the specific examples presented herein are
not intended to be limiting and are merely representative of
exemplary functionality.
[0033] Furthermore, while, for convenience, various embodiments of
the present invention may be described with reference to detecting
image spam in the graphic/image file formats currently most
prevalent (i.e., Graphic Interchange Format (GIF), Joint
Photographic Experts Group (JPEG) and Portable Network Graphics
(PNG) graphic/image file formats), embodiments of the present
invention are not so limited and are equally applicable to various
other current and future graphic/image file formats, including, but
not limited to, Progressive Graphics File (PGF), Tagged Image File
Format (TIFF), bit mapped format (BMP), HDP, WDP, XPM, MacOS-PICT,
Irix-RGB, Multiresolution Seamless Image Database (MrSID), RAW
formats used by various digital cameras, various vector formats,
such as Scalable Vector Graphics (SVG), as well as other file
formats of attachments which may themselves contain embedded
images, such as Portable Document Format (PDF), Encapsulated
PostScript, SWF, Windows Metafile, AutoCAD DXF and CorelDraw
CDR.
[0034] In the following description, numerous specific details are
set forth in order to provide a thorough understanding of
embodiments of the present invention. It will be apparent, however,
to one skilled in the art that embodiments of the present invention
may be practiced without some of these specific details. In other
instances, well-known structures and devices are shown in block
diagram form.
[0035] Embodiments of the present invention include various steps,
which will be described below. The steps may be performed by
hardware components or may be embodied in machine-executable
instructions, which may be used to cause a general-purpose or
special-purpose processor programmed with the instructions to
perform the steps. Alternatively, the steps may be performed by a
combination of hardware, software, firmware and/or by human
operators.
[0036] Embodiments of the present invention may be provided as a
computer program product, which may include a machine-readable
medium having stored thereon instructions, which may be used to
program a computer (or other electronic devices) to perform a
process. The machine-readable medium may include, but is not
limited to, floppy diskettes, optical disks, compact disc read-only
memories (CD-ROMs), and magneto-optical disks, ROMs, random access
memories (RAMs), erasable programmable read-only memories (EPROMs),
electrically erasable programmable read-only memories (EEPROMs),
magnetic or optical cards, flash memory, or other type of
media/machine-readable medium suitable for storing electronic
instructions. Moreover, embodiments of the present invention may
also be downloaded as a computer program product, wherein the
program may be transferred from a remote computer to a requesting
computer by way of data signals embodied in a carrier wave or other
propagation medium via a communication link (e.g., a modem or
network connection).
Terminology
[0037] Brief definitions of terms used throughout this application
are given below.
[0038] The term "client" generally refers to an application,
program, process or device in a client/server relationship that
requests information or services from another program, process or
device (a server) on a network. Importantly, the terms "client" and
"server" are relative since an application may be a client to one
application but a server to another. The term "client" also
encompasses software that makes the connection between a requesting
application, program, process or device to a server possible, such
as an FTP client.
[0039] The terms "connected" or "coupled" and related terms are
used in an operational sense and are not necessarily limited to a
direct connection or coupling. Thus, for example, two devices may
be coupled directly, or via one or more intermediary media or
devices. As another example, devices may be coupled in such a way
that information can be passed there between, while not sharing any
physical connection with one another. Based on the disclosure
provided herein, one of ordinary skill in the art will appreciate a
variety of ways in which connection or coupling exists in
accordance with the aforementioned definition.
[0040] The phrase "embedded image" generally refers to an image
that is displayed or rendered inline within a styled or formatted
electronic message, such as a HyperText Markup Language
(HTML)-based or formatted email message. As used herein, the phrase
"embedded image" is intended to encompass scenarios in which the
image data is sent with the email message and linked images in
which a reference to the image is sent with the email message and
the image data is retrieved once the recipient views the email
message. The phrase "embedded image" also includes an image that is
embedded in other file formats of attachments, such as Portable
Document Format (PDF) attachments, in which the image data is
displayed to the email recipient when the attachment is viewed.
[0041] The phrase "image spam" generally refers to spam in which
the "call to action" of the message is partially or completely
contained within an embedded file attachment, such as a .gif or
jpeg or .pdf file, rather than in the body of the email message.
Such images are typically automatically displayed to the email
recipients and typically some form of obfuscation is implemented in
an attempt to hide the true content of the image from spam
filters.
[0042] The phrases "in one embodiment," "according to one
embodiment," and the like generally mean the particular feature,
structure, or characteristic following the phrase is included in at
least one embodiment of the present invention, and may be included
in more than one embodiment of the present invention. Importantly,
such phrases do not necessarily refer to the same embodiment.
[0043] The phrase "network gateway" generally refers to an
internetworking system, a system that joins two networks together.
A "network gateway" can be implemented completely in software,
completely in hardware, or as a combination of the two. Depending
on the particular implementation, network gateways can operate at
any level of the OSI model from application protocols to low-level
signaling.
[0044] If the specification states a component or feature "may",
"can", "could", or "might" be included or have a characteristic,
that particular component or feature is not required to be included
or have the characteristic.
[0045] The term "proxy" generally refers to an intermediary device,
program or agent, which acts as both a server and a client for the
purpose of making or forwarding requests on behalf of other
clients.
[0046] The term "responsive" includes completely or partially
responsive.
[0047] The term "server" generally refers to an application,
program, process or device in a client/server relationship that
responds to requests for information or services by another
program, process or device (a server) on a network. The term
"server" also encompasses software that makes the act of serving
information or providing services possible.
[0048] The term "spam" generally refers to electronic junk mail,
typically bulk electronic mail (email) messages in the form of
commercial advertising. Often, email message content may be
irrelevant in determining whether an email message is spam, though
most spam is commercial in nature. There is spam that fraudulently
promotes penny stocks in the classic pump-and-dump scheme. There is
spam that promotes religious beliefs. From the recipient's
perspective, spam typically represents unsolicited, unwanted,
irrelevant, and/or inappropriate email messages, often unsolicited
commercial email (UCE). In addition to UCE, spam includes, but is
not limited to, email messages regarding or associated with
fraudulent business schemes, chain letters, and/or offensive sexual
or political messages.
[0049] According to one embodiment "spam" comprises Unsolicited
Bulk Email (UBE). Unsolicited generally means the recipient of the
email message has not granted verifiable permission for the email
message to be sent and the sender has no discernible relationship
with all or some of the recipients. Bulk generally refers to the
fact that the email message is sent as part of a larger collection
of email messages, all having substantively identical content. In
embodiments in which spam is equated with UBE, an email message is
considered spam if it is both unsolicited and bulk. Unsolicited
email can be normal email, such as first contact enquiries, job
enquiries, and sales enquiries. Bulk email can be normal email,
such as subscriber newsletters, customer communications, discussion
lists, etc. Consequently, in such embodiments, an email message
would be considered spam (i) the recipient's personal identity and
context are irrelevant because the email message is equally
applicable to many other potential recipients; and (ii) the
recipient has not verifiably granted deliberate, explicit, and
still-revocable permission for the email message to be sent.
[0050] The phrase "transparent proxy" generally refers to a
specialized form of proxy that only implements a subset of a given
protocol and allows unknown or uninteresting protocol commands to
pass unaltered. Advantageously, as compared to a full proxy in
which use by a client typically requires editing of the client's
configuration file(s) to point to the proxy, it is not necessary to
perform such extra configuration in order to use a transparent
proxy.
[0051] FIG. 2 is a block diagram conceptually illustrating a
simplified network architecture in which embodiments of the present
invention may be employed. In this simple example, spammers 205 are
shown coupled to the public Internet 200 to which local area
network (LAN) 240 is also communicatively coupled through a
firewall 210, a network gateway 215 and an email security system
220, which incorporates within an anti-spam module 225 various
novel image spam detection methodologies that are described further
below.
[0052] In the present example, the email security system 220 is
logically interposed between spammers and the email server 230 to
perform spam filtering on incoming email messages from the public
Internet 200 prior to receipt and storage on the email server 230
from which and through which client workstations 260 residing on
the LAN 240 may retrieve and send email correspondence.
[0053] In the exemplary network architecture of FIG. 2, the
firewall 210 may represent a hardware or software solution
configured to protect the resources of LAN from outsiders and to
control what outside resources local users have access to by
enforcing security policies. The firewall 210 may filter or
disallow unauthorized or potentially dangerous material or content
from entering LAN 240 and may otherwise limit access between the
LAN 240 and the public Internet 200 in accordance with local
security policy established and maintained by an administrator of
LAN 240.
[0054] In one embodiment, the network gateway 215 acts as an
interface between the LAN 240 and the public Internet 200. The
network gateway 215 may, for example, translate between dissimilar
protocols used internally and externally to the LAN 240. Depending
upon the distribution of functionality, the network gateway 215 or
the firewall 210 may perform network address translation (NAT) to
hide private Internet Protocol (IP) addresses used within LAN 240
and enable multiple client workstations, such as client
workstations 260, to access the public Internet 200 using a single
public IP address.
[0055] According to one embodiment, the email security system 220
performs email filtering to detect, tag, block and/or remove
unwanted spam and malicious attachments. In one embodiment, an
anti-spam module 225 of the email security system 220, performs one
or more spam filtering techniques, including but not limited to,
sender IP reputation analysis and content analysis, such as
attachment/content filtering, heuristic rules, deep email header
inspection, spam URI real-time blacklists (SURBL), banned word
filtering, spam checksum blacklist, forged IP checking, greylist
checking, Bayesian classification, Bayesian statistical filters,
signature reputation, and/or filtering methods such as
FortiGuard-Antispam, access policy filtering, global and user
black/white list filtering, spam Real-time Blackhole List (RBL),
Domain Name Service (DNS) Block List (DNSBL) and per user Bayesian
filtering so that individual users can set their own profiles.
[0056] The anti-spam module 225 also performs various novel image
spam detection methodologies or spam image analysis scanning based
on sender's intention analysis in an attempt to detect, tag, block
and/or remove spam presented in the form of one or more images.
Examples of the image analysis techniques and the sender's
intention analysis methodologies are described in more detail
below. Existing email security platforms that exemplifies various
operational characteristics of the email security system 220
according to an embodiment of the present invention include the
FortiMail.TM. family of high-performance, multi-layered email
security platforms, including the FortiMail-100 platform, the
FortiMail-400 platform, the FortiMail-2000 platform and the
FortiMail-4000A platform all of which are available from Fortinet,
Inc. of Sunnyvale, Calif.
[0057] FIG. 3 is a block diagram conceptually illustrating
interaction among various functional units of an email security
system 320 with a client workstation 360 and an email server 330 in
accordance with an embodiment of the present invention.
[0058] While in this simplified example, only a single client
workstation, i.e., client workstation 360, and a single email
server, i.e., email server 330, are shown interacting with the
email security system 320, it should be understood that many local
and/or remote client workstations, servers and email servers may
interact directly or indirectly with the email security system 320
and directly or indirectly with each other.
[0059] According to the present example, the email security system
320, which may be implemented as one or more virtual or physical
devices, includes a content processor 321, logically interposed
between sources of inbound email 380 and an enterprise's email
server 330. In the context of the present example, the content
processor 321 performs scanning of inbound email messages 380
originating from sources on the public Internet 200 before allowing
such inbound email messages 380 to be stored on the email server
330. In one embodiment, an anti-spam module 325 of the content
processor 321 may perform spam filtering and an anti-virus (AV)
module 326 implementing AV and other filters potentially performs
other traditional anti-virus detection and content filtering on
data associated with the email messages.
[0060] In the current example, anti-spam module 325 may apply
various image analysis methodologies described further below to
ascertain email senders' intentions and therefore the likelihood
that attached and/or embedded images represent image spam.
According to the current example, the anti-spam module 325,
responsive to being presented with an inbound email message,
determines whether the email message contains embedded or attached
images and if so, as described further below with reference to FIG.
5 and FIG. 6, determines if such images represent image spam.
[0061] In one embodiment, the content processor 321 is an
integrated FortiASIC.TM. Content Processor chip developed by
Fortinet, Inc. of Sunnyvale, Calif. In alternative embodiments, the
content processor 321 may be a dedicated coprocessor or software to
help offload content filtering tasks from a host processor (not
shown).
[0062] In alternative embodiments, the anti-spam module 325 may be
associated with or otherwise responsive to a mail transfer protocol
proxy (not shown). The mail transfer protocol proxy may be
implemented as a transparent proxy that implements handlers for
Simple Mail Transfer Protocol (SMTP) or Extended SMTP (ESMTP)
commands/replies relevant to the performance of content filtering
activities and passes through those not relevant to the performance
of content filtering activities. In one embodiment, the mail
transfer protocol proxy may subject each of incoming email,
outgoing email and internal email to scanning by the anti-spam
module 325 and/or the content processor 321.
[0063] Notably, filtering of email need not be performed prior to
storage of email message on the email server 330. In alternative
embodiments, the content processor 321, the mail transfer protocol
proxy (not shown) or some other functional unit logically
interposed between a user agent or email client 361 executing on
the client workstation 360 and the email server 330 may process
email messages at the time they are requested to be transferred
from the user agent/email client 361 to the email server 330 or
vice versa. Meanwhile, neither the email messages nor their
attachments need be stored locally on the email security system 320
to support the filtering functionality described herein. For
example, instead of the anti-spam processing running responsive to
a mail transfer protocol proxy, the email security system 320 may
open a direct connection between the email client 361 and the email
server 330, and filter email in real-time as it passes through.
[0064] While in the context of the present example, the content
processor 321, the anti-spam module 325 and the mail transfer
protocol proxy (not shown) have been described as residing within
or as part of the same network device, in alternative embodiments
one or more of these functional units may be located remotely from
the other functional units. According to one embodiment, the
hardware components and/or software modules that implement the
content processor 321 the anti-spam module 325 and the mail
transfer protocol proxy are generally provided on or distributed
among one or more Internet and/or LAN accessible networked devices,
such as one or more network gateways, firewalls, network security
appliances, email security systems, switches, bridges, routers,
data storage devices, computer systems and the like.
[0065] In one embodiment, the functionality of one or more of the
above-referenced functional units may be merged in various
combinations. For example, the content processor 321 may be
incorporated within the mail transfer protocol proxy or the
anti-spam module 325 may be incorporated within the email server
330 or the email client 361. Moreover, the functional units can be
communicatively coupled using any suitable communication method
(e.g., message passing, parameter passing, and/or signals through
one or more communication paths etc.). Additionally, the functional
units can be physically connected according to any suitable
interconnection architecture (e.g., fully connected, hypercube,
etc.).
[0066] According to embodiments of the invention, the functional
units can be any suitable type of logic (e.g., digital logic) for
executing the operations described herein. Any of the functional
units used in conjunction with embodiments of the invention can
include machine-readable media including instructions for
performing operations described herein. Machine-readable media
include any mechanism that provides (i.e., stores and/or transmits)
information in a form readable by a machine (e.g., a computer). For
example, a machine-readable medium includes read only memory (ROM),
random access memory (RAM), magnetic disk storage media, optical
storage media, flash memory devices, electrical, optical,
acoustical or other forms of propagated signals (e.g., carrier
waves, infrared signals, digital signals, etc.), etc.
[0067] FIG. 4 is an example of a computer system with which
embodiments of the present invention may be utilized. The computer
system 300 may represent or form a part of an email security
system, network gateway, firewall, network appliance, switch,
bridge, router, data storage devices, server, client workstation
and/or other network device implementing one or more of the content
processor 321 or other functional units depicted in FIG. 3.
According to FIG. 4, the computer system 400 includes one or more
processors 405, one or more communication ports 410, main memory
415, read only memory 420, mass storage 425, a bus 430, and
removable storage media 440.
[0068] The processor(s) 405 may be Intel.RTM. Itanium.RTM. or
Itanium 2.RTM. processor(s), AMD.RTM. Opteron.RTM. or Athlon
MP.RTM. processor(s) or other processors known in the art.
[0069] Communication port(s) 410 represent physical and/or logical
ports. For example communication port(s) may be any of an RS-232
port for use with a modem based dialup connection, a 10/100
Ethernet port, or a Gigabit port using copper or fiber.
Communication port(s) 410 may be chosen depending on a network such
a Local Area Network (LAN), Wide Area Network (WAN), or any network
to which the computer system 400 connects.
[0070] Communication port(s) 410 may also be the name of the end of
a logical connection (e.g., a Transmission Control Protocol (TCP)
port or a User Datagram Protocol (UDP) port). For example
communication ports may be one of the Well Know Ports, such as TCP
port 25 or UDP port 25 (used for Simple Mail Transfer), assigned by
the Internet Assigned Numbers Authority (IANA) for specific
uses.
[0071] Main memory 415 may be Random Access Memory (RAM), or any
other dynamic storage device(s) commonly known in the art.
[0072] Read only memory 420 may be any static storage device(s)
such as Programmable Read Only Memory (PROM) chips for storing
static information such as instructions for processors 405.
[0073] Mass storage 425 may be used to store information and
instructions. For example, hard disks such as the Adaptec.RTM.
family of SCSI drives, an optical disc, an array of disks such as
RAID, such as the Adaptec family of RAID drives, or any other mass
storage devices may be used.
[0074] Bus 430 communicatively couples processor(s) 405 with the
other memory, storage and communication blocks. Bus 430 may be a
PCI/PCI-X or SCSI based system bus depending on the storage devices
used.
[0075] Optional removable storage media 440 may be any kind of
external hard-drives, floppy drives, IOMEGA.RTM. Zip Drives,
Compact Disc-Read Only Memory (CD-ROM), Compact Disc-Re-Writable
(CD-RW), Digital Video Disk (DVD)-Read Only Memory (DVD-ROM),
Re-Writable DVD and the like.
[0076] FIG. 5 is a high-level flow diagram illustrating anti-spam
processing of images using sender's intention analysis in
accordance with an embodiment of the present invention. Depending
upon the particular implementation, the various process and
decision blocks described below may be performed by hardware
components, embodied in machine-executable instructions, which may
be used to cause a general-purpose or special-purpose processor
programmed with the instructions to perform the steps, or the steps
may be performed by a combination of hardware, software, firmware
and/or involvement of human participation/interaction.
[0077] At block 510, an email message is analyzed to determine if
it contains images. For purposes of the present example, the
direction of flow of the email message is not pertinent. As
indicated above, the email message may be inbound, outbound or an
intra-enterprise email message. In various embodiments, however,
the anti-spam processing may be enabled in one direction only or
various detection threshholds could be configured differently for
different flows.
[0078] In any event, in one embodiment, the headers, body and
attachments, if any, of the email message at issue are parsed and
scanned to identify whether the email message is deemed to contain
one or more embedded images. If so, processing continues with block
520. Otherwise, no further image spam analysis is required and
processing branches to the end.
[0079] At block 520, the email message at issue has been determined
to contain one or more embedded images. In the current example, the
senders' intention analysis anti-spam processing, therefore,
proceeds to calculate the location(s) of the embedded image(s).
Images may be embedded in a HyperText Markup Language (HTML) part
of an HTML formatted email message, within a MIME document or
attached separately. In one embodiment, by parsing the HTML, plain
text and/or other Multipurpose Internet Mail Extension (MIME)
parts, the displaying line just prior to the images can be
identified and thus the approximate displaying location of any
embedded images can be calculated.
[0080] At block 530, the one or more images are analyzed for
indications of one or more abnormal factors. Typically, the
abnormal factors are manifestations of a spammer's attempt to
obscure text embedded within the one or more images by injecting a
variety of noise. In one embodiment, abnormal factors include the
presence of one or more of the following characteristics (i)
illegal base64 encoding; (ii) multiple images within one HTML part;
(iii) one or more low entropy frames in an animated Graphic
Interchange Format (GIF); (iv) illegal chunk data within a Portable
Network Graphics (PNG) file; (v) quantities of unsmoothed curves;
and (iv) quantities of unsmoothed color blocks.
[0081] In one embodiment, illegal base64 encoding can be detected
by, among other things, observing illegal characters, such as `!`
in the encoded content, such as the HTML formatted message or any
part of the MIME document.
[0082] In one embodiment, the inclusion of multiple images within
one HTML part can be detected by parsing the HTML formatted email
message and observing more than one image within an HTML part. In
the exemplary HTML code excerpt below, the existence of three
images within a single table row (<tr> . . . </tr>)
reveals an intention on the part of the creator of the email
message to display a contiguous image to the email recipient based
on the three separate embedded images.
TABLE-US-00001 <html> <head> <meta
content="text/html;charset=ISO-8859-1" http-equiv="Content-
Type"> <title></title> </head> <body
bgcolor="#ffffff" text="#000000"> <title>abovementioned
bertie</title> <div align="center"> [...] <tr>
<td width="33%"> <a
href="http://www.lklljjp.biz/vpr6160/"> <img
name="apprehension" src="cid:part2.00020108.07020409@72.ca"
border="0" height="179" width="184"></a></td> <td
width="33%"> <a href="http://www.lklljjp.biz/vpr6160/">
<img name="gradate" src="cid:part3.00060308.03010709@72.ca"
border="0" height="179" width="184"></a></td> <td
width="34%"> <a href="http://www.lklljjp.biz/vpr6160/">
<img name="maltreat" src="cid:part4.02080304.00040002@72.ca"
border="0" height="179" width="184"></a></td>
</tr> [...] </body> </html>
[0083] The existence of one or more low entropy frames of an
animated GIF may be determined on an absolute and/or relative
basis. For example, an animated GIF frame may be determined to be
low with reference to observed entropy values of normal GIF files,
which vary from approximately 0.1 to 5.0. Therefore, in one
embodiment, the existence of one or more low entropy frames is
confirmed based on a comparison of the entropy values calculated
for the animated GIF at issue to 0.1. If the entropy value
calculated for any frame of the animated GIF at issue is less than
0.1, then this abnormal factor is deemed to exist. In other
embodiments, one or more frames of the animated GIF file at issue
may simply be "low" entropy relative to the other high entropy
frames. For example, a variation of more than 4.9 between the
highest entropy frame and the lowest entropy frame relatively lower
than the others within the animated GIF file at issue.
[0084] Illegal chunk data within a Portable Network Graphics (PNG)
file may be observed by evaluating information contained within
and/or about the chunks. For example, the length of the chunk and
cyclic redundancy checksum (CRC) may be verified against the actual
data length and recomputed CRC.
[0085] Quantities of unsmoothed curves may be detected by
evaluating the amount of pixels in which the difference between
their color and the average color of the surrounding pixels are
greater than a threshold.
[0086] Quantities of unsmoothed color blocks may be detected by
evaluating the amount of the color blocks in which the difference
between their color and the color of the surrounding color blocks
are greater than a threshold. Color blocks contain pixels with the
same or similar colors.
[0087] In one embodiment, rather than simply conveying a binary
result (e.g., a zero indicating the absence of the abnormal factor
at issue and a one indicating the presence of the abnormal factor
at issue), a value within a range (e.g., 0 to 10) may be returned
indicating the degree to which the abnormal factor is
expressed.
[0088] At block 540, the quantity of text embedded in the images is
measured. In one embodiment, images are converted to a binary
representation based on a thresholding technique described in
further detail below. In general, thresholding is a simple method
of image segmentation. Individual pixels in a grayscale image are
marked as "information" pixels if their value is greater than some
threshold value, T, (assuming the information content is brighter
than the background) and as "background" pixels otherwise.
Typically, an information pixel is given a value of "1" while a
background pixel is given a value of "0." Then, a text string
measurement algorithm is applied to the binary representation of
the portion of the image deemed to contain the information
content.
[0089] Notably, in one embodiment, rather than considering the
quantity of embedded text alone, both the quantity of text and the
relative position of such text within an email viewer's preview
window, for example, or within the image itself may be taken into
consideration. For example, a high spam score could be assigned to
a very large image (with a correspondingly smaller percentage of
text), but the text is positioned to occupy the whole preview
window.
[0090] At block 550, the email message is classified as spam or
clean based on the observed characteristics of the embedded
image(s), such as image location information, the
existence/non-existence of various abnormal factors and the
quantity of text determined to exist within the embedded image(s).
In one embodiment, the spam/clean classification may be based upon
a weighted average of the various observed characteristics.
[0091] In one embodiment, each observed characteristic may
contribute to the score. Once the score reaches a threshold, the
email message may be classified as spam and the further
characteristics may not require analysis or observation. The email
message is classified as clean if the score is less than the
threshold after all the characteristics have been evaluated. In one
embodiment, the characteristics may be considered in the following
order: [0092] Image location information [0093] Presence of
continuous images [0094] Presence of illegal base64 encoding [0095]
Presence of lower entropy frames in an animated GIF [0096] Presence
of illegal chunk data of a PNG encoded image [0097] Quantities
and/or location of text in the images [0098] Quantities of
unsmoothed curves in the images [0099] Quantities of unsmoothed
color blocks in the images
[0100] In one embodiment, similar to that described above with
reference to abnormal factors, rather than making the ultimate
spam/clean decision (because the ultimate decision could be made by
another component), a "spaminess" score may be generated. For
example, rather than simply conveying a binary result (e.g., spam
vs. clean), a value within a range (e.g., 0 to 10) may be returned
indicating the degree to which the email message appeared to
contain indications of being spam or the likelihood the email
message is spam.
[0101] If upon completion of the anti-spam processing described
above there is not sufficient data to determine the email message
is spam (e.g., there is insufficient data to determine the sender's
intention), then according to one embodiment, more CPU intensive
processes, such as OCR, may be invoked. Advantageously, in this
manner, most image spam emails can be detected in real-time without
compromising performance and more CPU intensive processes are only
performed if and when required.
[0102] FIG. 6 is a flow diagram illustrating quantity of text
measurement processing in accordance with an embodiment of the
present invention. The steps described below represent the
processing of block 540 of FIG. 5 according to one embodiment of
the present invention.
[0103] As mentioned with reference to FIG. 5, depending upon the
particular implementation, the various process and decision blocks
described below may be performed by hardware components, embodied
in machine-executable instructions, which may be used to cause a
general-purpose or special-purpose processor programmed with the
instructions to perform the steps, or the steps may be performed by
a combination of hardware, software, firmware and/or involvement of
human participation/interaction.
[0104] At block 610, if the image at issue is color, then it is
converted to grayscale to form a grayscale representation,
G.sub.i,j. According to one embodiment, color pixels of the image
at issue are converted to grayscale by computing an average or
weighted average of the red, green and blue color components. While
various conversions may be used, examples of suitable conversion
equations include the following:
G.sub.i,j=(0.299*r.sub.i,j+0.587* g.sub.i,j+0.114* b.sub.i,j)/3
0.ltoreq.i<x.sub.max,0.ltoreq.j<y.sub.max EQ #1
G.sub.i,j=(0.3*r.sub.i,j+0.6*g.sub.i,j+0.1*b.sub.i,j)/3
0.ltoreq.i<x.sub.max,0.ltoreq.j<y.sub.max EQ #2
G.sub.i,j=(r.sub.i,j+g.sub.i,j+b.sub.i,j)/3
0.ltoreq.i<x.sub.max,0.ltoreq.j<y.sub.max EQ #3
[0105] At block 620, entropy and threshold values are determined
for the grayscale image, G.sub.i,j. Entropy is a statistical
measure of randomness that can be used to characterize the texture
of the input image. In connection with calculating the entropy of
the grayscale image, an intermediate data structure is built
containing an intensity histogram, C.sup.g. In the context of an
8-bit grayscale image, each pixel may have a value of 0 to 255.
Thus, the intensity histogram includes 256 bins each of which
maintain a count of the number of pixels in the grayscale image
having that value. An example of an intensity histogram is shown in
FIG. 9, which represents an intensity histogram for a grayscale
representation of FIG. 8. In one embodiment entropy is calculated
according to:
E = - g = 0 255 ( C g C g * log ( C g C g ) ) EQ #4 Subject to : C
g = i = 0 x max j = 0 y max c i , j g EQ #5 c i , j g = { 1 G i , j
= g 0 otherwise , 0 .ltoreq. i < x max , 0 .ltoreq. j < y max
EQ #6 ##EQU00001##
[0106] According to one embodiment, a threshold value within the
intensity histogram is selected simply by choosing the mean or
median value. The rationale for this simple threshold selection is
that if the information pixels are brighter than the background,
they should also be brighter than the average. However, to
compensate for the existence of noise and variability in the
background, a more sophisticated approach is to create a histogram
of the image pixel intensities and then use the valley point as the
threshold, T. This histogram approach assumes that there is some
average value for the background and information pixels, but that
the actual pixel values have some variation around these average
values. In one embodiment, the threshold, T, is calculated by:
T=Max(.delta..sub.i) 0.ltoreq.i.ltoreq.255 EQ#7
[0107] Subject to:
.delta..sub.i=i.sub.w1w.sub.i2(M.sub.i1-M.sub.i2).sup.20.ltoreq.i.ltoreq-
.255 EQ #8
w i 1 = g = 0 i C g EQ #9 w i 2 = g = i + 1 255 C g EQ #10 M i 1 =
g = 0 i g * C g g = 0 i C g EQ #11 M i 2 = g = i + 1 255 k * C g g
= i + 1 255 C g EQ #12 ##EQU00002##
[0108] According to the above example, the gray levels are divided
into two groups by i, and w.sub.i1 and w.sub.i2 are the total
amount of the pixels of each group while M.sub.i1 and M.sub.i2 are
the average of the gray level of each group.
[0109] Notably, there are many existing methods of performing
thresholding. Consequently, any other current or future method of
performing thresholding may be used depending upon the needs of a
particular implementation.
[0110] At block 630, thresholding is performed to form a binary
representation, B.sub.i,j, of the grayscale image based on the
threshold value selected in block 620. In one embodiment,
thresholding is performed in accordance with the following
equations:
B i , j = { 0 G i , j < T 1 Otherwise 0 .ltoreq. i < x max ,
0 .ltoreq. j < y max EQ #13 B i , j ' = { B i , j max ( C k )
< .differential. , max ( C k ) < T ! B i , j Otherwise 0
.ltoreq. k .ltoreq. 255 EQ #14 ##EQU00003##
[0111] where, .differential. is an adjustable parameter.
[0112] At block 640, the binary image is logically divided into
M.times.N virtual blocks.
[0113] At block 650, the M.times.N virtual blocks are analyzed to
quantify the number of text strings. In one embodiment, the text
strings in the binary image are quantified in accordance with the
following equations:
T = m = 0 M n = 0 N T m , n EQ #15 Subject to : T m , n = y t = y 0
m y max m y b = y t + 1 y max m T y t , y b m , n y 0 m = y max
.differential. 0 ( m - 1 ) , y max m = y 0 m + .differential. 0 EQ
#16 T y t , y b m , n = { 1 .differential. 1 > i = y t y b CB i
n > .differential. 2 k = x 0 n x max n B k , y b + 1 <
.differential. 3 , x 0 n = x max .differential. 0 ( n - 1 ) , x max
n = x 0 n + .differential. 0 0 Otherwise EQ #17 CB i n = { 1
.differential. 4 > k = x 0 n x max n B k , i > .differential.
5 Max ( k = x w x w + .differential. 6 B k , i ) <
.differential. 7 , x 0 n .ltoreq. x w .ltoreq. x max n 0 Otherwise
EQ #18 ##EQU00004##
[0114] where,
[0115] .differential..sub.0 . . . a.differential..sub.7 are
adjustable parameters;
[0116] T.sub.y.sub.t,.sub.y.sub.b.sup.m,n is the likelihood that
the row between y.sub.t and y.sub.b in the block [m,n] represents
text;
[0117] CB.sub.i.sup.n is the likelihood that the line[i] is a part
of text;
[0118] B.sub.k,i is the value of pixel[k,i] in the binary
image.
[0119] Notably, while in the context of the equations presented
above, a global thresholding approach is implemented taking into
consideration the image as a whole, in alternative embodiments,
various forms of local thresholding may be performed that consider
groups of blocks or individual blocks to determine the best
thresholding approach for such block or blocks.
CONCRETE EXAMPLES
[0120] For sake of illustration, two concrete examples of
application of the thresholding and text quantification described
above will now be provided with reference to FIG. 7 to FIG. 13.
[0121] FIG. 7 is an example of an image spam email message 700
containing an embedded image 710. Typically, such image spam email
messages also include text 720 in an attempt to defeat conventional
heuristic filters.
[0122] FIG. 8 is a grayscale image 810 based on the embedded image
710 of FIG. 7 according to one embodiment of the present invention.
According to the flow diagram of FIG. 6, the first step (block 610)
is to convert the embedded image 710 to a grayscale representation,
G.sub.i,j. Assuming embedded image 710 of FIG. 7 is a color image
having red (r), green (g) and blue (b) color components, after
application of one of equations EQ #1, EQ #2, EQ #3 or the like,
the grayscale representation, G.sub.i,j, appears as grayscale image
810.
[0123] FIG. 9 is an intensity histogram 900 for the grayscale image
810 of FIG. 8 according to one embodiment of the present invention.
According to the flow diagram of FIG. 6, the next step (block 620)
is to build an intensity histogram data structure, C.sup.g, and
determine a threshold value for the grayscale image 810. After
application of one or more of equations EQ #4, EQ #5, EQ #6, EQ #7,
EQ #8, EQ #9, EQ #10, EQ #11, EQ #12 or the like to the grayscale
representation, G.sub.i,j, (grayscale image 810), an intensity
histogram data structure, C.sup.g, results, which appears as
intensity histogram 900 when displayed in graphical form. Assuming
256 possible gray levels are represented in grayscale image 810,
the intensity histogram 900 graphically illustrates the number of
pixel occurrences in grayscale image 810 for each gray level.
[0124] Application of the above-referenced equations also results
in a threshold value, T, 910, being calculated for grayscale image
810. According to this example, the threshold value 910 is 109.
[0125] FIG. 10 is a binary image 1010 resulting from thresholding
the grayscale image 810 of FIG. 8 in accordance with an embodiment
of the present invention. According to the flow diagram of FIG. 6,
the next step (block 630) is to binarize the image by performing
thresholding with the calculated threshold value. Application of
one or both of equations EQ #13 and EQ #14 or the like, causes the
binary representation, B.sub.i,j, to contain a zero for each pixel
in which the grayscale representation G.sub.i,j, is less than the
calculated threshold value, T, and to contain a one for each pixel
in which the grayscale representation G.sub.i,j, is greater than or
equal to the calculated threshold value, T. For purposes of
illustration, the result of graphically depicting the binary
representation, B.sub.i,j, in which pixels having a value of one
are presented as black and pixels having a value of zero are
presented as white image is shown as binary image 1010. As can be
seen with reference to FIG. 10, the information content intended to
be conveyed, i.e., the various text strings, to the email recipient
is now clearly distinguishable from the background.
[0126] FIG. 11 illustrates an exemplary segmentation of the binary
image of FIG. 10 into 28 virtual blocks and highlights the text
strings detected within the blocks in accordance with an embodiment
of the present invention. According to the flow diagram of FIG. 6,
the next steps (blocks 640 and 650) are to logically divide the
binary image 1010 into virtual blocks and then separately analyze
each block to measure perceived text content. Application of one or
more of equations EQ #15, EQ #16, EQ #17, EQ #18 or the like,
causes the text string count, T, to contain the sum of all blocks
determined to contain a text string.
[0127] In the present example, segmented binary image 1110 contains
28 virtual blocks, examples of which are pointed out with reference
numerals 1120 and 1130. According to equations EQ #15, EQ #16, EQ
#17 and EQ #18, 23 of the 28 blocks contain a total of 63 text
strings. Text strings detected by the algorithm are underlined.
Block 1120 is an example of a block that has been determined to
contain one or more text strings, i.e., the word "TRADE" 1121.
Block 1130 is an example of a block that has been determined not to
contain a text string.
[0128] FIG. 12 is a grayscale image 1210 based on another exemplary
embedded image observed in connection with image spam.
[0129] FIG. 13 illustrates an exemplary segmentation into 56
virtual blocks a binarized image 1310 corresponding to the
grayscale image 1210 of FIG. 12 and highlights the text strings
detected within the blocks in accordance with an embodiment of the
present invention. In the present example, segmented binary image
1310 contains 56 virtual blocks, examples of which are pointed out
with reference numerals 1320 and 1330. According to equations EQ
#15, EQ #16, EQ #17 and EQ #18, 26 of the 56 blocks contain a total
of 40 text strings. Text strings detected by the algorithm are
underlined. Block 1320 is an example of a block that has been
determined to contain one or more text strings, i.e., the group of
letters "ebtEras". Block 1330 is an example of a block that has
been determined not to contain a text string.
[0130] Notably, to the extent reverse video or the presentation of
light colored (e.g., white) text in the context of a dark (e.g.,
black) background becomes problematic (see, e.g., the "LEARN MORE"
text string embedded within FIG. 13), one approach to detect such
text strings would be to apply a local thresholding approach using
EQ #14, which would effectively reverse the black and white pixels
for the blocks at issue.
[0131] While embodiments of the invention have been illustrated and
described, it will be clear that the invention is not limited to
these embodiments only. Numerous modifications, changes,
variations, substitutions, and equivalents will be apparent to
those skilled in the art, without departing from the spirit and
scope of the invention, as described in the claims.
* * * * *
References