U.S. patent application number 12/690667 was filed with the patent office on 2010-08-12 for intrusion alarm video-processing device.
This patent application is currently assigned to HITACHI KOKUSAI ELECTRIC INC.. Invention is credited to Miyuki FUJII, Mitsue ITO, Wataru ITO, Kazunari IWANAGA, Lev Borisovich KOGAN, Alexander Sergeyevich KONDRATYEV, Vitaly Alexandrovich LOPOTA, Sergey Anatoliyevich POLOVKO, Ekaterina Yurevna SMIRNOVA, Dmitry Nikolayevich STEPANOV, Kirill Nikolayevich STUPIN, Victor Ivanovich YUDIN.
Application Number | 20100201820 12/690667 |
Document ID | / |
Family ID | 42355954 |
Filed Date | 2010-08-12 |
United States Patent
Application |
20100201820 |
Kind Code |
A1 |
LOPOTA; Vitaly Alexandrovich ;
et al. |
August 12, 2010 |
INTRUSION ALARM VIDEO-PROCESSING DEVICE
Abstract
Binarization is performed using a threshold image obtained by
multiplying a variation in each pixel value of an input image with
a coefficient. Although the variation is time-averaged based on an
update coefficient for each pixel, the update coefficient is
switched depending on whether or not a relevant pixel belongs to
the object. Subsequently, from the binary image, an initial
detection zone is formed and a spatial filtering process is
performed thereto. The spatial filtering process includes at least
one of skeleton analysis processing, object mask processing,
morphology processing, and section analysis processing. For a
tracking zone, the temporal positional change thereof is tracked,
and the noise is reduced. Some of the tracking zones are removed,
and the remaining zones are integrated into a cluster, and
furthermore the cluster selection is performed based on the
dimensions in real space.
Inventors: |
LOPOTA; Vitaly Alexandrovich;
(Saint-Petersbrug, RU) ; KONDRATYEV; Alexander
Sergeyevich; (Saint-Petersbrug, RU) ; YUDIN; Victor
Ivanovich; (Saint-Petersbrug, RU) ; POLOVKO; Sergey
Anatoliyevich; (Saint-Petersbrug, RU) ; SMIRNOVA;
Ekaterina Yurevna; (Saint-Petersbrug, RU) ; STUPIN;
Kirill Nikolayevich; (Saint-Petersbrug, RU) ; KOGAN;
Lev Borisovich; (Saint-Petersbrug, RU) ; STEPANOV;
Dmitry Nikolayevich; (Syas'sstroy, RU) ; ITO;
Wataru; (Tokyo, JP) ; ITO; Mitsue; (Tokyo,
JP) ; IWANAGA; Kazunari; (Tokyo, JP) ; FUJII;
Miyuki; (Tokyo, JP) |
Correspondence
Address: |
BRUNDIDGE & STANGER, P.C.
2318 MILL ROAD, SUITE 1020
ALEXANDRIA
VA
22314
US
|
Assignee: |
HITACHI KOKUSAI ELECTRIC
INC.
Tokyo
JP
|
Family ID: |
42355954 |
Appl. No.: |
12/690667 |
Filed: |
January 20, 2010 |
Current U.S.
Class: |
348/152 ;
348/E7.085 |
Current CPC
Class: |
G06T 2207/20044
20130101; G06T 7/254 20170101; G08B 13/1961 20130101; G06T 7/194
20170101 |
Class at
Publication: |
348/152 ;
348/E07.085 |
International
Class: |
H04N 7/18 20060101
H04N007/18 |
Foreign Application Data
Date |
Code |
Application Number |
Jan 22, 2009 |
RU |
2009102124 |
Claims
1. An object detection method for detecting an object in a video
image, comprising the steps of; calculating a time-domain average
value of each pixel in the image; calculating a time-domain
variance or standard deviation of each pixel in the image using a
time constant that is variable for each pixel; calculating a
time-domain maximum value of the variance or standard deviation of
each pixel in the image; binarizing a current image with a
threshold value based on a value obtained by multiplying the
maximum value with a predetermined coefficient, for each pixel in
the image; labeling the binarized image and treating the thus found
plurality of connected areas as pre-detection zones; controlling
the variable time-constant depending on whether the pixel is
classified into a background or an object, for each pixel in the
image; calculating geometric attributes in real space of the
plurality of pre-detection zones, and screening the pre-detection
zones based on the geometric attributes; carrying out spatial
filtering including at least one of skeleton analysis processing,
object mask processing, morphology operations, and section analysis
processings to the binarized image or an image derived from the
binarization; recording a pre-detection zone, which has been
subjected to the spatial filtering step or the screening step, as a
tracking zone, and updating the recorded tracking zone in
accordance with the degree of coincidence with a stored past
tracking zone or tracking a temporal positional-change of a
tracking zone of interest by extracting a line component in time
and space; grouping neighboring tracking zones into a cluster based
on a predetermined rule; and determining the cluster based on a
size of the cluster or a plurality of conditions specifying at
least one of a variation in a relative position with a
predetermined monitor area and a variation in a relative position
with other cluster.
2. The object detection method according to claim 1, wherein the
determination step uses the predetermined monitor area, wherein the
predetermined monitor area is defined by either a polygonal column
or cylinder perpendicular to a ground, or a plane area using a
coordinate system having two orthogonal axes parallel to the
ground.
3. The object detection method according to claim 1, wherein the
skeleton analysis processing comprises the steps of: acquiring
shape information on the pre-detection zone by carrying out
thinning process or skeleton processing to the binarized image;
extracting main axes from the shape information; and extracting
axes of an object by removing axes of a shade from the extracted
axes.
4. The object detection method according to claim 1, further
comprising the steps of: inputting the plurality of conditions
described in a script format, as a monitor condition script capable
of specifying priorities for the respective conditions and
specifying disable or enable of the detection; and analyzing a
logic of the monitor condition script and generating a decision
table, wherein the determination step determines whether or not the
information on the object matches a monitor condition in accordance
with the priorities assigned to the conditions.
Description
BACKGROUND OF THE INVENTION
[0001] The present invention relates to intrusion alarm
video-processing devices, and in particular, relates to an
intrusion alarm video-processing device that detects an intruder by
processing a video shot with a monocular camera.
[0002] The conventional intruder alarm system is not satisfactory
with regard to frequent false alarms, a lack of versatility, i.e.,
requiring delicate and labor intensive setting adjustment
corresponding to monitoring stations. When classical tasks in image
processing, such as segmentation, skeleton extraction, recognition,
and detection, need to be realized, apparently, difficulties in
developing a typical intruder alarm system are in large pat due to
the presence of various noises due to various kinds of sources.
[0003] Inexpensive CMOS sensors are used in almost all surveillance
video cameras. However, in even the highest-performance sensor
among these sensors, a certain hardware noise mixes into imaging
data. There is an inverse correlation between the luminance level
and the sensor noise level. Due to this noise, the same two images
cannot be taken even if a camera and the environment to be imaged
are not moving. Actually, the luminance value or the RGB value of a
pixel is observed as a probability variable. Accordingly, the value
of a pixel observed as the probability variable should be modeled
with an appropriate method. It has been experimentally proved that
the sensor noise can be appropriately modeled as white noise.
[0004] As a related art underlying the present invention, a moving
vehicle detection method by Eremin S. N. is known (see RU (Russian)
patent No. 2262661). This method comprises the steps of acquiring a
frame, calculating an inter-frame difference, binarizing with a
threshold, performing morphological operation, calculating a Sobel
operator, storing an initial frame, and updating the background
based on a special equation, detecting a difference between a frame
and a background, calculating a histogram of images, detecting the
maximum luminance, verifying by comparison with an existing object,
separating a mixed object, locating a vehicle, and generating a
rectangle that expresses a coordinate at which the vehicle may be
located within a relevant framing means.
[0005] Moreover, as a related art in connection with the present
invention, an image recognition method using a Hu invariant moment
is known (see Ming-Kuei HU, "Visual Pattern Recognition by Moment
Invariants", IRE Transactions on information theory, 1962, pp.
179-187).
[0006] Moreover, a method is known, in which Fourier Mellin
transform or a Gabor filter is used as a scale invariable value and
these are compared with a dictionary to recognize an object (see
Park, H. J., Yang H. S, "Invariant object detection based on
evidence accumulation and Gabor features", Pattern recognition
letters 22, pp. 869-882, and Kyrki, V., Kamarainen J. K, "Simple
Gabor feature space for invariant object recognition", Pattern
recognition letters 25, No. 3, 2004, pp. 311-318).
[0007] Moreover, a corner detection method by Harris is known (see
C. Harris and M. Stephens, "A combined corner and edge detector",
Proc. Alvey Vision Conf., Univ. Manchester, 1988, pp. 147-151). In
this approach, a detected corner is used as a feature quantity. Any
object has a unique set of corner points. Recognition processing is
performed by comparing with a positional relationship of corners
which an object in a standard image has.
[0008] Moreover, there are known a method of applying a Gaussian
filter to an image in multi-stages and preparing difference image
groups thereof (Laplacian pyramid) (see U.S. Pat. No. 6,141,459),
and SIFT (Scale-invariant feature transform) that extracts a scale
invariable feature quantity, such as a key point, from the maximum
value of these image groups (see David G. Lowe, "Distinctive image
features from scale-invariant key points, Journal of Computer
Vision, 60, 2, 2004, pp. 91-110).
SUMMARY OF THE INVENTION
[0009] The drawbacks of the above-described respective methods are
to erroneously detect a shadow as an object (an intruder, a
vehicle, or the like), and incapability to determine the actual
size of an object. Other drawback is that when an object (or its
position), which is brought into the field of view and left behind,
is erroneously detected, the updating of a background model in an
appropriate pixel is completely stopped and as a result a static
object cannot be automatically integrated into the background. For
this reason, a false alarm or a detection omission occurs under the
presence of a disturbance caused by continuous changes or a
temporary change in illumination, leaves, and the movement of water
surface, or rainfall (rain, snow, or the like). Moreover,
sufficient consideration has not been paid to a periodic background
fluctuation, such as a-flicker, or to the tracking within an area
where the illuminance varies greatly from place to place.
[0010] It is an object of the present invention to reduce the
number of false responses and improve the detection accuracy of the
boundary of a moving object, thereby improving the quality of a TV
surveillance security system under the complex climate conditions
and the varying background, and furthermore extending the
functionality or the operability.
[0011] An intrusion alarm video-processing device of the present
invention uses a background difference method based on a parametric
model. That is, every time an image frame is input, the absolute
value of a difference between the input image of the current frame
and a background image is calculated, and is then binarized using a
threshold image. For the threshold image, the variance
.sigma..sup.2 in each pixel value of the input image multiplied by
a predetermined coefficient k.sub.1 is used. Although the variance
.sigma..sup.2 is time-averaged based on an update coefficient .rho.
for each pixel, the update coefficient .rho. is selected as the
different value depending on whether a relevant pixel belongs to
the background or belongs to the object.
[0012] Subsequently, an initial detection zone is formed from the
binary image and a spatial filtering process is performed thereto.
The spatial filtering process includes at least one of skeleton
analysis processing, object mask processing, morphology operation,
and section analysis processing. The skeleton processing includes a
process to obtain shape information of the initial detection zone
by a thinning process or skeleton processing with respect to the
binary image, a process to extract main axes from the shape
information, and a process to extract the axes of the object from
the extracted axes.
[0013] The object mask processing includes a process to extract the
border area that is not adjacent to the border of the initial
detection zone of the binary image. The morphology processing
includes an expansion process to convert a pixel adjacent to a
white pixel of a binary image to a white pixel, and an contraction
process to convert a pixel adjacent to a black pixel of a binary
image to a black pixel. The section analysis processing includes a
process to divide the initial detection zone into segments, a
process to analyze the ratio of white pixels of the binary pixels
with respect to each segment, and a process to select segments
based on the ratio of white pixels. Subsequently, a tracking zone
that expresses the independent part of the object is formed.
[0014] For the tracking zone, a temporal positional change of a
tracking zone of interest is tracked using at least one of the
following methods; a tracking method based on characteristic
information, such as the existence position and size, the center of
gravity, the contour feature of the image, and the moment; and a
tracking method based on the line component extraction approach
represented by Hough transform or the like, in which a line
component is extracted from the temporally arranged binary
spatio-temporal data obtained at each time point. The tracked
result is subjected to at least one of smoothing filtering, a
moving-average filtering, and Kalman filtering, and thus a
component due to noise is reduced from the calculated positional
change.
[0015] Some of the tracking zones are removed, and the remaining
zones are integrated into a cluster, and furthermore the cluster
selection is performed. The cluster selection is determined based
on the size of a cluster, the position coordinate of a cluster, the
displacement from an area having a specified shape, the
displacement from an area lying a predetermined distance or less
away from a certain cluster. This determination is made after
converting to the dimensions in real space by coordinate
conversion. This conversion is calculated using the conditions of
the image sensor of a camera and the camera parameters at a
mounting position. Eventually, a cluster selected and remaining is
judged as an object to be detected.
[0016] Other than the intrusion alarm video-processing devices as
described above, intrusion alarm video-processing devices with some
of the constituent elements thereof replaced with the ones in other
known art may be included in the present invention.
[0017] The intrusion alarm video-processing device of the present
invention can accurately detect a monitored object from a video
even if there are various kinds of regular, temporary, or periodic
disturbances, such as the climate conditions, inactive
(abiological) movement, or a fluctuation in the artificial
image.
[0018] Other objects, features and advantages of the invention will
become apparent from the following description of the embodiments
of the invention taken in conjunction with the accompanying
drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] FIG. 1 shows the main processing loop of an intrusion alarm
video-processing device (Embodiment 1).
[0020] FIG. 2 is a view illustrating scene coordinates, camera
coordinates, and screen coordinates (Embodiment 1).
[0021] FIG. 3 is an image showing an example of a splitting process
(S118) (Embodiment 1).
[0022] FIG. 4 is an image showing an example of merge processing
(S119) (Embodiment 1).
[0023] FIGS. 5A to 5F are images showing an example of skeleton
processing (Embodiment 3).
[0024] FIG. 6 is a flowchart of object segmentation (OS) process
(Embodiment 4).
[0025] FIG. 7 is an example of monitor conditions (Embodiment
5).
[0026] FIG. 8 is an example of monitor conditions (Embodiment
5).
[0027] FIG. 9 is an example of an equipment configuration
(Embodiment 5).
[0028] FIG. 10 is an example of a decision table (Embodiment
5).
[0029] FIG. 11 is an example of setting a monitor area in the
camera coordinate system (Embodiment 6).
[0030] FIG. 12 an example of setting a monitor area in the scene
coordinate system (Embodiment 6).
[0031] FIG. 13 is an example of generating a processed area taking
into consideration the height of a monitor area (Embodiment 6).
[0032] FIG. 14 is an example of imaging an target object to be
monitored (Embodiment 6).
DETAILED DESCRIPTION OF THE EMBODIMENTS
General Logic of a Video Monitoring System Function
[0033] In order to realize the main goals, first, the general logic
for the operation of an intrusion alarm video-processing device
according to an embodiment of the present invention need to be
determined.
[0034] In order to solve the related-art problems, the evaluation
of an observed environmental change and the data analysis in the
prediction level are required. In accordance with such analysis
result, the observed situation is evaluated as the one to be
alarmed (a possible threat). Depending on the degree of risk in the
situation (also taking into consideration the prediction), a
response of one video monitoring system or either one of the other
video monitoring systems will be formed. The feature of this system
is that the response from the system is made similar to that from a
human operator.
[0035] As a result, the processing logic of the video monitoring
system is the logic itself in detection, prediction, and removal
(screening) of a threat to an article. Development of the
processing logic is based on the formalization of alert and
hazardous situations. Under actual conditions, for the
formalization of situations, the number of false alarms can be
reduced by the integration analysis and by grouping current
situations into one of the classes ("problematic", "dangerous",
"very dangerous"). It is a natural way to develop the processing
logic relying on the judgment of a person who experienced the
monitoring tasks. While looking at a plurality of complex scenes in
which environmental changes occur, he/she pays attention to an
object that may be a direct threat to a protected article, and
tries to predict a change of scenery while paying attention to the
speed or direction of a questionable object.
[0036] Identifying a moving object (or an object left behind) from
a complicated background in a scene, in which noise in the natural
world exists, should be carried out before assessing the current
situation.
[0037] Then, the compound function of this system can be divided
into four main stages below:
[0038] 1) Adjustment
[0039] 2) Initial detection (preliminary detection)
[0040] 3) Analysis on the status, taking into consideration a
detection object
[0041] 4) Alarm and analysis on metadata.
[0042] "Adjustment" includes the following items:
[0043] 1) Algorithm adjustment (parameter setting for video-data
processing)
[0044] 2) Camera setting adjustment (setting and adjustment of
camera setting parameters)
[0045] 3) Zone adjustment (selection and indication of a different
"zone of interest" corresponding to a surveillance scene).
[0046] "Initial detection" means evaluation of a difference between
a "background" and the current video frame. The main object in this
stage is to detect all the differences as much as possible based on
a selected criteria (threshold). The quality of detection
(detection of a difference from the background) is conditioned in
the initial detection stage. Here, although we may have a number of
erroneous detections, the amount thereof will decrease in the next
stage. The algorithm of the initial detection is a processing with
respect to the luminance value (having 0-255 values for each of
three channels of RGB colors) of a pixel.
[0047] "Analysis on the status" is required to reduce the amount of
erroneous detections. The first step in the status analysis is to
neglect an object without need to be alerted and without need to be
closely watched. Implementation of this step in this system
includes the following items:
[0048] 1) Evaluation of the size of an initially detected
object
[0049] 2) Evaluation of the shape of an initially detected
object
[0050] 3) Evaluation of the value of "collation with the
background" of an initially detected object (i.e., not the
processing of the luminance value of one pixel but the processing
of the characteristics of the whole pixels corresponding to a
detection object is performed)
[0051] 4) Evaluation of the life time of an initially detected
object
[0052] 5) Evaluation of the speed of an initially detected
object.
[0053] For the purpose of further evaluation of the object
behavior, the status recognition, and the generation of a
corresponding response, the following shaped areas within the
camera imaging range are used:
[0054] 1) Polygon area
[0055] 2) Pillar area
[0056] 3) Perpendicular plane area
A separate degree of risk can be set to each of the zones,
respectively.
Embodiment 1
[0057] First, the main terms used in the description of this
embodiment are defined.
[0058] Current frame (image): one frame of image obtained from a
video input in the current processing cycle Background frame
(image): an image obtained by successively averaging (smoothing)
the luminance value of each pixel within an image frame. These
calculations are performed using a low pass filter 106a (described
later).
[0059] Standard-deviation frame: an image obtained by successively
averaging (smoothing) the variance of the luminance value of each
pixel within an image frame. These calculations are performed using
a low pass filter 106b (described later).
[0060] Frame difference (image): an image resulting from an image
difference between the current frame and the background frame.
[0061] Binary frame (image): an image resulting from the
binarization of a difference image frame, and is obtain by
comparing the difference frame with a standard deviation frame for
each pixel.
[0062] Foreground pixel: a pixel within the current frame and
contained in a non-zero zone (zone having a pixel value of zero or
more) in a binary image frame
[0063] Background pixel: a pixel within the current frame and
contained in a zero zone (zone having a pixel value of 0) in a
binary image frame Note that, although a frame is the unit
constituting one image, it may be used synonymously with an
image.
[0064] FIG. 1 shows a main processing loop of Embodiment 1. The
initial detection phase covers from input of an image frame (Step
101) to a binarization process (Step 108).
[0065] In Step 101, an input frame just shot with a camera is
input. Step 101 is activated via an event handler by a timer event,
thereby starting the main processing loop. The input image is in a
YUV 4:2:2 format, for example.
[0066] In Step 102, the resolution and/or the numbers of colors of
an input image are reduced to the ones in a format suitable for
real-time processing. In this embodiment, the input image is
converted to one byte of gray scale image per pixel because the
later-described several functions support only RBG or one-channel
gray scale. YUV, HSB (HSV), or other format may be appropriate. The
resolution corresponds to a plurality of formats and is reduced to
360.times.240 pixels, for example. In this Step 102, a process to
adequately blur an image with a low frequency spatial filter is
also performed before or after the reduction of the resolution
and/or the numbers of colors. For example, Gaussian filter is
suitable for high speed processing because it can perform
calculations in the x direction and in the y direction, separately.
Or, a median filter that employs a median within 3.times.3 pixels
may be used. Finally, the gain is controlled so as to uniform the
luminance (average) in a predetermined area within an image.
[0067] In Step 103, if it is the initial operation of the main
processing loop, the loop is branched to a setup (setting) process
(Step 104). In Step 104, the later-described various constants
(parameters) are set, and also the setting is made for specifying
what kind of alarm is issued when an object of what kind of size,
speed, and locus is detected in a detection area of what kind of
shape and position. Some of these settings are provided using the
values of real space coordinates (scene coordinate system) instead
of screen coordinates. The details will be described in Steps 124,
125.
[0068] In Step 105, the prepared (reduced) frame is stored to be
used as one-frame delayed image.
[0069] In Step 106, two types of low pass filter processings are
performed using the prepared current image and one-frame delayed
image. In this embodiment, the background image is modeled as a
stochastic process having an unknown average and standard
deviation. A time domain low pass filter is used to evaluate
(estimate) the moment thereof.
[0070] The low pass filter 106a regularly updates the evaluation of
the average of each pixel. The moving average is calculated (as in
Equation (1) below) every time a new frame is input.
B.sub.i.ident..mu..sub.i=(1-.rho.).mu..sub.i-1+.rho.I.sub.i (1)
Where, I.sub.i denotes the current image, and p denotes a filter
constant (0<.rho.<1), and i denotes the index of a frame. The
result of the low pass filter 106a is referred to as the background
frame.
[0071] The filter constant has the following meaning. Now, consider
the number of image frames required to capture a new object into
the background. If this capture is too fast, we may miss an object
(to be detected) that does not move such fast. For example, in the
case of .rho.=1, the current (new) image frame immediately becomes
a new background image frame, while in the case of .rho.=0, the
first image frame remains as the background image frame and the
background image frame will not be updated any more. Actually, we
want to realize a (successively) moderate-updating of the
background and a process of smoothing an abrupt change in the
luminance value. First, T is defined as a preferable cycle
(interval) of perfect update of the background image frame. If T is
defined as the number of processing frames (not in the unit of
seconds), .rho. is given by .rho.=5/T. For example, if perfect
updating of the background is executed within 1000 processing
frames, the filter constant is set as .rho.=0.005.
[0072] The low pass filter 106b successively calculates an
estimated standard deviation .sigma. of each pixel using the same
method.
.sigma..sub.i.sup.2=(1-.rho.).sigma..sub.i-1.sup.2+.rho.(.mu..sub.i-I.su-
b.i).sup.2 (2)
Note that the background frame or the current frame may be that of
one frame before (the frame with the index of i-1). As described
later, .rho. is switchable for each pixel depending on the types of
a zone or various kinds of conditions (e.g., luminance). .rho. may
be varied between the low pass filters 106a and 106b, and is
denoted as .rho..sub.a and .rho..sub.b in this case,
respectively.
[0073] Actually, the estimated standard deviation a is stored as
.sigma..sup.2 (i.e., variance) on a memory in order to avoid square
root calculation, and is handled as the squared value itself until
the binarization processing.
[0074] In Step 107, the temporal maximum value .sigma.' of the
standard deviation .sigma. (or variance) which the low pass filter
106b calculated is calculated and held for each pixel. Although the
maximum value .sigma.' may be successively searched from a
predetermined number of past frames, it can be calculated also from
Equation (3) below.
.sigma. i ' = { ( 1 - .rho. m ) .sigma. i - 1 ' + .rho. m .sigma. i
, when .sigma. i < .sigma. i - 1 ' .sigma. i , when .sigma. i
.gtoreq. .sigma. i - 1 ' ( 3 ) ##EQU00001##
[0075] In Step 108, the difference frame is generated using the
prepared current image and background image. The change detection
algorithm of this embodiment is based on the absolute value of the
image difference frame between the reduced input image I.sub.i and
the background image .mu..sub.i (or .mu..sub.i-1) generated by the
low pass filter 106a.
[0076] In Step 109, this difference frame is binarized using an
adaptive threshold k.sub.1.sigma.. The standard deviation is used
as an adaptive part of the binarized threshold, here.
|I.sub.i-B.sub.i-1|.sup.2>k.sub.1.sup.2.sigma..sub.i-1.sup.2
(4)
Where, k is a constant selected in the setting stage (Step 104). A
recommended value ranges from 3 to 4 and is determined depending on
the quality of noise. The result of the binarization processing is
obtained as a binary image, where "0" (False) means that nothing is
detected, and "255" (True) denotes a detected pixel. If the image
has been handed as a color image until this step, integration of
color channels is also performed here. For the integration, the
color channels may be subjected to weighted-summation before
binarization or the color channels may be combined by logical sum
(OR) after binarization. The binary image (or an area of a true
value within the binary image) obtained in Step 109 is also
referred to as the initial object mask.
[0077] Steps 110 to 123 are the phases of "status analysis".
[0078] In Step 110, morphological operations are applied to the
initial object mask. The morphological operations include four
basic operations as follows: a dilation process to compute logical
OR while shifting an image within a predetermined range, an erosion
process to compute logical AND, an opening process to carry out the
erosion process after the dilation process, and a closing process
to carry out the dilation process after the erosion process. The
opening process has an effect to connect adjacent "255" (True)
pixels together, while the closing process has an effect to remove
point-like "255" (True) pixels. Either one of them is used in this
embodiment.
[0079] In the initial object mask, the morphological operations
cannot sufficiently remove the case where a false value hole occurs
in an area of connected true values. For this reason, a hole
filling process may be carried out, in which a false value area
surrounded by true values is detected and then this area is filled
with true values.
[0080] In Step 111, bad traces (tracking) are cleaned up and the
(binary pixels of) background image (causing this bad traces) are
removed. That is, if an erroneously detected tracking zone in Step
120, 122, or the like of the foregoing processing cycle has been
already found, pixels within this tracking zone in the initial
object mask are disabled (are set to values other than 255), and at
the same time the inside of this tracking zone in the current frame
is replaced with that of the background image and corrected. With
this step, the object mask is completed. In addition, the original
current-frame is also stored separately.
[0081] In Step 112, labeling of preliminary detection zones and
calculation of the attributes thereof are carried out. The labeling
is an approach to find and mark (label) all the connected areas
within an image. In this stage, a unique number is given to a
connected area comprising pixels having true values within a binary
image, and this connected area is subsequently handled as a
preliminary detection zone "DetZones" (Dz.sub.0, Dz.sub.1, . . . )
having circumscribed rectangular coordinates (four coordinates of
the up, down, left, and right) and an area (the area of the inside
of a connected area or the number of connected pixels).
[0082] In Step 113, when the luminance abruptly changes due to any
incident (clouds, illuminating a streetlight, or the like), the
main processing loop is branched to a fast adaptation mode (Step
114). In this example, when a total area sum of the detection zones
in the whole image frame or a total area sum of the detection zones
inside a "fast adaptation zone" becomes larger than a preset
threshold, the main processing loop is branched. In this example,
the fast adaptation mode will be maintained for several periods.
This period (specified by the number of frames, not by time) is
also preset.
[0083] In Step 114, if it is in the fast adaptation period, such a
value that can totally replace the background image by the end of
the fast adaptive processing duration time will be assigned to the
filter constant. For example, if the duration time of 50 processing
frames is set in the fast adaptive processing, the filter constant
.rho. becomes equal to 0.1. In this way, the fast adaptive
processing can avoid an erroneous detection caused by an abrupt
change in the background. Detection of a questionable object during
the fast adaptive processing (in Step 116 and thereafter) is not
carried out.
[0084] In Step 115, the filter constants used for detection zones
are made adaptive. The binary image is used in order to separate a
pixel (having the value of 255 in the binary images and referred to
as a foreground pixel), in which a questionable object may be
detected, from a pixel (having the value of zero in the binary
images) in which only the background is detected. The filter
constants of the low pass filters 106a, 106b with respect to the
foreground pixel are changed so that the speed at which the
(erroneously detected) foreground pixel becomes the background may
become 10 times slower as compared with other pixels of the image
frame. That is, the above description of .rho. is applied to
.rho..sub.1, and .rho. is redefined as follows.
.rho. = { .rho. 1 , when pixel is grouped as the background k *
.rho. 1 , when pixel is grouped as the object ( 5 )
##EQU00002##
In this embodiment, k=0.1. Accordingly, the system can prevent the
actual object from being reflected on the background image for a
long time as compared with a case without this local adaptive
processing. As compared with the fast adaptive processing, this
processing can avoid the oversight of an object that stops or is
moving at a low speed.
[0085] In Step 116, the geometric attributes of a preliminary
detection zone (analysis zone) are calculated. The geometric
attributes include the position and size (width and height) of a
detection zone expressed with the scene coordinate system. Consider
the following coordinate system (FIG. 2).
[0086] X, Y, Z: scene coordinate system (world coordinate system).
The X-Y plane is parallel to a floor surface (ground), and this
level ranges from 0.5 to 0.7 m.
[0087] X', Y', Z': camera coordinate system. X', Y' axes are
parallel to a target focal plane, X' is parallel to the X axis, and
Z' is equal to the optical axis of a camera.
[0088] Xs, Ys: image (screen) coordinates, which are similar to the
X'-Y' plane, but the unit thereof is in pixel, not in meter.
[0089] The height of a camera is denoted by h, and the gradient of
the camera optical axis with respect to the X-Y plane is denoted by
t. An object P positioned in a scene denoted by X, Y, and Z
coordinates (Z=0) is converted to the camera coordinate system by
Equation (6) below.
X'=X
Y'=Y.times.cos(t)-h.times.sin(t)
Z'=Y.times.sin(t)+h.times.cos(t) (6)
The screen coordinates of the object are given by Equation (7)
below using projection optics equations.
Z'.times.X.sub.S=f.sub.i.times.p.sub.X.times.X'
Z'.times.Y.sub.S=f.sub.i.times.p.sub.Y.times.Y' (7)
Where, f.sub.i denotes a focal length and p.sub.x [m.sup.-i] and
p.sub.y [m.sup.-1] denote the picture element density in the
X.sub.s and Y.sub.s directions, respectively, and
f=f.sub.ip.sub.x=f.sub.ip.sub.y is defined. These camera
installation parameters are provided in Step 104. By replacement of
the variable Z',
X.sub.S.times.Y.times.sin(t)+X.sub.S.times.h.times.cos(t)=f.times.S
Y.sub.S.times.Y.times.sin(t)+Y.sub.S.times.h.times.cos(t)=f.times.Y.time-
s.cos(t)-f.times.h.times.sin(t) (8)
is obtained, where the conversion equation is Equation (9)
below.
Y=(f.times.h.times.sin(t)+Y.sub.S.times.h.times.cos(t))/(f.times.cos(t)--
Y.sub.S.times.sin(t))
X=(X.sub.S.times.Y.times.sin(t)+X.sub.S.times.h.times.cos(t))/f
(9)
Since a camera may have been installed using a different method
from that of FIG. 2, the rotation angle of the camera may need to
be considered with respect to the Z axis and Z' axis. In this case,
new coordinates are expressed with Equation (10) below.
X=X.times.cos(a)-Y.times.sin(a)
Y=X.times.sin(a)+Y.times.cos(a) (10)
Where, a denotes the rotation angle with respect to the Z axis.
Similarly, the screen coordinates are expressed with Equation (11)
below.
X.sub.S=X.sub.S.times.cos(a')-Y.sub.S.times.sin(a')
Y.sub.S=X.sub.S.times.sin(a')+Y.sub.S.times.cos(a') (11)
Here, a' denotes the rotation angle with respect to the Z axis.
[0090] In Step 117, a preliminary detection zone (analysis zone)
that does not satisfy a predetermined size is blocked off (so as
not to be passed to the subsequent processes). For each detection
zone, the geometric attributes (e.g., the width and height in real
space) in this scene coordinate system (X, Y, Z) are compared with
predetermined values (e.g., w.sub.min=0.1, w.sub.max=2,
h.sub.min=0.1, h.sub.max=3 that define the respective upper limit
and lower limit), and then only a zone that satisfies the above
values is filtered and stored in an array "SelZone". Moreover, a
pixel of a preliminary detection zone in the current frame, the
pixel not satisfying the above, is overwritten with that of the
background frame.
[0091] In Step 118, segmentation of the preliminary detection zone
having passed Step 117 is performed. The segmentation is required
for the analysis on the level of "hole filling of a detection
area". In order to calculate a new border of each filtered zone,
all of the filtered zones (square areas of interest) are split in
the shape of a strip with equal width. The upper side and lower
side of this split zone are redefined based on the object mask, and
the split width is defined in advance as a meter value in the scene
coordinate system. Actually, the width is finely adjusted so as to
be split into an integer number of widths and with equal width.
Then, the split zones are stored as Sz.sub.0, Sz.sub.1, . . . and
so on.
[0092] FIG. 3 shows a result of this segmentation. Rectangles drawn
with a thick white line and vertically-long rectangles within the
white rectangle express the result of segmentation and the
re-calculated border, respectively. This reveals that the
segmentation provides the contour of an actual vehicle and the
contour of the actual shadow by setting the split width to 0.2 [m],
for example.
[0093] In Step 119, a merge of the split areas is performed using
the filling rate of an elongated zone (analysis zone). The merge is
achieved by repeating the following first to third sub-steps until
an unreferred split zone is gone.
[0094] First, a reference zone is searched. The reference zone is
one of the above-described split zones, and is like the following
one. That is, a zone (1) nearest to the center of the base of an
image frame, (2) not contained in any of the merged groups, and (3)
not used as a trial zone in the past.
[0095] Secondly, an elongated zone serving as a merged candidate is
calculated from the attributes of the found reference zone. The
elongated zone is a rectangle having a larger height than a
predetermined height (e.g., 0.8 m for a person) in the scene
coordinate system. The height in the metric unit is calculated from
the height of a filtered zone (zone before splitting) based on a
proportional relationship.
[0096] Thirdly, if S.sub.cross/S.sub.total>"Merge area
overlapping ratio" is not satisfied, the elongated zone is
incorporated into a merged group. Here, S.sub.cross is the area of
a crossing area (common area) between a merged zone (circumscribed
rectangle of a merged group) and an elongated zone, and S.sub.total
is the area of the elongated zone itself. If the crossing area is
0, the above-described overlapping rate is calculated regarding the
reference zone itself as the merged zone, and if the above
condition is satisfied, a new merged group that regards the
elongated zone as the first member thereof is created.
[0097] Finally, a merged group that sufficiently satisfies is
registered with an array "Merge" as a merge zone. This condition is
S.sub.sum/S.sub.merge>"Merge area filling ratio", where
S.sub.sum is a sum of the individual area of an elongated zone
included in a merged group, and S.sub.merge is the area
(circumscribed rectangle) of a merged zone. "Merge area filling
ratio" is 60%, for example. A merged group that does not
sufficiently satisfy will not be registered with the array
"Merge".
[0098] FIG. 4 shows a result of the merge. A rectangle with thin
lines expresses a split zone that is merged into one. It can be
understood that only a high portion of the detection object passes
the merge processing.
[0099] In Step 120, out of the previous frame, a location similar
to the zone (tracking zone) registered with the array Merge and the
degree of coincidence are calculated to update the array Trace.
Tracking zones until the previous time are registered with the
array Trace, and this processing is intended to check whether or
not these zones stably exist in a series of processing frames, and
to reduce an erroneous detection. In this step, for each tracking
zone stored in the array Merge, the tracking zone is clipped from
the previous frame (or the previous difference frame), and from the
current frame, a search range of images are clipped, the search
range being obtained by expanding the tracking zone by a specified
amount, and then within this search area the following calculation
is performed to search the maximum degree of coincidence.
Collation = ( 1 - D / G ) 100 % D = i = 1 n j = 1 m .delta. ij a ij
- b ij G = i = 1 n j = 1 m .delta. ij max ( a ij , b ij ) .delta.
ij = { 1 if a ij > 0 , b ij > 0 0 if a ij = 0 or b ij = 0. (
12 ) ##EQU00003##
Where, a.sub.ij denotes an element of a luminance matrix (image
fragment) of a pattern, and b.sub.ij denotes an element of a
luminance matrix (image fragment) of a search range. If each
element has a plurality of color channels, a sum of the absolute
value of a difference for each color channel is, used.
[0100] In a certain tracking zone, if the maximum degree of
coincidence is larger than a value "Trace zone correlation coef",
the position of a calculated tracking zone within the search range
updates the array Trace as a new position of the tracking zone. If
the maximum degree of coincidence is smaller than the value "Trace
zone correlation coef" in the number of frames (iMissedFrameCnt),
this tracking zone is deleted from the array Trace (and from the
array Merge).
[0101] In Step 121, provision is made so that a tracking zone
similar to the background may not be added to each tracking zone of
the array Trace. That is, a new zone will be added to the array
Trace only if the collation value with the background is smaller
than "Trace zone correlation to backgr". Moreover, also if an
overlap between a new zone and the existing zone is larger than a
value iTRZoneOvrCoef, this new zone will not be added to the array
Trace. The collation in this step may be carried out based on the
degree of coincidence as in (Collation) in Equation (11) (Step
117), or other feature quantity may be used.
[0102] For the processings of Steps 120, 121, the calculation
amount will abruptly increase as the zone becomes large. For this
reason, an upper limit may be set to the zone size, whereby a
clipped image may be shrunk so as not to exceed this upper limit.
After Step 121, the array Merge is released from the memory.
[0103] In Step 122, each tracking zone of the array Trace is
integrated into a cluster to create an array Cluster. The allowable
life time and size are defined for the cluster, and the one
satisfying these will be registered with the array Cluster. The
integrating process is performed by the following first to fifth
sub-steps.
[0104] Firstly, a cluster is created as a rectangular area that
contains a group of tracking zones that exist in the vicinity
respectively. The maximum allowable interval between tracking zones
integrated into a cluster is denoted by Clustering factor, and is 5
pixels, for example.
[0105] Secondly, a process of connecting clusters, which are
created in the current processing cycle and in the previous
processing cycle (Cluster and ClustPre, hereinafter referred to as
the current cluster and the previous cluster), is performed to
create the following arrays.
[0106] MinTOCur: denotes a previous cluster intersecting a certain
current cluster Cluster(i) and having the minimum T0 (detection
time) value
[0107] CrQPre: the number of current clusters intersecting a
certain previous cluster ClustPre [j]
[0108] CrQCur: the number of previous clusters intersecting a
certain current cluster Cluster [i].
[0109] Thirdly, the data of the array Cluster is created from the
above-described CrQCur, CrQPre, and MinT0Cur based on the following
rules. [0110] If only a certain previous cluster and a certain
current cluster intersect with each other, then the ID, T0, and
detection position of the previous cluster are inherited to the
current cluster. [0111] If a certain current cluster intersects
with one or more previous clusters, then a new ID is given to this
current cluster, and the T0 of the previous cluster having the
smallest value T0 is inherited, and as the detection position the
position of the current cluster is employed. [0112] If a certain
current cluster does not intersect with any previous clusters, then
a new ID is given to this current cluster, the current time is
given as T0, and as the detection position the position of the
current cluster is employed.
[0113] Fourthly, the locus, speed (to be used in the subsequent
step), or the like of a cluster are calculated and stored in the
array Cluster.
[0114] Fifthly, the array Cluster of the current cluster is
overwritten and stored onto the array ClustPre of the previous
cluster.
[0115] In Step 123, each cluster of the array Cluster whose life
time (which is a difference between T0 and the current time, and is
in the unit of number of frames) exceeds a predetermined value
(e.g., 40) is selected, and each cluster whose life time is no more
than the predetermined value is dismissed (is not to be passed to
the next process).
[0116] In Step 124, based on the detection area set in Step 104 and
a relative position with each cluster, whether a cluster is inside
or outside each detection area is determined. The detection areas
include a polygon (cylinder) area (defined by the screen
coordinates or the scene coordinates), a pillar area (defined by
the scene coordinates, and the bottom of the pillar is the ground
(X-Y plane), a circular area (defined by the scene coordinates on
the ground (X-Y plane)), and a perpendicular plane area (defined by
the scene coordinates, and preferable for a wall or a window). As
the position of each cluster, the coordinate values (screen
coordinates or scene coordinates) in the center of bottom of a
cluster (in the ground portion, such as a person's leg) is used. A
well-known algorithm is used in determining whether a cluster is
inside or outside each detection area.
[0117] In Step 125, if the results of analysis and collation of the
attributes (invariable values and the like of the foreground image
of a cluster in addition to the position and the movement) of a
cluster determined as being within the detection area satisfy the
decision rules defined for the relevant detection area, a
predetermined alarm is issued. Although the use of the invariable
value (feature quantity) is not indispensable, HOG (Histograms of
Oriented Gradients) or the like may be used, other than, for
example, those shown in Ming-Kuei HU, "Visual Pattern Recognition
by Moment Invariants", IRE Transactions on information theory 1962,
pp. 179-187; Park, H. J., Yang H. S, "Invariant object detection
based on evidence accumulation and Gabor features", Pattern
recognition letters 22, pp. 869-882; Kyrki, V., Kamarainen J. K,
"Simple Gabor feature space for invariant object recognition",
Pattern recognition letters 25, No. 3, 2004, pp. 311-318; C. Harris
and M. Stephens, "A combined corner and edge detector", Proc. Alvey
Vision Conf., Univ. Manchester, 1988, pp. 147-151; and David G.
Lowe, "Distinctive image features from scale-invariant key points,
Journal of Computer Vision, 60, 2, 2004, pp. 91-110, as described
above.
[0118] The examples of the decision rule include the following
ones.
[Decision-Rule Name: Vehicles in a "Vehicle Off-Limits" Area]
[0119] An object is detected as a vehicle, and when it exists in a
"vehicle off-limits" warning area (area where only access by a
person is allowed), this object is judged as an illegal object.
[Decision-Rule Name: Person Inside a Vehicle-Specific Area]
[0120] If an object is detected as a person and exists in a
"vehicle-specific" warning area, this object is judged as an
illegal object.
[Decision-Rule Name: U-Turn]
[0121] In all the processing frames, the distance between the
position of the locus of an object and the position of the current
object is already calculated, and if this distance becomes smaller
than the distance of the previous processing frame, then a "U-turn
counter" of this object is incremented, while if this distance
becomes larger than that, the counter is decremented. If this
counter value exceeds a threshold ("the object almost stops in the
threshold number of processing frames"), it is determined that this
object is making U-turn. More preferably, the locus, to which a
smoothing filter, a moving-average filter, a Kalman filter, or the
like has been applied, is used, and a reversal of the velocity
vector is judged at intervals from several tenth of seconds to
several seconds.
[Decision-Rule Name: Fixed Time Zone]
[0122] Upon detection of an object within a fixed time zone, a time
zone counter k3 of the object will increase. The time zone counter
of the object will never decrease. If the counter exceeds a
threshold k3.sub.max, the object is judged as having stayed near a
vehicle for a long time, and an alarm is sounded.
[Decision-Rule Name: Vehicle Stoppage (Temporary Time Zone)]
[0123] If an object is detected as a vehicle, and further is
detected as being stopped, then a temporary time zone is created
around the object (the outer circumference of the object cluster is
expanded in the upward, downward, left, and right directions by the
amount of a half the object size). In the time zone, it takes some
time for the vehicle to become the background (this period is
referred to as a time zone adaptation period). Subsequently, the
zone becomes effective and the judgment operation is started. If
the object is detected as a person in the time zone, the time zone
counter k3 of the object will increase. The time zone counters of
the object will never decrease. If the counter exceeds the
threshold k3.sub.max, the object is judged as having stayed near a
stopped vehicle for a long time, and an alarm is sounded. If a
vehicle is detected inside the time zone, a time zone removal
process is started. It takes a while until the background within
the zone is updated. During this period, within this time zone an
alarm will not be issued. Upon completion of an "adaptation period
for return", the time zone is deleted. In a processing frame in
which the speed of a stopped object/an object moving at low speed
falls below a threshold, a low-speed movement counter k2 is
incremented. In a processing frame in which the speed of the object
exceeds the threshold, the slow-speed movement counter k2 is
decremented. If the counter value exceeds a threshold ("the object
almost stops in the threshold number of processing frames"), it is
determined that the object has stopped.
[Decision-Rule Name: Abandoned/Taken Away Object]
[0124] If a split of an object is detected (although there was one
object in the previous processing frame, the object is now observed
as two or more objects at the relevant position), all the "Split
flags" of these objects are turned on. If it is determined that one
of the objects has stopped and the split flag is being turned on,
then this object is judged as an "abandoned or taken out
object".
Embodiment 2
[0125] An intrusion alarm video-processing device of Embodiment 2
differs from Embodiment 1 in that TSV (Temporal Spatio-Velocity)
transform or the like is used for object tracking. The device of
Embodiment 1 is preferable to detect an intrusion of a certain
object (a vehicle, a boat, a person) into a place where there are
usually no people, while the device of Embodiment 2 is intended to
detect an object doing questionable behavior among ordinary objects
passing by.
[0126] TSV transform is based on the three-dimensional Hough
transform with regard to spatio-temporal images such as consecutive
time sequence frames. In this embodiment intended to obtain the
locus of an object, a linear Hough transform is used. That is,
lines are detected from a pixel value space defined on the three
dimensions of two spatial dimensions (the vertical direction and
horizontal direction of an original image) and a time dimension. As
the image to be TSV transformed (referred to as the initial
detection image), a sequence of difference images between adjacent
frames as follows is used.
S ( x , y , n ) = { 1 if I ( x , y , n ) - I ( x , y , n - T ) >
Th 0 else ( 13 ) ##EQU00004##
Where, S (x, y, n) denotes an initial detection image of the n-th
frame, I (x, y, n) denotes an input image of the n-th frame, T
denotes a time constant, and Th denotes a threshold (constant).
Other than Equation (13), a contour detection image or the
background difference image of Embodiment 1 may be used as the
initial detection image.
[0127] In order to improve the quality, a 1.times.3 AND operator is
used with respect to all the pixels of S (x, y, n) to obtain S*(x,
y, n).
S*(x, y, n)=S(x, y-1, n)&S(x, y, n)&S(x, y+1, n) (13')
[0128] The notation of TSV transform is defined as follows.
V.sub.n(x, y, v.sub.x, v.sub.y)=TVS{S*(x, y, n)} (14)
In Hough transform of this embodiment, an exponential decay
filtering is applied to S*(x, y, n) in advance so that the weight
of vote may decrease as the frame becomes older.
L.sub.n.sub.p(x, y, n)=S*(x, y, n)F.sub.n.sub.p(n) (15)
Where, S*(x, y, n) denotes a binary image of the n-th image frame,
n.sub.p denotes the index of the current image frame, F.sub.np (n)
denotes a filter expressed with Equation (16) below, where
n<=n.sub.p.
F n p ( n ) = { ( 1 - - .lamda. ) .lamda. ( n - n p ) n .ltoreq. n
p 0 n > n p ( 16 ) ##EQU00005##
[0129] The Hough transform with respect to LineA in time-space is
expressed by Equation (17) below.
( x y ) = ( n - n p ) ( v x v y ) + ( p x p y ) V n p ( p x , p y ,
v x , v y ) = Hough LineA { L n p ( x , y , n ) } = n L n p ( v x (
n - n p ) + p x , v y ( n - n p ) + p y , n ) ( 17 )
##EQU00006##
Where, (x, y) denotes a coordinate, (v.sub.x, v.sub.y) denotes a
velocity, (p.sub.x, p.sub.y) denotes a reference position (e.g.,
position in the current frame of a known object), LineA denotes a
line passing through a point (p.sub.x, p.sub.y) and having a
gradient (v.sub.x, v.sub.y). The value of V.sub.np denotes the
likelihood of a relevant line at a time point n.sub.p.
[0130] In the case of an exponential-function expression, V.sub.np
can be described using the regression equation below.
V.sub.n.sub.p(p.sub.x, p.sub.y, v.sub.x,
v.sub.y)=e.sup.-.lamda.V.sub.n.sub.p.sub.-1(p.sub.x-v.sub.x,
p.sub.y-v.sub.y, v.sub.x, v.sub.y)+(1-e.sup.-.lamda.)S*(x, y,
n.sub.p) (18)
p.sub.x, p.sub.y, v.sub.X, and v.sub.y are discretized to define a
cell, and Equation (18) is totaled within each cell and is then
binarized to a truth or a false using an appropriate threshold, and
the binarized one is defined as V*.sub.np(p.sub.x, p.sub.y,
v.sub.x, v.sub.y).
[0131] Here, an inclined cylindrical equation of a movement model
below is introduced.
C ( x , y , n ) = ( x - a x n 2 - v x n - p x ) 2 R x 2 + ( y - a y
n 2 - v y n - p y ) 2 R y 2 ( 19 ) ##EQU00007##
Where, the center of the cylindrical coordinates is
(a.sub.xn.sup.2+v.sub.xn+p.sub.x, a.sub.yn.sup.2+v.sub.yn+p.sub.y),
and the horizontal radius and the vertical radius are denoted as
R.sub.x, R.sub.y, respectively. The parameters of the cylindrical
coordinates are defined by Equation (20) below.
a x = .sigma. t 2 .tau. t , t 2 - .tau. t , t 2 .tau. t , x .sigma.
t 2 2 .sigma. t 2 - .tau. t , t 2 2 , a y = .sigma. t 2 .tau. t , t
2 - .tau. t , t 2 .tau. t , y .sigma. t 2 2 .sigma. t 2 - .tau. t ,
t 2 2 , v x = .tau. t , t 2 .tau. t 2 , x - .sigma. t 2 2 .tau. t ,
x .sigma. t 2 2 .sigma. t 2 - .tau. t , t 2 2 , v y = .tau. t , t 2
.tau. t 2 , y - .sigma. t 2 2 .tau. t , y .sigma. t 2 2 .sigma. t 2
- .tau. t , t 2 2 , p x = x _ - a x t _ 2 - v x t _ , p y = y _ - a
y t _ 2 - v y t _ R x 2 = a x 2 .sigma. t 2 2 + v x 2 .sigma. t 2 -
.sigma. x 2 + 2 a x v x .tau. t , t 2 - 2 v x .tau. t , x - 2 a x
.tau. t 2 , x , R y 2 = a y 2 .sigma. t 2 2 + v y 2 .sigma. t 2 -
.sigma. y 2 + 2 a y v y .tau. t , t 2 - 2 v y .tau. t , y - 2 a y
.tau. t 2 , y ( 20 ) ##EQU00008##
Where, .sigma..sub.k2 denotes the variance on the k axis,
.tau..sub.k, l denote a covariance of k and l, and k bar denotes
the average of k.
[0132] The cylinder density denoting the validity of the cylinder
is defined by Equation (21) below.
r = N .pi. R x R y h ( 21 ) ##EQU00009##
Where, h is the height (i.e., observed time) of a cylinder, and N
is the number of TSV cells with a true value inside the
cylinder.
[0133] In this embodiment, the initial detection based on the
inter-frame difference described above is performed in parallel
with the initial detection based on the background difference of
Steps 106 to 115 of Embodiment 1. Moreover, Steps 120 to 121 of
Embodiment 1 are deleted so as to move from Step 119 to Step 122,
and in parallel therewith the TSV transform is performed. In Step
122, the locus information obtained by the TSV transform is
compared with the array "Merge" obtained in Step 119, and the same
processing as that of Embodiment 1 is performed.
Embodiment 3
[0134] An intrusion alarm video-processing device of Embodiment 3
differs from Embodiment 1 in that a skeleton processing is
performed in place of or in addition to the segmentation and merge
processings of Steps 118, 119 of Embodiment 1. The skeleton
processing includes a process to obtain shape information of the
initial detection zone by a thinning process or skeleton processing
with respect to the binary image, a process to extract main axes
from the shape information, and a process to extract the axes of
the object from the extracted axes.
[0135] An image skel (A) obtained by performing the skeleton
processing to an arbitrary image A is expressed by Equation (22)
below.
skel ( A ) = Y k = 0 K - 1 { er ( A , kB ) - open ( er ( A , kB ) ,
B ) } ( 22 ) ##EQU00010##
Where, B denotes a structuring element (which is preferably
circular), er (A, kB) denotes an operation of eroding A K-times
with B, open (A, B) denotes an operation of opening A with B.
[0136] In this embodiment, as the image A, a binary image is used
which is clipped with the preliminary detection zone (circumscribed
rectangle) obtained in Step 117.
[0137] FIGS. 5A-5F are an image showing an example of skeleton
processing of this embodiment. FIG. 5A shows an image of the
current frame clipped with a preliminary detection zone containing
an object (person), FIG. 5B shows a difference image corresponding
to FIG. 5A, FIG. 5C shows a binary image of FIG. 5B, and FIG. 5D
shows a thinned (filamented) image by the skeleton processing of
FIG. 5C. In FIG. 5D, short thin lines are cleaned up (deleted) and
the remaining basic thin lines are approximated with two bands
having a constant width.
[0138] FIG. 5E is the result of this approximation, showing band's
connected boundary. Using this band, a person and the basic axes of
the shadow thereof can be determined, and furthermore the angle
with respect to the vertical direction thereof can be calculated.
If one of the angles of a band is approximately zero (almost
vertical) and the other angle is within a predetermined range, it
is determined that the other one is a shadow. By filling the binary
images on the side of shadow with a false value, the binary images
being divided by the connected boundary, an image in which only a
person as shown in FIG. 5F is taken out is obtained.
[0139] After the object mask is corrected, the processes after Step
120 can be continued as in Embodiment 1.
Embodiment 4
[0140] An intrusion alarm video-processing device of Embodiment 4
performs a process (hereinafter, referred to as an OS processing)
to extract a purer object from a preliminary detection zone in
place of the segmentation and merge processings of Steps 118, 119
of Embodiment 1.
[0141] In this embodiment, the preliminary detection zone is a
rectangular area that contains an object candidate within the
binary images obtained by initial detection of an object, the
rectangular area comprising a horizontal or perpendicular side.
Hereinafter, this preliminary detection zone is referred to as DZ.
The goal of the segmentation process in DZ is to express the pixel
of a "pure" object, i.e., an object image without a pixel of the
background as the recognition image. Mathematically, a matrix of
images in DZ is an input for an object area splitting process of
DZ, and a matrix of object images of DZ without the background is
an output of the object area splitting process. A matrix of images
typically serves as a matrix of three-dimensional vectors of a
pixel comprising RGB components, which corresponds to a matrix of
pixels within DZ in the original image.
[0142] The OS process of this embodiment is a combination of the
following three methods.
[0143] (1) A difference analysis method performing difference
analysis on an image fragment in which an object is detected and an
image in which an object does not exist (background image) in
DZ.
[0144] (2) An image fragment extraction method based on brightness,
color, texture, or the like.
[0145] (3) A segmentation and shadow clipping method
[0146] FIG. 6 is a flowchart of the above method (1), which is
performed to each DZ.
[0147] As Step 201, it is determined whether or not the background
(precisely speaking, the background that can be divided in Steps
202 to 206) is contained in a target DZ. If the background is not
contained, the flow moves to Step 207 because Steps 202 to 206 are
meaningless.
[0148] As Step 202, filter processing of the current image and
background image in DZ is performed. This process includes a median
filter processing, the so-called cell discretization processing
(hereafter, referred to as a CD (Cellular Dilation) processing) by
image expansion processing, and a low pass filter processing
(smoothing process).
[0149] The CD processing comprises a process to convert each pixel
of the original image to a square image fragment comprising similar
pixels including the peripheral two to three or more pixels of the
relevant pixel. This process is useful to keep the size of DZ as
small as possible.
[0150] If a combination of the median filter processing (to be
performed before CD) and the low pass filter processing (to be
performed after CD) is used in the CD processing, the enlargement
of an image in DZ and the reconstruction of a small and low quality
image to a certain level can be performed simultaneously. These
processes are simultaneously carried out to the current image frame
and the background image (reference image) frame in each DZ,
respectively.
[0151] As Step 203, a difference frame (DF) in DZ is created and
processed. This is performed by two separate processes: a process
to create DF in each DZ from the filtered current image (containing
the object) and the background image (not containing the object) by
Step 202; and a DF binarizing process using an appropriate pixel
value threshold. The DF creating process is a simple subtraction
processing of each element of a filtered image matrix with respect
to the current image and the background image in DZ. In processing
a color image, a difference between vectors is determined by the
calculation result of the magnitude of the vector. In the
binarization processing, the same processing as that of Step 109 of
Embodiment 1 is performed using a predetermined threshold.
[0152] As Step 204, a connected-area extracting process is
performed. The connected (integrated) area extracting process is a
process to extract an area connected as one block within each DZ
and having a predetermined size (number of pixels) or a size larger
than the predetermined size. This process is the same as that of
Step 112 of Embodiment 1.
[0153] As Step 205, an effective area is extracted from a plurality
of connected areas that are extracted in Step 203. As the candidate
for the effective area, the largest (judged based on the number of
pixels) connected area is selected, and this is denoted as ArM.
Then, a process to fill a hole existing within the ArM is carried
out.
[0154] In order to do it, first, the reverse image having only ArM
is created.
[0155] Next, a connected area that is not adjacent to the boundary
of DZ is extracted from the created reverse image. Since this area
is a hole, the ArM is corrected by filling this area with a
"truth".
[0156] By taking into consideration the hole-filling area, useful
geometric information on an object can be obtained for recognition
or removal. However, nevertheless, a simple connected-object area
is still required to obtain a useful feature (in particular,
skeleton information in the object area).
[0157] As Step 206, it is determined whether an effective area
could be extracted in Step 204, and if the effective area could be
extracted, then the flow moves to Step 212, otherwise moves to Step
206.
[0158] As Step 207, segmentation based on brightness (luminance) is
performed. For example, the value of Y of the YUV format or V of
HUV is discretized, and all the pixels within DZ are sorted into
groups of these discrete values. The sorted pixels are converted to
those of a connected-area by spatial filtering.
[0159] As Step 208, segmentation based on color is performed, as in
Step 205.
[0160] As Step 209, DZ is segmented into blocks of several pixels
in square, and a texture value for each block is calculated, and
then areas are formed by grouping the blocks using the texture
value.
[0161] As Step 210, from a combination of segmentations of Steps
205 to 207, a plurality of effective area candidates are created
based on a predetermined rule.
[0162] As Step 211, from a plurality of effective area candidates,
one effective area is extracted based on a predetermined scale
(e.g., area size).
[0163] As Step 212, shadow detection and segmentation, and shadow
clipping are performed using, for example, the same skeleton
processing as that of Embodiment 3.
[0164] As Step 213, the corrected object mask is applied to the
current image to obtain an image matrix of only the actual
object.
Embodiment 5
[0165] In an intrusion alarm video-processing device of Embodiment
5, the setup process of Step 104 of Embodiment 1 is improved.
(1) Equipment Configuration of this Embodiment
[0166] The configuration of an image processing device is shown in
FIG. 9. This monitor device comprises an imaging unit 501, a video
input unit 502, an image processor 503, a program memory 504, a
work memory 505, an external I/F circuit 506, a video output
circuit 507, a data bus 508, an indicator unit 509, and a display
unit 510.
(2) Method for Specifying the Monitoring Conditions According to
this Embodiment
[0167] Examples of the monitoring conditions in this embodiment are
shown in FIG. 7, FIG. 8. FIG. 7 is a script for monitoring the
violation in the running speed and running direction of a vehicle,
wherein if the vehicle runs at a speed no more than a predetermined
speed and in a predetermined direction, the vehicle is permitted
(i.e., the vehicle is neither an object to be alarmed nor an object
to be monitored), otherwise the vehicle is prohibited (i.e., the
vehicle is an object to be alarmed and monitored). FIG. 8 shows an
intermediate script with which the monitoring conditions specified
in a script format is subjected to lexical analysis in the image
processor 503. In the intermediate script, ": =" is an operator
denoting the definition, the left side (left side value) of ": ="
denotes the target definition, and the right side (right side
value) of ": =" denotes the conditions of the definition. Moreover,
"=" is an operator denoting a comparison, the left side value of
"=" denotes information on an object, and the right side value of
"=" denotes a condition value which a user sets.
[0168] (Examples of a list of operators and a list of information
on an object, and an example of a conversion procedure from the
script to the intermediate script are to be supplemented)
(3) Generation of a Decision Table, and Judgment Using the Decision
Table
[0169] FIG. 10 shows an example of a decision table.
[0170] In this embodiment, since the judgment condition comprises a
combination of a plurality of conditions, whether a detected object
meets the monitoring conditions is determined using the decision
table as shown in FIG. 10. Here, for simplicity of description, a
case is shown, in which a decision table is created using two
pieces of information of the width and the height of a detected
object, whereby whether or not the detected object (as an example,
an object with 3 m in width and 1.5 m in height) meets a condition
401 can be judged, in other words, whether the detected object can
be judged as a [CAR]. First, in FIG. 8, since the conditions with
regard to the height of the condition 401 are [WIDTH]=[no less than
2 m] and [WIDTH]=[less than 5 m], the [WIDTH] axis of the decision
table, i.e., the horizontal axis is equally divided into five,
which are then labeled with [less than 2 m], [2 m], [less than 5
m], [5 m], and [over 5 m], respectively. Here, the reason why there
are five labels is that the conditions of [WIDTH] consist of two
condition values of [no less than 2 m] and [less than 5 m] and thus
boundary portions for discriminating "no less than" and "less than"
from others need to be included. Moreover, if one condition value
[WIDTH]=[no less than 2 m] is good enough, three subdivisions are
provided. Accordingly, the maximum number of subdivisions is the
number of twice the number of condition values plus 1. Next, a
portion meeting this condition is filled with 1 (e.g., reference
number 603), and a portion not meeting this condition is filled
with -1 (e.g., reference number 602). If this is also performed to
the [HEIGHT] axis, the decision table 601 shown in FIG. 10 is
obtained. Next, since the detected object is 3 m in width, and 1.5
m in height, a portion indicated by reference number 603 is filled
with 1 according to this decision table and it can be judged that
the detected object meets the condition. Even if the number of
conditions increases, the number of axes of this decision table and
the number of subdivisions of each axis just need to be changed,
and practically, data size to such extent that can be stored in a
work memory can be handled. Moreover, in this method, whether to
meet a condition or not is expressed using values, such as -1 and
1, however, a condition under which decision is not made (Don't
care) may be expressed using other value (e.g., 0 or the like).
[0171] According to this embodiment, the monitoring conditions can
be specified with a readable simple sentence (script), and
furthermore, a plurality of conditions are configured so as to be
logically judged, thereby allowing complicated conditions to be
specified as compared with the related arts, and further allowing a
simple and correct specification to be made.
[0172] (If special monitoring conditions are configured in advance
so as to be downloaded via a network, services flexibly
corresponding to various monitoring environments can be realized,
thereby also allowing a business model to be constructed.)
Embodiment 6
[0173] (1) Equipment Configuration of this Embodiment
[0174] The equipment configuration and basic operation of
Embodiment 6 are the same as those of Embodiment 5.
(2) Setting of a Monitor Area in this Embodiment
[0175] In a monitor area 1301, on the scene coordinate system (the
second coordinate system parallel to the ground and is similar to a
map), information on a map of an area desired to be monitored is
indicated using the indicator unit (FIG. 12). The height
information of the area desired to be monitored is provided using a
numerical value or the like. Since the height information
corresponds to the z axis coordinate of the scene coordinate system
(when the height of the xy plane is 0), the height information can
be provided as an actual value (2 m, 3 feet, or the like) without
relying on the apparent height.
[0176] An indication of the monitor area 1301 may be made directly
to the camera coordinate system, such as an input image or the like
(FIG. 11). The height of the area desired to be monitored may be
preset in advance. The monitor area may be indicated using a circle
or a line in addition to a polygon, and the processed area can be
specified using various patterns, such as a cylindrical shape, a
spherical shape, a perpendicular plane.
(3) Method of Calculating the Distance Between a Camera and a Point
on a Monitor Area, in this Embodiment
[0177] A position (x', y') in the camera coordinate system is
converted to a position (x, y) in the scene coordinate system.
[0178] Since the scene coordinate system is similar to a map, the
distance between a camera and a point on the monitor area is
{square root over (x.sup.2+y.sup.2)} when the point of origin 0 of
the scene coordinate system is the position of the camera.
(4) Method of Calculating an Apparent Height of a Target Object to
be Monitored at the Above-Described Point The camera coordinate on
the upper side of the target object is denoted as (x'_head,
y'_head) and the camera coordinate on the lower side of the target
object is denoted as (x'_legs, y'_legs).
[0179] First, the following coordinates are calculated in
accordance with the conversion equation to the scene coordinates
using camera installation conditions.
[0180] Scene coordinate of the upper side of the target object
(x_head, y_head)
[0181] Angle of depression for imaging the upper side
.theta.y_head
[0182] Scene coordinate of the lower side of the target object
(x_legs, y_legs)
[0183] Angle of depression for imaging the lower side
.theta.y_legs
[0184] Rotation angle .theta.x=.theta.x_head=.theta.x_legs
[0185] Distance between the camera and the upper side of the target
object D_leg.sup.2= {square root over
(x.sub.leg.sup.2+y.sub.legs.sup.2)}
[0186] Distance between the camera and the lower side of the target
object on the scene coordinates D_head= {square root over
(x.sub.head.sup.2+y.sub.head.sup.2)}
[0187] FIG. 14 shows an example of imaging a target object to be
monitored 1601.
[0188] According to FIG. 14, the height of a target object (Height)
is geometrically calculated by Equation (1-1) below.
Height=(D_head-D_legs)/tan(90.degree.-.theta.y_head) (1-1)
(5) Method of Calculating an Apparent Height by Converting to the
Scene Coordinate System
[0189] [Apparent height] is calculated (i.e., back calculation of
the above (4)) to see which position (x_head, y_head) on the scene
coordinates the information of the height (Height) at a point
(x_legs, y_legs) on the monitor area appears.
[0190] .theta.y_head can be expressed as follows using the
installation height H of the imaging unit.
tan(.theta.y_head)=(H-Height)/D_legs
[tan(90-.theta.y_head)=D_legs/(H-Height)] (1-2)
Equation (1-1) is transformed and then Equation (1-2) is
substituted.
[0191] D_head=(Height-D_legs)/(H-Height)+D_legs Accordingly, the
coordinate (x_head, y_head) of the upper side of the monitor area
can be calculated as follows.
x_head=D_headcos(.theta.x)
y_head=-D_headsin(.theta.x)
Moreover, the camera coordinate (x'_head, y'_head) can be also
calculated in accordance with the coordinate conversion, and the
apparent height on the camera coordinate also can be expressed
easily. (6) Method of Generating a Processed Area from a Monitor
Area Based on the Apparent Height.
[0192] The respective apparent heights are calculated from each
coordinate of the monitor area 1301 indicated in (2) and the height
information of the monitor area. By setting each coordinate, which
the apparent height has, and each coordinate, which an indicated
monitor area has, to a processed area (FIG. 13), a
three-dimensional processed area 1401 can be created taking into
consideration the height of the monitor area 1301.
[0193] In this embodiment, by setting a monitor area (on a map), a
three-dimensional processed area taking into consideration the
height of the monitor area can be automatically set up, and
therefore, simple area setting without relying on the apparent size
can be realized. Moreover, since the area setting by actually
measuring the height of an object reflected in an input image is
not required, the complexity of setting can be reduced.
[0194] Moreover, a monitor area can be set up in the scene
coordinate system, and a coordinate on a map can be used, as it is,
in setting an area. Furthermore, efficient area setting and
intruder monitoring in combination with the previous applications,
such as sharing of monitor areas between multiple monitoring
devices, are possible.
[0195] It should be further understood by those skilled in the art
that although the foregoing description has been made on
embodiments of the invention, the invention is not limited thereto
and various changes and modifications may be made without departing
from the spirit of the invention and the scope of the appended
claims.
* * * * *