Model-less Background Estimation For Foreground Detection In Video Sequences Bernal; Edgar A. ; et al. [Xerox Corporation]

Model-less Background Estimation For Foreground Detection In Video Sequences

Bernal; Edgar A. ; et al.

Patent Application Summary

U.S. patent application number 14/606469 was filed with the patent office on 2016-07-28 for model-less background estimation for foreground detection in video sequences. The applicant listed for this patent is Xerox Corporation. Invention is credited to Edgar A. Bernal, Qun Li.

Application Number	20160217575 14/606469
Document ID	/
Family ID	56432719
Filed Date	2016-07-28

United States Patent Application	20160217575
Kind Code	A1
Bernal; Edgar A. ; et al.	July 28, 2016

MODEL-LESS BACKGROUND ESTIMATION FOR FOREGROUND DETECTION IN VIDEO SEQUENCES

Abstract

A camera outputs video as a sequence of video frames having pixel values in a first (e.g., relatively low dimensional) color space, where the first color space has a first number of channels. An image-processing device maps the video frames to a second (e.g., relatively higher dimensional) color representation of video frames. The mapping causes the second color representation of video frames to have a greater number of channels relative to the first number of channels. The image-processing device extracts a second color representation of a background frame of the scene. The image-processing device can then detect foreground objects in a current frame of the second color representation of video frames by comparing the current frame with the second color representation of a background frame. The image-processing device then outputs an identification of the foreground objects in the current frame of the video.

Inventors:

Bernal; Edgar A.; (Webster, NY) ; Li; Qun; (Webster, NY)

Applicant:

Name	City	State	Country	Type
Xerox Corporation	Norwalk	CT	US

Family ID:

56432719

Appl. No.:

14/606469

Filed:

January 27, 2015

Current U.S. Class:	1/1
Current CPC Class:	G06T 2207/20081 20130101; G06T 7/254 20170101; G06T 2207/30232 20130101; G06T 2207/10024 20130101; G06T 2207/10016 20130101; G06T 7/90 20170101; G06T 2207/20032 20130101
International Class:	G06T 7/00 20060101 G06T007/00; H04N 9/04 20060101 H04N009/04; G06T 7/40 20060101 G06T007/40

Claims

1. A system comprising: an image-processing device; and a camera operatively connected to said image-processing device, said camera outputting video of a scene being monitored and being in a fixed position relative to said scene, said camera outputting said video as a sequence of video frames having pixel values in a first color space, said first color space having a first number of channels, said image-processing device mapping said video frames to a second color representation of video frames, said second color representation of video frames having a larger second number of channels relative to said first number of channels, said image-processing device producing a second color representation of a background frame of the scene, said image-processing device detecting at least one foreground object in a current frame of video frames by comparing said second color representation of said current frame with said second color representation of said background frame, and said image-processing device outputting an identification of said at least one foreground object in said current frame of said video.

2. The system according to claim 1, said second color representation of video frames having more color discrimination relative to said video frames having pixel values in said first color space.

3. The system according to claim 1, pixel values in said second color representation of video frames being represented by vectors having a greater vector length relative to pixel values in said first color space.

4. The system according to claim 1, said image-processing device extracting said second color representation of a background frame by: obtaining a frame of said second color representation of video frames when no foreground objects are present; filtering moving objects from said second color representation of video frames by identifying said moving objects as ones that change locations in adjacent frames of said second color representation of video frames; temporally averaging a number of incoming frames; or temporally median filtering a number of incoming frames.

5. The system according to claim 1, said image-processing device generating a third color representation for a third color representation of a background frame and said video frames, said third color representation having at least one of a smaller number of channels and a smaller bit depth relative to the second color representation, said third color representation being obtained via a dimensionality reduction technique, and said third color representation preserves photometric invariance and discriminative attributes of the second color representation.

6. A system comprising: an image-processing device; and a camera operatively connected to said image-processing device, said camera outputting video of a scene being monitored and being in a fixed position relative to said scene, said camera outputting said video as a sequence of video frames having pixel values in a first color space, said first color space having a first number of channels and a first number of bits per channel, said image-processing device mapping said video frames to a second color representation of video frames, said second color representation of video frames having a larger second number of channels relative to said first number of channels, and a different second number of bits per channel relative to said first number of bits per channel, said image-processing device producing a second color representation of a background frame of the scene, said image-processing device detecting at least one foreground object in a current frame of video frames by comparing said second color representation of said current frame with said second color representation of said background frame, and said image-processing device outputting an identification of said at least one foreground object in said current frame of said video.

7. The system according to claim 6, said second color representation of video frames having more color discrimination relative to said video frames having pixel values in said first color space.

8. The system according to claim 6, pixel values in said second color representation of video frames being represented by vectors having a greater vector length relative to pixel values in said first color space.

9. The system according to claim 6, said image-processing device extracting said second color representation of a background frame by: obtaining a frame of said second color representation of video frames when no foreground objects are present; filtering moving objects from said second color representation of video frames by identifying said moving objects as ones that change locations in adjacent frames of said second color representation of video frames; temporally averaging a number of incoming frames; or temporally median filtering a number of incoming frames.

10. The system according to claim 6, said image-processing device generating a third color representation for a third color representation of a background frame and said video frames, said third color representation having at least one of a smaller number of channels and a smaller bit depth relative to the second color representation, said third color representation being obtained via a dimensionality reduction technique, and said third color representation preserves photometric invariance and discriminative attributes of the second color representation.

11. A method comprising: outputting video of a scene being monitored using a camera in a fixed position, said outputting said video comprising outputting a sequence of video frames having pixel values in a first color space, and said first color space having a first number of channels; mapping said video frames to a second color representation of video frames using an image-processing device operatively connected to said camera, said mapping transforming said pixel values in said first color space from said first number of channels to a greater second number of channels; producing a second color representation of a background frame of the scene using said image-processing device; detecting at least one foreground object in a current frame of said second color representation of video frames by comparing said current frame with said second color representation of said background frame using said image-processing device; and outputting an identification of said at least one foreground object in said current frame of said video from said image-processing device.

12. The method according to claim 11, said second color representation of video frames having more color discrimination relative to said video frames having pixel values in said first color space.

13. The method according to claim 11, pixel values in said second color representation of video frames being represented by vectors having a greater vector length relative to pixel values in said first color space.

14. The method according to claim 11, said extracting said second color representation of a background frame comprising: obtaining a frame of said second color representation of video frames when no foreground objects are present; filtering moving objects from said second color representation of video frames by identifying said moving objects as ones that change locations in adjacent frames of said second color representation of video frames; temporally averaging a number of incoming frames; or temporally median filtering a number of incoming frames.

15. The method according to claim 11, further comprising generating a third color representation for a third color representation of a background frame and said video frames, said third color representation having at least one of a smaller number of channels and a smaller bit depth relative to the second color representation, said third color representation being obtained via a dimensionality reduction technique, and said third color representation preserves photometric invariance and discriminative attributes of the second color representation.

16. A method comprising: outputting video of a scene being monitored using a camera in a fixed position, said outputting said video comprising outputting a sequence of video frames having pixel values in a first color space, and said first color space having a first number of channels and a first number of bits per channel; mapping said video frames to a second color representation of video frames using an image-processing device operatively connected to said camera, said mapping transforming said pixel values in said first color space from said first number of channels to a greater second number of channels having a different second number of bits per channel; producing a second color representation of a background frame of the scene using said image-processing device; detecting at least one foreground object in a current frame of said second color representation of video frames by comparing said current frame with said second color representation of a background frame using said image-processing device; and outputting an identification of said at least one foreground object in said current frame of said video from said image-processing device.

17. The method according to claim 16, said second color representation of video frames having more color discrimination relative to said video frames having pixel values in said first color space.

18. The method according to claim 16, pixel values in said second color representation of video frames being represented by vectors having a greater vector length relative to pixel values in said first color space.

19. The method according to claim 16, said extracting said second color representation of a background frame comprising: obtaining a frame of said second color representation of video frames when no foreground objects are present; filtering moving objects from said second color representation of video frames by identifying said moving objects as ones that change locations in adjacent frames of said second color representation of video frames; temporally averaging a number of incoming frames; or temporally median filtering a number of incoming frames.

20. The method according to claim 16, further comprising generating a third color representation for a third color representation of a background frame and said video frames, said third color representation having at least one of a smaller number of channels and a smaller bit depth relative to the second color representation, said third color representation being obtained via a dimensionality reduction technique, and said third color representation preserves photometric invariance and discriminative attributes of the second color representation.

Description

BACKGROUND

[0001] Systems and methods herein generally relate to processing items in video frames obtained using a camera system, and more particularly to image processors that discriminate between background and foreground items within such video frames, without using substantial background modeling processes.

[0002] Video-based detection of moving and foreground objects in video acquired by stationary cameras is a core computer vision task. Temporal differencing of video frames is often used to detect objects in motion, but fails to detect slow-moving (relative to frame rate) or stationary objects. Background estimation and subtraction, on the other hand, can detect both moving and stationary foreground objects, but is typically more computationally expensive (both in terms of computing and memory resources) than frame differencing. Background estimation techniques construct and maintain statistical models describing background pixel behavior. According to this approach, a historical statistical model (e.g., a parametric density model such as a Gaussian Mixture Model (GMM), or a non-parametric density model such as a kernel-based estimate) for each pixel is constructed and updated continuously with each incoming frame at a rate controlled by a predetermined learning rate factor. Foreground detection is performed by determining a measure of fit of each pixel value in the incoming frame relative to its constructed statistical model: pixels that do not fit their corresponding background model are considered foreground pixels.

[0003] This approach has numerous limitations, including the requirement for computational and storage resources, the fact that the model takes time to converge, and the fact that there are many parameters to tune (e.g., the learning rate, the goodness-of-fit threshold, the number of components in each mixture model, etc.). Once a set of parameters is chosen, the latitude of scenarios supported by the model-based methods is limited; for example, too slow a learning rate would mean that the background estimate cannot adapt quickly enough to fast changes in the appearance of the scene; conversely, too fast a learning rate would cause objects that stay stationary for long periods to be absorbed into the background estimate.

SUMMARY

[0004] An exemplary system herein includes an image-processing device and a camera operatively (meaning directly or indirectly) connected to the image-processing device. The camera is in a fixed position and outputs video of a scene being monitored. The camera outputs the video as a sequence of video frames have pixel values in a first (e.g., relatively low dimensional) color space, where the first color space has a first number of bits per channel.

[0005] The image-processing device maps the video frames to a second (e.g., relatively higher dimensional) color representation of video frames. The mapping causes the second color representation of video frames to have a relatively greater number of channels and possibly a relatively different number of bits per channel. The mapping causes the second color representation of video frames to be more photometrically invariant to illumination conditions and more color discriminative relative to the first color space.

[0006] The first color space can be, for example, 3 or 4 dimensional color spaces (e.g., RGB, YCbCr, YUV, Lab, CMYK, Luv, etc.) while the second color representation can have much higher dimensions, such as 11-dimensions (i.e., pixel values in the second color representation of video frames are represented by vectors have a greater vector length relative to pixel values in the first color space). Thus, the second color representation of video frames have more color discrimination relative to the video frames have pixel values in the first color space.

[0007] The image-processing device extracts a second color representation of a background frame of the scene from at least one of the second color representation of video frames. For example, the image-processing device can extract the second color representation of a background frame by: obtaining a frame of the second color representation of video frames when no foreground objects are present; filtering moving objects from the second color representation of video frames by identifying the moving objects as ones that change locations in adjacent frames of the second color representation of video frames; temporally averaging a number of incoming frames; or temporally median filtering a number of incoming frames.

[0008] The image-processing device can then detect foreground objects in a current frame of the second color representation of video frames by comparing the current frame with the second color representation of a background frame. The image-processing device then outputs an identification of the foreground objects in the current frame of the video.

[0009] Additionally, the image-processing device can generate a third color representation of the background frame and the video frames. The third color representation has a smaller number of channels and/or a smaller bit depth relative to the second color representation, where bit depth represents the number of bits per channel. The third color representation can be obtained from the second color representation via a dimensionality reduction technique, and the third color representation largely preserves photometric invariance and discriminative attributes of the second color representation.

[0010] An exemplary method herein captures and outputs video of a scene being monitored using a camera in a fixed position. The video is output from the camera as a sequence of video frames that have pixel values in a first color space (e.g., RGB, YCbCr, YUV, Lab, CMYK and Luv) where the first color space has a first number of bits per channel. Also, this exemplary method maps the video frames to second color representation of video frames using an image-processing device operatively connected to the camera. Also, the mapping process can transform the pixel values in the first color space to be more photometrically invariant to illumination conditions.

[0011] The mapping process transforms the pixel values in the first color space from the first number of bits per channel to a greater number of channels and, possibly a different number of bits per channel. Thus, the mapping process produces pixel values in the second color representation of video frames to be represented by vectors have a greater vector length relative to pixel values in the first color space. This, therefore, causes the second color representation of video frames to have more color discrimination relative to the video frames have pixel values in the first color space.

[0012] This exemplary method also extracts a second color representation of a background frame of the scene from at least one of the second color representation of video frames (using the image-processing device). More specifically, the process of extracting the second color representation of a background frame can be performed by, for example, obtaining a frame of the second color representation of video frames when no foreground objects are present, filtering moving objects from the second color representation of video frames by identifying the moving objects as ones that change locations in adjacent frames of the second color representation of video frames, temporally averaging a number of incoming frames, temporally median filtering a number of incoming frames, etc.

[0013] Then, this method can detect foreground objects in a current frame of the second color representation of video frames by comparing the current frame with the second color representation of a background frame, again using the image-processing device. Finally, this exemplary method outputs an identification of the foreground objects in the current frame of the video from the image-processing device.

[0014] Additionally, this exemplary method can generate a third color representation of the background frame and the video frames. The third color representation has a smaller number of channels and/or a smaller bit depth relative to the second color representation. The third color representation can be obtained via a dimensionality reduction technique, and the third color representation largely preserves photometric invariance and discriminative attributes of the second color representation.

[0015] These and other features are described in, or are apparent from, the following detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] Various exemplary systems and methods are described in detail below, with reference to the attached drawing figures, in which:

[0017] FIG. 1 is a conceptual chart of an estimation/updating of a model-based foreground detection system;

[0018] FIG. 2 is a conceptual chart of modules provided by systems and methods herein;

[0019] FIG. 3 is a block diagram illustrating processes carried out by systems and methods herein;

[0020] FIG. 4 is a sample video frame illustrating a conference room scene with a moving tennis ball;

[0021] FIG. 5 shows a sample background image extracted by the background image extraction module;

[0022] FIG. 6 illustrates mapping FIG. 5 to a higher-dimensional representation;

[0023] FIGS. 7(a)-7(c) illustrate the foreground detection process performed by systems and methods herein;

[0024] FIG. 8 is a flow diagram of various methods herein; and

[0025] FIG. 9 is a schematic diagram illustrating systems herein.

DETAILED DESCRIPTION

[0026] As mentioned above, conventional systems that use modeling to differentiate background and foreground objects in video frames suffer from many limitations. Therefore, the systems and methods herein do away with the added computational and storage requirements of traditional model-based approaches and possess a smaller number of parameters to be tuned, which results in increased robustness across a wider range of scenarios. Also, the model-less systems and methods herein do not require any convergence time.

[0027] The systems and methods herein therefore provide model-less background estimation for foreground detection that does away with the initialization period and the added computational and storage requirements of traditional model-based approaches, and possesses a smaller number of parameters to be tuned, all of which results in increased robustness across a wider range of scenarios. Note that in this disclosure the words "model-less" and "unmodeled" are sometimes used interchangeably to define processes that do not using modeling processes. The systems and methods herein use a representation of the background image in a color space that is highly photometric invariant while at the same time being highly discriminative.

[0028] Foreground and moving object detection is usually a precursor of video-based object tracking, and, as such, is one of the fundamental technical problems in computer vision applications such as surveillance, traffic monitoring and traffic law enforcement, etc. Examples of implementations that rely on robust object tracking include video-based vehicle speed estimation, automated parking monitoring, and measuring total experience time in retail spaces. The methods and systems disclosed herein diverge from the traditional model-based approaches for background estimation and do not rely on model construction and maintenance.

[0029] One limitation of model-based approaches lies in the number of parameters that need to be tuned. For example, the choice for a learning rate involves a tradeoff between how fast the model is updated and the range of speed of motion that can be supported by the model. Specifically, too slow a learning rate would mean that the background estimate cannot adapt quickly enough to fast changes in the appearance of the scene (e.g., changes in lighting, weather, etc.); conversely, too fast a learning rate would cause objects that stay stationary for long periods (relative to frame rate and learning rate) to be absorbed into the background estimate. As another example, the choice for the number of components in each model involves a tradeoff between how adaptable the models are to changes in illumination and computational complexity, because a larger number of components increases adaptability and complexity at the same time. Unfortunately, too large a number of components in the model may lead to overfitting issues, where the appearance of objects in the scene other than those in the background may be represented in the model. Also, the choice for thresholding constant to binarize the output of the fit test involves a tradeoff between false positives and missed detections.

[0030] Another limitation of the model-based approaches lies in the memory and computational resources required to create, maintain (update) and store pixel-wise models. Yet another limitation is related to the time the model construction phase takes to converge (usually in the order of a few hundred to a few thousand frames).

[0031] FIG. 1 is a flowchart of a background estimation/updating and foreground detection process. Thus, FIG. 1 illustrates a model-based process for background estimation and updating, and foreground detection. Specifically, reference numeral 100 identifies a binary image, reference numeral 102 identifies a fit test process, reference numerals 104 and 106 identify background models, reference numeral 108 identifies a model update process, and reference numeral 110 identifies grayscale/color images that are part of a video sequence. In the logic of FIG. 1 conceptually: F.sub.i denotes the i-th video frame (grayscale or color), where i represents a temporal index; BG.sub.i denotes the i-th background model (array of pixel-wise statistical models) used for foreground detection in conjunction with frame F.sub.i (this is the model available before an update occurs based on the newly incoming pixel samples in F.sub.i); FG.sub.i denotes the i-th foreground binary mask obtained via comparison between BG.sub.i and F.sub.i; BG.sub.i+1 denotes the (i+1)-th background model obtained by updating the pixel-wise background models in BG.sub.i with the pixel values in F.sub.i; lastly, FG.sub.i+1 will subsequently be determined via comparison between BG.sub.i+1 and frame F.sub.i+1. Note that frames F.sub.1 through F.sub.t are involved in the estimation of background model BG.sub.t+1.

[0032] The following in-depth discussion of the operation of model-based approaches is intended to convey the complexity and need for storage resources, as well as the need to fine-tune a range of parameters in model-based methods.

[0033] With respect to pixel modeling, statistical models for background estimation model the values of a pixel over time as the instantiations of a random variable with a given distribution. Background estimation is achieved by estimating the parameters of the distributions that accurately describe the historical behavior of pixel values for every pixel in the scene. Specifically, at frame n, what is known about a particular pixel located at spatial coordinates (i,j) is the history of its values {X.sub.1, X.sub.2, . . . , X.sub.n}={I(i,j,m), 1.ltoreq.m.ltoreq.n} where I is the image sequence or video frame sequence, (i,j) are the spatial pixel indices and m is the temporal image frame index.

[0034] While the historical behavior can be described with different statistical models including parametric models that assume an underlying distribution and estimate the relevant parameters, and non-parametric models such as kernel-based density estimation approaches, an algorithm can be implemented in terms of Gaussian mixture models, and note that it is equally applicable to other online modeling approaches. One can model the recent history of behavior of values of each pixel as a mixture of K Gaussian distributions, so that the probability of observing the current value is P(X.sub.t)=.SIGMA..sub.i=1.sup.Kw.sub.it.eta.(X.sub.t,.mu..sub.i- t,.SIGMA..sub.it) where w.sub.it is an estimate of the weight of the i-th Gaussian component in the mixture at time t, .mu..sub.it is the mean value of the i-th Gaussian component in the mixture at time t, .SIGMA..sub.it is the covariance matrix of the i-th Gaussian component in the mixture at time t, and .eta.(.cndot.) is the Gaussian probability density function. Sometimes a reasonable assumption is for the different color channels to be uncorrelated, in which case .SIGMA..sub.it=.sigma..sub.itI.

[0035] Pixel modeling is usually conducted during the initialization/training phase of the background model. To this end, the first N frames (usually N.about.100 in practice) are used to train the background model. A background model is said to have been initialized once the parameters that best describe the mixture of Gaussians (weights, mean vectors and covariance matrices for each Gaussian component) for every pixel are determined. For simplicity, the following omits the initialization/training phase of the background model from the description of the system and assumes the background model has been initialized upon the beginning of the foreground detection process.

[0036] With respect to foreground pixel detection, foreground detection is performed by determining a measure of fit of each pixel value in the incoming frame relative to its constructed statistical model (e.g., item 102). In one example, as a new frame comes in, every pixel value in the frame is checked against its respective mixture model so that a pixel is deemed to be a background pixel if it is located within T=3 standard deviations of the mean of any of the K components. Use of other values for T or membership/fit tests to determine pixel membership (e.g., maximum likelihood) is possible.

[0037] With respect to model updating (e.g., item 108), if none of the K components in the distribution match the current pixel value according to the membership test described above, the pixel may be considered as a foreground pixel, and, additionally, the least probable component in the mixture may be replaced with a component with mean equal to the incoming pixel value, some arbitrarily high variance, and a small weighting factor, the two latter statements reflecting the lack of confidence in the newly added component.

[0038] If, on the other hand, there is a component in the distribution that matches the pixel, the weights of the distributions can be adjusted according to: w.sub.i(t+1)=(1-.alpha.)w.sub.it+.alpha.M.sub.it where .alpha. is the learning or update rate and M.sub.it is an indicator variable equaling 0 for every component except the matching one (in which case M.sub.it=1), so that only the weight factor for the matching distribution is updated. Similarly, only the mean and standard deviation/covariance estimates for matching components are updated according to:

.mu..sub.t+1=(1-.rho.).mu..sub.t+.rho.X.sub.t

.sigma..sub.t+1.sup.2=(1-.rho.).sigma..sub.t.sup.2+.rho.(X.sub.t-.mu..su- b.t+1).sup.T(X.sub.t-.mu..sub.t+1)

[0039] where X.sub.t is the value of the incoming pixel and .rho.=.alpha..eta.(X.sub.t|.mu..sub.k,.sigma..sub.k.sup.2) is the learning rate for the parameters of the matching component of the distribution, k.

[0040] To avoid such computational and storage requirements associated with model-based systems, the systems and methods herein provide model-less (unmodeled) background estimation for foreground detection that does away with the added computational and storage requirements of traditional model-based approaches and possesses a smaller number of parameters to be tuned, which results in increased robustness across a wider range of scenarios. The systems and methods herein rely on the representation of the background image in a color space that is highly photometric invariant, while at the same time being highly discriminative.

[0041] FIG. 2 is a conceptual chart of modules provided by systems and methods herein, and such conceptual modules include a video acquisition module 120, which provides the incoming video frames either via real-time acquisition with a camera or by reading videos stored offline; a background image extraction module 122, which extracts an image from the video feed with no foreground objects from the video feed; a color feature extraction module 124, which takes as input an image (e.g., a video frame) and computes its representation in the desired color feature space; and a foreground detection module 126, which compares the color feature representation of the background image and that of each incoming video frame, and outputs a binary mask indicating the location of foreground and moving objects.

[0042] FIG. 3 is a block diagram illustrating processes carried out by systems and methods herein. The video acquisition module 120 (in FIG. 2) reads video frames F.sub.i (130); the background image extraction module 122 selects a frame 132 from the video 130 with no foreground objects (denoted by F.sub.0 in the Figure); the color feature extraction module 124 performs color feature extraction tasks on both the extracted background 132 and incoming frames 130; and the foreground detection module 126 compares the color feature representations of the background 134 and the incoming frame 136 to produce a binary image representative of the foreground area 138. It can be seen from the diagram in FIG. 3 that the background representation is static and does not require updating or maintenance.

[0043] In greater detail, the video acquisition module 120 can be fixed or stationary (usually a surveillance) camera acquiring video of the region of interest. Alternatively, stored video can be read from its storage medium. FIG. 4 is a sample video frame 140 obtained by the video acquisition module 120 illustrating a conference room scene with a moving tennis ball 142. As shown in FIG. 4, a tennis ball (foreground object 142) bounces around the scene being captured (the background comprises the room and stationary objects therein), while the illumination in the scene changes drastically at periodic intervals (lights are turned on and off to simulate drastic illumination changes encountered in real-life situations such as changes due to camera auto-gain or exposure parameters, fast-moving clouds, transit of illuminated objects across the scene, etc.)

[0044] The background image extraction module 122 extracts an image of the scene being monitored with no foreground objects from the video feed. This image can be extracted from a single frame (e.g., after camera installation, or every time the camera is moved), or can be automatically estimated from the incoming video (e.g., by temporally averaging or median filtering a number of incoming frames). Since a representation of the background image in the second color space is desired, the processing of the video frames to extract the background image can be performed in the first color space and the resulting background image then mapped to the second color space; alternatively, the processing of the incoming video frames can be performed directly from video frames in the second color space. FIG. 5 shows a sample background image 144 extracted by the background image extraction module 122. Other than the requirement that no foreground object be present in the frame chosen as a background image, no additional constraints are imposed.

[0045] The color feature space selected for performing the background subtraction is robust to a wide range of photometric conditions (e.g., illumination and changes thereof, as well as changes caused by variations in camera parameters such as auto-gain and exposure), so factors like time of day or weather conditions will not have an impact on the choice for a background image. For example, the color feature space is such that the color feature representation of a given scene taken on a sunny day closely matches that of the same scene taken on a rainy or cloudy day.

[0046] In addition to being photometrically invariant, the high-dimensional color feature space is also highly discriminative. This means that the representation of two objects in a scene with apparently similar colors, (e.g., two objects with two closely resembling shades of red) will be significantly different. Note that there is a tradeoff between how discriminative and how photometrically invariant a color space is. The highly discriminative and photometrically invariant color space used with the methods and systems herein strikes a good balance between photometric invariance and discriminability.

[0047] In one example, the color feature extraction module 124, uses high-dimensional color features to represent both the background image and the incoming video frames. The selected color feature space, in addition to being high-dimensional (in order to aid discriminability), is highly photometrically invariant, which means that a given color has similar representations in the feature space regardless of illumination conditions (varying illumination conditions are brought about by shadows, and changes in lighting and weather conditions, as well as changes in camera capture parameters.) One of the reasons why model-based background estimation algorithms are popular is because they are highly adaptable to changing illumination conditions. As stated, however, they have intrinsic limitations regarding how fast they can adapt to those changes; for example, shadows cast by a passing cloud will be detected initially as foreground, and may only be absorbed by the background model if the cloud is moving slowly enough, relative to the selected learning rate. By representing the background and foreground images in a color space that is illumination-independent, a static background representation can be maintained for as long as the configuration of the camera relative to the scene remains unchanged. In one example, if the use of a low-dimensional color space is desired, a mapping from the high-dimensional space to a low-dimensional space can be performed via dimensionality reduction techniques (e.g., linear dimensionality reduction techniques, such as principal component analysis or PCA and independent component analysis or ICA, non-linear dimensionality reduction techniques such as non-linear PCA and ICA, manifold learning, and principal curves, or quantization techniques such as scalar and vectorial quantization) provided the mapping largely preserves the features of the original space. The dimensionality reduction is such that the low-dimensional color space largely preserves most of the photometrically invariant and discriminative properties of the high-dimensional space.

[0048] The color feature extraction module 124 extracts high-dimensional color features by linearly mapping a color in the RGB or other low-dimensional color space to a high-dimensional space, which can, for example, be based on color names. Intuitively speaking, when colors that lie on a low-dimensional space (i.e., three-channel spaces such as RGB (red, green, blue), Lab (CIELAB (or L*a*b*), YUV (luma (Y') and chrominance (UV)), YCrCb (Y' is the luma component, and CB and CR are the blue-difference and red-difference chroma components), and four-channel spaces such as CMYK (cyan, magenta, yellow, black, etc.) are mapped to a high-dimensional space, their representation is sparse, which leads to good discriminability (i.e., a red object looks different than a blue object). For example, when the transformation is constructed taking into account human color naming, the mapping performed by the color feature extraction module 124 brings about added robustness to changes in color appearance due to variations in illumination (i.e., a red object is red regardless of whether it's sunny or cloudy; similarly, a red object looks different than a blue object regardless of illumination.) These two attributes give raise to photometric invariant and discriminative representations of colors.

[0049] The systems and methods herein use high-dimensional mapping that maps RGB or other low-dimensional color space to a relatively higher-dimensional space (e.g., an 11-dimension or 11-channel space) although other specific mappings are possible, as long as they satisfy the requirements described above. The transformation can be learned from labeled and uncalibrated real-world images with color names. These images can be obtained, for example, via search engine results corresponding to image sets resulting from color name search queries. The images will comprise a wide range of objects and scenes whose appearance more or less corresponds to the queried color; specifically, images retrieved via the query "black" will largely contain black scenes and objects acquired under varying illumination, camera pose, and other capture conditions. Probabilistic models that represent the color distribution of each of the query images can be constructed and used to learn the appearance of colors corresponding to color names. The learned model can be implemented in the form of a look-up table (LUT) that maps colors in the incoming color space (e.g., RGB, YCrCb, Luv, Lab, etc.) to a higher dimensional color space where each of the dimensions roughly corresponds to a color name. In more general examples, the mappings to a higher dimensional color space can be learned from labels other than color names, as long as the labels are uncorrelated.

[0050] As a background image is selected by the background extraction module 122, its high-dimensional color representation is computed and stored by the color feature extraction module 124. Similarly, as incoming frames are acquired and processed, their high-dimensional representation is computed by the color feature extraction module 124 and foreground detection is performed in the high-dimensional space by the foreground detection module 126.

[0051] In one example, the color feature extraction module 124 performs mapping. Incoming three-channel RGB colors are quantized to 8 bits per channel, and then mapped to a 4 bit per channel, 11-dimensional color space via the use of a 256.sup.3.fwdarw.4.sup.11 LUT (although different bits per channel or bit depths and different dimensional color spaces are equally useful with the systems and methods herein, and the foregoing are merely examples). FIG. 6 illustrates the result of mapping performed by the color feature extraction module 124; and item 150 in FIG. 6 shows the pseudocolored pixel-wise 11-dimensional or 11-channel representation of the background image from FIG. 5. Intuitively, the mapping converts a densely populated low-dimensional color space into a sparsely populated high-dimensional color space because of the significant dimensionality disparities between both spaces. The sparsely populated high-dimensional space consists of groups of colors where variations of a given color due to different illuminations, shadows and object specularities are clustered together. The discriminative capabilities of the mapping are due to the fact that color representations of photometric variations of a given color are more tightly clustered than color representations of different colors.

[0052] The foreground detection module 126 compares the color feature representations of the background (e.g., FIG. 6) and every incoming frame, and outputs a binary mask where active pixels are associated with foreground or moving objects. If high-dimensional pixel representations are interpreted as vectors in a high-dimensional space, the comparison can take the form of a pixel-wise vectorial distance computation. Alternatively, if pixel representations are interpreted as discrete distributions, divergence metrics can be used as a measure of similarity. Both approaches can be equivalently used by systems and methods herein (as are other similar approaches that measure similarities/dissimilarities between vectors). In any case, the resulting similarity number is thresholded (pixel values are compared to a threshold value to determine if they will be white or black in the image of the foreground objects) to produce a binary output. In one example, a simple pixel-wise Euclidean distance metric between the high-dimensional representation of the background and incoming frame is performed, followed by a thresholding operation. FIGS. 7(a)-7(c) illustrate the foreground detection process. More specifically, FIG. 7(a) shows a video frame with a moving object 142; FIG. 7(b) shows the result of mapping performed by the color feature extraction module 124 on FIG. 7(a) into an 11-dimensional color representation, and FIG. 7(c) shows the corresponding binary foreground mask calculated by the foreground detection module 126.

[0053] As noted above, due to the time the model-based approaches take to adapt to the changing illumination conditions, a significant number of false positives are sometimes present in the modeled foreground mask. To the contrary, with the systems and methods herein, the false positives are kept to a minimum (because of the photometric invariance to illumination of the color space utilized) while still performing robust detection of foreground objects (because of the discriminability of the color space utilized). Therefore, the systems and methods herein are robust to illumination changes regardless of the rate at which they happen, use a smaller number of parameters that need to be tuned, do not require initialization or convergence time, and reduce the computational and memory requirements.

[0054] FIG. 8 is flowchart illustrating an exemplary method herein. In item 180, this method uses a camera in a fixed position to capture and output video of a scene being monitored. The video is output from the camera as a sequence of video frames having pixel values in a first (e.g., relatively low-dimensional) color space (e.g., three channel spaces such as RGB, YCbCr, YUV, Lab, and Luv, and four-channel spaces such as CMYK) where the first color space has a first (e.g., low) number of bits per channel or bit depth.

[0055] Also, in item 182, this exemplary method maps the video frames to a second (e.g., relatively higher-dimensional) color representation of video frames using an image-processing device operatively connected to the camera. For example, each pixel can be transformed to a higher-dimensional representation using a previously calculated look-up table (LUT), or other similar processing can be performed by the image processor to map the video frames (in a minimal processing time (e.g., in fractions of a second) without performing any modeling). Also, the mapping process can transform the pixel values in the first color space to be more photometrically invariant to illumination conditions.

[0056] The mapping process in item 182 transforms the pixel values in the first color space from the first number of bits per channel and a first number of channels (e.g., 8 or 16 bits per channel and 3 or 4 channels) to a second number of bits per channel and a second number of channels (e.g., 2, 4, etc., bits per channel and 8, 10, 12, etc. channels). Thus, the mapping process produces pixel values in the second color representation of video frames to be represented by vectors having a greater vector length (e.g., 8, 10 or 12 dimensions) relative to pixel values in the first color space (e.g., 2, 3 or 4 dimensions). This, therefore, causes the second color representation of video frames to have more color discrimination relative to the video frames having pixel values in the first color space.

[0057] Therefore, in item 182, the systems and methods herein transform the pixel values from a first color space (which has a relatively lower number of channels, and a given number of bits per channel or bit depth) into a second, higher-dimensional color space (which has a relatively greater number of channels, and possibly a different number of bits per channel or bit depth) in order to provide a color space that is both highly-discriminative, while at the same time being photometrically invariant to illumination.

[0058] In other words, in item 182, the mapping converts a densely populated (e.g., higher number of bits per channel or larger bit depth) low-dimensional (e.g., lower number of channels) color space and into a sparsely populated (e.g., lower number of bits per channel or smaller bit depth) high-dimensional (e.g., higher number of channels) color space. The increase in sparseness (e.g., because low-dimensional vectors are represented via high-dimensional vectors) leads to good discriminability between objects (i.e., a red object looks different than a blue object) without substantially reducing photometric invariance to illumination.

[0059] Thus, the second color space may have a smaller bit depth relative to the first color space; however, stated more generally, the second color space possibly has a different color bit depth relative to the first one (both smaller and greater).

[0060] In item 184, this exemplary method also produces (e.g., extracts) a second color representation of a background frame of the scene. The second color representation of a background frame can be produced by extracting the second color representation of a background frame from the second color space representation of incoming video frames, or the processing of the video frames can take place in the first color space, and then the resulting background image can be mapped to the second color space. More specifically, the process of extracting the second color representation of a background frame in item 184 can be performed by, for example, obtaining a frame of the second color representation of video frames when no foreground objects are present, filtering moving objects from the second color representation of video frames by identifying the moving objects as ones that change locations in adjacent frames of the second color representation of video frames, temporally averaging a number of incoming frames, temporally median filtering a number of incoming frames, etc. Since a representation of the background image in the second color space is desired, the processing of the video frames to extract the background image can be also performed in the first color space and the resulting background image then mapped to the second color space.

[0061] Then, in item 186, this method can detect foreground objects in a current frame of the second color representation of video frames by comparing the current frame with the second color representation of a background frame, again using the image-processing device. Finally, this exemplary method outputs an identification of the foreground objects in the current frame of the video from the image-processing device in item 188.

[0062] Additionally, in item 190, this exemplary method can generate a third color representation of the background frame and the video frames. The third color representation has a smaller number of channels and/or a smaller number of bits per channel relative to the second color representation. The third color representation can be obtained from the second color representation via a dimensionality reduction technique, and the third color representation preserves photometric invariance and discriminative attributes of the second color representation. Therefore, in item 190, if the use of a low-dimensional color space (e.g., third color representation) is desired, a mapping from the high-dimensional space to a low-dimensional space can be performed via dimensionality reduction techniques (e.g., linear dimensionality reduction techniques, such as principal component analysis or PCA and independent component analysis or ICA, non-linear dimensionality reduction techniques such as non-linear PCA and ICA, manifold learning, and principal curves, or quantization techniques such as scalar and vectorial quantization) provided the mapping largely preserves the features of the original space. The dimensionality reduction is such that the low-dimensional color space preserves most of the photometrically invariant and discriminative properties of the high-dimensional space.

[0063] FIG. 9 illustrates a computerized device 200, which can be used with systems and methods herein and can comprise, for example, an image processor, etc. The computerized device 200 includes a controller/tangible processor 216 and a communications port (input/output) 214 operatively connected to the tangible processor 216 and to a camera 232 on an external computerized network (external to the computerized device 200). Also, the computerized device 200 can include at least one accessory functional component, such as a graphical user interface (GUI) assembly 212. The user may receive messages, instructions, and menu options from, and enter instructions through, the graphical user interface or control panel 212.

[0064] The input/output device 214 is used for communications to and from the computerized device 200 and comprises a wired device or wireless device (of any form, whether currently known or developed in the future). The tangible processor 216 controls the various actions of the computerized device. A non-transitory, tangible, computer storage medium device 210 (which can be optical, magnetic, capacitor based, etc., and is different from a transitory signal) is readable by the tangible processor 216 and stores instructions that the tangible processor 216 executes to allow the computerized device to perform its various functions, such as those described herein. Thus, as shown in FIG. 9, a body housing has one or more functional components that operate on power supplied from an alternating current (AC) source 220 by the power supply 218. The power supply 218 can comprise a common power conversion unit, power storage element (e.g., a battery, etc), etc.

[0065] The image processor 200 shown in FIG. 9 is a special-purpose device distinguished from general-purpose computers because such a device include specialized hardware, such as: specialized processors 216 (e.g., containing specialized filters, buffers, application specific integrated circuits (ASICs), ports, etc.) that are specialized for image processing, etc.

[0066] Thus, exemplary system includes an image-processing device 200 and a camera 232 operatively (meaning directly or indirectly) connected to the image-processing device 200. The camera 232 is in a fixed position and outputs video of a scene being monitored. The camera 232 outputs the video as a sequence of video frames having pixel values in a first (e.g., relatively low dimensional) color space, where the first color space has a first number of bits per channel.

[0067] The image-processing device 200 maps the video frames to second (e.g., relatively higher dimensional) color representation of video frames. The mapping causes the second color representation of video frames to have a greater number of channels and possibly a different number of bits per channel relative to the first number of bits per channel. The mapping can also cause the second color representation of video frames to be more photometrically invariant to illumination conditions relative to the first color space. In one example, if the use a low-dimensional color space is desired, the image-processing device 200 can perform a mapping from the second, high-dimensional space to a third, low-dimensional space via traditional dimensionality reduction techniques (e.g., linear dimensionality reduction techniques such as principal component analysis or PCA and independent component analysis or ICA, non-linear dimensionality reduction techniques such as non-linear PCA and ICA, manifold learning, and principal curves, or quantization techniques such as scalar and vectorial quantization) provided the mapping largely preserves the features of the original space. Specifically, the dimensionality reduction is such that the low-dimensional color space preserves most of the photometrically invariant and discriminative properties of the high-dimensional space.

[0068] The first color space can be, for example, 3 or 4 dimensional color spaces (e.g., three-channel spaces such as RGB, YCbCr, YUV, Lab, Luv, and four-channel spaces such as CMYK, etc.) while the second color representation can have much higher dimensions, such as 11-dimensions or 11-channels (i.e., pixel values in the second color representation of video frames are represented by vectors have a greater vector length relative to pixel values in the first color space). The mapping is such that the second color representation of video frames has improved color discrimination and photometric invariance relative to the video frames having pixel values in the first color space.

[0069] The image-processing device 200 extracts a second color representation of a background frame of the scene from at least one of the second color representation of video frames. For example, the image-processing device 200 can extract the second color representation of a background frame by: obtaining a frame of the second color representation of video frames when no foreground objects are present; filtering moving objects from the second color representation of video frames by identifying the moving objects as ones that change locations in adjacent frames of the second color representation of video frames; temporally averaging a number of incoming frames; or temporally median filtering a number of incoming frames, etc. Since a representation of the background image in the second color space is desired, the processing of the video frames to extract the background image can alternatively be performed in the first color space and the resulting background image then mapped to the second color space.

[0070] The image-processing device 200 can then detect foreground objects in a current frame of the second color representation of video frames by comparing the current frame with the second color representation of a background frame. The image-processing device 200 then outputs an identification of the foreground objects in the current frame of the video.

[0071] The hardware described herein, such as the camera and video frame image processor, plays a significant part in permitting the foregoing method to be performed, rather than function solely as a mechanism for permitting a solution to be achieved more quickly, (i.e., through the utilization of a computer for performing calculations).

[0072] As would be understood by one ordinarily skilled in the art, the processes described herein cannot be performed by human alone (or one operating with a pen and a pad of paper) and instead such processes can only be performed by a machine. Specifically, processes such as obtaining videos, processing and analyzing video frames on a pixel-by-pixel basis, etc., requires the utilization of different specialized machines. Therefore, for example, the processing of video frames performed by the systems and methods herein cannot be performed manually (because it would take decades or lifetimes to perform the mathematical calculations for all pixels involved, that are performed in seconds or fractions of a second by devices herein) and the devices described herein are integral with the processes performed by methods herein. Further, such machine-only processes are not mere "post-solution activity" because the digital images obtained by the camera and the pixel processing on the video frames are integral to the methods herein. Similarly, the electronic transmissions between the camera and image processor utilize special-purpose equipment (telecommunications equipment, routers, switches, etc.) that are distinct from a general-purpose processor.

[0073] The methods herein additionally solve many technological problems related to object detection in video frames. Foreground and moving object detection is a precursor of video-based object tracking, and, as such, is one of the technical problems in computer vision applications such as surveillance, traffic monitoring and traffic law enforcement, etc. By identifying foreground objects using unmodeled processing (which is more robust and utilizes less hardware resources) the systems and methods herein provide many substantial technological benefits.

[0074] A "pixel" refers to the smallest segment into which an image can be divided. Received pixels of an input image are associated with a color value defined in terms of a color space, such as color, intensity, lightness, brightness, or some mathematical transformation thereof. Pixel color values may be converted to a chrominance-luminance space using, for instance, a RBG-to-YCbCr converter to obtain luminance (Y) and chrominance (Cb,Cr) values. Further, the terms automated or automatically mean that once a process is started (by a machine or a user), one or more machines perform the process without further input from any user.

[0075] While some exemplary structures are illustrated in the attached drawings, those ordinarily skilled in the art would understand that the drawings are simplified schematic illustrations and that the claims presented below encompass many more features that are not illustrated (or potentially many less) but that are commonly utilized with such devices and systems. Therefore, Applicants do not intend for the claims presented below to be limited by the attached drawings, but instead the attached drawings are merely provided to illustrate a few ways in which the claimed features can be implemented.

[0076] Many computerized devices are discussed above. Computerized devices that include chip-based central processing units (CPU's), input/output devices (including graphic user interfaces (GUI), memories, comparators, tangible processors, etc.) are well-known and readily available devices produced by manufacturers such as Dell Computers, Round Rock Tex., USA and Apple Computer Co., Cupertino Calif., USA. Such computerized devices commonly include input/output devices, power supplies, tangible processors, electronic storage memories, wiring, etc., the details of which are omitted herefrom to allow the reader to focus on the salient aspects of the systems and methods described herein.

[0077] It will be appreciated that the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims. Unless specifically defined in a specific claim itself, steps or components of the systems and methods herein cannot be implied or imported from any above example as limitations to any particular order, number, position, size, shape, angle, color, or material.

* * * * *