U.S. patent application number 14/272570 was filed with the patent office on 2015-11-12 for method for detecting objects in stereo images.
This patent application is currently assigned to Mitsubishi Electric Research Laboratories, Inc.. The applicant listed for this patent is Mitsubishi Electric Research Laboratories, Inc.. Invention is credited to Ming-Yu Liu, Oncel Tuzel.
Application Number | 20150324659 14/272570 |
Document ID | / |
Family ID | 54368107 |
Filed Date | 2015-11-12 |
United States Patent
Application |
20150324659 |
Kind Code |
A1 |
Liu; Ming-Yu ; et
al. |
November 12, 2015 |
METHOD FOR DETECTING OBJECTS IN STEREO IMAGES
Abstract
A method detects an object in a pair of stereo images acquired
of a scene, by first generating a cost volume from the pair of
stereo images, wherein the cost volume includes matching costs for
a range of disparity values, for each pixel the sterao images,
between the stereo images in the pair. Feature vectors are
determined from sub-images in the cost volume using a feature
function of the disparity values with a minimal accumulated cost
within regions inside the sub-images. Then, a classifier is appled
to the feature vectors to detect whether the sub-image includes the
object.
Inventors: |
Liu; Ming-Yu; (Cambridge,
MA) ; Tuzel; Oncel; (Winchester, MA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Mitsubishi Electric Research Laboratories, Inc. |
Cambridge |
MA |
US |
|
|
Assignee: |
Mitsubishi Electric Research
Laboratories, Inc.
Cambridge
MA
|
Family ID: |
54368107 |
Appl. No.: |
14/272570 |
Filed: |
May 8, 2014 |
Current U.S.
Class: |
382/197 |
Current CPC
Class: |
G06K 9/3241 20130101;
G06T 2207/10012 20130101; G06T 7/593 20170101; G06T 7/11 20170101;
G06T 7/194 20170101; G06T 17/00 20130101; G06K 9/6256 20130101;
G06K 9/52 20130101; G06K 9/468 20130101; G06K 9/6267 20130101 |
International
Class: |
G06K 9/46 20060101
G06K009/46; G06K 9/52 20060101 G06K009/52; G06T 17/00 20060101
G06T017/00; G06K 9/62 20060101 G06K009/62 |
Claims
1. A method for detecting an object in a pair of stereo images
acquired of a scene, where each stereo image include pixels,
comprising the steps of: generating a cost volume from the pair of
stereo images, wherein the cost volume includes matching costs for
a range of disparity values, for each pixel, between the stereo
images in the pair; determining feature vectors from sub-images in
the cost volume using a feature function of the disparity values
with a minimal accumulated cost within regions inside the
sub-images; and applying a classifier to the feature vectors to
detect whether the sub-images includes the object, wherein the
steps are performed in a processor.
2. The method of claim 1, further comprising: localizing the object
within the stereo images.
3. The method of claim 1, wherin the classifier is learned from
pairs of training stereo images.
4. The method of claim 1, further comprising: rectifying the pair
of stereo images.
5. The method of claim 1, further comprising: smoothing the pair of
stereo images.
6. The method of claim 1, wherein the generating further comprises:
matching colors and gradients of the pixels in the pair of stereo
images using a Euclidaian norm.
7. The method of claim 1, wherein the feature function is f k ( J )
= { 1 if d min ( R k 1 ) > d min ( R k 2 ) 0 if d min ( R k 1 )
= d min ( R k 2 ) - 1 otherwise , ( 1 ) ##EQU00007## where J
represents the sub-image, k represents a dimension of the feature
vectors, min represents a function that returns a minimum, and
d.sub.min(R.sub.k.sup.i) represents the disparity value that has a
minimal accumulated cost in rectangle area of R.sub.k.sup.i in the
sub-image, wherein i indexed the rectangular regions.
8. The method of claim 7, wherein d min ( R k i ) = arg min d ( x ,
y ) .di-elect cons. R k i C ( x , y , d ) . ##EQU00008## where C(x,
y, d) represent the cost volume.
9. The method of claim 1, wherein the classifier is an ensemble
classifier including T decision tree classifiers.
10. 10. The method of claim 9, wherein the classifier provides a
detection score s for sub-image J is s ( J ) = t = 1 T .theta. t
.delta. t ( J ) , ##EQU00009## where .delta..sub.t's are the
decision tree classifiers and .theta..sub.t's are corresponding
weights.
Description
FIELD OF THE INVENTION
[0001] This invention relates to computer vision, and more
particularly to detecting objects in stereo images.
BACKGROUND OF THE INVENTION
[0002] Many computer vision applications use stereo images acquired
by a stereo camera to detect objects. A stereo camera typically has
multiple lenses and sensors. Usually, the intra-axial distance
between the lenses is about the same distance as between the eyes
to provide overlapping views.
[0003] FIG. 1 shows a conventional system for stereo-based object
detection. A stereo camera 101 acquires stereo images 102. The
dection method can include the following steps: stereo imaging 100,
cost volume determination 110, depth/disparity map estimation 120,
and object detection 130.
[0004] Most of the conventional methods for stereo-based object
detection rely on per pixel depth information in the overlapping
area 120. This step is generally referred as depth/range map
estimation. This step can be achieved by determining disparity
values, i.e., translation of corresponding pixels in the two
images, determining the depth map. The depth map can then be used
lot object detection 130, e.g., a histogram of oriented gradients
(HoG) of the depth map is used for object description. One method
estimates the dominate disparity in a sub-image region, and use a
co-occurrence histogram of the relative disparity values for object
detection.
[0005] Depth/range/disparity map estimation is a challenging
problem. Local methods suffers from inaccurate depth determination,
while global methods require significant computational resources,
and are unsuited for real-time applications.
[0006] Several methods avoid the depth map determination step by
using stereo cues for region of interest generation. For example,
one method determines a stixel map which marks the potential object
locations. Each stixel is defined by a 3D position relative to the
camera and stands vertically on a ground plane. A detector based on
the color image content is then applied to the locations to detect
objects.
[0007] U.S. Publication 20130177237 uses a range map to determine
an area of interest, and uses a classifier based on an intensity
histogram to detect objects.
[0008] Region of interest methods cannot be directly applied to
object detection. They have to be applied in conjunction with other
object detectors. In addition, miss detection is certain when the
area of interest does not cover the object.
SUMMARY OF THE INVENTION
[0009] The embodiments of the invention provide a method for
detecting objects in stereo images. A cost volume is computed from
the images. Then, object detection is directly applied features
obtained from the cost volume. The detection uses T decision tree
classifiers (Adaboost) that are learned from training features.
[0010] The invention avoids the error-prone and
computationally-complex depth map estimation step of the prior art,
and leads to an accurate and efficient object detector. The method
is better-suited for embedded systems because it does not require
complex optimization modules necessary to obtain good depth map. In
addition, the method searches all sub-images in the input images to
detect the object. This avoids the miss detection problem that
exists in the region of interest generation techniques.
[0011] The detection is accurate because the method can leverage a
large amount of training data and make use of machine learning
procedures. It outperforms region of interest generation techniques
in detection accuracy.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] FIG. 1 is a block diagram of a conventional stereo-based
object detection system:
[0013] FIG. 2 is a block diagram of a stereo-based object detection
system according to embodiments of the invention;
[0014] FIG. 3 is a block, diagram of an object detection module for
the stereo-based object detection system of FIG. 2;
[0015] FIG. 4 is a block diagram of a method for learning the
stereo-based object detector according to embodiments of the
invention.
[0016] FIG. 5 is a schematic of cost volume determination according
to embodiments of the invention;
[0017] FIG. 6 is a schematic of a learned feature according to
embodiments of the invention; and
[0018] FIG. 7 is a schematic of objects occupying large and small
portions of sub-images.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0019] FIG. 2 shows a method and system for detecting an object 201
in a pair al stereo images 200 according, to embodiments of our
invention. A cost volume 211 is generated 210 from the pair of
stereo images. This is followed by selecting and extracting 215
feature vectors 216. Then, an object detector 220 is applied to the
features vectors volume to detect the object. The object detector
uses classifiers 230 leaned form training images features 231.
After the object is dected, it can be localized, that is the
loation of the object in the image can be determined. The method
can be performed in a processor 250 connected to memory and
input/output interfaces by buses as known in the art.
[0020] Our invention is based on the realization that depth
information available in a depth map is also available in the cost
volume, because the depth map is derived from the cost volume.
[0021] Our detector 220 that uses the cost volume directly is
theoretically capable of matching the performance of any detector
based on the depth map. Moreover, the cost volume is a richer
representation than the conventional depth map. The depth map only
provides a depth for each pixel, while the cost volume provides
matching costs for a range of potential depths that each pixel in
the stereo images can have, including the true depth. Hence, the
detector uses features directly obtained from the cost volume can
access more depth information, and achieve better performance.
[0022] As shown in FIG. 3, one embodiment of our invention includes
cost volume generation 210, feature extraction 310, object
detection and localization 320, learned discriminative features
330, and a learned object classification model 340. The
localization determines where the object is detected.
[0023] FIG. 4 shows a machine learning procedure for learning the
discriminative features and the learned object classification
model. Features are selected and learned 410 from training data 400
comprising pairs of training stereo images.
[0024] Cost Volume Generation
[0025] FIG. 5 shows the generation of the cost volume C 211. The
cost volume C: X.times.Y.times.D is a three-dimension data
structure stored in the memory, where X and Y denote the image x
and y axes and D denotes a set of disparity values, which are
translations between corresponding pixels in the two stereo images
I.sub.L 501 and I.sub.R 502. We assume that L.sub.L and I.sub.R are
rectified, which means that the images have been transformed such
that the lens distortion effects are compensated, and a pixel in a
row of one image is mapped to a pixel in the same row of the other
image. The cost volume can then be determined by matching pixel
appearance in the pair of stereo images I.sub.L and I.sub.R.
[0026] One way to determine the cost volume apply the mapping given
by
C(x, y, d) =.parallel.I.sub.L(x,
y)-I.sub.R(x-d,y).parallel..sub.2+.lamda..parallel.
grad(I.sub.L(x,y))-grad(I.sub.R(x-d,y)).parallel..sub.2
for any (x,y,d).di-elect cons.X.times.Y.times.D.
where .parallel..parallel..sub.2 denotes a Euclidean norm and
I.sub.L(x, y) refers the pixel color values in the (x,y) location
of the I.sub.L image, I.sub.R(x,y) refers the pixel color values in
the (x,y) location of the IR image, grad(I.sub.L(x,y)) refers to
the gradient in the (x,y) location of the IL image,
grad(I.sub.R(x-d,y)) refers to the gradient in the (x,y) location
of the IR image, and .lamda. is the weight controlling the
importance of the gradient information. Note that image smoothing
technique can be applied, such as bilateral filtering or guided
filtering to enhance the cost volume.
[0027] Feature Extraction
[0028] FIG. 6 shows feature selection and extraction 215 of FIG. 2.
We extract a K-dimensional feature vector from each sub-image 600
for determining whether or not the object is present in the
sub-image. The sub-images can be considered a moving window passed
over the image in, e.g., in a raster scan order for eachy pixel at
multiple scales.
[0029] Note, the embodiments only directly uses the cost volume to
determine the features. Depth map estimation as in the prior art is
not performed.
[0030] Each dimension of the feature vector corresponds to a
numerical comparison result between a Fig cost disparity values of
two e.g., rectangle, regions R.sub.k.sup.1 601 and R.sub.k.sup.2
602 in the sub-image 600. Let the sub-image be denoted as J and the
k.sup.th dimension of feature vector be represented as f.sub.k(J).
The value of f.sub.k(J) is
f k ( J ) = { 1 if d min ( R k 1 ) > d min ( R k 2 ) 0 if d min
( R k 1 ) = d min ( R k 2 ) - 1 otherwise , ( 1 ) ##EQU00001##
where d.sub.min(R.sub.k.sup.i) represents to the disparity value
that has a minimal (min) accumulated cost in the region of
R.sub.k.sup.i of the sub-image. That is
d min ( R k i ) = arg min d ( x , y ) .di-elect cons. R k i C ( x ,
y , d ) . ( 2 ) ##EQU00002##
[0031] Note that determining the minimal cost disparity value in
the region is relatively simple because the accumulated cost can be
obtained efficiently using an integral image technique as known in
the art. The locations and size of the regions are learned using a
machine learning procedure, which is described below.
[0032] Object Detection and Localization
[0033] The K-dimensional feature vector associated with the
sub-image is passed to an ensemble classifier for determining a
detection score. The ensemble classifier includes T decision tree
classifiers. Each decision tree classifier takes a small number of
dimensions of the K-dimensional feature as input, and classifies
the sub-image as positive (containing an object) or negative (not
containing an object). A detection score s obtained from the
classifier for the sub-image J is given by
s ( J ) = t = 1 T .theta. t .delta. t ( J ) , ( 3 )
##EQU00003##
where .delta..sub.t's are the decision tree classifiers and
.theta..sub.t's are the corresponding weights. If the score is
greater than a preset threshold, then the system declares a
detection in the sub-image.
[0034] As shown in FIG. 7, the classifier can be trained to give a
higher score when the object occupies a larger portion, of the
sub-image 701 and a lower score when the object only occupies a
small portion of the subimage 702, because the larger object
provides a better estimate of where the object is located within
the image than the smaller object.
[0035] Feature Selection and Classifier Learning Procedure
[0036] We use a discrete AdaBoost procedure for selecting the
region
{(R.sub.k.sup.1,R.sub.k.sup.2)|.A-inverted.k=12, . . . , K},
(4)
and for learning the decision tree classifier weights
{.theta..sub.t|.A-inverted.t=1, 2, . . . , T}. (5).
[0037] We collect a set of data fir a learning task, which includes
a set of stereo training images. The sub-images that contain an
object is labeled as positive instances, while others are labeled
as negative instances. We align the positive and negative
sub-images so that their centers coincide. The sub-images are also
scaled to have the same height. The aligned and scaled sub-images
are denoted as
D={(j.sub.i,l.sub.i), i=1, 2, . . . , V}, (6)
where J.sub.i denotes the i.sup.th sub-image, l.sub.i is the label,
and V is the total number of sub-images.
[0038] We sample a set of N regions as the feature pools {R.sub.i,
i=1, 2, . . . , N}, which have different locations and sizes and
are covered by the aligned sub-images. We randomly pair two regions
and compare their disparity values of the minimal cost. This is
performed K times to construct a K-dimensional feature vector.
[0039] We use the discrete AdaBoost procedure to learn T decision
tree classifiers and their weights. The procedure starts with
assigning uniform weights to the training samples. A decision tree
is then learned based on the current training sample weights. The
weights of incorrectly classified samples are increased so that the
weights have more impact during the next round of decision tree
classifier learning. We assign the weight to the decision tree
classifier based on the weighted error rate. This process is
repeated I times to construct an ensemble classifier of T decision
tree classifiers. A pseudo code of the procedure is described
below. [0040] Input: Feature vectors and class labels
D={(f(J.sub.i),l.sub.i), i=1, 2, . . . , V} [0041] Output: Ensemble
classifiers .SIGMA..sub.t=1.sup.T.theta..sub.t.delta..sub.t(J)
[0042] Start with uniform weights
[0042] w i = 1 V , i = 1 , 2 , , V ##EQU00004##
For t=1, 2, . . . , T [0043] 1. Learn a decision tree classifier
.delta..sub.t(J).di-elect cons.{-1,1} using weights w.sub.i's;
[0044] 2. Determine error rate
.epsilon.=.SIGMA..sub.iw.sub.i|(.delta..sub.t(J.sub.i.noteq.l.sub.i);
[0045] 3. Determine decision tree classifier weight
[0045] .theta. t = log 1 - ; ##EQU00005## [0046] 4. Set
w.sub.i.rarw.w.sub.iexp(.theta..sub.t|(.delta..sub.t(J.sub.i).noteq.l.sub-
.i) for i=1, 2, . . . , V; and [0047] 5. Normalize the sample
weights
[0047] w i .rarw. w i .SIGMA. i w i . ##EQU00006##
[0048] Function .delta.t, which is used in steps 2 and 4,
represents the indicator function, which returns one if the
statement in the parenthesis is true and zero otherwise.
[0049] Although the invention has been described by way of examples
of preferred embodiments, it is to be understood that various other
adaptations and modifications can be made within the spirit and
scope of the invention. Therefore, it is the object of the appended
claims to cover all such variations and modifications as come
within the true spirit and scope of the invention.
* * * * *