U.S. patent application number 09/781229 was filed with the patent office on 2001-10-11 for canonical correlation analysis of image/control-point location coupling for the automatic location of control points.
Invention is credited to Covell, Michele, Slaney, Malcolm.
Application Number | 20010028731 09/781229 |
Document ID | / |
Family ID | 24611604 |
Filed Date | 2001-10-11 |
United States Patent
Application |
20010028731 |
Kind Code |
A1 |
Covell, Michele ; et
al. |
October 11, 2001 |
Canonical correlation analysis of image/control-point location
coupling for the automatic location of control points
Abstract
The identification of hidden data, such as feature-based control
points in an image, from a set of observable data, such as the
image, is achieved through a two-stage approach. The first stage
involves a learning process, in which a number of sample data sets,
e.g. images, are analyzed to identify the correspondence between
observable data, such as visual aspects of the image, and the
desired hidden data, such as the control points. Two models are
created. A feature appearance-only model is created from aligned
examples of the feature in the observed data. In addition, each
labeled data set is processed to generate a coupled model of the
aligned observed data and the associated hidden data. In the second
stage of the process, the modeled feature is located in an
unmarked, unaligned data set, using the feature appearance-only
model. This location is used as an alignment point and the coupled
model is then applied to the aligned data, giving an estimate of
the hidden data values for that data set.
Inventors: |
Covell, Michele; (Palo Alto,
CA) ; Slaney, Malcolm; (Palo Alto, CA) |
Correspondence
Address: |
BURNS DOANE SWECKER & MATHIS L L P
POST OFFICE BOX 1404
ALEXANDRIA
VA
22313-1404
US
|
Family ID: |
24611604 |
Appl. No.: |
09/781229 |
Filed: |
February 13, 2001 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
09781229 |
Feb 13, 2001 |
|
|
|
08651108 |
May 21, 1996 |
|
|
|
6188776 |
|
|
|
|
Current U.S.
Class: |
382/118 |
Current CPC
Class: |
G06V 10/757 20220101;
G06V 10/476 20220101; G06V 10/46 20220101; G06V 10/443 20220101;
G06V 40/171 20220101 |
Class at
Publication: |
382/118 |
International
Class: |
G06K 009/00 |
Claims
What is claimed is:
1. A method for determining continuous-valued hidden data from
observable data, comprising the steps of: A) conducting a training
stage which includes the steps of: labelling a plurality of
representative sets of unaligned observed data to identify correct
alignment of the observed data and continuous-valued hidden data
associated with each set of observed data; analyzing the observed
data to generate a first model which represents the aligned
observed data; generating a second model on the aligned and labeled
data sets which explicitly represents the coupling between aligned
observable data and the hidden data; B) for each set of unlabeled
data, conducting a labelling stage which includes the steps of:
analyzing the unlabeled set of unaligned observed data by means of
the first model to determine alignment of the observable data
associated therewith; applying the second model to said unlabeled
set of aligned observed data; and determining hidden data for the
unlabeled set of aligned data from said application of the second
model.
2. The method of claim 1 wherein said second model is generated by
means of canonical correlation.
3. The method of claim 1 wherein each set of unaligned observed
data defines an image.
4. The method of claim 3 wherein said hidden data comprises control
points which relate to fiduciary points on objects in an image.
5. The method of claim 4 wherein at least some of said control
points relate to fiduciary points on obscured portions of objects
in the images.
6. The method of claim 4 wherein control points are determined for
at least two new images, and further including the step of morphing
between said new images in accordance with the determined control
points.
7. The method of claim 4 further including the step of creating a
composite image by incorporating a new image into another image by
means of the determined control points for each of the two
images.
8. The method of claim 4 wherein said images include faces, and
further including the step of analyzing the control points to
recognize a known face in an image.
9. The method of claim 4 wherein said images comprise cartoons.
10. The method of claim 4 wherein said images include faces, and
further including the step of analyzing the control points to
recognize an expression on a face in an image.
11. The method of claim 4, further including the step of
controlling a robot to grasp an object in accordance with the
fiduciary points that are labeled in the image of the object.
12. The method of claim 1 wherein said sets of unaligned observed
data comprise a sequence of video images.
13. The method of claim 12 further including the step of analyzing
determined control points in said sequence of video images to
recognize movement of an object in the images.
14. The method of claim 13 wherein said movement comprises nearly
periodic motion.
15. The method of claim 1 wherein said sets of unaligned observed
data comprise audio signals.
16. The method of claim 1 further including the steps of selecting
a plurality of said representative sets of data, using hidden data
in said plurality of data sets to automatically generate
interpolated data sets that are based on said plurality of data
sets and that include both observable and hidden data, and
including said interpolated data sets in the plurality of
representative data sets that are analyzed to generate said second
model.
17. The method of claim 16 wherein said second model is a
multifaceted model, and said interpolated data sets are at the
boundaries of facets in said second model.
18. The method of claim 1 further including the steps of selecting
a plurality of said representative sets of data, using hidden data
in said plurality of data sets to automatically generate
interpolated data sets that are based on said plurality of data
sets and that contain observable data, and including said
interpolated data sets in the plurality of representative data sets
that are analyzed to generate said first model.
19. The method of claim 18 wherein said first model is a
multifaceted model, and said interpolated data sets are at the
boundaries of facets in said first model.
20. The method of claim 1 wherein said applying and determining
steps are carried out in a non-iterative manner.
21. The method of claim 1 wherein said first model is a manifold
model.
22. The method of claim 21 wherein said first model is an affine
manifold model.
23. The method of claim 21 wherein the step of aligning the
observed data in an unlabeled data set comprises the steps of: i)
selecting possible locations for the alignment of the data; ii) for
each possible location, determining a lower bound for the distance
between the unlabeled data set aligned at that location and an
expected appearance of aligned data, in accordance with an average
appearance defined by the first model; iii) removing the possible
locations whose lower bound exceeds a threshold value; iv) for each
possible location, determining the coordinate value for a dimension
of the first model; v) for each possible location, determining a
new lower bound by combining previously determined coordinate
values with the distance between the data set aligned at that
location and the appearance of the data set under said alignment in
accordance with the previously determined coordinate values; and
vi) repeating steps iii), iv) and v) for all of the dimensions of
the model.
24. The method of claim 23 wherein said lower bounds are determined
in accordance with expected variances along each of the dimensions
of the manifold model.
25. The method of claim 24 wherein said expected variances are
progressively smaller on each successive repetition of said
steps.
26. The method of claim 22 wherein the step of applying the second
model to the unlabeled set of aligned observed data includes
projecting, with the use of an orthonormal transform, the aligned
observed unlabeled data onto a subspace of the second model having
fewer dimensions than said second model; performing a general
matrix multiplication within said subspace; and projecting, with
the use of an orthonormal transform, into a second space of the
model to determine hidden data for the unlabeled data set.
27. The method of claim 26 wherein said general matrix
multiplication is determined, in part, according to a gradual
roll-off in manifold dimensions according to the coherence between
the hidden and aligned observed data that is used to generate said
second model.
28. The method of claim 1 further including the step defining the
alignment of the observed data in the representative sets of data
from an analysis of the hidden data with which the data sets are
labeled.
29. The method of claim 28 wherein an analysis of the observed data
is also employed in said alignment process.
30. The method of claim 28 wherein said defining step comprises
dividing the hidden data into separate groups, and assigning a
different definition of aligned observed data in each
representative data set to the respective groups.
31. The method of claim 30 wherein the division of the hidden data
into separate groups is determined in accordance with analysis of
the hidden data.
32. The method of claim 30 wherein the definition of aligned
observed data is determined in accordance with analysis of the
hidden data.
33. The method of claim 32 wherein the definition of aligned
observed data is also determined in accordance with analysis of the
observed data.
34. The method of claim 31 wherein the observed data is also used
to divide the hidden data into said groups.
35. The method of claim 32 wherein the division of hidden data into
groups is carried out by measuring the coherence of the hidden
data.
Description
[0001] This disclosure is based upon, and claims priority from,
U.S. application Ser. No. 08/651,108 (now U.S. Pat. No. 6,188,776),
the contents of which are incorporated herein by reference.
FIELD OF THE INVENTION
[0002] The present invention is directed to data analysis, such as
audio analysis, image analysis and video analysis, and more
particularly to the estimation of hidden data from observed data.
For image analysis, this hidden data estimation involves the
placement of control points on unmarked images or sequences of
images to identify corresponding fiduciary points on objects in the
images.
BACKGROUND OF THE INVENTION
[0003] Some types of data analysis and data manipulation operations
require that "hidden" data first be derived from observable data.
In the field of speech analysis, for example, one form of
observable data is pitch-synchronous frames of speech samples. To
perform linear predictive coding on a speech signal, the
pitch-synchronous frames are labeled to identify vocal-tract
positions. The pitch-synchronous data is observable in the sense
that it is intrinsic to the data and can be easily derived using
known signal processing techniques simply by the correct alignment
between the speech sample and a frame window. In contrast, the
vocal tract positions must be estimated either using some extrinsic
assumptions (such as an acoustic waveguide having uniform length
sections with each section of constant width) or using a general
modeling framework with parameter values derived from an example
database (e.g. linear manifold model with labeled data). Therefore,
the vocal tract positions are known as "hidden" data.
[0004] In image processing applications, the observable data of an
image includes attributes such as color or grayscale values of
individual pixels, range data, and the like. In some types of image
analysis, it is necessary to identify specific points in an image
that serve as the basis for identifying object configurations or
motions. For example, in gesture recognition, it is useful to
identify the locations and motions of each of the figures. Another
type of image processing application relates to image manipulation.
For example, in image morphing, where one image transforms into
another image, it is necessary to identify points of correspondence
in each of the two images. If an image of a face is to morph into
an image of a different face, for example, it may be appropriate to
identify points in each of the two images that designate the
outline and tip of the nose, the outlines of the eyes and the
irises, the inner and outer boundaries of the mouth, the tops and
bottoms of the upper and lower teeth, the hairline, etc. After the
corresponding points in the two images have been identified, they
serve as constraints for controlling the manipulation of pixels
during the transform from one image to the other.
[0005] In a similar manner, control points are useful in video
compositing operations, where a portion of an image is incorporated
into a video frame. Again, corresponding points in the two images
must be designated, so that the incorporated image will be properly
aligned and scaled with the features of the video frame into which
it is being incorporated. These control points are one form of
hidden data in an image.
[0006] In the past, the identification of hidden data, such as
control points in an image, was typically carried out on a manual
basis. In most morphing processes, for example, a user was required
to manually specify all of the corresponding control points in the
beginning and ending images. If only two images are involved, this
requirement is somewhat tedious, but manageable. However, in
situations involving databases that contain a large number of
images, the need to manually identify the control points in each
image can become quite burdensome. For example, U.S. Pat. No.
5,880,788 discloses a video manipulation system in which images of
different mouth positions are selected from a database and
incorporated into a video stream, in synchrony with a soundtrack.
For optimum results, control points which identify various
fiduciary points on the image of a person's mouth are designated
for each frame in the video, as well as each mouth image stored in
the database. These control points serve as the basis for aligning
the image of the mouth with the image of a person's face in the
video frame. It can be appreciated that manual designation of the
control points for all of the various images in such an application
can become quite cumbersome.
[0007] Most previous efforts at automatically recognizing salient
components of an image have concentrated on features within the
image. For example, two articles entitled "View-Based and Modular
Eigenspaces for Face Recognition," Pentland et al, Proc. IEEE
ICCVPR '94, 1994, and "Probabilistic Visual Learning for Object
Detection," Moghaddam et al, Proc. IEEE CVPR, 1995, disclose a
technique in which various features of a face, such as the nose,
eyes, and mouth, can be automatically recognized. Once these
features have been identified, an alignment point is designated for
each feature, and the variations of the newly aligned features from
the expected appearances of the features can be used for
recognition of a face.
[0008] While this technique is useful for data alignment in
applications such as face recognition, it does not by itself
provide a sufficient number of data points for image manipulation
techniques, such as morphing and image compositing, or other types
of image processing which rely upon the location of a large number
of specific points, such as general gesture or expression
recognition.
[0009] Other prior art techniques for determining data points from
an image employ active contour models or shape-plus-texture models.
Active contour models, also known as "snakes", are described in M.
Kass, A. Witkin, D. Terzopoulous, "Snakes, Active Contour Models."
IEEE International Conference on Computer Vision, 1987, and C.
Bregler and S. Omohundro, "Surface Learning with Applications to
Lipreading," Neural Information Processing Systems, 1994. The
approaches described in these references use a relaxation technique
to find a local minimum of an "energy function", where the energy
function is the sum of an external energy term, determined from the
grayscale values of the image, and an internal energy term,
determined from the configuration of the snake or contour itself.
The external energy term typically measures the local image
gradient or the local image difference from some expected value.
The internal energy term typically measures local "shape" (e.g.
curvature, length). The Bregler and Omohundro reference discloses
the use of a measure of distance between the overall shape of the
snake to the expected shapes for the contours being sought as an
internal energy term.
[0010] Snakes can easily be thought of as providing control point
locations, and the extension to snakes taught by the Bregler et al
reference allows one to take advantage of example-based learning to
constrain the estimated locations of these control points. However,
there is no direct link between the image appearance and the shape
constraints. This makes the discovery of "correct" energy function
an error-prone process, which relies heavily on the experience of
the user and on his familiarity with the problem at hand. The
complete energy function is not easily and automatically derived
from data-analysis of an example training set.
[0011] Shape-plus-texture models are described in A. Lanitis, C. J.
Taylor, T. F. Cootes, "A Unified Approach to Coding and
Interpreting Face Images," International Conference on Computer
Vision, 1995, and D. Beymer, "Vectorizing Face Images by
Interleaving Shape and Texture Computations," A.I. Memo 1537.
Shape-plus-texture models describe the appearance of an object in
an image using shape descriptions (e.g. contour locations or
multiple point locations) plus a texture description, such as the
expected grayscale values at specified offsets relative to the
shape-description points. The Beymer reference discloses that the
model for texture is example-based, using an affine manifold model
description derived from the principle component analysis of a
database of shape-free images (i.e. the images are pre-warped to
align their shape descriptions). The shape model is unconstrained
(which the reference refers to as "data-driven"), and, in labeling,
is allowed to vary arbitrarily based on a pixel-level mapping
derived from optical flow. In the Lanitis et al. reference, both
the shape and the texture models are derived separately from
examples, using affine manifold model descriptions derived from
principle component analyses of a database. For the shape model,
the shape description locations (the control point (x,y) locations)
are analyzed directly (independent of the grayscale image data) to
get the shape manifold. For the texture model, as in the Beymer
reference, the example grayscale images are pre-warped to provide
"shape-free texture" and these shape-free images are analyzed to
get the texture manifold model. In other references, the locations
for control points on a new (unlabeled) image are estimated using
an iterative technique. First, a shape description for a new image
is estimated (i.e. x,y control point locations are estimated), only
allowing shape descriptions which are consistent with the shape
model. In the Beymer reference, this could be any shape
description. Then, a "shape-free texture" image is computed by
warping the new image data according to the estimated shape model.
The distance between this shape-free texture image and the texture
model is used to determine a new estimate of shape. In the case of
the Beymer reference, the new estimated shape is determined by
unconstrained optical flow between the shape-free unlabeled image
and the closest point in the texture manifold. The Lanitis
reference uses a similar update mechanism with the added constraint
that the new shape model must lie on the shape manifold. After
iterating until some unspecified criteria is met, the last shape
description can be used to describe control point locations on the
input image.
[0012] Shape-plus-texture methods give estimates for many
control-point locations. They also provide well-defined
example-based training methods and error criteria derived from that
example-based training. However, the models which are derived for
these approaches rely on estimates of unknown parameters--they need
an estimate of shape in order to process the image data. Thus, they
are forced to rely on iterative solutions. Furthermore, the shape-
and texture-models do not explicitly take advantage of the coupling
between shape and the image data. The models of admissible shapes
are derived without regard to the image values and the models of
admissible textures is derived only after "normalizing out" the
shape model.
[0013] When deriving models to allow estimates for unknown
parameters, the coupling between observable parameters, such as
image grayscale values, and the unknown parameters in the
description should preferably be captured, rather than the
independent descriptions of the unknown parameters and of the
"normalized" known parameters. This is similar to the difference
between "reconstructive" models (models that allow data to be
reconstructed with minimum error) and "discriminative" models
(models that allow unknown classification data to be estimated with
minimum error).
BRIEF STATEMENT OF THE INVENTION
[0014] In accordance with the present invention, the determination
of hidden data from observed data is achieved through a two-stage
approach. The first stage involves a learning process, in which a
number of sample data sets, e.g. images, are analyzed to identify
the correspondence between observable data, such as visual aspects
of the image, and the desired hidden data, e.g. control points.
With reference to the case of image analysis, a number of
representative images are labeled with control point locations
relating to features of interest. An appearance-only feature model
is created from aligned images of each feature. The aligned image
data is rotated into standard orientations, to generate a coupled
model of the aligned feature appearance and the control point
locations around that feature. For example, for a coupled affine
manifold model, the expected (average) vectors for both the visible
image data and the control point locations are derived, from all of
the individual vectors for the labeled representative images. A
linear manifold model of the combined image deviations and location
deviations is also determined from this data. This feature model
represents the distribution of visible aspects of an image and the
locations of control points, and the coupling relationship between
them.
[0015] In the second stage of the process, a feature is located on
an unmarked image using the appearance-only feature model. The
relevant portion of the image is then analyzed to determine a
vector for the visible image data. This vector is compared to the
average vector for the representative images, and the deviations
are determined. These values are projected onto the data model, to
identify the locations of the control points in the unmarked
image.
[0016] In a low-resolution implementation of the invention, certain
assumptions are made regarding the correspondence between the
visible image data and the control-point locations. These
assumptions can be used to reduce the amount of computation that is
required to derive the model from the training data, as well as
that which is required to locate the control points in the
labelling process. The low-resolution approach may be desirable in
those applications where a high degree of precision is not
required, such as in a low-resolution video morphing or compositing
system. In a second implementation of the invention, additional
computations are carried out during both the training and labeling
steps, to provide a higher degree of precision in the location of
the control points. This higher-resolution implementation provides
a greater degree of control for processes such as high-resolution
video morphing or compositing and the like.
[0017] The foregoing features of the invention, as well as more
specific aspects thereof which contribute to the practical
implementation of the invention under different conditions, are
explained in greater detail hereinafter with reference to exemplary
embodiments illustrated in the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] FIG. 1a is an illustration of an image of a person's lips in
a partially open position and the teeth in a closed position;
[0019] FIG. 1b is an illustration of the image of FIG. 1a in which
control points on salient portions of the image have been
identified;
[0020] FIG. 1c is an illustration of only the control points which
are identified in FIG. 1b;
[0021] FIGS. 2a-2c are illustrations corresponding to those of
FIGS. 1a-1c for an image of closed lips and closed teeth;
[0022] FIGS. 3a-3c are illustrations corresponding to those of
FIGS. 1a-1c for an image of fully open lips and partially opened
teeth;
[0023] FIG. 4 is an illustration of the generation of the data
vectors for the image of FIG. 1a;
[0024] FIG. 5 is an illustration of the manner in which the average
value vectors are determined;
[0025] FIG. 6 is an illustration of the deviation matrices;
[0026] FIG. 7 is an illustration of the inputs to a layer of
perceptrons;
[0027] FIGS. 8a-8b illustrate two examples of the mapping of data
sets to a global linear manifold; and
[0028] FIGS. 9a-9b illustrate the matching of images by reference
to coupled models.
DETAILED DESCRIPTION
[0029] Generally speaking, the present invention is directed to the
determination of continuous-valued hidden data from observed data.
To facilitate an understanding of the invention, it will be
described hereinafter with reference to the specific task of
placing control points on unmarked two-dimensional images. Control
points are locations on an image that are estimated as best
corresponding to fiduciary points on an object. For example, if an
object of interest is a face, an outside corner of the lips might
be designated as one fiduciary point on the face. A control point
marks the image location which is the best estimate of where that
fiduciary point indicates the outside corner of the lips appearing
in the image.
[0030] The ability to automatically estimate control-point data on
unmarked images provides a number of different opportunities for
image processing. For example, it can be used to locate control
points for applications such as expression and gesture recognition,
image morphing, "gesture-based" manipulation of video imagery, and
image segmentation and recomposition. It also provides a method for
matching fiduciary points in images of distinct but related objects
by matching each image separately to one or more models of the
appearance of object features. In addition, the results of the
present invention can be used to define and align features which
are sought in the imagery. As another example, the labeled control
points in an image can be used as the basis for controlling
physical operations, such as guiding a robotic arm in grasping an
object which appears in a real-time image.
[0031] In the following description of examples of the invention,
reference is made to features on a face as the bases for fiduciary
points. It will be appreciated that the references to various
points on a face are merely exemplary, to facilitate an
understanding of the invention, and do not represent the only
practical implementation of the invention. Rather, the principles
of the invention can be applied in any situation in which it is
desirable to automatically identify hidden data, such as control
points, from observed data within a set of data, or a subset
thereof.
[0032] In practice, the present invention is carried out on a
computer that is suitably programmed to perform the tasks described
hereinafter, as well as the ultimate data processing operation that
is desired from the hidden data, such as image morphing. The
details of the computer itself, as well as the ultimate data
processing steps, do not form part of the invention, and therefore
are not described herein. Generally speaking, the data to be
processed is stored in suitable memory within the computer, e.g.
random access memory and/or a non-volatile storage medium such as a
hard disk, and can be displayed on one or more monitors associated
with the computer, reproduced via audio speakers, or otherwise
presented in a perceptible form that is appropriate to the specific
nature of the data.
[0033] FIG. 1a illustrates an example of a representative image
from a database of training images, which might be displayed on a
monitor and which is to be labeled with control points. In this
particular example, the image is that of a human mouth, and shows
the lips slightly parted with the teeth closed. The image is
comprised of N.sub.x.times.N.sub.y pixels, and could be a portion
of a much larger image, such as a portrait of a person's face.
[0034] Within the larger image, the pixels pertaining to the
subimage of the mouth could be in a variety of different positions,
depending upon where the image of the face appears in the scene,
the tilt of the person's head, and the like. In this condition, the
pixel data pertaining to the subimage of the mouth is considered to
be unaligned. The first step in the location of the control points,
therefore, is to align the subimage of the mouth within an
N.sub.x.times.N.sub.y window of pixels.
[0035] The extraction of a subimage, such as a mouth, from an
overall image might be carried out with a number of different
approaches. In one approach, a feature of interest is first
identified in the image, for example by using the feature
recognition technique described in the previously cited articles by
Pentland et al and Moghaddam et al. Once the feature is identified,
it is then aligned within an N.sub.x.times.N.sub.y window of
pixels. This subimage is then marked by the user with the control
points which lie in the mouth area. In another approach, all of the
control points in the overall image are first identified. Groups of
control points can then be used to locate a feature of interest. In
the immediately following discussion of one embodiment of the
invention, it will be assumed that each subimage comprises a
feature that has been first identified and aligned within an
N.sub.x.times.N.sub.y window, so that the feature appears
consistently at the same location from one subimage to the
next.
[0036] The control points are labeled in the representative images
by the user, for example by using a pointing device for the
computer such as a mouse and cursor or a pen. Some illustrative
control points for the image of FIG. 1a are identified in FIG. 1b.
These control points are located at the corners of the mouth, and
at the inner and outer edges of both the upper and lower lips, at
the centers thereof. Control points are also placed at the top and
bottom edges of the upper and lower teeth which are located one
tooth to the left and one tooth to the right of the center teeth.
It is to be noted that the top edge of the upper teeth and the
bottom edge of the lower teeth are not visible in FIG. 1a, and
therefore the locations of these control points are estimated by
the user. FIG. 1c illustrates the control points by themselves. The
location of each control point within the image can be designated
by means of x and y coordinates in the pre-aligned subimage.
[0037] FIGS. 2a-2c and 3a-3c illustrate two other representative
pre-aligned subimages, with their corresponding control points
identified. For ease of understanding, all of the examples of FIGS.
1, 2 and 3 are of the same scale. This can be achieved by
resampling the images, as needed. In the image of FIG. 2a, both the
mouth and teeth are closed, so that the control points for the
inner edge of each of the upper and lower lips coincide with one
another, and all of the teeth are hidden. In the representative
image of FIG. 3a, the mouth is open wider than in FIG. 1a, so as to
reveal more of the teeth, and the teeth themselves are partially
open.
[0038] To generate a model which is used to automatically label
control points on other, unmarked images of a human mouth, the
representative pre-aligned subimages and their control-point
locations are analyzed to generate a joint model of their expected
values and of their expected coupled variations. As a first step in
the analysis, an image data vector is generated for each
representative pre-aligned subimage. In the examples of FIGS. 1a,
2a and 3a, each subimage is an N.sub.x.times.N.sub.y array of
pixels. Referring to FIG. 4, an image data vector f.sub.1 for the
image of FIG. 1a is formed by a linear concatenation of the data
values for all of the pixels in the image, to thereby form a vector
of length N.sub.x.times.N.sub.y. An optional processing step on
each image data vector can be included to normalize the amplitude
of the vector. This step may be required if there are significant
brightness variations between the different images in the database.
The data values that constitute the vector f.sub.1 can represent
grayscale values, color, hue, saturation, or any other perceptible
attribute of an image. In the following discussion, specific
reference will be made to grayscale values, but it will be
appreciated that any other quantifiable value or vector of values
can be used as well.
[0039] In essence, each pixel value represents one dimension of an
N.sub.xN.sub.y dimensional vector. A similar vector p.sub.1 is
formed for the designated control points. If the number of control
points is identified as L, and each control point is represented by
two values, namely its x and y coordinates, the vector for the
control points will have a length of 2L. Similar vectors f.sub.2,
f.sub.3 and p.sub.2, p.sub.3 are calculated for each of the
representative images of FIGS. 2a and 3a.
[0040] After the vectors have been determined for each of the
individual images, an average vector is computed, as depicted in
FIG. 5. In the example of FIG. 5, the total number of
representative images is M. For the image data, the average vector
F contains N.sub.xN.sub.y elements, each of which is the average of
the grayscale value for a corresponding pixel location in each of
the representative pre-aligned subimages. In a similar manner, an
average vector P is computed for the 2L control point values of all
the representative images.
[0041] Using the average vector F for the image data, an example
image-variation matrix F can be created by removing the bias from
the individual image vectors and combining the result into a
matrix, as follows:
F=[(f.sub.1-F)(f.sub.2-F) . . . (f.sub.M-F)]. (1)
[0042] This matrix is depicted in FIG. 6. In a similar manner, a
matrix of control point location variations can be created as
follows:
P=[(p.sub.1-P)(p.sub.2-P) . . . (p.sub.M-P)]. (2)
[0043] The combined matrix 1 [ F P ]
[0044] completely describes the observed coupled variations in the
pre-aligned representative images and the control point locations.
Each observed variation of the subimage data from F, the expected
image-data values, appears in the top N.sub.xN.sub.y rows of each
column. Each corresponding observed variation of the control-point
locations from P, the expected control point-location values,
appears in the bottom 2L rows of the same column.
[0045] The process as described up to this point is identical to
that described in U.S. Pat. No. 6,188,776. Furthermore, as in the
procedure of that patent, a linear manifold model of the coupled
variations is used to estimate the values of the hidden data from
the observed data.
[0046] Unlike the procedure of the earlier patent, however, the
model is not derived from the singular vectors of the combined data
matrix. Instead, traditional estimation theory is employed, and the
Wiener filter is used, which is defined as:
{circumflex over
(p)}=R.sub.p.function.R.sub..function..function..sup.+(.f-
unction.-{overscore (F)})+{overscore (P)}
[0047] where R.sub..function..function..sup.+ is the pseudo-inverse
of R.sub.ff. Estimation theory guarantees that (in the
least-squares sense), this is the optimal linear model for
estimating the hidden data, p, from the observed data, f. Hence, it
may be considered to be a better estimator (in the least squares
sense) than the estimator derived in aforementioned patent.
[0048] At this point the method of this invention can be summarized
by:
[0049] Stage 1: Training
[0050] 1.1) Pre-align labelled training data so that the "feature"
on which fiduciary points are clustered is at a known position,
through in-plane rotation and scaling of sub-images that form the
observed data. This can be done using eigenfeatures or neural
networks, or any other feature-location technique that give
accurate alignment. For instance, affine tracking can be
employed.
[0051] 1.2) Estimate the cross correlation between the pre-aligned
input data (observed data, F) and the output data (hidden data, P),
using the labels that indicate the hidden data on this training
set. This can be done in a variety of ways; one approach to this
computation is described below.
[0052] 1.3) Estimate the inverse of the autocorrelation matrix for
the pre-aligned input data (observed data, F).
[0053] 1.4) Form the Weiner prediction matrix W=R.sub.pf
R.sub..function..function..sup.+.
[0054] Stage 2: Labelling
[0055] 2.1) Determine the correct alignment, in-plane rotation, and
scaling of new (unlabelled) input data and use that to create an
aligned data vector, f.
[0056] 2.2) Subtract the mean-aligned data vector, {overscore (F)},
to give (.function.-{overscore (F)}), the variation of the observed
data from the mean-aligned data.
[0057] 2.3) Premultiply the variation vector,
(.function.-{overscore (F)}), by the Wiener matrix W to give an
estimate of the variation of the hidden data from its mean
value.
[0058] 2.4) Add the mean-value vector for the hidden data,
{overscore (P)}, to give the final estimate of the hidden data.
[0059] This algorithm can be further refined by incorporating the
correlation "compression" technique of Canonical Correlation
Analysis.
[0060] In general, Canonical Correlation Analysis (CCA) uses joint
data x.sub.i and y.sub.i, from input and output subspaces,
respectively, to find canonic correlation matrices, A.sub.x and
A.sub.y. In the context of the present invention, therefore, the
input data comprises f and p, respectively, and produces the
matrices A.sub.f and A.sub.p. These matrices whiten the input and
output data, respectively, as well as make the cross correlation
diagonal and maximally compact. Specifically, the variables .eta.
and .phi. are defined as follows:
.eta.=A.sub..function..sup.T(.function.-{overscore (F)})
.phi.=A.sub.p.sup.T(p-{overscore (P)})
[0061] and the following properties are observed:
[0062] E{.eta..eta..sup.T}=I
E{.phi..phi..sup.T}=I
E{.phi..eta.T}=.SIGMA..sub.k=diag{.sigma..sub.1, .sigma..sub.2, . .
. .sigma..sub.L}
[0063] where 1.gtoreq..sigma..sub.1.gtoreq..sigma..sub.2.gtoreq.. .
. .gtoreq..sigma..sub.M>0, and
.sigma..sub.M+1=.sigma..sub.M+2=. . . .sigma..sub.L=0.
[0064] For i starting from 1 and repeating up to L, .sigma..sub.i
is the largest possible correlation between .eta..sub.i and
.phi..sub.i (when .eta..sub.i and .phi..sub.i are the i-th elements
of .eta. and .phi., respectively), given the norm constraints on
.eta. and .phi. and .eta..sub.i and .phi..sub.i are uncorrelated
with .eta..sub.1.eta..sub.2. . . .eta..sub.i-1 and .phi..sub.1,
.phi..sub.2, . . . .phi..sub.i-1,
[0065] The matrices A.sub.f and A.sub.p can be determined by first
whitening the input and output data, as follows:
.function..sup.i=R.sub..function..function..sup.1/2(.function.-{overscore
(F)}) (3)
p.sup.l=R.sub.pp.sup.-1/2(p-{overscore (P)}) (4)
[0066] The left and right singular vectors of the cross-correlation
matrix between the whitened data are then identified as
follows:
K=R.sub.p.sub..sup.l.sub..function..sub..sup.l=R.sub.pp.sup.-1/2R.sub.p.fu-
nction.R.sub..function..function..sup.-1/2=U.sub.K.SIGMA..sub.KV.sub.k.sup-
.T (5)
[0067] Since the singular valued decomposition (SVD) gives the same
type of maximal compaction that is needed for the cross-correlation
matrix, then A.sub.f and A.sub.p can be defined as follows:
A.sub.p=R.sub.pp.sup.-1/2U.sub.K (7)
[0068] The Wiener matrix can then be rewritten in terms of the
canonical-correlation matrices. This provides an implementation of
a conventional Wiener filter with better numerical properties:
W=R.sub.p.function.R.sub..function..function..sup.+=R.sub.pp.sup.1/2(R.sub-
.pp.sup.+1/2R.sub.p.function.R.sub..function..function..sup.+1/2)R.sub..fu-
nction..function..sup.+1/2
=R.sub.pp.sup.1/2KR.sub..function..function..sup.+1/2
=R.sub.pp.sup.1/2U.sub.K.SIGMA..sub.K.sup.TV.sub.KR.sub..function..functio-
n..sup.+1/2
=(R.sub.pp.sup.+1/2U.sub.K).sup.+T.SIGMA..sub.K(R.sub..function..function.-
.sup.+1/2V.sub.K).sup.T
=A.sub.p.sup.+T.SIGMA..sub.kA.sub..function..sup.T
[0069] where U.sup.+T is the transpose of the pseudo-inverse of U
and R.sup.+1/2 is the pseudo-inverse of R.sup.1/2. The pseudo
inverses R.sub.pp.sup.+ and R.sub..function..function..sup.+ are
computed according to an expected noise floor in the observed and
hidden data spaces, and U.sub.K, V.sub.K, and .SIGMA..sub.K are
appropriately truncated matrices to thereby retain only the most
significant M dimensions, where M is also chosen according to the
expected noise floor in the data.
[0070] The advantage of this recasting of the Wiener filter is that
it becomes possible to apply the filter by first projecting the
input (observed) data, f-{overscore (F)}, onto a low (M) dimension
al subspace using A.sub..function..sup.T, then rescale the M
coordinates according to the diagonal matrix .SIGMA..sub.K, and
finally re-project the rescaled coordinates into the
(high-dimensional) output subspace using A.sub.p.sup.+T.
[0071] This improved approach to the Wiener computation eliminates
the need to explicitly estimate R.sub.p.function. and
R.sub..function..functi- on..sup.+. Instead, it is only necessary
to estimate .SIGMA..sub.K, A.sub.p.sup.+T and
A.sub..function..sup.T. These matrices can be estimated by first
estimating R.sub.pp, R.sub.ff, and R.sub.pf computing the SVD of
R.sub.pp.sup.+1/2R.sub.p.function.R.sub..function..function..s-
up.+1/2, and using the components to compute A.sub.p.sup.+t and
A.sub..function..sup.T. However, that approach introduces a
well-known problem due to doubling the dynamic range of the
analysis data. Instead, the estimation equations are derived in
terms of the components of the SVDs of the training data matrices.
Specifically, the SVDs of the zero-mean input and output matrices
are determined as follows:
[x.sub.1-{overscore (x)}. . . x.sub.N-{overscore (x)}]={square
root}{square root over (N-1)}U.sub.x.SIGMA..sub.xV.sub.x.sup.T,
(8)
[y.sub.1-{overscore (y)}. . . y.sub.N-{overscore (y)}]={square
root}{square root over (N-1)}U.sub.y.SIGMA..sub.yV.sub.y.sup.T.
(9)
[0072] From these two decompositions, the two correlation matrices
can be written as:
R.sub.xx=U.sub.x.SIGMA..sub.x.sup.2U.sub.x.sup.TR.sub.xx.sup.-1/2=U.sub.x.-
SIGMA..sub.x.sup.-1U.sub.x.sup.T, (10)
R.sub.yy=U.sub.y.SIGMA..sub.y.sup.2U.sub.y.sup.TR.sub.yy.sup.-1/2=U.sub.y.-
SIGMA..sub.y.sup.-1U.sub.y.sup.T, (11)
[0073] and then the cross-correlation matrix becomes:
R.sub.yx=U.sub.y.SIGMA..sub.yV.sub.y.sup.TV.sub.x.sup.T.SIGMA..sub.xU.sub.-
x.
[0074] Using these expressions for the correlation matrices, the K
matrix becomes
K=(U.sub.y.SIGMA..sub.y.sup.-1U.sub.y.sup.T)(U.sub.y.SIGMA..sub.yV.sub.y.s-
up.TV.sub.x.SIGMA..sub.xU.sub.x.sup.T)(U.sub.x.SIGMA..sub.x.sup.-1U.sub.x.-
sup.T)=U.sub.yV.sub.y.sup.TV.sub.xU.sub.x.sup.T. (13)
[0075] The quantity U.sub.y.sup.TKU.sub.x is expressed in terms of
its SVD as follows:
U.sub.y.sup.TKU.sub.x=V.sub.y.sup.TV.sub.x=(U.sub.y.sup.TU.sub.K).SIGMA..s-
ub.K(V.sub.K.sup.TU.sub.x)=U.sub.UKU.SIGMA..sub.KV.sub.UKU.sup.T,
(14)
[0076] and, due to the uniqueness of the SVD,
[0077] U.sub.y.sup.TU.sub.K=U.sub.UKU
[0078] and
U.sub.x.sup.TV.sub.K=V.sub.UKU. (15)
[0079] The equation for A.sub.f can be rewritten to remove the need
for the squaring operation
A.sub..function.=R.sub..function..function..sup.-1/2V.sub.K=U.sub..functio-
n..SIGMA..sub..function..sup.-1(U.sub..function..sup.TV.sub.K)=U.sub..func-
tion..SIGMA..sub..function..sup.-1V.sub.UKU (16)
[0080] and similarly for A.sub.p
A.sub.p=R.sub.pp.sup.-1/2U.sub.K=U.sub.p.SIGMA..sub.p.sup.-1(U.sub.p.sup.T-
U.sub.K)=U.sub.p.SIGMA..sub.p.sup.-1U.sub.UKU. (17)
[0081] Using these identities, A.sub.f and A.sub.p are computed
using the following steps:
[0082] 1) Find the SVDs of the data matrices using equations 8 and
9.
[0083] 2) Form a rotated version of the cross-correlation matrix K
and compute its SVD using equation 14.
[0084] 3) Compute the A.sub.f and A.sub.p matrixes using equations
16 and 17.
[0085] With these estimates of A.sub.f and A.sub.p, it becomes
feasible to form W and then label the data as described in Stage 2
above.
[0086] In the foregoing description, each feature of an object and
the grouping of control points with a feature was implicitly
defined. The definition of features, and the grouping of control
points, can be carried out in a number of different ways. One
approach, described in the references by Pentland et al and
Moghaddam et al, is to use manually defined features and, by
extension, manually defined groupings of features and control
points.
[0087] Another alternative is to define the features, either
manually, semi-manually, or automatically, and then automatically
assign the control points to features. In this case, a "feature
location" plus a "feature extent" is required for feature
definition. The feature location must be determined for each
feature in each training example. The feature extent can be
provided once for each feature by a "windowing function" with
compact support, i.e. the windowing function equals zero outside
some finite-sized region.
[0088] One way to derive feature definitions automatically is based
on approaches for finding visually distinct areas as described, for
example, in J. Shi and C. Tomasi, "Good Features to Track", CVPR,
1994. The techniques mentioned in this reference provide metrics
for determining how distinctive different local regions are and how
stable they are across aligned images. In the training database,
alignment can be provided by using the control points which are
already included in the training database. These control point
correspondences can then be interpolated to provide a dense
correspondence field using morphing techniques, such as those
described in T. Bieir, S. Nealy, "Feature-based Image
Metamorphosis", SIGGRAPH 1992.
[0089] The techniques of the Tomasi reference provide image
locations which are both distinctive and stable. This can be
translated into features by "K-means clustering" with a distance
metric that includes both the average proximity of the
differentiated points and the variance in proximity of the points
across the training database. For a description of "K-means
clustering", see Duda and Hart, Pattern Recognition and Scene
Analysis, John Wiley & Sons, 1973, pp. 211-252. Once the
differentiated points are clustered, the feature location can be
defined as a function of the locations of the clustered distinctive
image locations. Any one of the mean location, median location
(median x and median y) or modal location (the (x,y) bin with the
most points, for some bin width) can be used as the function.
[0090] The spatial extent of the feature can also be defined either
manually or as a function of the clustered distinctive image
locations. One possibility is to use a convex hull of clustered
locations, with a Hamming-like drop-off perpendicular to the
boundary of the hull. Other possibilities include RBF-like windows,
where the windowing magnitude drops off as a truncated Gaussian
from each of the clustered points, with a hard-limit maximum value
of one at each point. Semi-manual definition is also reasonable,
since this only requires one basic description (the windowing
function) for each feature, instead of a new piece of information
on each training image.
[0091] Once the features have been defined, manually, semi-manually
or automatically, the control points are automatically grouped with
the features. Alternative approaches are possible for this grouping
as well. A preferred approach employs the following steps:
[0092] for a control point which almost always lies within one
feature's extent, (e.g. greater than 90% of the examples), and
seldom lies within any other feature's extent, (e.g. less than 50%
of the examples), the control point is associated with the one
feature;
[0093] for each control point which lies within the extent of
plural features more often than is considered seldom (e.g. more
than 50% of the time) the same distance metric is used between the
control point and the centers of the features with which it
overlaps the required number of times. The feature which exhibits
the smallest distance metric is chosen for the control point;
[0094] for each control point which does not lie within any
feature's extent almost always (e.g. more than 90% of the time) a
distance metric is determined between the control point and the
centers of all of the features, which takes into account both the
average proximity and variance in proximity. The feature with the
smallest distance metric is chosen for the control point.
[0095] Another alternative for defining features and grouping
control points with features is to first group control points and
then define a feature to be associated with that group, either
semi-manually or automatically. The control points can be first
grouped using "K-means clustering" with a distance metric which
measures both average proximity and variance in proximity between
the control points. Once the control point clusters are defined,
the associated feature location is automatically defined as a
function of the control point locations in each cluster. Again,
mean location, median location or modal location can be employed to
define the feature location function. The feature extent can be
defined manually or automatically. If defined automatically, it can
be determined from either the clustered control point locations
only, or both of those locations and differentiated image
locations, as described previously. One approach is to take the
convex hull of the clustered control-point locations with a
Hamming-like drop-off perpendicular to the boundary. Another
approach is to include with the cluster all the differentiated
points which, in any training image, lie within the convex hull of
the clustered control points and to then use the convex hull of
this expanded set of points. In this approach, if no differentiated
image locations are associated with the clustered control points,
then the nearest differentiated image location, in average
distance, is added before finding the convex hull.
[0096] Another approach to defining features and the control points
that are grouped with them is to use a "K-means clustering" on the
combined set of control-point locations and differentiated point
locations. The distance metric for this clustering again uses
average proximity and the variance in proximity, but includes the
constraint that at least one control point and at least one
differentiated image point must be included in each cluster. The
feature location and extent can then be determined automatically
from these clusters, in the same ways as described previously.
[0097] The above approaches for control point location and for
defining feature/control point groupings can also be extended to
video inputs and to control point locations over time. For this
situation, the first frame of a sequence to be labeled is treated
as an isolated image, and is labeled with control points in the
manner described previously. For each subsequent frame, the feature
location estimate is derived from a feature tracking system, such
as those described in M. Turk and A. Pentland, "Eigen faces for
Recognition", Journal of Cognitive Neurosciences, Vol. 3, No. 1,
1991, pp. 71-86, and J. Woodfill, R. Zabih, "An Algorithm for
Real-Time Tracking of Non-rigid Objects, AAAI-91, Proc. Natl. Conf.
on Artificial Intelligence, 1991, pp. 718-723. The image data which
is used to estimate the control-point locations is image data from
each of (T-1) prior image frames, plus image data from the current
image frame. In each set of data, a subimage is extracted on the
basis of an estimated location of the feature in that frame. In
addition, the control-point location estimates for the (T-1) frames
is included in the observed data. This results in
((T-1)(N.sub.xN.sub.y+2L)+N.sub.xN.sub.y) dimensions of observed
data. This data is projected (possibly with regularization) onto
the coupled model manifold, and then into the space of current
control-point locations. The coupled model manifold is derived from
image sequences in the same general manner as the isolated-image
coupling models.
[0098] In the description given above, each feature is separately
located and labeled. With multiple features, mutual information
between features can be used to improve the detection and location
of each of the features. For example, the fact that the left eye is
typically seen above and to the left of the nose can be used to
reinforce observations of this configuration of features. One
approach which can utilize mutual information between features is
to create composite models, which include many or all of the
features being sought. An example of this approach is reported in
A. Pentland, B. Moghaddam and T. Stanner, "View-Based and Modular
Eigenspaces for Face Recognition," CVPR '94, pp. 84-91.
[0099] Another way to combine the information given by the manifold
match with the information given by the expected relative positions
of features is to treat them as independent sources of information
that the feature is not in a given location. Under this assumption,
the probability of that feature is not in a particular location is
given by:
(1-P.sub.total,i(L.sub.i))=(1-P.sub.Ui(L.sub.i)).PI.(1-P.sub.i.vertline.di-
stj(L.sub.i))
j=all features
except i
[0100]
P.sub.i.vertline.distj(L.sub.i)=.SIGMA..sub..DELTA.y.SIGMA..sub..D-
ELTA.xP.sub.Uj(L.sub.i-[.DELTA.y/.DELTA.x])P.sub.i.vertline.j(L.sub.i.vert-
line.L.sub..sub.i-[.DELTA.y/.DELTA.x])
[0101] where P.sub.total,i(L.sub.i) is the final (total) likelihood
of feature i at location L.sub.i;
[0102] P.sub.Ui(L.sub.i) is the match likelihood of feature i at
location L.sub.i, estimated from the affine manifold model of
feature i; and
[0103] P.sub.i.vertline.distj(L.sub.i) is the probability of
feature i being at location L.sub.i, based on the match likelihood
distribution of feature j.
[0104] After some algebraic manipulation, a recursive definition
for the total probability is given by:
P.sub.0,i(L.sub.i)=P.sub.Ui(L.sub.i)
P.sub.K,i(L.sub.i)=P.sub.(K-1),1(L.sub.i) if K=i
=P.sub.(K-1),i(L.sub.i)+(1-P.sub.(K-1),i(L.sub.i))P.sub.i.vertline.distK(L-
.sub.i)
[0105] otherwise
P.sub.total,i(L.sub.i)=P.sub.N,i(L.sub.i)
[0106] where P.sub.K,i(L.sub.i) is the likelihood of feature i at
location L.sub.i, estimated from the match probability of feature i
and the relative position information from features
[0107] j=0. . . K, omitting i; and
[0108] N is the total number of related features.
[0109] These recursive equations are used in labeling to modify the
match likelihood. Based on experimental results, it is also useful
to reduce the effect that one feature can have on another feature's
distribution as a function of the distance between the two
features. For example, the chin location should not have as a large
influence over the forehead location as it does over the mouth
location. With enough training data, this diffusion effect is
captured in the models of the expected relative positions of the
features: the chin/mouth dependency has a much sharper and higher
peak than the chin/forehead dependency. However, if limited
training data is available, it may be best to explicitly reduce the
coupling between distant features by reducing the magnitude of
P.sub.i.vertline.j(L.sub.i.vertline.L.sub.i-D) as a function of
distance (.vertline.D.vertline.).
[0110] The conditional probabilities relating feature locations,
P.sub.i.vertline.j(L.sub.i.vertline.L.sub.j), can be estimated from
the training data. This is done by noting that these probabilities
are approximately stationary. It is only the offset between the two
feature locations which is of significance, not the absolute
locations of the features. Using this fact, the conditional
probability P.sub.i.vertline.j(L.sub.i.vertline.L.sub.j) can be
estimated in the training stage by:
[0111] (a) aligning the training images such that the location of
feature j is at the origin of the coordinate system;
[0112] (b) accumulating the (two-dimensional) location histogram
for feature i; and
[0113] (c) normalizing the histogram values by the total number of
training images, to give an estimated distribution of
probabilities.
[0114] It will be recognized that an increase in the number of
samples that are employed in the training stage can lead to a
reduction in errors during the labelling stage. If a limited number
of training samples is available, the training set can be expanded
to provide additional pre-aligned feature images. To this end, a
set of "allowed pairings" of images are defined by the user. This
set defaults to all M(M-1)/2 combinations of image pairs in the
case of an original training set of M isolated images, and to M-1
sequentially neighboring pairs in the case of a training set
derived of images extracted from a video sequence.
[0115] For each pair in the allowed set, the images are morphed,
using the marked control-point locations, to generate an arbitrary,
user-defined, number of intermediate images and intermediate
control-point locations. These newly-generated images can be used
both to extend the example database for feature location and to
extend the database for creating a coupled model. A particular
advantage of this approach is the fact that each of the
newly-generated intermediate images is pre-labeled with the control
points, thereby reducing the effort required during the training
stage.
[0116] The preceding description is based upon the ability of a
linear manifold to capture variations in the coupling data across
all of the configurations of the feature, appearance and
control-point locations for the images in the training database.
However, there may be situations in which this assumption is
incorrect. In those cases, there will be no single, linear manifold
that can capture the important variations. In the past, attempts at
solving this type of a problem have resorted to the use of
piecewise linear models. See, for example, the previously cited
references by Pentland et al and Bregler et al. In some of these
approaches, the observed data is projected onto each of the
piecewise linear models and is then evaluated to determine which
model provides the best fit. In other approaches, the observed data
is projected onto a single locally linear model, which is then
evaluated to check whether the observed data "belongs" to that
linear model. If it does not, the data is reevaluated on other
pieces of the model until the best fit is found. In either case,
the number of projections which are, or may be, needed grows
linearly with the number of linear pieces in the overall model. K-D
trees (e.g. quad trees) can be used to reduce the linear growth to
logarithmic growth, but the required number of projections
nevertheless grows with the complexity of the model.
[0117] In the context of the present invention, the number of
required projections can be significantly reduced when a piecewise
linear model is employed. Rather than being linearly related to the
total number of pieces in the model, the technique of the present
invention keeps the number of projections constant, independent of
the total model complexity.
[0118] More particularly, the data is first modelled by a linear
manifold. The coordinates within this linear manifold are
quantized, using a scalar quantizer. The quantization boundaries
can be selected by training a simple threshold perceptron, with
each threshold unit having access to only one dimension of the
manifold coordinates. See J. Hertz, A. Krogh, R. G. Palmer,
Introduction to the Theory of Neural Computation, Addison-Wesley
Publishing, 1991, pp. 89-107, for a description of simple
perceptrons and their training. In this case, if there are K
dimensions in the manifold, the procedure can start with KN.sub.L
threshold units, for some arbitrary value N.sub.L. The input to
each perceptron is simply one of the K manifold coordinates (see
FIG. 7). The thresholds establish a grid that is used to divide the
data into clusters. Each cluster is then used to form a separate
linear manifold model. If, after training on the error in the
control-point locations, the error is still too large, N.sub.L can
be increased, by adding another K.DELTA.N.sub.L units, and
retraining the perceptron. Once the error is acceptable, threshold
units can be removed "non-uniformally" across the K dimensions. One
procedure for doing so is as follows:
[0119] for each of K dimensions, remove one unit from the selected
dimension and retrain the perceptron. Measure the final error.
[0120] pick the network from the K alternatives with the lowest
error.
[0121] repeat until no more perceptrons can be removed while still
meeting the error bound.
[0122] This technique allows non-uniform quantization to be
employed in each of the dimensions.
[0123] Alternatives to perceptrons for determining grid line
placement include global optimization procedures by regular
sampling or by statistical sampling (e.g. genetic algorithms or
simulated annealing algorithms).
[0124] This simple approach will succeed in validly segmenting the
training data as long as the data is "sufficiently linear". FIGS.
8a and 8b show two illustrative data sets to explain this concept.
In these examples, the observed data dimensionality (i.e.
N.sub.xN.sub.y) is 2, the global manifold dimensionality (i.e. K)
is 1 and the hidden data dimensionality (i.e. 2L) is 1. The
observed data is the location in the plane. The hidden data is the
distance along the dotted curve which has been overlaid on the
training data. Referring to FIG. 8a, the sine-wave curve can be
well approximated by segmenting the data into non-overlapping
regions on the global linear manifold and modeling each data
segment using a linear manifold model of the coupling data. As
shown in FIG. 8b, however, the ellipsoidal curve cannot be well
represented by the same type of segmentation. This is because
non-neighboring piecewise linear patches will overlap one another
when projected onto the global linear manifold.
[0125] One way to correct for this potential difficulty is to allow
some regions of the quantized global model to "point to" multiple
alternative piecewise linear models. During labelling, the model
which is used to estimate the hidden data from the observations
that fall within these multiple-model grid cells is the model with
the minimum distance between its appearance-only feature model and
the observed data.
[0126] In training, deciding whether to introduce another
quantization level or to introduce multiple-model cells can be
carried out on different bases. Approaches which can be tried
include stochastic sampling of the alternatives (e.g.
population-based search or simulated annealing algorithms).
Alternatively, multiple-model cells can be used if any of the
linear dimensions of the cells fall below some threshold. Which of
these method is best will depend heavily on the topology of the
data set from which the training data was taken.
[0127] The gridlines define regions on the global linear coupled
manifold. Namely, the training data for each region is used to
create a new linear manifold model of that part of the global
linear coupled manifold. However, depending on the distribution of
training data, this completely "gridline-defined" division of the
training data will result in some regions which have little or no
data with which to create a model. Furthermore, since the training
data is a sparse sampling of the space, completely disjoint models
will result in areas of the global linear manifold which may be
very poorly modelled by local (one-sided) extrapolation. Instead,
models can be merged and data-sets extended in the following
ways:
[0128] Data interpolation across grid cells: The previously
described approach of using morphed intermediate examples can be
used to create intermediate examples on or near the grid-line
boundaries. These examples can be included in the data sets of the
cells on either side of the boundary.
[0129] Model merging between grid cells: If neighboring grid cells
have very similar data (i.e., the error in the control point
location using a merged model is below some user-defined bound),
then the grid cells should be merged in "best-first" order. If this
results in a large number of merged cells, then a hash function can
be used to translate grid cell number to model number (reducing the
number of look-up table entries according to the hash function).
The hash function should be selected to minimize the number of
collisions, where a collision is the number of times identical hash
keys which correspond to two or more distinct models are expected
to be used. For two or more grid cells with a shared model, having
an identical hash is not considered a collision.
[0130] With this approach, when a new image is being labeled, only
two projections are required for each dimension, one onto the
linear model and one onto the appropriate facet of the piecewise
linear model.
[0131] It will be apparent that extensions to this quantized
approach include such approaches as linear interpolation of the
hidden data estimates between model patches, based on a measure of
the distances between the selected model patch and the observed
data and between the neighboring model patches and the observed
data. These extensions are within the scope of the invention
described herein.
[0132] From the foregoing it can be seen that the present invention
provides a method for estimating the locations of control points on
unmarked imagery. Once the control points have been located, the
fiduciary points in images of distinct but related objects can be
correlated, by matching those images to a model of features of the
object, as shown in FIG. 9a. This capability of the invention is
related to model-based matching, but differs in the sense that the
model is used as an intermediary for matching two distinct
images.
[0133] The results provided by the invention can also be used to
automatically determine correspondences between images when each
image is matched to a separate feature model and when the
control-point locations estimated by each of these feature models
has a known mapping with control-point locations estimated by the
other model. This allows matching of related objects viewed under
very different imaging conditions, such as matching a frontal and a
profile view of a single or different faces. It also allows
matching of unrelated objects using some pre-defined relationship,
such as matching a frontal view of a human face to the front view
of a car or the side view of a dog's body to a side view of a
table, as shown in FIG. 9b.
[0134] The results provided by the invention can be used in a
number of other applications as well. For example, the automated
location of control points can be used to provide much more
efficient image manipulation techniques, such as image segmentation
and recomposition, and automatic morphing. The invention also
facilitates the defining and aligning of features which are sought
in imagery, for recognition purposes and the like. For example,
control points in a real-time image of an object can be used as a
guide to control a robot arm whose task is to grip the object.
Other applications include face recognition, gesture recognition,
body tracking, image encoding, e.g. compression, pose estimation
(as described in Lantis et al, "A Unified Approach To Coding and
Interpreting Face Images", International Conference on Computer
Vision, 1995), and recognition of periodic or nearly periodic
motion, such as gait recognition.
[0135] It will be appreciated by those of ordinary skill in the art
that the present invention can be embodied in other specific forms
without departing from the spirit or essential characteristics
thereof. For example, the principles of the invention are not
limited to use on natural images, they can also be employed in
connection with graphic images, including images which contain
large areas of the same color, such as cartoons. Furthermore, the
invention is not limited to use with two-dimensional images. It is
equally applicable to one-dimensional data signals, such as the
location of vocal tract positions in a speech signal, to perform
linear predictive coding. Similarly, it can be applied to video
signals, which can be viewed as three-dimensional data since they
include the added dimension of time.
[0136] The presently disclosed embodiments are therefore considered
in all respects to be illustrative, and not restrictive. The scope
of the invention is indicated by the appended claims, rather than
the foregoing description, and all changes that come within the
meaning and range of equivalence thereof are intended to be
embraced therein.
* * * * *