U.S. patent application number 13/300509 was filed with the patent office on 2012-05-24 for method and device for detecting and tracking non-rigid objects in movement, in real time, in a video stream, enabling a user to interact with a computer system.
This patent application is currently assigned to TOTAL IMMERSION. Invention is credited to Nicolas Livet, Thomas Pasquier.
Application Number | 20120129605 13/300509 |
Document ID | / |
Family ID | 44168356 |
Filed Date | 2012-05-24 |
United States Patent
Application |
20120129605 |
Kind Code |
A1 |
Livet; Nicolas ; et
al. |
May 24, 2012 |
METHOD AND DEVICE FOR DETECTING AND TRACKING NON-RIGID OBJECTS IN
MOVEMENT, IN REAL TIME, IN A VIDEO STREAM, ENABLING A USER TO
INTERACT WITH A COMPUTER SYSTEM
Abstract
The invention relates in particular to the detection of
interactions with a software application according to a movement of
an object situated in the field of an image sensor. After having
received a first and a second image and having identified a first
region of interest in the first image, a second region of interest,
corresponding to the first region of interest, is identified in the
second image. The first and second regions of interest are compared
and a mask of interest characterizing a variation of at least one
feature of corresponding points in the first and second regions of
interest is determined. A movement of the object is then determined
from said mask of interest. The movement is analyzed and, in
response, a predetermined action is triggered or not triggered.
Inventors: |
Livet; Nicolas; (Saint
Egreve, FR) ; Pasquier; Thomas; (Libourne,
FR) |
Assignee: |
TOTAL IMMERSION
Suresnes
FR
|
Family ID: |
44168356 |
Appl. No.: |
13/300509 |
Filed: |
November 18, 2011 |
Current U.S.
Class: |
463/39 |
Current CPC
Class: |
G06F 3/017 20130101;
G06T 7/246 20170101; G06T 2207/10016 20130101; G06K 9/00355
20130101; G06T 2207/30196 20130101; G06F 3/005 20130101 |
Class at
Publication: |
463/39 |
International
Class: |
A63F 9/24 20060101
A63F009/24 |
Foreign Application Data
Date |
Code |
Application Number |
Nov 19, 2010 |
FR |
1059541 |
Claims
1. A computer implemented method of detecting movement of at least
one object situated in a field of an image sensor, the image sensor
providing a stream of images to the computer, the method
comprising: receiving at least one first image from the image
sensor; identifying at least one first region of interest in the
first image, wherein the at least one first region of interest
corresponds to a part of the at least one first image; receiving at
least one second image from the image sensor; identifying at least
one second region of interest in the at least one second image,
wherein the at least one second region of interest corresponds to
the at least one first region of interest of the at least one first
image; comparing the at least one first and second regions of
interest and determining a mask of interest characterizing a
variation of at least one feature of corresponding points in the at
least one first and second regions of interest; determining a
movement of the at least one object from the mask of interest,
wherein the at least one object is at least partially represented
in at least one of the at least one first and second regions of
interest; analyzing the movement; and determining whether to
trigger an action.
2. The method according to claim 1, wherein determining the
movement comprises determining and matching at least one pair of
points of interest in the at least one first and second images,
wherein at least one point of the at least one pair of points of
interest belong to the mask of interest.
3. The method according to claim 2, wherein determining the
movement comprises determining and matching a plurality of pairs of
points of interest in the at least one first and second images,
wherein at least one point of each of the pairs of points of
interest belong to the mask of interest, wherein the movement is
estimated on the basis of a transformation of a first set of points
of interest into a second set of points of interest, wherein the
points of interest of the first and second sets belong to the
plurality of pairs of points of interest, wherein the points of
interest of the first set of points of interest additionally belong
to at least one first image, and wherein the points of interest of
the second set of points of interest additionally belong to at
least one second image.
4. The method according to claim 3, wherein the transformation
implements a weighting function based on a distance between two
points of interest from the same pairs of points of interest of the
plurality of pairs of points of interest.
5. The method according to claim 3, further comprising validating
at least one point of interest of the at least one first image,
belonging to the at least one pair of points of interest, according
to the determined movement, wherein the at least one validated
point of interest is used to track the object in at least one third
image following the at least one second image and the at least one
validated point of interest is used for modifying a mask of
interest created on the basis of the at least one second and third
images.
6. The method according to claim 1, wherein comparing the at least
one first and second regions of interest comprises performing
subtraction, point by point, of values of corresponding points of
the at least one first and second regions of interest and comparing
a result of the subtraction to a threshold.
7. The method according to claim 1, further comprising detecting at
least one feature in the at least one first image, wherein the at
least one first region of interest is at least partially identified
in response to the detecting.
8. The method according to claim 7, wherein the at least one
feature includes at least one of a shape and a color.
9. The method according to claim 1, further comprising estimating
at least one modified second region of interest in the at least one
second image, wherein the at least one modified second region of
interest of the at least one second image is estimated according to
the at least one first region of interest of the at least one first
image and of the at least one second region of interest of the at
least one second image.
10. The method according to claim 9, wherein the estimating
comprises performing an object tracking algorithm of KL T type.
11. The method according to claim 1, wherein the movement comprises
at least one of a translation, a rotation, a scale factor.
12. The method according to claim 11, wherein the movement
comprises a scale factor and wherein whether the action is
triggered is determined based at least in part on the scale
factor.
13. The method according to claim 1, wherein movements of at least
two objects situated in the field of the image sensor are
determined, and wherein whether the action is triggered is
determined based at least in part on a combination of the movements
associated with the at least two objects.
14. (canceled)
15. (canceled)
16. A non-transitory computer readable medium having instructions,
which, when executed cause the computer to perform the method of
claim 1.
17. A device configured to perform the method of claim 1.
Description
[0001] The present invention concerns the detection of objects by
the analysis of images, and their tracking, in a video stream
representing a sequence of images and more particularly a method
and a device for detecting and tracking non-rigid objects in
movement, in real time, in a video stream, enabling a user to
interact with a computer system.
[0002] Augmented reality in particular seeks to insert one or more
virtual objects in images of a video stream representing a sequence
of images. According to the type of application, the position and
orientation of those virtual objects may be determined by data that
are external to the scene represented by the images, for example
coordinates obtained directly from a game scenario, or by data
linked to certain elements of that scene, for example coordinates
of a particular point in the scene such as the hand of a player.
When the nature of the objects present in the real scene has been
identified and the position and the orientation have been
determined by data linked to certain elements of that scene, it may
be necessary to track those elements according to movements of the
video camera or movements of those elements themselves in the
scene. The operations of tracking elements and embedding virtual
objects in the real images may be executed by different computers
or by the same computer.
[0003] Furthermore, in such applications, it may be proposed to
users to interact, in the real scene represented, at least
partially, by the stream of images, with a computer system in order
in particular to trigger particular actions or scenarios which for
example enable the interaction with virtual elements superposed on
the images.
[0004] The same applies in numerous other types of applications,
for example in video game applications.
[0005] With these aims, it is necessary to identify particular
movements such as hand movements to identify one or more
predetermined commands. Such commands are comparable to those
initiated by a computer pointing device such as a mouse.
[0006] The applicant has developed algorithms for visual tracking
of textured objects, having varied geometries, not using any marker
and whose originality lies in the matching of particular points
between a current image of a video stream and a set of key images
which are automatically obtained on initializing the system.
However, such algorithms, described in French patent applications
0753482, 0752810, 0902764, 0752809 and 0957353, do not enable the
detection of movements of objects that are not textured or that
have a practically uniform texture such as the hands of a user.
Furthermore, they are essentially directed to the tracking of rigid
objects.
[0007] Although solutions are known enabling a user to interact
with a computer system, in a scene represented by a sequence of
images, those solutions are generally complex to implement.
[0008] More particularly, a first solution consists in using
tactile sensors which are associated, for example, with the joints
of a user or actor. Although this approach is often dedicated to
movement tracking applications, in particular for cinematographic
special effects, is it also possible to track the position and the
orientation of an actor and, in particular, of his hands and feet
to enable him to interact with a computer system in a virtual
scene. However, the use of this technique proves to be costly since
it requires the insertion, in the scene represented by the stream
of images analyzed, of cumbersome sensors which may furthermore
suffer from disturbance linked to their environment (for example
electromagnetic interference)
[0009] Another solution, developed in particular in the European
projects "OCETRE" and "HOLONICS" consists in using several image
sources, for example several video cameras, to enable real time
three dimensional reconstruction of the environment and of the
spatial movements of the users. An example of such approaches is in
particular described in the document entitled "Holographic and
action capture techniques", T. Rodriguez, A. Cabo de Leon, B.
Uzzan, N. Livet, E. Boyer, F. Geffray, T. Balogh, Z. Megyesi and A.
Barsi, August 2007, SIGGRAPH '07, ACM SIGGRAPH 2007, Emerging
Technologies. It is to be noted that these applications may enable
the geometry of the real scene to be reproduced but do not
currently enable precise movements to be identified. Furthermore,
to meet real time constraints, it is necessary to set up complex
and costly hardware architectures.
[0010] Touch screens are also known for viewing augmented reality
scenes which enable interactions of a user with a computer system
to be determined. However, these screens are costly and poorly
adapted to the applications of augmented reality.
[0011] As regards the interactions of users in the field of video
games, an image is typically captured from a webcam type video
camera connected to a computer or to a console. After having been
stored in a memory of the system to which the video camera is
connected, this image is generally analyzed by an object tracking
algorithm, also referred to as blobs tracking, to compute in real
time the contours of certain elements of the user who is moving in
the image by using, in particular, an optical flow algorithm. The
position of those shapes in the image enables certain parts of the
displayed image to be modified or deformed. This solution thus
enables the disturbance in a zone of the image to be located in two
degrees of freedom.
[0012] However, the limits of this approach are mainly the lack of
precision since it is not possible to maintain the proper execution
of the process during a displacement of the video camera and the
lack of semantics since it is not possible to distinguish the
movements between the foreground and the background. Furthermore,
this solution uses optical flow image analysis which, in
particular, does not provide robustness to changes in lighting or
noise.
[0013] Also known is an approach to real time detection of an
interaction between a user and a computer system in an augmented
reality scene, based on an image of a sequence of images, the
interaction resulting from the modification of the appearance of
the representation of an object present in the image. However, this
method, described in particular in French patent application No.
0854382, does not enable precise movements of the user to be
identified and only applies to sufficiently textured zones of the
image.
[0014] The invention enables at least one of the problems set forth
above to be solved.
[0015] The invention is thus directed to a computer method for
detecting interactions with a software application according to a
movement of at least one object situated in the field of an image
sensor connected to a computer implementing the method, said image
sensor providing a stream of images to said computer, the method
comprising the following steps, [0016] receiving at least one first
image from said image sensor; [0017] identifying at least one first
region of interest in said first image, said at least one first
region of interest corresponding to a part of said at least one
first image; [0018] receiving at least one second image from said
image sensor; [0019] identifying at least one second region of
interest of said at least one second image, said at least one
second region of interest corresponding to said at least one first
region of interest of said at least one first image; [0020]
comparing said at least one first and second regions of interest
and determining a mask of interest characterizing a variation of at
least one feature of corresponding points in said at least one
first and second regions of interest; [0021] determining a movement
of said at least one object from said mask of interest, said at
least one object being at least partially represented in at least
one of said at least one first and second regions of interest; and
[0022] analyzing said movement and, in response to said analyzing
step, triggering or not triggering a predetermined action.
[0023] The method according to the invention thus enables objects
to be tracked, in particular deformable objects with little
texture, in particular for augmented reality applications.
Furthermore, the limited quantity of processing enables the method
to be implemented in devices having limited resources (in
particular in terms of computation) such as mobile platforms.
Moreover, the method may be used with an image sensor of low
quality.
[0024] The method according to the invention enables fast movements
of objects to be tracked, even in the presence of blur in the
images acquired by the image sensor. In addition, the processing
according to the method of the invention does not depend on
specific color properties of the moving objects, and it is thus
possible to track objects such as a hand or a textured object in
movement in front of the image sensor used.
[0025] The number of degrees of freedom defining the movements of
each tracked object may be set for each region of interest.
[0026] It is possible to track several zones of interest
simultaneously in particular in order to enable multiple control.
Thus, for example, the tracking of two hands enables the number of
possible iterations between a user and a software application to be
increased.
[0027] Advantageously, said step of determining a movement
comprises a step of determining and matching at least one pair of
points of interest in said at least one first and second images, at
least one point of said at least one pair of points of interest
belonging to said mask of interest. The method according to the
invention thus enables the advantages linked to the tracking of
points of interest to be combined while limiting the zones where
those points are located in order to limit the processing and to
concentrate on the tracked object.
[0028] According to a particular embodiment, said step of
determining a movement comprises a step of determining and matching
a plurality of pairs of points of interest in said at least one
first and second images, at least one point of each of said pairs
of points of interest belonging to said mask of interest, said
movement being estimated on the basis of a transformation of a
first set of points of interest into a second set of points of
interest, the points of interest of said first and second sets
belonging to said plurality of pairs of points of interest, the
points of interest of said first set of points of interest
furthermore belonging to said at least one first image and the
points of interest of said second set of points of interest
furthermore belonging to said at least one second image. The
general movement of a part of an object may thus be determined from
the movements of a set of points of interest.
[0029] Said transformation preferably implements a weighting
function based on a distance between two points of interest from
the same pairs of points of interest of said plurality of pairs of
points of interest in order to improve the estimation of the
movement of the tracked object.
[0030] Still according to a particular embodiment, the method
further comprises a step of validating at least one point of
interest of said at least one first image, belonging to said at
least one pair of points of interest, according to said determined
movement, said at least one validated point of interest being used
to track said object in at least one third image following said at
least one second image and said at least one validated point of
interest being used for modifying a mask of interest created on the
basis of said at least one second and third images. It is thus
possible to use points of interest which are the same from image to
image if they efficiently contribute to the general movement
estimation of the tracked object. Furthermore, the validated points
of interest are used to select new points of interest in order to
avoid an excessive accumulation of points of interest in a limited
region.
[0031] Said step of comparing said at least one first and second
regions of interest comprises a step of performing subtraction,
point by point, of values of corresponding points of said at least
one first and second regions of interest and a step of comparing a
result of said subtraction to a predetermined threshold. Such an
embodiment makes it possible to combine the effectiveness of the
method and limiting processing resources.
[0032] According to a particular embodiment, the method further
comprises a step of detecting at least one predetermined feature in
said at least one first image, said at least one first region of
interest being at least partially identified in response to said
detecting step. The method according to the invention may thus be
automatically initialized or re-initialized according to elements
of the content of the processed image. Such a predetermined feature
is, for example, a predetermined shape and/or a predetermined
color.
[0033] Advantageously, the method further comprises a step of
estimating at least one modified second region of interest in said
at least one second image, said at least one modified second region
of interest of said at least one second image being estimated
according to said at least one first region of interest of said at
least one first image and of said at least one second region of
interest of said at least one second image. The method according to
the invention thus makes it possible to anticipate the processing
of the following image for the object tracking. Said estimation of
said at least one modified second region of interest of said at
least one second image for example implements an object tracking
algorithm of KLT type.
[0034] Said movement may in particular be characterized by a
translation, a rotation and/or a scale factor.
[0035] When said movement is characterized by a scale factor,
whether or not said predetermined action is triggered may be
determined on the basis of said scale factor. Thus, a scale factor
may, for example, characterize a mouse click.
[0036] According to a particular embodiment, the movements of at
least two objects situated in the field of said image sensor are
determined, whether or not said predetermined action is triggered
being determined according to a combination of the movements
associated with said at least two objects. It is thus possible to
determine a movement of an object on the basis of movements of
other objects, in particular other objects subjected to constraints
of relative position.
[0037] The invention is also directed to a computer program
comprising instructions adapted to the implementation of each of
the steps of the method described earlier when said program is
executed on a computer as well as a device comprising means adapted
to the implementation of each of the steps of the method described
earlier. The advantages of this computer program and of this method
are similar to those referred to earlier.
[0038] Other advantages, objects and features of the present
invention will emerge from the following detailed description,
given by way of non-limiting example, relative to the accompanying
drawings in which:
[0039] FIG. 1, comprising FIGS. 1a and 1b, illustrates two
successive images of a stream of images that may be used to
determine the movement of objects and the interaction of a
user;
[0040] FIG. 2, comprising FIGS. 2a to 2d, illustrates examples of
variation in a region of interest of an image with the
corresponding region of interest of a following image;
[0041] FIG. 3 is a diagrammatic illustration of the determination
of a movement of an object of which at least one part is
represented in a region and in a mask of interest of two
consecutive images;
[0042] FIG. 4 is a diagrammatic illustration of certain steps
implemented in accordance with the invention to identify, in
continuous operation, variations in position of objects between two
consecutive (or close) images of a sequence of images;
[0043] FIG. 5 illustrates certain aspects of the invention when
four parameters characterize a movement of an object tracked in
consecutive (or close) images of a sequence of images;
[0044] FIG. 6, comprising FIGS. 6a, 6b and 6c, illustrates an
example of implementation of the invention in the context of a
driving simulation game in which two regions of interest enable the
tracking of a user's hands in real time, characterizing a vehicle
steering wheel movement, in a sequence of images; and,
[0045] FIG. 7 illustrates an example of a device adapted to
implement the invention.
[0046] In general terms, the invention concerns the tracking of
objects in particular regions of images in a stream of images,
those regions, termed regions of interest, comprising a part of the
tracked objects and a part of the scene represented in the images.
It has been observed that the analysis of regions of interest makes
it possible to speed up the processing time and to improve the
movement detection of objects.
[0047] The regions of interest are, preferably, defined as
two-dimensional shapes, in an image. These shapes are, for example,
rectangles or circles. They are preferably constant and
predetermined. The regions of interest may be characterized by
points of interest, that is to say singular points, such as points
having a high luminance gradient, and the initial position of the
regions of interest may be predetermined, be determined by a user,
by an event such as the appearance of a shape or a color or
according to predefined features, for example using key images.
These regions may also be moved depending on the movement of
tracked objects or have a fixed position and orientation in the
image. The use of several regions of interest makes it possible,
for example, to observe several concomitant interactions of a user
(a region of interest may correspond to each of his hands) and/or
several concomitant interactions of several users.
[0048] The points of interest are used in order to find the
variation of the regions of interest, in a stream of images, from
one image to a following (or close) image, according to techniques
of tracking points of interest based, for example, on algorithms
known under the name of FAST, for the detection, and KLT (initials
of Kanade, Lucas and Tomasi), for tracking in the following image.
The points of interest of a region of interest may vary over the
images analyzed, in particular according to the distortion of the
objects tracked and their movements which may mask parts of the
scene represented in the images and/or make parts of those objects
leave the zones of interest.
[0049] Furthermore, the objects whose movements may create an
interaction are tracked in each region of interest according to a
mechanism for tracking points of interest in masks defined in the
regions of interest.
[0050] FIGS. 1 and 2 illustrate the general principle of the
invention.
[0051] FIG. 1, comprising FIGS. 1a and 1b, illustrates two
successive images of a stream of images that may be used to
determine the movement of objects and the interaction of a
user.
[0052] As illustrated in FIG. 1a, image 100-1 represents a scene
having fixed elements (not represented) such as elements of decor
and mobile elements here linked to animate characters (real or
virtual). The image 100-1 here comprises a region of interest
105-1. As indicated previously, several regions of interest may be
processed simultaneously however, in the interest of clarity, a
single region of interest is represented here, the processing of
the regions of interest being similar for each of them. It is
considered that the shape of the region of interest 105-1 as well
as its initial position are predetermined.
[0053] Image 100-2 of FIG. 1b represents an image following the
image 100-1 of FIG. 1a in a sequence of images. It is possible to
define, in the image 100-2, a region of interest 105-2,
corresponding to the position and to the dimensions of the region
of interest 105-1 defined in the preceding image, in which
disturbances may be estimated. The region of interest 105-1 is thus
compared to the region of interest 105-2 of FIG. 1b, for example by
subtracting those image parts one from another, pixel by pixel
(pixel being an acronym for PICture ELement), in order to extract
therefrom a map of pixels that are considered to be in movement.
These pixels in movement constitute a mask of pixels of interest
(presented in FIG. 2).
[0054] Points of interest, generically referenced 110 in FIG. 1a,
may be determined in the image 100-1, in particular in the region
of interest 105-1, according to standard algorithms for image
analysis. These points of interest may be advantageously detected
at positions in the region of interest which belong to the mask of
pixels of interest.
[0055] The points of interest 110 defined in the region of interest
105-1 are tracked in the image 100-2, preferably in the region of
interest 105-2, for example using the KLT tracking principles by
comparing portions of the images 100-1 and 100-2 that are
associated with the neighborhoods of the points of interest.
[0056] These matches denoted 115 between the image 100-1 and the
image 100-2 make it possible to estimate the movements of the hand
represented with the reference 120-1 in image 100-1 and the
reference 120-2 in image 100-2. It is thus possible to obtain the
new position of the hand in the image 100-2.
[0057] The movement of the hand may next be advantageously used to
move the region of interest 105-2 from the image 100-2 to the
modified region of interest 125 which may be used for estimating
the movement of the hand in an image following the image 100-2 of
the image stream. The method of tracking objects may thus continue
recursively.
[0058] It is to be noted here that, as stated earlier, certain
points of interest present in the image 100-1 have disappeared from
the image 100-2 due, in particular, to the presence and movements
of the hand.
[0059] The determination of points of interest in an image is,
preferably, limited to the zone corresponding to the corresponding
region of interest as located on the current image or to a zone
comprising all or part thereof when a mask of interest of pixels in
movement is defined in that region of interest.
[0060] According to a particular embodiment, estimation is made of
information characterizing the relative positions and orientations
of the objects to track (for example the hand referenced 120-1 in
FIG. 1a) in relation to a reference linked to the video camera from
which the images come. Such information is, for example
two-dimensional position information (x, y), orientation
information (.theta.) and information on distance to the video
camera, that is to say scale(s) of the objects to track.
[0061] Similarly, it is possible to track the modifications that
have occurred in the region of interest 125 that is defined in the
image 100-2 relative to the region of interest 105-1 of the image
100-1 according to a movement estimated between the image 100-2 and
the following image of the stream of images. For these purposes, a
new region of interest is first of all identified in the following
image on the basis of the region of interest 125. When the region
of interest has been identified, it is compared with the region of
interest 125 in order to determine the modified elements, forming a
mask comprising parts of objects whose movements must be
determined.
[0062] FIG. 2, comprising FIGS. 2a to 2c, illustrates the variation
of a region of interest of one image in comparison with the
corresponding region of interest, at the same position, of a
following image, as described with reference to FIG. 1. The image
resulting from this comparison, having the same shape as the region
of interest, is formed of pixels which here may take two states, a
first state being associated, by default, with each pixel. A second
state is associated with the pixels corresponding to the pixels of
the regions of interest whose variation exceeds a predetermined
threshold. This second state forms a mask used here to limit the
search for points of interest to zones which are situated on
tracked objects or that are close to those tracked objects in order
to characterize the movement of the tracked objects and, possibly,
to trigger particular actions.
[0063] FIG. 2a represents a region of interest of a first image
whereas FIG. 2b represents the corresponding region of interest of
a following image, at the same position. As illustrated in FIG. 2a,
the region of interest 200-1 comprises a hand 205-1 as well as
another object 210-1. Similarly, the corresponding region of
interest, referenced 200-2 and illustrated in FIG. 2b, comprises
the hand and the object, here referenced 205-2 and 210-2,
respectively. The hand, generically referenced 205, has moved
substantially whereas the object, generically referenced 210, has
only moved slightly.
[0064] FIG. 2c illustrates the image 215 resulting from the
comparison of the regions of interest 200-1 and 200-2. The black
part, forming a mask of interest, represents the pixels whose
difference is greater than a predetermined threshold whereas the
white part represents the pixels whose difference is less than that
threshold. The black part comprises in particular the part
referenced 220 corresponding to the difference in position of the
hand 205 between the regions of interest 200-1 and 200-2. It also
comprises the part 225 corresponding to the difference in position
of the object 210 between those regions of interest. The part 230
corresponds to the part of the hand 205 present in both these
regions of interest.
[0065] The image 215 represented in FIG. 2c may be analyzed to
deduce therefrom an interaction between the user who moved his hand
in the field of the video camera from which come the images from
which are extracted the regions of interest 200-1 and 200-2 and a
computer system processing those images. Such an analysis may in
particular consist in identifying the movement of points of
interest belonging to the mask of interest so formed, the search
for points of interest then preferably being limited to the mask of
interest.
[0066] However, a skeletonizing step making it possible in
particular to eliminate adjoining movements such as the movement
referenced 225 is, preferably, carried out before analyzing the
movement of the points of interest belonging to the mask of
interest. This skeletonizing step may take the form of a
morphological processing operation such as for example operations
of opening or closing applied to the mask of interest.
[0067] Furthermore, advantageously, the mask of interest obtained
is modified in order to eliminate the parts situated around the
points of interest identified recursively between the image from
which is extracted the region of interest 200-1 and the image
preceding it.
[0068] FIG. 2d thus illustrates the mask of interest represented in
FIG. 2c, here referenced 235, to which the parts 240 situated
around the points of interest identified by 245 have been
eliminated. The parts 240 are, for example, circular. They are of
predetermined radius here.
[0069] The mask of interest 235 thus has cropped from it zones in
which are situated already detected points of interest and where it
is thus not necessary to detect new ones. In other words, this
modified mask of interest 235 has just excluded a part of the mask
of interest 220 in order to avoid the accumulation of points of
interest in the same zone of the region of interest.
[0070] Again, the mask of interest 235 may be used to identify
points of interest whose movements may be analyzed in order to
trigger, the case arising, a particular action.
[0071] FIG. 3 is again a diagrammatic illustration of the
determination of a movement of an object of which at least one part
is represented in a region and a mask of interest of two
consecutive (or close) images. The image 300 here corresponds to
the mask of interest resulting from the comparison of the regions
of interest 200-1 and 200-2 as described with reference to FIG. 2d.
However, a skeletonizing step has been carried out to eliminate the
disturbances (in particular the disturbance 225). Thus, the image
300 comprises a mask 305 which may be used for identifying new
points of interest whose movements characterize the movement of
objects in that region of interest.
[0072] By way of illustration, the point of interest corresponding
to the end of the user's index finger is shown. Reference 310-1
designates this point of interest according to its position in the
region of interest 200-1 and reference 310-2 designates that point
of interest according to its position in the region of interest
200-2. Thus, by using standard techniques for tracking points of
interest, for example an algorithm for tracking by optical flow, it
is possible, on the basis of the point of interest 310-1 of the
region of interest 200-1, to find the corresponding point of
interest 310-2 of the region of interest 200-2 and, consequently,
to find the corresponding translation.
[0073] The analysis of the movements of several points of interest,
in particular of the point of interest 310-1 and of the points of
interest detected and validated beforehand, for example the points
of interest 245, makes it possible to determine a set of movement
parameters for the tracked object, in particular which are linked
to a translation, a rotation and/or a change of scale.
[0074] FIG. 4 is a diagrammatic illustration of certain steps
implemented in accordance with the invention to identify, in
continuous operation, variations in arrangement of objects between
two consecutive (or close) images of a sequence of images.
[0075] The images here are acquired via an image sensor such as a
video camera, in particular a video camera of webcam type,
connected to a computer system implementing the method described
here.
[0076] After having acquired a current image 400 and if that image
is the first to be processed, that is to say if a preceding image
405 from the same video stream has not been processed beforehand, a
first step of initializing (step 410) is executed. An object of
this step is in particular to define features of at least one
region of interest, for example a shape, a size and an initial
position.
[0077] As described earlier, a region of interest may be defined
relative to a corresponding region of interest determined in a
preceding image (in recursive phase of tracking, in this case the
initializing 410 is not necessary) or according to predetermined
features and/or particular events (corresponding to the
initializing phase).
[0078] Thus, by way of illustration, it is possible for a region of
interest not to be defined in an initial state, the system being on
standby for a triggering event, for example a particular movement
of the user facing the video camera (the moving pixels in the image
are analyzed in search for a particular movement), the location of
a particular color such as the color of skin or the recognition of
a particular predetermined object whose position defines that of
the region of interest. Like the position, the size and the shape
of the region of interest may be predefined or be determined
according to features of the detected event.
[0079] The initializing step 410 may thus take several forms
depending on the object to track in the image sequence and
depending on the application implemented.
[0080] It may in particular be a static initialization. In this
case, the initial position of the region of interest is
predetermined (off-line determination) and the tracking algorithm
is on standby for a disturbance.
[0081] The initializing phase may also comprise a step of
recognizing objects of a specific type. For example, the principles
of detecting descriptors of Haar wavelet type may be implemented.
The principle of these descriptors is in particular described in
the paper by Viola and Jones, "Rapid object detection using boosted
cascade of simple features", Computer Vision and Pattern
Recognition, 2001. These descriptors in particular enable the
detection of a face, the eyes or a hand in an image or a part of an
image. During the initializing phase, it is thus possible to search
for particular objects either in the whole image in order to
position the region of interest on the detected object or in a
region of interest itself to trigger the tracking of the recognized
object.
[0082] Another approach consists in segmenting an image and in
identifying certain color properties and certain predefined shapes.
When a shape and/or a segmented region of the processed image is
similar to the object searched for, for example the color of the
skin and the outline of the hand, the tracking process is
initialized as described earlier.
[0083] In a following step (step 415), a region of interest whose
features have been determined beforehand (on initialization or in
the preceding image) is positioned in the current image to extract
the corresponding image part. If the current image is the first
image of the video stream to be processed, that image becomes the
preceding image, a new image current is acquired and step 415 is
repeated.
[0084] This image part thus extracted is then compared with the
corresponding region of interest of the preceding image (step 420).
Such a comparison may in particular consist of subtracting each
pixel from the region of interest considered of the current image
with the corresponding pixel of the corresponding region of
interest of the preceding image.
[0085] The detection of the points in movement is thus carried out,
according to this example, by the absolute difference of parts of
the current image and of the preceding image. This difference makes
it possible to create a mask of interest capable of being used to
distinguish a moving object from the decor, which is essentially
static. However, as the object/decor segmentation is not expected
to be perfect, it is possible to update such a mask of interest
recursively on the basis of the movements in order to identify the
movements of the pixels of the tracked object and the movements of
the pixels which belong to the background of the image.
[0086] Thresholding is then preferably carried out on the
difference between pixels according to a predetermined threshold
value (step 425). Such thresholding may, for example, be carried
out on the luminance. If coding over 8 bits is used, its value is,
for example, 100. It makes it possible to isolate the pixels having
a movement considered to be sufficiently great between two
consecutive (or close) images. The difference between the pixels of
the current and preceding images is then binary coded, for example
black if the difference exceeds the predetermined threshold
characterizing the movement and white in the opposite case. The
binary image formed by the pixels whose difference exceeds the
predetermined threshold forms a mask of interest or tracking in the
region of interest considered (step 430).
[0087] If points of interest have been validated beforehand, the
mask is modified (step 460) in order to exclude from the mask zones
in which points of interest are recursively tracked. Thus, as
represented by the use of dashed line, step 460 is only carried out
if there are validated points of interest. As indicated earlier,
this step consists in eliminating zones from the mask created, for
example disks of a predetermined diameter, around points of
interest validated beforehand.
[0088] Points of interest are then searched for in the region of
the preceding image corresponding to the mask of interest so
defined (step 435), the mask of interest here being the mask of
interest created at step 430 or the mask of interest created at
step 430 and modified during step 460.
[0089] The search for points of interest is, for example, limited
to the detection of twenty points of interest. Naturally, this
number may be different and may be estimated according to the size
of the mask of interest.
[0090] This search is advantageously carried out with the algorithm
known by the name FAST. According to this algorithm, a Bresenham
circle, for example with a perimeter of 16 pixels, is constructed
around each pixel of the image. If k contiguous pixels (k typically
having a value of 9, 10, 11 or 12) contained in that circle all
have either greater intensity than the central pixel, or all have
lower intensity than the central pixel, that central pixel is
considered as a point of interest. It is also possible to identify
points of interest with an approach based on image gradients as
provided in the approach known by the name of Harris points
detection.
[0091] The points of interest detected in the preceding image
according to the mask of interest as well as, where applicable, the
points of interest detected and validated beforehand are used to
identify the corresponding points of interest in the current
image.
[0092] A search for corresponding points of interest in the current
image is thus carried out (step 440), preferably using a method
known under the name of optical flow. The use of this technique
gives better robustness when the image is blurred, in particular
thanks to the use of pyramids of images smoothed by a Gaussian
filter. This is for example the approach implemented by Lucas,
Kanade and Tomasi in the algorithm known under the name KLT.
[0093] When the points of interest of the current image,
corresponding to the points of interest of the preceding image
(which are determined according to the mask of interest or by
recursive tracking), have been identified, movement parameters are
estimated for objects tracked in the region of interest of the
preceding image relative to the region of interest of the current
image (step 445). Such parameters, also termed degrees of freedom,
comprise, for example, a parameter of translation along the x-axis,
a parameter of translation along the y-axis, a rotation parameter
and/or a scale parameter, the transformation making a set of
bi-directional points pass from one plane to another, grouping
together these four parameters, being termed the similarity. These
parameters are, preferably, estimated using the method of Nonlinear
Least Squares Error (NLSE) or Gauss-Newton method. This method is
directed to minimizing a re-projection error over the set of the
tracked points of interest. In order to improve the estimation of
the parameters of the model (position and orientation), it is
advantageous, in a specific embodiment, to search for those
parameters in a distinct manner. Thus, for example, it is relevant
to apply the least squares error, in a first phase, in order to
estimate only the translation parameters (x,y), these latter being
easier to identify, then, during a second iteration, to compute the
parameters of scale change and/or of rotation (possibly less
precisely).
[0094] In a following step, the points of interest of the preceding
image, for which a match has been found in the current image, are
preferably analyzed in order to recursively determine valid points
of interest relative to the movement estimated in the preceding
step. For these purposes, it is verified, for each previously
determined point of interest of the preceding image (determined
according to the mask of interest of by recursive tracking),
whether the movement, relative to that point of interest, of the
corresponding point of interest of the current image is in
accordance with the identified movement. In the affirmative, the
point of interest is considered as valid whereas in the opposite
case, it is considered as not valid. A threshold, typically
expressed in pixels and having a predetermined value, is
advantageously used to authorize a certain margin of error between
the theoretical position of the point in the current image
(obtained by applying the parameters of step 445) and its real
position (obtained by the tracking method of step 440).
[0095] The valid points of interest, here referenced 455, are
considered as belonging to an object whose movement is tracked
whereas the non-valid points (also termed outliers), are considered
as belonging to the image background or to portions of an object
which are not visible in the image.
[0096] As indicated earlier, the valid points of interest are
tracked in the following image and are used to modify the mask of
interest created by comparison of a region of interest of the
current image with the corresponding region of interest of the
following image (step 460) in order to exclude from the portions of
mask, pixels in movement between the current and following images
as described with reference to FIG. 2d. This modified mask of
interest makes it possible to eliminate portions of images in which
points of interest are recursively tracked. The valid points of
interest are thus kept for several processing operations on
successive images and in particular enable stabilization of the
tracking of objects.
[0097] The new region of interest (or modified region of interest)
which is used for processing the current image and the following
image is then estimated thanks to the previously estimated degrees
of freedom (step 445). For example, if the degrees of freedom are x
and y translations, the new position of the region of interest is
estimated according to the previous position of the region of
interest, using those two items of information. If a change (or
changes) of scale is estimated and considered in this step, it is
possible, according to the scenario considered, also to modify the
size of the new region of interest which is used in the current and
following images of the video stream.
[0098] In parallel, when the different degrees of freedom have been
computed, it is possible to estimate a particular interaction
according to those parameters (step 470).
[0099] According to a particular embodiment, the estimation of a
change (or changes) of scale is used for detecting the triggering
of an action in similar manner to the click of a mouse. Similarly,
it is possible to use changes of orientation, particularly those
around the viewing axis of the video camera (referred to as roll)
in order, for example, to enable the rotation of a virtual element
displayed in a scene or to control a button of "potentiometer" type
in order, for example, to adjust a volume of sound of an
application.
[0100] This detection of interactions according to scale factor to
detect an action such as a mouse click may, for example, be
implemented in the following manner, by counting the number of
images over which the norm of the movement vector (translation) and
the scale factor (determined according to corresponding regions of
interest) are less than certain predetermined values. Such a number
characterizes a stability in the movement of the tracked objects.
If the number of images over which the movement is stable exceeds a
certain threshold, the system enters a state of standby for the
detection of a click. A click is then detected by measuring the
average of the absolute differences of the scale factors between
current and preceding images, this being performed over a given
number of images. If the sum thus computed exceeds a certain
threshold, the click is validated.
[0101] When an object is no longer tracked in a sequence of images
(either because it disappears from the image, or because it has
been lost), the algorithm preferably returns to the initializing
step. Furthermore, loss of tracking leading to the initializing
step being re-executed may be identified by measuring the movements
of a user. Thus, it may be decided to reinitialize the method when
those movements are stable or non-existent for a predetermined
period or when a tracked object leaves the field of view of the
image sensor.
[0102] FIG. 5 illustrates more precisely certain aspects of the
invention when four parameters characterize a movement of an object
tracked in consecutive (or close) images of a sequence of images;
These four parameters here are a translation denoted (T.sub.x,
T.sub.y), a rotation denoted .theta. around the optical axis of the
image sensor and a scale factor denoted s. These four parameters
represent a similarity which is the transformation enabling a point
M to be transformed from a plane to a point M'.
[0103] As illustrated in FIG. 5, O represents the origin of a frame
of reference 505 for the object in preceding image and O'
represents the origin of a frame of reference 510 of the object in
the current image, which frame of reference 510 being obtained in
accordance with the object tracking method, the image frame of
reference here bearing the reference 500. It is then possible to
express the transformation of the point M to the point M' by the
following system of non-linear equations:
X.sub.M'=s(X.sub.M-X.sub.O)cos(.theta.)-s(Y.sub.M-Y.sub.O)sin(.theta.)+T-
.sub.x+X.sub.O
Y.sub.M'=s(X.sub.M-X.sub.O)sin(.theta.)+s(Y.sub.M-Y.sub.O)cos(.theta.)+T-
.sub.y+Y.sub.O
[0104] where (X.sub.M, Y.sub.M) are the coordinates of the point M
expressed in the image frame of reference, (X.sub.0, Y.sub.0) are
the coordinates of the point O in the image frame of reference and
(X.sub.M', Y.sub.M') are the coordinates of the point M' in the
image frame of reference.
[0105] The points M.sub.s and M.sub.s.theta. respectively represent
the transformation of the point M according to the change in scale
s and the change of scale s combined with the rotation .theta.,
respectively.
[0106] As described earlier, it is possible to use the nonlinear
least squares error approach to solve this system by using all the
points of interest tracked in step 440 described with reference to
FIG. 4.
[0107] To compute the new position of the object in the current
image (step 465 of FIG. 4), it suffices theoretically to apply the
estimated translation (T.sub.x,T.sub.y) to the previous position of
the object in the following manner:
X.sub.0'=X.sub.0+T.sub.x
Y.sub.0'=Y.sub.0+T.sub.y
[0108] where (X.sub.0', Y.sub.0') are the coordinates of the point
O' in the image frame of reference.
[0109] Advantageously, the partial derivatives of each point
considered, that is to say the movements associated with each of
those points, are weighted according to the associated movement.
Thus, the points of interest moving the most have greater
importance in the estimation of the parameters, which avoids the
points of interest linked to the background disturbing the tracking
of objects.
[0110] It has thus been observed that it is advantageous to add an
influence of the center of gravity of the points of interest
tracked in the current image to the preceding equation. This center
of gravity approximately corresponds to the local center of gravity
of the movement (the points tracked in the current image come from
moving points in the preceding image). The center of the region of
interest thus tends to translate to the center of the movement so
long as the distance of the object to the center of gravity is
greater than the estimated translation movement. The origin of the
frame of reference in the current image, characterizing the
movement of the tracked object, is advantageously computed
according to the following relationship:
X.sub.O'=X.sub.O+W.sub.GC(X.sub.GC(X.sub.GC-X.sub.O)+W.sub.TT.sub.x
Y.sub.O'=Y.sub.O+W.sub.GC(Y.sub.GC-Y.sub.O)+W.sub.TT.sub.y
[0111] where (X.sub.GC, Y.sub.GC) represent the center of gravity
of the points of interest in the current image and W.sub.GC
represents the weight on the influence of the current center of
gravity and W.sub.T the weight on the influence of the translation.
The parameter W.sub.GC is positively correlated here with the
velocity of movement of the tracked object whereas the parameter
W.sub.T may be fixed depending on the desired influence of the
translation.
[0112] FIG. 6, comprising FIGS. 6a, 6b and 6c, illustrates an
example of implementation of the invention in the context of a
driving simulation game in which two regions of interest enable the
tracking of a user's hands in real time, characterizing a vehicle
steering wheel movement, in a sequence of images.
[0113] More specifically, FIG. 6a is a pictorial presentation of
the context of the game, whereas FIG. 6b represents the display of
the game as perceived by a user. FIG. 6c illustrates the estimation
of the movement parameters, or degrees of freedom, of the tracked
objects to deduce therefrom a movement of a vehicle steering
wheel.
[0114] FIG. 6a comprises an image 600 extracted from the sequence
of images provided by the image sensor used. The latter is placed
facing the user, as if it were fastened to the windshield of the
vehicle driven by the user. This image 600 here contains a zone 605
comprising two circular regions of interest 610 and 615 associated
with a steering wheel 620 drawn in overlay by computer graphics.
The image 600 also comprises elements of the real scene in which
the user is situated.
[0115] The initial position of the regions 610 and 615 is fixed on
a predetermined horizontal line, at equal distances on respective
opposite sides of a point representing the center of the steering
wheel, while awaiting a disturbance. When the user positions his
hands in these two regions, he is able to turn the steering wheel
either to the left, or to the right. The movement of the regions
610 and 615 is here constrained by the radius of the circle
corresponding to the steering wheel 620. The image representing the
steering wheel moves with the hands of the user, for example
according to the average movement of both hands.
[0116] The radius of the circle corresponding to the steering wheel
620 may also vary when the user moves his hands towards or away
from the center of that circle.
[0117] These two degrees of freedom are next advantageously used to
control the orientation of a vehicle (position of the hands on the
circle corresponding to the steering wheel 620) and its velocity
(scale factor linked to the position of the hands relative to the
center of the circle corresponding to the steering wheel 620).
[0118] FIG. 6b, illustrating the display 625 of the application,
comprises the image portion 605 extracted from the image 600. This
display enables the user to observe and control his movements. The
image portion 605 may advantageously be represented as a car
rear-view mirror in which the driver may observe his actions.
[0119] The regions 610 and 615 of the image 600 enable the
movements of the steering wheel 620 to be controlled, that is to
say to control the direction of the vehicle referenced 630 on the
display 625 as well as its velocity relative to the elements 635 of
the decor, the vehicle 630 and the elements 635 of the decor being
created here by computer graphics. In accordance with the standard
driving applications, the vehicle may move in the decor and hit
certain elements.
[0120] FIG. 6c more precisely describes the estimation of the
parameters of freedom linked to each of the regions of interest and
to deduce the degrees of freedom of the steering wheel. In this
implementation, the parameters to estimate are the orientation
.theta. of the steering wheel and its diameter D.
[0121] In order to analyze the components of the movement, several
frames of reference are defined. The frame of reference Ow here
corresponds to an overall frame of reference ("world" frame of
reference), the frame of reference Owh is a local frame of
reference linked to the steering wheel 620 and the frames of
reference Oa1 and Oa2 are two local frames of reference linked to
the regions of interest 610 and 615. The vectors Va1(Xva1, Yva1)
and Va2(Xva2, Yva2) are the movement vectors resulting from the
analysis of the movement of the user's hands in the regions of
interest 610 and 615, expressed in the frames of reference Oa1 and
Oa2, respectively.
[0122] The new orientation .theta.' of the steering wheel is
computed relative to its previous orientation .theta. and on the
basis of the movement of the user's hands (determined via the two
regions of interest 610 and 615). The movement of the steering
wheel is thus a constrained movement linked to the movement of
several regions of interest. The new orientation .theta.' may be
computed in the following manner:
.theta.'=.theta.+((.DELTA..theta.1+.DELTA..theta.2)/2)
[0123] where .DELTA..theta.1 and .DELTA..theta.2, represent the
rotation of the user's hands.
[0124] .DELTA..theta.1 may be computed by the following
relationship:
.DELTA..theta.1=a tan2(Yva1wh, D/2)
[0125] with
Yva1wh=Xva1*sin(-(.theta.+.pi.))+Yva1*cos(-(.theta.+.pi.))
characterizing the translation along the y-axis in the frame of
reference Owh.
[0126] .DELTA..theta.2 may be computed in similar manner.
[0127] Similarly, the new diameter D' of the steering wheel is
computed on the basis of its previous diameter D and on the basis
of the movement of the user's hands (determined via the two regions
of interest 610 and 615). It may be computed in the following
manner:
D'=D+((Xva1wh+Xva2wh)/2)
[0128] with Xva1wh=Xva1*cos(-(.theta.+)-Yva1*sin(-(.theta.+.pi.))
and
[0129] Xva2wh=Xva2*cos(-.theta.)-Yva2*sin(-.theta.)
[0130] Thus, knowing the angular position of the steering wheel and
its diameter, the game scenario may in particular compute a
corresponding computer graphics image.
[0131] FIG. 7 illustrates an example of a device which may be used
to identify the movements of objects represented in images provided
by a video camera and to trigger particular actions according to
identified movements. The device 700 is for example a mobile
telephone of smartphone type, a personal digital assistant, a
micro-computer or a workstation.
[0132] The device 700 preferably comprises a communication bus 702
to which are connected: [0133] a central processing unit or
microprocessor 704 (CPU); [0134] A read only memory 706 (ROM) able
to include the operating system and programs such as "Prog"; [0135]
a random access memory or cache memory (RAM) 708, comprising
registers adapted to record variables and parameters created and
modified during the execution of the aforementioned programs;
[0136] a video acquisition card 710 connected to a video camera
712; and [0137] a graphics card 714 connected to a screen or a
projector 716.
[0138] Optionally, the device 700 may also have the following
items: [0139] a hard disk 720 able to contain the aforesaid
programs "Prog" and data processed or to be processed according to
the invention; [0140] a keyboard 722 and a mouse 724 or any other
pointing device such as an optical stylus, a touch screen or a
remote control enabling the user to interact with the programs
according to the invention, in particular during the phases of
installation and/or initialization; [0141] a communication
interface 726 connected to a distributed communication network 728,
for example the Internet, the interface being able to transmit and
receive data; and, [0142] a reader for memory cards (not shown)
adapted to read or write thereon data processed or to be processed
according to the invention.
[0143] The communication bus allows communication and
interoperability between the different elements included in the
device 700 or connected to it. The representation of the bus is
non-limiting and, in particular, the central processing unit may
communicate instructions to any element of the device 700 directly
or by means of another element of the device 700.
[0144] The executable code of each program enabling the
programmable apparatus to implement the processes according to the
invention may be stored, for example, on the hard disk 720 or in
read only memory 706.
[0145] According to a variant, the executable code of the programs
can be received by the intermediary of the communication network
728, via the interface 726, in order to be stored in an identical
fashion to that described previously.
[0146] More generally, the program or programs may be loaded into
one of the storage means of the device 700 before being
executed.
[0147] The central processing unit 704 will control and direct the
execution of the instructions or portions of software code of the
program or programs according to the invention, these instructions
being stored on the hard disk 720 or in the read-only memory 706 or
in the other aforementioned storage elements. On powering up, the
program or programs which are stored in a non-volatile memory, for
example the hard disk 720 or the read only memory 706, are
transferred into the random-access memory 708, which then contains
the executable code of the program or programs according to the
invention, as well as registers for storing the variables and
parameters necessary for implementation of the invention.
[0148] It should be noted that the communication apparatus
comprising the device according to the invention can also be a
programmed apparatus. This apparatus then contains the code of the
computer program or programs for example fixed in an application
specific integrated circuit (ASIC).
[0149] Naturally, to satisfy specific needs, a person skilled in
the art will be able to make amendments to the preceding
description.
* * * * *