U.S. patent application number 10/914621 was filed with the patent office on 2005-04-21 for mobile face capture and image processing system and method.
Invention is credited to Biocca, Frank, Figueroa, Miguel Villaneuva, Reddy, Chandan K., Rolland, Jannick P., Stockman, George C..
Application Number | 20050083248 10/914621 |
Document ID | / |
Family ID | 25010806 |
Filed Date | 2005-04-21 |
United States Patent
Application |
20050083248 |
Kind Code |
A1 |
Biocca, Frank ; et
al. |
April 21, 2005 |
Mobile face capture and image processing system and method
Abstract
Image processing procedures include receiving at least two side
view images of a face of a user. In other aspects, side view images
are warped and blended into an output image of a face of a user as
if viewed from a virtual point of view. In further aspects, a
virtual video is produced in real time of output images from a
video feed of side view images.
Inventors: |
Biocca, Frank; (East
Lansing, MI) ; Rolland, Jannick P.; (Chultuota,
FL) ; Stockman, George C.; (Okemos, MI) ;
Reddy, Chandan K.; (Ithaca, NY) ; Figueroa, Miguel
Villaneuva; (Lansing, MI) |
Correspondence
Address: |
HARNESS, DICKEY & PIERCE, P.L.C.
P.O. BOX 828
BLOOMFIELD HILLS
MI
48303
US
|
Family ID: |
25010806 |
Appl. No.: |
10/914621 |
Filed: |
August 9, 2004 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10914621 |
Aug 9, 2004 |
|
|
|
09748761 |
Dec 22, 2000 |
|
|
|
6774869 |
|
|
|
|
Current U.S.
Class: |
345/8 ;
348/E13.022; 348/E13.041; 348/E13.045; 348/E13.059;
348/E13.071 |
Current CPC
Class: |
H04N 13/398 20180501;
H04N 13/194 20180501; G02B 27/0172 20130101; H04N 13/286 20180501;
G02B 2027/0187 20130101; A41D 31/32 20190201; H04N 13/366 20180501;
G02B 2027/0138 20130101; G02B 2027/0134 20130101; G02B 27/017
20130101; H04N 13/344 20180501; G02B 2027/0178 20130101 |
Class at
Publication: |
345/008 |
International
Class: |
G09G 005/00 |
Claims
What is claimed is:
1. An image processing method, comprising: receiving at least two
side view images of a face of a user; warping and blending the side
view images into an output image of the face of the user as if
viewed from a virtual point of view; and producing a virtual video
in real time of output images from a video feed of side view
images.
2. The method of claim 1, further comprising: accessing a
three-dimensional closed mesh model of points corresponding to
salient facial feature points; and warping and blending the side
view images by texture mapping the side view images to the
three-dimensional closed mesh model based on mappings of vertices
of polygons of the mesh model into two-dimensional coordinate
spaces of side view images.
3. The method of claim 2, further comprising instantiating a mesh
model for an individual user by obtaining scaling and deformation
transformations.
4. The method of claim 3, further comprising obtaining scaling and
deformation transformations by choosing special points by hand on a
digital frontal and profile photo.
5. The method of claim 3, further comprising obtaining scaling and
deformation transformations by choosing special points from the two
side images of a neutral facial expression of the user captured by
imaging components of a head mounted display worn by the user, and
enabling the user wearing the head mounted display to make
adjustments that apply various scaling and deformation
transformations while viewing a resulting output image rendered by
the head mounted display.
6. The method of claim 2, further comprising: receiving a selection
of a virtual point of view from which to view a three-dimensional
model of the face of the user; and rendering the three-dimensional
model from the virtual point of view based on the selection.
7. The method of claim 6, further comprising applying a lighting
model that determines how the three-dimensional model appears to be
illuminated.
8. The method of claim 2, further comprising dynamically fitting
parameters of the mesh model by optimizing similarity between a
three-dimensional model rendered from a virtual point of view
corresponding to an actual point of view of a side view image while
using the parameters and the side view image.
9. The method of claim 8, further comprising fitting the parameters
via hill-climbing so that incremental dynamic updates can be made
to the model parameters for sequentially observed side video
frames.
10. The method of claim 2, further comprising training a morphable
model for dynamic use on a population of users, including capturing
side views and a frontal view of a diverse set of training
speakers.
11. The method of claim 10, further comprising: hand labeling mesh
points for a diverse set of faces and multiframe video recording;
and performing principal components analysis to obtain a minimum
spanning dimensionality.
12. The method of claim 2, further comprising texture mapping
triangles of the mesh model to stored face data as needed to fill
in un-imaged patches.
13. The method of claim 1, further comprising: accessing
transformation tables for the side view images, wherein the
transformation tables define rules for interpolating regions of the
side view images into side portions of the output image; warping
the side view images based on the transformation tables, thereby
producing the side portions of the output image; and blending the
side portions of the output image, thereby producing the output
image.
14. The method of claim 13, further comprising creating the
transformation tables by projecting a grid pattern onto a human
face at least as if from the virtual point of view and mapping
polygons of left and right calibration face images to corresponding
polygons of the grid pattern.
15. The method of claim 13, wherein warping the side view images
includes reconstructing coordinates in side portions of the output
image by accessing corresponding locations in the transformation
tables and retrieving pixels in the side view images using
interpolation.
16. The method of claim 1, wherein receiving at least two side view
images includes receiving side view images captured via at least
two imaging components of a head mounted display worn by the user,
said imaging components attached to said head mounted display unit
and thereby obtaining fixed positions and orientations relative to
the face of the user and adapted to receive at least two side views
of the face of the user;
17. The method of claim 1, further comprising linearly smoothing
the output image in order to smooth intensity across a vertical
midline of the face.
18. An apparatus, comprising: a head mounted display unit worn by a
first user, the display unit rendering to the first user an output
image of a face of a second user virtually interacting with the
first user in a collaborative, virtual environment, wherein the
output image has been formed, based on offset view images of the
face of the second user, such that the face of the second user
appears as if viewed from a virtual point of view; and an input
port receiving at least one of the following: (a) offset view
images of the face of the second user; (b) user-specific scaling
and deformation transformations specific to the second user; (c)
position of the face of the second user in a common coordinate
system of the collaborative, virtual environment; (d) a
three-dimensional model of the face of the second user; (e) a
selection of a virtual point of view from which to render the
three-dimensional model of the face of the second user; and (f) an
output image of the face of the second user.
19. The apparatus of claim 18, further comprising: an array of at
least two imaging components having fixed positions and
orientations relative to a face of the first user and adapted to
receive at least two offset views of the face of the first user;
and an output port transmitting at least one of the following: (a)
offset view images of the face of the first user; (b) user-specific
scaling and deformation transformations specific to the first user;
(c) position of the face of the first user in the common coordinate
system of the collaborative, virtual environment within which the
first user and the second user virtually interact; (d) a
three-dimensional model of the face of the first user; (e) a
selection of a virtual point of view from which to render the
three-dimensional model of the face of the first user; and (f) an
output image of the face of the first user.
20. The apparatus of claim 19, further comprising an image
processing module accessing a three-dimensional closed mesh model
of points corresponding to salient facial feature points of offset
view images of the face of the first user, and combining the offset
view images of the face of the first user by texture mapping the
offset view images of the face of the first user to the
three-dimensional closed mesh model based on mappings of vertices
of polygons of the mesh model into two-dimensional coordinate
spaces of the offset view images of the face of the first user,
thereby forming the three-dimensional model of the face of the
first user.
21. The apparatus of claim 20, wherein said image processing module
is further adapted to select a virtual point of view from which to
view the three-dimensional model of the face of the first user
based on positions of faces of the first user and the second user
in a common coordinate system of the collaborative environment, and
to render the three-dimensional model of the face of the first user
from the virtual point of view, thereby forming the output image of
the face of the first user.
22. The apparatus of claim 21, wherein said image processing module
is adapted to linearly smooth the output image in order to smooth
intensity across a vertical midline of the face of the first
user.
23. The apparatus of claim 21, wherein said image processing module
is adapted to apply a lighting model that determines how the
three-dimensional model of the face of the first user appears to be
illuminated.
24. The apparatus of claim 18, further comprising an image
processing module adapted to select a virtual point of view from
which to view the three-dimensional model of the face of the second
user based on positions of faces of the first user and the second
user in a common coordinate system of the collaborative
environment, and to render the three-dimensional model of the face
of the second user from the virtual point of view, thereby forming
the output image of the face of the second user.
25. The apparatus of claim 24, wherein said imaging module is
further adapted to access a three-dimensional closed mesh model of
points corresponding to salient facial feature points of offset
view images of the face of the second user, and combining the
offset view images of the face of the second user by texture
mapping the offset view images of the face of the second user to
the three-dimensional closed mesh model based on mappings of
vertices of polygons of the mesh model into two-dimensional
coordinate spaces of the offset view images of the face of the
second user, thereby forming the three-dimensional model of the
face of the second user.
26. The apparatus of claim 24, wherein said image processing module
is further adapted to apply a lighting model that determines how
the three-dimensional model appears to be illuminated.
27. The apparatus of claim 18, further comprising an image
processing module adapted to linearly smooth the output image in
order to smooth intensity across a vertical midline of the
face.
28. An apparatus, comprising: an array of at least two imaging
components having fixed positions and orientations relative to a
face of a first user and adapted to receive at least two offset
views of the face of the first user; and an output port
transmitting at least one of the following: (a) offset view images
of the face of the first user; (b) user-specific scaling and
deformation transformations specific to the first user; (c)
position of the face of the first user in a common coordinate
system of a collaborative, virtual environment within which the
first user and a second user virtually interact; (d) a
three-dimensional model of the face of the first user; (e) a
selection of a virtual point of view from which to render the
three-dimensional model of the face of the first user; and (f) an
output image of the face of the first user, wherein the output
image of the face of the first user has been formed by combining
offset view images of the face of the first user into an output
image of the face of the first user as if viewed from a virtual
point of view.
29. The apparatus of claim 28, further comprising an image
processing module accessing a three-dimensional closed mesh model
of points corresponding to salient facial feature points of offset
view images of the face of the first user, and combining the offset
view images of the face of the first user by texture mapping the
offset view images of the face of the first user to the
three-dimensional closed mesh model based on mappings of vertices
of polygons of the mesh model into two-dimensional coordinate
spaces of the offset view images of the face of the first user,
thereby forming the three-dimensional model of the face of the
first user.
30. The apparatus of claim 29, wherein said image processing module
is further adapted to select a virtual point of view from which to
view the three-dimensional model of the face of the first user
based on positions of faces of the first user and the second user
in a common coordinate system of the collaborative environment, and
to render the three-dimensional model of the face of the first user
from the virtual point of view, thereby forming the output image of
the face of the first user.
31. The apparatus of claim 30, wherein said image processing module
is adapted to linearly smooth the output image in order to smooth
intensity across a vertical midline of the face of the first
user.
32. The apparatus of claim 29, wherein said image processing module
is adapted to apply a lighting model that determines how the
three-dimensional model of the face of the first user appears to be
illuminated.
33. Computer software, comprising: first instructions receiving at
least two offset view images of a contoured structure; second
instructions forming, from the offset view images, an output image
of the contoured structure as if viewed from a virtual point of
view.
34. The computer software of claim 33, wherein said second
instructions are adapted to recognize feature points of the
contoured structure in the offset view images, to access a
three-dimensional closed mesh model of feature points similar to
the recognized feature points, and to texture map the offset view
images to the three-dimensional closed mesh model based on mappings
of vertices of polygons of the mesh model into two-dimensional
coordinate spaces of the offset view images, thereby forming a
three-dimensional model of the contoured structure.
35. The computer software of claim 33, wherein said second set of
instructions is further adapted to select a virtual point of view
from which to view the three-dimensional model, and to render the
three-dimensional model from the virtual point of view, thereby
forming the output image.
36. Computer software, comprising: a first set of instructions
receiving at least two offset view images of a contoured structure;
a second set of instructions recognizing feature points of the
contoured structure in the offset view images, accessing a
three-dimensional closed mesh model of feature points similar to
the recognized feature points, and texture mapping the offset view
images to the three-dimensional closed mesh model based on mappings
of vertices of polygons of the mesh model into two-dimensional
coordinate spaces of the offset view images, thereby forming a
three-dimensional model of the contoured structure.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation-in-part of U.S. patent
application Ser. No. 09/748,761 filed on Dec. 22, 2000. The
disclosure of the above application is incorporated herein by
reference in its entirety for any purpose.
FIELD OF THE INVENTION
[0002] The present invention generally relates to computer-based
teleconferencing in a networked virtual reality environment, and
more particularly to mobile face capture and image processing.
BACKGROUND OF THE INVENTION
[0003] Networked virtual environments allow users at remote
locations to use a telecommunication link to coordinate work and
social interaction. Teleconferencing systems and virtual
environments that use 3D computer graphics displays and digital
video recording systems allow remote users to interact with each
other, to view virtual work objects such as text, engineering
models, medical models, play environments and other forms of
digital data, and to view each other's physical environment.
[0004] A number of teleconferencing technologies support
collaborative virtual environments Which allow interaction between
individuals in local and remote sites. For example,
video-teleconferencing systems use simple video screens and wide
screen displays to allow interaction between individuals in local
and remote sites. However, wide screen displays are disadvantageous
because virtual 3D objects presented on the screen are not blended
into the environment of the room of the users. In such an
environment, local users cannot have a virtual object between them.
This problem applies to representation of remote users as well. The
location of the remote participants cannot be anywhere in the room
or the space around the user, but is restricted to the screen.
[0005] Networked immersive virtual environments also present
various disadvantages. Networked immersive virtual reality systems
are sometimes used to allow remote users to connect via a
telecommunication link and interact with each other and virtual
objects. In many such systems the users must wear a virtual reality
display where the user's eyes and a large part of the face are
occluded. Because these systems only display 3D virtual
environments, the user cannot see both the physical world of the
site in which they are located and the virtual world which is
displayed. Furthermore, people in the same room cannot see each
others' full face and eyes, so local interaction is diminished.
Because the face is occluded, such systems cannot capture and
record a full stereoscopic view of remote users' faces.
[0006] Another teleconferencing system is termed CAVES. CAVES
systems use multiple screens arranged in a room configuration to
display virtual information. Such systems have several
disadvantages. In CAVES systems, there is only one correct
viewpoint, all other local users have a distorted perspective on
the virtual scene. Scenes in the CAVES are only projected on a
wall. So two local users can view a scene on the wall, but an
object cannot be presented in the space between users. These
systems also use multiple rear screen projectors, and therefore are
very bulky and expensive. Additionally, CAVES systems may also
utilize stereoscopic screen displays. Stereoscopic screen display
systems do not present 3D stereoscopic views that interpose 3D
objects between local users of the system. These systems sometimes
use 3D glasses to present a 3D view, but only one viewpoint is
shared among many users often with perspective distortions.
[0007] Consequently, there is a need for an augmented reality
display that mitigates the above mentioned disadvantages and has
the capability to display virtual objects and environments,
superimpose virtual objects on the "real world" scenes, provide
"face-to-face" recording and display, be used in various ambient
lighting environments, and correct for optical distortion, while
minimizing computational power and time.
[0008] Faces have been captured passively in rooms instrumented
with a set of cameras, where stereo computations can be done using
selected viewpoints. Other objects can be captured using the same
methods. Such hardware configurations are unavailable for mobile
use in arbitrary environments, however. Other work has shown that
faces can be captured using a single camera and processing that
uses knowledge of the human face. Either the face has to move
relative to the camera, or assumptions of symmetry are employed.
Our approach is to use two cameras affixed to the head, which is
necessary to convey non symmetrical facial expression, such as the
closing of one eye and not the other, or the reflection of a fire
on only one side of the face.
[0009] There is little overlap in the images taken from outside the
user's central field of view, so the frontal view synthesized is a
novel view. In previous work, novel views have been synthesized by
a panoramic system and/or by interpolating between a set of views.
Producing novel views in a dynamic scenario was successfully shown
for a highly rigid motion. This work extended interpolation
techniques to the temporal domain from the spatial domain. A novel
view at a new time instant was generated by interpolating views at
nearby time intervals using spatio-temporal view interpolation,
where a dynamic 3-D scene is modeled and novel views are generated
at intermediate time intervals.
[0010] There remains a need for a way to generate in real time a
synthetic frontal view of a human face from two real side
views.
SUMMARY OF THE INVENTION
[0011] In accordance with the present invention, image processing
procedures include receiving at least two side view images of a
face of a user. In other aspects, side view images are warped and
blended into an output image of a face of a user as if viewed from
a virtual point of view. In further aspects, a video is produced in
real time of output images from a video feed of side view
images.
[0012] In yet other aspects, a teleportal system is provided. A
principal feature of the teleportal system is that single or
multiple users at a local site and a remote site use a
telecommunication link to engage in face-to-face interaction with
other users in a 3D augmented reality environment. Each user
utilizes a system that includes a display such as a projection
augmented-reality display and sensors such as a stereo facial
expression video capture system. The video capture system allows
the participants to view a 3D, stereoscopic, video-based image of
the face of all remote participants and hear their voices, view
unobstructed the local participants, and view a room that blends
physical with virtual objects with which users can interact and
manipulate.
[0013] In one preferred embodiment of the system, multiple local
and remote users can interact in a room-sized space draped in a
fine grained retro-reflective fabric. An optical tracker preferably
having markers attached to each user's body and digital video
cameras at the site records the location of each user at a site. A
computer uses the information about each user's location to
calculate the user's body location in space and create a correct
perspective on the location of the 3D virtual objects in the
room.
[0014] The projection augmented-reality display projects stereo
images towards a screen which is covered by a fine grain
retro-reflective fabric. The projection augmented-reality display
uses an optics system that preferably includes two miniature source
displays, and projection-optics, such as a double Gauss form lens
combined with a beam splitter, to project an image via light
towards the surface covered with the retro-reflective fabric. The
retro-reflective fabric retro-reflects the projected light brightly
and directly back to the eyes of the user. Because of the
properties of the retro-reflective screen and the optics system,
each eye receives the image from only one of the source displays.
The user perceives a 3D stereoscopic image apparently floating in
space. The projection augmented-reality display and video capture
system does not occlude vision of the physical environment in which
the user is located. The system of the present invention allows
users to see both virtual and physical objects, so that the objects
appear to occupy the same space. Depending on the embodiment of the
system, the system can completely immerse the user in a virtual
environment, or the virtual environment can be restricted to a
specific region in space, such as a projection window or table top.
Furthermore, the restricted regions can be made part of an
immersive wrap-around display.
[0015] Further areas of applicability of the present invention will
become apparent from the detailed description provided hereinafter.
It should be understood that the detailed description and specific
examples, while indicating the preferred embodiment of the
invention, are intended for purposes of illustration only and are
not intended to limit the scope of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] The present invention will become more fully understood from
the detailed description and the accompanying drawings,
wherein:
[0017] FIG. 1 is a plan view of a first preferred embodiment of a
teleportal system of the present invention showing one local user
at a first site and two remote users at a second site;
[0018] FIG. 2 is a block diagram depicting the teleportal system of
the present invention;
[0019] FIG. 3 is a perspective view of the illumination system for
a projection user-mounted display of the present invention;
[0020] FIG. 4 is a perspective view of a first preferred embodiment
of a vertical architecture of the illumination system for the
projection user-mounted display of the present invention;
[0021] FIG. 5 is a perspective view of a second preferred
embodiment of a horizontal architecture of the illumination system
for the projection user-mounted display of the present
invention;
[0022] FIG. 6 is a diagram depicting an exemplary optical pathway
associated with a projection user-mounted display of the present
invention;
[0023] FIG. 7 is a side view of a projection lens used in the
projection augmented-reality display of the present invention;
[0024] FIG. 8 is a side view of the projection augmented-reality
display of FIG. 4 mounted into a headwear apparatus;
[0025] FIG. 9 is a perspective view of the video system in the
teleportal headset of the present invention;
[0026] FIG. 10 is a side view of the video system of FIG. 9;
[0027] FIG. 11 is a top view of a video system of FIG. 9;
[0028] FIG. 12a is an alternate embodiment of the teleportal site
of the present invention with a wall screen;
[0029] FIG. 12b is another alternate embodiment of the teleportal
site of the present invention with a spherical screen;
[0030] FIG. 12c is yet another alternate embodiment of the
teleportal site of the present invention with a hand-held
screen;
[0031] FIG. 12d is yet another alternate embodiment of the
teleportal site of the present invention with body shaped
screens;
[0032] FIG. 13 is a first preferred embodiment of the projection
augmented-reality display of the present invention;
[0033] FIG. 14 is a side view of the projection augmented-reality
display of FIG. 13;
[0034] FIG. 15 is a view of a face capture concept and images from
a prototype head mounted display unit;
[0035] FIG. 16 is a view of an experimental prototype of a face
capture system;
[0036] FIG. 17 is a view demonstrating behavior of a grid
pattern;
[0037] FIG. 18 is a view of face images captured during a
calibration stage;
[0038] FIG. 19 is a block diagram of an off-line calibration stage
during synthesis of a virtual frontal view;
[0039] FIG. 20 is a block diagram of an operational stage during
synthesis of a virtual frontal view;
[0040] FIG. 21 is a set of views illustrating generation of a
frontal view during a calibration stage and reconstruction of the
frontal image from a side view using a grid: (a) left image
captured during the calibration stage; (b) operational left image
warped into virtual image plus calibration stripes; and (c)
operational left image without stripes;
[0041] FIG. 22 is a set of views illustrating: (a) a frontal view
obtained from a camcorder; and (b) a virtual frontal view obtained
as a reconstructed frontal view from transformation tables and a
side image of FIG. 21(c);
[0042] FIG. 23 is a set of views of images considered for objective
evaluation with a top row of real video frames compared to a bottom
row of virtual video frames;
[0043] FIG. 24 is a set of views of a real video image on the left
compared to a corresponding virtual video image on the right,
wherein facial regions are compared using cross-correlation;
[0044] FIG. 25 is a set of views of a real video image on the left
compared to a corresponding virtual video image on the right,
wherein distances between facial feature points are considered
using a Euclidean distance measure;
[0045] FIG. 26 is a set of views with a top row showing images
captured using a left camera, a second row showing images captured
using a right camera; a third row showing images captured using a
camcorder placed in front of the face, and a final row showing
virtual frontal views generated from images in the first two
rows;
[0046] FIG. 27 is a set of views illustrating synchronization of
eyelids during blinking, with real video displayed in a top row and
virtual video illustrated in a bottom row; and
[0047] FIG. 28 is a view identifying some feature points in a side
image and a set of triangles formed using the feature points as
vertices.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0048] The following description of the preferred embodiment(s) is
merely exemplary in nature and is in no way intended to limit the
invention, its application, or uses.
[0049] FIG. 1 depicts a teleportal system 100 using two display
sites 101 and 102. Teleportal system 100 includes a first
teleportal site or local site 101 and a second teleportal site or
remote site 102. It should be appreciated that additional
teleportal sites can be included in teleportal system 100. Although
first teleportal site 101 is described in detail below, it should
further be appreciated that the second teleportal site 102 can be
identical to the first teleportal site 101. It should also be noted
that the number of users and types of screens can vary at each
site.
[0050] Teleportal sites 101 and 102 preferably include a screen
103. Screen 103 is made of a retro-reflective material such as
beads-based or corner-cube based materials manufactured by 3M.RTM.
and Reflexite Corporation. The retro-reflective material is
preferably gold which produces a bright image with adequate
resolution. Alternatively, other material which has metalic fiber
adequate to reflect at least a majority of the image or light
projected onto its surface may be used. The retro-reflective
material preferably provides about 98 percent reflection of the
incident light projected onto its surface. The material
retro-reflects light projected onto its surface directly back upon
its incident path and to the eyes of the user. Screen 103 can be a
surface of any shape, including but not limited to a plane, sphere,
pyramid, and body-shaped, for example, like a glove for a user's
hand or a body suit for the entire body. Screen 103 can also be
formed to a substantially cubic shape resembling a room, preferably
similar to four walls and a ceiling which generally surround the
users. In the preferred embodiment, screen 103 forms four walls
which surround users 110. 3D graphics are visible via screen 103.
Because the users can see 3D stereographic images, text, and
animations, all surfaces that have retro-reflective property in the
room or physical environment can carry information. For example, a
spherical screen 104 is disposed within the room or physical
environment for projecting images. The room or physical environment
may include physical objects substantially unrelated to the
teleportal system 100. For example, physical objects may include
furniture, walls, floors, ceilings and/or other inanimate
objects.
[0051] With a continued reference to FIG. 1, local site 101
includes a tracking system 106. Tracking system 106 is preferably
an optical or optical/hybrid tracking system which may include at
least one digital video camera or CCD camera. By way of example,
four digital video cameras 114, 116, 118 and 120 are shown. By way
of another example, several sets of three CCD arrays stacked up
could be used for optical tracking. Visual processing software (not
shown) processes teleportal site data acquired from digital video
cameras 114, 116, 118 and 120. The software provides the data to
the networked computer 107a. Teleportal site data, for example,
includes the position of users within the teleportal room.
[0052] Optical tracking system 106 further includes markers 96 that
are preferably attached to one or more body parts of the user. In
the preferred embodiment, markers 96 are coupled to each user's
hand, which is monitored for movement and position. Markers 96
communicate marker location data regarding the location of the
user's head and hands. It should be appreciated that the location
of any other body part of the user or object to which a marker is
attached can be acquired.
[0053] Users 110 wear a novel teleportal headset 105. Each headset
preferably has displays and sensors. Each teleportal headset 105
communicates with a networked computer. For example, teleportal
headsets 105 of site 101 communicate with networked computer 107a.
Networked computer 107a communicates with a networked computer 107b
of site 102 via a networked data system 99. In this manner,
teleportal headsets can exchange data via the networked computers.
It should be appreciated that teleportal headset 105 can be
connected via a wireless connection to the networked computers. It
should also be appreciated that headset 105 can alternatively
communicate directly to networked data system 99. One type of
networked data system 99 is the Internet, a dedicated
telecommunication line connecting the two sites, or a wireless
network connection.
[0054] FIG. 2 is a block diagram showing the components for
processing and distribution of information of the present invention
teleportal system 100. It should be appreciated that information
can be processed and distributed from other sources that provide
visual data which can be projected by teleportal system 100. For
example, digital pictures of body parts, images acquired via
medical imaging technology and images of other three dimensional
(3D) objects. Teleportal headset 105 includes at least one sensor
array 220 which identifies and transmits the user's behavior. In
the preferred embodiment, sensor array 220 includes a facial
capture system 203 (described in further detail with reference to
FIGS. 9, 10, and 11) that senses facial expression, an optical
tracking system 106 that senses head motion, and a microphone 204
that senses voice and communication noise. It should be appreciated
that other attributes of the user's behavior can be identified and
transmitted by adding additional types of sensors.
[0055] Each of sensors 203, 106 and 204 are preferably connected to
networked computer 107 and sends signals to the networked computer.
Facial capture system 203 sends signals to the networked computer.
However, it should be appreciated that sensors 203, 106 and 204 can
directly communicate with a networked data system 99. Facial
capture system 203 provides image signals based on the image viewed
by a digital camera which are processed by a face-unwarping and
image stitching module 207. Images or "first images" sensed by face
capture system 203 are morphed for viewing by users at remote sites
via a networked computer. The images for viewing are 3D and
stereoscopic such that each user experiences a perspectively
correct viewpoint on an augmented reality scene. The images of
participants can be located anywhere in space around the user.
[0056] Morphing distorts the stereo images to produce a viewpoint
of preferably a user's moving face that appears different from the
viewpoint originally obtained by facial capture system 203. The
distorted viewpoint is accomplished via image morphing to
approximate a direct face-to-face view of the remote face.
Face-warping and image-stitching module 207 morphs images to the
user's viewpoint. The pixel correspondence algorithm or face
warping and image stitching module 207 calculates the corresponding
points between the first images to create second images for remote
users. Image data retrieved from the first images allows for a
calculation of a 3D structure of the head of the user. The 3D image
is preferably a stereoscopic video image or a video texture mapping
to a 3D virtual mesh. The 3D model can display the 3D structure or
second images to the users in the remote location. Each user in the
local and remote sites has a personal and correct perspective
viewpoint on the augmented reality scene. Optical tracking system
106 and microphone 204 provide signals to networked computer 107
that are processed by a virtual environment module 208.
[0057] A display array 222 is provided to allow the user to
experience the 3D virtual environment, for example via a projection
augmented-reality display 401 and stereo audio earphones 205 which
are connected to user 110. Display array 222 is connected to a
networked computer. In the preferred embodiment, a modem 209
connects a networked computer to network 99.
[0058] FIGS. 3 through 5 illustrate a projection augmented-reality
display 401 which can be used in a wide variety of lighting
conditions, including indoor and outdoor environments. With
specific reference to FIG. 3, a projection lens 502 is positioned
to receive a beam from a beamsplitter 503. A source display 501,
which is a reflective LCD panel, is positioned opposite of
projection lens 502 from beamsplitter 503. Alternatively, source
display 501 may be a DLP flipping mirror manufactured by Texas
Instruments.RTM.. Beamsplitter 503 is angled at a position less
than ninety degrees from the plane in which projection lens 502 is
positioned. A collimating lens 302 is positioned to provide a
collimating lens beam to beamsplitter 503. A mirror 304 is placed
between collimating lens 302 and a surface mounted LCD 306. Surface
mounted LCD 306 provides light to mirror 304 which passes through
collimating lens 302 and beamsplitter 503.
[0059] Source display 501 transmits light to beamsplitter 503. It
should be appreciated that FIG. 4 depicts a pair of the projection
augmented-reality displays shown in FIG. 3; however, each of
projection augmented-reality displays 530 and 532 are mounted in a
vertical orientation relative to the head of the user. Furthermore,
FIG. 5 depicts a pair of projection augmented-reality displays of
the type shown in FIG. 3; however, each of projection
augmented-reality displays 534 and 536 are mounted in a horizontal
orientation relative to the hood of the user.
[0060] FIG. 6 illustrates the optics of projection
augmented-reality display 500 relative to a user's eye 508. A
projection lens 502 receives an image from a source display 501
located beyond the focal plane of projection lens 502. Source
display 501 may be a reflective LCD panel. However, it should be
appreciated that any miniature display including, but not limited
to, miniature CRT displays, DLP flipping mirror systems and
backlighting transmissive LCDs may be alternatively utilized.
Source display 501 preferably provides an image that is further
transmitted through projection lens 502. The image is preferably
computer-generated. A translucent mirror or light beamsplitter 503
is placed after projection lens 502 at preferably 45 degrees with
respect to the optical axis of projection lens 502; therefore, the
light refracted by projection lens 502 produces an intermediary
image 505 at its optical conjugate and the reflected light of the
beam-splitter produces a projected image 506, symmetrical to
intermediary image 505 about the plane in which light beamsplitter
503 is positioned. A retro-reflective screen 504 is placed in a
position onto which projected image 506 is directed.
Retro-reflective screen 504 may be located in front of or behind
projected image 506 so that rays hitting the surface are reflected
back in the opposite direction and travel through beamsplitter 503
to user's eye 508. The reflected image is of a sufficient
brightness which permits improved resolution. User's eye 508 will
perceive projected image 506 from an exit pupil 507 of the optical
system.
[0061] FIG. 7 depicts a preferred optical form for projection lens
502. Projection lens 502 includes a variety of elements and can be
accomplished with glass optics, plastic optics, or diffractive
optics. A non-limiting example of projection lens 502 is a double
Gauss lens form formed by a first singlet lens 609, a second
singlet lens 613, a first doublet lens 610, a second doublet lens
612, and a stop surface 611, which are arranged in series.
Projection lens 502 is made of a material which is transparent to
visible light. The lens material may include glass and plastic
materials.
[0062] Additionally, the projection augmented-reality display can
be mounted on the head. More specifically, FIG. 8 shows projection
augmented-reality display 800 mounted to headwear or helmet 810.
Projection augmented-reality display 800 is mounted in a vertical
direction. Projection augmented-reality display 800 can be used in
various ambient light conditions, including, but not limited to,
artificial light and natural sunlight. In the preferred embodiment,
light source 812 transmits light to source display 814. Projection
augmented-reality display 800 provides optics to produce an image
to the user.
[0063] FIGS. 9, 10 and 11 illustrate teleportal headset 105 of the
present invention. Teleportal headset 105 preferably includes a
facial expression capture system 402, ear phones 404, and a
microphone 403. Facial expression capture system 402 preferably
includes digital video cameras 601a and 601b. In the preferred
embodiment, digital video cameras 601a and 601b are disposed on
either side of the user's face 606, such that images covering the
entire face are captured, which are the used to create one image of
the complete face, or a 3D model of the complete face that can then
be used to generate single images or stereo images for general
viewpoints of the face 606.
[0064] Each video camera 601a and 601b is mounted to a housing 406.
Housing 406 is formed as a temple section of the headset 105. In
the preferred embodiment, each digital video camera 601a and 601b
is pointed at a respective convex mirror 602a and 602b. Each convex
mirror 602a and 602b is connected to housing 406 and is angled to
reflect an image of the adjacent side of the face. Digital cameras
601a and 601b located on each side of the user's face 410 capture a
first image or particular image of the face from each convex mirror
602a and 602b associated with the individual digital cameras 601a
and 601b, respectively, such that a stereo image of the face is
captured. A lens 408 is located at each eye of user face 606. Lens
408 allows images to be displayed to the user as the lens 408 is
positioned 45 percent relative to the axis in which a light beam is
transmitted from a projector. Lens 408 is made of a material that
reflects and transmits light. One preferred material is "half
silvered mirror."
[0065] FIGS. 12a through 12d show alternate configurations of a
teleportal site of the present invention with various shaped
screens. FIG. 12a illustrates an alternate embodiment of the
teleportal system 702 in which retro-reflective fabric screen 103
is used on a room's wall so that a more traditional
teleconferencing system can be provided. FIG. 12b illustrates
another alternate embodiment of a teleportal site 704 in which a
desktop system 702 is provided. In desktop system 702, two users
110 observe a 3D object on a table top screen 708. In the preferred
embodiment, screen 708 is spherically shaped. All users in site of
the screen 708 can view the perspective projections at the same
time from their particular positions.
[0066] FIG. 12c shows yet another alternate embodiment of
teleportal site 704. User 110 has a wearable computer forming a
"magic mirror" configuration of teleportal site 704. Teleportal
headset 105 is connected to a wearable computer 712. The wearable
computer 712 is linked to the remote user (not shown) preferably
via a wireless network connection. A wearable screen includes a
hand-held surface 714 covered with a retro-reflective fabric for
the display of the remote user. A "magic mirror" configuration of
teleportal site 704 is preferred in the outdoor setting because it
is mobile and easy to transport. In the "magic mirror
configuration," the user holds the surface 714, preferably via a
handle and positions the surface 714 over a space to view the
virtual environment projected by the projection display of the
teleportal head set 105.
[0067] FIG. 12d shows yet another alternate embodiment of the
teleportal site 810. A body shaped screen 812 is disposed on a
person's body 814. Body shaped screen 812 can be continuous or
substantially discontinuous depending upon the desire to cover
certain body parts. For example, a body shaped screen 812 can be
shaped for a patient's head, upper body, and lower body. A body
shaped screen 812 is beneficial for projecting images, such as that
produced by MRI (or other digital images), onto the patient's body
during surgery. This projecting permits a surgeon or user 816 to
better approximate the location of internal organs prior to
invasive treatment. Body shaped screen 812 can further be formed as
gloves 816, thereby allowing the surgeon to place his hands (and
arms) over the body of the patient yet continue to view the
internal image in a virtual view without interference of his
hands.
[0068] FIGS. 13 and 14 show a first preferred embodiment of a
projection augmented-reality display 900 which includes a pair of
LCD displays 902 coupled to headwear 905. In the preferred
embodiment, a pair of LCD displays 902 project images to the eyes
of the users. A microphone 910 is also coupled to headwear 905 to
sense the user's voice. Furthermore, an earphone 912 is coupled to
headwear 905. A lens 906 covers the eyes of the user 914 but still
permits the user to view the surrounding around her. The glass lens
906 transmits and reflects light. In this manner, the user's eyes
are not occluded by the lens. One preferred material for the
transparent glass lens 906 is a "half silvered mirror."
[0069] Communication of the expressive human face is important to
tele-communication and distributed collaborative work. In addition
to sophisticated collaborative work environments, there is a strong
popular trend for the merger of cell phone and video functionality
at consumer prices. At both ends of the technology spectrum, there
is a problem producing quality video of a person's face without
interfering with that person's ability to perform some task
requiring both visual and motor attention. When the person is
mobile, the technology of most collaborative environments is
unusable. Referring now to FIG. 15, the solution proposed here is
to modify a helmet mounted display (HMD) for minimally intrusive
face capture. The prototype HMD has small mirrors held above the
temples and viewed by small video cameras above the ears, creating
a helmet that is balanced and light and with minimal occlusion of
the wearer's field of view. The complete HMD design includes
display components that display remote faces and scenes to the
wearer as well as reality augmentation for the wearer's
environment. The system and method of the present invention
provides a virtual frontal video of the HMD wearer. This virtual
video (VV) is synthesized by warping and blending the two real side
view videos.
[0070] Side view as used herein should be interpreted as any offset
view. Thus, the angle with respect to the face does not have to be
directly from the side. Also, the side view can be from an angle
beneath or above the face. Further, while side views of faces of
users are typically captured and used from/in a virtual view, it
should be readily understood that other parts of a user may also be
captured, such as a user's hand.
[0071] A prototype HMD facial capture system has been developed.
The development of the video processing reported here was isolated
from the HMD device and performed using a fixed lab bench and
conventional computer. Porting and integration of the video
processing with the mobile HMD hardware can be accomplished in a
variety of ways as further described below.
[0072] The prototype system was configured with off-the-shelf
hardware and software components. FIG. 16 illustrates a lab bench
used to develop the mobile face capture and image processing system
and method. The bench was built to accommodate human subjects so
they could keep their heads fixed relative to two cameras 1000A and
1000B and a structured light projector 1002. The two cameras 1000A
and 1000B are placed so that their images are similar to those that
can be obtained from the HMD optics. The light projector 1002 is
used to orient the head precisely and to obtain calibration data
used in image warping. In addition to the equipment shown in FIG.
16, a video camera (not shown) placed on top of the projector
records the subject's face during each experiment for comparison
purposes. The prototype uses an Intel Pentium III processor running
at 746 MHz with 384 MB RAM having two Matrox Meteor II standard
cards.
[0073] In the experiment demonstrating feasibility of some
embodiments of the present invention, several videos were taken for
several volunteers so that the synthetic video could be compared to
real video. One question posed was whether the synthetic frontal
video would be of sufficient quality to support the applications
intended for the HMD. The bench was set up for a general user and
adjustments were made for individuals only when needed. Video and
audio were recorded for each subject for offline processing.
[0074] The problem is to generate a virtual frontal view from two
side views. The projected light grid provides a basis for mapping
pixels from the side images into a virtual image with the
projector's viewpoint. The grid is projected onto the face for only
a few frames so that mapping tables can be built, and then is
switched off for regular operation.
[0075] There are three 2D coordinate systems involved in creation
of the virtual video. A global 3D coordinate system is denoted;
however, it must be emphasized that 3D coordinates are not required
for the task according to some embodiments of the present
invention.
[0076] 1) World Coordinate System (WCS): for discussion only in
some embodiments
[0077] 2) Left Camera Coordinate System (LCS): I.sub.L[s, t] is the
left image with s, t coordinates.
[0078] 3) Right Camera Coordinate System (RCS): I.sub.R[u, v] is
the right image with u, v coordinates.
[0079] 4) Projector Coordinate System (PCS): V[x, y] is the output
virtual video image with coordinates defined by the projected
grid.
[0080] During the calibration phase, the transformation tables are
generated using the grid pattern coordinates. A rectangular grid is
projected onto the face and the two side views are captured as
shown in FIGS. 16 and 18. The location of the grid regions in the
side images define where real pixel data is to be accessed for
placement in the virtual video. Coordinate transformation is done
between PCS and LCS and between PCS and RCS. Using transformation
tables that store the locations of grid points, an algorithm can
map every pixel in the front view to the appropriate side view. By
centering the grid on the face, the grid also supports the
correspondence between LCS and RCS and the blending of their
pixels.
[0081] The behavior of a single gridded cell in the original side
view and the virtual frontal view is demonstrated in FIG. 17. A
grid cell in the frontal image maps to a quadrilateral with curve
edges in the side image. Bilinear interpolation is used to
reconstruct the original frontal grid pattern by warping a
quadrilateral into a square or a rectangle.
s=f.sub.1(x, y) and t=g.sub.1(x, y) (1)
u=f.sub.r(x, y) and v=g.sub.r(x, y) (2)
[0082] Equations 1 and 2 are four functions determined during the
calibration stage and implemented via the transformations tables.
These transformation tables are then used in the operational stage
immediately after the grid is switched off. During operation, it is
known for each pixel V[x, y] in which grid cell of LCS or RCS it
lies. Bilinear interpolation is then used on the grid cell corners
to access an actual pixel value to be output to the VV.
[0083] In the case where convex mirrors, wide angle lenses, or
equivalent sensors are employed to capture offset views of user
faces, warping can still be accomplished in one step by making the
point correspondences. However, in the case of strong nonlinear
distortion, it is envisioned that bicubic interpolation may be
employed instead of bilinear interpolation. It is also envisioned
that subpixel coordinates and multiple pixel sampling can be used
in cases where the face texture changes fast or where the face
normal is away from the sensor direction.
[0084] Some implementation details are as follows. A rectangular
grid of dimension 400.times.400 is projected onto the face. The
grid is made by repeating three colored lines. White, green and
cyan colors proved useful because of their bright appearance over
the skin color. This combination of hues demonstrated good
performance over a wide variety of skin pigmentations. However, it
is envisioned that other hues may be employed. The first few frames
have the grid projected onto the face before the grid is turned
off. One of the frames with the grid is taken and the
transformation tables are generated. The size of the grid pattern
that is projected in the calibration stage plays a significant role
in the quality of the video. This size was decided based on the
trade-off between the quality of the video and execution time. An
appropriate grid size was chosen based on trial and error. The
trial and error process started by projecting a sparse grid pattern
onto the face and then increasing the density of the grid pattern.
At one point, the increase in the density did not significantly
improve the quality of the face image but consumed too much time.
At that point, the grid was finalized with a grid cell size of
row-width 24 pixels and column-width 18 pixels. FIG. 18 shows the
frames that are captured during the calibration stage of the
experiment. This calibration step is feasible for use in
collaborative rooms; however, it is envisioned that the calibration
is applicable to mobile users as well.
[0085] FIG. 19 shows the off-line calibration stage during the
synthesis of the virtual frontal view. Projector 1002 projects grid
pattern 1004 onto human face 1006. Grid lines reflect off of human
face 1006 to left and right mirrors 1008A and 1008B, and from the
mirrors to respective left and right cameras 1000A and 1000B.
Quadrilaterals of left and right calibration face images 1010A and
1010B are mapped to corresponding squares or rectangles of grid
pattern 1004 to form left and right transformation tables 1012. It
is envisioned that more than two side views can be used, and that
other polygonal shapes besides quadrilaterals may be employed.
Thus, a grid pattern of predetermined polygonal shapes is projected
onto the face from a virtual point of view, side view images of the
face are captured, and pixels enclosed by the polygons of captured
side view images are mapped back to corresponding predetermined
polygonal shapes of the grid pattern to form the transformation
tables. It is envisioned that the side view imaging arrays may be
integrated into a projection screen of the HMD, thus eliminating
the mirrors while retaining fixed positions respective of and
orientations toward sides of the user's face.
[0086] Using the transformation tables 1012 generated in the
calibration phase, each virtual frontal frame is generated. The
algorithm reconstructs each (x, y) coordinate in the virtual view
by accessing the corresponding location in the transformation table
and retrieving the pixel in I.sub.L (or I.sub.R) using
interpolation. Then a 1D linear smoothing filter is used to smooth
the intensity across the vertical midline of the face. Without this
smoothing, a human viewer usually perceives a slight intensity edge
at the midline of the face.
[0087] FIG. 20 shows the complete block diagram of the operational
phase. Transformation tables 1012 are used to warp left and right
face images 1010A and 1010B into left warped face image 1014A and
right warped face image 1014B. These portions of the virtual output
image 1016 are then blended by mosaicking the face image. Post
processing to linearly smooth the image is performed to result in a
final virtual face image 1018. Since the transformation is based on
the bilinear interpolation technique, each pixel can be generated
only when it is inside four grid coordinate points. Because the
grid is not defined well at the periphery of the face, the
algorithm is unable to generate the ears and hair portion of the
face. The results of the warping during the calibration and the
operation stage is shown in FIGS. 21 through 23.
[0088] Some other post-processing can be included. For example,
frames with a gridded pattern can be deleted from the final output:
these can be identified by a large shift in intensity when the
projected grid is switched off. Also, a microphone recording of the
voice of the user, stored in a separate .wav file, can be appended
to the video file to obtain a final output.
[0089] Color balancing of the cameras can also be performed. Even
though software based approaches for color balancing can be taken,
the color balancing in the present work is done at the hardware
level. Before the cameras are used for calibration, they are
balanced using the white balancing technique. A single white paper
is shown to both cameras and cameras are white balanced
instantly.
[0090] The virtual video of the face can be adequate to support the
communication of identity, mental state, gesture, and gaze
direction. Some objective comparisons between the synthesized and
real videos are reported below, plus a qualitative assessment.
[0091] The real video frames from the camcorder and the virtual
video frames were normalized to the same size of 200.times.200 and
compared using cross correlation and interpoint distances between
salient face features. Five images that were considered for
evaluation are shown in FIG. 23. Important items considered were
the smoothness and accuracy of lips and eyes and their movements,
the quality of the intensities, and the synchronization of the
audio and video. In particular, flaws looked for were breaks at the
centerline of the face due to blending and for other distortions
that may have been caused by the sensing and warping process.
[0092] 1) Normalized cross-correlation: The cross correlation
between regions of the virtual image and real image was computed
for rectangular regions containing the eyes and mouth (FIG. 24). As
Table 1 shows, there was high correlation between the real and the
virtual images taken at the same instant of time. Frames 2 and 3
shown in FIG. 23 contain facial expressions (eye and lip movements)
that were quite different from the expression used during the
calibration stage and the generated view gave a slightly lower
correlation value when compared with the other frames. Also, the
facial expressions in the first and fourth frames were similar to
that of the expression in the calibration frame. Hence, these
frames have a higher correlation value compared to the rest. The
eye and lip regions were considered for evaluating the system
because during any facial movement, these regions change
significantly and are more important in communication.
1TABLE 1 Results of Normalized Cross-Correlation Between the Real
and the Virtual Frontal Views Applied in Regions Around the Eyes
and Mouth video left eye right eye mouth eyes + mouth complete
Frame 1 0.988 0.987 0.993 0.989 0.989 Frame 2 0.969 0.972 0.985
0.978 0.985 Frame 3 0.969 0.967 0.992 0.978 0.986 Frame 4 0.991
0.989 0.993 0.990 0.990 Frame 5 0.985 0.986 0.992 0.988 0.989
[0093] 2) Euclidean distance measure: The difference in the
normalized Euclidean distances between some of the most prominent
feature points were computed. The feature points are chosen in such
a way that one of them is relatively static with respect to the
other. For some prominent feature points, such as corners of the
eyes, nose tip, corners of the mouth, the corners of the eyes are
relatively static when compared with the corners of the mouth. FIG.
25 shows the most prominent facial feature points and the distances
between those points. Let R.sub.ij represent the Euclidean distance
between two feature points i and j in the real frontal image and
V.sub.ij represent the Euclidean distance between two feature
points in the virtual frontal image. The difference in the
Euclidean distance is
D.sub.ij=.vertline.R.sub.ij-V.sub.ij.vertline.. The average error
.epsilon. for comparing the face images is defined by 1 = 1 6 [ D a
f + D b f + D c f + D c g + D d g + D e g ] .
2TABLE 2 Euclidean Distance Measurement of the Prominent Facial
Distances in the Real Image and Virtual Image and the Defined
Average Error. All Dimensions are in Pixels. Frames D.sub.af
D.sub.bf D.sub.cf D.sub.cg D.sub.dg D.sub.eg Error (.epsilon.)
Frame 1 2.00 0.80 4.15 3.49 2.95 3.46 2.80 Frame 2 0.59 3.00 0.79
4.91 0.63 0.80 1.79 Frame 3 1.88 3.84 4.29 4.34 2.68 1.83 3.14
Frame 4 1.09 2.97 2.10 6.33 3.01 4.08 3.36 Frame 5 1.62 2.21 5.57
4.99 1.24 1.90 2.92
[0094] The results in Table II indicate small errors in the
Euclidean distance measurements of the order of 3 pixels in an
image of size 200.times.200. The facial feature points in the five
frames were selected manually and hence the errors might have also
been caused due to the instability of manual selection. One can
note that the error values of D.sub.cf and D.sub.cg are larger than
the others. This is probably because the nose tip is not as
robustly located compared to eye corners.
[0095] A preliminary subjective study was also performed. In
general, the quality of the videos was assessed as adequate to
support the variety of intended applications. The two halves of all
the videos are well synchronized and color balanced. The quality of
the audio is good and it has been synchronized well with the lip
movements. Some observed problems were distortion in the eyes and
teeth and in some cases a cross-eyed appearance. The face appears
slightly bulged compared with the real videos, which is probably
due to the combined radial distortions of the camera and projector
lenses.
[0096] Synchronization in the two videos is preferred in the
invention application. Since two views of a face with lip movements
are merged together, any small changes in the synchronization has
high impact on the misalignment of the lips. This synchronization
was evaluated based on sensitive movements such as eyeball
movements and blinking eyelids. Similarly, mouth movements were
examined in the virtual videos. FIGS. 26 to 27 show some of these
effects.
[0097] Analysis indicates that a real-time mobile system is
feasible. The total computation time consists of (1) transferring
the images into buffers, (2) warping by interpolating each of the
grid blocks, and (3) linearly smoothing each output image. The
average time is about 60 ms per frame using a 746 MHz computer.
Less than 30 ms would be considered to be real-time: this can be
achieved with a current computer with clock rate of 2.6 GHz. Some
implementations can require more power to mosaic training data into
the video to account for features occluded from the cameras.
[0098] It can be concluded that the algorithm being used can be
made to work in real-time. The working prototype has been tested on
a diverse set of seven individuals. From comparisons of the virtual
videos with real videos, it is expected that important facial
expressions will be represented adequately and not distorted by
more than 2%. Thus, the HMD system implementing the image
processing software of the present invention can support the
intended telecommunication applications.
[0099] It is envisioned that calibration using a projected grid can
be used with the algorithms described above. 3D texture-mapped face
models can also be created by calibrating the cameras and projector
in the WCS. 3D models present the opportunity for greater
compression of the signal and for arbitrary frontal viewpoints,
which are desired for virtual face-to-face collaboration. Although
technically feasible, structured light projection is an obtrusive
step in the process and may be cumbersome in the field. Thus, a
generic mesh model of the face can also be employed.
[0100] There is a problem due to occlusion in the blending of the
two side images. Some facial surface points that should be
displayed in the frontal image are not visible in the side images.
For example, the two cameras cannot see the back of the mouth. It
is envisioned that training data may be taken from a user and
patched into the synthetic video, either for that user or for
another, similar user. During training, the user can generate a
basis for all possible future output material. The system can
contain methods to index to the right material and blend it with
the regular warped output. A related problem is that facial
deformations that make significant alterations to the face surface
may not be rendered well by the static warp. Examples are tongue
thrusts and severe facial distortions. The static warp algorithm
achieves good results for moderate facial distortion: It does not
crash when severe cases are encountered, but the virtual video can
show a discontinuity in important facial features.
[0101] Other embodiments of the present invention employ a 3D model
as described below. The 3D modeling embodiments include one or more
of the following: (a) a calibration method that does not depend
upon structured light, (b) an output format that is a dynamic 3D
model rather than just a 2D video, and (c) a real-time tracking
method that identifies salient face points in the two side videos
and updates both the 3D structure and the texture of the 3D model
accordingly.
[0102] The 3D face model can be represented by a closed mesh of n
points (x, y, z), i=1, n and a texture map. This model can be
rendered rapidly by standard graphics software and displayed by
standard graphics cards. The mesh point 3D coordinates are
available for a generic face. Scaling and deformation
transformations can be used to instantiate this model for an
individual wearing the Face Capture Head Mounted Display Units
(FCHMDs). The model can be viewed/rendered from a general viewpoint
within the coverage of the cameras and not just from the central
point in front of the face. Triangles of the mesh can be
texture-mapped to the sensed images and to other stored face data
that may be needed to fill in for unimaged patches.
[0103] The 3D face model can be instantiated to fit a specific
individual by one or more of the following: (1) choosing special
points by hand on a digital frontal and profile photo; (2) choosing
special points from the two side video frames of a neutral
expression taken from the FCHMD, and enabling the wearer to make
adjustments while viewing the resulting rendered 3D model.
[0104] In some embodiments, standard rendering of the face model
requires one or more of the following: (1) the set of triangles
modeling the 3D geometry; (2) the two side images from the FCHMD;
(3) a mapping of all vertices of each 3D triangle into the 2D
coordinate space of one of the side images; (4) a viewpoint from
which to view the 3D model; and (5) a lighting model that
determines how the 3D model is illuminated.
[0105] FIG. 28 illustrates the identification of some feature
points in a side image and a set of triangles formed using the
feature points as vertices. These triangles serve as bounding
polygons for regions to be texture mapped to corresponding
polygonally bounded regions of a generic mesh model. On a frame by
frame basis, the generic mesh model used is selected to maximize
similarity between the feature points automatically recognized in
the side view image and feature points of the mesh model as if
viewed from the side. In some embodiments, scaling and deformation
transformations already obtained for causing the generic mesh model
to fit a particular user are next used to modify texture mapping of
the generic mesh model to the side view images. Then, the resulting
3D model of the user's face can be rendered from a selected virtual
point of view to result in an output image. Accordingly, input
video streams of side view images can be used in realtime to
produce a video stream of output images from a virtual point of
view.
[0106] It is envisioned that users communicating with one another
may each wear a FCHMD, and that the FCHMD can operate in a variety
of ways. For example, side views of a first user's face can be
transmitted to the second user's FCHMD, where they can be warped
and blended to produce the 3D model, which is then rendered from a
selected perspective to produce the output image. Also, the first
user's FCHMD can warp and blend the side views to produce the 3D
model, and transmit the 3D model to the second user's FCHMD where
it can be rendered from a selected perspective to produce the
output image. Further, the first user's FCHMD can warp and blend
the side views, render the resulting 3D model from a selected
perspective to produce the output image, and transmit the output
image to the second user's FCHMD. Yet further, an external image
processing module external to the FCHMDs can perform some or all of
the steps necessary to produce the output image from the side
views. Further still, this external image processing module can be
remotely located on a communications network, rather than
physically located at a location of one or more of the user's.
Accordingly, a FCHMD may be adapted to transmit to a remote
location and/or receive from a remote location at least one of the
following: (1) side view images; (2) user-specific scaling and
deformation transformations; (3) position of a user's face in a
common coordinate system of a collaborative, virtual environment;
(4) a 3D model of a user's face; (5) a selection of a virtual point
of view from which to render a user's face; and (6) an output
image. Supplemental image data obtained from a particular user or
from training users can also be transmitted or received, and can
even be integrated into the generic mesh models ahead of time.
[0107] It should be readily understood that the FCHMD does not have
to transmit or receive one or more of each of the types of data
listed above. For example, it is possible that an FCHMD may only
transmit and receive output images. It is also possible that an
FCHMD may transmit and receive only two data types, including
output images together with position of a user's face in a common
coordinate system of a collaborative, virtual environment. It is
further possible that an FCHMD will transmit and receive only side
view images. It is still further possible that an FCHMD will
transmit and receive only two data types, including side view
images, together with position of a user's face in a common
coordinate system of a collaborative, virtual environment. It is
yet further possible that an FCHMD will transmit and receive only
3D models of user's faces. It is still yet further possible that an
FCHMD will transmit and receive only two data types, including 3D
models of user's faces, together with position of a user's face in
a common coordinate system of a collaborative, virtual environment.
In the cases where 3D models or side view images are transmitted
and received, it may be the case that user-specific scaling and
deformation transformations are transmitted and received at some
point, perhaps during an initialization of collaboration. It is
additionally possible that one FCHMD can do most or all of the work
for both FCHMDs, and receive side view images and face position
data for a first user while transmitting output images or a 3D
model for a second user. Accordingly, all of these embodiments and
others that will be readily apparent to those skilled in the art
are described above.
[0108] During operation, the FCHMD optics/electronics of some
embodiments can sense in real time the real expressive face of the
wearer from the two side videos, and the software can create in
real time an active 3D face model to be transmitted to remote
collaborators.
[0109] The morphable model is trained for dynamic use on a
population of users. A diverse set of training users may wear the
FCHMD and follow a script that induces a variety of facial
expressions, while frontal video is also recorded. This training
set can support salient point tracking and also the substitution of
real data for viewpoints that cannot be observed by the side
cameras (inside the mouth, for example). Moreover, the training
videos can record sequences of articulator movements that can be
used during online FCHMD use.
[0110] Let S be a set of shape vectors composed of the face surface
points and a corresponding set T of texture vectors.
S.sub.j=(x.sub.1, y.sub.1, z.sub.1, . . . , x.sub.n, y.sub.n,
z.sub.n) (1)
T.sub.j=(r.sub.1g.sub.1b.sub.1, . . . , r.sub.n, g.sub.n, b.sub.n)
(2)
[0111] The shape points contain, as a subset, the salient points of
the shape mesh. Training the model can be accomplished by hand
labeling of the mesh points for a diverse set of faces and
multiframe video recording followed by principal components
analysis to obtain a minimum spanning dimensionality.
[0112] Any face S.sub.p, T.sub.p in the population can be
represented as S.sub.p=.SIGMA..sub.j=1.sup.Ma.sub.jS.sub.j and
T.sub.p=.SIGMA..sub.j=1.s- up.Ma.sub.jT.sub.j, with
.SIGMA..sub.j=1.sup.Ma.sub.j=1 and .SIGMA..sub.j=1.sup.Mb.sub.j1.
The parameters a.sub.j, b.sub.j represent the face p in terms of
the training faces and the new illumination conditions and possibly
slight variation in the camera view.
[0113] Tracking of salient feature points can be accomplished to
dynamically change the transformation tables and achieve a dynamic
model. The parameters of the model a.sub.j, b.sub.j can be
dynamically fit by optimizing the similarity between a model
rendered using these parameters and the observed images. 2 E a j ,
b j = x , y ; I observed [ x , y ] r; ( 3 )
[0114] Fitting via hill-climbing is one designated optimization
procedure in some embodiments so that small dynamic updates can be
made to the model parameters for the next observed side video
frames.
[0115] The FCHMD can be calibrated by finding the optimal fit
between a parameterized model and the video data currently observed
on the FCHMD. Once this fit is known, locations of the salient mesh
points (X.sub.k, Y.sub.k, Z.sub.k) are known and thus a texture map
is defined between the 3D mesh and the 2D images for that instant
of time (current expression). Since iterative hill-climbing is used
for the fitting procedure, it is expected that either some
intelligent guess or some hand selection will be needed to
initialize the fitting. A fully automatic procedure can be
initialized from an average wearer's face determined from the
training data. The control software for the FCHMD can have a back
up procedure so that the HMD wearer can initialize the fitting by
manually choosing some salient face points via the wearer viewing
the video images and selecting points.
[0116] The description of the invention is merely exemplary in
nature and, thus, variations that do not depart from the gist of
the invention are intended to be within the scope of the invention.
Such variations are not to be regarded as a departure from the
spirit and scope of the invention.
* * * * *