U.S. patent application number 13/602097 was filed with the patent office on 2013-12-05 for perspective-correct communication window with motion parallax.
This patent application is currently assigned to Microsoft Corporation. The applicant listed for this patent is Christian F. Huitema, Eric G. Lang, Yancey Christopher Smith, Zhengyou Zhang. Invention is credited to Christian F. Huitema, Eric G. Lang, Yancey Christopher Smith, Zhengyou Zhang.
Application Number | 20130321564 13/602097 |
Document ID | / |
Family ID | 60922750 |
Filed Date | 2013-12-05 |
United States Patent
Application |
20130321564 |
Kind Code |
A1 |
Smith; Yancey Christopher ;
et al. |
December 5, 2013 |
PERSPECTIVE-CORRECT COMMUNICATION WINDOW WITH MOTION PARALLAX
Abstract
A perspective-correct communication window system and method for
communicating between participants in an online meeting, where the
participants are not in the same physical locations. Embodiments of
the system and method provide an in-person communications
experience by changing virtual viewpoint for the participants when
they are viewing the online meeting. The participant sees a
different perspective displayed on a monitor based on the location
of the participant's eyes. Embodiments of the system and method
include a capture and creation component that is used to capture
visual data about each participant and create a realistic geometric
proxy from the data. A scene geometry component is used to create a
virtual scene geometry that mimics the arrangement of an in-person
meeting. A virtual viewpoint component displays the changing
virtual viewpoint to the viewer and can add perceived depth using
motion parallax.
Inventors: |
Smith; Yancey Christopher;
(Kirkland, WA) ; Lang; Eric G.; (Yarrow Point,
WA) ; Zhang; Zhengyou; (Bellevue, WA) ;
Huitema; Christian F.; (Clyde Hill, WA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Smith; Yancey Christopher
Lang; Eric G.
Zhang; Zhengyou
Huitema; Christian F. |
Kirkland
Yarrow Point
Bellevue
Clyde Hill |
WA
WA
WA
WA |
US
US
US
US |
|
|
Assignee: |
Microsoft Corporation
Redmond
WA
|
Family ID: |
60922750 |
Appl. No.: |
13/602097 |
Filed: |
August 31, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61653983 |
May 31, 2012 |
|
|
|
Current U.S.
Class: |
348/14.08 ;
348/E7.077 |
Current CPC
Class: |
G06T 15/04 20130101;
H04R 2227/005 20130101; H04N 13/243 20180501; H04N 13/194 20180501;
G06T 2210/56 20130101; G06T 17/00 20130101; H04N 7/157 20130101;
G06T 15/08 20130101; H04N 7/142 20130101; H04N 13/257 20180501;
H04N 13/239 20180501; H04S 2400/15 20130101; H04N 13/117 20180501;
H04N 13/246 20180501; H04N 7/15 20130101; G06T 15/205 20130101 |
Class at
Publication: |
348/14.08 ;
348/E07.077 |
International
Class: |
H04N 7/14 20060101
H04N007/14 |
Claims
1. A method for communicating with a remote participant in a
meeting, comprising: creating a geometric proxy for each
participant in the meeting including a local participant and the
remote participant, where the local participant and the remote
participant are in different physical locations; rendering the
geometric proxy for each participant in a scene geometry that is
consistent with an in-person conversation; transmitting each
rendered geometric proxy and the scene geometry to each of
participant; displaying the geometric proxy for the remote
participant and an associated background to the local participant;
and changing a virtual viewpoint of the geometric proxy for the
remote participant and the associated background based on an
orientation of the local participant's face.
2. The method of claim 1, further comprising using a face tracking
technique to track the orientation of the local participant's
face.
3. The method of claim 2, further comprising keeping the virtual
viewpoint level with an eye gaze of the local participant using the
face tracking technique.
4. The method of claim 1, further comprising capturing images of
each of the participants in the meeting using a plurality of camera
pods.
5. The method of claim 4, further comprising creating the virtual
viewpoint using a virtual camera that is a composition of images
from at least two of the plurality of camera pods.
6. The method of claim 4, further comprising capturing RGB data and
depth information using the plurality of camera pods.
7. The method of claim 6, further comprising creating the geometric
proxy for each participant by adding the RGB data to the depth
information.
8. The method of claim 1, further comprising: determining a number
of participants in the meeting; and generating the scene geometry
based on the number of participants to simulate an in-person
conversation between the participants.
9. The method of claim 8, further comprising: using virtual boxes
to ensure that an eye gaze and conversational geometry between the
participants is correct and that to other participants the
conversational geometry looks correct so the local participant can
see correctly the other participants; and determining a number of
virtual boxes to use based on the number of participants.
10. The method of claim 9, further comprising: determining that
there are two participants in the meeting; and creating a first
virtual box and a second virtual box that are facing each
other.
11. The method of claim 9, further comprising: determining that
there are three participants in the meeting; and creating a first
virtual box, a second virtual box, and a third virtual box that are
place around a virtual round table in an equidistant manner.
12. The method of claim 1, further comprising adding an illusion of
depth to the virtual viewpoint using motion parallax.
13. The method of claim 12, further comprising using a face
tracking technique to shift the virtual viewpoint as the local
participant's head moves.
14. A method for changing a virtual viewpoint of a local
participant in an online conference, comprising: capturing images
of the local participant to obtain captured information; creating a
geometric proxy of the local participant using the captured
information; determining a number of participants in the online
conference; generating scene geometry based on the number of
participants to simulate being in an in-person conversation with
other participants in the online conference; rendering the
geometric proxy to geometric proxies of the other participants in
the scene geometry; transmitting the rendered geometric proxies and
scene geometry to the local participant; displaying on a monitor to
the local participant a rendered geometric proxy of a remote
participant that is in a different physical location than the local
participant; and changing the virtual viewpoint of the local
participant based on a position of the local participant's
face.
15. The method of claim 14, further comprising adding depth to the
virtual viewpoint using motion parallax.
16. The method of claim 14, further comprising generating the
virtual viewpoint using a virtual camera that is the composition of
images captured by a plurality of camera pods each having a
different view of a scene.
17. The method of claim 16, further comprising: tracking the
position of the local participant's face using a face tracking
technique; and keeping the local participant facing at least some
of the other participants using the position of the local
participant's face to ensure that the local participant and at
least one of the other participants are always looking straight at
each other.
18. A method for creating an in-person communication experience
between participants in an online meeting, comprising: arranging a
plurality of camera pods around a monitor that a local participant
is viewing; capturing RGB data and depth information of the local
participant from the plurality of camera pods; creating a geometric
proxy for the local participant by adding the RGB and the depth
information; generating scene geometry based on a number of
participants in the online meeting, including the local
participant; rendering the geometric proxy for the local
participant and geometric proxies for each of the other
participants in the online meeting to each other in the scene
geometry so that the scene geometry is consistent with an in-person
conversation; transmitting the rendered geometric proxies and scene
geometry to each of the participants; displaying a virtual
viewpoint to the local participant that includes a rendered
geometric proxy for each of the participants along with a
background that is part of the scene geometry; changing the virtual
viewpoint based on a position of the local participant's face such
that the local participant's view of other participants and the
background is dependent on an orientation of the local
participant's face; and adding an illusion of depth to the virtual
viewpoint using motion parallax to create the in-person
communication experience between the participants in the online
meeting.
19. The method of claim 18, further comprising: having the local
participant and another participant at the same physical location
and viewing the monitor, where the monitor has a lenticular display
that allows multiple viewing angles; and having the local
participant view the monitor from the right side to obtain a right
field-of-view and having the other participant view the monitor
from the left side to obtain a left field-of-view, the right
field-of-view and the left field-of-view being different.
20. The method of claim 18, further comprising: having the local
participant and another participant at the same physical location
and viewing the monitor; using a face tracking technique to track
the local participant and the other participant; and providing
different views on the monitor to the local participant and the
other participant based on an orientation of their faces.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of and priority to
Provisional U.S. Patent Application Ser. No. 61/653,983, filed May
31, 2012.
BACKGROUND
[0002] Current video conferencing technology typically uses a
single camera to capture RGB data (from the red, blue, and green
(RGB) color model) of a local scene. This local scene typically
includes the people that are participating in the video conference,
or meeting participants. The data then is transmitted in real time
to a remote location and then displayed to another meeting
participant that is in a different location than the other meeting
participant.
[0003] While advances have been made in video conferencing
technology that help provide a higher definition capture,
compression, and transmission, typically the experience falls short
of recreating the face-to-face experience of an in-person
conference. One reason for this is that the typical video
conferencing experience lacks eye gaze and other correct
conversational geometry. For example, typically the person being
captured remotely is not looking into your eyes, as one would
experience in a face-to-face conversation. This is because their
eyes are not looking where the camera is located and instead are
looking at the screen. Moreover, three-dimensional (3D) elements
like motion parallax and image depth, as well as the freedom to
change perspective in the scene are lacking because there is only a
single, fixed video camera capturing the scene and the meeting
participants.
SUMMARY
[0004] This Summary is provided to introduce a selection of
concepts in a simplified form that are further described below in
the Detailed Description. This Summary is not intended to identify
key features or essential features of the claimed subject matter,
nor is it intended to be used to limit the scope of the claimed
subject matter.
[0005] Embodiments of the perspective-correct communication window
system and method provide a way to create an in-person
communications experience for participants in an online meeting or
conference. Embodiments of the system and method provide a correct
perspective of the participants using a changing virtual viewpoint
for the participants when they are viewing the online meeting. This
changing virtual viewpoint is dependent on the position and
orientation of the viewer's face or more particularly the viewer's
eyes. Depending on the position and orientation of the face, the
viewer sees a different perspective of the other participants in
the meeting as well as the background in the display.
[0006] Embodiments of the system and method generally include three
components. A first component, the capture and creation component,
is used to capture visual data about each participant and create a
realistic geometric proxy from the data. This geometric proxy is a
geometric representation of the participant that has real video
painted onto the geometric representation frame by frame in order
to increase the realism. Moreover, a geometric proxy is created for
each participant in the meeting. The data is captured using one or
more camera pods. In some embodiments these camera pods include a
stereoscopic infrared (IR) camera and an IR emitter (to capture
depth information) and a RGB camera (to capture RGB data). The
camera pod layout at each endpoint is variable and dependent on the
number of camera pods available at the endpoint. Each geometric
proxy is created using the RGD data and the depth information.
[0007] A second component is a scene geometry component, which is
used to create a virtual scene geometry that imitates the
arrangement of an in-person meeting. The scene geometry is
dependent on the number of participants in the meeting. Creating
the scene geometry includes both the registration of the
three-dimensional (3D) volume and the alignment of the 3D space
that the camera pods capture. The general idea of the scene
geometry component is to create relative geometry between the
meeting participants. The scene is aligned virtually to mimic a
real-life scene as if the participants are in the same physical
location and engaged in an in-person conversation.
[0008] The scene geometry uses virtual boxes to have relative,
consistent geometry between the participants. A meeting with two
participants (or a one-to-one (1:1) scene geometry) consists of two
boxes that occupy the spaces in front of the respective monitors
(not shown) of the two participants. When there are three
participants the scene geometry includes three virtual boxes that
are placed around a virtual round table in an equidistant
manner.
[0009] The scene geometry also includes a virtual camera. The
virtual camera is a composition of images from two or more of the
camera pods in order to obtain a camera view that is not captured
by any one camera pod alone. This allows embodiments of the system
and method to obtain a natural eye gaze and connection between
people. Face tracking (or more specifically, eye tracking), is used
to improve performance by helping the virtual camera remain level
with the eye gaze of the viewer. In other words the face tracking
provides a correct virtual camera view that is aligned with the
viewer's eyes. This means that the virtual camera interacts with
the face tracking to create a virtual viewpoint that has the user
looking straight at the other participant.
[0010] Each geometric proxy is rendered relative to each other in
the scene geometry. The rendered geometric proxies and scene
geometry is then transmitted to each of the participants. The third
component is the virtual viewpoint component, which displays a
changing virtual viewpoint to the viewer based on the position and
orientation of the viewer's face. This motion parallax effect adds
realism to the scene displayed on the monitor. In addition, face
tracking can be used to track the position and orientation of the
viewer's face. What the viewer sees on the monitor in one facial
position and orientation is different from what the viewer sees in
another facial position and orientation.
[0011] Embodiments of the system and method also include
facilitating multiple participants at a single endpoint. An
endpoint means a location or environment containing one or more
participants of the conference or meeting. In some embodiments a
face tracking technique tracks two different faces and then
provides different views to different viewers. In other embodiments
glasses are worn by each of the multiple participants at the
endpoint and in some embodiments the glasses have active shutters
on them that show each wearer alternating frames displayed by the
monitor that are tuned to each pair of glasses. Other embodiments
use a monitor having multiple viewing angles such that a viewer
looking at the monitor from the right side sees one scene and
another viewer looking at the monitor from the left sees a
different scene.
[0012] It should be noted that alternative embodiments are
possible, and steps and elements discussed herein may be changed,
added, or eliminated, depending on the particular embodiment. These
alternative embodiments include alternative steps and alternative
elements that may be used, and structural changes that may be made,
without departing from the scope of the invention.
DRAWINGS DESCRIPTION
[0013] Referring now to the drawings in which like reference
numbers represent corresponding parts throughout:
[0014] FIG. 1 is a block diagram illustrating a general overview of
embodiments of the perspective-correct communication window system
and method implemented in a computing environment.
[0015] FIG. 2 is a block diagram illustrating the system details of
embodiments of the perspective-correct communication window system
and method shown in FIG. 1.
[0016] FIG. 3 illustrates a simplified example of a general-purpose
computer system on which various embodiments and elements of the
perspective-correct communication window system and method, as
described herein and shown in FIGS. 1, 2, and 4-8, may be
implemented.
[0017] FIG. 4 is a flow diagram illustrating the general operation
of embodiments of the perspective-correct communication window
system and method shown in FIGS. 1 and 2.
[0018] FIG. 5 is a block diagram illustrating the details of an
exemplary embodiment of a camera pod of embodiments of the
perspective-correct communication window system and method shown in
FIG. 1.
[0019] FIG. 6 illustrates an exemplary embodiment of a camera pod
layout (such as that shown in FIG. 2) using four camera pods.
[0020] FIG. 7 illustrates an exemplary overview of the creation of
a geometric proxy for a single meeting participant.
[0021] FIG. 8 illustrates an exemplary embodiment of the scene
geometry between participants when there are two participants in
the meeting (a 1:1 conference).
[0022] FIG. 9 illustrates an exemplary embodiment of the scene
geometry between participants when there are three participants in
the meeting (a 3-endpoint conference).
[0023] FIG. 10 illustrates an exemplary embodiment of a virtual
camera based on where the participant is looking.
[0024] FIG. 11 illustrates an exemplary embodiment of providing
depth through motion parallax based on where a viewer is
facing.
[0025] FIG. 12 illustrates an exemplary embodiment of a technique
to handle multiple participants at a single endpoint.
DETAILED DESCRIPTION
[0026] In the following description of perspective-correct
communication window system and method reference is made to the
accompanying drawings, which form a part thereof, and in which is
shown by way of illustration a specific example whereby embodiments
of the perspective-correct communication window system and method
may be practiced. It is to be understood that other embodiments may
be utilized and structural changes may be made without departing
from the scope of the claimed subject matter.
I. System Overview
[0027] Embodiments of the perspective-correct communication window
system and method provide a way provide an "in person"
communications experience for users. FIG. 1 is a block diagram
illustrating a general overview of embodiments of the
perspective-correct communication window system 100 and method
implemented in a computing environment. In particular, embodiments
of the system 100 and method are implemented on a computing device
110. This computing device may be a single computing device or may
be spread out over a plurality of devices. Moreover, the computing
device 110 may be virtually any device having a processor,
including a desktop computer, a tablet computing device, and an
embedded computing device.
[0028] As shown in FIG. 1, the computing environment includes a
first environment 120 and a second environment 125. In the first
environment 120, a first participant 130 is captured by a plurality
of first camera pods 135. It should be noted that four camera pods
are shown in FIG. 1, but fewer or more camera pods can be used.
Also as shown in FIG. 1, the first plurality of camera pods 135 are
shown attached to a first monitor 140, which is in communication
with the computing device 110. However, it should be noted that in
alternate embodiments the first plurality of camera pods 135 may be
mounted on some other structure or there may be some mounted on the
first monitor 140 and others mounted on other structures.
[0029] The first participant 130 is captured by the first plurality
of camera pods 135 and processed by embodiments of the
perspective-correct communication window system 100 and method, as
explained in detail below. This processed information is
transmitted across a network 150 using a first communication link
155 (from the first environment 120 to the network 150) and a
second communication link 160 (from the network 150 to the second
environment 125. In FIG. 1 embodiments of the system 100 and method
are shown residing on the network 150. However, it should be noted
that this is only one way in which the system 100 and method may be
implemented.
[0030] The transmitted processed information is received in the
second environment 125, processed by embodiments of the system 100
and method, and then displayed to a second participant 170 on a
second monitor 175. As shown in FIG. 1, the second monitor 175
contains a second plurality of camera pods 180 that are used to
capture the second participant 170. In addition, the second
plurality of camera pods 180 are used to track the eye gaze of the
second participant 170 and determine how the processed information
should be presented to the second participant 170. This is
explained in more detail below. Moreover, the first plurality of
camera pods 135 is also are used to track the eye gaze of the first
participant 130 and determine how processed information should be
presented to the first participant 130. In alternate embodiments
eye gaze is tracked using some other device that a camera pod, such
as an external camera.
[0031] It should be noted that embodiments of the system 100 and
method work in both directions. In other words, the first
environment 120 can also receive transmissions from the second
environment 125 and the second environment 125 can also transmit
processed information. For pedagogical purposes, however, only the
transmission from the first environment 120 to the system 120 and
method and on to the second environment 125 is discussed above.
II. System Details
[0032] Embodiments of the system 100 and method include three main
components that work together to create that "in person"
communications experience. The first component is capturing and
creating a three-dimensional (3D) video image of each person
participating in the conference. The second component is creating
the relevant scene geometry based on the number of participants in
the conference. This component ensures that the resultant geometry
between virtual viewpoints (or windows) at the endpoints is the
same. And the third component is rendering and providing a virtual
view as if the camera was positioned from the perspective of where
the viewer is looking, thereby recreating the same scene geometry
participants would have when talking in person.
[0033] FIG. 2 is a block diagram illustrating the system details of
embodiments of the perspective-correct communication window system
100 and method shown in FIG. 1. As shown in FIG. 2, embodiments of
the system 100 and method include a capture and creation component
200, a scene geometry component 210, and a virtual viewpoint
component 220. The capture and creation component is used for
capturing and creating a 3D video image of the participant.
[0034] Specifically, the capture and creation component 200
includes a camera pod layout 230 that includes a plurality of
camera pods. The camera pod layout 230 is used to capture a
participant from multiple perspectives. Computer vision methods are
used to create a high-fidelity geometry proxy for each meeting
participant. As explained in detail below, this is achieved by
taking RBG data obtained from an RGB data collection module 235 and
depth information obtained and computed by a depth information
computation module 240. From this information a geometric proxy
creation module 245 creates a geometric proxy 250 for each
participant. Image-based rendering methods are used to create
photorealistic textures for the geometric proxy 250 such as with
view-dependent texture mapping.
[0035] The scene geometry component 210 is used to create the
correct scene geometry to simulate participants being together in a
real conversation. This scene geometry is dependent on the number
of participants (or endpoints) in the conference. A 3D registration
module 260 is used to obtain a precise registration of a monitor
with the camera pods. Moreover, a space alignment module 265 aligns
the orientation of the camera pods with the real world. For a 1:1
meeting (having two participants), this is simply the two physical
spaces lined up across from one another in the virtual environment.
The capture area that is being recreated for each participant is
the area in front of the monitor.
[0036] Once the textured geometric proxy 250 has been created for
each meeting participant and the participants are represented in a
3D virtual space that is related to the other participants in the
conference, the geometric proxies are rendered to each other in a
manner consistent with conversational geometry. Moreover, this
rendering is done based on the number of participants in the
conference. Virtual boxes are used to ensure that an eye gaze and
conversational geometry between the participants is correct and
that to other participants the conversational geometry looks
correct so that the viewer can correctly see the other
participants.
[0037] The geometric proxies and in some cases the registration and
alignment information are transmitted to remote participants by the
transmission module 270. The virtual viewpoint component 220 is
used to enhance the virtual view rendered to the remote
participants. The experience of `being there` is enhanced through
the use of a motion parallax module 280 that adds motion parallax
and depth to the scene behind the participants. Horizontal and
lateral movements by either participant change the viewpoint shown
on their local displays and the participant sees the scene they are
viewing, and the person in it, from a different perspective. This
greatly enhances the experience of the meeting participants.
III. Exemplary Operating Environment
[0038] Before proceeding further with the operational overview and
details of embodiments of the perspective-correct communication
window system and method, a discussion will now be presented of an
exemplary operating environment in which embodiments of the
perspective-correct communication window system 100 and method may
operate. Embodiments of the perspective-correct communication
window system 100 and method described herein are operational
within numerous types of general purpose or special purpose
computing system environments or configurations.
[0039] FIG. 3 illustrates a simplified example of a general-purpose
computer system on which various embodiments and elements of the
perspective-correct communication window system 100 and method, as
described herein and shown in FIGS. 1, 2, and 4-12, may be
implemented. It should be noted that any boxes that are represented
by broken or dashed lines in FIG. 3 represent alternate embodiments
of the simplified computing device, and that any or all of these
alternate embodiments, as described below, may be used in
combination with other alternate embodiments that are described
throughout this document.
[0040] For example, FIG. 3 shows a general system diagram showing a
simplified computing device 10. The simplified computing device 10
may be a simplified version of the computing device 110 shown in
FIG. 1. Such computing devices can be typically be found in devices
having at least some minimum computational capability, including,
but not limited to, personal computers, server computers, hand-held
computing devices, laptop or mobile computers, communications
devices such as cell phones and PDA's, multiprocessor systems,
microprocessor-based systems, set top boxes, programmable consumer
electronics, network PCs, minicomputers, mainframe computers, audio
or video media players, etc.
[0041] To allow a device to implement embodiments of the
perspective-correct communication window system 100 and method
described herein, the device should have a sufficient computational
capability and system memory to enable basic computational
operations. In particular, as illustrated by FIG. 3, the
computational capability is generally illustrated by one or more
processing unit(s) 12, and may also include one or more GPUs 14,
either or both in communication with system memory 16. Note that
that the processing unit(s) 12 of the general computing device of
may be specialized microprocessors, such as a DSP, a VLIW, or other
micro-controller, or can be conventional CPUs having one or more
processing cores, including specialized GPU-based cores in a
multi-core CPU.
[0042] In addition, the simplified computing device 10 of FIG. 3
may also include other components, such as, for example, a
communications interface 18. The simplified computing device 10 of
FIG. 3 may also include one or more conventional computer input
devices 20 (such as styli, pointing devices, keyboards, audio input
devices, video input devices, haptic input devices, devices for
receiving wired or wireless data transmissions, etc.). The
simplified computing device 10 of FIG. 3 may also include other
optional components, such as, for example, one or more conventional
computer output devices 22 (e.g., display device(s) 24, audio
output devices, video output devices, devices for transmitting
wired or wireless data transmissions, etc.). Note that typical
communications interfaces 18, input devices 20, output devices 22,
and storage devices 26 for general-purpose computers are well known
to those skilled in the art, and will not be described in detail
herein.
[0043] The simplified computing device 10 of FIG. 3 may also
include a variety of computer readable media. Computer readable
media can be any available media that can be accessed by the
simplified computing device 10 via storage devices 26 and includes
both volatile and nonvolatile media that is either removable 28
and/or non-removable 30, for storage of information such as
computer-readable or computer-executable instructions, data
structures, program modules, or other data. By way of example, and
not limitation, computer readable media may comprise computer
storage media and communication media. Computer storage media
includes, but is not limited to, computer or machine readable media
or storage devices such as DVD's, CD's, floppy disks, tape drives,
hard drives, optical drives, solid state memory devices, RAM, ROM,
EEPROM, flash memory or other memory technology, magnetic
cassettes, magnetic tapes, magnetic disk storage, or other magnetic
storage devices, or any other device which can be used to store the
desired information and which can be accessed by one or more
computing devices.
[0044] Retention of information such as computer-readable or
computer-executable instructions, data structures, program modules,
etc., can also be accomplished by using any of a variety of the
aforementioned communication media to encode one or more modulated
data signals or carrier waves, or other transport mechanisms or
communications protocols, and includes any wired or wireless
information delivery mechanism. Note that the terms "modulated data
signal" or "carrier wave" generally refer to a signal that has one
or more of its characteristics set or changed in such a manner as
to encode information in the signal. For example, communication
media includes wired media such as a wired network or direct-wired
connection carrying one or more modulated data signals, and
wireless media such as acoustic, RF, infrared, laser, and other
wireless media for transmitting and/or receiving one or more
modulated data signals or carrier waves. Combinations of the any of
the above should also be included within the scope of communication
media.
[0045] Further, software, programs, and/or computer program
products embodying the some or all of the various embodiments of
the perspective-correct communication window system 100 and method
described herein, or portions thereof, may be stored, received,
transmitted, or read from any desired combination of computer or
machine readable media or storage devices and communication media
in the form of computer executable instructions or other data
structures.
[0046] Finally, embodiments of the perspective-correct
communication window 100 and method described herein may be further
described in the general context of computer-executable
instructions, such as program modules, being executed by a
computing device. Generally, program modules include routines,
programs, objects, components, data structures, etc., that perform
particular tasks or implement particular abstract data types. The
embodiments described herein may also be practiced in distributed
computing environments where tasks are performed by one or more
remote processing devices, or within a cloud of one or more
devices, that are linked through one or more communications
networks. In a distributed computing environment, program modules
may be located in both local and remote computer storage media
including media storage devices. Still further, the aforementioned
instructions may be implemented, in part or in whole, as hardware
logic circuits, which may or may not include a processor.
IV. Operational Overview
[0047] FIG. 4 is a flow diagram illustrating the general operation
of embodiments of the perspective-correct communication window
system 100 and method shown in FIGS. 1 and 2. As shown in FIG. 3,
the operation of embodiments of the perspective-correct
communication window system 100 and method begins by capturing
images of each of the participants in the conference or meeting
(box 400). At least one of the participants is a remote
participant, which means that the remote participant is not in the
same physical location as the other participant. The capture of
each participant is achieved by using the camera pods.
[0048] Next, embodiments of the method use data from the captured
images to create a geometric proxy for each participant (box 410).
The number of participants then is determined (box 420). This
determination may be performed out of order such that the number of
participants is determined or known beforehand. Embodiments of the
method then generate scene geometry based on the number of
participants (box 430). This scene geometry generation helps to
simulate the experience of an in-person conversation or meeting
with the remote participants.
[0049] Each geometric proxy for a particular participant then is
rendered to the other geometric proxies for the other participants
within the scene geometry (box 440). This rendering is performed
such that the geometric proxies are arranged in a manner that is
consistent with an in-person conversation. These rendered geometric
proxies and the scene geometry then are transmitted to the
participants (box 450). A changing virtual viewpoint is displayed
to each of the participants such that the virtual viewpoint is
dependent on an orientation of the viewer's face (box 460). For
additional realism, motion parallax and depth are added in order to
enhance the viewing experience for the participants (box 470). As
explained in detail below, the motion parallax and depth are
dependent on the eye gaze of the participant relative to the
monitor on which the participant in viewing the conference or
meeting.
V. Operational Details
[0050] The operational details of embodiments of the
perspective-correct communication window system 100 and method will
now be discussed. This includes the details of the camera pods,
camera pod layout, the geometric proxy creation, and the creation
of the scene geometry. Moreover, also discussed will be the concept
of a virtual camera, the addition of motion parallax and depth to
the geometric proxies and scene geometry, and the handling of more
than one participant in the same environment and viewing the same
monitor.
V.A. Camera Pod
[0051] The first component of embodiments of the
perspective-correct communication window system 100 and method is
the capture and creation component 200. This component includes a
plurality of camera pods that are used to capture the 3D scene.
Moreover, as explained below, each camera pod contains multiple
sensors.
[0052] FIG. 5 is a block diagram illustrating the details of an
exemplary embodiment of a camera pod 500 of embodiments of the
perspective-correct communication window system 100 and method
shown in FIG. 1. As noted above, embodiments of the system 100 and
method typically include more than one camera pod 500. However, for
pedagogical purposes only a single camera pod will be described.
Moreover, it should be noted that the multiple camera pods do not
necessarily have to include the same sensors. Some embodiments of
the system 100 and method may include a plurality of camera pods
that contain different sensors from each other.
[0053] As shown in FIG. 5, the camera pod 500 includes multiple
camera sensors. These sensors include stereoscopic sensors infrared
(IR) cameras 510, an RGB camera 520, and an IR emitter 530. In
order to capture a 3D image of the scene the camera pod 500
captures RGB data and the depth coordinates in order to compute a
depth map. FIG. 5 illustrates that the IR stereoscopic IR cameras
510 and the IR emitter 530 are used to capture the depth
calculation. The RGB camera 520 is used for the texture acquisition
and to reinforce the depth cues using depth segmentation. Depth
segmentation, which is well known in the computer vision field,
seeks to separate objects in an image from the background using
background subtraction.
[0054] In alternative embodiments, the camera pod 500 achieves
stereoscopic sensing using time of flight sensors or ultrasound
instead of the IR structure light approach. A time-of-flight camera
is a range imaging camera system that computes distance based on
the speed of light and by measuring the time of flight of a light
signal between the camera and the object for each point in an
image. Ultrasound techniques can be used compute distance by
generating an ultrasonic pulse in a certain direction. If there is
an object in the path of the pulse, then part or all of the pulse
will be reflected back to the transmitter as an echo. The range can
be found by measuring the difference between the pulse being
transmitted and the echo being received. In other embodiments the
distance may be found be performing an RGB depth calculation using
stereo pairs of RGB camera.
V.B. Camera Pod Layout
[0055] Also part of the capture and creation component 200 is the
camera pod layout. One or more camera pods are configured in a
particular layout in order to capture the 3D scene that includes
one or more of the participants. The number of camera pods directly
affects the quality of the captured images and the number of
occlusions. As the number of camera pods increases there is more
RGB data available and this improves image quality. Moreover, the
number of occlusions is diminished as the number of camera pods
increases.
[0056] In some embodiments of the system 100 and method the camera
pod layout uses four camera pods. In alternate embodiments any
number of cameras may be used. In fact there could be a lower-end
version that uses a single camera pod. For example, the single
camera pod may be mounted on top of a monitor and use image
distortion correction techniques to correct for any imaging errors.
The touchstone is that the camera pod layout should have enough
camera pods to provide a 3D view of the environment containing the
participant.
[0057] FIG. 6 illustrates an exemplary embodiment of a camera pod
layout (such as that shown in FIG. 2) using four camera pods. As
shown in FIG. 6, the four camera pods 500 are embedded in the bezel
of a monitor 600. The monitor 600 can be of virtually any size, but
larger monitors provide a more life-size re-projection. This
typically provides the user with the more realistic experience.
Displayed on the monitor 600 is a remote participant 610 that is
participating in the conference or meeting.
[0058] As shown in FIG. 6, the four camera pods 500 are arranged in
a diamond configuration. This allows embodiments of the system 100
and method to capture the user from above and below and from side
to side. Moreover, the two middle top and bottom camera pods can be
used to get a realistic texture on the face of the user without a
seam. Note that cameras in the corners will typically causes a seam
issue. In other embodiments virtually any configuration and
arrangement of the four camera pods 500 can be used and may be
mounted anywhere on the monitor 600. In still other embodiments one
or more of the four camera pods 500 are mounted in places other
than the monitor 600.
[0059] In alternate embodiments three camera pods are used and
positioned at the top or bottom of the monitor 600. Some
embodiments use two camera pods that are positioned at the top or
bottom corners of the monitor 600. In still other embodiments N
camera pods are used, where N is greater than four (N>4). In
this embodiment the N camera pods are positioned around the outside
edge of the monitor 600. In yet other embodiments there are
multiple camera pods positioned behind the monitor 600 in order to
capture the 3D scene of the environment containing the local
participant.
V.C. Geometric Proxy Creation
[0060] Another part of the capture and creation component 200 is
the geometric proxy creation module 245. It should be noted that
the geometric proxy is not an avatar or a graphical representation
of the user. Instead, the geometric proxy is a geometric
representation of the participant that has real video painted onto
the geometric representation frame by frame in order to increase
the realism. The module 245 creates a geometric proxy for each of
the participants in the conference or meeting. Depth information is
computed from range data captured by the camera pods 500. Once the
depth information is obtained a sparse point cloud is created from
depth points contained in the captured depth information. A dense
depth point cloud then is generated using known methods and the
captured depth information. In some embodiments a mesh is
constructed from the dense point cloud and the geometric proxy is
generated from the mesh. In alternate embodiments the dense point
clouds are textured in order to generate the geometric proxy.
[0061] FIG. 7 illustrates an exemplary overview of the creation of
a geometric proxy for a single meeting participant. As shown in
FIG. 7, RGB data 700 is captured from the RGB cameras of the camera
pods 500. In addition, depth information 710 is computed from the
depth data obtained by the camera pods 500. The RGB data 700 and
the depth information 710 are added together in order to create the
geometric proxy 250 for the single meeting participant. This
geometric proxy creation is performed for each of the participants
such that each participant has a corresponding geometric proxy.
V.D. Registration of the 3D Volume and Alignment of the 3D
Space
[0062] The second component of embodiments of the
perspective-correct communication window system 100 and method is
the scene geometry component 210. This includes both the
registration of the 3D volume and the alignment of the 3D space
that the camera pods 500 capture. The general idea of the scene
geometry component 210 is to create relative geometry between the
meeting participants. The desire is to align the scene exactly as
if the participants are in the same physical location and engaged
in an in-person conversation.
[0063] Embodiments of the system 100 and method create the scene
geometry that is a 3D scene anchored at the capturing environment.
In order to achieve this it is desirable to have a precise
estimation of the environments containing each of the participants.
Once this is obtained then embodiments of the system 100 and method
compute a precise registration of the monitor with the cameras.
This yields an orientation in virtual space that is aligned with
the real world. In other words, the virtual space is aligned with
the real space. This registration and alignment is achieved using
known methods. In some embodiments of the system 100 and method the
calibration is performed at the time of manufacture. In other
embodiments calibration is performed using a reference object in
the environment.
[0064] The scene geometry seeks to create relative geometry between
a local participant and remote participants. This includes creating
eye gaze and conversional geometry as if the participants were in
an in-person meeting. One way in which to get eye gaze and
conversational geometry correct is to have relative, consistent
geometry between the participants. In some embodiments this is
achieved by using virtual boxes. Specifically, if a box was drawn
around the participants in real space when the participants are in
a room together, then these virtual boxes are recreated in a
virtual layout to create the scene geometry. The shape of the
geometry does not matter as much as its consistency between the
participants.
[0065] Certain input form factors like single monitor or multiple
monitors will affect the optimum layout and scalability of the
solution. The scene geometry also depends on the number of
participants. A meeting with two participants (a local participant
and a remote participant) is a one-to-one (1:1) scene geometry that
is different from the scene geometry when there are three or more
participants. Moreover, as will be seen from the examples below,
the scene geometry includes eye gaze between the participants.
[0066] FIG. 8 illustrates an exemplary embodiment of scene geometry
between participants when there are two participants in the
meeting. As shown in FIG. 8 this scene geometry for a 1:1
conference 800 includes a first participant 810 and a second
participant 820. These participants are not in the same physical
location.
[0067] In this scene geometry for a 1:1 conference 800, the
geometry consists of two boxes that occupy the spaces in front of
the respective monitors (not shown) of the participants 810, 820. A
first virtual box 830 is drawn around the first participant 810 and
a second virtual box 840 is drawn around the second participant
820. Assuming the same size monitors and consistent setups allows
embodiments of the system 100 and method to know the scene geometry
is correct without any manipulation of the captured data.
[0068] In alternate embodiments of the system 100 and method there
are multiple remote participants and the geometry is different from
the scene geometry for a 1:1 conference 800. FIG. 9 illustrates an
exemplary embodiment of the scene geometry between participants
when there are three participants in the meeting. This is the scene
geometry for a 3-endpoint conference 900. An endpoint is an
environment containing a participant of the conference or meeting.
In a 3-endpoint conference there are participants in three
different physical locations.
[0069] In FIG. 9 the scene geometry for a 3-endpoint conference 900
includes participant #1 910, participant #2 920, and participant #3
930 around a virtual round table 935. A virtual box #1 940 is drawn
around participant #1 910, a virtual box #2 950 is drawn around
participant #2 920, and a virtual box #3 960 is drawn around
participant #3 930. Each of the virtual boxes 940, 950, 960 is
placed around the virtual round table 935 in an equidistant manner.
This creates the scene geometry for a 3-endpoint conference
900.
[0070] This scene geometry can be extended for additional
endpoints. However, at a certain point, such as when there are 4 to
5 endpoints (depending on size of screen) with a flat screen, the
scene geometry exceeds the ability of the capture and render of
natural pose positions. In that case, in order to preserve
conversational geometry while not having consistent virtual and
physical geometry, embodiments of the system 100 and method seek to
"pose" participants as they look at one another, exaggerating their
movements for people in the call in order to show them at who they
are looking. This, however, can get quite complicated and can lead
to an uncanny valley type of experience.
V.E. Virtual Camera
[0071] The scene geometry component 210 also includes a virtual
camera. The virtual camera defines the perspective projection
according to which a novel view of the 3D geometric proxy will be
rendered. This allows embodiments of the system 100 and method to
obtain a natural eye gaze and connection between people. One
breakdown in current video conferencing occurs because people are
not looking where a camera is positioned, so that the remote
participants in the conference feel as though the other person is
not looking at them. This is unnatural and typically does not occur
in an in-person conversion.
[0072] The virtual camera in embodiments of the system 100 and
method is created using the virtual space from the scene geometry
and the 3D geometric proxy (having detailed texture information)
for each participant. This virtual camera is not bound to the
locations of the real camera pods being used to capture the images.
Moreover, some embodiments of the system 100 and method use face
tracking (including eye gaze tracking) to determine where the
participants are and where they are looking in their virtual space.
This allows a virtual camera to be created based on where a
participant is looking in the scene. This serves to accurately
convey the proper gaze of the participant to other participants and
provides them the proper view. Thus, the virtual camera facilitates
natural eye gaze and conversational geometry in the interaction
between meeting participants.
[0073] Creating a scene geometry and putting extras in that
geometry create these virtual cameras. From the multiple
perspectives obtained by the camera pods the virtual camera is able
to move around the scene geometry and see interpolated views where
no real camera exists. For example think of the head as a balloon.
The front of the balloon will be captured by a camera pod in front
of the balloon and one side of the balloon will be captured by a
camera pod on that side of the balloon. A virtual camera can be
created anywhere in between the full front and the side by a
composition of images from both camera pods. In other words, the
virtual camera view is created as a composition of images from the
different cameras covering a particular space.
[0074] FIG. 10 illustrates an exemplary embodiment of a virtual
camera based on where a participant is looking. This can also be
thought of as using virtual gaze to obtain natural eye gaze. As
shown in FIG. 10, the monitor 600 displays the remote participant
610 to a local participant 1000. The monitor 600 includes the four
camera pods 500. A virtual eye gaze box 1010 is drawn around eyes
of the remote participant 1020 and eyes of the local participant
1030. The virtual eye gaze box 1010 is level such that in virtual
space the eyes of the remote participant 1020 and eyes of the local
participant 1030 are looking at each other.
[0075] Some embodiments of the virtual camera use face tracking to
improve performance. Face tracking helps embodiments of the system
100 and method change the perspective so that the participants are
always facing each other. Face tracking helps the virtual camera
remain level with the eye gaze of the viewer. This mimics how our
eyes work during an in-person conversation. The virtual camera
interacts with the face tracking to create a virtual viewpoint that
has the user looking straight at the other participant. In other
words, the face tracking is used to change the virtual viewpoint of
the virtual camera.
V.F. Depth Through Motion Parallax
[0076] The third component of the system 100 and method is the
virtual viewpoint component 220. Once the rendered geometric
proxies and scene geometry are transmitted to the participants it
is rendered on the monitors of the participants. In order to add
realism to the scene displayed on the monitor, depth using motion
parallax is added to provide the nuanced changes in view that come
when the position of someone viewing something changes.
[0077] Motion parallax is added using high-speed head tracking that
shifts the camera view as the viewer's head moves. This creates the
illusion of depth. FIG. 11 illustrates an exemplary embodiment of
providing depth through motion parallax based on where a viewer is
facing. As shown in FIG. 11, the monitor 600 having the four camera
pods 500 displays an image of the remote participant 610. Note that
in FIG. 11 the remote participant 610 is shown as a dotted-line
figure 1100 and a solid-line figure 1110. The dotted-line figure
1110 illustrates that the remote participant 610 is looking to his
left and thus has a first field-of-view 1120 that includes a
dotted-line participant 1130. The solid-line figure 1110
illustrates that the remote participant 610 is looking to his right
and thus has a second field-of-view 1140 that includes a solid-line
participant 1150.
[0078] As the remote participant's 610 viewpoint moves side to side
his perspective into the other space changes. This gives the remote
participant 610 a different view of the other participants and the
room (or environment) in which the other participants are located.
Thus, if the remote participant moves left, right, up, or down he
will see a slightly different view of the participant that the
remote participant 610 is interacting with and the background
behind that person shifts as well. This gives the scene a sense of
depth and gives the people in the scene the sense of volume that
they get when talking to someone in person. The remote
participant's viewpoint is tracked using head tracking or a
low-latency face tracking technique. Depth through motion parallax
dramatically enhances the volume feel while providing full freedom
of movement since the viewer is not locked to one camera
perspective.
V.G. Multiple Participants at a Single Endpoint
[0079] Embodiments of the system 100 and method also include the
situation where there is more than one participant at an endpoint.
The above technique for depth through motion parallax works well
for a single viewer because of the ability to track the viewer and
to provide the appropriate view on the monitor based on their
viewing angle and location. This does not work, however, if there
is a second person at the same endpoint and viewing the same
monitor because the monitor can only provide one scene at a time
and it will be locked to one person. This causes the view to be off
for the other viewer that is not being tracked.
[0080] There are several ways in which embodiments of the system
100 and method address this issue. In some embodiments monitors are
used that provide different images to different viewers. In these
embodiments the face tracking technique tracks two difference faces
and then provides different views to different viewers. In other
embodiments the motion parallax is removed and a fixed virtual
camera is locked in the center of the monitor. This creates a
sub-standard experience when more than one participant is at an
endpoint. In still other embodiments glasses are worn by each of
the multiple participants at the endpoint. Each pair of glasses is
used to provide different views. In still other embodiments the
glasses have active shutters on them that show each wearer
different frames from the monitor. The alternating frames displayed
by the monitor are tuned to each pair of glasses and provide each
viewer the correct image based on the viewer's location.
[0081] Another embodiment uses a monitor having multiple viewing
angles. FIG. 12 illustrates an exemplary embodiment of a technique
to handle multiple participants at a single endpoint using the
monitor having multiple viewing angles. This provides each viewer
in front of the monitor with a different view of the remote
participant 610 and the room behind the remote participant 610.
[0082] As shown in FIG. 12, a monitor having a lenticular display
1200 (which allows multiple viewing angles) and having the four
camera pods 500 is displaying the remote participant 610. A first
viewer 1210 is looking at the monitor 1200 from the left side of
the monitor 1200. The eyes of the first viewer 1220 are looking at
the monitor 1200 from the left side and have a left field-of-view
1230 of the monitor 1200. A second viewer 1240 is looking at the
monitor 1200 from the right side of the monitor 1200. The eyes of
the second viewer 1250 are looking at the monitor 1200 from the
right side and have a right field-of-view 1260. Because of the
lenticular display on the monitor 1200, the left field-of-view 1230
and the right field-of-view 1260 are different. In other words, the
first viewer 1210 and the second viewer 1240 are provided with
different view of the remote participant 610 and the room behind
the remote participant 610. Thus, even if the first viewer 1210 and
the second viewer 1240 were side by side, they would see different
things on the monitor 1200 based on their viewpoint.
[0083] Moreover, although the subject matter has been described in
language specific to structural features and/or methodological
acts, it is to be understood that the subject matter defined in the
appended claims is not necessarily limited to the specific features
or acts described above. Rather, the specific features and acts
described above are disclosed as example forms of implementing the
claims.
* * * * *