U.S. patent application number 15/338676 was filed with the patent office on 2018-05-03 for automated configuration of behavior of a telepresence system based on spatial detection of telepresence components.
The applicant listed for this patent is Cisco Technology, Inc.. Invention is credited to Glenn R. G. Aarrestad, Lennart Burenius, Johan Ludvig Nielsen, Jochen Christof Schirdewahn.
Application Number | 20180124354 15/338676 |
Document ID | / |
Family ID | 61801353 |
Filed Date | 2018-05-03 |
United States Patent
Application |
20180124354 |
Kind Code |
A1 |
Aarrestad; Glenn R. G. ; et
al. |
May 3, 2018 |
AUTOMATED CONFIGURATION OF BEHAVIOR OF A TELEPRESENCE SYSTEM BASED
ON SPATIAL DETECTION OF TELEPRESENCE COMPONENTS
Abstract
A system that automatically configures the behavior of the
display devices of a video conference endpoint. The controller may
detect, at a microphone array having a predetermined physical
relationship with respect to a camera, audio emitted from one or
more loudspeakers, each loudspeaker having a predetermined physical
relationship with respect to at least one of one or more display
devices in a conference room. The controller may then generate data
representing a spatial relationship between the one or more display
devices and the camera based on the detected audio. Finally, the
controller may assign video sources received by the endpoint to
each of the one or more display devices based on the data
representing the spatial relationship and the content of each
received video source, and may also assign outputs from multiple
video cameras to an outgoing video stream based on the on the data
representing the spatial relationship.
Inventors: |
Aarrestad; Glenn R. G.;
(Hovik, NO) ; Burenius; Lennart; (Oslo, NO)
; Schirdewahn; Jochen Christof; (Stabekk, NO) ;
Nielsen; Johan Ludvig; (Oslo, NO) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Cisco Technology, Inc. |
San Jose |
CA |
US |
|
|
Family ID: |
61801353 |
Appl. No.: |
15/338676 |
Filed: |
October 31, 2016 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
H04N 7/15 20130101; H04N
7/147 20130101; H04R 1/406 20130101; H04R 2430/20 20130101; H04N
7/142 20130101; H04R 1/403 20130101; H04R 3/005 20130101 |
International
Class: |
H04N 7/14 20060101
H04N007/14; H04R 1/40 20060101 H04R001/40; H04N 7/15 20060101
H04N007/15 |
Claims
1. A method comprising: detecting, at a microphone array having a
predetermined physical relationship with respect to a plurality of
cameras, audio emitted from each of one or more loudspeakers, each
loudspeaker having a predetermined physical relationship with
respect to at least one of one or more display devices in a
conference room; generating data representing a spatial
relationship between each of the one or more display devices and
each of the plurality of cameras based on the detected audio, where
generating data representing the spatial relationship includes
determining azimuth and elevation angles between each of the
plurality of cameras and each of the one or more display devices;
and assigning one or more video sources of an incoming video feed
from a remote conference room to corresponding ones of the one or
more display devices based on the data representing the spatial
relationship and content of the one or more video sources.
2. (canceled)
3. (canceled)
4. The method of claim 1, wherein generating data representing the
spatial relationship further comprises: determining, based on the
determined azimuth and elevation angles, at least one of: a first
probability that a first camera of the plurality of cameras is
disposed above one of the display devices, a second probability
that the first camera is disposed below one of the display devices,
a third probability that the first camera is disposed right of one
of the display devices, or fourth probability that the first camera
is disposed left of one of the display devices.
5. The method of claim 4, wherein assigning comprises: assigning a
video source of the incoming video feed to be displayed on a top
side of a screen, a bottom side of the screen, a right side of the
screen, or a left side of the screen of one of the display devices
based on where the first probability, second probability, third
probability and fourth probability indicate the first camera is
disposed with respect to the one of the display devices.
6. (canceled)
7. The method of claim 27, further comprising: tagging the
respective video outputs from the plurality of cameras with data
indicative of a respective field of view of each of the plurality
of cameras in the conference room.
8. The method of claim 27, wherein one of the display devices is a
user-interactive display device, and assigning further comprises:
assigning, based on the data representing the spatial relationship,
a particular one of the plurality of cameras that is positioned in
the conference room opposite the user-interactive display device to
capture a participant presenting on the user-interactive display
device and a surrounding area of the user-interactive display
device.
9. An apparatus comprising: a plurality of cameras configured to
capture video within a conference room; a microphone array having a
predetermined physical relationship with respect to the plurality
of cameras, the microphone array configured to transduce audio
received at the microphone array; and a processor configured to
control the plurality of cameras and the microphone array, wherein
the processor is configured to: cause the microphone array to
detect audio emitted from one or more loudspeakers having a
predetermined physical relationship with respect to at least one of
one or more display devices in the conference room; generate data
representing a spatial relationship between each of the one or more
display devices and each of the plurality of cameras based on the
detected audio by determining azimuth and elevation angles between
each of the plurality of cameras and each of the one or more
display devices; and assign one or more video sources of an
incoming video feed from a remote conference room to corresponding
ones of the one or more display devices based on the data
representing the spatial relationship and content of the one or
more video sources.
10. (canceled)
11. (canceled)
12. The apparatus of claim 9, wherein the processor, when
generating data representing the spatial relationship, is further
configured to: determine, based on the determined azimuth and
elevation angles, at least one of: a first probability that a first
camera of the plurality of cameras is disposed above one of the
display devices, a second probability that the first camera is
disposed below one of the display devices, a third probability that
the first camera is disposed right of one of the display devices,
or fourth probability that the first camera is disposed left of one
of the display devices.
13. The apparatus of claim 12, wherein the processor is further
configured to: assign a video source of the incoming video feed to
be displayed on a top side of a screen, a bottom side of the
screen, a right side of the screen, or a left side of the screen of
one of the display devices based on where the first probability,
second probability, third probability and fourth probability
indicate the first camera is disposed with respect to the one of
the display devices.
14. (canceled)
15. The apparatus of claim 28, wherein the processor is further
configured to: tag the respective video outputs from the plurality
of cameras with data indicative of a respective field of view of
each of the plurality of cameras in the conference room.
16. The apparatus of claim 28, wherein one of the display devices
is a user-interactive display device and the processor is further
configured to: assign, based on the data representing the spatial
relationship, a particular one of the plurality of cameras that is
positioned in the conference room opposite the user-interactive
display device to capture a participant presenting on the
user-interactive display device and a surrounding area of the
user-interactive display device.
17. One or more non-transitory computer readable storage media, the
computer readable storage media being encoded with software
comprising computer executable instructions, and when the software
is executed, operable to: detect, at a microphone array having a
predetermined physical relationship with respect to a plurality of
cameras, audio emitted from each of one or more loudspeakers, each
loudspeaker having a predetermined physical relationship with
respect to at least one of one or more display devices in a
conference room; generate data representing a spatial relationship
between each of the one or more display devices and each of the
plurality of cameras based on the detected audio by determining
azimuth and elevation angles between each of the plurality of
cameras and each of the one or more display devices; and assign one
or more video sources of an incoming video feed from a remote
conference room to corresponding ones of the one or more display
devices based on the data representing the spatial relationship and
content of the one or more video sources.
18. (canceled)
19. (canceled)
20. (canceled)
21. The non-transitory computer-readable storage media of claim 29,
wherein the instructions are further operable to: tag the
respective video outputs from the plurality of cameras with data
indicative of a respective field of view of each of the plurality
of cameras in the conference room.
22. The non-transitory computer-readable storage media of claim 17,
wherein the instructions, when generating data representing the
spatial relationship, are further operable to: determine, based on
the determined azimuth and elevation angles, at least one of: a
first probability that a first camera of the plurality of cameras
is disposed above one of the display devices, a second probability
that the first camera is disposed below one of the display devices,
a third probability that the first camera is disposed right of one
of the display devices, or fourth probability that the first camera
is disposed left of one of the display devices.
23. The non-transitory computer-readable storage media of claim 22,
wherein the instructions are further operable to: assign a video
source of the incoming video feed to be displayed on a top side of
a screen, a bottom side of the screen, a right side of the screen,
or a left side of the screen of one of the display devices based on
where the first probability, second probability, third probability
and fourth probability indicate the first camera is disposed with
respect to the one of the display devices.
24. The method of claim 1, further comprising: determining, from
the detected audio of each of the one or more loudspeakers, whether
each of the one or more loudspeakers is within a predetermined
distance from the microphone array.
25. The apparatus of claim 9, wherein the processor is further
configured to: determine, from the detected audio of each of the
one or more loudspeakers, whether each of the one or more
loudspeakers is within a predetermined distance from the microphone
array.
26. The non-transitory computer-readable storage media of claim 17,
wherein the instructions are further operable to: determine, from
the detected audio of each of the one or more loudspeakers, whether
each of the one or more loudspeakers is within a predetermined
distance from the microphone array.
27. The method of claim 1, further comprising: assigning video
outputs from the plurality of cameras to an outgoing video feed
based on the data representing the spatial relationship, the
outgoing video feed to be sent from the conference room to a remote
conference room.
28. The apparatus of claim 9, wherein the processor is further
configured to: assign video outputs from the plurality of cameras
to an outgoing video feed based on the data representing the
spatial relationship, the outgoing video feed to be sent from the
conference room to a remote conference room.
29. The non-transitory computer readable storage media of claim 17,
wherein the instructions are further operable to: assign video
outputs from the plurality of cameras to an outgoing video feed
based on the data representing the spatial relationship, the
outgoing video feed to be sent from the conference room to a remote
conference room.
Description
TECHNICAL FIELD
[0001] The present disclosure relates to configuring components of
a video conference endpoint in a conference room based on spatial
detection of the components.
BACKGROUND
[0002] Video conference endpoints are deployed in conference rooms.
The conference rooms can differ in size and configuration which
affects the, layout/placement of the video conference endpoint
components in the conference room, and use of the conference room.
The placement of the components within the conference room, such as
the relationship and placement of the display screens with respect
to the camera(s), affects the experience of participants of a
conference session within the conference room. Because no two
conference rooms are the same size and shape, a standard layout for
a conference room is not possible. This results in different
placements of the camera(s) with respect to the display screens of
a conference room. Typically, an operator has to manually select
which display screen is to receive specific video sources,
including which display screen, or portion of a display screen, is
to display the live video stream of the participants of the
conference session that are present at another video conference
endpoint. Such manual selection is cumbersome and inconvenient, and
often does not place the live video stream of participants of the
conference session from another video conference endpoint at a
position that maximizes eye contact between participants at
separate video conference endpoints.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] FIG. 1 is a block diagram of an example video conference
(e.g., teleconference) system in which techniques to automatically
configure the behavior of various components within the environment
based on spatial detection may be implemented, according to an
example embodiment.
[0004] FIG. 2A is an illustration of an example video conference
endpoint deployed in a conference room and configured to perform
techniques presented herein, according to an example
embodiment.
[0005] FIG. 2B is an illustration of example video conference
endpoints deployed in respective conference rooms and configured to
perform techniques presented herein, according to an example
embodiment.
[0006] FIG. 3A is a front view of one of the display devices of a
video conference endpoint where the integrated camera and
microphone array are disposed above the display device, according
to an example embodiment.
[0007] FIG. 3B is a front view of one of the display devices of a
video conference endpoint where the integrated camera and
microphone array are disposed below the display device, according
to an example embodiment.
[0008] FIG. 3C is a front view of one of the display devices of a
video conference endpoint where the integrated camera and
microphone array are disposed to the right of the display device,
according to an example embodiment.
[0009] FIG. 3D is a front view of one of the display devices of a
video conference endpoint where the integrated camera and
microphone array are disposed to the left of the display device,
according to an example embodiment.
[0010] FIG. 4A is a front view of a plurality of display devices of
a video conference endpoint where the integrated camera and
microphone array are disposed above one of the display devices,
according to an example embodiment.
[0011] FIG. 4B is a front view of a plurality of display devices of
a video conference endpoint where the integrated camera and
microphone array are disposed between the display devices,
according to an example embodiment.
[0012] FIG. 5 is a block diagram of an example controller of a
video conference endpoint configured to perform techniques
described herein, according to an embodiment.
[0013] FIG. 6 is an illustration of an example user control device
associated with a video conference endpoint, where the user control
device displays a rendering of the components of the video
conference endpoint.
[0014] FIG. 7A is a front view of a camera integrated with a
microphone array of a video conference endpoint, where the
microphone array is detecting audio outputs originating from either
the left side or the right side of the microphone array, according
to an example embodiment.
[0015] FIG. 7B is a front view of a camera integrated with a
microphone array of the video conference endpoint where the
microphone array is detecting audio outputs originating from either
above or below the microphone array, according to an example
embodiment.
[0016] FIG. 8 is a flowchart of a method of updating the display
device that receives a live video feed based on the determined
probability an audio output originates from above, below, to the
right of, and to the left of the microphone array, according to an
example embodiment.
[0017] FIG. 9 is a flowchart of a method for configuring the roles
of a plurality of display devices of a conference room based on
detected audio outputs of the plurality of display devices,
according to an example embodiment.
[0018] FIG. 10 is a flowchart of a method for configuring the roles
of a plurality of display devices of a conference room based on
detected audio outputs of the plurality display devices, the
detected shape of the table of the conference endpoint, and/or the
orientation of the detected faces within the conference room,
according to an example embodiment.
[0019] FIG. 11A is an illustration of a table and a plurality of
detected faces within a conference room from the viewpoint of a
camera integrated with a video conference endpoint where the camera
is mounted below a display device, according to an example
embodiment.
[0020] FIG. 11B is another illustration of a table and a plurality
of detected faces within a conference room from the viewpoint of a
camera integrated with a video conference endpoint of where the
camera is mounted at the same height as a display device, according
to an example embodiment.
[0021] FIG. 11C is an illustration of a table and a plurality of
detected faces within a conference room from the viewpoint of a
camera integrated with a video conference endpoint of where the
camera is mounted above a display device, according to an example
embodiment.
[0022] FIG. 12 is a flowchart of a method of generating data
representing the spatial relationship of the components of the
video conference endpoint, according to an example embodiment.
DESCRIPTION OF EXAMPLE EMBODIMENTS
Overview
[0023] Techniques presented herein relate to automatically
configuring the one or more display devices of a video conference
endpoint based on spatial detection of the components of the video
conference endpoint and the content of the video sources received
by the video conference endpoint. The video conference endpoint may
include one or more display devices, one or more loudspeakers
having a predetermined physical relationship with respect to at
least one of the one or more display devices, at least one camera,
a microphone array having a predetermined physical relationship
with respect to the camera, and a controller. The controller may be
configured to detect, at a microphone array having a predetermined
physical relationship with respect to a camera, the audio emitted
from each of one or more loudspeakers, each loudspeaker having a
predetermined physical relationship with respect to at least one of
one or more display devices in a conference room. The controller
may further be configured to generate data representing a spatial
relationship between each of the one or more display devices and
the camera based on the detected audio.
Example Embodiments
[0024] With reference to FIG. 1, there is depicted a block diagram
of a video conference (e.g., teleconference) system 100 in which
automatic configuration of the behavior of the display devices of
the system 100 based on spatial detection may be implemented,
according to an example embodiment. Video conference system 100
includes video conference endpoints 104 operated by local
users/participants 106 and configured to establish audio-visual
teleconference collaboration sessions with each other over a
communication network 110. Communication network 110 may include
one or more wide area networks (WANs), such as the Internet, and
one or more local area networks (LANs). A conference server 102 may
also be deployed to coordinate the routing of audio-video streams
among the video conference endpoints.
[0025] Each video conference endpoint 104 may include at least one
video camera (VC) 112, at least one display device 114, a
loudspeaker (LDSPKR) 116 coupled to or integrated with the display
device 114, one or more microphones arrays (MIC) 118 coupled to or
integrated with the camera 112, and an endpoint controller 120
configured to control the video camera(s) 112, at least one display
device 114, the loudspeaker 116, and the one or more microphone
arrays 118. In a transmit direction, endpoints 104 capture
audio/video from their local participants 106 with video camera
112/microphone array 118, encode the captured audio/video into data
packets, and transmit the data packets to other endpoints or to the
conference server 102. In a receive direction, endpoints 104 decode
audio/video from data packets received from the conference server
102 or other endpoints and present the audio/video to their local
participants 106 via display device 114/loudspeaker 116.
[0026] Referring now to FIG. 2A, there is depicted an illustration
of video conference endpoint 104 deployed in a conference room 200,
according to an embodiment. Video conference endpoint 104 includes
a plurality of display devices 114(1)-114(4) positioned around the
conference room 200. Display devices 114(1)-114(3) may be screens
configured to display content from video sources, while display
device 114(4) may be a user-interactive digital display device
(e.g., a whiteboard or touch screen). Display devices 114(1)-114(4)
may contain a camera 112(1)-112(4), respectively, and a microphone
array 118(1)-118(4), respectively, having a predetermined physical
relationship with respect to the cameras 112(1)-112(4),
respectively. In some embodiments, the microphone arrays
118(1)-118(4) may be integrated with the cameras 112(1)-112(4),
respectively. Cameras 112(1)-112(4) are each operated under control
of endpoint 104 to capture video of different views or scenes of
multiple participants 106 seated around a table 202 opposite from
or facing (i.e., in front of) the cameras 112(1)-112(4) (and
display devices 114(1)-114(4)). The cameras 112(1)-112(4) depicted
in FIG. 2A may be only one example of many possible camera and
camera combinations that may be used, as would be appreciated by
one of ordinary skill in the relevant arts having read the present
description (i.e., combining two video cameras for one display
device).
[0027] In some forms, the display devices may be separate from one
or more cameras, and the microphone arrays may be separate from the
display devices and one or more cameras. For example, an end user
may use his/her own display devices, and some cameras available in
the market are configured to attach to a microphone stand
supporting the microphone array. However, even in that situation,
the camera will, once attach, have a known predetermined physical
relationship with respect to the microphone array. In summary, the
various components of an endpoint may be integrated together when
sold, or may be configured after purchase to be physically attached
to each other so as to have a predetermined physical relationship.
Furthermore, the loudspeakers 116(1)-116(4) may have a
predetermined physical relationship with respect to the display
devices 114(1)-114(4), respectively. In some embodiments, the
loudspeakers 116(1)-116(4) may be integrated with the display
devices 114(1)-114(4), respectively. While FIG. 2A illustrates the
loudspeakers 116(1)-116(4) being disposed centrally on the display
devices 112(1)-112(4), it should be appreciated that the
loudspeakers 116(1)-116(4) may be disposed in any location within
or around the edge/frame of the display devices 112(1)-112(4),
including, but not limited to, centrally along the bottom edge of
the frame of the display devices 112(1)-112(4), the bottom corners
of the display devices 112(1)-112(4), etc. In other embodiments,
the loudspeakers 116(1)-116(4) may be attached or mounted in close
proximity to the display devices 114(1)-114(4), respectively. Thus,
the loudspeakers 116(1)-116(4) are configured to generate audio
projected in the same directions that the display devices
114(1)-114(4), respectively, display video content. In other words,
the loudspeakers 116(1)-116(4) are integrated with the display
devices 114(1)-114(4) such that the audio outputs generated by the
loudspeakers 116(1)-116(4) originate from approximately the same
location in which the content of the video sources are
displayed.
[0028] As depicted in the example of FIG. 2A, and as briefly
explained above, microphone arrays 118(1)-118(4) are positioned
adjacent to, integrated with (or otherwise in a known predetermined
physical relationship to), the cameras 112(1)-112(4), respectively.
In one embodiment, microphone arrays 118(1)-118(4) may be planar
microphone arrays. The combination of the cameras 112(1)-112(4)
with the microphone arrays 118(1)-118(4), respectively, may be
disposed adjacent to display devices 114(1)-114(4), respectively,
enabling the respective microphone arrays 118(1)-118(4) to receive
both audio from participants 106 in room 200 and the audio outputs
generated by the loudspeakers 116(1)-116(4) of display devices
114(1)-114(4). Each of cameras 112(1)-112(4) may include pan, tilt,
and zoom (PTZ) features that may be implemented mechanically and/or
digitally.
[0029] The video conference endpoint 104 further includes an
endpoint user control device 204 disposed within the conference
room 200. The endpoint user control device 204 may be movable
within the room 200. The endpoint user control device 204 may be a
tablet computer, smartphone or other similar device on which an
endpoint controller application is installed. The endpoint user
control device 204 may be configured to manage each of the display
devices 114(1)-114(4), including, but not limited to, the content
displayed on each of the display devices 114(1)-114(4). The
endpoint user control device 204 may also be configured control the
pan, tilt, and zoom the video cameras 112(1)-112(4) (in the
mechanical or digital domain) as necessary to capture video of
different views that encompass one or more of participants 106.
[0030] Video conference endpoint 104 uses (i) audio detection
techniques to detect audio sources, i.e., loudspeakers
116(1)-116(4), by the microphone arrays 118(1)-118(4) and to
determine the spatial relationship between the cameras
112(1)-112(4), display devices 114(1)-114(4), loudspeakers
116(1)-116(4), and microphone arrays 118(1)-118(4); (ii) face
detection techniques to detect faces and associated positions
thereof of participants 106 around the table 202; and (iii) object
detection techniques to detected the shape of specific and known
objects, e.g., the table 202.
[0031] In accordance with techniques presented herein, video
conference endpoint 104 defines/establishes the spatial
relationship between cameras 112(1)-112(4) and display devices
114(1)-114(4), and automatically determines which display device
114(1)-114(4) will display certain video feeds received by the
video conference endpoint 104. In support of this, video conference
endpoint 104 also defines the probability that an audio source
detected by the microphone array 118(1)-118(4) is disposed above,
below, to the right of, or to the left of the respective cameras
112(1)-112(4) and the respective microphone arrays 118(1)-118(4),
and thus also defines the probability that a display device
114(1)-114(4) is disposed above, below, to the right of, or to the
left of the respective cameras 112(1)-112(4) and respective
microphone arrays 118(1)-118(4). In certain cases described below,
endpoint 104 automatically determines which display device
114(1)-114(4) to display a live video feed of remote participants
106 located at a remote video conference endpoint 104.
[0032] Referring now to FIG. 2B, there is depicted an illustration
of a first video conference endpoint 104(1) deployed in conference
room 200(1) and a second video conference endpoint 104(2) deployed
in conference room 200(2), the two conference endpoints 104(1),
140(2) configured to communicate with one another via network 110,
according to an embodiment. The first video conference endpoint
104(1) and the second video conference endpoint 104(2) are
substantially similar to the video conference endpoint 104 depicted
in FIG. 2A.
[0033] The first video conference endpoint 104(1) includes a
plurality of display devices 114(1)-114(4) positioned around the
conference room 200(1). Display devices 114(1)-114(3) may be
screens configured to display content from video sources, while
display device 114(4) may be a user-interactive digital display
device (e.g., a whiteboard or touch screen). Display devices
114(1)-114(4) may contain a camera 112(1)-112(4), respectively, and
a microphone array 118(1)-118(4), respectively, integrated with the
cameras 112(1)-112(4), respectively. Cameras 112(1)-112(4) are each
operated under control of endpoint 104(1) to capture video of
different views or scenes of multiple participants 106 seated
around a table 202(1) opposite from or facing (i.e., in front of)
the cameras 112(1)-112(4) (and display devices 114(1)-114(4)).
Furthermore, display devices 114(1)-114(4) may contain an
integrated loudspeaker 116(1)-116(4), respectively.
[0034] The second video conference endpoint 104(2) includes a
plurality of display devices 114(5)-114(8) positioned around the
conference room 200(2). Display devices 114(5)-114(7) may be
screens configured to display content from video sources, while
display device 114(8) may be a user-interactive digital display
device (e.g., a whiteboard or touch screen). Display devices
114(5)-114(8) may contain a camera 112(5)-112(8), respectively, and
a microphone array 118(5)-118(8), respectively, integrated with the
cameras 112(5)-112(8), respectively. Cameras 112(5)-112(8) are each
operated under control of endpoint 104(2) to capture video of
different views or scenes of multiple participants 106 seated
around a table 202(2) opposite from or facing (i.e., in front of)
the cameras 112(5)-112(8) (and display devices 114(5)-114(8)).
Furthermore, display devices 114(5)-114(8) may contain an
integrated loudspeaker 116(5)-116(8), respectively.
[0035] As illustrated in FIG. 2B, the first video conference
endpoint 104(1) and the second video conference endpoint 104(2) are
configured to communicate with each other via network 110. The
captured video and audio of the first video conference endpoint
104(1) may be sent to the second video conference endpoint 104(2),
where the captured video and audio from the first video conference
endpoint 104(1) may be output by the display devices 114(5)-114(8)
and the loudspeakers 116(5)-116(8) of the second video conference
endpoint 104(2). Conversely, the captured video and audio of the
second video conference endpoint 104(2) may be sent to the first
video conference endpoint 104(1), where the captured video and
audio from the second video conference endpoint 104(2) may be
output by the display devices 114(1)-114(4) and the loudspeakers
116(1)-116(4) of the first video conference endpoint 104(1).
[0036] As described herein, the video conference endpoint 104(1)
may be configured to use data representing the spatial relationship
of video conference components generated according to the
techniques presented herein to assign video sources contained in an
incoming video feed received from video conference endpoint 104(2)
to display devices in conference room 200(1), and to assign outputs
from a plurality of cameras in conference room 200(1) in an
outgoing video feed to be sent to video conference endpoint 104(2)
in conference room 200(2). Similarly, video conference endpoint
104(2) may be configured to use data representing the spatial
relationship of video conference components generated according to
the techniques presented herein to assign video sources contained
in an incoming video feed received from video conference endpoint
104(1) to display devices in conference room 200(2), and to assign
outputs from a plurality of cameras in conference room 200(2) in an
outgoing video feed to be sent to video conference endpoint 104(2)
in conference room 200(1).
[0037] With reference to FIGS. 3A-3D, depicted is a front view of a
display device 114 with the camera 112 and microphone array 118
disposed at various positions around the display device 114. As
previously explained, the display device 114 includes a loudspeaker
116 integrated with, coupled to, or mounted in close proximity with
the display device 114. In the examples illustrated in FIGS. 3A-3D,
the loudspeaker 116 is integrated with the display device 114 such
that the loudspeaker 116 may be disposed within the display device
114. While FIGS. 3A-3D illustrate the loudspeaker 116 being
disposed centrally on the display device 112, it should be
appreciated that the loudspeaker 116 may be disposed in any
location within or around the edge/frame of the display device 112,
including, but not limited to, centrally along the bottom edge of
the frame of the display device 112, the bottom corners of the
display device 112, etc. Furthermore, the display device 114
includes a top side 300, a bottom side 302 opposite the top side
300, a first or left side 304, and a second or right side 306
opposite the left side 304. The display device 114 further includes
a screen 310, which is configured to display first video content
312 and second video content 314. In one embodiment, first video
content 312 may be a presentation (document, slides, etc.), while
second video content 314 may be a live video feed of remote
participants 106 located at another video conference endpoint
104.
[0038] As illustrated in FIG. 3A, when the camera 112 and
integrated microphone array 118 are disposed on or proximate to the
top side 300 of the display device 114, the video conference
endpoint 104 displays the live video feed 314 on the screen 310 of
the display device 114 proximate to the top side 300 and the camera
112. FIG. 3B illustrates that when the camera 112 and integrated
microphone array 118 are disposed on or proximate to the bottom
side 302 of the display device 114, the video conference endpoint
104 displays the live video feed 314 on the screen 310 of the
display device 114 proximate to the bottom side 302 and the camera
112.
[0039] Furthermore, FIG. 3C illustrates that when the camera 112
and integrated microphone array 118 are disposed on or proximate to
the right side 306 of the display device 114, the video conference
endpoint 104 displays the live video feed 314 on the screen 310 of
the display device 114 proximate to the right side 306 and the
camera 112. As illustrated in FIG. 3D, when the camera 112 and
integrated microphone array 118 are disposed on or proximate to the
left side 304 of the display device 114, the video conference
endpoint 104 displays the live video feed 314 on the screen 310 of
the display device 114 proximate to the top side 300 and the camera
112. Thus, as illustrated in FIGS. 3A-3D, the live video feed 314
of participants 106 from another endpoint 104 are presented on the
screen 310 of the display device 114 such that the live video feed
314 is proximate to the camera 112 attached or coupled to the
display device 114. Positioning the live video feed 314 proximate
to the camera 112 enables better "eye contact" between participants
106 at different endpoints 104. The positioning of the live video
feed 314 on the screen 310 of the display device 114, as described
above, gives the appearance that participants 106 at a first
endpoint 104 are looking into the camera 112 while actually viewing
the live video feed 314 disposed on the screen 310 of the display
device 114.
[0040] With reference to FIGS. 4A and 4B, depicted is a front view
of two display devices 114(1), 114(2) arranged proximate to each
other, with the camera 112 and microphone array 118 disposed at
various positions with respect to the display devices 114(1),
114(2). Similar to the examples illustrated in FIGS. 3A-3D, the
display devices 114(1), 114(2) includes a loudspeaker 116(1),
116(2) integrated with, coupled to, or mounted in close proximity
with the display devices 114(1), 114(2). In the examples
illustrated in FIGS. 4A and 4B, the loudspeakers 116(1), 116(2) are
integrated with the display devices 114(1), 114(2), respectively,
such that the loudspeakers 116(1), 116(2) may be disposed within
the display devices 114(1), 114(2), respectively. As previously
explained, while FIGS. 4A and 4B illustrate the loudspeakers
116(1), 116(2) being disposed centrally on the display devices
112(1), 112(2), it should be appreciated that the loudspeakers
116(1), 116(2) may be disposed in any location within or around the
edge/frame of the display devices 112(1), 112(2), including, but
not limited to, centrally along the bottom edge of the frame of the
display devices 112(1), 112(2), the bottom corners of the display
devices 112(1), 112(2), etc. Furthermore, each of the display
devices 114(1), 114(2) includes a top side 300(1), 300(2), a bottom
side 302(1), 302(2) opposite the top side 300(1), 300(2), a first
or left side 304(1), 304(2), and a second or right side 306(1),
306(2) opposite the left side 304(1), 304(2). The display devices
114(1), 114(2) further include a screen 310(1), 310(2), which are
configured to display first video content 312(1), 312(4), and which
may be capable of displaying second video content 314.
[0041] Even with multiple display devices 114(1), 114(2), the video
conference endpoint 104 is configured to determine on which screen
310(1), 310(2) to display the second video content or live video
feed 314, as well as the positioning on the selected screen 310(1),
310(2) such that the live video feed 314 is positioned proximate to
the camera 112 to enable better "eye contact" between participants
106 at different endpoints 104. As illustrated in FIG. 4A, the
camera 112 and integrated microphone array 118 are disposed on or
proximate to the top side 300(1) of the display device 114(1).
Thus, the video conference endpoint 104 configures the screens
310(1), 310(2) of the display devices 114(1), 114(2) to position
the live video feed 314 on the screen 310(1) of the display device
114(1) so that the live video feed 314 is proximate to the top side
300(1) and to the camera 112. As illustrated, while the live video
feed 314 is configured to share the screen 310(1) of the display
device 114(1) with the presentation 312(1), the presentation 312(2)
is also configured to encompass the entire screen 310(2) of the
display device 114(2). Therefore, participants 106 at the endpoint
104 may be able to view the content of the presentation 312(1),
312(2) on either screen 310(1), 310(2) of either display device
114(1), 114(2), while also viewing the live video feed 314 on the
screen 310(1) of the display device 114(1). Because the camera 112
is disposed on the top side 300(1) of the display device 114(1),
when participants 106 view the live video feed 314 displayed on the
screen 310(1) of the display device 114(1), which is proximate to
the top side 300(1) of the display device 114(1), the participants
106 appear to also be looking into the camera 112.
[0042] As illustrated in FIG. 4B, the camera 112 and integrated
microphone array 118 are disposed between the right side 306(1) of
display device 114(1) and the left side 304(2) of display device
114(2). In this illustrated example, the camera 112 and integrated
microphone array 118 may be disposed equidistant from the right
side 306(1) of display device 114(1) and left side 304(2) of
display device 114(2). When the camera 112 and integrated
microphone array 118 are disposed equidistant from the right side
306(1) of display device 114(1) and left side 304(2) of display
device 114(2), the video conference endpoint 104 may select on
which screen 310(1), 310(2) to display the live video feed 314. If
the camera 112 and integrated microphone array 118 are disposed
between the display devices 114(1), 114(2) such that the camera 112
and integrated microphone array 118 are closer to one of the right
side 306(1) of display device 114(1) or the left side 304(2) of
display device 114(2), the video conference endpoint 104 may
display the live video feed 314 on the screen 310(1), 310(2) to
which the camera 112 is closest.
[0043] As FIG. 4B illustrates, the camera 112 and integrated
microphone array 118 are disposed between the right side 306(1) of
display device 114(1) and left side 304(2) of display device
114(2), and the camera 112 and integrated microphone array 118 are
also disposed proximate to the bottom sides 302(1), 302(2) of
display device 114(1), 114(2). Thus, as illustrated, the video
conference endpoint 104 displays the live video feed 314 in the
bottom right corner of the screen 310(1) of the display device
114(1) proximate to both the bottom side 302(1) and the right side
306(1) of the display device 114(1). As previously explained, while
the live video feed 314 is configured to share the screen 310(1) of
the display device 114(1) with the presentation 312(1), the
presentation 312(2) is configured to encompass the entire screen
310(2) of the display device 114(2). Therefore, participants 106 at
the endpoint 104 may be able to view the content of the
presentation 312(1), 312(2) on either screen 310(1), 310(2) of
either display device 114(1), 114(2), while also viewing the live
video feed 314 on the screen 310(1) of the display device
114(1).
[0044] Reference is now made to FIG. 5, which shows an example
block diagram of an endpoint controller 120 of video conference
endpoint 104 configured to perform techniques described herein.
There are numerous possible configurations for endpoint controller
120 and FIG. 5 is meant to be an example. Endpoint controller 120
includes a processor 500, a network interface unit (NIU) 502, and
memory 504. The network interface (I/F) unit (NIU) 502 is, for
example, an Ethernet card or other interface device that allows the
endpoint 104 to communicate over communication network 110 (FIG.
1). Network interface unit 502 may include wired and/or wireless
connection capability.
[0045] Processor 500 may take the form of a collection of
microcontrollers and/or microprocessors, for example, each
configured to execute respective software instructions stored in
the memory 504. The collection of microcontrollers may include, for
example: a video controller to receive, send, and process video
signals related to display 114 and video camera 112; an audio
processor to receive, send, and process audio signals related to
loudspeaker 116 and microphone array 118; and a high-level
controller to provide overall control. Portions of memory 504 (and
the instruction therein) may be integrated with processor 500. As
used herein, the terms "audio" and "sound" are synonymous and
interchangeable.
[0046] In a distributed processor embodiment, endpoint controller
120 is a distributed processor, including, but not limited to, (i)
an audio processor for the microphone array 118 to determine audio
angle of arrival of a sound source (as discussed below), and (ii) a
video coder/decoder (i.e., codec) that is also configured to
analyze the content of the video sources received by the endpoint
104.
[0047] The memory 504 may include read only memory (ROM), random
access memory (RAM), magnetic disk storage media devices, optical
storage media devices, flash memory devices, electrical, optical,
or other physical/tangible (e.g., non-transitory) memory storage
devices. Thus, in general, the memory 504 may comprise one or more
computer readable storage media (e.g., a memory device) encoded
with software comprising computer executable instructions and when
the software is executed (by the processor 500) it is operable to
perform the operations described herein. For example, the memory
504 stores or is encoded with software instructions for Source
Display Positioning Module 506 to perform operations described
herein for determining the spatial relationship between camera 112
and display devices 114 and determining which display device 114
will display a live video feed. Source Display Positioning Module
506 also includes an Audio Analysis Module 508 and an Image
Analysis Module 510. Audio Analysis Module 508 may determine the
angle of arrival of a sound source as received by the microphone
array 118. Image Analysis Module 510 may evaluate the content of
video sources received by the video conference endpoint 104 and
determine which display device 114 will display a received video
source based on the information acquired by the Audio Analysis
Module 508.
[0048] With reference to FIG. 6, depicted is an example of endpoint
user control device 204. The endpoint user control device 204 may
have a display 600, such as a touchscreen display. The display 600
of the endpoint user control device 204 may be configured to
present rectangular representations 602(1)-602(3) of the display
devices 114(1)-114(3) operable at the endpoint 104. Thus, the
display 600 of the endpoint user control device 204 may display a
three dimensional representation of the conference room 200. As
illustrated in the example of FIG. 6, two display representations
602(1), 602(2) are disposed next to each other and are facing the
third display representation 602(3). Thus, the conference room 200
may contain two display devices 114(1), 114(2) on one side of the
room 200 and a third display device 114(3) on an opposite side of
the room 200, where the third display device 114(3) faces the other
two display devices 114(1), 114(2). A user may touch or tap the
touch screen display 600 of the endpoint user control device 204 at
the location of one of the display representations 602(1)-602(3) to
control (e.g., display content, volume control, etc.) the display
device 114(1)-114(3) represented by the selected display
representation 602(1)-602(3). Furthermore, the display 600 of the
endpoint user control device 204 may further present other controls
and functions 604 at the bottom of the display 600.
[0049] With reference to FIGS. 7A and 7B, depicted are audio
outputs of a loudspeaker 116 being detected by the microphone array
118 integrated with a camera 112 along a horizontal plane (FIG. 7A)
and a vertical plane (FIG. 7B). The microphone array 118 detects
audio outputs by the loudspeaker 116 and determines relative angles
of the loudspeaker 116 originating the audio output with reference,
or in relation to, the direction A in which the camera is facing
(e.g., a normal of the camera 112). As illustrated in FIG. 7A,
audio outputs detected by the microphone array 118 and originating
from a loudspeaker 116 disposed to the right of the camera 112 may
be given an angular measurement of .theta., while audio outputs
detected by the microphone array 118 and originating from a
loudspeaker 116 disposed to the left of the camera 112 may be given
an angular measurement of -.theta.. Thus, the angular measurements
of .theta. and -.theta. represent the azimuth angles of the
detected audio output with respect to the normal A of the camera
112. As illustrated in FIG. 7B, audio outputs detected by the
microphone array 118 and originating from a loudspeaker 116
disposed above the camera 112 may be given an angular measurement
of .phi., while audio outputs detected by the microphone array 118
and originating from a loudspeaker 116 disposed below the camera
112 may be given an angular measurement of -.phi.. Thus, the
angular measurements of .phi. and -.phi. represent the elevation
angles of the detected audio output with respect to the normal A of
the camera 112.
[0050] With reference to FIG. 8 and continued reference to FIGS.
3A-3D, 4A, 4B, 5, 7A, and 7B, there is depicted a flowchart of an
example method 800 of determining the spatial relationship between
a display device 114 with an integrated loudspeaker 116 and a
camera 112 with an integrated microphone array 118 based on the
audio generated by the loudspeaker 116. Initially, at 805, the
microphone array 118 receives a new frame of audio samples from a
loudspeaker 116. The endpoint controller 120 may be configured to
cause the loudspeaker 116 to generate an audio output and the
microphone array 118 is configured to detect the audio output. At
810, the endpoint controller 120 begins determining the azimuth
(.theta.) and elevation (.phi.) angles to the loudspeaker 116 from
the microphone array 118, while also triangulating a distance (r)
from the loudspeaker 116 generating the audio outputs detected by
the microphone array 118.
[0051] At 815, the endpoint controller 120 determines whether the
audio output detected by the microphone array 118 originates from a
location (e.g., a loudspeaker 116) that is less than a
predetermined distance (e.g., three meters) away from the
microphone array 118. If it is determined at 815 that the
loudspeaker 116 is less than the predetermined distance (three
meters) away from the microphone array 118, then the endpoint
controller 120 continues to determine the azimuth and elevation
angles of the detected audio output with respect to the microphone
array 118 at 820 and 850. However, if, at 815, the endpoint
controller 120 determines that the detected audio output is not
less than the predetermined distance (three meters) away from the
microphone array 118, then the endpoint controller 120 skips
determining the azimuth and elevation angles of the detected audio,
and, at 895, does not update the picture in picture positioning of
the live video feed 314. When the detected audio output originates
more than the predetermined distance (three meters) from the
microphone array 118, the positioning of the live video feed 314
may not be updated because the live video feed 314 may already be
disposed in an optimized position. However, other examples of
detected audio outputs that originated from more than three meters
from the microphone array 118 could include audio outputs that
originate from external sources (e.g., talking participants 106,
participant 106 devices, etc.), detected the audio outputs that
originated from a loudspeaker 116 but reflected off of the floor
and/or walls, etc.
[0052] After determining that the detected audio output originates
from a location less than the predetermined distance (three meters)
away from the microphone array 118 (e.g., from a loudspeaker 116
disposed less than three meters from the microphone array 118),
then at 820, the endpoint controller 120 calculates whether the
audio output detected by the microphone array has an elevation
angle .phi. greater than 75 degrees. If, at 820, the determined
elevation angle .phi. is greater than 75 degrees, then, at 825, the
endpoint controller 120 increases a probability value that the
loudspeaker 116 is above the camera 112 (P(above)). If, at 820, the
determined elevation angle .phi. is not greater than 75 degrees,
then, at 830, the endpoint controller 120 decreases the probability
that the loudspeaker 116 is above the camera 112 (P(above)).
[0053] If the endpoint controller 120 decreases the probability
that the loudspeaker 116 is above the camera (P(above)), then, at
835, the endpoint controller 120 calculates whether the audio
output detected by the microphone array has an elevation angle
.phi. less than -75 degrees. If, at 835, the determined elevation
angle .phi. is less than -75 degrees, then, at 840, the endpoint
controller 120 increases the probability that the loudspeaker 116
is below the camera 112 (P(below)). If, at 835, the determined
elevation angle .phi. is not less than -75 degrees, then, at 845,
the endpoint controller 120 decreases the probability that the
loudspeaker 116 is below the camera 112 (P(below)).
[0054] At 850, the endpoint controller 120 calculates whether the
audio output detected by the microphone array has an azimuth angle
.theta. greater than 75 degrees. If, at 850, the determined azimuth
angle .theta. is greater than 75 degrees, then, at 855, the
endpoint controller 120 increases the probability that the
loudspeaker 116 is disposed to the right of the camera 112
(P(right)). If, at 850, the determined azimuth angle .theta. is not
greater than 75 degrees, then, at 860, the endpoint controller 120
decreases the probability that the loudspeaker 116 is disposed to
the right of the camera 112 (P(right)).
[0055] If the endpoint controller 120 decreases the probability
that the loudspeaker 116 is disposed to the right of the camera 112
(P(right)), then, at 865, the endpoint controller 120 calculates
whether the audio output detected by the microphone array 118 has
an azimuth angle .theta. less than -75 degrees. If, at 865, the
determined azimuth angle .theta. is less than -75 degrees, then, at
870, the endpoint controller 120 increases the probability that the
loudspeaker 116 is disposed to the left of the camera 112
(P(left)). If, at 865, the determined azimuth angle .theta. is not
less than -75 degrees, then, at 875, the endpoint controller 120
decreases the probability that the loudspeaker 116 is disposed to
the left of the camera 112 (P(left)).
[0056] After all of the probabilities P(above), P(below), P(right),
P(left) have been calculated and it is verified that the
loudspeaker 116 is less than 3 meters away from the microphone
array 118, the endpoint controller 120, at 880, determines a
spatial relationship value S between the loudspeaker 116 generating
the audio output and the microphone array 118 by determining which
calculated probability P(above), P(below), P(right), P(left) has
the largest value. In one embodiment, the endpoint controller 120
may also disregard any of the probabilities P(above), P(below),
P(right), P(left) with lower values. At 885, the endpoint
controller 120 then determines whether or not the spatial
relationship value S is greater than a predetermined threshold
value. If, at 885, the endpoint controller 120 determines that the
spatial relationship value S is greater than the predefined
threshold, then, at 890, the endpoint controller 120 updates the
picture in picture positioning of the live video feed 314 so that
the live video feed 314 is positioned proximate to the camera 112
as illustrated in FIGS. 3A-3D, 4A, and 4B. However, if, at 885, the
endpoint controller 120 determines that the spatial relationship
value S is less than the predefined threshold, then, at 895, the
endpoint controller 120 does not update the picture in picture
positioning of the live video feed 314 because the live video feed
314 may already be disposed in an optimized position proximate to
the camera 112, as illustrated in FIGS. 3A-3D, 4A, and 4B.
[0057] In another example, with reference to the conference rooms
200, 200(1), and 200(2) depicted in FIGS. 2A and 2B, and with
continued reference to FIGS. 7A and 7B, the endpoint controller 120
is configured to determine which of cameras 112(1)-112(3) is best
oriented to capture a participant 106 presenting or collaborating
on the user-interactive display device 114(4). In this example, the
endpoint controller 120 may utilize the microphone arrays
118(1)-118(3) integrated with the cameras 112(1)-112(3) to detect
audio outputs from the loudspeaker 116(4) of user-interactive
display device 114(4), which is configured as a whiteboard or other
similar presentation/display device, in order to determine the
spatial relationship between the cameras 112(1)-112(3) and the
user-interactive display device 114(4). The endpoint controller 120
may, from the detected audio output, calculate the azimuth and
elevation angles, as well as the distance, of the loudspeaker
116(4) with respect to the each of the microphone arrays
118(1)-118(3). As previously explained, because the loudspeaker
116(4) is integrated with the user-interactive display device
114(4) and because the microphone arrays 118(1)-118(3) are
integrated with the respective cameras 112(1)-112(3), the
calculated distance and azimuth and elevation angles also represent
the spatial relationship of the user-interactive display device
114(4) with respect to the respective cameras 112(1)-112(3).
However, unlike the example method of FIG. 8, where the endpoint
controller 120 makes its determination on which display device
114(1)-114(4) is best utilized to display a video source based on a
calculated short distance (e.g., less than 3 meters from one of the
cameras 112(1)-112(4)) and calculated large azimuth and elevation
angles (e.g., greater than 75 degrees with respect to the normal A
of one of the cameras 112(1)-112(4)), the endpoint controller 120
may assign the function of displaying the presenting participant
106 standing at or next to the user-interactive display device
114(4) to a particular one of the cameras 112(1)-112(3) that is
calculated to be a large distance from the user-interactive display
device 114(4) and that has relatively small azimuth and elevation
angles with respect to the user-interactive display device 114(4).
When it is calculated that user-interactive display device 114(4)
is greater than a predetermined distance from one of the cameras
112(1)-112(3), and that the user-interactive display device 114(4)
is oriented with relatively small azimuth and elevation angles
(i.e., less than or equal to a predetermined angle) with respect to
the normal A of one of the cameras 112(1)-112(3), the endpoint
controller 120 may make the determination that a particular one of
the cameras 112(1)-112(3) is both pointed in the general direction
of the user-interactive display device 114(4) (e.g., the normal A
of one of the cameras 112(1)-112(3) extends in the general
direction of the user-interactive display device 114(4)) and is
disposed within the conference room 200 at a location opposite of
the user-interactive display device 114(4). As illustrated in FIG.
2A, the endpoint controller 120 of conference room 200 may make the
determination that camera 112(2) has an acceptable field of view of
the user-interactive display device 114(4), where camera 112(2) is
capable of capturing the user-interactive display device 114(4),
items displayed on the user-interactive display device 114(4), and
any participant that may be presenting or collaborating on the
user-interactive display device 114(4). After making this
determination, the endpoint controller 120 may assign camera 112(2)
with the function of capturing the user-interactive display device
114(4) and any participant that may be present at the
user-interactive display device 114(4) such that the field of view
of the camera 112(2) can be transmitted to another video conference
endpoint.
[0058] With continued reference to FIGS. 2A, 2B, and 8, once the
spatial relationship (e.g., the azimuth (.theta.) angles, the
elevation (.phi.) angles, and the distance (r)) between each of the
display devices 114(1)-114(4) and each of the cameras 112(1)-112(4)
has been calculated, the endpoint controller 120(1) of the first
conference room 200(1) can assign inbound video sources (e.g.,
video sources received by the first conference room 200(1) from the
second conference room 200(2)) to the display devices 114(1)-114(4)
within the first conference room 200(1). The assignment of the
inbound video sources to the display devices 114(1)-114(4) may be
based on the spatial relationship of each of the display devices
114(1)-114(4) with each of the cameras 112(1)-112(4), and the
respective locations of each of the display devices 114(1)-114(4)
and each of the cameras 112(1)-112(4) within the first conference
room 200(1). Similarly, the endpoint controller 120(1) may also
assign video outputs from the cameras 112(1)-112(4) to an outgoing
video feed sent from the first conference room 200(1) to the second
conference room 200(2). As previously explained, the cameras
112(1)-112(4) may be configured and operated to capture video of
different views or scenes of multiple participants 106 seated
around a table 202(1) opposite from or facing (i.e., in front of)
the cameras 112(1)-112(4) (and display devices 114(1)-114(4)). The
cameras 112(1)-112(4) may also be configured and operated to
capture video of participants 106 disposed around particular
display devices 114(1)-114(4). The assignment of the captured video
may be based on the data representing the spatial relationship of
each of the cameras 112(1)-112(4) with each of the display devices
114(1)-114(4), and the respective location of each of the cameras
112(1)-112(4) within the first conference room 200(1). Each of the
video outputs may be tagged or labeled with metadata indicating the
respective field of view of each of the cameras 112(1)-112(4) that
captured each of the video outputs. This tagged metadata may be
utilized by a remote conference room, such as the second conference
room 200(2), to further determine how to assign the inbound video
sources of the second conference room 200(2) to the display devices
114(5)-114(8) of the second conference room 200(2). The tagged
metadata is also useful for the remote conference rooms, such as
the second conference room 200(2), when the inbound video source
simultaneously includes video outputs from more than one camera
112(1)-112(4) of the first conference room 200(1), this is referred
to as multi-stream. That is, a video feed may include multiple
video streams.
[0059] With reference to FIG. 9, depicted is a flowchart of an
example method 900 for utilizing the loudspeakers 116 and the
microphone array 118 to determine spatial relationship between the
camera 112 and the display devices 114 disposed within a conference
room 200. Reference is also made to FIGS. 3A-3D for purposes of the
description of FIG. 9. At 905, the endpoint controller 120 plays an
audio output out of each of the display devices 114 that contain a
loudspeaker 116. The display devices 112 may be connected to the
endpoint 104 via a high-definition multimedia interface (HDMI)
cable, which is capable of transporting both video and audio
signals over the same cable. In the event that an audio signal is a
multi-channel audio signal, the loudspeaker 116 integrated with the
display device 114 may output a separate audio output for each
channel of the audio signal. At 910, the microphone array 118
integrated with the camera 112 detects the audio outputs to
determine the spatial location (e.g., above, below, side, opposite,
etc.) of the display device 114 with respect to the camera. The
steps at 905 and 910 may be repeated for each of the display
devices 114 and for each of the cameras 112 located within the
conference room 200. The detection and determination of the spatial
relationship between the cameras 112 and the display devices 114
may be completed as described above with respect to FIG. 8.
[0060] Once all of the spatial relationship and placement data has
been collected, the endpoint controller, at 915, builds an internal
model of the placement and relationship of the display devices 114
and cameras 112 in the conference room 200. At 920, the endpoint
controller 120 then configures the display device roles and rules
for presenting video and audio content based on the derived model
and the content of the video and audio sources. For example, if,
based on the derived model, the camera 112 is adjacent to (e.g.,
above, below, to the side) a display device 114 and the content of
one of the video and audio sources is a live video feed 314 of
another endpoint 104, then the endpoint controller 120 may
determine that that display device 114 should receive and display
the live video feed 314. In another example, if, based on the
derived model, the camera 112 is disposed opposite of the display
device 114 (e.g., the camera 112 is across the conference room 200
from a user-interactive display device 114), the endpoint
controller 120 may determine that that camera 112 should be
utilized to display the presentation presented on that display
device 114. Finally, at 925, the endpoint controller 120 presents
the three dimensional model to the conference room 200 through the
display 600 of the endpoint user control device 204, illustrated in
FIG. 6.
[0061] With reference to FIG. 10, illustrated is a flowchart of a
method 1000 for determining the spatial relationship between a
single camera 112 and a single display device 114, where the camera
112 is disposed either above or below the display device 114.
Reference is also made to FIGS. 2A, 2B, 3A-3D, 4A, 4B, 5, 8, and
11A-11C for purposes of the description of FIG. 10. The method 1000
includes a plurality of processes for determining the location of
the camera 112 with respect to the display device 114. These
processes include using the microphone array 118 integrated with
the camera 112 to determine the location of the loudspeaker 116
integrated with the display device 114, as described above, and
detecting the faces of participants 106 seated around the table 202
within the conference room 200. These techniques may be performed
together or in lieu of each other when the conditions for one of
the techniques are not sufficient for that technique to adequately
determine the positioning of the camera 112 with respect to the
display device 114
[0062] At 1005, the endpoint controller 120 utilizes facial
detection software, in conjunction with the camera 112 of a
conference room 200 of an endpoint 104, to detect the faces of
participants 106 positioned around a table 202 within a conference
room 200. Facial detection techniques are well known in the art and
are not described in detail herein. At 1010, the endpoint
controller 120 analyzes the image captured by the camera 112 and
plots a line passing through the detected faces around the table
202, as shown in FIGS. 11A-11C and described hereinafter. The line
passing through the detected faces could be a parabola, which may
be a function of the equation y=ax.sup.2+bx+c, that is fitted over
the location of the detected faces using any conventional method
including, but not limited to, the method of least squares. When
the line passing through the detected faces is represented by the
equation y=ax.sup.2+bx+c and the value of "a" is greater than zero,
than the line may have a curvature that opens upward, like that of
line B illustrated in FIG. 11A. If "a" has a value of zero in the
equation y=ax.sup.2+bx+c, then the line represented by the equation
y=ax.sup.2+bx+c may be a straight line, like that of line C
illustrated in FIG. 11B. However, when "a" in the equation
y=ax.sup.2+bx+c is of a value less than zero, then the line
represented by the equation y=ax.sup.2+bx+c may have a downward
facing curvature, like that of line D illustrated in FIG. 11C.
[0063] After the endpoint controller 120 plots a line passing
through the detected faces of the participants 106 within a
conference room 200, the endpoint controller 120, at 1015,
determines whether or not the value of "a" in the equation
y=ax.sup.2+bx+c is greater than or equal to zero. If, at 1015, the
value of "a" is found to be greater than or equal to zero, then, at
1020, the endpoint controller 120 can make the determination that
the camera 112 is disposed at a lower height within the conference
room 200 and is thus likely disposed below the display device 114
in the conference room 200. FIGS. 11A and 11B illustrate the field
of view of the camera 112 when the camera 112 is mounted below the
display device 114. FIG. 11A illustrates the field of view of the
camera 112 and the curved line B, where the value of "a" is greater
than zero. The detected faces of the participants 106 seated around
the conference table 202 in FIG. 11A are aligned with the upwardly
curved line B. FIG. 11B illustrates the field of view of the camera
112 and horizontal line C, where the value of "a" is equal to zero.
The detected faces of the participants 106 seated around the
conference table 202 in FIG. 11B are aligned with horizontal line
C. The difference between line B and line C may demonstrate that
the camera 112 in FIG. 11A is disposed at a lower height than the
camera 112 in FIG. 11B. As the height of the camera 112 increases,
the curvature of the line through the detected faces of the
participants transitions from an upwardly curved line to a
downwardly curved line. Thus, FIG. 11B illustrates a field of view
of the camera 112 when the camera 112 is disposed more closely to
being equal in height, or is disposed equal in height, with the
display device 114 than that of FIG. 11A.
[0064] However, if, at 1015, the value of "a" is found to be less
than zero, then, at 1025, the endpoint controller 120 can make the
determination that the camera 112 is disposed at a higher height
within the conference room 200 and is thus likely disposed above
the display device 114 in the conference room 200. FIG. 11C
illustrates the field of view of the camera 112 when the camera 112
is mounted above the display device 114. FIG. 11A illustrates the
field of view of the camera 112 and the curved line D, where the
value of "a" is less than zero. The detected faces of the
participants 106 seated around the conference table 202 in FIG. 11A
are aligned with the downwardly curved line D.
[0065] In some embodiments, in order to rely on the facial
detection method the detected faces of the participants 106
disposed in the middle of the field of view of the camera 112 need
to be smaller than the detected faces of the participants 106 faces
disposed on the sides of the field of view of the camera 112. As
illustrated in FIGS. 11A-11C, the detected faces disposed on the
sides of the field of view of the camera 112 are larger than those
of the detected faces disposed centrally in the field of view of
the camera 112. In addition, the least square error during curve
fitting needs to be below a certain predetermined threshold. If the
least square error exceeds the predetermined threshold, the seating
arrangement around the conference room table 202 may not be
conducive to the facial detection method (e.g., a large conference
room table 202 may not be disposed in the middle of the conference
room 200). In other embodiments, the reference value by which to
compare the value of "a" from the equation of y=ax.sup.2+bx+c may
be greater or lesser than zero.
[0066] Returning back to FIG. 10, in addition to utilizing facial
detection software at 1005, or instead of using facial detection
software (e.g., when participants 106 are not located within the
conference room 200, not enough participants 106 are located within
the conference room 200, etc.), then, at 1030, the endpoint
controller 120 may, as described above with respect to FIGS. 7A,
7B, 8, and 9, track audio outputs, using the microphone array 118
integrated with the camera 112, to localize the position of the
loudspeaker 116 integrated with the display device 114. At 1035,
the endpoint controller 120 determines, based on the audio
tracking, if the camera 112 is mounted above or below the display
device 114.
[0067] At 1035, once the information has been collected regarding
whether the camera 112 is mounted above or below the display device
114, the endpoint controller 120 uses the information to update the
layout of the screen 310 of the display device 114. As previously
explained and illustrated with regard to FIG. 3A, if it is
determined that the camera 112 is mounted above the display device
114, then the endpoint controller 120 may position the live video
feed 314 on the screen 310 proximate to the top edge 300 of display
device 114. Conversely, as previously explained and illustrated
with regard to FIG. 3B, if it is determined that the camera 112 is
mounted below the display device 114, then the endpoint controller
120 may position the live video feed 314 on the screen 310
proximate to the bottom edge 302 of display device 114.
[0068] With reference to FIG. 12, illustrated is a flowchart of a
method 1200 performed by the endpoint controller 120 for assigning
video and audio sources to the display devices 114 located within a
conference room. Reference is also made to FIGS. 2A, 2B, 3A-3D,
4A-4B, and 5 for purposes of the description of FIG. 12. At 1205,
the endpoint controller 120 detects, at a microphone array 118
having a predetermined physical relationship with respect to a
camera 112, the audio emitted from each of one or more loudspeakers
116, where each of the one or more loudspeakers 116 have a
predetermined physical relationship with respect to at least one of
one or more display devices 114 in a conference room. Thus, the
endpoint controller 120 enables the microphone array 118 to detect
the audio that is played, generated, or emitted from each
loudspeaker 116 of each display device 114 located within a
conference room 200. At 1210, the endpoint controller 120 utilizes
known triangulation and audio localization algorithms to determine
the direction and distance from the microphone array 118 to each of
the loudspeakers 116 that output audio received by the microphone
array 118. In other words, from the detected audio, the endpoint
controller 120 may determine the spatial relationship between the
microphone array 118 and the loudspeakers 116 within a conference
room 200. Because, as previously explained, the microphone array
118 has a known predetermined physical relationship with respect to
the camera 112 and each of the loudspeakers 116 has a known
predetermined physical relationship with respect to the one of the
display devices 114, determining the spatial relationship between
the microphone array 118 and the loudspeakers 116 also determines
the spatial relationship between the camera 112 and the display
devices 114.
[0069] After determining the spatial relationship between the
camera 112 and the display devices 114, the endpoint controller 120
may then assign video sources to each of the plurality of display
devices based on the data representing the spatial relationship,
the content of the video sources, and the use of the camera 112.
For example, if it is determined that a camera is disposed adjacent
to a display device 114 (e.g., mounted directly above or directly
below the display device 114), then that display device 114 may
receive a live video feed of another remote video conference
endpoint operated by other participants 106 while the camera 112
records a live video feed of the conference room 200 in which it is
disposed. The live video feed of the conference room 200 may be
sent to the remote video conference endpoint for viewing by
participants at that remote video conference endpoint. In another
example, as previously explained, if it is determined that another
display device, such as a user-interactive display device, is
disposed opposite of a camera 112, that camera 112 may be used to
show the participant 106 presenting or collaborating on the
user-interactive display device 114.
[0070] Techniques presented herein automatically determine the
roles of the cameras and the display devices of a video conference
endpoint within a conference room when participating in a video
conference. The detection of the placement of components of a video
conference endpoint/system in a room is automated using spatial
detection of audio signals emitted by the components using a
microphone array and image analysis to optimize screen usage and
visualization of the room for simpler control. No additional
equipment is needed. Rather, the equipment to be installed as part
of the video conference endpoint is used for the process. This
makes it easy to add and remove components, such as a digital
whiteboard to/from that room. In addition to automatic setup, these
techniques can be useful in suggesting layouts and room
configurations in a semi-manual wizard-guided type of setup
procedure.
[0071] Specifically, generated audio from the each of the
loudspeakers of the one or more display devices is detected by the
microphone arrays of each of the cameras to determine the spatial
relationship between each of the cameras and each of the
loudspeakers within a conference room. The determined spatial
relationship, along with the content of the video and audio sources
of the video conference may be used by the controller of the video
conference endpoint to automatically determine the roles of the one
or more display devices and each of the cameras of the conference
room. This eliminates a need to manually set up each video
conference room, and eliminates the need to have participants of
the video conference manually switch the roles of the display
devices and the cameras during a video conference.
[0072] In summary, in one form, a method is provided comprising:
detecting, at a microphone array having a predetermined physical
relationship with respect to a camera, the audio emitted from each
of one or more loudspeakers, each loudspeaker having a
predetermined physical relationship with respect to at least one of
one or more display devices in a conference room; and generating
data representing a spatial relationship between each of the one or
more display devices and the camera based on the detected
audio.
[0073] In another form, an apparatus is provided comprising: a
camera configured to capture video of a field of view; a microphone
array having a predetermined physical relationship with respect to
the camera, the microphone array configured to transduce audio
received at the microphone array; and a processor to control the
camera and the microphone array to: cause the microphone array to
detect audio emitted from one or more loudspeakers having a
predetermined physical relationship with respect to at least one of
one or more display devices in a conference room; and generate data
representing a spatial relationship between each of the one or more
display devices and the camera based on the detected audio.
[0074] In yet another form, a (non-transitory) processor readable
medium is provided. The medium stores instructions that, when
executed by a processor, cause the processor to: detect, at a
microphone array having a predetermined physical relationship with
respect to a camera, the audio emitted from each of one or more
loudspeakers, each loudspeaker having a predetermined physical
relationship with respect to at least one of one or more display
devices in a conference room; and generate data representing a
spatial relationship between each of the one or more display
devices and the camera based on the detected audio.
[0075] As described herein, the data representing the spatial
relationship may be used to assign one or more video sources of an
incoming video feed from a remote conference room to corresponding
ones of the one or more display devices. Similarly, the data
representing the spatial relationship may be used to assign video
outputs from a plurality of cameras in a conference room to an
outgoing video feed to be sent to a remote conference room.
[0076] The above description is intended by way of example only.
Various modifications and structural changes may be made therein
without departing from the scope of the concepts described herein
and within the scope and range of equivalents of the claims.
* * * * *