U.S. patent application number 11/988897 was filed with the patent office on 2010-06-03 for arrangement and method for the recording and display of images of a scene and/or an object.
Invention is credited to Ronny Billert, Torma Ferenc, Daniel Fuessel, Jens Meichsner, David Reuss, Alexander Schmidt.
Application Number | 20100134599 11/988897 |
Document ID | / |
Family ID | 38596678 |
Filed Date | 2010-06-03 |
United States Patent
Application |
20100134599 |
Kind Code |
A1 |
Billert; Ronny ; et
al. |
June 3, 2010 |
Arrangement and method for the recording and display of images of a
scene and/or an object
Abstract
The invention relates to an arrangement and a method for
capturing and displaying images of a scene and/or an object. Said
arrangement and method are particularly suited to display the
captured images in a three-dimensionally perceptible manner. The
aim of the invention is to create a new possibility to take images
of real scenes and/or objects with as little effort as possible and
then autostereoscopically display the same in a three-dimensional
fashion from two or more perspectives. Said aim is achieved by
providing at least one main camera of a first camera type for
recording images, at least one satellite camera of a second camera
type for recording images, an image converting device which is
mounted behind the cameras, and a 3D image display device, the two
camera types differing from each other in at least one parameter. A
total of at least three cameras is provided. Also disclosed is a
method for transmitting 3D data.
Inventors: |
Billert; Ronny; (Erfurt,
DE) ; Ferenc; Torma; (Jena, DE) ; Fuessel;
Daniel; (Hohenleuben, DE) ; Reuss; David;
(Jena, DE) ; Schmidt; Alexander; (Erfurt, DE)
; Meichsner; Jens; (Jena, DE) |
Correspondence
Address: |
FROMMER LAWRENCE & HAUG
745 FIFTH AVENUE- 10TH FL.
NEW YORK
NY
10151
US
|
Family ID: |
38596678 |
Appl. No.: |
11/988897 |
Filed: |
October 29, 2007 |
PCT Filed: |
October 29, 2007 |
PCT NO: |
PCT/DE2007/001965 |
371 Date: |
January 15, 2008 |
Current U.S.
Class: |
348/48 ; 348/441;
348/E13.074 |
Current CPC
Class: |
H04N 13/194 20180501;
H04N 13/344 20180501; H04N 13/133 20180501; H04N 13/395 20180501;
H04N 19/597 20141101; H04N 2013/0081 20130101; H04N 13/332
20180501; H04N 13/25 20180501; H04N 13/243 20180501; H04N 13/282
20180501; H04N 13/286 20180501; H04N 13/302 20180501; H04N 13/239
20180501; H04N 13/111 20180501 |
Class at
Publication: |
348/48 ;
348/E13.074; 348/441 |
International
Class: |
H04N 13/02 20060101
H04N013/02 |
Foreign Application Data
Date |
Code |
Application Number |
Nov 22, 2006 |
DE |
10 2006 055 641.0 |
Claims
1-39. (canceled)
40. The arrangement for the recording of images of a scene and/or
an object and their display for spatial perception, comprising: at
least one main camera of a first camera type for the recording of
images; at least two satellite cameras of a second camera type for
the recording of images, with the camera types differing in at
least one parameter; an image conversion device, arranged
downstream of the cameras, that receives and processes the initial
image data, said image conversion device performing, among other
processes, a depth or disparity recognition employing only those
images recorded by cameras of the same camera type (by said at
least two satellite cameras), but not the remaining images; and a
3D image display device, connected to the image conversion device,
that displays the provided image data for spatial perception
without special aids, with the 3D image display device displaying
at least two views.
41. The arrangement as claimed in claim 40, wherein the two camera
types differ at least in the resolution of the images to be
recorded.
42. The arrangement as claimed in claim 40, wherein the two camera
types differ at least in the built-in imaging chip.
43. The arrangement as claimed in claim 40, wherein exactly one
main camera and two satellite cameras are provided.
44. The arrangement as claimed in claim 40, wherein exactly one
main camera and three satellite cameras are provided.
45. The arrangement as claimed in claim 40, wherein exactly one
main camera and five satellite cameras are provided.
46. The arrangement as claimed in claim 40, wherein the second
camera type has a lower resolution than the first camera type.
47. The arrangement as claimed in claim 43, wherein the main camera
is arranged between the satellite cameras.
48. The arrangement as claimed in claim 40, wherein at least one
partially transparent mirror is arranged in front of each of the
objectives of the main camera and all satellite cameras.
49. The arrangement as claimed in claim 44, wherein the center
points of the objectives of the three satellite cameras form a
triangle.
50. The arrangement as claimed in claim 49, wherein the triangle is
a isosceles triangle.
51. The arrangement as claimed in claim 49, wherein the center
point of the objective of the main camera is arranged inside said
triangle, with the triangle to be understood to include its
sides.
52. The arrangement as claimed in claim 44, wherein one satellite
camera and the main camera are optically arranged relative to each
other in such a way that both record an image on essentially the
same optical axis, for which purpose preferably at least one
partially transparent mirror is arranged between the two
cameras.
53. The arrangement as claimed in claim 52, wherein the two other
satellite cameras are arranged to form a straight line or a
triangle together with the satellite camera associated to the main
camera.
54. The arrangement as claimed in claim 40, wherein the image
conversion device generates at least two views of the scene or
object recorded, and that, for generating these at least two views,
the image conversion device employs, besides the depth or disparity
data recognized, the image recorded by the at least one main camera
and at least one more image recorded by the satellite cameras, but
not necessarily the images of all cameras provided.
55. The arrangement as claimed in claim 54, wherein one of the at
least three views generated is still equal to one of the input
images.
56. The arrangement as claimed in claim 40, wherein the main camera
or all main cameras, and all satellite cameras record with
frame-accurate synchronization at a tolerance of maximally 100
frames per 24 hours.
57. A method for the recording and display of images of a scene
and/or an object, comprising the following steps: generating at
least one n-tuple of images, with n>2, with at least two images
of the n-tuple having different resolutions; transferring the image
data to an image conversion device, in which then a rectification,
a color adjustment, a depth or disparity recognition and subsequent
generation of further views from the n or less than n images of the
said n-tuple and from the depth or disparity recognition data are
carried out, with at least one view being generated that is not
exactly equal to any of the n-tuple of images generated, and with
the image conversion device employing, for depth or disparity
recognition, only such images of the n-tuple that have the same
resolution; subsequently generating a combination of at least two
different views or images in accordance with the parameter
assignment of the 3D display of a 3D image display device, for
spatial presentation without special aids; and finally presenting
the combined 3D image on the 3D display.
58. The method as claimed in claim 57, wherein, for the depth or
disparity recognition, those images of equal resolution are
employed whose resolution has the lowest total number of pixels
compared with all other resolutions provided.
59. The method as claimed in claim 58, wherein, for depth
recognition, a stack structure is established by means of a
line-by-line comparison of the pre-processed initial image data of
an n-tuple, precisely, of those images of the n-tuple only that
have the same resolution, in such a way that first those lines of
the different images of an n-tuple which have the same Y coordinate
are placed in register on top of each other and then a first
comparison is made, the result of the comparison being saved in one
line in such a way that equal tonal values in register are saved,
whereas different tonal values are deleted, which is followed by a
displacement of the lines in opposite directions by specified
increments of preferably 1/4 to 2 pixels, the results after each
increment being saved in further lines analogously to the first
comparison; so that, as a result after the comparisons made for
each pixel, the Z coordinate provides the information about the
degree of displacement of the views relative to each other.
60. The method as claimed in claim 59, wherein, after the
establishment of the stack structure, an optimization is made in
such a way that ambiguities are eliminated, and/or a reduction of
the elements to an unambiguous height profile curve is carried
out.
61. The method as claimed in claim 59, wherein, after the
establishment of the stack structure or after the steps described
in claim 60, the depth is determined for at least three original
images of the n-tuple, preferably in the form of depth maps.
62. The method as claimed in claim 61, wherein, after transfer of
the original images of the n-tuple and the respective depths
appertaining to them, a reconstruction is carried out by inverse
projection of the views of the n-tuple into the stack space by
depth maps, so that die stack structure is reconstructed, and so
that again different views can be subsequently generated therefrom
by projection.
63. The method as claimed in claim 57, wherein the images generated
are transmitted to the image conversion device.
64. The method as claimed in claim 57, wherein all views generated
of each image by the image conversion device are transmitted to the
3D image display device.
65. The method as claimed in claim 61, wherein the original images
of the n-tuple with the respective depths appertaining to them are
transmitted to the 3D image display device, after which first the
reconstruction according to claim 62 is carried out.
66. The method as claimed in claim 57, wherein the images of the
n-tuple are generated by a 3D camera system.
67. The method as claimed in claim 57, wherein the images of the
n-tuple are generated by a computer.
68. The method as claimed in claim 61, wherein at least two depth
maps differing in resolution are generated.
69. A method for the transmission of 3D information for the purpose
of later display for spatial perception without special aids, on
the basis of at least two different views, comprising the steps of:
proceeding from at least one n-tuple of images, with n>2, which
characterize different angles of view of an object or a scene, with
at least two images of the n-tuple having different resolutions;
determining the depth for at least three images; and thereafter, at
least three images of the n-tuple, together with the respective
depth information, are transmitted in a transmission channel.
70. The method as claimed in claim 69, wherein the depth
information is in the form of depth maps.
71. The method as claimed in claim 69, wherein the n-tuple of
images is a quadruple of images (n=4), with three images preferably
having the same resolution, whereas the fourth image has a higher
resolution and preferably belongs to the images transmitted in the
transmission channel.
72. The method as claimed in claim 69, wherein at least two of the
three depth maps have different resolutions.
73. The method as claimed in claim 69, wherein the image data and
the depth information are generated in the MPEG-4 format.
74. The method as claimed in claim 69, wherein the depth
information is determined only from such images of the n-tuple that
have the same resolution.
75. The method as claimed in claim 74, wherein, from the depth
information determined, the depth also for at least one image of
higher resolution is generated.
76. The method as claimed in claim 57, wherein depth information
determined from images of the n-tuple that have the lowest
resolution provided are transformed into a higher resolution by way
of edge recognitions in the at least one image of higher
resolution.
77. The method as claimed in claim 57, wherein a great number of
n-tuples of images and appertaining depth information are processed
in succession, so that a spatial display of moving images is made
possible.
78. The method as claimed in claim 77, wherein the great number of
n-tuples of images is subjected to spatial and temporal
filtering.
79. A method of transmitting 3D information for the purpose of
subsequent display for spatial perception without special aids, on
the basis of at least two different views, comprising the steps of:
proceeding from at least one n-tuple of images with n>2, which
characterize different viewing angles of an object or a scene;
determining the depth for at least three images; and thereafter at
least three images of the n-tuple, together with the respective
depth information are transmitted, in a transmission channel.
80. The method of claim 79, wherein the depth information is in the
form of depth maps.
81. The method of claim 79, wherein the n-tuple of images is a
triple of images (n=3), with the three images having the same
resolution.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This is a national phase application of International
Application No. PCT/DE2007/001965, filed Oct. 29, 2007, which
claims priority of German Application No. 10 2006 055 641.0, filed
Nov. 22, 2006, the complete disclosures of which are hereby
incorporated by reference.
BACKGROUND OF THE INVENTION
[0002] a) Field of the Invention
[0003] The invention relates to an arrangement and a method for the
recording and display of images of a scene and/or an object,
suitable especially for the display of the recorded images for
spatial perception. The invention further relates to a method of
transmitting images for spatial perception.
[0004] b) Description of the Related Art
[0005] At present there exist essentially three basically different
methods, and the appertaining arrangements, for recording 3D image
information:
[0006] (1) The classical stereo camera, consisting of two like
cameras, one each for a left and right image. For a highly resolved
display, high-resolving camera systems are required, though. With
multichannel systems, interpolation of the intermediate views is
required. This causes artefacts to be visible especially in the
middle views.
[0007] (2) The use of a multiview camera system. Its advantage over
the stereo camera is the correct image reproduction with
multichannel systems. In particular, no interpolations are
required. A disadvantage is the great effort needed to implement
exact alignment of the--e.g., eight--cameras with each other.
Another disadvantage is the increased cost involved in the use of
several cameras, which in addition entail further problems such as
differing white levels, tonal values and/or geometry
characteristics, which have to be balanced accordingly. The high
data rate to be handled with this method is also a drawback.
[0008] (3) The use of a depth camera. Here, use is made of a color
camera jointly with a depth sensor, which registers the--as a rule,
cyclopean--depth information of the scene to be recorded. Apart
from the fact that depth sensors are relatively expensive, their
drawback is that they often do not work very exactly, and/or that
no acceptable compromise between accuracy and speed is achieved.
The method requires a general extrapolation, in which artefacts,
especially in the outer views, cannot be excluded and, generally,
occluding artefacts cannot be covered up.
OBJECT AND SUMMARY OF THE INVENTION
[0009] The invention is based on the problem of setting forth a new
way of recording real scenes and/or objects with the least possible
effort and subsequently displaying them three-dimensionally in two
or more views for spatial perception. Another problem of the
invention is to find a suitable method for transmitting images for
spatial perception.
[0010] According to the invention, the problem is solved with an
arrangement for recording images of a scene and/or an object and
displaying them for spatial perception, this arrangement to
comprise the following components:
[0011] at least one main camera of a first camera type for the
recording of images;
[0012] at least two satellite cameras of a second camera type for
the recording of images, with the first and second camera types
differing by at least one parameter;
[0013] an image conversion device arranged downstream of the
cameras, for receiving and processing the initial image data, this
image conversion device performing, among other processes, a depth
or disparity recognition, for which only those images are employed
that were recorded by cameras of the same camera type (preferably
those that were recorded by the at least two satellite cameras),
but not the residual images; and
[0014] a 3D image display device connected to the image conversion
device, which displays the image data for spatial perception
without special viewing aids, the 3D image display device
displaying at least two views of the scene and/or object.
[0015] However, the 3D image display device may also display 3, 4,
5, 6, 7, 8, 9 or even more views simultaneously or at an average
time. Especially in image display devices of the last-named, so
called "multi-view" 3D type with 4 or more views displayed, the
special advantages of the invention take effect, viz. that it is
possible, with relatively few (e.g. three or four) cameras, to
provide more views than the number of cameras employed.
[0016] The main and the satellite camera generally, but not
imperatively, differ by their quality. Mostly, the main camera is a
high-quality camera, whereas the satellite cameras employed may be
of lesser quality (e.g., industrial cameras) and thus mostly, but
not imperatively, have a lower resolution, among other parameters.
In this case, then, the second camera type has a lower resolution
than the first one. The two camera types may also differ (at least)
by the built-in imaging chip.
[0017] Essentially, the advantage of the invention is that, besides
the classical stereo camera system, here consisting essentially of
two identical high-resolution cameras, a three-camera system is
used, preferably consisting of a central high-quality camera and
two additional cameras of lower resolution, arranged to the left
and right, respectively, of the main camera. Thus, the main camera
is arranged between the satellite cameras, for example.
[0018] Preferably then, the main camera is arranged between the
satellite cameras. The distances between the cameras and their
alignment (either in parallel or pointed at a common focus) are
variable within customary limits. The use of further satellite
cameras may be of advantage, as this enables a further reduction of
misinterpretations especially during the subsequent processing of
the image data.
[0019] According to the embodiment of the invention, it may thus be
of advantage that
[0020] exactly one main camera and two satellite cameras are
provided ("version 1+2");
[0021] exactly one main camera and three satellite cameras are
provided ("version 1+3"); or
[0022] exactly one main camera and five satellite cameras are
provided ("version 1+5").
[0023] The general idea of the invention obviously also includes
other embodiments, e.g., the use of several main cameras or of
still more satellite cameras.
[0024] All cameras may be arranged in parallel or pointed at a
common focus. It is also possible that not all of them are pointed
at a common focus (convergence angle). The optical axes of the
cameras may lie in one plane or in different planes, with the
center points of the objectives preferably arranged in line or on a
(preferably isosceles or equilateral) triangle. For special cases
of application, the center points of the cameras' objectives may
also be spaced at unequal distances relative to each other (with
the objective center points forming a scalene triangle). It is
further possible that all (at least three) cameras (i.e. all
existing main and satellite cameras) differ by at least one
parameter, e.g. by their resolution. The cameras can be
synchronized with regard to zoom, f-stop, focus etc. as well as
with regard to the individual frames (i.e. best possible
true-to-frame synchronization in recording). The cameras may be
fixed at permanent locations or movable relative to each other; the
setting of both the base distance between the cameras and the
convergence angles may be automatic.
[0025] It may be of advantage to provide adapter systems that
facilitate fixing especially the satellite cameras to the main
camera. In this way, ordinary cameras can be subsequently converted
into a 3D camera. It is also feasible, though, to convert an
existing stereo camera system into a 3D camera conforming to the
invention by retrofitting an added main camera.
[0026] Furthermore, the beam path--preferably in front of the
objectives of the various cameras--can be provided with additional
optical elements, e.g. one or several semitransparent mirrors. This
makes it possible, e.g., to arrange each of two satellite cameras
rotated 90 degrees relative to the main camera, so that the camera
bodies of all three cameras are arranged in such a way that their
objective center points are closer together horizontally than they
would be if all three cameras were arranged immediately side by
side, in which case the dimension of the camera bodies would
necessitate a certain, greater spacing of the objective center
points. In the constellation with the two satellite cameras rotated
90 degrees, a semitransparent mirror in reflection position,
arranged at an angle of about 45 degrees relative to the principal
rays emerging from the objectives of the satellite cameras, would
follow, whereas the same mirror follows in transmission position,
arranged at an angle of also 45 degrees relative to the principal
ray emerging from the objective of the main camera.
[0027] Preferably, the objective center points of the main camera
and of at least two satellite cameras form an isosceles triangle,
e.g., in version "1+2".
[0028] For version "1+3" it may be of advantage that the objective
center points of the three satellite cameras form a triangle,
preferably an isosceles one. In this case, the objective center
point of the main camera should be arranged within the said
triangle, the triangle being assumed to include its sides.
[0029] Moreover, in version "1+3" it is possible that one satellite
camera and the main camera are optically arranged relative to each
other in such a way that both record an image on essentially the
same optical axis, with preferably at least one semitransparent
mirror being arranged between the two cameras.
[0030] In this case, the two other satellite cameras are preferably
arranged so as to form a straight line or a triangle with the
satellite camera that is associated with the main camera.
[0031] Other embodiments can also be implemented, such as, e.g., a
version "1+4" with a quadrangle (e.g. square) of 4 satellite
cameras with a main camera inside the quadrangle (e.g., at the
center of its area), or even a version "1+n" with a circle of n=5
or more satellite cameras.
[0032] Advantageously, the image conversion device generates at
least three views of the recorded scene or object, employing for
such generation of at least three views, in addition to the depth
or disparity data registered, the image recorded by the at least
one main camera and at least two more images recorded by the
satellite cameras, though not necessarily by all cameras provided.
It is quite possible that one of the at least three views generated
still is equal to one of the input images. In the simplest case,
the image conversion device may even use only the image recorded by
the at least one main camera and the associated depth information
for generating the views.
[0033] The main camera (or all main cameras) and all satellite
cameras preferably record with frame-accurate synchronization, at a
tolerance of maximally 100 frames per 24 hours.
[0034] For special embodiments it may also be useful to use
black-and-white cameras as satellite cameras, and subsequently
automatically assign a tonal value preferably to the images
produced by them.
[0035] The problem is also solved by a method for the recording and
display of images of a scene and/or an object, comprising the
following steps:
[0036] Creation of at least an n-tuple of images, with n>2, with
at least two images having different resolutions;
[0037] Transfer of the image data to an image conversion device, in
which subsequently a rectification, a color adjustment, a depth or
disparity recognition and subsequent generation of further views
from n or less than n images of said the n-tuple and the depth or
disparity recognition values are carried out, with at least one
view being generated that is not exactly equal to any of the images
of the n-tuple created, and with the image conversion device
employing, for depth or disparity recognition, only such images of
the n-tuple that have the same resolution;
[0038] Subsequent creation of a combination of at least three
different views or images in accordance with the parameter
assignment of the 3D display of a 3D image display device for
spatial presentation without special viewing aids; and finally
[0039] Presentation of the combined 3D image on the 3D display.
[0040] The depth or disparity recognition employs the images with
that equal resolution having the lowest total number of pixels
compared to all other resolutions provided.
[0041] The depth recognition and subsequent generation of further
views from the n-tuple of images and the depth or disparity
recognition data can be carried out, for example, by creating a
stack structure and projecting it onto a desired view.
[0042] The creation of a stack structure may be replaced by other
applicable depth or disparity recognition algorithms, with the
depth or disparity values recognized being used for the creation of
desired views.
[0043] A stack structure may, in general, correspond to a layer
structure of graphical elements in different (virtual) planes.
[0044] If a 3D camera system consisting of cameras of different
types with different image resolutions is used, it is possible
first to carry out a size adaptation after transfer of the image
data to the image conversion device. The result of this are images
that all have the same resolution. This may correspond to the
highest resolution of the cameras, but preferably it is equal to
that of the lowest-resolution camera(s). Subsequently, the camera
images are rectified, i.e. their geometric distortions are
corrected (compensation of lens distortions, misalignment of
cameras, zoom differences, etc., if any). The size adaptation may
also be performed within the rectifying process. Immediately after,
a color adjustment, is carried out, e.g. as taught by the
publications "Joshi, N. Color Calibration for Arrays of Inexpensive
Image Sensors. Technical Report CSTR 2004-02 Mar. 31, 2004 Apr. 4,
2004, Stanford University, 2004" and A. LLie and G. Welch.
"Ensuring color consistency across multiple cameras", ICCV 2005. In
particular, the tonal/brightness values of the camera images are
matched, so that they are at an equal or at least comparable level.
For the image data thus provided, the stack structure for depth
recognition is established. In this process, the input images (only
the images of the n-tuple having the same resolution), stacked on
top of each other in the first step, are compared with each other
line by line. The line-by-line comparison can possibly be made in
an oblique direction rather; this will be favorable if the cameras
are not arranged in a horizontal plane. If pixels lying on top of
each other have the same tonal value, this will be saved; if they
have different tonal values, none of these will be saved.
Thereafter, the lines are displaced relative to each other by
defined steps (e.g., by 1/4 or 1/2 pixel) in opposite directions;
after every step the result of the comparison is saved again. At
the end of this process, the three-dimensional stack structure with
the coordinates X, Y and Z is obtained, with X and Y corresponding
to the pixel coordinates of the input image, whereas Z represents
the extent of relative displacement between the views. Thus, if two
or three cameras are used, always two or three lines, respectively,
are compared and displaced relative to each other. It is also
possible to use more than two, e.g., three cameras and still
combine always two lines only, in which case the comparisons have
to be matched once more. If three or more lines are compared, there
are far fewer ambiguities than with the comparison of the two lines
of two input images only. In the subsequent optimization of the
stack structure, the task essentially consists in deleting the
least probable combinations in case of ambiguous representations of
image elements in the stack. In addition, this contributes to data
reduction. Further reduction is achieved if a height profile curve
is derived from the remaining elements to obtain an unambiguous
imaging of the tonal values in a discrete depth plane (Z
coordinate). What normally follows now is the projection of the
stack structure onto the desired views. At least two views should
be created, one of which might still be equal to one of the input
images. However, this is done, as a rule, with the particular 3D
image display device in mind that is used thereafter. The
subsequent combination of the different views provided corresponds
to the parameter assignment of the 3D display.
[0045] Once the stack structure has been created, or following the
method steps in accordance with the invention, the depth is
determined for at least three original images of the n-tuple,
preferably in the form of depth maps. Preferably, at least two
depth maps having different resolution are created.
[0046] Further, after the original images of the n-tuple and the
respective associated depths have been taken over, preferably a
reconstruction is performed by inverse projection of the images of
the n-tuple into the stack space by means of depth maps, so that
the stack structure is reconstructed, and so afterwards again
different views can be generated therefrom by projection. Other
methods of creating the views from the image data provided
(n-tuples of images, depth information) are also possible.
[0047] Moreover, the original images of the n-tuple with the
respective associated depths can be transmitted to the 3D image
display device and then the reconstruction in accordance with the
inventive method can be done first.
[0048] In general, the images of the n-tuple are created, e.g., by
means of a 3D camera system, e.g. a multiple camera system
consisting of several separate cameras.
[0049] Alternatively it is possible, in the method described above
for the recording and display of images of a scene and/or an
object, to create the images by means of a computer. In this case,
preferably a depth map is created for each image, so that the
rectification, color adjustment and depth or disparity recognition
steps can be dropped. Preferably, at least two of the three depth
maps differ in resolution. In a preferred embodiment, n=3 images
may be provided, one of which has the (full-color) resolution of
1920.times.1080 pixels and the other two have the (full-color)
resolution of 1280.times.720 (or 1024.times.768) pixels, whereas
the appertaining depth maps have 960.times.540 and 640.times.360
(or 512.times.384) pixels, respectively. The image having the
higher resolution corresponds, in spatial terms, to a perspective
view lying between the perspective views of the other two
images.
[0050] The 3D image display device employed can preferably display
2, 3, 4, 5, 6, 7, 8, 9 or even more views simultaneously or at an
average time. It is particularly with such devices, known as
"multi-view" 3D image display devices with at least 4 or more views
displayed, that the special advantages of the invention take
effect, namely, that with relatively few (e.g. three) original
images, more views can be provided for spatial display than the
number of original images. By the way, the combination, mentioned
farther above, of at least two different views or images in
accordance with the parameter assignment of the 3D display of a 3D
image display device for spatial presentation without special
viewing aids may contain a combination of views not only from
different points in space but in time also.
[0051] Another important advantage of the invention is the fact
that, after the optimization of the stack structure, the depth is
determined per original image. The resulting data have an extremely
efficient data transfer format, viz. as n images (e.g. original
images, or views) plus n depth images (preferably with n=3), so
that a data rate is achieved that is markedly lower than that
required if all views were transferred. As a consequence, a unit
for the reconstruction of the stack structure and the unit for the
projection of the stack structure onto the desired view, or units
of other kind that perform the reconstruction of views differently,
have to be integrated into the 3D image display device.
[0052] For the steps mentioned above, it is possible to use
disparity instead of depth. By the way, the term "projection" here
may, in principle, also mean a mere displacement.
[0053] Of course, other depth or disparity recognition methods than
the one described before can be used to detect depth or disparities
from the n-tuple of images (with n>2), and/or to generate
further views from this n-tuple of images. Such alternative methods
or partial methods are described, for example, in the publications
"Tao, H. and Sawhney, H.: Global matching criterion and color
segmentation based stereo, in Proc. Workshop on the Application of
Computer Vision (WACV2000), pp. 246-253, December 2000", "M. Lin
and C. Tomasi: Surfaces with occlusions from layered Stereo.
Technical report, Stanford University, 2002. In preparation "C.
Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon
Winder, Richard Szeliski: High-quality video view interpolation
using a layered representation, International Conference on
Computer Graphics and Interactive Techniques, ACM SIGGRAPH 2004,
Los Angeles, Calif., pp: 600-608", "S. M. Seitz and C. R. Dyer:
View Morphing, Proc. SIGGRAPH 96, 1996, 21-30".
[0054] By the method according to the invention, it is possible, in
principle, that the images created are transferred to the image
conversion device. Moreover, all views of each image, generated by
the image conversion device can be transferred to the 3D image
display device.
[0055] In an advantageous embodiment, the invention comprises a
method for the transmission of 3D information for the purpose of
later display for spatial perception without special viewing aids,
on the basis of at least three different views, a method in which,
starting from at least one n-tuple of images (with n>2)
characterizing different angles of view of an object or a scene,
with at least two images of the n-tuple having different
resolutions, the depth is determined for at least three images, and
thereafter at least three images of the n-tuple together with the
respective depth information (preferably in the form of depth maps)
are transmitted in a transmission channel.
[0056] In a preferred embodiment, the n-tuple of images is a
quadruple of images (n=4), with preferably three images having the
same resolution and the forth one having a higher resolution, and
with the fourth image preferably belonging to the images
transmitted in the transmission channel, so that, for example, one
high-resolution image and two of the lower-resolution images are
transmitted together with the depth information.
[0057] Herein, at least two of the depth maps determined may differ
in resolution. The depth information is determined only from images
of the n-tuple having the same resolution.
[0058] It is also possible that, from the depth information
determined, the depth is also generated for at least one image of
higher resolution.
[0059] In a further development of the method according to the
invention as well as of the transmission method, the depth
information determined from images of the n-tuple having the lowest
existing resolution can be transformed into a higher resolution by
way of edge recognitions in the at least one image of higher
resolution. This is helpful especially if, e.g., in the versions
"1+2, "1+3" and "1+5" described before, the high-resolution main
camera zooms in on various scene details and/or objects, i.e.
records at a magnification. In this case, there is no absolute need
to vary the zoom settings of the satellite cameras, too. Instead,
the resolution of the corresponding depth information for the main
camera is increased as described above, so that the desired views
can be created with sufficient quality.
[0060] Further developments provide for a great number of n-tuples
of images and associated depth information to be processed in
succession, so that a spatial display of moving images is made
possible. Finally in this case, it is also possible to perform a
spatial and temporal filtering of the great number of n-tuples of
images.
[0061] The transmission channel may be, e.g., a digital TV signal,
the Internet or a DVD (HD, SD, BlueRay etc.). As a compression
standard, MPEG-4 can be used to advantage.
[0062] It is also of advantage if at least two of the three depth
maps have different resolutions. For example, in a preferred
embodiment, n=3 images may be provided, one of them having the
(full-color) resolution of 1920.times.1080 pixels, and two having
the (full-color) resolution of 1280.times.720 (or 1024.times.768)
pixels, whereas the pertaining depth maps have 960.times.540 and
640.times.360 (or 512.times.384) pixels, respectively. The image
having the higher resolution corresponds, in spatial terms, to a
perspective view lying between the perspective views of the other
two images.
[0063] The 3D image display device employed can preferably display
2, 3, 4, 5, 6, 7, 8, 9 or even more views simultaneously or at an
average time. Especially in those mentioned last, known as
"multi-view" 3D image display devices with 4 or more views
displayed, the special advantages of the invention take effect,
viz. that with relatively few (e.g. three) original images, more
views can be provided than the number of original images. The
reconstruction from the n-tuple of images transmitted together with
the respective depth information (with at least two images of the
n-tuple having different resolutions) in different views is
performed, e.g., in the following way: In a three-dimensional
coordinate system, the color information of each image--observed
from a suitable direction--are arranged in the depth positions
marked by the respective depth information belonging to the image.
This creates a colored three-dimensional volume with volume pixels
(voxels), which can be imaged from different perspectives or
directions by a virtual camera or by parallel projections. In this
way, more than three views can be advantageously regenerated from
the information transmitted. Other reconstruction algorithms for
the views or images are possible as well.
[0064] Regardless of this, the information transmitted is
reconstructible in a highly universal way, e.g. as (perspective)
views, tomographic slice images or voxels. Such image formats are
of great advantage for special 3D presentation methods, such as
volume 3D displays.
[0065] Moreover, in all transmission versions proposed by this
invention it is possible to transmit meta-information, e.g. in a
so-called alpha channel in addition. This may be information
supplementing the images, such as geometric conditions of the
n>2 images (e.g., relative angles, camera parameters), or
transparency or contour information.
[0066] Finally, the problem of the invention can be solved by a
method of transmitting 3D information for the purpose of subsequent
display for spatial perception without special viewing aids, on the
basis of at least three different views, whereby, starting from at
least one n-tuple of images with n>2 that characterize different
viewing angles of an object or scene, the depth is determined for
at least three images, and at least three images of the n-tuple
together with the respective depth information (preferably in the
form of depth maps) are subsequently transmitted in a transmission
channel.
[0067] Preferably the n-tuple of images is a triple of images
(n=3), with the three images having the same resolution. It is also
possible, however, that, e.g., n=5 or n=6 cameras generate 5 or 6
images each, so that the depth information is determined from the
quintuple or sixtuple of images, or at least from three images of
them, and 3 of the 5 or 6 images together with their depth maps are
subsequently transmitted, even with the added possibility of a
reduction of the resolution of individual images and/or depth
maps.
[0068] Below, the invention is described in greater detail by
example embodiments.
BRIEF DESCRIPTION OF THE DRAWINGS
[0069] The drawings show
[0070] FIG. 1: a sketch illustrating the principle of the
arrangement according to the invention, with a main camera and
three satellite cameras;
[0071] FIG. 2: a version with a main camera and two satellite
cameras;
[0072] FIG. 3: a schematic illustration of the step-by-step
displacement of two lines against one another, and generation of
the Z coordinate;
[0073] FIG. 4: a scheme of optimization by elimination of
ambiguities compared to FIG. 3;
[0074] FIG. 5: a scheme of optimization by reduction of the
elements to an unambiguous height profile curve, compared to FIG.
4;
[0075] FIG. 6: a schematic illustration of the step-by-step
displacement of three lines against one another, and generation of
the Z coordinate;
[0076] FIG. 7: a scheme of optimization by elimination of
ambiguities compared to FIG. 6;
[0077] FIG. 8: a scheme of optimization by reduction of the
elements to an unambiguous height profile curve, compared to FIG.
7;
[0078] FIG. 9: a schematic illustration of a projection of a view
from the scheme of optimization;
[0079] FIG. 10: a schematic illustration of an image combination of
four images, suitable for spatial display without special viewing
aids (state of the art); and
[0080] FIG. 11: a schematic illustration of the transmission method
according to the invention.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0081] An arrangement according to the invention essentially
consists of a 3D camera system 1, an image conversion device 2 and
a 3D image display device 3. As shown in FIG. 1, the 3D camera
system 1 contains three satellite cameras 14, 15 and 16, one main
camera 13; the image conversion device 2 contains a rectification
unit 21, a color adjustment unit 22, a unit for establishing the
stack structure 23, a unit for the optimization of the stack
structure 24, and a unit 25 for the projection of the stack
structure onto the desired view, and the 3D image display device 3
contains an image combination unit 31 and a 3D display 32, with the
3D display 32 displaying at least two views of a scene or object
for spatial presentation. The 3D display 32 can also work on the
basis of, say, 3, 4, 5, 6, 7, 8, 9 or even more views. As an
example, a 3D display 32 of model "Spatial View 19 inch" is
eligible, which displays 5 different views at a time.
[0082] FIG. 2 shows another arrangement according to the invention.
Here, the 3D camera system 1 contains a main camera 13, a first
satellite camera 14, and a second satellite camera 15. The image
conversion device 2 contains a rectification unit 21, a color
adjustment unit 22, a unit for establishing the stack structure 23,
a unit for the optimization of the stack structure 24, a unit 25
for projecting the stack structure onto the desired view, and a
unit for determining the depth 26; and the 3D image display device
3 contains, as shown in FIG. 2, a unit for the reconstruction of
the stack structure 30, an image combination unit 31, and a 3D
display 32.
[0083] According to the embodiment shown in FIG. 2, the 3D camera
system 1 consists of a main camera 13 and two satellite cameras 14,
15, with the main camera 13 being a high-quality camera with a high
resolving power, whereas the two satellite cameras 14, 15 are
provided with a lower resolving power. As usual, the camera
positions relative to each other are variable in spacing and
alignment within the known limits, so that stereoscopic images can
be taken. In the rectification unit 21, the camera images are
rectified, i.e. a compensation of lens distortions, camera
rotations, zoom differences, etc., is made. The rectification unit
21 is followed by the color adjustment unit 22. Here, the
tonal/brightness values of the recorded images are balanced to a
common level. The image data thus corrected are now fed to the unit
23 for establishing the stack structure.
[0084] Now, in principle, a line-by-line comparison is made of the
input images, but only those of the satellite cameras (14, 15
according to FIG. 2, or 14, 15, 16 according to FIG. 1). The
comparison according to FIG. 3 is based on the comparison of only
two lines each. In the first step, at first two lines are placed
one on top of the other with the same Y coordinate, which,
according to FIG. 3, corresponds to plane 0. The comparison is made
pixel by pixel, and, as shown in FIG. 3, the result of the
comparison is saved as a Z coordinate in accordance with the
existing comparison plane, a process in which pixels lying on top
of each other retain their tonal value if it is identical; if it is
not, no tonal value is saved. In the second step, the lines are
displaced by increments of 1/2 pixel each as shown in FIG. 3, with
the pixel being assigned to plane 1, or a next comparison is made
in plane 1, the result of which is saved in plane 1 (Z coordinate).
As can be seen from FIG. 3, the comparisons are generally made up
to plane 7 and then with plane -1 up to plane -7, each being saved
as a Z coordinate in the respective plane. The number of planes
corresponds to the maximum depth information occurring, and may
vary depending on the image content. The three-dimensional
structure thus established with the XYZ coordinates means that, for
each pixel, the degree of relative displacement between the views
is saved via the appertaining Z coordinate. As shown in FIG. 6, the
same comparison is made on the basis of the embodiment shown in
FIG. 1, save that three lines are compared here accordingly. A
simple comparison between FIG. 6 and FIG. 3 shows that the
comparison of three lines involves substantially fewer
misinterpretations. Thus, it is of advantage to do the comparison
with more than two lines. The stack structure established, which is
distinguished also by the fact that now the input images are no
longer present individually, is fed to the subsequent unit 24 for
optimization of the stack structure. Here, ambiguous depictions if
image elements are identified with the aim to delete such errors
due to improbable combinations, so that a corrected set of data is
generated in accordance with FIG. 4 or FIG. 7. In the next step, a
height profile curve that is as shallow or smooth as possible is
established from the remaining elements in order to achieve an
unambiguous imaging of the tonal values in a discrete depth plane
(Z coordinate). The results are shown in FIG. 5 and FIG. 8,
respectively. The result according to FIG. 5 is now fed to the unit
25 for the projection of the stack structure onto the desired view
as shown in FIG. 1. Here, the stack structure is projected onto a
defined plane in the space. The (i.e. each) desired view is
generated via the angles of the plane, as can be seen in FIG. 9. As
a rule, at least one view is generated that is not exactly equal to
any of the images recorded by the camera system 1. All views
generated are present at the output port of the image conversion
device 2 and can thus be transferred to the subsequent 3D image
display device 3 for stereoscopic presentation; by means of the
image combination unit 31 incorporated, at first the different
views are combined in accordance with the given parameter
assignment of the 3D display 32.
[0085] FIG. 2 illustrates another, optional way for transmitting
the processed data to the 3D image display device 3. Here, the unit
24 for the optimization of the stack structure is followed by the
unit 26 for determining the depth (broken line). Determining the
depth of the images creates a particularly efficient data transfer
format. This is because only three images and three depth images
are transferred, preferably in the MPEG-4 format. According to FIG.
2, the 3D image display device 3 is provided, on the input side,
with a unit 30 for reconstructing the stack structure, a subsequent
image combination unit 31 and a 3D display 32. In the unit 30 for
reconstructing the stack structure, the images and depths received
can be particularly efficiently reconverted into the stack
structure by inverse projection, so that the stack structure can be
made available to the subsequent unit 25 for projecting the stack
structure onto the desired view. The further procedure is then
identical to the version illustrated in FIG. 1, save for the
advantage that not all the views need to be transferred, especially
if the unit 25 is integrated in the 3D image display device 3. This
last-named, optional way can also be taken in the embodiment
according to FIG. 1, provided that the circumstances are matched
accordingly.
[0086] For better understanding, FIG. 10 shows a schematic
illustration of a state-of-the-art method (JP 08-331605) to create
an image combination of four images or views, suitable for spatial
presentation on a 3D display without special viewing aids, for
example on the basis of a suitable lenticular or barrier
technology. For that purpose, the four images or views have been
combined in the image combination unit 31 in accordance with the
image combination structure suitable for the 3D display 32.
[0087] FIG. 11, finally, is a schematic illustration of the
transmission method according to the invention. In an MPEG-4 data
stream, a total of 3 color images and 3 depth images (or streams of
moving images accordingly) are transmitted. To particular
advantage, one of the color image streams has a resolution of
1920.times.1080 pixels, whereas the other two have a resolution of
1280.times.720 (or 1024.times.768) pixels. Each of the appertaining
depth images (or depth image streams) is transmitted with half the
horizontal and half the vertical resolution, i.e. 960.times.540
pixels and 640.times.360 (or 512.times.384) pixels, respectively.
In the simplest case, the depth images consist of gray-scale
images, e.g. with 256 or 1024 possible gray levels per pixel, with
each gray level representing one depth value.
[0088] In another embodiment, the highest-resolution color image
would have, for example, 4096.times.4096 pixels, and the other
color images would have 2048.times.2048 or 1024.times.1024 pixels.
The appertaining depth images (or depth image streams) are
transmitted with half the horizontal and half the vertical
resolution. This version would be of advantage if the same data
record is to be used for stereoscopic presentations of particularly
high resolution (e.g. in the 3D movie theater with right and left
images) as well as for less well-resolved 3D presentation on 3D
displays, but then with at least two views presented.
[0089] While the foregoing description and drawings represent the
represent the present invention, invention, it will be obvious to
those skilled in the art that various changes may be made therein
without departing from the true spirit and scope of the present
invention.
LIST OF REFERENCE NUMBERS
[0090] 1 Camera system [0091] 13 Main camera [0092] 14 First
satellite camera [0093] 15 Second satellite camera [0094] 16 Third
satellite camera [0095] 2 Image conversion device [0096] 21
Rectification unit [0097] 22 Color adjustment unit [0098] 23 Unit
for establishing the stack structure [0099] 24 Unit for optimizing
the stack structure [0100] 26 Unit for projecting the stack
structure onto the desired view [0101] 26 Unit for determining the
depth [0102] 3 3D image display device [0103] 30 Unit for
reconstructing the stack structure [0104] 31 Image combination unit
[0105] 32 3D display
* * * * *