U.S. patent application number 14/415714 was filed with the patent office on 2015-06-04 for method and device for improving the rendering of multi-channel audio signals.
The applicant listed for this patent is THOMSON LICENSING. Invention is credited to Johannes Boehm, Peter Jax, Olivier Wuebbolt.
Application Number | 20150154965 14/415714 |
Document ID | / |
Family ID | 48874273 |
Filed Date | 2015-06-04 |
United States Patent
Application |
20150154965 |
Kind Code |
A1 |
Wuebbolt; Olivier ; et
al. |
June 4, 2015 |
METHOD AND DEVICE FOR IMPROVING THE RENDERING OF MULTI-CHANNEL
AUDIO SIGNALS
Abstract
Conventional audio compression technologies perform a
standardized signal transformation, independent of the type of the
content. Multi-channel signals are decomposed into their signal
components, subsequently quantized and encoded. This is
disadvantageous due to lack of knowledge on the characteristics of
scene composition, especially for e.g. multi-channel audio or
Higher-Order Ambisonics (HOA) content. An improved method for
encoding pre-processed audio data comprises encoding the
pre-processed audio data, and encoding auxiliary data that indicate
the particular audio pre-processing. An improved method for
decoding encoded audio data comprises determining that the encoded
audio data had been pre-processed before encoding, decoding the
audio data, extracting from received data information about the
pre-processing, and post-processing the decoded audio data
according to the extracted pre-processing information.
Inventors: |
Wuebbolt; Olivier;
(Hannover, DE) ; Boehm; Johannes; (Goettingen,
DE) ; Jax; Peter; (Hannover, DE) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
THOMSON LICENSING |
Issy de Moulineaux |
|
FR |
|
|
Family ID: |
48874273 |
Appl. No.: |
14/415714 |
Filed: |
July 19, 2013 |
PCT Filed: |
July 19, 2013 |
PCT NO: |
PCT/EP2013/065343 |
371 Date: |
January 19, 2015 |
Current U.S.
Class: |
704/500 |
Current CPC
Class: |
G10L 19/167 20130101;
H04S 3/008 20130101; H04S 2420/03 20130101; H04S 2420/11 20130101;
G10L 19/008 20130101; H04R 5/027 20130101; H04S 2400/03 20130101;
H04S 2400/01 20130101; H04S 2400/15 20130101 |
International
Class: |
G10L 19/008 20060101
G10L019/008 |
Foreign Application Data
Date |
Code |
Application Number |
Jul 19, 2012 |
EP |
12290239.8 |
Claims
1-16. (canceled)
17. A method for encoding pre-processed audio data, comprising
detecting for the audio data an audio data type out of at least
three different types, the types comprising a first Higher-Order
Ambisonics (HOA) format, a microphone recording with a given setup
of a plurality of microphones and a multichannel audio stream mixed
according to a specific panning; if, according to the detecting,
the audio data have a first HOA format, transforming coefficients
of the audio data of the first HOA format by an inverse Discrete
Spherical Harmonics Transform (iDSHT) to coefficients of a
different second HOA format; encoding the audio data, or said
coefficients of the spatial domain if according to the detecting
the audio data have a first HOA format; encoding auxiliary data
that indicate a particular audio pre-processing of the audio data,
the auxiliary data comprising at least metadata about virtual or
real loudspeaker positions and mixing information about the audio
data, the mixing information comprising details of at least one of
details of the first HOA format, the given setup of the plurality
of microphones and details of said specific panning.
18. The method according to claim 17, wherein the pre-processed
audio data and at least a part of the auxiliary data are obtained
from an audio production stage, the obtained part of the auxiliary
data comprising at least one of modification information, editing
information and synthesis information.
19. The method according to claim 18, wherein the audio production
stage is adapted for performing at least one of recording, mixing
and sound synthesis.
20. The method according to claim 17, wherein the auxiliary data
indicate that the audio content was derived from HOA content, plus
at least one of: an order of the HOA content representation, a 2D,
3D or hemispherical representation, and positions of spatial
sampling points.
21. The method according to claim 17, wherein the auxiliary data
indicate that the audio content was mixed synthetically using
vector-based amplitude panning (VBAP), plus an assignment of VBAP
tupels or triples of loudspeakers.
22. The method according to claim 17, wherein the auxiliary data
indicate that the audio content was recorded with fixed, discrete
microphones, plus at least one of: one or more positions and
directions of one or more microphones on the recording set, and one
or more kinds of microphones.
23. A method for decoding encoded audio data, comprising a.
determining that the encoded audio data has been pre-processed
before encoding; b. decoding the audio data; c. extracting from
received data information about the pre-processing, the information
comprising at least metadata about virtual or real loudspeaker
positions and mixing information about the audio data, the mixing
information comprising details of at least one of details of a
first HOA format, a setup of a plurality of microphones and details
of a specific panning; and d. post-processing the decoded audio
data according to the extracted pre-processing information.
24. The method according to claim 23, wherein the information about
the pre-processing indicates that the audio content was derived
from HOA content, plus at least one of an order of the HOA content
representation, a 2D, 3D or hemispherical representation, and
positions of spatial sampling points, and wherein the
post-processing comprises applying a DSHT to recover, from the
decoded audio data, a HOA representation according to the first HOA
format.
25. The method according to claim 17, wherein the information about
the pre-processing indicates that the audio content was mixed
synthetically using VBAP, plus an assignment of VBAP tupels or
triples of loudspeakers.
26. The method according to claim 17, wherein the information about
the pre-processing indicates that the audio content was recorded
with fixed, discrete microphones, plus at least one of: one or more
positions and directions of one or more microphones on the
recording set, and one or more kinds of microphones.
27. The method according to claim 17, wherein usage of the metadata
is optional and can be switched on or off.
28. An encoder for encoding pre-processed audio data, the audio
data having an audio data type out of at least three different
types, the types comprising a first Higher-Order Ambisonics (HOA)
format, a microphone recording with a given setup of a plurality of
microphones and a multichannel audio stream mixed according to a
specific panning, the encoder comprising a. an inverse Discrete
Spherical Harmonics Transform (iDSHT) block for transforming
coefficients of the audio data of the first HOA format to
coefficients of a different second HOA format if the audio data
have the first HOA format; b. first encoder for encoding the audio
data, or for encoding said coefficients of the spatial domain if
the audio data have a first HOA format; c. second encoder for
encoding auxiliary data that indicate a particular audio
pre-processing of the audio data, the auxiliary data comprising at
least metadata about virtual or real loudspeaker positions and
mixing information about the audio data, the mixing information
comprising details of at least one of details of the first HOA
format, the given setup of the plurality of microphones and details
of said specific panning.
29. The encoder according to claim 28, where the encoder comprises
a DSHT block, an MDCT block, a second inverse DSHT block for
performing an inverse DSHT, a source direction detecting block and
a parameter calculating block, wherein the DSHT block is configured
for calculating and performing a DSHT that is inverse to an iDSHT
as performed by said inverse Discrete Spherical Harmonics Transform
block, the DSHT block providing output to the MDCT block, the
source direction detecting block and the parameter calculating
block, and wherein the MDCT block is adapted for compensating a
temporal overlapping of audio frame segments, the MDCT block
providing output to the second inverse DSHT block, and wherein the
source direction detecting block is adapted for detecting one or
more strongest source directions within the output of the DSHT
block and provides output to the parameter calculating block, and
wherein the parameter calculating block is adapted for calculating
rotation parameters and provides the rotation parameters to the
second inverse DSHT block, the rotation parameters defining a
rotation that maps a spatial sample position of a sampling grid of
the inverse DSHT of the second inverse DSHT block to one of the one
or more detected strongest source directions, and wherein the
second inverse DSHT block is adapted for calculating an adaptive
rotation matrix from the rotation parameters received from the
parameter calculating block and for performing an adaptive inverse
DSHT, the adaptive inverse DSHT comprising a rotation according to
the adaptive rotation matrix and an inverse DSHT.
30. A decoder for decoding encoded audio data, comprising a.
analyzer for determining that the encoded audio data has been
pre-processed before encoding; b. first decoder for decoding the
audio data; c. data stream parser and extraction unit for
extracting from received data information about the pre-processing,
the information comprising at least metadata about virtual or real
loudspeaker positions and mixing information about the audio data,
the mixing information comprising details of at least one of
details of a first HOA format, a setup of a plurality of
microphones and details of a specific panning; and d. processing
unit for post-processing the decoded audio data according to the
extracted pre-processing information.
31. The decoder according to claim 30, wherein the information
about the pre-processing comprises indication of a microphone setup
or of a panning algorithm that has been used for mixing the audio
data.
32. An audio renderer suitable for rendering HOA signals, the audio
renderer including an interface that comprises a plurality of input
channels for receiving multi-channel audio data and spatial
position information for the input channels, and at least one
channel for receiving metadata, the metadata specifying a type of
audio mixing that has been applied to the multi-channel audio
data.
33. The method according to claim 23, wherein the information about
the pre-processing indicates that the audio content was mixed
synthetically using VBAP, plus an assignment of VBAP tupels or
triples of loudspeakers.
34. The method according to claim 23, wherein the information about
the pre-processing indicates that the audio content was recorded
with fixed, discrete microphones, plus at least one of: one or more
positions and directions of one or more microphones on the
recording set, and one or more kinds of microphones.
35. The method according to claim 23, wherein usage of the metadata
is optional and can be switched on or off.
36. The encoder according to claim 28, wherein the pre-processed
audio data and at least a part of the auxiliary data are obtained
from an audio production stage, the obtained part of the auxiliary
data comprising at least one of modification information, editing
information and synthesis information.
37. The encoder according to claim 36, wherein the audio production
stage is adapted for performing at least one of recording, mixing
and sound synthesis.
38. The encoder according to claim 28, wherein the auxiliary data
indicate that the audio content was derived from HOA content, plus
at least one of: an order of the HOA content representation, a 2D,
3D or hemispherical representation, and positions of spatial
sampling points.
39. The encoder according to claim 28, wherein the auxiliary data
indicate that the audio content was mixed synthetically using
vector-based amplitude panning (VBAP), plus an assignment of VBAP
tupels or triples of loudspeakers.
40. The encoder according to claim 28, wherein the auxiliary data
indicate that the audio content was recorded with fixed, discrete
microphones, plus at least one of: one or more positions and
directions of one or more microphones on the recording set, and one
or more kinds of microphones.
41. The decoder according to claim 30, wherein the information
about the pre-processing indicates that the audio content was
derived from HOA content, plus at least one of an order of the HOA
content representation, a 2D, 3D or hemispherical representation,
and positions of spatial sampling points, and wherein the
post-processing comprises applying a DSHT to recover, from the
decoded audio data, a HOA representation according to the first HOA
format.
42. The decoder according to claim 30, wherein the information
about the pre-processing indicates that the audio content was mixed
synthetically using vector-based amplitude panning (VBAP), plus an
assignment of VBAP tupels or triples of loudspeakers.
43. The decoder according to claim 30, wherein the information
about the pre-processing indicates that the audio content was
recorded with fixed, discrete microphones, plus at least one of:
one or more positions and directions of one or more microphones on
the recording set, and one or more kinds of microphones.
44. The decoder according to claim 30, wherein usage of the
metadata is optional and can be switched on or off.
Description
FIELD OF THE INVENTION
[0001] The invention is in the field of Audio Compression, in
particular compression of multi-channel audio signals and
sound-field-oriented audio scenes, e.g. Higher Order Ambisonics
(HOA).
BACKGROUND OF THE INVENTION
[0002] At present, compression schemes for multi-channel audio
signals do not explicitly take into account how the input audio
material has been generated or mixed. Thus, known audio compression
technologies are not aware of the origin/mixing type of the content
they shall compress. In known approaches, a "blind" signal
transformation is performed, by which the multi-channel signal is
decomposed into its signal components that are subsequently
quantized and encoded. A disadvantage of such approaches is that
the computation of the above-mentioned signal decomposition is
computationally demanding, and it is difficult and error prone to
find the best suitable and most efficient signal decomposition for
a given segment of the audio scene.
SUMMARY OF THE INVENTION
[0003] The present invention relates to a method and a device for
improving multi-channel audio rendering.
[0004] It has been found that at least some of the above-mentioned
disadvantages are due to the lack of prior knowledge on the
characteristics of the scene composition. Especially for spatial
audio content, e.g. multichannel-audio or Higher-Order Ambisonics
(HOA) content, this prior information is useful in order to adapt
the compression scheme. For instance, a common pre-processing step
in compression algorithms is an audio scene analysis, which targets
at extracting directional audio sources or audio objects from the
original content or original content mix. Such directional audio
sources or audio objects can be coded separately from the residual
spatial audio content.
[0005] In one embodiment, a method for encoding pre-processed audio
data comprises steps of encoding the pre-processed audio data, and
encoding auxiliary data that indicate the particular audio
pre-processing.
[0006] In one embodiment, the invention relates to a method for
decoding encoded audio data, comprising steps of determining that
the encoded audio data had been pre-processed before encoding,
decoding the audio data, extracting from received data information
about the pre-processing, and post-processing the decoded audio
data according to the extracted pre-processing information. The
step of determining that the encoded audio data had been
pre-processed before encoding can be achieved by analysis of the
audio data, or by analysis of accompanying metadata.
[0007] In one embodiment of the invention, an encoder for encoding
pre-processed audio data comprises a first encoder for encoding the
pre-processed audio data, and a second encoder for encoding
auxiliary data that indicate the particular audio
pre-processing.
[0008] In one embodiment of the invention, a decoder for decoding
encoded audio data comprises an analyzer for determining that the
encoded audio data had been pre-processed before encoding, a first
decoder for decoding the audio data, a data stream parser unit or
data stream extraction unit for extracting from received data
information about the pre-processing, and a processing unit for
post-processing the decoded audio data according to the extracted
pre-processing information.
[0009] In one embodiment of the invention, a computer readable
medium has stored thereon executable instructions to cause a
computer to perform a method according to at least one of the
above-described methods.
[0010] A general idea of the invention is based on at least one of
the following extensions of multi-channel audio compression
systems:
[0011] According to one embodiment, a multi-channel audio
compression and/or rendering system has an interface that comprises
the multi-channel audio signal stream (e.g. PCM streams), the
related spatial positions of the channels or corresponding
loudspeakers, and metadata indicating the type of mixing that had
been applied to the multi-channel audio signal stream. The mixing
type indicate for instance a (previous) use or configuration and/or
any details of HOA or VBAP panning, specific recording techniques,
or equivalent information. The interface can be an input interface
towards a signal transmission chain. In the case of HOA content,
the spatial positions of loudspeakers can be positions of virtual
loudspeakers.
[0012] According to one embodiment, the bit stream of a
multi-channel compression codec comprises signaling information in
order to transmit the above-mentioned metadata about virtual or
real loudspeaker positions and original mixing information to the
decoder and subsequent rendering algorithms. Thereby, any applied
rendering techniques on the decoding side can be adapted to the
specific mixing characteristics on the encoding side of the
particular transmitted content.
[0013] In one embodiment, the usage of the metadata is optional and
can be switched on or off. I.e., the audio content can be decoded
and rendered in a simple mode without using the metadata, but the
decoding and/or rendering will be not optimized in the simple mode.
In an enhanced mode, optimized decoding and/or rendering can be
achieved by making use of the metadata. In this embodiment, the
decoder/renderer can be switched between the two modes.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] Advantageous exemplary embodiments of the invention are
described with reference to the accompanying drawings, which show
in
[0015] FIG. 1 the structure of a known multi-channel transmission
system;
[0016] FIG. 2 the structure of a multi-channel transmission system
according to one embodiment of the invention;
[0017] FIG. 3 a smart decoder according to one embodiment of the
invention;
[0018] FIG. 4 the structure of a multi-channel transmission system
for HOA signals;
[0019] FIG. 5 spatial sampling points of a DSHT;
[0020] FIG. 6 examples of spherical sampling positions for a
codebook used in encoder and decoder building blocks; and
[0021] FIG. 7 an exemplary embodiment of a particularly improved
multi-channel audio encoder.
DETAILED DESCRIPTION OF THE INVENTION
[0022] FIG. 1 shows a known approach for multi-channel audio
coding. Audio data from an audio production stage 10 are encoded in
a multi-channel audio encoder 20, transmitted and decoded in a
multi-channel audio decoder 30. Metadata may explicitly be
transmitted (or their information may be included implicitly) and
related to the spatial audio composition. Such conventional
metadata are limited to information on the spatial positions of
loudspeakers, e.g. in the form of specific formats (e.g. stereo or
ITU-R BS.775-1 also known as "5.1 surround sound") or by tables
with loudspeaker positions. No information on how a specific
spatial audio mix/recording has been produced is communicated to
the multi-channel audio encoder 20, and thus such information
cannot be exploited or utilized in compressing the signal within
the multi-channel audio encoder 20.
[0023] However, it has been recognized that knowledge of at least
one of origin and mixing type of the content is of particular
importance if a multi-channel spatial audio coder processes at
least one of content that has been derived from a Higher-Order
Ambisonics (HOA) format, a recording with any fixed microphone
setup and a multi-channel mix with any specific panning algorithms,
because in these cases the specific mixing characteristics can be
exploited by the compression scheme. Also original multi-channel
audio content can benefit from additional mixing information
indication. It is advantageous to indicate e.g. a used panning
method such as e.g. Vector-Based Amplitude Panning (VBAP), or any
details thereof, for improving the encoding efficiency.
Advantageously, the signal models for the audio scene analysis, as
well as the subsequent encoding steps, can be adapted according to
this information. This results in a more efficient compression
system with respect to both rate-distortion performance and
computational effort.
[0024] In the particular case of HOA content, there is the problem
that many different conventions exist, e.g. complex-valued vs.
real-valued spherical harmonics, multiple/different normalization
schemes, etc. In order to avoid incompatibilities between
differently produced HOA content, it is useful to define a common
format. This can be achieved via a transformation of the HOA
time-domain coefficients to its equivalent spatial representation,
which is a multi-channel representation, using a transform such as
the Discrete Spherical Harmonics Transform (DSHT). The DSHT is
created from a regular spherical distribution of spatial sampling
positions, which can be regarded equivalent to virtual loudspeaker
positions. More definitions and details about the DSHT are given
below. Any system using another definition of HOA is able to derive
its own HOA coefficients representation from this common format
defined in the spatial domain. Compression of signals of said
common format benefits considerably from the prior knowledge that
the virtual loudspeaker signals represent an original HOA signal,
as described in more detail below.
[0025] Furthermore, this mixing information etc. is also useful for
the decoder or renderer. In one embodiment, the mixing information
etc. is included in the bit stream. The used rendering algorithm
can be adapted to the original mixing e.g. HOA or VBAP, to allow
for a better down-mix or rendering to flexible loudspeaker
positions.
[0026] FIG. 2 shows an extension of the multi-channel audio
transmission system according to one embodiment of the invention.
The extension is achieved by adding metadata that describe at least
one of the type of mixing, type of recording, type of editing, type
of synthesizing etc. that has been applied in the production stage
10 of the audio content. This information is carried through to the
decoder output and can be used inside the multi-channel compression
codec 40,50 in order to improve efficiency. The information on how
a specific spatial audio mix/recording has been produced is
communicated to the multi-channel audio encoder 40, and thus can be
exploited or utilized in compressing the signal.
[0027] One example as to how this metadata information can be used
is that, depending on the mixing type of the input material,
different coding modes can be activated by the multi-channel codec.
For instance, in one embodiment, a coding mode is switched to a
HOA-specific encoding/decoding principle (HOA mode), as described
below (with respect to eq.(3)-(16)) if HOA mixing is indicated at
the encoder input, while a different (e.g. more traditional)
multi-channel coding technology is used if the mixing type of the
input signal is not HOA, or unknown. In the HOA mode, the encoding
starts in one embodiment with a DSHT block in which a DSHT regains
the original HOA coefficients, before a HOA-specific encoding
process is started. In another embodiment, a different discrete
transform other than DSHT is used for a comparable purpose.
[0028] FIG. 3 shows a "smart" rendering system according to one
embodiment of the invention, which makes use of the inventive
metadata in order to accomplish a flexible down-mix, up-mix or
re-mix of the decoded N channels to M loudspeakers that are present
at the decoder terminal. The metadata on the type of mixing,
recording etc. can be exploited for selecting one of a plurality of
modes, so as to accomplish efficient, high-quality rendering. A
multi-channel encoder 50 uses optimized encoding, according to
metadata on the type of mix in the input audio data, and
encodes/provides not only N encoded audio channels and information
about loudspeaker positions, but also e.g. "type of mix"
information to the decoder 60. The decoder 60 (at the receiving
side) uses real loudspeaker positions of loudspeakers available at
the receiving side, which are unknown at the transmitting side
(i.e. encoder), for generating output signals for M audio channels.
In one embodiment, N is different from M. In one embodiment, N
equals M or is different from M, but the real loudspeaker positions
at the receiving side are different from loudspeaker positions that
were assumed in the encoder 50 and in the audio production 10. The
encoder 50 or the audio production 10 may assume e.g. standardized
loudspeaker positions.
[0029] FIG. 4 shows how the invention can be used for efficient
transmission of HOA content. The input HOA coefficients are
transformed into the spatial domain via an inverse DSHT (iDSHT)
410. The resulting N audio channels, their (virtual) spatial
positions, as well as an indication (e.g. a flag such as a "HOA
mixed" flag) are provided to the multi-channel audio encoder 420,
which is a compression encoder. The compression encoder can thus
utilize the prior knowledge that its input signals are HOA-derived.
An interface between the audio encoder 420 and an audio decoder 430
or audio renderer comprises N audio channels, their (virtual)
spatial positions, and said indication. An inverse process is
performed at the decoding side, i.e. the HOA representation can be
recovered by applying, after decoding 430, a DSHT 440 that uses
knowledge of the related operations that had been applied before
encoding the content. This knowledge is received through the
interface in form of the metadata according to the invention.
[0030] Some (but not necessarily all) kinds of metadata that are in
particular within the scope of this invention would be, for
example, at least one of the following: [0031] an indication that
original content was derived from HOA content, plus at least one
of: [0032] an order of the HOA representation [0033] indication of
2D, 3D or hemispherical representation; and [0034] positions of
spatial sampling points (adaptive or fixed) [0035] an indication
that original content was mixed synthetically using VBAP, plus an
assignment of VBAP tupels (pairs) or triples of loudspeakers; and
[0036] an indication that original content was recorded with fixed,
discrete microphones, plus at least one of: [0037] one or more
positions and directions of one or more microphones on the
recording set; and [0038] one or more kinds of microphones, e.g.
cardoid vs. omnidirectional vs. super-cardoid, etc.
[0039] Main advantages of the invention are at least the
following.
[0040] A more efficient compression scheme is obtained through
better prior knowledge on the signal characteristics of the input
material. The encoder can exploit this prior knowledge for improved
audio scene analysis (e.g. a source model of mixed content can be
adapted). An example for a source model of mixed content is a case
where a signal source has been modified, edited or synthesized in
an audio production stage 10. Such audio production stage 10 is
usually used to generate the multichannel audio signal, and it is
usually located before the multi-channel audio encoder block 20.
Such audio production stage 10 is also assumed (but not shown) in
FIG. 2 before the new encoding block 40. Conventionally, the
editing information is lost and not passed to the encoder, and can
therefore not be exploited. The present invention enables this
information to be preserved. Examples of the audio production stage
10 comprise recording and mixing, synthetic sound or
multi-microphone information, e.g., multiple sound sources that are
synthetically mapped to loudspeaker positions.
[0041] Another advantage of the invention is that the rendering of
transmitted and decoded content can be considerably improved, in
particular for ill-conditioned scenarios where a number of
available loudspeakers is different from a number of available
channels (so-called down-mix and up-mix scenarios), as well as for
flexible loudspeaker positioning. The latter requires re-mapping
according to the loudspeaker position(s).
[0042] Yet another advantage is that audio data in a sound field
related format, such as HOA, can be transmitted in channel-based
audio transmission systems without losing important data that are
required for high-quality rendering.
[0043] The transmission of metadata according to the invention
allows at the decoding side an optimized decoding and/or rendering,
particularly when a spatial decomposition is performed. While a
general spatial decomposition can be obtained by various means,
e.g. a Karhunen-Loeve Transform (KLT), an optimized decomposition
(using metadata according to the invention) is less computationally
expensive and, at the same time, provides a better quality of the
multi-channel output signals (e.g. the single channels can easier
be adapted or mapped to loudspeaker positions during the rendering,
and the mapping is more exact). This is particularly advantageous
if the number of channels is modified (increased or decreased) in a
mixing (matrixing) stage during the rendering, or if one or more
loudspeaker positions are modified (especially in cases where each
channel of the multi-channels is adapted to a particular
loudspeaker position).
[0044] In the following, the Higher Order Ambisonics (HOA) and the
Discrete Spherical Harmonics Transform (DSHT) are described.
[0045] HOA signals can be transformed to the spatial domain, e.g.
by a Discrete Spherical Harmonics Transform (DSHT), prior to
compression with perceptual coders. The transmission or storage of
such multi-channel audio signal representations usually demands for
appropriate multi-channel compression techniques. Usually, a
channel independent perceptual decoding is performed before finally
matrixing the I decoded signals {circumflex over ({circumflex over
(x)}.sub.i(l), i=1, . . . , I, into J new signals {circumflex over
(y)}.sub.j(l), j=1, . . . , J. The term matrixing means adding or
mixing the decoded signals {circumflex over ({circumflex over
(x)}.sub.i(l) in a weighted manner. Arranging all signals
{circumflex over ({circumflex over (x)}.sub.i(l), i=1, . . . , I,
as well as all new signals {circumflex over (y)}.sub.j(l), j=1, . .
. , J in vectors according to
{circumflex over ({circumflex over (x)}(l):=[{circumflex over
({circumflex over (x)}.sub.1(l) . . . {circumflex over ({circumflex
over (x)}.sub.I(l)].sup.T (1a)
{circumflex over ({circumflex over (y)}(l):=[{circumflex over
({circumflex over (y)}.sub.1(l) . . . {circumflex over ({circumflex
over (y)}.sub.J(l)].sup.T (1b)
the term "matrixing" origins from the fact that {circumflex over
(y)}(l) is, mathematically, obtained from {circumflex over
({circumflex over (x)}(l) through a matrix operation
{circumflex over ({circumflex over (y)}(l)=A{circumflex over
({circumflex over (x)}(l) (2)
where A denotes a mixing matrix composed of mixing weights. The
terms "mixing" and "matrixing" are used synonymously herein.
Mixing/matrixing is used for the purpose of rendering audio signals
for any particular loudspeaker setups.
[0046] The particular individual loudspeaker set-up on which the
matrix depends, and thus the maxtrix that is used for matrixing
during the rendering, is usually not known at the perceptual coding
stage.
[0047] The following section gives a brief introduction to Higher
Order Ambisonics (HOA) and defines the signals to be processed
(data rate compression).
[0048] Higher Order Ambisonics (HOA) is based on the description of
a sound field within a compact area of interest, which is assumed
to be free of sound sources. In that case the spatiotemporal
behavior of the sound pressure p(t, x) at time t and position x=[r,
.theta., .PHI.].sup.T within the area of interest (in spherical
coordinates) is physically fully determined by the homogeneous wave
equation. It can be shown that the Fourier transform of the sound
pressure with respect to time, i.e.,
P(.omega.,x)=F.sub.t{P(t,x)} (3)
where .omega. denotes the angular frequency (and F.sub.t) { }
corresponds to
.intg..sub.-.infin..sup..infin.p(t,x)e.sup.-.omega.tdt), may be
expanded into the series of Spherical Harmonics (SHs) according
to:
P ( k c s , x ) = n = 0 .infin. m = - n n A n m ( k ) j n ( kr ) Y
n m ( .theta. , .phi. ) ( 4 ) ##EQU00001##
[0049] In eq.(4), c.sub.s denotes the speed of sound and
k = .omega. c s ##EQU00002##
the angular wave number. Further, j.sub.n() indicate the spherical
Bessel functions of the first kind and order n and Y.sub.n.sup.m()
denote the Spherical Harmonics (SH) of order n and degree m. The
complete information about the sound field is actually contained
within the sound field coefficients A.sub.n.sup.m(k). It should be
noted that the SHs are complex valued functions in general.
However, by an appropriate linear combination of them, it is
possible to obtain real valued functions and perform the expansion
with respect to these functions.
[0050] Related to the pressure sound field description in eq.(4), a
source field can be defined as:
D ( k c s , .OMEGA. ) = n = 0 .infin. m = - n n B n m ( k ) Y n m (
.OMEGA. ) , ( 5 ) ##EQU00003##
with the source field or amplitude density [9] D(k c.sub.s,.OMEGA.)
depending on angular wave number and angular direction
.OMEGA.=[.theta.,.PHI.].sup.T. A source field can consist of
far-field/near-field, discrete/continuous sources [1]. The source
field coefficients B.sub.n.sup.m are related to the sound field
coefficients A.sub.n.sup.m by [1]:
A n m = { 4 .pi. i n B n m for the far field - i k h n ( 2 ) ( kr s
) B n m for the near field ( 6 ) ##EQU00004##
where h.sub.n.sup.(2) is the spherical Hankel function of the
second kind and r.sub.s is the source distance from the origin.
Concerning the near field, it is noted that positive frequencies
and the spherical Hankel function of second kind h.sub.n.sup.(2)
are used for incoming waves (related to e.sup.-ikr).
[0051] Signals in the HOA domain can be represented in frequency
domain or in time domain as the inverse Fourier transform of the
source field or sound field coefficients. The following description
will assume the use of a time domain representation of source field
coefficients:
b.sub.n.sup.m=i.sub.t{B.sub.n.sup.m} (7)
of a finite number: The infinite series in eq.(5) is truncated at
n=N. Truncation corresponds to a spatial bandwidth limitation. The
number of coefficients (or HOA channels) is given by:
O.sub.3D=(N+1).sup.2 for 3D (8)
or by O.sub.2D=2N+1 for 2D only descriptions. The coefficients
b.sub.n.sup.m comprise the Audio information of one time sample m
for later reproduction by loudspeakers. They can be stored or
transmitted and are thus subject to data rate compression. A single
time sample m of coefficients can be represented by vector b(m)
with O.sub.3D elements:
b(m): =[b.sub.0.sup.0(m), b.sub.1.sup.-1(m), b.sub.1.sup.0(m),
b.sub.1.sup.1(m), b.sub.2.sup.-2(m), . . . ,
b.sub.N.sup.N(m)].sup.T (9)
and a block of M time samples by matrix B
B:=[b(m.sub.START+1), b(m.sub.START+2), . . . , b(m.sub.START+M)]
(10)
[0052] Two dimensional representations of sound fields can be
derived by an expansion with circular harmonics. This is can be
seen as a special case of the general description presented above
using a fixed inclination of
.theta. = .pi. 2 , ##EQU00005##
different weighting of coefficients and a reduced set to O.sub.2D
coefficients (m=.+-.n). Thus all of the following considerations
also apply to 2D representations, the term sphere then needs to be
substituted by the term circle.
[0053] The following describes a transform from HOA coefficient
domain to a spatial, channel based, domain and vice versa. Eq.(5)
can be rewritten using time domain HOA coefficients for l discrete
spatial sample positions .OMEGA..sub.l=[.theta..sub.l,
.PHI..sub.l].sup.T on the unit sphere:
d .OMEGA. l : = n = 0 N m = - n n b n m Y n m ( .OMEGA. l ) , ( 11
) ##EQU00006##
[0054] Assuming L.sub.sd=(N+1).sup.2 spherical sample positions
.OMEGA..sub.l, this can be rewritten in vector notation for a HOA
data block B:
W=.PSI..sub.iB, (12)
with W: =[w(m.sub.START+1), w(m.sub.START+2), . . . ,
w(m.sub.START+M)] and
w ( m ) = [ d .OMEGA. 1 ( m ) , , d .OMEGA. L sd ( m ) ] T
##EQU00007##
representing a single time-sample of a L.sub.sd multichannel
signal, and matrix .PSI..sub.i=[y.sub.1, . . . ,
y.sub.L.sub.sd].sup.H with vectors
y.sub.l=[Y.sub.0.sup.0(.OMEGA..sub.l),
Y.sub.1.sup.-1(.OMEGA..sub.l), . . . ,
Y.sub.N.sup.N(.OMEGA..sub.l)].sup.T. If the spherical sample
positions are selected very regular, a matrix .PSI..sub.f exists
with
.PSI..sub.f.PSI..sub.i=I, (13)
where I is a O.sub.3D.times.O.sub.3D identity matrix. Then the
corresponding transformation to eq.(12) can be defined by:
B=.PSI..sub.fW. (14)
[0055] Eq.(14) transforms L.sub.sd spherical signals into the
coefficient domain and can be rewritten as a forward transform:
B=DSHT{W}, (15)
where DSHT{ } denotes the Discrete Spherical Harmonics Transform.
The corresponding inverse transform, transforms O.sub.3D
coefficient signals into the spatial domain to form L.sub.sd
channel based signals and eq.(12) becomes:
W=iDSHT{B}. (16)
[0056] The DSHT with a number of spherical positions L.sub.sd
matching the number of HOA coefficients O.sub.3D (see eq.(8)) is
described below. First, a default spherical sample grid is
selected. For a block of M time samples, the spherical sample grid
is rotated such that the logarithm of the term
l = 1 L Sd j = 1 L Sd .SIGMA. W Sd l , j - ( .sigma. S d 1 2 , ,
.sigma. S d L Sd 2 ) ( 17 ) ##EQU00008##
is minimized, where
.SIGMA. W S d l , j ##EQU00009##
are the absolute values of the elements of .SIGMA..sub.W.sub.sd
(with matrix row index/and column index j) and
.sigma. S d l 2 ##EQU00010##
are the diagonal elements of .SIGMA..sub.W.sub.sd. Visualized, this
corresponds to the spherical sampling grid of the DSHT as shown in
FIG. 5.
[0057] Suitable spherical sample positions for the DSHT and
procedures to derive such positions are well-known. Examples of
sampling grids are shown in FIG. 6. In particular, FIG. 6 shows
examples of spherical sampling positions for a codebook used in
encoder and decoder building blocks pE, pD, namely in FIG. 6 a) for
L.sub.sd=4, in FIG. 6 b) for L.sub.sd=9, in FIG. 6 c) for
L.sub.sd=16 and in FIG. 6 d) for L.sub.sd=25. Such codebooks can,
inter alia, be used for rendering according to pre-defined spatial
loudspeaker configurations.
[0058] FIG. 7 shows an exemplary embodiment of a particularly
improved multi-channel audio encoder 420 shown in FIG. 4. It
comprises a DSHT block 421, which calculates a DSHT that is inverse
to the Inverse DSHT of block 410 (in order to reverse the block
410). The purpose of block 421 is to provide at its output 70
signals that are substantially identical to the input of the
Inverse DSHT block 410. The processing of this signal 70 can then
be further optimized. The signal 70 comprises not only audio
components that are provided to an MDCT block 422, but also signal
portions 71 that indicate one or more dominant audio signal
components, or rather one or more locations of dominant audio
signal components. These are then used for detecting 424 at least
one strongest source direction and calculating 425 rotation
parameters for an adaptive rotation of the iDSHT. In one
embodiment, this is time variant, i.e. the detecting 424 and
calculating 425 is continuously re-adapted at defined discrete time
steps. The adaptive rotation matrix for the iDSHT is calculated and
the adaptive iDSHT is performed in the iDSHT block 423. The effect
of the rotation is that the sampling grid of the iDSHT 423 is
rotated such that one of the sides (i.e. a single spatial sample
position) matches the strongest source direction (this may be time
variant). This provides a more efficient and therefore better
encoding of the audio signal in the iDSHT block 423. The MDCT block
422 is advantageous for compensating the temporal overlapping of
audio frame segments. The iDSHT block 423 provides an encoded audio
signal 74, and the rotation parameter calculating block 425
provides rotation parameters as (at least a part of) pre-processing
information 75. Additionally, the pre-processing information 75 may
comprise other information.
[0059] Further, the present invention relates to the following
embodiments.
[0060] In one embodiment, the invention relates to a method for
transmitting and/or storing and processing a channel based 3D-audio
representation, comprising steps of sending/storing side
information (SI) along the channel based audio information, the
side information indicating the mixing type and intended speaker
position of the channel based audio information, where the mixing
type indicates an algorithm according to which the audio content
was mixed (e.g. in the mixing studio) in a previous processing
stage, where the speaker positions indicate the positions of the
speakers (ideal positions e.g. in the mixing studio) or the virtual
positions of the previous processing stage. Further processing
steps, after receiving said data structure and channel based audio
information, utilize the mixing & speaker position
information.
[0061] In one embodiment, the invention relates to a device for
transmitting and/or storing and processing a channel based 3D-audio
representation, comprising means for sending (or means for storing)
side information (SI) along the channel based Audio information,
the side information indicating the mixing type and intended
speaker position of the channel based audio information, where the
mixing type signals the algorithm according to which the audio
content was mixed (e.g. in the mixing studio) in a previous
processing stage, where the speaker positions indicate the
positions of the speakers (ideal positions e.g. in the mixing
studio) or the virtual positions of the previous processing stage.
Further, the device comprises a processor that utilizes the mixing
& speaker position information after receiving said data
structure and channel based audio information.
[0062] In one embodiment, the present invention relates to a 3D
audio system where the mixing information signals HOA content, the
HOA order and virtual speaker position information that relates to
an ideal spherical sampling grid that has been used to convert HOA
3D audio to the channel based representation before. After
receiving/reading transmitted channel based audio information and
accompanying side information (SI), the SI is used to re-encode the
channel based audio to HOA format. Said re-encoding is done by
calculating a mode-matrix .PSI. from said spherical sampling
positions and matrix multiplying it with the channel based content
(DSHT).
[0063] In one embodiment, the system/method is used for
circumventing ambiguities of different HOA formats. The HOA 3D
audio content in a 1.sup.st HOA format at the production side is
converted to a related channel based 3D audio representation using
the iDSHT related to the 1.sup.st format and distributed in the SI.
The received channel based audio information is converted to a
2.sup.nd HOA format using SI and a DSHT related to the 2.sup.nd
format. In one embodiment of the system, the 1.sup.st HOA format
uses a HOA representation with complex values and the 2.sup.nd HOA
format uses a HOA representation with real values. In one
embodiment of the system, the 2.sup.nd HOA format uses a complex
HOA representation and the 1.sup.st HOA format uses a HOA
representation with real values.
[0064] In one embodiment, the present invention relates to a 3D
audio system, wherein the mixing information is used to separate
directional 3D audio components (audio object extraction) from the
signal used within rate compression, signal enhancement or
rendering. In one embodiment, further steps are signaling HOA, the
HOA order and the related ideal spherical sampling grid that has
been used to convert HOA 3D audio to the channel based
representation before, restoring the HOA representation and
extracting the directional components by determining main signal
directions by use of block based covariance methods. Said
directions are used for HOA decoding the directional signals to
these directions. In one embodiment, the further steps are
signaling Vector Base Amplitude Panning (VBAP) and related speaker
position information, where the speaker position information is
used to determine the speaker triplets and a covariance method is
used to extract a correlated signal out of said triplet
channels.
[0065] In one embodiment of the 3D audio system, residual signals
are generated from the directional signals and the restored signals
related to the signal extraction (HOA signals, VBAP triplets
(pairs)).
[0066] In one embodiment, the present invention relates to a system
to perform data rate compression of the residual signals by steps
of reducing the order of the HOA residual signal and compressing
reduced order signals and directional signals, mixing the residual
triplet channels to a mono stream and providing related correlation
information, and transmitting said information and the compressed
mono signals together with compressed directional signals.
[0067] In one embodiment of the system to perform data rate
compression, it is used for rendering audio to loudspeakers,
wherein the extracted directional signals are panned to
loudspeakers using the main signal directions and the de-correlated
residual signals in the channel domain.
[0068] The invention allows generally a signalization of audio
content mixing characteristics. The invention can be used in audio
devices, particularly in audio encoding devices, audio mixing
devices and audio decoding devices.
[0069] It should be noted that although shown simply as a DSHT,
other types of transformation may be constructed or applied other
than a DSHT, as would be apparent to those of ordinary skill in the
art, all of which are contemplated within the spirit and scope of
the invention. Further, although the HOA format is exemplarily
mentioned in the above description, the invention can also be used
with other types of soundfield related formats other than
Ambisonics, as would be apparent to those of ordinary skill in the
art, all of which are contemplated within the spirit and scope of
the invention.
[0070] While there has been shown, described, and pointed out
fundamental novel features of the present invention as applied to
preferred embodiments thereof, it will be understood that various
omissions and substitutions and changes in the apparatus and method
described, in the form and details of the devices disclosed, and in
their operation, may be made by those skilled in the art without
departing from the spirit of the present invention. It will be
understood that the present invention has been described purely by
way of example, and modifications of detail can be made without
departing from the scope of the invention. It is expressly intended
that all combinations of those elements that perform substantially
the same function in substantially the same way to achieve the same
results are within the scope of the invention. Substitutions of
elements from one described embodiment to another are also fully
intended and contemplated.
REFERENCES
[0071] [1] T. D. Abhayapala "Generalized framework for spherical
microphone arrays: Spatial and frequency decomposition", In Proc.
IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), (accepted) Vol. X, pp. , April 2008, Las
Vegas, USA. [0072] [2] James R. Driscoll and Dennis M. Healy Jr.:
"Computing Fourier transforms and convolutions on the 2-sphere",
Advances in Applied Mathematics, 15:202-250, 1994
* * * * *