U.S. patent application number 16/769576 was filed with the patent office on 2021-06-03 for method and apparatus for inverse tone mapping.
This patent application is currently assigned to KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY. The applicant listed for this patent is KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY. Invention is credited to Dae Eun KIM, Mun Churl KIM, Soo Ye KIM.
Application Number | 20210166360 16/769576 |
Document ID | / |
Family ID | 1000005434082 |
Filed Date | 2021-06-03 |
United States Patent
Application |
20210166360 |
Kind Code |
A1 |
KIM; Mun Churl ; et
al. |
June 3, 2021 |
METHOD AND APPARATUS FOR INVERSE TONE MAPPING
Abstract
Inverse tone mapping (ITM) aims at generating a single high
dynamic range (HDR) image from a low dynamic range (LDR) image.
While ITM was frequently used for graphics rendering in the HDR
space, the advent of HDR consumer displays (e.g., HDR TV) and the
consequent need for HDR multimedia contents open up new horizons
for the consumption of ultra-high quality video contents. However,
due to the lack of HDR-filmed contents, the legacy LDR videos must
be up-converted for viewing on these HDR displays. Unfortunately,
the previous ITM methods are not appropriate for HDR consumer
displays, and their inverse-tone-mapped results are not visually
pleasing with noise amplification or lack of details. In this
paper, we propose a convolutional neural network (CNN) based
architecture designed for the ITM to HDR consumer displays, called
ITM-CNN, and its training strategy for enhancing the performance
based on image decomposition using the guided filter. We
demonstrate the benefits of decomposing the image by experimenting
with various architectures and also compare the performance for
different training strategies. To the best of our knowledge, this
paper first presents the ITM problem using CNNs for HDR consumer
displays, where the network is trained to restore lost details and
local contrast. Our ITM-CNN can readily up-convert LDR images for
direct viewing on an HDR consumer medium, and is a very powerful
means to solve the lack of HDR video contents with legacy LDR
videos.
Inventors: |
KIM; Mun Churl; (Daejeon,
KR) ; KIM; Soo Ye; (Daejeon, KR) ; KIM; Dae
Eun; (Daejeon, KR) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY |
Daejeon |
|
KR |
|
|
Assignee: |
KOREA ADVANCED INSTITUTE OF SCIENCE
AND TECHNOLOGY
Daejeon
KR
|
Family ID: |
1000005434082 |
Appl. No.: |
16/769576 |
Filed: |
December 6, 2017 |
PCT Filed: |
December 6, 2017 |
PCT NO: |
PCT/KR2017/014265 |
371 Date: |
June 3, 2020 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 17/11 20130101;
G06T 2207/20208 20130101; G06T 2207/20084 20130101; G06N 3/08
20130101; G06T 2207/10016 20130101; G06T 2207/10024 20130101; G06N
3/02 20130101; G06T 5/50 20130101; G06T 5/009 20130101; G06T
2207/20081 20130101 |
International
Class: |
G06T 5/00 20060101
G06T005/00; G06T 5/50 20060101 G06T005/50 |
Claims
1. AN apparatus for inverse tone mapping comprising: end-to-end CNN
that is able to jointly optimize the LDR decomposition and HDR
reconstruction phases, wherein the end-to-end CNN allows all legacy
LDR images/video to be up-converted for direct viewing on HDR TVs.
Description
TECHNICAL FIELD
[0001] At least one example embodiment relates to a method for
inverse tone mapping and apparatuses performing the method.
BACKGROUND ART
[0002] The human visual system perceives the world as much
brighter, with stronger contrasts and more details than is
typically presented in standard dynamic range (SDR) displays. In
comparison, recently available high dynamic range (HDR) consumer
displays allow users to enjoy videos closer to reality as seen by
the naked eye, with the brightness of at least 1,000 cd/m2 (as
opposed to 100 cd/m2 for SDR displays), higher contrast ratio,
increased bit depth of 10 bits or more, and wide color gamut (WCG).
However, although HDR TVs are readily available in the market,
there is a severe lack of HDR contents.
[0003] Inverse tone mapping (ITM), also referred to as reverse tone
mapping, is a popular area of research in computer graphics that
aims to predict HDR images from low dynamic range (LDR) images for
better graphics rendering. Another field of research, HDR imaging,
makes use of multiple LDR images of different exposures to create a
single HDR image that contains details in the saturated regions. In
the above two fields of research, the lighting calculations are
conducted in the HDR domain with the belief that this would yield a
more accurate representation although HDR TVs are readily available
in the market, there is a severe lack of HDR contents.
[0004] Inverse tone mapping (ITM), also referred to as reverse tone
mapping, is a popular area of research in computer graphics that
aims to predict HDR images from low dynamic range (LDR) images for
better graphics rendering. Another field of research, HDR imaging,
makes use of multiple LDR images of different exposures to create a
single HDR image that contains details in the saturated regions. In
the above two fields of research, the lighting calculations are
conducted in the HDR domain with the belief that this would yield a
more accurate representation of the graphic or natural scene on an
SDR display. The HDR images are viewed on professional HDR monitors
during rendering. Consequently, the HDR domain referred to in the
above areas are not necessarily the same as the now available HDR
consumer displays, and the resulting HDR images by the ITM methods
for such purposes are not suitable for direct viewing on an HDR TV.
When the conventional ITM methods are applied for an HDR TV with
the maximum brightness of 1,000 cd/m2, they are not capable of
fully utilizing the available HDR capacity due to their weakness in
generating full contrast or/and details, or due to noise
amplification as seen in FIG. 1. (Note that they are tone mapped
for viewing on the paper.)
DISCLOSURE OF INVENTION
Technical Problem
[0005] When the conventional ITM methods are applied for an HDR TV
with the maximum brightness of 1,000 cd/m2, they are not capable of
fully utilizing the available HDR capacity due to their weakness in
generating full contrast or/and details, or due to noise
amplification.
Solution to Problem
[0006] Therefore, we formulate a slightly different problem where
we aim to generate HDR images that can be directly viewed on
commercial HDR TVs. In this way, LDR legacy videos may be
up-converted to be viewed on available HDR displays without
additional information required. We propose an effective
convolutional neural network (CNN) based structure and its learning
strategy for up-converting a single LDR image of 8 bits/pixel,
gamma-corrected [20], in the BT.709 color container [25], to an HDR
image of 10 bits/pixel through the perceptual quantization (PQ)
transfer function [27] in the BT.2020 color container [26], that
may be directly viewed with commercial HDR TVs.
BRIEF DESCRIPTION OF DRAWINGS
[0007] These and/or other aspects will become apparent and more
readily appreciated from the following description of example
embodiments, taken in conjunction with the accompanying drawings of
which:
[0008] FIG. 1 is a diagram illustrating conventional ITM
Methods;
[0009] FIG. 2 is a diagram illustrating Architecture of our
ITM-CNN;
[0010] FIG. 3 are Images decomposition using a guided filter;
[0011] FIG. 4 is a diagram illustrating Pre-train structure;
[0012] FIG. 5a, FIG. 5b and FIG. 5c are diagram illustrating
different architecture using image decompositions architecture;
[0013] FIG. 6 is a diagram illustrating comparisons with previous
methods;
MODE FOR THE INVENTION
[0014] Hereinafter, some example embodiments will be described in
detail with reference to the accompanying drawings. Regarding the
reference numerals assigned to the elements in the drawings, it
should be noted that the same elements will be designated by the
same reference numerals, wherever possible, even though they are
shown in different drawings. Also, in the description of
embodiments, detailed description of well-known related structures
or functions will be omitted when it is deemed that such
description will cause ambiguous interpretation of the present
disclosure.
[0015] It should be understood, however, that there is no intent to
limit this disclosure to the particular example embodiments
disclosed. On the contrary, example embodiments are to cover all
modifications, equivalents, and alternatives falling within the
scope of the example embodiments.
[0016] The terminology used herein is for the purpose of describing
particular embodiments only and is not intended to limit the
disclosure. As used herein, the singular forms "a," "an," and "the"
are intended to include the plural forms as well, unless the
context clearly indicates otherwise. It will be further understood
that the terms "comprises," "comprising," "includes," and/or
"including," when used herein, specify the presence of stated
features, integers, steps, operations, elements, and/or components,
but do not preclude the presence or addition of one or more other
features, integers, steps, operations, elements, components, and/or
groups thereof.
[0017] In addition, terms such as first, second, A, B, (a), (b),
and the like may be used herein to describe components. Each of
these terminologies is not used to define an essence, order or
sequence of a corresponding component but used merely to
distinguish the corresponding component from other
component(s).
[0018] It should be noted that if it is described in the disclosure
that one component is "connected," "coupled," or "joined" to
another component, a third component may be "connected," "coupled,"
and "joined" between the first and second components, although the
first component may be directly connected, coupled or joined to the
second component.
[0019] It should also be noted that in some alternative
implementations, the functions/acts noted may occur out of the
order noted in the figures. For example, two figures shown in
succession may in fact be executed substantially concurrently or
may sometimes be executed in the reverse order, depending upon the
functionality/acts involved.
[0020] Unless otherwise defined, all terms including technical and
scientific terms used herein have the same meaning as commonly
understood by one of ordinary skill in the art to which these
example embodiments belong. It will be further understood that
terms, such as those defined in commonly used dictionaries, should
be interpreted as having a meaning that is consistent with their
meaning in the context of the relevant art and will not be
interpreted in an idealized or overly formal sense unless expressly
so defined herein.
[0021] Various example embodiments will now be described more fully
with reference to the accompanying drawings in which some example
embodiments are shown. In the drawings, the thicknesses of layers
and regions are exaggerated for clarity.
[0022] We propose the first CNN architecture, called ITM-CNN, for
the ITM problem to readily available HDR consumer displays.
[0023] Our architecture is a fully end-to-end CNN that is able to
jointly optimize the LDR decomposition and HDR reconstruction
phases.
[0024] This allows all legacy LDR images/video to be up-converted
for direct viewing on HDR TVs.
2. RELATED WORK
[0025] Tone mapping is a popular problem dealt in computer graphics
and image processing. When graphics are rendered in the HDR domain
for better representation of the scenes which have near-continuous
brightness and high contrast in the real world, they have to be
tone mapped somehow to the displayable range. ITM came later, to
transfer the LDR domain images to the HDR domain. The term itself
was first used by Banterle et al. in [14]. The previous ITM methods
mainly focus on generating the expand map and revealing contrast in
saturated regions. In this section, previous methods regarding ITM
to the HDR domain will be reviewed. It should be noted that we
address a slightly different problem in this paper than the
previous methods, where our final goal is not viewing HDR rendered
tone mapped images on SDR displays, nor viewing HDR images on a
professional HDR monitor, but viewing generated HDR images directly
on consumer HDR displays (e.g., HDR TVs).
[0026] ITM was first introduced by [14, 17], where Banterle et al.
formulated the inverse of Reinhard's tone mapping operator [18]. In
addition to the inverse of the tone mapping operator, they also
proposed an expand map, which specifies the amount of expansion for
every pixel position. Two main purposes of the expand map is to
reduce contouring artifacts resulting from quantization and to
expand the bright regions in the LDR images. Similarly, Rempel et
al. proposed a brightness enhancement function in [22]. The
brightness enhancement function is derived from the blurred mask
that indicates saturated pixel areas. With the edge stopping
filter, the brightness enhancement function preserves edges with
high contrast. Another related approach is [15, 16] where Kovaleski
and Oliveira used a bilateral grid to make an edge-preserving
expand map.
[0027] There are also methods that find a global mapping function
of the whole image instead of a pixel-wise mapping. In [21], Meylan
et al. applied linear expansions with two different slopes
depending on whether the pixel is classified as the diffuse region
or the specular region. Pixels with values greater than a
predefined threshold are classified into the specular region and
all other pixels are classified into the diffuse region. The
specular region is expanded with a steeper function than the
diffuse region. The ITM algorithms presented above mainly classify
pixels into two classes: pixels to be expanded more and pixels to
be expanded less. Usually, the pixels in the bright regions tend to
be expanded more.
[0028] Another method that investigates a global mapping function
is [23]. In [23], Masia et al. evaluated a number of ITM algorithms
and found that the performance of the algorithms decreased for
overexposed input images. Based on this observation, they proposed
an ITM curve based on the gamma curve, where the parameter gamma is
a function of the statistics of the input image. Their following
work [24] improves upon their previous work with a robust
multilinear regression model. In [19], Huo et al. proposed an ITM
algorithm imitating the characteristics of the human visual system
and its retina response. This ITM algorithm enhances local
contrast.
[0029] Recently, CNN-based structures have shown exceptional
performance in modelling images to find a non-linear mapping,
especially for classification problems [11] or regression problems
[12, 13]. There are very few ITM methods based on CNNs. One of
them, proposed by Endo et al. [7] is an indirect approach where
they use a combination of 2D and 3D CNNs to generate multiple LDR
images (bracketed images) of different exposures from a single LDR
image, and merge these bracketed images using the existing methods
to obtain a final tone mapped HDR image. Kalantari et al. [8]
proposed an HDR imaging method where they used a CNN for
integrating the given multiple
[0030] LDR images of different exposures to generate an HDR image
which is tone mapped for viewing. Eilertsen et al. [9] and Zhang et
al. [10] proposed encoder-decoder-based networks for HDR
reconstruction which is also tone mapped for viewing.
[0031] However, the previous methods share an ultimate step of tone
mapping the HDR-rendered images where the HDR domain referred to in
their papers are that to be viewed on professional HDR monitors for
rendering operations. Consequently, they have not considered
transfer functions such as PQ-EOTF [27, 28] or Hybrid Log Gamma
[28] and color containers related to SDR or HDR format. Even when
the transfer function and the color container are converted
manually, HDR images converted by the previous methods are not
suitable for viewing on consumer HDR displays. In this paper, we
propose an ITM method with ITM-CNN, by which the resulting HDR
images can be directly viewed on commercial HDR TVs. The end-to-end
CNN-based structure of our ITM-CNN benefits from image
decomposition along with delicate training strategies.
3. PROPOSED METHOD
[0032] Our proposed ITM-CNN has an end-to-end CNN structure for the
prediction of the HDR image from a single LDR image.
[0033] 3.1 Network Architecture
[0034] In tone mapping, edge-preserving filters (e.g. bilateral
filter) are frequently used on the HDR input to decompose the image
into the base layer and detail layer so that only the base layer is
compressed while preserving the detail layer.
[0035] The processed base and detail layers are then integrated to
obtain a final LDR output image. In contrast, the purpose of an ITM
algorithm is to predict lost details with an extended base layer to
match the desired brightness to finally generate the output HDR
image. If the image is decomposed into two parts with different
characteristics, appropriate processing may be done for the
individual branches for a more accurate prediction of the output
image. Following from this idea, we explicitly model our CNN
structure (ITM-CNN) as three parts: (i) LDR decomposition, (ii)
feature extraction and (iii) HDR reconstruction, as illustrated in
FIG. 2.
[0036] The first part of our ITM-CNN, LDR decomposition, consists
of three convolution layers where the number of output channels for
the last layer is six, intended to decompose each LDR input image
into two different sets of feature maps, simply divided as the
first three and the last three feature maps. Then, the convolution
layers in the feature extraction part proceed individually for the
two sets so that each of the two CNN branches are able to focus on
the characteristics of the respective inputs for specialized
feature extraction. Lastly, for HDR reconstruction, the extracted
feature maps from the individual passes are concatenated and the
network (HDR reconstruction part) learns to integrate the feature
maps from the two passes to finally generate an HDR image. Our
ITM-CNN jointly optimizes all the three steps of LDR decomposition,
feature extraction and HDR reconstruction, but has to be trained
delicately to fully benefit from the image filtering idea
(decomposition of LDR input).
[0037] 3.2 Training Strategy
[0038] First, we pre-train the feature extraction and HDR
[0039] reconstruction parts of the ITM-CNN after setting the LDR
decomposition part as a guided-filtering-based separation of the
base and detail layers for the LDR input. The pre-training
structure of the ITM-CNN is illustrated in FIG. 4. The guided
filter [1, 2] is an edge-preserving filter that does not suffer
from gradient reversal artifacts like the bilateral filter [3]. The
base layer is extracted using the self-guided filter as suggested
in [1, 2], and the detail layer is obtained by element-wise
division of the input LDR image by the base layer, given as
Idetail=ILDR.quadrature.Ibase (1)
[0040] where ILDR is the LDR input, Idetail is the detail layer and
Ibase is the base layer. D in (1) denotes an element-wise division
operator. An example of an image separated using the guided filter
is given in FIG. 3.
[0041] After pre-training the feature extraction and HDR
reconstruction parts using the pre-train structure with
guided-filtering-separation, the guided filter is replaced with
three convolution layers (LDR decomposition part in FIG. 2)
allowing for the final fully convolutional architecture as given in
FIG. 2. We pre-train the three layers in the new LDR decomposition
part with the same data but without updating the weights of later
layers by setting the learning rate to zero for those layers, so
that the convolution layers learn to decompose the LDR input image
into feature maps that lower the final loss, while utilizing the
weights in later layers that were trained with guided filter
separation. When the pre-training is finished, the network
(ITM-CNN) is finally trained end-to-end for joint optimization of
all three parts. We observed a significant increase in performance
by using this pre-training strategy.
4. EXPERIMENTS
4.1 Experiment Conditions
[0042] Data. We collected 7,268 frames of 3840.times.2160 UHD
resolution of the LDR-HDR data pairs containing diverse scenes. The
specifications are given in Table 1. The HDR video is
professionally filmed and mastered, and both the LDR and HDR data
are normalized to be in the range [0, 1]. For the synthesis of
training data, we randomly cropped 20 subimages of size 40.times.40
per frame with the frame stride of 30. This resulted in the total
training data of size 40.times.40.times.3.times.4,860. For testing,
we selected 14 frames from six different scenes that are not
included in the training set. All videos were converted to the YUV
color space and all three YUV channels were used for training.
Although it is common to use Y channel only, using all three
channels is more reasonable for our ITM problem since the color
container also changes from BT.709 [25] to BT.2020 [26]. The
quantitative benefit of using all three channels is shown in Table
2 when experimenting with a simple CNN structure.
TABLE-US-00001 TABLE 1 Transfer Color Data Type Bit Depth Function
Container LDR 8 bits/pixel Gamma BT. 709 HDR 10 bits/pixel PQ-EOTF
BT. 2020
TABLE-US-00002 TABLE 2 Train Data Y only YUV PSNR of Y (dB) 44.36
46.36 PSNR of YUV (dB) 32.53 48.25
[0043] The huge difference in the PSNR when measured for all
[0044] three YUV channels is largely in part due to the color
container and transfer function mismatch of LDR and HDR images if
the U and V channels are not trained. Note that the Y channel
benefits from the complementary information of U and V
channels.
[0045] Training parameters. The weight decay of the convolution
filters were set to 5.times.10-4 with that of biases set to zero.
The mini-batch size was 32 with the learning rate of 10-4 for
filters and 10-5 for biases. All convolution filters are of size
3.times.3 and the weights were initialized with the Xavier
initialization [4] that draws the weights from a normal
distribution with the variance expressed with both the number of
input and output neurons. The loss function of the network
(ITM-CNN) is given by
L ( .theta. ) = 1 2 n i = 1 n F ( I LDR ; .theta. ) - I HDR 2 ( 2 )
##EQU00001##
[0046] where .theta. is the set of model parameters, n is the
number of training samples, ILDR is the input LDR image, F is the
non-linear mapping function of the ITM-CNN giving the prediction of
the network as F(ILDR; .theta.), and IHDR is the ground truth HDR
image. The activation function is the rectified linear unit (ReLU)
[5] given by
ReLU=max(0,x) (3)
[0047] All network models are implemented with the MatConvNet [6]
package.
4.2 Input Decomposition
[0048] Decomposing the LDR input lets the feature extraction layers
to concentrate on each of the decompositions. Specifically, we use
the guided filter [1, 2] for input decomposition. We compare three
different architectures shown in FIG. 5 to observe the effect of
decomposing the LDR input.
[0049] The first structure shown in FIG. 5a is a simple
six-convolution-layer structure with residual learning. Since both
the input LDR image with 8 bits/pixel and the ground truth HDR
image with 10 bits/pixel are normalized to be in the range [0, 1],
we can simply model the network (FIG. 5a) to learn the difference
between the LDR and HDR image for a more accurate prediction.
Although no decomposition is performed on the input LDR image,
residual learning may be interpreted as an additive separation of
the output that allows the CNN (FIG. 5a) to focus only on
predicting the difference between LDR and HDR.
[0050] The second structure shown in FIG. 5b uses the guided filter
for multiplicative input decomposition, and has two individual
passes where one predicts the base layer and the other predicts the
detail layer of the HDR image. The base and detail layer
predictions are then multiplied element-wise to obtain the final
HDR image. By providing the ground truth base and detail layers of
the HDR image, this second structure can fully concentrate on the
decompositions for the final prediction.
[0051] The last structure shown in FIG. 5c, also uses the guided
filter for multiplicative input decomposition, but the feature maps
from the individual passes are concatenated for direct prediction
of the HDR image. This network learns the optimal integration
operation that lowers the final loss through the last three
convolutional layers, whereas for the structure in FIG. 5b, we
explicitly force the network to model the base and detail layers of
the HDR image for an element-wise multiplicative integration. Note
that this last structure is the same as the pre-train structure in
FIG. 4.
TABLE-US-00003 TABLE 3 Structure (a)* (a) (b) (c)* (c) Layer Number
of filter channels (input, output) 1 3, 32 3, 32 3, 45 3, 32 3, 32
3, 32 3, 32 3, 32 3, 32 2 32, 32 32, 32 45, 45 32, 32 32, 32 32, 32
32, 32 32, 32 32, 32 3 32, 32 32, 32 45, 48 32, 32 32, 32 32, 32
32, 32 32, 32 32, 32 4 32, 32 32, 32 48, 45 32, 32 32, 32 32, 52
64, 40 5 32, 32 32, 32 45, 45 32, 32 32, 32 52, 48 40, 40 6 32, 3
32, 3 45, 3 32, 3 32, 3 48, 3 40, 3 Total 38, 592 38, 592 77, 760
77, 184 77, 328 77, 112 parameters PSNR of Y 45.46 46.36 46.36
46.84 46.73 47.03 PSNR of YUV 47.28 48.25 48.39 48.65 48.87
48.82
[0052] The results of the experiment are given in Table 3 where
(a), (b) and (c) denote the structures illustrated in FIG. 5. For
fair comparison, we tune the number of filters in the hidden layers
so that the total number of parameters for each of the structures
are similar. We perform two additional experiments with structure
(a) and (c) denoted by (a)* and (c)* where (a)* is the structure
(a) without residual learning, and (c)* is the structure (c) with
element-wise multiplication instead of concatenation for
integrating the feature maps after the third convolution layer. We
compare the structures in terms of PSNR measured only for the Y
channel and for all three YUV channels.
[0053] Even with similar number of parameters, there is a maximum
PSNR difference of 0.67 dB measured for Y channel only and 0.48 dB
measured for YUV channels depending on whether input decomposition
is used or not and how the decomposed inputs are treated. The
highest performing structure is the structure (c), although (c)*
shows comparable results for PSNR measured for YUV, where the
network has the freedom to learn the integration of two feature
extraction passes. Comparing the structures (a) and (b), we confirm
that the input decomposition using the guided filter is highly
beneficial. Also, for the simple CNN architecture in the structure
(a), it is crucial to use residual learning for improved
prediction. Letting the convolution layers to focus on specific
input decompositions with different characteristics, and learn to
combine information generated by the different branches is
important in reconstructing a high quality HDR image
4.3 Effect of Pre-Training
[0054] We model a fully end-to-end CNN structure as illustrated in
FIG. 2 by replacing the guided filter based separation in the
structure (c) with three convolutional layers each with 32 filters
of size 3.times.3. However, we find that pre-training the network
is essential to fully utilize the three parts of the network (LDR
decomposition, feature extraction and HDR reconstruction) as
intended. Table 4 shows the results of the same network, the fully
convolutional network in FIG. 2, with different training
procedures. The specified order n corresponding to each of the
three parts in Table 4 indicate the training orders, where only the
layers of a specific part are trained in the nth order, and the
weights of other layers remain fixed. When the feature extraction
and HDR reconstruction parts are marked as being trained first
(denoted as `1st` in columns 3, 4 and 5 of Table 4), the LDR
decomposition part was replaced with the guided filter separation
for the pre-training. The `All` in row 5 of Table 4 indicates that
all layers were trained, end-to-end.
[0055] If the whole network is simply trained end-to-end without
any pre-training, it achieves 0.32 dB lower performance in PSNR of
Y than the structure of FIG. 5c, even though it has three more
convolution layers or 11,808 more filter parameters. However, if
the pre-trained filter values (of FIG. 4 or 5c) using the guided
filter separation (instead of the LDR decomposition layers) are
used to initialize the network before the end-to-end training, the
resulting PSNR performance jumps with 0.4 dB and 0.18 dB for Y and
YUV respectively, thus exceeding the best performing version in
Table 3.
[0056] The highest PSNR performance can be obtained by also
pre-training the LDR decomposition layers in conjunction with the
pre-training by the guided filter separation network and the
end-to-end training at the end.
TABLE-US-00004 TABLE 4 Network Part Order of Training LDR
Decomposition -- 2.sup.nd -- 2.sup.nd Feature Extraction --
1.sup.st 1.sup.st 1.sup.st HDR Reconstruction -- All 1.sup.st --
2.sup.nd 3.sup.rd PSNR of Y (dB) 46.71 46.28 47.11 47.27 PSNR of
YUV (dB) 48.80 48.15 48.98 49.21
TABLE-US-00005 TABLE 5 Banterle Huo Kovaleski Masia Meylan Rempel
et al. et al. et al. et al. et al. et al. Ours Sequence [17] [19]
[16] [24] [21] [22] (ITM-CNN) Aquarium 20.78 28.57 23.24 23.96
23.59 24.80 32.24 Leaves 19.53 29.47 23.48 19.64 23.05 25.13 29.95
Cuisine 17.47 28.73 21.51 22.24 22.01 24.79 31.07 Average 19.26
28.92 22.74 21.95 22.88 24.91 31.09
4.4 Comparisons with Conventional Methods
[0057] Since no previous method was explicitly trained for viewing
with consumer HDR displays, fair comparison with our method is
difficult. When comparing the previous ITM methods, we set the
maximum brightness to 1,000 cd/m2, remove gamma correction and
apply the expansion operator in the linear space. After the
expansion, the color container is converted from BT. 709 to BT.
2020 and the PQ-EOTF transfer function is applied for pixel values
to be in logarithmic space. Note that our method works directly in
the logarithmic space without any conversion using the transfer
function or color container.
[0058] Another complication is tone mapping for viewing on paper or
SDR displays. All images in this paper were tone mapped using the
madVR renderer using the MPC-HC player, heuristically found to be
most similar as viewing with HDR consumer displays. Although this
is not the exact application of our problem, the result images
still support our approach to be valid, and show that the existing
methods are not directly applicable for viewing on HDR consumer
displays. The result images for subjective comparisons are given in
FIG. 6. PSNR performance comparison is given in Table 5. More
experimental results are provided as a supplementary material
attached to this paper.
5. CONCLUSION
[0059] Despite that high-end HDR TVs are readily available in the
market, HDR video contents are scarce. This entails the need for a
means to up-convert LDR legacy video to HDR video for the high-end
HDR TVs. Although existing ITM methods share a similar goal of
up-converting, their ultimate objective is not to render the
inversely-tone-mapped HDR images on consumer HDR TV displays, but
to transfer the LDR scenes to the HDR domain for better graphics
rendering on professional HDR monitors. The resulting HDR images
from previous ITM methods exhibit noise amplification in dark
regions and lack of local contrast or unnatural colors when being
viewed on consumer HDR TV displays.
[0060] In this paper, we first present the ITM problem using CNNs
for HDR consumer displays, where the network is trained to restore
lost details and local contrast. For an accurate prediction of the
HDR image, the different parts (LDR decomposition, feature
extraction and HDR reconstruction) of our network must be trained
separately prior to end-to-end training. Specifically, we adopt the
guided filter for LDR decomposition for the pre-training stage so
that the later layers can focus on the individual decompositions
with separate passes. The HDR reconstruction part of ITM-CNN learns
to integrate the feature maps from the two passes. The resulting
HDR images are artifact-free, restore local contrast and details,
and are closest to the ground truth HDR images when compared to
previous ITM methods. ITM-CNN is readily applicable to legacy LDR
videos for their direct viewing as HDR videos on consumer HDR TV
displays.
[0061] The units and/or modules described herein may be implemented
using hardware components and software components. For example, the
hardware components may include microphones, amplifiers, band-pass
filters, audio to digital convertors, and processing devices. A
processing device may be implemented using one or more hardware
device configured to carry out and/or execute program code by
performing arithmetical, logical, and input/output operations. The
processing device(s) may include a processor, a controller and an
arithmetic logic unit, a digital signal processor, a microcomputer,
a field programmable array, a programmable logic unit, a
microprocessor or any other device capable of responding to and
executing instructions in a defined manner. The processing device
may run an operating system (OS) and one or more software
applications that run on the OS. The processing device also may
access, store, manipulate, process, and create data in response to
execution of the software. For purpose of simplicity, the
description of a processing device is used as singular; however,
one skilled in the art will appreciated that a processing device
may include multiple processing elements and multiple types of
processing elements. For example, a processing device may include
multiple processors or a processor and a controller. In addition,
different processing configurations are possible, such a parallel
processors.
[0062] The software may include a computer program, a piece of
code, an instruction, or some combination thereof, to independently
or collectively instruct and/or configure the processing device to
operate as desired, thereby transforming the processing device into
a special purpose processor. Software and data may be embodied
permanently or temporarily in any type of machine, component,
physical or virtual equipment, computer storage medium or device,
or in a propagated signal wave capable of providing instructions or
data to or being interpreted by the processing device. The software
also may be distributed over network coupled computer systems so
that the software is stored and executed in a distributed fashion.
The software and data may be stored by one or more non-transitory
computer readable recording mediums.
[0063] The methods according to the above-described example
embodiments may be recorded in non-transitory computer-readable
media including program instructions to implement various
operations of the above-described example embodiments. The media
may also include, alone or in combination with the program
instructions, data files, data structures, and the like. The
program instructions recorded on the media may be those specially
designed and constructed for the purposes of example embodiments,
or they may be of the kind well-known and available to those having
skill in the computer software arts. Examples of non-transitory
computer-readable media include magnetic media such as hard disks,
floppy disks, and magnetic tape; optical media such as CD-ROM
discs, DVDs, and/or Blue-ray discs; magneto-optical media such as
optical discs; and hardware devices that are specially configured
to store and perform program instructions, such as read-only memory
(ROM), random access memory (RAM), flash memory (e.g., USB flash
drives, memory cards, memory sticks, etc.), and the like. Examples
of program instructions include both machine code, such as produced
by a compiler, and files containing higher level code that may be
executed by the computer using an interpreter. The above-described
devices may be configured to act as one or more software modules in
order to perform the operations of the above-described example
embodiments, or vice versa.
[0064] A number of example embodiments have been described above.
Nevertheless, it should be understood that various modifications
may be made to these example embodiments. For example, suitable
results may be achieved if the described techniques are performed
in a different order and/or if components in a described system,
architecture, device, or circuit are combined in a different manner
and/or replaced or supplemented by other components or their
equivalents. Accordingly, other implementations are within the
scope of the following claims.
* * * * *