U.S. patent application number 15/144613 was filed with the patent office on 2016-11-03 for using machine learning to filter monte carlo noise from images.
The applicant listed for this patent is The Regents of the University of California. Invention is credited to Steve Bako, Nima Khademi Kalantari, Pradeep Sen.
Application Number | 20160321523 15/144613 |
Document ID | / |
Family ID | 57204111 |
Filed Date | 2016-11-03 |
United States Patent
Application |
20160321523 |
Kind Code |
A1 |
Sen; Pradeep ; et
al. |
November 3, 2016 |
USING MACHINE LEARNING TO FILTER MONTE CARLO NOISE FROM IMAGES
Abstract
A method of producing noise-free images is disclosed. The method
includes using machine learning incorporating a filter to output
filter parameters using the training images. The machine learning
may include training a neural network. The filter parameters are
applied to Monte Carlo rendered training images that have noise to
generate noise-free images. The training may include determining,
computing and extracting features of the training images; computing
filter parameters; applying an error metric; and applying
backpropgation. The neural network may be a multilayer perceptron.
The machine learning model is applied to new noisy Monte Carlo
rendered images to create noise-free images. This may include
applying the filter to the noisy Monte Carlo rendered images using
the filter parameters to create the noise-free images.
Inventors: |
Sen; Pradeep; (Goleta,
CA) ; Kalantari; Nima Khademi; (La Jolla, CA)
; Bako; Steve; (Santa Barbara, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
The Regents of the University of California |
Oakland |
CA |
US |
|
|
Family ID: |
57204111 |
Appl. No.: |
15/144613 |
Filed: |
May 2, 2016 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62155104 |
Apr 30, 2015 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 3/084 20130101;
G06T 2207/20084 20130101; G06K 9/6256 20130101; G06N 7/005
20130101; G06T 2207/20081 20130101; G06T 15/06 20130101; G06T 5/002
20130101; G06T 15/50 20130101 |
International
Class: |
G06K 9/66 20060101
G06K009/66; G06K 9/62 20060101 G06K009/62; G06T 5/20 20060101
G06T005/20; G06K 9/46 20060101 G06K009/46; G06T 5/00 20060101
G06T005/00 |
Goverment Interests
GOVERNMENT INTERESTS
[0002] This invention was made with Government support under Grant
(or Contract) Nos. IIS-1321168 and IIS-1342931 awarded by the
National Science Foundation. The Government has certain rights in
the invention.
Claims
1. A method of producing a noise-free image, the method comprising:
obtaining training images; using machine learning incorporating a
filter on the training images to output filter parameters;
receiving a Monte Carlo rendered image that has noise; executing
the filter on the noisy image using the filter parameters to
generate the noise-free image.
2. The method of claim 1 wherein the training images include both
ground truth training images and noisy training images.
3. The method of claim 1 wherein the using machine learning is
training a neural network.
4. The method of claim 3 wherein the neural network is a multilayer
perceptron.
5. The method of claim 3 wherein the training the neural network
includes: extracting, determining and/or computing features from
the training images; computing testing filter parameters using the
machine learning model including applying the filter using the
features to create a temporary image; applying an error metric to
the temporary image; correcting the machine learning model based on
the error metric including updating the testing filter parameters;
repeating the computing, the applying and the correcting to
determine final filter parameters.
6. The method of claim 5 wherein the extracting, determining and/or
computing features includes: determining primary features of the
training images; extracting and/or computing secondary features of
the training images using the primary features.
7. The method of claim 6 wherein the machine learning is a neural
network of the secondary features.
8. The method of claim 6 wherein the primary features include some
selected from the group including: positions, colors, world
positions, visibility, shading normals, texture values.
9. The method of claim 6 wherein the secondary features include
some selected from the group including: variances and noise
approximation in local regions, mean of primary features at various
block sizes, standard deviation of the primary features at various
block sizes, gradients of primary features, mean deviation of the
primary features, median absolute deviation (MAD) of primary
features, sampling rate.
10. The method of claim 3 further including applying
backpropagation on the neural network.
11. The method of claim 5 wherein the error metric is a modified
relative mean squared error function.
12. The method of claim 5 wherein the error metric is a perceptual
metric such as a structural similarity index (SSIM).
13. The method of claim 1 wherein executing the filter includes one
selected from the group including a Gaussian filter, a
cross-bilateral filter, and a cross non-local means filter.
14. The method of claim 1 wherein using machine learning includes:
applying an error metric to measure the distance between filtered
images and ground truth images; applying an optimization strategy
to minimize an energy function on results of the error metric.
15. The method of claim 14 wherein the energy function computes
errors in a multilayer perceptron.
16. The method of claim 14 wherein the error metric is a modified
relative mean squared error (RelMSE) metric.
17. The method of claim 14 wherein the optimization strategy is
backpropagation.
18. A method of producing a noise-free Monte Carlo rendered image,
the method comprising: training a machine learning model on a
plurality of Monte Carlo-rendered training images to learn how to
output noise-free images; receiving a new Monte Carlo rendered
noisy image; executing the trained machine learning model on the
new Monte Carlo rendered noisy image to generate a noise-free
result.
19. The method of claim 18, wherein the training includes placing a
filter after the machine learning model and the machine learning
model outputs parameters for the filter wherein the executing
including applying the filter to denoise the new Monte Carlo
rendered noisy image.
20. The method of claim 19, wherein during the training the machine
learning model is trained to output optimal filter parameters for
the training images such that when the new Monte Carlo noisy image
is received, the machine learning model outputs the filter
parameters to remove noise from the new image.
21. The method of claim 18 wherein the machine learning model is a
neural network.
Description
RELATED APPLICATION INFORMATION
[0001] This patent claims priority from Provisional Patent
Application No. 62/155,104, filed Apr. 30, 2015, titled A
LEARNING-BASED APPROACH FOR FILTERING MONTE CARLO NOISE which is
included by reference in its entirety.
NOTICE OF COPYRIGHTS AND TRADE DRESS
[0003] A portion of the disclosure of this patent document contains
material which is subject to copyright protection. This patent
document may show and/or describe matter which is or may become
trade dress of the owner. The copyright and trade dress owner has
no objection to the facsimile reproduction by anyone of the patent
disclosure as it appears in the Patent and Trademark Office patent
files or records, but otherwise reserves all copyright and trade
dress rights whatsoever.
BACKGROUND
[0004] 1. Field
[0005] This disclosure relates generally to methods for computer
graphics rendering. More specifically, it relates to removing noise
from and improving Monte Carlo rendered images.
[0006] 2. Description of the Related Art
[0007] Producing photorealistic images from a scene model requires
computing a complex multidimensional integral of the scene function
at every pixel of the image. For example, generating effects like
depth of field and motion blur requires integrating over domains
such as lens position and time. Monte Carlo (MC) rendering systems
approximate this integral by tracing light rays (samples) in the
multidimensional space to evaluate the scene function. Although an
approximation to this integral can be quickly evaluated with just a
few samples, the inaccuracy of this estimate relative to the true
value appears as unacceptable noise in the resulting image. Since
the variance of the MC estimator decreases linearly with the number
of samples, many samples are required to get a reliable estimate of
the integral. The high cost of computing additional rays results in
lengthy render times that negatively affect the applicability of MC
renderers in modern film production.
[0008] One way to mitigate this problem is to quickly render a
noisy image with a few samples and then filter it as a post-process
to generate an acceptable, noise-free result. This approach has
been the subject of extensive research in recent years. The more
successful methods typically use feature-based filters (e.g.,
cross-bilateral or cross non-local means filters) to leverage
additional scene features, such as world position, that help guide
the filtering process. Since these features are highly correlated
with scene detail, using them in the filtering process greatly
improves the quality of the results.
[0009] Some approaches have used this information to handle
specific distributed effects such as global illumination and depth
of field. However, a major challenge is how to exploit this
additional information to denoise distributed effects, which
requires setting the filter weights for all features (called
"filter parameters" hereafter) so that noise is removed while scene
detail is preserved. To do this, some have proposed to use the
functional dependencies between scene features and random
parameters calculated using mutual information, a process that
removes noise but was slow. Several other algorithms build upon
this by using error estimation metrics to select the best filter
parameters from a discrete set. The main drawback of these methods
is that their error metrics are usually noisy at low sampling
rates, reducing the accuracy of filter selection. Furthermore, they
choose the filter parameters from a preselected, discrete set that
may not contain the optimum. As a result, these methods produce
images with over/under blurred regions.
[0010] Since the introduction of distributed ray tracing by Cook et
al., researchers have proposed a variety of algorithms to address
the noise in Monte Carlo (MC) rendering. Some of these include
variance reduction techniques, low-discrepancy sampling patterns,
new Monte Carlo formulations with faster convergence, and methods
that exploit specific properties of the integrand}\add{position or
reuse samples based on the shape of the multidimensional
integrand.
[0011] Filtering approaches render a noisy image with a few samples
and then denoise images through a filtering process. Some methods
adaptively sample as well, further improving the results. Some
previous work on MC filtering use only sample color during
filtering and others use additional scene information.
[0012] Color-Based Filter Methods
[0013] These methods are inspired by traditional image denoising
techniques and use only pixel color information from the rendering
system to remove MC noise. Early work by Lee and Redner used
nonlinear filters (median and alpha-trimmed mean filters) to remove
spikes while preserving edges. Rushmeier and Ward proposed to
spread the energy of input samples through variable width filter
kernels. To reduce the noise in path-traced images, Jensen and
Christensen separated illumination into direct and indirect
components, filtered the indirect portion, and then added the
components back together. Bala et al. exploited an edge image to
facilitate the filtering process, while Xu and Pattanaik used a
bilateral filter to remove MC noise. Egan et al. used frequency
analysis to shear a filter for specific distributed effects such as
motion blur and occlusion/shadowing, while Mehta et al. used
related analysis to derive simple formulas that set the variance of
a screen-space Gaussian filter to target noise from specific
effects. Most of these approaches use the analysis to adaptively
position samples as well.
[0014] For denoising general distributed effects, Overbeck et al.
adapted wavelet shrinkage to MC noise reduction, while Rousselle et
al. selected an appropriate scale for a Gaussian filter at every
pixel to minimize the reconstruction error. Rousselle later
improved this using a non-local means filter. Using the median
absolute deviation to estimate the noise at every pixel, Kalantari
and Sen were able to apply arbitrary image denoising techniques to
MC rendering. Finally, Delbracio et al. proposed a method based on
non-local means filtering which computes the distance between two
patches using their color histograms. Although these color-based
methods are general and work on a variety of distributed effects,
they need many samples to produce reasonable results. At low
sampling rates, they generate unsatisfactory results on challenging
scenes.
[0015] Filters That Use Additional Information
[0016] The approaches in this category leverage additional scene
features (e.g., world positions, shading normals, texture values,
etc.) which are computed by the MC renderer. Thus, they tend to
generate higher-quality results compared to the color-based
approaches described above.
[0017] For example, McCool removed MC noise by using depths and
normals to create a coherence map for an anisotropic diffusion
filter. To efficiently render scenes with global illumination,
Segovia et al. and Laine et al. used a geometry buffer. Meanwhile,
to reduce global illumination noise, Dammertz et al. incorporated
wavelet information into the bilateral filter and Bauszat et al.
used guided image filtering. Shirley et al. used a depth buffer to
handle depth of field and motion blur effects, while Chen et al.
combined a depth map with sample variance to filter the noise from
depth of field. These methods are directed to a fixed set of
distributed effects and are not general.
[0018] Hachisuka et al. performed adaptive sampling and
reconstruction based on discontinuities in the multidimensional
space. Although this method handles general distributed effects, it
suffers from dimensionality.
[0019] To handle general MC noise using additional scene features,
Sen and Darabi observed the need to vary the filter's feature
weights across the image. Specifically, they proposed to compute
these weights using mutual information to approximate the
functional dependencies between scene features and the random
parameters. Li et al. used Stein's unbiased risk estimator (SURE)
to estimate the appropriate spatial filter parameter in a
cross-bilateral filter, while hard coding the weights of the
remaining cross terms. Rousselle et al. significantly improved upon
this by using the SURE metric to select between three candidate
cross non-local means filters that each weight color and features
differently. Moon et al. compute a weighted local regression on a
reduced feature space and evaluate the error for a discrete set of
filter parameters to select the best one.
[0020] The main problem the aforementioned approaches, which
constitute the state of the art, is that they weight each filter
term through either heuristic rules and/or an error metric which is
quite noisy at low sampling rates. Thus, they are not able to
robustly estimate the appropriate filter weights in challenging
cases.
[0021] Neural Networks in Graphics/Denoising
[0022] Neural networks have been used in computer graphics
processing. Grzeszczuk et al. used neural networks to create
physically realistic animation. Nowrouzezahrai et al. used neural
networks to predict per vertex visibility. Dachsbacher classified
different visibility configurations using neural networks. Ren et
al. used a neural network to model the radiance regression function
to render indirect illumination of a fixed scene in real time.
Neural networks have also been used in image denoising where they
have been directly trained on a set of noisy and clean patches.
[0023] In addition, Jakob et al. have a method that, while not
utilizing neural networks, performs learning through expectation
maximization to find the appropriate parameters of a Gaussian
mixture model to denoise photon maps, a different but related
problem.
DESCRIPTION OF THE DRAWINGS
[0024] FIG. 1 is a block diagram of the internal components of a
computing device on which the methods described herein may
execute.
[0025] FIG. 2 is a block diagram of a computing device on which the
methods described herein may execute.
[0026] FIG. 3 is a flow chart of a method of filtering Monte Carlo
noise from images.
[0027] FIG. 4 is a flow chart of a method of training a neural
network used in filtering Monte Carlo noise from images.
[0028] FIG. 5 is a diagram showing a multilayer perceptron.
[0029] FIG. 6 is a first scene and a portion thereof showing the
results of method of filtering Monte Carlo noise from images.
[0030] FIG. 7 is a second scene and a portion thereof showing the
results of method of filtering Monte Carlo noise from images.
DETAILED DESCRIPTION
[0031] Monte Carlo rendering allows for the creation of realistic
and creative images. However, the resulting images may be full of
noise and artifacts. As such, the images are considered noisy. The
term "noise" when used alone herein refers to Monte Carlo or MC
noise that reduces image quality and not desirable noise.
[0032] A machine learning approach to reduce noise in Monte Carlo
(MC) rendered images is described herein. To model the complex
relationship between ideal filter parameters and a set of features
extracted from the input noisy images, machine learning is used. In
one embodiment, a multilayer perceptron (MLP) neural network as a
nonlinear regression model is used for the machine learning. To
effectively train the neural network, the MLP neural network is
combined with a filter. In this arrangement, the MLP evaluates a
set of features extracted from a local neighborhood at each pixel
and outputs a set of filter parameters. The filter parameters and
the noisy samples are provided as inputs to the filter to generate
a filtered pixel that is compared to the ground truth pixel during
training. The neural network is trained on a set of images with a
variety of distributed effects and then applied to different images
containing various distributed effects or characteristics such as,
for example, motion blur, depth of field, area lighting, glossy
reflections, and global illumination. The machine learning approach
includes training an MLP neural network with a filter to provide
denoised or noise-free images.
[0033] There is a complex relationship between the input noisy
image and the optimal filter parameters needed to create an
accurate image. These filter parameters can be effectively
estimated using different factors (e.g., feature variances and
noise in local regions), but each individual factor by itself
cannot accurately predict them. Based on these observations, a
supervised learning method is described herein. The supervised
learning method learns the complex relationship between these
factors and the optimal filter parameters. In this way, the methods
avoid the problems of previous approaches. According to one version
of the method, a nonlinear regression model is trained on a set of
noisy MC rendered images and their corresponding ground truth
images, using a multilayer perceptron (MLP) coupled with a matching
filter during training and refinement.
[0034] During the training stage, the method renders both noisy
images at low sampling rates as well as their corresponding ground
truth images for a set of scenes with a variety of distributed
effects. The method then processes the noisy images and extracts a
set of useful features in square regions around every pixel. The
method is trained based on the extracted features to drive the
filter to produce images that resemble the ground truth. This is
done according to a specific error metric.
[0035] After the neural network has been trained, in an application
stage the method filters new noisy renderings with general
distributed effects. The method is fast (and may take a few seconds
or less) and produces better results than existing methods for a
wide range of distributed effects including depth of field, motion
blur, area lighting, glossy reflections, and global illumination.
Further, unlike earlier approaches, in one embodiment, no adaptive
sampling is performed. In another embodiment of the method,
adaptive sampling may be included. The method described herein is a
post-process step that effectively removes MC noise.
[0036] The method includes: reducing general MC noise using machine
learning including supervised learning for MC noise reduction; and
training a neural network in combination with a filter to generate
results that are close to ground truth images. In other
implementations, the machine learning may be support vector
machines, random forests, and other kinds of machine learning. As
such, the methods are not limited to neural networks.
[0037] Description of Apparatus
[0038] The methods described herein may be implemented on a
computing device such as a computer workstation or personal
computer. An example computing device 100 is shown in FIGS. 1 and
2. The computing device has at least one central processing unit
(CPU 112) which typically has multiple cores, a specialized
graphics processing unit (GPU 114), sufficient memory (random
access memory, RAM 116), and a non-volatile storage device 120.
Storage device 120 is typically a solid state (also known as
silicon storage) device (SSD) or hard disk drive (HDD) or
combination thereof. The GPU may be included on a motherboard 110
with the CPU or be included on an add-on card. Other components
included in computing device 100 that are commonly included are not
shown, including, for example, one or more network interface cards
or chips (NICs) that allow for network communication, buses such as
universal serial bus (USB), peripheral component interconnect
express (PCIe bus), serial advanced technology attachment (SATA),
serial attached small computer system interface) (serial attached
SCSI or SAS), and others. Images may be displayed on one or more
monitors 102 coupled with the computing device 100. User input may
be provided via one or more input devices 103 such as a keyboard,
mouse, track ball, track pad or digitized pen as well as a touch
screen included with monitor 102. The computing device 100 runs an
operating system such as, for example, a version of Linux, Apple OS
X, Microsoft Windows, and Ubuntu.
[0039] In one version, the method was implemented and run on a
computing device having an INTEL quad-core 3.7 GHz CPU with 24 GB
of RAM and a GeForce GTX TITAN GPU from NVIDIA Corporation. Many
other computing device configurations may be used; this is merely
provided as an example.
[0040] To implement one version of the methods described herein, a
learning-based filtering (LBF) component was written in the C++
programming language and integrated into the PBRT2 platform
(Physically Based Rendering, Second Edition, see pbrt.org). In one
implementation, the neural network and filter were written in CUDA
(a parallel computing platform for graphics processing available
from NVIDIA Corporation) to take advantage of GPU acceleration.
[0041] Description of Processes
[0042] The goal of the method described herein is to take a noisy
input image rendered with only a few samples and generate a
noise-free image that is similar to the ground truth image rendered
with many samples. Referring now to FIG. 3, there is shown a flow
chart of an overview of a method described herein. In its most
basic, the method includes obtaining training images as shown in
block 310. Both noisy and ground truth training images may be used.
Machine learning incorporating a filter is applied to the training
images and output filter parameters, as shown in block 320. The
machine learning model may be a neural network, a support vector
machine, a random forest, and other kinds of machine learning. A
plurality of Monte Carlo rendered images having noise (aka noisy
images) are obtained, as shown in block 330. A filter is executed
on or applied to the noisy images using the filter parameters, as
shown in block 340. Less noisy or, ideally, denoised or noise-free
images are provided, as shown in block 350.
[0043] Examples of the results of the application of the method are
shown in FIGS. 6 and 7. Scenes 600 and 700 are provided. Sample
image portions 610 and 710 are selected. MC rendering is performed
resulting in image portions 612 and 712. The image portions 612 and
712 are received as input to the method. The ground truth versions
of the image portions are shown as 616 and 716. Image portions 614
and 714 show result image portions produced from application of the
method of filtering Monte Carlo noise from images described herein.
As is shown, the results are striking.
[0044] Returning now to discussion of the method. The filtered
image is defined as
c={c.sub.r, c.sub.g, c.sub.b}
at pixel i is computed as a weighted average of all of the pixels
in a square neighborhood N(i) (for example, 55.times.55) centered
around pixel i:
c ^ i = j .di-elect cons. ( i ) d i , j c _ j j .di-elect cons. ( i
) d i , j , ##EQU00001##
where d.sub.i,j is the weight between pixel i and its neighbor j as
defined by the filter and c.sub.j is the noisy pixel color computed
by averaging all the sample colors in pixel j. For example, for a
standard Gaussian filter, d.sub.i,j would be the Gaussian-weighted
distance between pixels i and j in the spatial domain. More
sophisticated filters, such as the cross-bilateral filter, may be
used because they can leverage additional scene features (e.g.,
world positions, shading normals, texture values, etc.) to improve
the quality of filtering. When using a cross-bilateral filter,
d.sub.i,j is:
d i , j = exp [ - p _ i - p _ j 2 2 .alpha. i 2 ] .times. exp [ - D
( c _ i , c _ j ) 2 .beta. i 2 ] .times. k = 1 K exp [ - D k ( f _
i , k , f _ j , k ) 2 .gamma. k , i 2 ] , ##EQU00002##
where p.sub.j and f.sub.i,k refer to pixel i's screen space
position and scene feature k, respectively, and
.alpha..sub.i.sup.2, .beta..sub.i.sup.2 and .gamma..sub.k,i.sup.2
are the variances at pixel i for the spatial, color, and k.sup.th
feature terms. Here, D and D.sub.k are specific distance functions
for colors and scene features. In one version of the method, the
cross-bilateral filter is used, in other versions other
differentiable filters may be used.
[0045] The filtering process may be written as:
c ^ i = h ( s _ ( i ) , .theta. i ) , where s _ ( i ) = j .di-elect
cons. ( i ) s _ j . ##EQU00003##
Here, s.sub.N(i) is the collection of mean primary features in the
neighborhood of the i.sup.th pixel. The term "primary features"
refers to scene features that are computed directly by the
rendering system when shading samples. Primary features include
sample positions, colors, and K scene features such as world
positions, shading normals, direct illumination visibility and
texture values (namely first and second intersection texture
colors). The term "mean" refers to averaging the features of every
sample in a pixel. In the filtering process, h is the filter
function which implements the filtering described above.
.theta..sub.i is an array of M filter parameters at pixel i. To
identify the filter parameters {circumflex over (.theta.)}.sub.i
that estimate the optimum filter parameters .theta.*.sub.i, the
noisy mean primary features in a pixel's neighborhood are processed
to generate a set of more meaningful data called "secondary
features" x.sub.i={x.sub.1, x.sub.2, . . . , x.sub.N}.sub.i The
secondary features include feature variances, noise approximation
in local regions, and the like. The filter parameters are
approximated through an energy function of the secondary features:
{circumflex over (.theta.)}.sub.i=(x.sub.i). The relationship
between the secondary features and the optimal filter parameters is
complicated and difficult to model. For this reason the method uses
the following energy minimization function on training images:
* = arg min E ( h ( s _ ( i ) , ( x i ) ) , c i ) .
##EQU00004##
This energy function is used to compute the filter parameters that
will generate a filtered image close to ground truth.
[0046] To avoid problems in computing the energy function and the
filter parameters, the method uses a learning system that directly
minimizes errors. The method uses a nonlinear regression model
based on a neural network and directly combine the neural network
with a matching filter during training and later application.
Ground truth images are used during training to directly compute
the error between the filtered and ground truth image without need
for error estimation. During an application stage, the trained
machine learning model (resulting from iterations that minimize the
error computed by the energy function) is applied to additional or
secondary features from new scenes to compute filter parameters
that produce results close to the ground truth.
[0047] We now describe how to train a neural network in combination
with a filter by minimizing the energy function to create filter
parameters. Referring now to FIG. 4, training images are obtained,
both ground truth and noisy, as shown in block 410. Primary
features of the training images are determined, as shown in block
420. Secondary features of the training images are extracted or
computed, as shown in block 430. The secondary features may be
based on or computed from the primary features or may be extracted
or computed independent of the primary features. Training is then
performed using a neural network incorporating a filter, as shown
in block 440. The training includes computing training filter
parameters using a multilayer perceptron of the secondary features,
as shown in block 442. A filter is applied using the training
filter parameters, as shown in block 444. An error metric is
applied, as shown in block 446. The error metric is used to compare
temporary images with the ground truth images. Backpropagation is
then applied, as shown in block 448 in an effort to improve or
correct the training filter parameters to get the temporary images
closer to the ground truth images. The results are final filter
parameters 450 that allow for preparation of near ground truth,
noise-free images from noisy images. Detailed descriptions of these
actions are set forth below.
[0048] Neural Network.
[0049] In one embodiment, the neural network includes three
elements: (1) a model for representing the energy function, (2) an
appropriate error metric to measure the distance between the
filtered and ground truth images, and (3) an optimization strategy
to minimize the energy function.
[0050] The Energy Function
[0051] In one embodiment, the machine learning model is represented
as a neural network in the form of a multilayer perceptron (MLP).
The MLP is a regression model since it is a simple and powerful
system for discovering complex nonlinear relationships between
inputs and outputs. Moreover, MLPs are inherently parallel and can
be efficiently implemented on a GPU and are very fast once trained,
which is important for rendering. The method described herein
differs from standard MLPs in that a filter is incorporated into
the training process. By using a filter during machine learning and
particularly with the MLP, the method "backpropagates" to update
the weights of the neural network during training. To be used in
this way, the filter must be differentiable with respect to filter
parameters. Filters such as Gaussian, cross-bilateral, and cross
non-local means filters are all differentiable and may be
incorporated in the method. Other appropriate filters may also be
used.
[0052] As shown in FIG. 5, the MLP 500 consists of multiple layers
known as the input, hidden, and output layers. Each layer has
several nodes which are fully connected to all nodes in the next
layer through weights. The output of a certain node is a function
of the weighted sum of the outputs of the nodes from the previous
layer plus an additional bias term used as an offset. Specifically,
the output of the i.sup.th node at the l.sup.th layer is:
a s l = f l ( t = 1 n ( l - 1 ) w t , s l a t l - 1 + w 0 , s l ) ,
##EQU00005##
where n.sub.(l-1) is the number of nodes in layer l-1,
w.sub.t,s.sup.l is the weight associated with the connection
between node t in layer l-1, and node s in layer l, w.sub.0,s.sup.l
is the bias for this node, and f.sup.l is the activation function
for layer l. In one implementation, nonlinear activation functions
are used in all layers. Multiple kinds of nonlinear activation
functions may be used, such as the sigmoid function
f.sup.l(x)=1/(1+e.sup.-x). In various implementations, combinations
of linear and nonlinear activation functions may be used.
[0053] The Error Metric
[0054] The error metric to measure the error between the filtered
and ground truth pixel values used in the method is a modified
relative mean squared error (RelMSE) metric:
E i = n 2 q .di-elect cons. { r , g , b } ( c ^ i , q - c i , q ) 2
c i , q 2 + , ##EQU00006##
where n is the number of samples per pixel, c.sub.i,q and c.sub.i,q
are the i.sup.th color channel of the filtered and ground truth
pixels, respectively, and is a small number (0.01 in one
implementation) to avoid division by zero. In this equation,
division by is incorporated to account for human visual sensitivity
to color variations in darker regions of the image by giving higher
weight to the regions where the ground truth image is darker.
Further, by multiplying the squared error by n, an inverse
relationship to training image bias is removed and all of the
images have an equal contribution to the error regardless of
sampling rate. In addition, division by 2 is included to produce a
simpler derivative.
[0055] Optimization Strategy
[0056] The optimization starts with a large set of noisy images and
the corresponding ground truth images, which can be generated prior
to training. For each noisy image, a set of secondary features at
each pixel are extracted. The secondary features are used to train
the neural network through an iterative, three-step process called
"backpropagation". The goal of backpropagation is to determine the
optimal weights w.sub.t,s.sup.l for all nodes in the neural network
which minimize the error between the computed and desired outputs
(i.e., the ground truth values) for all pixels in the training
images, E=.SIGMA..sub.i.di-elect cons. all pE.sub.i.
[0057] Before starting the backpropagation process, the weights are
randomly initialized to small values around zero (for example,
between -0.5 to 0.5). Then in the first step, known as the
feed-forward pass, the output of the neural network is computed
using all inputs. This can be implemented efficiently using a
series of matrix multiplications and activation functions applied
to the input data to evaluate a.sub.s.sup.l using the equation
above. In the second step, the error between the computed and
desired outputs is used to determine the effect of each weight on
the output error. To do this, the derivative of the error is taken
with respect to each weight
.differential.E/.differential..omega..sub.t,s.sup.l. Thus, the
activation functions (and the filter as well) need to be
differentiable. These two steps are performed for all of the data
in the training set. The error gradient of each weight is
accumulated. In the third step, all the weights are updated
according to their error gradient and the actual error computed by
the Error Metric above. This completes a single iteration of
training, known as an epoch. Epochs are performed until a converged
set of weights is obtained.
[0058] Next, a chain rule is used to express the derivative of the
energy function.
.differential. E i .differential. w t , s l = m = 1 M [ q .di-elect
cons. { r , g , b } [ .differential. E i , q .differential. c ^ i ,
q .differential. c ^ i , q .differential. .theta. m , i ]
.differential. .theta. m , i .differential. w t , s l ]
##EQU00007##
where M is the number of filter parameters. The first term is the
derivative of the error with respect to the filtered pixels
c.sub.i,q. This first term can be calculated as:
.differential. E i .differential. c ^ i , q = n c ^ i , q - c i , q
c i , q 2 + .epsilon. . ##EQU00008##
In addition, is the output of the MLP network (shown in FIG. 5 and
described above). The middle of the derivative energy function term
requires that the filter be differentiable so the derivative of the
filtered color with respect to the filter parameters can be
computed. The cross-bilateral and cross non-local means filters and
Gaussian filter may be used for this, and other filters may also be
used.
[0059] The derivative energy function is computed for each weight
within the neural network, and the weights are updated after every
epoch. The process iterates until convergence is achieved.
[0060] Primary Features
[0061] Primary features are those directly output by the rendering
system. In one version of the method, seven primary features (M=7)
are used in the cross-bilateral filter. The primary features are:
screen position, color, and five additional features (K=5): world
position, shading normal, texture values for the first and second
intersections, and direct illumination visibility.
[0062] During rendering, for each sample screen position in x, y
coordinates, color in RGB format, world position in Cartesian
coordinates (x, y, z), shading normal (i, j, k), texture values for
the first and second intersections in RGB format, and a single
binary value for the direct illumination visibility, for a total of
18 floating point values. These values are averages for all samples
in a pixel to produce the mean primary features for every pixel in
the image. At this point, the average direct illumination
visibility represents the fraction of shadow rays that see the
light and is not a binary value. Moreover, additional features are
prefiltered using a non-local means filter in an 11.times.11 window
with patch size 7.times.7.
[0063] The distance of the color and additional features are
normalized by their variances. The following function is used for
the color term:
D ( c _ i , c _ j ) = c _ i - c _ j 2 .psi. i 2 + .psi. j 2 +
.zeta. , ##EQU00009##
Where .psi..sub.i and .psi..sub.j are the standard deviation of
color samples at pixel i and j, respectively, and .zeta. is a small
number (such as, for example, 10.sup.-10) to avoid division by
zero. For the additional features are expressed by the following
function,
D k ( f _ i , k , f _ j , k ) = f _ i , k - f _ j , k 2 max ( .psi.
k , i 2 , .delta. ) , ##EQU00010##
where .psi..sub.k,i is the standard deviation of the k.sup.th
feature at pixel i and .delta. is a small number (such as, for
example, 10.sup.-4) to avoid division by zero. The method smooths
the noisy standard deviations for the additional features
.psi..sub.k,i by filtering them using the same weights computed by
the non-local means filter when filtering the primary features.
[0064] Secondary Features
[0065] At every pixel, the method computes a set of secondary
features from the neighboring noisy samples to serve as inputs to
the neural network.
[0066] Feature statistics: the mean and standard deviation for the
K=5 additional features are computed for all samples in the pixel.
To capture more global statistics, the method also calculates the
mean and standard deviation of the pixel-averaged features in a
7.times.7 block around each pixel. The method computes the
statistics for each component (e.g., i, j, k for shading normal)
separately and averages them together to create a single value per
feature. Thus, according to the method, there are 20 total values
for each pixel and the block around it.
[0067] Gradients: The gradients of features may be used to decrease
the weight of a feature in regions with sharp edges. The method
calculates the gradient magnitude (scalar) of the K additional
features using a Sobel operator (5 values total).
[0068] Mean deviation: This term is the average of the absolute
difference between each individual pixel in a block and the block
mean. This feature can help identify regions with large errors. In
response, the neural network can adjust the filter parameters. For
each of the K additional features, the method computes the mean
deviation of all the pixel-averaged features in a 3.times.3 block
around each pixel. This feature is computed on each component
separately and then averaged to obtain a single value for each
additional feature (5 values total).
[0069] Median Absolute Deviation (MAD): The method uses the MAD to
estimate the amount of noise in each pixel, which is directly
related to the size of the filter. The method computes the MAD for
each K additional features (5 values total).
[0070] Sampling rate: The method uses the inverse of the sampling
rate as a secondary feature. The variance of MC noise decreases
linearly with the number of samples and, therefore, the filter
parameters should reflect this. Since the method includes training
a single neural network, the neural network is capable of handling
different sampling rates and adjusting the filter size
accordingly.
[0071] In one version of the system, the method computes a total of
N=36 secondary features for each pixel. These secondary features
are used as input to the neural network. The neural network outputs
the parameters to be used by the filter to generate the final
filtered pixel. The method does this for all the pixels to produce
a final result.
[0072] Video Application
[0073] Although described herein regarding scene images, the method
may be applied to frames of video. To handle video sequences, the
existing neural network described herein may be used without
retraining and the cross-bilateral filter may be extended to
operate on 3-D spatio-temporal volumes. This modification to the
filter is incorporated to reduce the flickering that might appear
if each frame is independently filtered. In one version of the
method, only three neighboring frames on each side of a current
frame (7 frames total) were used for spatio-temporal filtering. The
method generates high-quality, temporally-coherent videos from
noisy input sequences with low sampling rates.
[0074] Closing Comments
[0075] Throughout this description, the embodiments and examples
shown should be considered as exemplars, rather than limitations on
the apparatus and procedures disclosed or claimed. Although many of
the examples presented herein involve specific combinations of
method acts or system elements, it should be understood that those
acts and those elements may be combined in other ways to accomplish
the same objectives. With regard to flowcharts, additional and
fewer steps may be taken, and the steps as shown may be combined or
further refined to achieve the methods described herein. Acts,
elements and features discussed only in connection with one
embodiment are not intended to be excluded from a similar role in
other embodiments.
[0076] As used herein, "plurality" means two or more. As used
herein, a "set" of items may include one or more of such items. As
used herein, whether in the written description or the claims, the
terms "comprising", "including", "carrying", "having",
"containing", "involving", and the like are to be understood to be
open-ended, i.e., to mean including but not limited to. Only the
transitional phrases "consisting of" and "consisting essentially
of", respectively, are closed or semi-closed transitional phrases
with respect to claims. Use of ordinal terms such as "first",
"second", "third", etc., in the claims to modify a claim element
does not by itself connote any priority, precedence, or order of
one claim element over another or the temporal order in which acts
of a method are performed, but are used merely as labels to
distinguish one claim element having a certain name from another
element having a same name (but for use of the ordinal term) to
distinguish the claim elements. As used herein, "and/or" means that
the listed items are alternatives, but the alternatives also
include any combination of the listed items.
* * * * *