U.S. patent application number 15/664480 was filed with the patent office on 2018-02-08 for image analyzing apparatus, image analyzing method, and recording medium.
The applicant listed for this patent is Takayuki HARA. Invention is credited to Takayuki HARA.
Application Number | 20180039856 15/664480 |
Document ID | / |
Family ID | 61069318 |
Filed Date | 2018-02-08 |
United States Patent
Application |
20180039856 |
Kind Code |
A1 |
HARA; Takayuki |
February 8, 2018 |
IMAGE ANALYZING APPARATUS, IMAGE ANALYZING METHOD, AND RECORDING
MEDIUM
Abstract
An image analyzing apparatus reprojects an input image in a
plurality of different directions to divide the input image into a
plurality of partial images, extracts a feature amount from each of
the partial images, and calculates a degree of importance of the
input image by position from the extracted feature amount in
accordance with a predetermined regression model.
Inventors: |
HARA; Takayuki; (Kanagawa,
JP) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
HARA; Takayuki |
Kanagawa |
|
JP |
|
|
Family ID: |
61069318 |
Appl. No.: |
15/664480 |
Filed: |
July 31, 2017 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06K 9/4671 20130101;
G06K 9/66 20130101; G06N 3/0472 20130101; G06N 3/08 20130101; G06T
7/11 20170101; G06N 3/0454 20130101; G06N 3/0481 20130101 |
International
Class: |
G06K 9/46 20060101
G06K009/46; G06K 9/66 20060101 G06K009/66; G06N 3/04 20060101
G06N003/04; G06T 7/11 20060101 G06T007/11 |
Foreign Application Data
Date |
Code |
Application Number |
Aug 4, 2016 |
JP |
2016-153492 |
Claims
1. An image analyzing apparatus, comprising: one or more
processors; and a memory to store a plurality of instructions
which, when executed by one or more processors, cause the
processors to: reproject an input image in a plurality of different
directions to divide the input image into a plurality of partial
images; extract a feature amount from each of the partial images;
and calculate a degree of importance of the input image by position
from the extracted feature amount in accordance with a
predetermined regression model.
2. The image analyzing apparatus according to claim 1, wherein the
instructions further cause the processors to calculate a likelihood
distribution of an attention point from the calculated degree of
importance in accordance with the predetermined regression model,
and calculate the attention point from the likelihood distribution
of the attention point.
3. The image analyzing apparatus according to claim 2, wherein the
instructions further cause the processors to calculate, as the
attention point, a position corresponding to one of a maximum
likelihood value, an average value, and a local maximum value of
the likelihood distribution of the attention point.
4. The image analyzing apparatus according to claim 2, wherein the
instructions further cause the processors to add the degree of
importance to calculate the likelihood distribution of the
attention point.
5. The image analyzing apparatus according to claim 3, wherein at
least one processing among the extraction of the feature amount,
the calculation of the degree of importance, and the calculation of
the attention point likelihood distribution is executed by a neural
network.
6. A method for extracting a degree of importance of an input image
by position, comprising: reprojecting an input image in a plurality
of different directions to divide the input image into a plurality
of partial images; extracting a feature amount from each of the
partial images; and calculating a degree of importance of the input
image by position from the extracted feature amount in accordance
with a predetermined regression model.
7. A method for calculating an attention point of an input image,
comprising: reprojecting an input image in a plurality of different
directions to divide the input image into a plurality of partial
images; extracting a feature amount from each of the partial
images; calculating a degree of importance of the input image by
position from the extracted feature amount in accordance with a
predetermined regression model; calculating a likelihood
distribution of an attention point from the calculated degree of
importance in accordance with a predetermined regression model; and
calculating an attention point in accordance with the likelihood
distribution of the attention point.
8. The method for calculating the attention point of the input
image according to claim 7, wherein the calculating the attention
point includes calculating a position corresponding to one of a
maximum likelihood value, an average value, and a local maximum
value of the likelihood distribution of the attention point as the
attention point.
9. The method for calculating the attention point of the input
image according to claim 7, wherein the calculating the likelihood
distribution of the attention point includes adding the degree of
importance to calculate the likelihood distribution of the
attention point.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This patent application is based on and claims priority
pursuant to 35 U.S.C. .sctn.119(a) to Japanese Patent Application
No. 2016-153492, filed on Aug. 4, 2016, in the Japan Patent Office,
the entire disclosure of which is hereby incorporated by reference
herein.
BACKGROUND
Technical Field
[0002] The present invention relates to an image analyzing
apparatus, an image analyzing method, and a recording medium.
Description of the Related Art
[0003] A technique to extract a region of interest of a user from
an image has been widely used in, for example, automatic cropping
or generation of thumbnails of the image, or preprocessing of
generation of annotation in understanding or searching an image. To
extract the region of interest, a method using object recognition
or a saliency map has been known.
SUMMARY
[0004] Example embodiments of the present invention include an
apparatus and a method, each of which reprojects an input image in
a plurality of different directions to divide the input image into
a plurality of partial images, extracts a feature amount from each
of the partial images, and calculates a degree of importance of the
input image by position from the extracted feature amount in
accordance with a predetermined regression model.
[0005] Example embodiments of the present invention include an
apparatus and a method, each of which inputs image in a plurality
of different directions to divide the input image into a plurality
of partial images; extracts a feature amount from each of the
partial images; calculates a degree of importance of the input
image by position from the extracted feature amount in accordance
with a predetermined regression model; calculates a likelihood
distribution of an attention point from the calculated degree of
importance in accordance with a predetermined regression model; and
calculates an attention point in accordance with the likelihood
distribution of the attention point.
[0006] Example embodiments of the present invention include a
non-transitory recording medium storing a program for causing one
or more processors to perform any one of the above-described
operations.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0007] A more complete appreciation of the disclosure and many of
the attendant advantages and features thereof can be readily
obtained and understood from the following detailed description
with reference to the accompanying drawings, wherein:
[0008] FIG. 1 is a conceptual view for explaining an image of an
equirectangular projection (or equidistant cylindrical projection)
format;
[0009] FIG. 2 is a block diagram of functional blocks of an image
analyzing apparatus according to a first embodiment:
[0010] FIG. 3 is a flowchart illustrating operation executed by the
image analyzing apparatus of the first embodiment;
[0011] FIG. 4 is a conceptual view for explaining example
processing to be executed by a partial image divider;
[0012] FIGS. 5A and 5B are conceptual views for explaining example
processing to be executed by a partial image divider;
[0013] FIG. 6 is a conceptual view for explaining example
processing to be executed by an attention-point-likelihood
distribution calculator;
[0014] FIGS. 7A and 7B are conceptual views for explaining example
processing executed by the partial image divider;
[0015] FIG. 8 is a diagram illustrating a neural network
configuration of a feature amount extractor according to a second
embodiment;
[0016] FIG. 9 is a diagram illustrating a neural network
configuration of a degree-of-importance calculator of the second
embodiment;
[0017] FIG. 10 is a diagram illustrating a neural network
configuration of the feature amount extractor of the second
embodiment;
[0018] FIG. 11 is a diagram illustrating a neural network
configuration of an attention-point-likelihood distribution
calculator of the second embodiment;
[0019] FIG. 12 is a diagram illustrating a neural network
configuration of the attention point calculator of the second
embodiment;
[0020] FIG. 13 is a diagram illustrating a neural network
configuration of the second embodiment;
[0021] FIG. 14 is a diagram illustrating a neural network
configuration of the second embodiment; and
[0022] FIG. 15 is a schematic block diagram illustrating a hardware
configuration of an image analyzing apparatus according to an
embodiment.
[0023] The accompanying drawings are intended to depict embodiments
of the present invention and should not be interpreted to limit the
scope thereof. The accompanying drawings are not to be considered
as drawn to scale unless explicitly noted.
DETAILED DESCRIPTION
[0024] The terminology used herein is for the purpose of describing
particular embodiments only and is not intended to be limiting of
the present invention. As used herein, the singular forms "a", "an"
and "the" are intended to include the plural forms as well, unless
the context clearly indicates otherwise.
[0025] In describing embodiments illustrated in the drawings,
specific terminology is employed for the sake of clarity. However,
the disclosure of this specification is not intended to be limited
to the specific terminology so selected and it is to be understood
that each specific element includes all technical equivalents that
have a similar function, operate in a similar manner, and achieve a
similar result.
[0026] Embodiments of the present invention will be described
below, but these embodiments do not intend to limit the present
invention. In the accompanying drawings used in the following
description, the same reference signs will be given to common
elements whose description will not be repeated as appropriate.
[0027] According to an embodiment of the present invention, an
image analyzing apparatus includes a feature to extract a region of
interest from an input image. More particularly, the image
analyzing apparatus estimates an attention point (a point in the
region of interest or a center of gravity of the region of
interest). Before describing the image analyzing apparatus of the
present embodiment, a region-of-interest extracting technique of
the background art is descried, which is not capable of extracting
the region of interest accurately for an ultrawide image. The
ultrawide image is an image taken by a fish-eye camera having an
angle of view of more than 180 degrees or an omnidirectional camera
capable of shooting all directions over 360 degrees.
[0028] First, an ultrawide image may be converted into an image of
an equirectangular projection (equidistant cylindrical projection)
format to extract a region of interest from the converted image.
The equirectangular projection format is an expression format
mainly used in panoramic shooting. As illustrated in FIG. 1, a
three-dimensional direction of a pixel is resolved into latitude
and longitude to arrange corresponding pixel values in a square
grid. A pixel value can be obtained in the three-dimensional
direction according to the coordinate values of the latitude and
longitude of the image in the equirectangular projection format.
Thus, the image in the equirectangular projection format can be
understood as the pixel values plotted on a unit sphere.
[0029] In extracting the region of interest directly from the image
of the equirectangular projection format, it is not possible to
extract the region of interest existing in regions near zenith or
nadir of the sphere, or at the boundary of images, where distortion
becomes extremely large.
[0030] Secondly, an ultrawide image may be divided into a plurality
of images to extract the region of interest from the divided
images. In this case, however, it is not apparent as to how
saliency maps obtained from the individual divided images are
integrated.
[0031] Moreover, the ultrawide image is supposed to include a
plurality of objects having high saliency in one image, but the
past techniques do not include any scheme to determine priority of
such objects.
[0032] To solve the above problems in the conventional
region-of-interest extracting techniques, an image analyzing
apparatus of the present embodiment includes a function to
accurately extract a region of interest (attention point) of a user
from an ultrawide image having a large distortion and including a
plurality of objects. A specific configuration of the image
analyzing apparatus of the present embodiment will be described
below.
[0033] FIG. 2 illustrates functional blocks of an image analyzing
apparatus 100 as a first embodiment of the present invention. As
illustrated in FIG. 2, the image analyzing apparatus 100 includes
an image input 101, a partial image divider 102, a feature amount
extractor 103, a degree-of-importance calculator 104, an
attention-point-likelihood distribution calculator 105, and an
attention point calculator 106.
[0034] The image input 101 inputs a target image to be
processed.
[0035] The partial image divider 102 reprojects the target image to
be processed in a plurality of different directions to divide the
target image to be processed into a plurality of partial
images.
[0036] The feature amount extractor 103 extracts a feature amount
from each of the partial images.
[0037] From the extracted feature amount, the degree-of-importance
calculator 104 calculates a degree of importance for each position
of the target image to be processed in accordance with a
predetermined regression model.
[0038] From the calculated degree of importance, the
attention-point-likelihood distribution calculator 105 calculates a
likelihood distribution of an attention point in accordance with a
predetermined regression model.
[0039] In accordance with the calculated attention-point-likelihood
distribution, the attention point calculator 106 calculates the
attention point.
[0040] In the present embodiment, a computer included in the image
analyzing apparatus 100 executes a predetermined program to enable
the above-described functions of the image analyzing apparatus
100.
[0041] The functional configuration of the image analyzing
apparatus 100 of the present embodiment has been described. Next,
processing details executed by the image analyzing apparatus 100 is
described using a flowchart of FIG. 3.
[0042] First, at S101, the image input 101 reads an omnidirectional
image of the equirectangular projection format as a target image to
be processed from a storage area, and inputs the read image.
Hereinafter, the image having been input is referred to as an
"input image".
[0043] Subsequently, at S102, the partial image divider 102 divides
the shooting direction of the input image (omnidirectional image)
equally and spatially to reproject the input image in a plurality
of different shooting directions. Thus, the input image is divided
into a plurality of partial images. The division of the input image
into the partial images is described.
[0044] As illustrated in FIG. 1, a pixel value in the
three-dimensional direction can be obtained from coordinate values
of the latitude and longitude of the image of the equirectangular
projection format. The image of the equirectangular projection
format can conceptually be understood as including pixel values
plotted on a unit sphere. In the present embodiment, as illustrated
in FIG. 4, a predetermined projection plane is defined. With the
center of the unit sphere being at the center of projection O, the
perspective projection is carried out to cause a pixel value
(.theta., .phi.) of the omnidirectional image of the
equirectangular projection format to correspond to a pixel value
(x, y) on the defined projection plane according to equation (1)
below. Thus, the partial image is obtained. In the equation (1), P
represents a matrix of the perspective projection, and the equal
mark indicates that both sides of the equation are equal by a
scalar multiple other than zero.
[ Equation 1 ] ( x y 1 ) = P ( cos .PHI. cos .theta. cos .PHI. sin
.theta. sin .PHI. ) ( 1 ) ##EQU00001##
[0045] Specifically, a regular polyhedron having its center common
to the center of the unit sphere is defined as the projection plane
of the omnidirectional image of the equirectangular projection
format. With a normal line of each surface of the regular
polyhedron being the direction of the line of sight, the
perspective projection is carried out to obtain partial images.
FIG. 5A illustrates an example of a regular octahedron defined as
the projection plane of the omnidirectional image. FIG. 5B
illustrates an example of a regular dodecahedron defined as the
projection plane of the omnidirectional image.
[0046] Subsequently, at S103, the feature amount extractor 103
extracts a predetermined feature amount from each partial image
obtained in the preceding S102. The feature amount may be extracted
for each pixel of the partial image, or from a particular sampling
position. In the present embodiment, the input image is divided as
described above to calculate the feature amount from the partial
image having a small distortion. Thus, it is possible to robustly
process the ultrawide image having a wide angle of more than 180
degrees.
[0047] As the feature amount, the present embodiment can use
colors, edges, saliency, object positions/labels, and so on.
[0048] The color feature can be represented by values in a specific
color space (e.g., RGB or L*a*b*), or the Euclidean distance or
Mahalanobis distance from a particular color (e.g., color of the
skin).
[0049] The edge feature can be represented by the direction or
intensity of the pixel values extracted using a Sobel filter or a
Gabol filter.
[0050] The saliency can be represented by values of saliency
extracted by an existing saliency extracting algorithm.
[0051] For example, a region-of-interest extracting technique on
the basis of the object recognition includes a technique of
detecting a face region from the image to extract an image of the
face region, or a technique of detecting a human to extract a
region of the human from the image.
[0052] Meanwhile, in extracting the region of interest using the
saliency map, a low-order feature amount, such as colors or edges,
is used to allow more universal extraction of the region of
interest. In one example, a human vision model, which has been
studied in the field of brain and neuroscience, may be used to
generate a saliency map in a bottom-up manner from local features
of the image. Alternatively, the saliency map can be obtained
accurately by a technique to multiply an edge amount map calculated
for each pixel by a region-of-interest weighing map. The saliency
can further be calculated by a technique to combine the feature
amount of the image with depth information.
[0053] Moreover, a recent approach of extracting the
region-of-interest uses higher order and more meaningful
information with respective to the lower-order features (e.g.,
colors, edges, or depths) of the image. For example, the
higher-order features of the image can be extracted using a neural
network to estimate the region of interest.
[0054] The object position/label features to be used include the
position of an object (usually expressed by the coordinates of four
corners of a detected rectangle) that has been detected by existing
object detecting algorithm and the type of the object (e.g., face,
human, or car). Herein, the algorithm disclosed in Japanese Patent
Registration No. 4538008 (International Patent Publication No. WO
2007/020789) and Japanese Patent Registration No. 3411971 (Japanese
Patent Publication No. 2002-24544) described above may be used as
an example object detecting algorithm.
[0055] Obviously, the above-described feature amounts are not
limiting to the feature amounts capable of being used in the
present embodiment, and other feature amounts that have
conventionally been used in the field of image recognition (e.g.,
local binary patterns (LBP), Haarlikefeature, histogram of oriented
gradients (HOG), or scale-invariant feature transform (SIFT)) may
also be used.
[0056] Subsequently, at S104, the degree-of-importance calculator
104 calculates the degree of importance for each position (pixel)
of the input image according to the feature amount extracted from
each partial image using the predetermined regression model. This
is described in detail below.
[0057] Assume that vector h represents a vector arranging the
feature amounts for each position of the i-th partial image among
the N partial images divided from the input image, and that vector
g represents a vector arranging the degree of importance for each
position of the input image. The regression model f expressed by
the equation (2) is considered.
[Equation 2]
g=f(l.sub.1,l.sub.2, . . . ,l.sub.N) (2)
[0058] Equation (3) illustrates a linear conversion as a specific
format of the regression model f.
[ Equation 3 ] g = W ( I 1 I 2 I N ) + b ( 3 ) ##EQU00002##
[0059] In the equation (3), W and b represent parameters. In the
present embodiment, training data using feature amount l.sub.i as
input and the degree of importance g as output is prepared in
advance, and the training data is subjected to learning to identify
parameters W and b.
[0060] In doing this, the present embodiment assumes that the
degree of importance g which is the output (teacher data) of the
training data is obtained in an appropriate manner. One of the
simplest ways of obtaining the degree of importance g is that an
examinee designates a region that the examinee considers to be
important in the target image, and the degree of importance of the
pixels included in the region designated by the examinee is set to
"1" while setting the degree of importance of the other pixels is
set to "0". Alternatively, a locus of the viewpoint of the examinee
who sees the target image is obtained by, for example, an eye
tracker, and the obtained locus (line) is subjected to Gaussian
blur to obtain the degrees of importance (from 0 to 1) which is
normalized in accordance with the contrast level of the blurred
locus.
[0061] At S105, based on the design concept that the attention
points of the user are present in the direction having a higher
degree of importance, the attention-point-likelihood distribution
calculator 105 calculates the likelihood of the attention points in
accordance with the distribution of the degree of importance
calculated previously at S104. In the present embodiment, as
illustrated in FIG. 6, a region R is defined around the viewpoint A
being as a center through which the shooting direction extends. The
degree of importance at each position in the region R is added and
the added value can be calculated as the likelihood of the
attention point at the viewpoint A. Further, in the present
embodiment, a weight is added to the degree of importance of each
position in the region R, so that the degree of importance
attenuates as each position becomes away from the viewpoint A.
Using such a weight, a weight-added value of the degree of
importance can be calculated as the likelihood of attention point
of the viewpoint A.
[0062] With a three-dimensional vector p in the shooting direction,
and a degree of importance g(p) in the shooting direction p, the
likelihood of attention point a(p) is formulated as equation
(4):
[Equation 4]
a(p)=.eta.(.intg.g(q)w(p,q)dq) (4)
[0063] In the equation (4), q represents a monotonic increase
function, w(p, q) represents the weight, the integration is a
definite integral, and the range of the integration is the entire
unit sphere for shooting. In the present embodiment, q can be an
exponential function, and the w(p, q) is a function expressed in
equation (5).
[Equation 5]
w(p,q)=exp(ap.sup.Tq) (5)
[0064] The above equation (5) is based on the von Mises
distribution. If the directions p and q are identical, the
distribution is maximum. If the directions p and q are directed
oppositely, a minimum distribution results. In the present
embodiment, parameter a can determine the attenuation rate of the
weight, allowing an angle of view reflecting the attention point to
be reflected.
[0065] Further, in the present embodiment, the weight w(p, q) can
be expressed as equation (6) below with {.alpha..sub.i} being a
parameter, so that the polynomial equation of an inner product of
the directions p and q can be provided as an argument.
[ Equation 6 ] w ( p , q ) = exp ( i .alpha. i ( p T q ) i ) ( 6 )
##EQU00003##
[0066] Description continues by referring back to FIG. 3.
[0067] At S106, the attention point calculator 106 calculates the
attention point in accordance with the attention point likelihood
distribution a(p). For example, in the present embodiment, the
position corresponding to a shooting direction p that corresponds
to the maximum likelihood value of the likelihood of attention
point a(p) may be calculated as the attention point. Alternatively,
the position corresponding to the shooting direction p that
corresponds to an average value of the attention point likelihood
distribution a(p) may be provided as an attention point, as with
the equation (7). The integral of the equation (7) is a definite
integral, with an integral range being the entire shooting unit
spherical surface.
[Equation 7]
p=.intg.pa(p)dp (7)
[0068] The present embodiment may calculate positions corresponding
to N shooting directions p (N is an integer of at least 1) that
correspond to a local maximum value of the attention point
likelihood a(p) as the attention points. If a plurality of local
maximum values of the attention point likelihood a(p) are present,
a plurality of attention points can be obtained. The local maximum
value of the attention point likelihood a(p) can be determined by
hill climbing from the initial value of p which is randomly
generated. If it is desired to determine M attention values at
discrete positions, a plurality of attention points can be
determined as p1, p2, . . . , pM that can maximize an evaluation
function of equation (8):
[ Equation 8 ] J = i = 1 M a ( p i ) + d ( p 1 , p 1 , , p M ) ( 8
) ##EQU00004##
[0069] In the equation (8), d represents a function representing a
distance between viewpoints, such as a dispersion among p1, p2, . .
. , pM or a sum of the Euclidean distance between viewpoints.
[0070] The series of processing steps of calculating the attention
points from the input image (omnidirectional image in the
equirectangular projection format) have been described. If the
image analyzing apparatus 100 of the present embodiment is used in
cropping or generation of thumbnails, the region of interest is
defined by setting a particular angle of view around the attention
point determined by the above-described procedure, and the image of
the defined region of interest is used as it is as a cropped image
or a thumbnail image. In this case, the angle of view to be set is
preferably the angle of view of the region of interest including
the attention point in the training data that has been given to the
regression model. Meanwhile, if the image analyzing apparatus 100
of the present embodiment is applied to the image recognition/image
searching system, the object region including the attention point
is used as an object for recognition or search.
[0071] As described above, the present embodiment does not
calculate the attention point directly from the feature amount of
each partial image. Instead, the configuration adopted by the
present embodiment includes calculating the degree of importance
using a first regression model according to the feature amount of
each partial image, and then calculating the attention point with a
second regression model according to the calculated degree of
importance. Thus, it is possible to calculate the degree of
importance by reflecting the mutual interaction among the partial
images, enabling accurate estimation of the attention point of the
image including a plurality of high-salient objects in the image
such as an ultrawide image, while decreasing the number of
explanatory variables to improve generalization capability.
[0072] The following design changes are available for the first
embodiment described above.
[0073] For example, the input image may be divided by an arbitrary
dividing method other than dividing the spherical surface of the
omnidirectional image by approximating the regular polyhedrons. For
example, the spherical surface of the omnidirectional image may be
divided by approximating quasi-regular polyhedrons, or by Voronoi
division with randomly developed seeds on the spherical surface of
the omnidirectional image.
[0074] The partial images are not limited to the images obtained by
perspective projection of the omnidirectional image, and may be
obtained by other projection methods. For example, the partial
images may be obtained by orthographic projection. Alternatively,
the perspective projection may be carried out by shifting the
center of projection O from the center of the unit sphere, as
illustrated in FIGS. 7A and 7B. According to the projection method
illustrated in FIGS. 7A and 7B, the distortion of projection at the
edge of the image can be alleviated, while allowing the projection
of the angle of view of at least 180 degrees. Thus, the extraction
of the features of the element is possible by a smaller number of
divided imaged.
[0075] If an image taken by a camera having an angle of view of
less than 360 degrees is processed, the image having such an angle
of view is converted into an image of the equirectangular
projection format (partially excluded image) which is processed in
the same procedure as described above.
[0076] Even when the image to be processed is not in the
equirectangular projection format, the image processing can be
carried out similarly as described above, so long as the camera
that takes the image has been calibrated (i.e., directions of light
rays in the three-dimensional space corresponding to the position
of the imaging surface of the camera are known). When the image to
be processed is taken by an uncalibrated camera, the image cannot
be divided by approximating regular polyhedrons, but an applicable
other dividing method (e.g., the Voronoi division mentioned above)
may be used to divide the region.
[0077] In the above, the first embodiment of the present invention
in which the attention point is estimated from the input image in
accordance with the linear regression model has been described.
Next, a second embodiment of the present invention is described.
The second embodiment differs from the first embodiment in that a
neural network is used to estimate the attention point from the
input image. In the following, what are common to the first
embodiment will not be described and only the parts which differ
from the first embodiment are mainly described.
[0078] In the second embodiment, the feature amount extractor 103
is provided as a neural network to which a partial image is input
and from which a feature amount is output. For example, the feature
amount extractor 103 can be formed using a convolution network,
such as the one used in areas of object recognition, as illustrated
in FIG. 8. In this case, a filter operation including a plurality
of kinds of weights is carried out in a convolution layer
("convolution layer 1", "convolution layer 2", "convolution layer
3") to calculate a value which is then converted by an activation
function ("activation function"). Examples of the activation
function include a logistics function, an inverse tangent function,
and a rectified linear activation unit (ReLU) function. Pooling
("pooling") is a downsizing operation of variables, such as
maxpooling or average pooling.
[0079] In one example, the degree-of-importance calculator 104 is
implemented as a neural network to which a group of feature amounts
extracted from the partial images is input and from which a degree
of importance corresponding to the position of the input image is
output. The degree-of-importance calculator 104 integrates, as
illustrated in FIG. 9, the input feature amounts ("feature amount 1
to N") and repeatedly carries out linear conversion in a full
connected layer ("full connected layer 1, 2") and non-linear
conversion of the activation function ("activation function") to
calculate the degree of importance.
[0080] In one example, learning is carried out in advance using
training data to identify parameters for the neural networks that
form the feature amount extractor 103 and the degree-of-importance
calculator 104. The present embodiment may also use a method called
fine tuning in which the learning is carried out at least in the
feature amount extractor 103 or the degree-of-importance calculator
104, and the feature amount extractor 103 and the
degree-of-importance calculator 104 are connected as one network to
allow overall learning.
[0081] In one example, the feature amount extractor 103 learns
using data set of the partial images and the feature amounts (e.g.,
saliency and object label) as the training data, while the
degree-of-importance calculator 104 learns using the data set of
the feature amounts (e.g., saliency and object label) and the
degree of importance as the training data. Moreover, in the present
embodiment, values are extracted from intermediate layers of the
network, after the data set of the partial images and the object
labels (feature amount) are learned, to let the
degree-of-importance calculator 104 to learn the data set of the
values of the intermediate layers and the degree of importance, as
illustrated in FIG. 10. In the present embodiment, the feature
amount extractor 103 and the degree-of-importance calculator 104
may be regarded as a network to allow learning of the data set of
the input image and the degree of importance.
[0082] In one example, the attention-point-likelihood distribution
calculator 105 may be implemented as a neural network to which the
degree of importance is input and from which the likelihood
distribution of attention points is output. In the present
embodiment, the above-described equation (4) is understood as
converting a convolution result of the likelihood of attention
point a(p) and the degree of importance g(p) by the function q. The
function q is regarded as an activation function that substitutes
the integral of the convolution by numerical integral of the
discretized variable q, thus allowing calculation in the neural
network format.
[0083] In this example, the attention-point-likelihood distribution
calculator 105 can learn in the following manner. The parameter to
be determined is a weight w(p, q) for weighted summation of the
degree of importance. This can be learned directly or fixed at the
value of the equation (6). Alternatively, the value of the equation
(6) is set as an initial value for learning.
[0084] FIG. 11 illustrates a neural network configuration of the
attention-point-likelihood distribution calculator 105.
[0085] In the present embodiment, the attention point calculator
106 is formed as a neural network to which the likelihood
distribution of the attention points is input and from which the
attention point is output. FIG. 12 is a configuration example of
the attention point calculator 106 formed to generate the attention
point corresponding to an average value of the likelihood
distribution of the attention points corresponding to the equation
(6). The attention point calculator 106 is not limited to the
configuration of FIG. 12, and may also be configured to output the
maximum value as the attention point using the maximum output
layer.
[0086] The network weight of the attention point calculator 106 is
fixed in the direction of viewpoints (p1, p2, . . . , p) to which
adjustment is intrinsically not necessary. Alternatively, however,
the fixed viewpoints direction is given as the initial value and
adjusted by learning. In the present embodiment, it is also
possible for the attention-point-likelihood distribution calculator
105 and the attention point calculator 106 may be regarded as a
single network to form the neural network as illustrated in FIG.
13.
[0087] When learning the attention points, an angle between the
vector of the attention point of the teacher data and the vector of
the calculated attention point can be used as an error function in
the present embodiment. The Euclidean distance between the
attention point of the teacher data and the predicted attention
point may be used as the error. If the Euclidean distance is used,
a norm is also evaluated in addition to the direction of the vector
of the attention point. It is, therefore, preferable to introduce a
normalizing step to normalize the likelihood of attention point
a(p). This can be implemented using a softmax function including
the function ti. FIG. 14 is a configuration example of the
normalization of the likelihood of the attention point a(p) with
the softmax function.
[0088] In the present embodiment, the attention-point-likelihood
distribution calculator 105 and the attention point calculator 106
may learn separately, or entire learning may be carried out as one
network. Alternatively, one calculator may learn first and the
entire fine tuning follows.
[0089] In the above description, the feature amount extractor 103,
the degree-of-importance calculator 104, the
attention-point-likelihood distribution calculator 105, and the
attention point calculator 106 are implemented as the neural
networks. Alternatively, these four portions may be implemented as
a single neural network, or at least one of four portions may be
replaced with a linear regression.
[0090] As described above, the present embodiment uses the neural
network estimate the attention point from the input image
(omnidirectional image in the equirectangular projection format),
allowing total optimization from input to output and scalable
learning of a large volume of training data. As a result, the
attention point can be estimated accurately.
[0091] Although the attention point is estimated from the input
image using the neural network in the second embodiment described
above, at least one of the neural networks described above may be
replaced with other non-linear regression, such as support vector
regression or random forest regression.
[0092] In the above-described configuration, the degree of
importance is calculated from the feature amount of each partial
image using the first regression model, and the calculated degree
of importance is used to calculate the attention point using the
second regression model. However, the degree of importance
calculated with the first regression model can be used for
different purposes of use as below. For example, in the embodiments
of the present invention, the degree of importance calculated from
the input image can be used to generate a heat map of attention
points of a user in the input image. Alternatively, the degree of
importance calculated from the input image can be used to control
bit rate allocated to the input image when the image is compressed.
Specifically, a higher bit rate is allocated to pixels having a
higher degree of importance and a lower bit rate is allocated to
pixels having a lower degree of importance, thus optimizing the
quality of the image. Thus, the image analyzing apparatus 100
according to the embodiments of the present invention can be
considered not only as the apparatus that calculates the attention
points of the input image, but can also be considered as the
apparatus that calculates the degree of importance for each
position of the input image.
[0093] Moreover, the embodiments of the present invention have been
described mainly as a two-step method in which the first regression
model is used to calculate the degree of importance from the
feature amount of the partial images, followed by calculating the
attention points using the second regression model from the
calculated degree of importance. Alternatively, however, a
composite function, which uses the partial images as input and the
attention points as output and has an intermediate variable
corresponding to the degree of importance described above, may be
designed. As a result, a regression model can be formed by a single
step learning using the training data in which partial images (or
feature amount extracted from the partial images) are input and the
attention point is output. In this case, the intermediate variable
of the composite function can be used in place of the
above-described degree of importance to visualize the attention
points of the user or control allocation of bit rate in image
compression.
[0094] Referring to FIG. 15, a hardware configuration of a computer
included in the image analyzing apparatus 100 of the embodiments of
the present invention is described.
[0095] As illustrated in FIG. 15, a computer of the image analyzing
apparatus 100 according to the embodiments of the present invention
includes a processor 10 that controls entire operation of the
apparatus such as a central processing unit (CPU), a read only
memory (ROM) 12 that stores a boot program, a firmware program, and
so on, a random access memory (RAM) 14 that provides an area to
execute programs, an auxiliary storage 15 that stores a program or
an operating system (OS) to enable the above-described functions of
the image analyzing apparatus 100, an input/output interface 16
used to connect to an external input/output device, and a network
interface 18 used to connect to a network.
[0096] The features of the above-described embodiments are
implemented by programs described in programming languages, such as
C, C++, C#, Java (registered trademark). In the embodiments of the
present invention, such programs can be distributed as being stored
in a storage medium, such as a hard disk device, a compact disc
read-only memory (CD-ROM), a magnetooptical disc (MO), a digital
versatile disc (DVD), a flexible disc, an electrically erasable
programmable read-only memory (EEPROM), or an erasable programmable
read-only memory (EPROM), or transferred via a network in a formal
readable by other devices.
[0097] The above-described embodiments are illustrative and do not
limit the present invention. Thus, numerous additional
modifications and variations are possible in light of the above
teachings. For example, elements and/or features of different
illustrative embodiments may be combined with each other and/or
substituted for each other within the scope of the present
invention.
[0098] Each of the functions of the described embodiments may be
implemented by one or more processing circuits or circuitry.
Processing circuitry includes a programmed processor, as a
processor includes circuitry. A processing circuit also includes
devices such as an application specific integrated circuit (ASIC),
digital signal processor (DSP), field programmable gate array
(FPGA), and conventional circuit components arranged to perform the
recited functions.
* * * * *