U.S. patent application number 16/888068 was filed with the patent office on 2021-12-02 for operating on a video frame to generate a feature map of a neural network.
The applicant listed for this patent is Arm Limited. Invention is credited to Daren CROXFORD, Jayavarapu Srinivasa RAO, Dominic Hugo SYMES.
Application Number | 20210374422 16/888068 |
Document ID | / |
Family ID | 1000004913119 |
Filed Date | 2021-12-02 |
United States Patent
Application |
20210374422 |
Kind Code |
A1 |
RAO; Jayavarapu Srinivasa ;
et al. |
December 2, 2021 |
OPERATING ON A VIDEO FRAME TO GENERATE A FEATURE MAP OF A NEURAL
NETWORK
Abstract
A method is described for operating on a frame of a video to
generate a feature map of a neural network. The method determines
if a block of the frame is an inter block or an intra block, and
performs an inter block process in the event that the block is an
inter block and/or an intra block process in the event that the
block is an intra block. The inter block process determines a
measure of differences between the block of the frame and a
reference block of a reference frame of the video, and performs
either a first process or a second process based on the measure to
generate a segment of the feature map. The intra block process
determines a measure of flatness of the block of the frame, and
performs either a third process or a fourth process based on the
measure to generate a segment of the feature map.
Inventors: |
RAO; Jayavarapu Srinivasa;
(Cambridge, GB) ; CROXFORD; Daren; (Cambridge,
GB) ; SYMES; Dominic Hugo; (Cambridge, GB) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Arm Limited |
Cambridge |
|
GB |
|
|
Family ID: |
1000004913119 |
Appl. No.: |
16/888068 |
Filed: |
May 29, 2020 |
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 3/08 20130101; G06K
9/00744 20130101; G06K 9/6232 20130101; G06N 3/04 20130101; G06K
9/6256 20130101 |
International
Class: |
G06K 9/00 20060101
G06K009/00; G06K 9/62 20060101 G06K009/62; G06N 3/04 20060101
G06N003/04; G06N 3/08 20060101 G06N003/08 |
Claims
1. A method comprising: operating on a frame of a video to generate
a feature map of a neural network, wherein the frame comprises a
plurality of blocks and operating on the frame comprises:
determining if a block of the frame is an inter block or an intra
block; and performing an inter block process in the event that the
block is an inter block or an intra block process in the event that
the block is an intra block, wherein the inter block process
comprises: determining a measure of differences between the block
of the frame and a reference block of a reference frame of the
video; and performing one of a first process and a second process
based on the measure of differences, wherein: the first process
comprises performing at least one operation of the neural network
on the block of the frame to generate a segment of the feature map;
and the second process comprises using a segment of a reference
feature map to generate the segment of the feature map; and wherein
the intra block process comprises: determining a measure of
flatness of the block; and performing one of the third process and
a fourth process based on the measure of flatness, wherein: the
third process comprises performing at least one operation of the
neural network on the block to generate a segment of the feature
map; and the fourth process comprises performing an inverse
frequency transform on a DC coefficient of the block to generate a
DC offset, and using the DC offset to generate each element of the
segment of the feature map.
2. A method as claimed in claim 1, wherein the block of the frame
comprises frequency coefficients, and the measure of differences
comprises a sum of squares of the frequency coefficients or a sum
of absolute values of the frequency coefficients.
3. A method as claimed in claim 1, wherein the method comprises
performing the first process in the event that the measure of
differences is greater than a first threshold and performing the
second process in the event that the measure of differences is less
than the first threshold.
4. A method as claimed in claim 1, wherein the block of the frame
comprises a motion vector, and the second process comprises using
the motion vector to identify or generate the segment of the
reference feature map.
5. A method as claimed in claim 1, wherein the segment of the
reference feature map is at a fractionally translated position
within the reference feature map, and the second process comprises
using interpolation in order to generate the segment of the
reference feature map.
6. A method as claimed in claim 1, wherein the second process
comprises: comparing the measure of differences against a second
threshold; in the event that the measure of differences is less
than the second threshold, using the segment of the reference
feature map as the segment of the feature map; and in the event
that the measure of differences is greater than the second
threshold, modifying the segment of the reference feature map to
generate a modified segment and using the modified segment as the
segment of the feature map.
7. A method as claimed in claim 6, wherein modifying the segment of
the reference feature map comprises using the block of the frame to
modify the segment of the reference feature map to generate the
modified segment.
8. A method as claimed in claim 6, wherein modifying the segment of
the reference feature map comprises performing the operation on
residual data of the block of the frame to generate a residual
segment, and applying the residual segment to the segment of the
reference feature map to generate the modified segment.
9. A method as claimed in claim 6, wherein the block of the frame
comprises frequency coefficients, the frequency coefficients
comprise a DC coefficient, and modifying the segment of the
reference feature map comprises performing an inverse frequency
transform on the DC coefficient to generate a DC offset, and using
the DC offset to modify the segment of the reference feature
map.
10. A method as claimed in claim 9, wherein using the DC offset to
modify the segment of the reference feature map comprises using the
DC offset to generate one or more compensation values and applying
the compensation values to elements of the segment of the reference
feature map to generate the modified segment.
11. A method as claimed in claim 10, wherein the operation of the
neural network comprises a convolution operation and using the DC
offset to modify the segment of the reference feature map comprises
using the DC offset to generate a compensation value for each
kernel of the convolution operation, and applying the compensation
value to each element of a respective channel of the segment of the
reference feature map to generate the modified segment.
12. A method as claimed in claim 11, wherein using the DC offset to
generate a compensation value for each kernel comprises multiplying
the DC offset with a sum of weights of the kernel.
13. A method as claimed in claim 6, wherein modifying the segment
of the reference feature map comprises: determining a measure of
flatness of the block of the frame, and based on the measure of
flatness, modifying the segment of the reference feature map using
one of a first modification process and a second modification
process.
14. A method as claimed in claim 13, wherein: the first
modification process comprises performing the operation on residual
data of the block of the frame to generate a residual segment, and
using the residual segment to modify the segment of the reference
feature map to generate the modified segment; and the second
modification process comprises performing an inverse frequency
transform on a DC coefficient of the block of the frame to generate
a DC offset, and using the DC offset to modify the segment of the
reference feature map.
15. A method as claimed in claim 1, wherein determining the measure
of flatness comprises determining at least one of a size of the
block and a measure of high frequency coefficients of the
block.
16. A method as claimed in claim 15, wherein the operation is a
convolution operation, and using the DC offset to generate each
element comprises multiplying the DC offset with a sum of weights
for each kernel of the convolution operation to generate each
element of a respective channel of the segment of the feature
map.
17. A processing unit configured to: determine if a block of a
frame of a video is an inter block or an intra block; and perform
an inter block process in the event that the block is an inter
block or an intra block process in the event that the block is an
intra block, wherein the inter block process comprises: determining
a measure of differences between the block of the frame and a
reference block of a reference frame of the video; and performing
one of a first process and a second process based on the measure of
differences, wherein: the first process comprises performing or
instructing a further processing unit to perform at least one
operation of a neural network on the block of the frame to generate
a segment of the feature map; and the second process comprises
using a segment of a reference feature map to generate the segment
of the feature map; and wherein the intra block process comprises:
determining a measure of flatness of the block; and performing one
of the third process and a fourth process based on the measure of
flatness, wherein: the third process comprises performing or
instructing a further processing unit to perform at least one
operation of a neural network on the block to generate a segment of
the feature map; and the fourth process comprises performing an
inverse frequency transform on a DC coefficient of the block to
generate a DC offset, and using the DC offset to generate each
element of the segment of the feature map.
18. A system comprising a first processing unit and a second
processing unit, wherein the first processing unit is configured
to: operate on a frame of a video to generate a feature map of a
neural network, wherein the frame comprises a plurality of blocks
and operating on the frame comprises: determining if a block of a
frame of a video is an inter block or an intra block; and
performing an inter block process in the event that the block is an
inter block or an intra block process in the event that the block
is an intra block, wherein the inter block process comprises:
determining a measure of differences between the block of the frame
and a reference block of a reference frame of the video; and
performing one of a first process and a second process based on the
measure of differences, wherein: the first process comprises
instructing the second processing unit to perform at least one
operation of the neural network on the block of the frame to
generate a segment of the feature map; and the second process
comprises using a segment of a reference feature map to generate
the segment of the feature map; and wherein the intra block process
comprises: determining a measure of flatness of the block; and
performing one of the third process and a fourth process based on
the measure of flatness, wherein: the third process comprises
performing or instructing the second processing unit to perform at
least one operation of the neural network on the block to generate
a segment of the feature map; and the fourth process comprises
performing an inverse frequency transform on a DC coefficient of
the block to generate a DC offset, and using the DC offset to
generate each element of the segment of the feature map.
Description
BACKGROUND OF THE INVENTION
Field of the Invention
[0001] The present invention relates to a method of operating on a
frame of a video to generate a feature map of a neural network, and
to a processing unit and system for implementing the method.
Description of the Related Technology
[0002] Machine learning may be used to process visual data, such as
video. For example, machine learning may be used to extract
meaningful information form the visual data (e.g. the identify and
location of objects), or to enhance or manipulate the visual data
(e.g. increase the resolution or dynamic range). It may be
desirable to implement machine learning locally on an embedded
device, e.g. due to concerns over latency or privacy. However,
machine learning algorithms can be computationally expensive and
may present challenges when trying to implement on an embedded
device, particularly in battery-powered products having a low
energy or power budget.
SUMMARY
[0003] According to a first aspect of the present disclosure, there
is provided a method comprising operating on a frame of a video to
generate a feature map of a neural network, wherein the frame
comprises a plurality of blocks and operating on the frame
comprises: determining if a block of the frame is an inter block or
an intra block; and performing an inter block process in the event
that the block is an inter block and/or an intra block process in
the event that the block is an intra block, wherein the inter block
process comprises: determining a measure of differences between the
block of the frame and a reference block of a reference frame of
the video; and performing one of a first process and a second
process based on the measure of differences, wherein: the first
process comprises performing at least one operation of the neural
network on the block of the frame to generate a segment of the
feature map; and the second process comprises using a segment of a
reference feature map to generate the segment of the feature map;
and wherein the intra block process comprises: determining a
measure of flatness of the block; and performing one of the third
process and a fourth process based on the measure of flatness,
wherein: the third process comprises performing at least one
operation of the neural network on the block to generate a segment
of the feature map; and the fourth process comprises performing an
inverse frequency transform on a DC coefficient of the block to
generate a DC offset, and using the DC offset to generate each
element of the segment of the feature map.
[0004] According to a second aspect of the present disclosure,
there is provided a processing unit configured to: determine if a
block of a frame of a video is an inter block or an intra block;
and perform an inter block process in the event that the block is
an inter block and/or an intra block process in the event that the
block is an intra block, wherein the inter block process comprises:
determining a measure of differences between the block of the frame
and a reference block of a reference frame of the video; and
performing one of a first process and a second process based on the
measure of differences, wherein: the first process comprises
performing or instructing a further processing unit to perform at
least one operation of a neural network on the block of the frame
to generate a segment of the feature map; and the second process
comprises using a segment of a reference feature map to generate
the segment of the feature map; and wherein the intra block process
comprises: determining a measure of flatness of the block; and
performing one of the third process and a fourth process based on
the measure of flatness, wherein: the third process comprises
performing or instructing a further processing unit to perform at
least one operation of a neural network on the block to generate a
segment of the feature map; and the fourth process comprises
performing an inverse frequency transform on a DC coefficient of
the block to generate a DC offset, and using the DC offset to
generate each element of the segment of the feature map.
[0005] According to a third aspect of the present disclosure, there
is provided a system comprising a first processing unit and a
second processing unit, wherein the first processing unit is
configured to: operate on a frame of a video to generate a feature
map of a neural network, wherein the frame comprises a plurality of
blocks and operating on the frame comprises: determining if a block
of a frame of a video is an inter block or an intra block; and
performing an inter block process in the event that the block is an
inter block and/or an intra block process in the event that the
block is an intra block, wherein the inter block process comprises:
determining a measure of differences between the block of the frame
and a reference block of a reference frame of the video; and
performing one of a first process and a second process based on the
measure of differences, wherein: the first process comprises
instructing the second processing unit to perform at least one
operation of the neural network on the block of the frame to
generate a segment of the feature map; and the second process
comprises using a segment of a reference feature map to generate
the segment of the feature map; and wherein the intra block process
comprises: determining a measure of flatness of the block; and
performing one of the third process and a fourth process based on
the measure of flatness, wherein: the third process comprises
performing or instructing the second processing unit to perform at
least one operation of the neural network on the block to generate
a segment of the feature map; and the fourth process comprises
performing an inverse frequency transform on a DC coefficient of
the block to generate a DC offset, and using the DC offset to
generate each element of the segment of the feature map.
[0006] According to a fourth aspect of the present disclosure,
there is provided a method comprising operating on a frame of a
video to generate a feature map of a neural network, wherein the
frame comprises a plurality of blocks and operating on the frame
comprises: determining a measure of differences between the block
of the frame and a reference block of a reference frame of the
video; and performing one of a first process and a second process
based on the measure, wherein: the first process comprises
performing at least one operation of the neural network on the
block of the frame to generate a segment of the feature map; and
the second process comprises using a segment of a reference feature
map to generate the segment of the feature map.
[0007] According to a fifth aspect of the present disclosure, there
is provided a method comprising operating on a frame of a video to
generate a feature map of a neural network, wherein the frame
comprises a plurality of blocks and operating on the frame
comprises: determining a measure of flatness of the block; and
performing one of the first process and a second process based on
the measure, wherein: the first process comprises performing at
least one operation of the neural network on the block to generate
a segment of the feature map; and the second process comprises
performing an inverse frequency transform on a DC coefficient of
the block to generate a DC offset, and using the DC offset to
generate each element of the segment of the feature map.
[0008] Further features will become apparent from the following
description, given by way of example only, which is made with
reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] FIG. 1 is a flow chart of an example method for performing
operations of a neural network on a video frame;
[0010] FIG. 2 illustrates an example of a neural network;
[0011] FIG. 3 is a flow chart of a further example of a method for
performing operations of a neural network on a video frame;
[0012] FIG. 4 is a first example method for modifying a segment of
a reference feature map;
[0013] FIG. 5 is a second example method for modifying a segment of
a reference feature map;
[0014] FIG. 6 is a flow chart of a still further example for
performing operations of a neural network on a video frame;
[0015] FIG. 7 is a flow chart of another example method for
performing operations of a neural network on a video frame; and
[0016] FIG. 8 is a block diagram of a machine learning system.
DETAILED DESCRIPTION OF CERTAIN INVENTIVE EMBODIMENTS
[0017] Details of systems and methods according to examples will
become apparent from the following description, with reference to
the Figures. In this description, for the purpose of explanation,
numerous specific details of certain examples are set forth.
Reference in the specification to "an example" or similar language
means that a particular feature, structure, or characteristic
described in connection with the example is included in at least
that one example, but not necessarily in other examples. It should
further be noted that certain examples are described schematically
with certain features omitted and/or necessarily simplified for
ease of explanation and understanding of the concepts underlying
the examples.
[0018] Machine learning may be used to extract information from
video (e.g. the identity and location of objects, features, or
activities) or to enhance or manipulate video (e.g. increase the
resolution or dynamic range). A machine learning system may employ
a neural network, such as a convolutional neural network, in order
to extract information from each frame of the video. However,
processing each frame in this way can be computationally
expensive.
[0019] FIG. 1 is a flow chart of an example method for performing
operations of a neural network on a frame of a video.
[0020] The method 100 comprises storing 110 a reference feature
map. The reference feature map is generated by at least one
operation of a neural network performed on a reference frame of a
video. The reference feature map may be generated using the methods
described herein. Alternatively, the reference feature map may be
generated by performing the neural network operation(s) on the
reference frame in a conventional manner.
[0021] FIG. 2 illustrates an example of a neural network. The input
data corresponds to a single frame of the video. The input data
comprises just one channel corresponding to the luma channel of the
frame. However, the input data might additionally comprise channels
corresponding to one or both of the chroma channels. Moreover, the
input data may comprise channels of an alternative color space,
such as RGB.
[0022] The first operation of the neural network of FIG. 2 is a
convolution. Consequently, the reference feature map may comprise
the feature map output by the convolution operation. Storing the
feature map output by the convolution operation has the benefit
that linear terms may be added to the reference feature map in
order to generate a modified feature map, as described below in
more detail. The convolution operation is followed by an activation
operation and a pooling operation. Accordingly, the reference
feature map might alternatively comprise the feature map output by
the activation or pooling operation. Storing the feature map output
by the pooling operation has the benefit that reference feature map
is smaller and thus requires less memory. In FIG. 2, each
convolution operation is shown as comprising an activation function
(ReLU). However, the convolution operation (which is a linear) and
the activation operation (which is typically a non-linear) should
be regarded as two separate operations.
[0023] The method comprises operating on a current frame of the
video to generate a current feature map. Each frame of the video
comprises a plurality of blocks. For example, each frame may
comprise a plurality of macroblocks, and each macroblock may
comprise one or more blocks. For each block of the current frame,
the method determines 120 whether the block is an inter block or an
intra block. In the event that the block is an inter block, the
method performs an inter block process, and in the event that the
block is an intra block, the method performs an intra block
process.
[0024] The inter block process determines 130 a measure of the
differences between pixels of the block of the current frame and
pixels of the corresponding reference block of the reference frame.
This may comprise performing an inverse frequency transform on the
block of the current frame (in addition to any decoding and
dequantization) in order to convert the coefficients from the
frequency domain (frequency coefficients) to the spatial domain
(residual coefficients), and then determining a measure of the
differences based on the residual coefficients. However, many video
transforms, including discrete cosine transform, are orthogonal and
so preserve energy content. As a result, a measure of the
differences in the pixels of the two blocks may be determined
without the need to perform an inverse frequency transform. In
particular, a measure of the differences may be determined from the
frequency coefficients (quantized or dequantized) of the block of
the current frame.
[0025] The measure of differences may comprise the sum of squares
of the coefficients (be they frequency coefficients or residual
coefficients) of the block of the current frame. Alternatively, the
measure of differences may comprise the sum of absolute values of
the coefficients or any other statistical measure that may be used
to determine the magnitude of the residuals. Moreover, the measure
of differences may comprise more than one determiner. For example,
the measure may comprise the sum of squares of the coefficients and
the maximum absolute value of the coefficients. A macroblock may
include a skip flag, which when set implies that the residual
coefficients for the block are zero. Accordingly, the measure may
also comprise the value or setting of the skip flag.
[0026] The inter block process determines 140 whether the two
blocks (i.e. the block of the current frame and the reference block
of the reference frame) are similar based on the measure of
differences. For example, the process may determine that the blocks
are similar in the event that the skip flag is set. Additionally,
or alternatively, the process may compare the measure against a
threshold. The blocks are then determined to be similar if the
measure is less than the threshold, and dissimilar if the measure
is greater than the threshold. As noted above, the measure may
comprise more than one determiner. So, for example, the measure may
comprise the sum of squares of the coefficients and the maximum
absolute value of the coefficients. In this instance, the method
may determine that the two blocks are similar if the sum of the
squares is less than a threshold, and the maximum value is less
than a further threshold.
[0027] The inter block process performs one of two operations based
on the measure of differences. In particular, the inter block
process performs a first process if the blocks are determined to be
dissimilar, and a second process if the blocks are determined to be
similar.
[0028] The first process comprises performing 150 the operation of
the neural network on the block of the current frame to generate a
portion or segment of the current feature map. The operation is
performed on the pixel data of the block. Accordingly, the pixel
data of the current block are first reconstructed from the
coefficients of the block. So, for example, the frequency
coefficients of the block may be decoded, dequantized, and inverse
transformed in order to obtain the residual coefficients, and then
the residual coefficients may be added to the pixel data of the
reference block in order to obtain the pixel data of the current
block. The operation of the neural network is the same (at least
one) operation that was used to generate the reference feature map.
So, for example, if the first convolution operation of FIG. 2 was
used to generate the reference feature map, the first process
comprises operating on the pixel data of the current block using
the first convolution operation to generate a segment of the
current feature map.
[0029] The width and height of the segment of the current feature
map are defined by the size of the current block, as well as the
receptive field, stride length and padding of the filter used by
the operation. So, for example, if the size of the block is
64.times.64 and the receptive field, the stride length, and the
padding of the filter are 4.times.4 and (1,1) and zero then the
segment size might be 58.times.58.
[0030] The second process comprises using 160 a segment of the
reference feature map to generate the segment of the current
feature map. The reference segment comprises all channels of the
reference feature map, i.e. the depth of the reference segment is
the same as that of the reference feature map. However, the width
and height of the reference segment are again defined by the size
of the current block, as well as the receptive field and stride
length of the filter used by the operation.
[0031] The block of the current frame may comprise a motion vector,
and the second process may comprise using this motion vector in
order to identify and/or generate the reference segment. The motion
vector may be fractional (e.g. 1.5 pixels up and 2.25 pixels left).
As a result, the reference segment may be at a fractionally
translated position within the reference feature map. The second
process may therefore use interpolation, such as linear
interpolation, in order to generate the reference segment.
Additionally or alternatively, the reference feature map may have a
different width and/or height to that of the reference frame. This
can be seen in the neural network of FIG. 2, in which the width and
height of the input data is 227.times.227, whereas the width and
height of the feature map of the first convolution operation is
55.times.55. Any differences in the sizes of the frame and the
feature map will result in corresponding changes or scaling of the
motion vector. So, for example, if the reference feature map were
half the size of the reference frame then a motion vector of (3, 5)
would become (1.5, 2.5). Again, interpolation may be used in order
to generate the segment of the reference feature map.
[0032] The segment of the reference feature map may be used as the
segment of the current feature map. That is to say that the
reference segment may be used in the current feature map without
any modification. Alternatively, as described below in more detail,
the reference segment may instead be modified, and the modified
reference segment may be used as the segment of the current feature
map. In both instances, the reference segment is used (be it
unmodified or modified) to generate the segment of the current
feature map.
[0033] The intra block process comprises performing 170 the
operation of the neural network on the block of the current frame
to generate a segment of the current feature map. Again, as with
the inter block that is deemed dissimilar, the operation is
performed on the pixel data of the block.
[0034] The method may be repeated for each block of the current
frame in order to generate a full feature map composed of a
plurality of segments. As noted above, each segment of the current
feature map is likely to be smaller than the respective block. As a
result, there will be gaps or missing data in the current feature
map around each segment. There are various ways in which this might
be addressed. For example, the neural network operation may be
performed on those portions of the pixel data of the current frame
necessary to generate the missing data of the feature map. If the
current block is an inter block and has the same motion vector as
that of neighboring blocks, then the reference segment as a whole
may be used as the segment of the current feature map. Equally, if
the current block is an intra block and has the same DC coefficient
as that of neighboring blocks which are also deemed flat, then the
reference segment as a whole may be used as the segment of the
current feature map. As a further example, the video may be encoded
using non-standard encoding such that the blocks within each frame
overlap. Moreover, the degree of overlap may be defined such that
there are no gaps or missing data between segments of the current
feature map.
[0035] The current feature map, once complete, may be stored and
used as the reference feature map for the next frame of the
video.
[0036] With the method described above, the feature map for the
current frame may be generated using segments from the
previously-generated reference feature map. As a result, the
current feature map may be generated in a computational simpler
manner. In particular, the current feature map may be generated
without having to perform the neural network operation on each and
every block of the current frame. Indeed, for video that changes
relatively little from frame to frame, such as that captured by a
security camera or by a camera of an augmented reality headset,
significant savings in computation may be achieved.
[0037] Referring now to FIG. 3, the second process of the method of
FIG. 1 (i.e. the step of using 160 the reference segment to
generate the segment of the current feature map) may comprise
determining 161 whether the current block and the reference block
are very similar (i.e. identical or differing in relatively minor
details). For example, the second process may determine that the
blocks are very similar in the event that the skip flag is set.
Additionally or alternatively, the second process may determine
that the blocks are very similar by comparing the measure (e.g. sum
of squares of the coefficients) against a further threshold. More
particularly, the second process may determine that the blocks are
very similar in the event that the measure is less than the further
threshold.
[0038] In the event that the blocks are very similar, the second
process comprises using 162 the reference segment as the segment of
the current feature map. However, in the event that the blocks are
not very similar, the second process comprises modifying 163 the
reference segment, and then using the modified reference segment as
the segment of the current feature map.
[0039] FIG. 4 is a first example method for modifying 163 the
reference segment. The method comprises performing 164 the
operation of the neural network (again the same operation(s) as
that used to generate the reference feature map) on the residual
data of the current block to generate a residual segment, and then
applying (e.g. adding) the residual segment to the reference
segment to generate the modified segment. By operating on the
residual data rather than the pixel data of the current block, the
values processed are likely to be smaller and thus a reduction in
the amount of toggling in the data-path may be achieved. As a
result, the power required to generate the segment of the current
feature map may be reduced. Additionally, many of the residual
values of the current block are likely to be zero. As a result, the
computation required to generate the segment of the current feature
map may be better optimized, e.g. by using sparse optimization.
[0040] FIG. 5 is a second example method for modifying 163 the
reference segment. Where the current block comprise frequency
coefficients, the DC coefficient may be used to modify the
reference segment. For a block that has been determined to be
similar to its reference block, it is possible that all frequency
coefficients, with the exception of the DC coefficient, are zero or
are relatively small. Indeed, this may be made a requirement when
determining that the two blocks are similar. As noted above, when
determining 130 the measure of differences between the current
block and the reference block, the measure may comprise more than
one determiner. The method may therefore determine that the two
blocks are similar if, for example, the sum of squares of the
frequency coefficients is less than a threshold and each frequency
coefficient is less than a further threshold.
[0041] Modifying 163 the reference segment comprises generating 166
a DC offset by performing an inverse frequency transform on the DC
coefficient. The DC offset is then used to modify the reference
segment to generate the modified segment. The method may comprise
using 167 the DC offset to generate one or more compensation
values, and applying 168 (e.g. adding) the compensation values to
the elements of the reference segment to generate the modified
segment. Where the operation comprises a convolution operation, the
method may comprise using the DC offset to generate a compensation
value for each kernel of the convolution, and applying the
compensation value to each element of a respective channel of the
segment of the reference feature map to generate the modified
segment. The compensation value for each kernel may be generated by
multiplying the DC offset with a sum of weights of the kernel. So,
for example, a DC offset of 5 and a 3.times.3 kernel having the
weights {0,1,2;0,2,0;2,1,0} would result in a compensation value of
40. Since convolution operations are linear, the compensation value
may be applied to the output of the convolution operation in order
to achieve the same result as that which would be achieved had the
DC offset been applied to the input. However, by applying the
compensation value to the output (i.e. the segment of the reference
map) rather than the input (i.e. the pixel data of the current
block), the need to perform the convolution is avoided and thus the
segment of the current feature map may be generated in a
computational simpler manner. Moreover, for a trained neural
network, the sum of the weights for each kernel may be
precalculated and stored, thus further simplifying the
computation.
[0042] When determining if the two blocks are similar, the DC
coefficient may be omitted from the measure, i.e. the measure of
differences may be based on the AC coefficients only. As already
noted, the DC coefficient corresponds to a DC offset in the spatial
domain that is applied to all pixels of the current block. As a
result, the DC offset comprises no features. If the DC coefficient
were included in the measure of differences (e.g. the sum of
squares of the frequency coefficients) and the DC coefficient was
relatively large, the method may determine that the current block
and the reference block are dissimilar even if the other frequency
coefficients were all zero. However, by omitting the DC coefficient
from the measure of differences, the method may determine that the
current block and reference block are similar and instead use the
method of FIG. 5 to generate the segment of the current feature map
in a computational simpler way.
[0043] FIGS. 4 and 5 show two examples for modifying the segment of
the reference feature map. Common to both is the notion that the
reference block is modified using the block of the current frame.
The two examples are not mutually exclusive, which is to say that
the inclusion of one example does not preclude the inclusion of the
other example. The method may use one of the two examples based on
the measure of differences or some other metric. For example, the
method may compare the measure of differences against a third
threshold, modify the reference segment using the second example
(FIG. 5) in the event that the measure is less than the third
threshold, and modify the reference segment using the first example
(FIG. 4) in the event that the measure is greater than the third
threshold. Alternatively, the method may determine the flatness of
the block using the sum of squares or absolute sum of the high
frequency coefficients of the current block. The method may then
modify the reference segment using the second example (FIG. 5) in
the event that the block is flat (i.e. the sum is less than a
threshold), and modify the reference segment using the first
example (FIG. 4) in the event that the block is not flat (i.e. the
sum is greater than the threshold). In this way, the comparatively
simpler example of FIG. 5 is used to modify the segment of the
reference frame only when there appears to be no or little detail
(i.e. zero or small high frequency coefficients) in the current
block.
[0044] Example methods have thus far been described for generating
segments of the current feature map from inter blocks. However, as
will now be described with reference to FIG. 6, it is also possible
to provide computational benefits in connection with intra blocks
of the video frame.
[0045] FIG. 6 illustrates a modification to the method of FIG. 1.
The intra block process method determines 180 if the block is flat.
A flat block is one in which the pixel data is generally uniform
and is therefore devoid of any features. In intra prediction, a
large block generally suggests that the block is flat. Accordingly,
the intra block process may determine that block is flat based on
the size of the block. For example, the process may determine that
the block is flat if the size of the block is greater than
32.times.32 or some other threshold. The quantization applied to
the coefficients of the block may also be an indicator of the
flatness of the block. Accordingly, the intra block process may
additionally or alternatively determine that the block is flat
based on the quantization that is used. The flatness of the block
may also be determined from the amplitudes of the high frequency
coefficients of the block. For example, the intra block process may
comprise determining the sum of squares or absolute sum of the high
frequency coefficients of the block, and then determining that the
block is flat if the sum is less than a threshold.
[0046] In the event that the block is not flat, the intra block
process performs 170 the operation of the neural network on the
pixel data of the block to generate the segment of the current
feature map.
[0047] In the event that the block is flat, the intra block process
uses 190 the DC offset to generate the segment of the current
feature map. More particularly, the intra block process performs an
inverse frequency transform on the DC coefficient of the block in
order to generate the DC offset. The DC offset is then used to
generate each element of the segment of the current feature map.
Where the operation comprises a convolution operation, the process
may comprise using the DC offset to generate an activation value
for each kernel. The activation value may be generated by
multiplying the DC offset with a sum of weights of the kernel. The
activation value is then used for all elements of a respective
channel of the segment of the current feature map. So, for example,
if the activation value for the first kernel is 40 and the
activation for the second kernel is 25 then all elements of the
first channel of the segment will be 40, and all elements of the
second channel will be 25.
[0048] By identifying intra blocks that are flat and then using the
DC coefficient to generate activation values for all elements of
the segment of the current feature map, the feature map may be
generated in a computational simpler way. As noted above, for a
trained neural network, the sum of the weights for each kernel may
be precalculated and stored, thus further simplifying the
computation.
[0049] Example methods have thus far been described in which a
reference feature map is stored and used when processing inter
blocks of the current frame. The reference feature map is not,
however, used when processing intra blocks of the current frame.
Accordingly, when processing only intra blocks, there is no
requirement to store a reference feature map. Additionally, whilst
the methods described thus far store a single reference feature
map, the method may comprise storing more than one reference
feature map for use in processing inter blocks. For example, the
method may comprise storing a first reference feature map that
corresponds to the output of the first convolution operation of
FIG. 2, and a second reference feature map that corresponds to the
output of the first pooling operation of FIG. 2. In the event that
the method determines 161 that the current block and the reference
block are very similar, the method may use 162 a segment of the
second reference feature map as the segment of the current feature
map. However, in the event that the method determines 161 that the
blocks are not very similar, the method may modify 163 a segment of
the first reference feature map, perform the subsequent activation
and pooling operations of the neural network on the modified
segment of the second reference feature map, and then use the
result as the segment of the current feature map.
[0050] FIG. 7 is a flow chart of an example method that includes
many of the examples described above. The method 200 determines 220
whether a block of the current frame is an inter block or an intra
block.
[0051] In the event that the block is an inter block, the method
determines the similarity of the block with its respective
reference block. More particularly, the method determines 230 a
measure of the differences, AB, between the two blocks by
determining the sum of squares of the frequency coefficients of the
block of the current frame. The method then compares 240 the
measure, AB, against a first threshold, T_1. If the measure is less
than the first threshold, the block is deemed to be similar to the
reference block, otherwise the block is deemed to be dissimilar. If
the block is deemed to be dissimilar, the method performs 250 the
operation of the neural network on the pixel data of the current
block to generate the segment of the current feature map. If, on
the other hand, the block is deemed similar, the method compares
260 the measure, AB, against a second threshold, T_2. If the
measure is less than the second threshold, the block is deemed to
be very similar to the reference block and the method uses 270 the
reference segment as the segment of the current feature map.
Otherwise, the method compares 280 the measure, AB, against a third
threshold, T_3. If the measure is less than the third threshold,
the method modifies 290 the reference segment using the DC offset
in order to generate a modified reference segment, which is then
used as the segment of the current feature map. If, on the other
hand, the measure is greater than the third threshold, the method
performs the operation of the neural network on the residual data
of the current block to generate a residual segment. The reference
segment is then modified 300 using the residual segment, and the
resulting modified segment is then used as the segment of the
current feature map.
[0052] In the event that the block is an intra block, the method
determines the flatness of the block by first comparing 310 the
size of the block against a threshold, T_SZ. If the size of the
block is greater than the threshold, the block is deemed to be
flat. Otherwise, the method determines 320 the sum of squares of
the high frequency coefficients, .SIGMA.f.sub.H.sup.2, and compares
330 this against a threshold, T_FQ. If the sum of squares is
greater than the threshold, the block is deemed to be flat
otherwise the block is deemed to be not flat. If the block is
deemed to be flat, the method uses 340 the DC offset to generate
each element of the segment of the current feature map. If, on the
other hand, the block is deemed to be not flat, the method performs
350 the operation of the neural network on the pixel data of the
block to generate the segment of the current feature map.
[0053] FIG. 8 shows an example of a machine learning system for
implementing one or more of the methods described above. The system
10 comprises a first processing unit 20, a second processing unit
30, and a system memory 40. The first processing unit may be a
central processing unit (CPU) and the second processing unit 30 may
be a neural processing unit (NPU).
[0054] The system memory stores the video (in whole or in part), as
well as the reference feature map and the current feature map. The
first processing unit 20 is responsible for performing the majority
of the steps of the methods described above. However, the second
processing unit is responsible for performing any operations of the
neural network. Accordingly, whenever the method calls for an
operation to be performed on the current block (be it pixel data or
residual data), the first processing unit 20 outputs an instruction
to the second processing unit 30. The instruction may comprise the
type of operation to be performed, the locations in the system
memory 40 of the input data (e.g. pixel data or residual data) and
the output data (e.g. the segment or the residual segment) and,
where applicable, the weights, along with other parameters relating
to the operation, such as the number of kernels, kernel size,
stride and/or padding. Employing a second processing unit to
perform the operations of the neural network has the advantage that
the system can take advantage of parallel processing. In
particular, the second processing unit may perform the operations
of the neural network whilst the first processing unit is analyzing
the frames of the video. Additionally, the two processing units may
be optimized for their specific tasks and workloads. Nevertheless,
the first processing unit could conceivably perform the operations
of the neural network, thus obviating the need for a second
processing unit.
[0055] The methods described above may be used to reduce the
computation necessary to generate a feature map of a video frame.
The methods exploit the work previously performed by the encoder
when computing differences between blocks of adjacent frames. In
particular, the methods make use of the differences in order to
determine if a segment of a previously-generated feature map (i.e.
the reference feature map) may be reused. The methods described
above are suitable for use with existing video formats. However,
further efficiencies may be made by making changes at the encoder.
For example, the encoder may be configured to generate overlapping
blocks, and the degree of overlap may be defined by the receptive
field and stride length of the filter of the neural network
operation. As a result, a feature map may be generated without any
gaps or missing data around the segments. Additionally, the encoder
may determine if a block and its reference block are very similar,
similar or dissimilar (as employed in the methods above). The
encoder may then set a flag within the macroblock to indicate
whether a block is very similar, similar or dissimilar to its
reference block.
[0056] It is to be understood that any feature described in
relation to any one example may be used alone, or in combination
with other features described, and may also be used in combination
with one or more features of any other of the examples, or any
combination of any other of the examples. Furthermore, equivalents
and modifications not described above may also be employed without
departing from the scope of the accompanying claims.
* * * * *