U.S. patent application number 13/193001 was filed with the patent office on 2011-11-17 for real-time video coding/decoding.
Invention is credited to Alexey Martemyanov, Nikolay Terterov, Alexander Zheludkov.
Application Number | 20110280306 13/193001 |
Document ID | / |
Family ID | 40429744 |
Filed Date | 2011-11-17 |
United States Patent
Application |
20110280306 |
Kind Code |
A1 |
Zheludkov; Alexander ; et
al. |
November 17, 2011 |
REAL-TIME VIDEO CODING/DECODING
Abstract
A video codec having a modular structure for encoding/decoding a
digitized sequence of video frames in a multi-core system is
described. The video codec comprises a memory unit; a
multithreading engine. and a plurality of control and task modules
organized in a tree structure, each module corresponding to a
coding operation. The modules communicate with each other by
control messages and shared memory. The control modules control all
coding logic and workflow, and lower level task modules perform
tasks and provide calculations upon receiving messages from the
control task modules. The multithreading engine maintains context
of each task and assigns at least one core to each task for
execution. The method of coding/decoding comprises an error
resilient algorithm.
Inventors: |
Zheludkov; Alexander;
(Saint-Petersburg, RU) ; Martemyanov; Alexey;
(Saint-Petersburg, RU) ; Terterov; Nikolay;
(Saint-Petersburg, RU) |
Family ID: |
40429744 |
Appl. No.: |
13/193001 |
Filed: |
July 28, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12205482 |
Sep 5, 2008 |
8023562 |
|
|
13193001 |
|
|
|
|
60970680 |
Sep 7, 2007 |
|
|
|
Current U.S.
Class: |
375/240.13 ;
375/240.16; 375/240.28; 375/E7.093; 375/E7.101; 375/E7.149 |
Current CPC
Class: |
H04N 19/523 20141101;
H04N 19/85 20141101; H04N 19/42 20141101; H04N 19/56 20141101 |
Class at
Publication: |
375/240.13 ;
375/240.28; 375/240.16; 375/E07.101; 375/E07.149; 375/E07.093 |
International
Class: |
H04N 7/26 20060101
H04N007/26; H04N 7/28 20060101 H04N007/28 |
Claims
1. A video codec having a modular structure for encoding/decoding a
digitized sequence of video frames in a multi-core system, the
video codec comprising: a memory unit; a multithreading engine; a
plurality of control modules configured to control all coding logic
and workflow; and a plurality of task modules configured to execute
tasks assigned by the multithreading engine; wherein the modules
are organized in a tree structure, each module corresponding to a
coding operation, the modules are configured to share the memory
and communicate with each other by control messages, and the
multithreading engine is configured to initialize tasks, to
maintain context of each task and to assign at least one of the
plurality of task modules to each task for execution, to send and
handle messages, and to provide synchronization and
communication.
2. The video codec of claim 1, wherein the tree structure is
flexible and changeable depending on a number of available modules
and tasks.
3. The video codec of claim 1, wherein the control modules are
configured to control all coding logic and workflow, and
performance of the task modules.
4. The video codec of claim 3, wherein the task modules include
separate modules for long lasting calculations.
5. The video codec having a modular structure according to claim 1,
wherein the plurality of task modules comprises: at least one
module configured to provide pre-processing temporal denoising; at
least one module configured to perform core motion estimation; at
least one module configured to perform distributed motion
estimation; at least one module configured to perform weighted
prediction; and at least one module configured to perform an error
resilience algorithm.
6. A method of encoding/decoding a digitized sequence of video
frames in a multi-core system, the method comprising: encoding the
sequence of video frames by performing core motion estimation and
weighted texture prediction; and decoding encoded sequence of video
frames using high motion update and low motion update for error
resilience; wherein the error resilience comprises: defining each
macroblock as a high motion macroblock or a low motion macroblock;
defining for each macroblock of a current frame a set of
macroblocks including the macroblock of the current frame and at
least one macroblock located at the same position in previous
frames; and making a choice between INTRA and INTER coding mode for
each macroblock of high motion update frame and each macroblock of
low motion update frame.
7. The method of encoding/decoding of claim 6, wherein the choice
between INTRA and INTER coding mode for the error resilience is
based on the following: the INTRA coding mode is used for high
motion update frames, when cost_intraK.sub.H<cost_inter, and set
C(D) contains at least one high motion macroblock; and the INTRA
coding mode is used for low motion update frames, when
cost_intraK.sub.L<cost_inter, and set C(T.sub.L) contains at
least one high or low motion macroblock; where K.sub.H--INTRA cost
scaling factor for high motion frames; K.sub.L--INTRA cost scaling
factor for low motion frames; T.sub.H--high motion INTRA update
period; T.sub.L--low motion INTRA update period; and D--high motion
INTRA update depth.
8. The method of encoding/decoding of claim 7, wherein a macroblock
is defined as a high motion macroblock, when S>=MVT.sub.H, and
SD>=MVTD.sub.H; and a macroblock is defined as a low motion
macroblock, when MVT.sub.L<=S<MVT.sub.H, and
MVTD.sub.L<=SD<MVTD.sub.H, where S is a sum of absolute
values of motion vector components of the macroblock; SD is a sum
of absolute values of motion vector prediction difference
components of the macroblock; MVT.sub.H is a high motion threshold
for the sum of absolute values of the motion vector components;
MVTD.sub.H is a high motion threshold for the sum of absolute
values of the motion vector prediction difference components;
MVT.sub.L is a low motion threshold for the sum of absolute values
of the motion vector components; and MVTD.sub.L is a low motion
threshold for the sum of absolute values of the motion vector
prediction difference components.
Description
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This is a division of U.S. application Ser. No. 12/205,482,
filed Sep. 5, 2008, which claims benefit of U.S. Provisional
Application No. 60/970,680, filed Sep. 7, 2007, both of which are
hereby incorporated herein by reference in their entirety.
FIELD OF THE INVENTION
[0002] This application relates to video codec (coder/decoder)
systems used preferably for broadcast applications, such as news
cameras or studio video manipulation equipment, or as a part of a
computer system, which involves real-time or close to real-time
video compression and transfer: video conferencing systems,
internet video translation systems, video security and surveillance
systems, etc.
BACKGROUND OF THE INVENTION
[0003] Video codecs are employed to convert initial video sequence
(a set of video images, also named pictures, or frames) into
encoded bitstream (a set of compressed video sequence binary data),
and also converting video sequence binary data produced by a video
codec system into a reconstructed video sequence (a decoded set of
video images, or reconstructed frames). Hereinafter, the terms
"frame" and "picture" are assumed to be identical. It is known that
video compression relies on two basic assumptions. The first is
that human sensitivity to noise in the picture (frame) is highly
dependent on the frequency of the noise. The second is that in a
picture sequence every picture has a lot in common with the
preceding picture. In a picture large objects result in low spatial
frequencies, whereas small objects result in high spatial
frequencies. The noise detected by human vision is mostly at low
spatial frequencies. The data may be compressed by sending only the
difference between one picture and the next, and raising the noise
where it cannot be detected, thus shortening the length of data
words. Video sequence contains a significant amount of statistical
and subjective redundancy within and between pictures that can be
reduced by data compression technique to make its size smaller. For
still pictures (as in JPEG format), an intra-frame or spatial
redundancy is used, which treats each picture individually, without
reference to any other picture. In intra-coding, the main step is
to perform a spatial frequency analyses of the image using a known
technique of Discrete Cosine Transform (DCT). DCT converts input
pixels into a form in which the redundancy can be identified. The
frame is broken up into rectangular areas called macroblocks and
converted a macroblock at a time. A typical two-dimensional
2D-block is 8.times.8 pixels. The 2D-DCT converts the block into a
block of 64 coefficients. A coefficient is a number which describes
the amount of a particular spatial frequency which is present. The
coefficients then zig-zag scanned, weighted and run-length
coded.
[0004] For moving pictures, the mode of inter-coding is known to be
used to exploit redundancy between pictures, which gives a higher
compression factor than the intra-coding. The "difference" picture
is produced by subtracting every pixel in one picture from a pixel
in the same position in the next picture. The difference picture
may be then compressed using intra-coding with DCT.
[0005] In the case of significant movement between the pictures
resulting in large differences, it is known to use motion
compensation (MC), which allows a higher compression factor.
According to the known MC technique, at the coder, successive
pictures are compared and the shift of an area from one picture to
the next is measured to produce motion vectors. The codec attempts
to model the object in the new picture from the previous picture
using motion vectors. Each macroblock has its own motion vector
which applies to the whole block. The vector from the previous
picture is coded and vector differences are sent. Any discrepancies
are eliminated by comparing the model with the actual picture. The
codec sends the motion vectors and the discrepancies. The decoder
does the inverse process shifting the previous picture by the
vectors and adding the discrepancies to produce the next picture.
The quality of a reconstructed video sequence is measured as a
total deviation of it's pixels from the initial video sequence. The
increased use of real-time digital video communication
applications, such as video conferencing and video telephony
presents an ever increasing demand in high video quality.
[0006] In view of the increasing use of real-time and close to real
time video compression and arrival of a new standard improving
quality of the real time video communication, there is a need for
new effective algorithms applicable to different types of video
codecs, which can be used in the video encoders complying with
ITU-T Recommendation H.264, also known as MPEG-4 Part 10, or AVC
(ISO/IEC 14496-100), etc.
[0007] Most of known block-based video coding systems such as
MPEG-4 or ITU-T H.264, use coding algorithms with the common steps
of dividing each video frame into blocks of pixels (pels);
predicting the block pixels using "inter" prediction, or "intra"
prediction technique; transforming texture prediction error blocks;
predicting the motion vectors and calculating the motion vector
prediction differences; and coding texture prediction error
quantized transform coefficients, motion vectors prediction
differences, intra prediction types and the auxiliary frame
data.
[0008] The idea of motion pictures sequence pre-processing using
the pixels of current and previous frames was repeatedly treated in
the prior art. However, most of such algorithms suffer either from
possible over-smoothing due to application of the spatial filters
together with temporal ones or from very high complexity. The
advantages of the proposed method are: relatively low complexity
(depending mostly on the motion estimation, and the way of
smoothing the blocks edges) and efficient denoising while
preserving good original image details (especially for high
noise).
[0009] The possibility of creation the error resilient streams is
also highly important for the industrial codecs used in
broadcasting, streaming and the other applications operating in the
error-prone environment. One of the universal classes of the error
resilient streams creation methods are intra update methods (also
called intra refresh methods). These methods are based on inserting
some extra INTRA macroblocks inside the inter-coded frames. These
INTRA macroblocks should use for the texture prediction only the
reconstructed texture of the previously coded macroblocks of the
current frame, which are INTRA coded as well. Thus, all the INTRA
macroblocks of each frame will be decoded correctly even if the
texture of the previously decoded frames is lost or corrupted.
There are several conventional approaches to the INTRA update
method.
[0010] The simplest approach is to insert the INTRA macroblocks in
random positions of the frame with the probability corresponding to
the expected loss rate. According to the other approach, the INTRA
macroblocks are being inserted into the current frame according to
the pre-specified geometric scheme changing from frame to frame by
some pre-specified rule. The main drawback of such methods is that
they lead to enormous bitrate growth.
[0011] Another class of the INTRA update scheme is highly dependent
on the current frame texture and motion. In these methods the INTRA
macroblocks are being inserted either in the areas of highest
activity, determined by the average motion vectors magnitude or
using loss-aware rate-distortion optimization scheme under the
assumption that the current macroblock may be lost with the given
probability. Such methods are described, for example, in the
following papers:
[0012] Yao Wang, Stephan Wenger, Jiangtao Wen, and Aggelos K.
Katsaggelos, "Review of Error Resilient Coding Techniques for
Real-Time Video Communication", IEEE Signal Processing Magazine,
vol. 17, no. 4, pp. 61-82, July 2000;
[0013] R. Zhang, S. L. Regunathan and K. Rose, "Video Coding with
Optimal Inter/Intra Mode Switching for Packet Loss Resilience,"
IEEE Journal on Selected Areas in Communications, Special Issue on
Error-Resilient Image and Video Transmission. pp. 966-976, vol. 18,
no. 6, June 2000; and
[0014] Minyoung Kim, Hyunok Oh, Nikil Dutt, Alex Nicolau, Nalini
Venkatasubramanian, "PBPAIR: An Energy-efficient Error-resilient
Encoding Using Probability Based Power Aware Intra Refresh", ACM
SIGMOBILE Mob. Comput. Commun. Rev. 10(3): 58-69, 2006.
[0015] The drawback of these schemes is that they do not take into
account that high potential reconstruction error caused by the loss
of the current macroblock or the previous frame texture will
necessarily increase the reconstruction error of the next frames
inter macroblocks, which refer to the current macroblock.
[0016] In most encoders, which deal with different motion
compensation block sizes, a separate motion estimation procedure is
used for each block size. This increases the complexity of the
motion estimation algorithm and could present a problem in
providing efficient interconnections between the motion vectors
used in texture blocks of different sizes.
[0017] The new H.264 Standard improved the accuracy of the motion
vector calculation using a quarter-pel-accurate motion compensation
form. However, during motion estimation and motion compensation a
quite complicated interpolation procedure is needed for calculating
the pixel values with non-integer coordinates. In order to provide
an adequate motion estimation using known methods, it is necessary
either to store in memory a 4-times-zoomed frame, or to perform a
non-integer pixel interpolation during the motion estimation. Both
methods have their disadvantages. In the first case a memory
storage required for reference frames is increased by 16 times. The
second method increases the algorithm computational complexity and
leads to an additional CPU load.
[0018] The upcoming high-performance systems will integrate tens of
multithreaded processor cores on a single chip, resulting in
hundreds of concurrent threads sharing system resources. Proposed
new modular video encoding and decoding design fit much better to
such multi-core systems. It is based on splitting all coding
operations into separate "tasks". Such architecture makes possible
to load big number of cores even for one-slice-per-picture coding.
Another benefit of the proposed design is high flexibility of
system integration. One can easily construct any required system
(encoder, decoder, transcoder, etc.) from appropriate set of
modules. Moreover, such encoder system assembling can be done
dynamically, depending on available resources to control load
balancing for maximum encoding quality.
SUMMARY
[0019] The proposed video codec comprises a plurality of modules
organized in a tree structure, each module corresponding to a
coding operation. The codec has a memory unit and a multithreading
engine. The task modules communicate with each other by control
messages and shared memory. The multithreading engine initializes
tasks, sends and handles messages, provides synchronization and
communication. The tree structure of the codec is flexible and
changeable depending on a number of available cores and tasks. The
plurality of modules comprises control modules, which control all
coding logic and workflow, and lower level task modules to perform
operations and provide calculations upon receiving messages from
the control modules. Any task can be executed on any one or more
available cores. In order to improve the efficiency, the long
lasting calculations are performed by separate modules. The
multithreading engine maintains the context of each task and can
assign the execution of it to any available module, or several
modules. Task-to-module assignment can be controlled by a high
level control module.
[0020] A method of encoding/decoding a digitized sequence of video
frames in a multi-core system comprises a pre-processing temporal
de-noising; core motion estimation; distributed motion estimation;
determining whether weighted texture prediction should be used and,
if yes, defining the parameters for weighted texture prediction;
and decoding encoded sequence of video frames creating error
resilience streams using high motion update and low motion
update.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] FIG. 1 shows a block-diagram of a video codec having a
modular architecture for a multi-core system.
[0022] FIG. 2 shows a block-diagram of the motion estimation
algorithm of the coding/decoding method.
[0023] FIG. 3 shows a block-diagram of a simple inter frame
encoding unit.
DETAILED DESCRIPTION
[0024] The encoding system is designed to transform an input video
sequence into a compressed bitstream, so that the video sequence
can be restored from the bitstream. The proposed video encoding and
decoding system has a modular architecture for massive parallelism.
The upcoming high-performance systems will integrate tens of
multithreaded processor cores on a single chip, resulting in
hundreds of concurrent threads sharing system resources. Proposed
new modular video encoding and decoding design fits much better
into such multi-core systems. It is based on splitting all coding
operations into separate "tasks" and providing a plurality of
modules to control and perform the tasks. The system comprises a
plurality of modules, including task modules and control modules.
The modules communicate with each other by control messages and
shared memory. Access to the shared memory is restricted, minimized
and executed via a unified interface. The modules are organized
into the "tree graph" structure. This tree structure can be changed
dynamically, depending on the number of available cores and tasks.
Messages are sent from a higher level, or parent, to a lower level,
or child, and back from child to parent.
[0025] All hard calculations are done in the task modules. They
start work after receiving a message from a parent control module,
and send a message back to parent, when finished. The control
modules can receive messages from parent modules and children
modules, and can send messages to parent and children modules. The
control modules control all coding logic and workflow. They should
react on each message very quickly. If reaction to some message
requires long calculation, such calculation should be organized as
a separate task module.
[0026] The tasks initialization, sending and handling messages,
synchronization and other communication is implemented via special
multithreading engine. This engine is the only part of all source
code, which is different for different platform. The rest of the
source code remains unchanged when porting the system from one
platform to another.
[0027] There is no hard connection between the tasks and the
hardware cores. The multithreading engine maintains the context of
each task and can execute it on any available core. Moreover, it is
possible that one task module utilizes several cores for the
particular task execution. Task-to-core assignment in
multithreading can be controlled by a high level module. The
preferable cores for each task can be defined on
initialization.
[0028] Such architecture makes possible to load a big number of
cores even for one-slice-per-picture coding. Another benefit of
this design is the high flexibility of system integration. One can
easily construct any required system (encoder, decoder, transcoder,
etc.) from appropriate set of modules. Moreover, such "system
assembling" in encoder can be done dynamically, depending on
available resources to control load balancing for maximum encoding
quality.
[0029] An example of a task tree of such architecture used for
H.264 video encoding is shown in FIG. 1. The system comprises a
Memory unit 20 and a Multithreading Engine 30. The plurality of
modules includes control modules (C), task modules (T) and
calculation modules (TC).
The control modules include:
[0030] Main Encoder control module 10;
[0031] Preproc control module 40, which controls pre-processing
operations;
[0032] EncodeFrame control modules 42-44, each module controls at
least one of the frame encoding tasks. It is possible to activate
several simultaneously running tasks of this type for several
frames.
[0033] EncodeSlice 45-47 control modules, each controls one slice
encoding task. Each EncodeFrame module can activate several
simultaneously running tasks of this type for several slices of one
frame.
The task modules include:
[0034] Input task module 50--receives raw video frames and stores
them into the shared memory;
[0035] Output 52--output encoded sequence when ready;
[0036] PreprocLine 54-56--each task module pre-processes one line
of macroblock. It is possible to activate several simultaneously
running task modules of this type for one frame.
[0037] DeblockLine 61-63--each task module performs deblocking of
one line of macroblocks. Each EncodeFrame module can activate
several simultaneously running tasks of this type for several lines
in one frame.
[0038] InterpLine 64-66--each task module performs half-pel
interpolation of one line of macroblocks. Each EncodeFrame module
can activate several simultaneously running task modules of this
type for several lines in one frame.
[0039] EncMbLine 70-72--a task module of this type performs all
encoding calculation except entropy encoding, for one line of
macroblocks. Each EncodeSlice module can activate several
simultaneously running task modules 70-72 of this type for several
lines in one slice.
[0040] EntrEncMbLine 80--this task module performs entropy encoding
for one line of macroblocks. Each EncodeSlice task maintains only
one task module of this type and activates it line by line.
[0041] The proposed method of encoding/decoding of video sequence
in a multi-core system comprises:
[0042] pre-processing purely temporal denoising;
[0043] encoding the video sequence, including: [0044] core motion
estimation; [0045] distributed motion estimation; [0046]
determining use of weighted texture prediction from a difference
between normalized density functions for luminance histograms of an
original frame and luminance histograms of a reference frame and,
if used, [0047] calculating parameters for weighted texture
prediction; and
[0048] decoding encoded sequence of video frames using high motion
update and low motion update for error resilience sequence
compression quality.
[0049] Presented below are the preferred algorithms for each of the
method steps.
Purely Temporal De-Noising for Video Sequence Visual Quality
Improvement
[0050] The algorithm comprises the following general steps: [0051]
1) Motion estimation--creating a frame P of texture prediction for
the current frame from a previous frame and the corresponding
motion vectors. [0052] 2) Smoothing (deblocking) of the prediction
picture. [0053] 3) Modifying the current frame based on the current
frame pixels and the prediction frame pixels.
[0054] 1. Motion estimation. The process starts with performing
motion estimation. Hereafter it is assumed that the motion
estimation is performed over the blocks having the size of
8.times.8. It is also assumed that block-matching criterion for the
motion estimation algorithm is as follows:
mv_cos
t=SAD+.lamda.(log.sub.2(abs(MVDx))+log.sub.2(abs(MVDy))),
where SAD is the sum of the absolute difference between the pixels
of the current block and the pixels of the reference block;
.lamda., is a pre-defined constant (this is one of the denoising
strength parameters, see below), MVDx and MVDy are the differences
between the current motion vector components and some prediction
motion vector derived from the motion vectors calculated before for
the neighboring blocks (for example, one can take for such
prediction the median values of the corresponding components for
motion vectors of the upper, left and upper-right blocks.)
[0055] The results of the motion estimation procedure are:
[0056] The set of motion vectors for all picture 8.times.8
blocks.
[0057] The values of mv_cos t corresponding to these motion
vectors.
[0058] The texture prediction blocks of the size 8.times.8 defined
by the reference picture and motion vectors and the prediction
picture P composed from these prediction blocks.
[0059] The sufficiently large sum of those mv_cos t values for all
picture motion vectors signals about the scene change. In this case
one can either change the reference picture from the previous
picture to the successive one and repeat motion estimation for it,
or perform no denoising for current picture at all. Since the
situations of multiple picture scene change are quite uncommon for
typical video sequences, this cannot affect notably the denoising
visual effect.
[0060] 2. Prediction picture deblocking. After the motion
estimation and the prediction picture construction the smoothing of
8.times.8 blocks vertical and horizontal boundaries is performed.
This smoothing should be sufficiently strong and may be quite
rough. Thus, for example, the following method for smoothing the
blocks boundaries is quite suitable. Let's denote as P(x,y) a pixel
of the prediction picture with the coordinates (x,y), and as
P_res(x,y)--the same pixel of the prediction picture after its
smoothing. Consider the block 8.times.8 with the upper-left pixel
coordinates (x,y). If the total boundary difference between this
block and its left neighbor is sufficiently large then its left
boundary may be smoothed by the linear transform:
P.sub.res(x+j,y+l)=(P(x-4,y+l)(4-j)+P(x+3,y+l)(4+j))/8, (for all j=
0;3 and l= 0;7).
[0061] Otherwise, if this boundary difference is sufficiently small
then its left boundary may be smoothed by the linear transform:
P(x+j,y+l)=(P(x+j,y+l)3+P(x-1,y+l))/4, (for all j= 0;1 and l=
0;7).
y[0]=(y[-4]+y[3]+1)>>1;
y[1]=(3*y[-4]+5*y[3]+4)>>3;
y[2]=(y[-4]+3*y[3]+2)>>2;
y[3]=(y[-4]+7*y[3]+4)>>3;
[0062] 3. Denoising. The final denoising procedure is performed by
modifying all the pixels of the current frame as follows:
I=I+sign(I-P)F(abs(I-P))
where I is the pixel of the current frame, P is a pixel at the same
position in the prediction frame after the deblocking procedure
described above, abs is the function of absolute value; and sign is
a sign function (sign(x)=-1, if x<0; sign(x)=0, if x=0; and
sign(x)=1, if x>0). Besides, the denoising is not performed for
those current frame 8.times.8 blocks, for which the value of mv_cos
t exceeds some pre-defined threshold MV_COST_THR. Since such blocks
usually correspond to weakly predictable high motion areas, their
denoising may be omitted without notable visual quality
decrease.
[0063] The main algorithm parameters, which control the strength of
the noise suppression are T.sub.0, T.sub.1, T.sub.2, .lamda., and
MV_COST_THR.
[0064] One of the key issues of the method is choosing the
discrete-time function F. For its derivation the values T.sub.0,
T.sub.1, T.sub.2 should be set up so that
0.ltoreq.T.sub.0<T.sub.1<T.sub.2. The function should satisfy
the following requirements:
[0065] F(D) is sufficiently smooth inside the interval
[0;T.sub.2];
[0066] F(0)=0;
[0067] F(D).gtoreq.D/2 and F(D).ltoreq.D, if D .epsilon.
[0;T.sub.0);
[0068] F(D)=D/2, if D .epsilon. [T.sub.0;T.sub.1);
[0069] F(D).ltoreq.D/2 and F(D).gtoreq.0, if D .epsilon.
[T.sub.1;T.sub.2); and
[0070] F(D)=0, if D.gtoreq.T.sub.2.
[0071] These properties of the function F(D) together with proper
choice of the values T.sub.0, T.sub.1, T.sub.2 provide good
smoothing effect together with minimal visual artifacts added.
[0072] To complete the description of the algorithm, below are
given the instance of the function F(D) providing good noise
suppression results and the example values for T.sub.0, T.sub.1,
T.sub.2, .lamda., MV_COST_THR parameters. Function F(D).
F(D)=D(1-D.sup.2(5T.sub.0-3D)/4T.sub.0.sup.3), if D<T.sub.0;
F(D)=D/2, if D .epsilon. [T.sub.0;T.sub.1);
F(D)=D/2+D((D-T.sub.1).sup.3-2(T.sub.2-T.sub.1)(D-T.sub.1).sup.2-(T.sub.-
2-T.sub.1).sup.2(D-T.sub.1))/4(T.sub.2-T.sub.1).sup.3, if
D .epsilon. [T.sub.0;T.sub.2); and
F(D)=0, if D.gtoreq.T.sub.2.
To avoid the computation for every pixel of the image, the
pre-calculated table should be created containing all possible
values of abs(I-P). For typical color bit depth 8 this table
consists of 256 one-byte numbers.
[0073] As an example, the values of T.sub.0, T.sub.1, T.sub.2,
.lamda., MV_COST_THR for the case of strong noise may be chosen as
T.sub.0=16, T.sub.1=32, T.sub.2=48, .lamda.=24,
MV_COST_THR=6000.
Core Motion Estimation
[0074] FIG. 2 shows a block diagram of the core motion estimation
algorithm. Consider a frame F and a rectangular block of width W
and height H inside this frame. Hereafter W and H will be
considered as constant values, and the block B(F, X.sub.INIT,
Y.sub.INIT) as the left-top corner block inside the frame F with
coordinates X.sub.INIT, Y.sub.INIT. Consider another frame R called
a reference frame and a block B(R, X, Y), and some matching
criterion Q(F, R, X, Y), which is the measure of similarity between
the pixels of the block B(F, X.sub.INIT, Y.sub.INIT) and the block
B(R, X, Y). The smaller is the value of the function Q, the more
similar are the blocks. Hereafter the coordinates X.sub.INIT and
Y.sub.INIT will be assumed constant.
[0075] The aim of the algorithm is to find the coordinates X, Y
inside the reference frame R, which provide good correspondence
between the blocks B(F, X.sub.INIT, Y.sub.INIT) and B(R,X,Y)
minimizing the matching criterion Q(F,R,X,Y). In particular, if the
block B(F, X.sub.INIT, Y.sub.INIT) contains some object, which
position moves from the reference frame to the current frame, then
the values X.sub.INIT-X and Y.sub.INIT-Y should estimate well the
motion of the object from frame R to frame F.
[0076] Let's define the value W, which is supposed to be the power
of 2 and following values, which will be used as thresholds for the
function Q(F,R,X,Y):
TE--threshold for exiting the algorithm after candidates check;
T2--threshold for halving the search range during the logarithmic
search; T4--threshold for dividing in 4 the logarithmic search.
[0077] Suppose we have already find the best values of (X, Y) for
some of the neighboring blocks of block B(F, X.sub.INIT,
Y.sub.INIT) and some other pairs (X,Y), which are likely to be
close to the coordinate pair for the current block which we are
looking for. Let's call all these N pairs (X.sub.j, Y.sub.j) where
j=0, 1 . . . N, as candidates.
[0078] The first step of the algorithm is to calculate the values
of Q(F,R,X.sub.j,Y.sub.j) for all candidates. If the minimal value
of Q(F,R,X.sub.j,Y.sub.j), j=0, 1, . . . N is less than TE then the
algorithm comes to an end and this pair (X.sub.MIN, Y.sub.MIN),
providing the minimum of Q(F,R,X.sub.j,Y.sub.j), is considered as
the result of the algorithm.
[0079] Otherwise denote Q(F,R,X.sub.j,Y.sub.j) as Q.sub.MIN and
fulfill the following procedure.
[0080] Step 1. If Q.sub.MIN<T2 then W=W/2, else if
Q.sub.MIN<T4 then W=W/4. If W<1, exit the algorithm: the
result of the algorithm is the pair (X.sub.MIN,Y.sub.MIN).
[0081] Step 2. Define four pairs
(Dx.sub.0,Dy.sub.0)=(-W,0),
(Dx.sub.1,Dy.sub.1)=(0,-W),
(Dx.sub.2,Dy.sub.2)=(W,0), and
(Dx.sub.3,Dy.sub.3)=(0,W).
Calculate the values of Q(F, R, X.sub.MIN+DX.sub.0,
Y.sub.MIN+Dy.sub.0), Q(F, R, X.sub.MIN+Dx.sub.1,
Y.sub.MIN+Dy.sub.1), Q(F, R, X.sub.MIN+DX.sub.2,
Y.sub.MIN+Dy.sub.2), and Q(F, R, X.sub.MIN+Dx.sub.3,
Y.sub.MIN+Dy.sub.3). Find the pair (Dx.sub.j,Dy.sub.j) (j=0 . . .
3) providing the minimal value of Q(F, R, X.sub.MIN+Dx.sub.j,
Y.sub.MIN+Dy.sub.j). Denote this pair as (Dx.sub.MIN,Dy.sub.MIN).
If Q(F, R, X.sub.MIN+Dx.sub.MIN,
Y.sub.MIN+Dy.sub.MIN)>=Q.sub.MIN then go to the Step 5.
[0082] Step 3. Calculate the values Q(F, R,
X.sub.MIN+j*Dx.sub.MIN,Y.sub.MIN+j*Dy.sub.MIN) for increasing j=0,
1, 2 . . . J until the following inequality (I) is true or the
boundary of the frame is reached:
Q(F,R,X.sub.MIN+j*Dx.sub.MIN,Y.sub.MIN+j*Dy.sup.MIN)<=Q(F,R,X.sub.MIN-
+(j+1)*Dx.sub.MIN,Y.sub.MIN+(j+1)*Dy.sub.MIN), (I)
where J is the minimal nonnegative integer value of j for which the
inequality above is true.
[0083] Step 4. Set X.sub.MIN=X.sub.MIN+J*Dx.sub.MIN,
Y.sub.MIN=Y.sub.MIN+J*Dy.sub.MIN, and
Q.sub.MIN=Q(F,R,X.sub.MIN,Y.sub.MIN), where J is obtained in the
previous step.
[0084] Step 5. Set W=W/2. If W.noteq.1 then return to the Step 1.
Otherwise the result of the algorithm is one of the coordinate
pairs (X.sub.MIN,Y.sub.MIN), (X.sub.MIN+Dx.sub.MIN,
Y.sub.MIN+Dy.sub.MIN), (X.sub.MIN-Dx.sub.MIN,
Y.sub.MIN+Dy.sub.MIN), (X.sub.MIN+Dx.sub.MIN,
Y.sub.MIN-Dy.sub.MIN), (X.sub.MIN-DX.sub.MIN, Y.sub.MIN-Dy.sub.MIN)
which provides the minimum of the function Q(F,R,X,Y).
[0085] Note 1. The arguments X, Y of the function Q(F,R, X,Y) may
be measured not in integer coordinates but also in half-pixels,
quarter-pixels, etc. In the case of non-integer block coordinates
it is assumed that such block pixels can be retrieved from the
frame R pixels by some pre-defined interpolation procedure.
[0086] Note 2. One can see that the values of Q(F,R,X,Y) are
calculated sometimes more than once for the same pair (X,Y). For
example, at least three of five coordinate pairs of the Step 5 were
used in the Step 2 and Step 3. Evidently, such repeated
calculations should be identified and eliminated during the
practical implementation of the method.
[0087] Note 3. The initial value of W, the quality function
Q(F,R,X,Y) and the values TE, T2, T4 are the parameters of the
algorithm controlling the threshold between the motion estimation
speed and the motion estimation quality.
Distributed Motion Estimation
[0088] The motion estimation algorithm described in previous
section can be applied directly for encoding video frames. We call
it here "core" algorithm. It provides good results for video of
QCIF, CIF and D1 resolution. However, it could be improved for
high-definition video, where the dimension of one block is
essentially less then the dimension of the whole picture. Similar
improvement may be used when the reference frames are separated by
several non-reference frames. For such situation the following high
level additions to the motion estimation algorithm can be
applied.
[0089] Consider successive pictures to be coded as F.sub.1,
F.sub.2, F.sub.3, F.sub.4 . . . and for the pictures F.sub.2,
F.sub.3, F.sub.4 only the picture F.sub.1 can be used as reference.
So we need only F.sub.2 to F.sub.1, F.sub.3 to F.sub.1 and F.sub.4
to F.sub.1 motion vectors. The block size for motion estimation can
be 16.times.16, 16.times.8, 8.times.16 and 8.times.8.
[0090] The following algorithm is applied:
[0091] 1. Calculate and store spatially reduced pictures f.sub.1,
f.sub.2, f.sub.3, and f.sub.4. Reduction coefficient can be 2 or 4
in each spatial dimension.
[0092] 2. Select reduced block size in reduced pictures as
8.times.8.
[0093] 3. Calculate and store motion vectors for the following
reduced pictures pairs: f.sub.2 to f.sub.1, f.sub.3 to f.sub.2 and
f.sub.4 to f.sub.3. This motion estimation is done using core
motion estimation algorithm described above.
[0094] 4. Having motion vectors sets f.sub.2 to f.sub.1 and f.sub.3
to f.sub.2 calculate f.sub.3 to f.sub.1 motion vectors as
follows:
[0095] For each block b.sub.3 of f.sub.3 let (p.sub.x, p.sub.y) be
the center point and (m.sub.x, m.sub.y) be the motion vector set
from f.sub.3 to f.sub.2. Select the block from f.sub.2 picture
where the pixel (p.sub.x+m.sub.x, p.sub.y+m.sub.y) resides. Let
(n.sub.x, n.sub.y) be the motion vector from f.sub.2 to f.sub.1 set
for this block. Assign (n.sub.x+m.sub.x, n.sub.y+m.sub.y) as a
motion vector of the block b.sub.3 in the f.sub.3 to f.sub.1 motion
vectors set.
[0096] 5. Similar to 4, having f.sub.3 to f.sub.1 and f.sub.4 to
f.sub.3 motion vectors sets calculate f.sub.4 to f.sub.1 motion
vectors:
[0097] For each block b.sub.4 of f.sub.4 let (p.sub.x, p.sub.y) be
the center point and (m.sub.x, m.sub.y) the motion vector from
f.sub.4 to f.sub.3 set. Select the block from f.sub.3 picture where
the pixel (p.sub.x+m.sub.x, p.sub.y+m.sub.y) resides. Let (n.sub.x,
n.sub.y) be the motion vector from f.sub.3 to f.sub.1 set for this
block. Assign (n.sub.x+m.sub.x, n.sub.y+m.sub.y) as a motion vector
of the block b.sub.4 in the f.sub.4 to f.sub.1 motion vectors
set.
[0098] 6. For each block B.sub.i of picture F.sub.i (i=2,3,4) take
motion vector from appropriate block of reduce picture and scale it
back according to the picture reduction coefficient in each
dimension. This will be coarse full-pel motion vector for this
block. Apply core motion estimation algorithm to find fine full-pel
motion vector using this coarse motion vector as an additional
candidate and reducing search range to 2 or 4 depending on the
picture reduction coefficients.
[0099] 7. For each block B.sub.i of picture F.sub.i (i=2,3,4) apply
core motion estimation algorithm to find quarter-pel motion vector
using fine full-pel motion vector as a starting point and the
search range 2.
[0100] Using these additional algorithm has following benefits:
[0101] It separates different stages of motion estimation, so that
they can be implemented on different processors or processor
cores.
[0102] It utilizes information from the intermediate frames even if
these frames are not used as references in actual encoding.
[0103] It reduces the range of memory access, which can be critical
on the systems with low cash capabilities.
Weighted Prediction Detection Using Histograms
[0104] Typically inter-coding of picture block use some block of
already coded reference picture as a prediction of this block and
then encode the difference between prediction and block itself.
I.e., prediction of the block of pixels P(x,y) is calculated as
L(x+mx, y+my), where L(x,y) is luminance values of reference
pictures and (mx,my) is the motion vector for this block. H.264
video standard allows using weighed prediction for inter-pictures
coding. That means that prediction of the block of pixels P(x,y) is
calculated as A*L(x+mx, y+my)+B, where A and B are weighted
prediction parameters, common for all blocks of the coded picture.
The problem for encoder is the fast detection of video fragments,
where this tool should be used and to determine weighted prediction
parameters. Here we describe fast algorithm to detect such
situations. Inputs for this algorithm are luminance histograms of
reference frame and frame to be encoded. Output is a flag, whether
to use weighted prediction or not. If algorithm decides to use
weighted prediction, it returns also the parameters of weighted
prediction for this inter encoded picture.
[0105] The luminance histogram of a picture (Hist) is unsigned
integer array of size MAX_LUM_VAL+1.
Hist[i] is the number of pixels in the picture, with luminance
value=i. MAX_LUM_VAL depends of the number of bits used to
represent pixel luminance. The histogram is an approximation of
probability density function if the luminance value of the pixels
is considered as random value.
[0106] This algorithm is based on the assumption that in the case
when weighted prediction is applicable, i.e. one frame is well
predicted by the weighted prediction formula, the shape of
probability density functions for both frames should be similar.
This similarity is detected using the following obvious fact:
If P is a random variable with mean M and deviation S.sup.2, then
P.sub.n=(P-M)/S is a random value with mean=0 and deviation=1,
where P.sub.n as normalization of P. Probability density function
of P.sub.n random value can be easily calculated from the
probability density function of P.
[0107] In case when the weighted prediction is applicable, the
normalized probability density functions for reference and original
frames should be approximately the same. Having histograms of both
original and references frames Ho and Hr, the normalized density
functions for both can be estimated and compared. If the difference
is lower then some predefined threshold, the weighted prediction
should be applied.
[0108] Parameters A and B are calculated from the equations for
mean and deviation for the reference and the original frames:
A=So/Sr; B=Mo-A*Mr
[0109] Comparing of normalized density functions is done without
explicit calculation of these functions, using the following fast
implicit algorithm:
[0110] Input for this algorithm is: Hist0 and Hist1--histograms of
original and reference frames.
[0111] Output is the flag whether weighted prediction should be
applied and its parameters A and B.
[0112] The part of the histogram Hist[i.sub.s,i.sub.e] is defined
by starting and ending indexes i.sub.s and i.sub.e. Any histogram
part can be considered as the distribution of some separate
stochastic variable. The following values can be calculated for any
part of the histogram:
S(i.sub.s,i.sub.e,Hist)=Hist[i.sub.s]+Hist[i.sub.s+1]+Hist[i.sub.s+2]+
. . . +Hist[i.sub.e]
M(i.sub.s,i.sub.e,Hist)=i.sub.s*Hist[i.sub.s]+(i.sub.s+1)*Hist[i.sub.s+1-
]+(i.sub.s+2)*Hist[i.sub.s+2]+ . . . +i.sub.e*Hist[i.sub.e]
C(i.sub.s,i.sub.e,Hist)=M(i.sub.s,i.sub.e,Hist)/S(i.sub.s,i.sub.e,Hist),
where
C(i.sub.s, i.sub.e, Hist) is "center of mass" of the particular
part of histogram. Taking i.sub.c=round(C(i.sub.s, i.sub.e, Hist)),
we obtain 2 new histogram parts, defined by 2 index pairs (i.sub.s,
i.sub.c) and (i.sub.c, i.sub.e). We call it low subpart and high
subpart. Obviously, if histogram parts of 2 different frames
Hist0[i0.sub.s,i0.sub.e] and Hist1[i1.sub.s,i1.sub.e] has closed
normalized densities, and S(i0.sub.s,i0.sub.e,Hist0) is
approximately equal to S(i1.sub.s,i1.sub.e,ist1), same can be said
about the pair Hist0[i0.sub.s,i0.sub.c], Hist1[i1.sub.s,i1.sub.c]
and the pair Hist0[i0.sub.c,i0.sub.e], Hist1[i1.sub.c,i1.sub.e].
Hence S(i0.sub.s,i0.sub.c,Hist0) should be approximately equal to
S(i1.sub.s,i1.sub.c,Hist1), and S(i0.sub.c,i0.sub.e,Hist0) should
be approximately equal to S(i1.sub.c,i1.sub.e,Hist1).
[0113] The algorithm starts with the whole histograms, checks the
above approximate equality and applies itself recursively to each
subpart pairs. Recursion depth of 3-4 is quite enough to detect
weighted prediction cases. Note that this algorithm can be further
speedup if the sums S(0,i,Hist) for all i=0, 1, . . . 255 are
pre-calculated.
INTRA or INTER Mode of Encoding
[0114] The frame encoder may work in INTRA or INTER mode.
[0115] In the INTRA mode, a picture is broken into 16.times.16
macroblocks, which in turn may be broken into 16.times.16 or
4.times.4 blocks (the fragmentation can be different for different
macroblocks of a picture). The macroblocks are coded sequentially
row-by-row left to right. The blocks of each macroblock are coded
in the same order. The pixel values for each block may be predicted
from the already coded blocks by using several prediction schemes.
A known technique of Discrete Cosine Transform (DCT) is applied to
every texture prediction error block. The texture difference from
the prediction is transformed into frequency domain with
two-dimensional DCT (2D-DCT) transform. The transform coefficients
are quantized, reordered, coded and then coded arithmetically.
[0116] In the INTER mode, the picture is also broken into
16.times.16 macroblocks that are coded row-by-row, left-to-right.
Each macroblock can be intra or inter coded. All data is coded with
arithmetic encoder. For the inter coding, the block-match motion
compensation from up to 2 reference frames is used. The intra
coding is identical to that described above. For inter coding, a
macroblock is broken into 16.times.16, 8.times.8, 16.times.8,
8.times.16, 4.times.8, 8.times.4 or 4.times.4 blocks. For each
block, a motion vector to one of the reference frames is
transmitted that provides the best prediction of the block pixels
in block match. The texture difference from the prediction is coded
as in the intra mode.
[0117] FIG. 3 schematically presents a Simple Inter Frame Encoding
Unit 30. The encoding unit 30 includes a Macroblock Encoding unit
42, which is a frame texture coding unit corresponding to
16.times.16 frame texture blocks; a Frame Header Coding unit 44,
which includes a set of parameters for proper frame decoding.
Encoding of these parameters does not involve arithmetic coding,
each parameter is coded as an integer number with a minimal
necessary number of bits according to the range of the parameter
value. The encoding unit 30 also includes a Motion Estimator 46,
which performs motion estimation procedures. First, the picture
area is divided into blocks of size M.times.N. For each block whose
left-top corner is located in (x_init,y_init) point of the image,
wherein x_init is the multiple of M and y_init is the multiple of
N, a two-dimensional shift value (x_shift, y_shift) is calculated,
so that M block on the reference frame whose left-top corner is
located in the point (x_init+x_shift, y_init+y_shift) is the
prediction. The Coded Frame Reconstruction and Storage unit 48
provides the intra macroblock texture prediction and/or inter
macroblock texture prediction.
[0118] The INTRA macroblock texture prediction is the prediction
for the current macroblock texture, which involves only
reconstructed texture of the previously coded macroblocks from the
same frame.
[0119] The Inter macroblock texture prediction is the prediction
for the current macroblock texture, which involves the
reconstructed texture of the previously coded frames. The frame is
called inter frame if it contains inter predicted macroblocks.
[0120] The Intra prediction mode includes 8 different algorithms,
which are used for the intra macroblock texture prediction.
[0121] In the Inter prediction mode the macroblock area is divided
into 8 subdivisions, which are 8 rectangular blocks with width and
height either 8 or 16, each rectangular block having a
corresponding separate motion vector which is used for the intra
prediction.
Error Resilience
[0122] For the error resilience the following algorithm of the
INTRA refresh is proposed.
[0123] Hereafter we assume that the frame is divided into
16.times.16 blocks called macroblocks, each of which may be coded
as INTER macroblock, using the previous frames reconstructed
texture for coding, and INTRA macroblock, using for coding only the
current frame reconstructed texture. Hereafter it is assumed that
the video sequence frames, for which the algorithm is applied, are
coded consequently in display order; for each macroblock (texture
block of the size 16.times.16) of the INTER frame the following
values calculated: motion vector to the reference frame, motion
vector prediction via the neighboring macroblocks motion vectors
and the values cost_intra and cost_inter, such that in regular
situations, when cost_intra<cost_inter, the INTRA coding mode
for this macroblock shall be chosen.
[0124] The essence of the method is as follows. If in some of the
sequence frames there are regions that are lost or corrupted, then
the main temporally spreading visual defects are produced by the
moving regions. The high motion regions produce strong bad
artifacts immediately (and these artifacts will immediately be
noticed by the viewer) if the texture of at least one of the
current, previous or next to the motion frame is corrupted. That is
why only by inserting INTRA macroblocks at the motion areas in
several consequent frames around the current frame, the temporal
spreading of these artifacts may be effectively terminated. If the
texture of low motion regions is corrupted in the decoder then the
viewer will not see the appreciable visual artifacts immediately
after the texture corruption. So there is no need to insert the
INTRA macroblocks right after the motion. However, these visual
defects will consequently cumulate and grow and will be terminated
successfully at most after the pre-defined number of frames by low
motion INTRA update procedure. Besides the large motion vectors,
the indirect sign of the area that may cause the artifacts in the
corrupted stream decoding is the closeness of the values of
functions cost_intra and cost_inter.
[0125] Let's define the following values, characterizing the
temporal distances between frames:
T.sub.H--high motion INTRA update period; T.sub.L--low motion INTRA
update period; and D--high motion INTRA update depth.
[0126] Let's define the following values, concerning the thresholds
for the motion vectors components:
MVT.sub.L--low motion threshold for the motion vector components
absolute values sum; MVT.sub.H--high motion threshold for the
motion vector components absolute values sum; MVTD.sub.L--low
motion threshold for the motion vector prediction differences
components absolute values sum; MVTD.sub.H--high motion threshold
for the motion vector prediction differences components absolute
values sum; Let's define the following values, which will be used
as scaling factors: K.sub.H--INTRA cost scaling factor for high
motion frames, K.sub.L--INTRA cost scaling factor for low motion
frames.
[0127] Let's call each T.sub.L-th frame of the video sequence, for
which the algorithm is applied, as low motion update frame and each
T.sub.H-th frame of the sequence as high motion update frame.
[0128] Let's consider the macroblock as high motion macroblock when
S, the sum of absolute values of its motion vector components, is
greater or equal to MVT.sub.H, and when the sum of absolute values
of its motion vector prediction difference components SD is greater
or equal to MVTD.sub.H. Let's consider the macroblock as low motion
macroblock when the sum of absolute values of its motion vector
components S satisfies the inequality MVT.sub.L<=S<MVT.sub.H
and when the sum of absolute values of its motion vector prediction
difference components SD satisfies the inequality
MVTD.sub.L<=SD<MVTD.sub.H.
[0129] Let's also define for each macroblock the set C(T) of T
macroblocks including this macroblock itself and the macroblocks
from the previous frames located at the same position as this
macroblock.
[0130] Based on the definitions above it is proposed to make the
choice between INTRA and INTER coding mode for each macroblock of
the high and low motion update frames according to the following
rules:
1. For high motion update frames: the macroblock shall be coded as
INTRA if for this macroblock the following inequality is true:
cost_intraK.sub.H<cost_inter and the set C(D) contains at least
one high motion macroblock. 2. For low motion update frames: the
macroblock shall be coded as INTRA if for this macroblock the
following inequality is true: cost_intraK.sub.L<cost_inter, and
the set C(T.sub.L) contains at least one high or low motion
macroblock.
[0131] Despite the apparent simplicity, the proposed intra update
scheme shows very high efficiency and avoids the drawbacks of the
known intra update schemes. The main point of the algorithm is the
ability of moving areas stacking even if the motion stopped several
frames ago; the various experiments show the necessity of such
stacking. The extra INTRA macroblocks are sufficient for effective
terminating of noticeable artifacts and these macroblocks are
placed mostly in the positions where they will not cause essential
bitrate growth with respect to the INTER macroblocks. However, no
unnecessary INTRA macroblocks are inserted, keeping the error
resilient sequence compression quality at reasonable level.
[0132] The preceding description is intended to be illustrative of
the principles of the invention, and it will be appreciated that
numerous changes and modifications may occur to those skilled in
the art, and it is intended in the appended claims to cover all
those changes and modifications which fall within the true spirit
and scope of the present invention.
* * * * *