U.S. patent application number 10/444511 was filed with the patent office on 2004-01-01 for spatiotemporal prediction for bidirectionally predictive (b) pictures and motion vector prediction for multi-picture reference motion compensation.
Invention is credited to Li, Shipeng, Tourapis, Alexandros, Wu, Feng.
Application Number | 20040001546 10/444511 |
Document ID | / |
Family ID | 29550205 |
Filed Date | 2004-01-01 |
United States Patent
Application |
20040001546 |
Kind Code |
A1 |
Tourapis, Alexandros ; et
al. |
January 1, 2004 |
Spatiotemporal prediction for bidirectionally predictive (B)
pictures and motion vector prediction for multi-picture reference
motion compensation
Abstract
Several improvements for use with Bidirectionally Predictive (B)
pictures within a video sequence are provided. In certain
improvements Direct Mode encoding and/or Motion Vector Prediction
are enhanced using spatial prediction techniques. In other
improvements Motion Vector prediction includes temporal distance
and subblock information, for example, for more accurate
prediction. Such improvements and other presented herein
significantly improve the performance of any applicable video
coding system/logic.
Inventors: |
Tourapis, Alexandros;
(Nicosia, CY) ; Li, Shipeng; (Irvine, CA) ;
Wu, Feng; (Beijing, CN) |
Correspondence
Address: |
LEE & HAYES PLLC
421 W RIVERSIDE AVENUE SUITE 500
SPOKANE
WA
99201
|
Family ID: |
29550205 |
Appl. No.: |
10/444511 |
Filed: |
May 23, 2003 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60385965 |
Jun 3, 2002 |
|
|
|
Current U.S.
Class: |
375/240.12 ;
375/E7.119; 375/E7.133; 375/E7.165; 375/E7.176; 375/E7.211;
375/E7.25; 375/E7.258; 375/E7.262; 375/E7.266 |
Current CPC
Class: |
H04N 19/142 20141101;
H04N 19/61 20141101; H04N 19/58 20141101; H04N 19/593 20141101;
H04N 19/573 20141101; H04N 19/577 20141101; H04N 19/105 20141101;
H04N 19/102 20141101; H04N 19/513 20141101; H04N 19/137 20141101;
H04N 19/87 20141101; H04N 19/176 20141101; H04N 19/107 20141101;
H04N 19/52 20141101; H04N 19/56 20141101; H04N 19/51 20141101 |
Class at
Publication: |
375/240.12 |
International
Class: |
H04N 007/12 |
Claims
What is claimed is:
1. A method for use in encoding video data within a sequence of
video frames, the method comprising: identifying at least a portion
of at least one video frame to be a Bidirectionally Predictive (B)
picture; and selectively encoding said B picture using at least
spatial prediction to encode at least one motion parameter
associated with said B picture.
2. The method as recited in claim 1, wherein said B picture
includes a macroblock.
3. The method as recited in claim 2, wherein selectively encoding
said B picture using at least spatial prediction to encode said at
least one motion parameter produces a Direct Macroblock.
4. The method as recited in claim 1, wherein said B picture
includes a slice.
5. The method as recited in claim 1, wherein said B picture
includes at least a portion of a macroblock.
6. The method as recited in claim 1, wherein selectively encoding
said B picture using at least spatial prediction to encode said at
least one motion parameter further includes employing linear motion
vector prediction for said B picture based on at least one
reference picture that is at least another portion of said video
frame.
7. The method as recited in claim 1, wherein selectively encoding
said B picture using at least spatial prediction to encode said at
least one motion parameter further includes employing non-linear
motion vector prediction for said B picture based on at least one
reference picture that is at least another portion of said video
frame.
8. The method as recited in claim 1, wherein selectively encoding
said B picture using at least spatial prediction to encode said at
least one motion parameter further includes employing median motion
vector prediction for said B picture based on at least two
reference pictures that are both portions of said video frame.
9. The method as recited in claim 1, wherein said at least one
motion parameter includes at least one motion vector.
10. The method as recited in claim 1, wherein at least one other
portion of at least one other video frame is processed to further
selectively encode said B picture using temporal prediction to
encode at least one temporal-based motion parameter associated with
said B picture.
11. The method as recited in claim 10, wherein said temporal
prediction includes bidirectional temporal prediction.
12. The method as recited in claim 10, wherein said at least one
other video frame is a Predictive (P) frame.
13. The method as recited in claim 10, further comprising
selectively scaling said at least one temporal-based motion
parameter based at least in part on a temporal distance between
said other video frame and said frame that includes said B
picture.
14. The method as recited in claim 13, wherein temporal distance
information is encoded within a header associated with said encoded
B picture.
15. The method as recited in claim 10, wherein said at least one
other portion includes at least a portion of a macroblock within
said at least one other video frame.
16. A computer-readable medium having computer implementable
instructions for configuring at least one processing unit to
perform acts comprising: accessing data for a sequence of video
frames; identifying at least a portion of at least one video frame
to be a Bidirectionally Predictive (B) picture; and selectively
encoding said B picture using at least spatial prediction to encode
at least one motion parameter associated with said B picture.
17. The computer-readable medium as recited in claim 16, wherein
said B picture includes a macroblock.
18. The computer-readable medium as recited in claim 17, wherein
selectively encoding said B picture using at least spatial
prediction to encode said at least one motion parameter produces a
Direct Macroblock.
19. The computer-readable medium as recited in claim 16, wherein
said B picture includes a slice.
20. The computer-readable medium as recited in claim 16, wherein
said B picture includes at least a portion of a macroblock.
21. The computer-readable medium as recited in claim 16, wherein
selectively encoding said B picture using at least spatial
prediction to encode said at least one motion parameter further
includes employing linear motion vector prediction for said B
picture based on at least one reference picture that is at least
another portion of said video frame.
22. The computer-readable medium as recited in claim 16, wherein
selectively encoding said B picture using at least spatial
prediction to encode said at least one motion parameter further
includes employing non-linear motion vector prediction for said B
picture based on at least one reference picture that is at least
another portion of said video frame.
23. The computer-readable medium as recited in claim 16, wherein
selectively encoding said B picture using at least spatial
prediction to encode said at least one motion parameter further
includes employing median motion vector prediction for said B
picture based on at least two reference pictures that are both
portions of said video frame.
24. The computer-readable medium as recited in claim 16, wherein
said at least one motion parameter includes at least one motion
vector.
25. The computer-readable medium as recited in claim 1, wherein at
least one other portion of at least one other video frame is
processed to further selectively encode said B picture using
temporal prediction to encode at least one temporal-based motion
parameter associated with said B picture.
26. The computer-readable medium as recited in claim 25, wherein
said temporal prediction includes bidirectional temporal
prediction.
27. The computer-readable medium as recited in claim 25, wherein
said at least one other video frame is a Predictive (P) frame.
28. The computer-readable medium as recited in claim 25, having
computer implementable instructions for configuring said at least
one processing unit to perform acts comprising: selectively scaling
said at least one temporal-based motion parameter based at least in
part on a temporal distance between said other video frame and said
frame that includes said B picture.
29. The computer-readable medium as recited in claim 28, wherein
temporal distance information is encoded within a header associated
with said encoded B picture.
30. The computer-readable medium as recited in claim 25, wherein
said at least one other portion includes at least a portion of a
macroblock within said at least one other video frame.
31. An apparatus for use in encoding video data within a sequence
of video frames, the apparatus comprising: logic operatively
configured to access video data for a sequence of video frames,
identify at least a portion of at least one video frame to be a
Bidirectionally Predictive (B) picture, and selectively encode said
B picture using at least spatial prediction to encode at least one
motion parameter associated with said B picture.
32. The apparatus as recited in claim 31, wherein said B picture
includes a macroblock.
33. The apparatus as recited in claim 32, wherein said logic
selectively encodes said B picture using at least spatial
prediction to encode said at least one motion parameter to produce
a Direct Macroblock.
34. The apparatus as recited in claim 31, wherein said B picture
includes a slice.
35. The apparatus as recited in claim 31, wherein said B picture
includes at least a portion of a macroblock.
36. The apparatus as recited in claim 31, wherein said logic is
further configured to employ linear motion vector prediction for
said B picture based on at least one reference picture that is at
least another portion of said video frame.
37. The apparatus as recited in claim 31, wherein said logic is
further configured to employ non-linear motion vector prediction
for said B picture based on at least one reference picture that is
at least another portion of said video frame.
38. The apparatus as recited in claim 31, wherein said logic is
further configured to employ median motion vector prediction for
said B picture based on at least two reference pictures that are
both portions of said video frame.
39. The apparatus as recited in claim 31, wherein said at least one
motion parameter includes at least one motion vector.
40. The apparatus as recited in claim 31, wherein said logic is
further configured to process at least one other portion of at
least one other video frame is and selectively encode said B
picture using temporal prediction to encode at least one
temporal-based motion parameter associated with said B picture.
41. The apparatus as recited in claim 40, wherein said temporal
prediction includes bidirectional temporal prediction.
42. The apparatus as recited in claim 40, wherein said at least one
other video frame is a Predictive (P) frame.
43. The apparatus as recited in claim 40, wherein said logic is
further configured to selectively scale said at least one
temporal-based motion parameter based at least in part on a
temporal distance between said other video frame and said frame
that includes said B picture.
44. The apparatus as recited in claim 43, whereinsaid logic is
further configured to include temporal distance information within
a header associated with said encoded B picture.
45. The apparatus as recited in claim 40, wherein said at least one
other portion includes at least a portion of a macroblock within
said at least one other video frame.
46. A method for encoding video data, the method comprising:
identifying at least a portion of at least one video frame to be
coded in an enhanced direct mode; and encoding said portion in said
enhanced direct mode using at least spatial information associated
with said portion within said at least one video frame.
47. The method as recited in claim 46 wherein encoding said portion
in said enhanced direct mode further includes using temporal
information associated with said portion and at least one other
portion of at least one other video frame.
48. The method as recited in claim 46, wherein encoding said
portion in said enhanced direct mode further includes using motion
vector prediction based on at least one other portion within said
at least one video frame.
49. The method as recited in Clam 48, wherein said motion vector
prediction includes median prediction.
50. The method as recited in claim 46, wherein said enhance direct
mode includes using spatial prediction to calculate said spatial
information based on at least one linear function that considers
motion information of at least one other portion of said at least
one video frame.
51. The method as recited in claim 46, wherein said enhance direct
mode includes using spatial prediction to calculate said spatial
information based on at least one non-linear function that
considers motion information of at least one other portion of said
at least one video frame.
52. A computer-readable medium having computer implementable
instructions for configuring at least one processing unit to
perform acts comprising encoding video data by identifying at least
a portion of at least one video frame to be coded in an enhanced
direct mode, and encoding said portion in said enhanced direct mode
using at least spatial information associated with said portion
within said at least one video frame.
53. The computer-readable medium as recited in claim 52 wherein
encoding said portion in said enhanced direct mode further includes
using temporal information associated with said portion and at
least one other portion of at least one other video frame.
54. The computer-readable medium as recited in claim 52, wherein
encoding said portion in said enhanced direct mode further includes
using motion vector prediction based on at least one other portion
within said at least one video frame.
55. The computer-readable medium as recited in Clam 54, wherein
said motion vector prediction includes median prediction.
56. The computer-readable medium as recited in claim 52, wherein
said enhance direct mode includes using spatial prediction to
calculate said spatial information based on at least one linear
function that considers motion information of at least one other
portion of said at least one video frame.
57. The computer-readable medium as recited in claim 52, wherein
said enhance direct mode includes using spatial prediction to
calculate said spatial information based on at least one non-linear
function that considers motion information of at least one other
portion of said at least one video frame.
58. An apparatus comprising: logic operatively configured to encode
video data by identifying at least a portion of at least one video
frame to be coded in an enhanced direct mode, and encode said
portion in said enhanced direct mode using at least spatial
information Is associated with said portion within said at least
one video frame.
59. The apparatus as recited in claim 58 wherein said logic is
further operatively configured to use temporal information
associated with said portion and at least one other portion of at
least one other video frame to encode said portion in said enhanced
direct mode.
60. The apparatus as recited in claim 58, wherein said logic is
further operatively configured to encoding said portion in said
enhanced direct mode further using motion vector prediction
information based on at least one other portion within said at
least one video frame.
61. The apparatus as recited in Clam 60, wherein said motion vector
prediction includes median prediction.
62. The apparatus as recited in claim 56, wherein said logic is
further operatively configured to use spatial prediction to
calculate said spatial information based on at least one linear
function that considers motion information of at least one other
portion of said at least one video frame.
63. The apparatus as recited in claim 56, wherein said logic is
further operatively configured to use spatial prediction to
calculate said spatial information based on at least one non-linear
function that considers motion information of at least one other
portion of said at least one video frame.
64. A method to predict a reference picture in direct mode video
encoding, the method comprising: selecting a reference picture from
a group comprising a minimum reference picture for a plurality of
predictions related to at least a portion of a video frame to be
encoded, a median reference picture for said plurality of
predictions, and a current reference picture based on a single
direction prediction; and encoding said at least one portion of
said video frame based on selected reference picture.
65. The method as recited in claim 64, wherein selecting said
reference picture further includes selecting at least one spatially
related prediction.
66. The method as recited in claim 64, wherein selecting said
reference picture further includes selecting at least one
temporally related prediction.
67. A computer-readable medium having computer implementable
instructions for configuring at least one processing unit to
perform acts comprising: selecting a reference picture from a group
comprising a minimum reference picture for a plurality of
predictions related to at least a portion of a video frame to be
encoded, a median reference picture for said plurality of
predictions, and a current reference picture based on a single
direction prediction; and encoding said at least one portion of
said video frame based on selected reference picture.
68. The computer-readable medium as recited in claim 67, wherein
selecting said reference picture further includes selecting at
least one spatially related prediction.
69. The computer-readable medium as recited in claim 67, wherein
selecting said reference picture further includes selecting at
least one temporally related prediction.
70. An apparatus comprising: logic that is operatively configured
to select a reference picture from a group comprising a minimum
reference picture for a plurality of predictions related to at
least a portion of a video frame to be encoded, a median reference
picture for said plurality of predictions, and a current reference
picture based on a single direction prediction, and encode said at
least one portion of said video frame based on selected reference
picture.
71. The apparatus as recited in claim 70, wherein said logic is
operatively configured to select at least one spatially related
prediction.
72. The apparatus as recited in claim 70, wherein said logic is
operatively configured to select at least one temporally related
prediction.
73. A method for use in selecting between temporal prediction,
spatial prediction, or both temporal and spatial prediction for
encoding at least a portion of at least one video frame in an
enhanced direct mode, the method comprising: selecting temporal
prediction if at least one motion vector of a collocated portion of
said video frame is zero; if surrounding portions within said video
frame use different reference pictures than a collocated reference
picture, then select spatial prediction only; if a motion flow
associated with said portion of said video frame is substantially
different than a motion flow associated with a reference picture,
then select spatial prediction; if temporal prediction of direct
mode is signaled inside an image header, then selecting temporal
prediction; and if spatial prediction of direct mode is signaled
inside said image header, then selecting spatial prediction.
74. The method as recited in claim 73, further comprising:
correcting at least one temporally predicted parameter based on
spatial information.
75. The method as recited in claim 73, further comprising:
correcting at least one spatially predicted parameter based on
temporal information.
76. A computer-readable medium having computer implementable
instructions for configuring at least one processing unit to
perform acts comprising: selecting between temporal prediction,
spatial prediction, or both temporal and spatial prediction for
encoding at least a portion of at least one video frame in an
enhanced direct mode, such that: temporal prediction is selected if
at least one motion vector of a collocated portion of said video
frame is zero, only spatial prediction is selected if surrounding
portions within said video frame use different reference pictures
than a collocated reference picture, spatial prediction is selected
if a motion flow associated with said portion of said video frame
is substantially different than a motion flow associated with a
reference picture, temporal prediction is selected if temporal
prediction of direct mode is signaled inside an image header, and
spatial prediction is selected if spatial prediction of direct mode
is signaled inside said image header.
77. The computer-readable medium as recited in claim 76, further
comprising: correcting at least one temporally predicted parameter
based on spatial information.
78. The computer-readable medium as recited in claim 76, further
comprising: correcting at least one spatially predicted parameter
based on temporal information.
79. An apparatus comprising: logic operatively configured to select
between and employ temporal prediction, spatial prediction, or both
temporal and spatial prediction for encoding at least a portion of
at least one video frame in an enhanced direct mode, wherein said
logic: selects temporal prediction if at least one motion vector of
a collocated portion of said video frame is zero, selects only
spatial prediction if surrounding portions within said video frame
use different reference pictures than a collocated reference
picture, selects spatial prediction if a motion flow associated
with said portion of said video frame is substantially different
than a motion flow associated with a reference picture, selects
temporal prediction if temporal prediction of direct mode is
signaled inside an image header, and selects spatial prediction if
spatial prediction of direct mode is signaled inside said image
header.
80. The apparatus as recited in claim 79, wherein said logic is
further operatively configured to correct at least one temporally
predicted parameter based on spatial information.
81. The apparatus as recited in claim 79, wherein said logic is
further operatively configured to correct at least one spatially
predicted parameter based on temporal information.
82. A method for use in encoding video data, the method comprising:
selecting a reference portion of a future video frame to serve as a
B picture to at least one portion of an earlier video frame; using
motion vectors associated with said reference frame to calculate
motion vectors associated with said at least one portion; and
encoding said at least one portion based on said calculated motion
vectors associated with said at least one portion.
83. A method as recited in claim 82, wherein using said motion
vectors associated with said reference frame to calculate said
motion vectors associated with said at least one portion further
includes estimating at least one possible prediction for use in
direct mode coding by projecting and inverting backward and forward
motion vectors of the reference portion.
84. The method as recited in claim 83, wherein encoding said at
least one portion based on said calculated motion vectors
associated with said at least one portion further includes applying
selective projection and inversion based on at least one temporal
parameter associated with said reference portion with respect to
said at least one portion.
85. The method as recited in claim 82, wherein only one reference
portion is used for B pictures when encoding in direct mode.
86. The method as recited in claim 82, wherein encoding said at
least one portion based on said calculated motion vectors
associated with said at least one portion further includes encoding
in a direct mode wherein at least one of said calculated motion
vectors is based on at least one projected motion vector that
refers to at least two reference portions in two different
reference pictures.
87. The method as recited in claim 82, wherein encoding said at
least one portion based on said calculated motion vectors
associated with said at least one portion further includes encoding
in a direct mode wherein at least one of said calculated motion
vectors is based on spatial prediction associated with said
reference portion.
88. A computer-readable medium having computer implementable
instructions for configuring at least one processing unit to
perform acts comprising: selecting a reference portion of a future
video frame to serve as a B picture to at least one portion of an
earlier video frame; using motion vectors associated with said
reference frame to calculate motion vectors associated with said at
least one portion; and encoding said at least one portion based on
said calculated motion vectors associated with said at least one
portion.
89. A computer-readable medium as recited in claim 88, wherein
using said motion vectors associated with said reference frame to
calculate said motion vectors associated with said at least one
portion further includes estimating at least one possible
prediction for use in direct mode coding by projecting and
inverting backward and forward motion vectors of the reference
portion.
90. The computer-readable medium as recited in claim 89, wherein
encoding said at least one portion based on said calculated motion
vectors associated with said at least one portion further includes
applying selective projection and inversion based on at least one
temporal parameter associated with said reference portion with
respect to said at least one portion.
91. The computer-readable medium as recited in claim 88, wherein
only one reference portion is used for B pictures when encoding in
direct mode.
92. The computer-readable medium as recited in claim 88, wherein
encoding said at least one portion based on said calculated motion
vectors associated with said at least one portion further includes
encoding in a direct mode wherein at least one of said calculated
motion vectors is based on at least one projected motion vector
that refers to at least two reference portions in two different
reference pictures.
93. The computer-readable medium as recited in claim 88, wherein
encoding said at least one portion based on said calculated motion
vectors associated with said at least one portion further includes
encoding in a direct mode wherein at least one of said calculated
motion vectors is based on spatial prediction associated with said
reference portion.
94. An apparatus comprising: logic operatively configured to select
a reference portion of a future video frame to serve as a B picture
to at least one portion of an earlier video frame, use motion
vectors associated with said reference frame to calculate motion
vectors associated with said at least one portion, and encode said
at least one portion based on said calculated motion vectors
associated with said at least one portion.
95. The apparatus as recited in claim 94, wherein said logic is
further operatively configured to estimate at least one possible
prediction for use in direct mode coding by projecting and
inverting backward and forward motion vectors of the reference
portion.
96. The apparatus as recited in claim 95, wherein said logic is
further operatively configured to apply selective projection and
inversion based on at least one temporal parameter associated with
said reference portion with respect to said at least one
portion.
97. The apparatus as recited in claim 94, wherein only one
reference portion is used for B pictures when encoding in direct
mode.
98. The apparatus as recited in claim 94, wherein said logic is
further operatively configured to encode in a direct mode wherein
at least one of said calculated motion vectors is based on at least
one projected motion vector that refers to at least two reference
portions in two different reference pictures.
99. The apparatus as recited in claim 94, wherein said logic is
further operatively configured to encode in a direct mode wherein
at least one of said calculated motion vectors is based on spatial
prediction associated with said reference portion.
100. A method for use in determining motion vectors during video
encoding, the method comprising: selecting at least three
predictors A, B and C that each uses a different reference picture
having an associated temporal distance TR.sub.A, TR.sub.B, and
TR.sub.C respectively, and a motion vector MV.sub.A, MV.sub.B, and
MV.sub.C; and predicting a median motion vector MV.sub.pred
associated with a current reference picture that has a temporal
distance equal to TR.
101. The method as recited in claim 100, wherein said median
predictor MV.sub.pred is calculated as: 4 MV pred = TR .times.
Median ( MV A TR A , MV B TR B , MV C TR C ) .
102. The method as recited in claim 100, wherein said median
predictor MV.sub.pred is calculated as:
MV.sub.pred=Median(Ave({right arrow over
(MV)}.sub.C.sub..sub.1,{right arrow over
(MV)}.sub.C.sub..sub.2),Ave({rig- ht arrow over
(MV)}.sub.A.sub..sub.1, {right arrow over
(MV)}.sub.A.sub..sub.2),{right arrow over (MV)}.sub.B).
103. The method as recited in claim 100, further comprising:
selecting at least a fourth predictor D having an associated
temporal distance TR.sub.D and a motion vector MV.sub.D, and
wherein said median predictor MV.sub.pred is calculated as: {right
arrow over (MV)}.sub.pred=Median(Med- ian({right arrow over
(MV)}.sub.C.sub..sub.1,{right arrow over
(MV)}.sub.C.sub..sub.1,{right arrow over (MV)}.sub.D), . . .
Median({right arrow over (MV)}.sub.D,{right arrow over
(MV)}.sub.A.sub..sub.1,{right arrow over
(MV)}.sub.C.sub..sub.1),Median({- right arrow over
(MV)}.sub.B,{right arrow over (MV)}.sub.A.sub..sub.1,{rig- ht arrow
over (MV)}.sub.A.sub..sub.2))
104. The method as recited in claim 100, further comprising:
selecting at least a fourth predictor D having an associated
temporal distance TR.sub.D and a motion vector MV.sub.D, and
wherein said median predictor MV.sub.pred is calculated as: {right
arrow over (MV)}.sub.pred=Median({ri- ght arrow over
(MV)}.sub.C.sub..sub.1,{right arrow over (MV)}.sub.C.sub..sub.2,
{right arrow over (MV)}.sub.D, {right arrow over (MV)}.sub.B,{right
arrow over (MV)}.sub.A.sub..sub.1, {right arrow over
(MV)}.sub.A.sub..sub.2).
105. The method as recited in claim 100, further comprising
selectively substituting an adjacent portion of a reference frame
for a selected portion of said reference frame for use in
determining motion vector prediction when intra coding is used.
106. A computer-readable medium having computer implementable
instructions for configuring at least one processing unit to
perform acts comprising: selecting at least three predictors A, B
and C that each uses a different reference picture having an
associated temporal distance TR.sub.A, TR.sub.B, and TR.sub.C
respectively, and a motion vector MV.sub.A, MV.sub.B, and MV.sub.C;
and predicting a median motion vector MV.sub.pred associated with a
current reference picture that has a temporal distance equal to
TR.
107. The computer-readable medium as recited in claim 106, wherein
said median predictor MV.sub.pred is calculated as: 5 MV pred = TR
.times. Median ( MV A TR A , MV B TR B , MV C TR C ) .
108. The computer-readable medium as recited in claim 106, wherein
said median predictor MV.sub.pred is calculated as: {right arrow
over (MV)}.sub.pred=Median(Ave({right arrow over
(MV)}.sub.C.sub..sub.1,{right arrow over (MV)}.sub.C.sub..sub.2),
Ave({right arrow over (MV)}.sub.A.sub..sub.1, {right arrow over
(MV)}.sub.A.sub..sub.2),{right arrow over (MV)}.sub.B).
109. The computer-readable medium as recited in claim 106, further
comprising: selecting at least a fourth predictor D having an
associated temporal distance TR.sub.D and a motion vector MV.sub.D,
and wherein said median predictor MV.sub.pred is calculated as:
{right arrow over (MV)}.sub.pred=Median(Median({right arrow over
(MV)}.sub.C.sub..sub.1,{ri- ght arrow over
(MV)}.sub.C.sub..sub.2,{right arrow over (MV)}.sub.D), . . .
Median({right arrow over (MV)}.sub.D,{right arrow over
(MV)}.sub.A.sub..sub.1,{right arrow over
(MV)}.sub.C.sub..sub.2),Median({- right arrow over
(MV)}.sub.B,{right arrow over (MV)}.sub.A.sub..sub.1,{rig- ht arrow
over (MV)}.sub.A.sub..sub.2))
110. The computer-readable medium as recited in claim 106, further
comprising: selecting at least a fourth predictor D having an
associated temporal distance TR.sub.D and a motion vector MV.sub.D,
and wherein said median predictor MV.sub.pred is calculated as:
{right arrow over (MV)}.sub.pred=Median({right arrow over
(MV)}.sub.C.sub..sub.1,{right arrow over
(MV)}.sub.C.sub..sub.2,{right arrow over (MV)}.sub.D,{right arrow
over (MV)}.sub.B,{right arrow over (MV)}.sub.A.sub..sub.1,{right
arrow over (MV)}.sub.A.sub..sub.2)
111. The computer-readable medium as recited in claim 106, further
comprising selectively substituting an adjacent portion of a
reference frame for a selected portion of said reference frame for
use in determining motion vector prediction when intra coding is
used.
112. An apparatus comprising logic operatively configured to select
at least three predictors A, B and C that each uses a different
reference picture having an associated temporal distance TR.sub.A,
TR.sub.B, and TR.sub.C respectively, and a motion vector MV.sub.A,
MV.sub.B, and MV.sub.C, and predict a median motion vector
MV.sub.pred associated with a current reference picture that has a
temporal distance equal to TR.
113. The apparatus as recited in claim 112, wherein said median
predictor MV.sub.pred is calculated as: 6 MV pred = TR .times.
Median ( MV A TR A , MV B TR B , MV C TR C ) .
114. The apparatus as recited in claim 112, wherein said median
predictor MV.sub.pred is calculated as: {right arrow over
(MV)}.sub.pred=Median(Ave- ({fraction (MV)}.sub.C.sub..sub.1,
{right arrow over (MV)}.sub.C.sub..sub.2),
Ave(MV.sub.A.sub..sub.1,{right arrow over
(MV)}.sub.A.sub..sub.2),{right arrow over (MV)}.sub.B).
115. The apparatus as recited in claim 112, wherein said logic is
further operatively configured to select at least a fourth
predictor D having an associated temporal distance TR.sub.D and a
motion vector MV.sub.D, and wherein said median predictor
MV.sub.pred is calculated as: {right arrow over
(MV)}.sub.pred=Median(Median({right arrow over
(MV)}.sub.C.sub..sub.1,{right arrow over
(MV)}.sub.C.sub..sub.2,{right arrow over (MV)}.sub.D), . . .
Median({right arrow over (MV)}.sub.D,{right arrow over
(MV)}.sub.A.sub..sub.1,{right arrow over
(MV)}.sub.C.sub..sub.2),Median({right arrow over (MV)}.sub.B,{right
arrow over (MV)}.sub.A.sub..sub.1,{right arrow over
(MV)}.sub.A.sub..sub.2))
116. The apparatus as recited in claim 112, wherein said logic is
further operatively configured to select at least a fourth
predictor D having an associated temporal distance TR.sub.D and a
motion vector MV.sub.D, and wherein said median predictor
MV.sub.pred is calculated as: {right arrow over
(MV)}.sub.Pred=Median({right arrow over (MV)}.sub.C.sub..sub.1,
{right arrow over (MV)}.sub.C.sub..sub.2,{right arrow over
(MV)}.sub.D,{right arrow over (MV)}.sub.B,{right arrow over
(MV)}.sub.A.sub..sub.1,{right arrow over
(MV)}.sub.A.sub..sub.2).
117. The apparatus as recited in claim 112, wherein said logic is
further operatively configured to selectively substitute an
adjacent portion of a reference frame for a selected portion of
said reference frame for use in determining motion vector
prediction when intra coding is used.
Description
RELATED PATENT APPLICATIONS
[0001] This U.S. Non-provisional Application for Letters Patent
claims the benefit of priority from, and hereby incorporates by
reference the entire disclosure of, co-pending U.S. Provisional
Application for Letters Patent Serial No. 60/385,965, filed Jun. 3,
2002, and titled "Spatiotemporal Prediction for Bidirectionally
Predictive (B) Frames and Motion Vector Prediction for Multi-Frame
Reference Motion Compensation".
TECHNICAL FIELD
[0002] This invention relates to video coding, and more
particularly to methods and apparatuses for providing improved
coding and/or prediction techniques associated with different types
of video data.
BACKGROUND
[0003] The motivation for increased coding efficiency in video
coding has led to the adoption in the Joint Video Team (JVT) (a
standards body) of more refined and complicated models and modes
describing motion information for a given macroblock. These models
and modes tend to make better advantage of the temporal
redundancies that may exist within a video sequence. See, for
example, ITU-T, Video Coding Expert Group (VCEG), "JVT
Coding--(ITU-T H.26L & ISO/IEC JTC1 Standard)--Working Draft
Number 2 (WD-2)", ITU-T JVT-B 118, March 2002; and/or Heiko Schwarz
and Thomas Wiegand, "Tree-structured macroblock partition", Doc.
VCEG-N17, December 2001.
[0004] There is continuing need for further improved methods and
apparatuses that can support the latest models and modes and also
possibly introduce new models and modes to take advantage of
improved coding techniques.
SUMMARY
[0005] The above state needs and other are addressed, for example,
by a method for use in encoding video data within a sequence of
video frames. The method includes identifying at least a portion of
at least one video frame to be a Bidirectionally Predictive (B)
picture, and selectively encoding the B picture using at least
spatial prediction to encode at least one motion parameter
associated with the B picture. In certain exemplary implementations
the B picture may include a block, a macroblock, a subblock, a
slice, or other like portion of the video frame. For example, when
a macroblock portion is used, the method produces a Direct
Macroblock.
[0006] In certain further exemplary implementations, the method
further includes employing linear or non-linear motion vector
prediction for the B picture based on at least one reference
picture that is at least another portion of the video frame. By way
of example, in certain implementations, the method employs median
motion vector prediction to produce at least one motion vector.
[0007] In still other exemplary implementations, in addition to
spatial prediction, the method may also process at least one other
portion of at least one other video frame to further selectively
encode the B picture using temporal prediction to encode at least
one temporal-based motion parameter associated with the B picture.
In some instances the temporal prediction includes bidirectional
temporal prediction, for example based on at least a portion of a
Predictive (P) frame.
[0008] In certain other implementations, the method also
selectively determines applicable scaling for a temporal-based
motion parameter based at least in part on a temporal distance
between the predictor video frame and the frame that includes the B
picture. In certain implementations temporal distance information
is encoded, for example, within a header or other like data
arrangement associated with the encoded B picture.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The present invention is illustrated by way of example and
not limitation in the figures of the accompanying drawings. The
same numbers are used throughout the figures to reference like
components and/or features.
[0010] FIG. 1 is a block diagram depicting an exemplary computing
environment that is suitable for use with certain implementations
of the present invention.
[0011] FIG. 2 is a block diagram depicting an exemplary
representative device that is suitable for use with certain
implementations of the present invention.
[0012] FIG. 3 is an illustrative diagram depicting spatial
predication associated with portions of a picture, in accordance
with certain exemplary implementations of the present
invention.
[0013] FIG. 4 is an illustrative diagram depicting Direct
Prediction in B picture coding, in accordance with certain
exemplary implementations of the present invention.
[0014] FIG. 5 is an illustrative diagram depicting what happens
when a scene change happens or even when the collocated block is
intra-coded, in accordance with certain exemplary implementations
of the present invention.
[0015] FIG. 6 is an illustrative diagram depicting handling of
collocated intra within existing codecs wherein motion is assumed
to be zero, in accordance with certain exemplary implementations of
the present invention.
[0016] FIG. 7 is an illustrative diagram depicting how Direct Mode
is handled when the reference picture of the collocated block in
the subsequent P picture is other than zero, in accordance with
certain exemplary implementations of the present invention.
[0017] FIG. 8 is an illustrative diagram depicting an exemplary
scheme wherein MV.sub.FW and MV.sub.BW are derived from spatial
prediction, in accordance with certain exemplary implementations of
the present invention.
[0018] FIG. 9 is an illustrative diagram depicting how spatial
prediction solves the problem of scene changes and the like, in
accordance with certain exemplary implementations of the present
invention.
[0019] FIG. 10 is an illustrative diagram depicting joint
spatio-temporal prediction for Direct Mode in B picture coding, in
accordance with certain exemplary implementations of the present
invention.
[0020] FIG. 11 is an illustrative diagram depicting Motion Vector
Prediction of a current block considering reference picture
information of predictor macroblocks, in accordance with certain
exemplary implementations of the present invention.
[0021] FIG. 12 is an illustrative diagram depicting how to use more
candidates for Direct Mode prediction especially if bidirectional
prediction is used within the B picture, in accordance with certain
exemplary implementations of the present invention.
[0022] FIG. 13 is an illustrative diagram depicting how B pictures
may be restricted in using future and past reference pictures, in
accordance with certain exemplary implementations of the present
invention.
[0023] FIG. 14 is an illustrative diagram depicting projection of
collocated Motion Vectors to a current reference for temporal
direct prediction, in accordance with certain exemplary
implementations of the present invention.
[0024] FIGS. 15a-c are illustrative diagrams depicting Motion
Vector Predictors for one MV in different configurations, in
accordance with certain exemplary implementations of the present
invention.
[0025] FIGS. 16a-c are illustrative diagrams depicting Motion
Vector Predictors for one MV with 8.times.8 partitions in different
configurations, in accordance with certain exemplary
implementations of the present invention.
[0026] FIGS. 17a-c are illustrative diagrams depicting Motion
Vector Predictors for one MV with additional predictors for
8.times.8 partitioning, in accordance with certain exemplary
implementations of the present invention.
DETAILED DESCRIPTION
[0027] Several improvements for use with Bidirectionally Predictive
(B) pictures within a video sequence are described below and
illustrated in the accompanying drawings. In certain improvements
Direct Mode encoding and/or Motion Vector Prediction are enhanced
using spatial prediction techniques. In other improvements Motion
Vector prediction includes temporal distance and subblock
information, for example, for more accurate prediction. Such
improvements and other presented herein significantly improve the
performance of any applicable video coding system/logic.
[0028] While these and other exemplary methods and apparatuses are
described, it should be kept in mind that the techniques of the
present invention are not limited to the examples described and
shown in the accompanying drawings, but are also clearly adaptable
to other similar existing and future video coding schemes, etc.
[0029] Before introducing such exemplary methods and apparatuses,
an introduction is provided in the following section for suitable
exemplary operating environments, for example, in the form of a
computing device and other types of devices/appliances.
[0030] Exemplary Operational Environments:
[0031] Turning to the drawings, wherein like reference numerals
refer to like elements, the invention is illustrated as being
implemented in a suitable computing environment. Although not
required, the invention will be described in the general context of
computer-executable instructions, such as program modules, being
executed by a personal computer.
[0032] Generally, program modules include routines, programs,
objects, components, data structures, etc. that perform particular
tasks or implement particular abstract data types. Those skilled in
the art will appreciate that the invention may be practiced with
other computer system configurations, including hand-held devices,
multi-processor systems, microprocessor based or programmable
consumer electronics, network PCs, minicomputers, mainframe
computers, portable communication devices, and the like.
[0033] The invention may also be practiced in distributed computing
environments where tasks are performed by remote processing devices
that are linked through a communications network. In a distributed
computing environment, program modules may be located in both local
and remote memory storage devices.
[0034] FIG. 1 illustrates an example of a suitable computing
environment 120 on which the subsequently described systems,
apparatuses and methods may be implemented. Exemplary computing
environment 120 is only one example of a suitable computing
environment and is not intended to suggest any limitation as to the
scope of use or functionality of the improved methods and systems
described herein. Neither should computing environment 120 be
interpreted as having any dependency or requirement relating to any
one or combination of components illustrated in computing
environment 120.
[0035] The improved methods and systems herein are operational with
numerous other general purpose or special purpose computing system
environments or configurations. Examples of well known computing
systems, environments, and/or configurations that may be suitable
include, but are not limited to, personal computers, server
computers, thin clients, thick clients, hand-held or laptop
devices, multiprocessor systems, microprocessor-based systems, set
top boxes, programmable consumer electronics, network PCs,
minicomputers, mainframe computers, distributed computing
environments that include any of the above systems or devices, and
the like.
[0036] As shown in FIG. 1, computing environment 120 includes a
general-purpose computing device in the form of a computer 130. The
components of computer 130 may include one or more processors or
processing units 132, a system memory 134, and a bus 136 that
couples various system components including system memory 134 to
processor 132.
[0037] Bus 136 represents one or more of any of several types of
bus structures, including a memory bus or memory controller, a
peripheral bus, an accelerated graphics port, and a processor or
local bus using any of a variety of bus architectures. By way of
example, and not limitation, such architectures include Industry
Standard Architecture (ISA) bus, Micro Channel Architecture (MCA)
bus, Enhanced ISA (EISA) bus, Video Electronics Standards
Association (VESA) local bus, and Peripheral Component
Interconnects (PCI) bus also known as Mezzanine bus.
[0038] Computer 130 typically includes a variety of computer
readable media. Such media may be any available media that is
accessible by computer 130, and it includes both volatile and
non-volatile media, removable and non-removable media.
[0039] In FIG. 1, system memory 134 includes computer readable
media in the form of volatile memory, such as random access memory
(RAM) 140, and/or non-volatile memory, such as read only memory
(ROM) 138. A basic input/output system (BIOS) 142, containing the
basic routines that help to transfer information between elements
within computer 130, such as during start-up, is stored in ROM 138.
RAM 140 typically contains data and/or program modules that are
immediately accessible to and/or presently being operated on by
processor 132.
[0040] Computer 130 may further include other
removable/non-removable, volatile/non-volatile computer storage
media. For example, FIG. 1 illustrates a hard disk drive 144 for
reading from and writing to a non-removable, non-volatile magnetic
media (not shown and typically called a "hard drive"), a magnetic
disk drive 146 for reading from and writing to a removable,
non-volatile magnetic disk 148 (e.g., a "floppy disk"), and an
optical disk drive 150 for reading from or writing to a removable,
non-volatile optical disk 152 such as a CD-ROM/R/RW,
DVD-ROM/R/RW/+R/RAM or other optical media. Hard disk drive 144,
magnetic disk drive 146 and optical disk drive 150 are each
connected to bus 136 by one or more interfaces 154.
[0041] The drives and associated computer-readable media provide
nonvolatile storage of computer readable instructions, data
structures, program modules, and other data for computer 130.
Although the exemplary environment described herein employs a hard
disk, a removable magnetic disk 148 and a removable optical disk
152, it should be appreciated by those skilled in the art that
other types of computer readable media which can store data that is
accessible by a computer, such as magnetic cassettes, flash memory
cards, digital video disks, random access memories (RAMs), read
only memories (ROM), and the like, may also be used in the
exemplary operating environment.
[0042] A number of program modules may be stored on the hard disk,
magnetic disk 148, optical disk 152, ROM 138, or RAM 140,
including, e.g., an operating system 158, one or more application
programs 160, other program modules 162, and program data 164.
[0043] The improved methods and systems described herein may be
implemented within operating system 158, one or more application
programs 160, other program modules 162, and/or program data
164.
[0044] A user may provide commands and information into computer
130 through input devices such as keyboard 166 and pointing device
168 (such as a "mouse"). Other input devices (not shown) may
include a microphone, joystick, game pad, satellite dish, serial
port, scanner, camera, etc. These and other input devices are
connected to the processing unit 132 through a user input interface
170 that is coupled to bus 136, but may be connected by other
interface and bus structures, such as a parallel port, game port,
or a universal serial bus (USB).
[0045] A monitor 172 or other type of display device is also
connected to bus 136 via an interface, such as a video adapter 174.
In addition to monitor 172, personal computers typically include
other peripheral output devices (not shown), such as speakers and
printers, which may be connected through output peripheral
interface 175.
[0046] Computer 130 may operate in a networked environment using
logical connections to one or more remote computers, such as a
remote computer 182. Remote computer 182 may include many or all of
the elements and features described herein relative to computer
130.
[0047] Logical connections shown in FIG. 1 are a local area network
(LAN) 177 and a general wide area network (WAN) 179. Such
networking environments are commonplace in offices, enterprise-wide
computer networks, intranets, and the Internet.
[0048] When used in a LAN networking environment, computer 130 is
connected to LAN 177 via network interface or adapter 186. When
used in a WAN networking environment, the computer typically
includes a modem 178 or other means for establishing communications
over WAN 179. Modem 178, which may be internal or external, may be
connected to system bus 136 via the user input interface 170 or
other appropriate mechanism.
[0049] Depicted in FIG. 1, is a specific implementation of a WAN
via the Internet. Here, computer 130 employs modem 178 to establish
communications with at least one remote computer 182 via the
Internet 180.
[0050] In a networked environment, program modules depicted
relative to computer 130, or portions thereof, may be stored in a
remote memory storage device. Thus, e.g., as depicted in FIG. 1,
remote application programs 189 may reside on a memory device of
remote computer 182. It will be appreciated that the network
connections shown and described are exemplary and other means of
establishing a communications link between the computers may be
used.
[0051] Attention is now drawn to FIG. 2, which is a block diagram
depicting another exemplary device 200 that is also capable of
benefiting from the methods and apparatuses disclosed herein.
Device 200 is representative of any one or more devices or
appliances that are operatively configured to process video and/or
any related types of data in accordance with all or part of the
methods and apparatuses described herein and their equivalents.
Thus, device 200 may take the form of a computing device as in FIG.
1, or some other form, such as, for example, a wireless device, a
portable communication device, a personal digital assistant, a
video player, a television, a DVD player, a CD player, a karaoke
machine, a kiosk, a digital video projector, a flat panel video
display mechanism, a set-top box, a video game machine, etc. In
this example, device 200 includes logic 202 configured to process
video data, a video data source 204 configured to provide vide data
to logic 202, and at least one display module 206 capable of
displaying at least a portion of the video data for a user to view.
Logic 202 is representative of hardware, firmware, software and/or
any combination thereof. In certain implementations, for example,
logic 202 includes a compressor/decompressor (codec), or the like.
Video data source 204 is representative of any mechanism that can
provide, communicate, output, and/or at least momentarily store
video data suitable for processing by logic 202. Video reproduction
source is illustratively shown as being within and/or without
device 200. Display module 206 is representative of any mechanism
that a user might view directly or indirectly and see the visual
results of video data presented thereon. Additionally, in certain
implementations, device 200 may also include some form or
capability for reproducing or otherwise handling audio data
associated with the video data. Thus, an audio reproduction module
208 is shown.
[0052] With the examples of FIGS. 1 and 2 in mind, and others like
them, the next sections focus on certain exemplary methods and
apparatuses that may be at least partially practiced using with
such environments and with such devices.
[0053] Encoding Bidirectionally Predictive (B) Pictures And Motion
Vector Prediction
[0054] This section describes several exemplary improvements that
can be implemented to encode Bidirectionally Predictive (B)
pictures and Motion Vector prediction within a video coding system
or the like. The exemplary methods and apparatuses can be applied
to predict motion vectors and enhancements in the design of a B
picture Direct Mode. Such methods and apparatuses are particularly
suitable for multiple picture reference codecs, such as, for
example, JVT, and can achieve considerable coding gains especially
for panning sequences or scene changes.
[0055] Bidirectionally Predictive (B) pictures are an important
part of most video coding standards and systems since they tend to
increase the coding efficiency of such systems, for example, when
compared to only using Predictive (P) pictures. This improvement in
coding efficiency is mainly achieved by the consideration of
bidirectional motion compensation, which can effectively improve
motion compensated prediction and thus allow the encoding of
significantly reduced residue information. Furthermore, the
introduction of the Direct Prediction mode for a Macroblock/block
within such pictures can further increase efficiency considerably
(e.g., more than 10-20%) since no motion information is encoded.
Such may be accomplished, for example, by allowing the prediction
of both forward and backward motion information to be derived
directly from the motion vectors used in the corresponding
macroblock of a subsequent reference picture.
[0056] By way of example, FIG. 4 illustrates Direct Prediction in B
picture at time coding based on P frames at times t and t+2, and
the applicable motion vectors (MVs). Here, an assumption is made
that an object in the picture is moving with constant speed. This
makes it possible to predict a current position inside a B picture
without having to transmit any motion vectors. The motion vectors
({right arrow over (MV)}.sub.fw,{right arrow over (MV)}.sub.bw) of
the Direct Mode versus the motion vector {right arrow over (MV)} of
the collocated MB in the first subsequent P reference picture are
basically calculated by: 1 MV fw = TR B MV TR D and MV bw = ( TR B
- TR D ) MV TR D ,
[0057] where TR.sub.B is the temporal distance between the current
B picture and the reference picture pointed by the forward MV of
the collocated MB, and TR.sub.D is the temporal distance between
the future reference picture and the reference picture pointed by
the forward MV of the collocated MB.
[0058] Unfortunately there are several cases where the existing
Direct Mode does not provide an adequate solution, thus not
efficiently exploiting the properties of this mode. In particular,
existing designs of this mode usually force the motion parameters
of the Direct Macroblock, in the case of the collocated Macroblock
in the subsequent P picture being Intra coded, to be zero. For
example, see FIG. 6, which illustrates handling of collocated intra
within existing codecs wherein motion is assumed to be zero. This
essentially means that, for this case, the B picture Macroblock
will be coded as the average of the two collocated Macroblocks in
the first subsequent and past P references. This immediately raises
the following concern; if a Macroblock is Intra-coded, then how
does one know how much relationship it has with the collocated
Macroblock of its reference picture. In some situations, there may
be little if any actual relationship. Hence, it is possible that
the coding efficiency of the Direct Mode may be reduced. An extreme
case can be seen in the case of a scene change as illustrated in
FIG. 5. FIG. 5 illustrates what happens when a scene change occurs
in the video sequence and/or what happens when the collocated block
is intra. Here, in this example, obviously no relationship exists
between the two reference pictures given the scene change. In such
a case bidirectional prediction would provide little if any
benefit. As such, the Direct Mode could be completely wasted.
Unfortunately, conventional implementations of the Direct Mode
restrict it to always perform a bidirectional prediction of a
Macroblock.
[0059] FIG. 7 is an illustrative diagram depicting how Direct Mode
is handled when the reference picture of the collocated block in
the subsequent P picture is other than zero, in accordance with
certain implementations of the present invention.
[0060] An additional issue with the Direct Mode Macroblocks exists
when multi-picture reference motion compensation is used. Until
recently, for example, the JVT standard provided the timing
distance information (TR.sub.B and TR.sub.D), thus allowing for the
proper scaling of the parameters. Recently, this was changed in the
new revision of the codec (see, e.g., Joint Video Team (JVT) of
ISO/IEC MPEG and ITU-T VCEG, "Joint Committee Draft (CD) of Joint
Video Specification (ITU-T Rec. H.264.vertline.ISO/IEC 14496-10
AVC)", ITU-T JVT-C167, May. 2002, which is incorporated herein by
reference). In the new revision, the motion vector parameters of
the subsequent P picture are to be scaled equally for the Direct
Mode prediction, without taking in account the reference picture
information. This could lead to significant performance degradation
of the Direct Mode, since the constant motion assumption is no
longer followed.
[0061] Nevertheless, even if the temporal distance parameters were
available, it is not always certain that the usage of the Direct
Mode as defined previously is the most appropriate solution. In
particular for the B pictures which are closer to a first forward
reference picture, the correlation might be much stronger with that
picture, than the subsequent reference picture. An extreme example
which could contain such cases could be a sequence where scene A
changes to scene B, and then moves back to scene A (e.g., as may
happen in a news bulletin, etc.). All the above could deter the
performance of B picture encoding considerably since Direct Mode
will not be effectively exploited within the encoding process.
[0062] With these and other concerns in mind, unlike the previous
definitions of the Direct Mode where only temporal prediction was
used, in accordance with certain aspects of the present invention,
a new Direct Macroblock type is introduced wherein both temporal
prediction and/or spatial prediction is considered. The type(s) of
prediction used can depend on the type of reference picture
information of the first subsequent P reference picture, for
example.
[0063] In accordance with certain other aspects of the present
invention, one may also further considerably improve motion vector
prediction for both P and B pictures when multiple picture
references are used, by taking in consideration temporal distances,
if such are available.
[0064] These enhancements are implemented in certain exemplary
methods and apparatuses as described below. The methods and
apparatuses can achieve significant bitrate reductions while
achieving similar or better quality.
[0065] Direct Mode Enhancements:
[0066] In most conventional video coding systems, Direct Mode is
designed as a bidirectional prediction scheme where motion
parameters are always predicted in a temporal way from the motion
parameters in the subsequent P images. In this section, an enhanced
Direct Mode technique is provided in which spatial information may
also/alternatively be considered for such predictions.
[0067] One or more of the following exemplary techniques may be
implemented as needed, for example, depending on the complexity
and/or specifications of the system.
[0068] One technique is to implement spatial prediction of the
motion vector parameters of the Direct Mode without considering
temporal prediction. Spatial prediction can be accomplished, for
example, using existing Motion Vector prediction techniques used
for motion vector encoding (such as, e.g., median prediction). If
multiple picture references are used, then the reference picture of
the adjacent blocks may also be considered (even though there is no
such restriction and the same reference, e.g. 0, could always be
used).
[0069] Motion parameters and reference pictures could be predicted
as follows and with reference to FIG. 3, which illustrates spatial
predication associated with portions A-E (e.g., macroblocks,
slices, etc.) assumed to be available and part of a picture. Here,
E is predicted in general from A, B, C as Median (A, B, C). If C is
actually outside of the picture then D is used instead. If B, C,
and D are outside of picture, then only A is used, where as if A
does not exist, such is replaced with (0,0). Those skilled in the
art will recognize that spatial prediction may be done at a
subblock level as well.
[0070] In general spatial prediction can be seen as a linear or
nonlinear function of all available motion information calculated
within a picture or a group of macroblocks/blocks within the same
picture.
[0071] There are various methods available that may be arranged to
predict the reference picture for Direct Mode. For example, one
method may be to select a minimum reference picture among the
predictions. In another method, a median reference picture may be
selected. In certain methods, a selection may be made between a
minimum reference picture and median reference picture, e.g., if
the minimum is zero. In still other implementations, a higher
priority could also be given to either vertical or horizontal
predictors (A and B) due to their possibly stronger correlation
with E.
[0072] If one of the predictions does not exist (e.g., all
surrounding macroblocks are predicted with the same direction FW or
BW only or are intra), then the existing one is only used (single
direction prediction) or such could be predicted from the one
available. For example if forward prediction is available then: 2
MV bw = ( TR B - TR D ) MV fw TR B
[0073] Temporal prediction is used for Macroblocks if the
subsequent P reference is non intra as in existing codecs.
Attention is now drawn to FIG. 8, in which MV.sub.FW and MV.sub.BW
are derived from spatial prediction (Median MV of surrounding
Macroblocks). If either one is not available (i.e., no predictors)
then one-direction is used. If a subsequent P reference is intra,
then spatial prediction can be used instead as described above.
Assuming that no restrictions exist, if one of the predictions is
not available then Direct Mode becomes a single direction
prediction mode.
[0074] This could considerably benefit video coding when the scene
changes, for example, as illustrated in FIG. 9, and/or even when
fading exists within a video sequence. As illustrated in FIG. 9,
spatial prediction may be used to solve the problem of a scene
change.
[0075] If temporal distance information is not available within a
codec, temporal prediction will not be as efficient in the direct
mode for blocks when the collocated P reference block has a
non-zero reference picture. In such a case, spatial prediction may
also be used as above. As an alternative, one may estimate scaling
parameters if one of the surrounding macroblocks also uses the same
reference picture as the collocated P reference block. Furthermore,
special handling may be provided for the case of zero motion (or
close to zero motion) with a non-zero reference. Here, regardless
of temporal distance forward and backward motion vectors could
always be taken as zero. The best solution, however, may be to
always examine the reference picture information of surrounding
macroblocks and based thereon decide on how the direct mode should
be handled in such a case.
[0076] More particularly, for example, given a non-zero reference,
the following sub cases may be considered:
[0077] Case A: Temporal prediction is used if the motion vectors of
the collocated P block are zero.
[0078] Case B: If all surrounding macroblocks use different
reference pictures than the collocated P reference, then spatial
prediction appears to be a better choice and temporal prediction is
not used.
[0079] Case C: If motion flow inside the B picture appears to be
quite different than the one in the P reference picture, then
spatial prediction is used instead.
[0080] Case D: Spatial or temporal prediction of Direct Mode
macroblocks could be signaled inside the image header. A
pre-analysis of the image could be performed to decide which should
be used.
[0081] Case E: Correction of the temporally predicted parameters
based on spatial information (or vice versa). Thus, for example, if
both appear to have the same or approximately the same phase
information then the spatial information could be a very good
candidate for the direct mode prediction. A correction could also
be done on the phase, thus correcting the sub pixel accuracy of the
prediction.
[0082] FIG. 10 illustrates a joint spatio-temporal prediction for
Direct Mode in B picture coding. Here, in this example, Direct Mode
can be a 1- to 4-direction mode depending on information available.
Instead of using Bi-directional prediction for Direct Mode
macroblocks, a multi-hypothesis extension of such mode can be done
and multiple predictions used instead.
[0083] Combined with the discussion above, Direct Mode macroblocks
can be predicted using from one up to four possible motion vectors
depending on the information available. Such can be decided, for
example, based on the mode of the collocated P reference image
macroblock and on the surrounding macroblocks in the current B
picture. In such a case, if the spatial prediction is too different
than the temporal one, one of them could be selected as the only
prediction in favor of the other. Since spatial prediction as
described previously, might favor a different reference picture
than the temporal one, the same macroblock might be predicted from
more than 2 reference pictures.
[0084] The JVT standard does not restrict the first future
reference to be a P picture. Hence, in such a standard, a picture
can be a B as illustrated in FIG. 12, or even a Multi-Hypothesis
(MH) picture. This implies that more motion vectors are assigned
per macroblock. This means that one may also use this property to
increase the efficiency of the Direct Mode by more effectively
exploiting the additional motion information.
[0085] In FIG. 12, the first subsequent reference picture is a B
picture (pictures B.sub.8 and B.sub.9). This enables one to use
more candidates for Direct Mode prediction especially if
bidirectional prediction is used within the B picture.
[0086] In particular one may perform the following:
[0087] a.) If the collocated reference block in the first future
reference is using bidirectional prediction, the corresponding
motion vectors (forward or backward) are used for calculating the
motion vectors of the current block. Since the backward motion
vector of the reference corresponds to a future reference picture,
special care should be taken in the estimate of the current motion
parameters. Attention is drawn, for example to FIG. 12 in which the
first subsequent reference picture is a B picture (pictures B.sub.8
and B.sub.9). This enables one to use more candidates for Direct
Mode prediction especially if bidirectional prediction is used
within the B picture. Thus, as illustrated, the backward motion
vector of B.sub.8 {right arrow over (MV)}.sub.B8bw can be
calculated as 2.times.{right arrow over (MV)}.sub.B7bw due to the
temporal distance between B.sub.8, B.sub.7 and P.sub.6 Similarly
for B.sub.9 the backward motion vector can be taken as {right arrow
over (MV)}.sub.B7bw, if though these refer to the B.sub.7. One may
also restrict these to refer to the first subsequent P picture, in
which case these motion vectors can be scaled accordingly. A
similar conclusion can be deduced about the forward motion vectors.
Multiple picture reference or intra macroblocks can be handled
similar to the previous discussion.
[0088] b.) If bidirectional prediction for the collocated block is
used, then, in this example, one may estimate four possible
predictions for one macroblock for the direct mode case by
projecting and inverting the backward and forward motion vectors of
the reference.
[0089] c.) Selective projection and inversion may be used depending
on temporal distance. According to this solution, one selects the
motion vectors from the reference picture which are more reliable
for the prediction. For example, considering the illustration in
FIG. 12, one will note that B.sub.8 is much closer to P.sub.2 than
P.sub.6. This implies that the backward motion vector of B.sub.7
may not be a very reliable prediction. In this case, direct mode
motion vectors can therefore be calculated only from the forward
prediction of B.sub.7. For B.sub.9, however, both motion vectors
seem to be adequate enough for the prediction and therefore may be
used. Such decisions/information may also be decided/supported
within the header of the image. Other conditions and rules may also
be implemented. For example, additional spatial confidence of a
prediction and/or a motion vector phase may be considered. Note, in
particular, that if the forward and backward motion vectors have no
relationship, then the backward motion vector might be too
unreliable to use.
[0090] Single Picture Reference for B Pictures:
[0091] A special case exists with the usage of only one picture
reference for B pictures (although, typically a forward and a
backward reference are necessary) regardless of how many reference
pictures are used in P pictures. From observations of encoding
sequences in the current JVT codec, for example, it was noted that,
if one compares the single-picture reference versus the
multi-picture reference case using B pictures, even though encoding
performance of P pictures for the multi-picture case is almost
always superior to that of the single-picture, the some is not
always true for B pictures.
[0092] One reason for this observation is the overhead of the
reference picture used for each macroblock. Considering that B
pictures rely more on motion information than P pictures, the
reference picture information overhead reduces the number of bits
that are transmitted for the residue information at a given
bitrate, which thereby reduces efficiency. A rather easy and
efficient solution could be the selection of only one picture
reference for either backward or forward motion compensation, thus
not needing to transmit any reference picture information.
[0093] This is considered with reference to FIGS. 13 and 14. As
illustrated in FIG. 13, B pictures can be restricted in using only
one future and past reference pictures. Thus, for direct mode
motion vector calculation, projection of the motion vectors is
necessary. A projection of the collocated MVs to the current
reference for temporal direct prediction is illustrated in FIG. 14
(note that it is possible that TD.sub.D,0>TD.sub.D,1). Thus, in
this example, Direct Mode motion parameters are calculated by
projecting motion vectors that refer to other reference pictures to
the two reference pictures, or by using spatial prediction as in
FIG. 13. Note that such options not only allow for possible reduced
encoding complexity of B pictures, but also tend to reduce memory
requirements since fewer B pictures (e.g., maximum two) are needed
to be stored if B pictures are allowed to reference B pictures. In
certain cases a reference picture of the first future reference
picture may no longer be available in the reference buffer. This
could immediately generate a problem for the estimate of Direct
Mode macroblocks and special handling of such cases is required.
Obviously there is no such problem if a single picture reference is
used. However, if multiple picture references are desired, then
possible solutions include projecting the motion vector(s) to
either the first forward reference picture, and/or to the reference
picture that was closest to the non available picture. Either
solution could be viable, whereas again spatial prediction could be
an alternative solution.
[0094] Refinements of the motion vector prediction for single- and
multi-picture reference motion compensation
[0095] Motion vector prediction for multi-picture reference motion
compensation can significantly affect the performance of both B and
P picture coding. Existing standards, such as, for example, JVT, do
not always consider the reference pictures of the macroblocks used
in the prediction. The only consideration such standards do make is
when only one of the prediction macroblocks uses the same
reference. In such a case, only that predictor is used for the
motion prediction. There is no consideration of the reference
picture if only one or all predictors are using a different
reference.
[0096] In such a case, for example, and in accordance with certain
further aspects of the present invention, one can scale the
predictors according to their temporal distance versus the current
reference. Attention is drawn to FIG. 11, which illustrates Motion
Vector prediction of a current block (C) considering the reference
picture information of predictor macroblocks (Pr) and performance
of proper adjustments (e.g., scaling of the predictors).
[0097] If predictors A, B, and C use reference pictures with
temporal distance TR.sub.A, TR.sub.B, and TR.sub.C respectively,
and the current reference picture has a temporal distance equal to
TR, then the median predictor is calculated as follows: 3 MV pred =
TR .times. Median ( MV A TR A , MV B TR B , MV C TR C )
[0098] If integer computation is to be used, it may be easier to
place the multiplication inside the median, thus increasing
accuracy. The division could also be replaced with shifting, but
that reduces the performance, whereas it might be necessary to
handle signed shifting as well (-1>>N=-1). It is thus very
important in such cases to have the temporal distance information
available for performing the appropriate scaling. Such could also
be available within the header, if not predictable otherwise.
[0099] Motion Vector prediction as discussed previously is
basically median biased, meaning that the median value among a set
of predictors is selected for the prediction. If one only uses one
type of macroblock (e.g., 16.times.16) with one Motion Vector (MV),
then these predictors can be defined, for example, as illustrated
in FIG. 15. Here, MV predictors are shown for one MV. In FIG. 15a,
the MB is not in the first row or the last column. In FIG. 15b, the
MB is in the last column. In FIG. 15c, the MB is in the first
row.
[0100] The JVT standard improves on this further by also
considering the case that only one of the three predictors exists
(i.e. Macroblocks are intra or are using a different reference
picture in the case of multi-picture prediction). In such a case,
only the existing or same reference predictor is used for the
prediction and all others are not examined.
[0101] Intra coding does not always imply that a new object has
appeared or that scene changes. It might instead, for example, be
the case that motion estimation and compensation is inadequate to
represent the current object (e.g., search range, motion estimation
algorithm used, quantization of residue, etc) and that better
results could be achieved through Intra Coding instead. The
available motion predictors could still be adequate enough to
provide a good motion vector predictor solution.
[0102] What is intriguing is the consideration of subblocks within
a Macroblock, with each one being assigned different motion
information. MPEG-4 and H.263 standards, for example, can have up
to four such subblocks (e.g., with size 8.times.8), where as the
JVT standard allows up to sixteen subblocks while also being able
to handle variable block sizes (e.g., 4.times.4, 4.times.8,
8.times.4,8.times.8, 8.times.16, 16.times.8, and 16.times.16). In
addition JVT also allows for 8.times.8 Intra subblocks, thus
complicating things even further.
[0103] Considering the common cases of JVT and MPEG-4/H.263
(8.times.8 and 16.times.16), the predictor set for a 16.times.16
macroblock is illustrated in FIGS. 16a-c having a similar
arrangement to FIGS. 15a-c, respectively. Here, Motion Vector
predictors are shown for one MV with 8.times.8 partitions. Even
though the described predictors could give reasonable results in
some cases, it appears that they may not adequately cover all
possible predictions.
[0104] Attention is drawn next to FIGS. 17a-c, which are also in a
similar arrangement to FIGS. 15a-c, respectively. Here, in FIGS.
17a-c there are two additional predictors that could also be
considered in the prediction phase (C.sub.1 and A.sub.2). If
4.times.4 blocks are also considered, this increases the possible
predictors by four.
[0105] Instead of employing a median of the three predictors A, B,
and C (or Al, B, and C.sub.2) one may now have some additional, and
apparently more reliable, options. Thus, for example, one can
observe that predictors A.sub.1, and C.sub.2 are essentially too
close with one another and it may be the case that they may not be
too representative in the prediction phase. Instead, selecting
predictors A.sub.1, C.sub.1, and B seems to be a more reliable
solution due to their separation. An alternative could also be the
selection of A.sub.2 instead of A.sub.1 but that may again be too
close to predictor B. Simulations suggest that the first case is
usually a better choice. For the last column A.sub.2 could be used
instead of A.sub.l. For the first row either one of A.sub.1 and
A.sub.2 or even their average value could be used. Gain up to 1%
was noted within JVT with this implementation.
[0106] The previous case adds some tests for the last column. By
examining FIG. 17b, for example, it is obvious that such tends to
provide the best partitioning available. Thus, an optional solution
could be the selection of A.sub.2, C.sub.1, and B (from the
upper-left position). This may not always be recommended however,
since such an implementation may adversely affect the performance
of right predictors.
[0107] An alternative solution would be the usage of averages of
predictors within a Macroblock. The median may then be performed as
follows:
{right arrow over (MV)}.sub.pred=Median(Ave({right arrow over
(MV)}.sub.C.sub..sub.1,{right arrow over (MV)}.sub.C.sub..sub.2),
Ave({right arrow over (MV)}.sub.A.sub..sub.1,{right arrow over
(MV)}.sub.A.sub..sub.2),{right arrow over (MV)}.sub.B).
[0108] For median row/column calculation, the median can be
calculated as:
{right arrow over (MV)}.sub.pred=(Median(Median({right arrow over
(MV)}.sub.C.sub..sub.1,{right arrow over
(MV)}.sub.C.sub..sub.2,{right arrow over (MV)}.sub.D), . . .
Median({right arrow over (MV)}.sub.D,{right arrow over
(MV)}.sub.A.sub..sub.1,{right arrow over
(MV)}.sub.C.sub..sub.2),Median({right arrow over (MV)}.sub.B,{right
arrow over (MV)}.sub.A.sub..sub.1,{right arrow over
(MV)}.sub.A.sub..sub.2))
[0109] Another possible solution is a Median5 solution. This is
probably the most complicated solution due to computation
(quick-sort or bubble-sort could for example be used), but could
potentially yield the best results. If 4.times.4 blocks are
considered, for example, then Median9 could also be used:
{right arrow over (MV)}.sub.pred=Median({right arrow over
(MV)}.sub.C.sub..sub.1, {right arrow over
(MV)}.sub.C.sub..sub.2,{right arrow over (MV)}.sub.D,{right arrow
over (MV)}.sub.B,{right arrow over (MV)}.sub.A.sub..sub.1,{right
arrow over (MV)}.sub.A.sub..sub.2)
[0110] Considering that JVT allows the existence of Intra subblocks
within an Inter Macroblock (e.g., tree macroblock structure), such
could also be taken in consideration within the Motion Prediction.
If a subblock (e.g., from Macroblocks above or left only) to be
used for the MV prediction is Intra, then the adjacent subblock may
be used instead. Thus, if Al is intra but A.sub.2 is not, then Al
can be replaced by A.sub.2 in the prediction. A further possibility
is to replace one missing Intra Macroblock with the MV predictor
from the upper-left position. In FIG. 17a, for example, if C.sub.1
is missing then D may be used instead.
[0111] In the above sections, several improvements on B picture
Direct Mode and on Motion Vector Prediction were presented. It was
illustrated that spatial prediction can also be used for Direct
Mode macroblocks; where as Motion Vector prediction should consider
temporal distance and subblock information for more accurate
prediction. Such considerations should significantly improve the
performance of any applicable video coding system.
[0112] Conclusion
[0113] Although the description above uses language that is
specific to structural features and/or methodological acts, it is
to be understood that the invention defined in the appended claims
is not limited to the specific features or acts described. Rather,
the specific features and acts are disclosed as exemplary forms of
implementing the invention.
* * * * *