U.S. patent application number 12/107794 was filed with the patent office on 2008-12-04 for throughput performance when applying deblocking filters on reconstructed image frames.
This patent application is currently assigned to TEXAS INSTRUMENTS INCORPORATED. Invention is credited to Anurag Mithalal Jain, Sunand Mittal, Vipulkumar Parasottambhai Paladiya.
Application Number | 20080298472 12/107794 |
Document ID | / |
Family ID | 40088154 |
Filed Date | 2008-12-04 |
United States Patent
Application |
20080298472 |
Kind Code |
A1 |
Jain; Anurag Mithalal ; et
al. |
December 4, 2008 |
Throughput Performance When Applying Deblocking Filters On
Reconstructed Image Frames
Abstract
Improving throughput performance when applying deblocking
filters on reconstructed image frames. In one embodiment, an image
frame received in the form of a set of values in encoded format is
decoded to form a second set of values representing a
reconstruction of the image frame in a decoded format. The specific
one of the pairs of edges (formed by sub-blocks in the image frame)
to which a deblocking filter is to be applied is then determined by
evaluating any pre-conditions that need to be satisfied according
to a standard. The deblocking filter is then applied to the
determined specific ones of the pairs of edges, with the
application being performed after determining.
Inventors: |
Jain; Anurag Mithalal;
(Bangalore, IN) ; Paladiya; Vipulkumar
Parasottambhai; (Bangalore, IN) ; Mittal; Sunand;
(Ghaziabad, IN) |
Correspondence
Address: |
TEXAS INSTRUMENTS INCORPORATED
P O BOX 655474, M/S 3999
DALLAS
TX
75265
US
|
Assignee: |
TEXAS INSTRUMENTS
INCORPORATED
Dallas
TX
|
Family ID: |
40088154 |
Appl. No.: |
12/107794 |
Filed: |
April 23, 2008 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60941881 |
Jun 4, 2007 |
|
|
|
Current U.S.
Class: |
375/240.29 ;
375/E7.189 |
Current CPC
Class: |
H04N 19/86 20141101;
H04N 19/436 20141101; H04N 19/61 20141101; H04N 19/117 20141101;
H04N 19/14 20141101; H04N 19/176 20141101 |
Class at
Publication: |
375/240.29 ;
375/E07.189 |
International
Class: |
H04N 7/26 20060101
H04N007/26 |
Claims
1. A machine readable medium carrying one or more sequences of
instructions for causing a system to process image frames in
encoded format, wherein execution of said one or more sequences of
instructions by one or more processors contained in said system
causes said system to perform the actions of: receiving a first
plurality of values representing an image frame in encoded format,
said image frame containing a plurality of macro-blocks, each of
said macro-blocks in turn containing a plurality of sub-blocks, a
plurality of horizontal edges and a plurality of vertical edges
being formed by said plurality of sub-blocks, said plurality of
horizontal edges and said plurality of vertical edges including
pairs of edges of same orientation; decoding said first plurality
of values to form a second plurality of values representing a
reconstruction of said image frame in a decoded format; determining
the specific ones of said pair of edges to which a deblocking
filter is to be applied by evaluating a set of pre-conditions that
need to be satisfied according to a standard; and applying said
deblocking filter to the determined specific ones of said pair of
edges, wherein said applying is performed after said
determining.
2. The machine readable medium of claim 1, wherein each of said
pair of edges are adjacent to each other.
3. The machine readable medium of claim 2, wherein said determining
is performed for all edges in one orientation before performing
said applying.
4. The machine readable medium of claim 1, further comprising one
or more instructions for: forming a bit field containing a set of
bits, with each bit indicating whether said deblocking filter is to
be applied to a corresponding edge.
5. The machine readable medium of claim 4, further comprising one
or more instructions for: loading said bit field into a register;
and identifying a next bit starting from a first bit in said
register, wherein said next bit indicates a next edge to which said
deblocking filter is to be applied, wherein said identifying also
identifies a following bit starting from said next bit, wherein
said following bit indicates a following edge after said next edge
to which said deblocking filter is to be applied.
6. The machine readable medium of claim 5, wherein said identifying
comprises: using an instruction which receives an offset as an
input and indicates in said register a next bit position starting
from said offset at which the corresponding bit equals a desired
binary value, wherein said identifying identifies said next bit by
invoking said instruction with said offset equal to the bit
position of said first bit and then identifies said following bit
by invoking said instruction with said offset equaling the bit
position of said next bit in said bit field loaded into said
register.
7. The machine readable medium of claim 5, wherein said identifying
comprises shifting said bit field in said register by a number of
positions determined by the bit position at which said next bit is
present in said bit field when loaded into said register.
8. The machine readable medium of claim 5, further comprising
determining a number of bits in said bit field indicating that
deblocking filter is to be applied to corresponding edges, wherein
said identifying identifies each present edge to which deblocking
filter is to be applied in a corresponding loop, wherein said loop
is executed said number of times.
9. The machine readable medium of claim 8, further comprising one
or more instructions for: maintaining an edge counter which
indicates the number of bit positions from said first bit to a bit
representing said present edge; and determining a first set of
addresses of memory locations storing the specific ones of said
second plurality of values which are required to apply said
deblocking filter to said present edge based on said edge
counter.
10. The machine readable medium of claim 9, further comprising one
or more instructions for: maintaining a lookup table indicating the
addresses of memory locations storing said second plurality of
values which are required to apply said deblocking filter
corresponding to each of said plurality of horizontal edges and
each of said plurality of vertical edges, wherein said lookup table
is indexed based on said edge counter, wherein said determining
determines said first set of addresses corresponding to said
present edge based on said edge counter and said lookup table.
11. The machine readable medium of claim 4, wherein said second
plurality of values are stored in a plurality of memory locations
of a memory, wherein said bit field indicates that deblocking
filter is to be applied to a present edge, wherein said present
edge requires values at a set of memory locations contained in said
plurality of locations as inputs to said deblocking filter, further
comprising one or more instructions for: loading the values from
said set of memory locations into a set of registers,; checking
whether said bit field indicates that said deblocking filter is to
be applied to a base edge corresponding to said present edge,
wherein application of said deblocking filter to said base edge
causes at least some of the values in said set of memory locations
to be modified to corresponding new values; applying said
deblocking filter to said present edge using said values in said
set of registers if said bit field indicates that said deblocking
filter is not to be applied to said base edge; and waiting for
availability of said new values before applying said deblocking
filter to said present edge if said bit field indicates that said
deblocking filter is to be applied to said base edge.
12. The machine readable medium of claim 11, further comprising one
or more instructions for: storing said new values in a buffer,
which provides faster access than said memory; replacing the values
in said set of registers using said new values in said buffer after
said waiting; and applying said deblocking filter to said present
edge using the replaced values in said set of registers.
13. A method of processing image frames in encoded format, said
method comprising: receiving a first plurality of values
representing an image frame in encoded format, said image frame
containing a plurality of macro-blocks, each of said macro-blocks
in turn containing a plurality of sub-blocks, a plurality of
horizontal edges and a plurality of vertical edges being formed by
said plurality of sub-blocks, said plurality of horizontal edges
and said plurality of vertical edges including a pair of adjacent
edges of same orientation; decoding said first plurality of values
to form a second plurality of values representing said image frame
in a decoded format; determining the specific ones of said pair of
adjacent edges to which a deblocking filter is to be applied by
evaluating any pre-conditions that need to be satisfied according
to a standard; and applying said deblocking filter to the
determined specific ones of said pair of adjacent edges, wherein
said determining is performed for all edges in one orientation
before performing said applying.
14. The method of claim 13, further comprising forming a bit field
containing a set of bits, with each bit indicating whether said
deblocking filter is to be applied to a corresponding edge.
15. The method of claim 14, further comprising: loading said bit
field into a register; and identifying a next bit starting from a
first bit in said register, wherein said next bit indicates a next
edge to which said deblocking filter is to be applied, wherein said
identifying also identifies a following bit starting from said next
bit, wherein said following bit indicates a following edge after
said next edge to which said deblocking filter is to be
applied.
16. The method of claim 14, wherein said identifying comprises:
using an instruction which receives an offset as an input and
indicates in said register a next bit position starting from said
offset at which the corresponding bit equals a desired binary
value, wherein said identifying identifies said next bit by
invoking said instruction with said offset equal to the bit
position of said first bit and then identifies said following bit
by invoking said instruction with said offset equaling the bit
position of said next bit in said bit field loaded into said
register.
17. The method of claim 15, further comprising determining a number
of bits in said bit field indicating that deblocking filter is to
be applied to corresponding edges, wherein said identifying
identifies each present edge to which deblocking filter is to be
applied in a corresponding loop, wherein said loop is executed said
number of times.
18. The method of claim 17, further comprising: maintaining an edge
counter which indicates the number of bit positions from said first
bit to a bit representing said present edge; and computing
addresses of memory locations storing the specific ones of said
second plurality of values which are required to apply said
deblocking filter to said present edge.
19. The method of claim 14, wherein said second plurality of values
are stored in a plurality of memory locations of a memory, wherein
said bit field indicates that deblocking filter is to be applied to
a present edge, wherein said present edge requires values at a set
of memory locations contained in said plurality of locations as
inputs to said deblocking filter, said method further comprising:
loading the values from said set of memory locations into a set of
registers,; checking whether said bit field indicates that said
deblocking filter is to be applied to a base edge corresponding to
said present edge, wherein application of said deblocking filter to
said base edge causes at least some of the values in said set of
memory locations to be modified to corresponding new values;
applying said deblocking filter to said present edge using said
values in said set of registers if said bit field indicates that
said deblocking filter is not to be applied to said base edge; and
waiting for availability of said new values before applying said
deblocking filter to said present edge if said bit field indicates
that said deblocking filter is to be applied to said base edge.
20. The method of claim 19, further comprising: storing said new
values in a buffer, which provides faster access than said memory;
replacing the values in said set of registers using said new values
in said buffer after said waiting; and applying said deblocking
filter to said present edge using the replaced values in said set
of registers.
Description
RELATED APPLICATION(S)
[0001] The present application claims priority from co-pending U.S.
provisional application Ser. No. 60/941,881, entitled "Deblocking
Filter Implementation on VLIW Architectures for H.264 Video" filed
on 30 Apr. 2007, naming the same applicant Texas Instruments Inc
(the intended assignee) and the same inventors Anurag Mithalal
Jain, Vipulkumar Parasottambhai Paladiya, and Sunand Mittal as in
the subject application, attorney docket number: TI-60039PS, and is
incorporated in its entirety herewith.
BACKGROUND
[0002] 1. Field of Disclosure
[0003] The present disclosure relates generally to data
compression/decompression technologies, and more specifically to
improving throughput performance when applying deblocking filters
on reconstructed image frames.
[0004] 2. Related Art
[0005] Image frames are often required to be reconstructed from
corresponding compressed/encoded data. Reconstruction refers to
forming the uncompressed data, which is as close as possible to the
original data from which the compressed/encoded data is formed.
[0006] For example, data representing a sequence of image frames
generated from a video signal capturing a scene of interest is
often provided in a compressed/encoded form, typically for reducing
storage space or for reducing transmission bandwidth requirements.
Such a technique may necessitate the reconstruction of the scene of
interest (the sequence of image frames) by uncompressing/decoding
the provided data.
[0007] H.264 is an example standard using which image frames is
represented in a compressed form (thereby necessitating
reconstruction). H.264 is described in further detail in
"Information technology--Coding of audio-visual objects--Part 10:
Advanced Video Coding", available from ISO/IEC (International
Standards Organization/International Electrotechnical
Commission).
[0008] Deblocking filters are often applied on reconstructed image
frames. As is well known, compression/decompression techniques are
often "lossy" which could lead to undesirable visual
characteristics in the display of reconstructed image frames, and
applying the deblocking filters makes the display of reconstructed
image frames less objectionable to the human eye.
[0009] For example, the image frames reconstructed from data
compressed/encoded at low bit rates using the H.264 standard noted
above blockiness (block edges) and/or color transitions due to the
underlying compression/decompression techniques. By applying
deblocking filters, at least in the case of H.264 standard, the
eventual display of images can be made less objectionable to the
human eye.
[0010] Application of deblocking filter generally requires
substantial computational time/resources. As such, it may be
desirable that throughput performance be improved when applying
deblocking filters to reconstructed image frames.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] Example embodiments will be described with reference to the
following accompanying drawings, which are described briefly
below.
[0012] FIG. 1 is a block diagram illustrating an example
environment in which several features of the present invention may
be implemented.
[0013] FIG. 2A is block diagram of the internal details of an H.264
encoder illustrating an example embodiment in which several
features of the present invention are implemented in one
embodiment.
[0014] FIG. 2B is block diagram of the internal details of an H.264
decoder illustrating an example embodiment in which several
features of the present invention are implemented in one
embodiment.
[0015] FIG. 2C depicts the manner in which image frames are
compressed/encoded using a block-based compression/encoding
technique in one embodiment.
[0016] FIGS. 3A, 3B, and 3C together illustrate the manner in which
deblocking filters are applied to a reconstructed macro-block
(corresponding to block 290) in one embodiment in the context of
H.264.
[0017] FIG. 4A is a block diagram illustrating the details of
processing unit 150A in an embodiment.
[0018] FIG. 4B is a block diagram of processing environment
containing multiple execution units, each potentially implementing
a pipelined architecture in one embodiment.
[0019] FIG. 4C depicts the manner in which machine instructions
(executable code) may be generated in one embodiment.
[0020] FIG. 5 is a flowchart illustrating the manner in which a
deblocking filter is applied with enhanced parallelism according to
an aspect of the present invention.
[0021] FIG. 6 is a flowchart illustrating the manner in which the
enhanced parallelism is obtained in application of deblocking
filters in one embodiment of the present invention.
[0022] FIG. 7 depicts a bit field indicating both the vertical and
horizontal edges of a macro block to which a deblocking filter is
to be applied in one embodiment.
[0023] FIGS. 8A and 8B together illustrate the dependencies in the
application of deblocking filter to the edges of a macro-block in
one embodiment.
[0024] FIG. 9 is a flowchart illustrating the manner in which
memory dependencies in processing the edges (in one orientation) of
a reconstructed block are reduced according to an aspect of the
present invention.
[0025] In the drawings, like reference numbers generally indicate
identical, functionally similar, and/or structurally similar
elements. The drawing in which an element first appears is
indicated by the leftmost digit(s) in the corresponding reference
number.
DETAILED DESCRIPTION
1. Overview
[0026] Several features of the present invention can be used to
improve the throughput performance of applying deblocking filters
on reconstructed image frames. In one embodiment, a set of values
representing an image frame in encoded format is received, with the
image frame containing multiple macro-blocks. Each macro-block in
turn containing multiple sub-blocks forming horizontal and vertical
edges, with the edges including pairs of adjacent edges in the same
orientation (horizontal or vertical)
[0027] The received set of values is first decoded (and/or
decompressed) to form a second set of values representing a
reconstruction of the image frame in a decoded format. As noted in
the background section, deblocking filter may need to be applied on
the reconstructed image frames to make the display of the images
less objectionable to the human eye.
[0028] According to an aspect of the present invention, the
specific ones of the pairs of edges to which a deblocking filter is
to be applied is determined by evaluating a set of pre-conditions
that need to be satisfied according to a standard. The deblocking
filter is then applied to the determined specific ones of the pairs
of edges, with the application of deblocking filter being performed
after determining.
[0029] Thus, by determining the specific edges to which the
deblocking filter is to be applied, the application of the
deblocking filter to the edges can be performed with enhanced
parallelism, thereby improving throughput performance.
[0030] In one embodiment, the pairs of edges are adjacent edges.
Further, the determination of the specific ones of the pairs of
edges is performed for all the edges in one orientation (horizontal
or vertical) before application of the deblocking filter to the
determined pairs is performed. Such determination can further
enhance the parallelism in the manner of applying the deblocking
filter.
[0031] According to another aspect of the present invention, a bit
field containing a set of bits, with each bit indicating whether
the deblocking filter is to be applied to a corresponding edge is
formed.
[0032] According to yet another aspect of the present invention,
the formed bit field is loaded into a register and then used to
identify a next bit (starting from a first bit) indicating a next
edge to which the deblocking filter is to be applied. The bit field
in the register is then used to identify a following bit (starting
from the next bit) indicating a following edge after the next edge
to which said deblocking filter is to be applied.
[0033] In one embodiment, the identification is performed using an
instruction which receives an offset as an input and indicates in
the register a next bit position starting from the offset at which
the corresponding bit equals a desired binary value (for example
"1"). Accordingly, the next bit is identified by invoking the
instruction with the offset equal to the bit position of the first
bit and the following bit is identified by invoking the instruction
with the offset equaling the bit position of the next bit in the
bit field loaded into the register.
[0034] In an alternative embodiment, identifying the next and
following bits is performed by shifting the bit field in the
register by a number of positions determined by the bit position at
which the next bit is present in the bit field when loaded in the
register.
[0035] According to one more aspect of the present invention, a
number of bits in the bit field (formed according to an aspect
described above) indicating that a deblocking filter is to be
applied to corresponding edges is determined. Each present edge to
which the deblocking filter is to be applied is then identified in
a corresponding loop, with the loop being executed the determined
number of times/bits.
[0036] According to an aspect of the present invention, an edge
counter which indicates the number of bit positions from a first
bit to a bit representing a present edge (to which a deblocking
filter is to be applied) is maintained. The addresses of memory
locations storing the specific ones of the second set of values
(forming the reconstruction of the image frame) which is required
to apply the deblocking filter to the present edge are then
computed based on the edge counter.
[0037] According to another aspect of the present invention, a
present edge (to which a deblocking filter is to be applied as
indicated by the bit field) is processed by first loading into a
set of registers, the values required as inputs to the deblocking
filter from specific memory locations in a memory. The bit field is
then checked to determine whether the deblocking filter is to be
applied to a base edge corresponding to the present edge (also
referred to as a dependent edge), where application of the
deblocking filter to the base edge causes at least some of the
values in the specific memory locations to be modified to
corresponding new values.
[0038] The deblocking filter is then applied to the
present/dependent edge using the loaded values in the set of
registers if the bit field indicates that the deblocking filter is
not to be applied to the base edge. Alternatively, the system waits
for availability of the new values (caused by applying the
deblocking filter to the base edge) before applying the deblocking
filter to the present edge if the bit field indicates that the
deblocking filter is to be applied to the base edge.
[0039] Further, the new values are stored in a buffer, which
provides faster access than the memory, with the loaded values in
the set of registers being replaced with the new values in the
buffer after waiting for the new values to be available. The
deblocking filter is then applied to the present edge using the
replaced values in the set of registers.
[0040] Several aspects of the invention are described below with
reference to examples for illustration. It should be understood
that numerous specific details, relationships, and methods are set
forth to provide a full understanding of the invention. For
example, many of the functions units described in this
specification have been labeled as modules/blocks in order to more
particularly emphasize their implementation independence.
[0041] A module/block may be implemented as a hardware circuit
containing custom very large scale integration circuits or gate
arrays, off-the-shelf semiconductors such as logic chips,
transistors or other discrete components. A module/block may also
be implemented in programmable hardware devices such as field
programmable gate arrays, programmable array logic, programmable
logic devices, or the like.
[0042] Modules/blocks may also be implemented in software for
execution by various types of processors. An identified module of
executable code may, for instance, contain one or more physical or
logical blocks of computer instructions which may, for instance, be
organized as an object, procedure, or function. Nevertheless, the
executables of an identified module need not be physically located
together, but may contain disparate instructions stored in
different locations which when joined logically together constitute
the module/block and achieve the stated purpose for the
module/block.
[0043] It may be appreciated that a module/block of executable code
could be a single instruction, or many instructions and may even be
distributed over several code segments, among different programs,
and across several memory devices. Further, the functionality
described with reference to a single module/block can be split
across multiple module/blocks or alternatively the functionality
described with respect to multiple module/blocks can be combined
into a single (or other combination of blocks) as will be apparent
to a skilled practitioner based on the disclosure provided
herein.
[0044] Similarly, operational data may be identified and
illustrated herein within modules and may be embodied in any
suitable form and organized within any suitable type of data
structure. The operational data may be collected as a single data
set, or may be distributed over different locations including over
different member disks, and may exist, at least partially, merely
as electronic signals on a system or network.
[0045] Reference throughout this specification to "one embodiment",
"an embodiment", or similar language means that a particular
feature, structure, or characteristic described in connection with
the embodiment is included in at least one embodiment of the
present invention. Thus, appearances of the phrases "in one
embodiment", "in an embodiment" and similar language throughout
this specification may, but do not necessarily, all refer to the
same embodiment.
[0046] Furthermore, the described features, structures, or
characteristics of the invention may be combined in any suitable
manner in one or more embodiments. In the following description,
numerous specific details are provided such as examples of
programming, software modules, user selections, network
transactions, database queries, database structures, hardware
modules, hardware circuits, hardware chips, etc., to provide a
thorough understanding of embodiments of the invention.
[0047] However one skilled in the relevant art will recognize that
the invention can be practiced without one or more of the specific
details or with other methods, components, materials and so forth.
In other instances, well-known structures, materials, or operations
are not shown in detail to avoid obscuring the features of the
invention. Further more the features/aspects described can be
practiced in various combinations, though only some of the
combinations are described herein for conciseness.
2. Example Environment
[0048] FIG. 1 is a block diagram illustrating an example
environment in which several features of the present invention may
be implemented. The example environment is shown containing only
representative systems for illustration. However, real-world
environments may contain many more systems/components as will be
apparent to one skilled in the relevant arts by reading the
disclosure provided herein. Implementations in such environments
are also contemplated to be within the scope and spirit of various
aspects of the present invention.
[0049] The diagram is shown containing end systems 110A and 110B
designed/configured to communicate with each other in a video
conferencing application. End system 110A is shown containing
processing unit 150A, video camera 130A, and display unit 170A,
while end system 110B is shown containing processing unit 150B,
video camera 130B, and display unit 170B. Each component is
described in detail below.
[0050] Video camera 130A captures images of a scene (a general area
sought to be captured), and forwards the captured image to
processing unit 150A via path 135. The captured image is forwarded
in the form of corresponding image frames, with each image frame
containing a set of pixel values representing the captured image
when viewed as a two-dimensional area. The image frames (generally
in an uncompressed format) may be forwarded from video camera 130A
in any of formats such as RGB, YUV, etc.
[0051] Processing unit 150A may compress/encode each image frame
received from video camera 130A, and forward the compressed/encoded
image frames via path 155 to end system 110B. Path 155 may contain
various transmission paths (including networks, point-to-point
lines, etc.) providing a bandwidth for transmission of the
image/video data.
[0052] Alternatively, processing unit 150A may store the
compressed/encoded image frames in a memory (not shown). Processing
unit 150A may also receive compressed/encoded image data from end
system 110B, and forward the uncompressed/decoded image data
(representing the reconstructed scene) to display unit 170A via
path 157 for display.
[0053] Processing unit 150B, video camera 130B and display unit
170B respectively operate similar to the corresponding components
of end system 110A, and the description is not repeated for
conciseness. In particular, end system 110B may reconstruct the
scene by decompressing/decoding the image frames received from end
system 110A and then may display the reconstructed scene on display
unit 170B. Such reconstruction may be performed in both processing
unit 150A and 150B, according to several aspects of the present
invention, as described below with examples.
[0054] Several features of the present invention are described
below in a specific context of H.264 standard. However, it should
be appreciated that the features can be implemented with respect to
other encoding/decoding of sequence of image frames in other
contexts and/or other standards as well, as will be apparent to one
skilled in the relevant arts by reading the disclosure provided
herein.
3. H.264 Standard
[0055] FIG. 2A is block diagram of the internal details of an H.264
encoder illustrating an example embodiment in which several
features of the present invention are implemented in one
embodiment. The encoder may be implemented within processing unit
150A or externally (e.g., using custom ASICs).
[0056] Only some of the details as pertinent to the features
described below are shown for conciseness. For further details of
the H.264 standard, the reader is referred to the document noted in
the background section. Further, though shown as separate blocks
with distinct functionalities merely for illustration, the various
blocks of FIGS. 2A and 2B may be implemented as more/fewer blocks,
possibly with some of the functionalities merged/split into
one/multiple blocks (particularly when implemented as software
modules).
[0057] The block diagram is shown containing source image frame
210, reference image frame 215, encoding block 220, compression
block 230, compressed/encoded bit stream 235, decoding block 240,
reconstructed image frame 245, and deblocking filter 250. Each
block is described in detail below.
[0058] Source image frame 210 represents one of the image frames
received from video camera 130A desired to be compressed/encoded
according to the H.264 standard. In one embodiment, each source
image frame is encoded using a block-based compression encoding
technique as described below.
[0059] FIG. 2C depicts the manner in which image frames are
compressed/encoded using a block-based compression/encoding
technique in one embodiment. In a block-based technique, an image
frame is viewed as containing multiple blocks, with each block
representing a group of adjacent pixels with a desired dimension
and shape. The encoding and decoding of the image frame may then be
performed based on the blocks in the image frame.
[0060] In the following description, a compressed block refers to a
block after compression/encoding, while a reconstructed block
refers to the (uncompressed) block generated by
uncompressing/decoding a compressed block.
[0061] In H.264 standard, each block may be chosen to be a square
block of 16.times.16 pixels size as shown for block 290. However,
an image frame can be divided into square blocks of other sizes,
such 4.times.4 and 8.times.8 pixels. Further, the blocks can be of
other shapes (e.g., rectangle or non-uniform shape) and/or sizes in
alternative standards. Each of these blocks is hereafter referred
to as a macro-block to differentiate from the sub-blocks described
in section below.
[0062] Accordingly, source image frame 210 (only a portion is shown
there for conciseness) is shown as being divided into a number of
macro-blocks (shown numbered sequentially from m1 to m99 for
reference). Each macro-block represents a group of pixels which are
processed together while compressing/encoding source image frame
210.
[0063] Encoding block 220 encodes the received source image frame
210 using reference image frame 215 according to H.264 standard.
The encoding of the source image frame 210 may be performed with
respect to reference image frame 215.
[0064] Reference image frame 215 generally represents a
reconstructed image frame corresponding to a previous image frame
received from video camera 130A prior to (the present) source image
frame 210 being compressed. Reference image frame 215 may be
received from deblocking filter 250 or alternatively, in the
absence of a deblocking filter, correspond to reconstructed image
frame 245 generated by decoding block 240. It should be noted that
reference image frame 215 is not similar to the previous image
frame due to lossy video compression schemes.
[0065] Each macro-block (such as block 290) is encoded by first
finding the difference between the values of the (16.times.16)
pixels in the macro-block and the values of the corresponding
pixels in a reference macro-block (contained in source image frame
210 or reference image frame 215). The difference between the
macro-blocks is often expressed in terms of luma (representing the
brightness information) and chroma (representing the color
information) corresponding to the pixels.
[0066] Encoding block 220 then encodes the differences to generate
a corresponding encoded macro-block data. For example, the
difference may be transformed using a block transform and then
quantized to generate a corresponding set of quantized transform
coefficients (thereby compressing the data) representing the
macro-block being encoded. Such quantization may result in "lossy"
compression of the image frame, whereby some of the visual
information contained in source image frame 210 is not
compressed/encoded and therefore cannot be reconstructed when
decoding the corresponding compressed data.
[0067] In one embodiment, each macro-block in the image frame is
encoded to generate a corresponding 16.times.16 luma block and two
8.times.8 chroma blocks. The color information (chroma) is
generally less compared to the brightness information (luma) since
human perception is perceived to be less sensitive to color changes
in comparison to brightness changes.
[0068] Encoding block 220 then assembles the encoded macro-block
data corresponding to the macro-blocks forming source image frame
210 to form the encoded image data and forwards (makes available)
the encoded image data to compression block 230.
[0069] Compression block 230 further compresses encoded image data
using entropy-encoding techniques, well known in the relevant arts.
Entropy encoding may involve using fewer number of bits to encode
more frequently occurring data in the encode image data and using
more bits to encode less frequently occurring data.
[0070] The compressed/encoded image data is then generated in the
form of compressed/encoded data stream 235 (containing a set of
values in encoded format), which may then either be stored or
transmitted to a recipient system such as end system 110B.
Compressed/encoded data stream 235 may represent the entire image
frame or portions of it in a compressed/encoded form, and may
include information (such as size/dimension/shape of each of the
corresponding macro-blocks) to enable a device (such as processing
unit 150B of FIG. 1) to decompress/decode the image frame
accurately.
[0071] Decoding block 240 receives the output of encoding block 220
and decodes the encoded image data. Such decoding may be necessary
to generate reference image frame 215 to be used in encoding the
next image frame received from video camera 130A.
[0072] Decoding block 240 reconstructs the macro-block (in
reconstructed image frame 245) from the corresponding macro-block
data, as well as previously decoded macro-blocks which may be
retrieved from a storage unit (not shown). Decoding block 240 may
substantially perform the reverse of the corresponding operations
used to compress and encode a macro-block, such as an inverse
quantization and inverse transform, performed by encoding block
220.
[0073] Decoding block 240 then assembles the reconstructed
macro-blocks to generate reconstructed image frame 245, which is
then forwarded to deblocking filter 250. Deblocking filter 250,
provided according to several aspects of the present invention,
removes the visual defects in reconstructed image frame 245 to
generate reference image frame 215 as described in below
sections.
[0074] It may be appreciated that a similar approach may be used in
decompressing/decoding the compressed/encoded data stream 235 as
described in detail below.
[0075] FIG. 2B is block diagram of the internal details of an H.264
decoder illustrating an example embodiment in which several
features of the present invention are implemented in one
embodiment. The decoder may be implemented within processing unit
150B or externally (e.g., using custom ASICs). Only some of the
details as pertinent to the features described below are shown for
conciseness.
[0076] Decompression block 260 receives the compressed/encode image
frame in the form of compressed/encoded data stream 235 and may
substantially perform the reverse of the operations performed by
compression block 230 to generate the encoded image data.
Decompression block 260 may then forward the encoded image data to
decoding block 240.
[0077] Decoding block 240 reconstructs the image frame from the
encode image data in the form of reconstructed image frame 245
(containing a set of values in a decoded format), which is then
processed by deblocking filter 250 to generate displayed image
frame 265. Displayed image frame 265 may be displayed on display
unit 170B.
[0078] It may be appreciated that displayed image frame 265
corresponds (at least substantially) to source image frame 210
after being compressed and decompressed according to H.264
standard. As described above, it may be necessary to apply
deblocking filter to reconstructed image frame 245. The general
concepts underlying such application of deblocking filter according
to the H.264 standard are described below with examples.
4. Applying Deblocking Filters
[0079] FIGS. 3A, 3B, and 3C together illustrate the manner in which
deblocking filters are applied to a reconstructed macro-block
(corresponding to block 290) in one embodiment in the context of
H.264. Each of the Figures is described in detail below.
[0080] According to H.264 standard, the deblocking filter is to be
applied to each square block of 4.times.4 pixels (hereafter
referred to as a sub-block) in the reconstructed image frame. As
such, each reconstructed macro-block (16.times.16 pixels) may be
viewed as containing 16 sub-blocks of 4.times.4 pixels. The
application of the deblocking filter may then be performed in the
context of the sub-blocks.
[0081] FIG. 3A illustrates the order in which each of the sets of
horizontal and vertical edges (formed between sub-blocks) are to be
processed for deblocking. H.264 requires that the vertical edges be
processed before horizontal edges and accordingly the example
embodiments below are described based on that constraint. However,
it should be appreciated that alternative embodiments/standards can
be implemented with different order of processing of edges, as
desired in specific environments, without departing from the scope
and spirit of several aspects of the present invention.
[0082] FIG. 3A depicts 16 vertical edges (shown as v0-v15 in 310
i.e., edges of same vertical orientation), and 16 horizontal edges
(shown as h0-h15 in 320, i.e., edges of same horizontal
orientation) that are processed for the luma information
corresponding to a macro block (m41 or block 290).
[0083] The vertical edges v0-v15 is processed first according to
the sequence numbers associated with each edge followed by the
horizontal edges h0-h15 (also according to the sequence numbers).
It should be appreciated that each edge may be viewed as covering
the area between the displays caused by the adjacent pixels as also
shown in FIG. 3B.
[0084] Thus, vertical edge v0 between sub-blocks 312 and 315 (a
sub-block in the previous macro-block m40) is the area between
pixel pairs of {p0, q0}, with p0 representing the boundary pixel
for sub-block 315 and q0 representing the boundary pixel for
sub-block 312, as depicted in FIG. 3B. The horizontal edge h4 is
the area in the boundary of sub-blocks 322 and 325 between pixel
pairs of {m0, n0} as also illustrated in FIG. 3B. The remaining
edges for luma information are defined similarly.
[0085] It may be further appreciated that v1, v2, v3 are
respectively adjacent (of same orientation) edges to v0, v1, and
v2. The remaining vertical adjacent edges are similarly described
with respect to 310. Similarly, h1, h2, h3 are also respective
adjacent (in the horizontal orientation) edges to h0, h1 and
h2.
[0086] It may be observed in FIG. 3B that the pairs of pixels on
either side of edge v0 are labeled similarly as {p0, q0}. However,
the pixel pairs of {p0, q0} represent (four) different pairs of
pixels and may have different values based on the encoded data. The
labeling of the different pixel pairs using the same labels is
merely for convenience in describing the embodiments of the present
invention.
[0087] The order of processing the edges c0-c7 for the chroma
information corresponding to the macro-block (m41 or block 290) may
be similarly understood based on the depiction at 330 in FIG. 3A.
Various features of the invention hereafter are substantially
described with respect to processing of luma information for
conciseness. However, the processing may be applicable to chroma
information as well, as will be apparent to one skilled in the
relevant arts by reading the disclosure provided herein.
[0088] FIG. 3C indicates the rules set by H.264 standard with
respect to the number of adjacent pixels to be used ("boundary
strength") as inputs to the deblocking filter while processing each
edge. Column 360 specifies the conditions of the rule and column
365 specifies the corresponding boundary strength. As is well
known, the boundary strength is calculated as per the process
specified in the H.264 standard and is dependent of multiple data
fields (such as motion vector, quantization parameter, macro-block
type, etc.,) decoded from compressed/encoded data stream 235.
[0089] Thus, row 371 indicates that when either of the two
sub-blocks (conveniently named p and q and which may correspond to
sub-blocks 322 and 325) is intra coded and the edge is a
macro-block edge (e.g., edges v0-v3 and h0-h3 in FIG. 3A), the
boundary strength is 4, indicating that four pixels on both sides
of the edge (e.g., p0-p3 and q0-q3 in FIG. 3B) are to be used as
inputs to the deblocking filter. The remaining rows 372-375 are
similarly explained, with row 375 indicating the condition under
which deblocking filter need not be applied (boundary
strength=0).
[0090] During operation, along with the compressed values,
additional information may be received indicating the manner of
encoding of each macro-block, which facilitates a determination of
whether a macro-block is a candidate for application of deblocking
filter (i.e., having a boundary strength greater than 0).
[0091] Even if the boundary strength is greater than 0, a decision
on whether to apply the deblocking filter or not, may be based on
the below equation (referred to as threshold requirements):
|p0-q0|, |p1-p0|, and |q1-q0| are each less than a threshold t1 or
t2 Equation (1)
[0092] wherein, | | represents the absolute value operator, the
pixels {p1, p0, q0, q1} have been determined to be used as inputs
to the deblocking filter and t1 and t2 are thresholds specified by
the H.264 standard (and are commonly referred to as alpha and beta
thresholds). It should be appreciated that the values p0, q0, etc.
of above equation need to be used after any modification (or
computation of new values) by application of deblocking filters to
corresponding base edges.
[0093] Once a determination is made to apply the deblocking filter,
a specific formula based on the boundary strength is applied. The
formulas used for applying deblocking are not described as being
not relevant to an understanding of the described embodiment.
However, it is sufficient to understand that as each
(vertical/horizontal) edge is filtered, the pixels used for inputs
are recomputed and the recomputed/output values may replace the
input values. Once replaced, the new values may be used for
filtering the later edges according to the sequence described above
with respect to FIG. 3A.
[0094] It may be appreciated that the boundary strength and the
threshold requirements represent pre-conditions that need to be
satisfied for the application of the deblocking filter in the H.264
standard. However other standards may specify different/other
additional pre-conditions that need to be satisfied in determining
the application of the deblocking filter as will be apparent to one
skilled in the relevant arts.
[0095] It should be appreciated that several features of the
invention described below can be implemented in various embodiments
as a desired combination of one or more of hardware, software, and
firmware. The description is continued with respect to an
embodiment in which various features are operative when software
instructions are executed.
5. Software Implementation
[0096] FIG. 4A is a block diagram illustrating the details of
processing unit 150A in an embodiment. The description below also
applies to processing unit 150B.
[0097] Processing unit 150A may contain one or more processors such
as central processing unit (CPU) 410, random access memory (RAM)
420, secondary storage unit 450, display controller 460, network
interface 470, and input interface 480. All the components may
communicate with each other over communication path 440, which may
contain several buses as is well known in the relevant arts. The
components of FIG. 4 are described below in further detail.
[0098] CPU 410 may execute instructions stored in RAM 420 to
provide several features of the present invention. CPU 410 may
contain multiple execution units as described below with respect to
FIG. 4B, with each execution unit potentially being designed for a
specific task. Alternatively, CPU 410 may contain only a single
general-purpose processing unit.
[0099] RAM 420 may receive instructions from secondary storage unit
450 using communication path 440. In addition, RAM 420 may store
video frames received from a video camera during the encoding
operations noted above. Display controller 460 generates display
signals (e.g., in RGB format) to display unit 170B (FIG. 1) based
on data/instructions received from CPU 410.
[0100] Network interface 470 provides connectivity to a network
(e.g., using Internet Protocol), and may be used to
receive/transmit compressed/encoded video/image frames on path 155
of FIG. 1. Input interface 480 may include interfaces such as
keyboard/mouse, and interface for receiving video frames from video
camera 130A.
[0101] Secondary storage unit 450 may contain hard drive 456, flash
memory 457, and removable storage drive 458. Some or all of the
data and instructions may be provided on removable storage unit
459, and the data and instructions may be read and provided by
removable storage drive 458 to CPU 410. Floppy drive, magnetic tape
drive, CD-ROM drive, DVD Drive, Flash memory, removable memory chip
(PCMCIA Card, EPROM) are examples of such removable storage drive
458.
[0102] Alternatively, data and instructions may be copied to RAM
420 from which CPU 410 may read and execute the instructions using
the data. Removable storage unit 459 may be implemented using
medium and storage format compatible with removable storage drive
458 such that removable storage drive 458 can read the data and
instructions. Thus, removable storage unit 459 includes a computer
readable (storage) medium having stored therein computer software
and/or data.
[0103] In general, the computer (or generally, machine) readable
medium refers to any medium from which processors can read and
execute instructions. The medium can be randomly accessed (such as
RAM 420 or flash memory 457), volatile, non-volatile, removable or
non-removable, etc. While the computer readable medium is shown
being provided from within processing unit 150A for illustration,
it should be appreciated that the computer readable medium can be
provided external to processing unit 150A as well.
[0104] In this document, the term "computer program product" is
used to generally refer to removable storage unit 459 or hard disk
installed in hard drive 456. These computer program products are
means for providing software to CPU 410. CPU 410 may retrieve the
software instructions, and execute the instructions to provide
various features of the present invention described below. Groups
of software instructions in any form (for example, in
source/compiled/object form or post linking in a form suitable for
execution by CPU 410) are termed as code.
[0105] It may be appreciated that though the H.264 standard
requires the edges to be processed in a particular sequence, it may
be desirable that as many computations as possible be performed.
Accordingly, the edges may be processed in parallel subject to the
dependency requirements caused, for example, by the boundary
strengths and the need to use the recomputed values to filter the
later edges.
[0106] In one embodiment described below, multiple execution units
are employed to potentially process multiple edges in parallel. The
manner in which the throughput performance of applying a deblocking
filter may be enhanced in such an environment is described below
with examples (even though various features of the present
invention can be implemented in other types of environments,
potentially without multiple execution units, as will be apparent
to one skilled in the relevant arts by reading the disclosure
provided herein).
6. Processing Environment
[0107] FIG. 4B is a block diagram of processing environment
containing multiple execution units, each potentially implementing
a pipelined architecture in one embodiment. A pipelined
architecture refers to an implementation technique where
instructions are executed in a sequence of stages thereby
facilitating different multiple instructions (or part thereof) to
be executed in parallel.
[0108] CPU 410 represents such a processing environment
implementing a pipelined architecture and is shown containing
instruction cache 411, instruction register 412, data registers
413, execution units 415A-415D, and data cache 417. Merely for
illustration, only representative number/type of components is
shown in the Figure. Many processing environments often contain
many more components, both in number and type, depending on the
purpose for which the processing environment is designed.
[0109] It should be appreciated that the pipelining technique
and/or the multiple-execution-units are pertinent to only some of
the features of the invention, as will be clear from the
corresponding context. Further, the execution units can be present
as different CPUs as well. Each component of FIG. 4B is described
below in further detail.
[0110] Instruction cache 411 maintains machine instructions to be
executed. The instructions may be loaded from a memory (such as RAM
420 via path 440) prior to commencement of execution. The
instructions together represent in machine executable form, a
software module designed to apply the deblocking filter to the
different edges in a reconstructed macro-block.
[0111] Instruction register 412 stores the machine instruction
currently being executed. During execution, each machine
instruction in instruction cache 411 is loaded into instruction
register 412 which then holds the software instruction while being
decoded and executed by the different execution units
415A-415D.
[0112] Data registers 413 contain various registers, with each
register having capabilities such as holding the input values to an
instruction, storing the execution results, providing access
to/from data between execution units, etc. In general, the
registers provide a small amount of storage (in comparison to data
cache 417 and RAM 420) while typically providing fast access to the
data.
[0113] Data cache 417 represents a temporary storage storing
frequently accessed data (but less than the frequency of access of
the data stored in data registers 413 in one embodiment). Once a
data is stored in data cache 417, future use can be made by
accessing the cached copy rather than fetching (from memory such as
RAM 420) or recomputing the original data. The data stored in data
cache 417 may be periodically written to the memory.
[0114] Memory (such as RAM 420) may be viewed as containing
multiple memory locations, with each location storing a
corresponding data. As such, accessing specific data values may
require CPU 410 to specify the corresponding memory locations. In
general, accessing the data in memory locations is slower than
accessing data in data cache 417 which is slower than accessing
data in data registers 413.
[0115] Each of execution units 415A-415D may be designed to
independently execute a corresponding given set of machine
instructions together designed to perform a logical task (e.g.,
processing of a single edge, upon appropriate design of software
module 490 and/or compiler 494, described below).
[0116] Each of the execution units may further containing
functional units (representing a stage in the pipelined
architecture) capable of performing corresponding specific
operations, such as, loading/storing data values, branching based
on conditions, performing integer/floating point operations, etc.
Such functional units may fetch the machine instructions being
currently executed (or part thereof) from instruction register 420,
decode the machine instructions, and perform the corresponding
operations indicated by the machine instruction.
[0117] It may be appreciated that the throughput performance of
deblocking may be enhanced by utilizing the parallelism possible by
the presence of multiple execution units and pipelining
features.
[0118] Several aspects of the present invention enable the
parallelism to be exploited with respect to application of
deblocking filters. The manner in which the software code (and
consequently the resulting machine instruction) can be
specified/written to implement the deblocking filters is further
described with respect to an example environment supporting the
architecture of FIG. 4B described above.
7. Generating Machine Instructions for Parallelism
[0119] FIG. 4C depicts the manner in which machine instructions
(executable code) may be generated in one embodiment. Software
module 490 represents the software code containing user
instructions written by a developer. The software code may be
specified in any programming language, though higher-level
languages (e.g., C, C++, and Java) are generally preferred to
enhance the developers' productivity.
[0120] Compiler 494 processes the software code (in the specified
programming language) to generate executable code 498 containing
machine instructions suitable for execution in the processing
environment of FIG. 4B. Concepts such as object files, linking,
target machine specification, code generation, etc., are not
described in detail as not being pertinent to the concepts sought
to be illustrated. Compiler 494 may be designed to exploit the
parallelism possible in the processing environment (for example,
the environment described above), potentially by reordering the
logic without violating dependencies.
[0121] However, it is desirable that software module 490 by itself
contain processing logic which lends to further exploitation of the
parallelism in the processing environment. The manner, in which the
user instructions can be designed for enhanced parallelism in
several contexts, is described below. It should be however
appreciated that the features of the invention can be realized by
embedding the corresponding intelligence in compilers type systems
software as well.
[0122] According to aspect of the present invention, such enhanced
parallelism is obtained in the application of deblocking filter to
edges of a reconstructed macro-block. Such a feature will be
clearer in comparison to a prior approach which uses an alternative
technique, and accordingly the prior approach is described briefly
below.
8. Prior Approach
[0123] In one prior approach, each edge may be allocated to one of
the execution units, which then determines whether the edge meets
the requirements set forth with respect to FIG. 3C, and applies
deblocking filter to the edge if the requirements are met.
[0124] Such an approach causes irregular branching (since the "if"
condition checking whether the edge meets the pre-conditions, could
fail or succeed) during processing of each edge, thereby breaking
the pipelining process. The breaking of the pipeline reduces the
parallelism, as is well known in the relevant arts. The reduced
parallelism may in turn impede the parallelism possible across the
execution units (since the dependent edges need to wait for
completion of processing of the base edge).
[0125] A software module (or user instructions) designed according
to several aspects of the present invention improves the throughput
performance when applying deblocking filter to reconstructed image
frames, while overcoming some of the disadvantages of the above
prior approach.
[0126] As may be appreciated, the user instructions (or
corresponding machine instructions, also stored on a machine
readable medium) in turn causes CPU 410 (or the components therein)
to operate in the corresponding manner. Accordingly, the design of
the user instructions is described with reference to the effective
operation of CPU 410 in the description below.
9. Applying Deblocking Filter with Enhanced Parallelism
[0127] FIG. 5 is a flowchart illustrating the manner in which a
deblocking filter is applied with enhanced parallelism according to
an aspect of the present invention. The flowchart is described with
respect to FIGS. 1, 4A and 4B, merely for illustration. However,
various features can be implemented in other environments and other
components.
[0128] Furthermore, the steps are described in a specific sequence
merely for illustration. Alternative embodiments in other
environments, using other components and different sequence of
steps can also be implemented without departing from the scope and
spirit of several aspects of the present invention, as will be
apparent to one skilled in the relevant arts by reading the
disclosure provided herein. The flowchart starts in step 501, in
which control passes immediately to step 520.
[0129] In step 520, CPU 410 determines the specific ones of the
edges of the (reconstructed) macro-block to which to a deblocking
filter is to be applied based on boundary strength. In general, a
set of pre-conditions for each specific edge that are to be
satisfied prior to applying a deblocking filter, according to the
applicable standard (H.264 in the illustrative example), may be
evaluated for the determination.
[0130] In the case of H.264 standard, the determination may be
performed similar to the manner described above with respect to
FIG. 3C and therefore the description is not repeated in detail for
conciseness. In summary, deblocking filter is to be applied for an
edge if the corresponding boundary strength is greater than 0.
[0131] It should be noted that not all the pre-conditions (for
application of deblocking filter) need be checked, as suited in
specific environments. For example, in the example embodiment of
below, the threshold requirement is not checked for determining the
edges to which the deblocking filter is to be applied, since the
threshold requirement is based on the values of the pixels on
either side of an edge which may be modified by the application of
the deblocking filter to the base edge.
[0132] In step 560, CPU 410 applies the deblocking filter to each
of the determined specific edges of the macro-block in the order
specified in FIG. 3A. In particular, new values for a number of
adjacent pixels determined by the boundary strength of the edge
being filter may be computed, as a result. The new values, along
with any unchanged values of the various blocks, together represent
the reconstructed frame. The flowchart ends in step 599.
[0133] Thus, CPU 410 may first determine all the specific edges of
the macro-block that are to be filtered by evaluating any
applicable pre-conditions as a batch and then applies the
deblocking filter to only the determined edges (i.e., in case the
pre-conditions are satisfied) again as a batch. This means the
determination and applying deblocking filter steps are not
interspersed. In an embodiment, this manifests in software code
with the determination being outside of program structures such as
loops which apply the deblocking filters to each of the edges.
[0134] It may be appreciated that by determining the specific edges
to be filtered prior to application of the deblocking filter, the
irregular branching caused by prior approaches can also be avoided
within the individual execution units applying the deblocking
filter to the corresponding edge, thereby improving the performance
of the deblocking filter (by increasing the parallelism in CPU
410).
[0135] Furthermore, as the determination of step 520 for each edge
can be performed without dependency on other edge, it may be
possible to utilize as many execution units as are available for
the determination, thereby increasing the throughput performance.
Several features of the present invention provide for enhanced
parallelism even within execution of step 560, as described in
sections below.
[0136] It should be appreciated that the flowchart of FIG. 5 can be
implemented using various approaches, with corresponding
advantages. The description is continued with respect to an example
implementation of realizing the above noted features.
10. Example Implementation of Enhanced Parallelism
[0137] FIG. 6 is a flowchart illustrating the manner in which the
enhanced parallelism is obtained in application of deblocking
filters in one embodiment of the present invention. The flowchart
is described with respect to FIGS. 1, 4A and 4B, merely for
illustration. However, various features can be implemented in other
environments and other components.
[0138] Furthermore, the steps are described in a specific sequence
merely for illustration. Alternative embodiments in other
environments, using other components and different sequence of
steps can also be implemented without departing from the scope and
spirit of several aspects of the present invention, as will be
apparent to one skilled in the relevant arts by reading the
disclosure provided herein.
[0139] It may be appreciated that steps of FIG. 6 are first
performed for processing the vertical edges in a macro-block and
then may be performed again to process the horizontal edges in the
macro-block as specified by the H.264 standard. Accordingly, the
steps of the flow chart are described in relation to the processing
of the vertical edges, though the description is applicable to the
processing of the horizontal edges as well. The flowchart starts in
step 601, in which control passes immediately to step 610.
[0140] In step 610, CPU 410 generates a bit field representing the
(vertical/horizontal) edges in a macro-block to which a deblocking
filter is to be applied. A bit field contains a set of bits, with
each bit representing a corresponding Boolean flag (having the
values "true" or "false") indicating whether the deblocking filter
is to applied to the corresponding edge.
[0141] In general, the false value may be represented as a bit
value of "0" or "1" with the true value represented as a bit value
of the opposite parity ("1" in the case of "0" and vice versa). In
one embodiment, the true and false values are respectively
represented as bit values "1" (indicating that the filter is to be
applied) and "0" (indicating that the filter need not be
applied).
[0142] It may be appreciated that the bit field can be formed while
performing the determination of step 520 described above. The
generated bit field may be loaded into a register (in data
registers 413) during the processing of the edges. The generated
bit field may indicate only the vertical or horizontal edges in the
macro-block or a combination of both.
[0143] In one embodiment, shown in FIG. 7, the bit field indicates
both the vertical and horizontal edges of a macro block to which a
deblocking filter is to be applied. As described above, the value
of each bit in bit field 720 indicates whether the corresponding
vertical/horizontal edge is to be filtered (value 1) or not (value
0).
[0144] The bit positions of the bits in bit field 720 are indicated
in 710, while the edges corresponding to the bits are indicated in
730. Accordingly, the horizontal edges (h0-h15) are represented by
the left most 16 bits (in the bit positions 0-15) while the
vertical edges (v0-v15) are represented by the right most 16 bits
(in the bit positions 16-31).
[0145] It may be observed that the bit position corresponding to an
edge indicates the position of the edge in the sequence in which
all the edges in the macro-block are to be filtered (within the
group of edges in each orientation), as shown in FIG. 3A. For
example, bit position 5 corresponds to horizontal edge h5 which is
filtered fifth according to the horizontal sequence of FIG. 3A.
Similarly, bit position 23 (or 7 for the vertical orientation)
corresponds to the vertical edge v7 which is filtered seventh
according to the vertical sequence of FIG. 3A.
[0146] The description is continued assuming that the left most 16
bits in the bit field representing the horizontal edges is first
extracted to generate a horizontal edge bit field before performing
the below steps. The bits representing the vertical edges may be
extracted to form a vertical edge bit field when processing
vertical edges.
[0147] In step 620, CPU 410 sets a variable `loop count` equal to
the number of edges to which the deblocking filter is to be
applied. According to the convention of above, loop count would
equal the number of `1` in the vertical/horizontal edge bit field.
Thus, for bit field 720, loop count may be set equal to 8 when
processing vertical edges and to 7 when processing horizontal
edges.
[0148] In one embodiment, a machine instruction is provided to
count the number of `1` in a register, and accordingly the
vertical/horizontal edge bit field may be loaded into the register
and the corresponding machine instruction may be executed to
determine the value for loop count.
[0149] The steps of 640-690 operate to apply the deblocking filter
to each determined vertical/horizontal edge of the macro-block (to
which the deblocking filter is to be applied). The loop would be
executed only as many times as the number of vertical/horizontal
edges determined in step 520.
[0150] In step 640, CPU 410 checks whether the loop count is
greater than 0. Control passes to step 660 if loop count is greater
than 0 (indicating that there is at least one edge that is to be
filtered) and to step 699 otherwise. The flowchart ends in step 699
indicating that there are no more edges to be filtered.
[0151] In step 660, CPU 410 identifies an edge to be processed
using the vertical/horizontal edge bit field. The edge may be
identified using a single instruction or a set of instructions
based on the instruction set capable of being executed by CPU
410.
[0152] In one embodiment, the identification of the edge is
performed using an instruction which receives an offset as an input
and indicates a next bit position starting from the offset at which
the corresponding bit equals a desired binary value ("1" indicating
an edge to which the deblocking filter is to be applied).
[0153] Thus, the instruction is first invoked with an offset (by
convention chosen as -1) to identify the next bit position at which
the corresponding bit indicates an edge to which the deblocking
filter is to be applied. During the next execution of step 660, the
instruction is invoked with the offset equaling the next bit
position to identify a following bit (also having a value of "1")
indicating a following edge to which the deblocking filter is to
applied after applying to the next edge.
[0154] Referring to FIG. 7, the instruction is first invoked with
offset=-1 and returns the value of 0 indicating that the next edge
h0 to which the deblocking filter is to be applied. The instruction
is then invoked (during the next execution) with offset=0 (the next
bit position) to return the value of 3 indicating the following
edge h3 to which the deblocking filter is applied. Similarly, the
instruction is invoked in subsequent executions (with the offset
equaling the bit position of the edge determined in a previous
invocation) to identify the edges to which the deblocking filter is
to be applied.
[0155] In an alternative embodiment, the identification of the edge
to be processed is performed by shifting the bit field (loaded in
the register) by a number of positions determined by the bit
position at which the next bit is present in the bit field.
[0156] The number of positions to be shifted may be determined by a
specific instruction such as "LMBD" (left most bit detect) which
receives a bit field and a flag indicating the bit value to be
detected as inputs and provides the position (from the left most
bit) of the next bit in the bit field having the bit value
indicated by the flag.
[0157] Accordingly, when the LMBD instruction is invoked with bit
field 720 and a flag "1" (indicating that the deblocking filter is
to be applied) as inputs, the value of "0" is generated as the
output indicating that the deblocking filter is to be applied to
the edge h0. The output value is stored in a register (acting as an
accumulator) for future use.
[0158] Bit field is then shifted left by 1 number of positions
(determined as 1 more than the output value) resulting in the "1"
bit at bit position 0 being removed and each of the bits in the
other bit positions being moved to a bit position to their
respective left. Thus, the shifted bit field contains "0010" in the
bit positions 0-3.
[0159] During the next execution of step 660, the LMBD instruction
is again invoked with the shifted bit field and the flag as inputs
to generate the value of "2" as the output. The (output value +1)
is added to the value in the accumulator to derive the position of
the next edge to be filtered. In this case, the value "3` in the
accumulator indicates that the deblocking filter is to be applied
to the edge h3 (bit position 3 in bit field 720). The shifted bit
field is again shifted by 3 number of positions (1 more than the
output value) and the process is repeated for identifying the edges
to which the deblocking filter is to be applied.
[0160] In step 670, CPU 410 determines the memory locations at
which the values of the pixels required to filter the edge are
stored using the bit-position in the bit-field for the edge to be
filtered. The determination of the specific memory locations may be
performed in a known way.
[0161] In one embodiment, a variable named `edge counter`
indicating the specific edge being processed is maintained. The
edge counter indicates the number of bit positions from a first bit
to a bit representing the specific/present edge to which a
deblocking filter is to be applied.
[0162] Referring to FIG. 7, the edge counter is initially set to
the value 0 (since the bit position 0 corresponding to edge h0 has
a bit value of "1"). During the next execution of step 670, the
edge counter is set to the value 3, the bit position of the
following edge h3. The value of the edge counter may be returned by
an instruction or may be maintained as a sum/accumulator of the
values returned by the instruction (for example, LMBD described
above).
[0163] The value of the edge counter may then be used to determine
the memory locations. In one embodiment, a lookup table is
maintained indicating the memory locations at which the values of
the pixels for each of the horizontal/vertical edge are stored. The
lookup table is indexed based on the edge counter, thereby
facilitating the determination of the specific memory
locations.
[0164] Alternatively, the memory locations may be computed based on
specific edge being processed (as indicated by the value of the
edge counter) in combination with the offset locations at which the
macro-block is stored, the size of each memory location, the
boundary strength, etc.
[0165] In step 680, CPU 410 applies the deblocking filter to the
edge to cause the values in the memory locations to be modified to
corresponding new values. The application of the deblocking filter
may involve loading the pixel values from the computed memory
locations, the performance of the filter operations to generate new
values (not described for conciseness) and storing the new values
to the corresponding memory locations.
[0166] In step 690, CPU 410 decrements the loop count by 1
indicating that the edge has been processed. Control then passes to
step 640, where the loop count value is checked to determine
whether more edges are to be processed.
[0167] It may be appreciated that the above steps provide enhanced
parallelism in a scenario that the processing of the different
horizontal/vertical edges can be performed independently. It may be
desirable that the processing of the edges be performed with
maximum parallelism even in a scenario that dependencies exist
among the edges.
[0168] An aspect of the present invention enables dependent edges
to be processed while providing enhanced parallelism, thereby
improving the throughput performance when applying deblocking
filters on reconstructed image frames. It may be helpful to first
understand the manner in which dependencies exists among the edges
and accordingly the description is continued illustrating the
dependencies existing in the application of the deblocking filter
to a macro-block in one embodiment.
11. Dependencies in Applying Deblocking Filter
[0169] FIGS. 8A and 8B together illustrate the dependencies in the
application of deblocking filter to the edges of a macro-block in
one embodiment. Each of the Figures is described in detail
below.
[0170] FIG. 8A depicts the dependencies in processing vertical
edges in one embodiment. In particular the Figure depicts the
dependency between the vertical edges v0 (between sub-blocks 315
and 312) and v4 (between sub-blocks 312 and 318). In one scenario,
when the vertical edges v0 and v4 are determined to have respective
boundary strengths of 3 and 2, the pixels {p2, p1, p0, q0, q1, q2}
represent the inputs to the deblocking filter for edge v0, while
the pixels {r1, r0, s1, s0} represent the inputs for edge v4.
[0171] It may be observed that pixels r1 and q2 refer to the same
pixel (shown as the multiple value "q2/r1" in the corresponding
box) indicating that the new value of pixel q2 is to be used as the
value of r1 in processing the vertical edge e5. As such, it may be
necessary that the application of the deblocking filter to edge v4
be performed after the processing of edge v0 (at least until the
new values of the common pixels are generated).
[0172] Accordingly, the edge v4 (dependent edge) is said to have a
dependency on edge v0 (base edge). The dependency is applicable to
each of the horizontal/vertical edges in the reconstructed
macro-block. For macro-block edges such as v0 and h0, the
dependency may be with respect to an edge in another
macro-block.
[0173] Similarly, FIG. 8B depicts the dependencies in processing
horizontal edges of a macro-block in one embodiment. In particular,
the Figure indicates that the processing of edge h8 (between
sub-blocks 325 and 328) requires the new values of the rows of
pixels n1 and n2 (shown as "n1/j2 and "n2/j1"), and therefore it
may be necessary that the application of the deblocking filter to
edge h8 be performed after the processing of edge h4 (between
sub-blocks 322 and 325).
[0174] It may be appreciated that since each sub-block is of size
4.times.4 pixels, such dependencies often occur when a pair of
edges (in the same orientation) are determined to be filtered using
boundary strengths greater than 2. In H.264 standard, dependencies
are common while processing the luma information, since for chroma
information the maximum boundary strength used is 2 which causes no
overlapping pixels between pairs of edges.
[0175] Accordingly, the various aspects of the present invention
are described with respect to processing of luma information.
However, the features described below can be implemented for luma
and chroma information as well when encoding/decoding image frames
in other contexts and/or other standards as will be apparent to one
skilled in the relevant arts by reading the disclosure herein.
[0176] It may be noted that each edge (horizontal or vertical) may
have a dependency on only one other corresponding base edge at
least based on the sequence of processing the edges. In one
embodiment, when the edges are numbered in a sequence (as depicted
in FIG. 3A), the base edge is determined by subtracting 4 from the
sequence number of the dependent edge (with a value less than 0
indicating that the base edge occurs in another macro-block). For
example, for the horizontal edge h15, the base edge can be
calculated to being h11 (15-11).
[0177] It may be appreciated that the edge dependencies described
above may cause memory dependencies while processing the edges in a
reconstructed macro-block. In one embodiment, the luma information
corresponding to each sub-block is stored as 4 words in a memory,
with each word containing 4 bytes, each byte storing the luma
information corresponding to a pixel. Thus, in FIG. 3B, the 4 words
represent the corresponding 4 rows of pixels {p3, p2, p1, p0} or
the corresponding rows of pixels {m3}, {m2}, {m1} and {m0}.
[0178] In such a scenario, a dependency between horizontal edges
causes a memory dependency of at least one memory location (rows of
pixels {n2/j1} in the example above), while a dependency between
vertical edges causes a memory dependency for at least 4 memory
locations (4 rows of pixels {q0, q1, q2/r1}).
[0179] It may be desirable that such memory dependencies be reduced
(or removed) to enhance the parallelism in the processing of the
edges, thereby improving the performance of the deblocking
filter.
[0180] Various aspects of the present invention enable such memory
dependencies to be reduced when processing horizontal/vertical
edges of a reconstructed macro-block. The description is continued
illustrating the manner in which memory dependencies are reduced
when processing edges (in one orientation) in the reconstructed
macro-block.
12. Processing Edges in One Orientation
[0181] FIG. 9 is a flowchart illustrating the manner in which
memory dependencies in processing the edges (in one orientation) of
a reconstructed block are reduced according to an aspect of the
present invention.
[0182] The description is continued assuming that the horizontal
edges of the reconstructed macro-block are being processed and
accordingly the flowchart is described with respect to FIGS. 4A,
4B, and 8B, merely for illustration.
[0183] Further, it is assumed for convenience that the processing
of a horizontal edge is being performed by one of the execution
units (415A) contained in CPU 410, though the processing of
different edges may be performed by different execution units in
parallel, at least as permitted by several aspects of the present
invention described below. However, the various features can be
implemented in other environments and other components.
[0184] Furthermore, the steps are described in a specific sequence
merely for illustration. Alternative embodiments in other
environments, using other components and different sequence of
steps can also be implemented without departing from the scope and
spirit of several aspects of the present invention, as will be
apparent to one skilled in the relevant arts by reading the
disclosure provided herein. The flowchart starts in step 901, in
which control passes immediately to step 910.
[0185] In step 910, execution unit 415A receives the memory
locations storing the values of the pixels required to filter an
horizontal (or present) edge of a reconstructed macro block and
also a bit field indicating the edges in the macro block to which a
deblocking filter is to be applied.
[0186] The received memory locations may correspond to the memory
locations computed in step 670, while the received bit field may
correspond to the bit field generated in step 610 (an example of
which is shown in FIG. 7). As described above, the bit field
indicates the specific edges to which a deblocking filter is to be
applied (in one embodiment, by corresponding bit values of
"1").
[0187] The memory locations and the bit field may be received from
another execution unit (or a scheduling unit not shown) which
identifies the horizontal edge to be processed. The description is
continued assuming that the deblocking filter is to being applied
to horizontal edge h8.
[0188] In step 930, execution unit 415A reads/loads an input set of
values from the corresponding memory locations (received in step
910). The reading/loading and writing/storing of the values from/to
the memory locations may be performed in a known way. The input set
of values may be read into a set of registers provided in data
registers 413.
[0189] The number of values to be read/loaded may be determined
based on the boundary strength (which indicates the number of
pixels to be used as inputs to the deblocking filter). In one
embodiment, the input set of values corresponding to the two
sub-blocks (forming the edge) is retrieved in the form of 1-8
32-bit words (1-4 words per sub-block) from respective memory
locations in memory 480.
[0190] Thus, while processing edge h8 and assuming a boundary
strength of 3, the 3 words corresponding to the rows of pixels
{j2}, {j1}, and {j0} in sub-block 325 and the 3 words corresponding
to the rows of pixels {k2}, {k1}, and {k0} in sub-block 328 are
read into a set of registers provided in data registers 413. For
convenience, the registers are named j2_3210, j1_3210, j0_3210,
k0_3210, k.sub.--3210, and k2.sub.--3210 with the name indicating
the pixels stored in the corresponding register.
[0191] In step 940, execution unit 415A checks whether the bit
field (received in step 910) indicates that the deblocking filter
is to applied to the base edge (the edge on which the horizontal
edge h8 is dependent upon as described above with respect to FIG.
8B).
[0192] Execution unit 415A first determines the base edge
corresponding to the horizontal edge in a convenient/suitable
manner. In one embodiment described above, the base edge is
determined by subtracting 4 from the sequence number of the
horizontal edge. Thus, for the horizontal edge h8, the base edge is
calculated to be h4 (8-4).
[0193] Execution unit 415A then checks whether the bit field
indicates that the deblocking filter is to be applied to the base
edge by inspecting the value of the bit corresponding to the base
edge in the bit field. Control passes to step 950 if the bit has a
value of "1" (indicating that the deblocking filter is to be
applied to the base edge) and to step 960 otherwise.
[0194] In step 950, execution unit 415A replaces the dependent
values in the input set with corresponding values from a buffer.
The dependent values may be determined based on the pixels that are
common to both the horizontal edge and the dependent edge. The
values in the buffer represent the new values corresponding to the
common pixels.
[0195] In the above example, assuming that the base edge h4 is
filtered using a boundary strength of 3 (as depicted in FIG. 8B)
the dependent values may be determined to be the values
corresponding to the row of pixels {n1/j2} and {n2/j1}. Thus, the
new values of the rows of pixels {n1} and {n2} (generated by
applying the deblocking filter to the base edge h4) may be
retrieved from the buffer and used to respectively replace the
values in the registers j2_3210 and j1_3210 (respectively storing
the old values of the rows of pixels {j2} and {j1}).
[0196] It may be appreciated that the buffer may contain the new
values of only the common pixels (determined based on the
dependency among the edges). The buffer may be provided in data
cache 417 instead of memory (such as RAM 420), thereby increasing
the speed of access to the data. Control then passes to step
960.
[0197] In step 960, execution unit 415A performs filter operations
(as part of applying the deblocking filter) using the input set to
generate a corresponding output set of values, the output set
representing the set of pixels after filtering. The output set of
values may be generated and stored in another set of registers,
conveniently named, j2_3210', j1_3210', j0_3210', k0_3210',
k1_3210', and k2_3210' with the name indicating the pixels stored
in the corresponding register.
[0198] It may be observed that in a scenario that the base edge is
determined to be filtered, the input set used in performing the
filter operations contains the new values of the common/dependent
pixels (replaced in step 950) in conformance to the H.264 standard.
Alternatively, in a scenario that the base edge is determined to be
not filtered (bit in the mask=0), the loaded values are used as the
input set in performing the filter operations.
[0199] As described above, the specific set of filter operations
used to generate the output values is not described for
conciseness. Further, though the output values are assumed to be
generated in a different set of registers (having names similar to
the input set of registers), the techniques described herein can
also be applied when the output values are generated in data cache
and/or memory.
[0200] In step 970, execution unit 415A computes a set of
differences (according to equation 1 noted above) using the input
set of values. The computed differences are compared with the
respective threshold values described above with respect to FIG.
3C. As described above, each of the set of differences is computed
based on any new values of the common/dependent pixels according to
the H.264 standard.
[0201] The set of differences are computed for each of the set of
pixels forming the edge. Thus, for horizontal edge h8, the set of
differences is computed for each of the sets of pixels {j2, j1, j0,
k0, k1, k2} by substituting the values of j0, j1, k0, k1
respectively for p0, p1, q0 and q1 in Equation 1. The set of
differences are then used to determine the values to be
written/stored in the memory locations as described below.
[0202] Though the computation of the set of differences are shown
as being performed by execution unit 415A, it may be appreciated
that the computation may be performed by another execution unit
(such as 415B) in parallel with the performance of the filter
operations in step 960, thereby improving the throughput
performance. Such parallel performance of steps 960 and 970 is
facilitated by the replacement of the dependent values from a
buffer in step 950.
[0203] In step 980, execution unit 415A stores the output or input
set of values to the buffer based on comparison results of the set
of differences with respective threshold values according to
Equation 1. The output set of values are stored in the buffer if
the comparison satisfies the threshold requirements shown in
Equation 1 and the input set of values are stored in the buffer
otherwise.
[0204] It may be appreciated that the values (representing the new
values of the pixels) stored in the buffer may later be used in
step 950 when the deblocking filter is applied to the horizontal
edge h12. As described above, only the dependent/common values in
the output set may be stored in the buffer for convenience.
[0205] In step 990, execution unit 415A writes the output or input
set of values (from the corresponding set of registers) to the
corresponding memory locations based on the comparison results
noted above. The memory locations may be the same memory locations
from which the input set of values was read in step 920, in which
case only the output set of values need to be written and the input
set of values need not be written back. Alternatively, output or
input set of value may be written to a corresponding set of memory
locations where the post-filtered/displayed image frame is to be
generated.
[0206] In one embodiment, a set of mutually exclusive
conditional-store instructions are used to write the output/input
set of values to the memory locations. Each conditional-store
instruction receives as inputs a value to be written, a memory
location at which the value is to be written and a condition. On
execution, the value is written to the memory location only when
the condition is fulfilled.
[0207] An output/input value is then written to a memory location
by having two conditional-store instructions in tandem, whereby the
threshold is provided as the condition of the instruction for
writing the output value while the negation of the threshold is
provided as condition of the other instruction for writing the
input value. Accordingly, on execution of the tandem
conditional-store instructions, the output value is written to the
memory location when the threshold is fulfilled and the input value
is written otherwise.
[0208] A set of tandem conditional-store instructions may be used
to write the output or input set of values to the corresponding
memory locations. In the above example, the storage of the output
or input set of values is performed by four tandem
conditional-store instructions (one for each of the sets of pixels
{j2, j1, j0, k0, k1, k2}) with each of the tandem conditional-store
instructions contains an instruction for storing the value in a
corresponding output register (such as j2_3210') and another
instruction for storing the value in a corresponding input register
(such as j2_3210).
[0209] It may be appreciated that in a scenario that the output set
of values are to be written to the same memory locations (from
which the input set of values was read in step 920), the
store-switch instruction corresponding to the non-fulfillment of
the threshold (for writing the input value) need not be
executed.
[0210] The storage of the input set of values in the memory
locations/buffer (instead of the output set of values) indicates
that the deblocking filter has not been applied to the
corresponding set of pixels of the present edge being processed.
The flow chart ends in step 999.
[0211] It may be appreciated that the determination of whether a
base edge is being filtered using the bit field, enables the
application of deblocking filters to at least some of the edges in
the macro-block in parallel (without necessitating waiting for the
processing of the base edge to be completed).
[0212] Further, by storing the dependent values (determined based
on the dependency among the edges) in a buffer in data cache 417
(faster than memory), the memory dependencies among the horizontal
edges are reduced, further improving the throughput
performance.
[0213] It may be observed that the above steps for the application
of a deblocking filter are related to processing of edges in one
orientation (horizontal or vertical). However, for the other
orientation (vertical in case of horizontal and vice versa) the
above steps may be modified based on the information contained in
the patent document titled "Loop Deblock Filtering Of Block Coded
Video In a Very Long Instruction Word Processor" by Jagadeesh
Sankaran with publication number US 2005/0117653 available from US
patent office.
[0214] In particular, the reader is directed to FIGS. 15 and 16 in
the above noted patent document which illustrate a manner in which
edges in the other orientation can be processed. Accordingly, the
steps of FIG. 9 described above may be modified to exploit the
transpose feature noted in the patent document, as described below
with examples.
13. Processing Edges in the Other Orientation
[0215] Assuming that deblocking filter is to be applied to vertical
edge v4 (and referring to FIG. 8A), the input set of values is read
into a set of registers named s1_s0_r0_r1_0, s1_s0_r_r1_1,
s1_s0_r0_r1_2, and s1_s0_r0_r1_3 j1_3, the name indicating the
pixels stored in each register and the last number indicating the
corresponding row. It may be observed that the corresponding values
of the rows of pixel {r1} (having a dependency on the vertical edge
v0) are read into each of the 4 registers, indicating that there
are 4 memory dependencies.
[0216] The input values in the registers are then transposed and
stored in the same/different register conveniently named s1_3210,
s0_3210, r0_3210, and r1_3210. It may be observed that by
transposing the input set of values, the rows of pixel {r1} are
stored in a corresponding register r1_3210, indicating that the
number of memory dependencies has been reduced from 4 to 1.
[0217] The filter operations are then performed using the
transposed input set of values to generate a corresponding
transposed output set of values in the set of registers. The filter
operations may be performed on the transposed input set loaded from
memory or on the input set containing the new values replaced from
a buffer, if the base edge is determined to be filtered based on
the bit field.
[0218] The transposed output set of values (representing the new
values of the pixels) are then stored in the buffer and may later
be used when the deblocking filter is applied to the vertical edge
v8. As described above, only the dependent values in the transposed
output set may be stored in the buffer for convenience.
[0219] The transposed output set of values is then transposed to
generate an output set of values in the set of registers, the
output set representing the values of the pixels after filtering.
Thus, the transposed output set of values in the registers s1_3210,
s0_3210, r0_3210, r1_3210 are transposed to generate the output set
of values in the same/different registers conveniently named
s1_s0_r0_r1_0, s1_s0_r0_r1_1, s1_s0_r0_r1_2, and s1_s0_r0_r1_3
similar to the registers into which the input set of values were
read. The output set of values in the set of registers is then
written to the corresponding set of memory locations.
[0220] It may be observed that the transposed set of output values
(generated by application of the deblocking filter) are maintained
in the buffer instead of the output set (generated after
transposing), thereby improving the performance of the application
of the deblocking filter to the edges in the other orientation.
14. Conclusion
[0221] While various embodiments of the present invention have been
described above, it should be understood that they have been
presented by way of example only, and not limitation. Thus, the
breadth and scope of the present invention should not be limited by
any of the above-described exemplary embodiments, but should be
defined only in accordance with the following claims and their
equivalents.
[0222] It should be understood that the figures and/or screen shots
illustrated in the attachments highlighting the functionality and
advantages of the present invention are presented for example
purposes only. The present invention is sufficiently flexible and
configurable, such that it may be utilized in ways other than that
shown in the accompanying figures.
[0223] Further, the purpose of the following Abstract is to enable
the U.S. Patent and Trademark Office and the public generally, and
especially the scientists, engineers and practitioners in the art
who are not familiar with patent or legal terms or phraseology, to
determine quickly from a cursory inspection the nature and essence
of the technical disclosure of the application. The Abstract is not
intended to be limiting as to the scope of the present invention in
any way
* * * * *