U.S. patent application number 11/979092 was filed with the patent office on 2008-06-05 for architecture and method for parallel embedded block coding.
This patent application is currently assigned to National Taiwan University. Invention is credited to Yu-Wei Chang, Liang-Gee Chen, Hung-Chi Fang.
Application Number | 20080131012 11/979092 |
Document ID | / |
Family ID | 34677508 |
Filed Date | 2008-06-05 |
United States Patent
Application |
20080131012 |
Kind Code |
A1 |
Chen; Liang-Gee ; et
al. |
June 5, 2008 |
Architecture and method for parallel embedded block coding
Abstract
The present invention provides a high-speed, memory efficient
parallel coding technique for embedded block coding with optimized
truncation (EBCOT) used in still image compression. Attributing to
parallel processing method and structure, it processes a discrete
wavelet transform (DWT) coefficient at a clock cycle without any
state variable stored. Therefore, the need of state variable memory
can be avoid and the external memory bandwidth can be reduced. With
the same cost of chip-area and lower power consumption, the
processing rate of this invention is several times higher than
conventional schemes. Furthermore, the present invention processes
50 M coefficients per second at 100 MHz and can encode lossless
HDTV 720p resolution pictures at 30 fps in real time.
Inventors: |
Chen; Liang-Gee; (Taipei,
TW) ; Fang; Hung-Chi; (Taipei, TW) ; Chang;
Yu-Wei; (Taipei, TW) |
Correspondence
Address: |
TROXELL LAW OFFICE PLLC;SUITE 1404
5205 LEESBURG PIKE
FALLS CHURCH
VA
22041
US
|
Assignee: |
National Taiwan University
|
Family ID: |
34677508 |
Appl. No.: |
11/979092 |
Filed: |
October 31, 2007 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10739067 |
Dec 19, 2003 |
|
|
|
11979092 |
|
|
|
|
Current U.S.
Class: |
382/240 ;
375/E7.047; 375/E7.072 |
Current CPC
Class: |
H04N 19/13 20141101;
H04N 19/63 20141101; H04N 19/647 20141101 |
Class at
Publication: |
382/240 |
International
Class: |
G06K 9/36 20060101
G06K009/36 |
Claims
1-9. (canceled)
10. A coding apparatus processing a DWT coefficient having a
plurality of bit-planes in parallel at a time to provide coding
information for further coding process, said coding apparatus
comprising: a Gobang register bank (GRB) module, a compute most
significant bit pass (CMP) module, find contribution and coding
pass (FC) modules, context formation (CF) modules, a reconfigurable
first-in first-out register (RFIFO) module and arithmetic encoder
(AE) modules; wherein there are one said FC module and one said CF
module for each bit-plane and at one AE module for every two
bit-planes.
11. The coding apparatus claimed in claim 10, wherein said Gobang
register bank module is a 2-dimensional shift register bank whereby
input DWT coefficients recorded to meet JPEG 2000 scan order.
12. The coding apparatus claimed in claim 10, wherein said compute
most significant bit pass module determines a value P.sub.c.sup.mc
of target coefficient and computes variables .lamda..sub.s.sup.k,
.kappa..sub.s.sup.k of each coefficient.
13. The coding apparatus claimed in claim 10, wherein said find
contribution and coding pass (FC) module determines the coding pass
information of target coefficient and PHVD information for said
context formation module to compute context.
14. The coding apparatus claimed in claim 10, wherein said context
formation module is provided to treat special run-length codes.
15. The coding apparatus claimed in claim 10, wherein said
reconfigurable first-in first-out register (RFIFO) module is
provided to make the compression process more fluent.
16. The coding apparatus claimed in claim 10, wherein said
arithmetic encoder is provided to decrease the hardware requirement
and increase the system utilization.
17. (canceled)
Description
FIELD OF THE INVENTION
[0001] The present invention relates to a coding apparatus and
method for still image compression and particularly to the
architecture and method for parallel embedded block coding.
BACKGROUND OF THE INVENTION
[0002] JPEG 2000 is the latest standard for still image coding.
Instead of JPEG, with excellent coding performance and abundant
features, it may become the most popular still image coding
standard applying in digital cameras, digital video cameras and
other digital devices.
[0003] In JPEG 2000, however, embedded block coding is the most
complicated part as well as a hot topic investigated by
researchers. Conventional coding schemes process code-block
bit-plane by bit-plane in serial, and have many drawbacks listed as
follows:
[0004] 1. The processing rate is slow.
[0005] 2. The requirement of on-chip random access memory is
high.
[0006] 3. The efficiency and integration of the system is poor,
because it accesses off-chip memory by bit-plane as a unit.
[0007] The present invention has solved the above problems with
parallel coding technique. It speeds up the processing rate and
avoids the requirement of memory for state variables. Furthermore,
it greatly improves the access scheme of external memory and
facilitates integration of the coder by dealing with a discrete
wavelet transform (DWT) coefficient at word-level in each time.
SUMMARY OF THE INVENTION
[0008] Therefore, a primary object of the present invention is to
provide an image processing system for real-time image compression
products, such as digital cameras, digital video cameras, real-time
surveillance systems, or any lossless compression for medical or
military imagery.
[0009] Another object of the present invention is to provide a
coding method to treat all bit-planes in parallel at a time under
the latest standard of still image compression.
[0010] Still another object of the present invention is to provide
a high speed, efficient memory architecture for embedded block
coding. With the same cost of chip area, it increases the
processing rate and decreases the memory bandwidth by a factor of
six comparing to state-of-the-art technology.
[0011] In order to achieve the foregoing objects, the coding method
shows a DWT coefficient having a plurality of bit-planes in
parallel at a time to provide coding information for further coding
process, comprising the steps of:
(a) Acquire a target coefficient and eight neighboring coefficients
around said target coefficient. (b) Assign a first contribution to
each bit-plane of the neighboring coefficients in respect to the
location of their MSB and determining coding pass of each bit-plane
of said target coefficient by said contributions. (c) Calculate
first group of state variables of every bit-plane of said target
and neighboring coefficients in respect to the location of their
MSBs. (d) Assign a second contribution to each bit-plane of said
target and said neighboring coefficients according to said first
group of state variables and coding pass of said target and
neighboring coefficients and determining magnitude coding
information obtained through a first predefined table using said
second contributions as references. (e) Calculating second group of
state variables of said neighboring coefficients according to said
first group of state variables. (f) Assign a third contribution to
the neighboring coefficients according to said second group of
state variables and determining sign coding information obtained
through a second predefined table using said third contributions as
references.
[0012] The coding apparatus processes a DWT coefficient having a
plurality of bit-planes in parallel at a time to provide coding
information for further coding process. The apparatus comprises a
Gobang register bank (GRB) module, a compute most significant bit
pass (CMP) module, find contribution and coding pass (FC) modules,
context formation (CF) modules, a reconfigurable first-in first-out
register (RFIFO) module, and arithmetic encoder (AE) modules,
wherein there are one said FC module and one said CF module for
each bit-plane and at one AE module for every two bit-planes.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] These and other objects, features, and advantages of the
present invention will become apparent with reference to the
following descriptions and accompanying drawings, in which:
[0014] FIG. 1 is a view showing the coefficients in the context
window according to the present invention;
[0015] FIG. 2 is a table showing the sign coding information
produced by contributions according to the present invention;
[0016] FIG. 3 is a view showing the coding apparatus according to
the present invention;
[0017] FIG. 4 is a view showing a structure of the Gobang register
bank (GRB) module according to the present invention;
[0018] FIG. 5 is a view showing a structure of the compute most
significant bit (MSB) pass (CMP) module according to the present
invention;
[0019] FIG. 6 is a view showing processing elements of the compute
most significant bit (MSB) pass (CMP) module according to the
present invention;
[0020] FIG. 7 is a view showing a structure of the find
contribution and coding pass (FC) module according to the present
invention;
[0021] FIG. 8 is a view showing processing elements of the find
contribution and coding pass (FC) module according to the present
invention;
[0022] FIG. 9 is view showing a function of H.sup..chi. according
to the present invention;
[0023] FIG. 10 is a view showing a structure of the context
formation (CF) module according to the present invention;
[0024] FIG. 11 is a view showing a function of XD.sup..chi.
according to the present invention;
[0025] FIG. 12 is a view showing the relationship between length of
first-in first-out register (FIFO) and the number of clock cycle
according to the present invention;
[0026] FIG. 13 is a view showing a structure of the arithmetic
encoder (AE) modules according to the present invention; and
[0027] FIG. 14 is a table showing performance between the present
and conventional invention.
DETAIL DESCRIPTION OF PREFERRED EMBODIMENTS
[0028] The following descriptions of the preferred embodiments are
provided to understand the features and the structures of the
present invention.
[0029] The present invention is a parallel embedded block coding
which codes all bit-planes in parallel for increasing the
processing speed and decreasing the memory bandwidth with the same
cost of chip-area comparing with conventional schemes. Please
referring to FIG. 1 and FIG. 2, the coding method of the present
invention is described the following step.
[0030] In the step (a), a target coefficient and eight neighboring
coefficients around said target coefficient are acquired. Referring
to FIG. 1, the symbol c denotes the target coefficient and d0, v0,
d1, h0, h1, d2, v1, and d3 denote the neighboring coefficients
where prefixes d, v, and h mean the diagonal, vertical, and
horizontal relationship in respect to the target.
[0031] In the step (b), a first contribution of each bit-plane of
said neighboring coefficients in respect to the location of their
MSB is obtained. If the neighboring coefficient s process after the
target coefficient, like h1, v1, and d3 shown in FIG. 1, the first
contribution
.phi. s k = { 0 , k .gtoreq. m s 1 , k < m s , ##EQU00001##
and if the neighboring coefficient s is processed before the target
coefficient, the first contribution
.phi. s k = { 1 , k < m s 1 , ( k = m s ) & ( p s m s = 1 )
0 , otherwise . ##EQU00002##
The symbol m.sub.s appearing in the above functions is obtained
from
m s = { - 1 , .mu. s = 0 m , 2 m .ltoreq. .mu. s < 2 m + 1
##EQU00003##
where the symbol .mu..sub.s denotes the magnitude of the
coefficient s.
[0032] Next, after acquiring contributions .phi..sub.s of said
neighboring coefficients, the coding pass p.sub.c.sup.k of
coefficient c at k-th bit-plane is determined by a following
function:
p c k = { 2 , k < m c 3 , k = m c & .phi. s = 0 1 ,
otherwise . ##EQU00004##
[0033] In step (c), first group of state variables of every
bit-plane in coefficients in respect to the location of its MSB
obtain wherein said state variables comprising a first variable
.lamda..sub.s.sup.k and a second variable .kappa..sub.s.sup.k. The
first variable .lamda..sub.s.sup.k denotes whether the coefficient
s in k-th bit-plane is lower than its most significant bit (MSB)
obtained by
.lamda. s k = { 1 , k < m s 0 , k .gtoreq. m s .
##EQU00005##
The second variable .kappa..sub.s.sup.k denotes whether the
coefficient s in k-th bit-plane is the most significant bit (MSB)
obtained by
.kappa. s k = { 1 , k = m s 0 , k .noteq. m s . ##EQU00006##
[0034] In the step (d), the second contribution of each bit-plane
of said neighboring coefficients according to said first group of
state variables and coding pass of said target and neighboring
coefficients obtain. If the neighboring coefficient s is processed
after the target coefficient, the second contribution
.sigma..sub.s.sup.k of the coefficient s in k-th bit-plane is
obtained by
.sigma. s k = { .lamda. s k , .kappa. s k = 0 1 , .kappa. s k = 1
& p c k .noteq. 1 & p s k = 1 0 , otherwise .
##EQU00007##
If the neighboring coefficient s is processed before the target
coefficient, the second contribution .sigma..sub.s.sup.k of the
coefficient s in k-th bit-plane is obtained by
.sigma. s k = { .lamda. s k , .kappa. s k = 0 1 , .kappa. s k = 1
& p c k .noteq. 1 & p s k .noteq. 1 0 , otherwise .
##EQU00008##
[0035] Next, after acquiring eight numbers of second contributions,
the second contributions are summed separately by groups of
horizontal, vertical, and diagonal coefficients in respect to the
target. The contribution summations are
H k = i = 0 1 .sigma. hi k ##EQU00009## V k = i = 0 1 .sigma. vi k
##EQU00009.2## D k = i = 0 3 .sigma. ih . ##EQU00009.3##
[0036] Next, contexts for the magnitude coding is determined
through the first table predefined in compliance with JPEG 2000
standard using said contribution summations as references.
[0037] In step (e), second group of state variables of neighbors
obtained wherein said state variables comprising a third variable
.alpha..sub.s and a fourth variable .beta..sub.s. The third
variable represents the relative location of MSBs of said target
coefficient and the neighboring coefficient s and is obtained
by
.alpha. s = k .lamda. s k & .kappa. c k . ##EQU00010##
The fourth variable represents whether or not the MSBs of said
target coefficient and the neighboring coefficient s are in the
same bit-plane and is obtained by
.beta. s = k .kappa. s k & .kappa. c k . ##EQU00011##
[0038] In step (f), it can introduce a new variable. A variable
.chi..sup.s denotes the sign (0 for positive) of coefficient s.
Next, there are only four coefficients, including horizontal and
vertical ones, concerned. If the neighboring coefficient s is
processed after the target coefficient c, the third contributions
.sigma..sub.s.sup..chi. are obtained
.sigma. s .chi. s = { .alpha. s , .beta. s = 0 1 , .beta. s = 1
& p c m c .noteq. 1 & p s m c = 1 0 , otherwise .
##EQU00012##
Otherwise, the third contributions .sigma..sub.s.sup.x are obtained
by
.sigma. s .chi. s = { .alpha. s , .beta. s = 0 1 , .beta. s = 1
& p c m c .noteq. 3 & p s m c .noteq. 1 0 , otherwise .
##EQU00013##
[0039] Next, after obtaining third contributions, parameters
H.sup..chi. and V.sup..chi. can be determined through a table shown
in FIG. 2. Therefore, the sign coding information can be
determined.
[0040] From the FIG. 3 to FIG. 14, the coding apparatus of the
present invention is described the following descriptions.
[0041] Referring to FIG. 3, the coding apparatus process a DWT
coefficient having a plurality of bit-planes in parallel at a time
to provide coding information for further coding process. The
apparatus comprise a Gobang register bank (GRB) module 1, a compute
most significant bit (MSB) pass (CMP) module 2, find contribution
and coding pass (FC) modules 3, context formation (CF) modules 4, a
reconfigurable first-in first-out register (RFIFO) module 5, and
arithmetic encoder (AE) modules 6. There are one said FC module 3
and one said CF module 4 for each bit-plane and one AE module 6 for
every two bit-planes. The Gobang register bank module 1 is a
2-dimensional shift register bank, as shown in FIG. 4, whereby
discrete wavelet coefficients are recorded to meet JPEG 2000 scan
order. The input data is first rotated within each column to match
the data flow of one column in the stripe in a clock cycle (for
example: W1.fwdarw.W2, W2.fwdarw.W3, W3.fwdarw.W4, W4.fwdarw.W0,
W0.fwdarw.W1). When a column in the stripe is coded, every four
clock cycles, the date samples are shifted to next column for the
next column in the stripe (for example: W1.fwdarw.W7, W2.fwdarw.W8,
W3.fwdarw.W9, W4.fwdarw.W5, W0.fwdarw.W6). The symbols
.omega..sub.CMP and .omega..sub.FC indicate two sets of 3.times.3
registers to form the context windows for CMP and FC modules 2,
3.
[0042] The compute MSB pass (CMP) module 2, as shown in FIG. 5,
determines a value P.sub.c.sup.mc of target coefficient and the
variables .lamda..sub.s.sup.k and .kappa..sub.s.sup.k of each
coefficient. With these variables, every bit-plane can be computed
independently by module after compute MSB pass (CMP) module 2. In
this way, power consumption is able to reduce by turning off
modules when there is no data in corresponding bit-planes. FIG. 6
shows a detail circuit diagram of processing elements (PE) of this
module 2. In FIG. 6, the device UOR treats all bit-planes of two
inputs by OR operation and further treats the results by OR
operation as an output, i.e. the output is "1" if, for some
bit-plane, both bit of two inputs are "1". The sub-mode SG composed
of simple AND and OR gates that calculates .kappa..sub.s.sup.k. The
sub-module FO is the first one detector.
[0043] The find contribution and coding pass (FC) module 3, which
is shown in FIG. 7, determines the coding pass information and
calculates the PHVD (Coding pass and HVD contributions) values to
form the context. The circuit diagram of processing elements of FC
module 3 is shown in FIG. 8. FIG. 9 shows the circuit diagram
function H.sup..chi. directly derived from the algorism. The
circuit of function V.sup..chi. is the same as H.sup..chi..
[0044] FIG. 10 shows the context formation (CF) module 4 wherein
the zero coding (ZC) module and the magnitude refinement (MR)
module operate through tables defined to meet standard. In order to
cope with the run-length code (RLC), the four contexts for the four
samples in a column of stripe are buffered. After deciding the RLC,
the contexts are generated. On the contrary, the other way to
generate contexts needs to store HVD values and the coding pass
information in registers and uses more chip area. Two-thirds of
chip-area for the buffers can be reduced by choosing the former
method. In one clock cycle, there are at most four contexts
produced. Furthermore, FIG. 11 shows the circuit diagram providing
the context of sign denoted by XD.sup..chi. and the truth table for
the GS and SC sub-module. It is important that the circuit diagram
shown in FIG. 10 is needed for every bit-plane while only one
circuit shown in FIG. 11, for passing results to every bit-plane,
is needed for all.
[0045] The reconfigurable first-in first-out register (RFIFO)
module 5 uses first-in first-out (FIFO) to smooth the compression
process. While the number of contexts produced by CF module 4
ranges from 0 to 4, the average producing number is one and the
producing number of following AE module 6 is also one. FIG. 12
shows a result of simulation comparing the length of FIFO (or
number of registers) with the average amount of clock used for a
code-block. Referring to FIG. 12, it is more efficient when the
length of FIFO register is 15. However, the requirement of
registers is too more. In order to lower down the hardware
requirement, it is suitable to utilize the embedded block coding
feature by using reconfigurable first-in first-out (RFIFO) module
5. Finally, two 15-length and eight 4-length FIFOs can achieve 80%
performance of conventional structure, which requires ten 15-length
FIFOs.
[0046] In a best embodiment, a configuration having two 15-length
and eight 4-length FIFO registers when processing a new block
coding is more efficient. After theoretical analysis, it is more
efficient to allocate FIFO registers to third and fourth bit-planes
of the block. Therefore, in the beginning of block coding, RFIFO
module 5 is reconfigured according to MSB.
[0047] In the present invention, the arithmetic encoder (AE)
modules 6 processing six separate embedded bit-streams for each one
decrease the demand of hardware and increase the hardware
utilization. Since the present invention supports 11 bit-planes (1
bit for sign and 10 bits for magnitude), there are at most
28(3.times.9+1) embedded bit-streams. A most direct way to achieve
this proposed structure is to have 28 arithmetic encoders. However,
this leads to large hardware cost and low hardware utilization. The
hardware requirement can be reduces by further analysis of two
features of embedded block encoder. First, due to the exclusive
property of three coding passes of a bit-plane in one clock cycle,
the need of encoders can be reduced to 10 whereby to meet 10
bit-planes of magnitude in the present invention. Second, since the
processing number of contexts decreases from low to high bit-plane,
the largest amount of contexts for arithmetic encoder to process
appears at the lowest bit-plane. Therefore, the number of encoder
can be eliminated to 5 by assigning an arithmetic encoder module 6
to treat 2 bit-planes. Although the area decreases by 18% ( 5/18),
not all elements in AE modules 6 can be reduced by the same
percentage. Since every coding state register of independent
bit-streams must be separated and cannot be avoided, the area,
which can be eliminated, is of MQ coder and probability table. FIG.
13 shows the block diagram of arithmetic encoder module 6 wherein
the blocks A, C, CT, and B are coding state registers. Each module
stores six coding states of the six embedded bit-streams. By Bi and
Pi, AE module selects one of the six coding state register sets and
encodes the input context-decision pair.
[0048] FIG. 14 shows a comparison between the present invention and
other proposed architectures by four factors. The first factor is
speed defined by average number of clock cycle needed to treat a
discrete wavelet transform coefficient. Referring to FIG. 14, the
speed of the present invention is five times higher than other
scheme in average, where the symbol n denotes the number of
bit-planes of the embedded block. Since other architecture
compresses bit-plane in serial, the speed is related to the number
of bit-plane of embedded block. The second factor is the logic gate
count. Due to parallel structure of the present invention, it needs
four times more gates than others. The third factor is the on-chip
memory requirement. Generally, memory occupies large chip-area. The
requirement of memory of the present invention is about 4.7% of the
requirement of convention. As a result of memory reduction, the
area of the present invention is similar to that of conventional
architectures. The last factor is off-chip memory bandwidth. Since
external memory accessing consumes large power, with external
memory bandwidth reduced by a factor of six, the present invention
consumes lower power than others.
[0049] As already discussed, the invention has been superior to
many conventional schemes. With the same chip-area and lower power
consumption, the speed of this invention is six times higher than
the pasts.
[0050] It should be understood that although certain preferred
embodiments of the present invention have been illustrated and
described, various modifications, alternatives and equivalents
thereof will become apparent to those skilled in the art and,
accordingly, the scope of the present invention should be defined
only by the appended claims and equivalents thereof.
* * * * *