U.S. patent application number 09/794706 was filed with the patent office on 2002-08-29 for font compression and retrieval.
Invention is credited to Aberg, Jan, Smeets, Bernard.
Application Number | 20020118885 09/794706 |
Document ID | / |
Family ID | 25163415 |
Filed Date | 2002-08-29 |
United States Patent
Application |
20020118885 |
Kind Code |
A1 |
Smeets, Bernard ; et
al. |
August 29, 2002 |
Font compression and retrieval
Abstract
Method and apparatus for compressing data representing a set of
symbols such that each symbol of the set can be separately accessed
and decompressed. Each symbol of the set of symbols is encoded in
the form of a two-pair code wherein a first part of the code is
common for all encoded symbols and a second part of the code
encodes the data representing a symbol. An identifier is given for
each symbol for permitting each encoded symbol to be separately
accessed and decompressed. The invention is particularly useful for
storing large fonts such as a Chinese or Japanese character
set.
Inventors: |
Smeets, Bernard; (Dalby,
SE) ; Aberg, Jan; (Lund, SE) |
Correspondence
Address: |
Richard J. Moura, Esq.
Jenkens & Gilchrist, P.C.
Suite 3200
1445 Ross Avenue
Dallas
TX
75202-2799
US
|
Family ID: |
25163415 |
Appl. No.: |
09/794706 |
Filed: |
February 27, 2001 |
Current U.S.
Class: |
382/246 |
Current CPC
Class: |
G06T 9/00 20130101; H03M
7/30 20130101 |
Class at
Publication: |
382/246 |
International
Class: |
G06K 009/36 |
Claims
1. A method for compressing data representing a set of symbols,
comprising: encoding each symbol of said set of symbols in the form
of a two-part code wherein a first part of said two-part code is
common for all encoded symbols of said set and a second part of
said two-part code comprises encoded data representing a symbol of
said set, wherein each encoded symbol of said set can be separately
accessed and decompressed.
2. The method according to claim 1, wherein the first part of said
two-part code comprises a statistical model of said set of symbols,
and wherein said encoding step includes encoding data representing
each symbol of said set of symbols with a code derived from said
statistical model to provide said second part of said two-part code
for each symbol of said set.
3. The method according to claim 2, wherein said data representing
each symbol of said set comprises a two-dimensional bitmap of each
said symbol.
4. The method according to claim 3, wherein said encoding step
includes the step of determining the context of each pixel of a
bitmap representing a symbol, and constructing a probability table
of pixel values having one entry per possible context.
5. The method according to claim 4, wherein said encoding step
comprises encoding each symbol using an arithmetic encoder with
said probability table.
6. The method according to claim 3, wherein said encoding step
includes the step of determining the context of each pixel of a
bitmap representing a symbol, and constructing a prediction table
indicating the most probable bit value in each context.
7. The method according to claim 6, wherein the most probable bit
value for each bit is exclusive-ORed with the actual bit producing
a bit stream that is encoded by a Huffman code.
8. The method according to claim 2, and further including the step
of storing said encoded data representing each symbol of said set
in a memory.
9. The method according to claim 8, and further including an
identifier for each symbol of said set of symbols for identifying a
location at which the encoded data representing each symbol is
stored in the memory.
10. The method according to claim 9, wherein the encoded data
representing each symbol of said set are sorted by length, and
wherein a set of indexing tables is created for each length, each
indexing table including identifiers for encoded data of its
respective length.
11. The method according to claim 10, wherein the identifiers
included in each indexing table are arranged in an ascending
order.
12. The method according to claim 10, wherein a length table is
created, said length table including information about each index
table, including the length it corresponds to and the length of the
index table.
13. The method according to claim 1, wherein said set of symbols
comprises a font, and wherein each symbol comprises a symbol of
said font.
14. The method according to claim 13, wherein said font comprises a
font selected from the group consisting of a Chinese character set
and a Japanese character set.
15. A method for retrieving encoded data representing a symbol of a
set of symbols which are stored in a memory, given an identifier
for each symbol of said set of symbols, comprising: identifying a
location at which said data representing said symbol of said set of
symbols is stored in said memory using the identifier for said
symbol; and retrieving said data representing said symbol from said
memory.
16. The method according to claim 15, wherein said providing step
includes providing an index table containing a list of identifiers
for different symbols of said set of symbols, and wherein said
identifying step includes searching said index table to locate the
identifier for the symbol.
17. The method according to claim 16, wherein encoded data
representing said symbols of said set of symbols are sorted first
by length and then by identifier, in ascending order, and wherein
said step of providing an index table includes providing an index
table for each length, each index table containing a list of
identifiers sorted in ascending order for symbols represented by
encoded data of a particular length.
18. The method according to claim 17, wherein said searching step
comprises searching said index tables for said identifier using a
binary search.
19. The method according to claim 15, wherein said method further
includes the step of decompressing the retrieved data.
20. The method according to claim 19, wherein said encoded data
comprises a two-dimensional bitmap of a symbol.
21. Apparatus for compressing data representing a set of symbols,
said apparatus comprising: an encoder which encodes each symbol of
said set of symbols in the form of a two-part code, wherein a first
part of said two-part code is common for all encoded symbols of
said set, and a second part of said two-part code comprises encoded
data representing a symbol of said set; wherein each encoded symbol
of said set can be separately accessed and decompressed.
22. The apparatus according to claim 21, wherein the first part of
said two-part code comprises a statistical model of said set of
symbols, and wherein said encoder encodes data representing each
symbol of said set of symbol with a code derived from said
statistical model to provide said second part of said two-part code
for each symbol of said set.
23. The apparatus according to claim 22, wherein said data
representing each said symbol comprises a bitmap of each said
symbol, and wherein said encoder includes: an arithmetic encoder
for sequentially encoding pixels of said bitmap; and a source model
for providing coding probabilities for the arithmetic encoder, said
source model including a context forming unit for forming a context
for each pixel of the bitmap, and a probability table containing
the pixel probability of each pixel conditioned on the context
formed by the context forming unit.
24. The apparatus according to claim 22, wherein said data
representing each said symbol comprises a bitmap of each said
symbol, and wherein said encoder includes: a source model for
providing a predicted value for each pixel of the bitmap, a unit
for exclusive-Oring the predicted value of each pixel with the
actual bit to produce a bit stream, and a Huffman coding unit for
encoding the bit stream.
25. The apparatus according to claim 24, wherein said source model
includes a context forming unit for forming a context for each
pixel of the bitmap, and a prediction table for providing the
predicted value for each pixel in each context.
26. The apparatus according to claim 21, wherein said set of
symbols comprises a font.
27. Apparatus for retrieving data representing a symbol of a set of
symbols which are stored in a memory, each of said symbols having
an identifier, said apparatus comprising: a locator which
identifies a location in said memory at which said data
representing a symbol of said set of symbols is stored using said
identifier for said symbol; and a decoder which decodes said
symbol, wherein data representing each encoded symbol can be
separately accessed and decoded.
28. The apparatus according to claim 27, wherein said locator
identifies said location by searching an index table which contains
a list of identifiers for different symbols of said set of
symbols.
29. The apparatus according to claim 28, wherein data representing
said set of symbols are sorted by length, and wherein an index
table is provided for each length, each index table containing a
list of identifiers for symbols represented by data of a particular
length.
Description
BACKGROUND OF THE INVENTION
[0001] 1. Field of the Invention
[0002] The present invention relates generally to the compression
and retrieval of data representing a font or other set of symbols;
and, more particularly, to a method and apparatus for storing a
large font, such as a Chinese or Japanese character set, in a
compressed form while retaining access to individual symbols of the
font.
[0003] 2. Description of the Prior Art
[0004] In order to display messages in languages such as Chinese or
Japanese on a CRT or an LCD display, a large set of symbols, or
glyphs, is required. For example, the Chinese Unicode standard
character set contains about 21,000 different Chinese symbols.
Furthermore, each symbol is the size of at least some hundreds of
pixels; and, as a result, to store a complete Chinese font requires
a large amount of memory. Being able to store the glyphs in a more
compact format than pure bitmaps will substantially reduce memory
requirements.
[0005] For laser printers or high resolution displays, a font is
usually stored as a series of points that are joined by curves.
This brings the additional advantage of making font scaling
possible, although for fonts stored in this way, some processing is
needed to render the image itself. For lower resolution displays,
font scaling is not of interest, and it would be more efficient to
store the font as a bitmap.
[0006] The majority of lossless data compression methods known in
the artwork in a sequential manner by referring back to data that
has already been encoded. Such methods are inappropriate for font
compression, however; where, ideally, only a single glyph should be
decompressed at a time. If sequential methods of this type are
employed, some blocking of the glyphs is required, and a trade-off
must be made between the two extremes of compressing the entire
font as one block, thus losing random access capability, and
compressing each symbol separately, in which case overall
performance becomes quite poor.
[0007] Instead of the above-mentioned sequential codes, a two-part
code can also be used to compress and retrieve a font. Typically,
the first part of such a code describes the statistical properties,
or a model, of the data; and the second part encodes the data by a
code derived from the model.
[0008] Exemplary of font compression and retrieval methods known in
the prior art include those described in U.S. Pat. Nos. 5,488,365;
5,058,187; 5,587,725; 5,473,704; and 5,020,121; and PCT Publication
No. WO 98/16902. In general, however, none of these prior art
methods describes a font compression and retrieval technique that
provides complete random access of individual symbols of the font,
which is important to permit high-speed access of the symbols by
modern high-speed equipment.
SUMMARY OF THE INVENTION
[0009] The present invention is directed to a method and apparatus
for the compression and retrieval of data representing a set of
symbols; and, particularly, to a method and apparatus for
compressing a font which comprises a large number of glyphs, such
as a Chinese or a Japanese character set, in such a manner that
each individual glyph of the character set can be separately
accessed and decompressed.
[0010] A method for compressing data representing a set of symbols
according to the present invention includes encoding each symbol of
the set of symbols in the form of a two-part code wherein a first
part of the two-part code is common for all encoded symbols of the
set and a second part of the two-part code comprises encoded data
representing a symbol of the set, wherein each encoded symbol of
the set can be separately accessed and decompressed.
[0011] The present invention provides a data compression method
that is based on the use of a two-part code, however, the first
part of the code is common for all symbols of the set of symbols,
and this allows each encoded symbol to be separately accessed and
decompressed. The present invention, accordingly, provides the
property of random access to the individual symbols of the set
which, as indicated above, is a particularly important capability
in modern high-speed equipment.
[0012] According to a presently preferred embodiment of the
invention, the set of symbols comprises a font of individual
symbols or glyphs, and the data representing the glyphs includes a
bitmap for each glyph. To encode a font, a statistical model of the
set of glyphs is created (the first part of the two-part code or
the "model"), and each glyph is then separately compressed with a
code derived from the model (the second part of the two-part code
or the "codeword"). The compressed glyphs are partitioned by
codeword length, and one indexing table, sorted by an identifier
for each glyph, is created for each partition. An encoded font will
thus comprise a statistical model, a set of codewords, a set of
indexing tables, a table of lengths for the indexing tables and,
perhaps, auxiliary tables used for decoding.
[0013] To decode a particular glyph, given the identifier for the
glyph, the indexing tables are first searched for a matching entry.
From the table lengths and the position in the table, the position
or location of the particular glyph in the code set can be
computed, and this permits the desired codeword for that glyph to
then be extracted and decoded. Because, in the present invention, a
two-part code is used where the first part of the code is common
for all the encoded glyphs; indexing is greatly simplified inasmuch
as for each glyph it is only necessary to locate the codeword for
that particular glyph.
[0014] In accordance with one presently preferred embodiment of the
invention, font compression is achieved utilizing an arithmetic
encoder with a fixed probability table. This procedure provides
near optimal compression of the glyphs, given the probability
table, without the need of additional tables. According to an
alternative embodiment of the invention, font compression is by
means of a predictive encoding scheme with a fixed prediction table
followed by a fixed Huffman coding. This procedure makes it
possible to have a very fast decompression while retaining
reasonable compression speeds. This embodiment is also particularly
suitable for hardware implementation.
[0015] In general, with the present invention, complete random
access to individual symbols of a set of symbols is gained at the
cost only of using a two-part code with separate codewords instead
of an adaptive one-part code as is known in the prior art. The
additional cost in memory requirements is estimated to be no more
than about 10 to 15 percent for a font of 10,000 symbols.
[0016] Yet further advantages and specific features of the
invention will become apparent hereinafter in conjunction with the
following detailed description of presently preferred embodiments
thereof.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] FIG. 1 schematically illustrates an encoder for a font
compression scheme according to a first presently preferred
embodiment of the invention;
[0018] FIG. 2 schematically illustrates one example of a
conditioning context of a pixel to assist in explaining the
operation of the encoder of FIG. 1;
[0019] FIG. 3 schematically illustrates a decoder for retrieving
data encoded by the encoder of FIG. 1;
[0020] FIG. 4 schematically illustrates an encoder for a font
compression scheme according to a second presently preferred
embodiment of the invention, and
[0021] FIGS. 5 and 6 are flow charts illustrating the basic steps
of the encoding and decoding procedures, respectively, for a font
compression and retrieval method according to the present
invention.
DETAILED DESCRIPTION OF PRESENTLY PREFERRED EMBODIMENTS
[0022] FIG. 1 is a block diagram schematically illustrating the
encoder structure of a compression scheme according to a first
embodiment of the present invention for compressing data
representing a set of symbols such as a font of Chinese or Japanese
symbols or glyphs. Initially, it should be appreciated that the
encoding procedures described can be carried out without time
limitations so as to permit optimization of the size of the
compressed data.
[0023] The encoder of FIG. 1 is generally designated by reference
number 10 and is composed of a plurality of modules. Initially, a
two-dimensional bitmap 12 representing a symbol or glyph of a font,
is converted to a sequence x.sup.n=x.sub.1, x.sub.2, . . . ,
x.sub.n of bits by a serialize module 14 by scanning the bitmap
according to some specified rule. Possible scan orders, for
example, include row-wise, column-wise, diagonal or a more involved
scan order as are well-known to those skilled in the art.
[0024] The sequence of bits output from the serialize module 14 is
directed to an arithmetic encoder module 16 which maybe a standard
binary arithmetic encoding unit (see, for example, C. B. Jones, "An
Efficient Coding System for Long Source Sequences", IEEE
Transactions on Information Theory, vol. 27, no. 3, pp. 280-291,
May, 1981). For efficiency, the arithmetic precision of the encoder
should be matched with the precision of the probability table that
will be discussed hereinafter. The bits are encoded sequentially as
the bitmap is scanned.
[0025] The model that provides the coding probabilities for the
arithmetic coding is illustrated by dashed block 18 and is
designated in FIG. 1 as a source model. The source model is context
based, i.e., the probability distribution of each pixel of the
bitmap is determined by a conditioning context of surrounding pixel
values. Thus, the model includes a context forming unit or module
22 which selects bits from previously encoded ones in the same
bitmap to determine the context, which is represented as an
integer. FIG. 2 schematically illustrates one example of the
correspondence between context pixels and bit positions. The
conditioning context of any pixel must contain only pixels that
appear earlier in the scan order. Its shape may vary depending on
the pixel. Any context pixel outside the bitmap is set to zero.
[0026] The source model 18 also includes a probability table 24
which has one entry per possible context, containing the pixel
probability conditioned on that context, stored with fixed
precision.
[0027] Given the size and shape of the context, the probability
table 24 is constructed by counting the occurrences of ones in each
context, and normalizing by the number of occurrences of the
context.
[0028] If only certain codeword lengths are allowed, for instance
integer bytes, zeros are appended to the end of the output of the
arithmetic encoder by byte alignment module 26. The output of the
byte alignment module is the codeword representing a symbol or
glyph of the font.
[0029] Each glyph of the font is encoded utilizing the encoder of
FIG. 1 until the entire font has been encoded and stored in memory.
Before a font is encoded, the scan order and the context forming
function are chosen. Different sizes of contexts, scan orders,
context forming functions, and precisions of the probability table
can be tried in order to find the one yielding the best
compression. The quantity that is minimized here to yield the best
compression is the size of the codeword set plus the size of the
probability table.
[0030] The codewords produced by the above procedure are sorted
first by length and then by identifier, which is given for each
glyph of the font, and an index table for each length is
constructed as a list of identifiers sorted in ascending order. For
each index table is stored in a length table the codeword length it
corresponds to, and the table length.
[0031] The codewords are stored together with the index table and
the length table. It should be noted that the information about the
location and length of each codeword in memory is present only in
the index and length tables, i.e., the codewords are stored sorted
by length and identifier but without any separators.
[0032] When a glyph with given identifier I is to be decoded, the
index tables are first searched one by one, using a binary search.
If the identifier is found, the address of the corresponding
codeword is found by summing the product of the codeword length and
the number of codewords of that length over all codeword lengths
smaller than the one of the searched table (counting of the
codewords should begin at zero), and adding the codeword length of
the searched table times the position of the identifier in the
table. Other search orders of the tables are also possible. For
instance, one could search the tables in order of descending size,
if desired; and it is not intended to limit the invention to any
particular search order. It should also be understood that other
searching methods can be used as well without departing from the
scope of the present invention, and it is also not intended to
limit the invention to any particular searching method.
[0033] Once the codeword has been located in memory, it is decoded
by the decoding structure 30 illustrated in FIG. 3. In FIG. 3, the
source model 18 is identical to the source model 18 in the encoder
10 of FIG. 1 (and thus has been given the same reference number),
the arithmetic decoder 36 parallels the arithmetic encoder 16 in
the usual way, and the image former 34 is simply the inverse of the
serializer 14 in the encoder of FIG. 1. The decoder 30 in FIG. 3
provides as the output thereof the bitmap 32 of the desired glyph
from the compressed font.
[0034] FIG. 4 is a block diagram schematically illustrating the
structure of an encoder 40 according to a second embodiment of the
present invention. In FIG. 4, the probability table of the source
model 18 of the encoder 10 of FIG. 1 is replaced by a prediction
table 42 in a source model 48 in which each entry is one bit,
indicating the most probable bit value in each context. The
predicted value for each bit is exclusive-ORed with the actual bit
by unit 44, producing a bit stream that is encoded by a Huffman
code in Huffman encoder module 46 (see D. A. Huffman, "A Method for
the Construction of Minimum-Redundancy Codes", Proc. IRE, vol. 40,
pp 1098-1101, 1952.)
[0035] In this embodiment, in addition to the codewords, the index
table and the length table, a description of the Huffman code must
also be made available to the decoder. The optimization with
respect to context size, etc. as described above with respect to
the first embodiment, can be applied to this embodiment as well.
The decoder structure for use with the encoder of this embodiment
is also analogous to the decoder used with the encoder of the first
embodiment (illustrated in FIG. 3), and need not be further
described herein.
[0036] FIGS. 5 and 6 are flowcharts which generally illustrate the
steps of the encoding and decoding procedures, respectively, of the
compression and retrieval methods of the present invention.
Initially, with respect to FIG. 5, to encode a font,
two-dimensional bitmaps of the individual symbols or glyphs of the
font are serialized at step 61 using the serializer 14 shown in
FIGS. 1 or 4. A source or statistical model of the serialized data
is created at step 62 using the context forming module 22 and
either the probability table 24 of FIG. 1 or the prediction table
of FIG. 4. The sequence of bits output from the serializer is then
encoded in step 63 where each symbol or glyph of the font is
independently encoded with a code derived from the statistical
model by either the arithmetic encoder 16 of FIG. 1 or the Huffman
encoder 46 of FIG. 4 to provide the encoded codeword set
representing the font. The encoded font is then stored in memory at
step 64 for later retrieval, for example. As indicated above, the
codewords are stored together with the index table and the length
table.
[0037] To decode the encoded symbols stored in memory; as
illustrated in FIG. 6, the index tables are first searched in step
71 until the identifier for the encoded symbol is found. The
address of the stored encoded symbol is then found using the
identifier, step 72; and, finally, the codeword is retrieved and
decoded, step 83, using the decoder of, for example, FIG. 3, to
provide the decompressed bitmap of the selected symbol or
glyph.
[0038] An important aspect of the present invention is that a
two-part code is used wherein the first part of the code, i.e., the
model, is common for all the encoded glyphs; and the second part of
the code, i.e., the codeword, comprises the encoded data
representing a glyph. This simplifies indexing as for each glyph it
is only necessary to locate the codeword. In the first embodiment
described above, an arithmetic coder with a fixed probability table
is used, which ensures near optimal compression of the glyphs,
given the probability table, without the need for additional
tables, as distinguished from Lempel-Ziv and Huffman coding schemes
which perform poorly on short data blocks and require extensive
code tables, respectively.
[0039] By the use of predictive coding, as provided in the second
embodiment described above, with a fixed prediction table followed
by a fixed Huffman code, it becomes possible to have a very fast
decompression while retaining reasonable compression. This method
is particularly suitable for hardware implementation.
[0040] In general, in the present invention, by using an indexing
method with a length table, and indexing tables in which are
performed at least one search; each table is reduced to a list of
identifiers. With the present invention, the total size of the
addressing tables are only marginally larger than the space
originally occupied by the identifiers; and, thus, we have gained
the property of random access to the glyphs with only a slight
increase in index table size.
[0041] It should be emphasized that the term "comprises/comprising"
when used in this specification is taken to specify the presence of
stated features, integers, steps or components, but does not
preclude the presence or addition of one or more other features,
integers, steps, components or groups thereof.
[0042] It should also be emphasized that while what has been
described herein constitutes presently preferred embodiments of the
invention, it should be recognized that the invention could take
numerous other forms. Accordingly, it should be understood that the
invention should be limited only insofar as is required by the
scope of the following claims.
* * * * *