U.S. patent application number 17/579904 was filed with the patent office on 2022-05-12 for method and apparatus for compressing data of storage system, device, and readable storage medium.
This patent application is currently assigned to HUAWEI TECHNOLOGIES CO., LTD.. The applicant listed for this patent is HUAWEI TECHNOLOGIES CO., LTD.. Invention is credited to Kun Guan, Shaohui Quan, Jianqiang Shen, Liyu Wang.
Application Number | 20220147255 17/579904 |
Document ID | / |
Family ID | 1000006152132 |
Filed Date | 2022-05-12 |
United States Patent
Application |
20220147255 |
Kind Code |
A1 |
Guan; Kun ; et al. |
May 12, 2022 |
METHOD AND APPARATUS FOR COMPRESSING DATA OF STORAGE SYSTEM,
DEVICE, AND READABLE STORAGE MEDIUM
Abstract
In a method of storing data block, a storage device has stored a
plurality of data block groups, each data block group having a
common part that is contained in another data block in that group.
For a target block to be stored, the storage device selects from
the data block groups a target data block group has one data block
whose common part is identical to a part of the target data block.
The storage device then saves the target block by storing a target
reference block of the target data block group and differential
data between the target data block and the target reference
block.
Inventors: |
Guan; Kun;
(Saint-Petersburg, RU) ; Quan; Shaohui; (Hangzhou,
CN) ; Wang; Liyu; (Beijing, CN) ; Shen;
Jianqiang; (Hangzhou, CN) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
HUAWEI TECHNOLOGIES CO., LTD. |
Shenzhen |
|
CN |
|
|
Assignee: |
HUAWEI TECHNOLOGIES CO.,
LTD.
Shenzhen
CN
|
Family ID: |
1000006152132 |
Appl. No.: |
17/579904 |
Filed: |
January 20, 2022 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
PCT/CN2019/097144 |
Jul 22, 2019 |
|
|
|
17579904 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
H03M 7/30 20130101; G06F
3/0641 20130101; G06F 3/0673 20130101; G06F 3/0608 20130101 |
International
Class: |
G06F 3/06 20060101
G06F003/06; H03M 7/30 20060101 H03M007/30 |
Claims
1. A method of storing data block performed by a storage device,
comprising: storing data block groups, wherein each data block
group of the data block groups has a plurality of data blocks, and
each data block of said each data block group has a common part
identical to a part of another data blocks of said each data block
group; selecting from the data block groups a target data block
group, wherein one data block in the target data block group has a
common part identical to a part of the target data block; and
saving the target block by storing a target reference block of the
target data block group and differential data between the target
data block and the target reference block.
2. The method according to claim 1, wherein the step of saving the
data block groups comprises: storing, for each data block group of
the data block groups, a reference block and differential data
between each data block in the data block group and the reference
block; wherein each data block group comprises a reference
block.
3. The method according to claim 2, furthering comprising:
continuously storing all data of each data block group in storage
address.
4. The method according to claim 1, furthering comprising:
deduplicating data blocks in the storage system to obtain data
blocks of the data block groups, wherein data blocks obtained after
deduplicating are not the same.
5. The method according to claim 1, furthering comprising:
comparing, fingerprints of a part of data blocks of data block
groups and fingerprints of parts of the target data block for
selecting.
6. The method according to claim 1, furthering comprising:
comparing, fingerprints of a common parts and fingerprints of parts
of the target data block, wherein all data blocks in each data
block groups have a common part.
7. The method according to claim 1, wherein the comparing step
comprises: obtaining the target data block by the reference block
and the differential data.
8. A storage device, comprising: a memory storing executable
instructions; and a processor configured to execute the executable
instructions to: save data block groups, wherein each data block
group of the data block groups has a plurality of data blocks, each
data block of said each data block group has a common part
identical to a part of another data blocks of said each data block
group; select from the data block groups a target data block group,
wherein one data block in the target data block group has a common
part identical to a part of the target data block; and save the
target block by storing a target reference block of the target data
block group and differential data between the target data block and
the target reference block.
9. The storage device according to claim 8, wherein the processor
is configured to save the data block groups by storing, for each
data block group of the data block groups, a reference block and
differential data between each data block in the data block group
and the reference block; wherein each data block group comprises a
reference block.
10. The storage device according to claim 9, wherein the processor
is configured to further execute the executable instructions to:
continuously store all data of each data block group in storage
address.
11. The storage device according to claim 8, wherein the processor
is configured to further execute the executable instructions to:
deduplicate data blocks to obtain data blocks of the data block
groups, wherein data blocks obtained after deduplicating are not
the same.
12. The storage device according to claim 8, wherein the processor
is configured to: compare fingerprints of a part of data blocks of
data block groups and fingerprints of parts of the target data
block.
13. The storage device according to claim 8, wherein the processor
is configured to: compare fingerprints of a common parts and
fingerprints of parts of the target data block, wherein all data
blocks in each data block groups have a common part.
14. The storage device according to claim 8, wherein the processor
is configured to: obtain the target data block by the reference
block and the differential data.
15. A storage device, comprising: a memory storing executable
instructions; and a processor configured to execute the executable
instructions to: determine data blocks in a plurality of data
blocks, if one data block has a common part identical to a part of
another data blocks; select a group of data blocks from the
plurality of data blocks, wherein each data block of said each data
block group has a common part identical to a part of another data
blocks of said each data block group; and save the group of data
blocks by storing a reference block and differential data between
each data block in the group and the reference block.
16. The storage device according to claim 15, wherein the processor
is configured to: compare fingerprints between different data
blocks of the plurality of data blocks, wherein each said data
block of the plurality of data has at least one fingerprints.
17. The storage device according to claim 15, wherein the processor
is configured to further execute the executable instructions to:
continuously store all data of the data block group in storage
address.
18. The storage device according to claim 15, furthering
comprising: deduplicate data blocks in the storage device to obtain
the plurality of data blocks.
19. The storage device according to claim 15, the selecting step
comprising: select a group of data blocks from a number of data
blocks, according to fingerprint of part of each data block in the
number of data blocks.
20. The storage device according to claim 15, wherein the processor
is configured to: select the group of data blocks from the
plurality of data blocks according to all data blocks in each data
block groups have a common part.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This is a continuation of International Patent Application
No. PCT/CN2019/097144, filed on Jul. 22, 2019. The disclosure of
the aforementioned application is hereby incorporated by reference
in its entirety.
TECHNICAL FIELD
[0002] This application relates to the field of storage
technologies, and in particular, to a method and an apparatus for
compressing data of a storage system, a device, and a readable
storage medium.
BACKGROUND
[0003] With rapid development of big data, cloud computing, and
artificial intelligence, enterprises have an explosive growth in
data storage requirements. If data is directly stored, relatively
large storage space is occupied, and costs are relatively high. To
improve utilization of storage space, a data reduction technology
is usually used to compress data.
[0004] In a related technology, a deduplication technology is
generally used to improve the utilization of storage space. To be
specific, a file is divided into data blocks of a same size, and a
deduplication fingerprint of each data block is calculated. Because
a same deduplication fingerprint indicates that content of data
blocks is the same, data blocks with a same deduplication
fingerprint can be stored only once.
[0005] When the deduplication technology is used, redundant data
can be deleted only when content of data blocks is the same. During
actual data storage, however, there is a low probability that there
are data blocks that are completely the same. Therefore, a data
reduction effect is poor.
SUMMARY
[0006] Embodiments of this application provide a method and an
apparatus for compressing data of a storage system, a device, and a
readable storage medium, to overcome a problem of a poor data
reduction effect in a related technology.
[0007] According to an aspect, this application provides a method
for compressing data of a storage system. The method includes:
determining whether deduplication can be performed on a target data
block; when deduplication cannot be performed on the target data
block, obtaining a similar fingerprint of the target data block;
determining, based on the similar fingerprint, a combined data
block group to which the target data block belongs; and performing
similar compression on the target data block based on a reference
block in the combined data block group.
[0008] In a solution shown in this embodiment of this application,
when the storage system stores data blocks in batches, the storage
system determines whether deduplication can be performed on a
target data block that refers to any one of the data blocks. When
the storage system determines that deduplication cannot be
performed on the target data block, the storage system may obtain
the similar fingerprint of the target data block. The storage
system may determine the similar fingerprint of the target data
block before determining whether deduplication can be performed on
the target data block, or when determining that deduplication
cannot be performed on the target data block. A determining manner
may be: splitting the target data block into equal-sized data
units, and separately inputting each data unit into a preset hash
function, to obtain an output result, namely, the similar
fingerprint of the target data block. It can be learned that the
similar fingerprint of the target data block is not one numeric
value, but includes a group of numeric values.
[0009] After obtaining the similar fingerprint of the target data
block, the storage system may determine the combined data block
group to which the target data block belongs. Data blocks included
in the combined data block group may be compressed together. The
storage system may further determine the reference block in the
combined data block group. If the target data block is not the
reference block in the combined data block group, the storage
system may determine differential data between the target data
block and the reference block, and compress the differential data.
If the target data block is the reference block in the combined
data block group, the storage system may compress the target data
in a conventional compression manner, and perform similar
compression on the other data blocks in the combined data block
group in a same manner as the target data block. In this way,
similar compression and deduplication are combined. When
deduplication cannot be performed, similar compression can be used
to further compress some data, to improve a reduction rate.
[0010] In a possible implementation, the determining whether
deduplication can be performed on a target data block includes:
generating a deduplication fingerprint of the target data block;
and querying whether the storage system has a fingerprint the same
as the deduplication fingerprint, to determine whether
deduplication can be performed on the target data block.
[0011] In the solution shown in this embodiment of this
application, a deduplication fingerprint table is recorded in the
storage system. The deduplication fingerprint table includes a
deduplication fingerprint of a data block that is compressed and
stored, a deduplication fingerprint of a received data block that
is not compressed, and metadata information of the corresponding
data block. The storage system may input the target data block into
a fingerprint extraction function, to obtain the deduplication
fingerprint of the target data block as an output result. The
fingerprint extraction function may be a hash function. Then, the
storage system determines, in the deduplication fingerprint table
by using the deduplication fingerprint, whether the deduplication
fingerprint exists in the received data block that is not
compressed. If the deduplication fingerprint exists in the received
data block that is not compressed, it may indicate that a same data
block exists. In this case, deduplication can be performed on the
target data block. If the deduplication fingerprint does not exist
in the received data block that is not compressed, it may indicate
that the target data block does not exist. In this case,
deduplication cannot be performed on the target data block, the
target data block needs to be directly compressed in a conventional
manner (for example, Huffman encoding), and a compressed data block
is stored. In this way, whether deduplication can be performed on
the target data block can be accurately determined.
[0012] In a possible implementation, the determining whether
deduplication can be performed on a target data block includes:
determining a load of the storage system to determine whether
deduplication can be performed on the target data block.
[0013] In the solution shown in this embodiment of this
application, the load of the storage system directly affects
storage efficiency of a data block. The storage system may
determine the load of the storage system, and determine whether the
load meets a load exceeding condition. The load may be reflected by
a central processing unit (CPU) usage, a storage space usage, and a
current time period. If the load meets the load exceeding
condition, the storage system performs deduplication to improve
processing efficiency of the storage system. If the load does not
meet the load exceeding condition, the storage system has high
processing efficiency and does not perform deduplication. In this
way, the processing efficiency of the storage system can be
improved by determining the load.
[0014] In a possible implementation, the method further includes:
consecutively storing, in a same storage block, compressed data
obtained after similar compression is performed on the target data
block, and compressed data of another data block in the combined
data block group.
[0015] In the solution shown in this embodiment of this
application, when compressed data obtained after similar
compression is performed on the target data block is stored, a
storage block in which the compressed data of the other data block
in the combined data block group is stored and a storage location
of the compressed data in the storage block may be determined.
Then, the compressed data obtained after similar compression is
performed on the target data block and the compressed data of the
other data block in the combined data block group are consecutively
stored together. In this way, during data reading, differential
data and data of the reference block can be read at a time. This
can improve data reading efficiency.
[0016] In a possible implementation, if there are a plurality of
data blocks other than the reference block in the combined data
block group, compressed data of m data blocks is before the data of
the reference block, and compressed data of n data blocks is after
the reference block, where a difference between m and n is equal to
any one of 0, 1, or -1, and both m and n are greater than or equal
to 1.
[0017] In the solution shown in this embodiment of this
application, if there are a plurality of data blocks other than the
reference block in the combined data block group, assuming that
there are m+n data blocks other than the reference block, the
compressed data of m data blocks may be set before the data of the
reference block, and the compressed data of n data blocks may be
set after the data of the reference block. If m+n is an odd number,
a relationship between m and n may be that m-n is equal to 1 or -1.
If m+n is an even number, a relationship between m and n may be
that m-n is equal to 0. In this way, during data reading, if
differential data of a data block after the reference block needs
to be read, the reading may directly start from the reference block
until the differential data of the data block is read, without a
need to read differential data of all data blocks. If differential
data of a data block before the reference block needs to be read,
the reading may directly start from the differential data of the
data block, and ends after the data of the reference block is read,
without a need to read all the data. Therefore, less data is read,
and reading efficiency is improved.
[0018] In a possible implementation, the determining, based on the
similar fingerprint, a combined data block group to which the
target data block belongs includes: determining, based on a similar
fingerprint quantity, a data block group corresponding to the
target data block, where the similar fingerprint quantity is a
quantity of same similar fingerprints in any two data blocks in one
data block group; and forming, in the data block group
corresponding to the target data block, a first quantity of data
blocks that have a same target fingerprint as the target data block
into the combined data block group to which the target data block
belongs, where a data amount of differential data between the
target data block and a data block that has the target fingerprint
is less than a data amount of differential data between the target
data block and a data block that does not have the target
fingerprint.
[0019] In the solution shown in this embodiment of this
application, a similar fingerprint table is established in the
storage system. The similar fingerprint table includes a
correspondence between each similar fingerprint and metadata
information of a data block. The storage system may determine,
based on the similar fingerprint table, an uncompressed data block
corresponding to each fingerprint in the similar fingerprint of the
target data block. Then, data blocks having a similar fingerprint
quantity of same fingerprints are grouped into one group by using
the similar fingerprint quantity. In this way, the data block group
corresponding to the target data block can be obtained. In the data
block group corresponding to the target data block, data blocks
that have a same target fingerprint as the target data block and
that do not form a combined data block group with another data
block may be successively selected from each data block group, to
form the combined data block group to which the target data block
belongs. A combined data block group to which any data block
belongs may be determined in this manner. Because the data amount
of the differential data between the target data block and the data
block that has the target fingerprint is less than the data amount
of the differential data between the target data block and the data
block that does not have the target fingerprint, the data block
that has the target fingerprint and the target data block are
selected to form a combined data block group, and are compressed
together. This can improve the reduction rate.
[0020] In a possible implementation, the determining, based on the
similar fingerprint, a combined data block group to which the
target data block belongs includes: determining, based on a similar
fingerprint quantity, a data block group corresponding to the
target data block, where the similar fingerprint quantity is a
quantity of same similar fingerprints in any two data blocks in one
data block group; determining a quantity of same similar
fingerprints in the target data block and in a data block in each
data block group; and forming, in the data block group
corresponding to the target data block, a first quantity of data
blocks that have a maximum quantity of same similar fingerprints as
the target data block into the combined data block group to which
the target data block belongs.
[0021] In the solution shown in this embodiment of this
application, a similar fingerprint table is established in the
storage system. The similar fingerprint table includes a
correspondence between each similar fingerprint and metadata
information of a data block. The storage system may determine,
based on the similar fingerprint table, an uncompressed data block
corresponding to each fingerprint in the similar fingerprint of the
target data block. Then, data blocks having a similar fingerprint
quantity of same fingerprints are grouped into one group by using
the similar fingerprint quantity. In this way, the data block group
corresponding to the target data block can be obtained. For the
target data block, in the data block group corresponding to the
target data block, data blocks that do not form a combined data
block group with another data block are determined, a quantity of
same similar fingerprints in the data blocks and in the target data
block is determined, and then the data blocks are arranged in
descending order. The first quantity of the data blocks are
consecutively selected from the beginning in sequence, to form the
combined data block group to which the target data block belongs.
In this way, if data blocks have more same similar fingerprints, it
indicates that data blocks are more similar. Data blocks having a
relatively large quantity of same similar fingerprints may be
selected to form the combined data block group, so that a data
reduction rate can be improved.
[0022] According to an aspect, an apparatus for compressing data of
a storage system is provided. The apparatus includes one or more
modules, configured to perform the foregoing method for compressing
data of a storage system.
[0023] According to an aspect, a storage device is provided. The
storage device includes an interface and a processor. The interface
and the processor cooperate to perform the foregoing method for
compressing data of a storage system.
[0024] According to an aspect, a computer-readable storage medium
is provided. The computer-readable storage medium stores an
instruction, and when the computer-readable storage medium runs on
a storage system, the storage system is enabled to perform the
foregoing method for compressing data of a storage system.
[0025] According to an aspect, a computer program product includes
an instruction is provided. When the computer program product runs
on a storage system, the storage system is enabled to perform the
foregoing method for compressing data of a storage system.
[0026] The technical solutions provided in this application include
at least the following beneficial effects:
[0027] In the embodiments of this application, when a data block is
stored, whether deduplication can be performed on a target data
block is determined; when deduplication cannot be performed on the
target data block, a similar fingerprint of the target data block
is obtained; a combined data block group to which the target data
block belongs is determined based on the similar fingerprint; and
similar compression is performed on the target data block based on
a reference block in the combined data block group. In this way,
similar compression and deduplication are combined. When
deduplication cannot be performed, similar compression can be used
to further compress some data, to improve a reduction rate.
BRIEF DESCRIPTION OF DRAWINGS
[0028] FIG. 1 is an architectural diagram of a storage system
according to an example embodiment of this application;
[0029] FIG. 2 is a structural diagram of a storage system according
to an example embodiment of this application;
[0030] FIG. 3 is a flowchart of a method for compressing data of a
storage system according to an example embodiment of this
application;
[0031] FIG. 4 is a schematic diagram of storage of a data block
according to an example embodiment of this application;
[0032] FIG. 5 is a schematic diagram of storage of a data block
according to an example embodiment of this application;
[0033] FIG. 6 is a schematic diagram of storage of a storage block
according to an example embodiment of this application;
[0034] FIG. 7 is a schematic diagram of a compressed block
according to an example embodiment of this application;
[0035] FIG. 8 is a flowchart of a method for reading data according
to an example embodiment of this application; and
[0036] FIG. 9 is a schematic diagram of a structure of an apparatus
for compressing data of a storage system according to an example
embodiment of this application.
DESCRIPTION OF EMBODIMENTS
[0037] To make the objectives, technical solutions, and advantages
of this application clearer, the following further describes the
implementations of this application in detail with reference to the
accompanying drawings.
[0038] To facilitate understanding of the embodiments of this
application, the following first describes a system architecture
and concepts of nouns in the embodiments of this application.
[0039] The embodiments of this application are applicable to a
storage system in the storage field. The storage system may be a
server with a storage function, a server cluster with a storage
function, a storage array, a distributed storage system, or the
like. An architecture of the storage system may be shown in FIG. 1.
The storage system may include a space management layer, a data
management layer, and an underlying storage layer. The space
management layer may include a plurality of execution modules. The
underlying storage layer may also include a plurality of execution
modules. The space management layer may be configured to connect to
an upper layer, receive data, and send the data to the data
management layer. The data management layer may be configured to
compress an input data block to obtain a compressed data block, and
send the compressed data block to the underlying storage layer for
storage. For example, an input of the data management layer is data
blocks A1, B, A2, . . . , X, and B, and an output of the data
management layer is data blocks A1, A2, B, . . . , and X. Because
only one B is stored, the amount of stored data is reduced.
[0040] Compression: A compression technology can be classified into
lossless compression and lossy compression. Lossless compression
means that compressed data is decompressed, and obtained data the
same data as original data. The storage system mainly uses
compression algorithms, such as Huffman encoding, lempel ziv zelch
(lzw), and deflaft. Lossy compression means that compressed data is
decompressed, and obtained data is different from original data.
Lossy compression is mainly applicable to the field of image or
video compression.
[0041] Deduplication: Same files or data blocks in a distributed
storage system are eliminated, to effectively reduce physical
storage space occupied by data. This technology can be used in
storage backup and archiving systems. Generally, a file is divided
into a plurality of data blocks, a deduplication fingerprint of
each data block is calculated, and data with same fingerprints
indicates that data blocks have same content. Therefore, original
data can be stored only once for data blocks with same
fingerprints, to reduce a data amount.
[0042] An embodiment of this application provides a method for
compressing data of a storage system. The method may be performed
by the storage system.
[0043] FIG. 2 is a block diagram of a structure of a storage system
according to an embodiment of this application. The storage system
may include at least an interface 201 and a processor 202. The
interface 201 may be configured to receive data. In a specific
implementation, the interface 201 may be a hardware interface, for
example, a network interface card (network interface card, NIC) or
a host bus adapter (host bus adapter, HBA), or may be a program
interface module. The processor 202 may be a combination of a
central processing unit and a memory, or may be a field
programmable gate array (field programmable gate array, FPGA) or
other hardware. The processor 202 may alternatively be a
combination of a central processing unit and other hardware, for
example, a combination of the central processing unit and an FPGA.
The processor 202 may be a control center of the storage system,
and is connected to all parts of the entire storage system through
various interfaces and lines. In a possible implementation, the
processor 202 may include one or more processing cores. Further,
the storage system further includes a hard disk, configured to
provide storage space for the storage system.
[0044] An embodiment of this application provides a method for
compressing data of a storage system. As shown in FIG. 3, an
execution procedure of the method may include the following
steps.
[0045] Step 301: Determine whether deduplication can be performed
on a target data block.
[0046] During implementation, after the storage system is online,
if an upper-layer application needs to store data, the upper-layer
application may send the data to the storage system. The storage
system receives the data. If a data amount of the data is
relatively large, the storage system may divide the data into data
blocks, and a size of each data block may be 4 KB, 8 KB, or another
value. If the data amount of the data is less than a data amount of
one data block, the data may be directly determined as a data
block. Then, the storage system may periodically process the data
blocks, or process the data blocks in batches when a data amount of
received data blocks reaches a specific value. Any data block in
the data blocks processed in batches this time may be the target
data block, and whether deduplication can be performed on the
target data block may be determined based on a current status of
the storage system, whether the storage system stores a
deduplication fingerprint of the target data block, or the
like.
[0047] In the step 301, there are a plurality of manners of
determining whether deduplication can be performed on the target
data block. The following provides two feasible
implementations.
[0048] Manner 1: Generate the deduplication fingerprint of the data
block, and query whether the storage system has a fingerprint that
is the same as the deduplication fingerprint, to determine whether
deduplication can be performed on the data block.
[0049] During implementation, a deduplication fingerprint table is
recorded in the storage system. The deduplication fingerprint table
includes a deduplication fingerprint of a data block that is
compressed and stored, a deduplication fingerprint of a received
data block that is not compressed, and metadata information of the
corresponding data block. Each time after the storage system
determines the deduplication fingerprint of the data block, the
storage system correspondingly adds the deduplication fingerprint
and corresponding metadata information to the deduplication
fingerprint table. The metadata information of the data block
includes an identifier indicating whether the data block is a
reference block or a duplicate block (if the data block is
determined, a location where the data block is not determined may
not be filled, and is subsequently filled after the data block is
determined), a storage location (if the data block is stored, the
data block has a storage location), and the like.
[0050] The storage system may input the target data block into a
fingerprint extraction function, to obtain the deduplication
fingerprint of the target data block as an output result. The
fingerprint extraction function may be a hash function. Then, the
storage system determines, in the deduplication fingerprint table
by using the deduplication fingerprint, whether the deduplication
fingerprint exists in the received data block that is not
compressed. If the deduplication fingerprint exists in the received
data block that is not compressed, it may indicate that a same data
block exists. In this case, deduplication can be performed on the
target data block. If the deduplication fingerprint does not exist
in the received data block that is not compressed, it may indicate
that the target data block does not exist. In this case,
deduplication cannot be performed on the target data block.
[0051] Manner 2: Determine a load of the storage system to
determine whether deduplication can be performed on the data
block.
[0052] During implementation, the storage system may determine the
load of the storage system, and determine whether a current CPU
usage exceeds a first value. The load may be reflected by a CPU
usage, a storage space usage, and a current time period. If the
current CPU usage exceeds the first value, the storage system may
determine that the load meets a load exceeding condition, and
perform deduplication. The storage system may determine whether a
current storage space usage exceeds a second value. If the current
storage space usage exceeds the second value, the storage system
may determine that the load meets the load exceeding condition, and
perform deduplication. The storage system may determine a current
time point, to determine a time period in which the current time
point is located. If the time period in which the current time
point is located is a target time period, and the target time
period may be from 7:00 to 24:00, the load meets the load exceeding
condition. The storage system may perform any one or more of the
foregoing operations to determine that the load meets the load
exceeding condition. If none of the foregoing conditions is met,
the load does not meet the load exceeding condition. The storage
system may concurrently determine whether the current CPU usage
exceeds the first value, whether the current storage space usage
exceeds the second value, and the current time period. As long as
one determining result is that the load meets the load exceeding
condition, the storage system may stop remaining determining
operations.
[0053] If the storage system determines that the load of the
storage system meets the load exceeding condition, the storage
system may perform deduplication on the data block. If the storage
system determines that the load of the storage system does not meet
the load exceeding condition, the storage system may determine that
deduplication does not need to be performed.
[0054] It should be noted that, because a CPU needs to be occupied
each time a data block is compressed, the CPU needs to be
considered. Because storage space is also occupied when duplicate
data is stored, the storage space also needs to be considered. In
some time periods, the upper-layer application stores a large
amount of data, and in another time period, upper-layer application
stores a small amount of data. Therefore, deduplication needs to be
performed during peak hours and does not need to be performed
during off-peak hours.
[0055] Step 302: When deduplication cannot be performed on the
target data block, obtain a similar fingerprint of the target data
block.
[0056] The similar fingerprint may include one or more
fingerprints.
[0057] During implementation, when deduplication cannot be
performed on the target data block, the storage system may obtain
the similar fingerprint of the target data block. The similar
fingerprint of the target data block may be added to a similar
fingerprint table. The storage system stores the similar
fingerprint table. The similar fingerprint table includes a
correspondence between each similar fingerprint and metadata
information of a data block. Similar fingerprints included in the
similar fingerprint table are similar fingerprints of data blocks
whose the similar fingerprints are determined (including a similar
fingerprint of uncompressed data and a similar fingerprint of a
compressed data block). For any data block, the metadata
information in the similar fingerprint table includes information
such as an identifier indicating whether the data block is a
reference block or a similar block (if the data block is
determined, a location where the data block is not determined may
not be filled, and is subsequently filled after the data block is
determined), and a storage location (if the data block is stored,
the data block has a storage location). In addition, the metadata
information may further record a strongly similar fingerprint, for
example, a target fingerprint identifier mentioned below. A
strongly similar fingerprint of a data block is determined based on
all similar fingerprints of the data block, and may be obtained by
performing processing, for example, weighting (for example, there
are three similar fingerprints: a fingerprint 1, a fingerprint 2,
and a fingerprint 3, each fingerprint corresponds to a weight
value, and the weight values respectively are a, b, and c. A sum of
a, b, and c is equal to 1, and the strongly similar fingerprint is
equal to a*fingerprint 1+b*fingerprint 2+c*fingerprint 3). This may
reflect all fingerprints in the similar fingerprint. The similar
fingerprint table may be stored in a form of a table. As shown in
Table 1, similar fingerprints include a fingerprint 1, a
fingerprint 2, . . . , and a fingerprint n. The fingerprint 1
corresponds to metadata information of a data block 1, metadata
information of a data block 2, metadata information of a data block
3, and the like. The fingerprint 2 corresponds to the metadata
information of the data block 2, the metadata information of the
data block 3, metadata information of a data block 5, and the like.
The fingerprint n corresponds to the metadata information of the
data block 1, metadata information of a data block 4, and the
like.
TABLE-US-00001 TABLE 1 Fingerprint Metadata information Fingerprint
1 The metadata information of the data block 1, the metadata
information of the data block 2, and the metadata information of
the data block 3 Fingerprint 2 The metadata information of the data
block 2, the metadata information of the data block 3, and the
metadata information of the data block 5 . . . . . . Fingerprint n
The metadata information of the data block 1, and the metadata
information of the data block 4
[0058] It should be noted that, in this embodiment of this
application, the similar fingerprint of the target data block may
be determined when it is determined that deduplication cannot be
performed on the target data block, or the similar fingerprint of
the target data block may be determined when whether deduplication
can be performed is determined. When a similar fingerprint is
determined, a hash algorithm may be used to determine the similar
fingerprint of the target data block. The processing may be:
dividing the target data block into a plurality of small data units
(each data unit has a same length), and calculating a hash value,
namely the similar fingerprint of the target data block, for each
data unit by using a preset hash function.
[0059] It should be noted that, when the storage system is just
online, the similar fingerprint table is blank. As time goes by,
more data blocks are stored, and the similar fingerprint table is
increasingly large.
[0060] It should be further noted that the foregoing hash functions
for determining the deduplication fingerprint and the similar
fingerprint are different functions.
[0061] In addition, when deduplication cannot be performed on the
target data block, the similar fingerprint of the target data block
is directly determined. Alternatively, when deduplication cannot be
performed on the target data block, the similar fingerprint of the
target data block may not be directly determined, instead, whether
the load of the storage system meets the load exceeding condition
is determined (for determining processing, refer to the foregoing
implementation 2). If the load exceeding condition is met, the
similar fingerprint of the target data block may be generated. If
the load exceeding condition is not met, subsequent similar
compression processing may not be performed, in other words, the
similar fingerprint of the target data block is not determined, and
subsequent steps 303 and 304 may not be performed.
[0062] Step 303: Determine, based on the similar fingerprint, a
combined data block group to which the target data block
belongs.
[0063] During implementation, the storage system may determine, in
the similar fingerprint table by using the similar fingerprints of
the target data block, a data block group corresponding to each
fingerprint in the similar fingerprint of the target data block,
and then determine, in the data block groups, the combined data
block group to which the target data block belongs.
[0064] In an optional implementation, the combined data block group
to which the target data block belongs may be determined in a
plurality of manners. The following provides two feasible
manners.
[0065] Manner 1: Determine, based on a similar fingerprint
quantity, a data block group corresponding to the target data
block, where the similar fingerprint quantity is a quantity of same
similar fingerprints in any two data blocks in one data block
group; and form, in the data block group corresponding to the
target data block, a first quantity of data blocks that have a same
target fingerprint as the target data block into the combined data
block group to which the target data block belongs.
[0066] The similar fingerprint quantity may be set in advance, and
is stored in the storage system. For example, the similar
fingerprint quantity may be 2. The similar fingerprint quantity is
related to a quantity of similar fingerprints extracted from each
data block. Generally, a larger quantity of fingerprints included
in a similar fingerprint indicates a larger similar fingerprint
quantity, and a smaller quantity of fingerprints included in a
similar fingerprint indicates a smaller similar fingerprint
quantity. A data amount of differential data between data blocks
having the target fingerprint (which may also be referred to as a
strongly similar fingerprint) is the smallest, so that a data
amount of compressed data of the data blocks is the smallest. The
first quantity may be preset, for example, 8.
[0067] During implementation, the foregoing similar fingerprint
table is established in the storage system, and an uncompressed
data block corresponding to each fingerprint in the similar
fingerprint of the target data block may be determined from the
similar fingerprint table. Then, data blocks having a similar
fingerprint quantity of same fingerprints are grouped into one
group by using the similar fingerprint quantity, to determine a
data block group where the target data block is located, namely the
data block group corresponding to the target data block.
[0068] When the target data block has not been selected as a member
of another reference block, in the data block group corresponding
to the target data block, data blocks that have a same target
fingerprint as the target data block and that do not form a
combined data block group with another data block may be
successively selected from each data block group, to form the
combined data block group to which the target data block belongs.
For any data block, a combined data block group to which each data
block belongs may be determined in this manner.
[0069] For the target data block, when a member is selected for the
other reference block, if a target fingerprint exists in both a
reference block and the target data block, a combined data block
group to which the reference block belongs may be determined as the
combined data block group to which the target data block
belongs.
[0070] It should be noted that, if a quantity of data blocks in a
data block group is limited, after the quantity of data blocks in
the combined data block group reaches the first quantity, no data
block is further added to the combined data block group. For
example, the target data block corresponds to three data block
groups. When a quantity of data blocks that are in the first two
data block groups and that have the target fingerprint of the
target data block has reached the first quantity, the data blocks
form a combined data block group. In this case, the combined data
block group to which the target data block belongs is
determined.
[0071] For example, similar fingerprints of a target data block C3
include a fingerprint 1, a fingerprint 2, and a fingerprint 3, a
target fingerprint of C3 is a fingerprint 4, and the similar
fingerprint quantity is 1. The fingerprint 1 in the similar
fingerprints of the data block corresponds to data blocks C0, C1,
C2, C3, C4, C5, and C6. The fingerprint 2 corresponds to data
blocks C0, D1, C3, D3, C5, and C7. The fingerprint 3 corresponds to
data blocks C0, C3, C5, C7, D5, and D6. Because the similar
fingerprints of the target data block include the fingerprint 1,
the fingerprint 2, and the fingerprint 3, a data block group formed
by the data blocks corresponding to the fingerprint 1 is a data
block group corresponding to the target data block, a data block
group formed by the data blocks corresponding to the fingerprint 2
is a data block group corresponding to the target data block, and a
data block group formed by the data blocks corresponding to the
fingerprint 3 is a data block group corresponding to the target
data block. For the fingerprint 1, C0 is selected as a reference
block. If both C3 and C0 have a same strongly similar fingerprint
(namely, the target fingerprint), C3 may be left in a data block
group in which C0 is used as a reference block, and the data block
group in which C0 is used as a reference block is a combined data
block group to which the target data block C3 belongs. For the
fingerprint 1, the target fingerprint also exists in C5 and C6. In
this case, C5 and C6 may be added to the data block group in which
C0 is used as a reference block. For the fingerprint 2, the target
fingerprint also exists in C7, and C7 may be added to the data
block group in which C0 is used as a reference block. Because the
target fingerprint exists in all selected data blocks, the target
fingerprint exists in all data blocks in the combined data block
group.
[0072] Manner 2: Determine, based on a similar fingerprint
quantity, a data block group corresponding to the target data
block, where the similar fingerprint quantity is a quantity of same
similar fingerprints in any two data blocks in one data block
group; determine a quantity of same similar fingerprints in the
target data block and in a data block in each data block group; and
form, in the data block group corresponding to the target data
block, a first quantity of data blocks that have a maximum quantity
of same similar fingerprints as the target data block into the
combined data block group to which the target data block
belongs.
[0073] During implementation, the foregoing similar fingerprint
table is established in the storage system, and an uncompressed
data block corresponding to each fingerprint in the similar
fingerprint of the target data block may be determined from the
similar fingerprint table. Then, data blocks having a similar
fingerprint quantity of same fingerprints are grouped into one
group by using the similar fingerprint quantity, to determine a
data block group where the target data block is located, namely the
data block group corresponding to the target data block.
[0074] When the target data block has not been selected as a member
of another reference block, data blocks that are in the data block
group corresponding to the target data block and that do not form a
combined data block group with another data block are determined, a
quantity of same similar fingerprints in the data blocks and in the
target data block is determined, and then data blocks are arranged
in descending order. The first quantity of the data blocks are
selected to form the combined data block group to which the target
data block belongs.
[0075] For the target data block, when a member is selected for the
other reference block, the first quantity of members need to be
selected for a reference block. In a ranking (in descending order)
of quantities of same similar fingerprints in the reference block
and in uncompressed data blocks that are in a data block group
corresponding to the reference block, if the target data block
belongs to the first quantity, a combined data block group to which
the reference block belongs may be determined as the combined data
block group to which the target data block belongs.
[0076] For example, similar fingerprints of a target data block E3
include a fingerprint 1, a fingerprint 2, and a fingerprint 3, and
the similar fingerprint quantity is 1. The fingerprint 1 in the
similar fingerprints of the data block corresponds to data blocks
E0, E1, E2, E3, E4, E5, and E6. The fingerprint 2 corresponds to
data blocks E0, F1, E3, F3, E5, and E7. The fingerprint 3
corresponds to data blocks E0, E3, E5, E7, F5, and F6. Because the
similar fingerprints of the target data block include the
fingerprint 1, the fingerprint 2, and the fingerprint 3, a data
block group formed by the data blocks corresponding to the
fingerprint 1 is a data block group corresponding to the target
data block, a data block group formed by the data blocks
corresponding to the fingerprint 2 is a data block group
corresponding to the target data block, and a data block group
formed by the data blocks corresponding to the fingerprint 3 is a
data block group corresponding to the target data block. Currently,
uncompressed data blocks include E2, E4, E5, E6, F5, and F6.
Quantities of same similar fingerprints are arranged in a
descending order as E4, E5, F5, F6, E2, and E6. The first quantity
is 6. E4, E5, F5, F6, E2, and E3 may be selected to form a combined
data block group.
[0077] It should be noted that, if a quantity of data blocks in a
data block group is limited, after the quantity of data blocks in
the combined data block group reaches the first quantity, no data
block is further added to the combined data block group.
[0078] In addition, after a second quantity of batch processing
cycles (for example, the second quantity may be 2), if no similar
data block or repeated data block is found for some data blocks,
the data blocks may be directly compressed and stored in a
conventional manner. Alternatively, after the second quantity of
batch processing processes, if no similar data block or repeated
data block is found for some data blocks, the data blocks may be
directly compressed and stored in a conventional manner.
[0079] In addition, in this embodiment of this application, the
data block group corresponding to the target data block may
alternatively be determined not based on the similar fingerprint
quantity. A first data block of processed data blocks in the batch
is used as the reference block. The first quantity of data blocks
having same strongly similar fingerprints are selected from the
remaining data blocks, to form a data block group to which the
first data block belongs. Alternatively, the first quantity of data
blocks having a maximum quantity of same similar fingerprints as
the first data block are selected from the remaining data blocks,
to form a data block group to which the first data block belongs.
Next, a first data block is selected from data blocks that do not
form a data block group as the reference block, and then processing
of selecting data blocks from the remaining data blocks continues
to be performed, to obtain a data block group to which the first
data block belongs. In this manner, the combined data block group
to which the target data block belongs may be obtained. In
addition, if the first quantity of data blocks having the same
strongly similar fingerprints cannot be selected for the first data
block, a data block whose similar fingerprint quantity exceeds a
value may be selected after the current selection, and added to the
data block group using the first data block as the reference
block.
[0080] For the foregoing combined data block group, manners of
determining the reference block are further provided in this
embodiment of this application.
[0081] Manner 1: In the combined data block group, a first added
data block is determined as the reference block.
[0082] During implementation, in the combined data block group, an
adding order of each data block may be determined, and an earliest
added data block is determined as the reference block of the
combined data block group. For example, in the foregoing example,
C0 is first added, and C0 is determined as the reference block.
[0083] Manner 2: In the combined data block group, a data block
that has a highest data reduction rate of the combined data block
group is determined as the reference block.
[0084] During implementation, when the data blocks in the combined
data block group are compressed, any data block is used as the
reference block, and each data block in the combined data block
group is compressed to obtain compressed data of each data block in
the combined data block group. Then, a data amount of the combined
data block group before compression is compared with a data amount
of the compressed data of the combined data block group, to obtain
a reduction rate corresponding to the reference block. For any
reference block, this manner may be used to determine a reduction
rate corresponding to the reference block. A data block with a
largest reduction rate is selected as the reference block.
[0085] For example, the combined data block group includes three
data blocks: A1, A2, and A3. When A1 is used as the reference
block, an overall reduction rate of the combined data block group
is 77%. When A2 is used as the reference block, the overall
reduction rate of the combined data block group is 65%. When A3 is
used as the reference block, the overall reduction rate of the
combined data block group is 50%. In this way, it may be obtained
that the overall reduction rate of the combined data block group is
the highest when A1 is used as the reference block. Therefore, in
the combined data block group, A1 may be used as the reference
block.
[0086] It should be noted that determining efficiency of the
foregoing manner 1 of determining the reference block is relatively
high, but a data block with a highest reduction rate may not be
selected. In the foregoing manner 2 of determining the reference
block, although a compressed block with a high reduction rate can
be determined, a selection process is complex and efficiency is
relatively low. Therefore, when there are a relatively large
quantity of data blocks in the combined data block group, the
manner 1 of determining the reference block may be selected, to
improve selection efficiency. However, when there are a relatively
small quantity of data blocks in the combined data block group, the
manner 2 of determining the reference block may be selected, to
provide a high reduction rate.
[0087] Step 304: Perform similar compression on the target data
block based on the reference block in the combined data block
group.
[0088] During implementation, the storage system may determine
differential data between the target data block and the reference
block in the combined data block group. If the reference block has
been compressed, the differential data may be directly compressed
to obtain compressed data of the target data block. If the
reference block has not been compressed, the reference block may be
compressed, and the differential data is compressed. Subsequently,
data of the target data block may be restored by using data of the
reference block and the differential data.
[0089] In an optional implementation, in the storage system, data
in a same combined data block group may be stored in a storage
block, and may be stored in a same storage block, or may be stored
in different storage blocks. This is not limited in this embodiment
of this application. When the data is stored in different storage
blocks, if the reference block and differential data of a currently
to-be-read data block are in a same storage block, the reference
block and the differential data may be directly read from the
storage block at a time (if the reference block is in the front,
the reading may be performed from the reference block to the
differential data of the to-be-read data block; and if the
reference block is in the back, the reading may be performed from
the differential data of the to-be-read data block to the reference
block). If the reference block and the differential data of the
currently to-be-read data block are not in a same storage block,
the reference block and the differential data may be separately
read from different storage blocks.
[0090] In an optional implementation, in the storage system,
compressed data is stored in a storage block. During storage,
compressed data of a same combined data block group needs to be
stored in one storage block and is consecutively stored. The
processing may be as follows:
[0091] consecutively storing, in a same storage block, compressed
data obtained after similar compression is performed on data
blocks, and compressed data of another data block in the combined
data block group.
[0092] During implementation, when compressed data obtained after
similar compression is performed on the target data block is
stored, a storage block in which the compressed data of the other
data block in the combined data block group is stored and a storage
location of the compressed data in the storage block may be
determined. Then, the compressed data obtained after similar
compression is performed on the target data block and the
compressed data of the other data block in the combined data block
group are consecutively stored together.
[0093] For example, as shown in FIG. 4, the data blocks are A1 and
A2, A0 is a reference block, dA1 is differential data between A1
and A0, and dA2 is differential data between A2 and A0. A0, dA1,
and dA2 may be stored in a same storage block, and are
consecutively stored, where A0 is adjacent to dA1, and dA1 is
adjacent to dA2.
[0094] It should be noted that, because a processing resource is
consumed each time data is read, the reference block and the
differential data are generally read at a time instead of being
read twice, to save the processing resource. Therefore, the
compressed data in the foregoing combined data block group is
consecutively stored in one storage block, so that the reference
block, and the differential data between the other data block and
the reference block may be read at a time during reading.
[0095] In an optional implementation, to reduce an amount of data
read at a time, a location in which the data of the reference block
is stored may be configured, and the processing may be as
follows:
[0096] if there are a plurality of data blocks other than the
reference block in the combined data block group, compressed data
of m data blocks is before the data of the reference block, and
compressed data of n data blocks is after the reference block,
where a difference between m and n is equal to any one of 0, 1, or
-1, and both m and n are greater than or equal to 1.
[0097] During implementation, if there are a plurality of data
blocks other than the reference block in the combined data block
group, assuming that there are m+n data blocks other than the
reference block, the compressed data of m data blocks may be set
before the data of the reference block, and the compressed data of
n data blocks may be set after the data of the reference block. If
m+n is an odd number, a relationship between m and n may be that
m-n is equal to 1 or -1. If m+n is an even number, a relationship
between m and n may be that m-n is equal to 0. In this way, during
data reading, if differential data of a data block after the
reference block needs to be read, the reading may directly start
from the reference block until the differential data of the data
block is read, without a need to read differential data of all data
blocks. If differential data of a data block before the reference
block needs to be read, the reading may directly start from the
differential data of the data block, and ends after the data of the
reference block is read, without a need to read all the data.
Therefore, less data is read, and reading efficiency is
improved.
[0098] For example, as shown in FIG. 5 that corresponds to FIG. 4,
in addition to the reference block, there are two data blocks A1
and A2 in the combined data block group, and A0 is stored between
dA1 and dA2. In this way, when data of A2 is read, the reading may
directly start from A0, and ends after dA2 is read, without a need
to read dA1. This can speed up the reading. When data of A1 is
read, the reading may start from dA1, and ends after A0 is read,
without a need to read dA2. This can speed up the reading.
[0099] In addition, when there is one data block other than the
reference block in the combined data block group, the data of the
reference block may be located before compressed data of another
data block in the combined data block group, or may be located
after the compressed data of the other data block in the combined
data block group.
[0100] In addition, the length of the storage block is generally
fixed. When the storage block is not fully stored after data of one
combined data block group is stored in the storage block, the
storage block may store data of another combined data block group,
but data of a same combined data block group needs to be stored in
a same storage block, to facilitate subsequent reading.
[0101] To describe a structure of the storage block more clearly,
an embodiment of this application further provides a structure of a
storage block. As shown in FIG. 6, in original structures of
storage blocks, a storage block 0 is used to store data blocks X,
Y, Z, and G, a storage block 1 is used to store data blocks A0, B0,
A1, and D1, and a storage block 2 is used to store data blocks A2,
D0, A0, and B1. The data blocks X, Y, Z, and G are data blocks on
which deduplication or similar compression is not performed, and no
change may be made. Because both the storage block 1 and the
storage block 2 have A0, deduplication may be performed, to delete
one A0. Because both A1 and A2 are similar to A0, similar
compression may be performed, to obtain differential data dA1
between A1 and A0 and differential data dA2 between A2 and A0. dA1,
dA2, and A0 may be placed in one storage block and stored in the
storage block 1. Because B0 is similar to B1 and B0 is a reference
block, similar compression may be performed, to obtain differential
data dB1 between B1 and B0. Because D1 is similar to D0 and D0 is a
reference block, similar compression may be performed, to obtain
differential data dD1 between D1 and D0. B0 and dB1 may be stored
in a same storage block, D0 and dD1 may be stored in a same storage
block, and B0, dB1, D0 and dD1 are stored in the storage block 2.
In other words, the storage blocks are classified into two types.
One type of storage block is used to store a data block on which
deduplication and/or similar compression are not performed, and the
other type of storage block is used to store a data block on which
deduplication and/or similar compression are performed.
[0102] In addition, in the foregoing step 304, compressed data of a
same combined data block group may be stored in a same compressed
block, and the processing may be as follows:
[0103] if a compressed block to which the combined data block group
belongs has a remaining capacity, compress differential data
between a data block and a reference block, and store the
compressed data in the compressed block; or if a compressed block
to which the combined data block group belongs has no remaining
capacity, create a new compressed block, re-determine a data block
group to which a data block belongs, select a reference block from
the re-determined data block group, compress differential data of
the data block and the re-selected reference block, and store the
compressed data into the newly created compressed block.
[0104] Each compressed block is used to store compressed data of
one combined data block group, and a data amount of data that can
be stored in the compressed block is a fixed value which may be 16
KB, 32 KB, or another value.
[0105] During implementation, when similar compression is performed
on a target data block in a combined data block group, a compressed
block to which the combined data block group belongs has a
remaining capacity, and the remaining capacity is greater than or
equal to a data amount of differential data between the target data
block and the reference block in the combined data block group, the
differential data between the target data block and the reference
block in the combined data block group may be compressed, and then
the compressed differential data is stored in the compressed
block.
[0106] If the compressed block to which the combined data block
group belongs has no remaining capacity to store the differential
data of the target data block relative to the reference block, a
new compressed block may be created. If there is another data block
that is in the combined data block group and that is not
compressed, the target data block and the other data block in the
combined data block group may form a new combined data block group,
and then a reference block is determined in the new combined data
block group. A data block that is first added may be determined as
the reference block, or a data block that maximizes a reduction
rate of the new combined data block group may be determined as the
reference block (in this case, the target data block may also be
selected as the reference block). If the target data block is not
the reference block, differential data between the target data
block and the reselected reference block is compressed, the
compressed data is stored in the new compressed block, and the
reference block in the new combined data block group may be stored.
If the target data block is the reference block, conventional
compression and storage may be directly performed, and similar
compression may be performed on another data block with reference
to the target data block. In this way, the target data block can be
compressed.
[0107] For example, as shown in FIG. 7, a combined data block group
to which data blocks belong includes five data blocks: a reference
block, a, b, c, and d. The first data block is the reference block.
Lossless compression is performed on data of the reference block,
similar compression is performed on the other data blocks relative
to the reference block, and compressed data blocks are stored in
one data block. The compressed blocks of the other data blocks are
sequentially da, db, dc, and dd.
[0108] It should be noted that the compressed block generally can
store a small amount of data to facilitate reading. If the
compressed block can store a large amount of data, when the data is
read, reading needs to be performed from the reference block to the
end, to read the data at the end of the compressed block, and
therefore a large amount of data is read at a time, and more
resources are wasted.
[0109] According to the embodiments of this application, a new
storage system may directly combine deduplication and similar
compression, to obtain a system with a new compression technology.
For a system that is online, if there are no deduplication and
similar compression, an independent processing mechanism may be
embedded into the system.
[0110] In this embodiment of this application, when a data block is
stored, whether deduplication can be performed on a target data
block is determined; when deduplication cannot be performed on the
target data block, a similar fingerprint of the target data block
is obtained; a combined data block group to which the target data
block belongs is determined based on the similar fingerprint; and
similar compression is performed on the target data block based on
a reference block in the combined data block group. In this way,
similar compression and deduplication are combined. When
deduplication cannot be performed, similar compression can be used
to further compress some data, to improve a reduction rate.
[0111] Based on the foregoing processing of compressing data, an
embodiment of this application further correspondingly provides a
process of reading compressed data. That compressed data in a
combined data block group is stored in a same storage block is used
as an example. Reading steps are shown in FIG. 8.
[0112] Step 801: Receive a read request for a to-be-read data
block.
[0113] During implementation, after a data block is stored in a
storage system, if the data block needs to be read subsequently, a
read request may be sent to the storage system, and an identifier
of the to-be-read data block is carried in the read request.
[0114] Step 802: Obtain metadata information of the to-be-read data
block.
[0115] During implementation, the storage system may read the
metadata information of the to-be-read data block from a storage
block (the metadata information is usually stored in a first
storage block) by using the identifier of the to-be-read data
block, and the metadata information may include a storage block in
which a reference block of the to-be-read data block is located, a
location of the reference block in the storage block, and an offset
location of the to-be-read data block relative to the reference
block (the offset location may be an offset data amount, a quantity
of offset data blocks, or the like). For example, eight data blocks
are shifted rightwards from the location of the reference
block.
[0116] Step 803: Read, based on the metadata information, the
reference block of the to-be-read data block and differential data
between the to-be-read data block and the reference block from the
storage block to which the reference block of the to-be-read data
block belongs.
[0117] During implementation, after obtaining the metadata
information, the storage system may determine, by using the
metadata information, the location of the reference block in the
storage block and the offset location of the to-be-read data block
relative to the reference block. If the reference block is before
the differential data of the to-be-read data block, reading may
start from the reference block until the differential data between
the to-be-read data block and the reference block is read. If the
reference block is after the differential data of the to-be-read
data block, reading may start from the to-be-read data block until
the reference block is read. Data of the reference block and the
differential data between the to-be-read data block and the
reference block are obtained from read data.
[0118] It should be noted herein that the storage block in which
the reference block is located further includes a header (head) of
the reference block, and the header is used to describe a quantity
of data blocks included in the storage block, a data amount of the
storage block, and the like.
[0119] Step 804: Restore data of the to-be-read data block based on
the reference block and the differential data.
[0120] During implementation, the storage system may superpose the
data of the reference block with the differential data, to obtain
all data of the to-be-read data block.
[0121] Step 805: Send the data of the to-be-read data block to a
requester.
[0122] In this way, in one process of reading data from a disk, if
the metadata information is stored in a memory, because in the
storage block, reading may start from the reference block until the
differential data of the to-be-read data block is read. When
differential data of another data block exists between the
reference block and the differential data of the to-be-read data
block, the differential data of the other data block is also read.
In this case, although the differential data of the other data
block is read, compared with first reading the data of the
reference block and then reading the differential data of the
to-be-read data block, this way occupies less processing resources.
It can be learned that in this application, all the data of the
to-be-read data block can be read only once.
[0123] If the metadata information is stored in the disk, the
metadata information of the to-be-read data block is read from the
disk, and then the differential data between the reference block
and the to-be-read data block is read from the storage block at a
time. Therefore, in this application, all the data of the
to-be-read data block can be read only twice.
[0124] FIG. 9 is a structural diagram of an apparatus for
compressing data of a storage system according to an embodiment of
this application. The apparatus may be implemented as a part of the
apparatus or the entire apparatus by using software, hardware, or a
combination thereof. The apparatus provided in this embodiment of
this application may implement the procedure in the embodiment of
this application shown in FIG. 3. The apparatus includes a
determining module 910, an obtaining module 920, and a compression
module 930.
[0125] The determining module 910 is configured to determine
whether deduplication can be performed on a target data block, and
may specifically be configured to perform the step 301 and implicit
steps included therein.
[0126] The obtaining module 920 is configured to, when
deduplication cannot be performed on the target data block, obtain
a similar fingerprint of the target data block, and may
specifically be configured to perform the step 302 and implicit
steps included therein.
[0127] The determining module 910 is further configured to
determine, based on the similar fingerprint, a combined data block
group to which the target data block belongs, and may specifically
be configured to perform the step 303 and the implicit steps
included therein.
[0128] The compression module 930 is configured to perform similar
compression on the data block based on a reference block in the
combined data block group, and may specifically be configured to
perform the step 304 and implicit steps included therein.
[0129] In an optional implementation, the determining module 910 is
configured to:
[0130] generate a deduplication fingerprint of the target data
block; and
[0131] query whether the storage system has a fingerprint that is
the same as the deduplication fingerprint, to determine whether
deduplication can be performed on the target data block.
[0132] In an optional implementation, the determining module 910 is
configured to:
[0133] determine a load of the storage system to determine whether
deduplication can be performed on the target data block.
[0134] In an optional implementation, the compression module 930 is
further configured to:
[0135] consecutively store, in a same storage block, compressed
data obtained after similar compression is performed on the target
data block, and compressed data of another data block in the
combined data block group.
[0136] In an optional implementation, if there are a plurality of
data blocks other than the reference block in the combined data
block group, compressed data of m data blocks is before data of the
reference block, and compressed data of n data blocks is after the
reference block, where a difference between m and n is equal to any
one of 0, 1, or -1, and both m and n are greater than or equal to
1.
[0137] In an optional implementation, the determining module 910 is
further configured to:
[0138] determine, based on a similar fingerprint quantity, a data
block group corresponding to the target data block, where the
similar fingerprint quantity is a quantity of same similar
fingerprints in any two data blocks in one data block group; and
form, in the data block group corresponding to the target data
block, a first quantity of data blocks that have a same target
fingerprint as the target data block into the combined data block
group to which the target data block belongs, where a data amount
of differential data between the target data block and a data block
that has the target fingerprint is less than a data amount of
differential data between the target data block and a data block
that does not have the target fingerprint.
[0139] In an optional implementation, the determining module 910 is
further configured to:
[0140] determine, based on a similar fingerprint quantity, a data
block group corresponding to the target data block, where the
similar fingerprint quantity is a quantity of same similar
fingerprints in any two data blocks in one data block group;
determine a quantity of same similar fingerprints in the target
data block and in a data block in each data block group; and form,
in the data block group corresponding to the target data block, a
first quantity of data blocks that have a maximum quantity of same
similar fingerprints as the target data block into the combined
data block group to which the target data block belongs.
[0141] In this embodiment of this application, when a data block is
stored, whether deduplication can be performed on a target data
block is determined; when deduplication cannot be performed on the
target data block, a similar fingerprint of the target data block
is obtained; a combined data block group to which the target data
block belongs is determined based on the similar fingerprint; and
similar compression is performed on the target data block based on
a reference block in the combined data block group. In this way,
similar compression and deduplication are combined. When
deduplication cannot be performed, similar compression can be used
to further compress some data, to improve a reduction rate.
[0142] It should be noted that when the apparatus for compressing
data of a storage system, provided in the foregoing embodiment,
processes data, division of the foregoing functional modules is
used only as an example for description. In actual application, the
foregoing functions may be allocated to different functional
modules and implemented according to a requirement, in other words,
an internal structure of the apparatus is divided into different
functional modules for implementing all or some of the functions
described above. In addition, the apparatus for compressing data of
a storage system, provided in the foregoing embodiment, and the
embodiment of the method for compressing data of a storage system
belong to a same concept. For details about a specific
implementation process of the apparatus, refer to the method
embodiment. Details are not described herein again.
[0143] In an optional implementation, an embodiment of this
application further provides a computer-readable storage medium.
The computer-readable storage medium stores an instruction, and
when the computer-readable storage medium runs on a storage system,
the storage system is enabled to perform the foregoing method for
compressing data of a storage system.
[0144] In an optional implementation, an embodiment of this
application further provides a computer program product including
an instruction. When the computer program product runs on a storage
system, the storage system is enabled to perform the foregoing
method for compressing data of a storage system.
[0145] All or some of the foregoing embodiments may be implemented
by using software, hardware, firmware, or any combination thereof.
When the software is used for implementation, all or some of the
embodiments may be implemented in a form of a computer program
product. The computer program product includes one or more computer
instructions. When the computer program instructions are loaded and
executed on a server or a terminal, all or some of the procedures
or functions according to the embodiments of the present invention
are generated. The computer instructions may be stored in a
computer-readable storage medium or may be transmitted from a
computer-readable storage medium to another computer-readable
storage medium. For example, the computer instructions may be
transmitted from a website, computer, server, or data center to
another website, computer, server, or data center in a wired (for
example, a coaxial optical cable, an optical fiber, or a digital
subscriber line) or wireless (for example, infrared, radio, or
microwave) manner. The computer-readable storage medium may be any
usable medium accessible by a server or a terminal, or a data
storage device, such as a server or a data center, integrating one
or more usable media. The usable medium may be a magnetic medium
(for example, a floppy disk, a hard disk, and a magnetic tape), an
optical medium (for example, a digital video disk (Digital Video
Disk, DVD)), or a semiconductor medium (for example, a solid-state
drive).
[0146] The foregoing descriptions are merely example embodiments of
this application, but are not intended to limit this application.
Any modification, equivalent replacement, or improvement made
without departing from the spirit and principle of this application
should fall within the protection scope of this application.
* * * * *