U.S. patent application number 14/950456 was filed with the patent office on 2016-05-26 for content-based replication of data between storage units.
The applicant listed for this patent is Nimble Storage, Inc.. Invention is credited to Tomasz Barszczak, Nimesh Bhagat, Gurunatha Karaje.
Application Number | 20160150012 14/950456 |
Document ID | / |
Family ID | 56010446 |
Filed Date | 2016-05-26 |
United States Patent
Application |
20160150012 |
Kind Code |
A1 |
Barszczak; Tomasz ; et
al. |
May 26, 2016 |
CONTENT-BASED REPLICATION OF DATA BETWEEN STORAGE UNITS
Abstract
Methods, systems, and computer programs are presented for
replicating data across storage systems. One method includes an
operation for transferring a snapshot of a volume from an upstream
array to a downstream array. The method further includes an
operation for comparing an upstream snapshot checksum of the
snapshot in the upstream array with a downstream snapshot checksum
of the snapshot in the downstream array. When the upstream snapshot
checksum is different from the downstream snapshot checksum, a
plurality of chunks in the snapshot are defined. Further, for each
chunk in the snapshot, a comparison is made of an upstream chunk
checksum calculated by the upstream array with a downstream chunk
checksum calculated by the downstream array. When the upstream
chunk checksum is different from the downstream chunk checksum then
the data of the chunk is sent from the upstream array to the
downstream array.
Inventors: |
Barszczak; Tomasz; (San
Jose, CA) ; Karaje; Gurunatha; (San Jose, CA)
; Bhagat; Nimesh; (San Jose, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Nimble Storage, Inc. |
San Jose |
CA |
US |
|
|
Family ID: |
56010446 |
Appl. No.: |
14/950456 |
Filed: |
November 24, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62084395 |
Nov 25, 2014 |
|
|
|
62084403 |
Nov 25, 2014 |
|
|
|
Current U.S.
Class: |
709/219 |
Current CPC
Class: |
G06F 3/067 20130101;
G06F 16/2237 20190101; G06F 3/0689 20130101; H04L 67/1095 20130101;
G06F 3/0607 20130101; G06F 3/065 20130101; G06F 16/27 20190101;
G06F 3/0619 20130101; H04L 67/1097 20130101; G06F 16/1844 20190101;
G06F 16/2365 20190101 |
International
Class: |
H04L 29/08 20060101
H04L029/08; G06F 3/06 20060101 G06F003/06 |
Claims
1. A method for replicating data across storage systems, the method
comprising: transferring a snapshot of a volume from an upstream
array to a downstream array, the volume being a single accessible
storage area within the upstream array; comparing an upstream
snapshot checksum of the snapshot in the upstream array with a
downstream snapshot checksum of the snapshot in the downstream
array; when the upstream snapshot checksum is different from the
downstream snapshot checksum, defining a plurality of chunks in the
snapshot; and for each chunk in the snapshot, comparing an upstream
chunk checksum calculated by the upstream array with a downstream
chunk checksum calculated by the downstream array; and sending,
from the upstream array to the downstream array, data of the chunk
when the upstream chunk checksum is different from the downstream
chunk checksum.
2. The method as recited in claim 1, further including: exchanging,
before defining the plurality of chunks, transfer parameters
between the upstream array and the downstream array, the transfer
parameters including one or more of checksum type for calculating
the upstream chunk checksum and the downstream chunk checksum, or a
checksum size, or a chunk size, or a cursor indicating at what
chunk to start the comparing of the upstream chunk checksum and the
downstream chunk checksum.
3. The method as recited in claim 2, further including: starting
comparing the upstream chunk checksum with the downstream chunk
checksum at the chunk indicated by the cursor.
4. The method as recited in claim 1, wherein comparing the upstream
chunk checksum with the downstream chunk checksum further includes:
calculating, by the upstream array, the upstream chunk checksum;
sending, from the upstream array to the downstream array, a request
to get the downstream chunk checksum; calculating, by the
downstream array, the downstream chunk checksum; sending the
downstream chunk checksum to the upstream array; and comparing, by
the upstream array, the upstream chunk checksum with the downstream
chunk checksum.
5. The method as recited in claim 1, wherein transferring the
snapshot of the volume includes transferring all data from the
snapshot from the upstream array to the downstream array, wherein
the upstream array is a first storage system that includes a first
processor, a first volatile memory, and a first permanent storage,
wherein the downstream array is a second storage system that
includes a second processor, a second volatile memory, and a second
permanent storage, wherein the upstream array stores the snapshot
of the volume to be replicated to the downstream array, wherein a
volume holds data for the single accessible storage area, wherein
data of the volume is accessible by a host in communication with
the storage system.
6. The method as recited in claim 1, further including: sending,
from the upstream array to the downstream array, a confirmation
message indicating that the snapshot has been validated.
7. The method as recited in claim 1, wherein the snapshot of the
volume includes one or more blocks, wherein data from the snapshot
is accessed by a host in units of a size of the block, the method
further including: storing, in the upstream array, checksums of
blocks of the snapshot; and calculating the upstream chunk checksum
based on the checksums of the blocks in the chunk, wherein the
chunk is not uncompressed to calculate the upstream chunk
checksum.
8. The method as recited in claim 7, wherein a chunk includes one
hundred or more blocks, wherein data from the chunk is not directly
addressable by the host.
9. The method as recited in claim 7, wherein a chunk has a size in
a range from 1 Megabyte to 16 Megabytes, wherein a block has a
block size in a range from 1 Kilobyte to 256 Kilobytes.
10. A method for replicating data across storage systems, the
method comprising: transferring a snapshot of a volume from an
upstream array to a downstream array, the volume being a single
accessible storage area within the upstream array; comparing an
upstream snapshot checksum of the snapshot in the upstream array
with a downstream snapshot checksum of the snapshot in the
downstream array; when the upstream snapshot checksum is different
from the downstream snapshot checksum, defining a plurality of
chunks in the snapshot; and for each chunk in the snapshot,
comparing an upstream chunk checksum calculated by the upstream
array with a downstream chunk checksum calculated by the downstream
array; when the upstream chunk checksum is different from the
downstream chunk checksum, defining a plurality of blocks in the
chunk; and for each block in the chunk, comparing an upstream block
checksum calculated by the upstream array with a downstream block
checksum calculated by the downstream array; and sending, from the
upstream array to the downstream array, data of the block when the
upstream block checksum is different from the downstream block
checksum.
11. The method as recited in claim 10, exchanging, before defining
the plurality of chunks, transfer parameters between the upstream
array and the downstream array, the transfer parameters including
one or more of checksum type for calculating the upstream chunk
checksum and the downstream chunk checksum, or a checksum size, or
a chunk size, or a cursor indicating at what chunk to start the
comparing of the upstream chunk checksum and the downstream chunk
checksum.
12. The method as recited in claim 11, further including: starting
comparing the upstream chunk checksum with the downstream chunk
checksum at the chunk indicated by the cursor.
13. The method as recited in claim 10, wherein data from the
snapshot is accessed by a host in units of a size of the block, the
method further including: storing, in the upstream array, checksums
of blocks of the snapshot; and calculating the upstream chunk
checksum based on the checksums of the blocks in the chunk, wherein
the chunk is not uncompressed to calculate the upstream chunk
checksum.
14. The method as recited in claim 10, wherein operations of the
method are performed by a computer program when executed by one or
more processors, the computer program being embedded in a
non-transitory computer-readable storage medium.
15. A non-transitory computer-readable storage medium storing a
computer program for replicating data across storage systems, the
computer-readable storage medium comprising: program instructions
for transferring a snapshot of a volume from an upstream array to a
downstream array, the volume being a single accessible storage area
within the upstream array; program instructions for comparing an
upstream snapshot checksum of the snapshot in the upstream array
with a downstream snapshot checksum of the snapshot in the
downstream array; program instructions for, when the upstream
snapshot checksum is different from the downstream snapshot
checksum, defining a plurality of chunks in the snapshot; and for
each chunk in the snapshot, program instructions for comparing an
upstream chunk checksum calculated by the upstream array with a
downstream chunk checksum calculated by the downstream array; and
program instructions for sending, from the upstream array to the
downstream array, data of the chunk when the upstream chunk
checksum is different from the downstream chunk checksum.
16. The storage medium as recited in claim 15, further including:
program instructions for exchanging, before defining the plurality
of chunks, transfer parameters between the upstream array and the
downstream array, the transfer parameters including one or more of
checksum type for calculating the upstream chunk checksum and the
downstream chunk checksum, or a checksum size, or a chunk size, or
a cursor indicating at what chunk to start the comparing of the
upstream chunk checksum and the downstream chunk checksum.
17. The storage medium as recited in claim 16, further including:
program instructions for starting comparing the upstream chunk
checksum with the downstream chunk checksum at the chunk indicated
by the cursor.
18. The storage medium as recited in claim 15, wherein comparing
the upstream chunk checksum with the downstream chunk checksum
further includes: program instructions for calculating, by the
upstream array, the upstream chunk checksum; program instructions
for sending, from the upstream array to the downstream array, a
request to get the downstream chunk checksum; program instructions
for calculating, by the downstream array, the downstream chunk
checksum; program instructions for sending the downstream chunk
checksum to the upstream array; and program instructions for
comparing, by the upstream array, the upstream chunk checksum with
the downstream chunk checksum.
19. The storage medium as recited in claim 15, wherein transferring
the snapshot of the volume includes transferring all data from the
snapshot from the upstream array to the downstream array.
20. The storage medium as recited in claim 15, further including:
program instructions for sending, from the upstream array to the
downstream array, a confirmation message indicating that the
snapshot has been validated.
Description
CLAIM OF PRIORITY
[0001] This application claims priority from U.S. Provisional
Patent Application No. 62/084,395, filed Nov. 25, 2014, entitled
"Content-Based Replication of Data Between Storage Units," and from
U.S. Provisional Patent Application No. 62/084,403, filed Nov. 25,
2014, entitled "Content-Based Replication of Data in Scale Out
System." These provisional applications are herein incorporated by
reference.
CROSS REFERENCE TO RELATED APPLICATIONS
[0002] This application is related by subject matter to U.S. patent
application Ser. No. ______ (Attorney Docket No. NIMSP112) filed on
the same day as the instant application and entitled "Content-Based
Replication of Data in Scale Out System", which is incorporated
herein by reference.
BACKGROUND
[0003] 1. Field of the Invention
[0004] The present embodiments relate to methods, systems, and
programs for replicating data in a networked storage system.
[0005] 2. Description of the Related Art
[0006] Network storage, also referred to as network storage systems
or storage systems, is computer data storage connected to a
computer network providing data access to heterogeneous clients.
Typically network storage systems process a large amount of
Input/Output (IO) requests, and high availability, speed, and
reliability are desirable characteristics of network storage.
[0007] Sometimes data is copied from one system to another, such as
when an organization upgrades to a new data storage device, when
backing up data to a different location, or when backing up data
for the purpose of disaster recovery. The data needs to be migrated
or replicated to the new device from the old device.
[0008] However, when transferring large volumes of data, there
could be some glitches during the transfer/replication process, and
some of the data may be improperly transferred. It may be very
expensive resource wise to retransfer all the data, because it may
take a large amount of processor and network resources that may
impact the ongoing operation of the data service. Also, when data
is being replicated to a different storage system, there could be a
previous snapshot of the data in both systems. If a change is
detected between snapshots being replicated, it may be very
expensive to transmit over the network large amounts of data if
only a small portion of the data has changed. Further yet, if a
common base snapshot is lost, resending all the data may be very
expensive.
[0009] What is needed is a network storage device, software, and
systems that provide verification of the correct transfer of large
amounts of data from one system to another, as well as ways to
correct errors found during the replication process.
[0010] It is in this context that embodiments arise.
SUMMARY
[0011] The present embodiments relate to fixing problems when data
is replicated from a first system to a second system. It should be
appreciated that the present embodiments can be implemented in
numerous ways, such as a method, an apparatus, a system, a device,
or a computer program on a computer readable medium. Several
embodiments are described below.
[0012] One aspect includes a method for replicating data across
storage systems. The method includes an operation for transferring
a snapshot of a volume from an upstream array to a downstream
array, the volume being a single accessible storage area within the
upstream array. The method further includes comparing an upstream
snapshot checksum of the snapshot in the upstream array with a
downstream snapshot checksum of the snapshot in the downstream
array. When the upstream snapshot checksum is different from the
downstream snapshot checksum, a plurality of chunks is defined in
the snapshot. For each chunk in the snapshot, an upstream chunk
checksum calculated by the upstream array is compared with a
downstream chunk checksum calculated by the downstream array.
Further, the method includes an operation for sending, from the
upstream array to the downstream array, data of the chunk when the
upstream chunk checksum is different from the downstream chunk
checksum.
[0013] One general aspect includes a method for replicating data
across storage systems. The method includes an operation for
transferring a snapshot of a volume from an upstream array to a
downstream array, the volume being a single accessible storage area
within the upstream array. The method also includes an operation
for comparing an upstream snapshot checksum of the snapshot in the
upstream array with a downstream snapshot checksum of the snapshot
in the downstream array. When the upstream snapshot checksum is
different from the downstream snapshot checksum, a plurality of
chunks is defined in the snapshot. For each chunk in the snapshot,
an upstream chunk checksum calculated by the upstream array is
compared with a downstream chunk checksum calculated by the
downstream array. When the upstream chunk checksum is different
from the downstream chunk checksum, a plurality of blocks is
defined in the chunk. Further, for each block in the chunk an
upstream block checksum calculated by the upstream array is
compared with a downstream block checksum calculated by the
downstream array. When the upstream block checksum is different
from the downstream block checksum data of the block is sent from
the upstream array to the downstream array.
[0014] One aspect includes a non-transitory computer-readable
storage medium storing a computer program for replicating data
across storage systems. The computer-readable storage medium
includes program instructions for transferring a snapshot of a
volume from an upstream array to a downstream array, the volume
being a single accessible storage area within the upstream array,
and program instructions for comparing an upstream snapshot
checksum of the snapshot in the upstream array with a downstream
snapshot checksum of the snapshot in the downstream array. When the
upstream snapshot checksum is different from the downstream
snapshot checksum, a plurality of chunks is defined in the
snapshot. For each chunk in the snapshot, an upstream chunk
checksum calculated by the upstream array is compared with a
downstream chunk checksum calculated by the downstream array. The
storage medium further includes program instructions for sending,
from the upstream array to the downstream array, data of the chunk
when the upstream chunk checksum is different from the downstream
chunk checksum.
[0015] Other aspects will become apparent from the following
detailed description, taken in conjunction with the accompanying
drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] The embodiments may best be understood by reference to the
following description taken in conjunction with the accompanying
drawings.
[0017] FIG. 1 illustrates the replication of the snapshots from one
system to another, according to one embodiment.
[0018] FIG. 2 illustrates the partition of a volume into a
plurality of chunks, where each chunk may include a plurality of
blocks, according to one embodiment.
[0019] FIG. 3 illustrates the content-based replication (CBR)
method for validating data and correcting erroneous data between
two volumes, according to one embodiment.
[0020] FIG. 4 illustrates the CBR process which includes checking
block checksums, according to one embodiment.
[0021] FIG. 5 illustrates the read and write paths within the
storage array, according to one embodiment.
[0022] FIG. 6 illustrates an example of a configuration where
multiple arrays can be made part of a group (i.e., a cluster), in
accordance with one embodiment.
[0023] FIG. 7 illustrates the architecture of a storage array,
according to one embodiment.
[0024] FIG. 8 is a flow chart of a method for replicating data
across storage systems, according to one embodiment.
DETAILED DESCRIPTION
[0025] The following embodiments describe methods, devices,
systems, and computer programs for replicating data across storage
systems. It will be apparent, that the present embodiments may be
practiced without some or all of these specific details. In other
instances, well-known process operations have not been described in
detail in order not to unnecessarily obscure the present
embodiments.
[0026] In some implementations, a Snapshot Delta Replication (SDR)
method is used to replicate snapshots of data volumes in a network
storage device. However, something could have gone wrong during the
replication, and a check is made to determine if the replicated
snapshot is correct. If the replication is not completely correct,
the data would have to be resent, which may be very resource
costly. In order to avoid having to replicate all the data again, a
Content-Based Replication (CBR) method is used to minimize the
amount of data needed to fix the replicated snapshot.
[0027] With the CBR method, volume checksums are made at the
upstream system (system being replicated) and the downstream system
(system where the replicated data is received). If the checksums do
not match, the volume is divided into large pieces of data,
referred to herein as chunks, (e.g., 16 MB although other values
are also possible). Then checksums are performed for each chunk, at
the upstream system and the downstream system. If corresponding
pair of checksums for the same chunk do not match at the upstream
and the downstream systems, then the upstream system sends the
chunk of data to the downstream system.
[0028] In one embodiment, another level of iteration is used to
further divide the chunks into smaller pieces and perform checksums
on the smaller pieces. For example, checksums of the blocks within
a chunk can be compared, and then the blocks that have mismatched
checksums are transmitted over the network.
[0029] In another embodiment, an automated program determines when
CBR is to be performed, based on system parameters (design by the
system designers), or user configuration (e.g., once a week), or
based on heuristics that determine when the risk of an incorrect
replication is high (e.g., after installing a new release). For
example, in one embodiment, CBR could be more efficient for
replication seeding than SDR when a common base snapshot is not
found between the upstream and the downstream volumes, however, the
downstream volume may already have blocks of the volume due to an
earlier SDR.
[0030] FIG. 1 illustrates the replication of the snapshots from one
system to another, according to one embodiment. In one embodiment,
a volume is a single accessible storage area, reserved for one
application or one host, or for a group of users of an organization
or to segment/separate types of data for security or accessibly. In
one embodiment, the data of the volume is divided into blocks, and
the data from the volume is accessed by identifying a block (e.g.,
identifying an offset associated with the block being retrieved).
That is, data from the volume is accessed by the host in units of a
size of the block, and the block is the smallest amount of data
that can be requested from the volume. The networked storage device
where the data is stored is also referred to herein as a storage
array or a storage system.
[0031] In one embodiment, a first system creates snapshots of a
volume over time (e.g., S.sub.1, S.sub.2, S.sub.3, etc.). The
volume replicates one or more of the snapshots to a second volume,
for example to provide backup of the data in a different location
or in a different storage array.
[0032] The storage array that holds the source data to be copied is
referred to as the upstream storage array, or the upstream system,
or the base storage array, and the storage array that receives a
copy of the data is referred to as the downstream storage array or
the downstream system. When SDR is in the process of replicating a
snapshot to create a replicated snapshot in another storage array,
to compute what blocks need to be transferred, SDR uses a base
snapshot that is already present on the downstream as well as on
the upstream. This common snapshot is also referred to as the
common ancestor snapshot. After SDR is complete, the replicated
snapshot is present on both the upstream and the downstream storage
arrays.
[0033] In one embodiment, replication means copying all the data
from the upstream volume to the downstream volume. In some
embodiments, if the common ancestor snapshot of the volume has
already been replicated, the replication of a later snapshot
includes copying only the data that has changed, which is also
referred to herein as the delta data or the difference between the
two snapshots. It is noted that not all the snapshots in the
upstream volume have to be replicated to the downstream volume.
[0034] For example, in the exemplary embodiment of FIG. 1, the
upstream volume has over time generated five snapshots, S.sub.1,
S.sub.2, S.sub.3, S.sub.4, and S.sub.5. The replication policy
specifies that every other snapshot in the upstream volume is to be
copied to the downstream volume. Therefore, the downstream volume
has replicated snapshots S.sub.1', S.sub.3', and S.sub.5'. As used
herein, the snapshots with the apostrophe mark refer to the data in
the downstream system.
[0035] Replicating snapshot S.sub.1 requires copying all the data
from S.sub.1 to S.sub.1' because there are no previous snapshots
that have been replicated. However, replicating snapshot S.sub.3
requires only copying the difference between S.sub.3 and S.sub.1
[S.sub.3-S.sub.1]. In one embodiment, this method for replicating
snapshots from the upstream to the downstream volume by copying the
difference between two snapshots in time is referred to herein as
snapshot delta replication (SDR).
[0036] Sometimes, SDR is an efficient process, but other times SDR
is very inefficient. For example, in one scenario, two blocks,
B.sub.1 and B.sub.2 are written to the volume after snapshot
S.sub.1 is taken but before snapshot S.sub.3 is taken. If SDR is
performed for snapshot S.sub.3 using snapshot S.sub.1 as the common
snapshot, only blocks B.sub.1 and B.sub.2 will be replicated (i.e.,
transmitted to the downstream system) and SDR is efficient in this
case. However, if for some reason, snapshot S.sub.1 is not
available in the downstream system, then SDR would be inefficient
as the complete volume would have to be transmitted to the
downstream system.
[0037] FIG. 2 illustrates the partition of a volume into a
plurality of chunks, where each chunk includes a plurality of
blocks, according to one embodiment. Sometimes, a downstream
snapshot is not exactly the same as the upstream snapshot, e.g.,
because of a failure during the communication of the data from the
upstream to the downstream volume.
[0038] In one embodiment, the detection that the snapshots are not
exactly equal may be performed by doing checksums of the upstream
and downstream volumes. If the checksums don't match, then there is
a problem with the copied data. An obvious and expensive solution
is to recopy all the data again until the checksums match. However,
re-copying large amounts of data may cause distress in the data
storage system and impact performance, which means that that
transfer of large amounts of data should be avoided during normal
operating hours. Therefore, resending the data is not a desirable
solution.
[0039] In one embodiment, the volume is logically divided into
large groups of data, referred to herein as chunks. In one
embodiment, the size of a block is 4 KBytes, but other values are
also possible, such as in the range from 256 bytes to 50 Kbytes or
more.
[0040] A chunk (e.g., 16 MB) is usually much larger than a block,
so the chunk includes a plurality of blocks. In one embodiment, the
chunk is not addressable for accessing data from the volume and the
chunk is only utilized for correcting the replication of snapshots,
as described in more detail below. Other embodiments may include
other sizes for chunks, such as in the range from 1 megabytes to
100 megabytes, or in the range from 100 megabytes to 1 or several
gigabytes. In one embodiment, the size of the chunk is 100 times
the size of the block, but other multipliers may also be possible,
such as 50 to 5000. Therefore, the size of the chunk may be 50 to
5000 times bigger than the size of the block.
[0041] FIG. 2 shows a volume that has been divided into chunks
C.sub.1, C.sub.2, C.sub.3, etc. Further, each chunk contains
blocks, such as block C.sub.1 that includes blocks B.sub.1,
B.sub.2, B.sub.3, etc. The checksums performed can be of any type.
In one embodiment, a cryptographically strong checksum is utilized.
For example, a checksum that requires data read and checksum
computation, provides SHA-1, 20-bytes long (e.g., 5 GB per TB, if
transmitted for every 4K uncompressed block. In another embodiment,
16 bytes encryption is utilized. In another embodiment, the
checksum is SHA-2.
[0042] Another possible checksum is a Fletcher checksum. Further,
several types of checksums may be utilized depending on the size of
the data to be checksumed. For example, a Fletcher checksum may be
utilized for snapshots, and an SHA-1 checksum may be utilized for
chunks or blocks. In one embodiment, the checksum is negotiated
between the upstream and the downstream storage arrays during the
CBR initialization period.
[0043] Further, the checksums may be performed over compressed or
uncompressed data. In one embodiment, the checksum of uncompressed
data is utilized but this requires decompression which causes
higher resource utilization. In another embodiment, the checksum is
performed over compressed data, however, this option may stop
working when compression of blocks starts differing between
upstream and downstream (e.g., due to background strong
recompression taking place in the downstream system). In yet
another embodiment, uncompressed checksums are stored for certain
data ranges, and a larger checksum is formed by combining the data
from the uncompressed checksums. This way, there is no need to
decompress the blocks to obtain the checksums of the chunks.
[0044] FIG. 3 illustrates the content-based replication (CBR)
method for validating data and correcting erroneous data transfers
between two volumes, according to one embodiment. In one
embodiment, a snapshot S.sub.1 is copied from an upstream volume to
a snapshot S.sub.1' in the downstream volume. For example, the
snapshots can be replicated by using the SDR method described
above. In one embodiment, the network storage system may limit the
CBR process to one volume at a time, in order to limit the stress
on the system. In another embodiment, one or more volumes may skip
the CBR process if the volumes have been created after a certain
time (e.g., time when the storage array was upgraded past a known
release with a potential replication problem).
[0045] At start time, the upstream and the downstream arrays may
exchange CBR-related information, such as checksum type, checksum
size, and how much data is covered by each checksum (e.g., size of
the chunk, how many blocks in each chunk).
[0046] The validation of the snapshots can be initiated in
different ways. For example, an administrator may request a storage
array to check for the validity of a snapshot in a downstream
volume, or an automated validating process may be initiated by the
storage array. For example, a validating process may be initiated
periodically or maybe initiated after the data center updates the
software of one or more storage arrays, or as additional hardware
(e.g. another storage array) is added to the network data
system.
[0047] In one embodiment, the upstream volume computes the checksum
of S.sub.1, i.e., the checksum of the complete snapshot S.sub.1.
The upstream volume then sends a request to the downstream volume
to provide the checksum of the downstream snapshot S.sub.1'. In
other embodiment, the downstream volume initiates the process for
comparing the checksums. In general, some of the methods described
herein include operations performed by the upstream volume (e.g.,
initiating the validation procedure, comparing checksums, etc.),
but the same principles may be applied when the downstream volume
perform these operations (e.g., initiating the validation of the
replicated data) for validating replicated data.
[0048] The downstream volume then calculates S.sub.1' checksum (or
retrieves the checksum from memory if the checksum is already
available) and sends the checksum to the upstream volume. The
upstream volume compares the two checksums of S1 and S1', and if
the checksums match that snapshot is assumed to be correct (e.g.,
validated). However, if the checksums do not match, then the
content-based replication CBR process is started.
[0049] A principle of CBR is to calculate the checksums of large
amounts of data (e.g., for each chunk) instead of comparing the
checksums for each of the individual blocks in the volume. In one
embodiment, when a mismatch is found, the system administrator gets
an alert (on the downstream array, or on the upstream array, or on
both). The alert indicates that the replicated snapshot is
compromised (and maybe older snapshots too). After executing CBR,
the system administrator will get another alert that the mismatch
has been fixed in the most recent replicated snapshot.
[0050] The upstream volume sends a request to the downstream volume
to start the CBR process, and sends information related to the
process, such as the checksum type to be performed, the chunk size,
and a cursor used to indicate at what chunk to start the CBR
process. The cursor is useful in case the CBR process is suspended
for any reason, such as a system suffering downtime or a
network-related problem (e.g., network disconnect). This way, when
the upstream and the downstream volume are ready to continue with
the suspended CBR process, the process does not have to be
restarted from the beginning but from the place associated with the
value of the cursor. In one embodiment, the cursor may be kept in
the upstream volume, or in the downstream volume, or in both
places. In one embodiment, the cursor is an identifier for a chunk
in the volume, wherein all the chunks that preceded the identified
chunk are considered to have been already validated.
[0051] For each chunk, the upstream and the downstream systems
calculate the respective checksums C.sub.i and C.sub.i'. Then the
downstream array sends C.sub.i' checksum to the upstream array, and
the upstream array compares checksums C.sub.i and C.sub.i'. If the
checksums match, the process continues with the next chunk, until
all the chunks are validated. However, if the checksums C.sub.i and
C.sub.i' do not match, the upstream storage array sends the data
for chunk C.sub.i to the downstream array. When the last chunk has
been validated, the upstream storage array sends a CBR complete
notification to the downstream array.
[0052] In some embodiments, the upstream array and the downstream
array coordinate the validation process by checksumming and
comparing a plurality of chunks simultaneously (e.g., in parallel),
that is, the arrays do not have to wait till a chunk validation is
completed to perform the validation of the next chunk and several
chunk validation processes may be performed in parallel.
[0053] It is noted that SDR and CBR may coexist in the same storage
array, or even in the same volume, because at different times and
under different circumstances one method may be preferred over
others.
[0054] In one embodiment, a per-volume state is maintained, in both
the upstream and the downstream array, for managing and tracking
content based replication of each volume. The downstream volume's
state is consulted during the replication protocol phase that
occurs prior to the SDR data transfer phase. If the upstream or the
downstream volume state indicates the need for content based
replication to occur, the upstream and/or the downstream array
coordinate with the storage control system to initiate CBR.
[0055] Once the data transfer phase has completed, the upstream
array sends an indication to the downstream array during the
snapshot replication phase as to whether or not content based
replication was carried out. This allows the downstream array to
update the volume state, which includes clearing a flag that
indicates a content based replication is needed, and updating a
state to indicate the snapshot ID at which content based
replication occurred. Also, the downstream array will issue an
alert if the volume record indicates that errors took place (which
need to be fixed at this point).
[0056] FIG. 4 illustrates the CBR process which includes checking
block checksums, according to one embodiment. As discussed above
with reference to FIG. 3, the purpose of CBR is to perform
checksums in large groups of data, and if the checksums fail, then
sent only the data that is incorrectly replicated. The volume has
been divided into chunks, as shown in FIG. 2, but the process may
be performed iteratively and further divide each chunk into
sub-chunks which are smaller than the chunks.
[0057] If the checksum for a chunk fails, then checksums for the
sub-chunks are calculated and compared and the data for the
sub-chunks that fail the validation is sent from the upstream array
to the downstream array, instead of having to send the whole chunk.
In one embodiment, the size of the chunk is between 5 times and
1000 times the size of the sub-chunk, but other value multipliers
are also possible.
[0058] FIG. 4 is an example of a two-level CBR process, where the
first level of validation is performed for the chunks, as described
above with reference to FIG. 3, and the second level of validation
is performed at the block level. Although blocks are being utilized
as sub-chunks, any other size of sub-chunk may also be
utilized.
[0059] The operations described in FIG. 4 are initially the same as
the method in FIG. 3, but the method diverges once the checksum for
a chunk fails. In this case, the second level CBR is initiated at
the block level. The upstream volume sends a command to the
downstream volume that there has been a chunk checksum mismatch and
block checksum is initiated. The command includes information
regarding the second level validation, such as a block cursor
(similar to the chunk cursor), the chunk identifier for the
validation, the number of blocks to be validated in the chunk,
etc.
[0060] The upstream and the downstream volumes then calculate the
checksum for a block B.sub.j and the downstream volume sends the
checksum of B.sub.j'. The upstream volume compares the checksums of
B.sub.j and B.sub.j', and if there is a mismatch the data for block
Bj is sent to the downstream array. In one embodiment, once all the
blocks in the chunk are validated, identification is sent to the
downstream array that the validation of that chunk has been
completed. In one embodiment, the checksums for the chunk are
rechecked to validate that the chunk is now correctly replicated.
In one embodiment, the downstream array compares the checksums and
notifies the upstream array which blocks to re-send.
[0061] In CBR, the upstream and downstream arrays compute checksums
and if the checksums don't match, the upstream array sends data to
fix the mismatch. The two states of verification and fixing can be
done sequentially or it can be parallelized, for example if
checksum of chunk 0 . . . 16 MB of bin1 does not match, the system
will start fixing 0 . . . 16 MB while performing checksum on the
next chunk 16 MB . . . 32 MB.
[0062] FIG. 5 illustrates the read and write paths within the
storage array, according to one embodiment. Regarding the write
path, the initiator 106 in the host 104 sends the write request to
the storage array 102. As the write data comes in, the write data
is written into NVRAM 108, and an acknowledgment is sent back to
the initiator (e.g., the host or application making the request).
In one embodiment, storage array 102 supports variable block sizes.
Data blocks in the NVRAM 108 are grouped together to form a segment
that includes a plurality of data blocks, which may be of different
sizes. The segment is compressed and then written to HDD 110. In
addition, if the segment is considered to be cache-worthy (i.e.,
important enough to be cached or likely to be accessed again) the
segment is also written to the SSD cache 112. In one embodiment,
the segment is written to the SSD 112 in parallel while writing the
segment to HDD 110.
[0063] In one embodiment, the performance of the write path is
driven by the flushing of NVRAM 108 to disk 110. With regards to
the read path, the initiator 106 sends a read request to storage
array 102. The requested data may be found in any of the different
levels of storage mediums of the storage array 102. First, a check
is made to see if the data is found in RAM (not shown), which is a
shadow memory of NVRAM 108, and if the data is found in RAM then
the data is read from RAM and sent back to the initiator 106. In
one embodiment, the shadow RAM memory (e.g., DRAM) keeps a copy of
the data in the NVRAM and the read operations are served from the
shadow RAM memory. When data is written to the NVRAM, the data is
also written to the shadow RAM so the read operations can be served
from the shadow RAM leaving the NVRAM free for processing write
operations.
[0064] If the data is not found in the shadow RAM then a check is
made to determine if the data is in cache, and if so (i.e., cache
hit), the data is read from the flash cache 112 and sent to the
initiator 106. If the data is not found in the NVRAM 108 nor in the
flash cache 112, then the data is read from the hard drives 110 and
sent to the initiator 106. In addition, if the data being served
from hard disk 110 is cache worthy, then the data is also cached in
the SSD cache 112.
[0065] FIG. 6 illustrates an example of a configuration where
multiple arrays can be made part of a group (i.e., a cluster), in
accordance with one embodiment. In this example, a group is
configured by storage arrays that have also been associated with
pools 1150, 1152. For example, array 1 and array 2 are associated
with pool 1150. The arrays 1 and 2 of pool 1150 are configured with
volume 1 1160-1 and array 3 is configured in pool 1152 for managing
volume 2 1160-2. Pool 1152 that currently contains volume 2, can be
grown by adding additional arrays to increase performance and
storage capacity. Further illustrated is the ability to replicate a
particular group, such as group A to group B, while maintaining the
configuration settings for the pools and volumes associated with
group A.
[0066] As shown, a volume can be configured to span multiple
storage arrays of a storage pool. In this configuration, arrays in
a volume are members of a storage pool. In one example, if an array
is added to a group and the array if not specified to a particular
pool, the array will be made a member of a default storage pool.
For instance, in FIG. 6, the default storage pool may be pool 1150
that includes array 1 and array 2. In one embodiment, pools can be
used to separate organizational sensitive data, such as finance and
human resources to meet security requirements. In additional to
pooling by organization, pooling can also be made by application
type. In some embodiments, it is possible to selectively migrate
volumes from one pool to another pool. The migration of pools can
include migration of their associated snapshots, and volumes can
support reads/writes during migration processes. In yet another
feature, existing pools can add arrays to boost performance and
storage capacity or evacuate arrays from existing pools (e.g., when
storage and/or performance is no longer needed or when one array is
being replaced with another array). Still further, logic in the
storage OS allows for merging of pools of a group. This is useful
when combining storage resources that were previously in separate
pools, thus increasing performance scaling across multiple
arrays.
[0067] The difference between groups and storage pools is that
groups aggregate arrays for management while storage pools
aggregate arrays for capacity and performance. As noted above, some
operations on storage pools may include creating and deleting
storage pools, adding and removing arrays to or from storage pools,
merging storage pools, and the like. In one example, a command line
may be provided to access a particular pool, which allows
management of multiple storage arrays via the command line (CLI)
interface. In one embodiment, a scale-out set up can be created by
either performing a group merge or adding an array. A group merge
is meant to merge two arrays that are already set up and have
objects and data stored thereon. The merge process ensures that
there are no duplicate objects and the merge adheres to other rules
around replication, online volumes, etc. Multi-array groups can
also be created by adding an underutilized array to another
existing array.
[0068] In one embodiment, storage pools are rebalanced when storage
objects such as arrays, pools and volumes are added, removed or
merged. Rebalancing is a non-disruptive low-impact process that
allows application IO to continue uninterrupted even to the data
sets during migration. Pool rebalancing gives highest priority to
active data IO and performs the rebalancing process with a lower
priority.
[0069] As noted, a group may be associated with several arrays, and
at least one array is designated as the group leader (GL). The
group leader has the configuration files and data that it maintains
to manage the group of arrays. In one embodiment, a backup group
leader (BGL) may be identified as one of the members of the storage
arrays. Thus, the GL is the storage array manager, while the other
arrays of the group are member arrays. In some cases, the GL may be
migrated to another member array in case of a failure or possible
failure at the array operating as the GL. As the configuration
files are replicated at the BGL, the BGL is the one that takes the
role as a new GL and another member array is designated as the BGL.
In one embodiment, volumes are striped across a particular pool of
arrays. As noted, group configuration data (configuration files and
data managed by a GL) is stored in a common location and is
replicated to the BGL.
[0070] In one embodiment, only a single management IP (Internet
Protocol) address is used to access the group. Benefits of a
centrally managed group include single volume collections across
the group, snapshot and replication schedules spanning the group,
added level of security by creating pools, shared access control
lists (ACLs), high availability, and general array administration
that operates at the group level and CLI command access to the
specific group.
[0071] In one implementation, the storage scale-out architecture
allows management of a storage cluster that spreads volumes and
their IO requests between multiple arrays. A host cannot assume
that a volume can be accessed through specific paths to one
specific array or another. Instead of advertising all of the iSCSI
interfaces on the array, the disclosed storage scale-out clusters
advertise one IP address (e.g., iSCSI discovery). Volume IO
requests are redirected to the appropriate array by leveraging deep
integration with the host operating system platforms (e.g.,
Microsoft, VMware, etc.), or using iSCSI redirection.
[0072] FIG. 7 illustrates the architecture of a storage array,
according to one embodiment. In one embodiment, storage array 102
includes an active controller 1120, a standby controller 1124, one
or more HDDs 110, and one or more SSDs 112. In one embodiment, the
controller 1120 includes non-volatile RAM (NVRAM) 1118, which is
for storing the incoming data as the data arrives to the storage
array. After the data is processed (e.g., compressed and organized
in segments (e.g., coalesced)), the data is transferred from the
NVRAM 1118 to HDD 110, or to SSD 112, or to both.
[0073] In addition, the active controller 1120 further includes CPU
1108, general-purpose RAM 1112 (e.g., used by the programs
executing in CPU 1108), input/output module 1110 for communicating
with external devices (e.g., USB port, terminal port, connectors,
plugs, links, etc.), one or more network interface cards (NICs)
1114 for exchanging data packages through network 1156, one or more
power supplies 1116, a temperature sensor (not shown), and a
storage connect module 1122 for sending and receiving data to and
from the HDD 110 and SSD 112. In one embodiment, standby controller
1124 includes the same components as active controller 1120.
[0074] Active controller 1120 is configured to execute one or more
computer programs stored in RAM 1112. One of the computer programs
is the storage operating system (OS) used to perform operating
system functions for the active controller device. In some
implementations, one or more expansion shelves 1130 may be coupled
to storage array 102 to increase HDD 1132 capacity, or SSD 1134
capacity, or both.
[0075] Active controller 1120 and standby controller 1124 have
their own NVRAMs, but they share HDDs 110 and SSDs 112. The standby
controller 1124 receives copies of what gets stored in the NVRAM
1118 of the active controller 1120 and stores the copies in its own
NVRAM. If the active controller 1120 fails, standby controller 1124
takes over the management of the storage array 102. When servers,
also referred to herein as hosts, connect to the storage array 102,
read/write requests (e.g., IO requests) are sent over network 1156,
and the storage array 102 stores the sent data or sends back the
requested data to host 104.
[0076] Host 104 is a computing device including a CPU 1150, memory
(RAM) 1146, permanent storage (HDD) 1142, a NIC card 1152, and an
IO module 1154. The host 104 includes one or more applications 1136
executing on CPU 1150, a host operating system 1138, and a computer
program storage array manager 1140 that provides an interface for
accessing storage array 102 to applications 1136. Storage array
manager 1140 includes an initiator 1144 and a storage OS interface
program 1148. When an IO operation is requested by one of the
applications 1136, the initiator 1144 establishes a connection with
storage array 102 in one of the supported formats (e.g., iSCSI,
Fibre Channel, or any other protocol). The storage OS interface
1148 provides console capabilities for managing the storage array
102 by communicating with the active controller 1120 and the
storage OS 1106 executing therein.
[0077] To process the IO requests, resources from the storage array
102 are required. Some of these resources may be a bottleneck in
the processing of storage requests because the resources are over
utilized, or are slow, or for any other reason. In general, the CPU
and the hard drives of the storage array 102 can become over
utilized and become performance bottlenecks. For example, the CPU
may become very busy because the CPU is utilized for processing
storage IO requests while also performing background tasks, such as
garbage collection, snapshots, replication, alert reporting, etc.
In one example, if there are many cache hits (i.e., the SSD
contains the requested data during IO requests), the SSD cache,
which is a fast responding system, may press the CPU for cycles,
thus causing potential bottlenecks for other requested IOs or for
processing background operations.
[0078] The hard disks may also become a bottleneck because the
inherent access speed to data is slow when compared to accessing
data from memory (e.g., NVRAM) or SSD. Embodiments presented herein
are described with reference to CPU and HDD bottlenecks, but the
same principles may be applied to other resources, such as a system
with insufficient amount of NVRAM.
[0079] As used herein, SSDs functioning as flash cache, should be
understood to operate the SSD as a cache for block level data
access, providing service to read operations instead of only
reading from HDDs 110. Thus, if data is present in SSDs 112,
reading will occur from the SSDs instead of requiring a read to the
HDDs 110, which is a slower operation. As mentioned above, the
storage operating system 1106 is configured with an algorithm that
allows for intelligent writing of certain data to the SSDs
112(e.g., cache-worthy data), and all data is written directly to
the HDDs 110 from NVRAM 1118.
[0080] The algorithm, in one embodiment, is configured to select
cache-worthy data for writing to the SSDs 112, in a manner that
provides in increased likelihood that a read operation will access
data from SSDs 112. In some embodiments, the algorithm is referred
to as a cache accelerated sequential layout (CASL) architecture,
which intelligently leverages unique properties of flash and disk
to provide high performance and optimal use of capacity. In one
embodiment, CASL caches "hot" active data onto SSD in real
time--without the need to set complex policies. This way, the
storage array can instantly respond to read requests--as much as
ten times faster than traditional bolt-on or tiered approaches to
flash caching.
[0081] For purposes of discussion and understanding, reference is
made to CASL as being an algorithm processed by the storage OS.
However, it should be understood that optimizations, modifications,
additions, and subtractions to versions of CASL may take place from
time to time. As such, reference to CASL should be understood to
represent exemplary functionality, and the functionality may change
from time to time, and may be modified to include or exclude
features referenced herein or incorporated by reference herein.
Still further, it should be understood that the embodiments
described herein are just examples, and many more examples and/or
implementations may be defined by combining elements and/or
omitting elements described with reference to the claimed
features.
[0082] In some implementations, SSDs 112 may be referred to as
flash, or flash cache, or flash-based memory cache, or flash
drives, storage flash, or simply cache. Consistent with the use of
these terms, in the context of storage array 102, the various
implementations of SSD 112 provide block level caching to storage,
as opposed to instruction level caching. As mentioned above, one
functionality enabled by algorithms of the storage OS 1106 is to
provide storage of cache-worthy block level data to the SSDs, so
that subsequent read operations are optimized (i.e., reads that are
likely to hit the flash cache will be stored to SSDs 12, as a form
of storage caching, to accelerate the performance of the storage
array 102).
[0083] In one embodiment, it should be understood that the "block
level processing" of SSDs 112, serving as storage cache, is
different than "instruction level processing," which is a common
function in microprocessor environments. In one example,
microprocessor environments utilize main memory, and various levels
of cache memory (e.g., L1, L2, etc.). Instruction level caching, is
differentiated further, because instruction level caching is
block-agnostic, meaning that instruction level caching is not aware
of what type of application is producing or requesting the data
processed by the microprocessor. Generally speaking, the
microprocessor is required to treat all instruction level caching
equally, without discriminating or differentiating processing of
different types of applications.
[0084] In the various implementations described herein, the storage
caching facilitated by SSDs 112 is implemented by algorithms
exercised by the storage OS 1106, which can differentiate between
the types of blocks being processed for each type of application or
applications. That is, block data being written to storage 1130 can
be associated with block data specific applications. For instance,
one application may be a mail system application, while another
application may be a financial database application, and yet
another may be for a website-hosting application. Each application
can have different storage accessing patterns and/or requirements.
In accordance with several embodiments described herein, block data
(e.g., associated with the specific applications) can be treated
differently when processed by the algorithms executed by the
storage OS 1106, for efficient use of flash cache 112.
[0085] FIG. 8 is a flow chart of a method for replicating data
across storage systems, according to one embodiment. Operation 802
is for transferring the snapshot of a volume from an upstream array
to a downstream array, the volume being a single accessible storage
area within the upstream array. From operation 802, the method
flows to operation 804 for comparing an upstream snapshot checksum
(usc) of the snapshot in the upstream array with a downstream
snapshot checksum (dsc) of the snapshot in the downstream
array.
[0086] In operation 806, a check is made to determine if usc is
equal to dsc. If usc is equal to dsc, the method flows to operation
818, and if usc is not equal to dsc, the method flows to operation
808. In operation 808, a plurality of chunks is defined in the
snapshot.
[0087] From operation 808, the method flows to operation 810 where
a comparison is made of an upstream chunk checksum (ucc) calculated
by the upstream array with a downstream chunk checksum (dsc)
calculated by the downstream array.
[0088] In operation 812, a check is made to determine if ucc is
equal to dsc. If ucc is equal to dsc, the method flows to operation
814 where the chunk is considered validated. If ucc is not equal to
dsc, the method flows to operation 816 where data of the is sent
chunk from the upstream array to the downstream array. Operations
810, 812, and 814 or 816 are repeated for all the chunks defined in
operation 808. When all the chunks have been validated, the
snapshot is considered validated in operation 818.
[0089] The snapshot operations described herein solve the problem
of having to re-send all the data of a volume from one storage
system to another storage system when a problem occurs while
replicating data. The data of the original snapshot is kept in
permanent storage of the upstream array, and the replicated data is
kept in permanent storage in the downstream array. Instead of
having to resend all the data of the volume, the volume is divided
into logical groups of data, referred to as chunks, and then each
chunk is validated. When all the chunks have been validated, the
replicated version of the volume is considered validated. The
operations described herein refer to the exchange of information
between two separate storage devices, which exchange data and
transfer parameters to validate the replicated data. By
re-transmitting data only chunks that have been improperly
replicated, savings in time and resources are attained because the
data of the chunks that have been correctly replicated does not
have to be re-transmitted.
[0090] One or more embodiments can also be fabricated as computer
readable code on a non-transitory computer readable storage medium.
The non-transitory computer readable storage medium is any
non-transitory data storage device that can store data, which can
be thereafter be read by a computer system. Examples of the
non-transitory computer readable storage medium include hard
drives, network attached storage (NAS), read-only memory,
random-access memory, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes and
other optical and non-optical data storage devices. The
non-transitory computer readable storage medium can include
computer readable storage medium distributed over a network-coupled
computer system so that the computer readable code is stored and
executed in a distributed fashion.
[0091] Although the method operations were described in a specific
order, it should be understood that other housekeeping operations
may be performed in between operations, or operations may be
adjusted so that they occur at slightly different times, or may be
distributed in a system which allows the occurrence of the
processing operations at various intervals associated with the
processing, as long as the processing of the overlay operations are
performed in the desired way.
[0092] Although the foregoing embodiments have been described in
some detail for purposes of clarity of understanding, it will be
apparent that certain changes and modifications can be practiced
within the scope of the appended claims. Accordingly, the present
embodiments are to be considered as illustrative and not
restrictive, and the embodiments are not to be limited to the
details given herein, but may be modified within the scope and
equivalents of the described embodiments.
* * * * *