U.S. patent application number 14/846689 was filed with the patent office on 2016-03-17 for fibre channel storage array methods for handling cache-consistency among controllers of an array and consistency among arrays of a pool.
The applicant listed for this patent is Nimble Storage, Inc.. Invention is credited to Matti Vanninen.
Application Number | 20160077752 14/846689 |
Document ID | / |
Family ID | 55454784 |
Filed Date | 2016-03-17 |
United States Patent
Application |
20160077752 |
Kind Code |
A1 |
Vanninen; Matti |
March 17, 2016 |
Fibre Channel Storage Array Methods for Handling Cache-Consistency
Among Controllers of an Array and Consistency Among Arrays of a
Pool
Abstract
Storage arrays, systems and methods for operating storage arrays
for maintaining consistency in configuration data between processes
running on an active controller and a standby controller of the
storage array are provided. One example method includes executing a
primary process in user space of the active controller. The primary
process is configured to process request commands from one or more
initiators, and the primary process has access to a volume manager
for serving data input/output (I/O) requests and non-I/O requests.
The primary process has primary access to the configuration data
and includes a first logical unit (LU) cache for storing the
configuration data. The method also includes executing a secondary
process in user space of the standby controller. The secondary
process is configured to process request commands from one or more
of the initiators, wherein the secondary process does not have
access to the volume manger. The secondary process has a second LU
cache for storing the configuration data, and the second LU cache
is used by the secondary process for responding to non-I/O
requests. The method includes receiving, at the primary process, an
update to the configuration data and sending, by the primary
process, the update to the configuration data to the secondary
process for updating the second LU cache. When the primary process
receives an acknowledgement from the secondary process that the
update to the configuration data was received, then the updates to
the configuration data are committed to the first LU cache of the
active controller.
Inventors: |
Vanninen; Matti; (Durham,
NC) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Nimble Storage, Inc. |
San Jose |
CA |
US |
|
|
Family ID: |
55454784 |
Appl. No.: |
14/846689 |
Filed: |
September 4, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62050680 |
Sep 15, 2014 |
|
|
|
Current U.S.
Class: |
711/114 |
Current CPC
Class: |
G06F 3/061 20130101;
G06F 3/0617 20130101; G06F 11/2092 20130101; G06F 13/426 20130101;
G06F 3/0611 20130101; G06F 3/0689 20130101; G06F 11/20 20130101;
G06F 3/0659 20130101; G06F 11/14 20130101; G06F 3/0631 20130101;
G06F 11/00 20130101; G06F 3/0635 20130101; G06F 13/4027 20130101;
G06F 3/0665 20130101; G06F 3/0619 20130101; G06F 13/4282 20130101;
G06F 3/067 20130101; G06F 3/0685 20130101 |
International
Class: |
G06F 3/06 20060101
G06F003/06 |
Claims
1. A method for maintaining consistency in configuration data
between processes running on an active controller and a standby
controller of a storage array, comprising, executing a primary
process in user space of the active controller, the primary process
configured to process request commands from one or more initiators,
the primary process having access to a volume manager for serving
data input/output (I/O) requests and non-I/O requests, the primary
process having primary access to the configuration data and having
a first logical unit (LU) cache for storing the configuration data;
executing a secondary process in user space of the standby
controller, the secondary process configured to process request
commands from one or more of the initiators, the secondary process
not having access to the volume manger, the secondary process
having a second LU cache for storing the configuration data, the
second LU cache being used by the secondary process for responding
to non-I/O requests; receiving, at the primary process, an update
to the configuration data; sending, by the primary process, the
update to the configuration data to the secondary process for
updating the second LU cache, and if the primary process receives
an acknowledgement from the secondary process that the update to
the configuration data was received, then committing the updates to
the configuration data to the first LU cache of the active
controller.
2. The method of claim 1, wherein if the primary process does not
receive the acknowledgement from the secondary process, then
waiting a period of time before committing the update to the first
LU cache, wherein the primary process commits the update to the
first LU cache after waiting the period of time, as the secondary
process will have restarted after the period of time has been
reached.
3. The method of claim 2, wherein the period of time is determined
when a heart-beat exchange between the secondary process and the
primary process is missing.
4. The method of claim 1, wherein the primary process includes a
first SCSI layer for processing calls to the volume manager for
serving the data input/output (I/O) requests and non-I/O
requests.
5. The method of claim 1, wherein the secondary process includes a
second SCSI layer for processing calls to a volume manager, wherein
code of the second SCSI layer is configured to return an error or
an unavailable response to said SCSI layer calls since the volume
manager having access for serving the data input/output (I/O)
requests is not made available via the secondary process.
6. The method of claim 1, further comprising, providing the primary
process with communication access to a configuration database, the
configuration database is configured to persistently store
configuration updates made to logical unit number (LUN) masking and
mapping, and configuration updates made to port name generation and
port configuration.
7. The method of claim 1, wherein a SCSI layer of the active
controller is provided with an access interface to the first LU
cache, and the first LU cache operates as a library linked between
the primary process, the secondary process and a configuration
management unit that is interfaced with a configuration
database.
8. The method of claim 1, wherein the secondary process uses a
proxy service of the primary process to communicate with a
configuration database regarding changes to the second LU
cache.
9. The method of claim 1, wherein the primary process is configured
to push said updates to the configuration data to the secondary
process for commitment, such that updates to be made to the first
LU cache are made to the second LU cache.
10. The method of claim 1, further comprising, resending the update
to the configuration data to the primary process until committed,
wherein the primary process commits after receipt of the
acknowledgement from the secondary process to avoid having the
second LU cache having an update that is not yet committed to the
first LU cache.
11. The method of claim 1, further comprising, receiving port state
data a management configuration unit from a process that monitors
port information, the management configuring unit is then
configured to instruct the primary process of the update to the
configuration data that includes said port state data.
12. The method of claim 1, wherein the secondary process is further
executed in the user space of the active controller, such that each
of the active controller and the standby controller executes a
respective one of the secondary process, the secondary process of
the active controller has access to the first LU cache, whereas the
secondary process of the standby controller has access to the
second LU cache.
13. The method of claim 1, further comprising, providing at least
two of said storage array; designating one of said storage array as
a group leader (GL) to form a pool of storage arrays; executing a
configuration management unit on the GL; and receiving changes at
each primary process of each storage array from the configuration
management unit of the GL, such that each primary process of each
storage array pushes updates to respective secondary processes so
that first and second LU cache in each of the active controller and
standby controller of respective array is maintained consistent for
the pool of storage arrays.
14. The method of claim 1, wherein the first and second LU cache is
configured to store said configuration data related to one or more
of LUN mapping data, or port state data, or inquiry data, or
combinations of two or more thereof.
15. A storage array, comprising, (a) an active controller
configured to execute a primary process and a first secondary
process, (i) the primary process includes a volume manager and a
first SCSI layer; (ii) the first secondary process includes a
second SCSI layer; and (iii) a first logical unit (LU) cache, the
first LU cache configured to store configuration data related to
logical unit number (LUN) mapping and port data; (b) a standby
controller configured to execute a second secondary process, (i)
the second secondary process includes a third SCSI layer; (ii) a
second logical unit (LU) cache, the second LU cache is also
configured to store the configuration data related to logical unit
number (LUN) mapping and port data; and (c) a configuration
management unit that is configured to communicate changes to the
configuration data to the primary process, the primary process is
configured to push said changes to the configuration data to said
second secondary process to enable commitment to said second LU
cache, wherein the primary process of the active controller is
configured to wait to commit the changes to the configuration data
to the first LU cache until confirmation is received by the primary
process that the second secondary process has committed the changes
to the configuration data to the second LU cache; wherein the
storage array is configured to service requests from one or more
initiators.
16. The storage array of claim 15, wherein said first SCSI layer
and said second SCSI layer each have access to the first LU cache,
and said third SCSI layer has access to the second LU cache.
17. The storage array of claim 15, wherein the configuration
management unit is interfaced with a configuration database.
18. The storage array of claim 15, wherein if the primary process
does not receive a confirmation from the secondary process, then
waiting a period of time before committing the update to the first
LU cache, such that the primary process commits the update to the
first LU cache after waiting the period of time, as the secondary
process is programmed to have restarted after the period of time
has been reached.
19. The storage array of claim 15, further comprising a controller
management daemon to monitor port state of the storage array,
wherein changes to port state are received by the configuration
management unit, said changes to are defined as changes to the
configuration data that are stored to a configuration database and
pushed to the primary process of the active controller for
propagation to the first LU cache and the second LU cache.
20. The storage array of claim 15, wherein the volume manager is
provided with access to storage of the storage array and only the
primary process is used for serving data input/output (I/O)
requests.
21. The storage array of claim 15, wherein two or more of said
storage arrays are programmable to operate as a pool of arrays,
wherein one of said pool of arrays is a group leader (GL) and each
of said storage arrays has a respective first LU cache and second
LU cache.
22. A storage array, comprising, an active controller configured to
execute a primary process that includes a volume manager and a
first SCSI layer, the active controller further includes a first
logical unit (LU) cache for storing configuration data related to
logical unit number (LUN) mapping and port data of the storage
array; a standby controller configured to execute a secondary
process (280b), the secondary process includes a second SCSI layer,
the standby controller further includes a second logical unit (LU)
cache that is also configured to store the configuration data
related to logical unit number (LUN) mapping and port data of the
storage array; and a configuration management unit that is
configured to communicate changes to the configuration data to the
primary process, the primary process is configured to push said
changes to the configuration data to said secondary process to
enable commitment to said second LU cache, wherein the primary
process of the active controller is configured to wait to commit
the changes to the configuration data to the first LU cache until
confirmation is received by the primary process that the secondary
process has committed the changes to the configuration data to the
second LU cache; wherein the storage array is configured to service
requests from one or more initiators.
23. The storage array of claim 22, wherein the active controller
includes a second secondary process having a third SCSI layer,
wherein the first and third SCSI layer is provided access to the
first LU cache of the active controller and the second SCSI layer
is provided with access to the second LU cache of the standby
controller; wherein the first SCSI layer of the active controller
is used to make updates to the first LU cache and pushes said
updates to the second SCSI layer of the standby controller for
making said updates to the second LU cache.
24. The storage array of claim 22, further comprising a controller
management daemon to monitor port state of the storage array,
wherein changes to the port state are received by the configuration
management unit, said changes to are defined as changes to the
configuration data that are stored to a configuration database and
pushed to the primary process of the active controller for
propagation to the first LU cache and the second LU cache.
25. The storage array of claim 22, wherein the volume manager is
provided with access to storage of the storage array and only the
primary process is used for serving data input/output (I/O)
requests to initiators that use the storage array as a target.
26. The storage array of claim 22, wherein two or more of said
storage arrays are programmable to operate as a pool of arrays,
wherein one of said pool of arrays is a group leader (GL) and each
of said storage arrays has a respective first LU cache and second
LU cache.
27. Computer readable media, being non-transitory, for maintaining
consistency in configuration data between processes running on an
active controller and a standby controller of a storage array,
comprising, program instructions for executing a primary process in
user space of the active controller, the primary process configured
to process request commands from one or more initiators, the
primary process having access to a volume manager for serving data
input/output (I/O) requests and non-I/O requests, the primary
process having primary access to the configuration data and having
a first logical unit (LU) cache for storing the configuration data;
program instructions for executing a secondary process in user
space of the standby controller, the secondary process configured
to process request commands from one or more of the initiators, the
secondary process not having access to the volume manger, the
secondary process having a second LU cache for storing the
configuration data, the second LU cache being used by the secondary
process for responding to non-I/O requests; program instructions
for receiving, at the primary process, an update to the
configuration data; program instructions for sending, by the
primary process, the update to the configuration data to the
secondary process for updating the second LU cache, and if the
primary process receives an acknowledgement from the secondary
process that the update to the configuration data was received,
then committing the updates to the configuration data to the first
LU cache of the active controller.
28. The computer readable media of claim 27, further comprising,
program instructions for determining if the primary process does
not receive the acknowledgement from the secondary process, and
then waiting a period of time before committing the update to the
first LU cache; wherein the primary process commits the update to
the first LU cache after waiting the period of time, as the
secondary process will have restarted after the period of time has
been reached.
29. The computer readable media of claim 28, wherein the period of
time is determined when a heart-beat exchange between the secondary
process and the primary process is missing.
30. The computer readable media of claim 27, wherein the primary
process includes a first SCSI layer for processing calls to the
volume manager for serving the data input/output (I/O) requests and
non-I/O requests; wherein the secondary process includes a second
SCSI layer for processing calls to a volume manager, wherein code
of the second SCSI layer is configured to return an error or an
unavailable response since the volume manager having access for
serving the data input/output (I/O) requests is not made available
via the secondary process.
31. The computer readable media of claim 27, further comprising,
program instructions for providing the primary process with
communication access to a configuration database, the configuration
database is configured to persistently store configuration updates
made to logical unit number (LUN) masking and mapping, and
configuration updates made to port name generation and port
configuration.
32. The computer readable media of claim 27, for at least two of
said storage array, the computer readable media includes, program
instructions for designating one of said storage array as a group
leader (GL) to form a pool of storage arrays; program instructions
for executing a configuration management unit on the GL; and
program instructions for receiving changes at each primary process
of each storage array from the configuration management unit of the
GL, such that each primary process of each storage array pushes
updates to respective secondary processes so that first and second
LU cache in each of the active controller and standby controller of
respective array is maintained consistent for the pool of storage
arrays.
Description
CLAIM OF PRIORITY
[0001] This application claims priority from U.S. Provisional
Patent Application No. 62/050,680, filed on Sep. 15, 2014, entitled
"Fibre Channel Storage Array Systems and Methods," which is herein
incorporated by reference.
BACKGROUND
[0002] 1. Field of the Invention
[0003] The present embodiments relate to storage arrays, methods,
systems, and programs for maintaining a consistent cache of logical
unit data and port data for Fibre Channel arrays, such as where
storage arrays use fail-over processes and standby hardware to
maintain high availability to initiators.
[0004] 2. Description of the Related Art
[0005] Network storage, also referred to as network storage systems
or storage systems, is computer data storage connected to a
computer network providing data access to heterogeneous clients.
Typically network storage systems process a large amount of
Input/Output (I/O) requests, and high availability, speed, and
reliability are desirable characteristics of network storage.
[0006] One way to provide quick access to data is by utilizing fast
cache memory to store data. Since the difference in access times
between a cache memory and a hard drive are significant, the
overall performance of the system is highly impacted by the cache
hit ratio. Therefore, it is important to provide optimal
utilization of the cache memory in order to have in cache the data
that is accessed most often.
[0007] There is also a need for storage systems that operate Fibre
Channel networks, to provide fault tolerant connection to
initiators. If initiators see storage arrays with excessive
failures, even when a storage array is processing failover
procedures, such storage arrays will be viewed as less than
optimal. A need therefore exists for a storage array that is
capable of handling failover operations while providing initiators
with consistent connections to such storage arrays.
[0008] It is in this context that embodiments arise.
SUMMARY
[0009] Methods and storage systems for processing failover
operations in a storage array configured for Fibre Channel
communication are provided. A storage array includes an active
controller and a standby controller. In one embodiment, management
changes may be made at the active controller, and those changes
should consistently be made at the standby controller. In one
configuration, both the active controller and the standby
controller include a logical unit (LU) cache. The methods disclosed
herein relate to managing consistency of the LU cache between
copies accessed by the active controller and the standby
controller, a method for synchronizing cache content. Maintaining
consistency enables among the active and standby controller ensures
that initiators accessing an array have the correct logical unit
number mappings, port data, etc., even when failover has occurred
between the active and standby controllers.
[0010] In one embodiment, a method for operating a storage array
for maintaining consistency in configuration data between processes
running on an active controller and a standby controller of the
storage array is provided. In this embodiment, the method includes
executing a primary process in user space of the active controller.
The primary process is configured to process request commands from
one or more initiators, and the primary process has access to a
volume manager for serving data input/output (I/O) requests and
non-I/O requests. The primary process has primary access to the
configuration data and includes a first logical unit (LU) cache for
storing the configuration data. The method also includes executing
a secondary process in user space of the standby controller. The
secondary process is configured to process request commands from
one or more of the initiators, wherein the secondary process does
not have access to the volume manger. The secondary process has a
second LU cache for storing the configuration data, and the second
LU cache is used by the secondary process for responding to non-I/O
requests. The method includes receiving, at the primary process, an
update to the configuration data and sending, by the primary
process, the update to the configuration data to the secondary
process for updating the second LU cache. When the primary process
receives an acknowledgement from the secondary process that the
update to the configuration data was received, then the updates to
the configuration data are committed to the first LU cache of the
active controller.
[0011] In another embodiment, a storage array is provided. The
storage array includes an active controller configured to execute a
primary process that includes a volume manager and a first SCSI
layer. The active controller further includes a first logical unit
(LU) cache for storing configuration data related to logical unit
number (LUN) mapping and port data of the storage array. Further
included is a standby controller configured to execute a secondary
process. The secondary process includes a second SCSI layer. The
standby controller further includes a second logical unit (LU)
cache that is also configured to store the configuration data
related to logical unit number (LUN) mapping and port data of the
storage array. A configuration management unit is also provided and
is configured to communicate changes to the configuration data to
the primary process. The primary process is configured to push said
changes to the configuration data to said secondary process to
enable commitment to said second LU cache. The primary process of
the active controller is configured to wait to commit the changes
to the configuration data to the first LU cache until confirmation
is received by the primary process that the secondary process has
committed the changes to the configuration data to the second LU
cache. The storage array is configured to service requests from one
or more initiators.
[0012] In yet another embodiment, computer readable media is
provided, having program instructions for operating a storage array
for maintaining consistency in configuration data between processes
running on an active controller and a standby controller of the
storage array.
[0013] Other aspects will become apparent from the following
detailed description, taken in conjunction with the accompanying
drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The embodiments may best be understood by reference to the
following description taken in conjunction with the accompanying
drawings.
[0015] FIG. 1A provides one example view of a storage array SCSI
target stack, in accordance with one embodiment.
[0016] FIG. 1B illustrates an example of a storage array having an
active controller and a standby controller, in accordance with one
embodiment.
[0017] FIG. 1C shows an example of the active controller, which is
configured with a data services daemon (DSD) and a standby failover
daemon (SFD), in accordance with one embodiment.
[0018] FIG. 2 illustrates an example of the architecture of a
storage array, according to one embodiment.
[0019] FIGS. 3A and 3B illustrate example logical diagrams of a
storage array, which includes an active controller and a standby
controller, in accordance with two embodiments.
[0020] FIG. 4 illustrates a logical unit (LU) cache distributed in
two arrays and associated management among the arrays, in
accordance with one embodiment.
[0021] FIG. 5 illustrates an example architecture, which
illustrates communication among user space processes, which
includes LU cache processing, in accordance with one
embodiment.
[0022] FIG. 6 illustrates an active controller and a standby
controller managing configurations changes and updates to LU cache
in each controller, in accordance with one embodiment.
DETAILED DESCRIPTION
[0023] The following embodiments describe methods, devices,
systems, and computer programs for storage arrays, which cache
within storage arrays is managed for consistency. Cache consistency
is particularly needed in storage arrays that maintain separate
cache copies for each controller, in a multi-controller storage
array. Multi-controller storage arrays are those that have an
active controller for serving data and information to requesting
initiators and standby controllers that stand ready to take over
the role as the active controller if any failure or power down of
the active controller occurs. In one configuration, logical unit
(LU) cache copies are maintained by each of the active controller
and the standby controller. During operation, user space processes
work to synchronize changes made to the LU cache, which include
changes to logic unit numbers and port state information.
[0024] More detail regarding maintaining LU cache consistency among
controllers of a storage array will be provided with reference to
FIGS. 3A-6 below.
[0025] It should be noted that various embodiments described in the
present disclosure may be practiced without some or all of these
specific details. In other instances, well known process operations
have not been described in detail in order not to unnecessarily
obscure various embodiments described in the present
disclosure.
[0026] One protocol is iSCSI (Internet Small Computer System
Interface). iSCSI is used for interconnecting storage arrays to a
network, which enables the transport of SCSI commands over Ethernet
connections using TCP/IP (i.e., for IP networks). In such
configurations, an iSCSI storage implementation can be deployed
using Ethernet routers, switches, network adapters, and
cabling.
[0027] Another protocol is Fibre Channel. Fibre Channel is a
high-speed network technology, which is primarily utilized in
storage array networks (SANs). Storage arrays are the target
devices in a SAN configuration, wherein the fabric and initiators
all intercommunicate using the Fibre Channel protocol. Fibre
Channel Protocol (FCP) is a transport protocol (similar to TCP used
in IP networks) that predominantly transports SCSI commands over
Fibre Channel networks.
[0028] In accordance with various embodiments described herein, a
storage array configurable for Fibre Channel mode or iSCSI mode is
provided. The storage array can include logic and hardware to
operate in the iSCSI mode and can implement one or more Ethernet
cards. To operate in the Fibre Channel mode, the storage array is
provided with a Fibre Channel (FC) card (e.g., a hardware card of
the controller). The FC card is the link between the Fibre Channel
physical network (i.e., PHY) and the Fibre Channel driver (FC)
driver of the storage array.
[0029] FIG. 1A provides one example view of a storage array SCSI
target stack 100. The stack includes a volume manager (VM) 102,
which broadly includes the operating system (OS) 106 of the storage
array and an I/O handling protocol 108 that processes read and
write I/O commands to storage of the storage array. The I/O
handling protocol, in one embodiment, is referred to herein as a
cache accelerated sequential layout (CASL) process, which
intelligently leverages unique properties of flash and disk of the
storage array to provide high performance and optimal use of
capacity. CASL functions as the file system of the array, albeit
processing is generally performed at the block level instead of
file level.
[0030] Below the VM 102 is a SCSI layer 104, which is configured to
handle SCSI commands. In one embodiment, the SCSI layer 104 has
been implemented to be independent of iSCSI transport
functionality. For example, in storage arrays configured for pure
iSCSI mode operation, the iSCSI transport 112 may include logic
that is shared by the SCSI layer 104. However, to implement a Fibre
Channel operating storage array, the SCSI layer 104 has been
implemented to remove dependencies on the iSCSI transport 112. The
SCSI target stack 100 further includes a Fibre Channel (FC)
transport 110, which functions as user space for running various
processes, which are referred to herein as daemons. The user-space
of the FC transport 110 serves as the conduit to the SCSI target
(i.e., SCSI layer 104).
[0031] A Fibre Channel (FC) driver 116 is further provided, which
is in communication with a Fibre Channel (FC) card 118. In one
embodiment, in order to interact with the FC card 118, which is a
dedicated hardware/firmware, a dedicated FC driver 116 is provided.
For each FC card 118 (i.e., port) in an array, an instance of the
FC driver 116 is provided. The FC driver 116 is, in one embodiment,
a kernel level driver that is responsible for interacting directly
with the FC card 118 to retrieve incoming SCSI commands, request
data transfer, and send SCSI responses, among other things. In one
embodiment, the FC card 118 may be an adapter card, which includes
hardware, firmware and software for processing Fibre Channel
packets between the Fibre Channel fabric and the FC driver. In one
specific example, the FC card 118 may be a Fibre Channel Host Bus
Adapter (HBA) card. If the storage array is configured for iSCSI
mode, Linux sockets are used to communicate with a TCP/IP network
interface card (NIC), for communication with an Ethernet
fabric.
[0032] FIG. 1B illustrates an example of a storage array 202, which
includes an active controller 220, a standby controller 224, and
storage (i.e., hard disk drives (HDDs) 226, and solid state drives
(SSDs) 228). This configuration shows the storage array SCSI target
stack 100 usable in each of the active and standby controllers 220
and 224, depending on the state of operation. For example, if the
active controller 220 is functioning normally, the standby
controller is not serving IOs to and from the storage, and ports of
the standby controller are simply operational in a standby (SB)
state in accordance with an asymmetric logical unit access (ALUA)
protocol. The ALUA protocol is described in more detail in a Fibre
Channel standard, entitled "Information technology-SCSI Primary
Commands-4 (SPC-4)", revision 36s, dated 21 March, 2014 (Project
T10/BSR INCITS 513), which is incorporated herein by reference.
Generally speaking, ALUA is a multi-pathing method that allows each
port to manage access states and path attributes using assignments
that include: (a) active/optimized (AO); (b) active/non-optimized
(ANO); (c) standby (SB); unavailable (UA); and (d) logical block
dependent (LBD).
[0033] In the example of FIG. 1B, it is noted that the standby
controller 224 may not have the iSCSI transport 112 during the time
it operates as a "standby" controller. If failover occurs and the
standby controller 224 becomes the active controller 220, then the
iSCSI transport 112 will be populated. Note also, that during Fibre
Channel operation, the FC transport 110 is the module that is in
operation. Alternatively, if the storage arrays are used in an
iSCSI configuration, the iSCSI transport 112 will be needed, along
with the Linux Sockets 114 to enable Ethernet fabric
communication.
[0034] FIG. 1C shows an example of the active controller 220, which
is configured with a data services daemon (DSD) 260. DSD 260 is
designed to provide full access to the storage array 202 via the VM
102, which includes serving IOs to the volumes of the storage array
202 (e.g., in response to initiator access requests to the SCSI
target storage array 202). The DSD 260 of the active controller 220
is a user space process. For failover capabilities within the
active controller 220 itself, the user space of the active
controller 220 also includes a standby failover daemon (SFD) 280a.
The SFD 280a is configured as a backup process that does not
process IOs to the volumes of the storage array 202, but can
provide limited services, such as responding to information SCSI
commands while the DSD 260 is re-started (e.g., after a crash). In
one embodiment, SFD may also be referred to as a SCSI failover and
forwarding daemon.
[0035] If the SFD 280a takes over for the DSD 260, the I_T Nexus
(i.e., connection) between initiators and the target array remain
un-terminated. As will be described in more detail below in
reference to a port-grab mechanism, during the transition between
DSD 260 and SFD 280a, the FC driver 116 can transition between user
space partner processes (e.g., DSD/SFD), without terminating the
SCSI I_T_Nexus and forcing the initiator to reestablish its
connection to the target.
[0036] The standby controller 224 of the storage array 202 is also
configured with an SFD 280b in its user space. As noted above, the
ports of the standby controller 224 are set to standby (SB) per
ALUA. If a command is received by the SFD of the standby
controller, it can process that command in one of three ways. In
regard to a first way, for many commands, including READ and WRITE,
the SCSI standard does not require the target to support the
operation. For this case, SFD 280b returns the SCSI response
prescribed by the standard to indicate non-support. In a second
way, among the mandatory-to-support SCSI commands, there are
certain commands for which initiators expect quick response under
all conditions, including during failover.
[0037] Examples include, without limitation, INQUIRY, REPORT_LUNS,
and REPORT_TARGET_PORT_GROUPS. For these commands, SFD 280b
responds locally and independently. In a third way, for other
mandatory-to-support SCSI commands (such as
PERSISTENT_RESERVATION_IN/OUT), the SFD 280b will depend on the DSD
260 process running on the active controller 220. Thus, a
forwarding engine is used to forward SCSI commands from the standby
controller 224 to the active controller 220. The active controller
220 will process the commands and send responses back to the
standby controller 224, which will in turn send them to the
initiator.
[0038] For commands that need to be processed locally, all
information required to create an accurate and consistent SCSI
response will be stored locally in an LU cache 290. As will be
described in more detail below, a logical unit (LU) cache will be
present on each of the active and standby controllers 220/224, and
consistency methods ensure that all LU cache states are updated.
The SFD 280a/b uses the LU cache 290 to independently respond to a
small number of commands, such as Inquiry, Report LUNs and
RTPG.
[0039] Furthermore, in Fibre Channel, each FC transport endpoint is
identified by a Fibre Channel (FC) World Wide Node Name (WWNN) and
World Wide Port Name (WWPN). It is customary and expected that all
ports for a given target advertise the same single WWNN. The client
OS storage stack will establish a single FC connection to each
available FC transport endpoint (WWNN/WWPN pair). In some
embodiments, when the FC requires a separate WWNN/WWPN pair for
each target, the single-LUN target model would require a separate
WWNN/WWPN pair for each exported volume. It should be understood
that Single-LUN target models are just one example, and other
configurations that are not Single-Lun target may also be
implemented in some configurations. In one example of storage array
202, it may have two FC transport endpoints for each of the active
controller 220 and the standby controller 224. That is, the active
controller 220 may have two ports (i.e., two WWNN/WWPN pairs), and
the standby controller 224 may also have two ports (i.e., two
WWNN/WWPN pairs). It should be understood that the configuration of
the storage array 202 may be modified to include more or fewer
ports.
[0040] The LUN mapping is configured to persistently store the
mapping information and maintain consistency across reboots. The
LUN mapping is stored in the LU cache 290. The DSD 260 and SFD 280a
and 280b are provided with direct access to the LU cache 290. As
will be described below in more detail, the LU cache 290a/b will
also store inquiry data and port state information. In one
embodiment, as described with reference to FIGS. 3A and 3B below, a
GDD 297 (Group Data Daemon) and a GMD 298 (Group Management Daemon)
may be used to maintain LUN mapping information for each initiator.
GDD 297, from SCSI perspective, is configured to work with SCSI
layer 104 to handle SCSI Reservation and TMF (task management
function). In one embodiment, GDD 297 will support iSCSI login and
connection re-balancing for when the storage array 202 is
configured/used as an iSCSI target. In one configuration, GDD 297
and GMD 298 operate as a configuration management unit 291.
[0041] It will be apparent that the present embodiments may be
practiced without some or all of these specific details.
Modification to the modules, code and communication interfaces are
also possible, so long as the defined functionality for the storage
array or modules of the storage array is maintained. In other
instances, well-known process operations have not been described in
detail in order not to unnecessarily obscure the present
embodiments.
Storage Array Example Structure
[0042] FIG. 2 illustrates an example of the architecture of a
storage array 102, according to one embodiment. In one embodiment,
storage array 102 includes an active controller 220, a standby
controller 224, one or more HDDs 226, and one or more SSDs 228. In
one embodiment, the controller 220 includes non-volatile RAM
(NVRAM) 218, which is for storing the incoming data as it arrives
to the storage array. After the data is processed (e.g., compressed
and organized in segments (e.g., coalesced)), the data is
transferred from the NVRAM 218 to HDD 226, or to SSD 228, or to
both.
[0043] In addition, the active controller 220 further includes CPU
208, general-purpose RAM 212 (e.g., used by the programs executing
in CPU 208), input/output module 210 for communicating with
external devices (e.g., USB port, terminal port, connectors, plugs,
links, etc.), one or more network interface cards (NICs) 214 for
exchanging data packages through network 256, one or more power
supplies 216, a temperature sensor (not shown), and a storage
connect module 222 for sending and receiving data to and from the
HDD 226 and SSD 228. In one embodiment, the NICs 214 may be
configured for Ethernet communication or Fibre Channel
communication, depending on the hardware card used and the storage
fabric. In other embodiments, the storage array 202 may be
configured to operate using the iSCSI transport or the Fibre
Channel transport.
[0044] Active controller 220 is configured to execute one or more
computer programs stored in RAM 212. One of the computer programs
is the storage operating system (OS) used to perform operating
system functions for the active controller device. In some
implementations, one or more expansion shelves 230 may be coupled
to storage array 202 to increase HDD 232 capacity, or SSD 234
capacity, or both.
[0045] Active controller 220 and standby controller 224 have their
own NVRAMs, but they share HDDs 226 and SSDs 228. The standby
controller 224 receives copies of what gets stored in the NVRAM 218
of the active controller 220 and stores the copies in its own
NVRAM. If the active controller 220 fails, standby controller 224
takes over the management of the storage array 202. When servers,
also referred to herein as hosts, connect to the storage array 202,
read/write requests (e.g., I/O requests) are sent over network 256,
and the storage array 202 stores the sent data or sends back the
requested data to host 204.
[0046] Host 204 is a computing device including a CPU 250, memory
(RAM) 246, permanent storage (HDD) 242, a NIC card 252, and an I/O
module 254. The host 204 includes one or more applications 236
executing on CPU 250, a host operating system 238, and a computer
program storage array manager 240 that provides an interface for
accessing storage array 202 to applications 236. Storage array
manager 240 includes an initiator 244 and a storage OS interface
program 248. When an I/O operation is requested by one of the
applications 236, the initiator 244 establishes a connection with
storage array 202 in one of the supported protocols (e.g., iSCSI,
Fibre Channel, or any other protocol). The storage OS interface 248
provides console capabilities for managing the storage array 202 by
communicating with the active controller 220 and the storage OS 106
executed therein. It should be understood, however, that specific
implementations may utilize different modules, different protocols,
different number of controllers, etc., while still being configured
to execute or process operations taught and disclosed herein.
[0047] As discussed with reference to FIGS. 1A-1C, in a storage
array 202, a kernel level process occurs at the FC driver 116,
which is charged with communicating down with the Fibre Channel
(FC) card 118. The FC card 118, itself includes firmware that
provides the FC processing between the FC driver 116 and the
physical network (PHY) or Fibre Channel fabric. In the illustrated
configuration, the FC driver 116 is in direct communication with
the user space, which includes the FC transport 110, the SCSI layer
104.
[0048] FIG. 3A illustrates an example logical diagram of a storage
array 202, which includes an active controller 220 and a standby
controller 224. The active controller 220 is shown to include DSD
260 and SFD 280a, while the standby controller 224 includes an SFD
280b. Generally speaking, the DSD 260 is a primary process and the
SFD 280a and SFD 280b are secondary processes. In one embodiment,
the DSD 260 running on the active controller 220 is provided with
access to a fully functioning volume manager (VM) 102, while the
SFD 280a of the active controller 224 and the SFD 280b of the
standby controller 224 are only provided with a VM stub 102'. This
means that VM stub 102' is not provided with access to the storage
of the storage array 202. For example, the SCSI layer 104 may make
calls to the VM stub 102', as the code used for the DSD 260 may be
similar, yet a code object is used so that an error or unavailable
response is received from the VM stub 102', as no access to the VM
is provided. Also shown is that the SFD 280a on the active
controller 220 and the DSD 260 have access to the LU cache 290a. As
noted above, the LU cache 290a is configured to store available LUN
mapping, inquiry data and port state information. The standby
controller 224 also includes an SFD 280b that has access to LU
cache 290a.
[0049] As noted, a SCSI logical unit is visible through multiple
Fibre Channel ports (namely, all of the ports which reside on
arrays within the logical unit's pool). An initiator may issue a
SCSI command to any of these ports, to request the port state for
all ports through which the logical unit may be accessed. In one
embodiment, this requires a CMD 402 (Controller Management Daemon)
to monitor port state for FC target ports on a given array 202,
report initial state and state changes to AMD 404 (Array Management
Daemon). The AMD 404 will forward this information to GDD 297. GDD
297 is a clearing house for all FC target ports in the entire
group, and will disseminate this information to DSD 260. DSD 260
will retrieve the port state and store it into the LU cache
290a.
[0050] In one embodiment, the SCSI layer 104 within DSD 260 and SFD
280 will need access to several pieces of system information, in
order to process SCSI commands. This information includes LUN
mapping information, e.g. to build REPORT_LUNS responses, and to
validate and map a logical unit number to its associated volume.
The SCSI layer 104 will need access to the FC port state to build
REPORT_TARGET_PORT_GROUPS response, and to determine the
port_identifier fields for certain SCSI INQUIRY responses. The LU
cache 290a, being accessible to DSD 260 and SFD 280a will enable
memory-speed access to the LU cache 290a. The DSD 260 is, in one
embodiment, configured to build the LU cache 290a so it can quickly
retrieve the needed LUN mapping and port state information from GDD
297 and make this information available to SFD 280a and 280b
processes. The SFD 280b on the standby controller 224 maintains
communication with DSD 260 on the active controller 220, to
maintain an up-to-date copy of LU cache 290b.
[0051] At startup, DSD 260 needs an up-to-date LU cache 290a in
order to handle incoming SCSI commands. Therefore, during startup,
DSD 260 needs to retrieve from GDD 297 the LUN mapping
configuration and current port state information, and populate the
LU cache 290a (or verify the validity of the existing LU cache
290a). DSD 260 also needs to notify the SFD 280b on the standby
controller 224 if the LU cache 290a contents are updated. DSD 260
also needs to interact with the FC kernel driver 116, to claim
responsibility for current and future SCSI I_T nexuses and
commands.
[0052] Thus, in order for DSD 260 to process non LU_CACHE-variety
commands directed to a specific logical unit (e.g. READ and WRITE),
the contents of the LU cache 290a is necessary, but not sufficient.
The SCSI layer 104 within DSD 260 consults the LU cache 290a in
order to validate the specified LU number, and to map the LU number
to a backing-store volume. Then the SCSI command handler can
process the command to the proper volume.
[0053] On the active controller 220, if the SFD 280b gains access
(i.e., via port grab when DSD 260 goes down), SFD 280b will get the
latest copy of the LU cache, as previously populated by DSD 260,
which may be by directly accessing a shared memory segment. Thus,
whenever DSD 260 is unavailable (e.g. crashed or is restarting),
SFD 280a services certain SCSI commands. For LU_CACHE-variety
commands, SFD 280a fully processes the commands using only
information from the LU cache 290a. For other commands, SFD 280a
returns appropriate responses indicating that the command could not
be immediately completed.
[0054] On the standby controller 224, SFD 280b always responds to
certain incoming SCSI commands. For LU_CACHE-variety commands, SFD
280b fully processes the commands using only information from the
LU cache 290b. For commands which constitute LUN-level serializing
events (e.g. SCSI Reservations, LUN_RESET), interaction with GDD
297 is required by the DSD 260 which is providing access to the
affected LUN. In one embodiment, SFD 280b on the standby controller
224 is not permitted to communicate directly with GDD 297, so this
is achieved using a proxy service provided for this purpose by DSD
260 on the active controller 220. If DSD 260 is available, the
command is handled using this DSD proxy service. If DSD 260 is not
available, error response is provided. For other commands, SFD 280b
returns SCSI responses as appropriate for such commands received on
ALUA standby ports.
[0055] In general and in one configuration, the two processes
(e.g., primary process (DSD 260) and secondary process (SFD 280))
provide various advantages and efficiencies in storage
architectures. One technical advantage is seamless transition from
standby-mode to full active-optimized mode on the standby
controller, as it becomes the active controller. Another technical
advantage is reduced disruption on a single controller during short
periods of DSD 260 down time (e.g. DSD crashes, but failover not
triggered).
[0056] In one configuration, a storage array 202 includes an active
controller 220 and a standby controller 224. As mentioned above,
the LU cache 290a is a module shared by DSD 260 and SFD 280a that
caches data needed to serve certain SCSI commands. With multi-LUN
target Fibre Channel, the SFD 280a will also be serving SCSI
commands, but SFD 280a does not have access to VM 102. Multi-LUN
target is an implementation that requires tracking of LUN to Volume
mappings. LU cache 290a is designed as a way for SFD 280a to
provide volume attribute and LUN inventory information to the SCSI
layer 104 in the absence of VM 102 access.
[0057] Conceptually, LU cache 290a sits between the SCSI layer 104
in DSD 260 and SFD 280a (i.e., user space), and the configuration
information is stored in a configuration database 296, referred to
herein as a scale-out database. As an advantage, the configuration
database 296 stores configuration information, which may be used
for scale-out and non-scale out implementations. The configuration
database 296, in one embodiment, is designed as a persistent
storage of LUN data (e.g., LUN inventory information (i.e., LUN
mapping), inquiry data, port state info, etc.), which is provided
to the DSD 260 by GDD 297 (e.g., based on changes made using GMD
298). The configuration database 296 is generally storing
configuration data. LU cache 290a presents access interfaces to
SCSI layer 104 and modifier interfaces to GMD 298 and GDD 297. In
one embodiment, the GMD 298 and GDD 297 are collectively operating
as a configuration management unit 291 for the array 202, as shown
in FIG. 3A. The configuration management unit 291, e.g., one or
both of GDD 297 and GMD 298, is further shown interfaced with the
configuration database 296. In one embodiment, LU cache 290 is
implemented as a library linked in by SFD 280a, DSD 260 and GDD
297.
[0058] In one embodiment, the configuration management unit 291
includes GDD 297 and GMD 298. In specific examples, GMD 298 (Group
Management Daemon) is a process primarily responsible for system
management of a storage group. A storage group is a cluster of
arrays with a single shared management plane. In one example, GMD
298 provides APIs (programmatic interfaces) and CLIs (command line
interfaces) by which administrators can perform management
operations, such as provisioning and monitoring storage. In one
example, GDD 297 (Group Data Daemon) is a process responsible for
coordinating distributed data path operations in a storage group.
For example, this may include acquiring and checking SCSI
reservations, and iSCSI login permissions.
[0059] GMD 298 and GDD 297 further provide an interface to SODB
(i.e., the configuration database 296), which is a persistent store
for configuration information in a storage group, and it
communicates with DSD 260, AMD 404, and other processes to perform
management activities. The information in LU cache is a subset of
the information in SODB. LU cache is initialized by fetching data
from GDD 297, and then incremental updates are applied via GMD
298.
[0060] FIG. 3B illustrates another embodiment, wherein the VM stubs
102' are not part of the design. In this embodiment, different SCSI
layer 104 libraries are used for SFD 280 (i.e., 280a and 280b) and
DSD 260. By providing different SCSI layer 104 libraries for SFD
280 and DSD 260, calls made to the VM 102 via the SCSI layer 104 of
either the active controller SFD 280a or the standby controller SFD
280b will not be provided with access to the VM 102. In one
embodiment, an error may be returned or an unavailable response may
be returned by the SCSI layer 104 library of the respective SFD
280. In one embodiment, the SCSI layer 104 of the SFD 280a and 280b
implement error handling, which avoids the need for a VM stub 102'.
On the other hand, if the SCSI layer 104 of the active controller
DSD 260 receives the request to access the VM 102, access will be
provided.
[0061] FIG. 4 illustrates how LU cache 290a and 290b is distributed
in each array 202a and 202b. Each DSD 260 of each storage array
will have a copy of LU cache 290a, which it manages and provides
access to SFD 280a on the active controller 220 and sends to SFD
280b on the standby controller 224. On the active controller 220,
one copy of LU cache 290a is shared by DSD 260 and SFD 280a. As
noted above, configuration database is persistent storage for
holding LUN mapping, inquiry data and port state information, when
SODB data is available. In one embodiment, a GDD 297 (Group Data
Daemon) and a GMD 298 (Group Management Daemon) will maintain the
persistent storage and provide this data to the DSDs 260 for
populating the LU caches 290a and 290b. As can be appreciated,
consistency is most important between controllers (i.e., active
controller 220 and standby controller 224) within a single storage
array 202, because in the event of one controller fails the other
one can have an up to date, and correct copy of LU cache 290.
Although consistency in a pool configuration is also beneficial, it
is less important to maintain exact consistence between different
arrays in a pool. Accordingly, in one embodiment, an implementation
enforces that two controllers 220/224 in a storage array 202 are
consistent, but allows temporary inconstancy between two arrays in
a pool.
[0062] In some implementations, the configuration database 296 will
also store other information that is sent to DSD 260 for populating
LU cache 290a. This information may include VolSnap, which is a
combination of VolUid and SnapshotUid. These values are stored in
the configuration database 296 and sent to or retrieved by DSD 260
via GMD 298 using Simple Object Access protocol (SOAP) calls. In
one embodiment, logical unit (LU) serial numbers may be derived
from the VolSnap. VolSnap may be used as a vol handle for
(initiator, lun) and vol mapping in LU cache 290. Accordingly,
VolSnap may be stored in LU cache 290a. Additionally, Iqn
Uniquifier, which is a hash of the volume containing group UID (T:
UID of the group that contained the volume when the volume was
created and which can change after group merge) may also be stored
in LU cache 290a. In one embodiment, this information can be used
together with VolSnap to create a serial number.
[0063] These are just some examples of types of data that can be
cached in LU cache 290a/b, and other data may also be included
depending on the implementation. For example, LUN inventory data
can also be stored in LU cache 290a, which in a multi-lun target is
needed to identify a list of LUNs available to an initiator. This
is also needed to respond to REPORT_LUNS which is a command
identified as requiring fast response from SFD 280a. Volume Size is
a volume attribute stored in the configuration database 296 and is
propagated to VM 102 via GMD 298 and DSD 260 using SOAP. In one
example, this data is used to respond to REPORT_CAPACITY which is a
command identified as requiring fast response from SFD 280a.
Accordingly, it should be understood that configuration database
296 and LU cache 290a may include a variety of data that can be
shared to storage arrays and must be kept consistent among the
active and standby controllers and among arrays.
[0064] FIG. 5 illustrates an architecture, which includes DSD 260,
LU cache 290a and 290b, and SFD 280a and 280b. The solid arrows
show inter-process communication, while dotted arrows show
interaction between a process and a data store. In this one
example, a solid arrow's tail corresponds to a client of cache data
and the head corresponds to the server of cache data. Generally,
therefore, a dotted arrow's tail points to a process and a head
points to the data store. In this example, storage array 202b
(Array 2) is the group leader (GL) of a cluster of two arrays. For
this reason, GMD 298 and configuration database 296 are also
managed by the GL, which is Array 2. If a single array were
present, then GMD 298 and configuration database 296 would be
handled by that single array. Of course, grouping of arrays enables
clustering (e.g., pooling) of arrays for performance, such as in
scale-out implementations.
[0065] The SFD 280a is a process, wherein a single instance runs on
every controller (220 and 224). SFD 280a includes SCSI layer 104
and Transport layer 110 (see FIG. 3), but does not incorporate the
VM 102. In one embodiment, the SFD 280a has the ability to
independently respond to some SCSI commands like REPORT LUNS,
INQUIRY, REPORT TARGET PORT GROUPS, by consulting the LU cache 290a
locally on the controller that it is running on. However, an SFD
280a is not able to handle READ and WRITES, as SFD 280a and 280b do
not have access to a VM 102. As shown, the SFD 280b also has the
ability to forward SCSI requests to DSD 260 on the active
controller 220.
[0066] In one embodiment, DSD 260 is able to respond to requests
from SFD 280a and 280b. In one configuration, the LU cache 290a is
a single instance that is available on every controller. In one
embodiment, LU cache 290a is required primarily by SFD 280a, but
DSD 260 may also use LU cache 290a. As mentioned, LU cache 290a has
all the information necessary for SFD 280a to generate SCSI
responses to a small set of commands that include at least Inquiry,
Report LUNs and RTPG. GMD is enhanced to respond to on-change
notifications from DSD 260 (local or remote) and to generate
on-change notifications that affect LU cache 290a state.
Configuration database 296, in one embodiment, can be enhanced to
also store new ALUA related state information.
[0067] In one embodiment, management changes may be made at the
active controller 220, and those changes should consistently be
made at the standby controller 224. As discussed and shown in FIG.
4, both the active controller 220 and the standby controller 224
include a LU cache 290a and LU cache 290b, respectively.
Consistency is needed so that the LU cache 290a of the active
controller 220 is maintained in sync with the LU cache 290a of the
standby controller 224. This is particularly needed in cases of
failover, wherein the active controller 220 may fail and the
standby controller 224 is required to take over the role of active
controller 220. If consistency is not maintained, there may be
cases where changes and/or updates to configuration data (e.g., LUN
mapping, inquiry data, and port state info) in the LU cache 290a of
the active controller 220 is not yet present in the LU cache 290a
of the standby controller 224. In another case, the LU cache 290b
of the standby controller 224 may have a state that is ahead of the
state of the LU cache 290a of the active controller. Thus, within a
storage array 202, it is important for the controllers to have a
consistent LU cache 290a and 290b.
[0068] Consistency in LU cache 290a and 290b is even a stronger
requirement for arrays within a pool of arrays. The reason is that
if an inconsistency between arrays causes an initiator to miss a
path, it still has other available paths through other arrays.
Those paths may be less performant but there is no loss of service.
However if an initiator misses a path from one controller in an
array, and a failover occurs, the initiator will lose all
paths.
[0069] In one embodiment, the flow of changes to LU cache 290a flow
from GMD 298 to DSD 260 to SFD 280b. In this configuration, DSD 260
does not get changes from SFD 280a/b. If DSD 260 has crashed then
changes can occur in GMD 298, but those will remain in pending
state and will be retried.
[0070] As noted, the goal is for DSD 260 and SFD 280b to always
have exactly the same data. Unfortunately, it is not always
possible to have the same data for two processes across a
potentially unstable link to always be consistent, as it is always
possible for a message from one to the other to be lost. Given this
inherent constraint, one method is provided so that the order in
which updates are made reduces inconsistency.
[0071] FIG. 6 illustrates the active controller 220 with DSD 260
and the standby controller 224 with SFD 280b. Also shown is that
GMD 298 may enable configuration updates by GDD 297 to the LUNs.
The configuration updates are stored in the configuration database
(SODB) and GDD 297 will send to DSD 260 changes [1] that are pushed
[2] to SFD 280b. At this point, if SFD 280b is operational, SFD
280b will commit the changes [2.5] to LU cache 290a in the standby
controller 224. Then, SFD 280b acknowledges [3] back to DSD 260
that the changes were received and/or committed. At this point, DSD
260 is allowed to commit [4] the changes to LU cache 290a of the
active controller 220.
[0072] In one embodiment, if step [3] fails, SFD 280b will have
committed a change that DSD 260 still has not committed. This is
because DSD 260 is configured to commit after it receives the
acknowledgement from SFD 280b, and therefore SFD 280b will be ahead
(i.e., will have newer data, not yet committed to LU cache 290a of
the active controller 220). In one embodiment, the GMD 298
processes are configured to retry telling DSD 260 of the changes,
following steps [1]-[4] until it eventually resolves this
situation. That is, DSD 260 and SFD 280b will have the same LU
cache 290a data. In another embodiment, it is possible to order it
so DSD 260 commits first, but that would risk a new LUN coming
online without any standby paths which would be a higher
availability risk, but it would still be possible as an alternate
configuration.
[0073] In one embodiment, a ping/pong heartbeat is processed
between SFD 280b of the standby controller 224 and DSD 260 of the
active controller 220. SFD 280b sends a ping message to DSD 260 and
DSD 260 responds with a pong every second. If the ping does not go
through then the pong response will also not be sent. If SFD 280b
has not received a pong for a period of time (e.g., in 5 minutes)
it restarts. If DSD 260 has not received the ping in a period of
time (e.g., in 5 minutes) it can infer that SFD 280b has restarted
because no pong messages have been sent. In one embodiment, DSD 260
must push any delta update to SFD 280b before committing, but in
this state it knows SFD 280b is not up (e.g., the time period has
passed) so it's free to accept more changes. The DSD 260 knows that
if the standby controller 224 cannot reach the active controller
220 for some time (e.g., 5 minutes), then standby controller 224 is
programmed to restart itself so it can then get LU cache 290a from
DSD 260 running on the active controller 220.
[0074] In one embodiment, the secondary process (SFD 280) must
restart itself after the predetermined period of time (e.g., 5
minutes). The primary process (DSD 260) is allowed to commit an
unacknowledged transaction (e.g., a transaction for which a
confirmation of commitment is not received), because the primary
process knows (e.g., by programming) that the secondary (SFD 280)
should have restarted. However, the primary process (DSD 260) is
not just "waiting a period of time" but specifically waiting a
pre-defined amount of time after which secondary process (SFD 280)
must have restarted. Accordingly, the primary process commits the
update to the first LU cache after waiting the period of time, as
the secondary process is programmed to have restarted after the
period of time has been reached.
[0075] It should further be understood that the period of time of 5
minutes is just an example, and lower time settings, such as 30
seconds, 1 minute, 2 minutes, 3 minutes, 4 minutes, or time
selected between 1 second and 30 minutes may be used, depending on
the configuration of the system.
[0076] In on embodiment, LU cache 290 updates are initiated by
configuration changes in GMD 298 and configuration database 296. To
manage LUN inventory, one embodiment will use application
programming interfaces (APIs). Some APIs may include, just for
example:
[0077] void add_initiator_to_igroup(wwn_t wwpn, uint64_t
igroupid)
[0078] void add_lun_to_igroup(uint64_t igroupid, uint64_t lun,
VolSnap vs)
[0079] void add_vol(VolSnap vs, uint64_t capacity)*
[0080] void remove_initiator_from_from_igroup(wwn_t wwpn, uint64_t
igroup)
[0081] void remove_lun_from_igroup(uint64_t igroupid, uint64_t lun,
VolSnap vs)
[0082] void remove_vol(VolSnap vs, uint64_t capacity)
[0083] These calls are made in GDD 297 when DSD 260 requests a new
cache. GDD 297 reads SODB (i.e., the configuration database 296)
and builds a complete LU cache using these calls, and transmits it
to DSD 260 via a remote procedure call (RPC). These calls are also
made in DSD 260 when GMD 298 does incremental updates. In one
embodiment, call for port information can also be me made via APIs,
such as a port->portset->igroup mappings.
[0084] In one embodiment, the SFD 280 and DSD 260 need to request
an LU cache 290 (i.e., most current version of the cache contents)
on startup, and GDD 297 needs to push out a new LU cache 290 after
management operations.
[0085] In one example, the LU cache is implemented as a collection
of containers stored in logically contiguous memory. This structure
containing the LU cache data structures is called the LucStore. The
LucStore, in one example, contains three maps:
[0086] FcpinitiatorMap: Initiator WWPN->IgroupSet
[0087] IgroupMap: Igroups ID->LunMap for the Igroup
[0088] VolumeMap: VolSnap->Volume attributes
[0089] The contents of the Initiator and Igroup Maps are themselves
containers:
[0090] IgroupldSet: A set of igroup ids
[0091] LunMap: LU number->VolSnap
[0092] In one embodiment, the truth for LU cache is what is stored
in the configuration database 296 (SODB). The LucStore structure is
created by GDD 297 on the group leader (GL), and transmitted to DSD
260 on each array on the request, and then transmitted from DSD 260
to SFD 280. The LU cache is therefore stored in a shared memory
object by DSD 260 and SFD 280. Further, in one example, GDD 297 on
the group leader (GL) may be able to use the same shared memory
object as DSD 260. In one alternative is to store LU cache in a
file so it can be recovered after reboot.
[0093] In general, when changes or updates are made to the
configuration data 296 via the configuration management unit 291,
those updates that affect LU cache must be propagated or pushed
down to DSD 260, which in turn propagates to SFD 280 for commitment
to LU cache 290b and then LU cache 290a. Additionally, changes to
port information is communicated up to the configuration management
unit 291 using CMD 402 that communicates directly with the FC
Kernel driver 116, and the AMD 404 (e.g., see FIGS. 3A/3B). Once
the configuration management unit 291 has changes to port data
(e.g., changes and/or updates), this data can be pushed to DSD 260,
which in turn propagates to SFD 280 for commitment to LU cache 290b
and then LU cache 290a. This process ensures that the SCSI layer
104 has the most current LUN mapping data and/or most current port
data. Having the most current and accurate information in LU cache
290a and 290b ensures that the SCSI layer 104 can more rapidly
respond to requests from initiators with accurate data. This fast
and accurate reply is also beneficial in the case of failover
cases, wherein the standby controller 224 may become the active
controller 220. This benefit is also useful in cases where the DSD
260 may go down and the SFD 280 may take over the primary role of
responding to requests, using LU cache 290b. Again, these are just
examples, and those skilled in the art will appreciate that various
combinations of the disclosed embodiments and elements are
possible.
Example Storage Array Infrastructure
[0094] In some embodiments, a plurality of storage arrays may be
used in data center configurations or non-data center
configurations. A data center may include a plurality of servers, a
plurality of storage arrays, and combinations of servers and other
storage. It should be understood that the exact configuration of
the types of servers and storage arrays incorporated into specific
implementations, enterprises, data centers, small office
environments, business environments, and personal environments,
will vary depending on the performance and storage needs of the
configuration.
[0095] In some embodiments, servers may be virtualized utilizing
virtualization techniques, such that operating systems can be
mounted on hypervisors to allow hardware and other resources to be
shared by specific applications. In virtualized environments,
storage is also accessed by virtual hosts that provide services to
the various applications and provide data and store data to
storage. In such configurations, the storage arrays can be
configured to service specific types of applications, and the
storage functions can be optimized for the type of data being
serviced.
[0096] For example, a variety of cloud-based applications are
configured to service specific types of information. Some
information requires that storage access times are sufficiently
fast to service mission-critical processing, while other types of
applications are designed for longer-term storage, archiving, and
more infrequent accesses. As such, a storage array can be
configured and programmed for optimization that allows servicing of
various types of applications. In some embodiments, certain
applications are assigned to respective volumes in a storage array.
Each volume can then be optimized for the type of data that it will
service.
[0097] As described above with reference to FIG. 2, the storage
array 202 can include one or more controllers 220, 224. One
controller serves as the active controller 220, while the other
controller 224 functions as a backup controller (standby). For
redundancy, if the active controller 220 were to fail, immediate
transparent handoff of processing (i.e., fail-over) can be made to
the standby controller 224. Each controller is therefore configured
to access storage 1130, which in one embodiment includes hard disk
drives (HDD) 226 and solid-state drives (SSD) 228. As mentioned
above, SSDs 228 are utilized as a type of flash cache, which
enables efficient reading of data stored to the storage 1130.
[0098] As used herein, SSDs functioning as "flash cache," should be
understood to operate the SSD as a cache for block level data
access, providing service to read operations instead of only
reading from HDDs 226. Thus, if data is present in SSDs 228,
reading will occur from the SSDs instead of requiring a read to the
HDDs 226, which is a slower operation. As mentioned above, the
storage operating system 106 is configured with an algorithm that
allows for intelligent writing of certain data to the SSDs 228
(e.g., cache-worthy data), and all data is written directly to the
HDDs 226 from NVRAM 218.
[0099] The algorithm, in one embodiment, is configured to select
cache-worthy data for writing to the SSDs 228, in a manner that
provides an increased likelihood that a read operation will access
data from SSDs 228. In some embodiments, the algorithm is referred
to as a cache accelerated sequential layout (CASL) architecture,
which intelligently leverages unique properties of flash and disk
to provide high performance and optimal use of capacity. In one
embodiment, CASL caches "hot" active data onto SSD in real
time--without the need to set complex policies. This way, the
storage array can instantly respond to read requests--as much as
ten times faster than traditional bolt-on or tiered approaches to
flash caching.
[0100] For purposes of discussion and understanding, reference is
made to CASL as being an algorithm processed by the storage OS.
However, it should be understood that optimizations, modifications,
additions, and subtractions to versions of CASL may take place from
time to time. As such, reference to CASL should be understood to
represent exemplary functionality, and the functionality may change
from time to time, and may be modified to include or exclude
features referenced herein or incorporated by reference herein.
Still further, it should be understood that the embodiments
described herein are just examples, and many more examples and/or
implementations may be defined by combining elements and/or
omitting elements described with reference to the claimed
features.
[0101] In some implementations, SSDs 228 may be referred to as
flash, or flash cache, or flash-based memory cache, or flash
drives, storage flash, or simply cache. Consistent with the use of
these terms, in the context of storage array 102, the various
implementations of SSD 228 provide block level caching to storage,
as opposed to instruction level caching. As mentioned above, one
functionality enabled by algorithms of the storage OS 106 is to
provide storage of cache-worthy block level data to the SSDs, so
that subsequent read operations are optimized (i.e., reads that are
likely to hit the flash cache will be stored to SSDs 228, as a form
of storage caching, to accelerate the performance of the storage
array 102).
[0102] In one embodiment, it should be understood that the "block
level processing" of SSDs 228, serving as storage cache, is
different than "instruction level processing," which is a common
function in microprocessor environments. In one example,
microprocessor environments utilize main memory, and various levels
of cache memory (e.g., L1, L2, etc). Instruction level caching, is
differentiated further, because instruction level caching is
block-agnostic, meaning that instruction level caching is not aware
of what type of application is producing or requesting the data
processed by the microprocessor. Generally speaking, the
microprocessor is required to treat all instruction level caching
equally, without discriminating or differentiating processing of
different types of applications.
[0103] In the various implementations described herein, the storage
caching facilitated by SSDs 228 is implemented by algorithms
exercised by the storage OS 106, which can differentiate between
the types of blocks being processed for each type of application or
applications. That is, block data being written to storage 1130 can
be associated with block data specific applications. For instance,
one application may be a mail system application, while another
application may be a financial database application, and yet
another may be for a website-hosting application. Each application
can have different storage accessing patterns and/or requirements.
In accordance with several embodiments described herein, block data
(e.g., associated with the specific applications) can be treated
differently when processed by the algorithms executed by the
storage OS 106, for efficient use of flash cache 228.
[0104] Continuing with the example of FIG. 2, that active
controller 220 is shown including various components that enable
efficient processing of storage block reads and writes. As
mentioned above, the controller may include an input output (I/O)
210, which can enable one or more machines to access functionality
of the storage array 202. This access can provide direct access to
the storage array, instead of accessing the storage array over a
network. Direct access to the storage array is, in some
embodiments, utilized to run diagnostics, implement settings,
implement storage updates, change software configurations, and/or
combinations thereof. As shown, the CPU 208 is communicating with
storage OS 106.
[0105] One or more embodiments can also be fabricated as computer
readable code on a non-transitory computer readable storage medium.
The non-transitory computer readable storage medium is any
non-transitory data storage device that can store data, which can
thereafter be read by a computer system. Examples of the
non-transitory computer readable storage medium include hard
drives, network attached storage (NAS), read-only memory,
random-access memory, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes and
other optical and non-optical data storage devices. The
non-transitory computer readable storage medium can include
computer readable storage medium distributed over a network-coupled
computer system so that the computer readable code is stored and
executed in a distributed fashion.
[0106] The method operations were described in a specific order,
but it should be understood that other housekeeping operations may
be performed in between operations, or operations may be adjusted
so that they occur at slightly different times, or may be
distributed in a system which allows the occurrence of the
processing operations at various intervals associated with the
processing, as long as the processing of the overlay operations are
performed in the desired way.
[0107] Although the foregoing embodiments have been described in
some detail for purposes of clarity of understanding, it will be
apparent that certain changes and modifications can be practiced
within the scope of the appended claims. Accordingly, the present
embodiments are to be considered as illustrative and not
restrictive, and the embodiments are not to be limited to the
details given herein, but may be modified within the scope and
equivalents of the described embodiments and sample appended
claims.
* * * * *