U.S. patent application number 11/144796 was filed with the patent office on 2006-10-05 for storage system providing stream-oriented performance assurance.
Invention is credited to Naoto Matsunami, Akira Nishimoto.
Application Number | 20060224784 11/144796 |
Document ID | / |
Family ID | 37071950 |
Filed Date | 2006-10-05 |
United States Patent
Application |
20060224784 |
Kind Code |
A1 |
Nishimoto; Akira ; et
al. |
October 5, 2006 |
Storage system providing stream-oriented performance assurance
Abstract
The transfer rates of streams are assured if the plural streams
are mixed. A disk array system required to process multiple streams
from a host computer carries out recognition of the transfer rates,
recognition of drive performance and a fault processing time,
determines the size of a sequential buffer holding each stream, and
determines the drive I/O size. Using the transfer rate and buffer
size determined in these processing steps, required end times at
which prefetch and destage should be terminated are found. Based on
the times, the I/O priorities are determined.
Inventors: |
Nishimoto; Akira;
(Sagamihara, JP) ; Matsunami; Naoto; (Hayama,
JP) |
Correspondence
Address: |
ANTONELLI, TERRY, STOUT & KRAUS, LLP
1300 NORTH SEVENTEENTH STREET
SUITE 1800
ARLINGTON
VA
22209-3873
US
|
Family ID: |
37071950 |
Appl. No.: |
11/144796 |
Filed: |
June 6, 2005 |
Current U.S.
Class: |
710/36 |
Current CPC
Class: |
G06F 3/0659 20130101;
G06F 3/0611 20130101; G06F 3/0689 20130101 |
Class at
Publication: |
710/036 |
International
Class: |
G06F 3/00 20060101
G06F003/00 |
Foreign Application Data
Date |
Code |
Application Number |
Apr 4, 2005 |
JP |
2005-107013 |
Claims
1. A storage system comprising: a controller connectable to a
plurality of computers; and a plurality of storage devices
connected to the controller, wherein the plurality of storage
devices store a plurality of contents, wherein the controller
receives a first request from a one of the plurality of computers
which requests access to a first content of the plurality of
contents, wherein the controller calculates first data transfer
rate of the first request, and wherein the controller executes a
plurality of processes of the first request based/on a result of
calculation.
2. A storage system according to claim 1, wherein the controller
executes the plurality of processes within a first period of time
to maintain the first data transfer rate, and wherein the first
period of time is calculated based on a resource and performance of
the storage system, and the first data transfer rate.
3. A storage system according to claim 2, wherein the controller
receives a second request which requests access to a second content
of the plurality of contents, calculates a second data transfer
rate of the second request, and executes a plurality of second
processes of the second request based on the second data transfer
rate.
4. A storage system according to claim 3, wherein the controller
executes the plurality of second processes within a second period
of time to maintain the second data transfer rate, and wherein the
second period of time is calculated based on the resource and
performance of the storage system, and the second data transfer
rate.
5. A storage system according to claim 4, wherein the controller
compares a ending time of the first period of time and the second
period of time, and executes the plurality of second processes
prior to the plurality of processes if the ending time of the
second period of time is earlier than the ending time of the first
period of time.
6. A storage system according to claim 5, comprising a memory,
wherein the controller configures memory area of the memory used by
the plurality of processes based on the first data transfer
rate.
7. A storage system according to claim 6, wherein, if the
controller received the first request, the controller sends data
based on a maximum data transfer rate of the storage system to the
one of the plurality of computers, checks an amount of transferring
data between the storage system and-the one of the plurality of
computers and calculates the first data transfer rate based on the
amount of transferring data.
8. A storage system according to claim 7, wherein the controller
configures a plurality of logical units based on the plurality of
storage drives, wherein a first logical unit of the plurality of
logical units stores the first content of the plurality of contents
and a second logical unit of the plurality of logical units stores
the second content, and wherein the controller calculates a data
transfer rate for the first logical unit as the first data transfer
rate, and calculates a data transfer rate for the second logical
unit as the second data transfer rate.
9. A storage system according to claim 7, wherein the controller
checks whether the first request indicates a sequential access or
not, and calculates the first data transfer rate if the first
request indicates the sequential access.
10. A storage system according to claim 7, wherein the controller
stores information of the first data transfer rate, and uses the
information of the first data transfer rate if the controller
receives the first request again.
11. A data transfer method used in a storage system that stores a
plurality of contents, comprising: receiving a first request from a
one of a plurality of connected to the storage system which
requests to access to a first content/of the plurality of contents;
calculating a first data transfer rate of the first request; and
sending data requested by the first request based on a result of
calculation.
12. A data transfer method according to claim 11 comprising:
executing a plurality of processes for transferring the data within
a first period of time to maintain the first data transfer rate,
and wherein the first period of time is calculated based on a
resource and performance of the storage system and the first data
transfer rate.
13. A data transfer method according to claim 12 comprising:
receiving a second request which requests access to a second
content of the plurality of contents; calculating a second data
transfer rate of the second request; and sending second data
requested by the second request based on the second data transfer
rate.
14. A data transfer method according to claim 13 comprising:
executing a plurality of second processes for sending the second
data within a second period of time to maintain the second data
transfer rate, and wherein the second period of time is calculated
based on the resource and performance of the storage system and the
second data transfer rate.
15. A data transfer method according to claim 14 comprising:
comparing a ending time of the first period of time and the second
period of time; and executing the plurality of second processes
prior to the plurality of processes if the ending time of the
second period of time is earlier than the ending time of the first
period of time.
16. A data transfer method according to claim 15, wherein the
calculating includes: sending data based on a maximum data transfer
rate of the storage system to the one of the plurality of computers
responding to the first request; checking an amount of transferring
data between the storage system and the one of the plurality of
computers; and calculating the first data transfer rate based on
the amount of transferring data.
17. A data transfer method according to claim 16, wherein the
calculating includes: checking whether the first request received
from the one of the plurality of computers indicates a sequential
access or not; and calculating the first data transfer rate
indicated if the first request indicates the sequential access.
18. A storage system comprising: a means for storing a plurality of
contents; a means for receiving a first request from a one of a
plurality of computers connected to the storage system which
requests access to a first content of the plurality of contents; a
means for receiving a second request which requests access to a
second content of the plurality of contents; a means for
calculating a first data transfer rate of the first request and a
second data transfer rate of the second request; and a means for
sending data requested by the first request based on a result of
calculation and second data requested by the second request based
on the second data transfer rate.
19. A storage system according to claim 11 comprising: means for
executing a plurality of processes for transferring the data within
a first period of time to maintain the first data transfer rate,
and executing a plurality of second processes for sending
the,second data within a second period of time to maintain the
second data transfer rate, wherein the first period of time is
calculated based on a resource and performance of the storage
system and the first data transfer rate, and the second period of
time is calculated based on the resource and performance of the
storage system and the second data transfer rate.
20. A storage system according to claim 19 comprising: means for
comparing a ending time of the first period of time and the second
period of time; and means for executing the plurality of second
processes prior to the plurality of processes if the ending time of
the second period of time is earlier than the ending time of the
first period of time.
Description
CROSS-REFERENCES TO RELATED APPLICATIONS
[0001] This application relates to and claims priority from
Japanese Patent Application No. 2005-107013, filed on Apr. 04,
2005, the entire disclosure of which is incorporated herein by
reference.
BACKGROUND OF THE INVENTION
[0002] The present invention relates to a technique for assuring
qualities, such as transfer rate and response time, in the
operation of a storage system.
[0003] In recent years, broadband communications services have
enjoyed widespread use. With this trend, media-rich contents, such
as digitized movies and news, have been delivered by streaming
technology more increasingly. Storage equipment, a storage system,
or a storage array system contained in a system for delivery of
streaming media is required to assure the qualities of data
transfer (hereinafter referred to as qualities of services (QoS)),
such as transfer rate and response time, as well as the maximum
throughput performance, in order to assure stable delivery of data
to clients. Prior techniques regarding such quality assurance are
disclosed in Patent References 1 to 3.
[0004] Patent Reference 1 discloses a technique that is intended to
assure communications qualities from a computer to storage areas of
storage equipment while taking into account the components inside
the storage equipment and their respective performance values. In
particular, a communications quality-setting device for assuring
communications qualities between the storage equipment and a
computer is disclosed in Patent Reference 1. This quality-setting
device has an assured performance value-creating portion and an
assured performance value-setting portion. The assured performance
value-creating portion determines a performance value to be assured
in each storage area, based on information about requests regarding
assurance of the communications qualities, on the assurable
performance value of the interface of the storage equipment, and on
the assurable performance values of the storage areas of the
storage equipment. The assured performance value-setting portion
assures communications between the computer and the storage areas
inside the storage equipment by giving an instruction to a storage
controller to control the arrangement of data between the storage
areas and the interface according to the performance values to be
assured.
[0005] Patent Reference 2 discloses storage equipment for assuring
the data transfer rate. Specifically, Patent Reference 2 discloses
a technique using an expected data transfer rate and the transfer
rate of each track is disclosed. The storage equipment is informed
of a required, expected data transfer rate. This rate is registered
in a management table. During formatting, the storage equipment
recognizes bad sectors, if any, by writing and reading of data to
and from each track. The writing time to the sectors, excluding the
bad sectors, is registered in the management table. If this data
transfer rate is less than the already registered, expected data
transfer rate, it is recognized that no data can be stored in this
storage equipment. Data is stored using only the sectors which can
assure the expected transfer rate.
[0006] Patent Reference 3 discloses storage equipment having a
timeout table and a data creation means. A time at which access to
data recorded in the recording portion should be terminated is
recorded in the timeout table. If the access does not end when the
time stored in the timeout table has passed, the data creation
means gains access to redundant data recorded in the recording
portion and creates data. TABLE-US-00001 [Patent Reference 1]
JP-A-2004-86512 [Patent Reference 2] JP-A-10-162505 [Patent
Reference 3] Japanese Patent No. 3,080,584
[0007] In the aforementioned streaming the delivery system, the
computer offering services to customers makes plural accesses
simultaneously (hereinafter referred to as multiple streaming
accesses) to a storage device in which the media-contents are
stored, in order to deliver the contents to plural customers
simultaneously. The "streaming" indicates transfer of a unit of
data.
[0008] For example, one stream corresponds to data transfer of one
content. It is necessary that storage equipment treating multiple
streaming accesses assure a predetermined quality of service (QoS)
for each stream.
[0009] In the Publications described above, the assurance of
quality of service regarding one stream is mentioned. However, with
respect to multiple streams, how quality of service of each
individual stream is assured is not mentioned at all.
SUMMARY OF THE INVENTION
[0010] It is an object of the present invention to assure quality
of service (QoS) for each stream in storage equipment capable of
processing multiple streams.
[0011] One embodiment of the present invention is a storage system
for receiving streaming accesses from a computer. The storage
system itself detects the data transfer rates of streaming
accesses. In this configuration, the storage system calculates a
time required to execute internal processing, such as readout of
data based on the detected data transfer rate, and processes the
data based on the result.
[0012] More specifically, the storage system calculates the time
required to execute the processing from the resources of the array
system, from the performance, and from the detected data transfer
rate. Where the data transfer rate detected from the resources
available at that time cannot be sustained, the storage system may
modify the configuration of the resources. An example of the
resources is the buffer memory size. The performance can be the
performance of a drive or the processing time taken when a fault
occurs.
[0013] When plural streams are processed by the storage system, I/O
operations are internally scheduled according to the required
processing time.
[0014] In addition, the storage system can be so configured that a
streaming access is judged according to whether it is a sequential
access. Moreover, a streaming access may be judged based on the
access destination.
[0015] Other structures of the present invention will become
apparent from the following description of various embodiments of
the invention. Obviously, the concept of the present invention is
not limited to the embodiments described herein.
[0016] The storage system according to the present invention
receives multiple streams and can stabilize the bit rates of the
streams and the response time.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] FIG. 1 is a block diagram showing an example of the
configuration of a system in accordance with the present
invention;
[0018] FIG. 2 is a diagram schematically illustrating the
processing of read streams;
[0019] FIG. 3 is a diagram schematically illustrating the
processing of write streams;
[0020] FIG. 4 is a diagram showing a transfer rate setting
table;
[0021] FIG. 5 is a diagram showing a job management table;
[0022] FIG. 6 is a diagram illustrating relations among an instant
at which a job is created, an instant at which the job is required
to be started, an instant at which the job is required to be
terminated, and a time required to complete the job;
[0023] FIG. 7 is a diagram schematically illustrating registration
of jobs in a priority queue and a nonpriority queue and job
selection;
[0024] FIG. 8 is a flowchart illustrating a procedure for
registration in queues;
[0025] FIG. 9 is a flowchart illustrating an example of a procedure
for selecting executed jobs;
[0026] FIG. 10 is a diagram schematically illustrating relations
among processing steps executed by an embodiment of the present
invention; and
[0027] FIG. 11 is a diagram showing a table of numbers of enabled
tagged queues.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0028] FIG. 1 shows a computer system according to a first
embodiment of the present invention. The computer system has
terminals (hereinafter simply referred to as users) 101-104 that
are used by users, computers 109 and 110 (hereinafter may also be
referred to as the host computers) that receive delivery requests
from the users via the Internet 105, and a storage system 113
(hereinafter referred to as the disk array system) for processing
access requests from the host computers. The users and host
computers are interconnected via the Internet 105. The host
computers and the disk array system 113 are connected via
communications lines, which may be replaced by the Internet
105.
[0029] The disk array system 113 has a disk array controller 114
for controlling disk accesses and other operations and drive
devices 118-121 for storing data. The disk array controller 114 has
an MPU 115 for executing a control program and a cache memory 116
for storing data.
[0030] In the present embodiment, the disk array system 113 is
shared between the plural host computers, which in turn receive
content request accesses from plural users. Accordingly, the disk
array system 113 is shared among a multiplicity of users. The disk
array system 113 needs to process multiple streaming accesses.
[0031] Each of the host computers is a general computer and
includes a processor, an interface with a network, and a memory. As
mentioned previously, the host computer receives content delivery
requests from users and requests the disk array system 113 to read
out or write the media contents (hereinafter may be referred to as
streaming accesses) according to the contents of the requests.
[0032] The user terminal may be a device capable of displaying
media contents. For example, the terminal may be a cell phone,
personal computer, or TV receiver. The device of the user terminal
is equipped with an interface for connection with a network.
Furthermore, the device includes a device for making communications
with the host computers and a display unit.
[0033] The processing of multiple streams in the present embodiment
will be described briefly below.
[0034] When multiple streams are received, the disk array system
113 of the present embodiment first detects the date transfer rates
(hereinafter may be simply referred to as transfer rates) required
by the individual streams. Then, the array system 113 determines
assigned ones of the resources of the array system 113 necessary to
maintain the data transfer rates of the individual streams. In this
embodiment, the assigned resources are buffer sizes and disk I/O
size for each individual stream. Then, based on the assigned
resources, the array system 113 processes the multiple streams.
[0035] Where multiple streams are processed in practice, the disk
array system is 113 determines the order in which processing steps
are performed while taking account of I/O processing of other
streams such that processing responding to the received I/O request
can be completed in a time sufficient to maintain the detected data
transfer rate.
[0036] FIG. 10 illustrates an example of a set of relations among
processing steps of the processing briefly described above. The
disk array system 113 achieves the processing by performing the
processing steps shown in FIG. 10.
[0037] The processing briefly described above includes a processing
step 1003 for setting parameters, a processing step 1005 for
determining the buffer size and drive I/O size, and a processing
step 1006 for performing I/O scheduling. That is, the disk array
system 113 performs these processing steps. The processing step
1003 for setting parameters contains a processing substep 1000 for
recognizing transfer rates and a processing substep 1004 for
recognizing the drive performance and the processing time when a
fault occurs.
[0038] In the processing step 1005 for determining the buffer size
and drive I/O size, the disk array system 113 determines the buffer
size and disk I/O size from the set parameters. In the processing
step 1006 for performing I/O scheduling, the disk array system 113
schedules I/Os occurring in the multiple streams, using the set
parameters and buffer size.
[0039] Based on these processing steps, the disk array system 113
recognizes the transfer rates of the individual streams contained
in the multiple streams and assigns resources, which are matched to
the transfer rates, to the streams.
[0040] Furthermore, the array system 113 schedules and processes
the I/Os of the multiple streams based on the resources. The
processing steps are described in further detail below.
[0041] In the processing substep 1000 for recognizing transfer
rates, the disk array system 113 recognizes the transfer rates of
the streams issued to the array system 113 from a host computer.
Processing for maintaining the transfer rates necessary for the
streams is performed by the disk array controller 114 using the
transfer rates recognized by the processing, as described above.
More specifically, the disk array system 113 recognizes the
transfer rates using either automatic recognition (1001) or user's
indication (1002).
[0042] In the case of automatic recognition (1001), when a
streaming access occurs from a host computer to the disk array
system 113, the array system 113 first makes a decision as to
whether the received access is a streaming access.
[0043] A method of discerning streaming accesses by the disk array
system 113 is described below. Generally, streaming accesses are
often sequential accesses (i.e., accesses to consecutive sectors).
Accordingly, in the present embodiment, if the disk array system
113 determines that an access request from the host computer is a
sequential access, the access is judged to be a streaming
access.
[0044] One available method of judging a sequential access consists
of checking the sequential access by the sequentiality of addresses
of data specified by the I/O received, for example, from the host
computer. More specifically, when an I/O is received from the host
computer, the disk array system 113 checks to see if required data
exists in the cache memory. At this 10 time, the disk array system
113 also checks to see if data in the sector indicated by the
address (e.g., logical block address (LBA)) immediately preceding
the sector in which the required data is stored exists in the cache
memory. If the data exists, the received I/O is judged to be one
stream access.
[0045] If the access is judged as a streaming access, the disk
array system 113 calculates the bit rate. First, the array system
113 sets its maximum transfer rate as a parameter. I/O operations
with the host computer regarding the streaming access are executed
at the maximum transfer rate for a given period. Then, the array
system 113 measures the amount of data actually transferred to and
from the host computer regarding the streaming access for the given
period. The array system 113 finds the transfer rate regarding the
stream from the measured value.
[0046] The found transfer rate is taken as the transfer rate of
this stream. Thereafter, the disk array system transfers data based
on the streaming access while controlling itself so as to maintain
the found data transfer rate.
[0047] In streaming delivery, data transfer rates required by a
delivery request from users to the host computer are often constant
among streams. In this case, the transfer rates required by
transfer requests from the host computer to the disk array system
113 are also constant among streams. Therefore, the transfer rate
of each stream can be recognized by recording the transfer rate of
each stream during a given time in the disk array system.
[0048] The given time indicates an arbitrary time interval until
the state of transmission of data settles down. This may be
specified by the administrator or set as follows. The storage
equipment repeatedly calculates the bit rate at regular intervals
of time (e.g., at intervals of 30 seconds) and takes the time when
the calculated variation decreases below a certain value as the
given time.
[0049] As another embodiment of the present invention, the disk
array system 113 may determine whether the access is a streaming
access, according to whether it is a request for access to a
certain address or address area, not according to whether it is a
sequential access. In a simple method, the disk array system 113
grasps the head address (or address area) of content stored in the
array system 113, and, if there is an access to the head address
(or address area), the array system 113 determines that the access
is a streaming access.
[0050] However, it is unrealistic to assume that the disk array
system 113 will grasp the head addresses of all of the contents.
Therefore, the decision as to whether the access is a streaming
access is made according to whether there is a request for access
to a unit of managed storage sector (e.g., a logical unit (LU)) in
the disk array system 113. For instance, where access is made to a
certain LU, the disk array system 113 recognizes this access as a
streaming access and starts to calculate the data transfer rate. In
this case, the administrator may manage the disk array system 113
in such a way that contents which require a similar data transfer
rate are stored in the same LU.
[0051] In addition, the disk array system 113 may hold information
about a once calculated data transfer rate and set the data
transfer rate using the held information, i.e., calculation of the
data transfer rate is omitted, in a case where there is another
access to the same content. In this case, the disk array system 113
must record information about the relation between the content and
the data transfer rate. For example, the head address (or head
address of the LU) of the content and the data transfer rate may be
interrelated and recorded.
[0052] Similar processing is performed in a case where plural
streaming accesses are received. Where a streaming access is judged
by a sequential access, the disk array system 113 judges different
sequential accesses (i.e., including plural accesses to the same
content and different accesses to different contents) as different
streaming accesses.
[0053] On the other hand, where a streaming access is judged by an
address (content or LU), the disk array system 113 judges accesses
to different addresses as different streaming accesses. In this
case, if the accesses are judged as plural accesses to the same
address, the disk array system 113 calculates only the data
transfer rate of any one of plural streaming accesses. The result
can be applied to the plural streaming accesses.
[0054] The disk array system 113 stores information about the
transfer rates of the recognized individual streams in the disk
array controller 114 in response to each stream.
[0055] The method (1002) using user's indication will now be
described with reference to FIG. 4. A transfer rate either required
by the administrator of the disk array system 113 or calculated by
the host computer executing an agent program is set into the disk
array system 113. Where the administrator of the array system 113
sets the transfer rate, the administrator gives an instruction to
the array system 113 through the management terminal 122 to cause
the disk array controller 114 within the disk array system 113 to
set values into the table shown in FIG. 4, which is present within
the controller 114. Where the rate is set by the host computer
executing the agent program, the host computer receives transfer
rate information from a program that controls delivery made by a
delivery server and gives an instruction to the disk array system
113 to set the transfer rate information in-band. More
specifically, the host computer sends out a special command, which
is received by the disk array system 113. The array system 113 1o
then sets values specified as in the table shown in FIG. 4.
[0056] The transfer rate is specified for each LU as indicated by
column 401 or for each area or sector as given by columns 402 and
403. With respect to the transfer rate of 404, the value of the bit
rate is directly set, such as 10 Mbps or 1.5 Mbps, or information
about a compression rate standard, such as MPEG1, is MPEG2, MPEG4,
or high definition (HD), is set. In the latter case, the disk array
system 113 judges the transfer rate from these standards.
Furthermore, the required response time in column 405 of each I/O
corresponding to each stream can be set for each LU or for each
area, other than the transfer rate.
[0057] In the processing substep 1004 for recognizing the drive
performance and a processing time taken when a fault occurs, the
disk array controller 114 recognizes information about the
performance of the drive and information about the processing time
taken when a fault occurs. The type of drive inside the disk array
system 113 has been previously determined. Information about the
performance of the drive, such as the seek time and the data
transfer time, have values intrinsic to the drive. These values
have been previously set into the disk array controller 114. The
controller 114 finds information about the drive used in practice
from these preset values and sets the information into the table
within the disk array controller 114. In addition, information
about the processing time taken when a fault occurs has been
previously determined. These kinds of information are set in the
table within the disk array controller 114. The controller 114
performs processing taking into account the information about the
performance of the drive and the processing time taken when a fault
occurs when I/O processing is performed.
[0058] These kinds of information are used to find the forecasted
end time of the command issued to the drive. Also, the forecasted
end time is used in the processing step 1006 performing I/O
scheduling.
[0059] The disk array system 113 performs the processing step 1005
for determining the buffer size and disk I/O size, using the
parameters set in the processing step 1003. FIGS. 2 and 3
schematically illustrate an example of a method of determining the
buffer size during read and write operations and an example of a
method using the buffer.
[0060] An example of a method of using the buffer and determining
the buffer size when multiple streaming accesses regarding data
readout are received from the host computer will be described with
reference to FIG. 2. It is assumed that multiple streaming accesses
to the disk array system 113 are generated and that the streams are
read accesses. Generally, streaming accesses for reading out data
are consecutively read accesses (hereinafter referred to as
sequential read accesses). Therefore, to cope with streams for
reading out data, the disk array system 113 has sequential buffers
within the cache memory 116, as shown in FIG. 2. The number of
sequential buffers is N, and each buffer corresponds to one
stream.
[0061] In the disk array system 113, each sequential buffer is
formed by plural surfaces. Each of these surfaces indicates a
physical unit of storage forming the cache memory 116. For example,
where a memory having storage elements mounted on the front and
rear surfaces of a single substrate is used, the front and rear
surfaces correspond to the "surfaces". In FIG. 2, each sequential
buffer is formed by surface 0 (for example, front surface 227) and
surface 1 (for example, rear surface 228). Data about one stripe
row is stored on each surface of the sequential buffer. The "stripe
row" indicates a unit of stored data when an array configuration is
formed in storage equipment.
[0062] FIG. 2 shows a case in which the disk array system 113 has
an array configuration of the RAID4 or RAID5 type with 4D1P (four
data disks and 1 parity disk). Surface 0 of one sequential buffer
has sectors 207-210 for storing data and a sector 211 for storing
parity data. Similarly, surface 1 has sectors 212-215 for storing
data and a sector 216 for storing parity data.
[0063] At the time of sequential reading, an operation for reading
successive sectors in a storage area possessed by the disk array
system 113 takes place. In this case, the array system 113 performs
a prefetch, which means that the time of access to the drive in the
disk array system 113 is hidden by reading data into the cache
memory 116 from the drive by means of the disk array controller 114
prior to generation of a read request from the host computer in a
case where a data readout location can be forecasted. In the case
of a sequential read, the data readout location can be forecasted,
and, therefore, it can be considered that this prefetch occurs
prior to data readout. The prefetch is also adopted in the present
embodiment.
[0064] In the present embodiment, the disk array system 113
prefetches each stripe row on surface 0 or 1 on a one-by-one basis.
Accordingly, the disk array controller 114 does not issue a request
for a read to the driver in response to every request from the host
computer, but issues a request for a read to the drives 221-224 for
every stripe row. When a read request from the host computer is
issued in practice, the disk array controller 114 transfers data
corresponding to the read request to the host computer, if the data
corresponding to the read request exists in the sequential buffer.
If the data does not exist in the buffer, the controller 114
performs a prefetch, reads data about the corresponding stripe rows
from the drives, and transfers the data to the host computer.
[0065] When transfer of data stored on one surface to the host
computer is completed, the disk array controller 114 performs the
next prefetch operation for the corresponding buffer. While data
stored on one surface by a prefetch is being transferred to the
host computer, data is stored on the other surface. For this
purpose, the disk array controller 114 performs prefetch for this
surface. Thus, the disk array system 113 can send data to the host
computer without interruption. Conversely speaking, unless prefetch
of data for the other surface is completed at the time when data
transfer to the host computer regarding one surface ends, data
transfer to the host computer is delayed until prefetch of data for
the other surface is completed. In this case, there is the danger
that the disk array system 113 cannot assure the data transfer
rate.
[0066] At this time, the disk array system 113 determines the sizes
of sequential buffers assigned to individual read streams based on
information obtained by the previously described processing step
1003 for setting parameters. More specifically, the array system
113 determines the sizes of the sequential buffers to assure the
transfer rates of the read streams which have been detected or
set.
[0067] The lower portion (226) of FIG. 2 illustrates the relation
between the transfer rate of streams set by the processing step
1003 and the time at which the prefetch ends. Generally, the rate
at which data is required to be transferred to the host computer
cannot be maintained (i.e., data to be transferred ceases to be
present in the cache) unless the prefetch ends within a period
given by (size of one buffer surface/required transfer rate). In
the example shown at 226, the stripe size is 128 KB, the array
configuration is 4D1P, the stripe row size is 512 KB, and the size
of the data area of one surface of the sequential buffer is 512 KB.
In case (1), the transfer rate is 192 KB/s. In case (2), the
transfer rate is 6.25 MB/s. In the two cases, the prefetch end
times (hereinafter may also be referred to as required prefetch end
times) are shown.
[0068] Where the required transfer rate is 192 KB/s, the
above-described calculational formula indicates that if the
prefetch ends within 2.6 s, the transfer rate required by the host
computer can be assured. Furthermore, where the required transfer
rate is 6.25 MB/s, the prefetch for 1 stripe row must be terminated
within 80 ms.
[0069] The disk array system 113 first calculates the required
prefetch end time using the calculational formula, based on the
size of the sequential buffer assigned to read streams at the
present time. The array system 113 checks to see whether the disk
array controller 114 can complete the prefetch within the
calculated, required prefetch end time by referring to the drive
performance and fault processing time parameters set in the
parameter-setting processing step 1003.
[0070] In the example shown in FIG. 3, for example, the transfer
rate is 6.25 MB/s, the array configuration is 4D1P, the stripe size
is 128 KB, and one surface of the buffer is 1 stripe row (128
KB*4=512 KB). In this case, it takes 512 KB/6.25 MB=80 ms for the
host computer to read data of 512 KB on one surface of the buffer.
Accordingly, if the prefetch of one surface of the buffer ends
within 80 ms, the transfer rate required by the host computer can
be maintained. Since the prefetch is performed for every stripe
row, the transfer rate can be maintained if the request for a read
of 128 KB to each drive ends within 80 ms.
[0071] The disk array controller 114 can find the read time of the
drive for one read request according to the information about the
drive performance.
[0072] It is assumed, for example, that the drive has a command
time of 0.5 ms, a rotation waiting time of 2 ms, a seek time of 3.8
ms, and an internal transfer time of 0.24 ms. The data transfer
time to the disk array controller is 0.15 ms. One drive I/O time
taken to handle one read request is about 7 ms. Accordingly, unless
plural commands are issued to the drive or a fault has occurred,
this drive can handle the read request of 128 KB within 80 ms.
[0073] However, where plural (e.g., 10) commands are issued to the
drive, a time of 11.times.7 ms=77 ms elapses until a final (i.e.,
eleventh) command is issued and data corresponding to the command
is sent. In this case, it is difficult for the drive to handle the
read request of 128 KB within 80 ms. The processing time required
when a fault occurs is considered similarly. For example, where
generation of a drive fault is found and it is necessary to read in
parity data again for data recovery, corresponding drive accesses
are generated. This delays the prefetch end time.
[0074] If the disk array controller 114 can perform a prefetch
within the calculated, required prefetch end time as a result of
considerations of the drive performance and fault-processing time,
as described previously, the disk array system 113 uses the already
assigned sequential buffers without modifying the buffer size.
[0075] On the other hand, in a case where it is impossible to
perform a prefetch within the required prefetch end time calculated
by the disk array controller 114, the disk array system 113
increases the size of the sequential buffers assigned to streams,
taking into account the drive performance and the fault-processing
time such that the required prefetch end time ends within the
processing time of the controller 114. For example, if the size of
one buffer surface is increased to 1 MB corresponding to 2 stripe
rows in 226, the required prefetch end time of (1) increases to 5.2
s and the required prefetch end time of (2) increases to 160
ms.
[0076] In the above-described example, with respect to a drive to
which 10 commands, for example, are sent, if the buffer size is set
to 1 MB, for example, to maintain the transfer rate, then the
required prefetch end time is 160 ms. Therefore, the transfer rate
can be maintained.
[0077] FIG. 3 illustrates the case where multiple streaming access
requests from the host computer are writes. In the case of a write
request (hereinafter may be referred to as writing stream), streams
are sequential accesses in the same way as read streams.
Accordingly, the disk array system 113 assigns sequential buffers
to write streams in the same way as shown in FIG. 2. Furthermore,
one sequential buffer is assigned to two or more surfaces in the
same way as shown in FIG. 2.
[0078] In the case of write streams, after receiving data about one
surface from the host computer, the disk array system 113 generates
parity only from data received from the host computer and writes
data and parity about one stripe row into the drive. After storing
the data about one surface, the disk array system 113 activates
processing for the surface to generate parity and write into the
drive (destaging). Meanwhile, if processing for one surface to
generate parity and destage to the drive is started, the disk array
system 113 receives writing data from the host computer, using
another surface.
[0079] Accordingly, if parity generation and destage for one
surface ends until data received from the host computer is stored
onto the other surface, the disk array system 113 can receive data
without causing a write request from the host computer to wait.
[0080] Therefore, using a calculational formula similar to the
formula shown in FIG. 2, the disk array system 113 calculates the
time taken to generate parity and perform a destage operation based
on the size of the sequential buffer given to write streams at the
present time. The array system 113 checks to see whether the disk
array controller 114 can write data within the calculated time,
using the information about the drive performance and
fault-processing time.
[0081] Where the disk array controller 114 cannot write data within
the calculated time, the disk array system 113 increases the size
of the sequential buffer to such an extent that the disk array
controller 114 affords the time taken to write data.
[0082] The lower portion (330) of FIG. 3 illustrates an example of
a set of relations among parity generation, destage processing
time, buffer size, and transfer rate. The stripe size is 128 KB.
The array configuration is 4D1P. The size on one surface of the
buffer is 512 KB. In case (1), the transfer rate from the host
computer is 192 KB/s. In case (2), the transfer rate is 6.25 MB/s.
Where the transfer rate is 192 KB/s, the required end time of the
processing for parity generation and destage is calculated in the
same way as in the case of reads and results in 2.6 s. Where the
transfer rate is 6.25 MB/s, the time is 80 ms. In the same way as
in the case of reads illustrated in FIG. 2, it is checked to s
determine whether the disk array controller 114 can complete the
destage process within the calculated, required end time from the
relation to the drive performance and fault-processing time. If it
is impossible, the buffer size is increased.
[0083] In the processing substep 1004 for recognizing the drive
performance and the processing time when a fault occurs, the
performance parameters, such as the seek time and transfer time of
the drive, alone are set. In addition, the number of enabled tagged
queues for each drive is set. FIG. 11 shows an example of a set of
set numbers of enabled tagged queues for each drive. The table is
loaded in the disk array controller 114. The controller 114 issues
commands to the drive. In response to some of these commands, no
completion acknowledgements are sent back to the controller 114
from the drive. The "number of tagged queues" is the number of
these commands which have not yet replied, i.e., the number of
commands being processed within the drive. The number of enabled
tagged queues shown in FIG. 11 indicates the limit value of the
number of tagged queues.
[0084] The disk array controller 114 refers to the table of FIG. 11
when an I/O is issued to the drive and checks to determine whether
the present number of tagged queues has reached the limit number
put in the table. If the number of tagged queues has reached the
limit value, the disk array controller 114 suppresses issuance of
I/Os to the drive. This control is used to assure the transfer rate
of streams that have a high degree of urgency and to assure the
response time.
[0085] Generally, the I/O response time from the drive (hereinafter
may also be referred to as the drive I/O response time) increases
roughly in proportion to the number of tagged queues. Therefore, if
the number of tagged queues is unlimitedly permitted, it is highly
likely that the drive I/O response time increases beyond the
required end time of the processing for prefetch and destage shown
in FIGS. 2 and 3. To circumvent this situation, the number of
tagged queues to the drive is suppressed using the values shown in
FIG. 11, thus assuring maximum drive response times for all of the
I/Os.
[0086] The maximum drive response time is the drive's response time
necessary to maintain the transfer rate to the host computer. That
is, it is the maximum allowable value of the time taken from the
time when a read command is issued to the drive until data is sent
back. Where 10 commands have been already queued in the drive when
an I/O is issued, as mentioned previously, it takes about 70 ms
until data about the commands is returned because the processing
time of one I/O drive is almost fixed (about 7 ms in the above
example). This is the maximum drive response time in a case where
the number of tagged queues is 10. On the other hand, if the number
of tagged queues is 0, data is sent in about 7 ms.
[0087] That is, if the number of tagged queues is limited, the
maximum value of the processing time required by the drive to
process one command can be forecasted. If issued commands are
prioritized in the disk array controller 114 (e.g., if a command of
a higher priority (command arising from a stream of a high bit
rate) than commands whose issuance to the drive is made to wait,
for example, by queuing limitation within the disk array
controller), the command of a higher priority is issued to the
drive with higher priority than the waiting commands. If this
operation is performed, it can be assured that the response time of
the drive in response to the command of a higher priority is
increased up to the drive processing time (maximum drive response
time), that is, the number of queued commands * time taken to
process 1 command.
[0088] The I/O whose issuance to the drive is suppressed is made to
wait in the drive queue. Where a job having a short required end
time, such as a prefetch caused by a high transfer rate stream, is
produced, the processing is lo terminated within the required end
time by registering it in the head position of the drive queue.
[0089] In the processing step 1005 for determining the drive I/O
size, the disk array system 113 determines the drive I/O size based
on the size of the sequential buffer determined by the processing
for determining the buffer size. The "drive I/O size" indicates the
amount of data read out or written in one operation set by a data
readout (or write) command issued to the drive from the disk array
controller 114. Accordingly, if the drive I/O size is increased,
the throughput performance in reading or writing the drive is
improved.
[0090] Therefore, where multiple streams required to be sent at
high transfer rates are received, the drive efficiency and
performance can be enhanced by increasing the drive I/O size. With
respect to a reading operation, the disk array controller 114
issues a command requesting a prefetch for one surface of the
buffer to the drive. With respect to a writing operation, the
controller issues a command requesting a destage for one surface of
the buffer to the drive. Accordingly, the I/O size to the drive is
increased by increasing the size of the sequential buffer.
[0091] The processing step 1006 for performing I/O scheduling will
be described next. In the processing step 1006 for I/O scheduling
according to the present embodiment, a method illustrated at 1007
is used. The disk array controller 114 prioritizes jobs (such as a
prefetch request for the drive) regarding the processing of
multiple streams using the parameters illustrated at 1007, based on
the required end time of each job and executes the jobs. In this
way, the multiple streams can be processed while assuring the
transfer rates of the streams. The processing will be described in
further detail below with reference to FIGS. 6-9.
[0092] FIG. 6 is a diagram illustrating the relation between
processing steps performed by the disk array controller 114 and a
processing time. In FIG. 6, as described previously, a unit of
processing performed by the array controller is represented as a
job. It is assumed that the aforementioned prefetch for a reading
operation is implemented by a prefetch job. The processing for
parity generation and destage for a writing operation is performed
by a destage job.
[0093] At the instant of time Tg (601), the disk array controller
114 creates a job in response to the processing of a stream. This
indicates, for example, an instant of time at which the disk array
controller 114 creates a prefetch job for reading data about one
buffer surface from the drive when a sequential read occurs, for
example, based on a read stream. In the case of a write stream, it
corresponds to an instant of time at which the disk array
controller 114 creates a destage job after data about one buffer
surface has been stored.
[0094] The job created at instant Tg is required to be terminated
at an instant of time Te (603). As described previously, the
required end time Te of the job is found from the size of the
sequential buffer and the required transfer rate of the stream.
Unless each job can be completed before this time, the sequential
buffer will be depleted. This will delay data transfer to the host
computer, or data from the host computer will not be accepted.
[0095] The period of time Tr (605) indicates a time taken to
process the job generated at the instant Tg. The time Tr is found
based on the number of queued commands to the drive, the drive
performance, and information about the fault processing time, as
described previously.
[0096] The instant of time Ts (602) indicates a time at which the
job found from Te-Tr must be started. More particularly, the
transfer rate of the corresponding stream cannot be assured unless
the disk array system 113 starts the processing of the job at the
instant Ts at the latest.
[0097] The disk array system 113 computes the instant Ts at all
times for all of the multiple streams. The array system 113
executes the sequence of jobs from the job of the stream
corresponding to the earliest instant Ts at that time according to
the result of the computation. Accordingly, the order in which the
jobs are created may be different from the order in which they are
executed. That is, the jobs are prioritized in the order of their
start times Ts.
[0098] In the present embodiment, it is assumed that the execution
time of one job is substantially identical with the execution times
of other jobs. Under this assumption, the required job start time
(Ts) is found from the required job end time Te. The order of
execution of the jobs is based on the order of their start instants
of time Ts. That is, it is assumed that a job having an earlier Te
has an earlier start time Ts. However, the execution time of one
job may be different from the execution time of another job. In
this case, the job end times Te may be simply compared in terms of
their order, and the jobs may be executed according to the order of
the job end times.
[0099] In this way, as attributes of each job, its time-related
parameters, such as the start time, end time, and processing time,
are introduced. As a result, processing jobs of a multiplicity of
streams having different required transfer rates can be
prioritized. Hence, streams having a higher priority in terms of
time can be processed with priority.
[0100] The aforementioned job creation time, job start time,
required job end time, and required execution time are loaded in a
job management table, as shown in FIG. 5. This table is stored in
the disk array controller 114. Based on the required job start time
Ts registered in the column 504 of FIG. 5, the disk array
controller 114 prioritizes jobs in the order #1, #2, #3, and
#4.
[0101] An example of the procedure of the processing step 1006 for
performing I/O scheduling by the disk array controller 114 will be
described below.
[0102] FIG. 7 is a diagram summarily illustrating scheduling of
jobs in the disk array controller 114. When a stream-processing
request is received, the array controller 114 creates a job (e.g.,
a command for causing the drive to perform processing)
corresponding to the stream. The array controller 114 previously
sets a queue area in the cache memory 116. The controller 114
registers a created job as one queue in the queue area.
[0103] Referring still to FIG. 7, the queue area of the disk array
controller 114 includes two areas: priority queue 702 and
nonpriority queue 708. For example, prefetch jobs used in streaming
and queues regarding destage jobs are registered in the priority
queue 702. On the other hand, queues regarding jobs in response to
random I/Os that are different from sequential access in streaming
are stored in the nonpriority queue. In a further embodiment of the
present invention, none of the priority and nonpriority queues are
provided.
[0104] Where the disk array system 113 is so set that priority is
given to reading, jobs regarding reading may be registered in the
priority queue, while jobs other than reads, such as writing, may
be stored in the nonpriority queue.
[0105] Based on the conditions described above, the disk array
controller 114 determines in which of the priority and nonpriority
queues is the created job (700 in the figure) registered
(registration 714 in either queue). Jobs are registered in the
queues (704-710). Furthermore, the array controller 114 selects
jobs from the queues to execute the jobs (selection 711 for
executing jobs) and executes the selected jobs (712).
[0106] A detailed example of the procedure of the queue
registration 714 and executed job selection 711 will be described
next. FIG. 8 is a diagram showing an example of the processing
procedure of the queue registration 714 performed by the disk array
controller 114. The controller 114 first defines a job to be
registered (hereinafter referred to as the registration requesting
job) as JOB. At this time, the controller 114 sets information
about Tg Oob creation time), Ts (required start time), and Te
(required end time) to the JOB regarding the registration
requesting job, based on the information registered in the table
shown in FIG. 5 (step 802).
[0107] Then, the disk array controller 114 makes a decision as to
whether the JOB is a job (hereinafter referred to as the priority
job) registered in the priority queue or a job (hereinafter
referred to as the nonpriority job) registered in the nonpriority
queue (step 803).
[0108] Depending on the result of the decision made in step 803,
the disk array controller 114 takes the registered queue as a
nonpriority queue if the JOB is a nonpriority job (step 804) and
takes the registered queue as a priority queue if the JOB is a
priority job (step 805). After the processing of the step 804 or
805, the controller 114 determines the position inside the queue in
which the JOB is registered. Specifically, the controller 114
compares the required start time of each job already registered in
the registered queue and the Ts set in the JOB. Of the jobs having
required start times earlier than Ts, the position located
immediately after the job having the latest required start time is
taken as the registration position of the JOB (step 806). Finally,
the controller 114 registers the JOB in the position determined in
step 806 (step 807).
[0109] FIG. 9 is a flowchart illustrating an example of the
procedure of processing of the selection 711 for executing jobs. In
the processing of the selection 711, the disk array controller 114
selects a job with the highest priority from the priority and
nonpriority queues. Since jobs in the priority and nonpriority
queues are registered using start time Ts, the jobs are prioritized
in each queue. Therefore, in selecting a job to be executed, the
jobs in the heads of the priority and nonpriority queues are
selected. Of these two jobs, the job with a higher priority is
selected. With respect to processing such as random access
registered in the nonpriority queue, a default value in a range in
which the command does not time out is set as the start time
Ts.
[0110] First, the disk array controller 114 takes the job in the
head of the priority queue as a job JOBp to be selected from the
priority queue and takes the job in the head of the nonpriority
queue as a job JOBnp to be selected from the nonpriority queue. Let
Tp_s be the required start time of JOBp. Let Tnp_s be the required
start time of JOBnp. Let Tc be the present time (step 901).
[0111] Then, the disk array controller 114 compares the present
time Tc with the required start time Tp_s. The controller compares
the present time Tc with the required start time Tnp_s (step 902).
If both Tp_s and Tnp_s are later than the present time Tc, it
follows that the jobs registered in the queues, respectively, have
not reached the required start times. Therefore, the controller 114
compares the start times Tp_s and Tnp_s, and takes the job with the
earlier time as a job to be executed in steps 905 and 907 (step
904).
[0112] Where the required start time of the job registered in at
least one of the queues is earlier than the present time Tc, the
disk array controller 114 checks to see if both Tp_s and Tnp_s have
passed the present time Tc (step 903). If so, the controller 114
preferentially executes the job in the priority queue (step
906).
[0113] Where the required start time of the job registered in
either queue is earlier than the present time Tc, the disk array
controller 114 compares their required start times (step 904). The
job having the earlier required start time is selected (steps 905
and 907). Consequently, this is equivalent to selecting the job
having the required start time earlier than the present time
Tc.
[0114] Either the priority job or the nonpriority job can be
executed as long as before the required start time, and so, the job
having the earlier required start time is selected by the
processing described above. Where both required start times are
later than the present time, the priority job is selected to
minimize the delay in processing of the priority job.
[0115] The I/O scheduling illustrated in FIG. 7 can be used in
plural locations within the disk array controller. For example, it
can be applied to a ready queue where jobs in a waiting state are
queued because they can be made executable and to a drive queue in
which jobs are queued when a command is issued to the drive.
[0116] Still another embodiment of the present invention involves a
disk array system comprising a transfer rate recognition portion, a
second recognition portion for recognizing drive performance and a
fault processing time, a buffer size determination portion, a drive
I/O size determination portion, and an I/O scheduling portion. The
buffer size determination portion and the drive I/O size
determination portion determine the buffer size and the drive I/O
size, using the transfer rate recognized by the transfer rate
recognition portion and the drive performance and the fault
processing time recognized by the second recognition portion. The
I/O scheduling portion prioritizes I/O processes, using the
recognized transfer rate, drive performance, fault processing time,
determined buffer size, and drive I/O size. Thus, the disk array
system assures the transfer rate.
* * * * *