System And Method For Modeling And Optimizing I/o Throughput Of Multiple Disks On A Bus BARVE, RAKESH D. ; et al. [BARVE, RAKESH D.]

System And Method For Modeling And Optimizing I/o Throughput Of Multiple Disks On A Bus

BARVE, RAKESH D. ; et al.

Patent Application Summary

U.S. patent application number 09/110110 was filed with the patent office on 2001-08-09 for system and method for modeling and optimizing i/o throughput of multiple disks on a bus. Invention is credited to BARVE, RAKESH D., GIBBONS, PHILLIP B., HILLYER, BRUCE K., MATIAS, YOSSI, SHRIVER, ELIZABETH, VITTER, JEFFREY S..

Application Number	20010013084 09/110110
Document ID	/
Family ID	22331277
Filed Date	2001-08-09

United States Patent Application	20010013084
Kind Code	A1
BARVE, RAKESH D. ; et al.	August 9, 2001

SYSTEM AND METHOD FOR MODELING AND OPTIMIZING I/O THROUGHPUT OF MULTIPLE DISKS ON A BUS

Abstract

A method for scheduling access of data blocks located in a computer system having a plurality of disk drives, each disk drive has a disk cache with a specified fence parameter value coupled to a host computer via a common bus. The method according to one embodiment, comprises the steps of: (a) sequentially accessing each of the disk drives for a predetermined number of iterations to retrieve a predetermined number of data blocks; (b) for a specified number of the iterations, transferring data located in the disk cache to be transferred to the common bus and requesting data corresponding to the following iteration to be transferred to the disk cache; and (c) repeating steps (a) and (b) until the predetermined iterations are completed.

Inventors:	BARVE, RAKESH D.; (DURHAM, NC) ; GIBBONS, PHILLIP B.; (WESTFIELD, NJ) ; HILLYER, BRUCE K.; (LEBANON, NJ) ; MATIAS, YOSSI; (POTOMAC, MD) ; SHRIVER, ELIZABETH; (JERSEY CITY, NJ) ; VITTER, JEFFREY S.; (DURHAM, NC)
Correspondence Address:	JOSEPH SOFER 342 MADISON AVENUE SUITE 1921 NEW YORK NY 10173
Family ID:	22331277
Appl. No.:	09/110110
Filed:	July 2, 1998

Current U.S. Class:	711/113 ; 711/114; 711/167; 711/E12.019
Current CPC Class:	G06F 3/0613 20130101; G06F 3/0689 20130101; G06F 3/0656 20130101; G06F 12/0866 20130101
Class at Publication:	711/113 ; 711/114; 711/167
International Class:	G06F 012/00

Claims

We claim:

1. In a computer system having a plurality of disk drives each disk drive having a disk cache with a specified fence parameter value coupled to a host computer via a common bus, a method for scheduling access of data blocks located in each one of said disk drives, said method comprising the steps of: (a) sequentially accessing each of said disk drives for a predetermined number of iterations to retrieve a predetermined number of data blocks; (b) for a specified number of said iterations, transferring data located in said disk cache to be transferred to said common bus and requesting data corresponding to the following iteration to be transferred to said disk cache; and (c) repeating said steps (a) and (b) until said predetermined iterations are completed.

2. The method in accordance with claim 1, wherein said step (b) comprises the steps of transferring data located in said disk cache and requesting data corresponding to the following iteration using an asynchronous read transfer of a disk sector that is located just before said requesting data.

3. The method in accordance with claim 1, wherein said step (b) comprises the steps of transferring data located in said disk cache and requesting data corresponding to the following iteration using a non-blocking read transfer of a disk sector that is located just before said requesting data.

4. In a computer system having a plurality of disk drives each disk drive having a disk cache with a specified fence parameter value coupled to a host computer via a common bus, a method for scheduling access of data blocks located in each one of said disk drives, said method comprising the steps of: (a) sequentially accessing each of said disk drives for a predetermined number of iterations to retrieve a -predetermined number of data blocks; (b) for each of said iterations, transferring data located in said disk cache to be transferred to said common bus and requesting data corresponding to the following iteration to be transferred to said disk cache; and (c) repeating said steps (a) and (b) until said predetermined iterations are completed.

5. The method in accordance with claim 4, wherein said step (b) comprises the steps of transferring data located in said disk cache and requesting data corresponding to the following iteration using an asynchronous read transfer of a disk sector that is located just before said requesting data.

6. The method in accordance with claim 4, wherein said step (b) comprises the steps of transferring data located in said disk cache and requesting data corresponding to the following iteration using a non-blocking read transfer of a disk sector that is located just before said requesting data.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This patent application is related to co-pending patent application Ser. No. 09/___,___, attorney docket no. R. Barve 1-10-4-16-1-4, filed concurrently with the present invention and entitled "A System and Method for Modeling and Optimizing I/O Throughput of Multiple Disks on a Bus."

FIELD OF THE INVENTION

[0002] This invention relates to data transfer arrangements in multiple disk systems and specifically to a system and method for optimizing data throughput in an input/output (I/O) bus coupled to a plurality of disk drives.

BACKGROUND OF THE INVENTION

[0003] In the past decade, computer systems have enjoyed a hundred-fold increase in processor speed, while the speed of disk drives has increased by less than a factor of 10. As a consequence of this disparity, computer systems that run applications that perform I/O-intensive processing, are designed to use many disks in parallel, usually organized as a disk farm or a RAID array. The physical organization generally consists of one or more I/O buses, (e.g., SCSI, FC, or SSA) with several disks on each bus.

[0004] Previous work related to disk I/O performance has focused on the disk drive, down playing the importance of bus contention and other bus effects. Indeed, the bus effects play an insignificant role in I/O performance for workloads with small I/O request sizes. But many I/O-intensive applications benefit significantly from larger requests (8-128 KB). Among these are multimedia servers and certain database and scientific computing applications that use external memory and out-of-core algorithmic techniques to process massive data sets. In such applications, parallel I/O performance is often limited by the bus.

[0005] Some prior art systems have attempted to implement a model of a computer system that retrieves data from a plurality of disk drives that are coupled to a bus, for example, a bus that employs a Small Computer System Interface (SCSI) protocol. Others have presented detailed performance studies for single disk systems, and approximation techniques for multiple disk systems. For several important workloads, the previous disk models fail to give an accurate prediction of system performance.

[0006] Thus there is a need for a system and a method for obtaining an analytical model of a bus supporting multiple disks, and based on that model, implementing a system that is configured to optimize the data throughput traveling via that bus.

SUMMARY OF THE INVENTION

[0007] In accordance with one embodiment of the invention, a computer system accesses data located in a plurality of disk drives coupled to a disk bus having a predetermined bus bandwidth. Each disk drive includes a buffer or cache memory for storing data intended to be transferred via the bus or onto the disk surface. The data from the disk are stored in the cache memory at a disk rotational bandwidth, and the data from cache to the disk bus are transferred at the bus bandwidth. During each read iteration, each disk drive loads its disk cache with the next request's data while the bus is being used by other disk drives to transfer the data for the current requests. Thus, each disk drive retrieves the data for the following read iteration from each disk to the corresponding disk cache, while data for the current read iteration is being provided from each disk cache to the disk bus.

[0008] In accordance with another embodiment of the invention, during each read iteration, each drive loads its disk cache with the data in the disk sector located before the sector hat contains the data required for the next request. Thus, each disk drive retrieves the data for the following read iteration from each disk to the corresponding disk cache using a disk pre-fetch feature while data for the current read iteration is being provided from each disk cache to the disk bus.

[0009] In accordance with another embodiment of the invention, a computer system includes a plurality of disk drive; each disk drive having a disk cache with a zero fence parameter value coupled to a host computer via a common bus, a read duration estimator for measuring the average time to read data blocks in each one of the disk drives comprises an overhead unit configured to provide the time during which a request is created and sent from a host computer to a disk drive via the bus. A minimum positioning time estimator is also included and is configured to measure the shortest time required for a disk drive to locate the data block. A mechanism-to-cache read time estimator is included and is configured to measure the time required for a leading portion of a requested data block to be transferred to a disk cache with the minimum positioning time. A data block read time estimator is configured to measure the time required to transfer data blocks remaining after transmitting to the host a corresponding leading portion of a requested data block in each of the disk caches. An adder is coupled to the overhead unit, the minimum positioning time estimator, the mechanism-to-cache read time estimator, and the data block read time estimator to provide an estimated duration for data request.

[0010] It is noted that in accordance with another embodiment of the invention, the read to duration estimator employs a disk drive with a non-zero fence parameter. Thus, a computer system in accordance with this embodiment comprises an overhead unit configured to provide the time during which a request is created and sent from a host computer to a disk drive via the bus. A minimum positioning time estimator is configured to measure an expected minimum positioning time corresponding to the shortest time required for a disk drive to locate the requested data block. A mechanism-to-cache read time estimator is configured to provide the time required for a disk drive to transfer a data portion to a disk cache. A data block read time estimator is configured to measure the time required to transfer data blocks stored in each of the disk caches to the host. An adder is coupled to the overhead unit, the minimum positioning time estimator, the mechanism-to-cache read time estimator, and the data block read time estimator to provide an estimated duration for a data request.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with features, objects, and advantages thereof may best be understood by reference to the following detailed description when read with the accompanying drawings in which:

[0012] FIG. 1 is a block diagram of a computer system that employs a method for optimizing data throughput in accordance with one embodiment of the present invention.

[0013] FIG. 2 is a block diagram of a read duration estimator in accordance with one embodiment of the present invention.

[0014] FIG. 3 is a block diagram of a read duration estimator in accordance with another embodiment of the present invention.

[0015] FIG. 4 is a flow diagram of a scheduling process for retrieving data from a plurality of disk drives in accordance with one embodiment of the invention.

DETAILED DESCRIPTION OF THE INVENTION

[0016] FIG. 1 illustrates a hardware configuration of a computer system 20 in accordance with one embodiment of the invention. A host computer 10 is configured to retrieve data from a plurality of disk drives 14 via a disk bus 12. In accordance with one embodiment of the invention, disk drives 14 may be a currently available disk drive such as Seagate Cheetah.RTM. model ST-34501W, connected to a computer 10, such as Sun Ultra-1 .RTM. running Solaris 2.5.2 operating system, or, Seagate Barracuda.RTM. model ST-32171 W, connected to a DEC AlphaStation.RTM. computer running a Digital Unix 4.0 operating system. It is noted that although the embodiments described herein refer to disk drives connected to a bus, the principles of the present invention apply to other data devices connected to the bus, such as tape drives. Furthermore, a combination of disk drives and tape drives may be coupled to a bus.

[0017] In accordance with one embodiment of the invention, bus 12 employs a communications protocol known as the small computer system interface (SCSI) protocol. To this end, each disk drive 14 includes a unique SCSI identifier which determines the priority of the disk drive when multiples disk drives are coupled to bus 12. Computer 10 also includes a SCSI controller 22, which has the highest priority. Thus, controller 22 prevails in any contention in which it participates. A memory system 26 is coupled to controller 22, and is configured to receive and store the data that has been retrieved from disk drives 14.

[0018] Each disk drive 14 includes a data cache 16 which is configured to act as a data buffer for transferring disk data to bus 12. Disk drive 14 also includes a plurality of disk platters 18 that contain a predetermined volume of data. Each platter includes a plurality of tracks that in turn contain a plurality of data sectors per track. Each data sector contains a predetermined length of data, such as 512 bytes. A plurality of disk heads 24 are controlled by a disk controller to be positioned to an appropriate location of a platter 18 in response to a request received by disk drive 14. It is noted that typically the disk platters are positioned on top of each other, spaced apart at a predetermined distance, and, are rotated around a vertical central axle. The tracks from each platter disposed at an equal distance from the axle, form a cylinder referred to as disk cylinder.

[0019] Data is transferred from cache 16 to host 10 in accordance with a control arrangement referred to as fence parameter. The fence parameter determines the time at which a disk drive 14 will begin to contend for the SCSI bus. The fence is also called the buffer full ratio in accordance with SCSI protocol as described on the SCSI-2 disconnect/reconnect control mode page. When a disk drive 14 is instructed to perform a read, and the disk recognizes that there will be a significant delay, such as the time it takes for head 24 to locate the appropriate data, the disk releases control of the SCSI bus (it disconnects). When disk drive 14 is ready to transfer the data to host 10, it contends for control of SCSI bus 12 (it reconnects) so that the read can be completed. Thus, the time that the bus receives a request to transfer data from a disk drive to the host is mainly based on the value of the fence parameter.

[0020] If the fence parameter is set to the minimum value, it will contend for bus 12 after the first sector of data has been transferred from a disk platter 18 to disk cache 16. By contrast, if the fence parameter is set to the maximum value, it will wait until almost all of the requested data has accumulated in the disk cache before contending for the bus. The performance implication is as follows. A low fence setting tends to reduce the response time, because the disk attempts to send data to the host as soon as the first sector is available. But when the cached data has been sent to the host (at the bus bandwidth), the disk continues to hold the bus. The remainder of the transfer occurs at a bandwidth referred to as rotational bandwidth, which is the rate at which bits pass under the disk head. The rotational bandwidth is usually less than 25% of the bus bandwidth, and for some disks, far less. A high fence parameter setting causes the disk to delay the start of data transfer to the host, but when the transfer does occur, it proceeds at "bus bandwidth", from cache 16 on the disk drive into host controller 22. In systems with multiple disks on a bus, a high fence setting potentially increases overall throughput for I/O intensive workloads.

[0021] In accordance with one embodiment of the invention, a performance model for a system that employs a disk drive configuration in accordance with FIG. 1 can be obtained. This model preferably approximates the time to complete a read operation in response to a request for a predetermined length of data referred to as a data block located on a disk drive 14.

[0022] The significant components of the time to complete a read operation are as follows.

[0023] Host queue time

[0024] The time during which a request remains queued up in SCSI controller

[0025] Overhead

[0026] The time necessary to create a request and send the request from host 10 to a disk drive 14.

[0027] Device queue time

[0028] The lime that a request waits in a disk drive 14 while a previous request is being served. This time is zero for a drive that can only handle one request at a time.

[0029] Seek time

[0030] The time required by disk head 14 to move to the track containing a requested data block address. Seek time has a nonlinear dependency on the number of tracks to be traversed.

[0031] Rotational latency time

[0032] After a seek completes, the time during which the disk rotates to position the disk head at the start of the data block.

[0033] Rotational transfer time

[0034] After the rotational latency completes, the time required for the head to transfer data from the disk platter 18 to cache 16. This time is largely governed by the speed of rotation and the number of bytes per track. This time is proportional to the number of bytes transferred, and includes any additional time required for track switches and cylinder switches when an I/O extends across multiple tracks or cylinders.

[0035] Bus busy time

[0036] The time period during which (some or all of) the data block resides in cache 16, waiting for bus 12 to become available for a transfer to host 10.

[0037] Bus transfer time

[0038] The time required to transmit a data block over bus 12, at the sustained bus bandwidth, from a disk drive 14 to host 10. It is proportional to the number of bytes to be transferred.

[0039] It is noted that the service time for a disk request is not simply the sum of these components. For instance, if the fence parameter is 0, some of the rotational transfer time may be overlapped with the bus transfer time. Moreover, under different scenarios, different terms may dominate. If many disks share a bus, the overlapped I/O transfers may cause the bus busy time to dominate, leading to service times much larger than the bus transfer time. If the I/O requests are small, then the overhead may dominate, in which case the effective data rate on the bus cannot approach the bus bandwidth, even if many disks share the bus.

[0040] In accordance with one embodiment of the invention, several simulated workloads may be performed to obtain and verify a model corresponding to the behavior of system 20. Throughout these simulations it is assumed that at most one request per disk is outstanding so that both the host queue time and the device queue time are zero. It is noted that the exemplary workloads described herein are for purposes of illustration only, and other workloads may also be employed in accordance with other embodiments of the invention.

[0041] An example of a simulated workload includes a process which consists of random, fixed-sized reads. Another simulated workload process may consist of random reads where the requested data size is uniformly distributed. A third simulated workload may consist of fixed-sized reads uniformly distributed on a subset of the cylinders of the disks; these workloads are referred to as having "spatial locality." These workloads capture the access patterns of external-memory algorithms designed for the Parallel Disk Model as described in Jeffery S. Vitter and Elizabeth A. M. Shriver, Algorithms For Parallel Memory I: Two-Level Memories, 12 (2/3) Algorithmic 110-47 (August and September 1994), and incorporated herein by reference. Examples of such algorithms are merge sort as described in Rakesh D. Barve, Edward F. Grove, and Jeffrey S. Vitter, Simple Randomized Merge sort On Parallel Disks, 23(4) Parallel Computing 601-631. North-Holland (Elsevier Scientific 1997) and incorporated herein by reference. Another example includes matrix multiplication as described in Algorithms For Parallel Memory I: Two-Level Memories Id.

[0042] In Parallel Disk Model algorithms, reads and writes are concurrent requests to a set of disks, issued in lock-step, one request per disk. The above described workloads also model applications that use balanced collective I/O's, i.e., where all processes make a single joint I/O request rather than numerous independent requests. The workloads also can be used to model a video-on-demand server that stripes data across multiple disks.

[0043] Preferably, in each workload, the requests are directed to a collection of independent disk drives 14 that share a bus 12. The requests are generated by multiple processes of equal priority running concurrently on a uniprocessor, one process per disk. Each process executes a tight loop that generates a random block address on its corresponding disk drive. The process then takes a time stamp corresponding to the time the request for a data block is made. Thereafter, the process issues a seek and a read system call to the raw disk (bypassing the file system). Thereafter, the process takes another time stamp corresponding to the time when the read request completes.

[0044] In accordance with one embodiment of the invention, each of the simulated workloads mentioned above consists of three phases: a startup period during which requests are issued but not timed, a measurement period during which the timings are accumulated in tables in main memory system 26, and a cool down period during which requests continue to be issued. The purpose of the startup and cool down periods is to ensure that the I/O system is under full load during the measurements. The I/O systems provide fairness in all our experiments: each disk complete approximately the same number of I/O's as explained below in more detail.

[0045] Based on the wort-loads described above, the behavior of system 20 exhibits what is referred to as a "round behavior." A round defines a periodic convoy behavior wherein all disk drives 14 receive a read request from host 10, in response to which each disk drive transmits the requested data block to the host before any disk drive receives another read request.

[0046] Remarkably, the round behavior described above is contrary to expectation. Since host 10 has the highest priority, it is expected that soon after a disk drive completes one request, the host would seize the bus to send another request to that disk drive, thereby keeping the bus and all the disk drives busy. It is noted that rounds could arise if the operating system kernel implements a fairness policy that forcibly balances the number of requests sent to each disk during periods of heavy I/O load by issuing requests in batches, instead of sending requests to disks as soon as possible.

[0047] In accordance with one embodiment of the invention, in order to ascertain whether D number of disk drives are served in accordance with a round behavior under some workload, it is preferable to examine the ordered I/O completion time stamps using a sliding window of size D. A violation of round ordering is said to occur on the jth time stamp in the window (where 0 .ltoreq.j.ltoreq.D-1) if there is an i<j such that the ith and jth I/O of the window both originate from the same disk: if the current sliding window contains a violation at the jth position, the window is advanced by j positions. Otherwise it is advanced by D positions. The fraction of I/O operations that do not violate round ordering is a measure of the extent of round formation for that experiment. In simulations described above, rounds occurred 88-99% of the time for uniform random workloads containing a mixture of 1, 2, 3, or 4 different request sizes and for workloads that have spatial locality. The workloads that were experimented with have request sizes of B . . . iB, for i the number of request sizes in the workload and for B=8,16,32,64, or 128 KBs.

[0048] It is noted that if the request size is small, system 20 does not exhibit a round behavior. In this case, bus 12 does not experience a bottleneck.

[0049] In accordance with one embodiment of the invention, a read duration model for reading data is provided as described hereinafter. The read duration is defined as the time period between a time stamp immediately before a read operation is made and immediately after the data is returned to the host.

[0050] The read duration model is described for a system 20 which includes only one disk drive 14, with a fence parameter value of zero and a non-zero fence parameter. The read duration model is also described for a system 20 having a plurality of disk drives 14 with zero and non-zero fence parameters.

[0051] Single disk model

[0052] In accordance with one embodiment of the invention, a model that characterizes read duration when only a single disk drive is active is described hereinafter. The model derived based on the principles of the present invention applies to both zero and non-zero fence parameter values. This model allows a system designer to estimate the performance of a system that utilizes disk drives, such as disk drive 14 of FIG. 1 for retrieving and storing data from a host computer 10, via a bus 12. Although, the examples provided herein relate to a SCSI bus, it will be appreciated that the invention is not limited in scope in that respect and other types of bus protocols may be employed.

[0053] Read duration for fence value 0.

[0054] When the fence parameter value of a disk drive 14 is zero, the disk drive requests the bus as soon as the first sector is available in disk cache 16. After the first sector has been transferred to the host, the transfer of the remainder of the data occurs at a mechanism-to-cache rate bandwidth referred to as rotational bandwidth (bandwidth rot) which corresponds to the rotational transfer time. As described above, the rotational transfer time is the time required for head 24 to transfer data from disk platter 18 to disk cache 16. It is noted that the rotational bandwidth is smaller than the cache-to-host rate, bandwidth referred to as the bus bandwidth (bandwidth.sub.bus).

[0055] When using only a single disk, and the data block does not cross a track or cylinder boundary, the average time to read a data block of size B (B >>1 sector) is well approximated by 1 ReadDuration = Overhead + E [ SeekTime ] + E [ RotationalLatency ] + B bandwidth rot ( 1 )

[0056] wherein, overhead time is the time required by the bus protocol to send a request from a controller 22 to disk drive 14, and E[Seek Time] is the expected value of the time required by disk head 24 to move to the track containing a requested data block address, and E[Rotational Latency] is the expected time after a seek completes during which the disk platter rotates to position disk head 24 at the start of the data block, and B is the data block size.

[0057] Equation (1) approximates the average read duration as the sum of the bus protocol overhead time, the expected seek time, the expected rotational latency, and the time to read the data from the disk surface. The data is transferred over the bus at the rotational transfer rate. This follows because disk cache 16 is used as a speed matching buffer.

[0058] When B is large, the requested data will extend over a number of tracks and possibly cylinders. Thus, the track( and cylinder switch times must be taken into account as well. These switching times are respectively referred to as TrackSwitchTime and CylinderSwitchTime, which correspond to the amount of time to perform one track switch and one cylinder switch, respectively. The number of cylinder switches may be approximated by B/AverageCylinderSize, and the number of track switches (including those that also cross a cylinder boundary) by B/AverageTrackSize. Thus, the sum of the track and cylinder switch times, referred to as TrackCylinderSwitch Time may be defined as 2 TrackSwitchTime ( B AverageTrackSize - B AverageCylinderSize ) + CylinderSwitchTime B AverageCylinderSize ( 2 )

[0059] Using the above definition of TrackCylinderSwitchTime, the following expression for the average read duration is defined by 3 ReadDuration = Overhead + E [ SeekTime ] + E [ RotationalLatency ] + B bandwidth rot + TracyCylinderSwitchTime ( 3 )

[0060] FIG. 2 illustrates a read duration time estimator 102 that is employed to measure the read duration time for a computer system 20 that employs one disk drive such as 14, in accordance with one embodiment of the invention. As illustrated, overhead unit 104 is configured to provide the bus overhead time depending, among other things, on the bus protocol being employed, the host characteristics, the operating system employed by the host, the host controller, and the disk controller. Seek time estimator 106 calculates the average time required by disk head 14 to move to the track containing a requested data block. Rotational latency estimator 108 is configured to calculate the average time after the seek is complete during which the disk platter rotates to position disk head 24 at the start of a data block. Data block read time estimator 110 is configured to calculate the time to read data from disk platter 18. Finally, TrackCylinderSwitchTime estimator 112 measures the sum of the track and cylinder switch times, when a data block crosses track and cylinder boundaries. The output of units 104, 106, 108, 110 and 112 are provided to a summing unit 114 so as to provide the read duration time for system 20 as described above.

[0061] Read duration for non-zero fence value.

[0062] When the fence parameter value of disk drive 14 is set to a non-zero value, a fraction of the requested data is first read into the disk drive's cache before the bus is requested. Data is transferred first from disk platter 18 into disk cache 16 at the rate of rotational bandwidth (bandwidth.sub.rot) as explained above, and then over bus 12 at the cache-to-host rate or bus bandwidth (bandwidth.sub.bus).

[0063] When the data is going over the bus to the host, either the rest of mechanism-to-cache data transfer will be hidden by the cache-to-host transfer, i.e., the transfer time is B-B.sub.C/bandwidth.sub.rot, or, the cache-to-host transfer will be visible, i.e., the transfer time is B/bandwidth.sub.bus. It is noted that the number of bytes in the disk cache before the bus is requested, is denoted as B.sub.C. Preferably, B.sub.C=B. (Fence value/256), wherein B is the data block size and the maximum fence parameter value is 255 sectors. As mentioned above, when the fence parameter value is 255, the disk waits until 255/256 of the requested number of sectors are in disk cache, before the disk drive contends for bus 12.

[0064] When using only a single disk, the average time to read a data block of size B that does not span across multiple tracks or cylinders is 4 ReadDuration = Overhead + E [ SeekTime ] + E [ RotationalLatency ] + B c bandwidth rot + max ( B bandwidth bus , B - B c bandwidth rot ) ( 4 )

[0065] Taking into account the time for the cylinder and track crossings, the read duration time is 5 ReadDuration = Overhead + E [ SeekTime ] + E [ RotationalLatency ] + B c bandwidth rot + TrackCylinderSwitchTime + max ( B bandwidth bus , B - B c bandwidth rot ) . ( 5 )

[0066] It is noted that the models presented in equations (3) and (5) may be extended to multiple request sizes by providing a weighted average of the read durations for each request size. Similarly, when the workload requests are not distributed across the entire disk, but instead are confined to a contiguous subset of the disk platters, the expected seek time used in equations (3) and (5) is calculated over that number of cylinders.

[0067] Parallel disk model

[0068] As explained above, when system 20 employs a plurality of disk drives 14 coupled to a bus 12 the input/output (I/O) transactions with the disk drives form a round behavior. In each round, one request is served from each disk. When the fence parameter value is 0, a disk is ready to transfer data to the host after it has positioned its head to the data and read the first sector into its disk cache. This time is dominated by the positioning time, which greatly exceeds the rotational transfer time for one sector. Transmission of data to the host begins when any one of the disks is ready, so on a bus with D disks, the idle time on the bus at the beginning of a round is well approximated by the expected minimum positioning time, denoted MPT(D).

[0069] Parallel read duration for fence value 0.

[0070] The general scenario in a round in accordance with the present invention is as follows. One request is sent to each of D disks 14. Usually the requested data blocks are not in disk caches 16, so the drives disconnect from bus 12. The disk with the smallest of the D positioning times reads the first requested sector into its cache, and reconnects to the host. It transmits the first sector at the buff bandwidth (bandwidth.sub.bus), and then continues transmitting at rotational bandwidth (bandwidth.sub.rot). After sending some data to the host, the disk disconnects, either because it has transferred the entire data block, or because the remaining portion of the data block lies on the next track or cylinder. By the time this disconnection occurs, it is likely that other drives have read enough data into their disk caches that the remaining portion of the D data blocks can be sent to the host at bus bandwidth (bandwidth.sub.bus). There may be several disconnects during this transmission, as various drives reach track or cylinder boundaries, but as soon as one drive disconnects, another reconnects to continue sending data to the host.

[0071] The average size of the leading portion of the first data block (i.e. the amount transferred prior to the first disconnection) is referred to as Leading_Portion(B). However, although the first disk sends one sector at the rate of bus bandwidth (bandwidth.sub.bus), before sending more at the rotational bandwidth (bandwidth.sub.rot), it is assumed that the entire leading portion from the first disk is sent at the rotational bandwidth. Furthermore, the overhead of the disconnection and reconnection is sufficiently small that it is absorbed into the overhead term. Thus, in accordance with one embodiment of the invention, the average read duration is given by 6 ReadDuration = Overhead + MPT ( D ) + Leading_Portion ( B ) bandwidth rot + DB - Leading_portion ( B ) bandwidth bus ( 6 )

[0072] wherein overhead is the time required for the bus to send a request from controller 22 to disk drive 14 in accordance with the bus protocol, and MPT(D) is the minimum positioning time of head 24 at the start of the requested data block.

[0073] When the request size B is small, it is usual for the entire data block to reside on a single track, whereas for large request sizes the expected size of the leading portion is one half the track size. Thus if B.ltoreq.AverageTrackSize/2, advantageously, Leading Portion(B) is approximated as Leading_Portion(B)=B, otherwise it is approximated as Leading Portion(B)=AverageTrackSize/2.

[0074] It is noted that equation (6) does not contain terms to account for the track and cylinder crossings such as those contained in equations (3) and (5). These crossings do not add to the read duration because the bus remains busy: one disk disconnects and another disk immediately seizes the bus to send its data to the host.

[0075] Parallel read duration for non-zero fence value.

[0076] In this case, the bus is idle during the shortest positioning time, then the bus continues to remain idle while the disk with shortest positioning time reads B.sub.c=B (FenceValue/256) bytes of the B bytes into its cache 16. Next the bus transmits those bytes to the host, followed by the rest of the data block and the data blocks from the other D -1 disks. Thus the average read duration in this case is given by 7 ReadDuration = Overhead + MPT ( D ) + B c bandwidth rot + DB bandwidth bus . ( 7 )

[0077] FIG. 3 illustrates a read duration time estimator 130 that is employed to measure the read duration time for a computer system 20 that employs a plurality of D disk drives such as 14, with a non-zero fence parameter value in accordance with one embodiment of the invention. As illustrated, overhead unit 132 is configured to provide the bus overhead time depending on the bus protocol being employed. Minimum positioning time estimator 136 is configured to obtain the shortest time that it takes for one of the D disk drives to position its corresponding head 24 over the beginning of a requested data block contained in that disk drive. Mechanism-to-cache read time estimator 138 provides the time that is required for the disk with shortest positioning time to transfer data from the corresponding disk platter 18 to disk cache 16 in accordance with a specified fence parameter value. Finally, data block read time estimator for all disks 140, estimates the time that the remaining data blocks on all disk drives are transferred to host 10 via bus 12. The output of units 132,136,138 and 140 are provided to a summing unit 142 so as to provide the read duration time for system 20 as described above.

[0078] It is noted that the round behavior of system 20 does have an impact on the specified fence parameter values and on the data throughput in bus 12. For example, a higher fence parameter value would increase overall throughput if the time to read the B, bytes into the cache at each disk were fully overlapped with bus bandwidth transfers by other disks. Since the workload attempts to keep all disks busy, it is expected that a fully overlapped scenario would occur. However, due to round behavior of system 20, the fully overlapped scenario does not occur and the throughput is reduced. In particular, the first such read (as well as the corresponding positioning time) is not overlapped, so that in fact smaller fence values result in higher throughput, even with an aggressive workload.

[0079] In accordance with another embodiment of the invention, minimum positioning time estimator 136 provides an expected minimum positioning time as described hereinafter. This expected minimum positioning time may be advantageously obtained for a system consisting of D disk drives 14 where each disk receives a random request at approximately the same time. Let ST be the random variable denoting the seek time of one disk and let MST.sub.D be the random variable denoting the minimum seek time for a D-disk system. The expected minimum positioning time can be approximated as the sum of the expected minimum seek time and the mean rotational latency:

MPT(D)=MST.sub.D+E[RotationalLatency] (8).

[0080] The random variable MST.sub.D denoting the minimum seek time for a D-disk system is estimated as described hereinafter.

[0081] Since it is assumed that the D disks are independent and have identical seek curves

Pr[MST.sub.D.gtoreq.z]=(Pr[ST.ltoreq.z]).sup.D (9)

[0082] wherein Pr [X].gtoreq.[x] is the probability that the random variable X is greater or equal to x.

[0083] The number of cylinders that the disk head can move past during time x is denoted as cylinder[x]; this is formally defined as 8 cylinder [ x ] = ( x - a 2 b ) a < x < SeekTime [ e ] cylinder [ x ] = ( x - c d ) SeekTime [ e ] x < SeekTime [ MaxCylinder ] ( 10 )

[0084] where the seek curve of the disk is defined as

SeekTime[dis]=0 dis=0 SeekTime[dis]=a+b {square root}{square root over (dis)} 0 23 dis.ltoreq.e SeekTime[dis]=c+d.multidot.dis dis>e (11)

[0085] where a, b, c, d, and e are device-specific parameters and dis is the number of cylinders to be traveled. Using equation (4.5) from Elizabeth Shriver Performance modeling for realistic storage devices PhD thesis, Department of Computer Science, New York University, New York, N.Y., May 1997, and incorporated herein by reference, and equation (9), 9 Pr [ MST D Z ] = ( 1 - cylinder [ z ] MaxCylinder ) 2 D ( 12 )

[0086] wherein MaxCylinder is maximum number of cylinders on the disk.

[0087] Using the definition of expectation for a finite continuous real random variable and equation (12), 10 E [ MST D z ] = 0 .infin. Pr [ MST D z ] z = 0 .infin. ( 1 - cylinder MaxCylinder ) 2 D z ( 13 )

[0088] Assuming the three-part seek curve as presented in equation (11), equation (13) can be simplified to 11 a + b MaxCylinder 2 D i = 0 ( 2 D i ) ( - 1 ) i e / MaxCylinder 2 i + 1 2 i + 1 + d MaxCylinder 2 D + 1 ( 1 - e MaxCylinder ) 2 D + 1 ( 14 )

[0089] It is noted that the round behavior of system 20 has an impact on fence parameter effects and data throughput in bus 12. For example, a higher fence parameter value would increase overall throughput if the time to read the B.sub.c bytes into the cache at each disk were fully overlapped with bus bandwidth transfers by other disks. Since the workload attempts to keep all disks busy, it would be expected that a fully overlapped scenario would occur. However, due to the round behavior, the fully overlapped scenario does not occur and the throughput is reduced. In particular, the first such read (as well as the corresponding positioning time) is not overlapped, so that in fact smaller fence values result in higher throughput, even with an aggressive workload.

[0090] Furthermore, similar to the single disk model explained above, the multiple disk model can be modified to support a workload that has requests which are randomly distributed across a subset of the cylinders, by adjusting the expected seek time approximation.

[0091] FIG. 4 is a flow chart of a bus scheduling process in accordance with one embodiment of the present invention. In accordance with one embodiment of the invention, the model equations (6) and (7) suggest two ways to decrease the read duration. Thus, it is possible to decrease the minimum positioning time, and convert those transfers that occur at the rotational bandwidth (bandwidth.sub.rot) to the faster bus bandwidth (bandwidth.sub.bus).

[0092] As illustrated in FIG. 4, the scheduling process for retrieving data blocks from a plurality of disk drives 14 is based on a sequential iteration of data requests. Thus, assuming that during iteration j-l, host 10 has knowledge of the data blocks that will be requested during iteration j, the scheduling or pipelining technique in accordance with the present invention is to overlap the positioning time for iteration j with the transfer time of the previous iteration. Furthermore, this pipelining technique stages data in disk caches 16, so that the first data block transmitted during iteration j is sent from cache at the bus bandwidth (bandwidth.sub.bus), rather than from the disk platter at the rotational bandwidth (bandwidth.sub.rot). At step 202 host 10 begins scheduling read requests. At step 204, for each b.sub.ij denoting the data block to be retrieved from disk i in round j, host 10 schedules bus 12 so that for all the D disks data blocks corresponding to the 0th iteration is transferred to the corresponding disk cache 16.

[0093] At step 208, during each iteration j, host 10 sends a read request to the D disk drives 14. At step 212 host 10 also sends to each disk drive a read request for the block that is required during the following iteration. As a result, while a disk drive is fetching the data for a following request, data from its disk cache and other disk caches are being transferred to host 10 via bus 12. A pseudo code describing the pipelining technique that schedules a SCSI bus in accordance with one embodiment of the present invention is as follows:

[0094] for 0.ltoreq.i.ltoreq.D-1

[0095] Request LoadIntoDiskBuffer (b.sub.i,O) on disk i.

[0096]

[0097] for 0.ltoreq.i.ltoreq.NumRequests

[0098] for 0.ltoreq.i.ltoreq.D-1

[0099] Read (b.sub.ij) from disk i.

[0100] Request LoadlntoDiskBuffer (b.sub.ij+1) on disk i.

[0101] The pseudo code LoadIntoDiskBuffer (b), causes the disk to prefetch data block b into its cache so that a subsequent Read(b) will not incur disk head positioning time or a head-limited transfer rate. The prefetch occurs; while the bus is busy transmitting data blocks from other disks and from the previous round. Thus, the random access latency is overlapped with bus transfers, and the bus transfers occur at the higher cache data rate, rather than the slower disk-head rate. The result is fair parallel I/O in rounds, with a high aggregate bandwidth for random I/O. It is noted that in accordance with another embodiment of the invention, instead of performing a prefetch for each iteration, the system may, for a specified number of iterations, transfer data located in the disk cache and request data corresponding to the following iteration to be transferred to the disk cache.

[0102] In accordance with another embodiment of the invention, the command LoadIntoDiskBuffer(b) is implemented by an asynchronous or a non-blocking read transfer of a disk sector that is located just before the data block b that is intended to be read during a following iteration. This non-blocking read command denoted as aioread( ) triggers the corresponding disk drive and its related mechanism to load data block b into the disk cache. For each data block, the aioread() implementation incurs the overhead of sending an extra bus request to the disk and receiving the unwanted sector by host 10 that triggers the disk read ahead.

[0103] For disk drives that employ a SCSI bus protocol, a SCSI Prefetch implementation allows the prefetch of data blocks; without the need to prefetch a sector just prior to the data block intended to be transferred. This prefetch implementation would only have the overhead of sending one extra SCSI request fir each data block.

[0104] The results of experiments on several hardware configurations illustrate a performance gain from pipelining in accordance with the present invention despite the additional overhead of the aioread implementation of LoadlntoDiskBuffer.

[0105] Table 1 evaluates the effectiveness of the pipelining technique with 2, 3, and 4 Cheetah.RTM. disks on a Sun Ultra-1, transferred data blocks with sizes ranging from 8 KB to 128 KB. The measurements are averaged over 1000 I/Os. The table compares the aggregate transfer rate in MB/s achieved by the "naive" approach (one process per disk performing random I/Os) with the pipelined technique in accordance with the present invention. The column labeled "%" contains the relative improvement (in percent) of the pipelined technique. With small data block sizes, the overhead outweighs the improvement. With 2, 3 or 4 disks and moderate or large data block sizes, the overlaps gained by the pipeline technique more than compensate for the increased overhead. For example, with 4 disks and 96 KB data blocks, the bandwidth improves 17%.

1TABLE 1 Data Block D = 2 D = 3 D = 4 size Pipe- Pipe- Pipe- (KB) Naive line % Naive line % Naive line % 8 1.32 1.22 -8 1.97 1.74 -12 2.63 2.30 -13 16 2.48 2.34 -6 3.65 3.31 -9 4.83 4.28 -11 32 4.43 4.27 -4 6.32 6.07 -4 8.02 7.75 -3 64 7.08 7.09 0 9.38 10.05 7 10.72 12.48 16 96 8.76 9.48 8 10.85 12.76 18 12.09 14.12 17 128 9.86 11.01 12 11.79 14.19 20 13.00 14.45 11

[0106] Thus, in accordance with the principles of the present invention, a model that quantifies the performance impacts of round behavior is achieved and a system that predicts the average read duration time when one or multiple disk drives are connected to a bus can be implement in accordance with FIGS. 2 and 3 as discussed above.

[0107] Furthermore, a scheduling process in accordance with the present invention that accesses across a collection of disks that share a bus may improve performance in the order of 20%. This is achieved by an application-level pipelining technique, which increases the aggregate disk bandwidth on the shared bus by increasing the overlap between disk seeks and data transfers, and by increasing the fraction of transfers that occur at the disk cache transfer rate rather than the slower disk head rate. The pipelining technique in accordance with the present invention enables each disk drive to be self-governing, such that it is not necessary to predict the positioning time that will be incurred by each I/O request. It is noted that if the workload does not have a uniform request size, the pipelining technique of the present invention, may be employed sometimes, for example, when a predetermined threshold for a request size has been reached.

[0108] While only certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes or equivalents will now occur to those skilled in the art. It is therefore, to be understood that the appended claims are intended to cover all such modifications and changes that fall within the true spirit of the invention.

* * * * *