Register based remote data flow control Futral, William T. [Futral, William T.]

Register based remote data flow control

Futral, William T.

Patent Application Summary

U.S. patent application number 10/759974 was filed with the patent office on 2004-09-09 for register based remote data flow control. Invention is credited to Futral, William T..

Application Number	20040174814 10/759974
Document ID	/
Family ID	32328612
Filed Date	2004-09-09

United States Patent Application	20040174814
Kind Code	A1
Futral, William T.	September 9, 2004

In a method according to an example embodiment of the invention, a data packet is transferred from an I/O node to a host across a channel-based switching fabric interconnect. The method stores a value in a register in the I/O node which is indicative of a number of send credits available to the I/O node. The I/O node keeps a count of the number of data transfers. It is then determined from the value of the register whether or not a sufficient number of send credits is available to the I/O node for the data to be transferred by comparing it with the count of previous data transfers. If a sufficient number of send credits is available to the I/O node, it promptly transfers the data to the host over the channel-based switching fabric interconnect. If a sufficient number of send credits is not available to the I/O node, it waits for the host to update the value stored in the register before transferring data.

Inventors:	Futral, William T.; (Portland, OR)
Correspondence Address:	BLAKELY SOKOLOFF TAYLOR & ZAFMAN 12400 WILSHIRE BOULEVARD, SEVENTH FLOOR LOS ANGELES CA 90025 US
Family ID:	32328612
Appl. No.:	10/759974
Filed:	January 15, 2004

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
10759974	Jan 15, 2004
09461236	Dec 16, 1999
6747949
60135259	May 21, 1999

Current U.S. Class:	370/231
Current CPC Class:	G06F 13/385 20130101
Class at Publication:	370/231
International Class:	H04L 001/00

Claims

1. A method of transferring data from an I/O node to a host across a channel-based switching fabric interconnect, the method comprising: storing a value in a register in the I/O node which is indicative of a number of send credits available to the I/O node; determining from the value of the register if there is a sufficient number of send credits available to the I/O node for the data to be transferred; promptly transferring the data from the I/O node to the host over the channel-based switching fabric interconnect if a sufficient number of send credits is available to the I/O node; and otherwise, if a sufficient number of send credits is not available to the I/O node, waiting for the host to update the value stored in the register before transferring data.

2. The method of claim 1, wherein the data is transferred by a send message sent from the I/O node to the host over the channel-based switching fabric when sufficient send credits are available for the data.

3. The method of claim 1, wherein each of the send credits available to the I/O node represent one or more receive buffers which are available at the host for receiving and storing a data packet.

4. The method of claim 2, wherein said step of transferring data by sending a send message comprises: preparing a descriptor describing the send operation to be performed; and posting the descriptor to one of a plurality of work queues in the I/O node.

5. The method of claim 4, wherein the I/O node further comprises a plurality of send buffers storing the data to be transferred, the step of transferring data by sending a send message comprises: preparing a descriptor describing the send operation to be performed; posting the send descriptor to one of the work queues in the I/O node; processing the posted send descriptor by transferring the data from one of the send buffers to the channel-based switching fabric interconnect.

6. The method of claim 4, wherein the host updates the value stored in the register by performing an RDMA write operation to the register.

7. The method of claim 5, further comprising: fragmenting the data to be transferred into two or more data packets; performing the following steps until all data packets have been sent: a) determining if a sufficient number of send credits is available at the I/O node; b) sending a data packet from the I/O node over the channel-based switching fabric if a sufficient number of send credits is available, and adjusting the number of send credits based on the sending of the data packet; and c) otherwise, if a sufficient number of send credits is not available at the I/O node, waiting for the host to update the value stored in the register before sending a data packet.

8. An I/O node configured to communicate with a host across a channel-based switching fabric interconnect, the I/O node comprising: a channel adapter connecting the I/O node to the channel-based switching fabric; and a virtual interface, including: a plurality of send and receive buffers; a transport service layer, the transport service layer transferring data between the I/O node and the host; an interface user agent coupled to the transport service provider; a kernel agent; a plurality of work queues; and a network interface controller coupled to the kernel agent, the work queues and the channel adapter; said virtual interface to issue one or more control commands to the kernel agent to establish a connection between the I/O node and the host across the channel-based switching fabric and to post data transfer requests to the work queues in response to commands from the transport service layer; and the network interface controller to process the data transfer requests by transferring data between the send and receive buffers and the channel adapter.

9. The I/O node of claim 8 wherein the virtual interface is in accordance with at least a portion of the Virtual Interface (VI) Architecture.

10. The I/O node of claim 9, wherein the kernel agent comprises a Virtual Interface (VI) kernel agent, and the network interface controller comprises a Virtual Interface (VI) network interface controller.

11. An I/O node configured to communicate with a host across a channel-based switching fabric, said I/O node comprising: a memory including send and receive application buffers; a transport service layer providing for data transfer across the channel-based switching fabric; a network interface controller coupled to the network; a plurality of work queues coupled to the network interface controller for posting data transfer requests thereto; a user agent coupled to the send and receive buffers and the network interface controller, the user agent posting data transfer requests to the work queues, the network interface controller processing the posted data transfer requests by transferring data between the send and receive buffers and the channel-based switching fabric.

12. An I/O node configured to communicate with a host computer over a channel-based switching fabric interconnect, the I/O node comprising: a processor; a register storing send credits; one or more work queues for posting data transfer requests; one or more registered send buffers; one or more registered receive buffers; a network interface controller coupled to the processor, the work queues, the buffers and the channel-based switching fabric, the network interface controller processing the posted data transfer requests by transferring data between the registered buffers and the channel-based switching fabric interconnect; and the processor being programmed to control the transfer of data through the network interface controller according to a credit-based flow control scheme depending on the send credits stored in said register.

13. The I/O node of claim 12, wherein said processor is programmed to perform the following: determine if a sufficient number of send credits is available; send a send message containing a data packet from the I/O node to the host over the channel-based switching fabric interconnect if a sufficient number of send credits are available; and otherwise, if a sufficient number of send credits is not available, waiting for the host to update the value stored in the register before transferring the data packet.

14. The I/O node of claim 13, wherein the I/O node places a count value of the number of data transfers in the send message.

15. The I/O node of claim 13, wherein the host updates the value stored in the register by performing an RDMA write operation to the register of the I/O node.

16. A host computer comprising: a network interface controller adapter connecting the host to a host channel adapter on a channel-based switching fabric interconnect; a host processor; a memory having registered send and receive buffers; and a device driver coupled to the host processor and the memory, and having one or more work queues for posting data transfer requests and a transport service layer providing an end-to-end credit-based flow control across the channel-based switching fabric interconnect according to the status of said registered receive buffers.

17. The host computer recited in claim 16, wherein the device driver comprises a virtual interface, including: a transport service layer, the transport service layer transferring data between the I/O node and the host; an interface user agent coupled to the transport service provider; and a kernel agent coupled to the the kernel agent and the work queues, said virtual interface issuing one or more control commands to the kernel agent to establish a connection between the I/O node and the host across the channel-based switching fabric and posting data transfer requests to the work queues in response to commands from the transport service layer.

18. The host computer of claim 17, wherein the virtual interface is in accordance with at least a portion of the Virtual Interface (VI) Architecture.

19. The host computer of claim 18, wherein the kernel agent comprises a Virtual Interface (VI) kernel agent, and the network interface controller comprises a Virtual Interface (VI) network interface controller.

20. The host computer of claim 16, wherein the device driver allocates receive buffers in the memory and performs an RDMA write operation to the I/O node to update a register storing the send credits of said I/O node.

Description

[0001] This application is a continuation application of Provisional Application Serial No. 60/135,259, filed on May 21, 1999.

BACKGROUND

[0002] 1. Field of the Invention

[0003] The invention relates generally to methods and apparatus for data communications across a network. In particular, the invention relates to methods and apparatus for register based remote data flow control over a channel-based switching fabric interconnect or the like.

[0004] 2. Description of the Related Art

[0005] Conventionally, an Input/Output (I/O) node functioning as an intermediary between a host computer and a Local Area Network (LAN) consists of a bus master network interface card (NIC). This process is shown generally in FIG. 1. The I/O controller in the NIC is provided with specific information (i.e., descriptor or memory token) about each buffer in a list of buffers set up and maintained in the host. Every time the NIC receives one or more packets from the LAN (step 1), it reads the buffer list (step 2) and uses a Direct Memory Access (DMA) write operation across the bus to put the packet(s) into the next receive buffer(s) on the list (step 3) and send a notification of the placement of the packet(s) in the buffer(s) to the host (step 4). When the notification is received and processed, either the buffer(s) is emptied or a driver allocates a number of buffers (i.e., by placing another buffer(s) at the end of the list of buffers) in memory and sends a notification to the NIC of the information identifying each of the buffers (step 5). The I/0 controller of the NIC must continually read the buffer list and manage the information about the pool of host buffers in order to be able to write data into the proper buffer as LAN packets are received.

[0006] This process is shown generally in FIG. 1. There are a number of disadvantages to such a process. First, quite a relatively high amount of overhead information must be transferred on the bus between the host computer and the NIC concerning all of the buffers being set up, emptied and allocated in the host. This information must first be transferred from the host computer to the NIC so that the NIC can maintain and refer to the buffer list when transferring data. Then, when data is transferred to a specific host buffer, that data must be accompanied by information identifying the specific host buffer. The process also increases the number of operations required to transfer data from the NIC to the host computer. In addition to the data transfer operation itself, the NIC must also send a separate notification to the host computer. Since the data transfer operation is a DMA write operation, there is typically a response sent back acknowledging the successful transfer of data. These additional operations also increase the load on the bus connecting the host computer and the NIC.

[0007] Such a process also leads to complexity and latencies in the I/O controller of the NIC. The I/0 controller must continuously receive and store the information concerning all of the buffers being set up, emptied and allocated in the host. It must also continuously maintain and refer to the list of host buffers in order to determine the proper buffer for data to be transferred into and to attach the corresponding buffer identifying information as overhead in the RDMA write operation transferring the data to that buffer. There can be significant latencies because of the several different operations across the bus and the processing of the buffer information in the I/O controller. Also, the host may be busy processing other tasks when it gets notified of the RDMA write operation and not realize that all of the buffers are full or close to full and that additional buffers need to be posted. In the meantime, the NIC may continue to receive LAN packets. If additional buffers are not posted to the bottom of the list of buffers in time, then all of the buffers may be consumed before the host responds to the notification. In such an event, there is an overflow at the NIC and the LAN packets have to be discarded. While the host node may re-request the lost data, it causes more LAN traffic which in turn increases the latency (and decreases the performance and efficiency) of the NIC when transferring data from the LAN to the host computer. Although additional buffering may be used to offset these effects to some extent, it increases the cost of the NIC, an important consideration in the LAN environment.

SUMMARY

[0008] The present invention is directed to methods and apparatus for data communications across a network. In a method according to an example embodiment of the invention, the first step of the method is to store a value in a register in the I/O node which is indicative of a number of send credits available to the I/O node. It is then determined from the value of the register whether or not there is a sufficient number of send credits available to the I/O node for the data to be transferred. If a sufficient number of send credits is available to the I/O node, it promptly transfers the data to the host over the channel-based switching fabric interconnect using send/receive semantics. If a sufficient number of send credits is not available to the I/O node, it waits for the host to update the value stored in the register before transferring the data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] The foregoing and a better understanding of the present invention will become apparent from the following detailed description of example embodiments and the claims when read in connection with the accompanying drawings, all forming a part of the disclosure of the invention. While the foregoing and following written and illustrated disclosure focuses on disclosing example embodiments of the invention, it should be clearly understood that the same is by way of illustration and example only and is not to be taken by way of limitation. The spirit and scope of the present invention being set forth by the appended claims.

[0010] The following represents a brief description of the drawings, wherein:

[0011] FIG. 1 is a block and flow diagram illustrating a conventional method of transferring data from a network interface card to a host computer.

[0012] FIG. 2 is a block diagram illustrating the NGIO/VI Architectural model used in an example embodiment of the invention.

[0013] FIG. 3 is a block diagram of the host and I/O node in an example embodiment of the invention.

DETAILED DESCRIPTION

[0014] With the advent of Next Generation Input/Output (NGIO) architecture, Version 1.0, published Jul. 22, 1999, low latency and high bandwidth channel-based switching fabric interconnects between a host computer and connected devices (including I/O nodes) have become a reality. This has opened new horizons for cluster computing. When implemented in conjunction with the Virtual Interface (VI) model described in the Virtual Interface Architecture Specification, Version 1.0, Dec. 16, 1997 (jointly authored by Intel Corporation, Microsoft Corporation, and Compaq Computer corporation), it is possible for distributed applications to perform low overhead communication using off-the shelf NGIO hardware. However, building high-level applications using primitives provided by the VI model is complex and requires substantial development efforts because the NGIO/VI channel-based switching fabric interconnect conventionally does not provide transport level functionality such as flow control, buffer management, fragmentation and reassembly. Moreover, it is impractical to implement existing network protocols such as the Transmission Control Protocol (TCP) over NGIO/VI because this would result in unnecessary additional overhead. TCP uses a sliding window flow control protocol incorporating sequence numbers, positive acknowledgments, error and duplicate detection, timeouts and retransmission of lost packets, etc., because the underlying network is presumed to be inherently unreliable. In contrast, NGIO/VI channel-based switching fabric interconnects have very low error rates and high reliability levels (delivery and reception) and consider transport errors catastrophic for reliable data delivery mode. Thus, due to the reliable data delivery and reception of NGIO/VI channel-based switched fabric interconnects, the channel connection is broken in the rare case of a lost packet or transport error. Since the virtual interface guarantees that data is delivered exactly once in order, many of the functions performed by TCP to ensure reliability are redundant and would add unnecessary overhead.

[0015] Even though the host computer and I/O node (and other devices in a computing cluster) can be connected by a NGIO/VI channel-based switching fabric having low latency and high bandwidth, the effective data transfer performance across that switching fabric can be less than optimum because of the lack of flow control and buffer management. This is especially true in computing clusters having an I/O node connected to a local area network (LAN) or other bursty, asynchronous, network where the amount of network traffic can increase or decrease suddenly and/or the data transfers can vary greatly in size and type from large pre-recorded contiguous blocks of image data, such as multimedia data from a CD-ROM, to much smaller heavily fragmented user data. The LAN packets received by the I/O nodes can range in maximum size anywhere from 1500 bytes to over 64000 bytes. In such installations, the manner in which the data packets are buffered and transferred by I/O nodes, host and other elements in the computing cluster can be crucial. Therefore, a need exists for a data communication service over a channel-based switching fabric interconnect that overcomes the disadvantages of conventional PCI compliant LAN NICs discussed above with respect to FIG. 1., yet still provides flow control and buffer management for data transfer between devices of a computing cluster connected by the switching fabric.

[0016] The example embodiment of the present invention is applied to a host computer and I/O node of a computing cluster connected to each other over a NGIO/VI channel-based switching fabric. The host computer has a processor, associated system memory with a plurality of allocated and configured buffers, and at least one internal bus connecting these components. It uses the VI Architectural Model which will be described shortly. However, the invention may be implemented in conjunction with other different channel-based switching fabric interconnects having messaging abilities. The example embodiment and other embodiments of the invention may utilize any other architecture and channel-based interconnect which supports both message passing and remote direct memory access, such as the System I/O (SIO) architecture currently being developed as a standardization of NGIO with other architectures. In this patent application, message passing refers to the transfer of data from one end of a channel to the other end wherein the unit receiving the data determines the desired location of the transferred data in its memory. In contrast, remote direct memory access (RDMA) operations allow the initiating end of a data transfer operation to identify the memory location at the receiving end of the channel where the data will be read or stored at the completion of the data transfer operation. According to the present invention, a channel is any means of transferring data, including but not limited to virtual channels, used to transfer data between two endpoints. While the example embodiment is an NGIO implementation and this channel definition is provided in the NGIO specification identified above, the present invention is not so limited. Furthermore, the terminology used in this application is consistent with the aforementioned NGIO specification, and other architectures may use different terminology to describe similar and corresponding aspects. For example, in NGIO, the smallest possible autonomous unit of data is called a cell, and a packet is made up of a number of such cells. In contrast, SIO uses the term "packet" to describe the smallest possible autonomous unit of data instead of "cell" as in NGIO, and uses the term "message" instead of "packet". Furthermore, an SIO packet differs slightly from the corresponding NGIO cell. An NGIO cell has a fixed header size and a fized maximum payload of 256 bytes. An SIO packet has several headers of fixed length, but which are only conditionally present in the packet. Also, the payload of an SIO packet is a minimum of 256 bytes and the maximum payload is variable and negotiable.

[0017] For ease of comparison with the conventional method and apparatus discussed above with respect to FIG. 1, the example embodiment of the invention is directed in part to a network interface card (NIC) connected to a local area network. However, such an application is but one of several possible examples of the invention which may, of course, be applied to any I/O node or to any pair of devices where an improvement in transferring data between the devices is desired for whatever reason. An I/O node refers generally to any device or controller that connects a host device or cluster to a network fabric. Although the example embodiment of the invention is implemented and particularly well suited for data from a local area network, the invention is not so limited in its application. Other embodiments of the invention may be implemented for other networks, especially asynchronous, bursty, networks having widely varying and fluctuating data traffic which is not requested by the receiving host computer.

[0018] While the embodiments of the invention can be applied to any I/O technology, the traffic studied in local area networks typically has fragmented data in the first 64 bytes of transferred packets. There are inefficiencies in the fragmentation and reassembly because the data cannot be processed at the receiving end until the last cell containing an element of the data is received. Large transfers, in particular, can hold up resources for a long time since there must be a validation that the entire payload is without uncorrectable errors. (Error correction information may be included in the cells in addition to the header and payload.)

[0019] As shown in FIG. 2, the VI Architectural model includes a VI consumer 8 and a VI provider 24. A VI consumer 8 is a software process that communicates using a Virtual Interface (VI). The VI consumer 8 typically includes an application program 10, an operating system communications facility 12 (e.g., Sockets, Remote Procedure Call or RPC, MPI) and a VI user agent 14. The VI provider 24 includes the combination of a VI network interface controller (VI NIC) 18 and a VI kernel agent 16. It connects to the NGIO channel-based switching fabric through a channel adapter 30.

[0020] VI NIC 18 can directly access memory for data transfer operations with the channel-based switching fabric. There are a pair of work queues, one for send operations (a send queue 21) and one for receive operations (receive queue 19). The work queues store one or more descriptors 23 between the time it is Posted (placed in the queue) and the time it is Done (when the VI NIC has completed processing it). The descriptor 23 is a data structure recognizable by the VI NIC that describes a data movement request, and it includes a list of segments (a control segment, an optional address segment and one or more data segments). The control segment identifies the type of VI NIC data movement operation to be performed and the status of a completed NIC data movement operation. The data segment describes a communications buffer for the data transfer operations. A receive queue 19 contains descriptors that describe where to place incoming data. A send queue 21 contains descriptors that describe the data to be transmitted. A pair of Vis are associated using connection primitives (e.g., VipConnectWait, VipConnectAccept, VipConnectRequest) to allow packets sent at one VI to be received at the other VI. A send doorbell (not shown) and a receive doorbell (not shown) are provided for allowing the VI consumer to notify the VI NIC 18 that work (a descriptor describing a requested data transfer operation) has been placed in the send queue 21 and receive queue 19, respectively.

[0021] The VI user agent 14 is a software component that enables an operating system communication facility 12 to utilize a particular VI provider 24. The VI user agent abstracts the details of the underlying VI NIC hardware in accordance with an interface defined by an operating system communication facility 12. The VI user agent includes a library of primitives known as the VI primitives library (VIPL) that provide functions for creating a VI (VipCreateVI), for destroying a VI (VipDestroyVI), for connecting one VI to another VI (e.g., VipConnectWait, VipConnectRequest), for accepting or rejecting a VI connection request (VipConnectAccept or VipConnectReject), for terminating or disconnecting a connection between two VIs (VipDisconnect), to allow a process to register process memory with a VI NIC (VipRegisterMem), to post descriptors (to place a descriptor in a VI work queue using, e.g., VipPostSend, VipPostRecv), etc. Details of the VI primitives (VIPL) are set forth in the VI Architecture Specification, version 1.0, Dec. 16, 1997.

[0022] The kernel agent 16 is the privileged part of the operating system, usually a driver supplied by the VI NIC vendor, that performs the setup and resource management functions. These functions include connection setup/teardown, interrupt management and or processing, management of system memory used by the VI NIC and error handling. VI consumers access the kernel agent 16 using the standard operating system mechanisms such as system calls. As shown by arrow 26, the OS communication facility 12 makes system calls to the VI kernel agent 16 to perform several control operations, including to register memory. The VI architecture requires the VI consumer to register memory to be used for data transfer prior to submitting the request for data transfer. The memory regions used by descriptors and data buffers are registered prior to data transfer operations. Memory registration gives a VI NIC a method to translate a virtual address to a physical address. The user receives an opaque memory handle as a result of memory registration. This allows a user to refer to a memory region using a memory handle/virtual address pair without worrying about crossing page boundaries and keeping track of the virtual address to tag mapping. Memory registration enables the VI provider to transfer data directly between the registered buffers of a VI consumer and the channel-based switching fabric.

[0023] After registering memory, operating system communication facility 12 can use data transfer primitives of the VIPL library of VI user agent 14 to send and receive data. The VI Architecture defines two types of data transfer operations: 1) send/receive message passing, and 2) RDMA read/write operations. Once a connection is established, the operating system facility 12 posts the application's send and receive requests directly to the send and receive queues. The descriptors are posted (e.g., placed in a work queue) and then a doorbell is rung to notify the NIC that work has been placed in the work queue. The doorbell can be rung (and the VI NIC 18 notified of the work in the queue) without kernel processing. The VI NIC 18 then processes the descriptor by sending or receiving data, and then notifies the VI User Agent 14 of the completed work using the completion queue 22. The VI NIC 18 directly performs the data transfer functions in response to the posted descriptors.

[0024] The NGIO/VI Architecture supports an unacknowledged class of service at the NIC level. However, it does not perform other transport level functions, including flow control and buffer management. The VI Architecture Specification, version 1.0, Dec. 16, 1997 states at page 15 that "VI consumers are responsible for managing flow control on a connection." The present invention is designed to provide data flow control over the NGIO/VI architecture or similar architecture.

[0025] An example embodiment of the invention is illustrated by the block diagram in FIG. 3. Host computer 300 has a device driver 301 configured according to the VI model described above, a host processor (CPU) 302 controlling operation of host computer 300 and a system memory 303 coupled to host processor 302 via a host bus. The device driver 301 is coupled to host memory 303 and to host processor 302. It has send/receive descriptors and information for credit based flow control. A host channel adapter (HCA) 304 connects host computer 300 to NGIO switching fabric 305. A portion of system memory 303 is allocated for a plurality of send buffers 303-1 and receive buffers 303-2, which are registered with device driver 301. Once the buffers are registered, device driver 301 can transfer incoming data directly from HCA 304 to a receiver buffer 303-2, and outgoing data can be directly transferred from a send buffer 303-1 to HCA 304. Pools of associated send and receive descriptors are also created and registered in device driver 301.

[0026] The switching fabric may contain many different switches SW and redundant paths (not shown) throughout the fabric, such that a plurality of messages can be traveling through the switching fabric at any given time. The switched fabric configuration can contain a plurality of channel adapters, such that there can be a multitude of different messages traveling through the fabric and where all of the various connected devices can continue operating while their messages are traveling through the switching fabric.

[0027] I/O node 310 is connected to NGIO switching fabric 305 through target channel adapter (TCA) 311 and to LAN 320 through a conventional LAN receive engine 313. I/O node 310 includes an I/O controller 312 configured according to the VI model described above. According to a feature of the invention, I/O controller 312 includes a credit register 314 storing credits indicating the number of receive buffers available in host computer 300. Device driver 301 is responsible for managing data flow between host computer 300 and I/O node 310 over a channel in NGIO switching fabric 305.

[0028] The data transfers are optimized through the device driver 301 and I/O controller 312 at all times. This helps avoid the processor or other elements of the host computer 300 or I/O node 310 from having to expend system resources to accomplish transfer of data blocks since there may be access conflicts with other functions. This method results in an immediate advantage compared to the conventional method shown in FIG. 1 which must have several operations across the fabric, plus a direct memory access (step 2) to place the data in a receive buffer 303-2 of host 300. The host channel adapter 304 and target channel adapter 311 provide all of the scatter/gather capability in the NGIO hardware so that the data is immediately delivered to the target as one contiguous block of data when possible. This minimizes the number of NGIO operations and transaction latency while improving the efficiency of data transfers. If necessary, data can be transferred in one or more data packets. In such an event, the individual data packets are successively transferred according to the same register based flow control scheme as intact data.

[0029] Before connection is started between host 300 and I/O node 310, a memory token is transferred to device driver 301. The memory token provides host 300 with access to credit register 314. The memory token can be of any format, e.g., simply a series of bits indicating the address of the remaining left-over data in memory of host computer 300. In the example embodiment, the memory token consists of a virtual address and a memory handle. The virtual address is determined by the I/O controller 312 and when received as part of a RDMA read operation, it is converted by a translation table into a physical address in memory corresponding to credit register 314. The I/O controller 312 may require that the memory handle accompanying the RDMA read operation is the same as that provided by it to ensure that the initiator of the RDMA read operation is entitled to access to the data. In advanced memory handle techniques, the memory handle may also indicate the authorization of the RDMA read operation to access the credit register 314. An important advantage of the example embodiment is that only credit register 314 and a single memory token need to be provided rather than the entire buffer list in the conventional system in FIG. 1.

[0030] After the initial RDMA write operation to initialize credit register 314, device driver will initiate multiple subsequent RDMA write operations as necessary to send credits to indicate that receive buffers 303-2 have been emptied or replenished, or that additional receive buffers 303-2 have been allocated in memory 303 of host 300. This process is indicated by the step (1) in FIG. 3. A key feature of this example embodiment is that I/O controller 312 does not have to manage a buffer list or buffer information. Indeed, the send credits in credit register 314 are updated by host 300 without any participation by I/O controller 312. CPU 302 schedules the buffer set up, and corresponding RDMA write operations, at a rate consistent with the resources on host 300. In particular, it schedules the buffer operation, and corresponding RDMA write operations, at the rate that it and I/O node 310 can best consume them thus increasing efficiency without additional demand on I/O node 310. A key advantage of this example embodiment is the efficiency with which the host 300 can use its resources. Host computers, especially servers, typically have many gigabytes of memory and a large amount of data that is being transferred to and from a network. But the amount of memory on an I/O node is relatively small in comparison. The granularity of cells passed back and forth in the NGIO switching fabric allows the example embodiment to optimize the use of receive buffers in the host 300.

[0031] There may be a series of data transfer operations sending LAN data to host computer 300 from I/O controller 312 according to the example embodiment of the invention as shown by step (2) in FIG. 3. I/O controller 312 counts the number of transfers in counter 316. As mentioned before, the I/O controller 312 can immediately transfer data whenever credit register 314 is greater than counter 316. Even though the data arrives asynchronously and unexpectedly from the LAN, it can be promptly forwarded to host 300 since there is no need for complicated processing. Conversely, the I/O controller 312 stops transferring data when host 300 consumes all of the registered receive buffers 303-2. As a result, data flow control can be simply and remotely established by host 300.

[0032] The data flow control also allows corrupted packets to be silently dropped between I/O node 310 and host 300. Although not shown, there is a counter 316 in the I/O node 310 that is incremented every time a LAN packet is sent to the host. The value of that counter is placed inside the data packet that is sent from I/O node 310 to host 300. If a LAN packet is received where the counter is equal to credit register 314, the packet is discarded and not sent. By placing the counter value in the send message itself allows the host to detect when it has missed packets. This increases efficiency and accomodates for the receive buffer 303-2 that did not get consumed by the defective packet. Every time host 300 detects a gap in count values in send messages, it also updates the credit register 314 so that it increases the number of send credits so that the buffer which was not filled can be used for another data packet.

[0033] This leads to another important feature of the example embodiment. The data flow control is set up by host channel adapter 304 and target channel adapter 311 over a channel with an unacknowledged class of service. If data packets were corrupted on an acknowledged channel, the data transfers would have to be stopped and restarted to compensate. If all of the data packets are held during that period of time, there would be a tremendous buffer requirement.

[0034] The example embodiment uses the messaging ability of the NGIO/VI architecture to send simple send messages over the unacknowledged channel. Consequently, I/O node 310 does not have to know anything about host 300 or the memory address location destination of the data. Instead of doing a RDMA write across the channel-based switching fabric, it uses a send message to a particular queue pair set up by the virtual interface in device driver 301 that has receive buffers 303-2 associated with corresponding receive queues. So as the data comes across the channel-based switching fabric, it goes into those buffers automatically. The only element that needs to know the particulars of the process is the virtual interface in device driver 301, not even the host channel adapter 304 and target channel adapter 311, although the data transfer is fairly simple.

[0035] Although an example embodiment, the invention is not limited to the example embodiment illustrated in FIG. 3. Indeed, an advantage of the invention is that it is particularly useful and widely adaptable to any I/O device having latency in data transfer operations. In this way, data transfers can be efficient in both a server that has a great deal of network I/O interfaces and other interfaces. The example embodiments will automatically adapt to transfer characteristics in which large blocks of data are generally asynchronously transferred as well as small blocks of data. Indeed, the example embodiments will adapt to any I/O data interface.

[0036] Other features of the invention may be apparent to those skilled in the art from the detailed description of the example embodiments and claims when read in connection with the accompanying drawings. While the foregoing and following written and illustrated disclosure focuses on disclosing example embodiments of the invention, it should be understood that the same is by way of illustration and example only, is not to be taken by way of limitation and may be modified in learned practice of the invention. While the foregoing has described what are considered to be example embodiments of the invention, it is understood that various modifications may be made therein and that the invention may be implemented in various forms and embodiments, and that it may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim all such modifications and variations.

* * * * *