Enabling Memory Transactions Across A Lossy Network Tanaka; Bert H. ; et al. [FORTINET, INC.]

Enabling Memory Transactions Across A Lossy Network

Tanaka; Bert H. ; et al.

Patent Application Summary

U.S. patent application number 12/762407 was filed with the patent office on 2010-08-12 for enabling memory transactions across a lossy network. This patent application is currently assigned to FORTINET, INC.. Invention is credited to Daniel J. Maltbie, Joseph R. Mihelich, Bert H. Tanaka.

Application Number	20100205502 12/762407
Document ID	/
Family ID	41432384
Filed Date	2010-08-12

United States Patent Application	20100205502
Kind Code	A1
Tanaka; Bert H. ; et al.	August 12, 2010

ENABLING MEMORY TRANSACTIONS ACROSS A LOSSY NETWORK

Abstract

Methods and systems for enabling remote programmed I/O to be carried out across a "lossy" network are provided. According to one embodiment, a node maps a portion of a remote memory of a remote node into its physical address space. MTMs conforming to a processor bus protocol are received by a network interface of the node. The MTMs destined for the remote node are encapsulated within network packets. Each network packet is assigned a sending priority based upon a transaction type of the encapsulated MTM and based upon ordering rules associated with the processor bus protocol. The network packets are organized into groups based upon sending priority and transmitted to the remote node via a lossy network according to the sending priorities. It is ensured that a particular subset of the network packets having a particular sending priority is received by the remote node in a proper sequence.

Inventors:	Tanaka; Bert H.; (Saratoga, CA) ; Maltbie; Daniel J.; (Santa Cruz, CA) ; Mihelich; Joseph R.; (Fremont, CA)
Correspondence Address:	MICHAEL A DESANCTIS;HAMILTON DESANCTIS & CHA LLP FINANCIAL PLAZA AT UNION SQUARE, 225 UNION BOULEVARD, SUITE 150 LAKEWOOD CO 80228 US
Assignee:	FORTINET, INC. Sunnyvale CA
Family ID:	41432384
Appl. No.:	12/762407
Filed:	April 19, 2010

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
11333877	Jan 17, 2006	7702742
12762407
60645000	Jan 18, 2005

Current U.S. Class:	714/749 ; 709/217; 709/236; 714/E11.113
Current CPC Class:	G06F 9/546 20130101; G06F 15/16 20130101; G06F 9/466 20130101; G06F 15/167 20130101; H04L 47/24 20130101
Class at Publication:	714/749 ; 709/217; 709/236; 714/E11.113
International Class:	G06F 15/16 20060101 G06F015/16; H04L 1/18 20060101 H04L001/18; G06F 11/14 20060101 G06F011/14

Claims

1. A computer-implemented method for a local computer system to access a remote memory of a remote computer system by performing remote programmed input/output (I/O), the method comprising: mapping, by the local computer system, a portion of the remote memory into a physical address space of the local computer system; receiving, by a local network interface of the local computer system coupled in communication with the remote computer system via a lossy network, a plurality of memory transaction messages (MTMs) from a local memory controller of the local computer system, wherein each MTM of the plurality of MTMs comprises a set of information and conforms to a processor bus protocol that is used by one or more local processors of the local computer system or the local memory controller to access a local memory of the local computer system; determining, by the local network interface, that the plurality of MTMs is destined for the remote computer system; determining, by the local network interface, for each MTM of the plurality of MTMs, a transaction type; generating a plurality of network packets, by the local network interface, within which at least a portion of the set of information for each MTM is encapsulated; assigning to each network packet of the plurality of network packets, by the local network interface, one of a plurality of sending priorities based upon the transaction type of the MTM of the plurality of MTMs whose said at least a portion of the set of information is encapsulated therein and based upon ordering rules associated with the processor bus protocol that defines relative priorities among MTMs exchanged between (i) one or more local processors or the local memory controller and (ii) the local memory; organizing, by the local network interface, the plurality of network packets into groups based upon sending priority; transmitting, by the local network interface, the plurality of network packets to the remote computer system via the lossy network according to an order determined based, at least partially, upon the sending priorities of the plurality of network packets; and ensuring, by the local network interface, that a particular subset of the plurality of network packets having a particular sending priority of the sending priorities is received by the remote computer system in a proper sequence.

2. The computer-implemented method of claim 1, wherein the local memory controller controls the local memory, and wherein an MTM of the plurality of MTMs comprises a request to access the remote memory which is local to and controlled by the remote computer system, or a response to an access request received from the remote computer system to access the local memory.

3. The computer-implemented method of claim 2, wherein each MTM of the plurality of MTMs is one of the following transaction types: (a) posted request; (b) response and (c) non-posted request.

4. The computer-implemented method of claim 3, wherein said assigning to each network packet of the plurality of network packets, by the local network interface, one of a plurality of sending priorities comprises: assigning a network packet of the plurality of network packets a first sending priority if the MTM that that packet is encapsulating is of the posted request type; assigning the network packet a second sending priority, which is lower than the first sending priority, if the MTM that that packet is encapsulating is of the response type; and assigning the network packet a third sending priority, which is lower than the second sending priority, if the MTM that that packet is encapsulating is of the non-posted request type.

5. The computer-implemented method of claim 4, wherein said organizing, by the local network interface, the plurality of network packets into groups comprises: placing all of the network packets of the plurality of network packets having the first sending priority into a first queue, wherein ordering of the network packets within the first queue is determined based upon when the MTMs encapsulated by the network packets within the first queue were received by the local network interface from the local memory controller; placing all of the network packets of the plurality of network packets having the second sending priority into a second queue, wherein ordering of the network packets within the second queue is determined based upon when the MTMs encapsulated by the network packets within the second queue were received by the local network interface from the local memory controller; and placing all of the network packets of the plurality of network packets having the third sending priority into a third queue, wherein ordering of the network packets within the third queue is determined based upon when the MTMs encapsulated by the network packets within the third queue were received by the local network interface from the local memory controller.

6. The computer-implemented method of claim 5, wherein said transmitting, by the local network interface, the plurality of network packets to the remote computer system via the lossy network comprises: transmitting all of the network packets in the first queue; transmitting the network packets in the second queue after all of the network packets in the first queue have been transmitted; and transmitting the network packets in the third queue after all of the network packets in the first queue and the second queue have been transmitted.

7. The computer-implemented method of claim 1, wherein said ensuring, by the local network interface, that a particular subset of the plurality of network packets having a particular sending priority of the sending priorities are received by the remote computer system in a proper sequence comprises: determining whether an acknowledgement has been received from the remote computer system indicating that the remote computer system has received a particular network packet within the particular subset of the network packets; and in response to a determination that the acknowledgement has not been received: (a) retransmitting the particular network packet to the remote computer system; and (b) retransmitting, to the remote computer system, all subsequent network packets in the particular subset of the network packets that were transmitted to the remote computer system after the particular network packet was transmitted.

8. The computer-implemented method of claim 7, wherein the subsequent network packets are retransmitted in the same order as they were sent previously.

9. The computer-implemented method of claim 1, further comprising: receiving, by the local network interface, an incoming network packet from the remote computer system, wherein the incoming packet is part of a certain sequence of incoming network packets from the remote computer system; extracting from the incoming network packet, by the local network interface, an associated priority and a sequence number; determining, by the local network interface, whether the sequence number matches an expected sequence number for the associated priority; and in response to a determination that the sequence number matches the expected sequence number for the associated priority, transmitting, by the local network interface, an acknowledgement packet to the remote node.

10. The computer-implemented method of claim 9, further comprising: processing, by the local network interface, the incoming network packet; and updating, by the local network interface, the expected sequence number for the associated priority to a new expected sequence number.

11. The computer-implemented method of claim 10, further comprising: in response to a determination that the sequence number does not match the expected sequence number for the associated priority, dropping, by the local network interface, the incoming network packet.

12. A computer-readable storage medium tangibly embodying a set of instructions, which when executed by one or more processors of a local computer system cause the local computer system to perform a method for accessing a remote memory of a remote computer system by performing remote programmed input/output (I/O), the method comprising: mapping a portion of the remote memory into a physical address space of the local computer system; receiving, by a local network interface of the local computer system coupled in communication with the remote computer system via a lossy network, a plurality of memory transaction messages (MTMs) from a local memory controller of the local computer system, wherein each MTM of the plurality of MTMs comprises a set of information and conforms to a processor bus protocol that is used by one or more local processors of the local computer system or the local memory controller to access a local memory of the local computer system; determining the plurality of MTMs are destined for the remote computer system; determining for each MTM of the plurality of MTMs, a transaction type; generating a plurality of network packets within which an MTM of the plurality of MTMs encapsulated; assigning to each network packet of the plurality of network packets one of a plurality of sending priorities based upon the transaction type of the MTM encapsulated therein and based upon ordering rules associated with the processor bus protocol that defines relative priorities among MTMs exchanged between (i) one or more local processors or the local memory controller and (ii) the local memory; organizing the plurality of network packets into groups based upon sending priority; transmitting the plurality of network packets to the remote computer system via the lossy network according to an order determined based, at least partially, upon the sending priorities of the plurality of network packets; and ensuring that a particular subset of the plurality of network packets having a particular sending priority of the sending priorities are received by the remote computer system in a proper sequence.

13. The computer-readable storage medium of claim 12, wherein the local memory controller controls the local memory, and wherein an MTM of the plurality of MTMs comprises a request to access the remote memory which is local to and controlled by the remote computer system, or a response to an access request received from the remote computer system to access the local memory.

14. The computer-readable storage medium of claim 13, wherein each MTM of the plurality of MTMs is one of the following transaction types: (a) posted request; (b) response and (c) non-posted request.

15. The computer-readable storage medium of claim 14, wherein said assigning to each network packet of the plurality of network packets one of a plurality of sending priorities comprises: assigning a network packet of the plurality of network packets a first sending priority if the MTM that that packet is encapsulating is of the posted request type; assigning the network packet a second sending priority, which is lower than the first sending priority, if the MTM that that packet is encapsulating is of the response type; and assigning the network packet a third sending priority, which is lower than the second sending priority, if the MTM that that packet is encapsulating is of the non-posted request type.

16. The computer-readable storage medium of claim 15, wherein said organizing the plurality of network packets into groups comprises: placing all of the network packets of the plurality of network packets having the first sending priority into a first queue, wherein ordering of the network packets within the first queue is determined based upon when the MTMs encapsulated by the network packets within the first queue were received by the local network interface from the local memory controller; placing all of the network packets of the plurality of network packets having the second sending priority into a second queue, wherein ordering of the network packets within the second queue is determined based upon when the MTMs encapsulated by the network packets within the second queue were received by the local network interface from the local memory controller; and placing all of the network packets of the plurality of network packets having the third sending priority into a third queue, wherein ordering of the network packets within the third queue is determined based upon when the MTMs encapsulated by the network packets within the third queue were received by the local network interface from the local memory controller.

17. The computer-readable storage medium of claim 16, wherein said transmitting the plurality of network packets to the remote computer system via the lossy network comprises: transmitting all of the network packets in the first queue; transmitting the network packets in the second queue after all of the network packets in the first queue have been transmitted; and transmitting the network packets in the third queue after all of the network packets in the first queue and the second queue have been transmitted.

18. The computer-readable storage medium of claim 12, wherein said ensuring that a particular subset of the plurality of network packets having a particular sending priority of the sending priorities are received by the remote computer system in a proper sequence comprises: determining whether an acknowledgement has been received from the remote computer system indicating that the remote computer system has received a particular network packet within the particular subset of the network packets; and in response to a determination that the acknowledgement has not been received: (a) retransmitting the particular network packet to the remote computer system; and (b) retransmitting, to the remote computer system, all subsequent network packets in the particular subset of the network packets that were transmitted to the remote computer system after the particular network packet was transmitted.

19. The computer-readable storage medium of claim 18, wherein the subsequent network packets are retransmitted in the same order as they were sent previously.

20. A network device comprising: a network interface operable to be coupled in communication to a remote node via a lossy network; a local memory having stored therein one or more routines for performing a method of accessing a remote memory of the remote node by performing remote programmed input/output (I/O); one or more processors coupled to the network interface and the storage device, operable to execute the one or more routines; wherein the method comprises: mapping, by the network device, a portion of a remote memory of the remote node into a physical address space of the network device; receiving, by the network interface, a plurality of memory transaction messages (MTMs) from a local memory controller of the network device, wherein each MTM of the plurality of MTMs comprises a set of information and conforms to a processor bus protocol that is used by the one or more processors or the local memory controller to access the local memory of the network device; determining the plurality of MTMs are destined for the remote node; determining for each MTM of the plurality of MTMs, a transaction type; generating a plurality of network packets within which an MTM of the plurality of MTMs is encapsulated; assigning to each network packet of the plurality of network packets one of a plurality of sending priorities based upon the transaction type of the MTM encapsulated therein and based upon ordering rules associated with the processor bus protocol that defines relative priorities among MTMs exchanged between (i) the one or more processors or the local memory controller and (ii) the local memory; organizing the plurality of network packets into groups based upon sending priority; transmitting the plurality of network packets to the remote node via the lossy network according to an order determined based, at least partially, upon the sending priorities of the plurality of network packets; and ensuring a particular subset of the plurality of network packets having a particular sending priority of the sending priorities are received by the remote node in a proper sequence.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is a continuation of U.S. patent application Ser. No. 11/333,877, filed on Jan. 17, 2006, which claims the benefit of priority to U.S. Provisional Patent Application No. 60/645,000, filed on Jan. 18, 2005, the contents of both of which are hereby incorporated by reference in their entirety for all purposes.

COPYRIGHT NOTICE

[0002] Contained herein is material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent disclosure by any person as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all rights to the copyright whatsoever. Copyright .COPYRGT. 2005-2010, Fortinet, Inc.

BACKGROUND

[0003] 1. Field

[0004] Embodiments of the present invention generally relate to distributed computing systems. In particular, embodiments of the present invention relate to network interfaces for enabling remote programmed input/output (I/O) to be carious out in a lossy network.

[0005] 2. Description of the Related Art

[0006] To satisfy the ever-growing need for computing power, the computing industry has moved towards the use of distributed computing systems. In a distributed computing system, multiple computing nodes are coupled together via a network to form an overall more powerful system. One of the advantages of a distributed computing system is that it is highly scalable. To increase the computing power of the system, one or more computing nodes may simply be added. Another advantage of a distributed system is that it enables less expensive, commodity computing nodes to be used. This makes it possible to add computing power to a system with relatively minimal cost. Because of these and other advantages, the popularity of distributed computing systems has grown in recent years.

[0007] In a distributed computing system, one of the major considerations is the ability of the various computing nodes to communicate with each other. The more easily and efficiently the computing nodes can communicate and interact, the more the overall system appears to be a single integrated system. One of the aspects of this node-to-node interaction is the ability of one node to access the memory controlled by another node.

[0008] Currently, a local node can access the memory controlled by a remote node in several ways. One way is through the use of programmed I/O (input/output). With programmed I/O, a portion of the memory controlled by the remote node is mapped into the physical address space of the local node. Once mapped in this way, a processor of the local node may access the remote memory portion as if it were a part of the local node's local memory. The processor may do this, for example, by issuing memory transaction messages (MTMs). These MTMs may have the same format and conform to the same processor bus protocol as the MTMs that the processor would otherwise issue to access the local node's local memory. An underlying component (for example, a network interface) would encapsulate these MTMs within network packets, and send those packets across the network to the remote node. In turn, the remote node would process the packets and perform the requested accesses on the memory that it controls. In this way, the processor on the local node is able to access the memory controlled by the remote node.

[0009] With programmed I/O, the processor of the local node expects the same operation and result from a remote memory access as it does from a local memory access. Thus, when performing a remote memory access, the network needs to ensure that its behavior satisfies the expectations of the processor. If it does not, serious errors may result. With programmed I/O, the processor has two main expectations. First, the processor expects the MTMs that it issues will be processed in an order that is consistent with the processor bus protocol. This ordering is important in ensuring proper processing of information, deadlock avoidance, etc. Second, the processor expects that its MTMs will be processed. The MTMs cannot be dropped or ignored. In order to accommodate remote memory access, a network needs to guarantee that these two expectations are met.

[0010] Unfortunately, most standard commodity networks, such as Ethernet, do not satisfy these conditions. In an Ethernet network, for example, the switches within the network, under certain circumstances, may drop packets. In the context of programmed I/O, such dropped packets may, and most likely will, lead to serious errors. Also, because packets may be dropped in an Ethernet network, there is no guarantee that packets will be received and processed in any particular order (even if dropped packets are resent). As a result, it has thus far not been possible to use standard commodity networks to implement remote programmed I/O. Rather, proprietary networks such as SCI and DEC memory channel have been used. These proprietary networks are undesirable, however, because they tend to be expensive. Also, because they are non-standard, they tend to be incompatible with most standard equipment, which leads to more increased cost. Because of these and other shortcomings of proprietary networks, it has been difficult up to this point to implement remote programmed I/O in a cost effective and efficient manner.

SUMMARY

[0011] Methods and systems are described for enabling remote programmed I/O to be carried out across a "lossy" network, such as an Ethernet network. According to one embodiment, a computer-implemented method is provided for a local computer system to access a remote memory of a remote computer system by performing remote programmed input/output (I/O). The local computer system maps a portion of the remote memory into a physical address space of the local computer system. A local network interface of the local computer system coupled in communication with the remote computer system via a lossy network, receives multiple memory transaction messages (MTMs) from a local memory controller of the local computer system. Each of the MTMs includes a set of information and conforms to a processor bus protocol that is used by one or more local processors of the local computer system or the local memory controller to access a local memory of the local computer system. The local network interface determines the MTMs are destined for the remote computer system. The local network interface determines a transaction type for each MTM. The local network interface generates network packets within which at least a portion of the set of information for each MTM is encapsulated. The local network interface assigns to each network packet a sending priority based upon the transaction type of the MTM whose information is encapsulated therein and based upon ordering rules associated with the processor bus protocol that defines relative priorities among MTMs exchanged between (i) one or more local processors or the local memory controller and (ii) the local memory. The network packets are organized into groups based upon sending priority and transmitted by the local network interface to the remote computer system via the lossy network according to an order determined based, at least partially, upon the sending priorities of the network packets. The local network interface ensures a particular subset of the network packets having a particular sending priority is received by the remote computer system in a proper sequence.

[0012] Other features of embodiments of the present invention will be apparent from the accompanying drawings and from the detailed description that follows.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] Embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:

[0014] FIG. 1 is a functional block diagram of a distributed computing system in which one embodiment of the present invention may be implemented.

[0015] FIG. 2 shows a computing node in accordance with an embodiment of the present invention.

[0016] FIG. 3 shows a computing node in accordance with an alternative embodiment of the present invention.

[0017] FIG. 4 shows a table in which a set of processor bus protocol ordering rules may be specified in accordance with an embodiment of the present invention.

[0018] FIG. 5 is a functional block diagram of two computing nodes illustrating how remote memory access may be carried out in accordance with an embodiment of the present invention.

[0019] FIG. 6 shows the operation of a network interface in accordance with an embodiment of the present invention.

[0020] FIG. 7 shows sample packet queues associated with a sample remote node in accordance with an embodiment of the present invention.

[0021] FIG. 8 shows a sample set of linked lists for sent packets in accordance with an embodiment of the present invention.

[0022] FIG. 9 shows an updated version of the linked lists of FIG. 8 in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION

[0023] Methods and systems are described for enabling remote programmed I/O to be carried out across a "lossy" network, such as an Ethernet network.

[0024] With reference to FIG. 1, there is shown a functional block diagram of a distributed computing system 100 in which one embodiment of the present invention may be implemented. As shown, the system 100 comprises a plurality of computing nodes 104 coupled together by a network 102. The network 102 enables the nodes 104 to communicate, interact, and exchange information so that the various nodes 104 can cooperate to act as a single integrated computing system. For purposes of the present invention, the network 102 may be any type of network, including, but certainly not limited to, a "lossy" network, such as an Ethernet network. As used herein, the term lossy refers broadly to any type of network in which one or more network packets may be dropped by the network (e.g. due to congestion, packet corruption, etc.). Because network packets may be dropped, a lossy network does not guarantee that packets sent in a particular order from a source will arrive at a destination in that order. In fact, some of the packets may not arrive at the destination at all.

[0025] FIG. 2 shows one possible embodiment of a computing node 104. As shown, the node 104 comprises a processor 202 having an integrated memory controller 204, a local memory 206, a persistent storage 208 (e.g. a hard drive) for storing an operating system (OS) 210 and other programs (not shown), and a network interface 220. In this architecture, the local memory 206 and the network interface 220 are coupled to the processor 202. The processor 202 accesses the local memory 206 through the integrated memory controller 204.

[0026] FIG. 3 shows another possible embodiment of a computing node 104. As shown, the node 104 comprises a processor 302, a north bridge 304, a local memory 306, a persistent storage 308 for storing an OS 210 and other programs (not shown), and a network interface 220. In this architecture, the processor 302, local memory 306, and network interface 220 are all coupled to the north bridge 304. To access the local memory 306, the processor 302 goes through the north bridge 304. In this architecture, the north bridge 304 performs the function of a memory controller (as well as other functions).

[0027] FIGS. 2 and 3 show just two possible computing node architectures. For purposes of the present invention, any computing node architecture may be used. In fact, a combination of different node architectures may be used within the overall system 100. For example, node 104(a) may have the architecture shown in FIG. 2, node 104(b) may have the architecture shown in FIG. 3, and node 104(c) may have yet a different architecture. This is within the scope of the present invention. For the sake of simplicity, however, it will be assumed in the following sections that the nodes 104 take on the architecture shown in FIG. 3. It should be noted, though, that the teachings provided herein may be applied generally to any desired node architecture.

Accessing Local Memory

[0028] With reference to FIG. 3, a method for accessing a local memory will now be described. When a computing node 104 boots up, its processor 302 loads and executes the OS 210 stored in the persistent storage 308. Upon execution, the OS 210 learns the particulars of the local memory 306. These particulars include the size of the local memory 306. Once it knows the size of the local memory 306, the OS 210 knows how much local physical memory space it has to work with. With this knowledge, the OS 210 defines a range of valid physical memory addresses. Once this range is defined, it is provided to the north bridge 304 for future use (basically, the north bridge 304 now knows the range of valid local physical memory addresses).

[0029] During regular operation, the OS 210 supports the execution of one or more programs. Execution of these programs gives rise to virtual address spaces. As portions of these virtual addresses are accessed by the programs, the OS 210 maps the virtual addresses to physical memory addresses. When the mapped virtual addresses are accessed, the corresponding physical memory addresses are accessed.

[0030] To access a physical memory address in local memory 306, the OS 210 causes the processor 302 to issue a memory transaction message (MTM) to the north bridge 304. The MTM may, for example, be a request to write a set of data into a physical memory address, or a request to read a set of data from a physical memory address. In one embodiment, an MTM includes a memory access command (e.g. write, read, etc.), a physical memory address, and a set of data (in the case of a write; no data in case of a read). In response to an MTM, the north bridge 304 determines (using the physical address range previously provided by the OS 210) whether the physical memory address specified in the MTM is a valid local memory address. If it is, then the north bridge 304 interacts with the local memory 306 to carry out the requested access. If the requested access was a write, the north bridge 304 causes the data in the MTM to be written into the specified physical memory address. If the requested access was a read, then the north bridge 304 causes data to be read from the specified physical memory address. The north bridge then returns the read data to the processor 302. In this manner, the processor 302, north bridge 304, and local memory 306 cooperate to carry out a local memory access.

Processor Bus Protocol

[0031] In carrying out local memory accesses, the processor 302 and north bridge 304 implement a processor bus protocol. This protocol dictates the manner in which the MTMs are to be exchanged. It also governs the format of the messages to be used between the processor 302 and north bridge 304. In addition, the protocol dictates the order in which different types of requests are to be processed. In one embodiment, the ordering of the different types of requests is enforced by the north bridge 304.

[0032] In accordance with a processor bus protocol, there are basically three types of MTMs (i.e. three transaction types): (1) a posted request; (2) a non-posted request; and (3) a response. A posted request is a memory access request sent from the processor 302 to the north bridge 304 for which the processor 302 expects no response. A posted write is an example of a posted requested. With a posted write, the processor 302 is simply asking the north bridge 304 to write a set of data into a physical memory address. The processor 302 is expecting no response from the north bridge 304. In contrast, a non-posted request is a memory access request sent from the processor 302 to the north bridge 304 for which the processor 302 is expecting a response. A read request is an example of a non-posted request. With a read request, the processor 302 is expecting the north bridge to respond with a set of read data. Finally, a response is a message sent from the north bridge 304 to the processor 302 in response to a previous request. A read response is an example of a response. The north bridge is responding to a previous read request with the data that the processor 302 requested.

[0033] In the processor bus protocol, the three types of transaction messages discussed above are not treated equally. Rather, the north bridge 304 may allow some messages to pass other messages, even though they arrive at the north bridge 304 at a later time (by "passing", it is meant that the requested memory access in a later received MTM is allowed to take place before the requested memory access in an earlier received MTM). FIG. 4 shows a table summarizing the processor bus protocol ordering rules. In the table of FIG. 4, the columns represent a first issued MTM and the rows represent a subsequently issued MTM. The rows and columns are separated into transaction types, and the cells at the intersection of the rows and columns indicate the ordering relationship between the two transactions. The table also shows a bypass flag. This is an additional parameter associated with an MTM that the north bridge 304 can use to determine whether to allow one MTM to pass another. If an MTM has its bypass flag set, then that MTM may have permission to bypass other MTMs that do not have their bypass flag set under certain circumstances. The table entries are defined as follows: [0034] Yes--The second MTM (the row) must be allowed to pass the first MTM (the column). When blocking occurs, the second MTM is required to pass the first MTM; [0035] Yes/No--There are no requirements. The second MTM may pass the first MTM, or be blocked by it. [0036] No--the second MTM must not be allowed to pass the first MTM.

[0037] From a perusal of the table, it becomes clear that a posted request transaction type has the highest priority of all of the transaction types. As shown in the table of FIG. 4, if the second MTM (row) is of the posted request transaction type, and the first MTM (column) is of the non-posted request transaction type, then the second MTM can pass the first MTM, as indicated by the four "Yes's" in the cells at the intersection of these rows and columns. Likewise, if the second MTM is of the posted request transaction type, and the first MTM is of the response transaction type, then the second MTM can also pass the first MTM. Thus, a posted request can pass both a non-posted request and a response. Further perusal of the table also reveals that a response transaction type has a higher priority than a non-posted request transaction type. This is shown by the fact that, if the second MTM is of the response transaction type, and the first MTM is of the non-posted request transaction type, then the second MTM can pass the first MTM. Hence, from this observation, it is clear that a posted request transaction type has the highest priority, a response transaction type has the next highest priority, and a non-posted request transaction type has the lowest priority. The significance of this observation will be made clear in a later section.

[0038] Enforcement of these ordering rules is important for a number of reasons. These reasons include: maintaining data coherency (by ensuring that transactions are performed in a deterministic order); avoiding deadlock; supporting legacy busses; and maximizing performance. For proper operation, a processor expects these MTM ordering rules to be enforced. If they are not, serious errors can occur.

Accessing Remote Memory

[0039] The above discussion describes the manner in which local memory access is carried out. With reference to FIG. 5, the manner in which remote memory access may be performed will now be described. FIG. 5 shows two nodes: node A 104(a) and node B 104(b). For purposes of the following discussion, it will be assumed that node B is the remote node that is making a portion of its local memory available to other nodes, and that node A is the local node that is remotely accessing the local memory of node B.

[0040] In one embodiment, to make a portion 520 of its physical address space 310(b) (and hence, a portion of its local memory) available to other nodes, the OS on node B instructs the network interface 220(b) of node B to advertise the availability of portion 520 to other nodes. As part of this instruction, the OS provides an aperture ID (for example, aperture B) associated with the portion 520, a starting address for the portion 520, and a size of the portion 520. This information is stored by the network interface 220(b) into the upstream translation table 502. It will thereafter be up to the network interface 220(b) to map that aperture ID to portion 520. A point to note is that, in remotely accessing portion 520, the other nodes will be referencing the aperture ID and not any particular physical memory address within portion 520. Using the aperture ID in this way enables each other node's physical address space to be decoupled from node B's physical address space; hence, each other node is able to maintain its own physical address space. After updating the upstream translation table 502, the network interface 220(b) of node B sends one or more packets across network 102 to advertise the availability of portion 520.

[0041] An advertisement packet is received by the network interface 220(a) of node A and passed on to the OS executing on node A. In response, the OS on node A augments the physical address space 510(a) of node A to include physical address portion 530, which in this example is the same size as portion 520. The OS knows that this portion 530 is not part of the physical address space provided by the local memory (hence, portion 530 is shown in dashed rather than solid lines). Nonetheless, the OS knows that it can access portion 530 in the same manner as it can access any portion of local memory. After augmenting the physical address space 510(a) to include portion 530, the OS informs the north bridge of node A of the augmentation. The north bridge will thereafter know portion 530 is mapped to the local memory of a remote node. The OS also causes the network interface 220(a) of node A to update its downstream translation table 504 to include the starting address of portion 530, the size of portion 530, the aperture ID (aperture B in the current example), and the network address of node B. Thereafter, the network interface 220(a) will know to map any access to an address within portion 530 to aperture B of node B.

[0042] Suppose now that the OS on node A causes an MTM of the posted request transaction type to be sent to the north bridge of node A. Suppose further that this MTM is directed to a physical address within portion 530. Upon receiving this MTM, the north bridge determines that this physical address is not within the physical address range provided by the local memory. Thus, the north bridge forwards the MTM on to the network interface 220(a). In response, the network interface 220(a), using the information stored in the downstream translation table 504, maps this physical address to aperture B of node B. Also, based upon the starting address of portion 530 and the physical address in the MTM, it computes an offset value (this offset is the value that needs to be applied to the starting address of portion 530 to derive the physical address in the MTM). Once it has done that, the network interface 220(a) composes a packet, which encapsulates at least some if not all of the information in the MTM. This packet also includes the aperture ID, the computed offset value, and the network address of node B. After that is done, the network interface 220(a) sends the packet into the network 102 destined for node B.

[0043] Upon receiving the packet, the network interface 220(b) of node B, using the information in the upstream translation table 502, maps the aperture ID to the starting address of portion 520. It then derives the actual physical address within portion 520 that is to be accessed by applying the offset value provided in the packet to the starting address of portion 520. The network interface 220(b) then uses the MTM information in the packet to compose an MTM that will be sent to the north bridge 304 of the node B. This MTM will contain all of the information that the north bridge 304 will need to perform the requested memory access on a particular physical address within portion 520. In this manner, remote memory access may be set up and implemented between two nodes on a network. In the example of FIG. 5, node B is shown as making a portion of its local memory available to node A. It should be noted that node A may likewise make a portion of its local memory available to node B. In such a case, the network interface 220(a) on node A would also have an upstream translation table similar to that shown for the network interface 220(b) of node B, and the network interface 220(b) on node B would also have a downstream translation table similar to that shown for the network interface 220(A) of node A. In addition, the network interface 220(a) of node A would perform the operations just described for the network interface 220(b) of node B, and the network interface 220(b) of node B would perform the operations just described for the network interface 220(a) of node A.

[0044] In the above description, the OSs on both nodes A and B are already up and running; thus, they are able to control the remote memory access setup process (which includes advertising the availability of portion 520, causing the upstream translation table 502 to be populated, augmenting physical address space 510(a) to include portion 530, causing the downstream translation table 504 to be populated, etc.). It should be noted, though, that it is possible to implement remote memory access even before the OSs are up and running. This requires some pre-programming, but it is possible. To illustrate how this many be done, reference will be made to an example.

[0045] Suppose that the OS 210 of node A is not stored in the local persistent storage 308 of node A. Instead, suppose that the OS 210 is stored remotely in portion 520 of node B's physical address space 510(b). Suppose further that the downstream translation table 504 and the upstream translation table 502 are pre-populated with mapping information that enables the OS 210 to be located in portion 520.

[0046] When node A boots up, the processor 302 on node A executes a set of code in a BIOS (basic input output system) (not shown). The purpose of this code is to perform some very low level hardware setup functions, and then to execute and transfer control over to an OS. In one embodiment, this BIOS code is pre-programmed to look for the OS code 210 at a specific physical address. When the BIOS code is executed by the processor 302 of node A, the processor 302 generates an MTM to access the specific physical address at which the OS code 210 starts. This MTM (which would be a read request) is passed to north bridge 304. The north bridge 304 does not recognize this physical address; thus, it passes the MTM on to the network interface 220(a) of node A. Using the pre-populated information in the downstream translation table 504, the network interface 220(a) maps the specific physical address to node B. It then composes a network packet to encapsulate the MTM (in which the specific physical address is translated into an aperture ID and offset), and sends the packet into the network 102.

[0047] The packet is received by the network interface 220(b) of node B. Using the pre-populated information in the upstream translation table 502, the network interface 220(b) maps the aperture ID and offset to a physical address within portion 520. The network interface 220(b) then uses the MTM information in the packet to compose an MTM that is sent to the north bridge 304 of node B. In response, the north bridge 304 of node B accesses the proper physical address within portion 520, reads the contents therein, and returns the contents to network interface 220(b). The network interface 220(b) then sends the contents back to the network interface 220(a) of node A, which forwards the contents to the north bridge 304 of node A, which forwards the contents to the processor 302 of node A. In this manner, the processor 302 of node A is able to load and execute an OS 210 that is located at a remote memory. With this capability, it is possible for a node to not store an OS 210 locally.

Other Functions Performed by Network Interface

[0048] The above discussion highlights the mapping function performed by the network interfaces 220. In addition to this function, the network interfaces, in one embodiment of the present invention, also implement several other functions to enable remote memory access to be carried out across a lossy network.

[0049] Recall from previous discussion that when a processor issues an MTM to carry out a memory access, the processor has two major expectations. First, it expects the memory access requested in the MTM to be performed, i.e. the MTM cannot be dropped or ignored. Second, the processor expects the ordering rules of the processor bus protocol discussed above to be enforced. For a local memory access, these expectations are not a problem. The processor bus coupling the processor and the north bridge is a lossless bus; thus, the MTM is assured of being received and processed by the north bridge. In addition, the north bridge ensures that the ordering rules are enforced. Thus, in a local memory access, these two expectations are easily met.

[0050] The same cannot be said for a remote memory access, however. Because the MTMs (encapsulated within network packets) now have to travel across a potentially lossy network, there is no guarantee that the MTMs will get to the remote node at all. Also, because the local north bridge is no longer controlling the memory access process, there is no guarantee that the ordering rules of the processor bus protocol will be enforced. In addition, because packets may be dropped, even if packets are sent in the right order from one node, there is no guarantee that they will be received in the same order at the remote node (dropped packets and resent packets may cause the packets to arrive out of order). In light of these problems, unless additional functionalities are provided, the expectations of the processor cannot be met in a remote memory access. In one embodiment, to enable remote memory access to be implemented properly to meet all of the expectations of the processor, the network interfaces 220 of the sending and receiving nodes implement additional functionalities. These additional functionalities: (1) facilitate enforcement of the ordering rules of the process bus protocol; and (2) ensure that all packets sent to a remote node are received by the remote node, and that they are received in the proper sequence.

Operational Overview of Network Interface

[0051] With reference to FIG. 6, there is shown an overview of the operation of a network interface 220 in accordance with one embodiment of the present invention. FIG. 6 shows the operation of the network interface 220 when it is sending packets encapsulating MTMs to a remote node. In describing this operation, reference will be made to FIGS. 3, 5, and 6. The network interface whose operation is being described in FIG. 6 will be assumed to be the network interface 220(a) of node A, and the remote node will be assumed to be node B.

[0052] In operation, the network interface 220(a) of node A receives (block 602) a plurality of MTMs from the north bridge 304 of node A. Zero or more of these MTMs may be requests originated by the processor 302 to access portion 530 of node A's physical address space, which maps to portion 520 of node B's physical address space. Also, zero or more of these MTMs may be responses generated by the north bridge 304 in response to requests from node B. For example, if node B sent a read request to read data from an address of node A's local memory 306, then the north bridge 304 would read data from that address, and generate a response that contains the read data.

[0053] Based on the information in the MTMs, the network interface 220(a) determines (block 604) that all of these MTMs are destined for remote node B. This determination may be made based upon information in the MTMs, the mapping information stored in the downstream translation table 504, or both. For example, the network interface 220(a) may use the information in the downstream translation table 504 to map the physical address specified in an MTM to aperture B of node B. In addition to determining a destination node for each MTM, the network interface 220(a) also determines (block 606) a transaction type for each MTM. In one embodiment, these transaction types are the ones mentioned previously, namely, posted request, non-posted request, and response. The transaction type may be determined by examining the memory access command in each MTM.

[0054] Further, the network interface 220(a) composes (block 608), for each MTM, a network packet to encapsulate at least a portion of that MTM. In addition to information pertaining to the MTM, this network packet contains all of the information needed to transport the packet to the remote node B, which includes, for example, the network address of node B, the aperture ID for portion 520, and an offset value. As each network packet is composed, it is assigned (block 610) a sending priority, and that sending priority is inserted into the network packet. In one embodiment, the sending priority assigned to a network packet is determined based upon the transaction type of the MTM that that network packet is encapsulating. As observed previously, MTMs of the posted request type have the highest priority in the processor bus protocol ordering rules. MTMs of the response type have the next highest priority, and MTMs of the non-posted request type have the lowest priority. Thus, in one embodiment, to be consistent with the processor bus protocol ordering rules, the network interface 220(a) assigns priorities as follows. If a network packet is encapsulating an MTM that is of the posted request type, then a first priority is assigned to that network packet, where the first priority is the highest priority. If a network packet is encapsulating an MTM that is of the response type, then a second priority is assigned to that network packet, where the second priority is lower than the first priority. Finally, if a network packet is encapsulating an MTM that is of the non-posted request type, then a third priority is assigned to that network packet, where the third priority is lower than the second priority.

[0055] After the sending priorities are assigned, the network interface 220(a) organizes (block 612) the network packets into groups based upon sending priority. In one embodiment, all network packets with first priority are put into a first queue (in one embodiment, the last composed packet goes to the end of the queue). All network packets with second priority are put into a second queue, and all network packets with third priority are put into a third queue. In one embodiment, the first, second, and third queues are per remote node. That is, there are a first, second, and third queues for remote node B, a first, second, and third queues for remote node C, and so forth.

[0056] After the network packets are organized into groups, the network interface 220(a) sends (block 614) the network packets into the network 102 to be transported to remote node B. The network packets are sent in an order determined based, at least partially, upon the sending priorities. In one embodiment, all of the network packets in the first queue are sent first. Only if there are no network packets in the first queue will the network packets in the second queue be sent. Likewise, only if there are no network packets in the first and second queues will the network packets in the third queue be sent. By sending the network packets in this order, the network interface 220(a) in effect enforces the ordering rules of the processor bus protocol. In one embodiment, the sent packets are stored in a packet buffer (not shown) on the network interface 220(a). They remain stored in the packet buffer until their receipt is acknowledged by the remote node B.

[0057] Thereafter, the network interface 220(a) ensures (block 616) that all of the sent packets are received by the remote node B, and that they are received in a proper sequence. More specifically, the network interface 220(a) ensures that at least a particular subset of the network packets having a particular sending priority are received by remote node B in a proper sequence. In one embodiment, the network interface 220(a) does so by maintaining a plurality of linked-lists of sent packets organized by sending priority. These linked-lists are specific to a particular remote node. Thus, in the current example, there are three linked-lists for remote node B: one with all of the first priority packets; another with all of the second priority packets; and another with all of the third priority packets. In one embodiment, as each network packet is received by the network interface 220(b) of the remote node B, an acknowledgement packet is sent by the network interface 220(b) indicating receipt of that network packet. If the network packets are received by the remote node B in the same order as they were sent, then the acknowledgement packets should also come back to the network interface 220(a) on node A in that order.

[0058] Under certain circumstances however (for example, a network packet was dropped or corrupted, or an acknowledgement packet was dropped or corrupted), the network interface 220(a) on node A may not receive an acknowledgement packet for a network packet within a certain period of time. When that occurs, the network interface 220(a) resends that network packet and all network packets following that network packet in the linked-list of which the packet is a part. The network packets are resent in the same order as before. The network interface 220(a) then waits for acknowledgement of the resent packets. By monitoring for acknowledgement packets, and by resending network packets, when necessary, in this manner, the network interface 220(a) ensures that all of the network packets are received by the remote node B, and that the network packets are received in the right sequence.

Specific Operational Examples

[0059] The above discussion provides an overview of the operation of a network interface 220 in accordance with one embodiment of the present invention. With reference to specific examples, the operation of a network interface 220 will now be described in greater detail. In the following discussion, reference will again be made to FIGS. 3 and 5, and it will again be assumed that node A is the local sending node and node B is the remote target node.

[0060] Suppose that the processor 302 of node A sends an MTM to the north bridge 304 that is a read request to read from a physical address within portion 530. Upon receiving this MTM, the north bridge 304 determines that the physical address in the MTM is not within the local memory 306; thus, it forwards the MTM to the network interface 220(a) of node A to perform a remote memory access.

[0061] Upon receiving the forwarded MTM, the network interface 220(a), using the information in the downstream translation table 504, maps the physical address in the MTM to aperture B of remote node B; hence, it determines that the MTM is destined for remote node B. In addition, the network interface 220(a) determines (in the manner described previously) an offset value for enabling the proper physical address of portion 520 to be accessed. The network interface 220(a) further determines, from the fact that the MTM is a read request, that the MTM is of the non-posted request transaction type.

[0062] The network interface 220(a) proceeds to compose a network packet (let's label it P1) to encapsulate at least a portion, if not all, of the information in this MTM. The network interface 220(a) inserts into this packet all of the information needed to transport the packet to remote node B, as well as information needed by node B to properly process the packet. Thus, packet P1 may include the network address of node B, information indicating that node A sent the packet, the aperture ID for portion 520, the offset value, etc. One other set of information that is included in the packet is a sending priority. In this example, the MTM is of the non-posted request type. That being the case, the network interface 220(a) assigns it a third and lowest priority. Another set of information that is included in the packet is a sequence number. This sequence number is assigned per remote node per priority level. Assuming that this packet is the initial third priority packet being destined for remote node B, it is assigned a sequence number of 1 (or any other desired initial number). After packet P1 is composed, it is put onto a third priority queue associated with remote node B.

[0063] Suppose now that the processor 302 of node A sends an MTM to the north bridge 304 that is a posted write request to write some data to a physical address within portion 530. Upon receiving this MTM, the north bridge 304 again determines that the physical address in the MTM is not within the local memory 306; thus, it forwards the MTM to the network interface 220(a) of node A to perform a remote memory access.

[0064] Upon receiving the forwarded MTM, the network interface 220(a), using the information in the downstream translation table 504, maps the physical address in the MTM to aperture B of remote node B; hence, it determines that the MTM is destined for remote node B. In addition, the network interface 220(a) determines an offset value for enabling the proper physical address of portion 520 to be accessed. The network interface 220(a) further determines, from the fact that the MTM is a posted write request, that the MTM is of the posted request transaction type.

[0065] The network interface 220(a) proceeds to compose a network packet (let's label it P2) to encapsulate at least a portion, if not all, of the information in this MTM. The network interface 220(a) inserts into this packet all of the information needed to transport the packet to remote node B, as well as information needed by node B to properly process the packet. One other set of information that is included in the packet is a sending priority. In this example, the MTM is of the posted request type. Thus, the network interface 220(a) assigns it a first and highest priority. Another set of information that is included in the packet is a sequence number. Assuming that this packet is the initial first priority packet being destined for remote node B, it is assigned a sequence number of 1 (or any other desired initial number). After packet P2 is composed, it is put onto a first priority queue associated with remote node B.

[0066] Suppose further that the processor 302 of node A sends another M.TM. to the north bridge 304 that is a posted write request to write some data to a physical address within portion 530. Upon receiving this MTM, the north bridge 304 again determines that the physical address in the MTM is not within the local memory 306; thus, it forwards the MTM to the network interface 220(a) of node A to perform a remote memory access.

[0067] Upon receiving the forwarded MTM, the network interface 220(a), using the information in the downstream translation table 504, maps the physical address in the MTM to aperture B of remote node B; hence, it determines that the MTM is destined for remote node B. In addition, the network interface 220(a) determines an offset value for enabling the proper physical address of portion 520 to be accessed. The network interface 220(a) further determines, from the fact that the MTM is a posted write request, that the MTM is of the posted request transaction type.

[0068] The network interface 220(a) proceeds to compose a network packet (let's label it P3) to encapsulate at least a portion, if not all, of the information in this MTM. The network interface 220(a) inserts into this packet all of the information needed to transport the packet to remote node B, as well as information needed by node B to properly process the packet. One other set of information that is included in the packet is a sending priority. In this example, the MTM is of the posted request type. Thus, the network interface 220(a) assigns it a first and highest priority. Another set of information that is included in the packet is a sequence number. Since this packet is also destined for remote node B, and since it follows packet P2 as the next first priority packet, it is given the next first priority sequence number, which would be, for example, 2. After packet P3 is composed, it is put onto the first priority queue associated with remote node B, right after packet P2. The current queuing situation is shown in FIG. 7. At this point, it should be noted that, in one embodiment, a separate set of queues is established for each remote node. Thus, one set of queues is established for node B, another set is established for node C (if node A is sending any remote memory access requests to node C), another set is established for node D, and so forth.

[0069] Suppose now that the network interface 220(a) begins sending the packets into the network 102. The network interface 220(a) starts with the first priority queue, and sends out packets P2 and P3, in that order (notice that even though these packets were composed by the network interface 220(a) after packet P1, they are sent out prior to packet P1 because of their higher priority). After all of the packets in the first priority queue are sent, the packets (if any) in the second priority queue may be sent. A second priority packet will be sent only if there are currently no packets in the first priority queue. If, before a second priority packet is sent, another first priority packet is added to the first priority queue, then that first priority packet will be sent before the second priority packet. In the current example, there are no packets in the second priority queue; thus, the network interface 220(a) proceeds to the third priority queue.

[0070] A third priority packet will be sent only if there are currently no packets in the first and second priority queues. If, before a third priority packet is sent, another first priority packet is added to the first priority queue and/or another second priority packet is added to the second priority queue, then that/those higher priority packet(s) will be sent before the third priority packet. In the current example, it will be assumed that no other packets are added to the other queues; thus, the network interface 220(a) sends packet P1 into the network 102.

[0071] In one embodiment, the network interface 220(a) stores each packet in a packet buffer (not shown) residing on the network interface 220(a). The packets in the packet buffer may be linked-listed to facilitate easy and convenient access. An example of the linked-lists for the current situation is shown in FIG. 8. In one embodiment, a separate set of linked-lists is maintained for each remote node, and a separate linked-list is maintained for each priority type. FIG. 8 shows the set of linked-lists maintained for the packets that have been sent to node B. As shown, there is a linked-list for first priority packets and a linked-list for third priority packets (there is no linked-list for second priority packets since, in the current example, no such packets have been sent to node B). The linked-list for first priority packets shows packet P2 as being the initial first priority packet sent. Packet P2 has a pointer to packet P3, the next first priority packet that was sent. The linked-list for third priority packets shows just packet P1, since this is the only third priority packet that has been sent. Maintaining the linked-lists in this way enables the network interface 220(a) to easily keep track of all of the packets that have been sent to node B.

[0072] Suppose now that after packets P2, P3, and P1 have been sent, the processor 302 of node A sends another M.TM. to the north bridge 304 that is a posted write request to write some data to a physical address within portion 530. Again, the north bridge 304 forwards this MTM to the network interface 220(a) of node A to perform a remote memory access.

[0073] The network interface 220(a) processes this MTM in the same manner as that described above in connection with packets P2 and P3, and gives rise to another first priority network packet (let's label it P4). Once this packet is composed, it is put onto the first priority queue for node B, and sent into the network 102. The updated linked-lists after packet P4 has been sent to node B is shown in FIG. 9. As can be seen, packet 4 has been added to the first priority packets linked-list. It has been given a sequence number of 3 since it is the third first priority packet that has been sent to node B. Further, packet P3 has been updated to point to packet P4. After the packets are sent, the network interface 220(a) waits for acknowledgement from the remote node B.

[0074] Suppose that the network interface 220(b) on node B receives packet P2. When it does so, the network interface 220(b) extracts from the packet the source information (information indicating that node A sent the packet), the sending priority, and the sequence number. The network interface 220(b) then determines whether this sequence number matches the expected sequence number for this sending node (node A) for this sending priority (first priority). In one embodiment, the network interface 220(b) maintains an expected sequence number for each sending priority for each sending node. For example, it maintains an expected sequence number for node A, first priority, another expected sequence number for node A, second priority, and another expected sequence number for node A, third priority. Similarly, it maintains an expected sequence number for node C, first priority, another expected sequence number for node C, second priority, and another expected sequence number for node C, third priority. This expected sequence number represents the sequence number that the network interface 220(b) is expecting the next packet from a particular node having a particular sending priority to have. If the extracted sequence number is not the same as the expected sequence number, then it means that the received packet is out of sequence or out of order somehow. In such a case, the network interface 220(b), in one embodiment, drops the packet. Doing so prevents node B from processing out of order memory access requests. This in turn helps to enforce the ordering rules of the processor bus protocol. In one embodiment, the network interface 220(b) will keep dropping packets until a packet with the right sequence number is received.

[0075] In the current example, however, it will be assumed that sequence #1 is the expected sequence number for a first priority packet from node A; thus, the network interface 220(b) accepts the packet, and sends an acknowledgement packet back to node A. This acknowledgement packet includes, among other information, the sequence number of packet P2 and the sending priority of packet P2. In addition to sending the acknowledgment packet, the network interface 220(b) also updates the expected sequence number to 2. Thus, the network interface 220(b) will expect the next first priority packet from node A to have the sequence number 2. Furthermore, the network interface 220(b) processes packet P2 to give the MTM encapsulated therein effect. Specifically, the network interface 220(b) extracts the MTM information from packet P2, determines the address within portion 520 that is to be accessed, composes an MTM to cause the requested posted write operation to be performed (this MTM includes the data to be written into the address), and forwards the MTM to the north bridge 304 of node B to cause the requested access to be performed. In this manner, the network interface 220(b) facilitates the remote posted write process.

[0076] On node A's side, when the network interface 220(a) receives the acknowledgement packet sent by node B, it extracts the priority information and the sequence number therefrom. Using this information, the network interface 220(a) determines that packet P2 on the first priority linked-list has been received and acknowledged by node B. Thus, it knows that it can remove packet P2 from the first priority linked-list. It also knows that the "earliest packet not yet acknowledged" on the first priority list is now packet P3. In this manner, the network interface 220(a) maintains the linked-lists.

[0077] In one embodiment, the network interface 220(a) periodically checks to see if the "earliest packet not yet acknowledged" in each linked-list has been acknowledged. If it has not been acknowledged within a certain timeout period, then the network interface 220(a) may take some action. To illustrate, suppose that packet P2 has not been acknowledged within the timeout period. This may occur for a number of reasons. For example, node B may have never received packet P2 (e.g. because P2 was dropped by the network 102), or the acknowledgement packet sent by node B may have been dropped by the network 102. Whatever the cause, when the network interface 220(a) determines that packet P2 has not been acknowledged and hence has "timed out", it resends packet P2 to node B. In addition, it resends all of the subsequent packets on the first priority linked-list to node B. These packets are resent to node B in the same order in which they were originally sent. Thus, in this example, the network interface 220(a) would resend packet P2, packet P3, and packet P4, in that order, to node B (note: in one embodiment, the network interface 220(a) does not resend packet P1 since P1 does not have the same sending priority as the packet that timed out (packet P2)). Resending all of the packets in the first priority linked-list in this way helps to ensure that the remote node B will receive all of the first priority packets, and that they will be received in the proper sequence.

[0078] The above discussion shows the example of one of the packets in the first priority linked-list timing out. Packets in one or more of the other linked-lists may also time out. For example, packet P1 on the third priority linked-list may time out. In such a case, the network interface 220(a), in one embodiment, would handle this time out in the same manner as that described above. Specifically, the network interface 220(a) would resend packet P1. In addition, it would resend all subsequent packets (if any) in the third priority linked-list. The packets would be resent in the same order in which they were originally sent. Resending packets in this manner ensures that all of the packets having the same priority are received by the remote node in the proper sequence.

[0079] To illustrate the operation of network interface 220(a) further, suppose that packets P2, P3, and P4 are sent to node B. Suppose further that the network interface 220(b) of node B receives all of the packets, and sends acknowledgements packets for all of them back to node A. Suppose, however, that the network interface 220(a) of node A does not receive the acknowledgement packet for packet P2, but does receive the acknowledgement packets for packets P3 and P4 before the timeout period for packet P2. In such a case, when the network interface 220(a) receives the acknowledgement packet for packet P3, it realizes that packet P3 is not the "earliest packet not yet acknowledged" (packet P2 is). Nonetheless, because the remote node would not have acknowledged receipt of packet P3 unless it also received packet P2, the network interface 220(a) can assume that packet P2 was also received. Thus, in this example, the network interface 220(a) would remove both packets P2 and P3 from the first priority linked-list, and make packet P4 the "earliest packet not yet acknowledged". Then, when the acknowledgment for packet P4 is received, that packet would also be removed from the linked-list. Receipt of all of the first priority packets by the remote node B in the proper order would thus be assured.

[0080] In the above discussion, node A acts as the local node that is accessing the memory of a remote node B. Thus, the network interface 220(a) of node A performs the sending and linked-list maintenance functions described above and the network interface 220(b) of node B performs the receiving and acknowledgement functions described above. It should be noted, though, that node A may also make a portion of its local memory available to node B. In such a case, node B would act as the local node that is accessing the memory of a remote node A. In such an arrangement, the network interface 220(b) of node B would perform the sending and linked-list maintenance functions described above and the network interface 220(a) of node A would perform the receiving and acknowledgement functions described above. In one embodiment, each network interface 220 is capable of performing both sets of functions. Which set of functions it performs at any particular time will depend upon the role that its associated node is playing at that time (e.g. acting as a local node or a remote node).

Network Interface Implementation

[0081] The functionalities provided by the network interface 220 have been described in detail. For purposes of the present invention, the network interface 220 and these functionalities may be realized using any desired technology. For example, the network interface 220 and its functionalities may be realized using hardware (e.g. hardware logic components, ASICs, etc.), software (e.g. having one or more processors execute one or more sets of instructions), or a combination thereof. All such implementations are within the scope of the present invention.

[0082] At this point, it should be noted that although the invention has been described with reference to one or more specific embodiments, it should not be construed to be so limited. Various modifications may be made by those of ordinary skill in the art with the benefit of this disclosure without departing from the spirit of the invention. Thus, the invention should not be limited by the specific embodiments used to illustrate it but only by the scope of the issued claims and the equivalents thereof.

* * * * *