U.S. patent application number 10/464715 was filed with the patent office on 2004-12-23 for load balancer performance using affinity modification.
Invention is credited to Gage, Christopher A. S., Pozefsky, Diane P., Sarkar, Soumitra.
Application Number | 20040260745 10/464715 |
Document ID | / |
Family ID | 33517335 |
Filed Date | 2004-12-23 |
United States Patent
Application |
20040260745 |
Kind Code |
A1 |
Gage, Christopher A. S. ; et
al. |
December 23, 2004 |
Load balancer performance using affinity modification
Abstract
A method, system, and computer program for managing network
connectivity between a host and a server cluster. The invention
helps reduce network traffic bottlenecks at the server cluster by
instructing the host to modify its network mapping such that
messages sent by the host to the server cluster reach a selected
server cluster member without passing through a dispatching
node.
Inventors: |
Gage, Christopher A. S.;
(Raleigh, NC) ; Pozefsky, Diane P.; (Chapel Hill,
NC) ; Sarkar, Soumitra; (Cary, NC) |
Correspondence
Address: |
Ido Tuchman
Suite 503
69-60 108th Street
Forest Hills
NY
11375
US
|
Family ID: |
33517335 |
Appl. No.: |
10/464715 |
Filed: |
June 18, 2003 |
Current U.S.
Class: |
709/200 ;
709/225; 718/105 |
Current CPC
Class: |
H04L 29/12018 20130101;
H04L 67/1038 20130101; H04L 69/163 20130101; H04L 69/164 20130101;
H04L 67/1002 20130101; H04L 29/12009 20130101; H04L 67/1008
20130101; H04L 61/10 20130101; H04L 69/16 20130101; H04L 67/1017
20130101 |
Class at
Publication: |
709/200 ;
709/225; 718/105 |
International
Class: |
G06F 015/173; G06F
015/16 |
Claims
1. A method for managing network connectivity between a host and a
target server, the target server belonging to a server cluster, and
the server cluster including a dispatching node configured to
dispatch network traffic to cluster members, the method comprising:
receiving an initial message from the host at the dispatching node;
selecting the target server to receive the initial message; sending
the initial message to the target server; and instructing the host
to modify its network mapping such that future messages sent by the
host to the server cluster reach the target server without passing
through the dispatching node.
2. The method of claim 1, wherein instructing the host to modify
its network mapping includes directing the host to modify its
address lookup table.
3. The method of claim 1, wherein instructing the host to modify
its network mapping includes adding a redirect rule to a host's IP
(Internet Protocol) routing table such that any message sent by the
host to the server cluster is instead sent to the target
server.
4. The method of claim 1, wherein instructing the host to modify
its network mapping includes directing the host to modify its ARP
(Address Resolution Protocol) cache such that the target server's.
Mac (media access control) address is substituted for the server
cluster's mac address when sending an ip datagram to the server
cluster.
5. The method of claim 1, further comprising instructing the host
to modify its network mapping from the target server to the server
cluster after a communication session between the host and the
target server is completed.
6. The method of claim 5, further comprising informing the
dispatching node that the communication session (or the affinity
relationship) between the host and the target server is
completed.
7. The method of claim 1, further comprising instructing the host
to modify its network mapping from the target server to the server
cluster after an affinity relationship is terminated based on
dispatching node configuration when a stateless protocol is
used.
8. The method of claim 7, further comprising informing the
dispatching node that the affinity relationship between the host
and the target server is completed.
9. A system for managing network connectivity between a host and a
target server, the target server belonging to a server cluster, and
the server cluster including a dispatching node configured to
dispatch network traffic to cluster members, the system comprising:
a receiving module configured to receive network messages from the
host at the dispatching node; a selecting module configured to
select the target server to receive the network messages from the
host; a dispatching module configured to dispatch the network
messages to the target server; and an instructing module configured
to instruct the host to modify its network mapping such that future
messages sent by the host to the server cluster reach the target
server without passing through the dispatching node.
10. The system of claim 9, wherein the instructing module is
further configured to direct the host to modify its address lookup
table.
11. The system of claim 9, wherein the instructing module is
further configured to add a redirect rule to a host's IP (Internet
Protocol) routing table such that any message sent by the host to
the server cluster is instead sent to the target server.
12. The system of claim 9, wherein the instructing module is
further configured to direct the host to modify its ARP (Address
Resolution Protocol) cache such that the target server's MAC (Media
Access Control) address is substituted for the server cluster's MAC
address when sending an IP datagram to the server cluster.
13. The system of claim 9, further comprising a session completion
module configured to instruct the host to modify its network
mapping from the target server to the server cluster after a
communication session between the host and the target server is
completed.
14. The system of claim 13, further comprising an informing module
configured to inform the dispatching node that the communication
session between the host and the target server is completed.
15. The system of claim 9, further comprising a session completion
module configured to instruct the host to modify its network
mapping from the target server to the server cluster after an
affinity relationship is to be terminated based on dispatching node
configuration.
16. The system of claim 13, further comprising an informing module
configured to inform the dispatching node that the affinity
relationship is to be terminated based on dispatching node
configuration.
17. A computer program product embodied in a tangible media
comprising: computer readable program codes coupled to the tangible
media for managing network connectivity between a host and a target
server, the target server belonging to a server cluster, and the
server cluster including a dispatching node configured to dispatch
network traffic to cluster members, the computer readable program
codes configured to cause the program to: receive an initial
message from the host at the dispatching node; select the target
server to receive the initial message; send the initial message to
the target server; and instruct the host to modify its network
mapping such that future messages sent by the host to the server
cluster reach the target server without passing through the
dispatching node.
18. The computer program product of claim 17, wherein instructing
the host to modify its network mapping includes directing the host
to modify its address lookup table.
19. The computer program product of claim 17, wherein the computer
readable program code configured to instruct the host to modify its
network mapping is further configured to add a redirect rule to a
host's IP (Internet Protocol) routing table such that any message
sent by the host to the server cluster is instead sent to the
target server.
20. The computer program product of claim 17, wherein the computer
readable program code configured to instruct the host to modify its
network mapping is further configured to direct the host to modify
its ARP (Address Resolution Protocol) table such that the target
server's MAC (Media Access Control) address is substituted for the
server cluster's MAC address.
21. The computer program product of claim 17, further comprising
computer readable program code configured to instruct the host to
modify its network mapping from the target server to the server
cluster after a communication session between the host and the
target server is completed.
22. The computer program product of claim 21, further comprising
computer readable program code configured to inform the dispatching
node that the communication session between the host and the target
server is completed.
23. A system for managing network connectivity between a host and a
target server, the target server belonging to a server cluster, and
the server cluster including a dispatching node configured to
dispatch network traffic to cluster members, the system comprising:
means for receiving an initial message from the host at the
dispatching node; means for selecting the target server to receive
the initial message; means for sending the initial message to the
target server; and means for instructing the host to modify its
network mapping such that future messages sent by the host to the
server cluster reach the target server without passing through the
dispatching node.
Description
FIELD OF THE INVENTION
[0001] The present invention relates generally to computer
networks, and more specifically to management of network
connectivity between a host and server cluster members in a
clustered network environment.
BACKGROUND
[0002] A computer network is a collection of computers, printers,
and other network devices linked together by a communication
system. Computer networks allow devices within the network to
transfer information and commands between one another. Many
computer networks are divided into smaller "sub-networks" or
"subnets" to help manage the network and to assist in message
routing. A subnet generally includes all devices in a network
segment that share a common address component. For example, subnet
can be composed of all devices in the network having an IP
(Internet Protocol) address with the same subnet identifier.
[0003] Some network systems utilize server clusters, also called
computer farms, to handle various resources in the network. A
server cluster distributes work among its cluster members so that
no one computer (or server) becomes overwhelmed by task requests.
For example, several computers may be organized as members in a
server cluster to handle an Internet site's Web requests. Server
clusters help prevent bottlenecks in a network by harnessing the
power of multiple servers.
[0004] Generally, a server cluster includes a load balancing node
that keeps track of the availability of each cluster member and
receives all inbound communications to the server cluster. The load
balancing node systematically distributes tasks among the cluster
members. When a client or host (i.e., a computer) outside the
server cluster initially submits a request to the server cluster,
the load balancing node selects the best-suited cluster member to
handle the message. The load balancing node then passes the request
to the selected cluster member and records the selection in an
"affinity" table. In this context, the affinity is a relationship
between the network addresses of the client and (selected) server,
as well as subaddresses that identify the applications on each.
Such an affinity might be established irrespective of whether the
underlying network protocol supports connection-oriented (as in
Transmission Control Protocol, or TCP) or connectionless (User
Datagram Protocol, or UDP) service.
[0005] Once such an affinity is established between the client and
the cluster member, all future communications identifying the
established connection are sent to the same cluster member using
the connection table until the affinity relationship is to be
removed. For connectionless (e.g., UDP) traffic, the duration of
the relationship can be based on a configured timer value--e.g.,
after 5 minutes of inactivity between the client and the server
applications the affinity table entry is removed. For
connection-oriented (e.g., TCP) traffic, the affinity exists as
long as the network connection exists, the termination of which can
be recognized by looking for well-defined protocol messages.
[0006] In load balancing nodes (e.g., IBM's Network Dispatcher),
such affinity configuration is typical for UDP packets from a given
host to the cluster IP address, and a given target port identifying
a "service" (e.g., Network File System (NFS) V2/V3). In the NFS
case, if there is a cluster of servers serving NFS requests, it is
beneficial to direct all UDP requests for NFS file services from a
given host (NFS client) to a given server (running NFS server
software) in the cluster because even though UDP is a stateless
(and connectionless) protocol, the given server in the cluster
might accumulate state information specific to the host (e.g., NFS
lock information handed to the NFS client running on that host)
such that directing all NFS traffic from that host to the same
server would be beneficial from a performance point of view. Since
UDP is connectionless, when to break the affinity between the host
and the server in the cluster is determined by a timer that
indicates a certain period (e.g., 10 minutes) of inactivity.
[0007] In such a load balancing scheme, when a cluster member
communicates directly with a client, it identifies itself using its
own address instead of the address of the server cluster. Outbound
traffic does not go through the load balancing node. The fact that
network traffic is being distributed between various servers in the
server cluster is invisible to the client. Moreover, to a computer
outside the server cluster, the server cluster structure is
invisible.
[0008] As mentioned above, the implementation of a conventional
server cluster model requires that all inbound network traffic
travel through the load balancing node before arriving at an
assigned server. In many applications, this overhead is perfectly
acceptable. The most commonly cited application of server clusters
is to load balance HTTP (HyperText Transfer Protocol) requests in a
Web server farm. HTTP requests are typically small inbound
messages, i.e., a GET or POST request specifying a URL (Universal
Resource Locator), and some parameters perhaps. It is usually the
HTTP response that is large, such as an HTML (HyperText Markup
Language) file and/or an image file sent to a browser. Therefore,
conventional server cluster models work well in such
applications.
[0009] In other applications, however, the conventional server
cluster model can be quite burdensome. Requiring that each inbound
packet travel through the load balancing node can cause performance
bottlenecks at the load balancing node if the inbound messages are
large. For example, in file serving applications, such as a
clustered NAS (Network Attached Storage) configuration, the size of
inbound file write requests can be substantial. In such a case, the
overhead of reading an entire write request packet at the load
balancing node and then writing the packet back out on a NIC
(Network Interface Card) to redirect it to another server can cause
a bottleneck on the network, the CPU, or its PCI bus.
SUMMARY OF THE INVENTION
[0010] The present invention addresses the above-mentioned
limitations of traditional server cluster configurations when the
networking protocol in use is TCP or UDP, each of which operates on
top of Internet Protocol (IP). It works by instructing a host
communicating with a server cluster to modify its network mapping
such that future messages sent by the host to the server cluster
reach a selected target server without passing through the load
balancing node. Such a configuration bypasses the load balancing
node and therefore beneficially eliminates potential bottlenecks at
the load balancing node due to inbound host network traffic.
[0011] Thus, an aspect of the present invention involves a method
for managing network connectivity between a host and a target
server. The target server belongs to a server cluster, and the
server cluster includes a dispatching node configured to dispatch
network traffic to the cluster members. The method includes a
receiving operation for receiving an initial message from the host
at the dispatching node, where an initial message could be a TCP
connection request for a given service (port), or a connectionless
(stateless) UDP request for a given port. A selecting operation
selects the target server to receive the initial message and a
sending operation sends the initial message to the target server.
An instructing operation requests the host to modify its network
mapping such that subsequent messages sent by the host to the
server cluster reach the target server without passing through the
dispatching node, until the dispatching node decides to end the
client-to-server-application affinity.
[0012] Another aspect of the invention is a system for managing
network connectivity between a host and a target server. As above,
the target server belongs to a server cluster, and the server
cluster includes a dispatching node configured to dispatch network
traffic to the cluster members. The system includes a receiving
module configured to receive network messages from the host at the
dispatching node. A selecting module is configured to select the
target server to receive the network messages from the host and a
dispatching module is configured to dispatch the network messages
to the target server. An instructing module is configured to
instruct the host to modify its network mapping such that
subsequent messages sent by the host to the server cluster reach
the target server without passing through the dispatching node,
until the dispatching node decides to end the
client-to-server-application affinity.
[0013] A further aspect of the invention is a computer program
product embodied in a tangible media for managing network
connectivity between a host and a target server. The computer
program includes program code configured to cause the program to
receive an initial message from the host at the dispatching node,
select the target server to receive the initial message, send the
initial message to the target server, and instruct the host to
modify its network mapping such that subsequent messages sent by
the host to the server cluster reach the target server without
passing through the dispatching node, until the dispatching node
decides to end the client-to-server-application affinity.
[0014] The foregoing and other features, utilities and advantages
of the invention will be apparent from the following more
particular description of various embodiments of the invention as
illustrated in the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] FIG. 1 shows an exemplary network environment embodying the
present invention.
[0016] FIG. 2 shows one embodiment of messages sent to and from a
server cluster in accordance with the present invention.
[0017] FIG. 3 shows a high level flowchart of operations performed
by one embodiment of the present invention.
[0018] FIG. 4 shows an exemplary system implementing the present
invention.
[0019] FIG. 5 shows a detailed flowchart of operations performed by
the embodiment described in FIG. 3.
[0020] FIG. 6 shows details of steps 530 and 536 of FIG. 5, as
applicable to the ARP broadcast method and the ICMP_REDIRECT
methods.
[0021] FIG. 7 shows an example of one possible race condition that
may occur under the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0022] The following description details how the present invention
is beneficially employed to improve the performance of traditional
server clusters. Throughout the description of the invention
reference is made to FIGS. 1-6. When referring to the figures, like
structures and elements shown throughout are indicated with like
reference numerals.
[0023] In FIG. 1, an exemplary network environment 102 embodying
the present invention is shown. It is initially noted that the
network environment 102 is presented for illustration purposes
only, and is representative of countless configurations in which
the invention may be implemented. Thus, the present invention
should not be considered limited to the system configuration shown
in the figure.
[0024] The network environment 102 includes a host 104 coupled to a
computer subnet 106. The host 104 is representative of any network
device capable of modifying its network mapping information
according to the present invention, as described in detail below.
In one embodiment of the invention, the host 104 is a NAS
client.
[0025] The subnet 106 is configured to effectuate communications
between various nodes within the network environment 102. In a
particular embodiment of the invention, the subnet 106 includes all
devices in a network environment 102 that share a common address
component. For example, the subnet 106 may comprise all devices in
the network environment 102 having an IP (Internet Protocol)
address that belong to the same IP subnet. The subnet 106 may be
arranged using various topologies known to those skilled in the
art, such as hub, star, and local area network (LAN) arrangements,
and include various communication technologies known to those
skilled in the art, such as wired, wireless, and fiber optic
communication technologies. Furthermore, the subnet 106 may support
various communication protocols known to those skilled in the art.
In one embodiment of the present invention, the subnet 106 is
configured to support Address Resolution Protocol (ARP) and/or
Internet Control Message Protocol (ICMP), each of which runs in
addition to TCP, UDP, and IP.
[0026] A server cluster 108 is also coupled to the subnet 106. As
mentioned above, the host 104 and server cluster 108 are located on
the same subnet 106. In other words, network packets sent from the
host 104 require no additional router hops to reach the server
cluster 108. The server cluster 108 comprises several servers 110
and a load balancing node 112 connected to the subnet 106. As used
herein, a server cluster 108 is a group of servers 110 selected to
appear as a single entity. Furthermore, as used herein, a load
balancing node includes any dispatcher configured to redirect work
among the servers 110. Thus, the load balancing node 112 is but one
type of dispatching node that may be utilized by the present
invention, and the dispatching node may use any criteria,
including, but not limited to, workload balancing to make its
redirection decisions. The servers 110 selected to be part of the
cluster 108 may be selected for any reason. Furthermore, the
cluster members may not necessarily be physically located close to
one another or share the same network connectivity. Every server
110 in the cluster 108, however, must have connectivity to the load
balancing node 112 and the subnet 106. It is envisioned that the
server cluster 108 may contain as many servers 110 as required by
the system to deal with average as well as peak demands from
hosts.
[0027] Each server 110 in the cluster 108 may include a load
balancer agent 114 that talks to the load balancing node 112.
Typically, these agents 114 provide server load information to the
load balancer 112 (including infinite load if the server 110 is
dead, and the agent 114 is not responding) to allow it to make
intelligent load balancing decisions. As discussed in more detail
below, the agent 114 may also perform additional functions such as
monitoring when the number of TCP connections initiated by a host
104 goes to 0, to allow the load balancer 112 to regain control of
the dispatching TCP connections to the server cluster IP address.
The same is the case with UDP traffic, since the individual servers
110 and agents 114 must monitor when there has been sufficient
amount of inactivity of UDP traffic from the host 104 to allow the
load balancing node 112 to regain control of dispatching UDP
datagrams sent to the cluster IP address.
[0028] Typically, the server cluster 108 is a collection of
computers designed to distribute network load among the cluster
members 110 so that no one server 110 becomes overwhelmed by task
requests. The load balancing node 112 performs load balancing
functions in the server cluster 108 by dispatching tasks to the
least loaded servers in the server cluster 108. The load balancing
is generally based on a scheduling algorithm and distribution of
weights associated with cluster members 110. In one configuration
of the present invention, the server cluster 108 utilizes a Network
Dispatcher developed by International Business Machines Corporation
to achieve load balancing. It is contemplated that the present
invention may be used with other network load balancing nodes, such
as various custom load balancers.
[0029] In a particular embodiment of the invention, the server
cluster 108 is configured as a NAS (Network-Attached Storage)
server cluster. As mentioned above, conventional server clusters
configured as clustered NAS servers are prone to network traffic
bottlenecks at the load balancing node 112 because the size of
inbound network packets can be quite large when file system write
operations are involved. As discussed in detail below, the present
invention overcomes such bottlenecks by instructing the host 104 to
modify its network mapping such that future messages sent by the
host 104 to the server cluster 108 reach a selected target server
without passing through the load balancing node 112. Such a
configuration bypasses the load balancing node 112 and therefore
beneficially eliminates potential bottlenecks at the load balancing
node 112.
[0030] While the network configuration of FIG. 1 describes the host
104 and server cluster 108 as being on the same subnet 106, this is
a typical and very useful real-world configuration. For example,
servers such as Web servers or databases that use a cluster of
Network Attached Storage devices (supporting file access protocols
like NFS and CIFS) often reside in the same IP subnet of a data
center environment. For the clustered NAS to function in high
availability mode, load balancing is typically performed. Thus, the
present invention allows the overhead of the load balancing node to
be alleviated in very common network configurations.
[0031] Referring now to FIG. 2, one embodiment of messages sent to
and from the server cluster 108 is shown. In accordance with this
embodiment, an initial message 202 is transmitted from the host 104
to the server cluster 108. It is noted that the initial message 202
may not necessarily be the first host message in network session
between the host 104 to server cluster 108 and may include special
information or commands, as discussed below. In general, the
initial message 202 is either a TCP connection request or UDP
datagram intended for the server cluster's virtual IP address 204.
A virtual IP address is an IP address selected to represent a
cluster or service provided by a cluster, which does not map
uniquely to a single box. The initial message 202 includes a
destination port (TCP or UDP) that identifies which application is
being accessed in the server cluster 108.
[0032] The cluster's virtual IP address 204 is mapped to the load
balancing node 112 so that the initial message 202 arrives at the
load balancing node 112. As mentioned above, the host 104, the
server cluster 108, and the cluster members are all located on the
same subnet 106. Thus, each device on the subnet 106 belongs to the
same IP subnet. For example, the host 104, the server cluster 108,
and the cluster members may all belong to the same IP subnet
"9.37.38", as shown.
[0033] After the load balancing node 112 receives the initial
message 202 from the host 104, the load balancing node 112 selects
a target server 206 to receive the initial message 202. In most
applications, the load balancing node 112 selects the target server
206 based on loading considerations, however the present invention
is not limited to such a selection criteria. Once the target server
206 is selected, the load balancing node 112 forwards the message
207 to the target server 206. Note that any message from server 206
to host 104 bypasses the load balancing node 112 and goes directly
to 104, as indicated by message 209.
[0034] After forwarding the initial message to the target server
206, the load balancing node 112 sends an instructing message 210
to the host 104. In one embodiment of the invention, the load
balancing node 112 sends the instructing message 210 only if the
host 104 is in the same subnet as the IP address of the server
cluster 108. This is easy to check since the source IP address is
available for both TCP and UDP protocols. The instructing message
210 requests that the host 104 modify its network mapping such that
future messages 212 sent by the host 104 to the server cluster 108
reach the target server 206 without passing through the load
balancing node 112. This is done by either telling the host that it
is taking a different route to the destination, or by mapping the
cluster IP address to a different physical network address. By
doing so, messages from the host 104 that would normally be
forwarded to the target server 206 using the load balancing node
112 arrive at the target server 206 directly. Thus, bottlenecks at
the load balancing node 112 due to large inbound messages can be
substantially reduced using the present invention.
[0035] It is contemplated that the instructing message 210 may be
any message known to those skilled in the art for modifying the
host's network mapping. Thus, the content of the instructing
message 210 is implementation dependent and can vary depending on
the protocol used by the present invention. In one embodiment of
the invention, for example, an ICMP_REDIRECT message can be used to
request the network mapping change. In another embodiment, an ARP
response message can be used to request the network mapping change
when host 104 sends an ARP broadcast requesting an
IP-address-to-MAC-address mapping for the cluster IP address. More
information about ICMP and ARP protocols can be found in,
Internetworking with TPC/IP Vol.1: Principles, Protocols, and
Architecture (4th Edition), by Douglas Comer, ISBN 0130183806.
While each technique has unique implementation aspects, their end
result is that whenever the host 104 sends another packet to the
primary cluster IP address 204, it is directed to the target server
206 without passing through the load balancing node 112.
[0036] In addition to sending the instructing message 210, the load
balancing node 112 can optionally send a control message 208 to the
load balancer agent running on the target server 206 after the
initial message is forwarded to the target server 206. For example,
if UDP is being used as the underlying transport protocol, then the
tracking of the timeout for inactivity of UDP traffic to the
configured port, which would cause traffic from the host 104 to the
target server 206 to once again be directed through the load
balancing node 112, has to be performed by the target server 206
since the load balancing node 112 is unable to monitor that
traffic. The target server 206 therefore has to be aware of the
timeout configured in the load balancing node 112. Note that while
the server 206 is aware of the timeout configured in the load
balancing node 112, it can choose to implement a higher timeout, if
based on its analysis of response times when communicating with the
host, it concludes that the host's path to it is slower than
expected.
[0037] Once the communication session between the host 104 and
target server 206 is completed, the host's network mapping is
returned to its original state so that future load balancing by the
load balancing node 112 can be performed. In one embodiment of the
invention, a completed communication session is defined as the
point when the total connections between the host 104 and the
target server 206 is zero in a stateful protocol (such as TCP), and
the point after a specified period of inactivity between the host
104 and the target server 206 in a stateless protocol (such as
UDP). Thus, upon completion of the communication session (i.e., a
decision by the target server 206 to terminate the special affinity
relationship between the host 104 and itself), the target server
206 sends a control message 214 to the load balancing node 112, and
the load balancing node 112 sends an instructing message 216 to the
host 104 to modify its network mapping table. This instructing
message 216 requests that the host 104 modify its network mapping
again so that messages sent to the server cluster 108 stop being
routed directly to the target server 206 and instead travel to the
load balancing node 112.
[0038] FIG. 2 also includes a second cluster IP address 218. This
address is used in another embodiment of the invention that uses
the ICMP_REDIRECT method when redirecting the host back to the load
balancer node.
[0039] In FIG. 3, a flowchart showing some of the operations
performed by one embodiment of the present invention is presented.
It should be remarked that the logical operations of the invention
may be implemented (1) as a sequence of computer executed steps
running on a computing system and/or (2) as interconnected machine
modules within the computing system. The implementation is a matter
of choice dependent on the performance requirements of the system
implementing the invention. Accordingly, the logical operations
making up the embodiments of the present invention described herein
are referred to alternatively as operations, steps, or modules.
[0040] Operation flow begins with receiving operation 302, wherein
the load balancing node receives an initial message from the host.
As mentioned above, the initial message is typically sent to a
server cluster's virtual network address and is routed to the load
balancing node by means of address mapping. In a particular
configuration of the invention, different IP addresses are used to
access different server cluster services. For example, the
cluster's NFS file service would have one server cluster IP
address, while the cluster's CIFS file service would have another
server cluster IP address. This arrangement avoids redirecting all
the traffic from a host for the cluster's services to the target
server when only one service redirection is intended.
[0041] In some real-world configurations the server cluster may
have only one cluster-wide virtual IP address and different ports
(TCP or UDP) are used to identify different services (e.g., NFS,
CIFS, etc.). Since the present invention works at the granularity
of an IP address, implementation of the invention may require that
different cluster IP addresses be assigned for different services.
Thus, a given host can be assigned to one server in the cluster for
one service, and a different server in the cluster for a different
service, based on the destination (TCP or UDP) port numbers. After
the receiving operation 302 is completed, control passes to
selecting operation 304.
[0042] At selecting operation 304, the load balancing node selects
one of the cluster members as a target server responsible for
performing tasks requested by the host. As mentioned above, the
load balancing node may select the target server for any reason.
Most often, the target server will be selected for load balancing
reasons. The load balancing node typically maintains a connection
table to keep track of which cluster member was assigned to handle
which network session. In a particular embodiment of the invention,
the load balancing node maintains connection table entries for TCP
connections, and maintains affinity (virtual connections) table
entries for UDP datagrams. Thus, in the general load balancing
function, all UDP datagrams with a given (src IP address, src port)
and (destination IP address, destination port) are directed to the
same target server in the cluster until some defined time period of
inactivity between the host and the server cluster expires.
[0043] During selecting operation 304, the load balancing node may
also decide whether or not to initiate direct server routing
according to the present invention. Thus, it is contemplated that
the load balancing node may selectively initiate direct message
routing on a case-by-case basis based on anticipated inbound
message sizes from the host or other factors. For example, the load
balancing node may implement conventional server cluster
functionality for communication sessions with relatively small
inbound messages (e.g., HTTP requests for Web page serving). On the
other hand, the load balancing node may implement direct message
routing for communication sessions with relatively large inbound
messages (e.g., file serving using NFS or CIFS). Such decision
making is facilitated by the fact that when the underlying
transport protocol is TCP or UDP, well-known (TCP or UDP) port
numbers can be used to identify the underlying application being
accessed over the network.
[0044] Once the selecting operation 304 is completed, the load
balancing node then forwards the initial message to the target
server during sending operation 306. The initial message may be
directed to the target server by only changing the LAN (Local Area
Network) level MAC (Media Access Control) address of the message.
The selecting operation 304 may also include creating a connection
table entry for that load balancing node. After the sending
operation 304 is completed, control passes to instructing operation
308.
[0045] At instructing operation 308, the load balancing node
instructs the host to modify its routing table so that future
messages from the host arrive at the target server without first
passing through the load balancing node. Once the host updates its
routing table, the load balancing node is no longer required to
forward messages to the target server from the host. It is
contemplated that the load balancing node may update its connection
table to flag the fact that routing modification on the host has
been requested. It should be noted that if the host does not modify
its routing table as requested by the load balancing node, the
server cluster simply continues to function in a conventional
manner without the benefit of direct message routing.
[0046] Once affinity between the host and the target server is
established, direct communications between these nodes continues
until the network session is completed. What constitutes a
completed network session may be dependent on the specific
mechanism used to implement the present invention. For example, in
one embodiment of the invention, the network session is considered
completed after a specified period of inactivity between the host
and the target server, when a stateless protocol such as UDP is
used. In other embodiments of the invention, completion of the
network session may occur when a connection count between the host
and the target server goes to zero, when a stateful protocol such
as TCP is used.
[0047] As mentioned above, the host's network mapping is returned
to its original configuration after the communication session is
completed. Generally speaking, this procedure involves reversing
the mapping operations above. Thus, when the communication session
is finished, the target server sends a control message to the load
balancer to inform it that the session is being terminated. In
response, the load balancer sends an instructing message to the
host requesting that the host modify its network mapping again such
that messages sent to the server cluster stop being routed directly
to the target server and instead travel to the server cluster and
thus the load balancing node.
[0048] In FIG. 4, an exemplary system 402 implementing the present
invention is shown. The system 402 includes a receiving module 404
configured to receive network messages from the host at the load
balancing node. A selecting module 404 is configured to select the
target server to receive the network messages from the host. A
dispatching module 408 is configured to dispatch the network
messages to the target server. An instructing module 410 is
configured to instruct the host to modify its network mapping such
that future messages sent by the host to the server cluster reach
the target server without passing through the load balancing
node.
[0049] The system 402 may also include a session completion module
412 and an informing module 414. The session completion module 412
is configured to instruct the host to modify its network mapping
from the target server to the server cluster after a communication
session between the host and the target server is completed. The
informing module 414 is configured to inform the load balancing
node that the communication session between the host and the target
server should be completed.
[0050] In FIG. 5, a flowchart for the processing logic in the load
balancing node is shown. As stated above, the logical operations of
the invention may be implemented (1) as a sequence of computer
executed steps running on a computing system and/or (2) as
interconnected machine modules within the computing system.
Accordingly, the logical operations making up the embodiments of
the present invention described herein are referred to
alternatively as operations, steps, or modules.
[0051] Operation flow begins with the receiving operation 504,
wherein the load balancing node receives an inbound message. Once
the message is received, control passes to decision operation 506,
where the load balancing node checks whether the message is a TCP
or UDP packet from a host or a control message from a server in the
cluster. The load balancing node can distinguish the control
messages from servers in the cluster from the "application"
messages from hosts outside the cluster based on the TCP or UDP
port it receives the message on. Furthermore, messages from hosts
outside the cluster are sent on the cluster-wide (virtual) IP
address, whereas control messages from servers in the cluster
(running load balancing node agents) are sent to a different IP
address.
[0052] If the message is from a host outside the cluster, control
proceeds to query operation 508. During this operation, the message
is checked to determine if it is an initial message from a host in
the form of a TCP connection setup request or not. If the message
is a TCP connection setup request to the cluster IP address,
control passes to selecting operation 522. If the message is not a
TCP connection setup request, as determined by query operation 508,
control proceeds to decision operation 510.
[0053] At decision operation 510, a check is made to determine if
the message is a new UDP request between a pair of IP addresses and
ports. In other words, decision operation 510 checks whether no
connection table entry exists for this source and destination IP
address pair and target port, and whether affinity for UDP packets
is configured for the target port. In decision operation 510, if
the request received is a UDP datagram for a given target port
(service) for which no affinity exists and affinity is to be
maintained (decision yields YES), then it too is an initial message
and control passes to selecting operation 522. If the decision
yields a value of NO, then control proceeds to decision operation
512.
[0054] At decision operation 512, a check is made to determine if a
connection table already exists for the TCP or UDP packet in the
form of a table entry whose key is <source IP address, target
(cluster) IP address, target port number>. This entry indicates
an affinity relationship between a source application on a host,
and a target application running in every server in the cluster.
The connection table entry exists for TCP as well as UDP packets,
but the latter will only exist if UDP affinity is configured for
the target port (application, e.g., the NFS well-known ports).
Control comes to decision operation 512 if the load balancing node
is operating in "legacy mode". Legacy mode operation would occur
if, for example, the host is not on the same subnet, the host's
mapping table cannot be changed, or the ICMP technique (described
later) is being used to change the host's mapping table but the
host is ignoring the ICMP_REDIRECT message. If, at decision
operation 512, it is determined that a connection table entry does
exist for the packet, control proceeds to forwarding operation 518.
If a connection table entry does not exist, control proceeds to
decision operation 514.
[0055] Decision operation 514 addresses a "race condition" that may
occur during operation of the invention. To illustrate the race
condition that may occur, reference is now made to FIG. 7. As
shown, the host 104 sends a close message 702 to the target server
206 terminating its last TCP connection. Upon receipt of the close
message 702, the target server 206 sends an end affinity message
704 to the load balancing node 112 requesting that the current
target server redirection be terminated. In response, the load
balancing node 112 sends a mapping table changing command 706 to
the host requesting that future TCP packets to the cluster IP
address be routed to the load balancing node 112 rather than the
target server 206. However, before the mapping table changing
command 706 reaches the host 104, a new TCP connection 708 is sent
from the host 104 to the target server 206. Furthermore, once the
mapping table changing command 706 is processed by the host 104,
data 710 on the new TCP connection is sent to load balancing node
112. Thus, the race condition causes traffic on the new TCP
connection to split between the load balancing node 112 and the
target server 206.
[0056] To handle this race condition, the target server 206 informs
the load balancing node 112 of the fact that the session has ended,
and the load balancing node 112 issues the mapping table changing
command 706 to the host 104, being fully prepared for the race
condition to occur. Since the load balancing node 112 is prepared
for the race condition, when it receives TCP traffic from the host
104 for which no connection table entry exists, it could keep
operating in "legacy" mode by creating a connection table entry and
sending another mapping table changing command 706 that directs the
host 104 back to the target server 206.
[0057] Returning to FIG. 5, at decision operation 512, once the
target server notes that the number of connections from the host
have dropped to 0 (zero), it sends a control message (see
identifying operation 534 where the control message is received by
the load balancing node) to the load balancing node to indicate
that it can send another mapping table changing message to the host
such that future TCP or UDP requests to the cluster go through the
load balancing node once more, thus allowing load balancing
decisions to be taken again. However, as described above, due to
the nature of networking and multiple nodes (host, server, load
balancing node) operating independently, it is possible that before
the load balancing node receives the control message from the
server and decides to send a mapping table changing command to the
host (see instructing operation 536), the host has already sent
another new TCP connection request directly to the assigned server
based on its old mapping table (possibly to a different port), and
thus there is no mapping table entry for that <source IP
address, destination IP address, target port> key in the load
balancing node. However, later when the load balancing node
executes instructing operation 536 and directs the host to send it
IP packets intended for the cluster IP address, it ends up getting
packets on this new TCP connection without having seen the TCP
connection request.
[0058] Thus, decision operation 514 ensures that this possible
sequence of events is accounted for. The load balancing node
prepares for this possibility in identifying operation 534. If the
load balancing node encounters this condition in decision operation
514 (the decision yields the value YES), it understands that it
must switch the host's connection table back to the assigned
server, and control proceeds to forwarding operation 526. However,
if the decision of operation 514 yields the value NO, then control
proceeds to decision operation 516.
[0059] Control reaches decision operation 516 if the load balancing
node receives a TCP or UDP packet with a given <source IP
address, destination IP address, destination port> key for which
no connection table exists. This situation is only valid if it is a
UDP packet for which no affinity has been configured for the target
port (application). In this (UDP) case, if a previous UDP packet
from that host was received to a different target port, and
affinity was configured for that port, and the load balancer used
one of the two methods to direct the host to a specific server in
the cluster, then even for this target port, the load balancer must
enforce affinity to the same server in the cluster, even if
affinity was not configured. This is another race condition that
the load balancer must deal with, because once the ICMP_REDIRECT or
ARP method alters the affinity table on the host, all UDP packets
from that host to any target port will be directed to the specific
server in the cluster, and this race condition indicates a scenario
where the ICMP REDIRECT or ARP response has simply not completed
its desired side effect in the host yet. If no affinity has been
configured for the target port, then a target server needs to be
selected to handle this particular (stateless) request, and control
passes from decision operation 516 to forwarding operation 518.
Otherwise, this is a TCP packet, no connection table entry exists,
and a packet from the same source node (host) was not previously
dispatched to a server in the cluster (the condition of decision
operation 514). Thus, this is an invalid packet and control
proceeds to discarding operation 520 where the packet is
discarded.
[0060] Returning to forwarding operation 518, packet forwarding
takes place for a TCP or UDP packet in "legacy" mode, where the
invention techniques are either not applicable because the host is
in a different subnet, or the technique is not functioning because
of the host implementation (e.g., the host is ignoring
ICMP_REDIRECT messages). In this case, the target server is chosen
based on the connection table entry if control reaches the
forwarding operation 518 from decision operation 512, or based on
some other load balancing node policy (e.g., round robin, or
currently least loaded server as indicated by the load balancing
node agent on that server) if control reaches here from decision
operation 516.
[0061] Referring again to selecting operation 522, which is reached
from operations 508 or 510, a target server is selected based on
load balancing node policy (currently least loaded server,
round-robin, etc.). This operation is the point where the invention
technique might be applicable and an "initial message", either TCP
or UDP, has been received. After selecting operation 522 is
completed, control passes to generating operation 524. During
generating operation 524, a connection table entry is recorded to
reflect the affinity between the (source) host and (destination)
server in the cluster, for a given port (application). The need for
the port as part of the affinity mapping is legacy load balancing
node behavior. After generating operation 524 is completed, control
passes to forwarding operation 526. In forwarding operation 526,
the packet (TCP connection request, or UDP packet) is forwarded to
the selected server. Control then proceeds to decision operation
528.
[0062] At decision operation 528, a check is made to see if the
host (as determined by the source IP address) is in the same IP
subnet. If the host is in the same IP subnet, the invention
technique can be applied and control proceeds to instructing
operation 530. If the host is not in the IP subnet, processing
ends. It should be noted that in some configurations, even if the
host is on the same subnet, the load balancer may choose not to use
the optimization of the present invention based, for example, on a
configured policy and a target port as mentioned above.
[0063] At instructing operation 530, the host is instructed to
change how a packet from the host, intended for a given destination
IP address, is sent to another machine on the IP network. After the
instructing operation 530 completes, control proceeds to sending
operation 532. Details of instructing operation 530 are shown in
FIG. 6.
[0064] In sending operation 532, a control message is sent from the
load balancing node to the server to which the TCP or UDP initial
message was just sent, to tell the load balancing node agent on
that node that the redirection has occurred. The sending operation
532 also indicates that the load balancing node agent should
monitor operating conditions to determine when it should switch
control back to the load balancing node. One example of such
monitoring would be involved if a TCP connection is dispatched to
it from a given host. Due to the host mapping table change, the
server will not only directly receive further TCP packets from that
host, bypassing the load balancing node, but it could also receive
new TCP connection requests. For example, certain implementations
of a service protocol can set up multiple TCP connections for
reliability, bandwidth utilization, etc. In that case, the load
balancing node tells the agent on that server to switch control
back when the number of TCP connections from that host goes to 0
(zero). For UDP packets forwarded to the server where affinity is
configured, the load balancing node tells the server to monitor
inactivity between the host and server, and when the inactivity
timeout configured in the load balancing node is observed in the
server, it should pass control back to the load balancing node.
Note that while the server is aware of the timeout configured in
the load balancing node, it can choose to implement a higher
timeout, if based on its analysis of response times when
communicating with the host, it concludes that the host's path to
it is slower than expected.
[0065] In receiving operation 534, the load balancing node receives
a message from a server in the cluster (from the load balancing
agent running on that server) indicating that the server is giving
control back to the load balancing node (because the number of TCP
connections from that host is down to 0 (zero) or because of UDP
traffic inactivity). Control then proceeds to sending operation
536.
[0066] At sending operation 536, the load balancing node sends a
message to the host to revert its network mapping tables back to
the original state such that all messages sent from that host to
the cluster IP address once again are sent to the load balancing
node, essentially reverting the host state back to what existed
before instructing operation 530 was executed. Once the sending
operation 536 is completed, the process ends. Details of
instructing operation 536 are shown in FIG. 6.
[0067] FIG. 6 shows details of operations 530 and 536 of FIG. 5, as
applicable to both the ARP broadcast method and the ICMP_REDIRECT
method described above. The process begins at decision operation
602. During this operation, the load balancing node determines
whether or not the ICMP_REDIRECT method can be used. It is
envisioned that ICMP_REDIRECT method can be selected by a system
administrator or by testing whether the host responds to
ICMP_REDIRECT commands. If the ICMP_REDIRECT method is used,
control passes to query operation 604.
[0068] During query operation 604, the process determines whether
the host-to-cluster session has completed (see operation 536 of
FIG. 5), or if this is a new host-to-cluster session being set up
(see operation 530 of FIG. 5). If query operation 604 determines
that the host-cluster session has not completed, control passes to
sending operation 606.
[0069] At sending operation 606, the host is instructed to modify
its IP routing table using ICMP_REDIRECT messages. The format of an
ICMP_REDIRECT message is shown in Table 1. The ICMP_REDIRECT works
by redirecting the IP traffic to the next hop, in effect telling it
to take a different route. Normally, for the purposes of the
ICMP_REDIRECT, the target server is the router. In this embodiment,
an ICMP_REDIRECT message with code value 1 instructs the host to
change its routing table such that whenever it sends an IP datagram
to the server cluster (virtual) IP address, it will send it to the
target server instead. In the ICMP_REDIRECT message, the router IP
address is the address of the target server address selected by the
load balancing node. The "IP header+first . . . " field contains
the header of an IP datagram whose target IP address is the primary
virtual cluster IP address. As mentioned above, in the event that
the host ignores the ICMP_REDIRECT message, the server cluster will
continue to operate in a conventional fashion.
1TABLE 1 Format of ICMP_REDIRECT Packet Type (5) Code (0 to 3)
Checksum Router IP address IP header + first 64 bits of datagram .
. .
[0070] For inbound UDP (User Datagram Protocol) messages, the load
balancing node can direct the first UDP datagram from the host to
the target server, create a connection table entry based on
<source IP address, destination IP address, destination
port>, and then send the ICMP_REDIRECT message to the host, thus
pointing the host to the target server IP address. Returning to
FIG. 2, this redirect message would, for example, be of the form:
Router IP address=9.37.38.32, IP datagram address=9.37.38.39. If
the routing table is updated by the host 104, future datagrams from
the host 104 to the server cluster IP address 204 will be sent to
the target server 206 (IP address 9.37.38.32) directly, thus
bypassing the load balancing node 112.
[0071] Referring back to query operation 604 of FIG. 6, if it is
determined that the process is being executed because the
host-to-cluster session has completed, control passes to sending
operation 608. At sending operation 608, the host is instructed to
modify its IP routing table using ICMP_REDIRECT messages such that
whenever it sends an IP datagram to the target server, the message
is sent to the server cluster IP instead. Thus, sending operation
608 reverses the effect of the ICMP_REDIRECT message issued in
sending operation 606. The router IP address is an alternate
cluster address as discussed below.
[0072] Returning to FIG. 2, when the UDP port affinity timer for
the host 104 expires, as indicated by the control message from
server 206 to the load balancing node 112, load balancing node 112
can send another ICMP_REDIRECT message to the host 104 pointing to
the alternate server cluster IP address 218. Such an ICMP_REDIRECT
message would, for example, be of the form: Router IP
address=9.37.38.39, IP datagram address=9.37.38.40. This message
would create a host routing table entry pointing one server cluster
IP address to another (alternate) server cluster IP address. The
alternate IP address enables host messages to reach the load
balancing node 112 without causing a loop in the routing table of
the host 104. Note that for the above technique to work, it is
required that the server cluster have two virtual IP addresses,
which is not uncommon.
[0073] For inbound TCP (Transmission Control Protocol) messages,
the load balancing node 112 can create a connection table entry for
the first TCP connection request from the host 104, forward the
request to the target server 206, and send an ICMP_REDIRECT message
to the host 104. The ICMP_REDIRECT message could, for example, be
of the form: Router IP address=9.37.38.32, IP datagram
address=9.37.38.39. Future TCP packets sent by the host 104 on that
connection would be sent to the target server 206 (IP address
9.37.38.32) directly, bypassing the load balancing node 112.
[0074] With TCP, it is important to redirect the host 104 back to
the load balancing node 112 when the total number of TCP
connections between the host 104 and the target server 206 is zero.
Since the load balancing node 112 does not see any inbound TCP
packets after the first connection is established between the host
104 and the target server 206, information about when the
connection count goes to zero must come from the target server 206.
This can be achieved by adding code in the load balancing node
agent that typically runs in each server (to report load, etc.),
extending such an agent to monitor the number of TCP connections,
or UDP traffic inactivity, in response to receiving control
messages from the load balancing node as in step 532 in FIG. 5.
Such load balancing node agent extensions can be implemented by
using well known techniques for monitoring TCP/IP traffic on a
given operating system, which typically involves writing
kernel-layer "wedge" drivers (e.g., a TDI filter driver on
Microsoft's Windows operating system) and sending control messages
to the load balancing node in response to the conditions being
observed. Windows is a registered trademark of Microsoft
Corporation in the United States and other countries.
[0075] Returning to FIG. 6, if at query operation 604 it is
determined that the ICMP_REDIRECT method is not being used, control
passes to waiting operation 610.
[0076] At waiting operation 610, the process waits until an ARP
broadcast message is issued from the host requesting the MAC
address of any of the configured cluster IP addresses. During the
waiting operation 610, messages from the host are sent to the
server cluster, received by load balancing node, and then forwarded
to the target server in a conventional matter until an ARP
broadcast is received from the host to refresh the host's ARP
cache. Once an ARP broadcast message is received from the host,
control passes to query operation 612.
[0077] At query operation 612, the process determines whether the
communication session between the host and the server cluster has
ended. If the session has not ended, then a new host-to-cluster
session is being set up, and control passes to sending operation
614.
[0078] At sending operation 614, the host is instructed to modify
its ARP cache such that the MAC address associated with the cluster
IP address is that of the target server instead of the MAC address
of the load balancing node. Thus, in response to the ARP broadcast,
the load balancing node returns the MAC address of the target
server to the host rather than its own MAC address. As a result,
subsequent UDP or TCP packets sent by the host to the cluster
virtual IP address reach the target server, bypassing the load
balancing node. It is contemplated that load-balancer-to-agent
protocols may be needed for each server to report its MAC address
to the load balancing node to which its IP address is bound.
[0079] If, at query operation 612, it is determined that the
session between the host and cluster has ended, control passes to
sending operation 616. During sending operation 616, the host is
instructed to modify its ARP cache such that the MAC address
associated with the cluster IP address is that of the load
balancing node instead of the MAC address of the target server.
Thus, sending operation 616 reverses the ARP cache modification
message issued in sending operation 614.
[0080] Turning again to FIG. 2, The ARP-based embodiment requires
another ARP broadcast from the host 104 for the cluster IP address
to switch messages back to the load balancing node 112. Thus, once
the number of TCP connections between the target server 206 and the
host 104 goes to zero, the target server 206 notifies the load
balancing node 112 about the opportunity to redirect the host 104
back to the load balancing node 112 as the destination for messages
sent to the cluster IP address 204. The load balancing node 112
cannot redirect the host 104 until it receives the next ARP
broadcast from the host 104 for the cluster IP address. When the
ARP broadcast is received, the load balancing node 112 responds
with its own MAC address, such that subsequent UDP or TCP packets
from the host 104 reach the load balancing node 112 again.
[0081] The foregoing description of the invention has been
presented for purposes of illustration and description. It is not
intended to be exhaustive or to limit the invention to the precise
form disclosed, and other modifications and variations may be
possible in light of the above teachings. The embodiments disclosed
were chosen and described in order to best explain the principles
of the invention and its practical application to thereby enable
others skilled in the art to best utilize the invention in various
embodiments and various modifications as are suited to the
particular use contemplated. It is intended that the appended
claims be construed to include other alternative embodiments of the
invention except insofar as limited by the prior art.
* * * * *