U.S. patent application number 09/928235 was filed with the patent office on 2002-02-28 for network server card and method for handling requests received via a network interface.
Invention is credited to Bestler, Caitlin B., Phillips, Robert C..
Application Number | 20020026502 09/928235 |
Document ID | / |
Family ID | 24561362 |
Filed Date | 2002-02-28 |
United States Patent
Application |
20020026502 |
Kind Code |
A1 |
Phillips, Robert C. ; et
al. |
February 28, 2002 |
Network server card and method for handling requests received via a
network interface
Abstract
An information asset (e.g., audio/video data) server system and
a set of steps performed by the data server system are disclosed
that facilitate efficient handling of a potentially large workload
arising from request messages received from users via a
communicatively coupled network link. The network data server
system comprises content transfer node including an external
network interface and a set of event engines. A workload request
received by the interface is delegated to one of the set of event
engines. The delegated event engine executes the request (or a
portion thereof) based upon the request type. In an embodiment of
the invention further requests that are part of the same logically
related group of communications are identified explicitly or
implicitly by header fields and associated with a single set of
context data and state tracking maintained by the content transfer
node.
Inventors: |
Phillips, Robert C.;
(Northbrook, IL) ; Bestler, Caitlin B.; (Chicago,
IL) |
Correspondence
Address: |
LEYDIG VOIT & MAYER, LTD
TWO PRUDENTIAL PLAZA, SUITE 4900
180 NORTH STETSON AVENUE
CHICAGO
IL
60601-6780
US
|
Family ID: |
24561362 |
Appl. No.: |
09/928235 |
Filed: |
August 10, 2001 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
09928235 |
Aug 10, 2001 |
|
|
|
09638774 |
Aug 15, 2000 |
|
|
|
Current U.S.
Class: |
709/219 ;
709/249 |
Current CPC
Class: |
H04L 9/40 20220501; H04L
65/40 20130101 |
Class at
Publication: |
709/219 ;
709/249 |
International
Class: |
G06F 015/16 |
Claims
What is claimed is:
1. A network server system for efficiently processing requests for
information assets stored upon a set of storage drives, wherein the
requests are received via a communicatively coupled network link,
the server system comprising: an internal network communicatively
coupling nodes within the network data server system; a
supplemental processor node communicatively coupled to the internal
network and comprising a general purpose processor and operating
system, and wherein the supplemental processor supports executing
application programs; a data storage node communicatively coupled
to the internal network, the data storage node comprising storage
media and conversion circuitry for packaging retrieved data from
the storage media to a format for transmission over the internal
network; and an external network access node supporting network
connections between the network server system and client nodes via
an external network, the external network interface comprising: an
external network interface comprising an external network interface
engine for executing data transfers between the external network
access node and the external network, an internal network interface
comprising an internal network interface engine for executing data
transfers between the external network access node and the internal
network, and one or more event engines for executing information
asset transfers between the data storage device and the external
network in accordance with contexts, maintained by the external
network access node, describing a present state of executing
information asset transfers performed by the one or more event
engines.
2. A method for processing requests for information assets stored
upon a set of data storage drives by a network server system,
wherein the requests are received via a communicatively coupled
external network link, the method comprising the steps of:
receiving, by an external network access node via the external
network link, a request for an information asset; creating, by the
external network access node, a context for the request wherein the
context includes a buffer identification and a processing engine on
the external network access node assigned to execute the request;
submitting, by the external network access node, a request for data
from a storage node connected to the external network access node
by an internal network; and receiving, by the external network
access node from the storage node, data corresponding to the
request for data from the storage node, and storing the received
data within memory on the external network access node
corresponding to the buffer identification, wherein data
transferred from the storage node to the receiving external network
node bypasses application memory space on a general processor node;
and transmitting, by the external network access node, the data
stored within memory corresponding to the buffer identification,
over the external network link.
3. A network server system for efficiently processing requests for
information assets stored upon a set of storage drives, wherein the
requests are received via a communicatively coupled network link,
the server system comprising: a supplemental processor node; a
network interface node comprising: a network interface
communicatively coupled to the network link and configured to
receive requests from clients via the network link; delegation
logic facilitating: associating a request type with at least a
portion of a request, identifying a handler from a set of
processing elements for executing at least the portion of the
request based upon the request type, and creating a data structure
linking at least the portion of the new request to the identified
handler processor; and a data path from the set of storage drives
to the network interface, the data path facilitating data transfers
between the set of storage drives and the external data access node
containing the set of processing elements that bypass the
supplemental processor node.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application is a continuation-in-part of
Phillips et al. U.S. patent application Ser. No. 09/638,774 filed
on Aug. 15, 2000, entitled "Network Server Card and Method for
Handling Requests Via a Network Interface" that is expressly
incorporated herein by reference in its entirety including the
contents of any references contained therein.
FIELD OF THE INVENTION
[0002] The present invention generally relates to the field of
server systems for handling requests from multiple users in a
network environment. More particularly, the present invention
concerns apparatuses and methods for efficiently providing
specialized services to requesters in a network environment through
request distribution.
BACKGROUND OF THE INVENTION
[0003] As the Internet expands, so too are the potential number of
users that simultaneously seek access to particular Internet
resources (e.g., a particular Web site/address). Thus, operators of
Internet sites are well advised to arrange their systems in a
manner such that the current and future expanded versions of the
system hardware relied upon to deliver site resources to users are
capable of responding to a potentially high volume of user
requests.
[0004] As the population of web users grows, the number of
concurrent clients that a single web server must support similarly
grows. It is now routine for a "single site" to have peak load
access demands far exceeding the capacity of a single server. Users
of such a site desire the illusion of accessing a single server
that has consistent information about each user's past
transactions. Furthermore, every distinct user should access the
apparent single server using a same name. The challenge is to
handle request processing load such that users are unaffected by
any undesirable side-effects of handling high request volume and
high data traffic volume associated with responding to the
requests.
[0005] Web servers today are faced with increasing demand to
provide audio/video materials, such as MP3, MPEG and Quicktime
files. Such files are considerably larger than simple web pages.
The combination of more users demanding larger files presents a
potentially overwhelming demand for vastly increased data volume
handling capabilities in web servers. More servers are needed to
handle the increasing data retrieval and transmission workload.
[0006] Solutions have been implemented that share a common goal of
dividing the workload of the single virtual server over many actual
servers. It is desired that the workload of responding to multiple
user requests be divided in a transparent manner so that the
division is not visible to the customer. However, load-balancing
mechanisms have limited or no knowledge with regard to the context
or prior history of the messages they are trying to distribute.
Existing solutions have chosen between overly restricting the
requests that can be processed by particular servers, thereby
limiting the ability to evenly balance the traffic load. In other
instances work is divided between servers operating in ignorance of
each other.
[0007] There are at least four known approaches for dealing with
the aforementioned problems encountered as a result of high user
load. The oldest is server mirroring that involves replicating
content across multiple sites. Users are encouraged to select a
specific server that is closest geographically to them. Because the
load distribution is voluntary it is not very efficient.
Synchronizing the content of all servers across all sites is
problematic and time consuming. Generally this approach is
considered suitable only for non-commercial information
distribution.
[0008] Second, a distributed naming service (DNS) approach accepts
requests from many clients and translates a single Domain Name in
the requests to an Internet Protocol (IP) address. Rather than
returning a same address to all clients, multiple servers are
designated, with each client receiving an IP address of only a
particular single server from the available servers. The IP
addresses supplied within the DNS answers are balanced with the
goal of evenly dividing the client load amongst the available
servers. Such distributed processing methods are inexact due to
remote caching. Furthermore, problems arise with regard to ensuring
that related requests from the same client issued at different
times result in a connection to the same server. Traffic to the DNS
server is increased because DNS queries are increased. This second
approach demonstrates that merely duplicating hardware will not
meet a need for additional throughput in a server resource/system.
Additionally, because later queries from the same user could easily
go to a different server, all work must be recorded on shared
storage devices. Such additional storage devices could be
additional database or file servers, or devices on a storage area
network.
[0009] Third, OSI layer 3 (network) and layer 4 (transport)
switching solutions distribute connections across multiple servers
having similar capabilities (though possibly differing load
handling capacity). Entire sessions are distributed to a particular
server regardless of the type of request. Because a layer 3 or 4
switch is not an actual participant in any of the protocol sessions
going through it, and because it is of a simpler and more
specialized design than the actual servers, it cannot fully
understand the protocols that it implements. It cannot with full
certainty understand the types of requests it sees. Though limited
examination of packet contents may occur, the majority of such
switches do not examine the content of packets that pass through
them.
[0010] An FTP (File Transfer Protocol) file transfer is one example
where limited examination may occur. Some layer 3 switches are
configured with enough knowledge of the FTP protocol to recognize
the description of a second connection buried within the payload of
a packet transferred via a first connection. Others merely assume
that all connections between a pair of IP addresses are for the
same user. The former approach requires the switch to stay current
with all new application protocols, and cannot work with encrypted
payloads. The latter presents the problem of requiring all users
working behind a single firewall using Port Network Address
Translation (PNAT), or masquerading, to be considered as a single
user. Given that the goal of load balancing is to evenly distribute
the workload, unpredictability in how much work is being committed
to a single server with a given dispatch presents a problem. This
example illustrates the deficiencies in the prior known systems
that are limited to switching functionality rather than delegating
execution of requests for resources to specialized
processors/processes.
[0011] Once a session is initiated it is assigned to a particular
server based on very little context information. Typically the
context information includes only the addressing information within
the initial packet itself without the benefit of knowing the
substance of any queries to customer databases or other records. In
particular it is not practical for the switch to distinguish
between a single user and an entire building of users sharing a
single IP address. Thus, an initial assignment may be supplemented
only by limited analysis of later packets to identify packets
belonging to the same session. This distribution scheme is carried
out by a switch having very limited analytical capabilities.
[0012] Other prior load distribution solutions conduct only limited
analyses to determine the type of a request received by the server
system after a session is initiated and assigned to a particular
server. Furthermore, providing a set of equally capable servers is
a potentially expensive solution that is likely considered too
expensive for many potential Internet service providers. Sharing
results through direct back channel communications or sharing of
database, file or storage servers is still required for these
solutions.
[0013] In a fourth known attempt to distribute requests for
resources over distributed network servers, server clusters
distribute work internally over an internal communications bus that
is typically a switched access bus. Communications to the external
network interface are performed via a single centralized server.
Thus, while distributing the computational load, the fourth known
option introduces a potential data communications bottleneck at the
centralized communication server.
[0014] The first three solutions divide a single virtual server's
workload into multiple user sessions. Such prior known solutions
attempt to ensure that all traffic for a given user session is
handled by a single actual server while attempting to distribute
the work load evenly over the entire set of actual servers. These
goals are incompatible in the prior known systems. Mirroring,
because it relies upon the user identifying the targeted server, is
excellent at ensuring only one actual server deals with a given
user, but the only mechanism for balancing load between the servers
is the process of users shifting between servers out of
frustration. Layer 3 switch solutions achieve load balancing, but
at the cost of failing to identify all parts of a single user's
interaction with the server. The fourth solution avoids these
problems by having a single server delegate work, but limits the
scope of this optimization by remaining a single communications
bottleneck.
[0015] Thus, while a number of solutions have been implemented to
deal with the problem of explosive growth in the popularity and
volume of use of the Internet and the resident Web sites, none
provide a solution to the increased cost and overhead associated
with distributing user workload addressed to a single apparent
resource that is, in actuality, distributed across multiple
processors.
[0016] It is further noted that today many Internet servers are
deployed almost exclusively to distribute stored content. In other
words, such servers execute one task-delivering data stored on a
memory drive to a requesting client over the Internet. Such servers
are structured on a request-load-process-deliver model. The server
application on a server host machine accepts a request for content
from a client. In response, the server application loads the
requested content from a memory drive into process memory space
reserved for the server application on the host. After loading the
content into process memory space, the server application processes
the content, and then the server delivers the processed content to
the client.
[0017] Ideally, content providers would prefer utilizing as few
server machines as possible to deliver content to Internet clients.
However, in the case of streaming data (e.g., video, audio, etc.)
the process of accepting a request, loading content and then
delivering the content consumes nearly all of the processor's
capacity serving only a single Gigabit Ethernet port. When multiple
Gigabit Ethernet ports are provided for a single server
application, the processor becomes a system bottleneck. As a
result, streaming content servers often require multiple replicas
to provide satisfactory bandwidth to client/users.
[0018] Many factors contribute to this bottleneck. Packets on both
the internal network (where the stored content is accessed) and
external network (where the clients are located) arrive
interleaved, fragmented and even out-of-order. The received packets
must be processed by the host operating system. In addition to the
inherent work of de-fragmenting the incoming packets, processing
the packets by the host operating system involves switching between
user and system memory maps and quite likely copying data between
the user memory space and system memory space. Furthermore, because
of successive layering of protocols, it is common to apply a
variety of checksum or CRC algorithms to different portions of the
payload with different degrees of reliability. For example, when a
higher layer protocol specifies an end-to-end 32-bit checksum, it
cannot eliminate an inadequate 16-bit checksum specified by a lower
level protocol. Hence both must be generated and checked. A number
of system architectures attempt to address the processor bottleneck
problem by optimizing the path between a network interface and a
processing application. These include Virtual Interface
Architecture (VIA), Infiniband Architecture (IBA), Warp and
Microsoft's Winsock Direct.
[0019] The following comprises a discussion of the classical method
for a server responding to client requests in an IP environment. In
classic server design, IP datagrams arrive at multiple network
interface cards (NICs). Each NIC deals with only one communication
protocol. A typical server includes NICs for dealing with the
external IP network (almost always Ethernet) and separate ones for
dealing with an internal storage-oriented network. The internal
storage-oriented network may be Ethernet/IP oriented, in which case
it is possible to apply a NAS (network attached storage) strategy
to access network file servers over the internal network.
Alternatively the internal network may be a specialized Storage
Area Network (SAN), such as Fibre Channel. The servers on these
networks typically provide block-level services, rather than
file-oriented services.
[0020] For an Ethernet/IP interface, the classic solution exhibits
the following characteristics. Buffers for incoming traffic are
allocated from a system memory pool on a per device basis, without
regard to the eventual destination. After the Ethernet frame has
been collected and validated, it is passed in FIFO order to the
host operating system's protocol stacks. The host operating
system's protocol stacks perform all required segmentation and
re-assembly to deliver messages to the requesting application
process. This must be based upon information within the reassembled
packet. Typically this involves one or more copy operations to
create an in-memory image of the complete buffer. Alternatively,
more complex interfaces can be used to pass a "scatter/gather list"
to the application. This results in more complex application code,
or postponing coalescing the message fragments to the application
code.
[0021] The work involved in transferring network payload through a
Host Operating System protocol stack can be so time consuming,
especially with switches between system and user memory space, that
a single processor server can be almost totally consumed merely
getting network traffic between a Gigabit Ethernet NIC and to
applications. This leaves almost no processing resources for an
application to actually accomplish any true data processing.
[0022] A virtual interface (VI) allows a NIC to directly post
results to user memory. This allows a complete message to be
assembled directly in the application's buffers without kernel/user
mode switching or intermediate copying. VI interfaces support two
delivery modes: send and remote direct memory access (RDMA). Send
delivery mode is based upon a connection. The incoming data is
paired with application read requests and transferred directly to
the application memory. Successive reads consume requests, and
reads are pre-issued. During RDMA delivery the request supplies a
memory key validating its access to the target application memory
space and an explicit offset within that target memory space. An
RDMA interface allows a file server to utilize an out-of-order
delivery strategy. Fragmented files are read from disk drives in
the most convenient order. Being restricted to reading them in the
"correct" order would require server-side buffering and/or more
passes of the drive heads. Out-of-order delivery also
enhances/supports striping of material across multiple drives.
Without the need to synchronize multiple sources it is easier to
get the multiple sources to deliver requested data in parallel.
[0023] VI and similar interfaces (such as InfiniBand) provide
improvements over the classic via-kernel methods. However, they
still exhibit significant drawbacks. First, VI and InfiniBand (IB)
require buffers to be pre-allocated for each pending read. When
applied only over the internal storage network this creates only
relatively minor problems. However, when dealing with remote
clients over the public external network long delays can be
expected. Furthermore, pre-allocating distinct buffers for 10,000
active clients is wasteful when, for example, it can be predicted
with near statistical certainty that no more than 500 of them will
respond in the immediate future. The time between each client
action and the time required to process each client request allows
a single buffer to be re-used potentially many times. However,
under VI and other similar RDMA protocols a buffer is locked down
for a single client from before the time the request is issued
until the request is fully satisfied. Second, under VI and IB the
NIC obtains from a kernel agent the mapping data required to
translate virtual memory addresses to physical memory for each
client. Obtaining this information, and ensuring that it is "locked
down" so that it cannot be moved or swapped out of memory, involves
a time-consuming negotiation with the kernel agent. One VI-derived
solution, the Infiniband Architecture, has added "Memory Windows"
to provide logical slices of memory that are only kernel registered
once but have finer-grained rights access administration on the NIC
itself. Third, the requirements of VI cause significant portions of
"application memory" to be pinned down and effectively function as
system memory. Pinning down application memory forces other buffers
to be swapped more often and makes it harder for the kernel to
optimize application performance. This is especially true if the
kernel has not been specifically re-engineered for VI. Fourth,
while VI enables bypassing the host processor's operating system,
the host processor is not bypassed. Therefore, all data traffic
flows into the host processor's memory, is examined by an
application, and then typically flows back out to another NIC
connected to a network.
[0024] The two above-described prior data transmission approaches
adopted by servers (i.e., classic and VI) are now described with
reference to how an incoming Ethernet frame of data from, for
example, an internal data storage network is handled by the server.
In the classic server approach, the Ethernet frame is DMA
transferred to a system buffer. Buffer selection is steered by the
device type upon which the data is to be transmitted--not by the
ultimate target. Intermediate software, typically the host
operating system's protocol stacks, assembles messages and then
copies them to user memory. This process is reversed to send the
loaded payload as part of an HTTP response via an external
network.
[0025] In the VI approach, when the Ethernet Frame arrives from a
data storage network it is paired either with an application read
request (in the case of a "send" operation) or an application
memory space (in the case of an "RDMA" operation). Because the
memory (and/or request) was pre-registered with the NIC and kernel,
the NIC determines the destination address(es) in physical memory
and deposits the payload there. This process is reversed to send
the loaded payload as part of an HTTP response via the external
network.
[0026] It is also interesting to note the number of memory transfer
operations associated with processing an incoming Ethernet frame.
In the classic server approach, the frame passes into the NIC, into
the host processor system buffer, into the host processor user
buffer, back to the host processor system buffer, back to a
different NIC, and then out to a destination. In a VI type system,
the frame passes into the NIC, into a host processor user buffer,
back to a different NIC, and out to a destination. In both cases,
buffering consumes considerable system resources.
SUMMARY OF THE INVENTION
[0027] While the above-described server architectures will improve
carrying out requests under the request-load-process-deliver model,
such architectures seek to optimize getting stored content to a
server application process and then transferring the data from the
server application to the client via network communication
interface processes. In accordance with an aspect of the present
invention, a server architecture facilitates optimizing transfer of
requested content from data storage drives to requesting
clients.
[0028] In accordance with the present invention, a data asset
server handles content delivery traffic where there is no need to
process the stored content, but rather merely a need to package the
data and control its delivery to a designated location. Thus, the
server carries out a "request-select-deliver" model wherein a
request is received by the server from a client. Next, content to
be delivered to the client in response to the request is selected.
In an embodiment of the invention, notification, but not the
selected data itself, is delivered to an application running on the
server system that received the request. The content is delivered
via an external network interface engine (with necessary protocol
wrappers but no transformation of the actual content) to the
requesting client.
[0029] In a particular embodiment of the invention, to which the
invention is not limited, a content transfer engine receives an
Ethernet Frame and a destination message buffer is determined
either by a pre-existing read or based on the state of the
connection. The destination is content transfer engine controlled
memory. Notification, but not the data itself, is delivered to a
content transfer daemon in the content transfer engine. The content
transfer daemon controls transfer of content for a single end-user
session. The content transfer daemon executes within the context of
the content transfer engine itself. The content transfer daemon
optionally extends its processing scope in conjunction with an
extended content transfer daemon that runs on a conventional
processor. Buffer capture events are typically sent to the content
transfer daemon using event queues, but may be sent over the
internal network directly to an extended content transfer daemon.
Because dispatching notices should be prompt and highly reliable it
would be highly unusual for an external network interface engine to
be used. The content transfer daemon then, by way of example,
composes an HTTP response referencing payload already in the
content transfer engine's memory. The HTTP response, including the
payload in the content transfer engine's memory is sent as the HTTP
response via the external network.
[0030] With regard to memory operations, the received Ethernet
frame is stored in the data buffer physically located on an access
node. The Ethernet frame data is not moved into system buffer space
managed by a conventional processor, nor is the frame processed by
the conventional processor. Instead, the stored data is output to a
connected network interface engine associated with an identified
target destination without ever entering the conventional
processor's data space.
[0031] In an embodiment of the present invention, multiple event
engines execute in parallel or as co-routines upon the content
transfer engine to serve client requests. Division of
responsibilities between a CTE, resident CTDs, and conventional
processor resident XCTDs is an application and implementation
specific tradeoff. A conventional processor favored strategy would
use the Content Transfer Engine to capture and transmit buffers,
while making most protocol decisions on the conventional processor.
A CTE favored strategy would handle virtually all normal cases
itself, and only forward exceptional cases for conventional
processor handling. The placement of wrapping protocol headers and
trailers around the stored content is performed in any of a variety
of possible locations, including both on and off the CTE. Finally,
it will be understood by those skilled in the art that the present
invention is applicable to a variety of external network
communications protocols.
BRIEF DESCRIPTION OF THE DRAWINGS
[0032] The appended claims set forth the features of the present
invention with particularity. The invention, together with its
objects and advantages, may be best understood from the following
detailed description taken in conjunction with the accompanying
drawings of which:
[0033] FIG. 1 is a diagram depicting an exemplary network
environment into which the invention is advantageously
incorporated;
[0034] FIG. 2 is a block diagram identifying primary logical
components within a set of network nodes arranged in an exemplary
physical arrangement in accordance with an information asset server
system embodying the present invention;
[0035] FIG. 3 depicts a set of functional/logical components of a
content transfer engine within a server system depicted in FIG.
2;
[0036] FIG. 4 depicts a set of data structures for external buffer
control and facilitating execution of operations in accordance with
an exemplary embodiment of the present invention;
[0037] FIG. 5 depicts a memory control block for a table data
structure facilitating carrying out an exemplary embodiment of the
present invention;
[0038] FIG. 6 depicts an exemplary code routine memory control
block facilitating carrying out the present invention;
[0039] FIG. 7 depicts an exemplary memory control block for an
allocated buffer;
[0040] FIG. 8 depicts an exemplary memory control block for a free
buffer;
[0041] FIG. 9 depicts a set of exemplary event queue target
descriptors in accordance with an exemplary embodiment of the
present invention;
[0042] FIG. 10 depicts an exemplary event queue entry format for
queue entries placed within the event queues of a content transfer
engine embodying the present invention;
[0043] FIG. 11 is a sequence of buffer representations depicting
the state of a capture buffer as data arrives in an out of order
fashion using RDMA mode capture;
[0044] FIG. 12 is a state diagram specifying how cells arriving on
a node channel are collected into a packet;
[0045] FIG. 13 is a state diagram specifying how Ethernet frames
are collected into a packet;
[0046] FIG. 14 is a state diagram specifying the life cycle of an
event engine;
[0047] FIG. 15 is a state diagram that expands the "Dispatching
Event" state of FIG. 14;
[0048] FIG. 16 is a state diagram that expands the "Completing
Successfully" state from FIG. 14;
[0049] FIG. 17 is a state diagram that expands the "Completing With
Exception" state from FIG. 14;
[0050] FIG. 18 is a state diagram specifying the life cycle of a
Capture Buffer; and
[0051] FIG. 19 is a flow diagram depicting the general operation of
the disclosed data server.
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT
[0052] A new network server architecture is described below in the
form of a preferred embodiment and variations thereof. At the heart
of this new network server architecture is a hybrid multiprocessor
server. The hybrid multiprocessor server includes a data
transmission engine comprising a set of micro-engines. The
micro-engines are processors with limited capabilities for
executing a particular set of tasks relating to, for example, data
communication. The network server also includes at least one
supplemental processor, a general purpose processor operating a
standard operating system. The network is, by way of example, the
Internet. Thus, communication between the server and an external
network is preferably conducted in accordance with the Internet
Protocol (IP).
[0053] When a user requests to initiate a session with the network
server, a session request is received by a network interface
engine. The network interface engine passes the request to either a
micro-engine or the supplemental processor. After establishing a
session/connection with a client, fulfillment of the client's data
requests are carried out primarily by one or more of the
micro-engines. The supplemental processor, in an embodiment of the
invention, receives notification of a data transfer. However, data
is transmitted between data storage drives (e.g., disk drives)
managed by the network server and the requesting clients without
transferring the transmitted data into the supplemental (host)
processor's memory space.
[0054] As mentioned above, the network server of the present
invention includes a set of one or more specialized micro-engine
processors, including for example a set of micro-engines referred
to herein as event engines that are configured to accept and
execute content transfer requests. Each connection is tracked as a
separate context. Each connection belongs to a specified class. The
class specifies, for example, a state transition model to which all
instances of that class will conform. The state model specifies the
default buffer allocation behavior, and how event capture is to be
processed (to what event queue and to what target routine) for each
state. Instances deal with specific user sessions. An example of a
specific class is one specifically configured to control delivery
of an html file according to the HTTP delivery protocol using
TCP/IP. Each instance actively transfers a specific html file to an
end user/client via a specific TCP/IP connection.
[0055] An interface to an internal network provides communication
paths between one or more storage servers (comprising multiple
storage media devices such as hard drives) and the data
transmission engine. In an exemplary implementation the internal
network is an internal network of the type generally disclosed in
U.S. application Ser. No. 09/579,574, filed on May 26, 2000, and
entitled: "Information Distribution System and Method" which is
explicitly incorporated herein in its entirety by reference.
However the invention disclosed here is intended to be applied to
other storage area networks, including Fibre Channel and
Infiniband.
[0056] The architecture of a network server system including the
above-described hybrid-multiprocessor data transmission interface
is extensible, and thus multiple hybrid-multiprocessor network
servers having the above general arrangement are integrated to
provide access to a shared data resource connected via a switching
fabric. Each of the network servers includes its own network
interface engines. In the case of an Internet connection, each
network server is assigned at least one unique Internet address.
The invention is not limited to an Internet environment and may be
incorporated into virtually any type of server environment
including intranet, local area network, and hybrid combinations of
various network interface engines to the network servers. Thus,
while generally referred to herein as a network server, the
architecture is applicable to network configurations that include
both WAN and LAN interfaces.
[0057] The present invention is not intended to be limited to a
particular arrangement of distributed handler processors on a
Network server. The "server" may take any of many different forms.
While preferably the default and specialized handler processors are
arranged upon a single, multi-layer printed circuit board, the
server is implemented in various embodiments as a set of cards
connected via a high speed data/control bus.
[0058] In accordance with and exemplary embodiment of the present
invention, an Internet Protocol (IP) access node, comprising a
network server and associated stored content requested by clients
on an IP network, includes a data access interface and a host. The
host provides a conventional application program execution
environment including off-the-shelf operating system software such
as NetBSD, FreeBSD, Linux or Windows 2000. The data access
interface comprises a combination of limited/specific purpose
micro-engines that perform a limited set of operations associated
with retrieving requested data from a set of data storage drives
and delivering the requested data to clients. When a request for
data is received by the IP access node, subsequently delivered data
bypasses the host (e.g., supplemental) processor thereby avoiding a
potential bottleneck to data delivery. While the invention is not
limited to any particular protocol, in an exemplary embodiment of
the invention, the data access interface is specifically programmed
to handle high volume Worldwide Web and streaming media protocols
such as the well known HTTP, FTP, RTSP and RTP data transfer
protocols.
[0059] With regard to data transmission buffering in contrast to
above-described known systems, the disclosed content transfer
engines have the ability to defer the actual allocation of a
capture buffer until the first packet arrives. The context has a
current state. In that state, there are defined memory pools for
the internal and external networks. When a packet actually arrives,
the first free buffer in that memory pool is allocated. The system
only needs to dedicate as many buffers to that pool as are
statistically required to meet a desired safety margin to ensure
that buffer overflow does not occur. Furthermore, because the pools
are state sensitive, allocation rules are incorporated to ensure
they cannot be exhausted by denial-of-service (DOS) attacks. A pool
is configured to only provide on-demand buffers to established
sessions. The pool specified for each state varies based on the
peak demand for connections in that state, and the required data
collection size. For example, during the life span of an inbound
FTP session, there are times when a simple command is expected
(requiring a relatively small buffer), and there are times when a
large file transfer is expected (requiring large allocated buffer
capacity).
[0060] Turning now to FIG. 1, in accordance with an exemplary
embodiment of the present invention, a content transfer engine is
deployed in a server system environment including a set of content
transfer access nodes 10 and 12. In the exemplary embodiment of the
present invention, the content transfer access nodes 10 and 12
access storage devices 14. The access storage devices are
represented in FIG. 1 as a set of storage nodes accessed by the
content transfer access nodes 10 and 12 over an internal network 16
that preferably exhibits at least RDMA capabilities. The internal
network 16 is preferably fully non-blocking switch fabric operating
under an ATM protocol or a variation thereof, and data is packaged
within cells transmitted over the internal network. However, such
an internal network is not required to realize the advantages of
the present invention.
[0061] In addition, the server system provides Internet content
distribution services, e.g., HTTP, to an extensible set of clients
18 over an external network 20 such as the Internet. The interfaces
between the content transfer access nodes 10 and 12 and the
external network 20 are, by way of example, Gigabit or 100 Mbit
Ethernet ports. A particular embodiment of the invention includes
dual Gigabit Ethernet ports. The exemplary server system includes
one or more supplemental processor (SP) nodes 22, that reside on
the host portion of the server system containing the content
transfer access nodes 10 and 12. The supplemental processor nodes
22 handle portions of protocols and requests that cannot be handled
by the limited state models supported by the content transfer
engines within the content transfer access nodes 10 and 12. The SP
nodes 22 are accessed directly, or alternatively (as shown in FIG.
1) via the internal network 16. The non-blocking paths of the
internal network 16 are governed by a server controller 24. A
number of internal networks are contemplated in accordance with
various embodiments of the present invention. Examples of such
internal networks include a "DirectPath" architecture of Ikadega,
Inc., of Northbrook, Ill., disclosed in U.S. application Ser. No.
09/579,574, filed on May 26, 2000, and entitled: "Information
Distribution System and Method"; and Phillips et al. U.S. patent
application Ser. No. 09/638,774 filed on Aug. 15, 2000, entitled
"Network Server Card and Method for Handling Requests Via a Network
Interface" the contents of which are explicitly incorporated herein
by reference. Content transfer engines executed within the content
transfer access nodes 10 and 12 perform/execute the role of an
access node fabric interface in the server system.
[0062] Other suitable internal network types include Infiniband
Architecture (IBA) wherein the content transfer access nodes 10 and
12 serve the roles of a hardware channel adapter (HCA); and Fibre
Channel wherein the content transfer access nodes 10 and 12 are
host adapters.
[0063] Having generally described an exemplary network environment,
an embodiment of the invention and exemplary alternatives,
attention is now directed to FIG. 2 that contains a schematic block
diagram depicting components of a server system including the
content transfer access node 10 in a network server system
embodying the present invention as well as logical connections
between the identified components. The architecture of the content
transfer access node 10 is particularly well-suited for providing
Internet oriented services, especially those involving streaming
video or other real-time extended data delivery systems. The
content transfer access node 10 features multiple execution units
(engines/micro-engines). There is an execution unit associated with
each network interface (interface engines) and multiple execution
units to process protocol events (event engines). Each execution
unit is fed by a corresponding dedicated queue. Such queues are
described herein below. Execution units handle requests from their
corresponding queue on a FIFO basis. Because of the limited
processing capabilities of the set of micro-engines, the content
transfer access node 10 is preferably utilized to control
high-speed data transfers rather than to process (e.g., performing
calculations upon) the transferred data itself. In an exemplary
embodiment of the invention, the event engines only examine very
small portions of the transferred data--typically only protocol
headers.
[0064] The architecture of the content transfer access node 10
facilitates dividing tasks associated with processing client
requests between the content transfer access nodes 10 and 12 and
the SP nodes 22. The content transfer access node 10 is
implemented, by way of example, as a combination of field
programmable gate arrays (FPGAs), programmable logic devices
(PLDs), and/or application specific integrated circuits (ASICs)
that perform a limited set of task types at a relatively high speed
and with minimal overhead. The SP nodes 22, on the other hand,
comprise conventional processors executing application programs on
an off-the-shelf operating system. The SP nodes 22 execute tasks
that are outside the limited scope of tasks executable by a content
transfer engine 50. The SP nodes 22 execute the tasks delegated by,
for example, the content transfer access nodes 10 and 12 with the
full overhead associated with processing requests in an
off-the-shelf operating system environment.
[0065] The exemplary content transfer access node 10 includes the
content transfer engine (CTE) 50 for communicating data between the
external network 20 and the internal network 16. The CTE 50
includes an external network interface engine 54. The external
network interface engine 54 passes request notifications to one of
a set of event engines 58 via an event queue 60. The event engines
58 execute content transfer daemons 59 (CTDs) that execute, by way
of example, a single instance of a specific protocol. Each instance
maintains data to facilitate tracking the state of the protocol
interchange and generating responses after being invoked with new
packets received for the particular protocol interchange. Each
content transfer daemon belongs to a class that specifies the
behavior/code common to all instances executing the same protocol.
An external interface event queue 56 temporarily stores request
notifications for transmitting content over the external network 20
received from one of the set of event engines 58. The request
notifications in the external interface event queue 56 concern
transmissions from the server system to the external network 20.
Similarly, the event engines 58 communicate to the internal network
16 via notifications placed within an internal network interface
engine queue 62. Request notifications are passed from the event
engines 58 to the internal network interface engine queue 62.
Thereafter, the requests are propagated via an internal network
interface engine 64 to the internal network 16. The operation of
the internal network interface engine 64 is described herein below
with reference to FIG. 3.
[0066] In an exemplary embodiment of the invention, an external RAM
70, is physically located upon the content transfer access node 10
(located, for example on a network interface card). The external
RAM 70 operates as a buffer between external nodes and nodes
connected to the internal network 16. During data transfers from a
node on the internal network 16 to one of the external clients 18,
data is placed within an external buffer within the external RAM 70
rather than an application buffer space located in one of the
supplemental processor nodes 22. Thus, in accordance with an
embodiment of the invention, transferred data bypasses application
space during the data transfers. Access to the external RAM 70 is
controlled by a buffer control 72 and memory controller 74 that
maintain a context for data transfer operations executed by the CTE
50. An exemplary context arrangement is depicted, by way of
example, in FIG. 4.
[0067] The CTE 50 includes multiple event engines that process
notifications of buffer captures within the CTE 50 itself. There
are many possible implementations for the event engines 58. The
capabilities of the event engines 58 include: having access to
their own high-speed memory (referred to herein as "Internal
Memory"), accepting work via the event queue 60 on the CTE 50,
allocating and de-allocating buffers using a buffer control 72 of
the CTE 50, performing DMA transfers to/from those buffers, posting
events to interface engines (e.g., interface engines 54 and 64),
and posting events to other event engines. Multiple event engines
58 are implemented as co-routines on shared hardware and/or through
true parallel processing. If implemented as co-routines, event
engines which are not engaging in DMA transfers to/from external
memory may not be blocked because another engine is performing a
DMA transfer. The life cycle of one of the event engines 58 is
summarized herein below with reference to FIG. 14.
[0068] Buffer Control 72
[0069] The assigned usage of internal (on-chip) and external memory
within the CTE 50 differs from the usage in a conventional
processor. The conventional solution is for on-chip resources to be
utilized for dynamic caching of content from the external memory.
Generally the conventional processor is unaware of the operation of
the cache, and operates as though it were truly working from the
external ram. The CTE 50, by contrast, considers its internal
memory to be its primary random access memory resource. The
external memory is viewed as a random access device. Asynchronous
memory transfers are executed to load the internal memory from the
external memory, or to store back to the external memory.
Additionally, DMA input and output is executed to/from external
memory.
[0070] The CTE 50 also manages buffer allocation, event queuing and
context switching.
[0071] The buffer control 72 manages allocation and de-allocation
of memory within the external RAM 70 corresponding to event
buffers. The buffer control 72 allocates a capture buffer for
capturing input for a specified context. Each context represents a
distinct content transfer and an associated content transfer daemon
instance. On the internal network 16 contexts are explicitly
identified in the first portion of a packet. In an embodiment of
the invention the context is specified within the header of the
first cell.
[0072] Furthermore, for data transmitted over the external network
20, application of a hash algorithm to the header fields of a
packet is performed to identify a context. For a TCP/IP connection
the context is identified by the combination of a VLAN (Virtual
LAN) field with the source and destination IP address and source
and destination TCP Port.
[0073] The allocation of a capture buffer within the CTE 50 follows
any of a variety of scenarios including, by way of example, the
following: a buffer may have been pre-loaded for the context by a
content transfer daemon or extended content transfer daemon, the
current state of the context specifies that a buffer should be
allocated from a specified memory pool (described herein below), or
the buffer may already be allocated as a result of collecting
packets for the same connection that are part of a larger message.
This set of options is in distinct contrast to prior solutions,
which would only have pre-loaded read descriptors or the option of
allocating buffer space from a common context-insensitive buffer
pool.
[0074] One way to control allocation of buffer space is to
pre-allocate memory pools that are used for particular request
types or requesting entities. Allocation of a buffer from a
specified memory pool is accomplished by simply removing a buffer
control block from the head of a list of free buffer control
blocks. Each memory pool has its own distinct free list. Allocating
like-sized buffers from a linked list is a well-known practice in
embedded software, although more typically employed within the
context of a conventional processor.
[0075] Upon allocating a buffer control block, its associated use
count is incremented. As will be described later herein, there are
other actions that can "attach" or make claims on this buffer
control block. The number of claims eventually reaches zero. When
the number of claims reaches zero the buffer control block is
returned to the buffer control 72's free pool. For a default memory
pool, returning the allocated buffer to the free pool is
accomplished by placing the buffer control block at the end of the
free pool linked list of free memory blocks.
[0076] The buffer control 72, by way of example, is invoked during
routine execution to allocate a new derived or "smart" output
buffer. Smart buffers do not contain the actual output, but rather
describe the output desired by reference to other buffers. Each of
the buffer references is one "attachment" to the referenced smart
buffer. For each segment of a "smart" output buffer, the sender may
request that the payload for that section be applied to the
calculation of a specific CRC or checksum. Other segments may
request that the accumulated CRC or checksum be placed as output
and then the current value reset. In this manner CRCs dependent on
the payload content can be calculated without requiring either one
of the event engines 58 or supplemental processor nodes 22 to
examine the payload. However, the event engines 58 and/or
supplemental processor nodes 22 retain full control over what
payload contributes to the calculation of a particular CRC. Without
this feature the network interface engines 54 and 64 would have to
understand the CRC algorithms for every protocol the CTE 50
supports. This would be particularly difficult because it is common
for the same payload to contribute to multiple CRCs at different
OSI layers.
[0077] The buffer control 72 is also responsible for allocating and
releasing any general free-standing data buffers. Free-standing
buffers are used by content transfer daemons 59 to pre-stock
boilerplate portions of responses. These may be included by
reference in later responses. In this fashion common message
elements, such as server identification strings and error codes do
not have to be generated for each message that require them.
[0078] Supplemental Processor Node 22a
[0079] A supplemental processor (SP) node 22a, also referred to as
a "host," is capable of performing a wide variety of tasks
delegated to it by the content transfer access node 10. SP node 22a
comprises, by way of example, its own dedicated external RAM 78.
The exemplary SP node 22a also includes a conventional processor 80
executing an off-the-shelf operating system 82 (e.g., WINDOWS,
UNIX, LINUX, etc.). The supplemental processor 80 also executes one
or more application programs 84 within the operating environment of
the off-the-shelf OS 82. The SP node 22a also includes the
following program components: extended content transfer daemons
(XCTDs) 86, a content transfer engine extension (CTEX) 88, and an
internal network (fabric) interface 90. Other included components
of the SP node 22a include a FLASH memory device enabling
bootstrapping the conventional processor 80. The SP node 22a may
additionally have its own external interfaces for administrative
purposes. Such external interfaces include, by way of example,
Ethernet and serial interfaces.
[0080] With regard to the programs executed upon the SP node 22a,
the operating system 82 includes a kernel and drivers. A set of
socket daemons execute the operating system 82 and act as Internet
daemons by accepting connections through a BSD (Berkeley Software
Distribution) socket application program interface. The BSD sockets
API is the predominant interface used for applications running on
Unix or Unix-derived operating systems that wish to interact via
the Internet Protocols. The socket daemons are launched at system
start-up or on-demand via a gate-keeping daemon such as INETD
(under most Unix systems) which manages the process of launching
daemons in response to external connection requests An example of a
socket daemon is an APACHE web server. The SP node 22a executes the
CTEX 88 as a background process/task. The only special kernel
support required is permission to interface directly to the
internal network interface. The SP node 22a's internal network
interface engine 90 performs the same function as the CTE's
internal network interface engine 64, but it may have differing
performance requirements/capabilities for dealing with concurrent
packet reception. The CTEX 88 bridges events/packets between the
content transfer access node 10 and the XCTDs 86.
[0081] Socket daemons are a specific form of applications 84 (as
shown in FIG. 2). The methods and conventions related to writing
this sort of application are well known to those skilled in the
art. Indeed, a major goal of the invention is to preserve the
ability to execute the vast repository of existing applications
code under this paradigm while enabling the actual data transfers
to proceed in a more efficient manner.
[0082] In an embodiment of the invention the CTEX 88 operates
closely with the SP node 22a's internal network interface engine 90
and access is not filtered on a per-message basis by the SP node
22a operating system 82. Such access is provided, for example, by
memory mapped locations and FIFO structures maintained directly by
the CTEX 88 with the permission of the operating system 82. The
socket daemon applications of the applications 84 may communicate
with the CTEX 88 either via the host operating system 82 kernel or
via a kernel bypass interface (represented by line 91) that avoids
calls to the operating system. Examples of such bypass interfaces
would include the DAFS (Direct Access File System) API and VIPL
(Virtual Interface Programming Library).
[0083] The XCTDs 86 comprise object oriented Internet daemons
operating within the context of the CTEX 88 using an event
dispatch-process model. The CTEX 88 invokes XCTDs 86 with an event,
and accepts results to be effected upon successful completion. The
results include messages to be sent, updates to shared data regions
and creation/deletion of contexts. Thus, the relationship between
XCTDs 86 and the CTEX 88 is similar to the relationship that exists
between the CTE 50's event engines 58 and their content transfer
daemons (CTDs) 59. However, the XCTDs 86 have access to the SP node
22a's memory and are capable of more complex computations. The
XCTDs 86 are capable of handling protocols without assistance, or
alternatively act as pre- and post-filters to conventional
socket-oriented application daemons.
[0084] In summary with regard to the role of the SP node 22a in the
proposed streaming data server application environment, the content
transfer access node 10 relieves the SP node 22a of the task of
executing the actual transfer of requested content. The shift from
"processing" the stored content to "selecting" and "wrapping" the
content, changes the role of the SP node 22a. The SP node 22a is
not even required for handling simple requests for data. Within the
context of a server system including the content transfer access
node 10, a "host processor" node containing a conventional
processor is more correctly referred-to as a "supplemental
processor" due to its supporting role in content transfer
operations. The SP node 22a is still able to control all protocol
handling and participate in handling portions of protocols that are
not handled by the content transfer access node 10. To accomplish
these objectives, the SP node 22a receives notifications that input
has arrived. Under the content transfer approach, the SP node 22a
is notified that the content is loaded within the content transfer
access node 10's memory. The SP node 22a does not directly examine
or process the transferred content. The SP node 22a is capable of
reading the content transfer access node 10's memory when it
actually must see the content. Finally, the SP node 22a is capable
of creating buffers in the content transfer access node 10's
memory.
[0085] A data storage node 14a is communicatively coupled to the
internal network 16 via an internal network interface 92. In an
exemplary embodiment of the invention, the data storage node 14a
comprises a set of ATA Drives 94a and 94b coupled to the internal
network interface 92 via storage adaptors 96a and 96b. An example
of the structure and operation of the storage adaptors 96a and 96b
is provided in Phillips et al. U.S. patent application Ser. No.
09/638,774 filed on Aug. 15, 2000, entitled "Network Server Card
and Method for Handling Requests Via a Network Interface"
incorporated herein by reference in its entirety including all
references incorporated therein.
[0086] As mentioned previously herein above, the internal network
16 is preferably a non-blocking switch fabric. The data paths of
the switch fabric are controlled by the server controller 24. The
server controller 24 includes an internal network interface 98. The
internal network interface 98 is, by way of example, a switch
fabric interface for communicating with the other server components
via the internal network 16. The server controller 24 includes, in
a particular embodiment, an event engine 100 driven by
notifications presented upon an event queue (not shown). The server
controller 24, includes its own allocated external RAM 102. The
server controller 24 maintains a list of files and volumes in file
system 104 and volume system 106 to facilitate access to requested
content on the data storage node 14a. The server controller 24 also
includes a storage array control 108 responsible for issuing
instructions to the data storage nodes so as to fulfill the
requests of other nodes and higher level components. The server
controller 24 also includes a traffic shaping algorithm and
controller 110 that establishes and maintains control over
distribution of tasks to multiple available processing engines
within the server system depicted in FIGS. 1 and 2. In the
exemplary embodiment the storage array control and traffic shaping
responsibilities are implemented by an algorithm that schedules
their work jointly in a single software subsystem.
[0087] One of the primary tasks of a data server system embodying
the present invention, and one for which it is especially
well-suited, is delivery of content in the form of streaming data.
In the exemplary embodiment of the present invention depicted in
FIG. 1, the internal network 16 interfaces content transfer access
nodes 10 and 12 directly to a set of data storage nodes 14 (e.g., a
set of hard disk drives supplying streaming media content to a set
of requesting clients). The content transfer access nodes 10 and 12
include interfaces to the external network 20. The intention of
this architecture, and alternative network data access
architectures embodying the present invention, is to provide a data
path between a data storage device and an external network
interface for transmitting data retrieved from the data storage
device to requesting clients, and that bypasses a general processor
(e.g., the supplemental processor nodes 22). In accordance with
embodiments of the present invention, once a data stream commences
in response to a networked client's request, the stream of
requested data bypasses conventional processors and operating
systems (in the supplemental processor nodes 22) that add
delay-inducing, resource-consuming overhead to such
transmissions.
[0088] In accordance with an exemplary embodiment of the present
invention incorporating the general logical arrangement of
components of FIGS. 1 and/or 2, the content transfer access nodes
10 and 12 receive requests for content stored upon the set of data
storage nodes 14. In response to the requests, possibly after
initial setup procedures implemented by the supplemental processor
nodes 22, retrieved content/data passes through the internal
network 16 to the content transfer access nodes 10 and/or 12 that
operate without the overhead of conventional processors running
conventional, off-the-shelf operating systems. The retrieved
content/data is handled by one or more of the micro-engines
operating within the content transfer access nodes 10 and 12.
[0089] Handling is distinguished from "processing" the retrieved
data. Handling, in an illustrative embodiment, includes basic
packaging (or wrapping) of the retrieved data with, for example, a
header, and transmitting the data according to rules specified by
communications protocols. In other instances, the data is packaged
by other entities such as the data storage-to-internal network
interfaces. During handling the content of the retrieved data is
not examined. On the other hand during processing, an application
executes to substantively examine and/or modify the retrieved data.
Processing is generally executed by an application executing upon
an off-the-shelf operating system.
[0090] After performing any specified handling, the packaged
retrieved data is transmitted over the external network 20 to one
of the connected clients 18. Similarly, the content transfer access
nodes 10 and 12 facilitate storing data received via external
network 20 onto the data storage nodes 14 without passing the data
through a conventional processor/operating system incorporated into
the supplemental processor nodes 22.
[0091] The above-described processing differs from that of
conventional processors in several ways. First, the content
transfer engine 50 has no general purpose caches. Instead the
internal memory is used as the primary working area, and the
external RAM is viewed as a storage device. Typically processors
view the external RAM as the primary working area, and on-chip
memory is used to dynamically cache portions of the working area in
a manner that is transparent to the software. Given that
conventional processors need to support a wide range of application
architectures this generalized approach is clearly superior. But
the content transfer engine supports a highly specialized
processing pattern, and applications can be designed to simply work
from the equivalent of the internal cache directly, with the
software taking explicit responsibility for transferring to from
the external RAM. Second, the context switching required by the CTE
50 to support this type of transfer is extremely minimal compared
to those under a conventional processor. Switching execution units
only occurs between interpretive instructions (cooperative
multi-tasking as opposed to pre-emptive) and switching between
connection contexts only occurs at event dispatch. In conventional
processor terms, switching to a connection context is just an
application directed load and flush of a data cache. Loading and
flushing data caches occurs constantly in conventional processor
architectures.
[0092] Additionally, conventional processor architectures must save
and restore registers, switch stack frames and possibly shift
memory maps. Third, the CTE 50 has no need for memory mapping
support. Conventional processors, and the entire VI family of
protocols, go through many steps to ensure that user-mode
applications can use the buffers they want to, at the memory
addresses they desire. The CTE 50 provides buffers to content
transfer daemons (CTDs). The CTDs do not attempt to control their
own memory allocation, and hence do not really care where their
data is stored, or how the buffers they are handling are addressed.
The aforementioned content request handling is carried out, for
example, by the content transfer access node 10 including the
content transfer engine (CTE) 50 having a general arrangement
depicted in FIG. 3. In addition to identifying a set of functional
components within the CTE 50, FIG. 3 depicts a set of logical paths
linking the identified functional components of the CTE 50 to one
another. Links are also depicted between the CTE 50 and the
external RAM 70, the external network 20 and the internal network
16.
[0093] The CTE 50 includes a physical interface linking to the
internal network 16. In particular, a bi-directional data path 120
links the internal network interface engine 64 to the internal
network 16. In an embodiment of the invention, the internal network
interface engine 64 supports cell-based data transmissions over the
bi-directional data path 120. However, in alternative embodiments,
the data transmitted over the data path 120 is arranged in other
formats in accordance with a variety of supported data transmission
protocols.
[0094] In accordance with an embodiment of the present invention,
data is transferred between the external RAM 70 and the internal
network interface engine 64--thereby bypassing costly conventional
processor overhead. A bi-directional data path 122 between the
internal network interface engine 64 and external RAM
interface/buffer control 124 (corresponding to the buffer control
72 and memory controller 74 in FIG. 2) facilitates content data
transfers between the internal network interface engine 64 and the
external RAM 70. A bi-directional address/data bus 126 links the
external RAM interface/buffer control 124 to the external RAM 70.
The bi-directional data path 122 and bi-directional address/data
bus 126 facilitate transmitting content data directly between the
internal network interface engine 64 and external RAM 70.
[0095] Data transfers to the internal network 16 from the external
RAM 70 are executed according to commands generated by the event
engines 58. Upon generation by the event engines 58, the commands
are placed within the internal network interface engine queue 62.
Transfer of generated commands from the event engines 58 to the
internal network interface engine queue 62 are represented by line
130. The queued commands are read by the internal network interface
engine 64 (represented by line 132). The internal network interface
engine 64 executes read and write operations involving the external
RAM 70 and internal nodes connected via the internal network 16
(e.g., storage nodes 14 and/or the supplemental processor nodes
22).
[0096] The internal network interface engine 64 is also capable of
generating events (e.g., commands or messages) that drive the
operation of the event engines 58. The internal network interface
engine 64 submits events (described herein below) to the event
queue 60 as represented by line 134. The queued events are read and
executed by appropriate ones of the event engines 58 (represented
by line 136). A bi-directional data path 138 between the event
engines 58 and the external RAM interface/buffer control 124
facilitates submission of read and write requests by the event
engines 58 to the external RAM interface/buffer controller 124.
Such requests concern transfer of data between the external RAM 70
and the internal network interface engine 64 and/or the external
network interface engine 54. When capturing buffers for storing
incoming data this interface is used to locate a target memory
location in the external RAM 72 to store the received data, and
then to store it. The posted event merely refers to the buffer
collected.
[0097] The output command posted to the external network interface
engine queue 56 will typically contain references to this data.
Thus content transfer is controlled by the CTD, without requiring
examination or processing of the actual packet contents.
[0098] Internal Network Interface Engine
[0099] Converting Between Packets and Cells
[0100] One of the functions performed by the CTE 50 is converting
data between a packetized form (for transmission over the external
network) and a cell form (for transmission over the internal
network). As previously mentioned, in the illustrative embodiment
of the present invention, data is transmitted in the form of data
cells over the internal network 16. In an exemplary embodiment of
the present invention the cell/packet and packet/cell conversions
are executed within the internal network interface engine 64. Such
conversion functionality is embedded into other components of the
CTE 50 (e.g., the external network interface engine 54) in
alternative embodiments of the invention.
[0101] In it's outbound role (sending data over the internal
network 16), the internal network interface engine 64 receives
event buffers from the internal network interface engine queue 62.
In response, the internal network interface engine 64 fetches a
buffered packet from the external RAM 70 (via the buffer control
124) based upon a location pointed to by the contents of an event
retrieved from the queue 62. The internal network interface engine
64 generates a set of cells from the retrieved buffered packet, and
transmits the cells to the internal network 16 via lines 120.
Generating the cells includes generating and appending a CRC-32 for
the data within the buffered packet. At the end of transmitting a
series of cells corresponding to the buffered packet to the
internal network 16, a buffering outbound depacketizer within the
internal network interface engine 64 releases the event buffer back
to the buffer control 124.
[0102] An output request is made by posting an event that
references a "smart buffer" to the external network interface
engine queue 56. As previously described for the external
interface, the internal interface transmits each portion of the
described output to a buffer set aside in the external RAM 70.
After each segment is transmitted, the claim on that portion of the
smart buffer set aside in RAM 70 is released. Finally the claim on
the smart buffer itself is released when the requested data has
been transmitted entirely. With regard to data transmission to the
internal network, the internal network interface engine 64
translates each segment into a series of internal network
cells.
[0103] In the exemplary embodiment of the invention, the internal
network interface engine 64 receives a sequence of one or more
cells from the internal network 16 in accordance with known data
cell delivery specifications and data communications techniques--or
variations of such standards. The internal network interface engine
64 converts the received cells into packets. The internal network
interface engine 64 supports concurrent construction of one packet
per assigned reception channel. Each received cell corresponds to a
unique reception channel that distinguishes a received cell for a
particular packet from other cells received by the CTE 50 that are
associated with other incomplete packets being constructed within
the CTE 50. In view of the need to distinguish packets under
construction, the internal network interface engine 64 reassembles
a maximum of one packet per reception channel at any given instance
in time. To maintain such support, the buffering inbound packetizer
of the internal network interface engine 64 stores a packet
reception state for each designated reception channel (e.g.,
expecting packet start/middle packet, accumulated packet length,
accumulated CRC-32 value, etc.). The buffering inbound packetizer
latches a context buffer in internal RAM associated with each
reception channel based upon a start cell for each packet to be
constructed. The internal network interface engine 64 also invokes
the external RAM interface/buffer control 124 to establish a data
buffer within the external RAM 70 to the extent needed to handle an
incoming packet. The options for allocating the buffer are as
previously described for the external network interface engine.
[0104] Thereafter, the data payload in the corresponding received
cells is stored within the established data buffer. When a packet
is complete, the internal network interface engine 64 creates an
event message. The packet completion event message is placed within
a designated one of the event queues 60 associated with a
particular event engine of the set of event engines 58 that will
handle the completed packet.
[0105] Regardless of the manner in which a packet is accumulated,
once it is complete, the constructed packet's CRC is validated. If
the constructed packet's CRC is good, the payload is delivered to a
dedicated destination. Otherwise CRC failure is tallied for
statistical tracking, and the packet is dropped. For buffered
inbound packetizing, when a packet is invalid (e.g., a failed
CRC-32 check), the buffering inbound packetizer of the internal
network interface engine 64 releases the current buffer associated
with the incoming data packet.
[0106] External Network Interface Engine
[0107] Having described the portions of the CTE 50 most closely
associated with data transfers over the internal network 16,
attention is now directed to portions of the CTE 50 components that
facilitate data transfers between the external RAM 70 and the
external network 20. Such transfers are performed, for example, by
the CTE 50 to provide retrieved data from one of the data storage
nodes 14 to a requesting client via the external network 20. The
external network interface engine 54 receives and transmits packets
to/from the external network. In an exemplary embodiment the
external network interface engine 54 manages a device bus and the
network interface engine devices on the device bus. In accordance
with an embodiment of the present invention, data is transferred
between the external RAM 70 and the external network interface
engine 54. A bi-directional data path 142 between the external
network interface engine 54 and external RAM interface/buffer
control 124 facilitates content data transfers between the internal
network interface engine 64 and the external RAM 70. The
bi-directional data path 142 and bi-directional address/data bus
126 facilitate transmitting content data directly between the
external network interface engine 64 of the CTE 50 and external RAM
70.
[0108] To capture packets, the external network interface engine 54
responds to a packet ready condition (typically in the form of a
status line and/or interrupt from an actual network interface
engine device) by transferring just the packet header from the
actual interface. This is illustrated in FIG. 13 as the
"headerReceived" transition from the "idle" state.
[0109] Fields of the header are used to find the hash entry for
this connection. If none exists a new context is allocated and
placed in the hash table. Fresh context allocations come from a
reserved pool of "unknown connection" contexts. This limits the
total resources that can be tied down responding to new external
network connection requests.
[0110] Such limits represent yet another manner in which the system
defends against Denial-of-service attacks. If the context resource
pool is exhausted, then the incoming frame is discarded. A
diagnostic tally of all such discards is kept. After assigning a
context, the header is examined to determine the capture type
(serial or RDMA) and the total packet size.
[0111] Having determined the capture type and packet size, the
context is latched for the identified capture type. As detailed
herein, latching the capture buffer for a context finds or
allocates a buffer to capture the incoming data. Once a capture
buffer is allocated, the fetched header is stored in that buffer.
Then a DMA transfer from the external network device to the capture
buffer is requested.
[0112] The DMA transfer either completes successfully (as shown by
the "goodTransferNotice" transition in FIG. 13) or an error is
detected, such as an invalid CRC (as shown by the
"badTransferNotice"). If the DMA transfer is successful the
captured bytes are committed. At this point the immediate
collection is complete, and the context is unlatched. In an
external protocol dependent fashion the interface engine determines
whether use of the current buffer has ended. If use of the current
buffer is done, then the buffer is released. Otherwise the engine
returns to an "idle" state. The one condition where a buffer is
always "done" is when it has been completely filled. On a bad
transfer notice the context is unlatched and the buffer is
aborted.
[0113] External Network Output Requests
[0114] Data transfers between the external network 20 and the
external RAM 70 are executed according to commands generated by the
event engines 58. Upon generation by the event engines 58, the
commands are placed within the external network interface engine
queue 56. Transfer of generated commands from the event engines 58
to the external network interface engine queue 56 are represented
by line 144. The queued commands are read by the external network
interface engine 54 (represented by line 146). The external network
interface engine 54 executes read and write operations involving
the external RAM 70 and external nodes connected via the external
network 20 (e.g., clients 18).
[0115] The external network interface engine 54 is also capable of
generating events (e.g., commands or messages) that drive the
operation of the event engines 58. The external network interface
engine 54 submits events (described herein below) to the event
queue 60 as represented by line 148. The queued events are read and
executed by appropriate ones of the event engines 58 (represented
by line 136).
[0116] Context Mutual Exclusion by Serialization
[0117] One requirement of object-oriented network processing is
that the system must ensure that the events for any given object
must be processed serially. The processing of event N for
object/connection X must be completed before the processing for
event N+1 for object/connection X starts. Conventional approaches
to solving this problem include assigning each object to a single
processing thread, and use of semaphores or locks to ensure that if
two threads do try to update the same context, that the second one
will be forced to wait for the first's completion. These
conventional solutions consume system resources and can result in
delays in handling events. Static assignment of threads is low
overhead, but can result in some threads being idle while others
are backed up in processing their input queue.
[0118] The CTE 50 solves this problem by dynamically assigning a
context/object to a specific execution unit when it makes a
dispatch. If there are no currently dispatched events for that
context, one of the event engines is picked arbitrarily. To ensure
that this selection is statistically balanced, and that it can be
done without requiring references to tables stored in external
memory, the preferred implementation is to make the selection based
upon a hash of a dispatched routine handle. However there are a
wide range of algorithms to make a balanced selection with low
overhead.
[0119] When a dispatch is made the CTE increments a count of
dispatched events for that context, and records a selected event
engine of the event engines 58. As long as that count remains
non-zero, all future dispatches for this context/object will be
placed on the same queue. At the completion of dispatching each
event, the selected event engine will decrement the dispatch count
for the context/object. When the count returns to zero, the choice
of event queue is once again unlatched and can be dynamically
assigned to a later event associated with the context/object.
[0120] Queues
[0121] In an exemplary embodiment, queues are a fundamental element
for interfacing the execution units (engines) and the CTDs 59 and
device control routines that they run. Each of the queues depicted
in FIGS. 2 and 3 feeds a single execution unit. When a calling
component seeks to send a message to another component of the CTE
50, the calling component posts a message to the target queue.
Posting may be a result of buffer capture or may be the result of a
successful event dispatch by an event engine. The receiving
component is responsible for fetching the received messages from
its queue.
[0122] Furthermore, events need not always be posted immediately to
a particular queue when they arise. In an exemplary embodiment of
the invention, a ticker within the internal network interface
engine 64 posts events received from a time deferred request queue
(not shown in the figures). The ticker accepts event messages and
buffers them for a designated time period. The ticker reposts the
event message to an appropriate one of the event queues 60 after a
designated wait period specified in the received event message.
[0123] In an embodiment of the invention, message queues are
implemented entirely upon a chip containing the processing
components of the CTE 50 (i.e., no external, off-chip memory is
utilized), and there is no access delay for making an external call
to off-board memory. Because, in the exemplary embodiment, the
message queue storage is on-chip, the capacity of each queue is not
dynamically modifiable. Instead, changes to queue size are carried
out in FPGA core or based upon a parameter accessed when the FPGA
is initialized.
[0124] The various queues 56, 60, and 62 store request
"notifications" rather than the requests themselves. The request
notifications, an exemplary format of which is discussed herein
below with reference to FIG. 10, include a command and a pointer to
additional instructions and/or data stored within external random
access memory 70 managed by the buffer control 72 and memory
controller 74. Transferring notifications, rather than the buffered
requests themselves, between components reduces memory transfers
within the data access server system embodying the present
invention and streamlines data transfer procedures.
[0125] As those skilled in the art will readily appreciate, the
above described hardware/firmware arrangement depicted in FIG. 3 is
exemplary of an architecture for facilitating transfer of large
amounts of data, retrieved from a storage device via the internal
network 16 (e.g., switch fabric), to a requesting external device
connected via an external network. The architecture of the content
transfer access node 10 depicted in FIGS. 2 and 3 has only limited
processing capabilities for facilitating the data transfers. The
result is a relatively inexpensive high-throughput server interface
for providing information assets to connected clients over an
external network.
[0126] Turning now to FIG. 4, an exemplary context format 200 and
related data structures are depicted. The context represents the
state of a particular task being performed by the CTE 50. A class
state ID 202 provides a pointer to a class state record 203. The
class state record 203 includes a memory pool ID 204 that
corresponds to a particular buffer pool from which buffer space is
allocated to a particular context object. A dispatch target 205
describes a particular type of action to be performed by the
particular context object to which the context is assigned. In an
embodiment of the invention, the class state also includes an event
engine routine handle that identifies a code routine executed in
association with the particular class state.
[0127] A set of memory pools are associated with each of a set of
capture buffer types. The memory pool ID 204 points to a particular
memory pool record 206 associated with one of the set of capture
buffer types. A memory pool record includes two fields: a list head
field 208 and a list tail field 210. The list head field 208 stores
a value corresponding to the location of a first free buffer
control block in a linked list of free buffer control blocks
representing unused ones of memory blocks allocated to the
particular memory pool. The list tail field 210 stores a value
corresponding to the location of a last free buffer control block
in the linked list of free buffer control blocks. In an
illustrative embodiment the list head 208 and list tail 210
comprise a total of 48 bits. The first four bits are reserved, the
next twenty bits specify a location in the external RAM 70 where
the first free buffer control block is located, the next four bits
are zero, and the final twenty bits specify a location in the
external RAM 70 where the last free buffer control block for a
particular memory pool is located.
[0128] For each network interface engine (e.g., internal and
external), the context includes a corresponding capture buffer
descriptor 220, 240. Each internal network capture buffer
descriptor 220 and external network capture buffer descriptor 240
includes an optional handle (pointer or other reference)
referencing a capture buffer having a format of the type depicted
by capture buffer control block 222. Non-null handles indicate that
the process of capturing a buffer for this context on that network
interface engine is in-progress. Each instance of the capture
buffer control block 222 includes, by way of example, a copied to
buffer address 224 that identifies a location where a buffered data
block commences. A logical length 226 identifies the portion of the
buffer that has fully captured (or otherwise logically valid)
content. By contrast, an allocated length field 228 identifies the
total length of the memory allocated to a particular capture
buffer. As discussed previously above, the dispatch target field
205 includes one of an extensible set of various target descriptors
(see, FIG. 9). As indicated by target queue 230, in addition to
pointing to a particular memory pool (field 229), the capture
buffer can also point to a particular event queue that receives
notifications/requests for a particular class of tasks. As
indicated by arrows 232, the event queues 60 in turn include queue
entries that specify buffer ID's pointing back to an instance of
the capture buffer control block 222. Furthermore, a smart output
buffer 234, in an embodiment of the present invention, includes a
list of segment descriptor/buffer reference pairs that reference
multiple instances of the capture buffer control block 222. In such
cases, the segment descriptor specifies an operation/task, and the
buffer reference (as depicted by lines 236) specifies a particular
instance of the capture buffer control block 222 upon which the
operation/task is to be performed.
[0129] Sub-fields within the internal network capture buffer
descriptor 220 of the context 200 also track a commit limit and
total committed fields. These sub-fields support reporting of
partial delivery over the internal network when the out-of-order
capability of RDMA transfers is invoked. An exemplary buffer
allocation scheme involving partial delivery is described herein
below. A buffer use count 238 stores a value indicating a the
capture buffer capture buffer is released.
[0130] The external network capture buffer descriptor 240, as
mentioned above, stores a reference (optional) to an instance of a
capture buffer control block 222. Instances of external network
capture buffers correspond to data transfers from the external
network 20 into the external RAM 70. An assigned engine queue 250
specifies one of the set of event engines 58 that has been assigned
to handle the task associated with the particular context.
[0131] As explained herein above, the CTE 50 dynamically assigns a
context/object to a specific execution unit when it makes a
dispatch. The dispatch count 260 identifies the number of
dispatches associated with a particular context. If there are no
currently dispatched events for that context, one of the event
engines is picked arbitrarily. When a dispatch is made the CTE
increments a count of dispatched events for that context, and
records a selected event engine of the event engines 58. As long as
that count remains non-zero, all future dispatches for this
context/object will be placed on the same queue. At the completion
of dispatching each event, the selected event engine will decrement
the dispatch count for the context/object. When the count returns
to zero, the choice of event queue is once again unlatched and can
be dynamically assigned to a later event associated with the
context/object.
[0132] In an exemplary embodiment of the present invention all
buffers are referenced by twenty-bit buffer IDs that are indexes to
buffer control blocks. Four distinct types of buffer control blocks
are supported: tables, code routines, allocated buffers and free
buffers. The buffer control blocks for tables and code routines are
64 bits each, and 128 bits are required for allocated and free
buffers.
[0133] A pre-defined location holds the buffer control block for
buffer ID zero. This is pre-defined to be a table that holds the
buffer control blocks for all tables and code routines. Buffer ID
"one" is reserved for the table holding the buffer control blocks
for allocated and free buffers. The first element in the table
holding the buffer control blocks will have a buffer ID higher than
the buffer ID of the last entry in the table/code-routine
table.
[0134] Turning now to FIG. 5, a 64-bit table memory control block
format is defined. A table buffer address 300 points to the first
entry in a table containing memory control blocks (both 64 and
128-bit). The next 8 bits, field 302 of the memory control block
are zeroes (a format marker). A sixteen-bit record size 304
specifies the size of each record. An allocated table size 306
(only 16 of 24 bits are specified) indicates the number of bytes
allocated to hold the records; this must be a multiple of the
record size.
[0135] An exemplary 64-bit code routine memory control block format
is defined in FIG. 6. While their content and usage is different,
code routine memory control blocks are identical in format to table
control blocks. Code routine memory control blocks include eight
1's as a format marker 312 and all zeroes in the record size 314.
Rather than a buffer size, routine size 316 stores a size of a code
routine.
[0136] Turning to FIG. 7, the fields are depicted for an allocated
buffer control block. Types of buffers include capture (described
herein above with reference to FIG. 4), direct and smart output.
Each instance of a 128-bit memory control block includes a copied
to buffer address 320 that identifies a location where a buffered
data block commences. A logical length 322 identifies the portion
of the allocated space that is logically usable. For output buffers
this would indicate to the network interface engine how much of the
buffer should be examined for output specifications (or data). For
capture buffers it indicates how large the fully committed portion
of the buffer is. An allocated length 324 identifies the total
length of the memory allocated to a particular allocated buffer. A
format marker 326 indicates whether a particular buffer is a smart
output buffer, simple output buffer, or a capture buffer. A memory
pool ID 328 indicates which one of a total of 256 potential memory
pools with which a particular memory buffer is associated. A
designated queue 330 identifies the event queue of the set of event
queues 60 with which the memory buffer is associated. An event
target 334 comprises a description of a particular event and
specifies how to process the buffer's contents. Examples of event
formats are described herein below with reference to FIG. 9. A
buffer use count field 336 identifies the number of users of a
particular buffer. The buffer use count field 336 is incremented
when a buffer is allocated or attached. It is decremented each time
a claim on it is released. When it is decremented back to zero it
will be placed back in its assigned home memory pool by appending
it to the tail of that pool's free list.
[0137] A free buffer control block format (one that has not been
allocated to an event) is summarized in FIG. 8. The free buffer
control block format is similar to the buffer control block
depicted in FIG. 7. The identifying difference is that a free
buffer has a zero use count, while an allocated buffer has a
non-zero use count. The memory address, allocated buffer length and
use count fields are aligned, so as to allow references to buffer
control blocks without prior knowledge as to whether the specific
control is free or in use. Each instance of a 128-bit memory
control block includes a copied to buffer address 340 that
identifies a location where a buffered data block commences. A next
set of 24 bits 342 are reserved (all zero). An allocated length 344
identifies the total length of the memory allocated to a particular
free buffer. A memory pool 348 indicates which of a total of 256
potential memory pools with which a particular memory buffer is
associated. A next set of twenty-eight bits 350 are undefined. They
may have leftover content from when this buffer control block was
last allocated. A next free buffer 352 (20 bits) identifies the
address of a next free buffer control block (thereby enabling
chaining of free buffer control blocks). A buffer use count field
354 identifies the number of users of a particular buffer. In a
"free buffer" this value is zero by definition--if there were
claims upon the buffer then it would not be a free buffer.
[0138] Turning now to FIG. 9, a set of exemplary event targets (40
bits) are depicted. In general, the first section of an event
comprises a format marker field specifying an event type. This is a
variable length "Huffman" style encoding, which is well-known
technique in the field.
[0139] The first type 400 (Internal Network Target) specifies the
packet header information required for an Internal Network Target.
This format is used for output requests placed on the internal
network event queue. Normally this is only used for output requests
from CTDs, but a context state may specify forwarding event notices
directly to an XCTD via the internal network.
[0140] The second type 401 (Table Update) is used by a CTD or XCTD
to send updates to specific tables. This mechanism is used to
bootstrap the CTE. The target specifies which table is being
updated (16 bits) and a record offset of the new data (20 bits).
Normally both the `set` and `clear` flags are set, causing the
entire content of the addressed records to be replaced. By setting
only `set` or `clear` the updater may request bit-wise setting or
clearing of bits in the target record.
[0141] The third type 402 (Ticker Target) is used by CTDs to
request notification after a specific period of time has elapsed.
The target specifies the context that is to receive the event and
the minimum number of implementation-defined clock ticks that must
transpire first.
[0142] The fourth type 403 (External Network Target) is used by the
CTD to request output on the external network. The target details
the set of external network ports that are acceptable for sending
this traffic. In most configurations the CTD finds all ports
acceptable, but there could be configurations where different ports
lead to different sub-networks.
[0143] The fifth type 404 (Context Target) is used upon buffer
capture and for communication between CTDs. It specifies the target
context (20 bits), the generation (4 bits) and the capture type (2
bits). The generation is zero for packet capture. For CTD generated
events it is any number greater than that of the event currently
being processed. This limitation prevents "chain reactions" in
faulty software where a single event results in posting two events,
which result in posting four events, etc.
[0144] The sixth type 405 (Simulated Input Target) allows a CTD to
simulate input with a supplied buffer that is to be treated as
though it were raw input received upon the specified channel. For
example, the internal network interface engine could be told to
simulate reception on a given Node Channel and then be supplied an
array of cells. This feature is intended to support debugging and
diagnostics.
[0145] As mentioned herein above, the CTE 50 includes a set of
message queues 56, 60 and 62 that buffer request/command
transmissions between internal components of the CTE 50. Referring
to FIG. 10, in an embodiment of the invention, each message is 3
bytes (24 bits). The first four bits 410 identify a specific event
completion notification type. The remaining twenty bits 412
comprise a pointer to buffer in the external RAM 70 associated with
the event and created by the external RAM interface/buffer control
124 of the CTE 50.
[0146] Turning now to FIG. 11, a set of buffer depictions
illustrate another aspect of an embodiment of the present invention
enabling buffers to fill in an out-of-order manner while still
enabling prompt use of totally delivered portions of the buffer. An
important feature of an exemplary embodiment of the invention is
the ability to issue large read requests and then receive partial
completion notifications as the requested content is delivered.
This approach allows the storage network to exercise flexibility in
scheduling deliveries.
[0147] The more conventional approach would use more, smaller
buffers so as to allow for a steady flow of completion
notifications. The conventional approach, however, reduces the
flexibility of the scheduler. Depending on the internal network
protocols, the requests may have to be dealt with in sequential
order. At a minimum, scheduling would be constrained to respect the
boundaries between the requests. A single transfer could not be
scheduled that crossed the artificial boundaries between sequential
requests.
[0148] When files are stored in non-contiguous sectors on storage
media maintained on the storage nodes 14, optimum scheduling of the
drive head may require reading blocks "out of order", i.e. in an
order that differs from the logical arrangement of the data in the
file. Requiring the server to deliver the blocks in order either
causes more disk head motion or storage-side buffering (until the
complete file is retrieved). Out-of-order delivery facilitates the
greatest flexibility for disk data delivery scheduling while
eliminating the need for extra disk side buffering. Augmenting
out-of-order delivery with early completions allows the use of
larger buffers without the increasing pipeline delay that waiting
for complete delivery of those larger buffers would entail. As
shown in FIG. 11, the ability to support partial completions with
out-of-order delivery is achieved by logically dividing each
capture buffer into three zones (uncommitted, partially committed
and fully committed).
[0149] FIG. 11 depicts a sequence of commits to a data capture
buffer. Initially, the buffer is completely empty. The first commit
operation adds a data block having a size of "4" and an offset
"10." Thus, since a gap exists at the beginning of the buffer, the
fully committed value remains at zero. The value of the highest
location in the buffer containing data, "14", is stored in the
total committed field. Total committed data size is set to 4.
[0150] Next, a data block of size "6" is added to the buffer
starting at location zero. The total committed is increased to 10,
the highest filled location remains "14," and the fully committed
range in the data buffer is increased to "6." The next commit is
size "4" beginning at offset "6." The commit limit remains
unchanged at "14." At this point the total committed is increased
from "10" to "14" and now equals the commit limit (now "14").
Therefore, the fully committed field can be updated to "14. " A
partial completion event can now be posted that specifies the
readiness of data up to location 14 to be transmitted from the data
capture buffer to a specified destination.
[0151] The remaining two diagrams of FIG. 11 depict the completion
of two more commits. The first of the two adds data to the end of
the buffer and increases the total committed and the commit limit.
The second of the two commits adds data to fill a gap and complete
the space of the data capture buffer.
[0152] Turning now to FIG. 12, a set of states and transitions are
depicted that summarize the creation of data packets from a set of
received cells in accordance with an embodiment of the present
invention. There are two primary states "Not In Packet" state 700
and "In Packet" state 716 between these two states, cell collectors
perform sequences of operations in association with state
transitions. From the "Not In Packet" state 700 the normal course
of events is to receive a "startCell". The startCell is processed
as follows. At action 702 the packet size is extracted from the
start cell header. Next, at action 704 the capture buffer is
latched for the type of capture (serial or RDMA) and the packet
size promised. If during action 704 no capture buffer is available,
an ovverun is tallied at action 706 and the collector returns to
the "Not In Packet" state 700.
[0153] Otherwise if a capture buffer is available, then at action
708 the CRC accumulation field is zeroed. Next, at action 710 the
packet size collected field is zeroed. Next, at action 712, the
base address for the collection (within the capture buf) is
obtained. For RDMA packets this is the RDMA offset from the packet
base. For serial packets it is after any previous packets already
collected for this buffer. Thereafter, at action 714 the cell's
payload is stored, with the CRC and length being accumulated. The
collector transitions to the "In Packet" state 716.
[0154] From the "Not In Packet" state 700 the following error
handling reactions are required. On receipt of an "endCell" the
"nMissedEnds" tally is incremented at action 720 and then the Not
In Packet state 700 is reentered. On receipt of a "midCell" the
"nStrayCells" tally is incremented at action 722 and then the Not
In Packet state 700 is reentered.
[0155] From the "In Packet" state 716 the following is done in
response receipt of a "midCell". In response action 714 stores the
cell's payload, with the CRC and length being accumulated. The In
Packet state 716 is reentered.
[0156] From the "In Packet" state the following operations are
executed in the following identified states in response to
receiving an "endCell." The cell's payload is stored per action
724, with the CRC and length being accumulated. Action 726 checks
the accumulated CRC. If it is bad, the capture buffer is aborted at
action 728, the bad packet counter is incremented at action 730,
the context port is unlatched at action 732 and the collector
returns to the "Not In Packet" state 700. Otherwise, at action 726
if the CRC is valid, the collected bytes are committed at action
740 and the context port is unlatched at action 732 before the
collector returns to the "Not In Packet" state 700.
[0157] From the "Not In Packet" state a "solo cell" may also be
received. A solo cell is processed as though it were both a start
and end cell. After extracting the packet size at action 760, at
action 752 a capture buffer is latched for the packet size
promised. If during action 752 no capture buffer is available, an
ovverun is tallied at action 754 and the collector returns to the
"Not In Packet" state 700. Otherwise, action 756 is entered wherein
the CRC accumulation field is zeroed.
[0158] While in the "In Packet" state 716 either a "soloCell"or
"startCell" may be received. After aborting processing of a current
packet, the solo and start cells are processed as though the
collector was in the "Not In Packet State" 700 after the current
packet has been aborted. Aborting the current packet accurs by
aborting the current capture buffer at action 760 and 762 for start
and solo cells respectively. The context port is unlatched at
action 764 and 766, and the "nMissedEnds" tally is incremented at
actions 768 and 770 for start and solo cells respectively. The
remaining steps concern processing the cell as though the cell was
received while the collector was in the "Not In Packet State" 700.
In particular, a promised packet size is extracted from a start
cell or a solo cell at stages 772 and 750, respectively.
[0159] Turning now to FIG. 13, a set of states summarize how
Ethernet frames are collected into a packet in accordance with an
aspect of an illustrative data server interface. Starting from an
"Idle" state 800 the collector responds to the "headerReceived"
notice as follows. At action 802 the header contents are read, and
used to calculate a hash value. The hash value is used at 802 to
find the entry in the hash table where this context should be. If
it does not already exist, then a new context is allocated if there
are any available in the context buffer pool for that purpose.
[0160] If at action 802 no context existed or could be allocated
("[no context available]") then the "MaxNewContextDenials" tally is
incremented at action 804 and the Ethernet frame is flushed at
action 806. The collector then returns to the "Idle" state 800.
[0161] Otherwise if a context existed or can be created, then at
action 810 the capture type (serial or RDMA) and packet size is
determined from the header. Based upon the acquired information, at
action 812 the capture buffer is latched for the port and capture
type. If required by the external protocol, the read header
information is transferred into the capture buffer at action 814.
Next, at action 816 an RDMA transfer is requested of the frame
payload to the capture buffer and the "Transferring payload to
Capture Buffer" state 820 is entered. The "transferring payload to
Capture Buffer" state 820 completes with either a successful
transfer or an error. In either case the collector returns to the
"Idle" state 800 (albeit via differing paths of
sub-states/operations). In the case of a good transfer, at action
822 the collected bytes are committed and at action 824 the context
is unlatched. Otherwise, in the case of a bad transfer the capture
buffer is aborted at action 826, and the context is unlatched at
action 824.
[0162] FIG. 14 summarized states and substates/operations within
the life cycle of an exemplary one of the event engines 58.
Starting with a Waiting for Event state 1000, the event engine is
activated. The Waiting for Event state 1000 occurs continuously in
a parallel execution architecture, or by having another event
engine block in a co-routine implementation. As a result of
activation the event engine will enter the Dispatching Event state
1002. The Dispatching Event state 1002 state will be explored in
more detail later herein with reference to FIG. 15. Normally this
state is exited with an "ok" status, indicating that an event has
been dispatched and the execution of its response can now begin at
Instruction state 1004. Alternately, state 1002 can exit with a
"queue empty" status and return to the "Waiting for Event"
state.
[0163] Execution proceeds from the Instruction state 1004 by
fetching the next instruction from internal memory, the opcode is
decoded at state 1006 and the implementation of that opcode is
invoked during state 1008. The execute opcode state 1008 exits in
one of four available methods. The result state transitions are
labeled as "ok-done" 1010, "ok-DMA requested" 1012, "successful
completion" 1014 and "exception" 1016.
[0164] The "ok-done" transition 1010 that returns to the Fetch
Execution state 1004 indicates that the instruction has completed,
and the event engine is ready to execute the next instruction. In
co-routine implementations there is a possibility that the CTE 50
could shift execution to another unblocked event engine between any
instruction.
[0165] The "ok-DMA requested"transition 1012 indicates that the
engine has requested a DMA transfer to or from external memory from
buffer control. The event engine will be blocked in wait state 1013
until that transfer has completed. At that point the event engine
transitions to the fetch instruction state 1004. In a co-routine
implementation, the CTE shifts execution to an unblocked event
engine.
[0166] The "successful completion" transition 1014 indicates that
the event has been fully processed without an exception being
raised. This transfers to a Completing Successfully state 1015
which is explained with reference to FIG. 16.
[0167] The "exception" transition 1016 occurs when an instruction
raises an exception during the execution of a routine instruction.
Causes of exceptions include divide by zero, failure to complete
before the dispatch deadline or use of an invalid index. A
"Completing with Exception" state 1018 is described herein below
with reference to FIG. 17.
[0168] Referring now to FIG. 15, to dispatch an event an event
engine first consumes an event from its event queue at state 1100.
If the event queue is empty, state 1100 is exited and state 1102 is
entered with a "queue empty status". Otherwise at action 1104 a
buffer ID is extracted from the consumed event. At action 1106 a
DMA read is requested of a buffer control block indexed by the
buffer ID. A wait state 1108 is entered.
[0169] When the DMA read completes, at action 1110 a target context
ID is extracted from the buffer control block. This is used as the
key to initiating a DMA read of the Context itself at action 1112.
The event engine waits at state 1114 for the DMA read to
complete.
[0170] When that read is complete at action 1116 the fault status
of the retrieved context is checked. If the context has been
quarantined due to uncorrected faults control passes to action 1118
and the current buffer is released. The event engine returns to
state 1110 and attempts to fetch an event from its input queue.
[0171] Otherwise at action 1120 the event engine gets the current
context state from the retrieved context and starts a DMA read of
the context state's data at action 1122. The event engine waits at
state 1124 for the context data read to finish.
[0172] Upon completion at action 1126 a routine handle is extracted
from the context. At action 1128 the event engine ensures that the
cache register is set and sets the program counter during action
1130.
[0173] If the code routine is already found within the engine's
code cache then during action 1130 the program counter merely needs
to be pointed at that location and the "ok" exit can be taken by
the event engine during state 1132 to start execution of the
routine.
[0174] Otherwise after the program counter is initialized during
action 1130, the code routine is DMA read into the code cache. A
DMA read is initiated during action 1134 for the code routine's
control block. After waiting at state 1136 for the DMA read to
complete, at action 1138 the event engine extracts the address and
size of the code routine. Next, at action 1140 a DMA read of the
code routine is initiated and the event engine waits for its
completion during state 1142. Thereafter, the event engine enters
the exit state 1132 and execution of the loaded code routine is
enabled.
[0175] Referring to FIG. 16, upon successful completion the event
engine determines at state 1200 if all events that it must post are
currently postable. If not, and none of the target queues have
their matching execution units in the "Other Full" state then at
state 1202 the event engine sets its own "Other Full" status flag
and at state 1204 waits until it is time to recheck the postability
of the dispatch results. Thereafter, at state 1206 the event engine
clears its "Other Full" flag and returns to state 1200. If at state
1200 waiting could create a deadlock (a target "Other Full" flag is
already set) then the event engine must raise the deadlock
exception and complete handling of the dispatch as described for
the "Completing with Exception" state described below with
reference to FIG. 17.
[0176] Otherwise the event engine creates any required log entries
at state 1210, DMA write them to external memory at wait state
1212. At state 1214 the event engine posts the events to the other
queues, and at state 1216 the event engine DMA writes its context
data and state back to external memory. The event engine waits for
the DMA write to complete at state 1218. When the DMA write is
complete at state 1220 the event engine decrements the context's
dispatch count, and then at state 1222 releases the buffer
associated with the event.
[0177] Referring to FIG. 17, when an exception has been raised, all
pending results of the current dispatch must be discarded. This
requires at state 1300 releasing the buffers associated with the
events that would have been posted upon a successful completion.
When possible, at state 1302 the event engine allocates an
exception report buffer. There is a pre-designated report buffer
for each type of event. At state 1304 that report buffer is filled
and placed at the head of the event engine's input queue. At state
1306, the event will report to the object itself that it faulted,
and require it to validate that its internal data is corrupt. If it
cannot do so, or faults while attempting to do so, the context will
be marked as quarantined. No further events will be dispatched to
it. The exception report buffer will typically have a reference to
the original buffer, which will require that it be attached. After
handling the exception report buffer, at state 1308 the engine will
release its claim on the event buffer that caused the exception and
at state 1310 a signal is provided that the event was processed
(albeit unsuccessfully).
[0178] If the report buffer is full, then the event engine passes
from state 1306 to state 1312 wherein it increments a counter that
tallies lost exceptions. Next, at state 1314, the event engine
releases its claim upon the exception report buffer. Lastly the
event's context dispatch count must be decremented, just as it
would have been with a successful completion.
[0179] FIG. 18 illustrates the life cycle of a buffer control
block. It starts in the "Free" state 1400. After buffer control 72
removes the buffer control block from its memory pool's free list,
it is assigned to perform a specific capture type for a specific
context. This involves setting the initial use count to 1 at state
1402, initializing the logical length to zero at state 1404, and
setting the event target at state 1406. The Event Target encodes
the target context, the selected target queue, the generation and
the capture type (for generation zero events). Target context and
queue selection have been described previously herein. The
generation field is set to zero when the buffer is being allocated
for capture. When events are sent by CTDs to other CTDs the
generation field must be set to a value higher than that of the
event which is currently being processed. The four types of capture
are internal network serial mode, internal network RDMA mode,
external network serial mode and external network RDMA mode. A
given context may have at most one active capture in-progress for
each capture type.
[0180] For RDMA captures two additional fields (Total Committed and
CommitLimit) must be initialized to zero at states 1408 and 1410.
These fields are found in the context data, rather than buffer
control block to avoid wasting their space on buffers that are not
supporting RDMA capture. The buffer is now in the Allocated state
1412. From the allocated state 1412 the buffer may be further
attached, which causes the use count to be incremented at state
1414. The buffer returns to the Allocated state 1412.
[0181] Also from the allocated state 1412 the buffer may be
committed at state 1416 or 1418. The act of committing a portion of
a capture buffer indicates two things. First it indicates that the
packet collector has received a portion being committed and the
portion has been validated by a capture specific method, most
typically a CRC32. Second, by committing a portion of the capture
buffer, the packet collector yields its permission to further
modify those bytes. Committed bytes are ready for processing by an
event engine.
[0182] For serial captures the Logical Length is simply incremented
by the number of newly committed bytes at state 1416. For RDMA
(out-of-order) captures the Commit Limit, Total Committed and
Logical Length fields must be updated as follows: if the base of
the newly committed data (Commit.base) matches the current
CommitLimit, then CommitLimit is incremented by the newly committed
size (Commit.size) at state 1418, TotalCommitted is incremented by
the newly committed size (Commit.size) at state 1420, if the
LogicalLength (which is the portion fully committed) is equal to
the newly committed base (Commit.base) then the Logical Length is
incremented by the newly committed size (Commit.size) at state
1422, and lastly if all bytes below the CommitLimit have been
committed (TotalCommitted is equal to CommitLimit) then the
LogicalLength can be set to the TotalCommitted at state 1424. The
rationale for out-of-order commits was discussed more fully with
reference to FIG. 11.
[0183] After byte counter maintenance steps are completed the
number of users is incremented, at state 1426 and the event is
posted to the target queue at state 1428. If the target queue is
full, then at state 1430 the execution unit's status is marked as
"Other Full" before the execution unit suspends itself at state
1432. The "Other Full" status will prevent other execution units
from suspending while trying to post output requests to this
execution unit's queue. This prevents the well-known "deadly
embrace" deadlock, where two execution units are each waiting for
the other to empty its input queue, but neither can because they
are waiting for the other to act first. The execution unit will be
activated from this "waiting to post" state 1432 after any
execution unit has completed processing an event. The flag is
cleared at state 1434 and the post re-attempted at state 1428.
[0184] As each claim on the buffer is released, at state 1440 the
number of users is decremented. Alternatively when the last claim
is released at state 1442 the buffer is restored to the tail of its
home memory pool and returns to the "free" state 1400.
[0185] Turning now to FIG. 19, a set of steps are depicted that
summarize an exemplary sequence of events/operations for
transmitting requested data from a data server to an external
client node. It is noted that in general, the disclosed
architecture of the CTE 50 enables applications, through the use of
status/control/notification messages, to control the transfer of
data from a data storage node to a requesting client for Internet,
NAS, SAN and similar protocols without incurring overhead
associated with placing transferred data payloads in the memory
space of application software.
[0186] In an embodiment of the present invention applications are
developed as a set of object classes. The resulting objects are
invoked either as CTDs 59 on the CTE 50 itself or as XCTDs 86 on a
companion CTEX 88 running on the supplemental processor node 22a.
The CTE 50 captures incoming packets and dispatches corresponding
events to a context object to which a particular client request is
assigned. By way of example, the context objects access shared
tables, generate output requests, generate messages to other
objects and modify their own data. The CTEX 88 performs the same
function for extended content transfer daemons (XCTDs) on the
supplemental processor node 22a.
[0187] With reference now to FIG. 19, an exemplary connection and
corresponding client request includes the following
operations/steps. Initially at step 1500 an incoming request from
the external network is identified as being a new connection. In
response, at step 1502 a new context (context object) is assigned
and an entry is created in the external network interface's packet
identifying hash tables associating the new connection with the new
context.
[0188] Thereafter, at step 1504 the context object completes
initial validating negotiations and establishment of the connection
with a requesting client. An example of such negotiations is
completing the TCP three-part handshake. Due to the wide variety
and/or potential complexity of this step, finishing establishment
of the connection, particularly validation of the user and any
supplied credentials (such as user password) will typically be
passed by an event engine within the CTE 50 to an XCTD running on
the CTEX 88.
[0189] After the CTEX 88 validates the session, at step 1506 the
CTEX 88 reassigns the external network packet identifying the
connection/context hash table entry to a new context object that
handles the in-session protocol for responding to the client
request(s). The original context object is returned to the
available pool of context objects for processing/validating unknown
connections.
[0190] Next, at step 1508 the new context object executing on the
CTE 50 and associated with an in-session protocol issues/passes
requests over the internal network to obtain content requested via
the validated client connection. This is performed, by way of
example, on a read-ahead basis that is primarily regulated by
available buffering rather than as a direct response to a sequence
of incoming requests. The requested data is transferred from one or
more of the storage nodes 14 via the internal network 16 directly
to a content transfer access node 10, thereby bypassing the
supplemental processor nodes 22. The content transfer engines
generally keep supervisory applications executing upon the
supplemental processors aware of the status of data transfers and
connections. However, the status knowledge is acquired through
status notifications passed to the supplemental processor rather
than direct observation of the transferred data content by the
supervisory applications executing on the supplemental
processor.
[0191] During step 1510 the in-session context object issues
protocol-specific output messages (requests/responses) to the
connected external client as material is available and subject to
any pacing specifications. The potential applicable data transfer
protocols include both client-based pacing (in response to acks)
and time-driven pacing (aiming at a specific rate until nacked).
Generally there will be fewer internal network fetches
(transferring data directly from a data storage node to a content
transfer engine residing on the content transfer access node) than
external network transmits. For example, it would be common to
fetch a 48 KByte HTML file in a single read over the internal
network, but it would take at least 32 separate TCP segments to
deliver it. For extremely large deliveries, such as streaming
media, pacing drives transmission from the buffer space assigned in
the CTE 50 for the connection while reads are issued whenever the
buffer drops below a configured threshold.
[0192] It is noted that execution of steps 1508 and 1510 can
overlap. As illustrated in the examples discussed herein above,
this is especially true in cases where relatively large files are
transferred. In such instances, the transfer of data over an
external interface to a requesting client commences prior to
completing transfer of the file from a storage node to a buffer in
the CTE 50.
[0193] It is further noted that during both steps 1508 and 1510 the
new in-session context object on the CTE 50 reports aggregate
progress and exception conditions (if present) to a corresponding
XCTD executing on the CTEX 88 associated with an application
executing upon the supplemental processor node 22a. The
application, executing upon the supplemental processor 22a and
potentially supervising client data requests and corresponding
responses, does not directly access the transferred data. Instead,
the application executing upon the supplemental processor 22a
observes, via notifications from the CTE 50, the progress of data
transfers. Based upon the notifications, the application issues
control instructions via the XCTD to the CTE 50 performing the
actual data transfers.
[0194] After completing a response to a file request or
alternatively when a session is completed, during step 1512 a hash
entry for the associated CTD in the CTE 50 is removed from the hash
table, the CTD notifies a corresponding XCTD of the completion, and
the CTD returns to the available pool of CTDs for its object
type.
[0195] Illustrative embodiments of the present invention and
certain variations thereof have been provided in the Figures and
accompanying written description. The present invention is not
intended to be limited to these embodiments. Rather the present
invention is intended to cover the disclosed embodiments as well as
others falling within the scope and spirit of the invention to the
fullest extent permitted in view of this disclosure and the
inventions defined by the claims appended herein below.
* * * * *