U.S. patent application number 13/928339 was filed with the patent office on 2014-07-17 for context switching with offload processors.
The applicant listed for this patent is Xockets IP, LLC. Invention is credited to Stephen Paul Belair, Parin Bhadrik Dalal.
Application Number | 20140201761 13/928339 |
Document ID | / |
Family ID | 51165034 |
Filed Date | 2014-07-17 |
United States Patent
Application |
20140201761 |
Kind Code |
A1 |
Dalal; Parin Bhadrik ; et
al. |
July 17, 2014 |
Context Switching with Offload Processors
Abstract
A method for context switching of multiple offload processors is
disclosed. The method can include receiving network packets for
processing through a memory bus connected socket, organizing the
network packets into multiple sessions for processing, suspending
processing of at least one session by reading a cache state of at
least one of the offload processor into a context memory by
operation of a scheduling circuit, with virtual memory locations
and physical cache locations being aligned, and subsequently
directing transfer of the cache state to at least one of the
offload processors for processing by operation of the scheduling
circuit.
Inventors: |
Dalal; Parin Bhadrik;
(Milpitas, CA) ; Belair; Stephen Paul; (Santa
Cruz, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Xockets IP, LLC |
Wilmington |
DE |
US |
|
|
Family ID: |
51165034 |
Appl. No.: |
13/928339 |
Filed: |
June 26, 2013 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61753892 |
Jan 17, 2013 |
|
|
|
61753895 |
Jan 17, 2013 |
|
|
|
61753899 |
Jan 17, 2013 |
|
|
|
61753901 |
Jan 17, 2013 |
|
|
|
61753903 |
Jan 17, 2013 |
|
|
|
61753904 |
Jan 17, 2013 |
|
|
|
61753906 |
Jan 17, 2013 |
|
|
|
61753907 |
Jan 17, 2013 |
|
|
|
61753910 |
Jan 17, 2013 |
|
|
|
Current U.S.
Class: |
718/108 |
Current CPC
Class: |
G06F 12/0815 20130101;
H04L 47/2441 20130101; G06F 12/1081 20130101; H04L 47/56 20130101;
G06F 9/461 20130101; H04L 61/6086 20130101; H04L 47/193 20130101;
G06F 13/285 20130101; H04L 47/624 20130101; G06F 12/0875 20130101;
H04L 67/1097 20130101; G06F 9/3877 20130101; H04L 61/2592 20130101;
H04L 49/40 20130101; G06F 13/4022 20130101; H04L 29/08549 20130101;
G06F 15/17337 20130101; H04L 61/103 20130101; G06F 13/16 20130101;
G06F 13/1652 20130101; G06F 13/362 20130101; G06F 12/1027 20130101;
H04L 49/90 20130101; G06F 15/161 20130101; H04L 29/08135 20130101;
H04L 47/6295 20130101; G06F 13/4068 20130101; Y02D 10/00 20180101;
G06F 2212/1024 20130101; H04L 67/10 20130101; G06F 9/4843
20130101 |
Class at
Publication: |
718/108 |
International
Class: |
G06F 9/46 20060101
G06F009/46 |
Claims
1. A method for context switching of multiple offload processors,
comprising the steps of: receiving network packets for processing
through a memory bus connected socket, organizing the network
packets into multiple sessions for processing, suspending
processing of at least one session by reading a cache state of at
least one of the offload processor into a context memory by
operation of a scheduling circuit, with virtual memory locations
and physical cache locations being aligned, and subsequently
directing transfer of the cache state to at least one of the
offload processors for processing by operation of the scheduling
circuit.
2. The method of claim 1 wherein the bulk read is through an
accelerator coherency port.
3. The method of claim 1 wherein the associated cache state
includes at least one of: a state of offload processor registers,
instructions for execution by the offload processor, a stack
pointer, program counter, prefetched instructions for execution by
the offload processor, prefetched data for use by the offload
processor, and data written into the cache of the offload
processor.
4. The method of claim 1, further including: the cache state
includes session context; and setting the session context to be
physically contiguous in the cache of the offload processor.
5. The method of claim 4, wherein the setting of the session
context includes cooperation of an operating system (OS) running on
the offload processor and the scheduling circuit.
6. The method of claim 1, further including upon initialization of
a processing session, communicating session data to the scheduling
circuit.
7. The method of claim 6, wherein the session data includes any
selected from: a session color, a session size and starting
physical cache address for the processing session.
8. The method of claim 1, further including determining starting
address for each of a plurality of processing sessions, the number
of sessions allowable in a cache of an offload processor, and the
number of locations wherein a session can be found for a given
session color.
9. The method of claim 1 further including transferring the cache
state of one of the offload processors to the cache of another of
the offload processors.
10. The method of claim 1, further including prioritizing a
processing of network packets in a first queue received over the
memory bus by stopping a first session associated with one of the
offload processors, storing the cache state of the offload
processor, and initiating processing of network packets held in a
second queue.
11. The method of claim 1, wherein the suspending processing of at
least one session includes operating in a preemption mode to
control a session execution.
12. The method of claim 1, wherein receiving network packets
includes receiving network packets through a dual in line memory
module (DIMM) compatible socket.
13. The method of claim 1, wherein reading the cache state into the
context memory includes reading the cache state data over the
memory bus.
Description
PRIORITY CLAIMS
[0001] This application claims the benefit of U.S. Provisional
Patent Applications 61/753,892 filed on Jan. 17, 2013, 61/753,895
filed on Jan. 17, 2013, 61/753,899 filed on Jan. 17, 2013,
61/753,901 filed on Jan. 17, 2013, 61/753,903 filed on Jan. 17,
2013, 61/753,904 filed on Jan. 17, 2013, 61/753,906 filed on Jan.
17, 2013, 61/753,907 filed on Jan. 17, 2013, and 61/753,910 filed
on Jan. 17, 2013, the contents all of which are incorporated by
reference herein.
TECHNICAL FIELD
[0002] Described embodiments relate to deterministic context
switching for computer systems that include a memory bus connected
module with offload processors.
BACKGROUND
[0003] Context switching (sometimes referred to as a process switch
or a task switch) is the switching of a processor from execution of
one process or thread to another. During a context switch the state
(context) of a process is stored in memory so that execution can be
resumed from the same point at a later time. This enables multiple
processes to share a single processor and support a multitasking
operating system. Commonly, a process is an executing or running
instance of a program that can run in parallel and share an address
space (i.e., a range of memory locations) and other resources with
their parent processes. A context generally includes the contents
of a processor's registers and program counter at a specified time.
An operating system can suspend the execution of a first process
and store the context for that process in memory, while
subsequently retrieving the context of a second process from memory
and restoring it in the processor's registers. After terminating or
suspending the second process, the context of the first process can
be reloaded, resuming execution of the first process.
[0004] However, context switching is computationally intensive. A
context switch can require considerable processor time, which can
be on the order of nanoseconds for each of the tens or hundreds of
context switches per second. Since modern processors can handle
hundreds or thousands separate processes, the time devoted to
context switching can represents a substantial cost to the system
in terms of processor time. Improved context switching methods and
systems can greatly improve overall system performance and reduce
hardware and power requirements for server or other data processing
systems.
SUMMARY
[0005] This disclosure describes embodiments of systems, hardware
and methods suitable for context switching of processors in a
system. Embodiments can include multiple offload processors, each
connected to a memory bus, with the respective offload processors
each having an associated cache with an associated cache state. A
low latency memory can be connected to multiple offload processors
through the memory bus, and a scheduling circuit can be used for
directing storage of a cache state from at least one of the
respective offload processors into the low latency memory, and for
later directing transfer over the memory bus of the cache state to
at least one of the respective offload processors. In certain
embodiments, including those associated with the use of ARM
architecture processors, the multiple offload processors can have
an accelerator coherency port for accessing cache state with
improved speed. In other embodiments, a common module can support
the offload processors, low latency memory, and scheduling circuit,
with access for external network packets provided through a memory
socket mediated connection, including but not limited to dual in
line memory module (DIMM) sockets.
[0006] In some embodiments the associated cache state includes at
least one of: a state of the processor registers saved in register
save area, instructions in the pipeline being executed, stack
pointer and program counter, prefetched instructions and data
waiting to be executed by the session, and data written into the
cache. The system can further include an operating system (OS)
running on at least one the multiple offload processors. The OS and
scheduling circuit can cooperate to establish session contexts that
are physically contiguous in a cache. Session color, size, and
starting physical address can be communicated to the scheduler
circuit upon session initialization and a memory allocator used to
determine a starting address of each session, the number of
sessions allowable in the cache, and the number of locations
wherein a session can be found for a given color.
[0007] According to embodiments, a cache state stored by one of the
multiple offload processors can be transferred to another offload
processor. In particular applications, this can enable a scheduling
circuit to prioritize processing of network packets in a first
queue received over the memory bus by stopping a first session
associated with one of the offload processors, storing the
associated cache state, and initiating processing of network
packets held in a second queue.
[0008] Embodiment can also include a method for context switching
of multiple offload processors, each having an associated cache
with an associated cache state, and using a low latency memory
connected to multiple offload processors through the memory bus.
The method includes directing storage of a cache state via a bulk
read from at least one of the respective offload processors into
the low latency memory using a scheduling circuit, with any virtual
and physical memory locations being aligned. Subsequently, transfer
is directed over the memory bus of the cache state to at least one
of the respective offload processors for processing, with transfer
being controlled by the scheduling circuit. As with the structure
embodiments previously described, common module can support the
offload processors, low latency memory, and scheduling circuit,
with access for external network packets provided through a DIMM or
other memory socket connection.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] FIG. 1-0 shows a system having context switching according
to an embodiment.
[0010] FIG. 1-1 is a diagram showing page collisions in a
physically indexed cache without coloring.
[0011] FIG. 1-2 shows a virtually indexed cache.
[0012] FIG. 1-3 shows a virtual/physical aligned cache according to
an embodiment.
[0013] FIGS. 2-0 to 2-3 show processor modules according to various
embodiments.
[0014] FIG. 2-4 shows a conventional dual-in-line memory
module.
[0015] FIG. 2-5 shows a system according to another embodiment.
[0016] FIG. 3 shows a system with a memory bus connected offload
processor module having context switching capabilities, according
to one embodiment.
[0017] FIG. 4 is a flow diagram showing context switching
operations according to one particular embodiment.
DETAILED DESCRIPTION
[0018] Various embodiments will now be described in detail with
reference to a number of drawings. The embodiments show modules,
systems and methods for switching contexts in offload processors
that are connected to a system memory bus. Such offload processors
can be in addition to any host processors connected to the system
memory bus, and can operate on data transferred over the system
memory bus independent of any host processors. In particular
embodiments, offload processors can have access to a low latency
memory, which can enable rapid storage and retrieval of context
data for rapid context switching. In very particular embodiments,
processing modules can populate physical slots for connecting
in-line memory modules (e.g., dual in line memory modules (DIMMs))
to a system memory bus.
[0019] FIG. 1-0 shows a system 100 according to one embodiment. A
system 100 can include one or more offload processors 118, a
scheduler 116, and a context memory 120. An offload processor 118
can include one or more processor cores that operate in conjunction
with a cache memory. In a context switch operation, the context of
a first processing task of offload processor 118 can be stored in
context memory 120, and the offload processor 118 can then
undertake a new processing task. Subsequently, the stored context
can be restored from the context memory 120 to the offload
processor 118, and the offload processor 118 can resume the first
processing task. In particular embodiments, the storing and
restoring of context data can include the transfer of data between
the cache of an offload processor 118 and the context memory
120.
[0020] A scheduler 116 can coordinate context switches of offload
processors 118 based on received processing requests. Accordingly,
a scheduler 116 can be informed of, or can have access to, the
state of offload processors 118 as well as the location of context
data for the offload processors 118. Context data locations can
include locations in processor cache, as well as locations in
context memory 120. A scheduler 116 can also follow, or be updated
with, a state of offload processors 118.
[0021] As understood from above, a context memory 120 can store
context data of offload processors 118 for subsequent retrieval. A
context memory 120 can be separate from cache memories of the
offload processors. In some embodiments, a context memory 120 can
be low latency memory as compared to other memory in the system, to
enable rapid context storage and retrieval. In some embodiments, a
context memory 120 can store data other than context data.
[0022] In the particular embodiment shown, offload processors 118,
scheduler 116, and context memory 120 can be part of a module 122
connected to a memory bus 124. Data and processing tasks for
execution by offload processors 118 can be received over memory bus
124. In some embodiments, transfers of context data between offload
processors 118 and context memory 120 can occur over memory bus
124. However, in other embodiments, such transfers can occur over a
different data path on the module 122.
[0023] Referring still to FIG. 1-0, in the very particular
embodiment shown, a method can further include a second switch 114,
a memory controller 112, a host processor 110, an input/output
(I/O) fabric 108, and a first switch 106. A second switch 114 can
be included on module 122. The particular system 100 of FIG. 1-0 is
directed to network packet processing scheduling and traffic
management, but it is understood that other embodiments can include
context switching operations, as described herein or equivalents,
directed to other types of processing tasks.
[0024] In the particular embodiment of FIG. 1-0, a first switch 106
can receive and/or transmit data packets 104 from data source 102.
A data source 102 can be any suitable source of packet data,
including the Internet, a network cloud, inter- or intra-data
center networks, cluster computers, rack systems, multiple or
individual servers or personal computers, or the like. Data can be
packet or switch based, although in particular embodiments
non-packet data is generally converted or encapsulated into packets
for ease of handling. The data packets typically have certain
characteristics, including transport protocol number, source and
destination port numbers, or source and destination (Internet
Protocol) IP addresses. The data packets can further have
associated metadata that helps in packet classification and
management.
[0025] A switch 106 can be a virtual switch (an I/O device). A
switch 106 can include, but is not limited to, devices compatible
with peripheral component interconnect (PCI) and/or PCI express
(PCIe) devices connecting with host motherboard via PCI or PCIe bus
107. The switch 106 can include a network interface controller
(NIC), a host bus adapter, a converged network adapter, or a
switched or an asynchronous transfer mode (ATM) network interface.
In some embodiments, a switch 106 can employ IO virtualization
schemes such as a single root I/O virtualization (SR-IOV) interface
to make a single network I/O device appear as multiple devices.
SR-IOV permits separate access to resources among various PCIe
hardware functions by providing both physical control and virtual
functions. In certain embodiments, the switch 106 can support
OpenFlow or similar software defined networking to abstract out of
the control plane. The control plane of the first virtual switch
performs functions such as route determination, target node
identification etc.
[0026] A switch 106 can be capable of examining network packets,
and using its control plane to create appropriate output ports for
network packets. Based on route calculation for the network packets
or data flows associated with the network packets, the forwarding
plane of the switch 106 can transfer the packets to an output
interface. An output interface of the switch may be connected with
an IO bus, and in certain embodiments the switch 106 may have the
capability to directly (or indirectly, via an I/O fabric 108)
transfer the network packets to a memory bus interconnect 109 for a
memory read or write operation (direct memory access operation).
Functionally, for certain applications the network packets can be
assigned for transport to specific memory locations based on
control plane functionality.
[0027] Switch 106, connected to an IO fabric 108 and memory bus
interconnect 109, can also be connected to host processor(s) 110.
Host processor(s) 110 can include one or more host processors which
can provide computational services including a provisioning agent
111. The provisioning agent 111 can be part of an operating system
or user code running on the host processor(s) 110. The provisioning
agent 111 typically initializes and interacts with virtual function
drivers provided by system 100. The virtual function driver can be
responsible for providing the virtual address of the memory space
where a direct memory addressing (DMA) is needed. Each device
driver can be allocated virtual addresses that map to the physical
addresses. A device model can be used to create an emulation of a
physical device for the host processor 110 to recognize each of the
multiple virtual functions (VF) that can be created. The device
model can be replicated multiple times to give the impression to VF
drivers (a driver that interacts with a virtual IO device) that
they are interacting with a physical device. For example, a certain
device model may be used to emulate a network adapter that the VF
driver can act to connect. The device model and the VF driver can
be run in either privileged or non-privileged mode. There can be no
restriction with regard to which device hosts/runs the code
corresponding to the device model and the VF driver. The code,
however, can have the capability to create multiple copies of
device model and VF driver so as to enable multiple copies of said
I/O interface to be created. In certain embodiments the operating
system can also create a defined physical address space for
applications supported by VF drivers. Further, the host operating
system can allocate a virtual memory address space to the
application or provisioning agent. The provisioning agent 111 can
broker with the host operating system to create a mapping between
virtual addresses and a subset of the available physical address
space. The provisioning agent 111 can be responsible for creating
each VF driver and allocating it a defined virtual address
space.
[0028] A second virtual switch 114 can also be connected to the
memory controller 112 using memory bus 109. The second virtual
switch 114 receives and switches traffic originating from the
memory bus 109 both to and from offload processors 118. Traffic may
include, but is not limited to, data flows to virtual devices
created and assigned by the provisioning agent 111, with processing
supported by offload processors 118. The forwarding plane of the
second virtual switch transports packets from a memory bus 109 to
offload processors 118 or from the offload processors 118 back onto
the memory bus 109. For certain applications, the described system
architecture allows relatively direct communication of network
packets to the offload processors 118 with minimal or no
interruptions to a host processor 110. The second virtual switch
114 can be capable of receiving packets and classifying them prior
to distribution to different hardware schedulers based on a defined
arbitration and scheduling scheme. The hardware scheduler 116
receives packets that can be assigned to flow sessions that are
scheduled for processing in one or more separate session.
[0029] A scheduler 116 can control processing tasks for execution
by offload processors 118, including the switching of contexts. In
some embodiments, metadata included within data received over
memory bus 124 (or metadata derived from such data) can be used to
by scheduler 116 the schedule/switch tasks of the offload
processors 118. However, command based control of a scheduler via
memory bus received commands or flags is also contemplated.
[0030] In the particular embodiment of FIG. 1-0, a scheduler 116
can be employed to implement traffic management of incoming
packets. Packets from a certain source, relating to a certain
traffic class, pertaining to a specific application or flowing to a
certain socket are referred to as part of a session flow and are
classified using session metadata. Session metadata often serve as
the criterion by which packets are prioritized and as such,
incoming packets can be reordered based on their session metadata.
This reordering of packets can occur in one or more buffers and can
modify the traffic shape of these flows. Packets of a session that
are reordered based on session metadata can be sent over to
specific traffic managed queues that are arbitrated out to output
ports using an arbitration circuit (not shown). The arbitration
circuit can feed these packet flows to a downstream packet
processing/terminating resource directly. Certain embodiments
provide for integration of thread and queue management so as to
enhance the throughput of downstream resources handling termination
of network data through above said threads.
[0031] Referring still to FIG. 1-0, data arriving at the scheduler
116 may also be packet data waiting to be terminated at the offload
processors 118 or it could be packet data waiting to be processed,
modified or switched out. The scheduler 116 can be responsible for
segregating incoming packets into corresponding application
sessions based on examination of packet data. The scheduler 116 can
have circuits for packet inspection and identifying relevant packet
characteristics. In some embodiments, a scheduler 116 can offload
part of the network stack to free offload processors 118 from
overhead incurred from network stack processing. In particular
embodiments, a scheduler 116 can carry out any of: TCP/transport
offload, encryption/decryption offload, segmentation and reassembly
thus allowing offload processors to operate on payloads of network
packets directly.
[0032] A scheduler 116 can further have the capability to transfer
the packets belonging to a session into a particular traffic
management queue. A scheduler 116 can to control the scheduling of
each of multiple such sessions into a general purpose OS. The
stickiness of sessions across a pipeline of stages, including a
general purpose OS, can be supported by scheduler 116 optimizing
operations carried out at each of the stages. Particular
embodiments of such operations are described in more detail
below.
[0033] While a scheduler 116 have any suitable form, a scheduling
circuit which can be used all, or in part, as a scheduler is shown
in U.S. Pat. No. 7,760,715, issued to Dalal on Jul. 20, 2010
(hereinafter the '715 patent), and is incorporated herein by
reference. The '751 patent discloses a scheduling circuit that
takes account of downstream execution resources. The session flows
in each of these queues is sent out through an output port to a
downstream network element.
[0034] A scheduler can employ an arbitration circuit to mediate
access of multiple traffic management output queues to available
output ports. Each of the output ports may be connected to one of
the offload processor cores through a packet buffer. The packet
buffer may further include a header pool and a packet body pool.
The header pool may only contain the header of packets to be
processed by offload processors 118. Sometimes, if the size of the
packet to be processed is sufficiently small, the header pool may
contain the entire packet. Packets can be transferred over to the
header pool or packet body pool depending on the nature of
operation carried out at the offload processor 118. For packet
processing, overlay, analytics, filtering and such other
applications it might be appropriate to transfer only the packet
header to the offload processors 118. In this case, depending on
the handling of the packet header, the packet body might either be
sewn together with a packet header and transferred over an egress
interface or dropped. For applications requiring the termination of
packets, the entire body of the packet might be transferred.
Offload processors can thus receive the packets and execute
suitable application session on them.
[0035] A scheduler 116 can schedule different sessions on the
offload processors 118, acting to coordinate such sessions to
reduce the overhead during context switches. A scheduler 116 can
arbitrate, not just between outgoing queues or session flows at
line rate speeds, but also between terminated sessions at very high
speeds. A scheduler 116 can manage the queuing of sessions on
offload processors 118 and be responsible for invoking new
application sessions on the OS. A scheduler 116 can indicate to the
OS that packets for a new session are available based on
traffic.
[0036] A scheduler 116 can also be informed of the state of the
execution resources of offload processors 118, the current session
that is run on the execution resource and the memory space
allocated to it, as well as the location of the session context in
the offload processor cache. A scheduler 116 can thus use the state
of the execution resource to carry out traffic management and
arbitration decisions.
[0037] In the embodiment shown, a scheduler 116 can provide for an
integration of thread management on the operating system with
traffic management of incoming packets. It can induce a persistence
of session flows across a spectrum of components including traffic
management queues and processing entities on the offload processors
118. An OS running on an offload processor 118 can allocate
execution resources such as processor cycles and memory to a
particular queue it is currently handling. The OS can further
allocate a thread or a group of threads for that particular queue,
so that it can be handled distinctly by the general purpose
processing element as a separate entity. Having multiple sessions
running on a general purpose (GP) processing resource (e.g.,
offload processor resource), each handling data from a particular
session flow resident in a queue on the scheduler 116, can tightly
integrate the scheduler 116 and the GP processing resource. This
can bring an element of persistence within session information
across the traffic management and scheduler 116 and the GP
processing resource.
[0038] In some embodiments, an offload processor 118 OS can be
modified from a previous OS to reduce the penalty and overhead
associated with context switch between resources. This can be
further exploited by the hardware scheduler to carry out seamless
switching between queues, and consequently their execution as
different sessions by the execution resource.
[0039] According to particular embodiments, a scheduler 116 can
implement traffic management of incoming packets. Packets from a
certain source, relating to a certain traffic class, pertaining to
a specific application or flowing to a certain socket are referred
to as part of a session flow and are classified using session
metadata. Session metadata can serve as the criterion by which
packets are prioritized and as such, incoming packets are reordered
based on their session metadata. This reordering of packets can
occur in one or more buffers and can modify the traffic shape of
these flows. Packets of a session that are reordered based on
session metadata can be sent over to specific traffic managed
queues that are arbitrated out to output ports using an arbitration
circuit. An arbitration circuit can feed these packet flows to a
downstream packet processing and/or terminating resource directly
(e.g., offload processor). Certain embodiments provide for
integration of thread and queue management so as to enhance the
throughput of downstream resources handling termination of network
data through above the threads.
[0040] In addition to carrying out traffic management, arbitration
and scheduling of incoming network packets (and flows), a scheduler
116 can be responsible for enabling minimal overhead context
switching between terminated sessions on the OS offload processor
118. Switching between multiple sessions on sessions of the offload
processors 118 can makes it possible to terminate multiple sessions
at very high speeds. In the embodiment shown, rapid context
switching can occur by operation of context module 120. In
particular embodiments, a context memory 120 can provide for
efficient, low latency context services in a system 100.
[0041] In the particular embodiment shown, packets can be
transferred over to the scheduler 116 by operation of second switch
114. Scheduler 116 can be responsible both for switching between a
session and a new session on the offload processors 118, as well as
initiating the saving of context in context memory 120. The context
of a session can include, but is not limited to: the state of the
processor registers saved in register save area, the instructions
in the pipeline being executed, stack pointer and program counter,
prefetched instructions and data that are waiting to be executed by
the session, data written into the cache recently and any other
relevant information that can identify a session executing on the
offload processor 118. In particular embodiments, a session context
can be identified using the combination of session id, session
index in the cache, and starting physical address.
[0042] As will be described in more detail with respect to FIG.
1-3, a translation scheme can be used such that contiguous pages of
a session in virtual memory are physically contiguous in a cache of
an offload processor 118. This contiguous nature of the session in
the cache can allow for a bulk read out of the session context into
a `context snapshot` for storage in context memory 120, from where
it can be retrieved when the operating system (OS) switches
processor resources back to the session. The ability to seamlessly
fetch session context from a context memory 120 (which can be a low
latency memory, and thus be orders of magnitude faster than a main
memory of a system) can serve to effectively expand the size of an
L2 cache of offload processors 118.
[0043] In some embodiments, an OS of system 100 can implement
optimizations in its input output memory management unit (IOMMU)
(not shown) to allow a translation lookaside buffer (TLB) (or
equivalent lookup structure) to distinctly identify contents of
each session. Such an arrangement can allow address translations to
be identified distinctly during a session switch out and
transferred to a page table cache that is external to the TLB. Use
of a page table cache can allow for an expansion in the size of the
TLB. Also given the fact that contiguous locations in the virtual
memory are at contiguous locations in physical memory and in a
physically indexed cache, the number of address translations
required for identifying a session are significantly reduced.
[0044] In the particular embodiment of FIG. 1-0, a system 100 can
be well suited to provide out session and packet termination
services. In some embodiments, control of network stack processing
can be performed by a scheduler 116. Thus, a scheduler 116 can act
as a traffic management queue, arbitration circuit and a network
stack offload device. The scheduler 116 can be responsible for
handling entire session and flow management on behalf of offload
processors 118. In such an arrangement, offload processors 118 can
be fed with the packets pertaining to a session directly into a
buffer, from where it can extract packet data for use. Processing
of a network stack can be optimized to avoid switches to a kernel
mode to handle network generated interrupts (and execute an
interrupt service routine). In this way, a system 100 can be
optimized to carry out context switching of sessions seamlessly and
with as little overhead as possible.
[0045] Referring still to FIG. 1-0, as will be understood, multiple
types of conventional input/output busses such as PCI, Fibre
Channel can be used in the described system 100. The bus
architecture can also be based on relevant JEDEC standards, on DIMM
data transfer protocols, on HyperTransport, or any other high
speed, low latency interconnection system. Offload processors 118
may include DDR DRAM, RLDRAM, embedded DRAM, next generation
stacked memory such as Hybrid Memory Cube (HMC), flash, or other
suitable memory, separate logic or bus management chips,
programmable units such as field programmable gate arrays (FPGAs),
custom designed application specific integrated circuits (ASICs)
and an energy efficient, general purpose processor such as those
based on ARM, ARC, Tensilica, MIPS, Strong/ARM, or RISC
architectures. Host processors 110 can be a general purpose
processor, including those based on Intel or AMD x86 architecture,
Intel Itanium architecture, MIPS architecture, SPARC architecture
or the like.
[0046] As will also be understood, conventional systems executing
processing like that performed by the system of FIG. 1-0 can be
implemented on multiple threads running on multiple processing
cores. Such parallelization of tasks into multiple thread contexts
can provide for increased throughput. Processor architectures such
as MIPS may include deep instruction pipelines to improve the
number of instructions per cycle. Further, the ability to run a
multi-threaded programming environment results in enhanced usage of
existing processor resources. To further increase parallel
execution on the hardware, processor architectures may include
multiple processor cores. Multi-core architectures comprising the
same type of cores, referred to as homogeneous core architectures,
provide higher instruction throughput by parallelizing threads or
processes across multiple cores. However, in such homogeneous core
architectures, the shared resources, such as memory, are amortized
over a small number of processors. In still other embodiments,
multiple offload or host processors can reside on modules connected
to individual rack units or blades that in turn reside on racks or
individual servers. These can be further grouped into clusters and
datacenters, which can be spatially located in the same building,
in the same city, or even in different countries. Any grouping
level can be connected to each other, and/or connected to public or
private cloud internets.
[0047] In such conventional approaches, memory and I/O accesses can
incur a high amount of processor overhead. Further, as noted
herein, context switches in conventional general purpose processing
units can be computationally intensive. It is therefore desirable
to reduce context switch overhead in a networked computing resource
handling a plurality of networked applications in order to increase
processor throughput. Conventional server loads can require complex
transport, high memory bandwidth, extreme amounts of data bandwidth
(randomly accessed, parallelized, and highly available), but often
with light touch processing: HTML, video, packet-level services,
security, and analytics. Further, idle processors still consume
more than 50% of their peak power consumption.
[0048] In contrast, in an embodiment such as that shown in FIG.
1-0, or an equivalent, complex transport, data bandwidth intensive,
frequent random access oriented, "light touch" processing loads can
be handled behind a socket abstraction created on multiple offload
processor 118 cores. At the same time, "heavy touch", computing
intensive loads can be handled by a socket abstraction on a host
processor 110 core (e.g., x86 processor cores). Such software
sockets can allow for a natural partitioning of these loads between
light touch (e.g., ARM) and heavy touch (e.g., x86) processor
cores. By usage of new application level sockets, according to
embodiments, server loads can be broken up across the offload
processors 118 and the host processor(s) 110.
[0049] To better understand operations of the embodiments disclosed
herein, conventional cache schemes are described with reference to
FIGS. 1-1 and 1-2. Modern operating systems that implement virtual
memory are responsible for the allocation of both virtual and
physical memory for processes, resulting in virtual to physical
translations that occur when a process executes and accesses
virtually addressed memory. In the management of memory for a
process, there is typically no coordination between the allocation
of a virtual address range and the corresponding physical addresses
that will be mapped by the virtual addresses. This lack of
coordination can affect both processor cache overhead and
effectiveness when a process is executing.
[0050] In conventional systems, a processor allocates memory pages
that are contiguous in virtual memory for each process that is
executing. The processor also allocates pages in physical memory,
which are not necessarily contiguous. A translation scheme is
established between the two schemes of addressing to ensure that
the abstraction of virtual memory is correctly supported by
physical memory pages. A processor can employ cache blocks that are
resident close to the processor to meet the immediate data
processing needs. Conventional caches can be arranged in a
hierarchy. Level One (L1) caches are closest to the processor,
followed by L2, L3, and so on. L2 acts a backup to L1 and so on.
For caches that are indexed by a part of the process's physical
addresses, the lack of correlation between the allocation of
virtual and physical memory for a range of addresses beyond the
size of a memory management unit (MMU) page results in haphazard
and inefficient effects in the processor caches. This increases
cache overheads and delay is introduced during a context switch
operation.
[0051] In physically addressed caches, the cache entry for the next
page in the virtual memory may not correspond to the next
contiguous page in the cache--thus degrading the overall
performance that can be achieved. For example, in FIG. 1-1,
contiguous pages in virtual memory 130 (Pages 1 and 2 of Process 1)
collide in the cache as their physical addresses in physical memory
132 index to the same location of the physically indexed cache 134
(of the processor). That is, processor cache (i.e., 134) is
physically indexed, and the addresses of the pages in the physical
memory 132 index to the same page in the processor cache.
Furthermore, when the effects of multiple processes accessing a
shared cache are considered, there is typically a lack of
consideration of overall cache performance when the OS allocates
physical memory to processes. This lack of consideration results in
different processes thrashing in the cache across context switches
(e.g., Process 1 and Process 2 in FIG. 1-1), which can
unnecessarily displace each other's lines, which can result in an
indeterminate number of cache miss/fills upon resuming a process,
or an increased number of line writebacks across context
switch.
[0052] As described with reference to FIG. 1-2, in other
conventional arrangements, processor caches can alternatively be
indexed by a part of the process's virtual addresses. Virtually
indexed caches are accessed by using a section of the bits of the
virtual address of the processor. Pages that are contiguous in
virtual memory 130 will be contiguous in virtually indexed caches
136 as seen in FIG. 1-2. As long as processor caches are virtually
indexed, no attention needs to be paid to coordinating the
allocation of physical memory 132 with the allocation of virtual
addresses. As programs sweep through virtual address ranges, they
will enjoy the benefits of spatial locality in the processor cache.
Such set-associative caches can have several entries corresponding
to an index. A given page which maps onto the given cache index can
be anywhere in that particular set. Given that there are several
positions available for a cache entry, the problems that caused
thrashing in the cache across context switches (i.e., as shown in
FIG. 1-1) are alleviated to a certain extent with set-associative
caches, as the processor can afford to keep used entries in the
cache to the longest extent possible. For this, caches employ a
least recently used algorithm. This results in mitigation of some
of the problems associated with a virtual addressing scheme
followed by an operating system, but places constraints on the size
of the cache. Consequently, bigger, multi-way associative caches
can be required to ensure that recently used entries are not
invalidated/flushed out. The comparator circuitry for a multi-way
set associative cache can be complex to accommodate for parallel
comparison, which increases the circuit level complexity associated
with the cache.
[0053] A cache control scheme known as "page coloring" has been
used by some conventional operating systems to deal with the
problem of cache-misses due to a virtual addressing scheme. If the
processor cache was physically indexed, the operating system was
constrained to look for physical memory locations that would not
index to locations in the cache of the same color. Under such a
cache control scheme, an operating system would have to assess, for
every virtual address, those pages in the physical memory that are
allowable based on the index they hash to in the physically indexed
cache. Several physical addresses are disallowed as the indices
derived might be of the same color. So, for physically indexed
caches, every page in the virtual memory would be colored to
identify its corresponding cache location and determine if the next
page is allocated to a physical memory and thus a cache location of
the same color or not. This process would be repeated for every
page, which can be a cumbersome operation. While it improves cache
efficiency, page coloring increases the overhead on the memory
management and translation unit as colors of every page would have
to be identified to prevent recently used pages from being
overwritten. The level of complexity of the operating system
increases correspondingly, as it needs to maintain an indicator of
the color of the previous virtual memory page in the cache.
[0054] The problem with a virtually indexed cache is that despite
the fact that the cache access latencies are higher, there is the
pervasive problem of aliasing. In the case of aliasing, multiple
virtual addresses (with different indices) mapping to the same page
in the physical memory are at different locations in the cache (due
to the different indices). Page coloring allows the virtual pages
and physical pages to have the same color and therefore occupy the
same set in the cache. Page coloring makes aliases to share the
same superset bits and index to the same lines in the cache. This
removes the problem of aliasing. Page coloring also imposes
constraints on memory allocation. When a new physical page is
allocated on a page fault, a memory management algorithm must pick
a page with the same color as the virtual color from the free list.
Because systems allocate virtual space systematically, the pages of
different programs tend to have the same colors, and thus some
physical colors may be more frequent than others. Thus page
coloring may impact the page fault rate. Moreover, the predominance
of some physical colors may create mapping conflicts between
programs in a second-level cache accessed with physical addresses.
Thus, a processor is faced with a very big problem with the
conventional page coloring scheme just described. Each of the
virtual pages could be occupying different pages in the physical
memory such that they occupy different cache colors, but the
processor would need to store the address translation of each and
every page. Given that a process could be sufficiently large, and
each process would comprise of several virtual pages, the page
coloring algorithm could become very complex. This would also
complicate it at the TLB end, as it would need to identify for each
page of the processor's virtual memory, the equivalent physical
address. As context switches tend to invalidate the TLB entries,
the processor would need to carry out page walks and fill the TLB
entries, and this would further add indeterminism and latency to
what is a routine context switch.
[0055] In this way, in commonly available conventional operating
systems, context switches result in collisions in the cache as well
as TLB misses when a process/thread is resumed. When the
process/thread resumes, there are an indeterminate number of
instruction and data cache misses as the thread's working set is
reloaded back into the cache (i.e., as the thread resumes in user
space and executes instructions, the instructions will typically
have to be loaded into the cache, along with the application data).
Upon switch-in (i.e., resumption of process/thread), the TLB
mappings may be completely or partially invalidated, with the base
of the new thread's page tables written to a register reserved for
that purpose. As the thread executes, the TLB misses will result in
page table walks (either by hardware or software) which result in
TLB fills. Each of these TLB misses has its own hardware costs,
including pipeline stall due to an exception (e.g., the overhead
created by memory accesses when performing a page table walk, along
with the associated cache misses/memory loads if the page tables
are not in the cache). These costs are dependent upon what took
place in the processor between successive runs of a process and are
therefore not fixed costs. Furthermore, these extra latencies add
to the cost of a context switch and detract from the effective
execution of a process. As will be appreciated, such foregoing
described cache control methods are non-deterministic with respect
to processing time, memory requirements, or other operating system
controlled resources, reducing overall efficiency of system
operation.
[0056] FIG. 1-3 shows a cache control system according to an
embodiment. In the cache control system, session contents can be
contiguous in a physically indexed cache 134'. The described
embodiment can use a translation scheme such that contiguous pages
of a session in virtual memory 130 are physically contiguous in the
physically indexed cache 134. In contrast to the foregoing
described non-deterministic cache control schemes, at least the
duration of a context switch operation can be deterministic. In the
described embodiment, replacing the context of a previous process
by the context of a new process involves transferring the new
process context from an external low latency memory such as
provided by context memory 120 of FIG. 1-0. In the process of
context switching, access of a main memory of a system can be
avoided (where such accesses can be delay intensive). The process
context is prefetched from the context memory 120 (which can be a
low latency memory). If needed for another context switch, process
context can be saved once again to the context memory 120. In this
way, deterministic context switching is achieved, as a context
switch operation can be defined in terms of the number of cycles
and the operations needed to be carried out. Further, use of a low
latency memory to store context data can provide for rapid context
switching.
[0057] FIGS. 2-0 to 2-5 describe aspects of hardware embodiments of
a module that can include context switching as described herein. In
particular embodiments, such processing modules can include DIMM
mountable modules.
[0058] FIG. 2-0 is a block diagram of a processing module 200
according to one embodiment. A processing module 200 can include a
physical connector 202, a memory interface 204, arbiter logic 206,
offload processor(s) 208, local memory 210, and control logic 212.
A connector 202 can provide a physical connection to system memory
bus. This is in contrast to a host processor which can access a
system memory bus via a memory controller, or the like. In very
particular embodiments, a connector 202 can be compatible with a
dual in-line memory module (DIMM) slot of a computing system.
Accordingly, a system including multiple DIMM slots can be
populated with one or more processing modules 200, or a mix of
processing modules and DIMM modules.
[0059] A memory interface 204 can detect data transfers on a system
memory bus, and in appropriate cases, enable write data to be
stored in the processing module 200 and/or read data to be read out
from the processing module 200. Such data transfers can include the
receipt of packet data having a particular network identifier. In
some embodiments, a memory interface 204 can be a slave interface,
thus data transfers are controlled by a master device separate from
the processing module 200. In very particular embodiments, a memory
interface 204 can be a direct memory access (DMA) slave, to
accommodate DMA transfers over a system memory bus initiated by a
DMA master. In some embodiments, a DMA master can be a device
different from a host processor. In such configurations, processing
module 200 can receive data for processing (e.g., DMA write), and
transfer processed data out (e.g., DMA read) without consuming host
processor resources.
[0060] Arbiter logic 206 can arbitrate between conflicting accesses
of data within processing module 200. In some embodiments, arbiter
logic 206 can arbitrate between accesses by offload processor 208
and accesses external to the processor module 200. It is understood
that a processing module 200 can include multiple locations that
are operated on at the same time. It is understood that accesses
arbitrated by arbiter logic 206 can include accesses to physical
system memory space occupied by the processor module 200, as well
as accesses to other resources (e.g., cache memory of offload or
host processor). Accordingly, arbitration rules for arbiter logic
206 can vary according to application. In some embodiments, such
arbitration rules are fixed for a given processor module 200. In
such cases, different applications can be accommodated by switching
out different processing modules. However, in alternate
embodiments, such arbitration rules can be configurable.
[0061] Offload processor 208 can include one or more processors
that can operate on data transferred over the system memory bus. In
some embodiments, offload processors can run a general operating
system or server applications such as Apache (as but one very
particular example), enabling processor contexts to be saved and
retrieved. Computing tasks executed by offload processor 208 can be
handled by the hardware scheduler. Offload processors 208 can
operate on data buffered in the processor module 200. In addition
or alternatively, offload processors 208 can access data stored
elsewhere in a system memory space. In some embodiments, offload
processors 208 can include a cache memory configured to store
context information. An offload processor 208 can include multiple
cores or one core.
[0062] A processor module 200 can be included in a system having a
host processor (not shown). In some embodiments, offload processors
208 can be a different type of processor as compared to the host
processor. In particular embodiments, offload processors 208 can
consume less power and/or have less computing power than a host
processor. In very particular embodiments, offload processors 208
can be "wimpy" core processors, while a host processor can be a
"brawny" core processor. However, in alternate embodiments, offload
processors 208 can have equivalent computing power to any host
processor. In very particular embodiments, a host processor can be
an x86 type processor, while an offload processor 208 can include
an ARM, ARC, Tensilica, MIPS, Strong/ARM, or RISC type processor,
as but a few examples.
[0063] Local memory 210 can be connected to offload processor 208
to enable the storing of context information. Accordingly, an
offload processor 208 can store current context information, and
then switch to a new computing task, then subsequently retrieve the
context information to resume the prior task. In very particular
embodiments, local memory 210 can be a low latency memory with
respect to other memories in a system. In some embodiments, storing
of context information can include copying an offload processor 208
cache.
[0064] In some embodiments, a same space within local memory 210 is
accessible by multiple offload processors 208 of the same type. In
this way, a context stored by one offload processor can be resumed
by a different offload processor.
[0065] Control logic 212 can control processing tasks executed by
offload processor(s). In some embodiments, control logic 212 can be
considered a hardware scheduler that can be conceptualized as
including a data evaluator 214, scheduler 216 and a switch
controller 218. A data evaluator 214 can extract "metadata" from
write data transferred over a system memory bus. "Metadata", as
used herein, can be any information embedded at one or more
predetermined locations of a block of write data that indicates
processing to be performed on all or a portion of the block of
write data and/or indicate a particular task/process to which the
data belongs (e.g., classification data). In some embodiments,
metadata can be data that indicates a higher level organization for
the block of write data. As but one very particular embodiment,
metadata can be header information of one or more network packets
(which may or may not be encapsulated within a higher layer packet
structure).
[0066] A scheduler 216 (e.g., a hardware scheduler) can order
computing tasks for offload processor(s) 208. In some embodiments,
scheduler 216 can generate a schedule that is continually updated
as write data for processing is received. In very particular
embodiments, a scheduler 216 can generate such a schedule based on
the ability to switch contexts of offload processor(s) 208. In this
way, on-module computing priorities can be adjusted on the fly. In
very particular embodiments, a scheduler 216 can assign a portion
of physical address space (e.g., memory locations within local
memory 210) to an offload processor 208, according to computing
tasks. The offload processor 208 can then switch between such
different spaces, saving context information prior to each switch,
and subsequently restoring context information when returning to
the memory space.
[0067] Switch controller 218 can control computing operations of
offload processor(s) 208. In particular embodiments, according to
scheduler 216, switch controller 218 can order offload processor(s)
208 to switch contexts. It is understood that a context switch
operation can be an "atomic" operation, executed in response to a
single command from switch controller 218. In addition or
alternatively, a switch controller 218 can issue an instruction set
that stores current context information, recalls context
information, etc.
[0068] In some embodiments, processor module 200 can include a
buffer memory (not shown). A buffer memory can store received write
data on board the processor module. A buffer memory can be
implemented on an entirely different set of memory devices, or can
be a memory embedded with logic and/or the offload processor. In
the latter case, arbiter logic 206 can arbitrate access to the
buffer memory. In some embodiments, a buffer memory can correspond
to a portion of a system physical memory space. The remaining
portion of the system memory space can correspond to other like
processor modules and/or memory modules connected to the same
system memory bus. In some embodiments buffer memory can be
different than local memory 210. For example, buffer memory can
have a slower access time than local memory 210. However, in other
embodiments, buffer memory and local memory can be implemented with
like memory devices.
[0069] In very particular embodiments, write data for processing
can have an expected maximum flow rate. A processor module 200 can
be configured to operate on such data at, or faster than, such a
flow rate. In this way, a master device (not shown) can write data
to a processor module without danger of overwriting data "in
process".
[0070] The various computing elements of a processor module 200 can
be implemented as one or more integrated circuit devices (ICs). It
is understood that the various components shown in FIG. 2-0 can be
formed in the same or different ICs. For example, control logic
212, memory interface 214, and/or arbiter logic 206 can be
implemented on one or more logic ICs, while offload processor(s)
208 and local memory 210 are separate ICs. Logic ICs can be fixed
logic (e.g., application specific ICs), programmable logic (e.g.,
field programmable gate arrays, FPGAs), or combinations
thereof.
[0071] Advantageously, the foregoing hardware and systems can
provide improved computational performance as compared to
traditional computing systems. Conventional systems, including
those based on x86 processors, are often ill-equipped to handle
such high volume applications. Even idling, x86 processors use a
significant amount of power, and near continuous operation for high
bandwidth packet analysis or other high volume processing tasks
makes the processor energy costs one of the dominant price
factors.
[0072] In addition, conventional systems can have issues with the
high cost of context switching wherein a host processor is required
to execute instructions which can include switching from one thread
to another. Such a switch can require storing and recalling the
context for the thread. If such context data is resident in a host
cache memory, such a context switch can occur relatively quickly.
However, if such context data is no longer in cache memory (i.e., a
cache miss), the data must be recalled from system memory, which
can incur a multi-cycle latency. Continuous cache misses during
context switching can adversely impact system performance.
[0073] FIG. 2-1 shows a processor module 200-1 according to one
very particular embodiment which is capable of reducing issues
associated with high volume processing or context switching
associated with many conventional server systems. A processor
module 200-1 can include ICs 220-0/1 mounted to a printed circuit
board (PCB) type substrate 222. PCB type substrate 222 can include
in-line module connector 202, which in one very particular
embodiment, can be a DIMM compatible connector. IC 220-0 can be a
system-on-chip (SoC) type device, integrating multiple functions.
In the very particular embodiment shown, an IC 220-0 can include
embedded processor(s), logic and memory. Such embedded processor(s)
can be offload processor(s) 208 as described herein, or
equivalents. Such logic can be any of controller logic 212, memory
interface 204 and/or arbiter logic 206, as described herein, or
equivalents. Such memory can be any of local memory 210, cache
memory for offload processor(s) 208, or buffer memory, as described
herein, or equivalents. Logic IC 220-1 can provide logic functions
not included IC 220-0.
[0074] FIG. 2-2 shows a processor module 200-2 according to another
very particular embodiment. A processor module 200-2 can include
ICs 220-2, -3, -4, -5 mounted to a PCB type substrate 222, like
that of FIG. 2-1. However, unlike FIG. 2-1, processor module
functions are distributed among single purpose type ICs. IC 220-2
can be a processor IC, which can be an offload processor 208. IC
220-3 can be a memory IC which can include local memory 210, buffer
memory, or combinations thereof. IC 220-4 can be a logic IC which
can include control logic 212, and in one very particular
embodiment, can be an FPGA. IC 220-5 can be another logic IC which
can include memory interface 204 and arbiter logic 206, and in one
very particular embodiment, can also be an FPGA.
[0075] It is understood that FIGS. 2-1/2 represent but two of
various implementations. The various functions of a processor
module can be distributed over any suitable number of ICs,
including a single SoC type IC.
[0076] FIG. 2-3 shows an opposing side of a processor module 200-1
or 200-2 according to a very particular embodiment. Processor
module 200-3 can include a number of memory ICs, one shown as
220-6, mounted to a PCB type substrate 222, like that of FIG. 2-1.
It is understood that various processing and logic components can
be mounted on an opposing side to that shown. A memory IC 220-6 can
be configured to represent a portion of the physical memory space
of a system. Memory ICs 220-6 can perform any or all of the
following functions: operate independently of other processor
module components, providing system memory accessed in a
conventional fashion; serve as buffer memory, storing write data
that can be processed with other processor module components, or
serve as local memory for storing processor context
information.
[0077] FIG. 2-4 shows a conventional DIMM module (i.e., it serves
only a memory function) that can populate a memory bus along with
processor modules as described herein, or equivalents.
[0078] FIG. 2-5 shows a system 230 according to one embodiment. A
system 230 can include a system memory bus 228 accessible via
multiple in-line module slots (one shown as 226). According to
embodiments, any or all of the slots 226 can be occupied by a
processor module 200 as described herein, or an equivalent. In the
event all slots 226 are not occupied by a processor module 200,
available slots can be occupied by conventional in-line memory
modules 224. In a very particular embodiment, slots 226 can be DIMM
slots.
[0079] In some embodiments, a processor module 200 can occupy one
slot. However, in other embodiments, a processor module can occupy
multiple slots.
[0080] In some embodiments, a system memory bus 228 can be further
interfaced with one or more host processors and/or input/output
device (not shown).
[0081] Having described processor modules according to various
embodiments, operations of an offload processor module capable of
interfacing with server or similar system via a memory bus and
according to a particular embodiment will now be described.
[0082] FIG. 3 shows a system 301 that can execute context switches
in offload processors according to an embodiment. In the example
shown, a system 301 can transport packet data to one or more
computational units (one shown as 300) located on a module, which
in particular embodiments, can include a connector compatible with
an existing memory module. In some embodiments, a computational
unit 300 can include a processor module as described in embodiments
herein, or an equivalent. A computational unit 300 can be capable
of intercepting or otherwise accessing packets sent over a memory
bus 316 and carrying out processing on such packets, including but
not limited to termination or metadata processing. A system memory
bus 316 can be a system memory bus like those described herein, or
equivalents (e.g., 228).
[0083] Referring still to FIG. 3, a system 301 can include an I/O
device 302 which can receive packet or other I/O data from an
external source. In some embodiments I/O device 302 can include
physical or virtual functions generated by the physical device to
receive a packet or other I/O data from the network or another
computer or virtual machine. In the very particular embodiment
shown, an I/O device 302 can include a network interface card (NIC)
having input buffer 302a (e.g., DMA ring buffer) and an I/O
virtualization function 302b.
[0084] According to embodiments, an I/O device 302 can write a
descriptor including details of the necessary memory operation for
the packet (i.e. read/write, source/destination). Such a descriptor
can be assigned a virtual memory location (e.g., by an operating
system of the system 301). I/O device 302 then communicates with an
input output memory management unit (IOMMU) 304 which can translate
virtual addresses to corresponding physical addresses with an IOMMU
function 304b. In the particular embodiment shown, a translation
look-aside buffer (TLB) 304a can be used for such translation.
Virtual function reads or writes data between I/O device and system
memory locations can then be executed with a direct memory transfer
(e.g., DMA) via a memory controller 306b of the system 301. An I/O
device 302 can be connected to IOMMU 304 by a host bus 312. In one
very particular embodiment, a host bus 312 can be a peripheral
interconnect (PCI) type bus. IOMMU 304 can be connected to a host
processing section 306 at a central processing unit I/O (CPUI0)
306a. In the embodiment shown, such a connection 314 can support a
HyperTransport (HT) protocol.
[0085] In the embodiment shown, a host processing section 306 can
include the CPUIO 306a, memory controller 306b, processing core
306c and corresponding provisioning agent 306d.
[0086] In particular embodiments, a computational unit 300 can
interface with the system bus 316 via standard in-line module
connection, which in very particular embodiments can include a DIMM
type slot. In the embodiment shown, a memory bus 316 can be a DDR3
type memory bus. Alternate embodiments can include any suitable
system memory bus. Packet data can be sent by memory controller
306b via memory bus 316 to a DMA slave interface 310a. DMA slave
interface 310a can be adapted to receive encapsulated read/write
instructions from a DMA write over the memory bus 316.
[0087] A hardware scheduler (308b/c/d/e/h) can perform traffic
management on incoming packets by categorizing them according to
flow using session metadata. Packets can be queued for output in an
onboard memory (310b/308a/308m) based on session priority. When the
hardware scheduler determines that a packet for a particular
session is ready to be processed by the offload processor 308i, the
onboard memory is signaled for a context switch to that session.
Utilizing this method of prioritization, context switching overhead
can be reduced, as compared to conventional approaches. That is, a
hardware scheduler can handle context switching decisions and thus
optimize the performance of the downstream resource (e.g., offload
processor 308i).
[0088] As noted above, in very particular embodiments, an offload
processor 308i can be a "wimpy core" type processor. According to
some embodiments, a host processor 306c can be a "brawny core" type
processor (e.g., an x86 or any other processor capable of handling
"heavy touch" computational operations). While an I/O device 302
can be configured to trigger host processor interrupts in response
to incoming packets, according to embodiments, such interrupts can
be disabled, thereby reducing processing overhead for the host
processor 306c. In some very particular embodiments, an offload
processor 308i can include an ARM, ARC, Tensilica, MIPS, Strong/ARM
or any other processor capable of handling "light touch"
operations. Preferably, an offload processor can run a general
purpose operating system for executing a plurality of sessions,
which can be optimized to work in conjunction with the hardware
scheduler in order to reduce context switching overhead.
[0089] Referring still to FIG. 3, in operation, a system 301 can
receive packets from an external network over a network interface.
The packets are destined for either a host processor 306c or an
offload processor 308i based on the classification logic and
schematics employed by I/O device 302. In particular embodiments,
I/O device 302 can operate as a virtualized NIC, with packets for a
particular logical network or to a certain virtual MAC (VMAC)
address can be directed into separate queues and sent over to the
destination logical entity. Such an arrangement can transfer
packets to different entities. In some embodiments, each such
entity can have a virtual driver, a virtual device model that it
uses to communicate with connected virtual network.
[0090] According to embodiments, multiple devices can be used to
redirect traffic to specific memory addresses. So, each of the
network devices operates as if it is transferring the packets to
the memory location of a logical entity. However, in reality, such
packets are transferred to memory addresses where they can be
handled by one or more offload processors (e.g., 308i). In
particular embodiments such transfers are to physical memory
addresses, thus logical entities can be removed from the
processing, and a host processor can be free from such packet
handling.
[0091] Accordingly, embodiments can be conceptualized as providing
a memory "black box" to which specific network data can be fed.
Such a memory black box can handle the data (e.g., process it) and
respond back when such data is requested.
[0092] Referring still to FIG. 3, according to some embodiments,
I/O device 302 can receive data packets from a network or from a
computing device. The data packets can have certain
characteristics, including transport protocol number, source and
destination port numbers, source and destination IP addresses, for
example. The data packets can further have metadata that is
processed (308d) that helps in their classification and
management.
[0093] I/O device 302 can include, but is not limited to,
peripheral component interconnect (PCI) and/or PCI express (PCIe)
devices connecting with a host motherboard via PCI or PCIe bus
(e.g., 312). Examples of I/O devices include a network interface
controller (NIC), a host bus adapter, a converged network adapter,
an ATM network interface, etc.
[0094] In order to provide for an abstraction scheme that allows
multiple logical entities to access the same I/O device 302, the
I/O device may be virtualized to provide for multiple virtual
devices each of which can perform some of the functions of the
physical I/O device. The IO virtualization program (e.g., 302b)
according to an embodiment, can redirect traffic to different
memory locations (and thus to different offload processors attached
to modules on a memory bus). To achieve this, an I/O device 302
(e.g., a network card) may be partitioned into several function
parts; including controlling function (CF) supporting input/output
virtualization (IOV) architecture (e.g., single-root IOV) and
multiple virtual function (VF) interfaces. Each virtual function
interface may be provided with resources during runtime for
dedicated usage. Examples of the CF and VF may include the physical
function and virtual functions under schemes such as Single Root
I/O Virtualization or Multi-Root I/O Virtualization architecture.
The CF acts as the physical resources that sets up and manages
virtual resources. The CF is also capable of acting as a
full-fledged 10 device. The VF is responsible for providing an
abstraction of a virtual device for communication with multiple
logical entities/multiple memory regions.
[0095] The operating system/the hypervisor/any of the virtual
machines/user code running on a host processor 306c may be loaded
with a device model, a VF driver and a driver for a CF. The device
model may be used to create an emulation of a physical device for
the host processor 306c to recognize each of the multiple VFs that
are created. The device model may be replicated multiple times to
give the impression to a VF driver (a driver that interacts with a
virtual IO device) that it is interacting with a physical device of
a particular type.
[0096] For example, a certain device module may be used to emulate
a network adapter such as the Intel.RTM. Ethernet Converged Network
Adapter(CNA) X540-T2, so that the I/O device 302 believes it is
interacting with such an adapter. In such a case, each of the
virtual functions may have the capability to support the functions
of the above said CNA, i.e., each of the Physical Functions should
be able to support such functionality. The device model and the VF
driver can be run in either privileged or non-privileged mode. In
some embodiments, there is no restriction with regard to who
hosts/runs the code corresponding to the device model and the VF
driver. The code, however, has the capability to create multiple
copies of device model and VF driver so as to enable multiple
copies of said I/O interface to be created.
[0097] An application or provisioning agent 306d, as part of an
application/user level code running in a kernel, may create a
virtual I/O address space for each VF, during runtime and allocate
part of the physical address space to it. For example, if an
application handling the VF driver instructs it to read or write
packets from or to memory addresses 0xaaaa to 0xffff, the device
driver may write I/O descriptors into a descriptor queue with a
head and tail pointer that are changed dynamically as queue entries
are filled. The data structure may be of another type as well,
including but not limited to a ring structure 302a or hash
table.
[0098] The VF can read from or write data to the address location
pointed to by the driver. Further, on completing the transfer of
data to the address space allocated to the driver, interrupts,
which are usually triggered to the host processor to handle said
network packets, can be disabled. Allocating a specific I/O space
to a device can include allocating said IO space a specific
physical memory space occupied.
[0099] In another embodiment, the descriptor may comprise only a
write operation, if the descriptor is associated with a specific
data structure for handling incoming packets. Further, the
descriptor for each of the entries in the incoming data structure
may be constant so as to redirect all data write to a specific
memory location. In an alternate embodiment, the descriptor for
consecutive entries may point to consecutive entries in memory so
as to direct incoming packets to consecutive memory locations.
[0100] Alternatively, said operating system may create a defined
physical address space for an application supporting the VF drivers
and allocate a virtual memory address space to the application or
provisioning agent 306d, thereby creating a mapping for each
virtual function between said virtual address and a physical
address space. Said mapping between virtual memory address space
and physical memory space may be stored in IOMMU tables (e.g., a
TLB 304a). The application performing memory reads or writes may
supply virtual addresses to say virtual function, and the host
processor OS may allocate a specific part of the physical memory
location to such an application.
[0101] Alternatively, VF may be configured to generate requests
such as read and write which may be part of a direct memory access
(DMA) read or write operation, for example. The virtual addresses
is be translated by the IOMMU 304 to their corresponding physical
addresses and the physical addresses may be provided to the memory
controller for access. That is, the IOMMU 304 may modify the memory
requests sourced by the I/O devices to change the virtual address
in the request to a physical address, and the memory request may be
forwarded to the memory controller for memory access. The memory
request may be forwarded over a bus 314 that supports a protocol
such as HyperTransport 314. The VF may in such cases carry out a
direct memory access by supplying the virtual memory address to the
IOMMU 304.
[0102] Alternatively, said application may directly code the
physical address into the VF descriptors if the VF allows for it.
If the VF cannot support physical addresses of the form used by the
host processor 306c, an aperture with a hardware size supported by
the VF device may be coded into the descriptor so that the VF is
informed of the target hardware address of the device. Data that is
transferred to an aperture may be mapped by a translation table to
a defined physical address space in the system memory. The DMA
operations may be initiated by software executed by the processors,
programming the I/O devices directly or indirectly to perform the
DMA operations.
[0103] Referring still to FIG. 3, in particular embodiments, parts
of computational unit 300 can be implemented with one or more
FPGAs. In the system of FIG. 3, computational unit 300 can include
FPGA 310 in which can be formed a DMA slave device module 310a and
arbiter 310f. A DMA slave module 310a can be any device suitable
for attachment to a memory bus 316 that can respond to DMA
read/write requests. In alternate embodiments, a DMA slave module
310a can be another interface capable of block data transfers over
memory bus 316. The DMA slave module 310a can be capable of
receiving data from a DMA controller (when it performs a read from
a `memory` or from a peripheral) or transferring data to a DMA
controller (when it performs a write instruction on the DMA slave
module 310a). The DMA slave module 310a may be adapted to receive
DMA read and write instructions encapsulated over a memory bus,
(e.g., in the form of a DDR data transmission, such as a packet or
data burst), or any other format that can be sent over the
corresponding memory bus.
[0104] A DMA slave module 310a can reconstruct the DMA read/write
instruction from the memory R/W packet. The DMA slave module 310a
may be adapted to respond to these instructions in the form of data
reads/data writes to the DMA master, which could either be housed
in a peripheral device, in the case of a PCIe bus, or a system DMA
controller in the case of an ISA bus.
[0105] I/O data that is received by the DMA device 310a can then
queued for arbitration. Arbitration can include the process of
scheduling packets of different flows, such that they are provided
access to available bandwidth based on a number of parameters. In
general, an arbiter 310f provides resource access to one or more
requestors. If multiple requestors request access, an arbiter 310f
can determine which requestor becomes the accessor and then passes
data from the accessor to the resource interface, and the
downstream resource can begin execution on the data. After the data
has been completely transferred to a resource, and the resource has
competed execution, the arbiter 310f can transfer control to a
different requestor and this cycle repeats for all available
requestors. In the embodiment of FIG. 3 arbiter 310f can notify
other portions of computational unit 300 (e.g., 308) of incoming
data.
[0106] Alternatively, a computation unit 300 can utilize an
arbitration scheme shown in U.S. Pat. No. 7,813,283, issued to
Dalal on Oct. 12, 2010, the contents of which are incorporated
herein by reference. Other suitable arbitration schemes known in
art could be implemented in embodiments herein. Alternatively, the
arbitration scheme of the current invention might be implemented
using an OpenFlow switch and an OpenFlow controller.
[0107] In the very particular embodiment of FIG. 3, computational
unit 300 can further include notify/prefetch circuits 310c which
can prefetch data stored in a buffer memory 310b in response to DMA
slave module 310a, and as arbitrated by arbiter 310f. Further,
arbiter 310f can access other portions of the computational unit
300 via a memory mapped I/O ingress path 310e and egress path
310g.
[0108] Referring to FIG. 3, a hardware scheduler can include a
scheduling circuit 308b/n to implement traffic management of
incoming packets. Packets from a certain source, relating to a
certain traffic class, pertaining to a specific application or
flowing to a certain socket are referred to as part of a session
flow and are classified using session metadata. Such classification
can be performed by classifier 308e.
[0109] In some embodiments, session metadata 308d can serve as the
criterion by which packets are prioritized and scheduled and as
such, incoming packets can be reordered based on their session
metadata. This reordering of packets can occur in one or more
buffers and can modify the traffic shape of these flows. The
scheduling discipline chosen for this prioritization, or traffic
management (TM), can affect the traffic shape of flows and
micro-flows through delay (buffering), bursting of traffic
(buffering and bursting), smoothing of traffic (buffering and
rate-limiting flows), dropping traffic (choosing data to discard so
as to avoid exhausting the buffer), delay jitter (temporally
shifting cells of a flow by different amounts) and by not admitting
a connection (e.g., cannot simultaneously guarantee existing
service level agreements (SLAs) with an additional flow's SLA).
[0110] According to embodiments, computational unit 300 can serve
as part of a switch fabric, and provide traffic management with
depth-limited output queues, the access to which is arbitrated by a
scheduling circuit 308b/n. Such output queues are managed using a
scheduling discipline to provide traffic management for incoming
flows. The session flows queued in each of these queues can be sent
out through an output port to a downstream network element.
[0111] It is noted that conventional traffic management do not take
into account the handling and management of data by downstream
elements except for meeting the SLA agreements it already has with
said downstream elements.
[0112] In contrast, according to embodiments a scheduler circuit
308b/n can allocate a priority to each of the output queues and
carry out reordering of incoming packets to maintain persistence of
session flows in these queues. A scheduler circuit 308b/n can be
used to control the scheduling of each of these persistent sessions
into a general purpose operating system (OS) 308j, executed on an
offload processor 308i. Packets of a particular session flow, as
defined above, can belong to a particular queue. The scheduler
circuit 308b/n may control the prioritization of these queues such
that they are arbitrated for handling by a general purpose (GP)
processing resource (e.g., offload processor 308i) located
downstream. An OS 308j running on a downstream processor 308i can
allocate execution resources such as processor cycles and memory to
a particular queue it is currently handling. The OS 308j may
further allocate a thread or a group of threads for that particular
queue, so that it is handled distinctly by the general purpose
processing element 308i as a separate entity. The fact that there
can be multiple sessions running on a GP processing resource, each
handling data from a particular session flow resident in a queue
established by the scheduler circuit, tightly integrates the
scheduler and the downstream resource (e.g., 308i). This can bring
about persistence of session information across the traffic
management and scheduling circuit and the general purpose
processing resource 308i.
[0113] Dedicated computing resources (e.g., 308i), memory space and
session context information for each of the sessions can provide a
way of handling, processing and/or terminating each of the session
flows at the general purpose processor 308i. The scheduler circuit
308b/n can exploit this functionality of the execution resource to
queue session flows for scheduling downstream. The scheduler
circuit 308b/n can be informed of the state of the execution
resource(s) (e.g., 308i), the current session that is run on the
execution resource; the memory space allocated to it, the location
of the session context in the processor cache.
[0114] According to embodiments, a scheduler circuit 308b/n can
further include switching circuits to change execution resources
from one state to another. The scheduler circuit 308b/n can use
such a capability to arbitrate between the queues that are ready to
be switched into the downstream execution resource. Further, the
downstream execution resource can be optimized to reduce the
penalty and overhead associated with context switch between
resources. This is further exploited by the scheduler circuit
308b/n to carry out seamless switching between queues, and
consequently their execution as different sessions by the execution
resource.
[0115] According to embodiments, a scheduler circuit 308b/n can
schedule different sessions on a downstream processing resource,
wherein the two are operated in coordination to reduce the overhead
during context switches. An important factor in decreasing the
latency of services and engineering computational availability can
be hardware context switching synchronized with network queuing. In
embodiments, when a queue is selected by a traffic manager, a
pipeline coordinates swapping in of the cache (e.g., L2 cache) of
the corresponding resource (e.g., 308i) and transfers the
reassembled I/O data into the memory space of the executing
process. In certain cases, no packets are pending in the queue, but
computation is still pending to service previous packets. Once this
process makes a memory reference outside of the data swapped, the
scheduler circuit (308b/n) can enable queued data from an I/O
device 302 to continue scheduling the thread.
[0116] In some embodiments, to provide fair queuing to a process
not having data, a maximum context size can be assumed as data
processed. In this way, a queue can be provisioned as the greater
of computational resource and network bandwidth resource. As but
one very particular example, a computation resource can be an ARM
A9 processor running at 800 MHz, while a network bandwidth can be 3
Gbps of bandwidth. Given the lopsided nature of this ratio,
embodiments can utilize computation having many parallel sessions
(such that the hardware's prefetching of session-specific data
offloads a large portion of the host processor load) and having
minimal general purpose processing of data.
[0117] Accordingly, in some embodiments, a scheduler circuit 308b/n
can be conceptualized as arbitrating, not between outgoing queues
at line rate speeds, but arbitrating between terminated sessions at
very high speeds. The stickiness of sessions across a pipeline of
stages, including a general purpose OS, can be a scheduler circuit
optimizing any or all such stages of such a pipeline.
[0118] Alternatively, a scheduling scheme can be used as shown in
U.S. Pat. No. 7,760,715 issued to Dalal on Jul. 20, 2010,
incorporated herein by reference. This scheme can be useful when it
is desirable to rate limit the flows for preventing the downstream
congestion of another resource specific to the over-selected flow,
or for enforcing service contracts for particular flows.
Embodiments can include arbitration scheme that allows for service
contracts of downstream resources, such as general purpose OS that
can be enforced seamlessly.
[0119] Referring still to FIG. 3, a hardware scheduler according to
embodiments herein, or equivalents, can provide for the
classification of incoming packet data into session flows based on
session metadata. It can further provide for traffic management of
these flows before they are arbitrated and queued as distinct
processing entities on the offload processors.
[0120] In some embodiments, offload processors (e.g., 308i) can be
general purpose processing units capable of handling packets of
different application or transport sessions. Such offload
processors can be low power processors capable of executing general
purpose instructions. The offload processors could be any suitable
processor, including but not limited to: ARM, ARC, Tensilica, MIPS,
StrongARM or any other processor that serves the functions
described herein. Such offload processors have a general purpose OS
running on them, wherein the general purpose OS is optimized to
reduce the penalty associated with context switching between
different threads or group of threads.
[0121] In contrast, context switches on host processors can be
computationally intensive processes that require the register save
area, process context in the cache and TLB entries to be restored
if they are invalidated or overwritten. Instruction Cache misses in
host processing systems can lead to pipeline stalls and data cache
misses lead to operation stall and such cache misses reduce
processor efficiency and increase processor overhead.
[0122] Also in contrast, an OS 308j running on the offload
processors 308i in association with a scheduler circuit 308b/n, can
operate together to reduce the context switch overhead incurred
between different processing entities running on it. Embodiments
can include a cooperative mechanism between a scheduler circuit and
the OS on the offload processor 308i, wherein the OS sets up
session context to be physically contiguous (physically colored
allocator for session heap and stack) in the cache; then
communicates the session color, size, and starting physical address
to the scheduler circuit upon session initialization. During an
actual context switch, a scheduler circuit can identify the session
context in the cache by using these parameters and initiate a bulk
transfer of these contents to an external low latency memory (e.g.,
308g). In addition, the scheduler circuit can manage the prefetch
of the old session if its context was saved to a local memory 308g.
In particular embodiments, a local memory 308g can be low latency
memory, such as a reduced latency dynamic random access memory
(RLDRAM), as but one very particular embodiment. Thus, in
embodiments, session context can be identified distinctly in the
cache.
[0123] In some embodiments, context size can be limited to ensure
fast switching speeds. In addition or alternatively, embodiments
can include a bulk transfer mechanism to transfer out session
context to a local memory 308g. The cache contents stored therein
can then be retrieved and prefetched during context switch back to
a previous session. Different context session data can be tagged
and/or identified within the local memory 308g for fast retrieval.
As noted above, context stored by one offload processor may be
recalled by a different offload processor.
[0124] In the very particular embodiment of FIG. 3, multiple
offload processing cores can be integrated into a computation FPGA
308. Multiple computational FPGAs can be arbitrated by arbitrator
circuits in another FPGA 310. The combination of computational
FPGAs (e.g., 308) and arbiter FPGAs (e.g., 310) are referred to as
"XIMM" modules or "Xockets DIMM modules" (e.g., computation unit
300). In particular applications, these XIMM modules can provide
integrated traffic and thread management circuits that broker
execution of multiple sessions on the offload processors.
[0125] FIG. 3 also shows an offload processor tunnel connection
308k, as well as a memory interface 308m and access unit 308l
(which can be an accelerator coherency port (ACP)). Memory
interface 308m can access buffer memory 308a. According to
embodiments, system 301 can include use an access (or "snooping"
unit) 308l, to access the cache contents of an offload processor
308i. In particular embodiments, the cache accessed can be an L2
cache. An access unit 308l can provide a port or other access
capability that can load data from an external, non-cached memory
308g into an offload processor cache, as well as transfer the cache
contents of an offload processor 308i to a non-cache memory 308g.
As part of a computational element 300, there can be several memory
devices (e.g., RAMs) that form memory 308g. Thus, memory 308g can
be used to store the cache contents of sessions. Memory 308g can
include one or more low latency memories, and can be conceptualized
as supplementing and/or augmenting the available L2 cache, and
extending the coherency domain of sessions. The addition memory
308g and access unit 308l can reduce the adverse effects of cache
misses for switched in sessions, in that a session's context can be
fetched and pre-fetched into an offload processor cache so that the
when the thread resumes, most of its previous working set is
already present in the cache.
[0126] According to one particular embodiment, in a session
switch-out, an offload processor 308i cache contents can be
transferred to memory 308g via tunnel 308k. However, in some
embodiments, a thread's register set may be saved to memory as part
of switch-out, thus these register contents can be resident in the
cache. Therefore, as part of switch-in, when a session's contents
are prefetched and transferred into the cache of an offload
processor 308i, the register contents can be loaded by the kernel
upon resuming the thread, and these loads should be from the cache
and not from memory 308g. Thus, with the careful management of a
session's cache contents, the cost of context switching due to
register set save and restore and cache misses on switch-in can be
greatly reduced, and even eliminated in some optimal cases, thereby
eliminating two sources of context switch overhead and reducing the
latency for the switched-in session to resume useful
processing.
[0127] According to some embodiments, an access (or snooping) unit
(e.g., 308l) can have the indices of all the lines in the cache
where the relevant session context resides. If the session is
scattered across locations in a physically indexed cache, it can
become very cumbersome to access all of the session contents as
multiple address translations would be required to access multiple
pages of the same session. Accordingly, embodiments can include a
page coloring scheme, in which the session contents are established
in contiguous locations in a physically indexed cache. A memory
allocator for session data can allocate from physically contiguous
pages so that there is control over physical address ranges for the
sessions. In some embodiments, this is done by aligning the virtual
memory page and the physical memory page to index to the same
location in the cache (e.g., FIG. 1-3). In alternate embodiments,
virtual and physical memory pages do not have to index the same
location in a physically indexed cache, but the different pages of
the session can be contiguous in physical memory, such that
knowledge of the beginning index and size of the entry in the cache
suffices to access all session data. Further, the set size is equal
to the size of a session, so that once the index of a session entry
in the cache is known; the index, the size and the set color could
be used to completely transfer out the session contents from the
cache to an external memory (e.g., 308g).
[0128] According to embodiments, all pages of a session can be
assigned the same color in a cache of an offload processor. In a
particular embodiment, all pages of a session can start at the page
boundary of a defined color. The number of pages allocated to a
color can be fixed based on the size of a session in the cache. An
offload processor (e.g., 308i) can be used for executing a specific
type of sessions and it is informed of the size of each session
beforehand. Based on this, the offload processor can begin a new
entry at a session boundary. It can similarly allocate pages in
physical memory that index to the session boundary in the cache.
The entire cache context can be saved beginning at the session
boundary. In the currently described embodiment, multiple pages in
the session can be contiguous in a physically indexed cache.
Multiple pages of a session can have the same color (i.e., they are
part of the same set) and are located contiguously. Pages of a
session are accessible by using an offset from the base index of
the session. The cache can be arranged and broken up into distinct
sets, not as pages, but as sessions. To move from one session to
another, the memory allocation scheme uses an offset to the lowest
bit of the indexes used to access these sessions. For example, a
physically indexed cache can be an L2 cache having a size of 512
kb. The cache can be 8-way associative, with eight ways per set
possible in the L2 cache. Therefore, there are eight lines per any
color in L2, or eight separate instances of each color in L2. With
a session context size of 8 Kb, there will then be eight different
session areas within the 512 Kb L2 cache, or eight session colors
with these chosen sizes.
[0129] According to embodiments, a physical memory allocator can
identify the color corresponding to a session based on the cache
entry/main memory entry of the temporally previous session. In a
particular embodiment, the physical memory allocator can identify a
session of the previous session based on the 3 bits of the address
used to assign a cache entry to the previous session. The physical
memory allocator can assign the new session to a main memory
location (whose color can be determined through a few comparisons
to the most recently used entry) and will cause a cache entry
corresponding to a session of a different color to be evicted based
on a least recently used policy. In another embodiment, an offload
processor can include multiple cores. In such an embodiment, cache
entries can be locked out for use by each processor core. For
example, if the offload processor has two cores, a given set of
cache lines in a cache (i.e., L2 cache) can be divided among
processors, halving the number of colors. The color of the session,
index of the session and session size to an external scheduler when
a new session is created can be communicated. This information can
be used for queue management of incoming session flows.
[0130] Embodiments can also permit isolation of shared text and any
shared data by locking these lines into a cache, apart from session
data. Again, a physical memory allocator and physical coloring
techniques can be used. If separate shared data is placed in a
cache, it is possible to lock it into the cache, as long as
transfers by an access unit (e.g., ACP) do not copy such lines.
When allocating memory for session data, the memory allocator can
be aware of physical color, as session data that resides in the
cache is mapped.
[0131] Having described various embodiments suitable for cache and
context switching management operations, an example illustrating
particular aspects will now be described.
[0132] FIG. 4 shows a method 400 of reduced overhead context
switching for a system according to an embodiment. At
initialization, a determination can be made if session coloring is
required (402). Such a determination can be made by an OS. If
session coloring is not required (No from 402), page coloring may
or may not be present depending upon default choices of an OS
(424).
[0133] If session coloring is required (Yes from 402), an OS can
initializes a memory allocator (404). A memory allocator can employ
cache optimization techniques that can allocate each session entry
to a "session" boundary. The memory allocator can determine the
starting address of each session, the number of sessions allowable
in the cache, and the number of locations wherein a session can be
found for a given color. Such operations can include determining
the number of sets available based cache size, number of colors and
size of a session (step 406).
[0134] When a packet for a session arrives, a determination can be
made of whether the packet is for a current or different session
(408). Such an action can be performed by the OS. If a packet is
for a different session (Yes from 408), a determination can be made
of whether the packet is form an earlier session (410). If the
packet is not from an earlier session (i.e., it is a new session),
a determination can be made to see if there is enough memory for
the new session (418). If there is enough space (Yes from 418), a
switch can be made to the new session (422). Such an action can
include allocating a new session at a session boundary, and saving
the context of the process that is currently executing to a context
memory (which can be an external low latency memory).
[0135] If no cache memory is available for a new session (No from
418) and/or the packet is for an earlier session (Yes from 410) an
examination can be made to determine if the packet of the old/new
session is of a same color (412). If it is of a different color (No
from 412), a switch can be made to that session (step 414). Such an
action can include retrieving (for an earlier session) or creating
(for a new session) cache entries for the task. In addition, such
an action can include, if needed, the flushing of cache entries
according to an LRU scheme.
[0136] If a packet of the old/new session is of the same color (Yes
from 412), a determination can be made as to whether the color
pressure can be exceeded (416). If the color pressure can be
exceeded, or a session of some other color is not available (Yes,
or . . . from 416), a switch can be made to the new session (420).
Such an action can include creating cache entries and remembering
the new session color. If the color pressure cannot be exceeded,
but sessions of some other color are available (No, but . . . from
416), a method can proceed to 414.
[0137] It should be appreciated that in the foregoing description
of exemplary embodiments of the invention, various features of the
invention are sometimes grouped together in a single embodiment,
figure, or description thereof for the purpose of streamlining the
disclosure aiding in the understanding of one or more of the
various inventive aspects. This method of disclosure, however, is
not to be interpreted as reflecting an intention that the claimed
invention requires more features than are expressly recited in each
claim. Rather, as the following claims reflect, inventive aspects
lie in less than all features of a single foregoing disclosed
embodiment. Thus, the claims following the detailed description are
hereby expressly incorporated into this detailed description, with
each claim standing on its own as a separate embodiment of this
invention.
[0138] It is also understood that the embodiments of the invention
may be practiced in the absence of an element and/or step not
specifically disclosed. That is, an inventive feature of the
invention may be elimination of an element.
[0139] Accordingly, while the various aspects of the particular
embodiments set forth herein have been described in detail, the
present invention could be subject to various changes,
substitutions, and alterations without departing from the spirit
and scope of the invention.
* * * * *