U.S. patent application number 14/024063 was filed with the patent office on 2015-03-12 for reducing latency associated with timestamps.
This patent application is currently assigned to NetLogic Microsystems, Inc.. The applicant listed for this patent is NetLogic Microsystems, Inc.. Invention is credited to David T. Hass, Kaushik Kuila, Ahmed SHAHID.
Application Number | 20150074442 14/024063 |
Document ID | / |
Family ID | 41721782 |
Filed Date | 2015-03-12 |
United States Patent
Application |
20150074442 |
Kind Code |
A1 |
SHAHID; Ahmed ; et
al. |
March 12, 2015 |
REDUCING LATENCY ASSOCIATED WITH TIMESTAMPS
Abstract
A system and method are provided for reducing a latency
associated with timestamps in a multi-core, multi threaded
processor. A processor capable of simultaneously processing a
plurality of threads is provided. The processor includes a
plurality of cores, a plurality of network interfaces for network
communication, and a timer circuit for reducing a latency
associated with timestamps used for synchronization of the network
communication utilizing a precision time protocol.
Inventors: |
SHAHID; Ahmed; (San Jose,
CA) ; Kuila; Kaushik; (San Jose, CA) ; Hass;
David T.; (Santa Clara, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
NetLogic Microsystems, Inc. |
Irvine |
CA |
US |
|
|
Assignee: |
NetLogic Microsystems, Inc.
Irvine
CA
|
Family ID: |
41721782 |
Appl. No.: |
14/024063 |
Filed: |
September 11, 2013 |
Current U.S.
Class: |
713/500 |
Current CPC
Class: |
G06F 1/00 20130101; G06F
1/14 20130101; H04J 3/0685 20130101; H04J 3/0667 20130101; G06F
1/12 20130101 |
Class at
Publication: |
713/500 |
International
Class: |
G06F 1/00 20060101
G06F001/00 |
Claims
1. An apparatus, comprising: a processor configured to process a
plurality of threads, the processor including: a plurality of
network interfaces for network communication, and a plurality of
cores configured to: generate at least one packet including a
timestamp for synchronization of the network communication with the
network interfaces using a single register write so as to reduce a
latency associated with the timestamp, and communicate the packet
to at least one of the plurality of network interfaces.
2. The apparatus of claim 1, wherein the latency includes memory
latency.
3. The apparatus of claim 1, wherein the latency includes interrupt
latency.
4. The apparatus of claim 3, wherein the interrupt latency is
eliminated by avoiding use of interrupts.
5. The apparatus of claim 1, wherein the at least one of the
plurality of cores is coupled to a network for communicating the
timestamp.
6. The apparatus of claim 1, wherein the processor includes a timer
circuit configured to utilize a time-transfer protocol defined by
IEEE 1588.
7. The apparatus of claim 1, wherein the processor includes a
programmable timer circuit.
8. The apparatus of claim 1, wherein the processor includes a timer
circuit coupled to a plurality of clock frequency sources.
9. The apparatus of claim 1, wherein the processor includes a timer
circuit and wherein the timer circuit includes a divider, a
multiplier, and an offset sub-circuit.
10. The apparatus of claim 1, wherein the processor includes a
timer circuit configured to synchronize the network communication
across each of the plurality of network interfaces.
11. The apparatus of claim 1, wherein the packet is a precision
time protocol packet.
12. The apparatus of claim 1, wherein the packet including the
timestamp is configured to be processed by any selected one of the
cores.
13. The apparatus of claim 2, wherein the apparatus is configured
to reduce the memory latency by communicating a second timestamp
directly from the at least one of the plurality of cores to the at
least one of the plurality of network interfaces.
14. A method of operating a processor including a plurality of
cores and a plurality of network interfaces, comprising: generating
a packet including a timestamp using a single register write to
reduce a latency associated with timestamps used for
synchronization of network communication; and communicating the
packet to at least one of the plurality of network interfaces.
15. The method of claim 14, wherein the latency includes memory
latency.
16. The method of claim 14, wherein the latency includes interrupt
latency.
17. The method of claim 16, further comprising avoiding use of
interrupts to eliminate interrupt latency.
18. The method of claim 14, further comprising coupling at least
one of the plurality of cores to a network for communicating the
timestamps.
19. The method of claim 14, further comprising synchronizing the
network communication across each of the plurality of network
interfaces using a timer circuit.
20. The method of claim 15, further comprising communicating a
second timestamp directly from the at least one of the plurality of
cores to at least one of the plurality of network interfaces to
reduce the memory latency.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of U.S. application Ser.
No. 12/201,689, filed Aug. 29, 2008, which is incorporated herein
by reference in its entirety.
FIELD OF THE INVENTION
[0002] The present invention relates to time-stamping packets in
processors, and more particularly to high-precision time stamping
of network packets in multi-core, multi-threaded processors.
BACKGROUND
[0003] The Precision Time Protocol (PTP) is a time-transfer
protocol that allows precise synchronization of networks (e.g.,
Ethernet networks). Typically, accuracy within a few nanoseconds
range may be achieved with this protocol when using hardware
generated timestamps. Often, this protocol is utilized such that a
set of slave devices may determine the offset between time
measurements on their clocks and time measurements on a master
device.
[0004] To date, the use of the PTP time-transfer protocol has been
optimized for systems employing single core processors. Latency
issues arising from interrupts and memory writes render the
implementation of such protocol on other systems inefficient. There
is thus a need for addressing these and/or other issues associated
with the prior art.
SUMMARY
[0005] A system and method are provided for reducing latency
associated with timestamps in a multi-core, multi threaded
processor. A processor capable of simultaneously processing a
plurality of threads is provided. The processor includes a
plurality of cores, a plurality of network interfaces for network
communication, and a timer circuit for reducing a latency
associated with timestamps used for synchronization of the network
communication utilizing a precision time protocol.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] FIG. 1 shows an apparatus for reducing latency associated
with timestamps in a multi-core, multi threaded processor, in
accordance with one embodiment.
[0007] FIG. 2 shows a timer circuit for reducing latency associated
with timestamps in a multi-core, multi threaded processor, in
accordance with one embodiment.
[0008] FIG. 3 shows a timing diagram for synchronizing a clock of a
slave device with a clock of a master device, in accordance with
one embodiment.
[0009] FIG. 4 shows a system for reducing latency associated with
timestamps in a multi-core, multi threaded processor, in accordance
with another embodiment.
[0010] FIG. 5 shows a system illustrating various agents attached
to a fast messaging network (FMN), in accordance with one
embodiment.
[0011] FIG. 6 illustrates an exemplary system in which the various
architecture and/or functionality of the various previous
embodiments may be implemented.
DETAILED DESCRIPTION
[0012] FIG. 1 shows an apparatus 100 for reducing latency
associated with timestamps in a multi-core, multi threaded
processor in accordance with one embodiment. As shown, a processor
102 capable of simultaneously processing a plurality of threads is
provided. As shown further, the processor includes a plurality of
cores 104, a plurality of network interfaces 106 for network
communication, and at least one timer circuit 108 for reducing
latency associated with timestamps used for synchronization of the
network communication utilizing a precision time protocol.
[0013] In the context of the present description, a precision time
protocol (PTP) refers to a time-transfer protocol that allows
precise synchronization of networks (e.g., Ethernet based networks,
wireless networks, etc.). In one embodiment, the precision time
protocol may be defined by IEEE 1588.
[0014] Furthermore, the latency associated with the timestamps may
include memory latency and/or interrupt latency. In this case,
reducing the latency may include reducing the latency with respect
to conventional processor systems. In one embodiment, the interrupt
latency may be reduced or eliminated by avoiding the use of
interrupts.
[0015] In another embodiment, the memory latency may be reduced or
eliminated by avoiding the writing of timestamps to memory. In this
case, the writing of timestamps to memory may be avoided by
directly transferring the timestamps between the plurality of cores
104 and the plurality of network interfaces 106.
[0016] In one embodiment, the cores 104 may each be capable of
generating a precision time protocol packet including one of the
timestamps. In this case, the cores 104 may each be capable of
generating a precision time protocol packet including one of the
timestamps, utilizing a single register write. Additionally, a
precision time protocol packet including one of the timestamps may
be capable of being processed by any selected one of the cores 104.
Furthermore, any of the cores 104 may be capable of managing any of
the network interfaces 106.
[0017] More illustrative information will now be set forth
regarding various optional architectures and features with which
the foregoing framework may or may not be implemented, per the
desires of the user. It should be strongly noted that the following
information is set forth for illustrative purposes and should not
be construed as limiting in any manner. Any of the following
features may be optionally incorporated with or without the
exclusion of other features described.
[0018] FIG. 2 shows a timer circuit 200 for increasing precision
and reducing latency associated with timestamps in a multi-core,
multi threaded processor, in accordance with one embodiment. As an
option, the timer circuit 200 may be implemented in the context of
the functionality of FIG. 1. Of course, however, the timer circuit
200 may be implemented in any desired environment. It should also
be noted that the aforementioned definitions may apply during the
present description.
[0019] As shown, two clock signals (e.g. a 1 GHz CPU clock signal
and a 125 MHz reference clock signal) are input into a first
multiplexer 202. A clock select signal is used to select one of the
CPU clock signal and the reference clock signal. The clock signal
output from the first multiplexer 202 is input into a programmable
clock divider 204, which is utilized to determine a frequency for
updating a first accumulating unit 206. Thus, the programmable
clock divider 204 receives the clock signal and divides the clock
signal by a user programmable ratio such that the first
accumulating unit 206 and an increment value generation portion 208
of the circuit 200 may utilize the divided clock signal as an input
clock signal.
[0020] In operation, the increment value generation portion 208
generates an increment value that is summed with an output of the
first accumulating unit 206. The increment value generation portion
208 includes a second accumulating unit 210. For every clock cycle
where a value "A" being tracked by the second accumulating unit 210
is less than a denominator value ("Inc_Den") defined by the
programmable clock divider 204, a numerator value ("Inc_Num")
defined by the programmable clock divider 204 is added to the value
being tracked using the second accumulating unit 210. The moment a
value "Y" becomes greater than or equal to the denominator value
"Inc_Den," an output "X" becomes 1 and the 1 is summed with an
integer value "Inc_Int" defined by the programmable clock divider
204, which produces an total increment value that is summed with an
output of the first accumulating unit 206 and added to a register
of the first accumulating unit 206 every clock cycle. In cycles
where "X" is zero, the total increment value is equal to
"Inc_Int."
[0021] Furthermore, an offset value "ACC Offset" is added to the
register of the first accumulating unit 206 whenever the register
is written to by software. As an option, this offset value may be
utilized to adjust the value of an output of the timer circuit 200.
For example, the offset value may be used to automatically
synchronize different devices (e.g. a master device and a slave
device, etc.). In one embodiment, this offset value may be provided
by an offset sub-circuit.
[0022] In this way, the programmable clock divider 204 may be
programmed with a ratio "a/b" that may be used to determine a
precision of synchronization. For example, the clock divider 204
may be programmed with a value of 2/3, where "Inc_Num" is equal to
2 and "Inc_Den" is equal to 3 in this case. For this example, the
increment value generation portion 208 will generate the values as
shown in Table 1, where the value of the second accumulator 210 is
equal a Y value of a previous clock cycle if the Y value of the
current clock cycle is less than 3, and equal to the X value of the
previous clock cycle minus 3 when the Y value of the current clock
cycle is greater than or equal to 3.
TABLE-US-00001 TABLE 1 Clock Cycle Number 1 2 3 4 5 Second 0 2 1 0
2 Accumulator Value Y 2 4 3 2 4 X 0 1 0 1 1
[0023] The output "X" is summed with an output of the first
accumulating unit 206 and added to a register of the first
accumulating unit 206 every clock cycle. Furthermore, an offset
value "ACC Offset" may be added to the register of the first
accumulating unit 206 whenever the register is written to by
software. In the case that a/b is equal to 5/3 (i.e. 1 and 2/3),
"Inc_Num" is equal to 2, "Inc_Den" is equal to 3, and "Inc_Int" is
equal to 1. Thus, the output "X" will be the same as illustrated in
Table 1. The output "X" is then summed with "Inc_Int," or 1 in this
case, the output of the first accumulating unit 206, and then added
to a register of the first accumulating unit 206 every clock
cycle.
[0024] Table 2 shows logic associated with the increment value
generation portion 208, in accordance with one embodiment.
TABLE-US-00002 TABLE 2 if (y >= Inc_Den) X = 1; else X = 0;
[0025] Ultimately, when the programmable timer 200 is programmed
with a ratio "a/b," where "a" is less than "b," the value of "a" is
added to the first accumulating unit 206 every "b" number of clock
cycles. When the programmable timer 200 is programmed with a ratio
"a/b" where "a" is greater than "b," "a/b" may be viewed as
"c+(a1/b1)," and a value of "a1" is added to the first accumulating
unit 206 every "b1" number of clock cycles and "c" is added to the
first accumulating unit 206 every clock cycle. In other words, when
the programmable timer 200 is programmed with a ratio "a/b," where
"a" is less than "b," "a/b" corresponds to "Inc_Num/Inc_Den." When
the programmable timer 200 is programmed with a ratio "a/b," where
"a" is greater than "b," "a/b" corresponds to "c+(a1/b1)," or
"Inc_Int+(Inc_Num/Inc_Den)." The programmable clock divider 204 is
present to reduce the incoming high frequency clock to lower
frequency for reducing power consumption. However, the precision of
the clock circuit 200 is still quite high because it allows the
clock increment value to be any number that can be represented by
"a/b."
[0026] Thus, for every clock cycle, "Inc_Int" is added to the first
accumulating unit 206. Additionally, for every "Inc_Den" number of
clock cycles, "Inc_Num" is added to the first accumulating unit
206. As noted above, the increment value generation portion 208 is
utilized to determine the "Inc_Den" number of clock cycles and when
"Inc_Num" is to be added to the first accumulating unit 206.
Accordingly, the programmable clock timer 200 may be programmed
with any proper or improper fraction such that the first
accumulating unit 206 increments utilizing that value.
[0027] The output of the first accumulating unit 206 may then be
used as the timer circuit output. Thus, the timer circuit clock
accuracy may be established based on this programmable value. In
this way, a source clock may be slower than an effective timer.
Accordingly, the programmable timer circuit 200, fed by a plurality
of clock frequency sources, may be utilized for synchronization of
network communication across each of a plurality of network
interfaces.
[0028] It should be noted that, in one embodiment, the first
accumulating unit 206 and/or the second accumulating unit 210 may
represent a clocking mechanism for IEEE 1588 timers. FIG. 3 shows a
timing diagram 300 for synchronizing a clock of a slave device with
a clock of a master device, in accordance with one embodiment. As
an option, the timing diagram 300 may be implemented in the context
of the functionality of FIGS. 1-2. Of course, however, the timing
diagram 300 may be implemented in any desired environment. Again,
the aforementioned definitions may apply during the present
description.
[0029] As shown, a master device sends a synchronization message to
a slave device. The master device samples the precise time (t1)
when the message left the interface. The slave device then receives
this synchronization message and records the precise time (t2) that
the message was received.
[0030] The master device then sends a follow up message including
the precise time when the synchronization message left the master
device interface. The slave device then sends a delay request
message to the master. The slave device also samples the time (t3)
when this message left the interface.
[0031] The master device then samples the exact time (t4) when it
receives the delay request message. A delay response message
including this time is then sent to the slave device. The slave
device then uses t1, t2, t3, and t4 to synchronize the slave clock
with the clock of the master device.
[0032] FIG. 4 shows a system 400 for reducing latency associated
with timestamps in a multi-core, multi threaded processor, in
accordance with another embodiment. As an option, the system 400
may be implemented in the context of the functionality of FIGS.
1-3. Of course, however, the system 400 may be implemented in any
desired environment. Further, the aforementioned definitions may
apply during the present description.
[0033] As shown, the system 400 includes a plurality of central
processing units (CPUs) 402 and a plurality of network interfaces
404. The CPUs 402 and the network interfaces 404 are capable of
communicating over a fast messaging network (FMN) 406. All
components on the FMN 406 may communicate directly with any other
components on the FMN 406.
[0034] For example, any one of the plurality of CPUs 402 may
communicate timestamps directly to any one of the network
interfaces 404 utilizing the FMN 406. Similarly, any one of the
plurality of network interfaces 404 may communicate timestamps
directly to any one of the CPUs 402 utilizing the FMN 406. In this
way, a memory latency introduced by writing the timestamps to
memory before communicating the timestamps between a CPU and
network interface may be avoided. Furthermore, by transferring the
timestamps directly between the CPUs 402 and the network interfaces
404 utilizing the FMN 406, the use of interrupts may be
avoided.
[0035] For example, one of the network interfaces 404 may receive a
packet, write the packet to memory 408, generate a descriptor
including address, length, status, and control information, and
forward the descriptor to one of the CPUs 402 over the FMN 406. In
this case, a timestamp generated at the network interface 404 may
also be included in the descriptor sent to one of the CPUs 402 over
the FMN 406. Thus, any memory latency that would occur from writing
the timestamp to memory is avoided. Furthermore, because the CPU
402 receives the packet information and the timestamp as part of
the descriptor, the CPU 402 is not interrupted from any processing.
Thus, interrupts may be avoided by utilizing transferring the
timestamp directly over the FMN 406. Furthermore, avoiding
interrupts enables the master device to simultaneously attempt
synchronization of timestamps with a plurality of slave devices,
thereby reducing latency in achieving network-wide timer
synchronization.
[0036] In one embodiment, a unique descriptor format for PTP
packets (e.g. PTP 1588) may be utilized that allows the CPUs 402 to
construct and transmit PTP packets with a single register write. In
other words, each of the cores may be capable of generating a
precision time protocol packet including one of the timestamps
utilizing a single register write.
[0037] For example, a descriptor may be designated as an IEEE 1588
format, and may include address, length, status, and control
information. This descriptor may be sent from any of the CPUs 402
to any of the network interfaces 404 and cause an IEEE 1588 format
packet to be generated and transmitted. The network interface 404
may then capture a timestamp corresponding to the IEEE 1588 packet
exiting the network interface 404 and return a follow up descriptor
with the captured timestamp to the CPU 402 utilizing the FMN 406.
Thus, interrupt and memory latency may be avoided. Further,
multiple IEEE 1588 packets may be generated by a plurality of CPUs
and sent to multiple networking interfaces, in parallel, thereby
allowing for timer synchronization with multiple slave devices,
simultaneously.
[0038] It should be noted that any of the network interfaces 404
may utilize any of the CPUs 402 to process a timestamp. Thus,
single or multiple time clock masters may be utilized on a per
network interface basis. Furthermore, any of the cores may be
capable managing any of the network interfaces 404. Additionally,
the network interfaces 404 may include a master network interface
and a slave network interface.
[0039] In one embodiment, a free back ID may be included in the
descriptor. In this case, the free back ID may be used to define a
CPU or thread to route a descriptor and an included timestamp when
the descriptor is being sent from one of the network interfaces
404. In this way, the free back ID may allow a captured timestamp
to be routed to any CPU and/or thread in a multi-core,
multi-threaded processor.
[0040] It should be noted that any number of CPUs 402 and any
number of network interfaces 404 may be utilized. For example, in
various embodiments, 8, 16, 32, or more CPUs may be utilized. As an
option, the CPUs may include one ore more virtual CPUs.
[0041] FIG. 5 shows a system 500 illustrating various agents
attached to a fast messaging network (FMN), in accordance with one
embodiment. As an option, the present system 500 may be implemented
in the context of the functionality and architecture of FIGS. 1-4.
Of course, however, the system 500 may be implemented in any
desired environment. Again, the aforementioned definitions may
apply during the present description.
[0042] As shown, eight cores (Core-0 502-0 through Core-7 502-7)
along with associated data caches (D-cache 504-0 through 504-7) and
instruction caches (I-cache 506-0 through 506-7) may interface to
an FMN. Further, Network I/O Interface Groups can also interface to
the FMN. Associated with a Port A, a DMA 508-A, a Parser/Classifier
512-A, and an XGMII/SPI-4.2 Port A 514-A can interface to the FMN
through a Packet Distribution Engine (PDE) 510-A. Similarly, for a
Port B, a DMA 508-B, a Parser/Classifier 512-B, and an
XGMII/SPI-4.2 Port B 514-B can interface to the FMN through a PDE
510-B. Also, a DMA 516, a Parser/Classifier 520, an RGMII Port A
522-A, an RGMII Port B 522-B, an RGMII Port C 522-C, and an RGMII
Port D 522-D can interface to the FMN through a PDE 518. Also, a
Security Acceleration Engine 524 including a DMA 526 and a DMA
Engine 528 can interface to the FMN.
[0043] In one embodiment, all agents (e.g. cores/threads or
networking interfaces, such as shown in FIG. 5) on the FMN can send
a message to any other agent on the FMN. This structure can allow
for fast packet movement among the agents, but software can alter
the use of the messaging system for any other appropriate purpose
by so defining the syntax and semantics of the message container.
In any event, each agent on the FMN may include a transmit queue
and a receive queue. Accordingly, messages intended for a
particular agent can be dropped into the associated receive queue.
All messages originating from a particular agent can be entered
into the associated transmit queue and subsequently pushed on the
FMN for delivery to the intended recipient.
[0044] In another aspect of embodiments of the invention, all
threads of the core (e.g., Core-0 502-0 through Core-7 502-7) can
share the queue resources. In order to ensure fairness in sending
out messages, a "round-robin" scheme may be implemented for
accepting messages into the transmit queue. This can guarantee that
all threads have the ability to send out messages even when one of
them is issuing messages at a faster rate. Accordingly, it is
possible that a given transmit queue may be full at the time a
message is issued. In such a case, all threads may be allowed to
queue up one message each inside the core until the transmit queue
has room to accept more messages. Further, the networking
interfaces may use the PDE to distribute incoming packets to the
designated threads. Further, outgoing packets for the networking
interfaces may be routed through packet ordering software.
[0045] As an example of one implementation of the system 500,
packets may be received by a network interface. The network
interface may include any network interface. For example, in
various embodiments, the network interface may include a Gigabit
Media Independent Interface (GMII), a Reduced Gigabit Media
Independent Interface (RGMII), or any other network interface.
[0046] When the network interface begins to receive a packet, the
network interface stores the packet data in memory, and notifies
software of the arrival of the packet, along with a notification of
the location of the packet in memory. In this case, the storing and
the notification may be performed automatically by the network
interface, based on parameters set up by software.
[0047] In one embodiment, storing the packet may include allocating
memory buffers to store the packet. For example, as packet data
arrives, a DMA may consume preallocated memory buffers and store
packet data in memory. As an option, the notification of the
arrival of the packet may include deciding which thread of a
plurality of CPUs should be notified of the arrival.
[0048] In one embodiment, the incoming packet data may be parsed
and classified. Based on this classification, a recipient thread
may be selected from a pool of candidate recipient threads that are
designed to handle packets of this kind. A message may then be sent
via the FMN to the designated thread announcing its arrival. By
providing a flexible feedback mechanism from the recipient thread,
the networking interfaces may achieve load balancing across a set
of threads.
[0049] A single FMN message may contain a plurality of packet
descriptors. Additional FMN messages may be generated as desired to
represent long packets. In one embodiment, packet descriptors may
contain address data, packet length, and port of origin data. One
packet descriptor format may include a pointer to the packet data
stored in memory. In another case, a packet descriptor format may
include a pointer to an array of packet descriptors, allowing for
packets of virtually unlimited size to be represented.
[0050] As an option, a bit field may indicate the last packet
descriptor in a sequence. Using packet descriptors, network
accelerators and threads may send and receive packets, create new
packets, forward packets to other threads, or any device, such as a
network interface for transmission. When a packet is finally
consumed, such as at the transmitting networking interface, the
exhausted packet buffer may be returned to the originating
interface so it can be reused.
[0051] In one embodiment, facilities may exist to return freed
packet descriptors back to their origin across the FMN without
thread intervention. Although, FMN messages may be transmitted in
packet descriptor format, the FMN may be implemented as a general
purpose message-passing system that can be used by threads to
communicate arbitrary information among them.
[0052] In another implementation, at system start-up, software may
provide all network interfaces with lists of fixed-size
pre-allocated memory called packet buffers to store incoming packet
data. Pointers may then be encapsulated to the packet buffers in
packet descriptors, and sent via the FMN to the various network
interfaces.
[0053] Each interface may contain a Free-In Descriptor FIFO used to
queue up these descriptors. Each of these FIFOs may correspond to a
bucket on the FMN. At startup, initialization software may populate
these FIFOs with free packet descriptors. In one embodiment, the
Free-In Descriptor FIFO may hold a fixed number of packet
descriptors on-chip (e.g. 128, 256, etc.) and be extended into
memory using a "spill" mechanism.
[0054] For example, when a FIFO fills up, spill regions in memory
may be utilized to store subsequent descriptors. These spill
regions may be made large enough to hold all descriptors necessary
for a specific interface. As an option, the spill regions holding
the free packet descriptors may also be cached.
[0055] When a packet comes in through the receive side of the
network interfaces, a free packet descriptor may be popped from the
Free-In Descriptor FIFO. The memory address pointer in the
descriptor may then be passed to a DMA engine which starts sending
the packet data to a memory subsystem. As many additional packet
descriptors may be popped from the Free-In Descriptor FIFO as are
utilized to store the entire packet. In this case, the last packet
descriptor may have an end-of-packet bit set.
[0056] In various embodiments, the packet descriptor my include
different formats. For example, in one embodiment, a receive packet
descriptor format may be used by the ingress side of network
interfaces to pass pointers to packet buffers and other useful
information to threads.
[0057] In another embodiment, a P2D type packet descriptor may be
used by the egress side of network interfaces to access pointers to
packet buffers to be transmitted. In this case, the P2D packet
descriptors may contain the physical address location from which
the transmitting DMA engine of the transmitting network interface
will read packet data to be transmitted. As an option, the physical
address may be byte-aligned or cache-line aligned. Additionally, a
length field may be included within P2D Descriptors which describes
the length of useful packet data in bytes.
[0058] In still another embodiment, a P2P type descriptor may be
used by the egress side of network interfaces to access packet data
of virtually unlimited size. The P2P type descriptors may allow FMN
messages to convey a virtually unlimited number of P2D type
descriptors. As an option, the physical address field specified in
the P2P type descriptor may resolve to the address of a table of
P2D type descriptors. In other embodiments, a free back descriptor
may be used by the network interfaces to indicate completion of
packet processing and a free in descriptor may be sent from threads
during initialization to populate the various descriptor FIFOs with
free packet descriptors.
[0059] In one embodiment, four P2D packet descriptors may be used
to describe the packet data to be sent. For example, a descriptor
"A1" may contain a byte-aligned address which specifies the
physical memory location containing the packet data used for
constructing the packet to be transmitted, a total of four of which
comprise the entire packet. The byte-aligned length and
byte-aligned address fields in each packet descriptor may be used
to characterize the four components of the packet data to be
transmitted. Furthermore, a descriptor "A4" may have an EOP bit set
to signify that this is the last descriptor for this packet.
[0060] Since P2D packets can represent multiple components of a
packet, packet data need not be contiguous. For example, a
descriptor "A1" may address a buffer containing an Authentication
Header (AH) and Encapsulating Security Protocol (ESP) readers,
which may be the first chunk of data needed to build up the packet.
Likewise, the second chunk of data required is likely the payload
data, addressed by a descriptor "A2." The ESP authentication data
and ESP trailer are the last chunk of data needed to build the
packet, and so may be pointed to by a last descriptor "A3," which
also has the EOP bit set signifying that this is the last chunk of
data being used to form the packet. In a similar manner, other
fields, such as VLAN tags, could be inserted into packets by using
the byte-addressable pointers available in the P2D descriptors.
[0061] FIG. 6 illustrates an exemplary system 600 in which the
various architecture and/or functionality of the various previous
embodiments may be implemented. As shown, a system 600 is provided
including at least one host processor 601 which is connected to a
communication bus 602. The system 600 may also include a main
memory 604. Control logic (software) and data may be stored in the
main memory 604 which may take the form of random access memory
(RAM).
[0062] The system 600 may also include a graphics processor 606 and
a display 608, i.e. a computer monitor. In one embodiment, the
graphics processor 606 may include a plurality of shader modules, a
rasterization module, etc. Each of the foregoing modules may even
be situated on a single semiconductor platform to form a graphics
processing unit (GPU).
[0063] In the present description, a single semiconductor platform
may refer to a sole unitary semiconductor-based integrated circuit
or chip. It should be noted that the term single semiconductor
platform may also refer to multi-chip modules with increased
connectivity which simulate on-chip operation, and make substantial
improvements over utilizing a conventional central processing unit
(CPU) and bus implementation. Of course, the various modules may
also be situated separately or in various combinations of
semiconductor platforms per the desires of the user.
[0064] The system 600 may also include a secondary storage 610. The
secondary storage 610 includes, for example, a hard disk drive
and/or a removable storage drive, representing a floppy disk drive,
a magnetic tape drive, a compact disk drive, etc. The removable
storage drive reads from and/or writes to a removable storage unit
in a well known manner.
[0065] Computer programs, or computer control logic algorithms, may
be stored in the main memory 604 and/or the secondary storage 610.
Such computer programs, when executed, enable the system 600 to
perform various functions. Memory 604, storage 610 and/or any other
storage are possible examples of computer-readable media.
[0066] In one embodiment, the architecture and/or functionality of
the various previous figures may be implemented in the context of
the host processor 601, graphics processor 606, an integrated
circuit (not shown) that is capable of at least a portion of the
capabilities of both the host processor 601 and the graphics
processor 606, a chipset (i.e. a group of integrated circuits
designed to work and sold as a unit for performing related
functions, etc.), and/or any other integrated circuit for that
matter.
[0067] Still yet, the architecture and/or functionality of the
various previous figures may be implemented in the context of a
general computer system, a circuit board system, a game console
system dedicated for entertainment purposes, an
application-specific system, and/or any other desired system. For
example, the system 600 may take the form of a desktop computer,
lap-top computer, and/or any other type of logic. Still yet, the
system 600 may take the form of various other devices including,
but not limited to, a personal digital assistant (PDA) device, a
mobile phone device, a television, etc.
[0068] Further, while not shown, the system 600 may be coupled to a
network [e.g. a telecommunications network, local area network
(LAN), wireless network, wide area network (WAN) such as the
Internet, peer-to-peer network, cable network, etc.) for
communication purposes.
[0069] While various embodiments have been described above, it
should be understood that they have been presented by way of
example only, and not limitation. Thus, the breadth and scope of a
preferred embodiment should not be limited by any of the
above-described exemplary embodiments, but should be defined only
in accordance with the following claims and their equivalents.
* * * * *