U.S. patent application number 11/042476 was filed with the patent office on 2006-03-16 for multi-core debugger.
This patent application is currently assigned to Cavium Networks. Invention is credited to Michael S. Bertone, David A. Carlson, Philip H. Dickinson, Muhammad R. Hussain, Richard E. Kessler, Trent Parker.
Application Number | 20060059286 11/042476 |
Document ID | / |
Family ID | 38731731 |
Filed Date | 2006-03-16 |
United States Patent
Application |
20060059286 |
Kind Code |
A1 |
Bertone; Michael S. ; et
al. |
March 16, 2006 |
Multi-core debugger
Abstract
In a multi-core processor, a high-speed interrupt-signal
interconnect allows more than one of the processors to be
interrupted at substantially the same time. For example, a global
signal interconnect is coupled to each of the multiple processors,
each processor being configured to selectively provide an interrupt
signal, or pulse thereon. Preferably, each of the processor cores
is capable of pulsing the global signal interconnect during every
clock cycle to minimize delay between a triggering event and its
respective interrupt signal. Each of the multiple processors also
senses, or samples the global signal interconnect, preferably
during the same cycle within which the pulse was provided, to
determine the existence of an interrupt signal. Upon sensing an
interrupt signal, each of the multiple processors responds to it
substantially simultaneously. For example, an interrupt signal
sampled by each of the multiple processors causes each processor to
invoke a debug handler routine.
Inventors: |
Bertone; Michael S.;
(Marlborough, MA) ; Carlson; David A.; (Haslet,
TX) ; Kessler; Richard E.; (Shrewsbury, MA) ;
Dickinson; Philip H.; (Cupertino, CA) ; Hussain;
Muhammad R.; (Pleasanton, CA) ; Parker; Trent;
(San Jose, CA) |
Correspondence
Address: |
HAMILTON, BROOK, SMITH & REYNOLDS, P.C.
530 VIRGINIA ROAD
P.O. BOX 9133
CONCORD
MA
01742-9133
US
|
Assignee: |
Cavium Networks
Santa Clara
CA
|
Family ID: |
38731731 |
Appl. No.: |
11/042476 |
Filed: |
January 25, 2005 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60609211 |
Sep 10, 2004 |
|
|
|
Current U.S.
Class: |
710/260 ;
714/E11.207 |
Current CPC
Class: |
G06F 9/30014 20130101;
G06F 9/30043 20130101; G06F 11/3632 20130101; G06F 12/0815
20130101; G06F 12/0835 20130101; G06F 12/084 20130101; G06F 9/383
20130101; G06F 12/0875 20130101; G06F 2212/6022 20130101; G06F
9/30138 20130101; G06F 12/0804 20130101; G06F 13/24 20130101; G06F
2212/6012 20130101; G06F 12/0813 20130101; G06F 9/3824 20130101;
G06F 12/0891 20130101 |
Class at
Publication: |
710/260 |
International
Class: |
G06F 13/24 20060101
G06F013/24 |
Claims
1. A multi-core processor comprising: a plurality of independent
processor cores, each processor core executing instructions and
operating in parallel to perform work; each of the plurality of
independent processor cores respectively including: an
interrupt-signal sensor; and an interrupt-signal generator
selectively providing an interrupt signal; and a global
interrupt-signal interconnect in electrical communication with each
of the plurality of independent processor cores, more than one of
the processor cores respectively interrupting its execution of
instructions substantially simultaneously responsive to sampling
with respective interrupt-signal sensors an interrupt signal on the
global interrupt-signal interconnect.
2. The multi-core processor of claim 1, wherein the respective
interrupt-signal generator of each of the plurality of independent
processor cores is coupled to the global interrupt-signal
interconnect.
3. The multi-core processor of claim 2, wherein the respective
interrupt-signal generator of each of the plurality of independent
processor cores is coupled to the global interrupt-signal
interconnect in a wired-OR configuration.
4. The multi-core processor of claim 1, wherein an interrupt signal
is provided in response to a write to a register in one of the
plurality of independent processor cores.
5. The multi-core processor of claim 1, wherein an interrupt signal
is provided in response to execution of a debug breakpoint
instruction in one of the plurality of independent processor
cores.
6. The multi-core processor of claim 1, wherein an interrupt signal
is provided in response to detection of an instruction or data
breakpoint match in one of the plurality of independent processor
cores.
7. The multi-core processor of claim 1, wherein the global
interrupt-signal interconnect comprises a plurality of independent
global interrupt-signal interconnects, each of the independent
global interrupt-signal interconnects representing a respective
interrupt signal.
8. The multi-core processor of claim 1, further comprising a trace
buffer coupled to the global interrupt-signal interconnect, the
trace buffer being configured to monitor memory transactions of the
independent processor cores in response to an interrupt signal on
the global interrupt-signal interconnect.
9. The multi-core processor of claim 1, wherein each of the
plurality of independent processor cores comprises a respective
register storing information, the register configurable according
to the sampled interrupt signal.
10. The multi-core processor of claim 1, further comprising a
core-processor clock signal for coordinating execution of the
instructions, wherein the interrupt-signal sensor samples the
global interrupt-signal interconnect during each cycle of the
core-processor clock signal.
11. The multi-core processor of claim 10, wherein each processor
core respectively interrupts its execution of instructions within
three core-processor clock cycles of sampling an interrupt signal
on the global interrupt-signal interconnect.
12. The multi-core processor of claim 1, wherein the global
interrupt-signal interconnect is used to communicate after the
plurality of processor cores are interrupted.
13. A method of debugging a multi-core processor comprising the
steps of: selectively providing an interrupt signal on a global
interrupt-signal interconnect, the global interrupt-signal
interconnect coupled to each of a plurality of processor cores
comprising the multi-core processor; sampling the provided
interrupt signal at each of the plurality of processor cores; and
interrupting execution of more than one of the plurality of
processor cores substantially simultaneously responsive to the
sensed interrupt signal.
14. The method of claim 13, wherein the interrupt signal is
selectively provided by one of the plurality of processor
cores.
15. The method of claim 14, wherein the interrupt signal is
provided in response to software control.
16. The method of claim 15, wherein the software control comprises
software writing a value to a register.
17. The method of claim 14, wherein the interrupt signal is
provided in response to execution of a debug breakpoint
instruction.
18. The method of claim 14, wherein the interrupt signal is
provided in response to a breakpoint match.
19. The method of claim 13, further comprising entering a debug
handler routine at each of the interrupted processor cores.
20. The method of claim 19, wherein each of the interrupted
processor cores communicates with an external device responsive to
entering the debug handler routine.
21. The method of claim 20, wherein each of the plurality of
processor cores communicates with the external device using a Joint
Test Action Group (JTAG) test access port.
22. The method of claim 20, further comprising using the global
interrupt-signal interconnect to communicate after the plurality of
processor cores are interrupted.
23. A multi-core processor comprising: means for selectively
providing an interrupt signal on a global interrupt-signal
interconnect, the global interrupt-signal interconnect coupled to
each of a plurality of processor cores comprising the multi-core
processor; means for sensing the provided interrupt signal at each
of the plurality of processor cores; and means for interrupting
execution of more than one of the plurality of processors
substantially simultaneously responsive to a sensed interrupt
signal.
Description
RELATED APPLICATION
[0001] This application claims the benefit of U.S. Provisional
Application No. 60/609,211, filed on Sep. 10, 2004. The entire
teachings of the above application are incorporated herein by
reference.
BACKGROUND OF THE INVENTION
[0002] Complex computer systems and programs rarely work exactly as
designed. During the development of a new computer system,
unexpected errors or bugs may be discovered by thorough testing and
exhaustive execution of a variety of programs and applications. The
source or cause of an error is often not apparent from the error
itself, many times an error manifests itself by locking the target
system for no apparent reason. Thus, tracking down the source of
the error can be problematic.
[0003] Software and system developers commonly use tools referred
to as debuggers to identify the source of unexpected errors and to
assist in their resolution. A debugger is a software program used
to break (i.e., interrupt) program execution at one or more
locations in an application program. Once interrupted, a user is
presented with a debugger command prompt for entering debugger
commands that will allow for setting breakpoints, displaying or
changing memory, single stepping, and so forth. Often, processors
include onboard features accessible by a debugger to facilitate
access to and operation of the processor during debugging.
[0004] One of the most difficult tasks facing designers of embedded
systems today is emulating and debugging embedded hardware and
software in a real-world environment. Embedded systems are growing
more complex, offering increasingly higher levels of performance,
and using larger software programs than ever before. To meet the
challenges of dealing with embedded systems, engineers and
programmers seek advanced tools to enable their performance of
appropriate levels of debugging.
[0005] Tracking down problems is particularly challenging when the
target system includes a multi-core processor. Multi-core
processors include two or more processor cores that are each
capable of simultaneously executing independent programs.
[0006] Using standard debug features that may be provided with the
individual processor cores of the multi-core processor, can provide
insight into operation of the individual processor cores. Assessing
operation of parallel applications being developed and executed on
the multi-core processor system by debugging an individual
processor core will generally be inadequate. Namely, if an
operation of a first processor is interrupted as described above,
the other processors will continue to operate, thereby changing the
state of the system with each subsequent clock cycle as measured
from the moment of interrupt.
SUMMARY OF THE INVENTION
[0007] A multi-core processor includes a global interrupt
capability that selectively breaks operation of more than one of
the multiple processor cores at substantially the same time,
usually within a few clock cycles. A global interrupt-signal
interconnect is coupled to each of the plurality of independent
processor cores. Each of the processor cores includes an
interrupt-signal sensor for sampling an interrupt signal on the
global-signal interconnect and an interrupt-signal generator for
selectively providing an interrupt signal. Each processor core
respectively interrupts its execution of instructions in response
to sampling an interrupt signal on the global interrupt-signal
interconnect.
[0008] The respective interrupt-signal generator of each of the
plurality of independent processor cores is coupled to the global
interrupt-signal interconnect. Outputs from the respective
interrupt-signal generators can be coupled together and further to
the global interrupt-signal interconnect in a wired-OR
configuration. Thus, each of the processor cores can individually
assert an interrupt signal on the same global interrupt-signal
interconnect.
[0009] The multi-core processor can further include an interface
adapted to connect to an external device. For example, the
interface can be defined by a Joint Test Action Group (JTAG)
interface. In some embodiments, more than one global
interrupt-signal interconnect includes are provided. In such a
configuration, each of the global interrupt-signal interconnects
can represent a different interrupt signal. Additionally,
information that may be relevant to debugging the multi-core
processor can be provided by a combination of signals asserted on
the multiple interrupt-signal interconnects.
[0010] In some embodiments, the plurality of independent processor
cores resides on a single semiconductor die. The independent
processor cores can be Reduced Instruction Set Computer (RISC)
processors. Alternatively or in addition, each of the multiple
independent processor cores includes a respective register storing
information configurable according to the sampled interrupt
signal.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] The foregoing and other objects, features and advantages of
the invention will be apparent from the following more particular
description of preferred embodiments of the invention, as
illustrated in the accompanying drawings in which like reference
characters refer to the same parts throughout the different views.
The drawings are not necessarily to scale, emphasis instead being
placed upon illustrating the principles of the invention.
[0012] FIG. 1 is a block diagram of a security appliance including
a network services processor according to the principles of the
present invention;
[0013] FIG. 2 is a block diagram of the network services processor
shown in FIG. 1;
[0014] FIGS. 3A and 3B are block diagrams illustrating exemplary
embodiments of a multi-core debug architecture;
[0015] FIG. 4 is a more detailed block diagram of one of the
processor cores shown in FIGS. 3A and 3B;
[0016] FIG. 5 is a schematic block diagram of a debug register;
[0017] FIG. 6 is a schematic block diagram of the Multi Core Debug
(MCD) register;
[0018] FIG. 7A is a schematic diagram of an exemplary Test Access
Port (TAP) controller;
[0019] FIG. 7B is a more-detailed block diagram illustrating the
interconnection of the TAPs among the multiple processor cores
shown in FIGS. 3A and 3B; and
[0020] FIG. 8 is a more-detailed block diagram of the debug
architecture within one of the processor cores shown in FIGS. 3A
and 3B.
DETAILED DESCRIPTION OF THE INVENTION
[0021] A description of preferred embodiments of the invention
follows.
[0022] Applications for multi-core processors are as limitless as
applications that use a single microprocessor. Some applications
that are particularly well suited for multi-core processors include
telecommunications and networking. Having multiple processor cores
enables a single sizeable task to be broken down into several
smaller, more manageable subtasks, each subtask being executed on a
different core processor. Breaking down large tasks in this way
typically simplifies the overall processing of complex, high-speed
data manipulations, such as those used in data security.
[0023] A debugging system for multi-core processors is provided to
facilitate debugging these parallel applications executing on
several independent processor cores. This is accomplished, at least
in part, by generating internal trigger events from one or more of
the multiple processor cores. These multiple trigger events can be
transmitted to an external debug console using a debug interface
having relatively few I/O signal lines. Preferably the debug
interface is separate from the processor core's memory interface
(e.g., the Dynamic Random Access Memory (DRAM) interface) to avoid
interference with the parallel application. A separate debug
interface also allows a majority of the hardware for the debug
interface to remain useable during normal processing of the
multi-core processors.
[0024] Combining multiple processor cores in a single system leads
to a closer placement of cores with respect to each other. Reducing
separation between the processor cores generally reduces
propagation delay, thereby increasing communication speed between
them. In some embodiments, the processor cores are provided within
the same central processing unit and are interconnected using
cables. Alternatively or in addition, some of the processor cores
can be interconnected in the same socket, e.g., plugged into a
common processor socket on a motherboard. In some applications, the
multiple processors are provided together on the same semiconductor
die.
[0025] With different processor cores operating cooperatively to
implement a common function, such as packet processing in a
high-speed packet processor, it may be necessary to examine the
state of more than one of the multiple processor cores during any
debugging activity. Thus, it would be beneficial to interrupt the
multiple processor cores at substantially the same time, thereby
allowing examination of the register contents and memory values
attributable to any of the multiple processor cores. Once
interrupted, operation of the multiple processor cores can be
stepped sequentially, in unison according to operation in a debug
mode. This special class of fast interrupts are referred to herein
as Multi-Core Debug (MCD) interrupts.
[0026] To facilitate the very fast debug interrupt, a separate,
high-speed interrupt-signal interconnect is provided. This separate
signal interconnect allows for substantially simultaneous
interruption of more than one of the multiple processor cores. For
example, a global signal interconnect is coupled to each of the
processor cores. Each of the processor cores, in turn, is
configured to selectively provide an interrupt signal, or pulse, on
the global signal interconnect. Preferably, each of the processor
cores is capable of pulsing the global signal interconnect during
any cycle of the processor clock. Additionally, each of the
processor cores samples the global signal interconnect to determine
whether any processor core has provided an interrupt signal.
[0027] Each of the multiple processor cores is connected to the
global signal interconnect, with each core being capable of
independently pulsing the signal interconnect. Once pulsed, the
processor cores sampling the signal interconnect receive the
interrupt substantially simultaneously. Using a logical OR
configuration of the contributed pulses from all of the multiple
processor cores provides the desired functionality (i.e., the
global signal interconnect is asserted if any one of the
interconnected processor cores asserts the interconnect).
[0028] Each of the processor cores includes respective debug
circuitry with supporting extensions that enable concurrent
multi-Core debugging. The debug circuitry is responsive to the
global signal interconnect being asserted.
[0029] More generally, a multi-core processor architecture includes
multiple global signal interconnects. Each of the multiple
interconnects is independently configured as the single
interconnect described above. Thus, each of the global signal
interconnects can be asserted (i.e., pulsed) by any of the multiple
processor cores as described above.
[0030] FIG. 1 is a block diagram of an exemplary security appliance
102 that includes a network services processor 100 according to the
principles of the present invention. The network services processor
100 is a multi-core processor. The security appliance 102 can be a
standalone system that switches packets received at one Ethernet
port (Gig E) to another Ethernet port (Gig E). Preferably, the
security appliance 102 also performs one or more security functions
related to the received packets prior to forwarding the packets.
For example, the security appliance 102 can be used to perform
security processing on packets received from a Wide Area Network
(WAN) 102 prior to forwarding the processed packets to a Local Area
Network (LAN) 103. Exemplary network services processors 100
adapted to perform such security processing can include hardware
packet processing, buffering, work scheduling, ordering,
synchronization, and coherence support to accelerate packet
processing tasks according to the principles of the present
invention.
[0031] The network services processor 100 generally processes
higher layer protocols. For example, the network services processor
100 processes one ore more of the Open System Interconnection (OSI)
network L2-L7 layer protocols encapsulated in received packets. As
is well-known to those skilled in the art, the OSI reference model
defines seven network protocol layers: Layers 1-7 (referred to
herein as L1-L7). The physical layer (L1) represents an actual
physical interface. Namely, the electrical and physical attributes
that enable a device to be connected to a transmission medium. The
data link layer (L2) performs data framing. The network layer (L3)
formats the data into packets. The transport layer (L4) handles end
to end transport. The session layer (L5) manages communications
between devices, for example, whether communication is half-duplex
or full-duplex. The presentation layer (L6) manages data formatting
and presentation, for example, syntax, control codes, special
graphics and character sets. The application layer (L7) permits
communication between users, for example, file transfer and
electronic mail.
[0032] To support multiple interconnects, the network services
processor 100 includes a number of interfaces. For example, the
network services processor 100 includes a number of Ethernet Media
Access Control interfaces with standard Reduced Gigabyte Media
Independent Interface (RGMII) connections to off-chip destinations
using physical interfaces (PHYs) 104a, 104b.
[0033] In operation, the network services processor 100 receives
packets from one or more external destinations at one ore more
respective Ethernet ports (Gig E) through the physical interfaces
PHY 104a, 104b. The network services processor 100 then selectively
performs L7-L2 network protocol processing on the received packets
forwarding processed packets through the physical interfaces 104a,
104b. The processed packets may be forwarded to another "hop" in
the network, to their final destination, or through a local
communications bus for further processing by a host processor. The
local communications bus can be any one of a number of industry
standard busses, such as a Peripheral Component Interconnect (PCI)
bus 106 or a PCI Extended (PCI-X). Other PC busses include
Integrated Systems Architecture (ISA), Extended ISA (EISA), Micro
Channel, VL-bus, NuBus, TURBOchannel, VMEbus, MULTIBUS, STD bus,
and proprietary busses. Further, the network protocol processing
can include processing of network security protocols such as
Firewall, Application Firewall, Virtual Private Network (VPN)
including IP Security (IPSec) and/or Secure Sockets Layer (SSL),
Intrusion detection System (IDS) and Anti-virus (AV).
[0034] A DRAM controller in the network services processor 100
controls access to an external Dynamic Random Access Memory (DRAM)
108 that is coupled to the network services processor 100. The DRAM
108 stores data packets received from the PHY interfaces 104a, 104b
from a local communications bus, such as the PCI-X interface 106
for processing by the network services processor 100. In one
embodiment, the DRAM interface supports 64 or 128 bit Double Data
Rate II Synchronous Dynamic Random Access Memory (DDR II SDRAM)
operating at speeds up to and including 800 MHz.
[0035] A boot bus 110 can be provided, such that the necessary boot
code is accessible allowing the network services processor 100 to
execute the boot code upon power-on and/or reset. Generally, the
boot code is stored in a memory, such as a flash memory 112.
Application code can also be loaded into the network services
processor 100 over the boot bus 110. For example, application code
can be loaded from a device 114 implementing the Compact Flash
standard, or from another high-volume device, such as a disk,
attached via the PCI bus.
[0036] A miscellaneous I/O interface 116 offers auxiliary
interfaces such as General Purpose Input/Output (GPIO), Flash, IEEE
802 two-wire Management Interface (MDIO), Universal Asynchronous
Receiver-Transmitters (UARTs), and serial interfaces.
[0037] The network services processor 100 can include another
memory controller for controlling Low latency DRAM 118. The low
latency DRAM 118 can be used for Internet Services and Security
applications, thereby allowing fast lookups, including the
string-matching that may be required for Intrusion Detection System
(IDS) or Anti Virus (AV) applications.
[0038] FIG. 2 is a more-detailed block diagram of an exemplary
network services processor 100, such as the one shown in FIG. 1. As
discussed above, the network services processor 100 can be adapted
to deliver high application performance by including multiple
processor cores 202. Network operations can be categorized into
data plane operations and control plane operations. A data plane
operation includes packet operations for forwarding packets. A
control plane operation includes processing of portions of complex
higher level protocols such as Internet Protocol Security (IPSec),
Transmission Control Protocol (TCP) and Secure Sockets Layer (SSL).
Advantageously, in such a network application, selective processor
cores 202 can be dedicated to performing respective data plane or
control plane operations. A data plane operation can include
processing of other portions of these complex higher level
protocols.
[0039] A packet input unit 214 can be used to allocate and create a
work queue entry for each packet. The work queue entry, in turn
contains a pointer to a buffered packet temporarily stored in
memory, such as Level-2 cache 212 or DRAM 108 (FIG. 1).
[0040] Packet Input/Output processing is performed by a respective
interface unit 210a, 210b, a packet input unit (Packet Input) 214,
and a packet output unit (PKO) 218. The input controller 214 and
interface units 210a, 210b can perform parsing of received packets
and checking of results to offload the processor cores 202.
[0041] A packet is received by any one of the interface units 210a,
210b (generally 210) through a predefined interface, such as a
System Packet Interface SPI-4.2 (e.g., SPI-4 phase 2 standard of
the Optical Internetworking Forum) or an RGMII interface. A packet
can also be received by a PCI interface 224. The interface unit
210a, 210b handles L2 network protocol pre-processing of the
received packet by checking various fields in the L2 network
protocol header included in the received packet. After the
interface unit 210 has performed L2 network protocol processing,
the packet is forwarded to the packet input unit 214. The
pre-processed packet can be forwarded over an input/output (I/O)
bus, such as I/O bus 225. The packet input unit 214 can be used to
perform additional pre-processing, such as pre-processing of L3 and
L4 network protocol headers included in the received packet. The
pre-processing can include checksum checks for Transmission Control
Protocol (TCP)/User Datagram Protocol (UDP) (L3 network
protocols).
[0042] The packet input unit 214 writes packet data into buffers in
Level-2 cache 212 or DRAM 108 (FIG. 1) in a format that is
convenient to higher-layer software executed in at least one
processor core 202 for further processing of higher level network
protocols. The packet input unit 214 can support a programmable
buffer size and can distribute packet data across multiple buffers
to support large packet input sizes.
[0043] The Packet order/work (POW) module (unit) 228 queues and
schedules work (i.e., packet processing operations) for the
processor cores 202. Work can be defined to be any task to be
performed by a processor core 202 that is identified by an entry on
a work queue. The task can include packet processing operations,
for example, packet processing operations for L4-L7 layers to be
performed on a received packet identified by a work queue entry on
a work queue. Each separate packet processing operation is a piece
of the work to be performed by a processor core 202 on the received
packet stored in memory. For example, the work can be the
processing of a received Firewall/Virtual Private Network (VPN)
packet. The processing of a Firewall/VPN packet includes the
following separate packet processing operations (i.e., pieces of
work): (1) defragmentation to reorder fragments in the received
packet; (2) IPSec decryption (3) IPSec encryption; and (4) Network
Address Translation (NAT) or TCP sequence number adjustment prior
to forwarding the packet.
[0044] The POW module 228 selects (i.e., schedules) work for a
processor core 202 and returns a pointer to the work queue entry
that describes the work to the processor core 202. Each piece of
work (i.e., a packet processing operation) has an associated group
identifier and a tag.
[0045] Prior to describing the operation of the processor cores 202
in further detail, the other modules in the processor core 202 will
be described. After the packet has been processed by the processor
cores 202, a packet output unit (PKO) 218 reads the packet data
from Level-2 cache 212 or DRAM 108 (FIG. 1), performs L4 network
protocol post-processing (e.g., generates a TCP/UDP checksum),
forwards the packet through the interface unit 210 and frees the
Level-2 cache 212 or DRAM 108 locations used to store the
packet.
[0046] The network services processor 100 can also include
application specific co-processors that offload the processor cores
202 so that the network services processor 100 achieves a
high-throughput. The application specific co-processors can include
a DFA co-processor 244 that performs Deterministic Finite Automata
(DFA) and a compression/decompression co-processor 208 that
performs compression and decompression. Other co-processors include
a Random Number Generator (RNG) 246 and a timer unit 242. The timer
unit 242 is particularly useful for TCP applications.
[0047] Each processor core 202 can include a dual-issue,
superscalar processor with a respective instruction cache 206, a
respective Level-1 data cache 204, and respective built-in hardware
acceleration (e.g., a crypto acceleration module) 200 for
cryptography algorithms with direct access to low latency memory
over the low latency memory bus 230. The low-latency, direct-access
path to low-latency memory 118 (FIG. 1) that bypasses the Level-2
cache memory 212 and can be directly accessed from both the
processor cores 202 and a DFA co-processor 244.
[0048] The network services processor 100 also includes a memory
subsystem. The memory subsystem includes the respective Level-1
data cache memory 204 of each of the processor cores 202,
respective instruction cache 206 in each of the processor cores
202, a Level-2 cache memory 212, a DRAM controller 216 for external
DRAM memory 108 (FIG. 1), and an interface, such as a low-latency
bus 230 to external low latency memory (not shown).
[0049] The memory subsystem is configured to support the multiple
processor cores 202 and can be tuned to deliver both the
high-throughput and the low-latency required by memory-intensive,
content-networking applications. Level-2 cache memory 212 and
external DRAM memory 108 (FIG. 1) are shared by all of the
processor cores 202 and I/O co-processor devices.
[0050] Each of the processor cores 202 can be coupled to the
Level-2 cache by a, local bus, such as a coherent memory bus 234.
Thus, the coherent memory bus 234 can represent the communication
channel for memory and I/O transactions between the processor cores
202, an I/O Bridge (IOB) 232, and the Level-2 cache and controller
212.
[0051] A Free-Pool Allocator (FPA) 236 maintains pools of pointers
to free memory in Level-2 cache memory 212 and DRAM 108. A
bandwidth efficient (Last-In-First-Out (LIFO)) stack is implemented
for each free pointer pool.
[0052] The I/O Bridge 232 manages the overall protocol and
arbitration and provides coherent I/O partitioning. The I/O Bridge
232 includes a bridge 238 and a Fetch-and-Add Unit (FAU) 240. The
bridge 238 includes buffer queues for storing information to be
transferred between the I/O bus 225, coherent memory bus 234, the
packet input unit 214 and the packet output unit 218.
[0053] The Fetch-and-Add Unit 240 includes a 2 kilobyte (KB)
register file supporting read, write, atomic fetch-and-add, and
atomic update operations. The Fetch-and-Add Unit 240 can be
accessed from both the cores 202 and the packet output unit 218.
The registers store highly-used values and thus reduce traffic to
access these values. Registers in the Fetch-and-Add Unit 240 are
used to maintain lengths of the output queues that are used for
forwarding processed packets through the packet output unit
218.
[0054] The PCI interface controller 224 has a Direct Memory Access
(DMA) engine that allows the processor cores 202 to move data
asynchronously between local memory in the network services
processor 100 and remote (PCI) memory (not shown) in both
directions.
[0055] In some embodiments, a key memory (KEY) 248 is provided. The
key memory 248 is a protected memory coupled to the I/O Bus 225
that can be written/read by the processor cores 202. For example,
the key memory can include error checking and correction. ECC will
report single and double bit errors and repair single bit errors.
The memory is a single-port memory that can be provided withy write
precedence. In some embodiments, the key memory 248 can be used to
temporarily store Loads, Stores, and I/O pre-fetches.
[0056] An Miscellaneous Input/Output (MIO) unit 226 can also be
coupled to the I/O bus 225 to provide interface support for one or
more external devices. For example, the MIO unit 226 can support
one or more interfaces to a Universal Asynchronous
Receiver/Transmitter (UART), to a boot bus, to a General Purpose
Input/Output (GPIO) interface for communicating with peripheral
devices (not shown), and more generally to a Field-Programmable
Gate Array (FPGA) for interfacing with external devices. For
example, an FPGA can be used to interface to external Ternary
Content-Addressable Memory (TCAM) hardware providing fast-lookup
performance. In particular, the MIO 226 can provide an interface to
an external debugger console described below.
[0057] The processor core 202 supports multiple operational modes
including: user mode, kernel mode, and debug mode. User mode is
most often employed when executing applications programs (e.g., the
internal flow of program control). Kernel mode is typically used
for handling exceptions and operating system kernel functions,
including management of any related coprocessor and Input/Output
(I/O) device access. Debug mode is a special operational mode
typically used by software developers to examine variables and
memory locations, to stop code execution at predefined break
points, and to step through the code one line or unit at a time,
usually while monitoring variables and memory locations. Debug mode
is also different from other operational modes in that there are
substantially no restrictions on access to coprocessors, memory
areas. Additionally, while in Debug mode, the usual exceptions like
address error and interrupt are masked.
[0058] A multi-core processor 100 configured for debugging parallel
applications executing on more than one independent processor cores
is shown in FIGS. 3A and 3B. For example, the multi-core processor
100 includes three separate global signal interconnects: MCD_0,
MCD_1, and MCD_2. Each of the three global signal interconnects
MCD_0, MCD_1, and MCD_2 is coupled to each of the multiple
processor cores 202a, 202b, . . . 202n (generally 202). Each
processor core 202, in turn, includes circuitry configured to
assert a global interrupt signal on one or more of the global
signal interconnects. Preferably, each of the processor cores 202
is configured to independently and selectively assert (i.e., pulse)
an interrupt on one or more of the global signal interconnects
MCD_0, MCD_1, and MCD_2.
[0059] Each processor core 202 also includes sensing circuitry
configured to sample each of the global signal interconnects to
determine the presence of an asserted interrupt. Preferably, each
of the processor cores 202 independently samples the global signal
interconnects MCD_0, MCD_1, and MCD_2 to determine whether an
interrupt has been asserted, and on which of the several global
signal interconnects the interrupt has been asserted--interrupts
can be asserted on more than one of the global signal interconnects
at a time. The global signal interconnects MCD_0, MCD_1, and MCD_2
are preferably sampled continuously, or at least once during each
clock cycle to determine the presence of an interrupt. The sensing
circuitry can include a register into which the state of the global
signal interconnect is latched. For example, a register is
configured to store one bit for each of the multiple global signal
interconnects, the value of the stored bit indicative of the state
of the respective global signal interconnect.
[0060] Having more than one global signal interconnect MCD_0,
MCD_1, and MCD_2 coupled to each of the processor cores 202 can
provide additional information. For example, with each of the three
wires capable of being independently pulsed between two states
(e.g., a logical Low or "0" and a logical High or "1") can provide
information corresponding to one of up to eight different messages
(i.e., .sup.3=8).
[0061] Alternatively, or in addition, the global signal
interconnects can be used to communicate with the processor cores
202 once interrupted. An external debug console 325 hosting a
debugger application and providing a user interface can be
interconnected to one or more of the processor cores 202 to
facilitate debugging of the system 100. Preferably, the global
signal interconnects are accessible by the debugger. For example, a
debugger can assert a pulse on MCD_1 to instruct the processor
cores 202 to check their mailbox location (e.g., in main memory)
for an instruction from the debugger. The debugger can assert a
pulse on MCD_2 to restart all processor cores 202 after a
multi-core interrupt. Thus, usage of the global signal
interconnects can minimize disruption of the state contained in the
processor cores 202 and in the system 100, while the debugger
examines it. This capability can be very useful to isolate the
cause of bugs in parallel applications.
[0062] The processor cores 202 are each coupled to the one or more
global signal MCD_0, MCD_1, and MCD_2 interconnects in a respective
"wired-OR" fashion. Thus, respective interrupt-signal generators of
each of the processor cores 202 are all interconnected at a first
wired-OR 310a, further connected to the first global signal
interconnect MCD_0. Second and third wired-ORs 310b, 310c are
provided to similarly interconnect the processor cores 202 to the
second and third global signal interconnects MCD_1 and MCD_2,
respectively.
[0063] Thus, should one or more of the processor cores 202 assert a
pulse on any of its respective interrupt-signal generator outputs
(e.g., to wired-OR 310a), the pulse will be asserted on the
respective global signal interconnect (e.g., MCD_0). Preferably, a
pulse can be asserted during any cycle. Using the wired-OR 310a,
310b, 310c (generally 310) provides the desired logic allowing any
of the processor cores 202 to drive the global interconnect signal
(i.e., the wired-OR providing an output="1" if any of its
inputs="1"), while also minimizing any corresponding delay. Once a
pulse is asserted on the global signal interconnects, all of the
processor cores 202 sample it, allowing the cores 202 to be
interrupted very quickly--at the same time, or at least within a
few cycles of the processor clock. Such a rapid interrupt of all of
the processor cores 202 preserves the entire state of the parallel
application at the time of the interrupt for examination by the
debugger.
[0064] In other embodiments, each of the processor cores 202 can be
interconnected to the global signal interconnects MCD_0, MCD_1, and
MCD_2 using combinational logic, such as a logical OR gate. Such
logic, however, represents additional complexity generally
resulting in a corresponding delay (e.g., a gate delay due to
synchronous logic, and/or a rise time delay due to the capacitance
of the logic circuitry).
[0065] Each processor core 202 provides an exception handler.
Generally, an exception refers to an error or other special
condition detected during normal program execution. The exception
handler can interrupt the normal flow of program control in
response to receiving an exception. For example, a debug exception
handler halts normal operation in response to receiving a debug
interrupt. The exception handler then passes control to a debug
handler, or software program, that controls operation in debug
mode.
[0066] Some exemplary exception types include a Debug Single Step
(DSS) exception resulting in single step execution in debug mode. A
general Debug Interrupt (DINT) results in entry of debug mode and
can be caused by the assertion of an external interrupt (e.g.,
EJ_DINT), or by setting a related bit in a debug register. An
interrupt can result from assertion of unmasked hardware or
software interrupt signal. A debug hardware instruction break
matched (DIB) exception results in entry of debug mode when an
instruction matches a predetermined instruction breakpoint.
Similarly, a debug breakpoint instruction (DBp) results in entry of
debug mode upon execution of a special instruction (e.g., a
software debug breakpoint instruction, such as the EJTAG "SDBBP"
instruction that places a processor into debug mode and fetches
associated handler code from memory). A Data Address Break (address
only) or Data Value (e.g., DDBL/DDBS) results in entry of debug
mode when a particular memory address is accessed, or a particular
value is written to/read from memory.
[0067] Each of the processor cores 202 includes respective onboard
debug circuitry 318. As shown in FIG. 3A, each of the multiple
processor cores 202 can include a respective core Test Access Port
(TAP) 320', 320'', 320''' (generally 320) for accessing the
respective debug circuitry 318. The core TAPs 320 are connected to
one system TAP 330. As shown, each of the respective core TAPs 320
and the system TAP 330 can be interconnected in a daisy chain
configuration. Additionally, the debug circuitry 318 of all of the
interconnected processor cores 202 can be coupled to the external
debug console 325.
[0068] Once in debug mode, the debug control console can be used to
inspect the values stored in registers and memory locations. The
debug control console provides a software program that communicates
with the onboard debug circuitry 318 to accomplish inspection of
stored values, setting of breakpoints, stopping, restarting and
sequentially stepping each of the processor cores 202 in
unison.
[0069] Alternatively, or in addition, each of the processor cores
202 can be coupled to the external debug console 325 through one or
more Universal Asynchronous Receiver-Transmitter (UART) devices
that include receiving and transmitting circuits for asynchronous
serial communications, as shown in FIG. 3B. In one embodiment, the
multi-core processor 100 includes two UART devices 335a, 335b
(generally 335) used to control serial data transmission and
reception between the processor cores 202 and external devices,
such as the external debug console 325. The UART devices 335 can be
included within the Miscellaneous I/O unit (FIG. 2). Thus, each
processor core 202 can communicate with another device, such as the
external debug console 325, through a respective memory bus
interface 340 using one or more of the UART devices 335 accessible
through the I/O bridge 238. Advantageously, communicating with the
external debug console 325 using the UART device 335 removes
constraints that would have otherwise been imposed by using a
standard interface, such as the JTAG TAP interface (FIG. 3A).
[0070] The multi-core processor 100 optionally includes a trace
buffer 610 (shown in phantom) for selectively monitoring memory
transactions of the processor cores 202. For example, the trace
buffer 610 is coupled to the coherent memory bus 234 to monitor
transactions thereon. Generally, the trace buffer 610 stores
information that can be used to assist in any debugging activity.
For example, the trace buffer 610 can be configured to store the
last "N" transactions on the bus, the N+1.sup.st transaction being
dumped as a new transaction occurs. Further, when using a single
trace buffer 610, identification tags can be used to identify the
particular core processor 202 associated with each stored
transaction.
[0071] Beneficially, the trace buffer 610 is also coupled to each
of the one or more global signal interconnects MCD_0, MCD_1, and
MCD_2, and configured with sensing circuitry sampling any pulses
asserted on the global signal interconnects. The trace buffer 610
also includes a trigger that initiates the starting and or stopping
of monitoring in response to sampling an interrupt signal on the
global signal interconnect. Although a single trace buffer 610
supporting multiple core processors 202 is illustrated, other
configurations are possible. For example, multiple trace buffers
610 can be provided with each trace buffer 610 respectively
corresponding to one of the multiple core processors 202.
Additionally, the trace buffer 610 can be on-chip, as shown, or
off-chip and accessible by a probe.
[0072] Alternatively or in addition, the trace buffer 610 includes
circuitry configured to assert a global interrupt signal on one or
more of the global signal interconnects MCD_0, MCD_1, and MCD_2. As
shown, the trace buffer 610 can be coupled to the global signal
interconnects MCD_0, MCD_1, and MCD_2 through the wired-OR circuits
310. In this configuration, the trace buffer 610 can selectively
assert a global interrupt signal on one or more of the global
signal interconnects MCD_0, MCD_1, and MCD_2, thereby interrupting
more than one of the multiple processor cores 202 in response to
activity on the coherent memory bus 234.
[0073] FIG. 4 is a more detailed block diagram of an exemplary
processor core 202 shown in FIGS. 3A and 3B. In general, a
processor core 202 interprets and executes instructions. In some
embodiments the processor core 202 is a Reduced Instruction Set
Computing (RISC) processor core 202. In more detail, the processor
core 202 includes an execution unit 400, an instruction dispatch
unit 402, an instruction fetch unit 404, a load/store unit 416, a
Memory Management Unit 406, a system interface 408, a write buffer
420 and security accelerators 200. The processor core 202 also
includes debug circuitry 318 allowing debug operations to be
performed. The system interface 408 controls access to external
memory, that is, memory external to the processor core 202, such as
the L2 cache memory described in relation to FIG. 2.
[0074] Still referring to FIG. 4, the execution unit 400 includes a
multiply/divide unit 412 and at least one register file 414. The
multiply/divide unit 412 has a 64-bit register-direct multiply. The
instruction fetch unit 404 includes Instruction Cache (ICache) 206.
The load/store unit 416 includes Data Cache (DCache) 204. A portion
of the data cache 204 can be reserved as local scratch pad/local
memory 422. In one embodiment, the instruction cache 206 is 32
Kilobytes, the data cache 204 is 8 Kilobytes and the write buffer
420 is 2 Kilobytes. The memory management unit 406 includes a
Translation Lookaside Buffer (TLB) 410.
[0075] In one embodiment, the processor core 202 includes a crypto
acceleration module (security accelerators) 200 that includes
cryptography acceleration. For example, the cryptography
acceleration can include one or more of Triple Data Encryption
Standard (3DES), Advanced Encryption Standard (AES), Secure Hash
Algorithm (SHA-1l), and Message Digest Algorithm #5 (MD5). The
crypto acceleration module 200 communicates by moves to and from
the main register file 414 in the execution unit 400. Particular
algorithms, such as Rivest, Shamir, Adleman (RSA) and the
Diffie-Heilman (DH) can be implemented and are performed in the
multiply/divide unit 412.
[0076] In some embodiments, the multi-core processor 100 (FIG. 2)
includes a superscalar processor. A superscalar processor includes
a superscalar instruction pipeline that allows more than one
instruction to be completed each cycle of the processor's clock
period by allowing multiple instructions to be issued
simultaneously and dispatched in parallel to multiple execution
units 400. The RISC-type processor core 202 has an instruction set
architecture that defines instructions by which the programmer
interfaces with the RISC-type processor 202. In one embodiment, the
superscalar RISC-type core is an extension of the MIPS64 version 2
core. Only load-and-store instructions access external memory; that
is, memory external to the processor core 202. In one embodiment,
the external memory is accessed over a coherent memory bus 234
(FIG. 2). All other instructions operate on data stored in the
register file 414 within the execution unit 414 of the processor
core 202. In some embodiments, the superscalar processor can be a
dual-issue processor.
[0077] The instruction pipeline is divided into stages, each stage
taking one clock cycle to complete. Thus, in a five stage pipeline,
it takes five clock cycles to process each instruction and five
instructions can be processed concurrently with each instruction
being processed by a different stage of the pipeline in any given
clock cycle. Typically, a five stage pipeline includes the
following stages: fetch, decode, execute, memory and write
back.
[0078] During the fetch-stage, the instruction fetch unit 404
fetches an instruction from instruction cache 206 at a location in
instruction cache 206 identified by a memory address stored in a
program counter. During the decode-stage, the instruction fetched
in the fetch-stage is decoded by the instruction dispatch unit 402
and the address of the next instruction to be fetched for the
issuing context is computed. During the execute-stage, the
execution unit 400 performs an operation dependent on the type of
instruction. For example, the execution unit 400 begins the
arithmetic or logical operation for a register-to-register
instruction, calculates the virtual address for a load or store
operation, or determines whether the branch condition is true for a
branch instruction. During the memory-stage, data is aligned by the
load/store unit 416 and transferred to its destination in external
memory. During the write back-stage, the result of a
register-to-register or load instruction is written back to the
register file 414.
[0079] The system interface 408 is coupled via the coherent memory
bus 234 (FIG. 2) to external memory. In one embodiment, the
coherent memory bus 243 has 384 bits and includes four separate
buses: (i) an Address/Command Bus; (ii) a Store Data Bus; (iii) a
Commit/Response control bus; and (iv) a Fill Data bus. All store
data is sent to external memory over the coherent memory bus 234
via a write buffer entry in the write buffer 420. In one
embodiment, the write buffer 420 has 16 write buffer entries.
[0080] Store data flows from the load/store unit 416 to the write
buffer 420, and from the write buffer 420 through the system
interface 408 to external memory. The processor core 202 can
generate data to be stored in external memory faster than the
system interface 408 can write the store data to the external
memory. The write buffer 420 minimizes pipeline stalls by providing
a resource for storing data prior to forwarding the data to
external memory.
[0081] The write buffer 420 is also used to aggregate data to be
stored in external memory over a coherent memory bus 424 into
aligned cache blocks to maximize the rate at which the data can be
written to the external memory. Furthermore, the write buffer 420
can also merge multiple stores to the same location in external
memory resulting in a single write operation to external memory.
The write-merging operation of the write buffer 420 can result in
the order of writes to the external memory being different than the
order of execution of the store instructions.
[0082] The processor core 202 also includes an exception control
system providing circuitry for identifying and managing exceptions.
An exception refers to an interruption or change of the normal flow
of program control that occurs when an event or other special
condition is detected during execution. Exceptions can be caused by
a variety of sources, including boundary cases in data, external
events, or even program errors, being generated (i.e., "raised") by
hardware or software. Exemplary hardware exceptions include resets,
interrupts and signals from a memory management unit. Hardware
exceptions may be generated by an arithmetic logic unit or
floating-point unit for numerical errors such as divide by zero,
overflow or underflow or instruction decoding errors such as
privileged, reserved, trap or undefined instructions. Software
exceptions are even more varied. For example, a software exception
can refer to any kind of error checking that alters the normal
behavior of the program. An exception transfers control from code
being executed at the instant of the exception to different code-a
routine commonly referred to as an exception handler.
[0083] A system co-processor can also be provide within the
processor core 202 for providing a diagnostic capability, for
controlling the operating mode (i.e., kernel, user, and debug), for
configuring interrupts as enabled or disabled, and for storing
other configuration information.
[0084] The processor core 202 also includes a Memory Management
Unit (MMU) 406 coupled to the instruction fetch unit 404 and the
load/store unit 416. The MMU 406 is a hardware device or circuit
that supports virtual memory and paging by translating virtual
addresses into physical addresses. Thus, the MMU 406 may receive a
virtual memory address from program instructions being executed on
the processor core 202. The virtual memory address is associated
with a read from or a write to physical memory. The MMU 406
translates the virtual address to a physical address to allow a
related physical memory location to be accessed by the program.
[0085] In a multitasking system all processes compete for the use
of memory and of the MMU 406. In some memory management
architectures, however, each process is allowed to have its own
area or configuration of the page table, with a mechanism to switch
between different mappings on a process switch. This means that all
processes can have the same virtual address space rather than
require load-time relocation. To accomplish this task, the MMU 406
can include a Translation Lookaside Buffer (TLB) 410.
[0086] The debug circuitry 318 on each processor core 202 can
include an onboard debug controller. Having an onboard debug
controller facilitates operation of the processor core 202 in the
debug mode. For example, the debug controller can allow for
single-step execution of the processor core 202. Further, the debug
controller can support breakpoints, enabling them to transition the
processor core 202 into debug mode. For example, the breakpoints
can be one or more of instruction breakpoints, data breakpoints,
and virtual address breakpoints.
[0087] In some embodiments, the onboard debug circuitry 318
includes standardized features. For example, the onboard debug
circuitry 318 can be compliant with the design philosophy of the
Joint Test Action Group (JTAG) interface--a popular standardized
interface defined by IEEE Standard 1149.1. In embodiments that
utilize processor cores, the onboard controller is referred to is
the standard MIPS Enhanced JTAG (EJTAG) debug circuitry 318.
[0088] Each processor core 202 includes one or more debug
registers, each register including one or more pre-defined fields
for storing information (e.g., state bits) related to different
aspects of debug mode operation. The debug registers 425 can be
located in the instruction fetch unit 404. For example, one of the
debug registers 425 is a Debug register 500. The Debug register 500
is illustrated in more detail in FIG. 5. The Debug register 500
includes a DM state bit indicative of whether the processor core
202 is operating in debug mode. Other bits include a DBD state bit
indicative of whether the last debug exception or exception in
Debug Mode occurred in a branch or jump delay slot. A DDBSImpr bit
is indicative of an imprecise debug data break store. A DDBLImpr
bit is indicative of an imprecise debug data break load. This bit
can be implemented for load value breakpoints. A DExcCode bit is
set to one when Debug[DExcCode] is valid and should be
interpreted.
[0089] Another one of the debug registers 425 is a Multi-Core Debug
(MCD) register 600 is shown in FIG. 6. The MCD register 600
includes dedicated multi-core debug state positions 615, one
position being provided for each of the respective global signal
interconnects MCD_0, MCD_1, and MCD_2. Similarly, the MCD register
600 includes dedicated mask-disable state positions 605, one
position being provided for each of the respective global signal
interconnects MCD_0, MCD_1, and MCD_2. When set, the mask-disable
bits (one bit for each global signal interconnect) disable the
effect of sampling a pulse on the corresponding global signal
interconnect.
[0090] The MCD register 600 also includes respective
software-control bit locations 610 for each of the several global
MCD wires. For the exemplary multi-core processor 100, the three
software-control bit locations 610, referred to as: Pls0, Pls1, or
Pls2 are reserved. These software-control bit locations 610
corresponding to the three global signal interconnects: MCD_0,
MCD_1, and MCD_2, respectively. Thus, bits written by software into
the software control bit locations 610 can be used to pulse any
combination of the three global MCD wires.
[0091] In some embodiments, the debug registers 425 (FIG. 4)
include a DEPC register for imprecise debug exceptions and
imprecise exceptions in Debug Mode. Imprecise debug data breakpoint
are provided for load value compare, otherwise debug data
breakpoints are precise. The DEPC register contains an address at
which execution should be resumed when returning to Non-Debug
Mode.
[0092] Exception handlers can be entered for debug processing in a
number of ways. First, software such as the processor core
instruction set and/or the debugger can include a breakpoint
instruction. When the breakpoint instruction is executed by the
execution unit 400, it causes a specific exception. Alternatively
or in addition, a set of trap instructions can be provided. When
the trap instructions are executed by the execution unit 400, a
specific exception will result, but only when certain register
value criteria are also satisfied. Further, a pair of optional
Watch registers can be programmed to cause a specific exception on
a load, store, or instruction fetch access to a specific word
(e.g., a 64-bit double word) in virtual memory. Still further, an
optional TLB-based MMU 406 can be programmed to "trap," or
otherwise interrupt program execution on any access, or more
specifically, on any store to a page of memory. These exceptions
generally refer to interrupting operation on any one of the
processor cores 202. To interrupt the other processor cores 202, a
pulse must be asserted on one or more of the global signal
interconnects MCD.sub.--0, MCD.sub.--1, MCD.sub.--2.
[0093] In operation, when one or more of the processor cores 202
asserts a pulse on one of the global signal interconnects MCD_0,
MCD_1, and MCD_2, the corresponding signal value can be a high
state, or logical one. The respective instruction fetch unit 404 of
each of the interconnected processor cores 202 samples the one on
the global signal interconnect. In response to sampling the one,
the instruction fetch unit 404 sets an internal state bit
corresponding to the sampled pulse. The internal state bit, or MCD
state bit, can be dedicated multi-core debug state positions 615 in
the multi-core debug register 600 (i.e., Multi-Core Debug[MCD0,
MCD1, MCD2]).
[0094] If any of multi-core debug state bits 615 are non-zero on a
given processor core 202 (and that processor core 202 is not
already in debug mode), the onboard debug circuitry 318 requests a
debug exception on its respective processor core 202. With all of
the multiple processor cores 202 sampling the same pulse and
setting their respective bits 615 at substantially the same time,
all of the unmasked processor cores 202 are interrupted at
substantially the same time. Preferably, this occurs during the
same cycle, but it can also occur within a few clock cycles.
Software can later clear Multi-Core Debug[MCD0, MCD1, MCD2] bits by
overwriting them (e.g., writing a one to them). Such a provision
ensures that no further debug interrupts occur after exiting the
debug handler.
[0095] In general, interrupts can be assigned different priority
values to ensure the desired results in situations in which more
than one type of interrupt occurs. In particular, the MCD
interrupts can occur at the same priority level as standard debug
interrupts provided within the debug circuitry 318 of each of the
processor cores 202. The exception location can also be the same as
a debug interrupt, with the multi-core debug bits 615 being similar
to the DINT bit of the debug register shown in FIG. 5.
[0096] The detailed behavior of the bits, however, is different.
For example, the DINT bit is read-only, whereas Multi-Core
Debug[MCD0, MCD 1, MCD2] bits can be written to, allowing the bits
to be cleared by the debug handler. Further, the DINT is cleared
when Multi-Core Debug[DExcC] is set, whereas the multi-core debug
state bits 615 need not be.
[0097] There are at least four ways that the global signal
interconnects MCD.sub.--0, MCD_1, MCD_2 can be pulsed. First,
software can cause initiation of a pulse on the global MCD wires.
For example, debugger software running on a processor core 202 can
write one or more values (e.g., a logical "1"s) to any combination
of the software-control state bits 610 of the MCD register 600.
When a "1" is written into one or more of these bits 610, the
processor core 202 interprets it as an instruction to assert an
interrupt signal, or pulse, on the corresponding global signal
interconnects.
[0098] The global signal interconnects can also be pulsed by
execution of a special instruction. For example, execution of a
software breakpoint instruction, such as the SDBBP instruction, by
any one of the processor cores 202 results in that core 202
asserting a pulse on the MCD_0 global signal interconnect. Whether
a pulse is actually asserted by a processor core 202 in response to
the breakpoint instruction can be further controlled by a
global-signal debug bit 618 in the MCD register 600. Thus, a pulse
is only asserted in response to the breakpoint instruction when the
MCD[GSDB] bit 618 is set.
[0099] Alternatively or in addition, the initiation of a pulse on
the global signal interconnects can result if one or more bits
within a particular register are set and a breakpoint match occurs.
When these two conditions occur, the hardware (e.g., the debug
circuitry 318) pulses one of the global MCD wires (e.g., the MCD_0
wire). An Instruction Breakpoint Control-n register (IBCn, "n"
being a numbered reference to a particular instruction breakpoint)
stores a value responsive to a match of an executed breakpoint
instruction. Similarly, a Data Breakpoint Control-n (DBCn) stores a
value responsive to a match of a data transaction. The registers
IBCn and DBCn generally include special bits (e.g., BE, TE) that
can be used to enable the respective breakpoints.
[0100] Table 1 below describes an exemplary embodiment in which the
detailed behavior on a breakpoint match is defined based on
exemplary register values. TABLE-US-00001 TABLE 1 Breakpoint Match
Behavior BE TE Comment 0 0 Nothing happens on a match 0 1 MCD0 is
pulsed on a match. BS bits are also set in IBS/DBS. No direct local
exception occurs. (This mode may not be used.) 1 0 A local
breakpoint exception occurs due to the breakpoint match, causing
the Core to enter debug mode. MCD0 is not pulsed. BS bits are set
in IBS/DBS. (This mode will be used when debugging, but not
multi-Core.) 1 1 A local breakpoint exception occurs due to the
breakpoint match, causing the Core to enter debug mode. MCD0 is
also pulsed. BS bits are also set in IBS/DBS. (This mode will be
used when debugging multi-Core.)
[0101] An exemplary TAP controller 700 is shown in FIG. 7A. The TAP
controller 700 includes one or more registers 705 for storing
instruction, data, and control information relating to the TAP
interface 320. The registers 705 allow a user to perform a set up
for the onboard debug circuitry 318, and provide important status
information during a debug session. The size of the registers 705
depends on the specific implementations, but usually they are at
least 32 bits.
[0102] The registers 705 receive information from an external
source using the Test Data Input (TDI) input (i.e., pin). The
registers also provide information to an external source using the
Test Data Output (TDO) output (i.e., pin). Operation of the
interface is provided by a TAP controller state machine 710. The
TAP controller 700 uses a communications channel, such as a serial
communications channel that operates according to a clock signal
received on the Test Clock (TCK) input (i.e., pin). Thus, movement
of data into and/or out of the registers 705 operates according to
the received clock signal. Similarly, operation of the state
machine also relies on the received clock.
[0103] A more detailed interconnection of respective TAP interfaces
320 on each of the multiple processor cores 202 is shown in FIG.
7B. A JTAG interface, referred to as a Test Access Port (TAP) 320',
320'', 320''' (generally 320), includes at least four-signal lines:
Test Clock (TCK); TMS; Test Data In (TDI); and Test Data Out (TDO).
The interface can also include one or more power and ground signal
lines (note shown). The JTAG interface is a serial interface that
is capable of transferring data according to a clock signal
received on the TCK signal line. Operating frequency varies per
chip, but is typically defined by a clock signal having a frequency
between about 10 MHz to about 100 MHz (i.e., from about 100
nanoseconds to about 10 nanoseconds per bit time).
[0104] Configuration of each of the respective debug circuitry 318
(FIGS. 3A and 3B) can be performed by manipulating an internal
state machine. For example, a debug controller state machine within
the debug circuitry 318 can be externally manipulated one bit at a
time via the TMS signal line of the TAP 330. Data can then be
transferred in and out, one bit at a time, during each TCK clock
cycle. The data can be received via the TDI signal line, and
transmitted out via the TDO signal line, respectively. Different
instruction modes can be loaded into the debug controller 318 to
read the core identification (ID), to sample input, to drive
(and/or float) output, to manipulate functions, and/or to bypass
(pipe TDI to TDO to logically shorten chains of multiple
chips).
[0105] The respective TAP 320 of each of the multiple processor
cores 202a, 202b . . . 202n (generally 202) is coupled to the
respective TAP 320 of the other multiple processor cores 202 in a
serial, or "daisy chain" configuration. Thus, the TCK signal of the
first TAP 320' is serially interconnected to the corresponding TCK
signal lines of all of the other TAPs 320. The interconnected TCK
signal lines are further connected to a corresponding TCK signal
line of a system TAP 330. Typically, the system TAP 330 is
interconnected to one of the end of the interconnected processor
cores 202 (i.e., processor core 202n or processor core 202a, as
shown), that end processor core 202 referred to as a "master"
processor core 202a. For the most part, the remaining TAP signal
lines are generally interconnected in a similar manner being
further connected from the master processor core 202a corresponding
TAP signal lines on the system TAP 330. Interconnection of the TDI
and TDO signal lines, however, is different as described in more
detail below.
[0106] In the daisy chain configuration, the TDI signal line of the
master processor core 202a connects to the corresponding TDI signal
line of the system TAP 330, the master processor core 202a
receiving data from an external source. The TDO signal line of the
master processor core 202a, however, is connected to the TDI signal
line of an adjacent processor core 202b. Additional processor cores
202 are connected in a similar manner, the TDO signal line of one
processor core 202 being interconnected to the TDI signal line of
its preceding processor core 202, until the TDO signal line of the
last processor core 202n in the chain is interconnected to the TDO
signal line of the system TAP 330.
[0107] A more-detailed diagram illustrating an alternative
embodiment of a processor core 202 including exemplary onboard
debug circuitry is shown in FIG. 8. An execution unit 400 (e.g., a
combined processor and co-processor) is coupled to a memory (e.g.,
cache) controller 805 through an MMU 410. The MMU 410 may include a
TLB. The memory controller 805 is further coupled to a memory
system interface through a bus interface unit 408. Access and
control of the onboard debug features is provided through an EJTAG
TAP 320. The processing unit 300 includes a number of registers 830
that support debug operation. For example, the processor core 202
includes an MCD register 835 as discussed above; a debug register
836 as also discussed above, a DEPC register 837 and a DESAVE
register 838.
[0108] A debug control register 832 is coupled between the
registers 830 of the processing unit 400, the memory controller
805, and externally via the EJTAG TAP 320. A hardware breakpoint
unit 825 is also coupled between the registers 830 of the execution
unit 400, the memory controller and the MMU 410. The Hardware
Breakpoint Unit 825 implements memory-mapped registers that control
the instruction and data hardware breakpoints. The memory-mapped
region containing the hardware breakpoint registers is accessible
to software only in debug mode.
[0109] The debug features provide compatibility with existing
debuggers. The debug circuitry 318 support includes specific
extensions that enable concurrent multi-Core debugging. For
example, controlling logic can be used to interpret the values of
the software-control bit locations 610. Upon interpreting a value
indicative of a pulse, the controlling logic can write the
interpreted values into the corresponding MCD.sub.--0, MCD_1, MCD_2
bit locations of the MCD register. The controlling logic can then
pulse the one or more corresponding global MCD wires, according to
the corresponding values 615. Once pulsed, the processor cores 202
sample the pulse. The pulse sampling can occur during the next
clock cycle after the pulse was written. Once sampled, each of the
processor cores 202 that is not masked, will initiate a debug
exception handler routine.
[0110] The debug exception handler can then follow a set of
predetermined rules to determine the one or more causes of a given
debug exception after reading the Debug and/or Multi-Core Debug
registers. For example, the debug exception handler can follow the
rules listed in Table 2 below. TABLE-US-00002 TABLE 2 Debug
Exception Handler Rules 1. Any of MCD state bit locations
(Multi-Core Debug[MCD0, MCD1, MCD2]) could be set at any time,
indicating that the corresponding MCD state bit is set. 2. If
Multi-Core Debug[DExcC] is set, all of Debug[DDBSImpr, DDBLImpr,
DINT, DIB, DDBS, DDBL, DBp, DSS] will be clear, and Debug[DExcCode]
will contain a valid code. (This is the case for a debug mode
exception.) 3. If none of Debug[DDBSImpr, DDBLImpr, DINT, DIB,
DDBS, DDBL, DBp, DSS] are set, then the exception was either due to
MCD*, or Multi- Core Debug[DExcC] being set and Debug[DExcCode] is
valid. 4. No more than one of Debug[DIB, DDBS, DDBL, DBp, DSS] can
be set. 5. If Multi-Core Debug[DExcC] is clear, any combination of
Debug[DDBLImpr, DINT] may be set. 6. At least one of
Debug[DDBLImpr, DINT, DIB, DDBS, DDBL, DBp, DSS] and Multi-Core
Debug[MCD0, MCD1, MCD2, DExcC] will be set.
[0111] While this invention has been particularly shown and
described with references to preferred embodiments thereof, it will
be understood by those skilled in the art that various changes in
form and details may be made therein without departing from the
scope of the invention encompassed by the appended claims.
* * * * *