Multi-core debugger Bertone; Michael S. ; et al. [Cavium Networks]

Multi-core debugger

Bertone; Michael S. ; et al.

Patent Application Summary

U.S. patent application number 11/042476 was filed with the patent office on 2006-03-16 for multi-core debugger. This patent application is currently assigned to Cavium Networks. Invention is credited to Michael S. Bertone, David A. Carlson, Philip H. Dickinson, Muhammad R. Hussain, Richard E. Kessler, Trent Parker.

Application Number	20060059286 11/042476
Document ID	/
Family ID	38731731
Filed Date	2006-03-16

United States Patent Application	20060059286
Kind Code	A1
Bertone; Michael S. ; et al.	March 16, 2006

Multi-core debugger

Abstract

In a multi-core processor, a high-speed interrupt-signal interconnect allows more than one of the processors to be interrupted at substantially the same time. For example, a global signal interconnect is coupled to each of the multiple processors, each processor being configured to selectively provide an interrupt signal, or pulse thereon. Preferably, each of the processor cores is capable of pulsing the global signal interconnect during every clock cycle to minimize delay between a triggering event and its respective interrupt signal. Each of the multiple processors also senses, or samples the global signal interconnect, preferably during the same cycle within which the pulse was provided, to determine the existence of an interrupt signal. Upon sensing an interrupt signal, each of the multiple processors responds to it substantially simultaneously. For example, an interrupt signal sampled by each of the multiple processors causes each processor to invoke a debug handler routine.

Inventors:	Bertone; Michael S.; (Marlborough, MA) ; Carlson; David A.; (Haslet, TX) ; Kessler; Richard E.; (Shrewsbury, MA) ; Dickinson; Philip H.; (Cupertino, CA) ; Hussain; Muhammad R.; (Pleasanton, CA) ; Parker; Trent; (San Jose, CA)
Correspondence Address:	HAMILTON, BROOK, SMITH & REYNOLDS, P.C. 530 VIRGINIA ROAD P.O. BOX 9133 CONCORD MA 01742-9133 US
Assignee:	Cavium Networks Santa Clara CA
Family ID:	38731731
Appl. No.:	11/042476
Filed:	January 25, 2005

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
60609211	Sep 10, 2004

Current U.S. Class:	710/260 ; 714/E11.207
Current CPC Class:	G06F 9/30014 20130101; G06F 9/30043 20130101; G06F 11/3632 20130101; G06F 12/0815 20130101; G06F 12/0835 20130101; G06F 12/084 20130101; G06F 9/383 20130101; G06F 12/0875 20130101; G06F 2212/6022 20130101; G06F 9/30138 20130101; G06F 12/0804 20130101; G06F 13/24 20130101; G06F 2212/6012 20130101; G06F 12/0813 20130101; G06F 9/3824 20130101; G06F 12/0891 20130101
Class at Publication:	710/260
International Class:	G06F 13/24 20060101 G06F013/24

Claims

1. A multi-core processor comprising: a plurality of independent processor cores, each processor core executing instructions and operating in parallel to perform work; each of the plurality of independent processor cores respectively including: an interrupt-signal sensor; and an interrupt-signal generator selectively providing an interrupt signal; and a global interrupt-signal interconnect in electrical communication with each of the plurality of independent processor cores, more than one of the processor cores respectively interrupting its execution of instructions substantially simultaneously responsive to sampling with respective interrupt-signal sensors an interrupt signal on the global interrupt-signal interconnect.

2. The multi-core processor of claim 1, wherein the respective interrupt-signal generator of each of the plurality of independent processor cores is coupled to the global interrupt-signal interconnect.

3. The multi-core processor of claim 2, wherein the respective interrupt-signal generator of each of the plurality of independent processor cores is coupled to the global interrupt-signal interconnect in a wired-OR configuration.

4. The multi-core processor of claim 1, wherein an interrupt signal is provided in response to a write to a register in one of the plurality of independent processor cores.

5. The multi-core processor of claim 1, wherein an interrupt signal is provided in response to execution of a debug breakpoint instruction in one of the plurality of independent processor cores.

6. The multi-core processor of claim 1, wherein an interrupt signal is provided in response to detection of an instruction or data breakpoint match in one of the plurality of independent processor cores.

7. The multi-core processor of claim 1, wherein the global interrupt-signal interconnect comprises a plurality of independent global interrupt-signal interconnects, each of the independent global interrupt-signal interconnects representing a respective interrupt signal.

8. The multi-core processor of claim 1, further comprising a trace buffer coupled to the global interrupt-signal interconnect, the trace buffer being configured to monitor memory transactions of the independent processor cores in response to an interrupt signal on the global interrupt-signal interconnect.

9. The multi-core processor of claim 1, wherein each of the plurality of independent processor cores comprises a respective register storing information, the register configurable according to the sampled interrupt signal.

10. The multi-core processor of claim 1, further comprising a core-processor clock signal for coordinating execution of the instructions, wherein the interrupt-signal sensor samples the global interrupt-signal interconnect during each cycle of the core-processor clock signal.

11. The multi-core processor of claim 10, wherein each processor core respectively interrupts its execution of instructions within three core-processor clock cycles of sampling an interrupt signal on the global interrupt-signal interconnect.

12. The multi-core processor of claim 1, wherein the global interrupt-signal interconnect is used to communicate after the plurality of processor cores are interrupted.

13. A method of debugging a multi-core processor comprising the steps of: selectively providing an interrupt signal on a global interrupt-signal interconnect, the global interrupt-signal interconnect coupled to each of a plurality of processor cores comprising the multi-core processor; sampling the provided interrupt signal at each of the plurality of processor cores; and interrupting execution of more than one of the plurality of processor cores substantially simultaneously responsive to the sensed interrupt signal.

14. The method of claim 13, wherein the interrupt signal is selectively provided by one of the plurality of processor cores.

15. The method of claim 14, wherein the interrupt signal is provided in response to software control.

16. The method of claim 15, wherein the software control comprises software writing a value to a register.

17. The method of claim 14, wherein the interrupt signal is provided in response to execution of a debug breakpoint instruction.

18. The method of claim 14, wherein the interrupt signal is provided in response to a breakpoint match.

19. The method of claim 13, further comprising entering a debug handler routine at each of the interrupted processor cores.

20. The method of claim 19, wherein each of the interrupted processor cores communicates with an external device responsive to entering the debug handler routine.

21. The method of claim 20, wherein each of the plurality of processor cores communicates with the external device using a Joint Test Action Group (JTAG) test access port.

22. The method of claim 20, further comprising using the global interrupt-signal interconnect to communicate after the plurality of processor cores are interrupted.

23. A multi-core processor comprising: means for selectively providing an interrupt signal on a global interrupt-signal interconnect, the global interrupt-signal interconnect coupled to each of a plurality of processor cores comprising the multi-core processor; means for sensing the provided interrupt signal at each of the plurality of processor cores; and means for interrupting execution of more than one of the plurality of processors substantially simultaneously responsive to a sensed interrupt signal.

Description

RELATED APPLICATION

[0001] This application claims the benefit of U.S. Provisional Application No. 60/609,211, filed on Sep. 10, 2004. The entire teachings of the above application are incorporated herein by reference.

BACKGROUND OF THE INVENTION

[0002] Complex computer systems and programs rarely work exactly as designed. During the development of a new computer system, unexpected errors or bugs may be discovered by thorough testing and exhaustive execution of a variety of programs and applications. The source or cause of an error is often not apparent from the error itself, many times an error manifests itself by locking the target system for no apparent reason. Thus, tracking down the source of the error can be problematic.

[0003] Software and system developers commonly use tools referred to as debuggers to identify the source of unexpected errors and to assist in their resolution. A debugger is a software program used to break (i.e., interrupt) program execution at one or more locations in an application program. Once interrupted, a user is presented with a debugger command prompt for entering debugger commands that will allow for setting breakpoints, displaying or changing memory, single stepping, and so forth. Often, processors include onboard features accessible by a debugger to facilitate access to and operation of the processor during debugging.

[0004] One of the most difficult tasks facing designers of embedded systems today is emulating and debugging embedded hardware and software in a real-world environment. Embedded systems are growing more complex, offering increasingly higher levels of performance, and using larger software programs than ever before. To meet the challenges of dealing with embedded systems, engineers and programmers seek advanced tools to enable their performance of appropriate levels of debugging.

[0005] Tracking down problems is particularly challenging when the target system includes a multi-core processor. Multi-core processors include two or more processor cores that are each capable of simultaneously executing independent programs.

[0006] Using standard debug features that may be provided with the individual processor cores of the multi-core processor, can provide insight into operation of the individual processor cores. Assessing operation of parallel applications being developed and executed on the multi-core processor system by debugging an individual processor core will generally be inadequate. Namely, if an operation of a first processor is interrupted as described above, the other processors will continue to operate, thereby changing the state of the system with each subsequent clock cycle as measured from the moment of interrupt.

SUMMARY OF THE INVENTION

[0007] A multi-core processor includes a global interrupt capability that selectively breaks operation of more than one of the multiple processor cores at substantially the same time, usually within a few clock cycles. A global interrupt-signal interconnect is coupled to each of the plurality of independent processor cores. Each of the processor cores includes an interrupt-signal sensor for sampling an interrupt signal on the global-signal interconnect and an interrupt-signal generator for selectively providing an interrupt signal. Each processor core respectively interrupts its execution of instructions in response to sampling an interrupt signal on the global interrupt-signal interconnect.

[0008] The respective interrupt-signal generator of each of the plurality of independent processor cores is coupled to the global interrupt-signal interconnect. Outputs from the respective interrupt-signal generators can be coupled together and further to the global interrupt-signal interconnect in a wired-OR configuration. Thus, each of the processor cores can individually assert an interrupt signal on the same global interrupt-signal interconnect.

[0009] The multi-core processor can further include an interface adapted to connect to an external device. For example, the interface can be defined by a Joint Test Action Group (JTAG) interface. In some embodiments, more than one global interrupt-signal interconnect includes are provided. In such a configuration, each of the global interrupt-signal interconnects can represent a different interrupt signal. Additionally, information that may be relevant to debugging the multi-core processor can be provided by a combination of signals asserted on the multiple interrupt-signal interconnects.

[0010] In some embodiments, the plurality of independent processor cores resides on a single semiconductor die. The independent processor cores can be Reduced Instruction Set Computer (RISC) processors. Alternatively or in addition, each of the multiple independent processor cores includes a respective register storing information configurable according to the sampled interrupt signal.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

[0012] FIG. 1 is a block diagram of a security appliance including a network services processor according to the principles of the present invention;

[0013] FIG. 2 is a block diagram of the network services processor shown in FIG. 1;

[0014] FIGS. 3A and 3B are block diagrams illustrating exemplary embodiments of a multi-core debug architecture;

[0015] FIG. 4 is a more detailed block diagram of one of the processor cores shown in FIGS. 3A and 3B;

[0016] FIG. 5 is a schematic block diagram of a debug register;

[0017] FIG. 6 is a schematic block diagram of the Multi Core Debug (MCD) register;

[0018] FIG. 7A is a schematic diagram of an exemplary Test Access Port (TAP) controller;

[0019] FIG. 7B is a more-detailed block diagram illustrating the interconnection of the TAPs among the multiple processor cores shown in FIGS. 3A and 3B; and

[0020] FIG. 8 is a more-detailed block diagram of the debug architecture within one of the processor cores shown in FIGS. 3A and 3B.

DETAILED DESCRIPTION OF THE INVENTION

[0021] A description of preferred embodiments of the invention follows.

[0022] Applications for multi-core processors are as limitless as applications that use a single microprocessor. Some applications that are particularly well suited for multi-core processors include telecommunications and networking. Having multiple processor cores enables a single sizeable task to be broken down into several smaller, more manageable subtasks, each subtask being executed on a different core processor. Breaking down large tasks in this way typically simplifies the overall processing of complex, high-speed data manipulations, such as those used in data security.

[0023] A debugging system for multi-core processors is provided to facilitate debugging these parallel applications executing on several independent processor cores. This is accomplished, at least in part, by generating internal trigger events from one or more of the multiple processor cores. These multiple trigger events can be transmitted to an external debug console using a debug interface having relatively few I/O signal lines. Preferably the debug interface is separate from the processor core's memory interface (e.g., the Dynamic Random Access Memory (DRAM) interface) to avoid interference with the parallel application. A separate debug interface also allows a majority of the hardware for the debug interface to remain useable during normal processing of the multi-core processors.

[0024] Combining multiple processor cores in a single system leads to a closer placement of cores with respect to each other. Reducing separation between the processor cores generally reduces propagation delay, thereby increasing communication speed between them. In some embodiments, the processor cores are provided within the same central processing unit and are interconnected using cables. Alternatively or in addition, some of the processor cores can be interconnected in the same socket, e.g., plugged into a common processor socket on a motherboard. In some applications, the multiple processors are provided together on the same semiconductor die.

[0025] With different processor cores operating cooperatively to implement a common function, such as packet processing in a high-speed packet processor, it may be necessary to examine the state of more than one of the multiple processor cores during any debugging activity. Thus, it would be beneficial to interrupt the multiple processor cores at substantially the same time, thereby allowing examination of the register contents and memory values attributable to any of the multiple processor cores. Once interrupted, operation of the multiple processor cores can be stepped sequentially, in unison according to operation in a debug mode. This special class of fast interrupts are referred to herein as Multi-Core Debug (MCD) interrupts.

[0026] To facilitate the very fast debug interrupt, a separate, high-speed interrupt-signal interconnect is provided. This separate signal interconnect allows for substantially simultaneous interruption of more than one of the multiple processor cores. For example, a global signal interconnect is coupled to each of the processor cores. Each of the processor cores, in turn, is configured to selectively provide an interrupt signal, or pulse, on the global signal interconnect. Preferably, each of the processor cores is capable of pulsing the global signal interconnect during any cycle of the processor clock. Additionally, each of the processor cores samples the global signal interconnect to determine whether any processor core has provided an interrupt signal.

[0027] Each of the multiple processor cores is connected to the global signal interconnect, with each core being capable of independently pulsing the signal interconnect. Once pulsed, the processor cores sampling the signal interconnect receive the interrupt substantially simultaneously. Using a logical OR configuration of the contributed pulses from all of the multiple processor cores provides the desired functionality (i.e., the global signal interconnect is asserted if any one of the interconnected processor cores asserts the interconnect).

[0028] Each of the processor cores includes respective debug circuitry with supporting extensions that enable concurrent multi-Core debugging. The debug circuitry is responsive to the global signal interconnect being asserted.

[0029] More generally, a multi-core processor architecture includes multiple global signal interconnects. Each of the multiple interconnects is independently configured as the single interconnect described above. Thus, each of the global signal interconnects can be asserted (i.e., pulsed) by any of the multiple processor cores as described above.

[0030] FIG. 1 is a block diagram of an exemplary security appliance 102 that includes a network services processor 100 according to the principles of the present invention. The network services processor 100 is a multi-core processor. The security appliance 102 can be a standalone system that switches packets received at one Ethernet port (Gig E) to another Ethernet port (Gig E). Preferably, the security appliance 102 also performs one or more security functions related to the received packets prior to forwarding the packets. For example, the security appliance 102 can be used to perform security processing on packets received from a Wide Area Network (WAN) 102 prior to forwarding the processed packets to a Local Area Network (LAN) 103. Exemplary network services processors 100 adapted to perform such security processing can include hardware packet processing, buffering, work scheduling, ordering, synchronization, and coherence support to accelerate packet processing tasks according to the principles of the present invention.

[0031] The network services processor 100 generally processes higher layer protocols. For example, the network services processor 100 processes one ore more of the Open System Interconnection (OSI) network L2-L7 layer protocols encapsulated in received packets. As is well-known to those skilled in the art, the OSI reference model defines seven network protocol layers: Layers 1-7 (referred to herein as L1-L7). The physical layer (L1) represents an actual physical interface. Namely, the electrical and physical attributes that enable a device to be connected to a transmission medium. The data link layer (L2) performs data framing. The network layer (L3) formats the data into packets. The transport layer (L4) handles end to end transport. The session layer (L5) manages communications between devices, for example, whether communication is half-duplex or full-duplex. The presentation layer (L6) manages data formatting and presentation, for example, syntax, control codes, special graphics and character sets. The application layer (L7) permits communication between users, for example, file transfer and electronic mail.

[0032] To support multiple interconnects, the network services processor 100 includes a number of interfaces. For example, the network services processor 100 includes a number of Ethernet Media Access Control interfaces with standard Reduced Gigabyte Media Independent Interface (RGMII) connections to off-chip destinations using physical interfaces (PHYs) 104a, 104b.

[0033] In operation, the network services processor 100 receives packets from one or more external destinations at one ore more respective Ethernet ports (Gig E) through the physical interfaces PHY 104a, 104b. The network services processor 100 then selectively performs L7-L2 network protocol processing on the received packets forwarding processed packets through the physical interfaces 104a, 104b. The processed packets may be forwarded to another "hop" in the network, to their final destination, or through a local communications bus for further processing by a host processor. The local communications bus can be any one of a number of industry standard busses, such as a Peripheral Component Interconnect (PCI) bus 106 or a PCI Extended (PCI-X). Other PC busses include Integrated Systems Architecture (ISA), Extended ISA (EISA), Micro Channel, VL-bus, NuBus, TURBOchannel, VMEbus, MULTIBUS, STD bus, and proprietary busses. Further, the network protocol processing can include processing of network security protocols such as Firewall, Application Firewall, Virtual Private Network (VPN) including IP Security (IPSec) and/or Secure Sockets Layer (SSL), Intrusion detection System (IDS) and Anti-virus (AV).

[0034] A DRAM controller in the network services processor 100 controls access to an external Dynamic Random Access Memory (DRAM) 108 that is coupled to the network services processor 100. The DRAM 108 stores data packets received from the PHY interfaces 104a, 104b from a local communications bus, such as the PCI-X interface 106 for processing by the network services processor 100. In one embodiment, the DRAM interface supports 64 or 128 bit Double Data Rate II Synchronous Dynamic Random Access Memory (DDR II SDRAM) operating at speeds up to and including 800 MHz.

[0035] A boot bus 110 can be provided, such that the necessary boot code is accessible allowing the network services processor 100 to execute the boot code upon power-on and/or reset. Generally, the boot code is stored in a memory, such as a flash memory 112. Application code can also be loaded into the network services processor 100 over the boot bus 110. For example, application code can be loaded from a device 114 implementing the Compact Flash standard, or from another high-volume device, such as a disk, attached via the PCI bus.

[0036] A miscellaneous I/O interface 116 offers auxiliary interfaces such as General Purpose Input/Output (GPIO), Flash, IEEE 802 two-wire Management Interface (MDIO), Universal Asynchronous Receiver-Transmitters (UARTs), and serial interfaces.

[0037] The network services processor 100 can include another memory controller for controlling Low latency DRAM 118. The low latency DRAM 118 can be used for Internet Services and Security applications, thereby allowing fast lookups, including the string-matching that may be required for Intrusion Detection System (IDS) or Anti Virus (AV) applications.

[0038] FIG. 2 is a more-detailed block diagram of an exemplary network services processor 100, such as the one shown in FIG. 1. As discussed above, the network services processor 100 can be adapted to deliver high application performance by including multiple processor cores 202. Network operations can be categorized into data plane operations and control plane operations. A data plane operation includes packet operations for forwarding packets. A control plane operation includes processing of portions of complex higher level protocols such as Internet Protocol Security (IPSec), Transmission Control Protocol (TCP) and Secure Sockets Layer (SSL). Advantageously, in such a network application, selective processor cores 202 can be dedicated to performing respective data plane or control plane operations. A data plane operation can include processing of other portions of these complex higher level protocols.

[0039] A packet input unit 214 can be used to allocate and create a work queue entry for each packet. The work queue entry, in turn contains a pointer to a buffered packet temporarily stored in memory, such as Level-2 cache 212 or DRAM 108 (FIG. 1).

[0040] Packet Input/Output processing is performed by a respective interface unit 210a, 210b, a packet input unit (Packet Input) 214, and a packet output unit (PKO) 218. The input controller 214 and interface units 210a, 210b can perform parsing of received packets and checking of results to offload the processor cores 202.

[0041] A packet is received by any one of the interface units 210a, 210b (generally 210) through a predefined interface, such as a System Packet Interface SPI-4.2 (e.g., SPI-4 phase 2 standard of the Optical Internetworking Forum) or an RGMII interface. A packet can also be received by a PCI interface 224. The interface unit 210a, 210b handles L2 network protocol pre-processing of the received packet by checking various fields in the L2 network protocol header included in the received packet. After the interface unit 210 has performed L2 network protocol processing, the packet is forwarded to the packet input unit 214. The pre-processed packet can be forwarded over an input/output (I/O) bus, such as I/O bus 225. The packet input unit 214 can be used to perform additional pre-processing, such as pre-processing of L3 and L4 network protocol headers included in the received packet. The pre-processing can include checksum checks for Transmission Control Protocol (TCP)/User Datagram Protocol (UDP) (L3 network protocols).

[0042] The packet input unit 214 writes packet data into buffers in Level-2 cache 212 or DRAM 108 (FIG. 1) in a format that is convenient to higher-layer software executed in at least one processor core 202 for further processing of higher level network protocols. The packet input unit 214 can support a programmable buffer size and can distribute packet data across multiple buffers to support large packet input sizes.

[0043] The Packet order/work (POW) module (unit) 228 queues and schedules work (i.e., packet processing operations) for the processor cores 202. Work can be defined to be any task to be performed by a processor core 202 that is identified by an entry on a work queue. The task can include packet processing operations, for example, packet processing operations for L4-L7 layers to be performed on a received packet identified by a work queue entry on a work queue. Each separate packet processing operation is a piece of the work to be performed by a processor core 202 on the received packet stored in memory. For example, the work can be the processing of a received Firewall/Virtual Private Network (VPN) packet. The processing of a Firewall/VPN packet includes the following separate packet processing operations (i.e., pieces of work): (1) defragmentation to reorder fragments in the received packet; (2) IPSec decryption (3) IPSec encryption; and (4) Network Address Translation (NAT) or TCP sequence number adjustment prior to forwarding the packet.

[0044] The POW module 228 selects (i.e., schedules) work for a processor core 202 and returns a pointer to the work queue entry that describes the work to the processor core 202. Each piece of work (i.e., a packet processing operation) has an associated group identifier and a tag.

[0045] Prior to describing the operation of the processor cores 202 in further detail, the other modules in the processor core 202 will be described. After the packet has been processed by the processor cores 202, a packet output unit (PKO) 218 reads the packet data from Level-2 cache 212 or DRAM 108 (FIG. 1), performs L4 network protocol post-processing (e.g., generates a TCP/UDP checksum), forwards the packet through the interface unit 210 and frees the Level-2 cache 212 or DRAM 108 locations used to store the packet.

[0046] The network services processor 100 can also include application specific co-processors that offload the processor cores 202 so that the network services processor 100 achieves a high-throughput. The application specific co-processors can include a DFA co-processor 244 that performs Deterministic Finite Automata (DFA) and a compression/decompression co-processor 208 that performs compression and decompression. Other co-processors include a Random Number Generator (RNG) 246 and a timer unit 242. The timer unit 242 is particularly useful for TCP applications.

[0047] Each processor core 202 can include a dual-issue, superscalar processor with a respective instruction cache 206, a respective Level-1 data cache 204, and respective built-in hardware acceleration (e.g., a crypto acceleration module) 200 for cryptography algorithms with direct access to low latency memory over the low latency memory bus 230. The low-latency, direct-access path to low-latency memory 118 (FIG. 1) that bypasses the Level-2 cache memory 212 and can be directly accessed from both the processor cores 202 and a DFA co-processor 244.

[0048] The network services processor 100 also includes a memory subsystem. The memory subsystem includes the respective Level-1 data cache memory 204 of each of the processor cores 202, respective instruction cache 206 in each of the processor cores 202, a Level-2 cache memory 212, a DRAM controller 216 for external DRAM memory 108 (FIG. 1), and an interface, such as a low-latency bus 230 to external low latency memory (not shown).

[0049] The memory subsystem is configured to support the multiple processor cores 202 and can be tuned to deliver both the high-throughput and the low-latency required by memory-intensive, content-networking applications. Level-2 cache memory 212 and external DRAM memory 108 (FIG. 1) are shared by all of the processor cores 202 and I/O co-processor devices.

[0050] Each of the processor cores 202 can be coupled to the Level-2 cache by a, local bus, such as a coherent memory bus 234. Thus, the coherent memory bus 234 can represent the communication channel for memory and I/O transactions between the processor cores 202, an I/O Bridge (IOB) 232, and the Level-2 cache and controller 212.

[0051] A Free-Pool Allocator (FPA) 236 maintains pools of pointers to free memory in Level-2 cache memory 212 and DRAM 108. A bandwidth efficient (Last-In-First-Out (LIFO)) stack is implemented for each free pointer pool.

[0052] The I/O Bridge 232 manages the overall protocol and arbitration and provides coherent I/O partitioning. The I/O Bridge 232 includes a bridge 238 and a Fetch-and-Add Unit (FAU) 240. The bridge 238 includes buffer queues for storing information to be transferred between the I/O bus 225, coherent memory bus 234, the packet input unit 214 and the packet output unit 218.

[0053] The Fetch-and-Add Unit 240 includes a 2 kilobyte (KB) register file supporting read, write, atomic fetch-and-add, and atomic update operations. The Fetch-and-Add Unit 240 can be accessed from both the cores 202 and the packet output unit 218. The registers store highly-used values and thus reduce traffic to access these values. Registers in the Fetch-and-Add Unit 240 are used to maintain lengths of the output queues that are used for forwarding processed packets through the packet output unit 218.

[0054] The PCI interface controller 224 has a Direct Memory Access (DMA) engine that allows the processor cores 202 to move data asynchronously between local memory in the network services processor 100 and remote (PCI) memory (not shown) in both directions.

[0055] In some embodiments, a key memory (KEY) 248 is provided. The key memory 248 is a protected memory coupled to the I/O Bus 225 that can be written/read by the processor cores 202. For example, the key memory can include error checking and correction. ECC will report single and double bit errors and repair single bit errors. The memory is a single-port memory that can be provided withy write precedence. In some embodiments, the key memory 248 can be used to temporarily store Loads, Stores, and I/O pre-fetches.

[0056] An Miscellaneous Input/Output (MIO) unit 226 can also be coupled to the I/O bus 225 to provide interface support for one or more external devices. For example, the MIO unit 226 can support one or more interfaces to a Universal Asynchronous Receiver/Transmitter (UART), to a boot bus, to a General Purpose Input/Output (GPIO) interface for communicating with peripheral devices (not shown), and more generally to a Field-Programmable Gate Array (FPGA) for interfacing with external devices. For example, an FPGA can be used to interface to external Ternary Content-Addressable Memory (TCAM) hardware providing fast-lookup performance. In particular, the MIO 226 can provide an interface to an external debugger console described below.

[0057] The processor core 202 supports multiple operational modes including: user mode, kernel mode, and debug mode. User mode is most often employed when executing applications programs (e.g., the internal flow of program control). Kernel mode is typically used for handling exceptions and operating system kernel functions, including management of any related coprocessor and Input/Output (I/O) device access. Debug mode is a special operational mode typically used by software developers to examine variables and memory locations, to stop code execution at predefined break points, and to step through the code one line or unit at a time, usually while monitoring variables and memory locations. Debug mode is also different from other operational modes in that there are substantially no restrictions on access to coprocessors, memory areas. Additionally, while in Debug mode, the usual exceptions like address error and interrupt are masked.

[0058] A multi-core processor 100 configured for debugging parallel applications executing on more than one independent processor cores is shown in FIGS. 3A and 3B. For example, the multi-core processor 100 includes three separate global signal interconnects: MCD_0, MCD_1, and MCD_2. Each of the three global signal interconnects MCD_0, MCD_1, and MCD_2 is coupled to each of the multiple processor cores 202a, 202b, . . . 202n (generally 202). Each processor core 202, in turn, includes circuitry configured to assert a global interrupt signal on one or more of the global signal interconnects. Preferably, each of the processor cores 202 is configured to independently and selectively assert (i.e., pulse) an interrupt on one or more of the global signal interconnects MCD_0, MCD_1, and MCD_2.

[0059] Each processor core 202 also includes sensing circuitry configured to sample each of the global signal interconnects to determine the presence of an asserted interrupt. Preferably, each of the processor cores 202 independently samples the global signal interconnects MCD_0, MCD_1, and MCD_2 to determine whether an interrupt has been asserted, and on which of the several global signal interconnects the interrupt has been asserted--interrupts can be asserted on more than one of the global signal interconnects at a time. The global signal interconnects MCD_0, MCD_1, and MCD_2 are preferably sampled continuously, or at least once during each clock cycle to determine the presence of an interrupt. The sensing circuitry can include a register into which the state of the global signal interconnect is latched. For example, a register is configured to store one bit for each of the multiple global signal interconnects, the value of the stored bit indicative of the state of the respective global signal interconnect.

[0060] Having more than one global signal interconnect MCD_0, MCD_1, and MCD_2 coupled to each of the processor cores 202 can provide additional information. For example, with each of the three wires capable of being independently pulsed between two states (e.g., a logical Low or "0" and a logical High or "1") can provide information corresponding to one of up to eight different messages (i.e., .sup.3=8).

[0061] Alternatively, or in addition, the global signal interconnects can be used to communicate with the processor cores 202 once interrupted. An external debug console 325 hosting a debugger application and providing a user interface can be interconnected to one or more of the processor cores 202 to facilitate debugging of the system 100. Preferably, the global signal interconnects are accessible by the debugger. For example, a debugger can assert a pulse on MCD_1 to instruct the processor cores 202 to check their mailbox location (e.g., in main memory) for an instruction from the debugger. The debugger can assert a pulse on MCD_2 to restart all processor cores 202 after a multi-core interrupt. Thus, usage of the global signal interconnects can minimize disruption of the state contained in the processor cores 202 and in the system 100, while the debugger examines it. This capability can be very useful to isolate the cause of bugs in parallel applications.

[0062] The processor cores 202 are each coupled to the one or more global signal MCD_0, MCD_1, and MCD_2 interconnects in a respective "wired-OR" fashion. Thus, respective interrupt-signal generators of each of the processor cores 202 are all interconnected at a first wired-OR 310a, further connected to the first global signal interconnect MCD_0. Second and third wired-ORs 310b, 310c are provided to similarly interconnect the processor cores 202 to the second and third global signal interconnects MCD_1 and MCD_2, respectively.

[0063] Thus, should one or more of the processor cores 202 assert a pulse on any of its respective interrupt-signal generator outputs (e.g., to wired-OR 310a), the pulse will be asserted on the respective global signal interconnect (e.g., MCD_0). Preferably, a pulse can be asserted during any cycle. Using the wired-OR 310a, 310b, 310c (generally 310) provides the desired logic allowing any of the processor cores 202 to drive the global interconnect signal (i.e., the wired-OR providing an output="1" if any of its inputs="1"), while also minimizing any corresponding delay. Once a pulse is asserted on the global signal interconnects, all of the processor cores 202 sample it, allowing the cores 202 to be interrupted very quickly--at the same time, or at least within a few cycles of the processor clock. Such a rapid interrupt of all of the processor cores 202 preserves the entire state of the parallel application at the time of the interrupt for examination by the debugger.

[0064] In other embodiments, each of the processor cores 202 can be interconnected to the global signal interconnects MCD_0, MCD_1, and MCD_2 using combinational logic, such as a logical OR gate. Such logic, however, represents additional complexity generally resulting in a corresponding delay (e.g., a gate delay due to synchronous logic, and/or a rise time delay due to the capacitance of the logic circuitry).

[0065] Each processor core 202 provides an exception handler. Generally, an exception refers to an error or other special condition detected during normal program execution. The exception handler can interrupt the normal flow of program control in response to receiving an exception. For example, a debug exception handler halts normal operation in response to receiving a debug interrupt. The exception handler then passes control to a debug handler, or software program, that controls operation in debug mode.

[0066] Some exemplary exception types include a Debug Single Step (DSS) exception resulting in single step execution in debug mode. A general Debug Interrupt (DINT) results in entry of debug mode and can be caused by the assertion of an external interrupt (e.g., EJ_DINT), or by setting a related bit in a debug register. An interrupt can result from assertion of unmasked hardware or software interrupt signal. A debug hardware instruction break matched (DIB) exception results in entry of debug mode when an instruction matches a predetermined instruction breakpoint. Similarly, a debug breakpoint instruction (DBp) results in entry of debug mode upon execution of a special instruction (e.g., a software debug breakpoint instruction, such as the EJTAG "SDBBP" instruction that places a processor into debug mode and fetches associated handler code from memory). A Data Address Break (address only) or Data Value (e.g., DDBL/DDBS) results in entry of debug mode when a particular memory address is accessed, or a particular value is written to/read from memory.

[0067] Each of the processor cores 202 includes respective onboard debug circuitry 318. As shown in FIG. 3A, each of the multiple processor cores 202 can include a respective core Test Access Port (TAP) 320', 320'', 320''' (generally 320) for accessing the respective debug circuitry 318. The core TAPs 320 are connected to one system TAP 330. As shown, each of the respective core TAPs 320 and the system TAP 330 can be interconnected in a daisy chain configuration. Additionally, the debug circuitry 318 of all of the interconnected processor cores 202 can be coupled to the external debug console 325.

[0068] Once in debug mode, the debug control console can be used to inspect the values stored in registers and memory locations. The debug control console provides a software program that communicates with the onboard debug circuitry 318 to accomplish inspection of stored values, setting of breakpoints, stopping, restarting and sequentially stepping each of the processor cores 202 in unison.

[0069] Alternatively, or in addition, each of the processor cores 202 can be coupled to the external debug console 325 through one or more Universal Asynchronous Receiver-Transmitter (UART) devices that include receiving and transmitting circuits for asynchronous serial communications, as shown in FIG. 3B. In one embodiment, the multi-core processor 100 includes two UART devices 335a, 335b (generally 335) used to control serial data transmission and reception between the processor cores 202 and external devices, such as the external debug console 325. The UART devices 335 can be included within the Miscellaneous I/O unit (FIG. 2). Thus, each processor core 202 can communicate with another device, such as the external debug console 325, through a respective memory bus interface 340 using one or more of the UART devices 335 accessible through the I/O bridge 238. Advantageously, communicating with the external debug console 325 using the UART device 335 removes constraints that would have otherwise been imposed by using a standard interface, such as the JTAG TAP interface (FIG. 3A).

[0070] The multi-core processor 100 optionally includes a trace buffer 610 (shown in phantom) for selectively monitoring memory transactions of the processor cores 202. For example, the trace buffer 610 is coupled to the coherent memory bus 234 to monitor transactions thereon. Generally, the trace buffer 610 stores information that can be used to assist in any debugging activity. For example, the trace buffer 610 can be configured to store the last "N" transactions on the bus, the N+1.sup.st transaction being dumped as a new transaction occurs. Further, when using a single trace buffer 610, identification tags can be used to identify the particular core processor 202 associated with each stored transaction.

[0071] Beneficially, the trace buffer 610 is also coupled to each of the one or more global signal interconnects MCD_0, MCD_1, and MCD_2, and configured with sensing circuitry sampling any pulses asserted on the global signal interconnects. The trace buffer 610 also includes a trigger that initiates the starting and or stopping of monitoring in response to sampling an interrupt signal on the global signal interconnect. Although a single trace buffer 610 supporting multiple core processors 202 is illustrated, other configurations are possible. For example, multiple trace buffers 610 can be provided with each trace buffer 610 respectively corresponding to one of the multiple core processors 202. Additionally, the trace buffer 610 can be on-chip, as shown, or off-chip and accessible by a probe.

[0072] Alternatively or in addition, the trace buffer 610 includes circuitry configured to assert a global interrupt signal on one or more of the global signal interconnects MCD_0, MCD_1, and MCD_2. As shown, the trace buffer 610 can be coupled to the global signal interconnects MCD_0, MCD_1, and MCD_2 through the wired-OR circuits 310. In this configuration, the trace buffer 610 can selectively assert a global interrupt signal on one or more of the global signal interconnects MCD_0, MCD_1, and MCD_2, thereby interrupting more than one of the multiple processor cores 202 in response to activity on the coherent memory bus 234.

[0073] FIG. 4 is a more detailed block diagram of an exemplary processor core 202 shown in FIGS. 3A and 3B. In general, a processor core 202 interprets and executes instructions. In some embodiments the processor core 202 is a Reduced Instruction Set Computing (RISC) processor core 202. In more detail, the processor core 202 includes an execution unit 400, an instruction dispatch unit 402, an instruction fetch unit 404, a load/store unit 416, a Memory Management Unit 406, a system interface 408, a write buffer 420 and security accelerators 200. The processor core 202 also includes debug circuitry 318 allowing debug operations to be performed. The system interface 408 controls access to external memory, that is, memory external to the processor core 202, such as the L2 cache memory described in relation to FIG. 2.

[0074] Still referring to FIG. 4, the execution unit 400 includes a multiply/divide unit 412 and at least one register file 414. The multiply/divide unit 412 has a 64-bit register-direct multiply. The instruction fetch unit 404 includes Instruction Cache (ICache) 206. The load/store unit 416 includes Data Cache (DCache) 204. A portion of the data cache 204 can be reserved as local scratch pad/local memory 422. In one embodiment, the instruction cache 206 is 32 Kilobytes, the data cache 204 is 8 Kilobytes and the write buffer 420 is 2 Kilobytes. The memory management unit 406 includes a Translation Lookaside Buffer (TLB) 410.

[0075] In one embodiment, the processor core 202 includes a crypto acceleration module (security accelerators) 200 that includes cryptography acceleration. For example, the cryptography acceleration can include one or more of Triple Data Encryption Standard (3DES), Advanced Encryption Standard (AES), Secure Hash Algorithm (SHA-1l), and Message Digest Algorithm #5 (MD5). The crypto acceleration module 200 communicates by moves to and from the main register file 414 in the execution unit 400. Particular algorithms, such as Rivest, Shamir, Adleman (RSA) and the Diffie-Heilman (DH) can be implemented and are performed in the multiply/divide unit 412.

[0076] In some embodiments, the multi-core processor 100 (FIG. 2) includes a superscalar processor. A superscalar processor includes a superscalar instruction pipeline that allows more than one instruction to be completed each cycle of the processor's clock period by allowing multiple instructions to be issued simultaneously and dispatched in parallel to multiple execution units 400. The RISC-type processor core 202 has an instruction set architecture that defines instructions by which the programmer interfaces with the RISC-type processor 202. In one embodiment, the superscalar RISC-type core is an extension of the MIPS64 version 2 core. Only load-and-store instructions access external memory; that is, memory external to the processor core 202. In one embodiment, the external memory is accessed over a coherent memory bus 234 (FIG. 2). All other instructions operate on data stored in the register file 414 within the execution unit 414 of the processor core 202. In some embodiments, the superscalar processor can be a dual-issue processor.

[0077] The instruction pipeline is divided into stages, each stage taking one clock cycle to complete. Thus, in a five stage pipeline, it takes five clock cycles to process each instruction and five instructions can be processed concurrently with each instruction being processed by a different stage of the pipeline in any given clock cycle. Typically, a five stage pipeline includes the following stages: fetch, decode, execute, memory and write back.

[0078] During the fetch-stage, the instruction fetch unit 404 fetches an instruction from instruction cache 206 at a location in instruction cache 206 identified by a memory address stored in a program counter. During the decode-stage, the instruction fetched in the fetch-stage is decoded by the instruction dispatch unit 402 and the address of the next instruction to be fetched for the issuing context is computed. During the execute-stage, the execution unit 400 performs an operation dependent on the type of instruction. For example, the execution unit 400 begins the arithmetic or logical operation for a register-to-register instruction, calculates the virtual address for a load or store operation, or determines whether the branch condition is true for a branch instruction. During the memory-stage, data is aligned by the load/store unit 416 and transferred to its destination in external memory. During the write back-stage, the result of a register-to-register or load instruction is written back to the register file 414.

[0079] The system interface 408 is coupled via the coherent memory bus 234 (FIG. 2) to external memory. In one embodiment, the coherent memory bus 243 has 384 bits and includes four separate buses: (i) an Address/Command Bus; (ii) a Store Data Bus; (iii) a Commit/Response control bus; and (iv) a Fill Data bus. All store data is sent to external memory over the coherent memory bus 234 via a write buffer entry in the write buffer 420. In one embodiment, the write buffer 420 has 16 write buffer entries.

[0080] Store data flows from the load/store unit 416 to the write buffer 420, and from the write buffer 420 through the system interface 408 to external memory. The processor core 202 can generate data to be stored in external memory faster than the system interface 408 can write the store data to the external memory. The write buffer 420 minimizes pipeline stalls by providing a resource for storing data prior to forwarding the data to external memory.

[0081] The write buffer 420 is also used to aggregate data to be stored in external memory over a coherent memory bus 424 into aligned cache blocks to maximize the rate at which the data can be written to the external memory. Furthermore, the write buffer 420 can also merge multiple stores to the same location in external memory resulting in a single write operation to external memory. The write-merging operation of the write buffer 420 can result in the order of writes to the external memory being different than the order of execution of the store instructions.

[0082] The processor core 202 also includes an exception control system providing circuitry for identifying and managing exceptions. An exception refers to an interruption or change of the normal flow of program control that occurs when an event or other special condition is detected during execution. Exceptions can be caused by a variety of sources, including boundary cases in data, external events, or even program errors, being generated (i.e., "raised") by hardware or software. Exemplary hardware exceptions include resets, interrupts and signals from a memory management unit. Hardware exceptions may be generated by an arithmetic logic unit or floating-point unit for numerical errors such as divide by zero, overflow or underflow or instruction decoding errors such as privileged, reserved, trap or undefined instructions. Software exceptions are even more varied. For example, a software exception can refer to any kind of error checking that alters the normal behavior of the program. An exception transfers control from code being executed at the instant of the exception to different code-a routine commonly referred to as an exception handler.

[0083] A system co-processor can also be provide within the processor core 202 for providing a diagnostic capability, for controlling the operating mode (i.e., kernel, user, and debug), for configuring interrupts as enabled or disabled, and for storing other configuration information.

[0084] The processor core 202 also includes a Memory Management Unit (MMU) 406 coupled to the instruction fetch unit 404 and the load/store unit 416. The MMU 406 is a hardware device or circuit that supports virtual memory and paging by translating virtual addresses into physical addresses. Thus, the MMU 406 may receive a virtual memory address from program instructions being executed on the processor core 202. The virtual memory address is associated with a read from or a write to physical memory. The MMU 406 translates the virtual address to a physical address to allow a related physical memory location to be accessed by the program.

[0085] In a multitasking system all processes compete for the use of memory and of the MMU 406. In some memory management architectures, however, each process is allowed to have its own area or configuration of the page table, with a mechanism to switch between different mappings on a process switch. This means that all processes can have the same virtual address space rather than require load-time relocation. To accomplish this task, the MMU 406 can include a Translation Lookaside Buffer (TLB) 410.

[0086] The debug circuitry 318 on each processor core 202 can include an onboard debug controller. Having an onboard debug controller facilitates operation of the processor core 202 in the debug mode. For example, the debug controller can allow for single-step execution of the processor core 202. Further, the debug controller can support breakpoints, enabling them to transition the processor core 202 into debug mode. For example, the breakpoints can be one or more of instruction breakpoints, data breakpoints, and virtual address breakpoints.

[0087] In some embodiments, the onboard debug circuitry 318 includes standardized features. For example, the onboard debug circuitry 318 can be compliant with the design philosophy of the Joint Test Action Group (JTAG) interface--a popular standardized interface defined by IEEE Standard 1149.1. In embodiments that utilize processor cores, the onboard controller is referred to is the standard MIPS Enhanced JTAG (EJTAG) debug circuitry 318.

[0088] Each processor core 202 includes one or more debug registers, each register including one or more pre-defined fields for storing information (e.g., state bits) related to different aspects of debug mode operation. The debug registers 425 can be located in the instruction fetch unit 404. For example, one of the debug registers 425 is a Debug register 500. The Debug register 500 is illustrated in more detail in FIG. 5. The Debug register 500 includes a DM state bit indicative of whether the processor core 202 is operating in debug mode. Other bits include a DBD state bit indicative of whether the last debug exception or exception in Debug Mode occurred in a branch or jump delay slot. A DDBSImpr bit is indicative of an imprecise debug data break store. A DDBLImpr bit is indicative of an imprecise debug data break load. This bit can be implemented for load value breakpoints. A DExcCode bit is set to one when Debug[DExcCode] is valid and should be interpreted.

[0089] Another one of the debug registers 425 is a Multi-Core Debug (MCD) register 600 is shown in FIG. 6. The MCD register 600 includes dedicated multi-core debug state positions 615, one position being provided for each of the respective global signal interconnects MCD_0, MCD_1, and MCD_2. Similarly, the MCD register 600 includes dedicated mask-disable state positions 605, one position being provided for each of the respective global signal interconnects MCD_0, MCD_1, and MCD_2. When set, the mask-disable bits (one bit for each global signal interconnect) disable the effect of sampling a pulse on the corresponding global signal interconnect.

[0090] The MCD register 600 also includes respective software-control bit locations 610 for each of the several global MCD wires. For the exemplary multi-core processor 100, the three software-control bit locations 610, referred to as: Pls0, Pls1, or Pls2 are reserved. These software-control bit locations 610 corresponding to the three global signal interconnects: MCD_0, MCD_1, and MCD_2, respectively. Thus, bits written by software into the software control bit locations 610 can be used to pulse any combination of the three global MCD wires.

[0091] In some embodiments, the debug registers 425 (FIG. 4) include a DEPC register for imprecise debug exceptions and imprecise exceptions in Debug Mode. Imprecise debug data breakpoint are provided for load value compare, otherwise debug data breakpoints are precise. The DEPC register contains an address at which execution should be resumed when returning to Non-Debug Mode.

[0092] Exception handlers can be entered for debug processing in a number of ways. First, software such as the processor core instruction set and/or the debugger can include a breakpoint instruction. When the breakpoint instruction is executed by the execution unit 400, it causes a specific exception. Alternatively or in addition, a set of trap instructions can be provided. When the trap instructions are executed by the execution unit 400, a specific exception will result, but only when certain register value criteria are also satisfied. Further, a pair of optional Watch registers can be programmed to cause a specific exception on a load, store, or instruction fetch access to a specific word (e.g., a 64-bit double word) in virtual memory. Still further, an optional TLB-based MMU 406 can be programmed to "trap," or otherwise interrupt program execution on any access, or more specifically, on any store to a page of memory. These exceptions generally refer to interrupting operation on any one of the processor cores 202. To interrupt the other processor cores 202, a pulse must be asserted on one or more of the global signal interconnects MCD.sub.--0, MCD.sub.--1, MCD.sub.--2.

[0093] In operation, when one or more of the processor cores 202 asserts a pulse on one of the global signal interconnects MCD_0, MCD_1, and MCD_2, the corresponding signal value can be a high state, or logical one. The respective instruction fetch unit 404 of each of the interconnected processor cores 202 samples the one on the global signal interconnect. In response to sampling the one, the instruction fetch unit 404 sets an internal state bit corresponding to the sampled pulse. The internal state bit, or MCD state bit, can be dedicated multi-core debug state positions 615 in the multi-core debug register 600 (i.e., Multi-Core Debug[MCD0, MCD1, MCD2]).

[0094] If any of multi-core debug state bits 615 are non-zero on a given processor core 202 (and that processor core 202 is not already in debug mode), the onboard debug circuitry 318 requests a debug exception on its respective processor core 202. With all of the multiple processor cores 202 sampling the same pulse and setting their respective bits 615 at substantially the same time, all of the unmasked processor cores 202 are interrupted at substantially the same time. Preferably, this occurs during the same cycle, but it can also occur within a few clock cycles. Software can later clear Multi-Core Debug[MCD0, MCD1, MCD2] bits by overwriting them (e.g., writing a one to them). Such a provision ensures that no further debug interrupts occur after exiting the debug handler.

[0095] In general, interrupts can be assigned different priority values to ensure the desired results in situations in which more than one type of interrupt occurs. In particular, the MCD interrupts can occur at the same priority level as standard debug interrupts provided within the debug circuitry 318 of each of the processor cores 202. The exception location can also be the same as a debug interrupt, with the multi-core debug bits 615 being similar to the DINT bit of the debug register shown in FIG. 5.

[0096] The detailed behavior of the bits, however, is different. For example, the DINT bit is read-only, whereas Multi-Core Debug[MCD0, MCD 1, MCD2] bits can be written to, allowing the bits to be cleared by the debug handler. Further, the DINT is cleared when Multi-Core Debug[DExcC] is set, whereas the multi-core debug state bits 615 need not be.

[0097] There are at least four ways that the global signal interconnects MCD.sub.--0, MCD_1, MCD_2 can be pulsed. First, software can cause initiation of a pulse on the global MCD wires. For example, debugger software running on a processor core 202 can write one or more values (e.g., a logical "1"s) to any combination of the software-control state bits 610 of the MCD register 600. When a "1" is written into one or more of these bits 610, the processor core 202 interprets it as an instruction to assert an interrupt signal, or pulse, on the corresponding global signal interconnects.

[0098] The global signal interconnects can also be pulsed by execution of a special instruction. For example, execution of a software breakpoint instruction, such as the SDBBP instruction, by any one of the processor cores 202 results in that core 202 asserting a pulse on the MCD_0 global signal interconnect. Whether a pulse is actually asserted by a processor core 202 in response to the breakpoint instruction can be further controlled by a global-signal debug bit 618 in the MCD register 600. Thus, a pulse is only asserted in response to the breakpoint instruction when the MCD[GSDB] bit 618 is set.

[0099] Alternatively or in addition, the initiation of a pulse on the global signal interconnects can result if one or more bits within a particular register are set and a breakpoint match occurs. When these two conditions occur, the hardware (e.g., the debug circuitry 318) pulses one of the global MCD wires (e.g., the MCD_0 wire). An Instruction Breakpoint Control-n register (IBCn, "n" being a numbered reference to a particular instruction breakpoint) stores a value responsive to a match of an executed breakpoint instruction. Similarly, a Data Breakpoint Control-n (DBCn) stores a value responsive to a match of a data transaction. The registers IBCn and DBCn generally include special bits (e.g., BE, TE) that can be used to enable the respective breakpoints.

[0100] Table 1 below describes an exemplary embodiment in which the detailed behavior on a breakpoint match is defined based on exemplary register values. TABLE-US-00001 TABLE 1 Breakpoint Match Behavior BE TE Comment 0 0 Nothing happens on a match 0 1 MCD0 is pulsed on a match. BS bits are also set in IBS/DBS. No direct local exception occurs. (This mode may not be used.) 1 0 A local breakpoint exception occurs due to the breakpoint match, causing the Core to enter debug mode. MCD0 is not pulsed. BS bits are set in IBS/DBS. (This mode will be used when debugging, but not multi-Core.) 1 1 A local breakpoint exception occurs due to the breakpoint match, causing the Core to enter debug mode. MCD0 is also pulsed. BS bits are also set in IBS/DBS. (This mode will be used when debugging multi-Core.)

[0101] An exemplary TAP controller 700 is shown in FIG. 7A. The TAP controller 700 includes one or more registers 705 for storing instruction, data, and control information relating to the TAP interface 320. The registers 705 allow a user to perform a set up for the onboard debug circuitry 318, and provide important status information during a debug session. The size of the registers 705 depends on the specific implementations, but usually they are at least 32 bits.

[0102] The registers 705 receive information from an external source using the Test Data Input (TDI) input (i.e., pin). The registers also provide information to an external source using the Test Data Output (TDO) output (i.e., pin). Operation of the interface is provided by a TAP controller state machine 710. The TAP controller 700 uses a communications channel, such as a serial communications channel that operates according to a clock signal received on the Test Clock (TCK) input (i.e., pin). Thus, movement of data into and/or out of the registers 705 operates according to the received clock signal. Similarly, operation of the state machine also relies on the received clock.

[0103] A more detailed interconnection of respective TAP interfaces 320 on each of the multiple processor cores 202 is shown in FIG. 7B. A JTAG interface, referred to as a Test Access Port (TAP) 320', 320'', 320''' (generally 320), includes at least four-signal lines: Test Clock (TCK); TMS; Test Data In (TDI); and Test Data Out (TDO). The interface can also include one or more power and ground signal lines (note shown). The JTAG interface is a serial interface that is capable of transferring data according to a clock signal received on the TCK signal line. Operating frequency varies per chip, but is typically defined by a clock signal having a frequency between about 10 MHz to about 100 MHz (i.e., from about 100 nanoseconds to about 10 nanoseconds per bit time).

[0104] Configuration of each of the respective debug circuitry 318 (FIGS. 3A and 3B) can be performed by manipulating an internal state machine. For example, a debug controller state machine within the debug circuitry 318 can be externally manipulated one bit at a time via the TMS signal line of the TAP 330. Data can then be transferred in and out, one bit at a time, during each TCK clock cycle. The data can be received via the TDI signal line, and transmitted out via the TDO signal line, respectively. Different instruction modes can be loaded into the debug controller 318 to read the core identification (ID), to sample input, to drive (and/or float) output, to manipulate functions, and/or to bypass (pipe TDI to TDO to logically shorten chains of multiple chips).

[0105] The respective TAP 320 of each of the multiple processor cores 202a, 202b . . . 202n (generally 202) is coupled to the respective TAP 320 of the other multiple processor cores 202 in a serial, or "daisy chain" configuration. Thus, the TCK signal of the first TAP 320' is serially interconnected to the corresponding TCK signal lines of all of the other TAPs 320. The interconnected TCK signal lines are further connected to a corresponding TCK signal line of a system TAP 330. Typically, the system TAP 330 is interconnected to one of the end of the interconnected processor cores 202 (i.e., processor core 202n or processor core 202a, as shown), that end processor core 202 referred to as a "master" processor core 202a. For the most part, the remaining TAP signal lines are generally interconnected in a similar manner being further connected from the master processor core 202a corresponding TAP signal lines on the system TAP 330. Interconnection of the TDI and TDO signal lines, however, is different as described in more detail below.

[0106] In the daisy chain configuration, the TDI signal line of the master processor core 202a connects to the corresponding TDI signal line of the system TAP 330, the master processor core 202a receiving data from an external source. The TDO signal line of the master processor core 202a, however, is connected to the TDI signal line of an adjacent processor core 202b. Additional processor cores 202 are connected in a similar manner, the TDO signal line of one processor core 202 being interconnected to the TDI signal line of its preceding processor core 202, until the TDO signal line of the last processor core 202n in the chain is interconnected to the TDO signal line of the system TAP 330.

[0107] A more-detailed diagram illustrating an alternative embodiment of a processor core 202 including exemplary onboard debug circuitry is shown in FIG. 8. An execution unit 400 (e.g., a combined processor and co-processor) is coupled to a memory (e.g., cache) controller 805 through an MMU 410. The MMU 410 may include a TLB. The memory controller 805 is further coupled to a memory system interface through a bus interface unit 408. Access and control of the onboard debug features is provided through an EJTAG TAP 320. The processing unit 300 includes a number of registers 830 that support debug operation. For example, the processor core 202 includes an MCD register 835 as discussed above; a debug register 836 as also discussed above, a DEPC register 837 and a DESAVE register 838.

[0108] A debug control register 832 is coupled between the registers 830 of the processing unit 400, the memory controller 805, and externally via the EJTAG TAP 320. A hardware breakpoint unit 825 is also coupled between the registers 830 of the execution unit 400, the memory controller and the MMU 410. The Hardware Breakpoint Unit 825 implements memory-mapped registers that control the instruction and data hardware breakpoints. The memory-mapped region containing the hardware breakpoint registers is accessible to software only in debug mode.

[0109] The debug features provide compatibility with existing debuggers. The debug circuitry 318 support includes specific extensions that enable concurrent multi-Core debugging. For example, controlling logic can be used to interpret the values of the software-control bit locations 610. Upon interpreting a value indicative of a pulse, the controlling logic can write the interpreted values into the corresponding MCD.sub.--0, MCD_1, MCD_2 bit locations of the MCD register. The controlling logic can then pulse the one or more corresponding global MCD wires, according to the corresponding values 615. Once pulsed, the processor cores 202 sample the pulse. The pulse sampling can occur during the next clock cycle after the pulse was written. Once sampled, each of the processor cores 202 that is not masked, will initiate a debug exception handler routine.

[0110] The debug exception handler can then follow a set of predetermined rules to determine the one or more causes of a given debug exception after reading the Debug and/or Multi-Core Debug registers. For example, the debug exception handler can follow the rules listed in Table 2 below. TABLE-US-00002 TABLE 2 Debug Exception Handler Rules 1. Any of MCD state bit locations (Multi-Core Debug[MCD0, MCD1, MCD2]) could be set at any time, indicating that the corresponding MCD state bit is set. 2. If Multi-Core Debug[DExcC] is set, all of Debug[DDBSImpr, DDBLImpr, DINT, DIB, DDBS, DDBL, DBp, DSS] will be clear, and Debug[DExcCode] will contain a valid code. (This is the case for a debug mode exception.) 3. If none of Debug[DDBSImpr, DDBLImpr, DINT, DIB, DDBS, DDBL, DBp, DSS] are set, then the exception was either due to MCD*, or Multi- Core Debug[DExcC] being set and Debug[DExcCode] is valid. 4. No more than one of Debug[DIB, DDBS, DDBL, DBp, DSS] can be set. 5. If Multi-Core Debug[DExcC] is clear, any combination of Debug[DDBLImpr, DINT] may be set. 6. At least one of Debug[DDBLImpr, DINT, DIB, DDBS, DDBL, DBp, DSS] and Multi-Core Debug[MCD0, MCD1, MCD2, DExcC] will be set.

[0111] While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.

* * * * *