Non-data Inclusive Coherent (nic) Directory For Cache Bronson; Timothy C. ; et al. [International Business Machines Corporation]

Non-data Inclusive Coherent (nic) Directory For Cache

Bronson; Timothy C. ; et al.

Patent Application Summary

U.S. patent application number 14/501437 was filed with the patent office on 2015-02-26 for non-data inclusive coherent (nic) directory for cache. The applicant listed for this patent is International Business Machines Corporation. Invention is credited to Timothy C. Bronson, Garrett M. Drapala, Rebecca M. Gott, Pak-kin Mak, Vijayalakshmi Srinivasan, Craig R. Walters.

Application Number	20150058569 14/501437
Document ID	/
Family ID	51489346
Filed Date	2015-02-26

United States Patent Application	20150058569
Kind Code	A1
Bronson; Timothy C. ; et al.	February 26, 2015

NON-DATA INCLUSIVE COHERENT (NIC) DIRECTORY FOR CACHE

Abstract

Embodiments relate to a non-data inclusive coherent (NIC) directory for a symmetric multiprocessor (SMP) of a computer. An aspect includes determining a first eviction entry of a highest-level cache in a multilevel caching structure of the first processor node of the SMP. Another aspect includes determining that the NIC directory is not full. Another aspect includes determining that the first eviction entry of the highest-level cache is owned by a lower-level cache in the multilevel caching structure. Another aspect includes, based on the NIC directory not being full and based on the first eviction entry of the highest-level cache being owned by the lower-level cache, installing an address of the first eviction entry of the highest-level cache in a first new entry in the NIC directory. Another aspect includes invalidating the first eviction entry in the highest-level cache.

Inventors:

Bronson; Timothy C.; (Round Rock, TX) ; Drapala; Garrett M.; (Poughkeepsie, NY) ; Gott; Rebecca M.; (Poughkeepsie, NY) ; Mak; Pak-kin; (Poughkeepsie, NY) ; Srinivasan; Vijayalakshmi; (New York, NY) ; Walters; Craig R.; (Highland, NY)

Applicant:

Name	City	State	Country	Type
International Business Machines Corporation	Armonk	NY	US

Family ID:

51489346

Appl. No.:

14/501437

Filed:

September 30, 2014

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
13784958	Mar 5, 2013
14501437

Current U.S. Class:	711/122
Current CPC Class:	G06F 12/0831 20130101; G06F 2212/621 20130101; G06F 12/0811 20130101; G06F 12/0833 20130101; G06F 2212/283 20130101
Class at Publication:	711/122
International Class:	G06F 12/08 20060101 G06F012/08

Claims

1. A computer implemented method for operation of a non-data inclusive coherent (NIC) directory for a symmetric multiprocessor (SMP) of a computer, the method comprising: determining a first eviction entry of a highest-level cache in a multilevel caching structure of a first processor node of the SMP; determining that the NIC directory is not full; determining that the first eviction entry of the highest-level cache is owned by a lower-level cache in the multilevel caching structure; based on the NIC directory not being full and based on the first eviction entry of the highest-level cache being owned by the lower-level cache, installing an address of the first eviction entry of the highest-level cache in a first new entry in the NIC directory; and invalidating the first eviction entry in the highest-level cache.

2. The method of claim 1, further comprising: determining a second eviction entry in the lower-level cache in the multilevel caching structure; determining that an entry corresponding to the second eviction entry is located in the NIC directory; determining that an entry corresponding to the second eviction entry is not located in another lower-level cache in the multilevel caching structure; based on the entry corresponding to the second eviction entry being located in the NIC directory and based no entry corresponding to the second eviction entry being located in another lower-level cache of the multilevel caching structure, creating a second new entry corresponding to the second eviction entry in the highest-level cache; and invalidating the entry corresponding to the second eviction entry in the NIC directory.

3. The method of claim 2, further comprising: setting the second new entry in the highest-level cache to a most recently used (MRU) position; and setting an ownership of the second new entry in the highest-level cache to unowned.

4. The method of claim 1, further comprising: based on the NIC directory being full and based on a least recently used (LRU) unowned entry existing in the highest-level cache, evicting to a main memory of the computer system the LRU unowned entry; and based on the NIC directory being full and based on a least recently used (LRU) unowned entry not existing in the highest-level cache, evicting an LRU owned entry of the highest-level cache to the main memory of the computer system.

5. The method of claim 1, further comprising: receiving a snoop by the first processor node from a second processor node of the SMP via a SMP bus; determining that an entry corresponding to the snoop is located in the NIC directory; retrieving data corresponding to the snoop from the lower-level cache; and forwarding the retrieved data to the second processor node via the SMP bus.

6. The method of claim 5, wherein the snoop comprises an exclusive snoop, and further comprising: invalidating the entry corresponding to the exclusive snoop in the NIC directory.

7. The method of claim 5, wherein the snoop comprises a shared snoop, and further comprising: updating to shared ownership of the entry corresponding to the shared snoop in the NIC directory.

8. The method of claim 1, wherein the highest-level cache and the NIC directory are in communication with a plurality of lower-level caches in the multilevel caching structure; wherein the highest-level cache comprises a directory comprising entries corresponding to a first plurality of addresses, and data associated with the first plurality of addresses in the directory; and wherein the NIC directory comprises entries corresponding to a second plurality of addresses, and wherein the NIC directory does not comprise data associated the second plurality of addresses.

9. The method of claim 1, wherein determining the first eviction entry of the highest-level cache of the first processor node of the SMP comprises: based on an entry that is exclusively owned by the lower-level cache existing in the highest-level cache, selecting the entry that is exclusively owned by the lower-level cache as the first eviction entry; based on an entry that is exclusively owned by the lower-level cache not existing in the highest-level cache, and based on a shared entry having an unset intervention master (IM) tag existing in the highest-level cache, selecting the shared entry having the unset IM tag as the first eviction entry; and based on a shared entry having an unset IM tag not existing in the highest-level cache, selecting a shared entry having a set IM tag as the first eviction entry.

10. A computer program product for implementing a non-data inclusive coherent (NIC) directory for a symmetric multiprocessor (SMP) of a computer, the computer program product comprising: a tangible storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method comprising: determining a first eviction entry of a highest-level cache in a multilevel caching structure of a first processor node of the SMP; determining that the NIC directory is not full; determining that the first eviction entry of the highest-level cache is owned by a lower-level cache in the multilevel caching structure; based on the NIC directory not being full and based on the first eviction entry of the highest-level cache being owned by the lower-level cache, installing an address of the first eviction entry of the highest-level cache in a first new entry in the NIC directory; and invalidating the first eviction entry in the highest-level cache.

11. The computer program product of claim 10, further comprising: determining a second eviction entry in the lower-level cache in the multilevel caching structure; determining that an entry corresponding to the second eviction entry is located in the NIC directory; determining that an entry corresponding to the second eviction entry is not located in another lower-level cache in the multilevel caching structure; based on the entry corresponding to the second eviction entry being located in the NIC directory and based no entry corresponding to the second eviction entry being located in another lower-level cache of the multilevel caching structure, creating a second new entry corresponding to the second eviction entry in the highest-level cache; and invalidating the entry corresponding to the second eviction entry in the NIC directory.

12. The computer program product of claim 11, further comprising: setting the second new entry in the highest-level cache to a most recently used (MRU) position; and setting an ownership of the second new entry in the highest-level cache to unowned.

13. The computer program product of claim 10, further comprising: based on the NIC directory being full and based on a least recently used (LRU) unowned entry existing in the highest-level cache, evicting to a main memory of the computer system the LRU unowned entry; and based on the NIC directory being full and based on a least recently used (LRU) unowned entry not existing in the highest-level cache, evicting an LRU owned entry of the highest-level cache to the main memory of the computer system.

Description

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application is a continuation of U.S. application Ser. No. 13/784,958 (Bronson et al.), filed on Mar. 5, 2013, which is herein incorporated by reference in its entirety.

BACKGROUND

[0002] The present invention relates generally to a cache for a computer processor, and more specifically, to a cache including a non-data inclusive coherent (NIC) directory.

[0003] A symmetric multiprocessor (SMP) is a computer system that includes a plurality of processor nodes that are linked by one or more SMP buses. A computer system, such as an enterprise server computer system, may include multiple processor sockets that are interconnected in a SMP bus topology so as to achieve a relatively large overall processor capacity. Each processor node in a SMP includes a cache subsystem; a robust cache subsystem may be critical to good performance of a SMP. A relatively large SMP may have high traffic on the SMP bus, including snoops, which is a request for data by a processor node that is sent to the other processor nodes in the SMP, and cache-to-cache interventions, in which data migrates from one processor node to another. A snoop may require that a processor node interrogate a lower-level cache in the processor node to determine if the data requested by the snoop exists in the processor node. Such lower-level cache interrogations may interfere with core performance in the processor node.

[0004] An inclusive cache policy may be used in a multi-level cache hierarchy, allowing the highest-level cache to filter out snoops from the SMP bus when the requested data does not reside in the lower-level caches in the processor node. However, an inclusive cache policy may be relatively inefficient in use of available cache bits in the highest-level cache, as, in an inclusive cache, the highest-level cache holds the same data, or older versions of the data, that resides in the lower level caches. A victim highest-level cache that includes copies of the lower-level cache directories may also be used. However, such a caching structure requires a relatively large amount of space for the copied directories, and may also have relatively long shared intervention latency with owned data that is returned from a lower-level cache.

SUMMARY

[0005] Embodiments include a method and computer program product for a non-data inclusive coherent (NIC) directory for a symmetric multiprocessor (SMP) of a computer. An aspect includes determining a first eviction entry of a highest-level cache in a multilevel caching structure of the first processor node of the SMP. Another aspect includes determining that the NIC directory is not full. Another aspect includes determining that the first eviction entry of the highest-level cache is owned by a lower-level cache in the multilevel caching structure. Another aspect includes, based on the NIC directory not being full and based on the first eviction entry of the highest-level cache being owned by the lower-level cache, installing an address of the first eviction entry of the highest-level cache in a first new entry in the NIC directory. Another aspect includes invalidating the first eviction entry in the highest-level cache.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

[0006] The subject matter which is regarded as embodiments is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features, and advantages of the embodiments are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:

[0007] FIG. 1 depicts a computing system including a SMP in accordance with an embodiment;

[0008] FIG. 2 depicts a processor node including a NIC directory in accordance with an embodiment;

[0009] FIGS. 3A-D depict a processor node including a NIC directory in accordance with an embodiment;

[0010] FIG. 4 depicts a process flow for line address installations and invalidations in a processor node including a L4 cache and NIC directory in accordance with an embodiment;

[0011] FIG. 5 depicts a process flow for operation of a processor node including a NIC directory in accordance with an embodiment;

[0012] FIG. 6 depicts a processor node including a NIC directory in accordance with an embodiment;

[0013] FIG. 7 depicts a process flow for a L3 cache fetch that hits in the L4 or NIC directories in accordance with an embodiment;

[0014] FIG. 8 depicts a process flow for a L3 cache fetch that misses in the L4 and NIC directories in accordance with an embodiment;

[0015] FIG. 9 depicts a process flow for a L3 cache eviction castout that hits in the L4 directory in accordance with an embodiment;

[0016] FIG. 10 depicts a process flow for a L3 cache eviction castout that misses in the L4 directory and hits in the NIC directory in accordance with an embodiment;

[0017] FIG. 11 depicts a process flow for eviction of an entry from the L4 cache to the NIC directory in accordance with an embodiment;

[0018] FIG. 12 depicts a process flow for eviction from the L4 cache to main memory in accordance with an embodiment;

[0019] FIG. 13 depicts a process flow for a snoop fetch that hits exclusively to a L3 cache in accordance with an embodiment;

[0020] FIG. 14 depicts a process flow for a snoop for a L3 cache shared fetch that hits in L4 directory in accordance with an embodiment;

[0021] FIG. 15 depicts a process flow for a snoop for a L3 shared fetch that hits in the NIC directory in accordance with an embodiment; and

[0022] FIG. 16 illustrates a computer program product in accordance with an embodiment.

DETAILED DESCRIPTION

[0023] Embodiments of a NIC directory for a cache are provided, with exemplary embodiments being discussed below in detail. The NIC directory is used in conjunction with a multi-level caching structure in a processor node in a SMP. The NIC directory tracks data residing in the lower-level caches that has particular ownership states. The NIC directory and highest-level cache filter snoops from other processor nodes in the SMP, reducing cross interrogations to the lower levels of the cache. The NIC directory holds entries including line addresses and ownership information, but no data. The highest-level cache also comprises a directory that holds entries including line addresses and ownership information; the highest-level cache additionally holds data that is associated with the lines in its directory. The NIC directory and highest-level cache act to capture and track data that is evicted from the lower-level caches to maintain an inclusive cache management policy, allowing snoop filtering, increased cache bit efficiency and relatively fast intervention of shared data on snoop hits. A NIC directory may have any appropriate size; the size of a NIC directory may be determined based on an amount of space available on a chip residing within the processor node.

[0024] In some embodiments, a NIC directory may reside adjacent to the highest-level cache in the multi-level caching structure. In further embodiments, the NIC directory may comprise an additional associative compartment of the cache directory, but without the corresponding data. The NIC directory tracks lines that are exclusively owned by a lower-level cache in the processor node. Because exclusively-owned data is likely to be modifed in the lower-level cache, storage of such line data in the highest-level cache may be wasteful. The NIC directory also tracks shared read-only data that may or may not be used for off-node cache shared interventions. The highest-level cache that is used in conjunction with the NIC directory may track evictions from the lower-level cache, regardless of whether the data in the evicted line has been modified or not, and also commonly shared lines to enable fast intervention to other processor nodes in the SMP. This allows the highest-level cache and the NIC directory to effectively filter out snooping and intervention traffic from other processor nodes.

[0025] In some embodiments, line addresses may be stored in both the NIC directory and in the highest-level cache directory based on an addressing scheme including a directory address tag, which is derived from a low address portion of the system address; a cache row, which is derived from a middle portion of the system address; and a byte offset comprising a targeted byte index within a cache line. The lines in the NIC and highest-level directories may further include the following fields: a validity bit that indicates whether entry is valid; an address tag that, when combined with the cache row field, is used to determine the full system address for directory hit/miss compares; an ownership tag which identifies which lower-level cache within the processor node has ownership of the entry, and whether the ownership is read-only or exclusive; an intervention master (IM) bit, which, if set (i.e., IM=1), indicates that the processor node will be sourcing the data on the next snoop fetch; and a shared or multiple copy (MC) bit which, if unset (i.e., MC=0), indicates that the processor node has the sole copy of the data (which implies the IM bit for the entry is set). The highest-level cache additionally holds data associated with the addresses its directory.

[0026] FIG. 1 illustrates an embodiment of a computing system 100 including a SMP 101. SMP 101 includes a plurality of processor nodes 103A-N that are linked by a SMP bus 102. Computing system 100 also includes a main memory 104, and may be any appropriate type of computing system. FIG. 1 is shown for illustrative purposes only; a SMP in a computing system may include any appropriate number of processor nodes having any appropriate configuration, and the processor nodes may be connected by any appropriate number and configuration of SMP buses. Each processor node 103A-N includes a multi-level caching structure including a NIC directory, which is described in further detail below. In order to exchange data between processor nodes 103A-N, snoops are sent by a requesting processor node of processor nodes 103A-N to the other processor nodes via the SMP bus 102. These snoops may be intercepted by the highest-level cache and the NIC directory in each of the receiving processor nodes 103A-N.

[0027] In various embodiments, a NIC directory may be used in conjunction with any appropriate multi-level caching structure; in some embodiments, the multi-level caching structure may comprise a 4-level cache structure. While the NIC directory is discussed below with respect to a 4-level caching structure, this is for illustrative purposes only. In embodiments comprising a 4-level caching structure, a L4 cache comprises the highest-level cache, and a plurality of L3, L2, and L1 caches are located below the L4 cache. In such embodiments, the L4 cache and NIC directory may be shared by all the L3 caches within the processor node, and may communicate directly with the SMP bus. A L4 cache may have a size of about 256 megabytes (MB) in some embodiments. The L3 cache may comprise a store-in cache that is shared by some number of cores, and may have a size of about 32 MB in some embodiments. In some embodiments, there may be three L3 shared caches in a node, for a total of up to 96 MB of unique data. In conjunction with a 256 MB L4 cache there may be up to 352 MB of unique data within the processor node. The L1 cache and L2 cache may comprise store-through caches that are private to a particular core in a processor node. In some embodiments, the NIC directory size may be smaller than the sum of the next lower-level cache directories, e.g., less than 96 MB.

[0028] FIG. 2 illustrates an embodiment of a processor node 200 including an NIC directory 213 and a 4-level caching structure. Processor node 200 includes L4 cache 201, which is the highest-level cache, in communication with multiple L3 caches 202A-N. L4 cache 201 includes a L4 directory that tracks addresses in L4 cache 201, and L4 data that is associated with the addresses. Each of L3 caches 202A-N include a respective L3 directory and L3 data. Each of L3 caches 202A-N is in communication with a lower-level caching structure including respective L2 caches 203A-N, 206A-N, and 209A-N, and L1 caches 204A-N, 207A-N, and 210A-N. The L2 caches 203A-N, 206A-N, and L1 caches 204A-N, 207A-N, and 210A-N are each assigned to a respective core of cores 205A-N, 208A-N, and 211A-N. NIC directory 213 is located next to L4 cache 201 and is also in communication with L3 caches 202A-N. NIC directory 213 tracks addresses, but does not store data. Main memory 212 may comprise a sub-address space of a main memory (for example, main memory 104 of FIG. 1) that is assigned to processor node 200. Processor node 200 may comprise any of the processor nodes 103A-N that are shown in FIG. 1. FIG. 2 is shown for illustrative purposes only; any appropriate number and configuration of cache levels, and caches within those levels, may be included in a processor node of a SMP. Further, a NIC directory such as NIC directory 213 may be located in any appropriate location within a processor node.

[0029] FIGS. 3A-D illustrate various operations that may be performed within an embodiment of a processor node 300 including a NIC directory 302. In processor node 300 that is shown in FIGS. 3A-D, L3 directory 301 may be a directory that is located in any of L3 caches 202A-N of FIG. 2, and includes addresses for data that is held in the particular L3 cache. NIC directory 302 may comprise NIC directory 213 of FIG. 2. L4 directory may be located in L4 cache 201 of FIG. 2, and includes addresses of the data in the L4 cache; the L4 cache data is located in L4 data 305. Combined L3 data 304 comprises all the data located in all of the L3 caches 202A-N. FIG. 3A shows an embodiment of a snoop 310A that is received by the processor node 300 from the SMP bus. The snoop address is checked against the NIC directory 302 and the L4 directory 303, and if there is a L3-owned hit in either NIC directory 302 or the L4 directory 303, a cross interrogation 311A is sent from either the NIC directory 302 or the L4 directory 303 to the L3 directory 301 that owns the data requested by the snoop. FIG. 3B shows an embodiment of a fetch from a core in the processor node 300 that misses in the L3 directory 301. The L3 miss address 310B is sent from the L3 directory 301 the L4 directory 303, and is then broadcast as a snoop 311B on the SMP bus. The snoop may be either an exclusive snoop or a shared snoop, depending on whether the data is intended to be modified or not. Data 312B is returned from the SMP bus in response to the snoop, and installed in both combined L3 data 304 and L4 data 305. FIG. 3C shows an embodiment of installation of data from the L4 cache in the L3 cache in the processor node 300. The install data 310C is sent from L4 data 305 to combined L3 data 304. In order to install data 310C in combined L3 data 304, an entry, comprising L3 data 311C, is evicted from combined L3 data 304 and installed in L4 data 305. FIG. 3D shows an embodiment of data sourcing in response to a snoop hit in the processor node 300. Shared and unowned data 310D is sourced from L4 data 305 on a snoop hit, while modified data is sourced from combined L3 data 304 on a snoop hit.

[0030] FIG. 4 depicts a method 400 for line address installations and invalidations in a processor node including a L4 cache and NIC directory in accordance with an embodiment. In block 401, the following entry types are installed in the L4 cache: L3 fetches that miss in the L4 and NIC directories; L3 exclusive evictions or castouts (based on, for example, L3 least recently used, or LRU replacement policy) that hit in the NIC directory; and L3 read-only shared LRU castouts that hit in NIC directory and is a final copy of data, i.e., is not owned by any other L3 in the same processor node. In block 402, the following entry types are invalidated in the L4 cache, by, for example, setting the validity bit in the entry's line in the L4 directory to invalid: exclusive snoops from the SMP Bus; and L4 eviction. In block 403, the following entry type is installed in the NIC directory if the NIC directory is not full: L4 evictions that are owned by a L3 cache. In block 404, the following entry types are invalidated in the NIC directory by, for example, setting the validity bit in the entry's line in the NIC directory to invalid: lines that are hit by exclusive snoops from the SMP Bus; L3 exclusive LRU castouts; and L3 read-only shared LRU castouts that hit in NIC directory and do not hit in another L3 cache. Entries in the L4 cache and NIC directory that have a validity bit set to invalid may be overwritten by an installation that is performed according to blocks 401 or 403, and an invalidation according to blocks 402 or 404 may be triggered by an installation that is performed according to blocks 401 or 403.

[0031] FIG. 5 depicts a method 500 for operation of a processor node including a NIC directory in accordance with an embodiment. First, in block 501, the SMP starts up, and the L4 cache and NIC directory in the processor node are empty. Next, in block 502, as the SMP begins executing instructions, initial lines are installed in the L4 cache. The L4 cache is initially filled with cache lines that are marked IM=1 and owned by a L3 cache in the processor node, and the NIC directory is empty. Then, in block 503, in embodiments in which the L4 cache is larger than the combined L3 caches, the L3 caches will start to cast out LRU data before the L4 cache is full. The ownership status of lines in the L4 cache corresponding to these L3 cache LRU castouts is updated to unowned. At this point, the L4 cache contains mostly owned lines, with some unowned lines; the NIC directory is still empty. Flow then proceeds to block 504, in which, as the SMP continues to perform work, the L4 cache fills up and starts evicting entries to make room for new entries. Evictions from the L4 cache that are owned by a L3 cache are moved to the NIC directory. This preserves lines that are owned by a L3 cache in the caching structure. At this point, the L4 cache contains a mixture of L3-owned and unowned lines, and the NIC directory has some L3-owned lines. Next, in block 505, the NIC directory is filling up, and L3 cache LRU castouts start hitting in the NIC directory. The L3 LRU castouts that hit in the NIC directory are moved to the L4 cache. The L4 cache may make room for a L3 LRU castout that hits in the NIC directory by selecting an entry in the L4 cache that is owned exclusively by a L3 cache to be moved to the NIC directory. If such an entry is not available in the L4 cache, the L4 cache may select an entry for which IM=0 and ownership is shared (MC=1) by one or more L3s. If such an entry is not available in the L4 cache, the L4 cache may select an entry that IM=1 and ownership is shared (MC=1) by one or more L3s. At this point, the L4 cache may have more unowned lines than owned lines, and the NIC directory has more owned lines. Lastly, in block 506, a steady state is achieved, and most of the lines owned by a L3 cache within the processor node are now in the NIC directory, and the L4 cache holds mainly unowned lines, which may be evicted to make room for new entries as needed. The L3-owned lines that remain in the L4 cache may have IM=1 and MC=1 tags, allowing for relatively fast responses to interventions requesting data to be transferred to other processor nodes that are received on the SMP bus.

[0032] FIG. 6 depicts a processor node 600 including a NIC directory in accordance with another embodiment. Processor node 600 includes L3 interfaces 601, SMP bus interface 602, pipeline 603, NIC directory 604, L4 directory 605, L4 LRU 606, eviction logic 607, local store address registers (LSAR) 608, and local fetch address register (LFAR) 609. L3 interfaces 601 may be in communication with any appropriate number of L3 caches in the processor node 600. SMP bus interface 602 is in communication with a SMP bus that links a plurality of processor nodes in a SMP. L4 LRU 606 tracks the LRU entries in the L4 directory 605, and is used by eviction logic 607 to determine entries to evict from the L4 directory 605 and not from the NIC directory 604, as NIC directory 604 does not need to evict for entry replacement. Elements 601-609 of processor node 600 may be included in the various embodiments of processor nodes 103A-N, 200, and 300 that are shown in FIGS. 1, 2, and 3A-D. FIGS. 7-15, which describe embodiments of various operations that are performed in a processor node including a NIC directory, are discussed below with respect to processor node 600 of FIG. 6.

[0033] FIG. 7 depicts a method 700 for a L3 fetch that hits in the L4 or NIC directories in accordance with an embodiment. First, in block 701, a L3 fetch from a requesting L3 cache goes from L3 interfaces 601 into pipeline 603, and hits in the L4 directory 605 or the NIC directory 604. Then, in block 702, based on the hit being in the L4 directory 605, the hit entry is set to the most recently used (MRU) position in the L4 directory 605. Next, in block 703, the L3 fetch goes back into pipeline 603 to return the fetch data back to the requesting L3 via L3 interfaces 601. On a NIC directory hit, data is returned from another L3 cache within the processor node. Lastly, in block 704, the ownership tag of the hit entry is updated in either the NIC directory 604 or L4 directory 605 to reflect the requesting L3 cache.

[0034] FIG. 8 depicts a method 800 for a L3 fetch that misses in the L4 and NIC directories in accordance with an embodiment. First, in block 801, a L3 fetch from a requesting L3 cache goes from L3 interfaces 601 into pipeline 603, and misses the L4 directory 605 and the NIC directory 604. Next, in block 802, based on the L4 cache being full, an entry is evicted from the L4 cache to make room for a new entry; this is discussed in further detail below with respect to FIGS. 11 and 12. Then, in block 803, a snoop is sent to the SMP bus for the L3 fetch via LFAR 609, pipeline 603, and SMP bus interface 602. Next, in block 804 the fetch data is returned on the SMP bus via SMP bus interface 602 in response to the snoop, and is sent to the requesting L3 cache via L3 interfaces 601. In block 805 a new entry is created in the L4 directory 605 for the returned fetch data. Lastly, in block 806, the new entry in the L4 directory 605 is validated and updated into the MRU position.

[0035] FIG. 9 depicts a method 900 for a final copy L3 castout that hits in the L4 directory in accordance with an embodiment. First, in block 901, an entry is cast out, or evicted, from a L3 directory (based on, for example, the L3 LRU), and this castout entry hits in the L4 directory 605. Next, in block 902, the hit entry is set to the MRU position in L4 directory 605. Then, in block 903, the castout data is installed in the hit entry in the L4 cache. Lastly, in block 904, the ownership tag of the hit entry in the L4 directory 605 is updated to unowned.

[0036] FIG. 10 depicts a method 1000 for a final copy L3 castout that misses in the L4 directory and hits in the NIC directory in accordance with an embodiment. First, in block 1001, an entry is cast out, or evicted, from the L3 directory (based on, for example, the L3 LRU) and this castout entry misses in the L4 directory 605 but hits in the NIC directory 604. Then, in block 1002, based on the L4 cache being full, an entry is evicted from the L4 cache to make room for a new entry corresponding to the L3 castout; this is discussed in further detail below with respect to FIGS. 11 and 12. Next, in block 1003, the address and data of the L3 castout entry are installed in a new entry in the L4 cache. In block 1004, the new entry is validated and set to the MRU position in the L4 directory 605. In block 1005, the ownership tag of the new entry in the L4 director 605 is set to unowned. Lastly, in block 1006, the hit entry in the NIC Directory 604 is invalidated.

[0037] FIG. 11 depicts a method 1100 for eviction of an entry from the L4 cache to the NIC directory in accordance with an embodiment. First, in block 1101, it is determined that the L4 cache is full and an eviction is needed from the L4 cache to make room for a new entry, and that the NIC directory 604 has room for a new entry. Next, in block 1102, the eviction logic 607 selects an entry from the L4 directory 605 for eviction. Any L3 exclusively owned entry in the L4 directory 605 is selected first; if no L3 exclusively owned entry exists in the L4 directory 605, any shared entry with IM=0 is selected; if no shared entry with IM=0 exists in the L4 directory 605, any shared entry with IM=1 is selected by the eviction logic 607. Next, in block 1103, the selected entry is installed and validated in the NIC directory 604. Lastly, in block 1104, the selected entry is invalidated in the L4 directory 605.

[0038] FIG. 12 depicts a method 1200 for eviction of an entry from the L4 cache to the main memory in accordance with an embodiment. First, in block 1201, it is determined that the L4 cache is full and an eviction is needed from the L4 cache to make room for a new entry, and that the NIC directory 604 is also full. Next, in block 1202, the eviction logic 607 selects an entry based on L4 LRU 606 information for eviction from the L4 directory 605. The oldest entry in the L4 directory 605 that is not owned by any L3 cache is selected first; if no entry that is not owned by any L3 cache exists in the L4 directory 605, the oldest entry in the L4 directory 605 that is owned by a L3 cache is selected by the eviction logic 607. Then, in block 1203, if the data in the evicted entry has been modified, the modified data is written back to the main memory. Lastly, in block 1204, the selected entry is invalidated in the L4 directory 605.

[0039] FIG. 13 depicts a method 1300 for a snoop fetch in accordance with an embodiment. First, in block 1301, a snoop fetch is received from another processor node on the SMP bus via SMP bus interface 602. Next, in block 1302, the snoop hits exclusive to a L3 cache in either the NIC directory 604 or the L4 directory 605, and a cross interrogation is forwarded to the owning L3 via L3 interfaces 601. Then, in block 1303, the fetch data that was retrieved by the cross interrogation is sent on the SMP bus via SMP bus interface 602 to the requesting processor node. Lastly, in block 1304, the ownership tag of the entry in the NIC directory 604 or the L4 directory 605 corresponding to the snoop hit is updated to shared or invalidated based on the snoop fetch type.

[0040] FIG. 14 depicts a process flow for a snoop fetch that hits in L4 in accordance with an embodiment. First, in block 1401, a snoop fetch is received from another processor node on the SMP bus via SMP bus interface 602. Next, in block 1402, the snoop hits in the L4 directory 605 having a shared IM=1 state, and a cross interrogation is forwarded to the owning L3(s) via L3 interfaces 601 for an exclusive snoop to invalidate the L3(s). Then, in block 1403, the fetch data is accessed from the L4 cache and is sent on the SMP bus via SMP bus interface 602 to the requesting processor node. Lastly, in block 1404, the ownership tag of the entry in the L4 directory 605 corresponding to the hit is either updated to shared (for a shared snoop) or invalidated (for an exclusive snoop), based on the snoop fetch type.

[0041] FIG. 15 depicts a method 1500 for a snoop fetch that hits in the NIC directory in accordance with an embodiment. First, in block 1501, a snoop fetch is received from another processor node on the SMP bus via SMP bus interface 602. Next, in block 1502, the snoop hits in the NIC directory 604 having a shared IM=1 state, and a cross interrogation is forwarded to the owning L3(s) via L3 interfaces 601. Then, in block 1503, the fetch data that was retrieved by the cross interrogation is sent on the SMP bus via SMP bus interface 602 to the requesting processor node. Lastly, in block 1504, the ownership tag of the entry in the NIC directory 604 corresponding to the hit is updated to shared (for a shared snoop) or invalidated (for an exclusive snoop), based on the snoop fetch type.

[0042] As will be appreciated by one skilled in the art, one or more aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, one or more aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," "module" or "system". Furthermore, one or more aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.

[0043] Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.

[0044] Referring now to FIG. 16, in one example, a computer program product 1600 includes, for instance, one or more storage media 1602, wherein the media may be tangible and/or non-transitory, to store computer readable program code means or logic 1604 thereon to provide and facilitate one or more aspects of embodiments described herein.

[0045] Program code, when created and stored on a tangible medium (including but not limited to electronic memory modules (RAM), flash memory, Compact Discs (CDs), DVDs, Magnetic Tape and the like is often referred to as a "computer program product". The computer program product medium is typically readable by a processing circuit preferably in a computer system for execution by the processing circuit. Such program code may be created using a compiler or assembler for example, to assemble instructions, that, when executed perform aspects of the invention.

[0046] Technical effects and benefits include interception of snoops by higher-level caches in a processor node of a SMP.

[0047] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of embodiments. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0048] The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of embodiments have been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the embodiments. The embodiments were chosen and described in order to best explain the principles and the practical application, and to enable others of ordinary skill in the art to understand the embodiments with various modifications as are suited to the particular use contemplated.

[0049] Computer program code for carrying out operations for aspects of the embodiments may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).

[0050] Aspects of embodiments are described above with reference to flowchart illustrations and/or schematic diagrams of methods, apparatus (systems) and computer program products according to embodiments. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0051] These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

[0052] The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0053] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

* * * * *