U.S. patent number 5,701,503 [Application Number 08/360,331] was granted by the patent office on 1997-12-23 for method and apparatus for transferring information between a processor and a memory system.
This patent grant is currently assigned to Intel Corporation. Invention is credited to John M. Bauer, Michael W. Rhodehamel, Nitin V. Sarangdhar, Gurbir Singh, Wen-Hann Wang.
United States Patent |
5,701,503 |
Singh , et al. |
December 23, 1997 |
**Please see images for:
( Certificate of Correction ) ** |
Method and apparatus for transferring information between a
processor and a memory system
Abstract
A method and apparatus for transferring information between a
processor and a memory system utilizing a chunk write buffer, where
read and write requests to the L2 cache memory are controlled by
the processor. The cache line associated with each such request is
larger than the interface coupling the L2 cache memory and the
processor. Read requests are returned from the L2 cache memory to
the processor in burst fashion. Write requests are transferred from
the processor to the L2 cache memory during clock cycles in which
the processor does not require the interface for a read request.
Write requests need not be transferred in burst fashion; rather, a
portion of the write request corresponding to the size of the
interface, referred to as a chunk, is transferred from the
processor to the L2 cache memory and stored temporarily in the
chunk write buffer. When the processor has transferred the entire
cache line to the L2 cache memory, the processor signals the L2
cache memory to transfer the contents of the chunk write buffer
into the data array of the cache memory.
Inventors: |
Singh; Gurbir (Portland,
OR), Wang; Wen-Hann (Portland, OR), Rhodehamel; Michael
W. (Beaverton, OR), Bauer; John M. (Portland, OR),
Sarangdhar; Nitin V. (Portland, OR) |
Assignee: |
Intel Corporation (Santa Clara,
CA)
|
Family
ID: |
46250162 |
Appl.
No.: |
08/360,331 |
Filed: |
December 21, 1994 |
Related U.S. Patent Documents
|
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
Issue Date |
|
|
176782 |
Jan 4, 1994 |
|
|
|
|
Current U.S.
Class: |
711/126; 711/113;
711/118; 711/E12.043 |
Current CPC
Class: |
G06F
12/0897 (20130101); G06F 12/0831 (20130101) |
Current International
Class: |
G06F
12/08 (20060101); G06F 012/00 () |
Field of
Search: |
;395/800,445,440,872,875,876,550,877,469,470,453 |
References Cited
[Referenced By]
U.S. Patent Documents
Foreign Patent Documents
|
|
|
|
|
|
|
2 550 542 A1 |
|
Feb 1985 |
|
FR |
|
WO 83/01205 |
|
Apr 1983 |
|
WO |
|
Other References
Boyd et al., Nature 333:245-347 (1988). .
Boyd et al. in "Role of Layer Charge in Organic Contaminant
Sorption by Organo-Clays," Clay Minerals Society Pre-Meeting
Workshop on the Layer Charge Characteristics of Clays, Saskatoon,
Sask. CA (1992). .
Chladek et al. J. Chrom. Sci. 22:313-320 (1984). .
Hagen et al., Anal. Chim. Acta 236:157-164 (1990). .
Hutta et al., J. Chrom. 470:223-233 (1989). .
Junk et al., Anal. Chem. 60: 451-454 (1988). .
Lee et al., Environ. Sci. & Tech. 23:1365-1372 (1989). .
Lee et al.,J. Chem. Soc., Far. Trans. 85(9):2953-2962 (1989). .
Lee et al., Clays & Clay Min. 38(2):113-120 (1990). .
Plueddeman, Silane Coupling Agents, 2nd ed., Plenum Press, NY
(1991), pp. 8 and 187-188. .
Poole & Poole, Chromatography Today, Elsevier Pub. Co., New
York, N.Y. (1991), pp. 313-338. .
Val Popescu, et al., "The Metaflow Architecture", IEEE Micro, Jun.
1991, pp. 10-13 and 63-73..
|
Primary Examiner: Harvey; Jack B.
Assistant Examiner: Etienne; Ario
Attorney, Agent or Firm: Blakely, Sokoloff, Taylor &
Zafman
Parent Case Text
This application is a continuation-in-part application of U.S.
patent application Ser. No. 08/176,782, entitled "Method and
Apparatus for Controlling Multiple Dice with a Single Die", filed
Jan. 4, 1994, now abandoned, and assigned to the corporate assignee
of the present invention .
Claims
What is claimed is:
1. A method of transferring information between a processor and a
cache memory via a bus, the method comprising the steps of:
(a) transferring a first portion of data corresponding to a write
request from the processor to the cache memory during a first clock
cycle;
(b) transferring a read request from the processor to the cache
memory during a second clock cycle; and
(c) transferring a second portion of data corresponding to the
write request from the processor to the cache memory during a third
clock cycle, wherein the second clock cycle is both subsequent to
the first clock cycle and prior to the third clock cycle, and
wherein the read request and the write request target different
memory addresses.
2. The method of claim 1, further comprising the steps of
transferring a third portion of data corresponding to the write
request and a fourth portion of data corresponding to the write
request from the processor to the cache memory during a fourth
clock cycle and a fifth clock cycle, respectively, in which the
processor does not require the use of the bus for the read
request.
3. The method of claim 2, further comprising the step of storing
the first portion of data, second portion of data, third portion of
data and fourth portion of data corresponding to the write request
in the cache memory such that the second portion of data is stored
in a location subsequent to the first portion of data, the third
portion of data is stored in a location subsequent to the second
portion of data, and the fourth portion of data is stored in a
location subsequent to the third portion of data.
4. The method of claim 1, wherein the read request comprises
transferring a cache line from the cache memory to the processor
during consecutive clock cycles.
5. The method of claim 1, wherein the first portion of data is a
fraction of a cache line associated with the write request.
6. The method of claim 1, wherein the first portion of data is
one-fourth of a cache line associated with the write request.
7. The method of claim 1, wherein the cache memory includes a
plurality of cache lines and a buffer, wherein the transferring
step (a) comprises the step of transferring the first portion of
data to the buffer of the cache memory, wherein the transferring
step (c) comprises the step of transferring the second portion of
data to the buffer of the cache memory, and further comprising the
step of transferring a signal to the cache memory from the
processor, the signal indicating to the cache memory to transfer
the data in the buffer to one of the plurality of cache lines.
8. The method of claim 7, wherein the step of transferring a signal
further comprises transferring the signal during a fourth clock
cycle, the fourth clock cycle being subsequent to the third
cycle.
9. The method of claim 7, wherein the step of transferring a signal
further comprises transferring the signal during the third clock
cycle.
10. The method of claim 1, wherein the read request comprises
requesting to transfer a cache line from the cache memory to the
processor.
11. In a computer system having a processor coupled to a bus and a
cache memory coupled to the bus, wherein the cache memory includes
a plurality of cache lines and a buffer, a method of transferring
information between the processor and the cache memory, the method
comprising the steps of:
(a) transferring a plurality of read requests from the processor to
the cache memory during a first plurality of consecutive clock
cycles, wherein the first plurality of consecutive clock cycles
includes a first plurality of non-consecutive clock cycles, a
second plurality of consecutive clock cycles and a third plurality
of consecutive clock cycles, and wherein a first read request of
the plurality of read requests utilizes the bus during each clock
cycle of the second plurality of consecutive clock cycles and a
second read request of the plurality of read requests utilizes the
bus during each clock cycle of the third plurality of consecutive
clock cycles;
(b) transferring a first portion of a write request from the
processor to the buffer during a first clock cycle, wherein the
first clock cycle is one of the first plurality of non-consecutive
clock cycles, wherein the first clock cycle occurs prior to the
second plurality of consecutive clock cycles, and wherein the first
portion of the write request comprises transferring a first portion
of a cache line from the processor to the buffer;
(c) transferring a second portion of the write request from the
processor to the buffer during a second clock cycle, wherein the
second clock cycle occurs subsequent to the second plurality of
consecutive clock cycles, and wherein the second clock cycle is one
of the first plurality of non-consecutive clock cycles; and
(d) transferring a signal to the cache memory from the processor,
the signal indicating to the cache memory to transfer the contents
of the buffer to a cache line of the cache memory.
12. The method of claim 11, wherein the transferring step (c)
further comprises transferring a third portion of the write request
and a fourth portion of the write request from the processor to the
buffer during a third clock cycle and a fourth clock cycle,
respectively, the third and fourth clock cycles each being one of
the first plurality of non-consecutive clock cycles.
13. The method of claim 11, wherein the transferring step (d)
comprises transferring the signal during a third clock cycle, the
third clock cycle being one of the first plurality of
non-consecutive clock cycles and also being subsequent to the first
and second clock cycles.
14. The method of claim 11, wherein the transferring step (d)
comprises transferring the signal during the second clock
cycle.
15. A computer system comprising:
a cache memory which includes a plurality of cache lines and a
buffer;
a processor;
an interface coupled to the cache memory and the processor; and
interface control logic, coupled to the interface, for transferring
a plurality of read requests from the processor to the cache memory
and transferring a plurality of portions of data corresponding to a
write request from the processor to the buffer only in clock cycles
in which the processor does not require the use of the interface
for any of the plurality of read requests, wherein each of the
plurality of read requests is to a different address than the write
request, and wherein one of the plurality of read requests is
initiated by the processor subsequent to the write request being
initiated by the processor.
16. The system of claim 15, wherein the interface is a bus.
17. The system of claim 15, wherein the cache memory stores each of
the plurality of portions of data in the buffer, and wherein the
interface control logic is also for transferring a signal to the
cache memory from the processor, the signal indicating to the cache
memory to transfer the contents of the buffer to one of the
plurality of cache lines.
18. A memory subsystem for use in a computer system, the memory
subsystem comprising:
a cache memory which includes a plurality of cache lines and a
buffer;
an interface coupled to the cache memory and a processor; and
interface control logic coupled to the interface for,
transferring a first portion of data corresponding to a write
request from the processor to the cache memory via the interface
during a first clock cycle;
transferring a read request from the processor to the cache memory
via the interface during a second clock cycle; and
transferring a second portion of data corresponding to the write
request from the processor to the cache memory via the interface
during a third clock cycle, wherein the second clock cycle is both
subsequent to the first clock cycle and prior to the third clock
cycle, and wherein the read request and the write request are to
different addresses.
19. The memory subsystem of claim 18, wherein the interface is a
dedicated bus.
20. The memory subsystem of claim 18, wherein the interface control
logic is for transferring both the first portion of data and the
second portion of data to the buffer of the cache memory, and
wherein the interface control logic is also for transferring a
signal to the cache memory from the processor, the signal
indicating to the cache memory to transfer the data in the buffer
to one of the plurality of cache lines.
21. The memory subsystem of claim 20, wherein the interface control
logic is for transferring the signal during a fourth clock cycle,
the fourth clock cycle being subsequent to the third clock
cycle.
22. The memory subsystem of claim 20, wherein the interface control
logic is for transferring the signal during the third clock cycle.
Description
FIELD OF THE INVENTION
The present invention relates to the field of computer systems;
more particularly, the present invention relates to the field of
transferring information between an agent and a memory unit.
BACKGROUND OF THE INVENTION
Throughout the development of computer systems, a primary emphasis
has been on increasing the speed of such systems and their ability
to handle larger and more complicated programs while reducing their
cost. In order to increase the ability of a computer system, it is
necessary to both increase the size of the random access memory
(RAM) so its larger programs may be utilized by the computer system
and to increase the speed at which access to that RAM is afforded.
The straight forward method of increasing access speed is to use
components which operate more quickly. However, such
rapidly-operating components are more expensive than slower memory
components.
With the cost involved in providing high speed RAM, advanced
computer systems have used high-speed cache memory arrangements to
increase the operational speed of the memory system. A cache memory
arrangement provides a small portion of a specially fast memory in
addition to the regular RAM. As commands am issued and data is
utilized, the information is called from the RAM and stored in this
cache memory. As each new read and write command is issued, the
system looks to the fast memory cache to determine if the
information is stored in the cache. If the information is available
in the cache memory, access to the RAM is not required and the
command may be processed or the data accessed much more readily. If
the information is not available in the cache memory, the new data
can be copied from the main memory and stored in the cache memory
where it remains and can be accessed for later use by the system.
In well-designed memory systems, the information slot lies in the
cache memory over 90% of the time, on average. Consequently, use of
the cache memory substantially speeds the overall operation of the
memory utilized in the computer system.
In order to further enhance the speed of operation of the computer
system, it has been found desirable to directly associate a small
portion of extremely rapid cache memory directly on a processor
chip. For example, it may be useful to provide such a small fast
cache memory consisting of 8 kilobytes of memory directly on the
chip with the other elements of a CPU. Such an arrangement is
capable of greatly increasing the speed of the operation of the
system to a great degree for information which is used repeatedly
by various processes.
Today, cache memories are commonly designed at two levels: a first
level (L1) cache and a second level (L2) cache. An L1 cache is a
single layer of high speed memory between a microprocessor and main
system dynamic RAM (DRAM) memory. L1 caches hold copies of code and
data most frequently requested by the microprocessor and are
typically small ranging from 4 kilobytes to 64 kilobytes in size.
The L2 cache, on the other hand, is a second layer of high speed
memory between the L1 cache and the main system DRAM memory. L2
caches also hold copies of code and data frequently requested by
the microprocessor. The L2 cache handles the more random memory
request that the L1 cache misses. In order to simplify the handling
of requests that the L1 cache misses, the L2 cache typically
includes all the data of the L1 cache and more. As a result, a L2
cache is almost always larger than a L1 cache ranging in size
typically from 64 kilobytes to 1024 kilobytes.
The performance of a cache is affected by the organization of the
cache. Typically, them are three types of organizations that are
most commonly used. These are fully associative, set associative
and direct mapped (one-way set associative). In a fully associative
cache memory, each item of information from the main memory system
is stored as a unique cache entry. Them is no relationship between
the location of the information in the data cache RAM memory and
its original location in the main system memory. If there are x
storage locations in the cache, the cache will remember the last x
main system memory locations accessed by the microprocessor. With a
fully associative cache, the location of each store can hold
information from any location in the main system memory. As a
result, the cache requires complex tag entries (to map the complete
main memory system memory space), resulting in very complex and
expensive cache comparison logic. Set associative cache
organizations divide the data cache RAM into banks of memory, or
"ways". A 2-way set associative cache divides the data cache RAM
into two ways, a 4-way set associative cache into four ways, and so
on. The set associative cache separates main system memory into
pages, where each page is equal in size to the size of a way. For
example, a 64 k-byte 2-way set associative cache would logically
see main memory as a collection of 32 k-bytes pages, equal in size
to each way. Each location in a memory page can map only to the
same location in a cache way. For example, in a 2-way set
associative cache memory, each location in the main system memory
page can map in the same location of either of the two cache way
locations in the cache. When the microprocessor makes a memory
request, the set associative cache compares the memory request with
the tag entry at the page location in each of its ways to determine
if the information is in the cache (i.e., a hit). This means the
cache has to do one comparison for each way, for a total number of
comparisons equal to the number of ways. For example, in a 2-way
set associative cache memory, the cache would only have to make two
parallel comparisons to determine if the information requested is
stored in the cache.
A direct mapped (1-way set associative cache organization) uses the
entire data cache RAM as one bank of memory or way. The main system
memory of the cache is logically separated into pages, where each
page is the size of a data cache RAM. Each location in any main
system memory page directly maps only into the same location in the
data cache RAM.
In prior art, a separate cache controller is used to provide access
to the L2 cache. The cache controller is separate from the
processor in the computer system, usually as a separate computer
chip. The cache controller is very complicated logic. Most
processors systems contain two such controllers, one to control the
L1 cache within the processor and the other to control the L2 cache
in the system. The design of these two controllers is a compromise
between performance and complexity of state that must be shared
between them. The system of such hierarchical caches would provide
the highest overall performance if the two cache controllers had
access to information of both the caches and the processor and bus
accesses. This is clearly not possible when the cache controller
for the L2 cache lies in a separate package.
Another problem with the prior art is that the L2 cache is on the
system bus and access to the L2 cache is limited to the speed of
the system bus. For instance, if the system bus is running at 10
MHz, an access to the L2 cache can not be performed faster than 10
MHz. It would be beneficial for the processor to be able to access
the L2 cache at a rate faster than that of the system bus in order
to increase the overall speed of the system.
Thus, it is desirable to have a processor that may operate with
multiple types of cache organizations, including the option of
operating without a cache (if so desired). Therefore, as the
different organizations are upgraded in the future, the
microprocessor may not have to undergo any changes itself.
An additional problem with the prior art is that many processors
utilize information from memory in quantities larger than the size
of the bus. For example, a processor may read and write information
in 16-byte quantities over a 4-byte bus. Thus, at least four clock
cycles are needed to transfer the entire 16-byte quantity over the
bus. With some transactions, however, a pause by the processor of
four clock cycles while the 16-byte quantity is transferred
unnecessarily degrades processor performance. For example, for a
write request where the processor has finished using the data being
written back to memory, a pause while the data is transferred over
the bus prevents the processor from executing further instructions.
This pause is not necessary for the processor's operation because
the processor does not need the data being transferred. Thus, it
would be advantageous for the processor to be able to continue
executing instructions without pausing in such situations. The
present invention provides for this and other advantageous
results.
SUMMARY OF THE INVENTION
A method and apparatus for transferring information between a
processor and a memory system are described herein. Read and write
requests to the L2 cache memory are controlled by the processor.
The cache line associated with each such request is larger than the
interface coupling the L2 cache memory and the processor. Read
requests are returned from the L2 cache memory to the processor in
burst fashion. Write requests are transferred from the processor to
the L2 cache memory during clock cycles in which the processor does
not require the interface for a read request. Write requests need
not be transferred in burst fashion; rather, a portion of the write
request corresponding to the size of the interface, referred to as
a chunk, is transferred from the processor to the L2 cache memory
and stored temporarily in the chunk write buffer. When the
processor has transferred the entire cache line to the L2 cache
memory, the processor signals the L2 cache memory to transfer the
contents of the chunk write buffer into the data array of the cache
memory.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will be understood more fully from the
detailed description given below and from the accompanying drawings
of the preferred embodiments of the invention, which, however,
should not be taken to limit the invention to the specific
embodiments, but are for explanation and understanding only.
FIG. 1 is a block diagram of one embodiment of the cache memory
system of the present invention.
FIG. 2 is a block diagram of the second level (L2) cache memory of
the present invention.
FIG. 3 is a block diagram of one embodiment of the memory subsystem
of a computer system implemented according to the present
invention.
FIG. 4 is a diagram of one embodiment of the computer system of the
present invention.
FIGS. 5A, 5B, 5C and 5D illustrate embodiments of an integrated
circuit package containing multiple integrated circuit dice
according to the present invention.
FIG. 6 is a flowchart showing the steps followed by one embodiment
of the present invention in transferring a write request to an L2
cache memory.
DETAILED DESCRIPTION OF THE PRESENT INVENTION
A cache memory subsystem for use in a computer system is described.
In the following detailed description of the present invention
numerous specific details are set forth, such as specific numbers
of bits, command and signal names, etc., in order to provide a
thorough understanding of the present invention. However, it will
be understood to one skilled in the art that the present invention
may be practiced without these specific details. In other
instances, well-known structures and devices are shown in block
diagram form, rather than in detail, in order to avoid
unnecessarily obscuring the present invention.
The present invention provides for single integrated circuit (IC)
die control of multiple IC dice. The multiple dice may be contained
within a single integrated circuit package, such that the single
die controls the operations of the dice within the package,
including its own control. The single die is able to provide
control for other dice due to the partitioning of functionality
between the dice. The control function of the dice is partitioned,
such that the control necessary to monitor and start operations on
other dice is placed on a single die, while the control needed to
perform a specific function on a die remains. In this manner, a
single die is able to know what all of the other dice are doing,
while the other dice are performing the function themselves.
The present invention may provide for single die control of a two
die integration. In one embodiment, the present invention provides
for a second level (L2) cache memory and a processor multi-die
arrangement in a computer system.
FIG. 1 illustrates the processor and L2 cache memory arrangement of
the present invention. Referring to FIG. 1, processor 101 is
coupled to the L2 cache memory 102 by bus 103. Bus 103 is referred
to as the backside bus. Processor 101 is also coupled to a system
bus (not shown) via bus 104. In one embodiment, L2 cache memory 102
may also be coupled to the system bus via its own dedicated bus.
Processor 101 and L2 cache memory 102 may be contained in a dual
cavity integrated circuit package. Such a package, along with the
coupling of processor 101 to cache memory 102, is described later
below.
In one embodiment, the processor of the present invention is an
Intel Architecture Microprocessor such as manufactured by Intel
Corporation of Santa Clara, Calif., the corporate assignee of the
present invention. In the present invention, processor 101 may
include a level one (L1) cache memory. In the present invention,
processor 101 contains the functionality for controlling both the
L1 and L2 cache memories.
In one embodiment, L2 cache memory 102 comprises cache management
logic and a static random access memory (SRAM) that includes data
and a cache directory. The data is stored in a data storage array.
The cache directory may include a tag array, tag status and least
recently used (LRU) bits. The cache management logic includes the
logic to perform tag matching, a replacement algorithm and data
routing.
In one embodiment, L2 cache memory 102 is a 4-way set associative
cache memory. L2 cache memory 102 is block oriented with a 32-byte
line size divided into four 8-byte "chunks". In one embodiment, L2
cache memory 102 stores 512 Kbytes of information. L2 cache memory
102 may be designed to store other amounts of information, such as
256 Kbytes of information.
The L2 cache memory 102 communicates with processor 101 using bus
103. In one embodiment, bus 103 includes an address bus for sending
addresses from processor 101 to L2 cache memory 102, a
bi-directional data bus, a command bus for sending commands from
processor 101 to the L2 cache memory 102 and a bi-directional
status and state bus. Bus 103 is internal to the package when
processor 101 and cache memory 102 are contained in the same
package. In one embodiment, bus 103 is a high frequency (125-225
MHz), non-multiplexed bus with a 64-bit wide data bus with error
correction coding (ECC), a 33-bit address bus (supporting a 36-bit
address), 24 control pins, 12 data integrity pins plus a reset
signal and a clock signal. Each 8-byte quantity on the data bus is
referred to herein as a "chunk".
In the present invention, L2 cache memory 102 and processor 101 may
operate at different speeds. In one embodiment, L2 cache memory 102
operates at 66 MHz, while processor 101 operates at 133 MHz. In one
embodiment, bus 103 provides for communication between L2 cache
memory 102 and processor 101 and operates at 150 MHz.
Included along with each L2 data array 32-byte line are 32 data
integrity bits. (Eight bits are stored with each set of 64 data
bits). The L2 cache memory regards these extra bits as extensions
of the data array line, and does not interpret, or generate these
bits.
One Embodiment of the L2 Cache Memory
A block diagram of one embodiment of the L2 cache memory is shown
in FIG. 2. Referring to FIG. 2, L2 cache memory 200 includes
address latch 201, tag array 202, tag comparison logic 203, control
logic 204, chunk write buffer 205, data array 206, multiplexer
(MUX) select 207, data output register 208, and chunk MUX select
logic 209. Control logic 204 is coupled to components in L2 cache
memory 200 to control their operation, partially in response to
control signals 212 received on its inputs and signals by L2 cache
memory 200. Control logic 204 also provides signals 213 to the
processor.
The address latch 201 receives an address from bus 103. Bits [15:5]
of the address select one of the sets in the tag array 202. The
contents of the set are compared with bits [35:16] of the address
in the comparison logic 203 and the result is used to select the
appropriate way in the multiplexer select logic 207. Bits [4:3] of
the address select which chunk of the data gets sent to the data
bus 220, via the chunk multiplexer logic 209. The status signals
from the tag array 202 are also included with signals 213 from the
control logic 204.
In one embodiment, cache memory 200 is a 256 Kbyte four-way set
associative cache, such that tag array 202 includes four separate
ways (shown as way 0-3). In one implementation, tag array 202
includes 2048 sets, labeled set 0 to set 2047. In an alternate
embodiment, cache memory 200 is a 512 Kbyte four-way set
associative cache. In this alternate embodiment, tag array 202
includes 4096 sets, bits [16:5] of the address in address latch 201
select one of the sets in the tag array 202, and the contents of
the set are compared with bits [35:17] of the address in the
comparison logic 203.
Each set location in each way includes tag bits and status bits and
additional parity bits. Two additional parity bits are included.
The size of the tag is dependent on the size of the L2 cache
memory. The status bits reflect the state of the L2 cache memory
line: Modified, Exclusive, Share, or Invalid (MESI). These states
are well-known to those skilled in the art.
It should be noted that the L2 cache memory regards the
Modified/Exclusive/Shared states as valid, and does not distinguish
between any of the valid states. In the present invention, the
processor distinguishes between the valid states and, thus,
implements the cache coherency protocol. That is, the processor of
the present invention directly controls the status (MESI) of the
stored data lines within the L2 cache memory, and the target
associative way information on line writes. Each set of four tags
corresponding to the cache lines in all four ways in the tag array
contains one least recently used (LRU) field for determining victim
line during a read miss. The L2 cache memory implements a
pseudo-LRU algorithm for its line replacement policy. In one
embodiment, tag array 202 includes an LRU array having an entry for
each set in order to replace the least recently used lines in the
L2 cache memory. Therefore, each set is associated with one LRU
entry, four tag entries and four state entries.
The L2 cache memory supports data integrity through parity. The
data integrity is intended to cover transmission errors on the
backside bus and storage errors in tag array 202 and data array
206.
In one embodiment, data array 206 is a 288 Kbyte, 4-way set
associative SRAM that is logically organized as a
4.times.2K.times.288. The 288 bit word length is comprised of four
72 bit wide chunks, where each chunk contains 64 data bits plus 8
ECC bits. In an alternate embodiment, data array 206 is a 576
Kbyte, 4-way set associative SRAM that is logically organized as a
4.times.4K.times.288.
The L2 cache memory allows pipelined read accesses for producing a
"seamless data" stream along with mechanisms to allow data line
read capability and tag array access capability. If a memory
request is made to the L2 cache memory, a read operation occurs.
During a read operation to the L2 cache memory, an address is
driven on the address bus portion of the backside bus. The address
represents a portion of the memory request. The address is latched
into address latch 201. The set number portion of the address from
bits [15:5] in address latch 201 is used to select a set in the
four separate ways. The tag bits for the set in each of the ways
are sent to tag comparison logic 203 with the tag bit portion of
the address from bits [35:16] in address latch 201. Thus, the
memory request is compared with the tag entry at the page location
in each of the ways to determine if the information is in the
cache. In the currently preferred embodiment, since them are four
ways, them am four separate comparisons made. These comparisons
occur in parallel to determine if the information requested is
stored in the cache.
If one of the tags matches the tag from the address, then them is a
hit. If the results of the comparison determine there is a hit, the
tag comparison logic 203 outputs a signal to MUX select logic 207
causing MUX select logic 207 to select the page and corresponding
cache line in data array 206. The cache line data selected is
output to the data output register 208 which latches the data for
output onto the data bus portion 220 of the backside bus. Using the
critical chunk portion of the address from bits [4:3] in address
latch 201, the chunk MUX selection logic 209 selects the desired
chunk of data for output onto data bus 220, which is output to the
processor. In one embodiment, the L2 cache memory transfers a
32-byte cache line in burst fashion in four chunks, with the
critical chunk first. On the other hand, if none of the tags from
tag array 202 match the tag portion of the address of the memory
request, then them is a miss, which is indicated to the processor
using one of the signals 213.
When data is being written into the L2 cache memory due to the L1
cache of the processor writing data back to the L2 cache memory or
due to information from system memory being input into the
processor, a cache line is sent to the L2 cache memory. In one
embodiment, the cache line is 32-bytes. The data is received by
chunk write buffer 205 from data bus 220. Since data bus 220 is
only 8-bytes wide and the cache line in 32-bytes wide, the data is
received into the chunk write buffer one chunk at a time. The write
control signals 210 cause the data from data bus 220 to be directed
to its designated chunk location in chunk write buffer 205. Once
chunk write buffer 205 is filled, write control signals 210 cause
the data to be transferred from chunk write buffer 205 into data
array 206. Thus, chunks of data can be re-ordered when written into
the chunk write buffer and therefore need not be received by the L2
cache memory in the same order as they occur in the cache line.
FIG. 3 is a block diagram of a memory subsystem of the computer
system of the present invention. Boundary lines are shown to define
the portion of the memory subsystem that is contained within the
processor and that portion which is external to the processor.
Referring to FIG. 3, an L2 cache memory 301 is coupled to a
backside bus 302. Backside bus 302 is also coupled to backside bus
logic (BBL) 303. Backside bus logic 303 is also coupled to external
bus logic 304, instruction fetch unit (IFU) 305 and the data cache
(DCU) 306. The external bus logic 304 is also coupled to the
instruction fetch unit 305, data cache 306 and the system bus 307.
Also shown coupled to system bus 307 is system memory 308.
Instruction fetch unit (IFU) 305 includes instruction fetch logic
and the instruction cache and fetches instructions for execution in
the processor. When the instruction cache of IFU 305 desires to
fetch more instructions, it sends a request on signal 310 to
external bus logic 304. The external bus logic 304 sends the
request on request lines 315 to backside bus logic 303 which
forwards the request to the L2 cache memory 301. In response, the
L2 cache memory 301 performs a read operation. If there is a hit in
L2 cache memory 301, the instructions are returned to instruction
fetch unit 305 on the data return signal path 311 from backside bus
logic 303 and backside bus 302.
The DCU 306 holds temporary copies of data for use by the processor
in executing the instructions. In one embodiment, DCU 306 comprises
an L1 cache memory. In executing instructions, if the data is not
in the DCU 306 (that is, the L1 cache), a request is made on
request signal path 312 to external bus logic 304. External bus
logic 304 sends a request on request lines 315 to backside bus
logic 303 which forwards the request onto L2 cache memory 301. If
the data is in L2 cache memory 301 (if L2 cache memory 301 hits),
then the data is forwarded and returned to DCU 306 on data return
signal path 313. When data is written back from DCU 306 or if there
is no longer room in DCU 306, data is sent out to backside bus 302
via backside bus logic 303 to L2 cache memory 301.
In one embodiment of the present invention, data return signal path
311 and data return signal path 313 comprise a single signal path.
In this embodiment, backside bus logic 303 utilizes the same signal
path to return data to both the IFU 305 and the DCU 306, thereby
reducing the number of signal paths necessary within the processor
chip.
Any time data from L2 cache memory 301 is to be written back to
system memory 308, a request is made to external bus logic 304 via
request lines 316 from backside bus logic 303. The data is written
from the backside bus 302 through backside bus logic 303 to
external bus logic 304 via data path 327. External bus logic 304
controls the data by writing the data on external bus 317 to system
memory 308 via system bus 307. Any time them is a miss to L2 cache
memory 301, external bus logic 304 sends the request to system
memory 308 using external bus 317 and system bus 307. The data
returned is received by external bus logic 304 and sent to backside
bus logic 303 via data path 327 for storage in L2 cache memory 301.
The data is then written to DCU 306 via data return lines 313 and
to L2 cache memory 301 via backside bus 302.
Processor accesses from the instruction fetch unit 305 are always
fetch operations with no intent to modify the data. Similarly read
operations from the data cache 306 that are generated by the
processor executing a read operation are always data fetch
operations. These requests are forwarded to the system bus 307, if
they miss the L2 cache memory 301, as read operations from the
memory. When the data is returned by the system memory 308, a
signal on the system bus 307 indicates if copies of this data also
exist in other cache memories in a multiple processor system. If
they do, then the data is placed in the L2 cache memory 301 and
data cache memory 306 and marked with a state of Shared (S). On the
other hand, if no other cache memory has a copy of the data as
indicated on the system bus 307, the data can be placed in the L2
cache memory 301 and the data cache memory 306 and marked Exclusive
(E). In one embodiment of the present invention, instructions which
are returned by the system memory 308 are placed in the L2 cache
memory 301 and the instruction cache of IFU 305 and marked with a
state of Shared (S).
Whenever the processor intends to modify data, it issues a request
to the data cache memory 306 to obtain a copy of the data with
complete ownership. If the data is not present in the data cache
memory 306, a similar request is sent to the L2 cache memory 301.
If the request also misses the L2 cache memory 301, it is then
forwarded to the system bus 307 as a data read request with a
command modifier indicating to all other processors in the system
bus 307 that they must relinquish ownership of this cache line of
data and invalidate all copies of the line in their cache memories.
The data is then returned to the requesting processor granting it
exclusive ownership and is placed in the L2 cache memory 301 and
data cache memory 306 in the Modified (M) state. The processor is
then free to update any portion of the cache line it chooses. This
results in a write command being issued to the data cache memory
306 with the modified data and the data cache memory 306 updates
its copy of the cache line to reflect the change. If the processor
chooses to modify the data in that cache line yet again, it can do
so within the data cache memory 306 without issuing further request
to the L2 cache memory 301 and the system bus 307.
In one embodiment of the present invention, the L2 cache memory 301
and the data cache 306 and IFU cache 305 memories do not rely on
any special inclusion properties. That is, a cache line may be
valid in the L2 cache memory 301 and may not be in either the data
cache memory 306 or the instruction cache of the IFU 305. In an
alternate embodiment, the processor may follow a protocol by which
the state of a cache line in the data cache memory 306 is always
lower than or equal to the state of the same cache line in the L2
cache memory 301, wherein, for example, the hierarchy for a MESI
based cache coherency is Modified, Exclusive, Shared, and then
Invalid (listed from highest priority to lowest). This is referred
to herein as a "dirty" or "modified" inclusion. Implementing such a
priority, the processor would be able to determine that if the
cache line in the data cache memory 306 is shown to be Modified,
then the processor knows that the state of the same cache line in
the L2 cache memory 301 was changed to Modified upon the cache in
state in the data cache memory 306 and that the L2 cache memory 301
still reflects such a state for the cache line. In another
embodiment, if a subsequent change to the same cache line is to
occur, the processor could request the state of the cache line as
currently indicated by the L2 cache memory 301. Upon seeing that
the state is Modified, the processor knows that the data in the
data cache memory 306 may be changed. Another mechanism by which
the processor is able to identify whether a change in state needs
to be made to a cache line in the L2 cache memory 301 upon the
occurrence of a subsequent write to the same cache line in the data
cache memory 306 is to include a "more modified" indication (e.g.,
a bit location) for the cache line in the data cache memory 306
that indicates when the data in the data cache memory 306 has been
subsequently modified. In such an implementation, when such "more
modified" data is to be written back to the external bus (main
memory), then the processor is able to determine which data should
be written, specifically whether the data in the L2 cache memory
301 suffices or only the data in the data cache memory 306 will
do.
Several variations of this scenario can exist regarding
modifications to data in the cache memories. For example, the data
the processor intends to modify could be in the data cache memory
306 or the L2 cache memory 301, or both, because of a previous data
fetch transaction. However, the data is marked Shared (S)
indicating that although the cache memory has the latest copy of
the data, other cache memories in the system also have copies of
it. Under such a circumstance, the data cache memory 306 or the L2
cache memory 301 must make a request on the system bus 307
requesting ownership of the line but not necessarily perform a data
fetch operation. This request causes all other cache memories in
the system to mark their copies of the cache line as Invalid (I),
granting exclusive ownership of the cache line to the requesting
processor. This operation needs to be performed in order to ensure
that no more than one processor at a time attempts to modify the
data in a cache line at a time.
If two processors were to attempt to do so, then the request for
ownership for the cache line from one of them will reach the system
bus 307 before the other and it will get the opportunity to makes
its modifications to the cached data first. The integration of the
control for the L2 cache memory 301 into the processor permits
considerable optimization of the control logic of these complex
operations. Moreover, the interaction between sequential
transactions targeted at the same cache line can now be resolved
optimally without holding them up and waiting for the entire
previous transaction to complete. To this end, multiple operations
may be merged into single transaction by the processor. This has a
dramatic effect upon the performance of multiple processor systems
that are sharing the same data out of the system memory 308.
By integrating the control for the L2 cache memory 301 into the
processor, the present invention is able to better coordinate the
activities and transactions that occur in the computer system. The
single control of the processor is functionally positioned with
respect to the data cache memory 306, the L2 cache memory 301 and
the external bus to allow the controller to obtain information and
respond, if necessary, to any transaction on the three ports (data
cache memory 306, the L2 cache memory 301 and the external bus)
that is currently in progress. The controller of the present
invention can merge transactions sourced from one or more of the
data cache memory 306, the L2 cache memory 301 and the external bus
into a single transaction directed to the processor core or a
port.
For example, if the processor core desires to modify a cache line
in the data cache memory 306 which is marked as Exclusive (E) in
the data cache memory 306 and the L2 cache memory 301, when the
data is modified in the data cache memory 306, the L2 cache memory
301 must be set to Modified (M) (if following the protocol
discussed above). Then, if there is no Modified state written into
the L2 cache memory 301, a later modification to different data in
the same cache line occurs, and the cache line is to be evicted
from the data cache memory 306, then the state must be changed to
Modified (M) in the L2 cache memory 301 and the L2 cache memory 301
must be updated with the modified data. The controller of the
present invention is able to merge the operations into a single
operation. In a similar fashion, the controller of the present
invention is able to recognize conflicts between transactions with
the processor core and the L2 cache memory 301 and merge the
transactions, such that multiple transactions, as a whole, are
optimized into lesser operations, when possible.
Information is transferred from backside bus logic 303 to L2 cache
memory 301 in chunks. The size of each chunk is the size of the
data bus of backside bus 302. In one embodiment of the present
invention, a cache line is 32 bytes and the width of the data bus
of backside bus 302 is 8 bytes. Thus, transferring information from
backside bus logic 303 to L2 cache memory 301 comprises
transferring four chunks over backside bus 302.
Backside bus logic 303 operates to optimize the bandwidth of
backside bus 302. By optimizing the bandwidth of backside bus 302,
backside bus logic 303 reduces IFU 305 idle time because IFU 305 is
not required to wait for instructions to be returned from L2 cache
memory 301.
Backside bus logic 303 optimizes the bandwidth of backside bus 302
by giving read requests priority over write requests. A read
request may be received from either IFU 305 requesting an
instruction, or DCU 306 requesting data. A write request is
received from DCU 306 either requesting to write data back to
system memory or requesting to write data to L2 cache memory
301.
Read requests are immediately forwarded to L2 cache memory 301. If
there is a hit in L2 cache memory 301, the information, either
instructions or data, is returned to backside bus logic 303 via
backside bus 302. In one embodiment of the present invention, this
information is burst over backside bus 302. Thus, the entire cache
line is transferred over backside bus 302 in four consecutive clock
cycles. Backside bus logic 303 then returns the information to the
requesting agent, IFU 305 or DCU 306.
Backside bus logic 303 does not, however, immediately burst a cache
line for a write request over backside bus 302 to L2 cache memory
301. Rather, a write request to L2 cache memory 301 is buffered in
backside bus logic 303 and is transferred to L2 cache memory 301 in
chunks when backside bus 302 is available. Chunks received by L2
cache memory 301 are stored in the chunk write buffer. Backside bus
302 is available when instructions or data from a read request are
not being returned to backside bus logic 303 over backside bus 302
from L2 cache memory 301.
Backside bus 302 is frequently available for a period of clock
cycles subsequent to a read request. This is because although IFU
305 retrieves instructions from L2 cache memory 301 in 32-byte
lines, IFU 305 typically cannot execute those instructions in a
single clock cycle, Thus, one or two clock cycles are available
immediately subsequent to the read request in which IFU 305 is not
performing another read request, thereby leaving backside bus 302
unused. These one or two clock cycles are long enough to transfer
one or two chunks to L2 cache memory 301, however this period is
typically not of sufficient duration to allow all four chunks of a
write request to be transferred to L2 cache memory 301. It should
be noted that under certain circumstances, four or more consecutive
clock cycles may exist in which IFU 305 is not performing a request
which requires the use of backside bus 302. In these situations,
backside bus logic 303 transfers all four chunks to L2 cache memory
301 in four consecutive clock cycles.
In addition, many of the write requests serviced by backside bus
logic 303 are a result of read requests which missed L2 cache
memory 301. For these requests, backside bus logic 303 waits for
external bus logic 304 to fetch the instructions from system memory
308 via processor system bus 307. Processor system bus 307,
however, typically operates at a speed which is a fraction of that
which IFU 305, backside bus logic 303, and backside bus 302 all
operate. Thus, if backside bus logic 303 were to stall IFU 305
while waiting for the entire 32-byte cache line to be fetched from
system memory 308, the operation of IFU 305 would be significantly
delayed. For example, if processor system bus 307 is operating at
one-fourth the speed of IFU 305, then the fetch operation to system
memory 308 would result in a delay of 16 clock cycles for IFU 305
because external bus logic 304 would be retrieving a single 8-byte
chunk from system memory 308 every four clock cycles.
Backside bus logic 303 prevents such a long delay by writing single
chunks to L2 cache memory 301 as the chunks and backside bus 302
become available. Thus, IFU 305 can continue to issue requests to
L2 cache memory 301 via backside bus logic 303 and backside bus 302
because backside bus logic 303 is not waiting for an entire cache
line. Thus, in the example above, backside bus logic 303 transfers
a single chunk to L2 cache memory 301 when it receives the chunk
from external bus logic 304, assuming backside bus 302 is
available, thereby utilizing backside bus 302 once every four clock
cycles rather than utilizing it for the entire 16 clock cycle
period.
In one embodiment of the present invention, numerous
micro-operations may be utilized to control the L2 cache memory
behavior. In one implementation, these micro-operations include
control beyond reading and writing an entire cache line. In this
implementation, backside bus logic 303 gives priority to any read
operations from IFU 305, where these read operations involve the
transferring of data or commands from L2 cache memory 301 to
backside bus logic 303. Thus, any write operations (that is, those
operations requiring data or commands to be transferred to L2 cache
memory 301 from backside bus logic 303) are not performed until
backside bus 302 is available.
Thus, backside bus logic 303 services read requests immediately and
services write requests when backside bus 302 is available.
Therefore, IFU 305 idle time is reduced because requests whose
instructions or data are required for the operation of IFU 305 are
serviced immediately, while requests whose data has already been
utilized by IFU 305 are not necessarily serviced immediately.
Controlling the L2 Cache Memory
The processor of the present invention controls the functionality
of the L2 cache memory, such that the L2 cache memory is a slave on
the backside bus. That is, the control of the L2 cache memory is
partitioned between the processor and the L2 cache memory in such a
way that the processor of the present invention controls and knows
what the L2 cache memory is doing (e.g., whether the L2 cache
memory is transferring data). In the present invention, processor
101 controls the L2 cache memory using opcodes sent by the
processor to the L2 cache memory.
The processor controls the L2 cache memory behavior through
micro-operations which provide the functionality needed for all
processor micro-operations requiring the L2 cache memory. In one
embodiment, these micro-operations are used to perform all write
operations to the L2 cache memory. Also, the processor uses these
micro-operations to maintain the cache consistency. These
micro-operations function within the confines of the interface
provided between the processor and the L2 cache memory in order to
implement single die control in a multi-die package. A brief
description of each micro-operation performed by the L2 cache
memory is given below.
L2 Read Inquire: The processor uses this micro-operation to read an
entire line from L2 cache memory if the line is resident in the L2
cache memory. The processor sends a complete address to the L2
cache memory. The L2 cache memory examines the specified set number
to see if it has the specified tag. The L2 cache memory signals a
hit to the processor if a tag match is found and the status of the
line is modified, exclusive or shared. On a hit, the L2 cache
memory updates the appropriate LRU field and returns the way
information, the value of the status, and the data line in a burst
fashion over four cycles to the processor. If the tag did not
match, the L2 cache memory returns the way information, status and
full line address (tag and set) of the cache line to be replaced.
If a miss occurred, the LRU field is not updated, but will be
updated when the replacement line is written using the Write Line
Update micro-operation.
State Write: The processor may change the status of any given line
in the L2 cache memory with this micro-operation. The processor
indicates the set address, the way, and the status value to be
stored. No tag matching is done on the address and the L2 cache
memory LRU field is not modified.
State Read: The processor reads the status of any given line in the
L2 cache memory with this micro-operation. The processor indicates
the set address and the way to be read. The L2 cache memory returns
the status of the specified line. No tag matching is done on the
address and the L2 cache memory LRU field is not modified. This
operation may be used for testing purposes.
Tag Inquire: This operation prompts the L2 cache memory to
determine if it contains a given line. The L2 cache memory examines
the specified set number to determine if it has the specified tag.
The L2 cache memory then signals a hit to the processor if a tag
match is found and the status of the line is something other than
invalid (an illegal status encoding will all signal a hit, but will
also cause a BERR). The L2 cache memory data array is not accessed
and the L2 cache memory LRU field is not modified. This operation
may be used in response to a request from a device on the external
bus made to determine if data that is desired is currently stored
in the cache memory.
Tag Inquire Update: This micro-operation is identical to Tag
Inquire, except that the LRU field is updated upon a match.
Tag Array Read: This micro-operation is used to read an entry in
the tag array without accessing the data array. The processor
provides the set address and the way information. The L2 cache
memory returns the tag status and the full address. The address is
returned over the data lines using the natural alignment of the
tag. This scheme avoids bi-directionality on the address lines. The
L2 cache memory LRU field is not modified.
Data Array Read: This micro-operation is used to read an entry in
the data array without accessing the tag array. The processor
provides the set address and the way information. The data line is
sent back in a burst fashion over a four cycle burst. The L2 cache
memory LRU field is not modified. This operation may be used to
test the array.
Write Chunk: The processor uses this micro-operation to load a data
chunk into the Chunk Write Buffer on the L2 cache memory chip by
specifying the chunk number, driving a data word onto the data bus,
and signaling the L2 cache memory to write the chunk. This
micro-operation can occur independently of any other L2 cache
memory operation not involving the data bus. It is the
responsibility of the processor to prevent data bus collisions when
issuing Write Chunk micro-operations. No data array or tag array
accesses are made during this operation and the LRU field is not
modified. This operation is used to write a single chunk into the
L2 Chunk Write Buffer, which, at a later time, transfers the
contents of the Chunk Write Buffer to the L2 cache memory.
Write Line: This micro-operation is used as the final step in
writing an entire line into the L2 cache memory. The line to be
written is first stored in the L2 cache memory Chunk Write Buffer
using the Write Chunk micro-operation described above. The
processor then uses the Write Line micro-operation along with
specifying the tag and set address, the way information, and the
status of the line to be written. The entire line contained in the
Chunk Write Buffer is then written into the L2 cache memory data
array. The tag and the status are written into the tag array and
status array, respectively. The Write Line micro-operation may be
issued concurrently with the last of the four Write Chunk commands.
The LRU field is not modified with this micro-operation.
Write Line Update: This micro-operation is identical to the Write
Line with the addition that the LRU field is updated.
The processor sends these micro-operations to the control logic of
the L2 cache memory which decodes the micro-operations to control
the sequence of outputs and to provide data integrity. The decoding
performed produces signals that initiate access cycles. The control
block of the L2 cache memory also performs output control. The
control logic also includes a logic for performing data integrity
operations.
The micro-operations operate on the L2 cache memory using the
interface provided by the backside bus. One embodiment of the
backside bus provides the following L2 cache memory external
interface:
______________________________________ NAME CLASS I/O DESCRIPTION
______________________________________ A[35:3] Processor Input
Physical Address presented by the Addr processor to the L2 cache
memory. The Address field is only valid when ADS is asserted. For
several of the L2 cache memory operations, only subfields of the
address field are examined. D[63:0] Processor IOput Data is
transferred to and from the Data L2 cache memory over this
bidirectional bus. A dead clock is needed to prevent bus contention
during direction switching. During certain operations, this bus can
contain an address field or control register data, driven by the L2
cache memory. ADS Processor Input Upon assertion, the AdDress
Strobe Cntl indicates to the L2 cache memory that a valid
microoperation is being requested by the processor. CLK Processor
Input Full frequency Clock sent to the L2 Cntrl cache memory from
the processor. OP[3:0] Processor Input The L2 cache memory
OPeration is Cntl encoded in this field. OP is used by L2 cache
memory only when ADS is asserted. SW[3:0] Processor Input The
Status Write bits are supplied by Cntl the processor on a status
update operation. The processor controls the MESI status bits of
each tag in the directory array RESET Processor Input Buffered
System RESET pin. Cntl WC[1:0] Processor Input The Write Chunk
field encodes into Cntl which write chunk buffer location the chunk
data should be latched. It is used by the L2 cache memory only when
WCVAL is asserted, and is otherwise undefined. WCVAL Processor
Input The Write Chunk VALid indicates to Cntl the L2 cache memory
that a data chunk is to be stored in its chunk write buffer. The
chunk number is specified by WC. WO[1:0] Processor Input The 2-bit
Way Out field is an Cntl extension of the address field in that it
specifies which way is being referenced by the processor in the
given operation. HIT Processor Output HIT. Indicates status
information on Cntl a line read or tag inquire. WR[1:0] Processor
Output The Way Read field provides the Cntl processor the way
number of the line that is referenced by the current
micro-operation. SR[3:0] Processor Output The Status Read bits of a
given line Cntl in the L2 cache memory is driven by the L2 cache
memory. AP[1:0] Processor Input Two bits of Address Parity protect
Integrity each entry in the L2 cache memory tag array. The address
is examined for parity correctness on transmission from processor
to L2 cache memory. The transmitted parity bits are stored in the
tag array. The address parity is interleaved over the field, with
AP[1] protecting the odd bits of A, and AP[0] protecting the even
bits of A. Parity is odd. DECC[7:0] Processor IOput The Data Error
Checking and Integrity Correcting bits are supplied with each data
transfer, regardless of direction, and have the same timing as the
D bits. These pins are driven with a 1 when the L2 cache memory is
driving an address back to the processor. OUTP Processor Input
OUTput Parity, sent to the L2 cache Integrity memory from the
processor, calculated over Way Out, and Opcode field. WCPAR
Processor Input Write Chunk PARity, sent to L2 Integrity cache
memory from the processor, calculated over the chunk number
specified by WC. BERR Processor Output The L2 cache memory detected
a Integrity Storage Error in the tag or status array, and asserts
this signal to the processor. If a transmission error from the
processor to the L2 cache memory is detected, the BERR signal is
asserted to indicate Transmission Error. INP Processor Output INput
Parity, sent to the processor Integrity from the L2 cache memory,
computed over the Hit/Miss, Way Read, and Storage/Transmission
Error signals. ______________________________________
Through the use of the above signals, the present invention allows
access to the L2 cache memory by the processor. However, the L2
cache memory is able to complete the access in any manner. The
processor does not need to know how the access is being completed.
Because of this partitioning, any L2 cache memory may be designed
for use in the computer system of the present invention. Other
signals for bond control and testing, as well as for testing the
processor, have been omitted to avoid obscuring the present
invention.
Note that the processor die includes the control logic for
generating the micro-operations and using the interface provided
between the L2 cache memory die and the processor die, the
processor provides a unilateral control path between the two dice,
with only the bi-directional data bus as an exception. In this
manner, the partitioning between the two dice allows one of the
dice to include the necessary functionality to control the other
die(s).
The sequence of operations that are sent to the L2 cache are
controlled by the processor. This control logic can make optimal
use of the L2 as it has access to the L1 cache, the processor
requests and the system bus requests. The logic is designed to
provide the highest priority to all processor accesses. These
accesses use the address bus and the arrays for a short period of
time but then use the data bus to read out the data chunks over
four cycles. The control logic can recognize this condition and
perform tag inquire instructions to operate in parallel and not
require the data bus.
In the present invention, the processor performs an L1 cache memory
look-up before sending a memory request out to the L2 cache memory.
In another embodiment, the L1 and L2 cache memory look-ups are sent
at the same time. In this case, the information may be obtained one
or two cycles earlier than the prior art.
By having a single point of control in the processor, the
complexity of the memory subsystem is reduced by having to use only
a simple interface between the processor and the L2 cache memory.
This interface only includes an address bus, a data bus and some
status signals sent from the processor to the L2 cache memory for
providing information to the L2 cache memory in one direction. Only
the data bus is bi-directional.
Due to the partitioning in the present invention, multiple types of
cache memory organizations may be utilized with the present
invention. The cache memory must only be able to satisfy the memory
request from the processor according to its organization. This
allows the L2 cache memory to be upgraded to a different
organization in the future without impacting the particular site in
the circuit board. The only requirement is that the signals from
the IC package remain the same. In the present invention, the
directory and data of the L2 cache memory operate independently.
Because of this, it can be upgraded independently. It is possible
for the data to be transferring data over on the data bus while the
control is receiving a command to look up the directory to perform
a tag match.
In order to accommodate multiple requests in a single cycle, a
duplicate of the directory could be utilized. In this case, the
processor can examine the directory as well as another item on the
bus. In another embodiment, a handshake signal may be used to
indicate that the cache memory is not ready, and at a proper time,
a signal could be used to indicate that the cache memory is
ready.
By interleaving a data transfer cycle with a directory look-up, the
data transfer and the look-up can be totally hidden under the
command of the data transfer such that it can be done
inexpensively.
The processor and the L2 cache memory in the present invention may
be integrated in a computer system such as that in FIG. 4.
Referring first to FIG. 4, an overview of a computer system of the
present invention is shown in block diagram form. It will be
understood that while FIG. 4 is useful for providing an overall
description of the computer system of the present invention, a
number of details of the system are not shown. As necessary for
disclosure of the present invention, further detail is set forth
with reference to the other figures provided with this
specification. Further, the present invention is described with
reference to its preferred embodiment.
As illustrated in FIG. 4, a computer system, as may be utilized by
the present invention, generally comprises a processor-system bus
or other communication means 401 for communicating information and
a processor 402 coupled with processor-system bus 401 for
processing information. In the present invention, processor-system
bus 401 includes address, data and control buses. In the currently
preferred embodiment, processor 402 includes an internal cache
memory, commonly referred to as a level one (L1) cache memory for
temporarily storing data and instructions on-chip. A level two (L2)
cache memory 404 is coupled to processor 402 for temporarily
storing data and instructions for use by processor 402. In the
currently preferred embodiment, cache memory 404 is included in the
same chip package as processor 402.
Also coupled to processor-system bus 401 is processor 403 for
processing information in conjunction with processor 402. Processor
403 may comprise a parallel processor, such as a processor similar
to or the same as processor 402, or may comprise a co-processor,
such as a digital signal processor. A level three (L3) cache memory
411 for temporarily storing data and instructions for use by other
devices in the computer system (e.g., processor 402, processor 403,
etc.) and a L3 cache controller 410 for controlling access to L3
cache memory 411 may also be coupled to processor-system bus 401.
The L3 cache controller 410 is also coupled to memory-system bus
415.
A memory-system bus or other communication means 415 for
communicating information is coupled to processor 402 for providing
processor 402 and other devices in the computer system access to
the memory and input/output (I/O) subsystems. A memory controller
422 is coupled with memory-system bus 415 for controlling access to
a random access memory (RAM) or other dynamic storage device 421
(commonly referred to as a main memory) for storing information and
instructions for processor 402 (and processor 403). A mass data
storage device 425, such as a magnetic disk and disk drive, for
storing information and instructions, and a display device 423,
such as a cathode ray tube (CRT), liquid crystal display (LCD),
etc., for displaying information to the computer user are coupled
to memory-system bus 415.
An input/output (I/O) bridge 424 is coupled to memory-system bus
415 and I/O bus 431 to provide a communication path or gateway for
devices on either memory-system bus 415 or I/O bus 431 to access or
transfer data between devices on the other bus. Specifically,
bridge 424 turns the byte/word/dword data transfer traffic from I/O
bus 431 into line size traffic on memory-system bus 415.
I/O bus 431 communicates information between devices in the
computer system. Devices that may be coupled to system bus 431
include a display device 432, such as a cathode ray tube, liquid
crystal display, etc., an alphanumeric input device 433 including
alphanumeric and other keys, etc., for communicating information
and command selections to other devices in the computer system
(e.g., processor 402) and a cursor control device 434 for
controlling cursor movement. Moreover, a hard copy device 435, such
as a plotter or printer, for providing a visual representation of
the computer images and a mass storage device 436, such as a
magnetic disk and disk drive, for storing information and
instructions may also be coupled to system bus 431.
Of course, certain implementations and uses of the present
invention may not require nor include all of the above components.
For example, in certain implementations, the L3 cache controller
and L3 cache memory may not be required. In such implementations
processors (402) and (403) will reside directly on a memory system
bus 415. In other implementations, it may not be required to
provide a display device for displaying information. Certain
implementations of the present invention may include other
components.
In one embodiment, the L2 cache memory and the processor of the
present invention are contained within the same integrated circuit
package. FIG. 5A illustrates a section view of one embodiment of
the dual die package of the present invention. Referring to FIG.
5A, package 500 includes chip carrier 501 for encapsulating and
protecting multiple integrated circuit (IC) dice, leads or pins 502
for electrically coupling dice in package 500 to the circuit board
(not shown), wirebond 505, processor die 503 and L2 cache memory
die 504. The coupling between the circuit board and package 500 has
not been shown to avoid obscuring the present invention. Die 503 is
also coupled to leads 502, such that electrical signals may be
transferred from die 503 (and die 504) to off-chip (i.e., to the
printed wiring board). In one embodiment, die 503 is wire bonded to
leads 502 of package 500 in a manner well-known to those skilled in
the art. Although only six leads are shown, the actual number of
leads for the package is generally much larger, numbering in the
hundreds.
The leads 502 and the chip carrier 501 are comprised of materials
standard in the art. For instance, leads 502 may comprise copper
leads. Similarly, chip carrier 501 may be a ceramic carrier, a
plastic carrier or other material suitable for encasing the
multiple dice of package 500.
In one embodiment, processor die 503 and L2 cache memory die 504
are integrated circuits fabricated using well-known metal-oxide
semiconductor (MOS) techniques. Although only die 503 and die 504
are shown, other embodiments may include more than two dice
integrated into the same package. Processor die 503 and cache
memory die 504 are electrically coupled to each other by wirebond
505. Moreover, die 503 and die 504 are coupled to chip carrier 501
using, for example, adhesive, such a epoxy.
Die 503 and die 504 may be electrically coupled to each other in
numerous other ways. In one embodiment, L2 cache memory die 504 is
coupled directly on top of die 503 using solder bumps (e.g., solder
bumps 506) in a flip-chip type configuration, such as shown in FIG.
5B. Processor die 503 and L2 cache memory die 504 may be coupled to
a trace 507 using wirebonds 509, such as shown in FIG. 5C. In this
case, trace 507 acts as part of the electrical connection between
the dice. In yet another embodiment, die 503 and die 504 may be
solder bumped to a common silicon substrate 511 using solder bumps
510 as shown in FIG. 5D. In this case, silicon substrate 511
provides a communication path between the dice.
FIG. 6 is a flowchart showing the steps followed by one embodiment
of the present invention in transferring a write request to an L2
cache memory. The write request is first received and buffered in
the backside bus logic, step 610. The backside bus logic then
checks whether a read request is currently utilizing the backside
bus, step 620. The read request could be utilizing the backside bus
by, for example, backside bus logic transferring the request to the
L2 cache memory or by the L2 cache memory transferring the data
corresponding to the read request to the backside bus logic. If a
read request is currently utilizing the backside bus, then the
backside bus logic allows the read request to continue, step
630.
However, if a read request is not utilizing the backside bus, then
a chunk of the data for the write request is transferred to the
chunk write buffer in the L2 cache memory, step 640. The L2 cache
memory then checks whether additional chunks remain to be
transferred, step 650. In one embodiment, the L2 cache memory makes
this determination based on receipt of a signal from the backside
bus logic which indicates all chunks of the write request have been
transferred. If no additional chunks remain to be transferred, then
the L2 cache memory stores the contents of the chunk write buffer
into the appropriate L2 cache line, step 660. Otherwise, the
backside bus logic continues to transfer additional chunks of the
write request to the chunk write buffer until the entire write
request is transferred (steps 620 through 650).
It should be noted that additional requests may be received by the
backside bus logic from the instruction fetch unit or data cache
unit between chunk transfers. Thus, subsequent read requests may
prevent transfer of the current write request to the L2 cache
memory, and subsequent write requests may be buffered in the
backside bus logic until transfer of the current write request is
complete.
The present invention provides for a more reliable connection
because the distance the signals travel between the two dice is
reduced. This distance is substantially shorter than the distance
of traveling over the printed wiring board and the pins. Moreover,
the fact that both dice are contained in one package prevents
signals from having to cross the packaging material (e.g., the
ceramic) travel through the printing circuit board and then
continue into another chip. Therefore, the present invention
provides excellent electrical characteristics for high
frequencies.
The present invention also reduces the total number of pins
required for the microprocessor and the cache memory. In the prior
art, a certain number of pins are used to transfer information
between the cache memory and the microprocessor. This transfer
occurs over the printed wiring board. However, because the cache
memory is contained in the same package, the pins used to transfer
data between the cache memory and the microprocessor. Thus, the
overall number of pins is reduced.
The present invention allows multiple dice developed using
different technologies to be integrated into the same package. Each
of the dice may be developed and optimized independently. In other
words, a processor may be designed and manufactured based on
technology that optimizes the processor while a cache memory may be
designed and manufactured based on technology optimized for the
cache memory (e.g., SRAM technology). Therefore, even though the
two have different design complexities both may be integrated into
the same package.
The present invention also provides a single chip replacement
upgrade capability when a new processor and cache memory pair is to
be placed in the system. In this case, all that is required to
obtain an upgrade is the extraction of one chip from the printed
wiring board. Moreover, the cache memory can be upgraded
independently because of the partitioning. Normally, the printed
wiring board is designed to include a cache memory having a
predetermined number of pins and particular electrical
characteristics. Therefore, to change the cache memory requires a
change in the printed wiring board. Thus, using the present
invention, the cache memory may be upgraded without having to
upgrade the entire printed wiring board.
Furthermore, the present invention provides reduced electromagnetic
interference (EMI). The package, in effect, shields radiation
between the two dice due to the shorter connection. Thus, the
physical proximity of the two multiple dice together provides for a
much more uniform terminal environment. This environment can also
reduce the effects of drifts because of the characteristic process
variations due to temperature are lowered because both dice are at
approximately the same temperature.
Whereas many alterations and modifications of the present invention
will no doubt become apparent to a person of ordinary skill in the
art after having read the foregoing description, it is to be
understood that the particular embodiment shown and described by
way of illustration is in no way intended to be considered
limiting. Therefore, references to details of a preferred
embodiment are not intended to limit the scope of the claims which
in themselves recite only those features regarded as essential to
the invention.
Thus, a method and apparatus for transferring information between a
processor and a memory system has been described.
* * * * *