High performance memory architecture Braceras, George M. ; et al. [International Business Machines Corporation]

High performance memory architecture

Braceras, George M. ; et al.

Patent Application Summary

U.S. patent application number 09/761460 was filed with the patent office on 2002-01-17 for high performance memory architecture. This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Braceras, George M., Pilo, Harold.

Application Number	20020006070 09/761460
Document ID	/
Family ID	26876708
Filed Date	2002-01-17

United States Patent Application	20020006070
Kind Code	A1
Braceras, George M. ; et al.	January 17, 2002

High performance memory architecture

Abstract

A high performance memory array architecture is provided to minimize the delays within each array. The architecture of the array equalizes the access time to all memory elements by optimizing the positioning of the subarrays with respect to buffering and rebuffering elements used in the array which cause delays.

Inventors:	Braceras, George M.; (Essex Junction, VT) ; Pilo, Harold; (Underhill, VT)
Correspondence Address:	International Business Machines Corporation Intellectual Property Law-Mail 972E 1000 River Street Essex Junction VT 05452 US
Assignee:	International Business Machines Corporation Armonk NY 10504
Family ID:	26876708
Appl. No.:	09/761460
Filed:	January 16, 2001

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
60180882	Feb 8, 2000

Current U.S. Class:	365/230.03
Current CPC Class:	G11C 11/418 20130101; G11C 11/417 20130101; G11C 7/1039 20130101
Class at Publication:	365/230.03
International Class:	G11C 008/00

Claims

What is claimed is:

1. A semiconductor memory array system having an input which receives an address signal and an output which transmits stored data comprising; a plurality of subarrays having sense circuits, precharging circuits, timing circuits and memory elements arranged in a rectangular shaped matrix of rows and columns which stores data; a plurality of wordline driver circuits located along the center line of the matrix which decode the address and drive a wordline signal within the subarray; and, a plurality of rebuffers receiving the address signal and transmitting it to the wordline driver and column select driver circuits to the selected subarray which accesses and transmits the data to a plurality of data rebuffers positioned in the middle of the matrix to transmit the data to the output whereby the access data from each subarray is about the same.

2. A semiconductor memory array system of claim 1, wherein, addresses originate from a single location and are transmitted to a location central to all subarrays.

3. A semiconductor memory array system of claim 2, wherein, addresses at central location to all subarrays are driven equally to all global decode drivers.

4. A semiconductor memory array system of claim 1, wherein, memory subarrays are subdivided into memory sub-blocks.

5. A semiconductor memory array system of claim 4, wherein, data from each subarray in a memory sub-block is driven to a central location in the said memory sub-block.

6. A semiconductor memory array system of claim 5, wherein, data from the central location in each memory sub-block is driven to a location central to all memory sub-blocks and subsequently driven to a memory output driver.

7. A semiconductor memory array system having an input which receives an address signal and an output which transmits stored data comprising; address signals that are balanced in a manner of a clock tree.

8. A semiconductor memory array system having an input which receives an address signal and an output which transmits stored data comprising; data signals that are balanced in a manner of a clock tree.

Description

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The present invention relates to a high-performance semiconductor memory architecture and, in particular, to a high performance architecture of a Static Random Access Memory (SRAM) cell.

[0003] 2. Description of the Related Art

[0004] In these years, the operation speed and the integration density of the integrated circuit have been significantly improved, along with the miniaturization of the semiconductor elements. Particularly, BiCMOS LSI which employs the combination of CMOS FETs and bipolar transistors and enabling high speed operation and low power dissipation, is being developed.

[0005] This trend is ascribed to the recent tendency of high speed and high function of electronics instruments.

[0006] Miniaturization of semiconductor elements has been advanced to respond to this requirement. Fine pattern processings, however, need corresponding facilities and equipments. It is, hence, not easy to rapidly develop the fine pattern processings. Thus, attempts have been performed to enhance the high speed operation by the expedients in the circuit design. For example, high speed memory circuits unitizing pipeline systems were proposed and fabricated. The pipeline system performs reading and writing of information at a shorter time interval than the information read time of the memory circuit (address access time: time from the input of an information read signal to the output of a recorded information from a memory cell, hereinafter referred to as access time), by dividing the operation of the memory circuit along the signal flow and operating the respective circuits independently.

[0007] Later, an improved pipeline technique known as a "wave-pipeline technique" was proposed for further enhancing the operational frequency of the conventional pipeline technique. In the wave-pipeline technique, a plurality of signals are propagated on the data path as-wave signals. With this technique, an operation which is equivalent to a conventional two-stage pipeline technique is realized without interference and with a reduction in the power dissipation and the chip area.

[0008] In the wave-pipeline technique, the operational speed of the system is improved without using intermediate registers or latch circuits. That is, a plurality of coherent data waves are aligned in sequence in the combination circuit by feeding clock signal to flip-flops at a rate higher than the propagation delay time of the combinational circuit. That is, if all the signal paths for signal components of a wave signal extending from the input to the output of the combinational circuit have a substantially equal delay, the individual wave signals can be propagated toward the output section without an interference between the wave signals.

[0009] If address signals are applied to a data path with a cycle time which exceeds an access time, read-out data is not output during the self-delay of the memory core. In the memory system of the wave-pipeline technique, address input signals are applied with a period which is less than the critical path of a memory core section.

[0010] A key to implementing the wave-pipeline technique on the semiconductor memory system lies in reducing difference in the signal delay time which is caused by different locations of memory cells being accessed or that caused by a difference between data path lengths.

[0011] It is to be noted that in a memory system of a large capacity, a refinement in the process results in a reduced metal film thickness, and a reduction in size of a memory cell also results in a reduced metal line width.

[0012] As the capacity increases, there is a tendency that signal wiring and bit lines, which use metal, increase. This means that a resistance presented by the signal wiring and the bit line increases, posing a problem in that a signal delay time caused by the signal wiring and the bit lines increases as does a difference in the signal delay time caused by differences of data paths.

[0013] Static Random Access Memory (SRAM) devices are comprised of a rectangular matrix of memory cells. Individual memory cells are accessed by the intersection of decoded row and column addresses. Because the SRAM receives these addresses in only one input location, it follows that some memory elements are close to the input while others are farther away from the input. Or, in terms that are important to high performance memories, there are fast memory elements and slow memory elements within the SRAM. Normally, this is not an issue because the speed of the SRAM is dictated by the slowest cells, and the fastest cells meet the specifications with margin. As SRAM densities and performance increase, the speed difference between the fast and slow elements can become a significant percentage of the cycle time of the memories and start to impact performance. This will become evident by reviewing the operation of a conventional SRAM and considering the differences between an access to a slow memory element and an access to a fast memory element in a typical prior art SRAM design shown in FIG. 1.

[0014] FIG. 1 illustrates a conventional quadrant of a typical 16 Mb SRAM. Although this design has been optimized for performance, the shaded blocks, labeled SUB0UL (upper-left) through SUB15LR (lower-right) represent 64 of the SRAM's 256 subarrays. Each subarray is a small independent memory structure, which contains all the sensing, precharge, and timing circuitry to access the contained memory elements. SRAM designs utilize the subarray structure to minimize the number of cells activated within any given cycle, thereby reducing the chips active power. For this design, only 2 of the 64 quadrant subarrays will be active in a cycle. The subarrays are designed using the standard dummy wordline and dummy bit technique as more fully described in U.S. Pat. Nos. 5,268,869,and 4,425,633 which are examples of this technique used for the past 15 years to precisely time the sensing circuitry. One benefit derived from this sensing method is an almost constant access time across the subarray, leaving only the subarray selection and common global data buses between subarrays that can generate an access delta between memory elements. In this case, we will compare an access delta between a slow subarray 11 and a fast subarray 19.

[0015] For the existing architecture, 11 is accessed by an address signal 1 that drives from the center of the chip through three sections of wire 2, 3 and 4 having a delay RC1 and the two re-buffers 5 and 6, before reaching the global wordline driver 7. The global wordline driver circuits decode the address and drive a global wordline signal across the array on a wire 9 with delay RC2. Due to the large array size, the global wordline is applied to rebuffer 10 before driving to 11 across another wire 12 having a delay RC12. Note that for simplicity this diagram only illustrates the selection of the subarray through the global wordline. In reality, the global wordline selects the subarray in conjunction with several column selection signals. It should be clear that the wiring and buffering of the column signal will be handled in a manner similar to the global wordline. Once selected, subarray 11 accesses its local memory cells and then drives data along a data bus 13 with delay RC3 through a data rebuffer circuit 14, along a second data bus 15 with delay RC4, through a second data rebuffer circuit 16, and finally a third data bus 17 having a delay RC5 before reaching the SRAM output drivers 18.

[0016] The fast subarray 19 is selected similarly, except in this case the addresses only need to travel through one section of wire of delay RC1 and the two rebuffers 5 and 8 before reaching the global wordline driver circuit 20. The global wordline drives through wire 21 with delay RC2 and selects subarray 19 without going through the global wordline rebuffer 23. Following the access to the subarray's local memory elements the data drives directly into the first data rebuffer circuit 22 and subsequently the second data rebuffer 16 without having additional wire delays. After the second stage rebuffer, the data travels along the data path of delay RC5 before reaching the output drivers 18.

[0017] To get a better appreciation for the timing differences between the fast and slow subarrays the following Table I translates the various delays discussed above to specific values, based on a typical 16 Mb SRAM design parameters.

1TABLE I Access SUB0UL Access Delay (ps) SUB8LR Access Delay (ps) delta (ps) RC1 + I0 83 RC1 + I0 83 0 RC1 + I2 + RC1 169 I1 40 129 Gw1Driver 200 Gw1Driver 200 0 RC2 56 RC2 56 0 Gwlbuff + RC2 136 136 Subdelay 900 Subdelay 900 0 RC3 56 56 Data rebuff1 50 Data rebuff1 50 0 RC4 150 150 Data rebuff2 50 Data rebuff2 50 0 RC5 40 RC5 40 0 Total 1,890 1,419 471

[0018] As shown in Table I, the total difference between accessing a fast and a slow subarray is 471 ps, or almost 25% of the products 2 ns cycle. This timing difference (access delta) limits performance and complicates the design of a high performance SRAM device.

SUMMARY OF THE INVENTION

[0019] Accordingly, it is an object of the present invention to provide an architecture which will minimize the delays within each cell thereby allowing the cycle time to be reduced by preventing fast subarray accesses from colliding with the slower data from the more remote subarrays.

[0020] In accordance with the present invention, the architecture of the array is laid out to equalize access to all memory elements. To the greatest extent possible, the cells are located around the periphery as if on a rim of a wheel. With such an arrangement, the address signal is fed through the center of the array and propagates radially to the selected subarray. The data from the subarray will then follow a radial path back through the center of the array to the output drivers. In this way, the delays to the fastest and slowest subarray would have an access delta that is about the same.

BRIEF DESCRIPTION OF THE DRAWINGS

[0021] FIG. 1 is a schematic illustration of a conventional memory system; and

[0022] FIG. 2 is a schematic of a semiconductor memory system according to the present invention.

DESCRIPTION OF PREFERRED EMBODIMENT

[0023] The primary intent of this SRAM architecture is to equalize access delta to all memory elements, much as if the memory cells were located an equal distance from the input. This becomes important for high performance SRAMs when cross-chip access deltas become greater than 10% of the products desired cycle time. The access deltas, or slope in the products timings, limit how fast the SRAM can be cycled by causing collisions when a fast memory location is accessed immediately following a slow location. The basic method proposed to equalize access delays is similar to that used in the design of a microprocessor's balanced clock tree, except in this case the balancing is performed on the address decode path and data path of a memory chip. For fast large memory designs, it is necessary to rebuffer signals as these wires traverse the chip by strategically positioning the components which cause the delay; such as, the rebuffers, data buffers, and drivers as well as using additional extra wire-tracks to "wire back" to the fast subarray. In this manner, a balanced access can be achieved as will be described as shown in FIG. 2 which is a block diagram of the improved architecture of the present invention. All of this will become apparent by comparing the access time of a slow subarray 11 and a fast subarray 24 as shown in FIG. 2. It is to be noted that similar reference numerals as used in FIG. 1 are used in FIG. 2 to desegregate corresponding elements for the sake of understanding. The memory is accessed when an address signal 1 drives from the center of the chip through two sections of wire 2 and 3 with a delay RC1 and two rebuffers 5 and 6. Note that the address does not stop at the first place that it can be used (the lower right memory subarrays), instead it drives right past it to the center point between the near and far sections of the memory, and then wires back. Once reaching the second rebuffer 6 the addresses are evenly wired to a strategically positioned global wordline drivers in the center of the array through a wire 25 having a delay RC2, a third address rebuffer 26, and the two wires 27 and 28 of equal delay RC1. By wiring the addresses and positioning the rebuffers and decoders as shown in this example, it is clear that all global wordline drivers on this large memory chip will provide approximately equal access time. Now there is no difference between the upper left section or the lower right, and any access delta will be contained within a 16 subarray group. From the global wordline driver the fast subarray 24 is immediately selected, while the slow subarray 11 is selected after the wire 30 with a delay RC2. To minimize the access delta in the 16 subarray group, the proposed architecture is extended to the memory data bus. Both fast and slow subarrays send their data down a data bus 31 and 32 having a delay RC6 to the first stage data rebuffer 33 which is repositioned to help balance the data path. From the first stage rebuffer 33 the data is sent to the second stage rebuffer 35 on a wire 34 having a delay RC7. Again, it should be noted that this wire goes back in the direction from which that data from subarray 24 came and requires additional space (wire tracks) or an additional level of metal. This is a good tradeoff to achieve the best performance. The second stage data rebuffer 35 is now positioned in the middle of the SRAM quadrant, allowing an equal data path from each of the four 16 subarray groups (UL, UR, LL, LR). The second stage rebuffer then drives the data to the SRAM output drivers 37 along a wire 36 with a delay RC4.

[0024] The following Table II summarizes the various delays for the fast and slow subarrays discussed hereinbefore to specific values based on SRAM design parameters.

2TABLE II Access SUB0UL Access Delay (ps) SUB7UL Access Delay (ps) delta (ps) RC1 + I0 83 RC1 + I0 83 0 RC1 + I1 + RC2 169 RC1 + I1 + RC2 163 0 I2 + RC1 83 I2 + RC1 83 0 GWL Driver 200 GWL Driver 200 0 RC2 56 56 Subdelay 900 Subdelay 900 0 RC6 29 RC6 29 0 Data rebuff1 50 Data rebuff1 50 0 RC7 120 RC7 120 0 Data rebuff2 50 Data rebuff2 50 0 RC4 150 RC4 150 0 Total 1,890 1,834 56

[0025] To summarize, the SRAM architecture proposed here significantly minimizes access deltas across a large memory array, thereby allowing the cycle time to be reduced by preventing fast subarray accesses from colliding with the slower data from the more remote subarrays. In this example, the cycle can be reduced by >400 ps over the prior art architecture.

[0026] The memory system shown and described in connection with the above embodiment is preferred for use in a memory array of a high speed SRAM of a large capacity which may be used in a cache memory; for example, in which a high speed CPU and a high speed bus are directly coupled together in order to achieve a high speed operation compatible to the high speed CPU. However, this embodiment may be applicable to a high-speed semiconductor memory system of a large capacity other than SRAM; such as, Dynamic Random Access Memory (DRAM). In addition, it is to be noted that while the present invention has been disclosed above in connection with a preferred embodiment, it should be understood that the present invention has various forms of embodiments without departing from the spirit and the scope of the present invention.

* * * * *