U.S. patent number 9,959,929 [Application Number 14/609,963] was granted by the patent office on 2018-05-01 for memory device and method having on-board processing logic for facilitating interface with multiple processors, and computer system using same.
This patent grant is currently assigned to Micron Technology, Inc.. The grantee listed for this patent is MICRON TECHNOLOGY, INC.. Invention is credited to David Resnick.
United States Patent |
9,959,929 |
Resnick |
May 1, 2018 |
Memory device and method having on-board processing logic for
facilitating interface with multiple processors, and computer
system using same
Abstract
A memory device includes an on-board processing system that
facilitates the ability of the memory device to interface with a
plurality of processors operating in a parallel processing manner.
The processing system includes circuitry that performs processing
functions on data stored in the memory device in an indivisible
manner. More particularly, the system reads data from a bank of
memory cells or cache memory, performs a logic function on the data
to produce results data, and writes the results data back to the
bank or the cache memory. The logic function may be a Boolean logic
function or some other logic function.
Inventors: |
Resnick; David (Boise, ID) |
Applicant: |
Name |
City |
State |
Country |
Type |
MICRON TECHNOLOGY, INC. |
Boise |
ID |
US |
|
|
Assignee: |
Micron Technology, Inc. (Boise,
ID)
|
Family
ID: |
40351434 |
Appl.
No.: |
14/609,963 |
Filed: |
January 30, 2015 |
Prior Publication Data
|
|
|
|
Document
Identifier |
Publication Date |
|
US 20150143040 A1 |
May 21, 2015 |
|
Related U.S. Patent Documents
|
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
Issue Date |
|
|
13243917 |
Sep 23, 2011 |
8977822 |
|
|
|
11893593 |
Aug 11, 2011 |
8055852 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G11C
14/0018 (20130101); G06F 12/0246 (20130101); G11C
7/1006 (20130101); G11C 16/0408 (20130101); G06F
2212/7207 (20130101); G11C 2207/2245 (20130101) |
Current International
Class: |
G06F
12/00 (20060101); G11C 14/00 (20060101); G11C
7/10 (20060101); G06F 12/02 (20060101); G11C
16/04 (20060101) |
Field of
Search: |
;711/105,154,104,5,147,118 |
References Cited
[Referenced By]
U.S. Patent Documents
Foreign Patent Documents
|
|
|
|
|
|
|
0718769 |
|
Jun 1996 |
|
EP |
|
5978915 |
|
Nov 1999 |
|
TW |
|
200635383 |
|
Oct 2006 |
|
TW |
|
7174429 |
|
Feb 2007 |
|
TW |
|
200729859 |
|
Aug 2007 |
|
TW |
|
Other References
First Office Action received for TW application No. 097130579, Oct.
30, 2012. cited by applicant .
International Search Report and Written Opinion in International
Application No. PCT/US2008/072809, dated Feb. 26, 2009. cited by
applicant .
IEEE 100 The Authoritative Dictionary of IEEE Standard Terms, IEEE,
Seventh Ed., Dec. 2000, pp. 787-788. cited by applicant .
Office Action of the Intellectual Property Office for TW Appl. No.
097130579 dated Oct. 15, 2014. cited by applicant .
Fang, et al., "Active Memory Operations", ACM, Jun. 2007, 232-241.
cited by applicant .
Resnick, et al., TW Office Action for Taiwan Application No.
097130579 dated Apr. 3, 2013. cited by applicant .
"Application Note: Accelerate Common GUI Operations With a
TPDRAM-Based Frame Buffer; AN-43-01 Accelerate GUI Operations;
Application Note: Use of TPDRAM for Smarter/Faster Network
Applications, AN-43-02 TPDRAM for Network Applications", Micron
Semiconductor, Inc. (1994), Retrieved Feb. 2010, pp. 6-75 through
6-90. cited by applicant.
|
Primary Examiner: Chery; Mardochee
Attorney, Agent or Firm: Dorsey & Whitney LLP
Parent Case Text
CROSS-REFERENCE TO RELATED APPLICATION
This application is a continuation of U.S. patent application Ser.
No. 13/243,917, filed Sep. 23, 2011, issued as U.S. Pat. No.
8,977,822 on Mar. 10, 2015, which is a continuation of U.S. patent
application Ser. No. 11/893,593, filed Aug. 15, 2007 and issued as
U.S. Pat. No. 8,055,852 on Nov. 8, 2011. These applications and
patents are incorporated herein by reference in their entirety, for
any purpose.
Claims
What is claimed is:
1. An integrated circuit memory device comprising: a plurality of
memory cells; a logic unit configured to perform a logic function
on data initially received from a first location in the plurality
of memory cells and output a result data responsive to the logic
function; a select circuit coupled to a plurality of terminals and
the logic unit, the select circuit configured to receive write data
via the plurality of terminals and the result data from the logic
unit and to select between the plurality of terminals and the logic
unit, based on a command; and a write driver coupled to the select
circuit and configured to receive one of the write data and the
result data and, in response to receiving the result data, write
the result data to the first location of the plurality of memory
cells, wherein the result data replaces the data initially received
from the first location.
2. The integrated circuit memory device as claimed in claim 1,
wherein in response to a read command, the select circuit selects
the logic unit.
3. The integrated circuit memory device as claimed in claim 2,
wherein the read command is provided to the memory by an external
component.
4. The integrated circuit memory device as claimed in claim 2,
wherein when the plurality of terminals is configured to receive
data to be written to the memory, the select circuit is configured
to select the plurality of terminals in response.
5. The integrated circuit memory device as claimed in claim 1,
wherein the logic function comprises an AND operation.
6. The integrated circuit memory device as claimed in claim 1,
wherein the logic function comprises an OR function.
7. The integrated circuit memory device as claimed in claim 1,
wherein the logic function comprises an arithmetic operation.
8. The integrated circuit memory device as claimed in claim 1,
wherein the logic function comprises an XOR operation.
9. The integrated circuit memory device as claimed in claim 1,
wherein the logic function comprises a NAND operation.
10. A method, comprising: providing data read from a first location
of a plurality of memory cells to a logic unit; performing, by the
logic unit, a logic function on the data initially read from the
first location; providing a result data of the logic function to a
select circuit in response to the logic function performed on the
data; providing, with the select circuit, the result data of the
logic function to a write driver responsive to a selection signal,
wherein the selection signal is based on a command; and writing, by
the write driver, the result data to the first location of the
plurality of memory cells, wherein the result data replaces the
data initially read from the first location, wherein the memory
cells, the logic unit, and the write driver are incorporated into
an integrated circuit memory device.
11. The method as claimed in claim 10, wherein a cache memory is
configured to load the data in the plurality of memory cells, and
wherein providing data read from the plurality of memory cells to
the logic unit comprises reading the data from the cache memory and
wherein providing the result data of the logic function to the
write driver comprises sending the result data to the cache
memory.
12. The method as claimed in claim 10, wherein the logic function
comprises an AND operation.
13. The method as claimed in claim 10, wherein the logic function
comprises an OR operation.
14. The method as claimed in claim 10, wherein the logic function
comprises an arithmetic operation.
15. The method as claimed in claim 10, wherein the logic function
comprises an XOR operation.
16. The method as claimed in claim 10, wherein the logic function
comprises a NAND operation.
17. A system comprising: a controller and an integrated circuit
memory device coupled to the controller, wherein the controller is
configured to provide a command to the integrated circuit memory
device and wherein the integrated circuit memory device is
configured to receive write data from the controller; the
integrated circuit memory device comprising: a plurality of memory
cells; a command decoder; a logic unit configured to perform a
logic function on data initially read from a first location in the
plurality of memory cells in response to the command and output a
result data of the logic function; a select circuit configured to
receive the write data and the result data and selectively provide
the result data responsive to the command; and a write driver
configured to receive the result data and provide the result data
to the plurality of memory cells to be stored at the first location
in the plurality of memory cells, wherein the result data replaces
the data initially read from the first location.
18. The system as claimed in claim 17, wherein the select circuit
is configured to selectively couple a data path to the write driver
to provide the write data on the data path to the write driver.
19. The system as claimed in claim 17, wherein the logic function
comprises an AND operation.
20. The system as claimed in claim 17, wherein the logic function
comprises an OR operation.
Description
TECHNICAL FIELD
This invention relates generally to memory devices, and, more
particularly, to a memory device and method that facilitates access
by multiple memory access devices, as well as memory systems and
computer systems using the memory devices.
BACKGROUND
As computer and computer system architecture continues to evolve,
the number of processing cores and threads within cores is
increasing geometrically. This geometric increase is expected to
continue, even for simple, relatively inexpensive computer systems.
For server systems, system sizes measured in the number of
processors are increasing at an even faster rate.
Although this rapid increase in the number of cores and threads
enhances the performance of computer systems, it also has the
effect of making it difficult to apply the increasing parallelism
to single applications. This limitation exists even for high-end
processing tasks that naturally lend themselves to parallel
processing, such as, for example, weather prediction. One of the
major reasons for this limitation is that the number of
communication paths between processors, cores, and threads
increases disproportionately to the number of times the task is
divided into smaller and smaller pieces. Conceptually, this problem
can be analogized to the size of a processing being represented by
the volume of a 3D cube. Each time this volume is divided into
smaller cubes, the total surface area of the cubes, which
represents data that must be communicated between the processors
working on sub-cubes, increases. Every time that the number of
processors goes up by a factor of eight the total amount of
information to be communicated between the greater number of
processors doubles.
One reason for these problems caused by increasing parallelism is
that most systems communicate by sending messages between
processors, rather than sharing memory. This approach results in
high latencies and high software overheads, although it may
simplify some complex system architecture, operating system, and
compiler issues. Unfortunately, as the level of parallelism
increases, the processors in the system reach the point where all
they are doing is managing message traffic rather than actually
doing useful work.
There is therefore a need for a system and method that can reduce
software overhead and eliminate or at least reduce performance
bottlenecks thereby improving system performance and architectural
scalability at relatively low cost.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a computer system according to one
embodiment.
FIG. 2 is block diagram of a portion of a system memory device
containing processing logic according to one embodiment that may be
used in the computer system of FIG. 1 to allow operations to be
carried out in the memory device in an indivisible manner.
FIG. 3 is a block diagram of a memory device according to one
embodiment that may be used in the computer system of FIG. 1.
DETAILED DESCRIPTION
A computer system 10 according to one embodiment is shown in FIG.
1. The computer system 10 includes several parallel processors
14.sub.1-N connected to a common processor bus 16. Also connected
to the processor bus 16 are a system controller 20 and a level 2
("L2") cache 24. As is well known in the art, each of the
processors 14.sub.1-N may include a level 1 ("L1") cache.
The system controller 20 drives a display 26 through a graphics
accelerator 28, which may include a graphics processor and graphics
memory of conventional design. Also connected to the system
controller 20 is an input/output ("I/O") bus 30, such as a
peripheral component interconnect ("PCI") bus, to which are
connected a keyboard 32, a mass storage device 34, such as a hard
disk drive, and other peripheral devices 36. Of course there can
also be systems such as servers that do not have directly connected
keyboard, graphics or display capabilities, for example.
The computer system 10 also includes system memory 40, which may be
a dynamic random access memory ("DRAM") device or sets of such
devices. The system memory 40 is controlled by memory controller
circuitry 44 in the system controller 20 through a memory bus 46,
which normally includes a command/status bus, an address bus and a
data bus. There are also systems in which the system and memory
controller is implemented directly within a processor IC. As
described so far, the computer system 10 is conventional. However,
the system memory 40 departs from conventional systems by including
in the system memory 40 a processing system 50 that enhancers the
ability of the parallel processors 14.sub.1-N to access the system
memory 40 in an efficient manner. It should also be understood that
the system 50 may be used in memory devices in a computer or other
processor-based systems that differ from the computer system 10
shown in FIG. 1. For example, servers and other high-end systems
will generally not include the graphics accelerator 28, the display
26, the keyboard 32, etc., but will have disk systems or simply
connect to a network of other similar processors with attached
memory.
The processing system 50 includes circuitry that allows the system
memory 40 to be naturally coherent by carrying out operations in
the memory device an indivisible manner. The system reduces or
eliminates coherency issues and may improve communication for all
levels in the computer system 10. The processing system 50 or a
processing system according to some other embodiment can be
implemented in the system memory 40 while keeping the internal
organization of the memory system substantially the same as in
conventional system memories. For example, bank timing and memory
data rates can be substantially the same. Further, the system 50
need not be particularly fast as the operations needed are
generally simple and fit with current and anticipated memory clock
rates.
In general, it is preferable for the processing to be initiated and
to be performed as a single indivisible operation. An example is
where a byte in a 32-bit word is updated (read and then written)
while preventing access to the word while the update is being
executed. Functions like these, which are sometime referred to as
"atomic," are desired when parallel processes access and update
shared data. The processing system 50 allows the system memory 40
to be naturally coherent by performing operations as an indivisible
whole with a single access. The coherency circuitry reduces or
eliminates coherency issues and may improve communication for all
levels in the computer system 10. The coherency circuitry operates
most advantageously when used with other extensions to the
functionality of memory devices, such as that provided by a cache
system.
One embodiment of a processing system 50 is shown in FIG. 2. The
system 50 includes a select circuit 54, which may be a multiplexer,
that routes write data to a column of a Memory Bank 58 through a
set of write drivers 56. The write data are routed to the column
from either a data bus of the memory device 40 or Boolean Logic 60.
The Boolean Logic 60 receives read data from a set of sense
amplifiers and page registers 56. The read data are also applied to
the data bus of the memory device 40.
In operation, the select circuit 54 normally couples write data
directly to the write drivers 56 of the Bank 58. However, in
response to a command from the memory controller 44, the select
circuit 54 routes data from the Boolean Logic 60 to the write
drivers 56. In response to a read command, the read data are
applied to the Boolean Logic 60, and the Boolean Logic 60 then
performs a Boolean logic operation on the read data and writes data
resulting from the operation back to the location in the Bank 58
where the data was read. If the memory device 40 includes a cache
memory, the Boolean Logic 60 can instead perform an operation on
data read from the cache memory before writing the result data back
to the same location in the cache memory.
Although the system 50 shown in FIG. 2 uses Boolean Logic 60, other
embodiments may use circuits or logic that perform other increased
functions. In general, this increased functionality may be logic
functions, such as AND, OR, etc. functions, arithmetic operations,
such as ADD and SUB, and similar operations that can update and
change the contents of memory. Arithmetic functions would be very
useful to multiple different kinds of software. However, as
indicated above, the system 150 performs Boolean logic operations
since they are also very useful functions to implement as flags and
for general communication between computation threads, cores, and
clusters. A Boolean operation is a standalone bit-operation since
no communication between bits participating in the operation is
generally required, and can be implemented efficiently on a memory
die. As each Boolean operation is simple, the logic implementing
the functions does not have to be fast compared to the memory
clock. These functions provide coherency directly as memory is
modified in the memory device. These functions, in conjunction with
the protection capability described previously, enable system
implementation of a set of easy to use but novel memory
functions.
Typical logical functions that may be implemented by the Boolean
Logic 60 are shown in Table 1, below. The increased functionality
can provide solutions to many of the issues that surround the
increased parallelism of new computer implementations.
The basic operation that is performed to implement the logic
functions is: WriteData .OP. MemData.fwdarw.MemData where ".OP." is
a value designating a specified Boolean logic function. Memory data
is modified by data contained in what is basically a Write
operation, with the result returned to the same place in memory
that sourced the data. An on-chip data cache can be source and/or
sink of the data that is operated on by the Boolean Logic 160. If
the data source is a memory bank rather than a cache memory, an
Activate to a bank specified in the command should also be issued,
with the page data loaded into the normal row buffer. Write data
accompanying the command is then applied to the row buffer at the
specified column addresses. The result is written back to memory,
though this could be under control of a Precharge bit in the
Boolean logic 60. The operation is thus a Write, but with memory
data itself modifying what is written back to memory. If the data
source is a cache memory, then a cache row is fetched, such as by
using tag bits as described previously. After the data read from
the cache memory is transformed by the logic operation, the result
data are stored at the same location in the cache memory.
In operation, there may be multiple different kinds of OPs, so as
to enable memory bits to be set, cleared and complemented. As
detailed below, this write-up shows eight different operations. A
particular set of command bits are not shown here to encode the
particular Boolean logic function because the implementation can be
independent of the cache memory operations described previously. If
combined with the use of a cache memory, a cache reference command
as described above may be used. This cache reference command may be
encoded using a respective set of RAS, CAS, WE, DM command signals.
A set of commands is shown in Table 1, below. The manner in which
those command bits map to DRAM command bits my be defined in a
variety of manners. However, one embodiment of a set of
instructions and an instruction mapping is shown in Table 1 in
which "W" designates a write bit received by the memory device, "M"
designates a bit of data read from either a bank of memory cells or
the cache memory, "" is an AND function, "+" is an OR function, and
"s" is an exclusive OR function.
FIG. 3 shows one embodiment of a memory device 80. The memory
device 80 includes at least one bank of memory cells 84 coupled to
an addressing circuit 86 that is coupled between external terminals
88 and the at least one bank of memory cells 84. The memory device
80 also includes a data path 90 coupled between 92 external
terminals and the at least one bank of memory cells 84. Also
included in the memory device 80 is a command decoder 94 coupled to
external terminals 96. The command decoder 94 is operable to
generate control signals to control the operation of the memory
device 80. Finally, the memory device 80 includes a processing
system 98 coupled to the at least one bank of memory cells 84. The
processing system is operable to perform a processing function on
data read from the at least one bank of memory cells 84 to provide
results data and to write the results data to the at least one bank
of memory cells 84. The processing system 50 shown in FIG. 2 may be
used as the processing system 98, or some other embodiment of a
processing system may be used as the processing system 98.
TABLE-US-00001 TABLE 2 Boolean Functions OP Code Primary Alternate
Common (octal) Equation Equation Name Operation 0 W .cndot. M AND
Clear on 0's 1 W .cndot. M Clear on 1's 2 W .sym. M XOR Complement
on 1's 3 W .cndot. M W + M NOR NOR 4 W .cndot. M W + M NAND NAND 5
W .sym. M EQV Complement on 0's 6 W .cndot. M W + M Set on 0's 7 W
.cndot. M W + M OR Set on 1's Notes: 1 "W" is a write bit coming
from the input pins. 2 "M" is a memory bit 3 ".cndot." is AND 4 "+"
is OR 5 ".sym." is Exclusive OR
From the foregoing it will be appreciated that, although specific
embodiments of the invention have been described herein for
purposes of illustration, various modifications may be made without
deviating from the spirit and scope of the invention. Accordingly,
the invention is not limited except as by the appended claims.
* * * * *