U.S. patent application number 12/953055 was filed with the patent office on 2011-03-17 for configurable processor module accelerator using a programmable logic device.
This patent application is currently assigned to DRC Computer Corporation. Invention is credited to Steven Casselman, Stephen Sample.
Application Number | 20110066832 12/953055 |
Document ID | / |
Family ID | 38982411 |
Filed Date | 2011-03-17 |
United States Patent
Application |
20110066832 |
Kind Code |
A1 |
Casselman; Steven ; et
al. |
March 17, 2011 |
Configurable Processor Module Accelerator Using A Programmable
Logic Device
Abstract
A configurable processor module accelerator using a programmable
logic device is described. According to one embodiment, the
accelerator module includes a circuit board having coupled thereto
a first programmable logic device, a controller, and a first
memory. The first programmable logic device has access to a
bitstream which is stored in the first memory. Access to the
bitstream by the first programmable logic device is controlled by
the controller. The bitstream is capable of being instantiated in
the first programmable logic device using programmable logic
thereof to provide at least a transport interface for communication
between the first programmable logic device and one or more other
devices associated with the motherboard using the microprocessor
interface.
Inventors: |
Casselman; Steven; (Santa
Clara, CA) ; Sample; Stephen; (Saratoga, CA) |
Assignee: |
DRC Computer Corporation
Sunnyvale
CA
|
Family ID: |
38982411 |
Appl. No.: |
12/953055 |
Filed: |
November 23, 2010 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11829804 |
Jul 27, 2007 |
7856546 |
|
|
12953055 |
|
|
|
|
60820730 |
Jul 28, 2006 |
|
|
|
60826060 |
Sep 18, 2006 |
|
|
|
60865356 |
Nov 10, 2006 |
|
|
|
Current U.S.
Class: |
712/231 ;
712/E9.074 |
Current CPC
Class: |
G06F 13/409
20130101 |
Class at
Publication: |
712/231 ;
712/E09.074 |
International
Class: |
G06F 9/44 20060101
G06F009/44 |
Claims
1. An accelerator module, comprising: first programmable logic
device having access to a bitstream; the bitstream capable of being
stored in a first memory; the access by the first programmable
logic device to the bitstream being controlled by a controller; the
bitstream to provide a transport interface for communication
between the first programmable logic device and another device
using a microprocessor interface; and the transport interface
capable of direct communication via the microprocessor interface
with a microprocessor located.
2. The accelerator module according to claim 1, wherein the
controller is configured to monitor the first programmable logic
device and configured to cause instantiation of the transport
interface in the first programmable logic device responsive to a
detected condition.
3. The accelerator module according to claim 1, further comprising:
a second memory; and the bitstream capable of being instantiated in
the first programmable logic device using programmable logic
thereof to additionally provide a memory interface for
communication with the second memory.
4-27. (canceled)
Description
[0001] This application claims the benefit of priority to each of
U.S. Provisional Patent Application No. 60/820,730 entitled "FPGA
Co-Processor For Accelerated Computation" and filed on Jul. 28,
2006; U.S. Provisional Patent Application No. 60/826,060 entitled
"General Purpose Coprocessor Socket on Server Motherboards" and
filed on Sept. 18, 2006; and U.S. Provisional Patent Application
No. 60/865,356 entitled "FPGA Co-Processor With On-Board Dram
Memory" and filed on Nov. 10, 2006, each of which is incorporated
by reference herein in its entirety for all purposes to the extent
such subject matter is not inconsistent herewith.
FIELD
[0002] The invention relates generally to computer systems and,
more particularly, to an accelerator module capable of being
coupled for communication with a microprocessor bus.
BACKGROUND
[0003] Co-processors have been used to accelerate computational
performance. For example, some early microprocessors did not
include floating-point circuitry due to integrated circuit die area
limitations. As used herein, "include" and "including" mean
including without limitation. Unfortunately, performing
floating-point computations in software can be quite slow.
[0004] Accordingly, a co-processor configured to work with a
microprocessor was created. Instructions for the co-processor could
thus be passed through the microprocessor, such as for performing a
floating-point computation for example. As integrated circuit
technology improved, microprocessor and co-processor were combined
together in a single die. So, for example, some recent
microprocessors are capable of performing floating-point
operations.
[0005] Still, conventional microprocessors have a fixed set of
circuitry for carrying out instructions from their Instruction Set
Architecture ("ISA"). So while instructions from known ISAs may be
used for carrying out computational algorithms in a conventional
microprocessor, the execution of such instructions is limited to
the fixed set of circuitry of the microprocessor. In short,
microprocessors may not be well suited for carrying out some
complex algorithms or highly specialized algorithms, and thus
execution of such algorithms as program applications using a
microprocessor may be slow.
[0006] More recently, multi-microprocessor computing systems have
been implemented. In such systems, one microprocessor may act as a
Central Processing Unit ("CPU") and one or more other of such
microprocessors may act as auxiliary processors to improve
computational throughput. However, such microprocessors are still
limited to their fixed set of circuitry and associated ISA, and
thus may still be relatively slow when executing complex algorithms
or highly specialized algorithms.
[0007] A microprocessor interface conventionally has more available
pins than an edge connector associated with a peripheral circuit
board interface. Conventionally, a socket may be attached to a
microprocessor interface of a motherboard to facilitate addition of
a microprocessor, which may be added after manufacture of the
motherboard. Thus, in some instances, motherboards are sold
separately from microprocessors.
[0008] Programmable Logic Devices ("PLDs"), such as those that have
field programmable gates which may be arrayed as in Field
Programmable Gate Arrays ("FPGAs") for example, have programmable
logic that may be tailored for carrying out various tasks. For
purposes of clarity by way of example and not limitation, FPGAs are
described below; however, it should be understood that other
integrated circuits that include programmable logic, such as field
programmable gates, may be used.
[0009] Execution of complex algorithms or highly specialized
algorithms may be done in hardware via programmable logic tailored
to carrying out such algorithms. Executing of complex algorithms or
highly specialized algorithms instantiated, in whole or in part, in
programmable logic may be substantially faster than executing them
in software using a microprocessor or microprocessors.
[0010] However, motherboards or system boards capable of handling
one or more microprocessors are more common in computing systems
than PLDs, such as FPGAs for example, for a variety of known
reasons. Accordingly, some developers have created FPGA
accelerators implemented as expansion cards that plug into one or
more peripheral circuit board edge connection slots of a
motherboard. However, expansion board FPGA accelerators
("peripheral accelerators") are limited by the edge connection
interface pin density and associated performance of the peripheral
communication interface to which they interconnect. An example of a
peripheral interface is a Peripheral Component Interface ("PCI"). A
peripheral circuit board interface, such as a PCI for example, is
relatively slow as compared with a microprocessor interface.
Examples of microprocessor interfaces include a Front Side Bus
("FSB") and a HyperTransport ("HT") link, among other types of
microprocessor interfaces.
[0011] A configuration bitstream or a partial bitstream may be
pre-designed to provide one or more functional blocks when
instantiated in programmable logic. Such a pre-designed bitstream
or partial bitstream is conventionally derived from what is
generally referred to as a "core." For example an HT link core is
available from Xilinx, Inc. for providing a configuration bitstream
that may be instantiated in an FPGA from that vendor.
Conventionally, a core is usable in a variety of applications;
however, a core may include pre-defined placement or pre-defined
routing, or a combination thereof. These types of pre-designed
cores are sometimes known as "floor-planned" cores. Such
floor-planned cores may be pre-designed for a particular family of
products. Additionally, cores may allow a user to enter parameters
to activate functionality, change functionality, and adjust
interface parameters, among other known parameterizations.
SUMMARY
[0012] One or more embodiments generally relate to computer systems
and more particularly, to an accelerator module capable of being
coupled for communication with a microprocessor bus.
[0013] A configurable processor module accelerator using a
programmable logic device is described. According to one
embodiment, the accelerator module includes a circuit board having
coupled thereto a first programmable logic device, a controller,
and a first memory. The first programmable logic device has access
to a bitstream which is stored in the first memory. Access to the
bitstream by the first programmable logic device is controlled by
the controller. The bitstream is capable of being instantiated in
the first programmable logic device using programmable logic
thereof to provide at least a transport interface for communication
between the first programmable logic device and one or more other
devices associated with the motherboard using the microprocessor
interface.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] Accompanying drawing(s) show exemplary embodiment(s) in
accordance with one or more aspects of the invention; however, the
accompanying drawing(s) should not be taken to limit the invention
to the embodiment(s) shown, but are for explanation and
understanding only.
[0015] FIG. 1 is a perspective view block diagram depicting an
exemplary embodiment of a multiprocessor-capable computing
system.
[0016] FIG. 2 is a block diagram depicting an exemplary embodiment
of a reconfigurable processor unit ("RPU").
[0017] FIG. 3 is a block diagram depicting an exemplary embodiment
of some of the functional blocks of the Field Programmable Gate
Array ("FPGA") of the RPU of FIG. 2.
[0018] FIG. 4 is a perspective view depicting an exemplary
alternative embodiment to the RPU of FIG. 2, namely with an
additional connector.
[0019] FIG. 5 is a perspective view block diagram depicting another
exemplary embodiment of an RPU.
[0020] FIG. 6 is a flow diagram depicting an exemplary embodiment
of a boot flow for the RPU of FIG. 2 or the RPUs of FIGS. 5 and
10.
[0021] FIG. 7 is a flow diagram depicting an exemplary embodiment
of a configuration flow.
[0022] FIG. 8 is a flow diagram depicting an exemplary embodiment
of a configuration bitstream generation flow.
[0023] FIG. 9 is a block diagram depicting an exemplary embodiment
of a bank allocation.
[0024] FIG. 10 is a block diagram depicting yet another exemplary
embodiment of an RPU.
DETAILED DESCRIPTION
[0025] In the following description, numerous specific details are
set forth to provide a more thorough description of the specific
embodiments. It should be apparent, however, to one skilled in the
art, that the invention may be practiced without all the specific
details given below. In other instances, well known features have
not been described in detail so as not to obscure the embodiments.
For ease of illustration, the same number labels are used in
different diagrams to refer to the same items; however, in
alternative embodiments the items may be different.
[0026] In the following description, for purposes of explanation,
specific nomenclature is set forth to provide a thorough
understanding of the various inventive concepts disclosed herein.
However, it will be apparent to one skilled in the art that these
specific details are not required in order to practice the various
inventive concepts disclosed herein.
[0027] Some portions of the detailed descriptions that follow are
presented in terms of algorithms and symbolic representations of
operations on data bits within a computer memory. These algorithmic
descriptions and representations are the means used by those
skilled in the data processing arts to most effectively convey the
substance of their work to others skilled in the art. An algorithm
is here, and generally, conceived to be a self-consistent sequence
of steps leading to a desired result. The steps are those requiring
physical manipulations of physical quantities. Usually, though not
necessarily, these quantities take the form of electrical or
magnetic signals capable of being stored, transferred, combined,
compared, and otherwise manipulated. It has proven convenient at
times, principally for reasons of common usage, to refer to these
signals as bits, values, elements, symbols, characters, terms,
numbers, or the like.
[0028] It should be borne in mind, however, that all of these and
similar terms are to be associated with the appropriate physical
quantities and are merely convenient labels applied to these
quantities. Unless specifically stated otherwise as apparent from
the following discussion, it is appreciated that throughout the
description, discussions utilizing terms such as "processing" or
"computing" or "calculating" or "determining" or "displaying" or
the like, refer to the action and processes of a computer system,
or similar electronic computing device, that manipulates and
transforms data represented as physical (electronic) quantities
within the computer system's registers and memories into other data
similarly represented as physical quantities within the computer
system memories or registers or other such information storage,
transmission or display devices.
[0029] The present system and methods also relate to an apparatus
for performing the operations herein. This apparatus may be
specially constructed for the required purposes, or it may comprise
a general-purpose computer selectively activated or reconfigured by
a computer program stored in the computer. Such a computer program
may be stored in a computer readable storage medium, such as, but
is not limited to, any type of disk including floppy disks, optical
disks, CD-ROMs, and magnetic-optical disks, read-only memories
("ROMs"), random access memories ("RAMs"), EPROMs, EEPROMs,
magnetic or optical cards, or any type of media suitable for
storing electronic instructions, and each coupled to a computer
system bus.
[0030] The algorithms and displays presented herein are not
inherently related to any particular computer or other apparatus.
Various general-purpose systems may be used with programs in
accordance with the teachings herein, or it may prove convenient to
construct a more specialized apparatus to perform the required
method steps. The required structure for a variety of these systems
will appear from the description below. In addition, the invention
is not described with reference to any particular programming
language. It will be appreciated that a variety of programming
languages may be used to implement the teachings of the invention
as described herein.
[0031] For purposes of clarity by way of example and not
limitation, an HT link is described even though it shall be
apparent from such description that other known types of
microprocessor interfaces may be used. An HT link is a packet-based
input/output ("I/O") link which may be implemented using two
unidirectional sets of signals. The HT link, which nominally is a
point-to-point bus architecture, may be used to couple a
microprocessor to an accelerator module. Basically, one set of
signals from one HT capable device to another includes a clock
signal, a control signal, and a set of command address and data
("CAD") signals. Control signaling ("CTL") is used to differentiate
between control signaling and data signaling of CAD. In an HT link,
each byte of CAD has a control signal. A clock signal is used for
both CAD and CTL signals. Each byte of CAD, and its associated CTL,
has a separate clock signal. [put in IDS]
[0032] An accelerator module as described herein is referred to as
a reconfigurable processor unit ("RPU"). An RPU may be coupled to a
motherboard as a stand alone processor, namely without a separate
microprocessor coupled to the same motherboard or without a
separate microprocessor coupled to a related motherboard, such as
in a blade system. For example, an FPGA included with the RPU may
have an embedded processor or may have a soft processor
instantiated in configurable logic. However, at least one
microprocessor is described as being coupled to a same motherboard
for purposes of clarity by way of example and not limitation. As
described herein, an RPU may have one or more HT links, which
facilitates a scalable HT fabric.
[0033] An embodiment relates generally to an accelerator module
suitable for coupling to a microprocessor interface of a
motherboard. The accelerator module includes a circuit board having
coupled thereto a first programmable logic device, a controller,
and a first memory. The first programmable logic device has access
to a bitstream which is stored in the first memory. Access to the
bitstream by the first programmable logic device is controlled by
the controller. The bitstream is capable of being instantiated in
the first programmable logic device using programmable logic
thereof to provide at least a transport interface for communication
between the first programmable logic device and one or more other
devices associated with the motherboard using the microprocessor
interface. The transport interface is capable of direct
communication via the microprocessor interface with a
microprocessor located on the motherboard.
[0034] Another embodiment relates generally to another accelerator
module. A circuit board has coupled thereto a first programmable
logic device, a controller, and a first memory. The first
programmable logic device has access to a bitstream which is stored
in the first memory. Access to the bitstream by the first
programmable logic device is controlled by the controller. The
bitstream is capable of being instantiated in the first
programmable logic device using programmable logic thereof to
provide at least a transport interface for communication between
the first programmable logic device and one or more other devices
associated with a motherboard using a microprocessor interface of
the motherboard. The circuit board is configured for
interconnecting the first programmable logic device and the
controller to the microprocessor interface.
[0035] Yet another embodiment relates generally to a method for
accelerating data processing. A boot sequence is initiated for an
accelerator module directly coupled to a microprocessor interface.
A first programmable logic device of the accelerator module is
configured responsive to a bitstream to instantiate a first
interface in the first programmable logic device. A configuration
bitstream is obtained via the first interface instantiated in the
first programmable logic device. The first interface is capable of
direct communication with a microprocessor coupled to the
microprocessor interface. A user design is instantiated in the
first programmable logic device responsive to the configuration
bitstream. An algorithm or portion thereof is co-processed using
the user design.
[0036] FIG. 1 is a perspective view block diagram depicting an
exemplary embodiment of a multiprocessor-capable computing system
100. Computing system 100 includes a motherboard 120. Coupled to
motherboard 120 may be one or more dynamic random access memory
("DRAM") modules ("module memory") 104 coupled to motherboard 120
via associated edge connectors 105, such as to provide system
memory. Additionally, motherboard 120 may include one or more
peripheral cards 102 coupled via associated edge connectors
103.
[0037] Motherboard 120 may include one or more microprocessor
sockets 106, which are interconnect compatible with microprocessor
101. Of note, two of the four sockets 106 illustratively shown do
not have any device plugged into them. A microprocessor socket 106
includes an array of holes (not shown for purposes of clarity)
which is to be mated with the pin grid array ("PGA") of a
microprocessor 101. A variety of different PGAs may fit into a
variety of sockets. Alternatively, what is known as a Land Grid
Array ("LGA") may be used. Furthermore, it is not necessary that a
microprocessor 101 be coupled to motherboard 120 via a socket 106,
as microprocessor 101 may be mounted to motherboard 120, by flow or
wave soldering, or other methods of attaching an integrated circuit
chip to a circuit board.
[0038] Likewise, RPU 110 may be coupled to motherboard 120 by a
microprocessor socket 106 configured for a PGA or LGA, or more
directly coupled to motherboard 120 such as by soldering for
example. However, for purposes of clarity by way of example and not
limitation, it shall be assumed that RPU 110 and microprocessor 101
are both coupled to motherboard 120 via respective sockets 106.
[What is 199?]
[0039] For purposes of clarity by way of example and not
limitation, it shall be assumed that microprocessor 101 is an
Opteron microprocessor available from Advanced Micro Devices
("AMD"). However, it shall be appreciated that any of a variety of
other known of types of microprocessors including other
microprocessors available from AMD, as well as microprocessors
available from Intel, and ARM, among other microprocessor
manufactures, may be used. Some microprocessor bus architectures
are not designed to allow arbitrary devices to be coupled to them
for direct communication with the microprocessor. Instead, a
bridging device, which is part of the microprocessor chipset, is
used to convert the microprocessor bus or "front side bus" into a
standard bus to which other devices may be attached.
[0040] However, in general, fabric of a microprocessor interface
may be expanded beyond merely using general-purpose
microprocessors. As an Opteron application is described, by
directly communicating or direct communication, including
variations thereof, it is generally meant that a bridge or other
intermediary device need not be used for communicating with a
microprocessor via a microprocessor interface. Motherboard 120 may
include many known components which are omitted here for purposes
of clarity and not limitation. In this example, motherboard 120 may
be a K8SRE(S2891) motherboard from Tyan Computer corporation;
however, many other known motherboards may be used from this or
other vendors.
[0041] Even though in the example four sockets are shown for
possibly receiving at least one and as many as four RPUs 110, it
should be appreciated that fewer or more microprocessor physical
interfaces ("microprocessor interfaces") 198 may be present as is
known. Each socket 106 of motherboard 120 may have an instance of
an RPU 110 plugged into it. In other words, motherboard 120 need
not have any microprocessor 101 plugged into any of its
microprocessor sockets 106.
[0042] Thus, for example, a high performance computing or server
system ("computing system") may be built with multiple
motherboards, as generally indicated by dots 197, connected by
high-speed buses of a back plane (not shown). In such computing
systems, one or more of such motherboards 120 may have one or more
RPUs 110 without any microprocessor 101. Furthermore, in such
systems other motherboards 120 may have one or more microprocessors
101 without any RPUs 110. Alternatively or additionally, in such
systems, one or more other motherboards 120 may have a combination
of one or more RPUs 110 and one or more microprocessors 101. Again,
for purposes of clarity by way of example and not limitation, a
microprocessor 101 of a computing system 100 with a single
motherboard 120 is described, as any of the other configurations
described shall be understood from the description herein of a
computing system 100 with a single motherboard 120.
[0043] From the following description, it will be appreciated that
no modification to motherboard 120 need be made in order to
accommodate RPU 110. Thus, RPU 110 may be directly inserted into a
microprocessor socket 106 of motherboard 120. For purposes of
clarity and not limitation, it shall be assumed that a well-known
microprocessor interface for Opteron microprocessors, namely a 940
pin PGA socket defined by AMD, is used. This socket is commonly
referred to as a "940 socket", and again is used by way of example
and not limitation, as any of a variety of known types of
microprocessor interfaces available from AMD and other vendors may
be used. RPU 110 may access system memory, such as module memory
104 via a microprocessor interface associated with microprocessor
socket 106. By providing direct communication between RPU 110 and
microprocessor 101, as well as system memory, via a microprocessor
interface, data rates may be increased over conventional levels,
and latency bottlenecks may be at least substantially reduced by
having RPU 110 carry out the execution of all or portions of
applications, such as complex or specialized algorithms for
example, in programmed programmable logic.
[0044] Application acceleration may be obtained by off-loading
central processing unit ("CPU")-intensive or specialized software
subroutines, or a combination thereof, to RPU 110. RPU 110 may be
dynamically tailored to perform execution of instructions
associated with such CPU intensive or specialized software
subroutines. Thus, one or more applications, rather than being
executed in software, are executed at least in part in hardware,
namely programmable logic programmed to execute all or a portion of
a set of instructions. By executing such instructions in hardware,
such applications may be substantially accelerated as compared with
executing them in software using a general-purpose
microprocessor.
[0045] RPU 110 may be configured to be a special-purpose processor
or co-processor, which may be tailored to an application. Moreover,
because RPU 110 may be reconfigured for any of a variety of
applications, a reconfigurable application specific computing
environment is provided, which may be more economical than
providing an application specific computing environment which is
not reconfigurable. Additionally, because of enhanced data rates
and substantially reduced latency associated with a microprocessor
interface, as compared with for example a peripheral bus, the
ability to configure FPGA 200 of RPU 110 in a substantially reduced
amount of time, as well as the ability to move data at higher
bandwidths with reduced latency, allows for significant performance
advantages. While RPU 110 may be used to provide significant
performance benefits in CPU-intensive applications, such as
computer modeling, computer simulation, computer rendering,
computer synthesis, database searching/sequencing, database
sorting, cryptographic encoding/decoding, and data
compressing/decompressing, among other known CPU-intensive
applications, it should be appreciated that RPU 110 is not limited
to CPU-intensive applications.
[0046] HT links 107, as generally indicated by arrows, provide
electrical continuity within motherboard 120 for an HT interface
for communicating with microprocessor 110. Even though a
rectangular pattern for interconnecting microprocessor sockets 106
is illustratively shown, it should be appreciated that other
configurations of HT links 107, including diagonal, may be used.
Use of microprocessor sockets 106 allows microprocessors 101 as
well as RPUs 110 to be relatively easily removed or added to a
computing system 100. Accordingly, it should be appreciated that
system 100 need not be static in this regard. Thus, if an
application is more dependent upon RPUs 110 than microprocessors
101, microprocessors 101 may be exchanged for RPUs 110, and vice
versa.
[0047] Because motherboards 120 may be manufactured in large
quantities to support more general-purpose computing needs, the
ability to socket RPU 110 to a conventional motherboard 120 without
having to alter the configuration of motherboard 120 facilitates
deployment of RPUs in a variety of existing computing systems. Of
note, use of RPU 110 in some existing systems may involve some
minor changes. For example, Basic Input/Output Services ("BIOS")
changes or other programming changes may be involved. Furthermore,
physical changes, such as by setting dip switches for example, may
be involved. However, by using microprocessor interfaces which are
common in computer systems, the number of these minor changes may
be reduced. Thus, having an RPU 110 which is compatible with a
common microprocessor interface leverages the ability of migrating
RPUs 110 to servers and workstations.
[0048] The mechanical and electrical properties associated with at
least a portion of connection locations of a PGA of microprocessor
101 for interfacing to an HT link 107 may be the same as those for
RPU 110. However, RPU 110 need not use all the connections
available to a microprocessor interface via microprocessor socket
106, as RPU 110 may use substantially less than all of the
connections available via microprocessor socket 106. Alternatively,
as described below in additional detail, nearly all of the
available pin locations of a microprocessor socket 106 may be
used.
[0049] Referring now to FIG. 2, there is shown a block diagram
depicting an exemplary embodiment of an RPU 110. RPU 110 includes
FPGA 200, nonvolatile memory 204, and high-speed memory 202, as
well as a controller 203. More particularly for this exemplary
embodiment, nonvolatile memory 204 may be flash memory.
Furthermore, high-speed memory 202 may be static random access
memory ("SRAM") 202, and controller 203 may be complex programmable
logic device ("CPLD") 203. However, it should be appreciated from
the following description that, these particular types of
components may be changed. For example, an ASIC may replace CPLD
203. Likewise, read-only memory ("ROM") may replace flash memory
204. Finally, depending on the speed at which high-speed memory 202
is to be accessed, random access memories having slower speeds than
SRAM 202 may be used, such as some forms of dynamic random access
memory ("DRAM"), including reduced latency DRAM ("RLDRAM").
[0050] For example, FPGA 200 may be an XC4VLX60FF668 available from
Xilinx, Inc. Moreover, CPLD 203 may be an XC2C384-7FT256 CPLD
available from Xilinx, Inc. FPGA 200 and CPLD 203 may both be
obtained from Xilinx, Inc., where FPGA 200 and CPLD 203 have
interfaces designed for connecting to one another. The part numbers
above are merely examples of parts that may be used; however, it
should be appreciated that other integrated circuits for each of
the above described chips may be used. For example, other FPGAs or
CPLDs, those both available from Xilinx, as well as other vendors,
may be used. Other components of RPU 110, such as resistors,
capacitors, buffers, and oscillators, among others, have been
omitted for purposes of clarity and not limitation.
[0051] With renewed reference to FIG. 1, and continuing reference
to FIG. 2, computing system 100 and RPU 110 are further described.
SRAM 202, FPGA 200, flash memory 204, and CPLD 203 are coupled to a
printed circuit board ("PCB") 298. The opposite side of PCB 298 may
have extending therefrom pins 199 for plugging into a
microprocessor socket 106.
[0052] HT links 107 may be directly coupled with pins 199 for
direct communication with pins of FPGA 200 via PCB 298. However,
SRAM 202 and flash memory 204 are not coupled to a microprocessor
interface 198 associated with microprocessor socket 106, and CPLD
203 is generally not coupled to microprocessor interface 198 other
than the coupling to microprocessor interface 198 for a small
number of control signals. SRAM 202 may be used as an alternate
storage for configuration information or as a memory resource for
an application being executed by RPU 110, or a combination thereof.
However, resources other than SRAM 202 may be used for either or
both of these purposes, and thus SRAM 202 may be optional. Of note,
internal SRAM of FPGA 200 may be used, where FPGA 200 is configured
internally via an Internal Configuration Access Port ("ICAP").
[0053] FPGA 200 of RPU 110 may be put in direct communication with
microprocessor 101 via an HT link 107. There may be more than one
HT link 107, as generally indicated by HT links 107-1 through
107-N, for N a positive integer greater than one (collectively
herein HT links 107). For example, N may be equal to 3, where each
HT link 107 represents a 16-bit wide bus. Collectively, HT links
107 may be considered a microprocessor bus 210.
[0054] FPGA 200 may be directly coupled to HT links 107, and thus
is in direct communication with multiple HT compatible devices,
such as one or more other RPUs or one or more microprocessors, or a
combination thereof. Thus, FPGA 200 may be configured to
communicate with multiple HT link-compatible devices directly via
HT links 107.
[0055] RPU 110 may appear as a non-coherent bus device to
microprocessor 101. For example, RPU 110 may appear as a PCI device
to microprocessor 101. However, in contrast to a PCI device, RPU
110 communicates directly via HT links 107 with microprocessor 101.
Alternatively, another non-coherent bus device interface, such as
RapidIO, Hypertransport, or PCI Express for example, may be used
instead of PCI. Thus, software, or more particularly Application
Program Interfaces ("APIs"), written for PCI may be migrated to RPU
110. As described below in additional detail, this means that
source code, written for example in a high-level programming
language such as C, for a PCI may be directly converted to a
hardware description language ("HDL") version thereof for
instantiation in programmable logic fabric of FPGA 200 of RPU 110.
However, RPU 110, while appearing as a non-coherent bus device to
microprocessor 101 for purposes of facilitating rapid deployment,
need not appear as a non-coherent bus device. Accordingly, it
should be understood that RPU 110 may be configured to appear as a
coherent bus device to microprocessor 101.
[0056] Furthermore, FPGA 200 may be coupled for direct
communication with module memory 104. Continuing the
above-described example of an AMD Opteron motherboard, AMD 64's
Direct Connect Architecture may be used by RPU 110 not only for
directly communicating with module memory 104, but additionally for
memory mapping a portion of such module memory 104 to RPU 110 as a
primary user thereof. In other words, each microprocessor socket
106 may be associated with a bank of DRAM memory of module memory
104. For an RPU 110 that is plugged into a socket 106, the portion
of module memory 104 associated with that socket becomes dedicated
to such RPU 110. Thus, RPU 110 is capable of directly communicating
with such dedicated memory portion thereto of module memory 104,
namely without having to pass through intermediate chips for
bridging or arbitrated busing. Of note, this dedicated portion of
module memory 104 may be used for accelerating an application or
portion thereof being executed by such an RPU 110, as this
dedicated portion of memory provides a substantially high bandwidth
and a substantially low latency. In addition, memory of module
memory 104 associated with other of sockets 106 may be accessed by
means of one or more HT links 107 and one or more microprocessors
101. For these accesses to non-dedicated memory, RPU 110 does not
do any arbitration; rather, such accesses may for example be
arbitrated by a memory controller forming part of microprocessor
101.
[0057] FPGA 200 is coupled in this example through microprocessor
socket 106 to HT links 107 and AMD 64's Direct Connect Architect
for coupling for example to a module of module memory 104 via
memory bus 211. Again, it should be appreciated that performance
may be enhanced by improved throughput and reduced latency when
communicating information to and from RPU 110 via memory bus
211.
[0058] PCB 298 may include an SRAM bus 214, a CPLD/FPGA bus 216,
and a flash memory bus 213. CPLD 203 provides means for
communicating a default configuration from flash memory 204 for
FPGA 200. This default configuration obtained from flash memory 204
is provided to CPLD 203 via flash memory bus 213.
[0059] FIG. 3 is a block diagram depicting an exemplary embodiment
of some of the functional blocks of FPGA 200 after a configuration
thereof. FPGA 200 may have instantiated in programmable logic
thereof bitstream derived from a CPLD interface core to provide
CPLD interface 350. Additionally, other core derived bitsteams may
be instantiated in programmable logic of FPGA 200 to provide
support functions in addition to CPLD interface 350, as described
below in additional detail. For example an HT core derived
bitstream may be instantiated in FPGA 200 to provide HT interface
301 for communicating with one or more HT links 107. Additionally,
a core derived bitstream may be instantiated in FPGA 200 to provide
arbitration block 302 for addressing and arbitrating communications
with non-dedicated portions of module memory 104 via one or more HT
links 107 and one or more microprocessors 101. Notably, the
non-dedicated portions of module memory 104 may be considered
"system memory" as they are dedicated to one or more
microprocessors 101. Arbitration block 302 may be configured to
support Direct Memory Access ("DMA"). Optionally a core derived
bitstream may be instantiated in FPGA 200 to provide SRAM interface
303 for communicating with SRAM 202. Of note, data may be
communicated to and from SRAM interface 303 or user design 399 for
example via one or more HT links 107 as arbitrated by
DMA/arbitration block 302. Furthermore, of note, rather than SRAM
202, RLDRAM may be used, in which embodiment an RLDRAM interface
303 may be instantiated in programmable logic of FPGA 200. A
portion of SRAM bus 214 may be shared by CPLD 203 and FPGA 200 for
communicating with SRAM 202. Optionally, a core derived bitstream
may be instantiated in FPGA 200 for providing DRAM interface 304
for communicating with a dedicated portion or non-system memory
portion of module memory 104 via memory bus 211.
[0060] HT interface 301, DRAM interface 304, SRAM interface 303,
DMA/arbitration block 302, and CPLD interface 350 (hereinafter
collectively "support functions 300") may be coupled to user
available programmable logic fabric 310 via wrapper interface 305.
Wrapper interface 305 may be configured to provide a substantially
consistent interface coupling one or more of support functions 300
to user available programmable logic fabric 310. For example,
suppose one or more of support functions 300 are to be added or
modified; while such modifications to support functions 300 likely
will involve reconfiguration of user available programmable logic
fabric 310 for instantiation of a user design 399 therein, such
modifications are unlikely to result in having to modify the
interface of user design 399. Thus, by providing a consistent
wrapper interface, effort associated with having to modify user
design 399 may be avoided.
[0061] The physical size of RPU 110, including physical
configuration of PCB 298, may be limited with respect to physical
configuration of a microprocessor and heat sink combination to
avoid neighboring components of motherboard 120. For example, by
limiting the physical size of RPU 110 to the volume conventionally
used by an Opteron heat sink, deployment of RPU 110 is facilitated.
More particularly, AMD has defined the length, width, height, and
mounting hardware for such a heat sink and motherboard
manufacturers adhere to this specification to ensure their
motherboard is compatible with third party heat sinks.
[0062] FIG. 4 is a perspective view depicting an exemplary
alternative embodiment to RPU 110, namely RPU 410. RPU 410 is
generally the same as RPU 110 other than an additional connector
402 is included. Connector 402 may be mated with connector 401 of
daughter card 400. Daughter card 400 may include one or more
additional chips for expanding functionality of RPU 410. For
example, such additional functionality may include one or more of
additional memory or additional HT links. Examples of additional
memory may include flash, SRAM, DRAM, and ROM, among other known
types of memory.
[0063] FIG. 5 is a perspective view block diagram depicting an
exemplary embodiment of an RPU 510. RPU 510 like RPU 110 of both
FIGS. 1 and 2 includes FPGA 200, nonvolatile memory 204, high-speed
memory 202, and CPLD 203. Of note, PCB 598 of RPU 510 has more pins
522 than circuit board 298. For this example, FPGA 200 may be an
XC4VLX200-11FF1513C available from Xilinx, Inc. RPU 510 further
includes sockets 506 for receiving respective DRAM modules 206.
Additional RAM 205, which may be RLDRAM, may be included as part of
RPU 510. Like the description of RPU 110, other support components
for RPU 510 are not described for purposes of clarity and not
limitation. [will file notice of related application]
[0064] First, by having RPU-on-board DRAM modules 206, memory I/O
constraints are reduced, as memory bandwidth is increased. Thus, by
using RPU 510 not only can CPU constrained processes be
accelerated, but additionally such processes may be further
accelerated by lifting memory I/O constraints associated with
accessing module memory 104. Secondly, by having DRAM modules 206
more closely coupled with FPGA 200 than, for example, module memory
104, access performance of memory with reference to modules 206 may
be enhanced.
[0065] FPGA 200 may communicate with DRAM modules 206 via DRAM
interface 304 of FIG. 3. Any of a variety of known types of DRAM
may be used, such as DDR DRAM and RLDRAM for example. The number of
pins 522 used for RPU 510 for coupling to microprocessor socket 106
may be substantially greater than that of RPU 110. The higher pin
count allows for one or more additional HT links 107 and additional
functionality of DRAM interface 304, as well as some additional
control and monitoring signals.
[0066] FIG. 6 is a flow diagram depicting an exemplary embodiment
of a boot flow 600 for RPU 110 or RPU 510. With renewed reference
to FIGS. 1 through 5 and continuing reference to FIG. 6, boot flow
600 is further described.
[0067] At 601, a power acceptable ("OK") signal is obtained for
example from microprocessor 101. This may be a signal which
transitions from a logic low to a logic high state to indicate that
acceptable power levels have been obtained. The power OK signal,
such as power OK signal 290, is provided to CPLD 203. Responsive to
power OK signal 290 being in a logic high state, CPLD 203 is reset
to initiate RPU 110 or RPU 510 configuration. Alternatively, CPLD
203 may have logic that recognizes when power is first applied and
may then configure FPGA 200 automatically with a default
configuration from flash memory 204 without waiting for a power OK
signal 290 to be asserted.
[0068] Accordingly, when power is initially supplied or a
microprocessor reset signal is applied, FPGA 200 may be configured
with a default configuration automatically from flash memory 204.
Additionally, FPGA 200 may be configured with a default
configuration automatically from flash memory 204 if FPGA 200, or
more generally RPU 110 or RPU 510, ceases to properly operate due
to any of a variety of conditions or otherwise exceeds an
environmental operating threshold. Monitor logic is built into FPGA
200 and CPLD 203 which checks for correct operation of FPGA 200.
Monitor logic may be used to initiate reconfiguring with a default
configuration if FPGA 200 or CPLD 203 senses a fault condition.
[0069] At 602, from an address of flash memory 204, which for
example may be referred to as address 0, a boot sequence is
initiated. Optionally, a CPLD bitstream select input from a pin
associated with microprocessor socket 106 or microprocessor
interface 198 may additionally be used. This bitstream select pin
(not shown) may be used to cause CPLD 203 to load an alternative
configuration bitstream out of flash memory 204. This alternative
configuration bitstream may start at a different address than the
start address of the boot sequence so as to avoid confusion with a
primary default configuration bitstream. The alternative default
configuration bitstream may be used for example in the event that
the primary default configuration bitstream becomes corrupted or
for providing an alternative default configuration for FPGA
200.
[0070] CPLD 203 via flash memory bus 213, (that is used to write
and read information to and from flash memory 204 under control of
CPLD 203) may be used to read a configuration bitstream, therefrom,
for providing to a select map interface of FPGA 200 via SRAM bus
214. Alternatively or additionally, a dedicated configuration bus
212 of FIG. 10 may be used for configuration and other
communication between FPGA 200 and CPLD 203. A configuration
bitstream may thus be provided from flash memory 204 to CPLD 203
and then to a select map port of FPGA 200 via a dedicated
configuration bus 212 of FIG. 10. Additionally, there may be
dedicated configuration RAM 205 of FIG. 10 connected in parallel
with flash memory 204. Of note, functions of reconfiguration and
user memory are not shared by the same SRAM device in RPU 510 as
described with reference to RPU 110.
[0071] Asynchronous flash reads may be relatively slow in
comparison to communication between SRAM 202 and SRAM interface 303
via SRAM bus 214. Reconfiguration, which in contrast to an initial
or start-up default configuration, may be more time sensitive for
supporting ongoing operations, for example real-time processing.
Accordingly, one or more reconfiguration bitstreams 281 may be
loaded into SRAM 202 from memory accessible via motherboard 120.
Alternatively, reconfiguration may be done from flash memory 204,
and thus the one or more configuration bitstreams 280 stored in
flash memory 204 may include one or more reconfiguration
bitstreams. Again, flash memory 204 may be accessed via CPLD 203
for writing information thereto, although this may be done at a
slower rate as compared to writing to SRAM 202.
[0072] At 603, FPGA 200 is configured with a default configuration.
This default configuration pattern is sufficient to operate HT
interface 301. This means that microprocessor 101 may recognize
FPGA 200 for communication via one or more HT links 107. HT
interface 301 may then be used to transfer data to flash memory 204
under control of CPLD 203. Flash memory 204 may contain a default
FPGA configuration bitstream 280 instantiation in programmable
logic of support functions 300. Thus, such default configuration
bitstream 280 may be sufficient to operate HT interface 301, as
well as one or more of SRAM interface 303, DRAM interface 304, or
DMA/arbitration block 302.
[0073] As previously described, CPLD 203 initially configures FPGA
200 using a select map port of FPGA 200 (not shown for FPGA 200).
Flash memory 204 and CPLD 203 may be initially loaded with a
default configuration before being soldered onto or otherwise
coupled to PCB 298 of RPU 110 or PCB 598 of RPU 510. Flash memory
204 and CPLD 203 may be reloaded while FPGA 200 is operating by
transferring new or additional configuration data over HT interface
301. However, flash memory 204 generally provides semi-permanent
storage for a default FPGA configuration bitstream which is
generally changed infrequently. Furthermore, CPLD 203 provides
basic support functions for RPU 110 or RPU 510 and likewise is
generally changed infrequently.
[0074] Optionally, for purposes of verification, blocks of data
stored in flash memory 204 read out to CPLD 203 may be compared
against supposed equivalent blocks of data loaded into SRAM 202.
Thus, SRAM 202 may be used as a buffer to load in what should be an
equivalent configuration or reconfiguration bitstream for
comparison with a configuration or reconfiguration bitstream in
flash memory 204. Furthermore, SRAM 202 may be used as buffer
memory for loading a configuration or reconfiguration bitstream
into flash memory 204 under control of CPLD 203.
[0075] SRAM 202 may be read from or written to under control of
CPLD 203. This may be at a lower speed than with respect to
communication with FPGA 200 via SRAM interface 303. However, for a
runtime reconfiguration of FPGA 200, SRAM 202 may be loaded with a
reconfiguration bitstream from an HT link 107. FPGA 200 may then
inform CPLD 203 to initiate a reconfiguration from a configuration
bitstream in SRAM 202.
[0076] For a CPLD FPGA Xilinx pair, there may be a dedicated set of
signals for configuration and communication between CPLD 203 and
FPGA 200. These signals include the capability to transfer data and
addresses to and from FPGA 200 and CPLD 203 to allow FPGA 200 to
indicate to CPLD 203 when a configuration cycle has completed, to
pass a power OK signal 290 to FPGA 200, among other operations
consistent with the description herein. Moreover, CPLD 203 may
include an address register and a configuration register in
accordance with the description herein.
[0077] FIG. 7 is a flow diagram depicting an exemplary embodiment
of a configuration flow 700. Configuration flow 700 is described
with continuing reference to FIG. 7 and with renewed reference to
FIGS. 1 through 5. At 701, microprocessor 101 transfers or causes
transfer of a configuration bitstream over HT bus 210 for writing
to FPGA 200 of RPU 110 or RPU 510. This configuration bitstream may
include a user design 399 for instantiation in user available
programmable logic fabric 310. Additionally or alternatively, this
configuration bitstream may include additional or revised
definitions for one or more of support functions 300.
[0078] At 702, FPGA 200 saves the configuration bitstream obtained
at 701. The configuration bitstream obtained may be saved for
example in on board SRAM or DRAM, such as using memory interfaces
303 or 304, respectively. If, however, full reconfiguration of FPGA
200 is to be performed, the configuration bitstream is generally
saved in SRAM 202 as configuration bitstream 281. For full
reconfiguration, configuration data may be lost when DRAM interface
304 ceases to operate during the configuration process. SRAM 202
may be controlled using CPLD 203 instead of SRAM interface 303 in
FPGA 200, so configuration data is retained while FPGA 200 is being
reprogrammed with configuration bitstream 281. Once SRAM interface
303 is instantiated in FPGA 200 responsive to reconfiguration,
optionally control may be transferred from CPLD 203 to SRAM
interface 303 to speed up reconfiguration.
[0079] Operations at 701 and 702 may overlap one another for
concurrently obtaining a configuration bitstream and then saving
the configuration bitstream as it is being obtained. This may save
time, in particular when fully configuring FPGA 200 as the amount
of configuration data may be substantial. For partial
reconfiguration, less time may be saved by having operations at 701
and 702 overlap one another.
[0080] At step 703, microprocessor 101 uses HT bus 210 to send FPGA
200 an address of the configuration bitstream stored in memory at
702. Additionally at 703, microprocessor sends a command to FPGA
200 of RPU 110 or RPU 510 to reconfigure itself. This command
indicates whether to perform a partial reconfiguration or a full
reconfiguration. At 704, this command may be interpreted by FPGA
200 as to whether partial or full reconfiguration is to be
performed for initiating the reconfiguration.
[0081] During partial reconfiguration, one or more support
functions 300 may remain active, for example when configuration
data transferred over HT bus 210 to FPGA 200 is only to configure
or reconfigure a user design in user available programmable logic
fabric 310. This is interpreted as a partial reconfiguration, which
consumes significantly less time than a full reconfiguration. Data
for partial reconfiguration may be saved in DRAM 206 or SRAM 202.
Optionally, a configuration bitstream may be stored in internal RAM
of FPGA 200 when doing partial reconfiguration. Since FPGA 200 is
not completely erased and continues to operate during partial
reconfiguration, downloading and reconfiguration may proceed in
parallel. Additionally, modifications to one or more of support
functions 300 other than HT interface 301 may be considered for
partial reconfiguration depending on one or more of the application
and the extent of the modifications.
[0082] When RPU 110 or RPU 510 is used to accelerate computational
algorithms, frequent reconfiguration may be involved, and thus
reconfiguration time becomes a limiting factor in determining the
amount of acceleration that may be obtained. Accordingly, partial
reconfiguration may be used for such applications.
[0083] Partial reconfiguration at 705 may involve FPGA 200 loading
a partial reconfiguration bitstream into internal memory of FPGA
200 for reconfiguration using an ICAP (not shown) for FPGA 200).
Thus, dedicated hardware resources of FPGA 200 may be used for
reading and passing such partial reconfiguration bitstream to
program configuration memory associated with user available
programmable logic fabric 310 to partially reconfigure or
instantiate a user design 399. After loading of reconfiguration
data is complete, new or revised logic functions specified by the
partial reconfiguration data become active and may be used.
[0084] If full reconfiguration is determined at 704, then at 706
CPLD 203 takes over control of SRAM 202 and erases programmable
logic RAM of FPGA 200. After which, CPLD 203 transfers or causes
the transfer of a full set of reconfiguration data to FPGA 200.
This is similar to boot flow 600 of FIG. 6, except that the
reconfiguration data comes from SRAM 202 under control of CPLD 203
instead of flash memory 204. Alternatively a default configuration
could be initiated as previously described with reference to boot
flow 600 of FIG. 6.
[0085] For a user design 399 of FIG. 3 instantiated in programmable
logic, whether by full reconfiguration or partial reconfiguration,
it should be appreciated that such user design may be used to
accelerate execution of an application. For example, microprocessor
101 may hand off to RPU 110 or 510 an algorithm or portion thereof
an application for co-processing by RPU 110 or 510. Thus, a result
for co-processing may be output from RPU 110 or 510, as for
microprocessor 101, in substantially less time than if the
co-processing was done using another microprocessor.
[0086] FIG. 8 is a flow diagram depicting an exemplary embodiment
of a configuration bitstream generation flow 800. A purpose of RPUs
as described herein is accelerating computational algorithms. These
algorithms are typically described in a high-level computer
language, such as C for example. Unfortunately, the C language is
designed to execute on a sequential processor, such as for example
the Opteron from AMD or the Pentium from Intel.
[0087] Using an FPGA-based co-processor directly to execute an
algorithm described in the C language would thus offer little or no
acceleration since it would not utilize parallelism that may be
instantiated in the programmable logic of an FPGA. Advantages of an
FPGA-based co-processor as compared to a sequential processor are
the degree of parallelism and the amount of memory bandwidth that
may be implemented. In order to use FPGA 200 more effectively to
accelerate performance, the high-level computer language
description of a user's design, such as for a computational
algorithm, may be translated into an HDL, such as VHDL or Verilog,
listing at 801. Tools are available from companies, such as
Celoxica, that do this translation. Additionally, there are
variations of the C language, such as for example unified parallel
C ("UPC"), in which some parallelism is made visible to the user. A
user design in one of such dialects of C may translate into a
higher performing design when instantiated in FPGA 200 than the
same user design described in the more ubiquitous C language.
[0088] At 802, a constraints file with constraints is generated for
the user design. These constraints include both physical and timing
constraints. Physical constraints may be used to ensure that user
design 399 to be instantiated in user available programmable logic
fabric 310 connects correctly and does not conflict with support
functions 300. Timing constraints may be used to estimate the
operating speed of user design 399 after instantiation in user
available programmable logic fabric 310 and may be used to prevent
potential timing problems, such as race conditions for example.
[0089] At 803, the HDL listing from 801 is synthesized into a
circuit/network listing ("netlist"). Synthesis at 803 converts the
user design from an HDL description to a netlist of FPGA
primitives. Synthesis at 803 is guided by constraints in the
constraints file obtained at 802, such as to at least meet
performance targets. The Xilinx tool XST may be used for this
synthesis.
[0090] At 804, the netlist for a user design obtained at 803 is
combined with a netlist for pre-designed support functions 300 and
a netlist for associated pre-designed wrapper interface 305.
Support functions 300 and wrapper interface 305 netlists may be
combined together, and thus are hereinafter collectively referred
to as a support functions netlist. The support functions netlist
may have a pre-assigned fixed placement in FPGA 200. This
pre-assigned fixed placement facilitates combining the support
functions netlist with the user design netlist without affecting
operation of wrapper interface 305 and support functions 300.
Furthermore, sections of the support functions 300 may be
substantially sensitive to timing, and correct operation may be
promoted by a pre-assigned fixed placement. Accordingly, optionally
the support functions netlist may have a predetermined and fixed
routing other than with respect to connecting to the user's
design.
[0091] At 805, the combined netlist obtained at 804 is placed and
routed with the support functions netlist. Placement and routing is
performed by the appropriate FPGA software tools. These are
available from the FPGA vendor. Constraints in the constraints file
generated at 802 guide the placement and routing to ensure that
target performance and functionality parameters are met.
[0092] At 806, a full or partial configuration bitstream for FPGA
200 is generated. This is performed by a tool supplied by the FPGA
vendor. The configuration bitstream is then ready for download into
FPGA 200. Of note, overlap with a default core configuration may be
excluded from the instantiation of the configuration bitstream.
[0093] FIG. 9 is a block diagram depicting an exemplary embodiment
of bank allocation 900. Bank allocation 900 is for embodiment of
RPU 510 of FIG. 5. Bank allocation of RPU 110, which is a subset of
bank allocation for RPU 510, shall be understood from the following
description of bank allocation 900 for RPU 510.
[0094] FPGA 200 is divided up into banks of pins. Bank allocation
is used to group pins with similar I/O characteristics into FPGA
banks. FPGAs have a fixed number of pin groups or banks available,
where all pins in a bank have the same I/O voltage levels and
conform to a similar I/O standard. In addition, pins from the same
block of support functions 300 may be physically grouped together
to minimize or reduce the length signals within the block travel.
In addition, certain groups of pins within a block of support
functions 300 may include a clock pin in the same bank of FPGA 200
due to routing limitations within FPGA 200. Bank allocation 900 is
particular to the above-referenced FPGA part from Xilinx, Inc., and
thus other bank allocations may vary depending on the FPGA
selected.
[0095] Bank 5, bank 9, and a portion of bank 13 may be used for
communication with one of DRAMs 206, and bank 7, bank, 11, and a
portion of bank 15 may be used for communication with another of
DRAMs 206. A portion of bank 13 and bank 3 may be used for
accessing an RLDRAM, which may be used instead of SRAM 202.
Likewise, another portion of bank 15 and bank 4 may be used for
another of such RLDRAMs. These RLDRAMs are illustratively shown in
FIG. 5 as separate SRAMs 202.
[0096] Portions of banks 1, 6, and 10 may be used for an HT-2 link,
and remaining portions of banks 1, 6, and 10 may be used for DRAM
interface 304. A portion of bank 2 and a portion of bank 8 may be
used for an HT-1 link, and remaining portions of banks 2 and 8 may
likewise be used for DRAM interface 304. Portions of banks 12, 14,
and 16 may be used for an HT-0 link, and remaining portions of
banks 12, 14, and 16 may be used for DRAM interface 304.
[0097] However, regional clock pins are used in all banks except in
banks 1 through 4. Furthermore, bank 0 may be used for JTAG access
and other control signals. Additionally, bank 1 may be used for
CPLD control signals as well as JTAG signals.
[0098] Accordingly, it should be appreciated that no HT link shares
any bank with any other HT link. This facilitates modularity in
instantiating one or more HT links via HT interface 301.
Accordingly, individual HT links may be brought up or down without
affecting other HT links. Likewise, DRAM 206 busing may be coupled
to two separate sets of banks to facilitate modularity of design
for instantiating support for separate DRAMs 206 in DRAM interface
304. Furthermore, SRAM or RLDRAM 202 busing may be coupled to two
separate sets of banks to facilitate modularity of design for
instantiating support for separate SRAMs or RLDRAM 202s 202 in SRAM
or RLDRAM interface 303. Lastly, motherboard DRAM interfacing does
not share any bank with any other memory of RPU 510 to facilitate
modularity of design for instantiating support for mapping to
separate DRAMs or portions thereof of module memory 104 in DRAM
interface 304.
[0099] FIG. 10 is a block diagram depicting an exemplary embodiment
of an RPU 1000. RPU 1000 includes FPGA (field-programmable gate
array) 200, RLDRAMs 202a-d, CPLD 203, flash memory 204 and RAM 205,
along with other components such as resistors, capacitors, power
converters, buffers and oscillators which have been omitted for
clarity. In one embodiment, FPGA 200 is an XC4VLX200-10FF1513C
available from Xilinx, Inc.; although, there are numerous FPGAs
available from Xilinx and other vendors such as Altera which would
also be suitable. According to one embodiment, RLDRAMs 202a-202d
are MT49H16M18HT-33 parts from Micron Technology corporation, CPLD
203 is an XC2C384-7FTG256 from Xilinx, Inc., flash memory 204 is a
RC28F256P30B85 from Intel corporation and RAM 205 is a
MT45W8MW16BGX-708WT from Micron Technology. In each case, there are
numerous alternative components which could be used instead of
those listed here.
[0100] FPGA 200 is connected through bus 211 and microprocessor
socket 106 to motherboard module memory 104. It is also connected
through bus 210 and socket 106 to motherboard microprocessor 101.
In one embodiment, bus 210 is an HT bus capable of one or more HT
links 107 of FIG. 2. HT bus 210 has high bandwidth and low latency
characteristics and is available on microprocessor 101. Other buses
such as PCI, PCI Express or RapidIO could be used instead with the
appropriate motherboard components for providing a microprocessor
interface associated with a microprocessor socket 106. HT bus 210
may thus form a direct connection between microprocessor 101 and
RPU 1000 without passing through any intermediate chips or buses.
This direct connection may be used to enhance throughput and
latency when transferring data to and from RPU 1000.
[0101] On motherboards that support multiple HT buses or links,
there may be several HT buses 210 connected to the same or
different microprocessors 101 or to other motherboard components.
In one embodiment, microprocessor socket 106 and FPGA 200 support
up to 3 16-bit HT buses.
[0102] FPGA 200 connects to RLDRAMs 202a-d. RLDRAMs 202a-d are
divided into two banks with two RLDRAMs in each bank. These two
banks are supported by separate sets of banks of pins of FPGA 200
as described with reference to FIG. 9. The two banks are connected
to FPGA 200 via memory buses 214a and 214b. RLDRAM devices are used
in place of SRAM in one embodiment because they provide a
combination of large capacity, low latency and high bandwidth.
[0103] FPGA 200 is connected to CPLD 203 via dedicated
configuration bus 212 and CPLD/FPGA bus 216. CPLD 203 additionally
connects to flash memory 204 and
[0104] RAM 205 via memory bus 213. CPLD 203, along with flash
memory 204 and RAM 205 may be used to configure FPGA 200. Stored
data to configure FPGA 200 may come either from flash memory 204 or
RAM 205.
[0105] Flash memory 204 may be used to contain configuration data
that is infrequently changed or is retained when RPU 1000 is
powered off. In contrast, RAM 205 may be used for configuration
data that changes frequently. For example, a system where RPU 1000
is used to accelerate different mathematical algorithms at
different times may involve the use of RAM 205 to enhance
performance over the use of flash memory 204. In this type of
system, configuration data may be transferred from microprocessor
101 over HT bus 210 through FPGA 200, over CPLD/FPGA bus 216, then
through CPLD 203 and over memory bus 213 to RAM 205. In RAM 205
such configuration data may be stored, such as at least until it is
used to reconfigure FPGA 200. During reconfiguration, the stored
configuration data is transferred from RAM 205 over memory bus 213
to CPLD 203. CPLD 203 then reconfigures FPGA 200 over configuration
bus 212. There are many ways to configure FPGA 200 including serial
configuration, select map configuration with any of a variety of
widths, and JTAG configuration. Select map configuration is
described herein with respect to the exemplary embodiments;
however, other configuration routes may be used in accordance with
the description herein.
[0106] While the foregoing describes exemplary embodiment(s) in
accordance with one or more aspects of the invention, other and
further embodiment(s) in accordance with the one or more aspects of
the invention may be devised without departing from the scope
thereof, which is determined by the claim(s) that follow and
equivalents thereof. For example, even though separate integrated
circuits have been illustratively shown for purposes of
implementing an RPU, it should be appreciated that an RPU as
described herein may be integrated as a single chip. Claim(s)
listing steps do not imply any order of the steps. Trademarks are
the property of their respective owners.
* * * * *