U.S. patent application number 13/342157 was filed with the patent office on 2012-07-05 for flexible multi-processing system.
This patent application is currently assigned to Intellectual Ventures I LLC. Invention is credited to Dominik J. Schmidt, Robert Warren Sherburne, JR..
Application Number | 20120173864 13/342157 |
Document ID | / |
Family ID | 32710478 |
Filed Date | 2012-07-05 |
United States Patent
Application |
20120173864 |
Kind Code |
A1 |
Schmidt; Dominik J. ; et
al. |
July 5, 2012 |
FLEXIBLE MULTI-PROCESSING SYSTEM
Abstract
A processor includes a scalar computation unit; a vector
co-processor coupled to the scalar computation unit; and one or
more function-specific engines coupled to the scalar computation
unit, the engines adapted to minimize data exchange penalties by
processing small in-out bit slices.
Inventors: |
Schmidt; Dominik J.;
(Stanford, CA) ; Sherburne, JR.; Robert Warren;
(Kentfield, CA) |
Assignee: |
Intellectual Ventures I LLC
Wilmington
DE
|
Family ID: |
32710478 |
Appl. No.: |
13/342157 |
Filed: |
January 2, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10184402 |
Jun 28, 2002 |
8090928 |
|
|
13342157 |
|
|
|
|
Current U.S.
Class: |
713/100 |
Current CPC
Class: |
G06F 9/3887 20130101;
G06F 15/8076 20130101; G06F 9/3897 20130101; H04M 2201/34 20130101;
G06F 9/30036 20130101; G06F 9/3877 20130101 |
Class at
Publication: |
713/100 |
International
Class: |
G06F 9/06 20060101
G06F009/06 |
Claims
1-20. (canceled)
21. An apparatus, comprising: a computation unit configured to
execute control software to process data for wireless transmission,
wherein the control software includes a plurality of function calls
respectively corresponding to a plurality of digital signal
processing functions; a plurality of dedicated hardware engines
configured to perform respective ones of the plurality of digital
signal processing functions; and a switch fabric coupled to the
computation unit and to the plurality of dedicated hardware
engines, the switch fabric being configured to perform operations
including reconfiguring the apparatus from processing data for
transmission via a first wireless protocol to processing data for
transmission via a second wireless protocol.
22. The apparatus of claim 21, wherein the computation unit is
configured to cause at least first and second dedicated hardware
engines of the plurality of dedicated hardware engines to:
responsive to ones of the plurality of function calls included in
the control software, perform the corresponding digital signal
processing functions; and subsequently send information to the
computation unit.
23. The apparatus of claim 21, wherein the computation unit
comprises: a scalar computation unit configured to provide
configuration settings to the plurality of dedicated hardware
engines at a beginning of a computation sequence; and a vector
co-processor configured to perform parallel computational
operations.
24. The apparatus of claim 23, wherein the scalar computation unit
is further configured to cause the vector co-processor to perform
the parallel computational operations in response identifying the
parallel computational operations in the control software.
25. The apparatus of claim 21, wherein the first wireless protocol
is a cellular radio protocol, and wherein the second wireless
protocol is a short-range wireless protocol.
26. The apparatus of claim 21, wherein the plurality of dedicated
hardware engines are implemented as part of an integrated
circuit.
27. The apparatus of claim 21, wherein one or more of the plurality
of dedicated hardware engines is configured to be reconfigured from
use with the first wireless protocol to use with the second
wireless protocol by changing configuration settings.
28. The apparatus of claim 21, wherein the plurality of dedicated
hardware engines includes at least two different dedicated hardware
engines selected from the group consisting of: a convolutional
decoding engine, a modulation engine, a transform engine, an error
correction engine, and a cryptographic engine.
29. A wireless device, comprising: a radio frequency (RF) front end
configured to receive an RF signal from an antenna; and a logic
portion coupled to the RF front-end, the logic portion including: a
computation unit configured to execute control software to process
data packets for transmission via a wireless protocol, wherein the
control software includes a plurality of function calls
respectively corresponding to a plurality of digital signal
processing functions; a plurality of dedicated hardware engines
configured to perform respective ones of the plurality of digital
signal processing functions; and a switch fabric coupled to the
computation unit and to the plurality of dedicated hardware
engines, the switch fabric being configured to perform operations
including reconfiguring the logic portion from processing data for
transmission via a first wireless protocol to processing data for
transmission via a second wireless protocol.
30. The wireless device of claim 29, wherein the reconfiguring the
logic portion includes disabling one or more of the plurality of
dedicated hardware engines without disabling other ones of the
plurality of dedicated hardware engines.
31. The wireless device of claim 29, wherein the plurality of
dedicated hardware engines includes a modulation engine and a
cryptographic engine.
32. The wireless device of claim 29, further comprising: a cellular
radio core configured to support transmission via the first
wireless protocol; and a short-range wireless transceiver core
configured to support transmission via the second wireless
protocol.
33. The wireless device of claim 29, wherein the plurality of
dedicated hardware engines includes an orthogonal frequency
division multiplexing (OFDM) engine.
34. The wireless device of claim 29, wherein the first dedicated
hardware engines includes a transform engine and an error
correction engine.
35. The wireless device of claim 34, wherein the transform engine
is an FHT engine and the error correction engine is a CRC
engine.
36. The wireless device of claim 29, wherein the computation unit
is configured to provide configuration and parametric settings to
the plurality of dedicated hardware engines at a beginning of a
computation sequence.
37. The wireless device of claim 29, wherein the plurality of
dedicated hardware engines are implemented in one
application-specific integrated circuit.
38. The wireless device of claim 29, wherein the wireless device is
configured to communicate one or more of the data packets in
parallel via at least two different wireless protocols.
39. An apparatus, comprising: first means for executing control
software to process data packets for transmission via a wireless
protocol, wherein the control software includes a plurality of
function calls respectively corresponding to a plurality of digital
signal processing functions a plurality of dedicated hardware
engines configured to perform respective ones of the plurality of
digital signal processing functions; and second means for
reconfiguring the apparatus from processing data for transmission
via a first wireless protocol to processing data for transmission
via a second wireless protocol, the reconfiguring including
disabling one or more of the plurality of dedicated hardware
engines without disabling remaining ones of the plurality of
dedicated hardware engines.
40. The apparatus of claim 39, wherein the plurality of dedicated
hardware engines are implemented as part of a single integrated
circuit.
Description
BACKGROUND
[0001] The present invention relates to a flexible processing
system.
[0002] Advances in computer technology have provided high
performance, miniaturized computers that are inexpensive. Even with
these impressive achievements, manufacturers are constantly looking
for improvements in areas such as user-friendliness and
connectivity so that users can be productive any time anywhere.
Wireless communications networks offer the user such capabilities.
However, the speed and computational robustness of present-day
wireless communications systems leave much to be desired.
[0003] In response, the industry is adopting new technologies such
as 802.11A, GPRS and EDGE wireless networking technologies that
drive transparent connections between all computing,
communications, audio and video devices. 802.11A transceivers
communicate at the 5 GHz frequency and offer 100 Mbps throughput,
in contrast to the 2.4 GHz frequency and the 11 Mbps throughput of
802.11B transceivers.
[0004] General Packet Radio Service (GPRS) brings packet data
connectivity to the Global System for Mobile Communications (GSM)
market. GPRS integrates GSM and Internet Protocol (IP) technologies
and is a bearer for different types of wireless data applications
with bursty data, especially WAP-based information retrieval and
database access. GPRS packet-switched data technology makes
efficient use of radio and network resources. Session set-up is
nearly instantaneous, while higher bit rates enable convenient
personal and business applications. Consequently, GPRS not only
makes wireless applications more usable, but also opens up a
variety of new applications in personal messaging and wireless
corporate intranet access.
[0005] EDGE stands for Enhanced Data rates for Global Evolution.
EDGE is the result of a joint effort between TDMA operators,
vendors and carriers and the GSM Alliance to develop a common set
of third generation wireless standards that support high-speed
modulation. EDGE is a major component in the UWC-136 standard that
TDMA carriers have proposed as their third-generation standard of
choice. Using existing infrastructure, EDGE technology enables data
transmission speeds of up to 384 kilobits per second.
[0006] The new standards such as 802.11A, EDGE and GPRS achieve
increased transmission throughput by using complex digital signal
processing algorithms, many of which require high processing power
exceeding that offered by today's baseband processors.
[0007] One way to increase processing power is to perform
computations in parallel using hardwired, dedicated processors that
are optimized for one particular radio frequency (RF) protocol.
Although highly effective when geared to handle one RF protocol,
this approach is relatively inflexible and cannot be easily
switched to handle today's multi-mode cellular telephones that need
to communicate with a plurality of RF protocols.
[0008] Another way to increase processing power is to perform
computations in parallel using general-purpose processors. Although
flexible in programmability, such an approach may not provide the
highest possible computational power that may he needed when
performing digital signal processing for specific wireless
applications such as 802.11A or GPRS applications.
[0009] Yet another approach uses reconfigurable logic computer
architectures that include an array of programmable logic and
programmable interconnect elements. The elements can be configured
and reconfigured by the end user to implement a wide range of logic
functions and digital circuits and to implement custom
algorithm-specific circuits that accelerate the execution of the
algorithm. High levels of performance are achieved because the
gate-level customizations made possible with FPGAs results in an
extremely efficient circuit organization that uses customized
data-paths and "hardwired" control structures. These circuits
exhibit significant fine-grained, gate-level parallelism that is
not achievable with programmable, instruction-based technologies
such as microprocessors or supercomputers. This makes such
architectures especially well suited to applications requiring the
execution of multiple computations during the processing of a large
amount of data. A basic reconfigurable system consists of two
elements: a reconfigurable circuit resource of sufficient size and
complexity, and a library of circuit descriptions (configurations)
that can be down-loaded into the resource to configure it. The
reconfigurable resource would consist of a uniform array of
orthogonal logic elements (general-purpose elements with no fixed
functionality) that would be capable of being configured to
implement any desired digital function. The configuration library
would contain the basic logic and interconnect primitives that
could be used to create larger and more complex circuit
descriptions. The circuit descriptions in the library could also
include more complex structures such as counters, multiplexers,
small memories, and even structures such as controllers, large
memories and microcontroller cores. For example, U.S. Pat. No.
5,784,636 to Rupp on Jul. 21, 1998 discusses a reconfigurable
processor architecture using a programmable logic structure called
an Adaptive Logic Processor (ALP). The Rupp structure is similar to
an extendible field programmable gate array (FPGA) and is optimized
for the implementation of program specific pipeline functions,
where the function may be changed any number of times during the
progress of a computation. A Reconfigurable Pipeline Instruction
Control (RPIC) unit is used for loading the pipeline functions into
the ALP during the configuration process and coordinating the
operations of the ALP with other information processing structures,
such as memory, I/O devices, and arithmetic processing units.
Multiple components having the Rupp reconfigurable architecture may
he combined to produce high performance parallel processing systems
based on the Single Instruction Multiple Data (SIMD) architecture
concept.
SUMMARY
[0010] A processor includes a scalar computation unit; a vector
co-processor coupled to the scalar computation unit; and one or
more function-specific engines coupled to the scalar computation
unit, the engines adapted to minimize data exchange penalties by
processing small in-out bit slices.
[0011] Implementations of the system may include one or more of the
following. The hardware blocks have their own local memory and rely
on the scalar processor only for configuration and parametric
settings at the beginning of each computation sequence. The vector
co-processor performs computationally intensive operations, as
`functions` within the software algorithm implementation. The
hardware blocks act as subroutines, expanding the data flow locally
to achieve high throughput without a large bus-capacitance penalty.
The frequency of the hardware and processor can be scaled from
baseline crystal frequency to a maximum operating frequency. Each
hardware block has a synchronized switch, such that it can be
turned off without affecting the delay to the other blocks. The
switch adds an identical delay whether or not the hardware block is
on or not. A flexible analog interface can provide a varying
bit-width and sampling frequency. The analog interface also handles
variable filtering, DC offset compensation and I/Q mismatch
compensation, such that the processing load can be shared among the
digital and analog elements. This allows the use of
direct-conversion radios as well as the more traditional
super-heterodyne radios. The specific hardware subroutines can be
re-used from protocol to protocol by changing the input parameters
and the clock frequency.
[0012] Advantages of the system may include one or more of the
following. The system uses a RISC-like architecture with a vector
co-processor and an extensive library of engines or
function-specific hardware blocks. The engines perform vector
operations, but they are not generic arithmetic units. Rather, they
aggregate several specific multiply, add, compares to perform a
high level function such as the FFT. This is advantageous because
the RISC controller can be used to write simple control software in
ANSI-C without the need for complex DSP or VLIW languages, and the
engine or hardware blocks can be turned on and off as simple
subroutines within embedded code. The RISC controller can also run
upper layer protocol stacks. This allows for hardware re-use, since
the same processor will process initial packet data and also
provide the necessary configuration parameters to the vector
processor.
[0013] Most of the implementation is in hardware, which has the
highest computing power density (MIPS/mW/cm2). The RISC engine is
small, and the Vector co-processor is also small. By implementing
many of the instructions and subroutines in hardware, code size can
be limited, thereby reducing the embedded SRAM instruction memory.
New protocols can be implemented by adding new hardware accelerator
blocks (RAKE, correlator etc) and simply scaling the process
generation (milliwatts/Megahertz). The system's bus-less design
gives significant power savings since the bus capacitance does not
need to switch with every cycle.
[0014] A high performance, low overhead system for wireless
communication system expanding the functionality and capabilities
of a computer system is provided. The system effectively combines
multiple components required to implement cellular radio, 802.11A
and/or Bluetooth.TM. into a single integrated circuit device. The
complete integration of components greatly reduces manufacturing
costs. Another benefit is the fact that a single chip solution
results in much lower communication overhead, in comparison to
prior art multiple chip card system. The system provides for fast,
easy migration of existing designs to high performance, high
efficiency single chip solutions. Many elements of the LAN and WAN
architecture are the same and can be re-used. For example, the
Gaussian filter is used both in GSM communication and in Bluetooth
communication. Similarly, the MLSE decoder and convolutional
decoder are present in almost every wireless protocol, so they can
be used without resource duplication. The system provides a
combination of software/DSP/ASIC resources that are globally and
transparently `alterable` and that can be scaled to provide vast
processing power to handle the requirements of RF digital signal
processing.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] The accompanying drawings, which are incorporated in and
form a part of this specification, illustrate embodiments of the
invention and, together with the description, serve to explain the
principles of the invention:
[0016] FIG. 1 is a block diagram of a single chip processor.
[0017] FIG. 2 is an exemplary vector engine of the processor.
[0018] FIG. 3 is an exemplary scalar engine of the processor.
DESCRIPTION
[0019] Reference will now be made in detail to the preferred
embodiments of the invention, examples of which are illustrated in
the accompanying drawings. While the invention will be described in
conjunction with the preferred embodiments, it will be understood
that they are not intended to limit the invention to these
embodiments. On the contrary, the invention is intended to cover
alternatives, modifications and equivalents, which may be included
within the spirit and scope of the invention as defined by the
appended claims. Furthermore, in the following detailed description
of the present invention, numerous specific details are set forth
in order to provide a thorough understanding of the present
invention. However, it will be obvious to one of ordinary skill in
the art that the present invention may be practiced without these
specific details. In other instances, well known methods,
procedures, components, and circuits have not been described in
detail as not to unnecessarily obscure aspects of the present
invention.
[0020] FIG. 1 shows a block diagram of a processing system to
support a multi-mode wireless communicator device is shown. The
processing system includes a scalar computation unit, a vector
co-processor coupled to the scalar computation unit; and one or
more function-specific engines coupled to the scalar computation
unit and the vector co-processor. The function-specific engines are
adapted to minimize data exchange penalties by processing small
in-out bit slices. In the processing system, an instruction memory
10 communicates with a vector co-processor 20. Vector co-processor
20 receives data from a vector register file 22. The vector
processor 20 also communicates with a Reconfigurable Switch Fabric
44. Also in communication with the Reconfigurable Switch Fabric 44
is a Scalar Processor 30. The Scalar Processor 30 receives
instructions from the Instruction Memory 10 and a Scalar Vector
Register File 24. The Scalar Processor 30, Vector Co-processor 20
and Reconfigurable Switch Fabric 44 communicate with a Cache Memory
32, which in turn communicates with a Memory Controller 34. The
Memory Controller writes to a Buffer 38, which can be a FIFO output
buffer. The Memory Controller 34 also receives inputs from a buffer
36 such as a FIFO input. The FIFO input 36 and FIFO output 38
communicates with an intelligent analog subsystem 40. The Memory
Controller 34 in turn controls a DRAM main memory 42.
[0021] In accordance with the present invention, the processing
system of FIG. 1 that supports a multi-mode wireless communicator
device can include an analog portion integrated on the substrate
(e.g. the intelligent analog subsystem 40). The analog portion can
include a radio frequency (RF) front-end adapted to receive an RF
signal from an antenna, and an analog to digital converter (ADC)
coupled to the RF front-end to digitize the RF signal.
[0022] The Reconfigurable Switch Fabric 44 also communicates with a
plurality of functions of specific blocks. For example, the
Reconfigurable Switch Fabric communicates with a Viterbi Block 46,
OFDM Block 48, and GMSK Block 50, Scrambler Block 52, Viterbi Block
54, FHT Block 56, Maper Block 58, CRC Block 60, and AES Block
62.
[0023] Referring on to FIG. 2, an exemplary implementation of the
Vector Processor 20 of FIG. 1 is detailed. The Vector Processor 20
includes a Vector Register File 22. Further, the Vector Register
File 22 communicates with a plurality of Blocks 65. Block 65
includes a multiply of 66 which communicates with an accumulator
68. The accumulator 68 also receives data from the Vector Register
File 64. The operative of the accumulator 68 is provided to a
multiplexor 76. One input to the multiplexor 76 is a Logic
Operation Block 70 another input to the multiplexor 76 is a Shifter
74. The multiplexor 76 in term communicates with a Cross Bar 78
which communicates to a multiplexor 80 and which in term
communicates to a Second Cross Bar 82.
[0024] Referring on to FIG. 3, an embodiment of the Scalar
Processor 30 is detailed. In this embodiment, an adder 84 receives
data from a program counter register (PCR) 86. The PCR 86
communicates with an Instruction Memory Block 88. The Instruction
Memory also communicates with a Destruction Coder 90 whose output
is provided to a decoder 92. The Instruction Memory 88 also
communicates with a Register File 24 whose output is provided to a
Buffer 96 and 97. The output of the buffers 96 and 97 are provided
to a Multiplexor 98, Logic Operation Block 101 and Shifter 103,
respectively. The output of the Demultiplexor 98 Logic Operation
Block 101 and Shifter 103 are provided to a Multiplexor 105, which
in term drives a buffer Block 107 and 109. Blocks 107 and 109 in
term communicate with a Data Memory Block 111. Blocks 107,109 and
Data Memory 111 also communicates with a Demultiplexor 113, which
in term communicates with a Buffer 115 whose output is looped back
to the Register File 94.
[0025] The scalar processor is used for flow control. The vector
processor is used for parallel computation of vector operation.
Applications of vector operations are DCT, FFT, convolution, FIR
filtering, etc. At every cycle the processor will fetch a new
instruction, which can of either scalar or vector type. Scalar and
vector instructions are intermixed in the same program. Vector
instructions are executed in SIMD mode (single
instruction-multiple-data). Both, the scalar and the vector
processor are pipelined. This processor should be easy to implement
in a 0.18 micron CMOS technology.
[0026] The scalar instructions include: [0027] ADD [0028] SUB
[0029] AND [0030] OR [0031] XOR [0032] LSHIFT [0033] RSHIFT [0034]
JMP [0035] BEQ [0036] BNE [0037] LDI [0038] LOAD [0039] STORE The
vector instructions include:
TABLE-US-00001 [0039] VADD vector add VSUB vector subtract VMUL
vector multiply VMADD vector multiply-add VSHIFT VAND VOR VXOR
VLOAD VSTORE
[0040] The data path of the scalar processor is 32-bit wide. The
data path of the vector processor is 16-bit wide (or the width of
the A/D word).
[0041] In one implementation, the processor of FIG. 1 is
implemented in an integrated CMOS device with radio frequency (RF)
circuits, including a cellular radio core, a short-range wireless
transceiver core, and a sniffer, along side digital circuits,
including a reconfigurable processor (such as the core of FIG. 1),
a high-density memory array core, and a router. The high-density
memory array core can include various memory technologies such as
flash memory and dynamic random access memory (DRAM), among others,
on different portions of the memory array core.
[0042] In another implementation, a `pipeline` architecture is
achieved by linking the processors in series and performing
differing operations on each (this is more suitable for processing
GPRS data) and then switching to a parallel implementation for
high-speed standards. The general-purpose cores have a granular
control over clock speeds, which can be multiples of the master
clock to achieve synchronous operation to allow precise control
over the processors.
[0043] Additionally, dedicated hardware can be provided to handle
specific algorithms more efficiently than the processing cores. The
number of active processors is controlled depending on the
application, so that power is not used when it is not needed. This
embodiment does not rely on complex clock control methods to
conserve power, since the individual clocks are not run at high
speed, but rather the unused processor is simply turned off when
not needed.
[0044] Through the router, the multi-mode wireless communicator
device can detect and communicate with any wireless system it
encounters at a given frequency. The router performs the switch in
real time through an engine that keeps track of the addresses of
where the packets are going. The router can send packets in
parallel through two or more separate pathways. For example, if a
Bluetooth.TM. connection is established, the router knows which
address it is looking at and will be able to immediately route
packets using another connection standard. In doing this operation,
the router working with the RF sniffer periodically scans its radio
environment (`ping`) to decide on optimal transmission medium. The
router can send some packets in parallel through both the primary
and secondary communication channel to make sure some of the
packets arrive at their destinations.
[0045] The processor controls the cellular radio core and the
short-range wireless transceiver core to provide a seamless
dual-mode network integrated circuit that operates with a plurality
of distinct and unrelated communications standards and protocols
such as Global System for Mobile Communications (GSM), General
Packet Radio Service (GPRS), Enhance Data Rates for GSM Evolution
(Edge) and Bluetooth.TM.. The cell phone core provides wide area
network (WAN) access, while the short-range wireless transceiver
core supports-local area network (LAN) access. The reconfigurable
processor core has embedded read-only-memory (ROM) containing
software such as IEEE802.11, GSM, GPRS, Edge, and/or Bluetooth.TM.
protocol software, among others.
[0046] Although specific embodiments of the present invention have
been illustrated in the accompanying drawings and described in the
foregoing detailed description, it will be understood that the
invention is not limited to the particular embodiments described
herein, but is capable of numerous rearrangements, modifications,
and substitutions without departing from the scope of the
invention. The following claims are intended to encompass all such
modifications.
* * * * *