U.S. patent application number 11/195429 was filed with the patent office on 2006-02-02 for programmable processor architecture hirarchical compilation.
Invention is credited to John Reid JR. Hauser, Amit Ramchandran.
Application Number | 20060026578 11/195429 |
Document ID | / |
Family ID | 35733871 |
Filed Date | 2006-02-02 |
United States Patent
Application |
20060026578 |
Kind Code |
A1 |
Ramchandran; Amit ; et
al. |
February 2, 2006 |
Programmable processor architecture hirarchical compilation
Abstract
One embodiment of the present includes a heterogenous,
high-performance, scalable processor having at least one W-type
sub-processor capable of processing W bits or greater in parallel,
W being an integer value, at least one N-type sub-processor capable
of processing N bits in parallel, N being an integer value wherein
and smaller than W. A scenario compiler is included in a
hierarchical flow of compilation and used with other compilation
and assembler blocks to generate binary code based on different
types of codes to allow for efficient processing based on the
sub-processors while maintaining low power consumption when the
binary code is executed.
Inventors: |
Ramchandran; Amit; (San
Jose, CA) ; Hauser; John Reid JR.; (Berkeley,
CA) |
Correspondence
Address: |
Maryam Imam, Esq.;LAW OFFICES OF IMAM
Suite 1010
111 North Market Street
San Jose
CA
95113
US
|
Family ID: |
35733871 |
Appl. No.: |
11/195429 |
Filed: |
August 2, 2005 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11180068 |
Jul 12, 2005 |
|
|
|
11195429 |
Aug 2, 2005 |
|
|
|
60598417 |
Aug 2, 2004 |
|
|
|
Current U.S.
Class: |
717/149 ;
712/E9.046; 712/E9.049; 712/E9.071 |
Current CPC
Class: |
G06F 15/7842 20130101;
G06F 1/3203 20130101; G06F 9/3824 20130101; G06F 9/3895 20130101;
G06F 15/781 20130101; G06F 9/30036 20130101; G06F 15/8053 20130101;
G06F 9/3885 20130101; G06F 9/3828 20130101; G06F 9/30032 20130101;
G06F 15/7867 20130101; G06F 9/3877 20130101; G06F 9/30014
20130101 |
Class at
Publication: |
717/149 |
International
Class: |
G06F 9/45 20060101
G06F009/45 |
Claims
1. A software architecture for execution on a heterogenous,
high-performance, scalable processor having at least one W-type
sub-processor capable of processing W bits, or more, in parallel, W
being an integer value and having at least one N-type sub-processor
capable of processing N bits in parallel, N being an integer value
and smaller than, the software architecture comprising: a scenario
compiler for pre-compiling a scenario to create a binary code based
on assembly code and high level language and scenario description
language code, the scenario compiler including a plurality of
applications, each application including one or more kernels, the
scenario compiler pre-compiling the scenario for efficient
execution thereof by a plurality of sub-processors, each
sub-processor including a control circuit including high level code
for execution thereof, the control circuit is a high language
programmable controller for the sub-processor, wherein a
hierarchical compilation of different types of programming codes
allow for efficient binary code creating while reducing power
consumption when the binary code is executed by the
sub-processors.
2. A software architecture, as recited in claim 1, further
including a schedule and synchronization block communicating with
the scenario compiler and for generating code, based on scenario
description language (SDL) to operate with one or more of the
sub-processors.
3. A software architecture, as recited in claim 2, further
including a high level language compiler block receiving input from
the synchronization block for compiling high level code.
4. A software architecture, as recited in claim 3, further
including an assembler block coupled to receive information from
the high level language compiler block and from an assembly code
block, which provides assembly code written by a user, the
assembler block for assembling the assembly code and the
information received from the high level language compiler
block.
5. A software architecture, as recited in claim 4, further
including a binary code block for generating binary code based on
assembly code, high level code and SDL.
6. A software architecture, as recited in claim 5, further
including a scenario description and optional optimization block
coupled to the scenario description block and upon the generation
of binary code, a user's design goals are verified and if the
design goals are not met, the scenario description and optional
optimization block modifies the scenario.
7. A software architecture, as recited in claim 6, wherein the
sub-processors each include applications having kernels, the
kernels being engines for execution of computationally intensive
code.
8. A software architecture, as recited in claim 7, further
including a scenario description block coupled to the scenario
compiler block for generating SDL for describing inter-dependencies
between the kernals.
9. A software architecture, as recited in claim 8, further
including a low-level assembler and linker block coupled to the
optimizing assembler block for assembling the lowest-level
code.
10. A software architecture, as recited in claim 9, wherein the
low-level assembler and linker block further includes a latency
verification block responsive to an N number of previous
instructions and a current instruction for verifying the presence
of N number of previous instructions used by a user for
instructions requiring previous instructions.
11. A software architecture, as recited in claim 10, wherein the
latency verification block for verifying the user's instruction,
which includes use of previous instructions, against latency
rules.
12. A software architecture, as recited in claim 11, further
including shared memory coupled to the sub-processors wherein the
kernel of one of the sub-processors hands off to another
sub-processor by placing, in the shared memory, information to be
used by the another sub-processor.
13. A method of generating and executing code on a heterogenous,
high-performance, scalable processor having at least one W-type
sub-processor capable of processing W bits, or more, in parallel, W
being an integer value and having at least one N-type sub-processor
capable of processing N bits in parallel, N being an integer value
and smaller than, the software architecture comprising:
pre-compiling a scenario to create a binary code based on assembly
code and high level language and scenario description language
code; generating efficient binary code to be executed by the
sub-processors based on applications including kernels, the kernels
for executing computationally intensive code, the execution of the
binary code by the sub-processors causing reduction of power
consumption and flexible coding options to a user.
14. A method of generating and executing code, as recited in claim
13, further including performing latency verification to prevent a
user from using erroneous previous instructions.
Description
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of U.S. Provisional
Patent Application No. 60/598,417, entitled "Quasi-Adiabatic
Programmable Processor Architecture" and filed on Aug. 2, 2004 and
is a continuation-in-part of U.S. patent application Ser. No.
11/180,068, filed on Jul. 12, 2005 and entitled "PROGRAMMABLE
PROCESSOR ARCHITECTURE", the disclosures of both of which are
incorporated herein by reference as though set forth in full.
BACKGROUND OF THE INVENTION
[0002] 1. Field of the Invention
[0003] This invention relates generally to the field of processors
and more particularly, to processors having low power consumption,
high performance, low die area, and flexibly and scalably employed
in multimedia and communications applications.
[0004] 2. Description of the Prior Art
[0005] With the advent of the popularity of consumer gadgets, such
as cell or mobile phones, digital cameras, iPods and personal data
assistances (PDAs), many new standards for communication with these
gadgets have been adopted by the industry at wide. Some of these
standards include H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS,
MP3 and Security. However, an emerging problem is the use of
different standards dictating communications of and between
different gadgets requiring tremendous development effort. One of
the reasons for the foregoing problem is that no processor or
sub-processor, currently available in the marketplace, is easily
programmable for use by all digital devices and conforming to the
various mandated standards. It is only a matter of time before this
problem grows as new trends in consumer electronics warrant even
more standards adopted by the industry in the future.
[0006] One of the emerging and, if not, current, requirements of
processors is low power consumption yet the ability to cause
execution of code sufficient to process multiple applications.
Current power consumption is on the order of sub-hundreds of
milliwatts per application, whereas, the goal is to be under
sub-hundreds of milliwatts for executing multiple applications.
Another requirement of processors is low cost. Due to the wide
utilization of processors in consumer products, the processor must
be inexpensive to manufacture, otherwise, its use in most common
consumer electronics is not pragmatic.
[0007] To provide specific examples for current processor problems,
problems associated with RISCs, which are used in some consumer
products, microprocessors, which are used in other consumer
products, digital signal processors (DSPs), which are used in yet
other consumer products and application specific integrated
circuits (ASICs), which are used in still other consumer products,
and some of the other well-know processors, each exhibiting a
unique problem are briefly described below. These problems along
with advantages of using each are outlined below in a "Cons"
section discussing the disadvantages thereof and a "Pros" section
discussing the benefits thereof.
[0008] A. RISC/Super Scalar Processors
[0009] RISC and Super Scalar processors have been the most widely
accepted architectural solution for all general purpose computing.
They are often enhanced with application specific accelerators for
solving certain specialized problems within the context of a
general solution.
[0010] Examples include: ARM series, ARC series, StrongARM series,
and MIPS series.
[0011] Pros: [0012] Industry wide acceptance has lead to a more
matured tool chain and wide software choices [0013] A robust
programming model has resulted from a very efficient automatic code
generator used to generate binaries from high level languages like
C. [0014] Processors in the category are very good general purpose
solutions. [0015] Moore's Law can be effectively used for
increasing performance.
[0016] Cons: [0017] The general purpose nature of the architecture
does not leverage common/specific characteristics of a set or
sub-set of applications for better price, power and performance.
[0018] They consume moderate to high amounts of power with respect
to the amount of computation provided. [0019] Performance increase
is mostly achieved at the expense of pipeline latency which
adversely affects several multimedia and communication algorithms.
[0020] Complicated hardware scheduler, sophisticated control
mechanisms and significantly reduced restrictions for more
efficient automatic code generation for general algorithms have
made this category of solutions less area efficient.
[0021] B. Very Long Instruction Word (VLIW) and DSPs
[0022] VLIW architectures eliminated some of the inefficiencies
found in RISC and Super Scalar architectures to create a fairly
general solution in the digital signal processing space.
Parallelism was significantly increased. The onus of scheduling was
transferred from hardware to software to save area.
[0023] Examples include: TI 64xx, TI 55xx, StarCore SC140, ADI
SHARC series.
[0024] Pros: [0025] Restricting the solution to the signal
processing space improved 3P in comparison with RISC and Super
Scalar architectures [0026] VLIW architectures provide higher level
of parallelism relative to RISC and superscalar architectures.
[0027] An efficient tool chain and industry wide acceptance was
generated fairly rapidly. [0028] Automatic code generation and
programmability are showing significant improvements as more
processors designed for signal processing fall into this
category.
[0029] Cons: [0030] Although problem solving capability is reduced
to the digital signal processing space, it is too broad for a
general solution like VLIW machine to have efficient 3P. [0031]
Control is both expensive and power consuming especially for
primitive control code in many multimedia and communication
applications. [0032] Several power and area inefficient techniques
were used to make automatic code generation easy. Strong reliance
on these techniques by the software community is carrying forward
this inefficiency from generation to generation. [0033] VLIW
architectures are not well suited for processing serial code.
[0034] C. Reconfigurable Computing
[0035] Several efforts in industry and academia over the last 10
years were focused towards making a flexible solution with ASIC
like price, power and performance characteristics. Many have
challenged existing and matured laws and design paradigms with
little industry success. Most of the attempts have been in the
direction of creating solutions based on coarser grain FPGA like
architectures.
[0036] Pros: [0037] Some designs restricted to a specific
application while providing needed flexibility within that
application proved to be price, power, performance competitive
[0038] Research showed that such restricted yet flexible solutions
can be created to address many application hotspots.
[0039] Cons: [0040] Several designs in this space did not provide
an efficient and easy programming solution and therefore was not
widely accepted by a community adept in programming DSPs. [0041]
Automatic code generation from higher level languages like C was
either virtually impossible or highly inefficient for many of the
designs. [0042] 3P advantage was lost when an attempt was made to
combine heterogeneous applications using one type of interconnect
and one level of granularity. Degree of utilization of the provided
parallelism suffered heavily. [0043] Reconfiguration overhead was
significant in 3P for most designs. [0044] In many cases, the
external interface was complicated because the proprietary
reconfigurable fabric did not match industry standard system design
methodologies. [0045] Reconfigurable machines are uni-processors
and rely heavily on a tightly integrated RISC even for processing
primitive control.
[0046] D. Array of Processors
[0047] Some recent approaches are focused on making reconfigurable
systems better suited to process heterogeneous applications.
Solutions in this direction connect multiple processors optimized
for either one or a set of applications to create a processor array
fabric.
[0048] Pros: [0049] Different processors optimized for different
sets of applications when connected together using an efficient
fabric can help solve a wide range of problems. [0050] Uniform
scaling model allows number processors to be connected together as
performance requirements increase. [0051] Complex algorithms can be
efficiently partitioned.
[0052] Cons: [0053] Although performance requirements may be
adequately answered, power and price inefficiencies are too high.
[0054] The programming model varies from processor to processor.
This makes the job of the application developer much harder. [0055]
Uniform scaling of multiple processors is a very expensive and
power consuming resource. This has shown to display some
non-determinism that may be detrimental to the performance of the
entire system. [0056] The programming model at the system level
suffers from complexity of communicating data, code and control
information without any shared memory resources--since shared
memory is not uniformly scalable. [0057] Extensive and repetitive
glue logic required to connect different types of processors to a
homogeneous network adds to the area inefficiencies, increases
power and adds to the latency.
[0058] In light of the foregoing, there is a need for a low-power,
inexpensive, efficient, high-performance, flexibly programmable,
heterogenous processor for allowing execution of one or more
multimedia applications simultaneously.
SUMMARY OF THE INVENTION
[0059] Briefly, one embodiment of the present includes a
heterogenous, high-performance, scalable processor having at least
one W-type sub-processor capable of processing W bits or greater in
parallel, W being an integer value, at least one N-type
sub-processor capable of processing N bits in parallel, N being an
integer value wherein and smaller than W. A scenario compiler is
included in a hierarchical flow of compilation and used with other
compilation and assembler blocks to generate binary code based on
different types of codes to allow for efficient processing based on
the sub-processors while maintaining low power consumption when the
binary code is executed.
IN THE DRAWINGS
[0060] FIG. 1 shows an application 10 with reference to a digital
product 12 including an embodiment of the present invention
[0061] FIG. 2 shows an exemplary integrated circuit 20 including a
heterogenous, high-performance, scalable processor 22 coupled to a
memory controller and direct memory access (DMA) circuit 24 in
accordance with an embodiment of the present invention.
[0062] FIG. 3 shows, in conceptual form, an architecture 300
including software architecture 302 in combination with some of the
hardware components 304 of the circuit 20 of FIG. 1.
[0063] FIG. 4 shows, in conceptual form, the process of
hierarchical software compilation, in block flow form, in
accordance with a method of the present invention.
[0064] FIG. 5 shows, in conceptual form, the process of and
apparatus for latency verification used in assembly coding and
included within the block 434 of FIG. 4 in accordance with a method
and apparatus of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0065] A sub-processor ("CoolProcessor) is provided employing logic
"macro-functional-units" (function-oriented dedicated logic),
replacing the classical fixed datapath, complex exectution unit,
and register set used in general purpose CPUs and DSP engines and
replacing also the identical processing element used in homogeneous
multiprocessors (MSs).
[0066] As shown and described below with reference to FIG. 2, the
sub-processor employs a heterogeneous interconnect between
heterogeneous processors, designed to match multimedia and
communications applications.
[0067] One embodiment of the present invention employs four
sub-processors (referred to as "black boxes" or "processor" in the
provisional application No. 60/598,417, entitled "Quasi-Adiabatic
Programmable Processor Architecture"). In this patent document, a
processor 22 comprises a plurality of sub-processors. The four
sub-processors are split inot two categories. The letter "W"
designates CoolW sub-processors, capable of handling operands
requiring wide datapaths. The CoolW sub-processor, however, will
support wider rage data bits. The sub-processor is also capable of
executing 64-bit IEEE-standard floating-point instructions. Its
performance is greater than 49 MFLOPS at 150 MHz. The
floating-point instruction set includes addition, subtraction, and
multiplication.
[0068] The letter "N" indicates a CoolN sub-processor serving
narrow datapaths, such as required for average-quality imaging and
finite-field operations in communications. Each sub-processor
comprises a heterogeneous software programmable datapath connection
compute engines (in the CoolW sub-processor type) or compute
engines (in the CoolN sub-processor type). The internal compute
engines are referred to as MFU. Multiple instances of the MFUs are
nonuniformly distributed between the two types of
sub-processors.
[0069] A control circuit within each sub-processor operates as an
engine and is a high language programmable controller for the
sub-processor. The control circuit is aided by a core sequencer
underscoring the hard-wired nature of the MFUs: each unit is aimed
at executing efficiently only a fraction of the overall job. A
rather large instruction memory, per sub-processor, holds code for
the control circuit, internal interconnects, I/O, and MFUs
requiring it. Sufficient shared buffer memory is provided to store
operands and results of complex computations that make average
demands on operand life.
[0070] A general purpose processor (referred to as "ARM926" in the
provisional application) runs system software and generic
applications (applications other than multimedia and
communications). The GPP includes its own instruction and data
memory or cache.
[0071] The interconnect is based on the Sonics "smart" SoC bus. An
SoC architecture can include any number of sub-processors but the
number of sub-processors defines the number of threads, as will be
apparent shortly.
[0072] Referring now to FIG. 1, an application 10 is shown with
reference to a digital product 12 including an embodiment of the
present invention. FIG. 1 is intended to provide the reader with a
perspective regarding some, but not necessarily all, of the
advantages of a product, which includes an embodiment of the
present invention relative to those available in the
marketplace.
[0073] Accordingly, the product 12 is a converging product in that
it incorporates all of the applications that need to be executed by
today's mobile phone device 14, digital camera device 16, digital
recording or music device 18 and PDA device 20. The product 12 is
capable of executing one or more of the functions of the devices
14-20 simultaneously yet utilizing less power.
[0074] The product 12 is typically battery-operated and therefore
consumes little power even when executing multiple applications of
the applications executed by the devices 14-20. It is also capable
of execute code to effectuate operations in conformance with a
multitude of applications including but not limited to: H264,
MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security.
[0075] FIG. 2 shows an exemplary integrated circuit 20 including a
heterogenous, high-performance, scalable processor 22 coupled to a
memory controller and direct memory access (DMA) circuit 24 in
accordance with an embodiment of the present invention. Further
shown in FIG. 2, the processor 22 is coupled to interface circuit
26 through a general purpose bus 30 and to the interface circuit 28
through a general purpose bus 31 and further coupled, through the
bus 30, to a general purpose processor 32 through the bus 31. The
circuit 20 is further shown to include a clock reset and power
management 34 for generating a clock utilized by the remaining
circuits of the circuit 10, a reset signal utilized in the same
manner and circuitry for managing power by the same. There is
further included in the circuit 20, a Joint Test Action Group
(JTAG) circuit 36. JTAG is used as a standard for testing
chips.
[0076] The interface circuit 26 shown coupled to the bus 30 and
interface circuit 28, shown coupled to the bus 31, include the
blocks 40-66, which are generally known to those of ordinary skill
in the art and used by current processors.
[0077] The processor 22, which is a heterogeneous multi-processor,
is shown to include shared data memory 70, shared data memory 72, a
CoolW sub-processor (or block) 74, a CoolW sub-processor (or block)
76, a CoolN sub-processor (or block) 78 and a CoolN sub-processor
(or block) 80. Each of the blocks 74-80 has associated therewith an
instruction memory, for example, the CoolW block 74 has associated
therewith an instruction memory 82, the CoolW block 76 has
associated therewith an instruction memory 84, CoolN block 78 has
associated therewith an instruction memory 86 and the CoolN block
80 has associated therewith an instruction memory 88. Similarly,
each of the blocks 74-80 has associated therewith a control block.
The block 74 has associated therewith a control block 90, the block
76 has associated therewith a control block 92, the block 78 has
associated therewith a control block 94 and the block 80 has
associated therewith a control circuit 96. The block 74 and 76 are
designed to generally operate efficiently for 16, 24, 32 and 64-bit
operations or applications, whereas, the blocks 78 and 80 are
designed to generally operate efficiently for 1, 4, or 8-bit
operations or applications.
[0078] The blocks 74-80 are essentially sub-processors and the
CoolW blocks 74 and 76 are wide (or W) type of blocks, whereas, the
CoolN blocks 78 and 80 are narrow (or N) type of blocks. Wide and
narrow refers to the relative number of parallel bits processed or
routed within a sub-processor and that gives the heterogeneous
characteristic of the processor 22. Furthermore, the circuit 24 is
coupled directly to one of the sub-processors, i.e. one of the
blocks 74-80 resulting in the lowest latency path through the
sub-processor to which it is coupled. In FIG. 2, the circuit 24 is
shown directly coupled to the block 76 although it may be coupled
to any of the blocks 74, 78 or 80. Higher priority agents or tasks
may be assigned to the block which is directly coupled to the
circuit 24.
[0079] It should be noted that while four blocks 74-80 are shown,
other number of blocks may be utilized, however, utilizing
additional blocks clearly results in additional die space and
higher manufacturing costs.
[0080] Complicated applications requiring great processing power
are not scattered in the circuit 20, rather, they are grouped or
confined to a particular sub-processor or block for processing,
which substantially improves power consumption by eliminating or at
least reducing wire (metal) or routing lengths thereby reducing
wire capacitance. Additionally, utilization is increased and
activity is reduced contributing to lower power consumption.
[0081] The circuit 20 is an example of silicon on chip (or SoC)
offering Quasi-Adiabatic Programmable sub-Processors for multimedia
and communications applications, two types of sub-processors are
included, as previously indicated: W type and N type. W type or
Wide type processor is designed for high Power, Price, Performance
efficiency in applications requiring 16, 24, 32 and 64-bits of
processing. N type or Narrow type processor is designed for high
efficiency in applications requiring 8, 4 and 1-bit of processing.
While these bit numbers are used in the embodiments of the present
invention, by way of figures and description, other number of bits
may be readily employed.
[0082] Different applications require different performance or
processing capabilities and are thus, executed by a different type
of block or sub-processor. Take for instance, applications that are
typically executed by DSPs, they would be generally be processed by
W type sub-processors, such as the blocks 74 or 76 of FIG. 2
because they characteristically include commonly occurring DSP
kernels. Such applications include, but are not limited to, fast
fourier transform (FFT) or inverse FFT (IFFT), Adaptive finite
impulse response (FIR) filters, Discrete Cosine transform (DCT) or
inverse DCT (IDCT), Real/Complex FIR filter, IIR filter, resistance
capacitor Root Raise Cosine (RRC) filter, Color Space Converter, 3D
Bilinear Texture Mapping, Gouraud Shading, Golay Correlation,
Bilinear Interpolation, Median/Row/Column Filter, Alpha Blending,
Higher-Order Surface Tessellation, Vertex Shade (Trans/Light),
Triangle Setup, Full-Screen Anti-aliasing and Quantization.
[0083] Other commonly occurring DSP kernels can be executed by N
type sub-processors, such as blocks 78 and 80 and include, but are
not limited to, Variable Length Codec, Viterbi Codec, Turbo Codec,
Cyclic Redundancy Check, Walsh Code Generator,
Interleaver/De-Interleaver, LFSR, Scrambler, De-spreader,
Convolution Encoder, Reed-Solomon Codec, Scrambling Code Generator,
and Puncturing/De-puncturing.
[0084] Both W and N type sub-processors are capable of keeping net
activity and the resulting energy per transition low while
maintaining high performance with increased utilization in
comparison with existing architectural approaches like RISC,
Reconfigurable, Superscalar, VLIW and Multi-processor approaches.
The sub-processor architecture of the processor 22 reduces die size
resulting in an optimal processing solution and includes a novel
architecture referred to as "Quasi-Adiabatic" or "COOL"
architecture. Programmable processors in accordance therewith are
referred to as Quasi-Adiabatic Programmable or COOL Processors.
[0085] Quasi-Adiabatic Programmable or COOL Processors optimize
data path, control, memory and functional unit granularity to match
a finite subset of applications, as described previously. The way
in which this is accomplished will be clear relative to a
discussion and presentation of figures relating to the different
units or blocks or circuits and their inter-operations of the
processor 22, as presented below.
[0086] "Quasi-Adiabatic Programmable" or Concurrent Applications of
heterOgeneous intercOnnect and functionaL units (COOL) Processors.
In term of thermodynamics, Adiabatic Processes do not waste heat
and transfer all the used energy to performing useful work. Due to
the non-adiabatic nature of existing standard processes, circuit
design, and logic cell library design techniques, one can not ever
make an Adiabatic Processors. However, among the possible different
possible processor architecture some may be closer to Adiabatic.
The various embodiments of the present invention show a class of
processor architectures which are significantly closer to Adiabatic
as compared to the architectures of prior art, while they are,
nevertheless, programmable. They are referred to as
"Quasi-Adiabatic Programmable Processors".
[0087] The integrated circuit 20 allows as many applications as can
be supported by the resources within the processor 22 to be
executed together or concurrently and the number of such
applications far exceeds that which is supported by current
processors. Examples of applications that can be simultaneously or
concurrently executed by the integrated circuit 20 include but are
not limited to downloading an application from a wireless device
while decoding a movie that has been received, thus, a movie can be
downloaded and decoded simultaneously. Due to achieving
simultaneous application execution on the integrated circuit 20,
which has a small die size or silicon real estate as compared to
the number of applications it supports, costs of manufacturing the
integrated circuit are significantly lower than that which is
required for multiple devices of FIG. 1. Additionally, the
processor 22 offers a single programmable framework to a user to
implement multiple functions, such as multimedia complex
applications. Of important value is the ability of the integrated
circuit 20 and namely, the processor 22, to support future
standards adopted by the industry, which are expected to be of
greater complexity than that of today's standards.
[0088] Each of the blocks 74-80 can execute only one sequence (or
stream) of programs at a given time. A sequence of program is
referred to a function associated with a particular application.
For example, FFT is a type of sequence. However, different
sequences may be dependent on one another. For example, an FFT
program, once completed may store its results in the memory 70 and
the next sequence, may then use the stored result. Different
sequences sharing information in this manner or being dependent
upon each other in this manner is referred to as "stream flow".
[0089] In FIG. 2, the memories 70 and 72 each include 8 blocks of
16 kilobytes of memory, however, in other embodiments, different
size memory may be utilized.
[0090] The instruction memories 82, 84, 86 and 88 are used to store
instructions for execution by the blocks 74-80, respectively.
[0091] FIG. 3 shows, in conceptual form, an architecture 300
including software architecture 302 in combination with some of the
hardware components 304 of the circuit 20 of FIG. 1. The hardware
components 304 includes the processor 32, the circuit 26 and
circuit 28 and the processor 22, as described and shown with
respect to previous figures.
[0092] Included within the software architecture 302, a hardware
abstraction layer or low level drivers 306 and an operating systems
driver 308 cause interfacing or communication between the hardware
components 304 and the software architecture 302. The software
architecture 302 is further shown to include a CoolBios (basic
input output system) 310 coupled to the hardware components 304 and
to a scenario 312, which is for causing multiple applications 314
to be executed, each application 314 including kernels 316 for
execution of computationally-intense functions, such as fast
fourier transforms (FFTs), DCTs, Finite Impulse Response (FIR)
filtering and others know in the industry. The software
architecture 302 is further shown to include a system level
software changes scenarios 318, which is shown to communicate with
an operating systems interface (OSI) 322 and an operating system
320. The operating system 320 is further shown to communicate with
the scenario 312, applications 314, and kernals 316. the kernels
316 are engines for execution of computationally intensive code,
generally in assembly, or low level code.
[0093] Each of the applications 314 includes many kernels, such as
the kernals 316 DCT, VLC, conditional encoding (CE), cyclic
redundancy coding (CRC), down sampling (DS), variable length coding
(VLC), discrete cosine transform (DCT), motion estimation (ME),
motion compensation (MC) etc., that consume most of the compute
time in an application. The scenario-level software 310 contains
hooks to quasi-statically change the execution pattern of
applications contained within that scenario. The scenario 318
causes scenarios to be changed while running on the hardware 304.
From a software perspective, each of the kernels 316 is written in
assembly code for executing an FFT or other
computationally-intensive functions while the scenario 312 and each
of the applications 314 are in a higher level language, such as "C"
for reasons that will become apparent shortly. For now, suffice it
to say that the combination of assembly and a higher level language
being executed on a subprocessor CoolW or CooN and a control block
included therein, as the hardware architecture of FIG. 2, causes
simultaneous or concurrent execution of applications, in a
hierarchical manner and while maintaining low power
consumption.
[0094] The CoolBios 310 includes a set of software functions that
allow input and output communication with the processor 22 and
eliminates the need for a full operating system running on the
processor 22.
[0095] The hardware component 304 and software architecture 302
provide an environment to load and execute a multi-application
scenario. A "scenario", as referred to herein, is a set of
applications, such as the applications 314, executing concurrently.
Some examples of each of the applications 314, as shown in FIG. 3,
include but are not limited to JPEG, MP3, H.264 and 802.11g. A
scenario 312 interfaces with the operating system 320 and
higher-level software through the OSI 322 and the drivers 308.
[0096] The software architecture 302 and the hardware components
304 of FIG. 3 allow an operating system (OS) to be loaded onto the
processor 32 and the drivers 308 to ultimately allow a scenario 312
to be loaded for causing multiple applications to be executed
concurrently
[0097] The scenario 312 includes information, in its header,
overhead information, to cause turning on or off each of the
different applications 314. For example, the JPEG application can
be turned off while the remaining applications, such as MP3, H.264
and 802.11g remain on. This effectively aids in reducing power
consumption, as the need for power is reduced when an application
that is not currently being used is turned off. Remaining
processing power, i.e. that which is not currently being used, may
be devoted to executing a new application with some limitations, as
are now discussed.
[0098] Essentially, there are three modes of operation within the
software architecture 302. One is real-time mode, an example of
which is 802.11g, which has hardware time constraints. In this
case, it is not feasible to add another application because a
scenario 312 that includes an 802.11g application has compiled the
latter and in the presence of a pre-complied application, a new
application cannot be added. Generally, in the presence of
applications having a timing constraint, a new application is not
readily added or to dynamically change scenarios because it
disturbs the processing balance, however, this is not an issue in
mobile applications because scenarios are not readily changed in
such applications.
[0099] The scenario 312 is pre-compiled and quasi-statically
scheduled, which refers to turning applications on or off. The
pre-compiled and scheduled scenario 312, which is in binary form is
then stored in one of the sub-processors, such as the sub-processor
74. Turning off an application prevents "choking" of the system,
that is, bandwidth is improved.
[0100] The system level software changes scenarios 318 causes
changing of the scenario 312, which, as previously-stated, may be
done dynamically. The code in the latter is in "C" or a high level
code. The scenario 312 is written in scenario descriptive language
(SDL), which is a unique and proprietary language with all rights
reserved by 3Plus1 Technology, Inc. of Saratoga, Calif.
[0101] On the right-hand side of FIG. 3, the hierarchical
software/compiler characteristics of the architecture 300 is shown,
in conceptual form, and in reference to a software tools
hierarchical column 340, a hierarchical level column 342 and a
hardware hierarchical levels column 344. Each of the pieces of
software of the software architecture 302 is taken through a
different tool so as to avoid a flat methodology.
[0102] The drivers 306 and 308 are used as tools for the general
purpose processor (GPP) 32 on the highest level of the tool column
340 while, in the next level of the hierarchal tools, a scenario
compiler 348 is used, by an application programmer, to allocate
resources and executed on one or more particular sub-processors.
The kernels 316 are then advantageously partitioned. An application
is divided into smaller portions or threads, switching from one
kernel to another.
[0103] The number of threads is limited to the number of
sub-processors. The way in which applications are handed from one
kernel to another is by the kernel 316 that is currently operating
to finish a particular function, saving the result of the function
in shared memory and signaling completion of its function and then
another kernel 316 utilizing the stored information in shared
memory to perform another function. A synchronization code is used
for this hand-off, which is done by the scenario 312 and the
particular tool is the scenario compiler 348 and is automated.
Thus, synchronization and control code are generated automatically
due to the presence of the thread.
[0104] In the next level of the tool hierarchy, as shown in the
column 340, a controller/compiler 350 is used to compile a high
level language being employed, such as "C", which includes two
parts, an optimizing assembler 352 and a low level assembler 354.
The goal is to allow the programmer to write mostly C or high level
code, rather than assembly, as the former is easier. This is easily
allowed for given the sub-processor and hierarchal architecture of
the present invention. The compiler 350 is optimized for each
sub-processor, such as CoolW or CoolN. That is, high level code,
written by a user or programmer is compiled, pursuant to certain
rules, for storage and execution by a sub-processor and a control
block located therein, as previously shown and discussed.
[0105] By changing scenarios, multiple applications can be
performed, for example, a digital camera and a PDA can be performed
in a single device simultaneously. The ability to do so results in
foregoing the dynamic ability to change or add a scenario, as might
be done in a personal computer, but this limitation is completely
tolerable as a device that is to be used with a certain scenario
need normally be quickly programmed to include another scenario in
mobile handheld device applications.
[0106] By way of example, if a manufacturer introduces a product,
such as a PDA, this is compiled along with other applications, such
as a digital camera or MP3, etc., and a pre-compiled binary code is
created using the hierarchical software tools compilation and the
sub-processor-based hardware architecture of the present invention.
Such a pre-compiled code and multiple applications make up a
scenario, now, while another scenario may be pre-compiled, it is a
rare occurrence due to the reluctance of the manufacturer to
quickly introduce another product. Given time, another product is
likely to be introduced warranting another scenario but the time to
switch to another scenario is far from urgent.
[0107] In FIG. 3, column 342 states which part or component of the
hardware are utilized for the corresponding tool of column 340.
That is, viewed in a row, each location of the row within column
342 corresponds to a like-location in column 340. Thus, the GPP IDE
346 is handled by the hardware 304, the scenario compiler 348 is
handled by the processor 22 and the compiler 350 is handled by a
control block of one of the sub-processor and the assemblers 352
and 354 are handled by one or more of the sub-processors, such as
the sub-processor 72. The particular hardware hierarchical levels
are correspondingly enumerated in column 344 by reference.
[0108] The low level assembler 354 scheduling is done but also, all
of the hardware components are available, whereas, the optimizing
assembler 352 includes more restrictions because it operates at a
higher level but is able to schedule more. Area and power is saved
by less scheduling. The hierarchical flow of column 340 and the
hardware architecture of the processor of FIG. 2 allow for an
efficient, low power and flexible processing tool. In prior art,
while the assembler 354 and the GPP IDE 346 are used, the remainder
of the column 340 are not.
[0109] With continued reference to FIG. 3, an hierarchical
compilation involves partitioning the application code into a
general purpose processor component that allows interaction between
the processor 32 and all other hardware components. This
general-purpose processor component is mainly to allow switching
between different scenarios.
[0110] Scenarios are compiled to run on a combination of multiple
sub-processors that communicate through shared memory. The scenario
compiler 348 is the tool that schedules the coarse grain data
dependency graph wherein kernels and control code in one or more
applications communicate with each other and with the controlling
general-purpose processor. Dependencies are resolved to determine
trigger conditions based upon which synchronization code is
generated to evaluate these conditions at run-time. The compiler
350 targets the subset of a sub-processor or the control block
located therein (such as the control block 90) that execute
application control code and the scenario control and
synchronization code.
[0111] The optimizing assembler 352 and the low-level assembler 354
target functions written in sub-processor assembly. They
incorporate many scheduling techniques often found in higher level
compilers such as register allocation and software pipelining. The
binary software objects generated by these assemblers execute
either on a CoolW or CoolN processor.
[0112] The scenario compiler 348 schedules the correct operation of
the applications' functions and allocates data resources. The
scenario compiler uses scheduling algorithms from the existing art
to create the schedule. The scenario compiler 348 emits the source
code (a compilable program, written in C) that implements the
scenario scheduler. The scenario scheduler implements, in software,
the schedule chosen by the scenario compiler. That is, it manages
application resources (data--placed into shared and external
memory--and functions) that are partitioned among the multiple
processor cores contained within the target device. The scheduler
ensures the correct sequencing and synchronization of functions and
data that are in use within each of the multiple processors. The
scenario compiler 348 also generates code to correctly access
peripherals and DMA controllers as referenced by SDL-specific
language features based on information about the target
heterogeneous multi-processor(s) provided to the scenario compiler
348.
[0113] The SDL allows for a collection of functionality used in the
present invention. The Scenario Description Language (SDL) is a
language created for the purpose of creating high-level, abstract
descriptions of scenarios and the applications contained within.
SDL is compact, human-readable, and scalable. SDL provides language
syntax and semantics to describe: the flow of data into and out of
the sub-processors and between functions executing on the
sub-processor; the amount of storage required to stream data
through the applications executing on the sub-processor; the
priority of each application to facilitate the creation of a
functionally correct schedule that satisfies latency requirements;
the amount of data (and its type) produced and consumed by each
function; the maximum (worst case) execution time of each function,
which is used in the creation of the schedule; and the placement of
each function onto W- or N-type sub-processors.
[0114] FIG. 4 shows, in conceptual form, the process of
hierarchical software compilation, in block flow form, in
accordance with a method of the present invention. In FIG. 4, the
blocks that are shown shaded, such as blocks 402, 412, 422, 420,
424, 428, 440, 430-438 are based on a sub-processor and need be
executed thereby whereas, the remaining blocks having no shading
are irrespective of sub-processors.
[0115] Generally, FIG. 4 shows the way in which software is
compiled for use by the processor 22. In FIG. 4, there is shown
further details of some of the blocks of FIG. 3. Specifically, the
scenario compiler 348 of FIG. 3 is the scenario compiler block 416
of FIG. 4 and the scenario description block 418, the adjust
scenario description and optional optimization block 408, and the
processor-specific data block 422 serve as support for the scenario
compiler block 416. The controller/compiler block 424 is the same
as 350 of FIG. 3. The assembly code block 430 and the optimizing
assembler 432 is the same as the optimizing assembler 352 of FIG. 3
and the low level assembler and linker block 434 of FIG. 4 is the
same as the low level assembler 354 of FIG. 3. Thus, the
description of these blocks will not be repeated.
[0116] The block 416 allows the programmer to meet his/her design
goals without having to optimize either the high level code or the
assembly code. Having the SDL allows for allocating a function from
one block to another block at a high level. The block 416 serves as
a street map. The adjust partitioning and kernels of FIG. 21 of the
"PROGRAMMABLE PROCESSOR ARCHITECTURE" patent application
incorporated herein by reference is the same as the block 408.
[0117] The scenario description block 416 serves as input to the
scenario compiler block 418, as does the block 422. The output of
the block 418 serves as input to the block 420 and the block 408
serves as input to the block 416. The block 416 describes
inter-dependencies between the kernels 316 and applications 314 of
FIG. 3. The SDL is used by the block 416. The optimizing block 410
is used to optimize high level code and assembly code. Thus, the
block 410 provides input to the blocks existing assembly code block
412 and the existing high level code block 414. The scenario
compiler block 418 receives two sets of information, one is a
bottom-up set of information and another is a top-down set of
information. An example of the former is the kernels 316, i.e.
FFTs, DCTs, etc., which is provided by the block 422 to the block
418 and it is assembled, optionally, with power information during
assembly. An example of the latter is provided by the block 416,
which is programmed in SDL to serve as control code and for
defining inter-dependencies of the kernels and requirement of the
application. An example of the requirements of the application is
the length of time that can be used for processing a frame of
information other time-related requirements.
[0118] The block 418, once provided with the foregoing top-down and
bottom-up information, performs a best match process in the form of
a schedule. The schedule, for example, provides information
regarding the inter-dependencies of the sub-processors execution of
which requires synchronization code for the control circuit of a
sub-processor. The schedule information and synchronization
information are provided by the block 420, which receives input
from the block 418. The output of the block 420 is provided as
input to the block 424. Having the block 420 receiving its input
from the block 418 is generally not performed by prior art
techniques due to their design/hardware limitations. That is, the
hardware architecture, based on sub-processors, as shown in
previous figures and the referenced patent document, allows for
scheduling and synchronization after the block 418 performs its
operation. This allows for the control circuit and each
sub-processor to be the same as the other and for the code to be
transportable.
[0119] The non-native compilation and simulation block 428 is for
compiling in the absence of a processor, that is, during
development, while the hardware is yet not ready, compilation is
performed in "non-native" environment, whereas, the native
simulation block 440 is in native environment. The block 428 allows
for both assembly and high level code compilation while a native
compiler or the actual compiler to be ultimately employed is not
yet ready. Thus, an off-the-shelf compiler, i.e. non-native, may be
employed and combined with assembly code for simulation. This is
sub-processor specific. The kernels 316 and the time consumed for
executing control code compete.
[0120] In FIG. 4, the output of the existing assembly functions
block 402 serves as input to the existing assembly code 412, which
also receives input from the block 410. The output of the block 412
serves as input to the block 432, which also receives input from
the block 430. The output of the block 432 serves as input to the
block 434 and the block 434 serves as input to the block 436, which
provides input to the native simulation block 440 and the
implementation complete block 438. The block 440 provides input to
the a decision block 442, which determines whether or not design
goals have been made and if so, the block 438 is performed and if
not, either the code is optimized by the block 410 or the scenario
description is adjusted by the block 408. The output of the blocks
412, 414 and 420 are all provided to the block 428. That is,
assembly code, high level code and schedule and synchronization
code are all provided to the block 428 for execution or simulation.
The output of the block 428 is provided to a decision block 426 for
determining whether or not design goals are met, if so, the process
is exited, otherwise, the scenario may be adjusted by the block
408. The output of the block 414 is provided to the block 424 for
compilation thereof and the output of the latter is provided to the
block 430.
[0121] Optimization is done on a partition-basis. That is, high
level code is optimized separately from assembly code and from SDL.
For example, assembly code is optimized by the block 432, high
level code is optimized by the block 410 and SDL is optimized by
the block 416. This is a divide and conquer approach allowing
advantageous optimization of each type of code that is not
attainable without such a division of code. The block 424 receives
high level code and compiles the same but output assembly code to
the block 430, which is optimized by the block 432. The output of
the block 432 is provided to the block 434 for creation of still
further low level code and the output of the block 434 is provided
to the block 436 for generation of binary object code to be used by
a sub-processor. The assembly code that is written by the
programmer is provided from the block 412 to the block 432 for
assembling.
[0122] FIG. 5 shows, in conceptual form, the process of and
apparatus for latency verification used in assembly coding and
included within the block 434 of FIG. 4 in accordance with a method
and apparatus of the present invention. In FIG. 5, instructions 502
are shown to be provided to the block 434, which will also be
referred to as the latency verification block. That is, the current
instruction is referred to as "instruction n", while the
instruction previous to the latter is referred to as "instruction
n-1", while the instruction previous to the latter is referred to
as "instruction n-2" and the instruction previous to that is
referred to as "instruction n-3" and so on. That is, an n-1
instruction is delayed by a program cycle from an n instruction and
so on.
[0123] The block 434 performs various functions, shown in FIG. 5,
in blocks or diamond shapes. Such functions include identifying all
instructions directly contributing to instruction n 508, which is
then used to determine latency rules relevant to instructions at
506 and it is provided to a decision block at 510 for identifying
the earliest instruction. If the earliest instruction is
identifiable, the process continues to 514 checking the latency
rule against the register values latencies and if any error
results, the process continues to 516 at which time an error is
reported. If no error is detected at 514, the next instruction is
processed at 518. The process of FIG. 5 is done during compilation
and serves as check for the programmer.
[0124] At 506, rules are used to determine what the actual
latencies are using a database of rules. At 510, this determination
is made because previous instructions are not necessarily known.
For example, the first instruction of a sub-routine is one where
its previous instruction is not necessarily known. At 512, worst
case possibilities are determined. At 514, latency rules are
checked against the register value latencies. A latency is
basically a delayed or previous instruction. That is, the
programmer's annotation is compared to the rules for latency and if
there is a mismatch, an error is reported at 516. An example of the
programmer's annotation is discussed hereinbelow.
[0125] A computer processor implements strict read-after-write
behavior for a register when an instruction that reads the register
always obtains the value written by the most recent previously
executed instruction that writes the register. To achieve strict
read-after-write behavior for a register, it is sometimes necessary
to delay the execution of an instruction that reads the register by
one or more clock cycles from when it would otherwise execute. An
instruction so delayed is said to be stalled for the one or more
clock cycles of delay. The advantages of implementing strict
read-after-write behavior for all registers are: [0126] (1) The
same sequence of instructions can execute correctly on a wider
range of processor implementations, and [0127] (2) Assembly
language programming is made easier.
[0128] For some processors, particularly those designed for
high-performance digital signal processing and related embedded
systems, the complexity of the processor has been reduced by not
implementing strict read-after-write behavior for all registers.
For such a processor, when an instruction reads a register it may
not obtain the value written by the most recently executed
instruction to write the register, but instead an older value of
the register. Although the specific behavior is always
deterministic and can be documented as a set of latency rules, for
some processors these rules are quite complex taken together. For
processors of this kind, unfortunately, assembly language
programmers have invariably been burdened with observing the
latency rules entirely on their own without any automated
verification from the programming tools that they are applying the
rules correctly. When the latency rules are complex, programmers
naturally make avoidable errors that may not be discovered until
program testing.
[0129] Latency Verification:
[0130] In FIG. 5, a process is described by which an assembler or
other programming tool can verify that a processor's latency rules
are being applied correctly by the programmer. First, an assembly
language program is annotated by the programmer as follows:
[0131] For each register read by each instruction, a syntactic
annotation is made in the program text to indicate which previous
register value the programmer expects the instruction to obtain for
the register. The lack of an annotation is either an error or
indicates a default assumption. For example, the default assumption
could be that the expected value obtained for a register is the
value written by the most recent previously executed instruction
that writes the register (i.e., the same as strict read-after-write
behavior). Whenever the programmer expects a value different from
the default assumption, an annotation is required. For example, if
the programmer expects the value obtained for a register to be the
value written by the n-th earlier instruction that writes the
register (n>1), the annotation could be that a distinctive
character be repeated n-1 times adjacent to the register denotation
in the instruction. If, for example, this distinctive character is
the dollar sign ($), then the assembly language instruction [0132]
add r1, $$r2, $r3 would indicate that the programmer expects the
value obtained for register r3 to be the value written by the
second previous instruction to write r3, and the value obtained for
register r2 to be the value written by the third previous
instruction to write r2. In the above example, the current value of
register r1 and two values ago of register r2 and the previous
value of register r3 are being added. The assembler or block 434
checks to ensure that all of these values are available by
performing the process of FIG. 5. It should be noted that the
annotation need not be a dollar sign, rather, it can be any
notation.
[0133] Given these annotations, for each instruction, the assembler
or other programming tool automatically determines whether the
programmer's expectations are correct, by examining the sequences
of instructions that can execute previous to the given instruction
along all paths leading to the given instruction, and applying the
documented latency rules to these sequences. FIG. 5 shows one
possible manifestation of this process as part of a modified
assembler. Block 502 has an instruction n together with a sequence
of earlier instructions, as previously discussed above. Accordingly
and given the hooks placed into the compiler and assembler of the
present invention, programming is made simpler and
programmer-friendly while or without reducing power.
[0134] Block 434 determines whether the latency annotations are
correct for instruction n for this path, while block 520 performs
the other usual functions of an assembler for instruction n. In
block 508, the earlier instructions that contribute to the inputs
of instruction n are identified. Block 506 determines, from the
complete set of latency rules, those rules that are relevant to the
interaction between each earlier instruction that contributes to
the inputs of instruction n and instruction n itself. Where the
instructions that may precede instruction n are unknown (for
example, at the entrance to a subroutine), worst-case assumptions
must be made (blocks 510 and 512). Finally, wherever the annotation
made by the programmer does not match the behavior of the actual
processor, as determined by the latency rules, an error is reported
(blocks 514 and 516).
[0135] Although the present invention has been described in terms
of specific embodiments, it is anticipated that alterations and
modifications thereof will no doubt become apparent to those
skilled in the art. It is therefore intended that the following
claims be interpreted as covering all such alterations and
modification as fall within the true spirit and scope of the
invention.
* * * * *