U.S. patent application number 11/528434 was filed with the patent office on 2007-03-29 for systems and methods for selectively decoupling a parallel extended instruction pipeline.
This patent application is currently assigned to ARC International (UK) Limited. Invention is credited to Aris Aristodemou, Carl Norman Graham, Simon Jones, Seow Chuan Lim, Yazid Nemouchi, Kar-Lik Wong.
Application Number | 20070074004 11/528434 |
Document ID | / |
Family ID | 37968194 |
Filed Date | 2007-03-29 |
United States Patent
Application |
20070074004 |
Kind Code |
A1 |
Wong; Kar-Lik ; et
al. |
March 29, 2007 |
Systems and methods for selectively decoupling a parallel extended
instruction pipeline
Abstract
Systems and methods for selectively decoupling a parallel
extended processor pipeline. A main processor pipeline and parallel
extended pipeline are coupled via an instruction queue. The main
pipeline can instruct the parallel pipeline to execute instructions
directly or to begin fetching and executing its own instructions
autonomously. During autonomous operation of the parallel pipeline,
instructions from the main pipeline accumulate in the instruction
queue. The parallel pipeline can return to main pipeline controlled
execution through a single instruction. A light weight mechanism in
the form of a condition code as seen by the main processor is
designed to allow intelligent decision maximizing overall
performance to be made in run-time if further instructions should
be issued to the parallel extended pipeline based on the queue
status.
Inventors: |
Wong; Kar-Lik; (Wokinham,
GB) ; Graham; Carl Norman; (London, GB) ; Lim;
Seow Chuan; (Thatcham, GB) ; Jones; Simon;
(London, GB) ; Nemouchi; Yazid; (Sandhurst,
GB) ; Aristodemou; Aris; (Frien Barnet, GB) |
Correspondence
Address: |
HUNTON & WILLIAMS LLP;INTELLECTUAL PROPERTY DEPARTMENT
1900 K STREET, N.W.
SUITE 1200
WASHINGTON
DC
20006-1109
US
|
Assignee: |
ARC International (UK)
Limited
|
Family ID: |
37968194 |
Appl. No.: |
11/528434 |
Filed: |
September 28, 2006 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60721108 |
Sep 28, 2005 |
|
|
|
Current U.S.
Class: |
712/34 |
Current CPC
Class: |
G06F 9/3808 20130101;
G06F 9/30076 20130101; G06F 9/3887 20130101; G06F 9/3875 20130101;
G06F 9/3877 20130101; G06F 9/30018 20130101; G06F 13/28 20130101;
G06F 9/3885 20130101; G06F 9/3802 20130101; G06F 9/3893 20130101;
G06T 3/4007 20130101; H04N 19/43 20141101; G06F 9/30032 20130101;
H04N 19/436 20141101; H04N 19/86 20141101; H04N 19/182 20141101;
H04N 19/82 20141101; H04N 19/14 20141101; H04N 19/61 20141101; G06F
9/3897 20130101; G06F 9/3867 20130101; H04N 19/117 20141101; G06F
9/30003 20130101; H04N 19/523 20141101; H04N 19/176 20141101 |
Class at
Publication: |
712/034 |
International
Class: |
G06F 15/00 20060101
G06F015/00 |
Claims
1. A microprocessor architecture comprising: a first processor
instruction pipeline, comprising a front end portion and a rear
portion; a second processor instruction pipeline, comprising a
front end portion and a rear portion; and an instruction queue
coupling the first and second instruction pipeline between their
respective front end and rear portions.
2. The microprocessor architecture according to claim 1, where the
instruction queue is located in the second instruction pipeline
between that pipeline's front end and rear portions.
3. The microprocessor architecture according to claim 1, wherein
the queue is configured to store instructions issued by the first
instruction pipeline to the second instruction pipeline.
4. The microprocessor architecture according to claim 1, wherein
the first instruction pipeline is configured to be able to instruct
the second instruction pipeline to operate autonomously.
5. The microprocessor architecture according to claim 4, wherein
operating autonomously comprises fetching and executing its own
instructions via the second pipeline's front end portion.
6. The microprocessor architecture according to claim 4, wherein
operating autonomously comprises operating on a different clock
frequency than the first instruction pipeline.
7. The microprocessor architecture according to claim 5, wherein
instructions issued to the second instruction pipeline accumulate
in the queue during its autonomous operation.
8. The microprocessor architecture according to claim 7, wherein
the instruction queue comprises at least one condition code.
9. The microprocessor architecture according to claim 8, wherein
the at least one condition code comprises a code indicative of at
least one state of the queue selected from the group consisting
queue having less than a predetermined number of free slots, queue
having more than a predetermined of free slots, and queue full.
10. The microprocessor architecture according to claim 9, wherein
the first processor instruction pipeline uses the at least one
condition code to determine whether to send an instruction to the
queue or to branch to another instruction that does not require the
second instruction pipeline.
11. The microprocessor architecture according to claim 7, wherein
the second instruction pipeline is adapted to return from
autonomous operation to first instruction pipeline controlled
operation by executing a return instruction.
12. The microprocessor architecture according to claim 11, wherein
instructions accumulated in the queue are executed by the second
instruction pipeline when it returns from autonomous operation.
13. A method of dynamically decoupling a parallel extended
processor pipeline from a main processor pipeline comprising:
sending an instruction from the main processor pipeline to the
parallel extended processor pipeline instructing the parallel
extended processor pipeline to operate autonomously; operating the
parallel extended processor pipeline autonomously; storing
subsequent instructions from the main processor pipeline to the
parallel extended processor pipeline in an instruction queue;
executing an instruction with the parallel extended processor
pipeline to cease autonomous execution; and thereafter executing
instructions supplied by the main processor pipeline in the
queue.
14. The method according to claim 13, further comprising executing
an instruction on the main processor pipeline to check a condition
code of the instruction queue before sending subsequent
instructions to the queue.
15. The method according to claim 14, further comprising either
branching to another instruction that doesn't require the parallel
extended processor pipeline or sending the instruction to the
instruction queue based on the condition code.
16. The method according to claim 13, wherein operating the
parallel extended processor pipeline autonomously comprises
fetching and executing its own instructions via that pipeline's own
front end, independent of instructions fetched and executed by the
main processor pipeline.
17. The method according to claim 13, wherein operating the
parallel extended processor pipeline autonomously comprises
operating at a different clock frequency than the main processor
pipeline.
18. The method according to claim 13, wherein the main processor
pipeline continues to fetch and execute instructions while the
parallel extended processor pipeline is operating autonomously.
19. The method according to claim 13, wherein, executing an
instruction with the parallel extended processor pipeline to cease
autonomous execution comprises returning from autonomous operation
to first instruction pipeline controlled operation without being
instructed to do so by the first instruction pipeline.
20. A method of performing dynamically controlled parallel
instruction processing in a microprocessor comprising: fetching and
executing instructions with a main processor pipeline; sending
instructions from the main processor pipeline to a parallel
extended processor pipeline via an instruction queue coupling the
two pipelines; and if the instruction is to an instruction to be
executed by the parallel extended pipeline, executing that
instruction with the parallel extended pipeline; otherwise if the
instruction is an instruction instructing that parallel extended
pipeline to begin autonomous execution, thereafter fetching and
executing instructions autonomously with the parallel extended
pipeline independent of the main pipeline's instruction fetches,
and storing instructions from main pipeline for the parallel
extended pipeline in the instruction queue until autonomous
processing has ceased.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Patent
Application No. 60/721,108 titled "SIMD Architecture and Associated
Systems and Methods," filed Sep. 28, 2005, the disclosure of which
is hereby incorporated by reference in its entirety.
FIELD OF THE INVENTION
[0002] The invention relates generally to embedded microprocessor
architecture and more specifically to systems and methods for
selectively decoupling an extended instruction pipeline from a main
pipeline in an microprocessor-based system.
BACKGROUND OF THE INVENTION
[0003] Processor extension logic is utilized to extend a
microprocessor's capability. Typically, this logic is in parallel
to and accessible by the main processor pipeline. It is often used
to perform specific, repetitive, computationally intensive
functions thereby freeing up the main processor pipeline.
[0004] In conventional microprocessors, there are essentially two
types of parallel pipeline architectures: tightly coupled and
loosely coupled, or decoupled. In the former, instructions are
fetched and executed serially in the main processor pipeline. If
the instruction is an instruction to be processed by the extension
logic, the instruction is sent to that logic. Because every
instruction originates from the main pipeline the two pipelines are
said to be tightly coupled. This limits the degree of concurrency
exploitable between the pipelines.
[0005] In the second architecture, the parallel instruction
pipeline containing the extension logic is capable of fetching and
executing its own instructions and hence maximizing concurrency.
However, control and synchronization between the two pipelines
becomes difficult when programming a processor having such a
decoupled architecture. Thus, there exists a need for a parallel
pipeline architecture that can fully exploit the advantages of
parallelism without suffering from the design complexity of loosely
or completely decoupled pipelines.
SUMMARY OF THE INVENTION
[0006] Accordingly, at least one embodiment of the invention
provides a microprocessor architecture. The microprocessor
architecture according to this embodiment comprises a first
processor instruction pipeline, comprising a front end portion and
a rear portion, a second processor instruction pipeline, comprising
a front end portion and a rear portion, and an instruction queue
coupling the first and second instruction pipeline between their
respective front end and rear portions.
[0007] Another embodiment of the invention provides a method of
dynamically decoupling a parallel extended processor pipeline from
a main processor pipeline. The method according to this embodiment
comprises sending an instruction from the main processor pipeline
to the parallel extended processor pipeline instructing the
parallel extended processor pipeline to operate autonomously,
operating the parallel extended processor pipeline autonomously,
storing subsequent instructions from the main processor pipeline to
the parallel extended processor pipeline in an instruction queue,
executing an instruction with the parallel extended processor
pipeline to cease autonomous execution, and thereafter executing
instructions supplied by the main processor pipeline in the
queue.
[0008] Still a further embodiment of the invention provides a
method of performing dynamically controlled parallel instruction
processing in a microprocessor. The method according to this
embodiment comprises fetching and executing instructions with a
main processor pipeline, sending instructions from the main
processor pipeline to a parallel extended processor pipeline via an
instruction queue coupling the two pipelines, and if the
instruction is to an instruction to be executed by the parallel
extended pipeline, executing that instruction with the parallel
extended pipeline, otherwise if the instruction is an instruction
instructing that parallel extended pipeline to begin autonomous
execution, thereafter fetching and executing instructions
autonomously with the parallel extended pipeline independent of the
main pipeline's instruction fetches, and storing instructions from
main pipeline for the parallel extended pipeline in the instruction
queue until autonomous processing has ceased.
[0009] These and other embodiments and advantages of the present
invention will become apparent from the following detailed
description, taken in conjunction with the accompanying drawings,
illustrating by way of example the principles of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] In order to facilitate a fuller understanding of the present
disclosure, reference is now made to the accompanying drawings, in
which like elements are referenced with like numerals. These
drawings should not be construed as limiting the present
disclosure, but are intended to be exemplary only.
[0011] FIG. 1 is a functional block diagram illustrating a
microprocessor-based system including a main processor core and a
SIMD media accelerator according to at least one embodiment of the
invention;
[0012] FIG. 2 is a block diagram illustrating a conventional
multistage microprocessor pipeline having a pair of parallel data
paths;
[0013] FIG. 3 is a block diagram illustrating another conventional
multiprocessor design having a pair of parallel processor
pipelines;
[0014] FIG. 4 is a block diagram illustrating a dynamically
decoupleable multi-stage microprocessor pipeline according to at
least one embodiment of the invention; and
[0015] FIG. 5 is a flow chart detailing the steps of a method for
sending instructions for operating a main processor pipeline and an
extended processor pipeline according to at least one embodiment of
the invention; and
[0016] FIG. 6 is a flow chart detailing the steps of a method for
dynamically decoupling an extended processor pipeline from a main
pipeline according to at least one embodiment of the invention.
DETAILED DESCRIPTION
[0017] The following description is intended to convey a thorough
understanding of the embodiments described by providing a number of
specific embodiments and details involving microprocessor
architecture and systems and methods for selectively decoupling an
extended instruction pipeline from a main instruction pipeline. It
should be appreciated, however, that the present invention is not
limited to these specific embodiments and details, which are
exemplary only. It is further understood that one possessing
ordinary skill in the art, in light of known systems and methods,
would appreciate the use of the invention for its intended purposes
and benefits in any number of alternative embodiments, depending
upon specific design and other needs.
[0018] Referring now to FIG. 1, a functional block diagram
illustrating a microprocessor-based system 5 including a main
processor core 10 and a SIMD media accelerator 50 according to at
least one embodiment of the invention The diagram illustrates a
microprocessor 5 comprising a standard single instruction single
data (SISD) processor core 10 having a multistage instruction
pipeline 12 and a SIMD media engine 50. In various embodiments, the
processor core 10 may be a processor core such as the ARC 700
embedded processor core available from ARC, International of
Elstree, United Kingdom, and as described in provisional patent
application No. 60/572,238 filed May 19, 2004 entitled
"Microprocessor Architecture" which, is hereby incorporated by
reference in its entirety. Alternatively, in various embodiments,
the processor core may be a different processor core.
[0019] In various embodiments, a single instruction issued by the
processor pipeline 12 may cause up to 16 16-bit elements to be
operated on in parallel through the use of the 128-bit data path 55
in the media engine 50. In various embodiments, the SIMD engine 50
utilizes closely coupled memory units. In various embodiments, the
SIMD data memory 52 (SDM) is a 128-bit wide data memory that
provides low latency access to and from the 128-bit vector register
file 51. The SDM contents are transferable to and from system main
memory via a DMA unit 54 thereby freeing up the processor core 10
and the SIMD core 50. In various embodiments, a SIMD code memory 56
(SCM) allows the SIMD unit to fetch instructions from a localized
code memory, allowing the SIMD pipeline to dynamically decouple
from the processor core 10 resulting in truly parallel operation
between the processor core and SIMD media engine as will be
discussed in greater detail in the context of FIGS. 4-6.
[0020] Therefore, in various embodiments, the microprocessor
architecture will permit the processor-based system 5 to operate in
both closely coupled and decoupled modes of operation. In the
closely coupled mode of operation, the SIMD program code fetch is
exclusively handled by the main processor core 10. In the decoupled
mode of operation, the SIMD pipeline 50 executes code fetched from
a local memory 56 independent of the processor core 10. The
processor core 10 may therefore instruct the SIMD pipeline 50 to
execute autonomously in this de-coupled mode, for example, to
perform video tasks such as audio processing, entropy
encoding/decoding, discrete cosine transforms (DCTs) and inverse
DCTs, motion compensation and de-block filtering.
[0021] Referring now to FIG. 2, a block diagram illustrating a
conventional multistage microprocessor pipeline having a pair of
parallel data paths is depicted. In a microprocessor employing a
variable-length pipeline, data paths required to support different
instructions typically have a different number of stages. Data
paths supporting specialized extension instructions for performing
digital signal processing or other complex but repetitive functions
may be used only some of the time during processor execution and
remain idle otherwise. Thus, whether or not these instructions are
currently needed will effect the number of effective stages in the
processor pipeline.
[0022] Extending a general-purpose microprocessor with application
specific extension instructions can often add significant length to
the instruction pipeline. In the pipeline of FIG. 2, pipeline
stages F1 to F4 at the front end 100 of the processor pipeline are
responsible for functions such as instruction fetch, decode and
issue. These pipeline stages are used to handle all instructions
issued by the microprocessor. After these stages, the pipeline
splits into parallel data paths 110 and 115 incorporating stages
E1-E3 and D1-D4 respectively. These parallel sub-paths represent
pipeline stages used to support different instructions/data
operations. For example, stages E1-E3 may be the primary/default
processor pipeline, while stages D1-D4 comprise the extended
pipeline designed for processing specific instructions. This type
of architecture can be characterized as coupled or tightly coupled
to the extent that regardless of whether instructions are destined
for default pipeline stages E1-E3 or extended pipeline D1-D4, they
all must pass through stages F1-F4, until a decision is made as to
which portion of the pipeline will perform the remaining processing
steps.
[0023] By using the single pipeline front-end to fetch and issue
all instructions, the processor pipeline of FIG. 2 achieves the
advantage that instructions can be freely intermixed,
irrespectively of whether the instructions are executed by the data
path in sub-paths E1-E3 or D1-D4. Thus, all instructions appear as
a single thread of program execution. This type of pipeline
architecture also has the advantage of greatly simplified program
design and debugging, thereby reducing the time to market in
product developments. It is admittedly a highly flexible
architecture. However, a limitation of this architecture is that
the sequential nature of instruction execution significantly limits
the exploitable parallelism between the data paths that could
otherwise be used to improve overall performance. This negatively
effects performance relative to other parallel pipeline
architectures.
[0024] FIG. 3 is a block diagram illustrating another conventional
multiprocessor architecture having a pair of parallel instruction
pipelines. The processor pipeline of FIG. 3 contains a front end
120 comprised of stages F1-F4 and a rear portion 125 comprised of
stages E1-E3. However, the processor also contains a parallel data
path having a front end 135 comprised of front end stages G1-G2 and
rear portion 140 comprised of stages D1-D4. Unlike the architecture
of FIG. 2, this architecture contains truly parallel pipelines to
the extent that both front portions 420 and 435 each can fetch
instructions separately. This type of parallel architecture may be
characterized as loosely coupled or decoupled because the
application specific extension data path G1-G2 and D1-D4 is
autonomous and can execute instructions in parallel to the main
pipeline consisting of F1-F4 and E1-E3. This arrangement enhances
exploitable parallelism over the architecture depicted in FIG. 2.
However, as the two parallel pipelines become independent,
mechanisms are required to synchronize their operations, as
represented by dashed line 130. These mechanisms, typically
implemented using specific instructions and bus structures which,
are often not a natural part of a program and are inserted as
after-thoughts to "fix" the disconnect between main pipeline and
extended pipeline. As consequence of this, the resulting program
utilizing both instruction pipelines becomes difficult to design
and optimize.
[0025] Referring now to FIG. 4, a block diagram illustrating a
dynamically decoupleable multi-stage microprocessor pipeline
according to at least one embodiment of the invention is provided.
The pipeline architecture according to this embodiment ameliorates
at least some and preferably most or all of the above-noted
limitations of conventional parallel pipeline architectures. This
exemplary pipeline depicted in FIG. 4 consists of a front end
portion 145 comprising stages F1-F4, a rear portion 150 comprising
stages E1-E3, and a parallel extendible pipeline having a front
portion 160 comprising stages G1-G2 and a rear portion 165
comprising stages D1-D4. In the pipeline depicted in FIG. 4,
instructions can be issued from the CPU to the extendible pipeline
D1 to D4. To decouple the extendible pipeline D1 to D4 from the
front portion 145 of the main pipeline F1 to F4, a queue 155 is
added between the two pipelines. The queue serves to delay
execution of instructions issued by the front end portion 145 of
the main pipeline if the extension pipeline is not ready. A
tradeoff can be made during system design to decide on how many
entries should be in the queue 155 to insure that the extension
pipeline is sufficiently decoupled from the main pipeline.
Additionally, in various embodiments, the main pipeline can issue a
Sequence Run (vrun) instruction to instruct the extension pipeline
to use its own front end 160, G1 to G2 in the diagram, to execute
instruction sequences stored in a record memory 156, causing the
extension pipeline to fetch and execute instructions autonomously.
In various embodiments, while the extension pipeline, G1-G2 and
D1-D4, is performing operations, the main pipeline can keep issuing
extension instructions that accumulate in the queue 155 until the
extension pipeline executes a Sequence Record End (vendrec)
instruction. After the vendrec instruction is issued, the extension
resumes executing instructions issued to the queue 155.
[0026] Therefore, instead of trying to get what effectively becomes
two independent processors to work together as in the pipeline
depicted in FIG. 3, the pipeline depicted in FIG. 4 is designed to
switch between being coupled, that is, executing instructions for
the main pipeline front end 145, and being decoupled, that is,
during autonomous runtime of the extended pipeline. As such, the
instructions vrun and vendrec, which dynamically switch the
pipeline between the coupling states, can be designed to be light
weight, executing in, for example, a single cycle. These
instructions can then be seen as parallel analogs of the
conventional call and return instructions. That is, when
instructing the extension pipeline to fetch and execute
instructions autonomously, the main processor pipeline is issuing a
parallel function call that runs concurrently with its own thread
of instruction execution to maximize speedup of the application.
The two threads of instruction execution eventually join back into
one after the extension pipeline executes the vendrec instruction
which is the last instruction of the program thread autonomously
executed by the extension pipeline.
[0027] In addition to efficient operation, another advantage of
this architecture is that during debugging, such as, for example,
instruction stepping, the two parallel threads can be forced to be
serialized such that the CPU front portion 145 will not issue any
instruction after issuing vrun to the extension pipeline until the
latter fetches and executes the vendrec instruction. In various
embodiments, this will give the programmer the view of a single
program thread that has the same functional behavior of the
parallel program when executed normally and hence will greatly
simplify the task of debugging.
[0028] Another advantage of the processor pipeline containing a
parallel extendible pipeline that can be dynamically coupled and
decoupled is the ability to use two separate clock domains. In low
power applications, it is often necessary to run specific parts of
the integrated circuit at varying clock frequencies, in order to
reduce and/or minimize power consumption. Using dynamic decoupling,
the front end portion 145 of the main pipeline can utilize an
operating clock frequency different from that of the parallel
pipeline 165 of stages D1-D4 with the primary clock partitioning
occurring naturally at the queue 155 labeled as Q in the FIG.
4.
[0029] Referring now to FIG. 5, a flow chart of an exemplary method
for sending instructions from a main processor pipeline to an
extended processor pipeline according to at least one embodiment of
the invention is depicted. Operation of the method begins in step
200 and proceeds to step 205, where an instruction is fetched by
the main processor pipeline. In step 210, because the instruction
is determined to be one for processing by the parallel extended
pipeline, the instruction is passed from the main pipeline to the
parallel extended pipeline via an instruction queue coupling the
two pipelines. In various embodiments, if the parallel extended
pipeline is currently processing instructions from the queue, that
instruction will be processed in turn by the parallel extended
pipeline as specified in step 220. Otherwise, the instruction will
remain in the queue until the parallel extended pipeline has ceased
its autonomous operation. In step 225, while the instruction is
either sitting in the queue or being processed by the parallel
pipeline, the main pipeline is able to continue processing
instructions. The queue provides a mechanism for the main pipeline
to offload instructions to the parallel extended pipeline without
stalling the main pipeline. Operation of the method stops in step
230.
[0030] Referring now to FIG. 6, this Figure is a flow chart of an
exemplary method for dynamically decoupling an extended processor
pipeline from a main pipeline according to at least one embodiment
of the invention. Operation of the method begins in step 300 and
proceeds to step 305 where the main processor pipeline sends a run
instruction to the parallel extended pipeline via the instruction
queue coupling the pipelines. In step 310, the parallel pipeline
retrieves the run instruction from the queue. As noted above, this
may occur instantly or after the parallel pipeline has retrieved
and processed other instructions in front of the run instruction in
the queue. In various embodiments, this run instruction will
specify a location in a record memory accessible by the parallel
extended pipeline of a starting location of a sequence of recorded
instructions. Next, in step 315, based on receipt of the run
instruction, the parallel extended pipeline begins executing the
series of recorded instructions, that is, it begins autonomous
operation. In various embodiments this comprises fetching and
executing its own instructions independent of the main pipeline's
instruction stack. Also, in various embodiments, the parallel
extended pipeline may operate at another clock frequency that the
main pipeline, such as, for example, a fractional percentage (i.e.,
1/2, 1/4, etc.). Concurrent to the parallel extended pipeline's
autonomous execution, the main processor pipeline can continue
sending instructions to the parallel extended pipeline as depicted
in step 320. Then, in step 325, after the parallel pipeline has
processed an end instruction recorded at the end of the sequence of
recorded instructions, autonomous operation of that pipeline
ceases. In step 330, the parallel pipeline returns to the queue to
process any queued instructions received from the main pipeline. In
step 335, the parallel extended pipeline continues processing
instructions issued by the main pipeline that appear in the queue
until an instruction to begin autonomous operation is received.
[0031] As discussed above, in the microprocessor architecture
according to the various embodiments of the invention, a main
processor pipeline is extended through a dynamically coupled
parallel SIMD instruction pipeline. In various embodiments, the
main processor pipeline may issue instructions to the extended
pipeline through an instruction queue that effectively decouples
the extended pipeline. In various embodiments, the extended SIMD
pipeline is also able to run prerecorded macros that are stored in
a local SIMD instruction memory so that a single macro instruction
sent to the SIMD pipeline via the queue allows many pre-determined
instructions to be executed as discussed in commonly assigned U.S.
patent applications XX/XXX,XXX titled, "Systems and Methods for
Recording Instructions Sequences in a Microprocessor Having a
Dynamically Decoupleable Extended Instruction Pipeline," filed
concurrently herewith, the disclosure of which is hereby
incorporated by reference in its entirety. This architecture, among
other things, allows the SIMD media engine (the extended pipeline)
to operate in parallel with the primary pipeline (processor core)
and allows the processor core to operate far in advance of the
parallel SIMD pipeline.
[0032] One consideration of using an instruction queue to decouple
the extended SIMD pipeline from the processor core (main pipeline)
is that it becomes possible for the processor core to issue too
many instructions causing the queue to become full. When the main
processor pipeline can no longer issue instructions to the queue,
the pipeline will have to stall until the queue frees up a slot for
the instruction that caused the pipeline to stall. Pipeline stalls
have a negative effect on overall system performance. In this case
in particular, a pipeline stall means that the processor core will
stop being able to operate in parallel, therefore negating the
gains derived from the dynamically decoupled extended parallel SIMD
pipeline.
[0033] Therefore, in order to prevent the main processor pipeline
from issuing instructions to the queue when it is full, thereby
causing the main pipeline to stall, in various embodiments, the
SIMD pipeline queue uses condition codes to notify the processor
pipeline of the condition of the queue. In various embodiments, the
SIMD queue sets a condition code of QF for queue nearly full
whenever there are less than a predetermined number of empty slots
remaining in the queue. In various embodiments, this number may be
16. However, in various embodiments, the number may be different
than 16. In various embodiments, the SIMD queue sets a condition
code of QNF as the opposite of QF when more than the predetermined
number of slots remain available.
[0034] In various embodiments, rather than using several
instructions to load these status values and test the value before
branching on the test result, two conditional branch instructions
using these condition codes directly test for such conditions,
thereby reducing the number of instructions required to perform
this task. In various embodiments, these instructions will only
branch when the condition code used is set. In various embodiments,
these instructions may have the mnemonic "BQF" for branch when
queue is nearly full and "BQNF" for branch when queue is not nearly
full. Such condition codes make the queue full status an integral
part of the main processor programming model and make it possible
to make frequent light-weight intelligent decisions by software to
maximize overall performance. These condition codes are maintained
by the queue itself based on the queue's status. The instruction to
check the condition code are branch instructions that are specified
to check the particular condition codes. In various embodiments of
the invention, checking of the condition code is done by placing
condition code checking branch instructions where necessary, such
as before issuing any instructions to the extended pipeline. Thus,
the condition codes provide an easy mechanism for preventing main
pipeline stalls caused by trying to issue instructions to a full
queue.
[0035] These two conditional branch instructions allow the main
processor pipeline to regularly check the status of the queue
before issuing more instructions into the extended SIMD pipeline
queue. The main processor core can use these instructions to avoid
stalling the processor when the queue is full or nearly full, and
branch to another task that does not involve the SIMD engine until
these queue conditions change. Therefore, these instructions
provide the processor with an effective and relatively low overhead
means of scheduling work load on the available resources while
preventing main pipeline stalls.
[0036] The embodiments of the present inventions are not to be
limited in scope by the specific embodiments described herein. For
example, although many of the embodiments disclosed herein have
been described with reference to systems and dynamically decoupling
a parallel pipeline in a microprocessor-based system having a main
instruction pipeline and an extended instruction pipeline, the
principles herein are equally applicable to other aspects of
microprocessor design and function. Indeed, various modifications
of the embodiments of the present inventions, in addition to those
described herein, will be apparent to those of ordinary skill in
the art from the foregoing description and accompanying drawings.
Thus, such modifications are intended to fall within the scope of
the following appended claims. Further, although some of the
embodiments of the present invention have been described herein in
the context of a particular implementation in a particular
environment for a particular purpose, those of ordinary skill in
the art will recognize that its usefulness is not limited thereto
and that the embodiments of the present inventions can be
beneficially implemented in any number of environments for any
number of purposes. Accordingly, the claims set forth below should
be construed in view of the full breath and spirit of the
embodiments of the present inventions as disclosed herein.
* * * * *