U.S. patent application number 15/620837 was filed with the patent office on 2017-09-28 for run-time code parallelization with monitoring of repetitive instruction sequences during branch mis-prediction.
The applicant listed for this patent is CENTIPEDE SEMI LTD.. Invention is credited to Jonathan Friedmann, Shay Koren, Alberto Mandler, Noam Mizrahi.
Application Number | 20170277544 15/620837 |
Document ID | / |
Family ID | 54063505 |
Filed Date | 2017-09-28 |
United States Patent
Application |
20170277544 |
Kind Code |
A1 |
Mizrahi; Noam ; et
al. |
September 28, 2017 |
Run-Time Code Parallelization with Monitoring of Repetitive
Instruction Sequences During Branch Mis-Prediction
Abstract
A processor includes an execution pipeline and monitoring
circuity. The execution pipeline is configured to execute
instructions of program code. The monitoring circuity is configured
to monitor the instructions in a segment of a repetitive sequence
of the instructions so as to construct a specification of register
access by the monitored instructions, to parallelize execution of
the repetitive sequence based on the corrected specification, and
to terminate monitoring of the instructions and discard the
specification in response to detecting a branch mis-prediction in
the monitored instructions.
Inventors: |
Mizrahi; Noam; (Hod
Hasharon, IL) ; Mandler; Alberto; (Zichron Yaakov,
IL) ; Koren; Shay; (Tel-Aviv, IL) ; Friedmann;
Jonathan; (Even Yehuda, IL) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
CENTIPEDE SEMI LTD. |
Netanya |
|
IL |
|
|
Family ID: |
54063505 |
Appl. No.: |
15/620837 |
Filed: |
June 13, 2017 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
PCT/IB2015/059467 |
Dec 9, 2015 |
|
|
|
15620837 |
|
|
|
|
14583119 |
Dec 25, 2014 |
9135015 |
|
|
PCT/IB2015/059467 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 9/3851 20130101;
G06F 11/3065 20130101; G06F 11/1402 20130101; G06F 11/3055
20130101; G06F 11/30 20130101; G06F 9/3808 20130101; G06F 9/3861
20130101; G06F 11/3466 20130101; G06F 9/3838 20130101; G06F 11/302
20130101 |
International
Class: |
G06F 9/38 20060101
G06F009/38; G06F 11/30 20060101 G06F011/30 |
Claims
1. A processor, comprising: an execution pipeline, which is
configured to execute instructions of program code; and monitoring
circuity, which is configured to monitor the instructions in a
segment of a repetitive sequence of the instructions so as to
construct a specification of register access by the monitored
instructions, to parallelize execution of the repetitive sequence
based on the corrected specification, and to terminate monitoring
of the instructions and discard the specification in response to
detecting a branch mis-prediction in the monitored
instructions.
2. The processor according to claim 1, wherein the monitoring
circuity is further configured to generate a flow-control trace
traversed by the monitored instructions, and to correct the
flow-control trace so as to compensate for the branch
mis-prediction.
3. The processor according to claim 1, wherein the monitoring
circuity is configured to continue monitoring the instructions
during parallelized execution.
4. The processor according to claim 1, wherein the monitoring
circuity is configured to continue to monitor the instructions and
construct the specification after discarding the specification.
5. The processor according to claim 1, wherein the monitoring
circuity is configured to generate a flow-control trace of the
monitored instructions based on an output of a fetch unit in the
execution pipeline.
6. The processor according to claim 1, wherein the monitoring
circuity is configured to generate a flow-control trace of the
monitored instructions based on an output of a decoding unit in the
execution pipeline.
7. The processor according to claim 1, wherein the monitoring
circuity is configured to generate a flow-control trace of the
monitored instructions based on outputs of both a fetch unit and a
decoding unit in the execution pipeline.
8. The processor according to claim 1, wherein the monitoring
circuity is configured to record in the specification a location in
the sequence of a last write operation to a register, based on an
output of a fetch unit in the execution pipeline.
9. The processor according to claim 1, wherein the monitoring
circuity is configured to record in the specification a location in
the sequence of a last write operation to a register, based on the
instructions being executed in the execution pipeline.
10. The processor according to claim 1, wherein the monitoring
circuity is configured to record in the specification a location in
the sequence of a last write operation to a register, based on the
instructions that are committed and are not flushed due to the
branch mis-prediction.
11. The processor according to claim 1, wherein the monitoring
circuity is configured to collect the register access only after
evaluating respective branch conditions of conditional branch
instructions of the sequence.
12. The processor according to claim 1, wherein the monitoring
circuity is configured to generate a flow-control trace for the
monitored instructions, including for a branch instruction that is
not known to a branch prediction unit of the processor.
13. A processor, comprising: an execution pipeline, which is
configured to execute instructions of program code; and monitoring
circuity, which is configured to monitor the instructions in a
segment of a repetitive sequence of the instructions so as to
construct a specification of register access by the monitored
instructions, to parallelize execution of the repetitive sequence
based on the corrected specification, and to retain the
specification in the processor only provided that no branch
mis-prediction is detected in the monitored instructions.
14. A method, comprising: in a processor that executes instructions
of program code, monitoring the instructions in a segment of a
repetitive sequence of the instructions so as to construct a
specification of register access by the monitored instructions;
parallelizing execution of the repetitive sequence based on the
specification; and in response to detecting a branch mis-prediction
in the monitored instructions, terminating monitoring of the
instructions and discarding the specification.
15. The method according to claim 14, wherein monitoring the
instructions further comprises generating a flow-control trace
traversed by the monitored instructions, and comprising correcting
the flow-control trace so as to compensate for the branch
mis-prediction.
16. The method according to claim 14, and comprising continuing
monitoring the instructions during parallelized execution.
17. The method according to claim 14, and comprising continuing to
monitor the instructions and construct the specification after
discarding the specification.
18. The method according to claim 14, wherein monitoring the
instructions comprises generating a flow-control trace of the
monitored instructions based on an output of a fetch unit in an
execution pipeline of the processor.
19. The method according to claim 14, wherein monitoring the
instructions comprises generating a flow-control trace of the
monitored instructions based on an output of a decoding unit in an
execution pipeline of the processor.
20. The method according to claim 14, wherein monitoring the
instructions comprises generating a flow-control trace of the
monitored instructions based on outputs of both a fetch unit and a
decoding unit in an execution pipeline of the processor.
21. The method according to claim 14, wherein monitoring the
instructions comprises recording in the specification a location in
the sequence of a last write operation to a register, on an output
of a fetch unit in an execution pipeline of the processor.
22. The method according to claim 14, wherein monitoring the
instructions comprises recording in the specification a location in
the sequence of a last write operation to a register, based on the
instructions being executed in an execution pipeline of the
processor.
23. The method according to claim 14, wherein monitoring the
instructions comprises recording in the specification a location in
the sequence of a last write operation to a register, based on the
instructions that are committed and are not flushed due to the
branch mis-prediction.
24. The method according to claim 14, wherein monitoring the
instructions comprises collecting the register access only after
evaluating respective branch conditions of conditional branch
instructions of the sequence.
25. The method according to claim 14, wherein monitoring the
instructions comprises generating a flow-control trace for the
monitored instructions, including for a branch instruction that is
not known to a branch prediction unit of the processor.
26. A method, comprising: in a processor that executes instructions
of program code, monitoring the instructions in a segment of a
repetitive sequence of the instructions so as to construct a
specification of register access by the monitored instructions;
parallelizing execution of the repetitive sequence based on the
specification; and retaining the specification in the processor
only provided that no branch mis-prediction is detected in the
monitored instructions.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation-in-part of PCT
Application PCT/IB2015/059467, filed Dec. 9, 2015, which claims
priority from U.S. patent application Ser. No. 14/583,119, filed
Dec. 25, 2014. The disclosures of these related applications are
incorporated herein by reference.
FIELD OF THE INVENTION
[0002] The present invention relates generally to processor design,
and particularly to methods and systems for run-time code
parallelization.
BACKGROUND OF THE INVENTION
[0003] Various techniques have been proposed for dynamically
parallelizing software code at run-time. For example, Akkary and
Driscoll describe a processor architecture that enables dynamic
multithreading execution of a single program, in "A Dynamic
Multithreading Processor," Proceedings of the 31.sup.st Annual
International Symposium on Microarchitectures, December, 1998,
which is incorporated herein by reference.
[0004] Marcuellu et al., describe a processor microarchitecture
that simultaneously executes multiple threads of control obtained
from a single program by means of control speculation techniques
that do not require compiler or user support, in "Speculative
Multithreaded Processors," Proceedings of the 12th International
Conference on Supercomputing, 1998, which is incorporated herein by
reference.
[0005] Marcuello and Gonzales present a microarchitecture that
spawns speculative threads from a single-thread application at
run-time, in "Clustered Speculative Multithreaded Processors,"
Proceedings of the 13.sup.th International Conference on
Supercomputing, 1999, which is incorporated herein by
reference.
[0006] In "A Quantitative Assessment of Thread-Level Speculation
Techniques," Proceedings of the 14.sup.th International Parallel
and Distributed Processing Symposium, 2000, which is incorporated
herein by reference, Marcuello and Gonzales analyze the benefits of
different thread speculation techniques and the impact of value
prediction, branch prediction, thread initialization overhead and
connectivity among thread units.
[0007] Ortiz-Arroyo and Lee describe a multithreading architecture
called Dynamic Simultaneous Multithreading (DSMT) that executes
multiple threads from a single program on a simultaneous
multithreading processor core, in "Dynamic Simultaneous
Multithreaded Architecture," Proceedings of the 16.sup.th
International Conference on Parallel and Distributed Computing
Systems (PDCS'03), 2003, which is incorporated herein by
reference.
SUMMARY OF THE INVENTION
[0008] An embodiment of the present invention that is described
herein provides a processor including an execution pipeline and
monitoring circuity. The execution pipeline is configured to
execute instructions of program code. The monitoring circuity is
configured to monitor the instructions in a segment of a repetitive
sequence of the instructions so as to construct a specification of
register access by the monitored instructions, to parallelize
execution of the repetitive sequence based on the corrected
specification, and to terminate monitoring of the instructions and
discard the specification in response to detecting a branch
mis-prediction in the monitored instructions.
[0009] In an embodiment, the monitoring circuity is further
configured to generate a flow-control trace traversed by the
monitored instructions, and to correct the flow-control trace so as
to compensate for the branch mis-prediction. In another embodiment,
the monitoring circuity is configured to continue monitoring the
instructions during parallelized execution. In yet another
embodiment, the monitoring circuity is configured to continue to
monitor the instructions and construct the specification after
discarding the specification.
[0010] In an example embodiment, the monitoring circuity is
configured to generate a flow-control trace of the monitored
instructions based on an output of a fetch unit in the execution
pipeline. In another embodiment, the monitoring circuity is
configured to generate a flow-control trace of the monitored
instructions based on an output of a decoding unit in the execution
pipeline. In yet another embodiment, the monitoring circuity is
configured to generate a flow-control trace of the monitored
instructions based on outputs of both a fetch unit and a decoding
unit in the execution pipeline.
[0011] In some embodiment, the monitoring circuity is configured to
record in the specification a location in the sequence of a last
write operation to a register, based on an output of a fetch unit
in the execution pipeline. In other embodiments, the monitoring
circuity is configured to record in the specification a location in
the sequence of a last write operation to a register, based on the
instructions being executed in the execution pipeline. In still
other embodiments, the monitoring circuity is configured to record
in the specification a location in the sequence of a last write
operation to a register, based on the instructions that are
committed and are not flushed due to the branch mis-prediction.
[0012] In some embodiments, the monitoring circuity is configured
to collect the register access only after evaluating respective
branch conditions of conditional branch instructions of the
sequence. In some embodiments, the monitoring circuity is
configured to generate a flow-control trace for the monitored
instructions, including for a branch instruction that is not known
to a branch prediction unit of the processor.
[0013] There is additionally provided, in accordance with an
embodiment of the present invention, a processor including an
execution pipeline and monitoring circuity. The execution pipeline
is configured to execute instructions of program code. The
monitoring circuity is configured to monitor the instructions in a
segment of a repetitive sequence of the instructions so as to
construct a specification of register access by the monitored
instructions, to parallelize execution of the repetitive sequence
based on the corrected specification, and to retain the
specification in the processor only provided that no branch
mis-prediction is detected in the monitored instructions.
[0014] There is also provided, in accordance with an embodiment of
the present invention, a method including, in a processor that
executes instructions of program code, monitoring the instructions
in a segment of a repetitive sequence of the instructions so as to
construct a specification of register access by the monitored
instructions. Execution of the repetitive sequence is parallelized
based on the specification. In response to detecting a branch
mis-prediction in the monitored instructions, monitoring of the
instructions is terminated and the specification is discarded.
[0015] There is further provided, in accordance with an embodiment
of the present invention, a method including, in a processor that
executes instructions of program code, monitoring the instructions
in a segment of a repetitive sequence of the instructions so as to
construct a specification of register access by the monitored
instructions. Execution of the repetitive sequence is parallelized
based on the specification. The specification is retained in the
processor only provided that no branch mis-prediction is detected
in the monitored instructions.
[0016] The present invention will be more fully understood from the
following detailed description of the embodiments thereof, taken
together with the drawings in which:
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] FIG. 1 is a block diagram that schematically illustrates a
processor that performs run-time code parallelization, in
accordance with an embodiment of the present invention;
[0018] FIG. 2 is a diagram that schematically illustrates run-time
parallelization of a program loop, in accordance with an embodiment
of the present invention; and
[0019] FIG. 3 is a flow chart that schematically illustrates a
method for mitigating branch mis-prediction during monitoring of a
repetitive instruction sequence, in accordance with an embodiment
of the present invention.
DETAILED DESCRIPTION OF EMBODIMENTS
Overview
[0020] Embodiments of the present invention that are described
herein provide improved methods and devices for run-time
parallelization of code in a processor. In the disclosed
embodiments, the processor identifies a repetitive sequence of
instructions, and creates and executes multiple parallel code
sequences referred to as segments, which carry out different
occurrences of the sequence. The segments are scheduled for
parallel execution by multiple hardware threads.
[0021] For example, the repetitive sequence may comprise a loop, in
which case the segments comprise multiple loop iterations, parts of
an iteration or the continuation of a loop. As another example, the
repetitive sequence may comprise a function, in which case the
segments comprise multiple function calls, parts of a function or
function continuation. The parallelization is carried out at
run-time, on pre-compiled code. The term "repetitive sequence"
generally refers to any instruction sequence that is revisited and
executed multiple times.
[0022] In some embodiments, upon identifying a repetitive sequence,
the processor monitors the instructions in the sequence and
constructs a "scoreboard"--A specification of access to registers
by the monitored instructions. The scoreboard is associated with
the specific flow-control trace traversed by the monitored
sequence. The processor decides how and when to create and execute
the multiple segments based on the information collected in the
scoreboard and the trace.
[0023] Further aspects of instruction monitoring are addressed in a
U.S. Patent Application entitled "Run-time code parallelization
with continuous monitoring of repetitive instruction sequences,"
Attorney docket no. 1279-1004, and a U.S. Patent Application
entitled "Register classification for run-time code
parallelization," Attorney docket no. 1279-1004.1, which are
assigned to the assignee of the present patent application and
whose disclosures are incorporated herein by reference.
[0024] In some embodiments, the processor fetches and processes
instructions in its execution pipeline. Branch mis-prediction may
occur when a conditional branch instruction is predicted to take a
branch but during actual execution the branch is not taken, or vice
versa. Upon detecting branch mis-prediction, the processor
typically flushes the subsequent instructions and respective
results.
[0025] When branch mis-prediction occurs in a segment whose
instructions are being monitored, the register-access information
in the scoreboard will typically be incorrect or at least
incomplete. Some embodiments described herein provide techniques
for correcting the register-access information collected in the
scoreboard after detecting a branch mis-prediction event.
[0026] In an example embodiment, the processor stops monitoring of
the segment in question and discards the register-access
information collected in it. In other words, in some embodiments
the processor retains the scoreboard only provided that no branch
mis-prediction is detected in the monitored instructions. In other
embodiments, the processor rolls-back the scoreboard to the state
prior to the mis-prediction, and continues to monitor the segment
following the correct branch decision.
[0027] The processor may roll-back the scoreboard in various ways,
such as by saving in advance the states of the scoreboard prior to
conditional branch instructions, and reverting to a
previously-saved state when needed. Alternatively, the processor
may roll-back the scoreboard by tracing back the instructions that
follow the mis-prediction and decrementing the register-access
counters back to their values prior to the mis-prediction.
Rolling-back may be carried out for all conditional branch
instructions, or only for a selected subset of the conditional
branch instructions. Example criteria for selecting the subset are
also described.
[0028] In some embodiments, as part of the monitoring process, the
processor generates the flow-control trace to be associated with
the scoreboard. Upon detecting mis-prediction, the processor
typically corrects the generated flow-control trace, as well, using
any of the methods described above.
[0029] In other disclosed embodiments, the processor reduces the
impact of mis-prediction by proper choice of the execution-pipeline
stage at which the flow-control trace is generated, and the
execution-pipeline stage at which the register-access information
is collected.
[0030] In various embodiments, the processor may generate the trace
from the instructions immediately after fetching, immediately after
decoding, or a combination of the two.
[0031] The register-access information may be collected, for
example, immediately after decoding, after execution (including
execution of mis-predicted instructions that will be flushed), or
after committing (including only instructions that will not be
flushed).
System Description
[0032] FIG. 1 is a block diagram that schematically illustrates a
processor 20, in accordance with an embodiment of the present
invention. Processor 20 runs pre-compiled software code, while
parallelizing the code execution. Parallelization decisions are
performed by the processor at run-time, by analyzing the program
instructions as they are fetched from memory and decoded.
[0033] In the present example, processor 20 comprises an execution
pipeline that comprises one or more fetching units 24, one or more
decoding units 28, an Out-of-Order (OOO) buffer 32, and execution
units 36. Fetching units fetch program instructions from a
multi-level instruction cache memory, which in the present example
comprises a Level-1 (L1) instruction cache 40 and a Level-2 (L2)
instruction cache 44.
[0034] A branch prediction unit 48 predicts the flow-control traces
(referred to herein as "traces" for brevity) that are expected to
be traversed by the program during execution. The predictions are
typically based on the addresses or Program-Counter (PC) values of
previous instructions fetched by fetching units 24. Based on the
predictions, branch prediction unit 48 instructs fetching units 24
which new instructions are to be fetched. The flow-control
predictions of unit 48 also affect the parallelization of code
execution, as will be explained below.
[0035] Instructions decoded by decoding units 28 are stored in OOO
buffer 32, for out-of-order execution by execution units 36, i.e.,
not in the order in which they have been compiled and stored in
memory. Alternatively, the buffered instructions may be executed
in-order. The buffered instructions are then issued for execution
by the various execution units 36. In the present example,
execution units 36 comprise one or more Multiply-Accumulate (MAC)
units, one or more Arithmetic Logic Units (ALU), one or more
Load/Store units, and a branch execution unit (BRA). Additionally
or alternatively, execution units 36 may comprise other suitable
types of execution units, for example Floating-Point Units
(FPU).
[0036] The results produced by execution units 36 are stored in a
register file and/or a multi-level data cache memory, which in the
present example comprises a Level-1 (L1) data cache 52 and a
Level-2 (L2) data cache 56. In some embodiments, L2 data cache
memory 56 and L2 instruction cache memory 44 are implemented as
separate memory areas in the same physical memory, or simply share
the same memory without fixed pre-allocation.
[0037] In some embodiments, processor 20 further comprises a thread
monitoring and execution unit 60 that is responsible for run-time
code parallelization. The functions of unit 60 are explained in
detail below.
[0038] The configuration of processor 20 shown in FIG. 1 is an
example configuration that is chosen purely for the sake of
conceptual clarity. In alternative embodiments, any other suitable
processor configuration can be used. For example, in the
configuration of FIG. 1, multi-threading is implemented using
multiple fetch units 24 and multiple decoding units 28. Each
hardware thread may comprise a fetch unit assigned to fetch
instructions for the thread and a decoding unit assigned to decode
the fetched instructions. Additionally or alternatively,
multi-threading may be implemented in many other ways, such as
using multiple OOO buffers, separate execution units per thread
and/or separate register files per thread. In another embodiment,
different threads may comprise different respective processing
cores.
[0039] As yet another example, the processor may be implemented
without cache or with a different cache structure, without branch
prediction or with a separate branch prediction per thread. The
processor may comprise additional elements such as reorder buffer
(ROB), register renaming, to name just a few. Further
alternatively, the disclosed techniques can be carried out with
processors having any other suitable micro-architecture.
[0040] Processor 20 can be implemented using any suitable hardware,
such as using one or more Application-Specific Integrated Circuits
(ASICs), Field-Programmable Gate Arrays (FPGAs) or other device
types. Additionally or alternatively, certain elements of processor
20 can be implemented using software, or using a combination of
hardware and software elements. The instruction and data cache
memories can be implemented using any suitable type of memory, such
as Random Access Memory (RAM).
[0041] Processor 20 may be programmed in software to carry out the
functions described herein. The software may be downloaded to the
processor in electronic form, over a network, for example, or it
may, alternatively or additionally, be provided and/or stored on
non-transitory tangible media, such as magnetic, optical, or
electronic memory.
Run-Time Code Parallelization Based on Segment Monitoring
[0042] In some embodiments, unit 60 in processor 20 identifies
repetitive instruction sequences and parallelizes their execution.
Repetitive instruction sequences may comprise, for example,
respective iterations of a program loop, respective occurrences of
a function or procedure, or any other suitable sequence of
instructions that is revisited and executed multiple times. In the
present context, the term "repetitive instruction sequence" refers
to an instruction sequence whose flow-control trace (e.g., sequence
of PC values) has been executed in the past at least once. Data
values (e.g., register values) may differ from one execution to
another.
[0043] In the disclosed embodiments, processor 20 parallelizes a
repetitive instruction sequence by invoking and executing multiple
code segments in parallel or semi-parallel using multiple hardware
threads. Each thread executes a respective code segment, e.g., a
respective iteration of a loop, multiple (not necessarily
successive) loop iterations, part of a loop iteration, continuation
of a loop, a function or part or continuation thereof, or any other
suitable type of segment.
[0044] Parallelization of segments in processor 20 is performed
using multiple hardware threads. In the example of FIG. 1, although
not necessarily, each thread comprises a respective fetch unit 24
and a respective decoding unit 28 that have been assigned by unit
60 to perform one or more segments. In another example embodiment,
a given fetch unit 24 is shared between two or more threads.
[0045] In practice, data dependencies exist between segments. For
example, a calculation performed in a certain loop iteration may
depend on the result of a calculation performed in a previous
iteration. The ability to parallelize segments depends to a large
extent on such data dependencies.
[0046] FIG. 2 is a diagram that demonstrates run-time
parallelization of a program loop, in accordance with an example
embodiment of the present invention. The present example refers to
parallelization of instructions, but the disclosed technique can be
used in a similar manner for parallelizing micro-ops, as well. The
top of the figure shows an example program loop (reproduced from
the bzip benchmark of the SPECint test suite) and the dependencies
between instructions. Some dependencies are between instructions in
the same loop iteration, while others are between an instruction in
a given loop iteration and an instruction in a previous
iteration.
[0047] The bottom of the figure shows how unit 60 parallelizes this
loop using four threads TH1 . . . TH4, in accordance with an
embodiment of the present invention. The table spans a total of
eleven cycles, and lists which instructions of which threads are
executed during each cycle. Each instruction is represented by its
iteration number and the instruction number within the iteration.
For example, "14" stands for the 4.sup.th instruction of the
1.sup.st loop iteration. In this example instructions 5 and 7 are
neglected and perfect branch prediction is assumed.
[0048] The staggering in execution of the threads is due to data
dependencies. For example, thread TH2 cannot execute instructions
21 and 22 (the first two instructions in the second loop iteration)
until cycle 1, because instruction (the first instruction in the
second iteration) depends on instruction 13 (the third instruction
of the first iteration). Similar dependencies exist across the
table. Overall, this parallelization scheme is able to execute two
loop iterations in six cycles, or one iteration every three
cycles.
[0049] It is important to note that the parallelization shown in
FIG. 2 considers only data dependencies between instructions, and
does not consider other constraints such as availability of
execution units. Therefore, the cycles in FIG. 2 do not necessarily
translate directly into respective clock cycles. For example,
instructions that are listed in FIG. 2 as executed in a given cycle
may actually be executed in more than one clock cycle, because they
compete for the same execution units 36.
[0050] In some embodiments, unit 60 decides how to parallelize the
code by monitoring the instructions in the processor pipeline. In
response to identifying a repetitive instruction sequence, unit 60
starts monitoring the sequence as it is fetched, decoded and
executed by the processor.
[0051] In some implementations, the functionality of unit may be
distributed among the multiple hardware threads, such that a given
thread can be viewed as monitoring its instructions during
execution. Nevertheless, for the sake of clarity, the description
that follows assumes that monitoring functions are carried out by
unit 60.
[0052] As part of the monitoring process, unit 60 generates the
flow-control trace traversed by the monitored instructions, and a
monitoring table that is referred to herein as a scoreboard. The
scoreboard of a segment typically comprises some classification of
the registers. In addition, for at least some of the registers, the
scoreboard indicates the location in the monitored sequence of the
last write operation to the register.
[0053] Any suitable indication may be used to indicate the location
of the last write operation, such as a count of the number of
writes to the register or the address of the last write operation.
The last-write indication enables unit 60 to determine, for
example, when it is permitted to execute an instruction in a
subsequent segment that depends on the value of the register.
Additional aspects of scoreboard generation can be found in U.S.
Patent Applications Attorney docket no. 1279-1004 and 1279-1004.1,
cited above.
Handling Branch Mis-Prediction During Segment Monitoring
[0054] In some embodiments, processor 20 fetches and processes
instructions speculatively, based on a prediction of the branch
decisions that will be takes at future branch instructions. Branch
prediction is carried out by branch prediction unit 48, and affects
the instructions that are fetched for execution by fetch units
24.
[0055] Depending on the actual code and on the performance of unit
48, branch prediction may be erroneous. An event in which a
conditional branch was predicted to take a branch but in fact the
branch was not taken, or vice versa, is referred to herein as
branch mis-prediction, or simply mis-prediction for brevity. In an
embodiment of FIG. 1, the branch execution unit (BRA) compares the
branch prediction to the actual branch decision and outputs a
mis-prediction indication in case of a mismatch.
[0056] As noted above, in some embodiments monitoring unit monitors
the flow-control trace and the register access during execution. In
other embodiments unit 60 may monitor the flow-control trace and
the register access in various segments simultaneously during
parallel execution. When mis-prediction occurs in a segment being
monitored, the resulting trace and scoreboard will typically be
incorrect. For example, the scoreboard may comprise register-access
information that was collected over instructions that follow the
mis-predicted branch and will later be flushed.
[0057] In some embodiments, unit 60 takes various measures for
correcting the scoreboard in the event of mis-prediction. The
correction methods described below refer mainly to correction of
the register-access information. In some embodiments, unit 60 uses
these methods to correct the generated flow-control trace as
well.
[0058] In some embodiments, in response to a detected
mis-prediction event, unit 60 stops monitoring of the segment and
discards the register-access information collected so far in the
segment. Monitoring will typically be re-attempted in another
segment. In these embodiments, unit retains the scoreboard for the
segment in question only provided that no branch mis-prediction is
detected.
[0059] In other embodiments, unit 60 does not discard the
register-access information, but rather rolls-back the
register-access information to its state prior to the
mis-prediction. After rolling back, unit 60 may resume the
monitoring process along the correct trace.
[0060] Unit 60 may roll-back the scoreboard information in various
ways. In some embodiments, unit 60 traces back over the
instructions that follow the mis-prediction, and corrects the
register-access information to remove the contribution of these
instructions. For example, if the register-access information
comprises counts of write operations to registers, unit 60 may
decrement the counts to remove the contribution of write operations
that follow the mis-prediction. If the register-access information
comprises some other indications of the locations of the last write
operations to registers, unit 60 may correct these indications, as
well.
[0061] In alternative embodiments, unit 60 prepares in advance for
a possible roll-back of the scoreboard to a conditional branch
instruction, by saving the state that the scoreboard had prior to
that instruction. If mis-prediction occurs in this instruction,
unit 60 may revert to the saved state of the scoreboard and resume
monitoring from that point. The saved state of the scoreboard
typically comprises the register-access information and the
register classification prior to the branch instruction. The state
may correspond to the exact conditional branch instruction, to the
preceding instruction, or to another suitable instruction that is
prior to the conditional branch instruction.
[0062] In some embodiments, unit 60 saves the scoreboard state
prior to every conditional branch instruction, enabling roll-back
following any mis-prediction. In alternative embodiments, unit 60
saves the scoreboard state for only a selected subset of the
conditional branch instructions in the segment. This technique
reduces memory space, but on the other hand enables roll-back for
only some of the possible mis-predictions. If mis-prediction occurs
in an instruction for which no prior scoreboard state has been
saved, unit 60 typically has to abort monitoring the segment and
re-attempt monitoring in another segment.
[0063] Unit 60 may select the subset of conditional branch
instructions (for which the prior state of the scoreboard is saved)
using any suitable criterion. Typically, the criterion aims to
select conditional branch instructions that are likely to be
mis-predicted, and exclude conditional branch instructions that are
likely to be predicted correctly. In one embodiment, the subset to
be selected is specified in the code or by a compiler that compiles
the code. In another embodiment, the subset is chosen by unit 60 at
runtime. For example, unit 60 may accumulate mis-prediction
statistics and select conditional branch instructions in which
branch prediction accuracy is below a certain level.
[0064] The embodiments described above refer mainly to correction
of the last-write indications in the scoreboard following
mis-prediction. Additionally or alternatively, unit 60 may correct
any other suitable register access information in the scoreboard
that may be affected by mis-prediction. For example, the scoreboard
typically comprises a classification of the registers accessed by
the monitored instructions based on the order in which the register
is used as an operand or as a destination in the monitored
instructions. The classification may distinguish, for example,
between local (L) registers whose first occurrence is as a
destination, global (G) registers that are used only as operands,
and global-local (GL) registers whose first occurrence is as
operands and are subsequently used as destinations.
[0065] In some embodiments, unit 60 may re-classify one or more of
the registers so as to reflect their correct classification prior
to the mis-prediction. Any of the correction methods described
above (e.g., reverting to previously-saved states or tracing back
the instruction sequence) can be used for this purpose.
[0066] The embodiments described above are depicted purely by way
of example. In alternative embodiments, unit 60 may correct the
scoreboard in response to branch mis-prediction in any other
suitable way.
[0067] For example, in some embodiments unit 60 performs only an
approximate correction of the specification that only approximately
compensates for the effect of the mis-prediction. In these
embodiments, unit 60 may roll back the specification to a state
that approximates the state prior to the mis-prediction, rather
than to the exact prior state. The approximation may comprise, for
example, an approximation of the last-write indications of certain
registers. In the present context, both exact and approximate
corrections are considered types of specification corrections, and
both exact and approximate compensation for the mis-prediction are
considered types of compensation.
[0068] FIG. 3 is a flow chart that schematically illustrates a
method for mitigating branch mis-prediction during monitoring of a
repetitive instruction sequence, in accordance with an embodiment
of the present invention. The method begins with unit 60 of
processor 20 monitoring instructions of a repetitive instruction
sequence, at a monitoring step 70. As part of the monitoring
process, in some embodiments unit 60 generates the predicted
flow-control trace traversed by the instructions and the
corresponding scoreboard.
[0069] At an invocation step 74, unit 60 invokes multiple hardware
threads to execute respective segments of the repetitive
instruction sequence. For at least some of the segments, unit 60
continues to monitor the instructions during execution in the
threads.
[0070] At a mis-prediction detection step 78, processor 20 checks
whether branch mis-prediction has occurred in a given segment being
executed. If no mis-prediction is encountered, the method loops
back to step 74 above.
[0071] In case of branch mis-prediction, unit 60 corrects the
scoreboard to compensate for the effect of the instructions
following the mis-prediction, at a correction step 82. Unit 60 may
use any of the techniques described above, or any other suitable
technique, for this purpose. In some embodiments, the correction
involves correction of the register-access information as well as
correction of the generated flow-control trace.
Pipeline Considerations in Mitigating Branch Mis-Prediction
[0072] In some embodiments, unit 60 reduces the impact of branch
mis-prediction by properly choosing the stage in the execution
pipeline at which the trace is generated and the stage in the
execution pipeline at which the register-access information is
collected. Generally, trace generation and collection of
register-access information need not be performed at the same
pipeline stage.
[0073] In some embodiments, unit 60 generates the trace from the
branch instructions being fetched, i.e., based on the branch
instructions at the output of fetching units 24. In alternative
embodiments, unit 60 generates the trace from the branch
instructions being decoded, i.e., based on the branch instructions
at the output of decoding units 28.
[0074] In yet another embodiment, unit 60 generates the trace based
on a combination of branch instructions at the output of decoding
units 28, and branch instructions at the output of fetch units
24.
[0075] In some embodiments, unit 60 collects the register-access
information (e.g., classification of registers and locations of
last write operations to registers) at the output of decoding units
28, i.e., from the instructions being decoded.
[0076] In other embodiments, unit 60 collects the register-access
information based on the instructions being executed in execution
units 36, but before the instructions and results are finally
committed. In this embodiment, the register-access information
includes the contribution of instructions that follow
mis-prediction and will later be flushed (as in the case of
collecting the register-access information after the decoding
unit). In an alternative embodiment, unit 60 collects the
register-access information based only on the instructions that are
committed, i.e., without considering instructions that are flushed
due to mis-prediction.
[0077] In yet another embodiment, unit 60 collects the
register-access information and/or generates the trace after
evaluating the conditions of conditional branch instructions by the
branch execution unit, i.e., at a stage where the branch
instructions are no longer conditional.
[0078] Further additionally or alternatively, unit 60 may generate
the flow-control trace and/or collect the register-access
information based on any other suitable pipeline stages.
[0079] Generally speaking, monitoring instructions early in the
pipeline helps to invoke parallel execution more quickly and
efficiently, but on the other hand is more affected by
mis-prediction. Monitoring instructions later in the pipeline
causes slower parallelization, but is on the other hand less
sensitive to mis-prediction.
[0080] In some embodiments, unit 60 is able to generate a trace
even monitoring a conditional branch instruction that is not yet
known to branch prediction unit 48. This scenario may occur, for
example, when a repetitive instruction sequence is first
encountered and not yet identified as repetitive. Nevertheless, the
trace is still recorded by the decoding unit (or by a
register-renaming unit), and unit 60 may still be able to generate
a trace. Typically, the trace will be generated with a branch not
taken for this instruction.
[0081] It will be appreciated that the embodiments described above
are cited by way of example, and that the present invention is not
limited to what has been particularly shown and described
hereinabove. Rather, the scope of the present invention includes
both combinations and sub-combinations of the various features
described herein, as well as variations and modifications thereof
which would occur to persons skilled in the art upon reading the
foregoing description and which are not disclosed in the prior art.
Documents incorporated by reference in the present patent
application are to be considered an integral part of the
application except that to the extent any terms are defined in
these incorporated documents in a manner that conflicts with the
definitions made explicitly or implicitly in the present
specification, only the definitions in the present specification
should be considered.
* * * * *