U.S. patent application number 14/724175 was filed with the patent office on 2016-04-28 for processing method including pre-issue load-hit-store (lhs) hazard prediction to reduce rejection of load instructions.
The applicant listed for this patent is INTERNATIONAL BUSINESS MACHINES CORPORATION. Invention is credited to Sundeep Chadha, Richard James Eickemeyer, John Barry Griswell, JR., Dung Quoc Nguyen.
Application Number | 20160117174 14/724175 |
Document ID | / |
Family ID | 55792061 |
Filed Date | 2016-04-28 |
United States Patent
Application |
20160117174 |
Kind Code |
A1 |
Chadha; Sundeep ; et
al. |
April 28, 2016 |
PROCESSING METHOD INCLUDING PRE-ISSUE LOAD-HIT-STORE (LHS) HAZARD
PREDICTION TO REDUCE REJECTION OF LOAD INSTRUCTIONS
Abstract
A processing method supporting out-of-order execution (OOE)
includes load-hit-store (LHS) hazard prediction at the instruction
execution phase, reducing load instruction rejections and queue
flushes at the dispatch phase. The instruction dispatch unit (IDU)
detects likely LHS hazards by generating entries for pending stores
in a LHS detection table. The entries in the table contain an
address field (generally the immediate field) of the store
instruction and the register number of the store. The ISU compares
the address field and register number for each load with entries in
the table to determine if a likely LHS hazard exists and if an LHS
hazard is detected, the load is dispatched to the issue queue of
the load-store unit (LSU) with a tag corresponding to the matching
store instruction, causing the LSU to dispatch the load only after
the corresponding store has been dispatched for execution.
Inventors: |
Chadha; Sundeep; (AUSTIN,
TX) ; Eickemeyer; Richard James; (ROCHESTER, MN)
; Griswell, JR.; John Barry; (AUSTIN, TX) ;
Nguyen; Dung Quoc; (AUSTIN, TX) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
INTERNATIONAL BUSINESS MACHINES CORPORATION |
Armonk |
NY |
US |
|
|
Family ID: |
55792061 |
Appl. No.: |
14/724175 |
Filed: |
May 28, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
14522811 |
Oct 24, 2014 |
|
|
|
14724175 |
|
|
|
|
Current U.S.
Class: |
712/206 |
Current CPC
Class: |
G06F 9/3838 20130101;
G06F 9/3834 20130101 |
International
Class: |
G06F 9/38 20060101
G06F009/38; G06F 9/30 20060101 G06F009/30 |
Claims
1. A method of operation of a processor core, the method
comprising: fetching instructions of an instruction stream;
dispatching instructions of the instruction stream by an
instruction dispatch unit of the processor core that dispatches the
instructions to issue queues, according to a type of the
instructions; detecting likely load-hit-store hazards prior to the
dispatch of load instructions to an issue queue of a load-store
unit of the processor core; and identifying the likely
load-hit-store hazards to the load-store unit, whereby rejections
of the load instructions by the load-store unit due to
load-hit-store hazards is reduced.
2. The method of claim 1, wherein the instruction dispatch unit
detects store instructions of the instruction stream during the
dispatching of the store instructions and stores store address
information associated with the store instructions in corresponding
entries in a load-hit-store detection table, and wherein the
detecting likely load-hit-store hazards comprises detecting load
instructions of the instruction stream and comparing the store
address information of the entries in the table with load address
information of load instructions of the instruction stream.
3. The method of claim 2, further comprising: responsive to the
detecting of a store operation, writing the store address
information associated with the store operation to the
load-hit-store detection table and dispatching the store operation
to the issue queue of the load-store unit of the processor core;
responsive to the detecting of a load instruction, comparing the
load address information of the load instruction to entries in the
load-hit-store detection table corresponding to store operations
occurring earlier in the instruction stream to determine if a
likely load-hit-store hazard exists between the load instruction
and a given one of the store operations; responsive to the
comparing determining that the likely load-hit-store hazard exists
between the load instruction and the given store operation,
dispatching the load instruction to the issue queue of the
load-store unit of the processor core along with a tag identifying
the given store operation; and responsive to the comparing
determining that the likely load-hit-store hazard does not exist
between the load instruction and the given store operation,
dispatching the load instruction to the issue queue of the
load-store unit of the processor core without the tag.
4. The method of claim 2, wherein the store address information is
one or both of an immediate field of the store instruction and one
or more base register numbers of the store instruction.
5. The method of claim 3, further comprising: the load-store unit
examining a next entry of the issue queue to determine whether or
not a next operation is a load instruction with a corresponding
tag; the load-store unit, responsive to determining that the load
instruction with a corresponding tag is not present, processing the
next entry for execution by the load-store unit; the load-store
unit examining the next entry of the issue queue to determine
whether or not the next operation is a store operation; the
load-store unit, responsive to determining that the next operation
is a store operation, examining the issue queue to determine
whether a load instruction having a corresponding tag matching a
tag of the store operation is present; the load-store unit,
responsive to determining that the next operation is a store
operation, processing the next entry for execution by the
load-store unit; and the load-store unit, responsive to determining
that the load instruction having the corresponding tag matching the
tag of the store operation is present, processing the load
instruction for execution by the load-store unit subsequent to
processing the next entry.
6. The method of claim 3, further comprising: responsive to
detecting a store operation in the instruction stream, comparing
entries in the load-hit-store detection table with the store
address information of the store operation; and responsive to the
comparing detecting a match between the store address information
of the store instruction and an entry in the load-hit-store
detection table, invalidating the entry in the load-hit-store
detection table prior to the instruction dispatch unit storing an
entry corresponding to the store instruction in the load-hit-store
detection table, whereby only a single valid entry in the
load-hit-store detection table contains identical store address
information at any time.
7. The method of claim 3, wherein the comparing compares a
most-recently-stored matching entry in the load-hit-store detection
table that has a match between the load address information of the
load instruction and the most-recently-stored matching entry in the
load-hit-store detection table, whereby multiple valid entries in
the load-hit-store detection table may match a particular load
address information, without causing a load-hit-store hazard.
Description
[0001] The present Application is a Continuation of U.S. patent
application Ser. No. 14/522,811, filed on Oct. 24, 2014 and claims
priority thereto under 35 U.S.C. .sctn.120. The disclosure of the
above-referenced parent U.S. Patent Application is incorporated
herein by reference.
BACKGROUND OF THE INVENTION
[0002] 1. Field of the Invention
[0003] The present invention is related to processing systems and
processors, and more specifically to techniques for predicting
load-hit-store hazards at dispatch times to reduce rejection of
dispatched load instructions.
[0004] 2. Description of Related Art
[0005] In pipelined processors supporting out-of-order execution
(OOE), overlaps between store and load instructions causing
load-hit-store hazards represent a serious bottleneck in the data
flow between the load store unit (LSU) and the instruction dispatch
unit (IDU). In particular, in a typical pipelined processor, when a
load-hit-store hazard is detected by the LSU, the load instruction
that is dependent on the result of the store instruction is
rejected, generally several times, and reissues the load
instruction along with flushing all newer instructions following
the load instruction. The above-described reject and reissue
operation not only consumes resources of the load-store data
path(s) within the processor, but can also consume issue queue
space in the load-store execution path(s) by filling the load-store
issue queue with rejected load instructions that must be reissued.
When such an LHS hazard occurs in a program loop, the reject and
reissue operation can lead to a dramatic reduction in system
performance.
[0006] In some systems, the reissued load instruction entries are
tagged with dependency flags, so that subsequent reissues will only
occur after the store operation on which the load instruction
depends, preventing recurrence of the reissue operations. However,
rejection of the first issue of the load instruction and the
consequent flushing of newer instructions still represents a
significant performance penalty in OOE processors.
[0007] It would therefore be desirable to provide a method for
managing load-store operations with reduced rejection and reissue
of operations, in particular load rejections due to load-hit-store
hazards.
BRIEF SUMMARY OF THE INVENTION
[0008] The invention is embodied in a method that reduces rejection
of load instructions by predicting likely load-hit-store hazards.
The method is a method of operation of a processor core.
[0009] The processor core is embodied in a processor core
supporting out-of-order execution that detects likely
load-hit-store hazards. When an instruction dispatch unit decodes a
fetched instruction, if the instruction is a store instruction,
address information is stored in a load-hit-store detection table.
The address information is generally the base registers used to
generate the effective address of the store operation in
register-based addressing and/or the immediate field of the
instruction for immediate addressing. When a subsequent load
instruction is encountered, the instruction dispatch unit checks
the load-hit-store detection table to determine whether or not an
entry in the table has matching address information. If a matching
entry exists in the table, the instruction dispatch unit forwards
the load instruction with a tag corresponding to the entry, so that
the load-store unit will execute the load instruction after the
corresponding store has been executed. If no matching entry exists
in the table, the load instruction is issued untagged.
[0010] The foregoing and other objectives, features, and advantages
of the invention will be apparent from the following, more
particular, description of the preferred embodiment of the
invention, as illustrated in the accompanying drawings.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING
[0011] The novel features believed characteristic of the invention
are set forth in the appended claims. The invention itself,
however, as well as a preferred mode of use, further objectives,
and advantages thereof, will best be understood by reference to the
following detailed description of the invention when read in
conjunction with the accompanying Figures, wherein like reference
numerals indicate like components, and:
[0012] FIG. 1 is a block diagram illustrating a processing system
in accordance with an embodiment of the present invention.
[0013] FIG. 2 is a block diagram illustrating details of a
processor core 20 in accordance with an embodiment of the present
invention.
[0014] FIG. 3 is a block diagram illustrating details within
processor core 20 of FIG. 2 in accordance with an embodiment of the
present invention.
[0015] FIG. 4 is a table depicting entries within LHS detection
table 41 of processor core 20 in accordance with an embodiment of
the present invention.
[0016] FIG. 5 is a flowchart depicting a method of dispatching
load/store instructions in accordance with an embodiment of the
present invention.
[0017] FIG. 6 is a flowchart depicting a method of issuing
load/store instructions in accordance with an embodiment of the
present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0018] The present invention relates to processors and processing
systems in which rejects of load instructions due to load-hit-store
(LHS) hazards is reduced by predicting the occurrence of such
hazards using a LHS prediction table to track dispatched stores
that may or may not have been issued/executed. Load instructions
are examined at dispatch time to determine whether or not a pending
store exists that has not been committed for a cache write or that
has otherwise been flushed from the load-store execution path. If
an LHS hazard is detected, the load instruction is dispatched with
an ITAG matching the ITAG of the store instruction corresponding to
the entry in the LHS prediction table, so that the load-store unit
will issue the load instruction dependent on the store result,
i.e., will retain the load instruction in its issue queue until the
store instruction is committed or flushed, preventing rejections of
load instructions due to identification of LHS hazards during issue
of the load instructions.
[0019] Referring now to FIG. 1, a processing system in accordance
with an embodiment of the present invention is shown. The depicted
processing system includes a number of processors 10A-10D, each in
conformity with an embodiment of the present invention. The
depicted multi-processing system is illustrative, and a processing
system in accordance with other embodiments of the present
invention includes uni-processor systems having symmetric
multi-threading (SMT) cores. Processors 10A-10D are identical in
structure and include cores 20A-20B and a local storage 12, which
may be a cache level, or a level of internal system memory.
Processors 10A-10B are coupled to a main system memory 14, a
storage subsystem 16, which includes non-removable drives and
optical drives, for reading media such as a CD-ROM 17. The
illustrated processing system also includes input/output (I/O)
interfaces and devices 18 such as mice and keyboards for receiving
user input and graphical displays for displaying information. While
the system of FIG. 1 is used to provide an illustration of a system
in which the processor architecture of the present invention is
implemented, it is understood that the depicted architecture is not
limiting and is intended to provide an example of a suitable
computer system in which the techniques of the present invention
are applied.
[0020] Referring now to FIG. 2, details of processor cores 20A-20B
of FIG. 1 are illustrated in depicted processor core 20. Processor
core 20 includes an instruction fetch unit (IFU) 22 that fetches
one or more instruction streams from cache or system memory and
presents the instruction stream(s) to an instruction decode unit
24. An instruction dispatch unit (IDU) 26 dispatches the decoded
instructions to a number of internal processor pipelines. The
processor pipelines each include one of issue queues 27A-27D and an
execution unit provided by branch execution unit (BXU) 28,
condition result unit (CRU) 29, load-store unit (LSU) 30 or
floating point units (FPUs) 31A-31B. Registers such as a counter
register (CTR) 23A, a condition register (CR) 23B, general-purpose
registers (GPR) 23D, and floating-point result registers (FPR) 23C
provide locations for results of operations performed by the
corresponding execution unit(s). A global completion table (GCT) 21
provides an indication of pending operations that is marked as
completed when the results of an instruction are transferred to the
corresponding one of result registers 23A-23D. In embodiments of
the present invention, a LHS prediction logic 40 within IDU 26
manages a LHS detection table 41 that contains entries for all
pending store operations, e.g., all store operations that have not
reached the point of irrevocable execution. IDU 26 also manages
register mapping via a register mapper 25 that allocates storage in
the various register sets so that concurrent execution of program
code can be supported by the various pipelines. LSU 30 is coupled
to a store queue (STQ) 42 and a load queue (LDQ) 43, in which
pending store and load operations are respectively queued for
storages within a data cache 44 that provides for loading and
storing of data values in memory that are needed or modified by the
pipelines in core 20. Data cache 44 is coupled to one or more
translation look-aside buffers (TLB) 45 that map real or virtual
addresses in data cache 44 to addresses in an external memory
space.
[0021] Referring now to FIG. 3, a block diagram illustrating
details of IDU 26 within processor core 20 of FIG. 2 is shown. LHS
prediction logic 40 provides tracking of pending store operations
by generating entries for each store instruction decoded by
instruction decode unit 24 in LHS detection table 41. When store
instructions are received by IDU 26, address information associated
with the store instruction, which in the particular embodiment are
the base registers and/or the immediate value used in calculating
the effective address (EA) of the store operation, is inserted in
LHS detection table 41. The entry is also populated with an
instruction tag (ITAG) identifying the particular store
instruction, so that the entry in LHS detection table 41 can be
invalidated when the particular store instruction completes, along
with other information such as the thread identifier for the
instruction, the valid bit and the store instruction type, which is
used to determine which field(s) to compare for address matching.
FIG. 4 shows an exemplary LHS detection table 41 containing two
valid entries and one entry that has been retired due to
completion/commit of the store instruction to data cache 44 or
invalidated due to a flush. When IDU 26 receives a load
instruction, LHS prediction logic compares the address information
(e.g., immediate field and/or base registers, depending on the type
of addressing) of the load instruction with each entry in LHS
detection table 41, which may be facilitated by implementing LHS
detection table 41 with a content-addressable memory (CAM) that
produces the ITAG of the LHS detection table entry given the
address information, thread identifier and store instruction type
for valid entries. LHS detection table 41 may alternatively be
organized as a first-in-first-out (FIFO) queue. The load
instruction is then dispatched to issue queue 27D with the ITAG of
the entry, in order to cause LSU 30 to retain the load instruction
in issue queue 27D until the store instruction causing the LHS
hazard in conjunction with the load instruction has issued,
completed, or has been otherwise irrevocably committed or flushed.
In one embodiment of the invention, the lookup in LHS detection
table 41 locates the most recent entry matching the look-up
information, so that if multiple matching entries exist in LHS
detection table 41, the load instruction will be queued until the
last store instruction causing an LHS hazard has been
issued/completed/committed/flushed. In another embodiment, before
an entry is generated in LHS detection table 41, a look-up is
performed to determine if a matching entry exists, and if so, the
existing entry is invalidated or updated with a new ITAG. If LHS
detection table 41 is full, the oldest entry is overwritten.
[0022] It should be noted that the above-described matching does
not generally detect all LHS hazards, since, for example, a store
instruction using immediate addressing may hit the same address as
a load instruction using register or register indirect addressing,
and a matching entry in LHS detection table 41 will not be found
for the load. Such an LHS hazard will instead be rejected during
the issue phase after the full EA has been computed for both the
load and store instructions. However, most likely LHS hazards
should be detected under normal circumstances and the number of
load rejects due to LHS hazards dramatically reduced. Further, an
entry may be found in LHS detection table 41 that is flagged as an
LHS hazard and in actuality is not, for example, when a base
register value has been modified between a register-addressed load
and a preceding register-addressed store using the same base
register pair. Therefore, the method detects likely LHS hazards and
not guaranteed address conflicts/overlaps. However, such
occurrences should be rare compared to the number of actual LHS
hazards detected.
[0023] Referring now to FIG. 5, a method of operation of processor
core 20 in accordance with an embodiment of the present invention,
is illustrated in a flowchart. As illustrated in FIG. 5, when an
IFU fetches instruction(step 60) and the instruction is decoded
(step 61), if the instruction is a store instruction (decision 62),
and if there is an existing entry in LHS detection table 41 that
matches the base registers (register-based addressing) and/or
immediate field (immediate addressing) of the store instruction
(decision 63), the existing entry is invalidated, or alternatively
over-written (step 64). The base registers and immediate field of
the store instruction are written to an entry in LHS detection
table 41 (step 65) and the store instruction is dispatched (step
66). If the instruction is not a store instruction (decision 62),
but is a load instruction (decision 67), if the base registers
(register-based addressing) or immediate field (immediate
addressing) match an entry in LHS detection table 41 (decision 68),
the load instruction is dispatched to issue queue 27D with an ITAG
of the store instruction corresponding to the table entry (step
70). Otherwise, the load instruction is dispatched without an ITAG
(step 69), as are instructions that neither load nor store
instructions. Until the system is shut down (decision 71), steps
60-70 are repeated.
[0024] Referring now to FIG. 6, another method of operation of
processor core 20 in accordance with an embodiment of the present
invention is illustrated in a flowchart. As illustrated in FIG. 6,
LSU 30 peeks issue queue 27D (step 80) and if the next instruction
has an existing dependency (decision 81), such as a dependency
generated by the method of FIG. 5 when the load instruction is
dispatched with an ITAG of a store for which an LHS hazard is
predicted, the peek moves to the next instruction (step 82). If the
next instruction does not have an existing dependency (decision
81), if the instruction is a load instruction (decision 83), and
LDQ 43 is not full (decision 84), the load instruction is issued to
LDQ 43 (step 87). Similarly, if the instruction is a store
instruction (decision 85) , and STQ 42 is not full (decision 86),
the store instruction is issued to STQ 86 (step 87). Until the
system is shut down (decision 88), steps 80-87 are repeated.
[0025] While the invention has been particularly shown and
described with reference to the preferred embodiments thereof, it
will be understood by those skilled in the art that the foregoing
and other changes in form, and details may be made therein without
departing from the spirit and scope of the invention.
* * * * *