U.S. patent application number 10/770787 was filed with the patent office on 2004-08-12 for loop handling for single instruction multiple datapath processor architectures.
This patent application is currently assigned to ChipWrights Design, Inc., a Massachusetts corporation. Invention is credited to Redford, John.
Application Number | 20040158691 10/770787 |
Document ID | / |
Family ID | 24858560 |
Filed Date | 2004-08-12 |
United States Patent
Application |
20040158691 |
Kind Code |
A1 |
Redford, John |
August 12, 2004 |
Loop handling for single instruction multiple datapath processor
architectures
Abstract
A method of controlling the enabling of processor datapaths in a
SIMD processor during a loop processing operation is described. The
information used by the method includes an allocation between the
data items and a memory, a size of the array, and a number of
remaining parallel passes of the datapaths in the loop processing
operation. A computer instruction is also provided, which includes
a loop handling instruction that specifies the enabling of one of a
plurality of processor datapaths during processing an array of data
items. The instruction includes a count field that specifies the
number of remaining parallel loop passes to process the array and a
count field that specifies the number of serial loop passes to
process the array. Different instructions can be used to handle
different allocations of passes to parallel datapaths. The
instruction also uses information about the total number of
datapaths.
Inventors: |
Redford, John; (Cambridge,
MA) |
Correspondence
Address: |
FISH & RICHARDSON PC
225 FRANKLIN ST
BOSTON
MA
02110
US
|
Assignee: |
ChipWrights Design, Inc., a
Massachusetts corporation
|
Family ID: |
24858560 |
Appl. No.: |
10/770787 |
Filed: |
February 3, 2004 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10770787 |
Feb 3, 2004 |
|
|
|
09711556 |
Nov 13, 2000 |
|
|
|
6732253 |
|
|
|
|
Current U.S.
Class: |
712/13 ;
712/E9.039; 712/E9.071; 712/E9.078 |
Current CPC
Class: |
G06F 9/3885 20130101;
G06F 15/8007 20130101; G06F 9/3455 20130101; G06F 9/325 20130101;
G06F 9/345 20130101 |
Class at
Publication: |
712/013 |
International
Class: |
G06F 015/00 |
Claims
What is claimed is:
1. A method of controlling whether to enable one of a plurality of
processor datapaths in a SIMD processor that are operating on data
elements in an array, comprising: determining whether to enable the
datapath based on information about parameters of the SIMD
processor and the array, and a processing state of the datapaths
relative to the data items in the array.
2. The method of claim 1 wherein the information includes an
allocation between the data items and a memory.
3. The method of claim 2 wherein the information includes whether
the allocation is unity-stride, contiguous, or striped stride.
4. The method of claim 1 wherein the information includes a total
number of parallel loop passes in a loop processing operation being
performed by the datapaths.
5. The method of claim 1 wherein the information indicates a size
of the array.
6. The method of claim 1 wherein the processing state is a number
of remaining parallel loop passes in the loop processing
operation.
7. The method of claim 1 wherein the information includes a number
of said processor datapaths.
8. The method of claim 1 wherein the information includes an
allocation between the data items and a memory, a total number of
parallel loop passes in a loop processing operation being performed
by the datapaths, a size of the array, a number of remaining
parallel passes of the datapaths in the loop processing operation,
and a number of said processor datapaths.
9. The method of claim 8 wherein the allocation between the data
items and the memory is a unity-stride.
10. The method of claim 9 wherein the total number of loop passes
is determined by dividing a total number of serial loop passes by
the total number of datapaths implemented and rounded up to a next
integer.
11. The method of claim 10 wherein enabling comprises: determining
whether the total number of parallel loop passes minus the number
of remaining loop passes multiplied by the total number of
datapaths implemented plus a datapath number is less than the total
number of serial loop passes.
12. The method of claim 8 wherein the allocation between the data
items and the memory is a contiguous stride.
13. The method of claim 12 wherein the total number of parallel
loop passes is determined by dividing the total number of serial
loop passes by the number of datapaths and rounded up to a next
integer.
14. The method of claim 13 wherein enabling comprises: determining
whether the total number of parallel loop passes multiplied by a
datapath number plus the total number of parallel loop passes minus
a number of remaining parallel loop passes is less than the total
number of serial loop passes.
15. The method of claim 8 wherein the allocation is a striped
stride.
16. The method of claim 15 wherein enabling comprises: determining
whether the total number of parallel loop passes times a datapath
number plus the total number of parallel loop passes minus a number
of remaining parallel loop passes is less than the total number of
serial loop passes.
17. A computer instruction comprising: a loop handling instruction
that specifies the enabling of one of a plurality of processor
datapaths during processing an array of data items.
18. The instruction of claim 17 further comprising: a parallel
count field that specifies the number of remaining parallel loop
passes to process the array.
19. The instruction of claim 17 further comprising: a serial count
field that specifies the number of serial loop passes to process
the array.
20. A processor comprising: a register file; an arithmetic logic
unit coupled to the register file and a program control store that
stores a loop handling instruction that causes the processor to
enable one of a plurality of processor datapaths during processing
of an array of data.
Description
TECHNICAL FIELD
[0001] This invention relates to loop handling operations over an
array of data items in a single instruction multiple datapath
(SIMD) processor architecture.
BACKGROUND
[0002] Parallel processing is an efficient way of processing an
array of data items. A SIMD processor is a parallel processor array
architecture wherein multiple datapaths are controlled by a single
instruction. Each datapath handles one data item at a given time.
In a simple example, in a SIMD processor having four datapaths, the
data items in an eight data item array would be processed in each
of the four datapaths in two passes of a loop operation. The
allocation between datapaths and data items may vary, but in one
approach, in a first pass the first data item in the array is
processed by a first datapath, a second data item in the array is
processed by a second datapath, a third data item is processed by a
third datapath, and a fourth data item is processed by a fourth
datapath. In a second pass, a fifth data item is processed by the
first datapath, a sixth data item is processed by the second
datapath, a seventh data item is processed by the third datapath,
and an eighth data item is processed by the fourth datapath.
[0003] Problems may occur when the number of data items in the
array is not an integer multiple of the number of datapaths. For
example, modifying the simple example above so that there are four
datapaths and an array having seven data items, during the second
pass, the fourth datapath does not have an element in the eighth
item of the array to process. As a result, the fourth datapath may
erroneously write over some other data structure in memory, unless
the fourth datapath is disabled during the second pass.
[0004] One way of avoiding such erroneous overwriting is to force
the size of the array, i.e., the number of data items contained
within the array, to be an integer multiple of the number of
datapaths. Such an approach assumes that programmers have a priori
control of how data items are allocated in the array, which they
may not always have.
[0005] Typically, each datapath in a SIMD processor has an
associated processor enable bit that controls whether a datapath is
enabled or disabled. This allows a datapath to be disabled when,
e.g., the datapath would otherwise overrun the array.
SUMMARY
[0006] In a general aspect, the invention features a method of
controlling whether to enable one of a plurality of processor
datapaths in a SIMD processor that are operating on data elements
in an array, including determining whether to enable the datapath
based on information about parameters of the SIMD processor and the
array, and a processing state of the datapaths relative to the data
items in the array.
[0007] In a preferred embodiment, the information includes an
allocation between the data items and a memory, a total number of
parallel loop passes in a loop processing operation being performed
by the datapaths, a size of the array, and a number of datapaths
(i.e., how many datapaths there are in the SIMD processor). The
processing state is a number of remaining parallel passes of the
datapaths in the loop processing operation.
[0008] The allocation between the data items and the memory may be
unity-stride, contiguous or striped-stride.
[0009] In another aspect, the invention features a computer
instruction including a loop handling instruction that specifies
the enabling of one of a plurality of processor datapaths during
processing an array of data items.
[0010] In a preferred embodiment, the instruction includes a
parallel count field that specifies the number of remaining
parallel loop passes to process the array, and a serial count field
that specifies the number of serial loop passes to process the
array.
[0011] In another aspect, the invention features a processor
including a register file and an arithmetic logic unit coupled to
the register file, and a program control store that stores a loop
handling instruction that causes the processor to enable one of a
plurality of processor datapaths during processing of an array of
data.
[0012] Embodiments of various aspects of the invention may have,
one or more of the following advantages.
[0013] Datapaths may be disabled without having prior knowledge of
the number of data items in the array.
[0014] The method is readily extensible to a variety of memory
allocation schemes.
[0015] The loop handling instruction saves instruction memory
because the many operations needed to determine whether to enable
or disable a datapath may be specified with a simple and powerful
single instruction that also saves register space.
[0016] The loop handling instruction saves a programmer from having
to force the number of data items in the array of data items to be
an integer multiple of the number of datapaths.
[0017] Other features and advantages of the invention will be
apparent from the following detailed description and drawings, and
from the claims.
DESCRIPTION OF DRAWINGS
[0018] FIG. 1 is a block diagram of a single instruction multiple
datapath (SIMD) processor.
[0019] FIG. 2 shows a table of how thirty data items in an array
are handled by a SIMD processor having four datapaths during loop
processing in a unity stride allocation of memory.
[0020] FIG. 3 shows the syntax of a loop handling instruction.
[0021] FIG. 4 shows a table of how thirty data items in an array
are handled by a SIMD processor having four datapaths during loop
processing in a contiguous stride allocation of memory.
[0022] FIG. 5 shows the syntax of a loop handling instruction
combined with a loop branch.
[0023] FIG. 6 is a flow diagram of a process of controlling the
enabling of datapaths in a SIMD processor during loop
processing.
[0024] Like reference symbols in the various drawings indicate like
elements.
DETAILED DESCRIPTION
[0025] Referring to FIG. 1, a single instruction multiple datapath
(SIMD) processor 10 includes an instruction cache 12, control logic
14, a serial datapath, and a number of parallel datapaths labeled
18a, 18b, 18c, 18, . . . 18n. The parallel datapaths 18 write to a
memory 20. Each of the datapaths 18 has an associated processor
enable (PE) bit 22. Specifically, parallel datapath 18a is
associated with a PE bit 22a, parallel datapath 18b is associated
with a PE bit 22b, and so forth. When a PE bit is enabled, its
associated parallel datapath is enabled and data items may be
written by that parallel datapath. For example, if PE bit 22a is
enabled, data items may be written by parallel datapath 18a; if PE
bit 22b is enabled, data items may be written by parallel datapath
18b. If PE bit 22n is enabled, data items may be written by
parallel datapath 18n. When a PE bit is disabled, its associated
parallel datapath is disabled and data items may not be written by
that parallel datapath.
[0026] In operation, the control logic 14 fetches an instruction
from the instruction cache 12. The instruction is fed to the serial
datapath 16 that provides the instruction to the datapaths 18. Each
of the datapaths 18 are read together and written together unless
the processor enable bit is disabled for a particular datapath.
[0027] One or more of the datapaths 18 may need to be disabled
during a loop processing operation of an array of data items to
avoid an unused datapath from overrunning the end of the array and
erroneously writing over another data structure in memory. Rather
than manually having to determine when during the loop processing
operation to enable and disable datapaths, this determination may
be made on the fly during the loop processing operation, based on
information about parameters of the SIMD processor and the array,
and the processing state of the datapaths relative to the data
items in the array. This information includes: (1) the total number
of parallel loop passes occurring in the loop processing operation,
(2) the number of loop passes that would execute in a serial
datapath design (which indicates the size of the array), (3) the
number of remaining parallel passes occurring in the loop
processing operation, (4) the memory allocation used to allocate
data items of the array among the datapaths, and (5) the number of
parallel datapaths. Instructions that enable or disable a processor
enable bit for a datapath (thereby enabling or disabling the
datapath) during loop processing based on this information are
provided.
[0028] There are many ways to allocate memory for processing of an
array of data items in a SIMD processor. The simplest memory
allocation is where each one of a number of datapaths (NDP) takes
the NDPth iteration of the loop. This type of memory allocation is
called "unity stride."
[0029] Referring to FIG. 2, for example, a table illustrating how
thirty data items numbered 0 to 29 in an array are handled by a
SIMD processor having four datapaths labeled DP0, DP1, DP2 and DP3,
respectively, during loop processing in a unity stride memory
allocation is shown. In order to process the array, eight parallel
loop passes are executed. In a parallel loop pass 1, data items 0,
1, 2, and 3 are handled by datapaths 0, 1, 2, and 3. In a parallel
loop pass 2, data items 4, 5, 6 and 7 are handled by datapaths 0,
1, 2, and 3. In a final parallel loop pass, parallel loop pass 8,
data items 28 and 30 and handled by datapaths 0 and 1 while
datapaths 2 and 3 must be disabled to avoid overrunning the array
and writing over other data stored in memory.
[0030] The table in FIG. 2 illustrates why this type of memory
allocation is referred to as unity-stride. The "stride" between
data items being processed in each of the parallel datapaths in any
given parallel loop pass is one. That is, the difference between
any two data items being processed by parallel datapaths in a
parallel loop pass is one (or unity).
[0031] In the unity stride allocation, as the number of data items
are being processed a pattern emerges. Specifically, the pattern
illustrates that only two datapaths in a final parallel loop pass
need to be disabled. (Obviously, the pattern illustrated in FIG. 2
is trivial; as the number of datapaths and the array size are
increased, the pattern becomes more complex, but is discernible in
time.) From a knowledge of the pattern, the total number of loop
passes that would execute in a serial machine (which indicates the
size of the array), the number of remaining parallel loop passes,
and the number of datapaths, an instruction is provided to
determine whether a particular datapath should be disabled during a
particular parallel loop pass.
[0032] Referring to FIG. 3, a loop processor enable instruction 30
includes a field C representing the number of remaining parallel
loop passes during a loop processing operation, and a field L
representing the overall number of passes needed to service all the
data items in an array in a serial machine architecture. The
instruction 30 includes a memory allocation designation x. In the
example described with reference to FIG. 2, the memory allocation
designation x would refer to a unity-stride memory allocation,
i.e., U, and L=30 since there are thirty data items that would
require thirty loop passes in a serial machine architecture. PE [i,
j] represents the state of the processor enable bit for datapath i
during parallel loop pass j.
[0033] For the unity-stride example described in reference to FIG.
2, the total number of parallel loop passes is determined by
dividing the total number of serial loop passes by the number of
datapaths, and rounding the result up to the next integer. Thus, in
the example the total number of parallel loop passes equals 30/4,
which rounded up to the next integer produces 8.
[0034] Using the knowledge gained from the pattern present in the
unity-stride example and the values of C and L, a processor enable
bit associated with a datapath index i representing the datapath
and a data item j, that is, PE [i, j], is enabled if the total
number of parallel loop passes minus the number of remaining
parallel loop passes, all multiplied by the total number of
datapaths plus the datapath index, is less than the total number of
serial loop passes.
[0035] Alternatively, SIMD processor 10 may use a contiguous stride
memory allocation. Referring to FIG. 4, a table illustrating how
thirty data items (0 to 29) in an array are handled by SIMD
processor 10 having four datapaths (DP0-DP3) and implementing a
contiguous stride memory allocation is shown. In order to process
all thirty data items in the array, eight parallel passes are
executed. In a parallel loop pass 1, data items 0, 8, 16 and 24 are
handled by datapaths 0, 1, 2 and 3, respectively. In parallel loop
pass 2, data items 1, 9, 17 and 25 are handled by datapaths 0, 1, 2
and 3. As processing continues, a pattern arises. In this specific
example, in parallel loop passes 7 and 8, datapath 3 needs to be
disabled to avoid writing over memory beyond the end of the thirty
data items in the array. All other datapaths are enabled in every
pass.
[0036] The contiguous-stride memory allocation is useful when
neighboring data items are used when working on a particular data
item. For example, if datapath 0 is processing data item 4 in
parallel loop pass 5, it already has data item 3 from parallel loop
pass 4 and will be using data item 5 on the next parallel loop
pass. This memory allocation is called contiguous stride allocation
because each datapath is accessing a contiguous region of the
array.
[0037] In the contiguous stride memory allocation, a pattern
emerges to suggest that a single datapath needs to be disabled
during executions of, in this example, the last two parallel loop
passes. Referring again to FIG. 3, a memory allocation designation
x=CONT represents a contiguous-stride memory allocation scheme. For
the example described with reference to FIG. 4, the total number of
parallel loop passes needed to process the array of data items is
determined by dividing the total number of serial loop passes by
the number of datapaths and rounding the result up to the next
integer. Thus, in the example, the total number of parallel loop
passes equals 30/4, rounded up to 8.
[0038] From the contiguous-stride memory allocation pattern and the
values of C and L, a processor enable bit associated with a
datapath index i and a data item j, that is, PE [i, j], is enabled
if the total number of parallel loop passes multiplied by the
datapath index plus the total number of parallel loop passes minus
the number of remaining parallel loop passes is less than the total
number of serial loop passes.
[0039] An interleaved memory system permits many memory accesses to
be done at once. The number of memory banks M in an interleaved
memory system is generally a power of two, since that allows the
memory bank selection to be made using the lowest address bits. If
the stride in a read or write instruction is also a power of two,
the memory interleaving will not help, since all the addresses will
try to access the same memory bank. For example, if M=4 and the
stride is also four, the addresses for the read or write would be
0, 4, 8, and so forth, and they would all have to be handled by
bank 0; banks 1, 2 and 3 would be idle.
[0040] To avoid having all of the data items processed in the same
memory bank, the stride value may be selected to be an odd number.
Selecting the stride to be an odd number spreads the addresses
evenly among M banks if M is a power of two, since any odd number
and any power of two are mutually prime. In the case of a 30
element array, the stride would be 9, not 8 as with the contiguous
allocation. Datapath 0 would correspond to array elements 0 to 8,
datapath 1 would be associated with array elements 9 to 17, and
datapath 2 would correspond to elements 18 to 26, and datapath 3
would be assigned to elements 26 to 29. Datapath 3 would be turned
off for the last six elements, i.e., array elements 30 to 35. This
memory allocation is referred to as a striped-stride memory
allocation.
[0041] The number of parallel loop passes needed to process an
array of data items in a striped-stride memory allocation scheme is
determined by dividing the total number of serial datapaths by the
number of datapaths and rounding the result up to the next odd
integer.
[0042] Referring again to FIG. 3, a memory designation x=S
represents striped-stride allocation. A processor enable bit
associated with a datapath i and a data item j, that is, PE [i, j],
is enabled if the total number of parallel loop passes times the
datapath index plus the total number of parallel loop passes minus
the number of remaining parallel loop passes is less than the total
number of serial loop passes.
[0043] Referring to FIG. 5, the loop processor enable instruction
is shown combined with a loop branch instruction 70. This combined
instruction 70 will set the processor enable bit, as described
previously, according to the memory allocation scheme, the overall
number of parallel loop passes and the number of remaining parallel
loop passes, and test if the number of remaining parallel loop
passes equals zero. If the number of remaining passes greater than
zero, the branch is performed (i.e., "go to PC+displacement"), to
perform the next pass of the loop operation. Otherwise, the loop is
exited, and processing continues. In either case, the number of
remaining parallel loop passes is decremented and the loop
processing operation continues.
[0044] Referring to FIG. 6, a process 100 of controlling the
enabling of a datapath in a SIMD processor during loop processing
determines 102 the number of serial loop passes to service all of
the data items in an array. The process determines 104 the number
of remaining parallel loop passes to service the array. The process
then tests 106 whether the memory allocation scheme is a unity
stride allocation. If the memory allocation is a unity stride
allocation, the processor enable bit for the datapath servicing the
data item is enabled 108 if the total number of parallel loop
passes minus the number of remaining parallel loop passes, all
multiplied by the total number of datapaths plus the datapath
index, is less than the total number of serial loop passes.
[0045] If the memory allocated is not unity stride, the process
tests 110 whether the memory allocation scheme is a contiguous
stride allocation. If the memory allocation is a contiguous stride
allocation, the processor enable bit for the datapath servicing the
data item is enabled 112 if the total number of parallel loop
passes multiplied by the datapath index plus the total number of
parallel loop passes minus the number of remaining parallel loop
passes is less than the total number of serial loop passes.
[0046] Finally, if the memory allocation is neither unity stride
nor contiguous, the process tests 114 whether the memory allocation
scheme is a striped stride allocation. If the memory allocation is
a striped stride allocation, the processor enable bit for the
datapath servicing the data item is enabled 116 if the total number
of parallel loop passes times the datapath index plus the total
number of parallel loop passes minus the number of remaining
parallel loop passes is less than the total number of serial loop
passes.
[0047] A number of embodiments of the invention have been
described. Nevertheless, it will be understood that various
modifications may be made without departing from the spirit and
scope of the invention. For example, for processing larger numbers
of data items, a lookup table could be utilized until a time at
which a pattern develops according to the memory allocation scheme
employed. Once the pattern develops, the enabling of datapaths is
determined by the method herein described. Accordingly, other
embodiments are within the scope of the following claims.
* * * * *