U.S. patent application number 16/251887 was filed with the patent office on 2019-07-11 for multi-functional execution lane for image processor.
The applicant listed for this patent is Google LLC. Invention is credited to Albert Meixner, Jason Rupert Redgrave, Ofer Shacham, Artem Vasilyev.
Application Number | 20190213006 16/251887 |
Document ID | / |
Family ID | 57223790 |
Filed Date | 2019-07-11 |
![](/patent/app/20190213006/US20190213006A1-20190711-D00000.png)
![](/patent/app/20190213006/US20190213006A1-20190711-D00001.png)
![](/patent/app/20190213006/US20190213006A1-20190711-D00002.png)
![](/patent/app/20190213006/US20190213006A1-20190711-D00003.png)
![](/patent/app/20190213006/US20190213006A1-20190711-D00004.png)
![](/patent/app/20190213006/US20190213006A1-20190711-D00005.png)
![](/patent/app/20190213006/US20190213006A1-20190711-D00006.png)
![](/patent/app/20190213006/US20190213006A1-20190711-D00007.png)
![](/patent/app/20190213006/US20190213006A1-20190711-D00008.png)
United States Patent
Application |
20190213006 |
Kind Code |
A1 |
Vasilyev; Artem ; et
al. |
July 11, 2019 |
MULTI-FUNCTIONAL EXECUTION LANE FOR IMAGE PROCESSOR
Abstract
An apparatus is described that includes an execution unit having
a multiply add computation unit, a first ALU logic unit and a
second ALU logic unit. The ALU unit is to perform first, second,
third and fourth instructions. The first instruction is a multiply
add instruction. The second instruction is to perform parallel ALU
operations with the first and second ALU logic units operating
simultaneously to produce different respective output resultants of
the second instruction. The third instruction is to perform
sequential ALU operations with one of the ALU logic units operating
from an output of the other of the ALU logic units to determine an
output resultant of the third instruction. The fourth instruction
is to perform an iterative divide operation in which the first ALU
logic unit and the second ALU logic unit operate during to
determine first and second division resultant digit values.
Inventors: |
Vasilyev; Artem; (Stanford,
CA) ; Redgrave; Jason Rupert; (Mountain View, CA)
; Meixner; Albert; (Mountain View, CA) ; Shacham;
Ofer; (Los Altos, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Google LLC |
Mountain View |
CA |
US |
|
|
Family ID: |
57223790 |
Appl. No.: |
16/251887 |
Filed: |
January 18, 2019 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
15591955 |
May 10, 2017 |
10185560 |
|
|
16251887 |
|
|
|
|
14960334 |
Dec 4, 2015 |
9830150 |
|
|
15591955 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 15/80 20130101;
G06F 9/3001 20130101; G06F 9/30014 20130101; G06F 7/57
20130101 |
International
Class: |
G06F 9/30 20060101
G06F009/30; G06F 7/57 20060101 G06F007/57; G06F 15/80 20060101
G06F015/80 |
Claims
1.-19. (canceled)
20. An image processor comprising an array of processing units,
wherein each processing unit of the array of processing units
comprises: four input ports and two output ports; and a first
arithmetic-logic unit (ALU) and a second ALU configured to perform
a double-width ALU operation, during which: the first ALU is
configured to receive data from a first pair of input ports, to
perform a first full-width ALU operation to compute (i) a lower
half result of the double-width ALU operation and (ii) a carry
term, to provide the lower half result of the double-width ALU
operation to one of the two output ports, and to provide the carry
term to the second ALU, and the second ALU is configured to receive
data from a second pair of input ports and receive the carry term
from the first ALU, to perform a second full-width ALU operation to
compute an upper half result of the double-width ALU operation, and
to provide the upper half result of the double-width ALU operation
to another of the two output ports.
21. The image processor of claim 20, wherein the each processing
unit is configured to perform the second full-width ALU operation
after the first full-width ALU operation is complete.
22. The image processor of claim 21, wherein each processing unit
has a carry line between the first ALU and the second ALU to
provide the carry term to the second ALU.
23. The image processor of claim 22, wherein the second ALU is
configured to perform the second full-width ALU operation only upon
receiving the carry term on the carry line.
24. The image processor of claim 20, wherein the first ALU and the
second ALU of each processing unit are further configured to
perform four half-width ALU operations at least partially in
parallel, during which: the first ALU and the second ALU are each
configured to receive input operands from a respective pair of
input ports, to perform a first-half width operation on a lower
half of each of the input operands, to perform a second half-width
operation on an upper half of each of the input operands, and to
write a result to a respective one of the two output ports.
25. The image processor of claim 20, wherein the first ALU and the
second ALU of each processing unit are further configured to
perform a fused operation comprising a second operation performed
serially on the result of a first operation, during which: the
first ALU is configured to receive data from the first pair of
input ports, to perform the first operation, and to provide a
result of the first operation to the second ALU; and the second ALU
is configured to receive data from one input port of the second
pair of input ports and to receive the result of the first
operation from the first ALU, to perform the second operation, and
to provide a result of the second operation to one of the two
output ports.
26. The image processor of claim 25, wherein the first operation
and the second operation are different.
27. A method implemented by a processing unit of an image
processing comprising an array of processing units, the method
comprising: performing, by a first arithmetic-logic unit (ALU) and
a second ALU of the processing unit, a double-width ALU operation
using data received at a first pair of input ports and a second
pair of input ports of the processing unit, including: receiving,
by the first ALU, data from the first pair of input ports of the
processing unit, performing, by the first ALU, a first full-width
ALU operation using the data from the first pair of input ports to
compute a lower half result of the double-width ALU operation and a
carry term, providing, by the first ALU, the lower half result of
the double-width ALU operation to one of two output ports of the
processing unit, providing, by the first ALU, the carry term to the
second ALU, receiving, by the second ALU, data from the second pair
of input ports of the processing unit, receiving, by the second
ALU, the carry term from the first ALU, performing, by the second
ALU, a second full-width ALU operation using the data from the
second pair of input ports and the carry term to compute an upper
half result of the double-width ALU operation, and providing, by
the second ALU, the upper half result of the double-width ALU
operation to another of the two output ports.
28. The method of claim 27, wherein performing the second
full-width ALU operation comprises performing the second full-width
ALU operation after the first full-width ALU operation is
complete.
29. The method of claim 28, wherein each processing unit has a
carry line between the first ALU and the second ALU to provide the
carry term to the second ALU.
30. The method of claim 29, wherein performing the second
full-width ALU operation performing the second full-width ALU
operation only upon receiving the carry term on the carry line.
31. The method of claim 27, further comprising: performing, by the
first ALU and the second ALU, four half-width ALU operations at
least partially in parallel, including: receiving, by the first ALU
and the second ALU, respective input operands from a respective
pair of input ports, performing, by the first ALU and the second
ALU, a first-half width operation on a lower half of each of the
input operands, performing, by the first ALU and the second ALU, a
second half-width operation on an upper half of each of the input
operands, and writing, by the first ALU and the second ALU, a
result to a respective one of the two output ports.
32. The method of claim 27, further comprising: performing, by the
first ALU and the second ALU, a fused operation comprising a second
operation performed serially on the result of a first operation,
including: receiving, by the first ALU, data from the first pair of
input ports, performing, by the first ALU, the first operation, and
providing, by the first ALU, a result of the first operation to the
second ALU, receiving, by the second ALU, data from one input port
of the second pair of input ports, receiving, by the second ALU,
the result of the first operation from the first ALU, performing,
by the second ALU, the second operation, and providing, by the
second ALU, a result of the second operation to one of the two
output ports.
33. The method of claim 32, wherein the first operation and the
second operation are different.
34. An image processor comprising an array of processing units,
wherein each processing unit of the array of processing units is
configured to perform a double-width ALU operation, wherein each
processing unit comprises: four input ports and two output ports;
and means for performing a first full-width ALU operation using
data received at a first pair of the input ports to write a lower
half result of the double-width ALU operation to a first output
port and to generate a carry term; and means for performing a
second full-width ALU operation using the carry term and data
received at a second pair of the input ports and to write an upper
half result of the double-width ALU operation to a second output
port.
35. The image processor of claim 34, wherein each processing unit
is configured to perform the second full-width ALU operation after
the first full-width ALU operation is complete.
36. The image processor of claim 35, wherein each processing unit
has a carry line between the means for performing the first
full-width ALU operation and the means for performing the second
full-width ALU operation.
37. The image processor of claim 36, wherein the means for
performing the second full-width ALU operation is configured to
perform the second full-width ALU operation only upon receiving the
carry term on the carry line.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of and claims priority to
U.S. patent application Ser. No. 15/591,955, filed on May 10, 2017,
which is a continuation of U.S. patent application Ser. No.
14/960,334, filed on Dec. 4, 2015 (now U.S. Pat. No. 9,830,150).
The disclosures of the prior applications are considered part of
and are incorporated by reference in the disclosure of this
application.
FIELD OF INVENTION
[0002] The field of invention pertains generally to the computing
sciences, and, more specifically, to a multi-functional execution
lane for an image processor.
BACKGROUND
[0003] Image processing typically involves the processing of pixel
values that are organized into an array. Here, a spatially
organized two dimensional array captures the two dimensional nature
of images (additional dimensions may include time (e.g., a sequence
of two dimensional images) and data type (e.g., colors). In a
typical scenario, the arrayed pixel values are provided by a camera
that has generated a still image or a sequence of frames to capture
images of motion. Traditional image processors typically fall on
either side of two extremes.
[0004] A first extreme performs image processing tasks as software
programs executing on a general purpose processor or general
purpose-like processor (e.g., a general purpose processor with
vector instruction enhancements). Although the first extreme
typically provides a highly versatile application software
development platform, its use of finer grained data structures
combined with the associated overhead (e.g., instruction fetch and
decode, handling of on-chip and off-chip data, speculative
execution) ultimately results in larger amounts of energy being
consumed per unit of data during execution of the program code.
[0005] A second, opposite extreme applies fixed function hardwired
circuitry to much larger blocks of data. The use of larger (as
opposed to finer grained) blocks of data applied directly to custom
designed circuits greatly reduces power consumption per unit of
data. However, the use of custom designed fixed function circuitry
generally results in a limited set of tasks that the processor is
able to perform. As such, the widely versatile programming
environment (that is associated with the first extreme) is lacking
in the second extreme.
[0006] A technology platform that provides for both highly
versatile application software development opportunities combined
with improved power efficiency per unit of data remains a desirable
yet missing solution.
SUMMARY
[0007] An apparatus is described that includes an execution unit
having a multiply add computation unit, a first ALU logic unit and
a second ALU logic unit. The ALU unit is to perform first, second,
third and fourth instructions. The first instruction is a multiply
add instruction. The second instruction is to perform parallel ALU
operations with the first and second ALU logic units operating
simultaneously to produce different respective output resultants of
the second instruction. The third instruction is to perform
sequential ALU operations with one of the ALU logic units operating
from an output of the other of the ALU logic units to determine an
output resultant of the third instruction. The fourth instruction
is to perform an iterative divide operation in which the first ALU
logic unit and the second ALU logic alternatively operate during an
iteration to determine a quotient digit value.
[0008] An apparatus is described comprising an execution unit of an
image processor. The ALU unit comprises means for executing a first
instruction, the first instruction being a multiply add
instruction. The ALU unit comprises means for executing a second
instruction including performing parallel ALU operations with first
and second ALU logic units operating simultaneously to produce
different respective output resultants of the second instruction.
The ALU unit comprises means for executing a third instruction
including performing sequential ALU operations with one of the ALU
logic units operating from an output of the other of the ALU logic
units to determine an output resultant of the third instruction.
The ALU unit comprises means for executing a fourth instruction
including performing an iterative divide operation in which the
first ALU logic unit and the second ALU logic unit operate to
determine first and second digit resultant digit values.
FIGURES
[0009] The following description and accompanying drawings are used
to illustrate embodiments of the invention. In the drawings:
[0010] FIG. 1 shows a stencil processor component of an image
processor;
[0011] FIG. 2 shows an instance of an execution lane and its
coupling to a two dimensional shift register;
[0012] FIG. 3 shows relative delay of functions performed by an
embodiment of the execution lane of FIG. 2;
[0013] FIG. 4 shows a design for a multi-functional execution
lane;
[0014] FIGS. 5a and 5b show circuitry and a methodology to perform
an iterative divide operation;
[0015] FIG. 6 shows a methodology performed by the execution lane
described with respect to FIGS. 3 through 5a,b;
[0016] FIG. 7 shows an embodiment of a computing system.
DETAILED DESCRIPTION
[0017] FIG. 1 shows an embodiment of a stencil processor
architecture 100. A stencil processor, as will be made more clear
from the following discussion, is a processor that is optimized or
otherwise designed to process stencils of image data. One or more
stencil processors may be integrated into an image processor that
performs stencil based tasks on images processed by the processor.
As observed in FIG. 1, the stencil processor includes a data
computation unit 101, a scalar processor 102 and associated memory
103 and an I/O unit 104. The data computation unit 101 includes an
array of execution lanes 105, a two-dimensional shift array
structure 106 and separate random access memories 107 associated
with specific rows or columns of the array.
[0018] The I/O unit 104 is responsible for loading input "sheets"
of image data received into the data computation unit 101 and
storing output sheets of data from the data computation unit
externally from the stencil processor. In an embodiment, the
loading of sheet data into the data computation unit 101 entails
parsing a received sheet into rows/columns of image data and
loading the rows/columns of image data into the two dimensional
shift register structure 106 or respective random access memories
107 of the rows/columns of the execution lane array (described in
more detail below).
[0019] If the sheet is initially loaded into memories 107, the
individual execution lanes within the execution lane array 105 may
then load sheet data into the two-dimensional shift register
structure 106 from the random access memories 107 when appropriate
(e.g., as a load instruction just prior to operation on the sheet's
data). Upon completion of the loading of a sheet of data into the
register structure 106 (whether directly from a sheet generator or
from memories 107), the execution lanes of the execution lane array
105 operate on the data and eventually "write back" finished data
externally from the stencil processor, or, into the random access
memories 107. If the later the I/O unit 104 fetches the data from
the random access memories 107 to form an output sheet which is
then written externally from the stencil processor.
[0020] The scalar processor 102 includes a program controller 109
that reads the instructions of the stencil processor's program code
from instruction memory 103 and issues the instructions to the
execution lanes in the execution lane array 105. In an embodiment,
a single same instruction is broadcast to all execution lanes
within the array 105 to effect a SIMD-like behavior from the data
computation unit 101. In an embodiment, the instruction format of
the instructions read from scalar memory 103 and issued to the
execution lanes of the execution lane array 105 includes a
very-long-instruction-word (VLIW) type format that includes more
than one opcode per instruction. In a further embodiment, the VLIW
format includes both an ALU opcode that directs a mathematical
function performed by each execution lane's ALU and a memory opcode
(that directs a memory operation for a specific execution lane or
set of execution lanes).
[0021] The term "execution lane" refers to a set of one or more
execution units capable of executing an instruction (e.g., logic
circuitry that can execute an instruction). An execution lane can,
in various embodiments, include more processor-like functionality
beyond just execution units, however. For example, besides one or
more execution units, an execution lane may also include logic
circuitry that decodes a received instruction, or, in the case of
more MIMD-like designs, logic circuitry that fetches and decodes an
instruction. With respect to MIMD-like approaches, although a
centralized program control approach has largely been described
herein, a more distributed approach may be implemented in various
alternative embodiments (e.g., including program code and a program
controller within each execution lane of the array 105).
[0022] The combination of an execution lane array 105, program
controller 109 and two dimensional shift register structure 106
provides a widely adaptable/configurable hardware platform for a
broad range of programmable functions. For example, application
software developers are able to program kernels having a wide range
of different functional capability as well as dimension (e.g.,
stencil size) given that the individual execution lanes are able to
perform a wide variety of functions and are able to readily access
input image data proximate to any output array location.
[0023] During operation, because of the execution lane array 105
and two-dimensional shift register 106, multiple stencils of an
image can be operated on in parallel (as is understood in the art,
a stencil is typically implemented as a contiguous N.times.M or
N.times.M.times.C group of pixels within an image (where N can
equal M)). Here, e.g., each execution lane executes operations to
perform the processing for a particular stencil worth of data
within the image data, while, the two dimensional shift array
shifts its data to sequentially pass the data of each stencil to
register space coupled to the execution lane that is executing the
tasks for the stencil. Note that the two-dimensional shift register
106 may also be of larger dimension than the execution lane array
105 (e.g., if the execution lane array is of dimension X.times.X,
the two dimensional shift register 106 may be of dimension
Y.times.Y where Y>X). Here, in order to fully process stencils,
when the left edge of the stencils are being processed by the
execution lanes, the data in the shift register 106 will "push out"
off the right edge of the execution lane array 105. The extra
dimension of the shift register 106 is able to absorb the data that
is pushed off the edge of the execution lane array.
[0024] Apart from acting as a data store for image data being
operated on by the execution lane array 105, the random access
memories 107 may also keep one or more look-up tables. In various
embodiments one or more scalar look-up tables may also be
instantiated within the scalar memory 103.
[0025] A scalar look-up involves passing the same data value from
the same look-up table from the same index to each of the execution
lanes within the execution lane array 105. In various embodiments,
the VLIW instruction format described above is expanded to also
include a scalar opcode that directs a look-up operation performed
by the scalar processor into a scalar look-up table. The index that
is specified for use with the opcode may be an immediate operand or
fetched from some other data storage location. Regardless, in an
embodiment, a look-up from a scalar look-up table within scalar
memory essentially involves broadcasting the same data value to all
execution lanes within the execution lane array 105 during the same
clock cycle.
[0026] FIG. 2 shows another, more detailed depiction of the unit
cell for an ALU execution unit 205 within an execution lane 201 and
corresponding local shift register structure. The execution lane
and the register space associated with each location in the
execution lane array is, in an embodiment, implemented by
instantiating the circuitry observed in FIG. 2 at each node of the
execution lane array. As observed in FIG. 2, the unit cell includes
an execution lane 201 coupled to a register file 202 consisting of
four registers R1 through R4. During any cycle, the ALU execution
unit may read from any of registers R1 through R4 and write to any
of registers R1 through R4.
[0027] In an embodiment, the two dimensional shift register
structure is implemented by permitting, during a single cycle, the
contents of any of (only) one of registers R1 through R3 to be
shifted "out" to one of its neighbor's register files through
output multiplexer 203, and, having the contents of any of (only)
one of registers R1 through R3 replaced with content that is
shifted "in" from a corresponding one if its neighbors through
input multiplexers 204 such that shifts between neighbors are in a
same direction (e.g., all execution lanes shift left, all execution
lanes shift right, etc.). In various embodiments, the execution
lanes themselves execute their own respective shift instruction to
effect a large scale SIMD two-dimensional shift of the shift
register's contents. Although it may be common for a same register
to have its contents shifted out and replaced with content that is
shifted in on a same cycle, the multiplexer arrangement 203, 204
permits for different shift source and shift target registers
within a same register file during a same cycle.
[0028] As depicted in FIG. 2 note that during a shift sequence an
execution lane will shift content out from its register file 202 to
each of its left, right, top and bottom neighbors. In conjunction
with the same shift sequence, the execution lane will also shift
content into its register file from a particular one of its left,
right, top and bottom neighbors. Again, the shift out target and
shift in source should be consistent with a same shift direction
for all execution lanes (e.g., if the shift out is to the right
neighbor, the shift in should be from the left neighbor).
[0029] Although in one embodiment the content of only one register
is permitted to be shifted per execution lane per cycle, other
embodiments may permit the content of more than one register to be
shifted in/out. For example, the content of two registers may be
shifted out/in during a same cycle if a second instance of the
multiplexer circuitry 203, 204 observed in FIG. 2 is incorporated
into the design of FIG. 2. Of course, in embodiments where the
content of only one register is permitted to be shifted per cycle,
shifts from multiple registers may take place between mathematical
operations by consuming more clock cycles for shifts between
mathematical operations (e.g., the contents of two registers may be
shifted between math ops by consuming two shift ops between the
math ops).
[0030] If less than all the content of an execution lane's register
files are shifted out during a shift sequence note that the content
of the non shifted out registers of each execution lane remain in
place (do not shift). As such, any non shifted content that is not
replaced with shifted in content persists local to the execution
lane across the shifting cycle. A memory execution unit, not shown
in FIG. 2 for illustrative ease, may also exist in each execution
lane 201 to load/store data from/to the random access memory space
that is associated with the execution lane's row and/or column
within the execution lane array. Here, the memory unit acts as a
standard M unit in that it is often used to load/store data that
cannot be loaded/stored from/to the execution lane's own register
space. In various embodiments, the primary operation of the M unit
is to write data from a local register into memory, and, read data
from memory and write it into a local register.
[0031] With respect to the ISA opcodes supported by the ALU unit
205 of the hardware execution lane 201, in various embodiments, the
mathematical opcodes supported by the ALU unit 205 may include any
of the following ALU operations: add (ADD), substract (SUB), move
(MOV), multiple (MUL), multiply-add (MAD), absolute value (ABS),
divide (DIV), shift-left (SHL), shift-right (SHR), return min or
max (MIN/MAX), select (SEL), logical AND (AND), logical OR (OR),
logical XOR (XOR), count leading zeroes (CLZ or LZC) and a logical
complement (NOT). An embodiment of an ALU unit 205 or portion
thereof described in more detail below with respect to FIGS. 3
through 5. As described just above, memory access instructions can
be executed by the execution lane 201 to fetch/store data from/to
their associated random access memory. Additionally the hardware
execution lane 201 supports shift op instructions (right, left, up,
down) to shift data within the two dimensional shift register
structure. As described above, program control instructions are
largely executed by the scalar processor of the stencil
processor.
[0032] FIG. 3 shows a consumption time map for an execution unit,
or portion thereof, of an execution lane as described just above.
Specifically, FIG. 3 maps out the amount of time consumed by each
of a number of different instructions that can be executed by the
execution unit. As observed in FIG. 3, the execution unit can
perform: 1) a multiply-add instruction (MAD) 301; 2) two full width
(FW) or four half width (HW) ALU operations in parallel 302; 3) a
double width (2.times.W) ALU operation 303; 4) a FUSED operation of
the form ((C op D) op B) 304; and, 5) an iterative divide (DIV)
operation 306.
[0033] As observed in FIG. 3, the MAD operation 301, by itself,
consumes the most time amongst the various instructions that the
execution unit can execute. As such, a design perspective is that
the execution unit can be enhanced with multiple ALU logic units,
besides the logic that performs the MAD operation, to perform,
e.g., multiple ALU operations in parallel (such as operation 302)
and/or multiple ALU operations in series (such as operation
304).
[0034] FIG. 4 shows an embodiment of a design for an execution unit
405 that can support the different instructions illustrated in FIG.
3. As observed in FIG. 4, the execution unit 405 includes a first
ALU logic unit 401 and a second ALU logic unit 402 as well as a
multiply-add logic unit 403. Inputs from the register file are
labeled A, B, C, D while outputs written back to the register file
are labeled X and Y. As such, the execution unit 405 is a 4 input
port, 2 output port execution unit.
[0035] The multiply add logic unit 403, in an embodiment, performs
a full multiply-add instruction. That is, the multiply-add logic
unit 403 performs the function (A*B)+(C,D) where A is a full width
input operand, B is a full width operand and (C,D) is a
concatenation of two full width operands to form a double width
summation term. For example, if full width corresponds to 16 bits,
A is 16 bits, B is 16 bits and the summation term is 32 bits. As is
understood in the art, a multiply add of two full width values can
produce a double width resultant. As such, the resultant of the MAD
operation is written across the X, Y output ports where, e.g., X
includes the top half of the resultant and Y includes the bottom
half of the resultant. In a further embodiment the multiply-add
unit 403 supports a half width multiply add. Here, e.g., the lower
half of A is used as a first multiplicand, the lower half of B is
used as a second multiplicand and either C or D (but not a
concatenation) is used as the addend.
[0036] As mentioned above with respect to FIG. 3, the execution of
the MAD operation may consume more time than a typical ALU logic
unit. As such, the execution unit includes a pair of ALU logic
units 401, 402 to provide not only for parallel execution of ALU
operations but sequential ALU operations as well.
[0037] Here, referring to FIGS. 3 and 4, with respect to the dual
parallel FW operation 302, the first ALU logic unit 401 performs
the first full-width ALU operation (A op B) while the second ALU
performs the second full-width ALU operation (C op D) in parallel
with the first. Again, in an embodiment, full width operation
corresponds to 16 bits. Here, the first ALU logic unit 401 writes
the resultant of (A op B) into register X while the second ALU
logic unit 402 writes the resultant of (C op D) into register
Y.
[0038] In an embodiment, the instruction format for executing the
dual parallel full width ALU operation 302 includes an opcode that
specifies dual parallel full width operation and the destination
registers. In a further embodiment, the opcode, besides specifying
dual parallel full width operation, also specifies one or two ALU
operations. If the opcode only specifies one operation, both ALU
logic units 401, 402 will perform the same operation. By contrast
if the opcode specifies first and second different ALU operations,
the first ALU logic unit 401 performs one of the operations and the
second ALU logic unit 402 performs the second of the
operations.
[0039] With respect to the half width (HW) feature of operation
302, four half width ALU operations are performed in parallel.
Here, each of inputs A, B, C and D are understood to each include
two separate input operands. That is, e.g., a top half of A
corresponds to a first input operand, a lower half of A corresponds
to a second input operand, a top half of B corresponds to a third
input operand, a lower half of B corresponds to a fourth input
operand, etc.
[0040] As such, ALU logic unit 401 handles two ALU operations in
parallel and ALU logic unit 402 handles two ALU operations in
parallel. Thus, during execution, all four half width operations
are performed in parallel. At the end of the operation 302, ALU
logic unit 401 writes two half width resultants into register X and
ALU logic unit 402 writes two half width resultants into register
Y. As such, there are four separate half width resultants in
registers X and Y.
[0041] In an embodiment, the instruction format not only specifies
that parallel half width operation is to be performed but also
specifies which ALU operation(s) is/are to be performed. In various
embodiments the instruction format may specify that all four
operations are the same and only specify one operation and/or may
specify that all four operations are different and specify four
different operations. In the case of the later, alternatively, to
effect same operations for all four operations the instructions
format may specify the same operation four times. Various
combinations of these instruction format approaches are also
possible.
[0042] With respect to the double wide ALU operation 303 of FIG. 3,
in an embodiment, the execution unit 405 performs the operation
(A,C) op (B,D) where (A,C) is a concatenation of inputs A and C
that form a first double wide input operand and (B,D) is a
concatenation of inputs B and D that form a second double wide
input operand. Here, a carry term may be passed along carry line
404 from the first ALU logic unit 401 to the second ALU logic unit
402 to carry operations forward from full width to double
width.
[0043] That is, in an embodiment, the C and D terms represent the
lowered ordered halfs of the two double wide input operands. The
second ALU logic unit 402 performs the specified operation (e.g.,
ADD) on the two lower halfs and the resultant that is generated
corresponds to the lower half of the overall double wide resultant.
As such, the resultant from the second ALU logic unit 402 is
written into register Y. The operation on the lower halves may
generate a carry term that is carried to the first ALU logic unit
401 which continues the operation of the two respective upper
halves A and C of the input operands. The resultant from the first
ALU logic unit 401 corresponds to the upper half of the overall
resultant which is written into output register X. Because
operation on the upper halves by the first ALU logic unit 401 may
not be able to start until it receives the carry term from the
second ALU logic unit 402, the operation of the ALU logic units
402, 401 is sequential rather than parallel. As such, as observed
in FIG. 3, double width operations 303 may take approximately twice
as long as parallel full/half width operations 302.
[0044] Nevertheless, because the MAD operation 301 can consume more
time than two consecutive ALU logic unit operations, the machine
can be built around an execution unit 405 that can attempt to
insert as much function as it can into the time period consumed by
its longest propagation delay operation. As such, in an embodiment,
the cycle time of the execution unit 405 corresponds to the
execution time of the MAD instruction 301. In an embodiment, the
instruction format for a double wide operation specifies not only
the operation to be performed, but also, that the operation is a
double wide operation.
[0045] With respect to the FUSED operation 304, the execution unit
405 performs the operation (C op D) op B. Here, like the double
wide ALU operation 303 discussed just above, the dual ALU logic
units 401, 402 operate sequentially because the second operation
operates on the resultant of the first operation. Here, the second
ALU logic unit 402 performs the initial operation on full width
inputs C and D. The resultant of the second ALU logic 402, instead
of being written into resultant register space, is instead
multiplexed into an input of the first ALU logic unit 401 via
multiplexer 406. The first ALU logic unit 401 then performs the
second operation and writes the resultant into register X.
[0046] In a further embodiment, a half width FUSED operation can
also be performed. Here, operation is as described above except
that only half of the input operands are utilized. That is, for
example, in calculating (C op D) op B, only the lower half of C and
the lower half of D are used to determine a half width result for
the first operation, then, only the lower half of B is used along
with the half width resultant of the first operation to perform the
second operation. The resultant is written as a half width value in
register X. Further still, two half width FUSED operations can be
performed in parallel. Here, operation is as described just above
simultaneously with the same logical operations but for the high
half of the operands. The result is two half with values written
into register X.
[0047] In an embodiment, the instruction format for a FUSED
operation specifies that a FUSED operation is to be performed and
specifies the two operations. If the same operation is be performed
twice, in an embodiment, the instruction only specifies the
operation once or specifies it twice. In a further embodiment,
apart from specifying FUSED operation and the operation(s) to be
performed, the instruction format may further specify whether full
width or half width operation is to be performed.
[0048] Operation 306 of FIG. 3 illustrates that an iterative divide
operation can also be performed by the execution unit. In
particular, as explained in more detail below, in various
embodiments both ALU logic units 401, 402 collaboratively
participate in parallel during the iterative divide operation.
[0049] FIGS. 5a and 5b pertain to an embodiment for executing the
iterative divide instruction 306 of FIG. 3. FIG. 5a shows
additional circuitry to be added to the execution unit circuitry
405 of FIG. 4 to enable execution of the iterative divide
instruction (with the exception of the ALU logic units 501, 502
which are understood to be the same ALU logic units 401, 402 of
FIG. 4). FIG. 5b shows an embodiment of the micro-sequence
operation of the execution unit during execution of the iterative
divide instruction. As will be more clear from the following
discussion, a single execution of the instruction essentially
performs two iterations that are akin to the atomic act of long
division in which an attempt is made to divide the leading digit(s)
of a numerator (the value being divided into) by a divisor (the
value being divided into the numerator).
[0050] For simplicity, 16 bit division will be described (those of
ordinary skill will be able to extend the present teachings to
different width embodiments). With the embodiment described herein
performing two long division atomic acts, eight sequential
executions of the instruction are used to fully divide a 16 bit
numerator by a 16 bit divisor. That is, each atomic long division
act corresponds to the processing of a next significant bit of the
numerator. Two such significant bits are processed during a single
execution of the instruction. Therefore, in order to process all
bits of the numerator, eight sequential executions of the
instruction are needed to fully perform the complete division. The
output of a first instruction is written to the register file and
used as the input for the next subsequent instruction.
[0051] Referring to FIG. 5a, the numerator input is provided at the
B input and the divisor is presented at the D input. Again, in the
present embodiment, both of the B and D input operands are 16 bits.
A "packed" 32 bit data structure "PACK" that is a concatenation A,
B of the A and B input operands (the A operand is also 16 bits) can
be viewed as an initial data structure of a complete division
process. As an initial condition A is set to a string of sixteen
zeroes (000 . . . 0) and B is the numerator value.
[0052] Referring to FIGS. 5a and 5b, during a first micro-sequence,
a left shift of the PACK data structure is performed to create a
data structure A[14:0], B[15], referred to as the most significant
word of PACK ("PACK msw"). The divisor D is then subtracted 511
from PACK msw by the second ALU logic unit 502. This operation
corresponds to long division where the divisor is initially divided
into the leading digit of the numerator. Note that in an
embodiment, the ALU logic units 501, 502 are actually three input
ALUs and not two input ALUs as suggested by FIG. 4 (the third input
is reserved for the divisor D for the iterative divide
operation).
[0053] Different data processing procedures are then followed
depending on the sign 512 of the result of the subtraction 511.
Importantly, the first quotient resultant bit (i.e., the first bit
of the division result) is staged to be written into the second to
least significant bit of the Y output port 509 ("NEXT B[1]"). If
the result of the subtraction is negative, the quotient resultant
bit B[1] is set 513 to a 0. If the result of the subtraction is
positive, the quotient resultant bit B[1] is set 514 to a 1. The
setting of this bit corresponds to the process in long division
where the first digit of the quotient result is determined by
establishing whether or not the divisor value can be divided into
the first digit of the numerator.
[0054] Additionally, two different data structures are crafted and
presented to respective input ports ("1", "2") of a multiplexer 506
(which may be the same multiplexer as multiplexer 406 of FIG. 4).
The first data structure corresponds to a left shift of Pack msw
(A[13:0], B[15], B[14]) and is presented at input 1 of the
multiplexer 506. The creation of this data structure corresponds to
the process in long division where the next digit of the numerator
is appended to its most significant neighbor if the divisor does
not divide into the most significant neighbor.
[0055] The second crafted data structure corresponds to a left
shift of the result of the subtraction 511 that was just performed
by the second ALU logic unit 502 appended with bit B[13] and is
presented at the second input ("2") of the multiplexer 506. The
creation of this data structure corresponds to the situation in
long division where a divisor divides into the first digit(s) of
the numerator which sets up a next division into the result of the
difference between first digit(s) of the numerator and a multiple
of the divisor.
[0056] The first or second data structures are then selected by the
multiplexer 506 depending on whether the result of the subtraction
performed by the second ALU logic unit 502 yielded a positive or
negative result. If the subtraction yielded a negative result
(which corresponds to the divisor not being able to be divided into
the next significant digit of the numerator), the first data
structure is selected 513. If the subtraction yielded a positive
result (which corresponds to the divisor being able to be divided
into the next significant digit of the numerator), the second data
structure is selected 514.
[0057] The output of the multiplexer 506 is now understood to be
the new most significant word of the PACK data structure (new PACK
msw) and corresponds to the next value in a long division sequence
that the divisor is to be attempted to be divided into. As such,
the first ALU logic unit 501 subtracts 515 the divisor D from the
new PACK msw value. The least significant bit 510 of the Y output
B[0] is staged to be written as a 1 or a 0 depending on the sign of
the subtraction result from the first ALU 501 and represents the
next digit in the quotient resultant 517, 518.
[0058] A second multiplexer 508 selects between first and second
data structures depending 516 on the sign of the first ALU logic
unit's subtraction 515. A first data structure, presented at input
"1" of the second multiplexer 508, corresponds to the new PACK msw
value. A second data structure, presented at input "2" of the
second multiplexer 508, corresponds to the result of the
subtraction performed by the first ALU logic unit 501. Which of the
two data structures is selected depends on the sign of the result
of the subtraction 515 performed by the first ALU 501. If the
result of the subtraction is negative, the multiplexer selects the
new PACK msw value 517. If the result of the subtraction is
positive, the multiplexer selects the new PACK msw-D value 518.
[0059] The output of the second multiplexer 508 corresponds to the
NEXT A value which is written into the register file from the X
output. The value presented at the Y output (B[15:0]) is composed
at the leading edge of the B operand less its two most significant
bits that were consumed by the two just performed iterations
(B[13:0]). The concatenation of these remainder bits of B with the
two newly calculated quotient digit resultants are written into the
register file as the new B operand NEXT B. For a next iteration,
the X output from the previous instruction is read into the A
operand and the Y output from the previous instruction is read into
the B operand. The process then repeats until all digits of the
original B operand have been processed (which, again, in the case
of a 16 bit B operand will consume eight sequential executions of
the instruction). At the conclusion of all iterations, the final
quotient will be written into the register file from the Y output
and any remainder will be represented in the NEXT A value which is
written into the register file from the X output.
[0060] FIG. 6 shows an embodiment of a methodology performed by the
ALU unit described above. As observed in FIG. 6 the method includes
performing the following with an ALU unit of an image processor.
Executing a first instruction, the first instruction being a
multiply add instruction 601. Executing a second instruction
including performing parallel ALU operations with first and second
ALU logic units operating simultaneously to produce different
respective output resultants of the second instruction 602.
Executing a third instruction including performing sequential ALU
operations with one of the ALU logic units operating from an output
of the other of the ALU logic units to determine an output
resultant of the third instruction 603. Executing a fourth
instruction including performing an iterative divide operation in
which the first ALU logic unit and the second ALU logic unit
operate to determine first and second division resultant digit
values 604.
[0061] It is pertinent to point out that the various image
processor architecture features described above are not necessarily
limited to image processing in the traditional sense and therefore
may be applied to other applications that may (or may not) cause
the image processor to be re-characterized. For example, if any of
the various image processor architecture features described above
were to be used in the creation and/or generation and/or rendering
of animation as opposed to the processing of actual camera images,
the image processor may be characterized as a graphics processing
unit. Additionally, the image processor architectural features
described above may be applied to other technical applications such
as video processing, vision processing, image recognition and/or
machine learning. Applied in this manner, the image processor may
be integrated with (e.g., as a co-processor to) a more general
purpose processor (e.g., that is or is part of a CPU of computing
system), or, may be a stand alone processor within a computing
system.
[0062] The hardware design embodiments discussed above may be
embodied within a semiconductor chip and/or as a description of a
circuit design for eventual targeting toward a semiconductor
manufacturing process. In the case of the later, such circuit
descriptions may take of the form of a (e.g., VHDL or Verilog)
register transfer level (RTL) circuit description, a gate level
circuit description, a transistor level circuit description or mask
description or various combinations thereof. Circuit descriptions
are typically embodied on a computer readable storage medium (such
as a CD-ROM or other type of storage technology).
[0063] From the preceding sections is pertinent to recognize that
an image processor as described above may be embodied in hardware
on a computer system (e.g., as part of a handheld device's System
on Chip (SOC) that processes data from the handheld device's
camera). In cases where the image processor is embodied as a
hardware circuit, note that the image data that is processed by the
image processor may be received directly from a camera. Here, the
image processor may be part of a discrete camera, or, part of a
computing system having an integrated camera. In the case of the
later the image data may be received directly from the camera or
from the computing system's system memory (e.g., the camera sends
its image data to system memory rather than the image processor).
Note also that many of the features described in the preceding
sections may be applicable to a graphics processor unit (which
renders animation).
[0064] FIG. 7 provides an exemplary depiction of a computing
system. Many of the components of the computing system described
below are applicable to a computing system having an integrated
camera and associated image processor (e.g., a handheld device such
as a smartphone or tablet computer). Those of ordinary skill will
be able to easily delineate between the two.
[0065] As observed in FIG. 7, the basic computing system may
include a central processing unit 701 (which may include, e.g., a
plurality of general purpose processing cores 715_1 through 715_N
and a main memory controller 717 disposed on a multi-core processor
or applications processor), system memory 702, a display 703 (e.g.,
touchscreen, flat-panel), a local wired point-to-point link (e.g.,
USB) interface 704, various network I/O functions 705 (such as an
Ethernet interface and/or cellular modem subsystem), a wireless
local area network (e.g., WiFi) interface 706, a wireless
point-to-point link (e.g., Bluetooth) interface 707 and a Global
Positioning System interface 708, various sensors 709_1 through
709_N, one or more cameras 710, a battery 711, a power management
control unit 712, a speaker and microphone 713 and an audio
coder/decoder 714.
[0066] An applications processor or multi-core processor 750 may
include one or more general purpose processing cores 715 within its
CPU 701, one or more graphical processing units 716, a memory
management function 717 (e.g., a memory controller), an I/O control
function 718 and an image processing unit 719. The general purpose
processing cores 715 typically execute the operating system and
application software of the computing system. The graphics
processing units 716 typically execute graphics intensive functions
to, e.g., generate graphics information that is presented on the
display 703. The memory control function 717 interfaces with the
system memory 702 to write/read data to/from system memory 702. The
power management control unit 712 generally controls the power
consumption of the system 700.
[0067] The image processing unit 719 may be implemented according
to any of the image processing unit embodiments described at length
above in the preceding sections. Alternatively or in combination,
the IPU 719 may be coupled to either or both of the GPU 716 and CPU
701 as a co-processor thereof. Additionally, in various
embodiments, the GPU 716 may be implemented with any of the image
processor features described at length above.
[0068] Each of the touchscreen display 703, the communication
interfaces 704-707, the GPS interface 708, the sensors 709, the
camera 710, and the speaker/microphone codec 713, 714 all can be
viewed as various forms of I/O (input and/or output) relative to
the overall computing system including, where appropriate, an
integrated peripheral device as well (e.g., the one or more cameras
710). Depending on implementation, various ones of these I/O
components may be integrated on the applications
processor/multi-core processor 750 or may be located off the die or
outside the package of the applications processor/multi-core
processor 750.
[0069] In an embodiment one or more cameras 710 includes a depth
camera capable of measuring depth between the camera and an object
in its field of view. Application software, operating system
software, device driver software and/or firmware executing on a
general purpose CPU core (or other functional block having an
instruction execution pipeline to execute program code) of an
applications processor or other processor may perform any of the
functions described above.
[0070] Embodiments of the invention may include various processes
as set forth above. The processes may be embodied in
machine-executable instructions. The instructions can be used to
cause a general-purpose or special-purpose processor to perform
certain processes. Alternatively, these processes may be performed
by specific hardware components that contain hardwired logic for
performing the processes, or by any combination of programmed
computer components and custom hardware components.
[0071] Elements of the present invention may also be provided as a
machine-readable medium for storing the machine-executable
instructions. The machine-readable medium may include, but is not
limited to, floppy diskettes, optical disks, CD-ROMs, and
magneto-optical disks, FLASH memory, ROMs, RAMs, EPROMs, EEPROMs,
magnetic or optical cards, propagation media or other type of
media/machine-readable medium suitable for storing electronic
instructions. For example, the present invention may be downloaded
as a computer program which may be transferred from a remote
computer (e.g., a server) to a requesting computer (e.g., a client)
by way of data signals embodied in a carrier wave or other
propagation medium via a communication link (e.g., a modem or
network connection).
[0072] In the foregoing specification, the invention has been
described with reference to specific exemplary embodiments thereof.
It will, however, be evident that various modifications and changes
may be made thereto without departing from the broader spirit and
scope of the invention as set forth in the appended claims. The
specification and drawings are, accordingly, to be regarded in an
illustrative rather than a restrictive sense.
* * * * *