Apparatus, Method, And Computer-readable Medium For Activation Function Prediction In Deep Neural Networks Pillai; Kamlesh ; et al. [Abuhatzera; Avishaii]

Apparatus, Method, And Computer-readable Medium For Activation Function Prediction In Deep Neural Networks

Pillai; Kamlesh ; et al.

Patent Application Summary

U.S. patent application number 17/484423 was filed with the patent office on 2022-01-13 for apparatus, method, and computer-readable medium for activation function prediction in deep neural networks. The applicant listed for this patent is Avishaii Abuhatzera, Gurpreet Singh Kalsi, Kamlesh Pillai, Sreenivas Subramoney, Bharathwaj Suresh. Invention is credited to Avishaii Abuhatzera, Gurpreet Singh Kalsi, Kamlesh Pillai, Sreenivas Subramoney, Bharathwaj Suresh.

Application Number	20220012571 17/484423
Document ID	/
Family ID
Filed Date	2022-01-13

United States Patent Application	20220012571
Kind Code	A1
Pillai; Kamlesh ; et al.	January 13, 2022

APPARATUS, METHOD, AND COMPUTER-READABLE MEDIUM FOR ACTIVATION FUNCTION PREDICTION IN DEEP NEURAL NETWORKS

Abstract

Apparatuses and articles of manufacture are disclosed. An example apparatus includes an activation function control and decode circuitry to populate an input buffer circuitry with an input data element bit subset of less than a threshold number of bits of the input data element retrieved from the memory circuitry. The activation function and control circuitry also populate a kernel weight buffer circuitry with a weight data element bit subset of less than the threshold number of bits of the weight data element retrieved from the memory circuitry. The apparatus also including a preprocessor circuitry to calculate a partial convolution value of at least a portion of the input data element bit subset and the weight data element bit subset to determine a predicted sign of the partial convolution value.

Inventors:

Pillai; Kamlesh; (Bangalore, IN) ; Kalsi; Gurpreet Singh; (Bangalore, IN) ; Suresh; Bharathwaj; (Bangalore, IN) ; Subramoney; Sreenivas; (Bangalore, IN) ; Abuhatzera; Avishaii; (Qiriat Shemona, IL)

Applicant:

Name	City	State	Country	Type
Pillai; Kamlesh Kalsi; Gurpreet Singh Suresh; Bharathwaj Subramoney; Sreenivas Abuhatzera; Avishaii	Bangalore Bangalore Bangalore Bangalore Qiriat Shemona		IN IN IN IN IL

Appl. No.:

17/484423

Filed:

September 24, 2021

International Class:

G06N 3/04 20060101 G06N003/04; G06N 3/10 20060101 G06N003/10

Claims

1. An apparatus, comprising: processor circuitry including one or more of: at least one of a central processing unit, a graphic processing unit or a digital signal processor, the at least one of the central processing unit, the graphic processing unit or the digital signal processor having control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations corresponding to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus; a Field Programmable Gate Array (FPGA), the FPGA including logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations; or an Application Specific Integrate Circuitry (ASIC) including logic gate circuitry to perform one or more third operations; the processor circuitry to perform at least one of the one or more first operations, the one or more second operations or the one or more third operations to instantiate: an activation function control and decode circuitry to populate an input buffer circuitry with an input data element bit subset of less than a threshold number of bits of an input data element retrieved from a memory circuitry; and populate a kernel weight buffer circuitry with a weight data element bit subset of less than the threshold number of bits of a weight data element retrieved from the memory circuitry; and a preprocessor circuitry to calculate a partial convolution value of at least a portion of the input data element bit subset and the weight data element bit subset to determine a predicted sign of the partial convolution value; and send the predicted sign of the partial convolution value to the activation function control and decode circuitry.

2. The apparatus of claim 1, wherein the processor circuitry is to further perform at least one of the one or more first operations, the one or more second operations or the one or more third operations to instantiate: the preprocessor circuitry to store the partial convolution value in a data distribution circuitry in response to the predicted sign of the partial convolution value being non-negative; the activation function control and decode circuitry to cause a remainder processing circuitry to calculate a full convolution value of the input data element and the weight data element in response to the predicted sign of the partial convolution value being non-negative; and the remainder processing circuitry to calculate the full convolution value from the partial convolution value and a remaining subset of bits of the input data and weight data not used to determine the predicted sign of the partial convolution value, the partial convolution value retrieved from the data distribution circuitry.

3. The apparatus of claim 2, wherein the partial convolution value is a first partial convolution value and the portion of the input data element bit subset and the weight data element bit subset is a first portion of the input data element bit subset and the weight data element bit subset, and wherein the processor circuitry is to further perform at least one of the one or more first operations, the one or more second operations or the one or more third operations to instantiate: the preprocessor circuitry to calculate at least a second partial convolution value of at least a second portion of the input data element bit subset and the weight data element bit subset.

4. The apparatus of claim 2, wherein the input data element is a first input data element, and wherein the processor circuitry is to further perform at least one of the one or more first operations, the one or more second operations or the one or more third operations to instantiate: the input buffer circuitry to include a plurality of banks to store a plurality of input data elements comprising an input data tile, the input data tile including the first input data element.

5. The apparatus of claim 4, wherein the preprocessor circuitry is a first preprocessor circuitry and the partial convolution value is a first partial convolution value, and wherein the processor circuitry is to further perform at least one of the one or more first operations, the one or more second operations or the one or more third operations to instantiate: a plurality of preprocessor circuitries including the first preprocessor circuitry, wherein each of the plurality of preprocessor circuitries to calculate at least one of a plurality of partial convolution values, the plurality of partial convolution values calculated from at least a portion of each of the plurality of input data elements in the input data tile.

6. The apparatus of claim 2, wherein the input data is a first input data, and wherein the processor circuitry is to further perform at least one of the one or more first operations, the one or more second operations or the one or more third operations to instantiate: the preprocessor circuitry to calculate a second partial convolution value of a second input data and the weight data while the remainder processing circuitry calculates the full convolution value of the first input data and the weight data.

7. The apparatus of claim 1 wherein the activation function is a rectified linear unit (ReLu) function.

8. The apparatus of claim 1, wherein the input data and the weight data are a 32-bit floating point data type.

9. The apparatus of claim 8, wherein the processor circuitry is to further perform at least one of the one or more first operations, the one or more second operations or the one or more third operations to instantiate: the preprocessor circuitry to calculate the partial convolution value using a sign bit and one or more exponent bits of the input data and the weight data.

10. The apparatus of claim 8, wherein the processor circuitry is to further perform at least one of the one or more first operations, the one or more second operations or the one or more third operations to instantiate: the preprocessor circuitry to calculate the partial convolution value using a sign bit, one or more exponent bits, and one or more upper mantissa bits of the input data and the weight data.

11. The apparatus of claim 8, wherein the processor circuitry is to further perform at least one of the one or more first operations, the one or more second operations or the one or more third operations to instantiate: the activation function control and decode circuitry to arrange the input data and the weight data in the memory circuitry separately into a sign bit group, an exponent bits group, an upper mantissa bits group, and a lower mantissa bits group.

12. A non-transitory computer-readable storage medium comprising instructions that, when executed, cause one or more processors of a machine to at least: populate an input buffer circuitry with an input data element bit subset of less than a threshold number of bits bits of the input data element retrieved from a memory circuitry; populate a kernel weight buffer circuitry with a weight data element bit subset of less than the threshold number of bits bits of the weight data element retrieved from the memory circuitry; calculate a partial convolution value of at least a portion of the input data element bit subset and the weight data element bit subset to determine a predicted sign of the partial convolution value; and send the predicted sign of the partial convolution value to an activation function control and decode circuitry.

13. The non-transitory computer-readable storage medium of claim 12, wherein the instructions, when executed, cause the one or more processors of the machine to at least: store the partial convolution value in a data distribution circuitry in response to the predicted sign of the partial convolution value being non-negative; calculate a full convolution value of the input data element and the weight data element in response to the predicted sign of the partial convolution value being non-negative; and calculate the full convolution value from the partial convolution value and a remaining subset of bits of the input data and weight data not used to determine the predicted sign of the partial convolution value, the partial convolution value retrieved from the data distribution circuitry.

14. The non-transitory computer-readable storage medium of claim 13, wherein the partial convolution value is a first partial convolution value and the portion of the input data element bit subset and the weight data element bit subset is a first portion of the input data element bit subset and the weight data element bit subset, wherein the instructions, when executed, cause the one or more processors of the machine to: calculate at least a second partial convolution value of at least a second portion of the input data element bit subset and the weight data element bit subset.

15. The non-transitory computer-readable storage medium of claim 13, wherein the input data element is a first input data element, and wherein the instructions, when executed, cause the one or more processors of the machine to: store a plurality of input data elements comprising an input data tile, the input data tile including the first input data element.

16. The non-transitory computer-readable storage medium of claim 15, wherein the partial convolution value is a first partial convolution value, and wherein the instructions, when executed, cause the one or more processors of the machine to: calculate at least one of a plurality of partial convolution values, the plurality of partial convolution values calculated from at least a portion of each of the plurality of input data elements in the input data tile.

17. The non-transitory computer-readable storage medium of claim 13, wherein the input data is a first input data, and wherein the instructions, when executed, cause the one or more processors of the machine to: calculate a second partial convolution value of a second input data and the weight data in parallel to calculating the full convolution value of the first input data and the weight data.

18. The non-transitory computer-readable storage medium of claim 12, wherein the activation function is a rectified linear unit activation function, wherein the input data and the weight data are a 32-bit floating point data type.

19. The non-transitory computer-readable storage medium of claim 18, wherein the instructions, when executed, cause the one or more processors of the machine to: calculate the partial convolution value using a sign bit and one or more exponent bits of the input data and the weight data.

20. The non-transitory computer-readable storage medium of claim 18, wherein the instructions, when executed, cause the one or more processors of the machine to: calculate the partial convolution value using a sign bit, one or more exponent bits, and one or more upper mantissa bits of the input data and the weight data.

21. The non-transitory computer-readable storage medium of claim 18, wherein the instructions, when executed, cause the one or more processors of the machine to: arrange the input data and the weight data in the memory circuitry separately into a sign bit group, an exponent bits group, an upper mantissa bits group, and a lower mantissa bits group.

22. An apparatus comprising: means for populating an input buffer circuitry with an input data element bit subset of less than a threshold number of bits bits of the input data element retrieved from a memory circuitry; means for populating a kernel weight buffer circuitry with a weight data element bit subset of less than the threshold number of bits bits of the weight data element retrieved from the memory circuitry; means for calculating a partial convolution value of at least a portion of the input data element bit subset and the weight data element bit subset to determine a predicted sign of the partial convolution value; and means for sending the predicted sign of the partial convolution value to an activation function control and decode circuitry.

23. The apparatus of claim 22, further comprising: means for storing the partial convolution value in a data distribution circuitry in response to the predicted sign of the partial convolution value being non-negative; means for calculating a full convolution value of the input data element and the weight data element in response to the predicted sign of the partial convolution value being non-negative; and means for calculating the full convolution value from the partial convolution value and a remaining subset of bits of the input data and weight data not used to determine the predicted sign of the partial convolution value, the partial convolution value retrieved from the data distribution circuitry.

25. The apparatus of claim 24, wherein the partial convolution value is a first partial convolution value and the portion of the input data element bit subset and the weight data element bit subset is a first portion of the input data element bit subset and the weight data element bit subset, further comprising: means for calculating at least a second partial convolution value of at least a second portion of the input data element bit subset and the weight data element bit subset.

25. The non-transitory computer-readable storage medium of claim 24, wherein the input data element is a first input data element, and further comprising: means for storing a plurality of input data elements comprising an input data tile, the input data tile including the first input data element.

Description

FIELD OF THE INVENTION

[0001] The invention relates to artificial neural networks. More specifically, the invention relates to predicting the sign of an activation function in an artificial neural network.

BACKGROUND

[0002] Artificial neural networks, such as convolutional neural networks (CNNs), are utilized for many tasks. Among those tasks are learning to accurately make predictions. For example, a CNN can receive a large amount of image data and learn, through machine learning (ML) to classify content in images.

BRIEF DESCRIPTION OF THE DRAWINGS

[0003] FIG. 1 is a schematic illustration of an example system architecture that predicts the sign of an activation function result.

[0004] FIG. 2 illustrates an example arrangement of rearranged single-precision floating-point format (FP32) input and weight data in L1 memory.

[0005] FIG. 3 is a flowchart representative of example machine readable instructions that may be executed by example processor circuitry to implement a prediction of the sign for a (rectified linear unit) ReLU activation function with partial data.

[0006] FIG. 4 is another flowchart representative of example machine readable instructions that may be executed by example processor circuitry to implement a prediction of the sign for the ReLU activation function with partial data.

[0007] FIG. 5 illustrates an example of the layout of a memory storing the data described in the discussion related to the flowchart of FIG. 4.

[0008] FIG. 6A illustrates an example number format of an FP32 data type used for predicting a ReLU activation function result in a CNN.

[0009] FIG. 6B illustrates an example region of interest where a reduced precision of an FP32 input value and weight value used to calculate a partial convolution value may cause a prediction error of a ReLU activation function result.

[0010] FIG. 7 is a block diagram of an example processor platform 700 structured to execute and/or instantiate the machine readable instructions and/or operations of FIGS. 3 through 5 to implement the apparatus of FIG. 1.

[0011] FIG. 8 is a block diagram of an example implementation of the processor circuitry 712 of FIG. 7.

[0012] FIG. 9 is a block diagram of another example implementation of the processor circuitry 712 of FIG. 7.

[0013] FIG. 10A illustrates an example distribution graph of ReLU zero results across all layers (i.e., nodes) of the ResNet-50 model when run through an ImageNet dataset.

[0014] FIG. 10B-10D illustrate samples of the accuracy of the predicted negative result on a sample of three different convolution layers in the ResNet-50 model across a scale of mantissa bits used in the prediction.

[0015] FIG. 11A illustrates an example distribution graph of ReLU zero results across all layers (i.e., nodes) of the VGG-16 model when run through the ImageNet dataset.

[0016] FIG. 11B-11D illustrate samples of the accuracy of the predicted negative result on a sample of three different convolution layers in the VGG-16 model across a scale of mantissa bits used in the prediction.

[0017] The figures are not to scale. Instead, the thickness of the layers or regions may be enlarged in the drawings. In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts.

[0018] Unless specifically stated otherwise, descriptors such as "first," "second," "third," etc., are used herein without imputing or otherwise indicating any meaning of priority, physical order, arrangement in a list, and/or ordering in any way, but are merely used as labels and/or arbitrary names to distinguish elements for ease of understanding the disclosed examples. In some examples, the descriptor "first" may be used to refer to an element in the detailed description, while the same element may be referred to in a claim with a different descriptor such as "second" or "third." In such instances, it should be understood that such descriptors are used merely for identifying those elements distinctly that might, for example, otherwise share a same name.

[0019] As used herein, the phrase "in communication," including variations thereof, encompasses direct communication and/or indirect communication through one or more intermediary components, and does not require direct physical (e.g., wired) communication and/or constant communication, but rather additionally includes selective communication at periodic intervals, scheduled intervals, aperiodic intervals, and/or one-time events. As used herein, "processor circuitry" is defined to include (i) one or more special purpose electrical circuits structured to perform specific operation(s) and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors), and/or (ii) one or more general purpose semiconductor-based electrical circuits programmed with instructions to perform specific operations and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors). Examples of processor circuitry include programmed microprocessors, Field Programmable Gate Arrays (FPGAs) that may instantiate instructions, Central Processor Units (CPUs), Graphics Processor Units (GPUs), Digital Signal Processors (DSPs), XPUs, or microcontrollers and integrated circuits such as Application Specific Integrated Circuits (ASICs). For example, an XPU may be implemented by a heterogeneous computing system including multiple types of processor circuitry (e.g., one or more FPGAs, one or more CPUs, one or more GPUs, one or more DSPs, etc., and/or a combination thereof) and application programming interface(s) (API(s)) that may assign computing task(s) to whichever one(s) of the multiple types of the processing circuitry is/are best suited to execute the computing task(s).

DETAILED DESCRIPTION

[0020] Artificial neural networks, such as convolutional neural networks (CNNs), are utilized for many tasks. Among those tasks is learning to accurately make predictions. For example, a CNN can receive a large amount of image data and learn, through machine learning (ML), to classify content in images. In a CNN, the processes of image recognition and image classification commonly utilize a rectified linear unit (ReLU) as an activation function in practice. For a given node (also referred to as a layer) in a CNN, when fitting input data for recognition or classification, the ReLU activation function calculates the convolution of the input data with weight and bias parameter values. Whether these values are floating point, fixed point, or integer based, there is an overhead associated with such calculations. In a complex neural network that has a large number of nodes, the overhead will increase. Some of this overhead is wasted because any ReLU calculation result that returns a negative value is thrown out and never contributes to the CNN's output.

[0021] FIG. 1 is a schematic illustration of an example system architecture that predicts the sign of an activation function result.

[0022] In some examples, input data, weight data, and bias data utilized in a CNN are in a 32-bit floating point (FP32) data type format. The FP32 data type format includes a sign bit (bit [31]), a set of exponent bits (bits [30:23]), and a set of mantissa bits (bits [22:0]). In other examples, one or more other data types may be utilized, such as fixed point or 8-bit integer data types, among others. The examples described below will largely be utilizing FP32, but any one or more other data types might be utilized in practice (e.g., double precision floating point (FP64), 8-bit integer, 16-bit integer, 32-bit integer, 64-bit integer, etc.). See FIG. 6A and the corresponding discussion involving FIG. 6A below for a more detailed review of an example of the FP32 number format.

[0023] Typical CNNs utilize an activation function per node to map the input data to a series of weights and biases for image training and/or classification purposes. One of the most common activation functions in practice is the ReLU activation function. The examples described below will largely be utilizing the ReLU function for ease of explanation. In other examples, other activation functions that have similar behaviors to the ReLU function may be implemented in addition to or in place of the ReLU function (e.g., the leaky ReLU function) in some or all of the CNN nodes that use an activation function.

[0024] In some examples, the ReLU function consumes the output of a convolution layer in a CNN. The ReLU function clamps all the negative output values to zero (i.e., all the operations performed during the convolution layer resulting negative values are neutralized/discarded). Although the ReLU function is efficient from storage perspective because calculated convolution values with negative results are thrown out, there are still inefficiencies. For example, since the ReLU function throws out negative value results, there ends up being significant volumes of convolution calculations that are not further used.

[0025] If the result of each convolution calculation were able to be accurately predicted, the processing circuitry calculating the convolutions could be instructed to ignore calculations that end up as negative values. Thus, one purpose of predicting a sign (i.e., positive or negative) of a convolution result is to allow the hardware accelerator(s) performing the calculations to discontinue further calculations on input values that will have a negative ReLU result.

[0026] The hardware accelerator(s) process image data (and/or other data) layer by layer through the CNN in a tiled fashion. A tile is herein defined as a group of elements, each of which is a portion of the tile. For example, data from an image may be segmented into a series of 4.times.4 blocks of pixels, which also may be referred to as a 4.times.4 tile of (pixel) data elements. In some examples, each element is a base input data building block with which larger structures may be grouped, such as tiles. In some examples, hardware accelerators process data through a CNN in a tiled manner because each element in the tile is not dependent upon any calculated results of the other elements.

[0027] In the illustrated example in FIG. 1, a series of processing element array circuitries (100A, 100B, 100C) are present. In some examples, more processing element array circuitries are present. Although three processing element array circuitries are shown for the sake of simplicity in the discussion, many hardware accelerators are massively parallel and may have hundreds or more processing element array circuitries. The example processing element array circuitries 100A-100C are generally arranged in one or more systolic arrays of multiply-accumulate (MAC) blocks to increase performance and area efficiency. In some examples, there may be other blocks in addition to MAC blocks utilized to perform other types of calculations needed for nodes in the processing element array circuitries 100A-100C.

[0028] In some examples, circuitry comprising tile processing logic encapsulated in box 118 of FIG. 1 calculates input and weight values across each of the elements of a tile for each convolution node. The output of each convolution node includes a series of calculations utilizing input data and weight data processed by tile processing logic 118. The input data is defined herein as the data input into the CNN. For example, an image might be input into the CNN for the purpose of training the CNN or for the purpose of classifying the image once the CNN has been trained. The weight data is defined herein as a weighted value created through training the CNN (e.g., through backpropagation) and utilized as part of a connection between two given nodes. The weight data, when applied through a series of calculations to an input data from the previous node (or from the starting node), fits the input data to the model in the CNN.

[0029] In the illustrated example in FIG. 1, logic blocks/circuitries at least within tile processing logic 118 are utilized to perform at least an activation function computation in one or more CNN nodes. In some examples, the activation function is a ReLU function (or a similar function to ReLU). Thus, the logic block/circuitries in FIG. 1 will throw away negative results.

[0030] In some examples, for tile based FP32 operations at the nodes of a CNN, the output of each convolution node can be predicted by performing a partial FP32 calculation instead of performing a full FP32 calculation. More specifically, for a given example node that performs a ReLU function (or another activation function similar to ReLU), a partial FP32 calculation on the input data and the weight data in certain circumstances can lead to an accurate prediction of the sign (i.e., positive or negative) of the result. For a function like ReLU, predicting the sign of the result can lead to a more efficient flow of calculations of the tile of input data because all predicted negative results allow for discontinuing any remaining FP32 calculations.

[0031] For FP32 data type calculations, each example input data value and weight data value can be divided into two distinct groups/segments of bits (e.g., two subsets of the 32-bit total). In some examples, a first group includes sign bit (600 in FIG. 6A), the exponent bits (602 in FIG. 6A), and a set of upper mantissa bits (604 in FIG. 6A). And a second group includes a set of lower mantissa bits (606 in FIG. 6A). In some examples, calculations involving the first group of FP32 bits will be handled by the preprocessor circuitry 102A-102C and calculations involving the second group of FP32 bits will be handled by remainder processing circuitry 104A-104C.

[0032] In some examples, the size of a tile of the input data may be utilized to help determine an efficient division of mantissa bits that make up the upper mantissa bits vs. the mantissa bits that make up to the lower mantissa bits. An example mathematical proof to determine an efficient division of mantissa bits is described below following the description of FIG. 6B. In one example, the upper mantissa consists of 4 bits and the lower mantissa consists of 19 bits (i.e., the dividing line between the upper mantissa and the lower mantissa is between bits 18 and 19 in an FP32 number format). In other examples, the dividing line may be between higher or lower bits than bits 18 and 19.

[0033] While the examples described largely utilize a mantissa separated into two sections (an upper mantissa and a lower mantissa), it should be appreciated that in other examples the mantissa could be split into additional sections, such as in three sections (a lower mantissa, a middle mantissa section, and an upper mantissa section) or more.

[0034] In the illustrated example in FIG. 1, the processing element array circuitries 100A-100C include preprocessor circuitry (102A, 102B, and 102C, respectively) and remainder processing circuitry (104A, 104B, and 104C, respectively). In some examples, for each processing element array circuitry 100A-100C, the systolic array(s) of MAC blocks in the circuitry are separated into two groups, a group of MAC blocks defined as the preprocessor circuitry 102A-102C and a group of MAC blocks defined as the remainder processing circuitries 104A-104C. In some examples, the number of MAC blocks assigned to each preprocessor circuitry 102A-102C and the number of MAC blocks assigned to each remainder processing circuitry 104A-104C can be adjusted depending on the need of the input data workload.

[0035] In some examples, the preprocessor circuitry 102A-102C calculates a partial convolution of the data using the first subset of FP32 bits for each of the input data elements and weight data elements at a given node. More specifically, in some examples, the following preprocessing operations are performed on the first subset of FP32 bits of the input data and the weight data by preprocessor circuitry 102A-102C:

[0036] 1) XOR of sign bit

[0037] 2) Perform multiplication on exponent bits (i.e., addition of exponents)

[0038] 3) Perform multiplication on upper mantissa bits

[0039] Performing this set of operations on the first group of bits is herein referred to as calculating a partial convolution value (using the input data and weight data to do so). The value is a partial convolution because only a subset of FP32 bits that make up an input value and a weight value are used. Thus, in some examples, using the sign bit, the 8-bit exponent, and a 4-bit upper mantissa (bits [31:19]) from each of the input data and weight data values, the preprocessor circuitry 102A-102C calculates the partial convolution value. The result of the calculation will produce a value that can be positive or negative (or zero), herein referred to as the predicted sign. In some examples, the preprocessor circuitry 102A-102C can then send the predicted sign to control and decode circuitry 106.

[0040] In some example versions of a ReLU activation function or another similar function, the convolution data results are utilized for subsequent nodes in the CNN only if the result for a given node is positive. In other example versions of a ReLU or similar activation function, a zero result may default to a utilized result, thus in those versions the CNN nodes send the convolution results to subsequent nodes as long as the results are non-negative. Either version can be utilized for this process, but for simplicity the examples will focus around a non-negative convolution result being utilized.

[0041] In some examples, the predicted sign (also herein referred to as a sign indicator) may be a flag register, a designated bit in a hardware or software register, a communication packet, or any other type of signal meant to communicate a piece of information (e.g., information designating that the calculated partial convolution value is positive or negative). The sign information is referred to as "predicted" instead of known because the reduced number of mantissa bits utilized in the calculation introduces a certain amount of variability/error vs. the true/ideal value calculation utilizing all FP32 bits.

[0042] In some examples, the control and decode circuitry 106 (also referred to herein as the control 106) has logic that controls the flow of much of the system illustrated in FIG. 1. In some examples, the control 106 and the processing element array circuitries 100A-100C are each one or more hardware blocks of circuits in a graphics processing unit (GPU). In other examples, the control 106 and the processing element array circuitries 100A-100C are one or more blocks of circuits in an accelerator chip designed for artificial neural networks and/or other artificial intelligence applications. In yet other examples, the control 106 and the processing element array circuitries 100A-100C are one or more blocks of circuits in other hardware such as circuits in a central processing unit (CPU), in a memory controller, in an I/O controller, in a fixed programmable gate array (FPGA) chip, or in any other possible hardware circuitry where these circuits could be applicable. In yet other examples, the control 106 and the processing element array circuitries 100A-100C are implemented virtually in a software environment and the software environment is then run on one or more computer systems, such as mobile devices, laptops, desktops, workstations, and/or servers.

[0043] In the illustrated example in FIG. 1, the control 106 includes logic that loads/populates data into and fetches data from one or more memory circuitries, such as the L1 memory circuitry 108 and the higher level memory circuitry 110. In some examples, the L1 memory circuitry 108 is on the same die as the control 106 and processing element array circuitries 100A-100C. In other examples, the L1 memory circuitry 108 is on an adjacent die in the same semiconductor package as the control 106 and processing element array circuitries 100A-100C. In some examples, the higher level memory circuitry 110 is on an adjacent die in the same semiconductor package as the control 106 and processing element array circuitries 100A-100C. In other examples, the higher level memory circuitry 110 is in a discrete package/location from the control 106 and processing element array circuitries 100A-100C (e.g., such as part of discrete SDRAM memory substrates plugged into a motherboard's memory slot(s)).

[0044] In some examples, the control 106 includes logic to fetch at least input data and weight data from the higher level memory circuitry 110. As described above, in some examples, the input data and weight data that is fetched is in the FP32 format. Once the input data and weight data have been fetched, they can be stored into the L1 memory circuitry 108. In some examples, the control 106 performs and/or triggers a process to rearrange the FP32 data format into the portions that will be operated on independently. The control 106 then stores/loads the example rearranged data in L1 memory circuitry 108.

[0045] FIG. 2 illustrates an example arrangement of rearranged FP32 input and weight data in L1 memory 108. According to the illustrated example, the higher level memory 110 has at least a tile of FP32 format data (200 in FIG. 2). In some examples, the control (106 in FIG. 1) takes each 32-bit floating point value and separates it into four portions (i.e., four subsets of the total 32 bits): the 1-bit sign portion, the 8-bit exponent portion, and the 23-bit mantissa portion (which is split into an upper mantissa portion lower mantissa portion). In some examples, these four portions can be grouped across elements of a tile. For example, if a tile is made up of a 4.times.4 set of FP32 elements, then the control 106 stores 16 portions of each group of data into a specified memory area in the L1 memory circuitry 200.

[0046] In the illustrated example in FIG. 2, the control 106 stores 16 subsets of 1-bit signs in an all sign bits location 202 (e.g., a sign bit group of data) of L1 memory circuitry 108, 16 subsets of 8-bit exponents in an all exponent bits location 204 (e.g., an exponent bits group of data) of L1 memory circuitry 108, 16 subsets of upper mantissa bits in an all upper mantissa bits location 206 (e.g., an upper mantissa bits group of data) of L1 memory circuitry 108, and 16 subsets of lower mantissa bits in an all lower mantissa bits location 208 (e.g., a lower mantissa bits group of data) of L1 memory circuitry 108. In some examples, the 16 FP32 elements that make up each element of the 4.times.4 tile represent 16 pixels of an image or 16 of any defined basic block that makes up a larger set of input data fetched from higher level memory circuitry 110 (e.g., for pixels, the larger set of input data may be an entire image).

[0047] Returning to the illustrated example in FIG. 1, the system includes an input buffer circuitry (IBC) 112 and a kernel weight buffer circuitry (KWBC) 114. In some examples, the IBC 112 and the KWBC 114 are portions of a memory in the system in FIG. 1. For example, the IBC 112 and the KWBC 114 may be portions of the L1 memory circuitry 108 that have been dynamically allocated as buffers by the control 106. In other examples, the IBC 112 and KWBC 114 are specialized memory storage on or near the control 106 and the processing element array circuitry 100A-100C chip(s) designated for artificial neural network matrix math operations. In yet other examples, the IBC 112 and the KWBC 114 may be any other form of memory storage capable of storing input data and weight data that are accessible by other circuitry in the system in FIG. 1. In some embodiments, the IBC 112 includes multiple banks of storage to store several, elements, tiles and/or images simultaneously.

[0048] In some examples, the control 106 loads the IBC 112 and the KWBC 114 with input data and weight data, respectively, retrieved from the L1 memory circuitry 108. In some examples, the control 106 initially loads a subset of input data and weight data associated with the sign bit, the exponent bits, and the upper mantissa bits into the IBC 112 and the KWBC 114, respectively (e.g., the first three groupings of bits associated with the rearranged FP32 input data). In some examples, during a single data load into the IBC 112 and the KWBC 114, the amount of data loaded includes the three groupings of bits associated with all the elements of a tile of data. In other examples, during a single data load into the IBC 112 and the KWBC 114, the amount of data loaded includes the three groupings of bits associated with a single element of a tile. In yet other examples, during a single data load into the IBC 112 and the KWBC 114, the amount of data loaded includes the three groupings of bits associated with more than one tile, which may be up to and including loading all tiles of an image.

[0049] In some examples, the weight buffer information may not need to be updated once the CNN is trained. Thus, in some examples, the weight data for all four groupings of bits associated with the FP32 rearranged data is loaded once into the KWBC 114 at the beginning of the process for a tile and may be utilized across a series of partial convolution calculations involving multiple input data elements across one or more tiles (e.g., potentially for an entire image of input data calculations).

[0050] In the illustrated example of FIG. 1, once all relevant data from at least the first three groupings of bits have been loaded into the IBC 112 and the KWBC 114, the control 106 triggers the preprocessor circuitries 102A-102C to begin calculating the partial convolution value (e.g., the series of three preprocessing operations described above) for each element in the input data. For example, for a given node in the CNN, preprocessor circuitry 102A performs the three preprocessor calculations (i.e., XOR the sign bit, add the exponent bits, and multiply the upper mantissa bits) using a first element of input data and the weight data associated with the given node. In some examples, the partial convolution value may be calculated across all elements in a given tile in parallel utilizing a group of the preprocessor circuitries 102A-102C.

[0051] In some examples, the control 106 includes logic that can receive indicators of certain conditions and act on those conditions (e.g., the control 106 can trigger processes to occur in other logic blocks in FIG. 1).

[0052] In the illustrated example in FIG. 1, the control 106 receives an indicator of a predicted sign from one or more of the preprocessor circuitries 102A-102C. As described above, the predicted sign is determined from one or more of the preprocessor circuitries 102A-102C calculating a partial convolution result using a partial set of bits of the input data and weight data retrieved from the IBC 112 and the KWBC 114.

[0053] In some examples, the preprocessor circuitries 102A-102C store the partial convolution result value in a data distribution circuitry (DDC) 116. In some examples, the partial convolution result value is stored in the DDC 116 only if the predicted sign is determined to be non-negative. In some examples, the DDC 116 is a portion of a memory in the system in FIG. 1. For example, the DDC 116 may be a portion of the L1 memory circuitry 108 that has been dynamically allocated as a buffer by the control 106. In other examples, the DDC 116 is a specialized memory storage on or near the control 106 and the processing element array circuitry 100A-100C chip(s) designated for artificial neural network matrix math operations. In yet other examples, the DDC 116 may be any other form of memory storage capable of storing results data that are accessible by other circuitry in the system in FIG. 1. In some examples, the preprocessor circuitries 102A-102C additionally include logic circuitry that have the capability of store/load functionality to directly store the data in the DDC 116. In other examples, the control 106 performs the store of the partial convolution results data to the DDC 116.

[0054] Using the ReLU activation function as the example, if the predicted sign indicator (determined/calculated by the preprocessor circuitries 102A-102C and sent to the control 106) is non-negative, then the control 106 performs one or more resulting functions. In some examples, the control 106 will trigger (e.g., cause through some form of indicator/communication) one or more of the remainder processing circuitries 104A-104C to calculate the remaining portion of the convolution value using the remaining bits of the input data and weight data that were not calculated by the one or more preprocessor circuitries 102A-102C. For example, if the preprocessor circuitries 102A-102C calculated the partial convolution value from the sign bit, the 8-bit exponent, and a 4-bit upper mantissa (e.g., the most significant 13 bits total of the original FP32 operand), then the remainder processing circuitries 104A-104C calculates the convolution value of the 19-bit lower mantissa.

[0055] The example remainder processing circuitries 104A-104C combines the result of the 19-bit lower mantissa with a partial convolution result of the most significant 13 bits stored in the DDC 116 to create a full convolution value. In the illustrated example in FIG. 1, the calculated full convolution value (i.e., the combined result from the upper 13-bit calculation and the lower 19-bit calculation) is stored in the DDC 116. In some examples, the calculated full convolution value, or at least a portion of the value, is then loaded into the IBC 112 to allow the processing element array circuitries 100A-100C to calculate a next partial convolution value for a next node in the CNN (using a next weight data for the next node from the KWBC 114).

[0056] In some examples, if the predicted sign of the partial convolution value calculated by the preprocessor circuitries 102A-102C is negative, then the control 106 does not trigger a further calculation by the remainder processing circuitries 104A-104C and the partial convolution value is discarded from further use. In some examples, the negative predicted sign partial convolution value is not stored in the DDC 116. In other examples, the negative predicted sign partial convolution value is stored in the DDC 116, but upon determining the sign is negative, the control 106 flags the partial convolution value as invalid and the data can then subsequently be overwritten.

[0057] In some examples, the triggering process takes place on an entire tile of input data at the same time, across a group of remainder processing circuitries 104A-104C. In other examples, the triggering process can take place separately per element (i.e., per remainder processing circuitry). In some examples, for ReLU or similar activation functions, remainder processing circuitries 104A-104C that do not receive triggers will not calculate the lower mantissa bits of a given convolution, thus saving processing cycles.

[0058] A more detailed set of possible example implementations of the circuitry logic blocks shown in FIG. 1 are described below in the discussion related to FIGS. 7-9.

[0059] While an example manner of implementing the apparatus that predicts signs for the ReLU activation function with partial data is illustrated in FIG. 1, one or more of the elements, processes, and/or devices illustrated in FIG. 1 may be combined, divided, re-arranged, omitted, eliminated, and/or implemented in any other way. Further, the processing element array circuitries 100A-100C (including the preprocessor circuitries 102A-102C and the remainder processing circuitries 104A-104C), the control 106 (i.e., the activation function control and decode circuitry), the L1 memory circuitry 108, the higher level memory circuitry 110, the IBC 112, the KWBC 114, the DDC 116, and/or, more generally, the example apparatus and system of FIG. 1, may be implemented by hardware, software, firmware, and/or any combination of hardware, software, and/or firmware. Thus, for example, any of the example processing element array circuitries 100A-100C (including the example preprocessor circuitries 102A-102C and the example remainder processing circuitries 104A-104C), the example control 106 circuitry, the example L1 memory circuitry 108, the example higher level memory circuitry 110, the example IBC 112, the example KWBC 114, the example DDC 116, and/or, more generally, the example system of FIG. 1, could be implemented by processor circuitry, analog circuit(s), digital circuit(s), logic circuit(s), programmable processor(s), programmable microcontroller(s), graphics processing unit(s) (GPU(s)), digital signal processor(s) (DSP(s)), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), and/or field programmable logic device(s) (FPLD(s)) such as Field Programmable Gate Arrays (FPGAs). When reading any of the apparatus or system claims of this patent to cover a purely software and/or firmware implementation, at least one of the example processing element array circuitries 100A-100C (including the example preprocessor circuitries 102A-102C and the example remainder processing circuitries 104A-104C), the example control 106 circuitry, the example L1 memory circuitry 108, the example higher level memory circuitry 110, the example IBC 112, the example KWBC 114, the example DDC 116, and/or, more generally, the example apparatus and system of FIG. 1 is/are hereby expressly defined to include a non-transitory computer readable storage medium, device or storage disk such as a memory, a digital versatile disk (DVD), a compact disk (CD), a Blu-ray disk, etc., including the software and/or firmware. Further still, the example apparatus and system of FIG. 1 may include one or more elements, processes, and/or devices in addition to, or instead of, those illustrated in FIG. 1, and/or may include more than one of any or all of the illustrated elements, processes and devices.

[0060] A flowchart representative of example hardware logic circuitry, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the apparatus and system of FIG. 1 is shown in FIG. 3. The machine readable instructions may be one or more executable programs or portion(s) of an executable program for execution by processor circuitry, such as the processor circuitry 712 shown in the example processor platform 700 discussed below in connection with FIG. 7 and/or the example processor circuitry discussed below in connection with FIGS. 8 and/or 9. The program may be embodied in software stored on one or more non-transitory computer readable storage media such as a CD, a floppy disk, a hard disk drive (HDD), a DVD, a Blu-ray disk, a volatile memory (e.g., Random Access Memory (RAM) of any type, etc.), or a non-volatile memory (e.g., FLASH memory, an HDD, etc.) associated with processor circuitry located in one or more hardware devices, but the entire program and/or parts thereof could alternatively be executed by one or more hardware devices other than the processor circuitry and/or embodied in firmware or dedicated hardware. The machine readable instructions may be distributed across multiple hardware devices and/or executed by two or more hardware devices (e.g., a server and a client hardware device). For example, the client hardware device may be implemented by an endpoint client hardware device (e.g., a hardware device associated with a user) or an intermediate client hardware device (e.g., a radio access network (RAN) gateway that may facilitate communication between a server and an endpoint client hardware device). Similarly, the non-transitory computer readable storage media may include one or more mediums located in one or more hardware devices. Further, although the example program is described with reference to the flowchart illustrated in FIG. 3, many other methods of implementing the example apparatus of FIG. 1 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., processor circuitry, discrete and/or integrated analog and/or digital circuitry, an FPGA, an ASIC, a comparator, an operational-amplifier (op-amp), a logic circuit, etc.) structured to perform the corresponding operation without executing software or firmware. The processor circuitry may be distributed in different network locations and/or local to one or more hardware devices (e.g., a single-core processor (e.g., a single core central processor unit (CPU)), a multi-core processor (e.g., a multi-core CPU), etc.) in a single machine, multiple processors distributed across multiple servers of a server rack, multiple processors distributed across one or more server racks, a CPU and/or a FPGA located in the same package (e.g., the same integrated circuit (IC) package or in two or more separate housings, etc).

[0061] The machine readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data or a data structure (e.g., as portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc., in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and/or stored on separate computing devices, wherein the parts when decrypted, decompressed, and/or combined form a set of machine executable instructions that implement one or more operations that may together form a program such as that described herein.

[0062] In another example, the machine readable instructions may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc., in order to execute the machine readable instructions on a particular computing device or other device. In another example, the machine readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine readable media, as used herein, may include machine readable instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s) when stored or otherwise at rest or in transit.

[0063] The machine readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.

[0064] As mentioned above, the example operations of FIGS. 3 through 5 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on one or more non-transitory computer and/or machine readable media such as optical storage devices, magnetic storage devices, an HDD, a flash memory, a read-only memory (ROM), a CD, a DVD, a cache, a RAM of any type, a register, and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the terms non-transitory computer readable medium and non-transitory computer readable storage medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.

[0065] "Including" and "comprising" (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of "include" or "comprise" (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc., may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase "at least" is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term "comprising" and "including" are open ended. The term "and/or" when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, or (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase "at least one of A and B" is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase "at least one of A or B" is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase "at least one of A and B" is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase "at least one of A or B" is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B.

[0066] As used herein, singular references (e.g., "a", "an", "first", "second", etc.) do not exclude a plurality. The term "a" or "an" object, as used herein, refers to one or more of that object. The terms "a" (or "an"), "one or more", and "at least one" are used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method actions may be implemented by, e.g., the same entity or object. Additionally, although individual features may be included in different examples or claims, these may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.

[0067] FIG. 3 is a flowchart representative of example machine readable instructions that may be executed by example processor circuitry to implement a prediction of the sign for the ReLU activation function with partial data. The process flow is performed by the processing element array circuitries 100A-100C (including the preprocessor circuitries 102A-102C and the remainder processing circuitries 104A-104C), the control 106 (i.e., the activation function control and decode circuitry), the L1 memory circuitry 108, the higher level memory circuitry 110, the IBC 112, the KWBC 114, the DDC 116 as illustrated in FIG. 1.

[0068] In the illustrated example of FIG. 3, when input data is sent to a CNN to be processed (e.g., an image is sent through a CNN to be classified) the process begins, at block 300, where the control 106 retrieves input data and weight data from memory.

[0069] The example process continues at block 302 with the control 106 populating the IBC 112 with a subset of the input data. In some examples, the data loaded has been rearranged into groups from an initial FP32 format. Thus, in some examples, the sign bit, the exponent bits, and a group of upper mantissa bits make up the subset of input data loaded into the IBC 112.

[0070] The example process continues at block 304 with the control 106 populating the KWBC 114 with a subset of the input data. Similarly to the group of data loaded into the IBC 112 in block 302 above, in some examples, the sign bit, the exponent bits, and a group of upper mantissa bits make up the subset of weight data loaded into the KWBC 114.

[0071] The example process continues at block 306 when one or more of the preprocessor circuitries 102A-102C calculate a partial convolution value using at least a portion of the input data subset and the weight data subset. In some examples, the partial convolution calculation uses the entire subset of the sign bit, the exponent bits, and the upper mantissa bits. In other examples, an initial partial convolution calculation uses only the sign bit and the exponent bits to calculate a first partial convolution value. In some examples, it is possible to predict the sign of the partial convolution using only the values of the sign bit and the exponent bits of the input data and weight data. In these situations, the entirety of the FP32 mantissa (both upper and lower portions) is not significant enough to possibly change the predicted sign.

[0072] The example process continues at block 308 when one or more of the preprocessor circuitries 102A-102C predict the sign of the partial convolution value calculated in block 306. In some examples, if the predicted sign is negative, the sign can't turn positive no matter what subset of additional less significant bits are utilized in subsequent calculations of the convolution value, thus a negative result is known. In some examples, if the predicted sign is positive, the sign still may possibly turn negative once additional less significant bits are considered in subsequent calculations.

[0073] The example process continues at block 310 when one or more of the preprocessor circuitries 102A-102C send the predicted sign of the partial convolution value to the control 106. At this point the process flow of FIG. 3 is finished.

[0074] FIG. 4 is another flowchart representative of example machine readable instructions that may be executed by example processor circuitry to implement a prediction of the sign for the ReLU activation function with partial data. The process flow is performed by the processing element array circuitries 100A-100C (including the preprocessor circuitries 102A-102C and the remainder processing circuitries 104A-104C), the control 106 (i.e., the activation function control and decode circuitry), the L1 memory circuitry 108, the higher level memory circuitry 110, the IBC 112, the KWBC 114, the DDC 116 as illustrated in FIG. 1.

[0075] In the illustrated example of FIG. 4, the process begins at block 400 where input data is fed into the CNN to be processed and the activation function control and decode circuitry (control 106) populates a memory with tile data elements. In some examples, the input data includes a series of tiles that make up an image. In some examples, at least a tile's worth of data is populated in the memory at a given time. In some examples, the control reads input data from a higher level memory 110, rearranges the input data, and populates the input data into an L1 memory 108 in separate groups. FIG. 2 illustrates an example of how the control may populate the L1 memory 108 with the input data from a tile. In some examples, the memory is a designated hardware buffer (e.g., data distribution circuitry 116). In some examples, the memory is a range of memory locations in L1 memory 108. In other examples, the memory is any form of memory capable of storing input data and accessible by the other circuitry in the system shown in FIG. 1. In some examples, once the memory is populated with the tile data elements in block 400, the control 106 triggers one or more of the processing element array circuitries (100A-100C), and, more specifically, one or more of the preprocessor circuitries 102A-102C, to begin processing the elements in the tile, beginning with the first element.

[0076] The example process continues at block 402 when one or more of the preprocessor circuitries 102A-102C perform an exponent addition with the sign and exponent bits from the input data populated in the memory and a weight data.

[0077] The example process continues at block 404 when one or more of the preprocessor circuitries 102A-102C checks the result of the exponent addition in block 402 for a predicted negative value of the partial convolution result for a ReLU activation function.

[0078] If the predicted result of the exponent addition is negative, then the example process continues at block 406 when one or more of the preprocessor circuitries 102A-102C sends the element negative flag to the control 106. The element negative flag received by the control 106 indicates that no more processing of the element will be done because the input data value will be negative, thus the ReLU function discards the data.

[0079] If the predicted result of the exponent addition is non-negative, then the example process continues at block 408 when one or more of the preprocessor circuitries 102A-102C stores the partial compute data (e.g., a partial convolution value) into the memory (i.e., in response to the non-negative value). In some examples, the partially computed data is only stored into the memory when the predicted result determined in block 404 is a non-negative value. In other examples, the partially computed data is stored into the memory at a location in the process flow of the flowchart immediately above block 404. In these examples, the partially computed data from the exponent addition block 402 is stored into the memory regardless of the predicted sign.

[0080] The example process continues at block 410 when one or more of the preprocessor circuitries 102A-102C perform a mantissa multiplication with one or more of the upper mantissa bits (e.g., one or more of the most significant mantissa bits) of the input data populated in the memory and the same relevant bits for the weight data.

[0081] The example process continues at block 412 when one or more of the preprocessor circuitries 102A-102C checks the result of the upper mantissa multiplication for a predicted negative value of the partial convolution result for a ReLU activation function. In some examples, the preprocessor circuitries 102A-102C that check for a predicted negative value utilize the exponent addition result value(s) (stored in memory as partial compute data in block 408) with the upper mantissa multiplication result value(s) from block 410 to determine the new combined value (i.e., the partial convolution value of the input and weight sign bits, exponent bits, and upper mantissa bits).

[0082] If the predicted result of the upper mantissa multiplication is negative, then the example process continues at block 406 when one or more of the preprocessor circuitries 102A-102C sends the element negative flag to the control 106.

[0083] If the predicted result of the upper mantissa multiplication is non-negative, then the example process continues at block 414 when one or more of the preprocessor circuitries 102A-102C stores the partial compute data (i.e., the partial convolution value of the input and weight sign bits, exponent bits, and upper mantissa bits) into the memory.

[0084] The example process continues at block 416 when one or more of the remainder circuitries 104A-104C perform a mantissa multiplication with one or more of the lower mantissa bits (e.g., the remaining mantissa bits not utilized in the upper mantissa calculation from block 410) of the input data populated in the memory and the same relevant bits for the weight data. In some examples, the mantissa multiplication is performed in response to the control 106 causing one or more of the remainder circuitries 104A-104C to perform. In some examples, the control 106 triggers one or more of the remainder circuitries 104A-104C to calculate the mantissa for the remaining bits not utilized in the upper mantissa calculation (e.g., a remaining subset of bits not used to calculate the upper mantissa partial convolution result), where the control initiates the trigger in response to receiving a non-negative predicted result from one or more of the preprocessor circuitries 102A-102C.

[0085] The example process continues at block 418 when one or more of the preprocessor circuitries 102A-102C checks the result of the lower mantissa multiplication for a negative value of the whole convolution result for a ReLU activation function. In some examples, the preprocessor circuitries 102A-102C that check for the negative value utilize the exponent addition result value(s) (stored in memory as partial compute data in block 408) and the upper mantissa multiplication result value(s) (stored in memory as partial compute data in block 414 with the lower mantissa multiplication result value(s) from block 416 to determine the new combined value (i.e., the full convolution value of the input and weight sign bits, exponent bits, upper mantissa bits, and lower mantissa bits). At this point, there is no longer a predictive nature of the value of the sign because all 32 bits of the original FP32 format data are being utilized in the calculation. Therefore, the sign of the actual convolution result can be determined.

[0086] If the result of the lower mantissa multiplication is negative, then the example process continues at block 406 when one or more of the preprocessor circuitries 102A-102C sends the element negative flag to the control 106.

[0087] If the result of the lower mantissa multiplication is non-negative, then the example process continues at block 420 when one or more of the preprocessor circuitries 102A-102C store the full compute data (i.e., the full convolution value of the input and weight sign bits, exponent bits, upper mantissa bits, and lower mantissa bits) into the memory.

[0088] Returning to block 406 in the example process, once the element negative flag is sent to the control 106, then the example process continues at block 422 when the control 106 checks whether all elements have been processed in the input data tile. If all elements in the tile have been processed, then the example process is finished.

[0089] If there are still additional elements to be processed in the input data tile, then the control 106 triggers one or more of the processing element array circuitries (100A-100C), and, more specifically, one or more of the preprocessor circuitries 102A-102C, to begin processing next element(s) in the input data tile and the process repeats.

[0090] FIG. 5 illustrates an example of the layout of a memory storing the data described in the discussion related to the flowchart of FIG. 4. The flowchart illustrates memory locations where certain results are stored after specific blocks have been performed in FIG. 4.

[0091] The example preprocessor circuitries 102A-102C perform the exponent addition at block 402 in FIG. 4 and the result is stored in a memory 500 in a sign and exponent results location 502. In some examples, the memory 500 space shown may be a virtual set of contiguous addresses located in one or more memory circuitries in the system in FIG. 1. In other examples, the memory 500 shown may be physical memory, such as L1 memory 108. In yet other embodiments, the memory 500 shown may be any type of physical memory, storage, or buffer capable of storing such data for components in the system of FIG. 1.

[0092] In some examples, when performing block 408 of the flowchart in FIG. 4, the preprocessor circuitries 102A-102C store the partial compute data (determined from block 402 in FIG. 4) in a partial compute data location 508 in the memory 500. In block 408, the partial compute data stored consists of the partial convolution of the input and weight data convolution of the sign bits and the exponent bits. In some examples, the partial compute data 508 memory storage location can be written to by the control 106 and/or one or more of the preprocessor circuitries 102A-102C to store the partial convolution value calculated in exponent addition block 402. In some embodiments, the result of that calculation can be copied from the sign and exponent location 502 of memory 500.

[0093] The example preprocessor circuitries 102A-102C perform the upper mantissa multiplication at block 410 in FIG. 4 and the result is stored in the memory 500 in an upper mantissa results location 504. In some examples, when performing the mantissa multiplication, the previous partial compute data results that had been stored in the partial compute data location 508 are read and utilized in furtherance of computing additional bits of the full FP32 operand.

[0094] In some examples, when performing block 410 of the flowchart in FIG. 4, the preprocessor circuitries 102A-102C store the partial compute data (determined from block 410 in FIG. 4) in the partial compute data location 508 in the memory 500. In block 414, the partial compute data stored consists of the partial convolution of the input and weight data convolution of the sign bits, the exponent bits, and the upper mantissa bits. In some embodiments, the result of that calculation can be copied from a combination of the sign and exponent results location 502 and the upper mantissa results location 504 of memory 500.

[0095] The example preprocessor circuitries 102A-102C perform the lower mantissa multiplication at block 416 in FIG. 4 and the result is stored in the memory 500 in a lower mantissa results location 506. In some examples, when performing the mantissa multiplication, the previous partial compute data results that had been stored in the partial compute data location 508 are read and utilized in furtherance of computing the remaining additional bits of the full FP32 operand.

[0096] In some examples, when performing block 416 of the flowchart in FIG. 4, the preprocessor circuitries 102A-102C store the full compute data (determined from block 416 in FIG. 4) in the compute data location 510 in the memory 500. In block 420, the partial compute data stored consists of the full convolution of the input and weight data convolution of the sign bits, the exponent bits, the upper mantissa bits, and the lower mantissa bits. In some embodiments, the result of that calculation can be copied from a combination of the sign and exponent results location 502, the upper mantissa results location 504, and the lower mantissa results location 506 of memory 500.

[0097] FIG. 6A illustrates an example number format of an FP32 data type used for predicting a ReLU activation function result in a CNN. In some examples, with a FP32 data type, a reduced number of mantissa bits are used to calculate a convolution value from an input value and a weight value. The example format in FIG. 6A includes a 1-bit sign value 600 (bit [31]), an 8-bit exponent value 602 (bits [31:23]), an upper mantissa value 604 (N bits), and a lower mantissa value 606 (22-N bits). For example, if the upper mantissa value is a 4-bit value (bits [22:19]), then the lower mantissa value is a 19 bit value (bits [18:0]). In other examples, different permutations of the bit-size of the upper and lower mantissa values may be utilized.

[0098] The mantissa bits that are used to predict a ReLU activation function result begin with the most significant bits of the mantissa value (i.e., the upper bits; the upper mantissa value). The mantissa bits that are not used for partial convolution value prediction include a series of consecutive mantissa bits from the least significant bit (bit [0]) up to the bit immediately below the least significant bit of the upper mantissa value. In some examples, the prediction of the ReLU activation function result utilizes the sign value 600, the exponent value 602, and the upper mantissa value 604. Removing the lower mantissa value from a calculation reduces the precision of the result.

[0099] Consider examining a 32-bit value. In an example first examination of the value, all 32 bits are visible/available, therefore predicting the value is not necessary because the entire value is known (i.e., an ideal calculation using all mantissa bits). In an example second examination of the value, the most significant 13 bits of the value are visible (i.e., the least significant 19 bits are not visible leading to a reduced precision of the value). The reduced precision of the value may include an error of up to the maximum size of the not visible least significant bits.

[0100] Returning to calculating a partial sum of a convolution, the error corresponds to a region of interest where there may be a discrepancy between a calculated ideal partial sum value of the convolution (using all mantissa bits in the calculation) and a calculated partial sum value of the convolution using a reduced number of mantissa bits. In some examples, the partial sum that utilizes the reduced number of mantissa bits may have a different sign than the ideal partial sum. In some examples, the absolute value of the actual mantissa will be greater than or equal to the absolute value of the predicted mantissa.

[0101] FIG. 6B illustrates an example region of interest where a reduced precision of an FP32 input value and weight value used to calculate a partial convolution value may cause a prediction error of a ReLU activation function result. In some examples, the result loses precision and, in turn, increases a range of possible error in the prediction due to the calculation not using a subset of the mantissa bits (e.g., one or more lower/least significant mantissa bits). In the example described above regarding FIG. 6A, the lower 19 bits of the mantissa of the input value and the weight value are not utilized in the partial convolution value calculation.

[0102] As shown in FIG. 6B, an example region of interest 608 is shown on a number line 610 of the example calculated partial convolution value where there is likely a delta between a predicted value and the true value. The delta may result in the sign of the predicted value being different than the sign of the true value.

[0103] In some examples, performing convolution using a reduced number of mantissa bits can produce erroneous ReLU prediction because of missed inclusion of remaining mantissa bits for positive elements only. Negative elements further aid ReLU fail and hence does not contribute to the final error.

[0104] In some examples, it can be determined mathematically that a subset of the entire input data of FP32 data type can be utilized to sufficiently predict negative values for convolutional matrix multiplications involving input data and weights. Thus, not all 32-bits of FP32 data are needed to accurately predict negative results. Below is a series of mathematical proofs that show some examples of the region of interest, the max possible error in prediction, and conditions to be checked to qualify the predictions. Following those requirements, in some examples, a significant reduction in bits utilized to accurately predict the sign of a partial convolution value is achievable.

[0105] For the following description, let: [0106] X.sub.S=Partial sum of convolution operation using reduced mantissa bits. For example, in a 32 channel CONV operation X.sub.S can represent first 16 channel computation. [0107] X.sub.Reduced=Partial sum of convolution with reduced mantissa bits. [0108] X.sub.S.sup.Reduced=Final sum of the CONV operation considering reduced mantissa bits. [0109] X.sub.Ideal=Partial sum of convolution considering all mantissa bits. [0110] X.sub.S.sup.Ideal=Final sum of the CONV operation considering all mantissa bits.

[0111] This can also be represented as,

X.sub.S.sup.Ideal=X.sub.S+X.sub.Ideal (Equation 1)

X.sub.S.sup.Reduced=X.sub.S+X.sub.Reduced (Equation 2)

[0112] In some examples, reducing the number of mantissa bits in a floating-point number results in the number having a lower absolute magnitude. However, the sign remains unaffected as the sign bit is unchanged. Hence, if

X.sub.Ideal<0

X.sub.Reduced>X.sub.Ideal

X.sub.S+X.sub.Reduced>X.sub.S+X.sub.Ideal

[0113] In some examples, Equations 1 and 2 show that)

X.sub.S.sup.Reduced>X.sub.S.sup.Ideal (Equation 3)

[0114] In some examples, Equation 3 shows that if X.sub.S.sup.Reduced<0, then X.sub.S.sup.Ideal<0. An error due to the addition of a negative value cannot alter the sign of the sum from positive to negative. Therefore

if X.sub.Ideal>0

then X.sub.Reduced<X.sub.Ideal

then X.sub.S+X.sub.Reduced<X.sub.S+X.sub.Ideal

[0115] Again, in some examples, Equations 1 and 2 show that

X.sub.S.sup.Reduced<X.sub.S.sup.Ideal (Equation 4)

[0116] In some examples, for Equation 4, X.sub.S.sup.Reduced<0 does not guarantee X.sub.S.sup.Ideal<0. Thus, errors due to the addition of positive values will contribute towards a possible sign change from positive to negative. These errors can be utilized to determine a threshold value to compare against to conclude that the convolution sum is negative when calculating a partial convolution value using a reduced amount of mantissa bits.

[0117] In some examples, if a positive term in the convolution sum is given by C.sub.Mut=2.sup.E.sup.Mul.times.M.sub.Mul, where E.sub.Mul and M.sub.Mul are the unbiased exponent and mantissa value of the term, the maximum error that is possible when the number of mantissa bits is reduced by n is given by C.sub.Errmax=2.sup.E.sup.Mul.sup.-n+1.times.M.sub.Mul.

[0118] In some examples, for any floating-point number given by

N=(-1).sup.S.times.2.sup.E.times.M

[0119] where S, E, M represent the sign, unbiased exponent and mantissa value, the maximum possible error when only n mantissa bits are included is given by

E.sub.Max=-2.sup.(E-n).times.(-1).sup.S (Equation 5)

[0120] Consider an activation input (I) and weight (W) of a convolution layer. They are represented as

I=(-1).sup.S.sup.I.times.2.sup.E.sup.I.times.M.sub.I (Equation 6)

W=(-1).sup.S.sup.W.times.2.sup.E.sup.W.times.M.sub.W (Equation 7)

[0121] From Equation 5, in some examples, the most erroneous values that could result from reducing the number of mantissa bits to n in I (from Equation 6) and W (from Equation 7) is given by

I.sub.Reduced=(-1).sup.S.sup.I.times.2.sup.E.sup.I.times.M.sub.I-2.sup.(- E.sup.I.sup.-n).times.(-1).sup.S.sup.I (Equation 8)

W.sub.Reduced=(-1).sup.S.sup.W.times.2.sup.E.sup.W.times.M.sub.W-2.sup.(- E.sup.W.sup.-n).times.(-1).sup.S.sup.W (Equation 9)

[0122] In some examples, the convolution term, when I (from Equation 6) and W (from Equation 7) are multiplied, is given by

C.sub.Ideal=(-1).sup.S.sup.I.sup.+W.sup.W.times.2.sup.E.sup.I.sup.+E.sup- .W.times.(M.sub.I.times.M.sub.W) (Equation 10)

[0123] In some examples, with reduced mantissa in the convolution step, (Equation 8) and (Equation 9) gives

C.sub.Reduced=I.sub.Reduced.times.W.sub.Reduced=(-1).sup.S.sup.I.sup.+W.- sup.W.times.2.sup.E.sup.I.sup.+E.sup.W.times.(M.sub.I.times.M.sub.W)-(-1).- sup.S.sup.I.sup.+S.sup.W.times.2.sup.E.sup.I.sup.+E.sup.W.sup.-n.times.(M.- sub.I+M.sub.W)+2.sup.E.sup.I.sup.+E.sup.W.sup.-2n

Thus,

C.sub.Reduced=2.sup.E.sup.I.sup.+E.sup.W.times.(M.sub.I.times.M.sub.W)-2- .sup.-n.times.(M.sub.I+M.sub.W-2.sup.-n) (Equation 11)

[0124] In some examples, the error in convolution terms due to reduced mantissa can be obtained from (Equation 10) and (Equation 11)

C.sub.Error=C.sub.Ideal-C.sub.Reduced=2.sup.E.sup.I.sup.+E.sup.W.sup.-n.- times.(M.sub.I+M.sub.W+2.sup.-n)

[0125] In some examples, because 2' is always positive,

C.sub.Error.ltoreq.2.sup.E.sup.I.sup.+E.sup.W.sup.-n.times.(M.sub.I+M.su- b.W) (Equation 12)

[0126] Since M.sub.I and M.sub.W represent the mantissa values,

1.ltoreq.M.sub.I,M.sub.W<2

M.sub.I+M.sub.W.ltoreq.2.times.M.sub.I.times.M.sub.W

[0127] Therefore, (Equation 12) can be rewritten as

C.sub.Error.ltoreq.2.sup.E.sup.I.sup.+E.sup.W.sup.-n.times.(2.times.M.su- b.I.times.M.sub.w)=2.sup.E.sup.I.sup.+E.sup.W.sup.n+1.times.(M.sub.I+M.sub- .W)

[0128] In some examples, (Equation 10) provides

C.sub.Error.ltoreq.2.times.C.sub.Ideal (Equation 13)

[0129] In some examples, Theorem 1 illustrates that only positive terms will contribute to errors that can contribute to incorrectly identifying a negative value. Hence, S.sub.I+S.sub.W=0 (Either both I and W are positive or both are negative).

[0130] In (Equation 10), C.sub.Ideal can be rewritten as

C.sub.Ideal=2.sup.E.sup.Mul.times.M.sub.Mul (Equation 14)

[0131] where E.sub.Mul=E.sub.I+E.sub.W and M.sub.Mul=M.sub.I.times.M.sub.W.

[0132] Thus, in some examples, the maximum error in a positive term in the convolution sum is

C.sub.ErrMax=2.sup.E.sup.Mul.sup.-n+1.times.M.sub.Mul (Equation 15)

[0133] In some examples, if the convolution sum before the ReLU activation layer is given by C.sub.Tot=(-1).sup.S.sup.Tot.times.2.sup.E.sup.Tot.times.M.sub.Tot, and the sum of positive terms in the summation (including the bias value) is given by C.sub.Pos=2.sup.E.sup.Pos.times.M.sub.Pos, then the value of C.sub.Tot can be concluded to be negative if S.sub.Tot=1 and E.sub.Tot>E.sub.Pos-n, where n is the number of mantissa bits used in the computation.

[0134] In some examples, the sum of all product terms in the convolution is given by

C Tot = i .times. ( - 1 ) S i .times. 2 E i .times. M i = ( - 1 ) S Tot .times. 2 E Tot .times. M Tot ( Equation .times. .times. 16 ) ##EQU00001##

[0135] In some examples, from (Equation 15), the maximum error due to positive terms in the convolution is given by C.sub.ErrMax.sup.i=2.sup.E.sup.i.sup.-n+1.times.M.sub.i. Thus, in some examples, the following equation represents when errors are accumulated for all positive terms (including bias),

C ErrTot = i .times. : .times. S i = 0 .times. C ErrMax i = i .times. : .times. S i = 0 .times. 2 E i - n + 1 .times. M i ( Equation .times. .times. 17 ) ##EQU00002##

[0136] In some examples, unlike other terms in the convolution sum, the bias does not involve multiplication of reduced mantissa numbers. Thus, the maximum error for bias values will be lower. However, in some examples, the same error is considered (as an upper bound) to simplify calculations.

[0137] In some examples, the sum of positive terms (including bias) in the convolution sum is represented as

C Pos = i .times. : .times. S i = 0 .times. 2 E i .times. M i = 2 E Pos .times. M Pos ( Equation .times. .times. 18 ) ##EQU00003##

[0138] In some examples, using (Equation 18), the total error in (Equation 17) can be rewritten as,

C.sub.ErrTot=2.sup.-n+1.times.C.sub.Pos (Equation 19)

[0139] In some examples, to conclude that a convolution sum is zero/negative, the following two conditions should hold:

|C.sub.Tot|.gtoreq.|C.sub.Pos| (Equation 20)

S.sub.Tot=1 (Equation 21)

[0140] In some examples, (Equation 20) can be expanded using (Equation 16) and (Equation 18) to give

2.sup.E.sup.Tot.times.M.sub.Tot.gtoreq.2.sup.E.sup.Pos.sup.-n+1.times.M.- sub.Pos (Equation 22)

[0141] In some examples, note that if E.sub.Tot=E.sub.Pos-n+1, then the condition M.sub.Tot.gtoreq.M.sub.Pos must hold (As the total convolution sum (C.sub.Tot) must be greater than or equal to the sum of positive convolution terms and bias (C.sub.Pos))

[0142] Thus, in some examples, (Equation 22) now becomes

E.sub.Tot.gtoreq.E.sub.Pos-n+1 (Equation 23)

E.sub.Tot>E.sub.Pos-n (Equation 24)

[0143] Therefore, from (Equation 21) and (Equation 24), in some examples, it holds that a convolution sum computed using reduced-mantissa bits is negative (and the ReLU output is zero) if S.sub.Tot=1, M.sub.Tot.gtoreq.M.sub.Pos and E.sub.Tot>E.sub.Pos-n.

[0144] FIG. 7 is a block diagram of an example processor platform 700 structured to execute and/or instantiate the machine readable instructions and/or operations of FIGS. 3 through 5 to implement the apparatus of FIG. 1. The processor platform 700 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), a mobile device (e.g., a cell phone, a smart phone, a tablet such as an iPad), an Internet appliance, a DVD player, a digital video recorder, a Blu-ray player, a gaming console, a personal video recorder, a set top box, a headset (e.g., an augmented reality (AR) headset, a virtual reality (VR) headset, etc.) or other wearable device, or any other type of computing device.

[0145] The processor platform 700 of the illustrated example includes processor circuitry 712. The processor circuitry 712 of the illustrated example is hardware. For example, the processor circuitry 712 can be implemented by one or more integrated circuits, logic circuits, FPGAs microprocessors, CPUs, GPUs, DSPs, and/or microcontrollers from any desired family or manufacturer. The processor circuitry 712 may be implemented by one or more semiconductor based (e.g., silicon based) devices. In this example, the processor circuitry 712 implements the example processing element array circuitries 100A-100C (including the example preprocessor circuitries 102A-102C and the example remainder processing circuitries 104A-104C), the example control 106 circuitry, the example L1 memory circuitry 108, the example higher level memory circuitry 110, the example IBC 112, the example KWBC 114, and/or the example DDC 116. In some examples, tile processing logic 118 and the circuitry within (shown in greater detail in FIG. 1) is located at least partially in processor circuitry 712.

[0146] The processor circuitry 712 of the illustrated example includes a local memory 713 (e.g., a cache, registers, etc.). The processor circuitry 712 of the illustrated example is in communication with a main memory including a volatile memory 714 and a non-volatile memory 716 by a bus 718. The volatile memory 714 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS.RTM. Dynamic Random Access Memory (RDRAM.RTM.), and/or any other type of RAM device. The non-volatile memory 716 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 714, 716 of the illustrated example is controlled by a memory controller 717.

[0147] The processor platform 700 of the illustrated example also includes interface circuitry 720. The interface circuitry 720 may be implemented by hardware in accordance with any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) interface, a Bluetooth.RTM. interface, a near field communication (NFC) interface, a PCI interface, and/or a PCIe interface.

[0148] In the illustrated example, one or more input devices 722 are connected to the interface circuitry 720. The input device(s) 722 permit(s) a user to enter data and/or commands into the processor circuitry 712. The input device(s) 722 can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, an isopoint device, and/or a voice recognition system.

[0149] One or more output devices 724 are also connected to the interface circuitry 720 of the illustrated example. The output devices 724 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube (CRT) display, an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer, and/or speaker. The interface circuitry 720 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry such as a GPU.

[0150] The interface circuitry 720 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) by a network 726. The communication can be by, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, an optical connection, etc.

[0151] The processor platform 700 of the illustrated example also includes one or more mass storage devices 728 to store software and/or data. Examples of such mass storage devices 728 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, Blu-ray disk drives, redundant array of independent disks (RAID) systems, solid state storage devices such as flash memory devices, and DVD drives.

[0152] The machine executable instructions 732, which may be implemented by the machine readable instructions of FIGS. 3 through 5, may be stored in the mass storage device 728, in the volatile memory 714, in the non-volatile memory 716, and/or on a removable non-transitory computer readable storage medium such as a CD or DVD.

[0153] FIG. 8 is a block diagram of an example implementation of the processor circuitry 712 of FIG. 7. In this example, the processor circuitry 712 of FIG. 7 is implemented by a microprocessor 800. For example, the microprocessor 800 may implement multi-core hardware circuitry such as a CPU, a DSP, a GPU, an XPU, etc. Although it may include any number of example cores 802 (e.g., 1 core), the microprocessor 800 of this example is a multi-core semiconductor device including N cores. The cores 802 of the microprocessor 800 may operate independently or may cooperate to execute machine readable instructions. For example, machine code corresponding to a firmware program, an embedded software program, or a software program may be executed by one of the cores 802 or may be executed by multiple ones of the cores 802 at the same or different times. In some examples, the machine code corresponding to the firmware program, the embedded software program, or the software program is split into threads and executed in parallel by two or more of the cores 802. The software program may correspond to a portion or all of the machine readable instructions and/or operations represented by the flowchart of FIGS. 3 through 5.

[0154] The cores 802 may communicate by an example bus 804. In some examples, the bus 804 may implement a communication bus to effectuate communication associated with one(s) of the cores 802. For example, the bus 804 may implement at least one of an Inter-Integrated Circuit (I2C) bus, a Serial Peripheral Interface (SPI) bus, a PCI bus, or a PCIe bus. Additionally or alternatively, the bus 804 may implement any other type of computing or electrical bus. The cores 802 may obtain data, instructions, and/or signals from one or more external devices by example interface circuitry 806. The cores 802 may output data, instructions, and/or signals to the one or more external devices by the interface circuitry 806. Although the cores 802 of this example include example local memory 820 (e.g., Level 1 (L1) cache that may be split into an L1 data cache and an L1 instruction cache), the microprocessor 800 also includes example shared memory 810 that may be shared by the cores (e.g., Level 2 (L2_cache)) for high-speed access to data and/or instructions. Data and/or instructions may be transferred (e.g., shared) by writing to and/or reading from the shared memory 810. The local memory 820 of each of the cores 802 and the shared memory 810 may be part of a hierarchy of storage devices including multiple levels of cache memory and the main memory (e.g., the main memory 814, 816 of FIG. 8). Typically, higher levels of memory in the hierarchy exhibit lower access time and have smaller storage capacity than lower levels of memory. Changes in the various levels of the cache hierarchy are managed (e.g., coordinated) by a cache coherency policy.

[0155] Each core 802 may be referred to as a CPU, DSP, GPU, etc., or any other type of hardware circuitry. Each core 802 includes control unit circuitry 814, arithmetic and logic (AL) circuitry (sometimes referred to as an ALU) 816, a plurality of registers 818, the L1 cache 820, and an example bus 822. Other structures may be present. For example, each core 802 may include vector unit circuitry, single instruction multiple data (SIMD) unit circuitry, load/store unit (LSU) circuitry, branch/jump unit circuitry, floating-point unit (FPU) circuitry, etc. The control unit circuitry 814 includes semiconductor-based circuits structured to control (e.g., coordinate) data movement within the corresponding core 802. The AL circuitry 816 includes semiconductor-based circuits structured to perform one or more mathematic and/or logic operations on the data within the corresponding core 802. The AL circuitry 816 of some examples performs integer based operations. In other examples, the AL circuitry 816 also performs floating point operations. In yet other examples, the AL circuitry 816 may include first AL circuitry that performs integer based operations and second AL circuitry that performs floating point operations. In some examples, the AL circuitry 816 may be referred to as an Arithmetic Logic Unit (ALU). The registers 818 are semiconductor-based structures to store data and/or instructions such as results of one or more of the operations performed by the AL circuitry 816 of the corresponding core 802. For example, the registers 818 may include vector register(s), SIMD register(s), general purpose register(s), flag register(s), segment register(s), machine specific register(s), instruction pointer register(s), control register(s), debug register(s), memory management register(s), machine check register(s), etc. The registers 818 may be arranged in a bank as shown in FIG. 8. Alternatively, the registers 818 may be organized in any other arrangement, format, or structure including distributed throughout the core 802 to shorten access time. The bus 820 may implement at least one of an I2C bus, a SPI bus, a PCI bus, or a PCIe bus

[0156] Each core 802 and/or, more generally, the microprocessor 800 may include additional and/or alternate structures to those shown and described above. For example, one or more clock circuits, one or more power supplies, one or more power gates, one or more cache home agents (CHAs), one or more converged/common mesh stops (CMSs), one or more shifters (e.g., barrel shifter(s)) and/or other circuitry may be present. The microprocessor 800 is a semiconductor device fabricated to include many transistors interconnected to implement the structures described above in one or more integrated circuits (ICs) contained in one or more packages. The processor circuitry may include and/or cooperate with one or more accelerators. In some examples, accelerators are implemented by logic circuitry to perform certain tasks more quickly and/or efficiently than can be done by a general puspose processor. Examples of accelerators include ASICs and FPGAs such as those discussed herein. A GPU or other programmable device can also be an accelerator. Accelerators may be on-board the processor circuitry, in the same chip package as the processor circuitry and/or in one or more separate packages from the processor circuitry.

[0157] FIG. 9 is a block diagram of another example implementation of the processor circuitry 712 of FIG. 7. In this example, the processor circuitry 800 is implemented by FPGA circuitry 900. The FPGA circuitry 900 can be used, for example, to perform operations that could otherwise be performed by the example microprocessor 800 of FIG. 8 executing corresponding machine readable instructions. However, once configured, the FPGA circuitry 900 instantiates the machine readable instructions in hardware and, thus, can often execute the operations faster than they could be performed by a general purpose microprocessor executing the corresponding software.

[0158] More specifically, in contrast to the microprocessor 800 of FIG. 8 described above (which is a general purpose device that may be programmed to execute some or all of the machine readable instructions represented by the flowcharts of FIG. 3 through 5 but whose interconnections and logic circuitry are fixed once fabricated), the FPGA circuitry 900 of the example of FIG. 9 includes interconnections and logic circuitry that may be configured and/or interconnected in different ways after fabrication to instantiate, for example, some or all of the machine readable instructions represented by the flowchart of FIG. 3. In particular, the FPGA 900 may be thought of as an array of logic gates, interconnections, and switches. The switches can be programmed to change how the logic gates are interconnected by the interconnections, effectively forming one or more dedicated logic circuits (unless and until the FPGA circuitry 900 is reprogrammed). The configured logic circuits enable the logic gates to cooperate in different ways to perform different operations on data received by input circuitry. Those operations may correspond to some or all of the software represented by the flowchart of FIG. 3. As such, the FPGA circuitry 900 may be structured to effectively instantiate some or all of the machine readable instructions of the flowchart of FIG. 3 as dedicated logic circuits to perform the operations corresponding to those software instructions in a dedicated manner analogous to an ASIC. Therefore, the FPGA circuitry 900 may perform the operations corresponding to the some or all of the machine readable instructions of FIG. 3 faster than the general purpose microprocessor can execute the same.

[0159] In the example of FIG. 9, the FPGA circuitry 900 is structured to be programmed (and/or reprogrammed one or more times) by an end user by a hardware description language (HDL) such as Verilog. The FPGA circuitry 900 of FIG. 9, includes example input/output (I/O) circuitry 902 to obtain and/or output data to/from example configuration circuitry 904 and/or external hardware (e.g., external hardware circuitry) 906. For example, the configuration circuitry 904 may implement interface circuitry that may obtain machine readable instructions to configure the FPGA circuitry 900, or portion(s) thereof. In some such examples, the configuration circuitry 904 may obtain the machine readable instructions from a user, a machine (e.g., hardware circuitry (e.g., programmed or dedicated circuitry) that may implement an Artificial Intelligence/Machine Learning (AI/ML) model to generate the instructions), etc. In some examples, the external hardware 906 may implement the microprocessor 800 of FIG. 8. The FPGA circuitry 900 also includes an array of example logic gate circuitry 908, a plurality of example configurable interconnections 910, and example storage circuitry 912. The logic gate circuitry 908 and interconnections 910 are configurable to instantiate one or more operations that may correspond to at least some of the machine readable instructions of FIG. 3 and/or other desired operations. The logic gate circuitry 908 shown in FIG. 9 is fabricated in groups or blocks. Each block includes semiconductor-based electrical structures that may be configured into logic circuits. In some examples, the electrical structures include logic gates (e.g., And gates, Or gates, Nor gates, etc.) that provide basic building blocks for logic circuits. Electrically controllable switches (e.g., transistors) are present within each of the logic gate circuitry 908 to enable configuration of the electrical structures and/or the logic gates to form circuits to perform desired operations. The logic gate circuitry 908 may include other electrical structures such as look-up tables (LUTs), registers (e.g., flip-flops or latches), multiplexers, etc.

[0160] The interconnections 910 of the illustrated example are conductive pathways, traces, vias, or the like that may include electrically controllable switches (e.g., transistors) whose state can be changed by programming (e.g., using an HDL instruction language) to activate or deactivate one or more connections between one or more of the logic gate circuitry 908 to program desired logic circuits.

[0161] The storage circuitry 912 of the illustrated example is structured to store result(s) of the one or more of the operations performed by corresponding logic gates. The storage circuitry 912 may be implemented by registers or the like. In the illustrated example, the storage circuitry 912 is distributed amongst the logic gate circuitry 908 to facilitate access and increase execution speed.

[0162] The example FPGA circuitry 900 of FIG. 9 also includes example Dedicated Operations Circuitry 914. In this example, the Dedicated Operations Circuitry 914 includes special purpose circuitry 916 that may be invoked to implement commonly used functions to avoid the need to program those functions in the field. Examples of such special purpose circuitry 916 include memory (e.g., DRAM) controller circuitry, PCIe controller circuitry, clock circuitry, transceiver circuitry, memory, and multiplier-accumulator circuitry. Other types of special purpose circuitry may be present. In some examples, the FPGA circuitry 900 may also include example general purpose programmable circuitry 918 such as an example CPU 920 and/or an example DSP 922. Other general purpose programmable circuitry 918 may additionally or alternatively be present such as a GPU, an XPU, etc., that can be programmed to perform other operations.

[0163] Although FIGS. 8 and 9 illustrate two example implementations of the processor circuitry 712 of FIG. 7, many other approaches are contemplated. For example, as mentioned above, modern FPGA circuitry may include an on-board CPU, such as one or more of the example CPU 920 of FIG. 9. Therefore, the processor circuitry 712 of FIG. 7 may additionally be implemented by combining the example microprocessor 800 of FIG. 8 and the example FPGA circuitry 900 of FIG. 9. In some such hybrid examples, a first portion of the machine readable instructions represented by the flowchart of FIG. 3 may be executed by one or more of the cores 802 of FIG. 8 and a second portion of the machine readable instructions represented by the flowchart of FIG. 3 may be executed by the FPGA circuitry 900 of FIG. 9.

[0164] In some examples, the processor circuitry 712 of FIG. 7 may be in one or more packages. For example, the processor circuitry 800 of FIG. 8 and/or the FPGA circuitry 900 of FIG. 9 may be in one or more packages. In some examples, an XPU may be implemented by the processor circuitry 712 of FIG. 7, which may be in one or more packages. For example, the XPU may include a CPU in one package, a DSP in another package, a GPU in yet another package, and an FPGA in still yet another package.

[0165] From the foregoing, it will be appreciated that example apparatus, methods, and articles of manufacture have been disclosed that predict results of activation functions in convolutional neural networks.

[0166] To test the proficiency of the system illustrated in FIG. 1 to predict the sign of partial convolution calculations, a series of tests with standard CNN models were observed in operation. FIG. 10A illustrates an example distribution graph of ReLU zero results across all layers (i.e., nodes) of the ResNet-50 model. When a layer in the ResNet model outputs a zero, the convolution value at that layer was not utilized due to a negative result (thus, clamping the output to zero).

[0167] The dataset used was the ImageNet inference dataset from ILSVRC2012, which is 50,000 images from 1,000 classes. As can be seen, a significant number of results were clamped to zero. Specifically, 61.14% of the outputs of the ReLU layers were zero for the ResNet-50 architecture with pretrained ImageNet weights. Additionally, as can be observed in FIG. 10A, deeper layers into the model are more sparse with actual output with certain layers returning 80+% zeros across the dataset. The resulting outputs per layer have an element value distribution that is mostly confined within -4 to +4 due to batch normalization and 50% of the elements are confined within an output range of -1 to +1.

[0168] FIG. 10B-10D illustrate samples of the accuracy of the predicted negative result on a sample of three different convolution layers in the ResNet-50 model across a scale of mantissa bits used in the prediction. The implemented prediction model accuracy shows that as upper mantissa bits utilized in the partial convolution calculation (along with the sign bit and the exponent bits) are increased from 0 to 3, the negative values that were correctly predicted across the dataset increase from .about.10% at 0 upper mantissa bits up to .about.70% at 3 upper mantissa bits. Specifically, this shows the percentage of negative values matching between the predicted value and the full precision using all 32-bits. Thus, the 3 most significant (upper) mantissa bits, combined with the sign bit and exponent bits of an FP32 input data value will allow the model to predict almost 7 out of every 10 negative values. Thus, 20 of the 32 bits do not require circuitry calculations, which lowers overall processing requirements. The result also means that about 3 out of every 10 values the model predicts as non-negative eventually turns negative once the full mantissa is eventually calculated to verify a negative or non-negative value.

[0169] FIG. 11A illustrates an example distribution graph of ReLU zero results across all layers (i.e., nodes) of the VGG-16 model when run through the same ImageNet dataset. Similar to the ResNet-50 model above, if a given VGG-16 layer returns a 0 from a ReLU activation function, that means the convolution calculation returns a negative value, which clamps to zero.

[0170] FIG. 11B-11D illustrate samples of the accuracy of the predicted negative result on a sample of three different convolution layers in the VGG-16 model across a scale of mantissa bits used in the prediction. As can be seen, the predicted negative accuracy ranges from between 60-80% when 3 mantissa bits are used in the upper mantissa calculation. With the example preprocessor circuitries 102A-102C, 20 bit multiplication was eliminated in VGG-16 for about 48% of cases across all types of deep neural networks/convolutional neural networks. For cases where the predicted sign is positive, the computed result of the example preprocessor circuitries 102A-102C can be saved in the DDC 116 and the result of the remainder processing circuitry 104A-104C that performs multiplication of the remaining bits of mantissa are then combined in the DDC 116.

[0171] From the foregoing, it will be appreciated that example systems, methods, apparatus, and articles of manufacture have been disclosed that predict the sign of an activation function in a neural network. The disclosed systems, methods, apparatus, and articles of manufacture improve the efficiency of using a computing device by predicting the sign of an activation function used for classification in a neural network prior to calculating all bits of the mantissa. Predicting the sign of an activation function accurately with less than full mantissa calculations reduces the amount of compute cycles required to run a neural network. The disclosed systems, methods, apparatus, and articles of manufacture are accordingly directed to one or more improvement(s) in the operation of a machine such as a computer or other electronic and/or mechanical device.

[0172] Although certain example apparatus and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus, and articles of manufacture fairly falling within the scope of the claims of this patent. Further examples and combinations thereof include the following:

[0173] [EXAMPLE PARAGRAPHS MAPPING TO ALL CLAIMS WILL BE INSERTED WHEN A VERSION OF THE CLAIMS HAVE BEEN APPROVED]

[0174] The following claims are hereby incorporated into this Detailed Description by this reference, with each claim standing on its own.

* * * * *