A design methodology for approximate multipliers in convolutional neural networks: A case of MNIST

ABSTRACT

circuits are designed with small calculation error. In addition, there are many researches on approximate multipliers that reduce the error in the final output of CNN.
As mentioned above, approximate computing is studied as an effective method specialized to CNN [14][15][16]. An error resilience analysis was performed in order to determine key constraints for the design of approximate multipliers that are employed in the resulting structure of CNN [17]. Moreover, the paper showed the capability of the back-propagation learning algorithm for CNNs containing the approximate multipliers in MNIST and Street-View House Number dataset. Hammad et al. simulated the VGGNet [18] which employs an approximate multiplier in the layers [19]. The simulation results show that the approximate multiplier keeps the high accuracy of VGGNet. Sim et al. proposed a new Stochastic Computing (SC) multiplication algorithm and its vector extension which is called SC-MVM (Matrix-Vector Multiplier) [20]. The experimental results show that new SC-based CNN is more accurate and 40× to 490× more energyefficient in computation than the conventional SC-based ones. SC is one of the methods of approximate computing and has been actively studied [21,22]. Courbariaux et al. proposed a method which trains neural network parameters as binary weight and activation [23]. This network, which is called BinaryNet, can make a reduction of memory usage and use XNOR circuits as a multiplier, so many approaches of acceleration or/and implementation have been proposed [24][25][26][27]. Li et al. and Balaji et al. designed 8-bit fixed-point approximate multipliers for LeNet and implemented to FPGA [28,29]. These are more efficient in delay and resource than exact computation with high accuracy of MNIST classification. For these works, approximate computing can reduce computational resources and CNN's trained parameters.
In this paper, we present a design methodology of approximate multipliers for MNIST CNN. As a preliminary preparation of deign of approximate multipliers, we design 64 multipliers with simple reduction of the bit-width of an 8-bit multiplier, and evaluate trade-off between the accuracy of MNIST classification, the area and the delay of the multipliers. Based on the analysis, we design approximate multipliers based on two methods. One of the methods is based on comprehensive analysis of upper 2-bit partial products of 8-bit multiplier, and the other is selection of the partial products according to result of first method and 8-bit multiplier's analysis. With these improvements, we further reduce the area and the delay of the multipliers with keeping high accuracy in MNIST classification. The rest of this paper is organized as follows. Section 2. shows outline of general CNN and MNIST CNN in this paper. Section 3 describes simple approach to bitwidth reduction and the experiments. Section 4. describes a design methodology of approximate multiplier with two-step approach, and Section 5. concludes this paper with a summary.

CNN FOR MNIST CLASSIFICATION
In this section, we explain a general Convolutional Neural Network (CNN), its convolution layer and MNIST CNN in this work. CNN is a neural network which has deep layer including convolution layer. CNN is known as a high-accuracy image classification model. One of the most popular image classification models is Alexnet [30]. In 2012, Alexnet won ILSVRC by a big margin over other models. This victory triggered prevalence of CNN in the world and many of neural networks in image classification are based on CNN today.

Convolution layer
In this subsection, we explain about a general convolution layer. A Convolution layer performs convolution operation on an input image with a filter (weight), its outputs sum up bias, and the outputs are propagated to the next layer or activation function. The filter and bias are trained and saved in convolution layer. In convolution operation, the products of input and filter are summed up and stored to the corresponding output. Convolution operation performs at an interval which is defined "stride". Figure 1 represents an example of convolution operation. When a 5×5 input, a 3×3 filer and stride value "1" are given, the size of the output is organized as 3×3. In this case, the convolution layer contains 81 multiplications per filter. In general, CNN contains many filters. In this work, CNN contains 30 filters and the convolution layer performs multiplication operations 432,000 times per image.

An example of CNN for MNIST
In this subsection, we explain about MNIST CNN in this paper. Figure 2 shows the model of MNIST CNN and MNIST dataset. MNIST CNN consists of Convolution (convolution layer), Pooling (max pooling layer), Affine (affine layer), ReLU and Softmax. Next part describes each of the layers or the activation functions, except from the convolution layer.
Pooling selects the maximum number of input elements in filter range and stores the corresponding output. This layer can reduce processing weight with constant approximation. This operation does not contain parameters such as weight or bias, and keeps the spatial information of input image.
Affine multiplies each input element and weight value, sums up these products and stores the corresponding output. This layer connects all spatial information to each class. This operation contains parameters which are weight and bias, and does not keep the spatial information of input image.
ReLU and Softmax are activation functions which prepare input data. ReLU fixes "0" or less value to "0" and stores the corresponding output. It is known as ramp function. Softmax fixes input value to probability value. This function is used before finally output and highest probability value's class is CNN output.
The CNN trained MNIST data set. MNIST data set contains handwritten digits from "0" to "9", which is a collection of 28×28 quasi binary images, and the pixel value is varied from "0" to "1". MNIST is one of the most elementary datasets in image classification of CNN, so we firstly use this dataset to examine the impact of approximate multiplier to image classification.

PRELIMINARY ANALYSIS ON BIT-WIDTH
This section describes how to reduce bit-width of multiplication and the experiments to Figure out the scale of multiplier which CNN needs. The experiment is the preparation of Section 4. and the purpose is an exploration of the multiplier which keeps high accuracy of image classification.

Bit-Width reduction approach
In this work, we design different bit-width multipliers by reducing partial products of 8-bit multiplier. The 8-bit multiplier is general small scale, but MNIST is known as dataset which can be classified by using small scale circuit. The structure of BinaryNet is different from CNN in this work, but BinaryNet has a high accuracy of image classification and MNIST classification does not seem to need high precision multiplication.
The 8-bit multiplier is based on an exact 8-bit array multiplier since the 8-bit array multiplier is so small, and the 8-bit multiplier does not have to apply Wallace tree and/or booth's multiplication because of bad tradeoff between the overhead and the performance improvement of the multiplier. Figure 3 is an exact 8-bit array multiplier. The multiplication uses multiplicand and multiplier to obtain partial products. Sum of the partial products is a product. Sum operation starts from upper-right to lower-left. The product of 8-bit multiplier becomes 16-bit product. In this reduction, we replace the multiplicand or the multiplier with "0" in order from lowest-order bit. By this operation, the corresponding multiplier or multiplicand is useless, and this partial product's value becomes '0'. In addition, the multiplicand and multiplier have 8-bit information, and we can reduce different bit-width between the multiplicand and the multiplier. Figure 4 is an example of 7×6-bit multiplier. We replace least bit of multiplicand and lower 2-bit of the multiplier with "0", so we can delete 22 partial products and lower 3-bit of product. We can reduce different bit-width between the 8 multiplicands and 8 multipliers, so we can create 64 multipliers in different bit-width.

Preliminary results
We designed MNIST CNN in order to apply self-made multipliers. We applied different bit-width multipliers to MNIST CNN and evaluated the trade-offs of the accuracy of MNIST classification, the area and the delay of the multiplier. The experiments can Figure out the scale of multiplier which CNN needs, and we improve the multiplier, which keeps high accuracy of image classification, by reference to experimental results in the following section. We can reduce different bit-width between the multiplicand and multiplier, so we can create 64 multipliers in different bit-width. We use CNN parameters learned 2000 training images. We measure the image classification accuracy when CNN recognize 1000 test images. The accuracy with exact computation, which is 32-bit floating point multiplication, is 98.4%. The description of multipliers is written in Verilog. We synthesis the 64 multipliers by LeonardoSpectrum and measure the number of gate and critical path delay. We use Nangate45 which is 45nm process library. When we synthesis the circuit optimized minimizing area. Figure 5 (a)-(c) are the experimental results of bit-width reduction. The filter data are multiplicand and the image data are multiplier. Figure 5 (a) is the result of accuracy, Figure 5 (b) is the result of the area, and Figure 5 (c) is the result of the delay. In Figure 5 (a), accuracy in 8×8-bit is 98.4% and accuracies in 3×2bit and 4×1-bit are almost the same as the exact one. The accuracy in 2×1-bit is 97.1% and the accuracy in less than 2×1-bit decrease significantly. Whenever the filter bit-width is 1, the accuracy is extremely low. Therefore, the MNIST CNN needs about 2×2-or more bit multiplier to keep high accuracy of image classification. In Figure 5 (b) shows that the area of multiplier decreases in proportion to the scale of multiplier. The area in 3×2-bit is 15 gates that contains HA (half adder) and many AND gates, the area in 4×1-bit is 4 gates which are 4 AND gates and the one in 2×1-bit is 2 gates which are two AND gates. In Figure 5 (c) shows that the delay of multiplier decreases in proportion to the scale of multiplier. The delay in 3×2-bit is 0.10 ns. The delay in 4×1-bit is 0.02 ns as well as the 2×1-bit.  Table 1 is the detail of upper order 3-bit image bit-width. Cases of 3×2-bit and 4×1-bit are same accuracy and contain five partial products. However, the area of the 3×2-bit multiplier is larger than that of 4×1-bit, and the delay of the 3×2-bit multiplier is longer delay than a case of 4×1-bit. Therefore, a case of 3×2-bit is better power efficiency than a case of 4×1-bit. Besides, filter data (multiplicand) are more important to MNIST CNN than image data (multiplier). In the experimental results, MNIST CNN needs 4 gates to achieve almost the same accuracy as accuracy of classification with exact computation. In addition, when the number of gates is 3 in 3×1-bit, accuracy is 98.0% and the MNIST CNN needs about 3×2, 4×1 or more-bit multiplier to keep high accuracy of image classification. Therefore, in the next section, the proposed design methodology is used to achieve higher accuracy with 3 or fewer gates.

A DESIGN METHODOLOGY FOR APPROXIMATE MULTIPLIERS
This section describes a design methodology of approximate multipliers for MNIST CNN. In the first subsection, we explain an overview of a design methodology of approximate multipliers for CNN. In the next two subsections, we applied the design methodology to MNIST CNN. At every step, we evaluated the trade-offs of the accuracy of MNIST classification, the area and the delay of the multiplier.

A design methodology
This subsection describes overview of the proposed design methodology of approximate multipliers for CNN. Figure 6 is flowchart of the proposed design methodology. In the first harf of the methodology, we analyze partial products of exact multiplier and select significant partial products for CNN as an approximate multiplier. If the accuracy of CNN is not significantly higher or the area or delay don't satisfy user requirements, we analyze a broader scope again.
In the second harf of the methodology, we analyze in more detail partial products of exact multiplier for combining to the approximate multiplier designed in the first half of the methodology. If the accuracy of CNN is lower than exact one, we analyze a broader scope again. In the design methodology, we can design approximate multipliers that maintain a high recognition rate of CNN without compromising the user's requirements. In the next two subsections, we applied the design methodology to MNIST CNN.

Comprehensive analysis of 2-bit partial products
This subsection describes the example of the comprehensive analysis of 2×2-bit multiplier. According to the experimental results of Section 3, we analyzed only higher 2-bit partial products. Figure 7 is an exact 2×2-bit multiplier. 'a' is filter bit, 'b' is image bit, 'p' is product, and HA is half-adder circuit. We reduced partial products from lower bit in Section 3, but we selected one to four partial products in our design methodology. We analyzed the accuracy of MNIST classification, the area and the delay of the approximate multiplier which has one to four partial products of 2×2-bit multiplier comprehensively. In this experiment, we can explore the combination of partial products which keeps high accuracy of image classification. The experimental conditions are based on Section 3. Table 2 is the result of comprehensive analysis of the 2-bit partial products. When the number of partial products is 1, the case of a7b6 is the highest accuracy, but it is not enough high accuracy. When the number of partial products is 2, the case of a6b7 and a7b7 is highest accuracy, 97.1%. In this case, the number of gates is 2 and delay is 0.02 ns. When the number of partial products is 3 or 4, no other case of partial products is more efficiency than the case of a6b7 and a7b7. Figure 8 is the logic level circuit of a7b7 and a6b7. HAs are removed. In the second harf of the methodology, we try to achieve higher accuracy by a6b7, a7b7 and another partial product.   Figure 8. Logic-level circuit of a7b7 and a6b7

Selection of the third uppermost product
This subsection describes the example of selection of the third uppermost products. Figure 9 is a logic-level circuit of a6b7, a7b7 and aibj. aibj is one of the ten partial products which is possible to output upper than p11 which is the fifth uppermost product. In the ten cases, we analyze the accuracy of MNIST classification, the area and the delay of the multiplier. The experimental conditions are based on Section 3.  Table 3 is the result of selection of the third significant product. The number of gates and the delay always keep 3 gates and 0.02 ns, respectively. The highest accuracy achieves 98.4% in a5b6, and this is the same result as exact computation. Therefore, we achieve the same accuracy as MNIST classification with exact computation by using approximate multiplier composed of a7b7, a6b7 and a5b6. Figure 10 is the logic-level circuit of a6b7, a7b7 and a5b6. In the case of MNIST CNN, the proposed design methodorogy is efficient for designing approximate multipliers with small scale, low delay and high image recognition of CNN. Because a6b7 and a7b7 contain neither a5 nor b6, we consider that many varieties of multiplicand and multiplier improve CNN accuracy of classification. In addition, a5b6 is usually output to p11, but it is output to p12 in this case. Therfore, we considered that higher bit information is not always more important to image classification with CNN than lower bit information.   Figure 10. Logic-level circuit of a7b7, a6b7 and a5b6 Table 4 is a comparison table of multipliers with simple bit-width reduction and approximate multipliers based on the proposed methodology. The labels 8×8, 3×2, 2×2 and 2×1 are based on analysis of Section 3 the label AP (approximate) 2×1 is based on subsection 4.2, and the label AP 3×2 is based on subsection 4.3. AP 3×2-bit multiplier has the same accuracy of MNIST recognition as the exact 8×8 and 3×2bit multipliers. In addition, the area of AP 3×2-bit multiplier improved by 80% over the area of exact 3×2-bit multiplier, and the delay of AP 3×2-bit multiplier improved by 80% over the area of exact 3×2-bit multiplier. Therefore, the design methodology of approximate multipliers is efficient for MNIST CNN.

CONCLUSIONS
In this paper, we proposed a design methodology of approximate multipliers for CNN. We achieved high accuracy of MNIST classification with approximate multipliers based on the proposed methodology. With the design methodology, by using the approximate multiplier, we achieved the same accuracy as MNIST classification with exact computation and improved the area and delay performance. In future, we plan to apply the proposed methodology of approximate multipliers to more complex and larger scale CNN. Besides, we plan to implement approximated CNN on FPGA and plan to analyze the impact on FPGA.