Efficient robust speech recognition with empirical mode decomposition using an FPGA chip with dual core

Received Jan 01, 2020 Revised Feb 14, 2020 Accepted Feb 28, 2020 The purpose of this paper is to accelate the computing speed of Empirical Mode Decomposition (EMD) based on multi-core embedded systems for robust speech recognition. A reconfigurable chip, Field Programmable Gate Array (FPGA), is used for the implementation of the designed system. This paper applies EMD to discompose some noised speech signals into several Intrinsic Mode Functions (IMFs). These IMFs will be combined to recover the original speech by multiplying their corresponding weights which were trained by Genetic Algorithms (GA). After applying Empirical Mode Decomposition (EMD), we obtain a cleaner speech for recognition. Due to the complexity of the computation of the EMD, a dual-core architecture of embedded system on FPGA is proposed to accelerate the computing speed of EMD for robust speech recognition. This will enhance the efficiency of embedded speech recognition system.


INTRODUCTION
It has been a long time for the development of speech recognition. However, speech recognition for the speech subjects to environmental noise is still an open problem. The most important problem in the robust speech recognition is the mismatch problem arising from the mismatch of the training and application environment due to the noises. Consequently, a speech sensor with the ability of noises cancellation is important for the realization of robust speech recognition. The methods for handling the mismatch problem can be classified into two categories: feature-based method and model-based method. Feature-based methods focus on the feature parameters rather than on model parameters for speech or noise [1][2][3][4][5][6][7]. Model-based methods exploit prior knowledge about the distributions of speech and noise for speech feature enhancement [8][9][10][11][12][13]. In this paper, the noised speech signals will be processed by eliminating the noise components before capturing the features through Mel-Frequency Cepstrum Coefficient (MFCC). Hence, the speech features become cleaner when they are fed into the speech recognition platform for recognition. A better recognition rate for the noised speech signal can then be obtained.
This study applies the EMD to decompose noised speech signals into the components including speech signals or noises. EMD is first proposed by Prof. Huang to combine the Hilbert Transform (HT) to analyze the nonlinear and non-stationary time series. The combination of EMD and HT is then called Hilbert Huang Transform (HHT) [14]. The EMD was applied initially on the signal analysis of in the area of geoscience, strength analysis of material structure and the trend analysis of the stock market, etc. Hence, it is of the goal of this paper to find the weights corresponding to different IMFs and combines these weighted IMFs to recover the original speech signals. The weights for each IMF are trained by GA to find an optimal combination of IMFs. However, since EMD process will cost a lot of computation time, another goal of this paper is to implement a dual-cores architecture on an FPGA to accelerate the operation of EMD.

EMPIRICAL MODE DECOMPOSITION (EMD)
In this section, the procedure for performing EMD is introduced. Besides, a strategy based on GA and EMD to the robust speech recognition is proposed.

Procedure of EMD operations
The main step to perform EMD operation is to divide a speech signal into several intrinsic mode functions (IMFs). The condition for the data series to be an IMF can be described as follows [14] .Let the original signal is X (t) and Temp(t) = X (t) .
Step 1: Find the upper envelop U (t) and lower envelop L(t) of the signal Temp(t). Calculate the mean of the two envelops m(t) = [U (t) + L(t)]/ 2 . The component of Temp(t) is obtained by the equation (1) Step 2: Check whether the signal h(t) satisfies the conditions of IMF or not. If it is, then the first IMF is obtained as imf 1 (t) = h(t) and go to next step, else assign the signal h(t) as Temp(t) and go to Step 1 Step 3: Calculate the residue r 1 (t) as Assign the signal r 1 (t) as X (t) and repeat Step 1 and Step 2 to find imf 2 (t) Step 4: Repeat Step 3 to find the subsequent IMFs as follows.
This step is end when the signal r n (t) is constant or a monotone function. After the EMD procedure Step 1~ Step 4 is finished, the following decomposition of X (t) is obtained.

Combining GA with EMD for noise separation
In order to illustrate the effect of noises on IMFs, the EMD for a clean speech signal is first performed and the obtained IMFs are shown in Figure 1 [15]. In this figure, the leading five IMFs are shown, since the speech signal almost totally exists in these IMFs. Beyond these IMFs, it is hardly to find any speech signal components. It can be seen that the later the order of IMF is the lower the frequencies is. In order to examine that which IMF the noise or the speech signal will exist, a white noise is added into the clean speech signal. And then the IMFs, obtained from the EMD for a noised speech signal, are shown in Figure 2 [15]. Based on Figure 2, it is easy to find that the noise almost exists in the 1st IMF. Moreover, comparing Figure  1 to Figure 2, we can find that the 2nd and 3rd IMFs of the noised speech are very similar to the corresponding IMFs of the clean speech. So, we conclude that the speech signal mostly exists in the 2nd and 3rd IMFs. However, some experiments reveal that there are still some components of speech exist in later IMFs than 2nd and 3rd IMFs. Indeed, from a numerical experimental results from my previous work shown in Table 1 [15], it can be seen that the speech components exists in the later IMFs more evidently when the magnitude of the added noise becomes larger. This experiment reveals that the previous works on EMD for speech signal, which used only 2nd and 3rd IMFs to recover the original speech signal will lose some speech components in later IMFs. Thus, this paper asserts that the later IMFs should be included by multiplying some weights to recover the original signal. Actually, the weights for the each IMFs to recover the original speech are variant for different SNR of a noised speech. Consequently, this paper proposed a strategy which uses GA to train the optimal weighting of IMFs to recover the speech signal subject to various strength of noise. In the training phase of the weights for each IMFs, the chromosomes in GA are defined as Chrm = [w1 w2 … wn] and the recovered speech is then expressed as: The fitness function for the GA used in this study is defined as follows.
in which Ei,k ,t means the output error for t th record of i th literal by k th person

IMPLEMENT THE DESIGNED SPEECH RECOGNITION SYSTEM ON FPGA
In this paper, the developed noises cancellation method for speech sensors and the speech recognition system was implemented on a FPGA-based SOC embedded platform. The block diagram of the SOC architecture used in this study is shown in Figure 3. An Altera develop board DE2-70 in which a Cyclone FPGA chip is included is used for this experiment. The constraint on the development board for this experiment is that there is no operation system in the FPGA chip, only single-threaded procedure is available. This will slow down the computation speed of the speech recognition systems. On the board a push button is used for the control of the starting and ending of the voice record and a Toggle switch is used for controlling the sampling rate of AUDIO codec. The EDA tools Quartus II, SOPC Builder and Nios II are used to develop and simulate the system. In the hardware implementation, SRAM and Flash RAM are used for the storage of source code and testing signal, respectively. The I2C Protocol is used to control the register of the platform. Besides, AUDIO Controller is used to receive speech data and SEG7 is used for the display of recognition results. The standard control IPs which are supported by SOC Builder are adopted for driving the necessary elements SDRAM, SRAM and LCD. The push button and Toggle switch are connected by the built-in PIO. Moreover, SEG7 Controller and AUDIO Controller are user-defined. In the experiment, PLL is adjusted to have a frequency of 100Mhz and a delay of 3ns, and then support the system's clock

Dual Core Realization on FPGA
This paper used two Nios II/f fast cores with 100MHz clock for each CPU to implement the proposed system. The specs of NIOS II can be seen in Figure 4. There are two 32-Mbyte SDRAMs in the embedded platform used in this experiment. The test speeches and all the parameters for speech recognition, that is the parameters for the classifier (HMM here), the EMD and the GA, were stored in SDRAM 1 while the sharing data which are shared for the two CPUs were stored in SDRAM 2. The memory allocation of SDRAM 1 for the booting program of the two CPUs, called CPU1 and CPU2, is depicted in Figure 5. Moreover, the memory allocation of SDRAM for the parameters, functions and data which are necessary for the operations of CPU1 and CPU2 is depicted in Figure 6. The details for each segment are listed as follows:  .text-the execution codes  .rodata-the read-only data  .rwdata-variables and pointers  .heap-dynamic allocation of memory  .stack-parameters of function call and data for temporary variables The shared memory of the two CPUs is managed by a software MUTEX CORE which is supported by SOPC TOOLS of Altera Company.  The parallel process of EMD by the two CPUs is fulfilled by saparating a speech signal into two parts. The first part and the second part are sent to CPU1 and CPU2 for performing the EMD process, respectively. After the EMD process for the two parts are completed, CPU1 accesses the share memory for the EMD results and then performs the speech recognition by the algprithm implemented in FPGA. The cooperation of the two CPUs is depicted in Figure 7.

EXPERIMENTAL RESULTS
Ten speeches 0~9 are recorded with 8kHz, 16 bits length and monotone. For GA, each generation has 16 chromosomes, the survival rate for each chromosome is 0.5, and the mutation rate for each gene is 0.05. Besides, the chromosomes in parent generation are randomly crossover to generate the chromosomes of the next generation. The tables, Table 2 and Table 3 reveal the time cost for the EMD and speech recognition by using single core and dual core, respectively. Moreover, the word-by-word comparisons of computation time by using single-core architecture and dual-cores architecture for EMD and speech recognition are depicted in Figure 8 and Figure 9, respectively. It is obviously that the time cost by using dual cores is much less than that by using single core. The percentages for time saving for each speech are listed in Table 4. According to Table 4, the percentage for saving time by using dual core for EMD process are in the range from 23.55% to 49.85%, and that for recognition are in the range from 18.22% to 45.20%. The average

CONCLUSION
The dual core architecture for accelerating the computation time of EMD is proposed in this paper. The EMD combination with GA is used here for noise separation from a contaminated speech. Ten speeches are recorded for experiments in this paper. Experimental results show that the proposed dual core architecture implemented on an FPGA can save a lot of computation time without degrading the speech recognition rates. However, since the computation time is still too much to real-time applications, more cores are necessary to be integrated to increase the computation ability for an FPGA in the future.