Energy and Area Effective Hardware Design of Lifting Approach Discrete Wavelet Transform

Received Aug 1, 2018 Revised Oct 20, 2018 Accepted Oct 7, 2018 This paper presents low power Discrete Wavelet Transform DWT architecture, comprising of forward and inverse multilevel transform for 5/3 lifting scheme LS based wavelet transform filter. This LS filter consists of integer adder units and binary shifter rather than multiplier and divider units as in the convolution based filters; hence it is more adaptable to energy efficient hardware performance. The proposed architecture is described using the VHDL based methodology. This VHDL code has been simulated and synthesized to achieve the gate level building design which can be organized to be effectively developed in hardware environment. The Quartus II 9.1 software synthesis tools were employed to implement 2D-DWT VHDL codes in Altera Development board DE2, with Cyclone II FPGA device. The proposed LS wavelet architectures can be attained by focusing on the physical FPGA devices to considerably decrease the needed hardware expenditure and power consumption of the design. The utilized logic and register elements of the architecture are 127 slices (only 1%) usage from 33216 and the architecture consumes only 0.033 W. Simulations were performed using different sizes of gray scale images that authenticate the proposed design and attain a speed performance appropriate for numerous real-time applications.

with low-complexity and utilizes small amounts of memory, they frequently cause blocking artifacts in the low bit rate transmission [5]. Therefore, there is the need to remove all those limitations and to add novel improved features.
The wavelet filter major feature is that it consists of closeness information in the concluding result, consequently evading the blocking effect of DCT transform [6]. After the creation of Discrete Wavelet Transform DWT, numerous other codec algorithms was developed to compress the transform coefficients to any possible extent [7], and they include; Embedded Zerotree Wavelet EZW [8], Set Partitioning in Hierarchical Trees SPIHT [9], and Embedded Bock Coding with Optimized Truncation EBCOT [10]. The most prominent image compression standard, JPEG2000, implemented the EBCOT image codec algorithm [11].
Since approving the JPEG2000 image coding standard in 2002, the cost efficiency and real-time limitations remains the major obstacle for the hardware implementation of the JPEG2000 standard into consumer products [12]. A lot interest is garnered by LS discrete wavelet transform, developed to support real-time image processing requirements [13]. Presently, the focal point is developing the most effective approach to reduce hardware charges and intricacy, at the same time achieving the requirement of real-time system [14]. These algorithms entail very complicated hardware requirements, and demands high energy for data processing because of computational complexity. In hardware implementation of data-intensive algorithms, such as the 2D-DWT, the energy consumption attributable to data storage and transmission forms the central part (about 80%) of the total power budget [16]. The development of hardware modeling of the 2D-DWT is still in the early phase since it is a new field of research. In recent times, the Field Programmable Gate Array (FPGA) technology offers a viable platform for device portability and real-time applications by creating the prospect of designing high computing speed systems with reprogrammable features. Therefore, FPGA devices needs to be focused on designing cost effective high achievement systems using the Hardware Description Language (HDL) [17]. The proposed architecture of the 2D-DWT LS is devised based on the Very High Speed Integrated Circuit Hardware Description Language (VHDL) methodology.

RELATED WORKS
Nonetheless, numerous computation architectures for the implementation of the multilevel 2D DWT which utilizes the joint lossy and lossless transform have been suggested in related studies. The key objective of the study in [18], is to embed the 5/3 wavelet computation into the 9/7 so as to decrease the amount of adders compared to other elucidations. In [19] the proposed architecture can be reconstructed for 5/3 and 9/7 wavelet transforms to lower the power utilization and hardware expenditure of the design. The architecture for 1D-DWT principle can be expanded to include 2D-DWT architectures in [20] and is analogous to the one developed in [21] and [22]. The design in [23] presents a range of transformations like the 1D-DWT, 2D-DWT and multi-level decomposition of 5/3 DWT. Basically, 2D-DWT image codec designs, the main frequently used computation schedules, include: the row-column RC style [7], the line-based LB approach [24], and the block-based BB manner [25]. The major disparity between BB and LB is in the approach of original image traversing. In particular, BB functions with non-overlapping blocks of the image, while LB involves the processing of non-overlapping groups of lines. RC is the simplest 2D-DWT image codec design, which involves level-by-level logic [26][27][28][29].
For this study, modified RC hardware 2D-DWT architecture is designed based on embedded extension 5/3 LS, thus reducing the volume of computations as well as the time taken to execute the wavelet transform multilevel decomposition process. Moreover, it utilizes less logic element slices in the FPGA target device, which allows the 2D-DWT module to incorporate a more extensive array of the real time and memory limited mobile devices applications. It enhances the wavelet image transmission with various sub-bands, however limiting the quality of the resultant image. Conversely, the modified RC hardware 2D-DWT architecture is an energy efficient transmission scheme, advantageous for applications where image quality is not a major requirement. This paper is outlined as follows. Section two explains the wavelet based image compression system. The hardware design methodologies of the 2D-DWT implemented VHDL algorithms are elucidated in section three. Section four discusses the performance results while conclusions are made in section five.

PROPOSED 5/3 LS FDWT ARCHITECTURE DESIGN
This part consists of the architecture design of the programmable DWT processor. This processor can execute the 1D-DWT and 2D-DWT with multi-levels decomposition based on in the user demands. The proposed 2D-DWT design has been accurately confirmed by the VHDL Language. The developed 2-D

205
FDWT module comprises of three main components: 5/3 Wavelet Transform Unit WTU Core, memory unit, and 2D-DWT control unit. The first module involves a synthesizable 5/3WTU, which stands for the core part of the design. This core block performs the real wavelet computation on the image data. The design acceleration and the likelihood of fast implementation key factors were realized via the use of parallel processing of lifting modules and re-usability of image pixel data. The purpose of this module is to extract the input pixel coefficients from the memory with the aid of control signals produced by the 2D-DWT control unit, and subsequently executes the 5/3 wavelet transform of the input image pixels acquired from the memory. These calculated wavelet transform coefficients are then restored back to the external memory. Four input and two output registers are used to maintain four input data and two (approximation and detail) output data concurrently. Access to the input pixels are via a four sample register (provisional storage), to make active two concurrent predict and update, as illustrated in Figure 1. These samples can be input samples or earlier coefficient, depending on the "first" signal. If the first signal =1, it writes input samples into 1st, 2nd, 3rd addresses.  To carry out the 1-D transforms in row wise, the 2D_ DWT Control module begins 1D_ DWT control unit by applying reset signal. If the external Reset signal is affirmed, it immobilizes 5/3 WTU Core and RAM block and expect the positive edge of start signal to be stated. However, no values are written to the data bus. The initial image source is stored in an internal RAM at a time. The DWT_2D_Control block triggers the processor upon accepting an active start signal from the system environment. The start signal allows the processor to read an 8-bit pixel data from the original location of the internal RAM and start the computation transformation. The data are read in sequence from the memory. For every computation level, pixel values are first read in row-by-row. This process is maintained till all pixel values from all rows are read, and the transformed values saved in the RAM. Following the reading in of all the rows and the computation completed, the results are written to the internal RAM when a write signal is affirmed. Subsequent to transformation of all the rows of the image, 2D_ DWT Control module restarts the transformation process in column wise, thereby completing the level-1 transformation, accordingly. The modified pixel values of the new image data in the RAM are read in column by column and processed similarly as the rows. The results are saved in the internal RAM after finishing the computations for all columns. The size of the internal RAM is double of the original image size. After completing the transformation using the processor as well as computing all the demanded levels of the image, the ready signal is asserted by the processor to signify that the system is presently capable of reading the transformed image data results from the RAM.
This task is performed by introducing parameters to 2D-DWT control unit which demands transforming on 5/3 WTU and waiting for it to be completed and be replicated on all of the rows and columns during the horizontal and vertical passes till the end of 2D -DWT process. The whole amount of calculations is based on the level of computation particular to the number of levels, NL signal. Every level demands to read a definite number of rows and columns symbolizing a specified number of pixels. The third vital component is the memory module which is necessary for saving the original input image pixels and the resultant wavelet transform coefficients. The memory is capable of dealing with a request only if the write or the read signal is actively high. Input image data are activated by the memory read signal for simulation purposes. The input image pixels are stored in the external memory straight from the input text file where the image pixels are stored. The consequent wavelet transform coefficients are discarded into the output text file. For each calculation phase, pixel values are first read based on a row by-row pattern. This persists until all pixel values from all rows are read and the transformed coefficients are stored in the memory. The inverse wavelet transform of the calculated DWT coefficients, gives a transformed image which will be similar to the original used image with same number of coefficients.

HARDWARE IMPLEMENTATION RESULTS AND ANALYSIS FOR PROPOSED SYSTEM
Conventionally, two memory blocks are employed in image processing systems for two functions: the storage of original image, and for the outputs. The in-place mapping scheme was used in order to evade the second block: the filter's outputs are written over memory contents that have been used and no longer required as shown in Figure 4.
From the first 8 selected row succession of samples of Lena image, it is supposed that input coefficients are shifted from the (IN) input memory to the filter and the input memory stores coefficient, IN(0)=A1 at address 0 and coefficient IN(1)=A1 at address 1 and so on, as illustrated in Figure 5.
The pair of coefficients L1 (0)=A1, H1 (0)=A2 is formed through filtering the 3 first input coefficients, after executing the extension boundary. Given that, input coefficients are presently stored in the memory and will not be acquired again from the input samples memory, L1 (0) and H1 (0) can be stored in their place (addresses 0 and 1). Similarly, coefficients L1 (i)] and H1 (i), can be saved at addresses, 2i and 2 i+1, in that order as shown in Figure 6. The two major features which constitute the basic requisites of 2D DWT architectures with high performance are effective use of area and rapid performance. This study suggests a resource-efficient architecture for the performance of multi-level decomposition 2D-DWT using the LS filter. In contrast to the usual JPEG2000 Lee Gall wavelet transform, the proposed LS is simpler, more rapid, and decreases the computation operation. The proposed low power 2D DWT architecture using 5/3 LS is performed on Altera Circuit Emulation Development and Education Board DE2, and target chipset device is CycloneII: EP2C35F672C6FPGA. The physical hardware layout is formed with the use of Quartus II synthesis equipment. It is a valuable design technique is used to obtain the VHDL code as a source, and translate it automatically into a net-list. The performance of this proposed 5/3 LS exhibited considerable enhancements in the overall number of calculations and provided a lower corresponding gate counts compared to the conventional Lee Gall 5/3 lifting filter coding process, as shown in Tables 1 and 2 The Synthesis process performed indicates that the selected algorithm corresponds to the prerequisites of the design procedure. Therefore, a behavioral model can be developed in VHDL to be used for discrete wavelet transform for image processing. Thus the design can meet real time requirements, which finally map to the gate level.
After performing synthesis and other verification processes, a Register Transfer Level (RTL) simulation of DWT Module has been achieved. The RTL or Technology Map helps to check the design visually. Therefore it is necessary to write the physical behavior and then simulate it using the different versions of gray scale test image data i.e. by writing the test bench to verify the functionality. The test bench is written for four modules by using the same language (VHDL). The VHDL module has been authenticated via simulation using ModelSim-Altera software. The Mentor Graphics ModelSim-Altera software can carry out a timing simulation of a VHDL design from the ModelSim-Altera interface or with command-line commands as illustrated in Figure 7.
The varied stimulus signal presented to the FDWT module which is developed by the test bench environment includes Reset, Clock, and Start. Ready is the output signal which is returned to the control unit using ModelSim-Altera 6.5b. The design applies an equivalent code in the inverse IDWT module. The inverse transform can instantly be derived from the same structural FDWT design. The original pixels data input to the FDWT can be completely recovered from the estimated averages and wavelet coefficients components. The FDWT and IDWT executed with lifting theorem using 5/3 wavelet transform of similar computation intricacy given that the whole logic devices are required to be similar. The IDWT synthesis process is conducted to produce RTL Schematics effectively. There is the constant demand for clock cycles at different levels of calculation beginning from the time the start signal is affirmed till the ready signal is provided. The clock, start, and reset input signals from test bench environment and the system output ready signal for the FDWT module is shown in Figure 8. The energy dissipated during the 2D DWT decomposition process is acquired by counting the number of operations (computation) essential for decomposing an image and the memory data-access load. Thus, the number of cycles used for the 2D DWT coding process is the most significant factor for energy related issue. It is observed that increasing the number of computation levels subsequently elevates the number of clock cycles needed for executing the tasks. This is due to the high access to memory related to the large number of cycles required for enlarging the size of the embedded buffer used to save the image information.
Thus, the computation time also rises depending on the increase in level of decomposition for both 1D-DWT and 2D-DWT processes. For this study, the computation time was normalized to be consistent with internal clock rate. That is, the considerable drop in the computation time results in lower sized versions in contrast other increased image size versions as presented in Table 3. The anticipated overall time which can be obtained by the FDWT encoder for all image size versions is calculated as. As the state of decomposition of image rises, more estimated and comprehensive data becomes accessible. The process duration differ over both the image size and the decomposition levels. The suggested architecture design has lower calculation time in contrast to the conventional JPEG2000 lifting as depicted in Table 4. The design of a PCB must incorporate an approximate value of the power consumption of a device in order to develop a suitable power budget, and to design related requirements such as power supplies, voltage regulators, heat sink, and cooling system. To analyse approximately the power consumed by all modules, the power-estimator tool in the Altera Quartus version 9.1 was used. The power report of the proposed structure of 2D DWT is illustrated in Figure 9.
The PowerPlay Power Analysis tools are utilized to evaluate device power consumption accurately. The hardware design process involves the exchange between the setbacks limitation factors such as space, speed and power are trimmed to drastically enhance efficiency. The input/output I/O pins consume a huge amount of power because they are designed in a larger geometry compared to the core, to support sinking currents for all of the I/O standards [31]. Nonetheless, given that all the designs have about the similar I/O pins, although with varied power consumption [31]. The power consumption evaluation showed that the preferred chipset consumed around 0.033W compared to the other components of the designs. Table 5 shows a comparative analysis of hardware performances of associated implemented architectures based on frequency, the number of FPGA slices, image size, consumed power, and computing duration. The proposed architecture was evaluated against an average frequency of 160.23 MHz frequency (period=6.241 ns) for clock cycles. The architecture employed 127slices, of which only 1% was used from 33216 with 256×256 image size version. This is regarded as very low in contrast to the other architectures, particular since the number of clock cycles is a principal factor in the energy computations. Simply put, the considerably smaller computation duration leads to lower power consumption compared to other architectures. However, the proposed architecture exhibited as the most rapid computing time compared to the the other 5/3 or 9/7 LS structures. The proposed energy efficient LS 5/3 scheme is uncomplicated and straightforward to implement and of high outcome, mostly in hardware controlled platforms with limited memory and power essential applications. Subsequent to analyzing the sources and level of energy consumption in the wavelet transform, the 5/3 filter technique was modified to further minimize the computation energy and communication energy required for wavelet-based image compression and wireless transmission by decreasing the amount of arithmetic operations and memory accesses, and transmitted bits, respectively.

CONCLUSIONS
Lifting theorem was employed via LS 5/3 wavelet transform to develop a design where multipliers have been substituted with shifters, thereby decreasing the volume of operations involved in computing a DWT to approximately one-half of those required by a convolution approach. Therefore, less number of computations is needed and control complexity becomes simple. Moreover, the lifting scheme is adaptable to in-place computation, in order for DWT to be executed in low memory systems.

213
The study presents the hardware architecture success and implementation of lifting based wavelet transform for 5/3 wavelet filter. The architectures of FDWT and IDWT algorithm were designed using the VHDL language. Both algorithms show similar calculation complexity given that the overall numbers of logic devices are required to be similar. The VHDL module accomplishes 2-D DWT on images of various dimensions. The synthesis process have shown that the four proposed versions are almost similar in the used slices of the target device, while they are different in the number of clock cycles required for coding. Additionally, the number of clock cycles is also based on the number of levels needed. From the simulation results, it can be deduced that the proposed 5/3 algorithm is competent enough to reduce hardware expenditure and power consumption in contrast to the traditional JPEG2000 filter. The future extension recommended by this work includes further work on the transformation phase as well the coding phase, to be able to develop a comprehensive scheme for image compression by utilizing a LS architectural structures that are suitable for various portable and embedded wireless devices.