FPGA implementation of Lempel-Ziv data compression

Received Oct 30, 2020 Revised Jan 10, 2021 Accepted Apr 29, 2021 When transmitting the data in digital communication, it is well desired that the transmitting data bits should be as minimal as possible, so many techniques are used to compress the data. In this paper, a Lempel-Ziv algorithm for data compression was implemented through VHDL coding. One of the most lossless data compression algorithms commonly used is Lempel-Ziv. The work in this paper is devoted to improve the compression rate, space-saving, and utilization of the Lempel-Ziv algorithm using a systolic array approach. The developed design is validated with VHDL simulations using Xilinx ISE 14.5 and synthesized on Virtex-6 FPGA chip. The results show that our design is efficient in providing high compression rates and space-saving percentage as well as improved utilization. The Throughput is increased by 50% and the design area is decreased by more than 23% with a high compression ratio compared to comparable previous designs.


INTRODUCTION
Computers can deal with several different sorts of data like text, games, sound, photos, and film. A percentage of these information sources need a large amount of data which can also quickly fill up your hard disk or take a long time to transmit over a network. It is regularly an issue to be able to store a lot of digital information using a limited amount of space. For this reason, it is interesting to check if the data can be rewritten such that it takes up less space. This may appear like magic, but does, in fact, work well for some data types. Data compression is used multimedia formats for images, Video and audio [1,2]. The lossless data compression indicates that data is the same at the source and destination [3,4]. Huffman code [5,6], run-length code [7], arithmetic code [8], and Lempel-Ziv (LZ) compression algorithms [9] are a widely used [10] lossless data compression technique. Among them, the LZ algorithm that is a dictionary-based algorithm that can achieve an average compression ratio for lossless data compression and is considered universal. Statistical lossless data compressors are better than dictionary-based in cost, area requirement and compression ratios [11]. In the hardware implementation of dictionary-based methods, three approaches are distinguished: CAM (Content Addressable Memory) approach [12], the microprocessor approach [13] and the systolic array approach [5]. The main advantage of the Systolic array approach is that it is easily implemented and a higher clock rate can be achieved [14]. Comparison between the three approaches is shown in Table 1. Due to the considerable amount of parallel comparison involved by LZ algorithm, so achieving a very high throughput using software approaches may be difficult. Systolic Array approach will be used in this research to achieve high throughput with lower H/W requirements.
The remaining of this paper is organized as: This paper consists of six sections; the related work is explained in section 2. LZSS compression algorithm is explained in Section 3. Section 4 describes the systolic array design for LZ. Section 5 contains the simulation and implementation results of our design. Finally, conclusions are given in section 6.

RELATED WORK
Since lossy data compression allocates the bits necessary for data restoration within a specified fidelity level measured by a distortion feature. This theory is called rate-distortion [8]. In lossless data compression [7], the data should be precisely reconstructed [15]. Lempel-Ziv compression method is a dictionary method based on the substitution of text substrings with its previous occurrences. The Lempel-Ziv compression dictionary starts with a certain predetermined state, but during the encoding process, the content changes depending on the data that has already been encoded. LZ77 [9] and LZ78 [16] are the most famous algorithms. LZSS is the most popular versions of LZ77 [17][18][19]. There are many researches works on LZ design for data compression. We will introduce some of the recent and previous works such as in [20], in [14] and in [21].
In [22], Marsh and Knapp presented a detailed analysis of how the size of the buffers in the LZ77 algorithm affects the throughput and compression ratio. By choosing a specific buffer size, the required area can be evaluated, the compression ratio, and the throughput that the compressor can achieve. Using a Xilinx XC2V1000 FPGA device, the implementation of the compressor was done using a 512-byte search buffer and a coding buffer of 15-byte. Based on post-layout simulations, architecture can achieve a 11 Mbps throughput. In [23], By using systematic design methodologies, an area/power-architecture for LZ data compression was developed. In order to indicate early completion, they used a control variable to improve the latency. Their architecture allows a high-level understanding of the tradeoffs involved. By using a common estimation framework, a broad range of options can be considered, since the architecture is scalable and parameterized.
In [24], A LZ compression parallel algorithm was described by Mohamed A. Abd El Ghany. To display early completion, a control variable was used to further improve the latency. The proposed implementation is efficient in terms of speed and area requirements. The design area is decreased by more than 30% and the compression rate is increased by more than 40%. His compression rate was about 13Mbps. In [25], Design and FPGA implementation for GZIP compressor based systolic array was presented. A single GZIP compression core was implemented in Virtex 6 FPGA ML605 development board, data transfers Xillybus utilization was done over PCI Express. The throughput of their implementation was over 1.3 Gbps and the software average throughput was 52 Mbps using the Calgary corpus. In [26], H.Luo, Ye Cai, and Q.Mao presented a multi-core GZIP compressor for HDFS. To increase throughput, the core was designed via expanding multiple systolic array compression cores. The Hardware implementation was evaluated using Alpha Data Adm-Pcie-KU3 FPGA development board, RIFFA data transfers utilization was done over PCI Express. The peak throughput of the compressor exceeds 1.1 GB/s.

LZSS COMPRESSION ALGORITHM
LZSS is one of the improvements of LZ77 that will be used in this paper. a window (n = 9) shown in Figure 1 and look-ahead buffer (L s = 3) as an example. Assume that X i , i = 0, 1... n-1 will be represented as the window content and that Y j , j = 0, 1… L s -1 (i.e., Y j = X i+n-Ls ) as the look-ahead buffer content. The lookahead buffer content is compared with the content of the dictionary according to LZ concept to find the length of the longest match Lmax which start from I p position. Then output will be represented by a codeword (I p , L max ). The code word length L c is given by: To represent a symbol in the window, w bits are needed, l = log 2 (L s ) bits to represent L max , and p = log 2 (n-Ls) bits to represent I p . Then (l + p) / (L max * w) is the compression ratio. In the DG I Figure 2, match length and match signal are propagated from cell to cell. The window content (X) and the look-ahead buffer content (Y) are broadcast to all cells horizontally and diagonally respectively. By the DG projection into the surface normal to the projection vector selected, the processor assignment can be done.

SYSTOLIC ARRAY DESIGN FOR LZ DATA COMPRESSION
The compression design of Lempel Ziv is shown in Figure 3. The systolic array design architecture consists of three major components: the SALZC compressor module, the RAM block, and the host controller. SALZC module doesn't include block RAM. The dictionary size can be increased by exchanging the block RAM with a larger one. Also, the host controller is not combined into the SALZC module, to be able to modify when the dictionary size is changed. The window size length n in our implementation 1K, and the length of look-ahead buffer L s = 16.   Figure 2, all the nodes in a specific row are integrated into a single processor element (PE). A linear array of length L s is produced. The layout is simple due to the regularity of the array. A single cell (PE) only will handlaid out, then the other 15 PEs are its copies. Routing is also simplified due to systolic array design. The resulting array of Design-P are given in Figure 5 and the space-time diagram is shown in Table 2.
As shown in Figure 5 the architecture consists of 16 processing elements that is used for comparison, and L-encoder that is used for matching length output. Thus, the look-ahead buffer symbols Yj that remain in PEs during the encoding step and do not change. The Xi dictionary variable moves systolically from left to right, with 1 clock cycle delay. The processing element's match signal Ei moves onto the Lencoder. The encoder's output Li is the matching longitude resulting from the i-1 comparisons. After one clock cycle, the first Li will be obtained and each clock cycle will be obtained for the following ones. The Yj is preloaded to be processed before the encoding process and this will take Ls extra cycles. The time of preloading new source symbols during the encoding process depends on the number of source symbols will be compressed in the preceding compression step, Lmax.  The PE block diagram is presented in Figure 6. The comparison of Y j and incoming X i requires only one equality comparator. The E i (match signal) result for the comparator propagates to the L-encoder. The Lencoder block diagram is depicted in Figure 7. The match-length L i is computed according to match signals.

Host controller
The Host controller includes match results block (MRB), code word generator, and end of processing block (EOPB), as shown in Figure 8. From Figure 5, it is clear that the L-encoder doesn't generate the maximum matching length. So, in order to determine L max among the generated L i s', a match results block (MRB) is needed as shown in Figure 9. The end of processing block as shown in Figure 10 includes a 4-bit counter and Determination block (DB). This counter is needed to successfully handle the last part of the data stream. End of stream signal does not mean the end of the compression operation, but once the end of stream signal is generated using the 4-bit counter It's used to trigger the encoding process of the unprocessed data in the look-ahead buffer. After receiving the enable signal the counter will count the number of shift operations. DB determines the number of process elements that will operate during the encoding step according to the counter output and generates the end signal after the compression operation is complete. Determination block (DB) is shown in Figure 11. Without the DB the last part will be compressed incorrectly. The number of PEs in the forward buffer should be equal to the number of unprocessed data. Comparator and Subtractor are the principal components of DB. If the counter output (the number of data processed in the look-ahead buffer) is less than the number of PEs, they can be subtracted by the Subtractor. The number of PEs is created which will operate during the encoding stage. If the counter output is equal to the PES number, it means the entire look-ahead buffer data is processed. Hence the end signal (finish) will generate.  Figure 11. Determination block (DB)

Block RAM
We use the block RAM as the first-in-first-out (FIFO), so we need to use two counters, as illustrated in Figure 12. The first one is to generate a write address. At first, it is loaded by the look-ahead buffer's first address, then it counts to initialize the look-ahead, buffer. Afterward, it will point to where an input symbol should be inserted. The second one is to generate the address for reading. It will point to the FIFO's first location (equal to the address written + 1). Upon reaching the maximum value one of two counters. In the next step, it'll immediately go down to 0.

Software simulations
The RTL architecture of SALZC module depicted in figure 4 is VHDL modeled with its simulation result as shown in Figure 13. The SALZC receives a sequence of 16 bytes of data from a text vector file. Thus, the first 16-bytes of data stored in Y j then it reads X i and then it compared Y j with X i and the result is in L i and Y 0 -out since L i = 1111 and this is due to the first 16-byte of X i equal the first 16-bytes of Y j and Y 0 -out = 01110011 and this is due to the first byte of file = 01110011. The simulation result of the Host controller is shown in Figure 14. The code will output the code word due to the received signal from SALZC since if there is no match it will output codeword that contains M 0 if there is a match it will output the code word that contains Length of the match and its pointer. The first bit in the codeword specifies that if there is a match or not.

Figure 14. The host controller simulation result
The code also will do shift if en-shift = 1 or if load =1 since en-shift is a control signal to do 16 shifts initially then if load =1 it will load a new byte. The Host controller output also depends on L-ready, which shows that the match is ready or not. L i shows the length of the match and according to this length, the code will do shift (Li number of bytes). If we assume that L i = 0110 then Q-Ready = 1 then shift-left = 1 for 6 clock cycle then shift left return to zero waiting for a new condition of Li or load if there is no match. as shown in Figure 15 Window act as a dictionary in our code since RAM is FIFO its depth = 1024 and width = 8. The code reads the data input from the text file then the output is.

Figure 15. Window simulation result
After verifying the VHDL code of all the component Window, SALZC and Host controller, the match-length of comparison and the first byte stored in the first PE is fed to Host controller then it decides if it was a match-length then it compares it with the maximum length stored previously then it outputs the codeword that consists of (16-bits) contain match-length of compression and the pointer of this length, then it does several shifts equal to the match-length and load a new number of byte to the shift register and compare again. If it wasn't a successful comparison it output the first byte that was stored in the first PE and it does one shift (load one new byte) and do the comparison again as shown in Figure 16. If it has a match-length after the comparison, SALZC module has a signal that shows that the code has a match-length as shown in Figure 16 (Q_ready) signal = 1 at the time the output has a length and pointer and the first bit of the codeword equal one this is another verify for the output, but if the output has zero length it will out a signal (load) = 1 that verify there is no correct comparison.

Implementation
In this section, we present the achieved design lossless compression efficiency. The implementation of our design is carried out using Xilinx Virtex-6 FPGA, for n = 1k, L s = 16, w = 8. FPGA utilization summary is shown in Table 3. The compression rate R c can be estimated as: (2) In our implementation, we use window size (n) is 1K, L s = 16, w= 8, and CLK= 175.408, Our module does space-saving about 55% and on average compression rate up to 25.75 Mbps. Saving percentage in our FPGA implementation is 55% and the compression ratio is 67.8%. The total on-chip power is 3.422 W.  Table 4 depicts the comparison between Compression rate of the proposed design and the literature. In [26], presents parallel multi-core GZIP compressor via HDFS and implemented the design using Adm-Pcie-KU3 FPGA device. They achieved compression rate about 22%. In [25], presented a design and implementation of a complete GZIP core architecture. They do the implementation using Virtex 6 ML605 and achieved compression ratio about 21.7%. In [24], Presents design and implementation of LZSS using Sparten-II FPGA device and achieved 13 Mbps. Compared to the results in [16] the throughput is increased by 50% and the design area is decreased by more than 23% that provides an excellent platform for Real-time compression applications.

CONCLUSIONS
In this paper, the design and implementation of lossless data compression was described using the LZ algorithm. Xilinx ISE 14.5 tool is used. The programming is done in VHDL language and the whole algorithm is described in that language. Our systolic array LZ compression (SALZC) module provides spacesaving about 55% and on average compression rate up to 25.75 Mbps. Comparing to literature work we proved that LZSS based systolic array design can achieve high compression ratio compared to GZIP and also can achieve high compression rate compared to other LZ design. As future work, one may modify the host controller since it can be used for other algorithms string-matching based LZ, such as LZW and LZ78.