Design and performance analysis of asynchronous network on chip for streaming data transmission on FPGA

ABSTRACT


INTRODUCTION
One of the developing fields contained by the system on chip (SoC) analysis area is network on chip (NoC).There are several papers published by researchers about the developing fields of NoC systems and their application [1].In the meantime, there are several improvements and developments in the structure of asynchronous and asynchronous NoCs [2].Message-passing asynchronous NoC is guaranteed service over open core protocol (OCP) interfaces and is developed to a fully grown network in high speed NoC [3], [4].The favorable services offered by the asynchronous message-passing asynchronous NoC providing guaranteed services over OCP interfaces (MANGO) are bounded services [5], [6].The interfacing of OPC collaborates with NoC, this is associated with the core.The global science (GS) network and the built environment (BE) network are the two main components of any NoC network [7], [8].The virtual channels support the connection-oriented GS services, these services are measured with the latency and hard information that promises better utilization.The BE network is empowered with the packets that are routed within the wormhole routers [9].In the initial research, we often find the execution of the asynchronous circuits on field programmable gate array (FPGA) is very narrow and confined [10], [11].So, here we are Int J Reconfigurable & Embedded Syst ISSN: 2089-4864  Design and performance analysis of asynchronous network on chip for streaming … (Trupti Patil) 297 eager to implement a well-sophisticated approach that makes the implementation a better way in the execution.The NoC has been initiated with the best-effort NoC, in the elementary asynchronous mode [12].A router, master, network adapter (NA), and slave NA are the three components of the NoC.The routers are interconnected in a mesh topology and the warn hole routing is used for the communication.The use of supply routing and the XY-routing was issued to avoid the deadlocks [13].The number of lists can be unlimited in the packets [14].
The four basic elementary units of NoC are intellectual property (IP) cores, NA, routers (R0 to R8), and links.Figure 1 shows the outline of a 3×3 NoC module.NoC is a super technique where we can see the cores within can easily communicate with each other in a very accurate way [15].The execution and implementation of the FPGA are completely theoretical, so it is preferred to execute the BE NoC that has been performed.The primary concern of the thesis is availability, and the least prioritized issue is execution [16].The area of the complete structure is low, this is result of the accessible logical resources on the given FPGA [17].The next part of the thesis will eventually show the correct model for the selection of NoC design [18].The topology selected should be suitable for the outlines that are specified by the FPGA.The conditions of the topology that are to be concerned are listed in [19].The successor of the next node shall always be a onedirectional link like a torus or a K-Ary 2 cube mesh or the torus topology.At the stage of selection, a two-way link of A K-Ary 2 cube network is selected.The basic reason for the selection is to be free from the deadlock that occurs, whereas the torus has a huge abundant number of links [20].If the topology is integrated with XY routing, the deadlocks can be removed without the simulations of virtual channels the architecture of FPGA has a well onto the structure of topology in two dimensions.The further needs of a K-Ary 2 cube network topology are: there are four ports for network connections, one port for a core affiliation, and p described in [21].

PROPOSED LOW POWER ROUTER DESIGN FOR LABEL SWITCHING-NOC'S
Label switching (LS) technique is used in many networks such as automatic teller machines (ATMs) and banking applications since it is purely dependent on packet relaying because LS will carry route information in the form of labels within the network.Another function of LS is to change the direction in X-Y coordinates for transmission of a packet from one route to another route by identifying the next router through forwarding information, quality of service (QoS), guarantee, and traffic priority and finally, it assigns to nest route label [22].The LS is applied for the transmission of screaming data with more area consumption and high power utilization.The microarchitecture of the single router is shown in Figure 2 and it consists of first in first out (FIFO) and its control block, NoC manager, crossbar switch, and arbiter.This proposed work is mainly concentrated on reducing power using the bit transition encoder and decoder (BTED) technique as shown in (3) and (4).The existing LS-based NoC is for streaming applications that limit latency and hardware utilization.These limits are mainly addressed in this research work with the help of a NoC manager which can monitor and control bandwidth sharing and its adjustment automatically.The NoC manager uses a flow graph (FG) to represent communication between source and destination nodes which are updated and stores the packet and updates their bandwidths in a table known as the routing table, the source router present in FG is to process the packet which is generated through traffic generator is processing engine through input and output ports.This engine and input and output ports receive the data to form sink nodes in the FG from source to destination and intermediate nodes are represented as edges and stored in the FG is given in Figure 2 and its edges are shown in Table 1.Where   is the edges connected between the source (s) and destination (d) and these can be any nodes out of 64 nodes.The   is source node,   is the destination node,   is utilized (already used by another router),   is the available node (or free node in FG),    is node present in the list of labels used in the pipe through   and    is node present in FG which is not used by any other router.
During transmission of the packet,    ,    are equal to "NULL" when no data is available and their capacity or bandwidth are completely utilized and not available to serve further with any other router for data transmission.During the transmission of data,   will have maximum capacity or bandwidth and    is not used by any other router and it will be free for serve and available for data transmission.For effective data transmissions, the pipe should have maximum capacity 'c' and it will establish communications between source (s) and destination (d).In the proposed design, the major sub-systems are routers, network adaptors, switching algorithm, label-based routing technique and power optimization method.These all are integrated as SoC level to meet requirements of IP with optimal power, area, latency and throughput.All these sub-systems are part of NoC and it is integrated as SoC systems for interfacing with high speed Cortex-M33 processors and other controllers through different protocols [23].In Algorithm 1, the first step is for FG creations, from which, the input packet includes both information data and destination id as labels and FG contains the number of edges and capacity determination [24].All edges in FG will change their directions from   to  , based on the next router which is depending on the destination node.The FG also stored the capacity of each and every link (path).The third step is to monitor the number of packets transmitted and received between routers.In the fourth step, the data is stored in output ports when the source and destination node is the same.The ISSN: 2089-4864  Design and performance analysis of asynchronous network on chip for streaming … (Trupti Patil) 299 remaining steps are to perform the data transmission based on the bandwidth available at every node and finally, the received data is stored in Ps.Once the packet is reached the destination, FG will update its edges and push the packet data to output ports, and also store it in stack pointer (sp).After the destination node and pipe is identified, the SP is used for updating the used list and pipe ( The Figure 3 shows 8×8 LS-NoC in 2D mesh topology.The NoC manager is part of every router and each router has five input and output ports (East, North, South, West, And Local) and processing elements along with IP blocks that store the received packet at destination node.The single LS-based router is designed using combinational circuits between input and output ports.The received data from the source system i.e. the device which is generating electrocardiogram (ECG) signals are stored in FIFO if other flits are awaiting traversal or if the arbiter does not provide grant access to the output port [25].The FIFO control block (FCB) will take care of the FIFO pointer arithmetic and control the corresponding input port's signal flow.When a bad connection is detected, the LS-NoC Manager sets the capacity of that link to 0 in the FG.

−
Existing pipes connected to the connection are deactivated.The pipes have been renamed, and the routing tables have been modified.

−
After pipes are configured, the FG is updated.The NoC manager's overhead is made up of two parts: computation and configuration.Identifying a pipe with a flow-based method (Algorithm 1) incurs computational cost.Routing table configuration is transmitted across the network and routing tables are updated as part of the configuration overhead (Table 1).
In the proposed design, the major sub-systems are routers, network adaptors, switching algorithm, label-based routing technique and power optimization method.These all are integrated as SoC level to meet requirements of IP with optimal power, area, latency and throughput.All these sub-systems are part of NoC and it is integrated as SoC systems for interfacing with high speed Cortex-M33 processors and other controllers through different protocols.

BIT TRANSITION ENCODER/DECODER FOR POWER OPTIMIZATION IN NoC
The power consumption and its optimization in NoC is major challenging task and it will degrade the performance level.In this work as shown in Figures 4 and 5, bit transition encoder technique is applied before transmission of packet to source router and after receiving packet at destination router for power optimization.In any on chip memory or networks, power consumption is depending on number transitions such as bit 1 to bit 0 (formally known as type 1) or bit 0 to bit 1 (formally known as type 2), there is not bit transition if both bits are same like bit 0 to bit 0 (formally known as type 3) or bit 1 to bit 1 (formally known as type 4).The power reduction technique will work only on type 1 and type 2. The power optimization purely works based on number bits transitions in packet data, if there are more number of transition bits then encoding techniques is going minimize before sending the packet to next router.The generalized logical expression for encoding are given in ( 3) and (4).
Where FI is full invert, it can be either 1 or 0 and HI is half invert, its bit is same as FI,   is present bit in given packet and  −1 is previous bit in given packet, between these two bits, the XOR operations is performed to reduce number transitions, for example, let consider number of bits in packet is 16 bits (let say: 1010101010101010), the number of transitions are 15.After performing bit transition encoder on packet through XOR operation, encoded bits are 1111111111111111 as shown in Figure 3, number of transitions in encoded bits are 0, therefore number transitions are reduced from 15 to 0. The encoded packet is transmitted from source router to destination router, at destination router, after receiving packet before decoding, the bit transition decoder is applied to decode the original packet.The generalized logical expression for decoding are given in ( 5) and ( 6): after applying ( 3) and ( 4), the simulated results are shown in Figure 3, the decoded packet bits are same as packet bits which is transmitted at source node.

RESULTS AND DISCUSSIONS
The proposed LS-NoC with power optimization technique is successfully synthesized using Xilinx Design Suite 14.7 software tool and implemented on Artix-7 FPGA development.The delay and throughput and figure of merit are analyzed between source (R00) and destination (R06) nodes through simulated results shown in Figure 5.In order to proof the latencies between different routers, considered first router is always source routers and others are destination routers, the latency is measured from source to any other routers as shown in Table 2.The second column in the Table 2 shows different latencies, for example 10 and 15 is latency from router 3 to router 6 (shown in destination node column).Similarly, for throughput and frequencies are shown in Table 2.In Figure 6, the very first signal is clock of 100 MHz followed by input and output data of source router and destination routers and they are highlighted in the separate box.Figure 6.Simulated results of 3×3 LS-NoC and their received data at each input and output ports Throughput (thp): The proposed 8×8 NoC system's throughput is calculated as the ratio of the total amount of bits to be transmitted by the simulation time, in seconds, to the total number of bits to be conveyed in a given time, per sec, and is represented as (7).

Source Node and given data is 11b4
Destination Node: R06 and its input and output ports and received data is 611b4, where 6 is label bits and after decoded received data is 11b4 The number of packets transmitted per clock is given by   .The   is no. of transmitted packets per cycle, as   is packet size of 16 bits,   is size of flit of 16 bits,   is total latency of every packet transmission and T is total cycle period.The hardware platform for implementing NoC with different factors in the proposed system is the Xilinx Design Suite, which has already been used by the prior systems.Table 3 compares the summary of previous work with the proposed work.The relative plots and thorough analysis of NoC with and without the application of power optimization technique are shown in Figures 7 and 8.
As a result, when compared to the existing work, the suggested system performs better in all parameters, whether or not LS is used to store and then transmit ECG signals, as shown in Figure 9.The proposed LS-NoC is extended from 3×3 to 8×8 to analyse latency and routing paths that having totally 64 routers, the simulated results of 3×3 is shown in Figure 10 for the source 3 and destination node 6.The test bench of top-level design includes both LS-NoC and LEDR and its implementation includes 6-rail voltage with full functionality and inputs for the LEDR come from text files, in which the voltage levels are specified are continuously looped through during simulation as shown in Figure 10 Because of this, there is no effect from the VRAIL_EN signal on the simulated analog input (Voltage).The analog input will not rise when VRAIL_EN is asserted, nor will it fall with VRAIL_EN is de-asserted as shown in Figure 11.

CONCLUSION
Dynamically Reconfigurable network on chip (DRNoC) NoC uses mesh topology and XY-routing with deadlock freedom to minimize latency.The streaming data are converted into packet which consists of source, destination id's and flit bits.These packets are encoded by adding two additional request signals like handshake signals.The proposed design has asynchronous clocks which are synchronizer is used to manage synchronization.The router has 1302 LUTs as well as 530 latches in its region, with delay elements using 12% of the LUTs.The router's overall output measured was found to be 46 MHz.Three CPUs as well as three external units make up the prototype, which is connected via a 3×2 mesh.The power, as well as the area used by router buffers in NoC, seem to be a major issue in the deep submicron domain which elimination of buffers.When compared to another conventional bufferless routing algorithm, the computational results demonstrate that the designed routing algorithm optimizes average latency by 22%, power consumption by 21%, as well as area overhead by 44%.An 8×8 switch router with a suitable shortest path detector, such as a minimal spanning tree, is utilized to design the suggested network architecture for effective run-time routing.Therefore, Verilog hardware description language (VHDL) is been chosen for executing in VIVADO Xilinx 2018-1 software and is implemented on Nexys DDR-4 Artix-7 FPGA family with a part number XCA7CGS100t, which has 324 pins, with improved accuracy as well as 35% latency and when compared to the conventional router, the proposed router increases the efficiency by 40% and this technique outperforms the traditional one in terms of delay, area as well resource allocation.

Figure 1 .
Figure 1.General structure of NoC connected in a 3-by-3 mesh topology

Figure 2 .
Figure 2. Proposed label switched-based microarchitecture of single router with single-cycle flit traversal and their internal micro blocks including power optimization using BTED

Figure 3 .
Figure 3. FG of the LS-NoC and BTED architectures during data packet transmission with two sources and two destinations marked as red and green

Figure 4 .
Figure 4. Simulated results of power reduction through bit transition encoder technique, here input is 16 bits (1010101010101010) and output is 16 bits (1111111111111111)

Figure 5 .
Figure 5. Simulated results of power reduction through bit transition decoder technique, here input is 16 bits (1111111111111111) and output is 16 bits (1010101010101010)

Figure 7 .Figure 8 .Figure 9 .Figure 10 .
Figure 7. Delay calculation between source and destination of 3×3 NoC, intermediate nodes are 3 and 6, node 3 is received data at 10 ns, node 4 is received the data at 40 ns and node 6 received at 16 ns

Table 1 .
LS-NoC and BTED based NoC and its routing table's from 33 to 16 and 17 to 5 Ei Ni Nj uij Aij ) and available (   ) flows based on capacity or bandwidth.Once FG is updated, the NoC manager will configure the routing table, and label updating is performed in each and every intermediate router.Along with FG updating, the    of each edge will check for conflicts, if it is a conflict then the NoC manager will identify the alternative port or router which is unused in the routing table.This table data structure at a node can be written as shown in (2):   ,   →     (2) where   is the pipe label in the edge ending at   and   is the pipe label in the edge in   .Algorithm 1. NoC manager: identification of pipe, ps: pipe stack Define the source node as the pipe in the NoC: Input required:   = {  ,   ,   ,   ,    ,    }, s, d and c Define flow graph (FG) ={  }={  ,   ,   ,   ,    ,    } Initialize counter value to '0' i.e k=0 Suppose s=d then Do not perform any process and store source packet into output ports of same router and update ps=s{data} if s != d then For all edges starting from s, s+1,… to d perform loop If   > c then {If available node capacity is greater than c} Data_out = data packet is sends to next router input port and update the FG based on node id and then push data into sp If   <c then Search for alternative router and its free input and output ports and then push into sp.If d= destination id then Ps = d{data} and update FG and extract data packet by removing the label bits End

Table 2 .
Source and destination router

Table 3 .
Summary of the existing work with proposed work