# Survey on Performance and Energy consumption of Fault Tolerance in Network on Chip

#### B. Naresh Kumar Reddy, Vasantha M.H, Nithin Kumar Y.B.

Department of Electronics and Communication Engineering, National Institute of Technology Goa, India.

| Article Info                                                                                 | ABSTRACT                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Article history:<br>Received Nov 3, 2015<br>Revised Jan 18, 2016<br>Accepted Feb 11, 2016    | Network on Chip (NoC) is a communication subsystem, which has the logic<br>for sending and receiving the data from different sources in a single IC, is<br>adopting the technology of VLSI making it to be as compact as possible.<br>However, the increasing probability of failures in NoC's has been raising<br>concern among the researchers due to large scale integration of components.<br>In specific the issues of fault-tolerance, increase in length of global wires of<br>NoC has to be addressed for on chip and multi core architectures. This<br>survey presents a perspective on existing NoC Fault-tolerant algorithm and a<br>Corresponding distributed fault analysis strategy that encourages in<br>observing the fault status of individual NoC components and their adjacent<br>communication links. The analysis of the Fault-tolerant Network subjected to<br>dynamic workloads for large scale applications is also equally important.<br>This research paper mainly emphasizes on Fault tolerant NoC strategies<br>summarizing over thirty research papers. |
| <i>Keyword:</i><br>Core<br>Fault Tolerance<br>Network Interface<br>Network on Chip<br>Router |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |

## Corresponding Author:

B. Naresh Kumar Reddy, Departement of Electronics and Communication Engineering, National Institute of Technology Goa, Farmagudi, Ponda, Goa - 403 401. Email: naresh.nitg@gmail.com

# 1. INTRODUCTION

The reason behind developing System-on-Chip (SoC) architectures is the on-chip interconnect which plays a crucial role in managing the performance, energy and fault-tolerance of the overall system along with technology scaling [1]. Subsequently, design and analysis of scalable on-chip interconnects which are also called as Network-on-Chip (NoC) architectures, has become thrust in recent research. Also, NoC configuration has become an alternate flavor on account of area, energy and reliability constraints in deep sub-micron outline. The increase in probability of permanent/ hard fault rates of NoC resulting from quickened maturing impacts forces the system to work in faulty environments. Apart from this, the challenges in the manufacturing of boards using deep sub-micron technology puts an additional constraint for the reliable communication within the NoC components. So, the architectures like Correspondence driven, switch-based etc. are evolving as the true standard to interface many IP blocks using standard topologies such as 2D mesh and torus [2].

A generic NoC architecture consists of several cores, NI and Router(R). The cores in NoC generally resembles the mesh, which is clearly depicted in Figure 1 [3]. These cores can be homogeneous e.g., CPU, or heterogeneous, e.g., audio-video cores, wireless transmitter and receiver, etc. In NoC every core is connected to a local router via Network interface. In the similar fashion, each router is thus connected to neighboring routers forming a packet based network on chip [4-5].



Figure 1. NoC Architecture

## **1.1. Network Interface**

A Network Interface, typically a part of NoC, acts as a communication medium between core and router, it is primarily used for packetization and depacketization of data. Packetization by definition decouples the data into packets of specified standard length and pushes the packets into the attached router [6]. Where as, Depacketisation couples back the detached packets from the router at the receiving terminal point. In short, Network Interface is a system's interface between two equipments operating on different protocols.

#### 1.2. Routers (R)

Routers are considered as heart of NoC. A router contains typically consists of 5 input/output ports. 4 of them connected to neighboring routers and the remaining one is connected to the local core. Each input port has 4 virtual channels and behaves as FIFO queues. Then these channels are multiplexed in time and the packets are sent to Crossbar. The Arbitration Unit (AU) controls the crossbar and is responsible for routing of packets. The routers in general are built up with the features of fault detection and correction [7-8] and hence there should be some mechanism to be adopted to implement the fault detection and correction while adhering to basic properties of the router. Different universities and institutes proposed diverse flavors of architectures for a router keeping switching and routing algorithms as the basis.

#### 1.3. Network Topology

Network topology is a layout or structure of the network both in terms of physical and logical. Network topology represents the way in which nodes in a chip are connected to each other. Researchers proposed various homogeneous and heterogeneous network topologies keeping the performance and power consumption as the design criteria. Also, the increase in number of devices eventually leads to degradation of performance in an NoC. In fact the diminishing performance over the time is associated with increase in the fault-rates of the devices which later turned out to be the most dominating reason for the disgusting system halts in NoC. Therefore some techniques are compulsory to work the system even in faulty domains. The faults in a NoC could be categorized as permanent, intermittent and transient faults and researchers are proposing different techniques for each kind of the faults based on their behavior to different stimuli and time.

#### 2. RELATED RESEARCH

In this section, we would like to present a comprehensive analysis of various contributions to the Fault tolerance mechanisms in NoC domain.

#### 2.1. Core

As any of the earlier stated faults (permanent, transient and intermittent faults) occur at the core in NoC, the system performance gets directly affected leading to high energy consumption. Chen- Ling Chou et al., Proposed Replacement of spare core [9] technique to address this issue. The placement of the spare core in the system is chosen randomly using Fault tolerance mapping Functions [10]. Weighted Manhattan

Distance (WMD), Link Contention Count (LCC) and System Fragmentation Factor (SFF) are the few parameters to be considered for this Fault-tolerant mapping function. The authors also conveyed one mapping process to minimize SFF by continuous selection of tiles having fewer neighbors as well as smaller Euclidean Distance (ED) in the region. At the outset, placement of spare core not only depends on the minimum distance between faulty core and spare core but also on the failure propagation characteristics over the rest of the system. The method proposed by Fatemeh Khalili and Hamid R. Zarandi [11], for spare core placement is able to efficiently perform the resource management and the failure containment is also significantly improved within the system. Here in this approach, the placement of spare core is done using WMD [12], available neighboring tiles (ANT) and Unmapped Neighboring vertices's (UNV). The proposed algorithm is as follows: Minimize WMD, if more than one tile is satisfied, i.e. Choose minimum  $ANT(t_{mn}) - UNV(v_i)$  and then apply the spare placement algorithm (Calculate P<sub>critical</sub> using WMD). This proposed technique reduces the communication losses and offers performance improvement, compared to previous cited work [13]. One of author presented NMAP algorithm [14], which is comparatively faster, and says that, if you map the cores in a Network on chip based on bandwidth limitations, it is possible to minimize the communication delay. Author also explained both minimum-path routing and split traffic routing.

When tasks do not get accomplished in the core as expected, then it is implied that there is some internal fault [15] with in the core. Our Darin et al., presented a Task remapping strategy [16] to represent this kind of issues. It is online based solution mainly concentrates on permanent failures at cores in the NoC. He mainly mentions two aspects that have to account in NoC [17]. First one is partition problem i.e. Selection of core and the other is core mapping. Selection of core is important for running the tasks of the application and computational optimization is the challenge here. The second one the computational mapping, which actually maps the selected IPs to the tiles of NoC and Optimizing communication is key in the second aspect. Chao Wang et al., proposed CRS-TS algorithm [18], to sustain in the situation when faults occur in Processing Element or core. This algorithm operates in two stages. In the first stage, two operations, i.e. row bi-shift operation and column shift operation, are performed. Alternatively we can use, CRS to generate an initial feasible topology. In the second stage, a tabu search algorithm is customized to revise the initial topology to further reduce the distance and congestion factors.

#### 2.2. Network Interface

Network Interface (NI) is communication medium between core and router. Most of the Faults occur in NI are found to be in Look up Tables (LUT) [19], Input and output queue (FIFOs) [20] and Adapter. In Adapter, faults occur when protocol conversion mechanism is corrupted which is actually responsible for routing to the wrong destinations. In FIFO faults, typically data gets corrupted in the queue leading to false data reception and transmission. And similarly the faults in LUT generates the wrong routing paths.

Leandro Fiorin proposed Fault tolerance NI for Network on Chip [21] to address the above mentioned faults, and its focus is on the building blocks of LUT, FIFO and FSM. Here, LUT could be implemented by Combination of non-programmable content addressable memory (CAM) and RAM lines, here evaluated two level of architecture methodologies these are Error correcting/detecting codes and lines [22-23]. FIFO is implemented using an offset register to store offset value, and it will be added to the next working element in the FIFOs to read and write pointers for the respective read and write operations.FSM controlled Protocol Adaption in the NI operation is an emerging trend in error detection and correction in FSM. It works as follows: The information related to states of FSM is the format SECDED Hsiao code and comparing it with baseline reveals the presence of errors. Followed up by the error correction. A research paper on Multi Network Interface explained [24] that the earlier technologies take long time to deliver the packets in the case of hard/ soft errors in the Network Interface. Hence instead of relying on the NI, the idea of multiple NI has been proposed. Thus failure of packet delivery through one NI is autonomously handled by another NI in the same router. The quad NI's have also been introduced there by improving the fault tolerance at the architecture level to the further extent. Heikki Kariniemi and Jari Nurmi presents new algorithm [25], MSI is NI of Micromesh, and data is transferred to the entire NoC in terms of small size fixed packets due to the fact that small size packets could be stored in the buffers or memory. When the data is corrupted at that time, Direct Memory Access (DMA) transfers the data from memory to MSI HW, That MSI has the capability to detect the corrupted or faulty data using Cyclic Redundancy Check (CRC) sums and timers there by correcting the errors accordingly. Anup Das et al., proposed centralized H/W Fault tolerance NI for NoC based on spatial division multiplexing. In that design, data is transferred from core/router to router/core without loss, a core transfers the data to FIFO and then to controller. The controller detects the error which in turn passes the data to the attached distributor and finally it delivers the data via serialize [26].

#### 2.3. Path

Faults that occur in the path of NoC are two types, Temporary Faults and Permanent faults, Syed. M.A.H. Jafri and et., proposed fault tolerant mechanisms. If the fault is found to be a temporary fault that is observed in a wire or buffer, Error correcting codes (ECC) [27] could be used for error detection and correction. When permanent fault occur in any one of the wire, it is solved by placing a redundant wire between routers as referenced by paper [28] and significant energy consumption reduction could also be achieved. The natural complex network proposed by the Amlan Ganguly et al., dragged the inspiration for small-world architecture [29]. Natural complex network generally contains the long-range links, it can be designed with the help of single-hop wireless channels. This proposed architecture, the packet latency is high, considering this drawback Ogras and marculescu presented a novel design methodology for inserting application-specific long-range links to standard mesh NoC architecture [30]. Authors expressed that introducing long-range links plays crucial role both in the static and dynamic cases. During the times of heavy traffic, as per this architecture the addition of long range links reduces the packet latency and improves the performance, thus enhancing the throughput.

## 2.4. Router

Routers plays an active role in the whole operation of Network-on-Chip and holds the responsibility for the routing the packets to the corresponding destinations. Coming back to the topic of faults in the NoC, there are fair chances that faults could be seen in buffers, cross bars and switch allocators in the router.

The fine grained modular router architecture [1] is proposed by Jongman Kim and et., known as row-column decoupled router, to operate the router in faulty environments by concentrating on faulttolerance, performance and energy. The fine grained modular architecture outnumbers in terms of features compared to the earlier architecture. The key notable features include smaller crossbar  $(2\times 2)$  instead of larger crossbar (5×5), the path sensitive buffering scheme, and the XY routing algorithm which comes in handy during the time of failures. According to it, suppose if RC router N fails, then the algorithm has the mechanism to bypass the traffic to router N + 1 router directly from N-1 router without being touching the router N. when one of the core in the series of adjacent routers gets disconnected from the network due to some faults, it is possible to transfer the data between those core shared routers by issuing a command of core recovery in NoC resulting in facilitating a backup path for the core[31] as proposed by Khalid Latif. Adan Kohler explained in his proposal that by introducing the CRC at the router input and outputs, the data could be transferred safely as modern CRC are capable of error detection and correction for every packet that is in its domain. In NoC, the router failures are due to the faults in a router is implicitly derived. Yung-Chang Chang proposed advanced fault tolerance scheme [32], by incorporating a spare router in the NoC. During the times of faults, the spare router gets inserted to the top of the row. This concept of spare router gave born to two other algorithms known as shift-and-replace-allocation (SARA) algorithm and defect-awareness-pathallocation (DAPA) algorithm. In the former algorithm, the local core is connected to other router at the time of router failures, where as in the latter, the path is dynamically allocated in the case of routing failures. The research paper [33] briefly discusses these two algorithms with an example.

#### 3. CONCLUSION

The performance and issues of energy consumption in the scenario of faults in NoC have been well explored by the researchers. The knowledge of best practices i.e. rigorous study of different design techniques and methodologies is obligatory for the fruitful implementation of the router. The following interesting techniques could be summarized with respect to the concepts of fault-tolerance and energy consumption in a router

## Core:

Place the spare cores among other free non-faulty processing cores [11] when faults occur in a particular core. One of the author proposed side Spare core placement as cited at Figure 2 (b). FARM paper specifies that spare core could be placed randomly as shown in (c), whereas the dynamic placement of spare core is also possible as put forwarded by Fatemeh Khalili and Hamid R. Zarandi and pointed out in (d). Weighing against each of the above mentioned technique in the scenario of faults, the dynamic placement of the core turns out to be a clear winner. It offers better performance and optimizes the communication energy consumption.



Figure 2. Different spare core placement

## Network Interface:

The error detection and correction could be carried out by building blocks of NI. In LUT, CAM and RAM lines accounts for detecting and correcting the errors. In FIFO queues SECDED encoder makes the way for the smoother operation of NI in the case of permanent faults. **Path:** 

Reiterating that faults in path are classified as temporary and permanent, Error Correcting Codes (ECC) are the savior in dealing with temporary faults in a wire or buffer. The packet encoded with Hamming Code extends the flexibility of error correction at the decoder if the data packets gets corrupted in the path of transmission serves as one of the example for Error Correcting Codes. With respect to the permanent faults, the addition of spare wire between the routers remarkably reduces the power consumption as examined by the research paper [28].

## **Router:**

In the course of router failures or broken links between the core and the corresponding router, one researcher resolved this issue, by connecting every core to two routers. This allows the data transmission between the routers even if the core gets disconnected in one router and also the issue of core recovery for NoC architecture, as it contains the backup path for the core.

A lot has to be investigated or explored in the domain of NoC in providing the low cost and lower area occupancy solutions for the applications of embedded industry.

## ACKNOWLEDGEMENT

This publication is an outcome of the R&D work undertaken in the project under Visvesvaraya PhD scheme, Department of Electronics and Information Technology, Ministry of Communication & IT, Government of India and Media Lab Asia.

#### REFERENCES

- [1] Jongman Kim, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan, Mazin S. Yousif and Chita R. Das, "A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks", Proceedings of the 33rd International Symposium on Computer Architecture (ISCA 2006).
- [2] Tobias Bjerregaard And Shankar Mahadevan, "A Survey of Research and Practices of Network-on-Chip", ACM Computing Surveys, Vol. 38, March 2006.

- [3] C. Nicopoulos et al., "Network-on-Chip Architectures: ViChaR: A Dynamic Virtual Channel Regulator for NoC Routers", *Lecture Notes in Electrical Engineering 45, Springer Science Business Media B.V.* 2009.
- [4] Teijo Lehtonen et al., "Fault Tolerance Analysis of NoC Architectures", *IEEE Interantional Symposium on Circuits and Systems*, 2007.
- [5] C. Nicopoulos et al., "Network-on-Chip Architectures: RoCo: The Row–Column Deco pled Router –A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks", *Lecture Notes in Electrical Engineering 45, Springer Science Business Media B.V.* 2009.
- [6] Rong Ye and Qiang Xu "Energy-Efficient Design Techniques", Energy-Efficient Fault-Tolerant Systems, Springer Science+Business Media New York, 2014.
- [7] Mohammad Hosseinabady and Jose L. Nunez-Yanez" Fault-Tolerant Reconfigurable On-Chip-Network", Energy-Efficient Fault-Tolerant Systems, Springer Science+Business Media New York, 2014.
- [8] Timo Schonwald et al., *"Fully Adaptive Fault-Tolerant Routing Algorithm for Network-on-Chip Architectures"*, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).
- [9] Chen-Ling Chou and Radu Marculescu "FARM: Fault-Aware Resource Management in NoC based Multiprocessor Platforms", Design, Automation& Test in Europe Conference & Exhibition (DATE), 2011.
- [10] Cristinel Ababei and Rajendra Katti, "Achieving Network on Chip Fault Tolerance by Adaptive Remapping", IEEE Interantional Symposium on Parallel & Distributed Processing, IPDPS, 2009.
- [11] Fatemeh Khalili, Hamid R. Zarandi, "A fault-tolerant core mapping technique in networks-on-chip", IET Comput. Digit. Tech. Vol. 7, Iss.6, pp. 238–245, 2013.
- [12] http://grokbase.com/t/lucene/mahout-dev/082festv1e/weighted-manhattan-distance-metric.
- [13] Fatemeh Khalilia, Hamid R. Zarandi, "A Fault-Tolerant Low-Energy Multi-Application Mapping onto NoC-based Multiprocessors", IEEE 15th International Conference on Computational Science and Engineering, 2012 IEEE.
- [14] Srinivasan Murali, S., Micheli, G.D, "Bandwidth-constrained mapping of cores onto NoC architectures", Design Automation and Test in Europe, pp. 896–901, 2004.
- [15] Wooyoung Jang and David Z. Pan, "A3MAP: Architecture-Aware Analytic Mapping for Networks-on-Chip", Design, Automation& Test in Europe Conference & Exhibition (DATE), 2010.
- [16] Onur Derin et al., "Online Task Remapping Strategies for Fault-tolerant Network-on-Chip Multiprocessors", *NoCS* '11, May 1-4, 2011.
- [17] http://www.doc.ic.ac.uk/~br/berc/integerprog.pdf.
- [18] Chao Wang et al., "An Efficient Topology Reconfiguration Algorithm for NoC based Multiprocessor Arrays", IEEE International Conference on High Performance Computing and Communications, 2013.
- [19] Byeong Kil Lee and Lizy Kurian John, "Hardware Acceleration for Media/Transaction Applications in Network Processors", *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, Vol. 17, No. 12, December 2009.
- [20] Shirish Sathaye et al., "*FIFO Design for a High-speed Network Interface*", Design, Automation& Test in Europe Conference & Exhibition (DATE), 2011.
- [21] Leandro Fiorin et al., "Fault-Tolerant Network Interfaces for Networks-on-Chip", *IEEE Transactions On Dependable And Secure Computing*, Vol. 11, No. 1, January/February 2014.
- [22] Leandro Fiorin et al., "Design of Fault Tolerant Network Interfaces for NoCs", 14th Euromicro Conference on Digital System Design, 2011.
- [23] Luong D. Hung et al., "Utilization of SECDED for Soft Error and Variation-Induced Defect Tolerance in Caches", Design, Automation& Test in Europe Conference & Exhibition (DATE), 2007.
- [24] Ville Rantala et al., "Multi Network Interface Architectures for Fault Tolerant Network-on-Chip", Design, Automation& Test in Europe Conference & Exhibition (DATE), 2009.
- [25] Heikki Kariniemi and Jari Nurmi "NoC Interface for Fault-Tolerant Message-Passing Communication on Multiprocessor SoC Platform", *Design, Automation & Test in Europe Conference & Exhibition (DATE)*, 2009.
- [26] Anup Das et al., "Fault-Tolerant Network Interface for Spatial Division Multiplexing Based Network-on-Chip", IEEE comp., 2012.
- [27] MU-YUE HSIAO et al., "Application of Error-Correcting Codes in Computer Reliability Studies", IEEE Transactions on Reliability, Vol 1. No.3, August 1969.
- [28] Syed. M.A.H. Jafri et al., "Energy-aware fault-tolerant network-on-chips for addressing multiple traffic classes", *Microprocessors and Microsystems*, vol. 37 (2013) 811–822.
- [29] Amlan Ganguly et al., "Complex Network Inspired Fault-Tolerant NoC Architectures with Wireless Links", NoCS'11, May 1-4, 2011.
- [30] Umit Y. Ogras, and Radu Marculescu., "It's a Small World After All : NoC Performance Optimization Via Long-Range Link Insertion", *IEEE Transactions On Very Large Scale Integration (VLSI) Systems*, Vol. 14, No. 7, July 2006.
- [31] Khalid Latif et al., "Designing a High Performance and Reliable Networks-on-Chip using Network Interface Assisted Routing Strategy", 15th Euromicro Conference on Digital System Design, 2012.
- [32] Adan Kohler, Gert Schley, and Martin Radetzki, "Fault Tolerant Network on Chip Switching With Graceful Performance Degradation", *IEEE Transactions On Computer-Aided Design Of Integrated Circuits And Systems*, Vol. 29, No. 6, June 2010.
- [33] Yung-Chang Chang et al., "On the Design and Analysis of Fault Tolerant NoC Architecture Using Spare Routers", 16th Asia and South Pacific Design Automation Conference, ASP-DAC, 2011.