Comparing reliabilities of centralized and distributed switching architectures for reconfigurable 2D arrays

Received Jun 30, 2020 Revised Aug 15, 2020 Accepted Jan 23, 2021 Whether used as main processing engines or as special-purpose adjuncts, processor arrays are capable of boosting performance for a variety of computation-intensive applications. For large processor arrays, needed to achieve the required performance level in the age of big data, processor malfunctions, resulting in loss of computational capabilities, form a primary concern. There is no shortage of alternative reconfiguration architectures and associated algorithms for building robust processor arrays. However, a commensurately extensive body of knowledge about the reliability modeling aspects of such arrays is lacking. We study differences between 2D arrays with centralized and distributed switching, pointing out the advantages of the latter in terms of reliability, regularity, modularity, and VLSI realizability. Notions of reliability inversion (modeling uncertainties that might lead us to choose a less-reliable system over one with higher reliability) and modelability (system property that makes the derivation of tight reliability bounds possible, thus making reliability inversion much less likely) follow as important byproducts of our study.


INTRODUCTION
Reconfiguration can be used at system, architecture, and circuit levels to establish alternate or bypass signal paths in order to circumvent parts that have been rendered unusable by defects or malfunctions. We cite just a few examples: Defect tolerance for VLSI yield improvement [1], computation-mapping onto an incomplete or damaged FPGA using alternate components and interconnects [2], and building gracefullydegrading array processors [3]. A number of architecture-level methods for allowing a system to adapt to changing conditions can also fall in this category, although some may not be readily recognized as reconfiguration. Note, for example, that replacing a failed functional unit with a spare one upon failure detection, is the same as restructuring an (s + 1)-unit "parallel" system by switching-out the previously-active module and switching-in one of the s spare modules [4].
In the just-cited references, and in the context of this work, the goal of reconfiguration is returning a processor array containing a number of malfunctioning nodes to the original intact configuration, in order to run the existing algorithms with no modifications. Stated differently, we will not consider the use of a damaged array, providing a degraded level of inter-node connectivity, owing to bandwidth limitation, path dilation, or resource reduction [5], which would require computations to be remapped by increasing processor loads and/or link communication traffic. Also, we won't deal with certain kinds of reconfiguration

RECONFIGURABLE 2D ARRAYS
We will use a small 5  5 array of processors, with one spare row and one spare column, to illustrate the advantages of distributed switching over a centralized scheme. This arrangement leads to a 6  6 host array. From specific results for this small array, we establish general results by demonstrating that the advantage grows when we expand the array-size or increase the redundant resources, that is, use more spare rows/columns. More work remains to be done on quantifying the added advantages with size and redundancy, which remain qualitative in this paper.
An example array is depicted in Figure 1, where tracks of embedded reconfiguration switches are placed between processor rows/columns. Four-port, 3-state switches appear as small circles in Figure 1. Figure 2 depicts the details and usage of these switches within the processor array of Figure 1. The actual method of reconfiguration and its capacity to tolerate malfunctions will be discussed in Sections 3 and 4.  represents only one way of how processor bypassing capability might be provided by reconfiguration switches for the purpose of diverting signals from their original destinations in the same row or column to alternate nodes, perhaps in other rows/columns. Upon the replacement of a node, either the new node taking over is initialized to the current state of the outgoing node to continue its work or else the entire computation is repeated from the beginning or from a previously-saved rollback point. Given that the complexity of initialization or restart is independent of the replacement and reconfiguration architecture, we will not discuss it further in this paper.
Some alternate processor-array reconfiguration schemes use more than one track of reconfiguration switches; others employ switches that are more-complex than the one depicted in Figure 2. Figure 3 depicts a particular reconfiguration scheme using two switch tracks, embedded between processor rows/columns. The use of more switches or ones with higher complexity presents a reliability tradeoff to the designer, which isn't easy to quantify in general. One reason is that potential benefits of greater flexibility in reconfiguration may be offset by the higher failure rates of more-numerous or more-complex switches.  Figure 3. Two-dimensional array, outfitted with 3-state reconfiguration switches, lined up on two tracks in the gaps between processor rows/columns In this paper, we limit our discussion to only one track of switches between array rows and columns. Furthermore, we will focus on identical switches, to simplify reliability models and VLSI realization. However, our switches can be arbitrarily complex, given that, in our reliability model, the failure rate of a switch is an independent parameter unrelated to those of other components. With one spare row and one spare column added to an n  n array, the redundancy factor for nodes is (2n + 1)/n 2 = O(1/n). Increasing the number of spare rows and spare columns to k leads to the redundancy factor (2kn + k 2 )/n 2 = O(k/n). In either case, the redundancy factor is relatively low.

RECONFIGURATION WITH CENTRALIZED CONTROL
Consider embedding a 5  5 guest array within the 6  6 host array of Figure 1. In the beginning, the active nodes are within the 5 rows at the top and the 5 columns on the left. As nodes malfunction, the configuration undergoes changes. Figure 4 depicts an array after reconfiguration, with the assumption that, in addition to switches that re-route signals, the array is equipped with a separate mechanism for bypassing a node within the row/column where it is located.
Intuitively, re-routing of signals is accomplished by shifting rows downward (in the direction of the spare row) and shifting columns rightward (in the direction of the spare column), to compensate for unavailable row/column nodes. We offer no discussion of reconfiguration algorithms (see [8], for example). It suffices for our purposes in this paper to note that reconfiguration allows the guaranteed tolerance of up to two malfunctioning nodes. In other words, there exist certain patterns of three unavailable nodes that defeat this particular scheme's reconfiguration capability [4].
Two schemes can be envisaged for the operational control of the reconfiguration switches. In the first scheme, switches may be controlled by a centrally-located unit. Disadvantages of centralization include the control mechanism constituting a single point of failure, requirement for long wires to connect the control unit to all the switches, and excessive delay in uploading the configuration information upon a node replacement. In the second scheme, switches are paired with nearby nodes, which controls them. One drawback of distribution is that a malfunctioning node may not be able to properly control the switch(es) assigned to it, making other processing resources or certain array configurations inaccessible; this would reduce system reliability and increase the effort needed to model all pertinent scenarios.
Let's assess the reliability of the array in Figures 1 and 4, using a simple combinatorial model. We lump the switching mechanism into a hard core and consider the collection of nodes as a 34-out-of-36 system; recall that our scheme guarantees the tolerance of up to 2 bad nodes. Denoting the node failure rate by , the switch failure rate by , and noting that there are 60 embedded switches within the 6  6 array: where Rk-out-of-n(r) is the k-out-of-n reliability for modules having identical reliability r. Simplifying: Substituting (4) into (3) and plotting the system reliability for different switch complexities, reflected in the ratio  = /, leads to Figure 5. For processors with typical complexities,  is likely to be near the lower end of the interval of considered values [0.001, 0.1], whereas for bare-bone PEs used in some arrays,  may approach, but likely not get very close to, the higher end.
Unreliability curves in Figure 5 have expected shapes. When t is small, switching is the main source of unreliability, given the extremely high reliability of the processing part, as a 34-out-of-36 system. This observation explains the significant differences between reliabilities for the three cases with varying switch complexities. For large values of t, multiple node malfunctions that exceed the system's tolerance limit dominate, making switching differences much less relevant.

RECONFIGURATION WITH DISTRIBUTED CONTROL
We now shift our focus to a switching scheme in which system configuration is dictated by good processors, through setting the states of their own internal switches. Malfunctioning processors become irrelevant via their outputs not being considered by good processors. The scheme works, as long as all good nodes know the identity of malfunctioning nodes. This isn't an added burden, because such information is needed even in the centralized scheme.
One can implement distributed switching in many different ways. For our comparisons to be fair, we choose a distributed scheme that offers the same reconfiguration capability as the centralized scheme of Figures 1 and 2. Let's consider modules having two built-in 3-input multiplexers (muxes) which allow them choose their east and north neighbors from among three candidate nodes. Consider, for example, that a processor's north neighbor can one of three nodes in the preceding row : The node directly above, plus those before and after it in the same row. This is the same as the capability provided by the scheme in Figures 1  and 2, assuming a single spare row and a single spare column.
In this scheme, each node, with its built-in switches, becomes a little bit more complex, which we model by increasing the node failure rate from  to  +  ( is the failure rate of the original switches of Figure 2 and  is a new parameter reflecting the distribution overhead, that is, modeling the fact that switches become a bit more complex when we distribute them among the nodes). Reliability equations now become: Overall reliability = 34−out−of−36 ( ′ ) We can consider  = 2 as a reasonably pessimistic value in our running numerical example: The centralized scheme of Figure 1 includes 60/36  1.67 switches per node, each 3-state switch realizable with two 2-to-1 muxes. As depicted in Figure 6, two 3-input muxes are needed for each node in the distributed scheme. Figure 7 depicts the unreliabilities of our example array as  assumes the values in {0.001, 0.01, 0.1}. Again, the results are as expected. Reconfiguration switches and processing nodes are merged in the distributed scheme, with successful reconfiguration requiring only that 34 of the 36 nodes be healthy. The proper functioning of the switching mechanism is no longer more critical than the health of processors. Put another way, we have eliminated the system's single point of failure.

127
One may worry about the impact of choosing  = 2 on the validity of results. Table 1 shows the outcome of a sensitivity analysis on the value of , by considering different values for the parameter that represent negligible overhead ( = 1) and more significant overhead ( = 3). These extreme values of  serve to confirm that reliability values show little change when adjusting the value of .

COMPARATIVE RELIABILITY EVALUATION
Comparing Figures 5 and 7 reveal the advantages of distributed reconfiguration switching with respect to reliability. The advantages are greatest when the value of t is small, that is, the expected range of operation for high-reliability systems. In order to highlight the reliability differences for a reasonable case of switch complexity ( = 0.01), we superimpose the unreliability curves of Figures 5 and 7 to create Figure 8. Please refer to Section 7 for an explanation of the dotted gray curve. Figure 8 tells us that reliability differences disappear as t approaches 1, making centralized and distributed switching indistinguishable with regard to reliability. Note, however, that the region near t = 1 is practically useless, given the extremely low reliability values. A natural question at this point is whether the results pertaining to a specific, fairly small array utilizing a specific switching scheme, carry more-general significance. The three claims, with informal justifications, that follow suggest that the results are indeed more general. Claim 1: Increasing the size of the array, while keeping the same switching scheme and, thus, tolerance level, leads to a growth of the advantage of distributed switching over centralized switching. Proof outline for Claim 1: Take the reliability-lower-bounds expressions for an h  h array, and consider the next-larger, (h + 1)  (h + 1), array. Prove that the reliability lower-bound deterioration for the larger array is less for distributed switching compared with centralized switching. Intuitively, the centralized switch complexity growth directly affects the reliability, while the increase in module complexity as a result of distribution is moderated by passing through the k-out-of-n filter. Claim 2: Adding spare rows or spare columns causes the advantage of distributed switching over centralized switching to grow, all else being kept the same. Proof outline for Claim 2: With 2 spare rows and 2 spare columns, the reliability function changes from (n 2 + 2n -1)-out-of-(n 2 + 2n + 1) to (n 2 + 2n -3)-out-of-(n 2 + 2n +1), because of the tolerance level improving to 4 processor malfunctions. Here, too, the improvement in the k-out-on-n function value dominates the improvement in the centralized scheme, with its increased switch complexity. Claim 3: Using more-complex switches to improve the malfunction tolerance level causes the advantage of distributed switching over centralized switching to grow. Proof outline for Claim 3: The justification is quite similar to that of Claim 2. Other than the filtering effect of the k-out-of-n function, distributing the more-complex switching capabilities is beneficial.

THE NOTIONS OF MODELABILITY AND RELIABILITY INVERSION
In reliability engineering, the exact reliability of a system is unknowable. If we could observe system failures by running hundreds of identical copies of a system for decades, we could derive the actual reliability with high confidence. We need a large number of copies and extended running times because failure events are extremely rare in practical, highly-reliable systems, so extensive data collection is needed for results to be statistically valid. A practical alternative is to use an analytic or simulation model, involving simple, pessimistic assumptions about subsystems and their interactions, to derive a reliability lower bound. The latter approach becomes mandatory for evaluating a system at design time, when a physical implementation isn't available. The need for experimentation still exists, even when using models. Parameters for such model are estimated, and models are corrected/tuned, with help from experimental observations.
The nature of modelability is similar to (design for) analyzability, or "design for analysis" [9], a notion along the same lines as design for manufacturing (manufacturability). For electronic circuits, design for packageability [10] is rather similar. Analyzability can be ensured by insisting that certain design constraints be honored so as to allow the use of simpler tools for analysis. The notions just listed constrain the design process, but, surprisingly, lead to cost reduction and faster development. Modelability makes it easier to derive tighter lower bounds for reliability. Figure 9 compares Systems A and B with respect to their (unknowable) reliabilities and the corresponding model-based lower bounds, revealing System B to be more modelable than System A. Even though modelability is a qualitative notion at this juncture, it can certainly be quantified with further work (which is in the planning stage), in the same way that testability, serviceability, and several other "ilities" were quantified, after starting out as qualitative notions.
To recap, obtaining guaranteed lower bounds on a system's survival probability requires that reliability analysis be based on worst-case assumptions. Reliability experts are aware of the desirability of tight lower-bounds, but sometimes system structure makes it difficult to derive tight bounds. Consider System B with (unknowable) actual reliability of 0.997 and a computed reliability lower bound of 0.995 for mission time t0. Similarly, System A with (unknowable) actual reliability of 0.999 has a computed reliability lower bound of 0.993 for the same mission time. Figure 9 depicts the situation above. Unknowable actual reliabilities force us to make decisions based on lower bounds, leading to the pronouncement that B is more reliable than A at t0. This condition is referred to as reliability inversion [11], in analogy to "priority inversion" in the scheduling of real-time tasks [12], which caused serious trouble during the 1997 Mars Pathfinder mission [13].
Paying greater attention to Figure 9, we see that each of the systems A and B is preferable to the other system for certain ranges of mission time. However, these nuances are hidden from us due to unknowable actual reliabilities. Computed reliability bounds proclaim System B as uniformly preferable to System A with regard to reliability. All of the results reported in this paper should be extensible to alternative reconfiguration architectures and algorithms proposed more recently [14][15][16][17].

CONCLUSION
An important conclusion of our work is that two properties tend to make processor arrays with distributed reconfiguration switching preferable to those with centralized switching: Inherently higher reliability and better modelability, the latter property allowing us to use simple combinatorial models to derive tight reliability lower bounds. The author has previously shown that, under certain unusual conditions, centralized switching can exhibit higher reliability (the dashed line in Figure 8), even though its pessimistically-derived reliability bound is lower. Future research must address several remaining problems, as pointed out in the preceding sections. First, Claims 1-3 of Section 5 show that certain changes in array configuration improve the system reliability. We shall aim to quantify these improvements, either by constructing expanded formal proofs or by deriving reliability formulas for different array configurations and redundancy factors. Second, we shall move in the direction of quantifying the modelability notion, thus making it usable for comparative assessment competing system designs or for producing design guidelines and methodologies. Third, we may look into extending our results to alternative reconfiguration architectures and algorithms proposed more recently, so as to produce results that enjoy broader applicability.