Cost-efficient reconfigurable geometrical bus interconnection system for many-core platforms

ABSTRACT


INTRODUCTION
Multi-core processor designs are now seeing big shift [1] towards many-core computing [2]. Manycore processor systems are now trending as a platform for massive parallel computing [3] and are also had been used as a co-processor for a multi-core system [4]. In recent years, we see an emergence of extreme computing [5] for big data. Some architectural models for many-core processor systems [6,7] and performance improvement techniques [8] for multi-core have been developed. Many-core computing is gaining interests for artificial intelligence (AI) based defense applications [9]. Edge computing [10] is now considered as a relatively a new paradigm where the computational resources are placed at the edge of the network. Use of multi-core is gaining interests for edge computing [11] as data transfer between cores is power demanding and require a very complex connection infrastructure. With increased very large-scale integration (VLSI) density, heterogeneous many-core based chip multiprocessors (CMP) [12] for big data are on the rise. For these applications running on many-core platforms, we require cost efficient on-chip interconnection.

SYSTEM ARCHITECTURE AND CONIGURATIONS 2.1. Generalized reconfigurable many-core system platform
From the earlier work on multiple bus system [30], it has been observed that by using number of buses equal to one-half of the number of memory-modules or processors, we can achieve a memory bandwidth within 25 % of the crossbar bandwidth. For complete bus connections, all memory-modules connected to all buses. Increase in buses incurs high number of bus connections and cost. Earliest work on reduced bus connection schemes [31] showed general theorems and properties for the scheme. With reduction in the number of buses and bus connections, we propose a reconfigurable many-core platform with n cores, m memory-modules, b buses and k bus reduction factor. Figure 1 shows the system architecture that includes bus cache at each bus line and reconfigurable control.

Geometrical bus interconnection configurations
Our earlier work in [29], we complimented [31] and presented a generalized system architecture and characterization. In this paper we supplement expands our earlier work [29] and propose four distinct geometrical bus configurations: i). Group Rhombic 2 (GR2), ii). Group Rhombic 4 (GR4), iii). Hierarchical Rhombic (HR) and iv). Quadrant Rhombic (QR). We provide a comprehensive system characterization for these configurations. Although in generality, we stated a bus reduction factor in (1), we used = 2 throughout this paper. We considered rhombic as a geometrical pattern base to define these four configurations as rhombic was considered most cost-effective topology [29]. However, in general, any of the other geometrical pattern presented in [29] could be used as a base. In Figure 2, Figure 3 and Figure 4, all connections marked as "x" refers to the buses numbered on its left. − GR2: Memory-modules and buses divided into two groups connected in rhombic pattern. Figure 2 shows the configuration. All processor cores connected to all buses. − GR4: Memory-modules and buses divided into four groups connected in rhombic pattern. Figure 3 shows the configuration. All processor cores connected to all buses. − HR: Memory-modules and cores connected in hierarchical bus system [32] with modifications. Cores are connected in two groups in level 1 and memory-modules connected in rhombic pattern in level 2.
Processor core and memory buses are interconnected. Figure 4 shows both memory-module and processor core connections. − QR: Memory-modules and buses divided into four quadrants with each quadrant connected in rhombic pattern. Figure 5 shows the memory-module connections Group and quadrant rhombic configurations are tightly coupled as both core and memory bus connections are on the same set of buses. Hierarchical rhombic is a loosely coupled configuration since they have separate core and memory bus connections. It was shown in [31] that for any rhombic connections, the lower bound for number of memory-modules connected to each bus is equal to ( − + 1) to assign buses to memory-modules. In Figures 2 through 5 below shows these configurations. M1-M16 is memorymodules and P1-P8 is processor cores. We explicitly shown processor core connections only for hierarchical configuration.    Table 1 summarizes the total cost of interconnections. The first term is the processor core connection cost, the second term is the memory connection cost and the third term is the core to memory bus inter-level connection cost for hierarchical rhombic. As observed from Table 1, group rhombic has reduced number of bus connections compared to regular rhombic [29]. Quadrant rhombic has slightly higher number of bus connections than rhombic but is inherently fault tolerant to critical bus faults (explained in section 3.6). All four configurations have reduced number of bus connections compared to complete bus connections. Figure 6 shows the average cost savings. The cost savings ranges from 24 % to 42 % across all system sizes.  Figure 6. Average cost savings

System reconfiguration
Reconfiguration Control Register (RCR) shown in Figure 7 facilitates reconfiguration of the system for plurality of processor cores, memory-modules and buses connected to the system and have the following functions: − Switches SP, SM and SB: Connects cores, memory-modules and buses to the system respectively. − Registers SPR, SMR, and SBR: Controls SM, SP, and SB respectively. For example, if SP(i)=1, then a core is connected to the system. Similarly, SM(j) = 1 connects a memory j and SB(k) = 1 connects a bus k to the system. − Registers SCPR and SCMR: Controls the reconfiguration of the interconnection to connect a core or memory to a specific bus. For example, memory "1" connected to bus "1" for SCM (1,1) = 1. − On a single bus fault, the interconnection is reconfigured and updates the SCPR, SCMR and SBR registers. Reconfiguration is also performed for group rhombic to add connections.  Figure 7. System reconfiguration controls register (RCR) Table 2 gives an illustrated RCR for bus "2" for GR2 and QR.

GEOMETRICAL BUS INTERCONNECTION VALIDATION AND ARBITRATION
When random bus is assigned for geometrical bus configurations, some memory requests may not complete in the current memory cycle due to no bus connection to the memory; as a result, it requires specific bus arbitration. For assigning distinct buses to memory-modules, we require that the number of memorymodules connected to each bus for rhombic connection is equal to ( − + 1) [31]. This is the lower bound condition. We validate this theorem for our proposed group and quadrant configurations using the corollaries in the following section.

Geometrical bus interconnection configuration validation
Let ( ) be the number of successful distinct bus connections in each group i for = 1, . . where = 2 4 or in each quadrant i for = 1, … 4. − Corollary 1: For group rhombic configurations, if a bus is connected to less than − + 1 memorymodules, then we cannot assign b distinct buses; in this case, additional connections needed. Proof: For requests, group rhombic satisfies the lower bound condition locally (within group). However, as not all memory-modules connections exist in each group, ∑ ( ) =1 < . Hence, the lower bound condition for the overall interconnection will not be satisfied and require at most additional connections. − Corollary 2: Let = {1, 2, … /2} and B= {( 2 )+1, … }. For b memory requests, if 2 memory requests ∈ and 2 memory requests ∈ , then distinct buses can be assigned for GR2. This can be extended to GR4 with sets , , , of memory request each mapped to each group. Proof: In group rhombic, with groups , / rhombic connections exist giving ∑ ( ) =1 = g. b/g = . In this case, the lower bound is satisfied in each group and distinct buses can be assigned. requests ∈ and 2 requests ∈ , then distinct buses can be assigned in quadrant rhombic. Proof: For quadratic rhombic (see Figure 5), the lower bound condition is applied to ( /2 − /2 + 1) memory-modules connected to each quadrant. These connections are contributed from one quadrant each from left half quadrant ( /2 connections) and right half quadrants ( /2 connections). In total, we can assign distinct buses to memory-modules.

Geometrical bus arbitration algorithm for hierarchical and quadratic rhombic
Let be the set of memory requests sorted in ascending order and is the memory-module in . to . The algorithm searches for a bus in order for every M in set A and grants the bus if it is connected to M. If the bus > , then the next bus is updated to "1" and the search continues. For hierarchical rhombic, we arbitrate core bus as: The memory bus arbitration given as:

Geometrical bus arbitration algorithm for group rhombic
Let be the set of requests and is the memory-module in set and = 0 initially. Let be the number of successful memory-to-bus connections with = 0 initially. Let be the connectivity defined as before. The algorithm searches a bus for every in set and grants the bus if is connected to M. If the bus > , then the next bus is updated to bus "1" and the search continues. The memory bus arbitration given as:

Geometrical bus arbitration simulation
We conducted extensive geometrical bus arbitration simulation for all system sizes and configurations to verify the algorithm. We show the illustrated assigned buses to memory (shaded) in  Figure 8 shows the bus assignment, Figure 9 shows the bus assignment (color shaded) for simulation 2 indicating that added connection R on reconfiguration is required to grant buses to memory request, Simulation 3 is shown in Figure 9 (color bolded) for favorable memory requests requiring no additional connections and buses can be assigned.

Memory bandwidth
From the multiple-bus bandwidth analysis [30], with random bus arbitration, the memory bandwidth with reduced number of buses (b = m/2 ) and complete bus connections is given by: The first term in (2) is the crossbar bandwidth, the second term is the reduction in bandwidth with due to reduce number of buses. As the number of bus connections reduced for geometrical bus configurations, the bandwidth in (2) is further reduced by: Where is the number of memory-modules connected to bus and ( − ) is the number of memory-modules not connected to bus i. As equal number of memory-modules are connected to each bus in a rhombic topology [29], the second term in equation (4) gives the average number of memory-modules not connected to any bus.
where is the memory bandwidth using geometrical bus configuration with random bus arbitration.
We analytically derive the reduction in bandwidth in (4) using lower bound condition ( − + 1) [31]. − Corollary 5: When a bus is arbitrated randomly for hierarchical and quadrant geometrical bus configurations, the reduction in bandwidth is equal to ( − 1)/2 for 1 number of memory-modules not connected to a bus. Proof: For rhombic based connections, we require 1 ≤ − ( − + 1)= 1 ≤ ( − 1). This sets a threshold for 1 for reducton in bandwidth to − 1. The average number of memory-modules not connected to a bus is then equal to ( − 1)/ . The average number of memory-modules not connected to any bus is equal to ( − 1)/ which reduces to ( − 1)/2 for = 2 . The reduction ( − 1)/2 is also equal to second term in (4). In general, memory bandwidth with random bus assignment for number of groups is given by: we see that the reduction in bandwidth increases with . For hierarchical and quadrant rhombic, we can apply =1 as a special case for equation (5) which yields the reduction of ( − 1)/2 as stated in corollary 5. − Corollary 6: For hierarchical and quadrant configurations, with geometrical bus arbitration, the reduction in bandwidth in (5) is nullified. Proof: For 2 memory-modules connected to each bus; it requires 2 ≥ ( − + 1) for assigning distinct buses to memory-modules. For m=2b gives 2=b+1. As 2 > -1, the reduction in (5) is nullified giving effective bandwidth of . However, for group rhombic, there still exists some small reduction in bandwidth from b as 2 < − 1. But, if there is a favorable memory request pattern, then each group locally satisfies the lower bound condition [31] yielding 2 > − 1 thus assigning distinct buses to memory-modules. For complete bus connections, we can deduce 2 = 2 . As 2 > -1, it nullifies the reduction in bandwidth validating that the bandwidth for multiple bus system [30] with complete bus connections is given by (2).

Bus load and bus fault tolerance
Bus load is the number of memory-modules connected to each bus. As we increase the bus load, the capacitive loading on the bus increases; a point is eventually reached when speed up with multiple processors gets saturated. Memory load is the number of buses connected to a memory-module and dictates the bus fault tolerance. If a bus i ∈ fails, then it forces the lower bound to ( − + 2) [31]. We look at two specific bus fault conditions for hierarchical and quadrant connections: Critical Buses: Buses with a memory load of "1". Bus "1" and "b "are critical buses. If critical buses fail, then the memory-module "1" or "m" is completely disconnected. The remedy is to provide an additional connection R to critical buses. Figure 10 shows the bus assignment (color shaded) for a critical bus fault with a memory request {M1, M5, M6, M7} for quadrant rhombic. In this case, memory-module "1" is not Non-Critical Buses: Buses with a memory load > 1 are non-critical buses. When there is a fault on a non-critical bus, the memory bandwidth is degraded to ( − 1); but no memory-module is disconnected and always a bus connection exists for all memory-modules.
Group rhombic is less bus fault tolerant as the number of critical buses increase with the number of groups. We note that = 2 . Non-critical bus faults can disconnect 2( / − / + 1) memorymodules. However, the added connections can sustain these faults. Even with the increase in cost for added connections, the group interconnection is still cost effective.

PERFORMANCE CHARACTERIZATION
In this section, we present the results of our system characterization in terms of cost per bandwidth, cost per degraded bandwidth and system throughput with bus cache. Table 3 shows the memory bandwidth of geometrical bus interconnection where, col 1 represents bandwidth with random bus arbitration from eq. (4). For group rhombic, col 2 represents bandwidth with no added connections and col 3 represents bandwidth with added connections (both shaded) by using geometrical bus arbitration algorithm in sections 3.3. We determined the average bandwidth from over 100 iterations of random memory request. We obtained 30 to 50 % reduction in memory bandwidth with group rhombic when algorithm 3.3 is applied without any added connections. However, as we added bus connections, the bandwidth increases to ( − 1). For HR and QR, col 2 represents bandwidth from algorithm 3.2 giving the same bandwidth as with complete bus connections (row 1).  Table 4 shows the cost per bandwidth for all system sizes. For group rhombic, col 1 and col 2 represents the result with no added connections and added connections (both shown shaded) and col 3 represents cost per bandwidth for favorable memory request. Figure 11 shows the average cost per bandwidth across all system sizes. The reduction in average cost per bandwidth varies from 1.3 to 1.8 compared to complete bus connections.  Figure 11. Average cost per bandwidth

Cost per degraded bandwidth
We ran extensive simulation for determining the degraded bandwidth on a single bus fault. Table 5 shows the average cost per bandwidth (col 1) and cost per degraded bandwidth (col 2).  Figure 12 shows the average increase in cost per bandwidth (from col 1 to col 2 in Table 5) across all system sizes. The percentage average increase in cost per bandwidth due to bandwidth degradation varies from 3.6 to 4.8.

Effective system throughput with bus cache
As the number of processor cores on a chip multiprocessor increased, there is always a challenge to provide adequate interconnection bandwidth. Use of multi-level cache can increase the system throughput. However, use of large number of fast on-chip private core caches increases the system cost. A much slower shared bus cache placed on every bus line can optimize the overall system cost and reduce average memory access time leading to lower clocks per instruction (CPI). The hit ratio of bus cache can be given as: Where is the crossbar bandwidth [30] and is the bandwidth using geometrical bus (Table 3). Table 6 shows the effective system throughput with bus cache. In Table 6, col 1 correspond to throughput using ℎ from (6) and col 2 represents throughput with ℎ increased by 15 % from col 1. As observed in Table 6, we see an increase in effective throughput of the system when bus cache hit ratio is increased by 15 %. Figure 13 shows the average throughput for each system size. We observe that the average throughput is higher for hierarchical and quadrant rhombic compared to group rhombic. However, as the system size increases, the difference in throughput from hierarcchical and quadrant rhombic to group rhombic reduces; suggesting an advantage for using bus cache for group rhombic with increase in system size.  Figure 13. Average system throughput with bus cache

Estimated cost comparisons to NoC CLOS network
We compared generically the geometrical bus configurations cost with a circuit switched NoC [16] based on CLOS network [33] using the same system size. In the CLOS based NoC [16], the network was organized as 3 lanes (input, middle and output) consisting of crossbar switches. For example, a 20 x 20 circuit switched router has five 4 x 4 switches in the input lane, four 5 x 5 switches in the middle lane and five 4 x 4 switches in the output lane. We assumed a relative cost increase by 4 for each switch input/output in CLOS NoC compared to the geometrical bus connection switch due to network layers and buffers. We used switch input unit cost increase for regular crossbar by 2 compared to geometrical bus connection switch due to larger switch matrix at each cross point. Table 7 and Figure 14 shows the estimated cost comparisons. We noticed an average reduction by 2 in estimated cost of the geometric bus interconnection compared to CLOS based circuit switched NoC across all configurations.  Figure 14. Estimated average cost reduction factor with circuit switch router NoC

Summary of results
In summary, our results are as follows: − Group rhombic offer the best average cost savings (36 % to 42 %). For non-favourable memory requests additional connections are required to achieve same memory bandwidth of . Even by adding at most number of connections, group rhombic still offers the best average cost savings making it a good choice for higher system sizes. − We achieved reduction of 1.5x cost per memory bandwidth with group rhombic across all system sizes. − Quadratic and hierarchical rhombic achieves the same memory bandwidth without requiring any additional connections and we achieved reduction of 1.3x cost per memory bandwidth. System throughput is higher for hierarchical and quadrant rhombic compared to group rhombic with added connections. However, the throughput difference of hierarchical/quadrant rhombic from group rhombic reduces as system size increases. − The average cost reduction for geometrical bus interconnection to CLOS based NoC is 2x and varies from 1.8x to 2.4x. − The added connections in group rhombic also allow for good fault tolerance for critical and non-critical bus faults. − Optimum selection for configuration is to use inherently fault tolerant quadrant rhombic for small system sizes (less than 32 cores) and use group rhombic for larger system size (greater than 32 cores) to take cost reduction advantage for higher system size.

CONCLUSION AND FUTURE RESEARCH
For moderate number of small sized processor core from 16 to 128 (many-cores), few cost-effective geometrical bus interconnection configurations with reduced number of buses and reduced number of bus connections may offer as a cost-effective solution for on-chip interconnection. These configurations may provide an overall system cost advantage for many-core platforms. Today's VLSI density advantage plays a major role for the implementation economies of the interconnection. We achieved low cost per bandwidth and good bus fault tolerance with bandwidth degraded by less than 5 % across all geometrical bus configurations. By placing a small bus cache on each bus line, we achieved an increase in the overall system throughput. From our results, we conclude that quadrant rhombic is the best option for lower system size (≤ 32 processor cores) and group rhombic is a better option for larger system sizes (>32 processor cores).
For further research, our first plan is to extend the work to present a comprehensive interconnection system simulation combining on-chip multi-level cache that includes bus cache. Secondly, we will investigate hybrid approaches by combining geometrical bus interconnection with circuit switched routers for better scalability. Thirdly, we anticipate to use machine learning to capture parallel application affinity for selection of geometrical bus interconnect configurations and use any locality of memory references in the program that may benefit better performance from group connections. Finally, we plan to prototype the system on one or more FPGAs and assess the overall energy efficiency.