An efficient multi-level cache system for geometrically interconnected many-core chip multiprocessor

ABSTRACT


INTRODUCTION
In recent years, many cores are trending as a on-chip computing platform [1]- [3] that can provide massive computational power for a heterogenous computing environment for big data [4] and other compute intensive embedded artificial intelligence applications [5].Some recent work [6]- [9] on high performance computing for big data have focused on processing framework, architecture synthesis and utilization of multiple cores.With increased very large-scale integration (VLSI) density, it may be still manageable to provide heterogeneous computing using cost effective on-chip interconnection and cache memory system.From past research on bus-based interconnection for large parallel processing systems [10], it was determined that regular bus connected multiple-bus interconnection that uses number of buses equal to onehalf of the cores or memory modules, gives comparable memory bandwidth.However, the reduced bus interconnection is costly for chip multiprocessor (CMP) due to large number of bus-core/memory connections.In our earlier research, we proposed a cost-effective interconnection using geometrical patterns for buscore/memory connections [11] with reduced number of buses.The approach in [11] was extended to system level configuration defined with three geometrical system configurations termed as geometrical bus interconnection (GBI) [12] for bus-memory connections using rhombic connection pattern as the base.We achieved cost savings from 1.8 to 2.4 with GBI compared to regular reduced bus interconnection.However, as the overall throughput of the many-core CMP is also determined by the cache system performance, achieving high overall CMP througput with cost and performance efficient interconnection and cache system is highly desirable today.
Providing an adequate and sustained many-core CMP throughput becomes more challenging as it also requires efficient cache system solution.Towards this challenge, our focus is to present a cost-effective multi-level cache system to improve the overall many-core CMP throughput using comparable memory bandwidth results from cost-effective GBI [12].A typical general multi-level cache system hierarchy for multi-core systems as shown in Figure 1 has L1 and L2 private cache per core at levels 1 and 2, and a shared cache L3 as a last level cache (LLC) at level 3.For example, some of the current mainstream commercial multi-core processor such as Intel® Core™ i5 processor has three levels of cache with per core L1 with a separate instruction and data cache, a per core L2 unified (instruction/data) cache and a shared L3 cache as LLC (shared by all cores).Figure 1.Traditional multi-level cache system with L1, L2 and L3 for multi-core CMP Adding a large number of per core fast on-chip private L1 and L2 caches with a shared L3 may increase cache system cost.As a result, we propose an alternative solution by combining L1 with a relatively slower shared bus cache (SBC) as LLC added to every bus line of GBI [12] in which the data request of all cores is shared via GBI.In addition, our proposed cache system solution may also provide the ability to increase the cache levels and sizes within the cache hierarchy upon cache reconfiguration in order to optimize the system for cost, performance and power consumption.Some earlier research [13]- [16] have addressed various cache system architecture, issues and solutions for improved performance.In [13], the authors addressed analyzing memory performance for tiled many-core CMP.Lin et al. [14] suggested hybrid cache systems that included layers for cache architecture from memory to data base to improve performance in specific relational data base query for big data applications.Charles et al. [15] looked at cache reconfiguration for network-on-chip (NoC) based many-core CMP.Safayenikoo et al. [16] suggested an energy-efficient cache architecture to address the problem of increased leakage power resulting from large area of LLC (as much as 50% of the chip area) due to its increased size.Most of the work reported in [13]- [16] may require complex cache design process.Our proposed cache system solution is simple and do not add any extra or difficult cache design process.Our main contribution in this paper are as: i) Propose a shared bus cache (SBC) within a multi-level cache system; ii) Present a least recently used (LRU) multi-level cache system simulation to extract hit and miss concurrencies; iii) Apply concurrent average memory access time (C-AMAT) [17] to accurately determine the system throughput performance and present our results; and iv) Provide conclusion and present some insight into future research.

L1-SBC CACHE SYSTEM
Figure 2 shows a system with L1 and share bus cache at every bus line of GBI [12].We term the memory system using L1 private cache as L1, with L1 and L2 as L12, with L1 and shared bus cache as L1-SBC throughout this paper.An efficient multi-level cache system for geometrically interconnected many-core … (Tirumale Ramesh) 95

Concurrent average memory access time (C-AMAT)
Some cache techniques [18]- [20] were suggested earlier for improving traditional average memory access time for multi-level cache systems.In [18], hardware prefetching was considered to exploit spatial and temporal locality of references.In [19], multi-level caches were considered as primary and secondary memories for proxy servers to access web content.In [20], an LRU replacement policy was proposed that makes use of the awareness of the cache miss-penalty to ensure memory access latency is balanced for memory system built with different memory technologies termed as "hybrid" system.The work addressed in [18]- [20] were specific cache techniques attempted to reduce average memory access time without considering any cost implications.Our approach is to optimize cache and interconnection cost across the cache levels and apply C-AMAT for exploitation of parallel concurrency in cache hit and misses that accurately determine the average memory access time across all levels for data access.An analytical method for determining C-AMAT is briefly provided below.A traditional average memory access time (AMAT) with a multi-level cache system is given in ( 1) and ( 2) for L1 and L12 cache systems respectively.
Where t1 and t2 are the cache access time for level 1 and level 2 caches, h1 and h2 are cache hit ratios for level 1 and level 2 caches and tm is the global memory access time.In our approach, we exploit parallel concurrency for core and SBC hit and miss concurrency for SBC supported by GBI and apply C-AMAT for performance evaluation.The hit concurrency will improve performance while a cache miss may impact the memory system performance, depending on hit concurrency.Taking advantage of multiple buses with miss concurrency, higher system performance can be achieved.However, the application of C-AMAT need to ensure that the miss concurrency do not exceed the interconnection bandwidth with reduced number of buse.Thus, we re-write ( 1) and ( 2) as ( 3) and ( 4).
Where ℎ1 and ℎ2 are the average hit cycle concurrency at levels 1 and 2 and  is the average miss cycle concurrency.In this paper, we evaluate L1, L12 and L1-SBC systems.We selected minimum number of L1 and SBC cache blocks to meet the following criterion for hit and miss concurrency given as (5).
Since the GBI interconnection provides a memory bandwidth of /2, we can also approximate (4) by miss concurrency supported by the GBI memory bandwidth as (6).
When the  is less than /2, the interconnection bandwidth is not fully utilized.The C-AMAT given in ( 6) is smaller compared to conservative miss concurrency given in (4).The percentage deviation from (4) to (6) varies from 4 to 30 % across all cache systems.We see a higher deviation for L1-SBC system which is attributed to the fact that the miss concurrency decreases as a result of higher hit concurrency using bus cache during read cycle.In this paper, we only include conservative results from (3) and (4) for L1 and L12 cache systems respectively and at the same time ensuring criterion (5).

Geometrical bus interconnection (GBI) [12] cost
Table 1 gives the average normalized interconnection cost of GBI compared to fully reduced multiple bus system [10].We notice a reduction of about 30 % in cost across the number of cores from 16

SBC impact on C-AMAT
In the past, some shared cache techniques [21] have looked at cache sharing of ways based on hash mapping instead of traditional cache set sharing for multi-core platforms.In general, it is known that by increasing the number of processor cores can directly increase LLC (last level cache) hit and miss concurrency giving reduced C-AMAT.As our system uses buses equal to one-half the number of cores, the memory access missed in per core cache is searched in SBC.Since a shared reduced number of buses in our approach naturally captures all core accesses via the bus interconnection, placing an SBC at each bus line of GBI replicates closely to a traditional L3 shared cache normally used in current commercial processor systems.As we used  2 number of SBC at level 2, any miss in L1 increases the hit concurrency in SBC.In our approach, we accounted only a pure miss concurrency [17] (only if none of the bus cache has a hit in the hit cycle, a miss is accounted).

Cache association impact on C-AMAT
Cache association can also impact our solution.Authors in [22], [23] attributed to the fact that higher cache association normally increases the cache hit rate but at the expense of hardware complexity for the cache controller and additional latency for cache search time with increased association.However, in our approach, the association was selected to ensure that criteria (5) are satisfied.Thus, selecting a direct mapped cache may benefit to achieve reduced C-AMAT.In general, miss concurrencies in LLC can normally be supported by use of multi-ported memory, or multi-bank memory (memory modules) with a single bus.However, for a single bus system, bus contention impacts the throughput performance.The miss concurrency can be facilitated by using a multi-bank memory module with multiple bus interconnection between shared cache and memory modules.The miss concurrency can be supported by multiple buses in GBI yielding lower C-AMAT.

CACHE SYSTEM SIMULATION 3.1. System operation with L1-SBC
Figure 2 shows the operation flowchart for read and write cycles for L1 system.SBC is used only during "read cycle" with a "write through" policy to update on cache miss.In "normal no-fault mode", during read cycle, the data is first searched in L1.If the L1 read is a "miss," it is then searched in SBC.If it is "hit," the data is cached.On read "miss" in SBC, buses in GBI are arbitrated to utilize full memory bandwidth and the data is read from the global memory module and is written to SBC and L1 cache as well.If the current bus that is granted fails, then cache system switches to "bus fault mode" and the interconnection is re-arbitrated to use other b-1 connected buses.After bus re-arbitration, the data is researched first in L1 and if "hit", the data is cached in L1 cache, otherwise searched in SBC.During write cycle, if the L1 cache block is "present", then data is written into L1 cache.On L1 write "miss", the L1 cache block is replaced and the data is updated to L1 and consequently the data is written to global memory using arbitrated buses in GBI.
The proposed cache system was simulated using publicly available "lrucache" libraries in python and created multiple objects of a "lrucache" with indexing to implement L1, L2 and SBC.We iterated cache operation for over   1000 for n number of cores.Table 2 shows the general parameters used for the simulation.Using as much of insight into today's memory technologies, we approximately used a relative bit cost for L1, L2 and SBC as given in Table 2

Relative normalized system cost
Table 3 shows the normalized system cost as the total system cost that includes the normalized interconnection cost from Table 1 and relative cache memory cost from Table 2.As we noticed from Table 3, L2 cache adds 2 % additional system cost and SBC adds 0.5 % additional cost.We ran simulations using minimum number of L1, L2, and SBC cache blocks selected to meet the criteria given in (5).To reduce the cache "hit" time, we used an optimal cache association, but at the same time ensured concurrency criteria given by (5).

Cache read and write misses criticality impact
It is well known that cache read misses are more critical and incurs more penalty in read than write cycles.To alleviate this problem, some read-write partitioning policy was suggested in [24] that minimizes the read misses using dynamic cache management.To provide more read miss support, in our approach, we included SBC during read only.In general, as the read is increased from 50 to 80% of the processor data requests, we found drastic improvement in SBC hit concurrency as a result of its exclusive support during read cycle.However, as not all applications ensure a less data reads than data writes, we may treat the 50% read data requests as a good comparison for now and look for application centric read/write trade-offs in future using novel cache protocols.Some recent novel read/write cost tradeoffs for DNA based data storage [25] has been suggested.

L1 and SBC hit concurrencies and miss concurrencies
Tables 4 and 5 show the L1 cache hit concurrency (ℎ1), SBC hit concurrency (ℎ2), and miss concurrency in SBC () for various system sizes for L1, L12 and L1-SBC systems for 50 % and 80 % read requests respectively.Figures 5 and 6 show the hit and miss concurrency for L1-SBC for 50 % and 80 % read requests respectively.For the same number of cores, the miss concurrency decreases for L1-SBC as compared to L1 due to higher hit concurrency in SBC.The miss concurrency utilization in L1-SBC is about 50 % for larger number of cores.This is attributed to the fact that SBC offers higher hit concurrency yielding reduced memory traffic over the interconnection.Even though the low miss currency utilization may suggest that the number of buses for higher number of cores may be reduced further, it may invariably decrease the hit rate for SBC due to lower bandwidth availablity thus nullifying any overall advantage.As the data read are more than data writes, SBC hit concurrency increases by approximately 1.5 for the same system size.An efficient multi-level cache system for geometrically interconnected many-core … (Tirumale Ramesh) 99

Concurrent average memory access time (C-AMAT) cycles
We evaluated the concurrent average memory access time (C-AMAT) cycles from ( 3) and ( 4).Tables 6 and 7 show the C-AMAT for 50 % and 80 % read requests respectively.As a result of increased SBC hit concurrency, the C-AMAT decreases with the number of cores.Figure 7 shows the C-AMAT for 50% and 80% read requests respectively.Further reduction in C-AMAT is seen for 80% read requests due to increase in SBC hit concurrency.

Cache system throught
The throughput in GB/sec (g) given as (6).
Where b is the number of buses with 2 bytes bus data width.We assumed GBI bus arbitration and bus allocation reconfiguration time (tr) of 1 cycle and a clock cycle time of 0.5 ns.Table 8 summarizes our results for throughput in GB per sec.We used normalized unit cost from Table 3 and C-AMAT using (3) and (4).As shown in Table 8, the throughput increases with the number of cores and read request percentage suggesting a good advantage.Figure 8 shows the throughput for 50% and 80% read requests respectively.Figure 9 shows the average throughput improvement factor for L12 and L1-SBC cache systems over L1 cache system.We found that the average throughput improvement factor of L12 cache system across all system sizes is 1.5 for 50 % read requests and 1.8 for 80 % read requests compared to L1.We determined that the average throughput improvement for L1-SBC memory system is 2.5 for 50 % read requests and 2.4 with 80 % read requests compared to L1 system.As there is very negligible cost increase for L1-SBC (0.5%) over L1, we conclude that L1-SBC cache is both cost and performance efficient compared to L1 or L12 cache system, L1-SBC offers 30 to 60% increase in throughput improvement factor compared to L12 improvement factor over L1.

Cache system throughput with single bus fault
We also ran simulation for L1-SBC with a single bus fault in the system.We used both critical and non-critical bus for assigning faulty bus.A bus is a "critical bus" if a memory is only connected to that bus.Typically, rhombic interconnection [11] has a single "critical bus".However, with GBI [12], we provided redundant bus paths yielding all buses "non-critical".Figure 10 shows the percentage degradation of a single bus faulted system compared to normal L1-SBC system with 50% read requests.We noticed that the percentage degradation in throughput for a single bus fault is less than 5% across all system sizes and decreases with higher number of cores.This suggests a good fault tolerance for L1-SBC for increased number of cores.An efficient multi-level cache system for geometrically interconnected many-core … (Tirumale Ramesh) 101

CONCLUSION AND FUTURE RESEARCH
Many-core based heterogeneous system demands high system throughput for big data applications and other compute intensive embedded applications.By adding a less expensive SBC in association with expensive per core L1 private cache within a multi-level cache hierarchy, we can achieve higher system throughput.For better accuracy, we extracted cache hit and miss concurrencies at each level and applied concurrent average memory access time for L1, L12 and L1-SBC systems.We conducted simulation of L1, L12 and L1-SBC cache systems.Our simulation results indicate that by using L1-SBC, we can achieve 2.5 throughput improvement compared to using only L1 private cache and we see that L1-SBC offers higher increase in throughput improvement factor compared to L12 improvement factor at a very negligible increase in SBC cost over L1.We also determined that the throughput degradation using L1-SBC with a single bus fault is less than 5 % across all system sizes and this degradation reduces as the system size increases suggesting a good advantage for higher number of cores.As we used the SBC only during read request, in the future, we hope to develop some additional novel SBC cache protocols using exclusive and shared modes and include SBC in both read and write cycles.We also hope to perform some heterogenous computing big data application benchmarks with LRU L1-SBC system and assess the overall system performance.

Figure 10 .
Figure 10.Throughput degradation with a single bus fault for 50 % read requests

Table 2 .
. Cache system simulation parameters An efficient multi-level cache system for geometrically interconnected many-core … (TirumaleRamesh)

Table 3 .
Normalized system cost

Table 5 .
Cache hit and miss concurrency with 80% read requests

Table 8 .
Throughput in GB/sec for L1-SBC for 50 % and 80 % read requests