Processor performance metrics analysis and implementation for MIPS using an open source OS

Received Jan 29, 2021 Revised May 25, 2021 Accepted Jun 10, 2021 Processor efficiency is a important in embedded system. The efficiency of the processor depends on the L1 cache and translation lookaside buffer (TLB). It is required to understand the L1 cache and TLB performances during varied load for the execution on the processor and hence studies the performance of the varying load and its performance with caches with MIPS and operating system (OS) are studied in this paper. The proposed methods of implementation in the paper considers the counting of the instruction execution for respective cache and TLB management and the events are measured using a dedicated counters in software. The software counters are used as there are limitation to hardware counters in the MIPS32. Twentyseven metrics are considered for analysis and proper identification and implemented for the performance measurement of L1 cache and TLB on the MIPS32 processor. The generated data helps in future research in compiler tuning, memory management design for OS, analyzing architectural issues, system benchmarking, scalability, address space analysis, studies of bus communication among processor and its workload sharing characterization and kernel profiling.


INTRODUCTION
The performance measurement is measured based on the desired events occurring in the system at the time of program execution. The measure of the events and states of the quantum of work done over the processor is called the processor performance. Workload of the processor executions depends on the loading and storing of the data computations and the movement of data in a sequence of operations of the processor components involves pipeline, memory cache and peripherals. When OS is loaded over the processor, the performance of the processor is achieved either with the counters in hardware or/and software or both. The hardware counters which make of physical counters as the peripherals of the processor and is loaded with the measurement counter values for the execution process made of for any application or for an operating system [1]- [3].
The event occurrence associated with the performance measurement and modified code execution are performed through software counters. In software counters the event takes an additional instruction cycle to count the event. Whereas the counters in hardware are non-intrusive on the instruction execution cycles providing advantage for the performance of the system. The biggest disadvantage is the number of hardware counters are limited as compared to software counters. Hardware counters are utilized and focused to fine- tuning either the operating system or the executing process in the processor performance identified bottlenecks. Whereas the software counters are utilized and focused either for MIPS32 general-purpose performance measurement or for measure a bottleneck specifically of the MIPS32 based on OS.
This measurement is essential for mapping the behavior and efficiency not only of the underlying processor architecture and its associated subsystems, but it is also to measure the performance of the various processes executing on the architecture as well. The evolved understanding aids to monitor the performances, optimize the code, tune the code parameters, and code constructs, model and benchmark the system that comprise of the architecture, its subsystems, and the executing processes; be it the OS or the applications using the OS. Of interest is the software counters as it provides the flexibility to define the measurement framework.
The cache and TLB form an important measurement as it determines the efficiency of the processor in the context of the performance framework. The measurement framework addressed in this paper is to measure the L1 cache and TLB activities on a MIPS32 architecture implementation. The MIPS32 architecture implementation does not have a hardware-based counters; hence, software-based counters have been implemented. The OS manages the L1 cache and TLB on the MIPS32 implementation; hence, the software counters play an important role in measuring the performance of the processor. The OS has been instrumented with software counters to get data on the events associated with the L1 cache and TLB. The data generated by the software counters for L1 cache and TLB events is provided in the ASCII format to enable ease of analysis. A set of thus generated event data over a period has been utilized for generating the histogram for validation purposes.
Software counters are additional code added into the OS; hence, a certain amount of performance drop is expected when compared to an implementation of the OS without software counters. The focus of this performance measurement will be to generate as much as data possible at the first instance, then through analysis of the generated data the instrumentation can be reduced to move the OS closer to the desired performance level. Iterative analysis of the generated data and further tuning of the OS will then be essential.
The challenge offered by the MIPS32 architecture implementation has been the availability of free space on the flash memory. The code thus instrumented necessitates having a small footprint; hence, flash memory space saving techniques was formulated. The OS source code provided by the implementer of the MIPS32 architecture has stripped a general OS suitable for implementation on an embedded system; hence, compatibility with user loadable modules, file system writing, documentation, data extraction through FTP, and command line parameters processing are restricted. These challenges have been addressed to design, develop, and validate an online L1 cache and TLB performance measuring system for the MIPS32 architecture implementation.

PROCESSOR PERFORMANCE MEASUREMENT
The study of the project which is developing the concept would be developing the context associated with the MIPS32 processor architecture along its organization includes processor pipeline, memories and different level of caches, performance measurement matrices, data input output methods from the processor implementation board and how the setup of the development system.

MIPS32 architecture pipeline and cache
The architecture of MIPS32 processor is RISC processor based and they are either types in 32-or 64-bit addressing mode [4,5]. The speed of the processor depends on the pipeline and caches. Due to pipeline and caches advancement, there are wide applications in workstations to complex embedded systems. The mechanism of the pipeline deals with the division of the workload (instructions and data) into an ordered sequence providing to enable faster turnaround time for execution of the workload without and interlocking of the pipeline and stall within the processor pipeline. Pipeline of MIPS32 processor is shown in Figure 1 and it consists of five stages of pipeline.
A cache helps to access the instructions and data transfer faster to the between the memory and processor. The instructions are residing in the instruction memory and the data is residing in the data memory [4]- [6]. The CPU request the data and the cache provide it if it is available in cache block. If the data is not available in the cache, then the cache is refreshed and invalidated the set of data. This invalidating will force the cache to update the data required for the processor from the main memory through the TLB walk making it write back. As shown in the Figure 1 the representation of instruction cache is Icache and the data cache is Dcache. Both the caches are separated as per Harward architecture of the computer design principle to provide a better performance for instructions and data executions on the CPU avoiding the starvation of the CPU [4].

Performance measurement
Performance of the processor depends on gathering events data, occurring data movement in the processor and state transitions. The execution of the instructions depends on the event of data movement. Hence, when the CPU put the request on the cache, the responses of the cache and TLB on the data takes a higher priority as compared with the other operations requested by the CPU. The performance measurement is achieved using hardware counters or software-based counters. As shown in the Figure 2 is a hit-miss of the software counter [6]. The metrics classification is divided into two-categories of base and derived. In base metrics consists of the direct or raw counting of the monitored events, in base of the derived metrics are those arrived at with the combination of two or more metrics of either category such as base category or derived category or a both combination of base and derived category.
The generation of data by the software for the measurement of by the counters is either be archived for long or short-term duration. The data analysis of the quantum data is obtained from the extent of data archival. The begin of the experiment consists of modified source code analysis for metric collection for performance measurement of counter made in hardware or software. The performance measurement framework is designed to perform metric data of high amount varied load on the processor. The metric data analysis of data provides analysis with direction of further code instrumentation or reduction in the code instrumentation for arriving a measurement framework. The instrumentation of code, metric data collection and the analysis process is iterative as shown in Figure 3, until the frozening of the framework for performance measurement.

Development and system setup
An implementation of the 32-bit MIPS processor; BROADCOM BCM5354; is available in the NETGEAR OSP router WGR614 series [7]. The available OOS supporting the BCM5354 MIPS processors are the flavours of BSD [6][7][8][9][10] and Linux that permit the modification of the source code for incorporating the defined software counters for the MIPS processor. The source code compilation process is generally categorized as architecture independent and dependent. The architecture independent is common across all platforms supported by the OOS, while the vendor of the processor provides the dependent section of the source code. The development system being x86 based, requires the usage of OOS cross-compiler for generating the modified kernel for the MIPS processor, and these tools are available on the internet that are specific to the chosen OOS. The cross-compiler; commonly referred to as the toolchain [11], [12] is provided by the vendor that is specific to the processor implementation. The environment for source code and tool chain build would necessitate a disk space of about three-hundred and fifty megabytes; hence, the build environment should be setup in a location with adequate disk space. The kernel version of the operating system will not have any direct bearing on the tool chain, and the resulting source code build. Nevertheless, it is preferable to have the development system OS patched to the highest possible level.
The development system configuration will be as per the specification mandated by the chosen OS, and it is essential to ensure network connectivity is available for transferring the firmware image from the development system to that of the chosen hardware of MIPS32 implementation. The protocols that will be utilized during data transfer can be with either TCP or TFTP.

REQUIREMENT ANALYSIS FOR PERFORMANCE MEASUREMENT
Hardware of MIPS32 implementation. The protocols that will be utilized during data transfer can be with either TCP or TFTP.

Requirement specification
The following functional requirements are: − ASCII format to be used for data collection and accessible for data correlation and its applications. − The/proc file system in the file CPU info will show the values of the counter − Types of performance measurement is preferred in the hit-miss-refresh cycle related to the: Dcache; Icache; Scache; TLB − All operations defined for the cache, and the TLB are to be covered − The use of a separate variable counters to be used to count the overflows for each metric will be essential − Data structure of the performance measurement to be placed in the architecture specific …/include/asmmips directory − Metric update routines of the performance measurement to be placed in the architecture The source code modification should ensure minimal change in the firmware footprint size − It is necessary to provide the ability to compile the kernel without the metric collection − Efficiency of the introduced code is not the goal as the focus is to get as much as data from the cache and TLB management routines.

System analysis
In cache either write-back or/and invalidate are the major operations. When the CPU has updated cache write-back operation are performed. Hence, a required and the corresponding memory update will take place. To access a fresh set of data from the memory an invalidate operation is performed. The write-back and invalidate operations are applied for the Icache, Dcache and Scache lines. It is necessary to cache initialization is first done on the Icache, followed by the Dcache. The BCM5354 processor source code analyzing of the cache operations are as defined by [3] are: #define Index_I_Invalidate_Ins 0x00 #define Index_I_Writeback_Inv_Data 0x01 #define Index_I_Writeback_Inv_SData 0x03 #define Hit_I_Invalidate_Ins 0x10 #define Hit_I_Invalidate_Data 0x11 #define Hit_I_Invalidate_SData 0x13 #define Hit_W_Writeback_Inv_Data 0x15 #define Hit_W_Writeback_Inv_SData 0x17 #define Hit_W_Writeback_Ins 0x18 #define Hit_W_Writeback_Data 0x19 #define Hit_W_Writeback_SData 0x1b

System design
The cache operations listed [3] are utilized by multiple routines on the caches to either flush the lines or the KSeg0; So, it is necessary to trace and find the function used to defined cache operations using unique metrics. The list of traced metrics is a perspective of generating the complete view of the cache operations of the MIPS32 processor. The source code analyzing; [3], [13]- [19]; for the analyzed and the defined function calls for the performance measurement of the caches; consists of I, D and S; along with the TLB operations of the MIPS32 design and implementation is categorized in the Table 1. The visualized of the cache operations from Table 1, the operations can be repeated on the lines or the ways or on KSeg0.
The base metric with the help of a roll over counter helps to analyses and track the metric data over a suitable extended period as per the design and analyzed requirement. The processor will cause a rollover of the metric for a high-rate activity. The metrics choice of the data type is base and the roll over which are of the type of unsigned int. The base and the rollover metrics data type will necessitate a change to an unsigned long type of data if it is necessary by the decision based on the processor activity. The data structure of the designed metrics will thus consist of the based metrics and the roll over counters. Based on the defined cache operations the metric update is organized [3], [20]- [30] hence, the metric computation based on the statement of switch operations. The routines to be updated the metrics are defined as:

Building environment and hardware setup
The major section of the Linux source code is splitted into major two section which are architecture dependent and architecture independent. The router used here are NETGEAR router which comes with the bundle and source code with a third section and specific to router specific board. The building process of the provided code involves compiling the architecture specific code, following the Linux specific code and then architecture independent, the code compiling the results in the kernel images following the compile of the  [13] and installing the code in a suitable directory under the home directory of the user computer. g. Cleaning the existing object files, and the kernel image vmlinux under the Linux and the router sections of the source code tree using the followings indicated steps: cd …/src/router make clean make router-clean cd ../linux/linux make clean h. Building the Linux kernel image from the source code directory path .../src/linux/linux using the following steps for generating the MIPS32 kernel image vmlinux: make dep make i.
Building the router code in the directory path .../src/router using the following steps: make make install j.
The router WGR614v9 firmware upgrading of the image file will be created in the directory path .../src/router/mipsel-uclibc in the file name beginning with the WGR614v9 and ending with the extension chk. The firmware file name example is WGR614v9-12051706.chk. The router board is the hardware step for implementation of MIPS32 core by BROADCOM processor BCM5354 and is indicated in Figure 4.

Pseudo-code implementation procedure
The implementation of the performance metric data collection under the architecture specific memory managementroutines are available in the locations path …/src/linux/linux/arch/mips/mm, and the …/src/linux/linux/include/asm-mips directories. and the listing of the functions are listed in Table 1 and is available in the indicated directories. The pseudo-code is developed for collecting the performance metrics is represented in the following steps a to e: a. Calling the function in the code for the metric update from the function calls associated with the MIPS32 cache and TLB management, along with the designed parameters of cache / TLB operation and the type of the required operation; either on the ways, or the line, or the KSeg0. b. Updating the associated metric as designed for the functions listed in Table 1. c. If the metric counter is overflowing; then it made to wraps to the value zero and then then increment of the counter is performed corresponding rollover metric counter. d. Capturing and printing the values of all the designed metrics in the /proc/cpuinfo file. e. Repeating the steps a through d for each of the function call as listed in Table 1.
The method of implementing the pseudo-code is indicated below: a. Locating the section in the source code which are handling cache and TLB function calls from Linux kernel and memory management routines that are available in the directory . Based on the design requirements the parameters are listed in Table 1 and Section 2.3. g. As per the design requiremts calling the metric display function; created in step d; from the file …/src/linux/linux//arch/mips/kernel/proc.c to displaying the data in the /proc/cpuinfo file. Validating the changes on the hardware by building the firmware image for the NETGEAR WGR614v9 router based MIPS32 processor implementation along with the changes in the source code.

RESULTS AND DISCUSSIONS
The processor performance measurement results involve in understanding the metrics as per the design considerations and making its interpretations to concludes the observations of the results associating with these and its data-based interpretations. With the help of these interpretations wherever required a suitable code modification is performed and performance measurement had analyse.

Metric interpretation
For considering the example of the cache operation Index_I_Invalidate_I_I, and as seen in Table 1 under Section 2.3, these operations are performed on the Icache line and in Icache ways and on the KSeg0. The operation Index_Invalidate_I was performed to get the exact number of times about the metrics, the method is: Similarly, for each design requirements based defined operation of the cache and TLB, the corresponding metrics are indicated in Table 1 are to be added as shown in (1). The combination of metrics generates a derived metric, whereas the value of the individual metrics is providing the base metric. The Column 3 of the Table 1 indicates the base metrics. The data provided by the metrics which are of base and derived are helping to utilized and draw a histogram tracing for the processor cache activity over a period.
The histograms are plotted based on the data generated by the metric analysis tools met. 145 that are there for displaying the contents of the directory and files under the /proc file system, etc. To get a better example perspective of the effect of the commands, consider an example of the command: cat /proc/cpuinfo. The execution of the command has the following indicative steps: a. The shell spawning a new process by creating a process table entry in the program and copies the file descriptors. The process association with the cat application. b. The created process is placed by the scheduler for execution. c. By accessing the cat command, the code is brought into the memory from the file system. d. The file name is parsed by executing the code. In this paper it is the case, the file is /proc/cpuinfo. e. Checking if the file is a directory the respective file table entry is accessed. if it is true and yes, exit. f.
Opening the file for reading. g. Reading the line until it reaches to the end of line mark and store it in the buffer. h. Accessing the character device driver for console terminal to interact with the application of the performance measurement. i.
Opening the device for writing. j.
Getting back to the file for reading operation, and now calling the print routine to output data to the character device file. k. Repeating the steps g to j until it is end of file. l.
Releasing resources of the system occupied for reading the file /proc/cpuinfo. m. Releasing resources of the system occupied by the cat application. n. Releasing the system resources associated with process table entry.
There are multiple instructions from the steps a to n with each is either can be in a memory or on the flash file system of the router. The instructions are available into the memory which are prefetched due to the earlier executions. When the data are not available lead to the flush of the data to bring the new data required for the CPU of MIPS32. So, the cache invokes the TLB to perform the data transfer from the main memory to the required cache. As shown in the histograms in Figure 5, Figure 6, Figure 7 and Figure 8, There is a high activity at the first five-seconds of data collection of the Dcache and Icache when compared to the TLB and Scache.
As shown in the Figure 3, when there is execution of the instruction happening causing a fetch of the data from the memory into the Dcache. Before loading the data into Icache or Dcache, the entire Kseg0, or the way or the lines are flushed simultaneously updating the TLB. Each activity on the caches and TLB are a measurable event and hence are updated. The flushing activities either can be writeback or invalidate of the respective caches. The four histograms as shown in the Figure Table 3 shows listing of the code lines that were added into the stock source code [31]- [34]. The files mips32_cache.h, Makefile, proc.c, tlb-r4k.c, Config.h, applets.h and the usage.h were part of the source code tree, while the files cache_perf_proc.c, cache_perf_mips32.h and zmet.c were added into the source code tree as part of the implementation of the cache and TLB performance measurement metrics for MIPS32 architecture. The total number of lines that were added into the source code tree has been 1009 (one-thousand and nine), counted without the comments or introduced blank lines for formatting of the code.

CONCLUSIONS
Most of the processor utilize the hardware counters, where are MIPS32 architecture does not have hardware counter. So, the performance system requires the usage of the software counters. Software counters are defined and available in the kernel and need to write the code to access them. While using the software counter there are rise of overflow of the counters. How to use the overflow of the counters are incorporated in the L1 Cache and TLB of the MIPS32 memory management. The experiments are conducted by designing the twenty-seven metrics based on the software engineering principle for measuring the performance of the software counters for the MIPS32 processor for cache operations. The studies help to understand the importance of the software counters into the OS based processor performance.