SoC-FPGA systems for the acquisition and processing of electroencephalographic signals

Received Aug 11, 2021 Revised Sep 29, 2021 Accepted Oct 19, 2021 Real-time acquisition and processing of electroencephalographic signals have promising applications in the implementation of brain-computer interfaces. These devices allow the user to control a device without performing motor actions, and are usually made up of a biopotential acquisition stage and a personal computer (PC). This structure is very flexible and appropriate for research, but for final users it is necessary to migrate to an embedded system, eliminating the PC from the scheme. The strict real-time processing requirements of such systems justify the choice of a system on a chip fieldprogrammable gate arrays (SoC-FPGA) for its implementation. This article proposes a platform for the acquisition and processing of electroencephalographic signals using this type of device, which combines the parallelism and speed capabilities of an FPGA with the simplicity of a general-purpose processor on a single chip. In this scheme, the FPGA is in charge of the real-time operation, acquiring and processing the signals, while the processor solves the high-level tasks, with the interconnection between processing elements solved by buses integrated into the chip. The proposed scheme was used to implement a brain-computer interface based on steadystate visual evoked potentials, which was used to command a speller. The first tests of the system show that a selection time of 5 seconds per command can be achieved. The time delay between the user’s selection and the system response has been estimated at 343 μs.


INTRODUCTION
System on a chip field-programmable gate arrays (SoC-FPGA) are devices that combine in a single chip a field programmable gate array (FPGA) and a general-purpose processor. The high parallelism capacity of FPGAs has placed them as a fast and economical option, halfway between application-specific integrated circuit (ASIC) devices and general-purpose processors [1], outperforming even digital signal processing (DSP) devices when real-time requirements become more demanding [2]. On the other hand, having a general-purpose processor allows designers to implement a simple and intuitive user interface, alongside with any other high-level tools required, overcoming the limitations of FPGAs in this regard. By combining these two processing elements, SoC-FPGA devices have become a very good option for the development of systems where real-time processing is key. in a single chip. In this way, it is possible to integrate the high-level functionalities of a processor with the real-time operation of an FPGA, ensuring an optimal interconnection between both processing elements.
In these heterogeneous systems, FPGAs are ideal for handling parallel operations of many data channels; and, because they implement computation directly in hardware, they provide a low and constant latency path for tasks such as custom triggering and high-speed closed-loop control. On the other hand, they improve the flexibility of embedded systems, making them easier to update than systems with fixed logic and allowing them to adapt to changing I/O requirements. In this scheme, the designers can solve the tasks that demand low latency using the FPGA, while the embedded processor takes care of the user interface and the rest of the tasks with lesser time constraints, possibly porting an operating system for it.
Currently there are two manufacturers that concentrate almost the entire offer of SoC-FPGA systems. These are Xilinx and Intel (formerly Altera). The de10-nano board, selected for this article, features a Cyclone V 5CSEBA6U23I7 FPGA-SoC, part of the low-range of Intel Altera devices. This chip combines the Cyclone V FPGA with an ARM-Cortex A9 processor, which carries an Angstrom distribution of a Linux operating system, alongside with interconnection buses between them. Additionally, the board provides a LTC 2308 chip: a 500 ksps, 12-Bit, successive approximation register (SAR) analog-to-digital converter (ADC). The 110 K logic elements of the Cyclone V 5CSEBA6U23I7 device are more than enough for the type of processing required for this application, as it is to be shown in section 3.1, and the ADC included in the board is well suited for the acquisition of the EEG signal, hence the selection of the de10nano board constitutes a cost-effective solution. In Intel Altera's documentation the microprocessor is often referred to as a hard processor system (HPS). This acronym will be used on this paper as well.

BCI implementation using SoC
In order to implement a BCI device it is necessary to obtain electroencephalographic signals from the user and process them in real time, in the shortest possible time. The processed information must be classified by some criteria and then used to control a device. Throughout the process, the user must be provided with an interface as intuitive as possible, which provides him with feedback.
SoC-FPGA systems are ideal for implementing such a system. On the one hand, the FPGA is available to handle the acquisition and processing of the signals and, on the other, the microprocessor is ideal to carry the user interface. The classification of the signals and the control of the devices can be handled by either of the two modules. As a criterion for choosing which one to use, evaluating what type of devices will be controlled is proposed. If the purpose of the BCI is to control software (a speller, for example) it can be programmed directly into the microprocessor, together with the signal classifier system. Alternatively, if the idea is to control some external hardware (a wheelchair, for example) it is preferable that the FPGA takes care of both the classifier system and the device driver, to minimize the latency. If it is necessary to control external stimuli to handle the device (visual stimuli, for example), it is preferable to control them by the FPGA, to facilitate their synchronism with the acquired signal.
To guarantee the correct acquisition, storage and processing of the signal there are some basic elements that must be solved within the FPGA. An analog-to-digital converter (ADC) with 12-bit resolution can be used to acquire the signal, as long as an appropriate analog signal conditioning stage is provided. This converter must be managed by a dedicated module in the FPGA. Raw samples should be first saved internally and then exported to some memory shared with the microprocessor, when a certain number of samples is reached. This is done in order for the processor to have access to the raw samples; for graphing, for example, but without having to handle them in real-time.
While there are many possible digital signal processing techniques and methods that can be useful in a BCI system, one of the main ones is the discrete Fourier transform (DFT). This transform provides a way to analyze signals in a time-independent fashion, and the fast Fourier transform (FFT) implementation of it dramatically reduces the number of operations involved on its calculation. Additionally, using this algorithm, correlations and convolutions can be implemented, in general, using fewer operations, which allows the implementation of digital filters in an efficient way [16]. In this scheme, it is the FPGA that must be in charge of calculating the FFT of the signals, saving the results of the processing in some memory shared with the microprocessor. It is desirable that the length of the FFT matches the number of samples that the microprocessor has access to, in order for it to have access to the raw and processed signal in the last timewindow at any given moment.
With the acquisition and processing of the signal resolved, it is the microprocessor that must take care of the high-level tasks. The most important one is the user interface, which should be as intuitive and user-friendly as possible. It is desirable that this interface provides the operator with a way to verify the correct acquisition of the signal, by graphing it, for example. In general, these devices carry an operating system, which gives developers the possibility of programming the interface with the high-level design tools they are most familiar with.

SSVEP-based BCI implementation 2.3.1. Steady state visually evoked potentials
SSEVPs are periodic potentials that appear in a user's EEG record when presented with periodic flashing lights, with frequencies above 6 Hz, and present the same periodicity of the stimulus. Implementing a system that uses these potentials to control some type of device involves measuring the EEG signal, and processing it in its search. There are mainly two ways to use visual stimuli to encode information: phase and frequency encoding. In the first case, the user is presented with multiple flashing stimuli at the same frequency, but out of phase with each other [14]. It is the task of the classification algorithm to detect this delay in order to determine which stimulus the user is looking at. In the second case, the user is presented with visual stimuli blinking at different frequencies [12]. The classification algorithm in this case must determine which is the predominant frequency in the obtained record. Mixed schemes may also be used, with stimuli flashing in both different frequencies and phases [4].

General scheme of the device
In this article a SoC-FPGA was used to implement a SSVEP-based BCI with frequency encoding. The EEG signal is measured using wet electrodes in the user's occipital area. It then goes through a stage of analog signal conditioning, before being digitized by the LTC2308 ADC, available on the de10-nano board. The signal is stored internally and processed using FFT algorithm in the FPGA, and is later transferred to the microprocessor through an integrated bus. The microprocessor is responsible of implementing a speller and a signal graphing tool, using the information provided by the FPGA. At the same time, 5 visual stimuli of different frequencies are generated, which allow the user to control the device. The design of the system was planned in a modular way, seeking that each module serves a specific purpose and is independent of the others. In this way, a widely reusable system is achieved. Figure 1 shows a general diagram of the implemented device.

Signal conditioning stage and isolation barrier
The EEG signal's amplitude is between 2-100 µV, its frequency ranges from 0.5 Hz to 100 Hz, it's mounted on a DC component that can reach hundreds of mV (electrode potentials) and it's immersed in noise and electromagnetic interference [17]. Once the signal is measured through the wet electrodes an analogic conditioning stage needs to be provided, to enable its acquisition with the ADC integrated in the board (12 bits, successive approximations). For this purpose, a one channel AC coupled differential amplifier was used [18]. This circuit provides an amplification of 5832, and includes a feedback stage through a third electrode, with the purpose of reducing the common mode voltage in the differential pair (usually called a driven right leg circuit, or DRL, in the bibliography [19]). The same amplification stage, with a lower differential mode gain, can be used for obtaining electromyographic (EMG) [20] or electrocardiographic (ECG) [18], [21] signals. The amplifier was complemented with an integrated medical grade isolator ADUM6401, for its power supply, and an optical isolation amplifier based on the IL300 optocoupler, to isolate the signal at its  [22], safeguarding the integrity of the user. An analogic anti-aliasing active filter with cut off frequency of 160 Hz was also added to the set.

Visual stimuli
To generate the visual stimuli by which the user controls the device, 5 matrices of red LED lights, flashing at 14, 16, 18, 20 and 22 Hz, were used. These visual stimuli were suited horizontally, in line with the options available in the user interface, as seen in Figure 2. In order to correctly register the SSVEP phenomenon it's convenient that the periodicity of the stimuli is synchronized with the signal sampling time so the same timer, set at 1/1024 s and generated in the FPGA, was used to control both modules.

Low level processing
The acquisition and storage of the signal and the calculation of its FFT were implemented in the FPGA, coded using the Verilog hardware description language (HDL). Figure 3 shows in more detail the different modules that make up the digital design. These are the ADC driver, the visual stimuli driver, the timer, the FFT module, and memories and control signals that interfaces the system with the HPS. In this design a NIOS 2 soft processor was also included. The whole design is synchronized through a 50 MHz clock, available in the de10-nano board. The visual stimuli are controlled through a module that counts positive edges on the timer's clock, generating square signals of the desired frequencies that are synchronized with the timer. The ADC LTC2308, external to the SoC but included on the board, is driven by SPI protocol through a dedicated module. The timer fixes the sample rate at 1024 samples per second (sps). The samples from the ADC are firstly stored in an internal buffer, with a capacity of 512 samples. Once this buffer is filled the calculation of the sample's 512-point FFT takes place, and both the raw and the processed samples are stored in memories shared with the HPS, which is notified. This process is controlled and monitored by the NIOS 2 processor, which is a "soft" processor, i.e., programmed into the logic cells of the FPGA. The use of this processor was considered vital in the system development and debugging stage, since it provides greater versatility and control over the different stages of the process, and can be programmed easily through a hardware abstraction layer (HAL) provided by Intel Altera. In a future implementation of the system, it could be dispensed with, developing some direct communication between the ADC driver and the FFT module. This would allow a more efficient system, while saving space in the FPGA logic cells. No pre-processing is conducted on the samples provided by the ADC before inputting them to the FFT module, since it was considered unnecessary for the application intended at this stage. In future implementations of the system the data could be preprocessed using FIR [23] or IIR [24] filters, to reduce the off-band noise, and adaptative filters, for example a least-mean square filter [25], to reduce the on-band noise, produced by muscle artifacts. This will surely result in a more robust system.
The module that implements the FFT is made up of two parts. A core provided by [26] is responsible for calculating the 512-point FFT with a serial input/output interface, while two FIFO memories and dedicated logic solve the input/output interface with the NIOS processor. The core receives input samples in 12-bit format, computes the FFT and provides 16-bit outputs, truncating the values if necessary. To help reduce the internal truncation error of the core, while keeping the resource utilization reasonable, all the computations inside of it are carried out in 20-bit words. This includes the twiddle factors necessary for computation, that are stored in look-up tables to avoid having to compute them on the fly. The combination of a 512-point FFT with a sampling frequency of 1024 sps gives the system a frequency resolution of 2 Hz, so that all the frequencies of interest (14, 16, 18, 20 and 22 Hz) correspond to whole beams of the FFT, minimizing the spectral leakage. Once the computation is done the output of the FFT is stored in 32-bit length words, to facilitate its interface with the rest of the system.
In order to implement communication between the modules, two different standard Altera buses were used, as shown in Figure 4. When the ADC driver notifies that the 512 samples are available in the "Raw signal memory", the NIOS 2 processor transfers them to a first in first out (FIFO) memory (FIFO 1 in the figure). These transfers are managed by an Avalon memory mapped interface, which is well suited to implement read and write interfaces for master and slave components such as this. Then the data flows through the FFT. For this, Avalon streaming interfaces, which are ideal for low-latency, unidirectional data transfers, were used. Finally, the processed data is available in another FIFO memory (FIFO 2 in the figure), and the NIOS 2 processor transfers them to the "processed signal memory", using Avalon memory mapped interfaces again. This memory, alongside with the "raw signal memory" is available to the HPS, through the lightweight-AXI bus. Both memories, implemented with on-chip RAM, store data in 32-bit length words. For more information about the Avalon interfaces refer to Altera documentation [27].

HPS and FPGA communication
Interfacing the FPGA and the HPS involves selecting one of the three buses integrated in the SoC-FPGA. These are: the FPGA-to-HPS Bridge, the HPS-to-FPGA Bridge and the Lightweight HPS-to-FPGA Bridge. In this design there are no HPS's slaves that need to be controlled from the FPGA, all the transactions that need to take place have a maximum width of 32 bits, and there is no need for a highbandwidth bus, since all the real time processing takes place in the FPGA fabric. For these reasons the lightweight HPS-to-FPGA Bridge was selected. This bus is driven by a 100 MHz clock, has a fixed data width of 32 bits and exposes an interface that the user can connect to Avalon Memory Mapped slave interfaces. The manufacturer recommends its use for low-bandwidth traffic [27].

High level processing
The high-level processing was coded in the hard processor system (HPS) of the device. This module has direct access to the raw and processed signal in the last window of 512 samples. As seen in Figure 5, information is directly mapped in the virtual memory of the embedded Linux running in the processor, at the character device "dev/mem". A program running in Linux must access this device and read the information, which is synchronized by the FPGA modules through the control signals. In this scheme the developer can design high level programs using the tools that he or she is more used to. For this article a classifier algorithm and a speller were programmed in C#, executing them through the mono implementation of the .NET framework [28].
In each time-window (512 samples or 0.5 seconds) the classifier algorithm compares the magnitude of the signal's FFT at the frequencies of interest (14,16,18,20 and 22 Hz), with each other and against the mean, selecting the biggest one. If a frequency is selected in three consecutive time-windows it's inferred that the user was looking at the visual stimulus related to that frequency. Using this system, a user can select between five different options, without moving any other muscle than the eyes. The criterion of the three consecutive windows was adopted to reduce the probability of false positives, with the downside of reducing the achievable ITR of the system.
By relating each of the visual stimulus to a symbol in the screen a speller was implemented. The system allows the user to select between 5 symbols so, in a first stage, the GUI, which can be seen in Figure  1, asks the user to select between five groups of five letters. Later, in a second stage, the user is asked to select between the five letters of the selected group. Using this scheme, a total amount of 25 letters can be selected, with a minimum time of 3 seconds per selection (1,5 second per command). The "Y" letter was omitted due to its low appearance in the Spanish language. The GUI also provides the user with progress bars, related to the three successive selections, and the possibility of observing the raw and processed signal graphics.

Usage of FPGA resources
The design was implemented in the de10-nano board. Table 1 summarizes the resources utilized to fit it on the board. This summary shows that the design would even fit in smaller FPGA devices, and that there are plenty of resources left to implement further signal processing elements. Additionally Figure 6 shows the floorplan of the proposed system, obtained from the chip planner perspective on the software provided by Altera, for the Cyclone V 5CSEBA6U23I7 system. This Figure illustrates the area covered by the orchestrated circuit, which is the region covered by square shapes bluer than the others.  Figure 6. Floorplan of the proposed system Figure 7 shows the register transfer level (RTL) diagram of the proposed system. The block indicated as "soc_system: ADC" is generated by the system integration tool (known as QSYS) of the software provided by Intel Altera, and includes the ADC interface, the FFT core and interconnections, FIFO memories, the Nios 2 processor, the necessary memory for its operation and the HPS interface, as well as some interconnection blocks generated by the tool. The detail of this block is shown in Figure 8. The NIOS 2 processor is driven by the same 50 MHz clock than the whole system, but the clock that drives the hardware abstraction layer (HAL) from which the instructions are conducted is lower [29]. This clock was estimated in 11 MHz, using an oscilloscope and a simple software routine. The FFT module outputs one processed sample per clock after a delay of 1105 clocks, and it is driven by a 50 MHz clock. The decision algorithm time was measured using the C# Diagnostics library, resulting in 14 µs. The transactions between the HPS and the FPGA in Cyclone V devices has been the subject of the study [30]. Alongside with this paper an Excel Table is provided that allows the users to search the measured transfer rate for several modes of operation. For transferring 4096 Bytes (512, 32 bits raw and processed samples) using the Lightweight HPS-to-FPGA bus, the Angstrom Linux operating system and a 50 MHz clock for the FPGA a 20,08 MB/s transfer rate is reported. So, the estimated latency (L) is given by (1), (2) and (3)

Register transfer level diagram of the system
This calculation is an estimate but gives an idea of the expected latency of the system. As it can be appreciated the components that mostly contribute with this time delay are the NIOS 2 processor and the Lightweight bus. The first one can be dispensed with in future implementations, while the second one can be replaced with the HPS-to-FPGA bus which support 128-bit transactions, achieving a 35 MB/s transfer rate in these same conditions [30]. A higher frequency in the FPGA may also be used. Without any changes in the design a quick evaluation with the timing analyzer provided by Intel Altera shows that a 72 MHz clock can be achieved. This maximum frequency is computed as the maximum possible at which the transfers between registers are conducted in a single clock cycle. It may be further increased taking special care at some register paths, and evaluating in which paths this condition may be relaxed. With that being said, the authors consider that a latency on the order of the hundreds of microseconds (343 µs in our calculation) is acceptable, as it is almost indistinguishable for the user.

Speller test
In order to verify the system's overall performance, a simple high-level experiment was conducted. Two users, one with no previous experience in handling BCIs (user A) and one experienced (user B), were asked to select letters randomly. In the best cases the command selection time for user A was about 5 seconds but, in some cases, it took him more than 100 seconds to select a command. On the other hand, user B was able to execute a command typically between 5 and 10 seconds.
These results, that are shown for demonstrative purposes, are comparable with those presented in [14], where the typical user selection time was between 5 and 8 seconds. As an example, Figure 9 shows the EEG record of user A, obtained in 3 consecutive time-windows, corresponding to a successful selection of the option related to a visual stimulus of 16 Hz. The first image corresponds to the raw EEG signal, while the second shows its FFT transform in each time-window.

CONCLUSIONS
SoC FPGA systems present very attractive characteristics for the development of systems where real-time processing is key. Using these devices, dedicated logic in the FPGA can be used to solve tasks that demand low latency, leaving the processor dedicated to perform high-level tasks, such as the user interface. Throughout this article it has been discussed how a system of these characteristics can be used to develop a platform for the acquisition and processing of electroencephalographic signals, specifying how the tasks can be distributed in order to achieve a highly efficient BCI. Finally, the implementation of an SSVEP-based BCI system, developed entirely on a de10-nano SoC provided by Altera, has been shown. The time delay between the user's selection and the system response has been estimated in 343 µs, and a first experimental test, that allowed to verify the operation of the complete system, was carried out. A detailed characterization of the system performance and the proposal of strategies to improve it remain to be conducted. Although the design has been oriented to obtain and process SSVEPs with frequency encoding, the modularity of the design allows most of the developed tools to be reused in the implementation of different types of BCIs in the future.