Improving cross-subject classification performance of motor imagery signals: a data augmentation-focused deep learning framework

Enes Ozelbas; Emine Elif Tülay; Serhat Ozekes

doi:10.1088/2632-2153/ad200c

1. Introduction

Motor imagery brain computer interfaces (MI-BCIs) is an important sub-field of brain-computer interface (BCI) research due to its potential applications in the fields of neuroscience, rehabilitation, and human-computer interaction. For years, the development of reliable BCI systems for specific applications, such as prosthetic control [1] and stroke rehabilitation [2], has continued to show great interest in the neuroscience field. Some studies have investigated the use of motor imagery electroencephalogram (MI-EEG) signal classification for developing prosthetic limbs that can be controlled by the user’s intention to move their limbs [3, 4]. Other studies have focused on developing MI-EEG classification models for stroke rehabilitation, where the models can be used to evaluate the effectiveness of rehabilitation programs and track patients’ progress [5–7]. Given the potential contributions of MI-EEG for the development of motor imagery BCI applications, there appears to be a need for methodologies that can present solutions for performing well across diverse individuals and conditions. Unsurprisingly, pursuing this objective unveils many obstacles along the way, involving both the challenges arising from complex characteristics of EEG signals and the practical constraints in acquiring adequate amount of training data. These are serious roadblocks in the way of providing those who may easily use these tools to improve their overall life quality.

One of the main challenges is the non-stationarity of EEG signals, meaning there are temporally dependent changes in statistical characteristics of the signal [8, 9]. The non-stationary nature of EEG signals presents a significant hindrance in developing a single subject-independent model that can generalize well to new subjects. Therefore, there is a growing trend to develop subject-specific models that can be tailored specifically to each user by using the respective subject data only. Although subject-specific models seemingly have the potential to address the signal variability in EEG signals, they come with their own set of problems. One significant drawback of these individualized models is their inefficiency when using classifiers that require large amounts of training data, such as deep learning (DL) approaches [10, 11]. Given these circumstances, this study aims to address some of these challenges by introducing a novel data augmentation (DA)-oriented framework incorporating a well-directed deep learning model, resolving issues stemming from data scarcity and the dynamic nature of EEG signals.

The rest of the paper is structured with the following sections. Related works for MI-EEG classification with DL and DA techniques are reviewed in section 2. A comprehensive description of the proposed framework methodology is provided in section 3. Section 4 presents the experiments conducted to compare the proposed data augmentation technique with other existing methods. The results and discussion derived from these experiments are provided in section 5. Finally, section 6 concludes the study.

2. Related works

A considerable amount of research is undertaken on subject-independent models. However their effectiveness is often hindered by the dynamic nature of EEG signals and the inherent physiological distinctions among individuals. For this reason, several cross-subject incorporation mechanisms including modified neural networks [12, 13], transfer learning [14–16], and data augmentation approaches [17], are proposed.

The field of data augmentation in the scope of MI-EEG has witnessed the emergence of several techniques. These techniques involve a variety of methodologies aimed at increasing the adaptability and accuracy of MI-EEG classification models. Broadly categorized, these techniques can be divided into two principal categories: those that operate on the manipulation of raw EEG waveforms and those that leverage domain transformation techniques [18–20], such as short-time Fourier transform (STFT) [21], to derive augmented data in the alternative domains (e.g. frequency domain, time-frequency domain, spatial domain). The data augmentation techniques employed on the raw EEG waveforms typically incorporate changes in time-series orders, such as the use of sliding window and shifting techniques [22–24]. These methods effectively expand the dataset by creating new instances through slight alterations in the temporal order of the data. In contrast, DA methods that pay particular attention to preserving the temporal order of the EEG signals, such as various types of noise addition [25, 26] techniques also exist. These methods apply controlled changes to the data, increasing the model’s generalizability by exposing it to a broader range of signal variations. Several studies have recognized the use of noise addition method as a viable MI-EEG data augmentation approach. Li et al [25] investigated the use of Gaussian noise data augmentation with the CP-MixedNet architecture. Their experiments on multiple datasets demonstrated that a standard deviation of 0.001 yielded the best classification results. They observed that the addition of noise was particularly effective in improving classification accuracy especially for the feet and tongue imagery classes. Parvan et al [27] employed the addition of Gaussian noise with zero mean and a standard deviation of 0.15 to double the BCI Competition IV 2b dataset. According to the results, they successfully increased the Cohen’s Kappa by 0.07 compared to the model without DA. Zhang et al [26] introduced an EEG-inception network along with a noise addition DA technique. In their study, to augment their dataset they utilized the inherent noise in a subject, particularly the noise above 100 Hz, and applied it to another subject. According to their results, they were able to improve the overall average accuracy by 3.6% for the 4-class BCI Competition IV 2a. They also suggested that the increase in accuracy for Subject 2 and 5, who are referred to as ‘poor-performed subjects’ were significant. They attributed these significant improvements to the higher variability in noise profiles across trials for these subjects compared to others.

There are also the generative models, such as autoencoders (AEs) [28] and generative adversarial networks (GANs) [29], that can work with both raw EEG waveforms and other domain spaces. A basic autoencoder is structured on a sequential encoder and decoder architecture. The encoder’s function is to map the input data to a lower-dimensional space, and the decoder is tasked to reverse this process by reconstructing the original input from the encoder output. Based on this notion, other types of well-known AEs, such as stacked AE [30] and variational AE (VAE) [31] have been derived. In recent years, there has been a growing interest in utilizing GANs for EEG data augmentation [32], inspired by their well-established success in image and sound data synthesis. However, generating synthetic EEG signal sequences using GANs necessitates the development of specialized techniques and raises certain concerns that need to be addressed. One of the issues is the fact that GANs typically require a substantial amount of training data to reliably capture the underlying distribution and generate realistic samples [33]. Therefore, the performance of GANs is highly affected by the number of trial sequences in the dataset. Recording subject EEG data for extended runs is not only expensive and time-consuming, but also the signal quality and subject characteristics are susceptible to inconsistencies [34], even from one session to another [35]. Moreover, to acquire effective results, the tuning process of GANs could take a considerable effort and may need to be repeated for each subject in the scope of subject-specific approaches. Therefore, the limited availability and potential inconsistency of EEG signals hinder the actual potential of synthetic EEG data generation using GANs. In a comparative DA study, [36] investigated the performance of different DA methods such as geometric transformations (GT), AE and VAE, for the classification of MI data using a deep neural network. They transformed the time series signals into spectrogram images using Short-Time Fourier Transform and evaluated the performance of different DA methods for this spectrogram data. They developed a convolutional neural network (CNN) model to classify the MI-EEG signals and compared the classification performance after the augmentation. The results showed that the deep convolutional generative adversarial network provided better augmentation performance than conventional DA methods: GT, AE, and VAE. In another study by [37], they proposed a novel BCI framework by using autoencoder-based transfer learning, which includes three main components: an autoencoder framework, a joint adversarial network, and a regularized manifold constraint. The results showed that their proposed framework can achieve better results than state-of-the-art approaches in EEG signal classification tasks.

Common spatial patterns (CSP) is a widely adopted feature extraction technique that has mutated to various forms and maintained its prominence over the years. In the scope of MI-EEG, CSP is used to find spatial filters that maximize the variance of EEG signals from one class (e.g. imagining left-hand movement) while minimizing the variance from the other class (e.g. imagining right hand movement). The method is initially suggested as a binary class feature extractor that worked in a supervised fashion. Later, a study by [38] demonstrated the effectiveness of CSP in classifying multiple classes of motor imagery tasks. They used one-versus-the-rest (OVR) scheme to handle multi-class classification problems, which has since been widely adopted in the field [39, 40]. In this scheme, a binary CSP filter is calculated for each class against all the remaining classes. This allows for the transformation of a multi-class problem into multiple binary problems, each of which can be solved using the CSP method. The final class decision can then be made based on the output of these binary classifiers. Thanks to its rich characteristics and compatibility with other domains, CSP algorithm has seen numerous adaptations, each addressing a different limitation in the original method. One of the earliest and the most prominent extensions of CSP is filter bank CSP (FBCSP) [41]. FBCSP is considered a more intensive approach by incorporating spectral bands (e.g. 0.5–4 Hz, 4–8 Hz). Building on this, [42] developed the discriminative FBCSP. This variant focused on enhancing the discriminative power of the extracted features, further refining the subject-specific approach of FBCSP. Recognizing the challenges of applying CSP with low number of samples, [43] proposed the regularized CSP (R-CSP). This adaptation introduced regularization to the covariance-matrix estimation, improving the stability of the CSP algorithm when training samples are limited. A study conducted by [44] developed the filter bank R-CSP, which combined the filter bank approach of FBCSP with the regularization technique of R-CSP. This variant of CSP offered a comprehensive solution to the challenges of frequency selection and the problem of low number of samples. Each of these adaptations filled a specific gap in the original CSP algorithm, contributing to its ongoing relevance in the field of MI-BCI. Normally, CSP algorithms are applied for a signal window of at least 2 s to capture the inherent variance between classes. There is a recent increase in interest towards the discovery of relations between discrete CSP transformed time segments of EEG signals, attempting to unveil time-varying capabilities of CSP based algorithms. Ghaheri and Ahmadyfard [45] proposed a time-window based CSP algorithm to characterize time information and applied OVR to use CSP for multi-class problems. A recent study by [46] suggested using a sliding window technique where overlapping-band-based FBCSP is applied for extraction of subject-specific features, later to be used in an LSTM model. They emphasized that sliding window CSP is a step in the right direction for overcoming non-stationarity of EEG signals, especially considering the variability caused by this across multiple sessions and trials.

3. Proposed methodology

3.1. Methodology overview

The proposed methodology presents a data-driven data augmentation approach for subject matching using spatial characteristics of the given EEG signals, centered on the premise that classes with similar distributions across different subjects can be jointly trained for improved classification. This approach enables the creation of new MI-EEG signal sequences by integrating randomly selected waveform segments from similar subject pairs, creating new trial sequences that effectively enrich target subject dataset and allow for a wider exploration of class patterns. The process involves calculating class similarity among all subject pairs via the OVR scheme of CSP, aggregating similar subjects, and replacing segments with specified window lengths to generate new EEG sequences. Moreover, the accompanying multi-domain parallel CNN model, thanks to its ability to attend to spatially significant temporal relations in MI-EEG sequences, is a fitting complement to the proposed ACSSR.

3.2. Parallel convolutional neural networks model

The proposed CNN model is designed around a two-branched parallel architecture in which both branches have a specific purpose. The spatiotemporal branch, with the help of a soft attention mechanism, works as a feature extractor that acts like a temporal compass and guiding the temporal branch through learned spatially transformed temporal features. On the other hand, temporal branch is capable of extracting temporally present features in all channels in the raw signal and generate final classification results. Subsequent sections 3.2.1–3.2.3 offer a detailed explanation of the model’s components and workings. Figure 1 shows the overall configuration of the proposed CNN Model. The classification performance of the model with and without several data augmentation techniques used in the field of MI-EEG is presented in section 5.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** The proposed parallel CNN model, composed of two branches; a spatiotemporal branch acting as a feature extractor for spatially related temporal features and a temporal branch for evaluating the weights extracted from spatiotemporal branch to generate final classification output.
Download figure:
Standard image High-resolution image

3.2.1. Spatiotemporal branch

The spatiotemporal branch is responsible for extracting temporal features using the spatiotemporal input provided through domain transformation by CSP algorithm adopting a sliding window approach. It features two 1D convolutions with average pooling for downscaling the sequences. In each convolutional block, Conv1D, batch normalization, ELU activation, average pooling and dropout layers are included, in that order. At the end of the second average pooling, the branch is forwarded to the attention mechanism along with the temporal branch.

The sliding window approach for CSP involves division of the EEG signal sequences into smaller chunks of predetermined size. Each chunk is then processed independently using the CSP algorithm, resulting in dynamic CSP representations that can capture spatiotemporal dynamics in the EEG data. For the CSP transform, 3 EEG channels, namely C3, Cz, C4, are selected to better represent the motor imagery features. These resulting time dependent CSP components are then fed to the spatiotemporal branch of the parallel CNN model to be used for further feature extraction, and eventually for deriving the attention weights for the subsequent temporal branch. This sliding window CSP features a striding mechanism for enabling the overlap of the windows with the provided rate, allowing for smoother temporal representation and potentially better feature extraction. Figure 2 attempts to demonstrate the applied method. The pseudocode of the sliding window CSP algorithm can be found in algorithm 1 and the shape calculations regarding the sliding window CSP are covered in section 4.4.1.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Sliding window CSP. Raw signal of EEG channels C3, C4, Cz is on top and sliding window CSP transformed EEG signal is on bottom showing the transformation of EEG signal into CSP transformed time-series form.
Download figure:
Standard image High-resolution image

Algorithm 1. Pseudocode for sliding window CSP.
Initialize $stepsize \gets \lfloor windowsize \times (1 - overlap) \rfloor$
Initialize timepoints as the number of timepoints in X
Initialize an empty list X_csp
for i from 0 to $(timepoints - windowsize)$ with step stepsize do
Extract $window \gets X[:, :, i:i+windowsize]$
Fit csp model with window and y
Transform window using csp model and store the result in transformed_csp
Append transformed_csp to X_csp
end for
Concatenate X_csp windows of a trial along the time axis
return X_csp

3.2.2. Attention mechanism

A soft attention mechanism is integrated to the end of the spatiotemporal branch. This attention mechanism is effectively driven by the spatiotemporal branch, which provides valuable information about the temporal structure of the data. The attention weights are calculated through applying softmax activation to the dot product of the feature maps from the final layers of spatiotemporal and temporal branches. This dot product operation effectively measures the correlation between the transformed CSP and temporal features. In doing so, the relevant temporal locations of the temporal features that align with the spatial features are identified. This operation is shown in equation (1).

$\begin{equation} A = X_{\textrm{spa}} \cdot {X_{\textrm{temp}}}^T \end{equation} \tag{ 1 }$

where A is a vector of attention scores, ${X_{\textrm{spa}}}$ and ${X_{\textrm{temp}}}$ are the outputs of the last convolutional blocks of spatiotemporal and temporal branches, respectively.

To obtain the attention weights W, a softmax activation is applied to the attention scores, normalizing them such that the total sum is equal to 1. Given the vector of attention scores of real numbers $A = [a_1, a_2, \ldots, a_n]$ , the softmax function maps the attention score vector to a probability distribution $W = [w_1, w_2, \ldots, w_n]$ and each of the attention weights is calculated as shown in equation (2):

$\begin{equation} w_i = \frac{e^{a_i}}{\sum_{j = 1}^{n}e^{a_j}} \end{equation} \tag{ 2 }$

where w_i represents the attention weight for the ith element, and a_i corresponds to the attention score.

To make the dimensions of the attention weights and the temporal features compatible, a dense layer is applied to the attention weights and the shape adjusted attention weights Wʹ are acquired as per equation (3):

$\begin{equation} W^{^{\prime}} = \textrm{Dense}\left(W\right) \end{equation} \tag{ 3 }$

Finally, the calculated attention weights are then multiplied element-wise with the output from the last convolutional layer in the temporal branch (equation (4)):

$\begin{equation} Y_{\textrm{temp}} = W^{^{\prime}} \odot X_{\textrm{temp}} \end{equation} \tag{ 4 }$

where Y_temp is the final output of the attention mechanism in which attention weights are activated on the temporal branch. This mechanism as a whole allows the attention mechanism to adaptively emphasize certain temporal features based on the spatial features. This also conforms with cross-subject similarity measurements (see section 3.3.1) and subject pairings as they are performed on the CSP transformed EEG data. Furthermore, due to the time-dependent abilities of the CSP input used in the spatiotemporal branch, it makes the temporal attention mechanism a suitable addition as it guides the temporal branch through the help of spatiotemporal features. In figure 3, a sample of attention weights combined across all feature maps for a left-hand trial is presented as an example.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** The overall combined distribution of softmax activated temporal attention maps for a left-hand test trial. A colorbar is accompanied alongside the attention map indicating the weight intensities.
Download figure:
Standard image High-resolution image

3.2.3. Temporal branch

The temporal branch can be viewed as the secondary component of the proposed parallel CNN model responsible for performing the final classification task, and additionally taking advantage of the features extracted from the spatiotemporal branch for guidance. The model features two convolutional blocks, incorporating 1D convolutions, with the output of the second convolutional block passing through a flattening layer. Subsequently, a fully connected layer with softmax activation is applied to produce the final classification output. Each convolutional block in the model consists of the following sequential layers: batch normalization, ELU activation, average pooling, and dropout.

To effectively handle the classification of multiple classes in the temporal branch, cross-entropy is employed as the loss calculation method. This approach, integrating softmax activation with cross-entropy loss, produces a probability distribution across multiple classes. It enables the model to evaluate the likelihood of each class for a given input, with the cross-entropy component measuring the difference between predicted probabilities and actual labels. This loss is then used to iteratively correct the model predictions with the backpropagation to minimize the loss. This approach is required in multiclass scenarios, especially in the problems where the distinction between classes can be subtle and nuanced, as in the EEG signal classification. At the end, to generate the final classification results, the softmax function normalizes these probabilities into the range [0, 1] ensuring that the sum of probabilities for all classes equals one. Then the class with the highest probability is selected as the final predicted category.

3.3. Adaptive cross-subject segment replacement

The proposed adaptive cross-subject segment replacement (ACSSR) is a data augmentation technique used for the generation of artificial motor imagery trial signals from the signal sequences of other subjects. The ‘adaptive’ part in the name stems from the ability to adapt to other subjects to improve its classification performance. It can be thought of as a variant of the cross-trial segment replacement approach (described in section 4.5.3) in the way it selects the trial segments with the difference of employing cross-subject trials. The generation of new EEG signal sequences is comprised of three stages:

(i)
Establishing a subject matching table through calculation of subject similarity.
(ii)
Selection of balanced imagery class trials using the sampler module (see section 4.5).
(iii)
Randomly replacing segments of predefined lengths in the original sequence from trials of the matched subject and generate new trials.

Figure 4 demonstrates a general workflow of the ACSSR with the help of the subject similarity calculation described in section 3.3.1. The main difference between ACSSR and cross-trial segment replacement is that ACSSR retrieves the segment from another subject. This subject is selected based on the calculation of subject similarity using the existing class data across all subjects. This operation is described in section 3.3.1.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Adaptive cross-subject segment replacement (ACSSR) flowchart. The flowchart shows the general workflow for the division of the signal into segments, replacement of the segments, and adaptively selecting subject pairs after the determination of subject similarity according to the data present in the original set.
Download figure:
Standard image High-resolution image

3.3.1. Subject similarity calculation

The OVR strategy for CSP is applied to handle the multi-class classification problem. This strategy reduces the complexity of the MI-EEG signals by creating multiple binary CSP filters, where a single class is compared against all other classes. The transformation of EEG data to CSP feature space is particularly proven useful in the context of clustering different motor imagery classes as the goal of CSP is to maximize the variance between two distinct classes. Yuksel et al [47], in their work, explains the mathematical foundations that form CSP.

The OVR strategy will create binary CSP filters for each class. For every class c within the set L representing all motor imagery classes, a binary filter F_c can be expressed such as in equation (5):

$\begin{equation} \forall c \in L: \quad F_{c} = \textrm{CSP}\left(X_{c}, X_{\bar{c}}\right) \end{equation} \tag{ 5 }$

where X_c is the motor imagery sequences of class c, $X_{\bar{c}}$ is the CSP transform sequences of all other classes, and CSP is a function calculating CSP filters between two classes.

The similarity calculation in this study is performed just after the cross-validation train-test splits are generated and before the training loop started. Because the use of CSP demands the labels to be known beforehand, in order to prevent data leaks, subject similarity method is applied to only the training set created. For each class, a binary filter is created to distinguish between the target class and all other classes, resulting in four different filters per subject, thus yielding a total of 36. In figure 5, a scatter plot visualization is presented for one of the subjects, showing the comparison between the target class and the non-target classes in the OVR scheme. After acquiring the binary CSP filters corresponding to each class, following the OVR scheme, the filters are then normalized employing a min-max scaler, as shown in equation (6):

$\begin{align} \forall c \in L: \quad F_{c}^{{{\prime}}} = \frac{F_{c} - \min\left(F_{c}\right)}{\max\left(F_{c}\right) - \min\left(F_{c}\right)} \end{align} \tag{ 6 }$

where $F_{c}^{^{^{\prime}}}$ is the normalized binary CSP filter for class c, F_c denotes the binary CSP filter for class c before normalization.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Scatter plot of target vs. non-target binary CSP filters with 3 CSP components represented in 3D space for Subject 1. Each of the 4 images signifies a class where the target CSP components (red) are separated from the non-target ones (blue).
Download figure:
Standard image High-resolution image

This normalization facilitates a meaningful comparison by bringing each binary CSP filter to an equal scale. The target-class CSP filters are then used to aggregate the subjects together for each class. Despite its limitations, a t-distributed Stochastic Neighbor Embedding (t-SNE) visualization can offer some preliminary insights into how closely the aggregated subjects tend to form clusters based on their target-class CSP filters. t-SNE is notoriously known for its struggle to preserve the global structure [48], sensitivity to perplexity and initialization [49], thus producing varied results in each iteration depending on the initialization parameters. However, it is a useful method for displaying the general characteristics of a complex structure. The t-SNE applied scatter plot of aggregated subjects of target classes is shown in figure 6.

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** Scatter plot of aggregated t-SNE representation of subjects of target classes, where Class 0: left-hand imagery, Class 1: right-hand imagery, Class 2: feet imagery, Class 3: tongue imagery. This representation only indicates a general idea of potential distinction, rather than providing exact class boundaries between different subjects.
Download figure:
Standard image High-resolution image

The Euclidean distances between centroids of subject aggregated CSP values are calculated to quantify the difference in their characteristics. The calculated Euclidean distances provide a measure of similarity between subjects, which can be used to identify the best match between them. The distance matrices for each class are displayed in the form of heatmaps in figure 10. The subject matching table is derived from these matrices, presenting the closest match for each subject based on the calculated distances (see table 2).

Given a class $c \in L$ with CSP trials $S_{c,i}$ , the centroid O_c is derived from the equation (7):

$\begin{equation} O_c = \frac{1}{N_c}\sum_{i = 1}^{N_c}S_{c,i} \end{equation} \tag{ 7 }$

where N_c is the number of trials for class c, i is the index of a trial, and $S_{c,i}$ is the CSP feature vector of the ith trial for class c.

After finding the centroids of each subject for class c, the Euclidean distance between the centroids of two different subjects A and B is determined according to equation (8):

$\begin{equation} D_{AB} = \sqrt{\sum_{j = 1}^{N}\left(O_{A,j} - O_{B,j}\right)^2} \end{equation} \tag{ 8 }$

where j is the index of the CSP component, and N is the total number of components in the CSP transformed signal. The result, D_AB, is a scalar representation of the distance between subject A and B in the CSP feature space. While lower Euclidean distances indicate a higher similarity among the pair of subjects for that particular class, higher distances indicate less similarity.

4. Experiments

4.1. Experimental setup

All the computational experiments are conducted on a system equipped with 32 GB DDR4 RAM, AMD Ryzen 5 3600 CPU with 6 cores, and an NVIDIA RTX 2070 GPU with 8 GB of VRAM. Several software and their dependencies are used for signal processing, data preparation and classification. EEGLAB toolbox [50], under MATLAB environment [51], is used for purposes of importing the raw data and applying band-pass filtering. Python with version 3.10 [52], is used for data preparation, statistics, plotting, and classification purposes. Several Python packages are made use of for different purposes. For data preparation, pandas [53] and numPy [54] libraries; for statistical derivations numPy and sciPy packages [55]; for graph and heatmap plotting, matplotlib [56]; and finally for classification and other related tasks, keras library [57] with Tensorflow backend [58] is employed.

4.2. Dataset description

The dataset employed in this study is the multi-class BCI Competition IV 2a dataset [59, 60]. The dataset consists of recordings from 9 subjects who performed 4 different motor imagery (MI) tasks involving imaginary movements of various body parts, including left- and right-hand movements, feet movements, and tongue movements. Each subject completed two sessions, with one session conducted on a different day. Each session consisted of 72 trials per class, resulting in a total of 288 trials per subject per session. One recording session was designated for training data, which included class labels, while the other session was used for evaluation data without labels. These labels were not made publicly available until the competition had ended.

For electrode placement, a total of 22 EEG channels and 3 monopolar electrooculogram (EOG) channels were used, as shown in figures 7(a) and b respectively. The electrodes were positioned in line with the 10–20 system, which is a standardized method of electrode placement based on the distance of the electrodes from specified anatomical landmarks on the scalp. The electrode placements in the 10–20 system are set by percentages of the distance between these landmarks, with the left mastoid serving as the reference and the right mastoid serving as the grounding. All recordings are conducted with a sampling rate of 250 Hz.

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** Electrode and EOG placement according to the international 10–20 system, (a) Electrode placement based on the international 10–20 system, (b) EOG electrode placement [59, 60] Reproduced from [60]. CC BY 3.0.
Download figure:
Standard image High-resolution image

In preparation of the original dataset, authors applied several pre-processing steps including a band-pass filter between 0.5 Hz and 100 Hz, and a band-stop filter at 50 Hz for the elimination of line noise. Additionally an expert analysis was performed on the subject datasets to detect and mark any artifacts.

4.3. Experimental design

The experimental design of the motor imagery trials, as shown in figure 8, follows a cue-based paradigm, wherein subjects are instructed to respond to four distinct cues displayed on the screen, representing different imagined movements, and presented at different time intervals. Initially, a cross symbol is shown on a black screen accompanied by a brief warning sound. After a delay of 2 s, an arrow pointing in one of four directions is presented. Each direction corresponds to a specific class, and the subject is required to perform the corresponding imagined movement within a 4 s time frame, completing the trial after a 1.5 s break.

Figure 8. Refer to the following caption and surrounding text. — **Figure 8.** Experimental design of motor imagery tasks in BCI Competition IV–2a. Modified from: [59, 60] Reproduced from [60]. CC BY 3.0.
Download figure:
Standard image High-resolution image

4.4. Data preparation and pre-processing

4.4.1. Data preparation

The purpose of data preparation is to transform the raw EEG data to the correct shape and scale for the input of the proposed two-branch CNN model (3.2). Raw EEG data consisting of 9 GDF files for each training and evaluation data, are imported via EEGLAB⁵ for the purposes of signal pre-processing (refer to section 4.4.2) and conversion to CSV format [50]. The epoch duration for the trial sequences is determined to start from 1 s after the appearance of cue arrow and continues until the class arrows on the screen disappear, covering a total duration of 4 s. Each trial sequence was labeled over 1000 frames, or in other words over 4 s with a sampling rate of 250 Hz. The following equation (9) represents the relationship between the number of frames in the trial sequence (N), the duration in seconds (t), and the sampling rate (f_s ):

$\begin{equation} N = t \times f_s. \end{equation} \tag{ 9 }$

The reasoning behind this is to cover the allocated time frame interval for each different type of motor imagery task. Also, the 5 min of EOG sessions at the start of sessions and breaks of 1.5 s between trials are removed from the datasets. Both training and evaluation sets are merged and their labels are one-hot encoded. In order to input the datasets into the CNN model, these trial sequences are grouped into larger sets. Because each session is comprised of 288 trials, the total number of trials in the final dataset is 576, accounting for both the training and evaluation sets. The final dimensions of each subject dataset can then be expressed as a 3D matrix in equation (10):

$\begin{equation} \textrm{Shape} = \left(T \times N \times C\right) \end{equation} \tag{ 10 }$

where T is the number of trials, N is the number of time frames in each trial, and C is the number of EEG channels.

This is the general shape that is used for temporal branch of the CNN model. In order to acquire the shape for spatiotemporal branch however, sliding window CSP steps as outlined in algorithm 1 should be performed. For the parameters of sliding window CSP, a window size of 10, an overlap of 0.5 and CSP components of 3 is chosen. This yields a decrease of its time-axis length by approximately five-fold. The dimensions obtained for the temporal branch in equation (10) represents the shape to which sliding window CSP will be applied. However, only the 3 channels, namely C3, Cz and C4 is considered for sliding window CSP. The total number of sliding windows, denoted as N_sw, is a function of the total number of time frames in a trial (N), the window size (W), and the overlap (O). It can be computed using equation (11) as follows:

$\begin{equation} N_{sw} = \left\lfloor \frac{N - W}{W \times \left(1 - O\right)} \right\rfloor + 1 \end{equation} \tag{ 11 }$

where the floor function $\left\lfloor \cdot \right\rfloor$ is used to make sure that the number of sliding windows is an integer.

The transformation is applied to each of the T trials, with number of CSP components selected as C_spt. So the final dataset dimensions for the spatiotemporal branch, Shape_spt, is represented in equation (12):

$\begin{equation} \textrm{Shape}_{\textrm{spt}} = \left(T \times N_{sw} \times C_{\textrm{spt}}\right) \end{equation} \tag{ 12 }$

4.4.2. Signal pre-processing

The pre-processing strategy in this study is centered around the idea of preserving the most valuable rather than getting rid of the most undesired. This decision is made based on a common understanding that deep neural networks are adequately performant in handling noisy or raw data [61, 62], especially when provided with sufficient training samples and robust architectures. Therefore, only a minimal pre-processing stage utilizing a band-pass filter with finite impulse response, and EEG epoching is put to use for the signal sequences.

The signal pre-processing involves the use of a band-pass filter ranging from 8 to 30 Hz [63]. The range of band-pass filter is specifically chosen to focus on the frequency bands, namely the 8–13 Hz µ-band and the 13–30 Hz beta band which are typically associated with brain activity relevant to MI tasks, thereby increasing signal-to-noise ratio in relation to the classification of motor imagery tasks. It is observed that excessive pre-processing, to remove artifacts, not only has a risk of resulting in loss of valuable information in the signal, but also a potential to introduce sampling bias due to the selective removal procedures [64].

4.5. Data augmentation techniques for comparison

In this study, three data augmentation approaches, namely random time shifting (4.5.1), Gaussian noise addition (4.5.2) and cross-trial segment replacement (4.5.3), which are widely used in MI-EEG literature are employed to compare against the proposed ACSSR. The parameters used for the ACSSR data augmentation approach are selected identical to those used in the cross-trial segment replacement. These parameters are stated in section 4.5.3. A sampler module is developed to generate artificial sample sets with balanced classes in specified dataset size. These dataset sizes are determined based on the sampling ratios of the original dataset (e.g. 0.25×, 0.5×, 0.75×, 1.0×).

4.5.1. Random time shifting

In this study, the technique of random time shifting is applied by moving signals along the temporal axis to create diverse temporal representations of an EEG trial sequence. It uses a randomly generated number within a range to shift the signals along the time axis with the specified time points, back and forth. It helps the model to adapt to potential variations in the timing of motor imagery-related brain activity across different MI-EEG trials. By including these temporally shifted samples in the training set, the model is exposed to a wider variety of temporal orders, which can lead to improved performance when classifying previously unseen data.

For this purpose, a randomly chosen [−10, 10] ranged time point shifts are applied for all channels of each trial homogenously to create temporal variations.

4.5.2. Gaussian noise addition

Noise addition as a data augmentation technique in this study involves the addition of random Gaussian noise, derived from the statistical characteristics of the data, to the original signal. To implement, the class amplitude mean is calculated based on the trials. Subsequently, gaussian noise with a zero mean and a given standard deviation is produced. This generated noise is then combined with randomly chosen trials to create artificial trial sequences. When combining the noise with a trial, the same noise is applied to all 22 channels the same way. This is based on the consideration that introducing random changes across different channels will not be favorable as EEG signals characteristically have a high inter-channel correlation [46, 65]. Every electrode channel influences one another like an echo chamber. Changing this correlation will lead to misleading results. Figure 9 illustrates the data generation process through the addition of Gaussian noise to the original signal sequence.

Figure 9. Refer to the following caption and surrounding text. — **Figure 9.** Addition of Gaussian Noise to a 25 ms trial sequence of feet imagery class (a) original EEG signal is represented with red, (b) randomly generated Gaussian noise with standard deviation of 0.01 is represented with blue, (c) after the addition of Gaussian noise the noise added signal is shown in blue.
Download figure:
Standard image High-resolution image

In the context of this study, the noise addition is applied across time frames in each trial (N) over the number of EEG channels (C). The same noise with 0.01 standard deviation is added to all 22 channels in a trial sequence, resulting in an augmented trial sequence that incorporates the statistical characteristics of the data. Equation (13) shows the Gaussian noise addition process:

$\begin{equation} X_{i_{{\textrm{new}}}} = X_{i_{\textrm{original}}} + G_i\left(0, \sigma^2\right) \end{equation} \tag{ 13 }$

where $X_{i_{\textrm{original}}}$ denotes the original signal in the ith trial, while $X_{i_{\textrm{new}}}$ represents the new noise added signal. The term $G_i(0, \sigma^2)$ is the randomly generated Gaussian noise that is applied to the ith trial, having a zero mean and a standard deviation σ.

By applying equation (13), the same generated noise is added to all 22 channels in each trial sequence across 1000 frames.

4.5.3. Cross-trial segment replacement

The cross-trial segment replacement replaces segments of data within each trial with corresponding segments from other trials that are associated with the same imagery class. This involves dividing each trial into smaller segments, randomly selecting a segment to be replaced, and randomly choosing other trials from the same class. A random segment from one of these chosen trials is then selected to replace the original segment.

A blend window, generated randomly, is created using a linear interpolation between 0 and 1. The blend window is then used to proportionally distribute the values from the original and the replacing segment. This replacement strategy is designed to add meaningful variation, while still maintaining the class characteristics of each trial. However, the substantial degree of stochasticity in this approach may create unpredictability and instability. Consequently, this could result in noisy data and might negatively affect the overall performance of the model.

After manual inspection and performance tests, the parameters used in the implementation of this approach and for the proposed ACSSR data augmentation module are selected as a segment length of 50 time frames (0.2 s), and a blend size of 10 time frames.

4.6. Hyperparameter tuning

Bayesian Search is used to select the hyperparameters that work best across all subjects. A range of hyperparameters is included in the search algorithm, such as learning rate, batch size, dropout, and number of filters for convolutional layers. Both branch hyperparameters are first evaluated independently in single branch networks with manual inspection. After obtaining a general knowledge on how both branches are performing with the attempted hyperparameter values, the hyperparameter space of Bayesian Search is determined with this prior knowledge to decrease the amount of time and iterations. For the overall optimal hyperparameters see table 1.

Table 1. Bayesian Search results for optimal hyperparameters for all subjects.

Hyperparameter	Optimal Value
Learning rate	0.0001
Batch size	468
Dropout	0.4
Base number of filters*	8

Base number of filters refers to the factor that is multiplied by 2 in each consecutive convolutional layer such as 32, 64, 128 etc.

4.7. Evaluation

For the purposes of evaluating the classification results, 10-fold cross validation (CV) is used. Prior to the initialization of CV training loop, a randomly selected 10% portion of the subject dataset, respecting the balance across classes, is reserved for a validation set to remain constant throughout the cross-validation run. The purpose of creating such a validation set is to evaluate each epoch for updating the parameters that drives the selective early stop callback by providing training and validation losses. A brief description of the selective early stop callback is provided in the appendix A. Subsequently, the remainder of the dataset, approximately 90%, is set for 10-fold CV. The configuration of the CV is done with a stratified 10-fold structure, in which the dataset is divided into 10-folds, 9 belonging to the training set and 1 to the test set. Folds are determined in a randomized fashion, ensuring no bias is introduced. The 10-fold CV is repeated 3 times and the results are averaged. In each repetition, both the validation set, and CV folds are reinitialized. For all data augmentation techniques, using the proposed parallel CNN model, classification accuracies and F1 scores are acquired at the end of each fold across all 3 repetitions and then averaged to obtain final classification results, calculated as shown in equations (14) and (15), respectively:

$\begin{equation} \textrm{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \end{equation} \tag{ 14 }$

$\begin{equation} F1 = \frac{2 \times TP}{2 \times TP + FP + FN} \end{equation} \tag{ 15 }$

where TP, TN, FP, FN denotes true positive, true negative, false positive and false negative, respectively.

Additionally, a multi-class confusion matrix for each 10-fold CV run is created to further study class related performance using Cohen’s Kappa [66]. Kappa coefficient, denoted as κ, is a reliability metric that quantifies the level of agreement between two sets of classified data. The coefficient provides a score ranging from −1 to 1. While a score of 1 indicates perfect agreement, a score of 0 indicates agreement by chance, and a score of −1 indicates complete disagreement. Cohen’s Kappa is widely used in evaluating the models in the field of MI-BCI. It is also a fitting choice of measure for comparison, given its adoption as the primary evaluation metric in the BCI Competition IV 2a [59]. Further information regarding the Cohen’s Kappa is detailed in B.

5. Experimental results and discussion

This section aims to share the results of the experiments conducted. In this study various data augmentation techniques are comparatively inspected, and their implementations are detailed in section 4. The given sampling ratios of the data augmentation techniques (e.g. 0.25×, 0.5×) are selected from the best performing ones within the boundaries of the evaluation criteria. According to this, the proposed ACSSR methodology performed the best when sampled 0.5× of the original dataset. The detailed discussion of the results are provided in section 5.3.

5.1. Class-based subject similarities

The distances of subjects, in the case of all 4 classes, is calculated in accordance with the method mentioned in section 3.3.1 and demonstrated in figure 10 as an average heatmap across different folds. This heatmap represents the measure of similarity or dissimilarity between each pair of subjects. Lighter shades in the heatmap signify a higher degree of similarity between subjects while darker shades imply a larger distance or dissimilarity. The subject pairings along with the Euclidean distance between the subjects for each class is summarized in table 2.

Figure 10. Refer to the following caption and surrounding text. — **Figure 10.** Heatmap of subject-to-subject Euclidean distances for each motor imagery class showing the similarities between subjects.
Download figure:
Standard image High-resolution image

Table 2. Subject matching table with corresponding subject pairs and their Euclidean distances.

	Left-Hand		Right-Hand		Feet		Tongue
Subject	Match	Distance	Match	Distance	Match	Distance	Match	Distance
S1	Subject 2	0.14	Subject 2	0.21	Subject 3	0.05	Subject 5	0.22
S2	Subject 1	0.14	Subject 1	0.21	Subject 6	0.14	Subject 5	0.06
S3	Subject 7	0.19	Subject 2	0.31	Subject 1	0.05	Subject 8	0.14
S4	Subject 7	0.11	Subject 8	0.10	Subject 6	0.18	Subject 9	0.25
S5	Subject 4	0.23	Subject 4	0.15	Subject 2	0.17	Subject 2	0.06
S6	Subject 1	0.21	Subject 9	0.19	Subject 8	0.11	Subject 9	0.06
S7	Subject 8	0.08	Subject 8	0.15	Subject 1	0.08	Subject 5	0.09
S8	Subject 7	0.08	Subject 4	0.14	Subject 3	0.10	Subject 3	0.12
S9	Subject 2	0.22	Subject 6	0.19	Subject 3	0.15	Subject 6	0.06

5.2. Classification results

Table 3 presents the comparative classification results in terms of accuracy metric for each subject, obtained using the DA methods discussed in section 4.5 and the proposed ACSSR method described in section 3.3. The table also includes the results obtained without the use of any DA method, along with the mean accuracies and standard deviation of the accuracies. In the table, each data augmentation method is accompanied by a sampling ratio (e.g. 0.25×) implying the augmented data amount that yields the best result. According to the results, the best overall accuracy is achieved by time-shifting at a sampling ratio of 0.25× (78,14%), followed by noise addition (78,72%) and segment replacement both at a ratio of 0.5×(79,68%), and finally the proposed ACSSR method also at 0.5x (80,47%). Significant differences in the train and validation loss plots are shown when employing the ACSSR approach. For additional details on the train and validation loss plots, see supplementary figures S1 and S2.

Table 3. Comparison of classification accuracy results with different data augmentation techniques.

Subject	Without DA	Time shifting (0.25×)	Noise addition (0.5×)	Segment replacement (0.5×)	ACSSR (0.5×)
S1	0.8807	0.8750	0.8846	0.8788	0.8981
S2	0.6173	0.6250	0.6442	0.6673	0.6731
S3	0.8981	0.8904	0.8904	0.9192	0.9269
S4	0.7212	0.7365	0.7462	0.7538	0.7538
S5	0.5654	0.5827	0.5808	0.5788	0.5962
S6	0.6327	0.6269	0.6462	0.6385	0.6750
S7	0.9115	0.9154	0.9154	0.9269	0.9250
S8	0.8788	0.8942	0.8981	0.9019	0.8923
S9	0.8808	0.8865	0.8793	0.9058	0.9019
Mean	0.7763	0.7814	0.7872	0.7968	0.8047
Std.	0.1329	0.1300	0.1256	0.1303	0.1227

Abbreviations: DA, data augmentation; ACSSR, adaptive cross-subject segment replacement. The bold represents the highest classification accuracy achieved for each subject and the overall mean.

5.3. Discussion of the results

According to the tables 3 and 4, there is an overall increase of +3% in the both mean classification accuracy and F1-score, and in table 5, a +4% increase in Cohen’s Kappa with the ACSSR augmentation method applied, with 0.5× sampling ratio, compared to the classification without DA is observed. With the ACSSR method, the most apparent increase in accuracy is observed in Subject 2 with over +5%. Both Subjects 2 and 5 have shown a greater development in classification performance according to the Kappa results, with over +7% increase as can be seen in table 5. The proposed method shows an overall accuracy improvement of approximately +1% compared to the segment replacement method from which the proposed method was inspired. The most apparent increase in accuracy in the proposed ACSSR compared to the segment replacement was seen in subject 6, by +3%. However, subjects 7, 8 and 9 accuracies seem to have decreased ever so slightly. These are the subject datasets that usually the models perform better on compared to poor performing subjects 2, 5 and 6. The overall increase in the less performant subjects highlights the capabilities of the proposed CNN model in utilizing the ACSSR method. Other studies also confirm the problem with the homogeneity in quality across the subjects and trials in the dataset [67, 68]. A review study specifically conducted for the BCI Competition IV datasets by [60] suggests that having two different sessions on different days had a great impact on the inter-trial variance of EEG signals. Another issue they brought up was the excessive amounts of eye artifacts that required attention.

Table 4. Comparison of F1-Score results with different data augmentation techniques.

Subject	Without DA	Time shifting (0.25×)	Noise addition (0.5×)	Segment replacement (0.5×)	ACSSR (0.5×)
S1	0.8800	0.8747	0.8834	0.8777	0.8978
S2	0.6035	0.6147	0.6438	0.6621	0.6667
S3	0.8966	0.8895	0.8906	0.9190	0.9264
S4	0.7208	0.7358	0.7461	0.7538	0.7534
S5	0.5624	0.5822	0.5762	0.5742	0.5974
S6	0.6 315	0.6269	0.6450	0.6373	0.6744
S7	0.9115	0.9153	0.9155	0.9269	0.9249
S8	0.8777	0.8947	0.8981	0.9020	0.8917
S9	0.8802	0.8865	0.8791	0.9057	0.9016
Mean	0.7738	0.7800	0.7860	0.7954	0.8038

Abbreviations: DA, data augmentation; ACSSR, adaptive cross-subject segment replacement. The bold represents the highest classification F1-Score achieved for each subject and the overall mean.

Table 5. Cohen’s Kappa comparison of all experimented data augmentation techniques.

Subject	Without DA	Time shifting (0.25×)	Noise addition (0.5×)	Segment replacement (0.5×)	ACSSR (0.5×)
S1	0.8410	0.8317	0.8486	0.8378	0.8559
S2	0.4897	0.4980	0.5235	0.5572	0.5642
S3	0.8641	0.8562	0.8539	0.8932	0.8991
S4	0.6282	0.6469	0.6630	0.6714	0.6716
S5	0.4205	0.4458	0.4420	0.4356	0.4980
S6	0.5102	0.5048	0.5305	0.5195	0.5487
S7	0.8820	0.8843	0.8870	0.9042	0.8995
S8	0.8384	0.8592	0.8630	0.8685	0.8561
S9	0.8410	0.8490	0.8380	0.8771	0.8780
Mean	0.7017	0.7084	0.7166	0.7294	0.7412

Abbreviations: DA, data augmentation; ACSSR, adaptive cross-subject segment replacement. The bold represents the highest Cohen’s Kappa coefficient achieved for each subject and the overall mean.

Another metric that should be stated for the generalization properties of the model architecture is the standard deviation across subject accuracy scores. With the ACSSR method, the standard deviation across subjects has decreased from 13.28% to 12.27%. This is significantly lower than robust architectures in the literature such as EEGNet with 13.27% and Shallow ConvNet with 14.54% [69].

Notably, in tables 3 and 4, the relative increase in the F1 score with ACSSR is slightly higher compared to its impact on mean accuracy. Specifically, ACSSR increases the mean accuracy from 77.63% (without data augmentation) to 80.47%, representing a significant +2.84% improvement. Even more telling is its impact on the F1 score, where it elevates the mean from 77.38% to 80.38%, translating to a +3.00% enhancement. This nuanced improvement suggests that ACSSR approach is particularly effective in optimizing the precision-recall trade-off. This is considered a critical aspect of classification tasks as the balance between false positives and false negatives is decisive of model robustness. This finding stresses the ACSSR’s robustness in its potential applicability in different motor-imagery datasets. This is especially relevant in balanced classification tasks, where the objective is not only to maximize overall prediction accuracy but also to ensure a fairer classification capability across all categories.

The subject similarity calculation creates an approximate clustering and matching of subjects using the Euclidean distance in the context of CSP analysis. The results provide valuable insights into the relationships between subjects for each motor imagery class and their distinction in the way they performed the respective tasks. For instance, as observed in table 2, despite its good performance in overall classification with 92.65%, subject 3 was one of the least selected subjects. However, in feet imagery, subject 3 appears to be selected the most. A similar case for subject 5 can be made, while subject 5 overall classification results suggest a poor classification performance, it is the most popular choice for tongue imagery class. While these results could change with the introduction of new data from the subjects, it suggests a possibility that certain subjects may have specific strengths or weaknesses in performing different motor imagery tasks. Furthermore, because in this study the training and evaluation sets are merged together, due to the cross-session effects [35, 60], the dependency on intra-subject similarity effectively decreased. This means that the clustering and matching of subjects is based more on inter-subject similarity rather than the subject-specific properties, providing a broader perspective on subject relationships. Further analysis and investigation into the characteristics and strategies of individual subjects could provide valuable insights.

For Subject 5 there is a clear struggle with all the methods. The proposed ACSSR was able to increase the classification accuracy of subject 5 by +3% compared to without DA and +1% compared to the time shifting, the second-best DA method for subject 5. Moreover, Cohen’s Kappa has greatly increased for subject 5 relative to the accuracy metric, from 42% to 49%. This relative increase suggests that the classification performance across classes becomes more homogeneous and weaker classes are less frequently mixed up with other classes. The decrease in false positive and false negatives contributes to the increase in kappa values.

6. Conclusion

In conclusion, this study aimed to improve the classification performance of motor imagery signals and contribute to the improved performance of brain-computer Interfaces. The incorporation of a cross-subject mechanism into the methodology offered a balanced approach, combining the advantages of subject-independent and subject-specific models. This method, along with an attention mechanism implemented within the proposed CNN model, ensured a good utilization of relevant temporal and spatial features, hence contributing to the overall improvement in classification accuracy. The evaluation with 3-times repeated 10-fold CV demonstrated that the ACSSR data augmentation method proposed in this study substantially enhanced classification accuracy of BCI competition IV dataset 2a subject datasets in comparison to existing techniques, resulting in an overall improvement of +3% in mean classification accuracy with 80.47% and +4% increase in mean Cohen’s Kappa with 74.12%. The effectiveness of the proposed ACSSR is especially notable for subjects where classification performance was previously in lower levels. A decrease in standard deviation of 1% across mean subject accuracy scores was achieved, suggesting a more consistent performance across subjects compared to the some of the esteemed models in the literature. The study findings further contributed to the improved understanding of how subjects perform different motor imagery tasks and how these are classified by the model offered valuable insights which may guide the development of more robust BCIs in the future.

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors. https://bnci-horizon-2020.eu/database/data-sets.

Appendix A: Selective early stop callback

Early stopping is a regularization technique that helps prevent overfitting in machine learning models. It is particularly beneficial in iterative models that converge to a global minimum. After this global minimum it is expected that the model starts to overfit the model to the training data and decrease the generalization properties of the model. Early stopping potentially stops the training early at the optimal iteration in order to protect the generalization properties of the model.

Selective early stop callback mentioned in section 4.7 is a custom designed callback that works similarly to conventional early stopping, with a goal oriented towards selecting the epoch with optimal weights, that has a lower chance of overfitting to the training set. When properly configured, it terminates the training prematurely, but not before giving the model a fair chance to achieve its maximum potential by eventually reaching a plateau. Regular early stopping stops the training when the validation loss stops improving for a certain number of consecutive epochs. Selective early stop callback, on the other hand, stops the training early when the validation loss becomes significantly higher than the training loss by a predetermined amount, indicating the maximum allowed distance between training and validation loss values. The stopping occurs only after there is no improvement in the validation loss by at least a defined margin over a certain number of consecutive epochs. The set of rules defined in this callback ensures a more specific and highly effective criteria for preventing overfitting by selecting the epoch weights that are less prone to overfit. The termination and selection criteria is detailed in appendix A.1.

A.1. Termination condition and selection criteria

(i)
To trigger the stopping mechanism, two criteria must be met continuously for a certain number of epochs defined by the patience parameter:
- The validation loss should not show any improvement by at least the min_delta margin over the consecutive epochs.
- The difference between the training loss and the validation loss should be less than the overfit_threshold hyperparameter.
(ii)
If the training process is stopped early, the following criteria are considered for selecting the weights of the epoch to use:
- The epoch where the validation loss is closest to the best validation loss within a specified validation loss range defined by the plateau_offset hyperparameter.
- The epoch where the difference between the training loss and the validation loss is less than the overfit_threshold hyperparameter.

The selection process is visualized on a plot showing training and validation loss in figure A1.

Figure A1. Refer to the following caption and surrounding text. — **Figure A1.** Visualization of selective early stop epoch selection with hyperparameters and conditions on training vs validation loss plot. Selected epoch is shown with green circle, lowest validation loss of the training is shown with red circle, orange arrow indicates overfit threshold and green rectangle implies the epoch selection area.
Download figure:
Standard image High-resolution image

Appendix B: Cohen’s Kappa

Cohen’s Kappa is an interrater performance measurement method designed for binary class problems, therefore for the multi-class case a One-vs-Rest strategy is applied. For each class c, a binary problem is created where instances of class c are considered positive and the rest of the classes are considered the negative instances. In order to calculate Cohen’s Kappa, firstly the observed agreement $P_{o_c}$ , and the probability of random agreement P_e should be deducted from the confusion matrix as formulated in equations (B.1) and (B.2), respectively:

$\begin{equation} P_{o_c} = \frac{TP_c + TN_c}{TP_c + TN_c + FP_c + FN_c} \end{equation} \tag{ B.1 }$

$\begin{equation} P_{e_c} = \frac{\left(TP_c + FP_c\right) * \left(TP_c + FN_c\right) + \left(FN_c + TN_c\right) * \left(FP_c + TN_c\right)}{\left(TP_c + TN_c + FP_c + FN_c\right)^2} \end{equation} \tag{ B.2 }$

where TP_c TN_c , FP_c , FN_c , are the true positives, true negatives, false positives, and false negatives for the binary problem of class c vs the rest.

Then Kappa coefficient for class c is calculated as per the equation (B.3):

$\begin{equation} \kappa_c = 1 - \frac{1 - P_{o_c}}{1 - P_{e_c}}. \end{equation} \tag{ B.3 }$

And finally the average Cohen’s Kappa for a single fold for all imagery classes L is calculated by the equation (B.4):

$\begin{equation} \kappa_{\textrm{avg}} = \frac{1}{L} \sum_{c = 1}^{L} \kappa_c. \end{equation} \tag{ B.4 }$

Dates