Background

Data from the Intensive Care Unit (ICU) stored in patient data management systems (PDMS) is increasingly used in clinical practice and research. Large multinational collaborations use data driven methodology to define and describe patient groups and diagnoses, most notably the Sepsis 3-criteria [1] for adult patients and the Phoenix [2] score for pediatric sepsis. Automated systems based on these definitions are being integrated into ICUs for surveillance [3] and clinical decision support [4, 5].

Biochemical parameters from point-of-care blood gas analysis are commonly used as features in these definitions. In clinical practice blood for analysis could come from many different sources, e.g. arterial, venous, central venous, or capillary. Sample type is typically recorded at the point of care. Manual errors leading to mislabeling of venous samples as arterial or vice-verse has been described [4, 6]. Moreover, human errors in the ICU are not entirely random, but correlate to the number of patient activities performed by bedside clinicians [7, 8] and illness severity [9]. The distribution of labeling errors could be related to these factors.

Using data with this type of error could lead to flawed conclusions, particularly when using parameters that are known to vary between arterial and venous blood, e.g. hemoglobin oxygen saturation (SO2), pH, arterial partial pressure of oxygen (PaO2), or any parameter derived from those, e.g. the ratio of arterial oxygen tension to inspired oxygen fraction (PaO2/ FiO2 or PF-ratio). These parameters are used extensively for diagnostic and prognostic purposes in the ICU, including quantifying organ failure according to Sequential Organ Failure Assessment Score (SOFA) [10] and grading Acute Respiratory Distress Syndrome (ARDS) [11]. Additionally, overestimation of SO2 based on peripheral oxygen saturation measured by plethysmography (SpO2) is associated with worse outcomes and delayed care delivery [12], and data driven studies are warranted on predicting this phenomenon, described as silent hypoxemia [13, 14]. Therefore, the development of an accurate automated system for classification of blood gases in the ICU PDMS setting has many potential benefits.

We aimed to develop and validate a supervised machine learning (ML) model to differentiate between arterial and non-arterial blood gas samples from ICU patients in PDMS data with the best possible accuracy and sought to determine which class of ML model is most suitable for this task and how to optimize such a model to be used in large ICU datasets to detect erroneously labeled blood gas samples.

Methods

Data extraction

We conducted a retrospective, single-center cohort study. All point-of-care blood gas samples labeled as arterial, venous, or central venous from adult and pediatric patients treated in the ICU at Karolinska University Hospital, Huddinge, between January 1 and December 31, 2018, were extracted and included from the Patient Data Management System (PDMS; Centricity Critical Care, GE Healthcare, Chicago, IL, USA). Samples were analyzed on two ABL800 Flex blood gas analyzers (Radiometer Medical A/S, Brønshøj, Denmark), with sample type manually entered by the clinician at the time of analysis. Samples with technical flaws that prevented analysis (e.g. hemolysis, inadequate sample volume) were rejected by the blood gas analyzer and were not transmitted to the PDMS; no other exclusions were made. Clinical data were extracted from the PDMS and EHR (TakeCare, CompuGroup Medical CGM, Koblenz, Germany).

Data partitioning and features

To enable unbiased final model evaluation, a random subset of patient admissions representing at least 20% of all blood gas samples was selected as a holdout set prior to manual labeling and data analysis. The approximately 80% remaining samples were used as development set for feature engineering and model training. The development set was further split randomly 80%/20% into a training and testing set, to evaluate model performance during development prior to a final validation in the holdout set. Using 80% for model development while using 20% for final holdout was chosen to allow a large dataset in training to maximize model performance while still allowing robust evaluation, a common strategy in machine learning [15].

The variables included were pH, partial pressure of carbon dioxide in blood (pCO2), partial pressure of oxygen (pO2), Base Excess (BE), Standard Bicarbonate (StHCO3), Anion Gap, SO2, partial pressure corresponding to 50% hemoglobin oxygen saturation (p50), fraction of Methemoglobin (FMetHb), hemoglobin concentration (Hb), Hematocrit (Hct), and concentrations of Sodium (Na), Potassium (K), Calcium (Ca), Chloride (Cl), Glucose, and Lactate. pH, sO2, pO2, BE, pCO2 were calculated using Severinghaus’ [16] formula, or Siggaard-Andersen’s [17] equations, or numeric inversions thereof [18], if any one of them were missing and the required variables for calculation were present (Supplementary Sect. 2).

Clinical variables included the median SpO2 during the ten minutes before sampling and the closest matching SpO2 by value within +/- 10 min from blood gas sampling, and the differences between SO2 and these SpO2 values. The highest and lowest mean arterial blood pressure (MAP) and SpO2 within +/- 10 min from sampling were recorded. FiO2-setting at 5 min prior to and 15 min after blood gas sampling were recorded and the P/F-ratio was calculated from SO2 and Fio2 at 5 min prior.

Missing values in the MAP values (likely indicating absence of arterial line), changes in FiO2 from before to after blood gas sampling (likely indicating a bedside action based on arterial saturation or PO2), and an SO2 between the lowest and highest measured SpO2 within +/- 10 min (indicating that the SO2 was similar to the currently measured SpO2-values) were considered possible candidate predictors for blood gas type and encoded as categorial variables. See supplementary Table 1 for full variable definitions and rationale.

True classes of samples

The true sample type was determined through manual review and interpretation of all blood gas samples by a specialist physician in Anesthesia and Intensive Care. Data was presented in a spreadsheet (Excel, Microsoft, Seatle, WA, USA) and the reviewer had access to all filtering, analysis and sorting tools in that software to aid classification. Clinical data from the PDMS and EHRS systems, including diagnosis, therapeutic procedures, administered medications and other information such as the original sample type, was also available.

If a sample was suspected to be mislabeled, a second intensivist independently reviewed the case, and a final classification was determined through consensus between the two reviewers. Additionally, all samples labeled as arterial with a recorded pO2 < 6.66 kPa (50 mmHg) were reviewed by both intensivists, regardless of whether mislabeling was suspected by the first reviewer. The final classification (arterial, or non-arterial) was considered the ground truth for model development.

Data Preparation and feature engineering

In the training cohort, the predictive performance of each individual feature was assessed using a 5-fold cross-validated (CV) area under the receiving operator characteristic curve (AUROC) with the true class as the outcome. Features with a CV AUROC exceeding 0.75 had a 24-hour tri-cubic time-weighted mean calculated. The time-weighted mean difference (TMWD), representing the disparity between the feature value and its time-weighted mean, was then computed and considered as a potential feature.

Imputation and dimensionality reduction

Missing values were imputed individually per dataset using the mean (for continuous features) or mode (for categorical features). The features were mean centered and scaled to unit variance.

In the training set, Pearson’s correlation coefficient was calculated between all features. If features had a higher correlation than 0.75 with any other feature, a principal component (PC) analysis of such correlated features was performed. Best subset selection in multivariate logistic regression with sample type as dependent parameter and using the Bayesian information criterion was used to determine which PCs to include. Those PCs were then calculated in each dataset and used as features instead of the underlying correlated variables.

Unsupervised dimensionality reduction was performed with t-distributed Stochastic Neighbor Encoding (t-SNE) [19] to visually determine if a low-dimensional representation of the data could produce separation of the classes before and after dimensionality reduction.

Model training and selection

Due to expected class imbalance (approximately a 10:1 ratio of arterial to non-arterial samples) the area under the precision-recall curve (AUCPR) was used as the primary performance metric throughout the training procedure.

In the training set, a range of machine learning algorithms was trained and tuned using grid search over each algorithm’s hyperparameter space, with five-fold cross-validation (CV) applied to reduce the risk of overfitting [17]. The algorithms evaluated were random forest (RF), eXtreme Gradient Boosting (XGBoost), support vector machine (SVM), a feed-forward neural network (NN), regularized linear discriminant analysis (RDA), k-nearest neighbors (kNN), and logistic regression (LR) [18,19,20,21,22,23,24]. The two best performing algorithms, defined by the highest AUCPR in the testing set, were selected for further refinement.

For the two best performing algorithms after this grid search, a forward stepwise feature selection process was performed. At each number of features, a Bayesian Optimization process was used to find optimal hyperparameters for each algorithm. The performance on the test set was sequentially evaluated during this process, and if no improvement in AUCPR was seen after three rounds, the process was terminated and the model with the best performance was selected.

Finally, all models from the grid search (using the full feature set) and the two best models from the forward feature search were trained on the full training dataset and tested on the holdout set. See supplementary Sect. 3.

Statistical analysis

Model performance was evaluated with AUCPR for overall predictive power, AUROC for discrimination, and the Brier score for calibration. Confidence intervals were estimated using bootstrapping for precision-recall (PR) curves and DeLong’s method for ROC curves [25]. Given the clinically grounded expectation that model output probabilities would cluster near 0 or 1, calibration was assessed visually using a Locally Estimated Scatterplot Smoother (LOESS) plot to avoid binning artifacts and allow a smooth, non-parametric estimation predicted vs. observed probabilities. Logistic regression was used to assess calibration-in-large.

The optimal classification threshold was determined using the Fβ score with β = 0.5, prioritizing high precision over recall to reduce false positives. Normality of variables was assessed using Q-Q plots. Welch’s t-test and the Wilcoxon rank-sum test were applied for comparisons between continuous variables, as appropriate. Chi-square test was used for comparisons of categorical variables. A p-value < 0.05 was considered statistically significant, with Bonferroni correction applied to account for multiple comparisons when applicable.

T-distributed stochastic neighbor embedding (t-SNE) was used to visualize class separation in the feature space before and after dimensionality reduction. Feature importance in the final model was evaluated using Shapley Additive Explanations (SHAP) [26, 27].

R version 4.2.1 with the packages caret, data.table, ggplot2, rtsne, Rcpp and C + + 14 with the Boost library was used for all calculations.

Ethical approval

Swedish Ethical Review Authority (approval number 2019–06203, amendments 2022-04189-02 and 2024-01320-02) with a waiver of informed consent.

Results

A total of 33,800 blood gas samples (30753 arterial, 3047 non-arterial) from 691 intensive care admissions were included. The development set consisted of 542 admissions with a total of 26,986 samples (24463 arterial, 2523 non-arterial) and the holdout set consisted of 149 admissions, with a total of 6814 samples (6186 arterial, 628 non-arterial). The number of samples per admission ranged between 1 and 818. Out of 691 admissions, 80 (11%) had at least one mislabeled blood gas in the PDMS system during their ICU stay. The total number of erroneously marked blood gas samples was 150 (0.44%). Patients with at least 1 error during their ICU stay had a significantly higher mean Simplified Acute Physiology Score (SAPS3) at admission compared to those who had no errors made (65.14 vs. 60.46, p = 0.019). The cohorts were evenly matched in most aspects, except that the holdout set contained fewer male patients and more patients with surgery prior to admission, (clinical characteristics in Table 1).

Table 1 Patient characteristics per set

Of all 39 candidate predictors defined, all but 4 differed significantly between the two classes of blood gas sample in univariate testing (supplementary Table 2) after adjusting for multiple testing. The 19 features with the best 5-fold CV AUROC for blood gas type in the development cohort are shown in Table 2. All of them had an AUROC of at least 0.5673 in univariate testing. No variable was missing in more than 6.63% of cases. The full descriptive statistics including ranges of all blood gas parameters are found in the supplement (Supplementary Table 3), split by adult or pediatric patients and arterial or venous.

Table 2 The ranges per sample class of the subset of candidate features with CV AUC greater than the median among the 39 candidate predictors in univariate logistic regression in the development cohort

The number of variables were reduced from 39 predictors to 29 using the PCs of correlated variables. All Pearson correlations were lower than 0.75 after the variable reduction step (Fig. 1). t-SNE of the classes in this reduced feature space still revealed a clear grouping of most venous samples (Fig. 2).

Fig. 1
figure 1

Correlation matrix of the features after reduction of dimensionality with PC of correlated features, the numbers indicate the pairwise Pearson correlation

Fig. 2
figure 2

t-SNE plot from the development cohort after the dimensionality reduction process, colored by true sample class

The development cohort was split proportionally by class 80%/20% into training (21589 samples) and testing (5397). Based on AUCPR in the testing set, the best-performing algorithms after grid search of the hyperparameter space were XGBoost and RF. All algorithms evaluated had an AUCPR greater than 0.987 (supplementary Table 5). The XGboost and RF algorithms were chosen for feature selection and Bayesian Optimization of hyperparameters.

In the forward stepwise selection process, the matched SpO2 – SO2 difference was the best univariate predictor for both RF and XGBoost, (Table 3). In the Bayesian optimization step in the training set, no further improvement was seen in AUCPR in the test set after Bayesian optimization of models with more than 9 features (XGBoost) and 8 features (RF), respectively (Table 3, supplementary Tables 6, 7).

Table 3 Results of forward stepwise feature selection with CV AUCPR after bayesian optimization at each number of features

Holdout set results

In the holdout set, the best-performing model by AUCPR was XGBoost after Bayesian optimization with AUCPR of 0.9974 (95% CI 0.9961–0.9984, Fig. 3) and AUROC 0.9997 (95% CI 0.9996–0.9999, Fig. 4) and Brier score 0.0041 (95% CI 0.0031–0.0051). XGBoost demonstrated the best performance across all evaluated metrics except specificity, where Random Forest showed marginally better performance (Table 4). The XGboost model had significantly better discrimination than a logistic regression model (Fig. 4, p = 0.02).

Fig. 3
figure 3

PR curves for the final XGBoost model and the Logistic Regression model in the holdout set. The point indicates the best cutoff according to maximum F1-score

Fig. 4
figure 4

ROC curves for the final XGBoost model and the Logistic Regression model in the holdout set. The points indicate the best cutoff according to Youden’s criterion

Table 4 Results in the holdout set of all models after training on the full training set, ranked by AUCPR

Calibration assessment using logistic regression showed borderline significant underprediction of the true probabilities of non-arterial sample type overall (intercept = 0.31, p = 0.05), with mild underconfidence in predictions (slope 1.24, p < 0.01). This pattern was generally consistent with visual assessment of the calibration plot, which also revealed some underprediction in the lower range of probabilities (Fig. 5).

Fig. 5
figure 5

LOESS calibration plot for the XGBoost model. Abbreviations: LOESS Locally Estimated Scatterplot Smoothing, XGBoost eXtreme Gradient Boosting

Visual inspection of the SHAP plot for the optimized XGBoost model supported clinically relevant associations, for example a low SO2 and a large SpO2-SO2 difference were both strongly associated with an increased probability of non-arterial sample type (Fig. 6).

Fig. 6
figure 6

SHAP plot for the final XGBoost model. High model output corresponds to high probability of venous blood gas

Among the blood gases in the holdout data that were initially entered into the dataset as arterial by the bedside clinician, 13 were venous according to the human rater. In this subset of the data, the XGBoost model was able to correctly identify all mislabeled blood gas samples using the cutoff with the highest Fβ score in the test set, with 4 arterial samples misclassified as venous, corresponding to an accuracy of 99.94% (Table 5). The best accuracy in this subset of blood gases were achieved by the SVM, RF and XGboost models using the full feature set, all of them had only 1 prediction error (supplementary Table 8). For details regarding misclassifications, see Supplements Sect. 5 and Table 9.

Table 5 Confusion matrices for the XGBoost and logistic regression models among blood gas samples originally entered as ‘arterial’ in the holdout set

Discussion

Summary of findings

We developed and validated a supervised machine learning algorithm capable of classifying blood gas samples from a mixed adult and pediatric ICU population with performance comparable to expert clinical review. In addition, we estimated the prevalence of mislabeled blood gas samples in the ICU and provided descriptive statistics for biochemical parameters based on over 30,000 manually classified samples – representing one of the largest curated blood gas datasets in the literature. Despite existing safeguards, we observed a misclassification rate of 0.44%, suggesting that sample type errors may be more common in the ICU than in standard clinical chemistry settings [2, 3, 7].

Comparison with existing literature

The literature on blood gas sample type errors is limited, but the issue has been recognized in the context of automated SOFA score calculation, and various mitigation strategies have been proposed. For example, the Amsterdam University Medical Centers Database (AmsterdamUMCdb) SOFA algorithm regards all samples with PaO2 < 50mmHg (6.66 kPa) as non-arterial [15]. Our findings demonstrate that this threshold falls within the observed range for arterial blood gas samples from critically ill patients, highlighting the risk of misclassification when relying solely on absolute PaO2 values.

Descriptive statistics on blood gas ranges from critically ill pediatric patients are not well documented in previous research. Furthermore, while supervised machine learning has been applied to detect other sample type errors in clinical chemistry, no prior study has evaluated its use for classifying blood gas samples in the ICU setting [2]. Compared to other applications of machine learning in clinical medicine, our manual feature engineering including time-weighted averages captures trends in parameters, allowing models to account for changes in physiology.

Strengths and limitations

Our study has strengths. We conducted a thorough manual review of all included blood gas samples to establish ground truth, independent of any modeling. The holdout set was kept entirely separate with no patient overlap, supporting the external validity of our findings and approximating prospective deployment. The large dataset enabled robust model evaluation with narrow confidence intervals and permitted subgroup-level descriptive statistics.

The final XGBoost model demonstrated high precision in detecting non-arterial samples, allowing reliable performance even in settings where the prior probability of arterial sampling is high. Calibration assessment showed a slight underprediction of non-arterial sample type, with no evidence of overfitting, supported by a low Brier score and strong performance across all metrics in the holdout set. The model was able to achieve this performance with only 9 features, and the ensemble based algorithms (XGBoost and Random Forest) were consistently better than other methods such as Logistic Regression, Support Vector Machines and Neural Networks, suggesting complex relationships between input variables and sample type. The SHAP plot suggests clinically relevant relationships, for example a high difference between SpO2 and PO2 or the absence of arterial blood pressure increases the probability of venous sample.

Our study has limitations. It is a single-center study, which may limit generalizability to other populations, institutions, or care settings. In particular, the high severity of illness (mean SAPS3 score of 60) suggests a severely ill population, where abnormal blood gas values may be more prevalent. Ground truth classification relied on retrospective review of EHR and PDMS data; reviewers lacked access to real-time bedside context, which may have influenced some classifications. Finally, the model was trained only to distinguish arterial from non-arterial samples and did not differentiate between venous subtypes, which may be relevant in certain contexts.

Clinical implications and future research

This study demonstrates that blood gas sample type errors in the ICU can be detected automatically using machine learning, with potential for real-time implementation to improve diagnostic accuracy and enhance patient safety. Accounting for the broader physiological ranges seen in critically ill patients may reduce misclassification in automated scoring systems and improve interpretation of blood gas parameters.

The matched difference between SpO2 and SO2 was the strongest univariate predictor of sample type, outperforming traditional metrics like pO2 or the difference between SO2 and the local median SpO2. Importantly, a positive SpO2-SO2 difference was occasionally observed even in confirmed arterial samples, supporting the presence of silent hypoxemia in the ICU. Although the underlying mechanisms were not investigated in this study, they may reflect true physiological divergence due to peripheral vasoconstriction, microcirculatory failure, or elevated oxygen extraction. Future research should investigate whether this discrepancy is associated with clinical outcomes. Our model could facilitate such research in large retrospective datasets by distinguishing true silent hypoxemia from apparent SpO2-SO2 mismatches due to sample type misclassification.

Our dataset contains a mixed population of ICU patients, but the sample size is insufficient to allow meaningful subgroup analysis in specific diagnosis, such as septic shock, ARDS or cardiogenic shock. Future studies could aim to validate this model specifically in these patient groups, where severe physiological derangement could be more prevalent. Our data suggest that sample type errors may be more prevalent in patients with more severe illness, although this was a post-hoc finding and needs to be validated prospectively.

The primary motivation for developing this algorithm was to improve the accuracy of automated SOFA calculation and support sepsis recognition tools. The trained XGBoost model is available upon request, allowing other researchers to apply or validate it in retrospective datasets from other institutions, or in public ICU databases. The approach could also be adapted to identify sample type errors in other clinical settings.

Finally, our results suggest that sample type errors might be more frequent in ICU blood gas analysis than in standard clinical chemistry settings. Additional studies are needed to validate this finding in other institutions and investigate contributing factors such as staffing and workload.

Conclusion

We developed and validated a machine learning model that classifies blood gas sample type in ICU patients using routinely collected clinical data. The model’s high precision supports its potential use in improving automated illness severity assessment and enhancing data quality in both clinical care and research. Our findings also provide new insight into the prevalence of sample type error and the characteristics of arterial and non-arterial blood gas parameters from critically ill adults and children.