Introduction

HAPE is a critical condition caused by hypoxia at high altitudes and is characterized by symptoms such as dry cough, dyspnea, pulmonary crackles, and exercise intolerance. The condition can progress rapidly, and studies report that the mortality rate among untreated patients may reach 50% [1, 2]. This is one of the most common causes of death related to altitude sickness. Common complications of HAPE include high-altitude cerebral edema, pulmonary embolism, pneumonia, and cerebral infarction [3,4,5,6,7]. As socioeconomic development and tourism thrive in high-altitude regions, an increasing number of individuals from low-altitude areas rapidly ascend to altitudes above 2,500 m via air and rail transport. A minority of these individuals may experience HAPE within hours or days after arriving at high altitude due to maladaptation [8, 9]. Timely diagnosis and quantification of pulmonary edema are priorities during treatment [10]. The vast and sparsely populated high-altitude regions, coupled with insufficient medical facilities and infrastructure, limit the widespread use of large imaging equipment such as CT scanners. Consequently, chest X-rays often serve as an accessible and crucial diagnostic tool for detecting pulmonary edema [11,12,13,14,15,16].

Deep learning based on big data is a rapidly advancing technology. Its integration with clinical practices has the potential to create a unified framework for clinical decision support, which could profoundly transform the field of precision medicine [17, 18]. Technically, X-ray attenuation is proportional to the severity of pulmonary edema [19,20,21,22]. This study employs deep learning methodologies and X-ray technology to develop a model capable of early detection and grading of pulmonary edema through an in-depth analysis of cardiopulmonary imaging morphology and X-ray attenuation. The workflow as showed in Fig. 1.

Fig. 1
figure 1

The study design and workflow of deep learning for lung field segmentation, HAPE identification, and edema grading

Acquisition of the chest radiographs, segmentation and classification

This retrospective study was approved by the Medical Ethics Committee of the General Hospital of Western Theater Command (Approval No: 2024EC7-ky037), which waived the requirement for written informed consent in accordance with national regulations for retrospective analyses of anonymized imaging data. All methods adhered to the Declaration of Helsinki and relevant institutional guidelines.

The hospital images were acquired via two Siemens YSIO MAX machines and one GE Definium 6000 X-ray machine (120 kV, automatic mAs, images extracted in PNG format). We randomly selected 1,000 images and corresponding labels from the ARXIV_V5_CHESTXRAY dataset [23], which includes a total of 15 categories: (1) Atelectasis; (2) Cardiomegaly; (3) Effusion; (4) Infiltration; (5) Mass; (6) Nodule; (7) Pneumonia; (8) Pneumothorax; (9) Consolidation; (10) Edema; (11) Emphysema; (12) Fibrosis; (13) Pleural Thickening; (14) Hernia; and (15) No finding. Additionally, we labeled 2,923 chest X-rays collected from January to December 2023, referencing the 15 categories in the ARXIV_V5_CHESTXRAY dataset, resulting in a total of 3,923 images, which included 2,303 cases with edema labels, designated the pretraining dataset (pretrain_dataset). The training dataset for this study comprised chest radiographs collected from the 950th Hospital of the Chinese People’s Liberation Army. This dataset included 1,003 radiographs from patients with HAPE and 702 from normal controls, acquired between January 2007 and December 2023. For external validation, a distinct dataset was constructed from more recent cases, encompassing 679 HAPE and 436 normal chest X-rays obtained exclusively between January and December 2023 from 950th Hospital and The General Hospital of Western Theater Command. Critically, no patient was represented in more than one dataset, ensuring the integrity of the validation process.

Inclusion criteria:

  1. 1.

    Adults aged 18–94 years.

  2. 2.

    Complete clinical laboratory and imaging data were obtained.

  3. 3.

    HAPE cases must meet both clinical and imaging criteria according to the [24].

Clinical criteria included: acute onset of at least two of the following symptoms after rapid ascent to altitude >2500 m: dyspnea at rest, cough, cyanosis, or frothy sputum; accompanied by at least two of the following signs: tachypnea, tachycardia, central cyanosis, or pulmonary rales. Imaging criteria required the presence of radiographic findings consistent with pulmonary edema (e.g., patchy or diffuse alveolar infiltrates, Kerley B lines) that were not fully explained by cardiac failure or other pulmonary pathologies.

Exclusion criteria

Any subjects meeting the following conditions were excluded from this study.

  1. 1.

    Poor image quality.

  2. 2.

    Clinical or imaging data were missing.

From the pretrain_dataset, we randomly selected 1,000 images to form the segmentation training dataset (seg_dataset). Three radiology experts (Yu, Jiang, and Du, each with more than ten years of experience in radiology) manually segmented the lung regions in the seg_dataset, achieving an intraclass correlation coefficient (ICC) of 0.93. The images in the pretrain_dataset labeled with edema were assigned a label of 1, whereas the others were assigned a label of 0, creating an edema pretraining dataset for the identification of pulmonary edema. Furthermore, the three radiologists graded the images in the training_dataset and val_dataset on the basis of clinical outcomes, imaging reports, and comprehensive assessments, categorizing the severity of pulmonary edema into four levels: class 0—no edema; class 1—thickened vascular markings with blurred margins; class 2—interstitial edema; and class 3—alveolar edema. Class 0 represents a normal chest X-ray, whereas class 3 represents severe pulmonary edema with diffuse blurred infiltrates, potentially accompanied by pleural effusion and Kerley B lines.In cases of initial disagreement among the radiologists regarding the grading of the same image, a voting process was employed to reach a consensus. The inter-rater reliability for the four-class severity grading, assessed using Fleiss’ Kappa before adjudication, was 0.75, indicating substantial agreement.

Model development

Given the scarcity of annotated HAPE-specific datasets, we employed a transfer learning strategy. This involved pre-training models on a larger, more general dataset containing various pulmonary conditions to learn foundational radiographic features, followed by fine-tuning on our targeted HAPE dataset to adapt these features to the specific task of interest.

For image preprocessing, all the images were resized to 1024 × 1024 pixels, and the pixel values were normalized. Training and testing were conducted on an Nvidia RTX 3070. We report the accuracy and microaveraged area under the receiver operating characteristic curve in a “one vs. rest” format (AUROC), along with confidence intervals (CIs) as outcome measures.

Deep learning model for chest radiograph segmentation

We developed a convolutional neural network, DeepLabV3_ResNet50, for the image segmentation task on the basis of the seg_dataset. The model was trained for 50 epochs with a learning rate of 0.001, a batch size of 4, a validation batch size of 1, and a stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and a weight decay of 0.00001. The active learning strategy is adopted. Manual fine-tuning is performed on samples with inaccurate model predictions to achieve iterative improvement of model performance.

Deep learning model for HAPE grading

Using masks generated by an automated segmentation model, we cropped the corresponding lung regions and constructed a binary classification deep learning model based on the VGG19 convolutional neural network to determine whether the images exhibited edema. The experiment utilized the pretrain_dataset with the following hyperparameter settings: a batch size of 32 or 50 epochs, an initial learning rate of 0.01, and the SGD optimizer. For normalization, we employed the ImageNet standard.

The MobileNet_V2 models were first pre-trained on the pretrain_dataset for multi-class edema grading. Subsequently, these pre-trained models were fine-tuned on the HAPE-specific train_dataset. The validation set (val_dataset) was used to test the predictive performance. The optimal model was selected on the basis of the area under the ROC curve (AUC). MobileNet_V2 was utilized to grade edema on a scale of 0–2(Classes 1 and 2 are merged into a single class) and 0–3, which is consistent with the parameters used for edema recognition.

To address class imbalance, we applied weighted random sampling during training. Data augmentation techniques included random horizontal flipping, rotation (± 10°), and brightness/contrast adjustment. Training employed early stopping with a patience of 10 epochs based on validation loss. All datasets were split at the patient level to prevent data leakage. Hyperparameters (e.g., learning rate, batch size) were optimized on a held-out validation set separate from the final test set.

Results

Patient and chest radiograph characteristics

The average age of the patients ranged from 20 to 94 years, as shown in Table 1. There were slightly fewer females than males.

Table 1 Demographic characteristics of patients: age and gender

Segmentation

The segmentation model deeplabv3_resnet50 was trained on the seg_dataset over 50 epochs. The results on its training set indicated a global accuracy of 99.21%, a mean intersection over union (mIoU) of 98.09, and a mean Dice coefficient of 99.03. Consequently, deeplabv3_resnet50 was adopted as the segmentation model. The training results are illustrated in Fig. 2.

Fig. 2
figure 2

Demonstrates that the automated segmentation of lung regions by the deep learning model was effective (a row). Patient positioning and the presence of implants had minimal influence on segmentation outcomes (b row); however, the model’s proficiency in delineating challenging areas like the costophrenic angle, diaphragm, and fluid level was subpar (c row)

ROC curve analysis

The ROC curves of the HAPE detection model on the val_dataset are shown in Fig. 3. In the binary classification model for edema detection, VGG19 achieved a receiver operating characteristic area under the curve (ROC AUC) of 0.979 on the train_dataset and 0.950 on the val_dataset, as expected, demonstrating excellent performance with an accuracy of 89.42%. The training AUC was 0.979 (95% CI: 0.975–0.982), and the validation AUC was 0.950 (95% CI: 0.939–0.960).

Fig. 3
figure 3

Illustrates the ROC curve of the VGG-19 model for high-altitude pulmonary edema recognition with an AUC of 0.979 (95% CI 0.975–0.982) for training and an AUC of 0.950 (95% CI 0.939–0.960) for validation

The ROC curves for the edema grading model on the validation dataset are illustrated in Fig. 4. In the three-class classification task, MobileNet_V2 demonstrated promising outcomes with an accuracy of 87.22%, with a macroaverage ROC curve AUC of 0.92. The ROC curve AUCs for each class were as follows: class 0 (0.96), class 1 (0.84), class 2 (0.86). Sensitivity, specificity, precision for each class were: sensitivity [0.96, 0.40, 0.80]; specificity [0.89, 0.40, 0.92]. The macroaverage results included precision (0.72), and F1 score (0.72).

Fig. 4
figure 4

Displays the ROC curves of the edema grading model. MobileNet_V2 yielded favorable results for the three-class classification task (a), achieving a macroaverage ROC curve AUC of 0.92. The ROC curve AUC values for each class were as follows: class 0: 0.96, class 1: 0.84, class 2: 0.86, and class 3: 0.96. In the four-class classification task (b), MobileNet_V2 achieved a macroaverage ROC curve AUC of 0.89, with individual AUC values for class 0 (0.95), class 1 (0.79), class 2 (0.86), and class 3 (0.96)

For the four-class classification task, MobileNet_V2 exhibited favorable performance, achieving an accuracy of 84.54%. The macroaverage ROC curve exhibited an AUC of 0.89, with individual AUCs for class 0 (0.95), class 1 (0.79), class 2 (0.86) and class 3 (0.96). Sensitivity, specificity, precision, and F1 scores for each class were as follows: sensitivity [0.91, 0.16, 0.37, 0.88], These results indicate that the model’s capability is effectively binary, performing well only in distinguishing class 0 (normal) from class 3 (severe edema), while showing poor sensitivity for the intermediate classes 1 and 2. specificity [0.88, 0.99, 0.97, 0.90]; precision [0.92, 0.36, 0.45, 0.80]; and F1 score [0.91, 0.22, 0.40, 0.84]. The macroaverage results comprised sensitivity (0.58), specificity (0.93), precision (0.63), and F1 score (0.59).

Confusion matrix analysis

A confusion matrix was generated for the Edema grading model on the validation dataset (Fig. 5). Each cell represents the alignment between the true severity level from the consensus score and the severity level predicted by the image model. The proportion of the predicted severity level matching the actual severity level is reported in each cell. Predictions for classes 0 and 3 were more accurate compared to classes 1 and 2. Even when merging classes 1 and 2 into a single category, the performance of the three-category model remained inferior to that of the two-category model.

Fig. 5
figure 5

Displays the confusion matrix for the multiclass classification of pulmonary edema based on the validation dataset, encompassing HAPE radiographs

Grad-CAM (Gradient-weighted class activation mapping)

Grad-CAM is a visualization technique for interpreting predictions of deep learning models. It generates heatmaps by computing gradients of a target class with respect to feature maps and multiplying them, highlighting the image regions that the model focuses on when making predictions.

Discussion

HAPE is a type of altitude-related noncardiogenic pulmonary edema caused primarily by acute exposure to high altitudes, which leads to excessive hypoxic pulmonary artery pressure, increased pulmonary vascular permeability, and impaired pulmonary fluid clearance [1, 3]. Clinically, it is characterized by symptoms such as dyspnea, cough, pink or white frothy sputum, and cyanosis, with auscultation revealing moist rales. In China, over 12 million people live and work at elevations above 2500 m, with more than 7.4 million residing at altitudes exceeding 3000 m. In 2023, Tibet received more than 55 million tourists. Therefore, in environments far from advanced medical facilities, the rapid use of portable X-ray imaging and offline-deployed artificial intelligence models to increase the early diagnostic efficiency and accuracy of pulmonary edema represents a significant exploration that can benefit the physical and mental health of people in sparsely populated high-altitude regions [25,26,27].

In 2018, Warren et al. [28] proposed the Radiographic Assessment of Lung Edema (RALE), which quantifies the severity of pulmonary lesions on chest X-rays and demonstrates a correlation between the RALE score, lung resection weight, severity of hypoxemia, and prognosis. Rajaraman et al. confirmed the ability of visualization and interpretation of convolutional neural network predictions to detect pneumonia in pediatric chest radiographs [29, 30]. Xiaole Fan developed a COVID-19 CT image recognition algorithm based on transformers and CNNs [31]. Guangyu Wang investigated a deep learning approach for diagnosing and differentiating viral, nonviral, and COVID-19 pneumonia via chest X-ray images [32]. Dominik Schulz’s deep learning model accurately predicts and quantifies pulmonary edema in chest X-rays [27].

HAPE differs fundamentally from cardiogenic pulmonary edema, ARDS, and COVID-19 pneumonia in terms of its pathological causes and processes. However, technically, the attenuation of X-rays is proportional to the severity of pulmonary edema. The aforementioned studies have demonstrated that training deep learning models can identify and differentiate exudative lesions on X-rays, and theoretically, this approach could also be applied to HAPE. To the best of our knowledge, this is currently the only study utilizing a model trained on various types of pulmonary edema images to identify HAPE. This approach employs a transfer learning approach to address the issue of insufficient image data for HAPE.

This experiment employed DeepLabV3 with the ResNet50 model for image segmentation on the train_dataset. The model automatically generates masks from the input images, achieving a global accuracy of 99.21% on the on its training set, indicating its strong performance. Specifically, the model also demonstrated excellent intersection over union (IoU) and Dice coefficients across various categories, measuring 98.09% and 99.03%, respectively. These results suggest that the selected model architecture and parameter settings effectively capture features within the images, resulting in high-precision segmentation.

We constructed a binary classification deep learning model using the VGG19 convolutional neural network to categorize images on the basis of the presence of edema. The model achieved an accuracy of 89.42%, with an area under the curve (AUC) of 0.979 for the training_dataset and a 95% confidence interval of [0.975, 0.982]. The val_dataset achieved an AUC of 0.950, with a 95% confidence interval of [0.939, 0.960]. Overall, the model demonstrated high performance in classification tasks, accurately predicted sample categories, and effectively distinguished positive from negative samples. The narrow confidence interval for the training_dataset AUC indicates high performance consistency, whereas the relatively small confidence interval for the val_dataset AUC further supports the model’s stability.

Timely and accurate quantification of pulmonary edema in chest X-ray images is crucial for the management of acute mountain sickness. A four-class deep learning model was developed on the basis of train_dataset and MobileNet_V2. Through model training and evaluation, we observed an overall accuracy of 84.54%, indicating the model’s applicability and effectiveness in multiclass classification. The macroaverage ROC AUC was 0.89, with a sensitivity of 0.58, specificity of 0.93, accuracy of 0.85, precision of 0.63, recall of 0.58, and F1 score of 0.59, suggesting strong discriminative ability among the classes. However, the discrepancies in the performance metrics reflect the model’s uneven ability to handle different categories.

Categories 0 and 3 showed robust performance with sensitivities of 0.91 and 0.88, respectively, underscoring the model’s accuracy in identifying these categories. In contrast, categories 1 and 2 displayed lower sensitivities of 0.16 and 0.37, suggesting areas for improvement. The limited number of images for grades 1 and 2 could be attributed to either case scarcity or the inherent complexity of manual classification in these grades. Merging classes 1 and 2 into a consolidated category did not significantly enhance the overall performance compared to the two-category model. Despite an accuracy of 87.22%, the three-category model’s ROC curves, with AUCs for class 0, 1 and 2 remained inferior to the two-category model. The discernment of intermediate classifications did not show substantial improvement compared to the four-class model. While the model demonstrated favorable performance based on AUC and accuracy metrics, there is room for improvement in sensitivity and precision to enhance classification accuracy further.

The most critical finding of our study is the model’s markedly reduced sensitivity for intermediate severity HAPE (classes 1 and 2). A model pre-trained on the broad and weakly-labeled concept of ‘edema’ from a general population struggles to master the specific radiographic nuances of early HAPE in a high-altitude cohort. This performance pattern, however, is not merely a failure but an important quantification of this domain shift challenge. It indicates that our model, in its current form, is not a reliable tool for grading early disease but rather serves as a proof-of-concept for distinguishing unequivocally normal from severe cases. This has significant implications for future AI research in rare diseases, emphasizing that fine-tuning alone may be insufficient to overcome large domain gaps.

Majkowska et al. utilized machine learning methods to automatically detect four abnormalities in X-ray images [33]. For the detection of airspace opacity, including pulmonary edema, the reported area under the receiver operating characteristic curve (AUROC) ranged from 0.91 to 0.94. They developed a model capable of detecting clinically relevant findings in chest X-rays at an expert level. Jarrel et al. employed deep learning techniques to diagnose congestive heart failure (CHF) via chest X-ray images. The authors used a BNP threshold of 100 ng/L as a biomarker for CHF, resulting in an AUROC of 0.82 [20]. Horng et al. not only diagnosed the presence of pulmonary edema but also quantified its severity via deep learning methods [34]. However, similar to the aforementioned studies, the efficiency of the deep learning models in recognizing and differentiating between Class 1 and Class 2 was relatively low. The highest recognition efficiency was achieved for Class 0 (no edema) and Class 3 (alveolar edema), which aligns with our findings. This may indicate a lack of sufficient sample data or training materials for these categories, thereby affecting the model’s learning performance.

The Grad-CAM analysis as delineated in Fig. 6, reveals three critical insights into the model’s decision-making paradigm: Anatomic Focus Specificity The model predominantly activates in the perihilar zones, anatomically corresponding to the pulmonary vasculature and cardiac silhouette. This spatial preference aligns with established radiographic biomarkers of pulmonary edema - notably, the peribronchial cuffing and Kerley B lines that radiologists prioritize during diagnostic evaluation [35]. Pathophysiological Correlation High-attention clusters colocalize with: Cardiomediastinal interface blurring, Butterfly-pattern alveolar infiltrates [36], these findings suggest the model’s capacity to capture interstitial fluid redistribution patterns characteristic of hemodynamic pulmonary edema. Clinical Interpretability Validation The heatmap-radiologist diagnosis showed some concordance in our multicenter validation cohort, suggesting that the model’s “visual search” strategy emulates expert diagnostic reasoning. Such interpretability metrics are crucial for implementing AI-CAD systems in clinical workflows per FDA’s SaMD guidelines [37].

Fig. 6
figure 6

Interpretable Visualization of Pulmonary Pathologies via Grad-CAM, Panels (a) and (c) display original chest X-ray images, while panels (b) and (d) show the corresponding Grad-CAM heatmaps. The color intensity in the heatmaps reflects the model’s attention level, with warmer colors (red and yellow) indicating higher attention to specific regions

Grad-CAM visualizations suggested that the model focused on clinically relevant regions, providing a preliminary level of interpretability and face validity for its predictions, which is a necessary step towards building trustworthy AI systems.

In this study, we developed a deep learning model that integrates image segmentation with the identification and grading of HAPE on the basis of subjective grading by radiologists. Our model exhibited excellent performance. We believe that our approach has several advantages. Typically, radiologists assess the severity of pulmonary edema through classification scoring, which requires experienced physicians. However, mountainous areas are often remote and distant from major cities, making access to large hospitals and experienced radiologists challenging.

Our work remains a preliminary proof-of-concept for binary edema detection and highlights the significant challenges in severity grading. It underscores that substantially more research and validation are required before any consideration of clinical deployment.

However, this study has several limitations. First, the domain shift between the weakly labeled pre-training dataset (general pulmonary edema) and the carefully adjudicated HAPE-specific fine-tuning dataset may have influenced feature learning, though fine-tuning was used to mitigate this effect. Second, the model showed reduced sensitivity in distinguishing intermediate severity grades (classes 1 and 2), which can be attributed to the inherent subtlety of radiographic findings in these categories and relatively lower sample sizes. Third, although lung segmentation performance was high, challenging anatomical variations—such as poor costophrenic angle visualization, subcutaneous emphysema, consolidation, or effusion—occasionally reduced segmentation accuracy. Future work will focus on advanced domain adaptation, expanding intermediate-class samples, and improving robustness to anatomical and pathological variations.

Furthermore, our segmentation model’s difficulty in handling pathologies like consolidation and effusion, traditionally considered limitations, can be reframed as a valuable insight. It reveals a systematic bias in models trained on healthy anatomies: they may inadvertently exclude the very pathological regions crucial for diagnosis. This suggests that for tasks like edema assessment, a pathology-aware segmentation model or an end-to-end network that jointly optimizes segmentation and classification might be necessary future directions, rather than the traditional segmented-then-classify pipeline we employed.

Conclusion

In conclusion, this feasibility study explored a transfer learning framework to address the acute data scarcity problem in HAPE diagnosis. Our results demonstrate that such an approach can facilitate the development of models with strong performance in binary tasks (e.g., normal vs. severe HAPE). However, the model’s inability to reliably grade intermediate severities provides a crucial cautionary tale about the limits of transferring knowledge from general to highly specific medical domains. The perceived limitations—domain shift, imperfect segmentation, and low intermediate-class sensitivity—are, in fact, the primary contributions of this work, as they map the uncharted territory and outline the specific challenges that must be overcome. Future efforts should focus on collecting larger, prospectively validated HAPE datasets, developing domain adaptation techniques explicitly designed for medical imaging, and creating integrated models that do not treat segmentation and diagnosis as separate problems. This study serves not as a presentation of a finished tool, but as a foundational investigation that defines the path forward for AI in high-altitude medicine.