Introduction

In dermatological clinical trials, numerous digital images are captured to evaluate treatment effects. Automated image analysis, lesion detection, and feature extraction can help experts assess results [1]. Furthermore, machine learning (ML) can be used to identify statistical trends and biases across trials. Integrating state-of-the-art (SOTA) computer vision (CV) techniques into electronic clinical outcome assessments (eCOAs) can transform clinical treatment development [2]. Beyond controlled trials, applying CV to less standardized images from physicians or patients offers benefits for reliable and faster skin condition detection, despite challenges such as variability in image quality and capture standards [3]. Advances in CV are increasingly addressing these limitations, providing promising tools for stakeholders.

Currently, ML applications in dermoscopy focus predominantly on malignant skin lesion tasks like classification, detection, and segmentation [4]. Benign skin diseases, while impactful on quality of life, are less explored. These include Atopic Dermatitis (AD), Stasis Dermatitis (SD), Vitiligo (VI), and Alopecia Areata (AA), which affect millions worldwide:

  • AD is a chronic inflammatory condition affecting 20% of children [5]. The symptoms include recurrent lesions, itching, and dryness, as well as acute or chronic manifestations like erythema, oozing, and lichenification [5, 6].

  • SD, linked to chronic venous insufficiency, affects the lower extremities and presents with discoloration, itching, redness, swelling, and pain due to elevated venous pressure [7].

  • VI, the most common depigmenting disorder, affects 0.1–2% of the global population. Characterized by lighter patches from melanocyte loss, it has psychological but not life expectancy impacts, with origins tied to genetics and stress [8].

  • AA, an autoimmune condition affecting 2% of people globally, causes hair loss in patches or universally, affecting all demographic groups [9].

Applying ML and CV techniques to these diseases could enhance the efficiency and reliability of diagnosis and monitoring through automated systems, thereby providing repeatable and trustworthy outcomes. A systematic screening of the literature concerning the application of ML and CV approaches for skin medical image analysis focusing on VI, AA, SD, and AD aims to reveal the relevant potentials, limitations, challenges, and opportunities. In this systematic review, a detailed description of this search is provided, focusing on methods that quantify skin visual patterns to guide useful knowledge extraction for downstream tasks.

Methods

To provide consistent feedback on the research work conducted in recent years in the field of ML and CV techniques concerning the four diseases, a systematic search is completed to provide a list of eligible papers and their key findings. The entire procedure is reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, and the respective checklist is provided in Appendix B. The inclusion criteria required for a study to:

  • Focus on one or more of the four skin diseases of interest or include one or more of the four skin diseases among other diseases in the multiclass setting. In the case of dermatitis-related work, the paper must explicitly discuss SD or AD.

  • Propose an automatic solution to classify, detect, and/or segment the query class that is AI-based or hand-engineered.

  • Report evaluation results on open-source datasets or reasonably described in-house datasets.

The search was conducted on the Scopus database and included all combinations of words that refer to skin diseases and the names of ‘vitiligo’, ‘alopecia areata’, ‘atopic dermatitis’, and ‘stasis dermatitis’, in conjunction with AI- and CV-related terms. The alternative terms ‘venous’, ‘gravitational’, ‘congestion’, and ‘varicose’ for stasis and ‘eczema’ for dermatitis, among others, commonly found in the literature, were also taken into consideration. The term ‘leukoderma’ differs from VI, but it has been included in the search because this practice may uncover interesting CV works transferable to the VI domain due to the similarity of the two diseases. The search time frame spans from January 2004 to December 2024 (the specific end date is regulated by the Scopus database update and consultation time points). It captures emerging methodologies since AI and CV were radically evolving. The specific search queries used, including keywords, language and publication limitations applied, and dates of the last Scopus consultation, can be found in Appendix A. The manual review of all retrieved works, 441 in total, has led to the extraction of useful information about the methods used, and in some cases promising results. Specifically, the screening process comprised two screening rounds, prior to which an automated deduplication step identified 7 duplicate entries, reshaping the total paper count to 434. The first round screened the titles, abstracts and, if necessary, introductions of the 434 papers, providing a high-level view of each paper’s contents and leading to the rejection of 351 ineligible entries. The second round consisted of the meticulous study of the 83 retrieved papers, 46 of which were finally judged as eligible for inclusion (Fig. 1). The selected worksare scanned to identify methods related to image classification,

Fig. 1
figure 1

PRISMA workflow diagram for discovering literature about alopecia areata, vitiligo, atopic and stasis dermatitis

Lesion segmentation, lesion detection, and image processing techniques for image preprocessing and augmentation. A special focus has also been placed on discovering efforts for automated skin disease severity quantification, using measures such as the Vitiligo Area Scoring Index (VASI) and Severity of Alopecia Tool (SALT) scores, as well as other relevant metrics used by researchers. Both the screening and data collection parts of this review were conducted by 3 researchers working independently, without the assistance of automation software. Even though in some studies, information about the method was incomplete and the evaluation pipeline returned no strong proof of the robustness and performance of the proposed methodology (criticized by N. van Geel et al. [10]), a critical decision was made to include such studies in the review to extract additional useful information.

Results

The reviewed studies on skin disease lesions are categorized into four main skin conditions AA, AD, SD, VI and classified by downstream tasks: classification, object detection, segmentation, and severity score calculation. Additionally, some studies include data augmentation techniques to address data scarcity and overfitting. Most studies focus on VI, whereas eczema-related research helps compensate for the limited research on SD and AD because these conditions share visual characteristics. Among the downstream tasks, classification is the most common, followed by segmentation. The severity score calculation ranks lower, often integrating segmentation as part of the process. Object detection is the least emphasized because segmentation provides a more granular pixel-level analysis, making it preferable for detailed skin lesion assessment. The publication timeline is from August 2012 to April 2024. As shown in Figs. 2 and 3, the most productive year refers to 2023 with 21 works, followed by 2022 with 12 works. Works related to AA witnessed a sudden peak in the year 2023, possibly due to the release of high-quality datasets, whereas vitiligo publications are spread over all years in contrast to articles on other diseases that appear sporadically.

Fig. 2
figure 2

Graphical representation for the number of publications per skin disease and computer vision task

Fig. 3
figure 3

Timeline for the number of publications related to Alopecia Areata, Stasis and Atopic Dermatitis (Eczema Related), and Vitiligo

Machine learning applications for Alopecia Areata

Most existing studies in the field of computer vision for AA refer to the task of image classification [11,12,13,14,15,16,17,18,19]. The existing approaches are based on the typical machine-learning pipeline shown in Fig. 4. In each machine learning subtask, namely preprocessing, feature extraction, feature selection, and classification, the ML/CV algorithms for the existing works are depicted. In the case of neural networks [11, 12, 14,15,16], the subtask choices are fewer since the network architecture handles the feature extraction and classification subtasks entirely. However, there exists an approach by Mittal et al. that breaks down the total process into subtasks, even when using neural networks [12]. The feature extraction process is managed by a pretrained VGG16 network, and the classification is performed by a Support Vector Machine (SVM) classifier. As shown in Fig. 4, the use of histogram equalization techniques for contrast enhancement is a popular choice globally or locally (Contrast Limited Adaptive Histogram.

Fig. 4
figure 4

Overview of CV and ML approaches for AA in the relevant literature. GLCM stands for Gray-Level Co-occurrence matrix, LBP for Local Binary Patterns, AAA for Artificial Algae Algorithm, WNN for Wavelet Neural Network, and MELM for Modified Extreme Learning Machine. The dotted frame, labeled Neural Network is an optional choice in the pipeline that can replace the handcrafted feature extraction and classification process

Equalization - CLAHE). The team of Saraswathi and Pushpa presents significant contributions in the field [16,17,18,19,20,21], accounting for 54% of the relevant literature. Their most promising results are reported in [18] showcasing 96.94% accuracy in a four-class AA classification task. The above results are obtained when enhancing the Attention-based Balanced Multi-Tasking Ensembling Deep (AB-MTEDeep) model, which combines feature extraction from various scales and residual connections with the Generative Adversarial (GAN) -generated training data. A common throughout all classificationstudies is the utilization of the Figaro1k [22] and DermNet datasets [23].

Concerning the image segmentation task, the literature review reveals three existing works, two supervised and one unsupervised technique. Lee et al. [24] tackle a dual segmentation task to segment both scalp and hair loss areas employing a U-Net [25], whereas in [26], Bernardis and Castello-Soccio assign a cluster to each pixel of the image as per its hair density using the K-means clustering algorithm to differentiate between bald, low-density, and normal hair scalp regions in scalp images. Due to its simplicity, the system’s inference time and ability to differentiate between AA and other causes of low hair density areas are low, as the authors acknowledge. The works described by Lee et al. [24] and Gudobba et al. [27], although referring to segmentation, are described in the following subsection due to their main objective being closely related to the score (Severity of Alopecia Tool) calculation task.

Severity of alopecia tool quantitative analysis for Alopecia Areata

The segmentation techniques discussed earlier are foundational for the quantitative analysis of AA images. A widely adopted metric for assessing AA severity is the SALT score [28, 29], which evaluates scalp hair loss across four regions—vertex, right profile, left profile, and posterior—each assigned a weight factor and a score from 0 to 1. The SALT score, ranging from 0 (no hair loss) to 100 (complete hair loss), is calculated by summing the weighted scores for each region. It is used to monitor treatment efficacy by comparing baseline (BL) and follow-up (F/U) scores, with percentage changes computed using a specific formula (Eq. 1) [30]. Despite its usefulness, SALT has limitations, such as ignoring factors like the duration and recurrence of hair loss, psychological impact, and involvement of other areas (e.g., eyebrows, eyelashes) [31]. Additionally, the manual nature of the calculation introduces potential subjectivity and errors, thereby making the calculation time-intensive [31].

$$\:\frac{SALT\:BL-SALT\:F/U}{SALT\:BL}\:x\:100\%=\%\:change\:from\:baseline$$
(1)

Automated machine learning techniques for calculating SALT have been extensively studied to address time and subjectivity challenges. These methods primarily rely on segmentation approaches to differentiate hair from scalp areas, which are categorized into one unsupervised and two supervised techniques.

In the supervised domain, Lee et al.‘s segmentation method [24] employs the AloNet model, a CNN based on U-Net [25], to classify each pixel as “hair loss” or “scalp area.” This functionality is embedded in a web application to measure hair loss in AA patients. Similarly, Gudobba et al. [27] developed the HairComb algorithm, which uses two encoder-decoder branches based on UNet and ResNet50 to automatically calculate hair loss percentage across alopecia subtypes. HairComb is integrated into Trichy, a web-based tool, which provides user guidance for image capturing. HairComb reported 7% absolute error in calculating affected area percentages, which is comparable to the state-of-the-art SALT algorithm in [24].

The unsupervised approach proposed by Bernardis et al. [26] leverages pixel intensity of neighboring pixels to create a visual vocabulary for encoding images into vectors, which are clustered into labels—scalp, low density, and normal hair—for distinguishing hair and scalp in AA images. Diverging from these works, Seol et al. [32] explored a two-dimensional planimetric method for calculating the actual surface area of AA as a means of validating SALT scoring.

Machine learning applications for atopic and stasis dermatitis

Studies specifically focusing on AD and SD are relatively scarce. To address this, the search is broadened to include studies from the overarching domain of eczema, with many categorizing eczema diseases generically [33,34,35,36,37,38,39,40]. However, some works focus on specific eczema diseases, particularly the erythematous-squamous (ES) class, which excludes AD and SD [41,42,43]. Others include seborrheic dermatitis and psoriasis within multiclass settings or classify diverse dermatitis cases [44,45,46,47], offering valuable insights into the automatic processing of AD and SD images due to their potential visual similarities in CV models. For example, Zhou et al. [45] found that using a green background improved classification performance for lesions with black and red colors.

In terms of feature extraction mechanisms in classification tasks, Nourin et al. [48] used handcrafted features, such as Histogram of Gradients (HOG) and Gray Level Co-occurrence Matrix (GLCM) to classify images of eczema, hemangioma, melanoma, and SD, achieving 95.3% accuracy with GLCM and 78% with HOG. In contrast, learned features extracted from deep CNN architectures [49,50,51] report higher accuracy (96.04–97.5%) on datasets that include SD images. Hybrid features combining handcrafted and learned approaches are examined in [52], where the ReliefF technique refines the feature set, which is then fed into classifiers, such as SVM, K-Nearest Neighbor (KNN), and DT, to achieve 97% accuracy. Gradient -based Class Activation Maps (Grad-CAM) [53] visual explanations accompany these results.

Srivastava et al. [54] developed an image-processing method for detecting eczema-affected regions. Their approach begins with image preprocessing, including noise reduction, contrast enhancement, and quality improvement, followed by segmentation using K-means in the Lab color space, Otsu thresholding, and morphological operations. Features such as color, border, and texture are subsequently extracted. Similarly, two studies [55, 56] address segmentation tasks for psoriasis lesions using simple CNNs combined with optimization techniques, specifically the Adaptive Chimp Optimization Algorithm (AChOA) and the Adaptive Golden Eagle Optimization (IGEO). These methods achieve high segmentation accuracy (97%) and may inform segmentation approaches for AD and SD given the common CV challenges between psoriasis and eczema. One of these studies utilized a private clinical dataset of 7,000 images, including 4,200 images of psoriasis and 2,800 images of healthy skin [55]. In contrast, Srivastava et al. [54] employed an unsupervised approach based on K-means clustering and Otsu thresholding [57] to segment eczema-affected regions in digital images.

In addition to these methods, Rajathi et al. [50] applied machine learning techniques to classify digital images into varicose ulcer stages (five stages) and tissue types (four classes), while also performing lesion segmentation and wound area calculation. Furthermore, the segmentation methods discussed in [58,59,60] are described in the following section as part of severity score calculation techniques.

Eczema area and severity index calculation

The Eczema Area and Severity Index (EASI) [61] was developed in 1998 and later validated to meet the demands of investigators in need of a standardized evaluation tool for the severity of AD signs in clinical studies [62]. The formula for the EASI score includes visual estimation in four body regions (head and neck, upper extremities, trunk, and lower extremities), where each region is assigned an area score. Next, each region is assessed separately for the following four signs: erythema, edema/papulation, excoriation, and lichenification. Each sign is assigned an intensity score ranging from 0 to 3: 0, absent; 1, mild; 2, moderate; and 3 being severe [63]. The SCORAD (Severity Scoring of Atopic Dermatitis) index is also validated but combines subjective assessment of patients’ symptoms with observation of signs [62]. Regarding the EASI calculation, a modified formula was used (Eq. 2):

$$\:Severity\:Index\:=\:Area\:Score\:\times\:\:Intensity\:Score\:\times\:\:Region\:Score$$
(2)

Area Score is the percentage of eczema/total skin region, while the Intensity Score can range from 4 for mild eczema (setting 1 point for each of redness, thickness, scratching, and lichenification) to 12 (3 points for each) for severe eczema. The Region Score is the percentage of skin affected by eczema in each of the following four body regions: head (including neck), trunk, upper limbs, and lower limbs.

Machine learning researchers have developed automated systems for calculating the EASI score to help dermatologists achieve more consistent and reliable results. Alam et al. [58] proposed an automated eczema detection and severity measurement model by using 85 web-acquired images. Their pipeline includes: (a) a skin-region detection module leveraging the YCbCr color space, (b) an eczema-region detection module using k-means clustering in the LAB color space and morphological operations, (c) a feature extraction mechanism for color, texture (using GLCM), and border attributes, and (d) a classification system comprising two binary SVMs. One SVM distinguished between healthy and eczema skin, while the other classified eczema severity as mild or severe.

Bang et al. [64] trained four CNN architectures (ResNet V1-2, GoogleNet, and VGG-Net) to determine the optimal encoder for calculating individual EASI components. The accuracy rates were 90.63% for erythema, 89.06% for induration/papulation, 87.50% for excoriation, and 85.94% for lichenification.

Attar et al. [59] developed “EczemaNet2,” an enhanced version of their earlier “EczemaNet” model [60], which integrates U-Net to detect and segment AD regions. EczemaNet initially used an R-CNN-based approach, where cropped regions of interest (ROIs) were fed into seven classifiers, each producing a severity score (0–3) for specific disease signs. These scores were averaged across ROIs to calculate the EASI score, with additional support for the TISS [48] and SASSAD [49] scores. EczemaNet achieved a Root Mean Square Error (RMSE) of 1.929 ± 0.019 for EASI. EczemaNet2 replaced the R-CNN segmentation stage with two U-Nets, one for skin segmentation and another for AD segmentation. The postprocessing steps merge segmented regionsto generate square image crops for classifier input. Data augmentation techniques, including the pix2pix network, resulted in 25% and 40% improvements in segmentation and eczema detection performance, respectively, compared to the original EczemaNet pipeline.

Machine learning applications for vitiligo

Starting with the classification task for VI images, the literature review presents a significant number of works. The tasks vary in different taxonomies relative to the number of classes, the feature extraction mechanism type, the base versus ensemble classification techniques, and the application of data augmentation and transfer learning (TL) approaches. The simpler approaches are based on the plain feature extraction-classifier pipeline, excluding the use of elaborated preprocessing steps or techniques.

Concerning hand-crafted features, the authors of [65] propose the use of Mel Frequency Cepstral Coefficients (MFCC)—features often used for audio-related tasks—and i-Vectors as feature vectors paired with SVM and MLP classifiers, resulting in the best-performing out of four feature extraction-classifier combinations. Nosseir et al. [66] classify warts, hemangiomas, and VI using first- and second-order statistical features (GLCM) from pixel values. Related to learned features, Sharma et al. [67] perform feature extraction using Inception-V3, and various ML and DL algorithms are tested as classifiers, including a simple Naive Bayes and a CNN, all of which perform well. Bashar et al. [68] test four CNNs as feature extractors and four classifiers using similar methodology to obtain comparable results. A custom autoencoder CNN, which is an architecture commonly used for generative tasks, is defined and trained as a classifier [69], with the authors reporting 90% accuracy on the validation set. In a multiclass context, Algudah et al. [70] classify VI and five skin diseases using a short custom CNN. The model is simple, and the results are satisfactory and comparable to the work in [71]. Agrawal et al. [72] classify melanoma, VI, and vascular tumor images using fine-tuned InceptionV3. However, a noticeable 17% difference between training and test accuracy is observed.

Although more complex and prone to overfitting, ensemble models are often used by researchers to improve classification performance [73,74,75]. Liu et al. [73] employ three identical ResNet50 CNNs trained on different color spaces (RGB, HSV, and YCrCb) in an ensemble to identify skin images affected by VI. Saini et al. [74] define a two-model voting classifier trained on GLCM features. Dodia et al. [75] manually construct a small five-class dataset using a VGG-16-based feature extractor and a tree ensemble classifier trained with XGBoost [76]. These ensemble methods confirm that ensemble models outperform base classifiers. However, the use of private datasets in these studies hinders the comparison of state-of-the-art methods.

TL approaches are prevalent for utilizing pre-existing knowledge in a more specific domain [68, 77,78,79]. Mishra et al. [77] propose deep supervision for skin classification using activation mapping to create an image mask and targeting the network layer whose effective receptive field aligns best with the activation mask. An auxiliary loss function is fused with the standard loss function during training, thereby improving the performance of the VI datasets. Zhang et al. [78] utilize an in-house dataset al.ong with public datasets, offering an alternative open VI dataset. Their three TL-based models demonstrate superior performance to that of human experts. Bashar et al. [68] employ TL with four different DL architectures on the dataset used in earlier work [78].

In a different direction, generative models enhance datasets with synthetic samples. Luo et al. [79] propose a Cycle-Consistent Generative GAN-based augmentation procedure followed by super-resolution of the generated images. This method improves classification accuracy by 9.3% for a ResNet50 model compared to the non-augmented TL. Similarly, Mondal et al. [80] train a Wasserstein GAN with Gradient Penalty (WGAN-GP) to augment their datasetand then used CNNs to classify normal skin, leprosy, tinea versicolor, and VI. These models achieve accuracy in the 0.81–0.94 range. However, the datasetsize raises validity concerns. Liang et al. [81] introduce a novel method, Multi-hierarchy Contrastive Learning with Pareto Optimality (MHC-PO), which jointly trains models to learn data representations and perform classification tasks.

For VI lesion segmentation, preprocessing often involves classifying pixels into skin and background [82, 83]. Nugroho et al. [84] use independent component analysis (ICA) to represent skin disorders, followed by a region-growing algorithm for segmentation. Weakly supervised methods constitute a small but significant share of the literature [85, 86]. Bian et al. [85] use activation maps [87] of binary classifiers and the SLIC superpixel algorithm to delineate VI-affected areas, improving Intersection over Union (IoU). Low et al. [88] apply face recognition technology to correct image angles. Semi-supervised approaches like the mean-teacher learning framework proposed by Wang et al. [89] address labeled data scarcity by training a student model using pseudo-labels assigned by a teacher model.

U-Nets are widely used in medical segmentation tasks. For VI, U-Nets and their variations have been implemented [88, 90,91,92]. Low et al. [88] optimize U-Net’s encoding path with different CNN architectures, achieving the best results with InceptionResNetV2. Gou et al. [90] train three segmentation models—PSPNet [93], U-Net, and Unet++—on a large, annotated dataset, with Unet + + [94] performing best on in-house data but underperformed on open-source data. Li et al. [91] developed a U-Net-inspired model for facial VI cases, incorporating augmentation methods based on lesion color similarity between databases and target images.

Unsupervised methods show promise, particularly in scenarios with limited or subjective data labeling. Khatibi et al. [83] propose a stacked ensemble approach combining color-space-specific fully connected networks and clustering algorithms. Mehmood et al. [95] employ a simpler unsupervised technique to classify pixels based on color values, whereas Anthal et al. [96] use learning vector quantization networks for pixel classification. Nurhudatiana et al. [82] utilize Fuzzy C-means clustering for skin-background and VI segmentation based on YCbCr and RGB color spaces. In addition, Geel et al. [97] analyze ImageJ thresholding functions for VI lesion delineation.

Finally, bounding-box detection methods have been explored. Sorour et al. [98] employ a sequential configuration of YOLO-v5 models to predict VI-affected areas, including melanoma.

Vitiligo area scoring index calculation

The VASI is a widely used standard for assessing the extent of vitiligo (VI) lesions, providing repeatable but subjective insights into disease progression. The VASI score is calculated (Eq. 3) as the product of the affected area, measured in “hand units” (each equal to 1% of total body surface area), and the degree of depigmentation, expressed as a percentage from 0 to 100 [99] (Fig. 5). Accurate assessment involves registering the total affected body surface area and the extent of depigmentation for each lesion. While VASI relies on manual evaluation, techniques like superpixels [100] and level-set segmentation with SIFT and RANSAC [101] have been explored to aid in quantifying VI-affected areas, though their methods and outcomes vary. For instance, the superpixel approach lacks detail on calculating body surface area, whereas the level-set method yields percentage scores representing changes over time but

$$\:VASI=\:{\sum\:}_{All\:body\:sites}^{}Hand\:Units\:\times\:Residual\:Depigmentation$$
(3)
Fig. 5
figure 5

Visual representation of various degrees of depigmentation [99], licensed under CC BY 3.0

Not direct VASI scores. Alternative metrics, such as Facial VASI (F-VASI), Total VASI (T-VASI) [102], Vitiligo European Task Force (VETF) [103], and Vitiligo Extent Score (VES) [104], address limitations of the standard VASI by including facial surfaces or focusing on other aspects of disease assessment. These variations provide alternative approaches for quantifying and monitoring VI progression.

Data augmentation in the scope of medical skin images

Data augmentation requires an initial seed of images on which the technique can be based to generate new samples by applying a well-established transformation of the initial image (rotation, jittering, blurring, contrast enhancement). Alternatives refer to mixing two original images or copying and pasting parts of the initial image to a target sample [89, 91]. In the field of dermoscopy, image data augmentation has been applied in several cases, with the main objective of enhancing the training results of machine learning algorithms. As shown in Fig. 6, the data augmentation works related to the diseases in question are divided into four categories: (a) Geometric Transformations, (b) Kernel Filters [12, 52, 59, 88], (c) GAN-based [18, 52, 59, 79, 81, 91, 98], and (d) Mixing images [89, 91]. Starting from the case of basic image manipulations and their application to dermoscopy images, data augmentation techniques provide a copy of the original image by applying a simple transformation. In deep learning approaches, neural style transfer and GANs are frequently employed to solve the task of dermoscopy image augmentation. Such approaches require a wealth of training samples to produce reliable new samples ; therefore, they can not be deemed useful in cases where a sparsity of data samples is encountered. Luo et al. [79], use a Cycle-Consistent Generative Adversarial Network [105], which is followed by a super-resolution module to make amends for the absence of wood lamp images. The discrimination of vitiligo patterns in such images is more effective resulting in rather promising results. Mondal et al. [80]employed a Wasserstein GAN with a Gradient Penalty to generate synthetic VI images and increase the robustness of the overall classification scheme. An interesting example of the augmentation of macroscopy skin images captured by mobile devices is described by Andrade et al. [106]. A cycle-consistent adversarial network is used for the described objective, yielding effective quantitative metrics in the form of the Fréchet Inception Distance, while qualitative evaluation returns promising results in some cases. The technique could be used for different skin diseases in cases of comparing different endpoints (dermoscopy vs. macroscopy images from mobile phones) of the lesion between timestamps. In [52] the authors employ a GAN to generate the generation of synthetic skin lesion images. However, no samples or evaluation of the generative network is provided. Abdelhalim et al. [107] propose a progressively growing GAN for generating skin cancer images. The adversarial network exploits the gradually increasing image generation to address common training inconsistencies and improve image quality at higher resolution. Although these techniques have been rarely applied to images depicting areas of AA, AD/SD, and VI, success in the generic setting of skin lesions reveals the potential for their utilization for the increase of image samples for the diseases in question.

Mixing approaches have also been proposed in the relevant literature for increasing the number of image samples. A promising notion with different developed variations is the copy-and-paste procedure of the instance in question [108]. The copy-paste approach has been proven to significantly improve the results of segmentation algorithms compared to plain image transformations. It can be briefly described as the detection of objects in question in images and the overlay of these objects on a target in a systematic or random manner, meaning that the positions in the cuts are overlayed on the target images, are selected by a strategy concerning the context, or are placed at random locations of the image. Along with the image, the corresponding segmentation mask is modified to adhere to the given variations. More effective techniques for exploiting copy-paste have been proposed in [109,110,111] where contextual information is utilized for the detection of more effective locations and the alleviation of presented artifacts and noise mainly due to the direct operation of copying and pasting parts of the image.

Fig. 6
figure 6

Taxonomy overview of image data augmentation techniques used for the Vitiligo, Atopic and Stasis Dermatitis, and Alopecia Areata skin diseases

Wang et al. [89], use the copy and paste augmentation to address data shortage in VI images for the corresponding skin lesion segmentation. The editing of the VI images in the proposed methodology is conducted using a poison-blending technique [112] to paste the object into the target image and applying an improvement of the original method based on a mixed gradient mode modification. The proposed technique is applied as a preprocessing step to generate additional samples and their corresponding masks to train a semi-supervised Mean Teacher segmentation scheme. The results demonstrate that the proposed augmentation technique significantly increases the segmentation metrics and represents a promising path when annotated VI samples are scarce.

Another interesting approach that attempts to address this issue of data scarcity is proposed in [91]. Dedicated to enhancing the performance of deep segmentation architectures, this approach generates patches of VI lesions on target images by exploiting the potential of Progressive Histogram Colour Transfer (PHCT), as originally proposed by Pouli and Reinhard [113]. The authors detect the most suitable VI color transfer using a similarity metric and introduce patches of VI-shaped regions in the target image. Thus, a significant increase in the segmentation performance mechanism is achieved.

Datasets for the surveyed diseases

A list of the available public datasets containing dermoscopy and on-the-wild skin lesion images for the four diseases is provided in Table 1. In terms of image quantity, the presented numbers are significantly less because most of the images in the dataset are transformed duplicates of the original images. An example is provided for the Vtigo Dataset2, which is reported to contain 1,187 images in total; however, the actual number of images is reduced to the duplicates that derive from horizontal flips.

Table 1 List of existing image datasets related to the skin diseases in question

Discussion

Through this review, a wider perspective on the diseases under examination is gained. More specifically, the review reports the effects on the skin that can be captured in the form of digital images, the proposed diagnostic methods and treatment plans, and the corresponding improvements that attempt to automatically quantify therapies and the depicted visual patterns. The reader can be led to the formation of significant insights regarding best practices that have resulted in efficient ML and CV schemes along with the limitations that have hindered previous attempts and open issues that require to be addressed through new approaches. Although most of the works in digital dermoscopy refer to skin cancer [139,140,141,142], this review is an attempt to turn the interest of dedicated researchers to other skin diseases that affect quality of life.

In general, data scarcity creates a considerable challenge, and the four skin diseases bear no exception. Although several publicly available datasets have been found with corresponding annotations to some extent, the purpose of capturing such images was not intended for ML applications [143]. Images of different resolutions, shapes, and capturing conditions constitute an arduous operational field for CV and AI algorithms to perform and extract useful outcomes. In addition to the abovementioned shortcomings, many samples include artifacts or regions of interest that are enlarged on top of existing morphological findings. Training skin-condition predictive models requires a large number of images taken under specific conditions (e.g., distance, point of view, areas of body parts). These datasets are usually collected through clinical examinations and are not provided along with the publication of the related results, either due to ethical reasons or due to purposes of commercializing the produced models. An additional obstacle, even for extensive clinical image datasets, is variations in the annotations and respective scores among the qualified dermatologists [144], even from the same expert at different times of re-evaluation. Therefore, it should be noted that the creation of publicly available curated datasets for each disease would greatly assist the development of more effective relevant ML algorithms and their fairer quantitative evaluation [145]. In skin disease research, GANs can model the distribution of skin disease images and generate synthetic samples resembling real manifestations. This review highlights GAN-based approaches as promising for generative AI in addressing dataset scarcity. However, effectively training GANs is challenging, and requires large datasets, typically thousands of images, to achieve satisfactory results. Even then, this approach might not improve the robustness of the detection model, asreported in a recent study of AD [42]. Nevertheless, synthetic image generation is a domain rapidly evolving and might be able to overcome the limitations of previous models, such as the one used in that study. Diffusion models [146] can generate high-fidelity, high-resolution synthetic skin disease images. They can capture intricate details and nuances in images, making them useful for generating diverse synthetic images that represent various skin diseases, textures, colors, and manifestations. This approach provides an innovative way to augment datasets, understand disease progression, and enhance the robustness of ML models in dermatology research and medical imaging. However, it is important to ensure that the generated synthetic images accurately reflect real-life skin conditions and are used responsibly alongside authentic clinical data. In addition, advanced techniques like elaborate copy-paste methods can further increase sample diversity and improve the representation of each lesion. Li et al. [91], apart from proposing a novel technique for increasing the number of VI image samples, point out the necessity of considering that different types of VI representations depict different visual patterns, and, therefore, may be treated diversely. Apart from the obvious disadvantages due to data scarcity, it should be taken into consideration that most of the presented works use privately owned datasets, and even when publicly available datasets are employed, the experimental setup greatly differs in each case. This diversity hinders the extraction of useful conclusions about the effectiveness of the proposed approaches, the SOTA results, and future directions.

The calculation of severity scores (SALT, EASI, and VASI), is important for providing objective and accurate quantification of each disease and is therefore a key objective of ML approaches. However, most approaches fail to develop pipelines that can effectively calculate existing scores, which results in approximations. With reference to AA and SALT, the technological advancements extend to the exploitation of ML techniques through web [24] and mobile [27] applications, thusproviding solutions to practical issues in real life and facilitating their use far from the supervised environment of a healthcare clinic or a research lab. Research on SD and EASI falls within the broader field of dermatitis and eczema lesions. Challenges such as the complexity of the visual characteristics of diseases, which can be confused with other skin conditions, as well as research prioritization and data scarcity, hinder the development of ML and CV algorithms. Regarding the quantification of visual patterns for VASI scores, the literature presents efforts using ML to extract information from vitiligo-affected skin areas, focusing on segmentation or calculating differentiation between endpoints rather than providing an actual VASI score. In a relevant literature review [10], few studies dealt with the automated calculation of VASI scores. Moreover, in most studies, the evaluation process fails to provide robust evidence of effectiveness, resulting in unreliable quantifiable results, if any exist. The severity scores (SALT, EASI, and VASI) for each disease can create a bridge between health and machine experts to research new approaches for the unbiased and reliable monitoring of skin disease progression. Although the systematic review of existing works highlights the potential of automated methods to accurately segment VI lesions, the lack of an automated tool that is specifically targeted at the calculation of VASI score, as proposed by the relevant experts, is compelling.

Following the latest regulations in the US and Europe for the development of responsible, trustworthy, transparent, and reliable AI, the enhancement of ML models with interpretability properties should be pursued in alignment with the requirement of delivering models that can support their results with reasoning for the integration of research projects, into routine clinical workflows. Indications of explainability approaches for the developed AI models for the four diseases were reported in [52, 85, 86]. Instead of utilizing visual explanations to deliver reasoning concerning the classification results, the visualizations in [85] and [86] are employed as a form of weakly-supervised segmentation mask that guides the process to focus on important regions for each skin disease.

Conclusions

This report presents a systematic review of the research literature concerning benign skin diseases. The report dives into the applications of CV and ML techniques for the extraction of knowledge from images to focus on the specifics of Atopic Dermatitis, Stasis Dermatitis, Alopecia Areata, and Vitiligo diseases. Apart from demonstrating works that refer to the SOTA on downstream tasks such as classification, detection, and segmentation, the report contains references to the severity indices applied by relevant experts to assess the depicted lesions, the data augmentation issue, and existing datasets. The shortcomings of previous implementations and the latest advancements from other fields of medical imaging that can contribute to the tasks in question have been identified to a large extent.

Although a significant number of publicly available datasets are presented herein, the qualitative issues and the actual quantity of unique samples reveal the need for disease-specific datasets with curated annotations. The poor exploitation of these datasets is demonstrated by the extensive usage of privately owned datasets and data augmentation techniques reported in the literature. On the other hand, the interrater variability [147, 148] in the annotations of skin lesions suggests future directions toward unsupervised or self-supervised approaches.

The review discusses the integration of severity scores, such as SALT, EASI, and VASI into ML approaches for monitoring skin diseases. These scores help connect healthcare and ML experts to develop unbiased, reliable methods for tracking disease progression. For AA, SALT has been effectively used in ML-based web and mobile applications, enabling monitoring outside clinical settings. However, ML and CV advancements in dermatitis and eczema are challenged by the complexity of symptoms and data scarcity. Regarding vitiligo, attempts to use ML for VASI score calculation are limited and lack a robust, reliable method. The need for automated tools specifically designed for accurate VASI score calculation is underscored, highlighting the gap in current research. Emphasis should be placed on developing accurate automated VASI tools, improving data collection, refining ML algorithms for complex conditions, integrating multi-modal data, standardizing ML approaches, and creating user-friendly applications for non-clinical use.

Incorporating skin image analysis in clinical workflows and web/mobile applications can facilitate rapid and precise diagnosis, which is crucial for early intervention and improved treatment outcomes. By reducing the time required for accurate diagnosis, clinicians can initiate appropriate treatments promptly, thereby improving patient management and validating healing progression. The assessment procedure will greatly benefit from descriptive and comparable visualizations to support and justify their reports.