1 Introduction

As a rapidly growing form of renewable energy, photovoltaic (PV) energy has been widely adopted worldwide [6, 39]. In sun-rich regions, such as southern Xinjiang in China, the United Arab Emirates, and the deserts of Iran, the number of photovoltaic power plants continues to increase [3, 11, 35]. Due to geographical factors, these regions often suffer from severe dust pollution and low precipitation, resulting in the accumulation of sand and dust on the surfaces of the photovoltaic panels, thereby affecting their power generation efficiency. Unclean PV panels can lead to a decrease in the power generation efficiency of approximately 15% [7, 36, 41]. Therefore, cleaning PV panels is essential in the field of photovoltaic energy.

Traditional methods for cleaning photovoltaic panels include manual and vehicle-mounted techniques [2, 10]. However, due to the terrain and dust pollution in PV plant areas, both manual and vehicle-mounted cleaning methods often struggle to accurately locate the PV panels, resulting in wasted water resources and added burdens [37, 38]. Thus, there is a significant need for automatic and precise cleaning solutions, which are expected to enhance the adaptability and flexibility of cleaning robots while minimizing the need for manual intervention [5, 16, 31, 32].

Precise pose recognition is essential for the effective cleaning of photovoltaic (PV) panels. However, this task faces several challenges. The diverse installation angles and orientations of PV panels, interference from ambient light, the impact of dust and dirt on image clarity, and partial occlusion caused by other panels or environmental factors can significantly impact the accuracy and real-time performance of pose detection [15, 25, 48]. Moreover, the reflective properties of PV panel surfaces can lead to instability in the data collected by sensors, further complicating pose recognition. In real-world scenarios, PV panels are often partially occluded by other objects (e.g., panels and environmental factors like dust or debris). This makes detection and pose estimation more challenging. Additionally, the scale of the PV panels can vary significantly depending on their distance from the camera, which introduces further complexity in pose estimation. Recent studies have revealed that introducing visual deep learning-based object detection algorithms can enable robots to adaptively recognize the poses of the PV panels requiring cleaning [8, 40].

With the rise of deep learning, neural network-based methods [9, 19,20,21,22,23, 26, 33, 34] have gradually dominated the field of object detection. Recently object detection algorithms have shifted from two-stage methods (such as the R-CNN series [9, 18, 34]) to more efficient single-stage detectors (such as the YOLO and SSD series [26, 33]). These single-stage algorithms perform bounding box regression and object classification within a single network, significantly improving computational efficiency and enabling real-time detection. As a result, they have become vital choices for real-time visual applications. The YOLO series of algorithms [33] has gained wide-spread adoption in object detection due to its efficiency and accuracy. YOLO divides the entire image into a grid and predicts bounding boxes and class probabilities for each grid cell, enabling fast localization and recognition of objects. YOLO has undergone continuous updates and iterations. YOLOv8 [42], as a recent version in the series, retains the efficient design of its predecessors while introducing several optimizations in network architecture and algorithms, further enhancing detection accuracy and speed, especially in complex scenarios. YOLOv8 has been widely applied to various applications, including autonomous driving [1, 29], industrial inspection [14, 27, 44, 47], and more. YOLOv8 Nano (YOLOv8n) is a lightweight variant of YOLOv8, designed to offer a good balance between computational efficiency and detection accuracy, especially in resource-constrained environments.

Considering the advantages of YOLOv8n in terms of high speed and good performance, we adopt YOLOv8n as the primary method for PV cleaning pose recognition. However, we find that directly using YOLOv8n for pose recognition does not yield satisfactory results, as the model still has a long inference time and the detection accuracy is not high enough. Through preliminary analysis, we have found that the sub-optimal performance was due to the mismatch between the network structure, detection objective function, and the PV pose recognition task. The YOLOv8n backbone contains several convolutional layers, which require substantial computation time, leading to prolonged inference. As we have discussed previously, in real-world pose recognition scenarios, factors such as photovoltaic panel rotation and cleaning robot movement can result in scale variations and target rotations, which the traditional IoU-based detection objective function in YOLOv8n struggles to handle effectively, leading to inaccurate detection results.

To cope with the issue of long inference time, we introduce Mobile ViT, a lightweight Vision Transformer designed for mobile and embedded devices [30]. Mobile-ViT maintains the self-attention capability while reducing storage requirements and computation time. To overcome the problem of inaccurate detection results, we incorporate the MPDIoU loss, which not only focuses on the overlap but also considers the actual distance of the detection boxes [28]. This loss function can more effectively adapt to target scale changes and capture shape differences, addressing the shortages of traditional IoU-based methods, which are sensitive to target scale variations and unsuitable for rotated targets.

Combining YOLOv8n, Mobile-ViT, and MPDIoU loss, we propose a method called YOLOv8n-Photovoltaic-Pose (YOLOv8n-PP), which leverages the strengths of these components to achieve accurate and efficient pose recognition for photovoltaic cleaning applications. Additionally, recognizing the scarcity of datasets specifically for photovoltaic pose recognition, we have constructed a dataset named “P-Pose”. This dataset consists of PV pose images collected from the photovoltaic power plant of Jingke Technology in Alar City. We preprocess the collected images to align with the pose recognition task and make the dataset publicly available to facilitate future research. The dataset will be accessed at https://gitee.com/nanfnegzhiwoyi/p-pose.

We conduct extensive experiments to validate the effectiveness of our proposed method. We compare the performance of our algorithm with mainstream detection models, such as YOLOv5s, YOLOv7, YOLOv8s, and YOLOv8n. Additionally, we perform a series of ablation studies to verify the efficiency of incorporating Mobile-ViT and MPDIoU loss. Under a fair comparison, our YOLOv8-PP method achieves the best results across various evaluation metrics. Notably, the precision and recall of our approach are 3.45% and 5.78% higher, respectively, compared to the baseline YOLOv8n model.

The main contributions of this work can be summarized as follows:

  • We propose the photovoltaic pose recognition algorithm YOLOv8n-PP based on YOLOv8n, incorporating a lightweight and efficient Mobile-ViT module and the MPDIoU loss.

  • Our approach provides an effective solution for efficient photovoltaic panel cleaning, thereby improving power generation efficiency and addressing challenges in the development of the photovoltaic industry.

  • We have constructed the P-Pose PV pose dataset, which is optimized for the specific characteristics of photovoltaic panel pose recognition and offers strong technical support for future research.

The rest of the paper is organized as follows. Section 2 introduces related work and necessary fundamentals. Section 3 describes YOLOv8n-PP in detail, including the main network and the key modules. Section 4 makes a comprehensive investigation of the effectiveness of our method. We give a conclusion on this work in Sect. 5.

2 Related works

2.1 Pose recognition

In recent years, deep learning models have been applied to pose recognition. Xiang et al. [46] proposed using the pose convolutional neural network for the translation of target objects in human–robot collaborative delivery, calculating the translation matrix and rotation matrix of the target object. Gao et al. [8] proposed a 3D hand pose estimation-based dynamic gesture recognition method. Li et al. [24] proposed the AirPose multi-branch network for estimating aircraft attitude angles. Different from these works, we use the lightweight and high-precision YOLOv8-n object detection algorithm for mobile device PV pose recognition. To the best of our knowledge, we are the first to investigate the challenging problem of pose recognition algorithms for photovoltaic panel cleaning robots.

2.2 The YOLO series of models

The YOLO series of models represents a fast and efficient object detection algorithm capable of performing object localization and recognition in a single forward pass [33]. Compared to two-stage object detection methods [34], YOLO offers faster detection speeds and superior real-time performance. The YOLO series has undergone multiple iterations and optimizations [17]. The recent version, YOLOv8, is particularly advantageous due to its enhanced accuracy and efficiency, making it suitable for various applications that require rapid and reliable object detection. YOLOv8 employs a lightweight network structure by optimizing its backbone, neck, and head modules, significantly reducing the number of model parameters and computational costs while maintaining high detection accuracy. It introduces a Feature Pyramid Network (FPN) and a squeeze-and-excitation (SE) attention mechanism to enhance feature extraction capabilities. Additionally, YOLOv8 utilizes techniques such as depth-wise separable convolution to further compress the model, making it well suited for deployment on resource-constrained devices, including smartphones and embedded systems. YOLOv8 Nano (YOLOv8n) represents the most lightweight variant among the YOLOv8 family, prioritizing efficiency and real-time processing on low-power hardware while offering competitive detection performance.

To further boost the lightweight design, Xie et al. [47] introduced a lightweight multi-scale mixed convolution in the backbone network to effectively fuse features extracted at different scales. Furthermore, Wang et al. [44] replaced the backbone with GhostHGNetv2. Ma et al. [27] employed a lightweight detection head called AsDDet to further minimize parameters. Du et al. [4] combined YOLO with attention mechanics and deploys the model to the mobile vehicle. Our method employs YOLOv8n as the main network and improves its detection head and the supervised regression loss, achieving better detection accuracy and faster inference speed.

2.3 The mobile-ViT network

The Mobile-ViT network [30] is a lightweight visual transformer model designed specifically for mobile devices [12, 45]. Compared to convolutional neural networks, Mobile-ViT employs an attention-based visual transformation layer, which can effectively capture the global information and semantic features of the image. The Mobile-ViT network architecture consists of three parts: (1) convolutional layers for extracting low-level visual features; (2) visual transformation layers using the self-attention mechanism to capture the global context information of the image; (3) feed-forward networks to further extract and integrate the features. This hybrid architecture effectively combines the advantages of convolutional networks and vision transformers, maintaining good feature extraction capabilities while significantly reducing the computational complexity and parameter size of the model. Compared to standard Transformer models, Mobile-ViT introduces depth-wise separable convolution to dramatically reduce the model’s computational cost. It also employs a lightweight attention mechanism to further improve the inference speed, making it very suitable for deployment on mobile devices.

Fig. 1
figure 1

An illustration of the YOLOv8n-PP model. YOLOv8n-PP comprises a backbone network, a feature fusion layer (Neck), and a fusion head

2.4 The MPDIoU loss

Minimum Point Distance IoU (MPDIoU) loss [28] is a loss function designed for bounding box regression, specifically addressing the issues of sample imbalance in object detection models. Traditional bounding box regression loss functions, such as L1 loss and IoU loss, exhibit limitations in scenarios with imbalanced samples [51]. L1 loss fails to effectively distinguish between bounding boxes of varying sizes, resulting in poor detection performance for small objects. Conversely, IoU loss emphasizes the degree of overlap between bounding boxes, often neglecting the precision of object localization [49, 50]. In contrast, MPDIoU introduces the concept of minimum point distance, which measures the distance between two bounding boxes beyond merely their overlap. By minimizing this distance, MPDIoU accounts for both the degree of overlap and the accuracy of localization, thereby enhancing the model’s detection performance for objects of different sizes.

3 YOLOv8n-PP

In this section, we will introduce our method YOLOv8n-PP in detail. We give an overview of the YOLOv8n-PP model in Fig. 1. As illustrated, YOLOv8n-PP is built on the YOLOv8n framework. To achieve faster inference speed, YOLOv8n-PP uses the lightweight Mobile-ViT module to replace the backbone in YOLOv8n. Furthermore, to address the issue of imbalanced sample distribution across different classes in real-world pose recognition, we adopt the MPDIoU bounding box regression loss function to solve the problem of insufficient model generalization performance.

Fig. 2
figure 2

The illustration of mobile-ViT network structure. The lower part (a) shows the backbone module of YOLOv8 with the convolution layer replaced by the Mobile-ViT block. The upper part (b) displays the details of the Mobile-ViT block, which includes several modules using transformers as convolutions and a feature fusion module

Fig. 3
figure 3

The illustration of the MPDIoU. The calculation of MPDIoU involves the overlap of the two bounding boxes (IoU) and the distance between the two boxes (i.e., \(d_1\) and \(d_2\))

Fig. 4
figure 4

Comparison of detection results between YOLOv8n and YOLOv8-PP in occlusion scenarios. The PV panel on the right side of the figure was not detected by YOLOv8n due to partial occlusion, while YOLOv8-PP successfully detected the PV panel

Fig. 5
figure 5

Comparison of YOLOv8 and YOLOv8-PP photovoltaic panels in a sandy and dusty environment

3.1 The main network

YOLOv8n-PP is based on YOLOv8n, comprising a backbone network, a feature fusion layer (Neck), and a fusion head, as shown in Fig. 1. The backbone module is responsible for generating discriminative visual representations from the input. In YOLOv8n, the backbone employs deep convolutional neural networks and C2f blocks (CBS and C2f in Fig. 1, same as the following), but we replace the backbone module with the Mobile-ViT module, including Mobile-ViT and MobileNetV2 (MV2) blocks. Additionally, we maintain the spatial pyramid pooling-fast block (SPPF) in the backbone [13]. The neck module plays a crucial role in aggregating and fusing the multi-scale features produced by the backbone, including feature concatenating (Concat), upsampling (Upsample), and several convolution operations. The head module is designed for final predictions. It takes the feature maps from the neck and generates detecting heads for different scales of targets. The detecting heads also contain bounding box loss (Bbox loss) and classification loss (Cls loss). We employ MPDIoU loss as the bounding box loss to fit the PV pose recognition task.

3.2 Mobile-ViT

The photovoltaic panel cleaning robot needs to recognize diverse photovoltaic poses under mobile viewpoints and restricted inference time, so we use the relatively lightweight Mobile-ViT model to replace the backbone part of YOLOv8n, as shown in Fig. 2a. At the same time, the Mobile-ViT model has a unique image segmentation mechanism, which can effectively handle the difficulty in model training caused by the changing viewpoints. The network structure of Mobile-ViT is shown in Fig. 2b, which includes several modules using transformers as convolutions and a feature fusion module. It utilizes Self-Attention to learn the correlation between different positions in the image, thus comprehensively understanding the inherent structure of the image features.

3.3 MPDIoU

The core design of MPDIoU is the definition of two metrics: the Center Distance Difference Metric (MD) and the Projection Difference Metric (PD), respectively. The MD is calculated by normalizing the Euclidean distance between the centers of the two target frames to the range of [0, 1]. The PD is calculated by normalizing the sum of the Euclidean distances between the top-left and bottom-right corners of the two target frames to the range of [0, 1]. The maximum pooling operation is applied to the MD and PD to obtain the final overlap metric MPD. The MPDIoU is then calculated as a weighted combination of the traditional IoU and the MPD

$$\begin{aligned} \begin{aligned}&d_1^2=(x_1^B-x_1^A) ^2+(y_1^B-y_1^A) ^2\\&d_2^2=(x_2^B-x_2^A) ^2+(y_2^B-y_2^A) ^2\\&MPDIoU = \frac{A\cap B}{A \cup B}-\frac{d_1^2}{w^2+h^2}-\frac{d_2^2}{w^2+h^2}, \end{aligned} \end{aligned}$$
(1)

where A and B are two arbitrary convex shapes, and w and h represent the width and height of the input image, respectively. \((x_1^A, y_1^A)\) and \((x_2^A, y_2^A)\) represent the coordinates of the top-left and bottom-right points of A, while \((x_1^B, y_1^B)\) and \((x_2^B, y_2^B)\) represent the coordinates of the top-left and bottom-right points of B. \(d_1\) and \(d_2\) represent the squared Euclidean distances between the top-left corners and the bottom-right corners of A and B, respectively. An illustration of the MPDIoU is shown in Fig. 3.

For the convenience of optimization, the MPDIoU-based loss is defined as

$$\begin{aligned} LMPDIoU = 1 - MPDIoU. \end{aligned}$$
(2)

The MPDIoU-based bounding box regression loss simplifies the similarity comparison between two target frames, making it applicable to both overlapping and non-overlapping bounding box regression, and improving the convergence speed and regression accuracy. This method comprehensively considers the center distance difference and the width–height difference between the photovoltaic pose target frames, providing a more comprehensive and accurate photovoltaic pose object detection evaluation metric.

In this study, we adopt YOLOv8n as the baseline model and improve it by proposing YOLOv8-PP. YOLOv8-PP is optimized on the basis of YOLOv8n by adopting Mobile-ViT as the backbone network and introducing the MPDIoU loss function to improve the performance and robustness of the model. Through testing on the extended dataset, we find that YOLOv8-PP exhibits higher mAP and accuracy compared to YOLOv8n in various environments (as obscured in Fig. 4).

In addition, the expansion of the dataset allows YOLOv8-PP to better adapt to new environments. We demonstrate the superiority of YOLOv8-PP in a sandy and dusty environment (see Fig. 5); in this scenario, YOLOv8 misses some PV panels, while YOLOv8-PP successfully detects all PV panels, which demonstrates its good generalization ability in different environments. To further improve the robustness of the model in variable environments, we added data containing different seasons (e.g., snow and dust) and lighting conditions (e.g., strong reflections) to the training set and retrained the model. After retraining, the performance of YOLOv8-PP in these added environments is significantly improved, with the mAP increasing from 0.884 to 0.911. The above results show that data extension plays a key role in improving the model performance.

Fig. 6
figure 6

Preprocessing of the dataset. We apply some image data augmentation techniques to expand the dataset, to make the dataset close to real-world scenarios

4 Experiments

In this section, we provide a detailed explanation of the datasets and experimental settings. First, we outline the process of constructing the P-Pose dataset, describe the experimental environment, and present the relevant metrics used in our investigation. Subsequently, we present the key results that demonstrate the effectiveness of our method and offer further analysis.

4.1 P-Pose dataset construction and preprocessing

The P-Pose dataset contains photovoltaic pose images collected from the PV power plant of Jingke Technology in Alar City using manual methods.

During the collection process, the collectors considered various factors, such as high and low poses, backlighting and counter-lighting, left and right sides, etc. A total of 116 images were collected. To increase the diversity and applicability of the dataset, these images were preprocessed. First, the images were losslessly compressed and normalized to a format of 950\(\times\)428 pixels with three RGB channels. Then, image data augmentation techniques were used to expand the dataset, including adding noise, blurring, rotation, horizontal flipping, and brightness adjustment (as shown in Fig. 6), ultimately expanding to 2076 images. To establish the training and validation sets, we randomly split all samples in an 8:2 ratio, resulting in 1661 training samples and 415 validation samples.

Table 1 Configure the parameters of the experimental environment

4.2 Experiment setup

The experimental platform is shown in Table 1. We used a 64-bit Windows 10 operating system for training and validation. The hardware system configuration includes an Intel Xeon(R) Silver 4210R CPU @ 2.40GHz, an NVIDIA GeForce RTX 3060Ti GPU, and 64GB of RAM. Python 3.8 is used as the programming language, with CUDA 11.8 to accelerate training, and the PyTorch 2.0.0 deep learning framework for network training.

The model training is set to 300 epochs, with a batch size of 16, and an initial learning rate of 0.01. The cosine annealing learning rate adjustment algorithm is adopted, as defined by the following formula:

$$\begin{aligned} l_r = l_{min} + \frac{1}{2}(l_{max} - l_{min}) (1 + cos(\frac{T_{cur}}{T_{max}}\pi ) ), \end{aligned}$$
(3)

where \(l_r\) is the current learning rate, \(l_{min}\) is the minimum learning rate,\(l_{max}\) is the maximum learning rate, \(T_{cur}\) is the current training iteration (epoch), \(T_{max}\) is the total number of training iterations, and \(\pi\) is the circular constant.

4.3 Evaluation metrics

To verify the effectiveness and detection performance of the model, the following metrics are selected: precision & recall, F1-score, mean average precision (mAP), Giga floating point operations (GFLOPs), and frames per second (FPS).

Precision is the ratio of correctly predicted positive samples among all predicted positive samples, while recall is the proportion of correctly predicted positive samples among all actual positive samples. The definitions of precision and recall are as follows:

$$\begin{aligned} Precision=\frac{TP}{TP+FP},\quad Recall=\frac{TP}{TP+FN}, \end{aligned}$$
(4)

where TP denotes the true-positive samples, FP denotes the false-positive samples, and FN denotes the false-negative samples.

F1-score is the harmonic mean of precision and recall, so the F1 curve is often used to compare the performance of different models. The F1-score ranges from 0 to 1 and the higher score means better performance. F1-score is defined as

$$\begin{aligned} F1 = \frac{1}{Precision}+\frac{1}{Recall}. \end{aligned}$$
(5)

mAP evaluates the model’s performance in object detection tasks by plotting the precision–recall curve and calculating the area under the curve. The"m"represents the mean, and the number after the"@"symbol represents the threshold for determining the positive and negative samples based on the IoU. It is formulated as

$$\begin{aligned} mAP = \frac{1}{N}\sum _{i=1}^{N}\int _{0}^{1}P_i(R_i) \textrm{d}(R_i), \end{aligned}$$
(6)

where \(P_i\) and \(R_i\) denote the precision and recall of the \(i_{th}\) threshold. Following Li et al. [18], we denote AP as mAP(@0.5:0.95) and \(\hbox {AP}_{50}\) as mAP(@0.5).

GFLOPs measure the complexity of the model, with higher values indicating more computational resources required for inference, whereas lower values suggest a lower computational cost.

FPS represents the number of frames processed per second, which is affected by the algorithm and the hardware configuration of the experimental device (Tables 2, 3).

Table 2 Results of YOLOv8n-PP and other algorithms. The metrics is defined in 4.3
Table 3 Comparison of model performance metrics
Fig. 7
figure 7

Comparison of detection results of different models in the cases with complex backgrounds

Fig. 8
figure 8

Schematic comparison of YOLOv8 and YOLOv8-PP PV panel detection in snowy scenarios

Fig. 9
figure 9

Comparison of YOLOv8 and YOLOv8-PP photovoltaic panels under strong reflective scenes

4.4 Comparison results

We compared five mainstream object detection models: YOLOv5s [33], YOLOv7 [43], YOLOv8s, YOLOv8n, and YOLOv8n-PP, using seven evaluation metrics: Precision, Recall, \(\hbox {AP}_{50}\), F1, GFLOPs, AP, and FPS. With the dataset, model hyper-parameters, and training parameters kept consistent, the YOLOv8n-PP model showed the best performance in all the metrics. Compared to the runner-up model, YOLOv8n, YOLOv8n-PP had a slight improvement in the GFLOPs and Time indicators. Additionally, it was observed that the YOLOv5s algorithm performed the worst in most performance evaluation indicators, which may be due to its use of a relatively shallow feature extraction network.

To further validate the recognition performance of different algorithms on photovoltaic poses, we visualized the detection results using the optimally trained checkpoint of each model. The results, shown in Fig. 7, include a photovoltaic image with overlapping backgrounds and two different poses for comparison. The findings indicate that the recognition accuracy of the YOLOv8n algorithm exceeded that of the first three algorithms. However, when processing poses of photovoltaic panels in overlapping backgrounds, YOLOv8n may detect severely occluded targets, resulting in inaccurate detection boxes or even identifying the overlapping rear panels. This phenomenon could interfere with the smart cleaning equipment, hindering its ability to perform cleaning operations effectively. The YOLOv8n-PP model enhances accuracy by incorporating the Mobile-ViT self-attention mechanism and replacing the bounding box regression loss function. Its unique self-attention mechanism, combined with photovoltaic panel pose detection based on the minimum distance of detected targets, allows for accurate identification of photovoltaic panels requiring cleaning in overlapping backgrounds. This capability facilitates precise cleaning instructions, ensuring the effective execution of cleaning operations.

In testing different environmental conditions (e.g., sunny, occluded, snowy, strong reflections, sandy, and dusty), YOLOv8-PP demonstrates higher robustness. In the snowy day scenario, YOLOv8 only detected four PV panels with low confidence; in contrast, YOLOv8-PP was able to accurately detect five PV panels with generally high confidence (see Fig. 8). Similarly, under strong reflective conditions, YOLOv8 has missed detecting PV panels in the reflective area, while YOLOv8-PP can still stably identify PV panel targets in the area, demonstrating good light adaptability (see Fig. 9). These comparison results further validate the recognition ability and robustness advantage of YOLOv8-PP in complex environments.

Table 4 Results of ablation experiments. The same conventions are used as in Table 4
Fig. 10
figure 10

Comparison of YOLOv8n and YOLOv8n-PP across various metrics

4.5 Ablation study

To verify the effectiveness of the algorithm improvement modules, we established multiple variant models based on the original YOLOv8n model for ablation experiments, including YOLOv8n-M, YOLOv8n-L, and YOLOv8n-PP. YOLOv8n-M integrates the self-attention mechanism of the Mobile-ViT model, while YOLOv8n-L replaces the original detection loss of YOL-Ov8n with the MPDIoU loss. Finally, YOLOv8n-PP is the complete model that incorporates both the Mobile-ViT module and the MPDIoU loss proposed in this work.

According to the ablation experiment results in Table 4, the original YOLOv8n algorithm exhibits relatively low F1-score and average precision (AP), along with a higher computational cost (GFLOPs). Therefore, we introduce several enhancements to the original YOLOv8n algorithm to improve its performance. After incorporating Mobile-ViT and replacing the original bounding box regression loss function with MPDIoU, most metrics demonstrate improved performance. The results for YO-LOv8n-M and YOLOv8n-L indicate that each module provides individual benefits, and their combination yields an additional performance enhancement, effectively demonstrating the validity of our approach.

The model performances are shown in Table 3. In the dataset in five typical environments, YOLOv8 precision is 0.888, recall is 0.884, and mAP is 0.884. YOLO-PP precision is 0.892, recall is 0.857, and mAP is 0.91.

4.6 Further analysis

To more intuitively evaluate the performance of the improved algorithm, the Precision, Recall, and \(\hbox {AP}_{50}\) metrics of the YOLOv8n and YOLOv8n-PP models during the training process are visualized and analyzed. Figure 10 (left) shows the visual comparison of the Recall values during the training process of the YOLOv8n and YOLOv8n-PP models. According to the experimental data analysis, the two curves reached a relatively stable state almost simultaneously. Still, the recall rate of YOLOv8n-PP was always higher than YOLOv8n, and the recall rate curve was more stable. Figure 10 (middle) shows the visual comparison of the Precision values during the training process of the YOLOv8n and YOLOv8n-PP models. The YOLOv8n algorithm had a relatively low initial Precision value in the first 100 epochs, and the Precision value fluctuated significantly as the number of training rounds increased. After the 100th epoch, the Precision value stabilized at above 90%, but the curve still showed obvious fluctuations until the end of training, indicating that the Precision value was not stable enough. In contrast, the YOLOv8n-PP algorithm had an initial Precision value about 10% higher than YOLOv8n in the first 100 epochs, and the Precision value growth was more stable as the training rounds increased. Before the 100th round, the Precision value had already entered a stable state and remained stable until the end of training, with the Precision value curve being consistently higher than YOLOv8n. Figure 10 (right) shows the visual comparison of the \(\hbox {AP}_{50}\) values. The \(\hbox {AP}_{50}\) value of YOLOv8n fluctuated greatly and had not reached a stable state by the end of training. However, the \(\hbox {AP}_{50}\) value of YOLOv8n-PP was higher than YOLOv8n and showed a more stable curve, which leveled off in the later stage of training. We conducted a detailed analysis of YOLOv8 and YOLOv8-PP in terms of false detection and leakage detection.YOLOv8 has a high leakage detection rate in sand, dust, and snow environments, especially under large changes in light, while YOLOv8-PP optimizes the stability of the target detection and significantly reduces the false detection rate by introducing Mobile-ViT and MPDIoU.

5 Conclusions

In this work, we conduct an in-depth analysis and optimization of the photovoltaic panel pose detection model and construct a diverse and comprehensive dataset for photovoltaic panel poses to ensure that our method demonstrates strong generalization performance across various environments. We propose a photovoltaic panel pose recognition algorithm, YOLOv8n-Photovoltaic Pose (YOLOv8n-PP), based on an improved YOLOv8n architecture, which incorporates a more lightweight Mobile-ViT network design and the MPDIoU-optimized bounding box regression loss function. The results indicate that YOLOv8n-PP not only improves detection accuracy but also enhances stability in challenging scenarios, thereby contributing to more effective and reliable photovoltaic panel management. Although YOLOv8-PP has demonstrated excellent performance in a variety of environments, there is still room for improvement in extreme occlusion and highly reflective environments. Future research will include deploying YOLOv8-PP on a PV cleaning robot for field testing and incorporating additional types of sensors (e.g., infrared imaging) to further improve the model’s detection performance.