YOLOv8n-PP: a lightweight pose recognition algorithm for photovoltaic array cleaning robot

Luo, Jidong; Wang, Guoyi; Lei, Yanjiao; Wang, Dong; Zhang, Hongzhou

doi:10.1007/s11554-025-01713-y

YOLOv8n-PP: a lightweight pose recognition algorithm for photovoltaic array cleaning robot

Research
Open access
Published: 30 June 2025

Volume 22, article number 136, (2025)
Cite this article

You have full access to this open access article

Download PDF

Journal of Real-Time Image Processing Aims and scope Submit manuscript

YOLOv8n-PP: a lightweight pose recognition algorithm for photovoltaic array cleaning robot

Download PDF

Jidong Luo¹^nAff2,
Guoyi Wang¹^nAff2,
Yanjiao Lei¹^nAff2,
Dong Wang¹^nAff2 &
…
Hongzhou Zhang¹^nAff2

1083 Accesses
7 Altmetric
1 Mention
Explore all metrics

Abstract

Regular cleaning of the photovoltaic (PV) panel is crucial for maintaining optimal photovoltaic power generation efficiency. However, manual cleaning methods for PV panels are often inadequate and costly, underscoring the need for the introduction of PV cleaning robots. A significant challenge in the development of these robots is the recognition of PV panel poses. We propose the YOLOv8n-Photovoltaic-Pose (YOLOv8n-PP) method, a lightweight pose recognition algorithm, to address this issue. It is specifically designed for PV panel cleaning robots. Built upon the YOLOv8 framework, YOLOv8n-PP incorporates the Mobile-ViT visual module, which is both lightweight and mobile-friendly. This integration helps mitigate the effects of varying target poses from the robot’s mobile perspective. Additionally, we introduce the LMPDIoU boundary box regression loss to enhance the precision of PV panel recognition. Furthermore, we have developed a diverse and comprehensive dataset for PV panel poses to improve the model’s generalization capabilities. Our method shows improvements in both precision and recall, providing an effective solution for PV pose recognition.

YOLOv8-Based Photovoltaic Module Detection Using Aerial Imagery

Optimized YOLO based model for photovoltaic defect detection in electroluminescence images

Article Open access 26 September 2025

Defect Detection Algorithm for Monocrystalline Silicon Solar Cell Modules Based on Image Processing and Deep Learning

Article 16 October 2025

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As a rapidly growing form of renewable energy, photovoltaic (PV) energy has been widely adopted worldwide [6, 39]. In sun-rich regions, such as southern Xinjiang in China, the United Arab Emirates, and the deserts of Iran, the number of photovoltaic power plants continues to increase [3, 11, 35]. Due to geographical factors, these regions often suffer from severe dust pollution and low precipitation, resulting in the accumulation of sand and dust on the surfaces of the photovoltaic panels, thereby affecting their power generation efficiency. Unclean PV panels can lead to a decrease in the power generation efficiency of approximately 15% [7, 36, 41]. Therefore, cleaning PV panels is essential in the field of photovoltaic energy.

Traditional methods for cleaning photovoltaic panels include manual and vehicle-mounted techniques [2, 10]. However, due to the terrain and dust pollution in PV plant areas, both manual and vehicle-mounted cleaning methods often struggle to accurately locate the PV panels, resulting in wasted water resources and added burdens [37, 38]. Thus, there is a significant need for automatic and precise cleaning solutions, which are expected to enhance the adaptability and flexibility of cleaning robots while minimizing the need for manual intervention [5, 16, 31, 32].

Precise pose recognition is essential for the effective cleaning of photovoltaic (PV) panels. However, this task faces several challenges. The diverse installation angles and orientations of PV panels, interference from ambient light, the impact of dust and dirt on image clarity, and partial occlusion caused by other panels or environmental factors can significantly impact the accuracy and real-time performance of pose detection [15, 25, 48]. Moreover, the reflective properties of PV panel surfaces can lead to instability in the data collected by sensors, further complicating pose recognition. In real-world scenarios, PV panels are often partially occluded by other objects (e.g., panels and environmental factors like dust or debris). This makes detection and pose estimation more challenging. Additionally, the scale of the PV panels can vary significantly depending on their distance from the camera, which introduces further complexity in pose estimation. Recent studies have revealed that introducing visual deep learning-based object detection algorithms can enable robots to adaptively recognize the poses of the PV panels requiring cleaning [8, 40].

With the rise of deep learning, neural network-based methods [9, 19,20,21,22,23, 26, 33, 34] have gradually dominated the field of object detection. Recently object detection algorithms have shifted from two-stage methods (such as the R-CNN series [9, 18, 34]) to more efficient single-stage detectors (such as the YOLO and SSD series [26, 33]). These single-stage algorithms perform bounding box regression and object classification within a single network, significantly improving computational efficiency and enabling real-time detection. As a result, they have become vital choices for real-time visual applications. The YOLO series of algorithms [33] has gained wide-spread adoption in object detection due to its efficiency and accuracy. YOLO divides the entire image into a grid and predicts bounding boxes and class probabilities for each grid cell, enabling fast localization and recognition of objects. YOLO has undergone continuous updates and iterations. YOLOv8 [42], as a recent version in the series, retains the efficient design of its predecessors while introducing several optimizations in network architecture and algorithms, further enhancing detection accuracy and speed, especially in complex scenarios. YOLOv8 has been widely applied to various applications, including autonomous driving [1, 29], industrial inspection [14, 27, 44, 47], and more. YOLOv8 Nano (YOLOv8n) is a lightweight variant of YOLOv8, designed to offer a good balance between computational efficiency and detection accuracy, especially in resource-constrained environments.

Considering the advantages of YOLOv8n in terms of high speed and good performance, we adopt YOLOv8n as the primary method for PV cleaning pose recognition. However, we find that directly using YOLOv8n for pose recognition does not yield satisfactory results, as the model still has a long inference time and the detection accuracy is not high enough. Through preliminary analysis, we have found that the sub-optimal performance was due to the mismatch between the network structure, detection objective function, and the PV pose recognition task. The YOLOv8n backbone contains several convolutional layers, which require substantial computation time, leading to prolonged inference. As we have discussed previously, in real-world pose recognition scenarios, factors such as photovoltaic panel rotation and cleaning robot movement can result in scale variations and target rotations, which the traditional IoU-based detection objective function in YOLOv8n struggles to handle effectively, leading to inaccurate detection results.

To cope with the issue of long inference time, we introduce Mobile ViT, a lightweight Vision Transformer designed for mobile and embedded devices [30]. Mobile-ViT maintains the self-attention capability while reducing storage requirements and computation time. To overcome the problem of inaccurate detection results, we incorporate the MPDIoU loss, which not only focuses on the overlap but also considers the actual distance of the detection boxes [28]. This loss function can more effectively adapt to target scale changes and capture shape differences, addressing the shortages of traditional IoU-based methods, which are sensitive to target scale variations and unsuitable for rotated targets.

Combining YOLOv8n, Mobile-ViT, and MPDIoU loss, we propose a method called YOLOv8n-Photovoltaic-Pose (YOLOv8n-PP), which leverages the strengths of these components to achieve accurate and efficient pose recognition for photovoltaic cleaning applications. Additionally, recognizing the scarcity of datasets specifically for photovoltaic pose recognition, we have constructed a dataset named “P-Pose”. This dataset consists of PV pose images collected from the photovoltaic power plant of Jingke Technology in Alar City. We preprocess the collected images to align with the pose recognition task and make the dataset publicly available to facilitate future research. The dataset will be accessed at https://gitee.com/nanfnegzhiwoyi/p-pose.

We conduct extensive experiments to validate the effectiveness of our proposed method. We compare the performance of our algorithm with mainstream detection models, such as YOLOv5s, YOLOv7, YOLOv8s, and YOLOv8n. Additionally, we perform a series of ablation studies to verify the efficiency of incorporating Mobile-ViT and MPDIoU loss. Under a fair comparison, our YOLOv8-PP method achieves the best results across various evaluation metrics. Notably, the precision and recall of our approach are 3.45% and 5.78% higher, respectively, compared to the baseline YOLOv8n model.

The main contributions of this work can be summarized as follows:

We propose the photovoltaic pose recognition algorithm YOLOv8n-PP based on YOLOv8n, incorporating a lightweight and efficient Mobile-ViT module and the MPDIoU loss.
Our approach provides an effective solution for efficient photovoltaic panel cleaning, thereby improving power generation efficiency and addressing challenges in the development of the photovoltaic industry.
We have constructed the P-Pose PV pose dataset, which is optimized for the specific characteristics of photovoltaic panel pose recognition and offers strong technical support for future research.

The rest of the paper is organized as follows. Section 2 introduces related work and necessary fundamentals. Section 3 describes YOLOv8n-PP in detail, including the main network and the key modules. Section 4 makes a comprehensive investigation of the effectiveness of our method. We give a conclusion on this work in Sect. 5.

2 Related works

2.1 Pose recognition

In recent years, deep learning models have been applied to pose recognition. Xiang et al. [46] proposed using the pose convolutional neural network for the translation of target objects in human–robot collaborative delivery, calculating the translation matrix and rotation matrix of the target object. Gao et al. [8] proposed a 3D hand pose estimation-based dynamic gesture recognition method. Li et al. [24] proposed the AirPose multi-branch network for estimating aircraft attitude angles. Different from these works, we use the lightweight and high-precision YOLOv8-n object detection algorithm for mobile device PV pose recognition. To the best of our knowledge, we are the first to investigate the challenging problem of pose recognition algorithms for photovoltaic panel cleaning robots.

2.2 The YOLO series of models

The YOLO series of models represents a fast and efficient object detection algorithm capable of performing object localization and recognition in a single forward pass [33]. Compared to two-stage object detection methods [34], YOLO offers faster detection speeds and superior real-time performance. The YOLO series has undergone multiple iterations and optimizations [17]. The recent version, YOLOv8, is particularly advantageous due to its enhanced accuracy and efficiency, making it suitable for various applications that require rapid and reliable object detection. YOLOv8 employs a lightweight network structure by optimizing its backbone, neck, and head modules, significantly reducing the number of model parameters and computational costs while maintaining high detection accuracy. It introduces a Feature Pyramid Network (FPN) and a squeeze-and-excitation (SE) attention mechanism to enhance feature extraction capabilities. Additionally, YOLOv8 utilizes techniques such as depth-wise separable convolution to further compress the model, making it well suited for deployment on resource-constrained devices, including smartphones and embedded systems. YOLOv8 Nano (YOLOv8n) represents the most lightweight variant among the YOLOv8 family, prioritizing efficiency and real-time processing on low-power hardware while offering competitive detection performance.

To further boost the lightweight design, Xie et al. [47] introduced a lightweight multi-scale mixed convolution in the backbone network to effectively fuse features extracted at different scales. Furthermore, Wang et al. [44] replaced the backbone with GhostHGNetv2. Ma et al. [27] employed a lightweight detection head called AsDDet to further minimize parameters. Du et al. [4] combined YOLO with attention mechanics and deploys the model to the mobile vehicle. Our method employs YOLOv8n as the main network and improves its detection head and the supervised regression loss, achieving better detection accuracy and faster inference speed.

2.3 The mobile-ViT network

The Mobile-ViT network [30] is a lightweight visual transformer model designed specifically for mobile devices [12, 45]. Compared to convolutional neural networks, Mobile-ViT employs an attention-based visual transformation layer, which can effectively capture the global information and semantic features of the image. The Mobile-ViT network architecture consists of three parts: (1) convolutional layers for extracting low-level visual features; (2) visual transformation layers using the self-attention mechanism to capture the global context information of the image; (3) feed-forward networks to further extract and integrate the features. This hybrid architecture effectively combines the advantages of convolutional networks and vision transformers, maintaining good feature extraction capabilities while significantly reducing the computational complexity and parameter size of the model. Compared to standard Transformer models, Mobile-ViT introduces depth-wise separable convolution to dramatically reduce the model’s computational cost. It also employs a lightweight attention mechanism to further improve the inference speed, making it very suitable for deployment on mobile devices.

2.4 The MPDIoU loss

Minimum Point Distance IoU (MPDIoU) loss [28] is a loss function designed for bounding box regression, specifically addressing the issues of sample imbalance in object detection models. Traditional bounding box regression loss functions, such as L1 loss and IoU loss, exhibit limitations in scenarios with imbalanced samples [51]. L1 loss fails to effectively distinguish between bounding boxes of varying sizes, resulting in poor detection performance for small objects. Conversely, IoU loss emphasizes the degree of overlap between bounding boxes, often neglecting the precision of object localization [49, 50]. In contrast, MPDIoU introduces the concept of minimum point distance, which measures the distance between two bounding boxes beyond merely their overlap. By minimizing this distance, MPDIoU accounts for both the degree of overlap and the accuracy of localization, thereby enhancing the model’s detection performance for objects of different sizes.

3 YOLOv8n-PP

In this section, we will introduce our method YOLOv8n-PP in detail. We give an overview of the YOLOv8n-PP model in Fig. 1. As illustrated, YOLOv8n-PP is built on the YOLOv8n framework. To achieve faster inference speed, YOLOv8n-PP uses the lightweight Mobile-ViT module to replace the backbone in YOLOv8n. Furthermore, to address the issue of imbalanced sample distribution across different classes in real-world pose recognition, we adopt the MPDIoU bounding box regression loss function to solve the problem of insufficient model generalization performance.

3.1 The main network

YOLOv8n-PP is based on YOLOv8n, comprising a backbone network, a feature fusion layer (Neck), and a fusion head, as shown in Fig. 1. The backbone module is responsible for generating discriminative visual representations from the input. In YOLOv8n, the backbone employs deep convolutional neural networks and C2f blocks (CBS and C2f in Fig. 1, same as the following), but we replace the backbone module with the Mobile-ViT module, including Mobile-ViT and MobileNetV2 (MV2) blocks. Additionally, we maintain the spatial pyramid pooling-fast block (SPPF) in the backbone [13]. The neck module plays a crucial role in aggregating and fusing the multi-scale features produced by the backbone, including feature concatenating (Concat), upsampling (Upsample), and several convolution operations. The head module is designed for final predictions. It takes the feature maps from the neck and generates detecting heads for different scales of targets. The detecting heads also contain bounding box loss (Bbox loss) and classification loss (Cls loss). We employ MPDIoU loss as the bounding box loss to fit the PV pose recognition task.

3.2 Mobile-ViT

The photovoltaic panel cleaning robot needs to recognize diverse photovoltaic poses under mobile viewpoints and restricted inference time, so we use the relatively lightweight Mobile-ViT model to replace the backbone part of YOLOv8n, as shown in Fig. 2a. At the same time, the Mobile-ViT model has a unique image segmentation mechanism, which can effectively handle the difficulty in model training caused by the changing viewpoints. The network structure of Mobile-ViT is shown in Fig. 2b, which includes several modules using transformers as convolutions and a feature fusion module. It utilizes Self-Attention to learn the correlation between different positions in the image, thus comprehensively understanding the inherent structure of the image features.

3.3 MPDIoU

The core design of MPDIoU is the definition of two metrics: the Center Distance Difference Metric (MD) and the Projection Difference Metric (PD), respectively. The MD is calculated by normalizing the Euclidean distance between the centers of the two target frames to the range of [0, 1]. The PD is calculated by normalizing the sum of the Euclidean distances between the top-left and bottom-right corners of the two target frames to the range of [0, 1]. The maximum pooling operation is applied to the MD and PD to obtain the final overlap metric MPD. The MPDIoU is then calculated as a weighted combination of the traditional IoU and the MPD

$$\begin{aligned} \begin{aligned}&d_1^2=(x_1^B-x_1^A) ^2+(y_1^B-y_1^A) ^2\\&d_2^2=(x_2^B-x_2^A) ^2+(y_2^B-y_2^A) ^2\\&MPDIoU = \frac{A\cap B}{A \cup B}-\frac{d_1^2}{w^2+h^2}-\frac{d_2^2}{w^2+h^2}, \end{aligned} \end{aligned}$$

(1)

where A and B are two arbitrary convex shapes, and w and h represent the width and height of the input image, respectively. $(x_1^A, y_1^A)$ and $(x_2^A, y_2^A)$ represent the coordinates of the top-left and bottom-right points of A, while $(x_1^B, y_1^B)$ and $(x_2^B, y_2^B)$ represent the coordinates of the top-left and bottom-right points of B. $d_1$ and $d_2$ represent the squared Euclidean distances between the top-left corners and the bottom-right corners of A and B, respectively. An illustration of the MPDIoU is shown in Fig. 3.

For the convenience of optimization, the MPDIoU-based loss is defined as

$$\begin{aligned} LMPDIoU = 1 - MPDIoU. \end{aligned}$$

(2)

The MPDIoU-based bounding box regression loss simplifies the similarity comparison between two target frames, making it applicable to both overlapping and non-overlapping bounding box regression, and improving the convergence speed and regression accuracy. This method comprehensively considers the center distance difference and the width–height difference between the photovoltaic pose target frames, providing a more comprehensive and accurate photovoltaic pose object detection evaluation metric.

In this study, we adopt YOLOv8n as the baseline model and improve it by proposing YOLOv8-PP. YOLOv8-PP is optimized on the basis of YOLOv8n by adopting Mobile-ViT as the backbone network and introducing the MPDIoU loss function to improve the performance and robustness of the model. Through testing on the extended dataset, we find that YOLOv8-PP exhibits higher mAP and accuracy compared to YOLOv8n in various environments (as obscured in Fig. 4).

In addition, the expansion of the dataset allows YOLOv8-PP to better adapt to new environments. We demonstrate the superiority of YOLOv8-PP in a sandy and dusty environment (see Fig. 5); in this scenario, YOLOv8 misses some PV panels, while YOLOv8-PP successfully detects all PV panels, which demonstrates its good generalization ability in different environments. To further improve the robustness of the model in variable environments, we added data containing different seasons (e.g., snow and dust) and lighting conditions (e.g., strong reflections) to the training set and retrained the model. After retraining, the performance of YOLOv8-PP in these added environments is significantly improved, with the mAP increasing from 0.884 to 0.911. The above results show that data extension plays a key role in improving the model performance.

4 Experiments

In this section, we provide a detailed explanation of the datasets and experimental settings. First, we outline the process of constructing the P-Pose dataset, describe the experimental environment, and present the relevant metrics used in our investigation. Subsequently, we present the key results that demonstrate the effectiveness of our method and offer further analysis.

4.1 P-Pose dataset construction and preprocessing

The P-Pose dataset contains photovoltaic pose images collected from the PV power plant of Jingke Technology in Alar City using manual methods.

During the collection process, the collectors considered various factors, such as high and low poses, backlighting and counter-lighting, left and right sides, etc. A total of 116 images were collected. To increase the diversity and applicability of the dataset, these images were preprocessed. First, the images were losslessly compressed and normalized to a format of 950$\times$428 pixels with three RGB channels. Then, image data augmentation techniques were used to expand the dataset, including adding noise, blurring, rotation, horizontal flipping, and brightness adjustment (as shown in Fig. 6), ultimately expanding to 2076 images. To establish the training and validation sets, we randomly split all samples in an 8:2 ratio, resulting in 1661 training samples and 415 validation samples.

Table 1 Configure the parameters of the experimental environment

Full size table

4.2 Experiment setup

The experimental platform is shown in Table 1. We used a 64-bit Windows 10 operating system for training and validation. The hardware system configuration includes an Intel Xeon(R) Silver 4210R CPU @ 2.40GHz, an NVIDIA GeForce RTX 3060Ti GPU, and 64GB of RAM. Python 3.8 is used as the programming language, with CUDA 11.8 to accelerate training, and the PyTorch 2.0.0 deep learning framework for network training.

The model training is set to 300 epochs, with a batch size of 16, and an initial learning rate of 0.01. The cosine annealing learning rate adjustment algorithm is adopted, as defined by the following formula:

$$\begin{aligned} l_r = l_{min} + \frac{1}{2}(l_{max} - l_{min}) (1 + cos(\frac{T_{cur}}{T_{max}}\pi ) ), \end{aligned}$$

(3)

where $l_r$ is the current learning rate, $l_{min}$ is the minimum learning rate,$l_{max}$ is the maximum learning rate, $T_{cur}$ is the current training iteration (epoch), $T_{max}$ is the total number of training iterations, and $\pi$ is the circular constant.

4.3 Evaluation metrics

To verify the effectiveness and detection performance of the model, the following metrics are selected: precision & recall, F1-score, mean average precision (mAP), Giga floating point operations (GFLOPs), and frames per second (FPS).

Precision is the ratio of correctly predicted positive samples among all predicted positive samples, while recall is the proportion of correctly predicted positive samples among all actual positive samples. The definitions of precision and recall are as follows:

$$\begin{aligned} Precision=\frac{TP}{TP+FP},\quad Recall=\frac{TP}{TP+FN}, \end{aligned}$$

(4)

where TP denotes the true-positive samples, FP denotes the false-positive samples, and FN denotes the false-negative samples.

F1-score is the harmonic mean of precision and recall, so the F1 curve is often used to compare the performance of different models. The F1-score ranges from 0 to 1 and the higher score means better performance. F1-score is defined as

$$\begin{aligned} F1 = \frac{1}{Precision}+\frac{1}{Recall}. \end{aligned}$$

(5)

mAP evaluates the model’s performance in object detection tasks by plotting the precision–recall curve and calculating the area under the curve. The"m"represents the mean, and the number after the"@"symbol represents the threshold for determining the positive and negative samples based on the IoU. It is formulated as

$$\begin{aligned} mAP = \frac{1}{N}\sum _{i=1}^{N}\int _{0}^{1}P_i(R_i) \textrm{d}(R_i), \end{aligned}$$

(6)

where $P_i$ and $R_i$ denote the precision and recall of the $i_{th}$ threshold. Following Li et al. [18], we denote AP as mAP(@0.5:0.95) and $\hbox {AP}_{50}$ as mAP(@0.5).

GFLOPs measure the complexity of the model, with higher values indicating more computational resources required for inference, whereas lower values suggest a lower computational cost.

FPS represents the number of frames processed per second, which is affected by the algorithm and the hardware configuration of the experimental device (Tables 2, 3).

Table 2 Results of YOLOv8n-PP and other algorithms. The metrics is defined in 4.3

Full size table

Table 3 Comparison of model performance metrics

Full size table

4.4 Comparison results

We compared five mainstream object detection models: YOLOv5s [33], YOLOv7 [43], YOLOv8s, YOLOv8n, and YOLOv8n-PP, using seven evaluation metrics: Precision, Recall, $\hbox {AP}_{50}$, F1, GFLOPs, AP, and FPS. With the dataset, model hyper-parameters, and training parameters kept consistent, the YOLOv8n-PP model showed the best performance in all the metrics. Compared to the runner-up model, YOLOv8n, YOLOv8n-PP had a slight improvement in the GFLOPs and Time indicators. Additionally, it was observed that the YOLOv5s algorithm performed the worst in most performance evaluation indicators, which may be due to its use of a relatively shallow feature extraction network.

To further validate the recognition performance of different algorithms on photovoltaic poses, we visualized the detection results using the optimally trained checkpoint of each model. The results, shown in Fig. 7, include a photovoltaic image with overlapping backgrounds and two different poses for comparison. The findings indicate that the recognition accuracy of the YOLOv8n algorithm exceeded that of the first three algorithms. However, when processing poses of photovoltaic panels in overlapping backgrounds, YOLOv8n may detect severely occluded targets, resulting in inaccurate detection boxes or even identifying the overlapping rear panels. This phenomenon could interfere with the smart cleaning equipment, hindering its ability to perform cleaning operations effectively. The YOLOv8n-PP model enhances accuracy by incorporating the Mobile-ViT self-attention mechanism and replacing the bounding box regression loss function. Its unique self-attention mechanism, combined with photovoltaic panel pose detection based on the minimum distance of detected targets, allows for accurate identification of photovoltaic panels requiring cleaning in overlapping backgrounds. This capability facilitates precise cleaning instructions, ensuring the effective execution of cleaning operations.

In testing different environmental conditions (e.g., sunny, occluded, snowy, strong reflections, sandy, and dusty), YOLOv8-PP demonstrates higher robustness. In the snowy day scenario, YOLOv8 only detected four PV panels with low confidence; in contrast, YOLOv8-PP was able to accurately detect five PV panels with generally high confidence (see Fig. 8). Similarly, under strong reflective conditions, YOLOv8 has missed detecting PV panels in the reflective area, while YOLOv8-PP can still stably identify PV panel targets in the area, demonstrating good light adaptability (see Fig. 9). These comparison results further validate the recognition ability and robustness advantage of YOLOv8-PP in complex environments.

Table 4 Results of ablation experiments. The same conventions are used as in Table 4

Full size table

4.5 Ablation study

To verify the effectiveness of the algorithm improvement modules, we established multiple variant models based on the original YOLOv8n model for ablation experiments, including YOLOv8n-M, YOLOv8n-L, and YOLOv8n-PP. YOLOv8n-M integrates the self-attention mechanism of the Mobile-ViT model, while YOLOv8n-L replaces the original detection loss of YOL-Ov8n with the MPDIoU loss. Finally, YOLOv8n-PP is the complete model that incorporates both the Mobile-ViT module and the MPDIoU loss proposed in this work.

According to the ablation experiment results in Table 4, the original YOLOv8n algorithm exhibits relatively low F1-score and average precision (AP), along with a higher computational cost (GFLOPs). Therefore, we introduce several enhancements to the original YOLOv8n algorithm to improve its performance. After incorporating Mobile-ViT and replacing the original bounding box regression loss function with MPDIoU, most metrics demonstrate improved performance. The results for YO-LOv8n-M and YOLOv8n-L indicate that each module provides individual benefits, and their combination yields an additional performance enhancement, effectively demonstrating the validity of our approach.

The model performances are shown in Table 3. In the dataset in five typical environments, YOLOv8 precision is 0.888, recall is 0.884, and mAP is 0.884. YOLO-PP precision is 0.892, recall is 0.857, and mAP is 0.91.

4.6 Further analysis

To more intuitively evaluate the performance of the improved algorithm, the Precision, Recall, and $\hbox {AP}_{50}$ metrics of the YOLOv8n and YOLOv8n-PP models during the training process are visualized and analyzed. Figure 10 (left) shows the visual comparison of the Recall values during the training process of the YOLOv8n and YOLOv8n-PP models. According to the experimental data analysis, the two curves reached a relatively stable state almost simultaneously. Still, the recall rate of YOLOv8n-PP was always higher than YOLOv8n, and the recall rate curve was more stable. Figure 10 (middle) shows the visual comparison of the Precision values during the training process of the YOLOv8n and YOLOv8n-PP models. The YOLOv8n algorithm had a relatively low initial Precision value in the first 100 epochs, and the Precision value fluctuated significantly as the number of training rounds increased. After the 100th epoch, the Precision value stabilized at above 90%, but the curve still showed obvious fluctuations until the end of training, indicating that the Precision value was not stable enough. In contrast, the YOLOv8n-PP algorithm had an initial Precision value about 10% higher than YOLOv8n in the first 100 epochs, and the Precision value growth was more stable as the training rounds increased. Before the 100th round, the Precision value had already entered a stable state and remained stable until the end of training, with the Precision value curve being consistently higher than YOLOv8n. Figure 10 (right) shows the visual comparison of the $\hbox {AP}_{50}$ values. The $\hbox {AP}_{50}$ value of YOLOv8n fluctuated greatly and had not reached a stable state by the end of training. However, the $\hbox {AP}_{50}$ value of YOLOv8n-PP was higher than YOLOv8n and showed a more stable curve, which leveled off in the later stage of training. We conducted a detailed analysis of YOLOv8 and YOLOv8-PP in terms of false detection and leakage detection.YOLOv8 has a high leakage detection rate in sand, dust, and snow environments, especially under large changes in light, while YOLOv8-PP optimizes the stability of the target detection and significantly reduces the false detection rate by introducing Mobile-ViT and MPDIoU.

5 Conclusions

In this work, we conduct an in-depth analysis and optimization of the photovoltaic panel pose detection model and construct a diverse and comprehensive dataset for photovoltaic panel poses to ensure that our method demonstrates strong generalization performance across various environments. We propose a photovoltaic panel pose recognition algorithm, YOLOv8n-Photovoltaic Pose (YOLOv8n-PP), based on an improved YOLOv8n architecture, which incorporates a more lightweight Mobile-ViT network design and the MPDIoU-optimized bounding box regression loss function. The results indicate that YOLOv8n-PP not only improves detection accuracy but also enhances stability in challenging scenarios, thereby contributing to more effective and reliable photovoltaic panel management. Although YOLOv8-PP has demonstrated excellent performance in a variety of environments, there is still room for improvement in extreme occlusion and highly reflective environments. Future research will include deploying YOLOv8-PP on a PV cleaning robot for field testing and incorporating additional types of sensors (e.g., infrared imaging) to further improve the model’s detection performance.

Data availability

The data presented in this study are available in the article.

References

Artamonov, N., Yakimov, P.: Towards real-time traffic sign recognition via yolo on a mobile gpu. In: Journal of Physics: Conference Series, vol. 1096, p. 012086. IOP Publishing (2018)
Chowdhury, M.S., Rahman, K.S., Chowdhury, T., Nuthammachot, N., Techato, K., Akhtaruzzaman, M., Tiong, S.K., Sopian, K., Amin, N.: An overview of solar photovoltaic panels’ end-of-life material recycling. Energ. Strat. Rev. 27, 100431 (2020)
Article Google Scholar
Dehshiri, S., Firoozabadi, B.: Dust cycle, soiling effect and optimum cleaning schedule for pv modules in iran: A long-term multi-criteria analysis. Energy Convers. Manage. 286, 117084 (2023)
Article Google Scholar
Du, X., Meng, Z., Ma, Z., Lu, W., Cheng, H.: Tomato 3d pose detection algorithm based on keypoint detection and point cloud processing. Comput. Electron. Agric. 212, 108056 (2023)
Article Google Scholar
Ekinci, F., Yavuzdeğer, A., Nazlıgül, H., Esenboğa, B., Mert, B.D., Demirdelen, T.: Experimental investigation on solar pv panel dust cleaning with solution method. Sol. Energy 237, 1–10 (2022)
Article Google Scholar
El Chaar, L., El Zein, N., et al.: Review of photovoltaic technologies. Renew. Sustain. Energy Rev. 15(5), 2165–2175 (2011)
Article Google Scholar
Fouad, M., Shihata, L.A., Morgan, E.I.: An integrated review of factors influencing the performance of photovoltaic panels. Renew. Sustain. Energy Rev. 80, 1499–1511 (2017)
Article Google Scholar
Gao, Q., Chen, Y., Ju, Z., Liang, Y.: Dynamic hand gesture recognition based on 3d hand pose estimation for human-robot interaction. IEEE Sens. J. 22(18), 17421–17430 (2021)
Article Google Scholar
Girshick, R.: Fast r-cnn. arXiv preprint arXiv:1504.08083 (2015)
Gonzalo, A.P., Marugán, A.P., Márquez, F.: Survey of maintenance management for photovoltaic power systems. Renew. Sustain. Energy Rev. 134, 110347 (2020)
Article Google Scholar
Hachicha, A.A., Al-Sawafta, I., Said, Z.: Impact of dust on the performance of solar photovoltaic (pv) systems under united arab emirates weather conditions. Renewable Energy 141, 287–297 (2019)
Article Google Scholar
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al.: A survey on visual transformer. arXiv preprint arXiv:2012.12556
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)
Article Google Scholar
Huang, M., Mi, W., Wang, Y.: Edgs-yolov8: An improved yolov8 lightweight uav detection model. Drones 8(7), 337 (2024)
Article Google Scholar
Huang, Z.J., He, X.X., Wang, F.J., Shen, Q.: A real-time multi-stage architecture for pose estimation of zebrafish head with convolutional neural networks. J. Comput. Sci. Technol. 36, 434–444 (2021)
Article Google Scholar
Hudedmani, M.G., Joshi, G., Umayal, R., Revankar, A.: A comparative study of dust cleaning methods for the solar pv panels. Advanced Journal of Graduate Research 1(1), 24–29 (2017)
Article Google Scholar
Hussain, M.: Yolo-v1 to yolo-v8, the rise of yolo and its complementary nature toward digital manufacturing and industrial defect detection. Machines 11(7), 677 (2023)
Article Google Scholar
Li, X., Chen, H., Hu, X.: On the importance of backbone to the adversarial robustness of object detectors. IEEE Trans. Inf. Forensics Secur. 20, 2387–2398 (2025)
Article Google Scholar
Li, X., Li, Z., Li, Q., Lee, B., Cui, J., Hu, X.: Faster-gcg: Efficient discrete optimization jailbreak attacks against aligned large language models. arXiv preprint arXiv:2410.15362 (2024)
Li, X., Liu, Y., Dong, N., Qin, S., Hu, X.: Partimagenet++ dataset: Scaling up part-based models for robust recognition. In: European Conference on Computer Vision, vol. 15129, pp. 396–414. Springer (2024)
Li, X., Sun, W., Chen, H., Li, Q., Liu, Y., He, Y., Shi, J., Hu, X.: Adbm: Adversarial diffusion bridge model for reliable adversarial purification. ICLR (2025)
Li, X., Wang, Z., Zhang, B., Sun, F., Hu, X.: Recognizing object by components with human prior knowledge enhances adversarial robustness of deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 45(7), 8861–8873 (2023)
Article Google Scholar
Li, X., Zhang, W., Liu, Y., Hu, Z., Zhang, B., Hu, X.: Language-driven anchors for zero-shot adversarial robustness. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24686–24695 (2024)
Li, Y., Yu, R., Zhu, B.: 2d-key-points-localization-driven 3d aircraft pose estimation. IEEE Access 8, 181293–181301 (2020)
Article Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125 (2017)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: ECCV, pp. 21–37. Springer (2016)
Ma, N., Su, Y., Yang, L., Li, Z., Yan, H.: Wheat seed detection and counting method based on improved yolov8 model. Sensors 24(5), 1654 (2024)
Article Google Scholar
Ma, S., Xu, Y.: Mpdiou: a loss for efficient and accurate bounding box regression. arXiv preprint arXiv:2307.07662 (2023)
Mahaur, B., Mishra, K.: Small-object detection based on yolov5 in autonomous driving systems. Pattern Recogn. Lett. 168, 115–122 (2023)
Article Google Scholar
Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In: International Conference on Learning Representations (2022)
Mondal, S., Mondal, A.K., Sharma, A., Devalla, V., Rana, S., Kumar, S., Pandey, J.K.: An overview of cleaning and prevention processes for enhancing efficiency of solar photovoltaic panels. Curr. Sci. 115(6), 1065–1077 (2018)
Article Google Scholar
Patil, P., Bagi, J., Wagh, M.: A review on cleaning mechanism of solar photovoltaic panel. In: 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), pp. 250–256. IEEE (2017)
Redmon, J.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)
Article Google Scholar
Sadat, S.A., Faraji, J., Nazififard, M., Ketabi, A.: The experimental analysis of dust deposition effect on solar photovoltaic panels in iran’s desert environment. Sustainable Energy Technol. Assess. 47, 101542 (2021)
Article Google Scholar
Saidan, M., Albaali, A.G., Alasis, E., Kaldellis, J.K.: Experimental study on the effect of dust deposition on solar photovoltaic panels in desert environment. Renewable Energy 92, 499–505 (2016)
Article Google Scholar
Salamah, T., Ramahi, A., Alamara, K., Juaidi, A., Abdallah, R., Abdelkareem, M.A., Amer, E.C., Olabi, A.G.: Effect of dust and methods of cleaning on the performance of solar pv module for different climate regions: Comprehensive review. Sci. Total Environ. 827, 154050 (2022)
Article Google Scholar
Santhakumari, M., Sagar, N.: A review of the environmental factors degrading the performance of silicon wafer-based photovoltaic modules: Failure detection methods and essential mitigation techniques. Renew. Sustain. Energy Rev. 110, 83–100 (2019)
Article Google Scholar
Sayyah, A., Horenstein, M.N., Mazumder, M.K.: Energy yield loss caused by dust deposition on photovoltaic panels. Sol. Energy 107, 576–604 (2014)
Article Google Scholar
Syafiq, A., Pandey, A., Adzman, N., Abd Rahim, N.: Advances in approaches and methods for self-cleaning of solar photovoltaic panels. Sol. Energy 162, 597–619 (2018)
Article Google Scholar
Tian, H., Mancilla-David, F., Ellis, K., Muljadi, E., Jenkins, P.: A cell-to-module-to-array detailed model for photovoltaic panels. Sol. Energy 86(9), 2695–2706 (2012)
Article Google Scholar
Varghese, R., Sambath, M.: Yolov8: A novel object detection algorithm with enhanced performance and robustness. In: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pp. 1–6. IEEE (2024)
Wang, C., Bochkovskiy, A., Liao, H.M.: Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: CVPR, pp. 7464–7475. IEEE (2023)
Wang, J., Wang, J.: A lightweight yolov8 based on attention mechanism for mango pest and disease detection. J. Real-Time Image Proc. 21(4), 136 (2024)
Article Google Scholar
Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Yan, Z., Tomizuka, M., Gonzalez, J., Keutzer, K., Vajda, P.: Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 (2020)
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)
Xie, W., Sun, X., Ma, W.: A light weight multi-scale feature fusion steel surface defect detection model based on yolov8. Meas. Sci. Technol. 35(5), 055017 (2024)
Article Google Scholar
Zhang, T., Zheng, J., Zou, Y.: Fusing few-shot learning and lightweight deep network method for detecting workpiece pose based on monocular vision systems. Measurement 218, 113118 (2023)
Article Google Scholar
Zhang, Y.F., Ren, W., Zhang, Z., Jia, Z., Wang, L., Tan, T.: Focal and efficient iou loss for accurate bounding box regression. Neurocomputing 506, 146–157 (2022)
Article Google Scholar
Zheng, Z., Wang, P., Ren, D., Liu, W., Ye, R., Hu, Q., Zuo, W.: Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE transactions on cybernetics 52(8), 8574–8586 (2021)
Article Google Scholar
Zhou, D., Fang, J., Song, X., Guan, C., Yin, J., Dai, Y., Yang, R.: Iou loss for 2d/3d object detection. In: 2019 international conference on 3D vision (3DV), pp. 85–94. IEEE (2019)

Download references

Acknowledgements

The authors are grateful for the funding of the Science and Technology Program of XPCC (2023AB031; 364 2023AA007), Research on Intelligent Cleaning Device for Photovoltaic Panels Based on Iterative 365 Neural Network (202 1ZB02), and Construction Project of the New Energy Industry Innovation 366 Research Institute of the Xinjiang Production and Construction Corps (2023-02-20240106). The 367 experiments were funded by the Modern Agricultural Engineering Key Laboratory in Colleges 368 and Universities of Education Department of Xinjiang Uygur Autonomous Region and the Key 369 Laboratory of Special Products Utilization and Xinjiang Production and Construction Corps (XPCC) 370 Key Laboratory of Utilization and Equipment of Special Agricultural and Forestry Products in 371 Southern Xinjiang.

Author information

Jidong Luo, Guoyi Wang, Yanjiao Lei, Dong Wang & Hongzhou Zhang
Present address: Modern Agricultural Engineering Key Laboratory at Universities of Education Department of Xinjiang Uygur Autonomous Region, Tarim University, Alar, China

Authors and Affiliations

College of Mechanical and Electrical Engineering, Tarim University, Alar, 843300, China
Jidong Luo, Guoyi Wang, Yanjiao Lei, Dong Wang & Hongzhou Zhang

Authors

Jidong Luo
View author publications
Search author on:PubMed Google Scholar
Guoyi Wang
View author publications
Search author on:PubMed Google Scholar
Yanjiao Lei
View author publications
Search author on:PubMed Google Scholar
Dong Wang
View author publications
Search author on:PubMed Google Scholar
Hongzhou Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

J.L.: Conceptualization, Funding acquisition, Resources, Survey, Writing—Original Draft. G. W.: Methodology, Visualization, Validation, Survey. Y. L. Software, Survey, Writing – Review. D. W.: Formal analysis, validation, survey, writing – editing. H.Z.: Project Management, Data Management, Supervision, Surveys, Writing-Review and Editing. All authors have read and agree to the published version of the manuscript.

Corresponding author

Correspondence to Hongzhou Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Luo, J., Wang, G., Lei, Y. et al. YOLOv8n-PP: a lightweight pose recognition algorithm for photovoltaic array cleaning robot. J Real-Time Image Proc 22, 136 (2025). https://doi.org/10.1007/s11554-025-01713-y

Download citation

Received: 22 January 2025
Accepted: 06 June 2025
Published: 30 June 2025
Version of record: 30 June 2025
DOI: https://doi.org/10.1007/s11554-025-01713-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

YOLOv8n-PP: a lightweight pose recognition algorithm for photovoltaic array cleaning robot

Abstract

Similar content being viewed by others

YOLOv8-Based Photovoltaic Module Detection Using Aerial Imagery

Optimized YOLO based model for photovoltaic defect detection in electroluminescence images

Defect Detection Algorithm for Monocrystalline Silicon Solar Cell Modules Based on Image Processing and Deep Learning

Explore related subjects

1 Introduction

2 Related works

2.1 Pose recognition

2.2 The YOLO series of models

2.3 The mobile-ViT network

2.4 The MPDIoU loss

3 YOLOv8n-PP

3.1 The main network

3.2 Mobile-ViT

3.3 MPDIoU

4 Experiments

4.1 P-Pose dataset construction and preprocessing

4.2 Experiment setup

4.3 Evaluation metrics

4.4 Comparison results

4.5 Ablation study

4.6 Further analysis

5 Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords