Introduction

Object detection serves as a highly effective deep learning method, demonstrating exceptional success in surface defect detection [1,2,3], primarily attributed to its sophisticated classification and localization abilities. This method addresses several critical limitations of traditional anomaly detection approaches, such as subpar performance in handling multi-class and multi-scale object tasks, difficulty in processing large-size images, poor real-time capabilities, and overall inefficiency. By overcoming these pain points, object detection provides a more robust and efficient solution for identifying and localizing defects. The following part highlights the specific challenges and difficulties in defect detection, illustrating the areas where object detection models offer significant improvements.

Initially, computer vision-based classification method is employed for defect detection, primarily tasked with identifying images that contain defects [6, 7]. In cases where the image contains only a single type of defect, as illustrated in Fig. 1(a), the classification method is capable of effectively classifying the defects. Although the approach is able to recognize the type of defect present in the image, it is unable to accurately localize the defect within the picture. Whereas, when it contains multi-class and multi-scale defects as in Fig. 1(b), the classification method is just able to detect the presence of defects in the image but lacks the ability to classify or localize them accurately. In order to solve these problems, the object detection approach is developed, which is capable of realizing accurate classification and localization of multiple types of defects, as shown in Fig. 1(b). The application of object detection technology to defect detection greatly enhances both detection efficiency and accuracy, leading to remarkable advancements in the field of defect identification and detection. You only look once (YOLO) [8,9,10,11], Faster R-CNN [12], and SSD [13], as the most representative object detection algorithms, have achieved excellent performance in various fields such as industrial manufacturing, material inspection, weld inspection, and textile inspection. These achievements are primarily due to the incorporation of advanced technologies, including deep learning and computer vision, which enhance the capabilities of defect detection.

Fig. 1
figure 1

Complicated defects. (a) shows the surface defects with a single category in a large-size image on an open PCB dataset [4]. (b) presents the surface defects with multiple categories in a large-size image on an open GC10-DET dataset [5]

Furthermore, in modern industrial applications, high-resolution, large-size images are essential for detecting subtle surface defects and ensuring product quality. These images reveal fine imperfections more clearly, such as scratches and dents. However, their use also brings challenges, including increased data volume, higher storage and computational demands, and longer training times [14]. Traditional object detection models may find it difficult to handle such high-resolution images effectively, leading to a decrease in detection efficiency and an increase in false alarms.

In response to these challenges, enhancing detection accuracy without substantially increasing model complexity has become a key objective. A common approach to improve detection accuracy is to increase the model depth by stacking additional convolutional blocks. However, this approach also leads to a more complex and computationally intensive model [15]. To alleviate this, multi-scale feature fusion methods have been proposed [16], enabling the model to capture both fine and coarse features while reducing complexity. However, this method has some limitations in improving detection accuracy and cannot realize significant improvement. Therefore, it is crucial to trade off detection accuracy and model complexity in large-size image detection.

To address the issues of bloated structures and poor performance on multi-class, multi-scale surface defect detection in large-size images, this study proposes YOLO-MSD, a lightweight and effective model designed for industrial applications. The model features a four-scale backbone built with a novel Multi-Scale Convolution (MSC) module, which enhances feature extraction and fusion across different resolutions. In addition, we design a streamlined feature pyramid network (SFPN) to reduce the neck’s complexity while improving fusion efficiency. The anchor-free YOLO head is also simplified to lower computational cost and boost inference speed. Extensive evaluations conducted on five public datasets demonstrate that YOLO-MSD outperforms most advanced models. The model also runs on Jetson Xavier NX to confirm its suitability for edge deployment.

The main contributions are shown below.

  • This study proposes YOLO-MSD, a robust and lightweight object detection model tailored for industrial surface defect detection in large-size images, achieving higher detection accuracy and robustness against multi-scale, multi-class objects under complex backgrounds.

  • This work presents a novel MSC block and an SFPN to enhance feature extraction and fusion efficiency while reducing model complexity, enabling better performance with lower computational costs.

  • Extensive experiments conduct on five public datasets demonstrate that YOLO-MSD outperforms most existing state-of-the-art (SOTA) models in both detection accuracy and efficiency, validating its effectiveness and adaptability for industrial applications.

The remaining sections are organized as follows. Section 2 introduces the related work of multi-scale feature extraction structures, defect detection models, and their applications. Section 3 describes the MSD blocks and the MSD-based defect detection model YOLO-MSD. Section 4 performs extensive ablation and comparison experiments. Section 5 summarizes the article.

1 Related work

This section provides an overview of defect detection models and a brief exploration into the development of deep learning models for surface defect detection. Additionally, multi-scale feature extraction architectures are discussed. These are essential for capturing defects of varying size and complexity.

1.1 Defect detection models

Early defect detection relied on rule-based image processing techniques such as edge detection, threshold segmentation, and morphological operations [17]. These methods perform effectively with simple and predictable defects, but their accuracy significantly diminishes when faced with complex and irregular defects. With the improvement of computational power and the accumulation of large amounts of labelled data, convolutional neural network (CNN)-based deep learning methods are gradually replacing traditional methods in defect detection [18, 19]. Deep learning models are capable of automatically learning features in images and have the ability to handle defects in complex environments.

Classical CNN architectures such as VGG [20], ResNet [21] and DenseNet [22], which were used in the early days, have achieved some success in defect detection, but these models usually require a large amount of computational resources. In order to improve the detection efficiency and ensure real-time performance in industrial inspection. Lightweight networks such as MobileNet [23], EfficientNet [24] and Ghost [25] have been proposed and widely used to reduce computational overhead while maintaining detection accuracy. These methods mainly target the classification of defects.

When detecting multiple defects in complex environments, it is not only necessary to identify the class of defects but also to localize the defects accurately. Traditional CNN networks often have difficulty in such scenarios, which has driven the adoption of deep learning-based object detection models in industrial applications. Models such as Faster R-CNN, SSD, and YOLO have gained popularity for their ability to precisely classify and locate defects while maintaining high detection accuracy. With the continuous progress of deep learning technology, YOLO model, as a representative of one-step detection, has been optimized many times to show more and more powerful detection performance, such as YOLOX [26], YOLOv8 [27], YOLOv10 [28] and YOLO11 [11]. These improved versions have improved in terms of accuracy, speed, and model efficiency, making them even more effective for defect detection in complex environments.

To further improve defect detection accuracy, traditional CNN architectures usually extract complex features by stacking more convolutional groups at a single scale. This results in a bloated model structure and increased computational overhead. In addition, in practical applications, the defects in large-size images are of various sizes and types, and it is often difficult to fully capture such information by relying on single-scale features only. To address this challenge, multi-scale feature fusion techniques have emerged [16]. By fusing features at different scales, the model is able to capture both detailed and global information better, thus improving the accuracy of detection. In the next section, we delve into the development history of multi-scale feature fusion structures and their important role in enhancing defect detection performance.

1.2 Multi-scale feature fusion structures

Multi-scale feature fusion has been widely used to improve object detection in complex scenarios. By processing inputs at multiple scales and integrating the features, these methods enhance the model’s ability to detect objects of various sizes while balancing accuracy and computational cost [10, 29,30,31]. Such architectures significantly improve the model’s ability to detect objects of varying sizes, especially in high-resolution images. For YOLO models, the multi-scale feature fusion architecture is mainly implemented in the backbone and neck components. The backbone component typically consists of CNN models, while the neck component is composed of FPN structures.

For CNN-based models, ResNet first introduced the residual architecture, enabling direct feature reuse through shortcut connections between the original and convolved features [21]. Building on this concept, Iqbal et al. [32, 33] introduced a series of CNN structures for the automated detection of synovial fluid in human knee joints and the classification of endothelial cells derived from human-induced pluripotent stem cells. In the field of object detection, YOLOv3 incorporated a ResNet-like structure by introducing the Darknet53 backbone for effective multi-scale feature extraction [34]. This was further improved in YOLOv4, which proposed CSPDarknet53 to enhance feature fusion across two scales [35]. YOLOv5 [36], YOLOX [26], and YOLOv7 [37] continued refining this architecture to improve accuracy and efficiency. More recently, YOLOv8 [27], YOLOv10 [28], and YOLOv11 [11] have focused on lightweight designs while maintaining high performance across various detection tasks. However, these models perform poorly in large-size images containing multiple-scale objects. To address this problem we plan to use more dimensions to extract features.

For the neck structure, YOLOv3 first introduced FPN to fuse features at three scales [34]. YOLOv4 enhanced this by adding PANet, incorporating both top-down and bottom-up paths for better low-level feature integration [35]. YOLOv5 combined FPN and PANet for more effective multi-scale fusion, while YOLOX further optimized PANet [26]. BiFPN introduced learnable weights to adaptively balance features across scales, improving fusion performance [38]. However, these improvements increase network complexity and computational cost. In this work, we aim to design a novel lightweight neck that achieves efficient feature fusion with small overhead.

2 Methodology

This section overviews our proposed surface defect detection model YOLO-MSD. Section 3.1 describes the architecture of MSC blocks. Section 3.2 introduces the MSC blocks-based YOLO-MSD structure.

2.1 MSC blocks

In order to enhance feature extraction capabilities and recognition accuracy, traditional backbones often deepen the network and stack convolutional modules. Although this approach is effective, it significantly increases computational complexity and overhead. To address this issue, we propose a novel MSC block in this paper. The MSC block trades off computational overhead and detection accuracy by utilizing multi-scale parallel computation and feature fusion to improve feature extraction capabilities. Additionally, the MSC block reduces computational overhead by employing a few convolutional operations.

Figure 2 shows the architecture of the MSC blocks. For the MSC41 block, the primary function is to split the input features into four dimensions and then perform feature extraction and fusion across these four dimensions. In detail, the first CBS (contains of a convolution layer, a Batch Normalization layer and a SiLU activation function) operation aims to extract the features of the input image and boost the number of channels. Then the output is split into four scales. The first dimension uses a CBS group with \(1\times 1\) kernel size convolution to vary the number of channels. The other dimensions achieve down sampling and channels changes by a CBS group with a convolution kernel of \(1\times 1\) and a stride length of 2. The number of channels in each dimension is one-quarter of the input channels. As a result, we acquire four scales with different input sizes, the first dimension has the largest input size and the last dimension gains the smallest input size (that is a quarter of the first dimension). In addition, the MaxPooling operations integrate concatenate operations achieving features fusion, which transmits the large-scale information to small scales and enriches the features of the small scales.

Fig. 2
figure 2

Structure of the MSC blocks, including MSC41, MSC42, MSC3 and MSC2. CBS presents a convolution group, including a convolution layer, a Batch Normalization layer and a SiLU activation function. MaxP is the MaxPooling operation

Specifically, the computation across the four scales proceeds continuously until the target size is achieved, eliminating the redundant operations typically found in traditional feature fusion strategies, which involve separating, fusing, and then separating again. This approach preserves the original features of each scale while seamlessly merging them with the features of the next scale, thus enhancing the model’s overall feature extraction capability.

The MSC41 block is able to formulate as follows,

$$\begin{aligned} {\left\{ \begin{array}{ll} Y^{C1}_{M41}= P(f^{n/4}_{C31}(f^{n/4}_{C11}(X)) \\ Y^{C2}_{M41}= f^{n/4}_{C31}(f^{n/4}_{C31}(f^{n/4}_{C12}(X))\oplus P(Y^{C1}_{M41})) \\ Y^{C3}_{M41}= f^{n/4}_{C31}(f^{n/4}_{C31}(f^{n/4}_{C12}(f^{n/4}_{C12}(X)))\oplus P(Y^{C2}_{M41})) \\ \begin{aligned} Y^{C4}_{M41} & = f^{n/4}_{C31}(f^{n/4}_{C31}(f^{n/4}_{C12}(f^{n/4}_{C12}(f^{n/4}_{C12}(X))))\\ & \oplus P(Y^{C3}_{M41})) \end{aligned} \end{array}\right. } \end{aligned}$$
(1)

where \(Y^{C_i}_{M41}\) (\(i = 1, 2, 3, 4\)) represents the output feature at the i-th scale of the MSC41 block. Here, \(C_i\) denotes the i-th dimension. The input X is processed by a CBS operation with n output channels. Each \(f^{c}_{C_{ks}}(\cdot )\) represents a CBS group with kernel size \(k \times k\), c output channels, and stride s. The operator \(P(\cdot )\) denotes a \(3 \times 3\) MaxPooling operation used to downsample features before fusion. \(\oplus\) indicates the concatenation operation. This formulation enables hierarchical feature extraction and fusion across four scales, where each output \(Y^{C_i}_{M41}\) is built upon and enriched by information from the previous scales.

In contrast to the MSC41 block, the MSC42 block reduces the input processing while preserving the feature extraction across the four dimensions initially defined in the MSC41 block. The MSC42 blocks are expressed below,

$$\begin{aligned} {\left\{ \begin{array}{ll} Y^{C1}_{M421} = P(f^{n/2}_{C31}(f^{n/2}_{C31}(Y^{C1}_{M41})) \\ Y^{C2}_{M421} = f^{n/2}_{C31}(f^{n/2}_{C31}(f^{n/2}_{C12}(Y^{C2}_{M41}))\oplus P(Y^{C1}_{M42})) \\ Y^{C3}_{M421} = f^{n/2}_{C31}(f^{n/2}_{C31}(f^{n/2}_{C12}(Y^{C3}_{M41}))\oplus P(Y^{C2}_{M42})) \\ Y^{C4}_{M421} = f^{n/2}_{C31}(f^{n/2}_{C31}(f^{n/2}_{C12}(Y^{C4}_{M41}))\oplus P(Y^{C3}_{M42})) \end{array}\right. } \end{aligned}$$
(2)
$$\begin{aligned} {\left\{ \begin{array}{ll} Y^{C1}_{M422} = P(f^{n}_{C31}(f^{n}_{C31}(Y^{C1}_{M421})) \\ Y^{C2}_{M422} = f^{n}_{C31}(f^{n}_{C31}(f^{n}_{C12}(Y^{C2}_{M421}))\oplus P(Y^{C1}_{M422})) \\ Y^{C3}_{M422} = f^{n}_{C31}(f^{n}_{C31}(f^{n}_{C12}(Y^{C3}_{M421}))\oplus P(Y^{C2}_{M422})) \\ Y^{C4}_{M422} = f^{n}_{C31}(f^{n}_{C31}(f^{n}_{C12}(Y^{C4}_{M421}))\oplus P(Y^{C3}_{M422})) \end{array}\right. } \end{aligned}$$
(3)

where \(Y^{Ci}_{M421} (i=1, 2, 3, 4)\) presents the i-th scale output of the first MSC42 block. \(Y^{Ci}_{M422} (i=1, 2, 3, 4)\) indicates the i-th dimension output of the second MSC42 block. In particular, when the fourth channel of the second MSC42 block satisfies the output requirements, eliminating the need for further convolution.

The MSC3 and MSC2 blocks follow similar principles but operate on three and two dimensions, respectively. The expression of MSC3 and MSC2 blocks is as follows,

$$\begin{aligned} {\left\{ \begin{array}{ll} Y^{C1}_{M3} = P(f^{2n}_{C31}(f^{2n}_{C31}(Y^{C1}_{M422})) \\ Y^{C2}_{M3} = f^{2n}_{C31}(f^{2n}_{C31}(f^{2n}_{C12}(Y^{C2}_{M422}))\oplus P(Y^{C1}_{M3})) \\ Y^{C3}_{M3} = f^{2n}_{C31}(f^{2n}_{C31}(f^{2n}_{C12}(Y^{C3}_{M422}))\oplus P(Y^{C2}_{M3})) \end{array}\right. } \end{aligned}$$
(4)
$$\begin{aligned} {\left\{ \begin{array}{ll} Y^{C1}_{M2} = P(f^{4n}_{C31}(f^{4n}_{C31}(Y^{C1}_{M3})) \\ Y^{C2}_{M2} = f^{4n}_{C31}(f^{4n}_{C31}(f^{4n}_{C12}(Y^{C2}_{M3}))\oplus P(Y^{C1}_{M2})) \end{array}\right. } \end{aligned}$$
(5)

where, \(Y^{Ci}_{M3} (i=1, 2, 3)\) represents the i-th dimension output of the MSC3 block. \(Y^{Ci}_{M2} (i=1, 2)\) is the i-th scale output of the MSC2 block.

2.2 Overview of YOLO-MSD

To overcome the challenges of industrial surface defect detection in large-size images and trade off the recognition accuracy and model complex of YOLO, we propose a novel defect detection network YOLO-MSD. Figure 3 shows the framework of YOLO-MSD, which mainly consists of Backbone, Neck and Head. The detailed information about YOLO-MSD is provided below.

Fig. 3
figure 3

Architecture of the YOLO-MSD defect detection model, consisting of three parts: Backbone, Neck, and Head. The Backbone is constructed by a novel MSCNet. An SFPN framework is used as the Neck. A modified Anchor-free YOLO head is employed as the Head. The CBS convolution group consists of a convolution layer, a Batch Normalization (BN) layer and a SiLU activation function

2.2.1 MSC blocks-based backbone

The main task of Backbone is to extract the deep features from input images. Hence, improving the feature-extracting capability of Backbone is primary. This study proposes a novel MSCNet for the Backbone, enabling multi-scale feature extraction and fusion. The left of Fig. 3 displays the structure of the Backbone. This study employs some novel MSC blocks to extract and fuse features from four scales. The MSC41 block is used to split the input into four dimensions and exchange information on these dimensions. Other MSC blocks aim to acquire deep features of the input image from multiple scales. Furthermore, the last three outputs of each scale are concatenated separately to generate three outputs with different resolutions and the three outputs are transmitted to the Neck for further feature fusion. This means that each output of the Backbone contains feature information about four dimensions, ensuring the YOLO-MSD adequately extracts features from the input image. The three outputs of the Backbone are expressed below,

$$\begin{aligned} \left\{ \begin{array}{l} Y_{B1} = f^{2n}_{C31}(Y^{C1}_{M3}\oplus Y^{C2}_{M422}\oplus Y^{C3}_{M421}\oplus Y^{C4}_{M41}) \\ Y_{B2} = f^{4n}_{C31}(Y^{C1}_{M2}\oplus Y^{C2}_{M3}\oplus Y^{C3}_{M422}\oplus Y^{C4}_{M421}) \\ Y_{B3} = f^{8n}_{C31}(Y_{CC}\oplus Y^{C2}_{M2}\oplus Y^{C3}_{M3}\oplus Y^{C4}_{M422}) \end{array}\right. \end{aligned}$$
(6)
$$\begin{aligned} Y_{CC} = f^{8n}_{C31}(f^{8n}_{C32}(Y^{C1}_{M2}) \end{aligned}$$
(7)

where \(Y_{B1}\), \(Y_{B2}\) and \(Y_{B3}\) are the large, middle and small resolutions of the Backbone outputs, respectively. \(Y_{CC}\) donates the output of CC block in Fig. 3.

2.2.2 SFPN-based neck

Current YOLO necks always suffer from a bloated architecture, bringing a huge computational burden and a complex structure. To address the problem, this research presents a slight SFPN framework for feature fusion. The middle of Fig. 3 shows the detailed construction of SFPN. For the last scale, due to the same scale output of the Backbone has already fully fused the four-scale features, we only use a CBS convolution group in the Neck part for further feature fusion. For the middle layer, to achieve feature fusion from three scales, we employ a combination of downsampling, CBS operations, and upsampling on the three neck inputs. Specifically, the first input undergoes downsampling, the last undergoes upsampling, and the middle input utilizes a CBS operation. The resulting features are then concatenated and fed into a final CBS operation for fusion. Moreover, for the first scale, two upsampling operations are utilized for the last input and the middle input, combining with the first input for feature fusion across three dimensions. The process of the Neck is represented by the following expression,

$$\begin{aligned} {\left\{ \begin{array}{ll} \begin{aligned} Y_{N1} & = f^{m}_{C11}(f^{m}_{C11}(Y_{B1})\oplus f^{m}_{up}(Y_{N2})\\ & \oplus f^{m}_{up}(f^{2m}_{up}(Y_{B3}))) \\ \end{aligned}\\ \begin{aligned} Y_{N2} & = f^{2m}_{C11}(f^{2m}_{C12}(Y_{B1})\oplus f^{2m}_{C11}(Y_{B2})\\ & \oplus f^{2m}_{up}(Y_{B3})) \\ \end{aligned} \\ Y_{N3} = f^{4m}_{C11}(Y_{B3}) \end{array}\right. } \end{aligned}$$
(8)

where the Neck outputs \(Y_{N1}\), \(Y_{N2}\), and \(Y_{N3}\) represent large, medium, and small resolutions, respectively. \(f^{m}_{up}\) denotes the upsampling process (including a CBS convolution group and an upsampling operation) with m output channels.

Compared with current YOLO necks, our proposal uses fewer convolution and pooling operations, which reduces the computational burden of Neck. In addition, we utilize upsampling operations to transmit features of the last scale to the other two scales. Through this approach, the SFPN framework achieves effective feature fusion while maintaining low computational complexity.

2.2.3 Anchor-free head

To achieve fast inference speed and reduce computational complexity, this research uses an anchor-free YOLO head and removes some convolution operations. Figure 4 displays the architecture of the Head. In detail, we utilize a CBS operation to integrate the input channels. Then, the input is divided into two segments: the first segment is responsible for classifying the defects, while the second segment determines the existence of these defects and localizes them. Moreover, the outputs are concatenated for one dimension head output. As a result, the three scales of the YOLO head enable YOLO-MSD to classify and localize defects of various sizes across different dimensions. The YOLO head is expressed as follows,

$$\begin{aligned} \begin{aligned} Y_{Hi}&= f^{nc}_{C}(f^{m}_{C11}(f^{m}_{C11}(Y_{Ni})))\\&\oplus f^{4}_{C}(f^{m}_{C11}(f^{m}_{C11}(Y_{Ni}))) \\&\oplus f^{1}_{C}(f^{m}_{C11}(f^{m}_{C11}(Y_{Ni}))) \end{aligned} \end{aligned}$$
(9)

where, \(Y_{Hi} (i=1, 2, 3)\) presents the i-th output of the YOLO Head. \(f^{nc}_{C}\) is the convolution operation with a \(1\times 1\) kernel size, nc (number of classes) output channels and a stride of 1. \(f^{4}_{C}\) is used to predict the bounding box of defects with four coordinates. \(f^{1}_{C}\) aims to identify whether a defect exists.

Fig. 4
figure 4

Architecture of the YOLO head. Cls., Reg. and Obj. of YOLO Head are used for classification, regression and determination of the presence or absence of an object, respectively

We provide a notation list of the mathematical symbols, as shown in Table 1.

Table 1 Notation list of the mathematical symbols

3 Evaluation

3.1 Experiment configuration

3.1.1 YOLO-MSD family

To meet different application requirements and computational resource constraints, we divide the YOLO-SMD model into five versions: X, L, M, S, and Tiny. Table 2 shows the configuration of each version and the related parameter settings.

Table 2 Configuration of YOLO-MSD family

3.1.2 Evaluation metrics

Several parameters are used to evaluate the performance of YOLO models, consisting of Recall, Precision, Average Precision (AP), mean Average Precision (mAP), model parameters (Param.), Floating Point Operations (FLOPs), size of model weight (Size), and the inference speed (FPS). The related parameters are expressed as follows.

$$\begin{aligned} Rec=\frac{TP}{TP+FN} \end{aligned}$$
(10)
$$\begin{aligned} Pre=\frac{TP}{TP+FP} \end{aligned}$$
(11)

where, Pre is the Precision. Rec represents the Recall. TP, TN, FP and FN are the true positives, true negatives, false positives, and false negatives respectively. The AP and mAP are formulated as follows.

$$\begin{aligned} AP=\sum (Res_{n+1}-Res_{n})Pre_{max}[Res_{n}, Res_{n+1}] \end{aligned}$$
(12)
$$\begin{aligned} mAP=\frac{1}{m}\sum AP \end{aligned}$$
(13)

where, \(Res_n\) is the n-th value of Recall. \(Pre_{max}\) means the maximum value of AP in the range of \([Res_{n}, Res_{n+1}]\). m represents the number of classes.

Additionally, the MS-COCO criteria are utilized to evaluate the SOTA surface defect detection models, containing the definitions of the metrics as shown in Table 3.

Table 3 Definition of the MS-COCO evaluation metrics

3.1.3 Datasets

In this study, we use five open datasets to evaluate our proposal, including PCB, HRIPCB, GC10-DET, NEU-DET and CRACK datasets. Table 4 outlines the configurations of four datasets, while Fig. 5 illustrates the distribution of the PCB, GC10-DET, and NEU-DET datasets. The PCB, HRIPCB and GC10-DET datasets feature large-size images, whereas the NEU-DET and CRACK datasets contain small-size images (Table 4).

Table 4 Description of five open datasets
Fig. 5
figure 5

Distribution of three multi-category datasets. The X-axis displays the names of categories. The Y-axis illustrates the number of defects in each category. (a) presents the PCB dataset, which contains six categories. (b) depicts the GC10-DET dataset with ten categories. (c) shows the NEU-DET dataset, comprising six categories

The PCB dataset is composed of 693 large-size images of PCB surface defects in six categories: missing_hole, mouse_bite, open_deguit, short, spur, and spurious_copper, as shown in Fig. 5(a). The images come in 10 different sizes, ranging from 2240\(\times\)2016 to 3056\(\times\)2464. The HRIPCB dataset is generated by applying rotation-based data augmentation to all images in the PCB Dataset. We set all 693 rotated images as the test set. Figure 5(b) provides a distribution of the GC10-DET dataset, which consists of 693 metallic surface defect images categorized into ten classes. Each image has a resolution of 2048\(\times\)1000. The NEU-DET dataset, depicted in Fig. 5(c), contains 1800 images illustrating steel surface defects. These images are categorized into six distinct classes. The CRACK dataset consists of 996 images that specifically illustrate the crack defects within a single category (Table 4).

3.1.4 Implementation details

This study carries out comprehensive experiments on local equipment featuring an Intel® Core\(^\textrm{TM}\) i7-13700KF 24-Core Processor and a single NVIDIA GeForce RTX 4090 GPU. The operating system is Ubuntu 20.04, and the CUDA version is 12.0.

Furthermore, the YOLOv8, YOLOv10 and YOLO11 are performed on the PyTorch platform. Other models are conducted on the TensorFlow platform. The training parameters are shown in Table 5. In particular, our proposal in this study is compared with various SOTA models under the same training parameter settings, and no pre-trained weights are used. In addition, since some models did not achieve satisfactory results under this setting on the TensorFlow platform, we adjusted their learning rates appropriately to achieve optimal performance.

Table 5 Training parameters

3.2 Experiments

This section presents the extensive experimental results using four selected datasets with advanced object detection models. These models include the two-step object detection model Faster-RCNN [12] and the one-step object detection models YOLOv3 [34], YOLOv4 [35], YOLOv5, YOLOX [26], YOLOv7 [37], YOLOv8 [27], YOLOv10 [28] and the YOLO11 [11] series. The abundant experimental results are shown below.

3.2.1 Ablation experiments on the PCB dataset

The PCB dataset, as a representative dataset of industrial surface defects, exhibits large image sizes and relatively small defects. Therefore, this study conducted extensive comparison experiments on the PCB dataset to evaluate our proposal. In this section, we perform ablation experiments to investigate the effectiveness of the MSCNet and SFPN structures.

Effectiveness evaluation of SFPN framework

To verify the advantages of our SFPN, we combine MSCNet with other neck structures to form a series of defect detection models and conduct experiments, including ASFF [42], FPN [43], BiFPN [38], SPAN [44] and the necks of YOLOX series. Specifically, the configuration of MSCNet is the same as YOLO-MSD-L with output channels (224, 448, 896). The input image size is set to 512. Table 6 displays the experimental results of the models on the PCB dataset. Our proposal outperforms other neck structures and achieves the best \(mAP\) with 96.67%. Meanwhile, MSCNet+SFPN acquires a very small model size and low computational complexity. These results demonstrate that our SFPN architecture maintains a strong feature extraction capability while remaining lightweight.

Table 6 Comparison of the SFPN with other advanced neck models on the PCB dataset

Effectiveness evaluation of MSCNet architecture

To evaluate the effectiveness of the proposed MSC blocks, we constructed MSCNet using MSC blocks and compared it with ten widely used backbone architectures, including VGG16 [20], ResNet50 [21], MobileNetV2 [45], InceptionV3 [46], Xception [47], EfficientNet [24], DenseNet121 [22], GhostNet [25], DarkNet [34], and CSPDarknet53 [48]. In particular, the configuration of SFPN in this section, like YOLO-MSD-L, includes neck output channels of (256, 512, 1024). Meanwhile, the input image size is limited to 512. Table 7 displays the experimental results of the models on the PCB dataset. Our proposal outperforms other backbone architectures and attains the best \(mAP\) with 96.67%. Although MobileNetV2+SFPN achieves the most lightweight structure, the \(mAP\) of MobileNetV2+SFPN is 25.34% lower than MSCNet+SFPN. GhostNet+SFPN has the lowest computational overhead, but the \(mAP\) of MobileNetV2+SFPN is 29.10% lower than MSCNet+SFPN. VGG16+SFPN attains the fastest inference speed, while the \(mAP\) of VGG16+SFPN is 1.77% lower than MSCNet+SFPN and the \(FLOPs\) of VGG16+SFPN is over 2 times higher than MSCNet+SFPN.

Table 7 Comparison of the MSCNet with other advanced backbone models on the PCB dataset

The excellent performance of MSCNet stems from the architectural design of MSC blocks, which enable parallel multi-scale convolution operations followed by four-scale feature fusion. Unlike other backbones that rely on conventional, residual, separable, or ghost convolutions, MSC blocks are specifically tailored to extract rich feature representations and spatial details across different receptive fields. This design is particularly beneficial for detecting small or variably sized defects. Overall, the MSC block-based MSCNet exhibits robust feature extraction capability, confirming its effectiveness in detecting small defects within large-size images.

Figure 6 illustrates Grad-CAM heatmaps of four representative and optimal models. These visualizations demonstrate that the proposed MSCNet and SFPN modules effectively guide the model to focus on defect-relevant regions. In particular, YOLO-MSD-L exhibits stronger and more concentrated activation responses around true defect areas, indicating better feature representation and localization capability.

Fig. 6
figure 6

Grad-CAM heatmaps of four optimal models on the PCB dataset

3.2.2 Comparison experiments on the PCB dataset

To evaluate the effectiveness of our proposal, this section carried out comparison experiments on the PCB dataset. Table 8 displays the experimental results with advanced object detection models. It is clear that our proposal surpasses other models by a wide margin. In detail, the YOLO-MSD-L achieves the best \(mAP\) with 96.67% and beats the second-best model YOLOX-X by more than 5%. Additionally, the YOLO-MSD family is superior to other models in every dimension. Although YOLOv4-Tiny attains the fastest inference speed, the \(mAP\) of YOLO-MSD-L is 21.27% better than YOLOv4-Tiny. Unfortunately, on the PCB dataset, Faster-RCNN (ResNet50) and YOLOv5 fail to perform effectively. Furthermore, the model size, \(Parameters\), and computational cost of the YOLO-MSD series are less than other YOLOX families. Specifically, the \(Size\) and \(Parameters\) of YOLO-MSD-X are more than 20% lower than those of YOLOX-X, and the \(FLOPs\) of YOLO-MSD-X are about 25% lower than those of YOLOX-X. Although the YOLOv7 achieves a lighter construction than the YOLO-MSD, there is a significant gap in \(mAP\) between the YOLOv7 series and the YOLO-MSD family.

Table 8 Comparison with advanced models based on the open dataset PCB

Table 9 shows the comparison results of our proposed YOLO-MSD model with current mainstream models on the PCB dataset, covering YOLOv8, YOLOv10, and YOLO11. These YOLO series models are all trained on the PyTorch platform. Due to differences between the TensorFlow and PyTorch platforms, we only selected key evaluation metrics for comparison, including mAP, FLOPs, and Parameters. The comparison results show that YOLO-MSD significantly outperforms other models in detection performance and demonstrates a clear advantage in terms of the number of parameters. Notably, YOLO-MSD series achieves comparable performance to other models across the X, L, M, and S scales, even when evaluated under the L, M, S, and Tiny configurations. Moreover, YOLO-MSD outperforms YOLOv8 in all metrics. Although the computational complexity of YOLO-MSD is slightly higher than that of YOLOv10 and YOLO11 at certain scales, its mAP metric is significantly higher than both. These results fully validate the effectiveness and superiority of our proposed method in handling large-size images.

Table 9 Comparison with SOTA models based on the open dataset PCB

Figure 7 illustrates the defect detection results of the best-performing model, YOLO-MSD-L, on the PCB dataset. The results display several large-size images, with dimensions ranging from 2775\(\times\)2159 to 3056\(\times\)1586. Although the complexity and large size of the PCB images make the defects difficult to detect with the naked eye, each defect in the images has been effectively detected with a good confidence level. Meanwhile, the defects are accurately classified and located. These results verify the capability of the YOLO-MSD to detect and classify defects in handling complex and large-size inputs.

Fig. 7
figure 7

Defect detection results of YOLO-MSD-L on the PCB dataset. The image sizes range from 2775\(\times\)2159 to 3056\(\times\)2464. The boxes are the locations of the defects. The category and confidence of the defect are shown above the box

3.2.3 Comparison experiments on the HRIPCB dataset

As the HRIPCB dataset is created by augmenting all images in the PCB dataset through rotation, we assess the model trained on the PCB dataset by evaluating it on the validation set of the HRIPCB dataset and reporting the corresponding \(mAP\). Table 10 shows the comparison between our proposed method and SOTA models. Clearly, YOLO-MSD shows superior performance, significantly outperforming other models. Even though the images in the test set underwent rotation transformations, the detection accuracy of our method experienced only a slight decrease, while still maintaining a high level of precision. These results further validate the strong effectiveness and robustness of the proposed approach.

Table 10 Comparison with SOTA models on the open datasets HRIPCB

3.2.4 Comparison experiments on the GC10-DET dataset

This section conducts enriched experiments on the open surface defect GC10-DET dataset that consists of large-size images with defects of different scales. This facilitates the verification of the generalization of the model. Table 11 compares the proposed YOLO-MSD model with several SOTA models, namely YOLOv8, YOLOv10 and YOLO11, on the GC10-DET dataset. The results show that YOLO-MSD achieves the best detection performance with the highest \(mAP\) of 69.09%, outperforming both YOLOv10 and YOLO11. Although YOLOv8 performs slightly better at two smaller scales, achieving a \(mAP\) of 68.4%, it still fails by 0.69 percentage points compared to YOLO-MSD.

Table 11 Comparison with SOTA models based on the open dataset GC10-DET

Table 12 illustrates the experiments comparing SOTA object detection models. Clearly, our YOLO-MSD family outperforms other advanced models, and the \(mAP\) of the YOLO-MSD series exceeds the other models in each dimension. In particular, YOLO-MSD-X attains the best \(mAP\) of 69.09%, outperforming YOLOv5-X by 13.41%, YOLOX-X by 7.11%, and YOLOv7-X by 25.68%. The \(mAP\) of YOLO-MSD-Tiny achieves 61.02%, which is 9.28% higher than the \(mAP\) of YOLOX-Tiny and 17.58% better than the \(mAP\) of YOLOv5-Nano.

Table 12 Comparison with advanced models based on the open dataset GC10-DET

Figure 8 shows the YOLO-MSD-X-based defect detection results on the GC10-DET dataset. The images present a variety of detection challenges with defects of different sizes and categories. Using YOLO-MSD-X, the model exhibits a strong capability to accurately classify and locate the majority of these defects, regardless of their size or category, demonstrating its robustness to other detection scenarios.

Fig. 8
figure 8

Defect detection results of YOLO-MSD-X on the GC10-DET dataset. The image size is 2048\(\times\)1000. The results include images with a single object, multiple objects of the same class, and multiple objects from different classes. The boxes are the locations of the defects. The category and confidence are shown above the box

These experimental results confirm the superior performance, robustness and generalisability of YOLO-MSD in large-size image processing scenarios. Meanwhile, this also proves to some extent that our proposal balances detection accuracy and model complexity.

3.2.5 Comparison experiments on the CRACK and NEU-DET datasets

To further assess the generalization capability of the proposed model, we conduct comprehensive experiments on the CRACK and NEU-DET datasets, both of which consist of small-size images. This facilitates the verification of the generalization of the model. Tables 13-14 show the experiments comparing SOTA object detection models on the CRACK and NEU-DET datasets. Among them, the input image size of the Faster-RCNN and SSD is set to 600\(\times\)600 and 300\(\times\)300, respectively. The input image size of other object detection models is set to 416\(\times\)416.

Table 13 Comparison with advanced models based on the NEU-DET and CRACK datasets
Table 14 Comparison with SOTA models on the open datasets NEU-DET and CRACK

For the NEU-DET dataset, the mAP of YOLO-MSD exceeds some advanced models in each scale. Specially, YOLO-MSD-X acquires great mAP with 63.64%, exceeding SSD by 10.20%, YOLOv3 by 53.41%, YOLOv4 by 17.17%, YOLOv5-X by 8.82%, YOLOX-X by 1.94%, and YOLOv7-X by 9.29%. YOLO-MSD-X obtains good mAP of 62.15%, surpassing YOLOv4-Tiny by 4.91%, YOLOv5-Nano by 25.91% and YOLOX-Tiny by 1.35%. Unfortunately, the YOLO11 series achieves the best mAP, which is much better than YOLO-MSD. Furthermore, Faster-RCNN, YOLOv8 and YOLOv10 series also outperform YOLO-MSD. Figure 9 shows the P-R curve comparison between YOLO11-L and YOLO-MSD-S on the NEU-DET dataset. It can be seen that YOLO-MSD-S achieves detection performance comparable to YOLO11-L for most defect categories, indicating its strong capability in identifying most surface defects. However, its performance is inferior in the crazing and inclusion categories, particularly in the crazing category, where the AP decreases significantly.

Fig. 9
figure 9

P-R curve of YOLOv11-L and YOLO-MSD-S on the NEU-DET dataset

On the CRACK dataset, the YOLO-MSD models also outperform most of the advanced models. In particular, the mAP of our YOLO-MSD models is significantly higher than that of other models across all dimensions, such as YOLOv4, YOLOv5, YOLOv7 and YOLOv8. Compared with the YOLOX series, the YOLO-MSD series only lags behind in a few dimensions. However, all scales of the YOLOv10 and YOLO11 series are higher than YOLO-MSD models. These results show that our proposal is able to be successfully applied to other tasks and achieve great performance, demonstrating its generalization capabilities.

Figure 10 presents partial detection results of YOLO-MSD on the NEU-DET and CRACK datasets. The results show that YOLO-MSD can accurately locate various defect types, with prediction boxes closely fitting the target boundaries and high confidence levels. In the CRACK dataset, the model maintains high detection accuracy under different background conditions and crack morphologies, demonstrating good robustness and generalisation ability.

Fig. 10
figure 10

Detection results of YOLO-MSD on the NEU-DET and CRACK datasets are presented. Specifically, the results on the NEU-DET dataset are obtained using YOLO-MSD-S, while those on the CRACK dataset are based on YOLO-MSD-X

Besides detection accuracy, we also consider the trade-off between performance and computational complexity, which is crucial for real-world applications on resource-constrained edge devices. Although YOLO-MSD does not achieve the highest mAP across all settings, it maintains competitive detection performance while offering lower FLOPs and fewer parameters at comparable scales. This efficiency makes YOLO-MSD particularly suitable for deployment in practical scenarios requiring lightweight and accurate defect detection.

3.2.6 Deployment on low-performance devices

To further evaluate our proposal, the YOLO-MSD models are deployed on a resource-constrained device, the Jetson Xavier NX. The Jetson NX is equipped with a 6-core CPU and a GPU capable of 21 tera operations per second. The operating system is Ubuntu 20.04.5 LTS, and the TensorFlow framework version used is 2.11.0.

Table 15 presents the detection performance of the YOLO-MSD family deployed on the Jetson Xavier NX using the PCB dataset. As shown in Table 8, the YOLO-MSD models achieve comparable detection accuracy on the Jetson NX to that on the RTX 4090. Specifically, YOLO-MSD-X records a maximum memory usage of 2210 MB, a peak temperature of 46.45°C and a maximum power consumption of 14.59 W during inference. In contrast, YOLO-MSD-Tiny requires only 1359 MB of memory, reaches a temperature of 45.9°C, and consumes just 6.95 W. Due to the large input image sizes in the PCB dataset, the inference speed on the Jetson NX remains relatively low, with the highest observed speed being 20.82 FPS.

Table 15 Deploy the YOLO-MSD family on the Jetson NX

These results indicate a trade-off between detection accuracy and resource consumption across the YOLO-MSD family. Larger models, such as YOLO-MSD-X and YOLO-MSD-L, achieve higher accuracy but require more memory and power. In contrast, smaller variants like YOLO-MSD-Tiny consume fewer resources and run faster, making them more suitable for real-time applications on resource-constrained devices.

Overall, the deployment confirms that YOLO-MSD is a flexible and scalable solution for industrial surface defect detection, with the ability to adapt to different hardware capacities while maintaining high detection performance.

3.3 Discussion

In order to evaluate our proposal, this study performs various experiments on five open datasets. The ablation experiments on the PCB dataset show that the SFPN structure outperforms the SOTA neck structure in terms of feature extraction capability and model lightweighting, while MSCNet outperforms other CNN models in terms of feature extraction capability. However, MSCNet does not perform well in terms of lightweighting. In the future, we aim to lightweight the MSCNet structure to make the model easier to use.

Comparative experiments further demonstrate that YOLO-MSD delivers outstanding performance on large-size image datasets such as PCB, HRIPCB, and GC10-DET. These results highlight the strong feature extraction capability of the proposed method and its particular suitability for detecting multi-scale objects in high-resolution scenarios. Compared with existing SOTA models, YOLO-MSD achieves a more favourable balance between detection accuracy and model complexity.

YOLO-MSD also outperforms most advanced models on the datasets with small-size images like NEU-DET and CRACK. However, as shown in Figs. 9-10, YOLO-MSD still faces challenges when detecting less distinguishable defects on small-size images, such as subtle crazing and inclusion defects in the NEU-DET dataset. The results suggest that the current feature extraction capability of YOLO-MSD is not yet sufficient for handling targets with extremely weak visual features under low-resolution conditions. Therefore, future work will aim to enhance the model’s performance on small-size image datasets by improving feature extraction for low-contrast or fine-grained anomalies, while also expanding its adaptability to a broader range of applications.

The feasibility of the proposed approach for edge devices is demonstrated by using it on a low-power device. However, the inference speed on the Jetson NX is relatively slow. Therefore, future work aims to further optimise and lightweight the YOLO-MSD family for deployment on edge devices with even lower processing power.

The superior performance of YOLO-MSD on large-size image datasets can be attributed to its enhanced multi-scale feature extraction and fusion capabilities, which effectively capture complex object structures across high-resolution spatial domains. In contrast, on small-size datasets, the benefits of deep multi-scale representations are partially offset, and feature compression during early layers may lead to information loss for tiny objects. This observation highlights the need for further refinement of the feature extraction mechanism to better accommodate small-size image inputs.

Finally, the successful application of YOLO-MSD in industrial surface defect detection demonstrates its practical potential. The proposed method also holds promise for broader applications, including agricultural defect inspection, remote sensing image analysis, and medical anomaly detection.

4 Conclusion

In this work, we present YOLO-MSD, a lightweight and scalable model developed to address key challenges in industrial surface defect detection, particularly for multi-class, multi-scale tasks involving large-size images. By introducing a four-scale backbone constructed from MSC modules, the model achieves enhanced feature extraction and fusion across different resolutions. To further reduce computational complexity, we proposed an SFPN that effectively integrates multi-dimensional information with minimal overhead. Comprehensive experiments on four public datasets demonstrate that YOLO-MSD achieves SOTA performance across various scenarios, including both large-size and small-size images. Additionally, the model delivers real-time inference speed and has been successfully deployed on the Jetson Xavier NX, confirming its practicality for edge computing environments. Overall, YOLO-MSD offers a robust, efficient, and deployable solution for industrial surface defect detection across diverse operating conditions.