1 Introduction

In the digital age, cybersecurity is increasingly critical. Traditional manual testing is costly and inefficient as cyberspace grows. Automated penetration testing (AutoPT) has emerged to enhance efficiency [1], but it still faces limitations, particularly in intelligence. Reinforcement learning offers solutions to these challenges [2]. Most AutoPT research assumes a fully observable environment, modeled as an MDP, which doesn’t reflect the reality of partial information. The Partially Observable Markov Decision Process (POMDP) model better suits real-world scenarios, as shown by Zhang et al. [3] and Schwartz et al. [4], who incorporate scanning actions and defender strategies. AI in cybersecurity is a promising field, with applications in penetration testing enhancing efficiency [5] [6]. Recent advancements include frameworks like GAIL-PT [7], which combine expert knowledge with AI, and the use of large language models [8]. These developments in AutoPT are set to revolutionize cybersecurity by improving vulnerability identification and providing more effective security solutions.

Based on the analysis of existing methods, this paper first proposes a text embedding method based on original scanning information, addressing the challenge of incomplete information during the penetration testing phase. Next, an integrated decision-making framework, from perception to decision-making, is introduced. Finally, the effectiveness of the proposed method is verified through the LSTM-PPO algorithm. The structure of this paper is as follows: Sect. 2 reviews related work and gives a brief discussion. Section 3 describes our proposed model. Section 4 details the proposed raw scanning text embedding approach and the improved LSTM-PPO approach. Section 5 presents experimental results and analysis. Section 6 concludes the paper and discusses future works.

2 Related Works

This section reviews previous studies that have utilized the E2E approach in relevant research fields and discusses several approaches to improvements in AutoPT.

The E2E research approach has achieved significant breakthroughs across various fields. Currently, the E2E approaches are widely applied to numerous computer vision challenges, including object identification, scene detection, image segmentation, and so on. In natural language processing, the E2E approach directly processes raw text inputs to perform tasks such as machine translation without manual feature extraction  [9]  [10]. In computer vision, the E2E approach produce outputs like image classification and object detection directly from original images or videos, eliminating the need for manual feature design  [11]. For speech recognition, the E2E approaches convert speech signals directly into text, bypassing manual feature extraction  [12]. Autonomous driving also benefits from the E2E approaches, which link sensor inputs directly to vehicle control outputs, eliminating the need for manual feature extraction  [13]  [14]. In penetration testing, the E2E reinforcement learning models streamline the entire process from scanning information to action execution and enhance the efficiency of security assessments  [15]  [16]. Thus, there is significant potential for the development of the E2E automated penetration testing.

Challenges persist in the practical application of penetration testing based on reinforcement learning. For example, agents encounter challenges with convergence and decision efficiency because of the high-dimensional nature of discrete action spaces. To address these challenges, Yang et al..  [17] modeled the penetration testing problem as Markov Decision Processes (MDP) and introduced a coverage-based masking mechanism using the PPO algorithms to help agents adapt to future exploration, reducing the focus on previously selected actions. Guo et al..  [18] modeled Advanced Persistent Threats (APT) as POMDPs, proposing the PLAPT framework based on the PPO algorithm, which successfully reduced the dimensionality of large action spaces for agents. Despite the progress made in improving the efficiency of penetration testing, these approaches have yet to fully represent the historical states of agents or consider the realism of training environments, which remains a direction for future research.

3 Reinforcement Learning Model

3.1 POMDP of AutoPT

Despite reinforcement learning assisting agents in decision-making by optimizing reward functions, adjusting environmental states, and defining action spaces, in real-world penetration testing scenarios, agents often face constraints due to partial observability. Their access is often confined to limited and possibly unreliable data sources, such as network traffic and system logs from specific nodes, which limits their ability to fully comprehend the target system. POMDP as a widely adopted solution, offers a new approach to decision-making problems for agents unable to fully observe the environment state. Therefore, to tackle the obstacles presented by environments with incomplete information, we model the AutoPT problem as a POMDP and employ high-performance reinforcement learning frameworks to handle penetration testing tasks in partially observable environments. This strategy seeks to surmount the obstacles encountered by conventional approaches in penetration testing. POMDP is typically composed of a septuple \(\langle S, A, T, O, R, \gamma \rangle \), where S represents the state of the agent, A represents the action, T represents the transition probability function, O represents the observation space, R represents the reward and \(\gamma \) represents the discount factor. The decision process of the agent based on POMDP is shown in Fig. 1. The agent begins in an initial state \(s_0\), containing essential information about the target network. At each time step t, it selects an action \(a_t\) based on the observation \(o_t\) using a policy function \(\pi (o_t, \theta )\). The environment responds with a reward \(r_t\) and updates the state to \(s_{t+1}\).

Fig. 1.
figure 1

Agent decision process based on dynamic fusion of observation information.

This process continues until the agent either exhausts its steps or achieves its goal, completing the penetration test. Through this iterative interaction, the agent improves its ability to identify vulnerabilities, enhancing the efficiency of penetration testing.

3.2 Scan Information Embedding

Based on the POMDP model we constructed, \( O = \left\{ o _ { 1 }, o _ { 2 }, \ldots , o _ { n } \right\} \) represents the observation space, which denotes the set of system states observable by the penetration tester for the target hosts. The information regarding the target hosts can be represented as \(H= \left\{ h _ { 1 }, h _ { 2 }, \ldots , h _ { n } \right\} \).

$${h_1} = \left\{ \begin{array}{l} {\textbf {IP}}: 192.168.1.32\\ {\textbf {Port}}: 22,8000\\ {\textbf {Services:}} ssh OpenSSH 9.1p1 Debian 1, http Werkzeug httpd 1.0.1\\ {\textbf {OS:}} Linux\\ {\textbf {Vulnerability:}} CVE-2017-8291\\ {\textbf {Web fingerprint:}} HTML5 HTTPServer[Werkzeug/1.0.1 Python/3.5.3] \end{array} \right. $$

To efficiently encode scanning data into the intelligent penetration testing model, we utilized a denoising autoencoder built upon the Transformer architecture, referred to as TSDAE  [19], for training action embedding vectors. This approach involves converting scanning information into vector representations and integrating them into the model’s input, as shown in Fig. 2. In this context, “[CLS]” denotes the commencement of a textual sequence, while “[EOS]” indicates its conclusion. Following encoding via the TSDAE model, a vector representation of the scanned textual information can be obtained. The coding process is as follows: The coding process includes scanning data, tokenization, generating word embeddings, and passing through the encoder. By following this coding process, we convert the raw scan data into a structured vector representation for later agent decision-making processes.

Fig. 2.
figure 2

Information encoding process based on TSDAE.

3.3 Historical Observation Information Fusion

The fusion of historical information maximizes the utilization of previously observed data to enhance the understanding and decision-making capabilities of intelligent penetration testing models regarding the environment. Integrating historical information provides contextual cues, enhancing the agent’s comprehension of past state information. This allows the agent to better consider previous observations, leading to more rational inference and planning during the decision-making process. We denote the sequence of historical observation information as \({o_1, o_2, \ldots , o_t}\), as illustrated in Fig. 3. where \(o_t\) represents the environmental state observed by the agent at time step t. In the intelligent penetration testing model, we introduce an additional dimension to represent historical observation information as a vector \(h_o \in \mathbb {R}^d\). Then, we concatenate it with the representation vector \(s_t\) of the current environmental state to form the input of the model. This approach integrates historical observational data into the model, enhancing its understanding of past states. To achieve this fusion, we introduce an enhanced approach called LSTM-PPO, which will be detailed in the following sections.

Fig. 3.
figure 3

Implicit dynamic fusion mechanism by LSTM.

4 Our Approach

To address the limitations of current penetration testing decision-making methods, we propose the LSTM-PPO approach, which leverages raw scan data for text embedding. Unlike traditional rule-based methods, this approach employs LSTM networks to capture temporal dependencies in historical data and utilizes PPO algorithms to optimize the agent’s policies, enhancing decision-making in the testing environment. As illustrated in Fig. 4, we propose an algorithmic enhancement framework based on historical information fusion, comprising four main modules: the information scanning module, text embedding module, policy module based on LSTM networks (Actor), and evaluation module (Critic).

Fig. 4.
figure 4

Algorithm improvement based on historical information fusion.

Initially, the agent scans vulnerability data from the Vulhub [20] digital environment and encodes the scan details using the TSDAE model. These encoded vectors are input into the reinforcement learning algorithm to inform action selection. The environment then returns the reward to the agent. In our framework, both the Actor policy network and Critic value network incorporate LSTM networks to capture historical state information. The LSTM processes the agent’s state at each time step, producing encoded memory and prediction data, which are then used by the Actor for action decisions and the Critic for evaluating and refining the policy based on the received rewards.

4.1 TSDAE for Embedding

In our constructed penetration testing environment, after the agent retrieves scanning information from the digital environment, it first encodes the text information. To obtain an embedded representation of the agent’s action space, we utilize a denoising autoencoder based on the Transformer structure to train action embedding vectors. We utilize TSDAE to transform vulnerability description text into fixed-size vectors. The training process of TSDAE is outlined in Algorithm 1.

Algorithm 1
Dummy

. Training Procedure of TSDAE for Scanning Information

We begin by tokenizing the text and then utilize a model pre-trained on TSDAE to obtain its encoded representation. As shown in Fig. 5, we obtain vector representations of the text after encoding with TSDAE, where each element represents the value of the corresponding feature. During the pre-training phase of the TSDAE model, these features are autonomously acquired, with the goal of enhancing the capture of semantic information within the text. These vector representations empower the model to grasp the textual meaning with increased precision.

Fig. 5.
figure 5

Text embedding procedure.

4.2 LSTM-PPO Approach

The PPO algorithms is a reinforcement learning technique that uses “proximal policy optimization” to update policies by constraining the step size of updates [21]. Incorporating LSTM networks into the PPO algorithms leverages their ability to handle sequential data and mitigate long-term dependency issues. The inputs to the Actor and Critic modules based on LSTM networks include the agent’s environmental state at each time step, while the outputs consist of encoded memory and prediction information along with the agent’s current action. The operation of the prediction module can be represented as:

$$\begin{aligned} \tilde{ o } _ { t } = L S T M ( s _ { a t - 1 } , h _ { t - 1 } , c _ { t - 1 }), \end{aligned}$$
(1)

where, \(h_{t-1}\) and \(c_{t-1}\) represent the hidden state output of the LSTM network after k decisions, while \(s_{at-1}\) represents a vector composed of the agent’s state \(s_{t-1}\) and action \(a_{t-1}\) after k decisions. The output memory and prediction information \(\tilde{o}_{t}\) are concatenated with the agent’s state \(s_{t}\) and inputted into the Policy and Critic networks, thereby significantly enriching the input information for decision-making. In terms of network structure, \(\tilde{o}_{t}\) serves as an intermediate variable, connecting the LSTM output layer with the input layers of the value and policy networks, making the LSTM a shared prefix network. By using the encoded output of LSTM’s historical state information as input for the next decision, the agent fully considers historical information, achieving optimal global optimization.

LSTM-PPO effectively addresses the issue of insufficient observation in partially observable environments, offering a new approach to reinforcement learning challenges for optimization. The training procedure of LSTM-PPO is shown in Algorithm 2.

Algorithm 2
Dummy

. LSTM-PPO Training Procedure

The relationship between Actor, Critic, environment, and reward is shown in Fig. 6, where we have added LSTM to the Actor schema to better perceive historical state information. In the Actor network, we stack LSTM layers on top of each other, and append a fully connected layer to the final LSTM layer to generate actions. Each LSTM layer feeds its output as the input to the next layer, all the way up to the final fully connected layer.

Fig. 6.
figure 6

Actor-Critic architecture based on LSTM network.

The output of the LSTM, denoted as output, is used for further computation of action probabilities and value estimation.

$$\begin{aligned} output=fc(h_{T}), \end{aligned}$$
(2)

where \(h_{T}\) signifies the hidden state at the final time step, and \(f_{c}\) indicates the fully connected layer. To transform the output of the fully connected layer into a probability distribution for actions, we utilize the softmax function:

$$\begin{aligned} P(A_t=a|S_t;\theta ) = \frac{e^{Q(S_t,a;\theta )}}{\sum _{\alpha '} e^{Q(S_t,a';\theta )}}, \end{aligned}$$
(3)

where, \(Q(S_t, a';\theta )\)represents the estimated value for the state-action pair \((S_{t},a)\) and \(\theta \) denotes the neural network parameters. We use mean square error (MSE) losses to update the critic network parameters:

$$\begin{aligned} L^{Critic}(\theta _{v})=\frac{1}{N}\sum _{i}(V(S_{i};\theta _{v})-V^{target}(S_{i}))^2, \end{aligned}$$
(4)

where, \(V^{target}(S_{i})\)represents the target value. The agent processes observation information through an LSTM network architecture and calculates reward values and success rates during interactions with the environment. The cumulative reward value is denoted as eval_rewards, and the success rate is denoted as eval_success_rate.

$$\begin{aligned} e v a l\_r e w a r d s = \sum _ { i = 1 } ^ { N } \sum _ { j = 1 } ^ { M _ { i } } r _ { i j }, \end{aligned}$$
(5)

where, \(N\) represents the number of targets in the target list, i.e., \(\text {len(target}\_\text {list)}\). \(M_i\) denotes the total number of steps for the \(i\)-th target.

$$\begin{aligned} e v a l\_ s u c c e s s\_ r a t e = \frac{ l e n ( s u c e s s\_ l i s t ) }{ l e n ( t a r g e t \_ l i s t ) }, \end{aligned}$$
(6)

where, \(\text {len(sucess}\_\text {list)}\) indicates the number of completed targets, while \(\text {len(target}\_\text {list)}\) represents the total number of targets in the target list. In the LSTM-PPO approach, we leverage scanning information from a Vulhub-based digital environment. By integrating PPO algorithms with an LSTM network, this method enhances the agent’s environmental perception and decision-making capabilities, resulting in more precise penetration testing outcomes.

5 Experiments

5.1 Dataset

Since CyberBattleSim [22] and NASim [23] are tools for simulating network attacks and defenses, they provide rich simulated network scenarios and attack models. However, CyberBattleSim and NASim mainly focus on attack and defense behaviors in specific scenarios, lacking comprehensive coverage of real system vulnerabilities and being unable to offer a realistic and diverse vulnerability environment. In contrast, Vulhub significantly reduces the complexity of environment setup through pre-configured vulnerability images and simple deployment commands. More importantly, Vulhub is built based on real vulnerability environments, which can more accurately reflect actual attack scenarios, therefore, we choose Vulhub as the experimental scenario for this paper. After setting up the experimental environment, we used the Nmap scanning tool to perform comprehensive port and service scans on each virtual machine, collecting their IP addresses, open ports, running services, operating system information, and vulnerability information. We convert it into a standardized JSON format, where each JSON object contains information about one virtual machine. For example, a virtual machine with the CVE-2018-10933 vulnerability is described as ‘“ip”: “192.168.1.25”, “port”: [“22”, “2222”], “services”: [“ssh OpenSSH 9.1p1”], “os”: “Linux”, “vulnerability”: [“CVE-2018-10933”].

5.2 Experimental Results

Since the number of layers and Dropout in the LSTM network represents the number of network layers and the Dropout rate, respectively, these two parameters play a crucial role in the overall network architecture. To more accurately enhance the agent’s ability to perceive historical information, we conducted ablation experiments on the nums of layers and Dropout parameters in the LSTM, and we compared the relationship between cumulative reward value and success rate and step. As shown in Fig. 7, we set the range of nums of layers from 1 to 5. As shown in Fig. 8, we set the range of Dropout from 0 to 0.5.

Fig. 7.
figure 7

Effect of different layer values in different network scales.

Fig. 8.
figure 8

Effect of different dropout values in different network scales.

To evaluate the effectiveness of our approach, we conducted experiments in a chained network simulation environment with 20, 30, and 40 target hosts, and in a large-scale network simulation environment.

Through meticulous logging of the cumulative reward values after each training iteration, we created graphical illustrations showing the correlation between the training progress and the cumulative reward values, and between the training progress and the success rate. As shown in Fig. 9, we observe a steady increase in the cumulative reward values as training progresses, reflecting the model’s continuous learning and adaptation to the complexity of the task environment. Compared to the PPO, DQN, and Random algorithm, our model demonstrates a certain advantage from the initial stages of training, and this advantage is further consolidated and expanded throughout subsequent training iterations.

Fig. 9.
figure 9

Effect of Different Algorithms in Different Network Scales

Based on the experimental results shown in Fig. 9, it reveals that when the number of hosts is 20, the LSTM-PPO approach’s convergence reward increases by 18.86% compared to the PPO algorithms and by 392.06% compared to the DQN algorithm. When the number of hosts is 30, the LSTM-PPO approach’s convergence reward increases by 19.32% compared to the PPO algorithms and by 620.98% compared to the DQN algorithm. When the number of hosts is 40, the LSTM-PPO approach’s convergence reward increases by 10.29% compared to the PPO algorithms and by 1762.10% compared to the DQN algorithm. In large-scale network scenarios, the LSTM-PPO approach’s convergence reward increases by 29.47% compared to the PPO algorithms and by 383.43% compared to the DQN algorithm. Therefore, when the number of hosts exceeds 20, the LSTM-PPO approach outperforms other algorithms in terms of both cumulative reward and maximum success rate in chain networks and complex networks. This is because, as the network scale reaches a certain level, the LSTM network architecture can better utilize historical states to assist agents in making decisions. Consequently, the PPO-LSTM approach can achieve faster convergence and higher utility in practical network applications.

6 Conclusion

This paper introduces an E2E approach of AutoPT that emphasizes raw scanning information embedding and historical observations fusing. We can include information without omitting relevant details by implementing the TSDAE algorithm to represent raw scanning data. An innovative reinforcement learning approach combining PPO with LSTM is presented to guide the decision-making process by dynamically capturing historical observations and updating internal states. The adoption of the LSTM-PPO emerges as an effective AutoPT approach to adapt to dynamic changes in the environment and make more accurate and robust decisions. Experimental validations demonstrate the remarkable superiority of the LSTM-PPO approach across different network scales, providing more effective and efficient solutions for AutoPT agents.

In the future, we intend to delve further into comprehensive PT information embedding approaches involving topologies of the target networks. Furthermore, we are interested in hierarchical organizations of deep reinforcement learning algorithms to facilitate the fine-grained decomposing of actions in processes of AutoPT. Lastly, our objectives are aiming at generating a fully AutoPT agent that can analyze diverse systems protected by advanced protection approaches, such as dynamic defense mechanisms.