使用TensorFlow代理进行强化学习-教程

本文介绍如何使用TensorFlow Agents库实现强化学习,重点介绍了DQN代理在CartPole环境中的应用。文章详细讲解了环境设置、代理配置、数据收集及训练过程,展示了平均累积奖励随训练迭代的变化。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Some weeks ago, I wrote an article naming different frameworks you can use to implement Reinforcement Learning (RL) in your projects, showing the ups and downs of each of them and wondering if any of them would rule them all at some point. Since then, I’ve come to know TF Agents, a library for RL based on TensorFlow and with the full support of its community (note that TF Agents is not an official Google product but it is published as a repository from the official TensorFlow account on Github).

几周前,我写了一篇文章,命名了可用于在项目中实施强化学习(RL)的不同框架,展示了每个框架的起伏,并想知道它们是否会在某个时候统治所有框架。 从那时起,我开始了解TF Agents ,这是一个基于TensorFlow并在其社区的全面支持下用于RL的库(请注意,TF Agents不是Google的官方产品,但它是来自TensorFlow官方帐户的存储库发布的在Github上)。

I am currently using TF Agents on a project and it has been easy to start with it, thanks to its good documentation including tutorials. It is updated regularly and has lots of contributors, which makes me think it is possible we will see TF Agents as the standard framework for implementing RL in the near future. Because of this, I’ve decided to make this article to give you a quick introduction, so you can also benefit from this library. I have published all the code used here as a Google colab notebook, so you can easily run it online.

我目前在一个项目上使用TF Agents,并且由于它的良好文档(包括教程) ,因此开始它很容易。 它会定期更新并有很多贡献者,这使我认为我们有可能将TF代理视为在不久的将来实施RL的标准框架。 因此,我决定撰写这篇文章为您提供快速入门,因此您也可以从该库中受益。 我已经在这里发布了所有用作Google colab笔记本的代码 ,因此您可以轻松地在线运行它。

You can find the Github with all the code and documentation for TF-Agents here. You won’t need to clone their repository, but it’s always useful to have the official Github for reference. I have implemented the following example following partially one of their tutorials (1_dqn_tutorial) but I have simplified it further and used it for playing Atari games in this article. Let’s get hands on.

您可以在此处找到有关TF-Agent的所有代码和文档的Github。 您不需要克隆他们的存储库,但是拥有官方的Github作为参考总是有用的。 在部分教程之一(1_dqn_tutorial)之后,我实现了以下示例,但在本文中我进一步简化了该示例并将其用于玩Atari游戏。 让我们开始吧。

安装TF代理和依赖项 (Installing TF Agents and Dependencies)

As already said, TF-Agents runs on TensorFlow, more specifically TensorFlow 2.2.0. In addition you will need to install the following packages if you don’t have them already:

如前所述,TF-Agents在TensorFlow上运行,更具体地说是在TensorFlow 2.2.0上运行。 此外,如果您还没有以下软件包,则需要安装它们:

pip install tensorflow==2.2.0
pip install tf-agents

为CartPole实施DQN代理 (Implementing a DQN Agent for CartPole)

We will implement a DQN Agent (Mnih et al. 2015) and use it for CartPole, a classic control problem. If you would like to solve something more exciting like, say, an Atari game, you just need to change the environment name with the one you wish, choosing it from all the available OpenAI environments.

我们将实现DQN代理( Mnih等人,2015年 ),并将其用于经典的控制问题CartPole。 如果您想解决诸如Atari游戏之类的更令人兴奋的事情,则只需用所需的名称更改环境名称,然后从所有可用的OpenAI环境中进行选择即可

We start by doing all of the necessary imports. As you can see below, we implement quite a few objects from TF-Agents. These are all things we can customize and switch for our implementation.

我们首先进行所有必要的导入。 如下所示,我们实现了TF-Agents中的许多对象。 这些都是我们可以自定义和切换以实现的所有功能。

from __future__ import absolute_import, division, print_functionimport base64
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tffrom tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import q_network
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.utils import common

环境 (Environment)

Image for post
jaekookang/ jaekookang / RL-cartpole.] RL-cartpole的 GIF。]

Now, we head on to create our environment. In CartPole, we have a cart with a pole on top of it, the agent’s mission is to learn to keep up the pole, moving the cart left and right. Note that we will use an e environment from suite_gym already included in TF-Agents, which is a slightly customized (and improved for its use with TF-Agents) version of OpenAI Gym environments (if you’re interested, you can check the differences with OpenAI’s implementation here). We will also use a wrapper for our environment called TFPyEnvironment — which converts the numpy arrays used for state observations, actions and rewards into TensorFlow tensors. When dealing with TensorFlow models, (i.e., neural networks) we use tensors, so by using this wrapper we save some effort we would need to convert these data.

现在,我们继续创建环境。 在CartPole中,我们有一个顶部带有杆子的推车,代理的任务是学习保持杆子,左右移动推车。 请注意,我们将使用已经包含在TF-Agents中的suite_gym中的e环境,该环境是OpenAI Gym环境的略微定制(并与TF-Agents结合使用的版本)(如果您有兴趣,可以检查差异)与OpenAI的实现在此处 )。 我们还将为我们的环境使用一个名为TFPyEnvironment的包装器,该包装器将用于状态观察,操作和奖励的numpy数组转换为TensorFlow张量。 在处理TensorFlow模型(即神经网络)时,我们使用张量,因此通过使用此包装器,我们节省了一些工作,我们需要转换这些数据。

env = suite_gym.load('CartPole-v1')
env = tf_py_environment.TFPyEnvironment(env)

代理商 (Agent)

There are different agents in TF-Agents we can use: DQN, REINFORCE, DDPG, TD3, PPO and SAC. We will use DQN as said above. One of the main parameters of the agent is its Q (neural) network, which will be use to calculate the Q-values for the actions in each step. A q_network has two compulsory parameters: input_tensor_spec and action_spec defining the observation shape and the action shape. We can get this from our environment so we will define our q_network as follows:

我们可以使用TF-Agent中的不同代理: DQNREINFORCEDDPGTD3PPOSAC 。 我们将如上所述使用DQN。 代理的主要参数之一是其Q(神经)网络,该网络将用于计算每个步骤中操作的Q值。 q_network具有两个强制性参数:input_tensor_spec和action_spec,用于定义观察形状和动作形状。 我们可以从环境中获得此信息,因此我们将定义q_network,如下所示:

q_net = q_network.QNetwork(env.observation_spec(), 
env.action_spec())

There are many more parameters we can customize for our q_network as you can see here, but for now, we will go with the default ones. The agent also requires an optimiser to find the values for the q_network parameter. Let’s keep it classic and use Adam.

如您在此处看到的,我们还可以为q_network自定义更多参数,但是现在,我们将使用默认参数。 该代理还需要一个优化器来查找q_network参数的值。 让我们保持经典,并使用亚当。

optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.001)

Finally, we define and initialize our agent with the following parameters:

最后,我们使用以下参数定义和初始化代理:

  • The time_step_spec, which we get from our environment and defines how are our time steps defined.

    time_step_spec,我们从环境中获取并定义了如何定义时间步长。
  • The action_spec, same as for the q_network.

    action_spec,与q_network相同。
  • The Q network we created before.

    我们之前创建的Q网络。
  • The optimizer we have also created before.

    我们之前也创建过优化器。
  • The TD error loss function, similar to how the loss is used in NN.

    TD错误丢失功能,类似于在NN中使用丢失的方式。
  • The train step counter, that is just a rank 0 tensor (a.k.a. scalar) which will count the number of steps we do on the environment.

    火车步数计数器,仅是等级0张量(也称为标量),它将计算我们在环境中执行的步数。
train_step_counter = tf.Variable(0)agent = dqn_agent.DqnAgent(env.time_step_spec(),
env.action_spec(),
q_network=q_net,
optimizer=optimizer,
td_errors_loss_fn=
common.element_wise_squared_loss,
train_step_counter=train_step_counter)agent.initialize()

辅助方法:平均累积收益和收集数据 (Helper Methods: Average Cumulative Return and Collecting Data)

We will also need some helper methods. The first one will iterate over the environment for a number of episodes, applying the policy to choose what actions to follow and return the average cumulative reward in these episodes. This will come in handy to evaluate the policy learned by our agent. Below, we also try the method in our environment for 10 episodes.

我们还将需要一些辅助方法。 第一个将在环境中迭代多个情节,应用策略选择要遵循的动作并返回这些情节中的平均累积奖励。 这将有助于评估我们的代理商所学到的政策。 下面,我们还将在环境中尝试10集这种方法。

def compute_avg_return(environment, policy, num_episodes=10):
total_return = 0.0
for _ in range(num_episodes):
time_step = environment.reset()
episode_return = 0.0 while not time_step.is_last():
action_step = policy.action(time_step)
time_step = environment.step(action_step.action)
episode_return += time_step.reward
total_return += episode_return avg_return = total_return / num_episodes
return avg_return.numpy()[0]# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(env, agent.policy, 5)
returns = [avg_return]

We will also implement a method to collect data when training our agent. One of the breakthroughs of DQN was experience replay, in which we store the experiences of the agent (state, action, reward) and use it to train the Q network in batches in each step. This improves the learning by making it faster and more stable. In order to do this, TF-Agents includes the object TFUniformReplayBuffer, which stores these experiences to re-use them later, so we firstly create this object that we will need later on.

我们还将在培训代理人时实施一种收集数据的方法。 DQN的突破之一是体验重播,其中我们存储代理的体验(状态,动作,奖励),并在每个步骤中使用它来分批训练Q网络。 通过使其更快,更稳定来改善学习。 为了做到这一点,TF-Agents包含了对象TFUniformReplayBuffer,该对象存储了这些体验以供日后重用,因此我们首先创建此对象,稍后我们将需要它。

In this method, we take an environment, a policy and a buffer, take the current time_step formed by its state observation and reward at that time_step, the action the policy chooses and then the next time_step. Then, we store this in the replay buffer. Note the replay buffer stores an object called Trajectory, so we create this object with the elements named before, and then save it to the buffer using the method add_batch.

在这种方法中,我们采用一个环境,一个策略和一个缓冲区,并采用由其状态观察和奖励形成的当前time_step,并在该time_step进行奖励,然后选择策略所选择的动作,然后选择下一个time_step。 然后,我们将其存储在重播缓冲区中。 请注意,重播缓冲区存储了一个名为Trajectory的对象,因此我们使用之前命名的元素创建此对象,然后使用add_batch方法将其保存到缓冲区中。

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=env.batch_size,
max_length=100000)def collect_step(environment, policy, buffer):
time_step = environment.current_time_step()
action_step = policy.action(time_step)
next_time_step = environment.step(action_step.action)
traj = trajectory.from_transition(time_step,
action_step,
next_time_step)# Add trajectory to the replay buffer
buffer.add_batch(traj)

火车代理 (Train Agent)

We can finally train our agent. We define the number of steps we will make in every iteration, after this number of steps, we will train our agent in every iteration, modifying it’s policy. For now let’s just use 1 step per iteration. We also define the batch size with which our Q network will be trained and an iterator so we iterate over the experienced of the agent.

我们终于可以训练我们的代理商了。 我们定义每次迭代将执行的步骤数,在此步骤数之后,我们将在每次迭代中训练我们的代理,修改其策略。 现在,让我们每次迭代仅使用1个步骤。 我们还定义了用于训练Q网络的批处理大小以及一个迭代器,以便我们对代理的经验进行迭代。

Then, we will just gather some experience for our buffer and start with the common RL loop. Get experience by acting on the environment, train policy and repeat. We additionally print the loss and evaluate the performance of the agent every 200 and 1000 steps respectively.

然后,我们将收集一些有关缓冲区的经验,并从常见的RL循环开始。 通过对环境采取行动,制定政策并重复来获得经验。 我们另外打印损失,并分别每200和1000个步骤评估代理的性能。

collect_steps_per_iteration = 1
batch_size = 64
dataset = replay_buffer.as_dataset(num_parallel_calls=3,
sample_batch_size=batch_size,
num_steps=2).prefetch(3)
iterator = iter(dataset)
num_iterations = 20000
env.reset()for _ in range(batch_size):
collect_step(env, agent.policy, replay_buffer)for _ in range(num_iterations):
# Collect a few steps using collect_policy and save to the replay buffer.
for _ in range(collect_steps_per_iteration):
collect_step(env, agent.collect_policy, replay_buffer) # Sample a batch of data from the buffer and update the agent's network.
experience, unused_info = next(iterator)
train_loss = agent.train(experience).loss step = agent.train_step_counter.numpy() # Print loss every 200 steps.
if step % 200 == 0:
print('step = {0}: loss = {1}'.format(step, train_loss)) # Evaluate agent's performance every 1000 steps.
if step % 1000 == 0:
avg_return = compute_avg_return(env, agent.policy, 5)
print('step = {0}: Average Return = {1}'.format(step, avg_return))
returns.append(avg_return)

情节 (Plot)

We can now plot how the cumulative average reward varies as we train the agent. For this, we will use matplotlib to make a very simple plot.

现在我们可以绘制训练代理商的累计平均奖励的变化方式。 为此,我们将使用matplotlib进行非常简单的绘制。

iterations = range(0, num_iterations + 1, 1000)
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
Image for post
Average Return over 5 episodes of our DQN agent. You can see how the performance increases over time as the agent becomes more experienced.
DQN代理超过5集的平均回报。 您可以看到随着代理经验的增加,性能如何随着时间的推移而增加。

完整的代码 (Complete Code)

I have shared all the code in this article as a Google Colab notebook. You can directly run all the code as it is, if you would like to change it, you have to save it on your own Google drive account and then you can do whatever you like. You can also download it to run it locally on your computer, if you wish to.

我已经作为Google Colab笔记本共享了本文中的所有代码 。 您可以直接运行所有代码,如果要更改它们,则必须将其保存在自己的Google云端硬盘帐户中,然后您可以执行任何操作。 如果愿意,还可以下载它以在计算机上本地运行。

从这往哪儿走 (Where to go from here)

  • You can also check other environments in which to try TF-Agents (or any RL algorithm of your choice) in this other article I wrote some time ago.

    在我前一段时间写的另一篇文章中,您还可以检查尝试TF-Agents(或您选择的任何RL算法)的其他环境

As usual, thank you for reading! Let me know in responses what you think about TF-Agents, and also if you have any question or you found any 🐛 in the code.

和往常一样,谢谢您的阅读! 在答复中让我知道您对TF-Agent的看法,以及是否有任何疑问或在代码中找到任何🐛。

翻译自: https://towardsdatascience.com/reinforcement-learning-with-tensorflow-agents-tutorial-4ac7fa858728

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值