《Reinforcement Learning: An introduction》第一章

本文深入解析强化学习原理,探讨其与监督及非监督学习的区别,分析目标导向的智能体如何在未知环境中通过互动学习,以最大化奖励信号。涵盖探索与利用的平衡、策略、奖励信号等核心概念。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目录

介绍

1.1 强化学习(Reinforcement Learning)

强化学习的特征 

  与监督学习(supervised learning)的区别 

  与非监督学习(unsupervised learning)的区别 

难点与挑战 

1.2 强化学习的要素

总结


介绍

Learning from interaction is a foundational idea underlying nearly all theories of learning and intelligence.

可以将强化学习类比于我们自然的学习过程。想象一个新生的婴儿与外界互动的情景,没有老师专门教它,但它会四处看看、动动手脚,通过各种尝试来探索世界。再比如我们学驾驶或者与人对话,我们会根据当前动作所获得的反应来随时调整自己的行为。

1.1 强化学习(Reinforcement Learning)

强化学习就是在学习做什么、怎样把situations映射到actions上,从而使reward最大化。学习者不会被指定采取哪个action,而是需要通过不断的尝试来探索哪个action能获得最大reward。actions不仅会影响当前的reward,也会对下一个状态、接下来的rewards都产生影响。(当然,这视具体情况而定,不同场景中,action对后续的reward的影响深远程度是不同的。)具体公式表示为:Q(s1)=r2+\gamma \cdot r3+\gamma ^{2}\cdot r4+\gamma ^{3}\cdot r5+...。涉及参数:γ(Gama),γ是对未来reward的衰减值。比如γ=0时,Q(s1)=r2,即表示只在乎当前的reward。

强化学习经典示意图
from lecture "Introduction to reinforcement learning" given by d.silver

 一个learning agent需具有的特点:

  1. 一定程度地感知其所在环境的状态;(sensation)
  2. 实施对环境状态产生影响的actions;(action)
  3. 有目标或者与环境状态相关的目标。(goal)

强化学习的特征 

  与监督学习(supervised learning)的区别 

监督学习强化学习
从已经被正确标记的训练集中学习agent必须能够在探索中获得的经验,并从中学习

根据正确标记好的action集合,从而推测、产生对应于新的(训练集中没有的)状态下应该采取的正确action

(要让训练集包含能代表所有状态下正确的action,是不实际的)

  与非监督学习(unsupervised learning)的区别 

非监督学习强化学习

在未标记的数据集中寻找隐藏的关联结构

(find hidden structure)

目的是使reward最大化

(maximize a reward signal)

都不依赖于正确标记好的数据集

此外,强化学习还有一个关键特征是,它确切地、从整体上考虑以目标为导向的(goal-directed)agent与未知环境交互的问题。与之相比,一些方法只单独考虑某个子问题,而没有阐明其如何应用到更大的框架中。

难点与挑战 

权衡Exploration(探索未知)和exploitation(利用已知)间的平衡(exploration-exploitation dilemma)

Agent既要利用(exploit)其已有的经验来获得reward,又要探索(explore)没试过的action,从而能在未来做出更好的选择。Agent必须尝试各种actions,并且逐渐趋向于表现好的action。在一个随机任务中,每个action都需尝试很多次后,才能得到对其期望reward的可靠估计。

1.2 强化学习的要素

除了agent和environment外,强化学习有四个子要素:

  • a policy
  • a reward signal
  • a value function
  • a model of the environment (optionally)

总结

Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision making.

 强化学习与其他计算方法的区别在于,它强调agent通过与环境的直接交互来进行学习,而不依赖于外界监督或是对环境进行完整的建模。

The authoritative textbook for reinforcement learning by Richard Sutton and Andrew Barto. Contents Preface Series Forward Summary of Notation I. The Problem 1. Introduction 1.1 Reinforcement Learning 1.2 Examples 1.3 Elements of Reinforcement Learning 1.4 An Extended Example: Tic-Tac-Toe 1.5 Summary 1.6 History of Reinforcement Learning 1.7 Bibliographical Remarks 2. Evaluative Feedback 2.1 An -Armed Bandit Problem 2.2 Action-Value Methods 2.3 Softmax Action Selection 2.4 Evaluation Versus Instruction 2.5 Incremental Implementation 2.6 Tracking a Nonstationary Problem 2.7 Optimistic Initial Values 2.8 Reinforcement Comparison 2.9 Pursuit Methods 2.10 Associative Search 2.11 Conclusions 2.12 Bibliographical and Historical Remarks 3. The Reinforcement Learning Problem 3.1 The Agent-Environment Interface 3.2 Goals and Rewards 3.3 Returns 3.4 Unified Notation for Episodic and Continuing Tasks 3.5 The Markov Property 3.6 Markov Decision Processes 3.7 Value Functions 3.8 Optimal Value Functions 3.9 Optimality and Approximation 3.10 Summary 3.11 Bibliographical and Historical Remarks II. Elementary Solution Methods 4. Dynamic Programming 4.1 Policy Evaluation 4.2 Policy Improvement 4.3 Policy Iteration 4.4 Value Iteration 4.5 Asynchronous Dynamic Programming 4.6 Generalized Policy Iteration 4.7 Efficiency of Dynamic Programming 4.8 Summary 4.9 Bibliographical and Historical Remarks 5. Monte Carlo Methods 5.1 Monte Carlo Policy Evaluation 5.2 Monte Carlo Estimation of Action Values 5.3 Monte Carlo Control 5.4 On-Policy Monte Carlo Control 5.5 Evaluating One Policy While Following Another 5.6 Off-Policy Monte Carlo Control 5.7 Incremental Implementation 5.8 Summary 5.9 Bibliographical and Historical Remarks 6. Temporal-Difference Learning 6.1 TD Prediction 6.2 Advantages of TD Prediction Methods 6.3 Optimality of TD(0) 6.4 Sarsa: On-Policy TD Control 6.5 Q-Learning: Off-Policy TD Control 6.6 Actor-Critic Methods 6.7 R-Learning for Undiscounted Continuing Tasks 6.8 Games, Afterstates, and Other Special Cases 6.9 Summary 6.10 Bibliographical and Historical Remarks III. A Unified View 7. Eligibility Traces 7.1 -Step TD Prediction 7.2 The Forward View of TD( ) 7.3 The Backward View of TD( ) 7.4 Equivalence of Forward and Backward Views 7.5 Sarsa( ) 7.6 Q( ) 7.7 Eligibility Traces for Actor-Critic Methods 7.8 Replacing Traces 7.9 Implementation Issues 7.10 Variable 7.11 Conclusions 7.12 Bibliographical and Historical Remarks 8. Generalization and Function Approximation 8.1 Value Prediction with Function Approximation 8.2 Gradient-Descent Methods 8.3 Linear Methods 8.3.1 Coarse Coding 8.3.2 Tile Coding 8.3.3 Radial Basis Functions 8.3.4 Kanerva Coding 8.4 Control with Function Approximation 8.5 Off-Policy Bootstrapping 8.6 Should We Bootstrap? 8.7 Summary 8.8 Bibliographical and Historical Remarks 9. Planning and Learning 9.1 Models and Planning 9.2 Integrating Planning, Acting, and Learning 9.3 When the Model Is Wrong 9.4 Prioritized Sweeping 9.5 Full vs. Sample Backups 9.6 Trajectory Sampling 9.7 Heuristic Search 9.8 Summary 9.9 Bibliographical and Historical Remarks 10. Dimensions of Reinforcement Learning 10.1 The Unified View 10.2 Other Frontier Dimensions 11. Case Studies 11.1 TD-Gammon 11.2 Samuel's Checkers Player 11.3 The Acrobot 11.4 Elevator Dispatching 11.5 Dynamic Channel Allocation 11.6 Job-Shop Scheduling Bibliography Index
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值