Page 2 | NIUHE

围棋起源于古代中国，是世界上最古老的棋类运动之一。在宋代的《梦溪笔谈》中探讨了围棋的局数变化数目，作者沈括称“大约连书万字四十三个，即是局之大数”，意思是说变化数目要写43个万字。根据围棋规则，没有气的子不能存活，扣除这些状态后的合法状态约有 \(2.08×10^{170}\) 种。Robertson 与 Munro 在1978年证得围棋是一种 PSPACE-hard 的问题，其必胜法之记忆计算量在\(10^{600}\) 以上，这远远超过可观测宇宙的原子总数 \(10^{75}\)，可见围棋对传统的搜索方法是非常有挑战的。

笔记 Reinforcement Learning Deep Learning AlphaGo AlphaZero 增强学习

论文翻译：在没有人类知识的情况下掌握围棋

2018-03-10

1. 前言

人工智能的一个长期目标是在一些有挑战的领域中从零开始学习出超人熟练程度的算法。最近，AlphaGo成为第一个在围棋比赛中击败世界冠军的程序。 AlphaGo中的树搜索使用深度神经网络评估位置和选定的移动。这些神经网络是通过监督学习来自人类专家的走法以及通过强化自我学习来进行训练的。这里我们只介绍一种基于强化学习的算法，没有超出游戏规则的人类数据，指导或领域知识。AlphaGo成为自己的老师：一个神经网络训练预测AlphaGo的移动选择和游戏的胜者。这个神经网络提高了树搜索的强度，在下一次迭代中拥有更高质量的移动选择和更强的自我学习。我们的新程序AlphaGo Zero从零开始学习，实现了超人的表现，与之前发布的夺冠冠军AlphaGo相比以100-0取胜。

笔记 Reinforcement Learning Deep Learning AlphaGo AlphaZero 增强学习

RL - Integrating Learning and Planning

2018-01-09

Introduction

In last lecture, we learn policy directly from experience. In previous lectures, we learn value function directly from experience. In this lecture, we will learn model directly from experience and use planning to construct a value function or policy. Integrate learning and planning into a single architecture.

Model-Based RL

Learn a model from experience
Plan value function (and/or policy) from model

笔记 Reinforcement Learning AlphaGo 增强学习 MCTS TD Search Dyna

RL - Policy Gradient

2018-01-06

Introduction

This lecture talks about methods that optimise policy directly. Instead of working with value function as we consider so far, we seek experience and use the experience to update our policy in the direction that makes it better.

In the last lecture, we approximated the value or action-value function using parameters \(\theta\), \[ V_\theta(s)\approx V^\pi(s)\\ Q_\theta(s, a)\approx Q^\pi(s, a) \] A policy was generated directly from the value function using \(\epsilon\)-greedy.

In this lecture we will directly parametrise the policy \[ \pi_\theta(s, a)=\mathbb{P}[a|s, \theta] \] We will focus again on \(\color{red}{\mbox{model-free}}\) reinforcement learning.

笔记 Reinforcement Learning 增强学习 Policy Gradient REINFORCE Actor-Critic

RL - Value Function Approximation

2018-01-03

Introduction

This lecture will introduce how to scale up our algorithm to real practical RL problems by value function approximation.

Reinforcement learning can be used to solve large problems, e.g.

Backgammon: \(10^{20}\) states
Computer Go: \(10^{170}\) states
Helicopter: continuous state space

笔记 Reinforcement Learning Deep Learning 增强学习 DQN Neural Network