强化学习代码实战-05 Dyna-Q算法

基于Q-learning，加入数据反刍机制，更多地利用已有样本，温故而知新（离线学习）

import numpy as np
import random

# 获取一个格子的状态
def get_state(row, col):
    if row!=3:
        return 'ground'
    if row == 3 and col == 11:
        return 'terminal'
    if row == 3 and col == 0:
        return 'ground'
    return 'trap'

# 在某一状态下执行动作,获得对应奖励
def move(row, col, action):
    # 状态检查-进入陷阱或结束，则不能执行任何动作，获得0奖励
    if get_state(row, col) in ["trap", "terminal"]:
        return row, col, 0
    # 执行上下左右动作后，对应的位置变化
    if action == 0:
        row -= 1
    if action == 1:
        row += 1
    if action == 2:
        col -= 1
    if action == 3:
        col += 1
    # 最小不能小于零，最大不能大于3
    row = max(0, row)
    row = min(3, row)
    col = max(0, col)
    col = min(11, col)
    
    # 掉进trap奖励-100，其余每走一步奖励-1，让agent尽快完成任务
    reward = -1
    if get_state(row, col) == 'trap':
        reward = -100
    return row, col, reward

# 初始化Q表格，每个格子采取每个动作的分数，刚开始都是未知的故为零
Q = np.zeros([4, 12, 4])
# 存储历史样本数据，反刍使用。key:(row, col, action) value:(next_row, next_col, reward)
history = {}

# 根据当前所处的格子，选取一个动作
def get_action(row, col):
    # 以一定的概率探索
    if random.random() < 0.1:
        return np.random.choice(range(4))
    # 返回当前Q表格中分数最高的动作
    return Q[row, col].argmax()
    
# 计算当前格子的更新量(当前格子采取动作后获得的奖励，来到下一个格子及要进行的动作)
def update(row, col, action, reward, next_row, next_col):
    """计算量更新同srasa有差异
        Saras: 估计当前贪婪策略的价值函数Q[row, col, action]（在线策略）
        Q-learning: 直接估计最优Q[row, col]（离线策略）
        在线策略：行为策略和目标策略是同一个策略
        离线策略：---------------不是同一个策略
    """
    target = reward + Q[next_row, next_col].max() * 0.95
    value = Q[row, col, action]
    # 时序查分计算td_error
    td_error = 0.1 * (target - value)
    # 返回误差值
    return td_error

def q_planning():
    for _ in range(50):
        # 随机选取一个状态-动作样本
        row, col, action = random.choice(list(history.keys()))
        next_row, next_col, reward = history[(row, col, action)]
        # 价值更新
        td_error = update(row, col, action, reward, next_row, next_col)
        Q[row, col, action] += td_error

def train():
    for epoch in range(10000):
        # 每次迭代开始，随机一个起点，尽可能多地与环境交互，同时绑定一个动作
        row = np.random.choice(range(4))
        col = 0
        action = get_action(row, col)
        # 计算本轮奖励的总和，越来越大
        rewards_sum = 0
        
        # 一直取探索，直到游戏结束或者进入trap(要判断)
        while get_state(row, col) not in ["terminal", "trap"]:
            # 当前状态下移动一次，获得新的状态
            next_row, next_col, reward = move(row, col, action)
            next_action = get_action(next_row, next_col)
            # 样本存起来
            history[(row, col, action)] = next_row, next_col, reward
            rewards_sum += reward
            
            """数据反刍,在内部进行价值多次更新"""
            q_planning()
        
            # 状态更新
            row, col, action = next_row, next_col, next_action
        if epoch % 500 == 0:
            print(f"epoch:{epoch}, rewards_sum:{rewards_sum}")
train()

原文地址：http://www.cnblogs.com/demo-deng/p/16881383.html

1. 本站所有资源来源于用户上传和网络，如有侵权请邮件联系站长！ 2. 分享目的仅供大家学习和交流，请务用于商业用途! 3. 如果你也有好源码或者教程，可以到用户中心发布，分享有积分奖励和额外收入！ 4. 本站提供的源码、模板、插件等等其他资源，都不包含技术服务请大家谅解！ 5. 如有链接无法下载、失效或广告，请联系管理员处理！ 6. 本站资源售价只是赞助，收取费用仅维持本站的日常运营所需！ 7. 如遇到加密压缩包，默认解压密码为"gltf",如遇到无法解压的请联系管理员！ 8. 因为资源和程序源码均为可复制品，所以不支持任何理由的退款兑现，请斟酌后支付下载声明：如果标题没有注明"已测试"或者"测试可用"等字样的资源源码均未经过站长测试.特别注意没有标注的源码不保证任何可用性

强化学习代码实战-05 Dyna-Q算法

排行榜展示

3D打印机glb模型下载-机械glb模型

树glb模型下载-树2

水稻glb模型下载-水稻1

变电箱1glb模型下载-机械glb模型

树glb模型下载-树3

模型

树glb模型下载-树1

水稻glb模型下载-水稻1

树glb模型下载-树3

树glb模型下载-树2

变电箱1glb模型下载-机械glb模型

3D打印机glb模型下载-机械glb模型

强化学习代码实战-05 Dyna-Q算法

排行榜展示

标签

模型