强化学习 — mujoco、mujoco_py、gym 和 baselines的环境配置

和其它的机器学习方向一样，强化学习（Reinforcement Learning）也有一些经典的实验场景，像Mountain-Car，Cart-Pole等。由于近年来深度强化学习（Deep Reinforcement Learning）的兴起，各种新的更复杂的实验场景也在不断涌现。于是出现了OpenAI Gym，MuJoCo，rllab, DeepMind Lab, TORCS, PySC2等一系列优秀的平台。

博主环境

Ubuntu16.04

Anaconda2

python 3.６（建议重新在anaconda中创建新的环境，以下操作均在conda创建环境下配置）

tensorflow-gpu 1.4.1 （baseline 最低要求1.4.1）

CUDA 8.0 (CUDA的安装可参考

https://blog.csdn.net/Hansry/article/details/81008210

)

Cudnn 6.0

1.安装mujoco

MuJoCo（Multi-Joint dynamics with Contact）是一个物理模拟器，可以用于机器人控制优化等研究。

1.准备工作

在

官网上

下载
mjpro150 linux
，同时点击
Licence
下载许可证，需要
full name
email address
computer id
等信息，其中根据使用平台下载
getid_linux（可执行文件）
获取
computer id
, 步骤如下：

$ chmod a+x getid_linux (给予执行权限)
$ ./getid_linux

输出结果类似于
LINUX_A1EHAO_Q8BPHTIM10F05D0S3TB3293

这里写图片描述

点击
submint
后，从输入的邮箱中下载证书
mjkey.txt

2.环境配置

2.1 创建隐藏文件夹并将
mjpro150_linux
拷贝到
mujoco
文件夹中

mkdir ~/.mujoco    
cp mjpro150_linux.zip ~/.mujoco
cd ~/.mujoco
unzip mjpro150_linux.zip

2.2 将证书
mjkey.txt
拷贝到创建的隐藏文件夹中

cp mjkey.txt ~/.mujoco  
cp mjkey.txt ~/.mujoco/mjpro150/bin

2.3.添加环境变量, 打开
～/.bashrc
文件,将以下命令添加进去

export LD_LIBRARY_PATH=~/.mujoco/mjpro150/bin${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export MUJOCO_KEY_PATH=~/.mujoco${MUJOCO_KEY_PATH}

3.运行结果

cd ~/.mujoco/mjpro150/bin
./simulate ../model/humanoid.xml

2.安装mujoco_py

首先现在官网上下载安装

mujoco_py源码

, 注意的是在这里安装的时候可能会缺很多包，但是提示什么装什么就行了。

 pip3 install -U 'mujoco-py<1.50.2,>=1.50.1'

如下顺利执行则说明安装成功

>>> import mujoco_py
>>> from os.path import dirname
>>> model = mujoco_py.load_model_from_path(dirname(dirname(mujoco_py.__file__))  +"/xmls/claw.xml")
>>> sim = mujoco_py.MjSim(model)
>>> print(sim.data.qpos)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> sim.step()
>>> print(sim.data.qpos)
[ 2.09217903e-06 -1.82329050e-12 -1.16711384e-07 -4.69613872e-11
 -1.43931860e-05  4.73350204e-10 -3.23749942e-05 -1.19854057e-13
 -2.39251380e-08 -4.46750545e-07  1.78771599e-09 -1.04232280e-08]

一定会出现的问题为

Ｃｒｅating window glfw
ERROR: GLEW initialization error: Missing GL version

Press Enter to exit ...

将
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia-390
加入到
~/.bashrc
中解决问题

运行
python body_interaction.py

３.安装gym

OpenAI Gym是OpenAI出的研究强化学习算法的toolkit，它里边cover的场景非常多，从经典的Cart-Pole, Mountain-Car到Atar，Go，MuJoCo都有。官方网站为

https://gym.openai.com/

，源码位于

https://github.com/openai/gym

，它的readme提供了安装和运行示例，按其中的安装方法：

最小安装

git clone https://github.com/openai/gym.git
cd gym
pip install -e .

完全安装

apt-get install -y python-numpy python-dev cmake zlib1g-dev libjpeg-dev xvfb libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig Pillow  libglfw3-dev
pip install -e '.[all]'

进入
~/gym/examples/scripts
中，通过
./list_envs
可发现
gym
中拥有很多环境。

通过调用gym中的环境，如下代码所示，可以运行

import gym
env = gym.make('Hero-ram-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

<

>

import gym
env = gym.make('Hero-ram-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

4.安装baseline

OpenAI Baseline

是一系列高质量的强化学习控制算法，需要python>=3.5, 且需要
ＯpenMPI
和
zlib
,有些
ｂａｓｅｌｉｎｅ　ｅｘａｍｐｌｅ
是基于
Mujoco
物理仿真环境的，可以根据上面的教程进行安装。

4.1 下载
ｂａｓｅｌｉｎｅｓ
包并且配置相关工具

git clone https://github.com/openai/baselines.git
cd baselines
pip install -e .

4.2 安装
tensorflow-gpu
，由于我们的CUDA 和Cudnn 分别是8.0 和6.0, 所以我们需要安装对应的
tensorflow-gpu

conda install tensorflow-gpu=1.4.1

4.3 为了在
baseline
中对算法进行测试，需要安装
pytest

pip install pytest
pytest

输出以下内容即配置成功

================================================================================== test session starts ==================================================================================
platform linux -- Python 3.6.6, pytest-3.6.3, py-1.5.4, pluggy-0.6.0
rootdir: /home/hansry/append/RL/baselines, inifile:
collected 12 items                                                                                                                                                                      

baselines/common/test_identity.py

其他包都是缺什么补什么

补充：github 上有大神写的
gazebo
和
gym
的包：

https://github.com/erlerobot/gym-gazebo

5.baselines 中HER(Hindsight experience replay)的使用

进入到
/baselines/baselines/her/experiment
文件夹下，发现含有
ｃｏｎｆｉｇ.py
play.py
train.py
等文件

其中
config.py
一般是设置其参数，诸如下面

DEFAULT_PARAMS = {
    # env
    'max_u': 1.,  # max absolute value of actions on different coordinates
    # ddpg
    'layers': 3,  # number of layers in the critic/actor networks
    'hidden': 256,  # number of neurons in each hidden layers
    'network_class': 'baselines.her.actor_critic:ActorCritic',
    'Q_lr': 0.001,  # critic learning rate
    'pi_lr': 0.001,  # actor learning rate
    'buffer_size': int(1E6),  # for experience replay
    'polyak': 0.95,  # polyak averaging coefficient
    'action_l2': 1.0,  # quadratic penalty on actions (before rescaling by max_u)
    'clip_obs': 200.,
    'scope': 'ddpg',  # can be tweaked for testing
    'relative_goals': False,
    # training
    'n_cycles': 50,  # per epoch
    'rollout_batch_size': 2,  # per mpi thread
    'n_batches': 40,  # training batches per cycle
    'batch_size': 256,  # per mpi thread, measured in transitions and reduced to even multiple of chunk_length.
    'n_test_rollouts': 10,  # number of test rollouts per epoch, each consists of rollout_batch_size rollouts
    'test_with_polyak': False,  # run test episodes with the target network
    # exploration
    'random_eps': 0.3,  # percentage of time a random action is taken
    'noise_eps': 0.2,  # std of gaussian noise added to not-completely-random actions as a percentage of max_u
    # HER
    'replay_strategy': 'future',  # supported modes: future, none
    'replay_k': 4,  # number of additional goals used for replay, only used if off_policy_data=future
    # normalization
    'norm_eps': 0.01,  # epsilon used for observation normalization
    'norm_clip': 5,  # normalized observations are cropped to this values
}

其中
ｔｒａｉｎ.py
为训练
DDPG+HER
中的参数，诸如神经网络中的参数等，在训练参数过程中，还有许多可选选项

@click.option('--env', type=str, default='FetchReach-v1', help='the name of the OpenAI Gym environment that you want to train on')　（选择ＨＥＲ执行的任务）
@click.option('--logdir', type=str, default=None, help='the path to where logs and policy pickles should go. If not specified, creates a folder in /tmp/')　（选择训练完policy的参数）
@click.option('--n_epochs', type=int, default=50, help='the number of training epochs to run')　（迭代的次数）
@click.option('--num_cpu', type=int, default=1, help='the number of CPU cores to use (using MPI)')　（使用多少个ＣＰＵ）
@click.option('--seed', type=int, default=0, help='the random seed used to seed both the environment and the training code')
@click.option('--policy_save_interval', type=int, default=5, help='the interval with which policy pickles are saved. If set to 0, only the best and latest policy will be pickled.')
@click.option('--replay_strategy', type=click.Choice(['future', 'none']), default='future', help='the HER replay strategy to be used. "future" uses HER, "none" disables HER.')
@click.option('--clip_return', type=int, default=1, help='whether or not returns should be clipped')

训练的命令是
python train.py --num_cpu=2 (参数可选)

其中
play.py
为调用训练好的参数进行执行，执行命令如下：

python play.py policy_best.pkl(后面需要跟着训练好的参数文件)

1.安装mujoco

2.安装mujoco_py

３.安装gym

4.安装baseline

5.baselines 中HER(Hindsight experience replay)的使用

你可能也喜欢