ابو الفضل: November 2021

You need 2 things

A. Environment that simulates the step, rewad, states, etc

B. Model that is capable of learning over time

A. Environment:

Key libraries:

1. OpenAI gym: Contains many prebuilt environments

2. Gym_anytrading: Contains great environments for trading. https://github.com/AminHP/gym-anytrading

3. Book of yves hilpisch has a custom environment: https://colab.research.google.com/github/yhilpisch/aiif/blob/main/code/09_reinforcement_learning_b.ipynb

4. Book of Stefan jansen: https://github.com/stefan-jansen/machine-learning-for-trading/tree/main/22_deep_reinforcement_learning

B. Model

Key libraries

1. Stable baselines

To create a model, you need to define the algorithm and the policy

Algorithms

This table displays the rl algorithms that are implemented in the stable baselines project, along with some useful characteristics: support for recurrent policies, discrete/continuous actions, multiprocessing.

Name	Refactored [1]	Recurrent	`Box`	`Discrete`	Multi Processing
A2C	✔️	✔️	✔️	✔️	✔️
ACER	✔️	✔️	❌ [4]	✔️	✔️
ACKTR	✔️	✔️	✔️	✔️	✔️
DDPG	✔️	❌	✔️	❌	✔️ [3]
DQN	✔️	❌	❌	✔️	❌
HER	✔️	❌	✔️	✔️	❌
GAIL [2]	✔️	✔️	✔️	✔️	✔️ [3]
PPO1	✔️	❌	✔️	✔️	✔️ [3]
PPO2	✔️	✔️	✔️	✔️	✔️
SAC	✔️	❌	✔️	❌	❌
TD3	✔️	❌	✔️	❌	❌
TRPO	✔️	❌	✔️	✔	✔️ [3]

[1]	Whether or not the algorithm has be refactored to fit the `BaseRLModel` class.

[2]	Only implemented for TRPO.

[3]	(1, 2, 3, 4) Multi Processing with MPI.

[4]	TODO, in project scope.

Policies

Available Policies

`MlpPolicy`	Policy object that implements actor critic, using a MLP (2 layers of 64)
`MlpLstmPolicy`	Policy object that implements actor critic, using LSTMs with a MLP feature extraction
`MlpLnLstmPolicy`	Policy object that implements actor critic, using a layer normalized LSTMs with a MLP feature extraction
`CnnPolicy`	Policy object that implements actor critic, using a CNN (the nature CNN)
`CnnLstmPolicy`	Policy object that implements actor critic, using LSTMs with a CNN feature extraction
`CnnLnLstmPolicy`	Policy object that implements actor critic, using a layer normalized LSTMs with a CNN fe

Once the environment, algorithm and policies are defined, running a RL training is easy

1. Create environment:

env = gym.make('stocks-v0', df=df, frame_bound=(5,250), window_size=5)

2. Encode the environment so that RL can test multiple environments in parallel

env_maker = lambda: env2
env = DummyVecEnv([env_maker])

3. Define the model

model = A2C('MlpLstmPolicy', env, verbose=1)

4. Starting training

model.learn(total_timesteps=1000000)

Hyper parameter tuning:

1. Use different RL algorithms (A2C, POP, etc)

2. Use different policies (MLP, MLCNN, MLLSTM, etc)

3. Use different policy parameters

The most common hyperparameters to change are for A2C Feedforward policy

import gym
import tensorflow as tf

from stable_baselines import PPO2

# Custom MLP policy of two layers of size 32 each with tanh activation function
policy_kwargs = dict(act_fun=tf.nn.tanh, net_arch=[32, 32])
# Create the agent
model = PPO2("MlpPolicy", "CartPole-v1", policy_kwargs=policy_kwargs, verbose=1)
# Retrieve the environment
env = model.get_env()
# Train the agent
model.learn(total_timesteps=100000)
# Save the agent
model.save("ppo2-cartpole")

del model
# the policy_kwargs are automatically loaded
model = PPO2.load("ppo2-cartpole")

import gym

from stable_baselines.common.policies import FeedForwardPolicy, register_policy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C

# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           net_arch=[dict(pi=[128, 128, 128],
                                                          vf=[128, 128, 128])],
                                           feature_extraction="mlp")

# Create and wrap the environment
env = gym.make('LunarLander-v2')
env = DummyVecEnv([lambda: env])

model = A2C(CustomPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)
# Save the agent
model.save("a2c-lunar")

del model
# When loading a model with a custom policy
# you MUST pass explicitly the policy when loading the saved model
model = A2C.load("a2c-lunar", policy=CustomPolicy)

The net_arch parameter of FeedForwardPolicy allows to specify the amount and size of the hidden layers and how many of them are shared between the policy network and the value network. It is assumed to be a list with the following structure:

An arbitrary length (zero allowed) number of integers each specifying the number of units in a shared layer. If the number of ints is zero, there will be no shared layers.
An optional dict, to specify the following non-shared layers for the value network and the policy network. It is formatted like dict(vf=[<value layer sizes>], pi=[<policy layer sizes>]). If it is missing any of the keys (pi or vf), no non-shared layers (empty list) is assumed.

In short: [<shared layers>, dict(vf=[<non-shared value network layers>], pi=[<non-shared policy network layers>])].

Examples

Two shared layers of size 128: net_arch=[128, 128]

          obs
           |
         <128>
           |
         <128>
   /               \
action            value

Value network deeper than policy network, first layer shared: net_arch=[128, dict(vf=[256, 256])]

          obs
           |
         <128>
   /               \
action             <256>
                     |
                   <256>
                     |
                   value

Initially shared then diverging: [128, dict(vf=[256], pi=[16])]

          obs
           |
         <128>
   /               \
 <16>             <256>
   |                |
action            value

ابو الفضل

Saturday, November 20, 2021

Reinforcement learning summary

Examples

Loud fan of desktop

Followers

Report Abuse