Saturday, November 20, 2021

Reinforcement learning summary

 

You need 2 things

A. Environment that simulates the step, rewad, states, etc

B. Model that is capable of learning over time


A. Environment:

Key libraries:

1. OpenAI gym: Contains many prebuilt environments

2. Gym_anytrading: Contains great environments for trading. https://github.com/AminHP/gym-anytrading

3. Book of yves hilpisch has a custom environment: https://colab.research.google.com/github/yhilpisch/aiif/blob/main/code/09_reinforcement_learning_b.ipynb

4. Book of Stefan jansen: https://github.com/stefan-jansen/machine-learning-for-trading/tree/main/22_deep_reinforcement_learning


B. Model

Key libraries

1. Stable baselines

To create a model, you need to define the algorithm and the policy


Algorithms


This table displays the rl algorithms that are implemented in the stable baselines project, along with some useful characteristics: support for recurrent policies, discrete/continuous actions, multiprocessing.

NameRefactored [1]RecurrentBoxDiscreteMulti Processing
A2C✔️✔️✔️✔️✔️
ACER✔️✔️❌ [4]✔️✔️
ACKTR✔️✔️✔️✔️✔️
DDPG✔️✔️✔️ [3]
DQN✔️✔️
HER✔️✔️✔️
GAIL [2]✔️✔️✔️✔️✔️ [3]
PPO1✔️✔️✔️✔️ [3]
PPO2✔️✔️✔️✔️✔️
SAC✔️✔️
TD3✔️✔️
TRPO✔️✔️✔️ [3]
[1]Whether or not the algorithm has be refactored to fit the BaseRLModel class.
[2]Only implemented for TRPO.
[3](1234) Multi Processing with MPI.
[4]TODO, in project scope.


Policies


Available Policies

MlpPolicyPolicy object that implements actor critic, using a MLP (2 layers of 64)
MlpLstmPolicyPolicy object that implements actor critic, using LSTMs with a MLP feature extraction
MlpLnLstmPolicyPolicy object that implements actor critic, using a layer normalized LSTMs with a MLP feature extraction
CnnPolicyPolicy object that implements actor critic, using a CNN (the nature CNN)
CnnLstmPolicyPolicy object that implements actor critic, using LSTMs with a CNN feature extraction
CnnLnLstmPolicyPolicy object that implements actor critic, using a layer normalized LSTMs with a CNN fe

Once the environment, algorithm and policies are defined, running a RL training is easy
1. Create environment:

env = gym.make('stocks-v0', df=df, frame_bound=(5,250), window_size=5)

2. Encode the environment so that RL can test multiple environments in parallel

env_maker = lambda: env2
env = DummyVecEnv([env_maker])

3. Define the model

model = A2C('MlpLstmPolicy', env, verbose=1) 

4. Starting training

model.learn(total_timesteps=1000000)

Hyper parameter tuning:
1. Use different RL algorithms (A2C, POP, etc)
2. Use different policies (MLP, MLCNN, MLLSTM, etc)
3. Use different policy parameters

The most common hyperparameters to change are for A2C Feedforward policy

import gym
import tensorflow as tf

from stable_baselines import PPO2

# Custom MLP policy of two layers of size 32 each with tanh activation function
policy_kwargs = dict(act_fun=tf.nn.tanh, net_arch=[32, 32])
# Create the agent
model = PPO2("MlpPolicy", "CartPole-v1", policy_kwargs=policy_kwargs, verbose=1)
# Retrieve the environment
env = model.get_env()
# Train the agent
model.learn(total_timesteps=100000)
# Save the agent
model.save("ppo2-cartpole")

del model
# the policy_kwargs are automatically loaded
model = PPO2.load("ppo2-cartpole")

import gym

from stable_baselines.common.policies import FeedForwardPolicy, register_policy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C

# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           net_arch=[dict(pi=[128, 128, 128],
                                                          vf=[128, 128, 128])],
                                           feature_extraction="mlp")

# Create and wrap the environment
env = gym.make('LunarLander-v2')
env = DummyVecEnv([lambda: env])

model = A2C(CustomPolicy, env, verbose=1)
# Train the agent
model.learn(total_timesteps=100000)
# Save the agent
model.save("a2c-lunar")

del model
# When loading a model with a custom policy
# you MUST pass explicitly the policy when loading the saved model
model = A2C.load("a2c-lunar", policy=CustomPolicy)

The net_arch parameter of FeedForwardPolicy allows to specify the amount and size of the hidden layers and how many of them are shared between the policy network and the value network. It is assumed to be a list with the following structure:

  1. An arbitrary length (zero allowed) number of integers each specifying the number of units in a shared layer. If the number of ints is zero, there will be no shared layers.
  2. An optional dict, to specify the following non-shared layers for the value network and the policy network. It is formatted like dict(vf=[<value layer sizes>], pi=[<policy layer sizes>]). If it is missing any of the keys (pi or vf), no non-shared layers (empty list) is assumed.

In short: [<shared layers>, dict(vf=[<non-shared value network layers>], pi=[<non-shared policy network layers>])].

Examples

Two shared layers of size 128: net_arch=[128, 128]

          obs
           |
         <128>
           |
         <128>
   /               \
action            value

Value network deeper than policy network, first layer shared: net_arch=[128, dict(vf=[256, 256])]

          obs
           |
         <128>
   /               \
action             <256>
                     |
                   <256>
                     |
                   value

Initially shared then diverging: [128, dict(vf=[256], pi=[16])]

          obs
           |
         <128>
   /               \
 <16>             <256>
   |                |
action            value





Loud fan of desktop

 Upon restart the fan of the desktop got loud again. I cleaned the desktop from the dust but it was still loud (Lower than the first sound) ...