You need 2 things
A. Environment that simulates the step, rewad, states, etc
B. Model that is capable of learning over time
A. Environment:
Key libraries:
1. OpenAI gym: Contains many prebuilt environments
2. Gym_anytrading: Contains great environments for trading. https://github.com/AminHP/gym-anytrading
3. Book of yves hilpisch has a custom environment: https://colab.research.google.com/github/yhilpisch/aiif/blob/main/code/09_reinforcement_learning_b.ipynb
4. Book of Stefan jansen: https://github.com/stefan-jansen/machine-learning-for-trading/tree/main/22_deep_reinforcement_learning
B. Model
Key libraries
1. Stable baselines
To create a model, you need to define the algorithm and the policy
Algorithms
This table displays the rl algorithms that are implemented in the stable baselines project, along with some useful characteristics: support for recurrent policies, discrete/continuous actions, multiprocessing.
Name | Refactored [1] | Recurrent | Box | Discrete | Multi Processing |
---|---|---|---|---|---|
A2C | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
ACER | ✔️ | ✔️ | ❌ [4] | ✔️ | ✔️ |
ACKTR | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
DDPG | ✔️ | ❌ | ✔️ | ❌ | ✔️ [3] |
DQN | ✔️ | ❌ | ❌ | ✔️ | ❌ |
HER | ✔️ | ❌ | ✔️ | ✔️ | ❌ |
GAIL [2] | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ [3] |
PPO1 | ✔️ | ❌ | ✔️ | ✔️ | ✔️ [3] |
PPO2 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
SAC | ✔️ | ❌ | ✔️ | ❌ | ❌ |
TD3 | ✔️ | ❌ | ✔️ | ❌ | ❌ |
TRPO | ✔️ | ❌ | ✔️ | ✔ | ✔️ [3] |
[1] | Whether or not the algorithm has be refactored to fit the BaseRLModel class. |
[2] | Only implemented for TRPO. |
[4] | TODO, in project scope. |
Policies
Available Policies
MlpPolicy | Policy object that implements actor critic, using a MLP (2 layers of 64) |
MlpLstmPolicy | Policy object that implements actor critic, using LSTMs with a MLP feature extraction |
MlpLnLstmPolicy | Policy object that implements actor critic, using a layer normalized LSTMs with a MLP feature extraction |
CnnPolicy | Policy object that implements actor critic, using a CNN (the nature CNN) |
CnnLstmPolicy | Policy object that implements actor critic, using LSTMs with a CNN feature extraction |
CnnLnLstmPolicy | Policy object that implements actor critic, using a layer normalized LSTMs with a CNN fe |
env = gym.make('stocks-v0', df=df, frame_bound=(5,250), window_size=5)
env_maker = lambda: env2 env = DummyVecEnv([env_maker])
model = A2C('MlpLstmPolicy', env, verbose=1)
model.learn(total_timesteps=1000000)
import gym import tensorflow as tf from stable_baselines import PPO2 # Custom MLP policy of two layers of size 32 each with tanh activation function policy_kwargs = dict(act_fun=tf.nn.tanh, net_arch=[32, 32]) # Create the agent model = PPO2("MlpPolicy", "CartPole-v1", policy_kwargs=policy_kwargs, verbose=1) # Retrieve the environment env = model.get_env() # Train the agent model.learn(total_timesteps=100000) # Save the agent model.save("ppo2-cartpole") del model # the policy_kwargs are automatically loaded model = PPO2.load("ppo2-cartpole")
import gym from stable_baselines.common.policies import FeedForwardPolicy, register_policy from stable_baselines.common.vec_env import DummyVecEnv from stable_baselines import A2C # Custom MLP policy of three layers of size 128 each class CustomPolicy(FeedForwardPolicy): def __init__(self, *args, **kwargs): super(CustomPolicy, self).__init__(*args, **kwargs, net_arch=[dict(pi=[128, 128, 128], vf=[128, 128, 128])], feature_extraction="mlp") # Create and wrap the environment env = gym.make('LunarLander-v2') env = DummyVecEnv([lambda: env]) model = A2C(CustomPolicy, env, verbose=1) # Train the agent model.learn(total_timesteps=100000) # Save the agent model.save("a2c-lunar") del model # When loading a model with a custom policy # you MUST pass explicitly the policy when loading the saved model model = A2C.load("a2c-lunar", policy=CustomPolicy)
The net_arch
parameter of FeedForwardPolicy
allows to specify the amount and size of the hidden layers and how many of them are shared between the policy network and the value network. It is assumed to be a list with the following structure:
- An arbitrary length (zero allowed) number of integers each specifying the number of units in a shared layer. If the number of ints is zero, there will be no shared layers.
- An optional dict, to specify the following non-shared layers for the value network and the policy network. It is formatted like
dict(vf=[<value layer sizes>], pi=[<policy layer sizes>])
. If it is missing any of the keys (pi or vf), no non-shared layers (empty list) is assumed.
In short: [<shared layers>, dict(vf=[<non-shared value network layers>], pi=[<non-shared policy network layers>])]
.
Examples
Two shared layers of size 128: net_arch=[128, 128]
obs
|
<128>
|
<128>
/ \
action value
Value network deeper than policy network, first layer shared: net_arch=[128, dict(vf=[256, 256])]
obs
|
<128>
/ \
action <256>
|
<256>
|
value
Initially shared then diverging: [128, dict(vf=[256], pi=[16])]
obs
|
<128>
/ \
<16> <256>
| |
action value