# Reinforcement Learning Methods

**keywords** : *Deep reinforcement learning, quantitative trading, value-based methods, policy-based methods, Model-based methods*

**Overview**

**Deep reinforcement learning **(DRL), that balances exploration (of uncharted territory) and exploitation (of current knowledge), has been described in literature as a promising approach to automate trading in quantitative finance [__1__][__2__][__3__][__4__][__5__][__6__]. Over the years, extensive efforts have been deployed to **developing artificial intelligence techniques** for **finance research** and applications. AI methods can be helpful in **quantitative trading** in the automation of **market condition recognition** and **trading strategies execution**. Quantitative trading is indeed known for its high degree of **automation** and **continuity**.

**Reinforcement learning** techniques, which have for objective to **optimize the performance of agent** within an unknown environment (modelled as a ** Markov Decision Process MDP**), are in highly active development and new solutions are being introduced regularly and improved upon. Among the critical tools of autonomous learning, we can announce

**policy search**and

**value function**approximation. The search policy of RL is to

**detect an optimal**(stochastic) policy applying

**gradient-based**or

**gradient-free approaches**handling both continuous and discrete state-action settings [

__7__]. The

**value function**strategy consists in estimating the expected return in order to find the optimal policy dealing with all possible actions based on the given state.

**Figure 1**: *Interaction between agent and environment and the general structure of the deep reinforcement learning approaches [*__8__*]*

**Policy-based methods**

**Policy** is what we’re trying to find when we’re solving a reinforcement learning problem. In other words, when an observation is obtained by the agent and a decision is needed to be made regarding the next step, a policy is required, rather than the value of the state or particular action [__9__]. In **policy-based methods**, rather than learning a value function that tells us what is the expected sum of rewards given a state and an action, we learn directly the **policy function** that **defines state to action, **meaning** **select actions without employing a value function. **Policy learning** assumes a **parametric policy space**. The learning task consists in finding **parameters** that maximizes the correspondence of the policies with the observed preferences [__10__]. The commonly used algorithms are the **policy gradient (PG) methods**. The tactic is to find a **neural network parameterized policy** in order to maximize the expected cumulative reward [__11__].

**Value-based methods
**The value-based class of algorithms allow the development of a **value function** that defines a **policy**. **Value **defined as the **discounted total reward** is what **we can gather from this state** or by **issuing this particular action from the state**. If the value is known, our decision on every step becomes simple and obvious: we just act greedily in terms of value, which guarantees us good total reward at the end of the episode. So, the values of states (in the case of the Value Iteration method) or state + action (in the case of **Q-learning, Q **

*means quality,*[

__Link__

__]__

**) stand between us and the best reward. More about Q-learning is its approximation of an expected return, referred to as Q-value, via an iterative update of Q-table.The values of this table are obtained by the**

**Bellman equation**[

__9__].

**Model-based methods**

In **model-based RL**, the agents specifically **construct a transition model** of the environment, which is usually modelled by a deep network. This is often hard, depending on the complexity of the environment but offers some advantages over the model-free approaches, as an example, the model is generative, which suggests samples can be generated by the agent from its model of the environment and therefore, it can avoid any interaction with the environment.

**Conclusion**

The impact of **Automated Trading Systems** on financial markets is growing per annum and the trades generated by an algorithm now account for the majority of orders that arrive at stock exchanges. Reinforcement learning as other AI techniques present competitive edge in different industries and also in the one of our interest, quantitative finance and employed by sophisticated hedge funds. In this article, three main categories of reinforcement learning algorithms - value based, policy based, and model-based - have been discussed. In future article, the application of RL to quantitative trading will be explored further.

**Related work**

[__Link____]__** ***Machine Learning in Trading: Review of Q-Learning*, 2021