top of page

Reinforcement Learning Methods

keywords : Deep reinforcement learning, quantitative trading, value-based methods, policy-based methods, Model-based methods



Deep reinforcement learning (DRL), that balances exploration (of uncharted territory) and exploitation (of current knowledge), has been described in literature as a promising approach to automate trading in quantitative finance [1][2][3][4][5][6]. Over the years, extensive efforts have been deployed to developing artificial intelligence techniques for finance research and applications. AI methods can be helpful in quantitative trading in the automation of market condition recognition and trading strategies execution. Quantitative trading is indeed known for its high degree of automation and continuity.

Reinforcement learning techniques, which have for objective to optimize the performance of agent within an unknown environment (modelled as a Markov Decision Process MDP), are in highly active development and new solutions are being introduced regularly and improved upon. Among the critical tools of autonomous learning, we can announce policy search and value function approximation. The search policy of RL is to detect an optimal (stochastic) policy applying gradient-based or gradient-free approaches handling both continuous and discrete state-action settings [7]. The value function strategy consists in estimating the expected return in order to find the optimal policy dealing with all possible actions based on the given state.

Figure 1: Interaction between agent and environment and the general structure of the deep reinforcement learning approaches [8]

Policy-based methods

Policy is what we’re trying to find when we’re solving a reinforcement learning problem. In other words, when an observation is obtained by the agent and a decision is needed to be made regarding the next step, a policy is required, rather than the value of the state or particular action [9]. In policy-based methods, rather than learning a value function that tells us what is the expected sum of rewards given a state and an action, we learn directly the policy function that defines state to action, meaning select actions without employing a value function. Policy learning assumes a parametric policy space. The learning task consists in finding parameters that maximizes the correspondence of the policies with the observed preferences [10]. The commonly used algorithms are the policy gradient (PG) methods. The tactic is to find a neural network parameterized policy in order to maximize the expected cumulative reward [11].

Value-based methods The value-based class of algorithms allow the development of a value function that defines a policy. Value defined as the discounted total reward is what we can gather from this state or by issuing this particular action from the state. If the value is known, our decision on every step becomes simple and obvious: we just act greedily in terms of value, which guarantees us good total reward at the end of the episode. So, the values of states (in the case of the Value Iteration method) or state + action (in the case of Q-learning, Q means quality, [Link] ) stand between us and the best reward. More about Q-learning is its approximation of an expected return, referred to as Q-value, via an iterative update of Q-table.The values of this table are obtained by the Bellman equation [9].

Model-based methods

In model-based RL, the agents specifically construct a transition model of the environment, which is usually modelled by a deep network. This is often hard, depending on the complexity of the environment but offers some advantages over the model-free approaches, as an example, the model is generative, which suggests samples can be generated by the agent from its model of the environment and therefore, it can avoid any interaction with the environment.


The impact of Automated Trading Systems on financial markets is growing per annum and the trades generated by an algorithm now account for the majority of orders that arrive at stock exchanges. Reinforcement learning as other AI techniques present competitive edge in different industries and also in the one of our interest, quantitative finance and employed by sophisticated hedge funds. In this article, three main categories of reinforcement learning algorithms - value based, policy based, and model-based - have been discussed. In future article, the application of RL to quantitative trading will be explored further.

Related work

[Link] Machine Learning in Trading: Review of Q-Learning, 2021

bottom of page