top of page

Reinforcement learning in quantitative trading

Source (link)


Financial trading has been a widely researched topic and different methods have been proposed to trade markets over the last few decades. These include fundamental analysis [1], technical analysis [2] and algorithmic trading [3]. Indeed, many practitioners use a hybrid of those techniques to make trades [4]. The AI-based trading, especially, the reinforcement learning approach, attracts lots of interest in all industries, academia included.

Finance-related research using reinforcement learning (RL) has been conducted mainly to enhance the performance of trading algorithms. Moody and Saffell [5] conducted a study on the optimal portfolio, asset allocation, and trading system using recurrent reinforcement learning, which became the foundation for a significant amount of research. In RL, which is particularly used in trading systems research, a model-free method that relies on the input of market conditions as a state is applied employing a reward function.

Quantitative trading

Quantitative trading is a mode of profit mining from historical market data, which can be performed automatically by algorithms. The overall purpose of quantitative trading tasks is to maximize long-term profit under certain risk tolerance. More precisely, algorithmic trading generates profits through consistently buying and selling a given financial instrument; Portfolio management tries to preserve a well-balanced portfolio with a variety of financial assets; Order execution aims to fulfil an order specified by a strategy at the best price and ideally with minimum execution cost; Market making provides liquidity to the market and they profit from the price spread between buy and sell orders. [6]

Figure 1 : Relationship between quantitative trading tasks [6]

To design profitable quantitative trading strategies, the advantages of RL methods are four-fold:

  1. RL allows training an end-to-end agent, which takes as input state available market data and from that results trading actions as output.

  2. RL-based methods bypass the difficult task to predict future stock price and the overall profit is optimized directly.

  3. Task-specific constraints (e.g., transaction cost, liquidity and slippage) can be imported into RL objectives easily.

  4. RL methods have been said presenting the potential to generalize to different market condition.

Before to dive into RL methods, allow us introduce few key concepts:

  • Agent : makes the decision on what actions to take in order to collect maximum reward.

  • Environment : task or simulation that gives an observable state.

  • State : is defined as the current position of the agent.

  • Action : optimal action to take based on the state and determined by the agent.

  • Reward : result returned by the environment, which can be positive or negative.

These terms are purposely vague as each of those concepts will have a different meaning, which will depend on the environment and the task at hand. In the context of trading, here's what they mean:

  • The agent is the trader that has access to a brokerage account and periodically checks the market conditions and makes trading decisions.

  • The environment is the market and provides feedback in the form of profit and loss;

  • The state relates to statistics about the current market, which could be a number of different things such as daily moving averages, high of day, volume, etc.

  • The actions can be entering or closing a position( long, or short) and many more such as choosing an asset;

  • The reward can be profit or loss, volatility, sharpe ratio, or any other performance and risk metric.

Reinforcement learning

In accordance with the requirements of quantitative trading, reinforcement learning methods work as a direct adaptive optimal control of nonlinear systems and be split in three categories:

  • Critic-only approach is frequently applied in financial markets. The idea of this approach is to learn a value function based on which the agent can compare (“criticize”) the expected outcomes of different actions, e.g., “to go long” or “to go short” . During the decision making process, the agent senses the current state of the environment and selects the action with the best outcome according to the value function [7].

Table 1 summarizes the main works, their year of publication, as well as the data sample and resolution.

Table 1 : Overview of surveyed works applying the critic-only approach [7].

  • Actor-only is the second most common approach. Hereby, the agent senses the state of the environment and acts directly, i.e., without computing and comparing the expected outcomes of different actions. The agent hence learns a direct mapping (a policy) from states to actions. The key advantage of this approach is the continuous action space of the agent (e.g., to obtain fine-grained portfolio weights) and the typically faster convergence of the learning process [7].

Table 2 summarizes the main works that have been done in this category of RL.

Table 2 : Overview of surveyed works applying the actor-only approach [7].

  • Actor-critic approach forms the third category and aims at combining the advantages of the actor-only and the critic-only approach. The key idea is to simultaneously use an actor, which determines the agent’s action given the current state of the environment, and a critic, which judges the selected action. Simply speaking, the actor learns to choose the action which is considered best by the critic and the critic learns to improve its judgment. Despite its potential advantages, actor-critic RL is the least well researched approach in financial markets and has only a limited number of supporting works [7].

Table 3 summarizes the main works, their year of publication, as well as the data sample and resolution.

Table 3: Overview of surveyed works applying the actor-critic approach [7].


In this article, we have discussed how reinforcement learning methods are used in quantitative trading. In future paper, we will discuss the two main categories of RL, the one based on policy learning and the one based on value learning [8,9,10,11].

bottom of page