We'd like to use cookies on your device. You can accept our recommended cookies or customize your settings for better functionality.
We'd like to use cookies on your device. You can accept our recommended cookies or customize your settings for better functionality.

Q - Learning

A basic form of Reinforcement Learning

Throughout our lives, we perform several actions to pursue our dreams. Some of them bring us good rewards and others do not. In that description of how we pursue our goals in daily life, we framed for ourselves a representative analogy of reinforcement learning. Q-Learning is a basic form of Reinforcement Learning which uses Q-values (also called action values) to iteratively improve the behavior of the learning agent. In this quick post we’ll discuss q-learning and provide the basic background to understanding the algorithm. 

Q-learning is a model-free, off-policy reinforcement learning that will find the best course of action, given the current state of the agent. Depending on where the agent is in the environment, it will decide the next action to be taken.  

The objective of the model is to find the best course of action given its current state. To do this, it may produce rules of its own or it may operate outside the policy given to it to follow. This means that there is no actual need for a policy, hence we call it off-policy. 

Model-free means that the agent uses predictions of the environment’s expected response to move forward. It does not use the reward system to learn, but rather, trial and error. 

What’s this ‘Q’? 

The ‘Q’ in Q-learning stands for quality. Quality here represents how useful a given action is in gaining some future reward. 

Basic idea of Q-learning 

 There are three components of Q-Learning: 

  • The State, represents the current position of an agent in an environment. A discrete set of environment state,

  • The Action is the step taken by the agent when it is in a particular state. A discrete set of agent actions,

  • For every action, the agent will get a positive or negative reward. A reward function, R: S X A → R 

Basic Idea of Q Learning

Map system states to actions, to maximize the expected reward in the future.


An example of Q-learning is an Advertisement recommendation system. In a normal ad recommendation system, the ads you get are based on your previous purchases or websites you may have visited. If you’ve bought a TV, you will get recommended TVs of varied brands.  

Using Q-learning, we can optimize the ad recommendation system to recommend products that are frequently bought together. The reward will be if the user clicks on the suggested product. 

Billions of ads are served each day on various digital platforms, to serve the simple purpose of appealing viewers through them. The reinforcement model picks out the best ad that a user is likely to click on.   


While running the algorithm, we will come across various solutions and the agent will take multiple paths. The best among them is found by tabulating our findings in a table called a Q-Table. 

A Q-Table helps us to find the best action for each state in the environment. We use the Bellman Equation at each state to get the expected future state and reward and save it in a table to compare with other states.  

Let us create a Q-table for an agent that must learn to run, fetch, and sit on command. The steps taken to construct a q-table are: 

Step 1: Create an initial Q-Table with all values initialized to 0 

When we initially start, the values of all states and rewards will be 0. Consider the Q-Table shown below which shows a dog simulator learning to perform actions: 

Initial Q Table
Initial Q-Table

Step 2: Choose an action and perform it. Update values in the table 

This is the starting point. We have performed no other action yet. Let us say that we want the agent to sit initially, which it does. The table will change to: 

Q-Table after performing an action
Q-Table after performing an action

Step 3: Get the value of the reward and calculate the value Q-Value using Bellman Equation 

For the action performed, we need to calculate the value of the actual reward and the Q( S, A ) value 

Updating Q-Table with Bellman Equation
Updating Q-Table with Bellman Equation

Step 4: Continue the same until the table is filled or an episode ends 

The agent continues acting and for each action, the reward and Q-value are calculated, and it updates the table. 

Final Q-Table at end of an episode
Final Q-Table at end of an episode

Real-Life Application 

Online Web Systems Auto-configuration: An RL-based approach can be implemented for automatic configuration of multi-tier web systems; the model can learn to adapt performance parameter settings, efficiently and dynamically, to both workload changes and modifications of virtual machines. According to the experiments, such systems can determine the optimal or near-optimal configuration after about 25 iterations. 

Standard Q-learning can be applied in this case and the issue can be formulated as a finite MDP: the network’s configuration will represent the state space, actions ( “increase”, “decrease”, “keep”) for each parameter will be the action space, and the reward will be defined as the difference in target response time and measured response time.  

News Recommendations: An innovative approach based on deep reinforcement learning framework has been applied to tackle the shortcomings of standard online recommendation systems such as boring the user with suggesting the same or remarkably similar materials, not addressing the dynamic news environment, failing to incorporate key forms of user feedback (return patterns) when generating recommendations, etc. 

Network Traffic Signal Control: An RL-based framework was used to determine an optimal network control policy that reduced average delays and probabilities of intersectional cross-blocking and congestion. In a simulated traffic environment, an RL agent, tasked with controlling traffic signals, was put at the central intersection. The state-space was represented by a feature vector (of 8 dimensions) whose elements each represented a relative traffic flow; 8 non-conflicting phase combinations (for each isolated intersection) were chosen as an action set and the reward was the reduction in delay, compared to previous steps. 


Reinforcement learning can solve dynamic digital marketing problems, and hence deliver high-quality recommendations that resonate with customers' specific preferences, needs, and behavior. This machine learning provides online marketers with simple and reliable means of maximizing ROI.  Q-Learning algorithm can help increase your brand’s Customer Lifetime Value.  
Connect with the team at Merkle Sokrati to unlock the potential of reinforcement learning and unleash this state-of-the-art technology and ultimately increase the quality of your marketing outputs.