In this blog post
‘Learn from your mistakes’ is easily said than followed. But I have never thought that this statement has made so much of an impact in the minds of the technologists that they started adopting this technique of making machines learn from their mistakes so that they can intelligently work in their future actions. This act of parenting is the new cool in the tech world as the question always lingers in the readers’ minds as to how a machine would learn from the mistakes it commits. The logic behind this concept is very simple and easy to understand. It is very much like how a normal person learns from his mistake and how efficiently he uses his senses to avert committing the same mistake again.
The Idea of Reinforcement Learning
As I told you earlier, you can easily understand Reinforcement Learning if you learn how the machine analyzes its behavior, how it learns from its own mistakes and how it takes the appropriate decisions based on its analysis. Now assume that a baby is trying to walk. For the first few days it would analyze how the people around are walking. It’s learning starts right from seeing how others walk, how others move and what others do while walking and continues until it stands up and walks by itself. Now whenever a baby tries to stand up but falls, it learns from itself and again gets up to stand by itself and proceeds until it starts to walk.
The Math and Science behind Reinforcement Learning
According to me, the concept of reinforcement learning was actually bought out from a video game wherein the player (here the machine) gets a credit whenever he/she takes a right step towards achieving the goal and loses it whenever he takes a bad decision. In Reinforcement Learning, the player which is an agent, manipulates the environment and then takes the decision. If the decision is correct there is a reward that gets added to the score (0 at the beginning) and if the decision is wrong the reward gets reduced. So, this process is allowed to happen until the agent attains victory in the game. According to the various decisions that are taken by the agent the overall score is calculated and with this information the best way of winning the game is formulated.
Now if you have a look at the math behind the concept, we have the following factors that directly affects the decision making in Reinforcement Learning.
- Set of states, S
- Set of actions, A
- Reward function, R
- Policy (Pi)
- Value, V
Here the conclusion is arrived at, when the State S is said to have attained the desired state(WIN). The actions are the steps that the player takes during the progress of the game and Reward function is either added or deducted according to the resultant of the steps. The rewards that we get at the end after attaining is the value (V).
The Policy (Pi) is created with a value (V) for each time the game Is won and with many (Pi) been created from many samples the one with the least value V is chosen as the best solution.
Solution= E (R |Pi, S)
Algorithm behind this calculation
The basic algorithm is given by the concepts of Reinforcement Learning and the overall algorithm is been given by the Deep Q-Learning concept which is as follows:
Initialize the Values ‘s’ and ‘a’.
- Observe the current state ‘s’.
- Choose an action ‘a’ for that state such that the next state is attained according to the best way of environment analysis.
- Take the action, and observe the reward ‘r’ as well as the new state ‘s’.
- Update the Value for the state using the observed reward and the maximum reward possible for the next state. The updating is done according to the formula and parameters described above.
- Set the state to the new state, and repeat the process until the objective of the game is reached.
The same concept of Reinforcement Learning was also applied on the board game Go by the parent company of Google, the Alphabet. They did arrive to a point and formulated the Policy (Pi) with all their outcomes. They named their Policy and theory as AlphaGo. In order to test it, they allowed a game of Go to be played between AlphaGo and South Korean Champion Lee Sedol in March 2016. The 5 match series resulted in the AlphaGo winning the 9th ranked champ with an astonishing figure of 4-1.
The AlphaGo utilized the Reinforcement Learning concept along with the concept of Deep Learning to formulate its set of rules to achieve the winning position. In fact, after the match was concluded Lee Sedol awarded the highest award of Honorary 9 Dan in Go to the AlphaGo. Following this Google announced that the money that AlphaGo earned with the match winnings will be donated to the charities including UNICEF.
Reinforcement Learning vs. Artificial Intelligence
Reinforcement Learning, though involves many algorithms to formulate the Policy for arriving at a conclusion, at the end of the day it is still a Machine Learning mechanism and it definitely needs many arbitrary processes that needs to be carried out in order to formulate a proper solution or a policy. As compared to the yesteryear concepts of Machine Learning there are many differences while formulating a solution using Reinforcement Learning and while using Artificial Intelligence. What makes Reinforcement Learning stand unique from being a normal technique like machine Learning or Artificial Intelligence are as follows:
- First of all, unlike any other concepts or mechanisms, the objective of the game will never be known to the agent and the agent realizes only after it starts to take the steps forward and reaches the goal.
- Artificial Intelligence is all about providing the model with what needs to be done at which time and sometimes involves in providing the correct actions. But Reinforcement Learning is just the opposite of that.
- Artificial Intelligence doesn’t involve any mathematical or scientific model that it can learn from, but apparently behaves differently at different situations. Deep Learning and reinforcement learning are all devised only to attain the positive result and hence the outcomes of each step are periodically very much similar.
Why Deep Learning is coupled with Reinforcement Learning?
Deep learning is a complex function approximation, for image recognition, speech (supervised) as well as for dimension reduction and deep network pretraining (unsupervised).
Reinforcement learning is more in line with optimal control, wherein an agent learns to create, develop and maintain an optimal policy of sequential actions that needs to take by interacting with an environment. There are various branches within RL, such as temporal difference, Monte Carlo and dynamic programming.
Where deep learning and reinforcement learning combine (as seen in deep Q learning, Google deep mind Atari) is when a deep neural network is used to approximate the Q function in Q-learning, one popular algorithm that falls under temporal difference learning.
In the Atari game playing example, because the state space is so large (since they are using game video pixels), using a neural network to approximate Q.
The Reinforcement Learning is all about devising a plan to achieve the positive end result of the game. But as the algorithm makes the agent take arbitrary steps to viciously analyze the next states, it leads to a question of doubt on how quick the algorithm would come up with the solution.
Even though it is a statement of pride that Reinforcement Learning takes us all in getting to the finishing line, the question always remains if this algorithm would be effective in a complex environment that keeps changing randomly.
One has to clearly wait with patience for the Reinforcement Learning plan to be deduced and tested before which it can be followed. Basically, the biggest 2 disadvantages are:
- This methodology is a slow one as the agent takes considerable amount of time to learn the environment to find out the best solution.
- The whole process becomes very tedious and tough if the environment that it is performed in is complex.
There is a cloud of uncertainty that always surrounds when all the above said points are considered while deciding if Reinforcement Learning is to be considered as the desired algorithm for us to deduce the best solution in any environment.
We may have to wait for the outcomes of more such trials been performed in many complex environments in order to reach a conclusion that could well fit all our needs and expectations.