Understanding Reinforcement Learning in LLM Setting + DeepSeek’s GRPO + Code

Saying DeepSeek has taken the internet by storm , would be an understatement. The Paper made many significant changes in the LLM module such as .The focus of this post is more on the Reinforcement Learning side of the things. Reinforcement Learning in my opinion , will prove to be ground breaking in the coming years. We will look into why this model is having such an impact even bigger than the ones we had before. Before we get in the RL specific to DeepSeek , lets look into how Reinforcement Learning works.

What is Reinforcement Learning ?

RL is essentially a learning algorithm , that we are all familiar with on some level. Except you have not put a name on it. Reinforcement Learning is the reason how we learnt to talk , walk and survive as species. This is the same algorithm by which our brain functions. Think about it this way, if you have everything you have ever wanted right now , will you have any motivation to anything - NO. Your motivation , desire , actions , behaviours are all influenced by how valuable the world around you is right now vs how would you like it to be. Its this delta that causes you to change your behaviour that aligns with your goals. Why is it that , why do we find some situation more valuable than others. Here is the key , our assessment of a situation that we are in is estimated by how likely it is to be at a desired point from that point onwards. For us desirable situation is anything that brings us reward. This is the beauty of Reinforcement Learning as it combines principles of human behavior , neuroscience , machine learning into an mathematically expressive form that works wonder in cases ranging from stock market prediction , robotics , video games such as Atari , Dota , Alpha GO , RLHF in LLMs and now Reinforcement Learning in Reasoning LLMs.

Lets look into the following important terms -

Policy - Think of it a s a behaviour that decides your actions

       - Your policy can be always take right , that causes all your actions to right

      - Your policy can be 50% left 50% right 

      - Or your policy can be intelligent where you take left or right based on state you are in

Actions - A specific activity under agents control that it interacts with environment / state

          - You can receive an immediate reward 

          - You will  cause a change in the state , so taking an action always leads to a new state

States - All possible configuration of the environment agent is acting in.

         - All possible positions on chessboard