Ml Basic Framework Of Reinforcement Learning
## Basic Framework of Reinforcement Learning\n\nImagine you are teaching a puppy the command to sit. You wouldn't directly tell it how every muscle should move to sit; instead, you would do this:\n\n1. You give the command to sit.\n2. The puppy tries to perform a certain action (it might sit, lie down, or spin around).\n3. If it sits, you immediately give it a treat as a reward.\n4. If it does it wrong, you give no reward, or give a slight "wrong" signal.\n5. After multiple attempts, the puppy gradually understands: performing the sitting action after hearing "sit" earns a treat. Thus, it learns the command.\n\n**Reinforcement Learning** is about making a computer (or agent) act like this puppy, learning how to make a series of decisions to achieve a long-term goal by interacting with an environment and based on the rewards or penalties it receives.\n\nIt is fundamentally different from the **Supervised Learning** (which has standard answers from a "teacher") and **Unsupervised Learning** (which finds the inherent structure of data) that we are familiar with. Reinforcement learning is **learning from experience**, and its core is **trial-and-error** and **delayed rewards**.\n\n* * *\n\n## Core Elements of Reinforcement Learning\n\nTo formally describe this learning process, we introduce a few core concepts that together form the basic framework of reinforcement learning.\n\n### Agent and Environment\n\nThis is the most basic pair of interactive relationships in reinforcement learning.\n\n* **Agent**: The entity that learns and makes decisions. In the above example, the puppy is the agent. In a computer, it can be an algorithm, a program, or a robot.\n* **Environment**: The external world in which the agent resides and with which it interacts. For the puppy, the environment is you, the treats, the floor, and everything else external. The environment receives the agent's actions and provides a new state and reward.\n\n!(#)\n\nTheir relationship is a continuous loop: **Agent observes the environment -> takes an action -> Environment feeds back a new state and reward -> Agent observes again...**\n\n!(#)\n\n### State, Action, and Reward\n\nThese are the three key pieces of information describing each interaction.\n\n* **State**: A complete description of the environment's situation at a certain moment. For example, in the puppy training example, the state might include: the puppy is standing, you have a treat in your hand, and you just said "sit". The state is the basis for the agent's decision-making.\n* **Action**: The choices the agent can make in a given state. For the puppy, the set of actions might be {sit, lie down, stand, spin...}.\n* **Reward**: A scalar signal fed back to the agent by the environment after the agent executes an action. It defines **what is good and what is bad**. The reward is the only compass for the agent's learning. Giving the puppy a treat is a positive reward (+1), and saying "wrong" can be seen as a slight negative reward (-0.1).\n\n### Policy\n\n**Policy** is the agent's brain or code of conduct. It defines which action the agent should take in any given state.\n\nA policy can be a simple lookup table function or a complex deep neural network. The ultimate goal of reinforcement learning is to find an **optimal policy** that maximizes the **long-term cumulative reward** the agent receives from the environment.\n\n* **Example**: A simple policy might be: if the state is hearing the "sit" command, then choose the "sit" action with a 90% probability, and choose other actions with a 10% probability.\n\n### Value Function\n\nThe reward tells the agent the immediate good or bad of the **current** action, but the agent needs to care more about the **long-term** return. The **Value Function** is the tool used to measure this long-term return.\n\nThe question it answers is: starting from the current state and always following a certain policy, how much total reward can I **expect** to receive?\n\n* **State Value Function V(s)**: Measures the long-term value of following the current policy in state `s`.\n* **Action Value Function Q(s, a)**: Measures the long-term value of **executing a specific action `a`** in state `s`, and then following the current policy. It is more commonly used than the state value function because it can directly guide action selection.\n\n**Why do we need a value function?** Imagine a game of chess. Capturing an opponent's pawn yields an immediate small reward, but it might lead to being "checkmated" ten moves later, resulting in a massive negative reward. Through calculation and estimation, the value function can help the agent avoid this kind of behavior where seeking small gains leads to losing the overall game.\n\n* * *\n\n## Core Interaction Process: Markov Decision Process\n\nReinforcement learning problems are typically modeled as a **Markov Decision Process (MDP)**. This name sounds complicated, but it is simply a standard framework that mathematically organizes the elements mentioned above to describe the interaction between the agent and the environment.\n\nThe core idea of an MDP is: **the next state and reward depend only on the current state and the action currently taken, regardless of previous history** (i.e., the Markov property).\n\nA complete MDP interaction cycle is as follows:\n\n!(#)\n\n1. At time `t`, the environment is in state `S_t`.\n2. The agent observes this state.\n3. The agent selects an action `A_t` based on its policy `Ο`.\n4. The environment receives this action.\n5. Based on its internal dynamics, the environment transitions to the next state `S_{t+1}` and generates a scalar reward `R_{t+1}`, which is fed back to the agent.\n6. The time step advances (`t = t+1`), and a new cycle begins.\n\n!(#)\n\nThe agent's goal is to continuously experience this cycle and learn a policy `Ο*` that maximizes the **expected value of the cumulative reward (i.e., the return)** starting from any initial state.\n\n* * *\n\n## A Simple Code Example: Grid World\n\nLet's use a classic Grid World example to concretize these concepts. Suppose there is a 4x4 grid; the agent starts at the starting point `S` and the goal is to reach the endpoint `G`. Stepping on an obstacle `#` results in failure, and every step taken incurs a small penalty (encouraging reaching the endpoint as quickly as possible).\n\nS . . .. # . .. . # .. . . G\n* **State**: The coordinates of each grid cell, such as (0,0), (0,1)... (3,3). There are 16 states in total.\n* **Action**: {Up, Down, Left, Right}.\n* **Reward**:\n * Reaching `G`: +10\n * Hitting `#` or going out of bounds: -5\n * Other normal moves: -0.1 (encourages efficient paths)\n\n* **Policy**: We need to learn a table that records which direction to move in each state (grid cell).\n\nBelow is an extremely simplified pseudocode demonstration of Q-learning (a classic reinforcement learning algorithm) used to learn the optimal path for this grid world.\n\n## Instance\n\nimport numpy as np\n\nimport random\n\nfrom typing import Dict, List, Tuple\n\n# ====================== 1. Environment Simulation (Grid World) ======================\n\nclass GridWorldEnv:\n\n"""Simple grid world environment for demonstrating Q-Learning"""\n\ndef __init__ (self, grid_size: Tuple[int,int]=(5,5),\n\nstart_pos: Tuple[int,int]=(0,0),\n\ngoal_pos: Tuple[int,int]=(4,4),\n\nobstacle_pos: List[Tuple[int,int]]=[(1,1),(2,2),(3,1)]):\n\nself.grid_size= grid_size\n\nself.start_pos= start_pos\n\nself.goal_pos= goal_pos\n\nself.obstacle_pos= obstacle_pos\n\nself.current_pos= start_pos\n\n# Action Definition: 0-on, 1-lower, 2-Left, 3-Right\n\nself.actions=['up','down','left','right']\n\nself.num_actions=len(self.actions)\n\ndef reset(self) ->int:\n\n"""Reset environment, return initial state index"""\n\nself.current_pos=self.start_pos\n\nreturn self.pos_to_state(self.current_pos)\n\ndef pos_to_state(self, pos: Tuple[int,int]) ->int:\n\n"""Convert coordinates to state index"""\n\nreturn pos * self.grid_size + pos\n\ndef state_to_pos(self, state: int) -> Tuple[int,int]:\n\n"""Convert state index to coordinates"""\n\nreturn(state // self.grid_size, state % self.grid_size)\n\ndef random_action(self) ->int:\n\n"""Select a random action (exploration)"""\n\nreturn random.randint(0,self.num_actions - 1)\n\ndef action_to_direction(self, action: int) ->str
YouTip