YouTip LogoYouTip

Ml Deep Reinforcement Learning

## Deep Reinforcement Learning Deep reinforcement learning is an exciting interdisciplinary direction in the field of artificial intelligence. We can break it down into two parts to understand: **Reinforcement Learning** is the core idea, which simulates how humans or animals learn through "trial and error." Imagine teaching a puppy a new command: when it does something right, you give it a treat as a reward; when it does something wrong, there's no reward or even a mild punishment. After many attempts, the puppy learns to make the correct action in specific situations to get rewards. In reinforcement learning, the **Agent** is like this puppyβ€”it interacts with the **Environment** and adjusts its **Policy** based on the **Reward** received. **Deep Learning** is a powerful tool. When reinforcement learning faces very complex environments (such as video game screens or robot sensor data), traditional mathematical methods struggle to extract useful features for decision-making. Deep learning, especially deep neural networks, excels at processing such high-dimensional, complex raw data (like images and sounds), automatically learning hierarchical feature representations. Therefore, **Deep Reinforcement Learning = Reinforcement Learning's Decision Framework + Deep Learning's Perception and Representation Capability**. It enables agents to learn directly from complex raw inputs (such as pixels) how to take optimal actions to achieve long-term goals. !(#) * * * ## Core Concepts and Basic Framework To understand deep reinforcement learning, you first need to grasp several core roles in its basic framework and the relationships between them. ### Basic Elements of Reinforcement Learning 1. **Agent** * **Role**: Learner and decision-maker. * **Responsibility**: Observe the environment state, choose actions based on learned policy, execute actions, and receive feedback (new state and reward) from the environment. 2. **Environment** * **Role**: Everything external that the agent interacts with. * **Responsibility**: Receive the agent's actions, update its own state, and provide corresponding rewards. 3. **State (s)** * **Definition**: A description of the environment's specific situation at a certain moment. In deep reinforcement learning, states are usually high-dimensional, such as a frame of game image. 4. **Action (a)** * **Definition**: Choices the agent can make in a given state. For example, in a game, these might be "up," "left," "fire," etc. 5. **Reward (r)** * **Definition**: The immediate feedback signal from the environment to the agent's action, a scalar value. The reward is the "compass" for the agent's learning, and its goal is to maximize long-term cumulative rewards. 6. **Policy (Ο€)** * **Definition**: The agent's behavioral guidelines, a mapping function from states to actions. It tells the agent what action to take in what state. The policy can be deterministic (`a = Ο€(s)`) or stochastic (`a ~ Ο€(a|s)`). 7. **Value Function** * **Definition**: Used to evaluate how good a state or state-action pair is. It represents the **expected cumulative reward** that can be obtained from the current state (or after executing the current action). * **State Value Function V(s)**: The expected return that can be obtained by following the current policy in state `s`. * **Action Value Function Q(s, a)**: The expected return that can be obtained by executing action `a` in state `s` and then following the current policy. ### Interaction Process The interaction between agent and environment is a continuous cyclic process, which can be clearly represented by the following flowchart: !(#) This cycle repeats continuously, with the agent collecting large amounts of interaction data (`s, a, r, s'`) and using this data to improve its policy. * * * ## Main Algorithms of Deep Reinforcement Learning The algorithm family of deep reinforcement learning is mainly divided into two categories: **Value-Based** and **Policy-Based**, as well as **Actor-Critic** methods that combine the advantages of both. ### 1. Value-Based Deep Q-Network The core of these algorithms is to learn the optimal **action value function Q(s, a)**. Once an accurate Q function is learned, the optimal policy is simple: in each state `s`, choose the action `a` that maximizes `Q(s, a)`. **Deep Q-Network (DQN)** is a landmark work. It uses deep neural networks to approximate complex Q functions. **Key Technical Innovations of DQN:** * **Experience Replay**: The agent stores interaction experiences `(s, a, r, s')` in a memory buffer. During training, it randomly samples a batch of experiences from the buffer for learning. This breaks the correlation between data, making training more stable and efficient. * **Target Network**: A "target network" with the same structure but slower parameter updates is used to calculate learning targets (Q target values), while another "online network" is used for action selection and real-time updates. This solves the problem of constantly moving target values during training, greatly improving stability. **A Simplified DQN Training Process:** 1. Initialize online network `Q` and target network `Q_target` (same parameters), clear experience replay buffer. 2. Agent selects action `a` based on current state `s`, with certain probability randomly or according to `Q` network. 3. Execute action, environment returns reward `r` and new state `s'`, store experience `(s, a, r, s')` in replay buffer. 4. Randomly sample a batch of experiences from replay buffer. 5. For each sample, calculate target Q value: `y = r + Ξ³ * max_a' Q_target(s', a')`. Where `Ξ³` is the discount factor, used to trade off immediate rewards and future rewards. 6. Use `(y - Q(s, a))^2` as loss, update online network `Q` parameters through gradient descent. 7. Every certain number of steps, copy online network parameters to target network. 8. Repeat steps 2-7. **Advantages and Limitations:** * **Advantages**: Relatively high sample efficiency, relatively stable training. * **Limitations**: Naturally difficult to handle continuous action spaces (because it needs to calculate `max_a Q(s,a)`), and usually can only learn deterministic policies. ### 2. Policy-Based Policy Gradient Methods These methods directly parameterize the policy `Ο€(a|s; ΞΈ)` (for example, using a neural network), and optimize policy parameters `ΞΈ` to directly maximize expected return. **Core Idea**: Calculate the gradient of expected return `J(ΞΈ)` with respect to policy parameters `ΞΈ` (i.e., policy gradient), then update parameters along the gradient direction to make the policy better and better. **REINFORCE Algorithm** is a classic policy gradient algorithm. Its update formula is: `ΞΈ ← ΞΈ + Ξ± * βˆ‡_ΞΈ log Ο€(a|s; ΞΈ) * G_t` where `G_t` is the cumulative reward from the current moment to the end of the episode, and `Ξ±` is the learning rate. **Advantages and Limitations:** * **Advantages**: Can directly learn stochastic policies, naturally applicable to continuous action spaces. * **Limitations**: Updates based on entire episodes, high variance, leading to unstable training and low sample efficiency. ### 3. Actor-Critic Methods The Actor-Critic framework cleverly combines value-based and policy-based methods, taking the best of both. * **Actor**: A policy network responsible for generating actions based on states. It's like an actor, improving its "acting skills" (policy) under the guidance of the critic. * **Critic**: A value network (usually a Q network or V network), responsible for evaluating the value of actions taken by the actor in a certain state. It's like a critic, scoring the actor's performance. **Workflow:** 1. Actor selects and executes action `a` based on current state `s` and its own policy. 2. Environment feeds back reward `r` and new state `s'`. 3. Critic calculates TD error (Temporal-Difference Error, a signal measuring the difference between predicted value and actual value) based on `(s, a, r, s')`. 4. Critic uses this error to update its value evaluation network, making its scoring more accurate. 5. Actor uses the "score" provided by the critic (such as TD error or advantage function) to update its policy network, making itself more inclined to choose actions that can get high scores. **Advantages**: Actor-Critic methods usually have lower variance and are more stable than pure policy gradient methods (like REINFORCE), while being better than pure value methods (like DQN) at handling continuous actions and stochastic policies. **A3C, A2C, PPO, SAC** and others are very successful Actor-Critic algorithms. * * * ## Practice: Playing CartPole with DQN Let's intuitively experience DQN through a classic control problem `CartPole` (balancing pole). In this environment, a cart can move left and right, and the goal is to keep the pole on the cart upright. ### Environment Setup We use OpenAI Gym, a reinforcement learning toolkit. ## Example # Install necessary libraries (run in Jupyter Notebook or command line) # !pip install gym numpy torch import gym import numpy as np import random import torch import torch.nn as nn import torch.optim as optim import collections # Create environment env = gym.make('CartPole-v1') state_dim = env.observation_space.shape# State dimension: 4 (cart position, velocity, pole angle, angular velocity) action_dim = env.action_space.n# Action dimension: 2 (left, right) print(f"State space dimension: {state_dim}, Action space size: {action_dim}") ### Define Q-Network This is a simple fully connected neural network that takes state as input and outputs Q values corresponding to each action. ## Example class DQN(nn.Module): def __init__ (self, state_dim, action_dim): super(DQN,self). __init__ () self.fc1= nn.Linear(state_dim,128)# First fully connected layer self.fc2= nn.Linear(128,128)# Second fully connected layer self.fc3= nn.Linear(128, action_dim)# Output layer, one Q value per action def forward(self, x): x = torch.relu(self.fc1(x))# Use ReLU activation function to introduce non-linearity x = torch.relu(self.fc2(x)) return self.fc3(x)# Output Q values, without activation function ### Define Experience Replay Buffer Used to store and sample past experiences. ## Example class ReplayBuffer: def __init__ (self, capacity): self.buffer=collections.deque(maxlen=capacity)# Double-ended queue, automatically discards old experiences def add(self, state, action, reward, next_state, done): self.buffer.append((state, action, reward, next_state, done)) def sample(self, batch_size): transitions =random.sample(self.buffer, batch_size) # Organize data into column-stacked tensors for neural network batch processing state, action, reward, next_state, done =zip(*transitions) return(np.array(state), action, reward, np.array(next_state), done) def size(self): return len(self.buffer) ### Define DQN Agent Integrates network, experience replay, and training logic. ## Example class DQNAgent: def __init__ (self, state_dim, action_dim, lr=1e-3, gamma=0.98, epsilon=0.01, target_update_freq=10, buffer_size=10000, batch_size=64): self.action_dim= action_dim self.q_net= DQN(state_dim, action_dim)# Online network self.target_q_net= DQN(state_dim, action_dim)# Target network self.target_q_net.load_state_dict(self.q_net.state_dict())# Initial parameters consistent self.optimizer= optim.Adam(self.q_net.parameters(), lr=lr)# Optimizer self.gamma= gamma # Discount factor self.epsilon= epsilon # Exploration rate (final) self.target_update_freq= target_update_freq # Target network update frequency self.batch_size= batch_size self.buffer= ReplayBuffer(buffer_size) self.count=0# Record update steps def take_action(self, state, epsilon=None): """Select action based on epsilon-greedy policy""" if epsilon is None: epsilon =self.epsilon if np.random.random()< epsilon: return np.random.randint(self.action_dim)# Exploration: random selection else: state = torch.tensor(state, dtype=torch.float).unsqueeze(0)# Add batch dimension with torch.no_grad(): q_values =self.q_net(state) return q_values.argmax().item()# Exploitation: select action with maximum Q value def update(self): """Sample from experience replay buffer and update network""" if self.buffer.size()<self.batch_size: return # 1. Sample states, actions, rewards, next_states, dones =self.buffer.sample(self.batch_size) # Convert to PyTorch tensors states = torch.tensor(states, dtype=torch.float) actions = torch.tensor(actions).unsqueeze(1)# Shape becomes [batch_size, 1] for gather operation rewards = torch.tensor(rewards, dtype=torch.float).unsqueeze(1) next_states = torch.tensor(next_states, dtype=torch.float) dones = torch.tensor(dones, dtype=torch.float).unsqueeze(1) # 2. Calculate current Q values (Q(s, a)) current_q_values =self.q_net(states).gather(1, actions)# Only extract Q value corresponding to executed action a # 3. Calculate target Q values (r + Ξ³ * max_a' Q_target(s', a')) with torch.no_grad(): next_q_values =self.target_q_net(next_states).max(1).unsqueeze(1)# Take maximum Q value of next state target_q_values = rewards + self.gamma * next_q_values * (1 - dones)# If episode ends (done=1), no future reward # 4. Calculate loss (mean squared error) loss = nn.MSELoss()(current_q_values, target_q_values) # 5. Gradient descent update online network self.optimizer.zero_grad() loss.backward() # Optional: gradient clipping to prevent gradient explosion # torch.nn.utils.clip_grad_norm_(self.q_net.parameters(), max_norm=10) self.optimizer.step() self.count +=1 # 6. Periodically update target network if self.count % self.target_update_freq==0: self.target_q_net.load_state_dict(self.q_net.state_dict()) ### Training Loop ## Example def train_agent(env, agent, num_episodes=500, max_steps=500, initial_epsilon=0.9, epsilon_decay=0.995): """Train agent""" return_list =[]# Record total reward of each episode epsilon = initial_epsilon for i_episode in range(num_episodes): state, _ = env.reset() episode_return =0 done =False for step in range(max_steps): # 1. Select and execute action action = agent.take_action(state, epsilon)# Use decaying exploration rate next_state, reward
← Ml Forward And Backward PropagMl Exploration Exploitation β†’