Recurrent Neural Network (RNN) | Rookie Tutorial
Introduction to RNN
A Recurrent Neural Network (RNN) is a type of neural network specifically designed to handle sequential data such as text, speech, and time series.
Unlike traditional feedforward neural networks, RNNs have the ability to "remember" information from previous steps.
RNNs can use the hidden state from the previous step to influence the current step's output, thereby capturing temporal dependencies in sequences.
The Core Idea of RNN
The core of an RNN lies in its recurrent connection, where the network's output depends not only on the current input but also on the hidden states from all previous time steps. This structure allows RNNs to process sequences of arbitrary length.
Traditional Neural Networks: Inputs and outputs are independent (e.g., image classification, where each image is unrelated to others).
RNNs: Through recurrent connections, the hidden state from the previous step is passed to the next step, forming a "memory."
- Each step's input = Current data + Hidden state from the previous step.
- The output depends not only on the current input but also on the context of all previous steps.
Just like when reading a sentence, understanding the current word relies on what has been read before (e.g., "He opened __," you might predict "the door" or "the book").
Example: Simple RNN Implementation
import numpy as np
class SimpleRNN:
def __init__(self, input_size, hidden_size):
self.Wx = np.random.randn(hidden_size, input_size) # Input weights
self.Wh = np.random.randn(hidden_size, hidden_size) # Hidden state weights
self.b = np.zeros((hidden_size, 1)) # Bias term
def forward(self, x, h_prev):
h_next = np.tanh(np.dot(self.Wx, x) + np.dot(self.Wh, h_prev) + self.b)
return h_next
How RNNs Work
At each time step t, an RNN performs the following calculations:
- Receive the current input xβ and the hidden state from the previous step hβββ.
- Compute the new hidden state hβ = f(WββΒ·hβββ + WββΒ·xβ + b).
- Generate the output yβ = g(Wβα΅§Β·hβ + c).
Here, f and g are typically activation functions such as tanh or softmax.
Advantages and Disadvantages of RNNs
Advantages
- Can handle variable-length sequences.
- Theoretically capable of remembering long-term historical information.
- Shares parameters across all time steps (using the same set of weights for every step).
Disadvantages
- Prone to gradient vanishing/explosion problems, making it difficult to learn long-term dependencies.
- Lower computational efficiency since time steps cannot be processed in parallel.
Long Short-Term Memory Networks (LSTM)
LSTM is an improved architecture of RNNs, specifically designed to address the long-term dependency issues of standard RNNs.
Core Structure of LSTM
| Component | Function |
|---|---|
| Input Gate | Controls the inflow of new information. |
| Forget Gate | Determines which old information to discard. |
| Output Gate | Controls the amount of information outputted. |
| Memory Cell | Stores long-term states. |
Example: Basic Implementation of an LSTM Cell
class LSTMCell:
def __init__(self, input_size, hidden_size):
# Combine weights for all gates
self.W = np.random.randn(4 * hidden_size, input_size + hidden_size)
self.b = np.random.randn(4 * hidden_size, 1)
def forward(self, x, h_prev, c_prev):
combined = np.vstack((h_prev, x))
gates = np.dot(self.W, combined) + self.b
# Split gates into individual components
f_gate = sigmoid(gates[:hidden_size]) # Forget gate
i_gate = sigmoid(gates[hidden_size:2 * hidden_size]) # Input gate
o_gate = sigmoid(gates[2 * hidden_size:3 * hidden_size]) # Output gate
c_candidate = np.tanh(gates[3 * hidden_size:]) # Candidate memory
# Update memory and hidden state
c_next = f_gate * c_prev + i_gate * c_candidate
h_next = o_gate * np.tanh(c_next)
return h_next, c_next
How LSTM Solves Long-Term Dependency Problems
- Selective Memory: The forget gate decides whether to retain or discard specific information.
- Gradient Path: The memory cell provides a relatively direct path for gradient propagation.
- Information Protection: The memory content is not directly modified by operations at each time step.
Gated Recurrent Unit (GRU)
GRU is a simplified version of LSTM that maintains similar performance while reducing the number of parameters.
Core Structure of GRU
| Component | Function |
|---|---|
| Update Gate | Determines how much old information to keep. |
| Reset Gate | Controls how new and old information are combined. |
| Candidate Activation | Calculates the new state based on the reset gate. |
Example: Implementation of a GRU Cell
class GRUCell:
def __init__(self, input_size, hidden_size):
self.W = np.random.randn(3 * hidden_size, input_size + hidden_size)
self.b = np.random.randn(3 * hidden_size, 1)
def forward(self, x, h_prev):
combined = np.vstack((h_prev, x))
gates = np.dot(self.W, combined) + self.b
# Split gating signals
z = sigmoid(gates[:hidden_size]) # Update gate
r = sigmoid(gates[hidden_size:2 * hidden_size]) // Reset gate
h_candidate = np.tanh(np.dot(self.W[2 * hidden_size:],
np.vstack((r * h_prev, x))) + self.b[2 * hidden_size:]
# Update hidden state
h_next = (1 - z) * h_prev + z * h_candidate
return h_next
GRU vs LSTM
| Feature | GRU | LSTM |
|---|---|---|
| Number of Parameters | Less | More |
| Training Speed | Faster | <th Slower|
| Memory Unit | No | Yes |
| Number of Gates | 2 | 3 |
| Performance | Better on small datasets | Might perform better on large datasets |
Bi-directional RNN (Bi-RNN)
Bi-directional RNNs enhance sequence modeling capabilities by considering both past and future contextual information simultaneously.
Architecture of Bi-RNN
- Forward layer: Processes the sequence in chronological order.
- Reverse layer: Processes the sequence in reverse chronological order.
The final output is a combination of these two directions, usually achieved through concatenation or summation.
Applications of Bi-directional RNNs
- Natural Language Processing: Part-of-speech tagging, named entity recognition.
- Speech Recognition: Improving accuracy by leveraging both past and future contexts.
- Bioinformatics: Predicting protein structures.
- Time Series Forecasting: Considering both historical trends and future predictions.
Bi-directional LSTM/GRU
In modern applications, bi-directional RNNs often use LSTM or GRU as their foundational units.
Practice Exercises
Exercise 1: Implement a Simple RNN
Use Python and NumPy to implement a simple RNN capable of generating text at the character level.
Exercise 2: LSTM-based Sentiment Analysis
Build an LSTM-based sentiment classifier for movie reviews using Keras.
Exercise 3: Bi-directional GRU for Named Entity Recognition
Implement a bi-directional GRU model to identify entities such as names, places, etc., in text.
Exercise 4: Comparative Experiment
Compare the performance of Vanilla RNN, LSTM, and GRU on the same dataset.
Summary and Further Learning
RNNs and their variants are powerful tools for handling sequential data. To master them further:
- Understand how gradients propagate within RNNs.
- Learn how attention mechanisms enhance RNNs.
- Explore the relationship between Transformer architectures and RNNs.
- Practice various sequence modeling tasks such as machine translation and speech synthesis.
YouTip