Recurrent Neural Network

Recurrent Neural Network (RNN) | Rookie Tutorial

Introduction to RNN

A Recurrent Neural Network (RNN) is a type of neural network specifically designed to handle sequential data such as text, speech, and time series.

Unlike traditional feedforward neural networks, RNNs have the ability to "remember" information from previous steps.

RNNs can use the hidden state from the previous step to influence the current step's output, thereby capturing temporal dependencies in sequences.

The Core Idea of RNN

The core of an RNN lies in its recurrent connection, where the network's output depends not only on the current input but also on the hidden states from all previous time steps. This structure allows RNNs to process sequences of arbitrary length.

Traditional Neural Networks: Inputs and outputs are independent (e.g., image classification, where each image is unrelated to others).

RNNs: Through recurrent connections, the hidden state from the previous step is passed to the next step, forming a "memory."

Each step's input = Current data + Hidden state from the previous step.
The output depends not only on the current input but also on the context of all previous steps.

Just like when reading a sentence, understanding the current word relies on what has been read before (e.g., "He opened __," you might predict "the door" or "the book").

Example: Simple RNN Implementation


import numpy as np

class SimpleRNN:

    def __init__(self, input_size, hidden_size):
        self.Wx = np.random.randn(hidden_size, input_size)  # Input weights
        self.Wh = np.random.randn(hidden_size, hidden_size)  # Hidden state weights
        self.b = np.zeros((hidden_size, 1))  # Bias term

    def forward(self, x, h_prev):
        h_next = np.tanh(np.dot(self.Wx, x) + np.dot(self.Wh, h_prev) + self.b)
        return h_next

How RNNs Work

At each time step t, an RNN performs the following calculations:

Receive the current input xₜ and the hidden state from the previous step hₜ₋₁.
Compute the new hidden state hₜ = f(Wₕₕ·hₜ₋₁ + Wₓₕ·xₜ + b).
Generate the output yₜ = g(Wₕᵧ·hₜ + c).

Here, f and g are typically activation functions such as tanh or softmax.

Advantages and Disadvantages of RNNs

Advantages

Can handle variable-length sequences.
Theoretically capable of remembering long-term historical information.
Shares parameters across all time steps (using the same set of weights for every step).

Disadvantages

Prone to gradient vanishing/explosion problems, making it difficult to learn long-term dependencies.
Lower computational efficiency since time steps cannot be processed in parallel.

Long Short-Term Memory Networks (LSTM)

LSTM is an improved architecture of RNNs, specifically designed to address the long-term dependency issues of standard RNNs.

Core Structure of LSTM

Component	Function
Input Gate	Controls the inflow of new information.
Forget Gate	Determines which old information to discard.
Output Gate	Controls the amount of information outputted.
Memory Cell	Stores long-term states.

Example: Basic Implementation of an LSTM Cell


class LSTMCell:

    def __init__(self, input_size, hidden_size):
        # Combine weights for all gates
        self.W = np.random.randn(4 * hidden_size, input_size + hidden_size)
        self.b = np.random.randn(4 * hidden_size, 1)

    def forward(self, x, h_prev, c_prev):
        combined = np.vstack((h_prev, x))
        gates = np.dot(self.W, combined) + self.b

        # Split gates into individual components
        f_gate = sigmoid(gates[:hidden_size])  # Forget gate
        i_gate = sigmoid(gates[hidden_size:2 * hidden_size])  # Input gate
        o_gate = sigmoid(gates[2 * hidden_size:3 * hidden_size])  # Output gate
        c_candidate = np.tanh(gates[3 * hidden_size:])  # Candidate memory

        # Update memory and hidden state
        c_next = f_gate * c_prev + i_gate * c_candidate
        h_next = o_gate * np.tanh(c_next)

        return h_next, c_next

How LSTM Solves Long-Term Dependency Problems

Selective Memory: The forget gate decides whether to retain or discard specific information.
Gradient Path: The memory cell provides a relatively direct path for gradient propagation.
Information Protection: The memory content is not directly modified by operations at each time step.

Gated Recurrent Unit (GRU)

GRU is a simplified version of LSTM that maintains similar performance while reducing the number of parameters.

Core Structure of GRU

Component	Function
Update Gate	Determines how much old information to keep.
Reset Gate	Controls how new and old information are combined.
Candidate Activation	Calculates the new state based on the reset gate.

Example: Implementation of a GRU Cell


class GRUCell:

    def __init__(self, input_size, hidden_size):
        self.W = np.random.randn(3 * hidden_size, input_size + hidden_size)
        self.b = np.random.randn(3 * hidden_size, 1)

    def forward(self, x, h_prev):
        combined = np.vstack((h_prev, x))
        gates = np.dot(self.W, combined) + self.b

        # Split gating signals
        z = sigmoid(gates[:hidden_size])  # Update gate
        r = sigmoid(gates[hidden_size:2 * hidden_size])  // Reset gate
        h_candidate = np.tanh(np.dot(self.W[2 * hidden_size:], 
            np.vstack((r * h_prev, x))) + self.b[2 * hidden_size:]

        # Update hidden state
        h_next = (1 - z) * h_prev + z * h_candidate

        return h_next

GRU vs LSTM

<th Slower

Feature	GRU	LSTM
Number of Parameters	Less	More
Training Speed	Faster
Memory Unit	No	Yes
Number of Gates	2	3
Performance	Better on small datasets	Might perform better on large datasets

Bi-directional RNN (Bi-RNN)

Bi-directional RNNs enhance sequence modeling capabilities by considering both past and future contextual information simultaneously.

Architecture of Bi-RNN

Forward layer: Processes the sequence in chronological order.
Reverse layer: Processes the sequence in reverse chronological order.

The final output is a combination of these two directions, usually achieved through concatenation or summation.

Applications of Bi-directional RNNs

Natural Language Processing: Part-of-speech tagging, named entity recognition.
Speech Recognition: Improving accuracy by leveraging both past and future contexts.
Bioinformatics: Predicting protein structures.
Time Series Forecasting: Considering both historical trends and future predictions.

Bi-directional LSTM/GRU

In modern applications, bi-directional RNNs often use LSTM or GRU as their foundational units.

Practice Exercises

Exercise 1: Implement a Simple RNN

Use Python and NumPy to implement a simple RNN capable of generating text at the character level.

Exercise 2: LSTM-based Sentiment Analysis

Build an LSTM-based sentiment classifier for movie reviews using Keras.

Exercise 3: Bi-directional GRU for Named Entity Recognition

Implement a bi-directional GRU model to identify entities such as names, places, etc., in text.

Exercise 4: Comparative Experiment

Compare the performance of Vanilla RNN, LSTM, and GRU on the same dataset.

Summary and Further Learning

RNNs and their variants are powerful tools for handling sequential data. To master them further:

Understand how gradients propagate within RNNs.
Learn how attention mechanisms enhance RNNs.
Explore the relationship between Transformer architectures and RNNs.
Practice various sequence modeling tasks such as machine translation and speech synthesis.

YouTip