Tensorflow Text Classification
Text classification is a fundamental task in natural language processing (NLP), referring to the automatic categorization of text documents into one or more predefined categories. In practical applications, text classification is widely used for:
* Spam detection
* Sentiment analysis
* News categorization
* Customer service dialogue classification
* Product review classification
Implementing text classification with TensorFlow typically involves the following steps:
1. Data preparation and preprocessing
2. Text vectorization
3. Model construction
4. Model training
5. Model evaluation
6. Model deployment
* * *
## Environment Preparation
Before starting the project, ensure the following Python libraries are installed:
!pip install tensorflow !pip install numpy !pip install pandas !pip install matplotlib
Import necessary libraries:
## Example
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Check TensorFlow version:
## Example
print(tf.__version__)
# Sample output: 2.8.0
* * *
## Dataset Preparation
We will use the IMDB movie review dataset, a classic binary classification dataset containing 50,000 movie reviews labeled as positive (1) or negative (0) sentiment.
### Loading the Dataset
## Example
# Load IMDB data from TensorFlow datasets
imdb = tf.keras.datasets.imdb
# Keep only the top 10,000 most frequent words
(train_data, train_labels),(test_data, test_labels)= imdb.load_data(num_words=10000)
### Data Exploration
Examine the data format:
## Example
print("Training samples: {}, Test samples: {}".format(len(train_data),len(test_data)))
# Output: Training samples: 25000, Test samples: 25000
# View the first review
print(train_data)
# Output: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, ...]
### Data Preprocessing
Convert integer sequences to multi-hot encoding:
## Example
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence]=1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
# Convert labels to floats
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
* * *
## Building the Model
### Model Architecture
We will build a simple fully connected neural network:
## Example
model = tf.keras.Sequential([
layers.Dense(16, activation='relu', input_shape=(10000,)),
layers.Dense(16, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
### Model Compilation
## Example
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
Parameter explanation:
* `optimizer`: Optimizer, controls the learning process
* `loss`: Loss function, measures the difference between model predictions and true labels
* `metrics`: Evaluation metrics, monitor training and testing steps
* * *
## Training the Model
### Creating a Validation Set
## Example
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
### Training Process
## Example
history = model.fit(partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val))
### Visualizing Training Results
## Example
history_dict = history.history
# Plot training loss and validation loss
plt.plot(history_dict['loss'],'bo', label='Training loss')
plt.plot(history_dict['val_loss'],'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
# Plot training accuracy and validation accuracy
plt.plot(history_dict['accuracy'],'bo', label='Training acc')
plt.plot(history_dict['val_accuracy'],'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
* * *
## Model Evaluation and Prediction
### Evaluating Test Set Performance
## Example
results = model.evaluate(x_test, y_test)
print(results)
# Sample output: [0.3245, 0.8732] representing loss and accuracy
### Making Predictions
## Example
predictions = model.predict(x_test)
print(predictions)# Prediction probability for the first test sample
* * *
## Model Optimization Suggestions
1. **Adjust Network Architecture**:
* Increase or decrease the number of hidden layers
* Try different numbers of neurons
* Use different activation functions
2. **Regularization Techniques**:
* Add Dropout layers to prevent overfitting
* Use L1/L2 regularization
3. **Optimizer Selection**:
* Try other optimizers like Adam, SGD
* Adjust the learning rate
4. **Text Preprocessing Improvements**:
* Use word embeddings (Embedding) instead of multi-hot encoding
* Try pre-trained word vectors (e.g., Word2Vec, GloVe)
* * *
## Complete Code Example
## Example
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt
# Load data
imdb = tf.keras.datasets.imdb
(train_data, train_labels),(test_data, test_labels)= imdb.load_data(num_words=10000)
# Data preprocessing
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence]=1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
# Build model
model = tf.keras.Sequential([
layers.Dense(16, activation='relu', input_shape=(10000,)),
layers.Dense(16, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
# Compile model
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train model
history = model.fit(x_train, y_train,
epochs=4,
batch_size=512,
validation_data=(x_test, y_test))
# Evaluate model
results = model.evaluate(x_test, y_test)
print("Test loss and accuracy:", results)
# Make predictions
predictions = model.predict(x_test)
print("Prediction probability for the first review:", predictions)
YouTip