YouTip LogoYouTip

Pytorch Torchtext

While the PyTorch ecosystem has `torchvision` for image data and `torchaudio` for audio data, the official `torchtext` library for text processing has undergone some changes. This section introduces how to use various methods for text data preprocessing, vocabulary building, data loading, and other operations. > Note: The torchtext library has undergone some refactoring. It is recommended to use torchtext.legacy or build your own text processing pipeline. The latest torchtext version has returned and provides a more modern API. * * * ## 1. Text Data Preprocessing Basics Text preprocessing is the first step in NLP tasks, including tokenization, vocabulary building, encoding, and other operations. ### 1.1 Basic Text Processing Pipeline ## Example import re from collections import Counter class SimpleTokenizer: """ Simple tokenizer: tokenize by spaces and punctuation """ def __init__ (self): # Punctuation mapping self.punctuation=str.maketrans('','','.,!?;:"\'-()[]{}') def tokenize(self, text): # Convert to lowercase text = text.lower() # Remove punctuation text = text.translate(self.punctuation) # Tokenize tokens = text.split() return tokens class Vocabulary: """ Vocabulary building """ def __init__ (self, min_freq=2, max_size=10000): self.min_freq= min_freq self.max_size= max_size self.word2idx={'': 0,'': 1} self.idx2word={0: '',1: ''} self.word_count= Counter() def build_vocab(self, texts): """Build vocabulary from text list""" tokenizer = SimpleTokenizer() # Count word frequency for text in texts: tokens = tokenizer.tokenize(text) self.word_count.update(tokens) # Build vocabulary for word, count in self.word_count.most_common(self.max_size): if count ']) for token in tokens ] # Pad with zeros if len(indices)< max_len: indices +=[self.word2idx['']] * (max_len - len(indices)) return indices def decode(self, indices): """Decode index sequence to text""" tokens =[self.idx2word.get(idx,'')for idx in indices] return' '.join(tokens) # Usage example texts =[ "Hello world", "This is a test", "PyTorch is great for deep learning", "Natural language processing is fun", "Deep learning enables many applications", ] vocab = Vocabulary(min_freq=1, max_size=
← Pytorch Torchscript Onnx ExporPytorch Autoencoder β†’