YouTip LogoYouTip

Ml Training And Test Set Splitting

## Training and Test Set Splitting\n\nIn the world of machine learning, data is the fuel that drives all models. However, how you correctly use this fuel determines whether your model becomes an intelligent engine that accurately predicts the future, or a parrot that only memorizes things by rote.\n\nToday, we will delve into a crucial and fundamental concept in machine learning: **training and test set splitting**. This is the first step in building any reliable model, and the key to evaluating a model's true capabilities.\n\nSimply put, splitting training and test sets is like studying and taking exams in school:\n\n* **Training set** is the student's textbook and practice problems, which the model uses to learn patterns and rules in the data.\n* **Test set** is the final exam, which the model uses to test whether it truly understands the knowledge, rather than just memorizing the answers to practice problems (training set).\n\n* * *\n\n## Why Must We Split Training and Test Sets?\n\nImagine if a student only reviewed mock questions given by the teacher, and the exam questions were exactly the same mock questions, and he got a perfect score. Does this prove he truly understands the subject? Obviously not. He might just have memorized the answers.\n\nIn machine learning, if we train a model on **all data** and then evaluate its performance on the **same data**, we make the same mistake. The model will perform exceptionally well because it has already "seen" and "memorized" all the details in the data, including noise and randomness. This phenomenon is called **overfitting**.\n\nAn overfitted model is like a student who can only recite example problems. Once encountering new, unseen problems (new data), it will perform poorly. Its "generalization ability" is weak.\n\nTherefore, we must split the data into two parts:\n\n1. **Training set**: Used to **teach** the model and let it learn.\n2. **Test set**: Used to **test** the model and evaluate its ability to handle **new data it has never seen**.\n\nThe test set must be completely isolated from the training set and **must not** be seen by the model throughout the entire training process. Only then can the evaluation results on the test set objectively reflect the model's true generalization ability.\n\n* * *\n\n## How to Split: Common Methods and Strategies\n\nSplitting data sounds simple, but there is much to consider. Different splitting strategies apply to different scenarios.\n\n### 1. Simple Random Splitting\n\nThis is the most basic and commonly used method. Shuffle the entire dataset randomly, then split it into two parts according to a certain ratio.\n\n## Example\n\n```python\n# Example: Using Python's scikit-learn library for random splitting\n\nfrom sklearn.model_selection import train_test_split\n\n# Assume X is feature data, y is label data\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\nprint(f"Number of training samples:{len(X_train)}")\n\nprint(f"Number of test samples:{len(X_test)}")\n\n**Code Explanation**:\n\n* `train_test_split`: This is the core function in scikit-learn for splitting data.\n* `X, y`: Input feature data and corresponding labels.\n* `test_size=0.2`: Specifies the test set size ratio as 20% (meaning training set takes 80%). You can also use `train_size=0.8` to specify.\n* `random_state=42`: Sets a random seed. This ensures that the splitting result is exactly the same every time the code runs, which is crucial for experimental reproducibility. You can set it to any integer.\n\n### 2. Stratified Sampling Splitting\n\nIn classification problems, if the class distribution in the dataset is imbalanced (for example, 90% is class A, 10% is class B), simple random splitting may cause large differences in class proportions between training and test sets, affecting evaluation fairness.\n\n**Stratified sampling** ensures that after splitting, the proportion of each class in training and test sets remains consistent with the original dataset.\n\n## Example\n\n```python\n# Example: Using stratified sampling in classification problems\n\nfrom sklearn.model_selection import train_test_split\n\n# Assume y is classification labels\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)\n\n# Check class distribution after splitting\n\nfrom collections import Counter\n\nprint("Original data class distribution: ", Counter(y))\n\nprint("Training set class distribution: ", Counter(y_train))\n\nprint("Test set class distribution: ", Counter(y_test))\n\n**Code Explanation**:\n\n* `stratify=y`: This is the key parameter. It tells the function to perform stratified sampling according to the class distribution of label `y`.\n\n### 3. Time Series Data Splitting\n\nFor time series data (such as stock prices, daily temperatures), there are temporal dependencies between data points. We cannot shuffle randomly because future data cannot be used to predict the past.\n\nThe common practice is to split in chronological order: **use the first 80% of time data as training set, and the last 20% as test set**.\n\n## Example\n\n```python\n# Example: Sequential splitting of time series data\n\nsplit_index = int(len(X) * 0.8) # Calculate index at 80% position\n\nX_train, X_test = X[:split_index], X[split_index:]\n\ny_train, y_test = y[:split_index], y[split_index:]\n\nprint(f"Training set time range: first {split_index} samples")\n\nprint(f"Test set time range: last {len(X) - split_index} samples")\n\n* * *\n\n## How to Choose Splitting Ratios?\n\nThis is a common question, but there is no fixed answer. Common ratios include:\n\n| Ratio (Training:Test) | Applicable Scenario | Advantages | Disadvantages |\n| --- | --- | --- | --- |\n| **70:30** | Classic choice for small to medium datasets (thousands to tens of thousands of samples) | Balances training data volume and evaluation reliability | For very small datasets, 30% test set may have too few samples, making evaluation unstable |\n| **80:20** | More popular default choice now, especially suitable for deep learning | Provides more data for model learning | Relatively smaller test set, evaluation variance may be slightly larger |\n| **90:10 or 95:5** | When data is very limited | Maximizes use of limited data for training | Test set too small, evaluation results may be unreliable with low confidence |\n\n**Core Principles**:\n\n1. **Ensure training set is large enough**: Models need sufficient data to learn effective patterns.\n2. **Ensure test set is large enough**: Test set needs to provide statistically reliable performance evaluation. Usually, test set should have at least several hundred samples for stable evaluation results.\n3. **The larger the data volume**, the relatively **smaller** the proportion allocated to test set can be, because even a small proportion may represent a large number of samples.\n\n* * *\n\n## Advanced Concepts: Validation Set and Cross-Validation\n\nIn actual projects, we not only need to evaluate the final model, but also need to adjust the model's **hyperparameters** (such as learning rate, tree depth, etc.) during training. If we directly use the test set to adjust parameters, then the test set becomes "contaminated" again, losing its fairness as the "final examiner".\n\nFor this purpose, we introduce the **validation set**.\n\n### Three-Dataset Splitting: Training Set, Validation Set, Test Set\n\n1. **Training set**: Used for learning model parameters.\n2. **Validation set**: Used for adjusting hyperparameters, selecting models, or early stopping during training. It is equivalent to a "mock exam".\n3. **Test set**: After model and hyperparameters are determined, used for final, one-time performance evaluation. It is the "final exam".\n\n## Example\n\n```python\n# Example: First split into training+validation set and test set, then split validation set from training+validation set\n\nX_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42) # First split 15% as final test set\n\nX_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42) # Split about 15% from remaining 85% as validation set\n\n# Calculate ratio: 0.85 * 0.176 β‰ˆ 0.15, final ratio is approximately 70:15:15\n\nprint(f"Training set:{len(X_train)}, Validation set:{len(X_val)}, Test set:{len(X_test)}")\n\n### K-Fold Cross-Validation\n\nWhen data volume is not large, separately splitting a validation set will further reduce training data. **K-fold cross-validation** is a more powerful solution.\n\nIts process is as follows, which can effectively utilize limited data:\n\n## Example\n\n```mermaid\nflowchart TD\n\n A --> B\n\n B --> C{Perform K rounds of iteration}\n\n C --> D[Round i: Use the i-th fold as the validation set]\n\n D --> E\n\n E --> F\n\n F --> G\n\n G --> C\n\n C -- KAfter each round --> H\n\n## Example\n\n```python\n# Example: Using 5-fold cross-validation to evaluate model\n\nfrom sklearn.model_selection import cross_val_score\n\nfrom sklearn.linear_model import LogisticRegression\n\nmodel = LogisticRegression()\n\nscores = cross_val_score(model, X, y, cv=5) # cv=5 means 5-fold cross-validation\n\nprint(f"Scores per fold:{scores}")\n\nprint(f"Average score:{scores.mean():.4f} (+/- {scores.std()*2:.4f})") # Output mean score and standard deviation\n\n**Advantages of Cross-Validation**:\n\n* Fully utilizes all data for training and validation.\n* Evaluation results are more stable and reliable (because it's the average of multiple evaluations).\n* Is the gold standard for model selection and tuning on small to medium datasets.\n\n* * *\n\n## Practical Exercise: Hands-On Experience with Data Splitting\n\nNow, let's practice with a simple dataset.\n\n## Example\n\n```python\n# 1. Import necessary libraries\n\nimport numpy as np\n\nfrom sklearn.datasets import load_iris\n\nfrom sklearn.model_selection import train_test_split\n\n# 2. Load iris dataset\n\niris = load_iris()\n\nX, y = iris.data, iris.target\n\nprint(f"Dataset shape: features {X.shape}, Tag {y.shape}")\n\n# 3. Simple random splitting (80% training, 20% testing)\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\nprint(f"Random split -> Training set:{X_train.shape}, Test set:{X_test.shape}")\n\n# 4. Stratified random splitting\n\nX_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)\n\nprint(f"Stratified split -> Training set:{X_train_s.shape}, Test set:{X_test_s.shape}")\n\n# 5. Check stratification effect\n\nprint("n Original data class distribution: ", np.bincount(y))\n\nprint("Test set distribution after random split: ", np.bincount(y_test)) # May be imbalanced\n\nprint("Test set distribution after stratified split: ", np.bincount(y_test_s)) # Should be proportional to original distribution\n\n**Your Tasks**:\n\n1. Run the above code and observe the output.\n2. Try changing `test_size` to 0.3 and observe changes in training and test set sizes.\n3. Try changing `random_state` to another number (such as 7), run again, and observe whether splitting results change.\n4. (Challenge) Do not set the `random_state` parameter, run the code multiple times, and observe whether splitting results are the same each time.\n\n* * *\n\n## Summary and Key Points\n\n* **Core Purpose**: Splitting training and test sets is to **evaluate model generalization ability**, prevent overfitting, and ensure the model can handle new data.\n* **Golden Rule**: **Test set must remain completely secret throughout the entire training process**, used only for final evaluation.\n* **Splitting Methods**:\n * **Random splitting**: Most general.\n * **Stratified splitting**: Suitable for imbalanced data in classification problems.\n * **Sequential splitting**: Suitable for time series data.\n\n* **Splitting Ratios**: No absolute standard, need to balance between "sufficient training" and "reliable evaluation". 80:20 or 70:30 are common starting points.\n* **Advanced Tools**:\n * **Validation set**: Used for model tuning, protects purity of test set.\n * **K-fold cross-validation**: Powerful tool for evaluation and tuning on small to medium datasets, more robust results.
← Ml Probabilistic ThinkingMl Feature Engineering β†’