YouTip LogoYouTip

Ml Dimensionality Reduction

## Unsupervised Learning - Dimensionality Reduction\n\nImagine you are a photographer organizing a gallery containing millions of high-resolution photos. Each photo consists of millions of pixels (features). If you want to quickly find all photos of seaside sunsets, directly comparing every pixel of each photo is nearly impossible because the data volume is too large and too "wide".\n\nIn machine learning, we often face similar dilemmas: datasets have hundreds or thousands of features (dimensions). This not only leads to extremely slow computation (curse of dimensionality), but may also interfere with finding true patterns in the data because many features are redundant or irrelevant.\n\n**Dimensionality Reduction** is a core technique in unsupervised learning. Like a data fitness coach, it helps us "slim down" high-dimensional data into a lower-dimensional space while preserving the most important information as much as possible. Today, we will learn about dimensionality reduction in an accessible way.\n\n* * *\n\n## Basic Concepts of Dimensionality Reduction\n\n### What is Dimensionality Reduction?\n\nSimply put, **dimensionality reduction is the process of reducing the number of features in a dataset**. It maps data points from the original high-dimensional space to a new, lower-dimensional space through some mathematical transformation.\n\n### Why Dimensionality Reduction?\n\nDimensionality reduction is far from simply discarding data. Its core values are:\n\n1. **Visualization**: Humans can at most intuitively understand three-dimensional space. By reducing dimensions to 2D or 3D, we can plot high-dimensional data and visually observe its structure, groupings, and outliers.\n2. **Efficiency Improvement**: Fewer data dimensions mean smaller storage space, faster training speed, and lower computational costs.\n3. **Noise and Redundancy Removal**: Many algorithms (especially distance-based algorithms like KNN) suffer performance degradation in high-dimensional spaces due to irrelevant or duplicate features. Dimensionality reduction can extract the essence of data.\n4. **Alleviating the Curse of Dimensionality**: In high-dimensional spaces, data becomes extremely sparse, making it difficult for many machine learning models to find effective patterns.\n\n### Core Idea: Information Retention\n\nThe key challenge of dimensionality reduction is: **How to maximize the retention of valuable information in the original data (such as variance, data structure) while reducing dimensions?** Different dimensionality reduction algorithms have different answers to this.\n\n* * *\n\n## Detailed Explanation of Mainstream Dimensionality Reduction Algorithms\n\nDimensionality reduction algorithms are mainly divided into two categories: **Linear Dimensionality Reduction** and **Non-linear Dimensionality Reduction**.\n\n### Linear Dimensionality Reduction: Principal Component Analysis\n\n**Principal Component Analysis (PCA)** is the most classic and commonly used linear dimensionality reduction method. Its goal is to find a new set of coordinate axes (called "principal components") for the data, such that the variance of the data's projection on these new axes is maximized.\n\n#### How PCA Works (Four Steps):\n\n1. **Centering**: Subtract the mean from each feature to move the center of the data distribution to the origin.\n2. **Compute Covariance Matrix**: This matrix describes the correlation between various features of the data.\n3. **Eigenvalue Decomposition**: Calculate the eigenvalues and eigenvectors of the covariance matrix. **Eigenvectors** indicate the direction of the new coordinate axes (principal components), while **eigenvalues** represent the variance magnitude of the data in that direction. The larger the eigenvalue, the more information that direction contains.\n4. **Select Principal Components**: Sort eigenvalues from largest to smallest, select the eigenvectors corresponding to the top `k` largest eigenvalues, and form a projection matrix.\n5. **Data Transformation**: Multiply the original data by this projection matrix to obtain the new data reduced to `k` dimensions.\n\n## Example\n\n```python\n# Import necessary libraries\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.decomposition import PCA\nfrom sklearn.datasets import load_iris\n\n# -------------------------- Chinese font settings start --------------------------\nplt.rcParams['font.sans-serif']=[\n # Windows priority\n 'SimHei','Microsoft YaHei',\n # macOS priority\n 'PingFang SC','Heiti TC',\n # Linux priority\n 'WenQuanYi Micro Hei','DejaVu Sans'\n]\n# Fix minus sign display as square issue\nplt.rcParams['axes.unicode_minus']=False\n# -------------------------- Chinese font settings end --------------------------\n\n# 1. Load the classic Iris dataset (4 features)\niris = load_iris()\nX = iris.data # Original data: 150 samples, 4 features\ny = iris.target # Labels for visualization coloring\n\nprint(f"Original data shape: {X.shape}") # Output: (150, 4)\n\n# 2. Create PCA model, specify reduction to 2 dimensions\npca = PCA(n_components=2)\n\n# 3. Fit model (compute principal components) and transform data\nX_pca = pca.fit_transform(X)\n\nprint(f"Data shape after reduction: {X_pca.shape}") # Output: (150, 2)\nprint(f"Variance ratio explained by each component: {pca.explained_variance_ratio_}")\n# Output may look like: [0.9246, 0.0530] indicating the first component retains 92.5% info, second retains 5.3%\n\n# 4. Visualize the reduction results\nplt.figure(figsize=(8,6))\nscatter = plt.scatter(X_pca[:,0], X_pca[:,1], c=y, edgecolor='k', alpha=0.7)\nplt.xlabel('First Principal Component (PC1)')\nplt.ylabel('Second Principal Component (PC2)')\nplt.title('PCA: Iris Dataset Dimensionality Reduction Visualization')\nplt.colorbar(scatter, label='Iris species')\nplt.grid(True, linestyle='--', alpha=0.5)\nplt.show()\n\n**Code Explanation**:\n\n* `PCA(n_components=2)`: Initializes the model. The `n_components` parameter specifies the number of principal components to retain (i.e., the dimension after reduction).\n* `fit_transform(X)`: This is a combined method that first computes the mean and principal component directions of the data (`fit`), then immediately transforms the data to the new space (`transform`).\n* `explained_variance_ratio_`: This is a very important attribute of PCA. It tells us how much variance (information) each new feature (principal component) retains from the original data. This helps us decide how many principal components are appropriate.\n\n!(#)\n\n#### Pros and Cons of PCA\n\n* **Pros**: Computationally efficient, clear principles, effective at removing linear correlations.\n* **Cons**: It is a linear method that assumes the principal components of data are linear. For non-linear manifold data like the "Swiss roll", PCA performs poorly.\n\n### Non-linear Dimensionality Reduction: t-SNE\n\nWhen data has complex non-linear structures, we need non-linear dimensionality reduction methods. **t-Distributed Stochastic Neighbor Embedding (t-SNE)** is currently the most popular visualization-oriented non-linear dimensionality reduction algorithm.\n\n#### Core Idea of t-SNE\n\nt-SNE focuses on **preserving the local structure of data**. It tries to make points that are "similar" (close in distance) in high-dimensional space also "similar" in the low-dimensional mapping; while points that are "dissimilar" in high-dimensional space are far apart in low-dimensional space.\n\n## Example\n\n```python\n# Import necessary libraries\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.decomposition import PCA\nfrom sklearn.manifold import TSNE\nfrom sklearn.datasets import make_swiss_roll # Generate Swiss roll data\n\n# -------------------------- Chinese font settings start --------------------------\nplt.rcParams['font.sans-serif']=[\n # Windows priority\n 'SimHei','Microsoft YaHei',\n # macOS priority\n 'PingFang SC','Heiti TC',\n # Linux priority\n 'WenQuanYi Micro Hei','DejaVu Sans'\n]\n# Fix minus sign display as square issue\nplt.rcParams['axes.unicode_minus']=False\n# -------------------------- Chinese font settings end --------------------------\n\n# 1. Generate a non-linear dataset: Swiss roll\nX_swiss, color = make_swiss_roll(n_samples=1000, noise=0.1)\nprint(f"Swiss roll data shape: {X_swiss.shape}") # (1000, 3)\n\n# 2. Try dimensionality reduction using PCA (linear method)\npca = PCA(n_components=2)\nX_swiss_pca = pca.fit_transform(X_swiss)\n\n# 3. Dimensionality reduction using t-SNE (non-linear method)\n# perplexity is a key parameter for t-SNE, typically between 5 and 50, \n# representing the balance between local/global structure attention\ntsne = TSNE(n_components=2, perplexity=30, random_state=42)\nX_swiss_tsne = tsne.fit_transform(X_swiss)\n\n# 4. Comparative visualization\nfig, axes = plt.subplots(1,2, figsize=(15,6))\n\n# PCA results\naxes.scatter(X_swiss_pca[:,0], X_swiss_pca[:,1], c=color, cmap='viridis')\naxes.set_title('PCA Dimensionality Reduction Results')\naxes.set_xlabel('PC1')\naxes.set_ylabel('PC2')\n\n# t-SNE results\nsc = axes.scatter(X_swiss_tsne[:,0], X_swiss_tsne[:,1], c=color, cmap='viridis')\naxes.set_title('t-SNE Dimensionality Reduction Results (perplexity=30)')\naxes.set_xlabel('t-SNE 1')\naxes.set_ylabel('t-SNE 2')\n\nplt.colorbar(sc, ax=axes, label='Swiss roll "height"')\nplt.tight_layout()\nplt.show()\n\n**Code Explanation**:\n\n* `perplexity` parameter: Can be understood as how many neighbors to consider for each point. Smaller values focus more on local structure, larger values focus more on global structure. It is the most important tuning parameter for t-SNE.\n* `random_state`: Ensures reproducible results, as t-SNE's optimization process is stochastic.\n* From the visualization results, we can clearly see that PCA "flattens" the Swiss roll, losing its non-linear curled structure; while t-SNE better unfolds this roll on a 2D plane, preserving the local adjacency relationships of the data.\n\n!(#)\n\n#### Pros and Cons of t-SNE\n\n**Pros**: Excellent visualization effects for complex non-linear data, able to clearly show clustering structures.\n\n**Cons**:\n\n1. **Slow computation speed**, not suitable for large datasets.\n2. Results are **stochastic**, each run may be slightly different.\n3. **Sensitive to hyperparameters**, `perplexity` needs tuning.\n4. Mainly **used for visualization** (2D/3D), the reduced features are usually not used for subsequent machine learning tasks, as the distance meaning in its low-dimensional space has changed.\n\n* * *\n\n## How to Choose Dimensionality Reduction Methods and Key Parameters\n\n### Algorithm Selection Flowchart\n\n!(#)\n\n### Key Parameter Guide\n\n**PCA: `n_components`**\n\n* Can be set to an integer (e.g., 2) to specify the exact dimension.\n* Can be set to a decimal `0 < n < 1` (e.g., 0.95), indicating to retain the minimum number of principal components needed to reach that **cumulative variance contribution ratio** threshold.\n\n## Example\n\n```python\n# Retain 95% of variance information\npca = PCA(n_components=0.95)\npca.fit(X)\nprint(f"To retain 95% variance, {pca.n_components_} principal components are needed")\n\n**t-SNE: `perplexity`**\n\n* Typical values are between 5 and 50.\n* For small datasets ( 784 dimensions!\n\n# 2. First use PCA to quickly reduce to 50 dimensions, removinga large amount of noise\npca = PCA(n_components=50)\nX_mnist_pca = pca.fit_transform(X_mnist)\nprint(f"Shape after PCA: {X_mnist_pca.shape}")\n\n# 3. Then use t-SNE to reduce 50-dimensional data to 2 dimensions for visualization\ntsne = TSNE(n_components=2, perplexity=40, n_iter=300, random_state=42)\nX_mnist_tsne = tsne.fit_transform(X_mnist_pca)\n\n# 4. Visualization\nplt.figure(figsize=(10,8))\nscatter = plt.scatter(X_mnist_tsne[:,0], X_mnist_tsne[:,1],\n c=y_mnist.astype(int), cmap='tab10', alpha=0.6, s=5)\nplt.colorbar(scatter, ticks=range(10), label='Handwritten digits')\nplt.title('MNIST Handwritten Digits Dataset t-SNE Visualization after PCA Preprocessing')\nplt.xlabel('t-SNE 1')\nplt.ylabel('t-SNE 2')\nplt.grid(True, linestyle='--', alpha=0.3)\nplt.show()\n\n**Exercise Goal**: Observe whether different digits (0-9) form clear clusters on the 2D plane. Try modifying the `perplexity` parameter (e.g., to 10 or 50) and see how the visualization effect changes.\n\n### Summary and Key Points\n\n**The Essence of Dimensionality Reduction**: It is information compression and refinement, not simply discarding data. The goal is to **express as much original information as possible with fewer dimensions**.\n\n**PCA (King of Linearity)**: Finds the main linear directions of data by maximizing variance. Efficient, stable, suitable for preprocessing and removing linear correlations.\n\n**t-SNE (Visualization Tool)**: Reveals non-linear structures by preserving local similarity between data points. Stunning effects, but slow computation, random results, mainly used for exploratory data analysis.\n\n**Workflow**:\n\n* **Clarify the goal**: Is it for visualization, or to provide more refined features for downstream models?\n* **Data exploration**: First visualize part of the data to get a preliminary sense of its linear/non-linear nature.\n* **Method experimentation**: Select algorithms based on goals and data structure, and adjust key parameters.\n* **Evaluate results**: Evaluate dimensionality reduction effects through visualization, information retention rate, or downstream task performance.\n\nDimensionality reduction is a key to opening the black box of high-dimensional data.\n\nMastering PCA and t-SNE, you will be able to both overlook the global structure when facing complex data, and prepare streamlined features for subsequent machine learning models, greatly improving the efficiency and depth of data analysis.
← Ml Exploration ExploitationMl Classification Metrics β†’