Ml Cluster Analysis

## Unsupervised Learning - Clustering |\\\\n\\\\n## Unsupervised Learning - Clustering\\\\n\\\\nImagine walking into a huge library where all the books are piled chaotically on the floor. Your task isn’t to read every book (that would be too time-consuming), but rather to group them by subject—such as science fiction, historical biographies, or cooking recipes—into several piles. In this process, you don’t have a pre-existing classification list telling you which book belongs to which category; instead, you spontaneously discover these groups based on features like content, cover, and thickness.\\\\n\\\\nIn machine learning, **clustering** does exactly this: it is an **unsupervised learning** method aimed at uncovering the inherent structure and groupings within data that lacks pre-labeled answers (i.e., no "labels").\\\\n\\\\n* * *\\\\n\\\\n## What Are Unsupervised Learning and Clustering?\\\\n\\\\nBefore we begin, let’s quickly distinguish between the two major paradigms in machine learning:\\\\n\\\\n* **Supervised Learning**: Like learning with a teacher’s guidance. We provide the algorithm with many questions (feature data) and their corresponding standard answers (labels), enabling it to learn the mapping from questions to answers. For example, showing the algorithm many images of cats and dogs (features) and telling it whether each image is a cat or a dog (label); after training, it can then identify new images.\\\\n* **Unsupervised Learning**: Like letting the machine explore and discover on its own. We only provide questions (feature data), **without providing answers (labels)**. The algorithm’s task is to autonomously uncover patterns, structures, or relationships within the data. Clustering is one of the core techniques in this paradigm.\\\\n\\\\n**Core idea of clustering**: Partition the samples in a dataset into several **non-overlapping** subsets (called clusters or classes), such that samples within the same cluster are **similar** to each other, while samples in different clusters are **dissimilar**.\\\\n\\\\nSimilarity here is typically measured mathematically using **distance** (e.g., Euclidean distance). \\\\nThe smaller the distance, the higher the similarity.\\\\n\\\\n!(#)\\\\n\\\\n* * *\\\\n\\\\n## Classic Clustering Algorithm: K-Means\\\\n\\\\n`K-Means` is one of the most famous and widely used clustering algorithms, with an intuitive concept and relatively simple implementation.\\\\n\\\\n### Algorithm Principle and Steps\\\\n\\\\nWe can visualize the K-Means process as holding elections for representatives and redrawing districts:\\\\n\\\\n1. **Determine the number of clusters K**: First, you need to decide how many groups you want to divide the data into. This K value must be specified in advance—it is a key parameter of K-Means.\\\\n2. **Initialize representatives (centroids)**: Randomly select K points in the data space as the initial "centers" of each cluster, known as **centroids**.\\\\n3. **Assign residents (samples)**: Compute the distance from every sample point in the dataset to each of the K centroids. Following the principle of "closer belongs to closer", assign each sample to the cluster whose centroid is **closest** to it. Thus, all samples are partitioned into K clusters.\\\\n4. **Re-elect representatives (update centroids)**: Now each cluster contains a set of samples. Recompute each cluster’s centroid as the **mean** (average point) of all sample points in that cluster.\\\\n5. **Repeat and converge**: Repeat steps 3 (assignment) and 4 (update) until the centroid positions no longer change significantly (i.e., the algorithm converges). At this point, each sample’s cluster assignment stabilizes.\\\\n\\\\n### Code Example and Practice\\\\n\\\\nLet’s demonstrate K-Means using Python’s `scikit-learn` library and a simple dataset.\\\\n\\\\n## Example\\\\n\\\\n# Import necessary libraries\\\\n\\\\nimport numpy as np\\\\n\\\\nimport matplotlib.pyplot as plt\\\\n\\\\nfrom sklearn.datasets import make_blobs\\\\n\\\\nfrom sklearn.cluster import KMeans\\\\n\\\\n# -------------------------- Set Chinese font start --------------------------\\\\n\\\\n plt.rcParams['font.sans-serif']=[\\\\n\\\\n# Windows Priority\\\\n\\\\n'SimHei','Microsoft YaHei',\\\\n\\\\n# macOS Priority\\\\n\\\\n'PingFang SC','Heiti TC',\\\\n\\\\n# Linux Priority\\\\n\\\\n'WenQuanYi Micro Hei','DejaVu Sans'\\\\n\\\\n]\\\\n\\\\n# Fix issue where minus signs display as squares\\\\n\\\\n plt.rcParams['axes.unicode_minus']=False\\\\n\\\\n# -------------------------- Set Chinese font end --------------------------\\\\n\\\\n# 1. Create a synthetic dataset\\\\n\\\\n# We generate 300 sample points that naturally cluster around 4 centers (for easier observation)\\\\n\\\\n X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)\\\\n\\\\n# X is the feature data，y_true is the realClasslabel（only used forfinalcomparison，ClusterClassalgorithmdoes not usetoIt）\\\\n\\\\n# 2. Visualize original data\\\\n\\\\n plt.scatter(X[:,0], X[:,1], s=50)# s Represents point size\\\\n\\\\n plt.title("Original unlabeled data")\\\\n\\\\n plt.show()\\\\n\\\\n# 3. Apply K-Means clustering\\\\n\\\\n# Specify clustering into 4 classes\\\\n\\\\n kmeans = KMeans(n_clusters=4, random_state=0, n_init='auto')\\\\n\\\\n# Fit the model and predict cluster labels for each sample\\\\n\\\\n y_kmeans = kmeans.fit_predict(X)\\\\n\\\\n# 4. Get centroid coordinates\\\\n\\\\n centroids = kmeans.cluster_centers_\\\\n\\\\n# 5. Visualize Clustering Results\\\\n\\\\n plt.scatter(X[:,0], X[:,1], c=y_kmeans, s=50, cmap='viridis')\\\\n\\\\n# Color-code sample points by cluster\\\\n\\\\nplt.scatter(centroids[:,0], centroids[:,1], c='red', s=200, alpha=0.8, marker='X')\\\\n\\\\n# Mark centroid positions with large red crosses, where alpha controls transparency\\\\n\\\\n plt.title("K-Means Clustering Results (K=4)")\\\\n\\\\n plt.show()\\\\n\\\\n# Print predicted cluster labels for the first 10 samples\\\\n\\\\nprint("Before10Cluster labels for samples:", y_kmeans[:10])\\\\n\\\\n# Print centroid coordinates\\\\n\\\\nprint("Centroid coordinates of the four clusters:n", centroids)\\\\n\\\\n**Code Explanation**:\\\\n\\\\n* `make_blobs`: Generates a simulated dataset for clustering; `centers=4` indicates data points are generated around 4 centers.\\\\n* `KMeans(n_clusters=4)`: Creates a K-Means model instance, specifying K = 4 clusters. `n_init='auto'` sets the number of times the algorithm runs, returning the best result.\\\\n* `fit_predict(X)`: Core method; fits the model on data `X` and returns cluster indices (0, 1, 2, 3) for each sample.\\\\n* `cluster_centers_`: Attribute storing the coordinates of the K centroids after training.\\\\n\\\\nRunning this code produces two plots. The first shows randomly scattered points; the second clearly separates into four color-coded groups, with red X marks indicating centroids—this is the power of K-Means!\\\\n\\\\nBefore10Cluster labels for samples: \\\\nCentroid coordinates of the four clusters: [[ 0.94973532 4.41906906] [ 1.98258281 0.86771314] [-1.37324398 7.75368871] [-1.58438467 2.83081263]] \\\\nOriginal unlabeled data:\\\\n\\\\n!(#)\\\\n\\\\nK-Means Clustering Results:\\\\n\\\\n!(#)\\\\n\\\\n* * *\\\\n\\\\n## How to Choose the Optimal K Value?\\\\n\\\\nIn the above example, since we knew the data was generated around 4 centers, we easily set `K=4`. However, in real-world scenarios, we often don’t know how many clusters the data should be divided into. How do we choose K?\\\\n\\\\nA commonly used method is the **"Elbow Method"**. The idea is: as the number of clusters K increases, the average distance from each sample point to its assigned cluster centroid (called **distortion** or `inertia`) decreases. When K is less than the true number of clusters, increasing K significantly reduces this distance; once K reaches the true number of clusters, further increases in K cause the distance reduction to sharply taper off. This inflection point resembles an elbow joint, and the corresponding K value is considered a good choice.\\\\n\\\\n## Example\\\\n\\\\n# Elbow Method Example: Calculate inertia for different K values\\\\n\\\\n inertias =[]\\\\n\\\\n K_range =range(1,11)# Test K from 1 to 10\\\\n\\\\nfor k in K_range:\\\\n\\\\n kmeans = KMeans(n_clusters=k, random_state=0, n_init='auto')\\\\n\\\\n kmeans.fit(X)\\\\n\\\\n inertias.append(kmeans.inertia_)# inertia_ Attribute represents SSE\\\\n\\\\n# Plot the elbow curve\\\\n\\\\n plt.plot(K_range, inertias,'bo-')\\\\n\\\\n plt.xlabel('Number of clusters K')\\\\n\\\\n plt.ylabel('Inertia (SSE)')\\\\n\\\\n plt.title('Elbow Method to Find the Optimal K Value')\\\\n\\\\n plt.axvline(x=4, color='r', linestyle='--', alpha=0.5)# Annotate the known K value=4\\\\n\\\\n plt.show()\\\\n\\\\nObserving the generated curve, you’ll notice the curve’s decline rate clearly slows near K=4, forming an "elbow", suggesting K=4 is a reasonable choice.\\\\n\\\\n* * *\\\\n\\\\n## Applications of Clustering\\\\n\\\\nClustering is a powerful exploratory data analysis tool with extensive applications:\\\\n\\\\n1. **Customer Segmentation**: In e-commerce or marketing, cluster customers based on purchasing behavior and demographics (e.g., age, income) to identify groups like "high-value customers" or "price-sensitive customers" for targeted marketing.\\\\n2. **Image Segmentation**: Cluster image pixels by color and texture to simplify images or identify foreground/background regions.\\\\n3. **Anomaly Detection**: Normal data points typically form dense clusters, while anomalies lie far from any cluster center. Clustering helps detect such outliers.\\\\n4. **Document Classification**: Cluster news articles or research papers to automatically discover trending topics or research areas.\\\\n5. **Social Network Analysis**: Cluster users based on relationships and interactions to identify communities or social circles.\\\\n\\\\n* * *\\\\n\\\\n## Practice Exercises and Summary\\\\n\\\\n**Exercise 1: Try Different K Values** \\\\nModify the `n_clusters` parameter in the above K-Means example code to 2, 3, 5, and 8 respectively. Observe the clustering result plots to understand how K value selection affects outcomes.\\\\n\\\\n**Exercise 2: Use a Real Dataset** \\\\nTry clustering the built-in `iris` dataset in `scikit-learn`. Although this dataset is typically used for classification, you can ignore its labels and apply K-Means clustering using only the feature data (sepal/petal length and width), then compare clustering results with true labels to evaluate performance.\\\\n\\\\n## Example\\\\n\\\\nfrom sklearn import datasets\\\\n\\\\n iris = datasets.load_iris()\\\\n\\\\n X_iris = iris.data# Use only feature data

YouTip

Ml Cluster Analysis

📂 Categories