YouTip LogoYouTip

Sklearn Ml Model

Machine learning models are core tools for implementing automated data analysis, pattern recognition, and prediction. Based on different task types, machine learning models can generally be divided into: * **Classification models:** Predict discrete categories * **Regression models:** Predict continuous values * **Clustering models:** Automatically group data This chapter will introduce these common machine learning models in detail, and explain how to evaluate and optimize models. * * * ## 1. Classification Models Classification is one of the most common problems in machine learning, with the goal of mapping input data to discrete category labels. Common classification models include: * Logistic Regression * K-Nearest Neighbors (KNN) * Support Vector Machine (SVM) * Decision Tree * Random Forest ### Logistic Regression Although logistic regression has "regression" in its name, it is essentially a **probabilistic classification model**, commonly used for binary classification problems. Its core idea is: * First perform linear weighted summation * Then map the result to between 0 and 1 as a probability through the Sigmoid function It can be represented as: The core formula of logistic regression is: !(#) scikit-learn implementation: ## Example from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # Assume X is the feature matrix, y is the label X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = LogisticRegression() model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) ### K-Nearest Neighbors (KNN) K-Nearest Neighbors (KNN) is an instance-based learning method. When making predictions, it calculates the distance between the sample to be predicted and all samples in the training set, selects the K nearest neighbors, and makes predictions based on the neighbors' labels. Main parameters: * **K**: The number of neighbors selected. * **Distance metric**: Commonly Euclidean distance, but Manhattan distance, Minkowski distance, etc. can also be used. scikit-learn implementation: ## Example from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split # Assume X is the feature matrix, y is the label X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = KNeighborsClassifier(n_neighbors=3) model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) ### Support Vector Machine (SVM) Support Vector Machine is a powerful classification model, especially suitable for high-dimensional data. The basic idea of SVM is to find a hyperplane that maximizes the margin between sample points of different classes. For non-linearly separable data, SVM uses the kernel trick to map data to a high-dimensional space to find a separating hyperplane. Kernel functions: * **Linear kernel**: Suitable for linearly separable data. * **Gaussian Radial Basis Function (RBF)**: Suitable for non-linear data. * **Polynomial kernel**: Suitable for data with polynomial relationships. scikit-learn implementation: ## Example from sklearn.svm import SVC from sklearn.model_selection import train_test_split # Assume X is the feature matrix, y is the label X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = SVC(kernel='linear')# Use linear kernel model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) ### Decision Tree & Random Forest Decision tree is a tree-structured classification model that splits data and ultimately divides it into different categories. Random forest builds multiple decision trees and determines the final prediction result through voting or averaging. **Decision tree** selects the optimal feature for data partitioning, with the selection criterion usually being **information gain** or **Gini coefficient**. **Random forest** reduces overfitting and improves model accuracy through the integration of multiple decision trees. It introduces randomness (such as random feature selection, random data subset selection) to increase model diversity. scikit-learn implementation: ## Example from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split # Assume X is the feature matrix, y is the label X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Decision tree dt_model = DecisionTreeClassifier() dt_model.fit(X_train, y_train) # Random forest rf_model = RandomForestClassifier(n_estimators=100) rf_model.fit(X_train, y_train) # Predict dt_pred = dt_model.predict(X_test) rf_pred = rf_model.predict(X_test) * * * ## 2. Regression Models The goal of regression problems is to predict a continuous output variable. Common regression models include linear regression, ridge regression, and Lasso regression. ### Linear Regression Linear regression predicts the target variable by fitting a straight line. Its core assumption is that there is a linear relationship between features and the target variable. For a simple linear regression problem, the model can be represented as: !(#) * y is the predicted value (target value). * x 1, x 2, x n are input features. * w 1, w 2, w n are weights to be learned (model parameters). * b is the bias term. !(#) **scikit-learn implementation:** ## Example from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split # Assume X is the feature matrix, y is the target variable X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = LinearRegression() model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) ### Ridge Regression Ridge regression is a variant of linear regression that uses **L2 regularization** to constrain model complexity and avoid overfitting. By penalizing the magnitude of regression coefficients, ridge regression can better handle multicollinearity problems. **scikit-learn implementation:** ## Example from sklearn.linear_model import Ridge model = Ridge(alpha=1.0)# alpha is the regularization parameter model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) ### Lasso Regression Lasso regression is also a form of linear regression that uses **L1 regularization** to penalize regression coefficients. Unlike ridge regression, Lasso will compress some regression coefficients to zero, thereby achieving feature selection. **scikit-learn implementation:** ## Example from sklearn.linear_model import Lasso model = Lasso(alpha=0.1)# alpha is the regularization parameter model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) * * * ## 3. Clustering Models Clustering is an unsupervised learning method whose goal is to divide objects in a dataset into different groups (or clusters), so that objects in the same cluster are as similar as possible, while objects in different clusters are as different as possible. ### K-Means K-means is a common clustering algorithm that aims to divide data into K clusters by minimizing the distance between each data point and its cluster center to optimize cluster partitioning. **scikit-learn implementation:** ## Example from sklearn.cluster import KMeans # Assume X is the feature matrix model = KMeans(n_clusters=3) model.fit(X) # Get cluster labels labels = model.predict(X) ### DBSCAN (Density-Based Clustering) DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that performs clustering by finding regions with relatively high density, without needing to specify the number of clusters in advance. **scikit-learn implementation:** ## Example from sklearn.cluster import DBSCAN # Assume X is the feature matrix model = DBSCAN(eps=0.5, min_samples=5) model.fit(X) # Get cluster labels labels = model.labels_ ### Hierarchical Clustering Hierarchical clustering is a method of clustering by recursively merging or splitting clusters. Common hierarchical clustering methods include **agglomerative clustering** and **divisive clustering**. **scikit-learn implementation:** ## Example from sklearn.cluster import AgglomerativeClustering # Assume X is the feature matrix model = AgglomerativeClustering(n_clusters=3) labels = model.fit_predict(X) * * * ## 4. Model Evaluation and Selection In machine learning, after model training is complete, reasonable evaluation methods are needed to judge the model's generalization ability and select the best-performing model. Common evaluation methods include: * Classification metric evaluation * Cross validation * Hyperparameter tuning ### Classification Evaluation Metrics In classification problems, commonly used evaluation metrics include: * **Accuracy** * **Precision** * **Recall** * **F1 Score** They reflect model performance from different angles. **Explanation of each metric:** * **Accuracy** -- The proportion of correctly predicted samples among all samples: suitable for cases with relatively balanced class distribution * **Precision** -- Among samples predicted as positive, the proportion that are actually positive: focuses on "whether positive prediction is reliable" * **Recall** -- Among all samples that are actually positive, the proportion correctly identified by the model: focuses on "whether positive samples are missed" #### F1 Score The harmonic mean of Precision and Recall: !(#) Used to achieve balance between precision and recall. **scikit-learn implementation:** ## Example from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred) print(f"Accuracy: {accuracy}") print(f"Precision: {precision}") print(f"Recall: {recall}") print(f"F1 Score: {f1}") ### Hold-out Evaluation (Train/Test Split) The most common and intuitive evaluation method is to divide the dataset into: * Training set: used for model learning * Test set: used for performance evaluation This method is simple and efficient, suitable for the entry-level stage to understand model performance. ### Cross Validation Cross validation obtains more stable and reliable model performance evaluation results by partitioning the dataset multiple times for training and testing. The common practice is **K-fold cross validation**: * The dataset is divided into K parts * Each time one part is used as the validation set * The rest is used as the training set * Repeat K times and take the average result ## Example from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression model = LogisticRegression() scores = cross_val_score(model, X, y, cv=5) print("Cross-validation scores:", scores) print("Mean score:", scores.mean()) In actual projects, usually either the hold-out method or cross validation is used, and they do not need to be mixed together. ### Grid Search Tuning (GridSearchCV) Model performance is often greatly affected by hyperparameter settings. Grid search automatically finds the optimal configuration by traversing parameter combinations. Combined with cross validation, the tuning results are more robust. ## Example from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC parameters ={ 'kernel': ['linear','rbf'], 'C': [1,10,100] } model = SVC() grid_search = GridSearchCV( model, parameters, cv=5 ) grid_search.fit(X, y) print("Best parameters:", grid_search.best_params_) print("Best score:", grid_search.best_score_)
← Sklearn PipelineSklearn Basics β†’