Scikit-learn Basic Concepts
\nWhen using Scikit-learn for machine learning, it is essential to understand some fundamental concepts.
\nScikit-learn provides a unified and concise API to implement various machine learning algorithms and workflows, enabling us to quickly accomplish a wide range of machine learning tasks.
\nNext, we will elaborate on the following concepts: Data Representation, Model Types, Preprocessing Methods, Evaluation Metrics, Model Tuning, etc.
\n\n
1. Data Representation: Datasets and Features
\nDatasets are one of the most fundamental concepts in Scikit-learn.
\nThe core task of machine learning is to learn patterns from data, making the way data is represented crucial.
\nDatasets
\nIn Scikit-learn, data is typically represented through two main objects: the feature matrix and the target vector.
\nFeature Matrix: Each row represents a data sample, and each column represents a feature (i.e., an input variable). It is a two-dimensional array or matrix, typically stored using a NumPy array or a pandas DataFrame.
\nSuppose we have 3 samples, each with 2 features.
\nimport numpy as np\n\n X = np.array([[1.0,2.0],[2.0,3.0],[3.0,4.0]])\n\nTarget Vector: It represents the target (i.e., output label) for each sample, typically as a one-dimensional array.
\nFor example, in a classification task, the target is the class label for each sample.
\nThe corresponding target vector:
\ny = np.array([0, 1, 0]) # 0 category and 1 class\n\nFeatures and Labels
\n- \n
- Features: These are the input variables used to train the model within the dataset. In the example above,
Xis the feature matrix containing all input variables. \n - Labels: These are the target outputs for the machine learning model. In supervised learning, labels represent the results we want the model to predict. In the example above,
yis the label or target vector containing the class for each sample. \n
Dataset Splitting
\nIn practical applications, datasets are usually split into training and testing sets.
\nScikit-learn provides a convenient function, train_test_split(), to accomplish this:
from sklearn.model_selection import train_test_split\n\n X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n\n- \n
- The above code calls the
train_test_splitfunction and assigns the results to four variables:X_train,X_test,y_train, andy_test. \n Xandyare the parameters passed to thetrain_test_splitfunction, representing the feature dataset and target variable (labels), respectively. Typically,Xis a two-dimensional array, andyis a one-dimensional array. \n- The
test_size=0.3parameter specifies that the test set should be 30% of the original dataset. This means 70% of the data will be used for training, and the remaining 30% will be used for testing. \n - The
random_state=42parameter is a random seed used to ensure that the same result is obtained every time the dataset is split. This is highly useful in experiments and model validation as it guarantees reproducibility. \n
\n
2. Models and Algorithms
\nSupervised and Unsupervised Learning
\nIn Scikit-learn, machine learning models are broadly divided into two categories: supervised learning and unsupervised learning.
\nSupervised Learning: In supervised learning, models learn from labeled data during training, where these labels represent the outcomes we want the model to predict.
\nCommon supervised learning tasks include classification and regression.
\n- \n
- Classification: Assigns data points to predefined categories. For example, determining whether an email is spam or not. \n
- Regression: Predicts continuous value outputs. For example, predicting house prices or temperature. \n
Using a decision tree for a classification task:
\nfrom sklearn.tree import DecisionTreeClassifier\n\n clf = DecisionTreeClassifier()\n\n clf.fit(X_train, y_train)\n\n y_pred = clf.predict(X_test)\n\nUnsupervised Learning: Unsupervised learning refers to scenarios without labeled data, where models learn solely based on the features of the input data itself.
\nCommon unsupervised learning tasks include clustering and dimensionality reduction.
\n- \n
- Clustering: Groups data such that items within the same group share similarities. Common clustering algorithms include K-Means and DBSCAN. \n
- Dimensionality Reduction: Reduces the number of features in the data, commonly used for data compression or visualization. Common methods include PCA (Principal Component Analysis) and t-SNE (t-Distributed Stochastic Neighbor Embedding). \n
Using K-Means clustering:
\nfrom sklearn.cluster import KMeans\n\n kmeans = KMeans(n_clusters=3)\n\n kmeans.fit(X_train)\n\n y_pred = kmeans.predict(X_test)\n\nPreprocessing and Feature Engineering
\nBefore using Scikit-learn for machine learning, data preprocessing is usually required, which includes the following common tasks:
\n1. Standardization: Unifies the scale of features so that each feature has a mean of zero and a variance of one.
\nfrom sklearn.preprocessing import StandardScaler\n\n scaler = StandardScaler()\n\n X_scaled = scaler.fit_transform(X)\n\n2. Normalization: Scales feature values to a fixed range (typically between 0 and 1).
\nfrom sklearn.preprocessing import MinMaxScaler\n\n scaler = MinMaxScaler()\n\n X_normalized = scaler.fit_transform(X)\n\n3. Categorical Variable Encoding: Converts categorical data into numerical data (e.g., one-hot encoding).
\n\n
3. Model Evaluation and Validation
\nAfter training a machine learning model, its performance must be evaluated to ensure its generalization ability.
\nScikit-learn provides several tools for evaluating model performance.
\nCross-Validation
\nCross-validation is a common model evaluation method, especially when data is limited.
\nBy splitting the data into multiple subsets, using one subset as the validation set and the rest as the training set in each iteration, the model is trained and evaluated repeatedly. Finally, the average performance of the model is calculated.
\nfrom sklearn.model_selection import cross_val_score\n\n scores = cross_val_score(clf, X, y, cv=5)# 5-fold cross-validation\n\nprint("Cross-validation scores:", scores)\n\nCommon Evaluation Metrics
\nEvaluation metrics for classification tasks:
\n- \n
- Accuracy: The proportion of correctly predicted samples out of all samples. \n
- Precision: The proportion of actual positives among those predicted as positive. \n
- Recall: The proportion of actual positives that were correctly predicted. \n
- F1 Score: The harmonic mean of precision and recall. \n
from sklearn.metrics import accuracy_score, classification_report\n\nprint("Accuracy:", accuracy_score(y_test, y_pred))\n\nprint("Classification Report:n", classification_report(y_test, y_pred))\n\nEvaluation metrics for regression tasks:
\n- \n
- Mean Squared Error (MSE): The average of the squared differences between predicted and actual values. \n
- Coefficient of Determination (RΒ²): Measures the model's ability to explain the variance in the data. \n
from sklearn.metrics import mean_squared_error, r2_score\n\nprint("MSE:", mean_squared_error(y_test, y_pred))\n\nprint("RΒ²:", r2_score(y_test, y_pred))\n\n4. Model Selection and Tuning
\nGrid Search
\nGrid Search is a commonly used hyperparameter tuning method that finds the optimal combination of hyperparameters by exhaustively searching through all possible parameter combinations.
\nfrom sklearn.model_selection import GridSearchCV\n\nparam_grid ={'max_depth': [3,5,7],'min_samples_split': [2,5,10]}\n\n grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)\n\n grid_search.fit(X_train, y_train)\n\nprint("Best parameters:", grid_search.best_params_)\n\nRandom Search
\nRandom Search is a method that searches for optimal hyperparameters by randomly selecting combinations, offering higher efficiency than grid search.
\nfrom sklearn.model_selection import RandomizedSearchCV\n\nfrom scipy.stats import randint\n\nparam_dist ={'max_depth': [3,5,7],'min_samples_split': randint(2,10)}\n\n random_search = RandomizedSearchCV(DecisionTreeClassifier(), param_dist, n_iter=10, cv=5)\n\n random_search.fit(X_train, y_train)\n\nprint("Best parameters:", random_search.best_params_)\n
YouTip