Sklearn Data Preprocessing
Data preprocessing is a critical step in machine learning projects, which directly affects the training effect and final performance of the model.
When building machine learning models, data preprocessing is a crucial step that helps us clean and transform raw data to provide the best input for machine learning models.
Data preprocessing involves multiple steps, including handling missing values, data transformation, standardization, encoding, and more.
Appropriate preprocessing can not only improve model accuracy but also help the model generalize better.
* * *
## 1. Handling Missing Values
Missing values refer to the absence of values for certain features in a dataset.
Machine learning algorithms usually cannot directly handle missing values, so we need to process them.
### Checking Missing Values
First, check if there are missing values in the dataset.
You can usually use pandas to view missing values in the dataset:
## Example
```python
import pandas as pd
# Assume we have a DataFrame df
print(df.isnull().sum())# View the number of missing values in each column
### Filling Missing Values
The most common method for handling missing values is filling.
Common filling strategies include:
* **Mean Filling**: Suitable for numerical data.
* **Median Filling**: For datasets with outliers, using median may be more effective.
* **Mode Filling**: Suitable for categorical data.
In scikit-learn, SimpleImputer can easily implement missing value filling:
## Example
```python
from sklearn.impute import SimpleImputer
# For numerical data, use mean filling
imputer = SimpleImputer(strategy='mean')# Optional: 'mean', 'median', 'most_frequent'
df_imputed = imputer.fit_transform(df)# Fill missing values
### Deleting Missing Values
If the number of missing values is small and deleting them won't significantly affect the analysis results, another option is to directly delete the missing values.
```python
df_cleaned = df.dropna() # Delete rows containing missing values
For more content, refer to: (#)
* * *
## 2. Data Scaling
Machine learning algorithms are sensitive to the scale of data, so we need to scale the data so that features have the same scale.
Common scaling methods include:
* **Standardization**: Transforms data to have a mean of 0 and standard deviation of 1. Suitable for most machine learning algorithms.
* **Normalization**: Scales data to a specified range (usually [0, 1]).
### Standardization
Standardization can be achieved through StandardScaler, which transforms each feature to zero mean and unit variance:
## Example
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)# Standardize X
### Normalization
Normalization scales each feature to a specified range (usually [0, 1]).
MinMaxScaler is used to normalize data:
## Example
```python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)# Normalize X
### Why Do We Need Standardization and Normalization?
* **Standardization**: Very important for distance-based metrics (such as K-Nearest Neighbors, Support Vector Machines, etc.), because inconsistent feature scales may cause certain features to have excessive influence on the model. Standardization ensures that each feature contributes equally to the model.
* **Normalization**: Some algorithms (such as neural networks, gradient descent optimization algorithms, etc.) are very sensitive to the range of input data, and normalization helps accelerate convergence.
* * *
## 3. Categorical Variable Encoding
Machine learning models usually cannot directly process string-type categorical variables, so categorical variables need to be converted to numerical data.
Common encoding methods include:
### Label Encoding
Label encoding maps each category to a unique integer.
Suitable for cases where there is an ordinal relationship between categories (e.g., low, medium, high).
## Example
```python
from sklearn.preprocessing import LabelEncoder
# Assume we have a categorical variable y
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)# Convert categorical variable to integers
### One-Hot Encoding
One-hot encoding converts each category to a binary vector, suitable for cases where there is no ordinal relationship between categories (e.g., colors, countries, etc.).
OneHotEncoder can convert categorical variables to one-hot encoding.
## Example
```python
from sklearn.preprocessing import OneHotEncoder
# Assume we have a categorical variable X
encoder = OneHotEncoder(sparse=False)# sparse=False returns a dense matrix
X_encoded = encoder.fit_transform(X)# Convert categorical variable to one-hot encoding
In pandas, you can also use the get_dummies() function for one-hot encoding:
```python
X_encoded = pd.get_dummies(X)
* * *
## 4. Feature Selection
Feature selection improves model performance and reduces computational cost by selecting the most important features.
Common feature selection methods include:
### Model-Based Feature Selection
Use some machine learning models (such as Decision Trees or Random Forests) to evaluate feature importance for feature selection.
## Example
```python
from sklearn.ensemble import RandomForestClassifier
# Train a random forest model
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
# Get feature importance
importances = clf.feature_importances_
print(importances)
### Recursive Feature Elimination (RFE)
RFE is a method that recursively eliminates the least important features to select the optimal features.
RFE can help us automatically select important features.
## Example
```python
from sklearn.feature_selection import RFE
# Use a linear model for recursive feature elimination
rfe = RFE(clf, n_features_to_select=3)# Keep the 3 most important features
X_rfe = rfe.fit_transform(X_train, y_train)
* * *
## 5. Feature Engineering
Feature engineering improves model performance by processing, combining, or constructing new features from existing features.
Common methods of feature engineering include:
* **Feature Combination**: Combine two or more features into a new feature.
* **Feature Transformation**: Apply log transformation, square root transformation, etc. to features to solve non-linear problems in data.
* **Feature Creation**: Create new features based on existing data, such as extracting date, month, day of week from timestamps.
## Example
```python
# For example, combine two numerical features into a new feature
df['new_feature']= df['feature1'] * df['feature2']
* * *
## 6. Feature Extraction
Feature extraction aims to extract new, more expressive features from original features.
Common feature extraction methods include: Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).
### Principal Component Analysis (PCA)
PCA is a commonly used dimensionality reduction technique that maps data from high-dimensional space to low-dimensional space through linear transformation, so that new features (principal components) retain as much variance in the data as possible.
PCA is particularly suitable for cases with too many features, and can effectively reduce computational complexity.
## Example
```python
from sklearn.decomposition import PCA
# Assume X is the feature matrix
pca = PCA(n_components=2)# Reduce to 2 principal components
X_pca = pca.fit_transform(X)
PCA is mainly used in two scenarios:
* **Dimensionality Reduction**: When there are too many features, using PCA for dimensionality reduction can reduce computational cost while retaining the main information in the data.
* **Visualization**: Map high-dimensional data to 2D or 3D space to help us visualize data structure.
### Linear Discriminant Analysis (LDA)
LDA is a supervised learning dimensionality reduction method that aims to find a linear combination that maximizes the distance between different categories while minimizing the distance within categories.
LDA is typically used in
YouTip