YouTip LogoYouTip

Ml Feature Engineering

## Feature Engineering Imagine you are a chef preparing a delicious dish. The machine learning model is your "cooking algorithm," and the raw data is the various ingredients you bought from the market: vegetables, meat, seasonings, but some may be covered in mud, some are in whole chunks, and some have very strong flavors. **Feature engineering** is the process of cleaning, cutting, marinating, and combining these raw "ingredients" into "semi-finished products" that can be directly cooked. It is the bridge connecting raw data to machine learning models and the key step that determines the upper limit of model performance. Simply put, **feature engineering** is the process of using domain knowledge to extract, construct, and select features (variables) that are more valuable and easier for machine learning models to learn from raw data through a series of technical means. * * * ## 1. Why is Feature Engineering So Important? In machine learning projects, the quality of data and features directly determines the upper limit of model performance, while models and algorithms only approach this limit. Excellent feature engineering can: 1. **Improve model performance**: Good features make it easier for models to discover patterns in data. 2. **Accelerate model training**: Reducing irrelevant or redundant features can lower computational complexity. 3. **Enhance model generalization ability**: Prevent models from overfitting to noise in training data. 4. **Meet model requirements**: Different models have different assumptions about data (e.g., linear models assume linear relationships), and feature engineering can make data satisfy these assumptions. We can use the following flowchart to intuitively understand where feature engineering sits in the entire machine learning process: !(#) * * * ## 2. Core Operations of Feature Engineering Feature engineering mainly includes three types of operations: **feature processing**, **feature construction**, and **feature selection**. ### 1. Feature Processing This is the most basic step, aiming to "clean" raw data into a clean, standardized format. #### a) Handling Missing Values Data often contains missing values (such as `NaN`, `NULL`), which need to be properly handled. | Method | Description | Applicable Scenario | | --- | --- | --- | | **Deletion** | Directly delete rows or columns containing missing values | When missing data is very rare or the feature is unimportant | | **Imputation** | Fill with a certain value, such as mean, median, mode, or a special value (e.g., -1) | Most commonly used method, applicable to various situations | | **Interpolation** | Use time series or adjacent data points for interpolation calculation | Time series data | **Code example (using Python's pandas library):** ## Example import pandas as pd import numpy as np # Create a sample DataFrame with missing values data ={'Age': [25, np.nan,30,35, np.nan], 'Salary': [50000,54000, np.nan,62000,58000], 'City': ['Beijing','Shanghai','Guangzhou', np.nan,'Beijing']} df = pd.DataFrame(data) print("Original data:") print(df) # 1. Delete missing values (delete any rows containing NaN) df_dropped = df.dropna() print("n After deleting missing values:") print(df_dropped) # 2. Fill missing values # Fill numerical columns with mean df_filled = df.copy() df_filled['Age'].fillna(df_filled['Age'].mean(), inplace=True) df_filled['Salary'].fillna(df_filled['Salary'].mean(), inplace=True) # Fill categorical columns with mode df_filled['City'].fillna(df_filled['City'].mode(), inplace=True) print("n After filling missing values:") print(df_filled) #### b) Handling Outliers Outliers are values that are significantly different from most data and may interfere with models. Common detection methods include: * **Standard Deviation Method**: Values beyond the range of mean Β± 3 times the standard deviation are considered outliers. * **Box Plot Method**: Values less than `Q1 - 1.5*IQR` or greater than `Q3 + 1.5*IQR` are considered outliers (`IQR = Q3 - Q1`). Treatment methods include deletion, replacement with boundary values, or treating as missing values. #### c) Data Standardization/Normalization Many models (such as SVM, KNN, neural networks) are sensitive to feature scales. We need to transform features of different scales to the same scale. | Method | Formula | Description | Applicable Scenario | | --- | --- | --- | --- | | **Standardization** | `(x - mean) / standard deviation` | After processing, data has mean 0 and standard deviation 1 | When data distribution is approximately normal | | **Normalization** | `(x - min) / (max - min)` | Scale data to [0, 1] interval | When data boundaries are clear and fast computation is needed | **Code example (using scikit-learn library):** ## Example from sklearn.preprocessing import StandardScaler, MinMaxScaler import numpy as np # Sample data data = np.array([[1000,25], [1500,30], [800,20], [1200,28]]) # Standardization scaler_standard = StandardScaler() data_standardized = scaler_standard.fit_transform(data) print("Standardized data (mean~0, standard deviation~1):") print(data_standardized) print(f"Mean: {data_standardized.mean(axis=0)}") print(f"Standard deviation: {data_standardized.std(axis=0)}") # Normalization scaler_minmax = MinMaxScaler() data_normalized = scaler_minmax.fit_transform(data) print("n Normalized data (range [0,1]):") print(data_normalized) ### 2. Feature Construction Create new, more predictive features by combining or transforming existing features. #### a) Transforming Numerical Features * **Polynomial Features**: Create squares, cubes, etc. of features to help linear models learn non-linear relationships. * **Binning**: Divide continuous age into intervals such as "youth," "middle-aged," "elderly" to discretize continuous data. * **Mathematical Transformation**: Use logarithmic, exponential, etc. transformations to change data distribution. #### b) Encoding Categorical Features Machine learning models cannot directly process text like "Beijing," "Shanghai." They need to be converted to numbers. | Method | Description | Characteristics | | --- | --- | --- | | **Label Encoding** | Assign a unique integer to each category, e.g., `Beijing:0, Shanghai:1` | Simple, but may introduce false ordinal relationships (models may think 1>0) | | **One-Hot Encoding** | Create a new binary feature (0 or 1) for each category | Eliminates ordinal misunderstanding, but if there are many categories, it causes feature dimension explosion | **Code example:** ## Example import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder # Sample data df_cat = pd.DataFrame({'City': ['Beijing','Shanghai','Guangzhou','Beijing','Shenzhen']}) # Label encoding le = LabelEncoder() df_cat['City_Label_Encoded']= le.fit_transform(df_cat['City']) print("Label encoding result:") print(df_cat) # One-hot encoding # Method 1: Using pandas get_dummies df_onehot_pd = pd.get_dummies(df_cat['City'], prefix='City') print("n One-hot encoding using pandas:") print(df_onehot_pd) # Method 2: Using sklearn OneHotEncoder (more commonly used in pipelines) ohe = OneHotEncoder(sparse_output=False)# sparse_output=False returns array instead of sparse matrix encoded_array = ohe.fit_transform(df_cat[['City']])# Note input is 2D print("n One-hot encoding using sklearn (array form):") print(encoded_array) print("New feature names:", ohe.get_feature_names_out()) ### 3. Feature Selection Select the most important subset from all features to reduce dimensionality and overfitting risk. | Method | Description | | --- | --- | | **Filter Method** | Sort and filter based on statistical correlation between features and target variables (such as variance, chi-square test, mutual information). Independent of any model. | | **Wrapper Method** | Treat feature selection as a search problem, using model performance as evaluation criteria (such as Recursive Feature Elimination RFE). Good results but high computational cost. | | **Embedded Method** | Automatically perform feature selection during model training (such as L1 regularization LASSO regression, tree model feature importance). | **Code example (selection based on feature importance):** ## Example from sklearn.datasets import load_breast_cancer from sklearn.ensemble import RandomForestClassifier import pandas as pd import matplotlib.pyplot as plt # Load dataset data = load_breast_cancer() X = pd.DataFrame(data.data, columns=data.feature_names) y = data.target # Train a random forest model, which will calculate feature importance model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X, y) # Get feature importance importances = model.feature_importances_ feature_importance_df = pd.DataFrame({ 'Feature': data.feature_names, 'Importance': importances }).sort_values('Importance', ascending=False) print("Feature importance ranking:") print(feature_importance_df.head(10))# View top 10 most important features # Visualization plt.figure(figsize=(10,6)) plt.barh(feature_importance_df['Feature'][:10], feature_importance_df['Importance'][:10]) plt.xlabel('Feature Importance') plt.title('Top 10 Feature Importance') plt.gca().invert_yaxis()# Put most important at top plt.show() # Assume we select features with importance greater than 0.03 selected_features = feature_importance_df[feature_importance_df['Importance']>0.03]['Feature'].tolist() print(f"n Selected features: {selected_features}") * * * ## 3. Practical Exercise: Hands-on Processing of a Simple Dataset **Task**: Perform basic feature engineering on the famous Titanic passenger dataset to prepare for predicting passenger survival. **Step hints**: 1. Load data (can use `load_dataset('titanic')` from `seaborn` library). 2. Observe data: check feature types, missing value situation. 3. Feature processing: * Handle missing values (e.g., fill `age` with median, fill `embarked` with mode). * Perform **label encoding** or **one-hot encoding** on `sex` (gender) feature. * Perform **binning** on `age` to create new age group features. 4. Feature construction: * Combine `sibsp` (number of siblings/spouses) and `parch` (number of parents/children) to construct new `family_size` (family size) feature. 5. Feature selection: * Delete features you consider obviously irrelevant (such as `passenger_id`, `name`, `ticket`). * Try calculating correlation between numerical features and survival target, and filter features. Through this exercise, you will personally experience how raw data becomes more "friendly" to machine learning models through step-by-step feature engineering. ## Summary Feature engineering is a craft in machine learning that combines **art** (domain knowledge, experience, intuition) and **science** (statistical methods, algorithms). It has no fixed rules and needs to be repeatedly tried and iterated based on specific data, problems, and models. For beginners, mastering the basic methods introduced in this article and practicing diligently, you have already laid the most solid cornerstone for building effective machine learning models. Remember, **excellent models often stem from excellent features**.
← Ml Training And Test Set SplitMl Data Understanding β†’