Ml Hyperparameter Search
## Hyperparameter Search
In machine learning practice, we often encounter this confusion: why can others achieve 95% accuracy with the same algorithm, while mine only reaches 85%? Beyond differences in data quality and feature engineering, a key factor often lies in **hyperparameter** settings.
If the model algorithm is the car's engine, then hyperparameters are the fine-tuning knobs like ignition timing and fuel injection amount. Tune them well, and the engine runs powerfully; tune them poorly, and it may underperform or suffer excessive wear.
This article will systematically guide you through hyperparameter search, a crucial step in model optimization and engineering.
* * *
## What are Hyperparameters?
Before diving into search methods, we must first clarify a core concept: the difference between **hyperparameters** and **model parameters**.
### Model Parameters vs Hyperparameters
| Feature | Model Parameters | Hyperparameters |
| --- | --- | --- |
| **Definition** | Internal variables **learned** by the model from training data. | Configuration variables manually set or algorithmically selected **before** model training begins. |
| **Learning Method** | Automatically adjusted through optimization algorithms (e.g., gradient descent). | Not learned from training data; require external specification. |
| **Examples** | Weights `w` and bias `b` in linear regression; weights and biases in neural networks. | Learning rate, maximum depth of decision trees, number of trees in random forest, K value in KNN. |
| **Impact** | Determines the model's fitting ability for specific data. | Determines the model's **learning process, capacity, and structure**, thereby affecting final performance. |
**A vivid analogy**: Imagine you're learning to cook a new dish (training a model).
* **Model parameters** are like the specific amounts of "a pinch of salt, half a spoon of sugar" you figure out during this cooking session based on ingredients and heat. These amounts are derived through practice (training).
* **Hyperparameters** are decided before you start cooking: will you use **high-heat stir-fry** or **low-heat slow simmer** (learning rate)? How many times will you **stir-fry** total (training epochs)? These choices fundamentally affect your cooking process and final taste.
### Common Hyperparameter Examples
Different machine learning algorithms have their unique hyperparameters:
**General hyperparameters**:
* `learning_rate`: Controls the step size of model parameter updates. Too large may "skip" the optimal point; too small leads to slow learning.
* `n_estimators`: Number of weak learners (e.g., trees) in ensemble models.
* `max_iter` / `epochs`: Maximum number of iterations or training epochs.
**Linear models/Neural networks**:
* `alpha` / `lambda`: Strength of regularization terms, used to prevent overfitting.
* `batch_size`: Number of samples used for each parameter update.
* `hidden_layer_sizes`: Hidden layer sizes in neural networks.
**Tree models**:
* `max_depth`: Maximum depth of trees, controlling model complexity.
* `min_samples_split`: Minimum number of samples required to split an internal node.
* `min_samples_leaf`: Minimum number of samples required at a leaf node.
* * *
## Why Do We Need Hyperparameter Search?
Since hyperparameters are so important, can we set them based on experience or intuition? The answer is no. Here's why:
* **Huge performance impact**: The same model with different hyperparameter combinations can lead to vastly different performance (e.g., accuracy, F1 score).
* **No universal optimal values**: Optimal hyperparameters highly depend on the specific dataset, task, and model; there is no "magic default parameter" that works everywhere.
* **Enormous combination space**: Multiple hyperparameters interact with each other, forming a high-dimensional search space. Manual trial-and-error is extremely inefficient and prone to local thinking.
Therefore, we need systematic, automated methods to explore this vast parameter space and find better-performing configurations. This process is called **hyperparameter search** or **hyperparameter optimization**.
Its core goal is to find a set of hyperparameters within acceptable computational cost that optimizes the model's performance metric on **unseen data** (validation set).
!(#)
* * *
## Mainstream Hyperparameter Search Strategies
### 1. Grid Search
Grid search is the most basic and intuitive search method.
**How it works**:
* Define a list of candidate values for each hyperparameter to be searched.
* The search algorithm generates the **Cartesian product** of these lists, i.e., all possible combinations.
* Iterate through each combination, train the model, and evaluate.
* Select the combination with the best performance on the validation set.
**Example**: Searching two hyperparameters for Support Vector Machine (SVM).
## Example
# Suppose we define the following search grid
param_grid ={
'C': [0.1,1,10,100],# Regularization strength, 4 candidate values
'gamma': [0.001,0.01,0.1,1]# Kernel coefficient, 4 candidate values
}
# Grid search will try 4 * 4 = 16 different combinations
**Advantages**:
* **Simple and reliable**: As long as the grid is fine enough, it will certainly find the optimal solution within the given range.
* **Easy to parallelize**: Each parameter combination's training and evaluation are independent, making it very suitable for parallel computing.
**Disadvantages**:
* **Curse of dimensionality**: With slightly more hyperparameters or slightly denser candidate values, the number of combinations grows exponentially, making computational costs unbearable. For example, 5 parameters with 10 values each would require training and evaluating 10^5 = 100,000 models!
* **Low efficiency**: May waste large amounts of computational resources on "unimportant" parameters.
### 2. Random Search
Random search is an effective improvement addressing grid search's shortcomings.
**How it works**:
* Define a **probability distribution** for each hyperparameter (e.g., uniform distribution, log-uniform distribution).
* Within a specified total number of trials (`n_iter`), **randomly sample** a set of hyperparameter values.
* Train and evaluate each sampled parameter set.
* Select the combination with the best performance.
**Example**: Using random search to optimize a random forest.
## Example
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_dist ={
'n_estimators': randint(100,500),# Integer uniform distribution, 100 to 500
'max_depth': randint(5,30),# Integer uniform distribution, 5 to 30
'min_samples_split': uniform(0.01,0.2)# Continuous uniform distribution, 0.01 to 0.21
}
# Randomly perform 50 trials
random_search = RandomizedSearchCV(estimator=rf_model,
param_distributions=param_dist,
n_iter=50,
cv=5,
verbose=2)
random_search.fit(X_train, y_train)
**Why is random search more efficient?** Research (Bergstra & Bengio, 2012) shows that for most problems, model performance is typically sensitive to only a few hyperparameters. Random search allows us to explore more times **in each dimension**, thus having a higher probability of finding the best region for important parameters, unlike grid search which is constrained by fixed grids of unimportant parameters.
**Advantages**:
* **High computational efficiency**: Under the same computational budget, it has a higher probability of finding better solutions than grid search.
* **Flexible**: Can easily specify probability distributions for parameters (e.g., searching learning rate on a logarithmic scale).
**Disadvantages**:
* **Randomness**: Results may vary with different random seeds, potentially missing certain regions.
* **Memoryless**: Each trial is independent and does not leverage information from previous trials to guide subsequent search.
### 3. Bayesian Optimization
Bayesian optimization is a more intelligent search method, suitable for optimizing functions with very high evaluation costs (e.g., training a large deep learning model takes several days).
**Core idea**:
* **Surrogate model**: Use a computationally cheap probabilistic model (e.g., Gaussian process) to "simulate" the real, computationally expensive objective function (i.e., the relationship between model performance and hyperparameters).
* **Acquisition function**: Based on the surrogate model's uncertainty, select the "most promising" hyperparameter combination for the next evaluation. It balances **exploration** (sampling in high-uncertainty regions) and **exploitation** (sampling near known good-performing regions).
**Workflow**:
!(#)
**Advantages**:
* **Extremely efficient**: Can find near-optimal solutions with the minimum number of trials, especially suitable for expensive models.
* **Adaptive**: Intelligently guides search direction using historical information.
**Disadvantages**:
* **Complex implementation**: More complex than the previous two methods.
* **Difficult to parallelize**: Standard Bayesian optimization is sequential and difficult to parallelize directly (though improved methods exist).
* **High-dimensional spaces**: As hyperparameter dimensionality increases, fitting and optimizing the surrogate model becomes more difficult.
**Common tools**: `scikit-optimize`, `BayesianOptimization`, `Optuna`, `Hyperopt`.
* * *
## Engineering Practices and Considerations
### 1. Validation Strategy: Don't Contaminate Your Test Set!
When searching for hyperparameters, **never** use the test set to guide the search process, as this will lead to information leakage and overly optimistic generalization performance estimates.
**Correct approach**:
* Split data into: **training set**, **validation set**, **test set**.
* Perform hyperparameter search on "training set + validation set" (e.g., using cross-validation).
* After selecting the best hyperparameters, retrain the final model with this set of parameters on the **complete training set** (or merged training + validation set).
* Finally, use the **test set** that has never participated in any training or tuning process to fairly evaluate the final model's generalization ability.
### 2. Use Cross-Validation
To more robustly evaluate hyperparameter performance and avoid chance results from a single data split, cross-validation should be used.
## Example
from sklearn.model_selection import GridSearchCV
# Use 5-fold cross-validation for grid search
grid_search = GridSearchCV(estimator=model,
param_grid=param_grid,
cv=5,# 5-fold cross-validation
scoring='accuracy',
return_train_score=True)
grid_search.fit(X_train_val, y_train_val)# Here use training + validation data
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
# Get the best model (already refitted with best parameters on all data)
best_model = grid_search.best_estimator_
### 3. Hyperparameter Space Design Tips
* **Scale-sensitive parameters**: For parameters like learning rate and regularization strength, their **effective range often spans multiple orders of magnitude**. Search on a logarithmic scale (e.g., `[0.001, 0.01, 0.1, 1]`) rather than linear scale (e.g., `[0.1, 0.2, ..., 1.0]`).
* **Coarse-to-fine**: First perform large-range random search or sparse grid search to locate promising regions, then perform finer search in those regions.
* **Leverage prior knowledge**: Based on algorithm principles and empirical literature, set reasonable initial ranges and distributions.
### 4. Automation and Toolchain
In actual engineering, hyperparameter search is often integrated into MLOps pipelines.
* **Frameworks**: `Scikit-learn` provides `GridSearchCV` and `RandomizedSearchCV`.
* **Advanced libraries**: `Optuna`, `Ray Tune`, `Keras Tuner` etc. provide more powerful, distributed-friendly search capabilities, supporting advanced features like early stopping and pruning.
* **Cloud services**: Platforms like AWS SageMaker, Google Vertex AI provide managed hyperparameter optimization services.
* * *
## Hands-on Exercise
Now, let's complete a full hyperparameter search exercise using `Scikit-learn` and the `Random Forest` dataset.
**Task**: Use the wine dataset to optimize a random forest classifier through random search.
## Example
# 1. Import necessary libraries
import numpy as np
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import classification_report
from scipy.stats import randint
# 2. Load data and split
data = load_wine()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_val, X_val, y_train_val, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)# 0.25 * 0.8 = 0.2
# 3. Define model and parameter distributions
rf = RandomForestClassifier(random_state=42)
param_dist ={
'n_estimators': randint(50,300),# Number of trees
'max_depth': randint(3,20),# Maximum depth of trees
'min_samples_split': randint(2,10),# Minimum samples required to split internal node
'min_samples_leaf': randint(1,5),# Minimum samples at leaf node
'max_features': ['sqrt','log2']# Number of features to consider for best split
}
# 4. Execute random search (with 3-fold cross-validation)
random_search = RandomizedSearchCV(estimator=rf,
param_distributions=param_dist,
n_iter=30,# Randomly try 30 parameter sets
cv=3,# 3-fold cross-validation
scoring='accuracy',
random_state=42,
verbose=1,
n_jobs=-1)# Use all CPU cores in parallel
random_search.fit(X_train_val, y_train_val)
# 5. Output search results
print("="*50)
print("Random search best parameters:")
print(random_search.best_params_)
print(f"n Best cross-validation accuracy: {random_search.best_score_:.4f}")
# 6. Evaluate best model on independent validation set
best_model = random_search.best_estimator_
y_val_pred = best_model.predict(X_val)
print("n Performance report on validation set:")
print(classification_report(y_val, y_val_pred, target_names=data.target_names))
# 7. (Final step) Retrain with best parameters on entire training set, and evaluate on test set
final_model = RandomForestClassifier(**random_search.best_params_, random_state=42)
final_model.fit(X_train, y_train)# Use all training data
y_test_pred = final_model.predict(X_test)
print("="*50)
print("Final model performance report on test set (completely new data):")
print(classification_report(y_test, y_test_pred, target_names=data.target_names))
Output:
Fitting 3 folds for each of 30 candidates, totalling 90 fits ==================================================Random search best parameters:{'max_depth': 9, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 8, 'n_estimators': 156}Best cross-validation accuracy: 0.9812Performance report on validation set: precision recall f1-score support class_0 1.00 1.00 1.00 11 class_1 1.00 0.94 0.97 16 class_2 0.90 1.00 0.95 9 accuracy 0.97 36 macro avg 0.97 0.98 0.97 36 weighted avg 0.98 0.97 0.97 36==================================================Final model performance report on test set (completely new data): precision recall f1-score support class_0 1.00 1.00 1.00 14 class_1 1.00 1.00 1.00 14 class_2 1.00 1.00 1.00 8 accuracy 1.00 36 macro avg 1.00 1.00 1.00 36 weighted avg 1.00 1.00 1.00 36
YouTip