YouTip LogoYouTip

Ml Hyperparameter Search

## Hyperparameter Search In machine learning practice, we often encounter this confusion: why can others achieve 95% accuracy with the same algorithm, while mine only reaches 85%? Beyond differences in data quality and feature engineering, a key factor often lies in **hyperparameter** settings. If the model algorithm is the car's engine, then hyperparameters are the fine-tuning knobs like ignition timing and fuel injection amount. Tune them well, and the engine runs powerfully; tune them poorly, and it may underperform or suffer excessive wear. This article will systematically guide you through hyperparameter search, a crucial step in model optimization and engineering. * * * ## What are Hyperparameters? Before diving into search methods, we must first clarify a core concept: the difference between **hyperparameters** and **model parameters**. ### Model Parameters vs Hyperparameters | Feature | Model Parameters | Hyperparameters | | --- | --- | --- | | **Definition** | Internal variables **learned** by the model from training data. | Configuration variables manually set or algorithmically selected **before** model training begins. | | **Learning Method** | Automatically adjusted through optimization algorithms (e.g., gradient descent). | Not learned from training data; require external specification. | | **Examples** | Weights `w` and bias `b` in linear regression; weights and biases in neural networks. | Learning rate, maximum depth of decision trees, number of trees in random forest, K value in KNN. | | **Impact** | Determines the model's fitting ability for specific data. | Determines the model's **learning process, capacity, and structure**, thereby affecting final performance. | **A vivid analogy**: Imagine you're learning to cook a new dish (training a model). * **Model parameters** are like the specific amounts of "a pinch of salt, half a spoon of sugar" you figure out during this cooking session based on ingredients and heat. These amounts are derived through practice (training). * **Hyperparameters** are decided before you start cooking: will you use **high-heat stir-fry** or **low-heat slow simmer** (learning rate)? How many times will you **stir-fry** total (training epochs)? These choices fundamentally affect your cooking process and final taste. ### Common Hyperparameter Examples Different machine learning algorithms have their unique hyperparameters: **General hyperparameters**: * `learning_rate`: Controls the step size of model parameter updates. Too large may "skip" the optimal point; too small leads to slow learning. * `n_estimators`: Number of weak learners (e.g., trees) in ensemble models. * `max_iter` / `epochs`: Maximum number of iterations or training epochs. **Linear models/Neural networks**: * `alpha` / `lambda`: Strength of regularization terms, used to prevent overfitting. * `batch_size`: Number of samples used for each parameter update. * `hidden_layer_sizes`: Hidden layer sizes in neural networks. **Tree models**: * `max_depth`: Maximum depth of trees, controlling model complexity. * `min_samples_split`: Minimum number of samples required to split an internal node. * `min_samples_leaf`: Minimum number of samples required at a leaf node. * * * ## Why Do We Need Hyperparameter Search? Since hyperparameters are so important, can we set them based on experience or intuition? The answer is no. Here's why: * **Huge performance impact**: The same model with different hyperparameter combinations can lead to vastly different performance (e.g., accuracy, F1 score). * **No universal optimal values**: Optimal hyperparameters highly depend on the specific dataset, task, and model; there is no "magic default parameter" that works everywhere. * **Enormous combination space**: Multiple hyperparameters interact with each other, forming a high-dimensional search space. Manual trial-and-error is extremely inefficient and prone to local thinking. Therefore, we need systematic, automated methods to explore this vast parameter space and find better-performing configurations. This process is called **hyperparameter search** or **hyperparameter optimization**. Its core goal is to find a set of hyperparameters within acceptable computational cost that optimizes the model's performance metric on **unseen data** (validation set). !(#) * * * ## Mainstream Hyperparameter Search Strategies ### 1. Grid Search Grid search is the most basic and intuitive search method. **How it works**: * Define a list of candidate values for each hyperparameter to be searched. * The search algorithm generates the **Cartesian product** of these lists, i.e., all possible combinations. * Iterate through each combination, train the model, and evaluate. * Select the combination with the best performance on the validation set. **Example**: Searching two hyperparameters for Support Vector Machine (SVM). ## Example # Suppose we define the following search grid param_grid ={ 'C': [0.1,1,10,100],# Regularization strength, 4 candidate values 'gamma': [0.001,0.01,0.1,1]# Kernel coefficient, 4 candidate values } # Grid search will try 4 * 4 = 16 different combinations **Advantages**: * **Simple and reliable**: As long as the grid is fine enough, it will certainly find the optimal solution within the given range. * **Easy to parallelize**: Each parameter combination's training and evaluation are independent, making it very suitable for parallel computing. **Disadvantages**: * **Curse of dimensionality**: With slightly more hyperparameters or slightly denser candidate values, the number of combinations grows exponentially, making computational costs unbearable. For example, 5 parameters with 10 values each would require training and evaluating 10^5 = 100,000 models! * **Low efficiency**: May waste large amounts of computational resources on "unimportant" parameters. ### 2. Random Search Random search is an effective improvement addressing grid search's shortcomings. **How it works**: * Define a **probability distribution** for each hyperparameter (e.g., uniform distribution, log-uniform distribution). * Within a specified total number of trials (`n_iter`), **randomly sample** a set of hyperparameter values. * Train and evaluate each sampled parameter set. * Select the combination with the best performance. **Example**: Using random search to optimize a random forest. ## Example from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint, uniform param_dist ={ 'n_estimators': randint(100,500),# Integer uniform distribution, 100 to 500 'max_depth': randint(5,30),# Integer uniform distribution, 5 to 30 'min_samples_split': uniform(0.01,0.2)# Continuous uniform distribution, 0.01 to 0.21 } # Randomly perform 50 trials random_search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_dist, n_iter=50, cv=5, verbose=2) random_search.fit(X_train, y_train) **Why is random search more efficient?** Research (Bergstra & Bengio, 2012) shows that for most problems, model performance is typically sensitive to only a few hyperparameters. Random search allows us to explore more times **in each dimension**, thus having a higher probability of finding the best region for important parameters, unlike grid search which is constrained by fixed grids of unimportant parameters. **Advantages**: * **High computational efficiency**: Under the same computational budget, it has a higher probability of finding better solutions than grid search. * **Flexible**: Can easily specify probability distributions for parameters (e.g., searching learning rate on a logarithmic scale). **Disadvantages**: * **Randomness**: Results may vary with different random seeds, potentially missing certain regions. * **Memoryless**: Each trial is independent and does not leverage information from previous trials to guide subsequent search. ### 3. Bayesian Optimization Bayesian optimization is a more intelligent search method, suitable for optimizing functions with very high evaluation costs (e.g., training a large deep learning model takes several days). **Core idea**: * **Surrogate model**: Use a computationally cheap probabilistic model (e.g., Gaussian process) to "simulate" the real, computationally expensive objective function (i.e., the relationship between model performance and hyperparameters). * **Acquisition function**: Based on the surrogate model's uncertainty, select the "most promising" hyperparameter combination for the next evaluation. It balances **exploration** (sampling in high-uncertainty regions) and **exploitation** (sampling near known good-performing regions). **Workflow**: !(#) **Advantages**: * **Extremely efficient**: Can find near-optimal solutions with the minimum number of trials, especially suitable for expensive models. * **Adaptive**: Intelligently guides search direction using historical information. **Disadvantages**: * **Complex implementation**: More complex than the previous two methods. * **Difficult to parallelize**: Standard Bayesian optimization is sequential and difficult to parallelize directly (though improved methods exist). * **High-dimensional spaces**: As hyperparameter dimensionality increases, fitting and optimizing the surrogate model becomes more difficult. **Common tools**: `scikit-optimize`, `BayesianOptimization`, `Optuna`, `Hyperopt`. * * * ## Engineering Practices and Considerations ### 1. Validation Strategy: Don't Contaminate Your Test Set! When searching for hyperparameters, **never** use the test set to guide the search process, as this will lead to information leakage and overly optimistic generalization performance estimates. **Correct approach**: * Split data into: **training set**, **validation set**, **test set**. * Perform hyperparameter search on "training set + validation set" (e.g., using cross-validation). * After selecting the best hyperparameters, retrain the final model with this set of parameters on the **complete training set** (or merged training + validation set). * Finally, use the **test set** that has never participated in any training or tuning process to fairly evaluate the final model's generalization ability. ### 2. Use Cross-Validation To more robustly evaluate hyperparameter performance and avoid chance results from a single data split, cross-validation should be used. ## Example from sklearn.model_selection import GridSearchCV # Use 5-fold cross-validation for grid search grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5,# 5-fold cross-validation scoring='accuracy', return_train_score=True) grid_search.fit(X_train_val, y_train_val)# Here use training + validation data print(f"Best parameters: {grid_search.best_params_}") print(f"Best cross-validation score: {grid_search.best_score_:.4f}") # Get the best model (already refitted with best parameters on all data) best_model = grid_search.best_estimator_ ### 3. Hyperparameter Space Design Tips * **Scale-sensitive parameters**: For parameters like learning rate and regularization strength, their **effective range often spans multiple orders of magnitude**. Search on a logarithmic scale (e.g., `[0.001, 0.01, 0.1, 1]`) rather than linear scale (e.g., `[0.1, 0.2, ..., 1.0]`). * **Coarse-to-fine**: First perform large-range random search or sparse grid search to locate promising regions, then perform finer search in those regions. * **Leverage prior knowledge**: Based on algorithm principles and empirical literature, set reasonable initial ranges and distributions. ### 4. Automation and Toolchain In actual engineering, hyperparameter search is often integrated into MLOps pipelines. * **Frameworks**: `Scikit-learn` provides `GridSearchCV` and `RandomizedSearchCV`. * **Advanced libraries**: `Optuna`, `Ray Tune`, `Keras Tuner` etc. provide more powerful, distributed-friendly search capabilities, supporting advanced features like early stopping and pruning. * **Cloud services**: Platforms like AWS SageMaker, Google Vertex AI provide managed hyperparameter optimization services. * * * ## Hands-on Exercise Now, let's complete a full hyperparameter search exercise using `Scikit-learn` and the `Random Forest` dataset. **Task**: Use the wine dataset to optimize a random forest classifier through random search. ## Example # 1. Import necessary libraries import numpy as np from sklearn.datasets import load_wine from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, RandomizedSearchCV from sklearn.metrics import classification_report from scipy.stats import randint # 2. Load data and split data = load_wine() X, y = data.data, data.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) X_train_val, X_val, y_train_val, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)# 0.25 * 0.8 = 0.2 # 3. Define model and parameter distributions rf = RandomForestClassifier(random_state=42) param_dist ={ 'n_estimators': randint(50,300),# Number of trees 'max_depth': randint(3,20),# Maximum depth of trees 'min_samples_split': randint(2,10),# Minimum samples required to split internal node 'min_samples_leaf': randint(1,5),# Minimum samples at leaf node 'max_features': ['sqrt','log2']# Number of features to consider for best split } # 4. Execute random search (with 3-fold cross-validation) random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, n_iter=30,# Randomly try 30 parameter sets cv=3,# 3-fold cross-validation scoring='accuracy', random_state=42, verbose=1, n_jobs=-1)# Use all CPU cores in parallel random_search.fit(X_train_val, y_train_val) # 5. Output search results print("="*50) print("Random search best parameters:") print(random_search.best_params_) print(f"n Best cross-validation accuracy: {random_search.best_score_:.4f}") # 6. Evaluate best model on independent validation set best_model = random_search.best_estimator_ y_val_pred = best_model.predict(X_val) print("n Performance report on validation set:") print(classification_report(y_val, y_val_pred, target_names=data.target_names)) # 7. (Final step) Retrain with best parameters on entire training set, and evaluate on test set final_model = RandomForestClassifier(**random_search.best_params_, random_state=42) final_model.fit(X_train, y_train)# Use all training data y_test_pred = final_model.predict(X_test) print("="*50) print("Final model performance report on test set (completely new data):") print(classification_report(y_test, y_test_pred, target_names=data.target_names)) Output: Fitting 3 folds for each of 30 candidates, totalling 90 fits ==================================================Random search best parameters:{'max_depth': 9, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 8, 'n_estimators': 156}Best cross-validation accuracy: 0.9812Performance report on validation set: precision recall f1-score support class_0 1.00 1.00 1.00 11 class_1 1.00 0.94 0.97 16 class_2 0.90 1.00 0.95 9 accuracy 0.97 36 macro avg 0.97 0.98 0.97 36 weighted avg 0.98 0.97 0.97 36==================================================Final model performance report on test set (completely new data): precision recall f1-score support class_0 1.00 1.00 1.00 14 class_1 1.00 1.00 1.00 14 class_2 1.00 1.00 1.00 8 accuracy 1.00 36 macro avg 1.00 1.00 1.00 36 weighted avg 1.00 1.00 1.00 36
← Ml Model Optimization Common PMl Model Optimization Data Lea β†’