YouTip LogoYouTip

Sklearn House Prices

Next, we use sklearn to predict Beijing house prices. We will proceed step by step from data loading, exploratory analysis, feature engineering to model training and optimization, demonstrating how to use sklearn's tools and libraries to complete the prediction task. Content Overview: **1. Data Generation and Viewing** * Created simulated data using a dictionary and converted it to a `pandas DataFrame`. * The data includes house area, number of rooms, floor, year built, location (categorical variable), and house price (target variable). * Used `df.head()` and `df.describe()` to check the basic structure and statistical information of the data. **2. Data Preprocessing** * **Feature Selection**: Selected features related to house prices from the original data (area, number of rooms, floor, year built, location). * **Data Splitting**: Used `train_test_split` to split the dataset into 80% training set and 20% test set. * **Numerical Feature Standardization**: Used `StandardScaler` to standardize numerical features. * **Categorical Feature Encoding**: Used `OneHotEncoder` to perform One-Hot encoding on categorical features (`location`). * `ColumnTransformer` integrates the processing of numerical and categorical features into one step. **3. Model Training** * Used `Pipeline` to combine data preprocessing and model training steps to ensure the entire process is streamlined. * Used linear regression model (`LinearRegression`) for training. **4. Model Evaluation** * Calculated Mean Squared Error (MSE) through `mean_squared_error` and coefficient of determination (RΒ²) through `r2_score`. * Output evaluation results to check the model's prediction accuracy. **5. Model Optimization** * Used `GridSearchCV` to tune the hyperparameters of linear regression, mainly adjusting `fit_intercept` (whether to fit the intercept). * Found the best hyperparameters through grid search and used the best model to predict the test set. * Recalculated the optimized model evaluation metrics (MSE and RΒ²). * * * ## 1. Data Generation and Viewing First, we construct a simulated DataFrame containing common house price prediction features such as house area, number of rooms, floor, year built, and geographic location (categorical variable). ## Example import pandas as pd import numpy as np # Simulated data: house area (square meters), number of rooms, floor, year built, location (categorical variable) data ={ 'area': [70,85,100,120,60,150,200,80,95,110], 'rooms': [2,3,3,4,2,5,6,3,3,4], 'floor': [5,2,8,10,3,15,18,7,9,11], 'year_built': [2005,2010,2012,2015,2000,2018,2020,2008,2011,2016], 'location': ['Chaoyang','Haidian','Chaoyang','Dongcheng','Fengtai','Haidian','Chaoyang','Fengtai','Dongcheng','Haidian'], 'price': [5000000,6000000,6500000,7000000,4500000,10000000,12000000,5500000,6200000,7500000]# House price (target variable) } # Create DataFrame df = pd.DataFrame(data) # View data print("Data preview:") print(df.head()) Output: Data preview: area rooms floor year_built location price 0 70 2 5 2005 Chaoyang 50000001 85 3 2 2010 Haidian 60000002 100 3 8 2012 Chaoyang 65000003 120 4 10 2015 Dongcheng 70000004 60 2 3 2000 Fengtai 4500000 * * * ## 2. Data Preprocessing Data preprocessing usually includes feature selection, feature transformation, missing value handling, data standardization, etc. We will standardize numerical features and perform one-hot encoding on categorical features. ## Example from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline # FeatureSelection X = df[['area','rooms','floor','year_built','location']]# Feature y = df['price']# Target variable # Split training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Build preprocessing steps numeric_features =['area','rooms','floor','year_built'] categorical_features =['location'] numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())# Numerical Feature Standardization ]) categorical_transformer = Pipeline(steps=[ ('onehot', OneHotEncoder(handle_unknown='ignore'))# Handle new categories in test set ]) # Combine into ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ] ) # View dataPreprocessed Structure X_train_transformed = preprocessor.fit_transform(X_train) print("Preprocessed training data:") print(X_train_transformed) The output is: tianqixin@Mac-mini tutorial-test % python3 test.py Preprocessed training data:[[ 0.89826776 1.0440738 1.14636101 0.96800387 0. 0. 0. 1. ] [-0.95622052 -1.23390539 -0.98640366 -1.04544418 1. 0. 0. 0. ] [-0.72440948 -0.474579 -0.55985073 -0.58080232 0. 0. 1. 0. ] [-0.26078741 -0.474579 -0.34657426 0.03872015 1. 0. 0. 0. ] [-0.02897638 0.2847474 0.29325514 0.65824263 0. 0. 0. 1. ] [-1.18803155 -1.23390539 -1.4129566 -1.81984727 0. 0. 1. 0. ] [ 0.20283466 0.2847474 0.07997868 0.50336201 0. 1. 0. 0. ] [ 2.05732294 1.80340019 1.78619041 1.2777651 1. 0. 0. 0. ]] * * * ## 3. Building the Model Next, we use a linear regression model to predict house prices, and we use Pipeline to integrate the preprocessing and model training steps. ## Example from sklearn.linear_model import LinearRegression # Build a Pipeline with preprocessing and regression model model_pipeline = Pipeline(steps=[ ('preprocessor', preprocessor),# Data preprocessing steps ('regressor', LinearRegression())# Regression model ]) # Train Model model_pipeline.fit(X_train, y_train) # Make predictions y_pred = model_pipeline.predict(X_test) # Output prediction results print("n Prediction results:") print(y_pred) The output is: Prediction results:[6375000.00000001 4874999.99999998] * * * ## 4. Model Evaluation In model evaluation, we usually use Mean Squared Error (MSE) and Coefficient of Determination (RΒ²) to evaluate the performance of regression models. ## Example from sklearn.metrics import mean_squared_error, r2_score # Calculate Mean Squared Error (MSE) and Coefficient of Determination (RΒ²) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) # Output evaluation results print("n Model evaluation:") print(f"Mean Squared Error (MSE): {mse:.2f}") print(f"Coefficient of determination (RΒ²): {r2:.2f}") The output is: Prediction results:[6375000.00000001 4874999.99999998] tianqixin@Mac-mini tutorial-test % python3 test.py Model evaluation:Mean Squared Error (MSE): 648125000000.03Coefficient of determination (RΒ²): -63.81 * * * ## 5. Model Optimization (Grid Search) To improve model performance, we can use GridSearchCV to tune the model's hyperparameters. In this example, we don't adjust the parameters of linear regression because it doesn't
← Vscode IntroSklearn Iris Dataset β†’