Ml House Price Prediction
```html\n## Machine Learning - House Price Prediction\n\nWhen buying a house, people typically consider multiple factors: location, area, age of the property, number of rooms, population density, transportation conditions, etc. Each factor influences the price, but this influence is not linear or singular - it is the result ofCombine, balancing, and trade-offs.\n\nIn machine learning, the **regression problem** essentially transforms this "experience-based judgment" into a mathematical model that is computable, reusable, and evaluable.\n\nThis chapter will walk through the standard machine learning workflow for **house price prediction** from scratch: data understanding β feature analysis β model training β model evaluation β model optimization. The goal is not to memorize APIs, but to understand what each step does and why it is necessary.\n\n* * *\n\n## Part 1: Project Preparation and Environment Setup\n\n### 1.1 Core Tools Used\n\n* **NumPy**: Low-level numerical computation tool providing efficient array operations.\n* **Pandas**: Core tool for tabular data analysis, essential for machine learning preprocessing.\n* **Matplotlib / Seaborn**: Data visualization tools for understanding data distribution and relationships.\n* **Scikit-learn**: Machine learning toolbox covering datasets, models, and evaluation methods.\n\n### 1.2 Install Dependencies\n\npip install numpy pandas matplotlib seaborn scikit-learn\n\n* * *\n\n## Part 2: Load and Understand the Dataset\n\nOlder tutorials often use the Boston Housing dataset, but it has been deprecated. We will use the officially recommended **California Housing** dataset, which maintains the same concept but has more standardized data.\n\n## Example: Load Data\n\nimport pandas as pd\n\nfrom sklearn.datasets import fetch_california_housing\n\n# Load California Housing Dataset\n\ndata = fetch_california_housing()\n\n# Feature data (X)\n\ndf_features = pd.DataFrame(\n data.data,\n columns=data.feature_names\n)\n\n# Target variable (y): Median house value\n\ndf_target = pd.DataFrame(\n data.target,\n columns=['MedHouseVal']\n)\n\nprint(df_features.head())\nprint(df_target.head())\n\nOutput:\n\n MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude\n0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23\n1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22\n2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24\n3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25\n4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25\n\n MedHouseVal\n0 4.5261\n1 3.5852\n2 3.5213\n3 3.4134\n4 3.4221\n\nYou should understand three key points:\n\n* Each row represents statistical features of a house\n* Each column is a variable that can be used for prediction\n* `MedHouseVal` is our target variable to predict\n\n* * *\n\n## Part 3: Exploratory Data Analysis (EDA)\n\nBefore training a model, you must first answer one question: **Is the data worth learning?**\n\n### 3.1 Data Structure and Missing Value Check\n\n## Example: Data Overview\n\nprint("Feature dimensions:", df_features.shape)\nprint("Target dimensions:", df_target.shape)\n\nprint("\\nData types and missing values:")\ndf_features.info()\n\nprint("\\nMissing value count:")\nprint(df_features.isnull().sum())\n\nConclusions:\n\n* Sufficient sample size (around 20,000)\n* All numerical features\n* No missing values, ready for modeling\n\n### 3.2 Relationship Between Single Features and House Prices\n\nMachine learning is not a black box. At least at the beginner stage, you should understand what the model is learning.\n\n## Example: Room Count vs House Price\n\nimport matplotlib.pyplot as plt\n\nplt.figure(figsize=(8,6))\nplt.scatter(df_features['AveRooms'], df_target['MedHouseVal'], alpha=0.4)\nplt.xlabel('Average Rooms')\nplt.ylabel('Median House Value')\nplt.title('Rooms vs House Price')\nplt.grid(True)\nplt.show()\n\nKey observation: More rooms generally mean higher prices, but there's significant dispersion. This is exactly why regression models exist.\n\n### 3.3 Feature Correlation Analysis\n\n## Example: Correlation Heatmap\n\nimport seaborn as sns\n\ndf_all = pd.concat([df_features, df_target], axis=1)\ncorr = df_all.corr()\n\nplt.figure(figsize=(10,8))\nsns.heatmap(corr, cmap='coolwarm', center=0)\nplt.title('Feature Correlation Heatmap')\nplt.show()\n\nThe purpose of this step is not to "choose a model", but to confirm: there are indeed learnable statistical relationships.\n\n* * *\n\n## Part 4: Build Your First Regression Model\n\n### 4.1 Split Training and Test Sets\n\nA model cannot learn and test on the same dataset, otherwise the evaluation results are meaningless.\n\n## Example: Dataset Splitting\n\nfrom sklearn.model_selection import train_test_split\n\nX = df_features\ny = df_target['MedHouseVal']\n\nX_train, X_test, y_train, y_test = train_test_split(\n X, y, test_size=0.2, random_state=42\n)\n\nprint("Training set:", X_train.shape)\nprint("Test set:", X_test.shape)\n\n### 4.2 Train Linear Regression Model\n\n## Example: Model Training\n\nfrom sklearn.linear_model import LinearRegression\n\nmodel = LinearRegression()\nmodel.fit(X_train, y_train)\n\nprint("Intercept:", model.intercept_)\nprint("Coefficients:")\n\nfor name, coef in zip(X.columns, model.coef_):\n print(f"{name}: {coef:.4f}")\n\nThe advantage of linear regression is its strong interpretability, making it suitable for understanding the essence of regression problems.\n\n* * *\n\n## Part 5: Model Prediction and Evaluation\n\n### 5.1 Compare Predicted vs Actual Values\n\n## Example: Prediction vs Actual\n\ny_pred = model.predict(X_test)\n\nresult = pd.DataFrame({\n "Actual": y_test.values,\n "Predicted": y_pred\n})\n\nprint(result.head())\n\n### 5.2 Use Evaluation Metrics to Quantify Model Performance\n\n## Example: Evaluation Metrics\n\nfrom sklearn.metrics import mean_squared_error, r2_score\n\nmse = mean_squared_error(y_test, y_pred)\nr2 = r2_score(y_test, y_pred)\n\nprint("MSE:", mse)\nprint("R2:", r2)\n\nInterpretation:\n\n* Smaller MSE means lower prediction error\n* RΒ² closer to 1 indicates stronger explanatory power\n\n* * *\n\n## Part 6: Model Optimization - Standardization and Ridge Regression\n\n### 6.1 Why Standardization Matters\n\nLarge differences in feature scales can cause the model to bias toward features with larger numerical values.\n\n## Example: Feature Standardization\n\nfrom sklearn.preprocessing import StandardScaler\n\nscaler = StandardScaler()\nX_train_scaled = scaler.fit_transform(X_train)\nX_test_scaled = scaler.transform(X_test)\n\n### 6.2 Use Ridge Regression to Prevent Overfitting\n\n## Example: Ridge Regression\n\nfrom sklearn.linear_model import Ridge\n\nridge = Ridge(alpha=1.0)\nridge.fit(X_train_scaled, y_train)\n\ny_pred_ridge = ridge.predict(X_test_scaled)\n\nprint("Ridge MSE:", mean_squared_error(y_test, y_pred_ridge))\nprint("Ridge R2:", r2_score(y_test, y_pred_ridge))\n\n## Part 7: Complete Workflow Review\n\n!(https://example.com/wp-content/uploads/2025/12/d13f5388-d5e3-44b5-bd6e-84da9b6c4ba.png)\n```
YouTip