YouTip LogoYouTip

Ml Titanic Survival Prediction

# Titanic Survival Prediction | Novice Tutorial ## Titanic Survival Prediction If you're just starting to learn machine learning, you might feel that those complex algorithms and mathematical formulas are far from the real world. But today, we'll experience a complete machine learning project workflow through a classic case study - Titanic survival prediction. The Titanic dataset is one of the most famous introductory projects in machine learning, based on real passenger information from the 1912 Titanic disaster. Our goal is: **to build a model that predicts whether passengers could survive the disaster based on their age, gender, ticket class, and other information**. This project is classic because it perfectly covers the core steps of machine learning projects: 1. **Data Understanding and Exploration** 2. **Data Cleaning and Preprocessing** 3. **Feature Engineering** 4. **Model Selection and Training** 5. **Model Evaluation and Optimization** Through this practical case, you'll move beyond just reading theory to truly understanding how to apply machine learning to solve real-world problems. *** ## Step 1: Understanding Our Data Before writing any code, we must first understand the data at hand. The Titanic dataset typically contains the following fields (features): | Field Name | Description | Data Type | Notes | | --- | --- | --- | --- | | `PassengerId` | Passenger ID | Integer | Unique identifier, not helpful for prediction | | `Survived` | Survival status | Integer (0/1) | **Target variable**, 0=Not survived, 1=Survived | | `Pclass` | Ticket class | Integer (1,2,3) | 1=1st class, 2=2nd class, 3=3rd class | | `Name` | Passenger name | String | Contains titles (e.g., Mr., Miss.), can extract new features | | `Sex` | Gender | String | `male` or `female` | | `Age` | Age | Float | Some missing values | | `SibSp` | Number of siblings/spouses | Integer | | | `Parch` | Number of parents/children | Integer | | | `Ticket` | Ticket number | String | Complex structure, limited information | | `Fare` | Ticket price | Float | | | `Cabin` | Cabin number | String | Many missing values, first letter may indicate cabin area | | `Embarked` | Embarkation port | String | C=Cherbourg, Q=Queenstown, S=Southampton | **Key Insight**: From historical knowledge we know that the "women and children first" principle was followed, and first-class passengers had priority access to lifeboats. Therefore, we expect features like `Sex`, `Age`, and `Pclass` to have significant impact on prediction results. *** ## Step 2: Data Cleaning and Preprocessing Raw data is almost never perfect. Data cleaning is like preparing high-quality ingredients for a model, and this step is crucial. Store the following data in a train.csv file: PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings, Mrs. John Bradley",female,38,1,0,PC 17599,71.2833,C85,C 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S 4,1,1,"Futrelle, Mrs. Jacques Heath",female,35,1,0,113803,53.1,C123,S 5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S 6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q 7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S 8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S 9,1,3,"Johnson, Mrs. Oscar W",female,27,0,2,347742,11.1333,,S 10,1,2,"Nasser, Mrs. Nicholas",female,14,1,0,237736,30.0708,,C Store the following data in a test.csv file: PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked11,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q 12,3,"Wilkes, Mrs. James",female,47,1,0,363272,7,,S 13,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q 14,3,"Dwyer, Miss. Ellen",female,18,0,0,330959,7.75,,Q 15,1,"Jones, Mr. Charles",male,,1,0,PC 17603,82.1708,B28,C We'll use Python's `pandas` and `numpy` libraries to complete this work. ## Example ```python # Import necessary libraries import pandas as pd import numpy as np # Load data train_data = pd.read_csv('train.csv') # Training set, contains target variable Survived test_data = pd.read_csv('test.csv') # Test set, does not contain Survived, used for final evaluation # 1. Preliminary data inspection print("Training set shape:", train_data.shape) print(train_data.info()) # View data types and missing values print(train_data.head()) # View first few rows of data Output: Training set shape: (10, 12) RangeIndex: 10 entries, 0 to 9 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 10 non-null int64 1 Survived 10 non-null int64 2 Pclass 10 non-null int64 3 Name 10 non-null object 4 Sex 10 non-null object 5 Age 9 non-null float64 6 SibSp 10 non-null int64 7 Parch 10 non-null int64 8 Ticket 10 non-null object 9 Fare 10 non-null float64 10 Cabin 3 non-null object 11 Embarked 10 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 1.1+ KB None PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S After running the above code, you may notice two main issues: **missing values** and **non-numeric data**. ### Handling Missing Values ## Example ```python # Check number of missing values in each column print(train_data.isnull().sum()) # Handle Age: Fill with median train_data['Age'] = train_data['Age'].fillna(train_data['Age'].median()) test_data['Age'] = test_data['Age'].fillna(test_data['Age'].median()) # Handle Embarked (Embarkation port): Fill with mode most_common_port = train_data['Embarked'].mode() train_data['Embarked'] = train_data['Embarked'].fillna(most_common_port) test_data['Embarked'] = test_data['Embarked'].fillna(most_common_port) # Handle Fare (Ticket price): Test set test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].median()) # Handle Cabin: Directly delete train_data.drop(columns=['Cabin'], inplace=True) test_data.drop(columns=['Cabin'], inplace=True) ### Converting Non-Numeric Data Machine learning models typically only handle numeric data. We need to convert text columns like `Sex` and `Embarked` into numbers. ## Example ```python # Convert Sex column to numeric: female -> 0, male -> 1 train_data['Sex'] = train_data['Sex'].map({'female': 0, 'male': 1}) test_data['Sex'] = test_data['Sex'].map({'female': 0, 'male': 1}) # Convert Embarked column to numeric (One-Hot Encoding) # Since there's no ordinal relationship between ports, we shouldn't use simple 0,1,2 mapping train_data = pd.get_dummies(train_data, columns=['Embarked']) test_data = pd.get_dummies(test_data, columns=['Embarked']) *** ## Step 3: Feature Engineering Feature engineering is the "magic" of machine learning, where we help models learn better by creating or transforming features. Extracting "titles" from the `Name` column is a classic example. ## Example ```python # Extract titles (e.g., Mr., Mrs., Miss., Master.) from Name column # Titles often reflect age, social status, and gender, which may affect rescue priority train_data['Title'] = train_data['Name'].str.extract(' (+).', expand=False) test_data['Title'] = test_data['Name'].str.extract(' (+).', expand=False) # View what titles exist print(pd.crosstab(train_data['Title'], train_data['Sex'])) # Map rare titles to 'Rare' title_mapping = { 'Mr': 'Mr', 'Miss': 'Miss', 'Mrs': 'Mrs', 'Master': 'Master', 'Dr': 'Rare', 'Rev': 'Rare', 'Col': 'Rare', 'Major': 'Rare', 'Mlle': 'Miss', 'Countess': 'Rare', 'Ms': 'Miss', 'Lady': 'Rare', 'Jonkheer': 'Rare', 'Don': 'Rare', 'Dona': 'Rare', 'Mme': 'Mrs', 'Capt': 'Rare', 'Sir': 'Rare' } train_data['Title'] = train_data['Title'].map(title_mapping) test_data['Title'] = test_data['Title'].map(title_mapping) # Apply one-hot encoding to the processed Title column train_data = pd.get_dummies(train_data, columns=['Title']) test_data = pd.get_dummies(test_data, columns=['Title']) # Create new feature: Family size train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1 test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1 # Create new feature: Whether traveling alone train_data['IsAlone'] = (train_data['FamilySize'] == 1).astype(int) test_data['IsAlone'] = (test_data['FamilySize'] == 1).astype(int) # Drop unnecessary original columns columns_to_drop = ['PassengerId', 'Name', 'Ticket', 'SibSp', 'Parch'] train_data.drop(columns_to_drop, axis=1, inplace=True) test_passenger_ids = test_data['PassengerId'] # Save test set IDs for submission test_data.drop(columns_to_drop, axis=1, inplace=True) print("Feature engineering completed. Training set columns:", train_data.columns.tolist()) *** ## Step 4: Model Selection and Training Now that we have clean and informative numeric data, we'll split it into **features (X)** and **target variable (y)**, then select a model for training. We'll start with the simple and efficient **Random Forest** model. ## Example ```python # Import machine learning libraries from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Prepare data # X is the feature matrix, y is the target vector we want to predict X = train_data.drop('Survived', axis=1) y = train_data['Survived'] # To evaluate model performance during training, we'll split the data into training and validation sets # test_size=0.2 means 20% of data will be used for validation, 80% for training # random_state is a random seed to ensure consistent results X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) # Initialize random forest classifier # n_estimators: Number of trees in the forest # max_depth: Maximum depth of trees, controls model complexity and prevents overfitting # random_state: Ensures reproducibility model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42) # Train the model (let the model learn patterns from the data) model.fit(X_train, y_train) # Make predictions on the validation set y_pred = model.predict(X_val) # Evaluate model accuracy accuracy = accuracy_score(y_val, y_pred) print(f"Model accuracy on validation set: {accuracy:.4f} (i.e., {accuracy*100:.2f}%)") *** ## Step 5: Model Evaluation, Optimization, and Submission ### Evaluation and Optimization The results from a single training run may not be optimal. We can improve through: 1. **Adjusting model parameters**: Try different `n_estimators` or `max_depth`. 2. **Trying other models**: Such as logistic regression, support vector machines, gradient boosting trees, etc. 3. **Further feature engineering**: For example, binning `Age` or `Fare`. ## Example ```python # Example: Trying different maximum depths for depth in [3, 5, 10, None]: # None means no depth limit model_temp = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42) model_temp.fit(X_train, y_train) y_pred_temp = model_temp.predict(X_val) accuracy_temp = accuracy_score(y_val, y_pred_temp) print(f"Depth {depth}: Accuracy = {accuracy_temp:.4f}")
← Claude CodeMl Model Optimization Regulari β†’