Ml Titanic Survival Prediction
# Titanic Survival Prediction | Novice Tutorial
## Titanic Survival Prediction
If you're just starting to learn machine learning, you might feel that those complex algorithms and mathematical formulas are far from the real world. But today, we'll experience a complete machine learning project workflow through a classic case study - Titanic survival prediction.
The Titanic dataset is one of the most famous introductory projects in machine learning, based on real passenger information from the 1912 Titanic disaster. Our goal is: **to build a model that predicts whether passengers could survive the disaster based on their age, gender, ticket class, and other information**.
This project is classic because it perfectly covers the core steps of machine learning projects:
1. **Data Understanding and Exploration**
2. **Data Cleaning and Preprocessing**
3. **Feature Engineering**
4. **Model Selection and Training**
5. **Model Evaluation and Optimization**
Through this practical case, you'll move beyond just reading theory to truly understanding how to apply machine learning to solve real-world problems.
***
## Step 1: Understanding Our Data
Before writing any code, we must first understand the data at hand. The Titanic dataset typically contains the following fields (features):
| Field Name | Description | Data Type | Notes |
| --- | --- | --- | --- |
| `PassengerId` | Passenger ID | Integer | Unique identifier, not helpful for prediction |
| `Survived` | Survival status | Integer (0/1) | **Target variable**, 0=Not survived, 1=Survived |
| `Pclass` | Ticket class | Integer (1,2,3) | 1=1st class, 2=2nd class, 3=3rd class |
| `Name` | Passenger name | String | Contains titles (e.g., Mr., Miss.), can extract new features |
| `Sex` | Gender | String | `male` or `female` |
| `Age` | Age | Float | Some missing values |
| `SibSp` | Number of siblings/spouses | Integer | |
| `Parch` | Number of parents/children | Integer | |
| `Ticket` | Ticket number | String | Complex structure, limited information |
| `Fare` | Ticket price | Float | |
| `Cabin` | Cabin number | String | Many missing values, first letter may indicate cabin area |
| `Embarked` | Embarkation port | String | C=Cherbourg, Q=Queenstown, S=Southampton |
**Key Insight**: From historical knowledge we know that the "women and children first" principle was followed, and first-class passengers had priority access to lifeboats. Therefore, we expect features like `Sex`, `Age`, and `Pclass` to have significant impact on prediction results.
***
## Step 2: Data Cleaning and Preprocessing
Raw data is almost never perfect.
Data cleaning is like preparing high-quality ingredients for a model, and this step is crucial.
Store the following data in a train.csv file:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings, Mrs. John Bradley",female,38,1,0,PC 17599,71.2833,C85,C 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S 4,1,1,"Futrelle, Mrs. Jacques Heath",female,35,1,0,113803,53.1,C123,S 5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S 6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q 7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S 8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S 9,1,3,"Johnson, Mrs. Oscar W",female,27,0,2,347742,11.1333,,S 10,1,2,"Nasser, Mrs. Nicholas",female,14,1,0,237736,30.0708,,C
Store the following data in a test.csv file:
PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked11,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q 12,3,"Wilkes, Mrs. James",female,47,1,0,363272,7,,S 13,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q 14,3,"Dwyer, Miss. Ellen",female,18,0,0,330959,7.75,,Q 15,1,"Jones, Mr. Charles",male,,1,0,PC 17603,82.1708,B28,C
We'll use Python's `pandas` and `numpy` libraries to complete this work.
## Example
```python
# Import necessary libraries
import pandas as pd
import numpy as np
# Load data
train_data = pd.read_csv('train.csv') # Training set, contains target variable Survived
test_data = pd.read_csv('test.csv') # Test set, does not contain Survived, used for final evaluation
# 1. Preliminary data inspection
print("Training set shape:", train_data.shape)
print(train_data.info()) # View data types and missing values
print(train_data.head()) # View first few rows of data
Output:
Training set shape: (10, 12)
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 10 non-null int64
1 Survived 10 non-null int64
2 Pclass 10 non-null int64
3 Name 10 non-null object
4 Sex 10 non-null object
5 Age 9 non-null float64
6 SibSp 10 non-null int64
7 Parch 10 non-null int64
8 Ticket 10 non-null object
9 Fare 10 non-null float64
10 Cabin 3 non-null object
11 Embarked 10 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 1.1+ KB
None
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
After running the above code, you may notice two main issues: **missing values** and **non-numeric data**.
### Handling Missing Values
## Example
```python
# Check number of missing values in each column
print(train_data.isnull().sum())
# Handle Age: Fill with median
train_data['Age'] = train_data['Age'].fillna(train_data['Age'].median())
test_data['Age'] = test_data['Age'].fillna(test_data['Age'].median())
# Handle Embarked (Embarkation port): Fill with mode
most_common_port = train_data['Embarked'].mode()
train_data['Embarked'] = train_data['Embarked'].fillna(most_common_port)
test_data['Embarked'] = test_data['Embarked'].fillna(most_common_port)
# Handle Fare (Ticket price): Test set
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].median())
# Handle Cabin: Directly delete
train_data.drop(columns=['Cabin'], inplace=True)
test_data.drop(columns=['Cabin'], inplace=True)
### Converting Non-Numeric Data
Machine learning models typically only handle numeric data. We need to convert text columns like `Sex` and `Embarked` into numbers.
## Example
```python
# Convert Sex column to numeric: female -> 0, male -> 1
train_data['Sex'] = train_data['Sex'].map({'female': 0, 'male': 1})
test_data['Sex'] = test_data['Sex'].map({'female': 0, 'male': 1})
# Convert Embarked column to numeric (One-Hot Encoding)
# Since there's no ordinal relationship between ports, we shouldn't use simple 0,1,2 mapping
train_data = pd.get_dummies(train_data, columns=['Embarked'])
test_data = pd.get_dummies(test_data, columns=['Embarked'])
***
## Step 3: Feature Engineering
Feature engineering is the "magic" of machine learning, where we help models learn better by creating or transforming features. Extracting "titles" from the `Name` column is a classic example.
## Example
```python
# Extract titles (e.g., Mr., Mrs., Miss., Master.) from Name column
# Titles often reflect age, social status, and gender, which may affect rescue priority
train_data['Title'] = train_data['Name'].str.extract(' (+).', expand=False)
test_data['Title'] = test_data['Name'].str.extract(' (+).', expand=False)
# View what titles exist
print(pd.crosstab(train_data['Title'], train_data['Sex']))
# Map rare titles to 'Rare'
title_mapping = {
'Mr': 'Mr', 'Miss': 'Miss', 'Mrs': 'Mrs',
'Master': 'Master', 'Dr': 'Rare', 'Rev': 'Rare',
'Col': 'Rare', 'Major': 'Rare', 'Mlle': 'Miss',
'Countess': 'Rare', 'Ms': 'Miss', 'Lady': 'Rare',
'Jonkheer': 'Rare', 'Don': 'Rare', 'Dona': 'Rare',
'Mme': 'Mrs', 'Capt': 'Rare', 'Sir': 'Rare'
}
train_data['Title'] = train_data['Title'].map(title_mapping)
test_data['Title'] = test_data['Title'].map(title_mapping)
# Apply one-hot encoding to the processed Title column
train_data = pd.get_dummies(train_data, columns=['Title'])
test_data = pd.get_dummies(test_data, columns=['Title'])
# Create new feature: Family size
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1
test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1
# Create new feature: Whether traveling alone
train_data['IsAlone'] = (train_data['FamilySize'] == 1).astype(int)
test_data['IsAlone'] = (test_data['FamilySize'] == 1).astype(int)
# Drop unnecessary original columns
columns_to_drop = ['PassengerId', 'Name', 'Ticket', 'SibSp', 'Parch']
train_data.drop(columns_to_drop, axis=1, inplace=True)
test_passenger_ids = test_data['PassengerId'] # Save test set IDs for submission
test_data.drop(columns_to_drop, axis=1, inplace=True)
print("Feature engineering completed. Training set columns:", train_data.columns.tolist())
***
## Step 4: Model Selection and Training
Now that we have clean and informative numeric data, we'll split it into **features (X)** and **target variable (y)**, then select a model for training.
We'll start with the simple and efficient **Random Forest** model.
## Example
```python
# Import machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Prepare data
# X is the feature matrix, y is the target vector we want to predict
X = train_data.drop('Survived', axis=1)
y = train_data['Survived']
# To evaluate model performance during training, we'll split the data into training and validation sets
# test_size=0.2 means 20% of data will be used for validation, 80% for training
# random_state is a random seed to ensure consistent results
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize random forest classifier
# n_estimators: Number of trees in the forest
# max_depth: Maximum depth of trees, controls model complexity and prevents overfitting
# random_state: Ensures reproducibility
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
# Train the model (let the model learn patterns from the data)
model.fit(X_train, y_train)
# Make predictions on the validation set
y_pred = model.predict(X_val)
# Evaluate model accuracy
accuracy = accuracy_score(y_val, y_pred)
print(f"Model accuracy on validation set: {accuracy:.4f} (i.e., {accuracy*100:.2f}%)")
***
## Step 5: Model Evaluation, Optimization, and Submission
### Evaluation and Optimization
The results from a single training run may not be optimal. We can improve through:
1. **Adjusting model parameters**: Try different `n_estimators` or `max_depth`.
2. **Trying other models**: Such as logistic regression, support vector machines, gradient boosting trees, etc.
3. **Further feature engineering**: For example, binning `Age` or `Fare`.
## Example
```python
# Example: Trying different maximum depths
for depth in [3, 5, 10, None]: # None means no depth limit
model_temp = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
model_temp.fit(X_train, y_train)
y_pred_temp = model_temp.predict(X_val)
accuracy_temp = accuracy_score(y_val, y_pred_temp)
print(f"Depth {depth}: Accuracy = {accuracy_temp:.4f}")
YouTip