Data Understanding
\n\nBefore starting any machine learning project, such as predicting house prices, identifying cats and dogs in images, or recommending movies you might like, we first need to face the most fundamental and critical step: data understanding.
\n\nYou can think of data understanding as a detective carefully studying all clues and files before investigating a case. Without understanding the origin, authenticity, and meaning of the clues (data), any subsequent reasoning (modeling) may be built on a wrong foundation.
\n\nData understanding is the cornerstone of the entire machine learning workflow. It determines how we clean data, select models, and ultimately affects the success or failure of the model.
\n\n\n\n
What is Data Understanding?
\n\nData understanding, as the name suggests, is about deeply understanding the dataset you have. Its core goal is to answer the following questions:
\n\n- \n
- What data do I have? (Structure and types of data) \n
- What is the data quality? (Is the data clean, complete, and reliable?) \n
- What is the data "saying"? (What patterns, relationships, and distributions are hidden in the data?) \n
This process does not involve complex code and algorithms. It is more about gaining "intuition" about the data through observation, statistics, and visualization.
\n\n\n\n
Core Steps and Tools for Data Understanding
\n\nWe will use Pandas, the most popular data analysis library in Python, and the visualization libraries Matplotlib/Seaborn for demonstration. Please make sure you have installed them (pip install pandas matplotlib seaborn).
Step 1: First Meeting β Loading and Overview
\n\nFirst, we need to load the data into the program and quickly browse its overall appearance.
\n\nExample
\n\nimport pandas as pd\n\nimport matplotlib.pyplot as plt\n\nimport seaborn as sns\n\n# 1. Load data (here we use the classic Iris dataset as an example, you can also load your own CSV file)\n\n# Load from web\n\n url ="https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"\n\n df = pd.read_csv(url)\n\n# Or load from local file\n\n# df = pd.read_csv('your_dataset.csv')\n\n# 2. View the first few rows of data - first impression\n\nprint("First 5 rows of the data: ")\n\nprint(df.head())\n\nprint("\\n" + "="*50 + "\\n")\n\n# 3. View overall information about the data: number of rows, columns, data types, memory usage\n\nprint("Basic dataset information: ")\n\nprint(df.info())\n\nprint("\\n" + "="*50 + "\\n")\n\n# 4. View the shape of the data (how many rows, how many columns)\n\nprint(f"Dataset shape:{df.shape}")# Output (number of rows, number of columns)\n\nprint(f"Total of {df.shape} samples, {df.shape} features.")\n\n\nCode Explanation:
\n\n- \n
df.head(): Like browsing the table of contents of a book, quickly view the first few rows of data to understand what the data looks like. \ndf.info(): This is the "medical report" of the data. It will tell you:\n- \n
- Names of each column (
Column) \n - Number of non-null values (
Non-Null Count), which can immediately reveal if there is missing data \n - Data types (
Dtype), such asint64(integer),float64(decimal),object(text or mixed type) \n
\n- Names of each column (
df.shape: Directly get the dimensions of the data table. \n
Step 2: Quality Check β Discovering Missing Values and Outliers
\n\nData is rarely perfect. Common "data diseases" include missing values (some positions are empty) and outliers (some numbers are ridiculously large or small).
\n\nExample
\n\n# 1. Check for missing values\n\nprint("Number of missing values per feature: ")\n\nprint(df.isnull().sum())\n\nprint("\\n" + "="*50 + "\\n")\n\n# If there are many missing values, calculate the missing ratio\n\n missing_ratio = df.isnull().sum() / len(df) * 100\n\nprint("Missing value ratio per feature (%οΌ:")\n\nprint(missing_round)\n\nprint("\\n" + "="*50 + "\\n")\n\n# 2. Check statistical summary of numerical features - can find clues of outliers\n\nprint("Statistical summary of numerical features: ")\n\nprint(df.describe())\n\n\nCode Explanation:
\n\n- \n
df.isnull().sum(): Calculate the total number of null values (NaN) in each column. \ndf.describe(): Generate statistical summary of numerical columns, including:\n- \n
count: Count (can be used to confirm missing values again) \nmean: Mean \nstd: Standard deviation (data fluctuation magnitude) \nmin: Minimum value \n25%,50%(median),75%: Quartiles \nmax: Maximum value \n- By observing
minandmax, you can preliminarily judge if there are outliers (for example, the age column showing 200 years old). \n
\n
Step 3: Deep Insight β Distribution and Relationship Visualization
\n\nText and numbers are abstract, while charts allow us to intuitively "see" the data. This is the most interesting part of data understanding.
\n\nExample
\n\n# Set chart style\n\n sns.set(style="whitegrid")\n\n# 1. Univariate distribution - understand the distribution of each feature itself\n\n fig, axes = plt.subplots(2,2, figsize=(12,8))# Create 2x2 canvas\n\n features =['sepal_length'οΌ 'sepal_width'οΌ 'petal_length'οΌ 'petal_width']\n\n colors =['skyblue'οΌ 'lightgreen'οΌ 'salmon'οΌ 'gold']\n\nfor i,(ax, feature, color)in enumerate(zip(axes.flat, features, colors)):\n\n# Draw histogram (distribution) and kernel density estimation curve\n\n sns.histplot(df, kde=True, ax=ax, color=color, bins=20)\n\n ax.set_title(f'{feature} distribution of'οΌ fontsize=14)\n\n ax.set_xlabel(feature)\n\n ax.set_ylabel('Frequency')\n\nplt.tight_layout()\n\n plt.show()\n\n# 2. Box plot - view data distribution and outliers (more intuitive)\n\n plt.figure(figsize=(10,6))\n\n# Select numerical columns to draw box plot\n\n df_box = df.drop(columns=['species'])# Assume 'species' is text label column, remove it first\n\n sns.boxplot(data=df_box)\n\n plt.title('Box plots for numerical features (to check distribution and outliers)'οΌ fontsize=14)\n\n plt.xticks(rotation=45)\n\n plt.show()\n\n# 3. Relationship between variables - scatter plot matrix\n\nprint("\\nPlot a scatter plot matrix of feature relationships...οΌThis helps us identify correlations between features)")\n\n# Use Seaborn's pairplot, hue parameter can color by category (such as Iris species)\n\n sns.pairplot(df, hue='species'οΌ height=2.5)\n\n plt.suptitle('Scatter plot matrix of feature relationships (colored by class)'οΌ y=1.02, fontsize=16)\n\n plt.show()\n\n# 4. Correlation heatmap - quantify linear relationships between features\n\n plt.figure(figsize=(8,6))\n\n# Calculate correlation coefficients between numerical features\n\n numeric_df = df.select_dtypes(include=['float64'οΌ 'int64'])\n\n correlation_matrix = numeric_df.corr()\n\n sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm'οΌ center=0, square=True)\n\n plt.title('Feature correlation heatmap'οΌ fontsize=14)\n\n plt.show()\n\n\nChart Explanation:
\n\n- \n
- Histogram: Shows how the values of a feature (such as petal length) are distributed. Are they concentrated in a certain range, or scattered? \n
- Box plot:\n
- \n
- The line in the middle of the box represents the median. \n
- The upper and lower boundaries of the box represent the 25th percentile (Q1) and 75th percentile (Q3). \n
- The upper and lower "whiskers" usually represent the reasonable range (Q1-1.5IQR to Q3+1.5IQR). \n
- Individual points are likely outliers! \n
\n - Scatter plot matrix: View the relationship between any two features simultaneously. Points distributed in a band indicate possible correlation. \n
- Correlation heatmap: Uses colors and numbers (-1 to 1) to precisely represent the linear correlation degree between two features.\n
- \n
- 1: Perfect positive correlation (as one increases, the other also increases) \n
- -1: Perfect negative correlation (as one increases, the other decreases) \n
- 0: No linear relationship \n
\n
\n\n
Output of Data Understanding: A "Data Investigation Report"
\n\nAfter completing the above steps, you should be able to summarize a clear report about the current dataset, for example:
\n\n\nInvestigation Report on Iris Dataset
\nData Overview: 150 samples in total, 5 features (4 numerical features: sepal/petal length and width; 1 categorical label: species).
\nData Quality: No missing values, all numerical features are within reasonable biological ranges, no obvious outliers found.
\nData Insights:
\n\n
\n- Petal length (
\npetal_length) and petal width (petal_width) are highly correlated (>0.96), possibly indicating information redundancy.- Different Iris species are clearly distinguished by petal size, with clear clustering visible in scatter plots.
\n- The distribution of sepal width (
\nsepal_width) approximates a normal distribution.
YouTip