Pandas Dataframe
DataFrame is another core data structure in Pandas, similar to a two-dimensional table or a database table.
DataFrame is a tabular data structure containing a set of ordered columns, where each column can be of different value types (numeric, string, boolean).
DataFrame has both row and column indices and can be thought of as a dictionary of Series objects (sharing a common index).
DataFrame provides various functions for data access, filtering, splitting, merging, reshaping, aggregation, and transformation.
DataFrame is a very flexible and powerful data structure, widely used in tasks such as data analysis, cleaning, transformation, and visualization.
**DataFrame Features:**
* **Two-dimensional structure:** `DataFrame` is a two-dimensional table that can be viewed as an Excel spreadsheet or SQL table, with rows and columns. It can be considered as a dictionary composed of multiple `Series` objects.
* **Column data types:** Different columns can contain different data types, such as integers, floats, strings, or Python objects.
* **Index:** `DataFrame` can have row and column indices, similar to row numbers and column labels in Excel.
* **Mutable size:** Columns can be added and removed, similar to dictionaries in Python.
* **Automatic alignment:** During arithmetic operations or data alignment, `DataFrame` automatically aligns indices.
* **Handling missing data:** `DataFrame` can contain missing data, which Pandas represents using `NaN` (Not a Number).
* **Data operations:** Supports data slicing, indexing, subsetting, and other operations.
* **Time series support:** `DataFrame` has special support for time series data, allowing easy slicing, indexing, and operations on time data.
* **Rich data access functions:** Flexible data access and filtering through methods like `.loc`, `.iloc`, and `.query()`.
* **Flexible data processing functions:** Includes data merging, reshaping, pivoting, grouping, and aggregation.
* **Data visualization:** Although `DataFrame` itself is not a visualization tool, it can be used with visualization libraries like Matplotlib or Seaborn for data visualization.
* **Efficient data input/output:** Convenient reading and writing of data in various formats, such as CSV, Excel, SQL databases, and HDF5.
* **Descriptive statistics:** Provides methods to compute descriptive statistics, such as `.describe()`, `.mean()`, `.sum()`, etc.
* **Flexible data alignment and integration:** Easily merge, join, or update with other `DataFrame` or `Series` objects.
* **Transformation functions:** Transform values in a dataset, for example, using the `.apply()` method to apply custom functions.
* **Rolling window and time series analysis:** Supports rolling window statistics and time series analysis on datasets.
!(#)
!(#)
The DataFrame constructor is as follows:
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Parameter description:
* `data`: The data part of the DataFrame, which can be a dictionary, 2D array, Series, DataFrame, or other objects convertible to DataFrame. If this parameter is not provided, an empty DataFrame is created.
* `index`: The row index of the DataFrame, used to identify each row of data. It can be a list, array, index object, etc. If this parameter is not provided, a default integer index is created.
* `columns`: The column index of the DataFrame, used to identify each column of data. It can be a list, array, index object, etc. If this parameter is not provided, a default integer index is created.
* `dtype`: Specifies the data type of the DataFrame. It can be a NumPy data type, such as `np.int64`, `np.float64`, etc. If this parameter is not provided, the data type is automatically inferred from the data.
* `copy`: Whether to copy the data. The default is False, meaning no data is copied. If set to True, the input data is copied.
Pandas DataFrame is a two-dimensional array structure, similar to a 2D array.
## Example - Creating with a List
import pandas as pd
data =[['Google',10],['',12],['Wiki',13]]
# Create DataFrame
df = pd.DataFrame(data, columns=['Site','Age'])
# Use astype method to set the data type for each column
df['Site']= df['Site'].astype(str)
df['Age']= df['Age'].astype(float)
print(df)
You can also create it using a dictionary:
## Example - Creating with a Dictionary
import pandas as pd
data ={'Site':['Google','','Wiki'],'Age':[10,12,13]}
df = pd.DataFrame(data)
print(df)
The output is as follows:
!(#)
The following example uses ndarrays. The length of the ndarray must be the same. If an index is passed, the length of the index should equal the length of the array. If no index is passed, by default, the index will be range(n), where n is the length of the array.
For ndarrays, refer to: (#)
## Example - Creating with ndarrays
import numpy as np
import pandas as pd
# Create a 2D ndarray containing website and age
ndarray_data = np.array([
['Google',10],
['',12],
['Wiki',13]
])
# Create DataFrame using the DataFrame constructor
df = pd.DataFrame(ndarray_data, columns=['Site','Age'])
# Print the DataFrame
print(df)
The output is as follows:
!(#)
From the above output, we can see that the DataFrame data type is a table containing rows and columns:
!(#)
You can also use a dictionary (key/value), where the dictionary keys are column names:
## Example - Creating with a Dictionary
import pandas as pd
data =[{'a': 1,'b': 2},{'a': 5,'b': 10,'c': 20}]
df = pd.DataFrame(data)
print(df)
The output is:
a b c 0 1 2 NaN1 5 10 20.0
Data for non-corresponding parts is **NaN**.
Pandas can use the loc attribute to return data for a specified row. If no index is set, the first row index is **0**, the second row index is **1**, and so on:
## Example
import pandas as pd
data ={
"calories": [420,380,390],
"duration": [50,40,45]
}
# Load data into a DataFrame object
df = pd.DataFrame(data)
# Return the first row
print(df.loc)
# Return the second row
print(df.loc)
The output is as follows:
calories 420 duration 50Name: 0, dtype: int64 calories 380 duration 40Name: 1, dtype: int64
**Note:** The returned result is actually a Pandas Series data.
You can also return multiple rows using the [[ ... ]] format, where ... is the index of each row, separated by commas:
## Example
import pandas as pd
data ={
"calories": [420,380,390],
"duration": [50,40,45]
}
# Load data into a DataFrame object
df = pd.DataFrame(data)
# Return the first and second rows
print(df.loc[[0,1]])
The output is:
calories duration 0 420 501 380 40
**Note:** The returned result is actually a Pandas DataFrame data.
We can specify index values, as in the following example:
## Example
import pandas as pd
data ={
"calories": [420,380,390],
"duration": [50,40,45]
}
df = pd.DataFrame(data, index =["day1","day2","day3"])
print(df)
The output is:
calories duration day1 420 50 day2 380 40 day3 390 45
Pandas can use the loc attribute to return the row corresponding to a specified index:
## Example
import pandas as pd
data ={
"calories": [420,380,390],
"duration": [50,40,45]
}
df = pd.DataFrame(data, index =["day1","day2","day3"])
# Specify the index
print(df.loc)
The output is:
calories 380 duration 40Name: day2, dtype: int64
* * *
## DataFrame Methods
Common operations and methods for DataFrame are shown in the table below:
| **Method Name** | **Function Description** |
| --- | --- |
| `head(n)` | Returns the first n rows of the DataFrame (default is first 5 rows) |
| `tail(n)` | Returns the last n rows of the DataFrame (default is last 5 rows) |
| `info()` | Displays brief information about the DataFrame, including column names, data types, non-null counts, etc. |
| `describe()` | Returns statistical information for numeric columns of the DataFrame, such as mean, standard deviation, min, etc. |
| `shape` | Returns the number of rows and columns of the DataFrame (rows, columns) |
| `columns` | Returns all column names of the DataFrame |
| `index` | Returns the row index of the DataFrame |
| `dtypes` | Returns the numeric data type of each column |
| `sort_values(by)` | Sorts by a specified column |
| `sort_index()` | Sorts by row index |
| `dropna()` | Drops rows or columns containing missing values (NaN) |
| `fillna(value)` | Fills missing values with a specified value |
| `isnull()` | Checks for missing values, returns a boolean DataFrame |
| `notnull()` | Checks for non-missing values, returns a boolean DataFrame |
| `loc[]` | Selects data by label index |
| `iloc[]` | Selects data by position index |
| `at[]` | Accesses a single element in the DataFrame (more efficient than `loc[]`) |
| `iat[]` | Accesses a single element in the DataFrame (more efficient than `iloc[]`) |
| `apply(func)` | Applies a function to the DataFrame or Series |
| `applymap(func)` | Applies a function to each element of the DataFrame (only for DataFrame) |
| `groupby(by)` | Grouping operation, used for summary statistics by a certain column |
| `pivot_table()` | Creates a pivot table |
| `merge()` | Merges multiple DataFrames (similar to SQL JOIN operation) |
| `concat()` | Concatenates multiple DataFrames by row or column |
| `to_csv()` | Exports the DataFrame to a CSV file |
| `to_excel()` | Exports the DataFrame to an Excel file |
| `to_json()` | Exports the DataFrame to JSON format |
| `to_sql()` | Exports the DataFrame to a SQL database |
| `query()` | Queries the DataFrame using SQL-style syntax |
| `duplicated()` | Returns a boolean DataFrame indicating whether each row is a duplicate |
| `drop_duplicates()` | Drops duplicate rows |
| `set_index()` | Sets the index of the DataFrame |
| `reset_index()` | Resets the index of the DataFrame |
| `transpose()` | Transposes the DataFrame (swaps rows and columns) |
## Example
import pandas as pd
# Create DataFrame
data ={
'Name': ['Alice','Bob','Charlie','David'],
'Age': [25,30,35,40],
'City': ['New York','Los Angeles','Chicago','Houston']
}
df = pd.DataFrame(data)
# View the first two rows of data
print(df.head(2))
# View basic information about the DataFrame
print(df.info())
# Get descriptive statistics
print(df.describe())
# Sort by age
df_sorted = df.sort_values(by='Age', ascending=False)
print(df_sorted)
# Select specified columns
print(df[['Name','Age']])
# Select rows by index
print(df.iloc[1:3])# Select the second to third rows (by position)
# Select rows by label
print(df.loc[1:2])# Select the second to third rows (by label)
# Calculate grouped statistics (group by city, calculate average age)
print(df.groupby('City')['Age'].mean())
# Handle missing values (fill missing values)
df['Age']= df['Age'].fillna(30)
# Export to CSV file
df.to_csv('output.csv', index=False)
The output is:
# View the first two rows of data Name Age City0 Alice 25 New York1 Bob 30 Los Angeles# View basic information about the DataFrameRangeIndex: 4 entries, 0 to 3Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 4 non-null object 1 Age 4 non-null int64 2 City 4 non-null object dtypes: int64(1), object(2) memory usage: 148.0+ bytes # Get descriptive statistics Age count 4.000000 mean 32.500000 std 6.454972 min 25.00000025% 27.50000050% 32.50000075% 37.500000 max 40.000000# Sort by age Name Age City3 David 40 Houston2 Charlie 35 Chicago1 Bob 30 Los Angeles0 Alice 25 New York# Select rows by label Name Age City1 Bob 30 Los Angeles2 Charlie 35 Chicago# Calculate grouped statistics (group by city, calculate average age)CityChicago 35.0Houston 40.0Los Angeles 30.0New York 25.0Name: Age, dtype: float64
* * *
## More DataFrame Explanations
### Creating DataFrame
**From a dictionary:** The dictionary keys become column names, and the values become column data.
## Example
import pandas as pd
# Create DataFrame from a dictionary
df = pd.DataFrame({'Column1': [1,2,3],'Column2': [4,5,6]})
**From a list of lists:** The outer list represents rows, and the inner list represents columns.
# Create DataFrame from a list of lists
## Example
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],
columns=['Column1','Column2','Column3'])
**From a NumPy array:** Provide a 2D NumPy array.
## Example
import numpy as np
# Create DataFrame from a NumPy array
df = pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9]]))
**From Series to create DataFrame:** Using pd.Series().
## Example
# Create DataFrame from Series
s1 = pd.Series(['Alice','Bob','Charlie'])
s2 = pd.Series([25,30,35])
s3 = pd.Series(['New York','Los Angeles','Chicago'])
df = pd.DataFrame({'Name': s1,'Age': s2,'City': s3})
### DataFrame Attributes and Methods
DataFrame objects have many attributes and methods for data operations, indexing, and processing, such as shape, columns, index, head(), tail(), info(), describe(), mean(), sum(), etc.
## Example
# DataFrame attributes and methods
print(df.shape)# Shape
print(df.columns)# Column names
print(df.index)# Index
print(df.head())# First few rows, default is first 5 rows
print(df.tail())# Last few rows, default is last 5 rows
print(df.info())# Data information
print(df.describe())# Descriptive statistics
print(df.mean())# Calculate mean
print(df.sum())# Calculate sum
### Accessing DataFrame Elements
**Accessing columns:** Use column names as attributes or via .loc[], .iloc[], or with label or position indexing.
## Example
# Access by column name
print(df['Column1'])
# Access by attribute
print(df.Name)
# Access via .loc[]
print(df.loc[:,'Column1'])
# Access via .iloc[]
print(df.iloc[:,0])# Assuming 'Column1' is the first column
# Access a single element
print(df['Name'])
**Accessing rows:** Use row labels and .loc[].
## Example
# Access by row label
print(df.loc[0,'Column1'])
### Modifying DataFrame
**Modifying column data:** Assign values directly to the column.
df['Column1'] = [10, 11, 12]
**Adding a new column:** Assign values to the new column.
df['NewColumn'] = [100, 200, 300]
**Adding a new row:** Use loc, append, or concat methods.
## Example
# Use loc to add a new row at a specific index
df.loc=[13,14,15,16]
# Use append to add a new row at the end
new_row ={'Column1': 13,'Column2': 14,'NewColumn': 16}
df = df.append(new_row, ignore_index=True)
**Note:** The **append()** method has been deprecated in pandas version 1.4.0 and will be removed in future versions. The official recommendation is to use concat() as an alternative for data merging operations.
The concat() method is used to merge two or more DataFrames. When you want to add a row to another DataFrame, you can treat the new row as a new DataFrame and then use concat():
## Example
# Use concat to add a new row
new_row = pd.DataFrame([[4,7]], columns=['A','B'])# Create a DataFrame containing only the new row
df = pd.concat([df, new_row], ignore_index=True)# Add the new row to the original DataFrame
print(df)
### Deleting DataFrame Elements
**Deleting a column:** Use the drop method.
df_dropped = df.drop('Column1', axis=1)
**Deleting a row:** Also use the drop method.
df_dropped = df.drop(0) # Delete the row with index 0
### DataFrame Statistical Analysis
**Descriptive statistics:** Use .describe() to view statistical summaries of numeric columns.
df.describe()
**Calculating statistics:** Use aggregation functions like .sum(), .mean(), .max(), etc.
df['Column1'].sum() df.mean()
### DataFrame Index Operations
**Resetting the index:** Use .reset_index().
df_reset = df.reset_index(drop=True)
**Setting the index:** Use .set_index().
df_set = df.set_index('Column1')
### DataFrame Boolean Indexing
Use boolean expressions: Filter the DataFrame based on conditions.
df[df['Column1'] > 2]
### DataFrame Data Types
View data types: Use the dtypes attribute.
df.dtypes
**Convert data types:** Use the astype method.
df['Column1'] = df['Column1'].astype('float64')
### DataFrame Merging and Splitting
**Merging:** Use concat or merge methods.
# Vertical merge pd.concat([df1, df2], ignore_index=True)# Horizontal merge pd.merge(df1, df2, on='Column1')
**Splitting:** Use pivot, melt, or custom functions.
# Long format to wide format df_pivot = df.pivot(index='Column1', columns='Column2', values='Column3')# Wide format to long format df_melt = df.melt(id_vars='Column1', value_vars=['Column2', 'Column3'])
### Indexing and Slicing
DataFrame supports indexing and slicing operations on rows and columns.
## Example
# Indexing and slicing
print(df[['Name','Age']])# Extract multiple columns
print(df[1:3])# Slice rows
print(df.loc[:,'Name'])# Extract a single column
print(df.loc[1:2,['Name','Age']])# Label indexing to extract specified rows and columns
print(df.iloc[:,1:])# Position indexing to extract specified columns
### Notes
* `DataFrame` is a flexible data structure that can hold columns of different data types.
* Column names and row indices can be strings, integers, etc.
* `DataFrame` allows data selection, filtering, modification, and analysis in various ways.
* Through operations on `DataFrame`, you can perform data cleaning, transformation, analysis, and visualization.
YouTip