YouTip LogoYouTip

Pandas Pd Read Csv

[![Image 1: Python math module](#) Pandas Common Functions](#) * * * `read_csv()` is the most commonly used data reading function in the pandas library, used to read data from CSV (Comma-Separated Values) files and create a DataFrame. CSV files are a simple and widely used data exchange format that stores tabular data in plain text, with each line representing a record and fields separated by commas. `read_csv()` can intelligently parse CSV files, automatically identifying column names, data types, and handling various separators and encoding issues. * * * ## Basic Syntax and Parameters ### Syntax Format pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, dtype=None, skiprows=None, nrows=None, na_values=None, ...) ### Parameter Description | Parameter | Type | Description | Default Value | | --- | --- | --- | --- | | filepath_or_buffer | str, path object, or file-like object | Path to the CSV file, URL, or file object | Required | | sep | str | Field separator, comma is default for CSV | ',' | | header | int, list of int, 'infer' | Row number to use as column names, 0 means first row | 'infer' | | names | list-like | Custom list of column names | None | | index_col | int, str, list of int, list of str, False | Column(s) to use as row index | None | | usecols | list-like, callable | Read only specified columns | None | | dtype | dict | Specify data types for columns, e.g., {'a': np.float64} | None | | skiprows | list-like, int | Skip specified rows | None | | nrows | int | Read only the first n rows | None | | na_values | scalar, str, list-like, dict | Values recognized as NA/NaN | None | | encoding | str | File encoding, such as 'utf-8' | None | ### Return Value * **Return Type**: `pd.DataFrame` * Returns a two-dimensional labeled data structure, i.e., a pandas DataFrame, which can be used for various data analysis and processing operations. * * * ## Examples Through the following examples, fully master the various usages of `read_csv()`. ### Example 1: Reading Local CSV Files First, create a simple CSV file, then use `read_csv()` to read it. ## Example import pandas as pd # Create an example CSV file # First write some test data data ="""name,age,city,salary Tom,28,Beijing,8000 Jerry,35,Shanghai,12000 Mike,42,Guangzhou,15000 Lucy,26,Shenzhen,7000 """ # Write data to file (in actual use, you can directly read existing files) with open('employees.csv','w', encoding='utf-8')as f: f.write(data) # Use read_csv to read the CSV file # filepath_or_buffer: file path (required) df = pd.read_csv('employees.csv') # View the result print("Read DataFrame:") print(df) print("nData types:") print(df.dtypes) print("nColumn names:", df.columns.tolist()) **Expected Run Result:** Read DataFrame: name age city salary 0 Tom 28 Beijing 80001 Jerry 35 Shanghai 120002 Mike 42 Guangzhou 150003 Lucy 26 Shenzhen 7000Data types: name object age int64 city object salary int64 Column names: ['name', 'age', 'city', 'salary'] **Code Explanation:** * `pd.read_csv('employees.csv')` is the most basic usage, simply passing the file path. * By default, the first row is automatically recognized as column names (header='infer'). * Pandas automatically infers data types for each column: strings become object, integers become int64. * The returned DataFrame can be viewed directly using `print()`, or further analyzed. ### Example 2: Customizing Column Names and Selecting Specific Columns In real work, we may need to customize column names or read only specific columns to improve performance. ## Example import pandas as pd # Create test data data ="""name,age,city,salary,department Tom,28,Beijing,8000,IT Jerry,35,Shanghai,12000,HR Mike,42,Guangzhou,15000,Sales Lucy,26,Shenzhen,7000,IT """ with open('employees2.csv','w', encoding='utf-8')as f: f.write(data) # Example 2a: Customizing column names # The names parameter specifies new column names, overriding original column names df_custom = pd.read_csv('employees2.csv', names=['Name','Age','City','Salary','Department'], header=0) print("DataFrame after customizing column names:") print(df_custom) print() # Example 2b: Reading only specified columns # The usecols parameter can specify which columns to read, improving performance df_partial = pd.read_csv('employees2.csv', usecols=['name','salary']) print("Reading only partial columns:") print(df_partial) print() # Example 2c: Using index_col to set index column df_indexed = pd.read_csv('employees2.csv', index_col='name') print("Setting name as index:") print(df_indexed) **Expected Run Result:** DataFrame after customizing column names: Name Age City Salary Department0 Tom 28 Beijing 8000 IT 1 Jerry 35 Shanghai 12000 HR 2 Mike 42 Guangzhou 15000 Sales3 Lucy 26 Shenzhen 7000 IT Reading only partial columns: name salary 0 Tom 80001 Jerry 120002 Mike 150003 Lucy 7000Setting name as index: age city salary department name Tom 28 Beijing 8000 IT Jerry 35 Shanghai 12000 HR Mike 42 Guangzhou 15000 SalesLucy 26 Shenzhen 7000 IT **Code Explanation:** * The `names` parameter must match the number of columns in the data. If the file has a header row, you can set `header=0` to use the original header. * `usecols` can accept a list of column names (recommended) or column index list, returning columns in the specified order. * `index_col` sets a column as the row index, making it easier to query data quickly by index. ### Example 3: Handling Special Data and Missing Value Processing CSV files may contain missing values, special separators, or require skipping certain rows. ## Example import pandas as pd # Create a CSV file with special data # Using semicolon as separator, including missing values and NA values data ="""name;age;city;salary Tom;28;Beijing;8000 Jerry;;Shanghai;12000 Mike;42;Guangzhou; Lucy;26;NA;7000 """ with open('employees3.csv','w', encoding='utf-8')as f: f.write(data) # Example 3a: Reading semicolon-separated files df_semicolon = pd.read_csv('employees3.csv', sep=';') print("Using semicolon separator:") print(df_semicolon) print("Missing value statistics:") print(df_semicolon.isnull()) print() # Example 3b: Specifying which values are considered missing df_na = pd.read_csv('employees3.csv', sep=';', na_values=['NA','missing']) print("After customizing NA values:") print(df_na) print() # Example 3c: Skipping rows and limiting row count # Assume the first few lines are comments, which can be skipped data_with_comment ="""# This is an employee data file # Creation date: 2024-01-01 name;age;city;salary Tom;28;Beijing;8000 Jerry;35;Shanghai;12000 Mike;42;Guangzhou;15000 Lucy;26;Shenzhen;7000 """ with open('employees4.csv','w', encoding='utf-8')as f: f.write(data_with_comment) # skiprows skips the first two lines (comment lines) df_skip = pd.read_csv('employees4.csv', sep=';', skiprows=2) print("After skipping comment lines:") print(df_skip) print() # nrows reads only the first 3 rows df_nrows = pd.read_csv('employees4.csv', sep=';', skiprows=2, nrows=3) print("Reading only the first 3 rows:") print(df_nrows) **Expected Run Result:** Using semicolon separator: name age city salary 0 Tom 28.0 Beijing 8000.01 Jerry NaN Shanghai 12000.02 NaN 42.0 Guangzhou NaN3 Lucy 26.0 NA 7000.0Missing value statistics: name age city salary 0 False False False False1 False True False False2 True False False True3 False False False FalseAfter customizing NA values: name age city salary 0 Tom 28.0 Beijing 8000.01 Jerry NaN Shanghai 12000.02 NaN 42.0 Guangzhou NaN3 Lucy 26.0 NaN 7000.0After skipping comment lines: name age city salary 0 Tom 28 Beijing 80001 Jerry 35 Shanghai 12000 42 Guangzhou 150002 Mike3 Lucy 26 Shenzhen 7000Reading only the first 3 rows: name age city salary 0 Tom 28 Beijing 80001 Jerry 35 Shanghai 120002 Mike 42 Guangzhou 15000 **Code Explanation:** * The `sep` parameter can specify any separator, such as semicolon, tab, etc. * By default, empty strings, spaces, etc., are recognized as missing values. You can customize NA values using the `na_values` parameter. * `skiprows` can skip specified lines at the beginning of the file, useful for handling files with comments. * `nrows` limits the number of rows read, suitable for chunked reading of large files. * * * ## Notes * When reading large files, consider using the `chunksize` parameter for chunked reading to avoid memory issues. * When dealing with Chinese files, correctly specify the `encoding` parameter. Common encodings include 'utf-8', 'gbk', 'gb2312', etc. * If the CSV file has no header row, set `header=None` and specify column names via the `names` parameter. * For non-standard CSV files, you may need to adjust parameters like `sep`, `quotechar` for correct parsing. * * * ## Summary `read_csv()` is the most fundamental and important data reading function in pandas. It is powerful, supporting multiple separators, custom column names, index setting, missing value handling, and other advanced features. In practical data analysis work, mastering the various parameter usages of `read_csv()` allows efficient handling of various CSV file formats, laying a solid foundation for subsequent data cleaning and analysis. It is recommended that readers practice extensively to become proficient with these common parameters. [![Image 2: Python math module](#) Pandas Common Functions](#)
← Pandas Groupby SumPandas Pd Crosstab β†’