Pandas Quiz

## Pandas Knowledge Assessment Welcome to the **YouTip Pandas Quiz**! This comprehensive assessment is designed for data scientists, machine learning engineers, and Python developers who want to test and solidify their understanding of the Pandas library. Pandas is the cornerstone of data manipulation and analysis in the Python ecosystem. This quiz covers fundamental to advanced concepts, including Series, DataFrames, data cleaning, indexing, grouping, and merging. --- ## Quiz Questions Test your knowledge by answering the following 10 multiple-choice questions. The correct answers and detailed explanations are provided at the end of the quiz. ### Q1. Which of the following is the standard way to import the Pandas library? * A) `import pandas` * B) `import pandas as pd` * C) `import pydata as pd` * D) `from pandas import DataFrame` ### Q2. What is the primary difference between a Pandas Series and a Pandas DataFrame? * A) A Series is 1-dimensional, while a DataFrame is 2-dimensional. * B) A Series can only hold numerical data, while a DataFrame can hold any data type. * C) A DataFrame is immutable, while a Series is mutable. * D) There is no difference; they are aliases for the same object. ### Q3. How do you read a CSV file named `data.csv` into a Pandas DataFrame? * A) `df = pd.load_csv('data.csv')` * B) `df = pd.open_csv('data.csv')` * C) `df = pd.read_csv('data.csv')` * D) `df = pd.DataFrame('data.csv')` ### Q4. Which method is used to view the first 5 rows of a DataFrame by default? * A) `df.first(5)` * B) `df.show()` * C) `df.head()` * D) `df.preview()` ### Q5. How do you select a column named "Age" from a DataFrame named `df`? * A) `df.get_column("Age")` * B) `df` or `df.Age` * C) `df.loc("Age")` * D) `df.select("Age")` ### Q6. What is the purpose of the `df.dropna()` method? * A) To delete empty columns only. * B) To drop rows or columns that contain missing (NaN) values. * C) To replace missing values with zero. * D) To drop duplicate rows from the DataFrame. ### Q7. How do you filter a DataFrame `df` to only include rows where the "Salary" column is greater than 50,000? * A) `df.filter(df > 50000)` * B) `df[df > 50000]` * C) `df.where("Salary > 50000")` * D) `df.query(Salary > 50000)` ### Q8. Which of the following is used to group data in Pandas for aggregation? * A) `df.aggregate_by()` * B) `df.pivot()` * C) `df.groupby()` * D) `df.cluster()` ### Q9. What does the `inplace=True` parameter do when used in methods like `df.drop()`? * A) It performs the operation and returns a new copy of the DataFrame. * B) It ensures the operation is performed in-place, modifying the original DataFrame directly without returning a new one. * C) It saves the modified DataFrame directly to the disk. * D) It prevents any changes from being made to the DataFrame. ### Q10. How do you merge two DataFrames (`df1` and `df2`) on a common column named "ID"? * A) `pd.concat([df1, df2], on='ID')` * B) `df1.join(df2, on='ID')` * C) `pd.merge(df1, df2, on='ID')` * D) `df1.combine(df2, column='ID')` --- ## Answer Key & Explanations ### Q1. Correct Answer: **B** * **Explanation:** While `import pandas` (A) is syntactically correct, `import pandas as pd` (B) is the universally accepted community standard. Using `pd` as an alias keeps your code concise and readable. ### Q2. Correct Answer: **A** * **Explanation:** A Pandas **Series** is a one-dimensional labeled array capable of holding any data type. A **DataFrame** is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). ### Q3. Correct Answer: **C** * **Explanation:** The `pd.read_csv()` function is the built-in Pandas method designed to parse a CSV file and load it directly into a DataFrame object. ### Q4. Correct Answer: **C** * **Explanation:** The `df.head(n)` method returns the first `n` rows of the DataFrame. If no argument is passed, it defaults to returning the first 5 rows. ### Q5. Correct Answer: **B** * **Explanation:** You can access a column either by using bracket notation `df` (highly recommended, especially if column names contain spaces or special characters) or attribute notation `df.Age`. ### Q6. Correct Answer: **B** * **Explanation:** `df.dropna()` is used to remove missing values (NaNs). By default, it drops any row containing at least one missing value, but it can be configured to drop columns instead using the `axis` parameter. ### Q7. Correct Answer: **B** * **Explanation:** This is called **boolean indexing**. `df > 50000` evaluates to a boolean Series (True/False). Passing this boolean Series inside the brackets `df[...]` filters the DataFrame to return only the rows where the condition is `True`. ### Q8. Correct Answer: **C** * **Explanation:** The `df.groupby()` method is used to split the data into groups based on some criteria, allowing you to apply aggregation functions (like `.sum()`, `.mean()`, or `.count()`) to each group. ### Q9. Correct Answer: **B** * **Explanation:** By default, most Pandas operations return a copy of the DataFrame. Setting `inplace=True` modifies the original DataFrame directly, which can save memory when working with large datasets. ### Q10. Correct Answer: **C** * **Explanation:** `pd.merge()` is the primary function used to join two DataFrames database-style on a key column. `pd.concat()` is typically used for stacking DataFrames vertically or horizontally, and `.join()` is used for merging on indexes. --- ## Key Considerations for Pandas Developers When working with Pandas in production environments, keep the following best practices in mind: 1. **Avoid Loops (`for` loops):** Pandas is built on top of NumPy, which utilizes vectorized operations written in C. Instead of iterating through rows with a loop, use vectorized operations, `.apply()`, or `.map()` for significantly faster execution. 2. **Memory Optimization:** For large datasets, optimize memory usage by specifying column data types (`dtypes`) when reading files, or downcast numeric types using `pd.to_numeric()`. 3. **Chained Indexing Warning:** Avoid chained indexing like `df[df['A'] > 2]['B'] = 10`. This can lead to a `SettingWithCopyWarning`. Instead, use explicit indexers like `.loc` or `.iloc`: `df.loc[df['A'] > 2, 'B'] = 10`.

Pandas Quiz

📂 Categories