Learning Pandas

Pandas is a powerful and widely used Python library for data manipulation and analysis. It provides efficient, flexible, and intuitive tools for handling structured data, making tasks like cleaning, transforming, and analyzing datasets much easier. 📊

Using a dedicated library like Pandas saves time and effort. Instead of reinventing the wheel by writing custom code for common data operations, you can leverage Pandas’ built-in functionality to work with data efficiently. 🚀

📚 To learn more, visit the official Pandas documentation: https://pandas.pydata.org/

🔍 How Pandas Works

Most data you’ll encounter when starting out as a data analyst will be tabular—structured in rows and columns, much like a spreadsheet or a database table. Typically, the first row contains column headers, which provide names for each column in the dataset.

Pandas allows you to load tabular data from multiple formats, with CSV files and Excel spreadsheets being the most common. When you load data using Pandas, it is stored in an internal data structure called a DataFrame. This is a flexible and powerful way to represent and manipulate structured data.

Once your data is in a DataFrame, Pandas provides a rich set of helper methods to perform common data analysis tasks, such as:

  • 🔎 Filtering and selecting data
  • 🧹 Handling missing values
  • 🔄 Transforming and reshaping data
  • 📈 Aggregating and summarizing results

With these tools, Pandas makes it easy to clean, analyze, and visualize data efficiently.

📥 Importing Pandas and Loading Data into a DataFrame

Let’s look at a simple example of how Pandas loads and processes tabular data. Click on the “Run Code” button to run the code and display the results.

In Python, when we import a library, we can assign it a shorter alias for convenience. The line:

import pandas as pd

imports the Pandas library and assigns it the alias pd. This is a common convention in the data analysis community and makes code more concise. Using pd makes the code cleaner and easier to read, especially when working with multiple Pandas functions throughout a project.

Now, whenever we need to call a Pandas function, such as read_csv(), we prefix it with pd. instead of typing pandas. in full. For example:

df = pd.read_csv("data/rainfall_data.csv")

This calls the read_csv() function from Pandas, which reads a CSV file and loads its contents into a DataFrame. We also need to pass in the path to the folder that contains the file and the file name which is placed within opening and closing quotations e.g. "data/rainfall_data.csv".

Now, df contains the data from rainfall_data.csv. You’ll notice that df.head() displays the first 5 rows of the dataset. You can also specify the number of rows e.g. df.head(10).

Important

All the data files used in this module are stored in a folder called data. It is a good practice to separate your python code from datasets by using a separate folder.

Pandas can read various file formats, such as:

File Type Function
CSV pd.read_csv("file.csv")
Excel pd.read_excel("file.xlsx")
JSON pd.read_json("file.json")

🔎 Exploring the Data in a DataFrame

Pandas makes it easy to explore data efficiently. You can use the .info() method to get column information and the .describe() method to get a statistical summary of the data.

Once you’ve loaded a dataset, you can inspect it using:

print(df.head())      # Show first 5 rows
print(df.info())      # Summary of columns and data types
print(df.describe())  # Statistical summary

🛠️ Working with DataFrame Methods

A DataFrame has many built-in methods for data exploration and manipulation. Mastering these DataFrame methods will help you work with datasets more effectively, making it easier to clean, transform, and analyze data.

Below is a table of commonly used Pandas methods, grouped into two categories:
- Pandas (pd) methods, which are called directly from the Pandas library.
- DataFrame (df) methods, which are called on a specific DataFrame object.

Type Method Example Description
Pandas (pd) Methods pd.read_csv() pd.read_csv("data.csv") Reads a CSV file into a DataFrame.
pd.concat() pd.concat([df1, df2]) Combines multiple DataFrames.
pd.merge() pd.merge(df1, df2, on="id") Merges two DataFrames based on a common column.
DataFrame (df) Methods .head() df.head(5) Displays the first 5 rows of the DataFrame.
.info() df.info() Shows summary info about the DataFrame, including column types and missing values.
.describe() df.describe() Provides summary statistics for numeric columns.
.columns df.columns Returns the column names of the DataFrame.
.drop() df.drop("column_name", axis=1) Removes a specified column from the DataFrame.
.groupby() df.groupby("category").mean() Groups data based on a column and applies an aggregate function.
.sort_values() df.sort_values("price", ascending=False) Sorts the DataFrame based on a column’s values.

🎯 Understanding Parameters in Pandas Methods

Each of these methods accepts parameters, allowing you to customize how they behave. These parameters can be very useful for refining data operations.

For example, the .drop() method removes columns or rows from a DataFrame:

# Remove a specific column
df = df.drop("column_name", axis=1)

# Remove multiple columns
df = df.drop(["col1", "col2"], axis=1)

# Remove specific rows by index
df = df.drop([0, 1, 2], axis=0)
  • The axis=1 argument tells Pandas to drop a column.
  • The axis=0 argument tells Pandas to drop rows instead.

Pandas’ flexibility with parameters makes it a powerful tool for data manipulation. For example, when loading a CSV file, you can specify custom headers using the names parameter or skip rows with skiprows from being imported:

df = pd.read_csv("data.csv", names=["A", "B", "C"], skiprows=2)

Knowing these variations helps you work more efficiently and tailor operations to your specific needs.

Pandas offers many more methods for working with DataFrames. To explore them in detail, check out the official Pandas documentation:
https://pandas.pydata.org/docs/

⛓️ Using Method Chaining

Method chaining allows multiple operations to be performed in one line, making code cleaner and more efficient. Instead of creating intermediate variables, you can drop missing values, filter data, group by categories, and sort results in a single readable expression.

df_result = (
    df.dropna()
    .query("`Rainfall (mm)` > 5")
    .groupby("Location")["Rainfall (mm)"]
    .mean()
    .sort_values(ascending=False)
)

Method chaining keeps Pandas code concise and readable while avoiding unnecessary variables. 🚀

📋 Selecting Columns

In Pandas, you can select specific columns using []. A single column is accessed with df["column_name"].

Multiple columns are selected using a list, e.g., df[["col1", "col2"]]. This is useful when working with large datasets where only a subset of data is needed.

Here’s the expanded section on Boolean Filtering (Boolean Indexing) in Quarto markdown:

✅ Boolean Filtering in Pandas (Boolean Indexing)

Boolean filtering, also known as Boolean Indexing, is a powerful way to filter a DataFrame based on conditions. This allows you to extract specific rows that meet certain criteria, making it easier to analyze and manipulate large datasets. When working with a DataFrame, we often need to extract rows that match a certain condition. Boolean filtering (Boolean indexing) allows us to do this efficiently.

1️⃣ Filtering by a Single Column

Let’s say we want to retrieve only the rows where the Location is "City A". We can do this using the following code:

🔍 How does Indexing Syntax Work?

The key part of this code is:

df["Location"] == "City A"

This expression creates a Boolean Series, where each row is evaluated based on whether "Location" is "City A":

Each row in the Location column is checked: - If the value is "City A", it returns True. - Otherwise, it returns False.

When we write:

df[df["Location"] == "City A"]

we are passing this Boolean Series inside the DataFrame’s indexing brackets ([]).

This tells Pandas:
* Select only the rows where the Boolean Series is True.

So Pandas returns a new DataFrame containing only rows where "Location" is "City A".

We can also filter all rows where rainfall is less than 10 mm:

When filtering a DataFrame, we use Boolean operators to compare values and create conditions. These operators return a Boolean Series (True or False) that can be used for indexing.

Operator Description Example Used for filtering?
== Equal to df["Location"] == "City A" ✅ Yes
!= Not equal to df["Location"] != "City A" ✅ Yes
< Less than df["Rainfall (mm)"] < 10 ✅ Yes
<= Less than or equal to df["Temperature (C)"] <= 20 ✅ Yes
> Greater than df["Rainfall (mm)"] > 5 ✅ Yes
>= Greater than or equal to df["Temperature (C)"] >= 25 ✅ Yes
& AND (both conditions must be True) (df["Location"] == "City A") & (df["Rainfall (mm)"] < 10) ✅ Yes (Multiple conditions)
| OR (at least one condition must be True) (df["Location"] == "City A") | (df["Temperature (C)"] > 20) ✅ Yes (Multiple conditions)
~ NOT (negate condition) df[~(df["Location"] == "City A")] ✅ Yes (Excluding values)

For more details, see the official Pandas documentation:
https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing

2️⃣ Filtering by Multiple Criteria

You can filter using multiple criteria with & (AND) and | (OR):

Example: Days in “City B” with Rainfall < 10 mm

  • & (AND): Both conditions must be True for a row to be included.

Example: Days with Temperature > 20°C OR Rainfall > 10 mm

  • | (OR): At least one of the conditions must be True for a row to be included.

📊 Grouping Data with groupby() in Pandas

In our rainfall dataset, we have multiple rows for each city, each representing a day’s rainfall. If we want to find the average rainfall per city, one way would be to filter the dataset manually for each city and calculate the mean separately. However, this is inefficient—especially with large datasets.

Instead, we can use .groupby() to group all rows by city at once and calculate the average rainfall for each city efficiently. 🚀

🔧 How This Works

  1. Grouping the Data:
    • .groupby("Location") groups the DataFrame by Location.
    • This means all rows belonging to "City A" are grouped together, all rows for "City B" are grouped together, and so on.
  2. Applying the Aggregation Function:
    • ["Rainfall (mm)"].mean() calculates the average rainfall for each group.
    • Instead of filtering "City A", "City B", and "City C" separately, we compute all means in one step.

💡 Why groupby() is Useful

  • ✅ It avoids repetitive filtering for each city.
  • ✅ It works efficiently even for large datasets.
  • ✅ It can be combined with other aggregate functions like .sum(), .count(), etc.

For more details, check the Pandas documentation:
https://pandas.pydata.org/docs/user_guide/groupby.html

🔄 Pivoting Data in Pandas

Pivoting is the process of reshaping data from a long format (where each row represents a single observation) into a wide format (where values are spread across multiple columns). This is useful for analyzing trends, summarizing data, or preparing data for visualization.

In Pandas, we use .pivot() or .pivot_table() to transform data.

📌 Example: Pivoting the Rainfall Dataset

Let’s say we want to restructure our dataset so that each city has its own column, and the rows represent different dates. This makes it easier to compare daily rainfall across multiple locations.

🔧 How This Works

  1. Setting the Row Index (index="Date")
    • The Date column becomes the index of the new DataFrame.
  2. Spreading the Location Column (columns="Location")
    • Each unique Location (City A, City B, City C) becomes a column.
  3. Filling the Rainfall (mm) Values (values="Rainfall (mm)")
    • The rainfall values are placed in their respective city columns.
    • If a city had no recorded rainfall on a given date, the value is NaN (missing).

Why Use Pivoting? - It reshapes data for easier analysis 🚀.
- It makes it simpler to compare multiple categories (e.g., rainfall trends across cities).
- It’s useful for plotting and visualization, where each category needs its own column.

For more details, check the Pandas documentation:
https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-by-pivoting-dataframes

🎯 Summary

In this section, we explored some of Pandas’ most powerful features, including:

  • Boolean Filtering – Selecting rows based on conditions.
  • 📊 Grouping with groupby() – Aggregating data by categories.
  • 🔄 Pivoting Data – Reshaping data for easier analysis.

This is just a small selection of Pandas’ functionality! There are many more methods available to help with data cleaning, transformation, and analysis.

Tip📚 Keep Exploring
  • Pandas methods often have many parameters that can change how they work.
  • Always check the official Pandas documentation to discover additional options:
    https://pandas.pydata.org/docs/

By mastering Pandas, you can work with data more efficiently and gain deeper insights from your datasets! 🚀