Writing Reusable Code
When working with Pandas, itβs easy to write large, repetitive scripts that perform similar operations on different datasets. Breaking down repetitive tasks into functions makes code cleaner, modular, and easier to debug.
π Example: Cleaning and Processing Multiple Datasets
Letβs say we have three rainfall datasets (rainfall_2021.csv, rainfall_2022.csv, rainfall_2023.csv) that require the same cleaning steps.
β Without Code Decomposition (Highly Repetitive Code)
import pandas as pd
# Load and clean rainfall_2021.csv
df_2021 = pd.read_csv("data/rainfall_2021.csv")
df_2021['Rainfall (mm)'] = df_2021['Rainfall (mm)'].ffill() # Forward fill missing values
df_2021['Date'] = pd.to_datetime(df_2021['Date']) # Convert to datetime
df_2021 = df_2021[df_2021['Rainfall (mm)'] >= 0] # Remove negative values
# Load and clean rainfall_2022.csv
df_2022 = pd.read_csv("data/rainfall_2022.csv")
df_2022['Rainfall (mm)'] = df_2022['Rainfall (mm)'].ffill()
df_2022['Date'] = pd.to_datetime(df_2022['Date'])
df_2022 = df_2022[df_2022['Rainfall (mm)'] >= 0]
# Load and clean rainfall_2023.csv
df_2023 = pd.read_csv("data/rainfall_2023.csv")
df_2023['Rainfall (mm)'] = df_2023['Rainfall (mm)'].ffill()
df_2023['Date'] = pd.to_datetime(df_2023['Date'])
df_2023 = df_2023[df_2023['Rainfall (mm)'] >= 0]
print(df_2021.head(), df_2022.head(), df_2023.head())π‘ Problems:
- Lots of duplicated code β same logic is repeated for each file.
- Hard to maintain β if we want to change the cleaning steps, we must update every block.
- Not scalable β If we had 10+ files, this would be extremely inefficient.
β With Code Decomposition (Modular & Reusable)
Instead, we create a function that loads and cleans any rainfall dataset.
import pandas as pd
def load_and_clean_rainfall(filepath):
"""Loads a rainfall dataset, fills missing values, converts dates, and removes negative values."""
df = pd.read_csv(filepath)
# Use ffill() instead of fillna(method='ffill') to prevent FutureWarning
df['Rainfall (mm)'] = df['Rainfall (mm)'].ffill()
# Convert date column to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Remove negative rainfall values
df = df[df['Rainfall (mm)'] >= 0]
return df
# Process multiple datasets efficiently
df_2021 = load_and_clean_rainfall("data/rainfall_2021.csv")
df_2022 = load_and_clean_rainfall("data/rainfall_2022.csv")
df_2023 = load_and_clean_rainfall("data/rainfall_2023.csv")
print(df_2021.head(), df_2022.head(), df_2023.head())πΉ Why is This Better?
β
Avoids Code Repetition β The function is called multiple times instead of repeating the same steps.
β
Easier Maintenance β If we need to change the cleaning process (e.g., use median instead of forward fill), we update only one function.
β
Scalable β Works for any number of datasets without rewriting code.
π― Task: Automate Processing Multiple Files
Write a prompt to get an LLM to automate processing multiple datasets using the load_and_clean_rainfall() function in a loop.
filepaths = [
"data/rainfall_2021.csv",
"data/rainfall_2022.csv",
"data/rainfall_2023.csv"
]π Now, we can handle any number of rainfall datasets with just a few lines of code!
Try your code:
Python offers even more ways to organize reusable code:
- Classes: Encapsulate data and behavior into reusable objects.
- Modules: Store reusable functions in separate files that can be imported into multiple scripts.
By following functional decomposition principles, you create modular, reusable, and testable code that is easier to debug and maintain.