Lab 1: DataCleaningAndPreprocessing

In this lab, you’ll analyze the Data Science Salaries 2023 dataset using Pandas for data manipulation and Matplotlib for visualization. This dataset contains real-world salary data for data professionals across multiple countries, company sizes, and experience levels.

🔗 Dataset Source: Data Science Salaries 2023 on GitHub

Dataset Overview: The dataset has 3755 entries and 11 columns, providing insights into salaries, job roles, and company attributes.

Why this matters:

Understanding the structure of the dataset before analysis is crucial. Correctly identifying categorical and numeric variables helps with:

Accurate grouping, summarizing, and visualizations
Proper AI-assisted prompts for data cleaning or feature engineering
Avoiding mistakes in summary statistics or correlations

Tip

Pay attention to columns that may require conversion to categorical types or ordered factors, especially experience_level or company_size, to ensure correct interpretation in analyses.

In this lab, you’ll analyze Data Science Salaries using Pandas for data manipulation and Matplotlib for visualization.

Objective 🎯

Real-world data is rarely clean. Before analysis or modelling, datasets must be cleaned, corrected, and validated.

In this lab, you will learn how to:

Identify data quality issues
Handle missing values and duplicates
Correct data types
Engineer useful features
Detect and reason about outliers
Use AI responsibly to assist each step

Poor data quality can:

Bias results
Mislead models
Produce incorrect conclusions

Important

Garbage in = garbage out

📌 Guided AI-Assisted Workflow

Step	Task	Prompt Examples & Tips
1	Explore dataset	Role: Professional Data Analyst Prompt: Examine this dataset and provide a summary of structure, missing values, duplicates, and columns that may need cleaning. Explain your reasoning.
2	Handle missing values & duplicates	Role: Professional Data Scientist Prompt: Suggest strategies to handle missing values or duplicates. Recommend removal, imputation, or leave-as-is, and explain the trade-offs.
3	Validate data types & categorical levels	Role: Data Analysis Expert Prompt: Inspect the dataset and identify numeric vs categorical columns. Decide if categorical columns should be converted or ordered, and explain the impact on analysis and visualisations.
4	Feature engineering	Role: Data Analyst Mentor Prompt: Suggest meaningful derived features or transformations that highlight patterns and improve interpretability. Explain why each feature is useful for analysis or modelling.
5	Outlier detection & handling	Role: Data Quality Specialist Prompt: Detect extreme values in numeric columns using visual and statistical methods. Decide whether to keep, transform, or remove each outlier and justify your decision.

Task 1: Load Dataset & Identify Issues 🕵️

Objective

Understand the raw dataset and identify potential data issues.

We will use the Data Science Salaries 2023 dataset for this lab.

Your Task

Load the dataset
Inspect:
- First rows
- Data types
- Missing values
- Duplicates

Prompt Writing Exercise

Write a prompt asking an AI to:

Load the dataset
Display structure
Identify missing values and duplicates
Highlight columns needing attention

Example Code

Reflection

Which columns contain missing values?
Are there duplicates?
Which columns need cleaning before analysis?

Task 2: Handle Missing Values & Duplicates 💧

Objective

Clean the dataset by handling duplicate rows and validating whether missing-value treatment is required.

Important

Cleaning decisions must be driven by evidence from data inspection, not assumptions.

Your Task

Remove duplicate rows
Decide whether missing values require treatment
Justify your decisions

Prompt Writing Excercise

Role: Professional Data Scientist

Task: Clean this dataset by: 1. Detecting and removing duplicate rows 2. Checking for missing values in each column 3. Suggesting an appropriate strategy for handling missing values (e.g., remove, impute, or leave as-is) 4. Justify each cleaning decision based on the dataset

Instructions for AI: - Provide step-by-step explanation for duplicates and missing values. - Explain the impact of each cleaning step on subsequent analysis. - Highlight any potential risks if cleaning is done incorrectly.

Code Implementation( After you try )

Duplicate data is basically rows which are repeating themselves and hence, the way to treat duplicate data is to simply drop those entirely from the datatset.

Reflection

Were duplicate rows removed?
Does the cleaned dataset have fewer rows than before?
Are there any missing values remaining?

Output interpretation:

Duplicate rows detected and removed
No missing values present
No imputation is required for this dataset

Handling Duplicates

Duplicate rows represent repeated information and can:

Bias summary statistics
Over-represent certain job roles or salaries
Mislead downstream analysis

👉 Duplicates should be removed completely unless there is a strong reason to keep them.

What If Missing Values Existed? (Illustrative Only)

The dataset contains no missing values, but the following examples show how missing data would be handled if present.

These lines are commented out intentionally to reinforce decision-based cleaning.

Handling Missing Values: A Decision-Based Approach

The strategy used to handle missing values depends on:

Proportion of missing data
- Small proportion (< 5%) → Row-wise removal may be acceptable
Risk of data loss
- Large proportion (20–50%) → Imputation helps preserve statistical power
Type of variable
- Mean: roughly symmetric numeric data
- Median: skewed numeric data or outliers
- Mode: categorical variables
Post-cleaning Reflection
- Compare summaries before and after cleaning
- Ensure distributions are not distorted

In this lab, no imputation is performed because the dataset contains no missing values.

Important

Cleaning is not about applying every possible method.

It is about checking, deciding, and justifying each action.

Task 3: Validate Data Types & Categorical Levels 🔢

Objective

Before performing analysis, ensure that data types are correct and categorical levels make sense. This helps avoid mistakes in summaries, plots, and correlations later.

Note: We don’t always convert columns automatically — sometimes the raw object type is fine for analysis. The goal here is to confirm that the dataset is ready for meaningful exploration.

Your Task

Inspect categorical columns (experience_level, employment_type, job_title, salary_currency, employee_residence, company_location, company_size)
Inspect numeric columns (work_year, salary, salary_in_usd, remote_ratio)
Decide if any categorical columns should be converted to category type, and if any should be ordered (e.g., experience_level: EN → MI → SE → EX)
Check the levels/categories of all categorical variables

AI Prompting Exercise

Role: Data Analysis Expert Prompt Example:

“Inspect this dataset and identify which columns are numeric and which are categorical. Suggest which categorical columns should be converted to category type and which should be ordered. Explain how these decisions impact data summaries, visualizations, and analysis.”

Example Code

Discussion Questions

Which columns must be numeric to allow calculations or correlations?
Which columns are naturally categorical and why?
Should experience_level be ordered or left as unordered? Why?
Are there any object columns that you would leave as-is?
How might incorrect types affect visualizations or summary statistics?

Important

Converting object columns to categorical is optional, but it improves plotting, grouping, and modelling
Ordered categories are useful when there is a logical sequence
Validating types before analysis prevents subtle errors in summary statistics or correlations

Reflection

Are categorical columns properly identified?
Are numeric columns correctly typed?
Does experience_level have the correct order if you chose to order it?

Task 4: Exploratory Data Analysis (EDA)

Objective

Explore how salaries vary by experience level, job role, company size, and remote work.

Your Task

Compute summary statistics for salary_in_usd
Explore salary by experience_level, company_size, remote_ratio, and job_title
Visualize trends using bar plots, boxplots, or histograms
Reflect on patterns, trends, or anomalies in the dataset

AI Prompting Exercise

Role: Data Analyst Mentor Prompt Example:

“Analyze the dataset to explore how salaries vary by experience level, company size, remote ratio, and job title. Provide summary statistics and suggest appropriate visualizations. Highlight key trends, outliers, or anomalies, and explain their possible implications for understanding compensation patterns.”

4.1 Summary statistics for salary

Note

Why this matters: This step gives a quick overview of the salary distribution, including typical pay (median), spread, and extreme values. It helps identify whether salaries are skewed and whether outliers may influence later analysis.

4.2 Salary by experience level

Visualisation :

Note

Grouping salaries by experience level helps us understand how pay changes as professionals gain experience. This is useful for identifying career progression trends and setting realistic salary expectations.

4.3 Salary by company size

Visualisation

Note

Company size often reflects available resources and budget. Comparing salaries across company sizes helps assess whether larger organisations tend to offer higher compensation than smaller ones.

4.4 Impact of remote work on salary

Visualisation

Note

Remote work has become increasingly common these days. This analysis helps evaluate whether fully remote or hybrid roles are associated with higher or lower salaries compared to on-site positions.

4.5 Top 10 highest-paid job titles

Visualisation

Note

Identifying the highest-paying job titles highlights which roles are most valued in the market. This can guide people in choosing specialisations or understanding demand.

Note

Think about: Are salary differences driven by a single factor, or a combination of experience, role, and work environment? What additional variables might help explain these patterns?

Important

Did you notice that we created visualisations for each analysis?

In real-world datasets, data is often large and complex, making it difficult to understand patterns by looking at raw tables alone. Visualisations help us:

Spot trends quickly
Compare groups easily
Identify outliers and unusual patterns

However, choosing the right type of visualisation is just as important.

The type of plot you use depends on:

What question you are trying to answer
The type of variables (categorical vs numerical)
Whether you are comparing groups, distributions, or relationships

For example:

Bar plots → comparing categories
Box plots → understanding distributions and outliers
Histograms → viewing spread and skewness
Pie charts → simple proportions (used carefully)

Good visualisation choices make analysis faster, clearer, and more meaningful.

Reflection

Which factor seems to have the strongest relationship with salary?
Do higher experience levels always guarantee higher pay?
Does remote work appear to increase or decrease salaries?

Task 5: Outlier Detection & Handling ⚠️

Objective

Identify extreme values in numeric columns that could skew your analysis or influence modelling. Decide whether to keep, transform, or remove them.

Important

Note: Outliers are not always “bad.” Sometimes they reflect real-world extremes, so the goal is awareness and justification, not automatic removal.

Your Task

Explore numeric columns: work_year, salary, salary_in_usd, remote_ratio
Detect potential outliers using boxplots and IQR method
Decide for each outlier: keep, transform, or remove
Reflect on how outliers could affect summary statistics, plots, and analysis conclusions

AI Prompting Exercise

Role: Data Quality Specialist Prompt Example:

“Inspect numeric columns to identify extreme values or potential outliers. Recommend which outliers should be kept, transformed, or removed, and provide reasoning for each choice. Explain how outlier handling might affect summary statistics, visualizations, and downstream analysis.”

Example Code

Tip

Outlier detection is critical before analysis or modelling
Visualization (boxplots) gives intuitive insight, while IQR is a quantitative approach
Decisions should be justified with evidence, not arbitrary
Handling outliers incorrectly can mislead conclusions or models

Reflection

Are all numeric columns checked for outliers?
Are outlier decisions documented and justified?
Are visualizations interpreted correctly for data quality insights?
Do these outliers represent real data points or likely errors/noise?
How could these outliers affect your mean, median, and visualizations?
Would you remove, transform, or keep these values? Why?
How could outlier treatment affect AI-assisted reasoning or modelling downstream?