Lab 1: DataCleaningAndPreprocessing

In this lab, you’ll analyze the Data Science Salaries 2023 dataset using Pandas for data manipulation and Matplotlib for visualization. This dataset contains real-world salary data for data professionals across multiple countries, company sizes, and experience levels.

🔗 Dataset Source: Data Science Salaries 2023 on GitHub

Dataset Overview: The dataset has 3755 entries and 11 columns, providing insights into salaries, job roles, and company attributes.

Why this matters:

Understanding the structure of the dataset before analysis is crucial. Correctly identifying categorical and numeric variables helps with:

Tip

Pay attention to columns that may require conversion to categorical types or ordered factors, especially experience_level or company_size, to ensure correct interpretation in analyses.

In this lab, you’ll analyze Data Science Salaries using Pandas for data manipulation and Matplotlib for visualization.

Objective 🎯

Real-world data is rarely clean. Before analysis or modelling, datasets must be cleaned, corrected, and validated.

In this lab, you will learn how to:

  • Identify data quality issues
  • Handle missing values and duplicates
  • Correct data types
  • Engineer useful features
  • Detect and reason about outliers
  • Use AI responsibly to assist each step

Poor data quality can:

  • Bias results
  • Mislead models
  • Produce incorrect conclusions
Important

Garbage in = garbage out


📌 Guided AI-Assisted Workflow

Step Task Prompt Examples & Tips
1 Explore dataset Role: Professional Data Analyst
Prompt: Examine this dataset and provide a summary of structure, missing values, duplicates, and columns that may need cleaning. Explain your reasoning.
2 Handle missing values & duplicates Role: Professional Data Scientist
Prompt: Suggest strategies to handle missing values or duplicates. Recommend removal, imputation, or leave-as-is, and explain the trade-offs.
3 Validate data types & categorical levels Role: Data Analysis Expert
Prompt: Inspect the dataset and identify numeric vs categorical columns. Decide if categorical columns should be converted or ordered, and explain the impact on analysis and visualisations.
4 Feature engineering Role: Data Analyst Mentor
Prompt: Suggest meaningful derived features or transformations that highlight patterns and improve interpretability. Explain why each feature is useful for analysis or modelling.
5 Outlier detection & handling Role: Data Quality Specialist
Prompt: Detect extreme values in numeric columns using visual and statistical methods. Decide whether to keep, transform, or remove each outlier and justify your decision.

Task 1: Load Dataset & Identify Issues 🕵️

Objective

Understand the raw dataset and identify potential data issues.

We will use the Data Science Salaries 2023 dataset for this lab.

Your Task

  1. Load the dataset

  2. Inspect:

    • First rows
    • Data types
    • Missing values
    • Duplicates

Prompt Writing Exercise

Write a prompt asking an AI to:

  • Load the dataset
  • Display structure
  • Identify missing values and duplicates
  • Highlight columns needing attention

Example Code

Reflection

  • Which columns contain missing values?
  • Are there duplicates?
  • Which columns need cleaning before analysis?

Task 2: Handle Missing Values & Duplicates 💧

Objective

Clean the dataset by handling duplicate rows and validating whether missing-value treatment is required.

Important

Cleaning decisions must be driven by evidence from data inspection, not assumptions.

Your Task

  1. Remove duplicate rows
  2. Decide whether missing values require treatment
  3. Justify your decisions

Prompt Writing Excercise

Role: Professional Data Scientist

Task: Clean this dataset by: 1. Detecting and removing duplicate rows 2. Checking for missing values in each column 3. Suggesting an appropriate strategy for handling missing values (e.g., remove, impute, or leave as-is) 4. Justify each cleaning decision based on the dataset

Instructions for AI: - Provide step-by-step explanation for duplicates and missing values. - Explain the impact of each cleaning step on subsequent analysis. - Highlight any potential risks if cleaning is done incorrectly.

Code Implementation( After you try )

Duplicate data is basically rows which are repeating themselves and hence, the way to treat duplicate data is to simply drop those entirely from the datatset.

Reflection

  • Were duplicate rows removed?
  • Does the cleaned dataset have fewer rows than before?
  • Are there any missing values remaining?

Output interpretation:

  • Duplicate rows detected and removed
  • No missing values present
  • No imputation is required for this dataset
ImportantHandling Duplicates

Duplicate rows represent repeated information and can:

  • Bias summary statistics
  • Over-represent certain job roles or salaries
  • Mislead downstream analysis

👉 Duplicates should be removed completely unless there is a strong reason to keep them.

What If Missing Values Existed? (Illustrative Only)

The dataset contains no missing values, but the following examples show how missing data would be handled if present.

These lines are commented out intentionally to reinforce decision-based cleaning.

ImportantHandling Missing Values: A Decision-Based Approach

The strategy used to handle missing values depends on:

  1. Proportion of missing data

    • Small proportion (< 5%) → Row-wise removal may be acceptable
  2. Risk of data loss

    • Large proportion (20–50%) → Imputation helps preserve statistical power
  3. Type of variable

    • Mean: roughly symmetric numeric data
    • Median: skewed numeric data or outliers
    • Mode: categorical variables
  4. Post-cleaning Reflection

    • Compare summaries before and after cleaning
    • Ensure distributions are not distorted

In this lab, no imputation is performed because the dataset contains no missing values.

Important

Cleaning is not about applying every possible method.

It is about checking, deciding, and justifying each action.

Task 3: Validate Data Types & Categorical Levels 🔢

Objective

Before performing analysis, ensure that data types are correct and categorical levels make sense. This helps avoid mistakes in summaries, plots, and correlations later.

Note: We don’t always convert columns automatically — sometimes the raw object type is fine for analysis. The goal here is to confirm that the dataset is ready for meaningful exploration.

Your Task

  1. Inspect categorical columns (experience_level, employment_type, job_title, salary_currency, employee_residence, company_location, company_size)
  2. Inspect numeric columns (work_year, salary, salary_in_usd, remote_ratio)
  3. Decide if any categorical columns should be converted to category type, and if any should be ordered (e.g., experience_level: EN → MI → SE → EX)
  4. Check the levels/categories of all categorical variables

AI Prompting Exercise

Role: Data Analysis Expert Prompt Example:

“Inspect this dataset and identify which columns are numeric and which are categorical. Suggest which categorical columns should be converted to category type and which should be ordered. Explain how these decisions impact data summaries, visualizations, and analysis.”

Example Code

Discussion Questions

  1. Which columns must be numeric to allow calculations or correlations?
  2. Which columns are naturally categorical and why?
  3. Should experience_level be ordered or left as unordered? Why?
  4. Are there any object columns that you would leave as-is?
  5. How might incorrect types affect visualizations or summary statistics?
Important
  • Converting object columns to categorical is optional, but it improves plotting, grouping, and modelling
  • Ordered categories are useful when there is a logical sequence
  • Validating types before analysis prevents subtle errors in summary statistics or correlations

Reflection

  • Are categorical columns properly identified?
  • Are numeric columns correctly typed?
  • Does experience_level have the correct order if you chose to order it?

Task 4: Exploratory Data Analysis (EDA)

Objective

Explore how salaries vary by experience level, job role, company size, and remote work.

Your Task

  • Compute summary statistics for salary_in_usd
  • Explore salary by experience_level, company_size, remote_ratio, and job_title
  • Visualize trends using bar plots, boxplots, or histograms
  • Reflect on patterns, trends, or anomalies in the dataset

AI Prompting Exercise

Role: Data Analyst Mentor Prompt Example:

“Analyze the dataset to explore how salaries vary by experience level, company size, remote ratio, and job title. Provide summary statistics and suggest appropriate visualizations. Highlight key trends, outliers, or anomalies, and explain their possible implications for understanding compensation patterns.”

4.1 Summary statistics for salary

Note

Why this matters: This step gives a quick overview of the salary distribution, including typical pay (median), spread, and extreme values. It helps identify whether salaries are skewed and whether outliers may influence later analysis.

4.2 Salary by experience level

Visualisation :

Note

Grouping salaries by experience level helps us understand how pay changes as professionals gain experience. This is useful for identifying career progression trends and setting realistic salary expectations.

4.3 Salary by company size

Visualisation

Note

Company size often reflects available resources and budget. Comparing salaries across company sizes helps assess whether larger organisations tend to offer higher compensation than smaller ones.


4.4 Impact of remote work on salary

Visualisation

Note

Remote work has become increasingly common these days. This analysis helps evaluate whether fully remote or hybrid roles are associated with higher or lower salaries compared to on-site positions.

4.5 Top 10 highest-paid job titles

Visualisation

Note

Identifying the highest-paying job titles highlights which roles are most valued in the market. This can guide people in choosing specialisations or understanding demand.

Note

Think about: Are salary differences driven by a single factor, or a combination of experience, role, and work environment? What additional variables might help explain these patterns?

Important

Did you notice that we created visualisations for each analysis?

In real-world datasets, data is often large and complex, making it difficult to understand patterns by looking at raw tables alone. Visualisations help us:

  • Spot trends quickly
  • Compare groups easily
  • Identify outliers and unusual patterns

However, choosing the right type of visualisation is just as important.

The type of plot you use depends on:

  • What question you are trying to answer
  • The type of variables (categorical vs numerical)
  • Whether you are comparing groups, distributions, or relationships

For example:

  • Bar plots → comparing categories
  • Box plots → understanding distributions and outliers
  • Histograms → viewing spread and skewness
  • Pie charts → simple proportions (used carefully)

Good visualisation choices make analysis faster, clearer, and more meaningful.

Reflection

  • Which factor seems to have the strongest relationship with salary?
  • Do higher experience levels always guarantee higher pay?
  • Does remote work appear to increase or decrease salaries?

Task 5: Outlier Detection & Handling ⚠️

Objective

Identify extreme values in numeric columns that could skew your analysis or influence modelling. Decide whether to keep, transform, or remove them.

Important

Note: Outliers are not always “bad.” Sometimes they reflect real-world extremes, so the goal is awareness and justification, not automatic removal.

Your Task

  1. Explore numeric columns: work_year, salary, salary_in_usd, remote_ratio
  2. Detect potential outliers using boxplots and IQR method
  3. Decide for each outlier: keep, transform, or remove
  4. Reflect on how outliers could affect summary statistics, plots, and analysis conclusions

AI Prompting Exercise

Role: Data Quality Specialist Prompt Example:

“Inspect numeric columns to identify extreme values or potential outliers. Recommend which outliers should be kept, transformed, or removed, and provide reasoning for each choice. Explain how outlier handling might affect summary statistics, visualizations, and downstream analysis.”

Example Code

Tip
  • Outlier detection is critical before analysis or modelling
  • Visualization (boxplots) gives intuitive insight, while IQR is a quantitative approach
  • Decisions should be justified with evidence, not arbitrary
  • Handling outliers incorrectly can mislead conclusions or models

Reflection

  • Are all numeric columns checked for outliers?
  • Are outlier decisions documented and justified?
  • Are visualizations interpreted correctly for data quality insights?
  • Do these outliers represent real data points or likely errors/noise?
  • How could these outliers affect your mean, median, and visualizations?
  • Would you remove, transform, or keep these values? Why?
  • How could outlier treatment affect AI-assisted reasoning or modelling downstream?