Lab 1: DataCleaningAndPreprocessing
In this lab, you’ll analyze the Data Science Salaries 2023 dataset using Pandas for data manipulation and Matplotlib for visualization. This dataset contains real-world salary data for data professionals across multiple countries, company sizes, and experience levels.
🔗 Dataset Source: Data Science Salaries 2023 on GitHub
Dataset Overview: The dataset has 3755 entries and 11 columns, providing insights into salaries, job roles, and company attributes.
Why this matters:
Understanding the structure of the dataset before analysis is crucial. Correctly identifying categorical and numeric variables helps with:
- Accurate grouping, summarizing, and visualizations
- Proper AI-assisted prompts for data cleaning or feature engineering
- Avoiding mistakes in summary statistics or correlations
Pay attention to columns that may require conversion to categorical types or ordered factors, especially experience_level or company_size, to ensure correct interpretation in analyses.
In this lab, you’ll analyze Data Science Salaries using Pandas for data manipulation and Matplotlib for visualization.
Objective 🎯
Real-world data is rarely clean. Before analysis or modelling, datasets must be cleaned, corrected, and validated.
In this lab, you will learn how to:
- Identify data quality issues
- Handle missing values and duplicates
- Correct data types
- Engineer useful features
- Detect and reason about outliers
- Use AI responsibly to assist each step
Poor data quality can:
- Bias results
- Mislead models
- Produce incorrect conclusions
Garbage in = garbage out
📌 Guided AI-Assisted Workflow
| Step | Task | Prompt Examples & Tips |
|---|---|---|
| 1 | Explore dataset | Role: Professional Data Analyst Prompt: Examine this dataset and provide a summary of structure, missing values, duplicates, and columns that may need cleaning. Explain your reasoning. |
| 2 | Handle missing values & duplicates | Role: Professional Data Scientist Prompt: Suggest strategies to handle missing values or duplicates. Recommend removal, imputation, or leave-as-is, and explain the trade-offs. |
| 3 | Validate data types & categorical levels | Role: Data Analysis Expert Prompt: Inspect the dataset and identify numeric vs categorical columns. Decide if categorical columns should be converted or ordered, and explain the impact on analysis and visualisations. |
| 4 | Feature engineering | Role: Data Analyst Mentor Prompt: Suggest meaningful derived features or transformations that highlight patterns and improve interpretability. Explain why each feature is useful for analysis or modelling. |
| 5 | Outlier detection & handling | Role: Data Quality Specialist Prompt: Detect extreme values in numeric columns using visual and statistical methods. Decide whether to keep, transform, or remove each outlier and justify your decision. |
Task 1: Load Dataset & Identify Issues 🕵️
Objective
Understand the raw dataset and identify potential data issues.
We will use the Data Science Salaries 2023 dataset for this lab.
Your Task
Load the dataset
Inspect:
- First rows
- Data types
- Missing values
- Duplicates
Prompt Writing Exercise
Write a prompt asking an AI to:
- Load the dataset
- Display structure
- Identify missing values and duplicates
- Highlight columns needing attention
Example Code
Reflection
- Which columns contain missing values?
- Are there duplicates?
- Which columns need cleaning before analysis?
Task 2: Handle Missing Values & Duplicates 💧
Objective
Clean the dataset by handling duplicate rows and validating whether missing-value treatment is required.
Cleaning decisions must be driven by evidence from data inspection, not assumptions.
Your Task
- Remove duplicate rows
- Decide whether missing values require treatment
- Justify your decisions
Prompt Writing Excercise
Role: Professional Data Scientist
Task: Clean this dataset by: 1. Detecting and removing duplicate rows 2. Checking for missing values in each column 3. Suggesting an appropriate strategy for handling missing values (e.g., remove, impute, or leave as-is) 4. Justify each cleaning decision based on the dataset
Instructions for AI: - Provide step-by-step explanation for duplicates and missing values. - Explain the impact of each cleaning step on subsequent analysis. - Highlight any potential risks if cleaning is done incorrectly.
Code Implementation( After you try )
Duplicate data is basically rows which are repeating themselves and hence, the way to treat duplicate data is to simply drop those entirely from the datatset.
Reflection
- Were duplicate rows removed?
- Does the cleaned dataset have fewer rows than before?
- Are there any missing values remaining?
Output interpretation:
- Duplicate rows detected and removed
- No missing values present
- No imputation is required for this dataset
Duplicate rows represent repeated information and can:
- Bias summary statistics
- Over-represent certain job roles or salaries
- Mislead downstream analysis
👉 Duplicates should be removed completely unless there is a strong reason to keep them.
What If Missing Values Existed? (Illustrative Only)
The dataset contains no missing values, but the following examples show how missing data would be handled if present.
These lines are commented out intentionally to reinforce decision-based cleaning.
The strategy used to handle missing values depends on:
Proportion of missing data
- Small proportion (< 5%) → Row-wise removal may be acceptable
Risk of data loss
- Large proportion (20–50%) → Imputation helps preserve statistical power
Type of variable
- Mean: roughly symmetric numeric data
- Median: skewed numeric data or outliers
- Mode: categorical variables
Post-cleaning Reflection
- Compare summaries before and after cleaning
- Ensure distributions are not distorted
In this lab, no imputation is performed because the dataset contains no missing values.
Cleaning is not about applying every possible method.
It is about checking, deciding, and justifying each action.
Task 3: Validate Data Types & Categorical Levels 🔢
Objective
Before performing analysis, ensure that data types are correct and categorical levels make sense. This helps avoid mistakes in summaries, plots, and correlations later.
Note: We don’t always convert columns automatically — sometimes the raw object type is fine for analysis. The goal here is to confirm that the dataset is ready for meaningful exploration.
Your Task
- Inspect categorical columns (
experience_level,employment_type,job_title,salary_currency,employee_residence,company_location,company_size) - Inspect numeric columns (
work_year,salary,salary_in_usd,remote_ratio) - Decide if any categorical columns should be converted to
categorytype, and if any should be ordered (e.g.,experience_level: EN → MI → SE → EX) - Check the levels/categories of all categorical variables
AI Prompting Exercise
Role: Data Analysis Expert Prompt Example:
“Inspect this dataset and identify which columns are numeric and which are categorical. Suggest which categorical columns should be converted to category type and which should be ordered. Explain how these decisions impact data summaries, visualizations, and analysis.”
Example Code
Discussion Questions
- Which columns must be numeric to allow calculations or correlations?
- Which columns are naturally categorical and why?
- Should
experience_levelbe ordered or left as unordered? Why? - Are there any object columns that you would leave as-is?
- How might incorrect types affect visualizations or summary statistics?
- Converting object columns to categorical is optional, but it improves plotting, grouping, and modelling
- Ordered categories are useful when there is a logical sequence
- Validating types before analysis prevents subtle errors in summary statistics or correlations
Reflection
- Are categorical columns properly identified?
- Are numeric columns correctly typed?
- Does
experience_levelhave the correct order if you chose to order it?
Task 4: Exploratory Data Analysis (EDA)
Objective
Explore how salaries vary by experience level, job role, company size, and remote work.
Your Task
- Compute summary statistics for salary_in_usd
- Explore salary by experience_level, company_size, remote_ratio, and job_title
- Visualize trends using bar plots, boxplots, or histograms
- Reflect on patterns, trends, or anomalies in the dataset
AI Prompting Exercise
Role: Data Analyst Mentor Prompt Example:
“Analyze the dataset to explore how salaries vary by experience level, company size, remote ratio, and job title. Provide summary statistics and suggest appropriate visualizations. Highlight key trends, outliers, or anomalies, and explain their possible implications for understanding compensation patterns.”
4.1 Summary statistics for salary
Why this matters: This step gives a quick overview of the salary distribution, including typical pay (median), spread, and extreme values. It helps identify whether salaries are skewed and whether outliers may influence later analysis.
4.2 Salary by experience level
Visualisation :
Grouping salaries by experience level helps us understand how pay changes as professionals gain experience. This is useful for identifying career progression trends and setting realistic salary expectations.
4.3 Salary by company size
Visualisation
Company size often reflects available resources and budget. Comparing salaries across company sizes helps assess whether larger organisations tend to offer higher compensation than smaller ones.
4.4 Impact of remote work on salary
Visualisation
Remote work has become increasingly common these days. This analysis helps evaluate whether fully remote or hybrid roles are associated with higher or lower salaries compared to on-site positions.
4.5 Top 10 highest-paid job titles
Visualisation
Identifying the highest-paying job titles highlights which roles are most valued in the market. This can guide people in choosing specialisations or understanding demand.
Think about: Are salary differences driven by a single factor, or a combination of experience, role, and work environment? What additional variables might help explain these patterns?
Did you notice that we created visualisations for each analysis?
In real-world datasets, data is often large and complex, making it difficult to understand patterns by looking at raw tables alone. Visualisations help us:
- Spot trends quickly
- Compare groups easily
- Identify outliers and unusual patterns
However, choosing the right type of visualisation is just as important.
The type of plot you use depends on:
- What question you are trying to answer
- The type of variables (categorical vs numerical)
- Whether you are comparing groups, distributions, or relationships
For example:
- Bar plots → comparing categories
- Box plots → understanding distributions and outliers
- Histograms → viewing spread and skewness
- Pie charts → simple proportions (used carefully)
Good visualisation choices make analysis faster, clearer, and more meaningful.
Reflection
- Which factor seems to have the strongest relationship with salary?
- Do higher experience levels always guarantee higher pay?
- Does remote work appear to increase or decrease salaries?
Task 5: Outlier Detection & Handling ⚠️
Objective
Identify extreme values in numeric columns that could skew your analysis or influence modelling. Decide whether to keep, transform, or remove them.
Note: Outliers are not always “bad.” Sometimes they reflect real-world extremes, so the goal is awareness and justification, not automatic removal.
Your Task
- Explore numeric columns:
work_year,salary,salary_in_usd,remote_ratio - Detect potential outliers using boxplots and IQR method
- Decide for each outlier: keep, transform, or remove
- Reflect on how outliers could affect summary statistics, plots, and analysis conclusions
AI Prompting Exercise
Role: Data Quality Specialist Prompt Example:
“Inspect numeric columns to identify extreme values or potential outliers. Recommend which outliers should be kept, transformed, or removed, and provide reasoning for each choice. Explain how outlier handling might affect summary statistics, visualizations, and downstream analysis.”
Example Code
- Outlier detection is critical before analysis or modelling
- Visualization (boxplots) gives intuitive insight, while IQR is a quantitative approach
- Decisions should be justified with evidence, not arbitrary
- Handling outliers incorrectly can mislead conclusions or models
Reflection
- Are all numeric columns checked for outliers?
- Are outlier decisions documented and justified?
- Are visualizations interpreted correctly for data quality insights?
- Do these outliers represent real data points or likely errors/noise?
- How could these outliers affect your mean, median, and visualizations?
- Would you remove, transform, or keep these values? Why?
- How could outlier treatment affect AI-assisted reasoning or modelling downstream?