Lab2_ExploreAndAnalyseData

In this lab, you will work with a cleaned dataset to practice core data analysis skills in Python. You will explore the data, calculate summary statistics, and create visualisations using pandas and matplotlib.

This lab is designed to reflect real-world data analysis: you will ask questions, check assumptions, validate results, and iterate using AI-assisted coding.

You are encouraged to write your own prompts before running any code to get the most out of AI-assisted analysis.

Dataset Description 📊

In this lab, we use the Iris dataset, a classic dataset in machine learning and data analysis. It contains 150 observations of iris flowers, with 5 variables, four numeric features : - sepal_length, - sepal_width, - petal_length, - petal_width

that describe the size of the flower parts, and one categorical variable (species) indicating the species of the iris (setosa, versicolor, virginica).

This dataset is clean, with no missing values, and is widely used for classification, exploratory data analysis, and visualization exercises.

Its balanced structure and simplicity make it ideal for practicing data analysis workflows while checking assumptions, calculating summary statistics, and creating visualizations.

📌 Guided AI-Assisted Workflow

Step	Task	Prompt Examples & Tips
1	Explore dataset	Role: Professional Data Analyst Prompt: Examine the Iris dataset and summarise its structure, number of rows and columns, data types, and any missing values. Clearly identify numeric and categorical variables, and explain why this step is important before analysis.
2	Understand variable relationships	Role: Junior Data Scientist Prompt: Compute the correlation matrix for all numeric variables in the Iris dataset and visualise it as a heatmap. Identify strong, weak, and negative relationships, and explain what these relationships suggest about the data.
3	Calculate summary statistics	Role: Data Analyst Prompt: Group the Iris dataset by species and calculate the mean, median, and standard deviation for each numeric feature. Explain how these statistics help distinguish between species.
4	Create visualisations	Role: Data Analysis Mentor Prompt: Create appropriate visualisations (boxplots and scatterplots) to compare numeric features across iris species. Clearly label axes and titles, and explain what patterns, differences, or outliers the visualisations reveal.
5	Interpret and validate results	Role: Data Science Tutor Prompt: Integrate insights from the correlation matrix, summary statistics, and visualisations. Explain whether the results are consistent, which features best differentiate species, and highlight any overlaps or unexpected patterns.

Task 1: Load and Explore the Dataset 🧩

Objective

Understand the structure of the cleaned dataset before performing any analysis.

Prompt Writing Exercise

Write a prompt for AI that:

Loads the cleaned dataset
Displays its structure, first few rows, and column types
Identifies numerical vs categorical variables

Example Code

Observations

The Iris dataset is clean, with no missing values and 150 samples evenly split across three species (setosa, versicolor, virginica). The four numeric features (sepal_length, sepal_width, petal_length, petal_width) vary in range and spread, providing useful distinctions between species. This makes the dataset ideal for exploring relationships, visualising patterns, and practising core data analysis techniques.

Important

Inspecting the cleaned dataset helps confirm:

Missing values are handled
Correct variable types
Next steps for analysis

Task 2: Understand Variable Relationships 🔍

Objective

Investigate how numeric variables relate to each other.

Prompt Writing Exercise

Write a prompt for AI to:

Compute correlations between numeric variables
Visualise the correlation matrix as a heatmap

Correlation Heatmap of Numerical Variables

Understanding Relationships with a Correlation Matrix 📊

When working with multiple numerical variables, it is useful to ask:

Which variables move together?
Are some variables strongly related to price? A correlation matrix helps answer these questions.

Correlation measures the strength and direction of a linear relationship between two numerical variables Values range from:

+1 → strong positive relationship
0 → little or no linear relationship
−1 → strong negative relationship

A correlation matrix computes these values for all pairs of numerical variables at once.

To make this easier to interpret, we visualise the matrix as a heatmap, where:

colour represents the strength of the relationship
darker colours indicate stronger correlations

Example Code

Interpretation of the Correlation Matrix :

From the correlation matrix, we can observe the following patterns:

Strong positive correlations:

petal_length and petal_width (0.96) → Longer petals are strongly associated with wider petals.
sepal_length and petal_length (0.87) → Larger sepals tend to accompany longer petals.
sepal_length and petal_width (0.82) → Larger sepals also tend to be associated with wider petals.

Weak or negative correlations:

sepal_width and sepal_length (-0.12) → Sepal width shows almost no linear relationship with sepal length.
sepal_width and petal_length (-0.43) → Sepal width is moderately negatively correlated with petal length.
sepal_width and petal_width (-0.37) → Sepal width has a mild negative correlation with petal width.

Insights:

Petal dimensions (petal_length and petal_width) are highly correlated, meaning they tend to increase together.
Sepal length is positively associated with petal dimensions, suggesting that bigger flowers generally have both larger sepals and petals.
Sepal width behaves more independently and shows weak or negative correlations with the other features, indicating it may carry unique information for distinguishing species.

Overall, this correlation analysis helps quickly identify which numeric features are strongly related and which are relatively independent. This insight is valuable for:

Selecting variables for visualisations (scatter plots between strongly correlated features can highlight species differences).
Avoiding redundancy in modelling, since highly correlated variables (petal length & width) might provide overlapping information.
Guiding exploratory analysis, e.g., focusing on petal dimensions to separate species effectively.

Task 3: Calculate Summary Statistics 📊

Objective

Compute summary statistics to quantify patterns in the dataset and understand how the categorical variable (species) influences the numeric variables.

Your Task

Group the Iris dataset by species and calculate:

Mean, median, and standard deviation of each numeric feature
Count of observations in each species category

Inspect the results to see patterns and differences between species.

Prompt Writing Exercise

Before writing any code, write a prompt for AI that instructs it to:

Group the dataset by species
Calculate mean, median, and standard deviation for numeric variables
Return a clean summary table

Example Code

Interpretation of Species-wise Summary Statistics

From this summary table, we can observe:

Setosa: Generally smaller petals and sepals, low variability
Versicolor: Medium-sized features, moderate variability
Virginica: Largest petals and sepals, higher variability

Sepal Dimensions:

Setosa has the smallest sepal length and width (mean ~5.01 cm, 3.43 cm) with low variability, indicating relatively uniform sepal sizes.
Versicolor and Virginica have progressively larger sepals, with Virginica being the largest (mean sepal length ~6.59 cm, width ~2.97 cm).
Standard deviations show moderate variation, especially in Virginica, suggesting more diversity in sepal sizes.

Petal Dimensions:

Setosa petals are very short and narrow (mean length 1.46 cm, width 0.25 cm), which clearly separates it from the other species.
Versicolor and Virginica petals are longer and wider, with Virginica having the largest petals (mean length 5.55 cm, width 2.03 cm).
The higher standard deviations in Versicolor and Virginica indicate more variability in petal sizes compared to Setosa.

Important

Why this matters

Summary statistics provide a numerical overview of trends in the data and highlight which features best separate species. They also prepare you for visualisations, like boxplots or scatterplots, where patterns can be seen more intuitively.

Task 4: Create Visualisations 📈

Objective

Visualise how the flower measurements vary across iris species to better understand patterns, differences, and potential outliers.

Your Task

Compare distributions of sepal_length, sepal_width, petal_length, and petal_width across species using boxplots.
Explore relationships between two features (like petal_length vs petal_width) using scatterplots to observe species separation.
Optionally, use a pairplot (all numeric features vs each other) to see clusters by species.

Prompt Writing Exercise (AI-Assisted Visualisation)

Before writing any code, write a prompt for AI that:

Creates boxplots comparing each numeric feature across species
Generates scatterplots for key feature pairs (e.g., petal_length vs petal_width)
Uses clear axis labels, titles, and legends
Explains briefly what each plot is intended to show

Example Code

Observations

From these visualisations, students should notice:

Setosa is clearly separated from Versicolor and Virginica in both petal length and width.
Versicolor and Virginica have some overlap, but Virginica generally has larger petals.
Sepal dimensions provide separation, but less distinct than petals.
Boxplots reveal outliers in petal dimensions, especially for Versicolor and Virginica.

Insight for AI-Assisted Analysis

These visualisations confirm the patterns seen in summary statistics:

Petal measurements are the strongest species differentiators.
Sepal measurements show moderate differences.
Outliers and overlaps are easy to spot visually, guiding further analysis (e.g., feature selection for classification).

Important

Choosing the Right Visualisations 🔍

Visualisation is a crucial step in data analysis because it helps you see patterns, relationships, and anomalies that are hard to spot from tables alone. When creating plots: - Pick relevant features – Choose features that are likely to show variation between categories or have interesting relationships (e.g., petal length vs petal width clearly separates species). - Consider plot type – Use boxplots to show distributions and outliers, scatterplots for relationships between two numeric variables, and bar plots for summarising mean ± standard deviation. - Label axes and add titles – This ensures your plots are readable and interpretable. - Use color and grouping wisely – Color by species or category helps quickly distinguish groups and improves clarity.

Correct choice of visualisation ensures that your analysis communicates insights accurately and effectively, and prevents misinterpretation of the data.

Task 5: Interpret and Validate Relationships 🔍

Objective

Integrate your findings from the correlation analysis, summary statistics, and visualisations to draw meaningful insights about the Iris dataset and validate them in context.

Your Task

Review the correlation matrix from Task 2:
- Identify which numeric features are strongly correlated.
- Consider which relationships are intuitive based on your understanding of iris flowers.
Compare with summary statistics from Task 3:
- Look at the mean, median, and standard deviation for each feature by species.
- Check if these differences match what you observed in the correlation matrix.
Examine your visualisations from Task 4:
- Do scatterplots, boxplots, or barplots support the trends and differences found in the numeric summaries?
- Are there any surprising patterns or outliers?

Observations

Petal length and petal width are highly correlated (≈ 0.96), consistent with the fact that larger petals are proportionally wider.
Sepal length and petal length also show a strong positive correlation (≈ 0.87), suggesting that bigger flowers tend to have larger sepals and petals.
Sepal width shows weak or negative correlations with other features, highlighting it as less predictive on its own.
Species separation: Setosa is clearly distinct in petal length and width compared to Versicolor and Virginica, which aligns with the boxplots and summary statistics.
Outliers or variability: Versicolor and Virginica show higher standard deviation in petal and sepal dimensions, indicating more variation within these species.

Reflection

Do the numeric summaries, correlation matrix, and visualisations tell a consistent story?
Which features are most important for distinguishing species?
How might these insights inform a classification model or further exploratory analysis?

Why This Matters

Validating your findings ensures that your analysis is coherent and interpretable. It helps:

Confirm that your patterns make biological sense.
Identify features that are informative for distinguishing species.
Spot anomalies or unexpected patterns for further investigation.

This step bridges exploratory data analysis with decision-making or modelling, ensuring that insights are data-driven and reliable.