Lab2_ExploreAndAnalyseData
In this lab, you will work with a cleaned dataset to practice core data analysis skills in Python. You will explore the data, calculate summary statistics, and create visualisations using pandas and matplotlib.
This lab is designed to reflect real-world data analysis: you will ask questions, check assumptions, validate results, and iterate using AI-assisted coding.
You are encouraged to write your own prompts before running any code to get the most out of AI-assisted analysis.
Dataset Description π
In this lab, we use the Iris dataset, a classic dataset in machine learning and data analysis. It contains 150 observations of iris flowers, with 5 variables, four numeric features : - sepal_length, - sepal_width, - petal_length, - petal_width
that describe the size of the flower parts, and one categorical variable (species) indicating the species of the iris (setosa, versicolor, virginica).
This dataset is clean, with no missing values, and is widely used for classification, exploratory data analysis, and visualization exercises.
Its balanced structure and simplicity make it ideal for practicing data analysis workflows while checking assumptions, calculating summary statistics, and creating visualizations.
π Guided AI-Assisted Workflow
| Step | Task | Prompt Examples & Tips |
|---|---|---|
| 1 | Explore dataset | Role: Professional Data Analyst Prompt: Examine the Iris dataset and summarise its structure, number of rows and columns, data types, and any missing values. Clearly identify numeric and categorical variables, and explain why this step is important before analysis. |
| 2 | Understand variable relationships | Role: Junior Data Scientist Prompt: Compute the correlation matrix for all numeric variables in the Iris dataset and visualise it as a heatmap. Identify strong, weak, and negative relationships, and explain what these relationships suggest about the data. |
| 3 | Calculate summary statistics | Role: Data Analyst Prompt: Group the Iris dataset by species and calculate the mean, median, and standard deviation for each numeric feature. Explain how these statistics help distinguish between species. |
| 4 | Create visualisations | Role: Data Analysis Mentor Prompt: Create appropriate visualisations (boxplots and scatterplots) to compare numeric features across iris species. Clearly label axes and titles, and explain what patterns, differences, or outliers the visualisations reveal. |
| 5 | Interpret and validate results | Role: Data Science Tutor Prompt: Integrate insights from the correlation matrix, summary statistics, and visualisations. Explain whether the results are consistent, which features best differentiate species, and highlight any overlaps or unexpected patterns. |
Task 1: Load and Explore the Dataset π§©
Objective
Understand the structure of the cleaned dataset before performing any analysis.
Prompt Writing Exercise
Write a prompt for AI that:
- Loads the cleaned dataset
- Displays its structure, first few rows, and column types
- Identifies numerical vs categorical variables
Example Code
Observations
The Iris dataset is clean, with no missing values and 150 samples evenly split across three species (setosa, versicolor, virginica). The four numeric features (sepal_length, sepal_width, petal_length, petal_width) vary in range and spread, providing useful distinctions between species. This makes the dataset ideal for exploring relationships, visualising patterns, and practising core data analysis techniques.
Inspecting the cleaned dataset helps confirm:
- Missing values are handled
- Correct variable types
- Next steps for analysis
Task 2: Understand Variable Relationships π
Objective
Investigate how numeric variables relate to each other.
Prompt Writing Exercise
Write a prompt for AI to:
- Compute correlations between numeric variables
- Visualise the correlation matrix as a heatmap
Correlation Heatmap of Numerical Variables
When working with multiple numerical variables, it is useful to ask:
- Which variables move together?
- Are some variables strongly related to price? A correlation matrix helps answer these questions.
Correlation measures the strength and direction of a linear relationship between two numerical variables Values range from:
- +1 β strong positive relationship
- 0 β little or no linear relationship
- β1 β strong negative relationship
A correlation matrix computes these values for all pairs of numerical variables at once.
To make this easier to interpret, we visualise the matrix as a heatmap, where:
- colour represents the strength of the relationship
- darker colours indicate stronger correlations
Example Code
Interpretation of the Correlation Matrix :
From the correlation matrix, we can observe the following patterns:
Strong positive correlations:
petal_lengthandpetal_width(0.96) β Longer petals are strongly associated with wider petals.sepal_lengthandpetal_length(0.87) β Larger sepals tend to accompany longer petals.sepal_lengthandpetal_width(0.82) β Larger sepals also tend to be associated with wider petals.
Weak or negative correlations:
sepal_widthandsepal_length(-0.12) β Sepal width shows almost no linear relationship with sepal length.sepal_widthandpetal_length(-0.43) β Sepal width is moderately negatively correlated with petal length.sepal_widthandpetal_width(-0.37) β Sepal width has a mild negative correlation with petal width.
Insights:
- Petal dimensions (
petal_lengthandpetal_width) are highly correlated, meaning they tend to increase together. - Sepal length is positively associated with petal dimensions, suggesting that bigger flowers generally have both larger sepals and petals.
- Sepal width behaves more independently and shows weak or negative correlations with the other features, indicating it may carry unique information for distinguishing species.
Overall, this correlation analysis helps quickly identify which numeric features are strongly related and which are relatively independent. This insight is valuable for:
- Selecting variables for visualisations (scatter plots between strongly correlated features can highlight species differences).
- Avoiding redundancy in modelling, since highly correlated variables (petal length & width) might provide overlapping information.
- Guiding exploratory analysis, e.g., focusing on petal dimensions to separate species effectively.
Task 3: Calculate Summary Statistics π
Objective
Compute summary statistics to quantify patterns in the dataset and understand how the categorical variable (species) influences the numeric variables.
Your Task
Group the Iris dataset by species and calculate:
- Mean, median, and standard deviation of each numeric feature
- Count of observations in each species category
Inspect the results to see patterns and differences between species.
Prompt Writing Exercise
Before writing any code, write a prompt for AI that instructs it to:
- Group the dataset by
species - Calculate mean, median, and standard deviation for numeric variables
- Return a clean summary table
Example Code
Interpretation of Species-wise Summary Statistics
From this summary table, we can observe:
- Setosa: Generally smaller petals and sepals, low variability
- Versicolor: Medium-sized features, moderate variability
- Virginica: Largest petals and sepals, higher variability
Sepal Dimensions:
- Setosa has the smallest sepal length and width (mean ~5.01 cm, 3.43 cm) with low variability, indicating relatively uniform sepal sizes.
- Versicolor and Virginica have progressively larger sepals, with Virginica being the largest (mean sepal length ~6.59 cm, width ~2.97 cm).
- Standard deviations show moderate variation, especially in Virginica, suggesting more diversity in sepal sizes.
Petal Dimensions:
- Setosa petals are very short and narrow (mean length 1.46 cm, width 0.25 cm), which clearly separates it from the other species.
- Versicolor and Virginica petals are longer and wider, with Virginica having the largest petals (mean length 5.55 cm, width 2.03 cm).
- The higher standard deviations in Versicolor and Virginica indicate more variability in petal sizes compared to Setosa.
Why this matters
Summary statistics provide a numerical overview of trends in the data and highlight which features best separate species. They also prepare you for visualisations, like boxplots or scatterplots, where patterns can be seen more intuitively.
Task 4: Create Visualisations π
Objective
Visualise how the flower measurements vary across iris species to better understand patterns, differences, and potential outliers.
Your Task
- Compare distributions of sepal_length, sepal_width, petal_length, and petal_width across species using boxplots.
- Explore relationships between two features (like petal_length vs petal_width) using scatterplots to observe species separation.
- Optionally, use a pairplot (all numeric features vs each other) to see clusters by species.
Prompt Writing Exercise (AI-Assisted Visualisation)
Before writing any code, write a prompt for AI that:
- Creates boxplots comparing each numeric feature across species
- Generates scatterplots for key feature pairs (e.g., petal_length vs petal_width)
- Uses clear axis labels, titles, and legends
- Explains briefly what each plot is intended to show
Example Code
Observations
From these visualisations, students should notice:
- Setosa is clearly separated from Versicolor and Virginica in both petal length and width.
- Versicolor and Virginica have some overlap, but Virginica generally has larger petals.
- Sepal dimensions provide separation, but less distinct than petals.
- Boxplots reveal outliers in petal dimensions, especially for Versicolor and Virginica.
Insight for AI-Assisted Analysis
These visualisations confirm the patterns seen in summary statistics:
- Petal measurements are the strongest species differentiators.
- Sepal measurements show moderate differences.
- Outliers and overlaps are easy to spot visually, guiding further analysis (e.g., feature selection for classification).
Choosing the Right Visualisations π
Visualisation is a crucial step in data analysis because it helps you see patterns, relationships, and anomalies that are hard to spot from tables alone. When creating plots: - Pick relevant features β Choose features that are likely to show variation between categories or have interesting relationships (e.g., petal length vs petal width clearly separates species). - Consider plot type β Use boxplots to show distributions and outliers, scatterplots for relationships between two numeric variables, and bar plots for summarising mean Β± standard deviation. - Label axes and add titles β This ensures your plots are readable and interpretable. - Use color and grouping wisely β Color by species or category helps quickly distinguish groups and improves clarity.
Correct choice of visualisation ensures that your analysis communicates insights accurately and effectively, and prevents misinterpretation of the data.
Task 5: Interpret and Validate Relationships π
Objective
Integrate your findings from the correlation analysis, summary statistics, and visualisations to draw meaningful insights about the Iris dataset and validate them in context.
Your Task
Review the correlation matrix from Task 2:
- Identify which numeric features are strongly correlated.
- Consider which relationships are intuitive based on your understanding of iris flowers.
Compare with summary statistics from Task 3:
- Look at the mean, median, and standard deviation for each feature by species.
- Check if these differences match what you observed in the correlation matrix.
Examine your visualisations from Task 4:
- Do scatterplots, boxplots, or barplots support the trends and differences found in the numeric summaries?
- Are there any surprising patterns or outliers?
Observations
- Petal length and petal width are highly correlated (β 0.96), consistent with the fact that larger petals are proportionally wider.
- Sepal length and petal length also show a strong positive correlation (β 0.87), suggesting that bigger flowers tend to have larger sepals and petals.
- Sepal width shows weak or negative correlations with other features, highlighting it as less predictive on its own.
- Species separation: Setosa is clearly distinct in petal length and width compared to Versicolor and Virginica, which aligns with the boxplots and summary statistics.
- Outliers or variability: Versicolor and Virginica show higher standard deviation in petal and sepal dimensions, indicating more variation within these species.
Reflection
- Do the numeric summaries, correlation matrix, and visualisations tell a consistent story?
- Which features are most important for distinguishing species?
- How might these insights inform a classification model or further exploratory analysis?
Validating your findings ensures that your analysis is coherent and interpretable. It helps:
- Confirm that your patterns make biological sense.
- Identify features that are informative for distinguishing species.
- Spot anomalies or unexpected patterns for further investigation.
This step bridges exploratory data analysis with decision-making or modelling, ensuring that insights are data-driven and reliable.