Unit 3 · Data Analysis

📉 Scatterplots & Correlation

How to tell if two variables are related, and how strong that relationship is. This is the foundation of all bivariate data analysis.

What even is this?

Sometimes you have two sets of data, like training hours and goals scored, and you want to know: does one thing affect the other? Scatterplots let you see the relationship visually. Pearson's r gives you a number to measure how strong that relationship is.

In the exam you'll be asked to describe what you see in a scatterplot, interpret an r value, and sometimes make predictions. This page covers all of that.

1
The Two Variables

In any bivariate (two-variable) study, one variable explains or predicts the other. You need to know which is which, it affects which axis each goes on.

📌 Explanatory Variable

The one doing the explaining. Goes on the x-axis. Also called the independent variable.

Example: training hours per week

🎯 Response Variable

The one responding to changes. Goes on the y-axis. Also called the dependent variable.

Example: goals scored per season

💡 Quick tip: Ask yourself, which variable might cause or predict the other? That's your explanatory variable (x). The one that changes as a result is your response variable (y).
2
Describing a Scatterplot

When you describe an association in a scatterplot, cover these four things:

Direction

Positive, as x increases, y increases
Negative, as x increases, y decreases
No correlation, no clear pattern

Form

Linear, points follow a straight line
Non-linear, points follow a curve
(If non-linear, a regression line may not be appropriate)

Strength

Strong, points cluster tightly around the line
Moderate, noticeable trend but spread out
Weak, barely any pattern visible

Outliers

Any points that don't fit the overall pattern. Mention them if they're clearly visible, and note that they can affect the correlation coefficient.

Examples of different associations:

Strong Positive
Tight, linear, upward
Moderate Negative
Downward, but scattered
No Correlation
Random, no pattern
✏️ In the exam, a full description might look like: "There is a strong, positive, linear association between training hours and goals scored, with no obvious outliers."
3
Pearson's r, The Correlation Coefficient

Pearson's r is a single number that measures the strength and direction of a linear association. Your calculator does the hard work, you just need to know how to read it.

🧮On your Casio: the exact buttons to get r and the regression line (fx-100AU).Show me →
−1 ≤ r ≤ +1
−1−0.75−0.5−0.25 0 0.250.50.751
← Negative correlation No correlation Positive correlation →
r valueStrengthDirection
r = 1 Perfect Positive
0.75 ≤ r < 1 Strong Positive
0.5 ≤ r < 0.75 Moderate Positive
0 < r < 0.5 Weak Positive
r = 0 No linear correlation
−0.5 < r < 0 Weak Negative
−0.75 < r ≤ −0.5 Moderate Negative
−1 < r ≤ −0.75 Strong Negative
r = −1 Perfect Negative
⚠️ Correlation ≠ Causation. Just because two things are strongly correlated doesn't mean one causes the other. There might be a third variable involved, or it could just be coincidence. Always interpret carefully.
4
Worked Example
Training & Goals
A sports analyst records the weekly training sessions and goals scored per season for 8 strikers.
Player ABCD EFGH
Training sessions/week (x) 2344 5678
Goals per season (y) 791215 16202225

The scatterplot shows a clear upward trend. The calculator gives r = 0.99. Describe and interpret the association.

1

Identify the variables

The analyst is using training sessions to predict goals scored.

Explanatory (x): training sessions per week
Response (y): goals per season

2

Describe the scatterplot

Looking at the plot: the points go upward left to right (positive), they follow a straight-line pattern (linear), and they cluster tightly around that line (strong). No obvious outliers.

3

Interpret r = 0.99

The value is very close to +1. Using the classification table:

There is a very strong, positive, linear association between the number of weekly training sessions and goals scored per season.
4

Is this causation?

Not necessarily. The strong correlation suggests a relationship, but other factors (player skill, fitness, team quality) also affect goals. We can say the variables are associated, not that one causes the other.

5
Practice Questions

Tap a question to reveal the answer. Try to answer it yourself first!

1
A researcher wants to investigate whether the number of hours of sleep a student gets affects their exam score. Identify the explanatory and response variables.
Tap to reveal ▾
Explanatory variable (x): hours of sleep
Response variable (y): exam score

The researcher is using sleep to explain or predict exam performance, so sleep is the explanatory variable. Exam score responds to the amount of sleep.
2
A scatterplot shows data for daily temperature (°C) vs ice cream sales ($). The points go from bottom-left to top-right, are spread quite loosely, and follow a roughly straight path. Describe the association.
Tap to reveal ▾
There is a moderate, positive, linear association between daily temperature and ice cream sales.

Direction: positive, as temperature increases, sales also increase
Form: linear, follows a roughly straight-line pattern
Strength: moderate, the points are loosely spread around the line
Outliers: none mentioned
3
A study finds r = −0.83 between hours spent watching TV (x) and exam score (y). What does this r value tell you about the relationship?
Tap to reveal ▾
There is a strong, negative, linear association between hours watching TV and exam score.

• r = −0.83 → |r| = 0.83, which falls in the range 0.75 to 1 → strong
• The negative sign → negative direction, as TV hours increase, exam scores tend to decrease

Note: this doesn't mean watching TV causes lower scores, there could be other explanations (less study time, tiredness, etc.).
4
Four datasets have the following Pearson's r values: A: 0.42, B: −0.91, C: 0.78, D: −0.35. Which shows the strongest linear association? Which shows the weakest?
Tap to reveal ▾
Strongest: Dataset B (r = −0.91)
Weakest: Dataset D (r = −0.35)

Strength is determined by the absolute value of r (how close it is to 1 or −1, ignoring the sign):
• |A| = 0.42 → weak
• |B| = 0.91 → strong ← strongest
• |C| = 0.78 → strong
• |D| = 0.35 → weak ← weakest

The negative sign on B just means it's a negative direction, it doesn't make it weaker.
5
A news article claims: "Research shows a strong positive correlation (r = 0.89) between shoe size and reading ability in children." Does this mean having bigger feet makes you a better reader? Explain.
Tap to reveal ▾
No, correlation does not imply causation.

While the correlation is strong, bigger feet don't cause better reading. The real explanation is a lurking variable: age.

Older children have both bigger feet and better reading ability, age is driving both. This is a classic example of two variables being correlated simply because they're both related to a third variable.
Ready to practise?
The Transfer Window, Escape Room
6 challenges using scatterplots, Pearson's r, regression and residuals. Help Coastal FC sign their striker before the deadline.
Play →