Unit 3 · Data Analysis

📏 Least-Squares Regression

Once you can see a linear relationship in a scatterplot, the regression line lets you describe it with an equation, and use that equation to make predictions.

🔑 Before this clicks: if any of these feel shaky, a 5-minute refresh makes this page way easier:
📉 Scatterplots & Correlation🧰 Foundations: Reading scales

What even is this?

A regression line is the "line of best fit", a straight line drawn through a scatterplot that sits as close as possible to all the data points. It gives you an equation you can use to predict y from any x value.

You won't draw this by hand, your calculator finds the equation. Your job is to interpret what the equation means and use it to make predictions.

1
The Regression Equation
ŷ  =  a  +  bx
ŷ
predicted y value
(response variable)
a
y-intercept
(value when x = 0)
b
gradient (slope)
(change in y per 1-unit increase in x)
x
explanatory variable
(the known value)

The hat on ŷ ("y-hat") means it's a predicted value, not the actual data point, just what the model estimates.

x (explanatory) y (response) ŷ = a + bx residual
The regression line minimises the total of all squared residuals, that's why it's called "least-squares".
2
Interpreting the Equation

This is the big exam skill, taking the numbers in the equation and explaining what they mean in context.

📈 The Gradient (b) b

How much y changes for each 1-unit increase in x.

If ŷ = 4.5x + 38 and x = study hours, y = test score:

"For each additional hour of study per week, the test score increases by 4.5 marks on average."

📍 The Y-Intercept (a) a

The predicted value of y when x = 0.

Same equation, if x = 0:

"A student who does no study is predicted to score 38 marks."

⚠️ Be careful, x=0 may not be realistic!

⚠️ Interpreting the y-intercept carefully: Always check whether x = 0 makes sense in context. If x = temperature (°C), then x = 0 is 0°, that might be outside the data range and make the intercept meaningless as a real prediction. Just describe what the model says mathematically.
✏️ Exam tip: Always write your interpretation in context, include the variable names and units. "The gradient is 4.5" gets no marks. "For each additional hour of study, the predicted test score increases by 4.5 marks" does.
3
Making Predictions

To predict ŷ, substitute your x value into the equation. To find x for a target y, rearrange and solve.

Predict ŷ from x
x = 7
ŷ = 4.5(7) + 38
ŷ = 31.5 + 38
ŷ = 69.5
Find x for target ŷ
ŷ = 80
80 = 4.5x + 38
4.5x = 42
x = 9.3 hrs

Interpolation vs Extrapolation

EXTRAPOLATION less reliable DATA RANGE, INTERPOLATION predictions here are more reliable EXTRAPOLATION less reliable x = min data x = max data
🔍 Interpolation = predicting within the data range → generally reliable.
🚨 Extrapolation = predicting outside the data range → be careful, the pattern may not hold.
4
r², The Coefficient of Determination

Once you have Pearson's r, you can calculate r² (just square it). This tells you what proportion of the variation in y is explained by x.

r = 0.99
correlation coefficient
r² = 0.98
98% of variation in y is explained by x
✏️ Exam phrasing: "An r² value of 0.98 means that 98% of the variation in test scores can be explained by the number of study hours per week."
5
Worked Example
📚
Study Hours & Test Scores
Five students tracked their weekly study hours and their most recent test score.
Student AlexBrookeCalDanaEli
Study hours/week (x) 246810
Test score % (y) 4854677284

The calculator gives: ŷ = 4.5x + 38  and  r = 0.99. Data range: x = 2 to 10.

1

Interpret the gradient

b = 4.5. For each additional hour of study per week, the predicted test score increases by 4.5% on average.

2

Interpret the y-intercept

a = 38. The model predicts a student who does zero hours of study would score 38%. This is an extrapolation (x=0 is outside the data range of 2 to 10), so it should be interpreted cautiously.

3

Predict the score for 7 hours of study

x = 7 is within the data range (2 to 10) → interpolation, reliable prediction.

ŷ = 4.5(7) + 38 = 31.5 + 38 = 69.5%
4

How many hours to reach 80%?

Set ŷ = 80 and solve for x:

80 = 4.5x + 38
4.5x = 42
x = 42 ÷ 4.5 = 9.3 hours

x = 9.3 is within the data range → reliable prediction.

5

Calculate and interpret r²

r² = 0.99² = 0.9801 ≈ 0.98
98% of the variation in test scores is explained by study hours per week.
6
Practice Questions

Tap to reveal the answer. Try it yourself first!

1
A regression equation is ŷ = 12x − 5, where x = hours of sunshine per day and y = ice cream sales ($). Interpret the gradient and y-intercept in context.
Tap to reveal ▾
Gradient (b = 12): For each additional hour of sunshine per day, predicted ice cream sales increase by $12.

Y-intercept (a = −5): The model predicts sales of −$5 when there are zero hours of sunshine. A negative sales value isn't meaningful in real life, this is a mathematical result of the equation, not a practical prediction.
2
Using ŷ = 4.5x + 38, predict the test score for a student who studies 5 hours per week. The data was collected for students studying 2 to 10 hours. Is this prediction reliable?
Tap to reveal ▾
ŷ = 4.5(5) + 38 = 22.5 + 38 = 60.5%

x = 5 falls within the data range of 2 to 10 hours → this is interpolation → the prediction is reasonably reliable.
3
A teacher wants to predict the score for a student who studies 15 hours per week (using the same equation and data range of 2 to 10 hours). Should the teacher trust this prediction? Explain.
Tap to reveal ▾
No, this is extrapolation, and the prediction is less reliable.

x = 15 is outside the data range (2 to 10 hours). The linear pattern observed in the data may not continue beyond this range. The teacher should be cautious about using this prediction.
4
A regression model for temperature (x °C) and beach visitors (y) gives ŷ = 20x − 300, r = 0.95. Calculate r² and interpret it.
Tap to reveal ▾
r² = 0.95² = 0.9025 ≈ 0.90

Interpretation: 90% of the variation in the number of beach visitors can be explained by the daily maximum temperature. The remaining 10% is due to other factors (day of week, public holidays, surf conditions, etc.).
5
Using ŷ = 20x − 300 (data range: x = 18 to 32°C), find the temperature at which the model predicts 400 visitors. Is this prediction reliable?
Tap to reveal ▾
400 = 20x − 300
20x = 700
x = 35°C


x = 35 is outside the data range (18 to 32°C) → this is extrapolation. The prediction is less reliable, the linear relationship may not hold at temperatures above 32°C.
Ready to practise?
🌊
The Beach Report, Escape Room
6 challenges: predict visitor numbers, interpret the slope, work backwards from a target, calculate r² and more. Help the council plan their summer roster.
Play →
The Transfer Window, Escape Room
Also covers regression skills, prediction, residuals, slope and simultaneous equations.
Play →