What even is this?
A regression line is the "line of best fit", a straight line drawn through a scatterplot that sits as close as possible to all the data points. It gives you an equation you can use to predict y from any x value.
You won't draw this by hand, your calculator finds the equation. Your job is to interpret what the equation means and use it to make predictions.
(response variable)
(value when x = 0)
(change in y per 1-unit increase in x)
(the known value)
The hat on ŷ ("y-hat") means it's a predicted value, not the actual data point, just what the model estimates.
This is the big exam skill, taking the numbers in the equation and explaining what they mean in context.
📈 The Gradient (b) b
How much y changes for each 1-unit increase in x.
If ŷ = 4.5x + 38 and x = study hours, y = test score:
"For each additional hour of study per week, the test score increases by 4.5 marks on average."
📍 The Y-Intercept (a) a
The predicted value of y when x = 0.
Same equation, if x = 0:
"A student who does no study is predicted to score 38 marks."
⚠️ Be careful, x=0 may not be realistic!
To predict ŷ, substitute your x value into the equation. To find x for a target y, rearrange and solve.
ŷ = 4.5(7) + 38
ŷ = 31.5 + 38
ŷ = 69.5
80 = 4.5x + 38
4.5x = 42
x = 9.3 hrs
Interpolation vs Extrapolation
🚨 Extrapolation = predicting outside the data range → be careful, the pattern may not hold.
Once you have Pearson's r, you can calculate r² (just square it). This tells you what proportion of the variation in y is explained by x.
| Student | Alex | Brooke | Cal | Dana | Eli |
|---|---|---|---|---|---|
| Study hours/week (x) | 2 | 4 | 6 | 8 | 10 |
| Test score % (y) | 48 | 54 | 67 | 72 | 84 |
The calculator gives: ŷ = 4.5x + 38 and r = 0.99. Data range: x = 2 to 10.
Interpret the gradient
b = 4.5. For each additional hour of study per week, the predicted test score increases by 4.5% on average.
Interpret the y-intercept
a = 38. The model predicts a student who does zero hours of study would score 38%. This is an extrapolation (x=0 is outside the data range of 2 to 10), so it should be interpreted cautiously.
Predict the score for 7 hours of study
x = 7 is within the data range (2 to 10) → interpolation, reliable prediction.
How many hours to reach 80%?
Set ŷ = 80 and solve for x:
4.5x = 42
x = 42 ÷ 4.5 = 9.3 hours
x = 9.3 is within the data range → reliable prediction.
Calculate and interpret r²
Tap to reveal the answer. Try it yourself first!
Y-intercept (a = −5): The model predicts sales of −$5 when there are zero hours of sunshine. A negative sales value isn't meaningful in real life, this is a mathematical result of the equation, not a practical prediction.
x = 5 falls within the data range of 2 to 10 hours → this is interpolation → the prediction is reasonably reliable.
x = 15 is outside the data range (2 to 10 hours). The linear pattern observed in the data may not continue beyond this range. The teacher should be cautious about using this prediction.
Interpretation: 90% of the variation in the number of beach visitors can be explained by the daily maximum temperature. The remaining 10% is due to other factors (day of week, public holidays, surf conditions, etc.).
20x = 700
x = 35°C
x = 35 is outside the data range (18 to 32°C) → this is extrapolation. The prediction is less reliable, the linear relationship may not hold at temperatures above 32°C.