Unit 3 ยท Data Analysis

๐Ÿ” Residuals & Residual Plots

Once you have a regression line, how do you know it's actually a good fit? Residuals tell you how far off each prediction was, and a residual plot reveals whether a linear model is appropriate at all.

๐Ÿ”‘ Before this clicks: if any of these feel shaky, a 5-minute refresh makes this page way easier:
Least-Squares RegressionScatterplots

What even is this?

Every point in a scatterplot sits either above or below the regression line. The residual is the gap between the actual data point and what the regression line predicted. It's how wrong (or right) the model was for that point.

A residual plot takes all those gaps and displays them, if they look random, the linear model fits. If they show a pattern, a different model might be more appropriate.

1
What is a Residual?
residual = actual value โˆ’ predicted value = y โˆ’ ลท
+

Positive Residual

Actual value is above the regression line.
The model under-predicted, reality was higher than expected.

e.g. Predicted 70%, scored 75% โ†’ residual = +5

โˆ’

Negative Residual

Actual value is below the regression line.
The model over-predicted, reality was lower than expected.

e.g. Predicted 80%, scored 74% โ†’ residual = โˆ’6

๐Ÿ’ก Key property: For a least-squares regression line, the sum of all residuals is always zero (or very close to zero). The line perfectly balances the over- and under-predictions.
2
Residual Plots

A residual plot graphs the residuals on the y-axis against the x values (or predicted values) on the x-axis, with a horizontal zero line through the middle. The pattern in this plot tells you whether a linear model is appropriate.

0 x values โ†’
Random Scatter
โœ“ Linear model appropriate
Points spread randomly above and below zero, no pattern. The linear model is a good fit.
0 x values โ†’
Curved Pattern (U-shape)
โœ— Linear model NOT appropriate
Points follow a curve, positive at the ends, negative in the middle. A non-linear model would fit better.
โœ๏ธ The rule: If the residual plot shows random scatter around zero โ†’ the linear model is appropriate. If it shows a clear pattern (curve, fan shape, etc.) โ†’ the relationship is non-linear and a linear model is not the best choice.
3
Outliers in Residual Plots

An outlier in a residual plot is a point with an unusually large positive or negative residual, it sits much further from zero than the other points.

0 outlier!

The yellow point sits far above the zero line, it has a large positive residual, meaning the actual value was much higher than the model predicted.

Outliers are worth investigating. They might represent:
โ€ข A data entry error
โ€ข An unusual case (e.g. a record-breaking performance)
โ€ข A missing variable that explains the extreme value

4
Worked Example
๐Ÿ“š
Study Hours & Test Scores, Checking the Fit
Using the regression equation ลท = 4.5x + 38 from the previous topic.

Calculate the residuals for all 5 students, then interpret the residual plot.

Student Study hours (x) Actual score (y) Predicted (ลท) Residual (y โˆ’ ลท)
Alex248 4.5(2)+38 = 47 +1
Brooke454 4.5(4)+38 = 56 โˆ’2
Cal667 4.5(6)+38 = 65 +2
Dana872 4.5(8)+38 = 74 โˆ’2
Eli1084 4.5(10)+38 = 83 +1
1

Check the sum of residuals

(+1) + (โˆ’2) + (+2) + (โˆ’2) + (+1) = 0 โœ“ As expected for a least-squares regression line.

2

Interpret the residual plot

The residuals are small (โˆ’2 to +2) and alternate between positive and negative with no clear pattern, they're scattered randomly around zero.

Conclusion: The residual plot shows random scatter โ†’ a linear model is appropriate for this data.
3

Are there any outliers?

All residuals are between โˆ’2 and +2. No data point is unusually far from the regression line, there are no obvious outliers.

5
Practice Questions

Tap to reveal the answer. Try it yourself first!

1
A regression equation is ลท = 3x + 5. A data point has x = 4 and y = 16. Calculate the residual and state whether the point is above or below the regression line.
Tap to reveal โ–พ
ลท = 3(4) + 5 = 17
Residual = y โˆ’ ลท = 16 โˆ’ 17 = โˆ’1

The residual is negative โ†’ the actual point (16) is below the regression line (which predicted 17). The model over-predicted by 1.
2
A student is given this table of residuals for a dataset: +2.1, โˆ’0.8, +1.4, โˆ’1.9, +0.6, โˆ’1.4. What is the sum of these residuals? What does this tell you?
Tap to reveal โ–พ
Sum = 2.1 + (โˆ’0.8) + 1.4 + (โˆ’1.9) + 0.6 + (โˆ’1.4) = 0

The sum is zero (as expected). This confirms these residuals came from a least-squares regression line, which always perfectly balances the over- and under-predictions across the dataset.
3
A residual plot for a dataset shows points scattered randomly above and below the zero line, with no discernible pattern. What does this tell you about the linear regression model?
Tap to reveal โ–พ
A linear regression model is appropriate for this data.

Random scatter in a residual plot means the linear model captures the relationship well. The errors are just random variation, there's no systematic pattern the model is missing.
4
A residual plot shows a clear U-shaped pattern, residuals are positive at low x values, negative in the middle, and positive again at high x values. What does this suggest about the model?
Tap to reveal โ–พ
A linear model is NOT appropriate for this data.

The U-shaped (curved) pattern in the residual plot tells us the underlying relationship is non-linear. The linear model is systematically wrong in predictable ways, under-predicting in the middle and over-predicting at the extremes. A curve (e.g. quadratic) would be a better fit.
5
Five films have residuals: +0.3, โˆ’0.2, +1.8, โˆ’0.4, +0.1. Which film is the outlier, and is it above or below the regression line?
Tap to reveal โ–พ
Film 3 (residual = +1.8) is the outlier.

It has a much larger residual than the others (most are under 0.5 in absolute value). The positive sign means Film 3's actual value was above the regression line, the model significantly under-predicted it.

This might be worth investigating, perhaps this film had unusually strong word-of-mouth, a surprise cast announcement, or viral marketing that the model couldn't account for.
Ready to practise?
๐ŸŽฌ
Director's Cut, Escape Room
6 challenges: calculate residuals, identify outliers, and decide whether a linear model fits two rival studios' data. Only one studio is telling the truth.
Play โ†’