What even is this?
Every point in a scatterplot sits either above or below the regression line. The residual is the gap between the actual data point and what the regression line predicted. It's how wrong (or right) the model was for that point.
A residual plot takes all those gaps and displays them, if they look random, the linear model fits. If they show a pattern, a different model might be more appropriate.
Positive Residual
Actual value is above the regression line.
The model under-predicted, reality was higher than expected.
e.g. Predicted 70%, scored 75% โ residual = +5
Negative Residual
Actual value is below the regression line.
The model over-predicted, reality was lower than expected.
e.g. Predicted 80%, scored 74% โ residual = โ6
A residual plot graphs the residuals on the y-axis against the x values (or predicted values) on the x-axis, with a horizontal zero line through the middle. The pattern in this plot tells you whether a linear model is appropriate.
An outlier in a residual plot is a point with an unusually large positive or negative residual, it sits much further from zero than the other points.
The yellow point sits far above the zero line, it has a large positive residual, meaning the actual value was much higher than the model predicted.
Outliers are worth investigating. They might represent:
โข A data entry error
โข An unusual case (e.g. a record-breaking performance)
โข A missing variable that explains the extreme value
Calculate the residuals for all 5 students, then interpret the residual plot.
| Student | Study hours (x) | Actual score (y) | Predicted (ลท) | Residual (y โ ลท) |
|---|---|---|---|---|
| Alex | 2 | 48 | 4.5(2)+38 = 47 | +1 |
| Brooke | 4 | 54 | 4.5(4)+38 = 56 | โ2 |
| Cal | 6 | 67 | 4.5(6)+38 = 65 | +2 |
| Dana | 8 | 72 | 4.5(8)+38 = 74 | โ2 |
| Eli | 10 | 84 | 4.5(10)+38 = 83 | +1 |
Check the sum of residuals
(+1) + (โ2) + (+2) + (โ2) + (+1) = 0 โ As expected for a least-squares regression line.
Interpret the residual plot
The residuals are small (โ2 to +2) and alternate between positive and negative with no clear pattern, they're scattered randomly around zero.
Are there any outliers?
All residuals are between โ2 and +2. No data point is unusually far from the regression line, there are no obvious outliers.
Tap to reveal the answer. Try it yourself first!
Residual = y โ ลท = 16 โ 17 = โ1
The residual is negative โ the actual point (16) is below the regression line (which predicted 17). The model over-predicted by 1.
The sum is zero (as expected). This confirms these residuals came from a least-squares regression line, which always perfectly balances the over- and under-predictions across the dataset.
Random scatter in a residual plot means the linear model captures the relationship well. The errors are just random variation, there's no systematic pattern the model is missing.
The U-shaped (curved) pattern in the residual plot tells us the underlying relationship is non-linear. The linear model is systematically wrong in predictable ways, under-predicting in the middle and over-predicting at the extremes. A curve (e.g. quadratic) would be a better fit.
It has a much larger residual than the others (most are under 0.5 in absolute value). The positive sign means Film 3's actual value was above the regression line, the model significantly under-predicted it.
This might be worth investigating, perhaps this film had unusually strong word-of-mouth, a surprise cast announcement, or viral marketing that the model couldn't account for.