Author: David Zelený
Author: David Zelený
Linear regression is a common statistical method to quantify the relationship of two quantitative variables, where one can be considered as dependent on the other. In this example, let's 1) calculate linear regression using example data and fit the regression equation, 2) predict fitted values, 3) calculate explained variation (coefficient of determination, r2), and test the statistical significance, and 4) plot the regression line onto the scatterplot.
Data we will use here are in the dataset
cars, with two variables,
dist (the distance the car is driving after driver pressed the break) and
speed (the speed of the car at the moment when the break was pressed).
speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10
dist is dependent variable and
speed is independent (increasing speed is likely increasing the breaking distance, not the other way around). We can draw the relationship, using the
plot function with variables defined using formula interface:
dependent_variable ~ independent_variable, data = data.frame:
To calculate linear regression model, we use the function
lm (linear model), which uses the same formula interface as the
Call: lm(formula = dist ~ speed, data = cars) Coefficients: (Intercept) speed -17.579 3.932
The function 'lm' returns estimated regression coefficients, namely intercept and slope of the regression. Using these coefficients, we can create equation to calculate fitted values:
distpredicted = -17.579 + 3.932*speed
We can argue that the intercept coefficient does not make practical sense (at zero speed, the braking distance is obviously zero, not -17.6 feet), but let's ignore this annoying detail now and let's focus on other statistical details. We can apply generic function
summary on the object which stores the linear model results, to get further numbers:
Commented output reports the most important variables: apart to the regression coefficients also the coefficient of determination (r2), and the value of F statistic and P-value. These are usually numbers you need to include when you are reporting the results of the regression. Some details below:
distwas explained by given explanatory variable (
speed). You can imagine that braking distance, in fact, depends on many other factors than just speed of the car at the moment of pressing the brake - it also depends on how overused are wheels, whether the road is dry or wet or icy, how quickly driver pressed the brake etc. Here, speed explains 65% of the variation, which is more than half, but far from the whole - almost 35% of variation remains unexplained.
It is important to report all relevant statistical values. For example, when you report P-value, you need to also include the value of the test statistic (here F-value), and also the numbers of degrees of freedom (because the same F-value with different degrees of freedom leads to different P-value).
Finally, we can plot the regression line onto the plot. The simplest way to do it is using the function
abline, which can directly use the variables from the
lm object. You may know that function
abline has arguments
h for coordinates of vertical and horizontal line, respectively, but it has also arguments
b for regression coefficients (intercept and slope, respectively). It can extract these values directly from the result result of
plot (dist ~ speed, data = cars) lm_cars <- lm (dist ~ speed, data = cars) abline (lm_cars, col = 'red')
The more general solution is to use the parameters of linear regression and predict the values of dependent variables. Since we are considering linear response here, we can just predict the values of the dependent variable for minimum and maximum of the explanatory variable (i.e. braking distance for minimum and maximum measured speed within the experiment). For this purpose, there is the function
predict, which takes two arguments: the object generated by
lm function (here
newdata with the values of the explanatory variable for which dependent variable should be predicted. Important:
newdata must be a list, and the name of the component must be the same as the name of the explanatory variable in the formula you use in
speed). If the name is different (e.g. in
lm you use
speed, but in
predict you use
x), then the predict function will ignore the argument and predict values for all numbers inside the explanatory variable (here 50 speed measurements).
plot (dist ~ speed, data = cars) lm_cars <- lm (dist ~ speed, data = cars) speed_min_max <- range (cars$speed) # range of values, the same as: c(min (cars$speed), max (cars$speed)) dist_pred <- predict (lm_cars, newdata = list (speed = speed_min_max)) lines (dist_pred ~ speed_min_max, col = 'blue')
You can see that the result (blue regression line) is very similar to the one drawn by
abline function (red regression line above), but not identical - the red line goes from the edge to edge of the plotting region, while the blue line goes only from minimum to maximum value of
Benefit of the
predict solution will become more apparent if we consider that maybe the relationship between the dependent and explanatory variable is not linear, and would be better done by curvilinear shape. For this, we can use
poly function, included in the formula of
lm model, which allows creating polynomial regression (ie y = b0 + b1*x + b2*x2 for polynomial of the second degree, allowing the curve to “bend” once).
Call: lm(formula = dist ~ poly(speed, degree = 2), data = cars) Coefficients: (Intercept) poly(speed, degree = 2)1 poly(speed, degree = 2)2 42.98 145.55 23.00
The equation of this polynomial regression si then
distpredicted = 42.98 + 145.55*speed + 23.00*speed2
To draw the predicted polynomial regression curve, we need to use
abline can draw only straight lines), and additionally we need to define enough points within the curve to make the curvature smooth.
plot (dist ~ speed, data = cars) speed_min_max_seq <- seq (min (cars$speed), max (cars$speed), length = 100) dist_pred_2 <- predict (lm_cars, newdata = list (speed = speed_min_max_seq)) lines (dist_pred_2 ~ speed_min_max_seq, col = 'green')
You see that the green line is practically straight - in fact, the pattern is very close to linear, so even if we fit the polynomial curve, it will be almost straight.
Finally, we can report the statistical values of the linear regression as the legend inside the figure (this is suitable, however, only if there is enough space inside the drawing area). Let's do it for linear (but not polynomial) regression, and include parameters of the regression, r2, F-value and P-value:
plot (dist ~ speed, data = cars) lm_cars <- lm (dist ~ speed, data = cars) abline (lm_cars, col = 'red') legend ('topleft', legend = c('y = -17.58 + 3.93*x', 'r2 = 0.65, F = 89.57, P < 0.001'), bty = 'n')