Calculate linear regression, test it and plot it
Linear regression is a common statistical method to quantify the relationship of two quantitative variables, where one can be considered as dependent on the other. In this example, let's 1) calculate linear regression using example data and fit the regression equation, 2) predict fitted values, 3) calculate explained variation (coefficient of determination, r2), and test the statistical significance, and 4) plot the regression line onto the scatterplot.
Data we will use here are in the dataset cars
, with two variables, dist
(the distance the car is driving after driver pressed the break) and speed
(the speed of the car at the moment when the break was pressed).
head (cars)
speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10
Obviously, dist
is dependent variable and speed
is independent (increasing speed is likely increasing the breaking distance, not the other way around). We can draw the relationship, using the plot
function with variables defined using formula interface: dependent_variable ~ independent_variable, data = data.frame
:
plot (dist ~ speed, data = cars)
To calculate linear regression model, we use the function lm
(linear model), which uses the same formula interface as the plot
function:
lm_cars <- lm (dist ~ speed, data = dist) lm_cars
Call: lm(formula = dist ~ speed, data = cars) Coefficients: (Intercept) speed -17.579 3.932
The function 'lm' returns estimated regression coefficients, namely intercept and slope of the regression. Using these coefficients, we can create equation to calculate fitted values:
distpredicted = -17.579 + 3.932*speed
We can argue that the intercept coefficient does not make practical sense (at zero speed, the braking distance is obviously zero, not -17.6 feet), but let's ignore this annoying detail now and let's focus on other statistical details. We can apply generic function summary
on the object which stores the linear model results, to get further numbers:
summary (lm_dist)
Commented output reports the most important variables: apart to the regression coefficients also the coefficient of determination (r2), and the value of F statistic and P-value. These are usually numbers you need to include when you are reporting the results of the regression. Some details below:
- Parameters of the regression equation are important if you plan to predict the values of the dependent variable for a certain value of the explanatory variable. For example, if you want to know what would be the predicted braking distance if the car drives 15 mph, you can calculate it: distpredicted = -17.58 + 3.93*15 = 41.40 ft (note that the values of speed and distance are in imperial units - mph = miles per hour, where 1 mph = 1.61 km/h, and ft = feet, where 1 ft = 0.31 m).
- Coefficient of determination quantifies how much variation in the dependent variable
dist
was explained by given explanatory variable (speed
). You can imagine that braking distance, in fact, depends on many other factors than just the speed of the car at the moment of pressing the brake - it also depends on how overused are wheels, whether the road is dry or wet or icy, how quickly driver pressed the brake etc. Here, speed explains 65% of the variation, which is more than half, but far from the whole - almost 35% of variation remains unexplained. - Before we start to interpret the results, we often want to make sure that there is something to interpret. Even if the explanatory variable is a randomly generated value, it will explain the non-zero amount of variation (you can try it to see!). This means that we need to know whether our explained variation (65%) is higher than would be variation explained by some random variable. For this, we may test the null hypothesis H0 that there is no relationship between the dependent and explanatory variable, and if rejected, we can confirm that this relationship is significant. In the case of a parametric test of significance for regression, we calculate F-value and use it (together with appropriate degrees of freedom) to look up the appropriate P-value. P-value is the probability of obtaining the value of the test statistic (here F-value) equal to or more extreme than the observed one, even if the null hypothesis is true. The lower the P-value, the more statistically significant the result. In our case, P-value is very low, namely 0.00000000000149, very close to zero (you don't want to write it in this long format!). When reporting the P-value, we can either report the original value, or we report that the value is smaller than the arbitrarily selected threshold, e.g. P < 0.001 (three thresholds are commonly used, P < 0.001, P < 0.01 and P < 0.05; if the P-value is e.g. 0.032, you can report P < 0.05).
It is important to report all relevant statistical values. For example, when you report P-value, you need also to include the value of the test statistic (here F-value), and also the number of degrees of freedom (because the same F-value with different degrees of freedom leads to different P-values).
Finally, we can plot the regression line onto the plot. The simplest way to do it is using the function abline
, which can directly use the variables from the lm
object. You may know that function abline
has arguments v
and h
for coordinates of the vertical and horizontal line, respectively, but it has also arguments a
and b
for regression coefficients (intercept and slope, respectively). It can extract these values directly from the result of lm
function:
plot (dist ~ speed, data = cars) lm_cars <- lm (dist ~ speed, data = cars) abline (lm_cars, col = 'red')
The more general solution is to use the parameters of linear regression and predict the values of dependent variables. Since we are considering linear response here, we can just predict the values of the dependent variable for the minimum and maximum values of the explanatory variable (i.e. breaking distance for minimum and maximum measured speed within the experiment). For this purpose, there is a function predict
, which takes two arguments: the object generated by lm
function (here lm_dist
), and newdata
with the values of the explanatory variable for which dependent variable should be predicted. Important: newdata
must be a list, and the name of the component must be the same as the name of the explanatory variable in the formula you use in lm
(here speed
). If the name is different (e.g. in lm
you use speed
, but in predict
you use x
), then the predict function will (silently, ie without warning) ignore the argument and predict values for all numbers inside the explanatory variable (50 speed measurements here).
plot (dist ~ speed, data = cars) lm_cars <- lm (dist ~ speed, data = cars) speed_min_max <- range (cars$speed) # range of values, the same as: c(min (cars$speed), max (cars$speed)) dist_pred <- predict (lm_cars, newdata = list (speed = speed_min_max)) lines (dist_pred ~ speed_min_max, col = 'blue')
You can see that the result (blue regression line) is very similar to the one drawn by abline
function (red regression line above), but not identical - the red line goes from the edge to edge of the plotting region, while the blue line goes only from minimum to maximum value of speed
.
The benefit of the predict
solution will become more apparent if we consider that maybe the relationship between the dependent and explanatory variable is not linear, and would be better done by a curvilinear shape. For this, we can use poly
function, included in the formula of lm
model, which allows creating polynomial regression (ie y = b0 + b1*x + b2*x2 for polynomial of the second degree, allowing the curve to “bend” once).
lm_cars_2 <- lm (dist ~ poly (speed, degree = 2), data = cars) lm_cars_2
Call: lm(formula = dist ~ poly(speed, degree = 2), data = cars) Coefficients: (Intercept) poly(speed, degree = 2)1 poly(speed, degree = 2)2 42.98 145.55 23.00
The equation of this polynomial regression is then
distpredicted = 42.98 + 145.55*speed + 23.00*speed2
To draw the predicted polynomial regression curve, we need to use predict
(because abline
can draw only straight lines), and additionally, we need to define enough points within the curve to make the curvature smooth.
plot (dist ~ speed, data = cars) speed_seq <- seq (min (cars$speed), max (cars$speed), length = 100) dist_pred_2 <- predict (lm_cars, newdata = list (speed = speed_seq)) lines (dist_pred_2 ~ speed_seq, col = 'green')
You see that the green line is practically straight - in fact, the pattern is very close to linear, so even if we fit it by the polynomial curve.
Finally, we can report the statistical values of the linear regression as the legend inside the figure (this is suitable, however, only if there is enough space inside the drawing area, otherwise you can include it e.g. in the caption). Let's do it for linear (but not polynomial) regression, and include parameters of the regression, r2, F-value and P-value:
plot (dist ~ speed, data = cars) lm_cars <- lm (dist ~ speed, data = cars) abline (lm_cars, col = 'red') legend ('topleft', legend = c('y = -17.58 + 3.93*x', 'r2 = 0.65, F = 89.57, P < 0.001'), bty = 'n')