Introduction to R for Ecologists
R語言在生態學的應用
EEB5082 (B44U1940)
Author: David Zelený
Permalink: davidzeleny.net/link/recol
Introduction to R for Ecologists
R語言在生態學的應用
EEB5082 (B44U1940)
Author: David Zelený
Permalink: davidzeleny.net/link/recol
Linear regression is a common statistical method to quantify the relationship of two quantitative variables, where one can be considered as dependent on the other. In this example, let's 1) calculate linear regression using example data and fit the regression equation, 2) predict fitted values, 3) calculate explained variation (coefficient of determination, r2), and test the statistical significance, and 4) plot the regression line onto the scatterplot.
Data we will use here are in the dataset cars
, with two variables, dist
(the distance the car is driving after driver pressed the break) and speed
(the speed of the car at the moment when the break was pressed).
head (cars)
speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10
Obviously, dist
is dependent variable and speed
is independent (increasing speed is likely increasing the breaking distance, not the other way around). We can draw the relationship, using the plot
function with variables defined using formula interface: dependent_variable ~ independent_variable, data = data.frame
:
plot (dist ~ speed, data = cars)
To calculate linear regression model, we use the function lm
(linear model), which uses the same formula interface as the plot
function:
lm_cars <- lm (dist ~ speed, data = dist) lm_cars
Call: lm(formula = dist ~ speed, data = cars) Coefficients: (Intercept) speed -17.579 3.932
The function 'lm' returns estimated regression coefficients, namely intercept and slope of the regression. Using these coefficients, we can create equation to calculate fitted values:
distpredicted = -17.579 + 3.932*speed
We can argue that the intercept coefficient does not make practical sense (at zero speed, the braking distance is obviously zero, not -17.6 feet), but let's ignore this annoying detail now and let's focus on other statistical details. We can apply generic function summary
on the object which stores the linear model results, to get further numbers:
summary (lm_dist)
Commented output reports the most important variables: apart to the regression coefficients also the coefficient of determination (r2), and the value of F statistic and P-value. These are usually numbers you need to include when you are reporting the results of the regression. Some details below:
dist
was explained by given explanatory variable (speed
). You can imagine that braking distance, in fact, depends on many other factors than just speed of the car at the moment of pressing the brake - it also depends on how overused are wheels, whether the road is dry or wet or icy, how quickly driver pressed the brake etc. Here, speed explains 65% of the variation, which is more than half, but far from the whole - almost 35% of variation remains unexplained.It is important to report all relevant statistical values. For example, when you report P-value, you need to also include the value of the test statistic (here F-value), and also the numbers of degrees of freedom (because the same F-value with different degrees of freedom leads to different P-value).
Finally, we can plot the regression line onto the plot. The simplest way to do it is using the function abline
, which can directly use the variables from the lm
object. You may know that function abline
has arguments v
and h
for coordinates of vertical and horizontal line, respectively, but it has also arguments a
and b
for regression coefficients (intercept and slope, respectively). It can extract these values directly from the result result of lm
function:
plot (dist ~ speed, data = cars) lm_cars <- lm (dist ~ speed, data = cars) abline (lm_cars, col = 'red')
The more general solution is to use the parameters of linear regression and predict the values of dependent variables. Since we are considering linear response here, we can just predict the values of the dependent variable for minimum and maximum of the explanatory variable (i.e. braking distance for minimum and maximum measured speed within the experiment). For this purpose, there is the function predict
, which takes two arguments: the object generated by lm
function (here lm_dist
), and newdata
with the values of the explanatory variable for which dependent variable should be predicted. Important: newdata
must be a list, and the name of the component must be the same as the name of the explanatory variable in the formula you use in lm
(here speed
). If the name is different (e.g. in lm
you use speed
, but in predict
you use x
), then the predict function will ignore the argument and predict values for all numbers inside the explanatory variable (here 50 speed measurements).
plot (dist ~ speed, data = cars) lm_cars <- lm (dist ~ speed, data = cars) speed_min_max <- range (cars$speed) # range of values, the same as: c(min (cars$speed), max (cars$speed)) dist_pred <- predict (lm_cars, newdata = list (speed = speed_min_max)) lines (dist_pred ~ speed_min_max, col = 'blue')
You can see that the result (blue regression line) is very similar to the one drawn by abline
function (red regression line above), but not identical - the red line goes from the edge to edge of the plotting region, while the blue line goes only from minimum to maximum value of speed
.
Benefit of the predict
solution will become more apparent if we consider that maybe the relationship between the dependent and explanatory variable is not linear, and would be better done by curvilinear shape. For this, we can use poly
function, included in the formula of lm
model, which allows creating polynomial regression (ie y = b0 + b1*x + b2*x2 for polynomial of the second degree, allowing the curve to “bend” once).
lm_cars_2 <- lm (dist ~ poly (speed, degree = 2), data = cars) lm_cars_2
Call: lm(formula = dist ~ poly(speed, degree = 2), data = cars) Coefficients: (Intercept) poly(speed, degree = 2)1 poly(speed, degree = 2)2 42.98 145.55 23.00
The equation of this polynomial regression si then
distpredicted = 42.98 + 145.55*speed + 23.00*speed2
To draw the predicted polynomial regression curve, we need to use predict
(because abline
can draw only straight lines), and additionally we need to define enough points within the curve to make the curvature smooth.
plot (dist ~ speed, data = cars) speed_min_max_seq <- seq (min (cars$speed), max (cars$speed), length = 100) dist_pred_2 <- predict (lm_cars, newdata = list (speed = speed_min_max_seq)) lines (dist_pred_2 ~ speed_min_max_seq, col = 'green')
You see that the green line is practically straight - in fact, the pattern is very close to linear, so even if we fit the polynomial curve, it will be almost straight.
Finally, we can report the statistical values of the linear regression as the legend inside the figure (this is suitable, however, only if there is enough space inside the drawing area). Let's do it for linear (but not polynomial) regression, and include parameters of the regression, r2, F-value and P-value:
plot (dist ~ speed, data = cars) lm_cars <- lm (dist ~ speed, data = cars) abline (lm_cars, col = 'red') legend ('topleft', legend = c('y = -17.58 + 3.93*x', 'r2 = 0.65, F = 89.57, P < 0.001'), bty = 'n')