Introduction to Linear Regression in R

Linear Regression is one of the most commonly used statistical methods. Linear modeling and Linear regression helps us understand the relationship between multiple variables. In the simplest case, linear regression is about understanding the relation between two variables, X and Y.

One of the ways to understand linear regression is that we have the observed data (X,Y) pair and model it as a linear model of this form

Y = a + bX,

where X is the explanatory variable and Y is the dependent variable. With just two variables the linear model simply specifies a line. The slope of the line is b, and a is the intercept (the value of y when x = 0).

We fit a linear equation i.e. find the line that best explains the observed data.

Let us load tidyverse and set gggplot theme to theme_bw()

library(tidyverse)
theme_set(theme_bw())

To illustrate how to perform linear regression analysis in R, we will use simulated data based on lifeExp values from gapminder dataset. Let us load gapminder data set.

library(gapminder)
head(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Let us randomly select 200 lifeExp values from gapminder data and we will use that for our X variable.

set.seed(42)
X <- gapminder %>%
  sample_n(200) %>% 
  pull(lifeExp)

Let us create the variable Y based on X, by adding random noise. We use random numbers sampled from normal distribution with mean 5 and sd 2.

set.seed(42)
random_noise <- rnorm(length(X), mean=5, sd=2)

Let us make a histogram to see the distribution of the noise centered around 5.

ggplot(mapping = aes(random_noise))+ 
  geom_histogram(bins=50, fill="steelblue", col="black")

Let us create Y by adding the random noise to X values and create a data frame to keep both X and Y.

Y <- X + random_noise
df_lm <- data.frame(X=X,Y=Y)
head(df_lm)

##        X        Y
## 1 52.644 60.38592
## 2 57.939 61.80960
## 3 55.191 60.91726
## 4 72.220 78.48573
## 5 71.100 76.90854
## 6 54.043 58.83075
random error: Histogram

Now, we have the data and ready to do linear regression analysis with R. We can fit a linear model in R with lm() function. To do linear regression with Y and X and save the result in a variable.


# fit linear regression
lm_obj <- lm(Y~X)

We can get the summary of the linear model using the function summary()

# summary
summary(lm_obj)

Summary of the linear model provides a lot of information about the linear model and it would like this.

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8580 -1.1346 -0.0774  1.2124  5.2405 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.75676    0.64238   8.962  2.4e-16 ***
## X            0.98634    0.01056  93.391  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.946 on 198 degrees of freedom
## Multiple R-squared:  0.9778, Adjusted R-squared:  0.9777 
## F-statistic:  8722 on 1 and 198 DF,  p-value: < 2.2e-16

First, it will provide the linear model formula and a lot of details about the linear model we just built.

Let us just to section with the title “Coefficients” and look at the column variable “Estimate”. It gives the two parameters of the linear model, the intercept a and the slope b, estimated from the data points.

In this example, intercept a = 5.75676 and the slope is 0.98634.

Another parameter of interest is Adjusted R-squared given at the bottom. Adjusted R-squared is a measure of how well the data fits to the linear model. You can also think of it as measure of correlation between the two variables. It tells us the how much variance in the variable Y is explained by the variable X.

There are a lot of other features in the summary of the model. We will not worry about that now.

Now that we know the variables X and Y in this examples are highly correlated and linear regression model fits the data nicely.

Let us make a scatter plot between X and Y to see the relationship between X and Y. Let us also add fit line using geom_smooth() function using the linear model to see the fit.

df_lm %>%
  ggplot(aes(x=X,y=Y)) + 
  geom_point(color="steelblue") + 
  geom_smooth(method='lm',se=FALSE) +
  theme_bw(base_size = 16)

We can see that, X and Y are highly correlated to each other.

linear regression: scatter plot

Second Linear Regression Example

Now let us try doing linear regression in R with two variables that are not that correlated that much.

Let us generate random noise again, but this time with mean 20 and a larger standard deviation 20.

random_noise2 <- rnorm(length(X), mean=20, sd=20)

Let us add the random noise to X and create new variable Y as before.

Y <- X + random_noise2

Let us go ahead build a linear model with the new set of data. A difference in the linear model below is that we also provide the dataframe with the data.

# Create a dataframe
df2_lm <- data.frame(X=X,Y=Y)
# build linear regression model with lm()
lm2_obj <- lm(Y~X,data=df2_lm)

We can look at the summary of the linear model using summary() function as before

summary(lm2_obj)

And we would get a detailed summary as seen before. Our linear regression model has estimated the parameter for intercept a= 11.7267 and intercept b=1.1431.

If we look at the value of Adjusted R-squared = 0.3824, it is much lower when compared to the previous model. However, it still does explain a chunk of variance in the data. We can learn that, there is a linear relationship between the variables X and Y, but it is not as strong as the relationship in the first example.

## 
## Call:
## lm(formula = Y ~ X, data = df2_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.879 -12.664  -1.151  12.560  51.866 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.7267     6.2383    1.88   0.0616 .  
## X             1.1431     0.1026   11.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.9 on 198 degrees of freedom
## Multiple R-squared:  0.3855, Adjusted R-squared:  0.3824 
## F-statistic: 124.2 on 1 and 198 DF,  p-value: < 2.2e-16

Let us make the scatter plot between the two variables and fit a geom_smooth() function linear model.

df2_lm %>% 
  ggplot(aes(x=X,y=Y)) + 
  geom_point(color="dodgerblue") +
  theme_bw(base_size=16) +
  geom_smooth(method='lm',se=FALSE, color="black")

By looking at the scatter plot, we can come to the same conclusion that there is a weak linear relationship between the two variables.

linear regression: scatter plot