Introduction to Linear Regression in Python

Linear regression is one of the most commonly used statistical technique to understand relationship between two quantitative variables (in the simplest case). Simple linear regression models relationship between two variables X and Y, where X and Y are vectors with multiple values.

For example, X could how well each country is doing economically, like GDP per capita and Y could be life expectancy values of multiple countries. Linear regression can be used to understand the relationship between X and Y. In a standard machine learning application, linear regression is often used in prediction scenarios. If there is a strong relationship between X and Y, one can use the inferred relationship to predict the life expectancy of a new country from its gdpPercap value.

Another way to understand linear regression is that we have the observed data (X,Y) pair and model it as a linear model of this form

Y = a + bX, where X is the explanatory variable and Y is the dependent variable.

The slope of the line is b, and a is the intercept (the value of y when x = 0).

Basically, Linear regression models the relationship between two variables by fitting a linear equation to observed data. Here by fitting, we mean finding the line that explains the observed data.

Let us get started with an example of doing linear regression or fitting a linear model in Python. Let us first load the packages we will use in this example.

 
import pandas as pd
import numpy as np
# import matplotlib
import matplotlib.pyplot as plt
# import seaborn
import seaborn as sns
%matplotlib inline

Let us use gapminder data to fit a linear model.

 
data_url = 'http://bit.ly/2cLzoxH'
gapminder = pd.read_csv(data_url)
print(gapminder.head(n=3))

We will subset the gapminder data to contain the data for the continent America, just to make the example simple.

 
gapminder_america = gapminder.query('continent=="Americas"')
gapminder_america.head()

We will use the life expectancy values as X, the independent variable.

 
X = gapminder_america.lifeExp.values

For the simple example to understand the basics of linear regression, we will create Y values that highly correlated to X values. The strong correlation between X & Y implies a strong relationship between X and Y. To create Y that is correlated to X, we will simply add random noise to existing X values and use it as Y.

We will first create random noise by sampling normal random variables with some mean and variance. Scipy’s stats module can generate random numbers from different distributions.

 
from scipy.stats import norm
random_err = norm.rvs(size=len(X), loc=10, scale=5)

By adding the random numbers to X, we get our Y, the dependent variable.

 
Y = X + random_err

Now, we have our two variables and X and Y. We are ready to do linear regression analysis. The first step in any linear regression analysis is simply visualizing the data, X & Y with scatter plot. So, let us first make a scatter plot between X and Y. We will use Seaborn’s scatterplot function to make scatter plot.

 
sns.scatterplot(X,Y)

As we created X from Y, we see a high correlation between X and Y and a clear linear trend between X and Y. By design, this data is ideal for linear modeling.

Linear Regression example scatter plot

There are multiple options perform linear regression analysis in Python. Let us use Scikit learn’s linear_model module to perform linear regression. Let us import LinearRegression function from sklearn.linear_model.

 
from sklearn.linear_model import LinearRegression

The first step in linear regression using Scikit-learn is to create an instance of LinearRegression object.

 
regression = LinearRegression()

We can use the instance of LinearRegression object to fit our data X and Y.Note that our X and Y are one-dimensional. However sklearn’s Linear Regression module needs X as two dimentional data. We can easily convert the 1-d data into 2-d data using NumPy’s np.newaxis function as X[:,np.newaxis].

 
linear_model = regression.fit(X[:,np.newaxis],Y)
linear_model

Now we have actually performed linear regression analysis of our data X and Y. We can use a number of available functions/methods to understand our linear model. Recall, that our linear model is of the form Y = aX + b. By doing linear regression analysis, we have basically estimated the parameters a and b from the data.

The parameter a is intercept and it gives us the value of Y, when X is zero. We can check the estimated intercept using the method intercept_.

print(linear_model.intercept_)
2.7483214870653967

We can also get the estimated coefficient b using the method coef_.

print(linear_model.coef_)
[0.83047083]

The coefficient b tells us for every one unit of change in X, the change in the Y is about 0.83.

We can also get the strength of the relationship between two variables using the function score. The score is basically the R^2, coefficient of determination of the prediction. It informs us the percentage of variation in Y explained by Y.

print(linear_model.score(X[:,np.newaxis],Y))
0.7593183868366187

Since we have estimated the parameters defining linear model, we can use the parameters and X to get the values of Y. Scikit-learn’s function predict function can give up the predicted values of Y.

Y_predicted= linear_model.predict(X[:,np.newaxis])

We can compare the predicted values of Y with actual values of Y.

df = pd.DataFrame({"Y":Y, "Y_predicted":Y_predicted})
df.head()
	Y	Y_predicted
0	71.545317	72.425546
1	69.915958	74.328204
2	69.568524	75.066801
3	82.356598	75.555886
4	64.934736	76.978406

We can see that predicted Y values are pretty close to actual values of Y. The above example is one of the cases where the linear regression is a great way to understand the relationship.