Linear Regression is one of the commonly used statistical techniques used for understanding linear relationship between two or more variables. It is such a common technique, there are a number of ways one can perform linear regression analysis in Python. In this post we will do linear regression analysis, kind of from scratch, using matrix multiplication with NumPy in Python instead of readily available function in Python.
Let us first load necessary Python packages we will be using to build linear regression using Matrix multiplication in Numpy’s module for linear algebra.
import pandas as pd import numpy as np # import matplotlib import matplotlib.pyplot as plt # import seaborn import seaborn as sns %matplotlib inline
To build linear regression we will use the classic cars data from cmdlinetips.com‘s github page.
data_url = 'https://raw.githubusercontent.com/cmdlinetips/data/master/cars.tsv' cars = pd.read_csv(data_url, sep="\t")
cars dataset contains distance needed for cars at different speeds to stop from 1920 cars.
print(cars.head(n=3)) speed dist 0 4 2 1 4 10 2 7 4
Let us first visualize the relationship between speed and dist variables using a scatter plot.
bplot= sns.scatterplot('dist','speed',data=cars) bplot.axes.set_title("dist vs speed: Scatter Plot", fontsize=16) bplot.set_ylabel("Speed (mph)", fontsize=16) bplot.set_xlabel("Distances taken to stop (feet)", fontsize=16)
We can see a clear linear relationship between the two variables.
Let us name the two columns with two variable names X and Y, where X is the predictor variable
X = cars.dist.values
and Y is the response variable.
Y = cars.speed.values
Our observed data are pairs of x and y values.
With linear regression model, we fit our observed data using the linear model shown below and estimate the parameters of the linear model.
Here beta_0 and beta_1 are intercept and slope of the linear equation. We can combine the predictor variables together as matrix. In our example we have one predictor variable. So we create a matrix with ones as first column and X.
We use NumPy’s vstack to create a 2-d numpy array from two 1d-arrays and create X_mat.
X_mat=np.vstack((np.ones(len(X)), X)).T
X_mat[0:5,] array([[ 1., 2.], [ 1., 10.], [ 1., 4.], [ 1., 22.], [ 1., 16.]])
Linear Regression Model Estimates using Matrix Multiplications
With a little bit of linear algebra with the goal to minimize the mean square error of a system of linear equations we can get our parameter estimates in the form of matrix multiplications shown below.
We can implement this using NumPy’s linalg module’s matrix inverse function and matrix multiplication function.
beta_hat = np.linalg.inv(X_mat.T.dot(X_mat)).dot(X_mat.T).dot(Y)
The variable beta_hat contains the estimates of the two parameters of the linear model and we computed with matrix multiplication.
print(beta_hat) [8.28390564 0.16556757]
It is vector containing y-axis intercept and slope of the linear regression model. Let us use the parameters to estimate the values of Y using X values.
# predict using coefficients yhat = X_m.dot(beta_hat)
We can visualize our estimate of yhat with the scatter plot.
# plot data and predictions plt.scatter(X, Y) plt.plot(X, yhat, color='red')
We can clearly see that our estimates nicely shows the linear relationship between X and Y. Let us double check our estimates of linear regression model parameters by matrix multiplication using scikit-learn’s LinearRegression model function.
Verifying Linear Regression Model Estimates using Scikit-learn
Let us load scikit-learn’s linear regression module.
from sklearn.linear_model import LinearRegression
We can build linear regression model first initiating the object and then fitting the model with the data.
regression = LinearRegression() linear_model = regression.fit(X[:,np.newaxis],Y)
We can extract the parameters of the model using “intercept_” and “coef_” function. And we can see that the estimates are exactly the same as we obtained by matrix multiplication method.
print(linear_model.intercept_) 8.283905641787172
print(linear_model.coef_) [0.16556757]
In summary, we build linear regression model in Python from scratch using Matrix multiplication and verified our results using scikit-learn’s linear regression model. Solving the linear equation systems using matrix multiplication is just one way to do linear regression analysis from scrtach. One can also use a number of matrix decomposition techniques like SVD, Cholesky decomposition and QR decomposition. A good topic for another blog post on linear regression in Python with linear algebra techniques.