Linear Regression Analysis with statsmodels in Python

statsmodels Python
statsmodels Python

statsmodels Python
Linear Regression is one of the most useful statistical/machine learning techniques. And we have multiple ways to perform Linear Regression analysis in Python including scikit-learn’s linear regression functions and Python’s statmodels package.

statsmodels is a Python module for all things related to statistical analysis and it

provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. An extensive list of result statistics are available for each estimator. The results are tested against existing statistical packages to ensure that they are correct.

In this tutorial, we will learn how to do simple linear regression analysis using statmodels and understand the results.

Let us first load Pandas and Seaborn.

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

One of the advantages with statmodels package is that we can build linear regression model using formula that is very similar to the formula in R. Let us load statmodels’ formula api

import statsmodels.formula.api as smf

We will use Palmer penguins data and the dataset is available as part of Seaborn

penguins = sns.load_dataset("penguins")
penguins =penguins.dropna()
#penguins.head()

In this post, we will perform linear regression using penguins data, where the two variables of interest are strongly associated.

Let us first visualize the data used in the example of linear regression. To do that we will make a scatter plot and see that the two variables are clearly correlated.

sns.set_context("talk", font_scale=1.2)
plt.figure(figsize=(10,8))
sns.scatterplot(x="bill_length_mm",
                y="flipper_length_mm", 
                data=penguins)
Scatter plot for Linear Regression with statsmodels in Python

With linear regression analysis, we are using the data to build a linear model (y = a + bx) and estimate two parameters; intercept a, and slope b of the linear model.

How To Fit a Linear Model with statsmodels?

Let us build our first linear regression model with stats model. As mentioned we can use formula to define linear regression model with statsmodels. For example, to build a linear regression model between tow variables y and x, we use the formula “y~x”, as shown below using ols() function in statsmodels, where ols is short for “Ordinary Least Square”.

# specify linear model with statsmodels
lm_m1 = smf.ols(formula="bill_length_mm ~ flipper_length_mm", data=penguins)

After defining the linear regression model with ols() function we can actually fit the model to the data using fit() function.

# fit the linear model on the data with statsmodels' fit()
lm_fit = lm_m1.fit()

Access Results from statsmodels

The resulting object from fit() function contains all the results from the linear regression model. We can get the estimated parameters from the linear regression fit with params method.

lm_fit.params

Intercept           -7.218558
flipper_length_mm    0.254825
dtype: float64

We can also get the R-squared from the statsmodels’ result object

lm_fit.rsquared

0.4265339132459687

Another way to quickly see the summary of results is to use summary() function.

# get the summary of linear model with statsmodels' summary()
print(lm_fit.summary())

This basically gives the results in a tabular form with a lots of details. For example, in the first table statmodels provides detail about dependent variable, the method used, date and time when the model was run, number of observations, R-squared/adj. R-squared and a few statistics it computed in the model.

The second table contains most useful information from the linear regression model, the estimated parameters, their standard errors, t-statistic, p-value and confident interval.

And the third table contains more advanced statistical measures.


                            OLS Regression Results                            
==============================================================================
Dep. Variable:         bill_length_mm   R-squared:                       0.427
Model:                            OLS   Adj. R-squared:                  0.425
Method:                 Least Squares   F-statistic:                     246.2
Date:                Sat, 23 Jan 2021   Prob (F-statistic):           7.21e-42
Time:                        09:41:27   Log-Likelihood:                -945.20
No. Observations:                 333   AIC:                             1894.
Df Residuals:                     331   BIC:                             1902.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -7.2186      3.272     -2.206      0.028     -13.655      -0.783
flipper_length_mm     0.2548      0.016     15.691      0.000       0.223       0.287
==============================================================================
Omnibus:                       35.275   Durbin-Watson:                   0.950
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               44.902
Skew:                           0.783   Prob(JB):                     1.78e-10
Kurtosis:                       3.886   Cond. No.                     2.90e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.9e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

statsmodels stores the results in SimpleTable format. We can access the three result tables directly using tables() function on the summary() object. We can see that there are three tables in a list in the summary object as we saw before.


lm_fit.summary().tables
[<class 'statsmodels.iolib.table.SimpleTable'>,
 <class 'statsmodels.iolib.table.SimpleTable'>,
 <class 'statsmodels.iolib.table.SimpleTable'>]

Let us take a look at the the second table that contains the parameter estimates of the linear model

lm_fit.summary().tables[1]

        	coef	std err	t	P>|t|	[0.025	0.975]
Intercept	-7.2186	3.272	-2.206	0.028	-13.655	-0.783
flipper_length_mm	0.2548	0.016	15.691	0.000	0.223	0.287

<h3> Statsmodels results as Pandas Dataframe</h3>

Often you would like to have the results as Pandas dataframe. To convert the statmodels result table to Pandas dataframe, we first convert the table of interest to html file using as_html() function.

lm_results_html = lm_fit.summary().tables[1].as_html()

And then use Pandas’ read_html() function to read the html results as Pandas dataframe.

df = pd.read_html(lm_results_html, header=0, index_col=0)[0]

Now the results are available as Pandas dataframe.

        	coef	std err	t	P>|t|	[0.025	0.975]
Intercept	-7.2186	3.272	-2.206	0.028	-13.655	-0.783
flipper_length_mm	0.2548	0.016	15.691	0.000	0.223	0.287