Linear Regression is one of the most useful statistical/machine learning techniques. And we have multiple ways to perform Linear Regression analysis in Python including scikit-learn’s linear regression functions and Python’s statmodels package.
statsmodels is a Python module for all things related to statistical analysis and it
provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. An extensive list of result statistics are available for each estimator. The results are tested against existing statistical packages to ensure that they are correct.
In this tutorial, we will learn how to do simple linear regression analysis using statmodels and understand the results.
Let us first load Pandas and Seaborn.
import seaborn as sns import pandas as pd import matplotlib.pyplot as plt
One of the advantages with statmodels package is that we can build linear regression model using formula that is very similar to the formula in R. Let us load statmodels’ formula api
import statsmodels.formula.api as smf
We will use Palmer penguins data and the dataset is available as part of Seaborn
penguins = sns.load_dataset("penguins") penguins =penguins.dropna() #penguins.head()
In this post, we will perform linear regression using penguins data, where the two variables of interest are strongly associated.
Let us first visualize the data used in the example of linear regression. To do that we will make a scatter plot and see that the two variables are clearly correlated.
sns.set_context("talk", font_scale=1.2) plt.figure(figsize=(10,8)) sns.scatterplot(x="bill_length_mm", y="flipper_length_mm", data=penguins)
With linear regression analysis, we are using the data to build a linear model (y = a + bx) and estimate two parameters; intercept a, and slope b of the linear model.
How To Fit a Linear Model with statsmodels?
Let us build our first linear regression model with stats model. As mentioned we can use formula to define linear regression model with statsmodels. For example, to build a linear regression model between tow variables y and x, we use the formula “y~x”, as shown below using ols() function in statsmodels, where ols is short for “Ordinary Least Square”.
# specify linear model with statsmodels lm_m1 = smf.ols(formula="bill_length_mm ~ flipper_length_mm", data=penguins)
After defining the linear regression model with ols() function we can actually fit the model to the data using fit() function.
# fit the linear model on the data with statsmodels' fit() lm_fit = lm_m1.fit()
Access Results from statsmodels
The resulting object from fit() function contains all the results from the linear regression model. We can get the estimated parameters from the linear regression fit with params method.
lm_fit.params Intercept -7.218558 flipper_length_mm 0.254825 dtype: float64
We can also get the R-squared from the statsmodels’ result object
lm_fit.rsquared 0.4265339132459687
Another way to quickly see the summary of results is to use summary() function.
# get the summary of linear model with statsmodels' summary() print(lm_fit.summary())
This basically gives the results in a tabular form with a lots of details. For example, in the first table statmodels provides detail about dependent variable, the method used, date and time when the model was run, number of observations, R-squared/adj. R-squared and a few statistics it computed in the model.
The second table contains most useful information from the linear regression model, the estimated parameters, their standard errors, t-statistic, p-value and confident interval.
And the third table contains more advanced statistical measures.
OLS Regression Results ============================================================================== Dep. Variable: bill_length_mm R-squared: 0.427 Model: OLS Adj. R-squared: 0.425 Method: Least Squares F-statistic: 246.2 Date: Sat, 23 Jan 2021 Prob (F-statistic): 7.21e-42 Time: 09:41:27 Log-Likelihood: -945.20 No. Observations: 333 AIC: 1894. Df Residuals: 331 BIC: 1902. Df Model: 1 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------- Intercept -7.2186 3.272 -2.206 0.028 -13.655 -0.783 flipper_length_mm 0.2548 0.016 15.691 0.000 0.223 0.287 ============================================================================== Omnibus: 35.275 Durbin-Watson: 0.950 Prob(Omnibus): 0.000 Jarque-Bera (JB): 44.902 Skew: 0.783 Prob(JB): 1.78e-10 Kurtosis: 3.886 Cond. No. 2.90e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.9e+03. This might indicate that there are strong multicollinearity or other numerical problems.
statsmodels stores the results in SimpleTable format. We can access the three result tables directly using tables() function on the summary() object. We can see that there are three tables in a list in the summary object as we saw before.
lm_fit.summary().tables [<class 'statsmodels.iolib.table.SimpleTable'>, <class 'statsmodels.iolib.table.SimpleTable'>, <class 'statsmodels.iolib.table.SimpleTable'>]
Let us take a look at the the second table that contains the parameter estimates of the linear model
lm_fit.summary().tables[1] coef std err t P>|t| [0.025 0.975] Intercept -7.2186 3.272 -2.206 0.028 -13.655 -0.783 flipper_length_mm 0.2548 0.016 15.691 0.000 0.223 0.287 <h3> Statsmodels results as Pandas Dataframe</h3>
Often you would like to have the results as Pandas dataframe. To convert the statmodels result table to Pandas dataframe, we first convert the table of interest to html file using as_html() function.
lm_results_html = lm_fit.summary().tables[1].as_html()
And then use Pandas’ read_html() function to read the html results as Pandas dataframe.
df = pd.read_html(lm_results_html, header=0, index_col=0)[0]
Now the results are available as Pandas dataframe.
coef std err t P>|t| [0.025 0.975] Intercept -7.2186 3.272 -2.206 0.028 -13.655 -0.783 flipper_length_mm 0.2548 0.016 15.691 0.000 0.223 0.287