Python is popular for statistical analysis because of the large number of libraries. One of the most common statistical calculations is linear regression. statsmodels offers some powerful tools for regression and analysis of variance. Here’s how to get started with linear models.
What is statsmodels?
statsmodels is a Python library for running common statistical tests. It’s especially geared for regression analysis, particularly the kind you’d find in econometrics, but you don’t have to be an economist to use it.
It does have a learning curve. but once you…
Python is popular for statistical analysis because of the large number of libraries. One of the most common statistical calculations is linear regression. statsmodels offers some powerful tools for regression and analysis of variance. Here’s how to get started with linear models.
What is statsmodels?
statsmodels is a Python library for running common statistical tests. It’s especially geared for regression analysis, particularly the kind you’d find in econometrics, but you don’t have to be an economist to use it.
It does have a learning curve. but once you get the hang of it, you’ll find that it’s a lot more flexible to use than the regression functions you’ll find in a spreadsheet program like Excel. It won’t make the plot for you, though. If you want to generate the classic scatterplot with a regression line drawn over it, you’ll want to use a library like Seaborn.
One advantage of using statsmodels is that it’s cross-checked with other statistical software packages like R, Stata, and SAS for accuracy, so this might be the package for you if you’re in professional or academic research.
Simple linear regression
If you just want to determine the relation ship of a dependent variable (y), or the endogenous variable in econometric and statsmodels parlance, vs the exogenous, independent, or “x” variable, you can do this easily with statsmodels.
If you have experience with R or want a quick way to generate a regression with statsmodels using a pandas DataFrame, you can use R-style formulas.
First, you need to import statsmodels and its formula API:
import statsmodels.formula.api as smfimport statsmodels.api as sm
I’ll use my go-to model, but it’s actually from Seaborn, of tips and total bills in a New York City restaurant.
import seaborn as snssns.set_theme()tips = sns.load_dataset('tips')
I can look at the dataset with the head method:
tips.head()
I want to figure out the relationship between the tip vs. the total bill. I can plot this relationship with a scatterplot in Seaborn.
sns.relplot(x='total_bill',y='tip',data=tips)
It looks like there’s a positive linear relationship. I can try a regression plot:
sns.regplot(x='total_bill',y='tip',data=tips)
It looks like there indeed is a positive linear relationship, as indicated by the line. With a larger bill, it seems that the tip increases as well. This plot won’t give me the coefficients I need for this line. statsmodels will do that.
First, I’ll fit the line.
results = smf.ols('tip ~ total_bill',data = tips).fit()
This uses statsmodels’ formula API. It uses the patsy module, which borrows the conventions established by the R language. “ols” stands for “Ordinary Least Squares,” the method used to generate the regression. This minimizes the square of the distance from the line to the data points, also known as the residual, as much as possible. This sets up the tip column as the endogenous variable, and the total bill column as the exogenous variable. The tilde (~) in this context is kind of a fancy equal sign. It looks strange, but it’s a succinct way of specifying the relationship. The “data = tips” tells statsmodels to use the tips DataFrame. The .fit() method fits the datapoints to a results object, in this case, it will be called “results.”
You can see a summary in one of two ways. You can use Python’s print command if you’re writing a Python script:
print(results.summary())
If you’re using an interactive session, such as IPython or a Jupyter notebook, you can simply type the name of the object, which I’ll demonstrate in a notebook of examples for this article:
results.summary()
This will show the results of the regression. I’ll explain it in detail later.
You can also use NumPy arrays to perform the regression. You’ll create a “design” matrix of the independent variable plus an intercept column of 1s.
Let’s create x and y values by generating random numbers with NumPy:
x = np.linspace(-10,10,100)
statsmodels includes a function to create this intercept with the add_intercept method
y = 2*x - 3
statsmodels includes a function to create this intercept with the add_intercept method. This will create the design matrix.
X = sm.add_intercept(x)
Creating the model is similar to using R-style formulas.
model = sm.OLS(y,X)results = model.fit()print(results.summary())
Notice that when using the statsmodels API instead of the formula API, the OLS function is in all caps.
Multiple Linear Regression
It’s easy to extend the simple regression to more than one exogenous variable to multiple variables. This is useful for seeing if there might be a confounding variable leading to an erroneous correlation. Instead of fitting a line over a set of data points on a plane, you’re fitting a plane over the data points in three-dimensional space, and a hyperplane with more than two exogenous variables and thus more than three dimensions.
This is easy to accomplish with R-style formulas. Suppose we wanted to add the party size to our tips regression. We can do that:
results = smf.ols('tip ~ total_bill + size',data = tips).fit()results.summary()
To add regressors, you can just add them to the formula with a plus sign (“+”)
You can also use this to model nonlinear relationships, such as a quadratic equation. Let’s generate another model using NumPy and create a pandas DataFrame:
x = np.linspace(-10,10,100)y = 3*x**2 + 2*x + 5df = pd.DataFrame({'x':x,'y':y})
We can visualize this with another scatterplot:
sns.relplot(x='x',y='y',data=df)
The plot seems to suggest a classic quadratic parabola.
We can create another formula to model this with statsmodels by adding the square of the x values as a regressor:
results = smf.ols('y ~ x + I(x**2)',data=df).fit()results.summary()
The “I()” around the the x**2 is to tell statsmodels that this is not a separate regressor, but an operation on the x column in the pandas DataFrame.
Interpreting Regression Results
You may wonder how to interpret the regression results.
When you view the results.summary() information, you’ll see a table with some values. Let’s walk through it. The top-right-hand corner has ther r-squared, or the square of the correlation coeffcient r. This tells you how much of a correlation is between the indepdent or exogenous variables and the endogenous or dependent variables, and consequently, how good of a fit the model is to the data. The closer the r-squared is to 1, the more of a correlation there is between the variables. Real-world data rarely correlates perfectly. In this example with multiple regression, the adjusted r-squared is about .463, which is a pretty good correlation.
The adjusted r-squared is helpful for multiple regression and corrects for erroneous regression, giving you a more accurate correlation coefficient. If you look at the multiple regression we did, you’ll notice that it’s slightly lower than the r-squared.
The table below shows the intercept, the coefficients of each regressor, and how good of a fit they are. With a simple linear regression, the equation fits the slope-intercept form, y = mx + b. The value of the coefficient of the independent variable will be the m, and the intercept is the “b.” This naturally extends to more regressors. The right-hand columns pf the table shows the confidence interval of where the true values could lie, which was represented on the regression plot by the shaded area.
The “std err” column measures the standard error, which is the distance from the data points to the fitted line. The lower the number, the better the fit is of the model.
The p-values also determine the statistical significance of the variables. statsmodels performs the t-test on each variable of the null hypothesis that the slope is 0. If the p-value is lower than a predetermined threshold, the usual minimum being .05, the result is significant.
The bottom of the chart shows the distribution of the residuals, or the difference between the data points and the line. It should be as close to a normal distribution as possible to be a good fit. The skew and kurtosis measure how far the residuals differ from the normal distribution. As with the correlation coefficient, real-world data rarely fits perfectly.
One-way analysis of variance
You may wonder how to compare a numerical value across a categorical variable. To do that, you need to go to analysis of variance, or ANOVA. It’s easy to accomplish with statsmodels. Since we’re using just one category, this is called one-way ANOVA.
Let’s load in another dataset, this time of penguins in Antarctica.
penguins = sns.load_dataset('penguins')penguins.head()
Let’s see if species is a signficant predictor of bill length. We can use a linear model similar to regression, but with a categorical variable:
penguin_lm = smf.ols('bill_length_mm ~ species',data=penguins).fit()
We can feed this to statsmodels’ anova_lm function:
results = sm.stats.anova_lm(penguin_lm)
We can view the results:
print(results)
We’ll see a table of the results, but the number to pay attention to is the p-value. Since it’s so low it’s in negative scientific notation, species is a significant predictor of bill length.
Multi-way analysis of variance
We can extend this with another categorical variable with two-way anova. It’s easy to modify our original formula and add the island to the mix to see if that can also preduic the bill length:
penguin_multi_lm = smf.ols('bill_length_mm ~ species * island',data=penguins).fit()
results = sm.stats.anova_lm(penguin_multi_lm)
print(results)
It’s similar to multiple regression, but we use multiplication for the formula instead of addition.
Multiple regression and ANOVA like the kinds you’ve just seen would be almost impossible to do by hand, with the large amount of data and complex formulas. statsmodels makes sophisticated modeling available at your fingertips.