Machine Learning for Hackers Chapter 5: Linear regression (with categorical regressors)

Introduction

Chapter 5 of Machine Learning for Hackers is a relatively simple exercise in running linear regressions. Therefore, this post will be short, and I’ll only discuss the more interesting regression example, which nicely shows how patsy formulas handle categorical variables.

Linear regression with categorical independent variables

In chapter 5, the authors construct several linear regressions, the last of which is a multi-variate regression descriping the number of page views of top-viewed web sites. The regression is pretty straightforward, but includes two categorical variables: HasAdvertising, which takes values True or False; and InEnglish, which takes values Yes, No and NA (missing).

If we include these variables in the formula, then patsy/statmodels will automatically generate the necessary dummy variables. For HasAdvertising, we get a dummy variable equal to one when the the value is True. For InEnglish, which takes three values, we get two separate dummy variables, one for Yes, one for No, with the missing value serving as the baseline.

model = 'np.log(PageViews) ~ np.log(UniqueVisitors) + HasAdvertising +
InEnglish'
pageview_fit_multi = ols(model, top_1k_sites).fit()
print pageview_fit_multi.summary()

Results in:

OLS Regression Results

==============================================================================
Dep. Variable: np.log(PageViews) R-squared: 0.480
Model: OLS Adj. R-squared: 0.478
Method: Least Squares F-statistic: 229.4
Date: Sat, 24 Nov 2012 Prob (F-statistic): 1.52e-139
Time: 09:50:25 Log-Likelihood: -1481.1
No. Observations: 1000 AIC: 2972.
Df Residuals: 995 BIC: 2997.
Df Model: 4

==========================================================================================
coef std err t P\>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------------------
Intercept -1.9450 1.148 -1.695 0.090 -4.197 0.307
HasAdvertising[T.True] 0.3060 0.092 3.336 0.001 0.126 0.486
InEnglish[T.No] 0.8347 0.209 4.001 0.000 0.425 1.244
InEnglish[T.Yes] -0.1691 0.204 -0.828 0.408 -0.570 0.232
np.log(UniqueVisitors) 1.2651 0.071 17.936 0.000 1.127 1.403

==============================================================================
Omnibus: 73.424 Durbin-Watson: 2.068
Prob(Omnibus): 0.000 Jarque-Bera (JB): 92.632
Skew: 0.646 Prob(JB): 7.68e-21
Kurtosis: 3.744 Cond. No. 570.

==============================================================================

If we were going to do this without the formula API, we’d have to explicity make these dummies. For comparison, here’s that.

top_1k_sites['LogUniqueVisitors'] =
np.log(top_1k_sites['UniqueVisitors'])
top_1k_sites['HasAdvertisingYes'] =
np.where(top_1k_sites['HasAdvertising'] == 'Yes', 1, 0)
top_1k_sites['InEnglishYes'] = np.where(top_1k_sites['InEnglish']
== 'Yes', 1, 0)
top_1k_sites['InEnglishNo'] = np.where(top_1k_sites['InEnglish'] == 'No', 1, 0)

linreg_fit = sm.OLS(np.log(top_1k_sites['PageViews']),
sm.add_constant(top_1k_sites[['HasAdvertisingYes',
'LogUniqueVisitors',
'InEnglishNo', 'InEnglishYes']],
prepend = True)).fit()
linreg_fit.summary()

Comments