Machine Learning for Hackers Chapter 8: Principal Components Analysis

The code for Chapter 8 has been sitting around for a long time now. Let’s blow the dust off and check it out. One thing before we start: explaining PCA well is kinda hard. If any experts reading feel like I’ve described something imprecisely (and have a better description), I’m very open to suggestions.

Introduction

Chapter 8 is about Principal Components Analysis (PCA), which the authors perform on data with time series of prices for 24 stocks. In very broad terms, PCA is about projecting many real-life, observed variables onto a smaller number of “abstract” variables, the principal components. Principal components are selected in order to best preserve the variation and correlation of the original variables. For example, if we have 100 variables in our data, which are all highly correlated, we can project them down to just a few principal components—-i.e., the high correlation between them can be imagined as coming from an underlying factor that drives all of them, with some other less important factors driving their differences. When variables aren’t highly correlated, more principal components are needed to describe them well.

As you might imagine, PCA can be a very effective ...

Machine Learning for Hackers Chapter 7: Numerical optimization with deterministic and stochastic methods

Introduction

Chapter 7 of Machine Learning for Hackers is about numerical optimization. The authors organize the chapter around two examples of optimization. The first is a straightforward least-squares problem like that we’ve encountered already doing linear regressions, and is amenable to standard iterative algorithms (e.g. gradient descent). The second is a problem with a discrete search space, not clearly differentiable, and so lends itself to a stochastic/heuristic optimization technique (though we’ll see the optimization problem is basically artificial). The first problem gives us a chance to play around with Scipy’s optimization routines. The second problem has us hand-coding a Metropolis algorithm; this doesn’t show off much new Python, but it’s fun nonetheless.

The notebook for this chapter is at the github report here, or you can view it online via nbviewer here.

Ridge regression by least-squares

In chapter 6 we estimated LASSO regressions, which added an L1 penalty on the parameters to the OLS loss-function. The ridge regression works the same way, but applies an L2 penalty to the parameters. The ridge regression is a somewhat more straightforward optimization problem, since the L2 norm we use gives us a differentiable loss function.

In ...

Machine Learning for Hackers Chapter 6: Regression models with regularization

In my opinion, Chapter 6 is the most important chapter in Machine Learning for Hackers. It introduces the fundamental problem of machine learning: overfitting and the bias-variance tradeoff. And it demonstrates the two key tools for dealing with it: regularization and cross-validation.

It’s also a fun chapter to write in Python, because it lets me play with the fantastic scikit-learn library. scikit-learn is loaded with hi-tech machine learning models, along with convenient “pipeline”-type functions that facilitate the process of cross-validating and selecting hyperparameters for models. Best of all, it’s very well documented.

Fitting a sine wave with polynomial regression

The chapter starts out with a useful toy example—trying to fit a curve to data generated by a sine function over the interval [0, 1] with added Gaussian noise. The natural way to fit nonlinear data like this is using a polynomial function, so that the output, y is a function of powers of the input x. But there are two problems with this.

First, we can generate highly correlated regressors by taking powers of x, leading to noisy parameter estimates. The input x are evenly space numbers on the interval [0, 1]. So x and x ...

Machine Learning for Hackers Chapter 5: Linear regression (with categorical regressors)

Introduction

Chapter 5 of Machine Learning for Hackers is a relatively simple exercise in running linear regressions. Therefore, this post will be short, and I’ll only discuss the more interesting regression example, which nicely shows how patsy formulas handle categorical variables.

Linear regression with categorical independent variables

In chapter 5, the authors construct several linear regressions, the last of which is a multi-variate regression descriping the number of page views of top-viewed web sites. The regression is pretty straightforward, but includes two categorical variables: HasAdvertising, which takes values True or False; and InEnglish, which takes values Yes, No and NA (missing).

If we include these variables in the formula, then patsy/statmodels will automatically generate the necessary dummy variables. For HasAdvertising, we get a dummy variable equal to one when the the value is True. For InEnglish, which takes three values, we get two separate dummy variables, one for Yes, one for No, with the missing value serving as the baseline.

model = 'np.log(PageViews) ~ np.log(UniqueVisitors) + HasAdvertising +
InEnglish'
pageview_fit_multi = ols(model, top_1k_sites).fit()
print pageview_fit_multi.summary()

Results in:

OLS Regression Results

==============================================================================
Dep. Variable: np.log(PageViews) R-squared: 0.480
Method: Least ...

Machine Learning for Hackers Chapter 4: Priority e-mail ranking

Introduction

I’m not going to write much about this chapter. In my opinion the payoff-to-effort ratio for this project is pretty low. The algorithm for ranking e-mails is pretty straightforward, but in my opinion seriously flawed. Most of the code in the chapter (and there’s a lot of it) revolves around parsing the text in the files. It’s a good exercise in thinking through feature extraction, but it’s not got a lot of new ML concepts. And from my perspective, there’s not much opportunity to show off any Python goodness. But, I’ll hit a couple of points that are new and interesting.

The complete code is at the Github repo here, and you can read the notebook via nbviewer here.

1. Vectorized string methods in pandas. Back in Chapter 1, I groused about lacking vectorized functions for operations on strings or dates in pandas. If it wasn’t a numpy ufunc, you had to use the pandas map() method. That’s changed a lot over the summer, and since pandas 0.9.0, we can call vectorized string methods.

For example, here’s the code in my chapter for program that identifies e-mails that ...

ARM Chapter 5: Logistic models of well-switching in Bangladesh

The logistic regression we ran for chapter 2 of Machine Learning for Hackers was pretty simple. So I wanted to find an example that would dig a little deeper into statsmodels’s capabilities and the power of the patsy formula language.

So, I’m taking an intermission from Machine Learning for Hackers and am going to show an example from Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models (“ARM”). The chapter has a great example of going through the process of building, interpreting, and diagnosing a logistic regression model. We’ll end up with a model with lots of interactions and variable transforms, which is a great showcase for patsy and the statmodels formula API.

Logistic model of well-switching in Bangladesh

Our data are information on about 3,000 respondent households in Bangladesh with wells having an unsafe amount of arsenic. The data record the amount of arsenic in the respondent’s well, the distance to the nearest safe well (in meters), whether that respondent “switched” wells by using a neighbor’s safe well instead of their own, as well as the respondent’s years of education and a dummy variable indicating whether they belong to ...

Machine Learning for Hackers Chapter 2, Part 2: Logistic regression with statsmodels

Introduction

I last left chapter 2 of Maching Learning for Hackers (a long time ago), running some kernel density estimators on height and weight data (see here. The next part of the chapter plots a scatterplot of weight vs. height and runs a lowess smoother through it. I’m not going to write any more about the lowess function in statsmodels. I’ve discussed some issues with it (i.e. it’s slow) here. And it’s my sense that the lowess API, as it is now in statsmodels, is not long for this world. The code is all in the IPython notebooks in the Github repo and is pretty straightforward.

Patsy and statsmodels formulas

What I want to skip to here is the logistic regressions the authors run to close out the chapter. Back in the spring, I coded up the chapter in this notebook. At this point, there wasn’t really much cohesion between pandas and statsmodels. You’d end up doing data exploration and munging with pandas, then pulling what you needed out of dataframes into numpy arrays, and passing those arrays to statsmodels. (After writing seemingly needless boilerplate code like X = sm.add_constant(X, prepend = True ...

Machine Learning for Hackers Chapter 3: Naive Bayes Text Classification

I realize I haven’t blogged about the rest of chapter 2 yet. I’ll get back to that, but chapter 3 is on my mind today. If you haven’t seen them yet, IPython notebooks up to chapter 9 are all up in the Github repo. To view them online, you can check the links on this page.

Chapter 3 is about text classification. The authors build a classifier that will identify whether an e-mail is spam or not (“ham”) based on the content of the e-mail’s message. I won’t go into much detail on how the Naive Bayes classifier they use works (beyond what’s evident in the code). The theory is described well in the book and many other places. I’m just going to discuss implementation, assuming you know how the classifier works in theory. The Python code for this project relies heavily on the NLTK (Natural Language Toolkit) package, which is a comprehensive library that includes functions for doing NLP and text analysis, as well as an array of benchmark text corpora to use them on. If you want to go deep into this stuff, two good resources are:

Machine Learning for Hackers Chapter 2, Part 1: Summary stats and density estimators

Chapter 2 of MLFH summarizes techniques for exploring your data: determining data types, computing quantiles and other summary statistics, and plotting simple exploratory graphics. I’m not going to replicate it in its entirety; I’m just going to hit some of the more involved or interesting parts. The IPython notebook I created for this chapter, which lives here, contains more code than I’ll present on the blog.

This part’s highlights:

1. Pandas objects, as we’ve seen before, have methods that provide simple summary statistics.
2. The plotting methods in Pandas let you pass parameters to the Matplotlib functions they call. I’ll use this feature to mess around with histogram bins.
3. The gaussian_kde (kernel density estimator) function in scipy.stats.kde provides density estimates similar to R’s density function for Gaussian kernels. The kdensity function, in statsmodels.nonparametric.kde provides that and other kernels, but given the state of statsmodels documentation, you would probably only find this function by accident. It’s also substantially slower than gaussian_kde on large data. *[Not quite so! See update at the end.]

Height and weight data

The data analyzed in this chapter are the sexes, heights and weights, of 10,000 ...

Machine Learning for Hackers Chapter 1, Part 5: Trellis graphs.

Introduction

This post will wrap up Chapter 1 of MLFH. The only task left is to replicate the authors’ trellis graph on p. 26. The plot is made up of 50 panels, one for each U.S. state, with each panel plotting the number of UFO sightings by month in that state.

The key takeaways from this part are, unfortunately, a bunch of gripes about Matplotlib. Since I can’t transmit, blogospherically, the migraine I got over the two afternoons I spent wrestling with this graph, let me just try to succinctly list my grievances.

1. Out-of-the-box, Matplotlib graphs are uglier than those produced by either lattice or ggplot in R: The default color cycle is made up of dark primary colors. Tick marks and labels are poorly placed in anything but the simplest graphs. Non-data graph elements, like bounding boxes and gridlines, are too prominent and take focus away from the data elements.
2. The API is deeply confusing and difficult to remember. You have various objects that live in various containers. To make adjustments to graphs, you have to remember what container the thing you want to adjust lives in, remember what the object and its property is called, and ...

Page 1 / 2 »