posts by Carl

ARM Chapter 5: Logistic models of well-switching in Bangladesh

The logistic regression we ran for chapter 2 of Machine Learning for Hackers was pretty simple. So I wanted to find an example that would dig a little deeper into statsmodels’s capabilities and the power of the patsy formula language.

So, I’m taking an intermission from Machine Learning for Hackers and am going to show an example from Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models (“ARM”). The chapter has a great example of going through the process of building, interpreting, and diagnosing a logistic regression model. We’ll end up with a model with lots of interactions and variable transforms, which is a great showcase for patsy and the statmodels formula API.

Logistic model of well-switching in Bangladesh

Our data are information on about 3,000 respondent households in Bangladesh with wells having an unsafe amount of arsenic. The data record the amount of arsenic in the respondent’s well, the distance to the nearest safe well (in meters), whether that respondent “switched” wells by using a neighbor’s safe well instead of their own, as well as the respondent’s years of education and a dummy variable indicating whether they belong to ...

Machine Learning for Hackers Chapter 2, Part 2: Logistic regression with statsmodels


I last left chapter 2 of Maching Learning for Hackers (a long time ago), running some kernel density estimators on height and weight data (see here. The next part of the chapter plots a scatterplot of weight vs. height and runs a lowess smoother through it. I’m not going to write any more about the lowess function in statsmodels. I’ve discussed some issues with it (i.e. it’s slow) here. And it’s my sense that the lowess API, as it is now in statsmodels, is not long for this world. The code is all in the IPython notebooks in the Github repo and is pretty straightforward.

Patsy and statsmodels formulas

What I want to skip to here is the logistic regressions the authors run to close out the chapter. Back in the spring, I coded up the chapter in this notebook. At this point, there wasn’t really much cohesion between pandas and statsmodels. You’d end up doing data exploration and munging with pandas, then pulling what you needed out of dataframes into numpy arrays, and passing those arrays to statsmodels. (After writing seemingly needless boilerplate code like X = sm.add_constant(X, prepend = True ...

Machine Learning for Hackers Chapter 3: Naive Bayes Text Classification

I realize I haven’t blogged about the rest of chapter 2 yet. I’ll get back to that, but chapter 3 is on my mind today. If you haven’t seen them yet, IPython notebooks up to chapter 9 are all up in the Github repo. To view them online, you can check the links on this page.

Chapter 3 is about text classification. The authors build a classifier that will identify whether an e-mail is spam or not (“ham”) based on the content of the e-mail’s message. I won’t go into much detail on how the Naive Bayes classifier they use works (beyond what’s evident in the code). The theory is described well in the book and many other places. I’m just going to discuss implementation, assuming you know how the classifier works in theory. The Python code for this project relies heavily on the NLTK (Natural Language Toolkit) package, which is a comprehensive library that includes functions for doing NLP and text analysis, as well as an array of benchmark text corpora to use them on. If you want to go deep into this stuff, two good resources are:

Better typography for IPython notebooks

(Warning: ignorant rant coming up)

Like everyone else who’s ever used it, I love the IPython notebook. It’s not only an awesomely productive environment to work in, it’s also the most powerful weapon in the Python evangelist’s arsenal (suck it, Matlab).

I also think it’s not hard to imagine a world where scientific papers are all just literate programs. And the notebook is probably one of the best tools for literate programming around in any language. The intregration of markdown and LaTeX/MathJax into the notebook is just fantastic.

But it does have one weakness as a literate programming tool. The default typography is ugly as sin.

There are several issues, but two major ones are easily fixable.

Long lines

By far the biggest issue is that the text and input cells extend to 100% of the window width. Most people keep their browser windows open wider than is comfortable reading width, so you end up with long hard-to-read lines of text in the markdown cells.

And for the code, it would be nice to have the code cell discourage you from long lines. The variable width cells don’t. I’m an 80-character anal ...

How do you speed up 40,000 weighted least squares calculations? Skip 36,000 of them.

Despite having finished all the programming for Chapter 2 of MLFH a while ago, there’s been a long hiatus since thefirst post on that chapter.


Why the delay? The second part of the code focuses on two procedures: lowess scatterplot smoothing, and logistic regression. When implementing the former in statsmodels, I found that it was running dog slow on the data—in this case a scatterplot of 10,000 height-vs.-weight points. Indeed, for these 10,000 points, lowess, run with the default parameters, required about 23 seconds. After importing modules and defining variables according to my IPython notebook, we can run timeit on the function:

%timeit -n 3 lowess.lowess(heights, weights)

This results in

3 loops, best of 3: 42.6 s per loop

on the machine I’m writing this on  (a Windows laptop with a 2.67 GHz i5 processor; timings are faster, but still in the 30 sec. range on my 2.5 GHz i7 Macbook).

An R user—or really a user of any other statistical package—is going to be confused here. We’re all used to lowess being a relatively instantaneous procedure. It’s an oft-used option for ...

Machine Learning for Hackers Chapter 2, Part 1: Summary stats and density estimators

Chapter 2 of MLFH summarizes techniques for exploring your data: determining data types, computing quantiles and other summary statistics, and plotting simple exploratory graphics. I’m not going to replicate it in its entirety; I’m just going to hit some of the more involved or interesting parts. The IPython notebook I created for this chapter, which lives here, contains more code than I’ll present on the blog.

This part’s highlights:

  1. Pandas objects, as we’ve seen before, have methods that provide simple summary statistics.
  2. The plotting methods in Pandas let you pass parameters to the Matplotlib functions they call. I’ll use this feature to mess around with histogram bins.
  3. The gaussian_kde (kernel density estimator) function in scipy.stats.kde provides density estimates similar to R’s density function for Gaussian kernels. The kdensity function, in statsmodels.nonparametric.kde provides that and other kernels, but given the state of statsmodels documentation, you would probably only find this function by accident. It’s also substantially slower than gaussian_kde on large data. *[Not quite so! See update at the end.]

Height and weight data

The data analyzed in this chapter are the sexes, heights and weights, of 10,000 ...

Machine Learning for Hackers Chapter 1, Part 5: Trellis graphs.


This post will wrap up Chapter 1 of MLFH. The only task left is to replicate the authors’ trellis graph on p. 26. The plot is made up of 50 panels, one for each U.S. state, with each panel plotting the number of UFO sightings by month in that state.

The key takeaways from this part are, unfortunately, a bunch of gripes about Matplotlib. Since I can’t transmit, blogospherically, the migraine I got over the two afternoons I spent wrestling with this graph, let me just try to succinctly list my grievances.

  1. Out-of-the-box, Matplotlib graphs are uglier than those produced by either lattice or ggplot in R: The default color cycle is made up of dark primary colors. Tick marks and labels are poorly placed in anything but the simplest graphs. Non-data graph elements, like bounding boxes and gridlines, are too prominent and take focus away from the data elements.
  2. The API is deeply confusing and difficult to remember. You have various objects that live in various containers. To make adjustments to graphs, you have to remember what container the thing you want to adjust lives in, remember what the object and its property is called, and ...

Machine Learning for Hackers, Chapter 1, Part 4: Data aggregation and reshaping.


In the last part I made some simple summaries of the cleaned UFO data: basic descriptive statistics and historgrams. At the very end, I did some simple data aggregation by summing up the sightings by date, and plotted the resulting time series. In this part, I’ll go further with the aggregation, totalling sightings by state and month.

This takeaway from this part is that Pandas dataframes have some powerful methods for aggregating and manipulating data. I’ll show groupby, reindex, hierarchical indices, and stack and unstack in action.

The shape of data: the long and the wide of it

The first step in aggregating and reshaping data is to figure out the final form you want the data to be in. This form is basically defined by content and shape.

We know what we want the content to be: an entry in the data should give the number of sightings in a state/month combination.

We have two choices for the shape: wide or long. The wide version of this data would have months as the rows and states as the columns; it would be a 248 by 51 table with the number of sigthings as entries. This ...

Shades of Time: I don’t buy it, and that’s why it’s so great.

Over the weekend Drew Conway posted about a data analysis project he’d just completed called Shades of Time. Very briefly, he took a dataset of Time magazine covers from 1923 to March 2012, then used some Python libraries to identify the faces in the covers and identify the skin tone of each face. The result is a really great interactive visualization implemented in d3.js.

From looking at this data, Drew, with some caveats, observes that “it does appear that the variance in skin tones have [sic] changed over time, and in fact the tones are getting darker.” He also notes that there are more faces on covers in later years.

Why I don’t believe it

There’s no real statistical testing done here–no formal quantification how skin-tone representation on covers is changing over time. Instead, I think he’s drawing his conclusion on the vizualization alone, especially the scatterplot in the bottom panel that seems to show more darker tones appearing later in the date (starting in the 70’s, the skin-tone dispersion in his data starts to increase).

He notes that there are difficulties in both identifying faces and skin tones. After going through his ...

Machine Learning for Hackers Chapter 1, Part 3: Simple summaries and plots.


See Part 1 and Part 2 for previous work.

In this part, I’ll replicate the authors’ exploration of the UFO sighting dates via histograms. The key takeaways:

  1. The plotting methods in Pandas are easy and useful.
  2. Unlike R Dates, Python datetimes aren’t compatible with a lot of mathematical operations. We’ll see that you can’t apply quantile or histogram methods to them directly.

Quick data summary methods and datetime complications.

For those playing along at home, I’m at p. 19 of the book. The first thing the authors do here is get a statistical summary of the sighting dates in the data, which are recorded in the DateOccurred variable (which I’ve named date_occurred in my code). This is easy in R using the summary function, which provides the minimum, maximum, and quartiles of the data by default.

Pandas has similar functionality, in a method called describe, which gives the same for numeric variables, plus the count of non-null values and the mean and standard deviation. For example:

s1 = Series(np.random.randn(100))
print s1.describe()

outputs what we’d expect from a series of randomly-generated standard normals:

count 100.000000
mean -0.149274 ...

« Page 2 / 3 »