Despite having finished all the programming for Chapter 2 of MLFH a while ago, there’s been a long hiatus since thefirst post on that chapter.
Why the delay? The second part of the code focuses on two procedures:
lowess scatterplot smoothing, and logistic regression. When implementing
the former in statsmodels, I found that it was running dog slow on
the data—in this case a scatterplot of 10,000 height-vs.-weight points.
Indeed, for these 10,000 points, lowess, run with the default
parameters, required about 23 seconds. After importing modules and
defining variables according to my IPython notebook, we can run
timeit on the function:
%timeit -n 3 lowess.lowess(heights, weights)
This results in
3 loops, best of 3: 42.6 s per loop
on the machine I’m writing this on (a Windows laptop with a 2.67 GHz i5 processor; timings are faster, but still in the 30 sec. range on my 2.5 GHz i7 Macbook).
An R user—or really a user of any other statistical package—is going to be confused here. We’re all used to lowess being a relatively instantaneous procedure. It’s an oft-used option for ...