posts by Carl

Rejection/Feedback: A play in three acts

Act 1

Dear XXXX,

Thank you for coming in to interview with the team last week. Everyone enjoyed speaking with you, but unfortunately it was decided that your background and experience is not an ideal fit for the position. Please be assured that this decision was arrived at after careful and thorough deliberation.

Again, we appreciate your time, and wish you the best of success in your job search.

Sincerely,

Bea Pearson-Handler
Recruiting
FourPaws.com (FourSquare for your Dog!)

Act 2

Hi Bea,

Thank you for the e-mail. I enjoyed meeting with everyone and I’m very sorry to hear the news.

If it would be at all possible, it would be very helpful to me if you could provide me with any feedback regarding my interview or the hiring decision. Really, anything at all would be helpful ...

A Geneva Convention for the Language Wars

I don’t tend to get too sniffy about the quality of discourse on the Internet. I have some appreciation for even the most pointless, uninformed flamewars. (And maybe my take on Web site comments is for another post.) But there’s an increasingly popular topic of articles and blog posts which is starting to annoy me a little. You’ve likely read them—they have titles like: “Python is Eating R’s Lunch,” “Why Python is Going to Take Over Data Science,” “Why Python is a Pain in the Ass and Will Never Beat R,” “Why Everyone Will Live on the Moon and Code in Julia in 5 Years,” etc.

This style of article obviously isn’t unique to data analysis languages. It’s a classic nerd flamewar, in the proud tradition of text editor wars and browser wars. Perhaps an added inflammatory agent here is the Data Science hype machine.

And that’s all okay. Go on the Internet and bitch about languages you don’t like, or tell everyone why your preferred one is awesome. That’s what the Internet’s here for. And Lord knows I’ve done it myself.

My only problem is that it ...

Tricked out iterators in Julia

Introduction

I want to spend some time messing about with iterators in Julia. I think they not only provide a familiar and useful entry point into Julia’s type system and dispatch model, they’re also interesting in their own right.1 Clever application of iterators can help to simplify complicated loops, better express their intent, and improve memory usage.

A word of warning about the code here. Much of the it isn’t idiomatic Julia and I wouldn’t necessarily recommend using this style in a serious project. I also can’t speak to its performance vis-a-vis more obvious Julian alternatives. In some cases, the style of the code examples below may help reduce memory usage, but performance is not my main concern. (This may be the first blogpost about Julia unconcerned with speed). Instead, I’m just interested in different ways of expressing iteration problems.

For anyone who’d like to play along at home, there’s an IJulia notebook of this material on Github, which can be viewed on nbviewer here.

The Iterator Protocol

What do I mean by iterators?2 I mean any I in Julia that works on ...

Pardon the dust

Update 9/10/2013 New posts are going up on the blog, but I’m going to keep this post at the top for a while. Consider the site in beta for the moment, and please use the comment section of this post to report any issues. If you’re using IE to try and view the site, I’m sorry. But I’m not that sorry.


Update 9/3/2013 Things should be working reasonably well. A few kinks to work out, and I have to migrate the former site’s comments, but the current site is pretty much ready to go.


This is the new home for my blog, Slender Means. It’s currently in-progress, and I’m still finishing up the design, and fixing weird links and typos from the Wordpress to Pelican migration.

In the meantime, a more usable version sits at the old home: http://slendrmeans.wordpress.com.

Thanks for visiting!

-c.

Machine Learning for Hackers Chapter 8: Principal Components Analysis

The code for Chapter 8 has been sitting around for a long time now. Let’s blow the dust off and check it out. One thing before we start: explaining PCA well is kinda hard. If any experts reading feel like I’ve described something imprecisely (and have a better description), I’m very open to suggestions.

Introduction

Chapter 8 is about Principal Components Analysis (PCA), which the authors perform on data with time series of prices for 24 stocks. In very broad terms, PCA is about projecting many real-life, observed variables onto a smaller number of “abstract” variables, the principal components. Principal components are selected in order to best preserve the variation and correlation of the original variables. For example, if we have 100 variables in our data, which are all highly correlated, we can project them down to just a few principal components—-i.e., the high correlation between them can be imagined as coming from an underlying factor that drives all of them, with some other less important factors driving their differences. When variables aren’t highly correlated, more principal components are needed to describe them well.

As you might imagine, PCA can be a very effective ...

I’ve seen the best minds of my generation destroyed by Matlab …

(Note: this is very quick and not well thought out. Mostly a conversation starter as opposed to any real thesis on the subject.)

This post is a continuation of a Twitter conversation here, started when John Myles White poked the hornets’ nest. (Python’s nest? Where do Pythons live?)

The gist with John’s code is here.

This isn’t a very thoughtful post. But the conversation was becoming sort of a shootout and my thoughts (half-formed as they are) were a bit longer than a tweet. Essentially, I think the Python performance shootouts—PyPy, Numba, Cython—are missing the point.

The point is, I think, that loops are a crutch. A 3-nested for loop in Julia that increments a counter takes 8 lines of code (1 initialize counter, 3 for statements, 1 increment statement, 3 end statements). Only one of those lines tells me what the code does.

But most scientific programmers learned to code in imperative languages and that style of thinking and coding has become natural. I’ve often seen comments like this:

forloop_tweet

Which I think simply equates readability with familiarity. That isn’t wrong, but it isn’t the whole story.

Anyway, a lot of the ...

Machine Learning for Hackers Chapter 7: Numerical optimization with deterministic and stochastic methods

Introduction

Chapter 7 of Machine Learning for Hackers is about numerical optimization. The authors organize the chapter around two examples of optimization. The first is a straightforward least-squares problem like that we’ve encountered already doing linear regressions, and is amenable to standard iterative algorithms (e.g. gradient descent). The second is a problem with a discrete search space, not clearly differentiable, and so lends itself to a stochastic/heuristic optimization technique (though we’ll see the optimization problem is basically artificial). The first problem gives us a chance to play around with Scipy’s optimization routines. The second problem has us hand-coding a Metropolis algorithm; this doesn’t show off much new Python, but it’s fun nonetheless.

The notebook for this chapter is at the github report here, or you can view it online via nbviewer here.

Ridge regression by least-squares

In chapter 6 we estimated LASSO regressions, which added an L1 penalty on the parameters to the OLS loss-function. The ridge regression works the same way, but applies an L2 penalty to the parameters. The ridge regression is a somewhat more straightforward optimization problem, since the L2 norm we use gives us a differentiable loss function.

In ...

Machine Learning for Hackers Chapter 6: Regression models with regularization

In my opinion, Chapter 6 is the most important chapter in Machine Learning for Hackers. It introduces the fundamental problem of machine learning: overfitting and the bias-variance tradeoff. And it demonstrates the two key tools for dealing with it: regularization and cross-validation.

It’s also a fun chapter to write in Python, because it lets me play with the fantastic scikit-learn library. scikit-learn is loaded with hi-tech machine learning models, along with convenient “pipeline”-type functions that facilitate the process of cross-validating and selecting hyperparameters for models. Best of all, it’s very well documented.

Fitting a sine wave with polynomial regression

The chapter starts out with a useful toy example—trying to fit a curve to data generated by a sine function over the interval [0, 1] with added Gaussian noise. The natural way to fit nonlinear data like this is using a polynomial function, so that the output, y is a function of powers of the input x. But there are two problems with this.

First, we can generate highly correlated regressors by taking powers of x, leading to noisy parameter estimates. The input x are evenly space numbers on the interval [0, 1]. So x and x ...

Machine Learning for Hackers Chapter 5: Linear regression (with categorical regressors)

Introduction

Chapter 5 of Machine Learning for Hackers is a relatively simple exercise in running linear regressions. Therefore, this post will be short, and I’ll only discuss the more interesting regression example, which nicely shows how patsy formulas handle categorical variables.

Linear regression with categorical independent variables

In chapter 5, the authors construct several linear regressions, the last of which is a multi-variate regression descriping the number of page views of top-viewed web sites. The regression is pretty straightforward, but includes two categorical variables: HasAdvertising, which takes values True or False; and InEnglish, which takes values Yes, No and NA (missing).

If we include these variables in the formula, then patsy/statmodels will automatically generate the necessary dummy variables. For HasAdvertising, we get a dummy variable equal to one when the the value is True. For InEnglish, which takes three values, we get two separate dummy variables, one for Yes, one for No, with the missing value serving as the baseline.

model = 'np.log(PageViews) ~ np.log(UniqueVisitors) + HasAdvertising +
InEnglish'
pageview_fit_multi = ols(model, top_1k_sites).fit()
print pageview_fit_multi.summary()

Results in:

OLS Regression Results

==============================================================================
Dep. Variable: np.log(PageViews) R-squared: 0.480
Model: OLS Adj. R-squared: 0.478
Method: Least ...

Machine Learning for Hackers Chapter 4: Priority e-mail ranking

Introduction

I’m not going to write much about this chapter. In my opinion the payoff-to-effort ratio for this project is pretty low. The algorithm for ranking e-mails is pretty straightforward, but in my opinion seriously flawed. Most of the code in the chapter (and there’s a lot of it) revolves around parsing the text in the files. It’s a good exercise in thinking through feature extraction, but it’s not got a lot of new ML concepts. And from my perspective, there’s not much opportunity to show off any Python goodness. But, I’ll hit a couple of points that are new and interesting.

The complete code is at the Github repo here, and you can read the notebook via nbviewer here.

1. Vectorized string methods in pandas. Back in Chapter 1, I groused about lacking vectorized functions for operations on strings or dates in pandas. If it wasn’t a numpy ufunc, you had to use the pandas map() method. That’s changed a lot over the summer, and since pandas 0.9.0, we can call vectorized string methods.

For example, here’s the code in my chapter for program that identifies e-mails that ...

Page 1 / 3 »