Slender Means

Rejection/Feedback: A play in three acts

2014-05-07T19:52:00-04:00

Act 1

Dear XXXX,

Thank you for coming in to interview with the team last week. Everyone enjoyed speaking with you, but unfortunately it was decided that your background and experience is not an ideal fit for the position. Please be assured that this decision was arrived at after careful and thorough deliberation.

Again, we appreciate your time, and wish you the best of success in your job search.

Sincerely,

Bea Pearson-Handler
Recruiting
FourPaws.com (FourSquare for your Dog!)

Act 2

Hi Bea,

Thank you for the e-mail. I enjoyed meeting with everyone and I’m very sorry to hear the news.

If it would be at all possible, it would be very helpful to me if you could provide me with any feedback regarding my interview or the hiring decision. Really, anything at all would be helpful and I would very much appreciate it.

Thanks to you and everyone on the team for your time.

Best,
XXXX

Act 3

Dear XXXX,

I’ve included below some notes from the interviewers. Hopefully these will be helpful to you. Again, best of luck in your job search.

Bea Pearson-Handler
Recruiting
FourPaws.com (FourSquare for your Dog!)

9:30-10:30AM “Big Picture” Interview:

Candidate arrived 15 minutes late; claimed he was stuck behind “an Asian driver” on his way to the office.

Questionable personal hygiene.

When asked why he wanted to work at FourPaws, candidate referred to his “20—actually I guess it’s 30—grand” gambling debt, and his bookie, “Jimmy Two-Thumbs.”

When asked where he saw himself in five years, candidate replied “Working at Google.”

10-45AM-12:00PM Case Study Interview:

Candidate was asked how he might improve engagement of female users with certain features of the site. His answer began with “Chicks, am I right?”

When asked why manhole covers were round, candidate replied “Because round is the shape of the Platonic ideal of a hole,” then hedged with “Or, so that fat Con Ed guys can get through them.”

Candidate was asked to estimate the number of piano tuners in Chicago. Candidate responded, “Well, you drop one piano tuner from the middle floor of the building. Then you take the second one and light him on fire at both ends. Wait, what was the question?”

12:00PM - 1:00PM “Casual” Lunch Interview:

Candidate served himself one pork sausage, then generous servings of everything at the Vegan table.

Chewed with mouth open.

When asked what he does in his spare time, candidate said that he was a prolific author of “HGTV slash fiction, especially the Property Brothers”

Asked if soda was free at the firm. When told yes, asked if we had RC Cola. When told no, asked if we would special order it if he were hired. Seemed strangely adamant on this point; mentioned he would “bring it up in negotiations.”

1:00-2:00PM Technical Interview:

Candidate was given the option to use whiteboard or paper to sketch out code. Candidate said he preferred paper; removed quill pen and inkwell from his shoulder bag.

Candidate was asked to code FizzBuzz. Candidate replied that it was a trick question, since “3 and 5 are prime, so nothing is divisible by them except one.”

Candidate used Bubble Sort to sort an array. When asked if there was a more efficient algorithm, he said yes, but claimed he didn’t want to use it, “because [his] religion forbids it.”

2:00-2:15 HR Wrapup:

Candidate asked if he could be reimbursed for travel expenses, since he drives a 1984 Buick Skylark, “for irony.”

Candidate asked if signing bonuses were typically offered, since he had “a good feeling about the Broncos this year.”

Candidate asked if “that girl at reception, you know, with the shirt,” was single.

☙ Fin ❧

A Geneva Convention for the Language Wars

2014-01-17T14:30:00-05:00

I don’t tend to get too sniffy about the quality of discourse on the Internet. I have some appreciation for even the most pointless, uninformed flamewars. (And maybe my take on Web site comments is for another post.) But there’s an increasingly popular topic of articles and blog posts which is starting to annoy me a little. You’ve likely read them—they have titles like: “Python is Eating R’s Lunch,” “Why Python is Going to Take Over Data Science,” “Why Python is a Pain in the Ass and Will Never Beat R,” “Why Everyone Will Live on the Moon and Code in Julia in 5 Years,” etc.

This style of article obviously isn’t unique to data analysis languages. It’s a classic nerd flamewar, in the proud tradition of text editor wars and browser wars. Perhaps an added inflammatory agent here is the Data Science hype machine.

And that’s all okay. Go on the Internet and bitch about languages you don’t like, or tell everyone why your preferred one is awesome. That’s what the Internet’s here for. And Lord knows I’ve done it myself.

My only problem is that it distracts from more important, more interesting conversations about what’s happening with data analysis languages. Instead of pissing matches and popularity contests, the real interesting phenomena is how developers are comparing notes, sharing cool innovations, and increasing interoperability. A great example is the IPython notebook. The notebook doesn’t make Python better than other languages—it makes all languages better.

I think it’s a really fascinating time for folks who use and think about computer languages. The last 5 years or so has seen not only the introduction of really cool new languages, but also extraordinary developments in existing ones. I’m psyched about all these languages and I want them all to succeed and get better. Some days I want to code in R, some days in Python. Others in Julia, or Clojure, or F#, or even C++. I don’t want any of them to stagnate or disappear, or be “beaten” by any of the others. And I don’t think that’s happening anyway.

So what’s below is a somewhat tongue-in-cheek list of suggestions for facilitating productive and interesting discussions comparing languages. Many of them are not specific to our little R/Python/Matlab/Julia skirmishes, but apply to lots of different language wars (C++ vs. Java, Python vs. Perl, Ruby vs. Python, Clojure vs. Scala, Haskell vs., I dunno, everybody?) The last section is comprised of a couple of very general notes about civility. I’m strongly in favor everyone’s right to be a smug prick on the Internet. But, you know, you should probably try not to be a smug prick on the Internet.

And, please, feel free to add additions or suggestions in the comments, or in this Gist

Section 1: Being Aware of Context

§ 1, Article 1

Recognize that languages have goals and communities. It helps to evaluate them in that context. Features that are high priority to you may not be high priority to the majority of users in that language, and vice-versa.

§ 1, Article 2

Recognize that many smart, capable people are very productive in the language you’re slagging. The cool things science and industry are making in the language speaks far louder than your casual dismissals of it on a message board.

The same logic goes for language developers. For example, Hadley Wickham is a smart guy and a great programmer; he’s probably not one to waste his time improving a language that’s some irreparably broken dead-end. Same with these guys.

§ 1, Article 3

Recognize that language design is the art of the tradeoff. Don’t complain about a design choice until you understand the logic behind it. In many cases, your preferred design or feature was already considered, and would have led to undesirable outcomes elsewhere. It is helpful and interesting to disagree about how a tradeoff was managed, but do recognize that there was one.

§ 1, Article 4

Distinguish between a feature request and a language critique. If you come to a new language and miss some features of your old language, that’s fine. But that’s not necessarily a failing of the new language.

A living, breathing language is a combination of both its features and its idioms. A feature may not exist because its programmers tend to write code in a way that obviates its need. Sometimes such idioms are crutches to compensate for truly useful features that are missing; other times they are interesting and elegant expressions of a problem that you’re just not accustomed to. Try to spot the difference.

§ 1, Article 5

Pay your dues before dismissing a language. If you gave up on something in a language after finding it too difficult, consider that the problem may be yours. It may not be, but at least consider it.

§ 1, Article 6

Don’t over-sell immature, alpha-version features, no matter how promising they are. Promises don’t cook rice. Sending unsuspecting users to buggy, incomplete libraries just harms your cause in the long run.

Examples:

“Julia has a fast-growing library of packages!” Sure, but less than a handful are close to production quality.
“And now ggplot has been ported to Python!” Not quite yet.

Honest advertising of works-in-progress is encouraged, though. There’s nothing inherently wrong with immature libraries, many of which are fantastic.

§ 1, Article 7

Microbenchmarks are useful for understanding differences between languages and their execution, but are of limited use in pissing contests. No one knows exactly what percentage of the world’s working software is comprised of Fibonacci number calculations, but our best guess is not much.

Section 2: Being Interesting

§ 2, Article 1

Whether one language is going to take over another is not that interesting, nor that meaningful. (When does a language get “taken over?” For Christ’s sakes, there’s still a non-trivial amount of COBOL running out there in the wild.)

Competition is pointless, but comparison is not. Languages are increasingly adopting ideas from each other, building interops with each other, and sharing tooling. Having conversations about this process is far more interesting than running popularity contests.

§ 2, Article 2

Avoid clichéd arguments. They are not necessarily incaccurate, but they are boring.

Examples:

R is a “DSL” or “not a real language” (see Article 2 below); R is “designed by statisticians, not computer scientists.”
“Semantic whitespace in Python sucks.” (Generally, arguments over syntax are boring.)
“Julia doesn’t have as many libraries as ${pretty much anything}.”

In addition to arguments, also avoid clichéd phrases. (See, e.g., “not ready for prime-time.”)

§ 2, Article 3

Supplement abstract terms or subjective impressions with concrete definitions and examples.

Examples of statements that could use concrete support:

“Code in language X is more expressive than language Y.”
“R is a DSL, while Python is a general purpose language.”

Section 3: Being Civil

§ 4, Article 1

Be sure that you can accurately summarize someone’s argument before you start composing your rebuttal.

§ 4, Article 2

You are not so smart that you are entitled to be smug. Some tips:

Nix hyperbolic vocabulary. No one and nothing associated with any language is “stupid,” “dumb,” “crazy”, “broken,” etc.
Use of the word “fail” is strongly discouraged. Use of it as a noun is strictly prohibited.
It is no victory—not even a moral one—to find someone wrong on the internet. Don’t treat it a such. Offer a polite factual correction and allow for the possibility that you’ve misunderstood.

Tricked out iterators in Julia

2014-01-13T15:15:00-05:00

Introduction

I want to spend some time messing about with iterators in Julia. I think they not only provide a familiar and useful entry point into Julia’s type system and dispatch model, they’re also interesting in their own right.1 Clever application of iterators can help to simplify complicated loops, better express their intent, and improve memory usage.

A word of warning about the code here. Much of the it isn’t idiomatic Julia and I wouldn’t necessarily recommend using this style in a serious project. I also can’t speak to its performance vis-a-vis more obvious Julian alternatives. In some cases, the style of the code examples below may help reduce memory usage, but performance is not my main concern. (This may be the first blogpost about Julia unconcerned with speed). Instead, I’m just interested in different ways of expressing iteration problems.

For anyone who’d like to play along at home, there’s an IJulia notebook of this material on Github, which can be viewed on nbviewer here.

The Iterator Protocol

What do I mean by iterators?2 I mean any I in Julia that works on the right hand side of the statement for i = I .... That is, anything you can for-loop over. This includes not only data collections like Arrays, Dicts, and Sets, but also more abstract types like Ranges, as well as what I’ll call “higher order” iterators such as those that result from zip or enumerate functions.

As an equivalent definition, an iterator in Julia is any type that implements the iterator protocol. The iterator protocol is comprised of three methods: start, next, and done. So any type in Julia for which these three methods are defined is an iterator. It might be a dumb iterator or a broken iterator, but it’s an iterator.

Since the for statement works on iterators, and iterators are just a collection of methods, we can define any for loop using calls to those methods.

For example, this simple for loop

arr = [10:-2:1]
for i = arr
    println(i^2)
end

is equivalent to this

state = start(arr);
while !done(arr, state)
    i, state = next(arr, state)
    println(i^2)
end

In this example, the start method provides the initial state of the iterator; the next method returns the value of the array at a given state, as well as what the next state is. Finally, the done method returns true when we’ve gone past the end of the iterator, informing the loop that it should stop.

If you know Python, the idea of the iterator protocol is probably familiar. In Python, any object can be an iterator if it has the methods __iter__ and __next__. But notice the lack of side effects in the Julia implementation —calling start or next on the array has no affect on the array itself. start is basically a constant, always returning the value of the initial state whenever you pass it the same type of iterator. And next doesn’t really increment anything; it’s just a mapping from current state → (value, next state). In general, the iterator itself has no internal state being incremented or changed as you pass through a loop.

An iterator’s state

More concretely, what’s the state of an iterator? How the state is defined, and an iterator’s sequence of states depends on the type of the iterator itself. It’s best to look at some examples.

Arrays.

Arrays are very intuitive iterators. They have states that are just integer values from 1 to the length of the array. So start returns 1.

arr = ["one", "two", "three", "four", "five", "six"]
start(arr)

The next mapping is i → i+1, and the value of the iterator at any state i is just a[i].

for i = 1:4
    println("next(arr, $i) = ", next(arr, i))
end

next(arr, 1) = ("one",2)
next(arr, 2) = ("two",3)
next(arr, 3) = ("three",4)
next(arr, 4) = ("four",5)

If this were a multidimensional array, say 3×2 instead of 6×1, we’d get the same result; iteration would just proceed along the rows of the matrix.

The done method returns true when the state is i = length(a) + 1. You might think it’d be length(a), but recall the for-equivalent while loop above. Having done return true at the last index of the array would prevent the loop from executing on the last element. So in our 6-element array, done is true when the state hits 7.

println(done(arr, 6))  # not yet
println(next(arr, 6))

false
("six",7)

done(arr, 7)

true

Ranges

Ranges have states that looks similar to arrays, except they start at zero.

rng = 11:20  # length 10 range
start(rng) # 0

But the relationship between the current and next state is the same: i → i+1.

for i = [0, 1, 9, 10]
    println("next(rng, $i) = ", next(rng, i))
end

next(rng, 0) = (11,1)
next(rng, 1) = (12,2)
next(rng, 9) = (20,10)
next(rng, 10) = (21,11)

Since we start at zero, the done state is one less than the equivalent array.

done(rng, 10)

true

Unordered collections: Dicts, Sets, etc.

Arrays and ranges have a natural order, so the evolution of state is straightforward. But what about collections such as dictionaries and sets that have no inherent order? Like in many languages, such things can be iterated over, but the order of iteration is not easily predictable.

For example, here’s a dictionary:

dictit = {:one => 1, :three => 3, :five => 5, :five => "five!"}

{:one=>1,:three=>3,:five=>"five!"}

The starting state isn’t 0 or 1, as would be natural for an ordered collection.

s0 = start(dictit)

And while next maps state i to state j, the relationship between i and j is not obvious. Here, while the first state is 3, the second is 11, and the rest are similarly weird.

_, s1 = next(dictit, s0)

((:one,1),11)

_, s2 = next(dictit, s1)

((:three,3),13)

_, s3 = next(dictit, s2)

((:five,"five!"),17)

done(dictit, s3)

true

The states, you probably and correctly suspect, are tied to the internal implementation of the dictionary, e.g. how the keys are hashed. So the state doesn’t follow a predictable 1, 2, 3, … order, and what order of elements we get when iterating is essentially unpredictable.

But because for loops handle the iterator’s states for us, we rarely if ever have to worry about the representation of an iterator’s state. The for loop implicitly calls the start, done, and next methods, which do all this bookkeeping for us.

Iterators and Delayed Evaluation

While many iterators are collections of data in memory, like Arrays, Dicts, or Sets, iterators can also represent abstract collections that aren’t held in memory.

Range is a good example. When we iterate over the range 1:10, we get the sequence 1, 2, 3, …, 10. But in memory, this range is comprised of only two integers, 1 and 10. The values in between are only evaluated when we’re looping over it.

From https://github.com/JuliaLang/julia/blob/master/base/range.jl, here’s how a Range’s iterator protocol is defined:

start(r::Ranges) = 0
next{T}(r::Range{T}, i) = (oftype(T, r.start + i*step(r)), i+1)
next{T}(r::Range1{T}, i) = (oftype(T, r.start + i), i+1)
done(r::Ranges, i) = (length(r) <= i)

Notice that the next method calculates the value of the iterator in state i. This is different from an Array iterator, which just reads the element a[i] from memory.

Iterators that exploit delayed evaluation like this can have important performance benefits. If we want to iterate over the integers 1 to 10,000, iterating over an Array means we have to allocate about 80MB to hold it. A Range only requires 16 bytes; the same size as the range 1 to 100,000 or 1 to 100,000,000.

Application: Iterating over Fibonacci numbers

Here’s another example of an iterator which computes values on demand, using the next method to do the work. fibit(n) is an iterator over the first n Fibonacci numbers. When the iterator is constructed, it doesn’t calculate all of these numbers. Instead it waits for its next method to be called, providing the next Fibonacci number depending on the current one.

# Iterator produces the first n Fibonacci numbers
immutable FibIt{T<:Integer}
    last2::(T, T)
    n::Integer
end

fibit(n::Integer) = FibIt((0, 1), n)
# Specify types, e.g. BigInt to prevent overflow.
fibit(n::Integer, T::Type) = FibIt{T}((0, 1), n) 

Base.start(fi::FibIt) = 1

function Base.next(fi::FibIt, state)
    if state == 1
        return (1, 2)
    else
        fi.last2 = fi.last2[2], sum(fi.last2)
        (fi.last2[2], state + 1)
    end
end

Base.done(fi::FibIt, state) = state > fi.n

for i = fibit(10)
    print(i, " ")
end

1 1 2 3 5 8 13 21 34 55

Tasks/Co-routines

This talk of iterators with delayed evaluation may remind Pythonistas of generators. And Julia has a type that is basically equivalent to Python’s generators, called Tasks. A Task is constructed by calling the Task() constructor (or @task macro) on a function with a produce statement, which issimilar to Python’s yield.

Instead of using the Fibit type above, we could get an equivalent iterator by defining a Task that produces sequential Fibonacci numbers.

function fibtask(n::Integer, T::Type)
    a, b = (zero(T), one(T))
    produce(1)
    function _it()
        for i = 1:n
            produce(b)
            a, b = (b, a+b)
        end
    end
    Task(_it)
end

fibtask(n::Integer) = fibtask(n, Int)

Once we’ve made the task, we get iteration for free.

for i in fibtask(10)
    print(i, " ")
end

1 1 2 3 5 8 13 21 34 55

Whether you create an iterator using a type with the iterator protocol, or by constructing a Task, is up to you. There are pros and cons to each approach. By defining your iterator as a specific type, you can dispatch lots of other functions on it. Here, on the other hand, fibtask is just a Task type, so defining methods for it means defining methods for all Tasks, which may be undesirable or infeasible. On the other hand Tasks give you iterators with less code. Below I’ll show an example of an iterator that’s hard to define with the iterator protocol methods, but easy to define as a Task. And of course, Tasks are coroutines, and can be used in those contexts.

Realizing Iterators without loops

So far, we’ve talked about iterators in the context of for loops. We saw that for i = I was a construct for calling I‘s start, done and next methods, letting us realize and operate on the values in the iterator.

But there are functions which can take iterators as inputs and implicitly iterate over them to some desired result. This obviates the need for explicit for loops, and can make for cleaner more functional code. Some examples follow.

`collect` and `reduce`

The collect function takes an iterator input, realizes all its values, and collects them into an array.

collect(fibit(10))

10-element Array{Any,1}:
  1
  1
  2
  3
  5
  8
 13
 21
 34
 55

The reduce function similarly realizes the values of an iterator, but then successively applies an operator to them to give a scalar result.

reduce(+, fibit(10))

That reduce operation is equivalent to the sum function called with an iterator argument.

sum(fibit(10))

In this next line of code, I compute the sum of the reciprocals of the first 10,000 Fibonacci numbers (which should be close to this), using collect to first put them into an array.

sum(1 ./ collect(BigInt, fibit(10_000, BigInt)))

3.359885666243177553172011302918927179688905133731968486495553815325130318996609
e+00 with 256 bits of precision

Comprehensions

The collect function may remind you of an array comprehension, and it is similar, but here we see array comprehension don’t work on our iterator:

[f for f = fibit(10)]

no method length(FibIt{Int64})
while loading In[26], in expression starting on line 1
 in anonymous at no file

What’s going on is that the array comprehension wants to allocate an array, then fill it in with the values of the iterator. Since it doesn’t know the iterator’s length (how many values it will produce), it doesn’t know how large an array to allocate.3 We can fix this for our Fibonacci iterator by giving it a length method.

Base.length(it::FibIt) = it.n
[f for f = fibit(10)]

10-element Array{Int64,1}:
  1
  1
  2
  3
  5
  8
 13
 21
 34
 55

Now we can redefine our sum-of-reciprocals using a comprehension instead of collect.

sum([1/f for f = fibit(10_000, BigInt)])

3.359885666243177553172011302918927179688905133731968486495553815325130318996712
e+00 with 256 bits of precision

What if we tried this with our Fibonacci task?

[f for f = fibtask(10)]

no method length(Task)
while loading In[27], in expression starting on line 1
 in anonymous at no file

We get the same issue; Tasks don’t have a length method. The advantage of using the FibIt type is that we can easily define a length method for it. We can only give our Fibonacci task a method if we give all Tasks a length method, which wouldn’t make sense.

The Iterator Package

When we calculated the sum of the reciprocals of Fibonacci number above, we had to realize the values of the Fibonacci iterator before taking the reciprocal, and then sum a collection of all those values. Alternatively, we could have called sum on an iterator that produced 1/x for each Fibonacci number x.

One way to do this would be to create a new iterator type, called ReciprocalFibIt, and given it its own start, next, and done methods. But that feels a little excessive. Wouldn’t it be nicer to be able to construct that iterator from the Fibonacci iterator we already have? Essentially saying, “hey, I want another iterator that gives one over the values of that other iterator.”

That would be an example of what I’ll call a higher-order iterator, which is an iterator constructed from one or more other iterators. zip and enumerable are common examples.

It turns out Julia has a neat little package of useful higher-order iterators; called, obviously, Iterators. In the rest of (this already very long) post, I’ll explore some of things in the package. Pythonistas will notice similarities with the itertools module in the Standard Library.

using Iterators

Imap

The Imap iterator provides us with our wish above: a new iterator that is the result of applying a function to the values of an existing iterator. In the case of our reciprocal Fibonacci numbers, that function is x -> 1/x.

recipricalfib = imap(x -> 1/x, fibit(10_000, BigInt)) # A new iterator, composed
                                                      # from a FibIt
psi = sum(recipricalfib) # No collect needed

3.359885666243177553172011302918927179688905133731968486495553815325130318996609
e+00 with 256 bits of precision

So reciprocalfib is itself an iterator, whose values are only realized when it’s passed to the sum function. We didn’t have to allocate any arrays before calling sum as with the collect and comprehension examples above.

An IFilter iterator

Since we have a map-like iterator, why not a filter?4 How would it work? Given an iterator that produces values v1, v2, v3, …, the filter iterator would only produce the values that met some predicate, skipping any that didn’t.

This isn’t implemented in the Iterators package (because Base.filter will already do this, see footnote 4). It’s a neat idea, but it turns out to be tricky to define in terms of the iterator protocol. It’s easy with a Task, though.

function ifilter(f::Function, itr)
    function _it()
        for i = itr
            if f(i)
                produce(i)
            end
        end
    end
    Task(_it)
end

Application: A list of primes whose digits sum to a prime

Here’s an example of it in action. We’ll begin with a Range iterator from 1 to 1,000. I want to list all of numbers in that range that are (1) prime and (2) have digits that sum to a prime.

So ifilter takes the predicate test and the original iterator, then produces only those values from the original iterator that pass the test. Turns out there are 89 such primes between 1 and 1,000.

function funnyprimetest(n::Integer)
    sumdigits = sum([parseint(string(c)) for c in string(n)])
    isprime(n) & isprime(sumdigits)
end

collect(ifilter(funnyprimetest, 1:1000))

89-element Array{Any,1}:
   2
   3
   5
   7
  11
  23
  29
  41
  43
  47
  61
  67
  83
   ⋮
 829
 863
 881
 883
 887
 911
 919
 937
 953
 971
 977
 991

Repeat and RepeatForever

Another surprisingly useful iterator is Repeat, which simply produces an object some number of times. Here the iterator is just the string “echo!” five times.

for i = repeated("echo!", 5)
    println(i)
end

echo!
echo!
echo!
echo!
echo!

If we didn’t provide the second argument, the result would be an iterator that goes on infinitely, so its for loop would never terminate. Why would you want that? I’ll show some examples of its use below.

Extension: Repeating impure functions

One thing about the Repeat iterator though, is that the object or value it repeats is fixed at its construction. If you pass it a called function, it will call that function once in the constructor, and then repeatedly return the result of that first call. For pure functions, that’s fine. The first call of sqrt(100) is the same as the second, third, or ten-thousandth call of sqrt(100).

If the function is impure, though, we’ll get undesired results.

for i = repeated(rand(), 10) println(i) end

0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046
0.30142748588653046

Here, the rand function was called once in the constructor, and that result was repeated again and again. I’d prefer if I could get 10 separate calls to rand. Here’s one way to get this to work.

Base.next(it::Iterators.Repeat{Function}, state) = it.x(), state - 1
Base.next(it::Iterators.RepeatForever{Function}, state) = it.x(), nothing

# Note the function isn't called in the constructor;
# the `next` function does this.
for i = repeated(rand, 10) println(i) end

0.6621100826024566
0.755346320113107
0.021395943367805037
0.7304018818932669
0.22941680891855865
0.966762896262876
0.13729437119070198
0.028788242666101915
0.5584434146272579
0.09166900954689794

What I’ve done is create new next methods for the Repeat and RepeatForever iterators. When the object of the iterators is a function, the next methods call the function. By passing the iterator an uncalled function object, I avoid the call in the constructor, and defer it to the next method.

Take and Drop

The Take iterator only iterates over some specified first values of its input iterator. It works well in combination with infinite iterators, like RepeatForever

randsforever = repeated(rand)

[r for r = take(randsforever, 10)]

10-element Array{Any,1}:
 0.719153
 0.660597
 0.280763
 0.54125 
 0.427029
 0.919311
 0.165029
 0.796911
 0.354417
 0.678271

[r for r = take(randsforever, 20)]

20-element Array{Any,1}:
 0.568741 
 0.614644 
 0.49445  
 0.0942616
 0.518134 
 0.126585 
 0.961748 
 0.698277 
 0.0805089
 0.32351
 0.797422 
 0.513762 
 0.601515 
 0.616174 
 0.460832 
 0.813204 
 0.172391 
 0.444915 
 0.732941 
 0.0550762

The Drop iterator, on the other hand, ignores some specified first values of its input iterator. So, how many values should be printed in this for loop?

for i = drop(take(randsforever, 10_000), 9998)
    println(i)
end

Answer: just the last two, since we take 10,000 random numbers, but drop the first 9,998.

0.26830900957427684
0.5141969888172926

Extension: TakeWhile and TakeUntil

In some cases you may not want to take a fixed number of values from an iterator, but instead you want to take values until some condition is met.

To accomplish this, I’ll create a TakeWhile iterator, which takes values from its input iterator so long as they pass some test.

immutable TakeWhile{I}
    xs::I
    cond::Function
end

takewhile(xs, cond) = TakeWhile(xs, cond)
Base.start(it::TakeWhile) = start(it.xs)

Base.next(it::TakeWhile, state) = next(it.xs, state)

function Base.done(it::TakeWhile, state)
    i, _ = next(it, state)
    !it.cond(i) || done(it.xs, state)
end


tw = takewhile(1:10, x -> x^2 < 25)
collect(tw)

4-element Array{Int64,1}:
 1
 2
 3
 4

Let’s also create a TakeUntil iterator, which takes values until it finds one that passes the test. So the last value produced by this iterator will pass the test and all values before that will have failed.

immutable TakeUntil{I}
    xs::I
    cond::Function
end

takeuntil(xs, cond) = TakeUntil(xs, cond)
Base.start(it::TakeUntil) = start(it.xs), false

function Base.next(it::TakeUntil, state)
    i, s = next(it.xs, state[1])
    i, (s, it.cond(i))
end


function Base.done(it::TakeUntil, state)
    s, iscond = state
    iscond || done(it.xs, s)
end

collect(takeuntil(1:10, x -> x*x >= 25))  # x <= sqrt(25) -> 1:5

5-element Array{Any,1}:
 1
 2
 3
 4
 5

Application: How long does it take a Poisson process to produce a prime number?

As an application of the TakeUntil iterator, an experiment. How many draws do we have to make from a Poisson process until we draw a prime number? For this example, I’ll use a Poisson with mean 5,000.

In the code, we make a Repeat iterator that repeatedly draws from the Poisson. We pass this into takeuntil and this creates an iterator that draws from the Poisson until we find a prime number. While this is happening, we keep track of the number of steps we took through this iterator.

# Draw random integers from a distibrution d until you get a prime number.
# Return the number of draws.
function primetime(d, dparams)
    randgen = () -> rand(d(dparams...))
    tu = takeuntil(repeated(randgen), isprime)
    time = 0
    for i = tu
        time += 1
    end
    time
end

primetime_poiss5k = () -> primetime(Poisson, 5000)

What’s the average wait for a prime? Repeating the experiment 10,000 times, we find the average number of draws is between 7 and 8.

mean(repeated(primetime_poiss5k, 10_000))

7.6783

To see the distribution of waiting times, I’ll collect each repetition of the experiment in an array that we can plot.

times = [t::Int for t = repeated(primetime_poiss5k, 10_000)]

Partition

The Partition iterator split its input iterator into pieces, producing an iterator over iterators. For example we could use it to partition the Range iterator 1:100 into two iterators, 1:50 and 51:100. We can also make overlapping partitions, for example, 1:50, 2:51, 3:52, etc.

Application: Moving average

One useful application of overlapping partitions is computing moving averages. The following code imports Google’s historical stock price from Yahoo Finance and computes its 60-day moving average.

First, we download the data, creating a 2-D array containing dates, volumes, and prices.

const googdata =
    "http://ichart.finance.yahoo.com/table.csv?s=GOOG&d=0&e=7&f=2014&g=d&a=0&b=7&c=2013&ignore=.csv" |>
        download |>
        open |>
        readall |>
        s -> split(s, "\n") |> 
        a -> map(l -> split(l, ","), a) |>
        a -> filter(l -> contains(l[1], "201"), a) |>
        reverse  # Dates start at most recent, so reverse for chron order.

We then create iterators over the dates and closing prices in the Array. These iteratively extract and parse values from the relevant columns.

dates = imap(r -> date(r[1]), googdata)
close = imap(r -> parsefloat(r[7]), googdata)

Now we can make 60-day sub-period partitions and compute the average of each. Since I’m using imap nothing has been calculated yet. These are all just iterators promising to do work when called.

ma60 = imap(mean, partition(close, 60, 1))

# NB: The Julian way to do this would be
#     [mean(price[i-59:i]) for i = 60:length(price)]

With all these useful iterators defined, I can just collect them into arrays for plotting.

plot(layer(x=collect(dates), y=collect(close), Geom.line),
     layer(x=collect(dates)[60:end], y = collect(ma60), Geom.line),
     Guide.ylabel("Price"),
     Guide.title("GOOG Daily Stock Price 60-Day Moving Avg."))

Groupby

While the Partition iterator makes partitions of specified lengths, the Groupby iterator splits an iterator based on some condition. One caveat is that the input iterator has to be sorted in some way on the groupby condition, so that values with the same condition are adjacent in the iterator.

Application: Do Labor Force figures follow Benford’s Law?

In this example, I’m going to look at Benford’s Law using the Groupby iterator. Benford’s Law posits that the leading digits of numbers in many data sources follows a regular distribution. I’ll use the Groupby iterator to group the data by first digit and check this.

The data I’ll examine is the size of the labor force population in each U.S. county in 2012.

const lfdata = "http://www.bls.gov/lau/laucnty12.txt" |>
                   download |>
                   open |>
                   readall |>
                   s -> split(s, "\r\n") |>
                   a -> filter(x -> length(x) == 125, a) |>  # Rows with data
                   a -> map(x -> strip(x[79:92]), a) |>      # Column w/ LF data
                   a -> map(x -> replace(x, ",", ""), a) |>  # 1,000 -> 1000
                   x -> x[2:end] |>                          # Remove header
                   sort

The analysis is simple with a Groupby iterator. It splits up the data by leading digit, and then I just calculate the frequency of each leading digit in the data by taking the length of each leading-digit group as a share of the total length of the data.

dgroups = groupby(lfdata, s -> s[1]) # Groups by first digit
# Extract the digit from the group members
digits = imap(i -> parseint(string(i[1][1])), dgroups)
# Compute the frequency
frequency = imap(x -> length(x) / length(lfdata), dgroups)

Benford’s Law posits that the frequency of digit d in data should be log₁₀(d+1) - log₁₀(d). This function prints out a table of the observed frequencies next to the expected frequencies per Benford’s Law.

benfordcheck = function(obs_freqs, digits)
    pred_freqs = map(d -> log10(d+1) - log10(d), digits)
    println("")
    println("Digit Frequency Compared to Benford's Law")
    println("=========================================")
    println("")
    println("Digit  Observed  Expected  Difference");
    for (d, o, e) in zip(digits, obs_freqs, pred_freqs)
        @printf( "%5d %9.2f %9.2f %11.2f\n", d, 100*o, 100*e, 100*(o-e))
    end

end

We can see the labor force data follows Benford’s Law quite closely.

benfordcheck(frequency, digits)

Digit Frequency Compared to Benford's Law
=========================================

Digit  Observed  Expected  Difference
    1     30.09     30.10       -0.01
    2     16.46     17.61       -1.15
    3     12.02     12.49       -0.48
    4      9.72      9.69        0.03
    5      8.29      7.92        0.37
    6      6.30      6.69       -0.39
    7      6.02      5.80        0.23
    8      5.84      5.12        0.72
    9      5.25      4.58        0.67

To plot the comparison, I can collect the values from our iterators into a DataFrame.

benford_df = DataFrame(# Extract the digit
                       digits = collect(digits),
                       observed = collect(frequency),
                       expected = map(d -> log10(d+1) - log10(d), digits))

Iterate

Though its name might be a little confusing, the Iterate iterator is one of my favorites. It recursively applies a function to a starting value, that is f(...f(f(f(x)))...). I come across applications for it all over the place.

Application: Autoregressive time series processes

One application is producing autoregressive time series processes. An AR(1) process has the form x_t+1 = px_t + e_t+1, where e is some random noise. If We define the function f(x) = px + e, then x_t+2 as a function of x_t is f(f(x_t)). Subsequent values can be similarly produced by iteratively applying the function.

First the code for the AR(1) function itself, along with a helper function for plotting a realization of the process.

function ar(phi::Float64, sigma::Float64)
    x -> phi * x + sigma * randn()
end

plotar(arseq, title) = plot(x=1:length(arseq), y=arseq, Geom.line,
                            Guide.xlabel("Time"), Guide.ylabel(""),
                            Guide.title(title))

Defining a coefficient and a standard deviation for the random variable, I pass them through a process that creates an iterator that recursively applies the function, starting with a randomly-drawn initial value. Then I collect 250 values of that iterator and plot them.

const ar1coef = 0.9
const ar1sigma = 0.15                                           

(ar1coef, ar1sigma) |>
    x -> apply(ar, x) |>
    f -> iterate(f, ar1sigma*rand()) |>
    i -> collect(take(i, 250)) |>
    s -> plotar(s, "AR(1) Time Series")

This idea can be extended an AR(p) process, where the current value of x depends on several past values. Whereas the coefficient was a scalar in the AR(1) model, it’s a matrix now, but the formula is otherwise the same.

function ar(coeffs::AbstractVector{Float64}, sigma::Float64)
    p = length(coeffs)
    Phi = [coeffs', eye(p)[1:(end-1),:]]
    Sigma = [sigma, zeros(p-1)]
    x -> Phi * x + Sigma .* randn(p)
end

For an example, here’s 250-periods simulated from an AR(3) model.

const ar3coeffs = [.9, -.1, -.25]
const ar3sigma = 0.15

(ar3coeffs, ar3sigma) |>
    x -> apply(ar, x) |>
    f -> iterate(f, ar3sigma*rand(3)) |>
    i -> map(first, take(i, 250)) |>
    s -> plotar(s, "AR(3) Time Series")

Conclusion

Most iteration you’ll see in the wild uses simple collections or ranges as the iterator, performing extensive work inside the loop. Sometimes our problem can be better expressed using more complicated iterators whose structure represents the logic of our iteration. One thing to notice in all the examples was that once the iterators were defined, there was very little to do after iterating over them. Typically I was just collecting the iteration values into an array, or reducing them with an operation to a scalar result. We were also able to build the problems in such a way that calculation of values in the iterators was delayed until absolutely necessary.

There are tradeoffs to this sort of style, and much of the stuff in this post was more cute than practical. But it was a fun exploration of how to create types that interact with protocols in Julia. Julia’s type system and dispatch design are very powerful and interesting, and gives programmers a lot of flexibility in expressing their problems.

While we’ll see lots of examples of extending Julia’s base functions dispatched on newly-defined types, we won’t see much multiple dispatch, which is an important design feature of Julia. In fact, pretty much everything here could be implemented in a single-dispatch OO language.
Pythonistas may be thinking about the distinction between iterators and iterables. (See, e.g. this Stack Overflow thread.) That distinction doesn’t really apply to Julia. So I won’t use the term iterable here, and I’ll define an iterator in the two ways discussed above: (1) it is valid in a for i = I statement, and (2) it implements the iterator protocol.
This limitation seems to come from the idea that only iterators with known lengths can be counted on to produce multidimensional arrays. This may be changed in future versions of Julia. See, e.g. Issue #550. The collect function uses the push! function to dynamically allocate the array, but collect can only give a 1-D Array output, whereas comprehensions can be multidimensional.
Actually, Julia’s filter function already does this. If you pass that function a predicate or condition function and an iterator, it produce a Filter object that you can then iterate over. This is different from map which can take an input iterator, but returns the result of mapping the function immediately.

Pardon the dust

2013-09-30T00:00:00-04:00

Update 9/10/2013 New posts are going up on the blog, but I’m going to keep this post at the top for a while. Consider the site in beta for the moment, and please use the comment section of this post to report any issues. If you’re using IE to try and view the site, I’m sorry. But I’m not that sorry.

Update 9/3/2013 Things should be working reasonably well. A few kinks to work out, and I have to migrate the former site’s comments, but the current site is pretty much ready to go.

This is the new home for my blog, Slender Means. It’s currently in-progress, and I’m still finishing up the design, and fixing weird links and typos from the Wordpress to Pelican migration.

In the meantime, a more usable version sits at the old home: http://slendrmeans.wordpress.com.

Thanks for visiting!

-c.

Machine Learning for Hackers Chapter 8: Principal Components Analysis

2013-09-06T17:30:00-04:00

The code for Chapter 8 has been sitting around for a long time now. Let’s blow the dust off and check it out. One thing before we start: explaining PCA well is kinda hard. If any experts reading feel like I’ve described something imprecisely (and have a better description), I’m very open to suggestions.

Introduction

Chapter 8 is about Principal Components Analysis (PCA), which the authors perform on data with time series of prices for 24 stocks. In very broad terms, PCA is about projecting many real-life, observed variables onto a smaller number of “abstract” variables, the principal components. Principal components are selected in order to best preserve the variation and correlation of the original variables. For example, if we have 100 variables in our data, which are all highly correlated, we can project them down to just a few principal components—-i.e., the high correlation between them can be imagined as coming from an underlying factor that drives all of them, with some other less important factors driving their differences. When variables aren’t highly correlated, more principal components are needed to describe them well.

As you might imagine, PCA can be a very effective way of dealing with multi-collinearity that crops up in datasets with lots of variables. The downside is that PCA is just a mechanical process that is independent of the phenomenon we’re studying; the “principal components” we find don’t have to have any real-world meaning—-they’re just mathematical constructs. Sometimes we can give meaningful interpretations to the principal components by analogizing them to real underlying factors that theoretically drive our data. But this can be tricky, from both a technical and epistemological standpoint.

For the stocks the authors analyze, they ultimately try and reduce their description to a single principal component, which they interpret as a kind of “market-wide” factor, and compare with a broad market index (here the DJIA). This is a not uncommon application of PCA in stock analysis. But they’ve got a technical problem here.

To perform PCA, your data have to have a meaningful covariance matrix (or correlation matrix, but the conditions are equivalent). They analyze stock prices, which are non-stationary time series variables. This means their covariance matrices change with time, so you can’t really estimate a meaningful covariance matrix from a sample of data. Your estimator implicitly assumes the data are stationary, so your estimated covariance matrix is meaningless. If we calculate the stock returns in the data, though, we can do PCA properly, and we’ll see the relationship of the resulting principal component with the broad market index is much cleaner.

If you’re comfortable with PCA already, you don’t really have to worry about the conceptual content of this chapter. If you’re not, my advice it to take this chapter as a decent toy example of where and why one uses PCA, but don’t apply what’s done here elsewhere without learning more first. I’m not going to explain PCA in any detail; I just want to show where PCA functions live in the Python ecosystem and how they work. But, like most machine learning techniques, it shouldn’t be used at home without adult supervision.

As usual the IPython notebook lives at the Github repo here, and can be viewed via nbviewer here.

Stock data munging

The raw data are in a long format, with (no. of stocks) × (no. of days) rows, and three columns (a date, a stock ticker and a price for that ticker on that day). This sort of dataset lends itself to a pandas DataFrame with a hierarchical index—and since there’s only one variable in the data (the price), we’ll squeeze the DataFrame to get a Series. The Dow Jones data, containing just one ticker, is more straightforward.

prices_long = read_csv('data/stock_prices.csv', index_col = [0, 1],
                       parse_dates = True, squeeze = True)
dji_all = read_csv('data/dji.csv', index_col = 0, parse_dates = True,
                   squeeze = True)

With the stock data indexed this way, it’s easy to create a date × ticker DataFrame with prices as entries—-we just use unstack.

prices = prices_long.unstack()

Since we’ll ultimately want to perform this analysis with price returns, I’m going to create a similar dataset, just with returns instead of prices (note this will have one less day of data, since we don’t know the return for the first day in the data).

calc_returns = lambda x: np.log(x / x.shift(1))[1:]
returns = prices.apply(calc_returns)

Note I’m using log returns here. Pandas DataFrames have a pct_change method that would provide another way of computing returns.

The authors’ PCA strategy here is to extract a “stock index” factor from the stock data by using the first principal component—-this is the single principal component that captures the most variation in the underlying data.

The function make_pca_index is going to extract this first principal component, using the PCA function in scikit-learn’s sklearn.decomposition module. This is not the only way to get a PCA in Python—-indeed PCA is mechanically just an eigen-decomposition of the data’s correlation or covariance, so you could do this all from scratch in Numpy. The scikit-learn implementation, though, gives us a convenient PCA object to work with. And as usual with scikit-learn, the documentation is very good.

The PCA function works with a covariance or correlation matrix. We’re going to use a correlation matrix here; and our function will just take either stock price or return data, compute its correlation, then find the first principal component of the data. Notice the sign change I do at the end there—-the component ended up being negatively related to the data (i.e. when this factor goes down, the data go up, etc.). PCA results are typically sign- and scale-invariant; hence the problems with interpretation. We’ll make our resulting index a “postive” one by just reversing the sign.

def make_pca_index(data, scale_data = True):
    '''
    Compute the correlation matrix of a set of stock data, and return
    the first principal component.

    By default, the data are scaled to have mean zero and variance one
    before performing the PCA.
    '''
    if scale_data:
        data_std = data.apply(scale)
    else:
        data_std = data
    corrs = np.asarray(data_std.cov())
    pca   = PCA(n_components = 1).fit(corrs)
    # We end up getting a negative value for the index, so we'll reverse
    # the sign to have it be more intuitive.
    mkt_index = -scale(pca.transform(data_std))
    return mkt_index

A PCA index with price data

Let’s copy the authors and make an index from raw price data. Since prices don’t have meaningully-estimated correlations, this isn’t really correct, but it’s useful to compare with what’s in the book.

price_index = make_pca_index(prices)

To see what’s going on, let’s make two plots: a scatter plot of our PCA index with the DJIA, and a time series plot with the two indices. These correspond to figures 8-4 and 8-5 in the book.

plt.figure(figsize = (17, 5))
plt.subplot(121)
plt.plot(price_index, scale(dji), 'k.')
plt.xlabel('PCA index')
plt.ylabel('Dow Jones Index')
ols_fit = OLSreg(scale(dji), price_index)
plt.plot(price_index, ols_fit.fittedvalues, 'r-',
         label = 'R2 = %4.3f' % round(ols_fit.rsquared, 3))
plt.legend(loc = 'upper left')
plt.subplot(122)
plt.plot(dates, price_index, label = 'PCA Price Index')
plt.plot(dates, scale(dji), label = 'DJ Index')
plt.legend(loc = 'upper left')

This actually seems to look okay, and wouldn’t really alert us to any problems if we didn’t know any better. Let’s repeat the exercise with returns, though.

A PCA index with returns data

Since returns are stationary, we can estimate a meaningful correlation matrix, and our PCA will make more sense.

returns_index = make_pca_index(returns)

And the plots:

Looking at these, we see a much more straightforward linear relationship between the returns to the DJIA and the PCA index derived from stock returns.

Explained variance

Since the principal components are just eigenvalues, there will be as many of them as their are columns in our data (here 24). As we add components we explain more and more of the original correlation matrix. Once we add all 24 the amount of variation/correlation explained is 100%—-all we’ve done is define a rotation of the matrix, so there’s no information lost. But a plot of the cumulative explained variance as we add principal components can help us to see how far and how reliably the data can be reduced.

plt.bar(arange(24) + .5, pca_fit.explained_variance_ratio_.cumsum())
plt.ylim((0, 1))
plt.xlabel('No. of principal components')
plt.ylabel('Cumulative variance explained')
plt.grid(axis = 'y', ls = '-', lw = 1, color = 'white')

Factor loadings

We can also check out the loadings of the principal component across the stocks. What this shows us is how a change in the relates to the stocks in our data. For example a a component going up might cause half the stock returns in the data to go up and half to go down (it would positively load on some and negatively load on others.) We would expect, intuitively, a factor representing “the market,” as we think our first component does, to load on our stocks in the same direction, and roughly the same magnitude. And this is basically what we see.

plt.bar(arange(24), pca_fit.components_[0])

Conclusion

Since PCA is such a widely-used and fundamental technique, it’s important to know how to do it in Python, and the scikit-learn implementation is a good one. Check out the documentation here. Of course, like any statistical technique, PCA can definitely be misused, or at least easily misintepreted, so handle with care.

I’ve seen the best minds of my generation destroyed by Matlab …

2013-05-11T16:52:00-04:00

(Note: this is very quick and not well thought out. Mostly a conversation starter as opposed to any real thesis on the subject.)

This post is a continuation of a Twitter conversation here, started when John Myles White poked the hornets’ nest. (Python’s nest? Where do Pythons live?)

The gist with John’s code is here.

This isn’t a very thoughtful post. But the conversation was becoming sort of a shootout and my thoughts (half-formed as they are) were a bit longer than a tweet. Essentially, I think the Python performance shootouts—PyPy, Numba, Cython—are missing the point.

The point is, I think, that loops are a crutch. A 3-nested for loop in Julia that increments a counter takes 8 lines of code (1 initialize counter, 3 for statements, 1 increment statement, 3 end statements). Only one of those lines tells me what the code does.

But most scientific programmers learned to code in imperative languages and that style of thinking and coding has become natural. I’ve often seen comments like this:

Which I think simply equates readability with familiarity. That isn’t wrong, but it isn’t the whole story.

Anyway, a lot of the responses to John’s code were showing that, hey, you can get fast loops in Python, with either JITing (PyPy, Numba) or Cython. So here are my thoughts:

1. Cython is great. I’ve used it with great success myself. But Julia gives me fast loops while keeping the dynamic typing; i.e., I’m still writing in Julia. Cython is a manifestation of what the Julia developers call the “two-language problem.” My programmer-productivity happens in the slow, dynamic language, and I swap to a more painful language for critical bottlenecks and glue the two together. Cython is a more pleasant manifestation of the problem, especially since it lets you evolve in an exploratory, piece-meal way from your first language to your second language. But you still end up with code that is nice dynamic-typing and abstractions on the outside; gross static typing and low-level imperative stuff on the inside. (And Cython examples are often clean and simple, but the code can get hairy very quickly.)

2. One of the nice things about the slow for loops in Python and R is that they force you to think about other ways to express your problem. R and Python programmers start thinking about how they can exploit arrays and other ADTs, and higher-order functions to express they’re problem. Avoiding the loop performance hit is the first reason, but then many of them start to realize they like their code better this way. The adjustment is hard at first, but once you get their, it’s hard to go back.

Forget about the Numpy, PyPy, Cython solutions to John’s problem. I think it’s safe to say his original pure Python code would be considered pretty un-Pythonic, to the extent that’s a thing. Python programmers are discouraged from that style of writing-C-in-Python, for both performance reasons, and conceptual reasons. Python programmers just think the alternatives (e.g. list comprehensions) are more expressive and maintainable. They’re not avoiding for loops because they’re slow: they don’t want to write for loops.

Maybe Julia is the answer to this problem. Since list comprehensions, higher-order-functions (applies, maps, etc.) wrap imperative loops, and Julia loops are fast, then these things can be written in Julia and be fast.

But that requires some thought about how Julia devs want Julia programmers to program. Julia is great and really promising, and it’s got an opportunity to let scientific programmers really raise their game. But I’d hate the big pitch for Julia to be: hey, you can write fast loops! And it would basically become a refuge for people who never learned to properly code R and are are fed up with slow loops, or for Matlab guys who’s licenses ran out.

Machine Learning for Hackers Chapter 7: Numerical optimization with deterministic and stochastic methods

2013-02-12T18:51:00-05:00

Introduction

Chapter 7 of Machine Learning for Hackers is about numerical optimization. The authors organize the chapter around two examples of optimization. The first is a straightforward least-squares problem like that we’ve encountered already doing linear regressions, and is amenable to standard iterative algorithms (e.g. gradient descent). The second is a problem with a discrete search space, not clearly differentiable, and so lends itself to a stochastic/heuristic optimization technique (though we’ll see the optimization problem is basically artificial). The first problem gives us a chance to play around with Scipy’s optimization routines. The second problem has us hand-coding a Metropolis algorithm; this doesn’t show off much new Python, but it’s fun nonetheless.

The notebook for this chapter is at the github report here, or you can view it online via nbviewer here.

Ridge regression by least-squares

In chapter 6 we estimated LASSO regressions, which added an L1 penalty on the parameters to the OLS loss-function. The ridge regression works the same way, but applies an L2 penalty to the parameters. The ridge regression is a somewhat more straightforward optimization problem, since the L2 norm we use gives us a differentiable loss function.

In this example, we’ll regress weight on height, similar to chapter 5. We can specify the loss (sum of squared errors) function for the ridge regression with the following function in Python:

y = heights_weights['Weight'].values
Xmat = sm.add_constant(heights_weights['Height'], prepend = True)

def ridge_error(params, y, Xmat, lam):
    '''
    Compute SSE of the ridge regression.
    This is the normal regression SSE, plus the
    L2 cost of the parameters.
    '''
    predicted = np.dot(Xmat, params)
    sse = ((y - predicted) ** 2).sum()
    sse += lam * (params ** 2).sum()

    return sse

The authors use R’s optim function, which defaults to the Nelder-Mead simplex algorithm. This algorithm doesn’t use any gradient or Hessian information to optimize the function. We’ll want to try out some gradient methods, though. Even though the functions for these methods will compute numerical gradients and Hessians for us, for the ridge problem these are easy enough to specify explicitly.

def ridge_grad(params, y, Xmat, lam):
    '''
    The gradiant of the ridge regression SSE.
    '''
    grad = np.dot(np.dot(Xmat.T, Xmat), params) - np.dot(Xmat.T, y)
    grad += lam * params
    grad *= 2
    return grad

def ridge_hess(params, y, Xmat, lam):
'''
The hessian of the ridge regression SSE.
'''
    return np.dot(Xmat.T, Xmat) + np.eye(2) * lam

Like the LASSO regressions we worked with in chapter 6, the ridge requires a penalty parameter to weight the L2 cost of the coefficient parameters (called lam in the functions above; lambda is a keyword in Python). The authors assume we’ve already found an appropriate value via cross-validation, and that value is 1.0.

We can now try to minimize the loss function with a couple of different algorithms. First the Nelder-Mead simplex, which should correspond to the authors’ use of optim in R.

# Starting values for the a, b (intercept, slope) parameters
params0 = np.array([0.0, 0.0])

# Nelder-Mead simplex
ridge_fit = opt.fmin(ridge_error, params0, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
Current function value: 1612442.197636
Iterations: 117
Function evaluations: 221
Solution: a = -340.565, b = 7.565

Now the Newton conjugate-gradient method. We need to give this function a gradient; the Hessian is optional. First without the Hessian:

ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad,
args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
Current function value: 1612442.197636
Iterations: 3
Function evaluations: 4
Gradient evaluations: 11
Hessian evaluations: 0
Solution: a = -340.565, b = 7.565

Now supplying the Hessian:

ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime =
ridge_grad,
fhess = ridge_hess, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
Current function value: 1612442.197636
Iterations: 3
Function evaluations: 7
Gradient evaluations: 3
Hessian evaluations: 3
Solution: a = -340.565, b = 7.565

Fortunately, we get the same results for all three methods. Supplying the Hessian to the Newton method shaves some time off, but in this simple application, it’s not really worth coding up a Hessian function (except for fun).

Lastly, the BFGS method, supplied with the gradient:

ridge_fit = opt.fmin_ncg(ridge_error, params0, fprime = ridge_grad,
fhess = ridge_hess, args = (y, Xmat, 1))
print 'Solution: a = %8.3f, b = %8.3f ' % tuple(ridge_fit)

Optimization terminated successfully.
Current function value: 1612442.197636
Iterations: 3
Function evaluations: 7
Gradient evaluations: 3
Hessian evaluations: 3
Solution: a = -340.565, b = 7.565

For this simple problem, all of these methods work well. For more complicated problems, there are considerations which would lead you to prefer one over another, or perhaps to use them in combination. There are also several more methods available, some which allow you to solve constrained optimization problems. Check out the very good documentation. Also note that if you’re not into hand-coding gradients, scipy has a function derivative in its misc module that will compute numerical derivatives. In many cases, the functions will do this automatically if you fail to provide a function to their gradient arguments.

Optimizing on sentences with the Metropolis algorithm

The second example in this chapter is a “code-breaking” exercise. We start with a message “here is some sample text”, which we encrypt using a Ceasar cipher that shifts each letter in the message to the next letter in the alphabet (with Z going to A). We can represent the cipher (or any cipher) in Python with a dict that maps each letter to its encrypted counterpart.

letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
           'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
           'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
           'y', 'z']

ceasar_cipher = {i: j for i, j in zip(letters, letters[1:] + letters[:1])}
inverse_ceasar_cipher = {ceasar_cipher[k]: k for k in ceasar_cipher}

The inverse_ceasar_cipher dict reverses the cipher, so we can get an original message back from one that’s been encrypted by the Ceasar cipher. Based on these structures, let’s make functions that will encrypt and decrypt text.

def cipher_text(text, cipher_dict = ceasar_cipher):
    # Split the string into a list of characters to apply
    # the decoder over.
    strlist = list(text)

    ciphered = ''.join([cipher_dict.get(x) or x for x in strlist])
    return ciphered

def decipher_text(text, cipher_dict = ceasar_cipher):
    # Split the string into a list of characters to apply
    # the decoder over.
    strlist = list(text)

    # Invert the cipher dictionary (k, v) -> (v, k)
    decipher_dict = {cipher_dict[k]: k for k in cipher_dict}

    deciphered = ''.join([decipher_dict.get(x) or x for x in strlist])
    return deciphered

To decrypt our message, we’ll design a Metropolis algorithm that randomly proposes ciphers, decrypts the message according to the proposed cipher, and see’s how probable that message is based on a lexical database of word frequency in Wikipedia.

The following functions are used to generate proposal ciphers for the Metropolis algorithm. The idea is to randomly generate ciphers and see what text they result in. If the text resulting from a proposed cipher is more likely (according to the lexical database) than the current cipher, we accept the proposal. If it’s not, we accept it wil a probability that is lower the less likely the resulting text is.

The method of generating new proposals is important. The authors use a method that chooses a key (letter) at random from the current cipher, and swaps its with some other letter. For example, if we start with the Ceasar Cipher, our proposal might randomly choose to re-map A to N (instead of B). The proposal would then be the same a the Ceasar Cipher, but with A → N and M → B (since A originally mapped to B and M originally mapped to N). This proposal-generating mechanism is encapsulated in propose_modified_cipher_from_cipher.

This is inefficient in a few ways. First, the letter chosen to modify in the cipher may not even appear in the text, so the proposed cipher won’t modify the text at all and you end up wasting cycles generating a lot of useless proposals. Second, we may end up picking a letter that occurs in a highly likely word, which will increase the probability of generating an inferior proposal.

We’ll suggest another mechanism that, instead of selecting a letter from the current cipher to re-map, will choose a letter amongst the non-words in the current deciphered text. For example, if our current deciphered text is “hello wqrld”, we will only select amongst {w, q, r, l, d} to modify at random. The minimizes the chances that a modified cipher will turn real words into gibberish and produce less likely text. The function propose_modified_cipher_from_text performs this proposal mechanism.

One way to think of this is that it’s analogous to tuning the variance of the proposal distribution in the typical Metropolis algorithm. If the variance is too low, our algorithm won’t efficiently explore the target distribution. If it’s too high, we’ll end up generating lots of lousy proposals. Our cipher proposal rules can suffer from similar problems.

def generate_random_cipher():
    '''
    Randomly generate a cipher dictionary (a one-to-one letter -> letter
    map).
    Used to generate the starting cipher of the algorithm.
    '''
    cipher = []

    input = letters
    output = letters[:]
    random.shuffle(output)

    cipher_dict = {k: v for (k, v) in zip(input, output)}

    return cipher_dict

def modify_cipher(cipher_dict, input, new_output):
    '''
    Swap a single key in a cipher dictionary.

    Old: a -> b, ..., m -> n, ...
    New: a -> n, ..., m -> b, ...
    '''
    decipher_dict = {cipher_dict[k]: k for k in cipher_dict}
    old_output = cipher_dict[input]

    new_cipher = cipher_dict.copy()
    new_cipher[input] = new_output
    new_cipher[decipher_dict[new_output]] = old_output

    return new_cipher

def propose_modified_cipher_from_cipher(text, cipher_dict,
                                        lexical_db = lexical_database):
    '''
    Generates a new cipher by choosing and swapping a key in the
    current cipher.
    '''
    _ = text # Unused
    input = random.sample(cipher_dict.keys(), 1)[0]
    new_output = random.sample(letters, 1)[0]
    return modify_cipher(cipher_dict, input, new_output)

    def propose_modified_cipher_from_text(text, cipher_dict,
    lexical_db = lexical_database):

    '''
    Generates a new cipher by choosing a swapping a key in the current
    cipher, but only chooses keys that are letters that appear in the
    gibberish words in the current text.
    '''
    deciphered = decipher_text(text, cipher_dict).split()
    letters_to_sample = ''.join([t for t in deciphered
    if lexical_db.get(t) is None])
    letters_to_sample = letters_to_sample or ''.join(set(deciphered))

    input = random.sample(letters_to_sample, 1)[0]
    new_output = random.sample(letters, 1)[0]
    return modify_cipher(cipher_dict, input, new_output)

Next, we need to be able to compute a message’s likelihood (from the lexical database). The log-likelihood of a message is just the sum of the log-likelihoods of each word (one-gram) in the message. If the word is gibberish (i.e., doesn’t occur in the database) it gets a tiny probability set to the smallest floating-point precision.

def one_gram_prob(one_gram, lexical_db = lexical_database):
    return lexical_db.get(one_gram) or np.finfo(float).eps

    def text_logp(text, cipher_dict, lexical_db = lexical_database):
    deciphered = decipher_text(text, cipher_dict).split()
    logp = np.array([math.log(one_gram_prob(w)) for w in
    deciphered]).sum()
    return logp

We can now use these functions in our Metropolis algorithm. Each step in the metropolis algorithm proposes a cipher, deciphers the text according the proposal, and computes the log-likelihood of the deciphered message. If the likelihood of the deciphered message is better under the proposal cipher than the current cipher, we definitely accept that proposal for our next step. If not, we only accept the proposal with a probability based on the relative likelihood of the proposal to the current cipher.

I’ll define this function to take an arbitrary proposal function via the proposal_rule argument. So far, this can be one of the two propose_modified_cipher_from_* functions defined above.

def metropolis_step(text, cipher_dict, proposal_rule, lexical_db =
    lexical_database):
    proposed_cipher = proposal_rule(text, cipher_dict)
    lp1 = text_logp(text, cipher_dict)
    lp2 = text_logp(text, proposed_cipher)

    if lp2 > lp1:
        return proposed_cipher
    else:
        a = math.exp(lp2 - lp1)
        x = random.random()
        if x < a:
            return proposed_cipher
        else:
            return cipher_dict

To run the algorithm, just wrap the step function inside a loop. There’s no stopping rule for the algorithm, so we have to choose a number of iterations, and hope it’s enough to get us to the optimum. Let’s use 250,000.

message = 'here is some sample text'
ciphered_text = cipher_text(message, ceasar_cipher)
niter = 250000

def metropolis_decipher(ciphered_text, proposal_rule, niter, seed = 4):
    random.seed(seed)
    cipher = generate_random_cipher()

    deciphered_text_list = []
    logp_list = []

    for i in xrange(niter):
    logp = text_logp(ciphered_text, cipher)
    current_deciphered_text = decipher_text(ciphered_text, cipher)

    deciphered_text_list.append(current_deciphered_text)
    logp_list.append(logp)

    cipher = metropolis_step(ciphered_text, cipher, proposal_rule)

    results = DataFrame({'deciphered_text': deciphered_text_list, 'logp':
    logp_list})
    results.index = np.arange(1, niter + 1)
    return results

First let’s look at the authors’ proposal rule. While they managed to get a reasonable decrypted message in about 50,000 iterations, we’re still reading gibberish after 250,000. As they say in the book, their results are an artefact of a lucky seed value.

results0 = metropolis_decipher(ciphered_text,
propose_modified_cipher_from_cipher, niter)
print results0.ix[10000::10000]

               deciphered_text       logp
 10000 kudu of feru fyrvbu hush -86.585205
 20000 wudu of feru fbrkxu hush -87.124919
 30000 kudu of feru fnrbau hush -86.585205
 40000 wudu of feru fmrjiu hush -87.124919
 50000 kudu of feru fyrnbu hush -86.585205
 60000 kudu of feru fxrnvu hush -86.585205
 70000 pudu of feru fvrnlu hush -87.561022
 80000 kudu of feru fvrxgu hush -86.585205
 90000 kudu of feru fbrvtu hush -86.585205
100000 kudu of feru fjrnlu hush -86.585205
110000 kudu of feru fprbju hush -86.585205
120000 kudu of feru fnrjcu hush -86.585205
130000 kudu of feru flrvpu hush -86.585205
140000 puku of feru flrvxu hush -88.028362
150000 kudu of feru fxrviu hush -86.585205
160000 pulu of feru ftrdzu hush -88.323162
170000 wuzu of feru flrxdu hush -89.575925
180000 kudu of feru firamu hush -86.585205
190000 wudu of feru fyrzqu hush -87.124919
200000 wudu of feru fnraxu hush -87.124919
210000 puku of feru fjrnyu hush -88.028362
220000 puku of feru firyau hush -88.028362
230000 pudu of feru fkrcvu hush -87.561022
240000 kudu of feru ftrwzu hush -86.585205
250000 kudu of feru fprxzu hush -86.585205

Now, let’s try the alternative proposal rule, which only chooses letters from gibberish words when it modifies the current cipher to propose a new one. The algorithm doesn’t find the actual message, but it actually finds a more likely message (according the the lexical database) within 20,000 iterations.

results1 = metropolis_decipher(ciphered_text,
propose_modified_cipher_from_text, niter)
print results1.ix[10000::10000]

                deciphered_text       logp
 10000 were mi isle izlkde text -68.946850
 20000 were as some simple text -35.784429
 30000 were as some simple text -35.784429
 40000 were as some simple text -35.784429
 50000 were as some simple text -35.784429
 60000 were as some simple text -35.784429
 70000 were us some simple text -38.176725
 80000 were as some simple text -35.784429
 90000 were as some simple text -35.784429
100000 were as some simple text -35.784429
110000 were as some simple text -35.784429
120000 were as some simple text -35.784429
130000 were as some simple text -35.784429
140000 were as some simple text -35.784429
150000 were us some simple text -38.176725
160000 were as some simple text -35.784429
170000 were is some sample text -37.012894
180000 were as some simple text -35.784429
190000 were as some simple text -35.784429
200000 were as some simple text -35.784429
210000 were as some simple text -35.784429
220000 were as some simple text -35.784429
230000 were as some simple text -35.784429
240000 were as some simple text -35.784429
250000 were is some sample text -37.012894

The graph below plots the likelihood paths of the algorithm for the two proposal rules. The blue line is the log-likelihood of the original message we’re trying to recover.

Direct calculation of the most likely message

The Metropolis algorithm is kind of pointless for this application. It’s really just jumping around looking for the most likely phrase. But since the likelihood of a message is just the sum of the log probabilities of the log probabilities of its component words, we just need to look for the most likely words of the lengths of the words of the ciphered message.

If the message at some point is “fgk tp hpdt”, then, if run long enough, the algorithm should just find the most likely three-letter word, the most likely two-letter word, and the most likely four-letter word. But we can look these up directly.

For example, the message we encrypted is ‘here is some sample text’, which has word lengths 4, 2, 4, 6, 4. What’s the most likely message with these word lengths?

def maxprob_message(word_lens = (4, 2, 4, 6, 4), lexical_db =
lexical_database):
    db_word_series = Series(lexical_db.index)
    db_word_len = db_word_series.str.len()
    max_prob_wordlist = []
    logp = 0.0
    for i in word_lens:
        db_words_i = list(db_word_series[db_word_len == i])
        db_max_prob_word = lexical_db[db_words_i].idxmax()
        logp += math.log(lexical_db[db_words_i].max())
        max_prob_wordlist.append(db_max_prob_word)
    return max_prob_wordlist, logp

maxprob_message()


(['with', 'of', 'with', 'united', 'with'], -25.642396806584493)

So, technically, we should have decoded our message to be “with of united with” instead of “here is some sample text”. This is not a shining endorsement of this methodology for decrypting messages.

Conclusion

While it was a fun exercise to code up the Metropolis decrypter in this chapter, it didn’t show off any new Python functionality. The ridge problem, while less interesting, showed off some of the optimization algorithms in Scipy. There’s a lot of good stuff in Scipy’s optimize module, and its documentation is worth checking out.

Machine Learning for Hackers Chapter 6: Regression models with regularization

2013-02-08T20:07:00-05:00

In my opinion, Chapter 6 is the most important chapter in Machine Learning for Hackers. It introduces the fundamental problem of machine learning: overfitting and the bias-variance tradeoff. And it demonstrates the two key tools for dealing with it: regularization and cross-validation.

It’s also a fun chapter to write in Python, because it lets me play with the fantastic scikit-learn library. scikit-learn is loaded with hi-tech machine learning models, along with convenient “pipeline”-type functions that facilitate the process of cross-validating and selecting hyperparameters for models. Best of all, it’s very well documented.

Fitting a sine wave with polynomial regression

The chapter starts out with a useful toy example—trying to fit a curve to data generated by a sine function over the interval [0, 1] with added Gaussian noise. The natural way to fit nonlinear data like this is using a polynomial function, so that the output, y is a function of powers of the input x. But there are two problems with this.

First, we can generate highly correlated regressors by taking powers of x, leading to noisy parameter estimates. The input x are evenly space numbers on the interval [0, 1]. So x and x² are going to have a correlation over 95%. Similar with x² and x³. The solution to this is to use orthogonalized polynomial functions: tranformations of x that, when summed, result in polynomial functions, but are orthogonal (therefore uncorrelated) with each other.

Luckily, we can easily calculate these transformations using patsy. The C(x, Poly) transform computes orthonormal polynomial functions of x, then we’ll extract out various orders of the polynomial. So Xpoly[:, :2] selects out the 0th and 1st order functions, then when summed will give us a first order polynomial (i.e. linear). Similarly Xpoly[: :4] gives us the 0th through 3rd order functions, which sum up to a cubic polynomial.

sin_data = DataFrame({'x' : np.linspace(0, 1, 101)})
sin_data['y'] = np.sin(2 \* pi \* sin_data['x']) +
np.random.normal(0, 0.1, 101)

x = sin_data['x']
y = sin_data['y']
Xpoly = dmatrix('C(x, Poly)')
Xpoly1 = Xpoly[:, :2]
Xpoly3 = Xpoly[:, :4]
Xpoly5 = Xpoly[:, :6]
Xpoly25 = Xpoly[:, :26]

The problem we encounter now is how to choose what order polynomial to fit to the data. Any data can be fit well (i.e. have a high R^2^) if we use a high enough order polynomial. But we will start to over-fit our data; capturing noise specific to our sample, leading to poor predictions on new data. The graph below shows the fits to the data of a straight line, a 3rd-order polynomial, a 5th-order polynomial, and a 25th-order polynomial. Notice how the last fit gives us all kinds of degrees of freedom to capture specific datapoints, and the excessive “wiggles” look like we’re fitting to noise.

In machine learning, this problem is solved with regularization—penalizing large parameter estimates in a way that, hopefully, shrinks down the coefficients on all but the most important inputs. Here’s where scikit-learn shines.

Preventing overfitting with regularization

The penalty parameter in a regularized regression is typically found via cross-validation; for each candidate penalty one repeatedly fits the model on subsets on the data, and the penalty value that gives the best fit across the cross-validation “folds” is chosen. In the book, the authors hand-code up a cross-validation scheme, looping over possible penalties and subsets of the data and recording the MSEs.

In scikit-learn you can usually automate the cross-validation procedure, by one of a couple of ways. Many models have a CVversion, or, if not, you can wrap your model in a function like GridSearchCV which is a convenience function around all the looping and fit-recording entailed in a cross-validation. Here I’ll use the LassoCV function, which performs cross-validation for a LASSO-penalized linear regression.

lasso_model = LassoCV(cv = 15, copy_X = True, normalize = True)
lasso_fit = lasso_model.fit(Xpoly[:, 1:11], y)
lasso_path = lasso_model.score(Xpoly[:, 1:11], y)

The first line sets up the model by specifying some options. The only interesting one here is cv, which specifies how many cross-validation folds to run on each penalty-parameter value. The second line fits the model: here’s I’m going to run a 10th-order polynomial regression, and let the LASSO penalty shrink away all but the most important orders. Finally, lasso_path provides the objective function that our penalty parameter is suppose to optimize in the cross-validations (typically RMSE).

After running the fit() method, LassoCV will provide useful output attributes, including the “optimal” penalty parameter, stored in .alpha_. Note that scikit-learn refers to the penalty parameter as alpha, while R’s glmnet, which the authors use to implement the LASSO model, calls it lambda. I’m more accustomed to the penalty parameter being denoted with lambda myself. Note also that glmnet uses alpha elsewhere.

# Plot the average MSE across folds
plt.plot(-np.log(lasso_fit.alphas_),
np.sqrt(lasso_fit.mse_path_).mean(axis = 1))
plt.ylabel('RMSE (avg. across folds)')
plt.xlabel(r'\$-\\log(\\lambda)\$')
# Indicate the lasso parameter that minimizes the average MSE across
folds.
plt.axvline(-np.log(lasso_fit.alpha_), color = 'red')

The value of the penalty parameter itself isn’t all that meaningful. So let’s take a look at what the resulting coefficient estimates are when we apply the penalty.

print 'Deg. Coefficient'
print Series(np.r_[lasso_fit.intercept_, lasso_fit.coef_])

Deg. Coefficient
  0    -0.003584
  1    -5.359452
  2     0.000000
  3     4.689958
  4    -0.000000
  5    -0.547131
  6    -0.047675
  7     0.124998
  8     0.133224
  9    -0.171974
 10     0.090685

So the LASSO, after selecting a penalty parameter via cross-validation, results in essentially a 3rd-order polynomial model: y = -5.4x + 4.7x^3^. This makes sense since, as we saw above, we’d captured the important features of the data by the time we’d fit a 3rd order polynomial.

Predicting O’Reilly book sales using back-cover descriptions

Next I’ll use the same model to tackle some real data. We have the sales ranks of the top-100 selling O’Reilly books. We’d like to see if we use the text on the back-cover description of the book to predict its rank. So the output variable is the rank of the book (reversed so that 100 is the top-selling book, and 1 is the 100th best-selling book), while the input variables are all the terms that appear in these 100 books’ back covers. For each book the value of an input variable is the number of times the term appears on its back cover. Many of the input values will be zero (for example, the term “javascript” will occur many times in a book about javascript, but zero times in every other book).

So the matrix of input variables is just our old friend, the term-document matrix. Creating this (using any of the methods described in the posts for [chapter 3][] or [chapter 4][]), we can just apply LassoCV again.

lasso_model = LassoCV(cv = 10)
lasso_fit = lasso_model.fit(desc_tdm.values, ranks.values)

Because of the size and nature of the input data, this runs pretty slowly (about 3-5 minutes for me). And, because there seems to be no good prediction model to be had here, the model doesn’t alway converge. If we do get a convergent run, we find the CV procedure wants us to shrink all the coefficients to zero: no input is worth keeping per the LASSO. (Note that since the x-axis in the graph is -log(penalty), moving left on the axis, towards 0, means more regularization.) This is the same result the authors find.

Logistic regression with cross-validation

With the previous model a bust, the authors regroup and try to fit a more simple output variable: a binary indicator of whether the book is in the top-50 sellers or not. Since they’re modeling a 0/1 outcome, they use a logistic regression. Like the linear models we used above, we can also apply regularizers to logistic regression.

In the book, the authors again code up an explicit cross-validation procedure. The notebook for this chapter has some code that replicates their procedure, but here I’ll discuss a version that uses scikit-learn’s GridCV function, which automates the cross-validation procedure for us. (the term “grid” is a little confusing here, since we’re only optimizing over one variable, the penalty parameter; the term “grid” is a little more intuitive in a 2-or-more-dimension search).

clf = GridSearchCV(LogisticRegression(C = 1.0, penalty = 'l1'),
c_grid,
score_func = metrics.zero_one_score, cv = n_cv_folds)
clf.fit(trainX, trainy)

We initialize the GridCV procedure by telling it:

What model we’re using: logistic, with a penalty parameter C, initialized at 1.0, using the L1 (LASSO) penalty.
A grid/array of parameter value candidates to search over: here values of C.
A score function to optimize: before we were using the RMSE of the regression, here we’ll use a correct classification rate, given by zero_one_score, in scikit-learn’s metrics module.
The number of cross-validation folds to performs; this defined elsewhere in the variable n_cv_folds

Then I fit the model on training data (a random subset of 80). After running this, We can check what value it chose for the penalty parameter, C, and what the in-sample error-rate for this value was.

clf.best_params_, 1.0 - clf.best_score_
({'C': 0.29377144516536563}, 0.375)

And again, let’s plot the error rates against values of C to vizualize how regularization affects the model accuracy.

rates = np.array([1.0 - x[1] for x in clf.grid_scores_])
stds = [np.std(1.0 - x[2]) / math.sqrt(n_cv_folds) for x in
clf.grid_scores_]

plt.fill_between(cs, rates - stds, rates + stds, color = 'steelblue',
alpha = .4)
plt.plot(cs, rates, 'o-k', label = 'Avg. error rate across folds')
plt.xlabel('C (regularization parameter)')
plt.ylabel('Avg. error rate (and +/- 1 s.e.)')
plt.legend(loc = 'best')
plt.gca().grid()

After fitting to the training set, we can predict on the test set and and see how accurate the model is on new data using the classification_report function.

print metrics.classification_report(testy, clf.predict(testX))

             precision recall f1-score support
          0       0.78   0.44     0.56      16
          1       0.18   0.50     0.27       4
avg / total       0.66   0.45     0.50      20

And the confusion matrix shows we got 9 instances classified correctly (the diagonal), and 11 incorrectly (the off-diagonal).

print ' Predicted'
print ' Class'
print DataFrame(metrics.confusion_matrix(testy, clf.predict(testX)))

Predicted
  Class
  0 1
0 7 9
1 2 2

Conclusion

Cross-validation often requires a lot of bookkeeping code. Writing this over and over again for different applications is inefficient and error-prone. So it’s great that scikit-learn has functions that encapsulate the cross-validation process in convenient abstractions/interfaces that do the bookkeeping for you. It also has a wide array of useful, cutting-edge models, and the documentation is not just clear and organized, but also educational: there are lots of examples and exposition that explains how the underlying models work, not just what the API is.

So even though we didn’t build any kick-ass, high-accuracy predictive models here, we did get to explore some fundamental methods in building ML models, and get acquainted with the powerful tools in scikit-learn.

Machine Learning for Hackers Chapter 5: Linear regression (with categorical regressors)

2012-12-28T01:32:00-05:00

Introduction

Chapter 5 of Machine Learning for Hackers is a relatively simple exercise in running linear regressions. Therefore, this post will be short, and I’ll only discuss the more interesting regression example, which nicely shows how patsy formulas handle categorical variables.

Linear regression with categorical independent variables

In chapter 5, the authors construct several linear regressions, the last of which is a multi-variate regression descriping the number of page views of top-viewed web sites. The regression is pretty straightforward, but includes two categorical variables: HasAdvertising, which takes values True or False; and InEnglish, which takes values Yes, No and NA (missing).

If we include these variables in the formula, then patsy/statmodels will automatically generate the necessary dummy variables. For HasAdvertising, we get a dummy variable equal to one when the the value is True. For InEnglish, which takes three values, we get two separate dummy variables, one for Yes, one for No, with the missing value serving as the baseline.

model = 'np.log(PageViews) ~ np.log(UniqueVisitors) + HasAdvertising +
InEnglish'
pageview_fit_multi = ols(model, top_1k_sites).fit()
print pageview_fit_multi.summary()

Results in:

OLS Regression Results

==============================================================================
Dep. Variable: np.log(PageViews) R-squared: 0.480
Model: OLS Adj. R-squared: 0.478
Method: Least Squares F-statistic: 229.4
Date: Sat, 24 Nov 2012 Prob (F-statistic): 1.52e-139
Time: 09:50:25 Log-Likelihood: -1481.1
No. Observations: 1000 AIC: 2972.
Df Residuals: 995 BIC: 2997.
Df Model: 4

==========================================================================================
coef std err t P\>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------------------
Intercept -1.9450 1.148 -1.695 0.090 -4.197 0.307
HasAdvertising[T.True] 0.3060 0.092 3.336 0.001 0.126 0.486
InEnglish[T.No] 0.8347 0.209 4.001 0.000 0.425 1.244
InEnglish[T.Yes] -0.1691 0.204 -0.828 0.408 -0.570 0.232
np.log(UniqueVisitors) 1.2651 0.071 17.936 0.000 1.127 1.403

==============================================================================
Omnibus: 73.424 Durbin-Watson: 2.068
Prob(Omnibus): 0.000 Jarque-Bera (JB): 92.632
Skew: 0.646 Prob(JB): 7.68e-21
Kurtosis: 3.744 Cond. No. 570.

==============================================================================

If we were going to do this without the formula API, we’d have to explicity make these dummies. For comparison, here’s that.

top_1k_sites['LogUniqueVisitors'] =
np.log(top_1k_sites['UniqueVisitors'])
top_1k_sites['HasAdvertisingYes'] =
np.where(top_1k_sites['HasAdvertising'] == 'Yes', 1, 0)
top_1k_sites['InEnglishYes'] = np.where(top_1k_sites['InEnglish']
== 'Yes', 1, 0)
top_1k_sites['InEnglishNo'] = np.where(top_1k_sites['InEnglish'] == 'No', 1, 0)

linreg_fit = sm.OLS(np.log(top_1k_sites['PageViews']),
sm.add_constant(top_1k_sites[['HasAdvertisingYes',
'LogUniqueVisitors',
'InEnglishNo', 'InEnglishYes']],
prepend = True)).fit()
linreg_fit.summary()

Machine Learning for Hackers Chapter 4: Priority e-mail ranking

2012-12-28T00:00:00-05:00

Introduction

I’m not going to write much about this chapter. In my opinion the payoff-to-effort ratio for this project is pretty low. The algorithm for ranking e-mails is pretty straightforward, but in my opinion seriously flawed. Most of the code in the chapter (and there’s a lot of it) revolves around parsing the text in the files. It’s a good exercise in thinking through feature extraction, but it’s not got a lot of new ML concepts. And from my perspective, there’s not much opportunity to show off any Python goodness. But, I’ll hit a couple of points that are new and interesting.

The complete code is at the Github repo here, and you can read the notebook via nbviewer here.

1. Vectorized string methods in pandas. Back in Chapter 1, I groused about lacking vectorized functions for operations on strings or dates in pandas. If it wasn’t a numpy ufunc, you had to use the pandas map() method. That’s changed a lot over the summer, and since pandas 0.9.0, we can call vectorized string methods.

For example, here’s the code in my chapter for program that identifies e-mails that are part of a thread, by looking for “re:”-like prefixes on the subjects.

reply_pattern   = '(re:|re\[\d\]:)'
fwd_pattern = '(fw:|fw[\d]:)'

def thread_flag(s):
    '''
    Returns True if string s matches the thread patterns.
    If s is a pandas Series, returns a Series of booleans.
    '''
    if isinstance(s, basestring):
        return re.search(reply_pattern, s) is not None
    else:
        return s.str.contains(reply_pattern, re.I)

def clean_subject(s):
    '''
    Removes all the reply and forward labeling from a
    string (an e-mail subject) s.
    If s is a pandas Series, returns a Series of cleaned
    strings.
    This will help find the initial message in the thread
    (which won't have any of the reply/forward labeling.
    '''
    if isinstance(s, basestring):
        s_clean = re.sub(reply_pattern, '', s, re.I)
        s_clean = re.sub(fwd_pattern, '', s_clean, re.I)
        s_clean = s_clean.strip()
    else:
        s_clean = s.str.replace(reply_pattern, '', re.I)
        s_clean = s_clean.str.replace(fwd_pattern, '', re.I)
        s_clean = s_clean.str.strip()

    return s_clean

In thread_flag, if the input is a pandas series of e-mail subject lines, then the function will use a vectorized string function, called with .str.contains() to see if a pattern matching a reply-type prefix is in the subject. The function will therefore return a pandas series of booleans, that are True for all the subjects that have a reply pattern, and False for all the subjects that don’t.

The function clean_subjects, if given a pandas Series input, will use the vectorized string methods .str.replace() and .str.strip() to clean the re- and fwd-like patterns out of the subjects.

Notice there are some differences between the naming of pandas string methods and the base string methods or re module functions that perform similar operations on single strings. For example, there’s no contains function in re; we use re.search(). Similarly .str.replace() does what we’d use re.sub() to do on a single string.

2. More term-document matrices In Chapter 3 we built a term-document matrix to extract term-frequency features from a set of e-mails. This chapter has a similar exercise, applied to both e-mail messages and their subjects. In the code for that chapter, I built a TDM function that wrapped the term-document matrix function in the textmining package, adding some options that tried to mimic the tdm function in R’s tm package. I use that same function, tdm_df, here. In the post for that chapter, I lamented that I couldn’t find a decent term-document matrix function for Python. The one in textmining was too barebones and I was surprised there was nothing that fit the bill in NLTK.

In comments to that post, Vishal Goklani pointed me to the CountVectorizer function in scikits-learn (in the sklearn.feature_extraction.text module). Despite the rather generic name, this will give you a TDM from a set of documents, returned in the form of a sparse matrix. Here’s quick-and-dirty wrapper function that returns a TDM in the form of a pandas DataFrame.

def sklearn_tdm_df(docs, **kwargs):
    '''
    Create a term-document matrix (TDM) in the form of a pandas DataFrame
    Uses sklearn's CountVectorizer function.

    Parameters
    ----------
    docs: a sequence of documents (files, filenames, or the content) to be
        included in the TDM. See the `input` argument to CountVectorizer.
    **kwargs: keyword arguments for CountVectorizer options.

    Returns
    -------
    tdm_df: A pandas DataFrame with the term-document matrix. Columns are terms,
        rows are documents.
    '''
    # Initialize the vectorizer and get term counts in each document.
    vectorizer = CountVectorizer(**kwargs)
    word_counts = vectorizer.fit_transform(docs)

    # .vocabulary_ is a Dict whose keys are the terms in the documents,
    # and whose entries are the columns in the matrix returned by fit_transform()
    vocab = vectorizer.vocabulary_

    # Make a dictionary of Series for each term; convert to DataFrame
    count_dict = {w: Series(word_counts.getcol(vocab[w]).data) for w in vocab}
    tdm_df = DataFrame(count_dict).fillna(0)
    return tdm_df

# Call the function on e-mail messages. The token_pattern is set so that terms are only
# words with two or more letters (no numbers or punctuation)
message_tdm = sklearn_tdm_df(train_df['message'],
                             stop_words = 'english',
                             charset_error = 'ignore',
                             token_pattern = '[a-zA-Z]{2,}')

3. Timezone issues and rank instability. In the book, the authors compute stats measuring how active threads are. This depends on the time-stamps of the messages, which the authors parse out of the e-mail files. They ignore the time-zone information in the time-stamps, and this seems to create some bugs. For example, the following thread has two e-mails:

Name: [sadev] [bug 840] spam_level_char option change/removal
    734    2002-09-06 10:56:23-07:00
    763    2002-09-06 13:56:19-04:00

If you ignore the timezones, it looks like 763 comes three hours after 734. But looking at the timezones, you can see that 734 actually comes four seconds after 763. So this is a far more active thread than the code in the book calculates.

This sort of issue has a pretty big effect on the ranks of the messages. The rank is just the product of 5 feature weights (based on sender info., thread activity, and term features). Even though the authors scale the individual feature weights (typically with log-scales), by calculating the final rank as a product, you can get big rank difference based on what might seem to be practically similar features (even without any bugs)—for example, in some cases it doesn’t take a big difference to double a feature’s weight, which then doubles the e-mail’s rank.So it seems to me the ranking procedure in the book is not very stable. This is fine, since it’s just meant to be illustrative, but of course you want to be aware of this issue for a more serious exercise.

Conclusion

I didn’t go into much detail here. If you’re interested in seeing a lot of Python and pandas text parsing in action, definitely check out the code.

ARM Chapter 5: Logistic models of well-switching in Bangladesh

2012-12-22T19:10:00-05:00

The logistic regression we ran for chapter 2 of Machine Learning for Hackers was pretty simple. So I wanted to find an example that would dig a little deeper into statsmodels’s capabilities and the power of the patsy formula language.

So, I’m taking an intermission from Machine Learning for Hackers and am going to show an example from Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models (“ARM”). The chapter has a great example of going through the process of building, interpreting, and diagnosing a logistic regression model. We’ll end up with a model with lots of interactions and variable transforms, which is a great showcase for patsy and the statmodels formula API.

Logistic model of well-switching in Bangladesh

Our data are information on about 3,000 respondent households in Bangladesh with wells having an unsafe amount of arsenic. The data record the amount of arsenic in the respondent’s well, the distance to the nearest safe well (in meters), whether that respondent “switched” wells by using a neighbor’s safe well instead of their own, as well as the respondent’s years of education and a dummy variable indicating whether they belong to a community association.

  switch arsenic      dist assoc educ
1      1    2.36 16.826000     0    0
2      1    0.71 47.321999     0    0
3      0    2.07 20.966999     0   10
4      1    1.15 21.486000     0   12
5      1    1.10 40.874001     1   14
...

Our goal is to model well-switching decision. Since it’s a binary variable (1 = switch, 0 = no switch), we’ll use logistic regression.

The IPython notebook is at the Github repo here, and you can go here to view it on nbviewer. The analysis follows ARM chapter 5.4.

Model 1: Distance to a safe well

For our first pass, we’ll just use the distance to the nearest safe well. Since the distance is recorded in meters, and the effect of one meter is likely to be very small, we can get nicer model coefficients if we scale it. Instead of creating a new scaled variable, we’ll just do it in the formula description using the I() function.

model1 = logit('switch ~ I(dist/100.)', df = df).fit()
print model1.summary()

Optimization terminated successfully. Current function value: 2038.118913 Iterations 4 Logit Regression Results

==============================================================================
Dep. Variable: switch No. Observations: 3020
Model: Logit Df Residuals: 3018
Method: MLE Df Model: 1
Date: Sat, 22 Dec 2012 Pseudo R-squ.: 0.01017
Time: 13:05:25 Log-Likelihood: -2038.1
converged: True LL-Null: -2059.0
LLR p-value: 9.798e-11

==================================================================================
coef std err z P>|z| [95.0% Conf. Int.]

----------------------------------------------------------------------------------
Intercept 0.6060 0.060 10.047 0.000 0.488 0.724
I(dist / 100.) -0.6219 0.097 -6.383 0.000 -0.813 -0.431

Let’s plot this model. We’ll want to jitter the switch data, since it’s all 0/1 and will over-plot.

Another way to look at this is to plot the densities of distance for switchers and non-switchers. We expect the distribution of switchers to have more mass over short distances and the distribution of non-switchers to have more mass over long distances.

Model 2: Distance to a safe well and the arsenic level of own well

Next, let’s add the arsenic level as a regressor. We’d expect respondents with higher arsenic levels to be more motivated to switch.

model2 = logit('switch ~ I(dist / 100.) + arsenic', df = df).fit()
print model2.summary()



Optimization terminated successfully.
Current function value: 1965.334134
Iterations 5
Logit Regression Results

==============================================================================
Dep. Variable: switch No. Observations: 3020
Model: Logit Df Residuals: 3017
Method: MLE Df Model: 2
Date: Sat, 22 Dec 2012 Pseudo R-squ.: 0.04551
Time: 13:05:29 Log-Likelihood: -1965.3
converged: True LL-Null: -2059.0
LLR p-value: 1.995e-41

==================================================================================
coef std err z P>|z| [95.0% Conf. Int.]

----------------------------------------------------------------------------------
Intercept 0.0027 0.079 0.035 0.972 -0.153 0.158
I(dist / 100.) -0.8966 0.104 -8.593 0.000 -1.101 -0.692
arsenic 0.4608 0.041 11.134 0.000 0.380 0.542

==================================================================================

Which is what we see. The coefficients are what we’d expect: the farther to a safe well, the less likely a respondent is to switch, but the higher the arsenic level in their own well, the more likely.

Marginal Effects

To see the effect of these on the probability of switching, let’s calculate the marginal effects at the mean of the data.

model2.margeff(at = 'mean')
array([-0.21806505, 0.11206108])

So, for the mean respondent, an increase of 100 meters to the nearest safe well is associated with a 22% lower probability of switching. But an increase of 1 in the arsenic level is associated with an 11% higher probability of switching.

Class separability

To get a sense of how well this model might classify switchers and non-switchers, we can plot each class of respondent in (distance-arsenic)-space. We don’t see very clean separation, so we’d expect the model to have a fairly high error rate. But we do notice that the short-distance/high-arsenic region of the graph is mostly comprised switchers, and the long-distance/low-arsenic region is mostly comprised of non-switchers.

Model 3: Adding an interaction

It’s sensible that distance and arsenic would interact in the model. In other words, the effect of an 100 meters on your decision to switch would be affected by how much arsenic is in your well.

Again, we don’t have to pre-compute an explicit interaction variable. We can just specify an interaction in the formula description using the : operator.

model3 = logit('switch ~ I(dist / 100.) + arsenic + I(dist / 100.):arsenic',
df = df).fit()
print model3.summary()


Optimization terminated successfully.
Current function value: 1963.814202
Iterations 5
Logit Regression Results

==============================================================================
Dep. Variable: switch No. Observations: 3020
Model: Logit Df Residuals: 3016
Method: MLE Df Model: 3
Date: Sat, 22 Dec 2012 Pseudo R-squ.: 0.04625
Time: 13:05:33 Log-Likelihood: -1963.8
converged: True LL-Null: -2059.0
LLR p-value: 4.830e-41

==========================================================================================
coef std err z P>|z| [95.0% Conf. Int.]

------------------------------------------------------------------------------------------
Intercept -0.1479 0.118 -1.258 0.208 -0.378 0.083
I(dist / 100.) -0.5772 0.209 -2.759 0.006 -0.987 -0.167
arsenic 0.5560 0.069 8.021 0.000 0.420 0.692
I(dist / 100.):arsenic -0.1789 0.102 -1.748 0.080 -0.379 0.022

==========================================================================================

The coefficient on the interaction is negative and significant. While we can’t directly intepret its quantitative effect on switching, the qualitative interpretation gels with our intuition. Distance has a negative effect on switching, but this negative effect is reduced when arsenic levels are high. Alternatively, the arsenic level have a positive effect on switching, but this positive effect is reduced as distance to the nearest safe well increases.

Model 4: Adding education, more interactions, and centering variables

Respondents with more eduction might have a better understanding of the harmful effects of arsenic and therefore may be more likely to switch. Education is in years, so we’ll scale it for more sensible coefficients. We’ll also include interactions amongst all the regressors.

We’re also going to center the variables, to help with interpretation of the coefficients. Once more, we can just do this in the formula, without pre-computing centered variables.

model_form = ('switch ~ center(I(dist / 100.)) + center(arsenic) + ' +
              'center(I(educ / 4.)) + ' +
              'center(I(dist / 100.)) : center(arsenic) + ' +
              'center(I(dist / 100.)) : center(I(educ / 4.)) + ' +
              'center(arsenic) : center(I(educ / 4.))'
             )
model4 = logit(model_form, df = df).fit()
print model4.summary()




Optimization terminated successfully.
Current function value: 1945.871775
Iterations 5
Logit Regression Results

==============================================================================
Dep. Variable: switch No. Observations: 3020
Model: Logit Df Residuals: 3013
Method: MLE Df Model: 6
Date: Sat, 22 Dec 2012 Pseudo R-squ.: 0.05497
Time: 13:05:35 Log-Likelihood: -1945.9
converged: True LL-Null: -2059.0
LLR p-value: 4.588e-46

===============================================================================================================
coef std err z P>|z| [95.0% Conf. Int.]

---------------------------------------------------------------------------------------------------------------
Intercept 0.3563 0.040 8.844 0.000 0.277 0.435
center(I(dist / 100.)) -0.9029 0.107 -8.414 0.000 -1.113 -0.693
center(arsenic) 0.4950 0.043 11.497 0.000 0.411 0.579
center(I(educ / 4.)) 0.1850 0.039 4.720 0.000 0.108 0.262
center(I(dist / 100.)):center(arsenic) -0.1177 0.104 -1.137 0.256
-0.321 0.085
center(I(dist / 100.)):center(I(educ / 4.)) 0.3227 0.107 3.026 0.002
0.114 0.532
center(arsenic):center(I(educ / 4.)) 0.0722 0.044 1.647 0.100 -0.014
0.158

===============================================================================================================

Model assessment: binned residual plots

Plotting residuals to regressors can alert us to issues like nonlinearity or heteroskedasticity. Plotting raw residuals in a binary model isn’t usually informative, so we do some smoothing. Here, we’ll averaging the residuals within bins of the regressor. (A lowess or moving average might also work.)

I’m going to write a function to provide the binned residual data dynamically (and another helper function to plot the data). To create the bins I’m going to use the handy qcut function in pandas, which bins a vector of data into quantiles. Then I’ll use groupby to calculate the bin means and confidence intervals.

def bin_residuals(resid, var, bins):
    '''
    Compute average residuals within bins of a variable.

    Returns a dataframe indexed by the bins, with the bin midpoint,
    the residual average within the bin, and the confidence interval
    bounds.
    '''
    resid_df = DataFrame({'var': var, 'resid': resid})
    resid_df['bins'] = qcut(var, bins)
    bin_group = resid_df.groupby('bins')
    bin_df = bin_group['var', 'resid'].mean()
    bin_df['count'] = bin_group['resid'].count()
    bin_df['lower_ci'] = -2 * (bin_group['resid'].std() /
    np.sqrt(bin_group['resid'].count()))
    bin_df['upper_ci'] = 2 * (bin_group['resid'].std() /
    np.sqrt(bin_df['count']))
    bin_df = bin_df.sort('var')
    return(bin_df)

def plot_binned_residuals(bin_df):
    '''
    Plotted binned residual averages and confidence intervals.
    '''
    plt.plot(bin_df['var'], bin_df['resid'], '.')
    plt.plot(bin_df['var'], bin_df['lower_ci'], '-r')
    plt.plot(bin_df['var'], bin_df['upper_ci'], '-r')
    plt.axhline(0, color = 'gray', lw = .5)

    arsenic_resids = bin_residuals(model4.resid, df['arsenic'], 40)
    dist_resids = bin_residuals(model4.resid, df['dist'], 40)
    plt.figure(figsize = (12, 5))
    plt.subplot(121)
    plt.ylabel('Residual (bin avg.)')
    plt.xlabel('Arsenic (bin avg.)')
    plot_binned_residuals(arsenic_resids)
    plt.subplot(122)
    plot_binned_residuals(dist_resids)
    plt.ylabel('Residual (bin avg.)')
    plt.xlabel('Distance (bin avg.)')

Model 5: log-scaling arsenic

The binned residual plot indicates some nonlinearity in the arsenic variable. Note how the model over-estimated for low arsenic and underestimates for high arsenic. This suggests a log transformation or something similar.

We can again do this transformation right in the formula.

model_form = ('switch ~ center(I(dist / 100.)) +
               center(np.log(arsenic)) + ' +
              'center(I(educ / 4.)) + ' +
              'center(I(dist / 100.)) : center(np.log(arsenic)) + ' +
              'center(I(dist / 100.)) : center(I(educ / 4.)) + ' +
              'center(np.log(arsenic)) : center(I(educ / 4.))'
             )

model5 = logit(model_form, df = df).fit()
print model5.summary()




Optimization terminated successfully.
Current function value: 1931.554102
Iterations 5
Logit Regression Results

==============================================================================
Dep. Variable: switch No. Observations: 3020
Model: Logit Df Residuals: 3013
Method: MLE Df Model: 6
Date: Sat, 22 Dec 2012 Pseudo R-squ.: 0.06192
Time: 13:05:57 Log-Likelihood: -1931.6
converged: True LL-Null: -2059.0
LLR p-value: 3.517e-52

==================================================================================================================
coef std err z P>|z| [95.0% Conf. Int.]

------------------------------------------------------------------------------------------------------------------
Intercept 0.3452 0.040 8.528 0.000 0.266 0.425
center(I(dist / 100.)) -0.9796 0.111 -8.809 0.000 -1.197 -0.762
center(np.log(arsenic)) 0.9036 0.070 12.999 0.000 0.767 1.040
center(I(educ / 4.)) 0.1785 0.039 4.577 0.000 0.102 0.255
center(I(dist / 100.)):center(np.log(arsenic)) -0.1567 0.185 -0.846
0.397 -0.520 0.206
center(I(dist / 100.)):center(I(educ / 4.)) 0.3384 0.108 3.141 0.002
0.127 0.550
center(np.log(arsenic)):center(I(educ / 4.)) 0.0601 0.070 0.855 0.393
-0.078 0.198

==================================================================================================================

And the binned residual plot for arsenic now looks better.

Model error rates

The pred_table() gives us a confusion matrix for the model. We can use this to compute the error rate of the model.

We should compare this to the null error rates, which comes from a model that just classifies everything as whatever the most prevalent response is. Here 58% of the respondents were switchers, so the null model just classifies everyone as a switcher, and therefore has an error rate of 42%.

print model5.pred_table()
print 'Model Error rate: {0: 3.0%}'.format(
    1 - np.diag(model5.pred_table()).sum() / model5.pred_table().sum())
print 'Null Error Rate: {0: 3.0%}'.format(1 - df['switch'].mean())

[[ 568. 715.]
[ 387. 1350.]]
Model Error rate: 36%
Null Error Rate: 42%

Conclusion

So this was a more in-depth example of running a logistic regression with statsmodels and the formula API. Unlike last time, when we were just specifying the variables in the model, here we used the formula language to apply transforms and create interactions. I really love this: it drastically reduces the number of steps between thinking up a model and fitting it.

Machine Learning for Hackers Chapter 2, Part 2: Logistic regression with statsmodels

2012-12-21T04:04:00-05:00

Introduction

I last left chapter 2 of Maching Learning for Hackers (a long time ago), running some kernel density estimators on height and weight data (see here. The next part of the chapter plots a scatterplot of weight vs. height and runs a lowess smoother through it. I’m not going to write any more about the lowess function in statsmodels. I’ve discussed some issues with it (i.e. it’s slow) here. And it’s my sense that the lowess API, as it is now in statsmodels, is not long for this world. The code is all in the IPython notebooks in the Github repo and is pretty straightforward.

Patsy and statsmodels formulas

What I want to skip to here is the logistic regressions the authors run to close out the chapter. Back in the spring, I coded up the chapter in this notebook. At this point, there wasn’t really much cohesion between pandas and statsmodels. You’d end up doing data exploration and munging with pandas, then pulling what you needed out of dataframes into numpy arrays, and passing those arrays to statsmodels. (After writing seemingly needless boilerplate code like X = sm.add_constant(X, prepend = True). Who’s out there running all these regressions without constant terms, such that it makes sense to force the use to explicitly add a constant vector to the data matrix?)

Over the summer, though, something quite cool happened. patsy brought a formula interface to Python, and it got integrated into a number components of statsmodels. Skipper Seabold’s Pydata presentation is a good overview and demo. In a nutshell, statsmodels now talks to your pandas dataframes via an expressive “formula” description of your model.

For example, imagine we had a dataframe, df, with variables x1, x2, and y. If we wanted to regress y on x1 and x2 with the standard statmodels API, we’d code something like the following:

Xmat = sm.add_constant(df[['x1', 'x2']].values, prepend = True)
yvec = df['y'].values
ols_model = OLS(yvec, Xmat).fit()

Which is tolerable with short variable names. Once you start using longer names or need more RHS variables it becomes a mess. With patsy and the formula API, you just have:

ols_model = ols('y \~ x1 + x2', df = df).fit()

Which is just as simple as using lm in R. You can also specify variable transformations and interactions in the formula, without needing to pre-compute variable for them. It’s pretty slick.

All of this is still brand new, and largely undocumented, so proceed with caution. But I’ve gotten very excited incorporating it into my code. Stuff I wrote just 5 or 6 months ago looks clunky and outdated.

So I’ve updated the IPython notebook for chapter 2, here, to incorporate the formula API. That’s what I’ll discuss in the rest of the post.

Logistic regression with formulas in statmodels

The authors run a logistic regression to see if they can use a person’s height and weight to determine their gender. I’m not really sure why you’d run such a model (or how meaningful it is once you run it, given how co-linear height and weight are), but it’s easy enough for illustrating how to mechanically run a logistic regression and use it to linearly separate groups.

The dataset contains variables Height, Weight, and Gender. The latter is a string encoded either Male or Female. To run a logistic regression, we’ll want to transform this to a numerical 0/1 variable. We can do this a number of ways, but I’ll use the map method.

heights_weights['Male'] = heights_weights['Gender'].map({'Male': 1, 'Female': 0})

The statstmodels.formula.api module has a number of functions, including ols, logit, and glm. If we import logit from the module we can run a logistic regression easily.

male_logit = logit(formula = 'Male \~ Height + Weight', df = heights_weights).fit()
print male_logit.summary()

With these results:

Optimization terminated successfully.
Current function value: 2091.297971
Iterations 8
Logit Regression Results

==============================================================================
Dep. Variable: Male No. Observations: 10000
Model: Logit Df Residuals: 9997
Method: MLE Df Model: 2
Date: Thu, 20 Dec 2012 Pseudo R-squ.: 0.6983
Time: 14:41:33 Log-Likelihood: -2091.3
converged: True LL-Null: -6931.5
LLR p-value: 0.000

==============================================================================
coef std err z P\>|z| [95.0% Conf. Int.]

------------------------------------------------------------------------------
Intercept 0.6925 1.328 0.521 0.602 -1.911 3.296
Height -0.4926 0.029 -17.013 0.000 -0.549 -0.436
Weight 0.1983 0.005 38.663 0.000 0.188 0.208

==============================================================================

Just for fun, we can also run the logistic regression via a GLM with a binomial family and logit link. This is similar to how I’d run it in R.

male_glm_logit = glm('Male \~ Height + Weight', df =
heights_weights,
family = sm.families.Binomial(sm.families.links.logit)).fit()
print male_glm_logit.summary()

And the results are the same:

Generalized Linear Model Regression Results

==============================================================================
Dep. Variable: Male No. Observations: 10000
Model: GLM Df Residuals: 9997
Model Family: Binomial Df Model: 2
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -2091.3
Date: Thu, 20 Dec 2012 Deviance: 4182.6
Time: 14:41:37 Pearson chi2: 9.72e+03
No. Iterations: 8

==============================================================================
coef std err t P\>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------
Intercept 0.6925 1.328 0.521 0.602 -1.911 3.296
Height -0.4926 0.029 -17.013 0.000 -0.549 -0.436
Weight 0.1983 0.005 38.663 0.000 0.188 0.208

==============================================================================

Now we can use the coefficients to plot a separating line in height-weight space.

logit_pars = male_logit.params
intercept = -logit_pars['Intercept'] / logit_pars['Weight']
slope = -logit_pars['Height'] / logit_pars['Weight']

Let’s plot the data, color-coded by sex, and the separating line.

fig = plt.figure(figsize = (10, 8))
# Women points (coral)
plt.plot(heights_f, weights_f, '.', label = 'Female',
mfc = 'None', mec='coral', alpha = .4)
# Men points (blue)
plt.plot(heights_m, weights_m, '.', label = 'Male',
mfc = 'None', mec='steelblue', alpha = .4)
# The separating line
plt.plot(array([50, 80]), intercept + slope * array([50, 80]),
'-', color = '#461B7E')
plt.xlabel('Height (in.)')
plt.ylabel('Weight (lbs.)')
plt.legend(loc='upper left')

Conclusion

There are several more examples using Patsy formulas with statsmodels functions in later chapters. If you’re accustomed to R’s formula notation, the transition from running models in R to running models in statsmodels is easy. One of the annoying things in Python versus R is the need to pull arrays out of pandas dataframes, because the functions you want to apply to the data (say estimating models, or plotting) don’t interface with the dataframe, but instead numpy arrays. It’s not terrible, but it adds a layer of friction in the analysis. So it’s great that statsmodels is starting to integrate well with pandas.

Machine Learning for Hackers Chapter 3: Naive Bayes Text Classification

2012-12-20T04:20:00-05:00

I realize I haven’t blogged about the rest of chapter 2 yet. I’ll get back to that, but chapter 3 is on my mind today. If you haven’t seen them yet, IPython notebooks up to chapter 9 are all up in the Github repo. To view them online, you can check the links on this page.

Chapter 3 is about text classification. The authors build a classifier that will identify whether an e-mail is spam or not (“ham”) based on the content of the e-mail’s message. I won’t go into much detail on how the Naive Bayes classifier they use works (beyond what’s evident in the code). The theory is described well in the book and many other places. I’m just going to discuss implementation, assuming you know how the classifier works in theory. The Python code for this project relies heavily on the NLTK (Natural Language Toolkit) package, which is a comprehensive library that includes functions for doing NLP and text analysis, as well as an array of benchmark text corpora to use them on. If you want to go deep into this stuff, two good resources are:

Natural Language Processing with Python by S. Bird, E. Klein, and E. Loper; and
Python Text Processing with NLTK 2.0 Cookbook by J. Perkins

Two versions of the program

I’ve coded up two different versions of this chapter. The first, here, tries to follow the book relatively closely. The general procedure they use is:

Parse and tokenize the e-mails
Create a term-document matrix of the e-mails
Calculate features of the training e-mails using the term-document matrix
Train the classifier on these features
Test the classifier on other sets of spam and ham e-mails

I’m not going to discuss this version in much detail, but you should take a look at the notebook if you’re interested. Two big takeaways from this are:

Python lacks a good term-document matrix tool.** I was surprised to find that NLTK, which has so much functionality including helper functions like FreqDist, doesn’t have a function for making term-document matrices similar to the tdm function in R’s tm package. There is a Python module called textmining (which you can install with pip) that does have a term-document matrix function, but it’s pretty rudimentary. What you’ll see in this chapter is that I’ve coded up a term-document matrix function that uses the one in textmining but adds some bells and whistles, and returns the TDM as a (typically sparse) pandas dataframe.
The authors’ classifier suffers from numerical errors.** The Naive Bayes classifier calcalates the probability that a message is spam by calculating the probability that the message’s terms occur in a spam message. So if the message is just “buy viagra”, and “buy” occurs in 75% of the training spam, and “viagra” occurs in 50% of the training spam, then the classifier assigns this a ‘spam’ probability of .75 * .50 = 37.5%. The problem with this calculation is that there are typically many terms, and the probabilities are often small, so their product can end up smaller than machine precision and underflow to zero. The way around this is to take the sum of the log probabilities (so log(.75) + log(.25)). The authors don’t do this, though, and it’s apparent that they end up with underflow errors. See, for example, the code output on page 89. This is also what leads to them having essentially the same error rates for “hard” ham as they do for “easy” ham in the tables on pages 89 and 92. Once you fix this problem, it turns out the classifier is actually much better for spam and easy ham than it appears in the book, but it’s way worse for hard ham.

I’m going to focus on the second version of the program, though, in the notebook called ch3_nltk.ipynb. You can view it online here.In this version, I use NLTK’s built-in NaiveBayesClassifier function, and avoid creating the TDM (which isn’t really used for much in the original code anyway).

Building a Naive Bayes spam classifier with NLTK

I’ll follow the same logic as the program from chapter 3, but I’ll do so with a workflow more suited to NLTK’s functions. So instead of creating a term-document matrix, and building my own Naive Bayes classifier, Ill build a features → label association for each training e-mail, and feed a list of these to NLTK’s NaiveBayesClassifier function.

Extracting word features from the e-mail messages

The program begins with some simple code that loads the e-mail files from the directories, extracts the “message” or body of the e-mail, and loads all those messages into a list. This follows the book’s code pretty closely, and we end up with training and testing lists of spam, easy ham, and hard ham. The training data will be the e-mails in the training directories for spam and easy ham. (So, like in the book, we’re not training on any hard ham.)

Each e-mail in our classifier’s training data will have a label (“spam” or “ham”) and a feature set. For this application, we’re just going to use a feature set that is just a set of the unique words in the e-mail. Below, I’ll turn this into a dictionary to feed into the NaiveBayesClassifier, but first, let’s get the set.

Note: This is a similar to a “bag-of-words” model, in that it doesn’t care about word order or other semantic information. But a “bag-of-words” usually considers the frequency of the word within the document (like a histogram of the words), whereas we’re only concerned with whether it’s in an e-mail, not how often it occurs.*

Parsing and tokenizing the e-mails

I’m going to use NLTK’s wordpunct_tokenize function to break the message into tokens. This splits tokens at white space and (most) punctuation marks, and returns the punctuation along with the tokens on each side. So "I don't know. Do you?" becomes ["I", "don","'", "t", "know", ".", "Do", "you", "?"].

If you look through some of the training e-mails in train_spam_messages and train_ham_messages, you’ll notice a few features that make extracting words tricky.

First, there are a couple of odd text artefacts. The string ‘3D’ shows up in strange places in HTML attributes and other places, and we’ll remove these. Furthermore there seem to be some mid-word line wraps flagged with an ‘=’ where the word is broken across lines. For example, the word ‘apple’ might be split across lines like ‘app=\nle’. We want to strip these out so we can recover ‘apple’. We’ll want to deal with all these first, before we apply the tokenizer.

Second, there’s a lot of HTML in the messages. We’ll have to decide first whether we want to keep HTML info in our set of words. If we do, and we apply wordpunct_tokenize to some HTML, for example:

"<HEAD></HEAD><BODY><!-- Comment -->"

would tokenize to:

["<", "HEAD", "></", "HEAD", "><", "BODY", "><!--", "Comment", "-->"]

So if we drop the punctuation tokens, and get the unique set of what remains, we’d have {"HEAD", "BODY", "Comment"}, which seems like what we’d want. For example, it’s nice that this method doesn’t make, <HEAD> and </HEAD> separate words in our set, but just captures the existence of this tag with the term "HEAD". It might be a problem that we won’t distinguish between the HTML tag <HEAD> and “head” used as an English word in the message. But for the moment I’m willing to bet that sort of conflation won’t have a big effect on the classifier.

If we don’t want to count HTML information in our set of words, we can set strip_html to True, and we’ll take all the HTML tags out before tokenizing.

Lastly we’ll strip out any “stopwords” from the set. Stopwords are highly common, therefore low information words, like “a”, “the”, “he”, etc. Below I’ll use stopwords, downloaded from NLTK’s corpus library, with a minor modifications to deal with this. (In other programs I’ve used the stopwords exported from R’s tm package.)

Note that because our tokenizer splits contractions (“she’ll” → “she”, “ll”), we’d like to drop the ends (“ll”). Some of these may be picked up in NLTK’s stopwords list, others we’ll manually add. It’s an imperfect, but easy solution. There are more sophisticated ways of dealing with this which are overkill for our purposes.

Tokenizing, as perhaps you can tell, is a non-trivial operation. NLTK has a host of other tokenizing functions of varying sophistication, and even lets you define your own tokenizing rule using regex.

def get_msg_words(msg, stopwords = [], strip_html = False):
'''
Returns the set of unique words contained in an e-mail message.
Excludes
any that are in an optionally-provided list.

NLTK's 'wordpunct' tokenizer is used, and this will break contractions.
For example, don't -&gt; (don, ', t). Therefore, it's advisable to
supply
a stopwords list that includes contraction parts, like 'don' and 't'.
'''

# Strip out weird '3D' artefacts.
msg = re.sub('3D', '', msg)

# Strip out html tags and attributes and html character codes,
# like '&amp;nbsp;'  and '&amp;lt;'.
if strip_html:
msg = re.sub('&lt;(.|\\n)\*?&gt;', ' ', msg)
msg = re.sub('&amp;\\w+;', ' ', msg)

# wordpunct_tokenize doesn't split on underscores. We don't
# want to strip them, since the token first_name may be informative
# moreso than 'first' and 'name' apart. But there are tokens with
long
# underscore strings (e.g. 'name_'). We'll just
replace the
# multiple underscores with a single one, since 'name_' is
probably
# not distinct from 'name_' or 'name_' in identifying spam.
msg = re.sub('_+', '_', msg)

# Note, remove '=' symbols before tokenizing, since these
# sometimes occur within words to indicate, e.g., line-wrapping.
msg_words = set(wordpunct_tokenize(msg.replace('=\\n', '').lower()))

# Get rid of stopwords
msg_words = msg_words.difference(stopwords)

# Get rid of punctuation tokens, numbers, and single letters.
msg_words = [w for w in msg_words if re.search('[a-zA-Z]', w) and
len(w) &gt; 1]

return msg_words

Making a `(features, label)` list

The NaiveBayesClassifier function trains on data that’s of the form [(features1, label1), features2, label2), ..., (featuresN, labelN)] where featuresi is a dictionary of features for e-mail i and labeli is the label for e-mail i (spam or ham).

The function features_from_messages iterates through the messages creating this list, but calls an outside function to create the features for each e-mail. This makes the function modular in case we decide to try out some other method of extracting features from the e-mails besides the set of word. It then combines the features to the e-mail’s label in a tuple and adds the tuple to the list.

The word_indicator function calls get_msg_words() to get an e-mail’s words as a set, then creates a dictionary with entries {word: True} for each word in the set. This is a little counter-intuitive (since we don’t have {word: False} entries for words not in the set) but NaiveBayesClassifier knows how to handle it.

def features_from_messages(messages, label, feature_extractor, **kwargs):
     '''
    Make a (features, label) tuple for each message in a list of a certain,
    label of e-mails ('spam', 'ham') and return a list of these tuples.

    Note every e-mail in 'messages' should have the same label.
    '''
    features_labels = []
    for msg in messages:
    features = feature_extractor(msg, **kwargs)
    features_labels.append((features, label))
    return features_labels

    def word_indicator(msg, **kwargs):
    '''
    Create a dictionary of entries {word: True} for every unique
    word in a message.

    Note **kwargs are options to the word-set creator,
    get_msg_words().
    '''
    features = defaultdict(list)
    msg_words = get_msg_words(msg, **kwargs)
    for w in msg_words:
    features[w] = True
    return features

Training and evaluating the classifier

With those functions defined, we can apply them to the training and testing spam and ham messages.

def make_train_test_sets(feature_extractor, **kwargs):
    '''
    Make (feature, label) lists for each of the training
    and testing lists.
    '''
    train_spam = features_from_messages(train_spam_messages, 'spam',
    feature_extractor, **kwargs)
    train_ham = features_from_messages(train_easyham_messages, 'ham',
    feature_extractor, **kwargs)
    train_set = train_spam + train_ham

    test_spam = features_from_messages(test_spam_messages, 'spam',
    feature_extractor, **kwargs)

    test_ham = features_from_messages(test_easyham_messages, 'ham',
    feature_extractor, **kwargs)

    test_hardham = features_from_messages(test_hardham_messages,
    'ham',
    feature_extractor, **kwargs)

    return train_set, test_spam, test_ham, test_hardham

Notice that the training set we’ll use to train the classifier combines both the spam and easy ham training sets (since we need both types of e-mail to train it).

Finally, let’s write a function to train the classifier and check how accurate it is on the test data.

def check_classifier(feature_extractor, **kwargs):
    '''
    Train the classifier on the training spam and ham, then check its
    accuracy
    on the test data, and show the classifier's most informative features.
    '''

    # Make training and testing sets of (features, label) data
    train_set, test_spam, test_ham, test_hardham = \\
    make_train_test_sets(feature_extractor, **kwargs)

    # Train the classifier on the training set
    classifier = NaiveBayesClassifier.train(train_set)

    # How accurate is the classifier on the test sets?
    print ('Test Spam accuracy: {0:.2f}%'
    .format(100 \* nltk.classify.accuracy(classifier, test_spam)))
    print ('Test Ham accuracy: {0:.2f}%'
    .format(100 \* nltk.classify.accuracy(classifier, test_ham)))
    print ('Test Hard Ham accuracy: {0:.2f}%'
    .format(100 \* nltk.classify.accuracy(classifier, test_hardham)))

    # Show the top 20 informative features
    print classifier.show_most_informative_features(20)

The function also prints out the results of NaiveBayesClassifiers‘s handy show_most_informative_features method. This shows which features are most unique to one label or another. For example, if “viagra” shows up in 500 of the spam e-mails, but only 2 of the “ham” e-mails in the training set, then the method will show that “viagra” is one of the most informative features with a spam:ham ratio of 250:1.

So how do we do? I’ll check two versions. The first uses the HTML info in the e-mails in the classifier:

check_classifier(word_indicator, stopwords = sw)

Which gives:

Test Spam accuracy: 98.71%
Test Ham accuracy: 97.07%
Test Hard Ham accuracy: 13.71%
Most Informative Features
    align = True          spam : ham = 119.7 : 1.0
    tr = True             spam : ham = 115.7 : 1.0
    td = True             spam : ham = 111.7 : 1.0
    arial = True          spam : ham = 107.7 : 1.0
    cellpadding = True    spam : ham = 97.0 : 1.0
    cellspacing = True    spam : ham = 94.3 : 1.0
    img = True            spam : ham = 80.3 : 1.0
    bgcolor = True        spam : ham = 67.4 : 1.0
    href = True           spam : ham = 67.0 : 1.0
    sans = True           spam : ham = 62.3 : 1.0
    colspan = True        spam : ham = 61.0 : 1.0
    font = True           spam : ham = 61.0 : 1.0
    valign = True         spam : ham = 60.3 : 1.0
    br = True             spam : ham = 59.6 : 1.0
    verdana = True        spam : ham = 57.7 : 1.0
    nbsp = True           spam : ham = 57.4 : 1.0
    color = True          spam : ham = 54.4 : 1.0
    ff0000 = True         spam : ham = 53.0 : 1.0
    ffffff = True         spam : ham = 50.6 : 1.0
    border = True         spam : ham = 49.6 : 1.0

The classifier does a really good job for spam and easy ham, but it’s pretty miserable for hard ham. This may be because hard ham messages tend to be HTML-formatted while easy ham messages aren’t. Note how much the classifier relies on HTML information—nearly all the most informative features are HTML-related.

If we try just using the text of the messages, without the HTML information, we lose a tiny bit of accuracy in identifying spam but do much better with the hard ham.

check_classifier(word_indicator, stopwords = sw, strip_html = True)

shows

Test Spam accuracy: 96.64%
Test Ham accuracy: 98.64%
Test Hard Ham accuracy: 56.05%
Most Informative Features
    dear = True          spam : ham = 41.7 : 1.0
    aug = True           ham : spam = 38.3 : 1.0
    guaranteed = True    spam : ham = 35.0 : 1.0
    assistance = True    spam : ham = 29.7 : 1.0
    groups = True        ham : spam = 27.9 : 1.0
    mailings = True      spam : ham = 25.0 : 1.0
    sincerely = True     spam : ham = 23.0 : 1.0
    fill = True          spam : ham = 23.0 : 1.0
    mortgage = True      spam : ham = 21.7 : 1.0
    sir = True           spam : ham = 21.0 : 1.0
    sponsor = True       ham : spam = 20.3 : 1.0
    article = True       ham : spam = 20.3 : 1.0
    assist = True        spam : ham = 19.0 : 1.0
    income = True        spam : ham = 18.6 : 1.0
    tue = True           ham : spam = 18.3 : 1.0
    mails = True         spam : ham = 18.3 : 1.0
    iso = True           spam : ham = 17.7 : 1.0
    admin = True         ham : spam = 17.7 : 1.0
    monday = True        ham : spam = 17.7 : 1.0
    earn = True          spam : ham = 17.0 : 1.0

Check out the most informative features; they make a lot of sense. Note mostly spammers address you with “Dear” and “Sir” and sign off with “Sincerely,”. (Probably those Nigerian princes; they tend to be polite.) Other spam flags that gel with our intuition are “guaranteed”, “mortgage”, “assist”, “assistance”, and “income.”

Conclusion

So we’ve built a simple but decent spam classifier with just a tiny amount of code. NLTK provides a wealth of tools for doing this sort of thing more seriously including ways to extract more sophisticated features and more complex classifiers.

Better typography for IPython notebooks

2012-12-05T05:34:00-05:00

(Warning: ignorant rant coming up)

Like everyone else who’s ever used it, I love the IPython notebook. It’s not only an awesomely productive environment to work in, it’s also the most powerful weapon in the Python evangelist’s arsenal (suck it, Matlab).

I also think it’s not hard to imagine a world where scientific papers are all just literate programs. And the notebook is probably one of the best tools for literate programming around in any language. The intregration of markdown and LaTeX/MathJax into the notebook is just fantastic.

But it does have one weakness as a literate programming tool. The default typography is ugly as sin.

There are several issues, but two major ones are easily fixable.

Long lines

By far the biggest issue is that the text and input cells extend to 100% of the window width. Most people keep their browser windows open wider than is comfortable reading width, so you end up with long hard-to-read lines of text in the markdown cells.

And for the code, it would be nice to have the code cell discourage you from long lines. The variable width cells don’t. I’m an 80-character anal retentive, and even I have trouble in the notebook getting a sense of when a line is too long.

When you write a script in a text editor, there’s lots of previous code in the viewable window, so your eye gets a sense of the ‘right-margin’ of the code. (Not to mention many editors will indicate the 80- or whatever-character column, so you know exactly when to break). But in the notebook, your code is typically broken up into smaller blocks, and those blocks are interspersed with output and other cells. It’s hard to get a visual sense of the right margin.

Ugly fonts

Text and markdown cells are typically rendered in Helvetica or Arial. Helvetica is a fine font, obviously, but it’s not really suitable for paragraphs of text (how many books, magazines, newspapers, or academic papers do you see with body text typeset in Helvetica?). And combined with the small size and long lines makes it hard to read and just plain ugly. I don’t think I have to say anything about Arial.

The way I use the notebook—with markdown cells used for long stretches of explanatory text and result interpretation—it’s better to have the text cells render in a serif font. This way it stands out from the code and output cells more. Serif fonts also have more distinctive italics, and integrate better with LaTeX/MathJax math.

Code cells and interpreter output cells render in whatever your default monospace font is. That’s typically Courier or Courier New. This is fine, but really, this is the 21st century—we can do a lot better.

Update: one more thing

I realize I’ve made one other change that I think is important. The default ordered list in the notebook uses roman numerals (I, II, III, …). I almost always want arabic numerals (1, 2, 3, …) instead. We can change this in the file renderedhtml.css with

.rendered_html ol {list-style:decimal; margin: 1em 2em;}

(Also check the comments for other, and typically better ways to make changes.) You can also modify sub-levels ol ol, ol ol ol, etc. Ideally I’d like to have nested numbers 1.1, 1.1.1, but this isn’t straightforward so I haven’t implemented it. If anyone has tips, I’d be thrilled to hear them.

Fixing it (locally, at least)

(Warning: I don’t know what I’m doing. Don’t make any of these changes, or any others, without backing up the files first.)

(Update: Matthias Bussonnier has an informative post showing the right way to make these changes. If you make the CSS changes I describe below, do it the way he advises, not through the files I describe here.)

The notebook is served through the browser, so its frontend is basically just HTML, Javascript, and CSS. The typography and appearance of the notebook is nearly all driven by CSS files located where IPython is stored on your system. This will differ based on your OS and your Python distribution. On my mac, with the AnacondaCE distribution, the stylesheets are located in /Users/cvogel/anaconda/lib/python2.7/site-packages/IPython/frontend/html/notebook/static. There are several subfolders there, including one called /css and /codemirror. You can also take a look at the stylesheet files by firing up a notebook, and using your browser’s inspector. If your browser (e.g. Chrome) lets you edit stylesheets on the fly in the inspector, you can try out changes relatively safely.

Here are the edits I’ve made on my system to address the issues above. First, in the /css folder, in the file called notebook.css

1. Set code input cells to be narrower (code that runs past the width will be invisible). I try to set this for about 80 characters plus some buffer. There’s not way to set width as number of characters in CSS, so you may have to experiment to see what ex-widths works with your font.

div.input {
width: 105ex; /* about 80 chars + buffer */
...
}

2. Fixing markdown/text cells. I make changes to the font, width, and linespacing. I’m using Charis SIL, a font based on the classic Bitstream Charter, and freely available here. Shortening the lines and adding some line space (120% to 150% of point size is usually a good range) for legibility.

div.text_cell {
width: 105ex /* instead of 100%, */
...
}

div.text_cell_render {
/*font-family: "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;*/
font-family: "Charis SIL", serif; /* Make non-code text serif. */
line-height: 145%; /* added for some line spacing of text. */
width: 105ex; /* instead of 'inherit' for shorter lines */
...
}

3. Add styles to specify sizes for headers.

/* Set the size of the headers */
div.text_cell_render h1 {
font-size: 18pt;
}

div.text_cell_render h2 {
font-size: 14pt;
}

Then, in the `/codemirror/lib subfolder, there’s a file called codemirror.css. In here we can change the font used for code, both input and interpreter output. I’m using Consolas.

.CodeMirror {
font-family: Consolas, monospace;
}

Obviously these changes only affect notebooks you view on your local machine, and whoever views your notebooks on their own machine, or on nbviewer will see the default style.

Here are before and after shots of these changes:

Fixing it (globally?)

So this is all cute right? And it’s nice that we can do some customizations to the notebook, but, you know, big deal.

I’d argue this is actually more important than just aesthetic tinkering. The IPython notebook is becoming a one-stop-shop for exploration, collaboration, publication, distribution, and replication in data analysis. Like I said above, I think it’s not unreasonable that notebooks could replace a large class of scientific papers. But to do that, it has to perform as well as all the fragmented tools that researchers are currently using. Otherwise, people are going to keep pasting their code and results into Word and Latex documents. In other words, the notebook has to work not just as an interactive environment, but also as a static document. The IPython team realizes this, which is why tools like nbconvert exist.

People are doing amazing things in the notebook. The typography should encourage people to read them, and not just serve as suped-up comments.

Tools are often strongly associated with aesthetic characteristics that are only peripheral to the tool itself. ggplot can make charts that look however you want, but when people think of ggplot, they think of the gray background and the Color Brewer palette. And while main selling point of ggplot is its abstraction of the graph-making process, I think it was the distinctive and attractive style of its graphs that made it catch on so successfully. On the opposite end of the spectrum, when people think of Stata graphics, they think of this, and wince. And Latex will typeset documents with whatever crazy font you want, but in everyone’s mind, Latex ⇔ Computer Modern (for better or worse). Design defaults are important: they’re marketing and they encourage good habits by your users. It’d be a shame to have it be that people think of the IPython notebook and picture long lines of small, single-spaced Helvetica Neue.

It’s an insanely powerful tool. It’d be awesome if it were beautiful too, and that goal seems eminently do-able.

How do you speed up 40,000 weighted least squares calculations? Skip 36,000 of them.

2012-05-14T23:46:00-04:00

Despite having finished all the programming for Chapter 2 of MLFH a while ago, there’s been a long hiatus since thefirst post on that chapter.

(S)lowess

Why the delay? The second part of the code focuses on two procedures: lowess scatterplot smoothing, and logistic regression. When implementing the former in statsmodels, I found that it was running dog slow on the data—in this case a scatterplot of 10,000 height-vs.-weight points. Indeed, for these 10,000 points, lowess, run with the default parameters, required about 23 seconds. After importing modules and defining variables according to my IPython notebook, we can run timeit on the function:

%timeit -n 3 lowess.lowess(heights, weights)

This results in

3 loops, best of 3: 42.6 s per loop

on the machine I’m writing this on (a Windows laptop with a 2.67 GHz i5 processor; timings are faster, but still in the 30 sec. range on my 2.5 GHz i7 Macbook).

An R user—or really a user of any other statistical package—is going to be confused here. We’re all used to lowess being a relatively instantaneous procedure. It’s an oft-used option for graphics packages like Lattice and ggplot2 — and it doesn’t take 20-30 seconds to generate a plot with a lowess curve superimposed. So what’s the deal? Is something wrong with the statsmodels implementation?

The naive lowess algorithm

Short answer: no. Long answer: yeah, kinda. Let’s start by looking at the lowess algorithm in general, sticking to the 2-D y-vs.-x scatterplot case. (I don’t really find multi-dimensional lowess useful anyway; maybe others put it to frequent use. If so, I’d like to hear about it).

Let’s say we have data {x₁, …, x_n} and {y₁, …, y_n}. The idea is to fit a set of values {y^*₁, …, y^*_n} where each is the prediction at x_i from a weighted regression using a fixed neighborhood of points around x_i. The weighting scheme puts less weight on points that are far from x_i. The regression can be linear, or polynomial, but linear is typical, and lowess procedures that use polynomials with more than 2 degrees are rare.

After we get this first set of fits, we usually run the regressions a few more times, each time modifying the weights to take into account residuals from the previous fit. These “robustifying” iterations apply successively less weight to outlying points in the data, reducing their influence on the final curve.

Here’s the recipe:

Select the number of neighbors, k, to use in each local regression, and the number of robustifying iterations.
Sort the data, both x and y, by the order of the x-values.
For each x_i in {x₁, … x_n}:
1. Find the k points nearest to x_i (the neighborhood).
2. Calculate the weights for each x_j in the neighborhood. This requires:
  1. Calculating the distance between each x_j and x_i and applying a weighting function to these distances.
  2. Take the weights calculated from the previous fit’s residuals (if this is not the first fit) and multiply them by the distance weights.
3. Run a regression of the y_js on the x_js in the neighborhood, using the weights calculated in part B above. Predict y^*_i.
Calculate the residuals from this fitted series of {y^*₁, …, y^*_n}, and compute a weight from each of them.
Repeat 3 and 4 for the specified number of robustifying iterations.

Clearly, this is an expensive procedure. For 10,000 points and 3 robustifying iterations (which is the default in R and statsmodels), you’re calculating weights and running regressions 40,000 times (1 initial fit + 3 robustifying iterations). Running R’s lm.fit (which is the lean, fast engine under lm) 40,000 times costs about 11 seconds. Add on all the costs from weight calculations—which will happen 40,000 × k times, since a weight needs to be calculated for each point’s neightbor—-and it’s not surprising that the statsmodels version is as slow as it is. It is an inherently expensive algorithm.

Cheating our way to a faster lowess

The question is, why is R’s lowess so fast? The answer is that R—-and most other implementations, going back to Clevelands lowess.f Fortan program—don’t perform lowess calculations on all that data.

If you look at the R help file for lowess, you’ll see that in addition to the parameters we’d expect—the data x and y; a parameter to determine the size of the neighborhood; and the number of robustifying iterations—there’s an argument called delta.

The idea behind delta is the following: x_i that are close together aren’t very interesting. If we’ve already calculated y^*_i from the neighborhood of data around x_i, and |x_i+1 - x_i| < delta, then we don’t really need to calculate y^*_i+1. It’s bound to be near y^*_i.

Instead let’s go out to an x_j that’s farther away from x_i—-say the farthest one still within delta distance. Let’s fit another weighted regression here. All those points in between—within that delta distance—can be approximated by a line going between the two regression fits we made. Then, just keep skipping along in these delta-sized steps—back-filling the predictions by linear interpolation as we go—until the end of the data.

How much work have we saved ourselves? Assume as above 10,000 points and 4 iterations. If the x‘s are uniformly distributed along the axis, and we take delta to be 0.01 * (max(x) - min(x)) (which is the default value in R), then we’re only running 100 regressions per iteration, or 400 overall. Compared to the 40,000 that statsmodels is running, we can see why R is much faster. It’s cheating!

This kind of approximating is fine, really. It’s just assuming that, if our model is y = f(x) + e and f(x) is what we’re trying to estimate with lowess, we can take the linear approximation of it in small neighborhoods.

Implementing a faster lowess in Python

Algorithms for lowess written in low level languages aren’t hard to find. In addition to Cleveland’s Fortran implementation, there’s also a C version used by R (which is basically a direct translation of Cleveland’s, but without all the pesky commenting to let you know what it’s doing).

The statsmodel version though, is nicely organized—broken into sub-functions with clear names, and exploiting vectorized operations. But it’s slowness is not because it doesn’t exploit the delta trick. It also runs some expensive operations, like a call to SciPy’s lstsq function in each tight loop.

So, in addition to adding the delta trick, we’d like to speed up those calculations in the tight loop (part 3 in the list above) as much as possible. Luckily, Cython lets us split the difference.

My Cython version of lowess is in my github repo, here, in the file cylowess.py. There’s also an IPython notebook demonstrating it in action, and files comprising a testing suite, comparing its output to R’s.

Let’s take a look at some real squiggly data to see how it works. The Silverman motorcycle collision data, which is available as mcycle in R’s MASS package, is great test data for non-parametric curve fitting procedures. In addition to not having any simple parametric shape, it’s got some edge case issues that can cause problems, like repeated x-values.

This plot compares my lowess implementation with statsmodels’ and R’s:

The aggregate difference between R’s lowess and mine?

print 'R and New Lowess MAD: %5.2e' %
np.mean(np.abs(r_lowess['y'] - new_lowess[:, 1]))


R and New Lowess MAD: 1.62e-13

So it looks like it works.

Now let’s look at some timings. I’ll create some test data: 10,000 points, where x is uniformly distributed on [0, 20], and y = sin(x) + N(0, 0.5).

Statsmodel’s lowess:

%timeit -n 3 smlw.lowess(y, x)


3 loops, best of 3: 22.8 s per loop

The new Cythonized lowess:

%timeit -n 3 cyl.lowess(y, x)


3 loops, best of 3: 10.8 s per loop

This is without the delta trick. Skimming the fat off of those tight-looped operations and Cythonizing them cut the run time in half. 11 seconds still sucks, though, so let’s see what delta gets us.

delta = (x.max() - x.min()) \* 0.01
%timeit -n 3 cyl.lowess(y, x, delta = delta)


3 loops, best of 3: 125 ms per loop

Much better. That’s the kind of time skipping 36,000 weighted least-squares calculations will save you. Given that this is some curvy data, is all this linear interpolation acceptable? I’ll re-run both with a better level of the frac parameter; the default is 2/3, but I’ll reduce it to 1/10 to use smaller neighborhoods in the regression and allow for more curvature. Here’s the plot:

sm_lowess = smlw.lowess(y, x, frac = 0.1)
new_lowess = cyl.lowess(y, x, frac = 0.1, delta = delta)

Which looks just as good as the non-interpolated version, but doesn’t leave you twiddling your thumbs.

Conclusion

After all this, we have a version of lowess that’s competitive with R’s lowess function. R also has a much richer loess function, for which there’s no real statmodels equivalent. loess is a full-blown class from which one can make predictions and compute confidence intervals, among other things. It also allows for fitting a higher-dimensional surface, not just a curve. But I have a day job, so that’s all for some other time. This kind of simple lowess is typically enough for most needs.

With this obsessive compulsive diversion into the guts of lowess out of the way, I’ll wrap up Chapter 2 of MLFH in my next post.

Machine Learning for Hackers Chapter 2, Part 1: Summary stats and density estimators

2012-05-01T04:00:00-04:00

Chapter 2 of MLFH summarizes techniques for exploring your data: determining data types, computing quantiles and other summary statistics, and plotting simple exploratory graphics. I’m not going to replicate it in its entirety; I’m just going to hit some of the more involved or interesting parts. The IPython notebook I created for this chapter, which lives here, contains more code than I’ll present on the blog.

This part’s highlights:

Pandas objects, as we’ve seen before, have methods that provide simple summary statistics.
The plotting methods in Pandas let you pass parameters to the Matplotlib functions they call. I’ll use this feature to mess around with histogram bins.
The gaussian_kde (kernel density estimator) function in scipy.stats.kde provides density estimates similar to R’s density function for Gaussian kernels. The kdensity function, in statsmodels.nonparametric.kde provides that and other kernels, but given the state of statsmodels‘ documentation, you would probably only find this function by accident. It’s also substantially slower than gaussian_kde on large data. *[Not quite so! See update at the end.]

Height and weight data

The data analyzed in this chapter are the sexes, heights and weights, of 10,000 people. The raw file is a CSV that I import using read_table in Pandas:

heights_weights =
read_table('data/01_heights_weights_genders.csv', sep = ',', header = 0)

Inspecting the data with head,

print heights_weights.head(10)

gives us:

 Gender    Height     Weight
0  Male 73.847017 241.893563
1  Male 68.781904 162.310473
2  Male 74.110105 212.740856
3  Male 71.730978 220.042470
4  Male 69.881796 206.349801
5  Male 67.253016 152.212156
6  Male 68.785081 183.927889
7  Male 68.348516 167.971110
8  Male 67.018950 175.929440
9  Male 63.456494 156.399676

So it looks like heights are in inches, and weights are in pounds. It also looks like the dataset is evenly split between men and women, since

heights_weights.groupby('Gender')['Gender'].count()

results in:

Gender
Female 5000
Male 5000

The data are simple, clean, and appear to have imported correctly. So, we can start looking at some simple summaries.

Numeric summaries, especially quantiles

The first part of Chapter 2 covers the basic summary statistics: means, medians, variances, and quantiles. The authors hand-roll the mean, median, and variance functions to see how each is calculated. All of these methods are available as methods to Pandas series, or as NumPy functions (which are typically what’s called by equivalent Pandas methods).

The describe method of Pandas series and data frames, which we saw in Part 3 of Chapter 1, gives summary statistics. The summary stats for the height variable are:

heights = heights_weights['Height']
heights.describe()

count 10000.000000
mean 66.367560
std 3.847528
min 54.263133
25% 63.505620
50% 66.318070
75% 69.174262
max 78.998742

The heights all lay within a reasonable range, with no apparent outliers from bad data. The default quantile range in describe is 50%, so we get the 75th and 25th percentiles. This can be changed with the percentile_width argument; for example, percentile_width = 90 would give the 95th and 5th percentiles.

There doesn’t seem to be a direct analog to R’s range function, which calculates the difference between the maximum and minimum value of a vector, nor for the quantile, which can calculate the quantiles at any given a series of probabilities. These are easy enough to replicate though.

Note: Nathaniel Smith, in comments, points out that R’s range function doesn’t do this either, but just returns the min and max of a vector. There is a function for this in NumPy, though: the my_range function below gives the same result as would np.ptp(heights.values). ptp is the “peak-to-peak” (min-to-max) function.

Range is trivial:

def my_range(s):
'''
Difference between the max and min of an array or Series
'''
return s.max() - s.min()

Calling this, we get a range of 78.99 − 54.26 = 24.63 inches.

Next, a quantiles function to mimic R’s. We can just make a wrapper around the quantile method, mapping it along a sequence of provided probabilities.

def my_quantiles(s, prob = (0.0, 0.25, 0.5, 1.0)):
'''
Calculate quantiles of a series.

Parameters:
-----------
s : a pandas Series
prob : a tuple (or other iterable) of probabilities at
which to compute quantiles. Must be an iterable,
even for a single probability (e.g. prob = (0.50,)
not prob = 0.50).

Returns:
--------
A pandas series with the probabilities as an index.
'''
q = [s.quantile(p) for p in prob]
return Series(q, index = prob)

Note that the default argument gives quartiles. We can get deciles by calling:

print my_quantiles(heights, prob = arange(0, 1.1, 0.1))

which spits out:

0.0 54.263133
0.1 61.412701
0.2 62.859007
0.3 64.072407
0.4 65.194221
0.5 66.318070
0.6 67.435374
0.7 68.558072
0.8 69.811620
0.9 71.472149
1.0 78.998742

Note: the quantiles function I’ve written is a little awkward when dealing with a single quantile. Because the list comprehension that computes the qunatiles requires that the prob argument be an iterable, you would have to pass a list, tuple, array or other iterable with a single value. You can’t just pass it a float. I’ve hit this issue a few times writing Python functions–where it’s difficult to make code robust to both iterable and singleton arguments. If anyone has tips on this (should I really be doing type checking?), I’d be thrilled to hear them.

Histograms

Next the authors mess around with histograms and density plots to explore the distribution of the data. Noting that different bin sizes for histograms can affect how we perceive the data’s distribution, they plot histograms for a few different bin widths.

In Matplotlib, bins are not specified by their width, as is possible ggplot. We can either give Matplotlib the number of bins we want it to plot, or specify the actual bin-edge locations. It’s not difficult to translate a desired bin width into either one of these types of argument. I’ll provide the sequence of bins.

First, 1-inch bins:

bins1 = np.arange(heights.min(), heights.max(), 1.0)
heights.hist(bins = bins1, fc = 'steelblue')

Note how I’m using the Pandas hist method, which, using a **kwargs argument, can pass parameters to the Matplotlib plotting functions. Next, 5-inch bins:

bins5 = np.arange(heights.min(), heights.max(), 5.)
heights.hist(bins = bins5, fc = 'steelblue')

And finally, 0.001-inch bins:

bins001 = np.arange(heights.min(), heights.max(), .001)
heights.hist(bins = bins001, fc = 'steelblue')
plt.savefig('height_hist_bins001.png')

These all match the figures in the book, so I’m probably doing it right.

Kernel density estimators in SciPy and statsmodels

R’s density function computes kernel density estimates. The default kernel is Gaussian, but you can also use Epanechnikov, rectangular, triangular, biweight, cosine kernels.

In Python, it looks like you have two options for kernel density. The first is gaussian_kde from the scipy.stats.kde module. This provides a Gaussian kernel density estimate only. The other is kdensity in the statsmodels.nonparametric.kde module, which provides alternative kernels similar to R.

I actually wasn’t aware of the kdensity function for a while, until I stumbled upon a mention of it on a mailing list archive. I couldn’t find it in the statsmodels documentation. Statsmodels, generally, seems to have a lot of undocumented functionality; not surprising for a young, rapidly-expanding project.

Playing with both functions, I found some pros and cons for each. Obviously kdensity provides an option of kernels, whereas gaussian_kde does not. kdensity also generates simpler output than gaussian_kde. kdensity provides a tuple of two arrays–the grid of points at which the density was estimated, and the estimated density of those points. gaussian_kde provides an object that you have to evaluate on a set of points to get an array of estimated densities. So essentially, you’re calling it twice, and I don’t see much point to that redundancy.

On the other hand kdensity gets much slower than gaussian_kde as the number of points increases. For the 10,000 points in the = heights array, gaussian_kde took about 3.3 seconds to output the array of estimated densities. kdensity wasn’t finished after several minutes. I haven’t looked carefully at the source code of the two functions, but I assume kdensity‘s problem is that at some point it creates a temporary NxN array, which for N = 10,000 is going to gum things up. Setting the gridsize argument in kdensity to something even as large as 5000, cuts the size of the temporary array in half, and reduces the running time to about 3 seconds.

This is probably worth exploring in a future post. In the meantime, I’m going stick with gaussian_kde and plot some densities. Note: See the update below. I’ve updated the IPython notebook for this chapter to use Statsmodels’ KDE class instead of SciPy.]

First, heights:

density = kde.gaussian_kde(heights.values)

fig = plt.figure()
plt.plot(np.sort(heights.values),
density(np.sort(heights.values)))

The sorting of the heights array is to make the lines connect nicely. Otherwise, the lines will connect from point-to-point in the order they occur in the array; we want the density curve to connect points left-to-right.

Notice the slight bi-modality in the figure. What we’re likely seeing is a mixture of male and female distributions. We can plot those separately.

# Pull out male and female heights as arrays over which to compute densities
heights_m = heights[heights_weights['Gender'] == 'Male'].values
heights_f = heights[heights_weights['Gender'] == 'Female'].values
density_m = kde.gaussian_kde(heights_m)
density_f = kde.gaussian_kde(heights_f)

fig = plt.figure()
plt.plot(np.sort(heights_m), density_m(np.sort(heights_m)), label = 'Male')
plt.plot(np.sort(heights_f), density_f(np.sort(heights_f)), label = 'Female')
plt.legend()

We also have a weight variable we can plot.

weights_m = heights_weights[heights_weights['Gender'] == 'Male']['Weight'].values
weights_f = heights_weights[heights_weights['Gender'] == 'Female']['Weight'].values
density_m = kde.gaussian_kde(weights_m)
density_f = kde.gaussian_kde(weights_f)

fig = plt.figure()
plt.plot(np.sort(weights_m), density_m(np.sort(weights_m)), label = 'Male')
plt.plot(np.sort(weights_f), density_f(np.sort(weights_f)), label = 'Female')
plt.legend()

To finish up, let’s move each density plot to its own subplot, to match Figure 2-11 on page 51.

fig, axes = plt.subplots(nrows = 2, ncols = 1, sharex = True, figsize = (9, 6))
plt.subplots_adjust(hspace = 0.1)
axes[0].plot(np.sort(weights_f), density_f(np.sort(weights_f)),
label = 'Female')
axes[0].xaxis.tick_top()
axes[0].legend()
axes[1].plot(np.sort(weights_m), density_m(np.sort(weights_m)),
label = 'Male')
axes[1].legend()

Here I’m using the subplots function, same as in Part 5 of Chapter 1, and sharing the x-axis to make clear the difference between the distributions’ central tendencies.

Conclusion

I’ll wrap up Chapter 2 in the next post, where I’ll look at lowess smoothing in Statsmodels, and get a little taste of logistic regression.

Update!

Statsmodels honcho skipper seabold sets me straight in the comments. While the kdensity function is slow, statsmodels has an implementation which uses Fast Fourier Transforms for Gaussian kernels and is substantially faster than Scipy’s gaussian_kde.

For the heights array:

# Create a KDE object
heights_kde = sm.nonparametric.kde.KDE(heights.values)

# Estimate the density by fitting the object (default Gaussian kernel via FFT)
heights_kde.fit()

We can then plot this vector of estimated densities, heights_kde.density against the points in heights_kde.support.

I’ve updated the IPython notebook for this chapter to use Statsmodels’ KDE throughout, so check it out for more detail.

Machine Learning for Hackers Chapter 1, Part 5: Trellis graphs.

2012-04-27T04:00:00-04:00

Introduction

This post will wrap up Chapter 1 of MLFH. The only task left is to replicate the authors’ trellis graph on p. 26. The plot is made up of 50 panels, one for each U.S. state, with each panel plotting the number of UFO sightings by month in that state.

The key takeaways from this part are, unfortunately, a bunch of gripes about Matplotlib. Since I can’t transmit, blogospherically, the migraine I got over the two afternoons I spent wrestling with this graph, let me just try to succinctly list my grievances.

Out-of-the-box, Matplotlib graphs are uglier than those produced by either lattice or ggplot in R: The default color cycle is made up of dark primary colors. Tick marks and labels are poorly placed in anything but the simplest graphs. Non-data graph elements, like bounding boxes and gridlines, are too prominent and take focus away from the data elements.
The API is deeply confusing and difficult to remember. You have various objects that live in various containers. To make adjustments to graphs, you have to remember what container the thing you want to adjust lives in, remember what the object and its property is called, and then remember how Matplotlib’s getting and setting procedures work.
The pyplot set of commands is supposed to provide convenience functions, but these abstractions seem to leak early and often. Once you need to make finer adjustments, you’re back to the underlying API nightmare.
The documentation is both clear and comprehensive. But where it is clear, it is not comprehensive, and where it is comprehensive, it is not clear. For example, the Artist tutorial is a pretty clear big picture of Matplotlib’s API. Once you need any detail, though, you’re dealing with this.
Creating trellis graphs requires way more manual work than in either lattice or ggplot. The supblot functionality of Matplotlib is highly flexible, but in most cases, the user is going to want the code to do the thinking for them and not manually place every graph (or do a bunch of bookkeeping with loops).

With that off my chest, let me say that I have a ton of respect for Matplotlib’s developers. It is a massively complex library, and clearly very powerful and flexible. I have no doubt that Matplotlib gurus can do amazing things. I’m just trying to convey the non-guru’s perspective. Graphing libraries are difficult to design because they must be incredibly flexible and allow users to manipulate all of the myriad parts of the graph, but at the same time, they can’t overwhelm users with detail when the flexibility isn’t needed. How anyone does it–especially in an open-source project–I don’t know.

It’s also possible that I’m just Doing it Wrong, and in fact there are easy ways to do all the things I’ve complained about. If that’s the case, I hope someone reading this will enlighten me.

Trellis graphs in R and Matplotlib

In my opinion, trellis graphs are the “killer app” of multivariate data visualization. I produce trellis line and scatter plots more than almost any other kind of visualization. As such, it’s important for me to be able to easily produce quality trellis graphs.

Trellis graphs are easy to create in R. The two most popular high-level graphing packages in R, lattice and ggplot, both have simple methods for creating them. Indeed, creating trellis graphs is lattice’s raison d’etre, and the functionality and interface design in the package revolves around dealing with trellis graph and the panels within. In ggplot, the trellis is not such a central focus, but it still has easy-to-use methods for making and modifying trellis graphs (which it refers to as “faceted” graphs).

For example, the graph we want to make is a one liner in lattice:

xyplot(sightings ~ year_month | us_state, data = sightings_counts,
type = 'l', layout = c(5, 10))

Once you get the hang of R’s formula expressions–which doesn’t take long–this is an easy, expressive way to create a trellis graph. The authors use ggplot, which I find a bit less natural, but is still very easy.

Part of what makes trellis graphs to straightforward in R is that the concept of factors, and their use as conditioning variables, is so well-baked into the language. Matplotlib is essentially a plotting utility for NumPy, so it’s designed to plot arrays, not rich data structures. Without factors, without a notion of conditioning, and to a lesser extent, without formulas, trellis graphs just don’t come naturally.

Pandas, though, has structures that, if a plotting library was designed to understand them, might provide for easy trellis-ing. Even though Pandas doesn’t have factors, I could see, for example, a plot method for Pandas’ groupby objects that produces trellis graphs by default.

Plotting the UFO trellis graph

With all that throat-clearing out of the way, let’s get down to plotting the graph. The authors plot 50 state panels, with a 10-by-5 layout. Since I’ve included D.C. in my data, I have to plot 51 panels. You can fit this in a 17-by-3 layout, but that’s pretty awkward. I’d like to have 4 columns instead, but to fit 51 graphs, I’ll need 13 columns. That’s 52 subplots, meaning the 13th row won’t have graphs in every column, only the first three. I’m going to call these last three graphs the hangover graphs, and I’m going to define it as its own variable to help inform the layout procedures I run later.

Here are the layout parameters, then:

nrow = 13; ncol = 4; hangover = len(us_states) % ncol[/sourcecode]

Now let me get the “framing” objects in place: the figure, the subplot layout, and the titles.

fig, axes = plt.subplots(nrow, ncol, sharey = True,
figsize = (9, 11))

fig.suptitle('Monthly UFO Sightings by U.S. State\nJanuary 1990 through August 2010',
             size = 12)
plt.subplots_adjust(wspace = .05, hspace = .05)

The subplots function is some recently-implement syntactic sugar around Matplotlib’s subplot functionality (see the section on “Easy Pythonic Subplots” here). The sharey argument tells Matplotlib that the panels should all share the same y axis. Technically I want it to share an x axis too, but Matplotlib kept throwing errors when I tried to use the sharex argument with dates on the x-axis. Give the data, the panels will end up sharing an x axis anyway, so this argument isn’t necessary. The function returns two objects: fig refers to the overall figure container, and axes is an array containing each of the subplot/panel objects – so axes[0, 0] is the first panel.

Now the rest of the code:

num_state = 0
for i in range(nrow):
for j in range(ncol):
xs = axes[i, j]

xs.grid(linestyle = '-', linewidth = .25, color = 'gray')

if num_state < 51:
    st = us_states[num_state]
    sightings_counts.ix[st, ].plot(ax = xs, linewidth = .75)
    xs.text(0.05, .95, st.upper(), transform = axes[i, j].transAxes,
    verticalalignment = 'top')
    num_state += 1
else:
    # Make extra subplots invisible
    plt.setp(xs, visible = False)

xtl = xs.get_xticklabels()
ytl = xs.get_yticklabels()

# X-axis tick labels:
# Turn off tick labels for all the the bottom-most
# subplots. This includes the plots on the last row, and
# if the last row doesn't have a subplot in every column
# put tick labels on the next row up for those last
# columns.
#
# Y-axis tick labels:
# Put left-axis labels on the first column of subplots,
# odd rows. Put right-axis labels on the last column
# of subplots, even rows.
if i < nrow - 2 or (i < nrow - 1 and (hangover == 0 or j <= hangover - 1)):
    plt.setp(xtl, visible = False)
if j > 0 or i % 2 == 1:
    plt.setp(ytl, visible = False)
if j == ncol - 1 and i % 2 == 1:
    xs.yaxis.tick_right()

plt.setp(xtl, rotation=90.)

Let’s walk through this:

First, set up a counter to keep track of what state we’re plotting. This is a little un-Pythonic, but given what I do inside the loop, I couldn’t think of a better way.

Now, for each row, column in the 13-by-4 array of panels (and this code works for any row/column combination, as long as rows × columns >= 51):

Assign the panel (“axis”) associated with this row, column pair to its own variable.
Draw gray gridlines in the panel.
Go to the state in the us_state list corresponding to the current value of the state counter.
Select this state out of the sightings_counts series and plot its data in the current panel. Then, put a text label with the state’s initials in the upper left corner.
If I’ve gone through all the states, and the state counter variable is greater than 51, then make the panel invisible.
Assign the x- and y-axis ticklabel objects for the current panel to variables. We’re going to manipulate their attributes.
Now some tricky stuff. I want do the following things to the tick labels:
- I want to turn off the x-axis tick labels for all but the bottom-most panels, taking into account the hangover.
- I want to alternate the y-axis tick labels so that they are on the left for odd-numbered rows, and on the right for even-numbered rows. Having labels on both sides makes the graph easier to read, but having them on the same side on every row leads to overcrowding and overlapping.
Finally, I want the x-axis tick labels rotated 90 degrees. This gives space to put as many as possible on the graph without overcrowding (here, we can label every two years).

Here’s the result:

Not bad, I think. And maybe even better than the out-of-the-box version you get with ggplot. But it was a tremendous amount of work, and I don’t know if I’m going to be able to decipher this code six months from now. It’s just a tremendous amount of bookkeeping I have to do keeping track of what panel I’m in and where it’s located in the layout. There ought to be a function that does this for me.

Conclusion

So that’s it for Chapter 1 of MLFH. Overall, I was pleasantly surprised by Pandas and how easy it made loading, cleaning, and manipulating data. While there are a couple of things from R that I missed, there were several other things I though were easier and more flexible with Pandas.

On the other hand, going from lattice and ggplot to Matplotlib is like taking a time machine back to the early ‘90s. After reading the documentation and experimenting for several days, I still don’t think I’m sure how it works. Hopefully I’ll get the hang of it as I go forward.

My take is the Python data analysis community is aware of its “visualization gap” vis-a-vis R, and there are tools in the works to solve this issue. I’ve heard whispers about “ggplot for Python” or “D3 for Python.” Everything is still in the early stages, and it will probably be a while before better tools are available.

I’m also a little uncertain about the “x for Python” notion of creating graphing libraries. Matplotlib’s pyplot is essentially a “Matlab for Python” approach to graphics, and I don’t know that works to its credit. I’d much rather have a solid, Pythonic graphing library that lets me easily make publication-quality versions of the workhorse data graphics, than have something that apes the latest faddish graphing tool. There are a lot of smart people working on the problem, though, and I’m really excited to see what happens.

Machine Learning for Hackers, Chapter 1, Part 4: Data aggregation and reshaping.

2012-04-26T04:00:00-04:00

Introduction

In the last part I made some simple summaries of the cleaned UFO data: basic descriptive statistics and historgrams. At the very end, I did some simple data aggregation by summing up the sightings by date, and plotted the resulting time series. In this part, I’ll go further with the aggregation, totalling sightings by state and month.

This takeaway from this part is that Pandas dataframes have some powerful methods for aggregating and manipulating data. I’ll show groupby, reindex, hierarchical indices, and stack and unstack in action.

The shape of data: the long and the wide of it

The first step in aggregating and reshaping data is to figure out the final form you want the data to be in. This form is basically defined by content and shape.

We know what we want the content to be: an entry in the data should give the number of sightings in a state/month combination.

We have two choices for the shape: wide or long. The wide version of this data would have months as the rows and states as the columns; it would be a 248 by 51 table with the number of sigthings as entries. This is a really natural way to shape the data if we were presenting a table for example.

One of things I’ve picked up from my years of using R, though, is a preference for long data. This is because R’s factors and formulas with easy conditioning make it easier to work with long data. The most common example is using lattice plots. To generate a lattice plot of y over x with panels defined by a level of the variable f, you just call xyplot(y ~ x | f). For this to work though, the data must be long, with f a column of factors, and the x column will likely be some values repeated for each level of f. This seems kind of redundant and unwieldy when you’re used to tables and spreadsheets, but it becomes more natural when you starting working with tools like lattice or ggplot, using more panel data, or doing more split-apply-combine or map-reduce types of procedures.

Because Pandas dataframes are so organized around indices, and because Pandas allows for hierarchical indexing, we’ll find that it will be a good strategy to shape data in a way that provides for informative indices. This will give us access to a host of powerful methods to manipulate the dataframe. In this case, as we’ll see, by making the data long, we’ll be able to push most of the information into the dataframe’s index.

The long version of our UFO data would have rows defined by a state/month pair, and a column recording the number of sightings for that pair. In R–as the authors do in the book–you’ll have a dataframe with three columns. The first two are the factor variables USState and YearMonth. (I’m not actually sure these are technically factor variables in the authors’ implementation, but they are conceptually). The third is the sightings count.

In Pandas, since the state and month pairs identify unique observations, it’s natural to make these indices of the dataframe. Pandas supports hierarchical indexing by using unique tuples–here a tuple would be (state, month).

Aggregating the data

Now that we’ve decided the form of the data, let’s implement all this.

The first step is to create a year-month variable. I do this just by taking the date of each sighting, and calculating a new date with the same year and month, but set to the first of the month. This is just another map operation.

ufo_us['year_month'] = ufo_us['date_occurred'].map(lambda x:
dt.date(x.year, x.month, 1))

Note: The authors approach this problem a little differently, using R’s strftime function to turn the dates into a string of the form YYYY-MM. I prefer to keep them numeric (it makes time series plots more sensible), but either way works. My choice of the first day of the month is arbitrary, and just serves to collect the dates into groups.

Then we want to sum up the sightings by state and month. To do this, I’ll use Pandas groupby method. groupby, as you’d expect, works like SQL’s GROUP BY statement.

sightings_counts = ufo_us.groupby(['us_state',
'year_month'])['year_month'].count()

You can almost read this statement as an SQL query: SELECT COUNT(year_month) GROUP BY us_state, year_month.

The groupby method applied to the data frame results in a DataFrameGroupBy object, which isn’t much to look at but contains all the information we need to perform calculations by groups of the variables we passed to the method. Calling the year_month column results in a similar SeriesGroupBy object. Finally, calling the count method counts how many non-null observations of year_month there are in each level. The final output is a Series of the counts with a hierarchal index of the groupby variables.

To aggregate their data in R, the authors use the ddply function, which provides similar groupby-type functionality. I find the plyr functions less intuitive and expressive than Pandas’ syntax. But, the plyr functions are a big improvement over R’s apply functions for complicated calculations.

As the authors do on p. 22, let’s check out the first few Alaska sightings.

print 'First few AK sightings in data:'
print sightings_counts.ix['ak'].head(6)

This spits out:

First few AK sightings in data:
year_month
1990-01-01 1
1990-03-01 1
1990-05-01 1
1993-11-01 1
1994-02-01 1
1994-11-01 1

Note that I have one more observation than the authors do–February 1994. As discussed in Part 2, the authors’ cleaning methodology is going to cut any observations where the U.S. city part of the location data has commas in it. My methodology won’t lose those observations. That seems to be what’s happened here. Looking at that record with:

print 'Extra AK sighting, no on p. 22:'
print ufo_us[(ufo_us['us_state'] == 'ak') &
(ufo_us['year_month'] == dt.date(1994, 2, 1))] \\
[['year_month','location']]

shows that indeed, my extra observation has a comma in the city record:

Extra AK sighting, no on p. 22:
year_month location
5508 1994-02-01 Savoonga,St. Lawrence Island, AK[/sourcecode]

Indexing tricks

When we perform the groupby calculations, the resulting series is missing rows where there were no UFO sightings in a state/month. This makes sense of course – groupby goes through the data, finds all the state/month combinations, and turns them into discrete levels within which to perform calculations. If there are no sightings in a state in a month, groupby won’t know to turn that combination into a level.

So, basically, we want to add those levels back into the data and set the associated sightings count to zero. There are two ways to do this in Pandas. The first uses Pandas’ reindex methods. I’ll create a “full” index with every combination of states and months:

ym_list = [dt.date(y, m, 1) for y in range(1990, 2011)
            for m in range(1, 13)
            if dt.date(y, m, 1) \<= dt.date(2010, 8, 1)]

full_index = zip(np.sort(us_states \* len(ym_list)), ym_list \* len(us_states))
full_index = MultiIndex.from_tuples(full_index, names =
['states', 'year_month'])[/sourcecode]

The first line is just a list comprehension that creates a list of all the months in the data, from January 1990 to August 2010. The second line creates 51×248 tuples of (state, month) pairs. (I created the list of states, us_states, in Part 2.) The third line creates a Pandas hierarchical index out of these tuples. Hierarchical indices in Pandas can take names that label the levels of the index.

Next, I’ll reindex the sightings_counts series with this full index. Pandas will conform the dataset to the new index we give it, dropping elements whose index level is not in the new index, and making elements for new index levels not in the original. By default Pandas fills in these new elements with NA, but we can tell it to fill these values with zero, and end up with the series we’re looking for.

sightings_counts = sightings_counts.reindex(full_index, fill_value = 0)

Stacking and unstacking data

There’s another way to get the full time series out of the groupby calculations. Instead of creating the full index of state/month combinations, I can use a trick using Pandas stack and unstack methods. stack and unstack turn data from wide to long and vice versa, similar to the melt and cast methods in R’s reshape2 package.

The idea is to first widen (unstack) the data, so that we have states as columns and months as rows. This will force the data to have the 248×51 entries we’re looking for (assuming that there’s a sighting in at least one state every month between January 1990 and August 2010). For the entries in this data frame where there are no sightings–state/months not present in the long data–Pandas will fill in NA. I’ll tell Pandas to fill it with zero instead, and then stack the data again to put it back in long form. Since there is now a number (sometimes zero) for every state/month pair, this new long dataset will have all the rows we need. Here’s the code:

sightings_counts1 = ufo_us.groupby(['us_state', 'year_month'])['year_month'].count()

sightings_counts1 = sightings_counts1.unstack(1).fillna(0).stack()

Let’s check that we get the same dataset from both methods:

# Check they're the same shape and values.
print 'Shape using handmade MultiIndex:', sightings_counts.shape
print 'Shape using unstack/stack method:', sightings_counts1.shape
print 'Sum absolute difference:', np.sum(np.abs(sightings_counts1 -
sightings_counts))

I check the sum-of-absolute-differences between the series, instead of checking for strict equality, to give some leeway for floating point error (even though these should be integers, there might be some type conversion that happens through these methods). Either way, looks like we have the same result from both methods:

Shape using handmade MultiIndex: (12648,)
Shape using unstack/stack method: (12648,)
Sum absolute difference: 0.0

Conclusion

I’ve got the data just how I want it to plot time series of UFO sightings by state. There were actually very few lines of code in this part. But those few lines of code were doing a lot of work, and represented one of the toughest parts of working with data: getting it in the right shape. It wasn’t long ago that reshaping data was always and everywhere a huge hassle. It still is in some languages (*cough* SAS *cough* Stata *cough*). The combination of hierarchical indexing and stack and unstack methods in Pandas make doing this in Python actually pleasant.

I’m finally going to wrap up Chapter 1 in the next part, in which I create a plot to match the authors’ trellis plot of sightings time series by state. It’s going to be a real Matplotlib adventure.

Shades of Time: I don’t buy it, and that’s why it’s so great.

2012-04-22T04:00:00-04:00

Over the weekend Drew Conway posted about a data analysis project he’d just completed called Shades of Time. Very briefly, he took a dataset of Time magazine covers from 1923 to March 2012, then used some Python libraries to identify the faces in the covers and identify the skin tone of each face. The result is a really great interactive visualization implemented in d3.js.

From looking at this data, Drew, with some caveats, observes that “it does appear that the variance in skin tones have [sic] changed over time, and in fact the tones are getting darker.” He also notes that there are more faces on covers in later years.

Why I don’t believe it

There’s no real statistical testing done here–no formal quantification how skin-tone representation on covers is changing over time. Instead, I think he’s drawing his conclusion on the vizualization alone, especially the scatterplot in the bottom panel that seems to show more darker tones appearing later in the date (starting in the 70’s, the skin-tone dispersion in his data starts to increase).

He notes that there are difficulties in both identifying faces and skin tones. After going through his analysis, I think these algorithms are fragile enough, and the categorization of faces and skin tones is poor enough, that I don’t really buy his conclusion that cover face diversity is increasing.

For example, I reviewed many of the data classified with a dark skin tone that seemed to be contributing to the visual impression of increasing diversity. A good number of them weren’t faces at all, but objects like guns, or parts of the word “TIME.”

Many others were famous white guys. Here’s a list I made from my cursory review:

James Taylor (1971)
Archie Bunker/Carrol O’Connor (1973)
Joni Mitchell (1974)
Gerald Ford (1974, 1975)
Francisco Franco (1975)
Jimmy Carter (1976)
Queen Elizabeth (1976)
John Irving (1981)
Ronald Reagan (1985)
Willem Defoe, Charlie Sheen (1987)
Ollie North (1987)
Dan Rather (1988)
Michael Eisner (1988)
Statue In Congress (1990)
Garth Brooks (1992)
Roger Keith Coleman (1992)
Serbian Detention Camp Prisoners (1992)
Michael Chrichton (1995)
Bill Clinton (I know he’s the first black president, but I don’t think that should count) (1998)
Monica Lewinsky (1998)
John Travolta (1998)
Slobodan Milosevic (1999)
Ted Kaczynski (1999)
John McCain (2000)
Jerry Levin (2000)
George Bush (2000)
Francis Collins (2000)
Yoda (2002)
Trent Lott (2002)
Joe Wilson (2003)
Brad Pitt (2004)
John Kerry (2004)
George Bush (2004)
Bono (2006)
Bill Gates (2006)
Jesus Christ (2006)
John McCain (2006)
Rick Warren (2008)
Sarah Palin (2008)
Lloyd Blankfein (2009)
Tom Hanks (2010)
Jonathan Franzen (2010)
George Washington (2010)

Now, no classification algorithm is perfect, and these covers are complicated, heterogeneous inputs. But just from eyeballing it, this one seems so inaccurate on this data, that I don’t trust that the observed dispersion is the result of more correctly classified darker faces on covers.

Why it’s still awesome

While I don’t think the classification process here is accurate enough to let us draw inferences about skin tone diversity, the fact that I could come to this conclusion after 30 minutes of poking around on a web site really says some interesting things about the process and presentation of the project.

For one, I think it’s a fantastic use of dynamic visualization. I don’t think any aesthetic aspect of it is novel or noteworthy, instead I think it’s innovative on a more meta level. Often times we think for visualizations as serving one of two processes. The first is pre-model: exploration of raw data to suggest questions, patterns, or models. The second is post-model: presentation of results or model diagnostics.

I’ve been skeptical of d3 and similar frameworks, because I’ve rarely seen dynamic or interactive graphs that do a much better job at these two types of tasks than static graphs. At least not so much better as to justify the added costs of producing them and delivering them to an audience. Also, a lot of what I’ve seen that’s been represented as cool stuff you can do with d3–or Processing, or whatever–is mostly pretty junk; stuff like busy stream graphs and chord graphs and other things I’d put in the high-effort/low-reward quadrant of Kaiser Fung’s return-on-effort matrix.

The visualization for Shades of Time, though, is impressive to me because it’s not really exploring raw data, or presenting results–instead it’s illustrating the process of analyzing the data. To get the list above, I started from the time series chart at the bottom that seemed to show increasing diversity. Then I noted the points in that chart that I felt were most influencing that conclusion. I could then find them in the scrolling chart on the left, click, see on the right panel what raw data (what image on what cover) generated that point, and determine whether the classifier was giving a meaningful result.

After going through it long enough, I decided there really wasn’t enough meaningful output coming from the classifier for me to comfortably believe Drew’s observation. Nonetheless, I think it’s incredibly novel and useful to have a visualization that lets me so easily do a mini-replication of the analysis. This one lets you walk through the major steps, from raw data (the covers in the right panel) to quantification/classification (the skin tone tiles in the left panel) to aggregation and interpretation (the time series scatter plot on the bottom).

It really makes me rethink some of the possibilities of interactive graphics. This isn’t just a stream graph of box office receipts or the Baby Name Wizard, which are mostly just raw data explorers. I think it suggests a whole different application and conceptual framework for interactive graphics. That is, how do we illustrate to an audience the process by which we went from raw data to conclusions, and let them follow along and investigate that process?

Machine Learning for Hackers Chapter 1, Part 3: Simple summaries and plots.

2012-04-19T04:00:00-04:00

Introduction

See Part 1 and Part 2 for previous work.

In this part, I’ll replicate the authors’ exploration of the UFO sighting dates via histograms. The key takeaways:

The plotting methods in Pandas are easy and useful.
Unlike R Dates, Python datetimes aren’t compatible with a lot of mathematical operations. We’ll see that you can’t apply quantile or histogram methods to them directly.

Quick data summary methods and datetime complications.

For those playing along at home, I’m at p. 19 of the book. The first thing the authors do here is get a statistical summary of the sighting dates in the data, which are recorded in the DateOccurred variable (which I’ve named date_occurred in my code). This is easy in R using the summary function, which provides the minimum, maximum, and quartiles of the data by default.

Pandas has similar functionality, in a method called describe, which gives the same for numeric variables, plus the count of non-null values and the mean and standard deviation. For example:

s1 = Series(np.random.randn(100))
print s1.describe()

outputs what we’d expect from a series of randomly-generated standard normals:

count 100.000000
mean -0.149274
std 1.011230
min -2.521374
25% -0.790867
50% -0.167813
75% 0.596617
max 2.231157

If we apply this to the date_occurred series, though, we get something different.

ufo_us['date_occurred'].describe()[/sourcecode]

results in:

count 52134
unique 8786
top 1999-11-16 00:00:00
freq 185

because Pandas treats datetime series as non-numeric variables (which they technically are).

Note: To compute quantiles for numeric series, Pandas uses SciPy’s scoreatpercentile function, which in turn relies on a simple linear interpolation function (_interpolate in scipy.stats). datetime objects don’t play well with this function, since when you take the difference between two datetimes you don’t get a number, but instead a timedelta tuple, that you can’t perform mathematical operations on until you unpack it. The min and max methods will work on datetimes, though.

We can get around this by extracting the years from the variable, which will be integers.

years = ufo_us['date_occurred'].map(lambda x: x.year)
print years.describe()

results in:

count 52134.000000
mean 2000.572237
std 10.889045
min 1400.000000
25% 1999.000000
50% 2003.000000
75% 2007.000000
max 2010.000000

which is a little precise for year data, but how is Pandas to know? At any rate, we come to the same conclusion as the authors: that three quarters of the sightings occurred in 1999 or later, and the earliest date in the data is in 1400. (If we check, we’ll see this sighting occurred in Texas, so it’s certainly an error).

Plotting histograms

The authors then plot a histogram of the dates in the data. Like with quantile, the hist plot method (which just calls a Matplotlib histogram) doesn’t work with datetime data. If we try

ufo_us['date_occurred'].hist()

we’ll get an error complaining that datetime can’t be compared with float. So, I’ll just work with the years instead of the full datetime. I can generate the plot with a call to the series’ hist method, one of several plotting methods for Pandas objects that makes it extremely easy to get quick plots of them.

plt.figure()
years.hist(bins = (years.max() - years.min())/30., fc = 'steelblue')
plt.title('Histogram of years with U.S. UFO sightings\nAll years in data')
plt.savefig('quick_hist_all_years.png')

I explicitly set the bins to match the ggplot defaults used in the book. We get this plot, which basically matches the authors’:

The authors then focus on only data after 1990, using R’s subset function to remove earlier observations from the data. This is straightforward in Pandas. I’ll also extract another series with the years of this subset of dates.

ufo_us = ufo_us[ufo_us['date_occurred'] \>= dt.datetime(1990, 1, 1)]
years_post90 = ufo_us['date_occurred'].map(lambda x: x.year)

After subsetting, the authors have 46,347 rows left in the data. Looking at the shape attribute of the subsetted data frame, we have 46,780. We’ve picked up some observations from D.C., as well as from our more expansive method of finding U.S. locations.

Another histogram of the subset data looks similar to the authors’ chart on p. 23, but since I’m only histogramming over years, I lose some resolution.

While the histogram is fine for a quick look at the distribution of dates, it’s not a very accurate picture of how sightings evolve over time: the binning really destroys too much information. It makes more sense just to do a time-series plot of total sightings by date. We can do that with some data aggregation and an easy call to the plot method in Pandas.

post90_count = ufo_us.groupby('date_occurred')['date_occurred'].count()
plt.figure()
post90_count.plot()
plt.title('Number of U.S. UFO sightings\\nJanuary 1990 through August 2010')
plt.savefig('post90_count_ts.png')

This uses Pandas’ awesome groupby method, which I’ll discuss more in the next part. We get the following figure:

Based on this graph, it looks like there’s a seasonal component to sightings, which wasn’t apparent in the histogram. There are also a few large spikes, especially around the end of the millenium.

Conclusion

This part was a relatively easy one. The next part will focus on data aggregation using groupby and reindex methods. Then I’ll wrap up with with replicating the authors’ trellis graph.

Machine Learning for Hackers Chapter 1, Part 2: Cleaning date and location data

2012-04-18T04:00:00-04:00

Introduction

In the previous post, I loaded the raw UFO data into a Pandas data frame after cleaning up some irregularities in the text file. Since we’re ultimately concerned with analyzing UFO sightings over time and space, the next step is to clean those variables and prepare them for analysis and vizualization.

Some Python techniques to note in this part are:

Like in the last part, Python string methods are going to come in really handy, and be a simple, expressive solution to a lot of problems.
When those aren’t enough, Python has a pretty straightforward set of functions for implementing regular expressions.
The map() method in Pandas can be used to “vectorize” functions along a Series (i.e. a data frame column) and is similar to R’s apply. In general, using a NumPy ufunc (vectorized function) is preferable, but not all operations can be expressed in ufuncs. This is especially true for non-numeric operations, such as for strings or dates.

Cleaning dates: mapping and subsetting.

The first two columns of the data are dates in YYMMDDD format, and Pandas imported them as integers. R has a function, as.Date that will operate on a vector of date strings, converting them to numeric dates. In Python, the strptime function in the datetime module performs the same function, but it not vectorized the way as.Date is. (Note that R also has a strptime that converts date strings to POSIX class object). Therefore, we have to use the map method.

def ymd_convert(x):
'''
Convert dates in the imported UFO data.
Clean entries will look like YYYMMDD. If they're not clean, return NA.
'''
try:
cnv_dt = dt.datetime.strptime(str(x), '%Y%m%d')
except ValueError:
cnv_dt = np.nan

return cnv_dt

ufo['date_occurred'] = ufo['date_occurred'].map(ymd_convert)
ufo['date_reported'] =
ufo['date_reported'].map(ymd_convert)

Notice that map here is like R’s apply function (this is a little confusing, since Python also has an apply method that is not like R’s). Since series—columns in Pandas data frames—are just NumPy ndarrays underneath, only NumPy ufuncs will operate on them in a vectorized (fast, elementwise) fashion. Base Python functions, and any more complicated functions you create from them, will have to be explicitly mapped. This is a little different from R, where, since the fundamental object in the language is the vector, functions are more likely vectorized than not. Nonetheless, NumPy ufuncs do cover the gamut of mathematical operations, and for other cases, the map method is easy enough to implement.

Then we just get rid of the rows with one date or the other not in proper YYYMMDD format.

# Get rid of the rows that couldn't be conformed to datetime.
ufo = ufo[(notnull(ufo['date_reported'])) &
(notnull(ufo['date_occurred']))]

The subsetting of the data frame is done by indexing it with a boolean vector. Since the df[ ] operation returns rows, the

One can also subset an R data frame this way. R though, also has a subset function, with the syntax:

ufo = ufo[!is.na(ufo[ , 'date.reported']) & !is.na(ufo[ , 'date.occurred']), ]

being equivalent to:

ufo = subset(ufo, where = !is.na(date_reported) & !is.na(date.occurred))

The general subset syntax is: df.new = subset(df.orig, where = condition, select = columns). Since subset looks for the variables referenced in the where and select arguments in the df.orig environment, there’s no need to call them as df.orig[ , 'var'] or df.orig$var. There are other useful commands that work like this: with, within, and transform, for example.

I find the subset function in R more expressive and easier to read than the boolean masking method, and I miss there being a Pandas equivalent.

Cleaning locations: string functions and regular expressions

Cleaning the date variables was relatively easy. Locations are trickier, and the authors don’t do a particularly thorough job of it. (No knock on them, reading several pages of text cleaning would be deadly boring, and they’te just illustrating some techniques). I’ll suggest a slightly better method that will pick up some extra data, but even that could probably be improved if we were concerned about getting every bit of information out of this dataset.

The authors assume that valid U.S. locations are going to be in “City, ST” format (e.g., “Iowa City, IA”). Anything else is going to be dropped as either an international record, or not worth cleaning.

They write a function that takes a location record and checks that it fits this pattern by seeing if R’s strsplit function splits it into two elements at a comma. If so, the function returns a vector containing the two elements, otherwise it returns a vector with two NAs (though not quite, see the note below). They then use R’s lapply to apply the function elementwise, and collect the resulting vectors in a list. Then there are some tricks to get the list into an Nx2 matrix, and then put each column of the matrix into a variable in the data frame as USCity and USState.

Note: the authors wrap strsplit in tryCatch assuming that the former will throw an error if there are no commas in the string. My testing shows that’s not the case, and strsplit will just return the original string. The tryCatch wrapper doesn’t have any effect, and that line of code doesn’t appear to drop locations without commas as the authors intend. This isn’t really a problem, since they later subset on records with valid U.S. states, and that ultimately drops the no-comma location records.

It’s easy to write a similar function in Python, using the split method of string objects.

def get_location(l):
split_location = l.split(',')
clean_location = [x.strip() for x in split_location]
if len(split_location) != 2:
clean_location = ['', '']
return clean_location

This is near-direct translation of the authors’ get.location function. Note the strip method and the list comprehension replace the gsub function the authors use to remove beginning and trailing white space from the extracted city and states.

But a quick look at the data shows that there are lots of valid U.S. locations that will get dropped with this method. Specifically, the city part of the location contains commas in many records, so the split methods will return more than two elements and we will drop them as invalid. Let’s check out some cases with the following code:

multi_commas = ufo['location'].map(lambda x : x.count(',') \> 1)
print 'Number of entries w/ multiple commas', sum(multi_commas)
print ufo['location'][multi_commas][:10][/sourcecode]

This returns:

Number of entries w/ multiple commas 1055
1473 Aquaduct (near, over desert, before entering California), CA
1985 Redding (northeast of, out over Millville, approximately), CA
2108 Farmington (SE of, deserted area, Hwy 44), NM
2160 Stouthill (community, nearest city 30 miles, TN), TN
2242 Highway 71 between Clearmont, Missouri and Maryville, Missou, MO
2257 Bayfield (near, Lake Superior, south shore), WI
2287 Unidentified object sig, (VIC, Australia),
2297 Garfield, (VIC, Australia),
2384 Northeast Cape AFS, St Lawrence Island,, AK
2458 Flisa, Solør, Hedemark (Norway),[/sourcecode]

So there are over a thousand location records with more than one comma, and out of the first ten, seven are valid U.S. locations.

To save these records, I’ll try another method, using regular expressions to search for locations that end with “, ST”-type patterns. Since we’re going to ultimately use map to check this pattern for every row in the data, I’ll compile the pattern first, which typically speeds up repeated searches.

us_state_pattern = re.compile(', [A-Z][A-Z]\$', re.IGNORECASE)

Then, I’ll create a function that takes a location record as input, and applies the regex search to it.

def get_location2(l):
    strip_location = l.strip()
    us_state_search = us_state_pattern.search(strip_location)
    if us_state_search == None:
        clean_location = ['', '']
    else:
        us_city = strip_location[ :us_state_search.start()]
        us_state = strip_location[us_state_search.start() + 2: ]
        clean_location = [us_city, us_state]

return clean_location[/sourcecode]

To follow this, note that if the regex pattern isn’t found, then the search method returns None, otherwise it returns a search object with several useful attributes. One of them is start, which indicates where in the string the pattern starts. To extract the city, we just take all the characters in the string up to start. The state will start 2 characters later (since we don’t want the comma or space in front). The function, like the previous one, finally returns a two element list with either a city and a state, or two blanks for records that didn’t match the pattern.

I again use map to apply this function elementwise to the location column:

location_lists = ufo['location'].map(get_location2)

This returns a series of two-element lists. I use list comprehensions to extract the first and second elements out to individual lists, which I assign to us_city and us_state variables in the data frame. It sounds complicated, but in Python it’s just two fairly readable lines of code:

ufo['us_city'] = [city for city, st in location_lists]
ufo['us_state'] = [st.lower() for city, st in location_lists]

The last step in cleaning the location data is to weed out any locations that fit the “City, ST” pattern, but were not in U.S. states–Canadian provinces for example. The authors do this in a straightforward way by making a list of the 50 U.S. states and using R’s match function to see where the U.S. state variable matches a state in the list. They then subset the data frame to records where there is a match.

Note: The authors leave D.C. out of the list of states. It looks like there are about 90 records with D.C. in the state column. Unfortunately a couple of these aren’t Washington, D.C., but are South American “Distrito Capitals.” I’ll add D.C. into the list and subsequent analyses, keeping in mind there are a few false positives. (This may be true for other states as well, like I said at the start, this cleaning isn’t 100% accurate.)

NumPy has an equivalent to the match function, though the name is a little more awkward: in1d. Below, I assign anything records in us_state that doesn’t have a match in the state list a blank string, then drop them out of the data.

ufo['us_state'][-np.in1d(ufo['us_state'].tolist(), us_states)] = ''
ufo['us_city'][-np.in1d(ufo['us_state'].tolist(), us_states)] = ''

ufo_us = ufo[ufo['us_state'] != '']

The to_list is necessary because Pandas requires a list argument to [ ], and in1d returns a NumPy array.

And that’s that. In the next post I’ll start exploring the data graphically.

Machine Learning for Hackers Chapter 1, Part 1: Loading data

2012-04-14T04:00:00-04:00

Preface

This is my first Will it Python? post. These posts document my experiences trying to port complete and interesting R projects to Python. I’m beginning by going through the recently published Machine Learning for Hackers (MLFH) by Drew Conway and John Miles White.

More information on the posts is here, and archives are here.

Introduction

The first chapter of MLFH is a gentle introduction to loading, manipulating and graphing data in R. To keep the tutorial interesting, the authors have found a fun dataset of UFO sightings to work through.

Since this chapter is mainly devoted to loading and manipulating data, a lot of the R functionality they exploit is going to have an analog in Pandas. Even though there’s not too much exciting going on in this chapter, it’s a great way to explore how basic data tasks get done in Python. It turns out there are some interesting differences between how R and Python handle even this simple stuff.

In this first post, I’ll focus on just getting the data into the work environment. The complete code for the chapter is located in a Github repo, here.

Data with inconsistent column lengths: break or compensate?

The raw data is contained in a tab-separated file and the authors use R’s read.delim() function to read it into an R dataframe. The data seem to load smoothly, and there are no errors or warnings. There are no headers in the data, so the authors set the headers argument of read.delim() to FALSE and name the columns of dataframe after it’s loaded.

The same procedure in Python uses the read_table() function in Pandas:

ufo = read_table('data/ufo/ufo_awesome.tsv', sep = '\t',
na_values = '', header = None)

This, though, will raise an exception, complaining that there are the “wrong number of columns.” R loaded the data without complaint, so what’s going on?

It turns out that read_table() is right to complain. Let’s use Python’s basic file IO to read each line of the file, and separate the line into columns by splitting it at tab characters. We’d expect each line to have six columns. As soon as we hit a line that doesn’t, I’ll break the line-reading loop, and print out the line number and the columns it was split into. This will tell us where the first (if any) bad line is in the file, and give a look at what’s wrong with it.

inpath = 'data/ufo/ufo_awesome.tsv'
inf = open(inpath, 'r')

for i, line in enumerate(inf):
    splitline = line.split('\\t')
    if len(splitline) != 6:
        first_bad_line = splitline
        print "First bad row:", i
        for j, col in enumerate(first_bad_line):
            print j, col
        break

inf.close()

This code prints the following output:

First bad row: 754
0 19950704
1 19950706
2 Orlando, FL
3
4 4-5 min
5 I would like to report three yellow oval lights which passed over
Orlando,Florida on July 4, 1995 at aproximately 21:30 (9:30 pm). These
were the sizeof Venus (which they passed close by). Two of them traveled
one after the otherat exactly the same speed and path heading
south-southeast. The third oneappeared about a minute later following
the same path as the other two. Thewhole sighting lasted about 4-5
minutes. There were 4 other witnesses oldenough to report the sighting.
My 4 year old and 5 year old children were theones who called my
attention to the &quot;moving stars&quot;. These objects moved
fasterthan an airplane and did not resemble an aircraft, and were moving
much slowerthan a shooting star. As for them being fireworks, their path
was too regularand coordinated. If anybody else saw this phenomenon,
please contact me at:
6 ler@gnv.ifas.ufl.edu

So we see that in row 754 of the file, we came across a line with seven columns (six tabs). The sixth column of the data is a “long” description of the UFO sighting, and here it looks like there was a tab character within the long description, creating extraneous columns.

Why didn’t R have a problem with this line? We can see what happened if we look on page 15 of the MLFH. There the authors show rows of the data where the first column–the date of the sigthing–doesn’t match a date format. The first instance of a bad observation in the first column of the R data is ler@gnv.ifas.ufl.edu, which we just saw is actually the first instance of a spurious seventh column. Apparently, read.delim() is inferring the number of columns from the first few rows, then pushing any extra columns to a new row.

I think I much prefer the Pandas behavior here to R’s. Even though R actually did get the data loaded with no fuss, it ended up mangling it pretty badly. Given the size of the dataset, the rarity of these bad rows, and the authors’ cleaning process, it may not have mattered much at the end of the analysis. But that’s not going to be true in every case – and here, R isn’t even throwing a warning to indicate that something might be fishy with the raw data.

Note though, that if the authors had used read.delim() with a col.names argument, then R would have raised an error when it came across a row with more columns than were indicated by the supplied list of column names.

This is a pretty boring problem, but an important one. To sum up:

Lesson 1: R’s read.delim() without either header = TRUE or a col.names argument is dangerous. If you have to load the data to figure out what the column names should be, try loading it again with the column names you’ve assigned.

Preparing the raw data to load into a data frame.

Now that we’ve discovered irregularities in the raw data that are preventing it from fitting neatly into a data frame, we have to fix them.

There are two options, both involve processing the file line-by-line. First, we can take the data in the columns after the sixth and append them to the end of the data in the sixth column. The sixth column is a long text discription of the event, and the extra columns are likely to be continuations of that description. But, we don’t actually end up caring about the long description in our analysis, so I’ll take a second approach and just delete those extra columns.

The procedure is encapsulated in the function below. It reads lines from the original file, inpath, cleans them, and writes the result to outpath. Note that this function doesn’t actually return anything; it’s just a side-effect on the outpath file.

def ufotab_to_sixcols(inpath, outpath):
'''
Keep only the first 6 columns of data from messy UFO TSV file.

The UFO data set is only supposed to have six columns. But...

The sixth column is a long written description of the UFO sighting, and
sometimes is broken by tab characters which create extra columns.

For these records, we only keep the first six columns. This typically
cuts off some of the long description.

Sometimes a line has less than six columns. These are not written to
the output file (i.e., they're dropped from the data). These records
are usually so comprimised as to be uncleanable anyway.

This function has (is) a side effect on the outpath file, to which it
writes output.
'''

inf = open(inpath, 'r')
outf = open(outpath, 'w')

for line in inf:
splitline = line.split('\t')
# Skip short lines, which are dirty beyond repair, anyway.
if len(splitline) < 6:
continue

newline = ('\t').join(splitline[ :6])
# Records that have been truncated won't end in a newline character
# so add one.
if newline[-1: ] != '\n':
newline += '\n'

outf.write(newline)

inf.close()
outf.close()

This function performs the following steps:

Open the input file for reading and the output file for writing.
Read a line from the original file.
Split the line into columns at the tab characters using the split() method.
If line is split into less than six columns, ignore this line and go read the next one.
Otherwise rejoin the first six columns of the split line back together with tab characters using the join() method. This results in newline.
If there’s not a line break character at the end of newline (which will happen if we’ve cut off the ending column because it was past the sixth column), then add one on.
Write newline to the output file.
Repeat 2-7 with the next line of the input file.

Note that step 4 means that short lines with less than six columns (5 tabs) don’t get written to the cleaned file. I haven’t investigated in depth why some rows are too short and whether there’s a way to fix those rows instead of tossing them out, but it’s unlikely the fix would be simple or reliable.

I run the function to create a cleaned-up tab-separated file called ufo_awesome_6col.tsv. (The path to the input file, inpath, was already defined).

outpath = 'data/ufo/ufo_awesome_6col.tsv'
ufotab_to_sixcols(inpath, outpath)

Trying `read_table()` again.

Now I’ll try using Pandas and read_table() again to load the file into a data frame. (Since I know what the column names are supposed to be, I’ll just pass them to the function instead of adding them later.)

ufo = read_table('data/ufo/ufo_awesome_6col.tsv', sep = '\\t',
                  na_values = '',  header = None,
                  names = ['date_occurred', 'date_reported',
                           'location', 'short_desc', 'duration',
                           'long_desc'])

And this now runs without a hitch. We’ll use the head() and to_string() methods of a Pandas data frame to compare the first six rows of the data to what’s shown in the table on p. 14 of MLFH.

ufo.head(6).to_string(formatters =
                       {'long_desc' : lambda x : x[ :21]})

The dictionary in the formatters argument tells to_string() to only print the first 21 characters in the long description. The result is the following table:

   date_occurred  date_reported              location  short_desc duration                long_desc
0       19951009       19951009         Iowa City, IA         NaN      NaN    Man repts. witnessing
1       19951010       19951011         Milwaukee, WI         NaN    2 min.     Man on Hwy 43 SW of
2       19950101       19950103           Shelton, WA         NaN      NaN     Telephoned Report:CA
3       19950510       19950510          Columbia, MO         NaN    2 min.   Man repts. son&apos;s
4       19950611       19950614           Seattle, WA         NaN      NaN    Anonymous caller rept
5       19951025       19951024  Brunswick County, ND         NaN   30 min.   Sheriff&apos;s office

And this matches the authors’ table on p. 14. So we’re off to a good start. In the next post we’ll clean this data up some more and do some munging to get at the information we’re interested in.

Slender Means

Rejection/Feedback: A play in three acts

Act 1

Act 2

Act 3

A Geneva Convention for the Language Wars

Section 1: Being Aware of Context

§ 1, Article 1

§ 1, Article 2

§ 1, Article 3

§ 1, Article 4

§ 1, Article 5

§ 1, Article 6

§ 1, Article 7

Section 2: Being Interesting

§ 2, Article 1

§ 2, Article 2

§ 2, Article 3

Section 3: Being Civil

§ 4, Article 1

§ 4, Article 2

Tricked out iterators in Julia

Introduction

The Iterator Protocol

An iterator’s state

Arrays.

Ranges

Unordered collections: Dicts, Sets, etc.

Iterators and Delayed Evaluation

Application: Iterating over Fibonacci numbers

Tasks/Co-routines

Realizing Iterators without loops

collect and reduce

Comprehensions

The Iterator Package

Imap

An IFilter iterator

Application: A list of primes whose digits sum to a prime

Repeat and RepeatForever

Extension: Repeating impure functions

Take and Drop

Extension: TakeWhile and TakeUntil

Application: How long does it take a Poisson process to produce a prime number?

Partition

Application: Moving average

Groupby

Application: Do Labor Force figures follow Benford’s Law?

Iterate

Application: Autoregressive time series processes

Conclusion

Pardon the dust

Machine Learning for Hackers Chapter 8: Principal Components Analysis

Introduction

Stock data munging

A PCA index with price data

A PCA index with returns data

Explained variance

Factor loadings

Conclusion

I’ve seen the best minds of my generation destroyed by Matlab …

Machine Learning for Hackers Chapter 7: Numerical optimization with deterministic and stochastic methods

Introduction

Ridge regression by least-squares

Optimizing on sentences with the Metropolis algorithm

Direct calculation of the most likely message

Conclusion

Machine Learning for Hackers Chapter 6: Regression models with regularization

Fitting a sine wave with polynomial regression

Preventing overfitting with regularization

Predicting O’Reilly book sales using back-cover descriptions

Logistic regression with cross-validation

Conclusion

Machine Learning for Hackers Chapter 5: Linear regression (with categorical regressors)

Introduction

Linear regression with categorical independent variables

Machine Learning for Hackers Chapter 4: Priority e-mail ranking

Introduction

Conclusion

ARM Chapter 5: Logistic models of well-switching in Bangladesh

Logistic model of well-switching in Bangladesh

`collect` and `reduce`

Making a `(features, label)` list